Chapter 4 Missing values
4.1 DataSet
In our data set of risk factors of cervical cancer, the missing values were original represented by question marks “?”. We will first replace those marks with NA and then extract the first 100 rows as our example for this frist draft.
4.2 Missing Values by Column
## STDs: Time since first diagnosis STDs: Time since last diagnosis
## 93 93
## IUD IUD (years)
## 7 7
## Num of pregnancies Hormonal Contraceptives
## 5 5
## Hormonal Contraceptives (years) STDs
## 5 5
## STDs (number) STDs:condylomatosis
## 5 5
## STDs:cervical condylomatosis STDs:vaginal condylomatosis
## 5 5
## STDs:vulvo-perineal condylomatosis STDs:syphilis
## 5 5
## STDs:pelvic inflammatory disease STDs:genital herpes
## 5 5
## STDs:molluscum contagiosum STDs:AIDS
## 5 5
## STDs:HIV STDs:Hepatitis B
## 5 5
## STDs:HPV Number of sexual partners
## 5 3
## First sexual intercourse Smokes
## 2 1
## Smokes (years) Smokes (packs/year)
## 1 1
## Age STDs: Number of diagnosis
## 0 0
## Dx:Cancer Dx:CIN
## 0 0
## Dx:HPV Dx
## 0 0
## Hinselmann Schiller
## 0 0
## Citology Biopsy
## 0 0
4.3 Missing Value Plots
Here is a missing value plots with heat map of the first 100 data rows.
## NOTE: The following pairs of variables appear to have the same missingness pattern.
## Please verify whether they are in fact logically distinct variables.
## [,1] [,2]
## [1,] "Sm" "S()"
## [2,] "Sm" "S(/"
## [3,] "S()" "S(/"
## [4,] "HC" "HC("
## [5,] "HC" "STDs"
## [6,] "HC" "ST("
## [7,] "HC" "STDs:cn"
## [8,] "HC" "STDs:cc"
## [9,] "HC" "STDs:vc"
## [10,] "HC" "STD:-c"
## [11,] "HC" "STDs:s"
## [12,] "HC" "Sid"
## [13,] "HC" "Sh"
## [14,] "HC" "STDs:mc"
## [15,] "HC" "STD:A"
## [16,] "HC" "STD:HI"
## [17,] "HC" "SB"
## [18,] "HC" "STD:HP"
## [19,] "HC(" "STDs"
## [20,] "HC(" "ST("
## [21,] "HC(" "STDs:cn"
## [22,] "HC(" "STDs:cc"
## [23,] "HC(" "STDs:vc"
## [24,] "HC(" "STD:-c"
## [25,] "HC(" "STDs:s"
## [26,] "HC(" "Sid"
## [27,] "HC(" "Sh"
## [28,] "HC(" "STDs:mc"
## [29,] "HC(" "STD:A"
## [30,] "HC(" "STD:HI"
## [31,] "HC(" "SB"
## [32,] "HC(" "STD:HP"
## [33,] "IU" "I("
## [34,] "STDs" "ST("
## [35,] "STDs" "STDs:cn"
## [36,] "STDs" "STDs:cc"
## [37,] "STDs" "STDs:vc"
## [38,] "STDs" "STD:-c"
## [39,] "STDs" "STDs:s"
## [40,] "STDs" "Sid"
## [41,] "STDs" "Sh"
## [42,] "STDs" "STDs:mc"
## [43,] "STDs" "STD:A"
## [44,] "STDs" "STD:HI"
## [45,] "STDs" "SB"
## [46,] "STDs" "STD:HP"
## [47,] "ST(" "STDs:cn"
## [48,] "ST(" "STDs:cc"
## [49,] "ST(" "STDs:vc"
## [50,] "ST(" "STD:-c"
## [51,] "ST(" "STDs:s"
## [52,] "ST(" "Sid"
## [53,] "ST(" "Sh"
## [54,] "ST(" "STDs:mc"
## [55,] "ST(" "STD:A"
## [56,] "ST(" "STD:HI"
## [57,] "ST(" "SB"
## [58,] "ST(" "STD:HP"
## [59,] "STDs:cn" "STDs:cc"
## [60,] "STDs:cn" "STDs:vc"
## [61,] "STDs:cn" "STD:-c"
## [62,] "STDs:cn" "STDs:s"
## [63,] "STDs:cn" "Sid"
## [64,] "STDs:cn" "Sh"
## [65,] "STDs:cn" "STDs:mc"
## [66,] "STDs:cn" "STD:A"
## [67,] "STDs:cn" "STD:HI"
## [68,] "STDs:cn" "SB"
## [69,] "STDs:cn" "STD:HP"
## [70,] "STDs:cc" "STDs:vc"
## [71,] "STDs:cc" "STD:-c"
## [72,] "STDs:cc" "STDs:s"
## [73,] "STDs:cc" "Sid"
## [74,] "STDs:cc" "Sh"
## [75,] "STDs:cc" "STDs:mc"
## [76,] "STDs:cc" "STD:A"
## [77,] "STDs:cc" "STD:HI"
## [78,] "STDs:cc" "SB"
## [79,] "STDs:cc" "STD:HP"
## [80,] "STDs:vc" "STD:-c"
## [81,] "STDs:vc" "STDs:s"
## [82,] "STDs:vc" "Sid"
## [83,] "STDs:vc" "Sh"
## [84,] "STDs:vc" "STDs:mc"
## [85,] "STDs:vc" "STD:A"
## [86,] "STDs:vc" "STD:HI"
## [87,] "STDs:vc" "SB"
## [88,] "STDs:vc" "STD:HP"
## [89,] "STD:-c" "STDs:s"
## [90,] "STD:-c" "Sid"
## [91,] "STD:-c" "Sh"
## [92,] "STD:-c" "STDs:mc"
## [93,] "STD:-c" "STD:A"
## [94,] "STD:-c" "STD:HI"
## [95,] "STD:-c" "SB"
## [96,] "STD:-c" "STD:HP"
## [97,] "STDs:s" "Sid"
## [98,] "STDs:s" "Sh"
## [99,] "STDs:s" "STDs:mc"
## [100,] "STDs:s" "STD:A"
## [101,] "STDs:s" "STD:HI"
## [102,] "STDs:s" "SB"
## [103,] "STDs:s" "STD:HP"
## [104,] "Sid" "Sh"
## [105,] "Sid" "STDs:mc"
## [106,] "Sid" "STD:A"
## [107,] "Sid" "STD:HI"
## [108,] "Sid" "SB"
## [109,] "Sid" "STD:HP"
## [110,] "Sh" "STDs:mc"
## [111,] "Sh" "STD:A"
## [112,] "Sh" "STD:HI"
## [113,] "Sh" "SB"
## [114,] "Sh" "STD:HP"
## [115,] "STDs:mc" "STD:A"
## [116,] "STDs:mc" "STD:HI"
## [117,] "STDs:mc" "SB"
## [118,] "STDs:mc" "STD:HP"
## [119,] "STD:A" "STD:HI"
## [120,] "STD:A" "SB"
## [121,] "STD:A" "STD:HP"
## [122,] "STD:HI" "SB"
## [123,] "STD:HI" "STD:HP"
## [124,] "SB" "STD:HP"
Here is a missing value plots by variables of the first 100 data rows.
4.4 Using Problem 2 missing value function
As shown by the graph, we can see that both columns of name STDs: Time since first diagnosis and STDs: Time since last diagnosis contains a high number of missing data. We might need to looking into these two columns and decide whether to keep them given this high volume of NAs. Other missing values takes up some portion of each column.