Chapter 3 Data transformation

Our data for this project on the topic of cervical cancer risk factors was directly downloaded from UCI Machine Learning Repository. The download contains one file only, risk_factors_cervical_cancer.csv. Since the file is in csv format, there is little we need to perform on transforming the data, as R can directly read the data from csv file in an efficient and accurate way.

We would like to present some basic information on the data, including the variable names, variable types, data set shape, and basic observations on the data structure.

3.1 Variable Names and Variable Types

##    Type  Variable Name                     
## 1  int   Age                               
## 2  int   Number of sexual partners         
## 3  int   First sexual intercourse (age)    
## 4  int   Num of pregnancies                
## 5  bool  Smokes                            
## 6  float Smokes (years)                    
## 7  float Smokes (packs/year)               
## 8  bool  Hormonal Contraceptives           
## 9  float Hormonal Contraceptives (years)   
## 10 bool  IUD                               
## 11 float IUD (years)                       
## 12 bool  STDs                              
## 13 int   STDs (number)                     
## 14 bool  STDs:condylomatosis               
## 15 bool  STDs:cervical condylomatosis      
## 16 bool  STDs:vaginal condylomatosis       
## 17 bool  STDs:vulvo-perineal condylomatosis
## 18 bool  STDs:syphilis                     
## 19 bool  STDs:pelvic inflammatory disease  
## 20 bool  STDs:genital herpes               
## 21 bool  STDs:molluscum contagiosum        
## 22 bool  STDs:AIDS                         
## 23 bool  STDs:HIV                          
## 24 bool  STDs:Hepatitis B                  
## 25 bool  STDs:HPV                          
## 26 int   STDs: Number of diagnosis         
## 27 int   STDs: Time since first diagnosis  
## 28 int   STDs: Time since last diagnosis   
## 29 bool  Dx:Cancer                         
## 30 bool  Dx:CIN                            
## 31 bool  Dx:HPV                            
## 32 bool  Dx                                
## 33 bool  Hinselmann: target variable       
## 34 bool  Schiller: target variable         
## 35 bool  Cytology: target variable         
## 36 bool  Biopsy: target variable

Dx: oncology tests on specific diseases. Test on genes and mutations.

Hinselmann: a test method for cervical cancer by examining the cells on an instrument called colposcope.

Schiller: a preliminary test for cancer of the uterine cervix in which the cervix is painted with an aqueous solution of iodine and potassium iodide.

Cytology: A cytology test is used to look closely at cells and body fluids.

Biopsy: removes a small amount of tissue to examine under a microscope.

3.2 Dataset Shape and Observations

Row number: 858 Column number: 36

Observations: There are values ‘?’ in the dataset, indicating the missing of values. The patients may decided not to answer such specific questions that relates to the specific columns. That could be reasons of privacy concerns or personal misunderstandings.

Smoke years, Hormonal Contraceptives years, and IUD years values are not integers but floats. This means when doing the survey, patients include months or days into the record. Specific down to months would make the data more specific and possibly help analysis on correlations between these year variables to cervical cancer to be more reliable.

Some variables seem related by their properties. For example, when a patient do not smoke, her smoke years variable would be 0. When a patient did not have any number of diagnosis on STD, the Time since first diagnosis and Time since last diagnosis of STD variables are left to be ‘?’. This is a logical missing values in this circumstance.