Chapter 3 Data transformation
Our data for this project on the topic of cervical cancer risk factors was directly downloaded from UCI Machine Learning Repository. The download contains one file only, risk_factors_cervical_cancer.csv. Since the file is in csv format, there is little we need to perform on transforming the data, as R can directly read the data from csv file in an efficient and accurate way.
We would like to present some basic information on the data, including the variable names, variable types, data set shape, and basic observations on the data structure.
3.1 Variable Names and Variable Types
## Type Variable Name
## 1 int Age
## 2 int Number of sexual partners
## 3 int First sexual intercourse (age)
## 4 int Num of pregnancies
## 5 bool Smokes
## 6 float Smokes (years)
## 7 float Smokes (packs/year)
## 8 bool Hormonal Contraceptives
## 9 float Hormonal Contraceptives (years)
## 10 bool IUD
## 11 float IUD (years)
## 12 bool STDs
## 13 int STDs (number)
## 14 bool STDs:condylomatosis
## 15 bool STDs:cervical condylomatosis
## 16 bool STDs:vaginal condylomatosis
## 17 bool STDs:vulvo-perineal condylomatosis
## 18 bool STDs:syphilis
## 19 bool STDs:pelvic inflammatory disease
## 20 bool STDs:genital herpes
## 21 bool STDs:molluscum contagiosum
## 22 bool STDs:AIDS
## 23 bool STDs:HIV
## 24 bool STDs:Hepatitis B
## 25 bool STDs:HPV
## 26 int STDs: Number of diagnosis
## 27 int STDs: Time since first diagnosis
## 28 int STDs: Time since last diagnosis
## 29 bool Dx:Cancer
## 30 bool Dx:CIN
## 31 bool Dx:HPV
## 32 bool Dx
## 33 bool Hinselmann: target variable
## 34 bool Schiller: target variable
## 35 bool Cytology: target variable
## 36 bool Biopsy: target variable
Dx: oncology tests on specific diseases. Test on genes and mutations.
Hinselmann: a test method for cervical cancer by examining the cells on an instrument called colposcope.
Schiller: a preliminary test for cancer of the uterine cervix in which the cervix is painted with an aqueous solution of iodine and potassium iodide.
Cytology: A cytology test is used to look closely at cells and body fluids.
Biopsy: removes a small amount of tissue to examine under a microscope.
3.2 Dataset Shape and Observations
Row number: 858 Column number: 36
Observations: There are values ‘?’ in the dataset, indicating the missing of values. The patients may decided not to answer such specific questions that relates to the specific columns. That could be reasons of privacy concerns or personal misunderstandings.
Smoke years, Hormonal Contraceptives years, and IUD years values are not integers but floats. This means when doing the survey, patients include months or days into the record. Specific down to months would make the data more specific and possibly help analysis on correlations between these year variables to cervical cancer to be more reliable.
Some variables seem related by their properties. For example, when a patient do not smoke, her smoke years variable would be 0. When a patient did not have any number of diagnosis on STD, the Time since first diagnosis and Time since last diagnosis of STD variables are left to be ‘?’. This is a logical missing values in this circumstance.