Chapter 2 Data sources

The primary data source of this project is Cervical cancer (Risk Factors) Data Set from UCI Machine Learning Repository. It gathered information including demographic information, habits, and historic medical records of 858 patients at the ‘Hospital Universitario de Caracas’ in Caracas, Venezuela. And the data is donated on 2017-03-03.

There are 36 features in the data set, including non-disease behaviors, like smoking, and diseases conditions, like HIV, HPV. There are 4 target variables among the 36, which determines whether the patient has cervical cancer.

The issue with this data set is that several patients decided not to answer some of the questions because of privacy concerns, hence missing values are contained. Additionally, the sample ratio of label “1”, people are diagnosed with cervical cancer, to label “0”, people are not diagnosed with cervical cancer, is quite small, indicating an unbalanced data set.

Table 2.1: Risk Factors Data Set
Age Number of sexual partners First sexual intercourse Num of pregnancies Smokes Smokes years Smokes packs year Hormonal Contraceptives Hormonal Contraceptives years IUD IUD years STDs STDs number STDs condylomatosis STDs cervical condylomatosis STDs vaginal condylomatosis STDs vulvo perineal condylomatosis STDs syphilis STDs pelvic inflammatory disease STDs genital herpes STDs molluscum contagiosum STDs AIDS STDs HIV STDs Hepatitis B STDs HPV STDs Number of diagnosis STDs Time since first diagnosis STDs Time since last diagnosis Dx Cancer Dx CIN Dx HPV Dx Hinselmann Schiller Citology Biopsy
18 4.0 15.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 ? ? 0 0 0 0 0 0 0 0
15 1.0 14.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 ? ? 0 0 0 0 0 0 0 0
34 1.0 ? 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 ? ? 0 0 0 0 0 0 0 0
52 5.0 16.0 4.0 1.0 37.0 37.0 1.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 ? ? 1 0 1 0 0 0 0 0
46 3.0 21.0 4.0 0.0 0.0 0.0 1.0 15.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 ? ? 0 0 0 0 0 0 0 0