Introduction

Cancer remains one of the leading causes of morbidity and mortality worldwide. Recently, cell-free DNA (cfDNA), which are small fragments of DNA that circulate freely in the bloodstream, have emerged as a promising tool for non-invasive cancer diagnostics [1] [2] (Figure 1).

Figure 1: A simple diagram describing cell-free DNA. [2]

Aberrations in DNA methylation patterns can lead to the silencing or activation of critical genes that control cell cycle, apoptosis, and DNA repair mechanisms contributing to tumorigenesis and progression [3] [4] [5], and therefore may be useful for cancer diagnostics (Figure 2). 5-hydroxymethylcytosine (5-hmC) stands out for its role in the active DNA demethylation process. The low abundance of tumor DNAs out of total cfDNAs in blood necessitate analytical techniques capable of discerning subtle yet critical differences in methylation patterns between normal and tumor-derived DNA [6].

Figure 2: A simple diagram describing the effect of DNA methylation on gene activity [4] .

Problem Definition

This project aims to investigate the utility of 5-hmc cfDNA data for cancer diagnostics using machine learning, considering the role of 5-hmc marks in gene bodies as indicators of active gene transcription. If we are able to classify a sample as cancerous or identify the cancer type based on patterns of gene expression, this would suggest the occurrence or type of cancer may be determined from a simple blood draw versus more involved screening methods such CT scans, MRIs, or surgical and microscopy appraoches which require specialized equipment and may be more expensive than molecular approaches. This may make screening for cancer more accessible to patients living in rural communities or enable routine cancer screening as part of annual checkups.

Data

We worked with two cfDNA 5-hmc datasets from NCBI’s Gene Expression Omnibus (GEO): GSE81314 [7] and GSE89570 [8] . GSE81314 contains 49 cancer patients and 8 healthy individuals. GSE89570 contains 265 cancer patients and 96 healthy individuals.

Summary Statistics

Combining the two datasets, there are a total of 461 cfDNA samples and 17126 genes. Additional metadata features included age, gender, cancer type, grade, stage, tissue type, batch, sequencing mode, and study.

Figure 3: Distribution of cancerous and healthy samples

Figure 4: Distribution of cancer types

Data Harmonization

The two datasets were obtained from separate facilities using distinct sequencing instruments and methodologies. It’s important to note that while data points for GSE81314 are normalized counts per gene in RPKM (reads per kilobase of transcript per million reads mapped), points for GSE89570 are raw counts, necessitating conversion to RPKM for meaningful comparison. RPKM normalization enables comparison of gene expression (or gene counts) across samples by accounting for variations in sequencing depth across samples. To convert counts to RPKM:

Calculate the Gene Lengths in Kilobases: For each gene, calculate the length of the transcript in kilobases (kb).
```
Gene Length (kb) = Gene Length (bp) / 1000
```
Calculate the Total Number of Mapped Reads: Sum up all the read counts across all genes to get the total number of mapped reads in the sample.
```
Total Mapped Reads =  \sum_{\text{all genes}} Read Count
```
Normalize Read Counts by Gene Length: For each gene, divide the read count by the gene length in kilobases to get the read counts per kilobase (RPK).
```
RPK = Read Count / Gene Length (kb)
```
Normalize by Total Reads to Get RPKM: Finally, to get the RPKM value, divide the RPK value by the total number of mapped reads in millions.
```
RPKM = RPK * 10^6 / Total Mapped Reads
```

Both datasets were further scaled to ensure all features were within a range of -1 to 1. Additionally, standardization was applied to set the mean of the features to 0 and the standard deviation to 1. Genes that are not shared by the two datasets were excluded.

After data harmonization, we examined the histogram of gene features to determine if there is a distribution shift between the two studies. These histograms were created after data harmonization (described in the Methods section). We found no discernible difference in the shapes of the histograms (Figure 3).

Figure 5: Distribution of six randomly selected gene features to compare data distribution of the two studies.

Splitting Data into Training and Testing Datasets

Because our analysis involves supervised learning methods, we need to split the data into a training set, which is used for model training and hyperparameter tuning via cross-validation, and a test set, which is used for model evaluation. We used two different training/testing split methods.

Random split with stratified sampling: we combined both GSE81314 and GSE89570 into a single dataset. This dataset was then split with 80% of samples for training and 20% for testing. Stratified sampling was used to ensure the split resulted in approximately equal percentage of samples from each cancer type in both training and testing datasets.

Figure 6: Class distribution of training/test dataset split based on stratefied sampling.

Split by study: The second training/testing split was based on different studies. We used the dataset from the larger study (GSE98570) as our training set, and tested on the smaller dataset (GSE81314). The purpose of this approach is to assess how generalizable the models are when trained on one dataset and evaluated on an independent dataset, which is a typical challenge for machine learning models. Despite data harmonization being performed, the two datasets were collected from different facilities using distinct sequencing instruments and methodologies, which could lead to distribution shifts.

Figure 7: Class distribution of training/test dataset split by study.

References

Y. Van Der Pol and F. Mouliere, “Toward the early detection of cancer by decoding the epigenetic and environmental fingerprints of cell-free DNA,” Cancer cell, vol. 36, no. 4, pp. 350–368, 2019.
Q. Gao et al., “Circulating cell-free DNA for cancer early detection,” The Innovation, vol. 3, no. 4, 2022.
D. Cheishvili, L. Boureau, and M. Szyf, “DNA demethylation and invasive cancer: implications for therapeutics,” British journal of pharmacology, vol. 172, no. 11, pp. 2705–2715, 2015.
C. Plass, S. M. Pfister, A. M. Lindroth, O. Bogatyrova, R. Claus, and P. Lichter, “Mutations in regulators of the epigenome and their connections to global chromatin patterns in cancer,” Nature reviews genetics, vol. 14, no. 11, pp. 765–780, 2013.
D. M. Roy, L. A. Walsh, and T. A. Chan, “Driver mutations of cancer epigenomes,” Protein & cell, vol. 5, no. 4, pp. 265–296, 2014.
G. Eraslan, Ž. Avsec, J. Gagneur, and F. J. Theis, “Deep learning: new computational modelling techniques for genomics,” Nature Reviews Genetics, vol. 20, no. 7, pp. 389–403, 2019.
C.-X. Song et al., “5-Hydroxymethylcytosine signatures in cell-free DNA provide information about tumor types and stages,” Cell research, vol. 27, no. 10, pp. 1231–1242, 2017.
W. Li et al., “5-Hydroxymethylcytosine signatures in circulating cell-free DNA as diagnostic biomarkers for human cancers,” Cell research, vol. 27, no. 10, pp. 1243–1257, 2017.