Reducing Cloud Computing Costs Through Compression of Genome Sequencing Data

Student: Isaiah Brant ’25
Research Mentors: Patrick Brennan, Samuel Franklin, Ben Kelly, and Alejandro Otero-Bravo (The Institute of Genomic Medicine)

The current file type that is actively being used for storage at the Institute of Genomic Medicine is annually costing them $100,000. To effectively resolve this issue we researched a new file type that will reduce file size and ultimately save the Institute of Genomic Medicine over $50,000. While analyzing these new files for a loss of data to insure that it will not hinder with the Institute of Genomic Medicines workflow.

The Institute of Genomic Medicine (IGM) spends over $100,000 annually on storing BAM files. BAM files are essentially the sequenced raw genetic data of an individual or in our case patients and IGM generates an abundance of BAM files which are eventually archived. BAM files are quite large and our motivation is to resolve the issue of file size with the use of SAMtools CRAM compression. SAMtools CRAM guaranteed that CRAM files are 40% smaller than BAM files but we tested different CRAM commands to observe the compression on our own BAM files, allowing us to access if there truly is a 40% difference in file size and also to obverse the quality of CRAM files considering that there may be a loss of data during compression. We observed a 55% difference in file size with the archive SAMtools CRAM command which progressed our research to enhance this command with the use of multithreading. SAMtools CRAM runs on mono-threading which can cause that single thread to become backed up during compression and negatively affect the compression of the file. To accommodate this we did a series of tests using different amounts of threads and observed that 8 threads was the most optimal during CRAM compression, resulting in a 62% difference in file size when multithreading is partnered with the archive SAMtools CRAM command. As a result of our research we will yield an estimated cost saving over $50,000 but we did observe a loss of data during compression. SAMtools CRAM resulted in the removal of the NM and the MD tags within read data: the NM tag is the edit distance to the reference and the MD tag is string encoding that is mismatched and deleted within reference bases. We consulted with the analysis team that work very closely with these BAM files and they verified that the removal of these tags alone will not hinder their analysis. As our next step we intend to use SAMtools CRAM compression on a cohort of BAM files to observe how the SAMtools CRAM command fares with a larger data set, with intent of fully implementing SAMtools CRAM.