February 12th, 2018
The Broad Institute of MIT and Harvard, the pioneering biomedical and genomics research center, has been a valued collaborator since the early days of Google Cloud Platform (GCP). We helped them move their genomics data storage and analysis from an on-premises data center to the cloud; they helped us make GCP a useful and cost-effective place to conduct some of the most important research that science has to offer.
It’s been a privilege to support the Broad in their mission to advance our understanding of the biology and treatment of human disease, and to help lay the groundwork for a new generation of therapies. Today, we’re thrilled to be a part of an important milestone for Broad and the genomics community: On GCP, the cost of running Broad’s GATK Best Practices pipeline has been reduced to a little over $5 per genome.
Broad is one of the largest genomic sequencing centers around; on average a human genome goes onto a sequencer every 10 minutes. To date, Broad has processed more than 76,000 genomes, generating 24TB of data per day, and stores more than 36PB of data on GCP.
Once genomic data comes off a sequencer, processing and analysis is done in several steps. Those steps are strung together in an automated pipeline called GATK Best Practices.
When Broad first brought the GATK Best Practices pipeline to GCP in 2015, the cost to run it was $45. Since then, Broad has steadily brought down the cost by twiddling a variety of GCP knobs and dials to arrive at a 90 percent cost reduction while maintaining (and even improving) the quality of the output. Here’s a sampling of what they did:
- Task splitting. In Broad’s on-premises environment, disk ate the lion’s share of its computing budget, so it strung together compute tasks so they could be performed by the same machine, minimizing disk access. However, any two tasks could have vastly different compute and memory usage profiles. In the cloud, Broad drove higher utilization and lower costs by appropriately sizing the machines to the task at hand (even though it involved more data transfers between machines). This resulted in a cost savings of approximately 30 percent.
- Preemptible VMs. Broad’s genomics pipeline is short-lived and fault-tolerant, so it can take advantage of Preemptible VMs without worrying about shutdowns due to demand spikes. Broad programmed the pipeline to use Preemptible VMs, which are 80 percent cheaper than non-preemptible instances, by default. This resulted in an extra 35 percent savings.
- Persistent Disk provides durable HDD and SSD storage to GCP instances, and allows Broad to match its workloads with appropriately-sized and performant storage. Behold another 15 percent savings.
- Data streaming. Verily Life Sciences contributed code to the HTSJDK, a library that underlies Broad’s GATK. HTSJDK lets algorithms read data directly from a Google Cloud Storage bucket, so the job requires less disk space. The cost of Broad’s production pipeline is now down to $5!
Most recently, Broad released the open-source version 4.0 of GATK, and researchers of all backgrounds, including those without computational training, can access the GATK Best Practices Pipeline through FireCloud, Broad’s cloud-based analysis portal. The pipeline includes all the cost optimizations that Broad has achieved with its production pipeline, and is ready to run on preloaded example datasets.
And while all users can access FireCloud at no cost, cloud providers do charge their own fees for data storage and processing. To help make it easier for more researchers to leverage this critical work, we’re offering a $250 credit per user toward compute and storage costs to the first 1,000 applicants. You can learn more on the FireCloud website.
We continue to be amazed by the progress the Broad Institute enables in the field of genomics, and are so happy it chose GCP as its cloud infrastructure partner. Learn more about the GATK Best Practices pipeline by reading the Broad’s blog post.