Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Lessons from a Petabyte-‐Scale Science Cloud Service Provider (CSP) Robert Grossman Ins?tute for Genomics & Systems Biology Center for Research Informa?cs Computa?on Ins?tute Department of Medicine University of Chicago & Open Data Group September 11, 2012
The OSDC & Bionimbus Teams • Open Science Data Cloud (OSDC) Team – MaM Greenway, Allison Heath, Ray Powell, Rafael Suarez. – Major funding for the OSDC is provided by the Gordon and BeMy Moore Founda?on. • Bionimbus Team – Elizabeth Bartom, Casey Brown, Jason Grundstad, David Hanley, Nicolas Negre, Tom Stricker, MaM SlaMery, Rebecca Spokony & Kevin White. – Bionimbus is a joint project between Laboratory for Advanced Compu?ng & White Lab at the University of Chicago and uses in part the OSDC infrastructure.
Let’s Step Back 20 Years • 1992-‐96: Petabyte Access & Storage Solu?ons (PASS) Project for SSC. • It developed & benchmarked federated rela?onal, OO DB, object stores, & column-‐ oriented data warehouse solu?ons at the TB-‐scale.
A picture of Cern’s Large Hadron Collider (LHC). The LHC took about a decade to construct, and cost about $4.75 billion. Source of picture: Conrad Melvin, Crea?ve Commons BY-‐SA 2.0, www.ﬂickr.com/photos/58220828@N07/5350788732
One Million Genomes • Sequencing a million genomes would most likely fundamentally change the way we understand genomic varia?on. • The genomic data for a pa?ent is about 1 TB (including samples from both tumor and normal ?ssue). • One million genomes is about 1000 PB or 1 EB • With compression, it may be about 100 PB • At $1000/genome, the sequencing would cost about $1B
Big data driven discovery on 1,000,000 genomes and 1 EB of data. Genomic-‐ Improved Genomic-‐ driven understanding driven drug diagnosis of genomic development science Precision diagnosis and treatment. Preven?ve health care.
ER+ TNBC With genomics, we can stra?fy diseases and treat each stratum diﬀerently. Source: White Lab, University of Chicago.
Clonal Evolu?on of Tumors Tumors evolve temporally and spa?ally. Source: Mel Greaves & Carlo C. Maley, Clonal evolu?on in cancer, Nature, Volume 241, pages 306-‐312, 2012.
Combina?ons of Rare Alleles Penetrance High rare examples of alleles high-‐penetrance causing common variants Mendelian inﬂuencing Intermediate disease common disease Low-‐frequency variants with intermediate penetrance rare variants of most common Modest variants small eﬀect very hard to iden?fy implicated in by gene?c means common disease by GWA Low Allele 0.001 0.01 0.1 frequency Very rare Rare Uncommon Common Source: Mark McCarthy
TCGA Analysis of Lung Cancer • 178 cases of SQCC (lung cancer) • Matched tumor & normal • Mean of 360 exonic muta?ons, 323 CNV, & 165 rearrangements per tumor Source: The Cancer Genome Atlas Research Network, Comprehensive genomic characteriza?on of squamous cell lung cancers, Nature, 2012, doi:10.1038/nature11404.
Some Examples of Big Data Science Discipline Dura3on Size # Devices HEP -‐ LHC 10 years 15 PB/year* One Astronomy -‐ LSST 10 years 12 PB/year** One Genomics -‐ NGS 2-‐4 years 0.5 TB/genome 1000’s *At full capacity, the Large Hadron Collider (LHC), the worlds largest par?cle accelerator, is expected to produce more than 15 million Gigabytes of data each year. … This ambi?ous project connects and combines the IT power of more than 140 computer centres in 33 countries. Source: hMp://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-‐en.html **As it carries out its 10-‐year survey, LSST will produce over 15 terabytes of raw astronomical data each night (30 terabytes processed), resul?ng in a database catalog of 22 petabytes and an image archive of 100 petabytes. Source: hMp://www.lsst.org/News/enews/teragrid-‐1004.html
Another way: opencompute.org Think of data as big if you measure it in MW, as in Facebook’s Pineville Data Center is 30 MW.
An algorithm and compu?ng infrastructure is “big-‐data scalable” if adding a rack (or container) of data (and corresponding processors) allows you to do the same computa?on in the same ?me but over more data.
Commercial Cloud Service Provider (CSP) 15 MW Data Center Monitoring, Accoun?ng and network security billing Customer and forensics Facing Portal Automa?c provisioning and 100,000 servers infrastructure 1 PB DRAM management 100’s of PB of disk ~1 Tbps egress bandwidth 25 operators for 15 MW Commercial Cloud Data center network
What are some of the important diﬀerences between commercial and research-‐focused CSPs?
Science CSP Commercial CSP POV Democra?ze access to As long as you pay the bill; data. Integrate data to as long as the business make discoveries. Long model holds. term archive. Data & Data intensive Internet style scale out Storage Science Clouds compu?ng & HP storage and object-‐based storage Flows Large data ﬂows in and Lots of small web ﬂows out Streams Streaming processing NA required Accoun?ng Essen?al Essen?al Lock in Moving environment Lock in is good between CSPs essen?al
Part 3. The Open Cloud Consor?um’s Open Science Data Cloud
• U.S based not-‐for-‐proﬁt corpora?on. • Manages cloud compu?ng infrastructure to support scien?ﬁc research: Open Science Data Cloud. • Manages cloud compu?ng testbeds: Open Cloud Testbed. www.opencloudconsor?um.org 23
Cloud Services Opera?ons Centers (CSOC) • The OSDC operates Cloud Services Opera?ons Center (or CSOC). • It is a CSOC focused on suppor?ng Science Clouds for researchers. • Compare to Network Opera?ons Center or NOC. • Both are an important part of cyber infrastructure for big data science.
Diﬀerent Styles of OSDC Racks • Design 1: Put cores over spindles. • Higher cost but easy to compute over all the data. • Design 2: separate (some of the )2012 OSDC rack design (dray) • 950 TB / rack storage from the • 600 cores / rack compute.
Open Science Data Cloud Accoun?ng and Monitoring, billing (OSDC) compliance, & security Customer Facing Science Cloud SW & Services Portal (Tukey) Automa?c provisioning and 3 PB 2011 infrastructure 10 PB 2012 management ~100 Gbps bandwidth able to scale to 100 PB? 5-‐12 operators to operate 1-‐5 MW Science Cloud Data center network OSDC Data Stack based upon OpenStack, Hadoop, GlusterFS, UDT, …
OSDC Philosophy • We try to automate as much as possible (we automate the setup & opera?ons of a rack). • We try to write as liMle soyware as possible. • Each project is a bit diﬀerent, but in general: • We assign (permanent) IDs to data managed by the OSDC and manage associated metadata. • We assign and enforce permissions for users & groups of users and for ﬁles/objects, collec?ons of ﬁles/objects, and collec?ons of collec?ons. • We Support RESTful interfaces. • Do accoun?ng for storage and core-‐hours.
Some Of Our Biggest Mistakes • Not charging those who were the largest users of our services. This resulted in a lot of bad behavior. • Trying to support donated equipment without adequate staﬀ. • Being too op?mis?c about when big data soyware would be ready for prime ?me. • Some problems with big data soyware doesn’t show up at less than the full scale of the OSDC, but we have only one OSDC and it is diﬃcult to test at this scale.
Essen?al Services for a Science CSP • Support for data intensive compu?ng • Support for big data ﬂows • Account management, authen?ca?on and authoriza?on services • Health and status monitoring • Billing and accoun?ng • Ability to rapidly provision infrastructure • Security services, logging, event repor?ng • Access to large amounts of public data • High performance storage • Simple data export and import services
Number 1000’s Individual scien?sts & small projects 100’s Community based science via Science as a 10’s Service very large projects Data Size Small Medium to Large Very Large Public Shared community Dedicated infrastructure infrastructure infrastructure
Part 4. Bionimbus Bionimbus is a joint project between Laboratory For Advanced Compu?ng & the White Lab at the University of Chicago.
Bionimbus Community Genomic Cloud researcher • 1K genomes Cloud for • PubMed Public Data • etc. Personal “dropbox” + compute
Bionimbus Private Genomic Cloud researcher • 1K genomes Cloud for Cloud for TCGA • PubMed Public Data Controlled Data dbGaP • etc. Personal “dropbox” & compute
Bionimbus Private Biomedical Cloud researcher • 1K genomes • PubMed Cloud for Cloud for TCGA • etc. Public Data Personal “dropbox” Controlled Data dbGaP plus compute ScaMer, gather Clinical Cloud for queries Research Data PHI data Warehouse
Step 2. Send sample to Step 1. Get Bionimbus ID be sequenced. (BID), assign project, private/community, Internal BID Generator public cloud, etc. External Sequencers sequencing partner Step 5. Cloud based analysis using IGSB and 3rd party tools and applica?ons. Step 3a. Return raw reads. Step 3b. Return variant calls, CNV, annota?on… Bionimbus Bionimbus Private Cloud Community Step 4. Secure data UC Cloud rou?ng to appropriate cloud based upon BID. Bionimbus Private dbGaP Amazon Cloud XY
(Eucalyptus, web2py-‐based Front End OpenStack) U?lity Cloud (PostgreSQL) Services Database Analysis Pipelines & Services Re-‐analysis Services Intercloud Services (IDs, etc.) (UDT, Data Data replica?on) Inges?on Services Cloud Services (Hadoop, Sector/Sphere)
Enrich with Rela?onal databases Summary level clinical data (10-‐100 TB) NoSql & scien?ﬁc databases Varia?on (VCF) Files (1-‐10 PB) (Genomic varia?on) NoSql, DFS, Sequence (BAM) Files (100-‐1000 PB) ﬁle overlays? (Sequence data in binary form)
Acknowledgements Major funding and support for the Open Science Data Cloud (OSDC) is provided by the Gordon and BeMy Moore Founda?on. This funding is used to support the OSDC-‐Adler, Sullivan and Root facili?es. Addi?onal funding for the OSDC has been provided by the following sponsors: • The OCC-‐Y Hadoop Cluster (approximately 1000 cores and 1 PB of storage) was donated by Yahoo! in 2011. • Cisco provides the OSDC access to the Cisco C-‐Wave, which connects OSDC data centers with 10 Gbps wide area networks. • NSF awarded the OSDC a 5-‐year (2010-‐2016) PIRE award to train scien?sts to use the OSDC and to further develop the underlying technology. • OSDC technology for high performance data transport is support in part by NSF Award 1127316. • The StarLight Facility in Chicago enables the OSDC to connect to over 30 high performance research networks around the world at 10 Gbps or higher, with an increasing number of 100 Gbps connec?ons. The OSDC is managed by the Open Cloud Consor?um, a 501(c)(3) not-‐for-‐proﬁt corpora?on. If you are interested in providing funding or dona?ng equipment or services, please contact us at email@example.com.
For more informa?on • You can ﬁnd some more informa?on on my blog: rgrossman.com. • Some of my technical papers are also available there. • My email address is robert.grossman at uchicago dot edu • I recently wrote a popular book about compu?ng called: The Structure of Digital Compu?ng: From Mainframes to Big Data, which you can buy from Amazon. Center for Research Informatics
Sources for images • The image of the hard disk is from Norlando Pobre, Crea?ve Commons. • The image of the Facebook Pineville Data Center is from the Intel Free Press, www.ﬂickr.com/photos/intelfreepress/6722296855/, Crea?ve Commons BY 2.0. • The image of the LHC is from Conrad Melvin, Crea?ve Commons BY-‐SA 2.0, www.ﬂickr.com/ photos/58220828@N07/5350788732