The Open Science Data Cloud: Empowering the Long Tail of Science
A 501(c)(3) not-‐for-‐proﬁt operaCng clouds for science. The Open Science Data Cloud: Empowering the Long Tail of Science October 12, 2012 Robert L. Grossman University of Chicago and Open Cloud ConsorCum
QuesCon 1. What is the cyberinfrastructure required to manage, analyze, archive and share big data? Call this analyCc infrastructure.
QuesCon 2. What is the analogy of the GLIF* for analyCc infrastructure? *GLIF (www.glif.is), the Global Lambda Integrated Facility, is an internaConal virtual organizaCon that promotes the paradigm of lambda networking. GLIF provides lambdas internaConally as an integrated facility to support data-‐intensive scienCﬁc research, and supports middleware development for lambda networking.
Number 1000’s Individual scienCsts & small projects 100’s Community based science via Science as a 10’s Service very large projects Data Size Small Medium to Large Very Large Public Shared community Dedicated infrastructure infrastructure infrastructure
The long tail of data science A few large data Many smaller data science projects. science projects.
Part 1. What Instrument Do we Use to Make Big Data Discoveries? How do we build a “datascope?”
Another way: opencompute.org Think of data as big if you measure it in MW, as in Facebook’s Pineville Data Center is 30 MW.
An algorithm and compuCng infrastructure is “big-‐data scalable” if adding a rack (or container) of data (and corresponding processors) allows you to do the same computaCon in the same Cme but over more data.
Commercial Cloud Service Provider (CSP) 15 MW Data Center Monitoring, AccounCng and network security billing Customer and forensics Facing Portal AutomaCc provisioning and 100,000 servers infrastructure 1 PB DRAM management 100’s of PB of disk ~1 Tbps egress bandwidth 25 operators for 15 MW Commercial Cloud Data center network
My vote for a datascope: a (bouCque) data center scale facility with a big-‐data scalable analyCc infrastructure. What would a global integrated facility for datascopes look like?
Some Examples of Big Data Science Discipline Dura2on Size # Devices HEP -‐ LHC 10 years 15 PB/year* One Astronomy -‐ LSST 10 years 12 PB/year** One Genomics -‐ NGS 2-‐4 years 0.5 TB/genome 1000’s *At full capacity, the Large Hadron Collider (LHC), the worlds largest parCcle accelerator, is expected to produce more than 15 million Gigabytes of data each year. … This ambiCous project connects and combines the IT power of more than 140 computer centres in 33 countries. Source: hjp://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-‐en.html **As it carries out its 10-‐year survey, LSST will produce over 15 terabytes of raw astronomical data each night (30 terabytes processed), resulCng in a database catalog of 22 petabytes and an image archive of 100 petabytes. Source: hjp://www.lsst.org/News/enews/teragrid-‐1004.html
Sci CSP services Data scienCst Datascope – Science Cloud Service Provider (Sci CSP)
What are some of the important diﬀerences between commercial and research-‐focused Sci CSPs?
Science CSP Commercial CSP POV DemocraCze access to As long as you pay the bill; data. Integrate data to as long as the business make discoveries. Long model holds. term archive. Data & Data intensive Internet style scale out Storage Science Clouds compuCng & HP storage and object-‐based storage Flows Large data ﬂows in and Lots of small web ﬂows out Streams Streaming processing NA required AccounCng EssenCal EssenCal Lock in Moving environment Lock in is good between CSPs essenCal
Part 2. The Open Cloud ConsorCum’s Open Science Data Cloud
• U.S based not-‐for-‐proﬁt corporaCon. • Manages cloud compuCng infrastructure to support scienCﬁc research: Open Science Data Cloud. • Manages cloud compuCng testbeds: Open Cloud Testbed. www.opencloudconsorCum.org 18
OCC Members & Partners • Companies: Cisco, Yahoo!, Citrix, … • UniversiCes: University of Chicago, Northwestern Univ., Johns Hopkins, Calit2, ORNL, University of Illinois at Chicago, … • Federal agencies and labs: NASA, LLNL, ORNL • InternaConal Partners: AIST (Japan), U. Edinburgh, U. Amsterdam, … • Partners: NaConal Lambda Rail 19
OCC 2011 Resources Resource Type Comments OSDC Adler & UClity Cloud 1248 cores and 0.4 PB disk Sullivan OCC – Y Data Cloud 928 cores and 1.0 PB disk OCC – Matsu Mixed 1 rack OSDC Root Storage 0.8 PB • OCC-‐Adler, Sullivan & Root will more than double in size in 2012.
Bionimbus WG bionimbus.opensciencedatacloud.org (biological data)
One Million Genomes • Sequencing a million genomes would most likely fundamentally change the way we understand genomic variaCon. • The genomic data for a paCent is about 1 TB (including samples from both tumor and normal Cssue). • One million genomes is about 1000 PB or 1 EB • With compression, it may be about 100 PB • At $1000/genome, the sequencing would cost about $1B
Big data driven discovery on 1,000,000 genomes and 1 EB of data. Genomic-‐ Improved Genomic-‐ driven understanding driven drug diagnosis of genomic development science Precision diagnosis and treatment. PrevenCve health care.
Project Matsu WG:Clouds to Support Earth Sciencematsu.opensciencedatacloud.org 24
UDR • UDT is a high performance network transport protocol • UDR = rsync + UDT • It is easy for an average systems administrator to keep 100’s of TB of distributed data synchronized. • We are using it to distribute c. 1 PB from the OSDC
OpenFlow-‐Enabled Hadoop WG • When running Hadoop some map and reduce jobs take signiﬁcantly longer than others. • These are stragglers and can signiﬁcantly slow down a MapReduce computaCon. • Stragglers are common (dirty secret about Hadoop) • Infoblox and UChicago are leading a OCC Working Group on OpenFlow-‐enabled Hadoop that will provide addiConal bandwidth to stragglers. • We have a testbed for a wide area version of this project.
OSDC PIRE Project We select OSDC PIRE Fellows (US ciCzens or permanent residents): • We give them tutorials and training on big data science. • We provide them fellowships to work with OSDC internaConal partners. • We give them preferred access to the OSDC. Nominate your favorite scienCst as an OSDC PIRE Fellow. www.opensciencedatacloud.org (look for PIRE)
Open Science Data Cloud AccounCng and Monitoring, billing (OSDC) compliance, & security Customer Facing Science Cloud SW & Services Portal (Tukey) AutomaCc provisioning and 3 PB 2011 infrastructure 10 PB 2012 management ~100 Gbps bandwidth able to scale to 100 PB? 5-‐12 operators to operate 1-‐5 MW Science Cloud Data center network OSDC Data Stack based upon OpenStack, Hadoop, GlusterFS, UDT, …
Cloud Services OperaCons Centers (CSOC) • The OSDC operates Cloud Services OperaCons Center (or CSOC). • It is a CSOC focused on supporCng Science Clouds for researchers. • Compare to Network OperaCons Center or NOC. • Both are an important part of cyber infrastructure for big data science.
OSDC Racks • How quickly can we set up a rack? • How eﬃciently can we operate a rack? (racks/admin) 2012 OSDC rack design (dray) • 950 TB / rack • 600 cores / rack
EssenCal Services for a Science CSP • Support for data intensive compuCng • Support for big data ﬂows • Account management, authenCcaCon and authorizaCon services • Health and status monitoring • Billing and accounCng • Ability to rapidly provision infrastructure • Security services, logging, event reporCng • Access to large amounts of public data • High performance storage • Simple data export and import services
Please Join Us! (Help us from making even more mistakes.)
Acknowledgements Major funding and support for the Open Science Data Cloud (OSDC) is provided by the Gordon and Bejy Moore FoundaCon. This funding is used to support the OSDC-‐Adler, Sullivan and Root faciliCes. AddiConal funding for the OSDC has been provided by the following sponsors: • The OCC-‐Y Hadoop Cluster (approximately 1000 cores and 1 PB of storage) was donated by Yahoo! in 2011. • Cisco provides the OSDC access to the Cisco C-‐Wave, which connects OSDC data centers with 10 Gbps wide area networks. • NSF awarded the OSDC a 5-‐year (2010-‐2016) PIRE award to train scienCsts to use the OSDC and to further develop the underlying technology. • OSDC technology for high performance data transport is support in part by NSF Award 1127316. • The StarLight Facility in Chicago enables the OSDC to connect to over 30 high performance research networks around the world at 10 Gbps or higher, with an increasing number of 100 Gbps connecCons. The OSDC is managed by the Open Cloud ConsorCum, a 501(c)(3) not-‐for-‐proﬁt corporaCon. If you are interested in providing funding or donaCng equipment or services, please contact us at email@example.com.
For more informaCon • You can ﬁnd some more informaCon on my blog: rgrossman.com. • Some of my technical papers are also available there. • My email address is robert.grossman at uchicago dot edu. Center for Research Informatics