Big Process for Big Data

1,338 views
1,176 views

Published on

Talk at DOE CIO's Big Data Tech Summit -- latest take on why and wherefore of software as a service (SaaS) for science, and the Globus Online work we are doing, with various DOE examples.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,338
On SlideShare
0
From Embeds
0
Number of Embeds
13
Actions
Shares
0
Downloads
40
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • We will hear numerous talks today on issue relating to the management and analysis of big data—data that stresses our capabilities in terms of its volume, velocity, variety, or variability.I’d like to spend my time speaking to the importance of the related problems of process. I’ll do so from the perspective of the sciences, because that is where I have the most experience.As data volumes increase exponentially, the individual’s ability to operate on that data has to improve exponentially too, if big data is to be an opportunity and not a curse.This is especially true as the number of data sources grows rapidly and thus even the smallest lab (or company) is exposed to the data deluge
  • Single next-generation sequencing machine can generate 40Gbase/dayGap of >1000 – AND many more systems as people jump on bandwagonMeanwhile, other resources [money, people] stay flat
  • Storage statistics synthesis
  • See http://en.wikipedia.org/wiki/File:LLNL_US_Energy_Flow_2009.png for inspiration.Data rates are in TB/day; line thicknesses are 5 TB/day/ptNumbers:-- APS is 163 TB/day, preliminary data from de Carlo.-- ALCF is 150 TB/day: a number given in Carns et al.—but presumably is meant there to be Input *and* output??-- External sources is 8.6 TB/day (100 MB/s)—a WAG-- Others are made upOthers are just WAGs.By comparison: all observational and simulation data from LHC is 15PB/yr(Wikipedia): 475 MB/s
  • http://labmed.ascpjournals.org/content/40/1/5/F7.expansion.htmlOld tools: PCs, spreadsheets, etc., can’t handle these issues effectively
  • Aside: Another area in which I encounter substantial and growing complexity is travel.This being consumer space, there’s an app for that! A “software as a service” (aka cloud) app.
  • Small labs ….Potential solution? Outsource complex, time-consuming, mundane activities to third parties—to software-as-a-service (SaaS) providers—to a “research cloud” focused on process automationQuestion: Which steps can we outsource in that way?
  • https://plasmasim.physics.ucla.edu/research/winjum
  • Automated ingestCataloging
  • DiagnosisProvenance
  • Geophysical variables: Wind speeds, rainfall rate, temperatures, liquid water content, raindrop shape properties, etc.
  • Big Process for Big Data

    1. 1. Big process for big dataProcess automation for data-driven scienceIan FosterComputation InstituteMathematics and Computer Science DivisionDepartment of Computer ScienceArgonne National Laboratory & The University of ChicagoTalk at DOE Big Data Technology Summit, Washington DC, October 9, 2012 www.ci.anl.gov www.ci.uchicago.edu
    2. 2. Big data is not new at DOELarge Hadron Collider Higgs discovery “only possible because of the extraordinary achievements of … grid computing”15 PB/year —Rolf Heuer, CERN DG173 TB/day500 MB/sec LHC Computing Grid (10+ GB/sec) www.ci.anl.gov2 www.ci.uchicago.edu
    3. 3. But it is now ubiquitous: e.g., genomics www.ci.anl.gov3 Kahn, Science, 331 (6018): 728-729 www.ci.uchicago.edu
    4. 4. But it is now ubiquitous: e.g., genomics 6 years Computing x10 (x30 at DOE) www.ci.anl.gov4 Kahn, Science, 331 (6018): 728-729 www.ci.uchicago.edu
    5. 5. But it is now ubiquitous: e.g., genomics 6 years Computing x10 (x30 at DOE) Genome sequencing x105 www.ci.anl.gov5 Kahn, Science, 331 (6018): 728-729 www.ci.uchicago.edu
    6. 6. Now ubiquitous: e.g., light sources 18 orders of magnitude12 orders of in 5 decades!magnitudein 6 decades www.ci.anl.gov 6 Credit: Linda Young www.ci.uchicago.edu
    7. 7. Now ubiquitous: e.g., light sources www.ci.anl.gov7 Source: Francesco de Carlo www.ci.uchicago.edu
    8. 8. Local flows already exceed those of LHC External Argonne data sources 163 flows in TB/day 9 9 (estimates)Advanced Photon Source Argonne 143 10 Short- Long- Leadership term term Computing 100 storage 50 storage Facility 150 100 Other sources Other sources that remain to that remain to be quantified be quantified Data analysis www.ci.anl.gov8 www.ci.uchicago.edu
    9. 9. Big data demands new analysis modelsToday Desired www.ci.anl.gov9 Source: Francesco de Carlo www.ci.uchicago.edu 9
    10. 10. It’s velocity and variety as well as volume Proteomics Phenotypes Transcriptomics Genomes Growth curves Metabolomics Metabolic Reconciled Phenotype Model Model predictions Flux Integrated predictions Assembly Annotation model Hypotheses Regulon Regulatory Pathway prediction model designs www.ci.anl.gov10 Credit: Chris Henry et al. www.ci.uchicago.edu
    11. 11. Exponentially increasing complexity Run experiment Collect data Move data Check data Annotate data Share data Find similar data Link to literature Analyze data Publish data www.ci.anl.gov11 www.ci.uchicago.edu
    12. 12. www.ci.anl.gov12 www.ci.uchicago.edu
    13. 13. Tripit exemplifies process automation Me Other services Book flights Record flights Suggest hotel Book hotel Record hotel Get weather Prepare maps Share info Monitor prices Monitor flight www.ci.anl.gov13 www.ci.uchicago.edu
    14. 14. Big data requires big process Run experiment Outsourced Collect data Intuitive Move data Integrative Check data Annotate data Research IT Share data as a service Find similar data Link to literature Secure Performant Analyze data Reliable Publish data www.ci.anl.gov14 www.ci.uchicago.edu
    15. 15. Characterizing big process requirements Telescope In millions of labs Simulation worldwide, researchers struggle with massive data, advanced software, complex protocols, burdensome reporting Staging Ingest Registry Community Repository Analysis Next-gen genome Archive Mirror sequencerAccelerate discovery and innovation by outsourcing difficult tasks 15 www.ci.anl.gov www.ci.uchicago.edu
    16. 16. Characterizing big process requirements Telescope In millions of labs Simulation worldwide, researchers struggle with massive data, advanced software, complex Data movement is a frequentburdensome reporting protocols, challenge • Between facilities, archives,Registry researchers Staging Ingest • Many files, large data volumes Community • With security, reliability, performance Repository Analysis Next-gen genome Archive Mirror sequencerAccelerate discovery and innovation by outsourcing difficult tasks 16 www.ci.anl.gov www.ci.uchicago.edu
    17. 17. Globus Online: Big process for big dataData movement as a serviceSecure, automated, reliable, high-speed movement, synchronization of many files www.ci.anl.gov17 www.ci.uchicago.edu
    18. 18. 6,000 users500 M files, 7 PB moved99.9% availability
    19. 19. Examples of Globus Online in action• K. Heitmann (ANL) moves 22TB cosmology data at 5 Gb/s LANL  ANL• B. Winjum (UCLA) moves 900K-file plasma physics datasets UCLA - NERSC• Dan Kozak (Caltech) replicates 1 PB LIGO astronomy data for resilience• Supercomputer centers, genome facilities, light sources, universities all recommend it www.ci.anl.gov19 www.ci.uchicago.edu
    20. 20. Sizes of transfers Jan-Jun; size of circles prop. to log size Automation expands use of networks Red=NERSC/LBL/ESnet; Green=ORNL/BNL; Blue=ANL; Yellow=FNAL; Grey=OtherTransfers Jan-June 2012, 1e+12Size (bytes) vs timeSize ∝ log(transfer rate)Red: NERSC/LBL/Esnet 1e+09Green: ORNL, LBLBlue: ANL bytes_xferedYellow: FNAL 1e+06Grey: Other 1e+03 1e+00 Jan Mar May Jul www.ci.anl.gov20 www.ci.uchicago.edu
    21. 21. Need much more than data movement Telescope In millions of labs Simulation worldwide, researchers struggle with massive data, advanced software, complex protocols, burdensome reporting Staging Ingest Registry Community Repository Analysis Next-gen genome Archive Mirror sequencerAccelerate discovery and innovation by outsourcing difficult tasks 21 www.ci.anl.gov www.ci.uchicago.edu
    22. 22. Need much more than data movement Ingest, cata loging, inte Sharing, collaboration, Identity, grou ps, security Analysis, sim ulation, visu ... gration annotation alization Staging Ingest Registry Community Repository Analysis Next-gen genome Archive Mirror sequencerAccelerate discovery and innovation by outsourcing difficult tasks 22 www.ci.anl.gov www.ci.uchicago.edu
    23. 23. Earth System Grid: Data movement• Outsource data transfer – Client data download – Replication between sites• No ESGF client software needed• 20+ times faster than HTTP www.ci.anl.gov23 earthsystemgrid.org www.ci.uchicago.edu
    24. 24. Kbase: Identity, group, data movement www.ci.anl.gov24 kbase.science.energy.gov www.ci.uchicago.edu
    25. 25. Genomics: Data movement and analysis Galaxy-based workflow management Public • Globus Online Data Integrated Galaxy • Web-based UI data • Drag-n-drop Sequenc- Sequencin Globus Online provides Storage libraries workflow creation ing g Centers • Easily add new centers • High-performance • Fault-tolerant Lab Research tools • Secure Local Cluster/ • Analytical tools Seq Cloud file transfer between all Center run on scalable data endpoints computers Galaxy in Cloud Data management Data analysis www.ci.anl.gov25 Source: Ravi Madduri www.ci.uchicago.edu
    26. 26. Integrating observation and simulation 1 Cloud properties and precipitation characteristics in large-scale models and cloud- resolving models (e.g., CMIP5 models, GCRM)Percentage of mapped radar domain in Darwin with returns>10 dBz over the period 19 to 22 January 2006. Retrieve CompareConstruct structured4-D atmosphericstate (“CAN”) 2 Precipitating storm structures; storm lifecycles; Analytics Analytics statistical representation of storm scale properties; 3 predictive cloud models www.ci.anl.gov 26 Scott Collis www.ci.uchicago.edu
    27. 27. Integrating observation and simulation Level 1 Level 2 Level 3 PBs TBs GBs www.ci.anl.gov27 Salman Habib, Katrin Heitmann www.ci.uchicago.edu
    28. 28. Integrating observation and simulation www.ci.anl.gov28 Salman Habib, Katrin Heitmann www.ci.uchicago.edu
    29. 29. In summary: Big process for big dataAccelerate discovery and innovation worldwideby providing research IT as a serviceOutsource time-consuming tasks to• provide large numbers of researchers with unprecedented access to powerful tools;• enable a massive shortening of cycle times in time-consuming research processes; and• reduce research IT costs via economies of scaleAccelerate existing science; enable new science www.ci.anl.gov29 www.ci.uchicago.edu
    30. 30. Thank you!foster@anl.govwww.ci.anl.govwww.mcs.anl.govwww.globusonline.org www.ci.anl.gov www.ci.uchicago.edu

    ×