Big Process for Big Data @ PNNL, May 2013

768 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
768
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
17
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • The Computation Institute (or CI)A joint initiative between Uchicago and Argonne National LabA place where researchers from multiple disciplines come together and engage in research that is fundamentally enabled by computationMore recently ….we’ve been talking about it as the home of the research cloud …and I’ll describe what we mean by that throughout this talk
  • Here are some of the areas where we have active projectsFocus on areas of particular interest to I2/Esnet, namely HEP, climate change, genomics (up and coming)
  • And the reason is pretty obvious…This chart and others like it are becoming a cliché in next gen sequencing and big data presentations …but the point is that while Moore’s law translates to roughly 10x increase in processor power…data volumes are growing many orders of magnitude fasterAND MEANWHILE, other necessary resources [money, people] are staying pretty flatSo we have a crisis …and we hear that magic bullet of “the cloud” is going to solve itWell, as far as cost goes, clouds are helping but many issues remain
  • 173 TB/day
  • Another example if the earth systems grid that provides data and tools to over 20,000 climate scientists around the worldSo what’s notable about these examples?It’s the combination of the amount of data being managed and the number of people that need access to that dataWe heard Martin Leach tell us that the Broad Institute hit 10PB of spinning disk last year …and that it’s not a big dealTo a select few, these numbers are routine ….And for the projects I just talked about, the IT infrastructure is in placeThey have robust production solutionsBuilt by substantial teams at great expenseSustained, multi-year effortsApplication-specific solutions, built mostlyon common/homogeneoustechnology platforms
  • The point is, the 1% of projects are in good shape
  • But what about the 99% set?There are hundreds of thousands of small and medium labs around the world that are faced with similar data management challengesThey don’t have the resources to deal with these challengesSo their research suffers …and over time many may become irrelevantSo at the CI we asked ourselves a question …many questions actually about how we can help avert this crisisAnd one question that kinds sums up a lot of our thinking is…
  • There are hundreds of thousands of small and medium labs around the world that are faced with similar data management challengesThey don’t have the resources to deal with these challengesSo their research suffers …and over time many may become irrelevantSo at the CI we asked ourselves a question …many questions actually about how we can help avert this crisisAnd one question that kinds sums up a lot of our thinking is…
  • Can’t just expect to throw more people and $$$ at the problem ….already seeing the limits
  • Many in this room are probably users of Dropbox or similar services for keeping their files synced across multiple machinesWell, the scientific research equivalent is a little different
  • We figured it needs to allow a group of collaborating researchers to do many or all of these things with their data ……and not just the 2GB of powerpoints…or the 100GB of family photos and videos….but the petabytes and exabytes of data that will soon be the norm for many
  • So how would such a drop box for science be used? Let’s look at a very typical scientific data work flow . . .Data is generated by some instrument (a sequencer at JGI or a light source like APS/ALS)…since these instruments are in high demand, users have to get their data off the instrument to make way for the next userSo the data is typically moved from a staging area to some type of ingest storeEtcetera for analysis, sharing of results with collaborators, annotation with metadata for future search, backup/sync/archival, …
  • We figured it needs to allow a group of collaborating researchers to do many or all of these things with their data ……and not just the 2GB of powerpoints…or the 100GB of family photos and videos….but the petabytes and exabytes of data that will soon be the norm for many
  • Started with seemingly simple/mundane task of transferring files …etc.
  • And when we spoke with IT folks at various research communities they insisted that some things were not up for negotiation
  • And when we spoke with IT folks at various research communities they insisted that some things were not up for negotiation
  • And when we spoke with IT folks at various research communities they insisted that some things were not up for negotiation
  • This image shows a 3D rendering of a Shewanella biofilm grown on a flat plastic substrate in a Constant Depth bioFilm Fermenter (CDFF).  The image was generated using x-ray microtomography at the Advanced Photon Source, Argonne National Laboratory.   
  • We figured it needs to allow a group of collaborating researchers to do many or all of these things with their data ……and not just the 2GB of powerpoints…or the 100GB of family photos and videos….but the petabytes and exabytes of data that will soon be the norm for many
  • We figured it needs to allow a group of collaborating researchers to do many or all of these things with their data ……and not just the 2GB of powerpoints…or the 100GB of family photos and videos….but the petabytes and exabytes of data that will soon be the norm for many
  • http://datasets.globus.org/carl-catalog/query/propertyA=value1
  • http://www.blyberg.net/card-generator/http://www.sciencemag.org/content/332/6025/88/F1.large.jpg
  • Big Process for Big Data @ PNNL, May 2013

    1. 1. computationinstitute.orgBig process for big dataIan Fosterfoster@anl.gov
    2. 2. computationinstitute.orgThanks to great colleaguesand collaborators• Steve Tuecke, Rachana Ananthakrishnan, KyleChard, Raj Kettimuthu, Ravi Madduri, TanuMalik, and many others at Argonne & Uchicago• Carl Kesselman, Karl Czajkowski, Rob Schuler,and others at USC/ISI• Francesco de Carlo, Chris Jacobsen, and othersat Argonne• Kerstin Kleese-Van Dam, Carina Lansing, andothers at PNNL
    3. 3. computationinstitute.orgThe Computation Institute= UChicago + Argonne= Cross-disciplinary nexus= Home of the Discovery Cloud
    4. 4. computationinstitute.org
    5. 5. computationinstitute.orgx10 in 6 yearsx105 in 6 yearsWill data kill genomics?Kahn, Science, 331 (6018): 728-729
    6. 6. computationinstitute.org18 ordersof magnitudein 5 decades!12 ordersof magnitudeIn 6 decades!Moore’s Law for X-Ray Sources
    7. 7. computationinstitute.orgLarge Hadron ColliderHiggs discovery “only possible becauseof the extraordinary achievements of …grid computing”—Rolf Heuer, CERN DG
    8. 8. computationinstitute.org
    9. 9. computationinstitute.org1.2 PB of climate dataDelivered to 23,000 users
    10. 10. computationinstitute.orgWe have exceptionalinfrastructure for the 1%
    11. 11. computationinstitute.orgWhat about the 99%?
    12. 12. computationinstitute.orgBig science. Small labs.
    13. 13. computationinstitute.orgNeed: A new way to deliverresearch cyberinfrastructureFrictionlessAffordableSustainable
    14. 14. computationinstitute.orgWe asked ourselves:What if the research work flowcould be managed as easily as……our pictures…home entertainment…our e-mail
    15. 15. computationinstitute.orgWhat makes these services great?Great User Experience+High performance(but invisible) infrastructure
    16. 16. computationinstitute.orgWe aspire (initially) to create agreat user experience forresearch data managementWhat would a “dropbox forscience” look like?
    17. 17. computationinstitute.org• Collect• Move• Sync• Share• Analyze• Annotate• Publish• Search• Backup• ArchiveBIG DATA
    18. 18. computationinstitute.orgRegistryStagingStoreIngestStoreAnalysisStoreCommunityStoreArchive MirrorIngestStoreAnalysisStoreCommunityStoreArchive MirrorRegistryQuotaexceeded!Expiredcredentials!Networkfailed. Retry.!Permissiondenied!It should be trivial to Collect, Move, Sync, Share, Analyze,Annotate, Publish, Search, Backup, & Archive BIG DATA… but in reality it’s often very challenging
    19. 19. computationinstitute.orgAutomation is requiredto apply moresophisticated methods tofar more dataAutomation and outsourcing are key
    20. 20. computationinstitute.orgAutomation is requiredto apply moresophisticated methods tofar more dataOutsourcing is neededto achieve economies ofscale in the use ofautomated methodsAutomation and outsourcing are key
    21. 21. computationinstitute.orgBuilding a discovery cloud:Research strategy• Identify time-consuming activity that appearsamenable to automation and outsourcing• Implement activity as a high-quality, low-touchSaaS solution, leveraging commercial IaaS forhigh reliability, economies of scale• Evaluate• Extract common elements as aresearch automation platform• RepeatBonus question: Identify methods fordelivering SaaS solutions sustainablySoftware as a servicePlatform as a serviceInfrastructure as a service
    22. 22. computationinstitute.org• Collect• Move• Sync• Share• Analyze• Annotate• Publish• Search• Backup• ArchiveBIG DATA
    23. 23. computationinstitute.org• Collect• Move• Sync• Share• Analyze• Annotate• Publish• Search• Backup• Archive• Collect• Move• Sync• ShareCapabilities delivered usingSoftware-as-Service (SaaS) model
    24. 24. computationinstitute.orgDataSourceDataDestinationUserinitiatestransferrequest1GlobusOnlinemoves/syncs files2Globus Onlinenotifies user3
    25. 25. computationinstitute.orgDataSourceUser A selectsfile(s) to share;selectsuser/group, setsshare permissions1Globus Online tracksshared files; no needto move files tocloud storage!2User B logs in toGlobus Onlineand accessesshared file3
    26. 26. computationinstitute.orgExtreme ease of use• InCommon, Oauth, OpenID, X.509, …• Credential management• Group definition and management• Transfer management and optimization• Reliability via transfer retries• Web interface, REST API, command line• One-click “Globus Connect” install• 5-minute Globus Connect Multi User install
    27. 27. computationinstitute.orgEarly adoption is encouraging
    28. 28. computationinstitute.orgEarly adoption is encouraging8,000 registered users; >100 daily~16 PB moved; ~1B files10x (or better) performance vs. scp99.9% availabilityEntirely hosted on Amazon
    29. 29. 1e-011e+011e+031e+051e+07duration2011 20121 second1 minute1 hour1 day1 week
    30. 30. computationinstitute.orgWe benefit greatly fromESnet’s “Science DMZ”Three key components, all required:• “Friction free” network path– Highly capable network devices (wire-speed, deep queues)– Virtual circuit connectivity option– Security policy and enforcement specific to science workflows– Located at or near site perimeter if possible• Dedicated, high-performance Data Transfer Nodes (DTNs)– Hardware, operating system, libraries optimized for transfer– Optimized data transfer tools: Globus Online, GridFTP• Performance measurement/test node– perfSONARDetails at http://fasterdata.es.net/science-dmz/
    31. 31. computationinstitute.orgK. Heitmann (Argonne)moves 22 TB of cosmologydata LANL  ANL at 5 Gb/s
    32. 32. computationinstitute.orgB. Winjum (UCLA) moves900K-file plasma physicsdatasets UCLA NERSC
    33. 33. computationinstitute.orgDan Kozak (Caltech)replicates 1 PB LIGOastronomy data for resilience
    34. 34. computationinstitute.org3Credit: Kerstin Kleese-van DamErin Miller (PNNL)collects data atAdvanced PhotonSource, renders atPNNL, and views atANL
    35. 35. computationinstitute.org• Collect• Move• Sync• Share• Analyze• Annotate• Publish• Search• Backup• ArchiveBIG DATA
    36. 36. computationinstitute.org• Collect• Move• Sync• Share• Analyze• Annotate• Publish• Search• Backup• ArchiveBIG DATA
    37. 37. computationinstitute.orgGlobus Online already does a lotGlobus ToolkitSharing ServiceTransfer ServiceGlobus Nexus(Identity, Group, Profile)GlobusOnlineAPIsGlobusConnect
    38. 38. computationinstitute.orgData management SaaS (Globus) +Next-gen sequence analysis pipelines (Galaxy) +Cloud IaaS (Amazon) =Flexible, scalable, easy-to-use genomicsanalysis for all biologistsglobusgenomics
    39. 39. computationinstitute.orgA platform for integration
    40. 40. computationinstitute.orgA platform for integration
    41. 41. computationinstitute.orgA platform for integration
    42. 42. computationinstitute.orgWe are also adding capabilitiesGlobus ToolkitSharing ServiceTransfer ServiceGlobus Nexus(Identity, Group, Profile)GlobusOnlineAPIsGlobusConnect
    43. 43. computationinstitute.orgMore capabilities underway …Globus ToolkitSharing ServiceTransfer ServiceDataset ServicesGlobus Nexus(Identity, Group, Profile)GlobusOnlineAPIsGlobusConnect
    44. 44. computationinstitute.orgExpanding Globus Online services• Ingest and publication– Imagine a DropBox that not only replicates, butalso extracts metadata, catalogs, converts• Cataloging– Virtual views of data based on user-definedand/or automatically extracted metadata• Computation– Associate computational procedures,orchestrate application, catalog results, recordprovenance
    45. 45. computationinstitute.orgLooking deeply at howresearchers use data• A single research question often requires theintegration of many data elements, that are:– In different locations– In different formats (Excel, text, CDF, HDF, …)– Described in different ways• Best grouping can vary during investigation– Longitudinal, vertical, cross-cutting• But always needs to be operated on as a unit– Share, annotate, process, copy, archive, …
    46. 46. computationinstitute.orgHow do we manage data today?• Often, a curious mix of ad hoc methods– Organize in directories using file and directorynaming conventions– Capture status in README files, spreadsheets,notebooks• Time-consuming, complex, error proneWhy can’t we manage our data likewe manage our pictures and music?
    47. 47. computationinstitute.orgIntroducing the dataset• Group data based on use, not location– Logical grouping to organize, reorganize, search, anddescribe usage• Tag with characteristics that reflect content …– Capture as much existing information as we can• …or to reflect current status in investigation– Stage of processing, provenance, validation, ..• Share data sets for collaboration– Control access to data and metadata• Operate on datasets as units– Copy, export, analyze, tag, archive, …
    48. 48. computationinstitute.orgBuilds on catalog as a serviceApproach• Hosted user-definedcatalogs• Based on tag model<subject, name, value>• Optional schemaconstraints• Integrated with otherGlobus servicesThree REST APIs/query/• Retrieve subjects/tags/• Create, delete, retrievetags/tagdef/• Create, delete, retrievetag definitionsBuilds on USC Tagfiler project (C. Kesselman et al.)
    49. 49. computationinstitute.org50Multi-scaleimaging atAPSStorageImage processing(noise removal, etc.)TomographicreconstructionVisual inspectionSelectionBeamline 2-BM-B~1.5um resolutionBeamline 32-ID-C20-50 nm resolutionImage processing(noise removal, etc.)TomographicreconstructionVisual inspectionSelectionSelectionMulti-scaleimage fusionVisual inspectionUp to 100 fps2K x 2K, 16 bits11 GB raw data1,500 fps2K x 2K, 16 bits1 min readout11 GB raw data
    50. 50. 51mydata42owner: Francescotype: 3dtomoformat: HDF5beamline: 2BMDefine datasetInfer typeExtract metadataPopulate catalog(s)Locate datasetsAccess filesanalyzeCatalog derivedproductstransfer/scheduleOrchestrationOrganizationRecordprovenanceAnnotate, sharebrowse, search
    51. 51. computationinstitute.org
    52. 52. computationinstitute.org
    53. 53. computationinstitute.org
    54. 54. computationinstitute.orgBuilding a discovery cloud:Research strategy• Identify time-consuming activity that appearsamenable to automation and outsourcing• Implement activity as a high-quality, low-touchSaaS solution, leveraging commercial IaaS forhigh reliability, economies of scale• Evaluate• Extract common elements as aresearch automation platform• RepeatBonus question: Identify methods fordelivering SaaS solutions sustainablySoftware as a servicePlatform as a serviceInfrastructure as a service
    55. 55. computationinstitute.orgOur challenge:SustainabilityWe are a non-profit serviceprovider to the non-profitresearch community
    56. 56. computationinstitute.orgGlobus Online Provider PlansSupport ongoing operationsOffer value-added capabilitiesEngage more closely with users
    57. 57. computationinstitute.orgStarting at $20k per year• Provider endpoints with sharing• Multiple GridFTP servers per endpoint• Branded web sites• Alternate identity provider• Usage reporting• MSS optimizations• Operations monitoring and management• Input into and access to product roadmapProvider Plans offer…
    58. 58. computationinstitute.orgTo provide more capability formore people at substantiallylower cost by creativelyaggregating (“cloud”) andfederating (“grid”) resources“Science as a service”Our vision for a 21st centurydiscovery infrastructure
    59. 59. computationinstitute.orgIt’s a time of great opportunity … todevelop and apply Science aaSGlobus Nexus(Identity, Group, Profile)…Sharing ServiceTransfer ServiceDataset ServicesGlobus ToolkitGlobusOnlineAPIsGlobusConnect
    60. 60. computationinstitute.orgThanks to great colleaguesand collaborators• Steve Tuecke, Rachana Ananthakrishnan, KyleChard, Raj Kettimuthu, Ravi Madduri, TanuMalik, and many others at Argonne & Uchicago• Carl Kesselman, Karl Czajkowski, Rob Schuler,and others at USC/ISI• Francesco de Carlo, Chris Jacobsen, and othersat Argonne• Kerstin Kleese-Van Dam, Carina Lansing, andothers at PNNL
    61. 61. computationinstitute.orgThank you to our sponsors!

    ×