Open Analytics Environment

1,596 views

Published on

I summarize requirements for an "Open Analytics Environment" (aka "the Cauldron"), and some work being performed at the University of Chicago and Argonne National Laboratory towards its realization.

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,596
On SlideShare
0
From Embeds
0
Number of Embeds
49
Actions
Shares
0
Downloads
44
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • Open Analytics Environment

    1. 1. Towards an Open Analytics Environment Ian Foster Computation Institute Argonne National Lab & University of Chicago
    2. 2. The Computation Institute <ul><li>A joint institute of Argonne and the University of Chicago, focused on furthering system-level science via the development and use of advanced computational methods. </li></ul><ul><li>Solutions to many grand challenges facing science and society today require the analysis and understanding of entire systems, not just individual components. They require not reductionist approaches but the synthesis of knowledge from multiple levels of a system, whether biological, physical, or social (or all three). </li></ul>www.ci.uchicago.edu Faculty, fellows, staff, students, computers, projects.
    3. 3. The Good Old Days: Astronomy ~1600 30 years ? years 10 years 6 years 2 years
    4. 4. Astronomy, from 1600 to 2000 Automation 10 -1  10 8 Hz data capture Community 10 0  10 4 astronomers (10 6 amateur) Computation Data 10 6  10 15 B aggregate 10 -1  10 15 Hz peak Literature 10 1  10 5 pages/year
    5. 5. Biomedical Research ~1600
    6. 6. Biomedical Research ~2000 ... atcgaattccaggcgtcacattctcaattcca... MPMILGYWDIRGLAHAIRLLLEYTDSSYEEKKYT... Protein-Protein Interactions metabolism pathways receptor-ligand 4º structure Polymorphism and Variants genetic variants individual patients epidemiology Physiology Cellular biology Biochemistry Neurobiology Endocrinology etc. >10 6 ESTs Expression patterns Large-scale screens Genetics and Maps Linkage Cytogenetic Clone-based From John Wooley >10 6 >10 9 >10 6 >10 5 >10 9 DNA sequences alignments Proteins sequence 2º structure 3º structure
    7. 7. Growth of Sequences and Annotations since 1982 Folker Meyer, Genome Sequencing vs. Moore’s Law: Cyber Challenges for the Next Decade, CTWatch , August 2006.
    8. 8. The Analyst in Denial “ I just need a bigger disk (and workstation)”
    9. 9. An Open Analytics Environment Data in <ul><li>“ No limits” </li></ul><ul><li>Storage </li></ul><ul><li>Computing </li></ul><ul><li>Format </li></ul><ul><li>Program </li></ul><ul><li>Allowing for </li></ul><ul><li>Versioning </li></ul><ul><li>Provenance </li></ul><ul><li>Collaboration </li></ul><ul><li>Annotation </li></ul>Results out Programs & rules in
    10. 10. o·pen [oh-puhn] adjective <ul><li>having the interior immediately accessible </li></ul><ul><li>relatively free of obstructions to sight, movement, or internal arrangement </li></ul><ul><li>generous, liberal, or bounteous </li></ul><ul><li>in operation; live </li></ul><ul><li>readily admitting new members </li></ul><ul><li>not constipated </li></ul>
    11. 11. What Goes In (1)
    12. 12. What Goes In (2) Rules Workflows Dryad MapReduce Parallel programs SQL BPEL Swift SCFL R MatLab Octave
    13. 13. How it Cooks <ul><li>Virtualization </li></ul><ul><ul><li>Run any program, store any data </li></ul></ul><ul><li>Indexing </li></ul><ul><ul><li>Automated maintenance </li></ul></ul><ul><li>Provisioning </li></ul><ul><ul><li>Policy-driven allocation of resources to competing demands </li></ul></ul>
    14. 14. What Comes Out Data Data Virtual Data Schema
    15. 15. Analysis as (Collaborative) Process Transform Annotate Search Add to Tag Visualize Discover Extend Group Share
    16. 16. Centralized or Distributed? Both
    17. 17. Towards an Open Analysis Environment: (1) Applications <ul><li>Astrophysics </li></ul><ul><li>Cognitive science </li></ul><ul><li>East Asian studies </li></ul><ul><li>Economics </li></ul><ul><li>Environmental science </li></ul><ul><li>Epidemiology </li></ul><ul><li>Genomic medicine </li></ul><ul><li>Neuroscience </li></ul><ul><li>Political science </li></ul><ul><li>Sociology </li></ul><ul><li>Solid state physics </li></ul>
    18. 18. Towards an Open Analysis Environment: (2) Hardware SiCortex 6K cores, 6 Top/s IBM BG/P 160K cores, 500 Top/s PADS 10-40 Gbit/s
    19. 19. PADS: Petascale Active Data Store 500 TB reliable storage (data & metadata) 180 TB, 180 GB/s 17 Top/s analysis Data ingest Dynamic provisioning Parallel analysis Remote access Offload to remote data centers P A D S Diverse users Diverse data sources 1000 TB tape backup
    20. 20. Towards an Open Analysis Environment : (3) Methods <ul><li>HPC systems software (MPICH, PVFS, etc.) </li></ul><ul><li>Collaborative data tagging (GLOSS) </li></ul><ul><li>Data integration (XDTM) </li></ul><ul><li>HPC data analytics and visualization </li></ul><ul><li>Loosely coupled parallelism (Swift, Hadoop) </li></ul><ul><li>Dynamic provisioning (Falkon) </li></ul><ul><li>Service authoring (Introduce, caGrid, gRAVI) </li></ul><ul><li>Provenance recording and query (Swift) </li></ul><ul><li>Service composition and workflow (Taverna) </li></ul><ul><li>Virtualization management </li></ul><ul><li>Distributed data management (GridFTP, etc.) </li></ul>
    21. 21. Tagging & Social Networking GLOSS : Generalized Labels Over Scientific data Sources
    22. 22. XDTM: XML Data Typing & Mapping ./group23 drwxr-xr-x 4 yongzh users 2048 Nov 12 14:15 AA drwxr-xr-x 4 yongzh users 2048 Nov 11 21:13 CH drwxr-xr-x 4 yongzh users 2048 Nov 11 16:32 EC ./group23/AA : drwxr-xr-x 5 yongzh users 2048 Nov 5 12:41 04nov06aa drwxr-xr-x 4 yongzh users 2048 Dec 6 12:24 11nov06aa . /group23/AA/04nov06aa : drwxr-xr-x 2 yongzh users 2048 Nov 5 12:52 ANATOMY drwxr-xr-x 2 yongzh users 49152 Dec 5 11:40 FUNCTIONAL . /group23/AA/04nov06aa/ANATOMY : -rw-r--r-- 1 yongzh users 348 Nov 5 12:29 coplanar.hdr -rw-r--r-- 1 yongzh users 16777216 Nov 5 12:29 coplanar.img . /group23/AA/04nov06aa/FUNCTIONAL : -rw-r--r-- 1 yongzh users 348 Nov 5 12:32 bold1_0001.hdr -rw-r--r-- 1 yongzh users 409600 Nov 5 12:32 bold1_0001.img -rw-r--r-- 1 yongzh users 348 Nov 5 12:32 bold1_0002.hdr -rw-r--r-- 1 yongzh users 409600 Nov 5 12:32 bold1_0002.img -rw-r--r-- 1 yongzh users 496 Nov 15 20:44 bold1_0002.mat -rw-r--r-- 1 yongzh users 348 Nov 5 12:32 bold1_0003.hdr -rw-r--r-- 1 yongzh users 409600 Nov 5 12:32 bold1_0003.img Logical Physical
    23. 23. fMRI Type Definitions type Study { Group g[ ]; } type Group { Subject s[ ]; } type Subject { Volume anat; Run run[ ]; } type Run { Volume v[ ]; } type Volume { Image img; Header hdr; } type Image {}; type Header {}; type Warp {}; type Air {}; type AirVec { Air a[ ]; } type NormAnat { Volume anat; Warp aWarp; Volume nHires; }
    24. 24. High-Performance Data Analytics Functional MRI Ben Clifford, Mihael Hatigan, Mike Wilde, Yong Zhao
    25. 25. SwiftScript for fMRI Data Analysis <ul><li>(Run snr) functional ( Run r, NormAnat a, Air shrink ) { </li></ul><ul><ul><li>Run yroRun = reorientRun ( r , &quot;y&quot; ); </li></ul></ul><ul><ul><li>Run roRun = reorientRun ( yroRun , &quot;x&quot; ); </li></ul></ul><ul><ul><li>Volume std = roRun[0]; </li></ul></ul><ul><ul><li>Run rndr = random_select ( roRun, 0.1 ); </li></ul></ul><ul><ul><li>AirVector rndAirVec = align_linearRun ( rndr, std, 12, 1000, 1000, &quot;81 3 3&quot; ); </li></ul></ul><ul><ul><li>Run reslicedRndr = resliceRun ( rndr, rndAirVec, &quot;o&quot;, &quot;k&quot; ); </li></ul></ul><ul><ul><li>Volume meanRand = softmean ( reslicedRndr, &quot;y&quot;, &quot;null&quot; ); </li></ul></ul><ul><ul><li>Air mnQAAir = alignlinear ( a.nHires, meanRand, 6, 1000, 4, &quot;81 3 3&quot; ); </li></ul></ul><ul><ul><li>Warp boldNormWarp = combinewarp ( shrink, a.aWarp, mnQAAir ); </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><li>} </li></ul>(Run or) reorientRun (Run ir, string direction) { foreach Volume iv , i in ir.v { or.v[i] = reorient( iv , direction); } }
    26. 26. Provenance Data Model
    27. 27. Multi-level Scheduling SwiftScript Abstract computation Virtual Data Catalog SwiftScript Compiler Specification Execution Virtual Node(s)‏ Worker Nodes Provenance data Provenance data Provenance collector launcher launcher file1 file2 file3 App F1 App F2 Scheduling Execution Engine (Karajan w/ Swift Runtime)‏ Swift runtime callouts C C C C Status reporting Provisioning Falkon Resource Provisioner Amazon EC2
    28. 28. DOCK on SiCortex <ul><li>CPU cores: 5760 </li></ul><ul><li>Power: 15,000 W </li></ul><ul><li>Tasks: 92160 </li></ul><ul><li>Elapsed time: 12821 sec </li></ul><ul><li>Compute time: 1.94 CPU years </li></ul>(does not include ~800 sec to stage input data) Ioan Raicu, Zhao Zhang
    29. 29. LIGO Gravitational Wave Observatory Birmingham • >1 Terabyte/day to 8 sites 770 TB replicated to date: >120 million replicas MTBF = 1 month Ann Chervenak et al., ISI; Scott Koranda et al, LIGO <ul><li>Cardiff </li></ul>AEI/Golm
    30. 30. Lag Plot for Data Transfers to Caltech Credit: Kevin Flasch, LIGO
    31. 31. SIDGrid: B. Bertenthal et al., U.Chicago, IU, UIC
    32. 32. Social Informatics Data Grid (SIDgrid) TeraGrid PADS … SIDgrid Collaborative, multi-modal analysis of cognitive science data Diverse experimental data & metadata Browse data Search Content preview Transcode Download Analyze
    33. 33. ELAN SIDGrid Portal
    34. 35. A C ommunity I ntegrated M odel for E conomic a nd R esource T rajectories for H umankind ( CIM-EARTH ) Dynamics, foresight, uncertainty, resolution, … Agriculture, transport, taxation, … Data (global, local, …) (Super) computers CIM-EARTH Framework Community process Open code, data
    35. 36. Alleviating Poverty in Thailand: Modeling Entrepreneurship Consider only wealth, access to capital Consider also distance to 6 major cities Rob Townsend, Victor Zhorin, et al. Match High Low
    36. 37. Text Mining
    37. 38. GeneWays Online Journals Pathways GeneWays Andrey Rzhetsky et al. Screening 250,000 journal articles 2.5M reasoning chains 4M statements
    38. 39. Evidence Integration: Genetics & Disease Susceptibility Identify Genes Phenotype 1 Phenotype 2 Phenotype 3 Phenotype 4 Predictive Disease Susceptibility Physiology Metabolism Endocrine Proteome Immune Transcriptome Biomarker Signatures Morphometrics Pharmacokinetics Ethnicity Environment Age Gender Source: Terry Magnuson
    39. 40. James Evans, U.Chicago Arabidopsis articles
    40. 41. An Open Analytics Environment Data in <ul><li>“ No limits” </li></ul><ul><li>Storage </li></ul><ul><li>Computing </li></ul><ul><li>Format </li></ul><ul><li>Program </li></ul><ul><li>Allowing for </li></ul><ul><li>Versioning </li></ul><ul><li>Provenance </li></ul><ul><li>Collaboration </li></ul><ul><li>Annotation </li></ul>Results out Programs & rules in

    ×