Extreme Scripting July 2009

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Notes on slide 1

    Another perspective on the problem. A few words of explanation. If we are deploying a hospital IT system, we are (hopefully) in the bottom left hand corner.“You can’t achieve success via central planning.” Quoted in Crossing the Quality Chasm, p. 312In our scenarios, we don’t have that ability to control.

    What is the alternative? We can put in place mechanisms that facilitate groups with some common goal to form and function.Over time, things change, these groups evolve.If we are successful, they can expand, perhaps merge.Challenges: make this easy. Leverage scale effects.

    Principles and mechanisms that has been under development for some years.First CS, then physical sciences, then biology, most recently biomedicine –

    1 Favorite

    Extreme Scripting July 2009 - Presentation Transcript

    1. Extreme scriptingand other adventures in data-intensive computing
      Ian FosterAllan Espinosa, Ioan Raicu, Mike Wilde, Zhao Zhang
      Computation Institute
      Argonne National Lab & University of Chicago
    2. How data analysis happens at data-intensive computing workshops
    3. How data analysis really happensin scientific laboratories
      %foo file1 > file2
      % bar file2 > file3
      %foo file1 | bar > file3
      % foreachf (f1 f2 f3 f4 f5 f6 f7 … f100)
      foreach?foo $f.in | bar > $f.out
      foreach? end
      %
      % Now where on earth is f98.out, and how did I generate it again?
      Now: command not found.
      %
    4. Extreme scripting
      Complex
      scripts
      Swift
      Many activities
      Numerous files
      Complex data
      Data dependencies
      Many programs
      Simple
      scripts
      Big
      computers
      Small
      computers
      Many processors
      Storage hierarchy
      Failure
      Heterogeneity
      Preserving
      file system semantics,
      ability to call arbitrary executables
    5. Functional magnetic resonance imaging (fMRI) data analysis
    6. AIRSN program definition
      (Run or) reorientRun (Run ir, string direction) {
      foreachVolume iv, i in ir.v {
      or.v[i] = reorient(iv, direction);
      }
      }
      (Run snr) functional ( Run r, NormAnat a, Air shrink ) {
      Run yroRun = reorientRun( r , "y" );
      Run roRun = reorientRun( yroRun , "x" );
      Volume std = roRun[0];
      Run rndr = random_select( roRun, 0.1 );
      AirVectorrndAirVec = align_linearRun( rndr, std, 12, 1000, 1000, "81 3 3" );
      Run reslicedRndr = resliceRun( rndr, rndAirVec, "o", "k" );
      Volume meanRand = softmean( reslicedRndr, "y", "null" );
      Air mnQAAir = alignlinear( a.nHires, meanRand, 6, 1000, 4, "81 3 3" );
      Warp boldNormWarp = combinewarp( shrink, a.aWarp, mnQAAir );
      Run nr = reslice_warp_run( boldNormWarp, roRun );
      Volume meanAll = strictmean( nr, "y", "null" )
      Volume boldMask = binarize( meanAll, "y" );
      snr = gsmoothRun( nr, boldMask, "6 6 6" );
      }
    7. Many many tasks:Identifying potential drug targets
      2M+ ligands
      Protein xtarget(s)
      Benoit Roux et al.
    8. 6 GB
      2M structures
      (6 GB)
      ~4M x 60s x 1 cpu
      ~60K cpu-hrs
      FRED
      DOCK6
      Select best ~5K
      Select best ~5K
      ~10K x 20m x 1 cpu
      ~3K cpu-hrs
      Amber
      Select best ~500
      ~500 x 10hr x 100 cpu
      ~500K cpu-hrs
      GCMC
      ZINC
      3-D
      structures
      Manually prep
      DOCK6 rec file
      Manually prep
      FRED rec file
      NAB scriptparameters
      (defines flexible
      residues,
      #MDsteps)
      NAB
      Script
      Template
      DOCK6
      Receptor
      (1 per protein:
      defines pocket
      to bind to)
      FRED
      Receptor
      (1 per protein:
      defines pocket
      to bind to)
      PDB
      protein
      descriptions
      1 protein
      (1MB)
      BuildNABScript
      Amber prep:
      2. AmberizeReceptor
      4. perl: gen nabscript
      NAB
      Script
      start
      Amber Score:
      1. AmberizeLigand
      3. AmberizeComplex
      5. RunNABScript
      For 1 target:
      4 million tasks500,000 cpu-hrs
      (50 cpu-years)
      end
      report
      ligands
      complexes
    9. IBM BG/P
      570 Teraflop/s, 164,000 cores, 80 TB
    10. DOCK on BG/P: ~1M tasks on 119,000 CPUs
      118784 cores
      934803 tasks
      Elapsed time: 7257 sec
      Compute time: 21.43 CPU years
      Average task: 667 sec
      Time (sec)
      Relativeefficiency 99.7% (from 16 to 32 racks)
      Utilization: 99.6% sustained, 78.3% overall
      Ioan Raicu et al.
    11. Managing 160,000 cores
      Falkon
      High-speed local “disk”
      Slower shared storage
    12. Scaling Posix to petascale
      Global file system
      Chirp(multicast)
      Staging
       Torus and tree interconnects 
      Intermediate
      CN-striped intermediate file system
      Largedataset
      MosaStore(striping)

      IFScompute
      node
      IFScompute
      node
      LFS
      LFS
      IFSseg
      IFSseg
      ZOID on I/O node
      Computenode(local datasets)
      Computenode(local datasets)
      ZOID IFS
      Local
      . . .
    13. Efficiency for 4 second tasks and varying data size(1KB to 1MB) for CIO and GPFS up to 32K processors
    14. +
      +
      +
      +
      +
      +
      +
      =
      Provisioning for data-intensive workloads
      Example: on-demand “stacking” of arbitrary locations within ~10TB sky survey
      Challenges
      Random data access
      Much computing
      Time-varying load
      Solution
      Dynamic acquisition of compute & storage
      Data diffusion
      Sloan
      Data
      S
      IoanRaicu
    15. IoanRaicu
      “Sine” workload, 2M tasks, 10MB:10ms ratio, 100 nodes, GCC policy, 50GB caches/node
    16. Same scenario, but with dynamic resource provisioning
    17. Data diffusion sine-wave workload: Summary
      GPFS  5.70 hrs, ~8Gb/s, 1138 CPU hrs
      DD+SRP  1.80 hrs, ~25Gb/s, 361 CPU hrs
      DD+DRP  1.86 hrs, ~24Gb/s, 253 CPU hrs
    18. Data-intensive computing @ Computation Institute: Example applications
      Astrophysics
      Cognitive science
      East Asian studies
      Economics
      Environmental science
      Epidemiology
      Genomic medicine
      Neuroscience
      Political science
      Sociology
      Solid state physics
    19. Sequencing outpaces Moore’s law
      BLAST
      On EC2,
      US$
      Next-gen Solexa
      454
      Solexa
      Gigabases
      Folker Meyer, Computation Institute
    20. Data-intensive computing @ Computation Institute: Hardware
      1000 TBtape backup
      Dynamic provisioning
      500 TB reliable storage (data & metadata)
      Parallel analysis
      Diversedatasources
      Remote access
      P A D S
      180 TB,
      180 GB/s
      17 Top/s
      analysis
      Diverseusers
      Data
      ingest
      Offload to remote data centers
      PADS: Petascale Active Data Store (NSF MRI)
    21. Data-intensive computing @ Computation Institute: Software
      HPC systems software (MPICH, PVFS, ZeptOS)
      Collaborative data tagging (GLOSS)
      Data integration (XDTM)
      HPC data analytics and visualization
      Loosely coupled parallelism (Swift, Hadoop)
      Dynamic provisioning (Falkon)
      Service authoring (Introduce, caGrid, gRAVI)
      Provenance recording and query (Swift)
      Service composition and workflow (Taverna)
      Virtualization management (Workspace Service)
      Distributed data management (GridFTP, etc.)
    22. Data-intensive computing is an end-to-end problem
      Low
      Chaos
      Zone ofcomplexity
      Agreement
      about
      outcomes
      Plan and control
      High
      Low
      High
      Certainty about outcomes
      Ralph Stacey, Complexity and Creativity in Organizations, 1996
    23. We need to function in the zone of complexity
      Low
      Chaos
      Agreement
      about
      outcomes
      Plan and control
      High
      Low
      High
      Certainty about outcomes
      Ralph Stacey, Complexity and Creativity in Organizations, 1996
    24. The Grid paradigm
      Principles and mechanisms for dynamic virtual organizations
      Leverage service oriented architecture
      Loose coupling of data and services
      Open software,architecture
      Engineering
      Biomedicine
      Computer science
      Physics
      Healthcare
      Astronomy
      Biology
      1995 2000 2005 2010
    25. As of Oct19, 2008:
      122 participants
      105 services
      70 data
      35 analytical
    26. Multi-center clinical cancer trials image capture and review
      (Center for Health Informatics)
    27. Summary
      Extreme scripting offers the potential for easy scaling of proven working practices
      Interesting technical problems relating to programming and I/O models
      Many wonderful applications
      Data-intensive computing is an end-to-end problem
      Data generation, integration, analysis, etc., is a continuous, loosely coupled process
    28. Thank you!
      Computation Institutewww.ci.uchicago.edu

    + Ian FosterIan Foster, 4 months ago

    custom

    312 views, 1 favs, 0 embeds more stats

    A talk presented at an NSF Workshop on Data-Intensi more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 312
      • 312 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 1
    • Downloads 6
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories