Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Like this? Share it with your network


End-to-End eScience

Uploaded on

Invited talk at Microsoft Research, Spring 2009

Invited talk at Microsoft Research, Spring 2009

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide
  • There have traditionally been three legs to the scientific stool: theory, experiment, and observation.
  • These were mutually reinforcing: for example, observations might suggest theories, which could be tested by experiments.
  • Over the past 50 years, we have augmented the traditional “three legs of the stool” with an incredibly powerful new tool: high-speed computation. In traditional “computational science,” we use simulation to conduct “virtual experiments” – experiments that can’t be conducted in the lab, for various reasons.
  • In the past 10 years, a fourth method of scientific discovery has emerged: Acquire data en masse, independent of any hypothesis, and then ask questions about it post hoc. eScience is about massive and complex data -- data large enough to require automated or semi-automated analysis -- there’s too much to look at manually. Relevant tools are databases, visualization, cluster computing, data mining, machine learning, workflow, web services -- all integrated and optimized for scientific use.
  • Drowning in data; starving for information We’re at war with these engineering companies. FlowCAM is bragging about the amount of data they can spray out of their device. How to use this enormous data stream to answer scientific questions is someone else’s problem.
  • The long tail of eScience -- huge number of scientists who struggle with data management, but do not have access to IT reesources -- no clusters, no system administrators, no programmers, and no computer scientists. They rely on spreadsheets, email, and maybe a shared file system. Their data challenges have more to do with heterogeneity than size: tens of spreadsheets from different sources. However: the long tail is becoming the fat tail. Tens of spreadsheets are growing to hundreds, and the number of records in each goes from hundreds to thousands. How many of you know someone who was forced to split a large spreadsheet into multiple files in order to get around the 65k record limit in certain versions of Excel? Further, medium data (gigabytes) becomes big data (terabytes). Ocean modelers are moving from regional-focus to meso-scale simulations to global simulations.
  • Data Management != Storage Management Storage Management is SATA/SCSI/Fiber Backup policies and procedures redundancy decisions (RAID 0, 1+0, 0+1, 5 Access methods Query languages Data Mining, Analysis, Visualization Data Integration
  • The blind men and the eScience elephant.
  • Collaboration across many state federal and university agencies
  • Very interesting datasets, and all of it is freely available for any purpose.
  • File formats, programming languages, and DBMS exist that are organized around this simple property
  • With an unstructured grid, you explicitly track cells of various dimensions and the incidence relationship that connects them. Specialized “Boundary Representation” data structures exist for special cases, but there is no general data model.
  • Different semantics for subsetting may be defined. One particular semantics tends to preserve intuitive correctness properties.
  • On the order of hundreds of points. Manual browsing.
  • “ Make mashups easy to create” “ Raise the level of abstraction” “ Empower non-programmers to be programmers”
  • The seven people who know your
  • Climatology is long-term average
  • ~1 million observations. Can’t render each dot in, say, javascript. We need services that can produce these visualizations given parameters. This work is about synchronizing visualizations, blessing them with interactivity, and publishing them on the web.
  • The blind men and the eScience elephant.


  • 1. End-to-End eScience Integrating Query, Workflow, Visualization, and Mashups at an Ocean Observatory Bill Howe, University of Washington Harrison Green-Fishback, PSU David Maier, PSU Erik Anderson, Utah Emanuele Santos, Utah Juliana Freire, Utah Carlos Scheidegger, Utah Claudio Silva, Utah Antonio Baptista, OHSU Peter Lawson, OSU Renee Bellinger, OSU http://dev.pacificfishtrax.org/
  • 2. Outline
    • eScience
    • Brief Demo
    • A Domain-Specific Query Algebra
    • Mashups
  • 3. Theory Experiment Observation slide: Ed Lazowska
  • 4. Theory Experiment Observation slide: Ed Lazowska
  • 5. Theory Experiment Observation slide: Ed Lazowska
  • 6. Theory Experiment Observation Computational Science slide: Ed Lazowska
  • 7. Theory Experiment Observation Computational Science eScience
  • 8. All Science is becoming eScience
    • Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)
    • New model: “Download the world” (Data acquired en masse, independent of hypotheses)
    • But: Acquisition now outpaces analysis
      • Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)
      • Medicine: ubiquitous digital records, MRI, ultrasound
      • Oceanography: high-resolution models, cheap sensors, satellites
      • Biology: lab automation, high-throughput sequencing
    “ Increase Data Collection Exponentially in Less Time, with FlowCAM” Empirical X  Analytical X  Computational X  X-informatics
  • 9. The Long Tail The long tail is getting fatter: notebooks become spreadsheets (MB), spreadsheets become databases (GB), databases become clusters (TB) clusters become clouds (PB) data inventory ordinal position
    • Researchers with growing data management challenges but limited resources for cyberinfrastructure
    • No dedicated IT staff
    • Overreliance on simple tools (e.g., spreadsheets)
    CERN (~15PB/year) LSST (~100PB) PanSTARRS (~40PB) Ocean Modelers <Spreadsheet users> SDSS (~100TB) Seis-mologists Microbiologists CARMEN (~50TB) “ The future is already here. It’s just not very evenly distributed.”-- William Gibson
  • 10. eScience Institute at UW
    • Mission
      • Help position the University of Washington at the forefront of research both in modern eScience techniques and technologies, and in the fields that depend upon these techniques and technologies
    • Strategy
      • Increase the sharing of expertise and facilities
      • Bootstrap a cadre of Research Scientists
      • Add faculty in key fields
      • Make the entire University more effective
    • Launched July 1 with $1 million in permanent funding from the Washington State Legislature
      • Sought, and need, $2 million
  • 11. Facets of Database Research My research: customize and optimize for science Web Services Query Languages Storage Management Visualization; Workflow Data Integration Knowledge Extraction, Crawlers Access Methods Data Mining, Parallel Programming Models, Provenance complexity-hiding interfaces
  • 12. The eScience Elephant eScience Cloud/Cluster Workflow Databases Visualization Provenance “ flexibility; web services; integration” “ query processing; data independence; algebraic optimization; needles in haystacks” “ Exploratory science; mapping quantitative data to intuition” “ Reproducibility; forensics; sharing/reuse” “ Massive data parallelism” Mashups “ Rapid Prototyping; Simplified web programming”
  • 13. Some eScience Research Query Algebra for new Data Type Scientific Workflow Systems Science Mashups “ Dataspace” systems [Howe, Freire, Silva, et al. 2008] [Howe, Green-Fishback, Maier, 2009] [Howe, Maier, Rayner, Rucker 2008] [Howe, Maier. 2004, 2005, 2006] this talk
  • 14. Outline
    • eScience
    • Brief Demo
    • A Domain-Specific Query Algebra
    • Science Mashups
  • 15. VisTrails for Computation
  • 16. Spatial Patterns in Fisheries: new techniques, new opportunities for ecosystem-based management Peter Lawson 1 , Lorenzo Cianelli 2 , Bobby Ireland 2 1 2
  • 17. Enabling Scientific Discourse between Fishermen and Fisheries Managers
  • 18.  
  • 19.  
  • 20.  
  • 21. VisTrails for Collaboration Bill Howe @ CMOP computes salt flux using GridFields Erik Anderson @ Utah adds vector streamlines and adjusts opacity Bill Howe @ CMOP adds an isosurface of salinity Peter Lawson adds discussion of the scientific interpretation
  • 22. Outline
    • eScience
    • Brief Demo
    • A Domain-Specific Query Algebra
    • Mashups
  • 23. CMOP
  • 24. Columbia River Estuary red = high salinity (~34psu) blue = fresh water (~0 psu)
  • 25. Accessing Model Results
    • CMOP ocean circulation models run in forecast or hindcast mode
    • Models run serially in ~1/5 real time
      • On MPICH2, about 10x speedup before overhead dominates
    • Forecasts kept for 10 days, hindcasts kept indefinitely (40TB + 25TB/year)
    • Access via a GridFields Web Service
      • GFServer optimizes and evaluates GF expressions and returns the result
  • 26. Unstructured Grids “ unstructured grids” model complex domains at multiple scales simultaneously red = high salinity (~34psu) blue = fresh water (~0 psu) Columbia River Estuary … .but complicate processing
  • 27. “Structured” Grids “ structured grids” do a poor job of modeling complex features and complicate multi-scale analysis. But:Coastlines are not rectilinear 1) Missing values = wasted effort Higher resolution = wasted effort in areas of low dynamism 2) Data associated with cells at multiple dimensions Simple: Isomorphic to multidimensional arrays x x x x x x x x x x x x x
  • 28. Structured grids are easy
    • The data model
      • (Cartesian products of coordinate variables)
    • immediately implies a representation,
      • (multidimensional arrays)
    • an API,
      • (reading and writing subslabs )
    • and an efficient implementation
      • (address calculation using array “shape”)
  • 29. Structured grid example f( i , j ) x( i ) y( j ) for i in [4:6]: for j in [1:4]: addr = &f + j*|x| + i = f[4:6, 1:4] = NetCDF, MATLAB, RasDaMan, SciDB (soon), many more
  • 30. Unstructured Grids 2 3 4 ( E, I ) = A y x z E 0 = {2,3,4} E 1 = {x,y,z} E 2 = {A} I = z 2 z 4 A z x 2 x 3 A x A y y 4 y 3 … plus the transitive closure
  • 31. Subsetting Full grid: Eastern Pacific Subset: mouth of Columbia River color: bathymetry Washington Oregon California
  • 32. Correctness properties preserved Grid is well-supported (no ragged edges)
  • 33. Subset semantics 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 Input Simple Drop Cut everything labeled “0”. What should be kept? “ Exact” 1 1 1 1 0 0 1 1 0 0 1 1 1 1 2 1 1
  • 34. What about Visualization Libs?
    • Different C++ classes, each dependent on data characteristics.
    • Changes to data characteristics require changes to the program
    • Logical equivalences obscured
    • No data independence
    vtkExtractGeometry vtkThreshold vtkExtractGrid vtkExtractVOI vtkThresholdPoints We want: in VTK:
  • 35. GridField Data Model A GridField with two attributes bound to the 2-cells and four attributes bound to the 0-cells 13.2 30.1 9.0 13.4 12.0 28.0 9.0 14.3 12.5 29.8 9.4 13.9 12.1 29.4 10.6 13.8 temp salt y x 5.5 13.9 4.5 13.1 3.3 11.5 area flux
  • 36. GridField Operations
    • Lifted set operations
      • Union, Intersection, Cross Product
    • Scan/Bind
      • Read a grid/attribute
    • Restrict
      • Remove cells that do not satisfy a predicate
    • Accrete
      • Grow a grid by adding neighbors of cells
    • Regrid
      • Map the data of one grid onto another
  • 37. Usage Example (1) H = Scan(context, &quot;H&quot;) rH = Restrict(&quot;(326<x) & (x<345) & (287<y) & (y<302)&quot;, 0, H) H = rH = color: bathymetry dimension predicate
  • 38. Usage Example (2) H = Scan(context, “H&quot;) rH = Restrict(“h<500&quot;, 0, H) H = rH = color: bathymetry
  • 39. Longer Example H : (x,y,b) V : (z) H V render  (H  V) r(z>b) r(H  V) b(s) b(r(H  V)) r(region) r(b(r(H  V)))
  • 40. Optimization  H(x,y,b) V(z) r(z>b) b(s) r(region)  H(x,y,b) V(z) r(z>b) b(s) r(x,y) r(z) *Howe, Maier, Algebraic Manipulation of Scientific Datasets. VLDB Journal, 14:4, 2005
  • 41. Transect (Vertical Slice) P
  • 42. Transect: Bad Plan  H(x,y,b) V(z) r(z>b) b(s) regrid  P P  V
    • Construct full-size 3D grid
    • Construct 2D transect grid
    • Spatial Join 1) with 2)
  • 43. Transect: Optimized Plan P  V V(z) P H(x,y,b) regrid b(s)  regrid 
    • Find 2D cells containing points
    • Create “stacks” of 2D cells carrying data
    • Create 2D transect grid
    • Spatial Join 2) with 3)
  • 44. 1) Find cells containing points in P
  • 45. 1) 4) 2)
    • Find cells containing points in P
    • Construct “stacks” of cells
    • 4) Join 2) with 3)
  • 46. Transect: Results secs 800 MB dataset simple = nearest neighbor interpolation *_o = optimized by restricting to the region of interest
  • 47. Ongoing work
    • NSF Cluster Exploratory Award:
      • Where the Ocean Meets the Cloud:
        • Ad Hoc Longitudinal Analysis of Massive Mesh Data
    • Partnership between NSF, IBM, Google
    • Data-intensive computing
      • massive queries, not massive simulations
    • To “Cloud-Enable” GridFields and VisTrails
      • Goal: 10+-year climatologies at interactive speeds
      • Parallel implementations of GridField operators
        • via Hadoop (and Dryad!)
      • Provenance, repeatability, visualization via VisTrails
        • Connect rich desktop experience
    • Co-PIs from University of Utah
      • Claudio Silva and Juliana Freire
  • 48. Outline
    • eScience
    • Brief Demo
    • A Domain-Specific Query Algebra
    • Scientific Mashups
  • 49. Why Mashups?
    • Jim Gray: # of datasets scales as N 2
      • Each pairwise comparison generates a new dataset
    • Corollary: # of apps scales as N 2
      • Every pairwise comparison motivates a new mashup
    • To keep up, we need to
      • entrain new programmers,
      • make existing programmers more productive,
      • or both
  • 50. Satellite Images + Crime Incidence Reports
  • 51. Twitter Feed + Flickr Stream
  • 52. Mashup Frameworks
    • A bottom up approach
    • Start with a GPL, add
      • Visual programming
      • Interactive type checking
      • Exploit a corpus of previous examples
        • bootstrapping a mashup
        • mashup “autocomplete”
        • emit warnings
  • 53.  
  • 54.  
  • 55.  
  • 56. Scientific Mashup Characteristics
    • Turn over more data per operation
    • Involve subtle visualizations
    • Must serve a diverse audience
  • 57. A Model for Scientific Mashups
    • The “Data Product” is the currency of scientific communication with the public
    • Scientists are already adept at crafting them (consider powerpoint slides and figures)
    • We take a top down approach:
      • Take a static data product ensemble,
      • endow it with interactivity,
      • publish it online,
      • allow others to repurpose it at runtime
  • 58. Data Product Ensemble
  • 59. Mashup
  • 60. CTD: Conducitvity, Temperature, Depth
  • 61. Sampling
  • 62. Event Detection: Red Water
  • 63. CTD Cast
  • 64. Flowthrough
  • 65. Mashup
  • 66. Mashup
  • 67. Key Concepts
    • A mashup is a synchronized ensemble of data products
    • A data product is a mashable that has been adapted for a particular purpose
    • A mashable is an arbitrarily-complex computation that returns a relation
    • An adaptor displays the relation to the user and returns a subset
    • All adapted mashables accept input
    • Hence, user controls are modeled as adapted mashables just like “visual” data products
  • 68. Adapted Mashables
  • 69. Data Flow Graph
  • 70. Inferring Data Flow provides: {ABC} requires: {AB}
  • 71. Inferring Data Flow provides: {AC} requires: {AB} provides: {B}
  • 72. Inferring Data Flow provides: {AC} requires: {AB} underspecified mashup
    • Solution:
    • use defaults
    • root environment
    • hand-specified parameter
  • 73. Inferring Data Flow provides: {AB} requires: {AB} provides: {B} overspecified mashup
    • Solution: Break ties:
    • Prefer nodes on longer paths
    • Use layout information
  • 74. Audience-Tailored Mashups K12 students Experts
  • 75. Conclusions and Future Directions
    • We want to augment scientists, not programmers
      • Requires limiting expressiveness -- not yet clear where to draw the line
    • More work on semi-automatically tailoring a mashup at runtime
      • Automatically insert “context products”
        • See salinity, add a salinity colorbar
        • See a time, add a tide chart
        • See a location, add a map
      • Re-skin data products
      • “ Dashboard-style” vs. “Wizard-style” apps
  • 76. http://escience.washington.edu (retooled website coming soon)
  • 77. Comparison typing, provenance, Pegasus-style resource mapping, task parallelism arbitrary boxes-and-arrows * Workflow typing, massive data parallelism, fault tolerance RA + Apply + Partitioning IQueryable, IEnumerable MS Dryad optimization, physical data independence, data parallelism Select, Project, Join, Aggregate, … Relations Relational Algebra data parallelism, full control 70+ ops Arrays/ Matrices MPI massive data parallelism, fault tolerance Map, Reduce [(key,value)] MapReduce Typing, maybe * * GPL Services Operations Data Model
  • 78. Mashups serve a diverse audience student public scientist
  • 79. Computational Science
    • Theory
    • Experiment
    • Observation
    • Simulation (in silico)
    • Analysis (in ferro)
    Data acquisition is hypothesis-driven Data acquisition is technology-driven
  • 80. Motivation
    • Explore architectures blending techniques from
    • mashups (rapid prototyping),
    • visualization (interactivity, richness),
    • workflow (data integration, provenance),
    • databases (optimization, data independence)
    • to answer science questions at an Ocean Observatory
  • 81. Visualization Source: MayaVi website PLOT3D, GDAL, ShapeFile, OGC, .obj, .vtk, netCDF, HDF5, FITS, others Optimized for “throwing datasets” and interactivity Declarative query, interoperability, repeatability generally lacking Source: http://pogl.wordpress.com/2007/06/
  • 82. Workflow
    • Emphasis on integration, web services, flexibility
    • Unconstrained boxes-and-arrows
      • Any operation on any data type
    • Very expressive, but limited opportunities for static reasoning
      • Type safety
      • Task parallelism
      • Cache safety
      • Optimization via rewrite rules
      • Result size / execution time estimation
      • Transparent data parallelism
      • Platform portability
    To move the earth, you need somewhere to stand
  • 83. Databases Pre-relational DBMS brittleness: if your data changed, your application broke. Early RDBMS were buggy and slow (and often reviled), but required only 5% of the application code. physical data independence logical data independence files and pointers relations views “ Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.” Key Idea: Programs that manipulate tabular data exhibit an algebraic structure allowing reasoning and manipulation independent of physical data representation
  • 84. Heterogeneity also drives costs # of bytes # of data types CERN (~15PB/year, particle interactions) LSST (~100PB; images, objects) PanSTARRS (~40PB; images, objects, trajectories) OOI (~50TB/year; sim. results, satellite, gliders, AUVs, vessels, more) SDSS (~100TB; images, objects) Biologists (~10TB, sequences, alignments, annotations, BLAST hits, metadata, phylogeny trees)
  • 85. The eScience Elephant “ Like a snake” “ “ Like a hand fan” “ Like a wall” “ Like tree trunk” “ Like a spear” “ Like a rope”
  • 86.