End-to-End eScience
Integrating Query, Workflow,
Visualization, and Mashups
at an Ocean Observatory
Bill Howe,
University ...
01/30/15 Bill Howe, eScience Institute 2
Outline
 eScience
 Brief Demo
 A Domain-Specific Query Algebra
 Mashups
Theory
Experiment
Observation
slide: Ed Lazowska
Theory
Experiment
Observation
slide: Ed Lazowska
Theory
Experiment
Observation
slide: Ed Lazowska
Theory
Experiment
Observation
Computational
Science
slide: Ed Lazowska
Theory
Experiment
Observation
Computational
Science
eScience
01/30/15 Bill Howe, eScience Institute 8
All Science is becoming eScience
Old model: “Query the world” (Data acquisition c...
01/30/15 Bill Howe, eScience Institute 9
The long tail is getting fatter:
notebooks become spreadsheets (MB),
spreadsheets...
01/30/15 Bill Howe, eScience Institute 10
eScience Institute at UW
 Mission
 Help position the University of Washington ...
01/30/15 Bill Howe, eScience Institute 11
Web
Services
Facets of Database Research
Query
Languages
Storage
Management
Visu...
01/30/15 Bill Howe, eScience Institute 12
The eScience Elephant
eScience
Cloud/Cluster
Workflow
Databases
Visualization Pr...
01/30/15 Bill Howe, eScience Institute 13
Some eScience Research
Query Algebra for new Data Type
Scientific Workflow Syste...
01/30/15 Bill Howe, eScience Institute 14
Outline
 eScience
 Brief Demo
 A Domain-Specific Query Algebra
 Science Mash...
01/30/15 Bill Howe, eScience Institute 15
VisTrails for Computation
Spatial Patterns in Fisheries: newSpatial Patterns in Fisheries: new
techniques, new opportunities fortechniques, new oppo...
01/30/15 Bill Howe, eScience Institute 17
Enabling Scientific Discourse between
Fishermen and Fisheries Managers
01/30/15 Bill Howe, eScience Institute 18
01/30/15 Bill Howe, eScience Institute 19
01/30/15 Bill Howe, eScience Institute 20
01/30/15 Bill Howe, eScience Institute 21
VisTrails for Collaboration
Bill Howe @ CMOP
computes salt flux
using GridFields...
01/30/15 Bill Howe, eScience Institute 22
Outline
 eScience
 Brief Demo
 A Domain-Specific Query Algebra
 Mashups
01/30/15 Bill Howe, eScience Institute 23
CMOP
01/30/15 Bill Howe, eScience Institute 24
Columbia River Estuary
red = high salinity (~34psu)
blue = fresh water (~0 psu)
01/30/15 Bill Howe, eScience Institute 25
Accessing Model Results
 CMOP ocean circulation models run in forecast or
hindc...
01/30/15 Bill Howe, eScience Institute 26
Unstructured Grids
“unstructured grids” model
complex domains at multiple
scales...
01/30/15 Bill Howe, eScience Institute 27
“Structured” Grids
“structured grids” do a poor job of
modeling complex features...
01/30/15 Bill Howe, eScience Institute 28
Structured grids are easy
 The data model
(Cartesian products of coordinate var...
01/30/15 Bill Howe, eScience Institute 29
Structured grid example
f( i, j )
x( i)
y( j)
for i in [4:6]:
for j in [1:4]:
ad...
01/30/15 Bill Howe, eScience Institute 30
Unstructured Grids
2
3
4
( E, I ) = A
y
x
z
E0 = {2,3,4}
E1 = {x,y,z}
E2 = {A}
I...
01/30/15 Bill Howe, eScience Institute 31
Subsetting
Full grid: Eastern Pacific Subset: mouth of
Columbia River
color: bat...
01/30/15 Bill Howe, eScience Institute 32
Correctness properties preserved
Grid is well-supported
(no ragged edges)
01/30/15 Bill Howe, eScience Institute 33
Subset semantics
01
1
1
1 0
0
1
1
1
1
1
1
1
1
Input Simple Drop “Exact”
1
1
11
0...
01/30/15 Bill Howe, eScience Institute 34
What about Visualization Libs?
 Different C++ classes, each dependent on data c...
01/30/15 Bill Howe, eScience Institute 35
GridField Data Model
A GridField with two attributes bound to the 2-cells
and fo...
01/30/15 Bill Howe, eScience Institute 36
GridField Operations
 Lifted set operations
 Union, Intersection, Cross Produc...
01/30/15 Bill Howe, eScience Institute 37
Usage Example (1)
H = Scan(context, "H")
rH = Restrict("(326<x) & (x<345) & (287...
01/30/15 Bill Howe, eScience Institute 38
Usage Example (2)
H = Scan(context, “H")
rH = Restrict(“h<500", 0, H)
H = rH =
c...
01/30/15 Bill Howe, eScience Institute 39
Longer Example
H : (x,y,b)
V : (z)
render
H V
⊗
(H × V)
r(z>b)
r(H × V)
b(s)
b(r...
01/30/15 Bill Howe, eScience Institute 40
⊗
H(x,y,b)
V(z)
r(z>b) b(s) r(region)
⊗
H(x,y,b)
V(z)
r(z>b) b(s)
r(x,y)
r(z)
Op...
01/30/15 Bill Howe, eScience Institute 41
Transect (Vertical Slice)
P
01/30/15 Bill Howe, eScience Institute 42
Transect: Bad Plan
⊗
H(x,y,b)
V(z)
r(z>b) b(s) regrid
⊗
P
P ⊗ V
1) Construct ful...
01/30/15 Bill Howe, eScience Institute 43
Transect: Optimized Plan
P ⊗ V
V(z)
P
H(x,y,b)
regrid b(s)⊗ regrid
⊗
1) Find 2D ...
01/30/15 Bill Howe, eScience Institute 44
1) Find cells containing points in P
01/30/15 Bill Howe, eScience Institute 45
1)
4)
2)
1) Find cells containing points in P
2) Construct “stacks” of cells
4) ...
01/30/15 Bill Howe, eScience Institute 46
0
5
10
15
20
25
30
35
40
45
vtk(3D) interpolate simple interp_o simple_o
Transec...
01/30/15 Bill Howe, eScience Institute 47
Ongoing work
 NSF Cluster Exploratory Award:
 Where the Ocean Meets the Cloud:...
01/30/15 Bill Howe, eScience Institute 48
Outline
 eScience
 Brief Demo
 A Domain-Specific Query Algebra
 Scientific M...
01/30/15 Bill Howe, eScience Institute 49
Why Mashups?
 Jim Gray: # of datasets scales as N2
 Each pairwise comparison g...
01/30/15 Bill Howe, eScience Institute 50
Satellite Images + Crime Incidence Reports
01/30/15 Bill Howe, eScience Institute 51
Twitter Feed + Flickr Stream
01/30/15 Bill Howe, eScience Institute 52
Mashup Frameworks
 A bottom up approach
 Start with a GPL, add
 Visual progra...
01/30/15 Bill Howe, eScience Institute 53
01/30/15 Bill Howe, eScience Institute 54
01/30/15 Bill Howe, eScience Institute 55
01/30/15 Bill Howe, eScience Institute 56
Scientific Mashup Characteristics
 Turn over more data per operation
 Involve ...
01/30/15 Bill Howe, eScience Institute 57
A Model for Scientific Mashups
 The “Data Product” is the currency of scientifi...
01/30/15 Bill Howe, eScience Institute 58
Data Product Ensemble
01/30/15 Bill Howe, eScience Institute 59
Mashup
01/30/15 Bill Howe, eScience Institute 60
CTD: Conducitvity, Temperature, Depth
01/30/15 Bill Howe, eScience Institute 61
Sampling
01/30/15 Bill Howe, eScience Institute 62
Event Detection: Red Water
01/30/15 Bill Howe, eScience Institute 63
CTD Cast
01/30/15 Bill Howe, eScience Institute 64
Flowthrough
01/30/15 Bill Howe, eScience Institute 65
Mashup
01/30/15 Bill Howe, eScience Institute 66
Mashup
01/30/15 Bill Howe, eScience Institute 67
Key Concepts
 A mashup is a synchronized
ensemble of data products
 A data pro...
01/30/15 Bill Howe, eScience Institute 68
Adapted Mashables
01/30/15 Bill Howe, eScience Institute 69
Data Flow Graph
01/30/15 Bill Howe, eScience Institute 70
Inferring Data Flow
provides: {ABC}
requires: {AB}
01/30/15 Bill Howe, eScience Institute 71
Inferring Data Flow
provides: {AC}
requires: {AB}
provides: {B}
01/30/15 Bill Howe, eScience Institute 72
Inferring Data Flow
provides: {AC}
requires: {AB}
underspecified mashup
Solution...
01/30/15 Bill Howe, eScience Institute 73
Inferring Data Flow
provides: {AB}
requires: {AB}
provides: {B}
overspecified ma...
01/30/15 Bill Howe, eScience Institute 74
Audience-Tailored Mashups
K12 studentsExperts
01/30/15 Bill Howe, eScience Institute 75
Conclusions and Future Directions
 We want to augment scientists, not programme...
01/30/15 Bill Howe, eScience Institute 76
http://escience.washington.edu
(retooled website coming soon)
01/30/15 Bill Howe, eScience Institute 77
ComparisonData Model Operations Services
GPL * * Typing, maybe
Workflow * arbitr...
01/30/15 Bill Howe, eScience Institute 78
Mashups serve a diverse audience
student
public
scientist
01/30/15 Bill Howe, eScience Institute 79
Computational Science
 Theory
 Experiment
 Observation
 Simulation (in silic...
01/30/15 Bill Howe, eScience Institute 80
Explore architectures blending techniques from
• mashups (rapid prototyping),
• ...
01/30/15 Bill Howe, eScience Institute 81
Source: MayaVi website
PLOT3D, GDAL,
ShapeFile, OGC,
.obj, .vtk,
netCDF, HDF5,
F...
01/30/15 Bill Howe, eScience Institute 82
Workflow
 Emphasis on integration, web
services, flexibility
 Unconstrained bo...
01/30/15 Bill Howe, eScience Institute 83
Databases
Pre-relational DBMS brittleness: if your
data changed, your applicatio...
01/30/15 Bill Howe, eScience Institute 84
Heterogeneity also drives costs#ofbytes
# of data types
CERN
(~15PB/year, partic...
01/30/15 Bill Howe, eScience Institute 85
The eScience Elephant
“Like a snake”
“
“Like a hand fan” “Like a wall” “Like tre...
01/30/15 Bill Howe, eScience Institute 86
Upcoming SlideShare
Loading in...5
×

End-to-End eScience

737

Published on

Invited talk at Microsoft Research, Spring 2009

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
737
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • There have traditionally been three legs to the scientific stool: theory, experiment, and observation.
  • These were mutually reinforcing: for example, observations might suggest theories, which could be tested by experiments.
  • Over the past 50 years, we have augmented the traditional “three legs of the stool” with an incredibly powerful new tool: high-speed computation.
    In traditional “computational science,” we use simulation to conduct “virtual experiments” – experiments that can’t be conducted in the lab, for various reasons.
  • In the past 10 years, a fourth method of scientific discovery has emerged: Acquire data en masse, independent of any hypothesis, and then ask questions about it post hoc.
    eScience is about massive and complex data -- data large enough to require automated or semi-automated analysis -- there’s too much to look at manually. Relevant tools are databases, visualization, cluster computing, data mining, machine learning, workflow, web services -- all integrated and optimized for scientific use.
  • Drowning in data; starving for information
    We’re at war with these engineering companies. FlowCAM is bragging about the amount of data they can spray out of their device. How to use this enormous data stream to answer scientific questions is someone else’s problem.
  • The long tail of eScience -- huge number of scientists who struggle with data management, but do not have access to IT reesources -- no clusters, no system administrators, no programmers, and no computer scientists.
    They rely on spreadsheets, email, and maybe a shared file system.
    Their data challenges have more to do with heterogeneity than size: tens of spreadsheets from different sources.
    However: the long tail is becoming the fat tail. Tens of spreadsheets are growing to hundreds, and the number of records in each goes from hundreds to thousands. How many of you know someone who was forced to split a large spreadsheet into multiple files in order to get around the 65k record limit in certain versions of Excel?
    Further, medium data (gigabytes) becomes big data (terabytes). Ocean modelers are moving from regional-focus to meso-scale simulations to global simulations.
  • Data Management != Storage Management
    Storage Management is
    SATA/SCSI/Fiber
    Backup policies and procedures
    redundancy decisions (RAID 0, 1+0, 0+1, 5
    Access methods
    Query languages
    Data Mining, Analysis, Visualization
    Data Integration
  • The blind men and the eScience elephant.
  • Collaboration across many state federal and university agencies
  • &amp;lt;number&amp;gt;
  • Very interesting datasets, and all of it is freely available for any purpose.
  • File formats, programming languages, and DBMS exist that are organized around this simple property
  • With an unstructured grid, you explicitly track cells of various dimensions and the incidence relationship that connects them.
    Specialized “Boundary Representation” data structures exist for special cases, but there is no general data model.
  • Different semantics for subsetting may be defined. One particular semantics tends to preserve intuitive correctness properties.
  • &amp;lt;number&amp;gt;
  • &amp;lt;number&amp;gt;
  • &amp;lt;number&amp;gt;
  • &amp;lt;number&amp;gt;
  • &amp;lt;number&amp;gt;
  • &amp;lt;number&amp;gt;
  • On the order of hundreds of points. Manual browsing.
  • “Make mashups easy to create”
    “Raise the level of abstraction”
    “Empower non-programmers to be programmers”
  • The seven people who know your
  • Climatology is long-term average
  • ~1 million observations. Can’t render each dot in, say, javascript. We need services that can produce these visualizations given parameters. This work is about synchronizing visualizations, blessing them with interactivity, and publishing them on the web.
  • The blind men and the eScience elephant.
  • End-to-End eScience

    1. 1. End-to-End eScience Integrating Query, Workflow, Visualization, and Mashups at an Ocean Observatory Bill Howe, University of Washington Harrison Green-Fishback, PSU David Maier, PSU Erik Anderson, Utah Emanuele Santos, Utah Juliana Freire, Utah Carlos Scheidegger, Utah Claudio Silva, Utah Antonio Baptista, OHSU Peter Lawson, OSU Renee Bellinger, OSU http://dev.pacificfishtrax.org/ QuickTime™ and a decompressor are needed to see this picture.
    2. 2. 01/30/15 Bill Howe, eScience Institute 2 Outline  eScience  Brief Demo  A Domain-Specific Query Algebra  Mashups
    3. 3. Theory Experiment Observation slide: Ed Lazowska
    4. 4. Theory Experiment Observation slide: Ed Lazowska
    5. 5. Theory Experiment Observation slide: Ed Lazowska
    6. 6. Theory Experiment Observation Computational Science slide: Ed Lazowska
    7. 7. Theory Experiment Observation Computational Science eScience
    8. 8. 01/30/15 Bill Howe, eScience Institute 8 All Science is becoming eScience Old model: “Query the world” (Data acquisition coupled to a specific hypothesis) New model: “Download the world” (Data acquired en masse, independent of hypotheses) But: Acquisition now outpaces analysis  Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)  Medicine: ubiquitous digital records, MRI, ultrasound  Oceanography: high-resolution models, cheap sensors, satellites  Biology: lab automation, high-throughput sequencing “Increase Data Collection Exponentially in Less Time, with FlowCAM” Empirical X  Analytical X  Computational X  X-informatics
    9. 9. 01/30/15 Bill Howe, eScience Institute 9 The long tail is getting fatter: notebooks become spreadsheets (MB), spreadsheets become databases (GB), databases become clusters (TB) clusters become clouds (PB) The Long Taildatainventory ordinal position Researchers with growing data management challenges but limited resources for cyberinfrastructure • No dedicated IT staff • Overreliance on simple tools (e.g., spreadsheets) CERN (~15PB/year) LSST (~100PB) PanSTARRS (~40PB) Ocean Modelers <Spreadsheet users> SDSS (~100TB) Seis- mologists MicrobiologistsCARMEN (~50TB) “The future is already here. It’s just not very evenly distributed.”-- William Gibson
    10. 10. 01/30/15 Bill Howe, eScience Institute 10 eScience Institute at UW  Mission  Help position the University of Washington at the forefront of research both in modern eScience techniques and technologies, and in the fields that depend upon these techniques and technologies  Strategy  Increase the sharing of expertise and facilities  Bootstrap a cadre of Research Scientists  Add faculty in key fields  Make the entire University more effective  Launched July 1 with $1 million in permanent funding from the Washington State Legislature  Sought, and need, $2 million
    11. 11. 01/30/15 Bill Howe, eScience Institute 11 Web Services Facets of Database Research Query Languages Storage Management Visualization; Workflow Data Integration Knowledge Extraction, Crawlers Access Methods Data Mining, Parallel Programming Models, Provenance complexity-hiding interfaces My research: customize and optimize for science
    12. 12. 01/30/15 Bill Howe, eScience Institute 12 The eScience Elephant eScience Cloud/Cluster Workflow Databases Visualization Provenance “flexibility; web services; integration” “query processing; data independence; algebraic optimization; needles in haystacks” “Exploratory science; mapping quantitative data to intuition” “Reproducibility; forensics; sharing/reuse” “Massive data parallelism” Mashups “Rapid Prototyping; Simplified web programming”
    13. 13. 01/30/15 Bill Howe, eScience Institute 13 Some eScience Research Query Algebra for new Data Type Scientific Workflow Systems Science Mashups “Dataspace” systems [Howe, Freire, Silva, et al. 2008] [Howe, Green-Fishback, Maier, 2009] [Howe, Maier, Rayner, Rucker 2008] [Howe, Maier. 2004, 2005, 2006] thistalk
    14. 14. 01/30/15 Bill Howe, eScience Institute 14 Outline  eScience  Brief Demo  A Domain-Specific Query Algebra  Science Mashups
    15. 15. 01/30/15 Bill Howe, eScience Institute 15 VisTrails for Computation
    16. 16. Spatial Patterns in Fisheries: newSpatial Patterns in Fisheries: new techniques, new opportunities fortechniques, new opportunities for ecosystem-based managementecosystem-based management Peter LawsonPeter Lawson11 , Lorenzo Cianelli, Lorenzo Cianelli22 , Bobby Ireland, Bobby Ireland22 12
    17. 17. 01/30/15 Bill Howe, eScience Institute 17 Enabling Scientific Discourse between Fishermen and Fisheries Managers
    18. 18. 01/30/15 Bill Howe, eScience Institute 18
    19. 19. 01/30/15 Bill Howe, eScience Institute 19
    20. 20. 01/30/15 Bill Howe, eScience Institute 20
    21. 21. 01/30/15 Bill Howe, eScience Institute 21 VisTrails for Collaboration Bill Howe @ CMOP computes salt flux using GridFields Erik Anderson @ Utah adds vector streamlines and adjusts opacity Bill Howe @ CMOP adds an isosurface of salinity Peter Lawson adds discussion of the scientific interpretation
    22. 22. 01/30/15 Bill Howe, eScience Institute 22 Outline  eScience  Brief Demo  A Domain-Specific Query Algebra  Mashups
    23. 23. 01/30/15 Bill Howe, eScience Institute 23 CMOP
    24. 24. 01/30/15 Bill Howe, eScience Institute 24 Columbia River Estuary red = high salinity (~34psu) blue = fresh water (~0 psu)
    25. 25. 01/30/15 Bill Howe, eScience Institute 25 Accessing Model Results  CMOP ocean circulation models run in forecast or hindcast mode  Models run serially in ~1/5 real time  On MPICH2, about 10x speedup before overhead dominates  Forecasts kept for 10 days, hindcasts kept indefinitely (40TB + 25TB/year)  Access via a GridFields Web Service  GFServer optimizes and evaluates GF expressions and returns the result
    26. 26. 01/30/15 Bill Howe, eScience Institute 26 Unstructured Grids “unstructured grids” model complex domains at multiple scales simultaneously red = high salinity (~34psu) blue = fresh water (~0 psu) Columbia River Estuary ….but complicate processing
    27. 27. 01/30/15 Bill Howe, eScience Institute 27 “Structured” Grids “structured grids” do a poor job of modeling complex features and complicate multi-scale analysis. But:Coastlines are not rectilinear x x xx xx xx xx xx x 1) Missing values = wasted effort Higher resolution = wasted effort in areas of low dynamism 2) Data associated with cells at multiple dimensions Simple: Isomorphic to multidimensional arrays
    28. 28. 01/30/15 Bill Howe, eScience Institute 28 Structured grids are easy  The data model (Cartesian products of coordinate variables)  immediately implies a representation, (multidimensional arrays)  an API, (reading and writing subslabs)  and an efficient implementation (address calculation using array “shape”)
    29. 29. 01/30/15 Bill Howe, eScience Institute 29 Structured grid example f( i, j ) x( i) y( j) for i in [4:6]: for j in [1:4]: addr = &f + j*|x| + i = f[4:6, 1:4] = NetCDF, MATLAB, RasDaMan, SciDB (soon), many more
    30. 30. 01/30/15 Bill Howe, eScience Institute 30 Unstructured Grids 2 3 4 ( E, I ) = A y x z E0 = {2,3,4} E1 = {x,y,z} E2 = {A} I = z2 z4 Az x2 x3 Ax Ay y4 y3 …plus the transitive closure
    31. 31. 01/30/15 Bill Howe, eScience Institute 31 Subsetting Full grid: Eastern Pacific Subset: mouth of Columbia River color: bathymetry Washington Oregon California
    32. 32. 01/30/15 Bill Howe, eScience Institute 32 Correctness properties preserved Grid is well-supported (no ragged edges)
    33. 33. 01/30/15 Bill Howe, eScience Institute 33 Subset semantics 01 1 1 1 0 0 1 1 1 1 1 1 1 1 Input Simple Drop “Exact” 1 1 11 0 01 1 0 0 1 1 1 1 2 1 1 Cut everything labeled “0”. What should be kept?
    34. 34. 01/30/15 Bill Howe, eScience Institute 34 What about Visualization Libs?  Different C++ classes, each dependent on data characteristics.  Changes to data characteristics require changes to the program  Logical equivalences obscured  No data independence vtkExtractGeometry vtkThreshold vtkExtractGrid vtkExtractVOI vtkThresholdPoints We want: in VTK:
    35. 35. 01/30/15 Bill Howe, eScience Institute 35 GridField Data Model A GridField with two attributes bound to the 2-cells and four attributes bound to the 0-cells x y salt temp 13.8 10.6 29.4 12.1 13.9 9.4 29.8 12.5 14.3 9.0 28.0 12.0 13.4 9.0 30.1 13.2 flux area 11.5 3.3 13.9 5.5 13.1 4.5
    36. 36. 01/30/15 Bill Howe, eScience Institute 36 GridField Operations  Lifted set operations  Union, Intersection, Cross Product  Scan/Bind  Read a grid/attribute  Restrict  Remove cells that do not satisfy a predicate  Accrete  Grow a grid by adding neighbors of cells  Regrid  Map the data of one grid onto another
    37. 37. 01/30/15 Bill Howe, eScience Institute 37 Usage Example (1) H = Scan(context, "H") rH = Restrict("(326<x) & (x<345) & (287<y) & (y<302)", 0, H) H = rH = dimensionpredicate color: bathymetry
    38. 38. 01/30/15 Bill Howe, eScience Institute 38 Usage Example (2) H = Scan(context, “H") rH = Restrict(“h<500", 0, H) H = rH = color: bathymetry
    39. 39. 01/30/15 Bill Howe, eScience Institute 39 Longer Example H : (x,y,b) V : (z) render H V ⊗ (H × V) r(z>b) r(H × V) b(s) b(r(H × V)) r(region) r(b(r(H × V)))
    40. 40. 01/30/15 Bill Howe, eScience Institute 40 ⊗ H(x,y,b) V(z) r(z>b) b(s) r(region) ⊗ H(x,y,b) V(z) r(z>b) b(s) r(x,y) r(z) Optimization *Howe, Maier, Algebraic Manipulation of Scientific Datasets. VLDB Journal, 14:4, 2005
    41. 41. 01/30/15 Bill Howe, eScience Institute 41 Transect (Vertical Slice) P
    42. 42. 01/30/15 Bill Howe, eScience Institute 42 Transect: Bad Plan ⊗ H(x,y,b) V(z) r(z>b) b(s) regrid ⊗ P P ⊗ V 1) Construct full-size 3D grid 2) Construct 2D transect grid 3) Spatial Join 1) with 2)
    43. 43. 01/30/15 Bill Howe, eScience Institute 43 Transect: Optimized Plan P ⊗ V V(z) P H(x,y,b) regrid b(s)⊗ regrid ⊗ 1) Find 2D cells containing points 2) Create “stacks” of 2D cells carrying data 3) Create 2D transect grid 4) Spatial Join 2) with 3)
    44. 44. 01/30/15 Bill Howe, eScience Institute 44 1) Find cells containing points in P
    45. 45. 01/30/15 Bill Howe, eScience Institute 45 1) 4) 2) 1) Find cells containing points in P 2) Construct “stacks” of cells 4) Join 2) with 3)
    46. 46. 01/30/15 Bill Howe, eScience Institute 46 0 5 10 15 20 25 30 35 40 45 vtk(3D) interpolate simple interp_o simple_o Transect: Results secs 800 MB dataset simple = nearest neighbor interpolation *_o = optimized by restricting to the region of interest
    47. 47. 01/30/15 Bill Howe, eScience Institute 47 Ongoing work  NSF Cluster Exploratory Award:  Where the Ocean Meets the Cloud: Ad Hoc Longitudinal Analysis of Massive Mesh Data  Partnership between NSF, IBM, Google  Data-intensive computing  massive queries, not massive simulations  To “Cloud-Enable” GridFields and VisTrails  Goal: 10+-year climatologies at interactive speeds  Parallel implementations of GridField operators  via Hadoop (and Dryad!)  Provenance, repeatability, visualization via VisTrails  Connect rich desktop experience  Co-PIs from University of Utah  Claudio Silva and Juliana Freire
    48. 48. 01/30/15 Bill Howe, eScience Institute 48 Outline  eScience  Brief Demo  A Domain-Specific Query Algebra  Scientific Mashups
    49. 49. 01/30/15 Bill Howe, eScience Institute 49 Why Mashups?  Jim Gray: # of datasets scales as N2  Each pairwise comparison generates a new dataset  Corollary: # of apps scales as N2  Every pairwise comparison motivates a new mashup  To keep up, we need to  entrain new programmers,  make existing programmers more productive,  or both
    50. 50. 01/30/15 Bill Howe, eScience Institute 50 Satellite Images + Crime Incidence Reports
    51. 51. 01/30/15 Bill Howe, eScience Institute 51 Twitter Feed + Flickr Stream
    52. 52. 01/30/15 Bill Howe, eScience Institute 52 Mashup Frameworks  A bottom up approach  Start with a GPL, add  Visual programming  Interactive type checking  Exploit a corpus of previous examples  bootstrapping a mashup  mashup “autocomplete”  emit warnings
    53. 53. 01/30/15 Bill Howe, eScience Institute 53
    54. 54. 01/30/15 Bill Howe, eScience Institute 54
    55. 55. 01/30/15 Bill Howe, eScience Institute 55
    56. 56. 01/30/15 Bill Howe, eScience Institute 56 Scientific Mashup Characteristics  Turn over more data per operation  Involve subtle visualizations  Must serve a diverse audience
    57. 57. 01/30/15 Bill Howe, eScience Institute 57 A Model for Scientific Mashups  The “Data Product” is the currency of scientific communication with the public  Scientists are already adept at crafting them (consider powerpoint slides and figures)  We take a top down approach:  Take a static data product ensemble,  endow it with interactivity,  publish it online,  allow others to repurpose it at runtime
    58. 58. 01/30/15 Bill Howe, eScience Institute 58 Data Product Ensemble
    59. 59. 01/30/15 Bill Howe, eScience Institute 59 Mashup
    60. 60. 01/30/15 Bill Howe, eScience Institute 60 CTD: Conducitvity, Temperature, Depth
    61. 61. 01/30/15 Bill Howe, eScience Institute 61 Sampling
    62. 62. 01/30/15 Bill Howe, eScience Institute 62 Event Detection: Red Water
    63. 63. 01/30/15 Bill Howe, eScience Institute 63 CTD Cast
    64. 64. 01/30/15 Bill Howe, eScience Institute 64 Flowthrough
    65. 65. 01/30/15 Bill Howe, eScience Institute 65 Mashup
    66. 66. 01/30/15 Bill Howe, eScience Institute 66 Mashup
    67. 67. 01/30/15 Bill Howe, eScience Institute 67 Key Concepts  A mashup is a synchronized ensemble of data products  A data product is a mashable that has been adapted for a particular purpose  A mashable is an arbitrarily-complex computation that returns a relation  An adaptor displays the relation to the user and returns a subset  All adapted mashables accept input  Hence, user controls are modeled as adapted mashables just like “visual” data products
    68. 68. 01/30/15 Bill Howe, eScience Institute 68 Adapted Mashables
    69. 69. 01/30/15 Bill Howe, eScience Institute 69 Data Flow Graph
    70. 70. 01/30/15 Bill Howe, eScience Institute 70 Inferring Data Flow provides: {ABC} requires: {AB}
    71. 71. 01/30/15 Bill Howe, eScience Institute 71 Inferring Data Flow provides: {AC} requires: {AB} provides: {B}
    72. 72. 01/30/15 Bill Howe, eScience Institute 72 Inferring Data Flow provides: {AC} requires: {AB} underspecified mashup Solution: 1) use defaults 2) root environment 3) hand-specified parameter
    73. 73. 01/30/15 Bill Howe, eScience Institute 73 Inferring Data Flow provides: {AB} requires: {AB} provides: {B} overspecified mashup Solution: Break ties: 1) Prefer nodes on longer paths 2) Use layout information
    74. 74. 01/30/15 Bill Howe, eScience Institute 74 Audience-Tailored Mashups K12 studentsExperts
    75. 75. 01/30/15 Bill Howe, eScience Institute 75 Conclusions and Future Directions  We want to augment scientists, not programmers  Requires limiting expressiveness -- not yet clear where to draw the line  More work on semi-automatically tailoring a mashup at runtime  Automatically insert “context products”  See salinity, add a salinity colorbar  See a time, add a tide chart  See a location, add a map  Re-skin data products  “Dashboard-style” vs. “Wizard-style” apps
    76. 76. 01/30/15 Bill Howe, eScience Institute 76 http://escience.washington.edu (retooled website coming soon)
    77. 77. 01/30/15 Bill Howe, eScience Institute 77 ComparisonData Model Operations Services GPL * * Typing, maybe Workflow * arbitrary boxes- and-arrows typing, provenance, Pegasus-style resource mapping, task parallelism Relational Algebra Relations Select, Project, Join, Aggregate, … optimization, physical data independence, data parallelism MapReduce [(key,value)] Map, Reduce massive data parallelism, fault tolerance MS Dryad IQueryable, IEnumerable RA + Apply + Partitioning typing, massive data parallelism, fault tolerance MPI Arrays/ Matrices 70+ ops data parallelism, full control
    78. 78. 01/30/15 Bill Howe, eScience Institute 78 Mashups serve a diverse audience student public scientist
    79. 79. 01/30/15 Bill Howe, eScience Institute 79 Computational Science  Theory  Experiment  Observation  Simulation (in silico)  Analysis (in ferro) Data acquisition is hypothesis-driven Data acquisition is technology-driven
    80. 80. 01/30/15 Bill Howe, eScience Institute 80 Explore architectures blending techniques from • mashups (rapid prototyping), • visualization (interactivity, richness), • workflow (data integration, provenance), • databases (optimization, data independence) to answer science questions at an Ocean Observatory Motivation
    81. 81. 01/30/15 Bill Howe, eScience Institute 81 Source: MayaVi website PLOT3D, GDAL, ShapeFile, OGC, .obj, .vtk, netCDF, HDF5, FITS, others Optimized for “throwing datasets” and interactivity Declarative query, interoperability, repeatability generally lacking Source: http://pogl.wordpress.com/2007/06/ Visualization
    82. 82. 01/30/15 Bill Howe, eScience Institute 82 Workflow  Emphasis on integration, web services, flexibility  Unconstrained boxes-and-arrows  Any operation on any data type  Very expressive, but limited opportunities for static reasoning  Type safety  Task parallelism  Cache safety  Optimization via rewrite rules  Result size / execution time estimation  Transparent data parallelism  Platform portability To move the earth, you need somewhere to stand
    83. 83. 01/30/15 Bill Howe, eScience Institute 83 Databases Pre-relational DBMS brittleness: if your data changed, your application broke. Early RDBMS were buggy and slow (and often reviled), but required only 5% of the application code. physical data independence logical data independence files and pointers relations view s “Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.” Key Idea: Programs that manipulate tabular data exhibit an algebraic structure allowing reasoning and manipulation independent of physical data representation
    84. 84. 01/30/15 Bill Howe, eScience Institute 84 Heterogeneity also drives costs#ofbytes # of data types CERN (~15PB/year, particle interactions) LSST (~100PB; images, objects) PanSTARRS (~40PB; images, objects, trajectories) OOI (~50TB/year; sim. results, satellite, gliders, AUVs, vessels, more) SDSS (~100TB; images, objects) Biologists (~10TB, sequences, alignments, annotations, BLAST hits, metadata, phylogeny trees)
    85. 85. 01/30/15 Bill Howe, eScience Institute 85 The eScience Elephant “Like a snake” “ “Like a hand fan” “Like a wall” “Like tree trunk” “Like a spear” “Like a rope”
    86. 86. 01/30/15 Bill Howe, eScience Institute 86
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×