• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Visual Data Analytics in the Cloud for Exploratory Science

on

  • 1,336 views

Invited talk at Tableau, Inc. Part 1: Large-scale 3D Visualization in the cloud. Part 2: Semi-automatic mashups for eScience.

Invited talk at Tableau, Inc. Part 1: Large-scale 3D Visualization in the cloud. Part 2: Semi-automatic mashups for eScience.

Statistics

Views

Total Views
1,336
Views on SlideShare
1,336
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Drowning in data; starving for information We’re at war with these engineering companies. FlowCAM is bragging about the amount of data they can spray out of their device. How to use this enormous data stream to answer scientific questions is someone else’s problem. “ Typical large pharmas today are generating 20 terabytes of data daily. That’s probably going up to 100 terabytes per day in the next year or so.” “ tens of terabytes of data per day” -- genome center at Washignton University Increase data collection exponentially with flowcam
  • Vertical: Fewer, bigger apps Horizontal: Many, smaller apps Limiting Resource: Effort = Napps * Nbytes
  • Analytics and Visualization are mutually dependent Scalability Fault-tolerance Exploit shared-nothing, commodity clusters In general: Move computation to the data Data is ending up in the cloud; we need to figure out how to use it.
  • Visualization is a more efficient way to query data -- you can browse and explore. But you need to be able to switch back and forth between interactive browsing and symbolic querying
  • What exactly is Ad Hoc Research data? It is data that can come in any size shape or form, where the data is heterogeneous within its structure, format, quality, and more.
  • (granted we had a minute for Bill (clearly Bill) to describe this new eScience movement) We want to give a little background of our project before we launch into it, so we will discuss the problem we are trying to solve. Essentially, we want to remove the speed-bump of data handling from the scientists.
  • To begin, we ask, what kind of questions would you ask your data once you have it ready to be worked on? Just about EVERY question that we have heard a scientist would ask, we have found an equivalent SQL statement counterpart. If we could just turn their questions in SQL our job would be done, but there are many other problems to solve before that becomes a reality. For example, their data may not reside in a relational database. This brings us to part of our next problem: how can we bring the power of SQL to the scientists to solve their questions without the overhead of everything that a database administrator would need to do.
  • One claim we are trying to prove with this project is that scientists are not afraid to learn a bit of SQL
  • In our first generation deployment, we used the asp.net front end on the windows azure cloud to host our web service and Amazon’s ec2 cloud as the backend to host our Microsoft SQL Server database.
  • Data products are the currency of scientific and statistical communication with the public Ex: Obama map Ex: Mars Rover pictures generate 218M hits in 24 hrs But: Datasets are growing too big and too complex to view through a few static images Scientists want to create interactive visualizations that allow others to explore their results Ex: Nasa 3D with Photosynth Ex: CAMERA Ex:
  • On the order of hundreds of points. Manual browsing.
  • Ex: Nasa 3D with Photosynth Ex: CAMERA Ex:
  • Data-intensive science
  • This movie was rendered offline, but it’s increasingly important to be able to create visualizations on the fly to allow interactive exploration of large datasets.
  • Need to consider private clouds Not just renting hardware: general-purpose data processing
  • The goal here is to make Shared Nothing Architecturs easier to program.
  • We only wrap the interface for Hadoop Streaming in VisTrails with the additional suppport of HDFS operations to upload/download data/libraries for the job. The Hadoop Streaming is plugged into a local VTK rendering pipeline that would grab data from the cloud and generate an animation on the VisTrails Spreadsheet. Users can specify their own Python Source as mapper/reducer. In this case, a VTK script is specified in the mapper. Also, VTK libraries are shipped along with the code to the computing node. This uses the underlying –cacheArchive of Hadoop streaming.
  • By default, Hadoop logs are output to the standard output of VisTrails app. Jobs are killed by terminate the program and run an extra command returned by Hadoop. However, one can plug a HadoopTrackerCell to the end of the pipeline to have their log messages to be monitored on the VisTrails Spreadsheet. There are also button to kill the job or show Job Tracker, which would automatically connect through the CLuE’s specific proxy to see additional logs/error messages of jobs.
  • Drowning in data; starving for information We’re at war with these engineering companies. FlowCAM is bragging about the amount of data they can spray out of their device. How to use this enormous data stream to answer scientific questions is someone else’s problem.
  • Need to assign workflows to resources for execution in a heterogeneous compute environment. Parts of this workflow can be compiled into Hadoop jobs, parts should be run locally so that they exploit hardware acceleration. But this is not just computation placement -- there are different execution plans, similar to relational execution plans. Gridfields expressions can be algebraically optimized, for example.
  • Plan C: Build a spatial index to support panning Plan D: Build a multi-resolution index to support zoom … and so on Why not precompute all appropriate indexes? Some will (partially) reside on client Storage is not as cheap as we pretend Need a flexible system where a “query result” can be explored interactively, and we prepare for similar queries similarity defined by natural “browsing patterns” in visualization systems
  • We can’t just precompute the indexes, since they may reside on
  • Analytics and Visualization are mutually dependent Scalability Fault-tolerance Exploit shared-nothing, commodity clusters In general: Move computation to the data
  • Upper left: Average
  • Sweeping through the velocity fields quickly exposed the location of the “upstream” salt flux -- where salty water made its way back upstream.

Visual Data Analytics in the Cloud for Exploratory Science Visual Data Analytics in the Cloud for Exploratory Science Presentation Transcript

  • Visual Data Analytics in the Cloud for Exploratory Science Bill Howe, UW Huy Vo, Utah Claudio Silva, Utah Juliana Freire, Utah YingYi Bu, UW
  • Data acquisition is no longer the bottleneck
    • Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)
    • New model: “Download the world” (Data acquired en masse, in support of many hypotheses)
      • Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)
      • Oceanography: high-resolution models, cheap sensors, satellites
      • Biology: lab automation, high-throughput sequencing,
  • Two dimensions # of bytes # of apps LSST SDSS Galaxy BioMart GEO IOOS OOI LANL HIV Pathway Commons PanSTARRS Biology Oceanography Astronomy
  • This Talk
    • # of Bytes: MapReduce for Scientific Viz
    • # of Apps: Other VDA Projects
  • Converging Requirements Vis DB
  • Why Vis Needs DB “ Transferring the whole data generated … to a storage device or a visualization machine could become a serious bottleneck, because I/O would take most of the … time. A more feasible approach is to reduce and prepare the data in situ for subsequent visualization and data analysis tasks.” -- SciDAC Review Current Research Topics in Vis:
    • “ Query-driven Visualization”
    • “ In Situ Visualization”
    • “ Remote Visualization”
  • Why DB Needs Vis
  • Why DB Needs Vis (2)
    • “ What does the salt wedge look like?”
  • Thesis
    • We can no longer afford to build separate visualization and data management systems
    • Data is increasingly destined for the cloud
    • First Attack: Implement Vis primitives in an existing “cloud” DM system
  • Core Vis Algorithms in MapReduce
    • Scalar/Volume Rendering
    • Isosurface Extraction
    • Mesh Simplification
  • Some distributed algorithm… Map (Shuffle) Reduce
  • CluE Cluster
    • 410 nodes
    • Dual Intel Xeon 2.8GHz, hyperthreading
    • 8GB main memory each
    • Hadoop, no access to OS
    • Google provided, IBM maintaine, NSF funded
  • CluE Cluster Scaling
  • Isosurface Example
  • Isosurface Example
  • Isosurface Example
  • Isosurface Example
  • Isosurface Extraction
  • Isosurface Extraction
  • Isosurface Results O(N 2 ) O(N)
  • Scalable Rendering
  • Scalable Rendering
    • Left: Atlas
      • 18GB
      • 500M triangles
    • Right: St. Matthew
      • 13GB
      • 372M triangles
    • Laser Scans, Digital Michelandgelo project
    srrc: Digital Michelangelo project
  • Rendering Results
  • Roadmap
    • # of Bytes: MapReduce for Scientific Viz
    • # of Apps: Other VDA projects
      • Azure Ocean
      • SQLShare
      • Automating Mashups
  • [John Delaney, University of Washington]
  • Azure Ocean COVE for Visualization Trident for Processing Azure for Data + +
  • SQLShare: Query Services for Ad Hoc Research Data
  • Ad Hoc Research Data 5/18/10 Garret Cole, eScience Institute Fasta format Spread sheets Tabular data
  • Problem 5/18/10 Garret Cole, eScience Institute “ I spend 90% of my time handling data rather than doing science” -- Robin Kodner, Postdoc, Armbrust Lab
  • An observation about “handling data”
    • How often does each RNA hit appear inside my annotated surface group?
    • SELECT hit, COUNT (*) as cnt FROM tigrfamannotation_surface GROUP BY hit ORDER BY cnt DESC
    5/18/10 Garret Cole, eScience Institute
  • Discovery: SQL Does not Terrify Scientists 5/18/10 Garret Cole, eScience Institute
  •  
  • Technology used in 1 st Gen 5/18/10 Garret Cole, eScience Institute Component Stack
  • SQLShare Redux
    • Conventional wisdom says “Scientists won’t write SQL”
      • We don’t believe it!
    • Instead, we implicate difficulty in
      • installation
      • configuration
      • schema design
      • performance tuning
      • data ingest
      • over-reliance on GUIs
    • Critical need for visualization
      • Clear role for Tableau!
    We are asking “What kind of platform will make SQL useful for scientific inquiry?”
  • Automating Mashups
  • Why Mashups?
    • Jim Gray: # of datasets scales as N 2
      • Each pairwise comparison generates a new dataset
    • Corollary: # of apps scales as N 2
      • Every pairwise comparison motivates a new mashup
    • To keep up, we need to
      • entrain new programmers,
      • make existing programmers more productive,
      • or both
  • Satellite Images + Crime Incidence Reports
  • Twitter Feed + Flickr Stream
  • Why Mashups?
    • The time of one’s data fitting into a 15 page research paper is past.
    • Datasets are too large and complex to be conveyed with a handful of static images
    • Prediction: succinct, targeted, interactive web apps will become the currency of scientific communication
      • with the public
      • with policy makers
      • with colleagues in other disciplines
      • with peers
      • with students (K12 - grad)
  • Tableau Mashups
  • Conclusions
    • Converging requirements for DB and Vis
    • At high scale:
      • A Vis library in MapReduce
    • At high complexity:
      • Azure Ocean
        • Data + Workflow + Vis
        • “ Client + Cloud”,“Computational mobility”
      • SQLShare
        • Ad Hoc data -- “anything goes”
        • Visualization critical
      • (semi-)automated mashups
        • “ Show me what’s interesting”
  • Acknowledgments http://escience.washington.edu
  • BACKUP SLIDES
  • [John Delaney, University of Washington]
  •  
  • John Delaney
  • Azure Ocean COVE for Visualization Trident for Processing Azure for Data + +
  • COVE
    • Research into new interfaces for cross-disciplinary ocean science
    • Extensive instrument and cable layout for creating experiments
    • Flexible terrain and image engine for visualizing site
    • True 3D/4D science dataset visualization
    • Field tested in RSN observatory layout and on ocean expeditions
    • Cross platform and extensible with python and workflow systems
  • Trident
    • Microsoft Research scientific workflow system
    • Visual programming environment for connecting tasks
    • Science-specific task libraries including one for ocean sciences
    • Automated provenance capture, monitoring, and fault tolerance
    • Runs on local system, Windows server, or HPC Cluster
    • Cross platform with Silverlight and web service interface
  • Azure
    • Microsoft’s cloud computing platform
    • Provides storage and computing as pay-as-you-go services
    • From development standpoint, system looks like provisioned VM’s
    • SQL, table, and blob (file system) storage models are included
    • Access to storage via RESTful HTTP interface
  • Azure Ocean
    • COVE + Trident + Azure provides visual analytics to scientists
    • Any component – Visualization , Computing , or Data – can be provisioned locally, on a server, or in the cloud
    • When on same machine, system APIs are leveraged for speed
    • When distributed, communication is through HTTP and RESTful APIs
    • Flexible platform for the diverse ocean science needs
  •  
  • MapReduce Programming Model
    • Input & Output: each a set of key/value pairs
    • Programmer specifies two functions:
      • Processes input key/value pair
      • Produces set of intermediate pairs
      • Combines all intermediate values for a particular key
      • Produces a set of merged output values (usually just one)
    map (in_key, in_value) -> list(out_key, intermediate_value) reduce (out_key, list(intermediate_value)) -> list(out_value) slide source: Google, Inc.
  • Isosurface Example
  • Isosurface Example <Vis movie> Key idea: Zooplankton correlated with temperature
  • Example Query Results
  • Example Query: Climatology Feb May Average Surface Salinity by Month Columbia River Plume 1999-2006 Columbia River psu Washington Oregon animation
  • UW + Utah CluE Program
    • Goals
      • 10+-year “climatologies” at interactive speeds
      • … with provenance, reproducibility, collaboration …on a shared-nothing, commodity platform
      • In general: Explore the intersection of scientific databases and scientific visualization, at scale
    • Methods
      • “ Cloud-Enable” two projects
        • GridFields : Query algebra for mesh data
        • VisTrails : Scientific workflow and provenance
  •  
  • Converging Requirements Vis: “Query-driven Visualization” Vis: “In Situ Visualization” Vis: “Remote Visualization” DB: Millions of tuples per result Vis DB
  • Preliminary results
    • Managing Hadoop jobs with VisTrails
    • GridField queries in Hadoop
    • Core Visualization algorithms in Hadoop
  • Core Vis Algorithms in MapReduce
    • Scalar/Volume Rendering
      • Map: Rasterization
      • Reduce: Compositing, blending
    • Isosurface Extraction
      • Map: Isosurface Extraction
      • Reduce: Combine like isovalues
    • Mesh Simplification
      • Map: Bin vertices
      • Reduce: Collapse binned triangles
  • ATLAS dataset
  • Rendering (not CluE) # of mappers 57-node Nehalem
  • Isosurface Extraction (Preliminary) 32 48 64 96 128
  • “Query-Driven Visualization”
    • Vis perspective:
      • query = subsetting
    • DB perspective:
      • query = manipulation, preparation, restructuring, index-building, aggregation, regridding, downsampling, simplification, reformatting, etc.
    • Database Maxims:
    • Push the computation to the data.
    • Declarative programming is a good thing.
  • Why Cloud?
    • “Cloud”?
      • Software as a Service (SaaS)
      • Infrastructure as a Service (IaaS)
      • Platform as a Service (PaaS)
    • Working definition:
    General, elastic, data-intensive, scalable computing This work: Vis techniques + DB techniques in the Cloud
  • Shared Nothing Parallel Databases
    • Teradata
    • Greenplum
    • Netezza
    • Aster Data Systems
    • Datallegro
    • Vertica
    • MonetDB
    Microsoft Recently commercialized as “Vectorwise”
  • Taxonomy of Parallel Architectures Easiest to program, but $$$$ Scales to 1000s of nodes
  • VisTrails screenshot: VisTrails , Claudio Silva, Juliana Freire, et al., University of Utah
  • Version Tree screenshot: VisTrails , Claudio Silva, Juliana Freire, et al., University of Utah
  • Collaboration Howe et al., eScience 2008 Bill Howe @ UW computes salt flux using GridFields Erik Anderson @ Utah adds vector streamlines and adjusts opacity Bill Howe @ UW adds an isosurface of salinity Peter Lawson adds discussion of the scientific interpretation
  • Preliminary results
    • Managing Hadoop jobs with VisTrails
    • GridField queries in Hadoop
    • Core Visualization algorithms in Hadoop
  • Preliminary results
    • Managing Hadoop jobs with VisTrails
    • GridField queries in Hadoop
    • Core Visualization algorithms in Hadoop
  • Hadoop in VisTrails
    • Wrap Hadoop Streaming/HDFS Operations
    • Plug “PreProcess” to actual Vis Pipeline
    3/12/09
  • Hadoop in VisTrails
    • Provenance and Monitoring
    3/12/09
  • Preliminary results
    • Managing Hadoop jobs with VisTrails
    • GridField queries in Hadoop
    • Core Visualization algorithms in Hadoop
  • All Science is reducing to a database problem
    • Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)
    • New model: “Download the world” (Data acquired en masse, independent of hypotheses)
      • Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)
      • Medicine: ubiquitous digital records, MRI, ultrasound
      • Oceanography: high-resolution models, cheap sensors, satellites
      • Biology: lab automation, high-throughput sequencing
    “ Increase Data Collection Exponentially in Less Time, with FlowCAM” Empirical X  Analytical X  Computational X  X-informatics
  • Key Idea: Declarative Languages SELECT * FROM Order o, Item i WHERE o.item = i.item AND o.date = today() join select scan scan date = today() o.item = i.item Order o Item i Find all orders from today, along with the items ordered
  • Example System: Teradata AMP = unit of parallelism
  • Example System: Teradata AMP 1 AMP 2 AMP 3 select date=today() select date=today() select date=today() scan Order o scan Order o scan Order o hash h(item) hash h(item) hash h(item) AMP 4 AMP 5 AMP 6
  • Example System: Teradata AMP 1 AMP 2 AMP 3 scan Item i AMP 4 AMP 5 AMP 6 hash h(item) scan Item i hash h(item) scan Item i hash h(item)
  • Example System: Teradata AMP 4 AMP 5 AMP 6 join join join o.item = i.item o.item = i.item o.item = i.item contains all orders and all lines where hash(item) = 1 contains all orders and all lines where hash(item) = 2 contains all orders and all lines where hash(item) = 3
  • Workflow Execution Plans Need execution plans spanning client/server/cloud
  • Example: Isosurface Browsing
  • Example: Isosurface Browsing
    • Plan A
    Subset Subset Subset Subset tstep 0 tstep 1 tstep 2 tstep 3
  • Example: Isosurface Browsing
    • Plan B: Build an index
    Build Index, e.g., an Interval Tree (Cignoni 97) Subset Subset Subset tstep 0 tstep 1 tstep 2 tstep 3 Subset Render Isosurface Isosurface Isosurface Isosurface Render Render Render
  • Example: Isosurface Browsing
    • Plan C: Build a spatial index to support panning
    • Plan D: Build a multi-resolution index to support zoom
    • … and so on
    • Why not precompute all appropriate indexes?
      • Some will (partially) reside on client
      • Storage is not as cheap as we pretend
    • Need a flexible system where
      • a “query result” can be explored interactively, and
      • we prepare for similar queries
      • similarity defined by natural “browsing patterns” in visualization systems
  •  
  • Why MapReduce/Hadoop?
    • Popular
        • AWS Elastic MapReduce
        • 100s of startups
        • # of downloads
        • # of blog posts
    • Free as in Speech
    • Free as in Beer
    • Flexible, Lightweight
    • Scalable
    • Fault-tolerant
  • Reducing Latency
    • Online processing/progressive refinement
      • Deliver approximate/partial results
    • Standing Queries/Prepared plans
    • Exploit indexes
    Changes to Hadoop and/or other tools required (e.g., Hbase)
  • Masking Latency
    • Caching/materialized views
      • Reuse old results
    • Pre-fetching
      • Stage and prepare new results
    • Speculative processing
      • Anticipate future results
    No change to Hadoop required
  • source: Antonio Baptista, NSF CMOP STC
  • Why Visualization? (2) north channel south channel
  • MapReduce?
    • Hadoop simplifies parallel data processing
      • ++ scalability
      • ++ fault tolerance
      • ++ less programming
      • -- latency is an issue
  • Climatology Queries 1 2 3 4 5 6 7 31 23 psu 8 9 10 11 12 13 14 15 16 17 18 (b) 19 20 21 22 24 25 26 27 28 29 30
  •  
  • As a GridField Expression  H 0 : (x,y,b) V 0 : (  ) apply(0, z=(surf  b) *  ) bind(0, surf) C H = Scan(contxt, &quot;H&quot;) rH = Restrict(&quot;(326<x) & (x<345) & (287<y) & (y<302)&quot;, 0, H) T = Scan(contxt, “T”) V = Scan(contxt, “V”) HxV = Cross(H, V) HxVxT = Cross(HxV, T) salt = Bind(contxt, HxVxT, “salt”) onemonth = Regrid(salt, HxV, equijoin(“hpos,vpos”), avg())
  • As a SQL Query Select hpos, vpos, avg(salt) from ocean group by hpos, vpos
  • Scientific Workflow Systems
    • Value proposition: More time on science, less time on code
    • How: By providing language features emphasizing sharing, reuse, reproducibility, rapid prototyping, efficiency
      • Provenance
      • Visual programming
      • Caching
      • Integration with domain-specific tools
      • Scheduling
  • Related Vis Work
    • Parallel visualization systems
      • ParaView, VisIt
    • Query-Driven Visualization
      • [Bethel et al 2006,2008,2009]
    • FastBit Index
      • [Shoshani et al 2007]
    • DB Vis systems
      • Tableau
  • Feeding the Pipeline source: Ken Moreland missing step?
  • Cannot Ignore “Preprocessing” Hadoop
  • Role 2: Move Computation to the Data “Transferring the whole data generated … to a storage device or a visualization machine could become a serious bottleneck, because I/O would take most of the … time. A more feasible approach is to reduce and prepare the data in situ for subsequent visualization and data analysis tasks.” -- SciDAC Review
  • Remote Visualization
    • Reduce and render remotely, transfer images
      • ++ transfers less data
      • -- specialized hardware, high load
    • Reduce remotely, transfer data/geometry, render locally
      • ++ uses local graphics pipeline
      • -- transfers more data
  •  
  • Scientific Vis System Roundup
    • General
      • ParaView [KitWare, Los Alamos, Sandia]
      • VisIt [LLNL]
    • Specialized
      • SALSA, particles, Quinn, UW
      • VISUS, streaming/progressive, Jones, LLNL
      • SAGE,
      • Hyperwall, tiled display, NASA