• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

A New Partnership for Cross-Scale, Cross-Domain eScience

on

  • 643 views

Overview of the Moore Foundation-funded collaborative project between CMU and UW to advance eScience given at the Microsoft eScience 2009 workshop

Overview of the Moore Foundation-funded collaborative project between CMU and UW to advance eScience given at the Microsoft eScience 2009 workshop

Statistics

Views

Total Views
643
Views on SlideShare
643
Embed Views
0

Actions

Likes
0
Downloads
8
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • My name is Bill Howe. I’m not Ed Lazowska. In all fields of science, data is starting to come in faster than it can be analyzed, so we need to advance and proliferate computational technologies in sensor networking, databases and data mining, visualization, machine learning, and cluster/cloud computing. And if we don’t, we see UW losing its competitive edge. The mission of the eScience Institute is to prevent that from happening So by the animation loophole, there we go. Funding! We have $1M from the state, and we just got a nice award from the Moore foundation, and several proposals outstandind. People! We have a fantastic team: Dave Beck in Biosciences, Jeff Gardner in Astrophysics and HPC, myself in Databases, Ed and Erik, Mette Peters in Health Sciences, and Chance Reshke in large-scale computing platforms. And there’s our URL: escience.washington
  • Drowning in data; starving for information We’re at war with these engineering companies. FlowCAM is bragging about the amount of data they can spray out of their device. How to use this enormous data stream to answer scientific questions is someone else’s problem.
  • The long tail of eScience -- huge number of scientists who struggle with data management, but do not have access to IT resources -- no clusters, no system administrators, no programmers, and no computer scientists. They rely on spreadsheets, email, and maybe a shared file system. Their data challenges have more to do with heterogeneity than size: tens of spreadsheets from different sources. However: the long tail is becoming the fat tail. Tens of spreadsheets are growing to hundreds, and the number of records in each goes from hundreds to thousands. How many of you know someone who was forced to split a large spreadsheet into multiple files in order to get around the 65k record limit in certain versions of Excel? Further, medium data (gigabytes) becomes big data (terabytes). Ocean modelers are moving from regional-focus to meso-scale simulations to global simulations.
  • Armbrust Lab combines lab-based and field-based studies to address basic questions about the function of marine ecosystems.
  • Asterisk /Underlined (*) indicates custom software developed in the Armbrust Lab. Blue: Traditional tools for “basement” bioinformatics -- individual scientists Orange: Increased centralization, economies of scale, shared resources. Deployed in the Armbrust Lab Yellow: Third-party tools developed for scalable bioinformatics Purple: Emerging tools under evaluation for convenient petascale bioinformatics. Through a collaboration with the eScience Institute (under funding review by Moore Foundation!) Thanks to advances in sensors, sequencing instruments, and algorithms, the field of bioinformatics is moving away from "single-task" software that operate on datasets that fit on a single computer in favor of flexible, "multi-purpose" frameworks that can operate on datasets that span clusters of computers. In our lab, we have deployed a variety of flexible tools, and have developed our own software to streamline our scientific process and reduce the overall "time to insight". (maybe talk about WebBlast and PPlacer here.) Observing that the amount of data collected is doubling every year (outpacing even Moore's Law!), we are also collaborating with the UW eScience Institute to explore ways we can harness emerging technologies for massively parallel data analysis involving hundreds or thousands of machines. Some of these frameworks involve "cloud computing" -- the use of computational infrastructure provided, inexpensively, by "big players" in software and computing --- Amazon, Microsoft, Google. [Maybe more on the eScience Institute?]
  • Dial down the expressiveness but dial up the programming and execution services
  • It turns out that you can express a wide variety of computations using only a handful of operators.
  • Two nodes slower than one, four nodes slower than 8 -- shows overhead of providing parallel processing
  • Data products are the currency of scientific and statistical communication with the public Ex: Obama map Ex: Mars Rover pictures generate 218M hits in 24 hrs But: Datasets are growing too big and too complex to view through a few static images Scientists want to create interactive visualizations that allow others to explore their results Ex: Nasa 3D with Photosynth Ex: CAMERA Ex:
  • On the order of hundreds of points. Manual browsing.
  • This movie was rendered offline, but it’s increasingly important to be able to create visualizations on the fly to allow interactive exploration of large datasets.
  • Visualization is a more efficient way to query data -- you can browse and explore. But you need to be able to switch back and forth between interactive browsing and symbolic querying
  • Climatology is long-term average
  • Want to know the makeup of the text by word length. For example, we’d like to know how many words have greater than 10 characters. We’d also like to know how many words have between 5 and 9 characters, between 2 and 4 and those with just 1 character. Map will read in text and tag each word as a different color depending on the length of the word.
  • Want to know the makeup of the text by word length. For example, we’d like to know how many words have greater than 10 characters. We’d also like to know how many words have between 5 and 9 characters, between 2 and 4 and those with just 1 character. Map will read in text and tag each word as a different color depending on the length of the word.
  • Motivating Map task and intuition behind map…. Think of map as a group by. Distribution of word lengths
  • Motivating Map task and intuition behind map…. Think of map as a group by. Distribution of word lengths
  • Motivating Map task and intuition behind map…. Think of map as a group by. Distribution of word lengths
  • It provides a means of describing data with its natural structure only--that is, without superimposing any additional structure for machine representation purposes. Accordingly, it provides a basis for a high level data language which will yield maximal independence between programs on the one hand and machine representation on the other.
  • So these two different views of the world, RDBMS and MapReduce are not really different at all -- just different feature sets along a continuum of data processing. As evidence Teradata Greenplum Netezza Aster Data Systems Dataupia Vertica MonetDB
  • Hadoop implementation based off details in MR 2004 paper.
  • Don’t have to write separate map and reduce functions… will take care of that for you as well as optimize for you. This is by no means an exhaustive list of operators
  • Don’t have to write separate map and reduce functions… will take care of that for you as well as optimize for you.
  • The goal here is to make Shared Nothing Architecturs easier to program.

A New Partnership for Cross-Scale, Cross-Domain eScience A New Partnership for Cross-Scale, Cross-Domain eScience Presentation Transcript

  • A New Partnership for eScience Bill Howe, UW Ed Lazowska, UW Garth Gibson, CMU Christos Faloutsos, CMU Peter Lee, CMU (DARPA) Chris Mentzel, Moore
  •  
  • http://escience.washington.edu
  •  
  • The University of Washington eScience Institute
    • Rationale
      • The exponential increase in sensors is transitioning all fields of science and engineering from data-poor to data-rich
      • Techniques and technologies include
        • Sensors and sensor networks, databases, data mining, machine learning, visualization, cluster/cloud computing
      • If these techniques and technologies are not widely available and widely practiced, UW will cease to be competitive
    • Mission
      • Help position the University of Washington at the forefront of research both in modern eScience techniques and technologies, and in the fields that depend upon them
    • Strategy
      • Bootstrap a cadre of Research Scientists
      • Add faculty in key fields
      • Build out a “consultancy” of students and non-research staff
  • Staff and Funding
    • Funding
      • $1M/year direct appropriation from WA State Legislature
      • $1.5M from Gordon and Betty Moore Foundation (joint with CMU)
      • Multiple proposals outstanding
    • Staffing
      • Dave Beck, Research Scientist: Biosciences and software eng.
      • Jeff Gardner, Research Scientist: Astrophysics and HPC
      • Bill Howe,Research Scientist: Databases, visualization, DISC
      • Ed Lazowska, Director
      • Erik Lundberg (50%), Operations Director
      • Mette Peters, Health Sciences Liaison
      • Chance Reschke, Research Engineer: large scale computing platforms
      • … plus a senior faculty search underway
      • … plus a “consultancy” of students and professional staff
  • All science is reducing to a database problem
    • Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)
    • New model: “Download the world” (Data acquired en masse, in support of many hypotheses)
      • Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)
      • Medicine: ubiquitous digital records, MRI, ultrasound
      • Oceanography: high-resolution models, cheap sensors, satellites
      • Biology: lab automation, high-throughput sequencing
    “ Increase data collection exponentially with FlowCam!”
  • The Long Tail The long tail is getting fatter: notebooks become spreadsheets (MB), spreadsheets become databases (GB), databases become clusters (TB) clusters become clouds (PB) data volume rank
    • Researchers with growing data management challenges but limited resources for cyberinfrastructure
    • No dedicated IT staff
    • Over-reliance on inadequate but familiar tools
    CERN (~15PB/year) LSST (~100PB) PanSTARRS (~40PB) Ocean Modelers <Spreadsheet users> SDSS (~100TB) Seis-mologists Microbiologists CARMEN (~50TB) “ The future is already here. It’s just not very evenly distributed.” -- William Gibson
  • Case Study: Armbrust Lab
  • Armbrust Lab Tech Roadmap ClustalW scalability cluster/cloud workstation/server MAQ specific tasks general tasks Excel NCBI BLAST Phred/Phrap CloudBurst CLC Genomics Machine specialization Hadoop/Dryad Parallel Databases ? Azure, AWS WebBlast* RDBMS R PPlacer* AnnoJ BioPython Past Present Soon Other tools
  • What Does Scalable Mean?
    • Operationally :
      • In the past: “Works even if data doesn’t fit in main memory”
      • Now: “Can make use of 1000s of cheap computers”
    • Formally :
      • In the past: polynomial time and space. If you have N data items, you must do no more than N k operations
      • Soon: logarithmic time and linear space. If you have N data items, you must do no more than N log(N) operations
      • Soon, you’ll only get one pass at the data
      • So you better make that one pass count
  • A Goal: Cross-Scale Solutions
    • Gracefully scale up
      • from files to databases to cluster to cloud
      • from MB to GB to TB to PB
    • “ Gracefully” means:
      • logical data independence
      • no expensive ETL migration projects
    • “ Gracefully” means: everyone can use it
      • Hackers / Computational Scientists
      • Lab/Field Scientists
      • The Public
      • K12
      • Legislators
  • optimization, monitoring, scheduling RA-like, with Nest/Flatten Nested Relations Pig typing, provenance, Pegasus-style resource mapping, task parallelism arbitrary boxes-and-arrows * Workflow typing, massive data parallelism, fault tolerance RA + Apply + Partitioning IQueryable, IEnumerable DryadLINQ optimization, physical data independence, indexing, parallelism Select, Project, Join, Aggregate, … Relations SQL / Relational Algebra data parallelism, full control 70+ ops Arrays/ Matrices MPI massive data parallelism, fault tolerance, scheduling Map, Reduce [(key,value)] MapReduce None for free * * GPL Services Operations Data Model
  • MapReduce
    • Many tasks process big data, produce big data
    • Want to use hundreds or thousands of CPUs
      • ... but this needs to be easy
      • Parallel databases exist, but require DBAs and $$$$
      • … and do not easily scale to thousands of computers
    • MapReduce is a lightweight framework, providing:
      • Automatic parallelization and distribution
      • Fault-tolerance
      • I/O scheduling
      • Status and monitoring
  • A complete DryadLINQ program public class LogEntry { public string user, ip; public string page; public LogEntry(string line) { string[] fields = line.Split(' '); this.user = fields[8]; this.ip = fields[9]; this.page = fields[5]; } } public class UserPageCount { public string user, page; public int count; public UserPageCount( string usr, string page, int cnt){ this.user = usr; this.page = page; this.count = cnt; } } PartitionedTable<string> logs = PartitionedTable.Get<string>(@”file:logfile.pt”); var logentries = from line in logs where !line.StartsWith(&quot;#&quot;) select new LogEntry (line); var user = from access in logentries where access.user.EndsWith(@&quot;ulfar&quot;) select access; var accesses = from access in user group access by access.page into pages select new UserPageCount (&quot;ulfar&quot;, pages.Key, pages.Count()); var htmAccesses = from access in accesses where access.page.EndsWith(&quot;.htm&quot;) orderby access.count descending select access; htmAccesses.ToPartitionedTable(@”file: esults.pt”); slide source: Christophe Poulain, MSR
  • Relational Databases Pre-relational DBMS brittleness: if your data changed, your application often broke. Early RDBMS were buggy and slow (and often reviled), but required only 5% of the application code. physical data independence logical data independence files and pointers relations views “ Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.” Key Idea: Programs that manipulate tabular data exhibit an algebraic structure allowing reasoning and manipulation independently of physical data representation
  • Relational Databases
    • Databases are especially, but exclusively, effective at “Needle in Haystack” problems:
      • Extracting small results from big datasets
      • Transparently provide “old style” scalability
      • Your query will always* finish, regardless of dataset size.
      • Indexes are easily built and automatically used when appropriate
    CREATE INDEX seq_idx ON sequence(seq); SELECT seq FROM sequence WHERE seq = ‘GATTACGATATTA’; *almost
  • Key Idea: Data Independence physical data independence logical data independence files and pointers relations views SELECT * FROM my_sequences SELECT seq FROM ncbi_sequences WHERE seq = ‘GATTACGATATTA’; f = fopen( ‘table_file’ ); fseek( 10030440 ); while (True) { fread(&buf, 1, 8192, f); if (buf == GATTACGATATTA) { . . .
  • Key Idea: An Algebra of Tables select project join join Other operators: aggregate, union, difference, cross product
  • Key Idea: Algebraic Optimization
    • N = ((z*2)+((z*3)+0))/1
    • Algebraic Laws:
    • 1 . (+) identity: x+0 = x
    • 2 . (/) identity: x/1 = x
    • 3 . (*) distributes: (n*x+n*y) = n*(x+y)
    • 4 . (*) commutes: x*y = y*x
    • Apply rules 1, 3, 4, 2 :
    • N = (2+3)*z
    • two operations instead of five, no division operator
    Same idea works with the Relational Algebra!
  • Shared Nothing Parallel Databases
    • Teradata
    • Greenplum
    • Netezza
    • Aster Data Systems
    • DataAllegro
    • Vertica
    • MonetDB
    Microsoft Recently commercialized as “Vectorwise”
  • Case Study: Astrophysics Simulation
  • N-body Astrophysics Simulation
    • 15 years in dev
    • 10 9 particles
    • Gravity
    • Months to run
    • 7.5 million
    • CPU hours
    • 500 timesteps
    • Big Bang to now
    Simulations from Tom Quinn’s Lab, work by Sarah Loebman, YongChul Kwon, Bill Howe, Jeff Gardner, Magda Balazinska
  • Q1: Find Hot Gas
    • SELECT id
    • FROM gas
    • WHERE temp > 150000
  • Single Node: Query 1 169 MB 1.4 GB 36 GB
  • Multiple Nodes: Query 1 Database Z
  • Multiple Nodes:Query 2 Database Z
  • Q4: Gas Deletion SELECT gas1.id FROM gas1 FULL OUTER JOIN gas2 ON gas1.id=gas2.id WHERE gas2.id=NULL Particles removed between two timesteps
  • Single Node: Query 4
  • Multiple Nodes: Query 4
  • Ease of Use star43 = FOREACH rawGas43 GENERATE $0 AS pid:long; star60 = FOREACH rawGas60 GENERATE $0 AS pid:long; groupedGas = COGROUP star43 BY pid, star60 BY pid; selectedGas = FOREACH groupedGas GENERATE FLATTEN((IsEmpty(gas43) ? null : gas43)) as s43, FLATTEN((IsEmpty(gas60) ? null : gas60)) as s60; destroyed = FILTER selectedGas BY s60 is null;
  • Visualization and Mashups Dancing with Data
  • Data explosion, again
    • Data growth is outpacing Moore’s Law
      • Why?
      • Cost of acquisition has dropped through the floor
      • Every pairwise comparison of datasets generates a new dataset -- N 2 growth
    • So: Scalable analysis is necessary
    • But: Scalable analysis is hard
  • It’s not just the size….
    • Corollary: # of apps scales as N 2
      • Every pairwise comparison motivates a new application
    • To keep up, we need to
      • entrain new programmers,
      • make existing programmers more productive,
      • or both
  • Satellite Images + Crime Incidence Reports
  • Twitter Feed + Flickr Stream
  • Zooplankton and Temperature <Vis movie>
  • Why Visualization?
    • High bandwidth of the human visual cortex
    • Query-writing presumes a precise goal
    • Try this in SQL: “What does the salt wedge look like?”
  • Data Product Ensembles source: Antonio Baptista, Center for Coastal Margin Observation and Prediction
  • Example: Find matching sequences
    • Given a set of sequences
    • Find all sequences equal to “GATTACGATATTA”
  • Example System: Teradata AMP = unit of parallelism
  • Example System: Teradata SELECT * FROM Orders o, Lines i WHERE o.item = i.item AND o.date = today() join select scan scan date = today() o.item = i.item Order o Item i Find all orders from today, along with the items ordered
  • Example System: Teradata AMP 1 AMP 2 AMP 3 select date=today() select date=today() select date=today() scan Order o scan Order o scan Order o hash h(item) hash h(item) hash h(item) AMP 1 AMP 2 AMP 3
  • Example System: Teradata AMP 1 AMP 2 AMP 3 scan Item i AMP 1 AMP 2 AMP 3 hash h(item) scan Item i hash h(item) scan Item i hash h(item)
  • Example System: Teradata AMP 1 AMP 2 AMP 3 join join join o.item = i.item o.item = i.item o.item = i.item contains all orders and all lines where hash(item) = 1 contains all orders and all lines where hash(item) = 2 contains all orders and all lines where hash(item) = 3
  • MapReduce Programming Model
    • Input & Output: each a set of key/value pairs
    • Programmer specifies two functions:
      • Processes input key/value pair
      • Produces set of intermediate pairs
      • Combines all intermediate values for a particular key
      • Produces a set of merged output values (usually just one)
    map (in_key, in_value) -> list(out_key, intermediate_value) reduce (out_key, list(intermediate_value)) -> list(out_value) Inspired by primitives from functional programming languages such as Lisp, Scheme, and Haskell slide source: Google, Inc.
  • Example: Document Processing
  • Example: Word length histogram How many “big”, “medium”, and “small” words are used?
  • Example: Word length histogram Big = Yellow = 10+ letters Medium = Red = 5..9 letters Small = Blue = 2..4 letters Tiny = Pink = 1 letter
  • Example: Word length histogram Split the document into chunks and process each chunk on a different computer Chunk 1 Chunk 2
  • Example: Word length histogram (yellow, 20) (red, 71) (blue, 93) (pink, 6 ) Map Task 1 (204 words) Map Task 2 (190 words) (key, value) (yellow, 17) (red, 77) (blue, 107) (pink, 3)
  • Example: Word length histogram Reduce tasks (yellow, 17) (yellow, 20) (red, 77) (red, 71) (blue, 93) (blue, 107) (pink, 6) (pink, 3) Map task 1 Map task 2 “ Shuffle step” (yellow, 37) (red, 148) (blue, 200) (pink, 9) (yellow, 17) (red, 77) (blue, 107) (pink, 3) (yellow, 20) (red, 71) (blue, 93) (pink, 6 )
  • New Example: What does this do? map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, 1); reduce(String output_key, Iterator intermediate_values): // output_key: word // output_values: ???? int result = 0; for each v in intermediate_values: result += v; Emit(result); slide source: Google, Inc.
  • Relational Database Management Systems (RDBMS) Before RDBMS: if your data changed, your application broke. Early RDBMS were buggy and slow (and often reviled), but required only 5% of the application code. “ Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.” -- E.F. Codd 1979 Key Ideas: Programs that manipulate tabular data exhibit an algebraic structure allowing reasoning and manipulation independently of physical data representation
  • MapReduce is a Nascent Database Engine
    • Access Methods and Scheduling:
    • Query Language:
    • Query Optimizer:
    Pig Latin Graphics taken from: hadoop.apache.org and research.yahoo.com/node/90
  • MapReduce and Hadoop
    • MR introduced by Google
      • Published paper in OSDI 2004
    • MR: high-level programming model and implementation for large-scale parallel data processing
    • Hadoop
      • Open source MR implementation
      • Yahoo!, Facebook, New York Times
  • A Query Language for MR: Pig Latin
    • High-level, SQL-like dataflow language for MR
    • Goal: Sweet spot between SQL and MR
      • Applies SQL-like, high-level language constructs to accomplish low-level MR programming.
    • operators:
    • LOAD
    • STORE
    • FILTER
    • FOREACH … GENERATE
    • GROUP
    • binary operators:
    • JOIN
    • COGROUP
    • UNION
    • other support:
    • UDFs
    • COUNT
    • SUM
    • AVG
    • MIN/MAX
    Additional operators: http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm
  • New Task: k-mer Similarity
    • Given a set of database sequences and a set of query sequences
    • Return the top N similar pairs, where similarity is defined as the number of common k -mers
  • Pig Latin program D = LOAD ’db_sequences.fasta' USING FASTA() AS (did,dsequence); Q = LOAD ’query_sequences.fasta' USING FASTA() AS (qid,qsequence); Kd = FOREACH D GENERATE did, FLATTEN(kmers(7, dsequence)); Kq = FOREACH Q GENERATE qid, FLATTEN(kmers(7, qsequence)); R = JOIN Kd BY kmer, Kq BY kmer G = GROUP R BY (qid, did); C = FOREACH G GENERATE qid, did, COUNT(kmer) as score T = FILTER C BY score > 4 STORE g INTO seqs.txt';
  • New Task: Alignment
    • RMAP alignment implemented in Hadoop
      • Michael Schatz, CloudBurst: highly sensitive read mapping with MapReduce, Bioinformatics 25(11), April 2009
    • Goal: Align reads to a reference genome
    • Overview:
      • Map: Split reads and reference into k-mers
      • Reduce: for matching k-mers, find end-to-end alignments (seed and extend)
  • MapReduce Overhead
  • Elastic MapReduce
    • Custom Jar
      • Java
    • Streaming
      • Any language that can read/write stdin/stdout
    • Pig
      • Simple data flow language
    • Hive
      • SQL
  • Taxonomy of Parallel Architectures Easiest to program, but $$$$ Scales to 1000s of computers