A New Partnership for Cross-Scale, Cross-Domain eScience

A New Partnership for eScience
Bill Howe, UW
Ed Lazowska, UW
Garth Gibson, CMU
Christos Faloutsos, CMU
Peter Lee, CMU (DARPA)
Chris Mentzel, Moore
QuickTime™ and a
decompressor
are needed to see this picture.

http://escience.washington.edu

3/12/09 Bill Howe, eScience Institute4

The University of Washington
eScience Institute
 Rationale
 The exponential increase in sensors is transitioning all fields of science and
engineering from data-poor to data-rich
 Techniques and technologies include

Sensors and sensor networks, databases, data mining, machine learning,
visualization, cluster/cloud computing
 If these techniques and technologies are not widely available and widely
practiced, UW will cease to be competitive
 Mission
 Help position the University of Washington at the forefront of research both in
modern eScience techniques and technologies, and in the fields that depend
upon them
 Strategy
 Bootstrap a cadre of Research Scientists
 Add faculty in key fields
 Build out a “consultancy” of students and non-research staff
QuickTime™ and a
decompressor

Staff and Funding
 Funding
 $1M/year direct appropriation from WA State Legislature
 $1.5M from Gordon and Betty Moore Foundation (joint with CMU)
 Multiple proposals outstanding
 Staffing
 Dave Beck, Research Scientist: Biosciences and software eng.
 Jeff Gardner, Research Scientist: Astrophysics and HPC
 Bill Howe,Research Scientist: Databases, visualization, DISC
 Ed Lazowska, Director
 Erik Lundberg (50%), Operations Director
 Mette Peters, Health Sciences Liaison
 Chance Reschke, Research Engineer: large scale computing platforms
 …plus a senior faculty search underway
 …plus a “consultancy” of students and professional staff

All science is reducing to a database problem
Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)
New model: “Download the world” (Data acquired en masse, in support of many hypotheses)
 Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)
 Medicine: ubiquitous digital records, MRI, ultrasound
 Oceanography: high-resolution models, cheap sensors, satellites
 Biology: lab automation, high-throughput sequencing
“Increase data collection exponentially with FlowCam!”

The long tail is getting fatter:
notebooks become spreadsheets (MB),
spreadsheets become databases (GB),
databases become clusters (TB)
clusters become clouds (PB)
The Long Tail
datavolume
rank
Researchers with growing data management challenges
but limited resources for cyberinfrastructure
• No dedicated IT staff
• Over-reliance on inadequate but familiar tools
CERN
(~15PB/year)
LSST
(~100PB)
PanSTARRS
(~40PB)
Ocean
Modelers <Spreadsheet
users>
SDSS
(~100TB)
Seis-
mologists
MicrobiologistsCARMEN
(~50TB)
“The future is already here. It’s just not very
evenly distributed.” -- William Gibson

Case Study: Armbrust Lab

Armbrust Lab Tech Roadmap
ClustalW
scalability
cluster/cloud
workstation/server
MAQ
specifictasks
generaltasks
Excel
NCBI
BLAST
Phred/Phrap
CloudBurst
CLC Genomics
Machine
Hadoop/
Dryad
Parallel
Databases
?
Azure, AWS
WebBlast*
RDBMS
R
PPlacer*
AnnoJ
BioPython
Past
Present
Soon
Other tools
specialization

What Does Scalable Mean?
 Operationally:
 In the past: “Works even if data doesn’t fit in main memory”
 Now: “Can make use of 1000s of cheap computers”
 Formally:
 In the past: polynomial time and space. If you have N data
items, you must do no more than Nk
operations
 Soon: logarithmic time and linear space. If you have N data
items, you must do no more than N log(N) operations
 Soon, you’ll only get one pass at the data
 So you better make that one pass count

A Goal: Cross-Scale Solutions
 Gracefully scale up
 from files to databases to cluster to cloud
 from MB to GB to TB to PB
 “Gracefully” means:
 logical data independence
 no expensive ETL migration projects
 “Gracefully” means: everyone can use it
 Hackers / Computational Scientists
 Lab/Field Scientists
 The Public
 K12
 Legislators

Data Model Operations Services
GPL * * None for free
Workflow * arbitrary boxes-
and-arrows
typing, provenance,
Pegasus-style resource
mapping, task
parallelism
SQL /
Relational
Algebra
Relations Select, Project,
Join, Aggregate, …
optimization, physical
data independence,
indexing, parallelism
MapReduce [(key,value)] Map, Reduce massive data
parallelism, fault
tolerance, scheduling
Pig Nested
Relations
RA-like, with
Nest/Flatten
optimization,
monitoring, scheduling
DryadLINQ IQueryable,
IEnumerable
RA + Apply +
Partitioning
typing, massive data
parallelism, fault
tolerance
MPI Arrays/
Matrices
70+ ops data parallelism, full
control

MapReduce
 Many tasks process big data, produce big data
 Want to use hundreds or thousands of CPUs
 ... but this needs to be easy
 Parallel databases exist, but require DBAs and $$$$
 …and do not easily scale to thousands of computers
 MapReduce is a lightweight framework, providing:
 Automatic parallelization and distribution
 Fault-tolerance
 I/O scheduling
 Status and monitoring

public class LogEntry {
public string user, ip;
public string page;
public LogEntry(string line) {
string[] fields = line.Split(' ');
this.user = fields[8];
this.ip = fields[9];
this.page = fields[5];
}
}
public class UserPageCount{
public string user, page;
public int count;
public UserPageCount(
string usr, string page, int cnt){
this.user = usr;
this.page = page;
this.count = cnt;
}
}
PartitionedTable<string> logs =
PartitionedTable.Get<string>(@”file:…logfile.pt”);
var logentries =
from line in logs
where !line.StartsWith("#")
select new LogEntry(line);
var user =
from access in logentries
where access.user.EndsWith(@"ulfar")
select access;
var accesses =
from access in user
group access by access.page into pages
select new UserPageCount("ulfar", pages.Key, pages.Count());
var htmAccesses =
from access in accesses
where access.page.EndsWith(".htm")
orderby access.count descending
select access;
htmAccesses.ToPartitionedTable(@”file:…results.pt”);
slide source: Christophe Poulain, MSR
A complete DryadLINQ program

Relational Databases
Pre-relational DBMS brittleness: if your
data changed, your application often
broke.
Early RDBMS were buggy and slow (and
often reviled), but required only 5% of the
application code.
physical data independence
logical data independence
files and
pointers
relations
view
s
“Activities of users at terminals and
most application programs should
remain unaffected when the internal
representation of data is changed and
even when some aspects of the
external representation are changed.”
Key Idea: Programs that manipulate tabular
data exhibit an algebraic structure allowing
reasoning and manipulation independently
of physical data representation

Relational Databases
 Databases are especially, but exclusively, effective at
“Needle in Haystack” problems:
 Extracting small results from big datasets
 Transparently provide “old style” scalability
 Your query will always* finish, regardless of dataset size.
 Indexes are easily built and automatically used when
appropriateCREATE INDEX seq_idx ON sequence(seq);
SELECT seq
FROM sequence
WHERE seq = ‘GATTACGATATTA’;
*almost

Key Idea: Data Independence
physical data independence
logical data independence
files and
pointers
relations
view
s
SELECT *
FROM my_sequences
SELECT seq
FROM ncbi_sequences
WHERE seq =
‘GATTACGATATTA’;
f = fopen(‘table_file’);
fseek(10030440);
while (True) {
fread(&buf, 1, 8192, f);
if (buf == GATTACGATATTA) {
. . .

Key Idea: An Algebra of Tables
select
project
join join
Other operators: aggregate, union, difference, cross product

Key Idea: Algebraic Optimization
N = ((z*2)+((z*3)+0))/1
Algebraic Laws:
1. (+) identity: x+0 = x
2. (/) identity: x/1 = x
3. (*) distributes: (n*x+n*y) = n*(x+y)
4. (*) commutes: x*y = y*x
Apply rules 1, 3, 4, 2:
N = (2+3)*z
two operations instead of five, no division operator
Same idea works with the Relational Algebra!

Shared Nothing Parallel Databases
 Teradata
 Greenplum
 Netezza
 Aster Data Systems
 DataAllegro
 Vertica
 MonetDB
Microsoft
Recently commercialized as “Vectorwise”

Case Study: Astrophysics Simulation

24
N-body Astrophysics Simulation
• 15 years in dev
• 109
particles
• Gravity
• Months to run
• 7.5 million
CPU hours
• 500 timesteps
• Big Bang to now
Simulations from Tom Quinn’s Lab, work by Sarah Loebman, YongChul
Kwon, Bill Howe, Jeff Gardner, Magda Balazinska

25
Q1: Find Hot Gas
SELECT id
FROM gas
WHERE temp > 150000

26
Single Node: Query 1
169 MB 1.4 GB 36 GB

27
Multiple Nodes: Query 1
Database Z

28
Multiple Nodes:Query 2
Database Z

29
Q4: Gas Deletion
SELECT gas1.id
FROM gas1
FULL OUTER JOIN gas2
ON gas1.id=gas2.id
WHERE gas2.id=NULL
Particles removed
between two timesteps

Ease of Use
star43 = FOREACH rawGas43 GENERATE $0 AS pid:long;
star60 = FOREACH rawGas60 GENERATE $0 AS pid:long;
groupedGas = COGROUP star43 BY pid, star60 BY pid;
selectedGas = FOREACH groupedGas GENERATE
FLATTEN((IsEmpty(gas43) ? null : gas43)) as s43,
FLATTEN((IsEmpty(gas60) ? null : gas60)) as s60;
destroyed = FILTER selectedGas BY s60 is null;

Visualization and Mashups
Dancing with Data

Data explosion, again
 Data growth is outpacing Moore’s Law
Why?
 Cost of acquisition has dropped through the floor
 Every pairwise comparison of datasets
generates a new dataset -- N2
growth
 So: Scalable analysis is necessary
 But: Scalable analysis is hard

It’s not just the size….
 Corollary: # of apps scales as N2
 Every pairwise comparison motivates a new application
 To keep up, we need to
 entrain new programmers,
 make existing programmers more productive,
 or both

Satellite Images + Crime Incidence Reports

Twitter Feed + Flickr Stream

Zooplankton and Temperature
<Vis movie>
QuickTime™ and a
decompressor

Why Visualization?
 High bandwidth of the human visual cortex
 Query-writing presumes a precise goal
 Try this in SQL: “What does the salt wedge look like?”

Data Product Ensembles
source: Antonio Baptista, Center for Coastal Margin Observation and Prediction

Example: Find matching sequences
 Given a set of sequences
 Find all sequences equal to
“GATTACGATATTA”

Example System: Teradata
AMP = unit of parallelism

SELECT *
FROM Orders o, Lines i
WHERE o.item = i.item
AND o.date = today()
join
select
scan scan
date = today()
o.item = i.item
Order oItem i
Find all orders from today, along with the items ordered

AMP 1 AMP 2 AMP 3
select
date=today()
select
date=today()
select
date=today()
scan
Order o
scan
Order o
scan
Order o
hash
h(item)
hash
h(item)
hash
h(item)
AMP 1 AMP 2 AMP 3

AMP 1 AMP 2 AMP 3
scan
Item i
AMP 1 AMP 2 AMP 3
hash
h(item)
scan
Item i
hash
h(item)
scan
Item i
hash
h(item)

AMP 1 AMP 2 AMP 3
join join join
o.item = i.item o.item = i.item o.item = i.item
contains all orders and all lines
where hash(item) = 1

MapReduce Programming Model
 Input & Output: each a set of key/value pairs
 Programmer specifies two functions:
 Processes input key/value pair
 Produces set of intermediate pairs
 Combines all intermediate values for a particular key
 Produces a set of merged output values (usually just one)
map (in_key, in_value) -> list(out_key, intermediate_value)
reduce (out_key, list(intermediate_value)) -> list(out_value)
Inspired by primitives from functional programming
languages such as Lisp, Scheme, and Haskell
slide source: Google, Inc.

Abridged Declaration of Independence
A Declaration By the Representatives of the United States of America, in General Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from that subordination in
which they have hitherto remained, and to assume among powers of the earth the equal and independent station
to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind
requires that they should declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent; that from that equal
creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and
the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just
power from the consent of the governed; that whenever any form of government shall become destructive of
these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's
foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect
their safety and happiness. Prudence indeed will dictate that governments long established should not be
changed for light and transient causes: and accordingly all experience hath shewn that mankind are more
disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are
accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing
invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to
throw off such government and to provide new guards for future security. Such has been the patient sufferings
of the colonies; and such is now the necessity which constrains them to expunge their former systems of
government. the history of his present majesty is a history of unremitting injuries and usurpations, among which
no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object
the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid
world, for the truth of which we pledge a faith yet unsullied by falsehood.
Example: Document Processing

A Declaration By the Representatives of the United States of America, in General Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from that subordination in
which they have hitherto remained, and to assume among powers of the earth the equal and independent station
to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind
requires that they should declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent; that from that equal
creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and
the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just
power from the consent of the governed; that whenever any form of government shall become destructive of
these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's
foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect
their safety and happiness. Prudence indeed will dictate that governments long established should not be
changed for light and transient causes: and accordingly all experience hath shewn that mankind are more
disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are
accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing
invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to
throw off such government and to provide new guards for future security. Such has been the patient sufferings
of the colonies; and such is now the necessity which constrains them to expunge their former systems of
government. the history of his present majesty is a history of unremitting injuries and usurpations, among which
no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object
the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid
world, for the truth of which we pledge a faith yet unsullied by falsehood.
Example: Word length histogram
How many “big”, “medium”, and “small” words are used?

A Declaration By the Representatives of the United States of America, in General
Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from
that subordination in which they have hitherto remained, and to assume among powers of
the earth the equal and independent station to which the laws of nature and of nature's
god entitle them, a decent respect to the opinions of mankind requires that they should
declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent;
that from that equal creation they derive rights inherent and inalienable, among which are
the preservation of life, and liberty, and the pursuit of happiness; that to secure these
ends, governments are instituted among men, deriving their just power from the consent
of the governed; that whenever any form of government shall become destructive of these
ends, it is the right of the people to alter or to abolish it, and to institute new government,
laying it's foundation on such principles and organizing it's power in such form, as to
them shall seem most likely to effect their safety and happiness. Prudence indeed will
dictate that governments long established should not be changed for light and transient
causes: and accordingly all experience hath shewn that mankind are more disposed to
suffer while evils are sufferable, than to right themselves by abolishing the forms to
which they are accustomed. But when a long train of abuses and usurpations, begun at a
distinguished period, and pursuing invariably the same object, evinces a design to reduce
them to arbitrary power, it is their right, it is their duty, to throw off such government and
to provide new guards for future security. Such has been the patient sufferings of the
colonies; and such is now the necessity which constrains them to expunge their former
systems of government. the history of his present majesty is a history of unremitting
injuries and usurpations, among which no one fact stands single or solitary to contradict
the uniform tenor of the rest, all of which have in direct object the establishment of an
absolute tyranny over these states. To prove this, let facts be submitted to a candid world,
for the truth of which we pledge a faith yet unsullied by falsehood.
Big = Yellow = 10+ letters
Medium = Red = 5..9 letters
Small = Blue = 2..4 letters
Tiny = Pink = 1 letter

Congress Assembled.
Split the document into
chunks and process
each chunk on a
different computer
Chunk 1
Chunk 2

(yellow, 20)
(red, 71)
(blue, 93)
(pink, 6 )
Congress Assembled.
Map Task 1
(204 words)
Map Task 2
(190 words)
(key, value)
(yellow, 17)
(red, 77)
(blue, 107)
(pink, 3)

(yellow, 17)
(red, 77)
(blue, 107)
(pink, 3)
(yellow, 20)
(red, 71)
(blue, 93)
(pink, 6 )
Reduce tasks
(yellow, 17)
(yellow, 20)
(red, 77)
(red, 71)
(blue, 93)
(blue, 107)
(pink, 6)
(pink, 3)
Congress Assembled.
Map task 1
Map task 2
“Shuffle step”
(yellow, 37)
(red, 148)
(blue, 200)
(pink, 9)

New Example: What does this do?
map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, 1);
reduce(String output_key, Iterator intermediate_values):
// output_key: word
// output_values: ????
int result = 0;
for each v in intermediate_values:
result += v;
Emit(result);
slide source: Google, Inc.

Before RDBMS: if your data changed, your application broke.
Early RDBMS were buggy and slow (and often reviled), but
required only 5% of the application code.
“Activities of users at terminals and most application programs
should remain unaffected when the internal representation of data
is changed and even when some aspects of the external
representation are changed.” -- E.F. Codd 1979
Key Ideas: Programs that manipulate tabular data exhibit an
algebraic structure allowing reasoning and manipulation
independently of physical data representation
Relational Database Management Systems (RDBMS)

MapReduce is a Nascent Database Engine
Access Methods and
Scheduling:
Query Language:
Query Optimizer:
QuickTime™ and a
TIFF (Uncompressed) decompressor
Pig Latin
Graphics taken from: hadoop.apache.org and research.yahoo.com/node/90

QuickTime™ and a
TIFF (Uncompressed) decompressor
MapReduce and Hadoop
 MR introduced by Google
 Published paper in OSDI 2004
 MR: high-level programming model and
implementation for large-scale parallel data
processing
 Hadoop
 Open source MR implementation
 Yahoo!, Facebook, New York Times

operators:
• LOAD
• STORE
• FILTER
• FOREACH … GENERATE
• GROUP
binary operators:
• JOIN
• COGROUP
• UNION
other support:
• UDFs
• COUNT
• SUM
• AVG
• MIN/MAX
Additional operators:
http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm
A Query Language for MR: Pig Latin
 High-level, SQL-like dataflow language for MR
 Goal: Sweet spot between SQL and MR
 Applies SQL-like, high-level language constructs to
accomplish low-level MR programming.

New Task: k-mer Similarity
 Given a set of database sequences and a
set of query sequences
 Return the top N similar pairs, where
similarity is defined as the number of
common k-mers

Pig Latin program
D = LOAD ’db_sequences.fasta' USING FASTA() AS
(did,dsequence);
Q = LOAD ’query_sequences.fasta' USING FASTA() AS
(qid,qsequence);
Kd = FOREACH D GENERATE did, FLATTEN(kmers(7, dsequence));
Kq = FOREACH Q GENERATE qid, FLATTEN(kmers(7, qsequence));
R = JOIN Kd BY kmer, Kq BY kmer
G = GROUP R BY (qid, did);
C = FOREACH G GENERATE qid, did, COUNT(kmer) as score
T = FILTER C BY score > 4
STORE g INTO seqs.txt';

New Task: Alignment
 RMAP alignment implemented in Hadoop
Michael Schatz, CloudBurst: highly sensitive read mapping with
MapReduce, Bioinformatics 25(11), April 2009
 Goal: Align reads to a reference genome
 Overview:
 Map: Split reads and reference into k-mers
 Reduce: for matching k-mers, find end-to-end
alignments (seed and extend)

MapReduce Overhead
QuickTime™ and a
decompressor

Elastic MapReduce
 Custom Jar
 Java
 Streaming
 Any language that can read/write stdin/stdout
 Pig
 Simple data flow language
 Hive
 SQL

Taxonomy of Parallel Architectures
Easiest to program, but
$$$$
Scales to 1000s of computers

A New Partnership for Cross-Scale, Cross-Domain eScience

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to A New Partnership for Cross-Scale, Cross-Domain eScience

Similar to A New Partnership for Cross-Scale, Cross-Domain eScience (20)

More from University of Washington

More from University of Washington (13)

Recently uploaded

Recently uploaded (20)

A New Partnership for Cross-Scale, Cross-Domain eScience

Editor's Notes