SlideShare a Scribd company logo
1 of 61
The Other HPC: High
Productivity Computing in
Polystore Environments
Bill Howe, Ph.D.
Associate Director, eScience Institute
Senior Data Science Fellow, eScience Institute
Affiliate Associate Professor, Computer Science & Engineering
11/23/2015 Bill Howe, UW 1
Time
Amountofdataintheworld
Time
Processingpower
What is the rate-limiting step in data understanding?
Processing power:
Moore’s Law
Amount of data in
the world
Processingpower
Time
What is the rate-limiting step in data understanding?
Processing power:
Moore’s Law
Human cognitive capacity
Idea adapted from “Less is More” by Bill Buxton (2001)
Amount of data in
the world
slide src: Cecilia Aragon, UW HCDE
How much time do you spend “handling
data” as opposed to “doing science”?
Mode answer: “90%”
11/23/2015 Bill Howe, UW 4
“[This was hard] due to the large amount of data (e.g. data indexes for data retrieval,
dissection into data blocks and processing steps, order in which steps are performed
to match memory/time requirements, file formats required by software used).
In addition we actually spend quite some time in iterations fixing problems with
certain features (e.g. capping ENCODE data), testing features and feature products
to include, identifying useful test data sets, adjusting the training data (e.g. 1000G vs
human-derived variants)
So roughly 50% of the project was testing and improving the model, 30% figuring out
how to do things (engineering) and 20% getting files and getting them into the right
format.
I guess in total [I spent] 6 months [on this project].”
At least 3 months on issues of
scale, file handling, and feature
extraction.
Martin Kircher,
Genome SciencesWhy?
3k NSF postdocs in 2010
$50k / postdoc
at least 50% overhead
maybe $75M annually
at NSF alone?
Where does the time go? (2)
Productivity
How long I have to wait for results
monthsweeksdayshoursminutessecondsmilliseconds
HPC
Systems
Databases
feasibility
threshold
interactivity
threshold
These two performance
thresholds are really important;
other requirements are
situation-specific
11/23/2015 Bill Howe, UW 7
Table
Graph
Array
Matrix
Key-
Value
Data-
frame
MATLAB
GEMS
GraphX Neo4J
Dato
RDBMS
HIVE
Spark
R
Pandas
Ibis
Accumulo
Spark
SciDB HDF5
Myria
Polystore
Algebra
Desiderata for a Polystore Algebra
• Captures user intent
• Affords reasoning and optimization
• Accommodates best-known algorithms
11/23/2015 Bill Howe, UW 8
11/23/2015 Bill Howe, eScience Institute 13
Why do we care? Algebraic Optimization
N = ((z*2)+((z*3)+0))/1
Algebraic Laws:
1. (+) identity: x+0 = x
2. (/) identity: x/1 = x
3. (*) distributes: (n*x+n*y) = n*(x+y)
4. (*) commutes: x*y = y*x
Apply rules 1, 3, 4, 2:
N = (2+3)*z
two operations instead of five, no division operator
Same idea works with the Relational Algebra
The Myria Algebra is…
Relational Algebra
+ While / Sequence
+ Flatmap
+ Window Ops
+ Sample
(+ Dimension Bounds)
https://github.com/uwescience/raco/
MyriaX Radish SciDB GEMS
Parallel
Algebra
Polystore
Algebra
Middleware
SciDB
API
MyriaX
API
Radish
API
Graph
API
rewrite
rulesArray
Algebra
MyriaL
Services: visualization, logging, discovery, history, browsing
Orchestration
How does this actually work?
(1) Client submits a program
in one of several Big Data
languages….
(2) Program is parsed as
an expression tree….
(or programs directly against the API…)
(3) Expression tree is optimized
into a parallel, federated
execution plan involving one
or more Big Data platforms.
(4) Depending on the back end,
parallel plan may be directly
compiled into executable
code
How does this actually work?
(5) Orchestrates the parallel,
federated plan execution
across the platforms
Clien
t
MyriaQ Sys1 Sys2
How does this actually work?
(6) Exposes query execution
logs and results through
a REST API and a visual
web-based interface
How does this actually work?
What can you do with a Polystore Algebra?
1) Facilitate Experiments
– Provide reference implementations
– Apply shared optimizations for apples-to-apples
comparisons
– K-means, Markov chain, Naïve Bayes, TPC-H,
Betweenness Centrality, Sigma-clipping, Linear
Algebra
– LANL using this idea to express algorithms to solve
governing equations for heat transfer models!
11/23/2015 Bill Howe, UW 20
What can you do with a Polystore Algebra?
2) Rapidly develop new applications
– Microbial Oceanography
– Neuroanatomy
– Music Analytics
– Video Analytics
– Clinical Analytics
– Astronomical Image de-noising
11/23/2015 Bill Howe, UW 21
Laser
Microscope Objective
Pine Hole Lens
Nozzle d1
d2
FSC
(Forward scatter)
Orange fluo
Red fluo
EX: SeaFlow
Francois
Ribalet
Jarred
Swalwell
Ginger
Armbrust
Ex: SeaFlow
10
0
10
1
10
2
10
3
10
4
100
101
10
2
10
3
10
4
ps3.fcs…subset
FSC
692-40REDfluorescence FSC
Picoplankton
Nanoplankton
100
101
102
103
104
10
0
10
1
10
2
103
104
P35-surf
FSC Small Stuff
580-30
IS
Ultraplankton
Prochlorococcus
 Continuous observations of various phytoplankton groups from
1-20 mm in size
 Based on RED fluo: Prochlorococcus, Pico-, Ultra- and Nanoplankton
 Based on ORANGE fluo: Synechococcus, Cryptophytes
 Based on FSC: Coccolithophores
Francois
Ribalet
Jarred
Swalwell
Ginger
Armbrust
SeaFlow in Myria
• “That 5-line MyriaL program was 100x faster than my R cluster,
and much simpler”
Dan Halperin Sophie Clayton
11/23/2015 Bill Howe, UW 25
select a.annotation
, var_samp(d.density) as var
from density d join annotation a
on d.x = a.x
and d.y = a.y
and d.z = a.z
group by a.annotation
order by var desc
limit 10
Sample variance by annotation
across all experiments
11/23/2015 Bill Howe, UW 29
Are two regions connected?
adjacent(r1, r2) :-
annotation(experiment, x1, y1, z1, r1),
annotation(experiment, x2, y2, z2, r2),
x2 = x1+1 or y2 = y1+1 or z2 = z1 + 1
connected(r1, r2) :- adjacent(r1,r2)
connected(r1, r3) :- connected(r1, r2), adjacent(r2, r3)
Music Analytics
segments = scan(Jeremy:MSD:SegmentsTable);
songs = scan(Jeremy:MSD:SongsTable);
seg_count = select song_id, count(segment_number) as c from
segments;
density = select songs.song_id,
(seg_count.c / songs.duration) as density
from songs, seg_count
where songs.song_id = seg_count.song_id;
store(density, public:adhoc:song_density);
Computing song density
Million-Song Dataset
http://musicmachinery.com/2011/09/04/how-to-process-a-million-songs-in-20-minutes/
Blog post on how to run it in 20 minutes on Hadoop…
11/23/2015 Bill Howe, UW 31
-- calculate probability of outcomes
Poe = select input_sp.id as inputId,
sum(CondP.lp) as lprob,
CondP.outcome as outcome
from CondP, input_sp
where CondP.index=input_sp.index
and CondP.value=input_sp.value;
-- select the max probability outcome
classes = select inputId,
ArgMax(outcome, lprob)
from Poe;
Naïve Bayes Classification:
Million Song Dataset
Predict song year in a 515,345-song
dataset using eight timbre features,
discretized into intervals of size 10
bad data?
lower heart
rate variance
averagerelativeheartrate
variance
time (hours)
averageheartrate
beats/minute
MIMIC Information Flow
Client MyriaMiddleware MyriaX SciDB
Waveform
data
Structured
data
headless
Octave + Web
interface
REST
interface,
optimization,
orchestration
serverclient
https://metanautix.com/tr/01_big_data_techniques_for_media_graphics.pdf
https://metanautix.com/tr/01_big_data_techniques_for_media_graphics.pdf
11/23/2015 Bill Howe, UW 36
Ollie Lo, Los Alamos National Lab
What can you do with a Polystore Algebra?
3) Reason about algorithms
• Apply application-specific optimizations (in addition to
automatic optimizations)
11/23/2015 Bill Howe, UW 37
38
CurGood = SCAN(public:adhoc:sc_points);
DO
mean = [FROM CurGood EMIT val=AVG(v)];
std = [FROM CurGood EMIT val=STDEV(v)];
NewBad = [FROM Good WHERE ABS(Good.v - mean) > 2 * std EMIT *];
CurGood = CurGood - NewBad;
continue = [FROM NewBad EMIT COUNT(NewBad.v) > 0];
WHILE continue;
DUMP(CurGood);
Sigma-clipping, V0
39
CurGood = P
sum = [FROM CurGood EMIT SUM(val)];
sumsq = [FROM CurGood EMIT SUM(val*val)]
cnt = [FROM CurGood EMIT CNT(*)];
NewBad = []
DO
sum = sum – [FROM NewBad EMIT SUM(val)];
sumsq = sum – [FROM NewBad EMIT SUM(val*val)];
cnt = sum - [FROM NewBad EMIT CNT(*)];
mean = sum / cnt
std = sqrt(1/(cnt*(cnt-1)) * (cnt * sumsq - sum*sum))
NewBad = FILTER([ABS(val-mean)>std], CurGood)
CurGood = CurGood - NewBad
WHILE NewBad != {}
Sigma-clipping, V1: Incremental
40
Points = SCAN(public:adhoc:sc_points);
aggs = [FROM Points EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)];
newBad = []
bounds = [FROM Points EMIT lower=MIN(v), upper=MAX(v)];
DO
new_aggs = [FROM newBad EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)];
aggs = [FROM aggs, new_aggs EMIT _sum=aggs._sum - new_aggs._sum,
sumsq=aggs.sumsq - new_aggs.sumsq, cnt=aggs.cnt - new_aggs.cnt];
stats = [FROM aggs EMIT mean=_sum/cnt,
std=SQRT(1.0/(cnt*(cnt-1)) * (cnt * sumsq - _sum * _sum))];
newBounds = [FROM stats EMIT lower=mean - 2 * std, upper=mean + 2 * std];
tooLow = [FROM Points, bounds, newBounds WHERE newBounds.lower > v
AND v >= bounds.lower EMIT v=Points.v];
tooHigh = [FROM Points, bounds, newBounds WHERE newBounds.upper < v
AND v <= bounds.upper EMIT v=Points.v];
newBad = UNIONALL(tooLow, tooHigh);
bounds = newBounds;
continue = [FROM newBad EMIT COUNT(v) > 0];
WHILE continue;
output = [FROM Points, bounds WHERE Points.v > bounds.lower AND
Points.v < bounds.upper EMIT v=Points.v];
DUMP(output);
Sigma-clipping, V2
What can you do with a Polystore Algebra?
3) Orchestrate Federated Workflows
11/23/2015 Bill Howe, UW 41
Client MyriaX SciDB
More Orchestrating Federated Workflows
Spar
k
HadoopRDBMSMyriaQ
What can you do with a Polystore Algebra?
4) Study the price of abstraction
11/23/2015 Bill Howe, UW 43
Compiling the Myria algebra to bare metal PGAS programs
RADISH
ICDE 15
Brandon
Myers
RADISH
ICDE 15
Brandon
Myers
Query compilation for distributed processing
pipeline
as
parallel
code
parallel compiler
machine
code
[Myers ’14]
pipeline
fragment
code
pipeline
fragment
code
sequential
compiler
machine
code
[Crotty ’14, Li ’14, Seo ’14, Murray ‘11]
sequential
compiler
11/23/2015 Bill Howe, UW 47/57
1% selection microbenchmark, 20GB
Avoid long code paths
11/23/2015 Bill Howe, UW 48/57
Q2 SP2Bench, 100M triples, multiple self-joins
Communication optimization
Graph Patterns
49
• SP2Bench, 100 million triples
• Queries compiled to a PGAS C++ language layer, then
compiled again by a low-level PGAS compiler
• One of Myria’s supported back ends
• Comparison with Shark/Spark, which itself has been shown to
be 100X faster than Hadoop-based systems
• …plus PageRank, Naïve Bayes, and more
RADISH
ICDE 15
11/23/2015 Bill Howe, UW 50
select A.i, B.k, sum(A.val*B.val)
from A, B
where A.j = B.j
group by A.i, B.k
Matrix multiply in RA
Matrix multiply
sparsity exponent (r s.t. m=nr)
Complexity
exponent
n2.38
mn
m0.7n1.2+n2
slide adapted from ZwickR. Yuster and U. Zwick, Fast Sparse Matrix
n = number of rows
m = number of non-zerosComplexity of matrix
multiply
naïve sparse
algorithm
best known
sparse
algorithm
best known
dense
algorithm
lots of room
here
BLAS vs. SpBLAS vs. SQL (10k)
off the shelf
database
15X
Relative Speedup of SpBLAS vs. HyperDB
- speedup = T_HyperDB / T_SpBLAS
- benchmark datasets with r is 1.2 and
the real data cases (the three largest
datasets: 1.17 < r < 1.20)
- on star (nTh = 12), on dragon (nTh =
60)
- As n increases, the relative speedup
of SpBLAS over HyperDB is
reduced.
- soc-Pokec: the speedup is only
around 5 times.
on star, hyperDB stuck on thrashing
with soc-Pokec data.
11/23/2015 Bill Howe, UW 55
20k X 20k matrix multiply by sparsity
CombBLAS, MyriaX, Radish
11/23/2015 Bill Howe, UW 56
50k X 50k matrix multiply by sparsity
CombBLAS, MyriaX, Radish
Filter to upper left corner of result matrix
What can you do with a Polystore Algebra?
5) Provide new services over a Polystore
Ecosystem
11/23/2015 Bill Howe, UW 57
Lowering barrier to entry
Exposing Performance Issues
Dominik Moritz
EuroSys 15
Exposing Performance Issues
Dominik Moritz
EuroSys 15
Sourceworker
Destination worker
Kanit "Ham"
Wongsuphasawat
Voyager: Visualization
Recommendation
InfoVis 15
Seung-Hee
BaeScalable Graph Clustering
Version 1
Parallelize Best-known
Serial Algorithm
ICDM 2013
Version 2
Free 30% improvement
for any algorithm
TKDD 2014 SC 2015
Version 3
Distributed approx.
algorithm, 1.5B edges
Viziometrics: Analysis of Visualization
in the Scientific Literature
Proportion of
non-quantitative
figures in paper
Paper impact, grouped into 5% percentiles
Poshen Lee
http://escience.washington.edu
http://myria.cs.washington.edu
http://uwescience.github.io/sqlshare/

More Related Content

What's hot

Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceUniversity of Washington
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaUniversity of Washington
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain
 
Virtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchVirtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchUniversity of Washington
 
Big Data in NATO and Your Role
Big Data in NATO and Your RoleBig Data in NATO and Your Role
Big Data in NATO and Your RoleJay Gendron
 
Introduction to Big Data and Data Science
Introduction to Big Data and Data ScienceIntroduction to Big Data and Data Science
Introduction to Big Data and Data ScienceFeyzi R. Bagirov
 
New Trends and Directions in Data Science - MIT Information Quality Conferenc...
New Trends and Directions in Data Science - MIT Information Quality Conferenc...New Trends and Directions in Data Science - MIT Information Quality Conferenc...
New Trends and Directions in Data Science - MIT Information Quality Conferenc...Mario Faria
 
DSSG Speaker Series: Paco Nathan
DSSG Speaker Series: Paco NathanDSSG Speaker Series: Paco Nathan
DSSG Speaker Series: Paco NathanPaco Nathan
 
Facilitating Web Science Collaboration through Semantic Markup
Facilitating Web Science Collaboration through Semantic MarkupFacilitating Web Science Collaboration through Semantic Markup
Facilitating Web Science Collaboration through Semantic MarkupJames Hendler
 
Data Big and Broad (Oxford, 2012)
Data Big and Broad (Oxford, 2012)Data Big and Broad (Oxford, 2012)
Data Big and Broad (Oxford, 2012)James Hendler
 
Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...
Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...
Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...Artificial Intelligence Institute at UofSC
 
Hector Guerrero- Road to Business Analytics
Hector Guerrero- Road to Business AnalyticsHector Guerrero- Road to Business Analytics
Hector Guerrero- Road to Business AnalyticsErika Marr
 
Research Metadata Mechanics - Simon Porter
Research Metadata Mechanics - Simon PorterResearch Metadata Mechanics - Simon Porter
Research Metadata Mechanics - Simon PorterCASRAI
 
Data Center Computing for Data Science: an evolution of machines, middleware,...
Data Center Computing for Data Science: an evolution of machines, middleware,...Data Center Computing for Data Science: an evolution of machines, middleware,...
Data Center Computing for Data Science: an evolution of machines, middleware,...Paco Nathan
 
Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Alexandru Iosup
 
A Blind Date With (Big) Data: Student Data in (Higher) Education
A Blind Date With (Big) Data: Student Data in (Higher) EducationA Blind Date With (Big) Data: Student Data in (Higher) Education
A Blind Date With (Big) Data: Student Data in (Higher) EducationUniversity of South Africa (Unisa)
 

What's hot (20)

Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data Science
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and Myria
 
Science Data, Responsibly
Science Data, ResponsiblyScience Data, Responsibly
Science Data, Responsibly
 
End-to-End eScience
End-to-End eScienceEnd-to-End eScience
End-to-End eScience
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
 
Virtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchVirtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible Research
 
Big Data in NATO and Your Role
Big Data in NATO and Your RoleBig Data in NATO and Your Role
Big Data in NATO and Your Role
 
Introduction to Big Data and Data Science
Introduction to Big Data and Data ScienceIntroduction to Big Data and Data Science
Introduction to Big Data and Data Science
 
eResearch New Zealand Keynote
eResearch New Zealand KeynoteeResearch New Zealand Keynote
eResearch New Zealand Keynote
 
Broad Data
Broad DataBroad Data
Broad Data
 
New Trends and Directions in Data Science - MIT Information Quality Conferenc...
New Trends and Directions in Data Science - MIT Information Quality Conferenc...New Trends and Directions in Data Science - MIT Information Quality Conferenc...
New Trends and Directions in Data Science - MIT Information Quality Conferenc...
 
DSSG Speaker Series: Paco Nathan
DSSG Speaker Series: Paco NathanDSSG Speaker Series: Paco Nathan
DSSG Speaker Series: Paco Nathan
 
Facilitating Web Science Collaboration through Semantic Markup
Facilitating Web Science Collaboration through Semantic MarkupFacilitating Web Science Collaboration through Semantic Markup
Facilitating Web Science Collaboration through Semantic Markup
 
Data Big and Broad (Oxford, 2012)
Data Big and Broad (Oxford, 2012)Data Big and Broad (Oxford, 2012)
Data Big and Broad (Oxford, 2012)
 
Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...
Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...
Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...
 
Hector Guerrero- Road to Business Analytics
Hector Guerrero- Road to Business AnalyticsHector Guerrero- Road to Business Analytics
Hector Guerrero- Road to Business Analytics
 
Research Metadata Mechanics - Simon Porter
Research Metadata Mechanics - Simon PorterResearch Metadata Mechanics - Simon Porter
Research Metadata Mechanics - Simon Porter
 
Data Center Computing for Data Science: an evolution of machines, middleware,...
Data Center Computing for Data Science: an evolution of machines, middleware,...Data Center Computing for Data Science: an evolution of machines, middleware,...
Data Center Computing for Data Science: an evolution of machines, middleware,...
 
Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.
 
A Blind Date With (Big) Data: Student Data in (Higher) Education
A Blind Date With (Big) Data: Student Data in (Higher) EducationA Blind Date With (Big) Data: Student Data in (Higher) Education
A Blind Date With (Big) Data: Student Data in (Higher) Education
 

Similar to The Other HPC: High Productivity Computing

Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Zbigniew Jerzak
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
 
2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible research2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible researchYannick Wurm
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poliivascucristian
 
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceSQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceUniversity of Washington
 
Automatic and Interpretable Machine Learning with H2O and LIME
Automatic and Interpretable Machine Learning with H2O and LIMEAutomatic and Interpretable Machine Learning with H2O and LIME
Automatic and Interpretable Machine Learning with H2O and LIMEJo-fai Chow
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
MLconf NYC Shan Shan Huang
MLconf NYC Shan Shan HuangMLconf NYC Shan Shan Huang
MLconf NYC Shan Shan HuangMLconf
 
112 portfpres.pdf
112 portfpres.pdf112 portfpres.pdf
112 portfpres.pdfsash236
 
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverDataWorks Summit
 
Kusto (Azure Data Explorer) Training for R&D - January 2019
Kusto (Azure Data Explorer) Training for R&D - January 2019 Kusto (Azure Data Explorer) Training for R&D - January 2019
Kusto (Azure Data Explorer) Training for R&D - January 2019 Tal Bar-Zvi
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringCS, NcState
 
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software EngineeringTips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineeringjtdudley
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Julian Hyde
 
Products go Green: Worst-Case Energy Consumption in Software Product Lines
Products go Green: Worst-Case Energy Consumption in Software Product LinesProducts go Green: Worst-Case Energy Consumption in Software Product Lines
Products go Green: Worst-Case Energy Consumption in Software Product LinesGreenLabAtDI
 
Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...
Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...
Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...Matthew J Collins
 
And Then There Are Algorithms
And Then There Are AlgorithmsAnd Then There Are Algorithms
And Then There Are AlgorithmsInfluxData
 

Similar to The Other HPC: High Productivity Computing (20)

Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
BDS_QA.pdf
BDS_QA.pdfBDS_QA.pdf
BDS_QA.pdf
 
2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible research2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible research
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
 
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceSQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
 
Automatic and Interpretable Machine Learning with H2O and LIME
Automatic and Interpretable Machine Learning with H2O and LIMEAutomatic and Interpretable Machine Learning with H2O and LIME
Automatic and Interpretable Machine Learning with H2O and LIME
 
Blinkdb
BlinkdbBlinkdb
Blinkdb
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
MLconf NYC Shan Shan Huang
MLconf NYC Shan Shan HuangMLconf NYC Shan Shan Huang
MLconf NYC Shan Shan Huang
 
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
 
112 portfpres.pdf
112 portfpres.pdf112 portfpres.pdf
112 portfpres.pdf
 
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game Forever
 
Kusto (Azure Data Explorer) Training for R&D - January 2019
Kusto (Azure Data Explorer) Training for R&D - January 2019 Kusto (Azure Data Explorer) Training for R&D - January 2019
Kusto (Azure Data Explorer) Training for R&D - January 2019
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software Engineering
 
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software EngineeringTips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineering
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
 
Products go Green: Worst-Case Energy Consumption in Software Product Lines
Products go Green: Worst-Case Energy Consumption in Software Product LinesProducts go Green: Worst-Case Energy Consumption in Software Product Lines
Products go Green: Worst-Case Energy Consumption in Software Product Lines
 
Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...
Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...
Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...
 
And Then There Are Algorithms
And Then There Are AlgorithmsAnd Then There Are Algorithms
And Then There Are Algorithms
 

More from University of Washington

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)University of Washington
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceUniversity of Washington
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureUniversity of Washington
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsUniversity of Washington
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsUniversity of Washington
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe University of Washington
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsUniversity of Washington
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013University of Washington
 
Enabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareEnabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareUniversity of Washington
 
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersHaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersUniversity of Washington
 
Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce University of Washington
 
A New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScienceA New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScienceUniversity of Washington
 
Research Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and AnalysisResearch Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and AnalysisUniversity of Washington
 

More from University of Washington (15)

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data science
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State Legislature
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore Environments
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD Models
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) Scientists
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013
 
Data science curricula at UW
Data science curricula at UWData science curricula at UW
Data science curricula at UW
 
Enabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareEnabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShare
 
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersHaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
 
Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce
 
A New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScienceA New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScience
 
Data-Intensive Scalable Science
Data-Intensive Scalable ScienceData-Intensive Scalable Science
Data-Intensive Scalable Science
 
Research Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and AnalysisResearch Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and Analysis
 

Recently uploaded

RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 

Recently uploaded (20)

RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 

The Other HPC: High Productivity Computing

  • 1. The Other HPC: High Productivity Computing in Polystore Environments Bill Howe, Ph.D. Associate Director, eScience Institute Senior Data Science Fellow, eScience Institute Affiliate Associate Professor, Computer Science & Engineering 11/23/2015 Bill Howe, UW 1
  • 2. Time Amountofdataintheworld Time Processingpower What is the rate-limiting step in data understanding? Processing power: Moore’s Law Amount of data in the world
  • 3. Processingpower Time What is the rate-limiting step in data understanding? Processing power: Moore’s Law Human cognitive capacity Idea adapted from “Less is More” by Bill Buxton (2001) Amount of data in the world slide src: Cecilia Aragon, UW HCDE
  • 4. How much time do you spend “handling data” as opposed to “doing science”? Mode answer: “90%” 11/23/2015 Bill Howe, UW 4
  • 5. “[This was hard] due to the large amount of data (e.g. data indexes for data retrieval, dissection into data blocks and processing steps, order in which steps are performed to match memory/time requirements, file formats required by software used). In addition we actually spend quite some time in iterations fixing problems with certain features (e.g. capping ENCODE data), testing features and feature products to include, identifying useful test data sets, adjusting the training data (e.g. 1000G vs human-derived variants) So roughly 50% of the project was testing and improving the model, 30% figuring out how to do things (engineering) and 20% getting files and getting them into the right format. I guess in total [I spent] 6 months [on this project].” At least 3 months on issues of scale, file handling, and feature extraction. Martin Kircher, Genome SciencesWhy? 3k NSF postdocs in 2010 $50k / postdoc at least 50% overhead maybe $75M annually at NSF alone? Where does the time go? (2)
  • 6. Productivity How long I have to wait for results monthsweeksdayshoursminutessecondsmilliseconds HPC Systems Databases feasibility threshold interactivity threshold These two performance thresholds are really important; other requirements are situation-specific
  • 7. 11/23/2015 Bill Howe, UW 7 Table Graph Array Matrix Key- Value Data- frame MATLAB GEMS GraphX Neo4J Dato RDBMS HIVE Spark R Pandas Ibis Accumulo Spark SciDB HDF5 Myria Polystore Algebra
  • 8. Desiderata for a Polystore Algebra • Captures user intent • Affords reasoning and optimization • Accommodates best-known algorithms 11/23/2015 Bill Howe, UW 8
  • 9.
  • 10. 11/23/2015 Bill Howe, eScience Institute 13 Why do we care? Algebraic Optimization N = ((z*2)+((z*3)+0))/1 Algebraic Laws: 1. (+) identity: x+0 = x 2. (/) identity: x/1 = x 3. (*) distributes: (n*x+n*y) = n*(x+y) 4. (*) commutes: x*y = y*x Apply rules 1, 3, 4, 2: N = (2+3)*z two operations instead of five, no division operator Same idea works with the Relational Algebra
  • 11. The Myria Algebra is… Relational Algebra + While / Sequence + Flatmap + Window Ops + Sample (+ Dimension Bounds) https://github.com/uwescience/raco/
  • 12. MyriaX Radish SciDB GEMS Parallel Algebra Polystore Algebra Middleware SciDB API MyriaX API Radish API Graph API rewrite rulesArray Algebra MyriaL Services: visualization, logging, discovery, history, browsing Orchestration
  • 13. How does this actually work? (1) Client submits a program in one of several Big Data languages…. (2) Program is parsed as an expression tree…. (or programs directly against the API…)
  • 14. (3) Expression tree is optimized into a parallel, federated execution plan involving one or more Big Data platforms. (4) Depending on the back end, parallel plan may be directly compiled into executable code How does this actually work?
  • 15. (5) Orchestrates the parallel, federated plan execution across the platforms Clien t MyriaQ Sys1 Sys2 How does this actually work?
  • 16. (6) Exposes query execution logs and results through a REST API and a visual web-based interface How does this actually work?
  • 17. What can you do with a Polystore Algebra? 1) Facilitate Experiments – Provide reference implementations – Apply shared optimizations for apples-to-apples comparisons – K-means, Markov chain, Naïve Bayes, TPC-H, Betweenness Centrality, Sigma-clipping, Linear Algebra – LANL using this idea to express algorithms to solve governing equations for heat transfer models! 11/23/2015 Bill Howe, UW 20
  • 18. What can you do with a Polystore Algebra? 2) Rapidly develop new applications – Microbial Oceanography – Neuroanatomy – Music Analytics – Video Analytics – Clinical Analytics – Astronomical Image de-noising 11/23/2015 Bill Howe, UW 21
  • 19. Laser Microscope Objective Pine Hole Lens Nozzle d1 d2 FSC (Forward scatter) Orange fluo Red fluo EX: SeaFlow Francois Ribalet Jarred Swalwell Ginger Armbrust
  • 20. Ex: SeaFlow 10 0 10 1 10 2 10 3 10 4 100 101 10 2 10 3 10 4 ps3.fcs…subset FSC 692-40REDfluorescence FSC Picoplankton Nanoplankton 100 101 102 103 104 10 0 10 1 10 2 103 104 P35-surf FSC Small Stuff 580-30 IS Ultraplankton Prochlorococcus  Continuous observations of various phytoplankton groups from 1-20 mm in size  Based on RED fluo: Prochlorococcus, Pico-, Ultra- and Nanoplankton  Based on ORANGE fluo: Synechococcus, Cryptophytes  Based on FSC: Coccolithophores Francois Ribalet Jarred Swalwell Ginger Armbrust
  • 21. SeaFlow in Myria • “That 5-line MyriaL program was 100x faster than my R cluster, and much simpler” Dan Halperin Sophie Clayton
  • 23.
  • 24.
  • 25. select a.annotation , var_samp(d.density) as var from density d join annotation a on d.x = a.x and d.y = a.y and d.z = a.z group by a.annotation order by var desc limit 10 Sample variance by annotation across all experiments
  • 26. 11/23/2015 Bill Howe, UW 29 Are two regions connected? adjacent(r1, r2) :- annotation(experiment, x1, y1, z1, r1), annotation(experiment, x2, y2, z2, r2), x2 = x1+1 or y2 = y1+1 or z2 = z1 + 1 connected(r1, r2) :- adjacent(r1,r2) connected(r1, r3) :- connected(r1, r2), adjacent(r2, r3)
  • 27. Music Analytics segments = scan(Jeremy:MSD:SegmentsTable); songs = scan(Jeremy:MSD:SongsTable); seg_count = select song_id, count(segment_number) as c from segments; density = select songs.song_id, (seg_count.c / songs.duration) as density from songs, seg_count where songs.song_id = seg_count.song_id; store(density, public:adhoc:song_density); Computing song density Million-Song Dataset http://musicmachinery.com/2011/09/04/how-to-process-a-million-songs-in-20-minutes/ Blog post on how to run it in 20 minutes on Hadoop…
  • 28. 11/23/2015 Bill Howe, UW 31 -- calculate probability of outcomes Poe = select input_sp.id as inputId, sum(CondP.lp) as lprob, CondP.outcome as outcome from CondP, input_sp where CondP.index=input_sp.index and CondP.value=input_sp.value; -- select the max probability outcome classes = select inputId, ArgMax(outcome, lprob) from Poe; Naïve Bayes Classification: Million Song Dataset Predict song year in a 515,345-song dataset using eight timbre features, discretized into intervals of size 10
  • 29. bad data? lower heart rate variance averagerelativeheartrate variance time (hours) averageheartrate beats/minute
  • 30. MIMIC Information Flow Client MyriaMiddleware MyriaX SciDB Waveform data Structured data headless Octave + Web interface REST interface, optimization, orchestration serverclient
  • 33. 11/23/2015 Bill Howe, UW 36 Ollie Lo, Los Alamos National Lab
  • 34. What can you do with a Polystore Algebra? 3) Reason about algorithms • Apply application-specific optimizations (in addition to automatic optimizations) 11/23/2015 Bill Howe, UW 37
  • 35. 38 CurGood = SCAN(public:adhoc:sc_points); DO mean = [FROM CurGood EMIT val=AVG(v)]; std = [FROM CurGood EMIT val=STDEV(v)]; NewBad = [FROM Good WHERE ABS(Good.v - mean) > 2 * std EMIT *]; CurGood = CurGood - NewBad; continue = [FROM NewBad EMIT COUNT(NewBad.v) > 0]; WHILE continue; DUMP(CurGood); Sigma-clipping, V0
  • 36. 39 CurGood = P sum = [FROM CurGood EMIT SUM(val)]; sumsq = [FROM CurGood EMIT SUM(val*val)] cnt = [FROM CurGood EMIT CNT(*)]; NewBad = [] DO sum = sum – [FROM NewBad EMIT SUM(val)]; sumsq = sum – [FROM NewBad EMIT SUM(val*val)]; cnt = sum - [FROM NewBad EMIT CNT(*)]; mean = sum / cnt std = sqrt(1/(cnt*(cnt-1)) * (cnt * sumsq - sum*sum)) NewBad = FILTER([ABS(val-mean)>std], CurGood) CurGood = CurGood - NewBad WHILE NewBad != {} Sigma-clipping, V1: Incremental
  • 37. 40 Points = SCAN(public:adhoc:sc_points); aggs = [FROM Points EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)]; newBad = [] bounds = [FROM Points EMIT lower=MIN(v), upper=MAX(v)]; DO new_aggs = [FROM newBad EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)]; aggs = [FROM aggs, new_aggs EMIT _sum=aggs._sum - new_aggs._sum, sumsq=aggs.sumsq - new_aggs.sumsq, cnt=aggs.cnt - new_aggs.cnt]; stats = [FROM aggs EMIT mean=_sum/cnt, std=SQRT(1.0/(cnt*(cnt-1)) * (cnt * sumsq - _sum * _sum))]; newBounds = [FROM stats EMIT lower=mean - 2 * std, upper=mean + 2 * std]; tooLow = [FROM Points, bounds, newBounds WHERE newBounds.lower > v AND v >= bounds.lower EMIT v=Points.v]; tooHigh = [FROM Points, bounds, newBounds WHERE newBounds.upper < v AND v <= bounds.upper EMIT v=Points.v]; newBad = UNIONALL(tooLow, tooHigh); bounds = newBounds; continue = [FROM newBad EMIT COUNT(v) > 0]; WHILE continue; output = [FROM Points, bounds WHERE Points.v > bounds.lower AND Points.v < bounds.upper EMIT v=Points.v]; DUMP(output); Sigma-clipping, V2
  • 38. What can you do with a Polystore Algebra? 3) Orchestrate Federated Workflows 11/23/2015 Bill Howe, UW 41
  • 39. Client MyriaX SciDB More Orchestrating Federated Workflows Spar k HadoopRDBMSMyriaQ
  • 40. What can you do with a Polystore Algebra? 4) Study the price of abstraction 11/23/2015 Bill Howe, UW 43
  • 41. Compiling the Myria algebra to bare metal PGAS programs RADISH ICDE 15 Brandon Myers
  • 43. Query compilation for distributed processing pipeline as parallel code parallel compiler machine code [Myers ’14] pipeline fragment code pipeline fragment code sequential compiler machine code [Crotty ’14, Li ’14, Seo ’14, Murray ‘11] sequential compiler
  • 44. 11/23/2015 Bill Howe, UW 47/57 1% selection microbenchmark, 20GB Avoid long code paths
  • 45. 11/23/2015 Bill Howe, UW 48/57 Q2 SP2Bench, 100M triples, multiple self-joins Communication optimization
  • 46. Graph Patterns 49 • SP2Bench, 100 million triples • Queries compiled to a PGAS C++ language layer, then compiled again by a low-level PGAS compiler • One of Myria’s supported back ends • Comparison with Shark/Spark, which itself has been shown to be 100X faster than Hadoop-based systems • …plus PageRank, Naïve Bayes, and more RADISH ICDE 15
  • 48. select A.i, B.k, sum(A.val*B.val) from A, B where A.j = B.j group by A.i, B.k Matrix multiply in RA Matrix multiply
  • 49. sparsity exponent (r s.t. m=nr) Complexity exponent n2.38 mn m0.7n1.2+n2 slide adapted from ZwickR. Yuster and U. Zwick, Fast Sparse Matrix n = number of rows m = number of non-zerosComplexity of matrix multiply naïve sparse algorithm best known sparse algorithm best known dense algorithm lots of room here
  • 50. BLAS vs. SpBLAS vs. SQL (10k) off the shelf database 15X
  • 51. Relative Speedup of SpBLAS vs. HyperDB - speedup = T_HyperDB / T_SpBLAS - benchmark datasets with r is 1.2 and the real data cases (the three largest datasets: 1.17 < r < 1.20) - on star (nTh = 12), on dragon (nTh = 60) - As n increases, the relative speedup of SpBLAS over HyperDB is reduced. - soc-Pokec: the speedup is only around 5 times. on star, hyperDB stuck on thrashing with soc-Pokec data.
  • 52. 11/23/2015 Bill Howe, UW 55 20k X 20k matrix multiply by sparsity CombBLAS, MyriaX, Radish
  • 53. 11/23/2015 Bill Howe, UW 56 50k X 50k matrix multiply by sparsity CombBLAS, MyriaX, Radish Filter to upper left corner of result matrix
  • 54. What can you do with a Polystore Algebra? 5) Provide new services over a Polystore Ecosystem 11/23/2015 Bill Howe, UW 57
  • 57. Exposing Performance Issues Dominik Moritz EuroSys 15 Sourceworker Destination worker
  • 59. Seung-Hee BaeScalable Graph Clustering Version 1 Parallelize Best-known Serial Algorithm ICDM 2013 Version 2 Free 30% improvement for any algorithm TKDD 2014 SC 2015 Version 3 Distributed approx. algorithm, 1.5B edges
  • 60. Viziometrics: Analysis of Visualization in the Scientific Literature Proportion of non-quantitative figures in paper Paper impact, grouped into 5% percentiles Poshen Lee

Editor's Notes

  1. And processing power, either as raw processor speed or via novel multi-core and many-core architectures, is also continuing to increase exponentially…
  2. … but human cognitive capacity is remaining constant. How can computing technologies help scientists make sense out of these vast and complex data sets?
  3. We want to give a little background of our project before we launch into it, so we will discuss the problem we are trying to solve. Essentially, we want to remove the speed-bump of data handling from the scientists.
  4. Express these plans Optimize these plans Compile these plans Execute these plans
  5. So our approach is to model this overlap in capabilities as its own language. We start
  6. matrices and linear algebra is a terrible programming model, but there’s just so god damn much math that has been developed around them, that it’s here to stay. the functional programming crowd has been poised to solve all the world’s ills for 60 years, but they tend to have trouble pulling their heads out of their own navels long enough to solve someone’s actual problem in practice objects and methods are great for building software systems, but get in the way for data analysis files and scripts aren’t really data analysis – they are low-level operating system concepts data frames are just relations key-value pairs -- I’ll talk more about this in a bit Scale “While the community was skeptical that this new method could possibly outperform hand-coding, it reduced the number of programming statements necessary to operate a machine by a factor of 20, and quickly gained acceptance. “ “Relational model was buggy and slow, but you only had to write 5% of the code you used to have to write”
  7. We hoist
  8. MyriaL is an imperative language we like; I’ll show you some examples of that. The whole program is chained together as one big expression, perhaps with loops
  9. The logical plan is translated into a possibly federated, typically parallel back-end specific physical plan. Optimization rules are applied as appropriate. We’ve gotten more mileage than we expected out of just a simple rule-based optimizer, for two reasons: we have tried to make it very easy to add new rules on the fly, and we have made some algorithmic developments. For example, there’s been a lot of recent work on worst-case optimal join algorithms that scale with the size of the output rather than (only) the size of the input. One of our students has developed a variant of these worst-case optimal, multi-way join algorithm that looks like it could subsume the need for a lot of fretting about join order, skew handling, broadcasting, merge vs. hash, etc.
  10. Single interface to multiple big data systems * No one size fits all – there WILL be multiple systems and multiple tasks in play in realistic scenarios * Developer attention span is the bottleneck: Your data scientists can’t/won’t do the plumbing to make these systems talk to each other * Every system either a) claims to do everything or b) claims nobody else can do “their” thing. We need to stop the madness and do some good science. We need a middleware
  11. Advantage/ inconvenient sheath fluid alignment particles/laser. Sheath fluid replacement. Loading samples to the instrument. Advantage/ inconvenient sheathless
  12. And that’s just usinga parallel database. If we instead generate parallel programs and compile them the way the HPC folks do, we can beat up on Spark/Shark basically due to aggregating messages and removing serialization overhead.
  13. NOTES: Optimizations enable? with better semantics on a hash table join with UDFs, can do redundant computation elimination, code motion from UDF
  14. Can you just run this in a database and expect good performance. Of course not. But is this a fundamentally bad idea to run it this way? Maybe not.
  15. This is the complexity of three matrix multiply algorithms plotted against the sparsi – a naïve sparse