Talk given at Los Alamos National Labs in Fall 2015.
As research becomes more data-intensive and platforms become more heterogeneous, we need to shift focus from performance to productivity.
Defining Constituents, Data Vizzes and Telling a Data Story
The Other HPC: High Productivity Computing
1. The Other HPC: High
Productivity Computing in
Polystore Environments
Bill Howe, Ph.D.
Associate Director, eScience Institute
Senior Data Science Fellow, eScience Institute
Affiliate Associate Professor, Computer Science & Engineering
11/23/2015 Bill Howe, UW 1
3. Processingpower
Time
What is the rate-limiting step in data understanding?
Processing power:
Moore’s Law
Human cognitive capacity
Idea adapted from “Less is More” by Bill Buxton (2001)
Amount of data in
the world
slide src: Cecilia Aragon, UW HCDE
4. How much time do you spend “handling
data” as opposed to “doing science”?
Mode answer: “90%”
11/23/2015 Bill Howe, UW 4
5. “[This was hard] due to the large amount of data (e.g. data indexes for data retrieval,
dissection into data blocks and processing steps, order in which steps are performed
to match memory/time requirements, file formats required by software used).
In addition we actually spend quite some time in iterations fixing problems with
certain features (e.g. capping ENCODE data), testing features and feature products
to include, identifying useful test data sets, adjusting the training data (e.g. 1000G vs
human-derived variants)
So roughly 50% of the project was testing and improving the model, 30% figuring out
how to do things (engineering) and 20% getting files and getting them into the right
format.
I guess in total [I spent] 6 months [on this project].”
At least 3 months on issues of
scale, file handling, and feature
extraction.
Martin Kircher,
Genome SciencesWhy?
3k NSF postdocs in 2010
$50k / postdoc
at least 50% overhead
maybe $75M annually
at NSF alone?
Where does the time go? (2)
6. Productivity
How long I have to wait for results
monthsweeksdayshoursminutessecondsmilliseconds
HPC
Systems
Databases
feasibility
threshold
interactivity
threshold
These two performance
thresholds are really important;
other requirements are
situation-specific
7. 11/23/2015 Bill Howe, UW 7
Table
Graph
Array
Matrix
Key-
Value
Data-
frame
MATLAB
GEMS
GraphX Neo4J
Dato
RDBMS
HIVE
Spark
R
Pandas
Ibis
Accumulo
Spark
SciDB HDF5
Myria
Polystore
Algebra
8. Desiderata for a Polystore Algebra
• Captures user intent
• Affords reasoning and optimization
• Accommodates best-known algorithms
11/23/2015 Bill Howe, UW 8
9.
10. 11/23/2015 Bill Howe, eScience Institute 13
Why do we care? Algebraic Optimization
N = ((z*2)+((z*3)+0))/1
Algebraic Laws:
1. (+) identity: x+0 = x
2. (/) identity: x/1 = x
3. (*) distributes: (n*x+n*y) = n*(x+y)
4. (*) commutes: x*y = y*x
Apply rules 1, 3, 4, 2:
N = (2+3)*z
two operations instead of five, no division operator
Same idea works with the Relational Algebra
12. MyriaX Radish SciDB GEMS
Parallel
Algebra
Polystore
Algebra
Middleware
SciDB
API
MyriaX
API
Radish
API
Graph
API
rewrite
rulesArray
Algebra
MyriaL
Services: visualization, logging, discovery, history, browsing
Orchestration
13. How does this actually work?
(1) Client submits a program
in one of several Big Data
languages….
(2) Program is parsed as
an expression tree….
(or programs directly against the API…)
14. (3) Expression tree is optimized
into a parallel, federated
execution plan involving one
or more Big Data platforms.
(4) Depending on the back end,
parallel plan may be directly
compiled into executable
code
How does this actually work?
15. (5) Orchestrates the parallel,
federated plan execution
across the platforms
Clien
t
MyriaQ Sys1 Sys2
How does this actually work?
16. (6) Exposes query execution
logs and results through
a REST API and a visual
web-based interface
How does this actually work?
17. What can you do with a Polystore Algebra?
1) Facilitate Experiments
– Provide reference implementations
– Apply shared optimizations for apples-to-apples
comparisons
– K-means, Markov chain, Naïve Bayes, TPC-H,
Betweenness Centrality, Sigma-clipping, Linear
Algebra
– LANL using this idea to express algorithms to solve
governing equations for heat transfer models!
11/23/2015 Bill Howe, UW 20
18. What can you do with a Polystore Algebra?
2) Rapidly develop new applications
– Microbial Oceanography
– Neuroanatomy
– Music Analytics
– Video Analytics
– Clinical Analytics
– Astronomical Image de-noising
11/23/2015 Bill Howe, UW 21
25. select a.annotation
, var_samp(d.density) as var
from density d join annotation a
on d.x = a.x
and d.y = a.y
and d.z = a.z
group by a.annotation
order by var desc
limit 10
Sample variance by annotation
across all experiments
26. 11/23/2015 Bill Howe, UW 29
Are two regions connected?
adjacent(r1, r2) :-
annotation(experiment, x1, y1, z1, r1),
annotation(experiment, x2, y2, z2, r2),
x2 = x1+1 or y2 = y1+1 or z2 = z1 + 1
connected(r1, r2) :- adjacent(r1,r2)
connected(r1, r3) :- connected(r1, r2), adjacent(r2, r3)
27. Music Analytics
segments = scan(Jeremy:MSD:SegmentsTable);
songs = scan(Jeremy:MSD:SongsTable);
seg_count = select song_id, count(segment_number) as c from
segments;
density = select songs.song_id,
(seg_count.c / songs.duration) as density
from songs, seg_count
where songs.song_id = seg_count.song_id;
store(density, public:adhoc:song_density);
Computing song density
Million-Song Dataset
http://musicmachinery.com/2011/09/04/how-to-process-a-million-songs-in-20-minutes/
Blog post on how to run it in 20 minutes on Hadoop…
28. 11/23/2015 Bill Howe, UW 31
-- calculate probability of outcomes
Poe = select input_sp.id as inputId,
sum(CondP.lp) as lprob,
CondP.outcome as outcome
from CondP, input_sp
where CondP.index=input_sp.index
and CondP.value=input_sp.value;
-- select the max probability outcome
classes = select inputId,
ArgMax(outcome, lprob)
from Poe;
Naïve Bayes Classification:
Million Song Dataset
Predict song year in a 515,345-song
dataset using eight timbre features,
discretized into intervals of size 10
29. bad data?
lower heart
rate variance
averagerelativeheartrate
variance
time (hours)
averageheartrate
beats/minute
30. MIMIC Information Flow
Client MyriaMiddleware MyriaX SciDB
Waveform
data
Structured
data
headless
Octave + Web
interface
REST
interface,
optimization,
orchestration
serverclient
34. What can you do with a Polystore Algebra?
3) Reason about algorithms
• Apply application-specific optimizations (in addition to
automatic optimizations)
11/23/2015 Bill Howe, UW 37
35. 38
CurGood = SCAN(public:adhoc:sc_points);
DO
mean = [FROM CurGood EMIT val=AVG(v)];
std = [FROM CurGood EMIT val=STDEV(v)];
NewBad = [FROM Good WHERE ABS(Good.v - mean) > 2 * std EMIT *];
CurGood = CurGood - NewBad;
continue = [FROM NewBad EMIT COUNT(NewBad.v) > 0];
WHILE continue;
DUMP(CurGood);
Sigma-clipping, V0
36. 39
CurGood = P
sum = [FROM CurGood EMIT SUM(val)];
sumsq = [FROM CurGood EMIT SUM(val*val)]
cnt = [FROM CurGood EMIT CNT(*)];
NewBad = []
DO
sum = sum – [FROM NewBad EMIT SUM(val)];
sumsq = sum – [FROM NewBad EMIT SUM(val*val)];
cnt = sum - [FROM NewBad EMIT CNT(*)];
mean = sum / cnt
std = sqrt(1/(cnt*(cnt-1)) * (cnt * sumsq - sum*sum))
NewBad = FILTER([ABS(val-mean)>std], CurGood)
CurGood = CurGood - NewBad
WHILE NewBad != {}
Sigma-clipping, V1: Incremental
43. Query compilation for distributed processing
pipeline
as
parallel
code
parallel compiler
machine
code
[Myers ’14]
pipeline
fragment
code
pipeline
fragment
code
sequential
compiler
machine
code
[Crotty ’14, Li ’14, Seo ’14, Murray ‘11]
sequential
compiler
44. 11/23/2015 Bill Howe, UW 47/57
1% selection microbenchmark, 20GB
Avoid long code paths
45. 11/23/2015 Bill Howe, UW 48/57
Q2 SP2Bench, 100M triples, multiple self-joins
Communication optimization
46. Graph Patterns
49
• SP2Bench, 100 million triples
• Queries compiled to a PGAS C++ language layer, then
compiled again by a low-level PGAS compiler
• One of Myria’s supported back ends
• Comparison with Shark/Spark, which itself has been shown to
be 100X faster than Hadoop-based systems
• …plus PageRank, Naïve Bayes, and more
RADISH
ICDE 15
48. select A.i, B.k, sum(A.val*B.val)
from A, B
where A.j = B.j
group by A.i, B.k
Matrix multiply in RA
Matrix multiply
49. sparsity exponent (r s.t. m=nr)
Complexity
exponent
n2.38
mn
m0.7n1.2+n2
slide adapted from ZwickR. Yuster and U. Zwick, Fast Sparse Matrix
n = number of rows
m = number of non-zerosComplexity of matrix
multiply
naïve sparse
algorithm
best known
sparse
algorithm
best known
dense
algorithm
lots of room
here
51. Relative Speedup of SpBLAS vs. HyperDB
- speedup = T_HyperDB / T_SpBLAS
- benchmark datasets with r is 1.2 and
the real data cases (the three largest
datasets: 1.17 < r < 1.20)
- on star (nTh = 12), on dragon (nTh =
60)
- As n increases, the relative speedup
of SpBLAS over HyperDB is
reduced.
- soc-Pokec: the speedup is only
around 5 times.
on star, hyperDB stuck on thrashing
with soc-Pokec data.
52. 11/23/2015 Bill Howe, UW 55
20k X 20k matrix multiply by sparsity
CombBLAS, MyriaX, Radish
53. 11/23/2015 Bill Howe, UW 56
50k X 50k matrix multiply by sparsity
CombBLAS, MyriaX, Radish
Filter to upper left corner of result matrix
54. What can you do with a Polystore Algebra?
5) Provide new services over a Polystore
Ecosystem
11/23/2015 Bill Howe, UW 57
59. Seung-Hee
BaeScalable Graph Clustering
Version 1
Parallelize Best-known
Serial Algorithm
ICDM 2013
Version 2
Free 30% improvement
for any algorithm
TKDD 2014 SC 2015
Version 3
Distributed approx.
algorithm, 1.5B edges
60. Viziometrics: Analysis of Visualization
in the Scientific Literature
Proportion of
non-quantitative
figures in paper
Paper impact, grouped into 5% percentiles
Poshen Lee
And processing power, either as raw processor speed or via novel multi-core and many-core architectures, is also continuing to increase exponentially…
… but human cognitive capacity is remaining constant. How can computing technologies help scientists make sense out of these vast and complex data sets?
We want to give a little background of our project before we launch into it, so we will discuss the problem we are trying to solve.
Essentially, we want to remove the speed-bump of data handling from the scientists.
Express these plans
Optimize these plans
Compile these plans
Execute these plans
So our approach is to model this overlap in capabilities as its own language.
We start
matrices and linear algebra is a terrible programming model, but there’s just so god damn much math that has been developed around them, that it’s here to stay.
the functional programming crowd has been poised to solve all the world’s ills for 60 years, but they tend to have trouble pulling their heads out of their own navels long enough to solve someone’s actual problem in practice
objects and methods are great for building software systems, but get in the way for data analysis
files and scripts aren’t really data analysis – they are low-level operating system concepts
data frames are just relations
key-value pairs -- I’ll talk more about this in a bit
Scale
“While the community was skeptical that this new method could possibly outperform hand-coding, it reduced the number of programming statements necessary to operate a machine by a factor of 20, and quickly gained acceptance. “
“Relational model was buggy and slow, but you only had to write 5% of the code you used to have to write”
We hoist
MyriaL is an imperative language we like; I’ll show you some examples of that.
The whole program is chained together as one big expression, perhaps with loops
The logical plan is translated into a possibly federated, typically parallel back-end specific physical plan.
Optimization rules are applied as appropriate.
We’ve gotten more mileage than we expected out of just a simple rule-based optimizer, for two reasons: we have tried to make it very easy to add new rules on the fly, and we have made some algorithmic developments.
For example, there’s been a lot of recent work on worst-case optimal join algorithms that scale with the size of the output rather than (only) the size of the input.
One of our students has developed a variant of these worst-case optimal, multi-way join algorithm that looks like it could subsume the need for a lot of fretting about join order, skew handling, broadcasting, merge vs. hash, etc.
Single interface to multiple big data systems
* No one size fits all – there WILL be multiple systems and multiple tasks in play in realistic scenarios
* Developer attention span is the bottleneck: Your data scientists can’t/won’t do the plumbing to make these systems talk to each other
* Every system either a) claims to do everything or b) claims nobody else can do “their” thing. We need to stop the madness and do some good science.
We need a middleware
Advantage/ inconvenient sheath fluid alignment particles/laser. Sheath fluid replacement. Loading samples to the instrument.
Advantage/ inconvenient sheathless
And that’s just usinga parallel database. If we instead generate parallel programs and compile them the way the HPC folks do, we can beat up on Spark/Shark basically due to aggregating messages and removing serialization overhead.
NOTES:
Optimizations enable?
with better semantics on a hash table join with UDFs, can do redundant computation elimination, code motion from UDF
Can you just run this in a database and expect good performance. Of course not.
But is this a fundamentally bad idea to run it this way?
Maybe not.
This is the complexity of three matrix multiply algorithms plotted against the sparsi – a naïve sparse