Myria: Analytics-as-a-Service
for (Data) Scientists
Bill Howe
University of Washington

10/13/2013

Bill Howe, UW

1
“It’s a great time to be a data geek.”
-- Roger Barga, Microsoft Research

“The greatest minds of my generation are trying...
How can we deliver 1000 little SDSSs
to anyone who wants one?

10/13/2013

Bill Howe, UW

4
R/V Wecoma, April 2007
Armbrust Lab Retreat, 2009 (Biology, Oceanography)

10/13/2013

Bill Howe, UW

6
Astronomy Visualization
Workshop, 2011

10/13/2013

Bill Howe, UW

7
Big Data in the Long Tail Workshop, 2012 (Social Sciences)

10/13/2013

Bill Howe, UW

8
Maier’s 2nd Maxim

Working with scientists is like
working with 7 year olds:
They think they know everything
and they don’...
My Goal: Expose all the world’s science data
through declarative query interfaces

10/13/2013

Bill Howe, UW

10
Problem
How much time do you spend “handling
data” as opposed to “doing science”?

Mode answer: “90%”

10/13/2013

Bill Ho...
ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome
###query
chr_4[480001-580000].287
chr_4[560001-660000].1
chr_9[4000...
Maslow’s Needs Hierarchy
“As each need is satisfied, the
next higher level in the hierarchy
dominates conscious functionin...
A “Needs Hierarchy” of Science Data Management
“As each need is satisfied, the
next higher level in the hierarchy
dominate...
A “Needs Hierarchy” of Science Data Management
“As each need is satisfied, the
next higher level in the hierarchy
dominate...
Why should you care?

Science == Data Science

10/13/2013

Bill Howe, UW

16
Version 1

QUERY-AS-A-SERVICE
2010 - present

10/13/2013

Bill Howe, UW

17
3) Share the results
Make them public, tag
them, share with specific
colleagues – anyone with
access can query

2) Write S...
Find all TIGRFam ids (proteins) that are missing from at least
one of three samples (relations)
SELECT col0 FROM [refseq_h...
Non-programmers can write very complex queries
(rather than relying on staff programmers)
Example: Computing the overlaps ...
Howe, et al., CISE 2012
Join

Steven
Roberts

Link methylation
with gene description
Excel

Trim

SQL as a lab notebook:
http://bit.ly/16Xj2JP

Co...
Halperin, Howe, et al. SSDBM 2013
Andrew White,
UW Chemistry

“An undergraduate student and I are working with gigabytes of tabular data
derived from analys...
SSDBM 2011

Scientific data management reduces to sharing views
• Integrate data from multiple sources?
– joins and unions...
Two Problems with SQLShare
• No help for really big datasets
• No iteration

10/13/2013

Bill Howe, UW

26
Myria is…
• A compiler framework for multiple
iterative RA-based languages
• A parallel, shared-nothing, iterative
executi...
Myria Team
Dan Suciu
Magda Balazinska
Bill Howe

Dan Halperin (postdoc, technical lead)
Victor Almeida (postdoc)
Andrew Wh...
Myria
Architecture

Web UI
Language Parser

Google
App
Engine

Logical Optimizer for RA+While
Myria Compiler

MyriaL

C Co...
A(y) :- R(‘a’, y)
A(y) :- A(x), R(x,y)

10/13/2013

Bill Howe, UW

30
A = LOAD('points.txt', id:int, x:float, y:float)
E = LIMIT(A, 4);
F = SEQUENCE();
Centroids = [FROM E EMIT (id=F.next, x=E...
Why Iteration Matters
Datalog: reachability in a graph with 1.4B unique edges: almost 200 iterations

32
Why Iteration Matters
Datalog: reachability in a graph with 1.4B unique edges: almost 200 iterations

Vast majority
of rea...
Why Iteration Matters
Datalog: reachability in a graph with 1.4B unique edges: almost 200 iterations

Vast majority
of rea...
Fewer Iterations: Endgame Problem [Afrati 10]
100,000,000

frontier tuples
previously discovered tuples removed

10,000,00...
Reachability from ‘a’ in datalog

Basic Semi-Naïve Evaluation

Join
10/13/2013

Bill Howe, UW

A(y) :- R(‘a’, y)
A(y) :- A...
MAYBE JUST USE HADOOP?

10/13/2013

Bill Howe, UW

37
VLDB 2010, VLDBJ 2011
Bu, Howe, Balazinska, Ernst
VLDB10, VLDBJ12, Datalog12

Difference

Join

ΔAi-1

map

reduce

map

R...
VLDB 2010, VLDBJ 2011

Inter-loop caching
Iteration i = 0: Load a distributed cache
Iteration i > 0:

ΔAi-1

Difference

J...
Difference

Join

Caching Loop-Invariant Data

ΔAi-1

map

reduce

map

R(0)

map

reduce

map

R(1)

map

reduce

1200

m...
Difference

Join

ΔAi-1

MapReduce semantics
require that all keys from
the cache be extracted and
passed to reducers.

re...
Difference

Join

ΔAi-1

reduce

map

R(0)

map

reduce

map

R(1)

Second optimization:
Specialization for Equijoin

map
...
Difference

Join

ΔAi-1

map

map

reduce

map

R(1)

map

reduce

Ai(0)

map

Ai(1)

Equijoin seman cs

map

reduce

MapR...
Difference

Join

ΔAi-1

reduce

map

R(0)

Third Optimization: Extend Cache
to Support Duplicate Elimination

map

map

r...
Effect of Diff Cache
no diff ache
c

with diff ache

loop body (s)
total time for me (s)

100

Failures may be more likely...
Overall
35000

(a) no optimizations

30000
(b) HaLoop

time (s)

25000
20000
15000

(c) all
optimizations

10000

(d) raw ...
Fewer Iteraations: Loop unrolling

Run two joins for every dupe-elim

10/13/2013

Bill Howe, UW

47
half the iterations, but
each is more expensive

change
strategies
10/13/2013

Bill Howe, UW

48
reachable(Y) :- edge(5,Y)
reachable(Y) :- edge(X,Y), reachable(X)

# of Newly Discovered Facts

10000000
1000000
100000
10...
700
Total Time (second)

600
500
400

Greenplum

Low per-iteration cost

300

Myria

200

Greenplum, incremental

100

Gre...
Summary
• Goal: Expose all the world’s science data through
declarative query interfaces!
• Motivated by real science
• Da...
10/13/2013

Bill Howe, UW

52
Datalog Parser
Logical Optimizer

Myria Compiler

C Compiler

Grappa

Google
App
Engine

• Hypothesis: The performance dif...
Path-Counting Queries

Ex: Count the number of unique 2-hops
Assume a collection edges
answers = set()
for all (x, y1) in edges:
for all (y2, z) in edges:
if y1 == y2:
answers.insert(...
Assume a collection edges, but also an index
neighbors: vertex -> [vetex]
answers = set()
for all (x, y) in edges:
for all...
Just drop the edges collection entirely, leaving only the index
neighbors: vertex -> [vetex]
answers = set()
for all x in ...
Just drop the edges collection entirely, leaving only the index
neighbors: vertex -> [vetex]
count = 0
answers = set()
for...
Or if you prefer…assume a collection of vertices, where each
vertex points directly to its neighbors
answers = set()
for a...
Experiments
• Data sets:
Dataset

# Vertices

# Edges

#Distinct 2-hop
Paths

# Triangles

BSN*

685,230

7,600,595

78,35...
Experiments

no dupe
elim

single-threaded

dupe elim
BSN data set

Twitter 4ME data set
Experiments
Experiments
• Parallel system performance
Myria: Analytics-as-a-Service for (Data) Scientists
Upcoming SlideShare
Loading in …5
×

Myria: Analytics-as-a-Service for (Data) Scientists

1,563 views

Published on

Talk delivered at High Performance Transaction Processing 2013

Myria is a new Big Data service being developed at the University of Washington. We feature high level language interfaces, a hybrid graph-relational data model, database-style algebraic optimization, a comprehensive REST API, an iterative programming model suitable for machine learning and graph analytics applications, and a tight connection to new theories of parallel computation.

In this talk, we describe the motivation for another big data platform emphasizing requirements emerging from the physical, life, and social sciences.

Published in: Technology, Education
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,563
On SlideShare
0
From Embeds
0
Number of Embeds
48
Actions
Shares
0
Downloads
41
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • So in part motivated by this. there’s a group of great database researchers who work deeply with scientistsDave Maier, my advisor. Jignesh, who left. Natassa, who left. Yannis, Alex, others.And we recently attracted some new blood to the science data arena.But this community of science databases has something in common with the HPTS communityJim was luminary of HPTS, and no less so a luminary of science databases.The Sloan Digital Sk
  • To understand the problem, it’s useful to consider past successes. The Sloan Digital Sky Survey used a relational database with a carefully engineered schema, and then served the database online using a carefully engineered infrastructure.This approach requires a lot of people, expertise, money, and time – things that small and medium-sized projects don’t typically have.So the question we explore is: How can we support 1000 little "SDSSs” for small- and medium- sized projects?---We started thinking about a new tool. schema designed in part by a turing-award winning computer database expert  We can't afford to build a database + applications from scratch for every project and nobody wants to maintain such a system anyway.  Most importantly, the data comes from all over the place instead of a single source like SDSS --- we can't pretend the data will arrive clean and coherent.
  • …whereI had to disguise myself as an oceanographer in order to do data science work. This me on a research cruise in 2007
  • But since joining the eScience Institute, I’m can mingle freely with the scientists in their natural habitat, and I sometimes get invited to their events
  • In every discipline, you can play where’s Waldo in these group photos and find me.
  • The problem is not only scale, and not even usually scale – it’s what Stratos called DB exploration. Grubbing around in messy data with unknown quality, properties, etc.And, working
  • But the fundamental error made by computer scientists, and it’s probably the fault of the database community, is to assume that strong semantic integration is a prerequisite for query and analytics.It isn’t. It’s the final goal, not some insignificant preamble to analysis.Domain scientists know this – they take a very pragmatic approach. They write code to do data handling, they write code to do analytics, and they do data integration on the fly in a task-specific way.So one of my goals is to convince you of is that you can decouple declarative query from semantic integration, and doing so gives scientists a very powerful tool.
  • But the fundamental error made by computer scientists, and it’s probably the fault of the database community, is to assume that semantic integration is a prerequisite for query and analytics.It isn’t. It’s the final goal, not some insignificant preamble to analysis.Domain scientists know this – they take a very pragmatic approach. They write code to do data handling, they write code to do analytics, and they do data integration on the fly in a task-specific way.So one of my goals is to convince you of is that you can decouple declarative query from semantic integration, and doing so gives scientists a very powerful tool.
  • So we developed SQLShare to support a very simple workflow: you can upload data “as is” from spreadsheets or anything. It’s in the cloud, so no need to install or design a database.You can immediately begin writing queries, right in your browser, and put queries on top of queries on top of queries.Then you can share the results online: Your colleagues can browse the science questions and see the SQL that answers it. ta out.  ----Key ideas to get data in: a) Use the cloud to avoid having to install and run a databaseb) Give up on the schema -- just throw your data in "as is" and do "lazy integration.”c) Use some magic to automate parsing, integration, recommendations, and more.Key ideas to get data out:a) Associate science questions (in English) with each SQL query -- makes them easy to understand and easy to find.b) Saving and reusing queries is a first class requirement.  Given examples, it's easy to modify it into an "adjacent" query.c) Expose the whole system through a REST API to make it easy to bring new client applications online.
  • Multiple input laguages, multiple output languages, all RA basedDatabase on every node for local processingEverything in memory, but can push down into databasePush-based processing with back pressure to keep queues filled (a bit of streaming influence)Column-oriented tuple-batches between workers.Row-oriented on disk, typically, but depends on the databaseSupport
  • Four points to make:0) This is the time for *join only*, not the overall iteration time1) First iteration is slower, as the cache is filled2) each iteration is about 23X faster by joining against cached results.3) The gaps are failures, which are a reality at this scale. Recovery proceeded as usual.HaLoop showed similar, but did not evaluate complete datalog queries.
  • Two points to make:1) 20% speedup on the overall iteration time from this specialization. This optimization violates MapReduce semantics, but is safe given our target lanaguage of datalog. 2) The outliers represent failures, which is a reality of dealing with large-scale data, and is a key reason why HaLoop is popular.
  • The accumulated result is not loop-invariant, but it changes relatively slowly, and is needed on every iteration to check for duplicates. Extend the cache to support append, and we can use it for Dupe-Elim as well.
  • The diff cache worksMaybe ignore the failure comment, but just in case a question arises about why failures appear to be more common without the cache.The answer: we’re not sure, but we know more data is being transferred over the network without the cache.
  • But if we can’t express important analysis tasks, they’ll export their data and use some parallel cloudy R monstrosity
  • Myria: Analytics-as-a-Service for (Data) Scientists

    1. 1. Myria: Analytics-as-a-Service for (Data) Scientists Bill Howe University of Washington 10/13/2013 Bill Howe, UW 1
    2. 2. “It’s a great time to be a data geek.” -- Roger Barga, Microsoft Research “The greatest minds of my generation are trying to figure out how to make people click on ads” -- Jeff Hammerbacher, co-founder, Cloudera 2
    3. 3. How can we deliver 1000 little SDSSs to anyone who wants one? 10/13/2013 Bill Howe, UW 4
    4. 4. R/V Wecoma, April 2007
    5. 5. Armbrust Lab Retreat, 2009 (Biology, Oceanography) 10/13/2013 Bill Howe, UW 6
    6. 6. Astronomy Visualization Workshop, 2011 10/13/2013 Bill Howe, UW 7
    7. 7. Big Data in the Long Tail Workshop, 2012 (Social Sciences) 10/13/2013 Bill Howe, UW 8
    8. 8. Maier’s 2nd Maxim Working with scientists is like working with 7 year olds: They think they know everything and they don’t have any money 10/13/2013 Bill Howe, UW 9
    9. 9. My Goal: Expose all the world’s science data through declarative query interfaces 10/13/2013 Bill Howe, UW 10
    10. 10. Problem How much time do you spend “handling data” as opposed to “doing science”? Mode answer: “90%” 10/13/2013 Bill Howe, UW 11
    11. 11. ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome ###query chr_4[480001-580000].287 chr_4[560001-660000].1 chr_9[400001-500000].503 chr_9[320001-420000].548 chr_27[320001-404298].20 chr_26[320001-420000].378 chr_26[400001-441226].196 chr_24[160001-260000].65 chr_5[720001-820000].339 chr_9[160001-260000].243 chr_12[720001-820000].86 chr_12[800001-900000].109 chr_11[1-100000].70 chr_11[80001-180000].100 length 4500 3556 4211 2833 3991 3963 2949 3542 3141 3002 2895 1463 2886 1523 COG hit #1 e-value #1 identity #1 score #1 COG4547 COG5406 COG4547 COG5099 COG5099 2.00E-04 2.00E-04 5.00E-05 5.00E-05 2.00E-04 19 38 18 17 17 44.6 43.9 46.2 46.2 43.9 620 1001 620 777 777 Cobalamin biosynthesis protein C Nucleosome binding factor SPN, Cobalamin biosynthesis protein C RNA-binding protein of the Puf fa RNA-binding protein of the Puf fa COG5099 COG5077 COG5032 COG5032 4.00E-09 1.00E-25 2.00E-09 1.00E-09 20 26 30 30 59.3 114 60.5 60.1 777 1089 2105 2105 RNA-binding protein of the Puf fa Ubiquitin carboxyl-terminal hydr Phosphatidylinositol kinase and p Phosphatidylinositol kinase and p Simple Example hit length #1 description #1 COGAnnotation_coastal_sample.txt id query 1 FHJ7DRN01A0TND.1 2 FHJ7DRN01A1AD2.2 3 FHJ7DRN01A2HWZ.4 … 2853 FHJ7DRN02HXTBY.5 2854 FHJ7DRN02HZO4J.2 … 3566 FHJ7DRN02FUJW3.1 … hit COG0414 COG0092 COG3889 e_value identity_ score query_start query_end hit_start hit_end hit_length 1.00E-08 28 51 1 74 180 257 285 3.00E-20 47 89.9 6 85 41 120 233 0.0006 26 35.8 9 94 758 845 872 COG5077 COG0444 7.00E-09 2.00E-31 37 67 52.3 127 3 1 77 73 313 135 388 207 1089 316 COG5032 1.00E-09 32 54.7 1 75 1965 2038 2105 SELECT * FROM Phaeo_genome p, coastal_sample c WHERE p.COG_hit = c.hit 10/13/2013 Bill Howe, UW 12
    12. 12. Maslow’s Needs Hierarchy “As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.” -- Maslow 43 10/13/2013 Bill Howe, UW 13
    13. 13. A “Needs Hierarchy” of Science Data Management “As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.” -- Maslow 43 analytics query curation sharing storage 10/13/2013 Bill Howe, UW 14
    14. 14. A “Needs Hierarchy” of Science Data Management “As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.” -- Maslow 43 analytics query semantic integration sharing storage 10/13/2013 Bill Howe, UW 15
    15. 15. Why should you care? Science == Data Science 10/13/2013 Bill Howe, UW 16
    16. 16. Version 1 QUERY-AS-A-SERVICE 2010 - present 10/13/2013 Bill Howe, UW 17
    17. 17. 3) Share the results Make them public, tag them, share with specific colleagues – anyone with access can query 2) Write SQL Right in your browser, writing queries on top of queries on top of queries ... 1) Upload data “as is” Cloud-hosted; no need to install or design a database; no pre-defined schema SELECT hit, COUNT(*) FROM tigrfam_surface GROUP BY hit ORDER BY cnt DESC
    18. 18. Find all TIGRFam ids (proteins) that are missing from at least one of three samples (relations) SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs] UNION SELECT col0 FROM [est_hma_fasta_TGIRfam_refs] UNION SELECT col0 FROM [combo_hma_fasta_TGIRfam_refs] EXCEPT SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs] INTERSECT SELECT col0 FROM [est_hma_fasta_TGIRfam_refs] INTERSECT SELECT col0 FROM [combo_hma_fasta_TGIRfam_refs] 10/13/2013 Bill Howe, UW 19
    19. 19. Non-programmers can write very complex queries (rather than relying on staff programmers) Example: Computing the overlaps of two sets of blast results SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp , x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp , w.category as nc_category , CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) THEN x.end_bp - x.start_bp + 1 WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) We see thousands THEN x.end_bp - w.start_bp + 1 WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) queries written by THEN w.end_bp - x.start_bp + 1 non-programmers END AS len_overlap FROM [koesterj@washington.edu].[hotspots_deserts.tab] x INNER JOIN [koesterj@washington.edu].[table_noncoding_positions.tab] w ON x.chr = w.chr WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) ORDER BY x.strain, x.chr ASC, x.start_bp ASC of
    20. 20. Howe, et al., CISE 2012
    21. 21. Join Steven Roberts Link methylation with gene description Excel Trim SQL as a lab notebook: http://bit.ly/16Xj2JP Compute misstep: join w/ wrong fill Reorder columns Reorder columns Join Join Count Calculate methylation ratio Calculate methylation ratio and link with gene description Count Calculate # methylated CGs Calculate # all CGs Join Join Calculate # methylated CGs Calculate # all CGs Reorder columns GFF of methylated CG locations GFF of all genes GFF of all CG locations Gene descriptions GFF of methylated CG locations Popular service for Bioinformatics Workflows GFF of all genes GFF of all CG locations Gene descriptions
    22. 22. Halperin, Howe, et al. SSDBM 2013
    23. 23. Andrew White, UW Chemistry “An undergraduate student and I are working with gigabytes of tabular data derived from analysis of protein surfaces. Previously, we were using huge directory trees and plain text files. Now we can accomplish a 10 minute 100 line script in 1 line of SQL.” -- Andrew D White Decoding nonspecific interactions from nature. A. White, A. Nowinski, W. Huang, A. Keefe, F. Sun, S. Jiang. (2012) Chemical Science. Accepted 10/13/2013 Bill Howe, UW 24
    24. 24. SSDBM 2011 Scientific data management reduces to sharing views • Integrate data from multiple sources? – joins and unions with views • Standardize on units, apply naming conventions? – rename columns, apply functions with views • Attach metadata? – add new tables with descriptive names, add new columns with views • Data cleaning, quality control? – hide bad values with views • Maintain provenance? – inspect view dependencies • Propagate updates? – view maintenance • Protect sensitive data? – expose subsets with views (assuming views carry permissions) 10/13/2013 Bill Howe, UW 25
    25. 25. Two Problems with SQLShare • No help for really big datasets • No iteration 10/13/2013 Bill Howe, UW 26
    26. 26. Myria is… • A compiler framework for multiple iterative RA-based languages • A parallel, shared-nothing, iterative execution engine • A RESTful Query-as-a-Service platform • prefix meaning “ten thousand” in Greek 10/13/2013 Bill Howe, UW 27
    27. 27. Myria Team Dan Suciu Magda Balazinska Bill Howe Dan Halperin (postdoc, technical lead) Victor Almeida (postdoc) Andrew Whitaker (research scientist) Students Paris Koutris Emad Soroush Jingjing Wang ShengLiang Xu Jennifer Ortiz Jeremy Hyrkas Shumo Chu 28
    28. 28. Myria Architecture Web UI Language Parser Google App Engine Logical Optimizer for RA+While Myria Compiler MyriaL C Compiler Grappa json query plan MyriaDB REST Server Coordinator Catalog netty protocols Worker Catalog Worker Catalog … Worker Catalog jdbc jdbc jdbc RDBMS RDBMS RDBMS HDFS HDFS HDFS
    29. 29. A(y) :- R(‘a’, y) A(y) :- A(x), R(x,y) 10/13/2013 Bill Howe, UW 30
    30. 30. A = LOAD('points.txt', id:int, x:float, y:float) E = LIMIT(A, 4); F = SEQUENCE(); Centroids = [FROM E EMIT (id=F.next, x=E.x, y=E.y)]; Kmeans = [FROM A EMIT (id=id, x=x, y=y, cluster_id=0)] DO I = CROSS(Kmeans, Centroids); J = [FROM I EMIT (Kmeans.id, Kmeans.x, Kmeans.y, Centroids.cluster_id, $distance(Kmeans.x, Kmeans.y, Centroids.x, Centroids.y))]; K = [FROM J EMIT id, distance=$min(distance)]; L = JOIN(J, id, K, id) M = [FROM L WHERE J.distance <= K.distance EMIT (id=J.id, x=J.x, y=J.y, cluster_id=J.cluster_id)]; Kmeans' = [FROM M EMIT (id, x, y, $min(cluster_id))]; Delta = DIFF(Kmeans', Kmeans) Kmeans = Kmeans' Centroids = [FROM Kmeans' EMIT (cluster_id, x=avg(x), y=avg(y))]; WHILE DELTA != {} 10/13/2013 Bill Howe, UW 31
    31. 31. Why Iteration Matters Datalog: reachability in a graph with 1.4B unique edges: almost 200 iterations 32
    32. 32. Why Iteration Matters Datalog: reachability in a graph with 1.4B unique edges: almost 200 iterations Vast majority of reachable tuples discovered by iteration 25 33
    33. 33. Why Iteration Matters Datalog: reachability in a graph with 1.4B unique edges: almost 200 iterations Vast majority of reachable tuples discovered by iteration 25 The datalog program continues for almost 200 iterations, each almost as expensive as the early steps 34
    34. 34. Fewer Iterations: Endgame Problem [Afrati 10] 100,000,000 frontier tuples previously discovered tuples removed 10,000,000 # of tuples discovered 1,000,000 100,000 10,000 1,000 100 10 1 0 10/13/2013 20 40 60 80 100 iteration # Bill Howe, UW 120 140 160 180 35
    35. 35. Reachability from ‘a’ in datalog Basic Semi-Naïve Evaluation Join 10/13/2013 Bill Howe, UW A(y) :- R(‘a’, y) A(y) :- A(x), R(x,y) Dupe-elim 36
    36. 36. MAYBE JUST USE HADOOP? 10/13/2013 Bill Howe, UW 37
    37. 37. VLDB 2010, VLDBJ 2011 Bu, Howe, Balazinska, Ernst VLDB10, VLDBJ12, Datalog12 Difference Join ΔAi-1 map reduce map R(0) map reduce map R(1) map (a) Ai(0) (b) map Ai(1) reduce map reduce (a) R is loop invariant, but gets loaded and shuffled on each iteration (b) Ai grows slowly and monotonically, but is loaded and shuffled on each iteration. HaLoop’s Reducer Input Cache addressed (a), but did not support the append semantics needed for (b). 10/13/2013 Bill Howe, UW 38
    38. 38. VLDB 2010, VLDBJ 2011 Inter-loop caching Iteration i = 0: Load a distributed cache Iteration i > 0: ΔAi-1 Difference Join map R(0) map R(0) R(1) map R(1) reduce reduce map map Ai(0) map A(0) Ai(1) reduce map A(1) reduce Bu, Howe, Balazinska, Ernst VLDB10, VLDBJ12, Datalog12 39
    39. 39. Difference Join Caching Loop-Invariant Data ΔAi-1 map reduce map R(0) map reduce map R(1) map reduce 1200 map Ai(1) no cache Ai(0) map reduce cache failure me (s) 1000 800 600 First iteration is slow, as the invariant graph is shuffled and cached 400 23X 200 0 0 10/13/2013 10 20 itera on # Bill Howe, UW 30 40
    40. 40. Difference Join ΔAi-1 MapReduce semantics require that all keys from the cache be extracted and passed to reducers. reduce map R(0) Specialize Cache for Query Semantics map map reduce map R(1) map reduce Ai(0) map Ai(1) map reduce join keys arriving from mappers Reducer for Join But we only care about keys that join. all tuples from cache 10/13/2013 Bill Howe, UW 41
    41. 41. Difference Join ΔAi-1 reduce map R(0) map reduce map R(1) Second optimization: Specialization for Equijoin map map reduce Ai(0) map Ai(1) map reduce Index the cache, and only extract keys that join Reducer for Join join keys arriving from mappers keys that join indexed cache lookup 10/13/2013 Bill Howe, UW 42
    42. 42. Difference Join ΔAi-1 map map reduce map R(1) map reduce Ai(0) map Ai(1) Equijoin seman cs map reduce MapReduce seman cs 160 me (s) total time for loop body (s) reduce R(0) Effect of equijoin specialization map Failure occurred 120 80 ~20% 40 0 0 10/13/2013 20 40 60 itera on # Bill Howe, UW 80 43
    43. 43. Difference Join ΔAi-1 reduce map R(0) Third Optimization: Extend Cache to Support Duplicate Elimination map map reduce map R(1) map reduce Ai(0) map Ai(1) map reduce The accumulated result is not loop-invariant, but it changes relatively slowly, and is needed on every iteration to check for duplicates. Extend the cache to support append, and we can use it for Dupe-Elim as well. Reducer for Dupe-elim tuples arriving from mappers unique keys indexed cache lookup, with new tuples inserted 10/13/2013 Bill Howe, UW 44
    44. 44. Effect of Diff Cache no diff ache c with diff ache loop body (s) total time for me (s) 100 Failures may be more likely due to extra network traffic 80 60 ~20% overall improvement 40 20 0 0 10/13/2013 10 20 30 itera on # Bill Howe, UW 40 50 45
    45. 45. Overall 35000 (a) no optimizations 30000 (b) HaLoop time (s) 25000 20000 15000 (c) all optimizations 10000 (d) raw Hadoop overhead 5000 0 0 50 100 iteration # 150 200 250
    46. 46. Fewer Iteraations: Loop unrolling Run two joins for every dupe-elim 10/13/2013 Bill Howe, UW 47
    47. 47. half the iterations, but each is more expensive change strategies 10/13/2013 Bill Howe, UW 48
    48. 48. reachable(Y) :- edge(5,Y) reachable(Y) :- edge(X,Y), reachable(X) # of Newly Discovered Facts 10000000 1000000 100000 10000 1000 Greenplum Myria 100 10 not much useful work 1 1 3 5 7 9 11 13 15 17 19 21 23 Iteration
    49. 49. 700 Total Time (second) 600 500 400 Greenplum Low per-iteration cost 300 Myria 200 Greenplum, incremental 100 Greenplum, incremental+index 0 1 3 5 7 9 11 13 15 17 19 21 23 Iteration 10/13/2013 Bill Howe, UW 50
    50. 50. Summary • Goal: Expose all the world’s science data through declarative query interfaces! • Motivated by real science • Data and query model is iterative relational algebra • Industrial-strength Query-as-a-Service http://db.cs.washington.edu/myria/ http://myria-web.appspot.com/ 10/13/2013 Bill Howe, UW 51
    51. 51. 10/13/2013 Bill Howe, UW 52
    52. 52. Datalog Parser Logical Optimizer Myria Compiler C Compiler Grappa Google App Engine • Hypothesis: The performance difference between hand-coded graph algorithms and relational query plans amounts to implementation details • Can we generate “hand-coded” plans? 10/13/2013 Bill Howe, UW 53
    53. 53. Path-Counting Queries Ex: Count the number of unique 2-hops
    54. 54. Assume a collection edges answers = set() for all (x, y1) in edges: for all (y2, z) in edges: if y1 == y2: answers.insert((x,z)) count = answers.size() In an RDBMS: “Nested Loops Join” 10/13/2013 Bill Howe, UW 55
    55. 55. Assume a collection edges, but also an index neighbors: vertex -> [vetex] answers = set() for all (x, y) in edges: for all z in neighbors[y]: answers.insert((x,z)) count = answers.size() In an RDBMS: “Hash Join” 10/13/2013 Bill Howe, UW 56
    56. 56. Just drop the edges collection entirely, leaving only the index neighbors: vertex -> [vetex] answers = set() for all x in neighbors: for all y in neighbors[x]: for all z in neighbors[y]: answers.insert((x,z)) count = answers.size() In an RDBMS: Still a Hash Join 10/13/2013 Bill Howe, UW 57
    57. 57. Just drop the edges collection entirely, leaving only the index neighbors: vertex -> [vetex] count = 0 answers = set() for all x in neighbors: for all y in neighbors[x]: for all z in neighbors[y]: answers.insert(z) count += answers.size() answers.clear() only one value stays small RDBMS don’t express this, but there’s no reason they couldn’t 10/13/2013 Bill Howe, UW 58
    58. 58. Or if you prefer…assume a collection of vertices, where each vertex points directly to its neighbors answers = set() for all x in neighbors: for all y in x.neighbors(): for all z in y.neighbors(): answers.insert(z) count += answers.size() answers.clear() only one value, so stays small Boils down to dereferencing a pointer vs. probing a hash table 10/13/2013 Bill Howe, UW 59
    59. 59. Experiments • Data sets: Dataset # Vertices # Edges #Distinct 2-hop Paths # Triangles BSN* 685,230 7,600,595 78,350,597 6,935,709 Twitter 4MEⱡ 166,317 4,532,185 1,056,317,985 14,912,950 comlivejournal* 3,997,962 34,681,189 735,398,579 soclivejournal* 4,874,571 68,993,773 ⱡ Kwak et al H. 2010. 112,319,229 *http://snap.stanford.edu/
    60. 60. Experiments no dupe elim single-threaded dupe elim BSN data set Twitter 4ME data set
    61. 61. Experiments
    62. 62. Experiments • Parallel system performance

    ×