Democratizing Data Science in the Cloud

Democratizing Data Science
in the Cloud
Bill Howe, Ph.D.
Associate Director and Senior Data Science Fellow, eScience Institute
Affiliate Associate Professor, Computer Science & Engineering
11/1/2016 Bill Howe, UW 1

Cloud Data Management is about
sharing resources between tenants
We’re interested in new services powered by sharing
more than infrastructure – schema, data, queries

Why?
Example: JBOT* Open Data systems
Google
Fusion
Tables
3
Entrepreneurship
1) “Data once guarded for assumed but untested
reasons is now open, and we're seeing benefits.”
-- Nigel Shadbolt, Open Data Institute
2) Need to help “non-specialists within an
organization use data that had been the
realm of programmers and DB admins”
-- Benjamin Romano, Xconomy
“Businesses are now using data the way
scientists always have”
-- Jeff Hammerbacher
Mt. Sinai, formerly Cloudera
*Just a Bunch of Tables

Data, data, data
4
Kevin Merrit
CEO
Socrata
Deep Dhillon
CTO
Socrata

Q Q Q
….
Control Plane /
Infrastructure
Data Plane /
Database sys.
Application /
schema, data,
query logs

Q Q Q
….
Benefits: Significantly reduced management overhead
Challenges: security, scheduling, SLAs, isolation
Virtualization
Control Plane /
Infrastructure
Data Plane /
Database sys.
Application /
schema, data,
query logs

Q Q Q
….
DB-as-a-Service
Benefits: Significantly reduced management overhead
Challenges: security, scheduling, SLAs, isolation
Control Plane /
Infrastructure
Data Plane /
Database sys.
Application /
schema, data,
query logs

Q Q Q
….
JBOT* Query-as-a-Service Systems
Goal:
smart cross-tenant services,
trained on everyone’s data
• Metadata inference and data curation
• Query recommendation via common idioms
• Data discovery – e.g., “find me things to join with”
• Visualization recommendation
• Semi-automatic integration services
Control Plane /
Infrastructure
Data Plane /
Database sys.
Application /
schema, data,
query logs
*Just a Bunch of Tables

Example Service: Automated Data Curation
Microarray samples submitted to the Gene Expression Omnibus
Curation is fast becoming the
bottleneck to data sharing
Maxim
Gretchkin
Hoifung
Poon

Maxim
Gretchkin
Hoifung
Poon
Goal: Repair metadata for genetic
datasets using the content of the data, the
structure of an associated ontology, the
abstract of the paper, and everything else.
Deep Neural Network
Tissue Type Labels
Innovations in transfer learning,
poor training data, etc.
Paper
Abstract

Maxim
Gretchkin
Hoifung
Poon
Iterative co-learning between text-based classified and
expression-based classifier: Both models improve by
training on each others’ results

• SQLShare: Query-as-a-Service
• VizDeck: Visualization recommendation
• Myria: Big Data Ecosystems
VizDeck
Some Cloud Data Systems

1) Upload data “as is”
Cloud-hosted, secure; no
need to install or design a
database; no pre-defined
schema; schema inference;
some itegration
2) Write Queries
Right in your browser,
writing views on top of
views on top of views ...
SELECT hit, COUNT(*)
FROM tigrfam_surface
GROUP BY hit
ORDER BY cnt DESC
3) Share the results
Make them public, tag them,
share with specific colleagues –
anyone with access can query
http://sqlshare.escience.washington.edu

11/1/2016 Bill Howe, UW 15
http://sqlshare.escience.washington.edu

SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp
, x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp
, w.category as nc_category
, CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)
THEN x.end_bp - x.start_bp + 1
WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)
THEN x.end_bp - w.start_bp + 1
WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)
THEN w.end_bp - x.start_bp + 1
END AS len_overlap
FROM [koesterj@washington.edu].[hotspots_deserts.tab] x
INNER JOIN [koesterj@washington.edu].[table_noncoding_positions.tab] w
ON x.chr = w.chr
WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)
OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)
OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)
ORDER BY x.strain, x.chr ASC, x.start_bp ASC
Non-programmers can write very complex queries
(rather than relying on staff programmers)
Example: Computing the overlaps of two sets of blast results
We see thousands of
queries written by
non-programmers

The SQLShare Corpus:
A multi-year log of hand-written analytics queries
Queries 24275
Views 4535
Tables 3891
Users 591
SIGMOD 2016
Shrainik Jain
https://uwescience.github.io/sqlshare

19/57
A SQL “learner”
http://uwescience.github.io/sqlshare/

Latent Idioms for Schema-Independent Query Recommendation
Background on
Word2Vec, GloVE:
Map each term in a
corpus to a vector in
a high-dimensional
space based on its
co-occurrences.
Linear relationships
between these
vectors appear to
capture remarkable
semantic properties

:
SELECT COUNT(*) FROM [candrzejowiec@yahoo.com].[table_Firearms.txt]
SELECT COUNT (HiLo) FROM [roula.cardaras@gmail.com].[table_MUK.csv]
SELECT count(*) FROM [leslie@westerncatholic.org].[Depth_combined]
select count(Wave_Height) from [christa.kohnert@gmail.com].[Join]
SELECT count(*) FROM [wenjunh@washington.edu].[ecoli_nogaps_1.csv]
SELECT Count(*) FROM [latcron@gmail.com].[TargetTrackFeatures.csv]
SELECT count(*) FROM [billhowe].[sunrise sunset times 2009 - 2011]
SELECT Count(*) FROM [bifxcore@gmail.com].[table_ec_pdb_genus.csv]
SELECT count(*) FROM [whitead@washington.edu].[ecoli_nogaps_1.csv]
SELECT COUNT(*) FROM [ribalet@washington.edu].[Tokyo_0_merged.csv]
SELECT COUNT(*) FROM [dhalperi@washington.edu].[SPID_GOnumber.txt]
SELECT COUNT (species) FROM [bigbananatopdog@gmail.com].[Orthosia]
SELECT COUNT (species) FROM [bigbananatopdog@gmail.com].[Leucania]
:
Apply the same trick to the SQLShare corpus, cluster the results
A not-very-interesting cluster:
Latent SQL Idioms

:
SELECT COUNT(*) FROM [ajw123@washington.edu].[table_proteins.csv] WHERE species LIKE 'Homo sapiens
SELECT count (*) FROM [1029880@apps.nsd.org].[Task 5] where (Hashtags_In_Text) Like '%Phailin%'
SELECT count (*) FROM [1029880@apps.nsd.org].[Task 5] where (Hashtags_In_Text) Like '%Phailin%'
SELECT Count (*) FROM [kzoehayes@gmail.com].[Dated_Join] WHERE Category = 'Warm'
SELECT COUNT (*) FROM [ethanknight08@gmail.com].[table_PopulationV2.txt] WHERE Column1='Country'
SELECT COUNT(*) FROM [missmelupton@gmail.com].[table_pHWaterTemp] WHERE TempCategory='normal'
SELECT COUNT(*) FROM [1004387@apps.nsd.org].[no retweete] WHERE hashtags_in_text LIKE '%#odisha
:
Another not-very-interesting cluster:
We see other clusters that seem to capture more basics: “union,”
“group by with one grouping column,” “left outer join,” “string
manipulation,” etc.
Latent SQL Idioms

Latent SQL Idioms
More interesting examples:
select floor(latitude/0.7)*0.7 as latbin
, floor(longitude/0.7)*0.7 as lonbin
, species
FROM [koenigk92@gmail.com].[All3col]
select distinct case when patindex('%[0-9]%', [protein]) = 1 -- first char is number
and charindex(',', [protein]) = 0 -- and no comma present
then [protein]
else substring([protein], patindex('%[0-9]%', [protein]),
charindex(',', [protein])-patindex('%[0-9]%', [protein]))
end as [protein d1124],
[tot indep spectra] as [tot spectra d1124]
from [emmats@washington.edu].[d1_file124.txt]
Parsing a common
bioinformatics file format
Expressions for binning
space and time columns

MYRIA: BIG DATA POLYSTORES
11/1/2016 Bill Howe, UW 24

Q Q Q
….
Polystore Ecosystems: “Software Defined Databases”
Data Plane /
Database sys.
Application /
schema, data,
query logs
RDBMS HPC / Linear Algebra Graphs

Polystore
Execution
Plan
move
data
execute
query

Polystore
Execution
Plan
Tables KeyVal Arrays Graphs

Myria Algebra
Tables KeyVal Arrays Graphs

Spark Accumulo CombBLAS GraphX
Parallel
Algebra
Logical
Algebra
RACO
Relational Algebra COmpiler
CombBLAS
API
Spark
API
Accumulo Graph
API
rewrite
rules
Array
Algebra
MyriaL
Services: visualization, logging, discovery, history, browsing
Orchestration
https://github.com/uwescience/raco

https://metanautix.com/tr/01_big_data_techniques_for_media_graphics.pdf

11/1/2016 Bill Howe, UW 33
Ollie Lo, Los Alamos National Lab

34
CurGood = SCAN(public:adhoc:sc_points);
DO
mean = [FROM CurGood EMIT val=AVG(v)];
std = [FROM CurGood EMIT val=STDEV(v)];
NewBad = [FROM Good WHERE ABS(Good.v - mean) > 2 * std EMIT *];
CurGood = CurGood - NewBad;
continue = [FROM NewBad EMIT COUNT(NewBad.v) > 0];
WHILE continue;
DUMP(CurGood);
Sigma-clipping, V0

35
CurGood = P
sum = [FROM CurGood EMIT SUM(val)];
sumsq = [FROM CurGood EMIT SUM(val*val)]
cnt = [FROM CurGood EMIT CNT(*)];
NewBad = []
DO
sum = sum – [FROM NewBad EMIT SUM(val)];
sumsq = sum – [FROM NewBad EMIT SUM(val*val)];
cnt = sum - [FROM NewBad EMIT CNT(*)];
mean = sum / cnt
std = sqrt(1/(cnt*(cnt-1)) * (cnt * sumsq - sum*sum))
NewBad = FILTER([ABS(val-mean)>std], CurGood)
CurGood = CurGood - NewBad
WHILE NewBad != {}
Sigma-clipping, V1: Incremental

36
Points = SCAN(public:adhoc:sc_points);
aggs = [FROM Points EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)];
newBad = []
bounds = [FROM Points EMIT lower=MIN(v), upper=MAX(v)];
DO
new_aggs = [FROM newBad EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)];
aggs = [FROM aggs, new_aggs EMIT _sum=aggs._sum - new_aggs._sum,
sumsq=aggs.sumsq - new_aggs.sumsq, cnt=aggs.cnt - new_aggs.cnt];
stats = [FROM aggs EMIT mean=_sum/cnt,
std=SQRT(1.0/(cnt*(cnt-1)) * (cnt * sumsq - _sum * _sum))];
newBounds = [FROM stats EMIT lower=mean - 2 * std, upper=mean + 2 * std];
tooLow = [FROM Points, bounds, newBounds WHERE newBounds.lower > v
AND v >= bounds.lower EMIT v=Points.v];
tooHigh = [FROM Points, bounds, newBounds WHERE newBounds.upper < v
AND v <= bounds.upper EMIT v=Points.v];
newBad = UNIONALL(tooLow, tooHigh);
bounds = newBounds;
continue = [FROM newBad EMIT COUNT(v) > 0];
WHILE continue;
output = [FROM Points, bounds WHERE Points.v > bounds.lower AND
Points.v < bounds.upper EMIT v=Points.v];
DUMP(output);
Sigma-clipping, V2

Dominik Moritz
EuroVis 15
Empower the end user to do
performance profiling, debugging, etc.

Diagnosing problems
Sourcenode
Destination node
Dominik Moritz
EuroVis 15

Some ongoing work
• “from scratch” polystore optimizer
– Columbia-style, with some ideas from PL community
• Anecdotal Optimization
– Infer optimization decisions based on coarse-grained experimental
results from unreliable sources (blogs, literature)
– “System X is 2X faster than System Y on PageRank”
• Benchmarking Linear Algebra Systems vs. Databases
– HPC community thinks they are 1000X faster; they aren’t
– DB community thinks they are competitive; they aren’t
• Query compilation
– Bridge the gap between MPI and DB
• New query language Kamooks blending arrays and relations
11/1/2016 Bill Howe, UW 39

Some ongoing work
11/1/2016 Bill Howe, UW 40

Some ongoing work
11/1/2016 Bill Howe, UW 42

Query compilation for distributed processing
pipeline
as
parallel
code
parallel compiler
machine
code
[Myers ’14]
pipeline
fragment
code
pipeline
fragment
code
sequential
compiler
machine
code
[Crotty ’14, Li ’14, Seo ’14, Murray ‘11]
sequential
compiler

11/1/2016 Bill Howe, UW 45/57
1% selection microbenchmark, 20GB
Avoid long code paths
ICS 16
Brandon
Myers

11/1/2016 Bill Howe, UW 46/57
Q2 SP2Bench, 100M triples, multiple self-joins
Communication optimization
ICS 16
Brandon
Myers

Graph Patterns
47
• SP2Bench, 100 million triples
• Queries compiled to a PGAS C++ language layer, then
compiled again by a low-level PGAS compiler
• One of Myria’s supported back ends
• Comparison with Shark/Spark, which itself has been shown to
be 100X faster than Hadoop-based systems
• …plus PageRank, Naïve Bayes, and more
RADISH
ICS 16
Brandon
Myers

11/1/2016 Bill Howe, UW 48
ICS 15
RADISH
ICS 16
Brandon
Myers
TPC-H

Some ongoing work
– “Software-defined Databases”
11/1/2016 Bill Howe, UW 49

select A.i, B.k, sum(A.val*B.val)
from A, B
where A.j = B.j
group by A.i, B.k
Matrix multiply in RA
Matrix multiply

sparsity exponent (r s.t. m=nr)
Complexity
exponent
n2.38
mn
m0.7n1.2+n2
slide adapted from ZwickR. Yuster and U. Zwick, Fast Sparse Matrix Multiplication
n = number of rows
m = number of non-zeros
Complexity of matrix multiply
naïve sparse
algorithm
best known
sparse
algorithm
best known
dense
algorithm
lots of room
here

BLAS vs. SpBLAS vs. SQL (10k)
off the shelf
database
15X

11/1/2016 Bill Howe, UW 54
20k X 20k matrix multiply by sparsity
CombBLAS, MyriaX, Radish

11/1/2016 Bill Howe, UW 55
50k X 50k matrix multiply by sparsity
CombBLAS, MyriaX, Radish
Filter to upper left corner of result matrix

select AB.i, C.m, sum(AB.val*C.val)
from
(select A.i, B.k, sum(A.val*B.val)
from A, B
where A.j = B.j
group by A.i, B.k
) AB,
C
where AB.k = C.k
group by AB.i, C.m
A x B x C
select A.i, C.m, sum(A.val*B.val*C.val)
from A, B, C
where A.j = B.j
and B.k = C.k
group by A.i, C.m
A(i, j, val)
B(j, k, val)
C(k, m, val)
take three sparse
matrices
Now compute
multiway hypercube join:
O (|A|/p + |B|/p^2 + |C|/p)
Group by:
~O (N)
But wait, there’s more…..

2 seconds,
balanced
Hypercube
shuffle
Partitioned
hash join
43 seconds,
tons of skew
Task: self-multiply with 1M non-zeros

Seung-Hee
BaeScalable Graph Clustering
Version 1
Parallelize Best-known
Serial Algorithm
ICDM 2013
Version 2
Free 30% improvement
for any algorithm
TKDD 2014 SC 2015
Version 3
Distributed approx.
algorithm, 1.5B edges

http://escience.washington.edu
http://myria.cs.washington.edu

VIZDECK: VISUALIZATION
RECOMMENDATION
11/1/2016 Bill Howe, UW 60

“Data Triage” Pipeline
61
SAS
Excel
XML
CSV
SQL Azure
Files Tables Views
parse /
extract
“relational
analysis”
visual
analysis
Visualizations
SIGMOD 11
SSDBM 13
SIGMOD 16
sqlshare.escience.washington.edu
CHI 12
SIGMOD 12
iConference 13
SSDBM 11
CiSE 13
SSDBM 15

video
11/1/2016 Bill Howe, UW 65
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Fusion VizDeck ManyEyes Tableau
Task Completion Rate / Time - All Questions
CHI 13

Visualization Recommendation
• Model each “vizlet” as a triple
(x_column, y_column, vizlet_type)
• Extract features from each column
(f1x, f2x,…, fNx, f1y, f2y, …, fNy, vizlet_type)
• Interpret each “promotion” as a yes vote and each “discard” as a
no vote
• Train a (simple) model to predict vizlet type from features
• Recommend highest-scoring vizlets
• Add a diversity term to prevent a bunch of similar plots
• Incorporate score modifiers defined by the vizlet designer
– “My bar chart looks best when there are about 5 bars.”
– “My timeseries plot ignores null values”
11/1/2016 Bill Howe, UW 66

Example of a Learned Rule (1)
low x-entropy => bad scatter plot
11/1/2016 Bill Howe, UW 67
bad scatter plotgood scatter plot

low x-entropy => histogram
11/1/2016 Bill Howe, UW 68
bad scatter plot good histogram

69
high x-periodicity => timeseries plot
(periodicity = 1 / variance in gap length between successive values)

Voyager
11/1/2016 Bill Howe, UW 70
Kanit “Ham” Wongsuphasawat Dominik Moritz
InfoVis 15

Within the first few queries, you’ve
touched all the tables.
SIGMOD 2016
Shrainik Jain

Democratizing Data Science in the Cloud

More Related Content

What's hot

Viewers also liked

Similar to Democratizing Data Science in the Cloud

More from University of Washington

Recently uploaded

Democratizing Data Science in the Cloud

Editor's Notes