Democratizing Data Science
in the Cloud
Bill Howe, Ph.D.
Associate Director and Senior Data Science Fellow, eScience Institute
Affiliate Associate Professor, Computer Science & Engineering
11/1/2016 Bill Howe, UW 1
11/1/2016 Bill Howe, UW 2
Cloud Data Management is about
sharing resources between tenants
We’re interested in new services powered by sharing
more than infrastructure – schema, data, queries
Why?
Example: JBOT* Open Data systems
Google
Fusion
Tables
3
Entrepreneurship
1) “Data once guarded for assumed but untested
reasons is now open, and we're seeing benefits.”
-- Nigel Shadbolt, Open Data Institute
2) Need to help “non-specialists within an
organization use data that had been the
realm of programmers and DB admins”
-- Benjamin Romano, Xconomy
“Businesses are now using data the way
scientists always have”
-- Jeff Hammerbacher
Mt. Sinai, formerly Cloudera
*Just a Bunch of Tables
Data, data, data
4
Kevin Merrit
CEO
Socrata
Deep Dhillon
CTO
Socrata
Q Q Q
….
Control Plane /
Infrastructure
Data Plane /
Database sys.
Application /
schema, data,
query logs
Q Q Q
….
Benefits: Significantly reduced management overhead
Challenges: security, scheduling, SLAs, isolation
Virtualization
Control Plane /
Infrastructure
Data Plane /
Database sys.
Application /
schema, data,
query logs
Q Q Q
….
DB-as-a-Service
Benefits: Significantly reduced management overhead
Challenges: security, scheduling, SLAs, isolation
Control Plane /
Infrastructure
Data Plane /
Database sys.
Application /
schema, data,
query logs
Q Q Q
….
JBOT* Query-as-a-Service Systems
Goal:
smart cross-tenant services,
trained on everyone’s data
• Metadata inference and data curation
• Query recommendation via common idioms
• Data discovery – e.g., “find me things to join with”
• Visualization recommendation
• Semi-automatic integration services
Control Plane /
Infrastructure
Data Plane /
Database sys.
Application /
schema, data,
query logs
*Just a Bunch of Tables
Example Service: Automated Data Curation
11/1/2016 Bill Howe, UW 9
Microarray samples submitted to the Gene Expression Omnibus
Curation is fast becoming the
bottleneck to data sharing
Maxim
Gretchkin
Hoifung
Poon
Example Service: Automated Data Curation
Maxim
Gretchkin
Hoifung
Poon
Goal: Repair metadata for genetic
datasets using the content of the data, the
structure of an associated ontology, the
abstract of the paper, and everything else.
Deep Neural Network
Tissue Type Labels
Innovations in transfer learning,
poor training data, etc.
Paper
Abstract
Example Service: Automated Data Curation
Maxim
Gretchkin
Hoifung
Poon
Iterative co-learning between text-based classified and
expression-based classifier: Both models improve by
training on each others’ results
• SQLShare: Query-as-a-Service
• VizDeck: Visualization recommendation
• Myria: Big Data Ecosystems
VizDeck
Some Cloud Data Systems
1) Upload data “as is”
Cloud-hosted, secure; no
need to install or design a
database; no pre-defined
schema; schema inference;
some itegration
2) Write Queries
Right in your browser,
writing views on top of
views on top of views ...
SELECT hit, COUNT(*)
FROM tigrfam_surface
GROUP BY hit
ORDER BY cnt DESC
3) Share the results
Make them public, tag them,
share with specific colleagues –
anyone with access can query
http://sqlshare.escience.washington.edu
11/1/2016 Bill Howe, UW 15
http://sqlshare.escience.washington.edu
SIGMOD 2016
SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp
, x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp
, w.category as nc_category
, CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)
THEN x.end_bp - x.start_bp + 1
WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)
THEN x.end_bp - w.start_bp + 1
WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)
THEN w.end_bp - x.start_bp + 1
END AS len_overlap
FROM [koesterj@washington.edu].[hotspots_deserts.tab] x
INNER JOIN [koesterj@washington.edu].[table_noncoding_positions.tab] w
ON x.chr = w.chr
WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)
OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)
OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)
ORDER BY x.strain, x.chr ASC, x.start_bp ASC
Non-programmers can write very complex queries
(rather than relying on staff programmers)
Example: Computing the overlaps of two sets of blast results
We see thousands of
queries written by
non-programmers
The SQLShare Corpus:
A multi-year log of hand-written analytics queries
Queries 24275
Views 4535
Tables 3891
Users 591
SIGMOD 2016
Shrainik Jain
https://uwescience.github.io/sqlshare
19/57
A SQL “learner”
http://uwescience.github.io/sqlshare/
Latent Idioms for Schema-Independent Query Recommendation
Background on
Word2Vec, GloVE:
Map each term in a
corpus to a vector in
a high-dimensional
space based on its
co-occurrences.
Linear relationships
between these
vectors appear to
capture remarkable
semantic properties
:
SELECT COUNT(*) FROM [candrzejowiec@yahoo.com].[table_Firearms.txt]
SELECT COUNT (HiLo) FROM [roula.cardaras@gmail.com].[table_MUK.csv]
SELECT count(*) FROM [leslie@westerncatholic.org].[Depth_combined]
select count(Wave_Height) from [christa.kohnert@gmail.com].[Join]
SELECT count(*) FROM [wenjunh@washington.edu].[ecoli_nogaps_1.csv]
SELECT Count(*) FROM [latcron@gmail.com].[TargetTrackFeatures.csv]
SELECT count(*) FROM [billhowe].[sunrise sunset times 2009 - 2011]
SELECT Count(*) FROM [bifxcore@gmail.com].[table_ec_pdb_genus.csv]
SELECT count(*) FROM [whitead@washington.edu].[ecoli_nogaps_1.csv]
SELECT COUNT(*) FROM [ribalet@washington.edu].[Tokyo_0_merged.csv]
SELECT COUNT(*) FROM [dhalperi@washington.edu].[SPID_GOnumber.txt]
SELECT COUNT (species) FROM [bigbananatopdog@gmail.com].[Orthosia]
SELECT COUNT (species) FROM [bigbananatopdog@gmail.com].[Leucania]
:
Apply the same trick to the SQLShare corpus, cluster the results
A not-very-interesting cluster:
Latent SQL Idioms
:
SELECT COUNT(*) FROM [ajw123@washington.edu].[table_proteins.csv] WHERE species LIKE 'Homo sapiens
SELECT count (*) FROM [1029880@apps.nsd.org].[Task 5] where (Hashtags_In_Text) Like '%Phailin%'
SELECT count (*) FROM [1029880@apps.nsd.org].[Task 5] where (Hashtags_In_Text) Like '%Phailin%'
SELECT Count (*) FROM [kzoehayes@gmail.com].[Dated_Join] WHERE Category = 'Warm'
SELECT COUNT (*) FROM [ethanknight08@gmail.com].[table_PopulationV2.txt] WHERE Column1='Country'
SELECT COUNT(*) FROM [missmelupton@gmail.com].[table_pHWaterTemp] WHERE TempCategory='normal'
SELECT COUNT(*) FROM [1004387@apps.nsd.org].[no retweete] WHERE hashtags_in_text LIKE '%#odisha
:
Another not-very-interesting cluster:
We see other clusters that seem to capture more basics: “union,”
“group by with one grouping column,” “left outer join,” “string
manipulation,” etc.
Latent SQL Idioms
Latent SQL Idioms
More interesting examples:
select floor(latitude/0.7)*0.7 as latbin
, floor(longitude/0.7)*0.7 as lonbin
, species
FROM [koenigk92@gmail.com].[All3col]
select distinct case when patindex('%[0-9]%', [protein]) = 1 -- first char is number
and charindex(',', [protein]) = 0 -- and no comma present
then [protein]
else substring([protein], patindex('%[0-9]%', [protein]),
charindex(',', [protein])-patindex('%[0-9]%', [protein]))
end as [protein d1124],
[tot indep spectra] as [tot spectra d1124]
from [emmats@washington.edu].[d1_file124.txt]
Parsing a common
bioinformatics file format
Expressions for binning
space and time columns
MYRIA: BIG DATA POLYSTORES
11/1/2016 Bill Howe, UW 24
Q Q Q
….
Control Plane /
Infrastructure
Data Plane /
Database sys.
Application /
schema, data,
query logs
Q Q Q
….
Polystore Ecosystems: “Software Defined Databases”
Data Plane /
Database sys.
Application /
schema, data,
query logs
RDBMS HPC / Linear Algebra Graphs
Polystore
Execution
Plan
move
data
execute
query
Polystore
Execution
Plan
Tables KeyVal Arrays Graphs
Myria Algebra
Tables KeyVal Arrays Graphs
Spark Accumulo CombBLAS GraphX
Parallel
Algebra
Logical
Algebra
RACO
Relational Algebra COmpiler
CombBLAS
API
Spark
API
Accumulo Graph
API
rewrite
rules
Array
Algebra
MyriaL
Services: visualization, logging, discovery, history, browsing
Orchestration
https://github.com/uwescience/raco
https://metanautix.com/tr/01_big_data_techniques_for_media_graphics.pdf
11/1/2016 Bill Howe, UW 33
Ollie Lo, Los Alamos National Lab
34
CurGood = SCAN(public:adhoc:sc_points);
DO
mean = [FROM CurGood EMIT val=AVG(v)];
std = [FROM CurGood EMIT val=STDEV(v)];
NewBad = [FROM Good WHERE ABS(Good.v - mean) > 2 * std EMIT *];
CurGood = CurGood - NewBad;
continue = [FROM NewBad EMIT COUNT(NewBad.v) > 0];
WHILE continue;
DUMP(CurGood);
Sigma-clipping, V0
35
CurGood = P
sum = [FROM CurGood EMIT SUM(val)];
sumsq = [FROM CurGood EMIT SUM(val*val)]
cnt = [FROM CurGood EMIT CNT(*)];
NewBad = []
DO
sum = sum – [FROM NewBad EMIT SUM(val)];
sumsq = sum – [FROM NewBad EMIT SUM(val*val)];
cnt = sum - [FROM NewBad EMIT CNT(*)];
mean = sum / cnt
std = sqrt(1/(cnt*(cnt-1)) * (cnt * sumsq - sum*sum))
NewBad = FILTER([ABS(val-mean)>std], CurGood)
CurGood = CurGood - NewBad
WHILE NewBad != {}
Sigma-clipping, V1: Incremental
36
Points = SCAN(public:adhoc:sc_points);
aggs = [FROM Points EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)];
newBad = []
bounds = [FROM Points EMIT lower=MIN(v), upper=MAX(v)];
DO
new_aggs = [FROM newBad EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)];
aggs = [FROM aggs, new_aggs EMIT _sum=aggs._sum - new_aggs._sum,
sumsq=aggs.sumsq - new_aggs.sumsq, cnt=aggs.cnt - new_aggs.cnt];
stats = [FROM aggs EMIT mean=_sum/cnt,
std=SQRT(1.0/(cnt*(cnt-1)) * (cnt * sumsq - _sum * _sum))];
newBounds = [FROM stats EMIT lower=mean - 2 * std, upper=mean + 2 * std];
tooLow = [FROM Points, bounds, newBounds WHERE newBounds.lower > v
AND v >= bounds.lower EMIT v=Points.v];
tooHigh = [FROM Points, bounds, newBounds WHERE newBounds.upper < v
AND v <= bounds.upper EMIT v=Points.v];
newBad = UNIONALL(tooLow, tooHigh);
bounds = newBounds;
continue = [FROM newBad EMIT COUNT(v) > 0];
WHILE continue;
output = [FROM Points, bounds WHERE Points.v > bounds.lower AND
Points.v < bounds.upper EMIT v=Points.v];
DUMP(output);
Sigma-clipping, V2
Dominik Moritz
EuroVis 15
Empower the end user to do
performance profiling, debugging, etc.
Diagnosing problems
Sourcenode
Destination node
Dominik Moritz
EuroVis 15
Some ongoing work
• “from scratch” polystore optimizer
– Columbia-style, with some ideas from PL community
• Anecdotal Optimization
– Infer optimization decisions based on coarse-grained experimental
results from unreliable sources (blogs, literature)
– “System X is 2X faster than System Y on PageRank”
• Benchmarking Linear Algebra Systems vs. Databases
– HPC community thinks they are 1000X faster; they aren’t
– DB community thinks they are competitive; they aren’t
• Query compilation
– Bridge the gap between MPI and DB
• New query language Kamooks blending arrays and relations
11/1/2016 Bill Howe, UW 39
Some ongoing work
• “from scratch” polystore optimizer
– Columbia-style, with some ideas from PL community
• Anecdotal Optimization
– Infer optimization decisions based on coarse-grained experimental
results from unreliable sources (blogs, literature)
– “System X is 2X faster than System Y on PageRank”
• Benchmarking Linear Algebra Systems vs. Databases
– HPC community thinks they are 1000X faster; they aren’t
– DB community thinks they are competitive; they aren’t
• Query compilation
– Bridge the gap between MPI and DB
• New query language Kamooks blending arrays and relations
11/1/2016 Bill Howe, UW 40
Some ongoing work
• “from scratch” polystore optimizer
– Columbia-style, with some ideas from PL community
• Anecdotal Optimization
– Infer optimization decisions based on coarse-grained experimental
results from unreliable sources (blogs, literature)
– “System X is 2X faster than System Y on PageRank”
• Benchmarking Linear Algebra Systems vs. Databases
– HPC community thinks they are 1000X faster; they aren’t
– DB community thinks they are competitive; they aren’t
• Query compilation
– Bridge the gap between MPI and DB
• New query language Kamooks blending arrays and relations
11/1/2016 Bill Howe, UW 42
Query compilation for distributed processing
pipeline
as
parallel
code
parallel compiler
machine
code
[Myers ’14]
pipeline
fragment
code
pipeline
fragment
code
sequential
compiler
machine
code
[Crotty ’14, Li ’14, Seo ’14, Murray ‘11]
sequential
compiler
RADISH
ICS 16
Brandon
Myers
11/1/2016 Bill Howe, UW 45/57
1% selection microbenchmark, 20GB
Avoid long code paths
ICS 16
Brandon
Myers
11/1/2016 Bill Howe, UW 46/57
Q2 SP2Bench, 100M triples, multiple self-joins
Communication optimization
ICS 16
Brandon
Myers
Graph Patterns
47
• SP2Bench, 100 million triples
• Queries compiled to a PGAS C++ language layer, then
compiled again by a low-level PGAS compiler
• One of Myria’s supported back ends
• Comparison with Shark/Spark, which itself has been shown to
be 100X faster than Hadoop-based systems
• …plus PageRank, Naïve Bayes, and more
RADISH
ICS 16
Brandon
Myers
11/1/2016 Bill Howe, UW 48
ICS 15
RADISH
ICS 16
Brandon
Myers
TPC-H
Some ongoing work
• “from scratch” polystore optimizer
– Columbia-style, with some ideas from PL community
• Anecdotal Optimization
– Infer optimization decisions based on coarse-grained experimental
results from unreliable sources (blogs, literature)
– “System X is 2X faster than System Y on PageRank”
• Benchmarking Linear Algebra Systems vs. Databases
– HPC community thinks they are 1000X faster; they aren’t
– DB community thinks they are competitive; they aren’t
• Query compilation
– “Software-defined Databases”
– Bridge the gap between MPI and DB
• New query language Kamooks blending arrays and relations
11/1/2016 Bill Howe, UW 49
select A.i, B.k, sum(A.val*B.val)
from A, B
where A.j = B.j
group by A.i, B.k
Matrix multiply in RA
Matrix multiply
sparsity exponent (r s.t. m=nr)
Complexity
exponent
n2.38
mn
m0.7n1.2+n2
slide adapted from ZwickR. Yuster and U. Zwick, Fast Sparse Matrix Multiplication
n = number of rows
m = number of non-zeros
Complexity of matrix multiply
naïve sparse
algorithm
best known
sparse
algorithm
best known
dense
algorithm
lots of room
here
BLAS vs. SpBLAS vs. SQL (10k)
off the shelf
database
15X
11/1/2016 Bill Howe, UW 54
20k X 20k matrix multiply by sparsity
CombBLAS, MyriaX, Radish
11/1/2016 Bill Howe, UW 55
50k X 50k matrix multiply by sparsity
CombBLAS, MyriaX, Radish
Filter to upper left corner of result matrix
select AB.i, C.m, sum(AB.val*C.val)
from
(select A.i, B.k, sum(A.val*B.val)
from A, B
where A.j = B.j
group by A.i, B.k
) AB,
C
where AB.k = C.k
group by AB.i, C.m
A x B x C
select A.i, C.m, sum(A.val*B.val*C.val)
from A, B, C
where A.j = B.j
and B.k = C.k
group by A.i, C.m
A(i, j, val)
B(j, k, val)
C(k, m, val)
take three sparse
matrices
Now compute
multiway hypercube join:
O (|A|/p + |B|/p^2 + |C|/p)
Group by:
~O (N)
But wait, there’s more…..
2 seconds,
balanced
Hypercube
shuffle
Partitioned
hash join
43 seconds,
tons of skew
Task: self-multiply with 1M non-zeros
Seung-Hee
BaeScalable Graph Clustering
Version 1
Parallelize Best-known
Serial Algorithm
ICDM 2013
Version 2
Free 30% improvement
for any algorithm
TKDD 2014 SC 2015
Version 3
Distributed approx.
algorithm, 1.5B edges
http://escience.washington.edu
http://myria.cs.washington.edu
http://uwescience.github.io/sqlshare/
VIZDECK: VISUALIZATION
RECOMMENDATION
11/1/2016 Bill Howe, UW 60
“Data Triage” Pipeline
61
SAS
Excel
XML
CSV
SQL Azure
Files Tables Views
parse /
extract
“relational
analysis”
visual
analysis
Visualizations
SIGMOD 11
SSDBM 13
SIGMOD 16
sqlshare.escience.washington.edu
CHI 12
SIGMOD 12
iConference 13
SSDBM 11
CiSE 13
SSDBM 15
62
63
video
11/1/2016 Bill Howe, UW 65
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Fusion VizDeck ManyEyes Tableau
Task Completion Rate / Time - All Questions
CHI 13
Visualization Recommendation
• Model each “vizlet” as a triple
(x_column, y_column, vizlet_type)
• Extract features from each column
(f1x, f2x,…, fNx, f1y, f2y, …, fNy, vizlet_type)
• Interpret each “promotion” as a yes vote and each “discard” as a
no vote
• Train a (simple) model to predict vizlet type from features
• Recommend highest-scoring vizlets
• Add a diversity term to prevent a bunch of similar plots
• Incorporate score modifiers defined by the vizlet designer
– “My bar chart looks best when there are about 5 bars.”
– “My timeseries plot ignores null values”
11/1/2016 Bill Howe, UW 66
Example of a Learned Rule (1)
low x-entropy => bad scatter plot
11/1/2016 Bill Howe, UW 67
bad scatter plotgood scatter plot
Example of a Learned Rule (2)
low x-entropy => histogram
11/1/2016 Bill Howe, UW 68
bad scatter plot good histogram
Example of a Learned Rule (3)
69
high x-periodicity => timeseries plot
(periodicity = 1 / variance in gap length between successive values)
Voyager
11/1/2016 Bill Howe, UW 70
Kanit “Ham” Wongsuphasawat Dominik Moritz
InfoVis 15
Within the first few queries, you’ve
touched all the tables.
SIGMOD 2016
Shrainik Jain
http://uwescience.github.io/sqlshare/

Democratizing Data Science in the Cloud

  • 1.
    Democratizing Data Science inthe Cloud Bill Howe, Ph.D. Associate Director and Senior Data Science Fellow, eScience Institute Affiliate Associate Professor, Computer Science & Engineering 11/1/2016 Bill Howe, UW 1
  • 2.
    11/1/2016 Bill Howe,UW 2 Cloud Data Management is about sharing resources between tenants We’re interested in new services powered by sharing more than infrastructure – schema, data, queries
  • 3.
    Why? Example: JBOT* OpenData systems Google Fusion Tables 3 Entrepreneurship 1) “Data once guarded for assumed but untested reasons is now open, and we're seeing benefits.” -- Nigel Shadbolt, Open Data Institute 2) Need to help “non-specialists within an organization use data that had been the realm of programmers and DB admins” -- Benjamin Romano, Xconomy “Businesses are now using data the way scientists always have” -- Jeff Hammerbacher Mt. Sinai, formerly Cloudera *Just a Bunch of Tables
  • 4.
    Data, data, data 4 KevinMerrit CEO Socrata Deep Dhillon CTO Socrata
  • 5.
    Q Q Q …. ControlPlane / Infrastructure Data Plane / Database sys. Application / schema, data, query logs
  • 6.
    Q Q Q …. Benefits:Significantly reduced management overhead Challenges: security, scheduling, SLAs, isolation Virtualization Control Plane / Infrastructure Data Plane / Database sys. Application / schema, data, query logs
  • 7.
    Q Q Q …. DB-as-a-Service Benefits:Significantly reduced management overhead Challenges: security, scheduling, SLAs, isolation Control Plane / Infrastructure Data Plane / Database sys. Application / schema, data, query logs
  • 8.
    Q Q Q …. JBOT*Query-as-a-Service Systems Goal: smart cross-tenant services, trained on everyone’s data • Metadata inference and data curation • Query recommendation via common idioms • Data discovery – e.g., “find me things to join with” • Visualization recommendation • Semi-automatic integration services Control Plane / Infrastructure Data Plane / Database sys. Application / schema, data, query logs *Just a Bunch of Tables
  • 9.
    Example Service: AutomatedData Curation 11/1/2016 Bill Howe, UW 9 Microarray samples submitted to the Gene Expression Omnibus Curation is fast becoming the bottleneck to data sharing Maxim Gretchkin Hoifung Poon
  • 10.
    Example Service: AutomatedData Curation Maxim Gretchkin Hoifung Poon Goal: Repair metadata for genetic datasets using the content of the data, the structure of an associated ontology, the abstract of the paper, and everything else. Deep Neural Network Tissue Type Labels Innovations in transfer learning, poor training data, etc. Paper Abstract
  • 11.
    Example Service: AutomatedData Curation Maxim Gretchkin Hoifung Poon Iterative co-learning between text-based classified and expression-based classifier: Both models improve by training on each others’ results
  • 12.
    • SQLShare: Query-as-a-Service •VizDeck: Visualization recommendation • Myria: Big Data Ecosystems VizDeck Some Cloud Data Systems
  • 13.
    1) Upload data“as is” Cloud-hosted, secure; no need to install or design a database; no pre-defined schema; schema inference; some itegration 2) Write Queries Right in your browser, writing views on top of views on top of views ... SELECT hit, COUNT(*) FROM tigrfam_surface GROUP BY hit ORDER BY cnt DESC 3) Share the results Make them public, tag them, share with specific colleagues – anyone with access can query http://sqlshare.escience.washington.edu
  • 14.
    11/1/2016 Bill Howe,UW 15 http://sqlshare.escience.washington.edu
  • 15.
  • 16.
    SELECT x.strain, x.chr,x.region as snp_region, x.start_bp as snp_start_bp , x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp , w.category as nc_category , CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) THEN x.end_bp - x.start_bp + 1 WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) THEN x.end_bp - w.start_bp + 1 WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) THEN w.end_bp - x.start_bp + 1 END AS len_overlap FROM [koesterj@washington.edu].[hotspots_deserts.tab] x INNER JOIN [koesterj@washington.edu].[table_noncoding_positions.tab] w ON x.chr = w.chr WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) ORDER BY x.strain, x.chr ASC, x.start_bp ASC Non-programmers can write very complex queries (rather than relying on staff programmers) Example: Computing the overlaps of two sets of blast results We see thousands of queries written by non-programmers
  • 17.
    The SQLShare Corpus: Amulti-year log of hand-written analytics queries Queries 24275 Views 4535 Tables 3891 Users 591 SIGMOD 2016 Shrainik Jain https://uwescience.github.io/sqlshare
  • 18.
  • 19.
    Latent Idioms forSchema-Independent Query Recommendation Background on Word2Vec, GloVE: Map each term in a corpus to a vector in a high-dimensional space based on its co-occurrences. Linear relationships between these vectors appear to capture remarkable semantic properties
  • 20.
    : SELECT COUNT(*) FROM[candrzejowiec@yahoo.com].[table_Firearms.txt] SELECT COUNT (HiLo) FROM [roula.cardaras@gmail.com].[table_MUK.csv] SELECT count(*) FROM [leslie@westerncatholic.org].[Depth_combined] select count(Wave_Height) from [christa.kohnert@gmail.com].[Join] SELECT count(*) FROM [wenjunh@washington.edu].[ecoli_nogaps_1.csv] SELECT Count(*) FROM [latcron@gmail.com].[TargetTrackFeatures.csv] SELECT count(*) FROM [billhowe].[sunrise sunset times 2009 - 2011] SELECT Count(*) FROM [bifxcore@gmail.com].[table_ec_pdb_genus.csv] SELECT count(*) FROM [whitead@washington.edu].[ecoli_nogaps_1.csv] SELECT COUNT(*) FROM [ribalet@washington.edu].[Tokyo_0_merged.csv] SELECT COUNT(*) FROM [dhalperi@washington.edu].[SPID_GOnumber.txt] SELECT COUNT (species) FROM [bigbananatopdog@gmail.com].[Orthosia] SELECT COUNT (species) FROM [bigbananatopdog@gmail.com].[Leucania] : Apply the same trick to the SQLShare corpus, cluster the results A not-very-interesting cluster: Latent SQL Idioms
  • 21.
    : SELECT COUNT(*) FROM[ajw123@washington.edu].[table_proteins.csv] WHERE species LIKE 'Homo sapiens SELECT count (*) FROM [1029880@apps.nsd.org].[Task 5] where (Hashtags_In_Text) Like '%Phailin%' SELECT count (*) FROM [1029880@apps.nsd.org].[Task 5] where (Hashtags_In_Text) Like '%Phailin%' SELECT Count (*) FROM [kzoehayes@gmail.com].[Dated_Join] WHERE Category = 'Warm' SELECT COUNT (*) FROM [ethanknight08@gmail.com].[table_PopulationV2.txt] WHERE Column1='Country' SELECT COUNT(*) FROM [missmelupton@gmail.com].[table_pHWaterTemp] WHERE TempCategory='normal' SELECT COUNT(*) FROM [1004387@apps.nsd.org].[no retweete] WHERE hashtags_in_text LIKE '%#odisha : Another not-very-interesting cluster: We see other clusters that seem to capture more basics: “union,” “group by with one grouping column,” “left outer join,” “string manipulation,” etc. Latent SQL Idioms
  • 22.
    Latent SQL Idioms Moreinteresting examples: select floor(latitude/0.7)*0.7 as latbin , floor(longitude/0.7)*0.7 as lonbin , species FROM [koenigk92@gmail.com].[All3col] select distinct case when patindex('%[0-9]%', [protein]) = 1 -- first char is number and charindex(',', [protein]) = 0 -- and no comma present then [protein] else substring([protein], patindex('%[0-9]%', [protein]), charindex(',', [protein])-patindex('%[0-9]%', [protein])) end as [protein d1124], [tot indep spectra] as [tot spectra d1124] from [emmats@washington.edu].[d1_file124.txt] Parsing a common bioinformatics file format Expressions for binning space and time columns
  • 23.
    MYRIA: BIG DATAPOLYSTORES 11/1/2016 Bill Howe, UW 24
  • 24.
    Q Q Q …. ControlPlane / Infrastructure Data Plane / Database sys. Application / schema, data, query logs
  • 25.
    Q Q Q …. PolystoreEcosystems: “Software Defined Databases” Data Plane / Database sys. Application / schema, data, query logs RDBMS HPC / Linear Algebra Graphs
  • 26.
  • 27.
  • 28.
  • 29.
    Spark Accumulo CombBLASGraphX Parallel Algebra Logical Algebra RACO Relational Algebra COmpiler CombBLAS API Spark API Accumulo Graph API rewrite rules Array Algebra MyriaL Services: visualization, logging, discovery, history, browsing Orchestration https://github.com/uwescience/raco
  • 31.
  • 32.
    11/1/2016 Bill Howe,UW 33 Ollie Lo, Los Alamos National Lab
  • 33.
    34 CurGood = SCAN(public:adhoc:sc_points); DO mean= [FROM CurGood EMIT val=AVG(v)]; std = [FROM CurGood EMIT val=STDEV(v)]; NewBad = [FROM Good WHERE ABS(Good.v - mean) > 2 * std EMIT *]; CurGood = CurGood - NewBad; continue = [FROM NewBad EMIT COUNT(NewBad.v) > 0]; WHILE continue; DUMP(CurGood); Sigma-clipping, V0
  • 34.
    35 CurGood = P sum= [FROM CurGood EMIT SUM(val)]; sumsq = [FROM CurGood EMIT SUM(val*val)] cnt = [FROM CurGood EMIT CNT(*)]; NewBad = [] DO sum = sum – [FROM NewBad EMIT SUM(val)]; sumsq = sum – [FROM NewBad EMIT SUM(val*val)]; cnt = sum - [FROM NewBad EMIT CNT(*)]; mean = sum / cnt std = sqrt(1/(cnt*(cnt-1)) * (cnt * sumsq - sum*sum)) NewBad = FILTER([ABS(val-mean)>std], CurGood) CurGood = CurGood - NewBad WHILE NewBad != {} Sigma-clipping, V1: Incremental
  • 35.
    36 Points = SCAN(public:adhoc:sc_points); aggs= [FROM Points EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)]; newBad = [] bounds = [FROM Points EMIT lower=MIN(v), upper=MAX(v)]; DO new_aggs = [FROM newBad EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)]; aggs = [FROM aggs, new_aggs EMIT _sum=aggs._sum - new_aggs._sum, sumsq=aggs.sumsq - new_aggs.sumsq, cnt=aggs.cnt - new_aggs.cnt]; stats = [FROM aggs EMIT mean=_sum/cnt, std=SQRT(1.0/(cnt*(cnt-1)) * (cnt * sumsq - _sum * _sum))]; newBounds = [FROM stats EMIT lower=mean - 2 * std, upper=mean + 2 * std]; tooLow = [FROM Points, bounds, newBounds WHERE newBounds.lower > v AND v >= bounds.lower EMIT v=Points.v]; tooHigh = [FROM Points, bounds, newBounds WHERE newBounds.upper < v AND v <= bounds.upper EMIT v=Points.v]; newBad = UNIONALL(tooLow, tooHigh); bounds = newBounds; continue = [FROM newBad EMIT COUNT(v) > 0]; WHILE continue; output = [FROM Points, bounds WHERE Points.v > bounds.lower AND Points.v < bounds.upper EMIT v=Points.v]; DUMP(output); Sigma-clipping, V2
  • 36.
    Dominik Moritz EuroVis 15 Empowerthe end user to do performance profiling, debugging, etc.
  • 37.
  • 38.
    Some ongoing work •“from scratch” polystore optimizer – Columbia-style, with some ideas from PL community • Anecdotal Optimization – Infer optimization decisions based on coarse-grained experimental results from unreliable sources (blogs, literature) – “System X is 2X faster than System Y on PageRank” • Benchmarking Linear Algebra Systems vs. Databases – HPC community thinks they are 1000X faster; they aren’t – DB community thinks they are competitive; they aren’t • Query compilation – Bridge the gap between MPI and DB • New query language Kamooks blending arrays and relations 11/1/2016 Bill Howe, UW 39
  • 39.
    Some ongoing work •“from scratch” polystore optimizer – Columbia-style, with some ideas from PL community • Anecdotal Optimization – Infer optimization decisions based on coarse-grained experimental results from unreliable sources (blogs, literature) – “System X is 2X faster than System Y on PageRank” • Benchmarking Linear Algebra Systems vs. Databases – HPC community thinks they are 1000X faster; they aren’t – DB community thinks they are competitive; they aren’t • Query compilation – Bridge the gap between MPI and DB • New query language Kamooks blending arrays and relations 11/1/2016 Bill Howe, UW 40
  • 41.
    Some ongoing work •“from scratch” polystore optimizer – Columbia-style, with some ideas from PL community • Anecdotal Optimization – Infer optimization decisions based on coarse-grained experimental results from unreliable sources (blogs, literature) – “System X is 2X faster than System Y on PageRank” • Benchmarking Linear Algebra Systems vs. Databases – HPC community thinks they are 1000X faster; they aren’t – DB community thinks they are competitive; they aren’t • Query compilation – Bridge the gap between MPI and DB • New query language Kamooks blending arrays and relations 11/1/2016 Bill Howe, UW 42
  • 42.
    Query compilation fordistributed processing pipeline as parallel code parallel compiler machine code [Myers ’14] pipeline fragment code pipeline fragment code sequential compiler machine code [Crotty ’14, Li ’14, Seo ’14, Murray ‘11] sequential compiler
  • 43.
  • 44.
    11/1/2016 Bill Howe,UW 45/57 1% selection microbenchmark, 20GB Avoid long code paths ICS 16 Brandon Myers
  • 45.
    11/1/2016 Bill Howe,UW 46/57 Q2 SP2Bench, 100M triples, multiple self-joins Communication optimization ICS 16 Brandon Myers
  • 46.
    Graph Patterns 47 • SP2Bench,100 million triples • Queries compiled to a PGAS C++ language layer, then compiled again by a low-level PGAS compiler • One of Myria’s supported back ends • Comparison with Shark/Spark, which itself has been shown to be 100X faster than Hadoop-based systems • …plus PageRank, Naïve Bayes, and more RADISH ICS 16 Brandon Myers
  • 47.
    11/1/2016 Bill Howe,UW 48 ICS 15 RADISH ICS 16 Brandon Myers TPC-H
  • 48.
    Some ongoing work •“from scratch” polystore optimizer – Columbia-style, with some ideas from PL community • Anecdotal Optimization – Infer optimization decisions based on coarse-grained experimental results from unreliable sources (blogs, literature) – “System X is 2X faster than System Y on PageRank” • Benchmarking Linear Algebra Systems vs. Databases – HPC community thinks they are 1000X faster; they aren’t – DB community thinks they are competitive; they aren’t • Query compilation – “Software-defined Databases” – Bridge the gap between MPI and DB • New query language Kamooks blending arrays and relations 11/1/2016 Bill Howe, UW 49
  • 49.
    select A.i, B.k,sum(A.val*B.val) from A, B where A.j = B.j group by A.i, B.k Matrix multiply in RA Matrix multiply
  • 50.
    sparsity exponent (rs.t. m=nr) Complexity exponent n2.38 mn m0.7n1.2+n2 slide adapted from ZwickR. Yuster and U. Zwick, Fast Sparse Matrix Multiplication n = number of rows m = number of non-zeros Complexity of matrix multiply naïve sparse algorithm best known sparse algorithm best known dense algorithm lots of room here
  • 51.
    BLAS vs. SpBLASvs. SQL (10k) off the shelf database 15X
  • 52.
    11/1/2016 Bill Howe,UW 54 20k X 20k matrix multiply by sparsity CombBLAS, MyriaX, Radish
  • 53.
    11/1/2016 Bill Howe,UW 55 50k X 50k matrix multiply by sparsity CombBLAS, MyriaX, Radish Filter to upper left corner of result matrix
  • 54.
    select AB.i, C.m,sum(AB.val*C.val) from (select A.i, B.k, sum(A.val*B.val) from A, B where A.j = B.j group by A.i, B.k ) AB, C where AB.k = C.k group by AB.i, C.m A x B x C select A.i, C.m, sum(A.val*B.val*C.val) from A, B, C where A.j = B.j and B.k = C.k group by A.i, C.m A(i, j, val) B(j, k, val) C(k, m, val) take three sparse matrices Now compute multiway hypercube join: O (|A|/p + |B|/p^2 + |C|/p) Group by: ~O (N) But wait, there’s more…..
  • 55.
    2 seconds, balanced Hypercube shuffle Partitioned hash join 43seconds, tons of skew Task: self-multiply with 1M non-zeros
  • 56.
    Seung-Hee BaeScalable Graph Clustering Version1 Parallelize Best-known Serial Algorithm ICDM 2013 Version 2 Free 30% improvement for any algorithm TKDD 2014 SC 2015 Version 3 Distributed approx. algorithm, 1.5B edges
  • 57.
  • 58.
  • 59.
    “Data Triage” Pipeline 61 SAS Excel XML CSV SQLAzure Files Tables Views parse / extract “relational analysis” visual analysis Visualizations SIGMOD 11 SSDBM 13 SIGMOD 16 sqlshare.escience.washington.edu CHI 12 SIGMOD 12 iConference 13 SSDBM 11 CiSE 13 SSDBM 15
  • 60.
  • 61.
  • 63.
    video 11/1/2016 Bill Howe,UW 65 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Fusion VizDeck ManyEyes Tableau Task Completion Rate / Time - All Questions CHI 13
  • 64.
    Visualization Recommendation • Modeleach “vizlet” as a triple (x_column, y_column, vizlet_type) • Extract features from each column (f1x, f2x,…, fNx, f1y, f2y, …, fNy, vizlet_type) • Interpret each “promotion” as a yes vote and each “discard” as a no vote • Train a (simple) model to predict vizlet type from features • Recommend highest-scoring vizlets • Add a diversity term to prevent a bunch of similar plots • Incorporate score modifiers defined by the vizlet designer – “My bar chart looks best when there are about 5 bars.” – “My timeseries plot ignores null values” 11/1/2016 Bill Howe, UW 66
  • 65.
    Example of aLearned Rule (1) low x-entropy => bad scatter plot 11/1/2016 Bill Howe, UW 67 bad scatter plotgood scatter plot
  • 66.
    Example of aLearned Rule (2) low x-entropy => histogram 11/1/2016 Bill Howe, UW 68 bad scatter plot good histogram
  • 67.
    Example of aLearned Rule (3) 69 high x-periodicity => timeseries plot (periodicity = 1 / variance in gap length between successive values)
  • 68.
    Voyager 11/1/2016 Bill Howe,UW 70 Kanit “Ham” Wongsuphasawat Dominik Moritz InfoVis 15
  • 69.
    Within the firstfew queries, you’ve touched all the tables. SIGMOD 2016 Shrainik Jain http://uwescience.github.io/sqlshare/

Editor's Notes

  • #4 Let me give you a brief example of a project a little further upstream that the incubation program can provide access to. This work is in a space of open data sharing platforms, along with Socrata here in Seattle, products from Google and Microsoft, and a number of other companies. Two observations motivate the products in this space: First, there’s a movement toward open data that has researchers, government agencies, and even companies exposing their data assets online for use by others for reasons of transparency, efficiency, accountability. Even for commercial data, there are marketplaces emerging to facilitate the buying and selling of data. All of these use cases need new technology. So that’s one reason. Second, if you’re going to use someone else’s data, you need it to be as accessible as possible. In particular, you need to help data analysts use the data “had previously been the realm of programmers and DB adminsistrators” – here I’m quoting Benjamin Romano from in an Xconomy article about Socrata. SQLShare is an open data system, but emphasizes rich data manipulation rather than just fetch and retrieval, interoperability with external tools and existing databases, local or cloud deployments, and built-in services for data integration, profiling, and visualization. Ginger mentioned this system in her talk – we have maintained a production deployment here on campus for three years focusing on science users. Our observation is that science use cases are a predictor for commercial use cases – businesses are beginning to use data the same way scientists always have – they collect it aggressively, torture it with analytics, use it to make predictions about the world. So we think if we can handle these difficult science use cases that we will also be addressing a significant commercial problem.
  • #5 Solutions are emerging, powered by the open data movement. Socrata, a local Seattle company, has built a very successful business of helping cities jailbreak their data, and are now engaged in climbing the application stack to support analytics and visualization. Essentially every url of the form data.yourcity.gov is powered by Socrata’s technology Data, People, and Infrastructure
  • #8 and you can extend this model to the database layer to help share services like backup, recovery, caching, load balancing
  • #9 If you go up a level, you have what you might call “query as a service” – you’re querying your own data, but you might query others peoples data as well. And, even if you want to remain logically isolated, you can still benefit from services that are powered by mining everyone’s schema, data, and workload. For example, query recommendation in the past assumed a fixed schema. With this model, you can recommend “idioms” across different schemas. You can discover public datasets to join with, like Alon worked on with Fusion Tables You can recommend visualizations automatically You can automatically infer and attach metadata – semi-automatic data curation. A big globally shared data lake “Precision Medicine for Databases”
  • #15 So we developed SQLShare to support a very simple workflow: you can upload data “as is” from spreadsheets or anything. It’s in the cloud, so no need to install or design a database. You can immediately begin writing queries, right in your browser, and put queries on top of queries on top of queries. Then you can share the results online: Your colleagues can browse the science questions and see the SQL that answers it. ta out.  ---- Key ideas to get data in: a) Use the cloud to avoid having to install and run a database b) Give up on the schema -- just throw your data in "as is" and do "lazy integration.” c) Use some magic to automate parsing, integration, recommendations, and more. Key ideas to get data out: a) Associate science questions (in English) with each SQL query -- makes them easy to understand and easy to find. b) Saving and reusing queries is a first class requirement.  Given examples, it's easy to modify it into an "adjacent" query. c) Expose the whole system through a REST API to make it easy to bring new client applications online.
  • #16 Lots of features you can imagine here – anything you can do with a youtube video, you should be able to do with a query: share it, rate it, “more like this”, recommendations, We are exploring some of these.
  • #18 We see non-programmers who write these wonderful 40-line queries. This one does interval queries on genomic sequences. She doesn’t write any R, any Python, but she can do this, and she’s no longer dependent on staff programmers.
  • #27 If you go up a level, you have what you might call “query as a service” – you’re querying your own data, but you might query others peoples data as well. And, even if you want to remain logically isolated, you can still benefit from services that are powered by mining everyone’s schema, data, and workload. For example, query recommendation in the past assumed a fixed schema. With this model, you can recommend “idioms” across different schemas. You can discover public datasets to join with, like Alon worked on with Fusion Tables You can recommend visualizations automatically You can automatically infer and attach metadata – semi-automatic data curation. A big globally shared data lake
  • #28 Express these plans Optimize these plans Compile these plans Execute these plans
  • #29  Express these plans Optimize these plans Compile these plans Execute these plans
  • #30  So our approach is to model this overlap in capabilities as its own language. We start
  • #31 We hoist
  • #44  NOTES: Optimizations enable? with better semantics on a hash table join with UDFs, can do redundant computation elimination, code motion from UDF
  • #51 Can you just run this in a database and expect good performance. Of course not. But is this a fundamentally bad idea to run it this way? Maybe not.
  • #52  This is the complexity of three matrix multiply algorithms plotted against the sparsi – a naïve sparse
  • #57 Now let’s do
  • #62 If you can automate, you can precompute specualtively.
  • #67 On Big Data: interactive, on the web may or may not be feasible for big big data, but a good model for visualization recommendation can enable speculative generation.
  • #74 Why do we care about lifetime? Table usage predictions for caching and partitioning. Move from reactive to proactive physical design services. Query idioms are consistent, while the data is fleeting. Not exact queries as in a streaming system, but the “methods” are reused over and over. Extracting and optimizing these idioms across tenants is our goal.
  • #76 And processing power, either as raw processor speed or via novel multi-core and many-core architectures, is also continuing to increase exponentially…
  • #77 … but human cognitive capacity is remaining constant. How can computing technologies help scientists make sense out of these vast and complex data sets?
  • #78  We want to give a little background of our project before we launch into it, so we will discuss the problem we are trying to solve. Essentially, we want to remove the speed-bump of data handling from the scientists.