I presented these slides as a keynote at the Enterprise Intelligence Workshop at KDD2016 in San francisco.
In these slides, I describe our work towards developing a Maslow's Hierarchy for Human in the Loop Data Analytics!
Many many contributors!
• PIs: Kevin Chang, Karrie Karahalios,Aaron Elmore, Sam Madden, Amol
Deshpande (Spanning Illinois,UMD, MIT,Chicago)
• PhD Students: Mangesh Bendre, Himel Dev, John Lee, Albert Kim,
ManasiVartak, Liqi Xu, Silu Huang, Sajjadur Rahman, Stephen Macke
• MS Students:VipulVenkataraman,Tarique Siddiqui, ChaoWang, Sili Hui
• Undergrads: Paul Zhou, Ding Zhang, Kejia Jiang, Bofan Sun, Ed Xue, Sean
Zou, Jialin Liu, Changfeng Liu, XiaofoYu
2
Scale is a Solved Problem
Most work in the database community is myopically focused on
scale: the ability to pose SQL queries on larger and larger datasets.
My claim:
Scale is a solved problem.
Findings:
– Median job size at Microsoft andYahoo is 16GB;
– >90% of the jobs within Facebook are <100GB
The bottleneck is no longer our ability to pose SQL queries on
large datasets!
Of course, exceptions exist: the “1%” of data analysis needs
3
What about the Needs of the 99%?
The bottleneck is actually the “humans-in-the-loop”
As our data size has grown, what has stayed constant is
• the time for analysis,
• the human cognitive load,
• the skills to extract value from data
There is a severe need for tools that can help analysts extract
value from even moderately sized datasets
From “Big data and and itsTechnical Challenges”, CACM 2014
For big data to fully reach its potential, we need to consider scale not just for the system but
also from the perspective of humans.We have to make sure that the end points—humans—
can properly “absorb” the results of the analysis and not get lost in a sea of data.
4
Need of the hour: Human-In-the-Loop
Data AnalyticsTools
HILDA tools:
• treat both humans and data
as first-class citizens
• reduce human labor
• minimize complexity
Interaction Data Mining
Databases
Taking the human
perspective into
account
Go beyond SQL
Scalability/Interacti
vity is still
important
Magic happens here
5
A Maslow’s Hierarchy for HILDA
Background: Maslow developed a theory for what motivates
individuals in 1943; highly influential
Complex Needs
Basic Needs
6
A Maslow’s Hierarchy for HILDA
Share &
Collaborate
Play &
View
Touch &
Feel
Increasingsophisticationofanalysis
7
Touch and Feel:
DataSpread is a spreadsheet-database hybrid:
Goal: Marrying the flexibility and ease of use of
spreadsheets with the scalability and power of databases
Enables the “99%” with large datasets but limited prog.
skills to open, touch, and examine their datasets
http://dataspread.github.io
[VLDB’15,VLDB’15,ICDE’16]
8
Play andView:
Zenvisage is effortless visual exploration tool.
Goal: “fast-forward” to visual patterns, trends, without
having analyst step through each one individually
Enables individuals to play with, and extract insights
from large datasets at a fraction of the time.
http://zenvisage.github.io
[TR’16,VLDB’16,VLDB’15,DSIA’15,VLDB’14,VLDB’14]
9
Collaborate and Share:
OrpheusDB is a tool for managing dataset versions with a database
Goal: building a versioned database system to reduce the burden of
recording datasets in various stages of analysis
Enables individuals to collaborate on data analysis, and share, keep
track of, and retrieve dataset versions.
http://orpheus-db.github.io
[VLDB’16,VLDB’15,VLDB’15,TAPP’15,CIDR’15]
(also part of : a collab. analysis system w/ MIT & UMD)
datahub
10
This talk
About 10 minutes per system:
overview + architecture + one key technical challenge
Common theme: if you torture databases enough, you can get them to do
what you want!
Share &
Collaborate
Play &
View
Touch &
Feel
Increasingsophisticationofanalysis
11
Motivation
Most of the people doing ad-hoc data
manipulation and analysis use spreadsheets,
e.g., Excel
Why?
• Easy to use: direct manipulation
• Built-in visualization capabilities
• Flexible: no need for a schema
13
But Spreadsheets areTerrible!
– Slow
• single change wait minutes on a 10,000 x 10 spreadsheet
• can’t even open a spreadsheet with >1M cells
• speed by itself can prevent analysis
– Tedious + not Powerful
• filters via copy-paste
• only FK joins viaVLOOKUPs; others impossible
• even simple operations are cumbersome
– Brittle
• sharing excel sheets around, no collab/recovery
• using spreadsheets for collaboration is painful and error-prone
14
Let’s turn to Databases
Databases are:
• Slow Scalable
• Tedious + not Powerful Powerful and expressive (SQL)
• Brittle Collaboration, recovery, succinct
So why not use databases?
Well, for the same reason why spreadsheets are so useful:
• Easy to use Not easy to use
• Built-in visualization No built-in visualization
• Flexible Not flexible
15
Combining the benefits of
spreadsheets and databases
Spreadsheet as a frontend interface
Databases as a backend engine
Result: retain the benefits of both!
But it’s not that simple…
16
Different Ideologies
Databases and spreadsheets have different
ideologies that need to be reconciled…
Due to this, the integration is not trivial…
Feature Databases Spreadsheets
Data Model Schema-first Dynamic/No Schema
Addressing Tuples with PK Cells, using Row/Col
Presentation Set-oriented, no such
notion
Notion of current window,
order
Modifications Must correspond to
queries
Can be done at any
granularity
Computation Query at a time Value at a time
17
First Problem: Representation
Q: how do we represent spreadsheet data?
Dense spreadsheets: represent as tables
(Row #, Col1 val, Col2 val, …)
Sparse spreadsheets: represent as triples
(Row #, Column #,Value)
18
First Problem: Representation
Q: how do we represent spreadsheet data?
Can we do even better than the two
extremes?Yes!
Carve out
dense areas store as tables,
sparse areas store as triples
19
First Problem: Representation
However, even if we only use “tables”, carving out
the ideal # partitions (min. storage, modif., access)
is NP-Hard
Reduction from min. edge-length partition of
rectilinear polygons
Thankfully, we have a way out…
20
Solution: Constrain the Problem
A new class of partitionings: recursive decomp.
A very natural class of partitionings! 21
Solution: Constrain the Problem
The optimal recursive
decomp. partitioning can be
found in PTIME using DP
Still quadratic in # rows,
columns
Merge rows/columns with
identical signatures
~ the time for a single scan
22
Initial Progress and Architecture
Postgres backend
ZK spreadsheet
• open-source web
frontend
Comfortably scales to
arbitrarily many rows
+ handle SQL queries
Hopefully bring
spreadsheets to the big
data age!
Underlying Data
Interface-Embedded
Queries
Interface-Aware
Indexes
Interface Query Processor
Interface Storage Manager
Spreadsheet
SQL
Spreadsheet
Formulae
New Interface
Algebra
…
Vanilla
SQL
Interface Transaction Manager
Other Applications Sally Bob Sue
23
1224560
StandardVisual Data Analysis Recipe:
1. Load dataset into viz tool
2. Select viz to be generated
3. See if it matches desired visual
pattern or insight
4. Repeat until you find a match
25
Key Issue:
Visualizations can be generated by
• varying subsets of data, and
• varying attributes being visualized
Too many visualizations to look at to find
desired visual patterns!
27
Motivation
This is a real problem!
• Advertisers atTurn
– find keywords with similar CTRs to a specific one
• Bioinformaticians at an NIH genomics center
– find aspects on which two sets of genes differ
• Battery scientists at CMU
– find solvents with desired properties
Common theme: finding the “right” visualization can take
several hours of combing through visualizations manually.
28
Key Insight
We can automate that!
• instead of combing through visualizations manually
• tell us what you want, and we can “fast-forward” to desired insights
Desiderata for automation:
• Expressive – the ability to specify what you want
• Interactive – interact with the results, catering to non-programmers
• Scalable – get interesting results quickly
Enter Zenvisage:
(zen + envisage: to effortlessly visualize)
29
EffortlessVisual Exploration
of Large Datasets with
Ingredients
• Drag-and-drop and sketch based interactions
• to find specific patterns
• Sophisticated visual exploration language, ZQL
• to ask more elaborate questions
• Scalable visualization generation engine
• preprocess, batch and parallel eval. for interactive results
• Rapid pattern matching algorithms
• sampling-based techniques
30
Challenges: One Specific Instance
Find visualizations on which two groups of data differ most.
Examples:
• find visualizations where solvent x differs from solvent y
• find visualizations where product x differs from product y
We represent a visualization using [d, m, f]
• dimension = x axis
• measure = y axis
• function = aggregate applied to y
Each [d,m,f] on a specific subset of data can be computed using a
single SQL query.
33
Challenge: One Specific Instance
Find visualizations on which two groups of data differ most.
Naïve approach:
For each [d, m, f]:
Compute visualization for both products (two SQL queries),
then compare
Pick k best (“highest utility”) [d, m, f]
Utility Metric:We ignore how to compare for now, but there are
many standard distance metrics
Scale: 10s of dimensions, 10s of measures, handful of
aggregates 100s of queries for a single user task!
34
Issues w/ Naïve Approach
• Repeated processing of same
data in sequence across queries
• Computation wasted on low-
utility visualizations
Sharing
Pruning
35
Pruning Optimizations
• Keep running estimates of utility
• Prune visualizations based on estimates:
Two flavors
– Vanilla Confidence Interval based Pruning
– Multi-armed Bandit Pruning
Discard low-utility views early to avoid wasted computation
37
Motivation
Collaborative data science is
ubiquitous
• Many users, many versions of the
same dataset stored at many
stages of analysis
• Status quo:
– Stored in a file system, relationships
unknown
Challenge: can we build a versioned
data store?
– Support efficient access, retrieval,
querying, and modification of
versions
42
Motivation: Starting Points
• VCS: Git/svn is inefficient and unsuitable
– Ordered semantics
– No data manipulation API
– No efficient multi-version queries
– Poor support for massive files
• DBMS: Relational databases don’t support
versioning, but are efficient and scalable
43
Challenge: StoringVersions
Compactly/RetrievingVersions Quickly
1000s of versions, spanning millions of records.
Store all versions independently
Huge storage, version access time is very small
Store one version, all others via chains of “deltas”
Very small storage, version access time is high
45
And Answer Queries…
• Retrieve the first version that contains this tuple
• Find versions where the average(salary) is
greater than 1000
• Find all pairs of versions where over 100 new
tuples were added
• Show the history of the tuple with record id 34.
For more examples, see [TAPP’15]
46
Summary:
Make Data Analytics Great Again!
orpheus-db.github.ioShare &
Collaborate
Play &
View
Touch &
Feel
Increasingsophisticationofanalysis
zenvisage.github.io
dataspread.github.io
My website: http://data-people.cs.illinois.edu
Twitter: @adityagp 48