Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Aditya Parameswaran
Assistant Professor
University of Illinois
http://data-people.cs.illinois.edu
ThreeTools for
“Human-in...
Many many contributors!
• PIs: Kevin Chang, Karrie Karahalios,Aaron Elmore, Sam Madden, Amol
Deshpande (Spanning Illinois,...
Scale is a Solved Problem
Most work in the database community is myopically focused on
scale: the ability to pose SQL quer...
What about the Needs of the 99%?
The bottleneck is actually the “humans-in-the-loop”
As our data size has grown, what has ...
Need of the hour: Human-In-the-Loop
Data AnalyticsTools
HILDA tools:
• treat both humans and data
as first-class citizens
...
A Maslow’s Hierarchy for HILDA
Background: Maslow developed a theory for what motivates
individuals in 1943; highly influe...
A Maslow’s Hierarchy for HILDA
Share &
Collaborate
Play &
View
Touch &
Feel
Increasingsophisticationofanalysis
7
Touch and Feel:
DataSpread is a spreadsheet-database hybrid:
Goal: Marrying the flexibility and ease of use of
spreadsheet...
Play andView:
Zenvisage is effortless visual exploration tool.
Goal: “fast-forward” to visual patterns, trends, without
ha...
Collaborate and Share:
OrpheusDB is a tool for managing dataset versions with a database
Goal: building a versioned databa...
This talk
About 10 minutes per system:
overview + architecture + one key technical challenge
Common theme: if you torture ...
12
Motivation
Most of the people doing ad-hoc data
manipulation and analysis use spreadsheets,
e.g., Excel
Why?
• Easy to use...
But Spreadsheets areTerrible!
– Slow
• single change  wait minutes on a 10,000 x 10 spreadsheet
• can’t even open a sprea...
Let’s turn to Databases
Databases are:
• Slow Scalable
• Tedious + not Powerful Powerful and expressive (SQL)
• Brittle Co...
Combining the benefits of
spreadsheets and databases
Spreadsheet as a frontend interface
Databases as a backend engine
Res...
Different Ideologies
Databases and spreadsheets have different
ideologies that need to be reconciled…
Due to this, the int...
First Problem: Representation
Q: how do we represent spreadsheet data?
Dense spreadsheets: represent as tables
(Row #, Col...
First Problem: Representation
Q: how do we represent spreadsheet data?
Can we do even better than the two
extremes?Yes!
Ca...
First Problem: Representation
However, even if we only use “tables”, carving out
the ideal # partitions (min. storage, mod...
Solution: Constrain the Problem
A new class of partitionings: recursive decomp.
A very natural class of partitionings! 21
Solution: Constrain the Problem
The optimal recursive
decomp. partitioning can be
found in PTIME using DP
 Still quadrati...
Initial Progress and Architecture
Postgres backend
ZK spreadsheet
• open-source web
frontend
Comfortably scales to
arbitra...
StandardVisual Data Analysis Recipe:
1. Load dataset into viz tool
2. Select viz to be generated
3. See if it matches desi...
Tedious andTime-consuming!
26
Key Issue:
Visualizations can be generated by
• varying subsets of data, and
• varying attributes being visualized
Too man...
Motivation
This is a real problem!
• Advertisers atTurn
– find keywords with similar CTRs to a specific one
• Bioinformati...
Key Insight
We can automate that!
• instead of combing through visualizations manually
• tell us what you want, and we can...
EffortlessVisual Exploration
of Large Datasets with
Ingredients
• Drag-and-drop and sketch based interactions
• to find sp...
Attribute Selection
Sketching Canvas
Matches TypicalTrends and Outliers
ZQL:Advanced Exploration Interface
Screenshots
31
Screenshots
32
Challenges: One Specific Instance
Find visualizations on which two groups of data differ most.
Examples:
• find visualizat...
Challenge: One Specific Instance
Find visualizations on which two groups of data differ most.
Naïve approach:
For each [d,...
Issues w/ Naïve Approach
• Repeated processing of same
data in sequence across queries
• Computation wasted on low-
utilit...
Sharing Optimizations
1. Minimize # of queries: Group queries together
• Combine multiple aggregates:
(d1, m1, f1), (d1, m...
Pruning Optimizations
• Keep running estimates of utility
• Prune visualizations based on estimates:
Two flavors
– Vanilla...
Visualizations
Queries (100s)
Sharing
Pruning
Optimizer
DBMS
Middleware
Layer
Viz
interface
38
Up to 300X speedup: <1s for SM, 4s for L
Experimental Findings
39
EffortlessVisual Exploration
of Large Datasets with
Ingredients
• Drag-and-drop and
sketch based
interactions
• Sophistica...
41
Motivation
Collaborative data science is
ubiquitous
• Many users, many versions of the
same dataset stored at many
stages ...
Motivation: Starting Points
• VCS: Git/svn is inefficient and unsuitable
– Ordered semantics
– No data manipulation API
– ...
OrpheusDB: Current Focus
PostgreSQL +Versioning Commands
44
Challenge: StoringVersions
Compactly/RetrievingVersions Quickly
1000s of versions, spanning millions of records.
Store all...
And Answer Queries…
• Retrieve the first version that contains this tuple
• Find versions where the average(salary) is
gre...
Framework
“Versioning” Layer
(translation/bookkeeping)
User Interface Layer
47
UnmodifiedPostgres Backend
(not aware of ve...
Summary:
Make Data Analytics Great Again!
orpheus-db.github.ioShare &
Collaborate
Play &
View
Touch &
Feel
Increasingsophi...
Three Tools for "Human-in-the-loop" Data Science
Upcoming SlideShare
Loading in …5
×

Three Tools for "Human-in-the-loop" Data Science

1,313 views

Published on

I presented these slides as a keynote at the Enterprise Intelligence Workshop at KDD2016 in San francisco.

In these slides, I describe our work towards developing a Maslow's Hierarchy for Human in the Loop Data Analytics!

Published in: Technology
  • Be the first to comment

Three Tools for "Human-in-the-loop" Data Science

  1. 1. Aditya Parameswaran Assistant Professor University of Illinois http://data-people.cs.illinois.edu ThreeTools for “Human-in-the-loop” Data Science
  2. 2. Many many contributors! • PIs: Kevin Chang, Karrie Karahalios,Aaron Elmore, Sam Madden, Amol Deshpande (Spanning Illinois,UMD, MIT,Chicago) • PhD Students: Mangesh Bendre, Himel Dev, John Lee, Albert Kim, ManasiVartak, Liqi Xu, Silu Huang, Sajjadur Rahman, Stephen Macke • MS Students:VipulVenkataraman,Tarique Siddiqui, ChaoWang, Sili Hui • Undergrads: Paul Zhou, Ding Zhang, Kejia Jiang, Bofan Sun, Ed Xue, Sean Zou, Jialin Liu, Changfeng Liu, XiaofoYu 2
  3. 3. Scale is a Solved Problem Most work in the database community is myopically focused on scale: the ability to pose SQL queries on larger and larger datasets. My claim: Scale is a solved problem. Findings: – Median job size at Microsoft andYahoo is 16GB; – >90% of the jobs within Facebook are <100GB The bottleneck is no longer our ability to pose SQL queries on large datasets! Of course, exceptions exist: the “1%” of data analysis needs 3
  4. 4. What about the Needs of the 99%? The bottleneck is actually the “humans-in-the-loop” As our data size has grown, what has stayed constant is • the time for analysis, • the human cognitive load, • the skills to extract value from data There is a severe need for tools that can help analysts extract value from even moderately sized datasets From “Big data and and itsTechnical Challenges”, CACM 2014 For big data to fully reach its potential, we need to consider scale not just for the system but also from the perspective of humans.We have to make sure that the end points—humans— can properly “absorb” the results of the analysis and not get lost in a sea of data. 4
  5. 5. Need of the hour: Human-In-the-Loop Data AnalyticsTools HILDA tools: • treat both humans and data as first-class citizens • reduce human labor • minimize complexity Interaction Data Mining Databases Taking the human perspective into account Go beyond SQL Scalability/Interacti vity is still important Magic happens here 5
  6. 6. A Maslow’s Hierarchy for HILDA Background: Maslow developed a theory for what motivates individuals in 1943; highly influential Complex Needs Basic Needs 6
  7. 7. A Maslow’s Hierarchy for HILDA Share & Collaborate Play & View Touch & Feel Increasingsophisticationofanalysis 7
  8. 8. Touch and Feel: DataSpread is a spreadsheet-database hybrid: Goal: Marrying the flexibility and ease of use of spreadsheets with the scalability and power of databases Enables the “99%” with large datasets but limited prog. skills to open, touch, and examine their datasets http://dataspread.github.io [VLDB’15,VLDB’15,ICDE’16] 8
  9. 9. Play andView: Zenvisage is effortless visual exploration tool. Goal: “fast-forward” to visual patterns, trends, without having analyst step through each one individually Enables individuals to play with, and extract insights from large datasets at a fraction of the time. http://zenvisage.github.io [TR’16,VLDB’16,VLDB’15,DSIA’15,VLDB’14,VLDB’14] 9
  10. 10. Collaborate and Share: OrpheusDB is a tool for managing dataset versions with a database Goal: building a versioned database system to reduce the burden of recording datasets in various stages of analysis Enables individuals to collaborate on data analysis, and share, keep track of, and retrieve dataset versions. http://orpheus-db.github.io [VLDB’16,VLDB’15,VLDB’15,TAPP’15,CIDR’15] (also part of : a collab. analysis system w/ MIT & UMD) datahub 10
  11. 11. This talk About 10 minutes per system: overview + architecture + one key technical challenge Common theme: if you torture databases enough, you can get them to do what you want! Share & Collaborate Play & View Touch & Feel Increasingsophisticationofanalysis 11
  12. 12. 12
  13. 13. Motivation Most of the people doing ad-hoc data manipulation and analysis use spreadsheets, e.g., Excel Why? • Easy to use: direct manipulation • Built-in visualization capabilities • Flexible: no need for a schema 13
  14. 14. But Spreadsheets areTerrible! – Slow • single change  wait minutes on a 10,000 x 10 spreadsheet • can’t even open a spreadsheet with >1M cells • speed by itself can prevent analysis – Tedious + not Powerful • filters via copy-paste • only FK joins viaVLOOKUPs; others impossible • even simple operations are cumbersome – Brittle • sharing excel sheets around, no collab/recovery • using spreadsheets for collaboration is painful and error-prone 14
  15. 15. Let’s turn to Databases Databases are: • Slow Scalable • Tedious + not Powerful Powerful and expressive (SQL) • Brittle Collaboration, recovery, succinct So why not use databases? Well, for the same reason why spreadsheets are so useful: • Easy to use Not easy to use • Built-in visualization No built-in visualization • Flexible Not flexible 15
  16. 16. Combining the benefits of spreadsheets and databases Spreadsheet as a frontend interface Databases as a backend engine Result: retain the benefits of both! But it’s not that simple… 16
  17. 17. Different Ideologies Databases and spreadsheets have different ideologies that need to be reconciled… Due to this, the integration is not trivial… Feature Databases Spreadsheets Data Model Schema-first Dynamic/No Schema Addressing Tuples with PK Cells, using Row/Col Presentation Set-oriented, no such notion Notion of current window, order Modifications Must correspond to queries Can be done at any granularity Computation Query at a time Value at a time 17
  18. 18. First Problem: Representation Q: how do we represent spreadsheet data? Dense spreadsheets: represent as tables (Row #, Col1 val, Col2 val, …) Sparse spreadsheets: represent as triples (Row #, Column #,Value) 18
  19. 19. First Problem: Representation Q: how do we represent spreadsheet data? Can we do even better than the two extremes?Yes! Carve out dense areas  store as tables, sparse areas  store as triples 19
  20. 20. First Problem: Representation However, even if we only use “tables”, carving out the ideal # partitions (min. storage, modif., access) is NP-Hard Reduction from min. edge-length partition of rectilinear polygons Thankfully, we have a way out… 20
  21. 21. Solution: Constrain the Problem A new class of partitionings: recursive decomp. A very natural class of partitionings! 21
  22. 22. Solution: Constrain the Problem The optimal recursive decomp. partitioning can be found in PTIME using DP  Still quadratic in # rows, columns  Merge rows/columns with identical signatures ~ the time for a single scan 22
  23. 23. Initial Progress and Architecture Postgres backend ZK spreadsheet • open-source web frontend Comfortably scales to arbitrarily many rows + handle SQL queries Hopefully bring spreadsheets to the big data age! Underlying Data Interface-Embedded Queries Interface-Aware Indexes Interface Query Processor Interface Storage Manager Spreadsheet SQL Spreadsheet Formulae New Interface Algebra … Vanilla SQL Interface Transaction Manager Other Applications Sally Bob Sue 23 1224560
  24. 24. StandardVisual Data Analysis Recipe: 1. Load dataset into viz tool 2. Select viz to be generated 3. See if it matches desired visual pattern or insight 4. Repeat until you find a match 25
  25. 25. Tedious andTime-consuming! 26
  26. 26. Key Issue: Visualizations can be generated by • varying subsets of data, and • varying attributes being visualized Too many visualizations to look at to find desired visual patterns! 27
  27. 27. Motivation This is a real problem! • Advertisers atTurn – find keywords with similar CTRs to a specific one • Bioinformaticians at an NIH genomics center – find aspects on which two sets of genes differ • Battery scientists at CMU – find solvents with desired properties Common theme: finding the “right” visualization can take several hours of combing through visualizations manually. 28
  28. 28. Key Insight We can automate that! • instead of combing through visualizations manually • tell us what you want, and we can “fast-forward” to desired insights Desiderata for automation: • Expressive – the ability to specify what you want • Interactive – interact with the results, catering to non-programmers • Scalable – get interesting results quickly Enter Zenvisage: (zen + envisage: to effortlessly visualize) 29
  29. 29. EffortlessVisual Exploration of Large Datasets with Ingredients • Drag-and-drop and sketch based interactions • to find specific patterns • Sophisticated visual exploration language, ZQL • to ask more elaborate questions • Scalable visualization generation engine • preprocess, batch and parallel eval. for interactive results • Rapid pattern matching algorithms • sampling-based techniques 30
  30. 30. Attribute Selection Sketching Canvas Matches TypicalTrends and Outliers ZQL:Advanced Exploration Interface Screenshots 31
  31. 31. Screenshots 32
  32. 32. Challenges: One Specific Instance Find visualizations on which two groups of data differ most. Examples: • find visualizations where solvent x differs from solvent y • find visualizations where product x differs from product y We represent a visualization using [d, m, f] • dimension = x axis • measure = y axis • function = aggregate applied to y Each [d,m,f] on a specific subset of data can be computed using a single SQL query. 33
  33. 33. Challenge: One Specific Instance Find visualizations on which two groups of data differ most. Naïve approach: For each [d, m, f]: Compute visualization for both products (two SQL queries), then compare Pick k best (“highest utility”) [d, m, f] Utility Metric:We ignore how to compare for now, but there are many standard distance metrics Scale: 10s of dimensions, 10s of measures, handful of aggregates  100s of queries for a single user task! 34
  34. 34. Issues w/ Naïve Approach • Repeated processing of same data in sequence across queries • Computation wasted on low- utility visualizations Sharing Pruning 35
  35. 35. Sharing Optimizations 1. Minimize # of queries: Group queries together • Combine multiple aggregates: (d1, m1, f1), (d1, m2, f1) —> (d1, [m1, m2], f1) • Combine multiple group-bys: (d1, m1, f1), (d2, m1, f1) —> ([d1, d2], m1, f1) 2. Minimize sequential execution: Parallel query evaluation A bit tricky! 36
  36. 36. Pruning Optimizations • Keep running estimates of utility • Prune visualizations based on estimates: Two flavors – Vanilla Confidence Interval based Pruning – Multi-armed Bandit Pruning Discard low-utility views early to avoid wasted computation 37
  37. 37. Visualizations Queries (100s) Sharing Pruning Optimizer DBMS Middleware Layer Viz interface 38
  38. 38. Up to 300X speedup: <1s for SM, 4s for L Experimental Findings 39
  39. 39. EffortlessVisual Exploration of Large Datasets with Ingredients • Drag-and-drop and sketch based interactions • Sophisticated visual exploration language, ZQL • Scalable visualization generation engine • Rapid pattern matching algorithms 40
  40. 40. 41
  41. 41. Motivation Collaborative data science is ubiquitous • Many users, many versions of the same dataset stored at many stages of analysis • Status quo: – Stored in a file system, relationships unknown Challenge: can we build a versioned data store? – Support efficient access, retrieval, querying, and modification of versions 42
  42. 42. Motivation: Starting Points • VCS: Git/svn is inefficient and unsuitable – Ordered semantics – No data manipulation API – No efficient multi-version queries – Poor support for massive files • DBMS: Relational databases don’t support versioning, but are efficient and scalable 43
  43. 43. OrpheusDB: Current Focus PostgreSQL +Versioning Commands 44
  44. 44. Challenge: StoringVersions Compactly/RetrievingVersions Quickly 1000s of versions, spanning millions of records. Store all versions independently Huge storage, version access time is very small Store one version, all others via chains of “deltas” Very small storage, version access time is high 45
  45. 45. And Answer Queries… • Retrieve the first version that contains this tuple • Find versions where the average(salary) is greater than 1000 • Find all pairs of versions where over 100 new tuples were added • Show the history of the tuple with record id 34. For more examples, see [TAPP’15] 46
  46. 46. Framework “Versioning” Layer (translation/bookkeeping) User Interface Layer 47 UnmodifiedPostgres Backend (not aware of versions) Parser & Translator Layout Optimizer DBMS git commands, or SQL (versions as rel)
  47. 47. Summary: Make Data Analytics Great Again! orpheus-db.github.ioShare & Collaborate Play & View Touch & Feel Increasingsophisticationofanalysis zenvisage.github.io dataspread.github.io My website: http://data-people.cs.illinois.edu Twitter: @adityagp 48

×