SlideShare a Scribd company logo
Aditya Parameswaran
Assistant Professor
University of Illinois
http://data-people.cs.illinois.edu
ThreeTools for
“Human-in-the-loop”
Data Science
Many many contributors!
• PIs: Kevin Chang, Karrie Karahalios,Aaron Elmore, Sam Madden, Amol
Deshpande (Spanning Illinois,UMD, MIT,Chicago)
• PhD Students: Mangesh Bendre, Himel Dev, John Lee, Albert Kim,
ManasiVartak, Liqi Xu, Silu Huang, Sajjadur Rahman, Stephen Macke
• MS Students:VipulVenkataraman,Tarique Siddiqui, ChaoWang, Sili Hui
• Undergrads: Paul Zhou, Ding Zhang, Kejia Jiang, Bofan Sun, Ed Xue, Sean
Zou, Jialin Liu, Changfeng Liu, XiaofoYu
2
Scale is a Solved Problem
Most work in the database community is myopically focused on
scale: the ability to pose SQL queries on larger and larger datasets.
My claim:
Scale is a solved problem.
Findings:
– Median job size at Microsoft andYahoo is 16GB;
– >90% of the jobs within Facebook are <100GB
The bottleneck is no longer our ability to pose SQL queries on
large datasets!
Of course, exceptions exist: the “1%” of data analysis needs
3
What about the Needs of the 99%?
The bottleneck is actually the “humans-in-the-loop”
As our data size has grown, what has stayed constant is
• the time for analysis,
• the human cognitive load,
• the skills to extract value from data
There is a severe need for tools that can help analysts extract
value from even moderately sized datasets
From “Big data and and itsTechnical Challenges”, CACM 2014
For big data to fully reach its potential, we need to consider scale not just for the system but
also from the perspective of humans.We have to make sure that the end points—humans—
can properly “absorb” the results of the analysis and not get lost in a sea of data.
4
Need of the hour: Human-In-the-Loop
Data AnalyticsTools
HILDA tools:
• treat both humans and data
as first-class citizens
• reduce human labor
• minimize complexity
Interaction Data Mining
Databases
Taking the human
perspective into
account
Go beyond SQL
Scalability/Interacti
vity is still
important
Magic happens here
5
A Maslow’s Hierarchy for HILDA
Background: Maslow developed a theory for what motivates
individuals in 1943; highly influential
Complex Needs
Basic Needs
6
A Maslow’s Hierarchy for HILDA
Share &
Collaborate
Play &
View
Touch &
Feel
Increasingsophisticationofanalysis
7
Touch and Feel:
DataSpread is a spreadsheet-database hybrid:
Goal: Marrying the flexibility and ease of use of
spreadsheets with the scalability and power of databases
Enables the “99%” with large datasets but limited prog.
skills to open, touch, and examine their datasets
http://dataspread.github.io
[VLDB’15,VLDB’15,ICDE’16]
8
Play andView:
Zenvisage is effortless visual exploration tool.
Goal: “fast-forward” to visual patterns, trends, without
having analyst step through each one individually
Enables individuals to play with, and extract insights
from large datasets at a fraction of the time.
http://zenvisage.github.io
[TR’16,VLDB’16,VLDB’15,DSIA’15,VLDB’14,VLDB’14]
9
Collaborate and Share:
OrpheusDB is a tool for managing dataset versions with a database
Goal: building a versioned database system to reduce the burden of
recording datasets in various stages of analysis
Enables individuals to collaborate on data analysis, and share, keep
track of, and retrieve dataset versions.
http://orpheus-db.github.io
[VLDB’16,VLDB’15,VLDB’15,TAPP’15,CIDR’15]
(also part of : a collab. analysis system w/ MIT & UMD)
datahub
10
This talk
About 10 minutes per system:
overview + architecture + one key technical challenge
Common theme: if you torture databases enough, you can get them to do
what you want!
Share &
Collaborate
Play &
View
Touch &
Feel
Increasingsophisticationofanalysis
11
12
Motivation
Most of the people doing ad-hoc data
manipulation and analysis use spreadsheets,
e.g., Excel
Why?
• Easy to use: direct manipulation
• Built-in visualization capabilities
• Flexible: no need for a schema
13
But Spreadsheets areTerrible!
– Slow
• single change  wait minutes on a 10,000 x 10 spreadsheet
• can’t even open a spreadsheet with >1M cells
• speed by itself can prevent analysis
– Tedious + not Powerful
• filters via copy-paste
• only FK joins viaVLOOKUPs; others impossible
• even simple operations are cumbersome
– Brittle
• sharing excel sheets around, no collab/recovery
• using spreadsheets for collaboration is painful and error-prone
14
Let’s turn to Databases
Databases are:
• Slow Scalable
• Tedious + not Powerful Powerful and expressive (SQL)
• Brittle Collaboration, recovery, succinct
So why not use databases?
Well, for the same reason why spreadsheets are so useful:
• Easy to use Not easy to use
• Built-in visualization No built-in visualization
• Flexible Not flexible
15
Combining the benefits of
spreadsheets and databases
Spreadsheet as a frontend interface
Databases as a backend engine
Result: retain the benefits of both!
But it’s not that simple…
16
Different Ideologies
Databases and spreadsheets have different
ideologies that need to be reconciled…
Due to this, the integration is not trivial…
Feature Databases Spreadsheets
Data Model Schema-first Dynamic/No Schema
Addressing Tuples with PK Cells, using Row/Col
Presentation Set-oriented, no such
notion
Notion of current window,
order
Modifications Must correspond to
queries
Can be done at any
granularity
Computation Query at a time Value at a time
17
First Problem: Representation
Q: how do we represent spreadsheet data?
Dense spreadsheets: represent as tables
(Row #, Col1 val, Col2 val, …)
Sparse spreadsheets: represent as triples
(Row #, Column #,Value)
18
First Problem: Representation
Q: how do we represent spreadsheet data?
Can we do even better than the two
extremes?Yes!
Carve out
dense areas  store as tables,
sparse areas  store as triples
19
First Problem: Representation
However, even if we only use “tables”, carving out
the ideal # partitions (min. storage, modif., access)
is NP-Hard
Reduction from min. edge-length partition of
rectilinear polygons
Thankfully, we have a way out…
20
Solution: Constrain the Problem
A new class of partitionings: recursive decomp.
A very natural class of partitionings! 21
Solution: Constrain the Problem
The optimal recursive
decomp. partitioning can be
found in PTIME using DP
 Still quadratic in # rows,
columns 
Merge rows/columns with
identical signatures
~ the time for a single scan
22
Initial Progress and Architecture
Postgres backend
ZK spreadsheet
• open-source web
frontend
Comfortably scales to
arbitrarily many rows
+ handle SQL queries
Hopefully bring
spreadsheets to the big
data age!
Underlying Data
Interface-Embedded
Queries
Interface-Aware
Indexes
Interface Query Processor
Interface Storage Manager
Spreadsheet
SQL
Spreadsheet
Formulae
New Interface
Algebra
…
Vanilla
SQL
Interface Transaction Manager
Other Applications Sally Bob Sue
23
1224560
StandardVisual Data Analysis Recipe:
1. Load dataset into viz tool
2. Select viz to be generated
3. See if it matches desired visual
pattern or insight
4. Repeat until you find a match
25
Tedious andTime-consuming!
26
Key Issue:
Visualizations can be generated by
• varying subsets of data, and
• varying attributes being visualized
Too many visualizations to look at to find
desired visual patterns!
27
Motivation
This is a real problem!
• Advertisers atTurn
– find keywords with similar CTRs to a specific one
• Bioinformaticians at an NIH genomics center
– find aspects on which two sets of genes differ
• Battery scientists at CMU
– find solvents with desired properties
Common theme: finding the “right” visualization can take
several hours of combing through visualizations manually.
28
Key Insight
We can automate that!
• instead of combing through visualizations manually
• tell us what you want, and we can “fast-forward” to desired insights
Desiderata for automation:
• Expressive – the ability to specify what you want
• Interactive – interact with the results, catering to non-programmers
• Scalable – get interesting results quickly
Enter Zenvisage:
(zen + envisage: to effortlessly visualize)
29
EffortlessVisual Exploration
of Large Datasets with
Ingredients
• Drag-and-drop and sketch based interactions
• to find specific patterns
• Sophisticated visual exploration language, ZQL
• to ask more elaborate questions
• Scalable visualization generation engine
• preprocess, batch and parallel eval. for interactive results
• Rapid pattern matching algorithms
• sampling-based techniques
30
Attribute Selection
Sketching Canvas
Matches TypicalTrends and Outliers
ZQL:Advanced Exploration Interface
Screenshots
31
Screenshots
32
Challenges: One Specific Instance
Find visualizations on which two groups of data differ most.
Examples:
• find visualizations where solvent x differs from solvent y
• find visualizations where product x differs from product y
We represent a visualization using [d, m, f]
• dimension = x axis
• measure = y axis
• function = aggregate applied to y
Each [d,m,f] on a specific subset of data can be computed using a
single SQL query.
33
Challenge: One Specific Instance
Find visualizations on which two groups of data differ most.
Naïve approach:
For each [d, m, f]:
Compute visualization for both products (two SQL queries),
then compare
Pick k best (“highest utility”) [d, m, f]
Utility Metric:We ignore how to compare for now, but there are
many standard distance metrics
Scale: 10s of dimensions, 10s of measures, handful of
aggregates  100s of queries for a single user task!
34
Issues w/ Naïve Approach
• Repeated processing of same
data in sequence across queries
• Computation wasted on low-
utility visualizations
Sharing
Pruning
35
Sharing Optimizations
1. Minimize # of queries: Group queries together
• Combine multiple aggregates:
(d1, m1, f1), (d1, m2, f1) —> (d1, [m1, m2], f1)
• Combine multiple group-bys:
(d1, m1, f1), (d2, m1, f1) —> ([d1, d2], m1, f1)
2. Minimize sequential execution: Parallel query evaluation
A bit tricky!
36
Pruning Optimizations
• Keep running estimates of utility
• Prune visualizations based on estimates:
Two flavors
– Vanilla Confidence Interval based Pruning
– Multi-armed Bandit Pruning
Discard low-utility views early to avoid wasted computation
37
Visualizations
Queries (100s)
Sharing
Pruning
Optimizer
DBMS
Middleware
Layer
Viz
interface
38
Up to 300X speedup: <1s for SM, 4s for L
Experimental Findings
39
EffortlessVisual Exploration
of Large Datasets with
Ingredients
• Drag-and-drop and
sketch based
interactions
• Sophisticated visual
exploration language,
ZQL
• Scalable visualization
generation engine
• Rapid pattern matching
algorithms
40
41
Motivation
Collaborative data science is
ubiquitous
• Many users, many versions of the
same dataset stored at many
stages of analysis
• Status quo:
– Stored in a file system, relationships
unknown
Challenge: can we build a versioned
data store?
– Support efficient access, retrieval,
querying, and modification of
versions
42
Motivation: Starting Points
• VCS: Git/svn is inefficient and unsuitable
– Ordered semantics
– No data manipulation API
– No efficient multi-version queries
– Poor support for massive files
• DBMS: Relational databases don’t support
versioning, but are efficient and scalable
43
OrpheusDB: Current Focus
PostgreSQL +Versioning Commands
44
Challenge: StoringVersions
Compactly/RetrievingVersions Quickly
1000s of versions, spanning millions of records.
Store all versions independently
Huge storage, version access time is very small
Store one version, all others via chains of “deltas”
Very small storage, version access time is high
45
And Answer Queries…
• Retrieve the first version that contains this tuple
• Find versions where the average(salary) is
greater than 1000
• Find all pairs of versions where over 100 new
tuples were added
• Show the history of the tuple with record id 34.
For more examples, see [TAPP’15]
46
Framework
“Versioning” Layer
(translation/bookkeeping)
User Interface Layer
47
UnmodifiedPostgres Backend
(not aware of versions)
Parser &
Translator
Layout
Optimizer
DBMS
git commands, or
SQL (versions as rel)
Summary:
Make Data Analytics Great Again!
orpheus-db.github.ioShare &
Collaborate
Play &
View
Touch &
Feel
Increasingsophisticationofanalysis
zenvisage.github.io
dataspread.github.io
My website: http://data-people.cs.illinois.edu
Twitter: @adityagp 48

More Related Content

What's hot

Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Dr.Sotarat Thammaboosadee CIMP-Data Governance
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
bodaceacat
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger data
bodaceacat
 
Machine Learning in Big Data
Machine Learning in Big DataMachine Learning in Big Data
Machine Learning in Big Data
DataWorks Summit
 
Mauritius Big Data and Machine Learning JEDI workshop
Mauritius Big Data and Machine Learning JEDI workshopMauritius Big Data and Machine Learning JEDI workshop
Mauritius Big Data and Machine Learning JEDI workshop
CosmoAIMS Bassett
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Data Science London
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our Lives
Rukshan Batuwita
 
EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko Grobelnik
European Data Forum
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
Lars Marius Garshol
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
Sampath Kumar
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Caserta
 
Machine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesMachine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & Opportunities
CodePolitan
 
Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)
heba_ahmad
 
Data science presentation
Data science presentationData science presentation
Data science presentation
MSDEVMTL
 
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Ilkay Altintas, Ph.D.
 
8 minute intro to data science
8 minute intro to data science 8 minute intro to data science
8 minute intro to data science
Mahesh Kumar CV
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and Benefits
Chandan Rajah
 
Machine Learning using Big data
Machine Learning using Big data Machine Learning using Big data
Machine Learning using Big data
Vaibhav Kurkute
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Ferdin Joe John Joseph PhD
 
machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...
Armando Vieira
 

What's hot (20)

Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger data
 
Machine Learning in Big Data
Machine Learning in Big DataMachine Learning in Big Data
Machine Learning in Big Data
 
Mauritius Big Data and Machine Learning JEDI workshop
Mauritius Big Data and Machine Learning JEDI workshopMauritius Big Data and Machine Learning JEDI workshop
Mauritius Big Data and Machine Learning JEDI workshop
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our Lives
 
EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko Grobelnik
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Machine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesMachine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & Opportunities
 
Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
 
8 minute intro to data science
8 minute intro to data science 8 minute intro to data science
8 minute intro to data science
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and Benefits
 
Machine Learning using Big data
Machine Learning using Big data Machine Learning using Big data
Machine Learning using Big data
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
 
machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...
 

Similar to Three Tools for "Human-in-the-loop" Data Science

Lecture_1_Intro.pdf
Lecture_1_Intro.pdfLecture_1_Intro.pdf
Lecture_1_Intro.pdf
paijitk
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
Bhupesh Bansal
 
Agile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsAgile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics Applications
Russell Jurney
 
201407 MIT CDO IQ conceptual data modeling, big data, and information quality
201407 MIT CDO IQ conceptual data modeling, big data, and information quality201407 MIT CDO IQ conceptual data modeling, big data, and information quality
201407 MIT CDO IQ conceptual data modeling, big data, and information quality
Peter O'Kelly
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
The Hive
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Joaquin Delgado PhD.
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
S. Diana Hu
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Lucidworks
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics Applications
Russell Jurney
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics Applications
DataWorks Summit
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big Data
Melissa Hornbostel
 
NoSQL Simplified: Schema vs. Schema-less
NoSQL Simplified: Schema vs. Schema-lessNoSQL Simplified: Schema vs. Schema-less
NoSQL Simplified: Schema vs. Schema-less
InfiniteGraph
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera, Inc.
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data Scientists
Sri Ambati
 
H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin Ledell
Sri Ambati
 
李俊良/Feature Engineering in Machine Learning
李俊良/Feature Engineering in Machine Learning李俊良/Feature Engineering in Machine Learning
李俊良/Feature Engineering in Machine Learning
台灣資料科學年會
 
Tour of Big Data
Tour of Big DataTour of Big Data
Tour of Big Data
Raymond Yu
 
Patterns for Successful Data Science Projects (Spark AI Summit)
Patterns for Successful Data Science Projects (Spark AI Summit)Patterns for Successful Data Science Projects (Spark AI Summit)
Patterns for Successful Data Science Projects (Spark AI Summit)
Bill Chambers
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
Ilkay Altintas, Ph.D.
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Knowledge And Skill Forum
 

Similar to Three Tools for "Human-in-the-loop" Data Science (20)

Lecture_1_Intro.pdf
Lecture_1_Intro.pdfLecture_1_Intro.pdf
Lecture_1_Intro.pdf
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Agile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsAgile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics Applications
 
201407 MIT CDO IQ conceptual data modeling, big data, and information quality
201407 MIT CDO IQ conceptual data modeling, big data, and information quality201407 MIT CDO IQ conceptual data modeling, big data, and information quality
201407 MIT CDO IQ conceptual data modeling, big data, and information quality
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics Applications
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics Applications
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big Data
 
NoSQL Simplified: Schema vs. Schema-less
NoSQL Simplified: Schema vs. Schema-lessNoSQL Simplified: Schema vs. Schema-less
NoSQL Simplified: Schema vs. Schema-less
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data Scientists
 
H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin Ledell
 
李俊良/Feature Engineering in Machine Learning
李俊良/Feature Engineering in Machine Learning李俊良/Feature Engineering in Machine Learning
李俊良/Feature Engineering in Machine Learning
 
Tour of Big Data
Tour of Big DataTour of Big Data
Tour of Big Data
 
Patterns for Successful Data Science Projects (Spark AI Summit)
Patterns for Successful Data Science Projects (Spark AI Summit)Patterns for Successful Data Science Projects (Spark AI Summit)
Patterns for Successful Data Science Projects (Spark AI Summit)
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 

Recently uploaded

Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Pitangent Analytics & Technology Solutions Pvt. Ltd
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Neo4j
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
BibashShahi
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
Safe Software
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 

Recently uploaded (20)

Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 

Three Tools for "Human-in-the-loop" Data Science

  • 1. Aditya Parameswaran Assistant Professor University of Illinois http://data-people.cs.illinois.edu ThreeTools for “Human-in-the-loop” Data Science
  • 2. Many many contributors! • PIs: Kevin Chang, Karrie Karahalios,Aaron Elmore, Sam Madden, Amol Deshpande (Spanning Illinois,UMD, MIT,Chicago) • PhD Students: Mangesh Bendre, Himel Dev, John Lee, Albert Kim, ManasiVartak, Liqi Xu, Silu Huang, Sajjadur Rahman, Stephen Macke • MS Students:VipulVenkataraman,Tarique Siddiqui, ChaoWang, Sili Hui • Undergrads: Paul Zhou, Ding Zhang, Kejia Jiang, Bofan Sun, Ed Xue, Sean Zou, Jialin Liu, Changfeng Liu, XiaofoYu 2
  • 3. Scale is a Solved Problem Most work in the database community is myopically focused on scale: the ability to pose SQL queries on larger and larger datasets. My claim: Scale is a solved problem. Findings: – Median job size at Microsoft andYahoo is 16GB; – >90% of the jobs within Facebook are <100GB The bottleneck is no longer our ability to pose SQL queries on large datasets! Of course, exceptions exist: the “1%” of data analysis needs 3
  • 4. What about the Needs of the 99%? The bottleneck is actually the “humans-in-the-loop” As our data size has grown, what has stayed constant is • the time for analysis, • the human cognitive load, • the skills to extract value from data There is a severe need for tools that can help analysts extract value from even moderately sized datasets From “Big data and and itsTechnical Challenges”, CACM 2014 For big data to fully reach its potential, we need to consider scale not just for the system but also from the perspective of humans.We have to make sure that the end points—humans— can properly “absorb” the results of the analysis and not get lost in a sea of data. 4
  • 5. Need of the hour: Human-In-the-Loop Data AnalyticsTools HILDA tools: • treat both humans and data as first-class citizens • reduce human labor • minimize complexity Interaction Data Mining Databases Taking the human perspective into account Go beyond SQL Scalability/Interacti vity is still important Magic happens here 5
  • 6. A Maslow’s Hierarchy for HILDA Background: Maslow developed a theory for what motivates individuals in 1943; highly influential Complex Needs Basic Needs 6
  • 7. A Maslow’s Hierarchy for HILDA Share & Collaborate Play & View Touch & Feel Increasingsophisticationofanalysis 7
  • 8. Touch and Feel: DataSpread is a spreadsheet-database hybrid: Goal: Marrying the flexibility and ease of use of spreadsheets with the scalability and power of databases Enables the “99%” with large datasets but limited prog. skills to open, touch, and examine their datasets http://dataspread.github.io [VLDB’15,VLDB’15,ICDE’16] 8
  • 9. Play andView: Zenvisage is effortless visual exploration tool. Goal: “fast-forward” to visual patterns, trends, without having analyst step through each one individually Enables individuals to play with, and extract insights from large datasets at a fraction of the time. http://zenvisage.github.io [TR’16,VLDB’16,VLDB’15,DSIA’15,VLDB’14,VLDB’14] 9
  • 10. Collaborate and Share: OrpheusDB is a tool for managing dataset versions with a database Goal: building a versioned database system to reduce the burden of recording datasets in various stages of analysis Enables individuals to collaborate on data analysis, and share, keep track of, and retrieve dataset versions. http://orpheus-db.github.io [VLDB’16,VLDB’15,VLDB’15,TAPP’15,CIDR’15] (also part of : a collab. analysis system w/ MIT & UMD) datahub 10
  • 11. This talk About 10 minutes per system: overview + architecture + one key technical challenge Common theme: if you torture databases enough, you can get them to do what you want! Share & Collaborate Play & View Touch & Feel Increasingsophisticationofanalysis 11
  • 12. 12
  • 13. Motivation Most of the people doing ad-hoc data manipulation and analysis use spreadsheets, e.g., Excel Why? • Easy to use: direct manipulation • Built-in visualization capabilities • Flexible: no need for a schema 13
  • 14. But Spreadsheets areTerrible! – Slow • single change  wait minutes on a 10,000 x 10 spreadsheet • can’t even open a spreadsheet with >1M cells • speed by itself can prevent analysis – Tedious + not Powerful • filters via copy-paste • only FK joins viaVLOOKUPs; others impossible • even simple operations are cumbersome – Brittle • sharing excel sheets around, no collab/recovery • using spreadsheets for collaboration is painful and error-prone 14
  • 15. Let’s turn to Databases Databases are: • Slow Scalable • Tedious + not Powerful Powerful and expressive (SQL) • Brittle Collaboration, recovery, succinct So why not use databases? Well, for the same reason why spreadsheets are so useful: • Easy to use Not easy to use • Built-in visualization No built-in visualization • Flexible Not flexible 15
  • 16. Combining the benefits of spreadsheets and databases Spreadsheet as a frontend interface Databases as a backend engine Result: retain the benefits of both! But it’s not that simple… 16
  • 17. Different Ideologies Databases and spreadsheets have different ideologies that need to be reconciled… Due to this, the integration is not trivial… Feature Databases Spreadsheets Data Model Schema-first Dynamic/No Schema Addressing Tuples with PK Cells, using Row/Col Presentation Set-oriented, no such notion Notion of current window, order Modifications Must correspond to queries Can be done at any granularity Computation Query at a time Value at a time 17
  • 18. First Problem: Representation Q: how do we represent spreadsheet data? Dense spreadsheets: represent as tables (Row #, Col1 val, Col2 val, …) Sparse spreadsheets: represent as triples (Row #, Column #,Value) 18
  • 19. First Problem: Representation Q: how do we represent spreadsheet data? Can we do even better than the two extremes?Yes! Carve out dense areas  store as tables, sparse areas  store as triples 19
  • 20. First Problem: Representation However, even if we only use “tables”, carving out the ideal # partitions (min. storage, modif., access) is NP-Hard Reduction from min. edge-length partition of rectilinear polygons Thankfully, we have a way out… 20
  • 21. Solution: Constrain the Problem A new class of partitionings: recursive decomp. A very natural class of partitionings! 21
  • 22. Solution: Constrain the Problem The optimal recursive decomp. partitioning can be found in PTIME using DP  Still quadratic in # rows, columns  Merge rows/columns with identical signatures ~ the time for a single scan 22
  • 23. Initial Progress and Architecture Postgres backend ZK spreadsheet • open-source web frontend Comfortably scales to arbitrarily many rows + handle SQL queries Hopefully bring spreadsheets to the big data age! Underlying Data Interface-Embedded Queries Interface-Aware Indexes Interface Query Processor Interface Storage Manager Spreadsheet SQL Spreadsheet Formulae New Interface Algebra … Vanilla SQL Interface Transaction Manager Other Applications Sally Bob Sue 23 1224560
  • 24.
  • 25. StandardVisual Data Analysis Recipe: 1. Load dataset into viz tool 2. Select viz to be generated 3. See if it matches desired visual pattern or insight 4. Repeat until you find a match 25
  • 27. Key Issue: Visualizations can be generated by • varying subsets of data, and • varying attributes being visualized Too many visualizations to look at to find desired visual patterns! 27
  • 28. Motivation This is a real problem! • Advertisers atTurn – find keywords with similar CTRs to a specific one • Bioinformaticians at an NIH genomics center – find aspects on which two sets of genes differ • Battery scientists at CMU – find solvents with desired properties Common theme: finding the “right” visualization can take several hours of combing through visualizations manually. 28
  • 29. Key Insight We can automate that! • instead of combing through visualizations manually • tell us what you want, and we can “fast-forward” to desired insights Desiderata for automation: • Expressive – the ability to specify what you want • Interactive – interact with the results, catering to non-programmers • Scalable – get interesting results quickly Enter Zenvisage: (zen + envisage: to effortlessly visualize) 29
  • 30. EffortlessVisual Exploration of Large Datasets with Ingredients • Drag-and-drop and sketch based interactions • to find specific patterns • Sophisticated visual exploration language, ZQL • to ask more elaborate questions • Scalable visualization generation engine • preprocess, batch and parallel eval. for interactive results • Rapid pattern matching algorithms • sampling-based techniques 30
  • 31. Attribute Selection Sketching Canvas Matches TypicalTrends and Outliers ZQL:Advanced Exploration Interface Screenshots 31
  • 33. Challenges: One Specific Instance Find visualizations on which two groups of data differ most. Examples: • find visualizations where solvent x differs from solvent y • find visualizations where product x differs from product y We represent a visualization using [d, m, f] • dimension = x axis • measure = y axis • function = aggregate applied to y Each [d,m,f] on a specific subset of data can be computed using a single SQL query. 33
  • 34. Challenge: One Specific Instance Find visualizations on which two groups of data differ most. Naïve approach: For each [d, m, f]: Compute visualization for both products (two SQL queries), then compare Pick k best (“highest utility”) [d, m, f] Utility Metric:We ignore how to compare for now, but there are many standard distance metrics Scale: 10s of dimensions, 10s of measures, handful of aggregates  100s of queries for a single user task! 34
  • 35. Issues w/ Naïve Approach • Repeated processing of same data in sequence across queries • Computation wasted on low- utility visualizations Sharing Pruning 35
  • 36. Sharing Optimizations 1. Minimize # of queries: Group queries together • Combine multiple aggregates: (d1, m1, f1), (d1, m2, f1) —> (d1, [m1, m2], f1) • Combine multiple group-bys: (d1, m1, f1), (d2, m1, f1) —> ([d1, d2], m1, f1) 2. Minimize sequential execution: Parallel query evaluation A bit tricky! 36
  • 37. Pruning Optimizations • Keep running estimates of utility • Prune visualizations based on estimates: Two flavors – Vanilla Confidence Interval based Pruning – Multi-armed Bandit Pruning Discard low-utility views early to avoid wasted computation 37
  • 39. Up to 300X speedup: <1s for SM, 4s for L Experimental Findings 39
  • 40. EffortlessVisual Exploration of Large Datasets with Ingredients • Drag-and-drop and sketch based interactions • Sophisticated visual exploration language, ZQL • Scalable visualization generation engine • Rapid pattern matching algorithms 40
  • 41. 41
  • 42. Motivation Collaborative data science is ubiquitous • Many users, many versions of the same dataset stored at many stages of analysis • Status quo: – Stored in a file system, relationships unknown Challenge: can we build a versioned data store? – Support efficient access, retrieval, querying, and modification of versions 42
  • 43. Motivation: Starting Points • VCS: Git/svn is inefficient and unsuitable – Ordered semantics – No data manipulation API – No efficient multi-version queries – Poor support for massive files • DBMS: Relational databases don’t support versioning, but are efficient and scalable 43
  • 44. OrpheusDB: Current Focus PostgreSQL +Versioning Commands 44
  • 45. Challenge: StoringVersions Compactly/RetrievingVersions Quickly 1000s of versions, spanning millions of records. Store all versions independently Huge storage, version access time is very small Store one version, all others via chains of “deltas” Very small storage, version access time is high 45
  • 46. And Answer Queries… • Retrieve the first version that contains this tuple • Find versions where the average(salary) is greater than 1000 • Find all pairs of versions where over 100 new tuples were added • Show the history of the tuple with record id 34. For more examples, see [TAPP’15] 46
  • 47. Framework “Versioning” Layer (translation/bookkeeping) User Interface Layer 47 UnmodifiedPostgres Backend (not aware of versions) Parser & Translator Layout Optimizer DBMS git commands, or SQL (versions as rel)
  • 48. Summary: Make Data Analytics Great Again! orpheus-db.github.ioShare & Collaborate Play & View Touch & Feel Increasingsophisticationofanalysis zenvisage.github.io dataspread.github.io My website: http://data-people.cs.illinois.edu Twitter: @adityagp 48