More Related Content

Slideshows for you(20)

Similar to Three Tools for "Human-in-the-loop" Data Science(20)


Three Tools for "Human-in-the-loop" Data Science

  1. Aditya Parameswaran Assistant Professor University of Illinois ThreeTools for “Human-in-the-loop” Data Science
  2. Many many contributors! • PIs: Kevin Chang, Karrie Karahalios,Aaron Elmore, Sam Madden, Amol Deshpande (Spanning Illinois,UMD, MIT,Chicago) • PhD Students: Mangesh Bendre, Himel Dev, John Lee, Albert Kim, ManasiVartak, Liqi Xu, Silu Huang, Sajjadur Rahman, Stephen Macke • MS Students:VipulVenkataraman,Tarique Siddiqui, ChaoWang, Sili Hui • Undergrads: Paul Zhou, Ding Zhang, Kejia Jiang, Bofan Sun, Ed Xue, Sean Zou, Jialin Liu, Changfeng Liu, XiaofoYu 2
  3. Scale is a Solved Problem Most work in the database community is myopically focused on scale: the ability to pose SQL queries on larger and larger datasets. My claim: Scale is a solved problem. Findings: – Median job size at Microsoft andYahoo is 16GB; – >90% of the jobs within Facebook are <100GB The bottleneck is no longer our ability to pose SQL queries on large datasets! Of course, exceptions exist: the “1%” of data analysis needs 3
  4. What about the Needs of the 99%? The bottleneck is actually the “humans-in-the-loop” As our data size has grown, what has stayed constant is • the time for analysis, • the human cognitive load, • the skills to extract value from data There is a severe need for tools that can help analysts extract value from even moderately sized datasets From “Big data and and itsTechnical Challenges”, CACM 2014 For big data to fully reach its potential, we need to consider scale not just for the system but also from the perspective of humans.We have to make sure that the end points—humans— can properly “absorb” the results of the analysis and not get lost in a sea of data. 4
  5. Need of the hour: Human-In-the-Loop Data AnalyticsTools HILDA tools: • treat both humans and data as first-class citizens • reduce human labor • minimize complexity Interaction Data Mining Databases Taking the human perspective into account Go beyond SQL Scalability/Interacti vity is still important Magic happens here 5
  6. A Maslow’s Hierarchy for HILDA Background: Maslow developed a theory for what motivates individuals in 1943; highly influential Complex Needs Basic Needs 6
  7. A Maslow’s Hierarchy for HILDA Share & Collaborate Play & View Touch & Feel Increasingsophisticationofanalysis 7
  8. Touch and Feel: DataSpread is a spreadsheet-database hybrid: Goal: Marrying the flexibility and ease of use of spreadsheets with the scalability and power of databases Enables the “99%” with large datasets but limited prog. skills to open, touch, and examine their datasets [VLDB’15,VLDB’15,ICDE’16] 8
  9. Play andView: Zenvisage is effortless visual exploration tool. Goal: “fast-forward” to visual patterns, trends, without having analyst step through each one individually Enables individuals to play with, and extract insights from large datasets at a fraction of the time. [TR’16,VLDB’16,VLDB’15,DSIA’15,VLDB’14,VLDB’14] 9
  10. Collaborate and Share: OrpheusDB is a tool for managing dataset versions with a database Goal: building a versioned database system to reduce the burden of recording datasets in various stages of analysis Enables individuals to collaborate on data analysis, and share, keep track of, and retrieve dataset versions. [VLDB’16,VLDB’15,VLDB’15,TAPP’15,CIDR’15] (also part of : a collab. analysis system w/ MIT & UMD) datahub 10
  11. This talk About 10 minutes per system: overview + architecture + one key technical challenge Common theme: if you torture databases enough, you can get them to do what you want! Share & Collaborate Play & View Touch & Feel Increasingsophisticationofanalysis 11
  12. 12
  13. Motivation Most of the people doing ad-hoc data manipulation and analysis use spreadsheets, e.g., Excel Why? • Easy to use: direct manipulation • Built-in visualization capabilities • Flexible: no need for a schema 13
  14. But Spreadsheets areTerrible! – Slow • single change  wait minutes on a 10,000 x 10 spreadsheet • can’t even open a spreadsheet with >1M cells • speed by itself can prevent analysis – Tedious + not Powerful • filters via copy-paste • only FK joins viaVLOOKUPs; others impossible • even simple operations are cumbersome – Brittle • sharing excel sheets around, no collab/recovery • using spreadsheets for collaboration is painful and error-prone 14
  15. Let’s turn to Databases Databases are: • Slow Scalable • Tedious + not Powerful Powerful and expressive (SQL) • Brittle Collaboration, recovery, succinct So why not use databases? Well, for the same reason why spreadsheets are so useful: • Easy to use Not easy to use • Built-in visualization No built-in visualization • Flexible Not flexible 15
  16. Combining the benefits of spreadsheets and databases Spreadsheet as a frontend interface Databases as a backend engine Result: retain the benefits of both! But it’s not that simple… 16
  17. Different Ideologies Databases and spreadsheets have different ideologies that need to be reconciled… Due to this, the integration is not trivial… Feature Databases Spreadsheets Data Model Schema-first Dynamic/No Schema Addressing Tuples with PK Cells, using Row/Col Presentation Set-oriented, no such notion Notion of current window, order Modifications Must correspond to queries Can be done at any granularity Computation Query at a time Value at a time 17
  18. First Problem: Representation Q: how do we represent spreadsheet data? Dense spreadsheets: represent as tables (Row #, Col1 val, Col2 val, …) Sparse spreadsheets: represent as triples (Row #, Column #,Value) 18
  19. First Problem: Representation Q: how do we represent spreadsheet data? Can we do even better than the two extremes?Yes! Carve out dense areas  store as tables, sparse areas  store as triples 19
  20. First Problem: Representation However, even if we only use “tables”, carving out the ideal # partitions (min. storage, modif., access) is NP-Hard Reduction from min. edge-length partition of rectilinear polygons Thankfully, we have a way out… 20
  21. Solution: Constrain the Problem A new class of partitionings: recursive decomp. A very natural class of partitionings! 21
  22. Solution: Constrain the Problem The optimal recursive decomp. partitioning can be found in PTIME using DP  Still quadratic in # rows, columns  Merge rows/columns with identical signatures ~ the time for a single scan 22
  23. Initial Progress and Architecture Postgres backend ZK spreadsheet • open-source web frontend Comfortably scales to arbitrarily many rows + handle SQL queries Hopefully bring spreadsheets to the big data age! Underlying Data Interface-Embedded Queries Interface-Aware Indexes Interface Query Processor Interface Storage Manager Spreadsheet SQL Spreadsheet Formulae New Interface Algebra … Vanilla SQL Interface Transaction Manager Other Applications Sally Bob Sue 23 1224560
  24. StandardVisual Data Analysis Recipe: 1. Load dataset into viz tool 2. Select viz to be generated 3. See if it matches desired visual pattern or insight 4. Repeat until you find a match 25
  25. Tedious andTime-consuming! 26
  26. Key Issue: Visualizations can be generated by • varying subsets of data, and • varying attributes being visualized Too many visualizations to look at to find desired visual patterns! 27
  27. Motivation This is a real problem! • Advertisers atTurn – find keywords with similar CTRs to a specific one • Bioinformaticians at an NIH genomics center – find aspects on which two sets of genes differ • Battery scientists at CMU – find solvents with desired properties Common theme: finding the “right” visualization can take several hours of combing through visualizations manually. 28
  28. Key Insight We can automate that! • instead of combing through visualizations manually • tell us what you want, and we can “fast-forward” to desired insights Desiderata for automation: • Expressive – the ability to specify what you want • Interactive – interact with the results, catering to non-programmers • Scalable – get interesting results quickly Enter Zenvisage: (zen + envisage: to effortlessly visualize) 29
  29. EffortlessVisual Exploration of Large Datasets with Ingredients • Drag-and-drop and sketch based interactions • to find specific patterns • Sophisticated visual exploration language, ZQL • to ask more elaborate questions • Scalable visualization generation engine • preprocess, batch and parallel eval. for interactive results • Rapid pattern matching algorithms • sampling-based techniques 30
  30. Attribute Selection Sketching Canvas Matches TypicalTrends and Outliers ZQL:Advanced Exploration Interface Screenshots 31
  31. Screenshots 32
  32. Challenges: One Specific Instance Find visualizations on which two groups of data differ most. Examples: • find visualizations where solvent x differs from solvent y • find visualizations where product x differs from product y We represent a visualization using [d, m, f] • dimension = x axis • measure = y axis • function = aggregate applied to y Each [d,m,f] on a specific subset of data can be computed using a single SQL query. 33
  33. Challenge: One Specific Instance Find visualizations on which two groups of data differ most. Naïve approach: For each [d, m, f]: Compute visualization for both products (two SQL queries), then compare Pick k best (“highest utility”) [d, m, f] Utility Metric:We ignore how to compare for now, but there are many standard distance metrics Scale: 10s of dimensions, 10s of measures, handful of aggregates  100s of queries for a single user task! 34
  34. Issues w/ Naïve Approach • Repeated processing of same data in sequence across queries • Computation wasted on low- utility visualizations Sharing Pruning 35
  35. Sharing Optimizations 1. Minimize # of queries: Group queries together • Combine multiple aggregates: (d1, m1, f1), (d1, m2, f1) —> (d1, [m1, m2], f1) • Combine multiple group-bys: (d1, m1, f1), (d2, m1, f1) —> ([d1, d2], m1, f1) 2. Minimize sequential execution: Parallel query evaluation A bit tricky! 36
  36. Pruning Optimizations • Keep running estimates of utility • Prune visualizations based on estimates: Two flavors – Vanilla Confidence Interval based Pruning – Multi-armed Bandit Pruning Discard low-utility views early to avoid wasted computation 37
  37. Visualizations Queries (100s) Sharing Pruning Optimizer DBMS Middleware Layer Viz interface 38
  38. Up to 300X speedup: <1s for SM, 4s for L Experimental Findings 39
  39. EffortlessVisual Exploration of Large Datasets with Ingredients • Drag-and-drop and sketch based interactions • Sophisticated visual exploration language, ZQL • Scalable visualization generation engine • Rapid pattern matching algorithms 40
  40. 41
  41. Motivation Collaborative data science is ubiquitous • Many users, many versions of the same dataset stored at many stages of analysis • Status quo: – Stored in a file system, relationships unknown Challenge: can we build a versioned data store? – Support efficient access, retrieval, querying, and modification of versions 42
  42. Motivation: Starting Points • VCS: Git/svn is inefficient and unsuitable – Ordered semantics – No data manipulation API – No efficient multi-version queries – Poor support for massive files • DBMS: Relational databases don’t support versioning, but are efficient and scalable 43
  43. OrpheusDB: Current Focus PostgreSQL +Versioning Commands 44
  44. Challenge: StoringVersions Compactly/RetrievingVersions Quickly 1000s of versions, spanning millions of records. Store all versions independently Huge storage, version access time is very small Store one version, all others via chains of “deltas” Very small storage, version access time is high 45
  45. And Answer Queries… • Retrieve the first version that contains this tuple • Find versions where the average(salary) is greater than 1000 • Find all pairs of versions where over 100 new tuples were added • Show the history of the tuple with record id 34. For more examples, see [TAPP’15] 46
  46. Framework “Versioning” Layer (translation/bookkeeping) User Interface Layer 47 UnmodifiedPostgres Backend (not aware of versions) Parser & Translator Layout Optimizer DBMS git commands, or SQL (versions as rel)
  47. Summary: Make Data Analytics Great Again! orpheus-db.github.ioShare & Collaborate Play & View Touch & Feel Increasingsophisticationofanalysis My website: Twitter: @adityagp 48