Letizia Tanca - Exploring Databases: The Indiana Project


Published on

Letizia Tanca, Politecnico di Milano, made this presentation for the Cognitive Systems Institute Speaker Series on July 21, 2016..

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Letizia Tanca - Exploring Databases: The Indiana Project

  1. 1. Le#zia Tanca Politecnico di Milano joint work with Università della Basilicata (credits in the last slide) Cogni#ve Systems Ins#tute Speaker Series
  2. 2. User Interaction Visualize Annotation Collaboration Efficiency Explanations Sampling Personalization Intensional view Query Suggestion
  3. 3. •  Rich data •  Dialogue-based interac#on •  Based on intensional characteriza#on of the informa#on •  Meaningful feedback (relevance) •  User experience Database Explora#on as a viewpoint of Exploratory Compu5ng: à only, more emphasis on efficiency
  4. 4. •  Starting point: a large, “semantically-rich” db •  Goals •  explore, to learn interesting things •  without a clear, a-priori perception of what we are looking for
  5. 5. •  A classical db is inherently transactional •  “Data Enthusiasts” are not willing to afford building a warehouse •  Interactive Data Cleaning •  Let’s do it on the database!
  6. 6. The UI Layer The Engine Layer The DB Layer “interesting” attributes Ac#vity id type start length userId
  7. 7. AcmeUser Ac#vity Loca#on Sleep The Engine Layer The DB Layer AcmeUser ⨝ Loca#on Ac#vity ⨝ AcmeUser Sleep ⨝ AcmeUser type sex quality view X is a parent of view Y means Y contains X as a subexpression
  8. 8. •  Query Engine •  Frequency distributions of attribute values •  Sampling •  Statistical hypothesis tests: •  Real-valued attributes: •  Kolmogorov-Smirnov •  Categorical attributes •  Chi-Square •  or Entropy Test for low frequencies Query Engine Computing Distributions Running Hypothesis Tests
  9. 9. 1) Extrac#on 3) Itera#on 4) Ranking of the analyses based on the Hellinger Distance between the distribu#ons
  10. 10. An interactive dialogue: •  Users may change their minds •  Feedback: emphasis on dataset properties, not on extensions •  Summarization What is interesting is discovered: •  Discontinuities •  Niche knowledge detection is serendipitous: surprise vs. previous subsets or vs. user’s expectations •  At each iteration the user should understand •  the “current” subset of items (its properties) •  the main differences vs. one or more of the previous subsets •  where to focus her attention (what is interesting?) •  Statistical approach to finding discrepancies •  A way to highlight relevant properties
  11. 11. •  Politecnico di Milano: Paolo Paolini, NicoleQa Di Blas, Elisa Quintarelli, Manuel Roveri, Mirjana Mazuran •  Università della Basilicata: Giansalvatore Mecca, Donatello Santoro, Marcello Buoncris#ano, Antonio Giuzio •  M. Buoncris#ano, G. Mecca, E. Quintarelli, M.Roveri,D. Santoro, L. Tanca: Database Challenges for Exploratory Compu5ng. SIGMOD Record, 2015 •  N. Di Blas, M. Mazuran, P. Paolini, E. Quintarelli, L.Tanca: Exploratory compu5ng: a dra= Manifesto. DSAA 2014 •  S. Idreos, O. Papaemmanouil, S. Chaudhuri: Overview of Data Explora5on Techniques. SIGMOD 2015. •  My post on the SIGMOD Blog