Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Letizia Tanca - Exploring Databases: The Indiana Project


Published on

Letizia Tanca, Politecnico di Milano, made this presentation for the Cognitive Systems Institute Speaker Series on July 21, 2016..

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Letizia Tanca - Exploring Databases: The Indiana Project

  1. 1. Le#zia Tanca Politecnico di Milano joint work with Università della Basilicata (credits in the last slide) Cogni#ve Systems Ins#tute Speaker Series
  2. 2. User Interaction Visualize Annotation Collaboration Efficiency Explanations Sampling Personalization Intensional view Query Suggestion
  3. 3. •  Rich data •  Dialogue-based interac#on •  Based on intensional characteriza#on of the informa#on •  Meaningful feedback (relevance) •  User experience Database Explora#on as a viewpoint of Exploratory Compu5ng: à only, more emphasis on efficiency
  4. 4. •  Starting point: a large, “semantically-rich” db •  Goals •  explore, to learn interesting things •  without a clear, a-priori perception of what we are looking for
  5. 5. •  A classical db is inherently transactional •  “Data Enthusiasts” are not willing to afford building a warehouse •  Interactive Data Cleaning •  Let’s do it on the database!
  6. 6. The UI Layer The Engine Layer The DB Layer “interesting” attributes Ac#vity id type start length userId
  7. 7. AcmeUser Ac#vity Loca#on Sleep The Engine Layer The DB Layer AcmeUser ⨝ Loca#on Ac#vity ⨝ AcmeUser Sleep ⨝ AcmeUser type sex quality view X is a parent of view Y means Y contains X as a subexpression
  8. 8. •  Query Engine •  Frequency distributions of attribute values •  Sampling •  Statistical hypothesis tests: •  Real-valued attributes: •  Kolmogorov-Smirnov •  Categorical attributes •  Chi-Square •  or Entropy Test for low frequencies Query Engine Computing Distributions Running Hypothesis Tests
  9. 9. 1) Extrac#on 3) Itera#on 4) Ranking of the analyses based on the Hellinger Distance between the distribu#ons
  10. 10. An interactive dialogue: •  Users may change their minds •  Feedback: emphasis on dataset properties, not on extensions •  Summarization What is interesting is discovered: •  Discontinuities •  Niche knowledge detection is serendipitous: surprise vs. previous subsets or vs. user’s expectations •  At each iteration the user should understand •  the “current” subset of items (its properties) •  the main differences vs. one or more of the previous subsets •  where to focus her attention (what is interesting?) •  Statistical approach to finding discrepancies •  A way to highlight relevant properties
  11. 11. •  Politecnico di Milano: Paolo Paolini, NicoleQa Di Blas, Elisa Quintarelli, Manuel Roveri, Mirjana Mazuran •  Università della Basilicata: Giansalvatore Mecca, Donatello Santoro, Marcello Buoncris#ano, Antonio Giuzio •  M. Buoncris#ano, G. Mecca, E. Quintarelli, M.Roveri,D. Santoro, L. Tanca: Database Challenges for Exploratory Compu5ng. SIGMOD Record, 2015 •  N. Di Blas, M. Mazuran, P. Paolini, E. Quintarelli, L.Tanca: Exploratory compu5ng: a dra= Manifesto. DSAA 2014 •  S. Idreos, O. Papaemmanouil, S. Chaudhuri: Overview of Data Explora5on Techniques. SIGMOD 2015. •  My post on the SIGMOD Blog