Data Quality: Not Your Typical Database Problem
Upcoming SlideShare
Loading in...5
×
 

Data Quality: Not Your Typical Database Problem

on

  • 560 views

Ahmed K. Elmagarmid (IEEE Fellow and ACM Distinguished Scientist) gave a lecture on Data Quality: Not Your Typical Database Problem in the Distinguished Lecturer Series - Leon The Mathematician.

Ahmed K. Elmagarmid (IEEE Fellow and ACM Distinguished Scientist) gave a lecture on Data Quality: Not Your Typical Database Problem in the Distinguished Lecturer Series - Leon The Mathematician.

Statistics

Views

Total Views
560
Views on SlideShare
538
Embed Views
22

Actions

Likes
0
Downloads
6
Comments
0

1 Embed 22

http://dls.csd.auth.gr 22

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Data Quality: Not Your Typical Database Problem Data Quality: Not Your Typical Database Problem Presentation Transcript

  • Data QualityNot your Typical Database ProblemAhmed ElmagarmidExecutive DirectorQatar Computing Research Institute 2011 © Copyright QCRI. Confidential document.
  • Where are we located?2 2011 © Copyright QCRI. Confidential document.
  • 3 3 2011 © Copyright QCRI. Confidential document.
  • 4 2011 © Copyright QCRI. Confidential document.
  • Qatar Foundation5 2011 © Copyright QCRI. Confidential document.
  • SCIENCE & COMMUNITY EDUCATION RESEARCH DEVELOPMENT2.8 percent of GDP tobe spent on researchannually by 2015 2011 © Copyright QCRI. Confidential document.
  • Qatar Foundation Research Division Qatar Qatar Energy & QatarComputing Environment Biomedical Research Research Research Institute Institute Institute QCRI QEERI QBRI 2011 © Copyright QCRI. Confidential document.
  • QCRI Overview8 2011 © Copyright QCRI. Confidential document.
  • QCRI Vision To make Qatar a global center for computing research by becoming the world’s recognized leader in Arabic language technologies and in key areas vital to the global growth of Qatari business and entrepreneurial activity.9 2011 © Copyright QCRI. Confidential document.
  • QCRI Model Grand Challenges National Institutions (QCRI) Grand practical challenges Academia National and global impact Localized skills & knowledge Large teams and long term Individual projects Example peers: INRIA, MPI Students move on Theoretical & basic Project-based research Research Parks Commercialization Entrepreneurship Incubation Basic Research Applied Research 1010 2011 © Copyright QCRI. Confidential document.
  • QCRI Ecosystem QU Sidra QBRI MIT HKU QEERI QCRI WikiMedia QSTP Aljazeera QP ALTIS Boeing Energy Google MEEZA Yahoo Co. QSA IBM Microsoft11 2011 © Copyright QCRI. Confidential document.
  • QCRI Research Centers Arabic Social Scientific Language Computing Computing Technologies Data Analytics Cloud Computing12 2011 © Copyright QCRI. Confidential document.
  • QCRI Scientific Advisory Council Lord Rupert Redesdale Prof. Rich DeMillo UK House of Lords Georgia Tech, Chair Prof. Joichi Ito Prof. Ruzena Bajcsy MIT Media Lab Director University of California – Berkeley Lew Tucker Prof. Alfred V. Aho Vice President, Cisco Columbia University Prof. Dick Lipton Yousef Khalidi Georgia Tech Vice President, Microsoft13 2011 © Copyright QCRI. Confidential document.
  • The 60 Doers! Abdellatif Ahmed Richard Jill Management Ihab Nan Mourad and Support Team Richard P. Paolo Melissa Data Analytics Amr Kamal Halima Amal John Rashid Nada Agathe Scientific Michele Hend Chu ElKindi Computing Kulood Samreen Mohamed Simon P. Mustafa Tarek Preslav Othmane Kareem Stephan Ahmed A. Wei William Arabic Cloud Ahmed T. Language ThuyLinh Computing Sihem Maged Gautam Khaled AyshaAhmed M. Technologies Sofiane Social Ahmed A. Gokop Computing Ahmed T. Lolwa Safdar Amira Aybuke Shameem Francisco Simon G. Walid Peng Mikalai Khulood Ruth 2011 © Copyright QCRI. Confidential document.
  • Strategic Partnerships15 2011 © Copyright QCRI. Confidential document.
  • Agenda Partnerships Strategic16 2011 © Copyright QCRI. Confidential document.
  • 5-YEAR QCRI MANPOWER PLAN 110 102 82 34 +20 +48 +8 21 +13 10-11 11-12 12-13 13-14 14-1517 2011 © Copyright QCRI. Confidential document.
  • This Talk Data Quality18 2011 © Copyright QCRI. Confidential document.
  • Data Quality Enhancing the usability of the acquired data and increasing the confidence of query results "Poor data quality is the norm rather than the exception, but most organizations are in a state of denial about this issue. " -Gartner Group19 2011 © Copyright QCRI. Confidential document.
  • Dirty Data is ExpensiveReal life data is often dirty: Data Obama administration offerederror rates in industry: 1% - 30% $19 billion grants for health IT, i.e.(Redman, 1998) improve EMRs in 2009 The Data Warehousing InstituteErroneously priced data in retail estimates that data qualitydatabases costs US customers problems cost U.S. businesses$2.5 billion each year more than $600 billion a year (2002) 20 2011 © Copyright QCRI. Confidential document.
  • Where to start? Data Quality everywhere!• Data Entry• Information Extraction• Integration from multiple sources• Standardization and transformation• Business rules compliance21 2011 © Copyright QCRI. Confidential document.
  • “Academic” Data Cleaning ”● Pick a well understood data problem under some scoping assumptions and solve independently Duplicates Functional Dependency violations Matching dependency violations Missing value imputation● Piece-meal approach to tackle the complexity and sometimes the intractability of the problem Repairing violations of FD constraints in special cases (no deletion, left hand side changes only, allowing variable etc.)22 2011 © Copyright QCRI. Confidential document.
  • “Academic” Data Cleaning ”• Despite their theoretic and algorithmic beauty, rarely used – Problems never exist in isolation – Fixes to one problem often introduce “other” problems – Data usually not accessible to mess with – Integrity constraints!... What integrity constraints?!!23 2011 © Copyright QCRI. Confidential document.
  • “Practitioner” Data Cleaning ”• Will share some scary stories – “post-it notes” as an expert messaging system – “written permission” to change value of a record – Default values and best practices – “Call John.. He will know what to do”24 2011 © Copyright QCRI. Confidential document.
  • This Talk● Few data quality challenges and (hopefully) research directions● Summary of recent efforts at QCRI25 2011 © Copyright QCRI. Confidential document.
  • 10 Data Quality Issues26 2011 © Copyright QCRI. Confidential document.
  • Issue 1: The data trio DATA Quality27 2011 © Copyright QCRI. Confidential document.
  • Extraction remains a key source of data errors Acquiring the semantics/schema of the underlying unstructured data sources (document, emails, related Web info, click traces, profiles, interests, etc.)28 2011 © Copyright QCRI. Confidential document.
  • Integration aggravates the problem m1 Linked data as an attempt to live with errors .. link as you go29 2011 © Copyright QCRI. Confidential document.
  • Slide 29m1 Im not sure about this idea of putting "linked data" so prominent in this slide on II mourad, 7/23/2011
  • Issue 2: Data level or application level• Cleaning data tables by trusting the schema table! Is rarely useful• Will share a story – Bell-core with 1800 inter-linked databases – Rule-based logic for sanity checking – Post-it messages to communicate between data quality officers .. Who work in shifts! – Data cleaning action is meaningless if not tied to a business logic or to a process. Should never be against FDs30 2011 © Copyright QCRI. Confidential document.
  • Issue 3: Protect your gain: DQ Dashboard● How to protect against going backwards● How to protect your gains during the cleansing process● Metrics: Minimality Principle: mostly and widely used in academic cleaning Value of information: to spot the most important problem to fix31 2011 © Copyright QCRI. Confidential document.
  • Issue 3: Protect your gain - Ideas• Root-cause analysis for data cleaning• Chase problems to the source to reason about “progress”• Leveraging “Provenance” to design progress meters32 2011 © Copyright QCRI. Confidential document.
  • Issue 4: Data is not an orphan!● Data Stewards are not imaginary characters! Important data has stewards and custodians● Need to go through these guardians first Some health care requires a signed form per changed cell stating reasons for change● Possible approaches: How to avoid stewards? How to integrate them in the process or minimize their involvement?33 2011 © Copyright QCRI. Confidential document.
  • Issue 5: How clean is clean?• Quality awareness eats up 10% of the budget [Telecom Experience]• How to avoid over-cleaning• Example: “Bill Forgiveness”, a real-life experience: roaming charges and cross-carrier calls have a very complicated business model• Possible approaches – Measure cleaning progress – Clean only to satisfy some application needs34 2011 © Copyright QCRI. Confidential document.
  • Issue 6: Online cleaning a necessity not a feature● We live in a complex world → complex applications with 100s and 1000s of components and parameters● Clean as you go .. Clean on demand .. Clean opportunistically .. Can be the only hope● New concepts: Iterative cleaning Cleaning dynamic and evolving data● Off-line cleaning can still benefit historical data but is becoming less and less important35 2011 © Copyright QCRI. Confidential document.
  • Issue 7: Application quality• Data Quality → Information Quality → Application quality• Realizes the levels of complexity in current BI apps• Data usage should influence data cleaning – “Usage-based” data cleaning36 2011 © Copyright QCRI. Confidential document.
  • Issue 8: SW engineering DQ• Current focus on discrete values with simple integrity constraints (FD, uniqueness…)• We are good at checking if data complies with rules• Real business rules are often “assertions” and expressed in “turing-complete” languages• Checking “did we write the assertions right?” becomes a lot harder• But also.. need to think if we wrote the right assertions!37 2011 © Copyright QCRI. Confidential document.
  • Issue 9: DQ Theory?• ACID in transaction management were not only sensible requirements but also had algorithms and methods to enforce them during transactions processing• Does it make sense to do the same for Quality? Plausible properties along with actions for maintaining acceptable quality during data manipulation• Some of these already exist: Timeliness, Currency, Consistency, etc. but lack methods of enforcement38 2011 © Copyright QCRI. Confidential document.
  • Issue 10: Scale .. Scale• Terabytes and Petabytes of data requires new ways to enforce data quality• Which ball to drop• Leveraging application semantics and data usage• Sampling to learn from the few and apply on the masses• Active learning to replace human feedback (GDR as a solution)39 2011 © Copyright QCRI. Confidential document.
  • Sample QCRI Projects40 2011 © Copyright QCRI. Confidential document.
  • GDR – Guided Data Repair • Scalable ways to involve experts • Repurposing destructive automatic techniques to guide repairs • Value of Information measures to generate the most important questions User Query • Judicious use of active learning from user feedback Learn and Detect Errors Clean Database Repair and Violations Instance Database Results Input Database Instance41 2011 © Copyright QCRI. Confidential document.
  • GDR Architecture42 2011 © Copyright QCRI. Confidential document.
  • Probabilistic Data Cleaning User Query Possible Uncertain Repair Clean Database Error Detection Clean Database Generation Instance Possible Instance Clean Instance Input Probabilistic Results Database Instance43 2011 © Copyright QCRI. Confidential document.
  • Possible RepairsA possible repair is a clustering of the input tuples Person Possible Repairs ID Name ZIP Income X1 X2 X3 P1 Green 51519 30k {P1} {P1,P2} {P1,P2,P5} P2 Green 51518 32k {P2} {P3,P4} {P3,P4} P3 Peter 30528 40k {P3,P4} {P5} {P6} Uncertain {P6} P4 Peter 30528 40k {P5} Clustering P5 Gree 51519 55k {P6} P6 Chuck 51519 30k44 2011 © Copyright QCRI. Confidential document.
  • Thank Youwww.qcri.qa 2011 © Copyright QCRI. Confidential document.