Your SlideShare is downloading. ×
0
Data QualityNot your Typical Database ProblemAhmed ElmagarmidExecutive DirectorQatar Computing Research Institute         ...
Where are we located?2                           2011 © Copyright QCRI. Confidential document.
3                                       3    2011 © Copyright QCRI. Confidential document.
4   2011 © Copyright QCRI. Confidential document.
Qatar Foundation5                      2011 © Copyright QCRI. Confidential document.
SCIENCE &    COMMUNITY    EDUCATION           RESEARCH    DEVELOPMENT2.8 percent of GDP tobe spent on researchannually by ...
Qatar Foundation Research Division   Qatar       Qatar Energy &       QatarComputing       Environment      Biomedical Res...
QCRI Overview8                   2011 © Copyright QCRI. Confidential document.
QCRI Vision    To make Qatar a global center for    computing research by becoming the    world’s recognized leader in Ara...
QCRI Model     Grand Challenges                                                  National Institutions                    ...
QCRI Ecosystem                                                                                     QU          Sidra      ...
QCRI Research Centers        Arabic          Social         Scientific       Language       Computing       Computing     ...
QCRI Scientific Advisory Council                                           Lord Rupert Redesdale     Prof. Rich DeMillo   ...
The 60 Doers!                                                                                                             ...
Strategic Partnerships15                            2011 © Copyright QCRI. Confidential document.
Agenda Partnerships       Strategic16                         2011 © Copyright QCRI. Confidential document.
5-YEAR QCRI MANPOWER PLAN                                         110                                 102                 ...
This Talk     Data Quality18                  2011 © Copyright QCRI. Confidential document.
Data Quality     Enhancing the usability of the acquired data and     increasing the confidence of query results     "Poor...
Dirty Data is ExpensiveReal life data is often dirty: Data   Obama administration offerederror rates in industry: 1% - 30%...
Where to start? Data Quality                  everywhere!•    Data Entry•    Information Extraction•    Integration from m...
“Academic” Data Cleaning                               ”● Pick a well understood data problem under some scoping  assumpti...
“Academic” Data Cleaning                               ”• Despite their theoretic and algorithmic beauty, rarely used     ...
“Practitioner” Data Cleaning                                  ”• Will share some scary stories     –   “post-it notes” as ...
This Talk● Few data quality challenges and (hopefully) research  directions● Summary of recent efforts at QCRI25          ...
10 Data Quality Issues26                            2011 © Copyright QCRI. Confidential document.
Issue 1: The data trio              DATA              Quality27                            2011 © Copyright QCRI. Confiden...
Extraction remains a key source                       of data errors     Acquiring the semantics/schema of the underlying ...
Integration aggravates the                      problem                                                              m1 Li...
Slide 29m1         Im not sure about this idea of putting "linked data" so prominent in this slide on II           mourad,...
Issue 2: Data level or application                     level• Cleaning data tables by trusting the schema table! Is rarely...
Issue 3: Protect your gain: DQ                     Dashboard● How to protect against going backwards● How to protect your ...
Issue 3: Protect your gain - Ideas• Root-cause analysis for data cleaning• Chase problems to the source to reason about “p...
Issue 4: Data is not an orphan!● Data Stewards are not imaginary characters! Important data  has stewards and custodians● ...
Issue 5: How clean is clean?• Quality awareness eats up 10% of the budget [Telecom  Experience]• How to avoid over-cleanin...
Issue 6: Online cleaning a                     necessity not a feature● We live in a complex world → complex applications ...
Issue 7: Application quality• Data Quality → Information Quality → Application quality• Realizes the levels of complexity ...
Issue 8: SW engineering DQ• Current focus on discrete values with simple integrity constraints  (FD, uniqueness…)• We are ...
Issue 9: DQ Theory?• ACID in transaction management were not only sensible requirements but  also had algorithms and metho...
Issue 10: Scale .. Scale• Terabytes and Petabytes of data requires new ways to  enforce data quality• Which ball to drop• ...
Sample QCRI Projects40                          2011 © Copyright QCRI. Confidential document.
GDR – Guided Data Repair     • Scalable ways to involve experts     • Repurposing destructive automatic techniques to guid...
GDR Architecture42                      2011 © Copyright QCRI. Confidential document.
Probabilistic Data Cleaning                                                User Query                             Possible...
Possible RepairsA possible repair is a clustering of the input tuples                  Person                             ...
Thank Youwww.qcri.qa              2011 © Copyright QCRI. Confidential document.
Upcoming SlideShare
Loading in...5
×

Data Quality: Not Your Typical Database Problem

447

Published on

Ahmed K. Elmagarmid (IEEE Fellow and ACM Distinguished Scientist) gave a lecture on Data Quality: Not Your Typical Database Problem in the Distinguished Lecturer Series - Leon The Mathematician.

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
447
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
7
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Data Quality: Not Your Typical Database Problem"

  1. 1. Data QualityNot your Typical Database ProblemAhmed ElmagarmidExecutive DirectorQatar Computing Research Institute 2011 © Copyright QCRI. Confidential document.
  2. 2. Where are we located?2 2011 © Copyright QCRI. Confidential document.
  3. 3. 3 3 2011 © Copyright QCRI. Confidential document.
  4. 4. 4 2011 © Copyright QCRI. Confidential document.
  5. 5. Qatar Foundation5 2011 © Copyright QCRI. Confidential document.
  6. 6. SCIENCE & COMMUNITY EDUCATION RESEARCH DEVELOPMENT2.8 percent of GDP tobe spent on researchannually by 2015 2011 © Copyright QCRI. Confidential document.
  7. 7. Qatar Foundation Research Division Qatar Qatar Energy & QatarComputing Environment Biomedical Research Research Research Institute Institute Institute QCRI QEERI QBRI 2011 © Copyright QCRI. Confidential document.
  8. 8. QCRI Overview8 2011 © Copyright QCRI. Confidential document.
  9. 9. QCRI Vision To make Qatar a global center for computing research by becoming the world’s recognized leader in Arabic language technologies and in key areas vital to the global growth of Qatari business and entrepreneurial activity.9 2011 © Copyright QCRI. Confidential document.
  10. 10. QCRI Model Grand Challenges National Institutions (QCRI) Grand practical challenges Academia National and global impact Localized skills & knowledge Large teams and long term Individual projects Example peers: INRIA, MPI Students move on Theoretical & basic Project-based research Research Parks Commercialization Entrepreneurship Incubation Basic Research Applied Research 1010 2011 © Copyright QCRI. Confidential document.
  11. 11. QCRI Ecosystem QU Sidra QBRI MIT HKU QEERI QCRI WikiMedia QSTP Aljazeera QP ALTIS Boeing Energy Google MEEZA Yahoo Co. QSA IBM Microsoft11 2011 © Copyright QCRI. Confidential document.
  12. 12. QCRI Research Centers Arabic Social Scientific Language Computing Computing Technologies Data Analytics Cloud Computing12 2011 © Copyright QCRI. Confidential document.
  13. 13. QCRI Scientific Advisory Council Lord Rupert Redesdale Prof. Rich DeMillo UK House of Lords Georgia Tech, Chair Prof. Joichi Ito Prof. Ruzena Bajcsy MIT Media Lab Director University of California – Berkeley Lew Tucker Prof. Alfred V. Aho Vice President, Cisco Columbia University Prof. Dick Lipton Yousef Khalidi Georgia Tech Vice President, Microsoft13 2011 © Copyright QCRI. Confidential document.
  14. 14. The 60 Doers! Abdellatif Ahmed Richard Jill Management Ihab Nan Mourad and Support Team Richard P. Paolo Melissa Data Analytics Amr Kamal Halima Amal John Rashid Nada Agathe Scientific Michele Hend Chu ElKindi Computing Kulood Samreen Mohamed Simon P. Mustafa Tarek Preslav Othmane Kareem Stephan Ahmed A. Wei William Arabic Cloud Ahmed T. Language ThuyLinh Computing Sihem Maged Gautam Khaled AyshaAhmed M. Technologies Sofiane Social Ahmed A. Gokop Computing Ahmed T. Lolwa Safdar Amira Aybuke Shameem Francisco Simon G. Walid Peng Mikalai Khulood Ruth 2011 © Copyright QCRI. Confidential document.
  15. 15. Strategic Partnerships15 2011 © Copyright QCRI. Confidential document.
  16. 16. Agenda Partnerships Strategic16 2011 © Copyright QCRI. Confidential document.
  17. 17. 5-YEAR QCRI MANPOWER PLAN 110 102 82 34 +20 +48 +8 21 +13 10-11 11-12 12-13 13-14 14-1517 2011 © Copyright QCRI. Confidential document.
  18. 18. This Talk Data Quality18 2011 © Copyright QCRI. Confidential document.
  19. 19. Data Quality Enhancing the usability of the acquired data and increasing the confidence of query results "Poor data quality is the norm rather than the exception, but most organizations are in a state of denial about this issue. " -Gartner Group19 2011 © Copyright QCRI. Confidential document.
  20. 20. Dirty Data is ExpensiveReal life data is often dirty: Data Obama administration offerederror rates in industry: 1% - 30% $19 billion grants for health IT, i.e.(Redman, 1998) improve EMRs in 2009 The Data Warehousing InstituteErroneously priced data in retail estimates that data qualitydatabases costs US customers problems cost U.S. businesses$2.5 billion each year more than $600 billion a year (2002) 20 2011 © Copyright QCRI. Confidential document.
  21. 21. Where to start? Data Quality everywhere!• Data Entry• Information Extraction• Integration from multiple sources• Standardization and transformation• Business rules compliance21 2011 © Copyright QCRI. Confidential document.
  22. 22. “Academic” Data Cleaning ”● Pick a well understood data problem under some scoping assumptions and solve independently Duplicates Functional Dependency violations Matching dependency violations Missing value imputation● Piece-meal approach to tackle the complexity and sometimes the intractability of the problem Repairing violations of FD constraints in special cases (no deletion, left hand side changes only, allowing variable etc.)22 2011 © Copyright QCRI. Confidential document.
  23. 23. “Academic” Data Cleaning ”• Despite their theoretic and algorithmic beauty, rarely used – Problems never exist in isolation – Fixes to one problem often introduce “other” problems – Data usually not accessible to mess with – Integrity constraints!... What integrity constraints?!!23 2011 © Copyright QCRI. Confidential document.
  24. 24. “Practitioner” Data Cleaning ”• Will share some scary stories – “post-it notes” as an expert messaging system – “written permission” to change value of a record – Default values and best practices – “Call John.. He will know what to do”24 2011 © Copyright QCRI. Confidential document.
  25. 25. This Talk● Few data quality challenges and (hopefully) research directions● Summary of recent efforts at QCRI25 2011 © Copyright QCRI. Confidential document.
  26. 26. 10 Data Quality Issues26 2011 © Copyright QCRI. Confidential document.
  27. 27. Issue 1: The data trio DATA Quality27 2011 © Copyright QCRI. Confidential document.
  28. 28. Extraction remains a key source of data errors Acquiring the semantics/schema of the underlying unstructured data sources (document, emails, related Web info, click traces, profiles, interests, etc.)28 2011 © Copyright QCRI. Confidential document.
  29. 29. Integration aggravates the problem m1 Linked data as an attempt to live with errors .. link as you go29 2011 © Copyright QCRI. Confidential document.
  30. 30. Slide 29m1 Im not sure about this idea of putting "linked data" so prominent in this slide on II mourad, 7/23/2011
  31. 31. Issue 2: Data level or application level• Cleaning data tables by trusting the schema table! Is rarely useful• Will share a story – Bell-core with 1800 inter-linked databases – Rule-based logic for sanity checking – Post-it messages to communicate between data quality officers .. Who work in shifts! – Data cleaning action is meaningless if not tied to a business logic or to a process. Should never be against FDs30 2011 © Copyright QCRI. Confidential document.
  32. 32. Issue 3: Protect your gain: DQ Dashboard● How to protect against going backwards● How to protect your gains during the cleansing process● Metrics: Minimality Principle: mostly and widely used in academic cleaning Value of information: to spot the most important problem to fix31 2011 © Copyright QCRI. Confidential document.
  33. 33. Issue 3: Protect your gain - Ideas• Root-cause analysis for data cleaning• Chase problems to the source to reason about “progress”• Leveraging “Provenance” to design progress meters32 2011 © Copyright QCRI. Confidential document.
  34. 34. Issue 4: Data is not an orphan!● Data Stewards are not imaginary characters! Important data has stewards and custodians● Need to go through these guardians first Some health care requires a signed form per changed cell stating reasons for change● Possible approaches: How to avoid stewards? How to integrate them in the process or minimize their involvement?33 2011 © Copyright QCRI. Confidential document.
  35. 35. Issue 5: How clean is clean?• Quality awareness eats up 10% of the budget [Telecom Experience]• How to avoid over-cleaning• Example: “Bill Forgiveness”, a real-life experience: roaming charges and cross-carrier calls have a very complicated business model• Possible approaches – Measure cleaning progress – Clean only to satisfy some application needs34 2011 © Copyright QCRI. Confidential document.
  36. 36. Issue 6: Online cleaning a necessity not a feature● We live in a complex world → complex applications with 100s and 1000s of components and parameters● Clean as you go .. Clean on demand .. Clean opportunistically .. Can be the only hope● New concepts: Iterative cleaning Cleaning dynamic and evolving data● Off-line cleaning can still benefit historical data but is becoming less and less important35 2011 © Copyright QCRI. Confidential document.
  37. 37. Issue 7: Application quality• Data Quality → Information Quality → Application quality• Realizes the levels of complexity in current BI apps• Data usage should influence data cleaning – “Usage-based” data cleaning36 2011 © Copyright QCRI. Confidential document.
  38. 38. Issue 8: SW engineering DQ• Current focus on discrete values with simple integrity constraints (FD, uniqueness…)• We are good at checking if data complies with rules• Real business rules are often “assertions” and expressed in “turing-complete” languages• Checking “did we write the assertions right?” becomes a lot harder• But also.. need to think if we wrote the right assertions!37 2011 © Copyright QCRI. Confidential document.
  39. 39. Issue 9: DQ Theory?• ACID in transaction management were not only sensible requirements but also had algorithms and methods to enforce them during transactions processing• Does it make sense to do the same for Quality? Plausible properties along with actions for maintaining acceptable quality during data manipulation• Some of these already exist: Timeliness, Currency, Consistency, etc. but lack methods of enforcement38 2011 © Copyright QCRI. Confidential document.
  40. 40. Issue 10: Scale .. Scale• Terabytes and Petabytes of data requires new ways to enforce data quality• Which ball to drop• Leveraging application semantics and data usage• Sampling to learn from the few and apply on the masses• Active learning to replace human feedback (GDR as a solution)39 2011 © Copyright QCRI. Confidential document.
  41. 41. Sample QCRI Projects40 2011 © Copyright QCRI. Confidential document.
  42. 42. GDR – Guided Data Repair • Scalable ways to involve experts • Repurposing destructive automatic techniques to guide repairs • Value of Information measures to generate the most important questions User Query • Judicious use of active learning from user feedback Learn and Detect Errors Clean Database Repair and Violations Instance Database Results Input Database Instance41 2011 © Copyright QCRI. Confidential document.
  43. 43. GDR Architecture42 2011 © Copyright QCRI. Confidential document.
  44. 44. Probabilistic Data Cleaning User Query Possible Uncertain Repair Clean Database Error Detection Clean Database Generation Instance Possible Instance Clean Instance Input Probabilistic Results Database Instance43 2011 © Copyright QCRI. Confidential document.
  45. 45. Possible RepairsA possible repair is a clustering of the input tuples Person Possible Repairs ID Name ZIP Income X1 X2 X3 P1 Green 51519 30k {P1} {P1,P2} {P1,P2,P5} P2 Green 51518 32k {P2} {P3,P4} {P3,P4} P3 Peter 30528 40k {P3,P4} {P5} {P6} Uncertain {P6} P4 Peter 30528 40k {P5} Clustering P5 Gree 51519 55k {P6} P6 Chuck 51519 30k44 2011 © Copyright QCRI. Confidential document.
  46. 46. Thank Youwww.qcri.qa 2011 © Copyright QCRI. Confidential document.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×