Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
MANAGING UNCERTAINTY IN DATA
THE KEY TO EFFECTIVE MANAGEMENT OF DATA QUALITY
PROBLEMS
MAURICE VAN KEULEN
Paradigms of scientific method
 Empiricism
 Mathematical modeling
 Simulation
A new paradigm: Data-intensive Scientific...
Research on pregnancy processes based on Electronic
Patient Dossiers (EPDs) of some population of women
 Select consult &...
 A painstaking process follows with specifying filter
rules and manually inspecting samples of results
 Imperfect proces...
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
5
GEO-SOCIAL RECOMMENDAT...
Substantial amount of money involved in fraud. Ease of
committing fraud incites otherwise decent people to do it
as well. ...
 Paris Hilton stayed in the Paris Hilton
 Lady Gaga - Speechless live @ Helsinki 10/13/2010
http://www.youtube.com/watch...
 Search (finding the needle in the haystack)
 Information extraction from unstructured sources
 Natural language proces...
CustID Sales Name
1234 6000 John
2345 5000 Mary
3456 12000 Bart
… … …
14 Jan 2016Managing uncertainty in data: the key to ...
CustID Sales Name
1234 6000 John
2345 5000 Mary
3456 12000 Bart
… … …
14 Jan 2016Managing uncertainty in data: the key to ...
Probabilistic database technology can store, query,
analyze, reason with data taking into account possible
influence on th...
Let’s go for an initial
integration that can readily
and meaningfully be used
“Good is good enough” for
meaningful use in ...
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
14
COMBINING DATA …
Keul...
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
15
… AND THE PROBLEM OF ...
Database
Real world
(of car brands)
Mercedes-Benz 39
72BMW
45Renault
67Mercedes
8
Bayerische
Motoren Werke
25B.M.W.
SalesC...
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
17
MOST DATA QUALITY PRO...
 Looks like ordinary database
 Several “possible” answers or approximate answers to
queries
 Important: Scalability (bi...
Sales of “preferred customers”
 SELECT SUM(sales)
FROM carsales
WHERE sales≥ 100
 Answer: 106
 Risk = Probability * Imp...
14 Jan 2016Managing uncertainty in data: the key to effective management of data quality
problems
20
BACK TO GEO-SOCIAL RE...
Fraud risk analysis
 about which company do they talk?
 Indicators become possible indicators
 Fraud risk analysis is s...
 Inspired from information retrieval
(search engine evaluation)
 Precision = ratio of answers that are correct
(3/5 = 60...
Data quality: intangible problem with unknown impact
The key to effective management of DQ problems
 Model DQ problems as...
Upcoming SlideShare
Loading in …5
×

Managing uncertainty in data - Presentation at Data Science Northeast Netherlands Meetup 14 Jan 2016

3,703 views

Published on

Managing uncertainty in data: the key to effective management of data quality problems
Business analytics and data science are significantly impaired by a wide variety of 'data handling' issues, especially when data from different sources are combined and when unstructured data is involved. The root cause of many such problems centers around data semantics and data quality. We have developed a generic method which is based on modeling such problems as uncertainty *in* the data. A recently conceived new kind of DBMS can store, manage, and query large volumes of uncertain data: the UDBMS or "Uncertain Database". Together, they allow one to, e.g., postpone the resolution of data problems, assess what their influence is on analytical results, etc. We furthermore develop technology for data cleansing, web harvesting, and natural language processing which uses this method to deal with ambiguity of natural language and many other problems encountered when using unstructured data.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Managing uncertainty in data - Presentation at Data Science Northeast Netherlands Meetup 14 Jan 2016

  1. 1. MANAGING UNCERTAINTY IN DATA THE KEY TO EFFECTIVE MANAGEMENT OF DATA QUALITY PROBLEMS MAURICE VAN KEULEN
  2. 2. Paradigms of scientific method  Empiricism  Mathematical modeling  Simulation A new paradigm: Data-intensive Scientific Discovery  Combining and analyzing data in novel ways is capable of tackling research questions that could not be answered before 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 2 REVOLUTION IN SCIENTIFIC METHOD Bio-Informatics professor: “ PhD of 4 years, 3 years devoted to ‘data fiddling’ ”
  3. 3. Research on pregnancy processes based on Electronic Patient Dossiers (EPDs) of some population of women  Select consult & treatment records from their EPDs from multiple sources  After first analysis one discovers many records not related to pregnancy (e.g., dermatologist consult)  Assumption that all records that belong to a pregnant woman are related to pregnancy is wrong, hence also the selection criterion!  There is no objective means to ascertain this such as a field ‘related to pregnancy’ 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 3 A FIRST STORY: PREGNANCY RESEARCH
  4. 4.  A painstaking process follows with specifying filter rules and manually inspecting samples of results  Imperfect process so noisy records remain!  Wrong diagnoses cause more records to be erroneously in or out  more noisy records  Then, one looks at a sample and notices something strange in the times of consults: many appear close to each other and in the evening  Modification time of EPD record (what is recorded) does not reflect actual moment of activity (semantics)  sequence and duration noise 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 4 A FIRST STORY: PREGNANCY RESEARCH
  5. 5. 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 5 GEO-SOCIAL RECOMMENDATION: GPS TRAJECTORIES • Detect visits from trajectories • GPS traces from mobile phones • Point-Of-Interest (POI) data harvested from the internet • Purpose: construct profiles of • Customers • Products • for recommendation • Holiday homes • Greeting cards
  6. 6. Substantial amount of money involved in fraud. Ease of committing fraud incites otherwise decent people to do it as well. Danger to society  Inspect where there is a high risk of fraud  Example ISZW: labor market, labor circumstances, etc.  But: government data represents paper reality! Include traces from the internet (social media, web forums): Customers, employees, and by-standers leave behind observations and opinions  But natural language: about which company do they talk? 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 6 DATA-DRIVE FRAUD RISK ANALYSIS
  7. 7.  Paris Hilton stayed in the Paris Hilton  Lady Gaga - Speechless live @ Helsinki 10/13/2010 http://www.youtube.com/watch?v=yREociHyijk . . . @ladygaga also talks about her Grampa who died recently  Laelith Demonia has just defeated liwanu Hird. Career wins is 575, career losses is 966.  Adding Win7Beta, Win2008, and Vista x64 and x86 images to munin. #wds  history should show that bush jr should be in jail or at least never should have been president 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 7 NATURAL LANGUAGE PROCESSING: AMBIGUITY ABOUNDS
  8. 8.  Search (finding the needle in the haystack)  Information extraction from unstructured sources  Natural language processing  Web harvesting (both produce lower quality structured data)  Data quality management  Responsible analytics is (among other things) “Knowing how data quality problems in the source data affect the analytical results” 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 8 TECHNOLOGY WE WORK ON WE = DATABASES GROUP FROM UNIVERSITY OF TWENTE Equally true for Business Analytics
  9. 9. CustID Sales Name 1234 6000 John 2345 5000 Mary 3456 12000 Bart … … … 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 10 IMPACT OF DATA IMPERFECTIONS SELECT SUM(Sales) FROM CustSales 3423000  Sample of data looks fine  Result of analysis looks perfectly reasonable  If you don’t look hard enough if you don’t properly pay attention to it … you will be unaware … that you are possibly looking at significantly erroneous figures!!!
  10. 10. CustID Sales Name 1234 6000 John 2345 5000 Mary 3456 12000 Bart … … … 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 11 IMPACT OF DATA IMPERFECTIONS SELECT SUM(Sales) FROM CustSales 3423000 CustID Sales Name 6789 2 Tom 4567 6000 Jon 5678 NULL Nina … … … ???? Wrong figures included Missing figures Double counting etc. Many more problems at value, record, schema, source, trust levels
  11. 11. Probabilistic database technology can store, query, analyze, reason with data taking into account possible influence on the results  Treats data quality problems as a fact of life  Responsible analytics: know deficiencies of results  Generic and scalable approach and technology  Nice properties for application: postpone- resolution/cleaning, pay-as-you-go; good-is-good- enough; human-in-the-loop 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 12 PROBABILISTIC DATABASES TO THE RESCUE
  12. 12. Let’s go for an initial integration that can readily and meaningfully be used “Good is good enough” for meaningful use in many applications (can be achieved 10x earlier) Let it improve during use 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems PROBABILISTIC DATA INTEGRATION Use (analytics) Measure quality Improve data quality Partial data integration Enumerate cases for remaining problems Store data with uncertainty in UDBMS InitialintegrationContinuousimprovement 13 Postpon e problems Stop earlier Pay as you go Human in the loop
  13. 13. 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 14 COMBINING DATA … Keulen, M. (2012) Managing Uncertainty: The Road Towards Better Data Interoperability. IT - Information Technology, 54 (3). pp. 138-146. ISSN 1611-2776 Car brand Sales B.M.W. 25 Mercedes 32 Renault 10 Car brand Sales BMW 72 Mercedes-Benz 39 Renault 20 Car brand Sales Bayerische Motoren Werke 8 Mercedes 35 Renault 15 Car brand Sales B.M.W. 25 Bayerische Motoren Werke 8 BMW 72 Mercedes 67 Mercedes-Benz 39 Renault 45
  14. 14. 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 15 … AND THE PROBLEM OF SEMANTIC DUPLICATES Car brand Sales B.M.W. 25 Bayerische Motoren Werke 8 BMW 72 Mercedes 67 Mercedes-Benz 39 Renault 45 Preferred customers … SELECT SUM(Sales) FROM CarSales WHERE Sales>100 0 ‘No preferred customers’
  15. 15. Database Real world (of car brands) Mercedes-Benz 39 72BMW 45Renault 67Mercedes 8 Bayerische Motoren Werke 25B.M.W. SalesCar brand ω d1 d2 d3 d4 d5 d6 o1 o2 o3 o4 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 16 SEMANTIC DUPLICATES
  16. 16. 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 17 MOST DATA QUALITY PROBLEMS CAN BE MODELED AS UNCERTAINTY IN DATA Car brand Sales B.M.W. 25 Bayerische Motoren Werke 8 BMW 72 Mercedes 67 Mercedes-Benz 39 Renault 45 Mercedes 106 Mercedes-Benz 106 1 2 3 4 5 6 X=0 X=0 X=1 Y=0 X=1 Y=1 X=0 4 and 5 different 0.2 X=1 4 and 5 the same 0.8 Y=0 “Mercedes” correct name 0.5 Y=1 “Mercedes-Benz” correct name 0.5 B.M.W. / BMW / Bayerische Motoren Werke analogously Run some duplicate detection tool
  17. 17.  Looks like ordinary database  Several “possible” answers or approximate answers to queries  Important: Scalability (big data!) Sales of “preferred customers”  SELECT SUM(sales) FROM carsales WHERE sales≥ 100 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 18 IMPORTANT TOOL: PROBABILISTIC DATABASE SUM(sales) P 0 14% 105 6% 106 56% 211 24%
  18. 18. Sales of “preferred customers”  SELECT SUM(sales) FROM carsales WHERE sales≥ 100  Answer: 106  Risk = Probability * Impact  Analyst only bothered with problems that matter 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 19 QUERYING AND RELIABILITY ASSESSMENT SUM(sales) P 0 14% 105 6% 106 56% 211 24% Second most likely answer at 24% with impact factor 2 in sales (211 vs 106) Risk of substantially wrong answer
  19. 19. 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 20 BACK TO GEO-SOCIAL RECOMMENDATION HOW TO MODEL THE GPS TRAJECTORY PROBLEM?  Smoothing: any jumps and/or sudden sharp angles are suspicious and probably wrong  Points become estimated points  Some points are possibly suspicious  Some are more suspicious than others Model the uncertainty explicitly in the data
  20. 20. Fraud risk analysis  about which company do they talk?  Indicators become possible indicators  Fraud risk analysis is statistics / probability theory! Reasoning with possible indicators is very easy. It’s just more data 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 21 AMBIGUITY IN NATURAL LANGUAGE PROCESSING AND ITS CONSEQUENCES FOR FRAUD RISK ANALYSIS Paris Hilton stayed in the Paris Hilton Phrase begin end type ref Paris 1 1 City sws.geonames.org/ 2988507 Paris 1 1 Firstname Hilton 1 1 Lastname Paris Hilton 1 2 Person https://en.wikipedia.org/wi ki/Paris_Hilton Paris Hilton 1 2 Hotel www.hilton.com/Paris … … … … “belong together”
  21. 21.  Inspired from information retrieval (search engine evaluation)  Precision = ratio of answers that are correct (3/5 = 60%)  Recall = ratio of correct answers given (3/4 = 75%)  Expected precision and recall  A correct answer is better if the system dares to claim that it is correct with a higher probability  Analogously, incorrect answers with a high probability are worse than incorrect answers with a low probability  Expected precision = (0.8+0.7+0.2) / 2.3 = 74%  Expected recall = (0.8+0.7+0.2) / 4 = 43% 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 22 KNOW WHEN TO STOP CLEANING: MEASURING QUALITY A B C D E F G 80% 70% 50% 20% 10%
  22. 22. Data quality: intangible problem with unknown impact The key to effective management of DQ problems  Model DQ problems as uncertainty *in* the data  Probabilistic database technology for scalability  Postpone resolution/cleaning: pay-as-you-go  Measure and know when to stop: good-is-good-enough; human-in-the-loop 14 Jan 2016Managing uncertainty in data: the key to effective management of data quality problems 23 CONCLUSIONS Bio-Informatics professor: “ PhD of 4 years, 3 years devoted to ‘data fiddling’ ” If we can reduce the data fiddling with 1 year (33%), we make the scientist twice as productive!

×