SlideShare a Scribd company logo
1 of 39
Download to read offline
Data Quality
The True Big Data Challenge
Dr. Stefan Kühn
Lead Data Scientist
data2day 2016 - Karlsruhe
A short motivation
• Some „famous“ quotes
• "Data are becoming the new raw material of
business."
• "The data fabric is the next middleware.“
• "Data matures like wine, applications like fish."
• "There were 5 Exabytes of information created
between the dawn of civilization through 2003, but
that much information is now created every 2 days."
• "Information is the oil of the 21st century, and
analytics is the combustion engine."
2
A short motivation
3
Data matures like wine?
A short motivation
4
Data matures like wine?
More like grapes…
A short motivation
• Some „critical“ quotes
• "Big Data is not the new oil."
• "Data is not information, information is not
knowledge, knowledge is not understanding,
understanding is not wisdom."
• "It’s easy to lie with statistics. It’s hard to tell the
truth without statistics."
• "Anything that can be measured can be
improved.“
5
Data Quality
Fundamentals
6
Twofold Approach to Data Quality
• Does Data represent the real-world objects /
events / concepts it is supposed to?
• Does Data meet the expectations of the Data
consumers and the requirements of intended
usage?
• Warning: Data is not facts!
Data is not existing independent from its
creation.
7
Data as representation
8
Idea
Word World
Semiotic
Triangle
Where
is the
Data?
Data as representation
9
Metadata
Data World
Semiotic
Triangle
Here
is the
Data
Data and Metadata
• Data implies a context -> Metadata
• Metadata provides explicit knowledge about Data
• Metadata enables a common understanding of
Data inside an organization
• Metadata serves as documentation and
dictionary, as context for Data Understanding
Metadata is absolutely necessary for the effective
use of Data.
10
Responsibility for Data
Common Misunderstanding
• Data and Data-related systems typically are managed
and hosted by IT, therefore most people (from
business and IT) tend to think that Data is part of IT
and not of Business
• BUT: Data is not the by-product of Business processes
Data is THE product of Business Processes
• Data Quality Improvement as Business Strategy
Shared Responsibility
11
Data Creation as Observation
• Data is created under specific Conditions and for
specific Purposes
• Creation process involves
• Observed Object
• Observer
• Instrument
• Example - Customer Self-Registration Form
• Customer Information as Observed Object
• Customer as Observer
• Registration Form as Instrument
Instrument is not built / known by Observer.
12
Data as Product
• Analogy between manufacturing of products and
creation / production of data
• Data as core product of a business process
• Transfer quality concepts from Software Development
to „Data Development“
• Testing
• Staging
• Versioning
• Continuous Delivery / Improvement
• Product Management
• Standardization
Data Quality as Manufactoring Quality
13
Expectations and Requirements
• Implicit assumptions for usage of Data
• Creation of Data is a business process
• Expectations and requirements have to be
explicitely known when defining the process
• Data Quality is Business Process Quality
• Constantly changing expectations and
requirements makes Data age like grapes…
Make all assumptions explicit.
14
Data Producers
• People or systems that create Data
• Producers have control over what they create
(given the functionality of the instrument)
• Producers don’t have control over possible uses of data
• Most Data is produced for a dedicated purpose but used
for several purposes
• Data Quality is fixed at the moment of creation
Data Quality starts with enabling producers to produce
high-quality Data -> useable Data
15
Data Consumers
• People or systems that use Data within its lifecycle
• Multiple systems and people can consume data
• Often, Consumers are Producers at the same time
• Consumers do not control the production of Data but
have implicit assumptions and expectations about it
Data Quality Processes are Consumers of Data of
an unknown Quality and Producers of Data of a
defined Quality
16
17
Data Quality
Problems
Problematic Aspects of Data Management
• Data crosses Organizational Boundaries
• Technical (IT) and non-technical (Business)
roles have to communicate
• Shared Responsibility instead of „Ownership“
• No common definitions
• Twelve Barriers to Effective Management of
Data and Information Assets (Th. Redman)
Holistic Approach to Data Quality required
18
Problematic Aspects of Data Management
19
20
Big Data Quality
Big Problems
Summary
Big Data
• "Big data is what happened when the cost of
storing information became less than the cost of
making the decision to throw it away." (George Dyson)
21
What is Big Data?
• Different Data sources
• External Data
• No control over data production
• No sufficient documentation (Metadata)
• No quality definitions available
• Incompatible schema
• Example: Car callbacks
• Even more implicit assumptions
• Big Data implies less information per unit of data
• Lots of data points are redundant
• Example: Measure a constant quantity once per day or once
per second
22
What is Big Data in the Media?
• „new oil“
• „gold“
• „revolution“
• „raw material“
• „the future“
• „bigger, better, faster, more“
• „more data beats better algorithms“
• …
23
Three major problems
• Redundancy
• Big Data by Copy/Paste
• Resolution
• Every problem has an inherent time scale of change
• Every problem has an inherent level of uncertainty
• Increasing the resolution beyond these levels only
resolves noise
• Noise
• Adding noisy features decreases the signal-noise ratio
• Adding good but irrelevant features increases
complexity and can look like noise
24
Redundancy
25
Resolution
26
Resolution
27
Noise
28
Noise
29
Example from Kaggle
30
349 variables - basically rank 1
31
Moore’s Law
32
Moore’s Law: By Wgsimon - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=15193542
What’s the point?
• Moore’s Law
• Amount of transistors per area doubles every two
year
• Real-world Problem sizes
• Grow at approximately the same speed
• Algorithmic requirements
• For answering the same questions in the same
time, we need algorithms with linear complexity
33
Solutions?
34
Overall Goals
• Implement Data Quality Standards
• Detect Data Quality Problems
• Manage Data Quality Problems
• Root Cause Analysis of Data Quality Problems
• Measure Costs of „poor“ Data Quality
• Measure Value of Data / „high“ Data Quality
• Measure Effects of Data Quality
Improvements
35
Typical Approaches
• Force Data Quality (via order)
• Fillrate: Make certain fields a must
• Range: Prescribe list of valid options
• Buy tool
• Hire expert
• Fire expert
• Collect more bad Data
• Relabel „bad“ Data Pool as Data Lake
• …
36
Summary of the problem
Big Data
• "Big data is what happened when the cost of
storing information became less than the cost of
making the decision to throw it away." (George Dyson)
37
Useful Approaches
• Hire expert ;-)
• Shared Responsibility
• Common Understanding of and access to Metadata
• This does not imply that the terminology has to change
• Typically, the same term has a different meaning in
different departments
• Bounded contexts! (DDD)
• Invest in creating better Data instead of fixing old
and broken Data
Treat Data as Product, not as Fact
38
39
Thanks a lot!
www.codecentric.de
blog.codecentric.de
stefan.kuehn@codecentric.de
datascience@codecentric.de

More Related Content

What's hot

Building a Data Quality Program from Scratch
Building a Data Quality Program from ScratchBuilding a Data Quality Program from Scratch
Building a Data Quality Program from Scratch
dmurph4
 
Big data
Big dataBig data
Big data
Claire Choong
 
Big data and the data quality imperative
Big data and the data quality imperativeBig data and the data quality imperative
Big data and the data quality imperative
Trillium Software
 
Data Quality Integration (ETL) Open Source
Data Quality Integration (ETL) Open SourceData Quality Integration (ETL) Open Source
Data Quality Integration (ETL) Open Source
Stratebi
 

What's hot (20)

Understanding big data and data analytics big data
Understanding big data and data analytics big dataUnderstanding big data and data analytics big data
Understanding big data and data analytics big data
 
Enterprise Analytics: Serving Big Data Projects for Healthcare
Enterprise Analytics: Serving Big Data Projects for HealthcareEnterprise Analytics: Serving Big Data Projects for Healthcare
Enterprise Analytics: Serving Big Data Projects for Healthcare
 
Data Analytics
Data AnalyticsData Analytics
Data Analytics
 
Building a Data Quality Program from Scratch
Building a Data Quality Program from ScratchBuilding a Data Quality Program from Scratch
Building a Data Quality Program from Scratch
 
Data Quality Rules introduction
Data Quality Rules introductionData Quality Rules introduction
Data Quality Rules introduction
 
Big, small or just complex data?
Big, small or just complex data?Big, small or just complex data?
Big, small or just complex data?
 
Introduction to Big Data & Analytics
Introduction to Big Data & AnalyticsIntroduction to Big Data & Analytics
Introduction to Big Data & Analytics
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analytics
 
Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityData Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data Quality
 
Big data
Big dataBig data
Big data
 
AWC Career Bootcamp- August 21, 2013
AWC Career Bootcamp- August 21, 2013AWC Career Bootcamp- August 21, 2013
AWC Career Bootcamp- August 21, 2013
 
000 introduction to big data analytics 2021
000   introduction to big data analytics  2021000   introduction to big data analytics  2021
000 introduction to big data analytics 2021
 
Information Security Forum (ISF) Congress 2013
Information Security Forum (ISF) Congress 2013 Information Security Forum (ISF) Congress 2013
Information Security Forum (ISF) Congress 2013
 
Big data and the data quality imperative
Big data and the data quality imperativeBig data and the data quality imperative
Big data and the data quality imperative
 
Data Quality Integration (ETL) Open Source
Data Quality Integration (ETL) Open SourceData Quality Integration (ETL) Open Source
Data Quality Integration (ETL) Open Source
 
( Big ) Data Management - Data Quality - Global concepts in 5 slides
( Big ) Data Management - Data Quality - Global concepts in 5 slides( Big ) Data Management - Data Quality - Global concepts in 5 slides
( Big ) Data Management - Data Quality - Global concepts in 5 slides
 
Тестирование данных с помощью Data Quality Services (MS SQL 12)
Тестирование данных с помощью Data Quality Services (MS SQL 12)Тестирование данных с помощью Data Quality Services (MS SQL 12)
Тестирование данных с помощью Data Quality Services (MS SQL 12)
 
DGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data QualityDGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data Quality
 
Adventures in Data Profiling
Adventures in Data ProfilingAdventures in Data Profiling
Adventures in Data Profiling
 
The Economic Value of Data: A New Revenue Stream for Global Custodians
The Economic Value of Data: A New Revenue Stream for Global CustodiansThe Economic Value of Data: A New Revenue Stream for Global Custodians
The Economic Value of Data: A New Revenue Stream for Global Custodians
 

Viewers also liked

Viewers also liked (6)

Big data for quality education
Big data for quality educationBig data for quality education
Big data for quality education
 
Data-Ed Online: Engineering Solutions to Data Quality Challenges
Data-Ed Online: Engineering Solutions to Data Quality ChallengesData-Ed Online: Engineering Solutions to Data Quality Challenges
Data-Ed Online: Engineering Solutions to Data Quality Challenges
 
DICE & Cloudify – Quality Big Data Made Easy
DICE & Cloudify – Quality Big Data Made EasyDICE & Cloudify – Quality Big Data Made Easy
DICE & Cloudify – Quality Big Data Made Easy
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profiling
 
Data Quality Testing Generic (http://www.geektester.blogspot.com/)
Data Quality Testing Generic (http://www.geektester.blogspot.com/)Data Quality Testing Generic (http://www.geektester.blogspot.com/)
Data Quality Testing Generic (http://www.geektester.blogspot.com/)
 
DAMA Webinar - Big and Little Data Quality
DAMA Webinar - Big and Little Data QualityDAMA Webinar - Big and Little Data Quality
DAMA Webinar - Big and Little Data Quality
 

Similar to Data quality - The True Big Data Challenge

final oracle presentation
final oracle presentationfinal oracle presentation
final oracle presentation
Priyesh Patel
 
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Denodo
 
¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?
Denodo
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
Thinkful
 

Similar to Data quality - The True Big Data Challenge (20)

Data Governance in the Big Data Era
Data Governance in the Big Data EraData Governance in the Big Data Era
Data Governance in the Big Data Era
 
Data Governance in a big data era
Data Governance in a big data eraData Governance in a big data era
Data Governance in a big data era
 
Big data
Big dataBig data
Big data
 
The New Age Data Quality
The New Age Data QualityThe New Age Data Quality
The New Age Data Quality
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Presentation Big Data
Presentation Big DataPresentation Big Data
Presentation Big Data
 
final oracle presentation
final oracle presentationfinal oracle presentation
final oracle presentation
 
Industrial Data Science
Industrial Data ScienceIndustrial Data Science
Industrial Data Science
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data Science
 
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
 
Correlation does not mean causation
Correlation does not mean causationCorrelation does not mean causation
Correlation does not mean causation
 
Big data
Big dataBig data
Big data
 
¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?
 
Usama Fayyad talk in South Africa: From BigData to Data Science
Usama Fayyad talk in South Africa:  From BigData to Data ScienceUsama Fayyad talk in South Africa:  From BigData to Data Science
Usama Fayyad talk in South Africa: From BigData to Data Science
 
01-introduction.ppt the paper that you can unless you want to join me because...
01-introduction.ppt the paper that you can unless you want to join me because...01-introduction.ppt the paper that you can unless you want to join me because...
01-introduction.ppt the paper that you can unless you want to join me because...
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
 
DataSpryng Overview
DataSpryng OverviewDataSpryng Overview
DataSpryng Overview
 
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your DataFoundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
 
DataScienceIntroduction.pptx
DataScienceIntroduction.pptxDataScienceIntroduction.pptx
DataScienceIntroduction.pptx
 
CDOVision - RJA Presentation FINAL
CDOVision - RJA Presentation FINALCDOVision - RJA Presentation FINAL
CDOVision - RJA Presentation FINAL
 

More from Stefan Kühn

data2day2023_SKuehn_DataPlatformFallacy.pdf
data2day2023_SKuehn_DataPlatformFallacy.pdfdata2day2023_SKuehn_DataPlatformFallacy.pdf
data2day2023_SKuehn_DataPlatformFallacy.pdf
Stefan Kühn
 

More from Stefan Kühn (16)

data2day2023_SKuehn_DataPlatformFallacy.pdf
data2day2023_SKuehn_DataPlatformFallacy.pdfdata2day2023_SKuehn_DataPlatformFallacy.pdf
data2day2023_SKuehn_DataPlatformFallacy.pdf
 
data2day2022_SKuehn_DataValueChain.pdf
data2day2022_SKuehn_DataValueChain.pdfdata2day2022_SKuehn_DataValueChain.pdf
data2day2022_SKuehn_DataValueChain.pdf
 
Talk at MCubed London about Manifold Learning and Applications
Talk at MCubed London about Manifold Learning and ApplicationsTalk at MCubed London about Manifold Learning and Applications
Talk at MCubed London about Manifold Learning and Applications
 
Data Science - Cargo Cult - Organizational Change
Data Science - Cargo Cult - Organizational ChangeData Science - Cargo Cult - Organizational Change
Data Science - Cargo Cult - Organizational Change
 
Interactive Dashboards with R
Interactive Dashboards with RInteractive Dashboards with R
Interactive Dashboards with R
 
Talk at PyData Berlin about Manifold Learning and Applications
Talk at PyData Berlin about Manifold Learning and ApplicationsTalk at PyData Berlin about Manifold Learning and Applications
Talk at PyData Berlin about Manifold Learning and Applications
 
Bridging the gap
Bridging the gapBridging the gap
Bridging the gap
 
The Machinery behind Deep Learning
The Machinery behind Deep LearningThe Machinery behind Deep Learning
The Machinery behind Deep Learning
 
Manifold Learning and Data Visualization
Manifold Learning and Data VisualizationManifold Learning and Data Visualization
Manifold Learning and Data Visualization
 
Becoming Data-driven - Machine Learning @ XING Marketing Solutions
Becoming Data-driven - Machine Learning @ XING Marketing SolutionsBecoming Data-driven - Machine Learning @ XING Marketing Solutions
Becoming Data-driven - Machine Learning @ XING Marketing Solutions
 
Learning To Rank data2day 2017
Learning To Rank data2day 2017Learning To Rank data2day 2017
Learning To Rank data2day 2017
 
Deep Learning and Optimization Methods
Deep Learning and Optimization MethodsDeep Learning and Optimization Methods
Deep Learning and Optimization Methods
 
Visualizing and Communicating High-dimensional Data
Visualizing and Communicating High-dimensional DataVisualizing and Communicating High-dimensional Data
Visualizing and Communicating High-dimensional Data
 
Data Visualization at codetalks 2016
Data Visualization at codetalks 2016Data Visualization at codetalks 2016
Data Visualization at codetalks 2016
 
SKuehn_MachineLearningAndOptimization_2015
SKuehn_MachineLearningAndOptimization_2015SKuehn_MachineLearningAndOptimization_2015
SKuehn_MachineLearningAndOptimization_2015
 
SKuehn_Talk_FootballAnalytics_data2day2015
SKuehn_Talk_FootballAnalytics_data2day2015SKuehn_Talk_FootballAnalytics_data2day2015
SKuehn_Talk_FootballAnalytics_data2day2015
 

Recently uploaded

Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
SayantanBiswas37
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
HyderabadDolls
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 

Recently uploaded (20)

Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 

Data quality - The True Big Data Challenge

  • 1. Data Quality The True Big Data Challenge Dr. Stefan Kühn Lead Data Scientist data2day 2016 - Karlsruhe
  • 2. A short motivation • Some „famous“ quotes • "Data are becoming the new raw material of business." • "The data fabric is the next middleware.“ • "Data matures like wine, applications like fish." • "There were 5 Exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days." • "Information is the oil of the 21st century, and analytics is the combustion engine." 2
  • 3. A short motivation 3 Data matures like wine?
  • 4. A short motivation 4 Data matures like wine? More like grapes…
  • 5. A short motivation • Some „critical“ quotes • "Big Data is not the new oil." • "Data is not information, information is not knowledge, knowledge is not understanding, understanding is not wisdom." • "It’s easy to lie with statistics. It’s hard to tell the truth without statistics." • "Anything that can be measured can be improved.“ 5
  • 7. Twofold Approach to Data Quality • Does Data represent the real-world objects / events / concepts it is supposed to? • Does Data meet the expectations of the Data consumers and the requirements of intended usage? • Warning: Data is not facts! Data is not existing independent from its creation. 7
  • 8. Data as representation 8 Idea Word World Semiotic Triangle Where is the Data?
  • 9. Data as representation 9 Metadata Data World Semiotic Triangle Here is the Data
  • 10. Data and Metadata • Data implies a context -> Metadata • Metadata provides explicit knowledge about Data • Metadata enables a common understanding of Data inside an organization • Metadata serves as documentation and dictionary, as context for Data Understanding Metadata is absolutely necessary for the effective use of Data. 10
  • 11. Responsibility for Data Common Misunderstanding • Data and Data-related systems typically are managed and hosted by IT, therefore most people (from business and IT) tend to think that Data is part of IT and not of Business • BUT: Data is not the by-product of Business processes Data is THE product of Business Processes • Data Quality Improvement as Business Strategy Shared Responsibility 11
  • 12. Data Creation as Observation • Data is created under specific Conditions and for specific Purposes • Creation process involves • Observed Object • Observer • Instrument • Example - Customer Self-Registration Form • Customer Information as Observed Object • Customer as Observer • Registration Form as Instrument Instrument is not built / known by Observer. 12
  • 13. Data as Product • Analogy between manufacturing of products and creation / production of data • Data as core product of a business process • Transfer quality concepts from Software Development to „Data Development“ • Testing • Staging • Versioning • Continuous Delivery / Improvement • Product Management • Standardization Data Quality as Manufactoring Quality 13
  • 14. Expectations and Requirements • Implicit assumptions for usage of Data • Creation of Data is a business process • Expectations and requirements have to be explicitely known when defining the process • Data Quality is Business Process Quality • Constantly changing expectations and requirements makes Data age like grapes… Make all assumptions explicit. 14
  • 15. Data Producers • People or systems that create Data • Producers have control over what they create (given the functionality of the instrument) • Producers don’t have control over possible uses of data • Most Data is produced for a dedicated purpose but used for several purposes • Data Quality is fixed at the moment of creation Data Quality starts with enabling producers to produce high-quality Data -> useable Data 15
  • 16. Data Consumers • People or systems that use Data within its lifecycle • Multiple systems and people can consume data • Often, Consumers are Producers at the same time • Consumers do not control the production of Data but have implicit assumptions and expectations about it Data Quality Processes are Consumers of Data of an unknown Quality and Producers of Data of a defined Quality 16
  • 18. Problematic Aspects of Data Management • Data crosses Organizational Boundaries • Technical (IT) and non-technical (Business) roles have to communicate • Shared Responsibility instead of „Ownership“ • No common definitions • Twelve Barriers to Effective Management of Data and Information Assets (Th. Redman) Holistic Approach to Data Quality required 18
  • 19. Problematic Aspects of Data Management 19
  • 21. Summary Big Data • "Big data is what happened when the cost of storing information became less than the cost of making the decision to throw it away." (George Dyson) 21
  • 22. What is Big Data? • Different Data sources • External Data • No control over data production • No sufficient documentation (Metadata) • No quality definitions available • Incompatible schema • Example: Car callbacks • Even more implicit assumptions • Big Data implies less information per unit of data • Lots of data points are redundant • Example: Measure a constant quantity once per day or once per second 22
  • 23. What is Big Data in the Media? • „new oil“ • „gold“ • „revolution“ • „raw material“ • „the future“ • „bigger, better, faster, more“ • „more data beats better algorithms“ • … 23
  • 24. Three major problems • Redundancy • Big Data by Copy/Paste • Resolution • Every problem has an inherent time scale of change • Every problem has an inherent level of uncertainty • Increasing the resolution beyond these levels only resolves noise • Noise • Adding noisy features decreases the signal-noise ratio • Adding good but irrelevant features increases complexity and can look like noise 24
  • 31. 349 variables - basically rank 1 31
  • 32. Moore’s Law 32 Moore’s Law: By Wgsimon - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=15193542
  • 33. What’s the point? • Moore’s Law • Amount of transistors per area doubles every two year • Real-world Problem sizes • Grow at approximately the same speed • Algorithmic requirements • For answering the same questions in the same time, we need algorithms with linear complexity 33
  • 35. Overall Goals • Implement Data Quality Standards • Detect Data Quality Problems • Manage Data Quality Problems • Root Cause Analysis of Data Quality Problems • Measure Costs of „poor“ Data Quality • Measure Value of Data / „high“ Data Quality • Measure Effects of Data Quality Improvements 35
  • 36. Typical Approaches • Force Data Quality (via order) • Fillrate: Make certain fields a must • Range: Prescribe list of valid options • Buy tool • Hire expert • Fire expert • Collect more bad Data • Relabel „bad“ Data Pool as Data Lake • … 36
  • 37. Summary of the problem Big Data • "Big data is what happened when the cost of storing information became less than the cost of making the decision to throw it away." (George Dyson) 37
  • 38. Useful Approaches • Hire expert ;-) • Shared Responsibility • Common Understanding of and access to Metadata • This does not imply that the terminology has to change • Typically, the same term has a different meaning in different departments • Bounded contexts! (DDD) • Invest in creating better Data instead of fixing old and broken Data Treat Data as Product, not as Fact 38