SlideShare a Scribd company logo
Data Quality
The True Big Data Challenge
Dr. Stefan Kühn
Lead Data Scientist
data2day 2016 - Karlsruhe
A short motivation
• Some „famous“ quotes
• "Data are becoming the new raw material of
business."
• "The data fabric is the next middleware.“
• "Data matures like wine, applications like fish."
• "There were 5 Exabytes of information created
between the dawn of civilization through 2003, but
that much information is now created every 2 days."
• "Information is the oil of the 21st century, and
analytics is the combustion engine."
2
A short motivation
3
Data matures like wine?
A short motivation
4
Data matures like wine?
More like grapes…
A short motivation
• Some „critical“ quotes
• "Big Data is not the new oil."
• "Data is not information, information is not
knowledge, knowledge is not understanding,
understanding is not wisdom."
• "It’s easy to lie with statistics. It’s hard to tell the
truth without statistics."
• "Anything that can be measured can be
improved.“
5
Data Quality
Fundamentals
6
Twofold Approach to Data Quality
• Does Data represent the real-world objects /
events / concepts it is supposed to?
• Does Data meet the expectations of the Data
consumers and the requirements of intended
usage?
• Warning: Data is not facts!
Data is not existing independent from its
creation.
7
Data as representation
8
Idea
Word World
Semiotic
Triangle
Where
is the
Data?
Data as representation
9
Metadata
Data World
Semiotic
Triangle
Here
is the
Data
Data and Metadata
• Data implies a context -> Metadata
• Metadata provides explicit knowledge about Data
• Metadata enables a common understanding of
Data inside an organization
• Metadata serves as documentation and
dictionary, as context for Data Understanding
Metadata is absolutely necessary for the effective
use of Data.
10
Responsibility for Data
Common Misunderstanding
• Data and Data-related systems typically are managed
and hosted by IT, therefore most people (from
business and IT) tend to think that Data is part of IT
and not of Business
• BUT: Data is not the by-product of Business processes
Data is THE product of Business Processes
• Data Quality Improvement as Business Strategy
Shared Responsibility
11
Data Creation as Observation
• Data is created under specific Conditions and for
specific Purposes
• Creation process involves
• Observed Object
• Observer
• Instrument
• Example - Customer Self-Registration Form
• Customer Information as Observed Object
• Customer as Observer
• Registration Form as Instrument
Instrument is not built / known by Observer.
12
Data as Product
• Analogy between manufacturing of products and
creation / production of data
• Data as core product of a business process
• Transfer quality concepts from Software Development
to „Data Development“
• Testing
• Staging
• Versioning
• Continuous Delivery / Improvement
• Product Management
• Standardization
Data Quality as Manufactoring Quality
13
Expectations and Requirements
• Implicit assumptions for usage of Data
• Creation of Data is a business process
• Expectations and requirements have to be
explicitely known when defining the process
• Data Quality is Business Process Quality
• Constantly changing expectations and
requirements makes Data age like grapes…
Make all assumptions explicit.
14
Data Producers
• People or systems that create Data
• Producers have control over what they create
(given the functionality of the instrument)
• Producers don’t have control over possible uses of data
• Most Data is produced for a dedicated purpose but used
for several purposes
• Data Quality is fixed at the moment of creation
Data Quality starts with enabling producers to produce
high-quality Data -> useable Data
15
Data Consumers
• People or systems that use Data within its lifecycle
• Multiple systems and people can consume data
• Often, Consumers are Producers at the same time
• Consumers do not control the production of Data but
have implicit assumptions and expectations about it
Data Quality Processes are Consumers of Data of
an unknown Quality and Producers of Data of a
defined Quality
16
17
Data Quality
Problems
Problematic Aspects of Data Management
• Data crosses Organizational Boundaries
• Technical (IT) and non-technical (Business)
roles have to communicate
• Shared Responsibility instead of „Ownership“
• No common definitions
• Twelve Barriers to Effective Management of
Data and Information Assets (Th. Redman)
Holistic Approach to Data Quality required
18
Problematic Aspects of Data Management
19
20
Big Data Quality
Big Problems
Summary
Big Data
• "Big data is what happened when the cost of
storing information became less than the cost of
making the decision to throw it away." (George Dyson)
21
What is Big Data?
• Different Data sources
• External Data
• No control over data production
• No sufficient documentation (Metadata)
• No quality definitions available
• Incompatible schema
• Example: Car callbacks
• Even more implicit assumptions
• Big Data implies less information per unit of data
• Lots of data points are redundant
• Example: Measure a constant quantity once per day or once
per second
22
What is Big Data in the Media?
• „new oil“
• „gold“
• „revolution“
• „raw material“
• „the future“
• „bigger, better, faster, more“
• „more data beats better algorithms“
• …
23
Three major problems
• Redundancy
• Big Data by Copy/Paste
• Resolution
• Every problem has an inherent time scale of change
• Every problem has an inherent level of uncertainty
• Increasing the resolution beyond these levels only
resolves noise
• Noise
• Adding noisy features decreases the signal-noise ratio
• Adding good but irrelevant features increases
complexity and can look like noise
24
Redundancy
25
Resolution
26
Resolution
27
Noise
28
Noise
29
Example from Kaggle
30
349 variables - basically rank 1
31
Moore’s Law
32
Moore’s Law: By Wgsimon - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=15193542
What’s the point?
• Moore’s Law
• Amount of transistors per area doubles every two
year
• Real-world Problem sizes
• Grow at approximately the same speed
• Algorithmic requirements
• For answering the same questions in the same
time, we need algorithms with linear complexity
33
Solutions?
34
Overall Goals
• Implement Data Quality Standards
• Detect Data Quality Problems
• Manage Data Quality Problems
• Root Cause Analysis of Data Quality Problems
• Measure Costs of „poor“ Data Quality
• Measure Value of Data / „high“ Data Quality
• Measure Effects of Data Quality
Improvements
35
Typical Approaches
• Force Data Quality (via order)
• Fillrate: Make certain fields a must
• Range: Prescribe list of valid options
• Buy tool
• Hire expert
• Fire expert
• Collect more bad Data
• Relabel „bad“ Data Pool as Data Lake
• …
36
Summary of the problem
Big Data
• "Big data is what happened when the cost of
storing information became less than the cost of
making the decision to throw it away." (George Dyson)
37
Useful Approaches
• Hire expert ;-)
• Shared Responsibility
• Common Understanding of and access to Metadata
• This does not imply that the terminology has to change
• Typically, the same term has a different meaning in
different departments
• Bounded contexts! (DDD)
• Invest in creating better Data instead of fixing old
and broken Data
Treat Data as Product, not as Fact
38
39
Thanks a lot!
www.codecentric.de
blog.codecentric.de
stefan.kuehn@codecentric.de
datascience@codecentric.de

More Related Content

What's hot

Understanding big data and data analytics big data
Understanding big data and data analytics big dataUnderstanding big data and data analytics big data
Understanding big data and data analytics big data
Seta Wicaksana
 
Enterprise Analytics: Serving Big Data Projects for Healthcare
Enterprise Analytics: Serving Big Data Projects for HealthcareEnterprise Analytics: Serving Big Data Projects for Healthcare
Enterprise Analytics: Serving Big Data Projects for Healthcare
DATA360US
 
Data Analytics
Data AnalyticsData Analytics
Data Analytics
Srinimf-Slides
 
Building a Data Quality Program from Scratch
Building a Data Quality Program from ScratchBuilding a Data Quality Program from Scratch
Building a Data Quality Program from Scratchdmurph4
 
Data Quality Rules introduction
Data Quality Rules introductionData Quality Rules introduction
Data Quality Rules introduction
datatovalue
 
Big, small or just complex data?
Big, small or just complex data?Big, small or just complex data?
Big, small or just complex data?
panoratio
 
Introduction to Big Data & Analytics
Introduction to Big Data & AnalyticsIntroduction to Big Data & Analytics
Introduction to Big Data & Analytics
Prasad Chitta
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analytics
SSaudia
 
Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityData Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data Quality
Precisely
 
Big data
Big dataBig data
Big data
Claire Choong
 
AWC Career Bootcamp- August 21, 2013
AWC Career Bootcamp- August 21, 2013AWC Career Bootcamp- August 21, 2013
AWC Career Bootcamp- August 21, 2013
Patricia A Gilson
 
000 introduction to big data analytics 2021
000   introduction to big data analytics  2021000   introduction to big data analytics  2021
000 introduction to big data analytics 2021
Dendej Sawarnkatat
 
Information Security Forum (ISF) Congress 2013
Information Security Forum (ISF) Congress 2013 Information Security Forum (ISF) Congress 2013
Information Security Forum (ISF) Congress 2013
NIHR Clinical Research Network
 
Big data and the data quality imperative
Big data and the data quality imperativeBig data and the data quality imperative
Big data and the data quality imperativeTrillium Software
 
Data Quality Integration (ETL) Open Source
Data Quality Integration (ETL) Open SourceData Quality Integration (ETL) Open Source
Data Quality Integration (ETL) Open SourceStratebi
 
( Big ) Data Management - Data Quality - Global concepts in 5 slides
( Big ) Data Management - Data Quality - Global concepts in 5 slides( Big ) Data Management - Data Quality - Global concepts in 5 slides
( Big ) Data Management - Data Quality - Global concepts in 5 slides
Nicolas Sarramagna
 
Тестирование данных с помощью Data Quality Services (MS SQL 12)
Тестирование данных с помощью Data Quality Services (MS SQL 12)Тестирование данных с помощью Data Quality Services (MS SQL 12)
Тестирование данных с помощью Data Quality Services (MS SQL 12)
SQALab
 
DGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data QualityDGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data Quality
Caserta
 
Adventures in Data Profiling
Adventures in Data ProfilingAdventures in Data Profiling
Adventures in Data Profiling
Jim Harris
 
The Economic Value of Data: A New Revenue Stream for Global Custodians
The Economic Value of Data: A New Revenue Stream for Global CustodiansThe Economic Value of Data: A New Revenue Stream for Global Custodians
The Economic Value of Data: A New Revenue Stream for Global Custodians
Cognizant
 

What's hot (20)

Understanding big data and data analytics big data
Understanding big data and data analytics big dataUnderstanding big data and data analytics big data
Understanding big data and data analytics big data
 
Enterprise Analytics: Serving Big Data Projects for Healthcare
Enterprise Analytics: Serving Big Data Projects for HealthcareEnterprise Analytics: Serving Big Data Projects for Healthcare
Enterprise Analytics: Serving Big Data Projects for Healthcare
 
Data Analytics
Data AnalyticsData Analytics
Data Analytics
 
Building a Data Quality Program from Scratch
Building a Data Quality Program from ScratchBuilding a Data Quality Program from Scratch
Building a Data Quality Program from Scratch
 
Data Quality Rules introduction
Data Quality Rules introductionData Quality Rules introduction
Data Quality Rules introduction
 
Big, small or just complex data?
Big, small or just complex data?Big, small or just complex data?
Big, small or just complex data?
 
Introduction to Big Data & Analytics
Introduction to Big Data & AnalyticsIntroduction to Big Data & Analytics
Introduction to Big Data & Analytics
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analytics
 
Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityData Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data Quality
 
Big data
Big dataBig data
Big data
 
AWC Career Bootcamp- August 21, 2013
AWC Career Bootcamp- August 21, 2013AWC Career Bootcamp- August 21, 2013
AWC Career Bootcamp- August 21, 2013
 
000 introduction to big data analytics 2021
000   introduction to big data analytics  2021000   introduction to big data analytics  2021
000 introduction to big data analytics 2021
 
Information Security Forum (ISF) Congress 2013
Information Security Forum (ISF) Congress 2013 Information Security Forum (ISF) Congress 2013
Information Security Forum (ISF) Congress 2013
 
Big data and the data quality imperative
Big data and the data quality imperativeBig data and the data quality imperative
Big data and the data quality imperative
 
Data Quality Integration (ETL) Open Source
Data Quality Integration (ETL) Open SourceData Quality Integration (ETL) Open Source
Data Quality Integration (ETL) Open Source
 
( Big ) Data Management - Data Quality - Global concepts in 5 slides
( Big ) Data Management - Data Quality - Global concepts in 5 slides( Big ) Data Management - Data Quality - Global concepts in 5 slides
( Big ) Data Management - Data Quality - Global concepts in 5 slides
 
Тестирование данных с помощью Data Quality Services (MS SQL 12)
Тестирование данных с помощью Data Quality Services (MS SQL 12)Тестирование данных с помощью Data Quality Services (MS SQL 12)
Тестирование данных с помощью Data Quality Services (MS SQL 12)
 
DGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data QualityDGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data Quality
 
Adventures in Data Profiling
Adventures in Data ProfilingAdventures in Data Profiling
Adventures in Data Profiling
 
The Economic Value of Data: A New Revenue Stream for Global Custodians
The Economic Value of Data: A New Revenue Stream for Global CustodiansThe Economic Value of Data: A New Revenue Stream for Global Custodians
The Economic Value of Data: A New Revenue Stream for Global Custodians
 

Viewers also liked

Big data for quality education
Big data for quality educationBig data for quality education
Big data for quality education
Malintha Adikari
 
Data-Ed Online: Engineering Solutions to Data Quality Challenges
Data-Ed Online: Engineering Solutions to Data Quality ChallengesData-Ed Online: Engineering Solutions to Data Quality Challenges
Data-Ed Online: Engineering Solutions to Data Quality Challenges
Data Blueprint
 
DICE & Cloudify – Quality Big Data Made Easy
DICE & Cloudify – Quality Big Data Made EasyDICE & Cloudify – Quality Big Data Made Easy
DICE & Cloudify – Quality Big Data Made Easy
Cloudify Community
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profiling
Shailja Khurana
 
Data Quality Testing Generic (http://www.geektester.blogspot.com/)
Data Quality Testing Generic (http://www.geektester.blogspot.com/)Data Quality Testing Generic (http://www.geektester.blogspot.com/)
Data Quality Testing Generic (http://www.geektester.blogspot.com/)
raj.kamal13
 
DAMA Webinar - Big and Little Data Quality
DAMA Webinar - Big and Little Data QualityDAMA Webinar - Big and Little Data Quality
DAMA Webinar - Big and Little Data Quality
DATAVERSITY
 

Viewers also liked (6)

Big data for quality education
Big data for quality educationBig data for quality education
Big data for quality education
 
Data-Ed Online: Engineering Solutions to Data Quality Challenges
Data-Ed Online: Engineering Solutions to Data Quality ChallengesData-Ed Online: Engineering Solutions to Data Quality Challenges
Data-Ed Online: Engineering Solutions to Data Quality Challenges
 
DICE & Cloudify – Quality Big Data Made Easy
DICE & Cloudify – Quality Big Data Made EasyDICE & Cloudify – Quality Big Data Made Easy
DICE & Cloudify – Quality Big Data Made Easy
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profiling
 
Data Quality Testing Generic (http://www.geektester.blogspot.com/)
Data Quality Testing Generic (http://www.geektester.blogspot.com/)Data Quality Testing Generic (http://www.geektester.blogspot.com/)
Data Quality Testing Generic (http://www.geektester.blogspot.com/)
 
DAMA Webinar - Big and Little Data Quality
DAMA Webinar - Big and Little Data QualityDAMA Webinar - Big and Little Data Quality
DAMA Webinar - Big and Little Data Quality
 

Similar to Data quality - The True Big Data Challenge

Data Governance in the Big Data Era
Data Governance in the Big Data EraData Governance in the Big Data Era
Data Governance in the Big Data Era
Pieter De Leenheer
 
Data Governance in a big data era
Data Governance in a big data eraData Governance in a big data era
Data Governance in a big data era
Pieter De Leenheer
 
The New Age Data Quality
The New Age Data QualityThe New Age Data Quality
The New Age Data Quality
Ranjeet202050
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Umair Shafique
 
Presentation Big Data
Presentation Big DataPresentation Big Data
Presentation Big Data
René Kuipers
 
final oracle presentation
final oracle presentationfinal oracle presentation
final oracle presentationPriyesh Patel
 
Industrial Data Science
Industrial Data ScienceIndustrial Data Science
Industrial Data Science
Niko Vuokko
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data Science
Thinkful
 
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Denodo
 
Correlation does not mean causation
Correlation does not mean causationCorrelation does not mean causation
Correlation does not mean causation
Peter Varhol
 
Big data
Big dataBig data
¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?
Denodo
 
Usama Fayyad talk in South Africa: From BigData to Data Science
Usama Fayyad talk in South Africa:  From BigData to Data ScienceUsama Fayyad talk in South Africa:  From BigData to Data Science
Usama Fayyad talk in South Africa: From BigData to Data Science
Usama Fayyad
 
01-introduction.ppt the paper that you can unless you want to join me because...
01-introduction.ppt the paper that you can unless you want to join me because...01-introduction.ppt the paper that you can unless you want to join me because...
01-introduction.ppt the paper that you can unless you want to join me because...
teodroscampaus
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
Thinkful
 
DataSpryng Overview
DataSpryng OverviewDataSpryng Overview
DataSpryng Overview
jkvr
 
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your DataFoundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Precisely
 
DataScienceIntroduction.pptx
DataScienceIntroduction.pptxDataScienceIntroduction.pptx
DataScienceIntroduction.pptx
KannanThangavelu2
 

Similar to Data quality - The True Big Data Challenge (20)

Data Governance in the Big Data Era
Data Governance in the Big Data EraData Governance in the Big Data Era
Data Governance in the Big Data Era
 
Data Governance in a big data era
Data Governance in a big data eraData Governance in a big data era
Data Governance in a big data era
 
Big data
Big dataBig data
Big data
 
The New Age Data Quality
The New Age Data QualityThe New Age Data Quality
The New Age Data Quality
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Presentation Big Data
Presentation Big DataPresentation Big Data
Presentation Big Data
 
final oracle presentation
final oracle presentationfinal oracle presentation
final oracle presentation
 
Industrial Data Science
Industrial Data ScienceIndustrial Data Science
Industrial Data Science
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data Science
 
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
 
Correlation does not mean causation
Correlation does not mean causationCorrelation does not mean causation
Correlation does not mean causation
 
Big data
Big dataBig data
Big data
 
¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?
 
Usama Fayyad talk in South Africa: From BigData to Data Science
Usama Fayyad talk in South Africa:  From BigData to Data ScienceUsama Fayyad talk in South Africa:  From BigData to Data Science
Usama Fayyad talk in South Africa: From BigData to Data Science
 
01-introduction.ppt the paper that you can unless you want to join me because...
01-introduction.ppt the paper that you can unless you want to join me because...01-introduction.ppt the paper that you can unless you want to join me because...
01-introduction.ppt the paper that you can unless you want to join me because...
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
 
DataSpryng Overview
DataSpryng OverviewDataSpryng Overview
DataSpryng Overview
 
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your DataFoundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
 
DataScienceIntroduction.pptx
DataScienceIntroduction.pptxDataScienceIntroduction.pptx
DataScienceIntroduction.pptx
 
CDOVision - RJA Presentation FINAL
CDOVision - RJA Presentation FINALCDOVision - RJA Presentation FINAL
CDOVision - RJA Presentation FINAL
 

More from Stefan Kühn

data2day2023_SKuehn_DataPlatformFallacy.pdf
data2day2023_SKuehn_DataPlatformFallacy.pdfdata2day2023_SKuehn_DataPlatformFallacy.pdf
data2day2023_SKuehn_DataPlatformFallacy.pdf
Stefan Kühn
 
data2day2022_SKuehn_DataValueChain.pdf
data2day2022_SKuehn_DataValueChain.pdfdata2day2022_SKuehn_DataValueChain.pdf
data2day2022_SKuehn_DataValueChain.pdf
Stefan Kühn
 
Talk at MCubed London about Manifold Learning and Applications
Talk at MCubed London about Manifold Learning and ApplicationsTalk at MCubed London about Manifold Learning and Applications
Talk at MCubed London about Manifold Learning and Applications
Stefan Kühn
 
Data Science - Cargo Cult - Organizational Change
Data Science - Cargo Cult - Organizational ChangeData Science - Cargo Cult - Organizational Change
Data Science - Cargo Cult - Organizational Change
Stefan Kühn
 
Interactive Dashboards with R
Interactive Dashboards with RInteractive Dashboards with R
Interactive Dashboards with R
Stefan Kühn
 
Talk at PyData Berlin about Manifold Learning and Applications
Talk at PyData Berlin about Manifold Learning and ApplicationsTalk at PyData Berlin about Manifold Learning and Applications
Talk at PyData Berlin about Manifold Learning and Applications
Stefan Kühn
 
Bridging the gap
Bridging the gapBridging the gap
Bridging the gap
Stefan Kühn
 
The Machinery behind Deep Learning
The Machinery behind Deep LearningThe Machinery behind Deep Learning
The Machinery behind Deep Learning
Stefan Kühn
 
Manifold Learning and Data Visualization
Manifold Learning and Data VisualizationManifold Learning and Data Visualization
Manifold Learning and Data Visualization
Stefan Kühn
 
Becoming Data-driven - Machine Learning @ XING Marketing Solutions
Becoming Data-driven - Machine Learning @ XING Marketing SolutionsBecoming Data-driven - Machine Learning @ XING Marketing Solutions
Becoming Data-driven - Machine Learning @ XING Marketing Solutions
Stefan Kühn
 
Learning To Rank data2day 2017
Learning To Rank data2day 2017Learning To Rank data2day 2017
Learning To Rank data2day 2017
Stefan Kühn
 
Deep Learning and Optimization Methods
Deep Learning and Optimization MethodsDeep Learning and Optimization Methods
Deep Learning and Optimization Methods
Stefan Kühn
 
Visualizing and Communicating High-dimensional Data
Visualizing and Communicating High-dimensional DataVisualizing and Communicating High-dimensional Data
Visualizing and Communicating High-dimensional Data
Stefan Kühn
 
Data Visualization at codetalks 2016
Data Visualization at codetalks 2016Data Visualization at codetalks 2016
Data Visualization at codetalks 2016
Stefan Kühn
 
SKuehn_MachineLearningAndOptimization_2015
SKuehn_MachineLearningAndOptimization_2015SKuehn_MachineLearningAndOptimization_2015
SKuehn_MachineLearningAndOptimization_2015
Stefan Kühn
 
SKuehn_Talk_FootballAnalytics_data2day2015
SKuehn_Talk_FootballAnalytics_data2day2015SKuehn_Talk_FootballAnalytics_data2day2015
SKuehn_Talk_FootballAnalytics_data2day2015
Stefan Kühn
 

More from Stefan Kühn (16)

data2day2023_SKuehn_DataPlatformFallacy.pdf
data2day2023_SKuehn_DataPlatformFallacy.pdfdata2day2023_SKuehn_DataPlatformFallacy.pdf
data2day2023_SKuehn_DataPlatformFallacy.pdf
 
data2day2022_SKuehn_DataValueChain.pdf
data2day2022_SKuehn_DataValueChain.pdfdata2day2022_SKuehn_DataValueChain.pdf
data2day2022_SKuehn_DataValueChain.pdf
 
Talk at MCubed London about Manifold Learning and Applications
Talk at MCubed London about Manifold Learning and ApplicationsTalk at MCubed London about Manifold Learning and Applications
Talk at MCubed London about Manifold Learning and Applications
 
Data Science - Cargo Cult - Organizational Change
Data Science - Cargo Cult - Organizational ChangeData Science - Cargo Cult - Organizational Change
Data Science - Cargo Cult - Organizational Change
 
Interactive Dashboards with R
Interactive Dashboards with RInteractive Dashboards with R
Interactive Dashboards with R
 
Talk at PyData Berlin about Manifold Learning and Applications
Talk at PyData Berlin about Manifold Learning and ApplicationsTalk at PyData Berlin about Manifold Learning and Applications
Talk at PyData Berlin about Manifold Learning and Applications
 
Bridging the gap
Bridging the gapBridging the gap
Bridging the gap
 
The Machinery behind Deep Learning
The Machinery behind Deep LearningThe Machinery behind Deep Learning
The Machinery behind Deep Learning
 
Manifold Learning and Data Visualization
Manifold Learning and Data VisualizationManifold Learning and Data Visualization
Manifold Learning and Data Visualization
 
Becoming Data-driven - Machine Learning @ XING Marketing Solutions
Becoming Data-driven - Machine Learning @ XING Marketing SolutionsBecoming Data-driven - Machine Learning @ XING Marketing Solutions
Becoming Data-driven - Machine Learning @ XING Marketing Solutions
 
Learning To Rank data2day 2017
Learning To Rank data2day 2017Learning To Rank data2day 2017
Learning To Rank data2day 2017
 
Deep Learning and Optimization Methods
Deep Learning and Optimization MethodsDeep Learning and Optimization Methods
Deep Learning and Optimization Methods
 
Visualizing and Communicating High-dimensional Data
Visualizing and Communicating High-dimensional DataVisualizing and Communicating High-dimensional Data
Visualizing and Communicating High-dimensional Data
 
Data Visualization at codetalks 2016
Data Visualization at codetalks 2016Data Visualization at codetalks 2016
Data Visualization at codetalks 2016
 
SKuehn_MachineLearningAndOptimization_2015
SKuehn_MachineLearningAndOptimization_2015SKuehn_MachineLearningAndOptimization_2015
SKuehn_MachineLearningAndOptimization_2015
 
SKuehn_Talk_FootballAnalytics_data2day2015
SKuehn_Talk_FootballAnalytics_data2day2015SKuehn_Talk_FootballAnalytics_data2day2015
SKuehn_Talk_FootballAnalytics_data2day2015
 

Recently uploaded

一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Subhajit Sahu
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Enterprise Wired
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 

Recently uploaded (20)

一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 

Data quality - The True Big Data Challenge

  • 1. Data Quality The True Big Data Challenge Dr. Stefan Kühn Lead Data Scientist data2day 2016 - Karlsruhe
  • 2. A short motivation • Some „famous“ quotes • "Data are becoming the new raw material of business." • "The data fabric is the next middleware.“ • "Data matures like wine, applications like fish." • "There were 5 Exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days." • "Information is the oil of the 21st century, and analytics is the combustion engine." 2
  • 3. A short motivation 3 Data matures like wine?
  • 4. A short motivation 4 Data matures like wine? More like grapes…
  • 5. A short motivation • Some „critical“ quotes • "Big Data is not the new oil." • "Data is not information, information is not knowledge, knowledge is not understanding, understanding is not wisdom." • "It’s easy to lie with statistics. It’s hard to tell the truth without statistics." • "Anything that can be measured can be improved.“ 5
  • 7. Twofold Approach to Data Quality • Does Data represent the real-world objects / events / concepts it is supposed to? • Does Data meet the expectations of the Data consumers and the requirements of intended usage? • Warning: Data is not facts! Data is not existing independent from its creation. 7
  • 8. Data as representation 8 Idea Word World Semiotic Triangle Where is the Data?
  • 9. Data as representation 9 Metadata Data World Semiotic Triangle Here is the Data
  • 10. Data and Metadata • Data implies a context -> Metadata • Metadata provides explicit knowledge about Data • Metadata enables a common understanding of Data inside an organization • Metadata serves as documentation and dictionary, as context for Data Understanding Metadata is absolutely necessary for the effective use of Data. 10
  • 11. Responsibility for Data Common Misunderstanding • Data and Data-related systems typically are managed and hosted by IT, therefore most people (from business and IT) tend to think that Data is part of IT and not of Business • BUT: Data is not the by-product of Business processes Data is THE product of Business Processes • Data Quality Improvement as Business Strategy Shared Responsibility 11
  • 12. Data Creation as Observation • Data is created under specific Conditions and for specific Purposes • Creation process involves • Observed Object • Observer • Instrument • Example - Customer Self-Registration Form • Customer Information as Observed Object • Customer as Observer • Registration Form as Instrument Instrument is not built / known by Observer. 12
  • 13. Data as Product • Analogy between manufacturing of products and creation / production of data • Data as core product of a business process • Transfer quality concepts from Software Development to „Data Development“ • Testing • Staging • Versioning • Continuous Delivery / Improvement • Product Management • Standardization Data Quality as Manufactoring Quality 13
  • 14. Expectations and Requirements • Implicit assumptions for usage of Data • Creation of Data is a business process • Expectations and requirements have to be explicitely known when defining the process • Data Quality is Business Process Quality • Constantly changing expectations and requirements makes Data age like grapes… Make all assumptions explicit. 14
  • 15. Data Producers • People or systems that create Data • Producers have control over what they create (given the functionality of the instrument) • Producers don’t have control over possible uses of data • Most Data is produced for a dedicated purpose but used for several purposes • Data Quality is fixed at the moment of creation Data Quality starts with enabling producers to produce high-quality Data -> useable Data 15
  • 16. Data Consumers • People or systems that use Data within its lifecycle • Multiple systems and people can consume data • Often, Consumers are Producers at the same time • Consumers do not control the production of Data but have implicit assumptions and expectations about it Data Quality Processes are Consumers of Data of an unknown Quality and Producers of Data of a defined Quality 16
  • 18. Problematic Aspects of Data Management • Data crosses Organizational Boundaries • Technical (IT) and non-technical (Business) roles have to communicate • Shared Responsibility instead of „Ownership“ • No common definitions • Twelve Barriers to Effective Management of Data and Information Assets (Th. Redman) Holistic Approach to Data Quality required 18
  • 19. Problematic Aspects of Data Management 19
  • 21. Summary Big Data • "Big data is what happened when the cost of storing information became less than the cost of making the decision to throw it away." (George Dyson) 21
  • 22. What is Big Data? • Different Data sources • External Data • No control over data production • No sufficient documentation (Metadata) • No quality definitions available • Incompatible schema • Example: Car callbacks • Even more implicit assumptions • Big Data implies less information per unit of data • Lots of data points are redundant • Example: Measure a constant quantity once per day or once per second 22
  • 23. What is Big Data in the Media? • „new oil“ • „gold“ • „revolution“ • „raw material“ • „the future“ • „bigger, better, faster, more“ • „more data beats better algorithms“ • … 23
  • 24. Three major problems • Redundancy • Big Data by Copy/Paste • Resolution • Every problem has an inherent time scale of change • Every problem has an inherent level of uncertainty • Increasing the resolution beyond these levels only resolves noise • Noise • Adding noisy features decreases the signal-noise ratio • Adding good but irrelevant features increases complexity and can look like noise 24
  • 31. 349 variables - basically rank 1 31
  • 32. Moore’s Law 32 Moore’s Law: By Wgsimon - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=15193542
  • 33. What’s the point? • Moore’s Law • Amount of transistors per area doubles every two year • Real-world Problem sizes • Grow at approximately the same speed • Algorithmic requirements • For answering the same questions in the same time, we need algorithms with linear complexity 33
  • 35. Overall Goals • Implement Data Quality Standards • Detect Data Quality Problems • Manage Data Quality Problems • Root Cause Analysis of Data Quality Problems • Measure Costs of „poor“ Data Quality • Measure Value of Data / „high“ Data Quality • Measure Effects of Data Quality Improvements 35
  • 36. Typical Approaches • Force Data Quality (via order) • Fillrate: Make certain fields a must • Range: Prescribe list of valid options • Buy tool • Hire expert • Fire expert • Collect more bad Data • Relabel „bad“ Data Pool as Data Lake • … 36
  • 37. Summary of the problem Big Data • "Big data is what happened when the cost of storing information became less than the cost of making the decision to throw it away." (George Dyson) 37
  • 38. Useful Approaches • Hire expert ;-) • Shared Responsibility • Common Understanding of and access to Metadata • This does not imply that the terminology has to change • Typically, the same term has a different meaning in different departments • Bounded contexts! (DDD) • Invest in creating better Data instead of fixing old and broken Data Treat Data as Product, not as Fact 38