SlideShare a Scribd company logo
1 of 48
Data Cleansing and Data Understanding
Best Practices and Lessons from the Field
Casey Stella
@casey_stella
2017
1
InfoQ.com: News & Community Site
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
practices-data-curate
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon London
www.qconlondon.com
Hi, I’m Casey Stella!
• I’m a Principal Software Engineer at Hortonworks, a Hadoop vendor, writing open
source software
• I work on Apache Metron (Incubating), constructing a platform to do advanced
analytics and data science for cyber security at scale
2
Hi, I’m Casey Stella!
• I’m a Principal Software Engineer at Hortonworks, a Hadoop vendor, writing open
source software
• I work on Apache Metron (Incubating), constructing a platform to do advanced
analytics and data science for cyber security at scale
• Prior to this, I was
• Doing data science consulting on the Hadoop ecosystem for Hortonworks
• Doing data mining on medical data at Explorys using the Hadoop ecosystem
• Doing signal processing on seismic data at Ion Geophysical
• A graduate student in the Math department at Texas A&M in algorithmic
complexity theory
2
Garbage In =⇒ Garbage Out
“80% of the work in any data project is in cleaning the data.”
— D.J. Patel in Data Jujitsu
3
Data Cleansing =⇒ Data Understanding
There are two ways to understand your data
• Syntactic Understanding
• Semantic Understanding
If you hope to get anything out of your data, you have to have a handle on both.
4
Syntactic Understanding: True Types
A true type is a label applied to data points xi such that xi are mutually comparable.
• Schemas type != true data type
• A specific column can have many different types
5
Syntactic Understanding: True Types
A true type is a label applied to data points xi such that xi are mutually comparable.
• Schemas type != true data type
• A specific column can have many different types
“735” has a true type of integer but could have a schema type of string or double
5
Syntactic Understanding: Density
Data density is an indication of how data is clumped together.
6
Syntactic Understanding: Density
Data density is an indication of how data is clumped together.
• For numerical data, distributions and statistical characteristics are informative.
6
Syntactic Understanding: Density
Data density is an indication of how data is clumped together.
• For numerical data, distributions and statistical characteristics are informative.
• For non-numeric data, counts and distinct counts of a canonical representation are
extremely useful.
6
Syntactic Understanding: Density
Data density is an indication of how data is clumped together.
• For numerical data, distributions and statistical characteristics are informative.
• For non-numeric data, counts and distinct counts of a canonical representation are
extremely useful.
• For ALL data, an indication of how “empty” the data is.
6
Syntactic Understanding: Density
Data density is an indication of how data is clumped together.
• For numerical data, distributions and statistical characteristics are informative.
• For non-numeric data, counts and distinct counts of a canonical representation are
extremely useful.
• For ALL data, an indication of how “empty” the data is.
Canonical representations are representations which give you an idea at a glance of the
data format
6
Syntactic Understanding: Density
Data density is an indication of how data is clumped together.
• For numerical data, distributions and statistical characteristics are informative.
• For non-numeric data, counts and distinct counts of a canonical representation are
extremely useful.
• For ALL data, an indication of how “empty” the data is.
Canonical representations are representations which give you an idea at a glance of the
data format
• Replacing digits with the character ‘d’
• Stripping whitespace
• Normalizing punctuation
6
Syntactic Understanding: Density
Data density is an indication of how data is clumped together.
• For numerical data, distributions and statistical characteristics are informative.
• For non-numeric data, counts and distinct counts of a canonical representation are
extremely useful.
• For ALL data, an indication of how “empty” the data is.
Canonical representations are representations which give you an idea at a glance of the
data format
• Replacing digits with the character ‘d’
• Stripping whitespace
• Normalizing punctuation
Data density is an assumption underlying any conclusions drawn from your
data. 6
Syntactic Understanding: Density over Time
∆Density
∆t is how data clumps change over time.
7
Syntactic Understanding: Density over Time
∆Density
∆t is how data clumps change over time.
This kind of analysis can show
• Problems in the data pipeline
7
Syntactic Understanding: Density over Time
∆Density
∆t is how data clumps change over time.
This kind of analysis can show
• Problems in the data pipeline
• Whether the assumptions of your analysis are violated
7
Syntactic Understanding: Density over Time
∆Density
∆t is how data clumps change over time.
This kind of analysis can show
• Problems in the data pipeline
• Whether the assumptions of your analysis are violated
∆Density
∆t =⇒
• Automation
7
Syntactic Understanding: Density over Time
∆Density
∆t is how data clumps change over time.
This kind of analysis can show
• Problems in the data pipeline
• Whether the assumptions of your analysis are violated
∆Density
∆t =⇒
• Automation
• Outlier Alerting
7
Story Time: A Summation over Time Saves Face
The stage: An unnamed startup analyzing clinical effectiveness measurement for
hospital.
8
Story Time: A Summation over Time Saves Face
The stage: An unnamed startup analyzing clinical effectiveness measurement for
hospital.
• Think of it as a rules engine that takes medical data and outputs how well doctors
and departments are doing
8
Story Time: A Summation over Time Saves Face
The stage: An unnamed startup analyzing clinical effectiveness measurement for
hospital.
• Think of it as a rules engine that takes medical data and outputs how well doctors
and departments are doing
• Insights aren’t trusted if they’re wrong.
8
Story Time: A Summation over Time Saves Face
The stage: An unnamed startup analyzing clinical effectiveness measurement for
hospital.
• Think of it as a rules engine that takes medical data and outputs how well doctors
and departments are doing
• Insights aren’t trusted if they’re wrong.
• Correctness depend on good data
8
Semantic Understanding: “Do what I mean, not what I say”
Semantic understanding is understanding based on how the data is used rather than
how it is stored.
9
Semantic Understanding: “Do what I mean, not what I say”
Semantic understanding is understanding based on how the data is used rather than
how it is stored.
• Finding equivalences based on semantic understanding are often context sensitive.
9
Semantic Understanding: “Do what I mean, not what I say”
Semantic understanding is understanding based on how the data is used rather than
how it is stored.
• Finding equivalences based on semantic understanding are often context sensitive.
• May come from humans (e.g. domain experience and ontologies)
9
Semantic Understanding: “Do what I mean, not what I say”
Semantic understanding is understanding based on how the data is used rather than
how it is stored.
• Finding equivalences based on semantic understanding are often context sensitive.
• May come from humans (e.g. domain experience and ontologies)
• May come from machine learning (e.g. analyzing usage patterns to find synonyms)
9
Semantic Understanding: “Do what I mean, not what I say”
Semantic understanding is understanding based on how the data is used rather than
how it is stored.
• Finding equivalences based on semantic understanding are often context sensitive.
• May come from humans (e.g. domain experience and ontologies)
• May come from machine learning (e.g. analyzing usage patterns to find synonyms)
Semantic understanding may require data science. At the same time, data science will
require semantic understanding.
9
Story Time: Very Busy Bags of Mostly Water; Very Confused Boxes of Silicon
The stage: An unnamed insurance company building models to predict disease
10
Story Time: Very Busy Bags of Mostly Water; Very Confused Boxes of Silicon
The stage: An unnamed insurance company building models to predict disease
• Doctors and Nurses are busy people
10
Story Time: Very Busy Bags of Mostly Water; Very Confused Boxes of Silicon
The stage: An unnamed insurance company building models to predict disease
• Doctors and Nurses are busy people
• Humans suffer from confirmation bias
10
Story Time: Very Busy Bags of Mostly Water; Very Confused Boxes of Silicon
The stage: An unnamed insurance company building models to predict disease
• Doctors and Nurses are busy people
• Humans suffer from confirmation bias
• Machines can only interpret what they can see
10
Story Time: Very Busy Bags of Mostly Water; Very Confused Boxes of Silicon
The stage: An unnamed insurance company building models to predict disease
• Doctors and Nurses are busy people
• Humans suffer from confirmation bias
• Machines can only interpret what they can see
• Together we can fill in the gaps
10
SummarizerCLI
Implications for Team Structure
To be successful,
15
Implications for Team Structure
To be successful,
• Your data science teams have to be integrally involved in the data transformation
and understanding.
15
Implications for Team Structure
To be successful,
• Your data science teams have to be integrally involved in the data transformation
and understanding.
• Your data science teams have to be willing to get their hands dirty
15
Implications for Team Structure
To be successful,
• Your data science teams have to be integrally involved in the data transformation
and understanding.
• Your data science teams have to be willing to get their hands dirty
• Your data science teams have to be allowed to get their hands dirty
15
Implications for Team Structure
To be successful,
• Your data science teams have to be integrally involved in the data transformation
and understanding.
• Your data science teams have to be willing to get their hands dirty
• Your data science teams have to be allowed to get their hands dirty
• Your data science teams need software engineering chops.
15
Questions
Questions
Thanks for your attention! Questions?
• Code & scripts for this talk available on my github presentation page.1
• Find me at http://caseystella.com
• Twitter handle: @casey_stella
• Email address: cstella@hortonworks.com
1
http://github.com/cestella/presentations/
16
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
practices-data-curate

More Related Content

Viewers also liked

Essay Writing Simplified Secret Effective Shortcuts
Essay Writing Simplified  Secret Effective ShortcutsEssay Writing Simplified  Secret Effective Shortcuts
Essay Writing Simplified Secret Effective ShortcutsEssayWriter.Co.Uk
 
Building a Great Engineering Culture
Building a Great Engineering CultureBuilding a Great Engineering Culture
Building a Great Engineering CultureSimon Guest
 
A Practical Guide to Cynefin
A Practical Guide to CynefinA Practical Guide to Cynefin
A Practical Guide to CynefinDoc Norton
 
From Building a Marketplace to Building Teams
From Building a Marketplace to Building TeamsFrom Building a Marketplace to Building Teams
From Building a Marketplace to Building TeamsMike Brittain
 
Kevin beyond theschlockofthenew_feb2015
Kevin beyond theschlockofthenew_feb2015Kevin beyond theschlockofthenew_feb2015
Kevin beyond theschlockofthenew_feb2015Plan
 
Christian Titze, "Hello From the Other Side: Adapting the Agile Agency to Cli...
Christian Titze, "Hello From the Other Side: Adapting the Agile Agency to Cli...Christian Titze, "Hello From the Other Side: Adapting the Agile Agency to Cli...
Christian Titze, "Hello From the Other Side: Adapting the Agile Agency to Cli...WebVisions
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientistsAjay Ohri
 
Becoming a Better Programmer
Becoming a Better ProgrammerBecoming a Better Programmer
Becoming a Better ProgrammerPete Goodliffe
 
data science: past present & future [American Statistical Association (ASA) C...
data science: past present & future [American Statistical Association (ASA) C...data science: past present & future [American Statistical Association (ASA) C...
data science: past present & future [American Statistical Association (ASA) C...chris wiggins
 

Viewers also liked (14)

Essay Writing Simplified Secret Effective Shortcuts
Essay Writing Simplified  Secret Effective ShortcutsEssay Writing Simplified  Secret Effective Shortcuts
Essay Writing Simplified Secret Effective Shortcuts
 
Building a Great Engineering Culture
Building a Great Engineering CultureBuilding a Great Engineering Culture
Building a Great Engineering Culture
 
A Practical Guide to Cynefin
A Practical Guide to CynefinA Practical Guide to Cynefin
A Practical Guide to Cynefin
 
From Building a Marketplace to Building Teams
From Building a Marketplace to Building TeamsFrom Building a Marketplace to Building Teams
From Building a Marketplace to Building Teams
 
Kevin beyond theschlockofthenew_feb2015
Kevin beyond theschlockofthenew_feb2015Kevin beyond theschlockofthenew_feb2015
Kevin beyond theschlockofthenew_feb2015
 
Brayton cycle
Brayton cycleBrayton cycle
Brayton cycle
 
Natural disasters of pakistan
Natural disasters of pakistanNatural disasters of pakistan
Natural disasters of pakistan
 
Brayton cycle
Brayton cycleBrayton cycle
Brayton cycle
 
Javaで和暦と元号
Javaで和暦と元号Javaで和暦と元号
Javaで和暦と元号
 
Christian Titze, "Hello From the Other Side: Adapting the Agile Agency to Cli...
Christian Titze, "Hello From the Other Side: Adapting the Agile Agency to Cli...Christian Titze, "Hello From the Other Side: Adapting the Agile Agency to Cli...
Christian Titze, "Hello From the Other Side: Adapting the Agile Agency to Cli...
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientists
 
Becoming a Better Programmer
Becoming a Better ProgrammerBecoming a Better Programmer
Becoming a Better Programmer
 
data science: past present & future [American Statistical Association (ASA) C...
data science: past present & future [American Statistical Association (ASA) C...data science: past present & future [American Statistical Association (ASA) C...
data science: past present & future [American Statistical Association (ASA) C...
 
The 2017 Budget and Economic Outlook
The 2017 Budget and Economic OutlookThe 2017 Budget and Economic Outlook
The 2017 Budget and Economic Outlook
 

More from C4Media

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoC4Media
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileC4Media
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020C4Media
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsC4Media
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No KeeperC4Media
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like OwnersC4Media
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaC4Media
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideC4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 

More from C4Media (20)

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 

Recently uploaded

Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastUXDXConf
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityScyllaDB
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfSrushith Repakula
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...panagenda
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfFIDO Alliance
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaCzechDreamin
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandIES VE
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeCzechDreamin
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...FIDO Alliance
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...CzechDreamin
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty SecureFemke de Vroome
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...CzechDreamin
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Julian Hyde
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceSamy Fodil
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Patrick Viafore
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGDSC PJATK
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIES VE
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireExakis Nelite
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...marcuskenyatta275
 

Recently uploaded (20)

Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & Ireland
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 

Data Cleansing and Understanding Best Practices

  • 1. Data Cleansing and Data Understanding Best Practices and Lessons from the Field Casey Stella @casey_stella 2017 1
  • 2. InfoQ.com: News & Community Site Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ practices-data-curate • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week
  • 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon London www.qconlondon.com
  • 4. Hi, I’m Casey Stella! • I’m a Principal Software Engineer at Hortonworks, a Hadoop vendor, writing open source software • I work on Apache Metron (Incubating), constructing a platform to do advanced analytics and data science for cyber security at scale 2
  • 5. Hi, I’m Casey Stella! • I’m a Principal Software Engineer at Hortonworks, a Hadoop vendor, writing open source software • I work on Apache Metron (Incubating), constructing a platform to do advanced analytics and data science for cyber security at scale • Prior to this, I was • Doing data science consulting on the Hadoop ecosystem for Hortonworks • Doing data mining on medical data at Explorys using the Hadoop ecosystem • Doing signal processing on seismic data at Ion Geophysical • A graduate student in the Math department at Texas A&M in algorithmic complexity theory 2
  • 6. Garbage In =⇒ Garbage Out “80% of the work in any data project is in cleaning the data.” — D.J. Patel in Data Jujitsu 3
  • 7. Data Cleansing =⇒ Data Understanding There are two ways to understand your data • Syntactic Understanding • Semantic Understanding If you hope to get anything out of your data, you have to have a handle on both. 4
  • 8. Syntactic Understanding: True Types A true type is a label applied to data points xi such that xi are mutually comparable. • Schemas type != true data type • A specific column can have many different types 5
  • 9. Syntactic Understanding: True Types A true type is a label applied to data points xi such that xi are mutually comparable. • Schemas type != true data type • A specific column can have many different types “735” has a true type of integer but could have a schema type of string or double 5
  • 10. Syntactic Understanding: Density Data density is an indication of how data is clumped together. 6
  • 11. Syntactic Understanding: Density Data density is an indication of how data is clumped together. • For numerical data, distributions and statistical characteristics are informative. 6
  • 12. Syntactic Understanding: Density Data density is an indication of how data is clumped together. • For numerical data, distributions and statistical characteristics are informative. • For non-numeric data, counts and distinct counts of a canonical representation are extremely useful. 6
  • 13. Syntactic Understanding: Density Data density is an indication of how data is clumped together. • For numerical data, distributions and statistical characteristics are informative. • For non-numeric data, counts and distinct counts of a canonical representation are extremely useful. • For ALL data, an indication of how “empty” the data is. 6
  • 14. Syntactic Understanding: Density Data density is an indication of how data is clumped together. • For numerical data, distributions and statistical characteristics are informative. • For non-numeric data, counts and distinct counts of a canonical representation are extremely useful. • For ALL data, an indication of how “empty” the data is. Canonical representations are representations which give you an idea at a glance of the data format 6
  • 15. Syntactic Understanding: Density Data density is an indication of how data is clumped together. • For numerical data, distributions and statistical characteristics are informative. • For non-numeric data, counts and distinct counts of a canonical representation are extremely useful. • For ALL data, an indication of how “empty” the data is. Canonical representations are representations which give you an idea at a glance of the data format • Replacing digits with the character ‘d’ • Stripping whitespace • Normalizing punctuation 6
  • 16. Syntactic Understanding: Density Data density is an indication of how data is clumped together. • For numerical data, distributions and statistical characteristics are informative. • For non-numeric data, counts and distinct counts of a canonical representation are extremely useful. • For ALL data, an indication of how “empty” the data is. Canonical representations are representations which give you an idea at a glance of the data format • Replacing digits with the character ‘d’ • Stripping whitespace • Normalizing punctuation Data density is an assumption underlying any conclusions drawn from your data. 6
  • 17. Syntactic Understanding: Density over Time ∆Density ∆t is how data clumps change over time. 7
  • 18. Syntactic Understanding: Density over Time ∆Density ∆t is how data clumps change over time. This kind of analysis can show • Problems in the data pipeline 7
  • 19. Syntactic Understanding: Density over Time ∆Density ∆t is how data clumps change over time. This kind of analysis can show • Problems in the data pipeline • Whether the assumptions of your analysis are violated 7
  • 20. Syntactic Understanding: Density over Time ∆Density ∆t is how data clumps change over time. This kind of analysis can show • Problems in the data pipeline • Whether the assumptions of your analysis are violated ∆Density ∆t =⇒ • Automation 7
  • 21. Syntactic Understanding: Density over Time ∆Density ∆t is how data clumps change over time. This kind of analysis can show • Problems in the data pipeline • Whether the assumptions of your analysis are violated ∆Density ∆t =⇒ • Automation • Outlier Alerting 7
  • 22. Story Time: A Summation over Time Saves Face The stage: An unnamed startup analyzing clinical effectiveness measurement for hospital. 8
  • 23. Story Time: A Summation over Time Saves Face The stage: An unnamed startup analyzing clinical effectiveness measurement for hospital. • Think of it as a rules engine that takes medical data and outputs how well doctors and departments are doing 8
  • 24. Story Time: A Summation over Time Saves Face The stage: An unnamed startup analyzing clinical effectiveness measurement for hospital. • Think of it as a rules engine that takes medical data and outputs how well doctors and departments are doing • Insights aren’t trusted if they’re wrong. 8
  • 25. Story Time: A Summation over Time Saves Face The stage: An unnamed startup analyzing clinical effectiveness measurement for hospital. • Think of it as a rules engine that takes medical data and outputs how well doctors and departments are doing • Insights aren’t trusted if they’re wrong. • Correctness depend on good data 8
  • 26. Semantic Understanding: “Do what I mean, not what I say” Semantic understanding is understanding based on how the data is used rather than how it is stored. 9
  • 27. Semantic Understanding: “Do what I mean, not what I say” Semantic understanding is understanding based on how the data is used rather than how it is stored. • Finding equivalences based on semantic understanding are often context sensitive. 9
  • 28. Semantic Understanding: “Do what I mean, not what I say” Semantic understanding is understanding based on how the data is used rather than how it is stored. • Finding equivalences based on semantic understanding are often context sensitive. • May come from humans (e.g. domain experience and ontologies) 9
  • 29. Semantic Understanding: “Do what I mean, not what I say” Semantic understanding is understanding based on how the data is used rather than how it is stored. • Finding equivalences based on semantic understanding are often context sensitive. • May come from humans (e.g. domain experience and ontologies) • May come from machine learning (e.g. analyzing usage patterns to find synonyms) 9
  • 30. Semantic Understanding: “Do what I mean, not what I say” Semantic understanding is understanding based on how the data is used rather than how it is stored. • Finding equivalences based on semantic understanding are often context sensitive. • May come from humans (e.g. domain experience and ontologies) • May come from machine learning (e.g. analyzing usage patterns to find synonyms) Semantic understanding may require data science. At the same time, data science will require semantic understanding. 9
  • 31. Story Time: Very Busy Bags of Mostly Water; Very Confused Boxes of Silicon The stage: An unnamed insurance company building models to predict disease 10
  • 32. Story Time: Very Busy Bags of Mostly Water; Very Confused Boxes of Silicon The stage: An unnamed insurance company building models to predict disease • Doctors and Nurses are busy people 10
  • 33. Story Time: Very Busy Bags of Mostly Water; Very Confused Boxes of Silicon The stage: An unnamed insurance company building models to predict disease • Doctors and Nurses are busy people • Humans suffer from confirmation bias 10
  • 34. Story Time: Very Busy Bags of Mostly Water; Very Confused Boxes of Silicon The stage: An unnamed insurance company building models to predict disease • Doctors and Nurses are busy people • Humans suffer from confirmation bias • Machines can only interpret what they can see 10
  • 35. Story Time: Very Busy Bags of Mostly Water; Very Confused Boxes of Silicon The stage: An unnamed insurance company building models to predict disease • Doctors and Nurses are busy people • Humans suffer from confirmation bias • Machines can only interpret what they can see • Together we can fill in the gaps 10
  • 37.
  • 38.
  • 39.
  • 40.
  • 41. Implications for Team Structure To be successful, 15
  • 42. Implications for Team Structure To be successful, • Your data science teams have to be integrally involved in the data transformation and understanding. 15
  • 43. Implications for Team Structure To be successful, • Your data science teams have to be integrally involved in the data transformation and understanding. • Your data science teams have to be willing to get their hands dirty 15
  • 44. Implications for Team Structure To be successful, • Your data science teams have to be integrally involved in the data transformation and understanding. • Your data science teams have to be willing to get their hands dirty • Your data science teams have to be allowed to get their hands dirty 15
  • 45. Implications for Team Structure To be successful, • Your data science teams have to be integrally involved in the data transformation and understanding. • Your data science teams have to be willing to get their hands dirty • Your data science teams have to be allowed to get their hands dirty • Your data science teams need software engineering chops. 15
  • 47. Questions Thanks for your attention! Questions? • Code & scripts for this talk available on my github presentation page.1 • Find me at http://caseystella.com • Twitter handle: @casey_stella • Email address: cstella@hortonworks.com 1 http://github.com/cestella/presentations/ 16
  • 48. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ practices-data-curate