SlideShare a Scribd company logo
1 of 30
1
DATA WRANGLING
FIND LOAD CLEAN
2
DATA WRANGLING
FIND LOAD CLEAN
WHERE CAN I GET DATA FROM?
Client data isn't easy to get
THERE'S CLIENT DATA, AND THERE'S PUBLIC DATA
3
Public data isn't relevant
We have internal
information. Getting
information from outside is
our challenge. There’s no
way of doing that.
– Senior Editor
Leading Media Company
“
INDIA’S RELIGIONS
5
If you search on google.co.in for "how do I convert to", here are the suggestions Google shows
The popularity influences the order.
So there's a good chance that the religions on top are more often searched for.
AUSTRALIA’S RELIGIONS
6
But be careful of how you interpret it.
In Australia, PDF is not a religion. Unless you're a data scientist.
7
USE MULTIPLE APPROACHES TO FIND YOUR DATA
8
Public data catalogues
https://github.com/caesar0301/awesome-public-datasets
https://github.com/rasbt/pattern_classification/blob/master/resources/dataset_collections.md
Govt data websites
https://data.gov.in/
https://data.gov/
https://data.gov.uk/
https://data.gov.sg/
http://publicdata.eu/
or search on Google
https://www.google.com/
or ask people
Humans™
1
2
3
4
9
EXERCISE
LET'S FIND SOME DATASETS
(YOU PICK WHAT YOU WANT TO FIND. WE WILL SEARCH FOR IT)
10
DATA WRANGLING
FIND LOAD CLEAN
HOW DO I STORE & PROCESS DATA?
WE LOAD DATA INTO OUR PROGRAMS OR OTHERS'
11
Files Databases
• Delimited text: CSV, TSV, PSV
• Formatted text: TXT, PRN
• Marked up text: HTML, XML, JSON,
JSON Line, YAML, SQL
• Spreadsheets: XLS*, ODS, MDB,
ACCDB, DBF
• Specialised formats: HDF5, SQLite,
DTA (Stata), C4.5, CDF
• Graph formats: GEXF, GDF, GML,
GraphML, GraphViz DOT
• Unstructured: TXT, PDF, Images,
Audio, Video, ...
• In-memory databases: DataFrames
• Relational databases: Oracle, MySQL,
PostgreSQL, SQL Server, DB2, Sybase,
Informix, ...
• Document databases: MongoDB,
CouchDB, ElasticSearch, Firebase
• Distributed databases: HFS, Spark
• Cloud data stores: BigQuery,
DynamoDB, RedShift, Azure SQL
Database, DocumentDB, ...
• APIs: Twitter, Facebook, Google,
Wikipedia, YouTube, ...
Use CSV when sharing tabular data.
Use JSON for hierarchical data.
Use in-memory, else relational databases.
Don't analyse big data. Shrink it.
12
EXERCISE
LET'S LOAD FROM A SITE
THE GOOGLE SEARCH DATA YOU SAW EARLIER
LET'S LOAD A BIG DATASET
A FEW COLUMNS FROM A LEAKED OK CUPID SURVEY
LET'S LOAD AN UNSTRUCTURED TABLE
A TABLE FROM THE MEDICAL CERTIFICATION OF CAUSE OF DEATH 2013 PDF
13
DATA WRANGLING
FIND LOAD CLEAN
HOW DO I FIX THE DATA ISSUES?
CHECK FOR ALL THESE DATA CLEANSING ACTIVITIES
14
Fix rows &
columns
Fix missing
values
Standarise
values
Fix invalid
values
Filter
data
When we receive a dataset, we find a pattern of things that go wrong. These
can be fixed in specific ways.
Here's a workflow / checklist of things to look out for and fix.
After this, check if the data is complete, and sufficient to solve the problem.
FIX ROWS AND COLUMNS
15
Fix rows Examples
Delete incorrect rows Header rows, Footer rows
Delete summary rows Total, subtotal rows
Delete extra rows
Column number indicators (1), (2), ...
Blank rows
Fix columns Examples
Add column names if missing Files with missing header row
Rename columns consistently Abbreviations, encoded columns
Delete unnecessary columns Unidentified columns, irrelevant columns
Split columns for more data Split http://host:port/path into [Host, Port, Path]
Merge columns for identifiers Merge Firstname, Lastname into Name
Merge State, District into FullDistrict
Align misaligned columns Dataset may have shifted columns
FIX MISSING VALUES
16
Fix missing values Examples
Set values as missing values Treat blanks, "NA", "XX", "999", etc as missing
Fill missing values with...
Constant (e.g. zero)
Column (e.g. created date defaults to updated date)
Function (e.g. average of rows/columns)
External data
Remove missing values
Delete row
Delete column
Fill partial missing values Missing time zone, century etc.
STANDARDISE VALUES
17
Standardise numbers Examples
Remove outliers Removing high and low values
Standardise units lbs to kgs, m/s for speed
Scale values if required Fit to percentage scale
Standardise precision 2.1 to 2.10
Standardise text Examples
Remove extra characters Common prefix/suffix, leading/trailing/multiple spaces
Standardise case Uppercase, lowercase, Title Case, Sentence case, etc
Standardise format 23/10/16 to 2016/10/20
“Modi, Narendra" to “Narendra Modi"
FIX INVALID VALUES
18
Fix invalid values Examples
Encode unicode properly CP1252 instead of UTF-8
Convert incorrect data types
String to number: "12,300"
String to date: "2013-Aug"
Number to string: PIN Code 110001 to "110001"
Correct values not in list Non-existent country, PIN code
Correct wrong structure Phone number with over 10 digits
Correct values beyond range Temperature less than -273° C (0° K)
Validate internal rules
Gross sales > Net sales
Date of delivery > Date of ordering
If Title is "Mr" then Gender is "M"
In these cases, treat value as "missing".
Remove it, or fix it with a formula.
The formula may involve the value, row, column,
entire dataset, or external data
FILTER DATA
19
Filter data Examples
Deduplicate data
Remove identical rows
Remove rows where some columns are identical
Filter rows
Filter by segments
Filter by date period
Filter columns Pick columns relevant to analysis
Aggregate data Group by required keys, aggregate the rest
20
EXERCISE
ASSEMBLY ELECTION DATA
SOMETHING WE DID A FEW YEARS AGO, AND IS WELL DOCUMENTED
The ECI website has this data.
21
… and, most of the data is in PDFs
22
The PDF files have a reasonably clear structure
23
… that translates into text that can be parsed
24
… which, with some effort, can be converted into a structured format
… and at this point, we need to start checking for errors.
25
At this point, we start checking what’s gone wrong
Each row here
is one
constituency.
The number of
candidates
that have
contested in
each
constituency
in every year
is shown as a
table.
You can see
that some
patterns
emerge here.
26
Not every spelling error is easily identifiable by the first letter
Parties are mis-spelt
MADMK
MAMAK
MDMK
Party names change
AIADMK
ADMK
ADK
Parties restructure
INC(I)
INC
Constituency names mis-spelt
BHADRACHALAM
BHADRACHELAM
BHADRAHCALAM
27
Fortunately, large scale data itself can provide a solution
28
… with modern tools that support machine learning
29
30
DATA WRANGLING
FIND LOAD CLEAN

More Related Content

What's hot

Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis Peter Reimann
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataSalah Amean
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data MiningDHIVYADEVAKI
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysisGramener
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsJustin Cletus
 
3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysismlong24
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisEva Durall
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingankur bhalla
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysisVishwas N
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data AnalysisUmair Shafique
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data PreprocessingT Kavitha
 
Decision Trees
Decision TreesDecision Trees
Decision TreesStudent
 
Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Simplilearn
 
1. Data Analytics-introduction
1. Data Analytics-introduction1. Data Analytics-introduction
1. Data Analytics-introductionkrishna singh
 

What's hot (20)

Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
 
3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data Analysis
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Lecture #01
Lecture #01Lecture #01
Lecture #01
 
Data Visualization
Data VisualizationData Visualization
Data Visualization
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data Analysis
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...
 
1. Data Analytics-introduction
1. Data Analytics-introduction1. Data Analytics-introduction
1. Data Analytics-introduction
 

Similar to Data Wrangling

ML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptbelay41
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2Mahmoud Alfarra
 
Gupta ayankprojectassignmnet
Gupta ayankprojectassignmnetGupta ayankprojectassignmnet
Gupta ayankprojectassignmnetAyank Gupta
 
Data Preparation and Preprocessing , Data Cleaning
Data Preparation and Preprocessing , Data CleaningData Preparation and Preprocessing , Data Cleaning
Data Preparation and Preprocessing , Data CleaningShivarkarSandip
 
OutlierAnalysisIDIO071216.pptx.otliers is the main
OutlierAnalysisIDIO071216.pptx.otliers is the mainOutlierAnalysisIDIO071216.pptx.otliers is the main
OutlierAnalysisIDIO071216.pptx.otliers is the mainRamlalMeena5
 
How to source good data
How to source good dataHow to source good data
How to source good dataSolveXia
 
Aen007 Kenigsberg 091807
Aen007 Kenigsberg 091807Aen007 Kenigsberg 091807
Aen007 Kenigsberg 091807Dreamforce07
 
Lecture 19
Lecture 19Lecture 19
Lecture 19Shani729
 
03 preprocessing
03 preprocessing03 preprocessing
03 preprocessingpurnimatm
 
03Preprocessing_plp.pptx
03Preprocessing_plp.pptx03Preprocessing_plp.pptx
03Preprocessing_plp.pptxProfPPavanKumar
 
03Preprocessing_plp.pptx
03Preprocessing_plp.pptx03Preprocessing_plp.pptx
03Preprocessing_plp.pptxProfPPavanKumar
 
03Preprocessing for student computer sciecne.ppt
03Preprocessing for student computer sciecne.ppt03Preprocessing for student computer sciecne.ppt
03Preprocessing for student computer sciecne.pptMuhammadHanifSyabani
 

Similar to Data Wrangling (20)

Data analysis training
Data analysis trainingData analysis training
Data analysis training
 
ML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.ppt
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2
 
Databases
DatabasesDatabases
Databases
 
Gupta ayankprojectassignmnet
Gupta ayankprojectassignmnetGupta ayankprojectassignmnet
Gupta ayankprojectassignmnet
 
Alteryx Tableau Integration | Clean Your Data Faster for Tableau with Alteryx
Alteryx Tableau Integration | Clean Your Data Faster for Tableau with AlteryxAlteryx Tableau Integration | Clean Your Data Faster for Tableau with Alteryx
Alteryx Tableau Integration | Clean Your Data Faster for Tableau with Alteryx
 
4 preprocess
4 preprocess4 preprocess
4 preprocess
 
DataPreprocessing.ppt
DataPreprocessing.pptDataPreprocessing.ppt
DataPreprocessing.ppt
 
Data Preparation and Preprocessing , Data Cleaning
Data Preparation and Preprocessing , Data CleaningData Preparation and Preprocessing , Data Cleaning
Data Preparation and Preprocessing , Data Cleaning
 
OutlierAnalysisIDIO071216.pptx.otliers is the main
OutlierAnalysisIDIO071216.pptx.otliers is the mainOutlierAnalysisIDIO071216.pptx.otliers is the main
OutlierAnalysisIDIO071216.pptx.otliers is the main
 
How to source good data
How to source good dataHow to source good data
How to source good data
 
Complete Guide to Data Quality
Complete Guide to Data QualityComplete Guide to Data Quality
Complete Guide to Data Quality
 
Aen007 Kenigsberg 091807
Aen007 Kenigsberg 091807Aen007 Kenigsberg 091807
Aen007 Kenigsberg 091807
 
Lecture 19
Lecture 19Lecture 19
Lecture 19
 
03 preprocessing
03 preprocessing03 preprocessing
03 preprocessing
 
12 Days of Data
12 Days of Data12 Days of Data
12 Days of Data
 
03Preprocessing_plp.pptx
03Preprocessing_plp.pptx03Preprocessing_plp.pptx
03Preprocessing_plp.pptx
 
03Preprocessing.ppt
03Preprocessing.ppt03Preprocessing.ppt
03Preprocessing.ppt
 
03Preprocessing_plp.pptx
03Preprocessing_plp.pptx03Preprocessing_plp.pptx
03Preprocessing_plp.pptx
 
03Preprocessing for student computer sciecne.ppt
03Preprocessing for student computer sciecne.ppt03Preprocessing for student computer sciecne.ppt
03Preprocessing for student computer sciecne.ppt
 

More from Gramener

6 Methods to Improve Your Manufacturing Process with Computer Vision
6 Methods to Improve Your Manufacturing Process with Computer Vision6 Methods to Improve Your Manufacturing Process with Computer Vision
6 Methods to Improve Your Manufacturing Process with Computer VisionGramener
 
Detecting Manufacturing Defects with Computer Vision
Detecting Manufacturing Defects with Computer VisionDetecting Manufacturing Defects with Computer Vision
Detecting Manufacturing Defects with Computer VisionGramener
 
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma & Healthcare
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma  & HealthcareHow to Identify the Right Key Opinion Leaders (KOLs) in Pharma  & Healthcare
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma & HealthcareGramener
 
Automated Barcode Generation System in Manufacturing
Automated Barcode Generation System in ManufacturingAutomated Barcode Generation System in Manufacturing
Automated Barcode Generation System in ManufacturingGramener
 
The Role of Technology to Save Biodiversity
The Role of Technology to Save BiodiversityThe Role of Technology to Save Biodiversity
The Role of Technology to Save BiodiversityGramener
 
Enable Storytelling with Power BI & Comicgen Plugin
Enable Storytelling with Power BI  & Comicgen PluginEnable Storytelling with Power BI  & Comicgen Plugin
Enable Storytelling with Power BI & Comicgen PluginGramener
 
The Most Effective Method For Selecting Data Science Projects
The Most Effective Method For Selecting Data Science ProjectsThe Most Effective Method For Selecting Data Science Projects
The Most Effective Method For Selecting Data Science ProjectsGramener
 
Low Code Platform To Build Data & AI Products
Low Code Platform To Build Data & AI ProductsLow Code Platform To Build Data & AI Products
Low Code Platform To Build Data & AI ProductsGramener
 
5 Key Foundations To Build An Effective CX Program
5 Key Foundations To Build An Effective CX Program5 Key Foundations To Build An Effective CX Program
5 Key Foundations To Build An Effective CX ProgramGramener
 
Using Power BI To Improve Media Buying & Ad Performance
Using Power BI To Improve Media Buying & Ad PerformanceUsing Power BI To Improve Media Buying & Ad Performance
Using Power BI To Improve Media Buying & Ad PerformanceGramener
 
Recession Proofing With Data : Webinar
Recession Proofing With Data : WebinarRecession Proofing With Data : Webinar
Recession Proofing With Data : WebinarGramener
 
Engage Your Audience With PowerPoint Decks: Webinar
Engage Your Audience With PowerPoint Decks: WebinarEngage Your Audience With PowerPoint Decks: Webinar
Engage Your Audience With PowerPoint Decks: WebinarGramener
 
Structure Your Data Science Teams For Best Outcomes
Structure Your Data Science Teams For Best OutcomesStructure Your Data Science Teams For Best Outcomes
Structure Your Data Science Teams For Best OutcomesGramener
 
Dawn Of Geospatial AI - Webinar
Dawn Of Geospatial AI - WebinarDawn Of Geospatial AI - Webinar
Dawn Of Geospatial AI - WebinarGramener
 
5 Steps To Become A Data-Driven Organization : Webinar
5 Steps To Become A Data-Driven Organization : Webinar5 Steps To Become A Data-Driven Organization : Webinar
5 Steps To Become A Data-Driven Organization : WebinarGramener
 
5 Steps To Measure ROI On Your Data Science Initiatives - Webinar
 5 Steps To Measure ROI On Your Data Science Initiatives - Webinar 5 Steps To Measure ROI On Your Data Science Initiatives - Webinar
5 Steps To Measure ROI On Your Data Science Initiatives - WebinarGramener
 
Saving Lives with Geospatial AI - Pycon Indonesia 2020
Saving Lives with Geospatial AI - Pycon Indonesia 2020Saving Lives with Geospatial AI - Pycon Indonesia 2020
Saving Lives with Geospatial AI - Pycon Indonesia 2020Gramener
 
Driving Transformation in Industries with Artificial Intelligence (AI)
Driving Transformation in Industries with Artificial Intelligence (AI)Driving Transformation in Industries with Artificial Intelligence (AI)
Driving Transformation in Industries with Artificial Intelligence (AI)Gramener
 
The Art of Storytelling Using Data Science
The Art of Storytelling Using Data ScienceThe Art of Storytelling Using Data Science
The Art of Storytelling Using Data ScienceGramener
 
Storyfying your Data: How to go from Data to Insights to Stories
Storyfying your Data: How to go from Data to Insights to StoriesStoryfying your Data: How to go from Data to Insights to Stories
Storyfying your Data: How to go from Data to Insights to StoriesGramener
 

More from Gramener (20)

6 Methods to Improve Your Manufacturing Process with Computer Vision
6 Methods to Improve Your Manufacturing Process with Computer Vision6 Methods to Improve Your Manufacturing Process with Computer Vision
6 Methods to Improve Your Manufacturing Process with Computer Vision
 
Detecting Manufacturing Defects with Computer Vision
Detecting Manufacturing Defects with Computer VisionDetecting Manufacturing Defects with Computer Vision
Detecting Manufacturing Defects with Computer Vision
 
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma & Healthcare
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma  & HealthcareHow to Identify the Right Key Opinion Leaders (KOLs) in Pharma  & Healthcare
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma & Healthcare
 
Automated Barcode Generation System in Manufacturing
Automated Barcode Generation System in ManufacturingAutomated Barcode Generation System in Manufacturing
Automated Barcode Generation System in Manufacturing
 
The Role of Technology to Save Biodiversity
The Role of Technology to Save BiodiversityThe Role of Technology to Save Biodiversity
The Role of Technology to Save Biodiversity
 
Enable Storytelling with Power BI & Comicgen Plugin
Enable Storytelling with Power BI  & Comicgen PluginEnable Storytelling with Power BI  & Comicgen Plugin
Enable Storytelling with Power BI & Comicgen Plugin
 
The Most Effective Method For Selecting Data Science Projects
The Most Effective Method For Selecting Data Science ProjectsThe Most Effective Method For Selecting Data Science Projects
The Most Effective Method For Selecting Data Science Projects
 
Low Code Platform To Build Data & AI Products
Low Code Platform To Build Data & AI ProductsLow Code Platform To Build Data & AI Products
Low Code Platform To Build Data & AI Products
 
5 Key Foundations To Build An Effective CX Program
5 Key Foundations To Build An Effective CX Program5 Key Foundations To Build An Effective CX Program
5 Key Foundations To Build An Effective CX Program
 
Using Power BI To Improve Media Buying & Ad Performance
Using Power BI To Improve Media Buying & Ad PerformanceUsing Power BI To Improve Media Buying & Ad Performance
Using Power BI To Improve Media Buying & Ad Performance
 
Recession Proofing With Data : Webinar
Recession Proofing With Data : WebinarRecession Proofing With Data : Webinar
Recession Proofing With Data : Webinar
 
Engage Your Audience With PowerPoint Decks: Webinar
Engage Your Audience With PowerPoint Decks: WebinarEngage Your Audience With PowerPoint Decks: Webinar
Engage Your Audience With PowerPoint Decks: Webinar
 
Structure Your Data Science Teams For Best Outcomes
Structure Your Data Science Teams For Best OutcomesStructure Your Data Science Teams For Best Outcomes
Structure Your Data Science Teams For Best Outcomes
 
Dawn Of Geospatial AI - Webinar
Dawn Of Geospatial AI - WebinarDawn Of Geospatial AI - Webinar
Dawn Of Geospatial AI - Webinar
 
5 Steps To Become A Data-Driven Organization : Webinar
5 Steps To Become A Data-Driven Organization : Webinar5 Steps To Become A Data-Driven Organization : Webinar
5 Steps To Become A Data-Driven Organization : Webinar
 
5 Steps To Measure ROI On Your Data Science Initiatives - Webinar
 5 Steps To Measure ROI On Your Data Science Initiatives - Webinar 5 Steps To Measure ROI On Your Data Science Initiatives - Webinar
5 Steps To Measure ROI On Your Data Science Initiatives - Webinar
 
Saving Lives with Geospatial AI - Pycon Indonesia 2020
Saving Lives with Geospatial AI - Pycon Indonesia 2020Saving Lives with Geospatial AI - Pycon Indonesia 2020
Saving Lives with Geospatial AI - Pycon Indonesia 2020
 
Driving Transformation in Industries with Artificial Intelligence (AI)
Driving Transformation in Industries with Artificial Intelligence (AI)Driving Transformation in Industries with Artificial Intelligence (AI)
Driving Transformation in Industries with Artificial Intelligence (AI)
 
The Art of Storytelling Using Data Science
The Art of Storytelling Using Data ScienceThe Art of Storytelling Using Data Science
The Art of Storytelling Using Data Science
 
Storyfying your Data: How to go from Data to Insights to Stories
Storyfying your Data: How to go from Data to Insights to StoriesStoryfying your Data: How to go from Data to Insights to Stories
Storyfying your Data: How to go from Data to Insights to Stories
 

Recently uploaded

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 

Recently uploaded (20)

Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 

Data Wrangling

  • 2. 2 DATA WRANGLING FIND LOAD CLEAN WHERE CAN I GET DATA FROM?
  • 3. Client data isn't easy to get THERE'S CLIENT DATA, AND THERE'S PUBLIC DATA 3 Public data isn't relevant
  • 4. We have internal information. Getting information from outside is our challenge. There’s no way of doing that. – Senior Editor Leading Media Company “
  • 5. INDIA’S RELIGIONS 5 If you search on google.co.in for "how do I convert to", here are the suggestions Google shows The popularity influences the order. So there's a good chance that the religions on top are more often searched for.
  • 6. AUSTRALIA’S RELIGIONS 6 But be careful of how you interpret it. In Australia, PDF is not a religion. Unless you're a data scientist.
  • 7. 7
  • 8. USE MULTIPLE APPROACHES TO FIND YOUR DATA 8 Public data catalogues https://github.com/caesar0301/awesome-public-datasets https://github.com/rasbt/pattern_classification/blob/master/resources/dataset_collections.md Govt data websites https://data.gov.in/ https://data.gov/ https://data.gov.uk/ https://data.gov.sg/ http://publicdata.eu/ or search on Google https://www.google.com/ or ask people Humans™ 1 2 3 4
  • 9. 9 EXERCISE LET'S FIND SOME DATASETS (YOU PICK WHAT YOU WANT TO FIND. WE WILL SEARCH FOR IT)
  • 10. 10 DATA WRANGLING FIND LOAD CLEAN HOW DO I STORE & PROCESS DATA?
  • 11. WE LOAD DATA INTO OUR PROGRAMS OR OTHERS' 11 Files Databases • Delimited text: CSV, TSV, PSV • Formatted text: TXT, PRN • Marked up text: HTML, XML, JSON, JSON Line, YAML, SQL • Spreadsheets: XLS*, ODS, MDB, ACCDB, DBF • Specialised formats: HDF5, SQLite, DTA (Stata), C4.5, CDF • Graph formats: GEXF, GDF, GML, GraphML, GraphViz DOT • Unstructured: TXT, PDF, Images, Audio, Video, ... • In-memory databases: DataFrames • Relational databases: Oracle, MySQL, PostgreSQL, SQL Server, DB2, Sybase, Informix, ... • Document databases: MongoDB, CouchDB, ElasticSearch, Firebase • Distributed databases: HFS, Spark • Cloud data stores: BigQuery, DynamoDB, RedShift, Azure SQL Database, DocumentDB, ... • APIs: Twitter, Facebook, Google, Wikipedia, YouTube, ... Use CSV when sharing tabular data. Use JSON for hierarchical data. Use in-memory, else relational databases. Don't analyse big data. Shrink it.
  • 12. 12 EXERCISE LET'S LOAD FROM A SITE THE GOOGLE SEARCH DATA YOU SAW EARLIER LET'S LOAD A BIG DATASET A FEW COLUMNS FROM A LEAKED OK CUPID SURVEY LET'S LOAD AN UNSTRUCTURED TABLE A TABLE FROM THE MEDICAL CERTIFICATION OF CAUSE OF DEATH 2013 PDF
  • 13. 13 DATA WRANGLING FIND LOAD CLEAN HOW DO I FIX THE DATA ISSUES?
  • 14. CHECK FOR ALL THESE DATA CLEANSING ACTIVITIES 14 Fix rows & columns Fix missing values Standarise values Fix invalid values Filter data When we receive a dataset, we find a pattern of things that go wrong. These can be fixed in specific ways. Here's a workflow / checklist of things to look out for and fix. After this, check if the data is complete, and sufficient to solve the problem.
  • 15. FIX ROWS AND COLUMNS 15 Fix rows Examples Delete incorrect rows Header rows, Footer rows Delete summary rows Total, subtotal rows Delete extra rows Column number indicators (1), (2), ... Blank rows Fix columns Examples Add column names if missing Files with missing header row Rename columns consistently Abbreviations, encoded columns Delete unnecessary columns Unidentified columns, irrelevant columns Split columns for more data Split http://host:port/path into [Host, Port, Path] Merge columns for identifiers Merge Firstname, Lastname into Name Merge State, District into FullDistrict Align misaligned columns Dataset may have shifted columns
  • 16. FIX MISSING VALUES 16 Fix missing values Examples Set values as missing values Treat blanks, "NA", "XX", "999", etc as missing Fill missing values with... Constant (e.g. zero) Column (e.g. created date defaults to updated date) Function (e.g. average of rows/columns) External data Remove missing values Delete row Delete column Fill partial missing values Missing time zone, century etc.
  • 17. STANDARDISE VALUES 17 Standardise numbers Examples Remove outliers Removing high and low values Standardise units lbs to kgs, m/s for speed Scale values if required Fit to percentage scale Standardise precision 2.1 to 2.10 Standardise text Examples Remove extra characters Common prefix/suffix, leading/trailing/multiple spaces Standardise case Uppercase, lowercase, Title Case, Sentence case, etc Standardise format 23/10/16 to 2016/10/20 “Modi, Narendra" to “Narendra Modi"
  • 18. FIX INVALID VALUES 18 Fix invalid values Examples Encode unicode properly CP1252 instead of UTF-8 Convert incorrect data types String to number: "12,300" String to date: "2013-Aug" Number to string: PIN Code 110001 to "110001" Correct values not in list Non-existent country, PIN code Correct wrong structure Phone number with over 10 digits Correct values beyond range Temperature less than -273° C (0° K) Validate internal rules Gross sales > Net sales Date of delivery > Date of ordering If Title is "Mr" then Gender is "M" In these cases, treat value as "missing". Remove it, or fix it with a formula. The formula may involve the value, row, column, entire dataset, or external data
  • 19. FILTER DATA 19 Filter data Examples Deduplicate data Remove identical rows Remove rows where some columns are identical Filter rows Filter by segments Filter by date period Filter columns Pick columns relevant to analysis Aggregate data Group by required keys, aggregate the rest
  • 20. 20 EXERCISE ASSEMBLY ELECTION DATA SOMETHING WE DID A FEW YEARS AGO, AND IS WELL DOCUMENTED
  • 21. The ECI website has this data. 21
  • 22. … and, most of the data is in PDFs 22
  • 23. The PDF files have a reasonably clear structure 23
  • 24. … that translates into text that can be parsed 24
  • 25. … which, with some effort, can be converted into a structured format … and at this point, we need to start checking for errors. 25
  • 26. At this point, we start checking what’s gone wrong Each row here is one constituency. The number of candidates that have contested in each constituency in every year is shown as a table. You can see that some patterns emerge here. 26
  • 27. Not every spelling error is easily identifiable by the first letter Parties are mis-spelt MADMK MAMAK MDMK Party names change AIADMK ADMK ADK Parties restructure INC(I) INC Constituency names mis-spelt BHADRACHALAM BHADRACHELAM BHADRAHCALAM 27
  • 28. Fortunately, large scale data itself can provide a solution 28
  • 29. … with modern tools that support machine learning 29