Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1
DATA WRANGLING
FIND LOAD CLEAN
2
DATA WRANGLING
FIND LOAD CLEAN
WHERE CAN I GET DATA FROM?
Client data isn't easy to get
THERE'S CLIENT DATA, AND THERE'S PUBLIC DATA
3
Public data isn't relevant
We have internal
information. Getting
information from outside is
our challenge. There’s no
way of doing that.
– Senior Ed...
INDIA’S RELIGIONS
5
If you search on google.co.in for "how do I convert to", here are the suggestions Google shows
The pop...
AUSTRALIA’S RELIGIONS
6
But be careful of how you interpret it.
In Australia, PDF is not a religion. Unless you're a data ...
7
USE MULTIPLE APPROACHES TO FIND YOUR DATA
8
Public data catalogues
https://github.com/caesar0301/awesome-public-datasets
h...
9
EXERCISE
LET'S FIND SOME DATASETS
(YOU PICK WHAT YOU WANT TO FIND. WE WILL SEARCH FOR IT)
10
DATA WRANGLING
FIND LOAD CLEAN
HOW DO I STORE & PROCESS DATA?
WE LOAD DATA INTO OUR PROGRAMS OR OTHERS'
11
Files Databases
• Delimited text: CSV, TSV, PSV
• Formatted text: TXT, PRN
• ...
12
EXERCISE
LET'S LOAD FROM A SITE
THE GOOGLE SEARCH DATA YOU SAW EARLIER
LET'S LOAD A BIG DATASET
A FEW COLUMNS FROM A LE...
13
DATA WRANGLING
FIND LOAD CLEAN
HOW DO I FIX THE DATA ISSUES?
CHECK FOR ALL THESE DATA CLEANSING ACTIVITIES
14
Fix rows &
columns
Fix missing
values
Standarise
values
Fix invalid
value...
FIX ROWS AND COLUMNS
15
Fix rows Examples
Delete incorrect rows Header rows, Footer rows
Delete summary rows Total, subtot...
FIX MISSING VALUES
16
Fix missing values Examples
Set values as missing values Treat blanks, "NA", "XX", "999", etc as mis...
STANDARDISE VALUES
17
Standardise numbers Examples
Remove outliers Removing high and low values
Standardise units lbs to k...
FIX INVALID VALUES
18
Fix invalid values Examples
Encode unicode properly CP1252 instead of UTF-8
Convert incorrect data t...
FILTER DATA
19
Filter data Examples
Deduplicate data
Remove identical rows
Remove rows where some columns are identical
Fi...
20
EXERCISE
ASSEMBLY ELECTION DATA
SOMETHING WE DID A FEW YEARS AGO, AND IS WELL DOCUMENTED
The ECI website has this data.
21
… and, most of the data is in PDFs
22
The PDF files have a reasonably clear structure
23
… that translates into text that can be parsed
24
… which, with some effort, can be converted into a structured format
… and at this point, we need to start checking for er...
At this point, we start checking what’s gone wrong
Each row here
is one
constituency.
The number of
candidates
that have
c...
Not every spelling error is easily identifiable by the first letter
Parties are mis-spelt
MADMK
MAMAK
MDMK
Party names cha...
Fortunately, large scale data itself can provide a solution
28
… with modern tools that support machine learning
29
30
DATA WRANGLING
FIND LOAD CLEAN
Upcoming SlideShare
Loading in …5
×

Data Wrangling

692 views

Published on

The process of finding, loading and cleaning data in the real world. This is the pre-cursor to data analysis and data science.

This presentation explains sources where you can find data, the various formats in which data is usually available or stored, and a checklist of activities to perform when cleaning the data.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Data Wrangling

  1. 1. 1 DATA WRANGLING FIND LOAD CLEAN
  2. 2. 2 DATA WRANGLING FIND LOAD CLEAN WHERE CAN I GET DATA FROM?
  3. 3. Client data isn't easy to get THERE'S CLIENT DATA, AND THERE'S PUBLIC DATA 3 Public data isn't relevant
  4. 4. We have internal information. Getting information from outside is our challenge. There’s no way of doing that. – Senior Editor Leading Media Company “
  5. 5. INDIA’S RELIGIONS 5 If you search on google.co.in for "how do I convert to", here are the suggestions Google shows The popularity influences the order. So there's a good chance that the religions on top are more often searched for.
  6. 6. AUSTRALIA’S RELIGIONS 6 But be careful of how you interpret it. In Australia, PDF is not a religion. Unless you're a data scientist.
  7. 7. 7
  8. 8. USE MULTIPLE APPROACHES TO FIND YOUR DATA 8 Public data catalogues https://github.com/caesar0301/awesome-public-datasets https://github.com/rasbt/pattern_classification/blob/master/resources/dataset_collections.md Govt data websites https://data.gov.in/ https://data.gov/ https://data.gov.uk/ https://data.gov.sg/ http://publicdata.eu/ or search on Google https://www.google.com/ or ask people Humans™ 1 2 3 4
  9. 9. 9 EXERCISE LET'S FIND SOME DATASETS (YOU PICK WHAT YOU WANT TO FIND. WE WILL SEARCH FOR IT)
  10. 10. 10 DATA WRANGLING FIND LOAD CLEAN HOW DO I STORE & PROCESS DATA?
  11. 11. WE LOAD DATA INTO OUR PROGRAMS OR OTHERS' 11 Files Databases • Delimited text: CSV, TSV, PSV • Formatted text: TXT, PRN • Marked up text: HTML, XML, JSON, JSON Line, YAML, SQL • Spreadsheets: XLS*, ODS, MDB, ACCDB, DBF • Specialised formats: HDF5, SQLite, DTA (Stata), C4.5, CDF • Graph formats: GEXF, GDF, GML, GraphML, GraphViz DOT • Unstructured: TXT, PDF, Images, Audio, Video, ... • In-memory databases: DataFrames • Relational databases: Oracle, MySQL, PostgreSQL, SQL Server, DB2, Sybase, Informix, ... • Document databases: MongoDB, CouchDB, ElasticSearch, Firebase • Distributed databases: HFS, Spark • Cloud data stores: BigQuery, DynamoDB, RedShift, Azure SQL Database, DocumentDB, ... • APIs: Twitter, Facebook, Google, Wikipedia, YouTube, ... Use CSV when sharing tabular data. Use JSON for hierarchical data. Use in-memory, else relational databases. Don't analyse big data. Shrink it.
  12. 12. 12 EXERCISE LET'S LOAD FROM A SITE THE GOOGLE SEARCH DATA YOU SAW EARLIER LET'S LOAD A BIG DATASET A FEW COLUMNS FROM A LEAKED OK CUPID SURVEY LET'S LOAD AN UNSTRUCTURED TABLE A TABLE FROM THE MEDICAL CERTIFICATION OF CAUSE OF DEATH 2013 PDF
  13. 13. 13 DATA WRANGLING FIND LOAD CLEAN HOW DO I FIX THE DATA ISSUES?
  14. 14. CHECK FOR ALL THESE DATA CLEANSING ACTIVITIES 14 Fix rows & columns Fix missing values Standarise values Fix invalid values Filter data When we receive a dataset, we find a pattern of things that go wrong. These can be fixed in specific ways. Here's a workflow / checklist of things to look out for and fix. After this, check if the data is complete, and sufficient to solve the problem.
  15. 15. FIX ROWS AND COLUMNS 15 Fix rows Examples Delete incorrect rows Header rows, Footer rows Delete summary rows Total, subtotal rows Delete extra rows Column number indicators (1), (2), ... Blank rows Fix columns Examples Add column names if missing Files with missing header row Rename columns consistently Abbreviations, encoded columns Delete unnecessary columns Unidentified columns, irrelevant columns Split columns for more data Split http://host:port/path into [Host, Port, Path] Merge columns for identifiers Merge Firstname, Lastname into Name Merge State, District into FullDistrict Align misaligned columns Dataset may have shifted columns
  16. 16. FIX MISSING VALUES 16 Fix missing values Examples Set values as missing values Treat blanks, "NA", "XX", "999", etc as missing Fill missing values with... Constant (e.g. zero) Column (e.g. created date defaults to updated date) Function (e.g. average of rows/columns) External data Remove missing values Delete row Delete column Fill partial missing values Missing time zone, century etc.
  17. 17. STANDARDISE VALUES 17 Standardise numbers Examples Remove outliers Removing high and low values Standardise units lbs to kgs, m/s for speed Scale values if required Fit to percentage scale Standardise precision 2.1 to 2.10 Standardise text Examples Remove extra characters Common prefix/suffix, leading/trailing/multiple spaces Standardise case Uppercase, lowercase, Title Case, Sentence case, etc Standardise format 23/10/16 to 2016/10/20 “Modi, Narendra" to “Narendra Modi"
  18. 18. FIX INVALID VALUES 18 Fix invalid values Examples Encode unicode properly CP1252 instead of UTF-8 Convert incorrect data types String to number: "12,300" String to date: "2013-Aug" Number to string: PIN Code 110001 to "110001" Correct values not in list Non-existent country, PIN code Correct wrong structure Phone number with over 10 digits Correct values beyond range Temperature less than -273° C (0° K) Validate internal rules Gross sales > Net sales Date of delivery > Date of ordering If Title is "Mr" then Gender is "M" In these cases, treat value as "missing". Remove it, or fix it with a formula. The formula may involve the value, row, column, entire dataset, or external data
  19. 19. FILTER DATA 19 Filter data Examples Deduplicate data Remove identical rows Remove rows where some columns are identical Filter rows Filter by segments Filter by date period Filter columns Pick columns relevant to analysis Aggregate data Group by required keys, aggregate the rest
  20. 20. 20 EXERCISE ASSEMBLY ELECTION DATA SOMETHING WE DID A FEW YEARS AGO, AND IS WELL DOCUMENTED
  21. 21. The ECI website has this data. 21
  22. 22. … and, most of the data is in PDFs 22
  23. 23. The PDF files have a reasonably clear structure 23
  24. 24. … that translates into text that can be parsed 24
  25. 25. … which, with some effort, can be converted into a structured format … and at this point, we need to start checking for errors. 25
  26. 26. At this point, we start checking what’s gone wrong Each row here is one constituency. The number of candidates that have contested in each constituency in every year is shown as a table. You can see that some patterns emerge here. 26
  27. 27. Not every spelling error is easily identifiable by the first letter Parties are mis-spelt MADMK MAMAK MDMK Party names change AIADMK ADMK ADK Parties restructure INC(I) INC Constituency names mis-spelt BHADRACHALAM BHADRACHELAM BHADRAHCALAM 27
  28. 28. Fortunately, large scale data itself can provide a solution 28
  29. 29. … with modern tools that support machine learning 29
  30. 30. 30 DATA WRANGLING FIND LOAD CLEAN

×