SlideShare a Scribd company logo
1 of 36
Cleaning and Exploring
Data
Datascience session 5
Lab 5: your 5-7 things
Data cleaning
Basic data cleaning with Python
Using OpenRefine
Exploring Data
The Pandas library
The Seaborn library
The R language
Data Cleaning
Algorithms want their data to be:
Machine-readable
Consistent format (e.g. text is all lowercase)
Consistent labels (e.g. use M/F, Male/Female, 0/1/2, but not *all* of these)
No whitespace hiding in number or text cells
No junk characters
No strange outliers (e.g. 200 year old living people)
In vectors and matrices
Normalised
Cleaning with Python
Cleaning Strings
Removing capitals and whitespace:
mystring = " CApiTalIsaTion Sucks "
mystring.lower().strip()
original text is - CApiTalIsaTion Sucks -
lowercased text is - capitalisation sucks -
Text without whitespace is -capitalisation sucks-
Regular Expressions: repeated spaces
There’s a repeated space in capitalisation sucks
import re
re.sub(r's', '.', 'this is a string')
re.sub(r's+', '.', 'this is a string')
'this.is..a.string'
'this.is.a.string'
Regular Expressions: junk
import re
string1 = “This is a! sentence&& with junk!@“
cleanstring1 = re.sub(r'[^w ]', '', string1)
This is a sentence with junk
Converting Date/Times
European vs American? Name of month vs number? Python comes with a bunch
of date reformatting libraries that can convert between these. For example:
import datetime
date_string = “14/03/48"
datetime.datetime.strptime(date_string, ‘%m/%d/%y').strftime('%m/%d/%Y')
Cleaning with Open Refine
Our input file
Getting started
Inputting data
Cleaning up the import
The imported data
Cleaning up columns
Facets
Exploring Data
Exploring Data
Eyeball your data
Plot your data - visually look for trends and outliers
Get the basics statistics (mean, sd etc) of your data
Create pivot tables to help understand how columns interact
Do more cleaning if you need to (e.g. those outliers)
Exploring with Pandas
Reading in data files with Pandas
read_csv
read_excel
read_sql
read_json
read_html
read_stata
read_clipboard
Eyeballing rows
How many rows are there in this dataset?
len(df)
What do my data rows look like?
df.head(5)
df.tail()
df[10:20]
Eyeballing columns
What’s in these columns?
df[‘sourceid’]
df[[‘sourceid’,’ag12a_01','ag12a_02_2']]
What’s in the columns when these are true?
df[df.ag12a_01 == ‘YES’]
df[(df.ag12a_01 == 'YES') & (df.ag12a_02_1 == 'NO')]
Summarising columns
What are my column names and types?
df.columns
df.dtypes
Which labels do I have in this column?
df['ag12a_03'].unique()
df['ag12a_03'].value_counts()
Pivot Tables: Combining data from one dataframe
● pd.pivot_table(df, index=[‘sourceid’, ‘ag12a_03’])
Merge: Combining data from multiple frames
longnames = pd.DataFrame({ 'country' : pd.Series(['United States of America', 'Zaire', 'Egypt']),
'longname' : pd.Series([True, True, False])})
merged_data = pd.merge(
left=popstats,
right=longnames,
left_on='Country/territory of residence',
right_on='country')
merged_data[['Year', 'Country/territory of residence', 'longname', 'Total population', 'Origin / Returned from']]
Left Joins: Keep everything from the left table…
longnames = pd.DataFrame({ 'country' : pd.Series(['United States of America', 'Zaire', 'Egypt']),
'longname' : pd.Series([True, True, False])})
merged_data = pd.merge(
left=popstats,
right=longnames,
how='left',
left_on='Country/territory of residence',
right_on='country')
merged_data[['Year', 'Country/territory of residence', 'longname', 'Total population', 'Origin / Returned from']]
Normalising
Use pd.stack()
The Seaborn Library
The Iris dataset
import seaborn as sns
iris = sns.load_dataset('iris')
Visualising Iris data with Seaborn
sns.pairplot(iris, hue='species', size=2)
Exploring with R
R
Matrix analysis (similar to Pandas)
Good at:
Rapid statistical analysis (4000+ R libraries)
Rapidly-created static graphics
Not so good at:
Non-statistical things (e.g. GIS data analysis)
Running R code
● Running R files:
○ From the terminal window: “R <myscript.r —no-save”
○ From inside another R program: source('myscript.r')
● Writing your own R code:
○ iPython notebooks: create “R” notebook (instead of python3)
○ Terminal window: type “r” (and “q()” to quit)
Exercises
Code
Try running the Python and R code in the 5.x set of notebooks

More Related Content

What's hot (16)

Python language data types
Python language data typesPython language data types
Python language data types
 
python codes
python codespython codes
python codes
 
Biopython: Overview, State of the Art and Outlook
Biopython: Overview, State of the Art and OutlookBiopython: Overview, State of the Art and Outlook
Biopython: Overview, State of the Art and Outlook
 
PYTHON -Chapter 2 - Functions, Exception, Modules and Files -MAULIK BOR...
PYTHON -Chapter 2 - Functions,   Exception, Modules  and    Files -MAULIK BOR...PYTHON -Chapter 2 - Functions,   Exception, Modules  and    Files -MAULIK BOR...
PYTHON -Chapter 2 - Functions, Exception, Modules and Files -MAULIK BOR...
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
 
Programming in Python
Programming in Python Programming in Python
Programming in Python
 
Python advance
Python advancePython advance
Python advance
 
Python basics
Python basicsPython basics
Python basics
 
An Intro to Python in 30 minutes
An Intro to Python in 30 minutesAn Intro to Python in 30 minutes
An Intro to Python in 30 minutes
 
Python in 90 minutes
Python in 90 minutesPython in 90 minutes
Python in 90 minutes
 
Python :variable types
Python :variable typesPython :variable types
Python :variable types
 
Python Basics
Python Basics Python Basics
Python Basics
 
Arrays C#
Arrays C#Arrays C#
Arrays C#
 
AmI 2015 - Python basics
AmI 2015 - Python basicsAmI 2015 - Python basics
AmI 2015 - Python basics
 
Learn python - for beginners - part-2
Learn python - for beginners - part-2Learn python - for beginners - part-2
Learn python - for beginners - part-2
 
Python basics
Python basicsPython basics
Python basics
 

Viewers also liked

PyData Paris 2015 - Track 3.2 Serge Guelton et Pierrick Brunet
PyData Paris 2015 - Track 3.2 Serge Guelton et Pierrick Brunet PyData Paris 2015 - Track 3.2 Serge Guelton et Pierrick Brunet
PyData Paris 2015 - Track 3.2 Serge Guelton et Pierrick Brunet Pôle Systematic Paris-Region
 
H2O World - Munging, modeling, and pipelines using Python - Hank Roark
H2O World - Munging, modeling, and pipelines using Python - Hank RoarkH2O World - Munging, modeling, and pipelines using Python - Hank Roark
H2O World - Munging, modeling, and pipelines using Python - Hank RoarkSri Ambati
 
Pandas/Data Analysis at Baypiggies
Pandas/Data Analysis at BaypiggiesPandas/Data Analysis at Baypiggies
Pandas/Data Analysis at BaypiggiesAndy Hayden
 
Getting started with pandas
Getting started with pandasGetting started with pandas
Getting started with pandasmaikroeder
 
A look inside pandas design and development
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and developmentWes McKinney
 
Katharine Jarmul, Founder at Kjamistan - "Learn Data Wrangling with Python"
Katharine Jarmul, Founder at Kjamistan - "Learn Data Wrangling with Python"Katharine Jarmul, Founder at Kjamistan - "Learn Data Wrangling with Python"
Katharine Jarmul, Founder at Kjamistan - "Learn Data Wrangling with Python"Dataconomy Media
 
pandas - Python Data Analysis
pandas - Python Data Analysispandas - Python Data Analysis
pandas - Python Data AnalysisAndrew Henshaw
 
Intro to H2O in Python - Data Science LA
Intro to H2O in Python - Data Science LAIntro to H2O in Python - Data Science LA
Intro to H2O in Python - Data Science LASri Ambati
 
Build Your Own Recommendation Engine
Build Your Own Recommendation EngineBuild Your Own Recommendation Engine
Build Your Own Recommendation EngineSri Ambati
 
pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and StatisticsWes McKinney
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for PythonWes McKinney
 
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Cloudera, Inc.
 
Enabling the Connected Car Revolution

Enabling the Connected Car Revolution
Enabling the Connected Car Revolution

Enabling the Connected Car Revolution
Cloudera, Inc.
 

Viewers also liked (18)

PyData Paris 2015 - Track 3.2 Serge Guelton et Pierrick Brunet
PyData Paris 2015 - Track 3.2 Serge Guelton et Pierrick Brunet PyData Paris 2015 - Track 3.2 Serge Guelton et Pierrick Brunet
PyData Paris 2015 - Track 3.2 Serge Guelton et Pierrick Brunet
 
Python projects
Python projectsPython projects
Python projects
 
H2O World - Munging, modeling, and pipelines using Python - Hank Roark
H2O World - Munging, modeling, and pipelines using Python - Hank RoarkH2O World - Munging, modeling, and pipelines using Python - Hank Roark
H2O World - Munging, modeling, and pipelines using Python - Hank Roark
 
Pre processing
Pre processingPre processing
Pre processing
 
Pre processing big data
Pre processing big dataPre processing big data
Pre processing big data
 
Pandas/Data Analysis at Baypiggies
Pandas/Data Analysis at BaypiggiesPandas/Data Analysis at Baypiggies
Pandas/Data Analysis at Baypiggies
 
Getting started with pandas
Getting started with pandasGetting started with pandas
Getting started with pandas
 
A look inside pandas design and development
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and development
 
Katharine Jarmul, Founder at Kjamistan - "Learn Data Wrangling with Python"
Katharine Jarmul, Founder at Kjamistan - "Learn Data Wrangling with Python"Katharine Jarmul, Founder at Kjamistan - "Learn Data Wrangling with Python"
Katharine Jarmul, Founder at Kjamistan - "Learn Data Wrangling with Python"
 
Data Analysis With Pandas
Data Analysis With PandasData Analysis With Pandas
Data Analysis With Pandas
 
pandas - Python Data Analysis
pandas - Python Data Analysispandas - Python Data Analysis
pandas - Python Data Analysis
 
R vs Python vs SAS
R vs Python vs SASR vs Python vs SAS
R vs Python vs SAS
 
Intro to H2O in Python - Data Science LA
Intro to H2O in Python - Data Science LAIntro to H2O in Python - Data Science LA
Intro to H2O in Python - Data Science LA
 
Build Your Own Recommendation Engine
Build Your Own Recommendation EngineBuild Your Own Recommendation Engine
Build Your Own Recommendation Engine
 
pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statistics
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
 
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr

 
Enabling the Connected Car Revolution

Enabling the Connected Car Revolution
Enabling the Connected Car Revolution

Enabling the Connected Car Revolution

 

Similar to Session 05 cleaning and exploring

Ejercicios de estilo en la programación
Ejercicios de estilo en la programaciónEjercicios de estilo en la programación
Ejercicios de estilo en la programaciónSoftware Guru
 
pandas dataframe notes.pdf
pandas dataframe notes.pdfpandas dataframe notes.pdf
pandas dataframe notes.pdfAjeshSurejan2
 
python beginner talk slide
python beginner talk slidepython beginner talk slide
python beginner talk slidejonycse
 
Pandas Dataframe reading data Kirti final.pptx
Pandas Dataframe reading data  Kirti final.pptxPandas Dataframe reading data  Kirti final.pptx
Pandas Dataframe reading data Kirti final.pptxKirti Verma
 
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptxfINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptxdataKarthik
 
Morel, a data-parallel programming language
Morel, a data-parallel programming languageMorel, a data-parallel programming language
Morel, a data-parallel programming languageJulian Hyde
 
Data Analysis with R (combined slides)
Data Analysis with R (combined slides)Data Analysis with R (combined slides)
Data Analysis with R (combined slides)Guy Lebanon
 
Unit 3_Numpy_Vsp.pptx
Unit 3_Numpy_Vsp.pptxUnit 3_Numpy_Vsp.pptx
Unit 3_Numpy_Vsp.pptxprakashvs7
 
Python Workshop - Learn Python the Hard Way
Python Workshop - Learn Python the Hard WayPython Workshop - Learn Python the Hard Way
Python Workshop - Learn Python the Hard WayUtkarsh Sengar
 
XII - 2022-23 - IP - RAIPUR (CBSE FINAL EXAM).pdf
XII -  2022-23 - IP - RAIPUR (CBSE FINAL EXAM).pdfXII -  2022-23 - IP - RAIPUR (CBSE FINAL EXAM).pdf
XII - 2022-23 - IP - RAIPUR (CBSE FINAL EXAM).pdfKrishnaJyotish1
 
Get started with R lang
Get started with R langGet started with R lang
Get started with R langsenthil0809
 
Plunging Into Perl While Avoiding the Deep End (mostly)
Plunging Into Perl While Avoiding the Deep End (mostly)Plunging Into Perl While Avoiding the Deep End (mostly)
Plunging Into Perl While Avoiding the Deep End (mostly)Roy Zimmer
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxParveenShaik21
 
20130215 Reading data into R
20130215 Reading data into R20130215 Reading data into R
20130215 Reading data into RKazuki Yoshida
 

Similar to Session 05 cleaning and exploring (20)

Ejercicios de estilo en la programación
Ejercicios de estilo en la programaciónEjercicios de estilo en la programación
Ejercicios de estilo en la programación
 
pandas dataframe notes.pdf
pandas dataframe notes.pdfpandas dataframe notes.pdf
pandas dataframe notes.pdf
 
Lecture 9.pptx
Lecture 9.pptxLecture 9.pptx
Lecture 9.pptx
 
python beginner talk slide
python beginner talk slidepython beginner talk slide
python beginner talk slide
 
DS_PPT.pptx
DS_PPT.pptxDS_PPT.pptx
DS_PPT.pptx
 
Pandas Dataframe reading data Kirti final.pptx
Pandas Dataframe reading data  Kirti final.pptxPandas Dataframe reading data  Kirti final.pptx
Pandas Dataframe reading data Kirti final.pptx
 
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptxfINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
 
Morel, a data-parallel programming language
Morel, a data-parallel programming languageMorel, a data-parallel programming language
Morel, a data-parallel programming language
 
Control statements
Control statementsControl statements
Control statements
 
Data Analysis with R (combined slides)
Data Analysis with R (combined slides)Data Analysis with R (combined slides)
Data Analysis with R (combined slides)
 
Unit 3_Numpy_Vsp.pptx
Unit 3_Numpy_Vsp.pptxUnit 3_Numpy_Vsp.pptx
Unit 3_Numpy_Vsp.pptx
 
Python Workshop - Learn Python the Hard Way
Python Workshop - Learn Python the Hard WayPython Workshop - Learn Python the Hard Way
Python Workshop - Learn Python the Hard Way
 
Data Management in Python
Data Management in PythonData Management in Python
Data Management in Python
 
XII - 2022-23 - IP - RAIPUR (CBSE FINAL EXAM).pdf
XII -  2022-23 - IP - RAIPUR (CBSE FINAL EXAM).pdfXII -  2022-23 - IP - RAIPUR (CBSE FINAL EXAM).pdf
XII - 2022-23 - IP - RAIPUR (CBSE FINAL EXAM).pdf
 
Get started with R lang
Get started with R langGet started with R lang
Get started with R lang
 
Plunging Into Perl While Avoiding the Deep End (mostly)
Plunging Into Perl While Avoiding the Deep End (mostly)Plunging Into Perl While Avoiding the Deep End (mostly)
Plunging Into Perl While Avoiding the Deep End (mostly)
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
 
Arrays in c
Arrays in cArrays in c
Arrays in c
 
20130215 Reading data into R
20130215 Reading data into R20130215 Reading data into R
20130215 Reading data into R
 
interenship.pptx
interenship.pptxinterenship.pptx
interenship.pptx
 

More from bodaceacat

CansecWest2019: Infosec Frameworks for Misinformation
CansecWest2019: Infosec Frameworks for MisinformationCansecWest2019: Infosec Frameworks for Misinformation
CansecWest2019: Infosec Frameworks for Misinformationbodaceacat
 
2019 11 terp_breuer_disclosure_master
2019 11 terp_breuer_disclosure_master2019 11 terp_breuer_disclosure_master
2019 11 terp_breuer_disclosure_masterbodaceacat
 
Terp breuer misinfosecframeworks_cansecwest2019
Terp breuer misinfosecframeworks_cansecwest2019Terp breuer misinfosecframeworks_cansecwest2019
Terp breuer misinfosecframeworks_cansecwest2019bodaceacat
 
Misinfosec frameworks Cansecwest 2019
Misinfosec frameworks Cansecwest 2019Misinfosec frameworks Cansecwest 2019
Misinfosec frameworks Cansecwest 2019bodaceacat
 
Sjterp ds_of_misinfo_feb_2019
Sjterp ds_of_misinfo_feb_2019Sjterp ds_of_misinfo_feb_2019
Sjterp ds_of_misinfo_feb_2019bodaceacat
 
Practical Influence Operations, presentation at Sofwerx Dec 2018
Practical Influence Operations, presentation at Sofwerx Dec 2018Practical Influence Operations, presentation at Sofwerx Dec 2018
Practical Influence Operations, presentation at Sofwerx Dec 2018bodaceacat
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger databodaceacat
 
Session 09 learning relationships.pptx
Session 09 learning relationships.pptxSession 09 learning relationships.pptx
Session 09 learning relationships.pptxbodaceacat
 
Session 08 geospatial data
Session 08 geospatial dataSession 08 geospatial data
Session 08 geospatial databodaceacat
 
Session 06 machine learning.pptx
Session 06 machine learning.pptxSession 06 machine learning.pptx
Session 06 machine learning.pptxbodaceacat
 
Session 04 communicating results
Session 04 communicating resultsSession 04 communicating results
Session 04 communicating resultsbodaceacat
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring databodaceacat
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectbodaceacat
 
Gp technologybuilds july2011
Gp technologybuilds july2011Gp technologybuilds july2011
Gp technologybuilds july2011bodaceacat
 
Gp technologybuilds july2011
Gp technologybuilds july2011Gp technologybuilds july2011
Gp technologybuilds july2011bodaceacat
 
Ardrone represent
Ardrone representArdrone represent
Ardrone representbodaceacat
 
Global pulse app connection manager
Global pulse app connection managerGlobal pulse app connection manager
Global pulse app connection managerbodaceacat
 
Un Pulse Camp - Humanitarian Innovation
Un Pulse Camp - Humanitarian InnovationUn Pulse Camp - Humanitarian Innovation
Un Pulse Camp - Humanitarian Innovationbodaceacat
 
Blue light services
Blue light servicesBlue light services
Blue light servicesbodaceacat
 
Rhok and opendata hackathon intro
Rhok and opendata hackathon introRhok and opendata hackathon intro
Rhok and opendata hackathon introbodaceacat
 

More from bodaceacat (20)

CansecWest2019: Infosec Frameworks for Misinformation
CansecWest2019: Infosec Frameworks for MisinformationCansecWest2019: Infosec Frameworks for Misinformation
CansecWest2019: Infosec Frameworks for Misinformation
 
2019 11 terp_breuer_disclosure_master
2019 11 terp_breuer_disclosure_master2019 11 terp_breuer_disclosure_master
2019 11 terp_breuer_disclosure_master
 
Terp breuer misinfosecframeworks_cansecwest2019
Terp breuer misinfosecframeworks_cansecwest2019Terp breuer misinfosecframeworks_cansecwest2019
Terp breuer misinfosecframeworks_cansecwest2019
 
Misinfosec frameworks Cansecwest 2019
Misinfosec frameworks Cansecwest 2019Misinfosec frameworks Cansecwest 2019
Misinfosec frameworks Cansecwest 2019
 
Sjterp ds_of_misinfo_feb_2019
Sjterp ds_of_misinfo_feb_2019Sjterp ds_of_misinfo_feb_2019
Sjterp ds_of_misinfo_feb_2019
 
Practical Influence Operations, presentation at Sofwerx Dec 2018
Practical Influence Operations, presentation at Sofwerx Dec 2018Practical Influence Operations, presentation at Sofwerx Dec 2018
Practical Influence Operations, presentation at Sofwerx Dec 2018
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger data
 
Session 09 learning relationships.pptx
Session 09 learning relationships.pptxSession 09 learning relationships.pptx
Session 09 learning relationships.pptx
 
Session 08 geospatial data
Session 08 geospatial dataSession 08 geospatial data
Session 08 geospatial data
 
Session 06 machine learning.pptx
Session 06 machine learning.pptxSession 06 machine learning.pptx
Session 06 machine learning.pptx
 
Session 04 communicating results
Session 04 communicating resultsSession 04 communicating results
Session 04 communicating results
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring data
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
 
Gp technologybuilds july2011
Gp technologybuilds july2011Gp technologybuilds july2011
Gp technologybuilds july2011
 
Gp technologybuilds july2011
Gp technologybuilds july2011Gp technologybuilds july2011
Gp technologybuilds july2011
 
Ardrone represent
Ardrone representArdrone represent
Ardrone represent
 
Global pulse app connection manager
Global pulse app connection managerGlobal pulse app connection manager
Global pulse app connection manager
 
Un Pulse Camp - Humanitarian Innovation
Un Pulse Camp - Humanitarian InnovationUn Pulse Camp - Humanitarian Innovation
Un Pulse Camp - Humanitarian Innovation
 
Blue light services
Blue light servicesBlue light services
Blue light services
 
Rhok and opendata hackathon intro
Rhok and opendata hackathon introRhok and opendata hackathon intro
Rhok and opendata hackathon intro
 

Recently uploaded

Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 

Recently uploaded (20)

Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 

Session 05 cleaning and exploring

  • 2. Lab 5: your 5-7 things Data cleaning Basic data cleaning with Python Using OpenRefine Exploring Data The Pandas library The Seaborn library The R language
  • 4. Algorithms want their data to be: Machine-readable Consistent format (e.g. text is all lowercase) Consistent labels (e.g. use M/F, Male/Female, 0/1/2, but not *all* of these) No whitespace hiding in number or text cells No junk characters No strange outliers (e.g. 200 year old living people) In vectors and matrices Normalised
  • 6. Cleaning Strings Removing capitals and whitespace: mystring = " CApiTalIsaTion Sucks " mystring.lower().strip() original text is - CApiTalIsaTion Sucks - lowercased text is - capitalisation sucks - Text without whitespace is -capitalisation sucks-
  • 7. Regular Expressions: repeated spaces There’s a repeated space in capitalisation sucks import re re.sub(r's', '.', 'this is a string') re.sub(r's+', '.', 'this is a string') 'this.is..a.string' 'this.is.a.string'
  • 8. Regular Expressions: junk import re string1 = “This is a! sentence&& with junk!@“ cleanstring1 = re.sub(r'[^w ]', '', string1) This is a sentence with junk
  • 9. Converting Date/Times European vs American? Name of month vs number? Python comes with a bunch of date reformatting libraries that can convert between these. For example: import datetime date_string = “14/03/48" datetime.datetime.strptime(date_string, ‘%m/%d/%y').strftime('%m/%d/%Y')
  • 14. Cleaning up the import
  • 19. Exploring Data Eyeball your data Plot your data - visually look for trends and outliers Get the basics statistics (mean, sd etc) of your data Create pivot tables to help understand how columns interact Do more cleaning if you need to (e.g. those outliers)
  • 21. Reading in data files with Pandas read_csv read_excel read_sql read_json read_html read_stata read_clipboard
  • 22. Eyeballing rows How many rows are there in this dataset? len(df) What do my data rows look like? df.head(5) df.tail() df[10:20]
  • 23. Eyeballing columns What’s in these columns? df[‘sourceid’] df[[‘sourceid’,’ag12a_01','ag12a_02_2']] What’s in the columns when these are true? df[df.ag12a_01 == ‘YES’] df[(df.ag12a_01 == 'YES') & (df.ag12a_02_1 == 'NO')]
  • 24. Summarising columns What are my column names and types? df.columns df.dtypes Which labels do I have in this column? df['ag12a_03'].unique() df['ag12a_03'].value_counts()
  • 25. Pivot Tables: Combining data from one dataframe ● pd.pivot_table(df, index=[‘sourceid’, ‘ag12a_03’])
  • 26. Merge: Combining data from multiple frames longnames = pd.DataFrame({ 'country' : pd.Series(['United States of America', 'Zaire', 'Egypt']), 'longname' : pd.Series([True, True, False])}) merged_data = pd.merge( left=popstats, right=longnames, left_on='Country/territory of residence', right_on='country') merged_data[['Year', 'Country/territory of residence', 'longname', 'Total population', 'Origin / Returned from']]
  • 27. Left Joins: Keep everything from the left table… longnames = pd.DataFrame({ 'country' : pd.Series(['United States of America', 'Zaire', 'Egypt']), 'longname' : pd.Series([True, True, False])}) merged_data = pd.merge( left=popstats, right=longnames, how='left', left_on='Country/territory of residence', right_on='country') merged_data[['Year', 'Country/territory of residence', 'longname', 'Total population', 'Origin / Returned from']]
  • 30. The Iris dataset import seaborn as sns iris = sns.load_dataset('iris')
  • 31. Visualising Iris data with Seaborn sns.pairplot(iris, hue='species', size=2)
  • 33. R Matrix analysis (similar to Pandas) Good at: Rapid statistical analysis (4000+ R libraries) Rapidly-created static graphics Not so good at: Non-statistical things (e.g. GIS data analysis)
  • 34. Running R code ● Running R files: ○ From the terminal window: “R <myscript.r —no-save” ○ From inside another R program: source('myscript.r') ● Writing your own R code: ○ iPython notebooks: create “R” notebook (instead of python3) ○ Terminal window: type “r” (and “q()” to quit)
  • 36. Code Try running the Python and R code in the 5.x set of notebooks

Editor's Notes

  1. We’re talking today about cleaning and exploring data. What we’re really talking about is making friends with your data; understanding it yourself before you run any algorithms (e.g. machine learning algorithms) on it. We do this because a) it’s hard to run algorithms on badly-formatted data, and b) discovering data issues when you’re trying to train a classifier sucks - you have enough on your hands without dealing with outliers and spelling errors too.
  2. Data cleaning is the process of removing errors (spelling mistakes, extraneous characters, corrupted data etc) from datafiles, to prepare them for use in algorithms and visualisation. Data cleaning is sometimes also called data scrubbing or cleansing.
  3. Normalised: each datapoint has its own row in the data matrix.
  4. Basic text cleaning We'll spend a lot of time cleaning up text. Mostly this is because: A) although you see 'Capital' and 'capital' as the same words, an algorithm will see these as different because of the capital letter in one of them B) people leave a lot of invisible characters in their data (NB they do this to string representations of numerical data too - and many spreadsheet programs will store numbers as strings) - this is known as "whitespace", and can really mess up your day because "whitespace" and "whitespace " look the same to you, but an algorithm will see as different. In the example, lower() converts a string into lowercase (upper() converts it into uppercase, but the convention in data science is to use all-lowercase, probably because it's less shouty to read), and strip() removes any whitespace before the first character, and after the last character (a-z etc) in the string.
  5. Use the RE (regular expression) library to clean up strings. To use a regular expression (aka RegEx), you need to specify two patterns: one input pattern that will be applied repeatedly to the input text, and one output pattern that will be used in place of anything that matches the input pattern. Regular expressions can be very powerful and can take a while to learn, but here are a couple of patterns that you’ll probably find helpful at some point.
  6. ^\w = everything that isn’t a character or number. [] = a group of possible characters, e.g. [^\w ] = alphanumeric plus space. \s+,\s+ = one or more spaces followed by a comma then one or more spaces
  7. More about date formats in https://docs.python.org/3/library/datetime.html
  8. Open refine is a powerful data cleaning tool. It doesn’t do the cleaning for you, but it does make doing repeated tasks much much easier.
  9. This is file 2009_2013_popstats_PSQ_POC.csv in directory notebooks/example_data
  10. First, click on the OpenRefine icon. This will start a webpage for you, at URL 127.0.0.1:3333 Click on create project, then select a file. Click “next”.
  11. I’ve selected file Notebooks/example_data/2009_2013_popstats_PSQ_POC.csv Now I can see a preview of the data as it will be fed into the system, and a set of buttons for changing that import. And it’s a mess. OpenRefine has put all the data into one column - it’s ignored the commas that separate columns, and it’s used the first row (which is a comment about the file) as the column headings.
  12. Fortunately, OpenRefine has ways to start cleaning as you import the file. Here, we’ve selected “commas” as the column separators, and told OpenRefine to ignore the first 4 lines in the file. Now all we’ve got left to do is to clean up those annoying “*”s. First, click on “create project”.
  13. Here’s your data. You can do many things with this: clean text en-masse, move columns around (or add and delete columns), split or merge data in columns. We’ll play with just a few of these things.
  14. If you right-click on a cell that you want to change, a little blue “edit” box will appear. Click on this, then edit the box that appears. Now a powerful thing happens. You can apply whatever you did to that cell, to all other identical cells in this column. So if I want to remove cells with ‘*’ in them, I click on one of them, edit out the star, then click on “apply to all identical cells”.
  15. Facets are also really powerful ways to look at and edit the data in cells. For instance, if you click on the arrow next to “column 3”, you’ll get a popup menu. Click on “facet” then “text facet”, and a facet box will appear on the left side of the screen. Now, if you want to change every instance of “Viet Nam” to “Vietnam”, you just need to edit the text here (hover over the word “Viet Nam” and an “edit” box will appear. If you have really messy inputs, you can cluster them (see the “cluster” button?) into similar fields, and assign the text that should replace everything in the cluster. This is a useful way to deal with spelling variations, misspellings, spaces etc. When you’re finished, click “export” (top right side of the page) to write your data out to CSV etc. I’ve just shown you some of OpenRefine’s power. For example, there are a bunch of OpenRefine recipes at https://github.com/OpenRefine/OpenRefine/wiki/Recipes
  16. Get to know your dataset, before you ask a machine to understand it.
  17. Pandas is a Python data analysis library.
  18. Read_sql reads database files; read_html reads tables from html pages; read_clipboard reads from your PC’s clipboard. Beware: if you have more columns of data than you have column headings, Pandas read_csv can fail. If this happens, there are lots of optional parameters to read_csv that can help, but in practice it’s better to feed in clean data.
  19. df[k].unique()
  20. NB df.describe will only find mean and SD for numerical columns - which is reasonable.
  21. This is very similar to the pivot tables in Excel. More at http://pbpython.com/pandas-pivot-table-explained.html Those NaNs? “Not a numbers”.
  22. Pandas merge defaults to an ‘inner join’: only keep the rows where data exists in *both* tables. See e.g. http://www.datacarpentry.org/python-ecology/04-merging-data
  23. Left join: keep all rows from the first table; combine rows where you can, put “NaN”s in the rows when you can’t. Some great visuals about joins: http://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins
  24. You can normalise your tables in Pandas by using the stack function - see e.g. http://pandas.pydata.org/pandas-docs/stable/reshaping.html Image: UNICEF state of the world’s children report.
  25. You might want to do “sns.set()” before plotting…
  26. You can also run R from Python code, using the rpy2 library.