SlideShare a Scribd company logo
Cleaning and Exploring
Data
Datascience session 5
Lab 5: your 5-7 things
Data cleaning
Basic data cleaning with Python
Using OpenRefine
Exploring Data
The Pandas library
The Seaborn library
The R language
Data Cleaning
Algorithms want their data to be:
Machine-readable
Consistent format (e.g. text is all lowercase)
Consistent labels (e.g. use M/F, Male/Female, 0/1/2, but not *all* of these)
No whitespace hiding in number or text cells
No junk characters
No strange outliers (e.g. 200 year old living people)
In vectors and matrices
Normalised
Cleaning with Python
Cleaning Strings
Removing capitals and whitespace:
mystring = " CApiTalIsaTion Sucks "
mystring.lower().strip()
original text is - CApiTalIsaTion Sucks -
lowercased text is - capitalisation sucks -
Text without whitespace is -capitalisation sucks-
Regular Expressions: repeated spaces
There’s a repeated space in capitalisation sucks
import re
re.sub(r's', '.', 'this is a string')
re.sub(r's+', '.', 'this is a string')
'this.is..a.string'
'this.is.a.string'
Regular Expressions: junk
import re
string1 = “This is a! sentence&& with junk!@“
cleanstring1 = re.sub(r'[^w ]', '', string1)
This is a sentence with junk
Converting Date/Times
European vs American? Name of month vs number? Python comes with a bunch
of date reformatting libraries that can convert between these. For example:
import datetime
date_string = “14/03/48"
datetime.datetime.strptime(date_string, ‘%m/%d/%y').strftime('%m/%d/%Y')
Cleaning with Open Refine
Our input file
Getting started
Inputting data
Cleaning up the import
The imported data
Cleaning up columns
Facets
Exploring Data
Exploring Data
Eyeball your data
Plot your data - visually look for trends and outliers
Get the basics statistics (mean, sd etc) of your data
Create pivot tables to help understand how columns interact
Do more cleaning if you need to (e.g. those outliers)
Exploring with Pandas
Reading in data files with Pandas
read_csv
read_excel
read_sql
read_json
read_html
read_stata
read_clipboard
Eyeballing rows
How many rows are there in this dataset?
len(df)
What do my data rows look like?
df.head(5)
df.tail()
df[10:20]
Eyeballing columns
What’s in these columns?
df[‘sourceid’]
df[[‘sourceid’,’ag12a_01','ag12a_02_2']]
What’s in the columns when these are true?
df[df.ag12a_01 == ‘YES’]
df[(df.ag12a_01 == 'YES') & (df.ag12a_02_1 == 'NO')]
Summarising columns
What are my column names and types?
df.columns
df.dtypes
Which labels do I have in this column?
df['ag12a_03'].unique()
df['ag12a_03'].value_counts()
Pivot Tables: Combining data from one dataframe
● pd.pivot_table(df, index=[‘sourceid’, ‘ag12a_03’])
Merge: Combining data from multiple frames
longnames = pd.DataFrame({ 'country' : pd.Series(['United States of America', 'Zaire', 'Egypt']),
'longname' : pd.Series([True, True, False])})
merged_data = pd.merge(
left=popstats,
right=longnames,
left_on='Country/territory of residence',
right_on='country')
merged_data[['Year', 'Country/territory of residence', 'longname', 'Total population', 'Origin / Returned from']]
Left Joins: Keep everything from the left table…
longnames = pd.DataFrame({ 'country' : pd.Series(['United States of America', 'Zaire', 'Egypt']),
'longname' : pd.Series([True, True, False])})
merged_data = pd.merge(
left=popstats,
right=longnames,
how='left',
left_on='Country/territory of residence',
right_on='country')
merged_data[['Year', 'Country/territory of residence', 'longname', 'Total population', 'Origin / Returned from']]
Normalising
Use pd.stack()
The Seaborn Library
The Iris dataset
import seaborn as sns
iris = sns.load_dataset('iris')
Visualising Iris data with Seaborn
sns.pairplot(iris, hue='species', size=2)
Exploring with R
R
Matrix analysis (similar to Pandas)
Good at:
Rapid statistical analysis (4000+ R libraries)
Rapidly-created static graphics
Not so good at:
Non-statistical things (e.g. GIS data analysis)
Running R code
● Running R files:
○ From the terminal window: “R <myscript.r —no-save”
○ From inside another R program: source('myscript.r')
● Writing your own R code:
○ iPython notebooks: create “R” notebook (instead of python3)
○ Terminal window: type “r” (and “q()” to quit)
Exercises
Code
Try running the Python and R code in the 5.x set of notebooks

More Related Content

What's hot

Dev Concepts: Data Structures and Algorithms
Dev Concepts: Data Structures and AlgorithmsDev Concepts: Data Structures and Algorithms
Dev Concepts: Data Structures and Algorithms
Svetlin Nakov
 
Structures in c language
Structures in c languageStructures in c language
Structures in c language
tanmaymodi4
 
Computer programming(C++): Structures
Computer programming(C++): StructuresComputer programming(C++): Structures
Computer programming(C++): Structures
JishnuNath7
 
Biopython: Overview, State of the Art and Outlook
Biopython: Overview, State of the Art and OutlookBiopython: Overview, State of the Art and Outlook
Biopython: Overview, State of the Art and Outlook
Asociación Argentina de Bioinformática y Biología Computacional
 
Datatypes in python
Datatypes in pythonDatatypes in python
Datatypes in python
eShikshak
 
Basics of Python programming (part 2)
Basics of Python programming (part 2)Basics of Python programming (part 2)
Basics of Python programming (part 2)
Pedro Rodrigues
 
C arrays
C arraysC arrays
One dimensional 2
One dimensional 2One dimensional 2
One dimensional 2
Rajendran
 
Arrays in C
Arrays in CArrays in C
Arrays in C
Kamruddin Nur
 
Java chapter 6 - Arrays -syntax and use
Java chapter 6 - Arrays -syntax and useJava chapter 6 - Arrays -syntax and use
Java chapter 6 - Arrays -syntax and use
Mukesh Tekwani
 
cs8251 unit 1 ppt
cs8251 unit 1 pptcs8251 unit 1 ppt
cs8251 unit 1 ppt
praveenaprakasam
 
Concept Of C++ Data Types
Concept Of C++ Data TypesConcept Of C++ Data Types
Concept Of C++ Data Types
k v
 
Practical cats
Practical catsPractical cats
Practical cats
Raymond Tay
 
20130215 Reading data into R
20130215 Reading data into R20130215 Reading data into R
20130215 Reading data into R
Kazuki Yoshida
 
Computer programming 2 Lesson 5
Computer programming 2  Lesson 5Computer programming 2  Lesson 5
Computer programming 2 Lesson 5
MLG College of Learning, Inc
 
Is this good Python? PyCon WEB 2017 Lightning Talk
Is this good Python? PyCon WEB 2017 Lightning TalkIs this good Python? PyCon WEB 2017 Lightning Talk
Is this good Python? PyCon WEB 2017 Lightning Talk
Steffen Wenz
 
Python - variable types
Python - variable typesPython - variable types
Python - variable types
Learnbay Datascience
 
Data Types | CS8251- Programming in c | Learn Hub
Data Types | CS8251- Programming in c | Learn HubData Types | CS8251- Programming in c | Learn Hub
Data Types | CS8251- Programming in c | Learn Hub
Learn Hub
 

What's hot (18)

Dev Concepts: Data Structures and Algorithms
Dev Concepts: Data Structures and AlgorithmsDev Concepts: Data Structures and Algorithms
Dev Concepts: Data Structures and Algorithms
 
Structures in c language
Structures in c languageStructures in c language
Structures in c language
 
Computer programming(C++): Structures
Computer programming(C++): StructuresComputer programming(C++): Structures
Computer programming(C++): Structures
 
Biopython: Overview, State of the Art and Outlook
Biopython: Overview, State of the Art and OutlookBiopython: Overview, State of the Art and Outlook
Biopython: Overview, State of the Art and Outlook
 
Datatypes in python
Datatypes in pythonDatatypes in python
Datatypes in python
 
Basics of Python programming (part 2)
Basics of Python programming (part 2)Basics of Python programming (part 2)
Basics of Python programming (part 2)
 
C arrays
C arraysC arrays
C arrays
 
One dimensional 2
One dimensional 2One dimensional 2
One dimensional 2
 
Arrays in C
Arrays in CArrays in C
Arrays in C
 
Java chapter 6 - Arrays -syntax and use
Java chapter 6 - Arrays -syntax and useJava chapter 6 - Arrays -syntax and use
Java chapter 6 - Arrays -syntax and use
 
cs8251 unit 1 ppt
cs8251 unit 1 pptcs8251 unit 1 ppt
cs8251 unit 1 ppt
 
Concept Of C++ Data Types
Concept Of C++ Data TypesConcept Of C++ Data Types
Concept Of C++ Data Types
 
Practical cats
Practical catsPractical cats
Practical cats
 
20130215 Reading data into R
20130215 Reading data into R20130215 Reading data into R
20130215 Reading data into R
 
Computer programming 2 Lesson 5
Computer programming 2  Lesson 5Computer programming 2  Lesson 5
Computer programming 2 Lesson 5
 
Is this good Python? PyCon WEB 2017 Lightning Talk
Is this good Python? PyCon WEB 2017 Lightning TalkIs this good Python? PyCon WEB 2017 Lightning Talk
Is this good Python? PyCon WEB 2017 Lightning Talk
 
Python - variable types
Python - variable typesPython - variable types
Python - variable types
 
Data Types | CS8251- Programming in c | Learn Hub
Data Types | CS8251- Programming in c | Learn HubData Types | CS8251- Programming in c | Learn Hub
Data Types | CS8251- Programming in c | Learn Hub
 

Viewers also liked

Sales Leadership...Making Customers Style
Sales Leadership...Making Customers StyleSales Leadership...Making Customers Style
Sales Leadership...Making Customers Style
Mike Moore
 
Reading Comprehension Test for FMDC
Reading Comprehension Test for FMDCReading Comprehension Test for FMDC
Reading Comprehension Test for FMDC
Atiqa khan
 
PPT FOR DELICIOUS FOODS
PPT FOR DELICIOUS FOODSPPT FOR DELICIOUS FOODS
PPT FOR DELICIOUS FOODS
Mrunal Khare
 
Bigm 140316004148-phpapp02
Bigm 140316004148-phpapp02Bigm 140316004148-phpapp02
Bigm 140316004148-phpapp02
kongara
 
7.17.14 HOLDENEBONY MEDTERM REVISED4.27pm
7.17.14 HOLDENEBONY MEDTERM REVISED4.27pm7.17.14 HOLDENEBONY MEDTERM REVISED4.27pm
7.17.14 HOLDENEBONY MEDTERM REVISED4.27pm
Ebony Holden
 
0 i isem modelqp2014
0 i isem modelqp20140 i isem modelqp2014
0 i isem modelqp2014
EDIN BROW, DCE, AMET
 
Qyle 110 Tabla de tareas
Qyle 110 Tabla de tareasQyle 110 Tabla de tareas
Project design for change .doc pums kollimalai
Project design for change .doc pums kollimalaiProject design for change .doc pums kollimalai
Project design for change .doc pums kollimalai
designtn
 
11 คลื่นกล
11 คลื่นกล11 คลื่นกล
11 คลื่นกล
topofzeed
 
Cataloging Problems in Chinese Material Records of the University of Southern...
Cataloging Problems in Chinese Material Records of the University of Southern...Cataloging Problems in Chinese Material Records of the University of Southern...
Cataloging Problems in Chinese Material Records of the University of Southern...
Georgia Libraries Conference (formerly Ga COMO).
 
Question 3 sasiane evaluation
Question 3 sasiane evaluationQuestion 3 sasiane evaluation
Question 3 sasiane evaluation
sapphire29
 
Cruz roja
Cruz rojaCruz roja
Cruz roja
matiasmrossi
 
Comic1
Comic1Comic1
WISBOX Mini Brochure
WISBOX Mini BrochureWISBOX Mini Brochure
WISBOX Mini BrochureDennis WONG
 
2014.2 journal of literature and art studies
2014.2 journal of literature and art studies2014.2 journal of literature and art studies
2014.2 journal of literature and art studies
Doris Carly
 
Creating web sites using datalife engine
Creating web sites using datalife engineCreating web sites using datalife engine
Creating web sites using datalife engine
Japprend.Com
 
Rescue1.asd
Rescue1.asdRescue1.asd
Rescue1.asd
myHONforever
 
Do you live in a house or a flat?
Do you live in a house or a flat?Do you live in a house or a flat?
Do you live in a house or a flat?
عصام عمر الدسيسابي
 
NHQA Criteria 4. Performance Management
NHQA Criteria 4. Performance ManagementNHQA Criteria 4. Performance Management
NHQA Criteria 4. Performance Management
Tom Gillespie
 

Viewers also liked (20)

Sales Leadership...Making Customers Style
Sales Leadership...Making Customers StyleSales Leadership...Making Customers Style
Sales Leadership...Making Customers Style
 
Reading Comprehension Test for FMDC
Reading Comprehension Test for FMDCReading Comprehension Test for FMDC
Reading Comprehension Test for FMDC
 
PPT FOR DELICIOUS FOODS
PPT FOR DELICIOUS FOODSPPT FOR DELICIOUS FOODS
PPT FOR DELICIOUS FOODS
 
Bigm 140316004148-phpapp02
Bigm 140316004148-phpapp02Bigm 140316004148-phpapp02
Bigm 140316004148-phpapp02
 
7.17.14 HOLDENEBONY MEDTERM REVISED4.27pm
7.17.14 HOLDENEBONY MEDTERM REVISED4.27pm7.17.14 HOLDENEBONY MEDTERM REVISED4.27pm
7.17.14 HOLDENEBONY MEDTERM REVISED4.27pm
 
0 i isem modelqp2014
0 i isem modelqp20140 i isem modelqp2014
0 i isem modelqp2014
 
Qyle 110 Tabla de tareas
Qyle 110 Tabla de tareasQyle 110 Tabla de tareas
Qyle 110 Tabla de tareas
 
Project design for change .doc pums kollimalai
Project design for change .doc pums kollimalaiProject design for change .doc pums kollimalai
Project design for change .doc pums kollimalai
 
11 คลื่นกล
11 คลื่นกล11 คลื่นกล
11 คลื่นกล
 
Cataloging Problems in Chinese Material Records of the University of Southern...
Cataloging Problems in Chinese Material Records of the University of Southern...Cataloging Problems in Chinese Material Records of the University of Southern...
Cataloging Problems in Chinese Material Records of the University of Southern...
 
Question 3 sasiane evaluation
Question 3 sasiane evaluationQuestion 3 sasiane evaluation
Question 3 sasiane evaluation
 
Cruz roja
Cruz rojaCruz roja
Cruz roja
 
Comic1
Comic1Comic1
Comic1
 
WISBOX Mini Brochure
WISBOX Mini BrochureWISBOX Mini Brochure
WISBOX Mini Brochure
 
Tafseere namoonavol2
Tafseere namoonavol2Tafseere namoonavol2
Tafseere namoonavol2
 
2014.2 journal of literature and art studies
2014.2 journal of literature and art studies2014.2 journal of literature and art studies
2014.2 journal of literature and art studies
 
Creating web sites using datalife engine
Creating web sites using datalife engineCreating web sites using datalife engine
Creating web sites using datalife engine
 
Rescue1.asd
Rescue1.asdRescue1.asd
Rescue1.asd
 
Do you live in a house or a flat?
Do you live in a house or a flat?Do you live in a house or a flat?
Do you live in a house or a flat?
 
NHQA Criteria 4. Performance Management
NHQA Criteria 4. Performance ManagementNHQA Criteria 4. Performance Management
NHQA Criteria 4. Performance Management
 

Similar to Session 05 cleaning and exploring

Ejercicios de estilo en la programación
Ejercicios de estilo en la programaciónEjercicios de estilo en la programación
Ejercicios de estilo en la programación
Software Guru
 
pandas dataframe notes.pdf
pandas dataframe notes.pdfpandas dataframe notes.pdf
pandas dataframe notes.pdf
AjeshSurejan2
 
Lecture 9.pptx
Lecture 9.pptxLecture 9.pptx
Lecture 9.pptx
MathewJohnSinoCruz
 
python beginner talk slide
python beginner talk slidepython beginner talk slide
python beginner talk slide
jonycse
 
DS_PPT.pptx
DS_PPT.pptxDS_PPT.pptx
DS_PPT.pptx
MeghaKulkarni27
 
Pandas Dataframe reading data Kirti final.pptx
Pandas Dataframe reading data  Kirti final.pptxPandas Dataframe reading data  Kirti final.pptx
Pandas Dataframe reading data Kirti final.pptx
Kirti Verma
 
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptxfINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
dataKarthik
 
Morel, a data-parallel programming language
Morel, a data-parallel programming languageMorel, a data-parallel programming language
Morel, a data-parallel programming language
Julian Hyde
 
Control statements
Control statementsControl statements
Control statements
Pramod Rathore
 
Data Analysis with R (combined slides)
Data Analysis with R (combined slides)Data Analysis with R (combined slides)
Data Analysis with R (combined slides)
Guy Lebanon
 
Unit 3_Numpy_Vsp.pptx
Unit 3_Numpy_Vsp.pptxUnit 3_Numpy_Vsp.pptx
Unit 3_Numpy_Vsp.pptx
prakashvs7
 
Python Workshop - Learn Python the Hard Way
Python Workshop - Learn Python the Hard WayPython Workshop - Learn Python the Hard Way
Python Workshop - Learn Python the Hard Way
Utkarsh Sengar
 
Data Management in Python
Data Management in PythonData Management in Python
Data Management in Python
Sankhya_Analytics
 
XII - 2022-23 - IP - RAIPUR (CBSE FINAL EXAM).pdf
XII -  2022-23 - IP - RAIPUR (CBSE FINAL EXAM).pdfXII -  2022-23 - IP - RAIPUR (CBSE FINAL EXAM).pdf
XII - 2022-23 - IP - RAIPUR (CBSE FINAL EXAM).pdf
KrishnaJyotish1
 
Get started with R lang
Get started with R langGet started with R lang
Get started with R lang
senthil0809
 
Plunging Into Perl While Avoiding the Deep End (mostly)
Plunging Into Perl While Avoiding the Deep End (mostly)Plunging Into Perl While Avoiding the Deep End (mostly)
Plunging Into Perl While Avoiding the Deep End (mostly)
Roy Zimmer
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
ParveenShaik21
 
interenship.pptx
interenship.pptxinterenship.pptx
interenship.pptx
Naveen316549
 
Pandas.pptx
Pandas.pptxPandas.pptx
Pandas.pptx
Govardhan Bhavani
 
pandas-221217084954-937bb582.pdf
pandas-221217084954-937bb582.pdfpandas-221217084954-937bb582.pdf
pandas-221217084954-937bb582.pdf
scorsam1
 

Similar to Session 05 cleaning and exploring (20)

Ejercicios de estilo en la programación
Ejercicios de estilo en la programaciónEjercicios de estilo en la programación
Ejercicios de estilo en la programación
 
pandas dataframe notes.pdf
pandas dataframe notes.pdfpandas dataframe notes.pdf
pandas dataframe notes.pdf
 
Lecture 9.pptx
Lecture 9.pptxLecture 9.pptx
Lecture 9.pptx
 
python beginner talk slide
python beginner talk slidepython beginner talk slide
python beginner talk slide
 
DS_PPT.pptx
DS_PPT.pptxDS_PPT.pptx
DS_PPT.pptx
 
Pandas Dataframe reading data Kirti final.pptx
Pandas Dataframe reading data  Kirti final.pptxPandas Dataframe reading data  Kirti final.pptx
Pandas Dataframe reading data Kirti final.pptx
 
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptxfINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
 
Morel, a data-parallel programming language
Morel, a data-parallel programming languageMorel, a data-parallel programming language
Morel, a data-parallel programming language
 
Control statements
Control statementsControl statements
Control statements
 
Data Analysis with R (combined slides)
Data Analysis with R (combined slides)Data Analysis with R (combined slides)
Data Analysis with R (combined slides)
 
Unit 3_Numpy_Vsp.pptx
Unit 3_Numpy_Vsp.pptxUnit 3_Numpy_Vsp.pptx
Unit 3_Numpy_Vsp.pptx
 
Python Workshop - Learn Python the Hard Way
Python Workshop - Learn Python the Hard WayPython Workshop - Learn Python the Hard Way
Python Workshop - Learn Python the Hard Way
 
Data Management in Python
Data Management in PythonData Management in Python
Data Management in Python
 
XII - 2022-23 - IP - RAIPUR (CBSE FINAL EXAM).pdf
XII -  2022-23 - IP - RAIPUR (CBSE FINAL EXAM).pdfXII -  2022-23 - IP - RAIPUR (CBSE FINAL EXAM).pdf
XII - 2022-23 - IP - RAIPUR (CBSE FINAL EXAM).pdf
 
Get started with R lang
Get started with R langGet started with R lang
Get started with R lang
 
Plunging Into Perl While Avoiding the Deep End (mostly)
Plunging Into Perl While Avoiding the Deep End (mostly)Plunging Into Perl While Avoiding the Deep End (mostly)
Plunging Into Perl While Avoiding the Deep End (mostly)
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
 
interenship.pptx
interenship.pptxinterenship.pptx
interenship.pptx
 
Pandas.pptx
Pandas.pptxPandas.pptx
Pandas.pptx
 
pandas-221217084954-937bb582.pdf
pandas-221217084954-937bb582.pdfpandas-221217084954-937bb582.pdf
pandas-221217084954-937bb582.pdf
 

More from Sara-Jayne Terp

Distributed defense against disinformation: disinformation risk management an...
Distributed defense against disinformation: disinformation risk management an...Distributed defense against disinformation: disinformation risk management an...
Distributed defense against disinformation: disinformation risk management an...
Sara-Jayne Terp
 
Risk, SOCs, and mitigations: cognitive security is coming of age
Risk, SOCs, and mitigations: cognitive security is coming of ageRisk, SOCs, and mitigations: cognitive security is coming of age
Risk, SOCs, and mitigations: cognitive security is coming of age
Sara-Jayne Terp
 
disinformation risk management: leveraging cyber security best practices to s...
disinformation risk management: leveraging cyber security best practices to s...disinformation risk management: leveraging cyber security best practices to s...
disinformation risk management: leveraging cyber security best practices to s...
Sara-Jayne Terp
 
Cognitive security: all the other things
Cognitive security: all the other thingsCognitive security: all the other things
Cognitive security: all the other things
Sara-Jayne Terp
 
The Business(es) of Disinformation
The Business(es) of DisinformationThe Business(es) of Disinformation
The Business(es) of Disinformation
Sara-Jayne Terp
 
2021-05-SJTerp-AMITT_disinfoSoc-umaryland
2021-05-SJTerp-AMITT_disinfoSoc-umaryland2021-05-SJTerp-AMITT_disinfoSoc-umaryland
2021-05-SJTerp-AMITT_disinfoSoc-umaryland
Sara-Jayne Terp
 
2021 IWC presentation: Risk, SOCs and Mitigations: Cognitive Security is Comi...
2021 IWC presentation: Risk, SOCs and Mitigations: Cognitive Security is Comi...2021 IWC presentation: Risk, SOCs and Mitigations: Cognitive Security is Comi...
2021 IWC presentation: Risk, SOCs and Mitigations: Cognitive Security is Comi...
Sara-Jayne Terp
 
2021-02-10_CogSecCollab_UBerkeley
2021-02-10_CogSecCollab_UBerkeley2021-02-10_CogSecCollab_UBerkeley
2021-02-10_CogSecCollab_UBerkeley
Sara-Jayne Terp
 
Using AMITT and ATT&CK frameworks
Using AMITT and ATT&CK frameworksUsing AMITT and ATT&CK frameworks
Using AMITT and ATT&CK frameworks
Sara-Jayne Terp
 
2020 12 nyu-workshop_cog_sec
2020 12 nyu-workshop_cog_sec2020 12 nyu-workshop_cog_sec
2020 12 nyu-workshop_cog_sec
Sara-Jayne Terp
 
2020 09-01 disclosure
2020 09-01 disclosure2020 09-01 disclosure
2020 09-01 disclosure
Sara-Jayne Terp
 
2019 11 terp_mansonbulletproof_master copy
2019 11 terp_mansonbulletproof_master copy2019 11 terp_mansonbulletproof_master copy
2019 11 terp_mansonbulletproof_master copy
Sara-Jayne Terp
 
BSidesLV 2018 talk: social engineering at scale, a community guide
BSidesLV 2018 talk: social engineering at scale, a community guideBSidesLV 2018 talk: social engineering at scale, a community guide
BSidesLV 2018 talk: social engineering at scale, a community guide
Sara-Jayne Terp
 
Social engineering at scale
Social engineering at scaleSocial engineering at scale
Social engineering at scale
Sara-Jayne Terp
 
engineering misinformation
engineering misinformationengineering misinformation
engineering misinformation
Sara-Jayne Terp
 
Online misinformation: they're coming for our brainz now
Online misinformation: they're coming for our brainz nowOnline misinformation: they're coming for our brainz now
Online misinformation: they're coming for our brainz now
Sara-Jayne Terp
 
Sj terp ciwg_nyc2017_credibility_belief
Sj terp ciwg_nyc2017_credibility_beliefSj terp ciwg_nyc2017_credibility_belief
Sj terp ciwg_nyc2017_credibility_belief
Sara-Jayne Terp
 
Belief: learning about new problems from old things
Belief: learning about new problems from old thingsBelief: learning about new problems from old things
Belief: learning about new problems from old things
Sara-Jayne Terp
 
risks and mitigations of releasing data
risks and mitigations of releasing datarisks and mitigations of releasing data
risks and mitigations of releasing data
Sara-Jayne Terp
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger data
Sara-Jayne Terp
 

More from Sara-Jayne Terp (20)

Distributed defense against disinformation: disinformation risk management an...
Distributed defense against disinformation: disinformation risk management an...Distributed defense against disinformation: disinformation risk management an...
Distributed defense against disinformation: disinformation risk management an...
 
Risk, SOCs, and mitigations: cognitive security is coming of age
Risk, SOCs, and mitigations: cognitive security is coming of ageRisk, SOCs, and mitigations: cognitive security is coming of age
Risk, SOCs, and mitigations: cognitive security is coming of age
 
disinformation risk management: leveraging cyber security best practices to s...
disinformation risk management: leveraging cyber security best practices to s...disinformation risk management: leveraging cyber security best practices to s...
disinformation risk management: leveraging cyber security best practices to s...
 
Cognitive security: all the other things
Cognitive security: all the other thingsCognitive security: all the other things
Cognitive security: all the other things
 
The Business(es) of Disinformation
The Business(es) of DisinformationThe Business(es) of Disinformation
The Business(es) of Disinformation
 
2021-05-SJTerp-AMITT_disinfoSoc-umaryland
2021-05-SJTerp-AMITT_disinfoSoc-umaryland2021-05-SJTerp-AMITT_disinfoSoc-umaryland
2021-05-SJTerp-AMITT_disinfoSoc-umaryland
 
2021 IWC presentation: Risk, SOCs and Mitigations: Cognitive Security is Comi...
2021 IWC presentation: Risk, SOCs and Mitigations: Cognitive Security is Comi...2021 IWC presentation: Risk, SOCs and Mitigations: Cognitive Security is Comi...
2021 IWC presentation: Risk, SOCs and Mitigations: Cognitive Security is Comi...
 
2021-02-10_CogSecCollab_UBerkeley
2021-02-10_CogSecCollab_UBerkeley2021-02-10_CogSecCollab_UBerkeley
2021-02-10_CogSecCollab_UBerkeley
 
Using AMITT and ATT&CK frameworks
Using AMITT and ATT&CK frameworksUsing AMITT and ATT&CK frameworks
Using AMITT and ATT&CK frameworks
 
2020 12 nyu-workshop_cog_sec
2020 12 nyu-workshop_cog_sec2020 12 nyu-workshop_cog_sec
2020 12 nyu-workshop_cog_sec
 
2020 09-01 disclosure
2020 09-01 disclosure2020 09-01 disclosure
2020 09-01 disclosure
 
2019 11 terp_mansonbulletproof_master copy
2019 11 terp_mansonbulletproof_master copy2019 11 terp_mansonbulletproof_master copy
2019 11 terp_mansonbulletproof_master copy
 
BSidesLV 2018 talk: social engineering at scale, a community guide
BSidesLV 2018 talk: social engineering at scale, a community guideBSidesLV 2018 talk: social engineering at scale, a community guide
BSidesLV 2018 talk: social engineering at scale, a community guide
 
Social engineering at scale
Social engineering at scaleSocial engineering at scale
Social engineering at scale
 
engineering misinformation
engineering misinformationengineering misinformation
engineering misinformation
 
Online misinformation: they're coming for our brainz now
Online misinformation: they're coming for our brainz nowOnline misinformation: they're coming for our brainz now
Online misinformation: they're coming for our brainz now
 
Sj terp ciwg_nyc2017_credibility_belief
Sj terp ciwg_nyc2017_credibility_beliefSj terp ciwg_nyc2017_credibility_belief
Sj terp ciwg_nyc2017_credibility_belief
 
Belief: learning about new problems from old things
Belief: learning about new problems from old thingsBelief: learning about new problems from old things
Belief: learning about new problems from old things
 
risks and mitigations of releasing data
risks and mitigations of releasing datarisks and mitigations of releasing data
risks and mitigations of releasing data
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger data
 

Recently uploaded

End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024
facilitymanager11
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
UofT毕业证如何办理
UofT毕业证如何办理UofT毕业证如何办理
UofT毕业证如何办理
exukyp
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
mkkikqvo
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
y3i0qsdzb
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
sameer shah
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
VyNguyen709676
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
taqyea
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
z6osjkqvd
 

Recently uploaded (20)

End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
UofT毕业证如何办理
UofT毕业证如何办理UofT毕业证如何办理
UofT毕业证如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 

Session 05 cleaning and exploring

  • 2. Lab 5: your 5-7 things Data cleaning Basic data cleaning with Python Using OpenRefine Exploring Data The Pandas library The Seaborn library The R language
  • 4. Algorithms want their data to be: Machine-readable Consistent format (e.g. text is all lowercase) Consistent labels (e.g. use M/F, Male/Female, 0/1/2, but not *all* of these) No whitespace hiding in number or text cells No junk characters No strange outliers (e.g. 200 year old living people) In vectors and matrices Normalised
  • 6. Cleaning Strings Removing capitals and whitespace: mystring = " CApiTalIsaTion Sucks " mystring.lower().strip() original text is - CApiTalIsaTion Sucks - lowercased text is - capitalisation sucks - Text without whitespace is -capitalisation sucks-
  • 7. Regular Expressions: repeated spaces There’s a repeated space in capitalisation sucks import re re.sub(r's', '.', 'this is a string') re.sub(r's+', '.', 'this is a string') 'this.is..a.string' 'this.is.a.string'
  • 8. Regular Expressions: junk import re string1 = “This is a! sentence&& with junk!@“ cleanstring1 = re.sub(r'[^w ]', '', string1) This is a sentence with junk
  • 9. Converting Date/Times European vs American? Name of month vs number? Python comes with a bunch of date reformatting libraries that can convert between these. For example: import datetime date_string = “14/03/48" datetime.datetime.strptime(date_string, ‘%m/%d/%y').strftime('%m/%d/%Y')
  • 14. Cleaning up the import
  • 19. Exploring Data Eyeball your data Plot your data - visually look for trends and outliers Get the basics statistics (mean, sd etc) of your data Create pivot tables to help understand how columns interact Do more cleaning if you need to (e.g. those outliers)
  • 21. Reading in data files with Pandas read_csv read_excel read_sql read_json read_html read_stata read_clipboard
  • 22. Eyeballing rows How many rows are there in this dataset? len(df) What do my data rows look like? df.head(5) df.tail() df[10:20]
  • 23. Eyeballing columns What’s in these columns? df[‘sourceid’] df[[‘sourceid’,’ag12a_01','ag12a_02_2']] What’s in the columns when these are true? df[df.ag12a_01 == ‘YES’] df[(df.ag12a_01 == 'YES') & (df.ag12a_02_1 == 'NO')]
  • 24. Summarising columns What are my column names and types? df.columns df.dtypes Which labels do I have in this column? df['ag12a_03'].unique() df['ag12a_03'].value_counts()
  • 25. Pivot Tables: Combining data from one dataframe ● pd.pivot_table(df, index=[‘sourceid’, ‘ag12a_03’])
  • 26. Merge: Combining data from multiple frames longnames = pd.DataFrame({ 'country' : pd.Series(['United States of America', 'Zaire', 'Egypt']), 'longname' : pd.Series([True, True, False])}) merged_data = pd.merge( left=popstats, right=longnames, left_on='Country/territory of residence', right_on='country') merged_data[['Year', 'Country/territory of residence', 'longname', 'Total population', 'Origin / Returned from']]
  • 27. Left Joins: Keep everything from the left table… longnames = pd.DataFrame({ 'country' : pd.Series(['United States of America', 'Zaire', 'Egypt']), 'longname' : pd.Series([True, True, False])}) merged_data = pd.merge( left=popstats, right=longnames, how='left', left_on='Country/territory of residence', right_on='country') merged_data[['Year', 'Country/territory of residence', 'longname', 'Total population', 'Origin / Returned from']]
  • 30. The Iris dataset import seaborn as sns iris = sns.load_dataset('iris')
  • 31. Visualising Iris data with Seaborn sns.pairplot(iris, hue='species', size=2)
  • 33. R Matrix analysis (similar to Pandas) Good at: Rapid statistical analysis (4000+ R libraries) Rapidly-created static graphics Not so good at: Non-statistical things (e.g. GIS data analysis)
  • 34. Running R code ● Running R files: ○ From the terminal window: “R <myscript.r —no-save” ○ From inside another R program: source('myscript.r') ● Writing your own R code: ○ iPython notebooks: create “R” notebook (instead of python3) ○ Terminal window: type “r” (and “q()” to quit)
  • 36. Code Try running the Python and R code in the 5.x set of notebooks

Editor's Notes

  1. We’re talking today about cleaning and exploring data. What we’re really talking about is making friends with your data; understanding it yourself before you run any algorithms (e.g. machine learning algorithms) on it. We do this because a) it’s hard to run algorithms on badly-formatted data, and b) discovering data issues when you’re trying to train a classifier sucks - you have enough on your hands without dealing with outliers and spelling errors too.
  2. Data cleaning is the process of removing errors (spelling mistakes, extraneous characters, corrupted data etc) from datafiles, to prepare them for use in algorithms and visualisation. Data cleaning is sometimes also called data scrubbing or cleansing.
  3. Normalised: each datapoint has its own row in the data matrix.
  4. Basic text cleaning We'll spend a lot of time cleaning up text. Mostly this is because: A) although you see 'Capital' and 'capital' as the same words, an algorithm will see these as different because of the capital letter in one of them B) people leave a lot of invisible characters in their data (NB they do this to string representations of numerical data too - and many spreadsheet programs will store numbers as strings) - this is known as "whitespace", and can really mess up your day because "whitespace" and "whitespace " look the same to you, but an algorithm will see as different. In the example, lower() converts a string into lowercase (upper() converts it into uppercase, but the convention in data science is to use all-lowercase, probably because it's less shouty to read), and strip() removes any whitespace before the first character, and after the last character (a-z etc) in the string.
  5. Use the RE (regular expression) library to clean up strings. To use a regular expression (aka RegEx), you need to specify two patterns: one input pattern that will be applied repeatedly to the input text, and one output pattern that will be used in place of anything that matches the input pattern. Regular expressions can be very powerful and can take a while to learn, but here are a couple of patterns that you’ll probably find helpful at some point.
  6. ^\w = everything that isn’t a character or number. [] = a group of possible characters, e.g. [^\w ] = alphanumeric plus space. \s+,\s+ = one or more spaces followed by a comma then one or more spaces
  7. More about date formats in https://docs.python.org/3/library/datetime.html
  8. Open refine is a powerful data cleaning tool. It doesn’t do the cleaning for you, but it does make doing repeated tasks much much easier.
  9. This is file 2009_2013_popstats_PSQ_POC.csv in directory notebooks/example_data
  10. First, click on the OpenRefine icon. This will start a webpage for you, at URL 127.0.0.1:3333 Click on create project, then select a file. Click “next”.
  11. I’ve selected file Notebooks/example_data/2009_2013_popstats_PSQ_POC.csv Now I can see a preview of the data as it will be fed into the system, and a set of buttons for changing that import. And it’s a mess. OpenRefine has put all the data into one column - it’s ignored the commas that separate columns, and it’s used the first row (which is a comment about the file) as the column headings.
  12. Fortunately, OpenRefine has ways to start cleaning as you import the file. Here, we’ve selected “commas” as the column separators, and told OpenRefine to ignore the first 4 lines in the file. Now all we’ve got left to do is to clean up those annoying “*”s. First, click on “create project”.
  13. Here’s your data. You can do many things with this: clean text en-masse, move columns around (or add and delete columns), split or merge data in columns. We’ll play with just a few of these things.
  14. If you right-click on a cell that you want to change, a little blue “edit” box will appear. Click on this, then edit the box that appears. Now a powerful thing happens. You can apply whatever you did to that cell, to all other identical cells in this column. So if I want to remove cells with ‘*’ in them, I click on one of them, edit out the star, then click on “apply to all identical cells”.
  15. Facets are also really powerful ways to look at and edit the data in cells. For instance, if you click on the arrow next to “column 3”, you’ll get a popup menu. Click on “facet” then “text facet”, and a facet box will appear on the left side of the screen. Now, if you want to change every instance of “Viet Nam” to “Vietnam”, you just need to edit the text here (hover over the word “Viet Nam” and an “edit” box will appear. If you have really messy inputs, you can cluster them (see the “cluster” button?) into similar fields, and assign the text that should replace everything in the cluster. This is a useful way to deal with spelling variations, misspellings, spaces etc. When you’re finished, click “export” (top right side of the page) to write your data out to CSV etc. I’ve just shown you some of OpenRefine’s power. For example, there are a bunch of OpenRefine recipes at https://github.com/OpenRefine/OpenRefine/wiki/Recipes
  16. Get to know your dataset, before you ask a machine to understand it.
  17. Pandas is a Python data analysis library.
  18. Read_sql reads database files; read_html reads tables from html pages; read_clipboard reads from your PC’s clipboard. Beware: if you have more columns of data than you have column headings, Pandas read_csv can fail. If this happens, there are lots of optional parameters to read_csv that can help, but in practice it’s better to feed in clean data.
  19. df[k].unique()
  20. NB df.describe will only find mean and SD for numerical columns - which is reasonable.
  21. This is very similar to the pivot tables in Excel. More at http://pbpython.com/pandas-pivot-table-explained.html Those NaNs? “Not a numbers”.
  22. Pandas merge defaults to an ‘inner join’: only keep the rows where data exists in *both* tables. See e.g. http://www.datacarpentry.org/python-ecology/04-merging-data
  23. Left join: keep all rows from the first table; combine rows where you can, put “NaN”s in the rows when you can’t. Some great visuals about joins: http://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins
  24. You can normalise your tables in Pandas by using the stack function - see e.g. http://pandas.pydata.org/pandas-docs/stable/reshaping.html Image: UNICEF state of the world’s children report.
  25. You might want to do “sns.set()” before plotting…
  26. You can also run R from Python code, using the rpy2 library.