SlideShare a Scribd company logo
Machine Learning
Dataset Preparation
Portland Data Science Group
Created by Andrew Ferlitsch
Community Outreach Officer
July, 2017
Dataset Preparation
• Prior to using a dataset to train a model, the dataset
must be prepared.
1. Import the data
2. Clean the data (Data Wrangling)
3. Replace Missing Values
4. Categorical Value Conversion
5. Feature Scaling
Importing the Dataset
• Datasets are generally imported as a raw data files
(e.g., US Census) or via an API service (e.g., NWS
Weather Data SOAP API).
• Datasets are generally in the form of CSV, JSON or XML
data format.
• For the purpose of this tutorial, CSV is used in the
accompanying examples.
Importing the Dataset - Python
import pandas as pd # use pandas library for data frames
dataset = pd.read_csv( ‘data.csv’ ) # read CSV file into a data frame
pathname to raw data file
Function to read a CSV file
CSV data converted
to data frame.
Example Data (CSV File): Generated Data Frame:
Age, Gender, Income, Spending
22,M,18000,6000
25,F,30000,8000
31,F,35000,12000
35,M,40000,18000
Age Gender Income Spending
0 22 M 18000 6000
1 25 F 30000 8000
2 31 F 35000 12000
3 35 M 40000 18000
Data Frame adds these indices
Cleaning the Data (Data Wrangling)
• It is not uncommon for datasets to have some dirty
data entries (i.e., samples, rows in CSV file, …)
• Common Problems
• Bad Character Encodings (Funny Characters)
• Misaligned Data (e.g., row has too few/many columns)
• Data in wrong format.
Great Britain and the United States are two of the few places in the world that use a period to indicate the
decimal place. Many other countries use a comma instead. Likewise, while the U.K. and U.S. use a comma to
separate groups of thousands, many other countries use a period instead, and some countries separate
thousands groups with a thin space.
https://docs.oracle.com/cd/E19455-01/806-0169/overview-9/index.html
• Data Wrangling is an expertise/occupation all in its own.
Common Practices in Data Wrangling
• Know the character encoding of the data file and
intended character encoding of the data.
Convert the data encoding format of the file if necessary.
e.g., Notepad++ -> Encodings
• Know the data format of the source and expected
data format.
Convert the data format using a batch preprocessing file.
e.g., 1 000 000 -> 1,000,000
Replace Missing Values
• Not unusual for samples (rows) to contain missing (blank)
entries, or not a number (NaN).
• Blank/NaN entries do not work for Machine Learning!
• Need to replace the blank/NaN entry with something
meaningful.
• Delete the rows (generally not desirable)
• Replace with a Single Value
• Mean Average
• Multivariate Imputation using Chained Equations (MICS)
https://msdn.microsoft.com/en-us/library/azure/dn906028.aspx
Missing Values – Mean Value
from sklearn.preprocessing import Imputer # scikit-learn module
# Create imputer object to replace NaN values with the mean value of the column
imputer = Imputer( missing_values=‘NaN’,
strategy=‘mean’ )
# Fit the data to the imputer object
imputer = imputer.fit( dataset[ :, 2 ] )
# do the replacement and update the dataset
dataset[ :, 2 ] = imputer.transform( dataset[ :, 2 ] )
scikit-learn class for handling missing data
original dataset
replace missing values in column 2 (index starts at 0)
select all rows
needs to be the same columns in dataset
Categorical Variables
Age Gender Income
25 Male 25000
26 Female 22000
30 Male 45000
24 Female 26000
Independent Variables (Features)
Dependent Variables (Label)
Real Values Value to Predict
Categorical Values
Dummy Variable Conversion
Known in Python as OneHotEncoder
For each categorical feature:
1. Scan the dataset and determine all the unique instances.
2. Create a new feature (i.e., dummy variable) in dataset, one
per unique instance.
3. Remove the categorical feature from the dataset.
4. For each sample (row), set a 1 in the feature (dummy
variable) that corresponds to that categorical value instance,
and:
5. Set a 0 in the remaining features (dummy variables) for that
categorical field.
6. Remove one dummy variable field.
Dummy Variable Trap
Gender
Male
Female
Male
Female
Need to Drop one Dummy Variable!
Male Female
1 0
0 1
1 0
0 1
x1 x2 x3
Multicollinearity occurs when one variable predicts another.
i.e., x2 = ( 1 – x3)
As a result, a regression analysis cannot distinguish between the
contribution of x2 and x3.
Categorical Variable Conversion
from sklearn.preprocessing import LabelEncoder # scikit-learn module
# Create an encoder object to numerically (enumeration) encode categorical variables
labelEncoder = LabelEncoder()
# Fit the data to the Encoder object
labelEncoder.fit_transform()
dataset[ :, 1 ] = labelEncoder.fit_transform( dataset[ :, 1 ] )
# Create an encoder to convert numerical encodings to 1-encoded dummy variables
onehotencoder = OneHotEncoder( categorical_features = [ 1 ] )
# Replace the encoded categorical values with the 1-encoded dummy variables
dataset = onehotencoder.fit_transform( dataset )
scikit-learn class for categorical variable conversion
original dataset
encode the categorical values in column 1 (index starts at 0)
select all rows
needs to be the same columns in dataset
Categorical variables to convert are in column 1
Dataset with converted categorical variables
Feature Scaling
• If features do not have the same numerical scale
in values, will cause issues in training a mode.
• If the scale of one independent variable (feature) is
greater than another independent variable, the model
will give more importance (skew) to the independent
variable with the larger range.
• To eliminate this problem, one converts all the
independent variables to use the same scale.
• Normalization ( 0 to 1 )
• Standardization ( -1 to 1 )
Scaling Issue - Euclidean Distance
• Most machine learning models use Euclidean distance
between two points in 2D Cartesian space.
𝒙 𝟐 − 𝒙 𝟏
𝟐 + (𝒚 𝟐 − 𝒚 𝟏) 𝟐
• Given two independent variables (x1 = Age, x2 = Income)
and a dependent variable (y = spending), becomes for
a given sample (row) i:
𝒙𝟐𝒊 − 𝒙𝟏𝒊
𝟐 + 𝒚𝒊 − 𝒚𝒊 𝟐 = 𝒙𝟐𝒊 − 𝒙𝟏𝒊
𝟐
• If x1 or x2 is a substantially greater scale than the other,
the corresponding independent variable will dominate
the result, and will contribute more to the model.
Normalization or Standardization
• Feature Scaling means scaling features to the same scale.
• Normalization scales features between 0 and 1, retaining
their proportional range to each other.
• Standardization scales features to have a mean (u) of 0
and standard deviation (a) of 1.
X’ =
𝑥 − min(𝑥)
max 𝑥 − min(𝑥)
Normalization
original valuenew value
X’ =
𝑥 − 𝑢
𝑎
Standardization
original valuenew value
mean
standard deviation
Feature Scaling in Python
from sklearn.preprocessing import StandardScalar # scikit-learn module
# Create a scaling object to scale the features.
scale = StandardScalar()
# Fit the data to the Scaling object and transform the data
dataset [:,-1] = scale.fit_transform( dataset[:,-1] )
scikit-learn class for Feature Scaling
feature scale all the variables except the last column (y or label)

More Related Content

What's hot

Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
Sungjoon Choi
 
supervised learning
supervised learningsupervised learning
supervised learning
Amar Tripathi
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine Learning
Joel Graff
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Rahul Jain
 
Types of Machine Learning
Types of Machine LearningTypes of Machine Learning
Types of Machine Learning
Samra Shahzadi
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
mrizwan969
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learningbutest
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
Sebastian Raschka
 
Pattern recognition and Machine Learning.
Pattern recognition and Machine Learning.Pattern recognition and Machine Learning.
Pattern recognition and Machine Learning.
Rohit Kumar
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learning
Tonmoy Bhagawati
 
Feature selection concepts and methods
Feature selection concepts and methodsFeature selection concepts and methods
Feature selection concepts and methodsReza Ramezani
 
Basics of Machine Learning
Basics of Machine LearningBasics of Machine Learning
Basics of Machine Learningbutest
 
Supervised Unsupervised and Reinforcement Learning
Supervised Unsupervised and Reinforcement Learning Supervised Unsupervised and Reinforcement Learning
Supervised Unsupervised and Reinforcement Learning
Aakash Chotrani
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.butest
 
Machine Learning - Splitting Datasets
Machine Learning - Splitting DatasetsMachine Learning - Splitting Datasets
Machine Learning - Splitting Datasets
Andrew Ferlitsch
 
Adversarial Attacks and Defense
Adversarial Attacks and DefenseAdversarial Attacks and Defense
Adversarial Attacks and Defense
Kishor Datta Gupta
 
Machine learning
Machine learningMachine learning
Machine learning
Shailja Tripathi
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Shrey Malik
 

What's hot (20)

Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
 
supervised learning
supervised learningsupervised learning
supervised learning
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine Learning
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Types of Machine Learning
Types of Machine LearningTypes of Machine Learning
Types of Machine Learning
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learning
 
Pattern Recognition
Pattern RecognitionPattern Recognition
Pattern Recognition
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
 
Pattern recognition and Machine Learning.
Pattern recognition and Machine Learning.Pattern recognition and Machine Learning.
Pattern recognition and Machine Learning.
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learning
 
Feature selection concepts and methods
Feature selection concepts and methodsFeature selection concepts and methods
Feature selection concepts and methods
 
Basics of Machine Learning
Basics of Machine LearningBasics of Machine Learning
Basics of Machine Learning
 
Supervised Unsupervised and Reinforcement Learning
Supervised Unsupervised and Reinforcement Learning Supervised Unsupervised and Reinforcement Learning
Supervised Unsupervised and Reinforcement Learning
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
 
Machine Learning - Splitting Datasets
Machine Learning - Splitting DatasetsMachine Learning - Splitting Datasets
Machine Learning - Splitting Datasets
 
Adversarial Attacks and Defense
Adversarial Attacks and DefenseAdversarial Attacks and Defense
Adversarial Attacks and Defense
 
Machine learning
Machine learningMachine learning
Machine learning
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 

Similar to Machine Learning - Dataset Preparation

data science with python_UNIT 2_full notes.pdf
data science with python_UNIT 2_full notes.pdfdata science with python_UNIT 2_full notes.pdf
data science with python_UNIT 2_full notes.pdf
mukeshgarg02
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
TanujaSomvanshi1
 
Think Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use CaseThink Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use Case
Rachel Warren
 
Machine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionMachine Learning - Simple Linear Regression
Machine Learning - Simple Linear Regression
Siddharth Shrivastava
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parameters
University of Huddersfield
 
Data structure
Data structureData structure
Data structure
Muhammad Farhan
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for Python
Ralf Gommers
 
Search-Based Robustness Testing of Data Processing Systems
Search-Based Robustness Testing of Data Processing SystemsSearch-Based Robustness Testing of Data Processing Systems
Search-Based Robustness Testing of Data Processing Systems
Lionel Briand
 
presentation.ppt
presentation.pptpresentation.ppt
presentation.ppt
MadhuriChandanbatwe
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
pyingkodi maran
 
Data structure and algorithm.
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
Abdul salam
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
Daiki Tanaka
 
Unit-1 Introduction and Mathematical Preliminaries.pptx
Unit-1 Introduction and Mathematical Preliminaries.pptxUnit-1 Introduction and Mathematical Preliminaries.pptx
Unit-1 Introduction and Mathematical Preliminaries.pptx
avinashBajpayee1
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
Ivo Andreev
 
Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013
Sri Ambati
 
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data AnalysisWorkshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Olga Scrivner
 
Java 101 intro to programming with java
Java 101  intro to programming with javaJava 101  intro to programming with java
Java 101 intro to programming with java
Hawkman Academy
 
Java 101 Intro to Java Programming
Java 101 Intro to Java ProgrammingJava 101 Intro to Java Programming
Java 101 Intro to Java Programming
agorolabs
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Boston Institute of Analytics
 

Similar to Machine Learning - Dataset Preparation (20)

data science with python_UNIT 2_full notes.pdf
data science with python_UNIT 2_full notes.pdfdata science with python_UNIT 2_full notes.pdf
data science with python_UNIT 2_full notes.pdf
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
 
Think Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use CaseThink Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use Case
 
Machine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionMachine Learning - Simple Linear Regression
Machine Learning - Simple Linear Regression
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parameters
 
Data structure
Data structureData structure
Data structure
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for Python
 
Search-Based Robustness Testing of Data Processing Systems
Search-Based Robustness Testing of Data Processing SystemsSearch-Based Robustness Testing of Data Processing Systems
Search-Based Robustness Testing of Data Processing Systems
 
presentation.ppt
presentation.pptpresentation.ppt
presentation.ppt
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
 
Data structure and algorithm.
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Unit-1 Introduction and Mathematical Preliminaries.pptx
Unit-1 Introduction and Mathematical Preliminaries.pptxUnit-1 Introduction and Mathematical Preliminaries.pptx
Unit-1 Introduction and Mathematical Preliminaries.pptx
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
 
Java
JavaJava
Java
 
Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013
 
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data AnalysisWorkshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
 
Java 101 intro to programming with java
Java 101  intro to programming with javaJava 101  intro to programming with java
Java 101 intro to programming with java
 
Java 101 Intro to Java Programming
Java 101 Intro to Java ProgrammingJava 101 Intro to Java Programming
Java 101 Intro to Java Programming
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 

More from Andrew Ferlitsch

AI - Intelligent Agents
AI - Intelligent AgentsAI - Intelligent Agents
AI - Intelligent Agents
Andrew Ferlitsch
 
Pareto Principle Applied to QA
Pareto Principle Applied to QAPareto Principle Applied to QA
Pareto Principle Applied to QA
Andrew Ferlitsch
 
Whiteboarding Coding Challenges in Python
Whiteboarding Coding Challenges in PythonWhiteboarding Coding Challenges in Python
Whiteboarding Coding Challenges in Python
Andrew Ferlitsch
 
Object Oriented Programming Principles
Object Oriented Programming PrinciplesObject Oriented Programming Principles
Object Oriented Programming Principles
Andrew Ferlitsch
 
Python - OOP Programming
Python - OOP ProgrammingPython - OOP Programming
Python - OOP Programming
Andrew Ferlitsch
 
Python - Installing and Using Python and Jupyter Notepad
Python - Installing and Using Python and Jupyter NotepadPython - Installing and Using Python and Jupyter Notepad
Python - Installing and Using Python and Jupyter Notepad
Andrew Ferlitsch
 
Natural Language Processing - Groupings (Associations) Generation
Natural Language Processing - Groupings (Associations) GenerationNatural Language Processing - Groupings (Associations) Generation
Natural Language Processing - Groupings (Associations) Generation
Andrew Ferlitsch
 
Natural Language Provessing - Handling Narrarive Fields in Datasets for Class...
Natural Language Provessing - Handling Narrarive Fields in Datasets for Class...Natural Language Provessing - Handling Narrarive Fields in Datasets for Class...
Natural Language Provessing - Handling Narrarive Fields in Datasets for Class...
Andrew Ferlitsch
 
Machine Learning - Introduction to Recurrent Neural Networks
Machine Learning - Introduction to Recurrent Neural NetworksMachine Learning - Introduction to Recurrent Neural Networks
Machine Learning - Introduction to Recurrent Neural Networks
Andrew Ferlitsch
 
Machine Learning - Introduction to Convolutional Neural Networks
Machine Learning - Introduction to Convolutional Neural NetworksMachine Learning - Introduction to Convolutional Neural Networks
Machine Learning - Introduction to Convolutional Neural Networks
Andrew Ferlitsch
 
Machine Learning - Introduction to Neural Networks
Machine Learning - Introduction to Neural NetworksMachine Learning - Introduction to Neural Networks
Machine Learning - Introduction to Neural Networks
Andrew Ferlitsch
 
Python - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning LibrariesPython - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning Libraries
Andrew Ferlitsch
 
Machine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion MatrixMachine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion Matrix
Andrew Ferlitsch
 
Machine Learning - Ensemble Methods
Machine Learning - Ensemble MethodsMachine Learning - Ensemble Methods
Machine Learning - Ensemble Methods
Andrew Ferlitsch
 
ML - Multiple Linear Regression
ML - Multiple Linear RegressionML - Multiple Linear Regression
ML - Multiple Linear Regression
Andrew Ferlitsch
 
ML - Simple Linear Regression
ML - Simple Linear RegressionML - Simple Linear Regression
ML - Simple Linear Regression
Andrew Ferlitsch
 
Machine Learning - Dummy Variable Conversion
Machine Learning - Dummy Variable ConversionMachine Learning - Dummy Variable Conversion
Machine Learning - Dummy Variable Conversion
Andrew Ferlitsch
 
Machine Learning - Introduction to Tensorflow
Machine Learning - Introduction to TensorflowMachine Learning - Introduction to Tensorflow
Machine Learning - Introduction to Tensorflow
Andrew Ferlitsch
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Andrew Ferlitsch
 
AI - Introduction to Dynamic Programming
AI - Introduction to Dynamic ProgrammingAI - Introduction to Dynamic Programming
AI - Introduction to Dynamic Programming
Andrew Ferlitsch
 

More from Andrew Ferlitsch (20)

AI - Intelligent Agents
AI - Intelligent AgentsAI - Intelligent Agents
AI - Intelligent Agents
 
Pareto Principle Applied to QA
Pareto Principle Applied to QAPareto Principle Applied to QA
Pareto Principle Applied to QA
 
Whiteboarding Coding Challenges in Python
Whiteboarding Coding Challenges in PythonWhiteboarding Coding Challenges in Python
Whiteboarding Coding Challenges in Python
 
Object Oriented Programming Principles
Object Oriented Programming PrinciplesObject Oriented Programming Principles
Object Oriented Programming Principles
 
Python - OOP Programming
Python - OOP ProgrammingPython - OOP Programming
Python - OOP Programming
 
Python - Installing and Using Python and Jupyter Notepad
Python - Installing and Using Python and Jupyter NotepadPython - Installing and Using Python and Jupyter Notepad
Python - Installing and Using Python and Jupyter Notepad
 
Natural Language Processing - Groupings (Associations) Generation
Natural Language Processing - Groupings (Associations) GenerationNatural Language Processing - Groupings (Associations) Generation
Natural Language Processing - Groupings (Associations) Generation
 
Natural Language Provessing - Handling Narrarive Fields in Datasets for Class...
Natural Language Provessing - Handling Narrarive Fields in Datasets for Class...Natural Language Provessing - Handling Narrarive Fields in Datasets for Class...
Natural Language Provessing - Handling Narrarive Fields in Datasets for Class...
 
Machine Learning - Introduction to Recurrent Neural Networks
Machine Learning - Introduction to Recurrent Neural NetworksMachine Learning - Introduction to Recurrent Neural Networks
Machine Learning - Introduction to Recurrent Neural Networks
 
Machine Learning - Introduction to Convolutional Neural Networks
Machine Learning - Introduction to Convolutional Neural NetworksMachine Learning - Introduction to Convolutional Neural Networks
Machine Learning - Introduction to Convolutional Neural Networks
 
Machine Learning - Introduction to Neural Networks
Machine Learning - Introduction to Neural NetworksMachine Learning - Introduction to Neural Networks
Machine Learning - Introduction to Neural Networks
 
Python - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning LibrariesPython - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning Libraries
 
Machine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion MatrixMachine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion Matrix
 
Machine Learning - Ensemble Methods
Machine Learning - Ensemble MethodsMachine Learning - Ensemble Methods
Machine Learning - Ensemble Methods
 
ML - Multiple Linear Regression
ML - Multiple Linear RegressionML - Multiple Linear Regression
ML - Multiple Linear Regression
 
ML - Simple Linear Regression
ML - Simple Linear RegressionML - Simple Linear Regression
ML - Simple Linear Regression
 
Machine Learning - Dummy Variable Conversion
Machine Learning - Dummy Variable ConversionMachine Learning - Dummy Variable Conversion
Machine Learning - Dummy Variable Conversion
 
Machine Learning - Introduction to Tensorflow
Machine Learning - Introduction to TensorflowMachine Learning - Introduction to Tensorflow
Machine Learning - Introduction to Tensorflow
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
AI - Introduction to Dynamic Programming
AI - Introduction to Dynamic ProgrammingAI - Introduction to Dynamic Programming
AI - Introduction to Dynamic Programming
 

Recently uploaded

Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 

Recently uploaded (20)

Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 

Machine Learning - Dataset Preparation

  • 1. Machine Learning Dataset Preparation Portland Data Science Group Created by Andrew Ferlitsch Community Outreach Officer July, 2017
  • 2. Dataset Preparation • Prior to using a dataset to train a model, the dataset must be prepared. 1. Import the data 2. Clean the data (Data Wrangling) 3. Replace Missing Values 4. Categorical Value Conversion 5. Feature Scaling
  • 3. Importing the Dataset • Datasets are generally imported as a raw data files (e.g., US Census) or via an API service (e.g., NWS Weather Data SOAP API). • Datasets are generally in the form of CSV, JSON or XML data format. • For the purpose of this tutorial, CSV is used in the accompanying examples.
  • 4. Importing the Dataset - Python import pandas as pd # use pandas library for data frames dataset = pd.read_csv( ‘data.csv’ ) # read CSV file into a data frame pathname to raw data file Function to read a CSV file CSV data converted to data frame. Example Data (CSV File): Generated Data Frame: Age, Gender, Income, Spending 22,M,18000,6000 25,F,30000,8000 31,F,35000,12000 35,M,40000,18000 Age Gender Income Spending 0 22 M 18000 6000 1 25 F 30000 8000 2 31 F 35000 12000 3 35 M 40000 18000 Data Frame adds these indices
  • 5. Cleaning the Data (Data Wrangling) • It is not uncommon for datasets to have some dirty data entries (i.e., samples, rows in CSV file, …) • Common Problems • Bad Character Encodings (Funny Characters) • Misaligned Data (e.g., row has too few/many columns) • Data in wrong format. Great Britain and the United States are two of the few places in the world that use a period to indicate the decimal place. Many other countries use a comma instead. Likewise, while the U.K. and U.S. use a comma to separate groups of thousands, many other countries use a period instead, and some countries separate thousands groups with a thin space. https://docs.oracle.com/cd/E19455-01/806-0169/overview-9/index.html • Data Wrangling is an expertise/occupation all in its own.
  • 6. Common Practices in Data Wrangling • Know the character encoding of the data file and intended character encoding of the data. Convert the data encoding format of the file if necessary. e.g., Notepad++ -> Encodings • Know the data format of the source and expected data format. Convert the data format using a batch preprocessing file. e.g., 1 000 000 -> 1,000,000
  • 7. Replace Missing Values • Not unusual for samples (rows) to contain missing (blank) entries, or not a number (NaN). • Blank/NaN entries do not work for Machine Learning! • Need to replace the blank/NaN entry with something meaningful. • Delete the rows (generally not desirable) • Replace with a Single Value • Mean Average • Multivariate Imputation using Chained Equations (MICS) https://msdn.microsoft.com/en-us/library/azure/dn906028.aspx
  • 8. Missing Values – Mean Value from sklearn.preprocessing import Imputer # scikit-learn module # Create imputer object to replace NaN values with the mean value of the column imputer = Imputer( missing_values=‘NaN’, strategy=‘mean’ ) # Fit the data to the imputer object imputer = imputer.fit( dataset[ :, 2 ] ) # do the replacement and update the dataset dataset[ :, 2 ] = imputer.transform( dataset[ :, 2 ] ) scikit-learn class for handling missing data original dataset replace missing values in column 2 (index starts at 0) select all rows needs to be the same columns in dataset
  • 9. Categorical Variables Age Gender Income 25 Male 25000 26 Female 22000 30 Male 45000 24 Female 26000 Independent Variables (Features) Dependent Variables (Label) Real Values Value to Predict Categorical Values
  • 10. Dummy Variable Conversion Known in Python as OneHotEncoder For each categorical feature: 1. Scan the dataset and determine all the unique instances. 2. Create a new feature (i.e., dummy variable) in dataset, one per unique instance. 3. Remove the categorical feature from the dataset. 4. For each sample (row), set a 1 in the feature (dummy variable) that corresponds to that categorical value instance, and: 5. Set a 0 in the remaining features (dummy variables) for that categorical field. 6. Remove one dummy variable field.
  • 11. Dummy Variable Trap Gender Male Female Male Female Need to Drop one Dummy Variable! Male Female 1 0 0 1 1 0 0 1 x1 x2 x3 Multicollinearity occurs when one variable predicts another. i.e., x2 = ( 1 – x3) As a result, a regression analysis cannot distinguish between the contribution of x2 and x3.
  • 12. Categorical Variable Conversion from sklearn.preprocessing import LabelEncoder # scikit-learn module # Create an encoder object to numerically (enumeration) encode categorical variables labelEncoder = LabelEncoder() # Fit the data to the Encoder object labelEncoder.fit_transform() dataset[ :, 1 ] = labelEncoder.fit_transform( dataset[ :, 1 ] ) # Create an encoder to convert numerical encodings to 1-encoded dummy variables onehotencoder = OneHotEncoder( categorical_features = [ 1 ] ) # Replace the encoded categorical values with the 1-encoded dummy variables dataset = onehotencoder.fit_transform( dataset ) scikit-learn class for categorical variable conversion original dataset encode the categorical values in column 1 (index starts at 0) select all rows needs to be the same columns in dataset Categorical variables to convert are in column 1 Dataset with converted categorical variables
  • 13. Feature Scaling • If features do not have the same numerical scale in values, will cause issues in training a mode. • If the scale of one independent variable (feature) is greater than another independent variable, the model will give more importance (skew) to the independent variable with the larger range. • To eliminate this problem, one converts all the independent variables to use the same scale. • Normalization ( 0 to 1 ) • Standardization ( -1 to 1 )
  • 14. Scaling Issue - Euclidean Distance • Most machine learning models use Euclidean distance between two points in 2D Cartesian space. 𝒙 𝟐 − 𝒙 𝟏 𝟐 + (𝒚 𝟐 − 𝒚 𝟏) 𝟐 • Given two independent variables (x1 = Age, x2 = Income) and a dependent variable (y = spending), becomes for a given sample (row) i: 𝒙𝟐𝒊 − 𝒙𝟏𝒊 𝟐 + 𝒚𝒊 − 𝒚𝒊 𝟐 = 𝒙𝟐𝒊 − 𝒙𝟏𝒊 𝟐 • If x1 or x2 is a substantially greater scale than the other, the corresponding independent variable will dominate the result, and will contribute more to the model.
  • 15. Normalization or Standardization • Feature Scaling means scaling features to the same scale. • Normalization scales features between 0 and 1, retaining their proportional range to each other. • Standardization scales features to have a mean (u) of 0 and standard deviation (a) of 1. X’ = 𝑥 − min(𝑥) max 𝑥 − min(𝑥) Normalization original valuenew value X’ = 𝑥 − 𝑢 𝑎 Standardization original valuenew value mean standard deviation
  • 16. Feature Scaling in Python from sklearn.preprocessing import StandardScalar # scikit-learn module # Create a scaling object to scale the features. scale = StandardScalar() # Fit the data to the Scaling object and transform the data dataset [:,-1] = scale.fit_transform( dataset[:,-1] ) scikit-learn class for Feature Scaling feature scale all the variables except the last column (y or label)