SlideShare a Scribd company logo
1 of 9
Regression Methods in
Machine Learning
Splitting Datasets
Portland Data Science Group
Andrew Ferlitsch
Community Outreach Officer
July, 2017
Splitting Datasets
• To use a dataset in Machine Learning, the dataset is
first split into a training and test set.
• The training set is used to train the model.
• The test set is used to test the accuracy of the model.
• Typically, split 80% training, 20% test.
It’s About Training
Machine Learning is about using data to train a model
DATA
Training
Data
Test
Data
Train
Split Dataset into Training and Test
Model
Use Training Data to Train the Model
Produce Model
Test the Model Accuracy
Determine
Accuracy
Serial Splitting of the Dataset
• Simplest method of splitting data is to split it serially.
• Take first 80% rows and put into training set.
• Take remaining 20% rows and put into test set.
import pandas as pd # pandas library
dataset = pd.read_csv("Data.csv") # read in data as panda dataframe
nrows = dataset.shape[ 0 ] # property shape[ 0 ] is the number of rows
train = dataset.iloc[ 1: int(nrows * .8) , : ]
80% rows 20% rows all columns
test = dataset.iloc[int(nrows * .8) +1, nrows, : ]
Random Splitting of the Dataset
• Another method is too pick rows at random.
• Sci-kit learn has built-in method
from sklearn.cross_validation import train_test_split
ncols = dataset.shape[ 1 ] # property shape[ 0 ] is the number of columns
# Assume label is last column in dataset
X = dataset.iloc[ :, :-1 ] # X is all the features (exclude last column)
y = dataset.iloc[ :, ncols ] # Y is the label (last column)
# Split the data, with 80% train and 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
split size
seed for random
Number generator
Data Imbalance - Overfitting
• If the training data is overly unbalanced, then the model
will predict a non-meaningful result.
• For example, if the model is a binary classifier (e.g.,
apple vs. pear), and nearly all the samples are of the
same label (e.g., apple), then the model will simply
learn that everything is a that label (apple).
• This is called overfitting. To prevent overfitting, there
needs to be a fairly equal distribution of training samples
for each classification, or range if label is a real value.
Data Imbalance - Overfitting
Training
Data
(90%
Apple,
10%
Pear)
Train
In an imbalance, the model will fit itself to the imbalance, not the predictor.
Nearly all samples are an Apple
ModelPear
Test Sample is a Pear
Apple
The model will most likely predict
the pear as an apple.
Cross Validation
DATA
Training
Data
Test
Data
Train
Split Dataset into Training and Test
Model
Use Training Data to Train the Model
Accuracy
Predict
Accuracy
Training
Data
Validation
Data
Use Validation Data
To predict accuracy
of the model.
Make adjustments to training
method to improve accuracy
Randomly pick split
Repeat Process until Converge to Desired Accuracy
K-Fold Cross Validation
• K-Fold is a well-known form of cross validation.
• Steps:
1. Partition the dataset into k equal sized partitions.
2. Select one partition as the validation data.
3. Use the remaining k-1 as the training data.
4. Train the model and determine accuracy from
the validation data.
5. Repeat the process k times, selecting a different
partition each time for the validation data.
5. Average the accuracy results.

More Related Content

What's hot

Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning Mohammad Junaid Khan
 
Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset PreparationAndrew Ferlitsch
 
Machine Learning
Machine LearningMachine Learning
Machine LearningKumar P
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithmRashid Ansari
 
Machine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksMachine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksFrancesco Collova'
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachinePulse
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learningTonmoy Bhagawati
 
House Sale Price Prediction
House Sale Price PredictionHouse Sale Price Prediction
House Sale Price Predictionsriram30691
 
2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic conceptsKrish_ver2
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and RegressionMegha Sharma
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networksSi Haem
 

What's hot (20)

Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning
 
Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset Preparation
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Decision tree and random forest
Decision tree and random forestDecision tree and random forest
Decision tree and random forest
 
Lstm
LstmLstm
Lstm
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithm
 
Machine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksMachine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural Networks
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World Applications
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Deep learning
Deep learningDeep learning
Deep learning
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learning
 
House Sale Price Prediction
House Sale Price PredictionHouse Sale Price Prediction
House Sale Price Prediction
 
Deep learning
Deep learningDeep learning
Deep learning
 
Decision tree
Decision treeDecision tree
Decision tree
 
2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts
 
Random forest
Random forestRandom forest
Random forest
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
 
Machine learning
Machine learning Machine learning
Machine learning
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 

Similar to Machine Learning - Splitting Datasets

Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsSri Ambati
 
in5490-classification (1).pptx
in5490-classification (1).pptxin5490-classification (1).pptx
in5490-classification (1).pptxMonicaTimber
 
Top 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark LandryTop 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark LandrySri Ambati
 
H2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandryH2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandrySri Ambati
 
Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlpankit_ppt
 
Machine Learning Explained and how apply lean startup to develop a MVP tool
Machine Learning Explained and how apply lean startup to develop a MVP toolMachine Learning Explained and how apply lean startup to develop a MVP tool
Machine Learning Explained and how apply lean startup to develop a MVP toolFranki Chamaki
 
Random Forest / Bootstrap Aggregation
Random Forest / Bootstrap AggregationRandom Forest / Bootstrap Aggregation
Random Forest / Bootstrap AggregationRupak Roy
 
Getting started with Machine Learning
Getting started with Machine LearningGetting started with Machine Learning
Getting started with Machine LearningGaurav Bhalotia
 
Machine Learning Interview Questions and Answers
Machine Learning Interview Questions and AnswersMachine Learning Interview Questions and Answers
Machine Learning Interview Questions and AnswersSatyam Jaiswal
 
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...Intel® Software
 
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...Maninda Edirisooriya
 
Introduction to Machine Learning concepts
Introduction to Machine Learning conceptsIntroduction to Machine Learning concepts
Introduction to Machine Learning conceptsStefano Dalla Palma
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Marina Santini
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsSri Ambati
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)Abhimanyu Dwivedi
 

Similar to Machine Learning - Splitting Datasets (20)

Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
in5490-classification (1).pptx
in5490-classification (1).pptxin5490-classification (1).pptx
in5490-classification (1).pptx
 
Top 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark LandryTop 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark Landry
 
Machine Learning_Unit 2_Full.ppt.pdf
Machine Learning_Unit 2_Full.ppt.pdfMachine Learning_Unit 2_Full.ppt.pdf
Machine Learning_Unit 2_Full.ppt.pdf
 
H2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandryH2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark Landry
 
Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlp
 
Machine Learning Explained and how apply lean startup to develop a MVP tool
Machine Learning Explained and how apply lean startup to develop a MVP toolMachine Learning Explained and how apply lean startup to develop a MVP tool
Machine Learning Explained and how apply lean startup to develop a MVP tool
 
Random Forest Decision Tree.pptx
Random Forest Decision Tree.pptxRandom Forest Decision Tree.pptx
Random Forest Decision Tree.pptx
 
Random Forest / Bootstrap Aggregation
Random Forest / Bootstrap AggregationRandom Forest / Bootstrap Aggregation
Random Forest / Bootstrap Aggregation
 
Getting started with Machine Learning
Getting started with Machine LearningGetting started with Machine Learning
Getting started with Machine Learning
 
Machine Learning Interview Questions and Answers
Machine Learning Interview Questions and AnswersMachine Learning Interview Questions and Answers
Machine Learning Interview Questions and Answers
 
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
 
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
 
Introduction to Machine Learning concepts
Introduction to Machine Learning conceptsIntroduction to Machine Learning concepts
Introduction to Machine Learning concepts
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
ANN - UNIT 3.pptx
ANN - UNIT 3.pptxANN - UNIT 3.pptx
ANN - UNIT 3.pptx
 
ANN - UNIT 3.pptx
ANN - UNIT 3.pptxANN - UNIT 3.pptx
ANN - UNIT 3.pptx
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)
 
Artificial Neural Networks , Recurrent networks , Perceptron's
Artificial Neural Networks , Recurrent networks , Perceptron'sArtificial Neural Networks , Recurrent networks , Perceptron's
Artificial Neural Networks , Recurrent networks , Perceptron's
 

More from Andrew Ferlitsch

Pareto Principle Applied to QA
Pareto Principle Applied to QAPareto Principle Applied to QA
Pareto Principle Applied to QAAndrew Ferlitsch
 
Whiteboarding Coding Challenges in Python
Whiteboarding Coding Challenges in PythonWhiteboarding Coding Challenges in Python
Whiteboarding Coding Challenges in PythonAndrew Ferlitsch
 
Object Oriented Programming Principles
Object Oriented Programming PrinciplesObject Oriented Programming Principles
Object Oriented Programming PrinciplesAndrew Ferlitsch
 
Python - Installing and Using Python and Jupyter Notepad
Python - Installing and Using Python and Jupyter NotepadPython - Installing and Using Python and Jupyter Notepad
Python - Installing and Using Python and Jupyter NotepadAndrew Ferlitsch
 
Natural Language Processing - Groupings (Associations) Generation
Natural Language Processing - Groupings (Associations) GenerationNatural Language Processing - Groupings (Associations) Generation
Natural Language Processing - Groupings (Associations) GenerationAndrew Ferlitsch
 
Natural Language Provessing - Handling Narrarive Fields in Datasets for Class...
Natural Language Provessing - Handling Narrarive Fields in Datasets for Class...Natural Language Provessing - Handling Narrarive Fields in Datasets for Class...
Natural Language Provessing - Handling Narrarive Fields in Datasets for Class...Andrew Ferlitsch
 
Machine Learning - Introduction to Recurrent Neural Networks
Machine Learning - Introduction to Recurrent Neural NetworksMachine Learning - Introduction to Recurrent Neural Networks
Machine Learning - Introduction to Recurrent Neural NetworksAndrew Ferlitsch
 
Machine Learning - Introduction to Convolutional Neural Networks
Machine Learning - Introduction to Convolutional Neural NetworksMachine Learning - Introduction to Convolutional Neural Networks
Machine Learning - Introduction to Convolutional Neural NetworksAndrew Ferlitsch
 
Machine Learning - Introduction to Neural Networks
Machine Learning - Introduction to Neural NetworksMachine Learning - Introduction to Neural Networks
Machine Learning - Introduction to Neural NetworksAndrew Ferlitsch
 
Python - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning LibrariesPython - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning LibrariesAndrew Ferlitsch
 
Machine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion MatrixMachine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion MatrixAndrew Ferlitsch
 
Machine Learning - Ensemble Methods
Machine Learning - Ensemble MethodsMachine Learning - Ensemble Methods
Machine Learning - Ensemble MethodsAndrew Ferlitsch
 
ML - Multiple Linear Regression
ML - Multiple Linear RegressionML - Multiple Linear Regression
ML - Multiple Linear RegressionAndrew Ferlitsch
 
ML - Simple Linear Regression
ML - Simple Linear RegressionML - Simple Linear Regression
ML - Simple Linear RegressionAndrew Ferlitsch
 
Machine Learning - Dummy Variable Conversion
Machine Learning - Dummy Variable ConversionMachine Learning - Dummy Variable Conversion
Machine Learning - Dummy Variable ConversionAndrew Ferlitsch
 
Machine Learning - Introduction to Tensorflow
Machine Learning - Introduction to TensorflowMachine Learning - Introduction to Tensorflow
Machine Learning - Introduction to TensorflowAndrew Ferlitsch
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningAndrew Ferlitsch
 
AI - Introduction to Dynamic Programming
AI - Introduction to Dynamic ProgrammingAI - Introduction to Dynamic Programming
AI - Introduction to Dynamic ProgrammingAndrew Ferlitsch
 

More from Andrew Ferlitsch (20)

AI - Intelligent Agents
AI - Intelligent AgentsAI - Intelligent Agents
AI - Intelligent Agents
 
Pareto Principle Applied to QA
Pareto Principle Applied to QAPareto Principle Applied to QA
Pareto Principle Applied to QA
 
Whiteboarding Coding Challenges in Python
Whiteboarding Coding Challenges in PythonWhiteboarding Coding Challenges in Python
Whiteboarding Coding Challenges in Python
 
Object Oriented Programming Principles
Object Oriented Programming PrinciplesObject Oriented Programming Principles
Object Oriented Programming Principles
 
Python - OOP Programming
Python - OOP ProgrammingPython - OOP Programming
Python - OOP Programming
 
Python - Installing and Using Python and Jupyter Notepad
Python - Installing and Using Python and Jupyter NotepadPython - Installing and Using Python and Jupyter Notepad
Python - Installing and Using Python and Jupyter Notepad
 
Natural Language Processing - Groupings (Associations) Generation
Natural Language Processing - Groupings (Associations) GenerationNatural Language Processing - Groupings (Associations) Generation
Natural Language Processing - Groupings (Associations) Generation
 
Natural Language Provessing - Handling Narrarive Fields in Datasets for Class...
Natural Language Provessing - Handling Narrarive Fields in Datasets for Class...Natural Language Provessing - Handling Narrarive Fields in Datasets for Class...
Natural Language Provessing - Handling Narrarive Fields in Datasets for Class...
 
Machine Learning - Introduction to Recurrent Neural Networks
Machine Learning - Introduction to Recurrent Neural NetworksMachine Learning - Introduction to Recurrent Neural Networks
Machine Learning - Introduction to Recurrent Neural Networks
 
Machine Learning - Introduction to Convolutional Neural Networks
Machine Learning - Introduction to Convolutional Neural NetworksMachine Learning - Introduction to Convolutional Neural Networks
Machine Learning - Introduction to Convolutional Neural Networks
 
Machine Learning - Introduction to Neural Networks
Machine Learning - Introduction to Neural NetworksMachine Learning - Introduction to Neural Networks
Machine Learning - Introduction to Neural Networks
 
Python - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning LibrariesPython - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning Libraries
 
Machine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion MatrixMachine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion Matrix
 
Machine Learning - Ensemble Methods
Machine Learning - Ensemble MethodsMachine Learning - Ensemble Methods
Machine Learning - Ensemble Methods
 
ML - Multiple Linear Regression
ML - Multiple Linear RegressionML - Multiple Linear Regression
ML - Multiple Linear Regression
 
ML - Simple Linear Regression
ML - Simple Linear RegressionML - Simple Linear Regression
ML - Simple Linear Regression
 
Machine Learning - Dummy Variable Conversion
Machine Learning - Dummy Variable ConversionMachine Learning - Dummy Variable Conversion
Machine Learning - Dummy Variable Conversion
 
Machine Learning - Introduction to Tensorflow
Machine Learning - Introduction to TensorflowMachine Learning - Introduction to Tensorflow
Machine Learning - Introduction to Tensorflow
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
AI - Introduction to Dynamic Programming
AI - Introduction to Dynamic ProgrammingAI - Introduction to Dynamic Programming
AI - Introduction to Dynamic Programming
 

Recently uploaded

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 

Recently uploaded (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 

Machine Learning - Splitting Datasets

  • 1. Regression Methods in Machine Learning Splitting Datasets Portland Data Science Group Andrew Ferlitsch Community Outreach Officer July, 2017
  • 2. Splitting Datasets • To use a dataset in Machine Learning, the dataset is first split into a training and test set. • The training set is used to train the model. • The test set is used to test the accuracy of the model. • Typically, split 80% training, 20% test.
  • 3. It’s About Training Machine Learning is about using data to train a model DATA Training Data Test Data Train Split Dataset into Training and Test Model Use Training Data to Train the Model Produce Model Test the Model Accuracy Determine Accuracy
  • 4. Serial Splitting of the Dataset • Simplest method of splitting data is to split it serially. • Take first 80% rows and put into training set. • Take remaining 20% rows and put into test set. import pandas as pd # pandas library dataset = pd.read_csv("Data.csv") # read in data as panda dataframe nrows = dataset.shape[ 0 ] # property shape[ 0 ] is the number of rows train = dataset.iloc[ 1: int(nrows * .8) , : ] 80% rows 20% rows all columns test = dataset.iloc[int(nrows * .8) +1, nrows, : ]
  • 5. Random Splitting of the Dataset • Another method is too pick rows at random. • Sci-kit learn has built-in method from sklearn.cross_validation import train_test_split ncols = dataset.shape[ 1 ] # property shape[ 0 ] is the number of columns # Assume label is last column in dataset X = dataset.iloc[ :, :-1 ] # X is all the features (exclude last column) y = dataset.iloc[ :, ncols ] # Y is the label (last column) # Split the data, with 80% train and 20% test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) split size seed for random Number generator
  • 6. Data Imbalance - Overfitting • If the training data is overly unbalanced, then the model will predict a non-meaningful result. • For example, if the model is a binary classifier (e.g., apple vs. pear), and nearly all the samples are of the same label (e.g., apple), then the model will simply learn that everything is a that label (apple). • This is called overfitting. To prevent overfitting, there needs to be a fairly equal distribution of training samples for each classification, or range if label is a real value.
  • 7. Data Imbalance - Overfitting Training Data (90% Apple, 10% Pear) Train In an imbalance, the model will fit itself to the imbalance, not the predictor. Nearly all samples are an Apple ModelPear Test Sample is a Pear Apple The model will most likely predict the pear as an apple.
  • 8. Cross Validation DATA Training Data Test Data Train Split Dataset into Training and Test Model Use Training Data to Train the Model Accuracy Predict Accuracy Training Data Validation Data Use Validation Data To predict accuracy of the model. Make adjustments to training method to improve accuracy Randomly pick split Repeat Process until Converge to Desired Accuracy
  • 9. K-Fold Cross Validation • K-Fold is a well-known form of cross validation. • Steps: 1. Partition the dataset into k equal sized partitions. 2. Select one partition as the validation data. 3. Use the remaining k-1 as the training data. 4. Train the model and determine accuracy from the validation data. 5. Repeat the process k times, selecting a different partition each time for the validation data. 5. Average the accuracy results.