SlideShare a Scribd company logo
1 of 32
A Beginner’s Guide to Machine
Learning with Scikit-Learn
Sarah Guido
PyTennessee 2014
All about me
• Grad student at the University of Michigan

• Data analyst for HathiTrust
• Organizer of Ann Arbor PyLadies chapter
My talk
• Machine learning and scikit-learn

• Supervised and unsupervised learning
• Preprocessing, validation and testing, strategies

for machine learning
What is machine learning?
• Application of algorithms that learn from

examples
• Representation and generalization
Why should we care?
• Useful in every day life
• Email spam, handwriting analysis, stock market
analysis, Netflix
• Especially useful in data analysis
• Feature extraction, linear regression, classification,
clustering
Machine Learning Vocab
• Instance

• Feature
• Class
• Categorical
• Nominal
• Ordinal
• Continuous
Machine Learning Vocab
Feature

Class

Instance
Scikit-Learn
• Machine learning module

• Open-source
• Built-in datasets
• Good resources for learning
Scikit-Learn
• Model = EstimatorObject()

• Model.fit(dataset.data, dataset.target)
• dataset.data = dataset
• dataset.target = labels
• Model.predict(dataset.data)
Scikit-Learn
• Supervised

• Unsupervised
• Semi-supervised
• Reinforcement learning

• Neural networks
• …and many more!
Supervised learning
• Labeled data

• You know what you’re looking for
• Classification: predict categorical labels
• Regression: predict continuous target variables
Classification
• Categorical variables

• Relationship between instance and feature
• Classification algorithms == classifiers
Classification
• Naïve Bayes classifier

• Features are independent
• Fast performance
• Decent classifier
Classification
• Car evaluation dataset-UCI

• Features: buying price, the maintenance price,

the number of doors, the number of seats, the
size of the trunk, and the safety ranking
• Labels: unacceptable, acceptable, good, or very
good
Classification
Classification
Classification
Unsupervised algorithms
• Unlabeled data

• You might have no idea what you’re looking for
• Clustering: splitting observations into groups
• Dimensionality reduction: flatten data to fewer

dimensions
Clustering
• Exploring the data

• Similar objects in the same group
• Distance between data points
Clustering
• K-means clustering

• Three steps
• Chooses initial cluster centers
• Assigns data instance to cluster
• Recalculates cluster center
• Efficient
Clustering
Clustering
Clustering
Data preprocessing
• Encoding categorical features
Data preprocessing
Data preprocessing
Data preprocessing
• Split the dataset into training and test data
Validation and testing
• Model evaluation

• Cross-validation
Good strategies
• Avoid overfitting

• Use lots of data
• Intuition fails in high dimensions
My materials
• Scikit-learn.org documentation and tutorials

• Machine learning class at U of M
• Scikit-learn talks
Resources
• Scikit-learn documentation and tutorials
• scikit-learn.org/stable/documentation.html
• Other resources
• http://archive.ics.uci.edu/ml/datasets.html
• Mldata.org
• Videos
• Scikit-learn tutorial: http://vimeo.com/53062607
• Intro to scikit-learn: http://vimeo.com/72859487
Contact me!
• @sarah_guido

• Linkedin.com/sarahguido
• github.com/sarguido

More Related Content

What's hot

What's hot (20)

Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?
 
Python NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | EdurekaPython NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | Edureka
 
ML Basics
ML BasicsML Basics
ML Basics
 
Data Analysis and Visualization using Python
Data Analysis and Visualization using PythonData Analysis and Visualization using Python
Data Analysis and Visualization using Python
 
Python Anaconda Tutorial | Edureka
Python Anaconda Tutorial | EdurekaPython Anaconda Tutorial | Edureka
Python Anaconda Tutorial | Edureka
 
Machine Learning Basics
Machine Learning BasicsMachine Learning Basics
Machine Learning Basics
 
Cluster Analysis Introduction
Cluster Analysis IntroductionCluster Analysis Introduction
Cluster Analysis Introduction
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 
Introduction to Keras
Introduction to KerasIntroduction to Keras
Introduction to Keras
 
Cross-validation Tutorial: What, how and which?
Cross-validation Tutorial: What, how and which?Cross-validation Tutorial: What, how and which?
Cross-validation Tutorial: What, how and which?
 
Machine learning
Machine learningMachine learning
Machine learning
 
Machine-Learning-A-Z-Course-Downloadable-Slides-V1.5.pdf
Machine-Learning-A-Z-Course-Downloadable-Slides-V1.5.pdfMachine-Learning-A-Z-Course-Downloadable-Slides-V1.5.pdf
Machine-Learning-A-Z-Course-Downloadable-Slides-V1.5.pdf
 
Data Visualization using matplotlib
Data Visualization using matplotlibData Visualization using matplotlib
Data Visualization using matplotlib
 
Deep learning with Keras
Deep learning with KerasDeep learning with Keras
Deep learning with Keras
 
Machine learning overview
Machine learning overviewMachine learning overview
Machine learning overview
 
Introduction to Neural Networks
Introduction to Neural NetworksIntroduction to Neural Networks
Introduction to Neural Networks
 
Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning Algorithms
 
Feature Selection in Machine Learning
Feature Selection in Machine LearningFeature Selection in Machine Learning
Feature Selection in Machine Learning
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
 
Machine learning Algorithms
Machine learning AlgorithmsMachine learning Algorithms
Machine learning Algorithms
 

Viewers also liked

Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTKStatistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Olivier Grisel
 

Viewers also liked (20)

Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTKStatistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
 
K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-Learn
 
Text classification in scikit-learn
Text classification in scikit-learnText classification in scikit-learn
Text classification in scikit-learn
 
Intro to scikit-learn
Intro to scikit-learnIntro to scikit-learn
Intro to scikit-learn
 
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael VaroquauxPyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
 
Machine learning with scikit-learn
Machine learning with scikit-learnMachine learning with scikit-learn
Machine learning with scikit-learn
 
Think machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetanThink machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetan
 
Machine Learning with scikit-learn
Machine Learning with scikit-learnMachine Learning with scikit-learn
Machine Learning with scikit-learn
 
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnNumerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
 
Exploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-LearnExploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-Learn
 
Intro to scikit learn may 2017
Intro to scikit learn may 2017Intro to scikit learn may 2017
Intro to scikit learn may 2017
 
Machine learning in production with scikit-learn
Machine learning in production with scikit-learnMachine learning in production with scikit-learn
Machine learning in production with scikit-learn
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016
 
Intro to machine learning with scikit learn
Intro to machine learning with scikit learnIntro to machine learning with scikit learn
Intro to machine learning with scikit learn
 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
 
Realtime predictive analytics using RabbitMQ & scikit-learn
Realtime predictive analytics using RabbitMQ & scikit-learnRealtime predictive analytics using RabbitMQ & scikit-learn
Realtime predictive analytics using RabbitMQ & scikit-learn
 
Clustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn TutorialClustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn Tutorial
 
Data Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learnData Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learn
 
Accelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-LearnAccelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-Learn
 

Similar to A Beginner's Guide to Machine Learning with Scikit-Learn

ML.pptvdvdvdvdvdfvdfgvdsdgdsfgdfgdfgdfgdf
ML.pptvdvdvdvdvdfvdfgvdsdgdsfgdfgdfgdfgdfML.pptvdvdvdvdvdfvdfgvdsdgdsfgdfgdfgdfgdf
ML.pptvdvdvdvdvdfvdfgvdsdgdsfgdfgdfgdfgdf
AvijitChaudhuri3
 
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Sonya Liberman
 

Similar to A Beginner's Guide to Machine Learning with Scikit-Learn (20)

Data Mining - The Big Picture!
Data Mining - The Big Picture!Data Mining - The Big Picture!
Data Mining - The Big Picture!
 
machine learning
machine learningmachine learning
machine learning
 
Building High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning Applications
 
What is Machine Learning?
What is Machine Learning?What is Machine Learning?
What is Machine Learning?
 
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
 
Creativity and Curiosity - The Trial and Error of Data Science
Creativity and Curiosity - The Trial and Error of Data ScienceCreativity and Curiosity - The Trial and Error of Data Science
Creativity and Curiosity - The Trial and Error of Data Science
 
Challenges Faced & Lessons Learned Conducting Cleveland Clinic's First UX Stu...
Challenges Faced & Lessons Learned Conducting Cleveland Clinic's First UX Stu...Challenges Faced & Lessons Learned Conducting Cleveland Clinic's First UX Stu...
Challenges Faced & Lessons Learned Conducting Cleveland Clinic's First UX Stu...
 
Machine Learning for Everyone
Machine Learning for EveryoneMachine Learning for Everyone
Machine Learning for Everyone
 
Machine learning
Machine learning Machine learning
Machine learning
 
Machine Learning & Apache Mahout
Machine Learning & Apache MahoutMachine Learning & Apache Mahout
Machine Learning & Apache Mahout
 
Modern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyModern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in Mendeley
 
ML.ppt
ML.pptML.ppt
ML.ppt
 
ML.ppt
ML.pptML.ppt
ML.ppt
 
ML.ppt
ML.pptML.ppt
ML.ppt
 
ML.ppt
ML.pptML.ppt
ML.ppt
 
ML.pptvdvdvdvdvdfvdfgvdsdgdsfgdfgdfgdfgdf
ML.pptvdvdvdvdvdfvdfgvdsdgdsfgdfgdfgdfgdfML.pptvdvdvdvdvdfvdfgvdsdgdsfgdfgdfgdfgdf
ML.pptvdvdvdvdvdfvdfgvdsdgdsfgdfgdfgdfgdf
 
ML.ppt
ML.pptML.ppt
ML.ppt
 
TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.com
TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.comTDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.com
TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.com
 
Introduction to Enterprise Search
Introduction to Enterprise SearchIntroduction to Enterprise Search
Introduction to Enterprise Search
 
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
 

More from Sarah Guido

More from Sarah Guido (8)

Data Science Retrospective
Data Science RetrospectiveData Science Retrospective
Data Science Retrospective
 
The Wild West of Data Wrangling (PyTN)
The Wild West of Data Wrangling (PyTN)The Wild West of Data Wrangling (PyTN)
The Wild West of Data Wrangling (PyTN)
 
The Wild West of Data Wrangling
The Wild West of Data WranglingThe Wild West of Data Wrangling
The Wild West of Data Wrangling
 
The Importance of Community
The Importance of CommunityThe Importance of Community
The Importance of Community
 
Spark: The Good, the Bad, and the Ugly
Spark: The Good, the Bad, and the UglySpark: The Good, the Bad, and the Ugly
Spark: The Good, the Bad, and the Ugly
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Network theory - PyCon 2015
Network theory - PyCon 2015Network theory - PyCon 2015
Network theory - PyCon 2015
 
Analyzing Data With Python
Analyzing Data With PythonAnalyzing Data With Python
Analyzing Data With Python
 

Recently uploaded

Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
FIDO Alliance
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc
 

Recently uploaded (20)

Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governance
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 

A Beginner's Guide to Machine Learning with Scikit-Learn

Editor's Notes

  1. Pictures of both
  2. Snapshot of dataset
  3. Better explanation of functions
  4. Better explanation of functions