SlideShare a Scribd company logo
Intro to scikit-learn
Michael Becker
PyData Boston 2013
Who is this guy?
Software Engineer @ AWeber
Founder of the DataPhilly Meetup group
@beckerfuffle
beckerfuffle.com
These slides and more @ github.com/mdbecker
On the shoulders of giants
• Machine Learning 101
tutorial from scikit-learn.
On the shoulders of giants
• Machine Learning 101
tutorial from scikit-learn.
• IPython notebooks
from pycon 2013.
What is Machine Learning?
What is Machine Learning?
Data in scikit-learn
• Stored as a 2d-array
• [n_samples, n_features]
• n_samples: items to process
• n_features: distinct traits
The Iris Dataset
The Iris Dataset
The Iris Dataset
The Iris Dataset: Loading
The Iris Dataset: Loading
The Iris Dataset: Loading
The Iris Dataset: Loading
The Iris Dataset: Loading
The Iris Dataset: Loading
The Iris Dataset: Loading
Machine Learning:
Supervised
Machine Learning:
Unsupervised
Scikit-learn's interface
Scikit-learn's interface
Feature Extraction
Often data is unstructured & non-numerical:
•Text documents
Feature Extraction
Often data is unstructured & non-numerical:
•Text documents
•Images
Feature Extraction
Often data is unstructured & non-numerical:
•Text documents
•Images
•Sounds
Supervised Learning:
Classification
Supervised Learning:
Classification
Supervised Learning:
Classification
Supervised Learning:
Classification
Supervised Learning:
Classification
Supervised Learning:
Classification
Supervised Learning:
Classification
Supervised Learning:
Classification
Supervised Learning:
Classification
Supervised Learning:
Classification
• Email classification
• Language identification
• New article categorization
• Sentiment analysis
• Facial recognition
• ...
Unsupervised Learning
Dimensionality Reduction
Principal Component
Analysis
Unsupervised Learning: PCA
Unsupervised Learning: PCA
Unsupervised Learning: PCA
Unsupervised Learning: PCA
Validation & Testing
Validation & Testing
Validation & Testing
Overfitting
Cross-Validation
Cross-Validation
Cross-Validation
Additional Resources
• Machine Learning 101
tutorial from scikit-learn.
Additional Resources
• IPython notebooks
from pycon 2013.
My info
Tweet me @beckerfuffle
Find me at beckerfuffle.com
These slides and more @ github.com/mdbecker

More Related Content

What's hot

CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDF
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I  PPT  IN PDFCS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I  PPT  IN PDF
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDF
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
Python
PythonPython
Ejb and jsp
Ejb and jspEjb and jsp
Ejb and jsp
rajshreemuthiah
 
SOFTWARE ENGINEERING - FINAL PRESENTATION Slides
SOFTWARE ENGINEERING - FINAL PRESENTATION SlidesSOFTWARE ENGINEERING - FINAL PRESENTATION Slides
SOFTWARE ENGINEERING - FINAL PRESENTATION Slides
Jeremy Zhong
 
Oracle REST Data Services: Options for your Web Services
Oracle REST Data Services: Options for your Web ServicesOracle REST Data Services: Options for your Web Services
Oracle REST Data Services: Options for your Web Services
Jeff Smith
 
DBT PU BI Lab Manual for ETL Exercise.pdf
DBT PU BI Lab Manual for ETL Exercise.pdfDBT PU BI Lab Manual for ETL Exercise.pdf
DBT PU BI Lab Manual for ETL Exercise.pdf
JanakiramanS13
 
Nginx internals
Nginx internalsNginx internals
Nginx internals
liqiang xu
 
Julien Maitrehenry - Docker, ça mange quoi au printemps
Julien Maitrehenry - Docker, ça mange quoi au printempsJulien Maitrehenry - Docker, ça mange quoi au printemps
Julien Maitrehenry - Docker, ça mange quoi au printemps
Web à Québec
 
Apache tomcat
Apache tomcatApache tomcat
Apache tomcat
Shashwat Shriparv
 
Maven Introduction
Maven IntroductionMaven Introduction
Maven Introduction
Sandeep Chawla
 
Maven ppt
Maven pptMaven ppt
Maven ppt
natashasweety7
 
UCS Management APIs A Technical Deep Dive
UCS Management APIs A Technical Deep DiveUCS Management APIs A Technical Deep Dive
UCS Management APIs A Technical Deep Dive
Cisco DevNet
 
Being Functional on Reactive Streams with Spring Reactor
Being Functional on Reactive Streams with Spring ReactorBeing Functional on Reactive Streams with Spring Reactor
Being Functional on Reactive Streams with Spring Reactor
Max Huang
 
Survey on Software Defect Prediction
Survey on Software Defect PredictionSurvey on Software Defect Prediction
Survey on Software Defect Prediction
Sung Kim
 
Python RESTful webservices with Python: Flask and Django solutions
Python RESTful webservices with Python: Flask and Django solutionsPython RESTful webservices with Python: Flask and Django solutions
Python RESTful webservices with Python: Flask and Django solutions
Solution4Future
 
Case Study of Convolutional Neural Network
Case Study of Convolutional Neural NetworkCase Study of Convolutional Neural Network
Case Study of Convolutional Neural Network
NamHyuk Ahn
 
[OpenInfra Days Korea 2018] Day 2 - E5: GPU on Kubernetes
[OpenInfra Days Korea 2018] Day 2 - E5: GPU on Kubernetes[OpenInfra Days Korea 2018] Day 2 - E5: GPU on Kubernetes
[OpenInfra Days Korea 2018] Day 2 - E5: GPU on Kubernetes
OpenStack Korea Community
 
Clean Architecture Applications in Python
Clean Architecture Applications in PythonClean Architecture Applications in Python
Clean Architecture Applications in Python
Subhash Bhushan
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4
DigiGurukul
 
API Test Automation Using Karate (Anil Kumar Moka)
API Test Automation Using Karate (Anil Kumar Moka)API Test Automation Using Karate (Anil Kumar Moka)
API Test Automation Using Karate (Anil Kumar Moka)
Peter Thomas
 

What's hot (20)

CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDF
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I  PPT  IN PDFCS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I  PPT  IN PDF
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDF
 
Python
PythonPython
Python
 
Ejb and jsp
Ejb and jspEjb and jsp
Ejb and jsp
 
SOFTWARE ENGINEERING - FINAL PRESENTATION Slides
SOFTWARE ENGINEERING - FINAL PRESENTATION SlidesSOFTWARE ENGINEERING - FINAL PRESENTATION Slides
SOFTWARE ENGINEERING - FINAL PRESENTATION Slides
 
Oracle REST Data Services: Options for your Web Services
Oracle REST Data Services: Options for your Web ServicesOracle REST Data Services: Options for your Web Services
Oracle REST Data Services: Options for your Web Services
 
DBT PU BI Lab Manual for ETL Exercise.pdf
DBT PU BI Lab Manual for ETL Exercise.pdfDBT PU BI Lab Manual for ETL Exercise.pdf
DBT PU BI Lab Manual for ETL Exercise.pdf
 
Nginx internals
Nginx internalsNginx internals
Nginx internals
 
Julien Maitrehenry - Docker, ça mange quoi au printemps
Julien Maitrehenry - Docker, ça mange quoi au printempsJulien Maitrehenry - Docker, ça mange quoi au printemps
Julien Maitrehenry - Docker, ça mange quoi au printemps
 
Apache tomcat
Apache tomcatApache tomcat
Apache tomcat
 
Maven Introduction
Maven IntroductionMaven Introduction
Maven Introduction
 
Maven ppt
Maven pptMaven ppt
Maven ppt
 
UCS Management APIs A Technical Deep Dive
UCS Management APIs A Technical Deep DiveUCS Management APIs A Technical Deep Dive
UCS Management APIs A Technical Deep Dive
 
Being Functional on Reactive Streams with Spring Reactor
Being Functional on Reactive Streams with Spring ReactorBeing Functional on Reactive Streams with Spring Reactor
Being Functional on Reactive Streams with Spring Reactor
 
Survey on Software Defect Prediction
Survey on Software Defect PredictionSurvey on Software Defect Prediction
Survey on Software Defect Prediction
 
Python RESTful webservices with Python: Flask and Django solutions
Python RESTful webservices with Python: Flask and Django solutionsPython RESTful webservices with Python: Flask and Django solutions
Python RESTful webservices with Python: Flask and Django solutions
 
Case Study of Convolutional Neural Network
Case Study of Convolutional Neural NetworkCase Study of Convolutional Neural Network
Case Study of Convolutional Neural Network
 
[OpenInfra Days Korea 2018] Day 2 - E5: GPU on Kubernetes
[OpenInfra Days Korea 2018] Day 2 - E5: GPU on Kubernetes[OpenInfra Days Korea 2018] Day 2 - E5: GPU on Kubernetes
[OpenInfra Days Korea 2018] Day 2 - E5: GPU on Kubernetes
 
Clean Architecture Applications in Python
Clean Architecture Applications in PythonClean Architecture Applications in Python
Clean Architecture Applications in Python
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4
 
API Test Automation Using Karate (Anil Kumar Moka)
API Test Automation Using Karate (Anil Kumar Moka)API Test Automation Using Karate (Anil Kumar Moka)
API Test Automation Using Karate (Anil Kumar Moka)
 

Viewers also liked

Machine learning in production with scikit-learn
Machine learning in production with scikit-learnMachine learning in production with scikit-learn
Machine learning in production with scikit-learn
Jeff Klukas
 
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnNumerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Arnaud Joly
 
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael VaroquauxPyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pôle Systematic Paris-Region
 
Intro to machine learning with scikit learn
Intro to machine learning with scikit learnIntro to machine learning with scikit learn
Intro to machine learning with scikit learn
Yoss Cohen
 
Clustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn TutorialClustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn Tutorial
Damian R. Mingle, MBA
 
Machine Learning with scikit-learn
Machine Learning with scikit-learnMachine Learning with scikit-learn
Machine Learning with scikit-learn
odsc
 
Think machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetanThink machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetan
Chetan Khatri
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016
Gael Varoquaux
 
Intro to scikit learn may 2017
Intro to scikit learn may 2017Intro to scikit learn may 2017
Intro to scikit learn may 2017
Francesco Mosconi
 
Data Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learnData Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learn
Asim Jalis
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
Gilles Louppe
 
Realtime predictive analytics using RabbitMQ & scikit-learn
Realtime predictive analytics using RabbitMQ & scikit-learnRealtime predictive analytics using RabbitMQ & scikit-learn
Realtime predictive analytics using RabbitMQ & scikit-learn
AWeber
 
Exploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-LearnExploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-Learn
Kan Ouivirach, Ph.D.
 
Introduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learnIntroduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learn
Matt Hagy
 
Machine learning with scikit-learn
Machine learning with scikit-learnMachine learning with scikit-learn
Machine learning with scikit-learn
Qingkai Kong
 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
PyData
 
Converting Scikit-Learn to PMML
Converting Scikit-Learn to PMMLConverting Scikit-Learn to PMML
Converting Scikit-Learn to PMML
Villu Ruusmann
 
Text Classification/Categorization
Text Classification/CategorizationText Classification/Categorization
Text Classification/Categorization
Oswal Abhishek
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the project
Gael Varoquaux
 
Accelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-LearnAccelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-Learn
Gilles Louppe
 

Viewers also liked (20)

Machine learning in production with scikit-learn
Machine learning in production with scikit-learnMachine learning in production with scikit-learn
Machine learning in production with scikit-learn
 
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnNumerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
 
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael VaroquauxPyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
 
Intro to machine learning with scikit learn
Intro to machine learning with scikit learnIntro to machine learning with scikit learn
Intro to machine learning with scikit learn
 
Clustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn TutorialClustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn Tutorial
 
Machine Learning with scikit-learn
Machine Learning with scikit-learnMachine Learning with scikit-learn
Machine Learning with scikit-learn
 
Think machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetanThink machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetan
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016
 
Intro to scikit learn may 2017
Intro to scikit learn may 2017Intro to scikit learn may 2017
Intro to scikit learn may 2017
 
Data Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learnData Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learn
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
 
Realtime predictive analytics using RabbitMQ & scikit-learn
Realtime predictive analytics using RabbitMQ & scikit-learnRealtime predictive analytics using RabbitMQ & scikit-learn
Realtime predictive analytics using RabbitMQ & scikit-learn
 
Exploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-LearnExploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-Learn
 
Introduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learnIntroduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learn
 
Machine learning with scikit-learn
Machine learning with scikit-learnMachine learning with scikit-learn
Machine learning with scikit-learn
 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
 
Converting Scikit-Learn to PMML
Converting Scikit-Learn to PMMLConverting Scikit-Learn to PMML
Converting Scikit-Learn to PMML
 
Text Classification/Categorization
Text Classification/CategorizationText Classification/Categorization
Text Classification/Categorization
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the project
 
Accelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-LearnAccelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-Learn
 

Similar to Intro to scikit-learn

Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
Travis Oliphant
 
Machine Learning - Classification
Machine Learning - ClassificationMachine Learning - Classification
Machine Learning - Classification
Vikram Nandini
 
SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...
SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...
SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...
Keiichiro Ono
 
Laying the Foundation for Ionic Platform Insights on Spark
Laying the Foundation for Ionic Platform Insights on SparkLaying the Foundation for Ionic Platform Insights on Spark
Laying the Foundation for Ionic Platform Insights on Spark
Ionic Security
 
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskAUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
Víctor Zabalza
 
MetricMiner: Supporting Researchers in Mining Software Repositories - SCAM 2013
MetricMiner: Supporting Researchers in Mining Software Repositories - SCAM 2013MetricMiner: Supporting Researchers in Mining Software Repositories - SCAM 2013
MetricMiner: Supporting Researchers in Mining Software Repositories - SCAM 2013
Maurício Aniche
 
NDC Oslo : A Practical Introduction to Data Science
NDC Oslo : A Practical Introduction to Data ScienceNDC Oslo : A Practical Introduction to Data Science
NDC Oslo : A Practical Introduction to Data Science
Mark West
 
Neo4j Import Webinar
Neo4j Import WebinarNeo4j Import Webinar
Neo4j Import Webinar
Neo4j
 
Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
Building a Cutting-Edge Data Process Environment on a Budget by Gael VaroquauxBuilding a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
PyData
 
Python ml
Python mlPython ml
Python ml
Shubham Sharma
 
Sql Server 2008 Portfolio
Sql Server 2008 PortfolioSql Server 2008 Portfolio
Sql Server 2008 Portfolio
Eugene Kilpatrick
 
New M-Culture + Elementary WordPress
New M-Culture + Elementary WordPressNew M-Culture + Elementary WordPress
New M-Culture + Elementary WordPress
Sitdhibong Laokok
 
dbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchezdbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchez
GoDataDriven
 
Web scrapping - practical guide
Web scrapping - practical guideWeb scrapping - practical guide
Web scrapping - practical guide
SeeQuality.net
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15
Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15
Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15
MLconf
 
Azure Data Factory v2
Azure Data Factory v2Azure Data Factory v2
Azure Data Factory v2
inovex GmbH
 
Scaling Up Presentation
Scaling Up PresentationScaling Up Presentation
Scaling Up Presentation
Jiaqi Xie
 
Machine learning on Hadoop data lakes
Machine learning on Hadoop data lakesMachine learning on Hadoop data lakes
Machine learning on Hadoop data lakes
DataWorks Summit
 
The Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data ImplementationThe Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data Implementation
Inside Analysis
 

Similar to Intro to scikit-learn (20)

Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
Machine Learning - Classification
Machine Learning - ClassificationMachine Learning - Classification
Machine Learning - Classification
 
SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...
SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...
SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...
 
Laying the Foundation for Ionic Platform Insights on Spark
Laying the Foundation for Ionic Platform Insights on SparkLaying the Foundation for Ionic Platform Insights on Spark
Laying the Foundation for Ionic Platform Insights on Spark
 
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskAUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
 
MetricMiner: Supporting Researchers in Mining Software Repositories - SCAM 2013
MetricMiner: Supporting Researchers in Mining Software Repositories - SCAM 2013MetricMiner: Supporting Researchers in Mining Software Repositories - SCAM 2013
MetricMiner: Supporting Researchers in Mining Software Repositories - SCAM 2013
 
NDC Oslo : A Practical Introduction to Data Science
NDC Oslo : A Practical Introduction to Data ScienceNDC Oslo : A Practical Introduction to Data Science
NDC Oslo : A Practical Introduction to Data Science
 
Neo4j Import Webinar
Neo4j Import WebinarNeo4j Import Webinar
Neo4j Import Webinar
 
Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
Building a Cutting-Edge Data Process Environment on a Budget by Gael VaroquauxBuilding a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
 
Python ml
Python mlPython ml
Python ml
 
Sql Server 2008 Portfolio
Sql Server 2008 PortfolioSql Server 2008 Portfolio
Sql Server 2008 Portfolio
 
New M-Culture + Elementary WordPress
New M-Culture + Elementary WordPressNew M-Culture + Elementary WordPress
New M-Culture + Elementary WordPress
 
dbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchezdbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchez
 
Web scrapping - practical guide
Web scrapping - practical guideWeb scrapping - practical guide
Web scrapping - practical guide
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15
Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15
Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15
 
Azure Data Factory v2
Azure Data Factory v2Azure Data Factory v2
Azure Data Factory v2
 
Scaling Up Presentation
Scaling Up PresentationScaling Up Presentation
Scaling Up Presentation
 
Machine learning on Hadoop data lakes
Machine learning on Hadoop data lakesMachine learning on Hadoop data lakes
Machine learning on Hadoop data lakes
 
The Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data ImplementationThe Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data Implementation
 

More from AWeber

ASCEND Content Marketing Power Tools
ASCEND Content Marketing Power ToolsASCEND Content Marketing Power Tools
ASCEND Content Marketing Power Tools
AWeber
 
ASCEND Multichannel Marketing Power Tools
ASCEND Multichannel Marketing Power ToolsASCEND Multichannel Marketing Power Tools
ASCEND Multichannel Marketing Power Tools
AWeber
 
Beginner's Guide to Marketing on Social Networks
Beginner's Guide to Marketing on Social NetworksBeginner's Guide to Marketing on Social Networks
Beginner's Guide to Marketing on Social Networks
AWeber
 
5 Content Blind Spots and How to Avoid Them
5 Content Blind Spots and How to Avoid Them5 Content Blind Spots and How to Avoid Them
5 Content Blind Spots and How to Avoid Them
AWeber
 
Digital Marketing Tips from Experts at the Top of the Summit
Digital Marketing Tips from Experts at the Top of the SummitDigital Marketing Tips from Experts at the Top of the Summit
Digital Marketing Tips from Experts at the Top of the Summit
AWeber
 
Data Processing with Mechanical Turk
Data Processing with Mechanical TurkData Processing with Mechanical Turk
Data Processing with Mechanical Turk
AWeber
 
5 WordPress Plugins that will Rock Your World
5 WordPress Plugins that will Rock Your World5 WordPress Plugins that will Rock Your World
5 WordPress Plugins that will Rock Your World
AWeber
 
How to Grow Your Email List Like the Pros
How to Grow Your Email List Like the ProsHow to Grow Your Email List Like the Pros
How to Grow Your Email List Like the Pros
AWeber
 
How to Create Killer Emails that Make Readers Love You
How to Create Killer Emails that Make Readers Love YouHow to Create Killer Emails that Make Readers Love You
How to Create Killer Emails that Make Readers Love You
AWeber
 
Breathing Life (and ROI) Back Into Your Email Marketing
Breathing Life (and ROI) Back Into Your Email MarketingBreathing Life (and ROI) Back Into Your Email Marketing
Breathing Life (and ROI) Back Into Your Email Marketing
AWeber
 
More Engagement, Less Effort: The Lowdown on Marketing Automation
More Engagement, Less Effort: The Lowdown on Marketing AutomationMore Engagement, Less Effort: The Lowdown on Marketing Automation
More Engagement, Less Effort: The Lowdown on Marketing Automation
AWeber
 
25 List Building Tricks: Ideas, Examples and Resources to Improve Your Email ROI
25 List Building Tricks: Ideas, Examples and Resources to Improve Your Email ROI25 List Building Tricks: Ideas, Examples and Resources to Improve Your Email ROI
25 List Building Tricks: Ideas, Examples and Resources to Improve Your Email ROI
AWeber
 
Email List-Building 101: How to Reel In New Readers with a Few Simple Steps
Email List-Building 101: How to Reel In New Readers with a Few Simple StepsEmail List-Building 101: How to Reel In New Readers with a Few Simple Steps
Email List-Building 101: How to Reel In New Readers with a Few Simple Steps
AWeber
 
30 Ideas in 30 Minutes: Top Holiday Marketing Ideas You Can Steal For 2012
30 Ideas in 30 Minutes: Top Holiday Marketing Ideas You Can Steal For 201230 Ideas in 30 Minutes: Top Holiday Marketing Ideas You Can Steal For 2012
30 Ideas in 30 Minutes: Top Holiday Marketing Ideas You Can Steal For 2012
AWeber
 
How To Get The Results You Want From An Email Campaign
How To Get The Results You Want From An Email CampaignHow To Get The Results You Want From An Email Campaign
How To Get The Results You Want From An Email Campaign
AWeber
 
Smart Email Marketing: Engage Your Customers and Grow Your Business
Smart Email Marketing: Engage Your Customers and Grow Your BusinessSmart Email Marketing: Engage Your Customers and Grow Your Business
Smart Email Marketing: Engage Your Customers and Grow Your Business
AWeber
 
Get More Email Subscribers
Get More Email SubscribersGet More Email Subscribers
Get More Email Subscribers
AWeber
 
Efficient Marketing: The Tools You Need and How to Use Them
Efficient Marketing: The Tools You Need and How to Use ThemEfficient Marketing: The Tools You Need and How to Use Them
Efficient Marketing: The Tools You Need and How to Use Them
AWeber
 
From Local Business to National Sensation
From Local Business to National SensationFrom Local Business to National Sensation
From Local Business to National Sensation
AWeber
 
Live h2gs
Live h2gsLive h2gs
Live h2gs
AWeber
 

More from AWeber (20)

ASCEND Content Marketing Power Tools
ASCEND Content Marketing Power ToolsASCEND Content Marketing Power Tools
ASCEND Content Marketing Power Tools
 
ASCEND Multichannel Marketing Power Tools
ASCEND Multichannel Marketing Power ToolsASCEND Multichannel Marketing Power Tools
ASCEND Multichannel Marketing Power Tools
 
Beginner's Guide to Marketing on Social Networks
Beginner's Guide to Marketing on Social NetworksBeginner's Guide to Marketing on Social Networks
Beginner's Guide to Marketing on Social Networks
 
5 Content Blind Spots and How to Avoid Them
5 Content Blind Spots and How to Avoid Them5 Content Blind Spots and How to Avoid Them
5 Content Blind Spots and How to Avoid Them
 
Digital Marketing Tips from Experts at the Top of the Summit
Digital Marketing Tips from Experts at the Top of the SummitDigital Marketing Tips from Experts at the Top of the Summit
Digital Marketing Tips from Experts at the Top of the Summit
 
Data Processing with Mechanical Turk
Data Processing with Mechanical TurkData Processing with Mechanical Turk
Data Processing with Mechanical Turk
 
5 WordPress Plugins that will Rock Your World
5 WordPress Plugins that will Rock Your World5 WordPress Plugins that will Rock Your World
5 WordPress Plugins that will Rock Your World
 
How to Grow Your Email List Like the Pros
How to Grow Your Email List Like the ProsHow to Grow Your Email List Like the Pros
How to Grow Your Email List Like the Pros
 
How to Create Killer Emails that Make Readers Love You
How to Create Killer Emails that Make Readers Love YouHow to Create Killer Emails that Make Readers Love You
How to Create Killer Emails that Make Readers Love You
 
Breathing Life (and ROI) Back Into Your Email Marketing
Breathing Life (and ROI) Back Into Your Email MarketingBreathing Life (and ROI) Back Into Your Email Marketing
Breathing Life (and ROI) Back Into Your Email Marketing
 
More Engagement, Less Effort: The Lowdown on Marketing Automation
More Engagement, Less Effort: The Lowdown on Marketing AutomationMore Engagement, Less Effort: The Lowdown on Marketing Automation
More Engagement, Less Effort: The Lowdown on Marketing Automation
 
25 List Building Tricks: Ideas, Examples and Resources to Improve Your Email ROI
25 List Building Tricks: Ideas, Examples and Resources to Improve Your Email ROI25 List Building Tricks: Ideas, Examples and Resources to Improve Your Email ROI
25 List Building Tricks: Ideas, Examples and Resources to Improve Your Email ROI
 
Email List-Building 101: How to Reel In New Readers with a Few Simple Steps
Email List-Building 101: How to Reel In New Readers with a Few Simple StepsEmail List-Building 101: How to Reel In New Readers with a Few Simple Steps
Email List-Building 101: How to Reel In New Readers with a Few Simple Steps
 
30 Ideas in 30 Minutes: Top Holiday Marketing Ideas You Can Steal For 2012
30 Ideas in 30 Minutes: Top Holiday Marketing Ideas You Can Steal For 201230 Ideas in 30 Minutes: Top Holiday Marketing Ideas You Can Steal For 2012
30 Ideas in 30 Minutes: Top Holiday Marketing Ideas You Can Steal For 2012
 
How To Get The Results You Want From An Email Campaign
How To Get The Results You Want From An Email CampaignHow To Get The Results You Want From An Email Campaign
How To Get The Results You Want From An Email Campaign
 
Smart Email Marketing: Engage Your Customers and Grow Your Business
Smart Email Marketing: Engage Your Customers and Grow Your BusinessSmart Email Marketing: Engage Your Customers and Grow Your Business
Smart Email Marketing: Engage Your Customers and Grow Your Business
 
Get More Email Subscribers
Get More Email SubscribersGet More Email Subscribers
Get More Email Subscribers
 
Efficient Marketing: The Tools You Need and How to Use Them
Efficient Marketing: The Tools You Need and How to Use ThemEfficient Marketing: The Tools You Need and How to Use Them
Efficient Marketing: The Tools You Need and How to Use Them
 
From Local Business to National Sensation
From Local Business to National SensationFrom Local Business to National Sensation
From Local Business to National Sensation
 
Live h2gs
Live h2gsLive h2gs
Live h2gs
 

Recently uploaded

leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
alexjohnson7307
 
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
shanihomely
 
Feature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptxFeature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptx
ssuser1915fe1
 
Vulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive OverviewVulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive Overview
Steven Carlson
 
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision MakingConnector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
DianaGray10
 
The Impact of the Internet of Things (IoT) on Smart Homes and Cities
The Impact of the Internet of Things (IoT) on Smart Homes and CitiesThe Impact of the Internet of Things (IoT) on Smart Homes and Cities
The Impact of the Internet of Things (IoT) on Smart Homes and Cities
Arpan Buwa
 
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
Zilliz
 
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptxMAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
janagijoythi
 
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
Razin Mustafiz
 
Opencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of MünsterOpencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of Münster
Matthias Neugebauer
 
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
sunilverma7884
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
Priyanka Aash
 
Mastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for SuccessMastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for Success
David Wilson
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Zilliz
 
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdfAcumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
BrainSell Technologies
 
Step-By-Step Process to Develop a Mobile App From Scratch
Step-By-Step Process to Develop a Mobile App From ScratchStep-By-Step Process to Develop a Mobile App From Scratch
Step-By-Step Process to Develop a Mobile App From Scratch
softsuave
 
Patch Tuesday de julio
Patch Tuesday de julioPatch Tuesday de julio
Patch Tuesday de julio
Ivanti
 
Finetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and DefendingFinetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and Defending
Priyanka Aash
 
Uncharted Together- Navigating AI's New Frontiers in Libraries
Uncharted Together- Navigating AI's New Frontiers in LibrariesUncharted Together- Navigating AI's New Frontiers in Libraries
Uncharted Together- Navigating AI's New Frontiers in Libraries
Brian Pichman
 
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
FIDO Alliance
 

Recently uploaded (20)

leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
 
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
 
Feature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptxFeature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptx
 
Vulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive OverviewVulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive Overview
 
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision MakingConnector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
 
The Impact of the Internet of Things (IoT) on Smart Homes and Cities
The Impact of the Internet of Things (IoT) on Smart Homes and CitiesThe Impact of the Internet of Things (IoT) on Smart Homes and Cities
The Impact of the Internet of Things (IoT) on Smart Homes and Cities
 
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
 
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptxMAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
 
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
 
Opencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of MünsterOpencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of Münster
 
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
 
Mastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for SuccessMastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for Success
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
 
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdfAcumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
 
Step-By-Step Process to Develop a Mobile App From Scratch
Step-By-Step Process to Develop a Mobile App From ScratchStep-By-Step Process to Develop a Mobile App From Scratch
Step-By-Step Process to Develop a Mobile App From Scratch
 
Patch Tuesday de julio
Patch Tuesday de julioPatch Tuesday de julio
Patch Tuesday de julio
 
Finetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and DefendingFinetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and Defending
 
Uncharted Together- Navigating AI's New Frontiers in Libraries
Uncharted Together- Navigating AI's New Frontiers in LibrariesUncharted Together- Navigating AI's New Frontiers in Libraries
Uncharted Together- Navigating AI's New Frontiers in Libraries
 
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
 

Intro to scikit-learn

Editor's Notes

  1. Good morning everyone, My name is Michael Becker, I work in the Data Analysis and Management team at AWeber, an email marketing company in Chalfont, PA I'm also the founder of the DataPhilly Meetup group You can find me online @beckerfuffle on Twitter. At beckerfuffle.com, and I'm also mdbecker on github. I'll be posting the materials for this talk on my github.
  2. So I want to start this talk by thanking those who came before me. None of the content from this talk is original. It's been influenced heavily by various other talks and resources around the web. This talk is based primarily on the "Machine Learning 101" tutorial from the scikit-learn documentation.
  3. Additional thanks also to Jake Vanderplas for creating an excellent set of ipython notebooks for pycon 2013 which I've used for my code samples. This talk will only cover a subset of what's available in these resources. I recommend you have a look at those to learn more about scikit-learn. I’m not currently a contributor to the scikit-learn project or in any way affiliated with it. I’m just a very happy user.
  4. Machine learning algorithms can figure out how to perform important tasks based on previously seen data. To illustrate this point, let's take a look at two simple machine learning tasks. This plot represents data of two types. One is colored red; the other is colored blue. A classification algorithm may be used to draw a dividing line between the two clusters of points. This task may seem simple, but it illustrates an important point. By drawing this separating line, we have created a model which can generalize to new data: if you drop another point onto the plane which is unlabeled, this model can predict whether it's a blue or a red point.
  5. This plot shows a series of values that appear correlated. A plot like this could for example represent the prices of houses on the y axis and the square footage of those houses on the x axis. We can pretty easily fit a line to this set of data. Again, this is an example of fitting a model to data, such that the model can make generalizations about new data. The model has been learned from the training data, and can be used to predict the result of test data: we might be given an x-value (square footage), and the model would allow us to predict the y value (price). Again, this might seem like a trivial task, but it is a basic example of the type of problem you can solve with Machine Learning.
  6. Data in scikit-learn is usually represented as a 2d-array.. The size of the array is expected to be [n_samples by n_features] n_samples refers to the number of samples: each sample is an item to process. A sample can be a document, a picture, a sound, a row in a database, or anything you can describe with a fixed set of quantitative traits. n_features refers to the number of features or distinct traits that can be used to describe each item in a quantitative manner. Features are generally real-valued, but may be boolean or discrete-valued in some cases. You have to choose your features in advance, but you can have as few or as many features as you want. Some of your features can represent traits that are relatively rare in your data set. In this case this feature would be set to zero for samples where it is not found.
  7. As an example of a simple dataset, let's look at the iris dataset which comes with scikit-learn. The data consists of measurements of three different species of irises.
  8. The data contains (4) features for each sample. Each sample represents an individual flower. For each flower the features are: sepal length sepal width petal length petal width
  9. The iris dataset also contains the species of flower which is one of 3 classes.
  10. scikit-learn embeds a copy of the iris data along with a helper function to load it into numpy arrays
  11. The resulting dataset is called a “Bunch” object: you can see what's available using the method keys()
  12. The features of the sampled flowers are stored in the data attribute of the dataset Data is a 2d array of 150 samples (by) 4 features Here we can see what an individual sample looks like
  13. The information about the class of each sample is stored in the target attribute of the dataset While data is a 2d array...
  14. target is a 1d array with 1 class per sample (150).
  15. The names of the classes are stored in the target_names attribute. This can be used to convert the numerical target values to a human readable format.
  16. The iris data has 4 features. We can’t easily visualize all 4 features plus the labels in a 2 or 3 dimensional graph. However one method for visualizing this data could be to plot two of the dimensions using a simple scatter-plot. In this plot we’ve graphed: y-axis: sepal width x-axis: sepal length The blue class seems reasonably distinct in this visualization. Unfortunately, it's hard to visually separate the green and the red classes using this technique.
  17. Now let’s explore the different types of machine learning. Machine learning can be broken into two broad categories: supervised learning and unsupervised learning. In Supervised Learning, we have a dataset consisting of both features and labels. The task is to construct an estimator which is able to predict the label of an object given the set of features. Using our iris data as an example, we could try to predict the species of iris given a set of measurements of its flower. Supervised learning can be further broken down into two categories, classification and regression. In classification, the label is discrete, while in regression, the label is continuous. Our iris labels are discrete, there are only 3 possible values. Therefore predicting the species based on flower measurements would be a classification task.
  18. Unsupervised Learning addresses a different sort of problem. Here the data has no labels, and we are interested in finding similarities between the samples. You can think of unsupervised learning as a means of discovering labels from the data itself. Unsupervised learning comprises tasks such as dimensionality reduction and clustering. For example, in the iris data, we can used unsupervised methods to determine combinations of the features which are good at visualizing the structure of the data in 2 dimensions. We’ll see an example of this later. Sometimes you can even combine supervised and unsupervised learning. For example, unsupervised learning can be used to find useful features in the data, and then these features can be used within a supervised model.
  19. In scikit-learn, almost all operations are done through an estimator object. For example, a linear regression estimator can be instantiated as follows:
  20. Scikit-learn strives to have a uniform interface across all methods. Given a scikit-learn estimator object (named model), the following methods are available: All Estimators have a fit method. The fit method fits the model to a set of training data. Supervised estimators can have a few methods. All supervised estimators have a predict method: given a trained model this method predicts the label of a new set of data. For classification problems, some estimators also provide the predict_proba method, which returns the probability that a new observation has each categorical label. In this case, the label with the highest probability is returned by the predict method. For classification or regression problems, most estimators implement a score method. Scores are between 0 and 1, with a larger score indicating a better fit. For most estimators score calculates the accuracy of the model. Unsupervised estimators have a few unique methods The transform method transforms data into a new format. Some estimators implement the fit_transform method, which more efficiently performs a fit and a transform on the same input data.
  21. often, data does not come in a nice, structured, CSV file where every column measures the same thing. Let’s explore some common methods for extracting features in these cases. Text documents: Count the frequency of each word (n-grams) or pair of consecutive words in each document. This approach is called Bag of Words
  22. Extracting features from Images: Rescale the picture to a fixed size and take all the raw pixels values (with or without luminosity normalization) Take some transformation of the image Perform local feature extraction: split the image into small regions and perform feature extraction locally in each area, Then combine all the features of the individual areas into a single array.
  23. Extracting features from Sounds: Same type of strategies as for images; the difference is it’s a 1D rather than 2D space.
  24. Now that we’ve covered all the basics, let’s train a classification model using the iris dataset. First lets load the iris data like before.
  25. Let’s say that we were assigned the task of guessing the class of an individual flower given the measurements of petals and sepals. This is a classification task You’ll note that we’re using the variable uppercase X to represent our data and the variable lower case y to present our targets. These two variables are frequently used in the Machine Learning field so you will likely see this format frequently. Once the data has this format it is trivial to train a classifier...
  26. ...for example let’s try out a support vector machine.
  27. The first thing to do is to create an instance of the classifier. This can be done simply by calling the class name, with any arguments that the object accepts
  28. clf is a statistical model that has parameters that control the learning algorithm. Those parameters can be supplied by the user in the constructor of the model. Each estimator has different parameters. There are several methods for choosing good values for each parameter. I won’t cover these methods in this talk, but these are covered in Jake Vanderplas’ ipython notebooks.
  29. By default the model's fit parameters are not initialized. They will be tuned automatically from the data by calling the fit method with the data - X and labels - y
  30. We can now see some of the fit parameters within the classifier object. In scikit-learn, parameters defined during training have a trailing underscore.
  31. Once the model is trained, it can be used to predict the most likely outcome on new data. For instance let us define a list of simple samples that looks like the first sample of the iris dataset: Our model predicts the sample is of class 0
  32. So now that we’ve trained our first model, let’s revisit the previous diagram and see where our fit and predict calls fit in. We can see that we called fit with our vectorized features. We were able to skip this step in this example because the iris dataset is already vectorized. We can see that once fit was called, we called predict on a new data point and got as output an expected label.
  33. Classification involves predicting an unknown category based on observed features. Let’s go over a few examples of interesting classification tasks: E-mail classification: labeling email as spam or ham Language identification: labeling documents as English, Spanish, etc... News articles categorization: labeling articles as business, technology... Sentiment analysis: labeling customer feedback as negative, neutral, positive Facial recognition: label images as matching or not matching a person
  34. Let’s revisit unsupervised learning. The major difference between supervised and unsupervised learning is that in the case of unsupervised learning, our data is unlabeled. Previously we visualized the iris data by plotting pairs of dimensions. Here we will use an unsupervised dimensionality reduction algorithm to improve on our previous technique.
  35. Dimensionality reduction is the task of deriving a set of new abstract features that is smaller than the original feature set while retaining most of the variance of the original data. Here we'll use a common but powerful dimensionality reduction technique called Principal Component Analysis (PCA). We'll perform PCA on the iris dataset that we saw before: Since this is unsupervised learning, target (y) will be unused, however we'll use it later to visualize our results.
  36. PCA allows you to re-express a set of data points in terms of basic components that explain the most variance in the data. This is accomplished by combining the original features. If the number of retained components is 2 or 3, PCA can be used to visualize the dataset.
  37. We’ve used PCA to transform our original 4d data into 2d data
  38. PCA normalizes and whitens the data, which means that the data is now centered on both components The mean of both of the artificial components is 0...
  39. ... and the standard deviation is 1
  40. Now we can visualize the iris dataset along the two new dimensions Note that this visualization was generated without any information about the labels (y) (represented by the colors): this is the sense in which the learning is unsupervised. Even so, we see that the projection gives us insight into the distribution of the different flowers: notably, the red class is much more distinct than the other two species. And even among the green and blue classes, there is a pretty good division line that can be drawn.
  41. The last thing we’ll cover in this talk is validation and testing. The most common mistake beginners make when training statistical models is to evaluate the quality of the model on the same data used for fitting the model.
  42. Here we're training the classifier with all the data.
  43. We’re getting pretty high accuracy with this model. Question: what might be the problem with this approach?
  44. The problem is that some models can be subject to overfitting: they can learn the training data by heart without generalizing. The symptoms are: The accuracy on the data used for training can be excellent (sometimes 100%) The models do little better than random predictions when facing new data that was not part of the training set If you evaluate your model on your training data you won’t be able to tell whether your model is overfitting or not.
  45. Learning the parameters of a prediction function and testing it on the same data is a mistake: a model that would just repeat the labels of the samples that it has seen would have a perfect score but would fail to predict anything useful on new data. To avoid over-fitting, we have to define two different sets: a training set X_train, y_train which is used for training the model a testing set X_test, y_test which is used for evaluating the fitted model In scikit-learn such a random split can be quickly computed with the train_test_split helper function.
  46. using train_test_split, we can train on the training data... ...and test on the testing data
  47. There is an issue here, however: by defining these two sets, we significantly reduce the number of samples which can be used for training the model, and the results can depend on a particular random choice for the pair of (train, test) sets. A solution is to split the whole dataset a few times randomly into different training and testing sets, and to calculate the average value of the prediction scores obtained with the different sets. Such a procedure is called cross-validation. This approach can be computationally expensive, but does not waste too much data. Information on cross validation, and a lot of other awesome things which I haven’t covered can be found in the following resources.
  48. Thanks go to Jake Van-der-plas for creating an excellent set of ipython notebooks for pycon 2013 which I've used for my code samples.
  49. You can find me online @beckerfuffle on Twitter. At beckerfuffle.com, and I'm also mdbecker on github. I'll be posting the materials for this talk on my github.