A Beginner’s Guide to Machine
Learning with Scikit-Learn
Sarah Guido
PyTennessee 2014
All about me
• Grad student at the University of Michigan

• Data analyst for HathiTrust
• Organizer of Ann Arbor PyLadies chapter
My talk
• Machine learning and scikit-learn

• Supervised and unsupervised learning
• Preprocessing, validation and testing, strategies

for machine learning
What is machine learning?
• Application of algorithms that learn from

examples
• Representation and generalization
Why should we care?
• Useful in every day life
• Email spam, handwriting analysis, stock market
analysis, Netflix
• Especially useful in data analysis
• Feature extraction, linear regression, classification,
clustering
Machine Learning Vocab
• Instance

• Feature
• Class
• Categorical
• Nominal
• Ordinal
• Continuous
Machine Learning Vocab
Feature

Class

Instance
Scikit-Learn
• Machine learning module

• Open-source
• Built-in datasets
• Good resources for learning
Scikit-Learn
• Model = EstimatorObject()

• Model.fit(dataset.data, dataset.target)
• dataset.data = dataset
• dataset.target = labels
• Model.predict(dataset.data)
Scikit-Learn
• Supervised

• Unsupervised
• Semi-supervised
• Reinforcement learning

• Neural networks
• …and many more!
Supervised learning
• Labeled data

• You know what you’re looking for
• Classification: predict categorical labels
• Regression: predict continuous target variables
Classification
• Categorical variables

• Relationship between instance and feature
• Classification algorithms == classifiers
Classification
• Naïve Bayes classifier

• Features are independent
• Fast performance
• Decent classifier
Classification
• Car evaluation dataset-UCI

• Features: buying price, the maintenance price,

the number of doors, the number of seats, the
size of the trunk, and the safety ranking
• Labels: unacceptable, acceptable, good, or very
good
Classification
Classification
Classification
Unsupervised algorithms
• Unlabeled data

• You might have no idea what you’re looking for
• Clustering: splitting observations into groups
• Dimensionality reduction: flatten data to fewer

dimensions
Clustering
• Exploring the data

• Similar objects in the same group
• Distance between data points
Clustering
• K-means clustering

• Three steps
• Chooses initial cluster centers
• Assigns data instance to cluster
• Recalculates cluster center
• Efficient
Clustering
Clustering
Clustering
Data preprocessing
• Encoding categorical features
Data preprocessing
Data preprocessing
Data preprocessing
• Split the dataset into training and test data
Validation and testing
• Model evaluation

• Cross-validation
Good strategies
• Avoid overfitting

• Use lots of data
• Intuition fails in high dimensions
My materials
• Scikit-learn.org documentation and tutorials

• Machine learning class at U of M
• Scikit-learn talks
Resources
• Scikit-learn documentation and tutorials
• scikit-learn.org/stable/documentation.html
• Other resources
• http://archive.ics.uci.edu/ml/datasets.html
• Mldata.org
• Videos
• Scikit-learn tutorial: http://vimeo.com/53062607
• Intro to scikit-learn: http://vimeo.com/72859487
Contact me!
• @sarah_guido

• Linkedin.com/sarahguido
• github.com/sarguido

A Beginner's Guide to Machine Learning with Scikit-Learn

Editor's Notes

  • #12 Pictures of both
  • #15 Snapshot of dataset
  • #26 Better explanation of functions
  • #28 Better explanation of functions