SlideShare a Scribd company logo
1 of 34
Basic machine learning background
with Python scikit-learn
Usman Roshan
Python
• An object oriented interpreter-based
programming language
• Basic essentials
– Data types are numbers, strings, lists, and dictionaries
(hash-tables)
– For loops and conditionals
– Functions
– Lists and hash-tables are references (like pointers in C)
– All variables are passed by value
Python
• Simple Python programs
Data science
• Data science
– Simple definition: reasoning and making decisions from
data
– Contains machine learning, statistics, algorithms,
programming, big data
• Basic machine learning problems
– Classification
• Linear methods
• Non-linear
– Feature selection
– Clustering (unsupervised learning)
– Visualization with PCA
Python scikit-learn
• Popular machine learning toolkit in Python
http://scikit-learn.org/stable/
• Requirements
– Anaconda
– Available from
https://www.continuum.io/downloads
– Includes numpy, scipy, and scikit-learn (former
two are necessary for scikit-learn)
Data
• We think of data as vectors in a fixed
dimensional space
• For example
Classification
• Widely used task: given data determine the
class it belongs to. The class will lead to a
decision or outcome.
• Used in many different places:
– DNA and protein sequence classification
– Insurance
– Weather
– Experimental physics: Higgs Boson determination
Linear models
• We think of a linear model
as a hyperplane in space
• The margin is the
minimum distance of all
closest point (misclassified
have negative distance)
• The support vector
machine is the hyperplane
with largest margin
y
x
Support vector machine:
optimally separating hyperplane
In practice we allow for error terms in case there is no hyperplane.
minw,w0 ,xi
1
2
w
2
+ C xi
i
å
æ
è
ç
ö
ø
÷
s.t.
yi
(wT
x + w0
) ³1-xi
xi
³ 0
Optimally separating hyperplane
with errors
y
x
w
SVM on simple data
• Run SVM on example data shown earlier
• Solid line is SVM and dashed indicates margin
SVM in scikit-learn
• Dataset are taken from the UCI machine
learning repository
• Learn an SVM model on training data
• Which parameter settings?
– C: tradeoff between error and model complexity
(margin)
– max_iter: depth of the gradient descent algorithm
• Predict on test data
SVM in scikit-learn
Analysis of SVM program on
breast cancer data
Non-linear classification
• In practice some datasets may not be classifiable.
• Remember this may not be a big deal because the
test error is more important than the train one
Non-linear classification
• Neural networks
– Create a new representation of the data where it
is linearly separable
– Large networks leads to deep learning
• Decision trees
– Use several linear hyperplanes arranged in a tree
– Ensembles of decision trees are state of the art
such as random forest and boosting
Decision tree
From Alpaydin, 2010
Combining classifiers by bagging
• A single decision tree can overfit the data and
have poor generalization (high error on test data).
We can relieve this by bagging
• Bagging
– Randomly sample training data by bootstrapping
– Determine classifier Ci on sampled data
– Goto step 1 and repeat m times
– For final classifier output the majority vote
• Similar to tree bagging
– Compute decision trees on bootstrapped datasets
– Return majority vote
Variance reduction by voting
• What is the variance of the output of k
classifiers?
• Thus we want classifiers to be independent to
minimize variance
• Given independent binary classifiers each with
accuracy > ½ the majority vote accuracy increases
as we increase the number of classifiers (Hansen
and Salamon, IEEE Transactions of Pattern
Analysis and Machine Intelligence, 1990)
Random forest
• In addition to sampling datapoints (feature
vectors) we also sample features (to increase
independence among classifiers)
• Compute many decision trees and output
majority vote
• Can also rank features
• Alternative to bagging is to select datapoints
with different probabilities that change in the
algorithm (called boosting)
Decision tree and random forest in
scikit-learn
• Learn a decision tree and random forest on
training data
• Which parameter settings?
– Decision tree:
• Depth of tree
– Random forest:
• Number of trees
• Percentage of columns
• Predict on test data
Decision tree and random forest in
scikit-learn
Data projection
• What is the mean and variance here?
1 2 3 4
3
2
1
2
Data projection
• Which line maximizes variance?
1 2 3 4
3
2
1
2
Data projection
• Which line maximizes variance?
1 2 3 4
3
2
1
2
Principal component analysis
• Find vector w of length 1 that maximizes
variance of projected data
PCA optimization problem
argmax
w
1
n
(wT
xi - wT
m)2
subject to wT
w = 1
i=1
n
å
The optimization criterion can be rewritten as
argmax
w
1
n
(wT
(xi - m))2
=
i=1
n
å
argmax
w
1
n
(wT
(xi - m))T
(wT
(xi - m)) =
i=1
n
å
argmax
w
1
n
((xi - m)T
w)(wT
(xi - m)) =
i=1
n
å
argmax
w
1
n
wT
(xi - m)(xi - m)T
w =
i=1
n
å
argmax
w
wT 1
n
(xi - m)(xi - m)T
w =
i=1
n
å
argmax
w
wT
åw subject to wT
w = 1
Dimensionality reduction and visualization with PCA
Dimensionality reduction and visualization with PCA
PCA plot of breast cancer data (output of program in previous slide)
Unsupervised learning - clustering
• K-means: popular fast program for clustering
data
• Objective: find k clusters that minimize
Euclidean distance of points in each cluster to
their centers (means)
|| x j - mi ||2
x j ÎCi
å
i=1
2
å
K-means algorithm for two clusters
Input:
Algorithm:
1. Initialize: assign xi to C1 or C2 with equal probability and compute
means:
2. Recompute clusters: assign xi to C1 if ||xi-m1||<||xi-m2||, otherwise
assign to C2
3. Recompute means m1 and m2
4. Compute objective
5. Compute objective of new clustering. If difference is smaller than
then stop, otherwise go to step 2.
xi
ÎRd
,i =1… n
m1 =
1
C1
xi
xi ÎC1
å m2 =
1
C2
xi
xi ÎC2
å
|| x j - mi ||2
x j ÎCi
å
i=1
2
å
d
K-means in scikit-learn
K-means PCA plot in scikit-learn
K-means PCA plot in scikit-learn
PCA plot of breast cancer data
colored by true labels
PCA plot of breast cancer data
colored by k-means labels
Conclusion
• We saw basic data science and machine learning
tasks in Python scikit-learn
• Can we handle very large datasets in Python
scikit-learn? Yes
– For space use array from numpy to use a byte for a
char and 4 for float and int. Otherwise more space is
used because Python is object oriented
– For speed use stochastic gradient descent in scikit-
learn (doesn’t come with mini-batch though) and
mini-batch k-means
• Deep learning stuff: Keras

More Related Content

Similar to Machine Learning and Data Science with Python Scikit-Learn

Barga Data Science lecture 8
Barga Data Science lecture 8Barga Data Science lecture 8
Barga Data Science lecture 8Roger Barga
 
Support Vector machine(SVM) and Random Forest
Support Vector machine(SVM) and Random ForestSupport Vector machine(SVM) and Random Forest
Support Vector machine(SVM) and Random Forestumarcybermind
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesSơn Còm Nhom
 
Week_1 Machine Learning introduction.pptx
Week_1 Machine Learning introduction.pptxWeek_1 Machine Learning introduction.pptx
Week_1 Machine Learning introduction.pptxmuhammadsamroz
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learningmilad abbasi
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep LearningMehrnaz Faraz
 
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Maninda Edirisooriya
 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analyticsCollin Bennett
 
How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?Tuan Yang
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analyticsAnirudh
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017Manish Pandey
 
An introduction to variable and feature selection
An introduction to variable and feature selectionAn introduction to variable and feature selection
An introduction to variable and feature selectionMarco Meoni
 
Clustering on database systems rkm
Clustering on database systems rkmClustering on database systems rkm
Clustering on database systems rkmVahid Mirjalili
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-stepsShesha R
 
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคล
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์  เอื้อวัฒนามงคลMachine Learning: An introduction โดย รศ.ดร.สุรพงค์  เอื้อวัฒนามงคล
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคลBAINIDA
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkIvo Andreev
 
Machine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An IntroMachine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An IntroSi Krishan
 

Similar to Machine Learning and Data Science with Python Scikit-Learn (20)

Barga Data Science lecture 8
Barga Data Science lecture 8Barga Data Science lecture 8
Barga Data Science lecture 8
 
Support Vector machine(SVM) and Random Forest
Support Vector machine(SVM) and Random ForestSupport Vector machine(SVM) and Random Forest
Support Vector machine(SVM) and Random Forest
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and Techniques
 
sequencea.ppt
sequencea.pptsequencea.ppt
sequencea.ppt
 
Clustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn TutorialClustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn Tutorial
 
Week_1 Machine Learning introduction.pptx
Week_1 Machine Learning introduction.pptxWeek_1 Machine Learning introduction.pptx
Week_1 Machine Learning introduction.pptx
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learning
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analytics
 
How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analytics
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017
 
An introduction to variable and feature selection
An introduction to variable and feature selectionAn introduction to variable and feature selection
An introduction to variable and feature selection
 
Clustering on database systems rkm
Clustering on database systems rkmClustering on database systems rkm
Clustering on database systems rkm
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
 
Random Forest Decision Tree.pptx
Random Forest Decision Tree.pptxRandom Forest Decision Tree.pptx
Random Forest Decision Tree.pptx
 
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคล
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์  เอื้อวัฒนามงคลMachine Learning: An introduction โดย รศ.ดร.สุรพงค์  เอื้อวัฒนามงคล
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคล
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
Machine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An IntroMachine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An Intro
 

More from KabileshCm

ROBOTICS APPLICATIONS of Electronics.pptx
ROBOTICS APPLICATIONS of Electronics.pptxROBOTICS APPLICATIONS of Electronics.pptx
ROBOTICS APPLICATIONS of Electronics.pptxKabileshCm
 
Presentation On Machine Learning greator.pptx
Presentation On Machine Learning greator.pptxPresentation On Machine Learning greator.pptx
Presentation On Machine Learning greator.pptxKabileshCm
 
STUDENTS ATTENDANCE USING FACE RECOGNITION-1.pptx
STUDENTS ATTENDANCE USING FACE RECOGNITION-1.pptxSTUDENTS ATTENDANCE USING FACE RECOGNITION-1.pptx
STUDENTS ATTENDANCE USING FACE RECOGNITION-1.pptxKabileshCm
 
project[1].pdf
project[1].pdfproject[1].pdf
project[1].pdfKabileshCm
 
DATA SCIENCE WITH PYTHON.pptx
DATA SCIENCE WITH PYTHON.pptxDATA SCIENCE WITH PYTHON.pptx
DATA SCIENCE WITH PYTHON.pptxKabileshCm
 
Application Development for the Internet of Things.pptx
Application Development for the Internet of Things.pptxApplication Development for the Internet of Things.pptx
Application Development for the Internet of Things.pptxKabileshCm
 
Python-data-science.pptx
Python-data-science.pptxPython-data-science.pptx
Python-data-science.pptxKabileshCm
 

More from KabileshCm (9)

ROBOTICS APPLICATIONS of Electronics.pptx
ROBOTICS APPLICATIONS of Electronics.pptxROBOTICS APPLICATIONS of Electronics.pptx
ROBOTICS APPLICATIONS of Electronics.pptx
 
Presentation On Machine Learning greator.pptx
Presentation On Machine Learning greator.pptxPresentation On Machine Learning greator.pptx
Presentation On Machine Learning greator.pptx
 
STUDENTS ATTENDANCE USING FACE RECOGNITION-1.pptx
STUDENTS ATTENDANCE USING FACE RECOGNITION-1.pptxSTUDENTS ATTENDANCE USING FACE RECOGNITION-1.pptx
STUDENTS ATTENDANCE USING FACE RECOGNITION-1.pptx
 
project[1].pdf
project[1].pdfproject[1].pdf
project[1].pdf
 
DATA SCIENCE WITH PYTHON.pptx
DATA SCIENCE WITH PYTHON.pptxDATA SCIENCE WITH PYTHON.pptx
DATA SCIENCE WITH PYTHON.pptx
 
Application Development for the Internet of Things.pptx
Application Development for the Internet of Things.pptxApplication Development for the Internet of Things.pptx
Application Development for the Internet of Things.pptx
 
neural.ppt
neural.pptneural.ppt
neural.ppt
 
Lecture6.ppt
Lecture6.pptLecture6.ppt
Lecture6.ppt
 
Python-data-science.pptx
Python-data-science.pptxPython-data-science.pptx
Python-data-science.pptx
 

Recently uploaded

EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxnada99848
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 

Recently uploaded (20)

EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptx
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 

Machine Learning and Data Science with Python Scikit-Learn

  • 1. Basic machine learning background with Python scikit-learn Usman Roshan
  • 2. Python • An object oriented interpreter-based programming language • Basic essentials – Data types are numbers, strings, lists, and dictionaries (hash-tables) – For loops and conditionals – Functions – Lists and hash-tables are references (like pointers in C) – All variables are passed by value
  • 4. Data science • Data science – Simple definition: reasoning and making decisions from data – Contains machine learning, statistics, algorithms, programming, big data • Basic machine learning problems – Classification • Linear methods • Non-linear – Feature selection – Clustering (unsupervised learning) – Visualization with PCA
  • 5. Python scikit-learn • Popular machine learning toolkit in Python http://scikit-learn.org/stable/ • Requirements – Anaconda – Available from https://www.continuum.io/downloads – Includes numpy, scipy, and scikit-learn (former two are necessary for scikit-learn)
  • 6. Data • We think of data as vectors in a fixed dimensional space • For example
  • 7. Classification • Widely used task: given data determine the class it belongs to. The class will lead to a decision or outcome. • Used in many different places: – DNA and protein sequence classification – Insurance – Weather – Experimental physics: Higgs Boson determination
  • 8. Linear models • We think of a linear model as a hyperplane in space • The margin is the minimum distance of all closest point (misclassified have negative distance) • The support vector machine is the hyperplane with largest margin y x
  • 9. Support vector machine: optimally separating hyperplane In practice we allow for error terms in case there is no hyperplane. minw,w0 ,xi 1 2 w 2 + C xi i å æ è ç ö ø ÷ s.t. yi (wT x + w0 ) ³1-xi xi ³ 0
  • 11. SVM on simple data • Run SVM on example data shown earlier • Solid line is SVM and dashed indicates margin
  • 12. SVM in scikit-learn • Dataset are taken from the UCI machine learning repository • Learn an SVM model on training data • Which parameter settings? – C: tradeoff between error and model complexity (margin) – max_iter: depth of the gradient descent algorithm • Predict on test data
  • 13. SVM in scikit-learn Analysis of SVM program on breast cancer data
  • 14. Non-linear classification • In practice some datasets may not be classifiable. • Remember this may not be a big deal because the test error is more important than the train one
  • 15. Non-linear classification • Neural networks – Create a new representation of the data where it is linearly separable – Large networks leads to deep learning • Decision trees – Use several linear hyperplanes arranged in a tree – Ensembles of decision trees are state of the art such as random forest and boosting
  • 17. Combining classifiers by bagging • A single decision tree can overfit the data and have poor generalization (high error on test data). We can relieve this by bagging • Bagging – Randomly sample training data by bootstrapping – Determine classifier Ci on sampled data – Goto step 1 and repeat m times – For final classifier output the majority vote • Similar to tree bagging – Compute decision trees on bootstrapped datasets – Return majority vote
  • 18. Variance reduction by voting • What is the variance of the output of k classifiers? • Thus we want classifiers to be independent to minimize variance • Given independent binary classifiers each with accuracy > ½ the majority vote accuracy increases as we increase the number of classifiers (Hansen and Salamon, IEEE Transactions of Pattern Analysis and Machine Intelligence, 1990)
  • 19. Random forest • In addition to sampling datapoints (feature vectors) we also sample features (to increase independence among classifiers) • Compute many decision trees and output majority vote • Can also rank features • Alternative to bagging is to select datapoints with different probabilities that change in the algorithm (called boosting)
  • 20. Decision tree and random forest in scikit-learn • Learn a decision tree and random forest on training data • Which parameter settings? – Decision tree: • Depth of tree – Random forest: • Number of trees • Percentage of columns • Predict on test data
  • 21. Decision tree and random forest in scikit-learn
  • 22. Data projection • What is the mean and variance here? 1 2 3 4 3 2 1 2
  • 23. Data projection • Which line maximizes variance? 1 2 3 4 3 2 1 2
  • 24. Data projection • Which line maximizes variance? 1 2 3 4 3 2 1 2
  • 25. Principal component analysis • Find vector w of length 1 that maximizes variance of projected data
  • 26. PCA optimization problem argmax w 1 n (wT xi - wT m)2 subject to wT w = 1 i=1 n å The optimization criterion can be rewritten as argmax w 1 n (wT (xi - m))2 = i=1 n å argmax w 1 n (wT (xi - m))T (wT (xi - m)) = i=1 n å argmax w 1 n ((xi - m)T w)(wT (xi - m)) = i=1 n å argmax w 1 n wT (xi - m)(xi - m)T w = i=1 n å argmax w wT 1 n (xi - m)(xi - m)T w = i=1 n å argmax w wT åw subject to wT w = 1
  • 27. Dimensionality reduction and visualization with PCA
  • 28. Dimensionality reduction and visualization with PCA PCA plot of breast cancer data (output of program in previous slide)
  • 29. Unsupervised learning - clustering • K-means: popular fast program for clustering data • Objective: find k clusters that minimize Euclidean distance of points in each cluster to their centers (means) || x j - mi ||2 x j ÎCi å i=1 2 å
  • 30. K-means algorithm for two clusters Input: Algorithm: 1. Initialize: assign xi to C1 or C2 with equal probability and compute means: 2. Recompute clusters: assign xi to C1 if ||xi-m1||<||xi-m2||, otherwise assign to C2 3. Recompute means m1 and m2 4. Compute objective 5. Compute objective of new clustering. If difference is smaller than then stop, otherwise go to step 2. xi ÎRd ,i =1… n m1 = 1 C1 xi xi ÎC1 å m2 = 1 C2 xi xi ÎC2 å || x j - mi ||2 x j ÎCi å i=1 2 å d
  • 32. K-means PCA plot in scikit-learn
  • 33. K-means PCA plot in scikit-learn PCA plot of breast cancer data colored by true labels PCA plot of breast cancer data colored by k-means labels
  • 34. Conclusion • We saw basic data science and machine learning tasks in Python scikit-learn • Can we handle very large datasets in Python scikit-learn? Yes – For space use array from numpy to use a byte for a char and 4 for float and int. Otherwise more space is used because Python is object oriented – For speed use stochastic gradient descent in scikit- learn (doesn’t come with mini-batch though) and mini-batch k-means • Deep learning stuff: Keras