ML MODULE 1_slideshare.pdf

S H I WA N I
G U P T A
M A C H I N E
L E A R N I N G
1

S Y L L A B U S
Introduction to Machine Learning (1) 6
Machine Learning terminology, Types of Machine Learning, Issues in Machine Learning, Application of Machine
Learning, Steps in developing ML application, How to choose the right algorithm
Data Preprocessing (3) 10
Data Cleaning (missing value, outlier), Exploratory Data Analysis (descriptive statistics, Visualization), Feature
Engineering (Data Transformation (encoding, skew, scale), Feature selection)
Supervised Learning with Regression (1) 5
Simple Linear, Multiple Linear, Polynomial, Overfit/Undefit, Regularization, Evaluation Metric, Use case
Supervised Learning with Classification (3) 12
k Nearest Neighbor, Logistic Regression, Linear SVM, Kernels, Decision Tree (CART), Issues in DT learning,
Ensembles (Bagging – Random Forest, Boosting – Gradient Boost), Evaluation metric, Use case
Optimization Techniques (2) 6
Model Selection techniques ( Cross Validation), Gradient Descent Algorithm, Grid Search method, Model Evaluation
technique (Bias, Variance)
Unsupervised Learning with clustering and Reinforcement Learning (2) 6
k Means algorithm, Dimensionality Reduction, Use case, Elements of Reinforcement Learning, Temporal Difference
Learning, Online Learning, Use case
2

M O D U L E 1 ( 6 H O U R )
• Machine Learning terminology
• Types of Machine Learning
• Issues in Machine Learning
• Application of Machine Learning
• Steps in developing ML application
• How to choose the right algorithm
3

S / W A N D H / W R E Q U I R E M E N T
16+ GB RAM, 4+ CORES, SSD storage, Amazon AWS, MS Azure, Google cloud
Python Data Science S/W stack (pip, conda)
NumPy – Linear Algebra
Pandas – Data read / process
Scikit-Learn – ML algo
Matplotlib – Visualization
Seaborn – more aesthetically pleasing
Plotly – interactive visualization library
tsne – high dimensional visualization
StatsModel – statistical models
SciPy – optimization
Tkinter – GUI lib for python
PyTorch – open source framework
Keras – high level API and open source framework
TensorFlow - open source framework
Theano – multidim array manipulation
NLTK – human language data
BeautifulSoup – navigating webpage
Bokeh – interactive visualizations
TextBlob – process textual data
SHAP – Shaplely Additive exPlanations
xAI – eXplainable AI
•IDE – Spyder, Jupyter notebook, PyCharm, Google Colab
4
PROJECT

6
np.array([1, 2, 3]) #rank1 array
b.Shape #rows,col
a[:2, 1:3] # first 2 rows, col1,2
x.Dtype #datatype- int64, float64
np.reshape(v, (3, 1)) * w
PROJECT
pd.read_csv('data.csv')
pandas.DataFrame(mydataset)
df.head(10)
df.tail()
df.dropna()
df.corr()
df.plot()

P R E R E Q U I S I T E S
• Probability and Statistics (r.v., prob distrib, statistic – mean,
median, mode, variance, s.d., covariance, Baye’s theorem,
entropy)
• Linear Algebra (matrix, vector, tensors, eigen value, eigen
vector)
• Calculus (functions, derivatives of single variable and
multivariate functions)
• Python language
• Structured thinking, communication and prob solving
• Business understanding
7

W H Y I S M L G E T T I N G A T T E N T I O N R E C E N T L Y
This development is driven by a few underlying forces:
• The amount of data generated is increasing significantly with reduction in the cost of
sensors
• The cost of storing this data has reduced significantly
• The cost of computing has come down significantly
• Cloud has democratized compute for the masses
8
FUTURE

M L V S A U T O M A T I O N
• If you are thinking that machine learning is nothing but a new name for automation – you
would be wrong. Most of the automation which has happened in the last few decades has
been rule-driven automation. For example – automating flows in our mailbox needs us to
define the rules. These rules act in the same manner every time.
• On the other hand, machine learning helps machines learn by past data and change their
decisions/performance accordingly. Spam detection in our mailboxes is driven by machine
learning. Hence, it continues to evolve with time.
9
PROJECT

D E F I N I T I O N
“A computer program is said to learn from experience E with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured by P, improves with experience E” - Tom Mitchell
“Machine learning enables a machine to automatically learn from data, improve performance from experiences,
and predict things without being explicitly programmed.”
“Machine learning is a subfield of artificial intelligence, which enables machines to learn from past data or
experiences without being explicitly programmed.”
“Science of getting computers act without explicit programming” - Arthur Samuel
10
EXAM

S C I E N C E O F T E A C H I N G M A C H I N E S H O W T O L E A R N B Y S E L F
Eg. the task of mopping and cleaning the floor.
• When a human does the task – the quality of outcome would vary. The human would get exhausted / bored after a few hours of
work. The human would also get sick at times. Depending on the place – it could also be hazardous or risky for a human.
• Machines can do high frequency repetitive tasks with high accuracy without getting tired. On the other hand, if we can teach
machines to detect whether the floor needs cleaning and mopping and how much cleaning is required based on the condition of
the floor and the type of the floor, machines would be far better in doing the same job. They can go on to do that job without
getting tired or sick!
• This is what Machine Learning aims to do - enable machines to learn on their own.
In order to answer questions like:
• Whether the floor needs cleaning and mopping?
• How long does the floor need to be cleaned?
• Machines need a way to think and this is precisely where machine learning models help. The machines capture data from the
environment and feed it to the machine learning model. The model then uses this data to predict whether the floor needs cleaning
or not. And, for how long does it need the cleaning.
11

H O W D O M A C H I N E S L E A R N
• Tasks difficult for humans can be very simple for machines. e.g. multiplying very large numbers.
• Tasks which look simple to humans can be very difficult for machines!
• You only need to demonstrate cleaning and mopping to a human a few times before they can perform it on
their own.
• But, that is not the case with machines. We need to collect a lot of data along with the desired outcomes in
order to teach machines to perform specific tasks.
• This is where machine learning comes into play. Machine Learning would help the machine understand the
kind of cleaning, the intensity of cleaning, and duration of cleaning based on the conditions and nature of the
floor.
12

T O O L S
Language
• R
• Python
• SAS
• Julia
• Java
• Scala
Database
• SQL
• Oracle
• Hadoop
Visualisation
• D3.js
• Tableau
• QlikView
13
FUTURE

T E R M I N O L O G Y
• Dataset (training, validation, testing)
• .csv file
• Structured vs unstructured data
• predictor, target, explanatory, independent, dependent, response variable
• Instance
• Features (numerical, discrete, categorical, ordinal, nominal)
• Model
• Hypothesis
14
PROJECT

T Y P E S
• Supervised Learning – labelled (binary and multi class)
• Classification – discrete response eg. LoR, NB, kNN,
SVM, DT, RF, GBM, XGB, NN
Eg. spam filtering, waste classification
• Regression – continuous response eg. LR, SVR, DTR,
RFR
Eg. changes in temperature, stock price prediction
17
EXAM

T Y P E S
• Unsupervised Learning - unlabelled
• Clustering eg. k means, hierarchical, NN
Eg. customer segmentation, city planning, cell phone tower for optimal signal reception
• Association eg. Apriori
Eg. diaper and beer, bread and milk
• Dimensionality Reduction eg. PCA, SVD
Eg. MNIST data (70000X784), face recognition (698X4096)
• Anomaly Detection eg. kNN, kMeans
Eg. Fraud detection, fault detection, outlier detection
• Semi supervised learning
• Speech Analysis, Web content classification, Google Expander
18
EXAM

T Y P E S
• Reinforcement Learning maximise cumulative reward eg. Q-Learning, SARSA, DQN
Eg. robotic dog, Tic Tac Toe
• Neural Network eg. recognise dog
• Deep Learning eg. chat bot, real time bidding, recommender system
• Natural Language Processing eg. Lemmatisation, Stemming
Eg. customer service complaints, virtual assistant
• Computer Vision eg. Canny edge detection, Haar Cascade classifier
Eg. skin cancer diagnosis, detect real time traffic, guided surgery
• Evolutionary Learning (GA, Optimisation algorithms)
Eg. Super Mario
19
EXAM

I S S U E S I N M A C H I N E L E A R N I N G
• What are the existing algorithm for learning?
• When will algorithm converge?
• Which algo perform best for what kind of problems?
• How much data sufficient? eg. training to classify cat and dog
• Non representative training data e.g. Exit poll during elections
• Poor quality of data eg. Outliers, Missing
• How many features required? Irrelevant features
• Overfitting training data
• Underfitting training data
• Computation power? eg. GPU and TPU for ML and DL
• Interpretability of model? eg. Why bank declined loan for customer
• How to improve learning?
• Optimization vs Generalization?
• New and better algorithms required
• Need for more data scientists
20
EXAM

P R O J E C T I D E A S ( 4 0 )
• Fraud detection
• Predict low oxygen level during surgery
• Recognise CVD factors
• Movie recommendation (Netflix)
• Marketing and Sales
• Weather prediction
• Traffic Prediction (Uber ATG)
• Loan defaulting prediction
• Handwriting recognition
• Sentiment analysis
• Human activity recognition
• Sports predictor
• Big Mart Sales prediction
• Fake news detection
• Disease prediction
• Stock market analysis
• Amazon Alexa
• Search Engine Optimization
• Auto-tagging and Friend
suggestion (Facebook)
• Swiggy and Uber Eats
• House price prediction
• Market Analysis
• Handwritten digit recognition
• Equipment failure prediction
• Prospective insurance buyer
• Google News
• Video Surveillance
• Movie Ticket pricing system
• Object Detection
21
PROJECT

M L U S E C A S E I N S M A R T P H O N E S
• From the voice assistant that sets your alarm and finds you the best restaurants to the simple
use case of unlocking your phone via facial recognition – Machine Learning is truly
embedded in our favourite devices.
• Voice Assistants
• Smartphone Cameras
• App Store and Play Store Recommendations
• Face Unlock
22
EXAM

M L U S E C A S E I N T R A N S P O R TAT I O N
• The application of machine learning in the transport industry has gone to an entirely different
level in the last decade. This coincides with the rise of ride-hailing apps like Uber, Lyft, Ola,
etc. These companies use machine learning throughout their many products, from planning
optimal routes to deciding prices for the rides we take. So, let’s look at a few popular use
cases in transportation which use machine learning heavily.
• Dynamic Pricing in Travel
• Transporting and Commuting - Uber
• Google Maps
23
EXAM

M L U S E C A S E I N W E B S E R V I C E S
• We interact with certain applications every day multiple times. What we perhaps did not
realize until recently, most of these applications work thanks to the power and flexibility of
Machine Learning.
• Email Filtering
• Google Search
• Google Translate
• Facebook and LinkedIn Recommendations
24
EXAM

M L U S E C A S E I N S A L E S A N D M A R K E T I N G
• Top companies in the world are using Machine Learning to transform their strategies from top
to bottom. The two most impacted functions? Marketing and Sales!
• These days if you’re working in the Marketing and Sales field, you need to know at least one
Business Intelligence tool (like Tableau or Power BI). Additionally, marketers are expected to
know how to leverage Machine Learning in their day-to-day role to increase brand
awareness, improve the bottom line, etc.
• Recommendation Engine
• Personalized Marketing
• Customer Support (Chatbots)
25
EXAM

M L U S E C A S E I N F I N A N C I A L D O M A I N
• Most of the jobs in Machine Learning are geared towards the financial domain. And that
makes sense! This is the ultimate numbers field. A lot of banking institutions till recently used
to lean on Logistic Regression (a simple machine learning algorithm) to crunch these
numbers.
• Fraud Detection
• Personalized Banking
26
EXAM

S T E P S I N B U I L D I N G A M L A P P L I C AT I O N
• Frame and define the business problem to ML problem
• What is the main objective? What are we trying to predict?
• What are the target features?
• What is the input data? Is it available?
• What kind of problem are we facing? Binary classification? Clustering?
• What is the expected improvement?
• Define performance metric
• Regression problems use certain evaluation metrics such as Mean Squared Error (MSE).
• Classification problems use evaluation metrics as Precision, Accuracy and Recall.
27
EXAM

• Gathering Data
• RSS feed, web scraping, API
• Generating Hypothesis
• Can our outputs be predicted given the inputs.
• Our available data is informative enough to learn the relationship between the inputs and the outputs
• Exploratory Data Analysis (Visualisation for outlier)
• Data Preparation and cleaning (Missing Value)
• Delete relevant info or samples
• Missing value imputation
28
EXAM

• Feature Engineering (Encoding, Transformation)
• Mapping Ordinal features
• Encoding Nominal class labels
• Normalization, Standardization
• Define benchmark / baseline model (kNN, NB)
• Chose model
• Train/build Model (train:validation:test)
• Shuffle for classification
• For weather prediction, stock price prediction etc. data should not be shuffled, as the sequence of data is a crucial feature.
• Evaluate Model for Optimal Hyperparameters (cross validation)
• Tune Model (Grid search, Randomized search)
• Model testing and Deployment for prediction
29
EXAM

C H O I C E O F R I G H T A L G O R I T H M
30
EXAM

S T E P S F O R S E L E C T I N G R I G H T M L A L G O
• Understand your Data
• Type of data will decide algorithm
• Algo will decide no. of samples
Eg. NB will work with categorical data and is not sensitive to missing data
• Stats and Visualization to know your data
• Percentile helps to identify outlier, median to identify central tendency
• Box plot (outlier), Histogram (spread), Scatter plot (bivariate relationship)
• Clean data w.r.t Missing value
• Feature Engineering
• Encoding
• Feature creation
31
EXAM

S T E P S F O R S E L E C T I N G R I G H T M L A L G O
• Categorize the problem
• By I/P (supervised, unsupervised)
• By O/P (regression, classification, clustering, anomaly detection)
• Understand constraints (data storage capacity, real time applications, fast learning)
• Look for available algorithm (business goals met?, preprocessing required?, accuracy?, explain ability?,
speed?, scalable?)
• Try each, assess and compare
• Optimize
• Evaluate performance
• Repeat if required
32
EXAM

C H O I C E O F M O D E L ( U S E C A S E )
• Linear Regression: unstable with redundant feature
Eg. Sales prediction, Time for commuting
• Logistic Regression: not blackbox, works with correlated features
Eg. Fraud detection, Customer churn prediction
• Decision Tree: can handle outliers but overfit and take large memory
Eg. Bank loan defaulter, Investment decision
• SVM: memory intensive, hard to interpret and difficult to tune
Eg. Text classification, Handwritten character recognition
• NB: less training data required, low memory requirement, faster
Eg. Sentiment analysis, Recommender systems
• RF: works well with large data and high dimension
Eg. Predict loan defaulters, Predict patients for high risk
• NN: resource and memory intensive
Eg. Object Recognition, Natural Language Translation
• K-means: grouping but no. of groups unknown
Eg. Customer Segmentation, Crime locality identification
• PCA: dimensionality reduction
Eg. MNIST digits
33
PROJECT

C H O I C E O F M E T R I C
• Regression
• Mean Square Error, Root MSE, R-squared
• Mean Absolute Error if outliers
• R2
• Classification
• Accuracy, LogLoss, ROC-AUC, Precision Recall
• Kappa score, MCC
• Unsupervised
• Mutual Information
• RAND index
• Reinforcement Learning
• Dispersion across Time
• Risk across Time
34
PROJECT

P R O J E C T L A B O R I E N TAT I O N
Installing Anaconda and Python
Step-1: Download Anaconda Python: www.anaconda.com/distribution/
Step- 2: Install Anaconda Python (Python 3.7 version): double click on the ".exe" file of
Anaconda
Step- 3: Open Anaconda Navigator: use Anaconda navigator to launch a Python IDE such as
Spyder and Jupyter Notebook
Step- 4: Close the Spyder/Jupyter Notebook IDE.
https://colab.research.google.com
https://github.com
35
PROJECT

P R O J E C T TA S K L I S T Study tool for implementation
Project title and Course identification
Chose data (Understand Domain and data)
Perform EDA
Perform Feature Engineering
Chose model
Train and Validate model
Tune Hyperparameters
Test and Evaluate model
Prepare Report
Prepare Technical Paper
Present Case Study
36
PROJECT

E X P E C TAT I O N S
Case Study Presentation
Mini Project
Technical Paper
Report
Competition (Inhouse, Online)
37
PROJECT

C A S E S T U D Y T I T L E S ( 3 1 )
MNIST
MS-COCO
ImageNet
CIFAR
IMDB Reviews
WordNet
Twitter Sentiment Analysis
BreastCancer Wisconsin
BBC News
Wheat seeds
Amazon Reviews
Facial Image
Spam SMS
YouTube
Chars74K
WineQuality
IrisFlowers
LabelMe
HoTPotQA
Ionosphere
Xview
US Census
Boston House Price
BankNote authentication
PIMA Indian Diabetes
BBC Sport
Titanic
Santander Product Recommendation
Sonar
Swedish Auto Insurance
Abalone
38
PROJECT

B O O K S A N D D ATA S E T R E S O U R C E S
• https://www.kaggle.com/datasets
• https://archive.ics.uci.edu/ml/index.php
• https://registry.opendata.aws/
• https://toolbox.google.com/datasetsearch
• https://msropendata.com/
• https://github.com/awesomedata/awesome-public-datasets
• Indian Government dataset
• US Government Dataset
• Northern Ireland Public Sector Datasets
• European Union Open Data Portal
• https://scikit-learn.org/stable/datasets/index.html
• https://data.world
• http://archive.ics.uci.edu/ml/datasets
• https://www.ehdp.com/vitalnet/datasets.htm
• https://www.data.gov/health/
• “Python Machine Learning”, Sebastian Raschka, Packt
publishing
• “Machine Learning In Action”, Peter Harrington,
DreamTech Press
• “Introduction to Machine Learning” Ethem Alpaydın,
MIT Press
• “Machine Learning” Tom M. Mitchell, McGraw Hill
• “Machine Learning - An Algorithmic Perspective”
Stephen Marsland, CRC Press
• “Machine Learning ― A Probabilistic Perspective”
Kevin P. Murphy, MIT Press
• “Pattern Recognition and Machine Learning”,
Christopher M. Bishop, Springer
• “Elements of Statistical Learning” Trevor Hastie,
Robert Tibshirani, Jerome Friedman, Springer
39

L E A R N I N G R E S O U R C E S
• https://www.analyticsvidhya.com
• https://towardsdatascience.com
• https://analyticsindiamag.com
• https://machinelearningmastery.com
• https://www.datacamp.com
• https://www.superdatascience.com
• https://www.elitedatascience.com
• https://medium.com
• Siraj Raval youtube channel
• https://mlcontests.com
• https://www.datasciencechallenge.net
• https://www.machinehack.com
• https://www.hackerearth.com
• www.hackerearth.com
• www.kaggle.com/competitions
• www.smartindiahackathon.gov.in
• www.datahack.analyticsvidhya.com
• www.daretocompete.com
• https://github.com
40

S U M M A R Y ( S U M M AT I V E A S S E S S M E N T )
• Examine steps in developing Machine Learning application with respect to your mini project. [10]
• Review the issues in Machine Learning. [10]
• State applicable use case for each ML algorithm. [10]
• Examine Applications of AI. [10]
• Illustrate steps for selecting right ML algorithm. [10]
• Define ML and differentiate between Supervised, Unsupervised and Reinforcement learning with the help of suitable examples. [10]
• Explain ML w.r.t. identifying Tasks, Experience and Performance measure (Tom Mitchell). [10]
• designing a checkers learning problem
• designing a handwriting recognition learning problem
• designing a Robot driving learning problem
• Illustrate with example how Supervised learning can be used in handling loan defaulters. [10]
• Explain Supervised Learning with neat diagram. [10]
42
EXAM

Q U E R I E S
?
T H A N K
Y O U
43

ML MODULE 1_slideshare.pdf

More Related Content

What's hot

Similar to ML MODULE 1_slideshare.pdf

More from Shiwani Gupta

Recently uploaded

ML MODULE 1_slideshare.pdf