Machine Learning for Statisticians - Introduction

Introduction to Machine Learning for
Statisticians
ganesh.vigneswara@gmail.com, ganesh@ganeshniyer.com
Dr Ganesh Neelakanta Iyer
Industry Expert, Academician, Researcher, YouTuber, Kathakali Artist
http://ganeshniyer.com, https://www.linkedin.com/in/ganeshniyer/

About Me • Masters & PhD from National University of Singapore (NUS)
• Several years in Industry/Academia
• Architect, Manager, Technology Evangelist, Professor
• Talks/workshops in USA, Europe, Australia, Asia
• Cloud Computing, Game Theory, Machine Learning,
DevOps, SRE
• Kathakali Artist, Composer, Speaker, Traveler, YouTuber
GANESHNIYER http://ganeshniyer.com
https://bit.ly/MLPlaylistGanesh

Agenda
Introduction
• Artificial Intelligence
• AI vs ML
Machine Learning
• Introduction
• Types of ML
• Applications
• ML Algorithms
ML vs Statistics
ML resources
• Courses
• Data Sets
• Projects

DISCLAIMER
• I am NOT an expert in Machine Learning. I intend to share
some knowledge I have to help you kick-start your interest
• I have been informed that audience are new to this area. So
the session is a GENTLE introduction to ML and what it means
for statisticians
• For all guys who are forced to be here today, please enjoy
Dilbert cartoons and pictures of countries I have been

Dr Ganesh Neelakanta Iyer 5
nCorona

https://interestingengineering.com/china-uses-drones-and-ai-robots-to-fight-the-coronavirus-outbreak
https://www.dailymail.co.uk/news/article-7948181/Chinese-hospitals-start-use-AI-powered-robots-treat-coronavirus-patients.html
https://techresider.com/robots/the-penaut-robot-which-takes-food-to-patients-isolated-by-the-coronavirus/
https://www.archyde.com/body-heat-detector-drones-china-makes-massive-use-of-technologies-to-contain-the-coronavirus/
http://engnews24h.com/corona-virus-drones-with-speakers-are-patrolling-in-china/

8
BlueDot – an AI company made its first alert on December 31st.
This was ahead of the US Centers for Disease Control and
Prevention, which made its own determination on January 6th.
https://www.forbes.com/sites/tomtaulli/2020/02/02/coronavirus-can-ai-artificial-intelligence-make-a-difference/#41dd3f555817

nCorona - AI
• “We are currently using natural language processing (NLP) and
machine learning (ML) to process vast amounts of unstructured
text data, currently in 65 languages, to track outbreaks of over
100 different diseases, every 15 minutes around the clock,” said
Kamran Khan, founder of BlueDot
• “If we did this work manually, we would probably need over a
hundred people to do it well. These data analytics enable health
experts to focus their time and energy on how to respond to
infectious disease risks, rather than spending their time and
energy gathering and organizing information.”
10
https://www.forbes.com/sites/tomtaulli/2020/02/02/coronavirus-can-ai-artificial-intelligence-make-a-difference/#41dd3f555817

Artificial Intelligence
• “The study of the modelling of human mental functions by
computer programs.” — Collins Dictionary
12https://medium.com/life-of-a-technologist/what-would-the-managers-manage-in-the-
age-of-ai-6a00c26df257

Artificial Intelligence
• AI is composed of 2 words Artificial and Intelligence
• Anything which is not natural and created by humans is artificial
• Intelligence means ability to understand, reason, plan etc.
• So any code, tech or algorithm that enable machine to mimic,
develop or demonstrate the human cognition or behavior is AI
13

McDonald’s + Dynamic Yield
• McDonald’s thinks AI can help it sell more fast food to customers
• The company has announced that it is acquiring Dynamic Yield, an Israeli company
that uses AI to customise experiences
• McDonald's would use AI to tweak the menu options on the displays in the outlets,
based on factors such as the time of day, the weather outside and how busy the
restaurant is at the time
• If it is warm outside, the menu could offer more options for cold drinks such as
shakes, and perhaps more warm tea options if it is cold outside
• The system will also make recommendations in real-time for additional items that a
customer might want to order, based on what they had already ordered
https://www.news18.com/news/tech/a-burger-french-fries-and-some-artificial-intelligence-with-your-next-mcdonalds-order-2078213.html

Artificial Intelligence vs Machine Learning

AI vs ML
http://godigitalcrazy.com/artificial-intelligence-machine-learning-data-analytics/

Machine Learning
• Machine learning is the field of study that gives computers
the ability to learn without being explicitly programmed.
• In simple term, Machine Learning means making
prediction based on data
20

Machine Learning
21https://towardsdatascience.com/machine-learning-65dbd95f1603

A quick history.
From intuition to machine learning
Early
1900s
1970s
1990s
Now
Intuition Statistical
programming languages
Automated
machine learning
Manual analysis Visual statistical software
Using experience and
judgement to predict
outcomes
Writing code to construct
statistical models
The software knows how to analyse
your data and does it for you
Manual
calculations to
predict outcomes
Drag and drop workflows with menu
driven commands to set up and
statistical analysis
Slide credit: Edit

Why Machine Learning is Hard
You See Your ML Algorithm Sees

Why Machine Learning Is Hard, Redux
What is a “2”?

Why machine learning is hard?
Learning to identify an ‘apple’?
Apple Apple corporation Peach
Colour Red White Red
Type Fruit Logo Fruit
Shape Oval Cut oval Round
Slide credit: Edit

So much for a cat.
Principle of machine learning
Slide credit: Edit

Google ML
29

Google Translate
30

Google Voice search
31

Google Photos
32

Gmail smart reply
33

Google Maps
34

Dr Ganesh Neelakana Iyer
Example 101

Example
• Suppose we want to create a
system that tells us the
expected weight of person
based on its height
• Firstly, we will collect the data
• Each point on graph
represents a data point
37
https://towardsdatascience.com/cousins-of-artificial-intelligence-dda4edc27b55

Example
• To start with, we will draw a
simple line to predict weight
based on height
• A simple line could be W=H-100
• Where
– W=Weight in kgs
– H=Height in cms
38

Example
• This line can help us to make
prediction
• Our main goal is to reduce
distance between estimated
value and actual value i.e the
error
• In order to achieve this, will draw
a straight line which fits through
all the points
39

Example
• Our main goal is to minimize the
error and make them as small as
possible
• Decreasing the error between actual
and estimated value improves the
performance of model and also the
more data points we collect the
better our model will become
• So when we feed new data (height of
a person), it could easily tell us the
weight of the person
40

Types of Data
Data
Numerical
Discrete Continuous
Interval Ratio
Categorical
Nominal Ordinal
Time
series
Text

46
Resources: Datasets
• UCI Repository: http://www.ics.uci.edu/~mlearn/MLRepository.html
• UCI KDD Archive: http://kdd.ics.uci.edu/summary.data.application.html
• Kaggle https://www.kaggle.com/
• India Govt ISRO Data Sets https://bhuvan.nrsc.gov.in/bhuvan_links.php
• NIST https://data.world/nist
• Statlib: http://lib.stat.cmu.edu/
• Delve: http://www.cs.utoronto.ca/~delve/

Generate your own set
47

Generate your own set
48

50
Dimensionality Reduction
• It is so easy and convenient to collect data
– An experiment
• Data is not collected only for data mining
• Data accumulates in an unprecedented speed
• Data preprocessing is an important part for effective machine
learning and data mining
• Dimensionality reduction is an effective approach to
downsizing data

51
Document Classification
Internet
ACM Portal PubMedIEEE Xplore
Digital Libraries
Web Pages
Emails
■ Task: To classify unlabeled
documents into categories
■ Challenge: thousands of terms
■ Solution: to apply
dimensionality reduction
D1
D2
Sports
T1 T2 ….…… TN
12 0 ….…… 6
DM
C
Travel
Jobs
…
…
…
Terms
Documents
3 10 ….…… 28
0 11 ….…… 16
…

Dimensionality Reduction
• Selecting the most relevant attributes
• Feature Selection
• Combining attributes into a new reduced set of
features
• Feature Extraction
52

https://www.clariba.com/machine-learning-for-business

Types of ML Algorithms
56

Classification vs Regression
57
https://medium.com/@ali_88273/regression-vs-
classification-87c224350d69

Classification
• A classification problem is when the output variable is a category,
such as “red” or “blue” or “disease” and “no disease”
• A classification model attempts to draw some conclusion from
observed values
• Given one or more inputs a classification model will try to predict the
value of one or more outcomes
58

Classification
• A classification problem is when the output variable is a category,
such as “red” or “blue” or “disease” and “no disease”
• A classification model attempts to draw some conclusion from
observed values
• Given one or more inputs a classification model will try to predict the
value of one or more outcomes
https://developers.google.com/machine-learning/guides/
text-classification/

Regression
• A regression problem is when the output variable is a real or
continuous value, such as “salary” or “weight”
• Many different models can be used, the simplest is the linear
regression
• It tries to fit data with the best hyper-plane which goes through the
points

Examples
• Regression vs Classification
– Predicting age of a person
– Predicting nationality of a person
– Predicting whether stock price of a company will increase tomorrow
– Predicting the gender of a person by his/her handwriting style
– Predicting house price based on area
– Predicting whether monsoon will be normal next year
– Predict the number of copies a music album will be sold next month
61

Examples
• Regression vs Classification
– Predicting age of a person
– Predicting nationality of a person
– Predicting whether stock price of a company will increase tomorrow
– Predicting the gender of a person by his/her handwriting style
– Predicting house price based on area
– Predicting whether monsoon will be normal next year
– Predict the number of copies a music album will be sold next month
62

Evaluation Metrics
Accuracy
Confusion
Matrix
Precision
Recall /
Sensitivity
Specificity F1 Score
Gain and Lift
charts
Root Mean
Squared Error
Root Mean
Squared
Logarithmic
Error
R-squared Cross-validation Gini coefficient
https://www.analyticsvidhya.com/blog/2019/08/11-important-
model-evaluation-error-metrics/
https://medium.com/thalus-ai/performance-metrics-for-
classification-problems-in-machine-learning-part-i-b085d432082b

Statistics vs ML
66https://qph.fs.quoracdn.net/main-qimg-220b49a6aa9c221f5d44877ad1f6dfd7
https://www.unitedglobalgrp.com/wp-content/uploads/2018/05/machineLearning2-830x829.png

Statistics vs ML
• The major difference
between machine learning
and statistics is their purpose
• Machine learning models are
designed to make the most
accurate predictions possible
• Statistical models are
designed for inference about
the relationships between
variables
67
https://www.analyticsvidhya.com/blog/2015/12/hilarious-jokes-videos-statistics-data-science/

Statistics vs ML
68https://towardsdatascience.com/the-actual-difference-between-statistics-and-machine-learning-64b49f07ea3

ML is built upon Statistics
• Machine learning involves data, and data has to be described
using a statistical framework
• machine learning draws upon a large number of other fields
of mathematics and computer science, for example:
• ML theory from fields like mathematics & statistics
• ML algorithms from fields like optimization, matrix algebra,
calculus
• ML implementations from computer science & engineering
concepts (e.g. kernel tricks, feature hashing)
69

Both machine learning and statistics have the
same objective
70
Statistics Machine Learning
Estimation Learning
Classifier Hypothesis
Data Point Example/ Instance
Regression Supervised Learning
Classification Supervised Learning
Covariate Feature
Response Label
https://www.kdnuggets.com/2016/11/machine-learning-vs-statistics.html

Methodological differences between machine
learning and statistics
• ML professional: “The model is 85% accurate in predicting
Y, given a, b and c.”
• Statistician: “The model is 85% accurate in predicting Y,
given a, b and c; and I am 90% certain that you will obtain
the same result.”
71
https://www.kdnuggets.com/2016/11/machine-learning-vs-statistics.html

How statistics is used in Machine Learning?
• Do you have outliers?
• Is your data independent or correlated?
• Is your data sample identically distributed?
• Is the metric you have used to evaluate your model the
best one?
• How confident are you about the produced results?
• How can you construct a confidence interval for your
results?
72
https://www.quora.com/How-statistics-is-used-in-Machine-Learning

7 WAYS DATA SCIENTISTS
USE STATISTICS
73

1. Design and interpret experiments to inform
product decisions
Observation: Advertisement variant A has a 5% higher click-through rate than
variant B.
Let's say you're a national retailer and you're trying to test the effect of a new
marketing campaigns. Data Scientists can help you decide which stores you
should assign to the experimental group to get a good balance between the
experimental and control groups, what sample size you should assign to the
experimental group to get clear results, and how to run the study spending as
little money as possible.
Statistics Used: Experimental Design, Frequentist Statistics (Hypothesis
Tests and Conﬁdence Intervals
74https://www.quora.com/How-do-data-scientists-use-statistics

2. Build models that predict signal, not noise
Observation: Sales in December increased by 5%.
Data Scientists can tell you potential reasons why sales have increased by
5%. Data scientists can help you understand what drives sales, what sales
could look like next month, and potential trends to pay attention to.
Statistics Used: Regression, Classiﬁcation, Time Series Analysis, Causal
Analysis

3. Turn big data into the big picture
Observation: Some customers only buy healthy food, while others only buy
when there's a sale.
Data Scientists can help you label each customer, group them with similar
customers, and understand their buying habits. This allows you to see how
business developments can affect certain groups of the population, instead of
looking at everyone as a whole or looking at everyone individually.
Statistics Used: Clustering, Dimensionality Reduction, Latent Variable
Analysis

4. Understand user engagement, retention,
conversion, and leads
Observation: A lot of people are signing up for our site and never coming
back.
Why do your customers buy items from your site? How do you keep your
clients coming back? Why are users dropping out of your funnel? When will
they come out next? What kinds of emails from your company are most
successfully engaging users? What are some leading indicators of
engagement, activity, or success? What are some good sales leads?
Statistics Used: Regression, Causal Effects Analysis, Latent Variable
analysis, Survey Design

5. Give your users what they want
Given a matrix of users (customers, clients, users), and their interactions
(clicks, purchases, ratings) with your companies items (ads, goods, movies),
can you suggest what items your users will want next?
Statistics Used: Predictive Modeling, Latent Variable Analysis, Dimensionality
Reduction, Collaborative Filtering, Clustering

6. Estimate intelligently
Observation: We have a banner with 100 impressions and 0 clicks.
Is 0% a good estimate of the click-through-rate?
Data Scientists can incorporate data, global data, and prior knowledge to get
a desirable estimate, tell you the properties of that estimate, and summarize
what the estimate means.
Statistics Used: Bayesian Data Analysis

7. Tell the story with the data
The Data Scientist's role in the company is the serve as the ambassador
between the data and the company. Communication is key, and the Data
Scientist must be able to explain their insights in a way that the company can get
aboard, without sacriﬁcing the ﬁdelity of the data.
The Data Scientist does not simply summarize the numbers, but explains why
the numbers are important and what actionable insights one can get from these.
The Data Scientist is the storyteller of the company, communicating the
meaning of the data and why it is important to the company.
Statistics Used: Presenting and Communicating Data, Data Visualization

Resources for
you to start….
81

Fun ML projects for beginners
• Machine Learning Gladiator
• Play Money Ball
• Predict Stock Prices
• Teach a Neural Network to Read Handwriting
• Investigate Enron
• Write ML Algorithms from Scratch
• Mine Social Media Sentiment
• Improve Health Care
https://elitedatascience.com/machine-learning-projects-for-beginners

Predict Stock Prices
https://elitedatascience.com/machine-learning-projects-for-beginners

Interesting ML projects to start trying
• Beginner Level
– Iris Data
– Loan Prediction Data
– Bigmart Sales Data
– Boston Housing Data
– Time Series Analysis
Data
– Wine Quality Data
– Turkiye Student
Evaluation Data
– Heights and Weights
Data
• Intermediate Level
– Black Friday Data
– Human Activity
Recognition Data
– Siam Competition Data
– Trip History Data
– Million Song Data
– Census Income Data
– Movie Lens Data
– Twitter Classification
Data
• Advanced Level
– Identify your Digits
– Urban Sound
Classification
– Vox Celebrity Data
– ImageNet Data
– Chicago Crime Data
– Age Detection of Indian
Actors Data
– Recommendation
Engine Data
– VisualQA Data
https://www.analyticsvidhya.com/blog/2018/05/24-ultimate-data-science-projects-to-boost-your-knowledge-and-skills/

ganesh@ganeshniyer.com
ganesh.vigneswara@gmail.com
http://ganeshniyer.com
https://www.linkedin.com/in/ganeshniyer/
https://bit.ly/MLPlaylistGanesh

Machine Learning for Statisticians - Introduction

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Machine Learning for Statisticians - Introduction

Similar to Machine Learning for Statisticians - Introduction (20)

More from Dr Ganesh Iyer

More from Dr Ganesh Iyer (20)

Recently uploaded

Recently uploaded (20)

Machine Learning for Statisticians - Introduction