MACHINE LEARNING
Evren Korpeoglu, Data Science
Aarthi Srinivasan, Product Management
/Productschool @ProdSchool /ProductmanagementSV
What to Expect?
2
• What is machine learning ?
• Why is it important ?
• How do we use it ?
• Technical Concepts
• Examples
What is Machine Learning?
3
1. Science of getting computers to learn or recognize something without being explicitly
programmed – Andrew Ng
• Branch of Artificial Intelligence which is a branch of Computer Science
• Give lots of data to the computer so that it can figure it out
• One of the first examples is the computer checkers program by Arthur Samuel
* - ref: Andew Ng Courses, Big data: A revolution
2. Distinguish big data & machine learning: Big data is the data seed for creating
machine learning forests
• Big data collects information based on our digital exhaust (crumbs we leave in
the digital world) , demographics, preferences, health etc.
• Machine learning will mine this data and model behaviors with interactive
responses based on this data
Why do we need this?
4
1. Tons of applications impacting human health, utility and future simplification
Health & Wellness Utilitarian Future
• DNA sampling &
diagnosis
• Health reminders
& prevention
through AI tools
• Correlation studies
• Customizable
tablets
• Real time optimized
path maps
• Search Ranking
• Spam filter on email
• News aggregators
• Shopping
Recommendation
• Facebook face
recognition
• Age recognition
• Voice recognition –
Siri, Alexa
• Driverless cars
• Home decoration
Key Terms
5
• A set of data used to predict relationships.
• E.g. A diamond’s size, cut and clarity helps predicts the price. Data and
answers for each sample.
Training Set
• Uses training set to make a prediction.
• E.g. Model predicts diamond prices based on past prices.
Supervised Learning
• Provide data without suggesting anything so computer can identify patterns
or groupings.
• E.g. Customer segmentation, DNA groupings.
Unsupervised Learning
• Each distinct measurable data value you select in the training data set.
• E.g. A diamonds’ size is one of the feature’s for predicting price.
Features/ Variables /
Attributes
• Using the features provided in the training set make a prediction. Fit a curve
using the data provided.
• E.g. Price of diamond = X*Cut + Y*Clarity + Z*Size + features…
Supervised: Regression
• A defined set of categories for placing new data (observations)
• E.g. Presence of absence of cancer; Types of diabetes
Supervised: Classification
• Process of assigning observations into subsets
• E.g. Customer segment creations
Unsupervised: Clustering
Learning Steps
6
Collect /
Update User
Data
1
Create /
Update
Training Set
data
2
Create /
Update
algorithm for
training data
Update
Algorithm
Validate
Algorithm
3
Create
predictive
model
4
New real-time
observations
A/B Test &
Launch on
production
5
Data Wrangling and Feature Extraction
7
Spam Email
Detection
Title
Sender
Domain
# of
Recipients
Email
content
Country of
Origin
Non-
dictionary
Words
Hyperlinks
Address
Book
Length of
email
• Structured Data (Best)
– RDBMS, columnar data
– Strict Schema
– SQL
• Semi-Structured Data (Better)
– JSON, XML
– Enforce minimum schema
– JSON, XML Parser
• Unstructured Data
– Text, Image, Raw email
– No Schema
– Batch processing
– Regular expressions
– Map Reduce
GARBAGE IN GARBAGE OUT
Model Training
8
Feature
Extraction
(Feature
vector)
New
Text documents
User Activity
Images
Transaction history
Feature
Extraction
(Feature
vector)
Labels
Machine
Learning
Algorithm
Training / Testing
Text documents
User Activity
Images
Transaction history
Predictive
Model
Expected
Label
Model
Evaluation
Supervised learning techniques
9
• Linear classifier (numerical functions)
• Parametric (Probabilistic functions)
– Naïve Bayes, Hidden Markov models (HMM), Probabilistic
graphical models
• Non-parametric (Instance-based functions)
– K-nearest neighbors
• Non-metric (Symbolic functions)
– Classification and regression tree (CART)
• Aggregation
– Bagging (bootstrap + aggregation), Adaboost, Random
forest, Ensemble models
Linear Classifiers
10
• Logistic regression
– )
– w with minimum loss
– Solve iteratively using gradient descent
• Support vector machine (SVM)
– Maximum margin classifier
• Artificial Neural Networks
– Inspired from how neurons work
– Activation function (Sigmoid, ReLU etc.)
– Deep Learning
KNN / CART
11
• K-Nearest Neighbors
– Find K nearest training examples
– Majority vote
– Easy to implement
– Not scalable for real time predictions
• Classification and Regression Trees
– Easy to interpret for small trees
• Random Forests
– Ensemble of decision trees
– Usually performs very good
Unsupervised Learning
12
• Clustering
– K-means clustering
– Spectral clustering
• Dimensionality reduction
– Principal component analysis (PCA)
– Factor analysis
• Product Recommendations
– Collaborative Filtering
• Association Rules
– Market Basket Analysis
Model Evaluation
13
• Measure model performance
• Optimize model to improve prediction
quality
– Feature selection
– Hyperparameter tuning
• A/B Testing
• Explore/Exploit
• http://en.wikipedia.org/wiki/Precision_and_recall
Sample Architecture
14
-HADOOP
- SPARK
PREDICTION ENGINE
REAL TIME
DATA
SQL / NO SQL
Data Base
CLIENT MACHINE LEARNING
SYSTEM
Health & Wellness Sen.se Mother (iOT)
15
Amazon Echo & Personalization
16
Houzz Visual Match Deep Learning
17

Machine learning

  • 1.
    MACHINE LEARNING Evren Korpeoglu,Data Science Aarthi Srinivasan, Product Management /Productschool @ProdSchool /ProductmanagementSV
  • 2.
    What to Expect? 2 •What is machine learning ? • Why is it important ? • How do we use it ? • Technical Concepts • Examples
  • 3.
    What is MachineLearning? 3 1. Science of getting computers to learn or recognize something without being explicitly programmed – Andrew Ng • Branch of Artificial Intelligence which is a branch of Computer Science • Give lots of data to the computer so that it can figure it out • One of the first examples is the computer checkers program by Arthur Samuel * - ref: Andew Ng Courses, Big data: A revolution 2. Distinguish big data & machine learning: Big data is the data seed for creating machine learning forests • Big data collects information based on our digital exhaust (crumbs we leave in the digital world) , demographics, preferences, health etc. • Machine learning will mine this data and model behaviors with interactive responses based on this data
  • 4.
    Why do weneed this? 4 1. Tons of applications impacting human health, utility and future simplification Health & Wellness Utilitarian Future • DNA sampling & diagnosis • Health reminders & prevention through AI tools • Correlation studies • Customizable tablets • Real time optimized path maps • Search Ranking • Spam filter on email • News aggregators • Shopping Recommendation • Facebook face recognition • Age recognition • Voice recognition – Siri, Alexa • Driverless cars • Home decoration
  • 5.
    Key Terms 5 • Aset of data used to predict relationships. • E.g. A diamond’s size, cut and clarity helps predicts the price. Data and answers for each sample. Training Set • Uses training set to make a prediction. • E.g. Model predicts diamond prices based on past prices. Supervised Learning • Provide data without suggesting anything so computer can identify patterns or groupings. • E.g. Customer segmentation, DNA groupings. Unsupervised Learning • Each distinct measurable data value you select in the training data set. • E.g. A diamonds’ size is one of the feature’s for predicting price. Features/ Variables / Attributes • Using the features provided in the training set make a prediction. Fit a curve using the data provided. • E.g. Price of diamond = X*Cut + Y*Clarity + Z*Size + features… Supervised: Regression • A defined set of categories for placing new data (observations) • E.g. Presence of absence of cancer; Types of diabetes Supervised: Classification • Process of assigning observations into subsets • E.g. Customer segment creations Unsupervised: Clustering
  • 6.
    Learning Steps 6 Collect / UpdateUser Data 1 Create / Update Training Set data 2 Create / Update algorithm for training data Update Algorithm Validate Algorithm 3 Create predictive model 4 New real-time observations A/B Test & Launch on production 5
  • 7.
    Data Wrangling andFeature Extraction 7 Spam Email Detection Title Sender Domain # of Recipients Email content Country of Origin Non- dictionary Words Hyperlinks Address Book Length of email • Structured Data (Best) – RDBMS, columnar data – Strict Schema – SQL • Semi-Structured Data (Better) – JSON, XML – Enforce minimum schema – JSON, XML Parser • Unstructured Data – Text, Image, Raw email – No Schema – Batch processing – Regular expressions – Map Reduce GARBAGE IN GARBAGE OUT
  • 8.
    Model Training 8 Feature Extraction (Feature vector) New Text documents UserActivity Images Transaction history Feature Extraction (Feature vector) Labels Machine Learning Algorithm Training / Testing Text documents User Activity Images Transaction history Predictive Model Expected Label Model Evaluation
  • 9.
    Supervised learning techniques 9 •Linear classifier (numerical functions) • Parametric (Probabilistic functions) – Naïve Bayes, Hidden Markov models (HMM), Probabilistic graphical models • Non-parametric (Instance-based functions) – K-nearest neighbors • Non-metric (Symbolic functions) – Classification and regression tree (CART) • Aggregation – Bagging (bootstrap + aggregation), Adaboost, Random forest, Ensemble models
  • 10.
    Linear Classifiers 10 • Logisticregression – ) – w with minimum loss – Solve iteratively using gradient descent • Support vector machine (SVM) – Maximum margin classifier • Artificial Neural Networks – Inspired from how neurons work – Activation function (Sigmoid, ReLU etc.) – Deep Learning
  • 11.
    KNN / CART 11 •K-Nearest Neighbors – Find K nearest training examples – Majority vote – Easy to implement – Not scalable for real time predictions • Classification and Regression Trees – Easy to interpret for small trees • Random Forests – Ensemble of decision trees – Usually performs very good
  • 12.
    Unsupervised Learning 12 • Clustering –K-means clustering – Spectral clustering • Dimensionality reduction – Principal component analysis (PCA) – Factor analysis • Product Recommendations – Collaborative Filtering • Association Rules – Market Basket Analysis
  • 13.
    Model Evaluation 13 • Measuremodel performance • Optimize model to improve prediction quality – Feature selection – Hyperparameter tuning • A/B Testing • Explore/Exploit • http://en.wikipedia.org/wiki/Precision_and_recall
  • 14.
    Sample Architecture 14 -HADOOP - SPARK PREDICTIONENGINE REAL TIME DATA SQL / NO SQL Data Base CLIENT MACHINE LEARNING SYSTEM
  • 15.
    Health & WellnessSen.se Mother (iOT) 15
  • 16.
    Amazon Echo &Personalization 16
  • 17.
    Houzz Visual MatchDeep Learning 17