A Nontechnical Introduction to
About Me
Wesam Elshamy
Data Scientist @ American Express
Founder: Gramaunger (Pandora for food)
PhD Computer Science, Kansas State University
xyzwesamelshamy@gmail.com
Your Email Inbox
From: avoth@cogeco.ca [mailto:avoth@cogeco.ca] On Behalf Of
Webmail.Uchicago.edu@cogeco.ca
Sent: Thursday, July 24, 2008 6:46 PM
Subject: Quoting Uchicago.edu, Member.Services@ Uchicago.edu
Dear Uchicago.edu, email account user,
We are currently verifying our subscribers email accounts in other to
increase the efficiency of our webmail futures. During this course you
are required to provide the verification desk with the following details
so that your account could be verified;
CNetID::....................
Password:..............
Territory:...................
Kindly send these details so as to avoid the cancelation of your email
account.
Thanks, Uchicago.edu, Team
Reply?
Your Email Inbox
From: avoth@cogeco.ca [mailto:avoth@cogeco.ca] On Behalf Of
Webmail.Uchicago.edu@cogeco.ca
Sent: Thursday, July 24, 2008 6:46 PM
Subject: Quoting Uchicago.edu, Member.Services@ Uchicago.edu
Dear Uchicago.edu, email account user,
We are currently verifying our subscribers email accounts in other to
increase the efficiency of our webmail futures. During this course you
are required to provide the verification desk with the following details
so that your account could be verified;
CNetID::....................
Password:..............
Territory:...................
Kindly send these details so as to avoid the cancelation of your email
account.
Thanks, Uchicago.edu, Team
Reply?
2015 email Spam rate 53% [1]
$22 billion lost productivity cost in USA [2]
[1] 2016 Internet Security Threat Report, Symantec Corp
[2] 2004 National Technology Readiness Survey, Center for Excellence in Service at the University of Maryland
Too expensive to filter manually.
Credit Card Transactions
~80 billion legit transactions in the US [1]
~29 million unauthorized ($4 billion in value) [1]
[1] The 2013 Federal Reserve Payments Study, Federal Reserve System
Too expensive to authorize manually.
Zero liability for card holder.
Marketing
Online stores promote a handful of
personalized item recommendations out of
millions in their inventory.
[1] (2015) How Many Products Does Amazon Sell?, export-x.com
[2] Amazon Adds 30 Million Customers in the Past Year, Amazon Annual Shareholder Meeting 2014
amazon.com sells 488 million items [1] to
244 million users [2].
Why Machine Learning?
We need a process that improves a system’s
performance by learning from experience.
Tune Machine
Learning System
Collect
more
data
Deploy
Spam Filtering
Tune Spam Filter
Label Spam
messages
Deploy
Spam filter improves performance with every
correctly labeled Spam message.
Targeted Advertising
Every time you click an advertisement, you
tell the system “I’m interested.”
Tune ad targeting
system
Collect
click
data
Deploy
US Internet advertising revenue (2015) ~ $60 billion [1]
[1] IAB Internet Advertising Revenue Report—2015 Full Year Results, PricewaterhouseCoopers.
Big share of it programmatic buying.
Why Machine Learning?
Build systems that learn from individual and
collective user behavior and customize to
individual user.
Discover knowledge in large databases.
If you buy x and y, what else you may buy?
Develop specialized systems to replace specific
human tasks that require some intelligence.
Face detection
Document classification
Why Now?
Businesses collect massive amount of our data.
Clicks
Internet browsingTravelNest AC thermostat
Calendar events
Demographics
Internet search
PurchasesEmails ...
Servers are much more power and cheaper than ever.
Cloud computing: cheaply lease servers as
you need them, when you need them.
Rapid advances in theory and algorithms.
Hampered for decades by shortage of highly
available large scale data, powerful computers.
Skills Needed
● Probability theory
● Linear algebra
● Calculus
● Graph theory
● ...
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
● Programming with
multiple languages
● Text manipulation at
the command-line
● Algorithmic thinking
● Vectorized operations
● ...
Types of Learning
Supervised Unsupervised
1- Get labeled training data
2- Train model to correctly label data
3- Use trained model to label new data
appleapple banana
1- Get training data without labels
2- Train model to group
similar items together
Apple or banana?
Apple or banana?
3- Use trained model to label new data
Which group?
1 2
21
K-Means
One of the most well-known clustering algorithms.
Weight ( lbs )
Height
( feet )
50 100 200
4’
5’
6’
Cluster data into
homogeneous groups.
Clustering is
us/supervised?
How many clusters?
K-Means
K-Means
K-Means
K-Means
K-Means
K-Means
K-Means
K-Means
K-Means
No changes: Done training
How to tell how good it is?
K-Means
Assign cluster to new point
Applications of Clustering
Text documents clustering
Sports, politics, economy, entertainment, ...
Spam filters
Image clustering / search
Find images that look like this one. Product clustering
( amazon.com )
Recommend similar products.
Spamham
What features to use?
Issues
Data is usually noisy
Faulty sensors
Human error
Bad data transfer
Outliers
...
In Data Science, 80% of
time spent prepare data,
20% of time spent complain
about need for prepare data.
https://twitter.com/BigDataBorat
Issues
Algorithm used cannot capture patterns in the data
Underfitting
http://www.slideshare.net/larsga/introduction-to-big-datamachine-learning/34-Overfitting_Tuning_the_algorithm_so
Issues
Algorithm is over tuned that it fits to noise in the data.
Cannot generalize well.
Overfitting
28
http://www.slideshare.net/larsga/introduction-to-big-datamachine-learning/34-Overfitting_Tuning_the_algorithm_so
Clustering Algorithms
K-Means
K Nearest Neighbor
DBSCAN
Hierarchical Clustering
Mixture of Gausians
Fuzzy C-means
...
Which one to use?
No Free Lunch Theorem in Machine
Learning
There is no single learning algorithm that will outperform
all other algorithms in solving all problems.
No magic do-it-all algorithm exists.
Most valuable data scientist skill is to pick the
right algorithm for the problem at hand.
To Learn More
http://www2.ift.ulaval.ca/~chaib/IFT-4102-7025/public_html
/Fichiers/Machine_Learning_in_Action.pdf
Free online course by one of the best
researchers in Machine Learning.
Starts Oct 3rd
https://www.coursera.org/learn/machine-learning
Thanks

A Nontechnical Introduction to Machine Learning

  • 1.
  • 2.
    About Me Wesam Elshamy DataScientist @ American Express Founder: Gramaunger (Pandora for food) PhD Computer Science, Kansas State University xyzwesamelshamy@gmail.com
  • 3.
    Your Email Inbox From:avoth@cogeco.ca [mailto:avoth@cogeco.ca] On Behalf Of Webmail.Uchicago.edu@cogeco.ca Sent: Thursday, July 24, 2008 6:46 PM Subject: Quoting Uchicago.edu, Member.Services@ Uchicago.edu Dear Uchicago.edu, email account user, We are currently verifying our subscribers email accounts in other to increase the efficiency of our webmail futures. During this course you are required to provide the verification desk with the following details so that your account could be verified; CNetID::.................... Password:.............. Territory:................... Kindly send these details so as to avoid the cancelation of your email account. Thanks, Uchicago.edu, Team Reply?
  • 4.
    Your Email Inbox From:avoth@cogeco.ca [mailto:avoth@cogeco.ca] On Behalf Of Webmail.Uchicago.edu@cogeco.ca Sent: Thursday, July 24, 2008 6:46 PM Subject: Quoting Uchicago.edu, Member.Services@ Uchicago.edu Dear Uchicago.edu, email account user, We are currently verifying our subscribers email accounts in other to increase the efficiency of our webmail futures. During this course you are required to provide the verification desk with the following details so that your account could be verified; CNetID::.................... Password:.............. Territory:................... Kindly send these details so as to avoid the cancelation of your email account. Thanks, Uchicago.edu, Team Reply? 2015 email Spam rate 53% [1] $22 billion lost productivity cost in USA [2] [1] 2016 Internet Security Threat Report, Symantec Corp [2] 2004 National Technology Readiness Survey, Center for Excellence in Service at the University of Maryland Too expensive to filter manually.
  • 5.
    Credit Card Transactions ~80billion legit transactions in the US [1] ~29 million unauthorized ($4 billion in value) [1] [1] The 2013 Federal Reserve Payments Study, Federal Reserve System Too expensive to authorize manually. Zero liability for card holder.
  • 6.
    Marketing Online stores promotea handful of personalized item recommendations out of millions in their inventory. [1] (2015) How Many Products Does Amazon Sell?, export-x.com [2] Amazon Adds 30 Million Customers in the Past Year, Amazon Annual Shareholder Meeting 2014 amazon.com sells 488 million items [1] to 244 million users [2].
  • 7.
    Why Machine Learning? Weneed a process that improves a system’s performance by learning from experience. Tune Machine Learning System Collect more data Deploy
  • 8.
    Spam Filtering Tune SpamFilter Label Spam messages Deploy Spam filter improves performance with every correctly labeled Spam message.
  • 9.
    Targeted Advertising Every timeyou click an advertisement, you tell the system “I’m interested.” Tune ad targeting system Collect click data Deploy US Internet advertising revenue (2015) ~ $60 billion [1] [1] IAB Internet Advertising Revenue Report—2015 Full Year Results, PricewaterhouseCoopers. Big share of it programmatic buying.
  • 10.
    Why Machine Learning? Buildsystems that learn from individual and collective user behavior and customize to individual user. Discover knowledge in large databases. If you buy x and y, what else you may buy? Develop specialized systems to replace specific human tasks that require some intelligence. Face detection Document classification
  • 11.
    Why Now? Businesses collectmassive amount of our data. Clicks Internet browsingTravelNest AC thermostat Calendar events Demographics Internet search PurchasesEmails ... Servers are much more power and cheaper than ever. Cloud computing: cheaply lease servers as you need them, when you need them. Rapid advances in theory and algorithms. Hampered for decades by shortage of highly available large scale data, powerful computers.
  • 12.
    Skills Needed ● Probabilitytheory ● Linear algebra ● Calculus ● Graph theory ● ... http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram ● Programming with multiple languages ● Text manipulation at the command-line ● Algorithmic thinking ● Vectorized operations ● ...
  • 13.
    Types of Learning SupervisedUnsupervised 1- Get labeled training data 2- Train model to correctly label data 3- Use trained model to label new data appleapple banana 1- Get training data without labels 2- Train model to group similar items together Apple or banana? Apple or banana? 3- Use trained model to label new data Which group? 1 2 21
  • 14.
    K-Means One of themost well-known clustering algorithms. Weight ( lbs ) Height ( feet ) 50 100 200 4’ 5’ 6’ Cluster data into homogeneous groups. Clustering is us/supervised? How many clusters?
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
    K-Means No changes: Donetraining How to tell how good it is?
  • 24.
  • 25.
    Applications of Clustering Textdocuments clustering Sports, politics, economy, entertainment, ... Spam filters Image clustering / search Find images that look like this one. Product clustering ( amazon.com ) Recommend similar products. Spamham What features to use?
  • 26.
    Issues Data is usuallynoisy Faulty sensors Human error Bad data transfer Outliers ... In Data Science, 80% of time spent prepare data, 20% of time spent complain about need for prepare data. https://twitter.com/BigDataBorat
  • 27.
    Issues Algorithm used cannotcapture patterns in the data Underfitting http://www.slideshare.net/larsga/introduction-to-big-datamachine-learning/34-Overfitting_Tuning_the_algorithm_so
  • 28.
    Issues Algorithm is overtuned that it fits to noise in the data. Cannot generalize well. Overfitting 28 http://www.slideshare.net/larsga/introduction-to-big-datamachine-learning/34-Overfitting_Tuning_the_algorithm_so
  • 29.
    Clustering Algorithms K-Means K NearestNeighbor DBSCAN Hierarchical Clustering Mixture of Gausians Fuzzy C-means ... Which one to use?
  • 30.
    No Free LunchTheorem in Machine Learning There is no single learning algorithm that will outperform all other algorithms in solving all problems. No magic do-it-all algorithm exists. Most valuable data scientist skill is to pick the right algorithm for the problem at hand.
  • 31.
    To Learn More http://www2.ift.ulaval.ca/~chaib/IFT-4102-7025/public_html /Fichiers/Machine_Learning_in_Action.pdf Freeonline course by one of the best researchers in Machine Learning. Starts Oct 3rd https://www.coursera.org/learn/machine-learning
  • 32.