Big Data LDN 2017: Machine Learning: What Works And What They Won’t Tell You

© 2017 MapR Technologies 1
Machine Learning

Machine Learning:
What Works

Machine Learning:
What Works and
What They Won't Tell You

Contact Information
Ted Dunning, PhD
Chief Application Architect, MapR Technologies
Committer, PMC member, board member, ASF
O’Reilly author
Email tdunning@mapr.com tdunning@apache.org
Twitter @Ted_Dunning

Traditional problems

Where We Are Today

Why is cheap better than deep (sometimes)
Greenfield problems can be
– Easy (large number of these)
– Impossible (large number of these)
– Hard but possible (right on the boundary)
Mature problems can be
– Easy (these are already done)
– Impossible (still a large number of these)
– Hard but possible (now the majority of the effort)

Most data isn’t worth much in isolation
First data is valuable
Later data is dregs

Suddenly worth processing
First data is valuable
Later data is dregs
But has high aggregate value

If we can handle the scale
It’s really big

With great scale comes great opportunity
• Increasing scale by 1000x changes the game
• We essentially have green fields opening up all around
• Most of the opportunities don’t require advanced learning

A simple example - security monitoring
• “Small” data
– Capture IDS logs
– Detect what you already know
• “Big” data
– Capture switch, server, firewall logs as well
– New patterns emerge immediately

Another example – fraud detection
• “Small” data
– Maintain card profiles
– Segment models
– Evaluate all transactions
• “Big” Data
– Maintain card profiles, full 90 day transaction history
– Per user hierarchical models
– Evaluate all transactions

Easy != Stupid
• You still have to do things reasonably well
– Techniques that are not well founded are still problems
• Let’s talk about some of easy smart tricks today

Scale does not cure wrong
It just makes easy more common

A core technique
• Many of these easy problems reduce to finding interesting
coincidences
• This can be summarized as a 2 x 2 table
• Actually, many of these tables
A Other
B k11 k12
Other k21 k22

How do you do that?
• This is well handled using G-test
– See wikipedia
– See http://bit.ly/surprise-and-coincidence
• Original application in linguistics now cited > 2000 times
• Available in ElasticSearch, in Solr, in Mahout
• Available in R, C, Java, Python

Which one is the anomalous co-occurrence?
A not A
B 13 1000
not B 1000 100,000
A not A
B 1 0
not B 0 10,000
A not A
B 10 0
not B 0 100,000
A not A
B 1 0
not B 0 2

Which one is the anomalous co-occurrence?
A not A
B 13 1000
not B 1000 100,000
A not A
B 1 0
not B 0 10,000
A not A
B 10 0
not B 0 100,000
A not A
B 1 0
not B 0 2
0.90 1.95
4.52 14.3
Dunning Ted, Accurate Methods for the Statistics of Surprise and Coincidence,
Computational Linguistics vol 19 no. 1 (1993)

So we can find interesting coincidence

So we can find interesting coincidence
and that gets us exactly what?

Cooccurrence AnalysisCooccurrence Analysis

Real-life example
• Query: “Paco de Lucia”
• Conventional meta-data search results:
– “hombres de paco” times 400
– not much else
• Recommendation based search:
– Flamenco guitar and dancers
– Spanish and classical guitar
– Van Halen doing a classical/flamenco riff

Real-life example

Any other domains?

Document classification

Language identification

Species identification

Anything useful?
Like, to do with money?

Common Point of Compromise
• Scenario:
– Merchant 0 is compromised, leaks account data during compromise
– Fraud committed elsewhere during exploit
– High background level of fraud
– Limited detection rate for exploits
• Goal:
– Find merchant 0
• Meta-goal:
– Screen algorithms for this task without leaking sensitive data

Example 2 - Common Point of Compromise
skim exploit
Merchant 0
Skimmed
data
Merchant n
Card data is stolen
from Merchant 0
That data is used
in frauds at other
merchants

So how does it work in real-life?

●●●●●●●●●●●●●●●●●●●● ● ●● ●●● ●●● ●●●●● ●●●●● ●●● ●●● ●● ● ●● ●● ●● ● ●●●● ●●●● ●● ●●●● ●●●● ●●● ●● ●● ● ●● ● ●●●● ●● ● ●●●● ●●●●●● ●● ●● ●●● ●●● ●●●●● ● ●●● ●● ●●● ●●● ●● ●●●● ●
●● ●●● ●●● ●
●
● ●●
●
●
●
●●
020406080
LLR score for real data
Number of Merchants
BreachScore(LLR)
Real truly bad guys
100
101
102
103
104
105
106
Really truly bad guys

Example 3: Tensor Chicken
• Even simple image recognition is really handy
• Until very recently, it was really hard
• Lately, it is quite doable

Inception Model Architecture
• Inception is an advanced image recognizer trained on the
Imagenet task
• But it takes 10’s of millions of images and weeks to train
Rethinking the Inception Architecture for Computer Vision
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon
Shlens, Zbigniew Wojna
http://arxiv.org/abs/1512.00567

Machine Learning

Machine Learning
Profit!
World domination

But that requires Vats o’ Data™
and tons of CPU / GPU time

Idea:
train on one domain (big, hard)
retrain on second one (small, easy)

Transfer Learning

Transfer Learning
Profit!
World domination

Just Add Water: Tensor Chicken!
• Start with Inception v3: http://bit.ly/inception-v3
• Add a few thousand pictures of chickens
– And a few blue jays
• Voila!
• See Ian’s blog: http://bit.ly/tensor-chicken-blog
• @tensorchicken in action: https://twitter.com/tensorchicken

Example Images

Example Images
Plymouth
rock
Plymouth
rock Rhode
Island
Red
Open
door

Example Images
Plymouth
rock
Plymouth
rock Rhode
Island
Red
Open
door
Blue Jay!

Cooccurrence Analysis
Summary
• There are easy techniques to get good results
– The easy stuff often doesn’t make it in academia
• Cooccurrence / cross-occurrence is hugely powerful
– And easy
• Transfer learning turns deep learning into
cheap learning

Contact Information
Ted Dunning, PhD
Chief Application Architect, MapR Technologies
Committer, PMC member, board member, ASF
O’Reilly author
Email tdunning@mapr.com tdunning@apache.org
Twitter @Ted_Dunning

Additional Resources
O’Reilly report by Ted Dunning & Ellen Friedman © March 2017
Read free courtesy of MapR:
https://mapr.com/geo-distribution-big-data-and-analytics/
O’Reilly book by Ted Dunning & Ellen Friedman
© March 2016
https://mapr.com/streaming-architecture-using-
apache-kafka-mapr-streams/

Additional Resources
O’Reilly book by Ted Dunning & Ellen Friedman
© June 2014
https://mapr.com/practical-machine-learning-
new-look-anomaly-detection/
O’Reilly book by Ellen Friedman & Ted Dunning
© February 2014
https://mapr.com/practical-machine-learning/

New book: Machine Learning Logistics
Model Management in the Real World
O’Reilly book by Ellen Friedman & Ted Dunning
Download free from MapR
https://mapr.com/ebook/machine-learning-logistics/

Please support women in tech – help build
girls’ dreams of what they can accomplish
© Ellen Friedman 2015#womenintech #datawomen

Q&A
@mapr
tdunning@mapr.com
ENGAGE WITH US
@ Ted_Dunning

Big Data LDN 2017: Machine Learning: What Works And What They Won’t Tell You

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big Data LDN 2017: Machine Learning: What Works And What They Won’t Tell You

Similar to Big Data LDN 2017: Machine Learning: What Works And What They Won’t Tell You (20)

More from Matt Stubbs

More from Matt Stubbs (20)

Recently uploaded

Recently uploaded (20)

Big Data LDN 2017: Machine Learning: What Works And What They Won’t Tell You