Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
© 2014 MapR Technologies 1© 2014 MapR Technologies
© 2014 MapR Technologies 2
Me, Us
• Ted Dunning, Chief Application Architect, MapR
– Committer PMC member Zookeeper, Drill...
© 2014 MapR Technologies 3
Agenda
• Rationale
• Why cheap isn't the same as simple-minded
• Some techniques
• Examples
© 2014 MapR Technologies 4
Why is cheap better than deep (sometimes)
Greenfield problems can be
– Easy (large number of th...
© 2014 MapR Technologies 5
Most data isn’t worth much in isolation
First data is valuable
Later data is dregs
© 2014 MapR Technologies 6
Suddenly worth processing
First data is valuable
Later data is dregs
But has high aggregate val...
© 2014 MapR Technologies 7
If we can handle the scale
It’s really big
© 2014 MapR Technologies 8
With great scale comes great opportunity
• Increasing scale by 1000x changes the game
• We esse...
© 2014 MapR Technologies 9
A simple example - security monitoring
• “Small” data
– Capture IDS logs
– Detect what you alre...
© 2014 MapR Technologies 10
Another example – fraud detection
• “Small” data
– Maintain card profiles
– Segment models
– E...
© 2014 MapR Technologies 11
Easy != Stupid
• You still have to do things reasonably well
– Techniques that are not well fo...
© 2014 MapR Technologies 12
Blast from the past
© 2014 MapR Technologies 13
Scale does not cure wrong
It just makes easy more common
© 2014 MapR Technologies 14
A core technique
• Many of these easy problems reduce to finding interesting
coincidences
• Th...
© 2014 MapR Technologies 15
How do you do that?
• This is well handled using G-test
– See wikipedia
– See http://bit.ly/su...
© 2014 MapR Technologies 16
Which one is the anomalous co-occurrence?
A not A
B 13 1000
not B 1000 100,000
A not A
B 1 0
n...
© 2014 MapR Technologies 17
Which one is the anomalous co-occurrence?
A not A
B 13 1000
not B 1000 100,000
A not A
B 1 0
n...
© 2014 MapR Technologies 18
So we can find interesting coincidence
and that gets us exactly what?
© 2014 MapR Technologies 19
Cooccurrence AnalysisCooccurrence Analysis
© 2014 MapR Technologies 20
Real-life example
• Query: “Paco de Lucia”
• Conventional meta-data search results:
– “hombres...
© 2014 MapR Technologies 21
Real-life example
© 2014 MapR Technologies 22
Any other domains?
© 2014 MapR Technologies 23
Document classification
© 2014 MapR Technologies 24
Language identification
© 2014 MapR Technologies 25
OK … Works for language
Anything else?
© 2014 MapR Technologies 26
Species identification
© 2014 MapR Technologies 27
Anything useful?
Like, to do with money?
© 2014 MapR Technologies 28
Common Point of Compromise
• Scenario:
– Merchant 0 is compromised, leaks account data during ...
© 2014 MapR Technologies 29
Simulation Setup
0 20 40 60 80 100
0100300500
day
count
Compromise period
Exploit period
compr...
© 2014 MapR Technologies 30
Simulation Strategy
• For each consumer
– Pick consumer parameters such as transaction rate, p...
© 2014 MapR Technologies 31
But that isn’t very realistic!
• No details of the fraud
• No details of the fraudsters
• No d...
© 2014 MapR Technologies 32
Secure Development is Hard
System
knowledge
Observed
data
Training
algorithm
Model
New
measure...
© 2014 MapR Technologies 33
Secure Development is Hard
System
knowledge
Observed
data
Training
algorithm
Model
New
measure...
© 2014 MapR Technologies 34
How To Make Realistic Data
System
under test
Live
data
Failure
signatures
Fake
data
Failure
si...
© 2014 MapR Technologies 35
Parametric Simulation
Match here
Live
data
System
under test
Failure
signatures
Fake
data
Fail...
© 2014 MapR Technologies 36
Performance Indicators to Match
• User and merchant population
• Transaction count/consumer
• ...
© 2014 MapR Technologies 37
So how does it work in practice?
© 2014 MapR Technologies 38
© 2014 MapR Technologies 39
●●●●●●●●●●●●●●●●●●●● ● ●● ●●● ●●● ●●●●● ●●●●● ●●● ●●● ●● ● ●● ●● ●● ● ●●●● ●●●● ●● ●●●● ●●●● ●...
© 2014 MapR Technologies 40
●●●●●●●●●●●●●●●●●●●● ● ●● ●●● ●●● ●●●●● ●●●●● ●●● ●●● ●● ● ●● ●● ●● ● ●●●● ●●●● ●● ●●●● ●●●● ●...
© 2014 MapR Technologies 41
Me, Us
• Ted Dunning, Chief Application Architect, MapR
– Committer PMC member Zookeeper, Dril...
Upcoming SlideShare
Loading in …5
×

Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

917 views

Published on

Complement Deep Learning with Cheap Learning: Recent results of deep learning on hard problems has set the data world all a titter and made deep learning the fashion of the time.

But it is very important to remember that as data expands, the learning problems that are encountered are often nearly green field problems and it is often possible to solve these problems using remarkably simple techniques. Indeed, on many problems these simple techniques will give results as good as more complex ones, not because they are profound, but because many problems become simpler at scale.

That said, it isn’t always obvious how to do this. I will describe some of these techniques and show how they can be applied in practice.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

  1. 1. © 2014 MapR Technologies 1© 2014 MapR Technologies
  2. 2. © 2014 MapR Technologies 2 Me, Us • Ted Dunning, Chief Application Architect, MapR – Committer PMC member Zookeeper, Drill – VP Incubator – Bought the beer at the first HUG • MapR – Distributes more open source components for Hadoop – Adds major technology for performance, HA, industry standard API’s • Info – Hash tag - #mapr #mlconfatl – See also - @ApacheDrill @ted_dunning and @mapR
  3. 3. © 2014 MapR Technologies 3 Agenda • Rationale • Why cheap isn't the same as simple-minded • Some techniques • Examples
  4. 4. © 2014 MapR Technologies 4 Why is cheap better than deep (sometimes) Greenfield problems can be – Easy (large number of these) – Impossible (large number of these) – Hard but possible (right on the boundary) Mature problems can be – Easy (these are already done) – Impossible (still a large number of these) – Hard but possible (now the majority of the effort)
  5. 5. © 2014 MapR Technologies 5 Most data isn’t worth much in isolation First data is valuable Later data is dregs
  6. 6. © 2014 MapR Technologies 6 Suddenly worth processing First data is valuable Later data is dregs But has high aggregate value
  7. 7. © 2014 MapR Technologies 7 If we can handle the scale It’s really big
  8. 8. © 2014 MapR Technologies 8 With great scale comes great opportunity • Increasing scale by 1000x changes the game • We essentially have green fields opening up all around • Most of the opportunities don’t require advanced learning
  9. 9. © 2014 MapR Technologies 9 A simple example - security monitoring • “Small” data – Capture IDS logs – Detect what you already know • “Big” data – Capture switch, server, firewall logs as well – New patterns emerge immediately
  10. 10. © 2014 MapR Technologies 10 Another example – fraud detection • “Small” data – Maintain card profiles – Segment models – Evaluate all transactions • “Big” Data – Maintain card profiles, full 90 day transaction history – Per user hierarchical models – Evaluate all transactions
  11. 11. © 2014 MapR Technologies 11 Easy != Stupid • You still have to do things reasonably well – Techniques that are not well founded are still problems • Heuristic frequency ratios still fail – Coincidences still dominate the data – Accidental 100% correlations abound • Related techniques still broken for coincidence – Pearson’s χ2 – Simple correlations
  12. 12. © 2014 MapR Technologies 12 Blast from the past
  13. 13. © 2014 MapR Technologies 13 Scale does not cure wrong It just makes easy more common
  14. 14. © 2014 MapR Technologies 14 A core technique • Many of these easy problems reduce to finding interesting coincidences • This can be summarized as a 2 x 2 table • Actually, many of these tables A Other B k11 k12 Other k21 k22
  15. 15. © 2014 MapR Technologies 15 How do you do that? • This is well handled using G-test – See wikipedia – See http://bit.ly/surprise-and-coincidence • Original application in linguistics now cited > 2000 times • Available in ElasticSearch, in Solr, in Mahout • Available in R, C, Java, Python
  16. 16. © 2014 MapR Technologies 16 Which one is the anomalous co-occurrence? A not A B 13 1000 not B 1000 100,000 A not A B 1 0 not B 0 10,000 A not A B 10 0 not B 0 100,000 A not A B 1 0 not B 0 2
  17. 17. © 2014 MapR Technologies 17 Which one is the anomalous co-occurrence? A not A B 13 1000 not B 1000 100,000 A not A B 1 0 not B 0 10,000 A not A B 10 0 not B 0 100,000 A not A B 1 0 not B 0 2 0.90 1.95 4.52 14.3 Dunning Ted, Accurate Methods for the Statistics of Surprise and Coincidence, Computational Linguistics vol 19 no. 1 (1993)
  18. 18. © 2014 MapR Technologies 18 So we can find interesting coincidence and that gets us exactly what?
  19. 19. © 2014 MapR Technologies 19 Cooccurrence AnalysisCooccurrence Analysis
  20. 20. © 2014 MapR Technologies 20 Real-life example • Query: “Paco de Lucia” • Conventional meta-data search results: – “hombres de paco” times 400 – not much else • Recommendation based search: – Flamenco guitar and dancers – Spanish and classical guitar – Van Halen doing a classical/flamenco riff
  21. 21. © 2014 MapR Technologies 21 Real-life example
  22. 22. © 2014 MapR Technologies 22 Any other domains?
  23. 23. © 2014 MapR Technologies 23 Document classification
  24. 24. © 2014 MapR Technologies 24 Language identification
  25. 25. © 2014 MapR Technologies 25 OK … Works for language Anything else?
  26. 26. © 2014 MapR Technologies 26 Species identification
  27. 27. © 2014 MapR Technologies 27 Anything useful? Like, to do with money?
  28. 28. © 2014 MapR Technologies 28 Common Point of Compromise • Scenario: – Merchant 0 is compromised, leaks account data during compromise – Fraud committed elsewhere during exploit – High background level of fraud – Limited detection rate for exploits • Goal: – Find merchant 0 • Meta-goal: – Screen algorithms for this task without leaking sensitive data
  29. 29. © 2014 MapR Technologies 29 Simulation Setup 0 20 40 60 80 100 0100300500 day count Compromise period Exploit period compromises frauds
  30. 30. © 2014 MapR Technologies 30 Simulation Strategy • For each consumer – Pick consumer parameters such as transaction rate, preferences – Generate transactions until end of sim-time • If merchant 0 during compromise time, possibly mark as compromised • For all transactions, possible mark as fraud, probability depends on history • Merchants are selected using hierarchical Pittman-Yor • Restate data – Flatten transaction streams – Sort by time • Tunables – Compromise probability, background fraud, detection probability
  31. 31. © 2014 MapR Technologies 31 But that isn’t very realistic! • No details of the fraud • No details of the fraudsters • No details on the transactions • No details on the models • How can this be any good at all?
  32. 32. © 2014 MapR Technologies 32 Secure Development is Hard System knowledge Observed data Training algorithm Model New measurements Model Anomaly scores Model deployment
  33. 33. © 2014 MapR Technologies 33 Secure Development is Hard System knowledge Observed data Training algorithm Model New measurements Model Anomaly scores Model deployment Outside collaborators are outside the security perimeter They can’t see the data and they can’t tune new algorithms to fit reality
  34. 34. © 2014 MapR Technologies 34 How To Make Realistic Data System under test Live data Failure signatures Fake data Failure signatures
  35. 35. © 2014 MapR Technologies 35 Parametric Simulation Match here Live data System under test Failure signatures Fake data Failure signatures Fake data System under test Failure signatures Parametric matching of failure signatures allows emulation of complex data properties Matching on KPI’s and failure modes guarantees practical fidelity
  36. 36. © 2014 MapR Technologies 36 Performance Indicators to Match • User and merchant population • Transaction count/consumer • Merchant propensity skew • Level of detected fraud • Spectrum of meta-model scores
  37. 37. © 2014 MapR Technologies 37 So how does it work in practice?
  38. 38. © 2014 MapR Technologies 38
  39. 39. © 2014 MapR Technologies 39 ●●●●●●●●●●●●●●●●●●●● ● ●● ●●● ●●● ●●●●● ●●●●● ●●● ●●● ●● ● ●● ●● ●● ● ●●●● ●●●● ●● ●●●● ●●●● ●●● ●● ●● ● ●● ● ●●●● ●● ● ●●●● ●●●●●● ●● ●● ●●● ●●● ●●●●● ● ●●● ●● ●●● ●●● ●● ●●●● ● ●● ●●● ●●● ● ● ● ●● ● ● ● ●● 020406080 LLR score for real data Number of Merchants BreachScore(LLR) Real truly bad guys 100 101 102 103 104 105 106 Really truly bad guys
  40. 40. © 2014 MapR Technologies 40 ●●●●●●●●●●●●●●●●●●●● ● ●● ●●● ●●● ●●●●● ●●●●● ●●● ●●● ●● ● ●● ●● ●● ● ●●●● ●●●● ●● ●●●● ●●●● ●●● ●● ●● ● ●● ● ●●●● ●● ● ●●●● ●●●●●● ●● ●● ●●● ●●● ●●●●● ● ●●● ●● ●●● ●●● ●● ●●●● ● ●● ●●● ●●● ● ● ● ●● ● ● ● ●● 020406080 LLR score for real data Number of Merchants BreachScore(LLR) Real truly bad guys 100 101 102 103 104 105 106 Cooccurrence An Summary 0 20 40 60 80 100 0100300500 day count Compromise period Exploit period compromises frauds • We live in a golden age of newly achieved scale • That scale has lowered the tree – Hard problems are much easier – Lots of low-hanging fruit all around us • Cheap learning has huge value • Code available at http://github.com/tdunning
  41. 41. © 2014 MapR Technologies 41 Me, Us • Ted Dunning, Chief Application Architect, MapR – Committer PMC member Zookeeper, Drill – VP Incubator – Bought the beer at the first HUG • MapR – Distributes more open source components for Hadoop – Adds major technology for performance, HA, industry standard API’s • Info – Hash tag - #mapr #mlconfatl – See also - @ted_dunning and @mapR

×