© 2014 MapR Technologies 1© 2014 MapR Technologies
© 2014 MapR Technologies 2
Who am I?
Ted Dunning, Chief Applications Architect MapR Technologies
Email tdunning@mapr.com tdunning@apache.org
Twitter @Ted_Dunning
© 2014 MapR Technologies 3
Short Books by Ted Dunning & Ellen Friedman
• Published by O’Reilly in 2014 and 2015
• For sale from Amazon or O’Reilly
• Free e-books currently available courtesy of MapR
http://bit.ly/ebook-real-
world-hadoop
http://bit.ly/mapr-tsdb-
ebook
http://bit.ly/ebook-
anomaly
http://bit.ly/recommend
ation-ebook
© 2014 MapR Technologies 4
© 2014 MapR Technologies 5
The basic idea
© 2014 MapR Technologies 6
Anomaly Detection and Fraud Analytics
• Financial customer wants to identify zero-day attacks
• And advanced persistent threats
• By sophisticated adversaries who don’t use known vectors
• Must keep logs and other data secret
– But must also collaborate on detection algorithms
© 2014 MapR Technologies 7
Secure Development is Hard
System
knowledge
Observed
data
Training
algorithm
Model
New
measurements
Model
Anomaly
scores
Model
deployment
© 2014 MapR Technologies 8
Secure Development is Hard
System
knowledge
Observed
data
Training
algorithm
Model
New
measurements
Model
Anomaly
scores
Model
deployment
Outside collaborators
are outside the security
perimeter
They can’t see the data
and they can’t tune new
algorithms to fit reality
© 2014 MapR Technologies 9
How To Make Realistic Data
System
under test
Live
data
Failure
signatures
Fake
data
Failure
signatures
© 2014 MapR Technologies 10
Parametric Simulation
Match here
Live
data
System
under test
Failure
signatures
Fake
data
Failure
signatures
Fake
data
System
under test
Failure
signatures
Parametric matching of failure signatures
allows emulation of complex data properties
Matching on KPI’s and failure modes
guarantees practical fidelity
© 2014 MapR Technologies 11
Do’s and Don’ts
• Do match the KPI’s and failure modes
– Speed
– Score distribution
– False positive rates versus score
• Don’t try to match the actual data distribution precisely
– Good enough is good enough and we want to imitate failures,
not create new life forms
– Probably impossible to do precisely
– Even if possible, it is vastly harder to match distributions
© 2014 MapR Technologies 12
Methods for Generating Numbers
• Well-known distributions
– Uniform, normal, gamma, Poisson
– Truncations
• Cumulations
– Random walk v1
• Mixture distributions
• Hyper-parameters
– Random walk v2
© 2014 MapR Technologies 13
Normal
data = data.frame(x=rnorm(10000), y=rnorm(10000))
© 2014 MapR Technologies 14
Mixture of Normals
© 2014 MapR Technologies 15
Random Walk
y = cumsum(rnorm(10000))
© 2014 MapR Technologies 16
Pick Mean from Multinomial
© 2014 MapR Technologies 17
Random Walk with Variable Standard Deviation
y = cumsum(rt(10000, df=0.9))
© 2014 MapR Technologies 18
Methods for Generating Symbols
• Symbols are really just integers with a dictionary
• Well-known distributions
– Multinomial
– Dirichlet processes
– Rich-get-richer, Pittman-Yor
• Mixture distributions
• Hyper-parameters
• Lookup tables!!!
– Simple tables
– Data table joins for correlated components
© 2014 MapR Technologies 19
Skewed Integers
207 3
203 0
198 7
196 4
195 12
193 10
189 2
187 1
185 13
179 6
178 9
177 5
177 25
174 21
173 8
173 14
170 18
[
{"name":"x", "class":"int", "skew":1}
]
© 2014 MapR Technologies 20
Methods for Generating Behaviors
• Use structured data!
– Generate user meta-data
– Generate list of transactions
• Only flatten if necessary
• See Apache Drill for post-processing
© 2014 MapR Technologies 21
Methods for Generating Databases
• Use integers (see previous) as foreign keys
• Normalized form implies (approximate) independence of tables
© 2014 MapR Technologies 22
© 2014 MapR Technologies 23
Go get log-synth
https://github.com/tdunning/log-synth
© 2014 MapR Technologies 24
A worked example...
© 2014 MapR Technologies 25
Simulation Setup
0 20 40 60 80 100
0100300500
day
count
Compromise period
Exploit period
compromises
frauds
© 2014 MapR Technologies 26
© 2014 MapR Technologies 27
© 2014 MapR Technologies 28
Questions?
© 2014 MapR Technologies 29
Thank You
@mapr maprtech
tdunning@mapr.com
tdunning@apache.org
Ted Dunning, ChiefApplicationArchitect
MapRTechnologies
maprtech
mapr-technologies

Realistic Synthetic Generation Allows Secure Development

  • 1.
    © 2014 MapRTechnologies 1© 2014 MapR Technologies
  • 2.
    © 2014 MapRTechnologies 2 Who am I? Ted Dunning, Chief Applications Architect MapR Technologies Email tdunning@mapr.com tdunning@apache.org Twitter @Ted_Dunning
  • 3.
    © 2014 MapRTechnologies 3 Short Books by Ted Dunning & Ellen Friedman • Published by O’Reilly in 2014 and 2015 • For sale from Amazon or O’Reilly • Free e-books currently available courtesy of MapR http://bit.ly/ebook-real- world-hadoop http://bit.ly/mapr-tsdb- ebook http://bit.ly/ebook- anomaly http://bit.ly/recommend ation-ebook
  • 4.
    © 2014 MapRTechnologies 4
  • 5.
    © 2014 MapRTechnologies 5 The basic idea
  • 6.
    © 2014 MapRTechnologies 6 Anomaly Detection and Fraud Analytics • Financial customer wants to identify zero-day attacks • And advanced persistent threats • By sophisticated adversaries who don’t use known vectors • Must keep logs and other data secret – But must also collaborate on detection algorithms
  • 7.
    © 2014 MapRTechnologies 7 Secure Development is Hard System knowledge Observed data Training algorithm Model New measurements Model Anomaly scores Model deployment
  • 8.
    © 2014 MapRTechnologies 8 Secure Development is Hard System knowledge Observed data Training algorithm Model New measurements Model Anomaly scores Model deployment Outside collaborators are outside the security perimeter They can’t see the data and they can’t tune new algorithms to fit reality
  • 9.
    © 2014 MapRTechnologies 9 How To Make Realistic Data System under test Live data Failure signatures Fake data Failure signatures
  • 10.
    © 2014 MapRTechnologies 10 Parametric Simulation Match here Live data System under test Failure signatures Fake data Failure signatures Fake data System under test Failure signatures Parametric matching of failure signatures allows emulation of complex data properties Matching on KPI’s and failure modes guarantees practical fidelity
  • 11.
    © 2014 MapRTechnologies 11 Do’s and Don’ts • Do match the KPI’s and failure modes – Speed – Score distribution – False positive rates versus score • Don’t try to match the actual data distribution precisely – Good enough is good enough and we want to imitate failures, not create new life forms – Probably impossible to do precisely – Even if possible, it is vastly harder to match distributions
  • 12.
    © 2014 MapRTechnologies 12 Methods for Generating Numbers • Well-known distributions – Uniform, normal, gamma, Poisson – Truncations • Cumulations – Random walk v1 • Mixture distributions • Hyper-parameters – Random walk v2
  • 13.
    © 2014 MapRTechnologies 13 Normal data = data.frame(x=rnorm(10000), y=rnorm(10000))
  • 14.
    © 2014 MapRTechnologies 14 Mixture of Normals
  • 15.
    © 2014 MapRTechnologies 15 Random Walk y = cumsum(rnorm(10000))
  • 16.
    © 2014 MapRTechnologies 16 Pick Mean from Multinomial
  • 17.
    © 2014 MapRTechnologies 17 Random Walk with Variable Standard Deviation y = cumsum(rt(10000, df=0.9))
  • 18.
    © 2014 MapRTechnologies 18 Methods for Generating Symbols • Symbols are really just integers with a dictionary • Well-known distributions – Multinomial – Dirichlet processes – Rich-get-richer, Pittman-Yor • Mixture distributions • Hyper-parameters • Lookup tables!!! – Simple tables – Data table joins for correlated components
  • 19.
    © 2014 MapRTechnologies 19 Skewed Integers 207 3 203 0 198 7 196 4 195 12 193 10 189 2 187 1 185 13 179 6 178 9 177 5 177 25 174 21 173 8 173 14 170 18 [ {"name":"x", "class":"int", "skew":1} ]
  • 20.
    © 2014 MapRTechnologies 20 Methods for Generating Behaviors • Use structured data! – Generate user meta-data – Generate list of transactions • Only flatten if necessary • See Apache Drill for post-processing
  • 21.
    © 2014 MapRTechnologies 21 Methods for Generating Databases • Use integers (see previous) as foreign keys • Normalized form implies (approximate) independence of tables
  • 22.
    © 2014 MapRTechnologies 22
  • 23.
    © 2014 MapRTechnologies 23 Go get log-synth https://github.com/tdunning/log-synth
  • 24.
    © 2014 MapRTechnologies 24 A worked example...
  • 25.
    © 2014 MapRTechnologies 25 Simulation Setup 0 20 40 60 80 100 0100300500 day count Compromise period Exploit period compromises frauds
  • 26.
    © 2014 MapRTechnologies 26
  • 27.
    © 2014 MapRTechnologies 27
  • 28.
    © 2014 MapRTechnologies 28 Questions?
  • 29.
    © 2014 MapRTechnologies 29 Thank You @mapr maprtech tdunning@mapr.com tdunning@apache.org Ted Dunning, ChiefApplicationArchitect MapRTechnologies maprtech mapr-technologies