Realistic Synthetic Generation Allows Secure Development

•Download as PPTX, PDF•

1 like•1,564 views

From the Hadoop Summit 2015 Session with Ted Dunning. Open source is great, if developed in the open. Privacy is great, but things have to be private. So what happens when you find an open source bug with private data? How do you even file the bug report? Likewise, how can you develop fraud detection algorithms in academic settings when the training data can't be transported outside a secure perimeter. One answer is really good fake data. Good enough to fool the bug. Good enough to emulate the fraud. I will describe log-synth and several physics based approaches that can do this and tell some real stories about fake data.

Technology

© 2014 MapR Technologies 1© 2014 MapR Technologies

© 2014 MapR Technologies 2
Who am I?
Ted Dunning, Chief Applications Architect MapR Technologies
Email tdunning@mapr.com tdunning@apache.org
Twitter @Ted_Dunning

© 2014 MapR Technologies 3
Short Books by Ted Dunning & Ellen Friedman
• Published by O’Reilly in 2014 and 2015
• For sale from Amazon or O’Reilly
• Free e-books currently available courtesy of MapR
http://bit.ly/ebook-real-
world-hadoop
http://bit.ly/mapr-tsdb-
ebook
http://bit.ly/ebook-
anomaly
http://bit.ly/recommend
ation-ebook

© 2014 MapR Technologies 5
The basic idea

© 2014 MapR Technologies 6
Anomaly Detection and Fraud Analytics
• Financial customer wants to identify zero-day attacks
• And advanced persistent threats
• By sophisticated adversaries who don’t use known vectors
• Must keep logs and other data secret
– But must also collaborate on detection algorithms

© 2014 MapR Technologies 7
Secure Development is Hard
System
knowledge
Observed
data
Training
algorithm
Model
New
measurements
Model
Anomaly
scores
Model
deployment

© 2014 MapR Technologies 8
Secure Development is Hard
System
knowledge
Observed
data
Training
algorithm
Model
New
measurements
Model
Anomaly
scores
Model
deployment
Outside collaborators
are outside the security
perimeter
They can’t see the data
and they can’t tune new
algorithms to fit reality

© 2014 MapR Technologies 9
How To Make Realistic Data
System
under test
Live
data
Failure
signatures
Fake
data
Failure
signatures

© 2014 MapR Technologies 10
Parametric Simulation
Match here
Live
data
System
under test
Failure
signatures
Fake
data
Failure
signatures
Fake
data
System
under test
Failure
signatures
Parametric matching of failure signatures
allows emulation of complex data properties
Matching on KPI’s and failure modes
guarantees practical fidelity

© 2014 MapR Technologies 11
Do’s and Don’ts
• Do match the KPI’s and failure modes
– Speed
– Score distribution
– False positive rates versus score
• Don’t try to match the actual data distribution precisely
– Good enough is good enough and we want to imitate failures,
not create new life forms
– Probably impossible to do precisely
– Even if possible, it is vastly harder to match distributions

© 2014 MapR Technologies 12
Methods for Generating Numbers
• Well-known distributions
– Uniform, normal, gamma, Poisson
– Truncations
• Cumulations
– Random walk v1
• Mixture distributions
• Hyper-parameters
– Random walk v2

© 2014 MapR Technologies 13
Normal
data = data.frame(x=rnorm(10000), y=rnorm(10000))

© 2014 MapR Technologies 14
Mixture of Normals

© 2014 MapR Technologies 15
Random Walk
y = cumsum(rnorm(10000))

© 2014 MapR Technologies 16
Pick Mean from Multinomial

© 2014 MapR Technologies 17
Random Walk with Variable Standard Deviation
y = cumsum(rt(10000, df=0.9))

© 2014 MapR Technologies 18
Methods for Generating Symbols
• Symbols are really just integers with a dictionary
• Well-known distributions
– Multinomial
– Dirichlet processes
– Rich-get-richer, Pittman-Yor
• Mixture distributions
• Hyper-parameters
• Lookup tables!!!
– Simple tables
– Data table joins for correlated components

© 2014 MapR Technologies 19
Skewed Integers
207 3
203 0
198 7
196 4
195 12
193 10
189 2
187 1
185 13
179 6
178 9
177 5
177 25
174 21
173 8
173 14
170 18
[
{"name":"x", "class":"int", "skew":1}
]

© 2014 MapR Technologies 20
Methods for Generating Behaviors
• Use structured data!
– Generate user meta-data
– Generate list of transactions
• Only flatten if necessary
• See Apache Drill for post-processing

© 2014 MapR Technologies 21
Methods for Generating Databases
• Use integers (see previous) as foreign keys
• Normalized form implies (approximate) independence of tables

© 2014 MapR Technologies 23
Go get log-synth
https://github.com/tdunning/log-synth

© 2014 MapR Technologies 24
A worked example...

© 2014 MapR Technologies 25
Simulation Setup
0 20 40 60 80 100
0100300500
day
count
Compromise period
Exploit period
compromises
frauds

© 2014 MapR Technologies 29
Thank You
@mapr maprtech
tdunning@mapr.com
tdunning@apache.org
Ted Dunning, ChiefApplicationArchitect
MapRTechnologies
maprtech
mapr-technologies

Viewers also liked

Deep Learning vs. Cheap LearningMapR Technologies

Free Code Friday: Drill 101 - Basics of Apache DrillMapR Technologies

Putting Apache Drill into ProductionMapR Technologies

Rethinking SQL for Big Data with Apache DrillMapR Technologies

IoT Use Cases with MapRMapR Technologies

Drilling into Data with Apache DrillMapR Technologies

MapR 5.2: Getting More Value from the MapR Converged Data PlatformMapR Technologies

How Spark is Enabling the New Wave of Converged ApplicationsMapR Technologies

Zeta Architecture: The Next Generation Big Data ArchitectureMapR Technologies

Viewers also liked (9)

Deep Learning vs. Cheap Learning

Free Code Friday: Drill 101 - Basics of Apache Drill

Putting Apache Drill into Production

Rethinking SQL for Big Data with Apache Drill

IoT Use Cases with MapR

Drilling into Data with Apache Drill

MapR 5.2: Getting More Value from the MapR Converged Data Platform

How Spark is Enabling the New Wave of Converged Applications

Zeta Architecture: The Next Generation Big Data Architecture

Similar to Realistic Synthetic Generation Allows Secure Development

Anomaly Detection - New York Machine LearningTed Dunning

Deep Learning for Fraud DetectionDataWorks Summit/Hadoop Summit

Predictive Analytics with HadoopDataWorks Summit

How to find what you didn't know to look for, oractical anomaly detectionDataWorks Summit

How to Determine which Algorithms Really MatterDataWorks Summit

Practical Computing With ChaosDataWorks Summit

Cheap learning-dunning-9-18-2015Ted Dunning

Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15MLconf

Hadoop and R Go to the MoviesDataWorks Summit

Sharing Sensitive Data SecurelyTed Dunning

Anomaly Detection: How to find what you didn’t know to look forTed Dunning

How to tell which algorithms really matterDataWorks Summit

MapR & Skytree: MapR Technologies

HUG_Ireland_Streaming_Ted_DunningJohn Mulhall

Recommendation TechnTed Dunning

Ted Dunning - Keynote: How Can We Take Flink Forward?Flink Forward

Cognitive computing with big data, high tech and low tech approachesTed Dunning

ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies

Machine Learning Success: The Key to Easier Model ManagementMapR Technologies

Practical Machine Learning: Innovations in Recommendation WorkshopMapR Technologies

Similar to Realistic Synthetic Generation Allows Secure Development (20)

Anomaly Detection - New York Machine Learning

Deep Learning for Fraud Detection

Predictive Analytics with Hadoop

How to find what you didn't know to look for, oractical anomaly detection

How to Determine which Algorithms Really Matter

Practical Computing With Chaos

Cheap learning-dunning-9-18-2015

Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

Hadoop and R Go to the Movies

Sharing Sensitive Data Securely

Anomaly Detection: How to find what you didn’t know to look for

How to tell which algorithms really matter

MapR & Skytree:

HUG_Ireland_Streaming_Ted_Dunning

Recommendation Techn

Ted Dunning - Keynote: How Can We Take Flink Forward?

Cognitive computing with big data, high tech and low tech approaches

ML Workshop 2: Machine Learning Model Comparison & Evaluation

Machine Learning Success: The Key to Easier Model Management

Practical Machine Learning: Innovations in Recommendation Workshop

Recently uploaded

Real Time Object Detection Using Open CVKhem

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

A Domino Admins Adventures (Engage 2024)Gabriella Davis

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Histor y of HAM Radio presentation slidevu2urc

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Developing An App To Navigate The Roads of BrazilV3cube

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

Recently uploaded (20)

Real Time Object Detection Using Open CV

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

A Domino Admins Adventures (Engage 2024)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Strategies for Landing an Oracle DBA Job as a Fresher

How to Troubleshoot Apps for the Modern Connected Worker

AWS Community Day CPH - Three problems of Terraform

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

How to Troubleshoot Apps for the Modern Connected Worker

What Are The Drone Anti-jamming Systems Technology?

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Histor y of HAM Radio presentation slide

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

Data Cloud, More than a CDP by Matt Robison

Developing An App To Navigate The Roads of Brazil

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Realistic Synthetic Generation Allows Secure Development

3. © 2014 MapR Technologies 3 Short Books by Ted Dunning & Ellen Friedman • Published by O’Reilly in 2014 and 2015 • For sale from Amazon or O’Reilly • Free e-books currently available courtesy of MapR http://bit.ly/ebook-real- world-hadoop http://bit.ly/mapr-tsdb- ebook http://bit.ly/ebook- anomaly http://bit.ly/recommend ation-ebook

6. © 2014 MapR Technologies 6 Anomaly Detection and Fraud Analytics • Financial customer wants to identify zero-day attacks • And advanced persistent threats • By sophisticated adversaries who don’t use known vectors • Must keep logs and other data secret – But must also collaborate on detection algorithms

8. © 2014 MapR Technologies 8 Secure Development is Hard System knowledge Observed data Training algorithm Model New measurements Model Anomaly scores Model deployment Outside collaborators are outside the security perimeter They can’t see the data and they can’t tune new algorithms to fit reality

10. © 2014 MapR Technologies 10 Parametric Simulation Match here Live data System under test Failure signatures Fake data Failure signatures Fake data System under test Failure signatures Parametric matching of failure signatures allows emulation of complex data properties Matching on KPI’s and failure modes guarantees practical fidelity

11. © 2014 MapR Technologies 11 Do’s and Don’ts • Do match the KPI’s and failure modes – Speed – Score distribution – False positive rates versus score • Don’t try to match the actual data distribution precisely – Good enough is good enough and we want to imitate failures, not create new life forms – Probably impossible to do precisely – Even if possible, it is vastly harder to match distributions

12. © 2014 MapR Technologies 12 Methods for Generating Numbers • Well-known distributions – Uniform, normal, gamma, Poisson – Truncations • Cumulations – Random walk v1 • Mixture distributions • Hyper-parameters – Random walk v2

18. © 2014 MapR Technologies 18 Methods for Generating Symbols • Symbols are really just integers with a dictionary • Well-known distributions – Multinomial – Dirichlet processes – Rich-get-richer, Pittman-Yor • Mixture distributions • Hyper-parameters • Lookup tables!!! – Simple tables – Data table joins for correlated components

20. © 2014 MapR Technologies 20 Methods for Generating Behaviors • Use structured data! – Generate user meta-data – Generate list of transactions • Only flatten if necessary • See Apache Drill for post-processing

Realistic Synthetic Generation Allows Secure Development

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (9)

Similar to Realistic Synthetic Generation Allows Secure Development

Similar to Realistic Synthetic Generation Allows Secure Development (20)

More from MapR Technologies

More from MapR Technologies (20)

Recently uploaded

Recently uploaded (20)

Realistic Synthetic Generation Allows Secure Development