Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Finding Patterns in Data Breaches



One problem that every information security organization faces is how to accurately quantify the risks that they manage. In most cases, there is not enough information available to do this. There is ...

One problem that every information security organization faces is how to accurately quantify the risks that they manage. In most cases, there is not enough information available to do this. There is now enough known about data breaches to let us draw interesting conclusions, some of which may even have implications in other areas of information security. This talk describes what we can learn from a careful analysis of the available information on data breaches, how we can extend what we learn about data breaches to other aspects of information security, and why doing this makes sense.

Luther Martin, Chief Security Architect, Voltage Security, Inc.

Luther Martin is the Chief Security Architect at Voltage Security, Inc., a vendor of encryption technology and products. He began his career in information security at the National Security Agency, where he graduated from the NSA's Cryptologic Mathematician Program in 1991, and eventually became the Technical Director of the NSA's Engineering and Physical Sciences Security Division.

After leaving the NSA, he has worked at both security consulting and product companies. Notable accomplishments during this period include creating the security code review for consulting firm Ernst & Young, running the first commercial security code review projects, and creating the public-key infrastructure technology that was used in the U.S. Postal Service's PC Postage program.

He is the author of Introduction to Identity-based Encryption, and has contributed to seven other books and over 100 articles on the topics of information security and risk management.



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • ISSA Journal, March 2008, https://www.issa.org/Library/Journals/2008/March/Martin%20-%20The%20Information%20Security%20Life%20Cycle.pdf
  • Offer free VSN to the person with the best answer
  • NY 1 in 1,660 (0.7%); CA 1 in 194 (6%); NV 1 in 84 (13%); WY 1 in 2,621 (0.4%)
  • About $0.09 per $100 in the US; much less in other countries
  • Historically about 4%; now about 10%
  • So Lake Wobegon is almost possible – all but one child can be above average
  • Bell Atlantic and Dean’s troposcatter system
  • HM Revenue & Customs, National Archives and Records Admin
  • FrankBenford, “The law of anomalous numbers,” Proceedings of the American Philosophical Society, Vol. 78, pp. 551-72, 1938.

Finding Patterns in Data Breaches Finding Patterns in Data Breaches Presentation Transcript

  • 1
    Finding Patterns in Data Breaches
    Luther Martin
    October 21, 2010
  • Overview
    Attempt at humor
    Getting in the right frame of mind to think about statistics
    A reminder of some concepts from statistics
    What we can learn from data breaches
    What this tells us
    Some generalizations that might or might not be accurate
  • System development lifecycle (SDLC)
  • Security development lifecycle (SDLC)
  • Estimating some numbers
    What’s the probability of an exploitable vulnerability existing in your web server right now?
    What’s the probability of your web server being hacked in the next 12 months?
    If you don’t encrypt email, what’s the probability of it being intercepted and read on the Internet?
    Too hard?
  • Some easier questions
    What’s the current mortgage foreclosure rate?
    What’s the current fraud loss rate in the US for payment (credit and debit) cards?
    What’s the current charge-off rate in the US for credit card loans?
  • The foreclosure rate
    Currently about 1 in 381 per month, or about 3 percent per year
  • Payment card fraud loss rate
  • The charge-off rate for credit cards
  • More about statistics
    We described each of these using only one number
    An average
    That’s not the whole story
    The average person has less than 2 legs!
    1.99…< 2
    Most people have an above-average number of legs!
  • Even more about statistics
    It’s often useful to have a second number that tells how much variation we have in our data
    Sets of data can both have the same average, but be very different
    Same mean, different variance
  • The normal distribution
    The so-called normal distribution (“bell curve”) appears again and again in statistics
    Many things end up with a normal distribution when you might not expect it
  • The Central Limit Theorem
    If you add random values together you tend to get a normal distribution
    Proof by picture:
  • Why a known distribution is useful
    If we know that we have data that follows a particular probability distribution we can predict what we’ll see in the future with fairly good accuracy
    If you flip a fair coin 100 times then
    You’ll get about 50 Heads
    There’s about a 73 percent chance of getting 45 to 55 Heads
    There’s about a 2 percent chance of getting more than 60 Heads
    What this doesn’t do is predict how any particular flip of the coin will turn out
  • One more review of math: logarithms
    Logarithms are exponents
    So if we have these numbers:
    10, 100, 1,000, 10,000, …
    or 101, 102, 103, 104, …
    Then their logarithms are
    1, 2, 3, 4, …
    Note that multiplying corresponds to adding exponents (logs): 102 x 103 = 102+3 = 105
  • Logarithms naturally occur in lots of ways
    Human perception of sound (or light) is roughly proportional to logarithm of the sound level rather than the sound level
    If you double the sound pressure level it doesn’t double how loud it sounds to us
    Instead, double the logarithm of the sound pressure level
    That’s why decibels are used to measure sound levels, etc
    So logarithms may be annoying but they’re also useful in some cases
  • Another use for logarithms
    Logarithms are also a good way to handle big ranges in numbers
    Radio: transmit kilowatts (1,000 Watts), receive milliwatts (0.001 Watts)
    Hard to plot big ranges on one graph
    Very small numbers look just like zero
    Taking logarithms makes a big range easier to handle
    3 to -3 instead of 1,000 to 0.001
  • What about data breaches?
    The most comprehensive data is that maintained by the Open Security Foundation
    Currently has information on close to 3,000 data breaches
    Probably the most useful source of information on data breaches
    What patterns can we find in the OSF’s data?
  • Data breaches since 2006
  • Making the range of values smaller
  • Sort these values to get…
  • The log of breach size matches a normal distribution very well
    mean 3.2, standard deviation 1.2
  • What does this tell us?
    We may be able to understand the process that leads to data breaches
    We may be able to predict some things about future data breaches
    We may be able to find a good metric for industry-wide efforts to reduce data breaches
    We really need comprehensive data to find patterns that might be there
    Very small breaches are as important as very big
  • Understanding the process
    Just like we get a normal distribution from adding several random values together, we get a lognormal distribution when we multiply several random values together
    Multiplying corresponds to adding exponents (logs)
    This suggests that what we see for data breaches may be explained by a layered model of security
  • Abstract layered model of security
  • The general case: if we have
    The security provided by two technologies when they’re both used is greater or equal to the security of each of the components when they’re used by themselves
    If two technologies are independent then the security provided by the two technologies when they’re used together is equal to the sum of the security provided by each of the technologies
    The security provided by any technology is non-negative
  • It’s more than just data breaches
    Note that this model of the effect of bypassing layers of security leading to multiplying the hacker’s success doesn’t just apply to data breaches
    It also applies to any other aspect of information security
    When we learn how to quantify other types of security incidents we’ll probably find that the damage from them also follows a lognormal distribution
  • Then we have to have…
    A measure of security that works that way has to essentially be a logarithm
    Measuring security breaches in terms of logarithms may end up making more sense that measuring security breaches directly
    We see it with data breaches
    We’ll probably see it for other types of losses once we learn how to quantify those losses
  • Does this interpretation make sense?
    Other places where the lognormal distribution appears:
    The concentration of gold or uranium in ore deposits
    The latency period of bacterial food poisoning
    The age of the onset of Alzheimer's disease
    The amount of air pollution in Los Angeles
    The abundance of fish species
    The size of ice crystals in ice cream
    The number of words spoken in a telephone conversation
    The length of sentences written by George Bernard Shaw or Gilbert K. Chesterton
  • What can we predict?
    There’s about a 1 percent chance of any breach exposing 1 million or more records
    There’s about a 0.1 percent chance of any breach exposing 10 million or more records
    We can expect about 68 percent of breaches to expose between 100 and 25,000 records
    We can expect about 95 percent of breaches to expose between 6 and 400,000 records
  • What can we NOT predict?
    How many data breaches we should expect to see in the next 12 months
    Whether or not any particular business will suffer a data breach in the next 12 months
    Whether or not your business will suffer a data breach in the next 12 months
  • Other patterns: Benford’s law
    Benford’s law tells us that the leading digits in data tend to not be evenly distributed
    Probability of leading digit being n is
    P(n) = log(1+1/n)
  • Why Benford’s law might make sense
    Consider what happens with exponential growth
    Start with 1 and multiply by 1.1 at each step:
    1, 1.10 1.21, 1.33, 1.46, 1.61, 1.77, 1.95,
    2.14, 2.36, 2.59, 2.85, 3.14, 3.45, 3.80,
    4.18, 4.59, 5.05, 5.56, 6.12, 6.73, 7.40,
    8.14, 8.95, 9.85, 10.83, …
    Note that 1 is the most common, etc.
  • Benford’s law for breaches
  • Other patterns
    There are other patterns that we can find
    But they’re really just ways to repackage the exponential growth idea
    No real new ideas
    Zipf’s law
    Pareto’s principle
  • Zipf’s law
    Zipf’s law
    Order the data from biggest to smallest
    Then the total contribution from any entry is inversely proportional to its position in the table
    Second entry is about 1/2 of the first one
    Third entry is about 1/3 of the first one
    The nth entry is about 1/n of the first one
    R2 = 0.873
  • Pareto’s principle
    Sometimes known as the “80-20 rule”
    Very similar to the others that we’ve mentioned
    In general have k% of the population accounts for (100 - k)% of something for some k between 50 and 100
    For k = 80 we get the 80-20 rule
    Empirically, most data seem to cluster around k being in the middle of this range
    It’s yet another power law
  • Bottom line
    It certainly looks like it’s possible to find some interesting structure in the data that’s available for data breaches
    The size of data breaches seems to follow a very well defined pattern
    We may see this same pattern in other part of information security when we learn how to quantify other types of losses due to security breaches
    We need lots of data to see the patterns in it
    Data on small breaches is as important as the big ones
  • Practical implications (so what?)
    Developing metrics
    Developing ROI models
    Pricing insurance
    Are we winning or are hackers winning?
    Any time when quantifying a loss is useful
  • Some useful references
    The OSF’s data breach database
    E. Limpert, W. Stahel and M. Abbt, “Lognormal Distributions across the Sciences: Keys and Clues”
    The Voltage corporate blog
    CSO Magazine article on finding patterns in data breaches