Data analysis00 commonprobabilitymodels

http://publicationslist.org/junio
Data Analysis
Common Probability Models
Prof. Dr. Jose Fernando Rodrigues Junior
ICMC-USP

http://publicationslist.org/juniohttp://publicationslist.org/junio
What is it about?
Some systems are too difficult, if not impossible, to model
In these circumstances, a better approach is to match the
system with a known probability model
Instead of modeling how a system works, we model how its
outcomes behave
There are many such models, some of them, though, are most
commonly observed and treatable with moderately advanced
algebraic tools

Binomial distribution (Bernoulli trials)
Bernoulli trials refer to random events with only two possible
outcomes, usually named success and failure
Mutually exclusive and independent, these outcomes happen
with probabilities p and 1-p
Examples:
 Coin: success and failure with probability ½
 Dice: success for one of its faces has probability 1/6 against 5/6
 A basket with r red and b black balls: success for red is r/(r+b)
 Two coins: success for two heads is ¼
The nature of the Bernoulli trial lends itself to a simple model

For N trials, the probability of k successive successes, each
with probability p, is given by the Binomial distribution:
where: is the number of distinct
arrangements for k success and N-k failures

For N trials, the probability of k successive successes, each
with probability p, is given by the Binomial distribution:
where: is the number of distinct
arrangements for k success and N-k failures
The Bernoulli distribution tells two things:
Too many successes is not probable; and
Too few successes is not probable either
That is, if you are taking the risk in a binary event, the
more you try, the more you fail and also the more
you succeed

Following from this, the expected number (mean) of
successes in N trials is, quite obviously:
With standard deviation:
Notice that the standard deviation grows (^1/2) more slowly
than the mean

Example: suppose we try to develop a model to predict the staffing
required for a call center. We know that about one in every thousand
orders will lead to a complaint (hence p = 1/1000) and that we take
about Np=1000 complaints a day, as 1 million orders are shipped
every day.
The standard deviation in this example comes out to be
√Np(1 − p) ≈√1000 ≈ 30, as 1 − p is very close to 1 for the current
value of p. This deviation is quite acceptable for an expected value of
1000.
For this simple example, the required staff is determined according to
the number of complaints an employee can attend per day, considering
Np complaints a day.

Gaussian distribution
Also known as Normal distribution, and Bell curve, the Gaussian
distribution is the most commonly observed distribution. It is given
by:
where mean is given by
and standard deviation

Some examples of Gaussian curves:

The cumulative distribution function (CDF) describes the probability of a random
variable falling in the interval(−∞, x]. For some examples of Gaussian curves, we have:

Gaussians are so useful because:
 Great part of the events will occur in the range [mean – sd, mean +sd], what
simplifies probability expectations – outliers are not expected
 Identifying a Gaussian distribution leads to a simpler, though rigorous,
comprehension of the phenomenon without having the system under deep
investigation
 For Gaussian distributions, basic statistic summaries mean and standard
deviation are applicable
 It is simpler to perform calculi over the Gaussian distribution, especially
integrals, for this reason, it is often used as a Kernel
Gaussians are not useful because:
 It predicts the absence of outliers, what is not the case for real situations
 There are many phenomena that are not Gaussian, cases when mean and
standard deviation are misleading

Power-law distribution
The power-law (Zipf or Pareto) is a special case of non-normal
statistics
Example: consider the number of visits per person in a website

In the plot two facts stand out:
 the huge number of people who made a handful of visits (fewer than 5 or 6)
 at the other extreme, the huge number of visits that a few people made
This kind of distribution is mostly composed of outliers, its mean is 26
visits per person, which makes no sense for the observed data; the
standard deviation, 437, makes even less sense as it predicts negative
numbers of visits
Contrasting to Gaussian distributions with their quickly falling short
tails; power-law distributions are characterized by “heavy (fat, long)
tails”
Such distributions can be identified by a log-log plot that defines a line
whose slope is the power of the distribution function

Such distributions can be identified by a log-log plot that defines a line
whose slope is the power of the distribution function
For the website example, the log-log plot indicates a line with slope
-1.9, hence the data is modeled as
number of user ~ (number of visits per user)^-1.9

Well-known power-law distributions:
 the frequency with which words are used in texts
 the magnitude of earthquakes
 the size of files
 the copies of books sold
 the intensity of wars
 the sizes of sand particles and solar flares
 the population of cities
 and the distribution of wealth
Challenges imposed by the distribution:
 Observations span a wide range of values, often many orders of magnitude
 There is no typical scale or value that could be used for summarization
 The distribution is extremely skewed, with many data points at the low end and
few (but not negligibly few) data points at very high values
 Expectation values often depend on the sample size, and degenerates as more
values are considered in contrast to other distributions

How to work with power-law distributions?
 Do not use classical methods, especially mean and standard deviation
 Segment the data
 The majority of data points at small values
 The set of points in the tail
 The intermediate points
 For each segment, try to use classical methods
 Go into the problem domain so to explain the behavior of each segment

References
 Philipp K. Janert, Data Analysis with Open Source Tools,
O’Reilly, 2010.
 Wikipedia, http://en.wikipedia.org
 Wolfram MathWorld, http://mathworld.wolfram.com/

Data analysis00 commonprobabilitymodels

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (15)

Similar to Data analysis00 commonprobabilitymodels

Similar to Data analysis00 commonprobabilitymodels (20)

More from Universidade de São Paulo

More from Universidade de São Paulo (20)

Data analysis00 commonprobabilitymodels