Upcoming SlideShare
Loading in …5
×

Data analysis00 commonprobabilitymodels

208 views

Published on

Revised version - spell checked.

• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here
• Be the first to comment

• Be the first to like this

Data analysis00 commonprobabilitymodels

1. 1. http://publicationslist.org/junio Data Analysis Common Probability Models Prof. Dr. Jose Fernando Rodrigues Junior ICMC-USP
2. 2. http://publicationslist.org/juniohttp://publicationslist.org/junio What is it about? Some systems are too difficult, if not impossible, to model In these circumstances, a better approach is to match the system with a known probability model Instead of modeling how a system works, we model how its outcomes behave There are many such models, some of them, though, are most commonly observed and treatable with moderately advanced algebraic tools
3. 3. http://publicationslist.org/juniohttp://publicationslist.org/junio Binomial distribution (Bernoulli trials) Bernoulli trials refer to random events with only two possible outcomes, usually named success and failure Mutually exclusive and independent, these outcomes happen with probabilities p and 1-p Examples:  Coin: success and failure with probability ½  Dice: success for one of its faces has probability 1/6 against 5/6  A basket with r red and b black balls: success for red is r/(r+b)  Two coins: success for two heads is ¼ The nature of the Bernoulli trial lends itself to a simple model
4. 4. http://publicationslist.org/juniohttp://publicationslist.org/junio Binomial distribution (Bernoulli trials) For N trials, the probability of k successive successes, each with probability p, is given by the Binomial distribution: where: is the number of distinct arrangements for k success and N-k failures
5. 5. http://publicationslist.org/juniohttp://publicationslist.org/junio Binomial distribution (Bernoulli trials) For N trials, the probability of k successive successes, each with probability p, is given by the Binomial distribution: where: is the number of distinct arrangements for k success and N-k failures
6. 6. http://publicationslist.org/juniohttp://publicationslist.org/junio Binomial distribution (Bernoulli trials) For N trials, the probability of k successive successes, each with probability p, is given by the Binomial distribution: where: is the number of distinct arrangements for k success and N-k failures The Bernoulli distribution tells two things: Too many successes is not probable; and Too few successes is not probable either That is, if you are taking the risk in a binary event, the more you try, the more you fail and also the more you succeed
7. 7. http://publicationslist.org/juniohttp://publicationslist.org/junio Binomial distribution (Bernoulli trials) Following from this, the expected number (mean) of successes in N trials is, quite obviously: With standard deviation: Notice that the standard deviation grows (^1/2) more slowly than the mean
8. 8. http://publicationslist.org/juniohttp://publicationslist.org/junio Binomial distribution (Bernoulli trials) Example: suppose we try to develop a model to predict the staffing required for a call center. We know that about one in every thousand orders will lead to a complaint (hence p = 1/1000) and that we take about Np=1000 complaints a day, as 1 million orders are shipped every day. The standard deviation in this example comes out to be √Np(1 − p) ≈√1000 ≈ 30, as 1 − p is very close to 1 for the current value of p. This deviation is quite acceptable for an expected value of 1000. For this simple example, the required staff is determined according to the number of complaints an employee can attend per day, considering Np complaints a day.
9. 9. http://publicationslist.org/juniohttp://publicationslist.org/junio Gaussian distribution Also known as Normal distribution, and Bell curve, the Gaussian distribution is the most commonly observed distribution. It is given by: where mean is given by and standard deviation
10. 10. http://publicationslist.org/juniohttp://publicationslist.org/junio Gaussian distribution Some examples of Gaussian curves:
11. 11. http://publicationslist.org/juniohttp://publicationslist.org/junio Gaussian distribution The cumulative distribution function (CDF) describes the probability of a random variable falling in the interval(−∞, x]. For some examples of Gaussian curves, we have:
12. 12. http://publicationslist.org/juniohttp://publicationslist.org/junio Gaussian distribution Gaussians are so useful because:  Great part of the events will occur in the range [mean – sd, mean +sd], what simplifies probability expectations – outliers are not expected  Identifying a Gaussian distribution leads to a simpler, though rigorous, comprehension of the phenomenon without having the system under deep investigation  For Gaussian distributions, basic statistic summaries mean and standard deviation are applicable  It is simpler to perform calculi over the Gaussian distribution, especially integrals, for this reason, it is often used as a Kernel Gaussians are not useful because:  It predicts the absence of outliers, what is not the case for real situations  There are many phenomena that are not Gaussian, cases when mean and standard deviation are misleading
13. 13. http://publicationslist.org/juniohttp://publicationslist.org/junio Power-law distribution The power-law (Zipf or Pareto) is a special case of non-normal statistics Example: consider the number of visits per person in a website
14. 14. http://publicationslist.org/juniohttp://publicationslist.org/junio Power-law distribution In the plot two facts stand out:  the huge number of people who made a handful of visits (fewer than 5 or 6)  at the other extreme, the huge number of visits that a few people made This kind of distribution is mostly composed of outliers, its mean is 26 visits per person, which makes no sense for the observed data; the standard deviation, 437, makes even less sense as it predicts negative numbers of visits Contrasting to Gaussian distributions with their quickly falling short tails; power-law distributions are characterized by “heavy (fat, long) tails” Such distributions can be identified by a log-log plot that defines a line whose slope is the power of the distribution function
15. 15. http://publicationslist.org/juniohttp://publicationslist.org/junio Power-law distribution Such distributions can be identified by a log-log plot that defines a line whose slope is the power of the distribution function For the website example, the log-log plot indicates a line with slope -1.9, hence the data is modeled as number of user ~ (number of visits per user)^-1.9
16. 16. http://publicationslist.org/juniohttp://publicationslist.org/junio Power-law distribution Well-known power-law distributions:  the frequency with which words are used in texts  the magnitude of earthquakes  the size of files  the copies of books sold  the intensity of wars  the sizes of sand particles and solar flares  the population of cities  and the distribution of wealth Challenges imposed by the distribution:  Observations span a wide range of values, often many orders of magnitude  There is no typical scale or value that could be used for summarization  The distribution is extremely skewed, with many data points at the low end and few (but not negligibly few) data points at very high values  Expectation values often depend on the sample size, and degenerates as more values are considered in contrast to other distributions
17. 17. http://publicationslist.org/juniohttp://publicationslist.org/junio Power-law distribution How to work with power-law distributions?  Do not use classical methods, especially mean and standard deviation  Segment the data  The majority of data points at small values  The set of points in the tail  The intermediate points  For each segment, try to use classical methods  Go into the problem domain so to explain the behavior of each segment
18. 18. http://publicationslist.org/juniohttp://publicationslist.org/junio References  Philipp K. Janert, Data Analysis with Open Source Tools, O’Reilly, 2010.  Wikipedia, http://en.wikipedia.org  Wolfram MathWorld, http://mathworld.wolfram.com/