SlideShare a Scribd company logo
1 of 18
Download to read offline
http://publicationslist.org/junio
Data Analysis
Common Probability Models
Prof. Dr. Jose Fernando Rodrigues Junior
ICMC-USP
http://publicationslist.org/juniohttp://publicationslist.org/junio
What is it about?
Some systems are too difficult, if not impossible, to model
In these circumstances, a better approach is to match the
system with a known probability model
Instead of modeling how a system works, we model how its
outcomes behave
There are many such models, some of them, though, are most
commonly observed and treatable with moderately advanced
algebraic tools
http://publicationslist.org/juniohttp://publicationslist.org/junio
Binomial distribution (Bernoulli trials)
Bernoulli trials refer to random events with only two possible
outcomes, usually named success and failure
Mutually exclusive and independent, these outcomes happen
with probabilities p and 1-p
Examples:
 Coin: success and failure with probability ½
 Dice: success for one of its faces has probability 1/6 against 5/6
 A basket with r red and b black balls: success for red is r/(r+b)
 Two coins: success for two heads is ¼
The nature of the Bernoulli trial lends itself to a simple model
http://publicationslist.org/juniohttp://publicationslist.org/junio
Binomial distribution (Bernoulli trials)
For N trials, the probability of k successive successes, each
with probability p, is given by the Binomial distribution:
where: is the number of distinct
arrangements for k success and N-k failures
http://publicationslist.org/juniohttp://publicationslist.org/junio
Binomial distribution (Bernoulli trials)
For N trials, the probability of k successive successes, each
with probability p, is given by the Binomial distribution:
where: is the number of distinct
arrangements for k success and N-k failures
http://publicationslist.org/juniohttp://publicationslist.org/junio
Binomial distribution (Bernoulli trials)
For N trials, the probability of k successive successes, each
with probability p, is given by the Binomial distribution:
where: is the number of distinct
arrangements for k success and N-k failures
The Bernoulli distribution tells two things:
Too many successes is not probable; and
Too few successes is not probable either
That is, if you are taking the risk in a binary event, the
more you try, the more you fail and also the more
you succeed
http://publicationslist.org/juniohttp://publicationslist.org/junio
Binomial distribution (Bernoulli trials)
Following from this, the expected number (mean) of
successes in N trials is, quite obviously:
With standard deviation:
Notice that the standard deviation grows (^1/2) more slowly
than the mean
http://publicationslist.org/juniohttp://publicationslist.org/junio
Binomial distribution (Bernoulli trials)
Example: suppose we try to develop a model to predict the staffing
required for a call center. We know that about one in every thousand
orders will lead to a complaint (hence p = 1/1000) and that we take
about Np=1000 complaints a day, as 1 million orders are shipped
every day.
The standard deviation in this example comes out to be
√Np(1 − p) ≈√1000 ≈ 30, as 1 − p is very close to 1 for the current
value of p. This deviation is quite acceptable for an expected value of
1000.
For this simple example, the required staff is determined according to
the number of complaints an employee can attend per day, considering
Np complaints a day.
http://publicationslist.org/juniohttp://publicationslist.org/junio
Gaussian distribution
Also known as Normal distribution, and Bell curve, the Gaussian
distribution is the most commonly observed distribution. It is given
by:
where mean is given by
and standard deviation
http://publicationslist.org/juniohttp://publicationslist.org/junio
Gaussian distribution
Some examples of Gaussian curves:
http://publicationslist.org/juniohttp://publicationslist.org/junio
Gaussian distribution
The cumulative distribution function (CDF) describes the probability of a random
variable falling in the interval(−∞, x]. For some examples of Gaussian curves, we have:
http://publicationslist.org/juniohttp://publicationslist.org/junio
Gaussian distribution
Gaussians are so useful because:
 Great part of the events will occur in the range [mean – sd, mean +sd], what
simplifies probability expectations – outliers are not expected
 Identifying a Gaussian distribution leads to a simpler, though rigorous,
comprehension of the phenomenon without having the system under deep
investigation
 For Gaussian distributions, basic statistic summaries mean and standard
deviation are applicable
 It is simpler to perform calculi over the Gaussian distribution, especially
integrals, for this reason, it is often used as a Kernel
Gaussians are not useful because:
 It predicts the absence of outliers, what is not the case for real situations
 There are many phenomena that are not Gaussian, cases when mean and
standard deviation are misleading
http://publicationslist.org/juniohttp://publicationslist.org/junio
Power-law distribution
The power-law (Zipf or Pareto) is a special case of non-normal
statistics
Example: consider the number of visits per person in a website
http://publicationslist.org/juniohttp://publicationslist.org/junio
Power-law distribution
In the plot two facts stand out:
 the huge number of people who made a handful of visits (fewer than 5 or 6)
 at the other extreme, the huge number of visits that a few people made
This kind of distribution is mostly composed of outliers, its mean is 26
visits per person, which makes no sense for the observed data; the
standard deviation, 437, makes even less sense as it predicts negative
numbers of visits
Contrasting to Gaussian distributions with their quickly falling short
tails; power-law distributions are characterized by “heavy (fat, long)
tails”
Such distributions can be identified by a log-log plot that defines a line
whose slope is the power of the distribution function
http://publicationslist.org/juniohttp://publicationslist.org/junio
Power-law distribution
Such distributions can be identified by a log-log plot that defines a line
whose slope is the power of the distribution function
For the website example, the log-log plot indicates a line with slope
-1.9, hence the data is modeled as
number of user ~ (number of visits per user)^-1.9
http://publicationslist.org/juniohttp://publicationslist.org/junio
Power-law distribution
Well-known power-law distributions:
 the frequency with which words are used in texts
 the magnitude of earthquakes
 the size of files
 the copies of books sold
 the intensity of wars
 the sizes of sand particles and solar flares
 the population of cities
 and the distribution of wealth
Challenges imposed by the distribution:
 Observations span a wide range of values, often many orders of magnitude
 There is no typical scale or value that could be used for summarization
 The distribution is extremely skewed, with many data points at the low end and
few (but not negligibly few) data points at very high values
 Expectation values often depend on the sample size, and degenerates as more
values are considered in contrast to other distributions
http://publicationslist.org/juniohttp://publicationslist.org/junio
Power-law distribution
How to work with power-law distributions?
 Do not use classical methods, especially mean and standard deviation
 Segment the data
 The majority of data points at small values
 The set of points in the tail
 The intermediate points
 For each segment, try to use classical methods
 Go into the problem domain so to explain the behavior of each segment
http://publicationslist.org/juniohttp://publicationslist.org/junio
References
 Philipp K. Janert, Data Analysis with Open Source Tools,
O’Reilly, 2010.
 Wikipedia, http://en.wikipedia.org
 Wolfram MathWorld, http://mathworld.wolfram.com/

More Related Content

Viewers also liked

Thermal detectors of ir
Thermal detectors of irThermal detectors of ir
Thermal detectors of ir
Sampath Kumar
 
ECA - Source Transformation
ECA - Source TransformationECA - Source Transformation
ECA - Source Transformation
Hassaan Rahman
 
Optical pyrometer
Optical pyrometerOptical pyrometer
Optical pyrometer
rajguptanitw
 
thermocouple ppt
thermocouple pptthermocouple ppt
thermocouple ppt
Shyamakant Sharan
 

Viewers also liked (15)

Thermal detectors of ir
Thermal detectors of irThermal detectors of ir
Thermal detectors of ir
 
Sop for ph meter
Sop for ph meterSop for ph meter
Sop for ph meter
 
Solving problem involving temperature
Solving problem involving temperatureSolving problem involving temperature
Solving problem involving temperature
 
Experimental Determination of Tool-chip Interface Temperatures
Experimental Determination of Tool-chip Interface TemperaturesExperimental Determination of Tool-chip Interface Temperatures
Experimental Determination of Tool-chip Interface Temperatures
 
Basics Of Pyrometers
Basics Of PyrometersBasics Of Pyrometers
Basics Of Pyrometers
 
Pelillisyys yli kouluasteiden tehtavan yrittajyyskasvatuksen tukena
Pelillisyys yli kouluasteiden tehtavan yrittajyyskasvatuksen tukenaPelillisyys yli kouluasteiden tehtavan yrittajyyskasvatuksen tukena
Pelillisyys yli kouluasteiden tehtavan yrittajyyskasvatuksen tukena
 
ECA - Source Transformation
ECA - Source TransformationECA - Source Transformation
ECA - Source Transformation
 
Optical pyrometer
Optical pyrometerOptical pyrometer
Optical pyrometer
 
OPTICAL PYROMETER
OPTICAL PYROMETEROPTICAL PYROMETER
OPTICAL PYROMETER
 
Thermocouple
ThermocoupleThermocouple
Thermocouple
 
Thermocouple
ThermocoupleThermocouple
Thermocouple
 
Temperature Sensors
Temperature SensorsTemperature Sensors
Temperature Sensors
 
Network Theorems.ppt
Network Theorems.pptNetwork Theorems.ppt
Network Theorems.ppt
 
Pyrometer
PyrometerPyrometer
Pyrometer
 
thermocouple ppt
thermocouple pptthermocouple ppt
thermocouple ppt
 

Similar to Data analysis00 commonprobabilitymodels

Complete the Frankfort-Nachmias and Leon-Guerrero (2018) SPSS®.docx
Complete the Frankfort-Nachmias and Leon-Guerrero (2018) SPSS®.docxComplete the Frankfort-Nachmias and Leon-Guerrero (2018) SPSS®.docx
Complete the Frankfort-Nachmias and Leon-Guerrero (2018) SPSS®.docx
breaksdayle
 
Transportation and logistics modeling 2
Transportation and logistics modeling 2Transportation and logistics modeling 2
Transportation and logistics modeling 2
karim sal3awi
 
01 Descriptive Statistics for Exploring Data.pdf
01 Descriptive Statistics for Exploring Data.pdf01 Descriptive Statistics for Exploring Data.pdf
01 Descriptive Statistics for Exploring Data.pdf
SREDDINIRANJAN
 

Similar to Data analysis00 commonprobabilitymodels (20)

Data analysis03 timeasa-variable
Data analysis03 timeasa-variableData analysis03 timeasa-variable
Data analysis03 timeasa-variable
 
Data analysis01 singlevariable
Data analysis01 singlevariableData analysis01 singlevariable
Data analysis01 singlevariable
 
Data science
Data scienceData science
Data science
 
Data analysis02 twovariables
Data analysis02 twovariablesData analysis02 twovariables
Data analysis02 twovariables
 
Bayesian networks and the search for causality
Bayesian networks and the search for causalityBayesian networks and the search for causality
Bayesian networks and the search for causality
 
Complete the Frankfort-Nachmias and Leon-Guerrero (2018) SPSS®.docx
Complete the Frankfort-Nachmias and Leon-Guerrero (2018) SPSS®.docxComplete the Frankfort-Nachmias and Leon-Guerrero (2018) SPSS®.docx
Complete the Frankfort-Nachmias and Leon-Guerrero (2018) SPSS®.docx
 
Descriptive Statistics
Descriptive StatisticsDescriptive Statistics
Descriptive Statistics
 
Sensitivity Analysis
Sensitivity AnalysisSensitivity Analysis
Sensitivity Analysis
 
In Search of the Biggest Bang for the Buck
In Search of the Biggest Bang for the BuckIn Search of the Biggest Bang for the Buck
In Search of the Biggest Bang for the Buck
 
Stephen cresswell risk are we missing a trick - 25th june
Stephen cresswell   risk are we missing a trick - 25th juneStephen cresswell   risk are we missing a trick - 25th june
Stephen cresswell risk are we missing a trick - 25th june
 
Chapter 18
Chapter 18Chapter 18
Chapter 18
 
Module-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data scienceModule-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data science
 
Transportation and logistics modeling 2
Transportation and logistics modeling 2Transportation and logistics modeling 2
Transportation and logistics modeling 2
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientists
 
Outline Of A Literature Review
Outline Of A Literature ReviewOutline Of A Literature Review
Outline Of A Literature Review
 
https://www.scoop.it/topic/soft-computin/p/4142616393/2023/04/12/internationa...
https://www.scoop.it/topic/soft-computin/p/4142616393/2023/04/12/internationa...https://www.scoop.it/topic/soft-computin/p/4142616393/2023/04/12/internationa...
https://www.scoop.it/topic/soft-computin/p/4142616393/2023/04/12/internationa...
 
APPLICATION OF MATRIX PROFILE TECHNIQUES TO DETECT INSIGHTFUL DISCORDS IN CLI...
APPLICATION OF MATRIX PROFILE TECHNIQUES TO DETECT INSIGHTFUL DISCORDS IN CLI...APPLICATION OF MATRIX PROFILE TECHNIQUES TO DETECT INSIGHTFUL DISCORDS IN CLI...
APPLICATION OF MATRIX PROFILE TECHNIQUES TO DETECT INSIGHTFUL DISCORDS IN CLI...
 
OCTRI PSS Simulations in R Seminar.pdf
OCTRI PSS Simulations in R Seminar.pdfOCTRI PSS Simulations in R Seminar.pdf
OCTRI PSS Simulations in R Seminar.pdf
 
Quality Journey -Introduction to 7QC Tools2.0.pdf
Quality Journey -Introduction to 7QC Tools2.0.pdfQuality Journey -Introduction to 7QC Tools2.0.pdf
Quality Journey -Introduction to 7QC Tools2.0.pdf
 
01 Descriptive Statistics for Exploring Data.pdf
01 Descriptive Statistics for Exploring Data.pdf01 Descriptive Statistics for Exploring Data.pdf
01 Descriptive Statistics for Exploring Data.pdf
 

More from Universidade de São Paulo

On the Support of a Similarity-Enabled Relational Database Management System ...
On the Support of a Similarity-Enabled Relational Database Management System ...On the Support of a Similarity-Enabled Relational Database Management System ...
On the Support of a Similarity-Enabled Relational Database Management System ...
Universidade de São Paulo
 
Vertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale Graphs
Vertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale GraphsVertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale Graphs
Vertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale Graphs
Universidade de São Paulo
 
Techniques for effective and efficient fire detection from social media images
Techniques for effective and efficient fire detection from social media imagesTechniques for effective and efficient fire detection from social media images
Techniques for effective and efficient fire detection from social media images
Universidade de São Paulo
 
Multimodal graph-based analysis over the DBLP repository: critical discoverie...
Multimodal graph-based analysis over the DBLP repository: critical discoverie...Multimodal graph-based analysis over the DBLP repository: critical discoverie...
Multimodal graph-based analysis over the DBLP repository: critical discoverie...
Universidade de São Paulo
 

More from Universidade de São Paulo (20)

A gentle introduction to Deep Learning
A gentle introduction to Deep LearningA gentle introduction to Deep Learning
A gentle introduction to Deep Learning
 
Computação: carreira e mercado de trabalho
Computação: carreira e mercado de trabalhoComputação: carreira e mercado de trabalho
Computação: carreira e mercado de trabalho
 
Introdução às ferramentas de Business Intelligence do ecossistema Hadoop
Introdução às ferramentas de Business Intelligence do ecossistema HadoopIntrodução às ferramentas de Business Intelligence do ecossistema Hadoop
Introdução às ferramentas de Business Intelligence do ecossistema Hadoop
 
On the Support of a Similarity-Enabled Relational Database Management System ...
On the Support of a Similarity-Enabled Relational Database Management System ...On the Support of a Similarity-Enabled Relational Database Management System ...
On the Support of a Similarity-Enabled Relational Database Management System ...
 
Effective and Unsupervised Fractal-based Feature Selection for Very Large Dat...
Effective and Unsupervised Fractal-based Feature Selection for Very Large Dat...Effective and Unsupervised Fractal-based Feature Selection for Very Large Dat...
Effective and Unsupervised Fractal-based Feature Selection for Very Large Dat...
 
Fire Detection on Unconstrained Videos Using Color-Aware Spatial Modeling and...
Fire Detection on Unconstrained Videos Using Color-Aware Spatial Modeling and...Fire Detection on Unconstrained Videos Using Color-Aware Spatial Modeling and...
Fire Detection on Unconstrained Videos Using Color-Aware Spatial Modeling and...
 
Unveiling smoke in social images with the SmokeBlock approach
Unveiling smoke in social images with the SmokeBlock approachUnveiling smoke in social images with the SmokeBlock approach
Unveiling smoke in social images with the SmokeBlock approach
 
Vertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale Graphs
Vertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale GraphsVertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale Graphs
Vertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale Graphs
 
Fast Billion-scale Graph Computation Using a Bimodal Block Processing Model
Fast Billion-scale Graph Computation Using a Bimodal Block Processing ModelFast Billion-scale Graph Computation Using a Bimodal Block Processing Model
Fast Billion-scale Graph Computation Using a Bimodal Block Processing Model
 
An introduction to MongoDB
An introduction to MongoDBAn introduction to MongoDB
An introduction to MongoDB
 
StructMatrix: large-scale visualization of graphs by means of structure detec...
StructMatrix: large-scale visualization of graphs by means of structure detec...StructMatrix: large-scale visualization of graphs by means of structure detec...
StructMatrix: large-scale visualization of graphs by means of structure detec...
 
Apresentacao vldb
Apresentacao vldbApresentacao vldb
Apresentacao vldb
 
Techniques for effective and efficient fire detection from social media images
Techniques for effective and efficient fire detection from social media imagesTechniques for effective and efficient fire detection from social media images
Techniques for effective and efficient fire detection from social media images
 
Multimodal graph-based analysis over the DBLP repository: critical discoverie...
Multimodal graph-based analysis over the DBLP repository: critical discoverie...Multimodal graph-based analysis over the DBLP repository: critical discoverie...
Multimodal graph-based analysis over the DBLP repository: critical discoverie...
 
Supervised-Learning Link Recommendation in the DBLP co-authoring network
Supervised-Learning Link Recommendation in the DBLP co-authoring networkSupervised-Learning Link Recommendation in the DBLP co-authoring network
Supervised-Learning Link Recommendation in the DBLP co-authoring network
 
Graph-based Relational Data Visualization
Graph-based RelationalData VisualizationGraph-based RelationalData Visualization
Graph-based Relational Data Visualization
 
Reviewing Data Visualization: an Analytical Taxonomical Study
Reviewing Data Visualization: an Analytical Taxonomical StudyReviewing Data Visualization: an Analytical Taxonomical Study
Reviewing Data Visualization: an Analytical Taxonomical Study
 
Complexidade de Algoritmos, Notação assintótica, Algoritmos polinomiais e in...
Complexidade de Algoritmos, Notação assintótica, Algoritmos polinomiais e in...Complexidade de Algoritmos, Notação assintótica, Algoritmos polinomiais e in...
Complexidade de Algoritmos, Notação assintótica, Algoritmos polinomiais e in...
 
Dawarehouse e OLAP
Dawarehouse e OLAPDawarehouse e OLAP
Dawarehouse e OLAP
 
Visualization tree multiple linked analytical decisions
Visualization tree multiple linked analytical decisionsVisualization tree multiple linked analytical decisions
Visualization tree multiple linked analytical decisions
 

Data analysis00 commonprobabilitymodels