In natura si osservano delle distribuzioni empiriche; per studiarle è necessario avere delle distribuzioni teoriche di riferimento. Se si considera un fenomeno discreto, come il lancio dei dadi, la distribuzione teorica può essere assimilata alla distribuzione empirica e questo permette di calcolare le frequenze relative, la media e la deviazione standard.
Se invece il fenomeno è continuo si considera la funzione di densità e da questa, per integrazione, si ricavano le frequenze teoriche.
The document discusses generalized linear models (GLMs) and provides examples of logistic regression and Poisson regression. Some key points covered include:
- GLMs allow for non-normal distributions of the response variable and non-constant variance, which makes them useful for binary, count, and other types of data.
- The document outlines the framework for GLMs, including the link function that transforms the mean to the scale of the linear predictor and the inverse link that transforms it back.
- Logistic regression is presented as a GLM example for binary data with a logit link function. Poisson regression is given for count data with a log link.
- Examples are provided to demonstrate how to fit and interpret a logistic
The document summarizes a time series analysis workshop presented by Sri Krishnamurthy on December 20, 2018 in Boston. The workshop was hosted by QuantUniversity, which provides data science and quantitative finance programs and advisory services. Upcoming events from QuantUniversity include time series analysis and machine learning workshops in early 2019.
Classical linear regression model (1) (1)Tanvi Ahuja
This document presents an econometric analysis of the population growth of India from 2005 to 2014. A linear regression model is used to test the relationship between population and life expectancy. The results show a positive correlation, with a high R-squared value of 0.938 indicating life expectancy explains 93% of the variation in population. Specifically, for every one year increase in life expectancy, the population increases by approximately 41,922 people on average. Therefore, the analysis finds support for the hypothesis that countries with higher life expectancies will have higher populations.
The document outlines auto-regressive (AR) processes of order p. It begins by introducing AR(p) processes formally and discussing white noise. It then derives the first and second moments of an AR(p) process. Specific details are provided about AR(1) and AR(2) processes, including equations for their variance as a function of the noise variance and AR coefficients. Examples of simulated AR(1) processes are shown for different coefficient values.
Data Science - Part IV - Regression Analysis & ANOVADerek Kane
This lecture provides an overview of linear regression analysis, interaction terms, ANOVA, optimization, log-level, and log-log transformations. The first practical example centers around the Boston housing market where the second example dives into business applications of regression analysis in a supermarket retailer.
The document discusses smoking-related deaths in the United States each year, including 123,800 from lung cancer out of a total of 438,000 smoking-related deaths. It then defines conditional probability as the probability of one event occurring given that another event has already occurred. Using this definition, it calculates the conditional probability that a smoking-related death will be caused by lung cancer as 28%, based on the number of lung cancer deaths divided by the total number of smoking-related deaths.
In natura si osservano delle distribuzioni empiriche; per studiarle è necessario avere delle distribuzioni teoriche di riferimento. Se si considera un fenomeno discreto, come il lancio dei dadi, la distribuzione teorica può essere assimilata alla distribuzione empirica e questo permette di calcolare le frequenze relative, la media e la deviazione standard.
Se invece il fenomeno è continuo si considera la funzione di densità e da questa, per integrazione, si ricavano le frequenze teoriche.
The document discusses generalized linear models (GLMs) and provides examples of logistic regression and Poisson regression. Some key points covered include:
- GLMs allow for non-normal distributions of the response variable and non-constant variance, which makes them useful for binary, count, and other types of data.
- The document outlines the framework for GLMs, including the link function that transforms the mean to the scale of the linear predictor and the inverse link that transforms it back.
- Logistic regression is presented as a GLM example for binary data with a logit link function. Poisson regression is given for count data with a log link.
- Examples are provided to demonstrate how to fit and interpret a logistic
The document summarizes a time series analysis workshop presented by Sri Krishnamurthy on December 20, 2018 in Boston. The workshop was hosted by QuantUniversity, which provides data science and quantitative finance programs and advisory services. Upcoming events from QuantUniversity include time series analysis and machine learning workshops in early 2019.
Classical linear regression model (1) (1)Tanvi Ahuja
This document presents an econometric analysis of the population growth of India from 2005 to 2014. A linear regression model is used to test the relationship between population and life expectancy. The results show a positive correlation, with a high R-squared value of 0.938 indicating life expectancy explains 93% of the variation in population. Specifically, for every one year increase in life expectancy, the population increases by approximately 41,922 people on average. Therefore, the analysis finds support for the hypothesis that countries with higher life expectancies will have higher populations.
The document outlines auto-regressive (AR) processes of order p. It begins by introducing AR(p) processes formally and discussing white noise. It then derives the first and second moments of an AR(p) process. Specific details are provided about AR(1) and AR(2) processes, including equations for their variance as a function of the noise variance and AR coefficients. Examples of simulated AR(1) processes are shown for different coefficient values.
Data Science - Part IV - Regression Analysis & ANOVADerek Kane
This lecture provides an overview of linear regression analysis, interaction terms, ANOVA, optimization, log-level, and log-log transformations. The first practical example centers around the Boston housing market where the second example dives into business applications of regression analysis in a supermarket retailer.
The document discusses smoking-related deaths in the United States each year, including 123,800 from lung cancer out of a total of 438,000 smoking-related deaths. It then defines conditional probability as the probability of one event occurring given that another event has already occurred. Using this definition, it calculates the conditional probability that a smoking-related death will be caused by lung cancer as 28%, based on the number of lung cancer deaths divided by the total number of smoking-related deaths.
This document discusses point estimation and the criteria for a good point estimator. It defines point estimation, estimators, and estimates. The key criteria for a good point estimator are discussed as unbiasedness, consistency, efficiency, and sufficiency. Unbiasedness means the expected value of the estimator is equal to the true parameter value. Consistency means the estimator approaches the true value as the sample size increases. Efficiency refers to the estimator having the minimum possible variance. Sufficiency means the estimator uses all the information in the sample. Examples are provided for each concept.
This document discusses stationarity in time series analysis. It defines stationarity as a time series having a constant mean, constant variance, and constant autocorrelation structure over time. Non-stationary time series can be identified through run sequence plots, summary statistics, histograms, and augmented Dickey-Fuller tests. Common transformations like removing trends, heteroscedasticity through logging, differencing to remove autocorrelation, and removing seasonality can be used to make non-stationary time series data stationary. Python is used to demonstrate identifying and transforming non-stationary time series data.
Quantile regression is an extension of linear regression that relates specific quantiles (percentiles) of the target variable to the predictor variables rather than just the mean. It makes fewer assumptions than ordinary least squares regression about the distribution of the target variable and is more robust to outliers. Quantile regression can provide a more complete picture of the relationship between variables by examining how predictors influence different parts of the conditional distribution.
The document discusses several types of discrete probability distributions:
- Bernoulli distribution models experiments with two outcomes and is defined by the probability of success.
- Binomial distribution describes repeated Bernoulli trials and is defined by the number of trials and probability of success.
- Poisson distribution describes the number of occurrences within a time period and is defined by the average number of occurrences.
- Hypergeometric distribution describes sampling without replacement and is defined by the population size, sample size, and number of successes in the population.
Autocorrelation- Detection- part 1- Durbin-Watson d testShilpa Chaudhary
This document discusses various methods to detect autocorrelation in regression models, including graphical examination of residuals, formal statistical tests like the Durbin-Watson d test, Durbin's h test, and Breusch-Godfrey test. The Durbin-Watson d test compares the test statistic d to critical values dL and dU based on sample size and number of regressors to test for autocorrelation. If d is between 0-2, it indicates positive autocorrelation, while a value between 2-4 indicates negative autocorrelation. A value near 2 shows no autocorrelation. The document provides an example of applying the Durbin-Watson test to check for first-order autocorrelation
This document provides an overview of basic statistical concepts for bio science students. It defines measures of central tendency including mean, median, and mode. It also discusses measures of dispersion like range and standard deviation. Common probability distributions such as binomial, Poisson, and normal distributions are explained. Hypothesis testing concepts like p-values and types of statistical tests for different types of data like t-tests for continuous variables and chi-square tests for categorical data are summarized along with examples.
The document discusses various statistical tests used for hypothesis testing, including parametric and non-parametric tests. Parametric tests like the z-test and t-test assume a normal distribution, while non-parametric tests like the chi-square test, sign test, and Mann-Whitney test make fewer assumptions. The z-test specifically compares a sample mean to a hypothesized population mean for large samples or when the population variance is known.
This document discusses heteroskedasticity in multiple linear regression models. Heteroskedasticity occurs when the variance of the error term is not constant, violating the assumption of homoskedasticity. If heteroskedasticity is present, ordinary least squares (OLS) estimates are still unbiased but the standard errors are biased. Various tests for heteroskedasticity are presented, including the Breusch-Pagan and White tests. Weighted least squares (WLS) methods like feasible generalized least squares (FGLS) can produce more efficient estimates than OLS when the form of heteroskedasticity is known or can be estimated.
Final generalized linear modeling by idrees waris iugcId'rees Waris
This document discusses generalized linear models (GLM). It begins by introducing the topic and outlines the main points to be covered, including the history of GLM, assumptions for using GLM, and how to run GLM in SPSS. The document then covers the components of GLM, including the random, systematic, and link components. It discusses various distributions and link functions that can be used in GLM. The document concludes by providing an example of how to analyze shipping damage incident data using Poisson GLM in SPSS.
1) O documento discute medidas de associação e correlação entre variáveis, incluindo o coeficiente de correlação de Pearson.
2) Apresenta exemplos de possíveis relações entre variáveis como idade e altura, gastos com publicidade e faturamento.
3) Discutem conceitos como correlação positiva, negativa e ausência de correlação entre variáveis.
Dokumen tersebut membahas distribusi probabilitas variabel kontinyu dan diskrit, termasuk fungsi probabilitas, distribusi uniform, triangular, eksponensial, gamma, dan hubungannya.
Class lecture notes # 2 (statistics for research)Harve Abella
The document discusses different types of variables and scales of measurement used in research. It defines qualitative and quantitative variables, and describes discrete and continuous quantitative variables. It also outlines four scales of measurement - nominal, ordinal, interval, and ratio scales - and provides examples. The document emphasizes that statistics play a vital role in research design, validity/reliability testing, data organization and interpretation, and determining significance of findings.
This document provides an overview of logistic regression. It begins by defining logistic regression as a specialized form of regression used when the dependent variable is dichotomous while the independent variables can be of any type. It notes logistic regression allows prediction of discrete variables from continuous and discrete predictors without assumptions about variable distributions. The document then discusses why logistic regression is used when assumptions of other regressions like normality and equal variance are violated. It also outlines how to perform and interpret logistic regression including assessing model fit. Finally, it provides an example research question and hypotheses about predicting solar panel adoption using household income and mortgage as predictors.
Time Series Analysis and Forecasting.pptssuser220491
This document discusses time series analysis and forecasting. It introduces time series data and examples. The main methods for forecasting time series are regression analysis and time series analysis (TSA), which examines past behavior to predict future behavior without causal variables. TSA involves analyzing trends, cycles, seasonality, and random variations. Forecasting accuracy is measured using techniques like mean absolute deviation and mean square error. Extrapolation models like moving averages, weighted moving averages, and exponential smoothing are discussed for forecasting, as well as approaches for stationary, additive seasonal, multiplicative seasonal, and trend data.
Logistic regression is a statistical model used to predict binary outcomes like disease presence/absence from several explanatory variables. It is similar to linear regression but for binary rather than continuous outcomes. The document provides an example analysis using logistic regression to predict risk of HHV8 infection from sexual behaviors and infections like HIV. The analysis found HIV and HSV2 history were associated with higher odds of HHV8 after adjusting for other variables, while gonorrhea history was not a significant independent predictor.
The document discusses random variables and vectors. It defines random variables as functions that assign outcomes of random experiments to real numbers. There are two types of random variables: discrete and continuous. Random variables are characterized by their expected value, variance/standard deviation, and other moments. Random vectors are multivariate random variables. Key concepts covered include probability mass functions, probability density functions, expected value, variance, and how these properties change when random variables are scaled or combined linearly.
This document provides an overview of probability densities in data mining. It begins by explaining why understanding probability densities is important for working with real-valued data in data mining applications. It then covers basic notation and properties of continuous probability density functions (PDFs), including their meaning and how to calculate probabilities. It also discusses multivariate continuous PDFs, expectations, variance, standard deviation, and independence between random variables. The overall summary is that the document serves as an introduction to probability densities and their applications in data mining.
This document provides an overview of maximum likelihood estimation. It explains that maximum likelihood estimation finds the parameters of a probability distribution that make the observed data most probable. It gives the example of using maximum likelihood estimation to find the values of μ and σ that result in a normal distribution that best fits a data set. The goal of maximum likelihood is to find the parameter values that give the distribution with the highest probability of observing the actual data. It also discusses the concept of likelihood and compares it to probability, as well as considerations for removing constants and using the log-likelihood.
La probabilità è una misura del grado di incertezza di un evento in un certo esperimento casuale.
E’ ragionevole misurare l’incertezza degli eventi assegnando ad essi un numero compreso tra 0 e 1, detto probabilità di un evento.
Quanto più la probabilità è vicina a zero tanto più l’evento si verifica raramente e quanto più la probabilità è vicina a 1 tanto più l’evento è frequente.
This document discusses point estimation and the criteria for a good point estimator. It defines point estimation, estimators, and estimates. The key criteria for a good point estimator are discussed as unbiasedness, consistency, efficiency, and sufficiency. Unbiasedness means the expected value of the estimator is equal to the true parameter value. Consistency means the estimator approaches the true value as the sample size increases. Efficiency refers to the estimator having the minimum possible variance. Sufficiency means the estimator uses all the information in the sample. Examples are provided for each concept.
This document discusses stationarity in time series analysis. It defines stationarity as a time series having a constant mean, constant variance, and constant autocorrelation structure over time. Non-stationary time series can be identified through run sequence plots, summary statistics, histograms, and augmented Dickey-Fuller tests. Common transformations like removing trends, heteroscedasticity through logging, differencing to remove autocorrelation, and removing seasonality can be used to make non-stationary time series data stationary. Python is used to demonstrate identifying and transforming non-stationary time series data.
Quantile regression is an extension of linear regression that relates specific quantiles (percentiles) of the target variable to the predictor variables rather than just the mean. It makes fewer assumptions than ordinary least squares regression about the distribution of the target variable and is more robust to outliers. Quantile regression can provide a more complete picture of the relationship between variables by examining how predictors influence different parts of the conditional distribution.
The document discusses several types of discrete probability distributions:
- Bernoulli distribution models experiments with two outcomes and is defined by the probability of success.
- Binomial distribution describes repeated Bernoulli trials and is defined by the number of trials and probability of success.
- Poisson distribution describes the number of occurrences within a time period and is defined by the average number of occurrences.
- Hypergeometric distribution describes sampling without replacement and is defined by the population size, sample size, and number of successes in the population.
Autocorrelation- Detection- part 1- Durbin-Watson d testShilpa Chaudhary
This document discusses various methods to detect autocorrelation in regression models, including graphical examination of residuals, formal statistical tests like the Durbin-Watson d test, Durbin's h test, and Breusch-Godfrey test. The Durbin-Watson d test compares the test statistic d to critical values dL and dU based on sample size and number of regressors to test for autocorrelation. If d is between 0-2, it indicates positive autocorrelation, while a value between 2-4 indicates negative autocorrelation. A value near 2 shows no autocorrelation. The document provides an example of applying the Durbin-Watson test to check for first-order autocorrelation
This document provides an overview of basic statistical concepts for bio science students. It defines measures of central tendency including mean, median, and mode. It also discusses measures of dispersion like range and standard deviation. Common probability distributions such as binomial, Poisson, and normal distributions are explained. Hypothesis testing concepts like p-values and types of statistical tests for different types of data like t-tests for continuous variables and chi-square tests for categorical data are summarized along with examples.
The document discusses various statistical tests used for hypothesis testing, including parametric and non-parametric tests. Parametric tests like the z-test and t-test assume a normal distribution, while non-parametric tests like the chi-square test, sign test, and Mann-Whitney test make fewer assumptions. The z-test specifically compares a sample mean to a hypothesized population mean for large samples or when the population variance is known.
This document discusses heteroskedasticity in multiple linear regression models. Heteroskedasticity occurs when the variance of the error term is not constant, violating the assumption of homoskedasticity. If heteroskedasticity is present, ordinary least squares (OLS) estimates are still unbiased but the standard errors are biased. Various tests for heteroskedasticity are presented, including the Breusch-Pagan and White tests. Weighted least squares (WLS) methods like feasible generalized least squares (FGLS) can produce more efficient estimates than OLS when the form of heteroskedasticity is known or can be estimated.
Final generalized linear modeling by idrees waris iugcId'rees Waris
This document discusses generalized linear models (GLM). It begins by introducing the topic and outlines the main points to be covered, including the history of GLM, assumptions for using GLM, and how to run GLM in SPSS. The document then covers the components of GLM, including the random, systematic, and link components. It discusses various distributions and link functions that can be used in GLM. The document concludes by providing an example of how to analyze shipping damage incident data using Poisson GLM in SPSS.
1) O documento discute medidas de associação e correlação entre variáveis, incluindo o coeficiente de correlação de Pearson.
2) Apresenta exemplos de possíveis relações entre variáveis como idade e altura, gastos com publicidade e faturamento.
3) Discutem conceitos como correlação positiva, negativa e ausência de correlação entre variáveis.
Dokumen tersebut membahas distribusi probabilitas variabel kontinyu dan diskrit, termasuk fungsi probabilitas, distribusi uniform, triangular, eksponensial, gamma, dan hubungannya.
Class lecture notes # 2 (statistics for research)Harve Abella
The document discusses different types of variables and scales of measurement used in research. It defines qualitative and quantitative variables, and describes discrete and continuous quantitative variables. It also outlines four scales of measurement - nominal, ordinal, interval, and ratio scales - and provides examples. The document emphasizes that statistics play a vital role in research design, validity/reliability testing, data organization and interpretation, and determining significance of findings.
This document provides an overview of logistic regression. It begins by defining logistic regression as a specialized form of regression used when the dependent variable is dichotomous while the independent variables can be of any type. It notes logistic regression allows prediction of discrete variables from continuous and discrete predictors without assumptions about variable distributions. The document then discusses why logistic regression is used when assumptions of other regressions like normality and equal variance are violated. It also outlines how to perform and interpret logistic regression including assessing model fit. Finally, it provides an example research question and hypotheses about predicting solar panel adoption using household income and mortgage as predictors.
Time Series Analysis and Forecasting.pptssuser220491
This document discusses time series analysis and forecasting. It introduces time series data and examples. The main methods for forecasting time series are regression analysis and time series analysis (TSA), which examines past behavior to predict future behavior without causal variables. TSA involves analyzing trends, cycles, seasonality, and random variations. Forecasting accuracy is measured using techniques like mean absolute deviation and mean square error. Extrapolation models like moving averages, weighted moving averages, and exponential smoothing are discussed for forecasting, as well as approaches for stationary, additive seasonal, multiplicative seasonal, and trend data.
Logistic regression is a statistical model used to predict binary outcomes like disease presence/absence from several explanatory variables. It is similar to linear regression but for binary rather than continuous outcomes. The document provides an example analysis using logistic regression to predict risk of HHV8 infection from sexual behaviors and infections like HIV. The analysis found HIV and HSV2 history were associated with higher odds of HHV8 after adjusting for other variables, while gonorrhea history was not a significant independent predictor.
The document discusses random variables and vectors. It defines random variables as functions that assign outcomes of random experiments to real numbers. There are two types of random variables: discrete and continuous. Random variables are characterized by their expected value, variance/standard deviation, and other moments. Random vectors are multivariate random variables. Key concepts covered include probability mass functions, probability density functions, expected value, variance, and how these properties change when random variables are scaled or combined linearly.
This document provides an overview of probability densities in data mining. It begins by explaining why understanding probability densities is important for working with real-valued data in data mining applications. It then covers basic notation and properties of continuous probability density functions (PDFs), including their meaning and how to calculate probabilities. It also discusses multivariate continuous PDFs, expectations, variance, standard deviation, and independence between random variables. The overall summary is that the document serves as an introduction to probability densities and their applications in data mining.
This document provides an overview of maximum likelihood estimation. It explains that maximum likelihood estimation finds the parameters of a probability distribution that make the observed data most probable. It gives the example of using maximum likelihood estimation to find the values of μ and σ that result in a normal distribution that best fits a data set. The goal of maximum likelihood is to find the parameter values that give the distribution with the highest probability of observing the actual data. It also discusses the concept of likelihood and compares it to probability, as well as considerations for removing constants and using the log-likelihood.
La probabilità è una misura del grado di incertezza di un evento in un certo esperimento casuale.
E’ ragionevole misurare l’incertezza degli eventi assegnando ad essi un numero compreso tra 0 e 1, detto probabilità di un evento.
Quanto più la probabilità è vicina a zero tanto più l’evento si verifica raramente e quanto più la probabilità è vicina a 1 tanto più l’evento è frequente.
Quando si fa inferenza si cerca di indurre le caratteristiche sconosciute della popolazione a partire dalle informazioni campionarie. Più precisamente, fare inferenza significa:
Stimare: approssimare un parametro ignoto a partire dai dati campionari.
Testare delle ipotesi: verificare, utilizzando i dati campionari, la significatività statistica di ipotesi sulla distribuzione dei caratteri studiati, cioè sulla forma della distribuzione e sui valori che la qualificano: la media e lo scarto quadratico medio.
Probabilità e statistica: la scienza della previsioneAndrea Capocci
Come introdurre la probabilità e la statistica nella scuola superiore? Questa è la presentazione di un percorso didattico che prevede di introdurre la statistica con preciso punto di vista, non esaustivo ma compiuto. Questa presentazione è stata preparata per l'esame di Probabilità e Statistica nel Tirocinio Formativo Attivo 2014-15 dell'Università di Roma Tre
Derivata di una funzione in un punto. Significato geoemtrico di derivata. Equazione della retta tangente al grafico in un punto. Regole di derivazione. Continuità e derivabilità. Punti di non derivabilità.
2. Variabili Casuali (v.c.)
Una variabile casuale X e’ una funzione definita
sullo spazio campionario Ω che associa ad ogni
evento E ⊂ Ω un unico numero reale.
X
Ω
x6
E2 E3
x5
E1 x4
E6
E5 E4
E7 x3
x2
E9 E8 x1
3. Variabili casuali discrete e continue
• Una variabile casuale discreta può assumere un
insieme discreto (finito o numerabile) di numeri
reali.
• Una variabile casuale continua può assumere
tutti i valori compresi in un intervallo reale.
Ω discreto V.C. discreta
Ω continuo V.C. discreta o continua
4. Variabili casuali discrete
P(X=xi) Probabilità che la v.c. X
assuma il valore xi
La funzione di probabilità di una variabile
casuale discreta X associa ad ognuno dei
valori xi la corrispondente probabilità P(X=xi)
Proprietà ∑ i P ( xi ) = 1
P ( xi ) ≥ 0
6. Funzione di Ripartizione
Data una v.c discreta X, la funzione che fa corrispondere
ai valori x le probabilità cumulate
P(X ≤ x)
viene detta funzione di ripartizione ed indicata con
F (x ) = P ( X ≤ x ) = ∑P(X = w)
w ≤x
8. Funzione di Ripartizione: proprietà
• è non decrescente, ossia:
x1 < x 2 ⇒ F ( x1 ) ≤ F ( x 2 )
• Inoltre
lim F ( x ) = 0 lim F ( x ) = 1
x →−∞ x →∞
• è continua a destra, ossia
lim F (x ) = F ( x0 )
+
x → x0
9. Variabili casuali continue
Chiameremo Funzione di densità, la funzione
matematica f(x) per cui l’area sottesa alla
funzione, corrispondente ad un certo intervallo,
è uguale alla probabilità che X assuma un
valore in quell’intervallo.
10. Proprietà delle funzioni di densità
≥
• f(x)≥0 sempre
• L’area totale sottesa alla funzione =1, ossia
+∞
∫ f ( x ) dx = 1
−∞
• La probabilità che la v.c. assuma un particolare valore
dell’intervallo è zero.
11. Funzione di Ripartizione
Data una v.c. continua X, la funzione che fa
corrispondere ai valori x le probabilità cumulate
≤
P(X≤ x) viene detta funzione di ripartizione.
x
F ( x ) = P( X ≤ x ) = ∫ f ( w )dw
−∞
12. Variabili casuali continue
Esempio: 2,0
f x
X ~ f ( x) 1,5
f ( x) = 12x(1 − x)2 1,0
x ∈ [0;1]
0,5 0,229
0, 7 0,0
∫ f (x)dx = 0,229
0,0 0,5 0,7 1,0 X
0, 5
+∞ 1
∫ f (x)dx =∫ f (x)dx =1
−∞ 0
P(0,5<X<0,7)
13. Esempio
Funzioni di densità e corrispondenti funzioni di ripartizione
f (x ) F (x )
x x
2,00 1,0
f (x ) F (x )
0,8
1,50
0,6
1,00
0,4
0,50
0,2
0,00 0,0
0,0 1,5
x 3,0 0,0 1,5
x 3,0
14. Valore atteso di una v.c.
Il valore medio di una v.c. X, è definito come
E ( X ) = ∑ x i P ( x i ) Se la v.c. è discreta
i
+∞
E(X ) = ∫ x f (x ) dx Se la v.c. è continua
−∞
16. Esempio: v.c. continua
− λx
Consideriamo la v.c. X ~ λe
con lambda una costante positiva e x ≥0
(Esponenziale). Il valore atteso è dato
da 2,0
1,6
+∞
−λx 1
E(X ) = ∫ xλe dx = 1,2
−∞
λ 0,8
0,4
0,0
0,0 0,5 1,0 1,5 2,0 2,5 3,0
17. La varianza di una v.c.
La varianza V(X) di una variabile casuale X è definita da
V ( X ) = ∑ ( x i − E ( X ))2 P ( x i ) Se la v.c. è discreta
i
+∞
V (X ) = ∫ (x − E ( X ))2 f (x )dx Se la v.c. è continua
−∞
18. Varianza di una v.c.
Notazioni alternative:
V ( X ) = E {[ X − E ( X )] 2 }
oppure, dopo alcuni passaggi
( )− [E ( X )]
V (X ) = E X 2 2
La deviazione standard è definita
SD ( X ) = V ( X )
19. V.c. Standardizzate
I valori standardizzati esprimono la distanza tra i
valori osservati e la media in termini di deviazione
standard.
Se X è una v.c. con valore E(X) e SD(X) allora:
X − E( X )
Y =
SD( X )
È una v.c. standardizzata con E(Y)=0 e V(Y)=1
20. Teorema di Chebyshev
Sia X una variabile casuale e k un valore reale
positivo, allora vale la seguente disuguaglianza:
P ( X − E ( X ) ≥ k ⋅ SD ( X )) ≤ 2
1
k
Indipendentemente dalla distribuzione della v.c. , la
probabilità che X assuma valori distanti dalla media più
di k deviazioni standard è al più 1/k2
21. Distribuzioni di probabilità
• Sono una estensione delle distribuzioni di
frequenza
• Il caso più semplice da trattare è relativo a
distribuzioni di probabilità di variabili discrete
– Variabili discrete:
• Fenomeni di conteggio
• Esperimenti con modalità limitate
– Variabili continue:
• Misurazioni
• Esperimenti con modalità nel continuo
22. Metodi di sintesi delle distribuzioni
• In analogia con quanto detto circa le distribuzioni
di frequenze, anche per le distribuzioni di
probabilità abbiamo la necessità di sintetizzare i
dati
• In particolare ci interessa una sintesi di centralità
– Valore atteso
• Ed una sintesi della variabilità
– Varianza
– Scarto quadratico medio
– Coefficiente di variazione
23. Modelli probabilistici
• Per modello si intende una legge di probabilità
in grado di misurare l’incertezza circa il
fenomeno reale sotto studio
Fenomeno reale:
reale Modello matematico:
matematico
Numero di figli Distribuzione di
per famiglia probabilità da associare
al fenomeno di interesse
Risultati dal
per analizzarlo
lancio di un dado
24. • In funzione del fenomeno reale sotto studio,
si cerca di associare un modello opportuno a
descriverne la variabilità
• I modelli si dividono in:
– Modelli discreti, alcuni dei quali sono:
• Modello uniforme discreto
• Modello binomiale
• Modello poisson
– Modelli continui
• Modello normale
• Modello esponenziale
25. Modello uniforme (1)
• Si tratta della distribuzione più semplice
• Consente di valutare la probabilità di un
fenomeno per il quale ognuno degli k
possibili risultati sia equamente probabile.
• Ad esempio, la probabilità che un numero sia
estratto al lotto è descritta dal modello
Uniforme, perché ciascuno dei 90 numeri ha
una probabilità pari a 1/90 di essere estratto.
26. Modello bernoulliano (1)
• Questo particolare modello è adatto a descrive la probabilità
associata a variabili dicotomiche, cioè fenomeni che assumono
soltanto uno tra due possibili valori: x = 1 (si verifica un certo
evento di interesse E, ed x = 0 se non si verifica l’evento E).
• Esempio: (morto vs vivo, maschio vs femmina; presenza di un
evento di interesse contrapposto a tutti gli altri)
•
P ( X = 1) = p
P ( X = 0) = 1 − p
E(X ) = p
var( X ) = p (1 − p )
27. Distribuzione di Bernoulli
Una v.c. di Bernoulli può assumere il valore 1
con probabilità π e il valore 0 con probabilità 1-π
π
La sua funzione di probabilità può essere espressa come
P ( X = x ) = π x (1 − π )1− x per x = 0,1
Tutte le prove che producono solo due possibili risultati
generano v.c. di Bernoulli: il lancio di una moneta, il
sesso di un nascituro, il superamento o meno di un certo
livello di inflazione….
28. Modello Binomiale
• Il modello binomiale, descrive la probabilità il
numero x di successi ottenuti in un campione di
n prove (osservazioni):
n!
Pr( X = x | p ) = p x (1 − p ) n − x
x!(n − x)!
n x n− x
= p (1 − p )
x
E ( X ) = np
var( X ) = np (1 − p )
29. Modello Binomiale (2)
• La formulazione Pr(X = x) indica che su n osservazioni si
sono verificati esattamente x successi e, di conseguenza,
n – x insuccessi
• Poiché le prove sono indipendenti e la probabilità di
successo è costante pari a p per ogni prova, osserveremo
su n tentativi:
– x con probabilità p
– n – x con probabilità 1 – p
• Rimane da considerare il fatto che l’ordine in cui le prove
sono osservate è variabile e noi siamo interessati solo al
totale dei successi-insuccessi, non all’ordine
30. Modello Binomiale (3)
• Infatti, si potrebbero verificare:
– tutti i successi e poi tutti gli insuccessi
– oppure un successo e un insuccesso in sequenza
– oppure tutti gli insuccessi e poi tutti i successi …
• Per tenere conto di queste possibilità, le
probabilità sono moltiplicate per il numero di
possibili combinazioni di k
n n!
=
x x!(n − x)! con n!= n × (n − 1) × ... × 2 × 1
37. Modello Poisson (1)
• Il modello Poisson si adatta a rappresentare esperimenti
aleatori che danno luogo ad un numero discreto di
eventi in un intervallo continuo (es. arrivi in un
aeroporto in un’ora di tempo, )
• Le realizzazioni degli eventi devono avere le seguenti
caratteristiche:
– L’intervallo (di tempo o di spazio) è suddivisibile in n sotto-
intervalli all’interno dei quali la probabilità di manifestarsi di
un evento è piccola e la probabilità di manifestarsi di più
eventi tende a zero
– La probabilità di manifestarsi degli eventi nei sottointervalli
è costante
– Eventi relativi a sotto intervalli differenti sono
stocasticamente indipendenti (processo senza memoria)
38. Modello Poisson (2)
• In termini matematici il modello è
specificato come:
−λ
e λ x
Pr( X = x | λ ) =
x!
• Il parametro λ rappresenta il numero
medio di eventi che si verificano
nell’intervallo (numero atteso di successi)
39. Distribuzione di Poisson
Una v.c. di Poisson, è una v.c. discreta che può assumere
qualsiasi valore intero non-negativo.
La distribuzione di probabilità della Poisson è data da
λx −λ
P( x ) = e x = 0, 1, 2, K 0<λ<∞
x! 0,40
0,35 Poisson(1)
E(X ) = λ
0,30
0,25 Poisson(3)
V (X ) = λ
0,20
0,15
Poisson(7)
0,10
0,05
0,00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
40. Modello Poisson (3)
• L’esperimento aleatorio di interesse è il
“numero di arrivi di clienti presso una banca”
• Supponiamo che mediamente arrivino 3 clienti
ogni minuto
• Ci interessa calcolare:
– la probabilità che nello stesso intervallo arrivino
esattamente 2 clienti
– la probabilità che nello stesso intervallo arrivino più
di 2 clienti
41. Modello Poisson (4)
• Utilizzando la funzione della distribuzione
Poisson, si ha che:
−3 −3
e 3 2
9 × (2.71828 )
Pr( X = 2) = =
2! (2 × 1)
= 0.224
42. Modello Poisson (5)
• Invece, per determinare
∞
Pr( X > 2) = ∑ Pr( X = i)
i =3
è più comodo notare che:
Pr( X > 2) = 1 − Pr( X ≤ 2)
= Pr( X = 0 ) + Pr( X = 1) + Pr( X = 2)
dove queste tre probabilità sono “semplici” da
calcolare
43. Modello Poisson (6)
e −3 3 0 1 × (2.71828 ) −3
Pr( X = 0 ) = =
0! 1
= 0.0498
e −3 3 1 3 × (2.71828 ) −3
Pr( X = 1) = =
1! 1 Pr( X > 2) = 0.577
= 0.1494
e −3 3 2 9 × (2.71828 ) −3
Pr( X = 2) = =
2! (2 × 1)
= 0.224
44. Modello Poisson (7)
• NB nel modello Poisson, il parametro λ
rappresenta anche la varianza oltre che il numero
atteso di eventi
• Dunque, all’aumentare della media, aumenta
anche la varianza della distribuzione
• Questo fa sì che nella pratica si possa ricorrere a
formulazioni alternative del modello Poisson, che
permettono di “modellare” in modo indipendente
il termine di variabilità