• Like
Vaičiulytė, Ingrida ; Sakalauskas, Leonidas „Daugiamatis retų įvykių tikimybių vertinimo algoritmas“ (VU MII)
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Vaičiulytė, Ingrida ; Sakalauskas, Leonidas „Daugiamatis retų įvykių tikimybių vertinimo algoritmas“ (VU MII)

  • 122 views
Published

Pranešimas XVI kompiuterininkų konferencijos sekcijoje „Duomenų tyryba ir optimizavimas“, …

Pranešimas XVI kompiuterininkų konferencijos sekcijoje „Duomenų tyryba ir optimizavimas“,
„Kompiuterininkų dienos – 2013“, Šiauliai 2013-09-21

Published in Technology , Economy & Finance
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
122
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. MULTIDIMENSIONAL RARE EVENT PROBABILITY ESTIMATION ALGORITHM Ingrida Vaičiulytė Vilnius University Mathematics and Informatics Institute COMPUTER DAYS – 2013 Šiauliai
  • 2. Introduction This work describes the empirical Bayesian approach applied in the estimation of multi – dimensional frequency. It also introduces the Monte-Carlo Markov Chain (MCMC) procedure, which is designed for Bayesian computation. Modeling of the discrete variable - the number of occurrences of rare, used statistical models: a normal distribution with unknown parameters mean and variance, and Poisson distribution. COMPUTER DAYS – 2013 Šiauliai
  • 3. Introduction Let us consider a set 1 , 2 , , K of K populations, where each population j consists of N j individuals j 1, K . Assume that some event (e.g., death due to some disease, insured event) can occur in the populations under observation. COMPUTER DAYS – 2013 Šiauliai
  • 4. The aim Our aim is to estimate unknown probabilities of events Pjm , Y jm of events in populations when the numbers are observed j 1, K ; m 1, M . Y jm Since a simple estimate of relative risk N j cannot be used in many cases due to great differences in the population size N j , the empirical Bayesian approach is applied. COMPUTER DAYS – 2013 Šiauliai
  • 5. Poisson-Gaussian model An assumption is often justified that the numbers of cases Y jm follow to the Poisson m N j Pjm distribution with the parameters j and its density is as follows: m m j f Y , m j e m j m Yj j m j Y ! COMPUTER DAYS – 2013 Šiauliai j 1, , K .
  • 6. Poisson-Gaussian model The empirical Bayesian method is a two stage procedure, depending on the prior distribution introduced in the second stage. It is of interest to consider a model in which the logits P ln 1 P are normally distributed with the parameters , . COMPUTER DAYS – 2013 Šiauliai
  • 7. Poisson-Gaussian model Thus the density of logit is , , 1 2 g T exp M 2 Pjm are evaluated as a posteriori Then the rates means for given , m j P where 1 1 e m m j f Y , m 1 Dj M Dj m j , f Y , m 1 Nj M Nj 1 e m 1 e m g , , d , , g COMPUTER DAYS – 2013 Šiauliai , , d , j 1, K , m 1, M .
  • 8. Maximum likelihood method The Bayesian analysis is often related in statistics to the minimization of a certain function, expressed as the integral of a posteriori density. Thus, in the empirical Bayesian approach, the unknown parameters are , estimated by the maximum likelihood method. We get the logarithmic likelihood function after some manipulation such as M K L , m j ln j 1 f Y , m 1 Nj 1 e K g , , d m ln D j , , j 1 which have to be minimized to get estimates for the parameters. COMPUTER DAYS – 2013 Šiauliai
  • 9. Derivatives of the maximum likelihood function Likelihood function is differentiable many times with respect to the parameters , and the respective first derivatives of this function are as follows: M 1 L , m j f Y , K m 1 Dj j 1 1 L , 1 Nj 1 e , g , , d , M T 1 K j 1 m f Y jm , m 1 Dj COMPUTER DAYS – 2013 Šiauliai , Nj 1 e m g , , d .
  • 10. Poisson-Gaussian model estimates The maximum likelihood estimates of parameters , of Poisson-Gaussian model are found by solving equations, where the first derivatives must be equal to zero: Nj M 1 K K f Y jm , m 1 D j 1 T 1 K K j 1 1 e m j ,k m , , d , , M f Y jmk , , m 1 D m, k j g , COMPUTER DAYS – 2013 Šiauliai Nj 1 e g , , d .
  • 11. Poisson-Gaussian model estimates For instance, the “fixed point iteration” method is useful to solve these equations in order to get the maximum likelihood estimates of , : 1 K t 1 f Yj , K j 1 Nj 1 e Dj t , T t 1 1 K K j 1 t t f Yj , Dj g , , t d , t Nj 1 e t, t COMPUTER DAYS – 2013 Šiauliai t g , t , t d .
  • 12. MCMC algorithm The “fixed point iteration” method we can to realize by Monte-Carlo Markov chain approach. Let be generated t chains and in each chain we generate a multivariate Gaussian vector j ,k ~ N( t , t ), k 1,, N t . t N is the Monte – Carlo sample size at the t step. COMPUTER DAYS – 2013 Šiauliai th
  • 13. MCMC algorithm In order to avoid computational problems, when the intermediate results are very small, we have introduced the auxiliary function M rj m j ln f j (Y , m 1 Nj Nj M 1 e m m j )/ f j (Y , m 1 1 e m or M rj m 1 Mj e 1 e m m e 1 e m m COMPUTER DAYS – 2013 Šiauliai Y m j 1 e ln 1 e m m . ) ,
  • 14. MCMC algorithm And then we get estimates of parameters t 1 1 K K j ~ m tj ~t , 1 Dj 1 K t 1 K j ~t Sj ~t , 1 Dj where the Monte-Carlo estimators are as follows ~t Dj Nt rj ( j ,k ~ D2tj ), k 1 ~ m tj j ,k r( j ,k ), k 1 p t j ,m k 1 rj ( j ,k k 1 Nt Nt Nt ~ S jt Nt j ,k k 1 r( 1 e j ,k ) j ,k ,m . COMPUTER DAYS – 2013 Šiauliai ~ mtj ) ~ D tj Nt j ,k 2 , ~ mtj T r( j ,k ),
  • 15. MCMC algorithm Next, the estimate of the log-likelihood function is obtained using the Monte-Carlo estimate: K ~ ln D tj , t L j 1 its sample variance estimate: K dt j 1 ~ D 2 tj N t ~ 2 D tj 1, population of events probabilities estimate: ~t Pj ,m p tj ,m ~t . Dj COMPUTER DAYS – 2013 Šiauliai
  • 16. MCMC algorithm The Monte-Carlo chain can be terminated at the t th step, if difference between estimates of two current steps differs insignificantly. Thus, the hypothesis on the termination condition is rejected, if K Ht 1 K K j 1 k 1 ~ D 2tj ~ 2 D tj ln k SP k 1 k 1 k 1 COMPUTER DAYS – 2013 Šiauliai k T k 1 k 1 k M F ,v
  • 17. MCMC algorithm The next rule of sample size regulation is implemented; in order large samples would be taken only at the moment of making the decision on termination of the Monte-Carlo Markov chain: t N F ,v t 1 N v F t H ,v - Fisher’s quantile, - is the significance level. COMPUTER DAYS – 2013 Šiauliai
  • 18. MCMC algorithm Application of this rule allows to rational select of samples size in Monte-Carlo Markov chain to ensure the convergence of the maximum likelihood function. COMPUTER DAYS – 2013 Šiauliai
  • 19. Computer simulation Next, we used familiar data to construct and estimate this statistical model. The random sample of K 10 1 , 2 , , K populations has been simulated to explore the approach developed, in which can occur M 3 events. The logits of probabilities are normally distributed with these parameters 3 0,25 0 0 4 ; 0 0,25 0 5 0 0 0,25 COMPUTER DAYS – 2013 Šiauliai .
  • 20. Computer simulation Next, we have computed the Monte-Carlo Markov chain of t 100 estimators. To avoid very small or very large sample sizes, the following limits were applied 500 N k 17000. The termination conditions started to be valid after t 6 iterations. And we have got these means of parameters: COMPUTER DAYS – 2013 Šiauliai
  • 21. Estimates of parameters Iteration µ1 µ2 µ3 Loglikelihood function 1 -2,96 -4,29 -5,52 -62,90 5,57 500 9,55 2 -2,89 -4,04 -5,27 -396,58 4,81 500 6,18 3 -2,91 -4,03 -5,19 -420,42 2,97 500 3,86 4 -2,90 -4,04 -5,16 -424,87 3,2 500 0,35 5 -2,91 -4,04 -5,13 -428,05 1,57 2 963 1,41 6 -2,90 -4,04 -5,14 -427,57 1,32 4 383 0,32 7 -2,91 -4,04 -5,13 -425,54 0,75 13 986 0,40 8 -2,91 -4,04 -5,14 -425,33 0,75 14 345 0,40 9 -2,91 -4,04 -5,13 -425,71 0,75 13 525 0,84 10 -2,91 -4,04 -5,13 -426,47 0,75 15 135 0,22 Confidence interval Sample size Statistical hypothesis COMPUTER DAYS – 2013 Šiauliai
  • 22. Conclusions The empirical Bayesian approach applied in the estimation of multi-dimensional frequency has been described in this work. In this paper we: • presented an iterative method of “fixed point iteration” to compute the estimates; • introduced the Monte-Carlo Markov Chain procedure with adaptive regulation sample size and treatment of the simulation error in the statistical manner; • computed the empirical Bayesian estimation of unknown parameters and probabilities of the events. The approach developed can be applied in the analysis of social and medical data. COMPUTER DAYS – 2013 Šiauliai
  • 23. COMPUTER DAYS – 2013 Šiauliai