Vaičiulytė, Ingrida ; Sakalauskas, Leonidas „Daugiamatis retų įvykių tikimybių vertinimo algoritmas“ (VU MII)

361 views

Published on

Pranešimas XVI kompiuterininkų konferencijos sekcijoje „Duomenų tyryba ir optimizavimas“,
„Kompiuterininkų dienos – 2013“, Šiauliai 2013-09-21

Published in: Technology, Economy & Finance
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
361
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Vaičiulytė, Ingrida ; Sakalauskas, Leonidas „Daugiamatis retų įvykių tikimybių vertinimo algoritmas“ (VU MII)

  1. 1. MULTIDIMENSIONAL RARE EVENT PROBABILITY ESTIMATION ALGORITHM Ingrida Vaičiulytė Vilnius University Mathematics and Informatics Institute COMPUTER DAYS – 2013 Šiauliai
  2. 2. Introduction This work describes the empirical Bayesian approach applied in the estimation of multi – dimensional frequency. It also introduces the Monte-Carlo Markov Chain (MCMC) procedure, which is designed for Bayesian computation. Modeling of the discrete variable - the number of occurrences of rare, used statistical models: a normal distribution with unknown parameters mean and variance, and Poisson distribution. COMPUTER DAYS – 2013 Šiauliai
  3. 3. Introduction Let us consider a set 1 , 2 , , K of K populations, where each population j consists of N j individuals j 1, K . Assume that some event (e.g., death due to some disease, insured event) can occur in the populations under observation. COMPUTER DAYS – 2013 Šiauliai
  4. 4. The aim Our aim is to estimate unknown probabilities of events Pjm , Y jm of events in populations when the numbers are observed j 1, K ; m 1, M . Y jm Since a simple estimate of relative risk N j cannot be used in many cases due to great differences in the population size N j , the empirical Bayesian approach is applied. COMPUTER DAYS – 2013 Šiauliai
  5. 5. Poisson-Gaussian model An assumption is often justified that the numbers of cases Y jm follow to the Poisson m N j Pjm distribution with the parameters j and its density is as follows: m m j f Y , m j e m j m Yj j m j Y ! COMPUTER DAYS – 2013 Šiauliai j 1, , K .
  6. 6. Poisson-Gaussian model The empirical Bayesian method is a two stage procedure, depending on the prior distribution introduced in the second stage. It is of interest to consider a model in which the logits P ln 1 P are normally distributed with the parameters , . COMPUTER DAYS – 2013 Šiauliai
  7. 7. Poisson-Gaussian model Thus the density of logit is , , 1 2 g T exp M 2 Pjm are evaluated as a posteriori Then the rates means for given , m j P where 1 1 e m m j f Y , m 1 Dj M Dj m j , f Y , m 1 Nj M Nj 1 e m 1 e m g , , d , , g COMPUTER DAYS – 2013 Šiauliai , , d , j 1, K , m 1, M .
  8. 8. Maximum likelihood method The Bayesian analysis is often related in statistics to the minimization of a certain function, expressed as the integral of a posteriori density. Thus, in the empirical Bayesian approach, the unknown parameters are , estimated by the maximum likelihood method. We get the logarithmic likelihood function after some manipulation such as M K L , m j ln j 1 f Y , m 1 Nj 1 e K g , , d m ln D j , , j 1 which have to be minimized to get estimates for the parameters. COMPUTER DAYS – 2013 Šiauliai
  9. 9. Derivatives of the maximum likelihood function Likelihood function is differentiable many times with respect to the parameters , and the respective first derivatives of this function are as follows: M 1 L , m j f Y , K m 1 Dj j 1 1 L , 1 Nj 1 e , g , , d , M T 1 K j 1 m f Y jm , m 1 Dj COMPUTER DAYS – 2013 Šiauliai , Nj 1 e m g , , d .
  10. 10. Poisson-Gaussian model estimates The maximum likelihood estimates of parameters , of Poisson-Gaussian model are found by solving equations, where the first derivatives must be equal to zero: Nj M 1 K K f Y jm , m 1 D j 1 T 1 K K j 1 1 e m j ,k m , , d , , M f Y jmk , , m 1 D m, k j g , COMPUTER DAYS – 2013 Šiauliai Nj 1 e g , , d .
  11. 11. Poisson-Gaussian model estimates For instance, the “fixed point iteration” method is useful to solve these equations in order to get the maximum likelihood estimates of , : 1 K t 1 f Yj , K j 1 Nj 1 e Dj t , T t 1 1 K K j 1 t t f Yj , Dj g , , t d , t Nj 1 e t, t COMPUTER DAYS – 2013 Šiauliai t g , t , t d .
  12. 12. MCMC algorithm The “fixed point iteration” method we can to realize by Monte-Carlo Markov chain approach. Let be generated t chains and in each chain we generate a multivariate Gaussian vector j ,k ~ N( t , t ), k 1,, N t . t N is the Monte – Carlo sample size at the t step. COMPUTER DAYS – 2013 Šiauliai th
  13. 13. MCMC algorithm In order to avoid computational problems, when the intermediate results are very small, we have introduced the auxiliary function M rj m j ln f j (Y , m 1 Nj Nj M 1 e m m j )/ f j (Y , m 1 1 e m or M rj m 1 Mj e 1 e m m e 1 e m m COMPUTER DAYS – 2013 Šiauliai Y m j 1 e ln 1 e m m . ) ,
  14. 14. MCMC algorithm And then we get estimates of parameters t 1 1 K K j ~ m tj ~t , 1 Dj 1 K t 1 K j ~t Sj ~t , 1 Dj where the Monte-Carlo estimators are as follows ~t Dj Nt rj ( j ,k ~ D2tj ), k 1 ~ m tj j ,k r( j ,k ), k 1 p t j ,m k 1 rj ( j ,k k 1 Nt Nt Nt ~ S jt Nt j ,k k 1 r( 1 e j ,k ) j ,k ,m . COMPUTER DAYS – 2013 Šiauliai ~ mtj ) ~ D tj Nt j ,k 2 , ~ mtj T r( j ,k ),
  15. 15. MCMC algorithm Next, the estimate of the log-likelihood function is obtained using the Monte-Carlo estimate: K ~ ln D tj , t L j 1 its sample variance estimate: K dt j 1 ~ D 2 tj N t ~ 2 D tj 1, population of events probabilities estimate: ~t Pj ,m p tj ,m ~t . Dj COMPUTER DAYS – 2013 Šiauliai
  16. 16. MCMC algorithm The Monte-Carlo chain can be terminated at the t th step, if difference between estimates of two current steps differs insignificantly. Thus, the hypothesis on the termination condition is rejected, if K Ht 1 K K j 1 k 1 ~ D 2tj ~ 2 D tj ln k SP k 1 k 1 k 1 COMPUTER DAYS – 2013 Šiauliai k T k 1 k 1 k M F ,v
  17. 17. MCMC algorithm The next rule of sample size regulation is implemented; in order large samples would be taken only at the moment of making the decision on termination of the Monte-Carlo Markov chain: t N F ,v t 1 N v F t H ,v - Fisher’s quantile, - is the significance level. COMPUTER DAYS – 2013 Šiauliai
  18. 18. MCMC algorithm Application of this rule allows to rational select of samples size in Monte-Carlo Markov chain to ensure the convergence of the maximum likelihood function. COMPUTER DAYS – 2013 Šiauliai
  19. 19. Computer simulation Next, we used familiar data to construct and estimate this statistical model. The random sample of K 10 1 , 2 , , K populations has been simulated to explore the approach developed, in which can occur M 3 events. The logits of probabilities are normally distributed with these parameters 3 0,25 0 0 4 ; 0 0,25 0 5 0 0 0,25 COMPUTER DAYS – 2013 Šiauliai .
  20. 20. Computer simulation Next, we have computed the Monte-Carlo Markov chain of t 100 estimators. To avoid very small or very large sample sizes, the following limits were applied 500 N k 17000. The termination conditions started to be valid after t 6 iterations. And we have got these means of parameters: COMPUTER DAYS – 2013 Šiauliai
  21. 21. Estimates of parameters Iteration µ1 µ2 µ3 Loglikelihood function 1 -2,96 -4,29 -5,52 -62,90 5,57 500 9,55 2 -2,89 -4,04 -5,27 -396,58 4,81 500 6,18 3 -2,91 -4,03 -5,19 -420,42 2,97 500 3,86 4 -2,90 -4,04 -5,16 -424,87 3,2 500 0,35 5 -2,91 -4,04 -5,13 -428,05 1,57 2 963 1,41 6 -2,90 -4,04 -5,14 -427,57 1,32 4 383 0,32 7 -2,91 -4,04 -5,13 -425,54 0,75 13 986 0,40 8 -2,91 -4,04 -5,14 -425,33 0,75 14 345 0,40 9 -2,91 -4,04 -5,13 -425,71 0,75 13 525 0,84 10 -2,91 -4,04 -5,13 -426,47 0,75 15 135 0,22 Confidence interval Sample size Statistical hypothesis COMPUTER DAYS – 2013 Šiauliai
  22. 22. Conclusions The empirical Bayesian approach applied in the estimation of multi-dimensional frequency has been described in this work. In this paper we: • presented an iterative method of “fixed point iteration” to compute the estimates; • introduced the Monte-Carlo Markov Chain procedure with adaptive regulation sample size and treatment of the simulation error in the statistical manner; • computed the empirical Bayesian estimation of unknown parameters and probabilities of the events. The approach developed can be applied in the analysis of social and medical data. COMPUTER DAYS – 2013 Šiauliai
  23. 23. COMPUTER DAYS – 2013 Šiauliai

×