Dynamic modelling of document streams

727 views

Published on

Presentation for the GECCO conference

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
727
On SlideShare
0
From Embeds
0
Number of Embeds
67
Actions
Shares
0
Downloads
19
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Dynamic modelling of document streams

  1. 1. A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams Lourdes Araujo,JJ Merelo lurdes@lsi.uned.es, jj@merelo.net Dpto. Lenguajes y Sistemas Inform´ ticos a Universidad Nacional de Educaci´ n a Distancia o Dpto. Arquitectura y Tecnolog´a de Computadores ı Universidad de Granada Spain A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.1/24
  2. 2. Why Document • metadata, such as arrival time help organize document streams. Temporal • information help make sense of document streams such as e-mails and news items. Its study combines • content analysis and time series mode- lling. A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.2/24
  3. 3. Showing interest Hypothesis: Explosions in interest match points • in time where arrival intensity increases sharply. In general, arrival time is quite irregular. • Y #document arrivals X Time A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.3/24
  4. 4. Regularizing irregularity A cost function, that reflects • how difficult is hiking from one state to another, is introduced. Intervals of similar frequency • should be grouped in a sin- gle state, so change of sta- te will be penalyzed. But we shouldn’t overdo it. A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.4/24
  5. 5. Kleinberg’s model The document stream is modeled as an infinite • state automaton, A, which emits messages with different frequencies. Each state has a frequency assigned. • Bursts are indicated by transitions from a lower • to a higher state. Frequency changes are controlled by assigning • costs to state changes, avoiding small explosions and making identification of real explosions easier. A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.5/24
  6. 6. Infinite state automaton model Generation of time sequence • based on a exponential distribution. • Time interval x between message i and i + 1 follows exponential distribution function f (x) = αe−αx , for α > 0. • Expected value for the interval is α−1 . A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.6/24
  7. 7. First things first: two state mo- del Basic model 2-State probabilistic automata A: q0 • (low emission rate) y q1 (high). q1 q0 n + 1 messages, n intervals: Bayes procedure • used to fit to a conditional probability of a state sequence: q = (qi1 , · · · , qin ): n 1−p c(q|x) = b ln ( )+( −ln fit (xt )) p t=1 where b = state transitions, 1st term: low number of transitions, 2nd : states fit the sequence A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.7/24
  8. 8. To the infinite and beyond Given a sequence of intervals x = • (x1 , x2 , · · · , xn ), a sequence q = (qi1 , · · · , qin ) that minimizes n−1 n c(q|x) = τ (it , it+1 ) + −ln fit (xt ) t=0 t=1 must be found f is related to the resolution of discrete rates • within continuous emission rates, and τ the facility of changing state. A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.8/24
  9. 9. Infinite is a bit too much A∗ that minimizes c(q|x) is restricted to Ak • s,γ s,γ with k states. A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.9/24
  10. 10. Infinite is a bit too much A∗ that minimizes c(q|x) is restricted to Ak • s,γ s,γ with k states. We will use a evolutionary algorithm to find Ak . • s,γ A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.9/24
  11. 11. Infinite is a bit too much A∗ that minimizes c(q|x) is restricted to Ak • s,γ s,γ with k states. We will use a evolutionary algorithm to find Ak . • s,γ Finally! • A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.9/24
  12. 12. Individual representation n integer sequence,1 < qij < E, representing • automaton state and id i of last document in sequence. i arrives at 0 ≤ ti ≤ T (intervals xi = ti − ti−1 ). • ··· t1 t2 tn | qt1 , tk1 | qtk1 +1 , tk2 | · · · | qtf , tn | Fitness function = cost function. • Initial population: documents chosen at random • that split the document stream in intervals, with random states. A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.10/24
  13. 13. Crossover g11 g1i g1f1 ··· ··· q11 , (t1 , · · · ) q1i , (t − n1 , · · · , t, · · · t + m1 ) q1f1 , (· · · , tn ) ··· ··· g21 g2j g2f2 ··· ··· q21 , (t1 , · · · ) q2j , (t − n2 , · · · , t, · · · t + m2 ) q2f2 , (· · · , tn ) ··· ··· c.p. g11 g1i−1 g2j+1 g2f2 ··· ··· q11 q1i−1 q2j+1 q2f2 (t1 , · · · ) (· · · , t − n1 − 1) (t + m2 + 1, · · · ) (· · · , tn ) ? ··· ··· c.p. g21 g2j−1 g1i+1 g1f1 ··· ··· q21 q2j−1 q1i+1 q1f1 (t1 , · · · ) (· · · , t − n2 − 1) (t + m1 + 1, · · · ) (· · · , tn ) ? ··· ··· A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.11/24
  14. 14. Mutation Several mutation • operators • Increment state by one • Merge two genes, state taken randomly • Split a gene in two: one with original state, another ±1. A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.12/24
  15. 15. Effect of crossover 500 400 Generation N. 300 stream a 200 stream b stream c 100 10 20 30 40 50 Crossover rate % A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.13/24
  16. 16. Effect of mutation 500 400 Generation N. 300 200 stream a 100 stream b stream c 0 0 5 10 15 20 25 30 Mutation rate % A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.14/24
  17. 17. Effect of population size 500 stream a stream b 400 stream c Generation N. 300 200 100 0 100 200 300 400 500 Population size A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.15/24
  18. 18. Effect of number of generations 9e+05 8e+05 7e+05 Cost function 6e+05 stream a 5e+05 stream b stream c 4e+05 3e+05 2e+05 0 100 200 300 400 500 Generation N. A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.16/24
  19. 19. Time results State n. Viterbi Evo. Alg Ex. time Cost Ex. time Cost (Av. Cost, Std. dev.) 15 2319.36 277402 1678.61 277712 (279385.6, 980.11) 20 3117.28 277306 2182.12 277528 (278980.4, 1114.91) 25 3835.37 277260 2033.81 277270 (279472.6, 1116.03) Time comparison 4000 3000 time (s.) 2000 1000 Evolutionary algorithm Viterbi 0 15 20 25 states A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.17/24
  20. 20. Predicting the state of new arri- vals Main point of this work: • to predict whether buzz is going up or down. Several possible • approaches: using Viterbi algorithm over the whole sequence, and reusing evolutionary algorithms. Easy approach for a sin- • gle state: assume current trend continues. A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.18/24
  21. 21. Local approximation: results Previous substream A. T. Old s. New s. Trend · · · 38 38 39 41 49 49 ↓ 52 12 0 · · · 41 49 49 52 68 69 ↑ 69 3 4 · · · 88 89 90 90 91 92 → 95 0 0 A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.19/24
  22. 22. But it breaks down after a while date GA approx. 0(2004-04-02) 7(0.694669) ··· ··· 74(2004-06-15) 14(0.797281) 75(2004-06-16) 24(0.970706) 76(2004-06-17) 19(0.87973) 77(2004-06-18) 19(0.87973) 19(0.87973) 78(2004-06-19) 0(0.605263) 19(0.87973) 79(2004-06-20) 0(0.605263) 19(0.87973) A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.20/24
  23. 23. Fast GA for modelling new arri- vals Using results of previous fitting • Chromosome extended, and last gene mutation • probability higher. 1 GA fit approx. fit 0,9 Frequency 0,8 0,7 0,6 0 100 50 150 Time A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.21/24
  24. 24. Fast GA: Results Subst. len. New Subs. len. T. w/out seed T. w/ seed 219900 100 141.45 (79.09) 3895.28 219000 1000 144.75 (81.96) 210000 10000 166.73 (79.32) Subst. Len. New Subs. len. T. w/out seed T. w/ seed 3032 100 54.6 2632 500 92.247 5048.49 2132 1000 294.97 1132 2000 570.41 A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.22/24
  25. 25. Conclusions The presented system dynamically detects • changes on the trends of interest on a document stream. A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.23/24
  26. 26. Conclusions The presented system dynamically detects • changes on the trends of interest on a document stream. An EA allows to deal with very large sequences • of documents in a reasonable time. A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.23/24
  27. 27. Conclusions The presented system dynamically detects • changes on the trends of interest on a document stream. An EA allows to deal with very large sequences • of documents in a reasonable time. Extending this EA allows fitting a stream which • is an extension of a previously fitted substream in a very short time. A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.23/24
  28. 28. Conclusions The presented system dynamically detects • changes on the trends of interest on a document stream. An EA allows to deal with very large sequences • of documents in a reasonable time. Extending this EA allows fitting a stream which • is an extension of a previously fitted substream in a very short time. We plan to study correlations among document • streams, to automatically detect the occurrence of new topics composed of multi-word concepts. A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.23/24
  29. 29. The end Thanks for your attention • A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.24/24
  30. 30. The end Thanks for your attention • Any question? • A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.24/24

×