Taxonomy-driven lumping for sequence mining

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    Taxonomy-driven lumping for sequence mining - Presentation Transcript

    1. Taxonomy-driven lumping for sequence mining Francesco Bonchi, Carlos Castillo, Debora Donato, and Aristides Gionis Yahoo! Research, Barcelona
    2. a day in barcelona wakeup la-padrera desingual siete-puertas goto-sleep wakeup parc-guell corte-ingles cal-pep fianna goto-sleep wakeup parc-guell sagrada-familia zara camber txakoli euskal-etxea goto-sleep wakeup la-padrera casa-batllo corte-ingles camber cal-pep txakoli fianna razzmatazz bella-istanbul goto-sleep wakeup casa-batllo parc-quell sagrada-familia quatre-gats goto-sleep wakeup casa-batllo rambla desingual siete-puertas goto-sleep wakeup parc-guell la-padrera zara corte-ingles diesel euskal-etxea no-se goto-sleep ...
    3. a day in barcelona assume that we given a hierarchy on the basic events something tourism eating sightseeing shopping restaurant bar
    4. a day in barcelona research question: what is a good model for a tourist’s day in barcelona?
    5. a day in barcelona ^ something $
    6. a day in barcelona ^ tourism eating $
    7. a day in barcelona ^ sightseeing shopping eating $
    8. a day in barcelona ^ sightseeing shopping restaurant bar $
    9. a day in barcelona observation: deciding the model is equivalent with deciding where to cut the hierarchy
    10. a day in barcelona ^ something something tourism eating sightseeing shopping restaurant bar $
    11. a day in barcelona ^ something tourism tourism eating eating sightseeing shopping restaurant bar $
    12. a day in barcelona ^ something sightseeing shopping tourism eating eating sightseeing shopping restaurant bar $
    13. a day in barcelona ^ something sightseeing shopping tourism eating restaurant bar sightseeing shopping restaurant bar $
    14. the problem we are solving input: something wakeup la-padrera desingual siete-puertas goto-sleep wakeup parc-guell corte-ingles cal-pep goto-sleep + tourism eating ... sightseeing shopping restaurant bar output: ^ something sightseeing shopping tourism eating eating sightseeing shopping restaurant bar $
    15. motivating scenarios access logs at a corporate website access logs of interaction with a software system query-log mining trajectory data in all the above cases a hierarchy is given
    16. contribution tons of related work Markov models, hidden Markov models, state aggregation of Markov models, sequence clustering, and more some references in the paper, by no means complete however, to our knowledge we are the first to address this particular problem argue that it is a useful problem and propose efficient algorithms
    17. a bit more formally... alphabet on a set of symbols Σ = {α1 , . . . , αm } a hierarchy T over Σ set of input sequences D = {σ1 , . . . , σr } over Σ a set of states X = {x1 , . . . , xs } (Markov Model) transition probabilities px,y a set of symbols A(x) ⊆ Σ associated with each state x i.e., collection {A(x)} partitions Σ emission probabilities qx,α overall the model is represented by M = (X , A, p, q)
    18. is it a hidden Markov model? no: there are no hidden states for each observed symbol there is unique emission state avoid computational challenges of HMM; eg. Baum Welch but not as fine-grained as a Markov model ability to generalize from input data eg. {sightseeing, shopping} → tourism
    19. problem statement given set of input sequences D and hierarchy T find model M that minimizes the likelihood score function SL (D | M) SL is computed using both transition and emission probabilities
    20. facts about the problem given A(x) (partition of the Σ) we can compute M = (X , A, p, q) that optimizes SL (D | M) i.e., compute transition and emission probabilities by counting transitions and occurrences in the data (maximum-likelihood estimation) ∗ let it be SL (D | M) — depends only on the partition of the states so, the complexity of our problem comes from computing the optimal partition
    21. facts about the problem Theorem. Consider M1 and M2 such that the states of M1 is a sub-partition of the states of M2 . Then ∗ ∗ SL (D | M1 ) ≤ SL (D | M2 )
    22. adapting the problem statement given set of input sequences D and hierarchy T find model M that minimizes the function 2 2 ∗ ∗ 2 s − smin SL − SL,min Dist (M) = +w · ∗ ∗ , smax − smin SL,max − SL,min where M has s states and achieves likelihood score ∗ ∗ SL = SL (D | M)
    23. algorithms
    24. algorithm a generic bottom-up state-merging search algorithm with the following components: (i) an objective function g to evaluate goodness of a model (ii) a search policy p to compare models (iii) a priority queue Q of candidate models to evaluate (iv ) a set E of models already evaluated
    25. algorithm 1 start with the leaf-level model and add it in the queue 2 extract from the queue the best model M according to p and evaluate its value according to g 3 consider all models that result from M with one merging (along the hierarchy) and add them in the queue 4 if has not reached maximum number of iterations go to step 2 possible to evaluate the a model efficiently from its ancestor model and precomputed counters
    26. policies to compare models (i) ProbabilityFirst: p(vi , si ) > p(vj , sj ) if and only if vi < vj or (vi = vj and si < sj ); (ii) StatesFirst: p(vi , si ) > p(vj , sj ) if and only if si < sj or (si = sj and vi < vj ); (iii) DistSqFirst: p(vi , si ) > p(vj , sj ) if and only if g (vi , si ) < g (vj , sj ).
    27. clustering EM-type algorithm to partition the input sequences into k clusters start with a random partition evaluate a model M for each partition reassign each sequence to the best model repeat
    28. experiments
    29. taxomo (taxonomy-driven markovian modeler) software written in Java provides the following features: model inference given a taxonomy and a set of sequences EM-based clustering computation of log-likelihood of a given set of sequences from a given model generation of a set of sequences given a model
    30. dataset I — synthetic binary hierarchy with 64 leaves generative model by a random cut generate 10 000 random sequences
    31. dataset II — query log sample of 85 000 queries from Yahoo! queries are assigned by a classifier to the leafs of a hierarchy “ferrari photos” and “motorcycles” assigned to “recreation/automotive/autos” and “recreation/automotive/motorcycle” 100 leafs and max depth 6
    32. dataset III — trajectories Brinkhoff’s network-based synthetic generator of moving objects 500 000 spatio-temporal points for 80 000 trajectories hierarchy is a quad-tree formed by a 32 × 32 grid (e.g., the whole city is the root and recursively divide the area in quadrants)
    33. results — synthetic dataset Synthetic data set 1.07 States DistSq 1 1.065 DistSq 10 1.06 log−likelihood ratio 1.055 1.05 1.045 1.04 1.035 0 2000 4000 6000 8000 10000 Number of iterations approximation error vs number of iterations
    34. results — different policies 6 Trajectories dataset 5 Query−log dataset x 10 x 10 States 2.8 3.55 Probability 2.6 3.5 DistSq 1 States DistSq 10 2.4 3.45 Probability Minus log likelihood DistSq 1 Minus log likelihood 2.2 3.4 DistSq 10 2 3.35 3.3 1.8 3.25 1.6 3.2 1.4 3.15 1.2 3.1 1 20 30 40 50 60 70 80 90 100 0 200 400 600 800 1000 Number of states Number of states search space profile: probability of the best model found by the search algorithms, at each number of states, within 10,000 iterations
    35. results — convergence 6 Trajectories dataset 5 Query−log dataset x 10 x 10 3.6 2.8 DistSq 10 −− 100 iter DistSq 10 −− 100 iter DistSq 10 −− 1K iter 3.55 DistSq 10 −− 1K iter 2.6 DistSq 10 −− 10K iter DistSq 10 −− 10K iter 3.5 2.4 Minus log likelihood 3.45 2.2 Minus log likelihood 3.4 2 3.35 1.8 3.3 1.6 3.25 1.4 3.2 1.2 3.15 1 3.1 0 20 40 60 80 100 0 200 400 600 800 1000 Number of states Number of states profile of search space explored by DistSqFirst-10 policy for different number of iterations
    36. trajectories
    37. trajectories — clustering
    38. future directions incorporate time no hierarchy more applications
    39. thank you!

    + Carlos CastilloCarlos Castillo, 2 months ago

    custom

    188 views, 0 favs, 0 embeds more stats

    Presented by Aris Gionis at ECML/PKDD 2009 in Bled, more

    More info about this document

    CC Attribution License

    Go to text version

    • Total Views 188
      • 188 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 2
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories

    Tags