Upcoming SlideShare
×

# SAX-VSM

3,608 views

Published on

Published in: Education, Technology
6 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Yeah. This method is the data-crunching part. Thank you for following up!

Are you sure you want to  Yes  No
• Way to go, Pavel - is this for your dissertation?

Are you sure you want to  Yes  No
Views
Total views
3,608
On SlideShare
0
From Embeds
0
Number of Embeds
23
Actions
Shares
0
0
2
Likes
6
Embeds 0
No embeds

No notes for slide

### SAX-VSM

1. 1. SAX-VSM Interpretable time series classification with SAX, TF*IDF, and Vector Space Model Pavel Senin senin@hawaii.edu University of Hawaii at Manoa Department of Information and Computer Sciences Collaborative Software Development Laboratory http://csdl.ics.hawaii.edu
2. 2. Temporal data • Probably, the most of the collected data is temporal • Smarter technology (monitoring, on-line adjustment): – smarter house, power grid, water supply – smarter traffic – smarter cooking of your favorite food • Health, personal and global – blood pressure, heartbeat, sugar level, weight – epidemiology • Safety and Sustainability – fraud detection, unusual activity mining – civil infrastructure: bridges, buildings, roads – weather, seismography, astronomy – smarter agriculture • Economy: money, stocks, markets, shopping • Social networks, media, entertainment http://www.imdb.com/title/tt0192618/
3. 3. Problem definition • Given a sequences of points, or given a live stream of points • Find: – patterns, outliers, (motifs, discords) • Perform: – classification, clustering, forecasting • Gain domain-specific knowledge, infer a generative process 1 1 1 1 1 1 2 3 1 2 3 , , ,..., ... , , ,..., m k m m m m k x x x x x x x x Real-life data: - not equidistant - compressed/stretched - congested - noise - lost points
4. 4. Similarity? Yes, you know when you see it! But one need to teach machines to see that difference too. It turns out to be a quite difficult task. All solutions are based on the similarity in Time, Shape, or Change. How many metrics out there? Pseudoquasimetrics anyone? http://blog.sfgate.com/pets/2009/03/18/pet-look-alike-photo-contest/, 02-13-2013
5. 5. ’’…Euclidean distance or Dynamic Time Warping (DTW) distance does not significantly outperform random guessing. The reason for the poor performance of these otherwise very competitive classifiers seems to be due to the fact that the data is somewhat noisy (i.e. insect bites, and different stem lengths)…’’ “Time Series Shapelets: A New Primitive for Data Mining”, L. Ye, E. Keogh. ’’…However, it is clear that one-nearest-neighbor with Dynamic Time Warping (DTW) distance is exceptionally difficult to beat. This approach has one weakness, however; it is computationally too demanding for many realtime applications…’’ “Fast Time Series Classification Using Numerosity Reduction”, Xi, Keogh, Shelton, Wei By far, the most ubiquitous distance measure for time series is the Euclidean distance. 1-NN Euclidean Classifier is fast and accurate. Everyone does benchmark with it because it is really hard to beat 1-NN Euclidean classifier. State of the art “…our basic message: transforming the data is the simplest way of achieving improvement in problems where the discriminating features are based on similarity in change and similarity in shape...” “Transformation Based Ensembles for Time Series Classification”, Bagnall, A., Davis, L., Hills, J., Lines, J.
6. 6. • Can we ignore the time? • Can we step aside of mean, variance, kurtosis and skewness? • Can we transform the temporal data in some feature space? • Can we then actively learn from these features? • SAX-VSM does it all, it doesn’t ignore time though, ordering is “loosely” kept. (It keeps ordering within a sliding window, but not globally.) Features what are they? “Experiencing SAX: a Novel Symbolic Representation of Time Series", J.Lin, E.Keogh, L.Wei, S.Lonardi
7. 7. Features versus Distance • Dataset: electrical devices – Kettle, Immersion heater, fridge/freezer, oven/cooker, computer/television, and a dishwasher. – Measured every 15 minutes – means that ~15 minutes of information are lost “Classification of Household Devices by Electricity Usage Profiles", Lines et al. http://www.uea.ac.uk/cmp Jmotif take on ED dataset: https://code.google.com/p/jmotif/wiki/ElectricalDevices Error: Euclidean 1NN: 46% DTW 1NN: 33% Shapelet Tree: 45% Shapelet SVM: 75% SAX-VSM: 32% Distance Features
8. 8. Implementation and reproducibility • There is a large difference in precision, sometimes. • Which aligns with no free lunch theorems – http://en.wikipedia.org/wiki/No_free_lunch_in_search_and_optimization • My code is online here: https://github.com/jMotif/sax-vsm_classic, old location with more wiki and docs: http://code.google.com/p/jmotif/ • Feel free to use it for your needs. Please, contribute your changes. It is GPL. • All is reproducible. Most of the data available at UCR homepage, other datasets are online too. • Due to the active development things might change a bit.
9. 9. From where I am coming to this: yet another application problem - behaviors • I work on Software Processes for my thesis, specifically – on recurrent behaviors – Given a live stream (or a Git log) of telemetry from hundreds software project developers (linux kernel as an example), find: • What process they perform, what is their goal? What are their habits? • Outliers? Clusters? i.e. roles and groups? – Given a dozen of software project trails, are they similar or different in their software process? – What about people who generate these artifacts – I mean here NO periodicity, TONS of lost values, PLUS congestion, compression, you name it – data is corrupted. • How I arrived to this method – I have realized, that behaviors of every single individual must be counted in – they all knitting together the software. So, when I looked on all the trails ‐ in SAX space ‐ the choice of TF*IDF was obvious. • TF*IDF takes away similarities and highlights the behaviors which “stand out of the crowd”. Moreover, it weights their importance, by counting their re‐occurrence. So you will see that someone changing little things here or there. • Vector Space Model, in turn, takes care about carefully counting these “selected behaviors” in unknown temporal containers, pointing to class they should be assigned to – with a score!
10. 10. http://www.darpa.mil/Our_Work/I2O/Programs/Active_Authentication.aspx 02-13-2013 Why behaviors? Yet another reason. And many others things can be made “smart”. “…The current standard method for validating a user’s identity for authentication on an information system requires humans to do something that is inherently unnatural: create, remember, and manage long, complex passwords. Moreover, as long as the session remains active, typical systems incorporate no mechanisms to verify that the user originally authenticated is the user still in control of the keyboard. ..”
11. 11. SAX-VSM classification at large: features All this is well known since 1972, I wasn’t born yet. Thank you Gerald Salton! All this is known since 2002, I wasn’t in Grad school yet. Thank you Jessica and Eamonn!
12. 12. The only formulas. All other stuff is counting, hashing, threading, and other Java magic. http://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html#sec:querydocweighting Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008. Cosine similarity TF*IDF http://nlp.stanford.edu/IR-book/newslides.html
13. 13. IT IS NOT COMPLETELY NEW Later, I found, that idea was exercised quite a few times before ‐ with some success... But I would argue that I came to it and built it all alone, and made it working just fine. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2390719/
14. 14. This is that dataset: Synthetic Control
15. 15. UCR Synthetic Control JMotif error 0.7% on par with 1-NN DTW Error rate surface , one parameter is fixed, two others varying. Were they somewhere here?
16. 16. Clustering: UCR Synthetic Control, No labels = No problem k-Means works too (if you know the K)
17. 17. Another toy example: just three classes, CBF
18. 18. CBF, Classification accuracy is 99.9% http://jmotif.googlecode.com/svn/trunk/RCode/cbf/par_surface.gif
19. 19. CBF: accuracy, runtime, features. Classifying 30K of series (10K of each class) Slow TFIDF stat but slow growing! Euclidean 1NN Fast classification Rate of absorbing (learning) of new features, Each class ~1/3 of the whole as expected…
20. 20. CBF, SAX-VSM features importance distribution
21. 21. Another well-studied example: Gun/Point dataset Slide created by Eamonn Keogh, eamonn@cs.ucr.edu
22. 22. Slide by Lexiang Ye and Eamonn Keogh, www.cs.ucr.edu/~lexiangy/shapelet.html Looking for the best discriminating subseries Shapelet‐style Euclidean 91.3%, DTW 90.7%, Shapelet 93.3%, SAX-VSM 99.44%
23. 23. Looking for the best discriminating subseries SAX‐VSM‐style
24. 24. Gun/Point, SAX-VSM features importance distribution
25. 25. Leaves classification http://oregonstate.edu/dept/ldplants/Plant%20ID-Leaves.htm Leaf attachment : The pattern by which leaves are attached to a stem or twig is also a useful characteristic in plant identification. There are two large groups, alternate and opposite patterns, and a third less common pattern, whorled. Leaf lobes: Leaves may be lobed or not lobed. A lobe may be defined as a curved or rounded projection. With leaves there is no clear distinction between shallow lobes and deep teeth. A main vein is often visible in a lobe, this may not occur in teeth. Leaf margin: Another important leaf characteristic for plant identification is the edge or margin of a leaf or leaflet. Leaves have either smooth edges, called entire, or small notches or “teeth” along the margin.
26. 26. Leaves classification with SAX-VSM Euclidean 51.7%, DTW 59.1%, Shapelet ?, SAX-VSM 92.2% Moreover: SAX-VSM highlights same places as human experts do
27. 27. Acer Circunatum Acer Glabrum Acer Macrophyllum Acer Negundo Quercus KelloggiiQuercus Garryana
28. 28. Coffee spectrograms classification Caffeine Chlorogenic acid Caffeine Chlorogenic Acid Arabica 1.2% 5.5-8.0% Robusta 2.2% 7.0-10.0% http://code.google.com/p/jmotif/wiki/ArabicaRobusta http://www.coffeechemistry.com/index.php/General/Agriculture/differences‐between‐arabica‐and‐robusta‐coffee.html Euclidean 75.0%, DTW 82.1%, Shapelet ?, SAX-VSM 100%
29. 29. SAX-VSM classification accuracy study http://code.google.com/p/jmotif/wiki/Precision
30. 30. • Shapelets can be different in their length. And it works better. • What if words will be different in their length? – I can use different SAX parameters and Strategies – aiming on picking up classes specificities. – But yes, search space grows too… It can be better – TF*IDF and VSM do not care about words length or their number! Yoga dataset + Jmotif + {set of SAX params}*2 = -5% of error! Figure from "Semi-Supervised Time Series Classification". Li Wei & Eamonn Keogh Male or Female??? Euclidean 83%, DTW 83.6%, Shapelet ?, SAX-VSM 82% -> 87% (still running though)
31. 31. Best SAX parameters search Sampling the whole space with DIRECT Slice of space with Sliding Window=42 Down from 432 points to 42 = 10X speedup
32. 32. Results • SAX-VSM, statistically speaking, at the level of current state of the art. • It is fast, if (“and”?) you can learn offline. • Parameters search is the show stopper for now. • But the best thing – it allows one to find unique temporal patterns or discriminating features and weight them by importance among hundreds of thousands of candidates. • Still, efficient parameters search. • Evaluation methodology. What is the proper way? • TF*IDF implementations, how to chose a good one? Open questions
33. 33. • Ngrams – http://books.google.com/ngrams/graph?content=Sherlock+Holmes%2CAlbert+Einstein%2CSputnik%2Ctime+series%2 CANOVA%2Cfeature+extraction%2Ctfidf%2BTFIDF%2CGoogle%2CINRA%2CNgram&year_start=1900&year_end=2008 &corpus=15&smoothing=4&share= • Sequitur – Extracting a grammar off the timeseries and its use as Vector Space part input. I hope it will solve parameters search problem. • Prediction !!! – Grammars can tell what is the next word. Ngrams can help in that too. • Multidimensional series – all the dimensions into prefixed bags – Ngrams spanning through dimensions Where it goes… roadmap Applications? • I would be happy to try it on the real data and for the real problem. • Right now I am looking on application to DNA contigs clustering -> next slides • Discords, Motifs ->next slides
34. 34. Some back-up slides
35. 35. 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Poppet pulled significantly out of the solenoid before energizing The De- Energizing phase is normal Space Shuttle Marotta Valve Series http://www.marotta.com/files/Brochures_WP_CS/marotta_spacebro_final_lr.pdf Plot and Annotation by Eamonn Keogh www.cs.ucr.edu/~eamonn/discords/ICDM05_discords.ppt
36. 36. instances2Discords [FINE|main|3:23:45] data size: 5000; max: 7.06; min: -3.1; mean: 1.0547679999999706; NaNs: 0 instances2Discords [FINE|main|3:23:45] window size: 128, alphabet size: 3, SAX Trie size: 4872 getDiscordsAlgorithm [FINE|main|3:23:45] starting motif finding routines getDiscordsAlgorithm [FINE|main|3:23:50] new discord: bca, position 4297, distance 20.551788243362186, elapsed time: 0m 5s 20ms getDiscordsAlgorithm [FINE|main|3:23:55] new discord: acc, position 4071, distance 10.507368842864514, elapsed time: 0m 5s 278ms Jmotif implements this following papers by Keogh & Lin, a bit different Raw data 5 seconds later… all discords found (Intel Atom CPU)
37. 37. TEK-17 a bit more difficult, probably 5.5 seconds … (smaller window – larger trie structure) Here it is annotated by Jmotif: 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Poppet pulled significantly out of the solenoid before energizing Space Shuttle Marotta Valve Series Plot and Annotation by Eamonn Keogh www.cs.ucr.edu/~eamonn/discords/ICDM05_discords.ppt Zoom into Discord Other energizing fragments It is an outlier! http://code.google.com/p/jmotif/wiki/Discords
38. 38. Clustering DNA contigs/reads with TF*IDF • As well as there is a successful way of converting DNA to timeseries: http://www.cs.ucr.edu/~eamonn/UCRsuite.html • There were multiple attempts to treat DNA strings with information theory techniques. • All those complexity things… - Kullback–Leibler divergence you would probably know. • kMer == Ngram • And TF*IDF was applied too. And, in fact, it works. And it is fast. • Set of 76’326 contigs clustered in 2 minutes (!) • But I don’t know how good that clustering is. Seems to be OK, but more work is needed.
39. 39. The tree of these 70K+ contigs seems to be partitioned well by TFIDF
40. 40. Just a random subset … might be a problem? Clustalw By percentage of identity
41. 41. Clustalw By BLOSUM62 In fact, … Hard to tell…