Your SlideShare is downloading. ×
Text and text stream mining tutorial
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Text and text stream mining tutorial

2,907
views

Published on

Published in: Technology

0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,907
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
5
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Applet at http://www.math.le.ac.uk/people/ag153/homepage/KmeansKmedoids/Kmeans_Kmedoids.html
  • - Vegas77 Entertainment SE- Spam normally sent on weekends, lines drawn at Fridays – exceptions 28.3. and 28.4. - Price on Monday higher in many cases
  • http://www.bbc.co.uk/news/world-us-canada-15051554
  • Taken from http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-2
  • Transcript

    • 1. Large-scale information extraction and integration infrastructurefor supporting financial decision making (FP7-ICT-257928)http://project-first.eu Text Mining and Text Stream Mining Tutorial Miha Grčar miha.grcar@ijs.si Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana http://kt.ijs.si
    • 2. Text and text stream mining tutorial• Part I: Text mining• Part II: Text stream miningLucca, Oct 2012 Miha Grčar: Text and text stream mining 2
    • 3. PART I • PART II Part I: Text mining
    • 4. PART I • PART IIINTRO • BOW • ML • EVAL • APP What is text mining? • Text mining provides a set of methodologies and tools for discovering, presenting, and evaluating knowledge from large collections of textual documents • Text mining employs adopts and adapts methodologies and tools from … – Data mining (DM) – Machine learning (ML) – Information retrieval (IR) – Natural language processing (NLP) – Visualization – Social network analysis and graph mining – Knowledge management – … Lucca, Oct 2012 Miha Grčar: Text and text stream mining 4
    • 5. PART I • PART IIINTRO • BOW • ML • EVAL • APP Typical text mining process Feedback loop - Performance and Evaluation / - utility assessment validation - Feedback loop Data Text pre- Modeling acquisition processing - Presentation - Acquisition - Transformation - Discover Application - Interaction - Cleaning - Extract - Organize knowledge Lucca, Oct 2012 Miha Grčar: Text and text stream mining 5
    • 6. PART I • PART IIINTRO • BOW • ML • EVAL • APP What do we cover in Part 1? Feedback loop - Cross validation Evaluation / - Precision validation - Recall … Data Text pre- Modeling acquisition processing - Search & browse - Categorization - Recommendation - Vector spc model - Machine learning Application - Advertising - (bags-of-words) - Classification - Spam detection - Clustering - Summarization - Visualization … Lucca, Oct 2012 Miha Grčar: Text and text stream mining 6
    • 7. PART I • PART IIINTRO • BOW • ML • EVAL • APP Bags of words • Tokenize • Remove stop words the quick brown The quick dog brown dog jumps jumps over over the lazy dog. the lazy dog Lucca, Oct 2012 Miha Grčar: Text and text stream mining 7
    • 8. PART I • PART IIINTRO • BOW • ML • EVAL • APP Bags of words • Tokenize • Remove stop words • Lemmatize • Compute weights the quick brown quick jump brown lazy dog The quick dog brown dog jumps  jump 1 1 2 1 1 jumps over over the lazy dog. the lazy dog Lucca, Oct 2012 Miha Grčar: Text and text stream mining 8
    • 9. PART I • PART IIINTRO • BOW • ML • EVAL • APP Bags of words Tokenization & stop word removal Original text: Simple tokenizer (alphanumeric strings only): After ripping 14% higher from June until the first week of After | ripping | 14 | higher | from October, stocks ran headfirst into | June | until | the | first | week | a wall of worry seemingly too of | October | stocks | ran | large to climb. Europe, China, the headfirst | into | a | wall | of | fiscal cliff, etc arent new worry | seemingly | too | large | concerns but that doesnt mean to | climb | Europe | China | the | they arent real. Investors fiscal | cliff | etc | aren | t | new | suddenly care and are behaving concerns | but | that | doesn | t | accordingly, selling some of their mean | they | aren | t | real | more aggressive names and Investors | suddenly | care | and | rotating into defensives. are | behaving | accordingly | selling | some | of | their | more | aggressive | names | and | rotating | into | defensives Lucca, Oct 2012 Miha Grčar: Text and text stream mining 9
    • 10. PART I • PART IIINTRO • BOW • ML • EVAL • APP Bags of words Tokenization & stop word removal Original text: Regex tokenizer ([p{L}]+): After ripping 14% higher from After | ripping | higher | from | June until the first week of June | until | the | first | week | October, stocks ran headfirst into of | October | stocks | ran | a wall of worry seemingly too headfirst | into | a | wall | of | large to climb. Europe, China, the worry | seemingly | too | large | fiscal cliff, etc arent new to | climb | Europe | China | the concerns but that doesnt mean | fiscal | cliff | etc | arent | new they arent real. Investors | concerns | but | that | doesnt suddenly care and are behaving | mean | they | arent | real | accordingly, selling some of their Investors | suddenly | care | and more aggressive names and | are | behaving | accordingly | rotating into defensives. selling | some | of | their | more | aggressive | names | and | rotating | into | defensives Lucca, Oct 2012 Miha Grčar: Text and text stream mining 10
    • 11. PART I • PART IIINTRO • BOW • ML • EVAL • APP Bags of words Lemmatization Original text: Lemmatized: After ripping 14% higher from After | rip | high | from | June | June until the first week of until | the | first | week | of | October, stocks ran headfirst into October | stock | run | headfirst a wall of worry seemingly too | into | a | wall | of | worry | large to climb. Europe, China, the seemingly | too | large | to | fiscal cliff, etc arent new climb | Europe | China | the | concerns but that doesnt mean fiscal | cliff | etc | arent | new | they arent real. Investors concern | but | that | doesnt | suddenly care and are behaving mean | they | arent | real | accordingly, selling some of their Investor | suddenly | care | and | more aggressive names and are | behave | accordingly | sell | rotating into defensives. some | of | their | more | aggressive | name | and | rotate | into | defensive Lucca, Oct 2012 Miha Grčar: Text and text stream mining 11
    • 12. PART I • PART IIINTRO • BOW • ML • EVAL • APP Bags of words Lemmatization Original text: Lemmatized: È uno dei punti più contestati E | uno | dei | puntare | più | della legge di Stabilità approvata contestato | della | legge | di | da poco dal governo: il taglio alle Stabilità | approvare | da | poco | dal | governo | il | tagliare | alle | detrazioni fiscali, ossia gli "sconti" detrazione | fiscale | ossia | gli | che ogni contribuente può scontare | che | ogni | contribuire | vantare sulla propria può | vantare | sulla | proprio | dichiarazione dei redditi. Secondo dichiarazione | dei | reddito | una bozza aggiornata del disegno Secondo | una | bozzare | di legge, il taglio si applicherebbe aggiornare | del | disegnare | di | a decorrere dal periodo di legge | il | tagliare | si | applicare | a imposta al 31 dicembre 2012. Un | decorrere | dal | periodare | di | dettaglio che aveva creato, nei impostare | al | dicembre | Un | giorni scorsi, non poche dettagliare | che | aveva | creare | nei | giorno | scorrere | non | poca | polemiche. polemico Lucca, Oct 2012 Miha Grčar: Text and text stream mining 12
    • 13. PART I • PART IIINTRO • BOW • ML • EVAL • APP Computing weights • TF – Term Frequency – The number of times a lemma (stem) occurs in a document • DF – Document Frequency – The number of documents in which a lemma (stem) occurs at least once • TFIDF • Higher TF means higher TFIDF • Higher DF means lower TFIDF Lucca, Oct 2012 Miha Grčar: Text and text stream mining 13
    • 14. PART I • PART IIINTRO • BOW • ML • EVAL • APP Computing weights DF TF IDF TFIDF quick 1 1 0 0 The quick brown dog brown 1 1 0 0 jumps over dog 2 1 0 0 the lazy dog. jump 1 1 0 0 lazy 1 1 0 0 Lucca, Oct 2012 Miha Grčar: Text and text stream mining 14
    • 15. PART I • PART IIINTRO • BOW • ML • EVAL • APP Computing weights DF TF jump IDF TFIDF quick 1 1 0.69 0.69 The quick brown dog brown 1 1 0.69 0.69 jumps over dog 2 1 0.69 1.39 the lazy dog. jump 1 2 0 0 lazy 1 1 0.69 0.69 Lucca, Oct 2012 Miha Grčar: Text and text stream mining 15
    • 16. PART I • PART IIINTRO • BOW • ML • EVAL • APP Cosine similarity d1 d2 0 Lucca, Oct 2012 Miha Grčar: Text and text stream mining 16
    • 17. PART I • PART IIINTRO • BOW • ML • EVAL • APP Cosine similarity d1 1 d1 d2 d2 0 Lucca, Oct 2012 Miha Grčar: Text and text stream mining 17
    • 18. PART I • PART IIINTRO • BOW • ML • EVAL • APP Centroids • Determine characteristic words in a cluster • Nearest centroid classifier • k-means clustering • … Lucca, Oct 2012 Miha Grčar: Text and text stream mining 18
    • 19. PART I • PART IIINTRO • BOW • ML • EVAL • APP Where are we? Feedback loop - Cross validation Evaluation / - Precision validation - Recall … Data Text pre- Modeling acquisition processing - Search & browse - Categorization - Recommendation - Vector spc model - Machine learning Application - Advertising - (bags-of-words) - Classification - Spam detection - Clustering - Summarization - Visualization … Lucca, Oct 2012 Miha Grčar: Text and text stream mining 19
    • 20. PART I • PART IIINTRO • BOW • ML • EVAL • APP Machine learning • Machine learning is concerned with the development of algorithms that allow computer programs to learn from past experience [Mitchell] • Machine learning refers to a collection of algorithms that take as input empirical data (e.g., from databases or sensors) and try to discover some characteristics (rules, constraints, patterns, features) of the process that generated the data [Wikipedia] • Learning from past experience = learning from past examples • Examples (instances) = document vectors (normalized sparse vectors) Lucca, Oct 2012 Miha Grčar: Text and text stream mining 20
    • 21. PART I • PART IIINTRO • BOW • ML • EVAL • APP Machine learning • We will look at two commonly used machine learning techniques – Classification • Assigning instances (documents) to two or more predefined (discrete) classes • Supervised learning method – Clustering • Arranging instances (documents) into groups (clusters) so that instances in the same group are more similar to each other than to those in other groups • Unsupervised learning method Lucca, Oct 2012 Miha Grčar: Text and text stream mining 21
    • 22. PART I • PART IIINTRO • BOW • ML • EVAL • APP Classification • Labeled documents Mergers & Acquisitions • Ingram Wraps Up Brightpoint Buyout Mergers & Acquisitions • State Street completes acquisition of Goldman Sachs Administration Services Economy & Government • Gasoline fuels inflation, but Fed policy seen steady Economy & Government • Euro Leads Majors Higher as Spanish Bailout Looks Increasingly Likely ... Investing Picks • Smith & Wesson Holding Corp. Enters Oversold Territory Investing Picks • The Fresh Market: A Strong Buy • Learn to classify Labeled Training Classification dataset Algorithm Model • Classify unlabeled documents Unlabeled Classification Predictions dataset Algorithm (Labels) Fresh Del Monte Produce Inc. Investing Picks Enters Oversold Territory Classification Model Lucca, Oct 2012 Miha Grčar: Text and text stream mining 22
    • 23. PART I • PART IIINTRO • BOW • ML • EVAL • APP Classification with k-Nearest Neighbors Investing Picks Mergers & Acquisitions Economy & Government Investing Picks: 4 Mergers & Acquisitions: 1 Economy & Government: 0 Lucca, Oct 2012 23
    • 24. PART I • PART IIINTRO • BOW • ML • EVAL • APP Classification with Nearest Centroid Classifier Investing Picks Mergers & Acquisitions s1 s2 s3 Economy & Government Similarity s2 > s1 > s3 s2: Mergers & Acquisitions s1: Investing Picks s3: Economy & Government Lucca, Oct 2012 24
    • 25. PART I • PART IIINTRO • BOW • ML • EVAL • APP Classification with Support Vector Machine (SVM) w Investing Picks • Maximize w • Minimize tradeoff Mergers & Acquisitions Lucca, Oct 2012 Miha Grčar: Text and text stream mining 25
    • 26. PART I • PART IIINTRO • BOW • ML • EVAL • APP Classification algorithms Nearest SVM k-NN centroid (linear kernel) Multiclass? yes yes no Explains decisions? no yes yes Explains model? no yes yes Number of parameters 1 0 1 Model size big small small Training speed 0 fast slow Classification speed slow fast fast Accuracy (on texts) low medium high Lucca, Oct 2012 Miha Grčar: Text and text stream mining 26
    • 27. PART I • PART IIINTRO • BOW • ML • EVAL • APP Clustering Lucca, Oct 2012 27
    • 28. PART I • PART IIINTRO • BOW • ML • EVAL • APP Clustering • k-means clustering • Agglomerative hierarchical clustering Lucca, Oct 2012 Miha Grčar: Text and text stream mining 28
    • 29. PART I • PART IIINTRO • BOW • ML • EVAL • APP k-means clustering Input: k Output: k clusters (and their centroids) 1. Randomly select k instances for initial centroids 2. Assign step Assign each instance to the nearest centroid 3. If the assignments did not change, end the algorithm 4. Update step Recompute (update) centroids 5. Repeat at Step 2 Lucca, Oct 2012 Miha Grčar: Text and text stream mining 29
    • 30. PART I • PART IIINTRO • BOW • ML • EVAL • APP k-means clustering This video is available at http://first.ijs.si/tutorial/video/kmeans.html Lucca, Oct 2012 Miha Grčar: Text and text stream mining 30
    • 31. PART I • PART IIINTRO • BOW • ML • EVAL • APP Agglomerative hierarchical clustering 1. Find the two most similar instances 2. Connect them 3. Replace them with their centroid 4. Repeat … “Dendrogram” Lucca, Oct 2012 Miha Grčar: Text and text stream mining 31
    • 32. PART I • PART IIINTRO • BOW • ML • EVAL • APP Where are we? Feedback loop - Cross validation Evaluation / - Precision validation - Recall … Data Text pre- Modeling acquisition processing - Search & browse - Categorization - Recommendation - Vector spc model - Machine learning Application - Advertising - (bags-of-words) - Classification - Spam detection - Clustering - Summarization - Visualization … Lucca, Oct 2012 Miha Grčar: Text and text stream mining 32
    • 33. PART I • PART IIINTRO • BOW • ML • EVAL • APP Evaluation • Cross validation (http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29) – 10-fold cross validation – Stratified • Accuracy • Precision, recall, F1 (http://en.wikipedia.org/wiki/Precision_and_recall | http://en.wikipedia.org/wiki/F1_Score) • Micro and macro-averaging (http://nlp.stanford.edu/IR- book/html/htmledition/evaluation-of-text-classification-1.html | http://datamin.ubbcluj.ro/wiki/index.php/Evaluation_methods_in_text_categorization) • Statistical tests (http://en.wikipedia.org/wiki/Statistical_hypothesis_testing) Lucca, Oct 2012 Miha Grčar: Text and text stream mining 33
    • 34. PART I • PART IIINTRO • BOW • ML • EVAL • APP Where are we? Feedback loop - Cross validation Evaluation / - Precision validation - Recall … Data Text pre- Modeling acquisition processing - Search & browse - Categorization - Recommendation - Vector spc model - Machine learning Application - Advertising - (bags-of-words) - Classification - Spam detection - Clustering - Summarization - Visualization … Lucca, Oct 2012 Miha Grčar: Text and text stream mining 34
    • 35. PART I • PART IIINTRO • BOW • ML • EVAL • APP Applications • Enhanced Web search • Text summarization (SearchPoint) Leskovec et al. (2005): Extracting Summary Sentences Based on the Document Semantic • Social browsing (LiveNetLife) Graph. Microsoft Research Technical Report MSR-TR-2005-07. • Content categorization • Sentiment analysis • Content-based recommender (demo later) systems • News aggregation • Advertising http://emm.newsexplorer.eu • Blogging assistance (Zemanta) • Knowledge engineering http://ontogen.ijs.si • Spam detection • … • Visualization / summarization of large corpora Lucca, Oct 2012 Miha Grčar: Text and text stream mining 35
    • 36. Enhanced Web search (http://www.searchpoint.com)Lucca, Oct 2012 Miha Grčar: Text and text stream mining 36
    • 37. Hi! Hello Social browsing (http://www.livenetlife.com) @ http://videolectures.netLucca, Oct 2012 Miha Grčar: Text and text stream mining 37
    • 38. Content categorization @ http://videolectures.netLucca, Oct 2012 Miha Grčar: Text and text stream mining 38
    • 39. Recommender system @ http://videolectures.netLucca, Oct 2012 Miha Grčar: Text and text stream mining 39
    • 40. Contextualized advertisingLucca, Oct 2012 Miha Grčar: Text and text stream mining 40
    • 41. PART I • PART IIINTRO • BOW • ML • EVAL • APP Blogging assistant (http://www.zemanta.com) Lucca, Oct 2012 Miha Grčar: Text and text stream mining 41
    • 42. PART I • PART IIINTRO • BOW • ML • EVAL • APP Pump & dump Siering, Muntermann, Grčar (2012) Lucca, Oct 2012 Miha Grčar: Text and text stream mining 42
    • 43. PART I • PART IIINTRO • BOW • ML • EVAL • APP Visualizations • Document space visualization • Canyon flows • Tag clouds http://www.jasondavies.com/wordcloud/ Lucca, Oct 2012 Miha Grčar: Text and text stream mining 43
    • 44. PART I • PART II Recap • Basics • Applications – What is text mining? – Enhanced Web search – TF-IDF bag-of-words vectors (SearchPoint) – Cosine similarity – Social browsing (LiveNetLife) – Centroids – Content categorization • Machine learning – Content-based recommender systems – k-NN – Advertising – Nearest centroid classifier – Writing assistance (Zemanta) – SVM – Spam detection – k-means – Visualization / summarization – Agglomerative clustering of large corpora … Lucca, Oct 2012 Miha Grčar: Text and text stream mining 44
    • 45. PART I • PART II Part II: Text stream mining
    • 46. PART I • PART IIINTRO • DACQ • BOW • ML • APP What is text stream mining? Same as text mining but on streams Text stream mining provides a set of methodologies and tools for discovering, presenting, and evaluating knowledge from streams of textual documents Lucca, Oct 2012 Miha Grčar: Text and text stream mining 46
    • 47. PART I • PART IIINTRO • DACQ • BOW • ML • APP Remember Typical text mining process Feedback loop - Performance and Evaluation / - utility assessment validation - Feedback loop Data Text pre- Modeling acquisition processing - Presentation - Acquisition - Transformation - Discover Application - Interaction - Cleaning - Extract - Organize knowledge Lucca, Oct 2012 Miha Grčar: Text and text stream mining 47
    • 48. PART I • PART IIINTRO • DACQ • BOW • ML • APP Typical text stream mining process Feedback loop - Performance and - utility assessment Evaluation / - Obtaining new validation - labels - Feedback loop Stream Text pre- data Modeling processing acquisition - Presentation - Acquisition - Transformation - Discover Application - Interaction - Cleaning - Extract - Organize knowledge Lucca, Oct 2012 Miha Grčar: Text and text stream mining 48
    • 49. PART I • PART IIINTRO • DACQ • BOW • ML • APP Text stream mining pipelines • Pipelining and parallelization Parallelization – Enables concurrent processing – Increases throughput Pipelining – Enables distributed execution (cluster) • Near-realtime online systems – Stream cannot be paused or slowed down (e.g., newsfeeds) – [Near-realtime] Time between reception and utilization of data should be as short as possible – [Online] Stream is infinite and (sooner or later) outdated data needs to be deleted Lucca, Oct 2012 Miha Grčar: Text and text stream mining 49
    • 50. PART I • PART IIINTRO • DACQ • BOW • ML • APP What do we cover in Part II? Feedback loop Evaluation / validation Stream Text pre- data Modeling processing acquisition - Online document - space visualization - RSS feeds - Online BOW - Online ML Application - Online tweeter - Boilerplate remover - Incr. NCC - sentiment classif. - Language detection - Incr. k-means - Incr. SVM Lucca, Oct 2012 Miha Grčar: Text and text stream mining 50
    • 51. PART I • PART IIINTRO • DACQ • BOW • ML • APP Text stream acquisition and preprocessing RSS Boilerplate Language reader remover detector RSS Boilerplate Language Load balancing reader remover detector Online Sync ... BOW . . Preprocessing . . pipelines . . RSS Boilerplate Language reader remover detector Lucca, Oct 2012 Miha Grčar: Text and text stream mining 51
    • 52. PART I • PART IIINTRO • DACQ • BOW • ML • APP RSS (Really Simple Syndication) Lucca, Oct 2012 Miha Grčar: Text and text stream mining 52
    • 53. PART I • PART IIINTRO • DACQ • BOW • ML • APP RSS (Really Simple Syndication) <rss version="2.0"> <channel> <generator>NFE/1.0</generator> <title>Top Stories - Google News</title> <link>http://news.google.com/news?pz=1&amp;ned=us&amp;hl=en</link> <language>en</language> <webMaster>news-feedback@google.com</webMaster> <copyright>&amp;copy;2011 Google</copyright> <item> <title>Egypt Analysts Comment on Next Steps After Mubarak’s Ouster - Bloomberg</title> <link>http://news.google.com/news/url?sa=t&amp;fd=R&amp;usg=AFQjCNEF9B 7Q8C7_TBDKPEMFjb83fcuNfQ&amp;url=http://www.bloomberg.com/news/2011- 02-11/egypt-analysts-comment-on-next-steps-after-mubarak-s-ouster.html</link> <category>Top Stories</category> <pubDate>Fri, 11 Feb 2011 20:15:40 GMT+00:00</pubDate> <description>The ouster of Hosni Mubarak from Egypt’s presidency today, after protests that started Jan. 25, prompted the following comments from analysts: “The army needs to move quickly to remove obstacles to ...</description> </item> ... </channel> </rss> Lucca, Oct 2012 Miha Grčar: Text and text stream mining 53
    • 54. PART I • PART IIINTRO • DACQ • BOW • ML • APP Text stream acquisition and preprocessing RSS Boilerplate Language reader remover detector RSS Boilerplate Language Load balancing reader remover detector Online Sync ... BOW . . Preprocessing . . pipelines . . RSS Boilerplate Language reader remover detector Lucca, Oct 2012 Miha Grčar: Text and text stream mining 54
    • 55. PART I • PART IIINTRO • DACQ • BOW • ML • APP http://www.bbc.co.uk/news/world-us-canada-15051554 Boilerplate removal Lucca, Oct 2012 Miha Grčar: Text and text stream mining 55
    • 56. PART I • PART IIINTRO • DACQ • BOW • ML • APP Boilerplate removal URL tree protocol :// domain / path / file ? query http:// kt.ijs.si /a/b/ c.html ?pg=0 Tree branch: #  si  ijs  kt  a  b root domain path http://www.bbc.co.uk/news/world-us-canada-15051554 #  uk  co  bbc  www  news Lucca, Oct 2012 Miha Grčar: Text and text stream mining 56
    • 57. PART I • PART IIINTRO • DACQ • BOW • ML • APP Boilerplate removal URL tree How many times did I see “About Us” in this part of the tree? Path Domain Root Stream # This method is … • Unsupervised • Online • Incremental (consumes one document at a time) Lucca, Oct 2012 Miha Grčar: Text and text stream mining 57
    • 58. PART I • PART IIINTRO • DACQ • BOW • ML • APP Text stream acquisition and preprocessing RSS Boilerplate Language reader remover detector RSS Boilerplate Language Load balancing reader remover detector Online Sync ... BOW . . Preprocessing . . pipelines . . RSS Boilerplate Language reader remover detector Lucca, Oct 2012 Miha Grčar: Text and text stream mining 58
    • 59. PART I • PART IIINTRO • DACQ • BOW • ML • APP Language detection • Motivation: language-specific text analysis components and applications • Solutions based on word lists and word or character sequences (n-grams) • Character n-gram model – Build character n-gram histograms for many languages (language models) – Compare text document histogram to language models Lucca, Oct 2012 Miha Grčar: Text and text stream mining 59
    • 60. PART I • PART IIINTRO • DACQ • BOW • ML • APP Language detection English German E 1 E 1 T 2 N 2 O 3 R 3 A 4 I 4 N 5 T 5 I 6 S 6 H 7 A 7 S 8 D 8 R 9 U 9 D 10 EN 10 THE DER, DEN E_ 11 G 11 L 12 ER 12 _T 13 H 13 TH 14 L 14 HE 15 N_ 15 U 16 O 16 W 17 M 17 C 18 _D 18 M 19 C 19 ... ... ... ... Lucca, Oct 2012 Miha Grčar: Text and text stream mining 60
    • 61. PART I • PART IIINTRO • DACQ • BOW • ML • APP Language detection Article “Egypt rejoices at Mubarak departure” 450 350 400 300 350 250 English article (n-gram rank) English article (n-gram rank) 300 250 200 200 150 150 100 100 50 50 0 0 0 100 200 300 400 0 50 100 150 200 250 300 350 English language model (n-gram rank) German language model (n-gram rank) Lucca, Oct 2012 Miha Grčar: Text and text stream mining 61
    • 62. PART I • PART IIINTRO • DACQ • BOW • ML • APP Text stream acquisition and preprocessing RSS Boilerplate Language reader remover detector RSS Boilerplate Language Load balancing reader remover detector Online Sync ... BOW . . Preprocessing . . pipelines . . RSS Boilerplate Language reader remover detector Lucca, Oct 2012 Miha Grčar: Text and text stream mining 62
    • 63. PART I • PART IIINTRO • DACQ • BOW • ML • APP Online BOW Stream Outdated Queue of TF vectors Add Remove DF values Lucca, Oct 2012 Miha Grčar: Text and text stream mining 63
    • 64. PART I • PART IIINTRO • DACQ • BOW • ML • APP Online BOW Stream Outdated Queue of TF vectors DF values TF DF TF-IDF Lucca, Oct 2012 Miha Grčar: Text and text stream mining 64
    • 65. PART I • PART IIINTRO • DACQ • BOW • ML • APP Where are we? Feedback loop Evaluation / validation Stream Text pre- data Modeling processing acquisition - Online document - space visualization - RSS feeds - Online BOW - Online ML Application - Online tweeter - Boilerplate remover - Incr. NCC - sentiment classif. - Language detection - Incr. k-means - Incr. SVM Lucca, Oct 2012 Miha Grčar: Text and text stream mining 65
    • 66. PART I • PART IIINTRO • DACQ • BOW • ML • APP Batch, incremental, offline, online • Batch learning Consuming all training examples at once • Incremental learning Consuming one example at a time • Mini-batch learning Consuming several examples at a time • Offline learning (for datasets/finite streams) All data is stored and can be accessed repeatedly • Online learning (for infinite streams) Each example is discarded after being processed Lucca, Oct 2012 Miha Grčar: Text and text stream mining 66
    • 67. PART I • PART IIINTRO • DACQ • BOW • ML • APP Incremental nearest centroid classifier Outdated instance New instance Lucca, Oct 2012 Miha Grčar: Text and text stream mining 67
    • 68. PART I • PART IIINTRO • DACQ • BOW • ML • APP Incremental k-means clustering Converges in only a few iterations (warm start) Lucca, Oct 2012 Miha Grčar: Text and text stream mining 68
    • 69. PART I • PART IIINTRO • DACQ • BOW • ML • APP Other incremental methods • Incremental SVM A. Bordes, S. Ertekin, J. Weston, and L. Bottou (2005): Fast Kernel Classifiers with Online and Active Learning, Journal of Machine Learning Research, vol. 6, pp. 1579–1619 • Incremental perceptron www.cs.columbia.edu/~jebara/4771/tutorials/pe rceptron.pdf • Incremental winnow http://en.wikipedia.org/wiki/Winnow_%28algorit hm%29 Lucca, Oct 2012 Miha Grčar: Text and text stream mining 69
    • 70. PART I • PART IIINTRO • DACQ • BOW • ML • APP Where are we? Feedback loop Evaluation / validation Stream Text pre- data Modeling processing acquisition - Online document - space visualization - RSS feeds - Online BOW - Online ML Application - Online tweeter - Boilerplate remover - Incr. NCC - sentiment classif. - Language detection - Incr. k-means - Incr. SVM Lucca, Oct 2012 Miha Grčar: Text and text stream mining 70
    • 71. PART I • PART IIINTRO • BOW • ML • EVAL • APP Document space visualization 2D Several 1000 dimensions Lucca, Oct 2012 Miha Grčar: Text and text stream mining 71
    • 72. PART I • PART IIINTRO • BOW • ML • EVAL • APP Document space visualization Neighborhoods computation Corpus k-means Least-squares preprocessing clustering interpolation Document Stress corpus majorization Layout Lucca, Oct 2012 Miha Grčar: Text and text stream mining 72
    • 73. PART I • PART IIINTRO • BOW • ML • EVAL • APP Document space visualization Lucca, Oct 2012 Miha Grčar: Text and text stream mining 73
    • 74. PART I • PART IIINTRO • DACQ • BOW • ML • APP Document space visualization Maintaining sorted lists Warm start Warm start Parallelization Neighborhoods computation Corpus k-means Least-squares preprocessing clustering interpolation Stress Document Online majorization corpus BOW Layout Warm start Pipelining Lucca, Oct 2012 Miha Grčar: Text and text stream mining 74
    • 75. PART I • PART IIINTRO • DACQ • BOW • ML • APP Document space visualization This video is available at http://first.ijs.si/tutorial/video/ameba.html Lucca, Oct 2012 Miha Grčar: Text and text stream mining 75
    • 76. PART I • PART IIINTRO • DACQ • BOW • ML • APP Twitter • Platform for sending short messages (similar to SMS) • Est. 225 million users • 100 million accounts added in 2010 • 65 million tweets per day Lucca, Oct 2012 Miha Grčar: Text and text stream mining 76
    • 77. PART I • PART IIINTRO • DACQ • BOW • ML • APP Financial tweets • Informal $ sign convention • Some examples (March 19): – User#1: $AAPL is making an announcement at 9am on what it plans to do with its 97 billion in cash.We expect a dividend announcement – User#2: $AAPL over 600.00 a share in the pre-market on news of a dividend. – User#3: Will there be any other news besides $AAPL dividend? • We acquire ~13,000 tweets per weekday, for ~1,800 NASDAQ/NYSE stocks ($GOOG, $MSFT…) • We analyze tweets to determine whether they contain positive or negative vocabulary Lucca, Oct 2012 Miha Grčar: Text and text stream mining 77
    • 78. PART I • PART IIINTRO • DACQ • BOW • ML • APP Sentiment classification • Labeled documents POS Financial markets are now officially open :) POS market intelligence GMI Interactive and Mintel Win ARF Great Minds Award for Quality in Research POS $AAPL : trust me -- AAPL will soar tomorrow NEG Oh how I miss the days with GBP was at least 2 times the AUD. Sterling forecast to hit all-time lows soon NEG omg! did you know BORDERS closed?! they went bankrupt last month and closed!! awww, too bad! i love borders!! NEG @aekins thats just too bad ... • Learn to classify Labeled Training Classification dataset Algorithm Model • Classify unlabeled documents Unlabeled Classification Predictions dataset Algorithm (Labels) So Nickelodeon filed for bankruptcy and announced that the next Kids Choice NEG Awards will be its last. Classification Model Lucca, Oct 2012 Miha Grčar: Text and text stream mining 78
    • 79. PART I • PART IIINTRO • DACQ • BOW • ML • APP Sentiment classification • Emoticons & SVM classifier Goodnight everyoneeee :) Love yall I have a good feeling about today ;) ooo the ice cream van is here... yaaaaaay :D • Neutral zone in the garden in the sun! Just about to fill the pool! happy days! :D Finally got JSON in #processing to work. More playing around coming :) @oanhLove I hate when that happens... :-/ No jobs, no money. how in the hell is min wage here 4 fn clams an hour? :( I hate when I have to call and wake people up :( • Explanations I dont have any chalk! :-/ MY CHALKBOARD IS USELESS UGHHHHHHHHHHHHHHH.. life is NOT good all the time!!!!!! ;( • Accuracy Lucca, Oct 2012 79
    • 80. PART I • PART IIINTRO • DACQ • BOW • ML • APP Sentiment classification • Emoticons & – – SVM classifier – – + + – • Neutral zone – – + – + – + + • Explanations – – + + + • Accuracy – + + + + + Lucca, Oct 2012 80
    • 81. PART I • PART IIINTRO • DACQ • BOW • ML • APP Sentiment classification • Emoticons & – – SVM classifier – 0 0 + – • Neutral zone – – 0 – + – + + • Explanations – 0 0 + + • Accuracy 0 + + + 0 + Lucca, Oct 2012 81
    • 82. PART I • PART IIINTRO • DACQ • BOW • ML • APP Sentiment classification • Emoticons & SVM classifier “Sovereign debt and unemployment are big issues in EU.” • Neutral zone unemployed, issues, debt, eu sovereign, big • Explanations • Accuracy Lucca, Oct 2012 Miha Grčar: Text and text stream mining 82
    • 83. PART I • PART IIINTRO • DACQ • BOW • ML • APP Sentiment classification • Emoticons & Replace usernames Replace Remove Replace Replace negations exclamation Replace question Average accuracy SVM classifier URLs with a letter Accuracy Precision/recall 10-fold cross with a with a marks with a marks with token repetition validation token token token a token X X 81.06% 81.32%/81.32% 76.98% X X X X X X 80.22% 82.08%/78.02% 77.43% • Neutral zone X X X 79.94% 77.78%/84.62% 77.10% X X X 79.94% 76.70%/86.81% 77.53% X X X 79.67% 80.79%/78.57% 76.85% X 78.83% 77.60%/81.87% 77.29% • Explanations X X 78.55% 78.55% 75.86%/84.62% 77.78%/80.77% 76.91% 76.93% X X X X 78.27% 80.23%/75.82% 76.93% X X X 78.27% 76.53%/82.42% 77.04% • Accuracy X X X X X 77.44% 75.12%/82.97% 76.86% Lucca, Oct 2012 Miha Grčar: Text and text stream mining 83
    • 84. Grey: Netflix stock closing price Blue: The number of positive tweets Yellow: The difference between the positive and negative tweets Green dots: Relevant events concerning Netflix Red: The number of negative tweetsLucca, Oct 2012 Miha Grčar: Text and text stream mining 84
    • 85. First-quarter earnings release Plans to launch in 43 countries in Latin America and the Caribbean Volume peaks likely represent important events Netflix loses TV shows and films, Netflix loses the Starz dealLucca, Oct 2012 Miha Grčar: Text and text stream mining 85
    • 86. Sentiment cross-over happens before price plunge Sentiment cross-overLucca, Oct 2012 Miha Grčar: Text and text stream mining 86
    • 87. PART I • PART IIINTRO • DACQ • BOW • ML • APP Presidential elections http://predsedniskevolitve.si Lucca, Oct 2012 Miha Grčar: Text and text stream mining 87
    • 88. PART I • PART II Recap • Basics • Applications – What is text stream – Online document space mining? visualization – Pipelining, parallelization – Online tweeter sentiment – Web data acquisition classifier – Online BOWs • Stock sentiment monitoring • Machine learning • Presidential elections – Batch, incremental, offline, online – Incremental nearest centroid classifier – Incremental k-means – Warm start Lucca, Oct 2012 Miha Grčar: Text and text stream mining 88

    ×