Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Text and text stream mining tutorial

6,943 views

Published on

Published in: Technology
  • Can you earn $7000 a month from home? Chris Johnson has discovered a secret that has made him hundreds of thousands of dollars with simple online jobs, WANT TO SEE PROOF? ♥♥♥ https://tinyurl.com/ezpayjobs2019
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Text and text stream mining tutorial

  1. 1. Large-scale information extraction and integration infrastructurefor supporting financial decision making (FP7-ICT-257928)http://project-first.eu Text Mining and Text Stream Mining Tutorial Miha Grčar miha.grcar@ijs.si Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana http://kt.ijs.si
  2. 2. Text and text stream mining tutorial• Part I: Text mining• Part II: Text stream miningLucca, Oct 2012 Miha Grčar: Text and text stream mining 2
  3. 3. PART I • PART II Part I: Text mining
  4. 4. PART I • PART IIINTRO • BOW • ML • EVAL • APP What is text mining? • Text mining provides a set of methodologies and tools for discovering, presenting, and evaluating knowledge from large collections of textual documents • Text mining employs adopts and adapts methodologies and tools from … – Data mining (DM) – Machine learning (ML) – Information retrieval (IR) – Natural language processing (NLP) – Visualization – Social network analysis and graph mining – Knowledge management – … Lucca, Oct 2012 Miha Grčar: Text and text stream mining 4
  5. 5. PART I • PART IIINTRO • BOW • ML • EVAL • APP Typical text mining process Feedback loop - Performance and Evaluation / - utility assessment validation - Feedback loop Data Text pre- Modeling acquisition processing - Presentation - Acquisition - Transformation - Discover Application - Interaction - Cleaning - Extract - Organize knowledge Lucca, Oct 2012 Miha Grčar: Text and text stream mining 5
  6. 6. PART I • PART IIINTRO • BOW • ML • EVAL • APP What do we cover in Part 1? Feedback loop - Cross validation Evaluation / - Precision validation - Recall … Data Text pre- Modeling acquisition processing - Search & browse - Categorization - Recommendation - Vector spc model - Machine learning Application - Advertising - (bags-of-words) - Classification - Spam detection - Clustering - Summarization - Visualization … Lucca, Oct 2012 Miha Grčar: Text and text stream mining 6
  7. 7. PART I • PART IIINTRO • BOW • ML • EVAL • APP Bags of words • Tokenize • Remove stop words the quick brown The quick dog brown dog jumps jumps over over the lazy dog. the lazy dog Lucca, Oct 2012 Miha Grčar: Text and text stream mining 7
  8. 8. PART I • PART IIINTRO • BOW • ML • EVAL • APP Bags of words • Tokenize • Remove stop words • Lemmatize • Compute weights the quick brown quick jump brown lazy dog The quick dog brown dog jumps  jump 1 1 2 1 1 jumps over over the lazy dog. the lazy dog Lucca, Oct 2012 Miha Grčar: Text and text stream mining 8
  9. 9. PART I • PART IIINTRO • BOW • ML • EVAL • APP Bags of words Tokenization & stop word removal Original text: Simple tokenizer (alphanumeric strings only): After ripping 14% higher from June until the first week of After | ripping | 14 | higher | from October, stocks ran headfirst into | June | until | the | first | week | a wall of worry seemingly too of | October | stocks | ran | large to climb. Europe, China, the headfirst | into | a | wall | of | fiscal cliff, etc arent new worry | seemingly | too | large | concerns but that doesnt mean to | climb | Europe | China | the | they arent real. Investors fiscal | cliff | etc | aren | t | new | suddenly care and are behaving concerns | but | that | doesn | t | accordingly, selling some of their mean | they | aren | t | real | more aggressive names and Investors | suddenly | care | and | rotating into defensives. are | behaving | accordingly | selling | some | of | their | more | aggressive | names | and | rotating | into | defensives Lucca, Oct 2012 Miha Grčar: Text and text stream mining 9
  10. 10. PART I • PART IIINTRO • BOW • ML • EVAL • APP Bags of words Tokenization & stop word removal Original text: Regex tokenizer ([p{L}]+): After ripping 14% higher from After | ripping | higher | from | June until the first week of June | until | the | first | week | October, stocks ran headfirst into of | October | stocks | ran | a wall of worry seemingly too headfirst | into | a | wall | of | large to climb. Europe, China, the worry | seemingly | too | large | fiscal cliff, etc arent new to | climb | Europe | China | the concerns but that doesnt mean | fiscal | cliff | etc | arent | new they arent real. Investors | concerns | but | that | doesnt suddenly care and are behaving | mean | they | arent | real | accordingly, selling some of their Investors | suddenly | care | and more aggressive names and | are | behaving | accordingly | rotating into defensives. selling | some | of | their | more | aggressive | names | and | rotating | into | defensives Lucca, Oct 2012 Miha Grčar: Text and text stream mining 10
  11. 11. PART I • PART IIINTRO • BOW • ML • EVAL • APP Bags of words Lemmatization Original text: Lemmatized: After ripping 14% higher from After | rip | high | from | June | June until the first week of until | the | first | week | of | October, stocks ran headfirst into October | stock | run | headfirst a wall of worry seemingly too | into | a | wall | of | worry | large to climb. Europe, China, the seemingly | too | large | to | fiscal cliff, etc arent new climb | Europe | China | the | concerns but that doesnt mean fiscal | cliff | etc | arent | new | they arent real. Investors concern | but | that | doesnt | suddenly care and are behaving mean | they | arent | real | accordingly, selling some of their Investor | suddenly | care | and | more aggressive names and are | behave | accordingly | sell | rotating into defensives. some | of | their | more | aggressive | name | and | rotate | into | defensive Lucca, Oct 2012 Miha Grčar: Text and text stream mining 11
  12. 12. PART I • PART IIINTRO • BOW • ML • EVAL • APP Bags of words Lemmatization Original text: Lemmatized: È uno dei punti più contestati E | uno | dei | puntare | più | della legge di Stabilità approvata contestato | della | legge | di | da poco dal governo: il taglio alle Stabilità | approvare | da | poco | dal | governo | il | tagliare | alle | detrazioni fiscali, ossia gli "sconti" detrazione | fiscale | ossia | gli | che ogni contribuente può scontare | che | ogni | contribuire | vantare sulla propria può | vantare | sulla | proprio | dichiarazione dei redditi. Secondo dichiarazione | dei | reddito | una bozza aggiornata del disegno Secondo | una | bozzare | di legge, il taglio si applicherebbe aggiornare | del | disegnare | di | a decorrere dal periodo di legge | il | tagliare | si | applicare | a imposta al 31 dicembre 2012. Un | decorrere | dal | periodare | di | dettaglio che aveva creato, nei impostare | al | dicembre | Un | giorni scorsi, non poche dettagliare | che | aveva | creare | nei | giorno | scorrere | non | poca | polemiche. polemico Lucca, Oct 2012 Miha Grčar: Text and text stream mining 12
  13. 13. PART I • PART IIINTRO • BOW • ML • EVAL • APP Computing weights • TF – Term Frequency – The number of times a lemma (stem) occurs in a document • DF – Document Frequency – The number of documents in which a lemma (stem) occurs at least once • TFIDF • Higher TF means higher TFIDF • Higher DF means lower TFIDF Lucca, Oct 2012 Miha Grčar: Text and text stream mining 13
  14. 14. PART I • PART IIINTRO • BOW • ML • EVAL • APP Computing weights DF TF IDF TFIDF quick 1 1 0 0 The quick brown dog brown 1 1 0 0 jumps over dog 2 1 0 0 the lazy dog. jump 1 1 0 0 lazy 1 1 0 0 Lucca, Oct 2012 Miha Grčar: Text and text stream mining 14
  15. 15. PART I • PART IIINTRO • BOW • ML • EVAL • APP Computing weights DF TF jump IDF TFIDF quick 1 1 0.69 0.69 The quick brown dog brown 1 1 0.69 0.69 jumps over dog 2 1 0.69 1.39 the lazy dog. jump 1 2 0 0 lazy 1 1 0.69 0.69 Lucca, Oct 2012 Miha Grčar: Text and text stream mining 15
  16. 16. PART I • PART IIINTRO • BOW • ML • EVAL • APP Cosine similarity d1 d2 0 Lucca, Oct 2012 Miha Grčar: Text and text stream mining 16
  17. 17. PART I • PART IIINTRO • BOW • ML • EVAL • APP Cosine similarity d1 1 d1 d2 d2 0 Lucca, Oct 2012 Miha Grčar: Text and text stream mining 17
  18. 18. PART I • PART IIINTRO • BOW • ML • EVAL • APP Centroids • Determine characteristic words in a cluster • Nearest centroid classifier • k-means clustering • … Lucca, Oct 2012 Miha Grčar: Text and text stream mining 18
  19. 19. PART I • PART IIINTRO • BOW • ML • EVAL • APP Where are we? Feedback loop - Cross validation Evaluation / - Precision validation - Recall … Data Text pre- Modeling acquisition processing - Search & browse - Categorization - Recommendation - Vector spc model - Machine learning Application - Advertising - (bags-of-words) - Classification - Spam detection - Clustering - Summarization - Visualization … Lucca, Oct 2012 Miha Grčar: Text and text stream mining 19
  20. 20. PART I • PART IIINTRO • BOW • ML • EVAL • APP Machine learning • Machine learning is concerned with the development of algorithms that allow computer programs to learn from past experience [Mitchell] • Machine learning refers to a collection of algorithms that take as input empirical data (e.g., from databases or sensors) and try to discover some characteristics (rules, constraints, patterns, features) of the process that generated the data [Wikipedia] • Learning from past experience = learning from past examples • Examples (instances) = document vectors (normalized sparse vectors) Lucca, Oct 2012 Miha Grčar: Text and text stream mining 20
  21. 21. PART I • PART IIINTRO • BOW • ML • EVAL • APP Machine learning • We will look at two commonly used machine learning techniques – Classification • Assigning instances (documents) to two or more predefined (discrete) classes • Supervised learning method – Clustering • Arranging instances (documents) into groups (clusters) so that instances in the same group are more similar to each other than to those in other groups • Unsupervised learning method Lucca, Oct 2012 Miha Grčar: Text and text stream mining 21
  22. 22. PART I • PART IIINTRO • BOW • ML • EVAL • APP Classification • Labeled documents Mergers & Acquisitions • Ingram Wraps Up Brightpoint Buyout Mergers & Acquisitions • State Street completes acquisition of Goldman Sachs Administration Services Economy & Government • Gasoline fuels inflation, but Fed policy seen steady Economy & Government • Euro Leads Majors Higher as Spanish Bailout Looks Increasingly Likely ... Investing Picks • Smith & Wesson Holding Corp. Enters Oversold Territory Investing Picks • The Fresh Market: A Strong Buy • Learn to classify Labeled Training Classification dataset Algorithm Model • Classify unlabeled documents Unlabeled Classification Predictions dataset Algorithm (Labels) Fresh Del Monte Produce Inc. Investing Picks Enters Oversold Territory Classification Model Lucca, Oct 2012 Miha Grčar: Text and text stream mining 22
  23. 23. PART I • PART IIINTRO • BOW • ML • EVAL • APP Classification with k-Nearest Neighbors Investing Picks Mergers & Acquisitions Economy & Government Investing Picks: 4 Mergers & Acquisitions: 1 Economy & Government: 0 Lucca, Oct 2012 23
  24. 24. PART I • PART IIINTRO • BOW • ML • EVAL • APP Classification with Nearest Centroid Classifier Investing Picks Mergers & Acquisitions s1 s2 s3 Economy & Government Similarity s2 > s1 > s3 s2: Mergers & Acquisitions s1: Investing Picks s3: Economy & Government Lucca, Oct 2012 24
  25. 25. PART I • PART IIINTRO • BOW • ML • EVAL • APP Classification with Support Vector Machine (SVM) w Investing Picks • Maximize w • Minimize tradeoff Mergers & Acquisitions Lucca, Oct 2012 Miha Grčar: Text and text stream mining 25
  26. 26. PART I • PART IIINTRO • BOW • ML • EVAL • APP Classification algorithms Nearest SVM k-NN centroid (linear kernel) Multiclass? yes yes no Explains decisions? no yes yes Explains model? no yes yes Number of parameters 1 0 1 Model size big small small Training speed 0 fast slow Classification speed slow fast fast Accuracy (on texts) low medium high Lucca, Oct 2012 Miha Grčar: Text and text stream mining 26
  27. 27. PART I • PART IIINTRO • BOW • ML • EVAL • APP Clustering Lucca, Oct 2012 27
  28. 28. PART I • PART IIINTRO • BOW • ML • EVAL • APP Clustering • k-means clustering • Agglomerative hierarchical clustering Lucca, Oct 2012 Miha Grčar: Text and text stream mining 28
  29. 29. PART I • PART IIINTRO • BOW • ML • EVAL • APP k-means clustering Input: k Output: k clusters (and their centroids) 1. Randomly select k instances for initial centroids 2. Assign step Assign each instance to the nearest centroid 3. If the assignments did not change, end the algorithm 4. Update step Recompute (update) centroids 5. Repeat at Step 2 Lucca, Oct 2012 Miha Grčar: Text and text stream mining 29
  30. 30. PART I • PART IIINTRO • BOW • ML • EVAL • APP k-means clustering This video is available at http://first.ijs.si/tutorial/video/kmeans.html Lucca, Oct 2012 Miha Grčar: Text and text stream mining 30
  31. 31. PART I • PART IIINTRO • BOW • ML • EVAL • APP Agglomerative hierarchical clustering 1. Find the two most similar instances 2. Connect them 3. Replace them with their centroid 4. Repeat … “Dendrogram” Lucca, Oct 2012 Miha Grčar: Text and text stream mining 31
  32. 32. PART I • PART IIINTRO • BOW • ML • EVAL • APP Where are we? Feedback loop - Cross validation Evaluation / - Precision validation - Recall … Data Text pre- Modeling acquisition processing - Search & browse - Categorization - Recommendation - Vector spc model - Machine learning Application - Advertising - (bags-of-words) - Classification - Spam detection - Clustering - Summarization - Visualization … Lucca, Oct 2012 Miha Grčar: Text and text stream mining 32
  33. 33. PART I • PART IIINTRO • BOW • ML • EVAL • APP Evaluation • Cross validation (http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29) – 10-fold cross validation – Stratified • Accuracy • Precision, recall, F1 (http://en.wikipedia.org/wiki/Precision_and_recall | http://en.wikipedia.org/wiki/F1_Score) • Micro and macro-averaging (http://nlp.stanford.edu/IR- book/html/htmledition/evaluation-of-text-classification-1.html | http://datamin.ubbcluj.ro/wiki/index.php/Evaluation_methods_in_text_categorization) • Statistical tests (http://en.wikipedia.org/wiki/Statistical_hypothesis_testing) Lucca, Oct 2012 Miha Grčar: Text and text stream mining 33
  34. 34. PART I • PART IIINTRO • BOW • ML • EVAL • APP Where are we? Feedback loop - Cross validation Evaluation / - Precision validation - Recall … Data Text pre- Modeling acquisition processing - Search & browse - Categorization - Recommendation - Vector spc model - Machine learning Application - Advertising - (bags-of-words) - Classification - Spam detection - Clustering - Summarization - Visualization … Lucca, Oct 2012 Miha Grčar: Text and text stream mining 34
  35. 35. PART I • PART IIINTRO • BOW • ML • EVAL • APP Applications • Enhanced Web search • Text summarization (SearchPoint) Leskovec et al. (2005): Extracting Summary Sentences Based on the Document Semantic • Social browsing (LiveNetLife) Graph. Microsoft Research Technical Report MSR-TR-2005-07. • Content categorization • Sentiment analysis • Content-based recommender (demo later) systems • News aggregation • Advertising http://emm.newsexplorer.eu • Blogging assistance (Zemanta) • Knowledge engineering http://ontogen.ijs.si • Spam detection • … • Visualization / summarization of large corpora Lucca, Oct 2012 Miha Grčar: Text and text stream mining 35
  36. 36. Enhanced Web search (http://www.searchpoint.com)Lucca, Oct 2012 Miha Grčar: Text and text stream mining 36
  37. 37. Hi! Hello Social browsing (http://www.livenetlife.com) @ http://videolectures.netLucca, Oct 2012 Miha Grčar: Text and text stream mining 37
  38. 38. Content categorization @ http://videolectures.netLucca, Oct 2012 Miha Grčar: Text and text stream mining 38
  39. 39. Recommender system @ http://videolectures.netLucca, Oct 2012 Miha Grčar: Text and text stream mining 39
  40. 40. Contextualized advertisingLucca, Oct 2012 Miha Grčar: Text and text stream mining 40
  41. 41. PART I • PART IIINTRO • BOW • ML • EVAL • APP Blogging assistant (http://www.zemanta.com) Lucca, Oct 2012 Miha Grčar: Text and text stream mining 41
  42. 42. PART I • PART IIINTRO • BOW • ML • EVAL • APP Pump & dump Siering, Muntermann, Grčar (2012) Lucca, Oct 2012 Miha Grčar: Text and text stream mining 42
  43. 43. PART I • PART IIINTRO • BOW • ML • EVAL • APP Visualizations • Document space visualization • Canyon flows • Tag clouds http://www.jasondavies.com/wordcloud/ Lucca, Oct 2012 Miha Grčar: Text and text stream mining 43
  44. 44. PART I • PART II Recap • Basics • Applications – What is text mining? – Enhanced Web search – TF-IDF bag-of-words vectors (SearchPoint) – Cosine similarity – Social browsing (LiveNetLife) – Centroids – Content categorization • Machine learning – Content-based recommender systems – k-NN – Advertising – Nearest centroid classifier – Writing assistance (Zemanta) – SVM – Spam detection – k-means – Visualization / summarization – Agglomerative clustering of large corpora … Lucca, Oct 2012 Miha Grčar: Text and text stream mining 44
  45. 45. PART I • PART II Part II: Text stream mining
  46. 46. PART I • PART IIINTRO • DACQ • BOW • ML • APP What is text stream mining? Same as text mining but on streams Text stream mining provides a set of methodologies and tools for discovering, presenting, and evaluating knowledge from streams of textual documents Lucca, Oct 2012 Miha Grčar: Text and text stream mining 46
  47. 47. PART I • PART IIINTRO • DACQ • BOW • ML • APP Remember Typical text mining process Feedback loop - Performance and Evaluation / - utility assessment validation - Feedback loop Data Text pre- Modeling acquisition processing - Presentation - Acquisition - Transformation - Discover Application - Interaction - Cleaning - Extract - Organize knowledge Lucca, Oct 2012 Miha Grčar: Text and text stream mining 47
  48. 48. PART I • PART IIINTRO • DACQ • BOW • ML • APP Typical text stream mining process Feedback loop - Performance and - utility assessment Evaluation / - Obtaining new validation - labels - Feedback loop Stream Text pre- data Modeling processing acquisition - Presentation - Acquisition - Transformation - Discover Application - Interaction - Cleaning - Extract - Organize knowledge Lucca, Oct 2012 Miha Grčar: Text and text stream mining 48
  49. 49. PART I • PART IIINTRO • DACQ • BOW • ML • APP Text stream mining pipelines • Pipelining and parallelization Parallelization – Enables concurrent processing – Increases throughput Pipelining – Enables distributed execution (cluster) • Near-realtime online systems – Stream cannot be paused or slowed down (e.g., newsfeeds) – [Near-realtime] Time between reception and utilization of data should be as short as possible – [Online] Stream is infinite and (sooner or later) outdated data needs to be deleted Lucca, Oct 2012 Miha Grčar: Text and text stream mining 49
  50. 50. PART I • PART IIINTRO • DACQ • BOW • ML • APP What do we cover in Part II? Feedback loop Evaluation / validation Stream Text pre- data Modeling processing acquisition - Online document - space visualization - RSS feeds - Online BOW - Online ML Application - Online tweeter - Boilerplate remover - Incr. NCC - sentiment classif. - Language detection - Incr. k-means - Incr. SVM Lucca, Oct 2012 Miha Grčar: Text and text stream mining 50
  51. 51. PART I • PART IIINTRO • DACQ • BOW • ML • APP Text stream acquisition and preprocessing RSS Boilerplate Language reader remover detector RSS Boilerplate Language Load balancing reader remover detector Online Sync ... BOW . . Preprocessing . . pipelines . . RSS Boilerplate Language reader remover detector Lucca, Oct 2012 Miha Grčar: Text and text stream mining 51
  52. 52. PART I • PART IIINTRO • DACQ • BOW • ML • APP RSS (Really Simple Syndication) Lucca, Oct 2012 Miha Grčar: Text and text stream mining 52
  53. 53. PART I • PART IIINTRO • DACQ • BOW • ML • APP RSS (Really Simple Syndication) <rss version="2.0"> <channel> <generator>NFE/1.0</generator> <title>Top Stories - Google News</title> <link>http://news.google.com/news?pz=1&amp;ned=us&amp;hl=en</link> <language>en</language> <webMaster>news-feedback@google.com</webMaster> <copyright>&amp;copy;2011 Google</copyright> <item> <title>Egypt Analysts Comment on Next Steps After Mubarak’s Ouster - Bloomberg</title> <link>http://news.google.com/news/url?sa=t&amp;fd=R&amp;usg=AFQjCNEF9B 7Q8C7_TBDKPEMFjb83fcuNfQ&amp;url=http://www.bloomberg.com/news/2011- 02-11/egypt-analysts-comment-on-next-steps-after-mubarak-s-ouster.html</link> <category>Top Stories</category> <pubDate>Fri, 11 Feb 2011 20:15:40 GMT+00:00</pubDate> <description>The ouster of Hosni Mubarak from Egypt’s presidency today, after protests that started Jan. 25, prompted the following comments from analysts: “The army needs to move quickly to remove obstacles to ...</description> </item> ... </channel> </rss> Lucca, Oct 2012 Miha Grčar: Text and text stream mining 53
  54. 54. PART I • PART IIINTRO • DACQ • BOW • ML • APP Text stream acquisition and preprocessing RSS Boilerplate Language reader remover detector RSS Boilerplate Language Load balancing reader remover detector Online Sync ... BOW . . Preprocessing . . pipelines . . RSS Boilerplate Language reader remover detector Lucca, Oct 2012 Miha Grčar: Text and text stream mining 54
  55. 55. PART I • PART IIINTRO • DACQ • BOW • ML • APP http://www.bbc.co.uk/news/world-us-canada-15051554 Boilerplate removal Lucca, Oct 2012 Miha Grčar: Text and text stream mining 55
  56. 56. PART I • PART IIINTRO • DACQ • BOW • ML • APP Boilerplate removal URL tree protocol :// domain / path / file ? query http:// kt.ijs.si /a/b/ c.html ?pg=0 Tree branch: #  si  ijs  kt  a  b root domain path http://www.bbc.co.uk/news/world-us-canada-15051554 #  uk  co  bbc  www  news Lucca, Oct 2012 Miha Grčar: Text and text stream mining 56
  57. 57. PART I • PART IIINTRO • DACQ • BOW • ML • APP Boilerplate removal URL tree How many times did I see “About Us” in this part of the tree? Path Domain Root Stream # This method is … • Unsupervised • Online • Incremental (consumes one document at a time) Lucca, Oct 2012 Miha Grčar: Text and text stream mining 57
  58. 58. PART I • PART IIINTRO • DACQ • BOW • ML • APP Text stream acquisition and preprocessing RSS Boilerplate Language reader remover detector RSS Boilerplate Language Load balancing reader remover detector Online Sync ... BOW . . Preprocessing . . pipelines . . RSS Boilerplate Language reader remover detector Lucca, Oct 2012 Miha Grčar: Text and text stream mining 58
  59. 59. PART I • PART IIINTRO • DACQ • BOW • ML • APP Language detection • Motivation: language-specific text analysis components and applications • Solutions based on word lists and word or character sequences (n-grams) • Character n-gram model – Build character n-gram histograms for many languages (language models) – Compare text document histogram to language models Lucca, Oct 2012 Miha Grčar: Text and text stream mining 59
  60. 60. PART I • PART IIINTRO • DACQ • BOW • ML • APP Language detection English German E 1 E 1 T 2 N 2 O 3 R 3 A 4 I 4 N 5 T 5 I 6 S 6 H 7 A 7 S 8 D 8 R 9 U 9 D 10 EN 10 THE DER, DEN E_ 11 G 11 L 12 ER 12 _T 13 H 13 TH 14 L 14 HE 15 N_ 15 U 16 O 16 W 17 M 17 C 18 _D 18 M 19 C 19 ... ... ... ... Lucca, Oct 2012 Miha Grčar: Text and text stream mining 60
  61. 61. PART I • PART IIINTRO • DACQ • BOW • ML • APP Language detection Article “Egypt rejoices at Mubarak departure” 450 350 400 300 350 250 English article (n-gram rank) English article (n-gram rank) 300 250 200 200 150 150 100 100 50 50 0 0 0 100 200 300 400 0 50 100 150 200 250 300 350 English language model (n-gram rank) German language model (n-gram rank) Lucca, Oct 2012 Miha Grčar: Text and text stream mining 61
  62. 62. PART I • PART IIINTRO • DACQ • BOW • ML • APP Text stream acquisition and preprocessing RSS Boilerplate Language reader remover detector RSS Boilerplate Language Load balancing reader remover detector Online Sync ... BOW . . Preprocessing . . pipelines . . RSS Boilerplate Language reader remover detector Lucca, Oct 2012 Miha Grčar: Text and text stream mining 62
  63. 63. PART I • PART IIINTRO • DACQ • BOW • ML • APP Online BOW Stream Outdated Queue of TF vectors Add Remove DF values Lucca, Oct 2012 Miha Grčar: Text and text stream mining 63
  64. 64. PART I • PART IIINTRO • DACQ • BOW • ML • APP Online BOW Stream Outdated Queue of TF vectors DF values TF DF TF-IDF Lucca, Oct 2012 Miha Grčar: Text and text stream mining 64
  65. 65. PART I • PART IIINTRO • DACQ • BOW • ML • APP Where are we? Feedback loop Evaluation / validation Stream Text pre- data Modeling processing acquisition - Online document - space visualization - RSS feeds - Online BOW - Online ML Application - Online tweeter - Boilerplate remover - Incr. NCC - sentiment classif. - Language detection - Incr. k-means - Incr. SVM Lucca, Oct 2012 Miha Grčar: Text and text stream mining 65
  66. 66. PART I • PART IIINTRO • DACQ • BOW • ML • APP Batch, incremental, offline, online • Batch learning Consuming all training examples at once • Incremental learning Consuming one example at a time • Mini-batch learning Consuming several examples at a time • Offline learning (for datasets/finite streams) All data is stored and can be accessed repeatedly • Online learning (for infinite streams) Each example is discarded after being processed Lucca, Oct 2012 Miha Grčar: Text and text stream mining 66
  67. 67. PART I • PART IIINTRO • DACQ • BOW • ML • APP Incremental nearest centroid classifier Outdated instance New instance Lucca, Oct 2012 Miha Grčar: Text and text stream mining 67
  68. 68. PART I • PART IIINTRO • DACQ • BOW • ML • APP Incremental k-means clustering Converges in only a few iterations (warm start) Lucca, Oct 2012 Miha Grčar: Text and text stream mining 68
  69. 69. PART I • PART IIINTRO • DACQ • BOW • ML • APP Other incremental methods • Incremental SVM A. Bordes, S. Ertekin, J. Weston, and L. Bottou (2005): Fast Kernel Classifiers with Online and Active Learning, Journal of Machine Learning Research, vol. 6, pp. 1579–1619 • Incremental perceptron www.cs.columbia.edu/~jebara/4771/tutorials/pe rceptron.pdf • Incremental winnow http://en.wikipedia.org/wiki/Winnow_%28algorit hm%29 Lucca, Oct 2012 Miha Grčar: Text and text stream mining 69
  70. 70. PART I • PART IIINTRO • DACQ • BOW • ML • APP Where are we? Feedback loop Evaluation / validation Stream Text pre- data Modeling processing acquisition - Online document - space visualization - RSS feeds - Online BOW - Online ML Application - Online tweeter - Boilerplate remover - Incr. NCC - sentiment classif. - Language detection - Incr. k-means - Incr. SVM Lucca, Oct 2012 Miha Grčar: Text and text stream mining 70
  71. 71. PART I • PART IIINTRO • BOW • ML • EVAL • APP Document space visualization 2D Several 1000 dimensions Lucca, Oct 2012 Miha Grčar: Text and text stream mining 71
  72. 72. PART I • PART IIINTRO • BOW • ML • EVAL • APP Document space visualization Neighborhoods computation Corpus k-means Least-squares preprocessing clustering interpolation Document Stress corpus majorization Layout Lucca, Oct 2012 Miha Grčar: Text and text stream mining 72
  73. 73. PART I • PART IIINTRO • BOW • ML • EVAL • APP Document space visualization Lucca, Oct 2012 Miha Grčar: Text and text stream mining 73
  74. 74. PART I • PART IIINTRO • DACQ • BOW • ML • APP Document space visualization Maintaining sorted lists Warm start Warm start Parallelization Neighborhoods computation Corpus k-means Least-squares preprocessing clustering interpolation Stress Document Online majorization corpus BOW Layout Warm start Pipelining Lucca, Oct 2012 Miha Grčar: Text and text stream mining 74
  75. 75. PART I • PART IIINTRO • DACQ • BOW • ML • APP Document space visualization This video is available at http://first.ijs.si/tutorial/video/ameba.html Lucca, Oct 2012 Miha Grčar: Text and text stream mining 75
  76. 76. PART I • PART IIINTRO • DACQ • BOW • ML • APP Twitter • Platform for sending short messages (similar to SMS) • Est. 225 million users • 100 million accounts added in 2010 • 65 million tweets per day Lucca, Oct 2012 Miha Grčar: Text and text stream mining 76
  77. 77. PART I • PART IIINTRO • DACQ • BOW • ML • APP Financial tweets • Informal $ sign convention • Some examples (March 19): – User#1: $AAPL is making an announcement at 9am on what it plans to do with its 97 billion in cash.We expect a dividend announcement – User#2: $AAPL over 600.00 a share in the pre-market on news of a dividend. – User#3: Will there be any other news besides $AAPL dividend? • We acquire ~13,000 tweets per weekday, for ~1,800 NASDAQ/NYSE stocks ($GOOG, $MSFT…) • We analyze tweets to determine whether they contain positive or negative vocabulary Lucca, Oct 2012 Miha Grčar: Text and text stream mining 77
  78. 78. PART I • PART IIINTRO • DACQ • BOW • ML • APP Sentiment classification • Labeled documents POS Financial markets are now officially open :) POS market intelligence GMI Interactive and Mintel Win ARF Great Minds Award for Quality in Research POS $AAPL : trust me -- AAPL will soar tomorrow NEG Oh how I miss the days with GBP was at least 2 times the AUD. Sterling forecast to hit all-time lows soon NEG omg! did you know BORDERS closed?! they went bankrupt last month and closed!! awww, too bad! i love borders!! NEG @aekins thats just too bad ... • Learn to classify Labeled Training Classification dataset Algorithm Model • Classify unlabeled documents Unlabeled Classification Predictions dataset Algorithm (Labels) So Nickelodeon filed for bankruptcy and announced that the next Kids Choice NEG Awards will be its last. Classification Model Lucca, Oct 2012 Miha Grčar: Text and text stream mining 78
  79. 79. PART I • PART IIINTRO • DACQ • BOW • ML • APP Sentiment classification • Emoticons & SVM classifier Goodnight everyoneeee :) Love yall I have a good feeling about today ;) ooo the ice cream van is here... yaaaaaay :D • Neutral zone in the garden in the sun! Just about to fill the pool! happy days! :D Finally got JSON in #processing to work. More playing around coming :) @oanhLove I hate when that happens... :-/ No jobs, no money. how in the hell is min wage here 4 fn clams an hour? :( I hate when I have to call and wake people up :( • Explanations I dont have any chalk! :-/ MY CHALKBOARD IS USELESS UGHHHHHHHHHHHHHHH.. life is NOT good all the time!!!!!! ;( • Accuracy Lucca, Oct 2012 79
  80. 80. PART I • PART IIINTRO • DACQ • BOW • ML • APP Sentiment classification • Emoticons & – – SVM classifier – – + + – • Neutral zone – – + – + – + + • Explanations – – + + + • Accuracy – + + + + + Lucca, Oct 2012 80
  81. 81. PART I • PART IIINTRO • DACQ • BOW • ML • APP Sentiment classification • Emoticons & – – SVM classifier – 0 0 + – • Neutral zone – – 0 – + – + + • Explanations – 0 0 + + • Accuracy 0 + + + 0 + Lucca, Oct 2012 81
  82. 82. PART I • PART IIINTRO • DACQ • BOW • ML • APP Sentiment classification • Emoticons & SVM classifier “Sovereign debt and unemployment are big issues in EU.” • Neutral zone unemployed, issues, debt, eu sovereign, big • Explanations • Accuracy Lucca, Oct 2012 Miha Grčar: Text and text stream mining 82
  83. 83. PART I • PART IIINTRO • DACQ • BOW • ML • APP Sentiment classification • Emoticons & Replace usernames Replace Remove Replace Replace negations exclamation Replace question Average accuracy SVM classifier URLs with a letter Accuracy Precision/recall 10-fold cross with a with a marks with a marks with token repetition validation token token token a token X X 81.06% 81.32%/81.32% 76.98% X X X X X X 80.22% 82.08%/78.02% 77.43% • Neutral zone X X X 79.94% 77.78%/84.62% 77.10% X X X 79.94% 76.70%/86.81% 77.53% X X X 79.67% 80.79%/78.57% 76.85% X 78.83% 77.60%/81.87% 77.29% • Explanations X X 78.55% 78.55% 75.86%/84.62% 77.78%/80.77% 76.91% 76.93% X X X X 78.27% 80.23%/75.82% 76.93% X X X 78.27% 76.53%/82.42% 77.04% • Accuracy X X X X X 77.44% 75.12%/82.97% 76.86% Lucca, Oct 2012 Miha Grčar: Text and text stream mining 83
  84. 84. Grey: Netflix stock closing price Blue: The number of positive tweets Yellow: The difference between the positive and negative tweets Green dots: Relevant events concerning Netflix Red: The number of negative tweetsLucca, Oct 2012 Miha Grčar: Text and text stream mining 84
  85. 85. First-quarter earnings release Plans to launch in 43 countries in Latin America and the Caribbean Volume peaks likely represent important events Netflix loses TV shows and films, Netflix loses the Starz dealLucca, Oct 2012 Miha Grčar: Text and text stream mining 85
  86. 86. Sentiment cross-over happens before price plunge Sentiment cross-overLucca, Oct 2012 Miha Grčar: Text and text stream mining 86
  87. 87. PART I • PART IIINTRO • DACQ • BOW • ML • APP Presidential elections http://predsedniskevolitve.si Lucca, Oct 2012 Miha Grčar: Text and text stream mining 87
  88. 88. PART I • PART II Recap • Basics • Applications – What is text stream – Online document space mining? visualization – Pipelining, parallelization – Online tweeter sentiment – Web data acquisition classifier – Online BOWs • Stock sentiment monitoring • Machine learning • Presidential elections – Batch, incremental, offline, online – Incremental nearest centroid classifier – Incremental k-means – Warm start Lucca, Oct 2012 Miha Grčar: Text and text stream mining 88

×