Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

From NLP to text mining

2,730 views

Published on

From NLP to text mining

Published in: Education
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

From NLP to text mining

  1. 1. From NLP to Text Mining Yi-Shin Chen Institute of Information Systems and Applications Department of Computer Science National Tsing Hua University yishin@gmail.com
  2. 2. About Speaker 陳宜欣 Yi-Shin Chen ▷ Currently • 清華大學資訊工程系副教授 • 主持智慧型資料工程與應用實驗室 (IDEA Lab) ▷ Current Research • 音樂治療+人工智慧 • 文字資料的情緒分析、精神分析 2 將祝福種在教育的土壤中,用愛來護持 @ Yi-Shin Chen, NLP to Text Mining
  3. 3. Natural Language How it form? @ Yi-Shin Chen, NLP to Text Mining 3
  4. 4. Language ▷The method of human communication (Oxford dictionary) • Either spoken or written • Consisting of the use of words • In a structured and conventional way ▷The fundamental problem of communication is: • Reproducing at one point either exactly or approximately a message selected at another point • By Claude Shannon (1916–2001) @ Yi-Shin Chen, NLP to Text Mining 4
  5. 5. Communication @ Yi-Shin Chen, NLP to Text Mining 5 This is the best thing happened in my life Decode Semantic +Inference Produce EncodeThis == Best thing (happened (in my life)) He is happy Something to show we are buddy “I’m so happy for you”
  6. 6. Multilingual Communication @ Yi-Shin Chen, NLP to Text Mining 6 Decode Semantic +Inference Encode Decode Semantic +Inference Produce Encode
  7. 7. The Father of Modern Linguistics ▷Noam Chomsky • “a language to be a set (finite or infinite) of sentences, each finite in length and constructed out of a finite set of elements” • “the structure of language is biologically determined” • “that humans are born with an innate linguistic ability that constitutes a Universal Grammar will also be examined” → Syntactic Structures 句法結構 7 Noam Chomsky (1928 - current) @ Yi-Shin Chen, NLP to Text Mining
  8. 8. Basic Concepts in NLP 8 This is the best thing happened in my life. Det. Det. NN PNPre.Verb VerbAdj 辭彙分析 Lexical analysis (Part-of Speech Tagging 詞性標註) 句法分析 Syntactic analysis (Parsing) Noun Phrase Prep Phrase Prep Phrase Noun Phrase Sentence @ Yi-Shin Chen, NLP to Text Mining
  9. 9. Parsing ▷Parsing is the process of determining whether a string of tokens can be generated by a grammar @ Yi-Shin Chen, NLP to Text Mining 9 Lexical Analyzer Parser Input token getNext Token Symbol table parse tree Rest of Front End Intermediate representation Output of the encoder should be equivalent to the input
  10. 10. Parsing Example (for Compiler) ▷Grammar: • E :: = E op E | - E | (E) | id • op :: = + | - | * | / @ Yi-Shin Chen, NLP to Text Mining 10 a * - ( b + c ) E id op E ( )E id op - E id
  11. 11. Basic Concepts in NLP 11 This is the best thing happened in my life. Det. Det. NN PNPre.Verb VerbAdj 辭彙分析 Lexical analysis (Part-of Speech Tagging 詞性標註) 句法分析 Syntactic analysis (Parsing) This? (t1) Best thing (t2) My (m1) Happened (t1, t2, m1) 語意分析 Semantic Analysis Happy (x) if Happened (t1, ‘Best’, m1) Happy 推理 Inference Noun Phrase Prep Phrase Prep Phrase Noun Phrase Sentence @ Yi-Shin Chen, NLP to Text Mining
  12. 12. NLP to Natural Language Understanding (NLU) 12 https://nlp.stanford.edu/~wcmac/papers/20140716-UNLU.pdf NLP NLU Named Entity Recognition (NER) Part-Of-Speech Tagging (POS) Text Categorization Co-Reference Resolution Machine Translation Syntactic Parsing Question Answering (QA) Relation Extraction Semantic Parsing Paraphrase & Natural Language Inference Sentiment Analysis Dialogue Agents Summarization Automatic Speech Recognition (ASR) Text-To-Speech (TTS) @ Yi-Shin Chen, NLP to Text Mining
  13. 13. Challenges ▷Tagging/Parsing incorrect • 下午天留客天留我不留 → 下雨天留客 天留我不留 → 下雨天 留客天 留我不 留 → 下雨天 留客天 留我不留 ▷Inference incorrect • 玻璃杯碎了一地  玻璃杯不能用了 • 專家眼鏡碎滿地  專家眼鏡不能用了? ▷Language evolution • 安史之亂(唐) @ Yi-Shin Chen, NLP to Text Mining 13  安屎之亂(2018)
  14. 14. NLP Techniques For representation @ Yi-Shin Chen, NLP to Text Mining 14
  15. 15. NLP Techniques ▷Word segmentation* ▷Part of speech tagging (POS) ▷Stemming* ▷Syntactic Parsing ▷Named entity extraction ▷Co-reference resolution ▷Text categorization @ Yi-Shin Chen, NLP to Text Mining 15
  16. 16. Word Segmentation ▷In some languages, there is no explicit word boundary • 這地面積還真不小 • 人体内存在很多微生物 • うふふふふ 楽しむ ありがとうございます ▷Need segmentation tools • Chinese → Jieba: https://github.com/fxsjy/jieba → CKIP (Sinica): http://ckipsvr.iis.sinica.edu.tw/ → Or via any statistics analysis @ Yi-Shin Chen, NLP to Text Mining 16
  17. 17. Part-of-speech (POS) Tagging ▷Processing text and assigning parts of speech to each word ▷Tags may vary - by different tagging sets • Noun (N), Adjective (A), Verb (V), URL (U)… @ Yi-Shin Chen, NLP to Text Mining 17 Happy Easter! I went to work and came home to an empty house now im going for a quick run Happy_A Easter_N !_, I_O went_V to_P work_N and_& came_V home_N to_P an_D empty_A house_N now_R im_L going_V for_P a_D quick_A run_N (JJ)Happy (NNP) Easter! (PRP) I (VBD) went (TO) to (VB) work (CC) and (VBD) came (NN) home (TO) to (DT) an (JJ) empty (NN) house (RB) now (VBP) im (VBG) going (IN) for (DT) a (JJ) quick (NN) run
  18. 18. Stemmer ▷ Reduce inflected/derived words to their base/root form • E.g., Porter Stemmer http://textanalysisonline.com/nltk-porter-stemmer ▷ To get a more accurate statistics of words @ Yi-Shin Chen, NLP to Text Mining 18 Now, AI is poised to start an equally large transformation on many industries Now , AI is pois to start an equal larg transform on mani industry Porter Stemmer
  19. 19. NLP Techniques ▷Word segmentation* ▷Part of speech tagging (POS) ▷Stemming* ▷Syntactic Parsing ▷Named entity extraction ▷Co-reference resolution ▷Text categorization @ Yi-Shin Chen, NLP to Text Mining 19 Obtain representations
  20. 20. Vector Space Model (Bag of Words) ▷Represent the keywords of objects using a term vector • Term: basic concept, e.g., keywords to describe an object • Each term represents one dimension in a vector • Values of each term in a vector corresponds to the importance of that term • Words are used as features without the order → 我欠你一個人情 = 你欠我一個人情 • Usually working with N-gram features 20@ Yi-Shin Chen, NLP to Text Mining
  21. 21. Term Frequency and Inverse Document Frequency (TFIDF) ▷Since not all objects in the vector space are equally important, we can weight each term using its occurrence probability in the object description • Term frequency: TF(d,t) → number of times t occurs in the object description d • Inverse document frequency: IDF(t) → to scale down the terms that occur in many descriptions 21@ Yi-Shin Chen, NLP to Text Mining
  22. 22. Normalizing Term Frequency ▷nij represents the number of times a term ti occurs in a description dj . tfij can be normalized using the total number of terms in the document • 𝑡𝑓𝑖𝑗 = 𝑛 𝑖𝑗 𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑𝑉𝑎𝑙𝑢𝑒 ▷Normalized value could be: • Sum of all frequencies of terms • Max frequency value • Any other values can make tfij between 0 to 1 • BM25*: 𝑡𝑓𝑖𝑗 = 𝑛𝑖𝑗× 𝑘+1 𝑛𝑖𝑗+𝑘 22@ Yi-Shin Chen, NLP to Text Mining
  23. 23. Inverse Document Frequency ▷ IDF seeks to scale down the coordinates of terms that occur in many object descriptions • For example, some stop words(the, a, of, to, and…) may occur many times in a description. However, they should be considered as non-important in many cases • 𝑖𝑑𝑓𝑖 = 𝑙𝑜𝑔 𝑁 𝑑𝑓 𝑖 + 1 → where dfi (document frequency of term ti) is the number of descriptions in which ti occurs ▷ IDF can be replaced with ICF (inverse class frequency) and many other concepts based on applications 23@ Yi-Shin Chen, NLP to Text Mining
  24. 24. BOW Vectors ▷TFIDF+BOW still close to the state of the art ▷Good baseline to start with @ Yi-Shin Chen, NLP to Text Mining 24 Document 1 season timeout lost wi n game score ball pla y coach team Document 2 Document 3 3 0 5 0 2 6 0 2 0 2 0 0 7 0 2 1 0 0 3 0 0 1 0 0 1 2 2 0 3 0
  25. 25. Word2Vec With Machine Learning Flavor @ Yi-Shin Chen, NLP to Text Mining 25
  26. 26. Limitation of Bag of Words ▷No semantical relationship between words • Not designed to model linguistic knowledge ▷Sparsity • Due to high number of dimensions ▷Curse of dimensionality • When dimensionality increases, the distance between points becomes less meaningful @ Yi-Shin Chen, NLP to Text Mining 26
  27. 27. Word Context ▷Intuition: the context represents the semantics ▷Hypothesis: more simple models trained on more data will result in better word representations Work done by Mikolov et al. in 2013 @ Yi-Shin Chen, NLP to Text Mining 27 A medical doctor is a person who uses medicine to treat illness and injuries Some medical doctors only work on certain diseases or injuries Medical doctors examine, diagnose and treat patients These words represents doctors/doctor
  28. 28. Example ▷“king” – “man” + “woman” = • “queen” @ Yi-Shin Chen, NLP to Text Mining 28
  29. 29. Main Ideas of Word2Vec ▷Two models: • Continuous bag-of-words model • Continuous skip-gram model ▷Utilize a neural network to learn the weights of the word vectors @ Yi-Shin Chen, NLP to Text Mining 29
  30. 30. Continuous Bag-of-Words Model ▷Continuous Bag-of-Words Model • Predict target word by the context words • Eg: Given a sentence and window size 2 @ Yi-Shin Chen, NLP to Text Mining 30 Ex: ([features], label) ([I, am, good, pizza], eating), ([am, eating, pizza, now], good) and so on and so forth. I am eating good pizza now Target word Context word Context word
  31. 31. Continuous Bag-of-Words Model (Contd.) @ Yi-Shin Chen, NLP to Text Mining 31 ▷ We train one depth neural network, set hidden layer’s width 3. Input layer 1x6 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 Hidden layer 0.1 0.2 0.4 0.1 0.1 0.1 Softmax layer W W‘ 6x3 3x6 0 0 0 1 0 0 Actual label I am eating good pizza 000001 000010 000100 001000 010000 Backward propagation for update weight
  32. 32. @ Yi-Shin Chen, NLP to Text Mining 32 Input layer 1x6 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 Hidden layer 0.1 0.2 0.4 0.1 0.1 0.1 Softmax layer W W‘ 6x3 3x6 0 0 0 1 0 0 Actual label I am eating good pizza 000001 000010 000100 001000 010000 Backward propagation for update weight ▷ The goal is to extract word representation vector instead of prob of predict target word. Let’s look into here. Continuous Bag-of-Words Model (Contd.)
  33. 33. @ Yi-Shin Chen, NLP to Text Mining 33 Input layer 1x6 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 Hidden layer 0.1 0.2 0.4 0.1 0.1 0.1 Softmax layer W W‘ 6x3 3x6 0 0 0 1 0 0 Actual label I am eating good pizza 000001 000010 000100 001000 010000 Backward propagation for update weight ▷ The goal is to extract word representation vector instead of prob of predict target word. Let’s look into here. Continuous Bag-of-Words Model (Contd.)
  34. 34. ▷The output of the hidden layer is just the “word vector” for the input word @ Yi-Shin Chen, NLP to Text Mining 34 Continuous Bag-of-Words Model (Contd.) Input layer 1x6 W 6x3 .2 .7 .2 .2 .3 .8 .19 .28 .22 .22 .23 .21 .1 .5 .3 .2 .13 .23 0 0 0 0 0 1 X = .2 .13 .23 Hidden layer I am eating good pizza 000001 000010 000100 001000 010000
  35. 35. ▷Disadvantage: • It cannot capture rare word. @ Yi-Shin Chen, NLP to Text Mining 35 Continuous Bag-of-Words Model (Contd.)  So, The skip gram algorithm comes. This is a good movie theater Target word Marvelous Context word Context word
  36. 36. Continuous Skip-gram Model ▷ Reverse format of CBOW ▷ Predict representations of word that is put around the target words ▷ The context is specified by the window length @ Yi-Shin Chen, NLP to Text Mining 36
  37. 37. Continuous Skip-gram Model (Contd.) ▷Advantage: • It can capture rare word. • It captures the similarity of word semantic. → synonyms like “intelligent” and “smart” would have very similar contexts @ Yi-Shin Chen, NLP to Text Mining 37
  38. 38. NLP Techniques For relationships @ Yi-Shin Chen, NLP to Text Mining 38
  39. 39. NLP Techniques ▷Word segmentation* ▷Part of speech tagging (POS) ▷Stemming* ▷Syntactic Parsing ▷Named entity extraction ▷Co-reference resolution ▷Text categorization @ Yi-Shin Chen, NLP to Text Mining 39
  40. 40. Named Entity Recognition (NER) ▷Find and classify all the named entities in a text ▷What’s a named entity? • A mention of an entity using its name. → Kansas Jayhawks • This is a subset of the possible mentions... → Kansas, Jayhawks, the team, it, they ▷Find means identify the exact span of the mention ▷Classify means determine the category of the entity being referred to 40@ Yi-Shin Chen, NLP to Text Mining
  41. 41. Named Entity Type 41@ Yi-Shin Chen, NLP to Text Mining
  42. 42. Ambiguity 42@ Yi-Shin Chen, NLP to Text Mining
  43. 43. Named Entity Recognition Approaches ▷As with partial parsing and chunking there are two basic approaches (and hybrids) • Rule-based (regular expressions) → Lists of names → Patterns to match things that look like names → Patterns to match the environments that classes of names tend to occur in. • Machine Learning-based approaches → Get annotated training data → Extract features → Train systems to replicate the annotation 43@ Yi-Shin Chen, NLP to Text Mining
  44. 44. Rule-Based Approaches ▷Employ regular expressions to extract data ▷Examples: • Telephone number: (d{3}[-. ()]){1,2}[dA-Z]{4}. → 800-865-1125 → 800.865.1125 → (800)865-CARE • Software name extraction: ([A-Z][a-z]*s*)+ → Installation Designer v1.1 44@ Yi-Shin Chen, NLP to Text Mining
  45. 45. Machine Learning-Based Approaches 45@ Yi-Shin Chen, NLP to Text Mining
  46. 46. Co-reference Resolution ▷Find all expressions that refer to certain entity in a text ▷Common approaches for relation extractors • Hand-written patterns • Supervised machine learning • Semi-supervised and unsupervised → Bootstrapping (using seeds) → Distant supervision → Unsupervised learning from the web @ Yi-Shin Chen, NLP to Text Mining 46
  47. 47. Hearst's Patterns for IS-A Relations ▷"Y such as X ((, X)* (, and|or) X)" ▷"such Y as X" ▷"X or other Y" ▷"X and other Y" ▷"Y including X" ▷"Y, especially X" @ Yi-Shin Chen, NLP to Text Mining 47 (Hearst, 1992): Automatic Acquisition of Hyponyms
  48. 48. Extracting Richer Relations Using Rules ▷ Intuition: relations often hold between specific entities • located-in (ORGANIZATION, LOCATION) • founded (PERSON, ORGANIZATION) • cures (DRUG, DISEASE) ▷ Start with Named Entity tags to help extract relation 48 Content Slides by Prof. Dan Jurafsky
  49. 49. Relation Types ▷As with named entities, the list of relations is application specific. For generic news texts... 49@ Yi-Shin Chen, NLP to Text Mining
  50. 50. Relation Bootstrapping (Hearst 1992) ▷Gather a set of seed pairs that have relation R ▷Iterate: 1. Find sentences with these pairs 2. Look at the context between or around the pair and generalize the context to create patterns 3. Use the patterns for grep for more pairs 50Content Slides by Prof. Dan Jurafsky
  51. 51. Bootstrapping ▷<Mark Twain, Elmira> Seed tuple • Grep (google) for the environments of the seed tuple “Mark Twain is buried in Elmira, NY.” X is buried in Y “The grave of Mark Twain is in Elmira” The grave of X is in Y “Elmira is Mark Twain’s final resting place” Y is X’s final resting place ▷Use those patterns to grep for new tuples ▷Iterate 51Content Slides by Prof. Dan Jurafsky
  52. 52. Possible Seed: Wikipedia Infobox ▷ Infoboxes are kept in a namespace separate from articles • Namespce example: Special:SpecialPages; Wikipedia:List of infoboxes • Example: {{Infobox person |name = Casanova |image = Casanova_self_portrait.jpg |caption = A self portrait of Casanova ... |website = }} @ Yi-Shin Chen, NLP to Text Mining 52
  53. 53. Concept-based Model ▷ESA (Egozi, Markovitch, 2011) Every Wikipedia article represents a concept TF-IDF concept to inferring concepts from document Manually-maintained knowledge base 53@ Yi-Shin Chen, NLP to Text Mining
  54. 54. Yago ▷ YAGO: A Core of Semantic Knowledge Unifying WordNet and Wikipedia, WWW 2007 ▷ Unification of Wikipedia & WordNet ▷ Make use of rich structures and information • Infoboxes, Category Pages, etc. 54@ Yi-Shin Chen, NLP to Text Mining
  55. 55. NLP Tools @ Yi-Shin Chen, NLP to Text Mining 55
  56. 56. Parsing Tools - English ▷Berkeley Parser: • http://tomato.banatao.berkeley.edu:8080/parser/parser.html ▷Stanford CoreNLP • http://nlp.stanford.edu:8080/corenlp/ ▷Stanford Parser • Support: English, Simplified Chinese, Arbic, French, Spanish • http://nlp.stanford.edu:8080/parser/index.jsp @ Yi-Shin Chen, NLP to Text Mining 56
  57. 57. Parsing Tools - Chinese ▷Stanford Parser • Support: English, Simplified Chinese, Arbic, French, Spanish • http://nlp.stanford.edu:8080/parser/index.jsp ▷語言雲 • https://www.ltp-cloud.com/intro/ ▷CKIP (Traditional Chinese) • http://ckipsvr.iis.sinica.edu.tw/ @ Yi-Shin Chen, NLP to Text Mining 57
  58. 58. More Tools ▷NLTK (python): tokenize, tag, NE extraction, show parsing tree • Porter stemmer • n-grams ▷spyCy: undustrial-strength NLP in python @ Yi-Shin Chen, NLP to Text Mining 58
  59. 59. Semantic Resources ▷Wordnet ▷Google Knowledge Graph API ▷Hownet (Simplified Chinese) ▷E-hownet (Traditional Chinese) ▷Yago ▷DBPedia @ Yi-Shin Chen, NLP to Text Mining 59
  60. 60. Text Mining @ Yi-Shin Chen, NLP to Text Mining 60
  61. 61. Data (Text vs. Non-Text) 61 World Sensor Data Interpret by Report Weather Thermometer, Hygrometer 24。C, 55% Location GPS 37。N, 123 。E Body Sphygmometer, MRI, etc. 126/70 mmHg World To be or not to be.. human Subjective Objective @ Yi-Shin Chen, NLP to Text Mining
  62. 62. Data Mining vs. Text Mining 62 Non-text data • Numerical Categorical Relational Text data • Text Data Mining • Clustering • Classification • Association Rules • … Text Processing (including NLP) @ Yi-Shin Chen, NLP to Text Mining Preprocessing
  63. 63. Preprocessing in Reality 63@ Yi-Shin Chen, NLP to Text Mining
  64. 64. @ Yi-Shin Chen, NLP to Text Mining 64 一般資料:(General Data) ──── 職業: 無 種族: 客家 婚姻: married 旅遊史:No recent travel history in three months 接觸史:無 群聚:無 職業病史:Nil 資料來源:Patient herself and her daughter 主訴:(Chief Complaint) ── Sudden onest short of breath with cold sweating noted five days ago. ( since 06/09) 現在病症:(Present Illness) ──── This 60 year-old female had hypertension for 10 years and diabetes mellitus for 5 years that had regular medical control. As her mentioned, she got similar episode attacked three months ago with initail presentations of short of breat, DOE, orthopnea. She went to LMD for help with CAD, 3-V-D, s/p PTCA stenting x 3 for one vessels in 2012/03. She got regular CV OPD f/u at there and the medications was tooked. Since after, she had increatment of the oral water intake amounts. The urine output seems to be adequate and no body weight change or legs edema noted in recent three months. This time, acute onset severe dyspnea with orthopnea, DOE, heavy sweating and oral thirsty noted on 06/09. He had no fever, chills, nausea, vomiting, palpitation, cough, chest tightness, chest pain, palpitation, abdominal discomfort noticed. For the symptoms intolerable, he came to our ED for help with chest x film revealed cardiomegaly and elevations of cardiac markers noted. The cardiologist was consulted and the heparinization was applied. The CPK level had no elevation at regular f/u examinations, and her symptoms got much improved after. The cardiosonogram reported impaired LV systolic function and moderate MR. She was admitted for further suvery and managements to the acute ischemic heart disease. Align /Classify the attributes correctly
  65. 65. Language Detection ▷To detect an language (possible languages) in which the specified text is written ▷Difficulties • Short message • Different languages in one statement • Noisy 65 職業: 無 種族: 客家 婚姻: married 旅遊史:No recent travel history in three months 接觸史:無 群聚:無 職業病史:Nil 資料來源:Patient herself and her daughter @ Yi-Shin Chen, NLP to Text Mining
  66. 66. Data Cleaning ▷Special character ▷Utilize regular expressions to clean data 66 Unicode emotions ☺, ♥… Symbol icon ☏, ✉… Currency symbol €, £, $... Tweet URL Filter out non-(letters, space, punctuation, digit) ◕‿◕ Friendship is everything ♥ ✉ xxxx@gmail.com I added a video to a @YouTube playlist http://t.co/ceYX62StGO Jamie Riepe (^|s*)http(S+)?(s*|$) (p{L}+)|(p{Z}+)| (p{Punct}+)|(p{Digit}+) @ Yi-Shin Chen, NLP to Text Mining
  67. 67. Part-of-speech (POS) Tagging ▷Processing text and assigning parts of speech to each word ▷Twitter POS tagging • Noun (N), Adjective (A), Verb (V), URL (U)… 67 This 60 year-old female had hypertension for 10 years and diabetes mellitus for 5 years that had regular medical control. This(D) 60(Num) year-old(Adj) female(N) had(V) hypertension (N) for(Pre) 10(Num) years(N) and(Con) diabetes(N) mellitus(N) for(pre) 5(Num) years(N) that(Det) had(V) regular(Adj) medical(Adj) control(N). @ Yi-Shin Chen, NLP to Text Mining
  68. 68. Stemming 68 ▷Problem: • Diabetes -> diabete This 60 year-old female had hypertension for 10 years and diabetes mellitus for 5 years that had regular medical control. have have year year @ Yi-Shin Chen, NLP to Text Mining
  69. 69. More Preprocesses for Different Languages ▷Chinese Simplified/Traditional Conversion ▷Word segmentation 69@ Yi-Shin Chen, NLP to Text Mining
  70. 70. Wrong Chinese Word Segmentation ▷Wrong segmentation • 這(Nep) 地面(Nc) 積(VJ) 還(D) 真(D) 不(D) 小(VH) http://t.co/QlUbiaz2Iz ▷Wrong word • @iamzeke 實驗(Na) 室友(Na) 多(Dfa) 危險(VH) 你(Nh) 不(D) 知道(VK) 嗎 (T) ? ▷Wrong order • 人體(Na) 存(VC) 內在(Na) 很多(Neqa) 微生物(Na) ▷Unknown word • 半夜(Nd) 逛團(Na) 購(VC) 看到(VE) 太(Dfa) 吸引人(VH) !! 70 地面|面積 實驗室|室友 存在|內在 未知詞: 團購 @ Yi-Shin Chen, NLP to Text Mining
  71. 71. Back to Mining Let’s come back to Mining 71@ Yi-Shin Chen, NLP to Text Mining
  72. 72. Data Mining vs. Text Mining 72 Non-text data • Numerical Categorical Relational • Precise • Objective Text data • Text • Ambiguous • Subjective Data Mining • Clustering • Classification • Association Rules • … Text Processing (including NLP) Text Mining @ Yi-Shin Chen, NLP to Text Mining Preprocessing
  73. 73. Overview of Data Mining Understand the objectivities 73@ Yi-Shin Chen, NLP to Text Mining
  74. 74. Tasks in Data Mining ▷Problems should be well defined at the beginning ▷Two categories of tasks [Fayyad et al., 1996] 74 Predictive Tasks • Predict unknown values • e.g., potential customers Descriptive Tasks • Find patterns to describe data • e.g., Friendship finding VIP Cheap Potential @ Yi-Shin Chen, NLP to Text Mining
  75. 75. Select Techniques ▷Problems could be further decomposed 75 Predictive Tasks • Classification • Ranking • Regression • … Descriptive Tasks • Clustering • Association rules • Summarization • … Supervised Learning Unsupervised Learning @ Yi-Shin Chen, NLP to Text Mining
  76. 76. Supervised vs. Unsupervised Learning ▷ Supervised learning • Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations • New data is classified based on the training set ▷ Unsupervised learning • The class labels of training data is unknown • Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data 76@ Yi-Shin Chen, NLP to Text Mining
  77. 77. Classification ▷ Given a collection of records (training set ) • Each record contains a set of attributes • One of the attributes is the class ▷ Find a model for class attribute: • The model forms a function of the values of other attributes ▷ Goal: previously unseen records should be assigned a class as accurately as possible. • A test set is needed → To determine the accuracy of the model ▷Usually, the given data set is divided into training & test • With training set used to build the model • With test set used to validate it 77@ Yi-Shin Chen, NLP to Text Mining
  78. 78. Ranking ▷Produce a permutation to items in a new list • Items ranked in higher positions should be more important • E.g., Rank webpages in a search engine Webpages in higher positions are more relevant. 78@ Yi-Shin Chen, NLP to Text Mining
  79. 79. Regression ▷Find a function which model the data with least error • The output might be a numerical value • E.g.: Predict the stock value 79@ Yi-Shin Chen, NLP to Text Mining
  80. 80. Clustering ▷Group data into clusters • Similar to the objects within the same cluster • Dissimilar to the objects in other clusters • No predefined classes (unsupervised classification) 80@ Yi-Shin Chen, NLP to Text Mining
  81. 81. Association Rule Mining ▷Basic concept • Given a set of transactions • Find rules that will predict the occurrence of an item • Based on the occurrences of other items in the transaction 81@ Yi-Shin Chen, NLP to Text Mining
  82. 82. Summarization ▷Provide a more compact representation of the data • Data: Visualization • Text – Document Summarization → E.g.: Snippet 82@ Yi-Shin Chen, NLP to Text Mining
  83. 83. Back to Text Mining Let’s come back to Text Mining 83@ Yi-Shin Chen, NLP to Text Mining
  84. 84. Landscape of Text Mining 84 World Sensor Data Interpret by Report World devices 24。C, 55% World To be or not to be.. human Non-text (Context) Text Subjective Objective Perceived by Express Mining knowledge about Languages Nature Language Processing & Text Representation; Word Association and Mining @ Yi-Shin Chen, NLP to Text Mining
  85. 85. Landscape of Text Mining 85 World Sensor Data Interpret by Report World devices 24。C, 55% World To be or not to be.. human Non-text (Context) Text Subjective Objective Perceived by Express Mining content about the observers Opinion Mining and Sentiment Analysis @ Yi-Shin Chen, NLP to Text Mining
  86. 86. Landscape of Text Mining 86 World Sensor Data Interpret by Report World devices 24。C, 55% World To be or not to be.. human Non-text (Context) Text Subjective Objective Perceived by Express Mining content about the World Topic Mining , Contextual Text Mining @ Yi-Shin Chen, NLP to Text Mining
  87. 87. From NLP to Text Mining 87 This is the best thing happened in my life. Det. Det. NN PNPre.Verb VerbAdj String of Characters This? Happy This is the best thing happened in my life. String of Words POS Tags Best thing Happened My life Entity Period Entities Relationships Emotion The writer loves his new born baby Understanding (Logic predicates) Entity Deeper NLP Less accurate Closer to knowledge @ Yi-Shin Chen, NLP to Text Mining
  88. 88. NLP vs. Text Mining ▷Text Mining objectives • Overview • Know the trends • Accept noise @ Yi-Shin Chen, NLP to Text Mining 88 ▷NLP objectives • Understanding • Ability to answer • Immaculate
  89. 89. Word Relations Back to text 89@ Yi-Shin Chen, NLP to Text Mining
  90. 90. Word Relations ▷Paradigmatic: can be substituted for each other (similar) • E.g., Cat & dog, run and walk ▷Syntagmatic: can be combined with each other (correlated) • E.g., Cat and fights, dog and barks →These two basic and complementary relations can be generalized to describe relations of any times in a language 90 Animals Act Animals Act @ Yi-Shin Chen, NLP to Text Mining
  91. 91. Mining Word Associations ▷Paradigmatic • Represent each word by its context • Compute context similarity • Words with high context similarity ▷Syntagmatic • Count the number of times two words occur together in a context • Compare the co-occurrences with the corresponding individual occurrences • Words with high co-occurrences but relatively low individual occurrence 91@ Yi-Shin Chen, NLP to Text Mining
  92. 92. Paradigmatic Word Associations John’s cat eats fish in Saturday Mary’s dog eats meat in Sunday John’s cat drinks milk in Sunday Mary’s dog drinks beer in Tuesday 92 Act FoodTime Human Animals Own John Cat John’s cat Eat FishIn Saturday John’s --- eats fish in Saturday Mary’s --- eats meat in Sunday John’s --- drinks milk in Sunday Mary’s --- drinks beer in Tuesday Similar left content Similar right content Similar general content How similar are context (“cat”) and context (“dog”)? How similar are context (“cat”) and context (“John”)?  Expected Overlap of Words in Context (EOWC) Overlap (“cat”, “dog”) Overlap (“cat”, “John”) @ Yi-Shin Chen, NLP to Text Mining
  93. 93. Common Approach for EOWC: Cosine Similarity ▷ If d1 and d2 are two document vectors, then cos( d1, d2 ) = (d1  d2) / ||d1|| ||d2|| , where  indicates vector dot product and || d || is the length of vector d. ▷ Example: d1 = 3 2 0 5 0 0 0 2 0 0 d2 = 1 0 0 0 0 0 0 1 0 2 d1  d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5 ||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481 ||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245 cos( d1, d2 ) = .3150  Overlap (“John”, “Cat”) =.3150 93@ Yi-Shin Chen, NLP to Text Mining
  94. 94. Quality of EOWC? ▷ The more overlap the two context documents have, the higher the similarity would be ▷However: • It favor matching one frequent term very well over matching more distinct terms • It treats every word equally (overlap on “the” should not be as meaningful as overlap on “eats”) ▷Solution? • TFIDF 94@ Yi-Shin Chen, NLP to Text Mining
  95. 95. Mining Word Associations ▷Paradigmatic • Represent each word by its context • Compute context similarity • Words with high context similarity ▷Syntagmatic • Count the number of times two words occur together in a context • Compare the co-occurrences with the corresponding individual occurrences • Words with high co-occurrences but relatively low individual occurrence 95@ Yi-Shin Chen, NLP to Text Mining
  96. 96. Syntagmatic Word Associations John’s cat eats fish in Saturday Mary’s dog eats meat in Sunday John’s cat drinks milk in Sunday Mary’s dog drinks beer in Tuesday 96 Act FoodTime Human Animals Own John Cat John’s cat Eat FishIn Saturday John’s *** eats *** in Saturday Mary’s *** eats *** in Sunday John’s --- drinks --- in Sunday Mary’s --- drinks --- in Tuesday What words tend to occur to the left of “eats” What words to the right? Whenever “eats” occurs, what other words also tend to occur? Correlated occurrences P(dog | eats) = ? ; P(cats | eats) = ? @ Yi-Shin Chen, NLP to Text Mining
  97. 97. Word Prediction Prediction Question: Is word W present (or absent) in this segment? 97 Text Segment (any unit, e.g., sentence, paragraph, document) Predict the occurrence of word W1 = ‘meat’ W2 = ‘a’ W3 = ‘unicorn’ @ Yi-Shin Chen, NLP to Text Mining
  98. 98. Word Prediction: Formal Definition ▷Binary random variable {0,1} • 𝑥 𝑤 = 1 𝑤 𝑖𝑠 𝑝𝑟𝑒𝑠𝑒𝑛𝑡 0 𝑤 𝑖𝑠 𝑎𝑏𝑠𝑒𝑡 • 𝑃 𝑥 𝑤 = 1 + 𝑃 𝑥 𝑤 = 0 = 1 ▷The more random 𝑥 𝑤 is, the more difficult the prediction is ▷How do we quantitatively measure the randomness? 98@ Yi-Shin Chen, NLP to Text Mining
  99. 99. Entropy ▷Entropy measures the amount of randomness or surprise or uncertainty ▷Entropy is defined as: 99       1 log 1 log, 1 11 1             n i i n i ii n i i in pwhere pp p pppH  •entropy = 0 easy •entropy=1 •difficult0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.5 1 Entropy(p,1-p) @ Yi-Shin Chen, NLP to Text Mining Number of choices
  100. 100. Conditional Entropy Know nothing about the segment 100 Know “eats” is present (Xeat=1) 𝑝 𝑥 𝑚𝑒𝑎𝑡 = 1 𝑝(𝑥 𝑚𝑒𝑎𝑡 = 0) 𝑝 𝑥 𝑚𝑒𝑎𝑡 = 1 𝑥 𝑒𝑎𝑡𝑠 = 1 𝑝(𝑥 𝑚𝑒𝑎𝑡 = 0 𝑥 𝑒𝑎𝑡𝑠 = 1) 𝐻 𝑥 𝑚𝑒𝑎𝑡 = −𝑝 𝑥 𝑚𝑒𝑎𝑡 = 0 × 𝑙𝑜𝑔2 𝑝 𝑥 𝑚𝑒𝑎𝑡 = 0 − 𝑝 𝑥 𝑚𝑒𝑎𝑡 = 1 × 𝑙𝑜𝑔2 𝑝 𝑥 𝑚𝑒𝑎𝑡 = 1 𝐻 𝑥 𝑚𝑒𝑎𝑡 𝑥 𝑒𝑎𝑡𝑠 = 1 = −𝑝 𝑥 𝑚𝑒𝑎𝑡 = 0 𝑥 𝑒𝑎𝑡𝑠 = 1 × 𝑙𝑜𝑔2 𝑝 𝑥 𝑚𝑒𝑎𝑡 = 0 𝑥 𝑒𝑎𝑡𝑠 = 1 − 𝑝 𝑥 𝑚𝑒𝑎𝑡 = 1 𝑥 𝑒𝑎𝑡𝑠 = 1 × 𝑙𝑜𝑔2 𝑝 𝑥 𝑚𝑒𝑎𝑡 = 1 𝑥 𝑒𝑎𝑡𝑠 = 1 @ Yi-Shin Chen, NLP to Text Mining         n Xx xXYHxpXYH
  101. 101. Mining Syntagmatic Relations ▷For each word W1 • For every word W2, compute conditional entropy 𝐻 𝑥 𝑤1 𝑥 𝑤2 • Sort all the candidate words in ascending order of 𝐻 𝑥 𝑤1 𝑥 𝑤2 • Take the top-ranked candidate words with some given threshold ▷However • 𝐻 𝑥 𝑤1 𝑥 𝑤2 and 𝐻 𝑥 𝑤1 𝑥 𝑤3 are comparable • 𝐻 𝑥 𝑤1 𝑥 𝑤2 and 𝐻 𝑥 𝑤3 𝑥 𝑤2 are not → Because the upper bounds are different ▷Conditional entropy is not symmetric • 𝐻 𝑥 𝑤1 𝑥 𝑤2 ≠ 𝐻 𝑥 𝑤2 𝑥 𝑤1 101@ Yi-Shin Chen, NLP to Text Mining
  102. 102. Mutual Information ▷𝐼 𝑥; 𝑦 = 𝐻 𝑥 − 𝐻 𝑥 𝑦 = 𝐻 𝑦 − 𝐻 𝑦 𝑥 ▷Properties: • Symmetric • Non-negative • I(x;y)=0 iff x and y are independent ▷Allow us to compare different (x,y) pairs 102 H(x) H(y) H(x|y) H(y|x)I(x;y) @ Yi-Shin Chen, NLP to Text Mining H(x,y)
  103. 103. Topic Mining Assume we already know the word relationships 103@ Yi-Shin Chen, NLP to Text Mining
  104. 104. Landscape of Text Mining 104 World Sensor Data Interpret by Report World devices 24。C, 55% World To be or not to be.. human Non-text (Context) Text Subjective Objective Perceived by Express Mining knowledge about Languages Nature Language Processing & Text Representation; Word Association and Mining @ Yi-Shin Chen, NLP to Text Mining
  105. 105. Topic Mining: Motivation ▷Topic: key idea in text data • Theme/subject • Different granularities (e.g., sentence, article) ▷Motivated applications, e.g.: • Hot topics during the debates in 2016 presidential election • What do people like about Windows 10 • What are Facebook users talking about today? • What are the most watched news? 105 @ Yi-Shin Chen, NLP to Text Mining
  106. 106. Tasks of Topic Mining 106 Text Data Topic 1 Topic 2 Topic 3 Topic 4 Topic n Doc1 Doc2 @ Yi-Shin Chen, NLP to Text Mining
  107. 107. Formal Definition of Topic Mining ▷Input • A collection of N text documents 𝑆 = 𝑑1, 𝑑2, 𝑑3, … 𝑑 𝑛 • Number of topics: k ▷Output • k topics: 𝜃1, 𝜃2, 𝜃3, … 𝜃 𝑛 • Coverage of topics in each 𝑑𝑖: 𝜇𝑖1, 𝜇𝑖2, 𝜇𝑖3, … 𝜇𝑖𝑛 ▷How to define topic 𝜃𝑖? • Topic=term (word)? • Topic= classes? 107@ Yi-Shin Chen, NLP to Text Mining
  108. 108. Tasks of Topic Mining (Terms as Topics) 108 Text Data Politics Weather Sports Travel Technology Doc1 Doc2 @ Yi-Shin Chen, NLP to Text Mining
  109. 109. Problems with “Terms as Topics” ▷Not generic • Can only represent simple/general topic • Cannot represent complicated topics → E.g., “uber issue”: political or transportation related? ▷Incompleteness in coverage • Cannot capture variation of vocabulary ▷Word sense ambiguity • E.g., Hollywood star vs. stars in the sky; apple watch vs. apple recipes 109@ Yi-Shin Chen, NLP to Text Mining
  110. 110. Improved Ideas ▷Idea1 (Probabilistic topic models): topic = word distribution • E.g.: Sports = {(Sports, 0.2), (Game 0.01), (basketball 0.005), (play, 0.003), (NBA,0.01)…} • : generic, easy to implement ▷Idea 2 (Concept topic models): topic = concept • Maintain concepts (manually or automatically) → E.g., ConceptNet 110@ Yi-Shin Chen, NLP to Text Mining
  111. 111. Possible Approaches for Probabilistic Topic Models ▷Bag-of-words approach: • Mixture of unigram language model • Expectation-maximization algorithm • Probabilistic latent semantic analysis • Latent Dirichlet allocation (LDA) model ▷Graph-based approach : • TextRank (Mihalcea and Tarau, 2004) • Reinforcement Approach (Xiaojun et al., 2007) • CollabRank (Xiaojun er al., 2008) 111@ Yi-Shin Chen, NLP to Text Mining Generative Model
  112. 112. Bag-of-words Assumption ▷Word order is ignored ▷“bag-of-words” – exchangeability ▷Theorem (De Finetti, 1935) – if 𝑥1, 𝑥2, … , 𝑥 𝑛 are infinitely exchangeable, then the joint probability p 𝑥1, 𝑥2, … , 𝑥 𝑛 has a representation as a mixture: ▷p 𝑥1, 𝑥2, … , 𝑥 𝑛 = 𝑑𝜃𝑝 𝜃 𝑖=1 𝑁 𝑝 𝑥𝑖 𝜃 for some random variable θ @ Yi-Shin Chen, NLP to Text Mining 112
  113. 113. Latent Dirichlet Allocation ▷ Latent Dirichlet Allocation (D. M. Blei, A. Y. Ng, 2003) Linear Discriminant Analysis                             M d d k N n z dndnddnd k N n z nnn nn N n n dzwpzppDp dzwpzppp zwpzppp d dn n 1 1 1 1 ),()()(),( ),()()(),()3( ),()()(),,,()2(    w wz @ Yi-Shin Chen, NLP to Text Mining 113
  114. 114. LDA Assumption ▷Assume: • When writing a document, you 1. Decide how many words 2. Decide distribution(P = Dir(𝛼) P = Dir(𝛽)) 3. Choose topics (Dirichlet) 4. Choose words for topics (Dirichlet) 5. Repeat 3 • Example 1. 5 words in document 2. 50% food & 50% cute animals 3. 1st word - food topic, gives you the word “bread”. 4. 2nd word - cute animals topic, “adorable”. 5. 3rd word - cute animals topic, “dog”. 6. 4th word - food topic, “eating”. 7. 5th word - food topic, “banana”. 114 “bread adorable dog eating banana” Document Choice of topics and words @ Yi-Shin Chen, NLP to Text Mining
  115. 115. LDA Learning (Gibbs) ▷ How many topics you think there are ? ▷ Randomly assign words to topics ▷ Check and update topic assignments (Iterative) • p(topic t | document d) • p(word w | topic t) • Reassign w a new topic, p(topic t | document d) * p(word w | topic t) 115 I eat fish and vegetables. Dog and fish are pets. My kitten eats fish. #Topic: 2I eat fish and vegetables. Dog and fish are pets. My kitten eats fish. I eat fish and vegetables. Dog and fish are pets. My kitten eats fish. p(red|1)=0.67; p(purple|1)=0.33; p(red|2)=0.67; p(purple|2)=0.33; p(red|3)=0.67; p(purple|3)=0.33; p(eat|red)=0.17; p(eat|purple)=0.33; p(fish|red)=0.33; p(fish|purple)=0.33; p(vegetable|red)=0.17; p(dog|purple)=0.33; p(pet|red)=0.17; p(kitten|red)=0.17; p(purple|2)*p(fish|purple)=0.5*0.33=0.165; p(red|2)*p(fish|red)=0.5*0.2=0.1;  fish p(red|1)=0.67; p(purple|1)=0.33; p(red|2)=0.50; p(purple|2)=0.50; p(red|3)=0.67; p(purple|3)=0.33; p(eat|red)=0.20; p(eat|purple)=0.33; p(fish|red)=0.20; p(fish|purple)=0.33; p(vegetable|red)=0.20; p(dog|purple)=0.33; p(pet|red)=0.20; p(kitten|red)=0.20; I eat fish and vegetables. Dog and fish are pets. My kitten eats fish. @ Yi-Shin Chen, NLP to Text Mining
  116. 116. Related Work – Topic Model (LDA) 116 ▷I eat fish and vegetables. ▷Dog and fish are pets. ▷My kitten eats fish. Sentence 1: 14.67% Topic 1, 85.33% Topic 2 Sentence 2: 85.44% Topic 1, 14.56% Topic 2 Sentence 3: 19.95% Topic 1, 80.05% Topic 2 LDA Topic 1 0.268 fish 0.210 pet 0.210 dog 0.147 kitten Topic 2 0.296 eat 0.265 fish 0.189 vegetable 0.121 kitten @ Yi-Shin Chen, NLP to Text Mining
  117. 117. Possible Approaches for Probabilistic Topic Models ▷Bag-of-words approach: • Mixture of unigram language model • Expectation-maximization algorithm • Probabilistic latent semantic analysis • Latent Dirichlet allocation (LDA) model ▷Graph-based approach : • TextRank (Mihalcea and Tarau, 2004) • Reinforcement Approach (Xiaojun et al., 2007) • CollabRank (Xiaojun er al., 2008) 117@ Yi-Shin Chen, NLP to Text Mining
  118. 118. Construct Graph ▷ Directed graph ▷ Elements in the graph → Terms → Phrases → Sentences 118@ Yi-Shin Chen, NLP to Text Mining
  119. 119. Connect Term Nodes ▷Connect terms based on its slop. 119 I love the new ipod shuffle. It is the smallest ipod. Slop=1 I love the new ipod shuffle itis smallest Slop=0 @ Yi-Shin Chen, NLP to Text Mining
  120. 120. Connect Phrase Nodes ▷Connect phrases to • Compound words • Neighbor words 120 I love the new ipod shuffle. It is the smallest ipod. I love the new ipod shuffle itis smallest ipod shuffle @ Yi-Shin Chen, NLP to Text Mining
  121. 121. Connect Sentence Nodes ▷Connect to • Neighbor sentences • Compound terms • Compound phrase @ Yi-Shin Chen, NLP to Text Mining 121 I love the new ipod shuffle. It is the smallest ipod. ipod new shuffle it smallest ipod shuffle I love the new ipod shuffle. It is the smallest ipod.
  122. 122. Edge Weight Types 122 I love the new ipod shuffle. It is the smallest ipod. I love the new ipod shuffleitis smallest ipod shuffle @ Yi-Shin Chen, NLP to Text Mining
  123. 123. Graph-base Ranking ▷Scores for each node (TextRank 2004) 123 ythe  (1 0.85)  0.85*(0.1 0.5  0.2  0.6  0.3 0.7) the new love is 0.1 0.2 0.3 Score from parent nodes 0.5 0.7 0.6 ynew ylove yis damping factor @ Yi-Shin Chen, NLP to Text Mining     )( )( )()1()( i jk VInj j VOutV jk ji i VWS w w ddVWS
  124. 124. Result 124 the 1.45 ipod 1.21 new 1.02 is 1.00 shuffle 0.99 it 0.98 smallest 0.77 love 0.57 @ Yi-Shin Chen, NLP to Text Mining
  125. 125. Graph-based Extraction • Pros → Structure and syntax information → Mutual influence • Cons → Common words get higher scores 125@ Yi-Shin Chen, NLP to Text Mining
  126. 126. Summary: Probabilistic Topic Models ▷Probabilistic topic models): topic = word distribution • E.g.: Sports = {(Sports, 0.2), (Game 0.01), (basketball 0.005), (play, 0.003), (NBA,0.01)…} • : generic, easy to implement • ?: Not easy to understand/communicate • ?: Not easy to construct semantic relationship between topics 126 Topic? Topic= {(Crooked, 0.02), (dishonest, 0.001), (News, 0.0008), (totally, 0.0009), (total, 0.000009), (failed, 0.0006), (bad, 0.0015), (failing, 0.00001), (presidential, 0.0000008), (States, 0.0000004), (terrible, 0.0000085),(failed, 0.000021), (lightweight,0.00001),(weak, 0.0000075), ……} Most commonly used words in Trump insults @ Yi-Shin Chen, NLP to Text Mining
  127. 127. Improved Ideas ▷Idea1 (Probabilistic topic models): topic = word distribution • E.g.: Sports = {(Sports, 0.2), (Game 0.01), (basketball 0.005), (play, 0.003), (NBA,0.01)…} • : generic, easy to implement ▷Idea 2 (Concept topic models): topic = concept • Maintain concepts (manually or automatically) → E.g., ConceptNet 127@ Yi-Shin Chen, NLP to Text Mining
  128. 128. NLP Related Approach: Named Entity Recognition ▷Find and classify all the named entities in a text. ▷What’s a named entity? • A mention of an entity using its name. → Kansas Jayhawks • This is a subset of the possible mentions... → Kansas, Jayhawks, the team, it, they ▷Find means identify the exact span of the mention ▷Classify means determine the category of the entity being referred to 128@ Yi-Shin Chen, NLP to Text Mining
  129. 129. Named Entity Recognition Approaches ▷As with partial parsing and chunking there are two basic approaches (and hybrids) • Rule-based (regular expressions) → Lists of names → Patterns to match things that look like names → Patterns to match the environments that classes of names tend to occur in. • Machine Learning-based approaches → Get annotated training data → Extract features → Train systems to replicate the annotation 129@ Yi-Shin Chen, NLP to Text Mining
  130. 130. Opinion Mining How people feel? 130@ Yi-Shin Chen, NLP to Text Mining
  131. 131. Landscape of Text Mining 131 World Sensor Data Interpret by Report World devices 24。C, 55% World To be or not to be.. human Non-text (Context) Text Subjective Objective Perceived by Express Mining content about the observers Opinion Mining and Sentiment Analysis @ Yi-Shin Chen, NLP to Text Mining
  132. 132. Opinion ▷a subjective statement describing a person's perspective about something 132 Objective statement or Factual statement: can be proved to be right or wrong Opinion holder: Personalized / customized Depends on background, culture, context Target @ Yi-Shin Chen, NLP to Text Mining
  133. 133. Opinion Representation ▷Opinion holder: user ▷Opinion target: object ▷Opinion content: keywords? ▷Opinion context: time, location, others? ▷Opinion sentiment (emotion): positive/negative, happy or sad 133@ Yi-Shin Chen, NLP to Text Mining
  134. 134. Sentiment Analysis ▷Input: An opinionated text object ▷Output: Sentiment tag/Emotion label • Polarity analysis: {positive, negative, neutral} • Emotion analysis: happy, sad, anger ▷Naive approach: • Apply classification, clustering for extracted text features 134@ Yi-Shin Chen, NLP to Text Mining
  135. 135. Sentiment Representation ▷Categorical • Sentiment → Positive, neutral, negative • Stars • Emotions → Joy, anger, fear, sadness ▷Dimensional • Valence and arousal @ Yi-Shin Chen, NLP to Text Mining 135
  136. 136. Text Features ▷Character n-grams • Usually for spelling/recognition proof • Less meaningful ▷Word n-grams • n should be bigger than 1 for sentiment analysis ▷POS tag n-grams • Can mixed with words and POS tags → E.g., “adj noun”, “sad noun” 136@ Yi-Shin Chen, NLP to Text Mining
  137. 137. More Text Features ▷Word classes • Thesaurus: LIWC • Ontology: WordNet, Yago, DBPedia, SentiWordNet • Recognized entities: DBPedia, Yago ▷Frequent patterns in text • Could utilize pattern discovery algorithms • Optimizing the tradeoff between coverage and specificity is essential 137@ Yi-Shin Chen, NLP to Text Mining
  138. 138. LIWC ▷Linguistic Inquiry and word count • LIWC2015 ▷Home page: http://liwc.wpengine.com/ ▷>70 classes ▷ Developed by researchers with interests in social, clinical, health, and cognitive psychology ▷Cost: US$89.95 138@ Yi-Shin Chen, NLP to Text Mining
  139. 139. Chinese Corpus and Resource ▷Provided by NLP Lab of Academia Sinica • http://academiasinicanlplab.github.io/ • Corpus → NTCIR MOAT (Multilingual opinion analysis task) Corpus → EmotionLines: An Emotion Corpus of Multi-Party Conversations • Resource → 中文意見詞典NTUSD & ANTUSD說明 → NTUSD - NTU Sentiment Dictionary → ANTUSD - Augmented NTU Sentiment Dictionary → ACBiMA - Advanced Chinese Bi-Character Word Morphological Analyzer → CSentiPackage 1.0 (click here to request the password) → ACBiMA - Advanced Chinese Bi-Character Word Morphological Analyzer @ Yi-Shin Chen, NLP to Text Mining 139
  140. 140. Chinese Treebank ▷https://catalog.ldc.upenn.edu/ldc2013t21 @ Yi-Shin Chen, NLP to Text Mining 140 POS Tagged Syntactically Bracketed
  141. 141. Emotion Analysis: NLP Approach ▷Aggregation • Summing up opinion scores of characters/words ▷Weighted by structures • Morphological structures → Intra-word structures • Sentence syntactic structures → Inter-word structures ▷Composition • With deep neural networks @ Yi-Shin Chen, NLP to Text Mining 141
  142. 142. Emotion Analysis: Text Mining Approach ▷ Graph and Pattern Apporach • Carlos Argueta, Fernando Calderon, and Yi-Shin Chen, Multilingual Emotion Classifier using Unsupervised Pattern Extraction from Microblog Data, Intelligent Data Analysis - An International Journal, 2016 142@ Yi-Shin Chen, NLP to Text Mining
  143. 143. Collect Emotion Data 143@ Yi-Shin Chen, NLP to Text Mining
  144. 144. Collect Emotion Data 144@ Yi-Shin Chen, NLP to Text Mining
  145. 145. Collect Emotion Data Wait! Need Control Group 145@ Yi-Shin Chen, NLP to Text Mining
  146. 146. Not-Emotion Data 146@ Yi-Shin Chen, NLP to Text Mining
  147. 147. Not-Emotion Data 147@ Yi-Shin Chen, NLP to Text Mining
  148. 148. Not-Emotion Data 148@ Yi-Shin Chen, NLP to Text Mining
  149. 149. Preprocessing Steps ▷Hints: Remove troublesome ones o Too short → Too short to get important features o Contain too many hashtags → Too much information to process o Are retweets → Increase the complexity o Have URLs → Too trouble to collect the page data o Convert user mentions to <usermention> and hashtags to <hashtag> → Remove the identification. We should not peek answers! 149@ Yi-Shin Chen, NLP to Text Mining
  150. 150. Basic Guidelines ▷ Identify the common and differences between the experimental and control groups • Analyze the frequency of words → TF•IDF (Term frequency, inverse document frequency) • Analyze the co-occurrence between words/patterns → Co-occurrence • Analyze the importance between words → Centrality Graph 150@ Yi-Shin Chen, NLP to Text Mining
  151. 151. Graph Construction ▷Construct two graphs • E.g. → Emotion one: I love the World of Warcraft new game  → Not-emotion one: 3,000 killed in the world by ebola I of Warcraft new game WorldLove the 0.9 0.84 0.65 0.12 0.12 0.53 0.67  0.45 3,000 world b y ebola the killed in 0.49 0.87 0.93 0.83 0.55 0.25 151@ Yi-Shin Chen, NLP to Text Mining
  152. 152. Graph Processes ▷Remove the common ones between two graphs • Leave the significant ones only appear in the emotion graph ▷Analyze the centrality of words • Betweenness, Closeness, Eigenvector, Degree, Katz → Can use the free/open software, e.g, Gaphi, GraphDB ▷Analyze the cluster degrees • Clustering Coefficient GraphKey patterns 152@ Yi-Shin Chen, NLP to Text Mining
  153. 153. Essence Only Only key phrases →emotion patterns 153@ Yi-Shin Chen, NLP to Text Mining
  154. 154. Emotion Patterns Extraction o The goal: o Language independent extraction – not based on grammar or manual templates o More representative set of features - balance between generality and specificity o High recall/coverage – adapt to unseen words o Requiring only a relatively small number – high reliability o Efficient— fast extraction and utilization o Meaningful - even if there are no recognizable emotion words in it 154@ Yi-Shin Chen, NLP to Text Mining
  155. 155. Patterns Definition oConstructed from two types of elements: o Surface tokens: hello, , lol, house, … o Wildcards: * (matches every word) oContains at least 2 elements oContains at least one of each type of element Examples: 155 Pattern Matches * this * “Hate this weather”, “love this drawing” * *  “so happy ”, “to me ” luv my * “luv my gift”, “luv my price” * that “want that”, “love that”, “hate that” @ Yi-Shin Chen, NLP to Text Mining
  156. 156. Patterns Construction o Constructed from instances o An instance is a sequence of 2 or more words from CW and SW o Contains at least one CW and one SW Examples 156 SubjectWords love hate gift weather … Connector Words this luv my  … Instances “hate this weather” “so happy ” “luv my gift” “love this drawing” “luv my price” “to me  “kill this idiot” “finish this task” @ Yi-Shin Chen, NLP to Text Mining
  157. 157. Patterns Construction (2) oFind all instances in a corpus with their frequency oAggregate counts by grouping them based on length and position of matching CW 157 Instances Count “hate this weather” 5 “so happy ” 4 “luv my gift” 7 “love this drawing” 2 “luv my price” 1 “to me ” 3 “kill this idiot” 1 “finish this task” 4 Connector Words this luv my  … Groups Cou nt “Hate this weather”, “love this drawing”, “kill this idiot”, “finish this task” 12 “so happy ”, “to me ” 7 “luv my gift”, “luv my price” 8 … … @ Yi-Shin Chen, NLP to Text Mining
  158. 158. Patterns Construction (3) oReplace all the SWs by a wildcard * and keep the CWs to convert all instances into the representing pattern oThe wildcard matches any word and is used for term generalization oInfrequent patterns are filtered out 158 Connector Words this got my pain … Pattern Groups Cou nt * this * “Hate this weather”, “love this drawing”, “kill this idiot”, “finish this task” 12 * *  “so happy ”, “to me ” 7 luv my * “luv my gift”, “luv my price” 8 … … … @ Yi-Shin Chen, NLP to Text Mining
  159. 159. Ranking Emotion Patterns ▷ Ranking the emotion patterns for each emotion • Frequency, exclusiveness, diversity • One ranked list for each emotion SadJoy Anger 159@ Yi-Shin Chen, NLP to Text Mining
  160. 160. Samples of Emotion Patterns SadJoy Anger finally * my tomorrow !!! * <hashtag> birthday .+ * yay ! :) * ! princess * * hehe prom dress * memories * * without my sucks * <hashtag> * tonight :( * anymore .. felt so * . :( * * :(( my * always shut the * teachers * people say * -.- * understand why * why are * with these * 160@ Yi-Shin Chen, NLP to Text Mining
  161. 161. Accuracy 161 Naïve Bayes SVM NRCWE Our Approach English 81.90% 76.60% 35.40% 81.20% Spanish 70.00% 52.00% 0.00% 80.00% French 72.00% 61.00% 0.00% 84.00% 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% Accuracy LIWC No LIWC @ Yi-Shin Chen, NLP to Text Mining
  162. 162. Chinese Dataset ▷ Facebook Graph API • 粉絲頁文章 • 文章留言 ▷ 尋找有回應表情符號的留言 162@ Yi-Shin Chen, NLP to Text Mining
  163. 163. 中文情緒元素 163 社區字 道歉、浪費、腦殘、可憐的、太誇張、太扯了、欺負弱小、 腦袋有洞、無恥之徒 中心字 真是、好了、不如、你們這、給你們、有本事、哈哈哈、 有報應的、真她媽的 @ Yi-Shin Chen, NLP to Text Mining
  164. 164. 中文情緒特徵 164 悲傷 HA HA 生氣 * * 哈哈哈 國民黨 * * 黨產 * * * * 政府 * * 一樣 * * * @u 這 有梗 * * 柯P * * 小白 * * 心疼 * * 希望 * * 天使 * * 早日康復 * * * 下輩子 * * 無辜 * * * *,願 台灣 * * 法官 * * 女的 * * 憑什麼 * * * * *道歉 * * 民進黨 * * 這位 到底 * * * * 房東 * * 真的 還是 * * * * @u 你看 好可愛 @u * * 肥貓 * * 子彈 * * * * 太強 WOW 2016/6 @ Yi-Shin Chen, NLP to Text Mining
  165. 165. Contextual Text Mining Basic Concepts 165@ Yi-Shin Chen, NLP to Text Mining
  166. 166. Context ▷Text usually has rich context information • Direct context (meta-data): time, location, author • Indirect context: social networks of authors, other text related to the same source • Any other related text ▷Context could be used for: • Partition the data • Provide extra features 166@ Yi-Shin Chen, NLP to Text Mining
  167. 167. Contextual Text Mining ▷Query log + User = Personalized search ▷Tweet + Time = Event identification ▷Tweet + Location-related patterns = Location identification ▷Tweet + Sentiment = Opinion mining ▷Text Mining +Context  Contextual Text Mining 167@ Yi-Shin Chen, NLP to Text Mining
  168. 168. Partition Text 168 User y User 2 User n User k User x User 1 Users above age 65 Users under age 12 1998 1999 2000 2001 2002 2003 2004 2005 2006 Data within year 2000 Posts containing #sad @ Yi-Shin Chen, NLP to Text Mining
  169. 169. Generative Model of Text 169 I eat fish and vegetables. Dog and fish are pets. My kitten eats fish. eat fish vegetables Dog pets are kitten My and I )|( ModelwordP Generation Analyze Model Topic 1 0.268 fish 0.210 pet 0.210 dog 0.147 kitten Topic 2 0.296 eat 0.265 fish 0.189 vegetable 0.121 kitten )|( )|( DocumentTopicP TopicwordP @ Yi-Shin Chen, NLP to Text Mining
  170. 170. Contextualized Models of Text 170 I eat fish and vegetables. Dog and fish are pets. My kitten eats fish. eat fish vegetables Dog pets are kitten My and I Generation Analyze Model Year=2008 Location=Taiwan Source=FB emotion=happy Gender=Man ),|( ContextModelwordP @ Yi-Shin Chen, NLP to Text Mining
  171. 171. Naïve Contextual Topic Model 171 I eat fish and vegetables. Dog and fish are pets. My kitten eats fish. eat fish vegetables Dog pets are kitten My and I Generation Year=2008 Year=2007     Cj Ki jij ContextTopicwPContextizPjcPwP ..1 ..1 ),|()|()()( Topic 1 0.268 fish 0.210 pet 0.210 dog 0.147 kitten Topic 2 0.296 eat 0.265 fish 0.189 vegetable 0.121 kitten Topic 1 0.268 fish 0.210 pet 0.210 dog 0.147 kitten Topic 2 0.296 eat 0.265 fish 0.189 vegetable 0.121 kitten How do we estimate it? → Different approaches for different contextual data and problems@ Yi-Shin Chen, NLP to Text Mining
  172. 172. Contextual Probabilistic Latent Semantic Analysis (CPLSA) (Mei, Zhai, KDD2006) ▷An extension of PLSA model ([Hofmann 99]) by • Introducing context variables • Modeling views of topics • Modeling coverage variations of topics ▷Process of contextual text mining • Instantiation of CPLSA (context, views, coverage) • Fit the model to text data (EM algorithm) • Compare a topic from different views • Compute strength dynamics of topics from coverages • Compute other probabilistic topic patterns @ Yi-Shin Chen, NLP to Text Mining 172
  173. 173. The Probabilistic Model 173       D D ),( 111 ))|()|(),|(),|(log(),()(log CD Vw k l ilj m j j n i i wplpCDpCDvpDwcp  • A probabilistic model explaining the generation of a document D and its context features C: if an author wants to write such a document, he will – Choose a view vi according to the view distribution 𝑝 𝑣𝑖 𝐷, 𝐶 – Choose a coverage кj according to the coverage distribution 𝑝 𝑘𝑗 𝐷, 𝐶 – Choose a theme θ𝑖𝑙 according to the coverage кj – Generate a word using θ𝑖𝑙 – The likelihood of the document collection is: @ Yi-Shin Chen, NLP to Text Mining
  174. 174. Contextual Text Mining Example Contextual hate speech code words @ Yi-Shin Chen, NLP to Text Mining 174 https://arxiv.org/pdf/1711.10093.pdf
  175. 175. Culture Effect ▷Emotion expressions might be affected by our culture 175 182413 51866 5719 6435 112944 13085 22660 116421 0 20000 40000 60000 80000 100000 120000 140000 160000 180000 200000 @ Yi-Shin Chen, NLP to Text Mining
  176. 176. Code Word ▷Code Word: “a word or phrase that has a secret meaning or that is used instead of another word or phrase to avoid speaking directly” Merriam-Webster ▷舉例: 176 Anyone who isn’t white doesn’t deserve to live here. Those foreign ________ should be deported.niggers 已知的 憎恨字眼,容易辨識 animals 改用不挑釁的字眼,可以靠著推理得出Skype 不是通訊軟體名稱嗎? skypes 從上下文應該會覺得不太對 @ Yi-Shin Chen, NLP to Text Mining
  177. 177. Detecting Code Word ▷Detecting the words not in the dictionary ▷Expand the word set with context 177@ Yi-Shin Chen, NLP to Text Mining
  178. 178. Hate Dataset 178@ Yi-Shin Chen, NLP to Text Mining
  179. 179. Non-Hate Dataset -Twitter 179 Filter hate words @ Yi-Shin Chen, NLP to Text Mining
  180. 180. Context ▷Relatedness 關聯性 • with word2vec ▷Similarity 相似性 • with dependency-based word embedding 180 the man jumps boy plays talks Relatedness: word collocation Similarity:behavior@ Yi-Shin Chen, NLP to Text Mining
  181. 181. Dependency-based Word Embedding ▷Use dependency-based contexts instead of linear BoW [Levi and Goldberg, 2014] @ Yi-Shin Chen, NLP to Text Mining 181
  182. 182. Why Does Word Context Matter? ▷Input: Skypes ▷ Create different embeddings to capture different word usage. ▷ Build similarity and relatedness embeddings for hate data & clean data 182 SimilarityRelatedness Hate Data Clean Data skyped facetime Skype-ing phone whatsapp line snapchat imessage chat dropbox kike Line cockroaches negroes facebook animals @ Yi-Shin Chen, NLP to Text Mining
  183. 183. Finding Word Neighbours with Page Rank ▷Ranking the words with differences between data sets 183 niggersfaggots monkeys cunts animals 40.92 asshole negroe @ Yi-Shin Chen, NLP to Text Mining
  184. 184. Code Word ranking 184@ Yi-Shin Chen, NLP to Text Mining
  185. 185. Evaluation: Control results 185 Most participants were able to correctly answer the control across all 3 experiments 0 0.1 0.2 0.3 0.4 0.5 0.6 Very Likely Likely Neutral Unlikely Very Unlikely MAJORITYPERCENTAGE RATING Control word "niggers" Positive for Hate Speech 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Very Unlikely Unlikely Neutral Likely Very Likely MAJORITYPERCENTAGE RATING Control word "water" Negative for Hate Speech @ Yi-Shin Chen, NLP to Text Mining
  186. 186. Evaluation: Aggregate Classification 186 Hate Speech Not Hate Speech HateCommunity Precision Recall F1 0.88 1.00 0.93 1.00 0.67 0.80 CleanTexts Precision Recall F1 1.00 0.75 0.86 0.86 1.00 0.92 HateTexts Precision Recall F1 0.75 0.75 0.75 0.83 0.83 0.83 @ Yi-Shin Chen, NLP to Text Mining
  187. 187. Approach Comparison 187 TFIDF Our Approach @ Yi-Shin Chen, NLP to Text Mining
  188. 188. Some Remarks Some but not all 188@ Yi-Shin Chen, NLP to Text Mining
  189. 189. External Resources or NOT? 189 ▷Yes for resources • Steady • Emphasizes on certain terms • Few mistakes • Conservative ▷No • Keep changing • Relationships between words are more important • Many Mistakes • Experimental @ Yi-Shin Chen, NLP to Text Mining
  190. 190. Some Algorithms Do Not Work? ▷What are motivations for these algorithm? • Eg.,TFIDF is designed to find representatives ▷What are the assumptions for these algorithm? • E.g., LDA ▷Can our current data fit our problem? • How to modify? 190@ Yi-Shin Chen, NLP to Text Mining
  191. 191. How to Find Relationships? ▷Rely on machines • Machine learning • Association rules • Regression ▷Human first • E.g, The graph approach in emotion analysis → Find the centrality relationships and patterns in graph 191@ Yi-Shin Chen, NLP to Text Mining
  192. 192. Always Remember ▷Have a good and solid objective • No goal no gold • Know the relationships between them 192@ Yi-Shin Chen, NLP to Text Mining

×