Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹

17,710 views

Published on

自然語言處理 (Natural Language Processing) 技術專門對付非結構性、沒有整理成欄位值的資料,這也是資料分析時的燙手山芋之一。能夠理解語意,讓機器使用自然語言溝通,就能夠通過人工智慧的終極測試 — 圖靈測試 (Turing Test),可見這個問題的確是個極大的挑戰。即使是自然語言的基礎技術,例如怎麼樣從文章中抽取出重要的資訊?怎麼樣分析句子的結構?怎麼樣找到語言中所表達的語意?怎麼樣處理不同的語言?這些技巧在資料分析時不僅必備,想要做得好更需要訣竅。本課程先從基礎介紹開始,希望能帶大家認識自然語言處理這個在資料分析與資料探勘時非常有用的技術,並燃起對自然語言處理的興趣及熱情。

Published in: Data & Analytics
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Visit this site: tinyurl.com/sexinarea and find sex in your area for one night)) You can find me on this site too)
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Sex in your area for one night is there tinyurl.com/hotsexinarea Copy and paste link in your browser to visit a site)
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Girls for sex are waiting for you https://bit.ly/2TQ8UAY
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Meetings for sex in your area are there: https://bit.ly/2TQ8UAY
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹

  1. 1. Lun-Wei Ku NLPSA, Academia Sinica 無所不在的自然語言處理— 基礎概念、技術與工具介紹
  2. 2. Speaker Lecturer: Lun-Wei Ku Currently: Assistant Research Fellow, IIS, Academia Sinica Adjunct Assistant Professor, NCTU • Working on NLP and Sentiment Analysis • Running NLPSA Lab :http://www.lunweiku.com/ :http://academiasinicanlplab.github.io/ • Currently On-going Projects: – Graph Embedding, Emotion Enabled Dialog System, Cross-lingual Article Suggestion, Proactive Dialog Generation from Images and Texts 2
  3. 3. Outline 9:30 - 10:30 什麼是自然語言處理 10:30 - 10:50 茶點時間 10:50 - 12:30 中英文文本處理相關工具與資源介紹 12:30 - 13:20 午餐 13:20 - 15:00 自然語言處理於網路與社群媒體的挑戰 15:00 - 15:20 茶點時間 15:20 - 17:00 自然語言處理發展趨勢與業界應用 3
  4. 4. Section 1 什麼是自然語言處理 page 4
  5. 5. 自然語言 • 相對於機器語言 • 人類使用以溝通之語言 page 5
  6. 6. What is Natural Language Processing? • Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora. Challenges in natural language processing frequently involve natural language understanding, natural language generation (frequently from formal, machine-readable logical forms), connecting language and machine perception, dialog systems, or some combination thereof. (Wikipedia) page 6
  7. 7. 甚麼是自然語言處理 • 自然語言處理(英語:Natural Language Processing,簡稱NLP)是人工智慧和語言 學領域的分支學科。此領域探討如何處理 及運用自然語言;自然語言認知則是指讓 電腦「懂」人類的語言。 • 自然語言生成系統把計算機數據轉化為自 然語言。自然語言理解系統把自然語言轉 化為計算機程序更易於處理的形式。 (Wikipedia) page 7
  8. 8. 自然語言處理 • 是一個AI-complete的問題: 自然語言理解沒有完全被克服的,一次只能懂一 個領域,而且完全不懂說什麼。你問海南有什麼 好玩的?然後說旅遊什麼什麼的,我們人是這樣 講話的,但機器是不懂的。所以自然語言理解到 平台化使用還有十萬八千里,你們如果投了這類 項目,好好考慮一下。(李開復) page 8
  9. 9. Domain 範疇 (1) • Biomedical • Cognitive Modeling and Psycholinguistics • Dialogue and Interactive Systems • Discourse and Pragmatics • Generation and Summarization • Information Extraction, Retrieval, Question Answering, Document Analysis and NLP Applications • Machine Learning • Machine Translation page 9 • Multidisciplinary • Multilinguality • Phonology, Morphology and Word Segmentation • Resources and Evaluation • Semantics • Sentiment Analysis and Opinion Mining • Social Media • Speech • Tagging, Chunking, Syntax and Parsing • Vision, Robotics and Grounding
  10. 10. Domain 範疇 (2) • 文本朗讀(Text to speech)/語音合成(Speech synthesis) • 語音識別(Speech recognition) • 中文自動分詞(Chinese word segmentation) • 詞性標註(Part-of-speech tagging) • 句法分析(Parsing) • 自然語言生成(Natural language generation) • 文本分類(Text categorization) • 信息檢索(Information retrieval) • 信息抽取(Information extraction) • 文字校對(Text-proofing) • 問答系統(Question answering) • 機器翻譯(Machine translation) • 自動摘要(Automatic summarization) • 文字蘊涵(Textual entailment) page 10
  11. 11. Applications (1) • IBM Watson: Jeopardy https://www.youtube.com/watch?v=WFR3lOm_ xhE • Google Translate/Google小姐 page 11
  12. 12. Applications (2) • Spam filtering <-> Ads pushing – Google AdSense and so many others • Spelling Correction, Grammar – Grammarly- free grammar checker: https://www.grammarly.com/ – duoLinguo https://www.duolingo.com/ – 批改網 https://www.pigai.org/ – … page 12
  13. 13. Applications (3) • Paper Generator – Mathgen http://thatsmathematics.com/mathgen/ • Poem Generator 《秋蟲的聲音》 – 幸運將要投奔你的門上的時候 – 秋蟲的聲音也沒有 – 你的眼睛的誘惑 – 在天空中飛動 – 像人家把門關了幾天吧 – 我一個迷人的容貌 – 有時候不必再有一個太陽 – 把大地照成一顆星球 page 13
  14. 14. Applications (4) • Problem Solver – Math solver: https://www.cymath.com/ Step by step, NLP + others (graph, formula, …) page 14
  15. 15. Applications (5) • AI doctor – IBM Watson Health • Optimize performance • Engaged consumers • Enable effective care • Manage population health – Why is AI doctor related to NLP? • MedNLP: medical records, communication… page 15
  16. 16. Applications (5) • Summarization – 最佳示範:谷阿莫 *blog.investis.com • Sentiment/Opinion/Review • Social Media/Network application – Full of texts! *techxb.com page 16
  17. 17. Application: Multi-modal NLP • Captioning page 17
  18. 18. Application: Multi-modal NLP • Story Telling page 18
  19. 19. Other Close Disciplines • Artificial Intelligence (AI) • Information Retrieval (IR) • Machine Learning (ML) • Human Computer Interaction (HCI) page 19
  20. 20. NLP and AI • NLP takes care of the input/output of unstructured information for AI applications. • AI applications are expected to write/speak like people. • NLP is getting more and more important in AI. • However, NLP is challenging. page 20
  21. 21. NLP and IR • NLP borrows some concepts from IR, especially weighting scheme of words. • For IR, efficiency is very important. Some time limited NLP tasks will also incorporate ideas of IR to save time, e.g., clustering/offline preprocessing. page 21
  22. 22. NLP and ML • In the past, NLP techniques utilized a lot of linguistic knowledge in the form of rules or probability. • NLP uses a lot of ML/DL techniques nowadays. page 22
  23. 23. NLP and HCI • (writing or speaking) Language is a way for computers to communicate with people. • Representing information and utilizing it in an appropriate way can mitigate the errors people may sense. • NLP + HCI may lead to killer apps. page 23
  24. 24. Everywhere 無所不在? • 人類是群居動物,語言是人類溝通的工具 • 大腦資訊的輸入輸出 • 每天使用語言,賴以為生 • 不會說話?聽不見? 無時無刻,無所不在! page 24
  25. 25. Sample Text (中文) • 下雨天留客天留我不留 – 下雨天留客 天留我不留 – 下雨天 留客天 留我不 留 – 下雨天 留客天 留我不留 • 紅鯉魚與綠鯉魚與驢與鯉魚與驢與紅鯉魚 與驢與綠鯉魚 page 25
  26. 26. Typical Challenges • NLU: Natural Language Understanding • Inference – 玻璃杯碎了一地  玻璃杯不能用了 • Changing of languages, emerging of new words, phrases and concepts. – Domain: 跆拳道的品勢 – Social Media: 多多變套套 page 26
  27. 27. http://nlp.stanford.edu/~wcmac/papers/20140716-UNLU.pdf page 27
  28. 28. Wrap Up -1 • What is NLP? • What applications are related to NLP? • NLP and NLU • What are the current challenges? • NLP’s shallow and deep Next, let’s go ahead to NLPing! – about introducing the concept and trying the tools online (if available) page 28
  29. 29. Section 2 中英文文本處理相關工 具與資源介紹 page 29
  30. 30. First, make your corpus/datasets/materials ready! 11 December 2016 30
  31. 31. Sources of Data • Existing open datasets for developing techs • Crawling data from the Web • Logs from your deployed systems • Annotate your own data • Buy data 3111 December 2016
  32. 32. Natural Language Processing • Basic Functions – (Word Segmentation) – Part of Speech Tagging – (Stemming) – Named Entity Extraction – (Syntactic) Parsing – Coreference resolution – Text Categorization page 32
  33. 33. Word Segmentation • Some written languages have no explicit word boundary markers, such as Chinese or Japanese. • If words are to be the basic units for text processing, we need to know the boundaries. • 下雨天留客天留我不留 • 私は自然言語処理を好む • ‫الطبيعية‬ ‫اللغة‬ ‫معالجة‬ ‫أفضل‬ ‫أنا‬ page 33
  34. 34. Stemmer (English) • The process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form* *wikipedia page 34 I love natural language processing. I love natur languag process . Stemming
  35. 35. TF‧IDF (1) • Something used a lot in IR • term frequency * inversed document frequency • Calculate the weight of each term (usually words) in a dataset • An example of how to represent documents page 35 Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 5.25 3.18 0 0 0 0.35 Brutus 1.21 6.1 0 1 0 0 Caesar 8.59 2.54 0 1.51 0.25 0 Calpurnia 0 1.54 0 0 0 0 Cleopatra 2.85 0 0 0 0 0 mercy 1.51 0 1.9 0.12 5.25 0.88 worser 1.37 0 0.11 4.15 0.25 1.95
  36. 36. TF‧IDF (2) • 非常老但有效的公式 • 一個字的重要性指標由兩個因素決定: – 在同一篇文章中,出現的次數越多越重要 – 出現的文章越少越重要 page 36 )df/(log)tflog1(w 10,, tdt Ndt 
  37. 37. Bag of Words Model • Often abbreviated as “BOW” • Words are used as features WITHOUT their order. • 我給你一百萬 = 你給我一百萬 • Usually working with N-gram features 我 給 你 一 百 萬 我給 給你 你一 一百 百萬 我給你 給你一 你一百 一百萬 page 37
  38. 38. TFIDF + BOW (uni-&bi-grams) 現今仍是某些 task的state of the art,或至少能得到很接近 state of the art的效能,是很強的baseline。 page 38
  39. 39. • So far we have word-level information. • Next, we start to add more information on words and further to larger segments. page 39
  40. 40. Part of Speech (POS) Tagging • I love natural language processing. • (PRP I) (VBP love) (JJ natural) (NN language) • (NN processing)(. .) Verb, non-3rd person singular present Personal pronoun Adjective Noun, singular or mass Tags may vary – using different tagging sets: Penn Treebank Tagging Set page 40
  41. 41. Parsing (English) • Constituent Parse Tree (ROOT (S (NP (PRP I)) (VP (VBP love) (NP (JJ natural) (NN language)) (NP (NN processing))) (. .))) page 41
  42. 42. Parsing (English) • POS • Dependency Tree • Dependency Parser page 42
  43. 43. Semantic Role Labeling To label the role each word plays in sentences from the semantic aspect. https://www.slideshare.net/marinasantini1/semantic-role-labeling page 43
  44. 44. Parsing (Chinese) • Stanford Parser (Simplified Chinese) • 语言云(语言技术平台云LTP-Cloud) (Simplified Chinese) – 哈工大-讯飞语言云 (2014) – 經由HTTP request 取得結果 • CKIP Parser page 44
  45. 45. Parsing (Chinese) • 我爱 自然 语言 处理 • 我爱/VV 自然/NN 语言/NN 处理/NN • root(ROOT-0, 我爱-1) • compound:nn(处理-4, 自然-2) • compound:nn(处理-4, 语言-3) • dobj(我爱-1, 处理-4) • Error Propogation! page 45
  46. 46. Parsing (Chinese) • #1:1.[0] S(experiencer:NP(Head:Nhaa: 我)|Head:VL1:愛|reason:NP(property:Na:自然 |Head:Nac:語言)|goal:VP(Head:VC2:處理))# page 46
  47. 47. Using Tools (1) • Stanford Parser • Stanford CoreNLP (Demo) • Berkeley Parser • SRL (Demo) page 47
  48. 48. 英文的工具好完整,那 中文呢? page 48
  49. 49. Using Tools (2) • Jieba (segmentation, python codes) – HMM/Viterbi algorithm • CKIP – Chinese Segmentor/POS Tagger – Parser page 49
  50. 50. • For the traditional Chinese text environment… NLP Tools Comparison page 50 Stanford CoreNLP Jieba CKIP Language support Easy to use Domain adaptation Performance Price
  51. 51. Using Tools (3) • NLTK (python): tokenize, tag, NE extraction, show parsing trees – Porter stemmer – n-grams • tfidf not in NLTK, use scikit-learn. (machine learning in python) • spaCy: industrial-strength NLP in python page 51
  52. 52. Semantic Resources (1) • Wordnet (English) online demo • Freebase (English): API shutdown on Aug 31, 2016 => Google’s knowledge graph • Google Knowledge Graph API page 52
  53. 53. Semantic Resource (2) • Hownet (Simplified Chinese) • E-hownet (Traditional Chinese) page 53
  54. 54. Word Embeddings (1) Word embedding與過去使用的詞向量差異點: 可以做語意運算: king + woman – man = queen page 54
  55. 55. Word Embeddings (2) Pre-trained or train by yourself! • w2v • Glove 我不會deep learning怎麼辦? You can find various of embeddings on the Web. [Check here!] page 55
  56. 56. 我知道這些資訊跟處理方法了 然後呢? • 可以做 – 資訊擷取 (information extraction): 甚麼學習方法 都不會的話,可以寫一些規則來抽取自己需要 的資訊! – 機器學習 (machine learning): 如果會一點機器學 習,使用剛才介紹的文字處理方式,可以獲得 很多資訊當作特徵來學習模型,例如詞頻 (word frequency)、重要性(weight)、語言特徵 (POS)、句子結構(parsing tree)、語意(semantic ontology, word embedding) 等等等 page 56
  57. 57. NLP Tasks • Most of them can be transformed into – Classification problem – Clustering problem – (Sequential) Labeling problem page 57
  58. 58. 現在已經可以進行基本 的自然語言處理任務了! page 58
  59. 59. Wrap Up – Part II • For the English and Chinese languages • Pre-processing tools • Syntactic analysis tools • Semantic analysis tools page 59
  60. 60. Part III 自然語言處理於網路與 社群媒體的挑戰 page 60
  61. 61. 1. WWW/Social Media NLP 2. Sentiment Analysis Tool page 61
  62. 62. Not only texts… Created by Freepik 6211 December 2016 Money Network Sentiment User
  63. 63. Differences • Web or social texts are in a written form of the spoken language. – New words – Typos – Urban language – Cyber language – Abbreviations – A lot of (homo)phonic/semantic puns (諧音、雙關 語) – Foreign languages (激安殿堂 牛逼) page 63
  64. 64. If we just treat them as pure texts… • 八百屋的健太和大蔥女部分幫整個 劇情超加分 • 而且兩位演技都很好呀!最喜歡一 幕是 • 健太知道大蔥女的真面目後,在大 蔥女再來買蔥完要離開時 • 健太衝出去要追問的樣子,一副欲 言又止的臉 • 大蔥女也一副等待著健太說出來整 個很曖昧的畫面 • 八百(Neu) 屋(Na) 的(DE) 健(VH) 太(Dfa) 和 (P) 大蔥女(Na) 部分(Neqa) 幫(P) 整個(Neqa) 劇情(Na) 超(VJ) 加分(VB) 而且(Cbb) 兩(Neu) 位(Nf) 演技(Na) 都(D) 很(Dfa) 好(VH) 呀 (T) !(EXCLAMATIONCATEGORY) • --------------------------------------------------------------------- ------------------------------------------------------------- • 最(Dfa) 喜歡(VK) 一(Neu) 幕(Nf) 是(SHI) 健(VH) 太(Dfa) 知道(VK) 大蔥女(Na) 的(DE) 真面目(Na) 後(Ng) ,(COMMACATEGORY) • --------------------------------------------------------------------- ------------------------------------------------------------- • 在(P) 大蔥女(Na) 再(D) 來(D) 買蔥完(VC) 要(D) 離開(VC) 時(Ng) 健(VH) 太(Dfa) 衝出 去(VA) 要(D) 追問(VE) 的(DE) 樣子(Na) , (COMMACATEGORY) • --------------------------------------------------------------------- ------------------------------------------------------------- • 一(Neu) 副(Nf) 欲言又止(VH) 的(DE) 臉(Na) 大蔥女(Na) 也(D) 一(Neu) 副(Nf) 等待(VK) 著(Di) 健(VH) 太(Dfa) 說出來(VB) 整個(Neqa) 很(Dfa) 曖昧(VH) 的(DE) 畫面(Na) • --------------------------------------------------------------------- ------------------------------------------------------------- page 64
  65. 65. Stanford vs. CKIP • 八百(Neu) 屋(Na) 的(DE) 健(VH) 太(Dfa) 和(P) 大 蔥女(Na) 部分 (Neqa) 幫(P) 整 個(Neqa) 劇情 (Na) 超(VJ) 加 分(VB) • 八百/CD 屋/NN 的 /DEG 健太/NR 和 /CC 大葱/NR 女/JJ 部分/NN 帮/VV 整 个/DT 剧情/NN 超 加分/NN 11 December 2016 65
  66. 66. More Preprocessing Needed • Need to filter out dirty texts and find the major content. – Texts for ads – Texts for format • Need to cut sentences first before sending them into the parser. 6611 December 2016
  67. 67. Skills We Might Need • Text Normalization • Multimedia multimodal • User and Text Networking • Social Network page 67
  68. 68. Skills We Might Need • Text Normalization • Multimedia multimodal • User and Text Networking • Social Network page 68
  69. 69. Text Normalization • Normalization is to change the text written in web language into the one in the formal language before further to process it. • 私心喜翻的日式簡約風  私心喜歡的日式簡 約風 • 想一起去ㄉ水水們  想一起去的漂亮女生們 • 漂漂是今年才成為麻麻  漂漂是今年才成為 媽媽 page 69
  70. 70. 2017十大鄉民流行用語 • #1 低能卡 • #2 垃圾不分藍綠 • #3 我難過 • #4 這我一定吉 • #5 發錢 • #6 8+9 • #7 銅鋰鋅 • #8 下去領500 • #9 海水退潮就知道 誰沒穿褲子 • #10 少時不讀書, 長大當記者 • 同場加映:廠廠 page 70
  71. 71. Processing Web Text: do we need normalization? page 71
  72. 72. Or, A Parser for Web Text • Tweet POS Tagger/Parser like: ARK • Train with web texts to capture their characteristics. ikr smh he asked fir yo last name so he can add u on fb lololol • Unfortunately, so far we don’t have any for the Chinese language. 7211 December 2016 I know right shake my head he asked for your last name so he can add you on facebook lololol
  73. 73. Skills We Might Need • Text Normalization • Multimedia multimodal • User and Text Networking • Social Network page 73
  74. 74. 隨便開一個網路文章 • http://linshibi.com/ page 74
  75. 75. Skills We Might Need • Text Normalization • Multimedia multimodal • User and Text Networking • Social Network page 75
  76. 76. User and Text Network (1) • We can observe this networking in all social media in a forum-like style. page 76
  77. 77. User and Text Network (2) page 77
  78. 78. User and Text Network (2) page 78
  79. 79. • We will explain the way to utilize the concept of user and text network using the UTCNN model in the following sentiment package. page 79
  80. 80. Sentiment Analysis page 80
  81. 81. Sentiment Analysis Is… • Studying opinions, sentiments, subjectivities, affects, emotions, views, etc. in text such as news, blogs, reviews, comments, dialogs, or other kind of documents. • An important research question: – Sentiment information is global and powerful. – Sentiment information is valuable for companies, customers and personal communication. 81 11 December 2016
  82. 82. Sentiment Representation • Categorical – Sentiment, non-sentiment – Positive, neutral, negative – Stars – Emotions categories like Joy, Angry, Sadness… • Dimensional – Valence Arousal 11 December 201682
  83. 83. CSentiPackage @NLPSA 11 December 2016 83
  84. 84. CSentiPackage • Datasets – Chinese Morphological Dataset Cmorph (former version of ACiBiMA)* – Chinese Opinion Treebank • Resources – NTUSD/ANTUSD • Tools – CopeOpi + Tag Mapping File – UTCNN *https://github.com/windx0303/ACBiMA 11 December 201684
  85. 85. Statistics • NTUSD: Sentiment Dictionary (with 10,371 words): free for research, 400+ applications • ANTUSD: Augmented NTUSD (with 27,221 words, now integrating with e-Hownet) • Cmorph (with 8,000+ words) -> ACBiMA (with 11,000+ words) • Chinese Opinion Treebank: labels on Chinese Treebank 5.1 11 December 201685
  86. 86. Materials: From Words to Sentences • NTUSD: words (binary sentiment) • ANTUSD: words (annotation features) • Chinese Morphological Dataset: words (morphological structures) • Chinese Opinion Treebank: phrases (sentence structure) • Chinese Opinion Treebank: sentences (binary sentiment) 11 December 201686
  87. 87. Tools: From Words to Sentences, Documents, and Beyond • CopeOpi Sentiment Scoring Tool: words, sentences, documents, documents+ (text) • UTCNN: posts and users (text and social media) 11 December 201687
  88. 88. NTUSD • Simplified Chinese and traditional Chinese versions • A positive word collection of 2,812 words • A negative word collection of 8,276 words • No degree, no estimated scores and other information. 11 December 201688
  89. 89. ANTUSD • 6 Fields – CopeOpi Score – Number of positive annotation – Number of neutral annotation – Number of negative annotation – Number of non-sentiment annotation – Number of not-a-word annotation • Not-a-word: useful as they are collected from real segmentated data 開心 0.434168 1 0 0 0 0 酣聲 0 0 0 1 3 0 憤怒 -0.80011 0 0 5 0 0 11 December 201689
  90. 90. ANTUSD • Contains also short phrases like一昧要求, 一 路過關斬將,備受外界期待… 11 December 201690
  91. 91. ANTUSD and E-HOWNET • An integration of two resources which may help us play with sentiment and semantics. • Related English resource: SentiWordnet – Refer to Wordnet – With PosScore and NegScore added – ObjScore = 1-(PosScore+NegScore) E-HowNet .., A frame-based entity-relation model extended from HowNet .., Define lexical senses (concepts) in a hierarchical manner .., Now integrated with ANTUSD and covers 47.7% words in ANTUSD 11 December 201691
  92. 92. ANTUSD in E-HOWNET 11 December 201692
  93. 93. 11 December 201693
  94. 94. Chinese Morphological Structure • Parallel type: 財富 (rich wealth) • Substantive-Modifier type: 痛哭 (bitterly cry) • Subjective-Predicate type: 山崩 (land slip; landslide) • Verb-Object type: 避暑 (escape from summer) • Verb-Complement type: 提高 (increase: raise up) • Negation type: 無情 (no feelings) • Confirmation type: 有心 (have heart) • Others 11 December 201694
  95. 95. Chinese Opinion Treebank • Based on Chinese Treebank 5.1. • Including the opinion labels of each sentences. • Including the word-pairs and their composing type in opinionated sentences. • To avoid copyright issue, you need to have Chinese Treebank 5.1 by yourself in order to use Chinese Opinion Treebank! 11 December 201695
  96. 96. Chinese Opinion TreebankS ID=230: 黄河“金三角”成为新的投资热点 .node file .tree file .trio file Fields Node ID, POS, node content, node depth Node ID: children Trio ID, trio head, trio left node, trio right node, trio type Content 0,,,0 1,IP-HLN,,1 2,NP-SBJ,,2 3,NP-PN,,3 4,NR,黄河,4 5,NP,,3 6,PU,“,4 7,NN,金三角,4 8,PU,”,4 9,VP,,2 10,VV,成为,3 11,NP-OBJ,,3 12,CP,,4 13,WHNP-1,,5 14,-NONE-,*OP*,6 15,CP,,5 16,IP,,6 17,NP-SBJ,,7 18,-NONE-,*T*-1,8 19,VP,,7 20,VA,新,8 21,DEC,的,6 22,NP,,4 23,NN,投资,5 24,NN,热点,5 0:1, 1:2,9, 2:3,5, 3:4, 4: 5:6,7,8, 6: 7: 8: 9:10,11, 10: 11:12,22, 12:13,15, 13:14, 14: 15:16,21, 16:17,19, 17:18, 18: 19:20, 20: 21: 22:23,24, 23: 24: 2,1,2,9,3 3,22,23,24,2 Opinion labels of three annotators (filename, SID, opinion, polarity, opinion type) chtb_020.raw,230,N,, chtb_020.raw,230,Y,POS,STATUS chtb_020.raw,230,Y,POS,STATUS Opinion gold standard chtb_020.raw,230,Y,POS,STATUS 11 December 201696
  97. 97. Notation (Parsing Tree) • T: the parsing tree of a sentence S • O = {o1, o2, …}: in-ordered set of tree nodes • tri = : an opinion trio • : a syntactic inter- word relation Rpt є {Substantive-Modifier, Subjective-Predicate, Verb- Object, Verb-Complement, Other} Tri(S)= 1, IP, 活动, VP, Subjective-Predicate 2, VP, 取得, NP-OBJ, Verb-Object 3,NP-OBJ, 圆 满 , 成 功 , Substantive- Modifier 11 December 201697
  98. 98. Chinese Opinion Treebank • Align the opinion labels of sentences to Chinese Treebank 5.1 by sentence IDs. • Align Opinion trios to Chinese Treebank 5.1 by node IDs. • Can be used to do opinion cause analysis. 11 December 201698
  99. 99. CopeOpi • A statistical sentiment analysis tool • Can be used without any training • Users can update character weights or add any sentiment words • It runs fast. 11 December 201699
  100. 100. The First Idea • Chinese characters are mostly morphemes and they bear sentiment, too. • Simple example: some characters are preferred for naming, but some are not. • For example, 德(ethic) 勝(win) 高(high) good for names; 笨(stupid) 悲(sorrow) 惨(terrible) are not good choices for names. • With some exceptions, but still quite reliable if the sentiment of character is acquired statistically from a large naming corpus (or just sentiment dictionaries.) Exceptions like 徐悲鴻. 11 December 2016100
  101. 101. [仇 (-1.0) + 視 (0.0)] / 2 = -1/2 = -0.5 (NEG) [富(1.0) + 貴(0.936)] / 2 = 0.968 (POS) 好人、美麗、憤怒、弱小…       m j cc n j cc m j cc c jiji ji i fnfnfpfp fnfn N 11 1 // / )( iii ccc NPS    p j cw j S p S 1 1       m j cc n j cc n j cc c jiji ji i fnfnfpfp fpfp P 11 1 // / 101 Bag of Unit 11 December 2016
  102. 102. Aggregation • Word sentiment – Summing up opinion scores of characters • Sentence sentiment – Summing up opinion scores of words So is there any way we can give them weights? 11 December 2016102
  103. 103. • Linguistic Information: – Morphological structures • Intra-word structures – Sentence syntactic structures • Inter-word structures 103 Weighted by Structures 11 December 2016
  104. 104. Linguistic Morpho. Type Example 1. Parallel 財富、打罵 2. Substantive-Modifier 低級、痛哭 3. Subjective-Predicate 心疼、氣虛 4. Verb-Object 失控、免職 5. Verb-Complement 看清、擊潰 Opinion Morpho. Type Example 6. Negation 無法、不慎 7. Confirmation 有賴、有愧 8. Others 姪子、薄荷 104 Get types by SVM, CRF, handcraft… Morphological Structure 11 December 2016
  105. 105. Example of Sentiment Trios in Chinese Opinion Treebank Linguistic Morpho. Type Example Parallel (Skip) 美麗而聰慧 1. Substantive-Modifier 高大的樓房 2. Subjective-Predicate 學習認真 3. Verb-Object 恢復疲勞 4. Verb-Complement 收拾乾淨 Morpho. Type Opinion Example n. Others 為…/以… 11 December 2016105
  106. 106. Compositional Chinese Sentiment Analysis • Example:氣虛 • Subjective-Predicate type • 氣 0.5195 • 虛 -0.8178 • Score(氣虛) = -0.8178 11 December 2016106 • Example:看清、看壞 • Verb-Complement type • 看: 0.1 • 清: 0.8032 • 壞: -0.9 • Score(看清) = 0.8072 • Score(看壞) = -0.9
  107. 107. Example of Using Sentiment Trios • Score: 0.6736 11 December 2016107 )()()(else )(1-)(else )()(then)0)(and0)((if then)0)(and0)((if 2121 121 12121 21 CSCSCCS CSCCS CSCCSCSCS CSCS     Substantive-Modifier type )()()(else ))(())(()()(then )0)(and0)((if 2121 21121 21 CSCSCCS CSSIGNCSSIGNCSCCS CSCS    Verb-Object type 0.3018 0.6736 0.4109 0.6736
  108. 108. Preprocessing • Tokenize (segmentation) – Jieba – CKIP – Stanford parser • Part-of-speech tagging – CKIP – Stanford parser Tokenize is mandatory, we will release the “optional” version in the future. 11 December 2016108
  109. 109. CopeOpi – example • $ ./run_trad.sh – Run the CopeOpi with the files in the list “file.lst” • Check the results in out/0001.txt 11 December 2016109 test_trad.txt 0001
  110. 110. CopeOpi – example • Result summary in ./out.csv 11 December 2016110
  111. 111. Deep Neural Network Example Word • Morphological structure for a better word representation. • Same idea but for *Chinese sentiment analysis* • Luong, Thang, Richard Socher, and Christopher D. Manning. "Better Word Representations with Recursive Neural Networks for Morphology." CoNLL. 2013. 11 December 2016111
  112. 112. Deep Neural Network Example Sentence • Learned composition function (of semantics): Richard Socher (RNN, series work from 2011) 11 December 2016112
  113. 113. Learning by Neural Network • Word Sentiment • Sentence Sentiment • Document Sentiment • Social Media Post Sentiment 11 December 2016113
  114. 114. Learning by Deep Neural Network • Word Sentiment: CNN + ANTUSD • Sentence Sentiment • Document Sentiment • Social Media Post Sentiment: Text + User Context – Not yet consider structures! 11 December 2016114
  115. 115. CSentiPackage: UTCNN Learning by Deep Neural Network • Word Sentiment: CNN + ANTUSD • Sentence Sentiment • Document Sentiment • Social Media Post Sentiment: Text + User Context 11 December 2016115
  116. 116. User Topic Comment Neural Network (UTCNN) • A deep learning model of stance classification on social media text 11 December 2016116 Deep Learning Model AuthorsLikers Post content Comment content Commenters Topics
  117. 117. UTCNN • Stance tendency – Author – Liker – Topic – Commenter • Semantic preference – Author – Liker – Topic – Commenter 11 December 2016117 We should reject the re-construction of the Nuclear power plant. Great! ( ) NO! …… (post) (comment)
  118. 118. If you don’t know anything about deep learning (again) … – I won’t talk too much about it. No worries. – You can take the courses organized by 臺灣資料 科學協會 – Knowing that it’s a DNN Chinese sentiment model for now is enough. page 118
  119. 119. Social Media Dataset Released in CSentiPackage • Facebook fan groups (Chinese) – Author/liker/comment/commenter – Single topic (learn latent topics by LDA) – Unbalance – Chinese • Create Debate (English) – Author – Four topics – Balance – English 11 December 2016119
  120. 120. Environment • Software – OS: Linux – Programming language • Java 6 or higher • python 2.7 – Theano 0.8.2 – Keras 1.0.3 – sklearn • Hardware – Graphic cards (deep learning) 11 December 2016120
  121. 121. Demo Environment • CPU – Intel Xeon E5-2630 v3 ×2 • RAM – 64 GB • OS – Ubuntu 14.04 LTS • Graphic cards – Nvidia Tesla K40 ×2 11 December 2016121
  122. 122. UTCNN - data 11 December 2016122 • 3 46 57 … 573 49 61 4 -1 <sssss>福 島 核電廠 的 熔 毀 核 燃料棒 到底 有沒有 掉到 地下水層 …..<sssss>詳 見 俄國 時報 電視 專訪 <sssss> 544 490 565 … 428 危機 ,如果 安全 你 家 借放 ,事實 是 沒有 人 知道 真相 這 些 都 只是 推論 就 看 誰 的 推論 有 根據 合理 奇怪 的 是 擁核 五 毛 只 根據 東京 電力 的 說法 而 東京 電力 是 最 有 利益 關係 最 有 企圖 掩藏 事實 的 事主 貼 此 文 是 提 供 大家 獨立 沒有 核電 利益 纏身 的 核工 專家 與 小出裕 章 的 推論 僅 供 參考
  123. 123. UTCNN - demo 11 December 2016123 http://doraemon.iis.sinica.edu.tw/wordforce/
  124. 124. UTCNN - demo 11 December 2016124 http://doraemon.iis.sinica.edu.tw/wordforce/
  125. 125. Something Important About CSentiPackage 11 December 2016125 • CSentiPackage you obtained is only for your group to use for the research purpose. • It has been officially released so they can be downloaded any time. • Download or check what’s new @ http://academiasinicanlplab.github.io/ • Find the tutorial materials of CSentiPackage @ http://www.lunweiku.com/
  126. 126. Skills We Might Need • Text Normalization • Multimedia multimodal • User and Text Networking • Social Network page 126
  127. 127. NLP and Social Network • NLP sometimes serves as the pre-processing of the social network research to deal with unstructured data. • NLP in social media is sometimes referred by Social Media Analytics • NLP models can help find information such as events, sentiment, named entities for social network analysis • The network analysis algorithm can benefit NLP research by bringing in heterogeneous features. page 127
  128. 128. Challenges • Integrating features is not easy • Integrating knowledge is not easy, either • Data are big. Performance and efficiency are tradeoffs. • Social media are always changing and different over generations. • Visualizing both texts and the network is challenging. 12811 December 2016
  129. 129. Wrap Up – Part III • More context, more to know • More context, better for guessing • Inner context, outer context, inter context • Pay more attention to the relations 12911 December 2016
  130. 130. Part IV 自然語言處理發展趨勢 與業界應用 page 130
  131. 131. 1. Industrial Needs and Apps 2. Future Trend 11 December 2016 131
  132. 132. Industrial Needs • Techniques can make money • Techniques can provide better services (then to make money) • Techniques can make users engaged (then to make money) 13211 December 2016
  133. 133. Applications • Ads • Recommendation • QA • Interface: Chatbot page 133
  134. 134. Advertisement The most direct way to make profit page 134
  135. 135. Ads (1) • Google AdSense – AdSense 運作方式 網站擁有者可以藉由Google AdSense,以自己的線上內容來營利。 AdSense 會依據您的網站內容及訪客,放送適合的文字 與多媒體廣告。 這些廣告由想要宣傳產品的廣 告客戶製作及付費,而廣告客戶支付的費用會 因廣告而異,所以您的賺取的金額也會有所不 同。 • 廣告市占率: Google + FB 占九成 • But there is very little you can do (with NLP). page 135
  136. 136. 其他網站廣告常見形式 • 內容網站:推薦廣告文 page 136
  137. 137. Recommendation 產品推薦 • Content-based • Collaborative filtering • User behavior NLP techniques are needed mostly for content- based (items in e-commerce websites). page 137
  138. 138. • User behavior can be related to unstructured data. page 138
  139. 139. page 139
  140. 140. Mobile: Apps Recommendation (1) page 140 Descriptions Review Users Others Images Images
  141. 141. Mobile: Apps Recommendation (2) • Grouping them with similarity (like communication) or events (like travel). page 141
  142. 142. Chatbot: Where is my Dr. Know? A new interface connected to understanding and text generation. page 142
  143. 143. page 143
  144. 144. Two major purposes of chatbot • Chit-chat • Task-oriented The most natural kind is mixed somehow. page 144
  145. 145. Four major types of functions • 助理 (MS cortana) • 陪伴者 (MS 小冰) • 客服 (京東JIMI) • 問答 (IBM Watson) page 145
  146. 146. Chatbot • Retrieval based – 原理: 大家都接甚麼話,就接(最像的)那一句 – 優點: 句子都是人說過的,回應句較少出現不合 文法的問題 • Generation based – 原理: 目前大部分的generation based model都是 由深度學習模型來實作的,藉由學習上一句與 本句的編碼解碼關係,來產生最佳回答句。 – 優點: 可以產生新的,語料中沒看過的答句 page 146
  147. 147. Chatbot • Slot filling: – Sequential tagging – templates page 147
  148. 148. Chatbot • Api.ai: template/rule-based page 148
  149. 149. Chatbot Challenges • It is difficult to cross domain. • Needs very big data • It is challenging to connect to the background knowledge. However, chatbot performs satisfactory as a small, limited bot. Many Facebook stores utilize this kind of chatbot to sell things and provide services. page 149
  150. 150. NLP in Fintech: Inclusive Financing • Credit (personal or company) – Search social media/social network – Verify personal data – (online question generation) – 連結電商紀錄 • For online loans page 150
  151. 151. Future Trend • Application oriented NLP – (character-based, no more segmentation/parsing…) • Semantic oriented NLP • Language independent NLP • Multi-modal NLP • Multi-sourced/featured NLP • Knowledge empowered NLP page 151
  152. 152. Final Wrap Up • You have known what is NLP • You have checked major NLP tools • You have heard the cool things NLP can do • Start NLP today! 15211 December 2016
  153. 153. 工商服務時間 page 153
  154. 154. Submission deadline: October 10, 2017 Author notification: November 1, 2017 The 5th International Workshop on Natural Language Processing for Social Media In conjunction with IEEE BigData 2017 @ December 11-14, 2017. Boston, MA, USA. Organizers: Lun-Wei Ku, Academia Sinica, Taiwan & Cheng-Te Li, National Cheng Kung https://sites.google.com/site/socialnlp2017/Selected papers will be recommended to international journals. IEEE Big Data 2017 IEEE International Conference on Big Dat 11-14 December 2017 | Boston
  155. 155. Nov 27~Dec 1 11 December 2016 155
  156. 156. Thank You Q&A 11 December 2016 156

×