Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to natural language processing

1,071 views

Published on

An overview of natural language processing

Published in: Science

Introduction to natural language processing

  1. 1. Introduction to Natural Language Processing Pham Quang Nhat Minh FPT Technology Research Institute (FTRI) minhpqn@fpt.edu.vn
  2. 2. 2 IBM Watson won Jeopardy Game
  3. 3. Lecture Outline ● What is Natural Language Processing? ● Why is NLP hard? ● Brief history of NLP ● Fundamental tasks in NLP ● Some applications of NLP 3
  4. 4. What is Natural Language Processing? ● A field of computer science, artificial intelligence, and computational linguistics ● To get computers to perform useful tasks involving human languages − Human-Machine communication − Improving human-human communication ● E.g Machine Translation − Extracting information from texts 4
  5. 5. Why is NLP interesting? ● Languages involve many human activities − Reading, writing, speaking, listening ● Voice can be used as an user interface in many applications − Remote controls, virtual assistants like siri,... ● NLP is used to acquire insights from massive amount of textual data − E.g., hypotheses from medical, health reports ● NLP has many applications ● NLP is hard! 5
  6. 6. Why is NLP hard? ● Highly ambiguous ● Sentence I made her duck may have different meanings (from Jurafsky book) − I cooked waterfowl for her. − I cooked waterfowl belong to her. − I created the (plaster?) duck she owns. − I caused her to quickly lower her head or body. − I waved my magic wand and turned her into undifferentiated waterfowl. 6
  7. 7. Why is NLP hard? I shot an elephant in my pajamas. 7
  8. 8. Why is NLP hard? ● Natural languages are highly ambiguous at all levels − Lexical (word’s meaning) − Syntactic − Semantic − Discourse ● Natural languages are fuzzy ● Natural languages involve reasoning about the world − E.g., It is unlikely that an elephant wears a pajamas 8
  9. 9. Brief history of NLP ● Foundational Insights: 1940s and 1950s − Two foundational paradigms ● Automaton ● Probabilistic/Information-Theoretic models ● The two camps: 1957-1970 − Symbolic paradigm: the work of Chomsky and others on formal language theory and generative syntax (1950s ~ mid 1960s) − Stochastic paradigm ● In departments of statistics 9
  10. 10. Brief history of NLP ● Four paradigms: 1970-1983, explosion in research in speech and language processing − Stochastic paradigm − Logic-based paradigm − Natural language understanding − Discourse modeling paradigm ● Empiricism and Finite State Models Redux: 1983-1993 10
  11. 11. Brief history of NLP ● The Fields Comes Together: 1994-1999 − Probabilistic and data-driven models had become quite standard ● The Rise of Machine Learning: 2000-now − Large amount of spoken and textual data become available − Widespread availability of high-performance computing systems 11
  12. 12. Fundamental Tasks in NLP ● Word Segmentation ● Part-of-speech (POS) tagging ● Syntactic Analysis ● Semantic Analysis 12
  13. 13. Word Segmentation ● In some languages, there is no space between words, or a word may contain smaller syllables − 毎年うちの研究室の学生が1-2名国語研でアルバイトさせ てもらっているので、今日は新しくアルバイトする B4 学 生の紹介である。 − Nhật Bản luôn là thị trường thương mại quan trọng của Việt Nam (Nhật_Bản luôn là thị_ trường thương_mại quan_trọng của Việt_Nam) ● In such languages, word segmentation is the first step of NLP systems. 13
  14. 14. Word Segmentation ● A possible solution is maximum matching − Start by pointing at the beginning of a string, then choose the longest word in the the dictionary that matches the input at the current position. − Nhật_Bản luôn là thị trường thương mại quan trọng của Việt Nam ● Nhật_Bản is a word in dictionary, but “Nhật Bản luôn” is not ● Problems: − Maxmatching could not deal with unknown words − Dependency between words in the same sentences is not exploited 14
  15. 15. Word Segmentation ● Most successful word segmentation tools are based on machine-learning techniques. ● Word segmentation tools obtained high accuracy − vn.vitk (https://github.com/phuonglh/vn.vitk) obtained 97% accuracy on test data 15
  16. 16. POS Tagging ● Each word in a sentence can be classified in to classes, such as verbs, adjectives, nouns, etc ● POS Tagging is a process of tagging words in a sentences to particular part-of-speech, based on: − Its definition − Its context in the sentence ● The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./. 16
  17. 17. Sequence Labeling ● Many NLP problems can be viewed as sequence labeling ● Each token in a sequence is assigned a label. ● Labels of tokens are dependent on the labels of other tokens in the sequence, particularly their neighbors. 17 John saw the saw and decided to take it to the table. NNP VBD DT NN CC VBD TO VB PRP IN DT NN
  18. 18. Sequence Labeling as Classification ● Classify each token independently ● Use as features, information about the surrounding tokens (sliding window). 18 John saw the saw and decided to take it to the table. classifier NNP
  19. 19. Probabilistic Sequence Models • Model probabilities of pairs (token sequences, tag sequences) from annotated data set. • Exploit dependency between tokens • Typical sequence models • Hidden Markov Models (HMMs) • Conditional Random Fields (CRF) 19
  20. 20. Fundamental Tasks in NLP ● Word Segmentation ● Part-of-speech (POS) tagging ● Syntactic Analysis ● Semantic Analysis 20
  21. 21. Syntax Analysis ● The task of recognizing a sentence and assigning a syntactic structure to it. My dog also likes eating sausage. (ROOT (S (NP (PRP$ My) (NN dog)) (ADVP (RB also)) (VP (VBZ likes) (S (VP (VBG eating) (NP (NN sausage))))) (. .))) S NP PRP NN ADVP RB My dog also VP VBZ S VP VBG eating NP NN sausage . . 21
  22. 22. Syntax analysis ● An important task in NLP with many applications − Intermediate stage of representation for semantic analysis − Play an important role in applications like question answering and information extraction − E.g., What books were written by British women authors before 1800? 22
  23. 23. Syntax analysis ● A challenging task in NLP − Ambiguity problem: one sentence may have many possible parsing trees ● Vietnamese language processing (VNLP) still lacks accurate syntax parsers (in my understanding) − Accuracy about 78 ~ 84% 23
  24. 24. Approaches to Syntax analysis ● Top-down parsing ● Bottom-up parsing ● Dynamic programming methods − CYK algorithm − Earley algorithm − Chart parsing ● Probabilistic Context-Free Grammars (PCFG) ● Assign probabilities for derivations 24
  25. 25. Fundamental Tasks in NLP ● Word Segmentation ● Part-of-speech (POS) tagging ● Syntactic Analysis ● Semantic Analysis 25
  26. 26. Semantic Analysis ● Two levels ● Lexical semantics − Representing meaning of words − Word sense disambiguation (e.g., word bank) • Compositional semantics − How words combined to form a larger meaning. 26
  27. 27. Meaning representations • First order predicate calculus •E.g., Maharani serves vegetarian food. => Serves(Maharani, vegetarian food) •E.g., I only have five dollars and I don’t have a lot of time => Have(Speaker, FiveDollars) ∧ ¬Have(Speaker, LotOfTime) 27
  28. 28. 28 Syntax-driven semantic analysis
  29. 29. Lecture Outline ● What is Natural Language Processing? ● Why is NLP hard? ● Brief history of NLP ● Fundamental tasks in NLP ● Some NLP applications 29
  30. 30. Some applications ● Information Retrieval ● Information Extraction ● Question Answering ● Text Summarization ● Machine Translation 30
  31. 31. Information Retrieval ● Query: “list of good sushi restaurants in kyoto?” 31
  32. 32. Query Query processing Search (Vector space model or probabilistic) Ranked documents Indexing Document collection 32 Architecture of an ad hoc IR system
  33. 33. Information Extraction ● To extract from unstructured text, information which pre-specified or pre-defined in templates − Fill a number of slots/attributes ● Example: use template [PERSON, go, LOCATION, TIME] to extract information about the destination of an individual goes. − “President Obama went to Hanoi yesterday. − [PERSON = “President Obama”, go, LOCATION = “Hanoi”, TIME = “yesterday”] 33
  34. 34. Question Answering ● A system that automatically return answers for an user’s question by retrieving information from a collected documents. ● Differences from information retrieval system: − QA system’s goal is to respond exact answer instead of documents related to users’ question. ● Q: who did invent the internet? A: Robert E. Kahn and Vint Cerf. − QA system requires more complicated semantic analysis. 34
  35. 35. Question Answering ● Factoid question answering: − Who/What/Where/When − Answers are often phrases. ● Non-factoid question answering: − Definition question − How/Why − Answers may span multiple sentences (paragraph) 35
  36. 36. The figure is credited by Dr Ngo Xuan Bach: http://tinyurl.com/jk2dv33 36
  37. 37. Text Summarization ● Text summarization is process of distilling the most important information from a text to produce an abridge version for a particular task or user. ● Useful in the era of information explosion ● Categories of text summarization: − Single-document/Multi-document summarization − Extractive/Abstractive summarization − Query-focused text summarization 37
  38. 38. Example of text summarisation • https://www.bloomberg.com/view/articles/ 2016-08-23/china-s-super-bus-exposes-dark- side-of-p2p-lending • It looked like the future: a wide, elevated Chinese bus that would speed atop tracks straddling the road while multiple lanes of traffic flowed below. And the future looked surprisingly near. In early August, a prototype of the Transit Elevated Bus -- or TEB -- was tested in northern China. • Demand for such loans has exploded in recent years, growing in volume from $4.3 billion in 2013 to $71 billion in 2015. The appeal is twofold. First, China's big state-owned banks have traditionally focused their attention on other companies in the state sector, at the expense of consumers and small businesses. • Meanwhile, cash-rich Chinese are anxious to find yields higher than the anemic rates paid by China's state banks, which typically fall below 3 percent. China's dodgy stock markets aren't a terribly appealing alternative, while the attractiveness of Chinese real estate varies by region. Output by Skype’s Summarization chatbot 38
  39. 39. Machine Translation ● The use computer to automatic some or all of the process of translating one language to the other one. ● Fully automatic machine translation is one of the most challenging and hot topic in NLP. ● Recent advances of Deep Learning raise the trend of Neural Machine Translation. 39
  40. 40. Example (Google translation) It looked like the future: a wide, elevated Chinese bus that would speed atop tracks straddling the road while multiple lanes of traffic flowed below. And the future looked surprisingly near. Nó trông giống như tương lai: một rộng, xe buýt cao Trung Quốc sẽ tăng tốc trên đường ray trải dài đường trong khi nhiều tuyến đường giao thông chảy bên dưới. Và tương lai có vẻ ngạc nhiên gần. 40
  41. 41. Approaches in Machine Translation • Rule-based methods • Transfer-based MT • Interlingual MT • Dictionary-based MT • Statistical MT • Example-based MT • Hybrid MT
  42. 42. Bernard Vauquois' pyramid showing comparative depths of intermediary representation, interlingual machine translation at the peak, followed by transfer-based, then direct translation. 42
  43. 43. How to learn NLP? • Have background/knowledge about: • Probabilistic and Statistics • Basic math (linear algebra, calculus) • Machine Learning • Programming • Read textbook or attend online NLP courses: • Speech and Language Processing, by Jurafsky, Daniel and Martin, James H. • Youtube’s playlist (Dan Jurafsky & Chris Manning: Natural Language Processing): http://tinyurl.com/lb57fxf
  44. 44. How to learn NLP? • Practice with programming exercises: • 100 NLP drill exercises: https://github.com/minhpqn/nlp_100_drill_exercises • NLP Programming Tutorial, by Graham Neubig: http://www.phontron.com/ teaching.php • Compete in Kaggle data science challenges (kaggle.com)
  45. 45. Try some NLP applications • Try Stanford CoreNLP and Stanford Parser demo • http://nlp.stanford.edu:8080/corenlp • http://nlp.stanford.edu:8080/parser • Solve SAT-style math questions • http://euclid.allenai.org
  46. 46. References 1. Speech and Language Processing, by Jurafsky, Daniel and Martin, James H. 2. An Introduction to Natural Language Processing series (http://tinyurl.com/hdg58wx)
  47. 47. References • An Introduction to Natural Language Processing - Section 1 (http://tinyurl.com/ztkwb2b) • An Introduction to Natural Language Processing - Section 2: Some Brief History (http://tinyurl.com/j48or27) • An Introduction to Natural Language Processing - Section 3: Fundamental Tasks in NLP (http://tinyurl.com/zk7dgzv) • An Introduction to Natural Language Processing - Section 4: Some Applications (http://tinyurl.com/jk2dv33) • An Introduction to Natural Language Processing (http:// tinyurl.com/hdg58wx) 47

×