Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A Panorama of Natural Language Processing

1,497 views

Published on

Presented by Ted Xiao at RobotXSpace on 4/18/2017. This workshop covers the fundamentals of Natural Language Processing, crucial NLP approaches, and an overview of NLP in industry.

Published in: Data & Analytics
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • I have Found a better PPT on ThesisScientist.com on the same Topic
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

A Panorama of Natural Language Processing

  1. 1. A Panorama of Natural Language Processing Ted Xiao
  2. 2. Overview • Background • Grammars • Word Representation • Modern NLP • Future Directions • NLP in Industry • Demos
  3. 3. What is Natural Language Processing? NLP! Artificial Languages: Java, C++, Binary…
  4. 4. What is Natural Language Processing? NLP! Artificial Languages: Java, C++, Binary… Natural Language: Language spoken by people.
  5. 5. What is Natural Language Processing? NLP! Artificial Languages: Java, C++, Binary… Natural Language: Language spoken by people. Motivation: Sophisticated linguistic analysis for human-like sophistication for a range of tasks or applications.
  6. 6. What is Natural Language Processing? NLP! Goal: have computers understand natural language in order to perform useful tasks Artificial Languages: Java, C++, Binary… Natural Language: Language spoken by people. Motivation: Sophisticated linguistic analysis for human-like sophistication for a range of tasks or applications.
  7. 7. Task Types ● Syntax ○ Parsing ○ Stemming ○ Part of speech tagging ● Discourse ○ Parsing ○ Stemming ○ Part of speech tagging
  8. 8. Task Types ● Syntax ○ Parsing ○ Stemming ○ Part of speech tagging ● Semantics ○ Machine Translation ○ Natural Language Understanding, Generation ○ OCR ○ QA, Sentiment Analysis ○ Coreference ● Discourse ○ Parsing ○ Stemming ○ Part of speech tagging ● Speech ○ Speech Recognition ○ Text-to-Speech ○ Speech-to-Text
  9. 9. Task Examples
  10. 10. Task Examples
  11. 11. NLP in Industry
  12. 12. What Makes NLP Difficult? • We don’t understand language ourselves • Language encodes meaning • Language is learned intuitively - easy for children, hard for computers or
  13. 13. What Makes NLP Difficult? • We don’t understand language ourselves • Language encodes meaning • Language is learned intuitively - easy for children, hard for computers • Ambiguity • Language is symbolic • Subtleties: sarcasm, wordplay, idioms... or
  14. 14. NLP vs. PLP Programming Language Processing is easier than Natural Language Processing
  15. 15. ● The Pope’s baby steps on gays ● Scientists study whales from space ● Juvenile court to try shooting defendant ● Boy paralyzed after tumor fights back to gain black belt Examples of ambiguity: news headlines
  16. 16. An NLP Disaster
  17. 17. An NLP Disaster: Microsoft Tay (March 2016)
  18. 18. ...again?! Microsoft Zo (December 2016)
  19. 19. Overview • Background • Linguistics • Word Representation • Modern NLP • Future Directions • NLP in Industry • Demos
  20. 20. Grammars ● Grammars are the formal description of the structure of a language ● Skeleton of any language
  21. 21. Basic Linguistics Context Form Meaning Structure Audio
  22. 22. Levels of NLP
  23. 23. Overview • Background • Grammars • Word Representation • Modern NLP • Future Directions • NLP in Industry • Demos
  24. 24. Digitalizing Natural Language • Need to have some measure of similarity and differences between words • Vectors can do this!
  25. 25. Digitalizing Natural Language • Need to have some measure of similarity and differences between words • Vectors can do this! • We can use vector operations to gauge similarity between words
  26. 26. Word Vectors • 13 million tokens in the English language • Many words are similar (cat and feline, man and woman, etc…) • A nice idea: encode word tokens into a vector that is a point in some word space with dimension << 13 million
  27. 27. One-Hot Vectors • Express each word as an |V| dimensional vector with one 1 and the rest 0s, where |V| is the size of our vocabulary • One-hot vectors for a dictionary would look like:
  28. 28. What’s Wrong? • One hot vectors are independent (orthogonal) • But some words are similar! • A nicer idea: reduce the size of the space from |V| to a smaller- dimensional subspace that encodes relationships between words
  29. 29. Quick Aside: Singular Value Decomposition (SVD) (mxm) matrix of left singular vectors (nxn) matrix of right singular vectors (mxn) matrix with the singular values of X on its diagonals (mxn) Take away point: The SVD of X is the best rank-k approximation X = USVT
  30. 30. Illustration of the SVD as a Rank-k Approximation
  31. 31. SVD-Based Methods: Window-based Co- occurrence Matrix • Only count the number of times a word appears inside a window of a particular size around the word of interest • Consider the following three documents: • I enjoy flying • I like NLP • I like deep learning window size = 1
  32. 32. Applying the SVD to the co- occurrence matrix • X = USVT • Truncate S at some index k based on the amount of variance captured: • Take the sub-matrix Ui:V,1:k to be our word embeddings • Now have a k-dimensional representation of every word in our vocabulary!
  33. 33. Downfalls of SVD-based Methods • Co-occurrence matrix is high dimensional and sparse • SVDs are computationally expensive (quadratic cost) • Dimensions of the co-occurrence matrix are constantly changing as new words are added
  34. 34. Downfalls of SVD-based Methods • Co-occurrence matrix is high dimensional and sparse • SVDs are computationally expensive (quadratic cost) • Dimensions of the co-occurrence matrix are constantly changing as new words are added • Solution: iteration-based methods!
  35. 35. Iteration-based Methods • Word Vectors and Word Embeddings: Used to Find similarity Compute and store representative information about a huge dataset • Iteration-based Methods: create a model that learns one iteration at a time that will eventually be able to encode the probability of a word given its context • These include basic language models as well as the more advanced word2vec
  36. 36. Basic Language Models • Bag of Words • Just count the frequencies of words. • Issues: High dimension, and order and relations are lost
  37. 37. Basic Language Models • Bag of Words • Just count the frequencies of words. • Issues: High dimension, and order and relations are lost • Term Frequency-inverse Document Frequency • AKA TF-IDF • How important a word is in a document • Used in search engines!
  38. 38. Basic Language Models • Question: Are there ways we can maintain information about word order and meaning? • Bag of Words • Just count the frequencies of words. • Issues: High dimension, and order and relations are lost • Term Frequency-inverse Document Frequency • AKA TF-IDF • How important a word is in a document • Used in search engines!
  39. 39. Language Models • Goal: assign a probability to a sequence of tokens • Consider the two sentences: • “The dog wagged his tail” • “Puffer fish bank ladder” • Which should have a higher probability? • If we assume that word occurrences are independent, the probability of any given sequence of words is: (Unigram model)
  40. 40. What if Word Occurrences Are Not Independent? • Assume the probability of a sequence depends on the pairwise probability of a word in the sequence and a word next to it (bigrams) • A bigram model is of the form: • The general N-gram model is given by:
  41. 41. Approaches So Far ● Simple models trained on huge amounts of data outperform complex models trained on small amounts of data ● Unigrams: ● Bigrams: ● N-grams:
  42. 42. Continuous Bag of Words (CBOW) • Consider part of the sequence of words as context, and try to predict the center words • Sentence: “The dog wagged his tail” • Context: {“The”, “dog”, “his”, “tail} • Center words: “wagged” • Our known parameters are the sentence in question represented by one-hot vectors • Let x(c) denote the context words • Lets y(c) denote the target word (output)
  43. 43. Continuous Bag of Words
  44. 44. Skip-grams • Now, consider the center word as context and try to predict surrounding words • Sentence: “The dog wagged his tail” • Context: “wagged” • Surrounding words: {“The”, “dog”, “his”, “tail”} • Nearly identical set-up to CBOW, except we switch our x and y • Input: one-hot vector (context word) • Output: vectors describing the surrounding words
  45. 45. Skip-grams
  46. 46. Recap ● We first tried condensing language into word vectors ○ We want to keep meaning in a lower dimension ○ One-hot Vectors, SVD… These are expensive!
  47. 47. Recap ● We first tried condensing language into word vectors ○ We want to keep meaning in a lower dimension ○ One-hot Vectors, SVD… These are expensive! ● We then tried iteration based methods ○ Language Models: Bag of Words, TF IDF… no context!
  48. 48. Recap ● We first tried condensing language into word vectors ○ We want to keep meaning in a lower dimension ○ One-hot Vectors, SVD… These are expensive! ● We then tried iteration based methods ○ Language Models: Bag of Words, TF IDF… no context! ● We add in context with N-gram models
  49. 49. Recap ● We first tried condensing language into word vectors ○ We want to keep meaning in a lower dimension ○ One-hot Vectors, SVD… These are expensive! ● We then tried iteration based methods ○ Language Models: Bag of Words, TF IDF… no context! ● We add in context with N-gram models ● We extended these with CBOW and Skip-grams ○ CBOW: Predict center word ○ Skip-grams: Predict surrounding words
  50. 50. • So far, we have an expression for the chance that a sequence of words appears as products of conditional probabilities • Now, we’d like models that can learn the probabilities of word- sequences • Solution: Word2vec (Mikolov et al, 2013) Learning word-sequence probabilities
  51. 51. Word2vec • A neural network implementation that learns distributed representations for words • 2 algorithms • Continuous bag of words • Skip-grams • 2 training methods • Negative sampling • Hierarchical softmax • Best part: DOES NOT NEED LABELED DATA!
  52. 52. Word2vec • Many of those steps are complicated... • Luckily, someone made software that does this for us • Gensim is a python package that can do all of this complicating word2vec stuff in a few lines of code • Results: Training high dimensional word vectors on a large amount of data captures “Subtle semantic relationships between words”
  53. 53. Reflecting on word2vec • Words with similar meanings occur in clusters • Clusters are spaced such that some word relationships (such as analogies) can be reproduced with vector math • Famous example (with highly trained word vectors) • “king” - “man” + “woman” = “queen” • Useful feature: word2vec does not require labeled data • Most data in the world is unlabeled! • Word embeddings are very useful for prediction and translation tasks, as well as sentiment analysis
  54. 54. Overview • Background • Grammars • Word Representation • Modern NLP • Future Directions • NLP in Industry • Demos
  55. 55. Modern NLP • Advances in NLP were largely driven by • A vast increase in computing power • A better understanding of human language • Development of successful ML algorithms • Big data • Much of current work involves: • Machine translation • Spoken dialogue and conversational agents • Machine reading • Mining social media • Analysis and generation of speaker state
  56. 56. Forms of NLP Data User data Corpora Dictionaries Ontologies and databases
  57. 57. NLP Data Sources • Wikimedia • APIs: Twitter, Wordnik, … • Common crawl • Wordnet • Linguistic data consortium (www.ldc.upenn.edu) • University sites and the academic community • Stanford, Oxford, CMU • Create your own! • Web-scrape, crowd-source, linguists
  58. 58. Deep Learning vs Non-deep Learning Methods • Bag of words may outperform deep learning models in modest sized datasets • Word2vec sees a drastic improvement with a LOT of text • In literature, distributed word vector techniques outperform bag of words models Deep learning tries to capture the recursive nature of natural language
  59. 59. Deep Learning for NLP • Deep learning attempts to learn multiple levels of representation of increasing complexity and abstraction • Want computers to be able to understand the recursive nature of human language • Recursive/recurrent neural networks! • DL models can be fast ways to solve NLP tasks
  60. 60. Recurrent Neural Network ● Recurrent Neural Network ○ Connections between units form are directed cycles ○ Internal state of the network allows it to exhibit dynamic temporal behavior ○ Success in speech recognition, natural language, translation, etc. ● Long Short-term Memory: LSTM
  61. 61. Recurrent Neural Network
  62. 62. seq2seq ● Applied RNNs to Sequences ○ Generate a response based on meaningful input ○ For example, translate from English to French ● Two RNNs: an encoder that processes the input and a decoder that generates the output.
  63. 63. Recursive Deep Learning • Compositional vector grammars (parsing) • Recursive autoencoders (paraphrase detection) • Matrix-vector RNNs (relation classification) • Recursive neural tensor networks (sentiment analysis)
  64. 64. What’s at UC Berkeley? • Berkeley NLP Research - Dan Klein • Computer Vision - Alexei Efros, Jitendra Malik CV + NLP: Visual Question Answer
  65. 65. Overview • Background • Grammars • Word Representation • Modern NLP • Future Directions • NLP in Industry • Demos
  66. 66. What the (near) future holds Bots! Think Siri, but actually functional instead of a toy
  67. 67. What the (near) future holds Supporting invisible UI! The concept of invisible or zero user interaction between user and machine
  68. 68. What the (near) future holds Smarter search! The same capabilities that allow a chatbox to understand a customer’s request can enable “search like you talk” functionality
  69. 69. What the (near) future holds Intelligence from unstructured information! Analysis that accurately understands the subtleties of natural language (choice of words, tone, etc) can provide useful knowledge and insight of information
  70. 70. Overview • Background • Grammars • Word Representation • Modern NLP • Future Directions • NLP in Industry • Demos
  71. 71. NLP in Industry
  72. 72. NLP Architectures • Layered Model • Preprocessing • Low-level analysis • Semantic Analysis • Conversion to end products • Input/Output as API Structure
  73. 73. NLP at Scale • Systems come before algorithms • Objective functions are messy • Everything is changing • Understanding-optimization trade-off
  74. 74. NLP at Scale • Systems come before algorithms • Objective functions are messy • Everything is changing • Understanding-optimization trade-off
  75. 75. Developing an NLP system 1. Exploration a. Translate real-world requirements into a measurable goal b. Find an appropriate level and representation c. Find data for experiments
  76. 76. Developing an NLP system 1. Exploration a. Translate real-world requirements into a measurable goal b. Find an appropriate level and representation c. Find data for experiments 2. Development a. Find and utilize existing tools and frameworks b. Set up and perform a series of experiments
  77. 77. Developing an NLP system 1. Exploration a. Translate real-world requirements into a measurable goal b. Find an appropriate level and representation c. Find data for experiments 2. Development a. Find and utilize existing tools and frameworks b. Set up and perform a series of experiments 3. Production a. CPU/GPU intensive b. Most NLP frameworks are not production-ready c. Pre- and post- processing is invaluable d. Collect user feedback
  78. 78. I Have the Model… Now What? 1. Specify Performance Requirements 2. Separate Prediction Algorithm From Model Coefficients a. Select or Implement The Prediction Algorithm b. Serialize Your Model Coefficients 3. Develop Automated Tests For Your Model 4. Develop Back-Testing and Now-Testing Infrastructure 5. Challenge Then Trial Model Updates
  79. 79. Tips for NLP • Proper preprocessing is VERY important • Know your domain! • Validate your models! • Human judges • Cross-validation
  80. 80. Overview • Background • Grammars • Word Representation • Modern NLP • Future Directions • NLP in Industry • Demo
  81. 81. Programming Language Identification Exploring code on GitHub Goal is to figure out what language a file uses Potential Methods? Filename, keywords, comments Whitespace, syntax Must be scalable to handle large number of constantly updating repos
  82. 82. Existing Model Linguist: Heuristics + Naive Bayes Heuristics can be accurate but require updating and fine-tuning Naive Bayes depends on word frequencies - predictions are linear to vocab size Hard-coded rules do most of the work, leaving the Naive Bayes as a last resort Heavily dependent on file extension classification Selective Classification With file extensions, only 87% of files are classified
  83. 83. Thank you! ted@ml.berkeley.edu Special thanks to Jordan Prosky
  84. 84. Appendix
  85. 85. Appendix • Language is meant to convey meaning, which we have a natural way of encoding • Children learn this very fast! • Hard for computers to learn… • Language is a symbolic signaling system • Example: pen: or ? • Other subtleties: sarcasm, expressive signaling, …
  86. 86. What makes NLP difficult? • Language is meant to convey meaning, which we have a natural way of encoding • Children learn this very fast! • Hard for computers to learn… • Language is a symbolic signaling system • Example: pen: or ? • Other subtleties: sarcasm, expressive signaling, …
  87. 87. Basics of NLP data preprocessing • Domain specific! • Tokenization • Example: “This is a test that isn’t so simple” • Tokens: “This”, “is”, “a”, “test”, “that”, “is”, “n’t”, “so”, “simple” • Regular expressions • Stemming • Lower-casing • Removing/adding punctuation • Other…
  88. 88. SVD-Based Methods • Loop over a massive dataset to accumulate word co-occurrence counts in some matrix X • Perform the SVD on X to get U, S, and V • Use the rows of U as the word embeddings for all words in your dictionary X = ?
  89. 89. SVD-Based Methods: Word- Document Matrix • Assumption: related words often appear in the same document • Loop over many documents and every time word i appears in document j, add one to entry Xij • Very high dimensional - let’s try something better
  90. 90. We are not quite done… • Need to find suitable U and V matrices! • Two algorithms help us get what we want: • Hierarchical softmax • Negative sampling • These are complicated! • Luckily, someone made software that does this for us • Gensim is a python package that can do all of this complicating word2vec stuff in a few lines of code

×