Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro Panebianco

234 views

Published on

50 Shades of Text - Leveraging Natural Language Processing (NLP) to validate, improve, and expand the functionalities of a product

Nowadays, every company either stores or produces text data: from web logs and user queries, to translations and support tickets, yet not everyone knows how to extract valuable insights from it. In this session, we will present a practical case on how to move from raw text data to a valuable business application leveraging upon some of the major NLP methodologies (word embedding, word2vec, doc2vec, fastText, etc.)

Bio: Alessandro is a data veteran. He holds two Master’s degrees in computer engineering, one from Politecnico di Milano and the other from University of Illinois at Chicago (UIC).

He started his career in data consultancy, where he mastered Apache Spark for Machine Learning projects and subsequently joined WW Grainger, one of the largest MRO e-commerce companies in the United States. In September 2017, after more than 5 years in the USA, Alessandro returned to his native country, Italy, where he is now leading a team of data scientists. His current work focuses on achieving energy efficiency through the automation of energy management processes for commercial customers.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro Panebianco

  1. 1. 50 Shades of Text - Leveraging Natural Language Processing Alessandro Panebianco
  2. 2. Agenda • About Me • Natural Language Processing • Vectorization Techniques • Word Embeddings • Sentence Embeddings • Demo • Lessons Learned 2
  3. 3. • Computer Engineering • Data Science Consultancy • E-commerce • Energy&Utilities About me 3 email: ale.panebianco@me.com
  4. 4. Natural Language Processing Language is the method of human communication, either spoken or written, consisting of the use of words in a structured and conventional way 4
  5. 5. Natural Language Processing The goal of Natural Language Processing is for computers to achieve human-like comprehension of texts/languages 5
  6. 6. Natural Language Processing Why? https://youtu.be/lXUQ-DdSDoE?t=81 6
  7. 7. Natural Language Processing Applications • Machine translation (Google Translate) • Natural language generation (Reddit bot) • Sentiment analysis (Cambridge Analytica) • Lexical semantics (Thesaurus) • Web and application search (Amazon) • Question answering (chatbots) …. and many others 7
  8. 8. How do we enable machines to interpret language? Transforming raw text into numerical features 8 Vector Hashing Trick Bag of words TF-IDF Word2Vec GloVe FastText
  9. 9. Vectorization Techniques Bag of words How to go from words to vectors? without music life would be a mistake Radiohead are great band S1 1 1 1 1 1 1 1 0 0 0 0 S2 0 1 0 0 0 1 0 1 1 1 1 S1: Without music life would be a mistake S2: Radiohead are a great music band 9 ๏ Dictionary size ๏ Sparsity ๏ Word order absence ✓ Easy to implement ✓ Fast
  10. 10. Vectorization Techniques (II) Hashing Trick ๏ Hash is one-way ๏ Same output for different inputs 10 ✓ Same input -> Same output ✓ Range is always fixed (vector size)
  11. 11. Term Frequency - Inverse Document Frequency Weight rare words higher than common words Vectorization Techniques (III) TF-IDF without music life would be a mistake Radiohead are great band S1 0.3 0 0.3 0.3 0.3 0 0.3 0 0 0 0 S2 0 0 0 0 0 0 0 0.3 0.3 0.3 0.3 11 ๏ Dictionary size ๏ Sparsity ๏ Word order absence ✓ Easy to implement ✓ Fast ✓ Weight words
  12. 12. Word Embeddings Word2Vec • The goal of word embeddings is to generate vectors encoding semantics 12 • Word2Vec does it maximizing the cosine similarity between two randomly initialized vectors Context Window Without music life would be a mistake
  13. 13. Word Embeddings Word2Vec (II) Queen Woman King Man 13 • Analogies • Synonyms • Syntactic-Semantic vectors • Speech tagging • Named entity recognition King - Man + Woman = Queen
  14. 14. Word Embeddings GloVe • It differs from word2vec for being a count based model instead of predictive • Dimensionality reduction on the co-occurrence counts matrix • It uses cosine similarity 14
  15. 15. Word Embeddings FastText FastText : Word Embeddings = XGBoost : Random Forest 15
  16. 16. Sentence Embeddings What If we want to represent more than a single word? Many techniques have been utilized: • Common aggregation operations (avg, sum, concatenation, etc.) • Doc2Vec • Neural Networks (CNN,LSTM,etc.) 16
  17. 17. Sentence Embeddings (II) Doc2Vec Every paragraph is mapped to a unique vector The paragraph token can be thought of as another word. It acts as a memory that remembers what is missing from the current context — or the topic of the paragraph 17
  18. 18. Sentence Embeddings (III) CNN • Stacking words together create a Matrix (image) • Filters act like word scans (i.e. misspellings) • Max Pooling would highlight the most important words (i.e. what is the item of a query) • The LSTM layer keeps the word order 18
  19. 19. Sentence Embeddings (IV) LSTM • RNNs resemble how we process language (i.e. Google searches) • The LSTM layer generates a new encoding for the original input giving relevance to the word order (return_sequences=True) • The convolution layer filters the most important local features (i.e. what is the item of a query) 19
  20. 20. Demo Training Data: https://www.kaggle.com/c/home-depot-product-search- relevance/data GloVe vectors: http://nlp.stanford.edu/data/glove.6B.zip 20
  21. 21. Lessons Learned • NLP is one of the most mature research fields in the AI space • Make your own word embeddings using an ad-hoc vocabulary • With a large corpus, try FastText • With short texts (i.e. user queries) experiment higher text granularity (n-grams, characters) • Explore sentence embeddings through Neural Networks 21
  22. 22. Questions?

×