Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

168 views

Published on

PyParis 2017
http://pyparis.org

Published in: Technology
  • Be the first to comment

  • Be the first to like this

PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

  1. 1. How to prepare data for NLP Loryfel Nunez @lorynyc
  2. 2. California Gold Rush
  3. 3. “ Extracting actionable information from modern big data sets requires the equivalent processing infrastructure of extracting a nugget of GOLD from a mountain of DIRT. Nikolas Markou (via LInkedIn)
  4. 4. Have an intuition on how things work Breaking data down Keep it simple .. if possible 1 3 2
  5. 5. How does it work, anyway?1
  6. 6. The General NLP Problem dog: 3, 2, 1 red coat: 0, 0, 1 😋 😭
  7. 7. Controlling the input Document Unit Representation of text
  8. 8. Inside the Machine Smith acquires shares of Novak and Kline for $10.99 per share . Smith acquires shares of Novak and Kline for $10.99 per share . Smith acquires shares of Novak and Kline for $10.99 per share . Smith acquires shares of Novak and Kline for $10.99 per share .
  9. 9. BREAK IT DOWN 2
  10. 10. Let’s Break it Down á Novák Novák and Kline Smith acquires shares of Novak and Kline for $10.99 per share. Smith acquires shares of Novak and Kline for $10.99 per share. Smith Inc. acquires shares of Novak and Kline for $10.99 per share. Smith acquires common shares of N & K for $10.99/share.
  11. 11. In the real world <p><b>Smith Buys Novak</b></p> <p></p> <p>by Anna Smith<p> <p> LONDON --- Smith Inc. acquires shares for Novak &amp; Kline. for $10.99/share.</p> <table style="width:100%"> <tr><th>Col1</th><th>Col2</th> </tr> <tr><td>data1</td><td>data2</td></tr> </table>
  12. 12. … if possible 2
  13. 13. Character á &amp; Do you know the encoding of your input data? ◉User tells you ◉Metadata ◉Figure it out (using chardet, or similar) ◉Have your own heuristics
  14. 14. Tokens Forty-two, 42 Post-colonial, postcolonial eBay, Ebay, EBAY, ebay Fed, FED, fed C.A.T., CAT Heuristics Mappings Transformations numToWord, POS (from SpaCy or NLTK)
  15. 15. Tokens STEMMING vs LEMMATIZATION import spacy from nltk.stem.porter import PorterStemmer nlp = spacy.load('en') stemmer = PorterStemmer() doc = nlp(u'She is an intelligence operative.') for word in doc: stemmed = stemmer.stem(word.text) print(word.text, " LEMMA => ", word.lemma_, " STEM => ", stemmed) She LEMMA => -PRON- STEM => she is LEMMA => be STEM => is an LEMMA => an STEM => an intelligence LEMMA => intelligence STEM => intellig operative LEMMA => operative STEM => oper . LEMMA => . STEM => . SpaCy, NLTK
  16. 16. Entities Novak and Kline, NK, NYSE:NK, Test Company June 30, 2017 06/30/2017 30/6/2017 Smith acquires shares of Novak and Kline for $10.99 per share . Smith acquires shares of NK for $10.99 per share . ORG acquires shares of ORG for $10.99 per share .
  17. 17. Hot or Not REMOVING HIGHLIGHTING WORDS Emails, dates, URLs, stop words hotwords More than WORDS tables Hot patterns textacy
  18. 18. In the real world <p><b>Smith Buys Novak</b></p> <p></p> <p>by Anna Smith<p> <p> LONDON --- Smith Inc. acquires shares for Novak &amp; Kline. for $10.99/share.</p> <table style="width:100%"> <tr><th>Col1</th><th>Col2</th> </tr> <tr><td>data1</td><td>data2</td></tr> </table>
  19. 19. IRL {‘title’: ‘Smith Buys …’, ‘original_text’: ‘LONDON --- Smith..’, ‘transformed_text’: { ‘text_with_entities’: ‘LOCATION – ORG acquired …. ‘, ‘lemmatized’: ‘Smith Inc acquire share..’ ‘has_acquired: true }, ‘table’: ‘<table>….. </table>’ }
  20. 20. The General NLP Problem dog: 3, 2, 1 red coat: 0, 0, 1 😋 😭
  21. 21. Have an intuition on how things work Breaking data down Keep it simple .. if possible 1 3 2 -- how algorithms see text -- from bytes to documents -- patterns, normalization, metadata, actions (replace, remove, highlight)
  22. 22. ◉ Stanford NLP Group ◉ Spacy Documentation ◉ SciKit Learn Documentation ◉ The hard knocks of NLP projects References and other stuff
  23. 23. Any questions ? You can find me at ◉ @lorynyc ◉ loryn808@gmail.com Thanks!

×