Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

255 views

Published on

PyParis 2017
http://pyparis.org

Published in: Technology
  • Have u ever tried external professional writing services like ⇒ www.WritePaper.info ⇐ ? I did and I am more than satisfied.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • The ONLY Formula You'll EVER Need for Lasting Romance ●●● https://t.cn/A6yxiH0S
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Just got my check for $500 ☺☺☺ https://dwz1.cc/v5Fcq3Qr
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • There is a REAL system that is helping thousands of people, just like you, earn REAL money right from the comfort of their own homes. The entire system is made up with PROVEN ways for regular people just like you to get started making money online... the RIGHT way... the REAL way. ★★★ http://t.cn/AisJWCv6
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Multiple Time Lotto Winner Shocks The System�Reveals All!  https://tinyurl.com/t2onem4
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

  1. 1. How to prepare data for NLP Loryfel Nunez @lorynyc
  2. 2. California Gold Rush
  3. 3. “ Extracting actionable information from modern big data sets requires the equivalent processing infrastructure of extracting a nugget of GOLD from a mountain of DIRT. Nikolas Markou (via LInkedIn)
  4. 4. Have an intuition on how things work Breaking data down Keep it simple .. if possible 1 3 2
  5. 5. How does it work, anyway?1
  6. 6. The General NLP Problem dog: 3, 2, 1 red coat: 0, 0, 1 😋 😭
  7. 7. Controlling the input Document Unit Representation of text
  8. 8. Inside the Machine Smith acquires shares of Novak and Kline for $10.99 per share . Smith acquires shares of Novak and Kline for $10.99 per share . Smith acquires shares of Novak and Kline for $10.99 per share . Smith acquires shares of Novak and Kline for $10.99 per share .
  9. 9. BREAK IT DOWN 2
  10. 10. Let’s Break it Down á Novák Novák and Kline Smith acquires shares of Novak and Kline for $10.99 per share. Smith acquires shares of Novak and Kline for $10.99 per share. Smith Inc. acquires shares of Novak and Kline for $10.99 per share. Smith acquires common shares of N & K for $10.99/share.
  11. 11. In the real world <p><b>Smith Buys Novak</b></p> <p></p> <p>by Anna Smith<p> <p> LONDON --- Smith Inc. acquires shares for Novak &amp; Kline. for $10.99/share.</p> <table style="width:100%"> <tr><th>Col1</th><th>Col2</th> </tr> <tr><td>data1</td><td>data2</td></tr> </table>
  12. 12. … if possible 2
  13. 13. Character á &amp; Do you know the encoding of your input data? ◉User tells you ◉Metadata ◉Figure it out (using chardet, or similar) ◉Have your own heuristics
  14. 14. Tokens Forty-two, 42 Post-colonial, postcolonial eBay, Ebay, EBAY, ebay Fed, FED, fed C.A.T., CAT Heuristics Mappings Transformations numToWord, POS (from SpaCy or NLTK)
  15. 15. Tokens STEMMING vs LEMMATIZATION import spacy from nltk.stem.porter import PorterStemmer nlp = spacy.load('en') stemmer = PorterStemmer() doc = nlp(u'She is an intelligence operative.') for word in doc: stemmed = stemmer.stem(word.text) print(word.text, " LEMMA => ", word.lemma_, " STEM => ", stemmed) She LEMMA => -PRON- STEM => she is LEMMA => be STEM => is an LEMMA => an STEM => an intelligence LEMMA => intelligence STEM => intellig operative LEMMA => operative STEM => oper . LEMMA => . STEM => . SpaCy, NLTK
  16. 16. Entities Novak and Kline, NK, NYSE:NK, Test Company June 30, 2017 06/30/2017 30/6/2017 Smith acquires shares of Novak and Kline for $10.99 per share . Smith acquires shares of NK for $10.99 per share . ORG acquires shares of ORG for $10.99 per share .
  17. 17. Hot or Not REMOVING HIGHLIGHTING WORDS Emails, dates, URLs, stop words hotwords More than WORDS tables Hot patterns textacy
  18. 18. In the real world <p><b>Smith Buys Novak</b></p> <p></p> <p>by Anna Smith<p> <p> LONDON --- Smith Inc. acquires shares for Novak &amp; Kline. for $10.99/share.</p> <table style="width:100%"> <tr><th>Col1</th><th>Col2</th> </tr> <tr><td>data1</td><td>data2</td></tr> </table>
  19. 19. IRL {‘title’: ‘Smith Buys …’, ‘original_text’: ‘LONDON --- Smith..’, ‘transformed_text’: { ‘text_with_entities’: ‘LOCATION – ORG acquired …. ‘, ‘lemmatized’: ‘Smith Inc acquire share..’ ‘has_acquired: true }, ‘table’: ‘<table>….. </table>’ }
  20. 20. The General NLP Problem dog: 3, 2, 1 red coat: 0, 0, 1 😋 😭
  21. 21. Have an intuition on how things work Breaking data down Keep it simple .. if possible 1 3 2 -- how algorithms see text -- from bytes to documents -- patterns, normalization, metadata, actions (replace, remove, highlight)
  22. 22. ◉ Stanford NLP Group ◉ Spacy Documentation ◉ SciKit Learn Documentation ◉ The hard knocks of NLP projects References and other stuff
  23. 23. Any questions ? You can find me at ◉ @lorynyc ◉ loryn808@gmail.com Thanks!

×