Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

NLP in 10 lines of code

580 views

Published on

At Cytora, our production system works 24/7 to transform billions of pieces of unstructured web data into structured data sets. This is a huge job, and we use spaCy to help us on a daily basis.

SpaCy is an easy-to-use open source Python NLP library that excels at large-scale information extraction. It supports tokenization, sentence segmentation, named entity recognition, part of speech tagging and dependency parsing.

During this talk, we are going to demonstrate some of spaCy's core functionalities by performing a simple NLP analysis on Jane Austen's Pride and Prejudice.

Here's what we will achieve during this analysis:

- Extract the character names from the book (e.g. Elizabeth, Darcy, Bingley)

- Visualise character occurrences with regards to their relative position in the book (e.g. are specific characters mentioned more in the beginning of the book and others more towards the end?)

- Describe Mr Darcy's character using syntactic dependencies

---

Published in: Technology
  • Be the first to comment

NLP in 10 lines of code

  1. 1. NLP in 10 lines of code Andraž Hribernik
  2. 2. AGENDA 1. NLP analysis of Pride & Prejudice ○ Introduction to spaCy API ○ Extract characters and visualize them relative to their position in the book ○ Extract adjectives that describes a character in the book 2. How we use spaCy at Cytora
  3. 3. Pride & Prejudice by Jane Austen What is the book about? ○ 5 unmarried Bennet daughters ○ 2 young, wealthy gentlemen (Mr Bingley & Mr Darcy) move into their neighbourhood ○ The oldest Bennet daughters (Jane & Elizabeth) become involved with said gentlemen
  4. 4. Recreate the plot in 10 lines of code! 1. Parse text 2. Extract named entities 3. Keep only personal named entities 4. Get offset for every extracted entity 5. Plot the graph
  5. 5. 1. Parse text import spacy nlp = spacy.load('en') text = open('pride_and_prejudice.txt').read() processed_text = nlp(text)
  6. 6. 2. Extract named entities import spacy nlp = spacy.load('en') text = open('pride_and_prejudice.txt').read() processed_text = nlp(text) for ent in processed_text.ents[:7]: print(ent.text, ent.label_) Output: The Project Gutenberg EBook of ORG Jane Austen PERSON the Project Gutenberg License ORG www.gutenberg.org FAC Pride ORG Jane Austen PERSON August 26, 2008 DATE
  7. 7. 3. Keep only personal named entities import spacy nlp = spacy.load('en') text = open('pride_and_prejudice.txt').read() processed_text = nlp(text) for ent in processed_text.ents[300:310]: if ent.label_ == 'PERSON': print(ent.text, ent.label_) Output: Bingley PERSON Elizabeth PERSON Darcy PERSON William Lucas PERSON Darcy PERSON
  8. 8. 4. Get offset for every extracted entity ... processed_text = nlp(text) character_offsets = defaultdict(list) for ent in processed_text.ents: if ent.label_ == 'PERSON': character_offsets[ent.text].append(ent.start) print(character_offsets['Elizabeth'][:5]) print(character_offsets['Darcy'][:5]) print(processed_text[1422]) print(processed_text[3229]) Output: [1422, 3670, 3759, 3867, 4532] [3005, 3229, 3367, 3410, 3754] Elizabeth Darcy
  9. 9. 5. Plot the graph from collections import defaultdict import spacy nlp = spacy.load('en') text = open('pride_and_prejudice.txt').read() processed_text = nlp(text) character_offsets = defaultdict(list) for ent in processed_text.ents: if ent.label_ == 'PERSON': character_offsets[ent.lemma_].append(ent.start) plot_character_timeseries(character_offsets, ['darcy', 'bingley'])
  10. 10. Demo
  11. 11. Describe Mr Darcy
  12. 12. Describe Mr Darcy ● Automatically describe Mr Darcy (e.g. silent, tall, young, etc) ● We can solve this problem using syntactic dependencies that are part of spaCy API ● Syntactic dependencies could be very nicely visualized with displaCy
  13. 13. Describe Mr Darcy adjective modifier
  14. 14. Extract all ‘amod’ dependencies in entities subtree darcy_adjectives = [] darcy_ents = [ent for ent in processed_text.ents if ent.lemma_ == 'darcy'] for ent in darcy_ents: for token in ent.subtree: if token.dep_ == 'amod': darcy_adjectives.append(token.lemma_) print(set(darcy_adjectives)) Output: {'handsome', 'last', 'grave', 'silent', 'particular', 'young', 'poor', 'abominable', 'disappointing', 'disagreeable', 'confidential', 'late', 'little', 'charming', 'present', 'intimate'}
  15. 15. Describe Mr Darcy adjective complement noun subject
  16. 16. Extract all ‘acomp’ from entity’s root subtree for ent in darcy_ents: if ent.root.dep_ == 'nsubj': for child in ent.root.head.children: if child.dep_ == 'acomp': darcy_adjectives.append(child.lemma_) Output: {'kind', 'ashamed', 'impatient', 'answerable', 'sorry', 'unworthy', 'grow', 'fond', 'proud', 'engaged', 'little', 'clever', 'worth', 'tall', 'studious', 'punctual'}
  17. 17. Pros & Cons of syntactic dependencies approach ● Training dataset is not needed ● Intuitive ● From our experiences, you can achieve decent extraction precision ● Our approach achieved very poor recall ● Spacy dependency parsing always works inside a single sentence only
  18. 18. What is our mission at Cytora?
  19. 19. spaCy at Cytora ● We process 2M documents everyday with spaCy ● Named entity recognition (geolocations, actors) ● Dependency parsing (impact metric extraction) ● Integrated Word Embeddings (preprocessing for DL models)
  20. 20. Cytora is hiring! ● Data Engineer ● Data Science Analyst ● Risk Modeler All open positions
  21. 21. Thank you! https://github.com/cytora/pycon-nlp-in-10-lines https://spacy.io/ https://demos.explosion.ai/displacy/ http://www.cytora.com/ andraz@cytora.com

×