At Cytora, our production system works 24/7 to transform billions of pieces of unstructured web data into structured data sets. This is a huge job, and we use spaCy to help us on a daily basis.
SpaCy is an easy-to-use open source Python NLP library that excels at large-scale information extraction. It supports tokenization, sentence segmentation, named entity recognition, part of speech tagging and dependency parsing.
During this talk, we are going to demonstrate some of spaCy's core functionalities by performing a simple NLP analysis on Jane Austen's Pride and Prejudice.
Here's what we will achieve during this analysis:
- Extract the character names from the book (e.g. Elizabeth, Darcy, Bingley)
- Visualise character occurrences with regards to their relative position in the book (e.g. are specific characters mentioned more in the beginning of the book and others more towards the end?)
- Describe Mr Darcy's character using syntactic dependencies