Introduction to spaCy
2017-10-19
@auzen_
Goal
spaCy を使おうという気持ちになる
spaCy をすぐに使ってみることができる
2 / 17
What's spaCy?
"Industrial-Strength" NLP Library
Fastest in the world
written in Cython
Get things done
easy to install
simple API
Deep learning
interoperates seamlessly with TensorFlow, Keras,
Scikit-Learn, Gensim 3 / 17
Fastest in the world
Syntactic parsing
(Choi et al., IJCNLP, 2015)
https://spacy.io/docs/api/ 4 / 17
Fastest in the world
Detailed speed comparison
https://spacy.io/docs/api/
5 / 17
Get things done
Installation:
$ pip install spacy
$ python -m spacy download en
Load model and process text:
import spacy
nlp = spacy.load('en')
doc = nlp('Can you process this text?')
6 / 17
Get things done
POS tagging:
for token in doc:
print(token, token.pos_)
Can VERB
you PRON
process VERB
this DET
text NOUN
? PUNCT
7 / 17
Get things done
Dependency parsing:
for token in doc:
print('{} -({})-> {}'.format(token.head, token.dep_, token))
process -(aux)-> Can
process -(nsubj)-> you
process -(ROOT)-> process
text -(det)-> this
process -(dobj)-> text
process -(punct)-> ?
8 / 17
Get things done
Named entity recognition:
doc = nlp('The current capital of Japan is Tokyo.')
print(doc.ents)
(Japan, Tokyo)
9 / 17
What's next?
More about spaCy
Natural Language Processing in 10 Lines of Code
How spaCy Works
Incorporate with Deep learning library
Deep Learning with custom pipelines and Keras
Sense2vec with spaCy and Gensim
10 / 17
Try spaCy on the website
Dependency parsing
Named entity recognition
Sentence similarity
11 / 17
textacy
higher-level NLP built on spaCy
Documentation / GitHub / API Reference
textacy is a Python library for performing higher-level natural la
nguage processing (NLP) tasks, built on the high-performance s
paCy library.
textacy focuses on tasks facilitated by the ready availability of t
okenized, POS-tagged, and parsed text.
12 / 17
Features
https://github.com/chartbeat-labs/textacy
13 / 17
うちの研究室的に便利そうな機能
textacy.preprocess.preprocess_text
x "broken" unicode
replace all URL strings with 'URL'
replace all email strings with 'EMAIL'
replace all phone number strings with 'PHONE'
replace all number-like strings with 'NUMBER'
...
14 / 17
うちの研究室的に便利そうな機能
textacy.preprocess.preprocess_text
from textacy.preprocess import preprocess_text
text = 'ここの研究室すごいよ!!! http://www.cl.ecei.tohoku.ac.jp'
preprocess_text(text, no_urls=True)
'ここの研究室すごいよ!!! *URL*'
15 / 17
うちの研究室的に便利そうな機能
textacy.extract.pos_regex_matches
from textacy.extract import pos_regex_matches
from textacy.constants import POS_REGEX_PATTERNS
list(pos_regex_matches(nlp('Can you process this text?'),
POS_REGEX_PATTERNS['en']['NP']))
[this text]
16 / 17
... and more!
17 / 17

Introduction to spaCy