This presentation was provided by William Mattingly of the Smithsonian Institution, for the sixth session of NISO's 2023 Training Series on Text and Data Mining. Session six, "Text Mining Techniques" was held on Thursday, November 16, 2023.
5. NLTK
Natural Language Toolkit
● Created in 2001 by Steven Bird
and Edward Loper
● Natural Language Processing
with Python (2009)
● Benefits
○ Many features and tools
○ Books
○ Hosts a wide array of algorithms
● Limitations
○ Scalability
○ Customization
○ Other languages
6. spaCy
ExplosionAI
● Created in 2015 by Matthew
Honnibal and Ines Montani
● https://spacy.pythonhumanities.
com
● Benefits
○ Scalability
○ Customization
○ Community
■ LatinCy
■ Calamancy
○ Annotation tool - Prodigy
● Limitations
○ Low-resource languages
○ Resource intensive
○ Challenging Config system
8. Preparing Texts
Tokenizing
● Reduce all aspects of a text to
a single token
● Token => word, punctuation,
part of a conjunction, etc.
● Benefits: fast
● Limitation: large number of
variant forms of words
(especially in inflected
languages)
9. Preparing Texts
Stemming
● Reduce words to their core
stem
● Benefits: fast and rules-based
● Limitation: sometimes stems
are not real words
17. Topic
Modeling
LDA
● Presumes the presence of
topics that are hidden (latent).
Specify the number of topics
and identify how they cluster. It
uses a Matrix of words for each
document (Bag-of-Words)
● Advantage:
○ Works well with large datasets
○ Works well when the number of
subjects is known
● Disadvantages:
○ Challenging for short texts
○ Hard to interpret results
○ Topic quantity must be guessed if
not known
18. Topic
Modeling
Transformer-Based
● Leverages transformer-generated
document embeddings to capture
semantic meaning to then leverage
other algorithms for dimensionality
reduction and clustering.
● Advantages:
○ Captures broader meaning of documents
○ Do not need to know the number of topics
○ A lot of flexibility
○ Works with multilingual datasets
○ Works very well on large datasets
● Disadvantages:
○ Requires more resources to create
embeddings (only done once)
○ Fine-tuning the hyperparameters of the
dimensionality reduction and clustering
algorithms can be challenging
○ Challenge to reproduce results even with a
seed (controlled randomness)
21. NER
Overview
● Classify individual spans, or
sequence of tokens, in a text
● Types Classification
○ Hard Classification
○ Soft Classification
● Types of Methods
○ Machine Learning
○ Rules-Based
22. NER
Labels
● Locations
○ LOC - Location
○ GPE - Geopolitical Entity
● PERSON
● NORP - Nationalities, religious,
or political groups
● TIME
● DATE
● EVENT
● PRODUCT
● FAC - Buildings, airports,
highways, bridges, etc.