This presentation was provided by William Mattingly of the Smithsonian Institution, during the fifth segment of the NISO training series "AI & Prompt Design." Session Five: Named Entity Recognition with LLMs, was held on May 2, 2024.
2. 1. Named Entity Recognition (NER) as a Concept
2. Rules-Based Approaches to NER
3. Supervised Learning NER
4. Unsupervised Learning NER
5. Transformer-Based NER
6. GliNER
7. Large Language Models NER
Goals
10. Rules-Based
● List of Entities
Concentration Camps:
Auschwitz
Bergen-Belsen
Buchenwald
…
Gazetteer
11. Rules-Based
● Leverages the linguistic data of
a text to assign an entity.
● Use an NLP framework, like
spaCy or NLTK
Nearly two hundred of them were
taken to Berlin.
Verb of movement followed by a
proposition(s) [to, towards, away to]
and a location.
Linguistic Rules
12. Rules-Based
● Find conditions in which things
occur to then assign a label.
We were taken to the Warsaw
Ghetto.
If an entity is a LOCATION and the
word “ghetto” appears within a
context of 5 tokens, change entity
to GHETTO.
Nested Conditions
13. Rules-Based
● Regular Expressions is a
complex way of doing fuzzy
string matching.
Hic pagus unus, cum domo exisset,
patrum nostrorum memoria L.
Cassium consulem interfecerat et
eius exercitum sub iugum miserat.
Lucius Cassius
(?:[A-
Z].s)?Cassi(?:us|um|i|o|orum|is)
RegEx
15. Machine
Learning
{
"text": "John Doe was a prisoner at
Auschwitz during World War II.",
"entities": [
{
"type": "PERSON",
"value": "John Doe",
"start_pos": 0,
"end_pos": 8
},
{
"type": "CONC_CAMP",
"value": "Auschwitz",
"start_pos": 20,
"end_pos": 30
}
]
}
Supervised Learning
16. Machine
Learning
● Vectorize all multi-word tokens
● Plot them to identify patterns
Exercise:
https://wjbmattingly.com/unsupervis
ed-ner/
Unsupervised Learning
17. Machine
Learning
● GliNER => A transformer
architecture that allows you to
pass a text and your own
labels to a model without any
training.
Example:
https://huggingface.co/spaces/toma
arsen/gliner_medium-v2.1
Zero-Shot NER
20. LLMs
● Resource Intensity (and Cost)
● Data Privacy Concerns
● Black Box Models
● Training Data Bias
● Generalization Challenges
● Latency Issues
● Hallucinations
● Consistency
Limitations
21. LLMs
● Thinking through your
methodology for NER
● Assisting in certain steps of
NER (RegEx)
● Zero-Shot NER
● Few-Shot NER
How to use LLMs
22. Exercise 1: Use an LLM to help develop a
solution(s) to identify gender-specific people
in a text. Discuss the options as a group and
judge their merits. Consider the ethical
implications of the proposed solutions.
23. Mrs. Jessica Monica Kapitan works at the
office. Mrs. Kapitan is a lawyer. She is also
friends with Mrs. Thompson and Miss. Smith.
Sometimes Miss. Smith will miss her train.
24. Exercise 2: Capture all examples of Miss. and
Mrs. in the text with their corresponding
names using an LLM to generate RegEx
https://regex101.com/r/TLfbGE/1
25. Exercise 1: One Solution
b(Mrs.|Miss.)s+([A-Z][a-z]*(?:s+[A-Z][a-z]*)*)
26. Mr. Thomas and Dr. Jessica Davis went to the
store. They met Mrs. Stevens who works at a
nearby office. They are all friends with Colonel
Jackson. Col. Jackson is known to her friends
by her first name, Terry. They all know Mr.
and Mrs. Kapitan.
27. Exercise 3: Capture all examples [Honorific
Entity] in the text with their corresponding
names using an LLM to generate RegEx
https://regex101.com/r/FYcO8C/1
28. Exercise 3: One Solution
b(Mr.|Mrs.|Miss.|Dr.|Colonel|Col.)s+([A-Z][a-z]*(?:s+[A-Z][a-z]*)*)
29. Exercise 4: Use an LLM to identify the people
in the following text. Think through an ethical
way to use an LLM to assign potential gender
in these contexts.
Dr. Tracey Jordan works at the Smithsonian where he develops methods to identify named entities. Mrs. Alex Jackson leads the team.
She was trained in machine learning at Stanford. While Tracey functions as the domain expert, Alex Jackson designs the experiments.
They have another colleague, Leslie Peters.