Mattingly "Text Mining Techniques"

•

0 likes•283 views

National Information Standards Organization (NISO)

This presentation was provided by William Mattingly of the Smithsonian Institution, for the sixth session of NISO's 2023 Training Series on Text and Data Mining. Session six, "Text Mining Techniques" was held on Thursday, November 16, 2023.

Education

Text and Data Mining
Representing Text as Data

1. Natural Language Processing
2. Clustering Texts
3. Dimensionality Reduction Applied
4. Ways to Represent Text for Querying
Goals

NLTK
Natural Language Toolkit
● Created in 2001 by Steven Bird
and Edward Loper
● Natural Language Processing
with Python (2009)
● Benefits
○ Many features and tools
○ Books
○ Hosts a wide array of algorithms
● Limitations
○ Scalability
○ Customization
○ Other languages

spaCy
ExplosionAI
● Created in 2015 by Matthew
Honnibal and Ines Montani
● https://spacy.pythonhumanities.
com
● Benefits
○ Scalability
○ Customization
○ Community
■ LatinCy
■ Calamancy
○ Annotation tool - Prodigy
● Limitations
○ Low-resource languages
○ Resource intensive
○ Challenging Config system

Preparing Texts
● Tokenization
● Stemming
● Lemmatization

Preparing Texts
Tokenizing
● Reduce all aspects of a text to
a single token
● Token => word, punctuation,
part of a conjunction, etc.
● Benefits: fast
● Limitation: large number of
variant forms of words
(especially in inflected
languages)

Preparing Texts
Stemming
● Reduce words to their core
stem
● Benefits: fast and rules-based
● Limitation: sometimes stems
are not real words

Preparing Texts
Lemmatization
● Reduce words to their lemma
● Benefit: All lemmas are real
words
● Limitation: Slower than
stemming

Representing
Texts
Digitally
● Bag-of-Words
● Embeddings

Representing
Texts
Digitally
Bag-of-Words
● The apple is in the tree.
○ 1-the
○ 2-apple
○ 3-is
○ 4-in
○ 1-the
○ 5-tree
● [1, 2, 3, 4, 1, 5]

Representing
Texts
Digitally
Embeddings
● The apple is in the tree.
○ 1-[0.01234, -0.23456, 0.87654,
0.45678, -0.56123, 0.65432,
0.12345, -0.77123, 0.08456,
0.34567, ...]
○ 2-different vector
○ 3-different vector
○ 4-different vector
○ 1-[0.01234, -0.23456, 0.87654,
0.45678, -0.56123, 0.65432,
0.12345, -0.77123, 0.08456,
0.34567, ...]
○ 5-different vector

Topic
Modeling
Methods
● Latent Dirichlet Allocation
(LDA) Topic Modeling
● Transformer-Based Topic
Modeling

Topic
Modeling
LDA
● Presumes the presence of
topics that are hidden (latent).
Specify the number of topics
and identify how they cluster. It
uses a Matrix of words for each
document (Bag-of-Words)
● Advantage:
○ Works well with large datasets
○ Works well when the number of
subjects is known
● Disadvantages:
○ Challenging for short texts
○ Hard to interpret results
○ Topic quantity must be guessed if
not known

Topic
Modeling
Transformer-Based
● Leverages transformer-generated
document embeddings to capture
semantic meaning to then leverage
other algorithms for dimensionality
reduction and clustering.
● Advantages:
○ Captures broader meaning of documents
○ Do not need to know the number of topics
○ A lot of flexibility
○ Works with multilingual datasets
○ Works very well on large datasets
● Disadvantages:
○ Requires more resources to create
embeddings (only done once)
○ Fine-tuning the hyperparameters of the
dimensionality reduction and clustering
algorithms can be challenging
○ Challenge to reproduce results even with a
seed (controlled randomness)

Topic
Modeling
Challenges
● Interpreting Clusters
● Outliers
● Reproducibility

NER
Overview
● Classify individual spans, or
sequence of tokens, in a text
● Types Classification
○ Hard Classification
○ Soft Classification
● Types of Methods
○ Machine Learning
○ Rules-Based

NER
Labels
● Locations
○ LOC - Location
○ GPE - Geopolitical Entity
● PERSON
● NORP - Nationalities, religious,
or political groups
● TIME
● DATE
● EVENT
● PRODUCT
● FAC - Buildings, airports,
highways, bridges, etc.

Similar to Mattingly "Text Mining Techniques"

Open Chemistry, JupyterLab and data: Reproducible quantum chemistryMarcus Hanwell

Tutorial on Deep Learning in Recommender System, Lars summer school 2019Anoop Deoras

HelsinkiJS - Clojurescript for Javascript DevelopersJuho Teperi

Create your libraryLaurence Chen

Ontology Access Kit_ Workshop Intro Slides.pptxChris Mungall

Building NLP solutions for Davidson ML Groupbotsplash.com

PYTHON UNIT 1nagendrasai12

Fragebogen mit bildernStefan Gradmann

Fragen: visualisierungStefan Gradmann

Python programming ppt.pptxnagendrasai12

Overview of no sqlSean Murphy

Software Programming with Python II.pptxGevitaChinnaiah

11 scripting languagescherrybear2014

Tensorflow a brief introduction (1).pptxAnandMenon54

Document Clustering using LDA | Haridas Narayanaswamy [Pramati]Pramati Technologies

What drives Innovation? Innovations And Technological Solutions for the Distr...Stefano Fago

Python workshopShiraz LUG

Python workshopMarie Behzadi

Dr. Tanvi FOCP Unit-2 Session-1 PPT (Revised).pdfRahulSingh190790

2015 bioinformatics python_introduction_wim_vancriekinge_vfinalProf. Wim Van Criekinge

Similar to Mattingly "Text Mining Techniques" (20)

Open Chemistry, JupyterLab and data: Reproducible quantum chemistry

Tutorial on Deep Learning in Recommender System, Lars summer school 2019

HelsinkiJS - Clojurescript for Javascript Developers

Create your library

Ontology Access Kit_ Workshop Intro Slides.pptx

Building NLP solutions for Davidson ML Group

PYTHON UNIT 1

Fragebogen mit bildern

Fragen: visualisierung

Python programming ppt.pptx

Overview of no sql

Software Programming with Python II.pptx

11 scripting languages

Tensorflow a brief introduction (1).pptx

Document Clustering using LDA | Haridas Narayanaswamy [Pramati]

What drives Innovation? Innovations And Technological Solutions for the Distr...

Python workshop

Dr. Tanvi FOCP Unit-2 Session-1 PPT (Revised).pdf

2015 bioinformatics python_introduction_wim_vancriekinge_vfinal

More from National Information Standards Organization (NISO)

Mattingly "AI and Prompt Design: LLMs with Text Classification and Open Source"National Information Standards Organization (NISO)

Mattingly "AI and Prompt Design: LLMs with NER"National Information Standards Organization (NISO)

Mattingly "AI & Prompt Design: Named Entity Recognition"National Information Standards Organization (NISO)

Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"National Information Standards Organization (NISO)

Mattingly "AI & Prompt Design: The Basics of Prompt Design"National Information Standards Organization (NISO)

Bazargan "NISO Webinar, Sustainability in Publishing"National Information Standards Organization (NISO)

Rapple "Scholarly Communications and the Sustainable Development Goals"National Information Standards Organization (NISO)

Compton "NISO Webinar, Sustainability in Publishing"National Information Standards Organization (NISO)

Mattingly "AI & Prompt Design: Large Language Models"National Information Standards Organization (NISO)

Hazen, Morse, and Varnum "Spring 2024 ODI Conformance Statement Workshop for ...National Information Standards Organization (NISO)

Mattingly "AI & Prompt Design" - Introduction to Machine Learning"National Information Standards Organization (NISO)

Mattingly "Text and Data Mining: Building Data Driven Applications"National Information Standards Organization (NISO)

Mattingly "Text and Data Mining: Searching Vectors"National Information Standards Organization (NISO)

Carpenter "Designing NISO's New Strategic Plan: 2023-2026"National Information Standards Organization (NISO)

Ross and Clark "Strategic Planning"National Information Standards Organization (NISO)

Mattingly "Data Mining Techniques: Classification and Clustering"National Information Standards Organization (NISO)

Straza "Global collaboration towards equitable and open science: UNESCO Recom...National Information Standards Organization (NISO)

Lippincott "Beyond access: Accelerating discovery and increasing trust throug...National Information Standards Organization (NISO)

Kriegsman "Integrating Open and Equitable Research into Open Science"National Information Standards Organization (NISO)

Mattingly "Ethics and Cleaning Data"National Information Standards Organization (NISO)

More from National Information Standards Organization (NISO) (20)

Mattingly "AI and Prompt Design: LLMs with Text Classification and Open Source"

Mattingly "AI and Prompt Design: LLMs with NER"

Mattingly "AI & Prompt Design: Named Entity Recognition"

Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"

Mattingly "AI & Prompt Design: The Basics of Prompt Design"

Bazargan "NISO Webinar, Sustainability in Publishing"

Rapple "Scholarly Communications and the Sustainable Development Goals"

Compton "NISO Webinar, Sustainability in Publishing"

Mattingly "AI & Prompt Design: Large Language Models"

Hazen, Morse, and Varnum "Spring 2024 ODI Conformance Statement Workshop for ...

Mattingly "AI & Prompt Design" - Introduction to Machine Learning"

Mattingly "Text and Data Mining: Building Data Driven Applications"

Mattingly "Text and Data Mining: Searching Vectors"

Carpenter "Designing NISO's New Strategic Plan: 2023-2026"

Ross and Clark "Strategic Planning"

Mattingly "Data Mining Techniques: Classification and Clustering"

Straza "Global collaboration towards equitable and open science: UNESCO Recom...

Lippincott "Beyond access: Accelerating discovery and increasing trust throug...

Kriegsman "Integrating Open and Equitable Research into Open Science"

Mattingly "Ethics and Cleaning Data"

Recently uploaded

What is 3 Way Matching Process in Odoo 17.pptxCeline George

Introduction to TechSoup’s Digital Marketing Services and Use CasesTechSoup

Economic Importance Of Fungi In Food AdditivesSHIVANANDaRV

COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxannathomasp01

Interdisciplinary_Insights_Data_Collection_Methods.pptxPooja Bhuva

NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...Amil baba

Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Pooja Bhuva

OSCM Unit 2_Operations Processes & SystemsSandeep D Chaudhary

OS-operating systems- ch05 (CPU Scheduling) ...Dr. Mazin Mohamed alkathiri

How to setup Pycharm environment for Odoo 17.pptxCeline George

Our Environment Class 10 Science Notes pdfVivekanand Anglo Vedic Academy

80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...Nguyen Thanh Tu Collection

Model Attribute _rec_name in the Odoo 17Celine George

FSB Advising Checklist - Orientation 2024Elizabeth Walsh

Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxPooja Bhuva

How to Manage Global Discount in Odoo 17 POSCeline George

On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxPooja Bhuva

Accessible Digital Futures project (20/03/2024)Jisc

How to Add a Tool Tip to a Field in Odoo 17Celine George

Details on CBSE Compartment Exam.pptx1111GangaMaiya1

Recently uploaded (20)

What is 3 Way Matching Process in Odoo 17.pptx

Introduction to TechSoup’s Digital Marketing Services and Use Cases

Economic Importance Of Fungi In Food Additives

COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx

Interdisciplinary_Insights_Data_Collection_Methods.pptx

NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...

Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...

OSCM Unit 2_Operations Processes & Systems

OS-operating systems- ch05 (CPU Scheduling) ...

How to setup Pycharm environment for Odoo 17.pptx

Our Environment Class 10 Science Notes pdf

80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...

Model Attribute _rec_name in the Odoo 17

FSB Advising Checklist - Orientation 2024

Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx

How to Manage Global Discount in Odoo 17 POS

On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx

Accessible Digital Futures project (20/03/2024)

How to Add a Tool Tip to a Field in Odoo 17

Details on CBSE Compartment Exam.pptx1111

Mattingly "Text Mining Techniques"

1. Text and Data Mining Representing Text as Data

2. 1. Natural Language Processing 2. Clustering Texts 3. Dimensionality Reduction Applied 4. Ways to Represent Text for Querying Goals

3. Natural Language Processing (NLP)

4. NLTK vs. spaCy

5. NLTK Natural Language Toolkit ● Created in 2001 by Steven Bird and Edward Loper ● Natural Language Processing with Python (2009) ● Benefits ○ Many features and tools ○ Books ○ Hosts a wide array of algorithms ● Limitations ○ Scalability ○ Customization ○ Other languages

6. spaCy ExplosionAI ● Created in 2015 by Matthew Honnibal and Ines Montani ● https://spacy.pythonhumanities. com ● Benefits ○ Scalability ○ Customization ○ Community ■ LatinCy ■ Calamancy ○ Annotation tool - Prodigy ● Limitations ○ Low-resource languages ○ Resource intensive ○ Challenging Config system

7. Preparing Texts ● Tokenization ● Stemming ● Lemmatization

8. Preparing Texts Tokenizing ● Reduce all aspects of a text to a single token ● Token => word, punctuation, part of a conjunction, etc. ● Benefits: fast ● Limitation: large number of variant forms of words (especially in inflected languages)

9. Preparing Texts Stemming ● Reduce words to their core stem ● Benefits: fast and rules-based ● Limitation: sometimes stems are not real words

10. Preparing Texts Lemmatization ● Reduce words to their lemma ● Benefit: All lemmas are real words ● Limitation: Slower than stemming

11. Representing Texts Digitally

12. Representing Texts Digitally ● Bag-of-Words ● Embeddings

13. Representing Texts Digitally Bag-of-Words ● The apple is in the tree. ○ 1-the ○ 2-apple ○ 3-is ○ 4-in ○ 1-the ○ 5-tree ● [1, 2, 3, 4, 1, 5]

14. Representing Texts Digitally Embeddings ● The apple is in the tree. ○ 1-[0.01234, -0.23456, 0.87654, 0.45678, -0.56123, 0.65432, 0.12345, -0.77123, 0.08456, 0.34567, ...] ○ 2-different vector ○ 3-different vector ○ 4-different vector ○ 1-[0.01234, -0.23456, 0.87654, 0.45678, -0.56123, 0.65432, 0.12345, -0.77123, 0.08456, 0.34567, ...] ○ 5-different vector

15. Clustering Texts - Topic Modeling

16. Topic Modeling Methods ● Latent Dirichlet Allocation (LDA) Topic Modeling ● Transformer-Based Topic Modeling

17. Topic Modeling LDA ● Presumes the presence of topics that are hidden (latent). Specify the number of topics and identify how they cluster. It uses a Matrix of words for each document (Bag-of-Words) ● Advantage: ○ Works well with large datasets ○ Works well when the number of subjects is known ● Disadvantages: ○ Challenging for short texts ○ Hard to interpret results ○ Topic quantity must be guessed if not known

18. Topic Modeling Transformer-Based ● Leverages transformer-generated document embeddings to capture semantic meaning to then leverage other algorithms for dimensionality reduction and clustering. ● Advantages: ○ Captures broader meaning of documents ○ Do not need to know the number of topics ○ A lot of flexibility ○ Works with multilingual datasets ○ Works very well on large datasets ● Disadvantages: ○ Requires more resources to create embeddings (only done once) ○ Fine-tuning the hyperparameters of the dimensionality reduction and clustering algorithms can be challenging ○ Challenge to reproduce results even with a seed (controlled randomness)

19. Topic Modeling Challenges ● Interpreting Clusters ● Outliers ● Reproducibility

20. Named Entity Recognition (NER)

21. NER Overview ● Classify individual spans, or sequence of tokens, in a text ● Types Classification ○ Hard Classification ○ Soft Classification ● Types of Methods ○ Machine Learning ○ Rules-Based

22. NER Labels ● Locations ○ LOC - Location ○ GPE - Geopolitical Entity ● PERSON ● NORP - Nationalities, religious, or political groups ● TIME ● DATE ● EVENT ● PRODUCT ● FAC - Buildings, airports, highways, bridges, etc.

Mattingly "Text Mining Techniques"

Recommended

Recommended

More Related Content

Similar to Mattingly "Text Mining Techniques"

Similar to Mattingly "Text Mining Techniques" (20)

More from National Information Standards Organization (NISO)

More from National Information Standards Organization (NISO) (20)

Recently uploaded

Recently uploaded (20)

Mattingly "Text Mining Techniques"