Natural Language Processing for
Beginners
Colleen M. Farrelly
About Me
• Data science
lead/entrepreneur
• Geometry and NLP
researcher
• Author
• The Shape of Data (NSP,
2023)
• Network science book
(Packt, 2024)
• Artist and Calligrapher
Generative
AI
A gentle introduction
What is
generative AI?
• Set of algorithms that generate:
• Images
• Text samples
• Videos
• Audio content
• Guided by:
• Training sample
• User specifications
• Deep learning architectures
GPT
• Generative Pre-trained
Transformer 4
• Decoder-only transformer
network
• Gives sequence-to-
sequence decoder with
long-range memory
• Already blurring lines
between human
composition and AI
Tools that should work in Anywhere
Speech generation:
• https://play.ht/text-to-speech-voices
Text generation (OpenAI alternative, GPT-2):
• https://huggingface.co/tasks/text-generation
Image generation:
• https://creator.nightcafe.studio/create
Sentiment
Analysis and
Text Classifiers
Theory and Practice
Sentiment Analysis
• Understand positive/negative/neutral
tone of text data
• Customer feedback
• Chatbot emotion regulation
• Predicting patient outcomes or
physician bias
• Expansion to other emotions:
• Anger
• Sadness
• Surprise
• Some packages exist for some languages
and applications.
• Other languages or emotions require
custom code and dictionaries.
Classification
• Surgical outcomes based on
physician notes
• Types of customer complaints
from automated feedback form
• Tasks for chatbot to route from
sales chatbot conversations
Wrangling Text to Numeric Matrix
• Document word
counts/frequencies
• Binary, count, or
weighted
frequency/
inverse
frequency
• Sparse numeric
matrix
Embeddings:
High
dimension to
low
dimension
Context Matters: Pretrained encoder/decoder neural networks
BERT Models
• Many BERT and RoBERTa models
on HuggingFace
• Pre-trained neural networks
• Good context for English
• Fairly complex sentences
• Fantasy or trademark words
• Some domain-specific versions
• Easy to use in Python
• Requires computer storage
space
Fun with Python
Ethical Considerations
Some Ethical
Considerations
• Training data bias (pretrained
embeddings or own data)
• Misuse of generative AI (fake
news)
• Representation in models
• Language accuracy biases in
multilingual models
• Misclassification biases
• Plagiarism bias against non-
native English speakers

Natural Language Processing for Beginners.pptx

  • 1.
    Natural Language Processingfor Beginners Colleen M. Farrelly
  • 2.
    About Me • Datascience lead/entrepreneur • Geometry and NLP researcher • Author • The Shape of Data (NSP, 2023) • Network science book (Packt, 2024) • Artist and Calligrapher
  • 3.
  • 4.
    What is generative AI? •Set of algorithms that generate: • Images • Text samples • Videos • Audio content • Guided by: • Training sample • User specifications • Deep learning architectures
  • 5.
    GPT • Generative Pre-trained Transformer4 • Decoder-only transformer network • Gives sequence-to- sequence decoder with long-range memory • Already blurring lines between human composition and AI
  • 6.
    Tools that shouldwork in Anywhere Speech generation: • https://play.ht/text-to-speech-voices Text generation (OpenAI alternative, GPT-2): • https://huggingface.co/tasks/text-generation Image generation: • https://creator.nightcafe.studio/create
  • 7.
  • 8.
    Sentiment Analysis • Understandpositive/negative/neutral tone of text data • Customer feedback • Chatbot emotion regulation • Predicting patient outcomes or physician bias • Expansion to other emotions: • Anger • Sadness • Surprise • Some packages exist for some languages and applications. • Other languages or emotions require custom code and dictionaries.
  • 9.
    Classification • Surgical outcomesbased on physician notes • Types of customer complaints from automated feedback form • Tasks for chatbot to route from sales chatbot conversations
  • 10.
    Wrangling Text toNumeric Matrix • Document word counts/frequencies • Binary, count, or weighted frequency/ inverse frequency • Sparse numeric matrix
  • 11.
  • 12.
    Context Matters: Pretrainedencoder/decoder neural networks
  • 13.
    BERT Models • ManyBERT and RoBERTa models on HuggingFace • Pre-trained neural networks • Good context for English • Fairly complex sentences • Fantasy or trademark words • Some domain-specific versions • Easy to use in Python • Requires computer storage space
  • 14.
  • 15.
  • 16.
    Some Ethical Considerations • Trainingdata bias (pretrained embeddings or own data) • Misuse of generative AI (fake news) • Representation in models • Language accuracy biases in multilingual models • Misclassification biases • Plagiarism bias against non- native English speakers