Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Working in NLP in the Age of Large Language Models
1. Zachary Brown Charlottesville Data Science Meetup, 2023-09-18
Working in NLP in the Age of
Large Language Models
An Historical Perspective
2. Outline
Setting the Scene
Who’s who in NLP
The Alex Net Moment
Context is King
Sequences Abound
Attention is Everything
Multi-mode, multi-task, multi-billion
A New World
Introduction An Historical Perspective
3. Introduction
Outline
Setting the Scene
Who’s who in NLP
The Alex Net Moment
Context is King
Sequences Abound
Attention is Everything
Multi-mode, multi-task, multi-billion
A New World
An Historical Perspective
4. Advances in the technology have opened up a variety of novel use cases, and the hype
has caused a massive shift in both expectations and allowances for model performance
Setting the Scene
Generative AI has taken the world by storm over the past year, driven largely by the
groundbreaking performance of a small number of closed-source models
These recent advances have their roots in over a decade of accumulating foundational
research
7. Advances in the technology have opened up a variety of novel use cases, and the
hype has caused a massive shift in both expectations and allowances for model
performance
Setting the Scene
Generative AI has taken the world by storm over the past year, driven largely by the
groundbreaking performance of a small number of closed-source models
These recent advances have their roots in over a decade of accumulating foundational
research
8. The legal cases…
The harm
Novelty (Across Many Dimensions)
The good: Fun with content generation, trying to take
over the world, fun generative agents. Also, does
better on tests than you do…
9. The legal cases…
The harm: Hallucinations, disinformation,
workforce impacts, misguided usage,
environmental impacts
Novelty (Across Many Dimensions)
The good: Fun with content generation, trying to take
over the world, fun generative agents. Also, does
better on tests than you do…
10. The legal cases…
The harm: Hallucinations, disinformation,
workforce impacts, misguided usage,
environmental impacts
Novelty (Across Many Dimensions)
The good: Fun with content generation, trying to take
over the world, fun generative agents. Also, does
better on tests than you do…
12. Advances in the technology have opened up a variety of novel use cases, and the hype
has caused a massive shift in both expectations and allowances for model performance
Setting the Scene
Generative AI has taken the world by storm over the past year, driven largely by the
groundbreaking performance of a small number of closed-source models
These recent advances have their roots in over a decade of accumulating
foundational research
13. Foundational Roots of LLMs
2001
First Neural Language Model
2010
RNNs for Language Modeling
2013
Contextual Word Embeddings
2015
Attention Mechanism
2017
Transformer Architecture
14. How does the impact of previous advances compare to what’s happening right now?
Questions to Cover
What previous advances have led us here?
What have these changes meant for those in this field, and in the periphery of this field?
15. Engineers and Architects
Who: Software engineers, architects, devops and data engineers
What: Core contributors required to mature technology beyond specialized startups to enterprise
grade / scale
Who’s Who in NLP
Researchers and Practitioners
Who: Machine learning researchers and engineers, data scientists, computational linguists
What: Driving advances in tech and/or knowledgeable enough to immediately leverage
Business Interests and Specialists
Who: C-Suite members, enterprise technical leaders, founders and investors, and domain specialists
(research)
What: Recognizing maturity and leveraging advances in tech for relevant use cases
16. Introduction
Outline
Setting the Scene
Who’s who in NLP
The Alex Net Moment
Context is King
Sequences Abound
Attention is Everything
Multi-mode, multi-task, multi-billion
A New World
An Historical Perspective
17. The AlexNet Moment
(+ history, 2001 - 2012)
I’d argue that one of the most important moments leading to
NLP having broader impacts outside of academic/specialist
communities wasn’t an advancement in NLP at all…
But first, a bit of history on NLP research…
18. One of the foundational tasks for NLP is language modeling,
which is simply predicting the next word in a sequence given
the previous sequence
● In 2001, Bengio et. al. introduced an early neural
model for next token prediction
● In 2010, Mikolov et. al. explored the application of
RNNs for LM*
Early Neural Methods
for Language Modeling
* Extensions such as the LSTM gained massive popularity in subsequent years
19. Engineers and Architects
Impact?
Researchers and Practitioners
Exploration of new techniques really pushing the boundaries of what’s possible, but…
…often need to understand how to implement models from scratch with few standard libraries
Business Interests and Specialists
20. The AlexNet Moment
(2012)
In 2012, the AlexNet architecture won the ImageNet
competition with a 10.5% reduction in top-5 error.
This tangibly demonstrated the promise of neural networks
in solving problems that could provide tangible business
value
The resurgent popularity of neural networks had broad
implications beyond just the researchers and practitioners
working in this space
21. Engineers and Architects
Huge opportunity to build out standard frameworks to support deep learning research and
development
Impact?
Researchers and Practitioners
Purpose-built neural net architectures have the potential to substantially outperform prior methods
and should be more thoroughly explored for NLP use cases
Business Interests and Specialists
Early signals that neural nets can provide substantial business and research value, this is an
important area for early investment
22. Context Is King
(2013)
In 2013, Mikalov et. al. demonstrated word2vec, a technique
that efficiently produced contextual word embeddings at
scale, from a large, unlabeled corpus
This work was followed in 2014 by the GloVe embedding
method (Pennington et. al.) which leverages global
co-occurrance statistics to generate contextual embeddings
Both research groups made these sets of pre-trained word
embeddings publicly available under the Apache 2.0 license
23. Engineers and Architects
New interesting opportunities and challenges for large-scale unstructured data sets. New
packages emerging for generating and using publicly-available model artifacts
Impact?
Researchers and Practitioners
Unsupervised pre-training has the potential to capture interesting semantic relationships without
the need for expensive (and error prone) supervised human labeled data.
Business Interests and Specialists
My unstructured data has inherent value that can be extracted in an automated way.
There’s a new ecosystem emerging of privately-funded research efforts producing and releasing
valuable IP
24. In 2014, Sutskever et. al. from the Google Brain team
introduced a novel approach for leveraging neural nets
to map sequences to sequences. This had major
implications for neural machine translation among
other tasks
In the next year, Bahdanau et. al. published a novel
approach to neural machine translation introducing the
attention mechanism.
Sequences Abound
(2014 - 2015)
25. Engineers and Architects
Neural nets are starting to pop up in a variety of use cases. I should check out this new tensorflow
thing
Impact?
Researchers and Practitioners
We can now tackle seq2seq problems, and the attention mechanism shows great promise in letting
a model architecture learn which context is relevant for word prediction
Business Interests and Specialists
Large commercial R&D investments are producing truly novel tech that’s driving a step change in
capabilities. New business opportunities for startups, new investments for enterprises
26. In 2017, a new work from Vaswani et. al. demonstrated that
“Attention is All You Need,” extending the attention
mechanism proposed several years earlier to construct
stacked blocks of multi-headed attention, or transformers.
In the next year, the promise of transformers was firmly
established with the release of both the original BERT paper
from the Google AI Language team as well as the original GPT
paper from a team at OpenAI
Attention is Everything
(2017 - 2018)
27. Blocks of (multi-head) self-attention
are the key component of
transformers encoder and decoder
blocks, allowing the model to learn
deep contextual representations of
the input tokens (typically
bidirectional)
Attention is All You Need
28. Bidirectional Encoder Representations from Transformers
(BERT) demonstrated that transformer encoder-only models
can be efficiently trained through a two step process of
unsupervised pre-training followed by task-specific
fine-tuning
The masked-language-model (MLM) pretraining paradigm
opened the door for leveraging massive textual corpora to
produce extremely performant models, while fine-tuning
allowed practitioners to directly benefit from the substantial
investments of large industry research groups
BERT and the Encoders
29. The original GPT paper followed a similar “pretrain then
fine-tune” approach using a decoder only transformer
architecture.
Pretraining was carried out using with an autoregressive
language model objective, with a variety of tasks for
subsequent fine-tuning.
GPT
30. Engineers and Architects
Deep learning toolkits are becoming more broadly available, and non-specialists can now
experiment and build useful machine learning systems
Impact?
Researchers and Practitioners
Transformers are an incredibly robust tool for learning deep contextual representations for
language tasks. We should explore everything we can for transformers and pre-training paradigms
Business Interests and Specialists
The trend in open-sourcing valuable IP is only accelerating, an ecosystem is rapidly developing for
AI. What tooling needs to be built? What NLP use cases does my organization have?
31. Multi-mode, Multi-task,
Multi-billion (2019 - 2021)
In the wake of the success of models such as BERT and GPT,
research interests shifted to exploring, extending, and
augmenting various paradigms introduced in these recent
works, such as:
● Extensions to attention mechanism
● Encoder-decoder architectures
● Pre-training and fine-tuning paradigms
● Larger and smaller (more efficient) models
● Multi-modal applications
32. The original attention mechanism is robust, but
computationally expensive as sequence lengths grow. Models
such as Longformer, Reformer, Performer, etc. explored
various methods for extending the attention mechanism to
longer sequence lengths.
Extend Your Attention
33. While encoder and decoder-only models such as BERT and
GPT demonstrated great promise in their own right,
extensive research efforts were focused on leveraging full
encoder-decoder transformer architectures for sequence to
sequence tasks such as translation, reading comprehension,
summarization, etc.
Models like BART, T5, Pegasus, and ULM all leveraged
encoder-decoder models along with novel training paradigms
to produce performant models across a variety of tasks
Encode and Decode
34. Extending the fine-tuning paradigms introduced in previous
work, many groups shifted focus to using single architecture
to perform a variety of disparate tasks
Models like T5 probed the limits of multi-task transfer
learning, while the family of GPT models evolved (GPT-2 and
GPT-3) demonstrating the benefits of multi-task learning and
highlighting emergent capabilities of larger models such as
few-shot learning and generalization.
“Fine-Tuning Language Models from Human Preferences”
demonstrated that human feedback can play a key role in
aligning generative model outputs with human expectations
Task Variety, Instructions,
and Alignment
35. While some work sought to make transformer models smaller
and more efficient (pruning, distillation, quantization), other
works focused on scaling up to much larger models (and
larger training datasets).
Novel works explored new scaling laws for the era of large
language models
Bigger (and Smaller)
Models
36. A huge body of research also emerged around multi-modal
applications of transformers (Xu et. al. for a recent survey)
One particularly visible application of transformers applied
to audio was the Whisper model released by OpenAI , which
leveraged an relatively straightforward transformer and a
huge volume of data to produce a performant speech to text
model
Multi-Modal Models
37. Researchers and Practitioners
Wow… Also, everything is a transformer.
Engineers and Architects
With models getting so big, there are novel challenges for ML training and inference.
Access to public models allows us to incorporate ML into nearly any application with zero training
Impact?
Business Interests and Specialists
I can build a new business on top of open source machine learning easier than ever before, and so
can everyone else. How can I leverage our data and talent to apply this new tech to our business?
38. In November, OpenAI announced the public release of
ChatGPT, a large (and notably unspecified) generative model
fine-tuned through reinforcement learning with human
feedback (RLHF).
For many, this system was the first glimpse into the
technological advances of the past decade and the tangible
utility these models can provide.
The model demonstrated a step change in previously
demonstrated generative language modeling capabilities,
causing a noticeable shift in research interest and
commercial applications of NLP
A New World
(2022 - Present)
39. In early 2023, there were seemingly no competitive systems
available to the public that could match the performance of
this new model.
Aside from the novelty of the system in information recall or
content-generation use cases, the few-shot and zero-shot
performance of the model is remarkable
Coupled with restrictive terms of service for competitive
commercial uses, a sort of moat had been established around
this novel technology
The Dominance of GPT
40. As 2023 has progressed, competitive open-source models
have been released monthly, if not weekly. Some key aspects
can determine the utility of these models for business use
cases:
● Commercial permissibility (and data usage ToS!)
● Training / tuning paradigm
● Code generation
Competition Heats Up
41. To support this new emerging ecosystem of LLMs, a whole
host of new tools have emerged, and some existing solutions
have found new life. Some prominent areas of novel tooling
include:
● Chain creation and management
● Vector data stores
● LLM training and inference platforms
● LLM testing and monitoring suites
● Labeling platforms
A New Tooling
Ecosystem
42. Along with the proliferation of new tools, new techniques
that aid in working with these new models have gained in
popularity.
Parameter efficient fine-tuning (PEFT) techniques including
LoRA, p-tuning, etc. have emerged as a viable path for
domain/task adaptation on consumer hardware (along with
quantization)
New research has also emerged this year focusing on more
flexible embedding techniques (ALiBi, RoPE) that allow for
model training (and adaptation) to longer context windows
Trends in Methodology,
Both New…
43. Along with the proliferation of new tools, many familiar tools
and paradigms are being revisited and revitalized, including
components of more traditional chatbots, information
retrieval systems, and enterprise labeling platforms
Trends in Methodology,
Both New… … and Familiar
44. A New Hammer
(But Not for Every Nail)
LLMs excel in a variety of generative use cases, such as
conversational assistance, content generation (code,
templates, etc.), and retrieval augmented generation. They
also enable novel opportunities via autonomous agents.
Although LLMs are capable for a wide variety of tasks, for
discriminative use cases, when data is available at scale,
and/or when factual accuracy is critical (without a human in
the loop), smaller, more efficient models are often a better
option
45. Engineers and Architects
We can build generative AI directly into our platforms (at a cost) without the typical ML R&D
lifecycle, but it’s difficult to get traction beyond PoC stage
Impact?
Researchers and Practitioners
The bulk of generative NLP use cases will likely be handled by LLMs, and we need to understand
how to best utilize, maintain, and constrain these systems. BUT, genAI is not always the answer!
Business Interests and Specialists
Generative AI is in full-on disruption mode, and I need to figure out how to integrate it into our
business as quickly as possible