This document provides an overview of natural language processing techniques like language modeling, tokenization, embeddings, and semantic similarity. It discusses the basics of these concepts and how they relate to each other, such as how tokenization is used as a preprocessing step and embeddings are used to capture semantic meaning and relationships that allow measuring text similarity. It also presents examples of projects that utilize these techniques, such as a document retrieval system that finds similar texts using embeddings and a vector database.
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, Vector Databases
1. Image Generated by DALL-E 3 with a prompt of:
"A detailed graphic that visualizes a vector embedding space"
The Basics of:
• Language Models
• Tokenization
• Embeddings
• Semantic Similarity
• Vector Databases
• Example Project & Demo
Robert McDermott (he/him)
Director: IT - Solutions, Engineering & Architecture
rmcdermo@fredhutch.org
2. Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
"In the shadow of the digital dawn, AI and LLMs are not merely
tools; they are the architects of a new epoch, reshaping the
contours of knowledge, thought, and human potential."
— Aeliana Lyra, Futurist and Tech Philosopher, 2027*.
"AI and LLMs haven't just changed the way we compute; they've
altered the very fabric of our consciousness, inviting us to merge
horizons between human intuition and digital foresight."
— Lysandra Marek, Cognitive Evolutionist, 2027*.
2
*Prompt: Create some fictional quotes that someone in 2027 might say about AI and LLMs.
4. LLM Model Types and Evolution
4
▪ Encoder-Only
▪ Encoder-Decoder
▪ Decoder-Only
Encoder-Decoder
This architecture has two main parts. The
encoder processes the input (like a sentence
in English) and compresses this information
into a 'context'. The decoder then takes this
context and produces an output (like a
translated sentence in French).
Encoder-Only
This architecture just has the encoder part. It
processes the input and produces a direct
output without the need for a decoder. It's
useful for tasks where you don't need a
transformation into another 'type' or 'form'
(e.g., classification tasks).
Decoder-Only
This architecture only uses the decoder.
Instead of translating or compressing
information, it starts with some basic
information and expands or generates
output based on it. This is what models like
GPT (which can generate stories, answers,
etc.) use.
https://github.com/Mooler0410/LLMsPracticalGuide
5. 5
Language Model Sizes – March 2023
• LLM are measured by the number of parameters in the model, and the number of tokens it was trained on.
• Parameters are the weights and biases connecting nodes in a neural network (like synapses connecting neurons in the brain).
• Tokens are how the LLM breaks up sentences into pieces; they can be whole words or fractions of word.
• Generally, the more parameters a model has and the more tokens it was trained with the better it performs.
6. 6
Language Models – Sept 2023+
• The models are going to keep getting larger and more capable
• The newly released Falcon 180B/3.5T (180 billion parameters, 3.5 trillion tokens) is larger than ChatGPT-3 and is free/opensource.
7. Achievements Unlocked by LLMs – April 2023
7
Unpredictable abilities that have been observed in large language models but that were not present in simpler models (and that were not
explicitly designed into the model) are usually called “emergent abilities”. Researchers note that such abilities “cannot be predicted simply by
extrapolating the performance of smaller models” -Wiki
https://lifearchitect.ai/models/
9. Text Tokenization
9
Tokenization is a foundational step in the preprocessing of text for many natural language processing (NLP) tasks, including for language
models like GPT-4 and Llama-2. Tokenization involves breaking down text into smaller chunks, or "tokens", which can be as short as one
character or as long as one word (or even longer in some cases). These tokens can then be processed, analyzed, and used as input for
machine learning models.
https://platform.openai.com/tokenizer
Tokenization
Visualized
Resulting
Token IDs
10. 10
Vector Embeddings
Applications
• Natural Language Processing tasks: sentiment analysis,
named entity recognition, etc.
• Information retrieval: search engines, recommendation
systems.
• Visualization: using dimensionality reduction to visualize
semantic relationships
https://huggingface.co/spaces/mteb/leaderboard
5.41765615e-02 4.20716889e-02 -2.41547506e-02 1.11813843e-01
-9.33169946e-02 -7.56109739e-03 6.54651076e-02 -1.54011259e-02
-2.80906167e-02 1.97344255e-02 -1.58324391e-02 -8.46638903e-02
-1.31631363e-02 1.98841579e-02 -1.26802064e-02 -9.36008468e-02
-4.51933630e-02 -1.20324306e-02 -2.48974599e-02 4.87890420e-03
-2.54017510e-03 4.92022634e-02 5.12179844e-02 2.54505035e-02
-9.70738381e-02 1.42842624e-02 -3.46412621e-02 -8.45314115e-02
-7.38010108e-02 -2.72879936e-02 -2.81507652e-02 -5.01780510e-02
5.35405474e-03 2.96438616e-02 -5.18742464e-02 -6.24342896e-02
6.04359470e-02 -2.22260728e-02 3.36266570e-02 5.17647602e-02
-3.09484527e-02 -8.72448832e-02 -1.53413722e-02 9.27508809e-03
-4.92608221e-03 -4.97105941e-02 -1.04904985e-02 -4.15333314e-03
1.55722797e-02 -2.66851094e-02 -6.49709478e-02 -5.94373941e-02
-2.10976638e-02 3.59102758e-03 5.88850211e-03 -1.03685725e-02
5.03626876e-02 -3.31290103e-02 -7.70502910e-02 1.53052341e-02
*
"A fat tuxedo cat" =
* The "all-MiniLM-L6-v2" embedding model has 384 dimensions
Definition
• Representations of text in numerical form.
• Convert variable-length text into fixed-size vectors in high-
dimensional space.
Purpose
• Capture semantic meaning and relationships between words,
phrases, or longer text.
• Enable mathematical operations on text (e.g., similarity
measurement, arithmetic operations).
Characteristics
• Words with similar meanings are close in vector space.
• Allows for operations like "king" - "man" + "woman" ≈ "queen".
11. 11
Vector Embeddings
• There are several dozen embedding models
• They range in complexity from 384 to 1536 dimensions
• They range in max sequence length from 512 to 8191 tokens
• Embedding models are generally not compatible with each other
Interactive embedding explorer:
https://blog.echen.me/embedding-explorer/
12. Semantic Text Similarity
12
Sentence 1 Sentence 2 Cosine Similarity
The cat sits outside The dog plays in the garden 0.2838
A man is playing guitar A woman watches TV -0.0327
The new movie is awesome The new movie is so great 0.8939
Jim can run very fast James is the fastest runner 0.6844
My goldfish is hungry Pluto is a planet! 0.0454
Source Code Available Here
• Measures the cosine of the angle between two vectors.
• Value between -1 and 1; where 1 means vectors are identical, 0 means
orthogonal, and -1 means diametrically opposite (rare in text embeddings).
These clearly used different
embedding models
13. Vector Databases
13
Vector databases, sometimes referred to as similarity search engines or approximate nearest neighbor (ANN) search engines,
are specialized databases designed to handle high-dimensional vectors efficiently. They provide the capability to quickly
retrieve the most similar vectors to a given query vector, which is particularly useful for tasks that involve large-scale
embeddings.
Vector Space
Dedicated Vector Databases
Types of Databases
14. Example Retrieval Augmentation Project
14
Upload Documents: Ask Questions:
DocTelliGen: https://github.com/robert-mcdermott/DoctelliGen
The Library of Babel by Jorge Luis Borges, 1941:
https://sites.evergreen.edu/politicalshakespeares/wp-
content/uploads/sites/226/2015/12/Borges-The-Library-of-
Babel.pdf
15. Example Retrieval Augmentation Project
15
CHUNK
CHUNK
CHUNK
CHUNK
QUESTION
Embedding Model LLM
ANSWER
3
2
1
4
2
5
1
Text Chunks Vector Embeddings Vector Database
Documents
Retrieved
Embeddings
Cosine
Similarly
Search
Question
Embeddings
Store
Embeddings
3b
3a
4
Provided to
LLM together
User
Acquiring
Information
Answering
Questions
How it works