From grep to BERT

Joerg Viechtbauer
joerg.viechtbauer@qaware.de
QAware
from grep to BERT
December, the 10th 2020

A cloud-native’s questions
How many requests were handled yesterday?
Where is this value set?
Why doesn't the microservice start?
It did start? How is that even possible?
We all have all used grep to get that important information!
QAware 2
THAT TYPICAL MOTIVATIONAL SLIDE

A cloud-native’s questions
How many requests were handled yesterday?
Where is this value set?
Why doesn't the microservice start?
It did start? How is that even possible?
We all have all used grep to get that important information!
And now I just can't think of this stupid song name…
QAware 3
THAT TYPICAL MOTIVATIONAL SLIDE

wget -r -q -O- http://the-internet | grep "austrian rap song about
mozart"
QAware 4
SO WHY NOT USE OUR TRUSTY GREP…?

mozart"
QAware 5
…COMBINE IT WITH WGET –R

mozart"
Problem solved!
QAware 6
= SEMANTIC INTERNET SEARCH IN ONE
LINE

mozart"
Problem solved!
Thank you!
QAware 7
LINE

mozart"
Problem solved!
Thank you!
Questions?
QAware 8
LINE

mozart"
Problem solved!
Thank you!
Questions?
QAware 9
LINE

mozart"
QAware 11
LARGE SCALE SEMANTIC SEARCH

mozart"
How to make that work fast?
QAware 12

mozart"
How to make that work fast? How to make that
work at all?
QAware 13

mozart"
How to make that work fast? How to make that
work at all?
QAware 14
TODAY‘S TOPIC…

QAware 15
DISCLAIMER – MINOR SIMPLIFICATIONS
AHEAD
Takeaway (https://commons.wikimedia.org/wiki/File:Kaeng_phet_mu.jpg), „Kaeng phet mu“,
https://creativecommons.org/licenses/by-sa/3.0/legalcode
Recipe for Red Thai
Curry
Cut stuff into small pieces
Throw everything together
Heat
Bon Appetit!

vector space
modela super fast introduction

Document
https://en.wikipedia.org/wiki/Cat
Text
A cat can either be a house cat, a farm cat or a feral cat.
Normalization = lowercase + stemming/lemmatization
(optional)
a cat can either be a house cat a farm cat or a feral cat
Document Vector
{"a":4, "be":1, "an":1, "cat":4, "either":1, "farm":1,
"feral":1,
"house":1, "or":1}
QAware 17
DOCUMENT VECTOR

Document vector
{"a":4, "be":1, "an":1, "cat":4, "either":1, "farm":1,
"feral":1,
"house":1, "or":1}
Document matrix
QAware 18
DOCUMENT VECTOR
DOC a an be cat curry dog
eithe
r
farm feral
hous
e
or
wikicat 1 1 1 4 1 1 1 1 1
wikido
g
1 1
curry 10

Cosinus Similarity
Transform document 𝑑! into document vector 𝑑!
Create query vector ⃗𝑞
Score for document 𝑑! and query 𝑞:
𝑠𝑖𝑚 𝑑!, 𝑞 =
𝑑!
𝑑! "
)
⃗𝑞
⃗𝑞 "
QAware
VECTOR SPACE MODEL SEARCH

Better formula for vector values
𝑑! 𝑤 = log 1 + tf! 𝑤 + log
𝐷
df" 𝑤
(same formula for q)
Super simple
Very fast implementation possible
Pretty good search quality! (baseline for all retrieval systems)
QAware 21
TF*IDF Vectors

Find information not by matching keywords but by matching intent
Woah! Let’s start with synonyms!
Implementations
Manually crafted thesauri (= synonyms, associations, hierarchies)
Vector space dimensionality reduction
Find good approximation of the term-vectors in a low-dimensional space
(100000 -> 300)
Condensed representation captures the essence of the data (=> is that
meaning…?)
QAware 22
SEMANTIC SEARCH

Decompose the document-term-matrix A into smaller matrices
n = topic space (much smaller than t), typically n = 100-500
QAware 23
MATRIX FACTORIZATION
= × ×
t
d d
t n
n

Decompose the document-term-matrix A into smaller matrices
n = topic space (much smaller than t), typically n = 100-500
QAware 24
= × ×
t
d d
t n
n
U = topics V = document
embeddings

Reconstruction of values
QAware 25
= × ×
t
d d
t n
n

Uses singular value decomposition
All of you did this in the first semester (=> “Linear algebra"). (Do you remember?)
Unfortunately (tested on TREC7-SDR)
topic 0 = world news london boston eddie mair lisa mullins
radio public international cnn pri npr washington back next
ahead bbc coming …
Uhhh.
And it is slow.
And uhhh…
QAware 26
LSI - LATENT SEMANTIC INDEXING

Same idea: matrix decomposition
But different approach: expectation-maximization algorithm
(EM)
Will learn co-occurences of words in documents
All values are positive
Easy to implement (next slide)
Fast
QAware 27
PLSA - PROBABILISTIC LATENT SEMANTIC
ANALYSIS

for step in range(20): # <-- this is way way way too simple to work in
"reality(tm)"
#-- RESET ACCUMULATORS ------------------------------------------------------
padn = [[0] * aspects for j in range(docs )]
pawn = [[0] * aspects for j in range(words)]
pan = [0] * aspects
#-- PROCESS ALL DOCUMENTS ---------------------------------------------------
for docId in range(docs):
#-- ITERATE OVER ALL WORDS ------------------------------------------------
for wordId in range(words):
#-- E-STEP (THIS CAN BE DONE MUCH MORE EFFICIENT) ----------------------
pzdw = [pad[docId][a] * paw[wordId][a] / pa[a] for a in range(aspects)]
norm(pzdw)
scale(pzdw, arrDoc[docId][wordId])
#-- M-STEP --------------------------------------------------------------
add(padn[docId ], pzdw)
add(pawn[wordId], pzdw)
add(pan , pzdw)
#-- SAVE ACCUMULATORS -------------------------------------------------------
pad = padn
paw = pawn
pa = pan QAware 28
PLSA - PROBABILISTIC LATENT SEMANTIC
ANALYSIS
This actually works and
is not very far away from
a practical
implementation!

topic0
0.01928 kaczynski
0.01246 israel
0.00936 arafat
0.00910 israeli
0.00800 palestinian
0.00613 netanyahu
0.00607 minister
0.00584 peace
0.00536 judge
0.00506 suicide
0.00500 albright
0.00479 himself
0.00478 unabomber
0.00475 prime
0.00463 trial
0.00462 theodore
0.00450 said
0.00447 even
...
topic11
0.02723 hong
0.02691 kong
0.00811 human
0.00640 health
0.00614 flu
0.00611 those
0.00572 virus
0.00537 said
0.00524 any
0.00523 genetic
0.00487 government
0.00424 right
0.00420 them
0.00414 may
0.00414 don
0.00409 want
0.00405 china
0.00401 million
...
topic36
0.03553 space
0.02133 mir
0.01431 station
0.01076 mission
0.00927 crew
0.00926 russian
0.00924 nasa
0.00904 mars
0.00782 shuttle
0.00573 craft
0.00536 launch
0.00516 foale
0.00516 astronaut
0.00515 earth
0.00488 its
0.00479 tomorrow
0.00467 pathfinder
0.00466 off
...
QAware 29
PLSA ON TREC7-SDR (STANDARD CORPUS)
n=50

Project document into latent semantic space (= matrix multiplication)
Project query into latent semantic space (= matrix multiplication)
Calculate cosinus similarity
Result
Works well (10-15% better search quality*)
*mean average precision (measurement of search quality)
QAware 30
SEMANTIC SEARCH USING PLSA

32
REDUCE THIS TO ONE DIMENSION

33
REDUCE THIS TO ONE DIMENSION

34
LINEAR PROJECTION TO ONE
DIMENSION

35
DIMENSION

36
DIMENSION

QAware 37
NON-LINEAR PROJECTION TO ONE
DIMENSION

QAware 38
DIMENSION

QAware 39
DIMENSION

QAware 40
DIMENSION

neural networks
a not so gentle introduction

QAware 42
NEURAL NETWORKS
Motivation
Brains are large networks of neurons connected by axons (and somewhat
sucessful)
Can approximate any input-output data function (universal approximation
theorem)
Potentially massively parallel execution (= fast, if you are Google, Microsoft,
Amazon)
Very successful with many highly complex problems
Paradigm shift
You do not try to find an algorithm that solves the problem
You only need to provide enough examples (training data)

QAware 43
NEURAL NETWORKS - COMPONENTS
1
Inputs/Outputs
Number
𝑖𝑛!: temperature, count, color value…
𝑜𝑢𝑡 : value (~ log-probability)
Neurons
Apply activiation function to sum of inputs
Bias (fixed input set to 1)
Activation function 𝑓 (non-linear, monotonic,
smooth, differentiable, f(0) = 0, f‘(0) = 1)
Connections
Between neurons
Each connection has a weight (𝑤!, 𝑏)
𝑓 𝑏 + ,
!
𝑤! ) 𝑖𝑛! = 𝑜𝑢𝑡
𝑖𝑛!
𝑖𝑛"
𝑖𝑛#
𝑤!
𝑤"
𝑤#
𝑜𝑢𝑡
𝑏
𝑓
<= They define the „algorithm“
„Magically“ trained from examples!

QAware 44
NEURAL NETWORKS – ACTIVATION
FUNCTIONS

QAware 45
EXAMPLE
𝑥
𝑦
1
1
𝑜𝑢𝑡𝑓

QAware 46
EXAMPLE
𝑥
𝑦
1
1
𝑜𝑢𝑡𝑓
x y weighted-sum f(w-sum) = out
0 0
1 0
0 1
1 1

QAware 47
EXAMPLE
𝑥
𝑦
1
1
𝑜𝑢𝑡𝑓
0 0 0 0
1 0 1 1
0 1 1 1
1 1 2 1

QAware 48
EXAMPLE
𝑥
𝑦
1
1
𝑜𝑢𝑡𝑓
0 0 0 0
1 0 1 1
0 1 1 1
1 1 2 1
=> That‘s the Boolean OR

QAware 49
EXAMPLE
𝑥
𝑦
1
1
𝑜𝑢𝑡𝑓
0 0 0 0
1 0 1 1
0 1 1 1
1 1 2 1
=> That‘s the Boolean OR
(at least pretend to be impressed)

QAware 50
USUALLY MUCH MORE COMPLEX

QAware 51
LAYER NAMES
Input
layer
Hidden
layer
Output layer

QAware 52
AND MANY HIDDEN LAYERS (DEEP
LEARNING)
Simple
features
Complex
features

QAware 53
NEURAL NETWORKS FOR IMAGE
RECOGNITION
probability for cat
probability for
dog
probability
for thai
curry
Told you this would not be
gentle!

QAware 54
AND NOW WHAT?
probability for cat
probability for dog
probability for
thai curry

QAware 55
SHOW EXAMPLE…
probability for cat
probability for dog
probability for
thai curry

QAware 56
…AND EXPECTATION
probability for cat =
1
probability for dog =
0
probability for
thai curry =
0

QAware 57
…AND UPDATE THE WEIGHTS
probability for cat =
1
probability for dog =
0
probability for
thai curry =
0
And that is the beauty of neural
networks…
Automagically learn
weights

QAware 58
UNDER THE HOOD
CIFAR-10 - Learning Multiple Layers of Features from Tiny Images, Alex
Krizhevsky, 2009
Initialization
Set all weights to random
values
Training
Show a training example
Adjust weights a bit into
the direction of the correct
answer
(=> gradient descent)
Repeat (until „happy“)

QAware 59
TRAIN A NEURAL NETWORK
CIFAR-10 - Learning Multiple Layers of Features from Tiny Images, Alex
Krizhevsky, 2009
Python (Keras)
model.fit(images,
expectedClasses,
epochs=50,
batch_size=32)

QAware 60
IT IS NOT QUITE THAT SIMPLE

In theory (but only there)
The one-hidden-all-dense-layer model approach can handle every problem
QAware 61

In practice
Training (such a model) can take ages (and probably will not be good)
Much better: configuration specifically tailored to the problem
Very difficult to find if you need to start from scratch (research)
Creating good training data can be hard
QAware 62

In practice
Training (such a model) can take ages (and probably will not be good)
Much better: configuration specifically tailored to the problem
Very difficult to find if you need to start from scratch (research)
Creating good training data can be hard
Good news
Many well-proven configurations
Many pre-trained and ready-to-use models
Adapt a pre-trained model to your problem (=> transfer learning)
QAware 63

semantic search
with
neural networks

QAware 65
HOW TO HANDLE TEXT WITH A NN?
Text
The king and the queen live in the
castle.
One hot encoding
One input for each word in the
fixed vocabulary.
queen
the
live
and
castle
king
queen
the
live
and
castle
king

QAware 66
INFORMATION FUNNEL
Input format = output format
And a neural network
inbetween
Why? (And why that model?)
We’ll see!
queen
the
live
and
castle
king
queen
the
live
and
castle
king

QAware 67
INFORMATION FUNNEL
Text
castle.
Training
For all sentences in
Wikipedia…
Input: one word of the
sentence
Output: all words in the
sentence
queen
the
live
and
castle
king
queen
the
live
and
castle
king

QAware 68
INFORMATION FUNNEL
castle.
queen
the
live
and
castle
king
queen
the
live
and
castle
king

QAware 69
INFORMATION FUNNEL
queen
the
live
and
castle
king
queen
the
live
and
castle
king
castle.

QAware 70
INFORMATION FUNNEL
queen
the
live
and
castle
king
queen
the
live
and
castle
king
castle.

QAware 71
INFORMATION FUNNEL
We forced the neural network to
pass the information through a
funnel.
In order to reconstruct the input it
needs to learn relations between
words.
queen
the
live
and
castle
king
queen
the
live
and
castle
king

QAware 72
INFORMATION FUNNEL
Embeddings
The output of the neural
network after the funnel.
queen
the
live
and
castle
king

Word embeddings
Trained on a large number of input sentences
Not all use a neural network to generate the embedding
Freely available, ready for usage
(http://nlp.stanford.edu/data/glove.840B.300d.zip)
Search with word embeddings
Instead of the PLSA embeddings, we can use the GloVe embeddings
As vector use the average vector of every word from the document or the query
Cosinus similarity
QAware 73
GloVe/Word2Vec/fastText

QAware 74
zcat glove.840B.300d.txt.gz | grep cat
cat -0.15067 -0.024468 -0.23368 -0.23378 -0.18382 0.32711 -0.22084 -0.28777 0.12759 1.1656 -0.64163 -0.098455 -0.62397
0.010431 -0.25653 0.31799 0.037779 1.1904 -0.17714 -0.2595 -0.31461 0.038825 -0.15713 -0.13484 0.36936 -0.30562 -0.40619 -
0.38965 0.3686 0.013963 -0.6895 0.004066 -0.1367 0.32564 0.24688 -0.14011 0.53889 -0.80441 -0.1777 -0.12922 0.16303 0.14917 -
0.068429 -0.33922 0.18495 -0.082544 -0.46892 0.39581 -0.13742 -0.35132 0.22223 -0.144 -0.048287 0.3379 -0.31916 0.20526
0.098624 -0.23877 0.045338 0.43941 0.030385 -0.013821 -0.093273 -0.18178 0.19438 -0.3782 0.70144 0.16236 0.0059111 0.024898
-0.13613 -0.11425 -0.31598 -0.14209 0.028194 0.5419 -0.42413 -0.599 0.24976 -0.27003 0.14964 0.29287 -0.31281 0.16543 -0.21045
-0.4408 1.2174 0.51236 0.56209 0.14131 0.092514 0.71396 -0.021051 -0.33704 -0.20275 -0.36181 0.22055 -0.25665 0.28425 -
0.16968 0.058029 0.61182 0.31576 -0.079185 0.35538 -0.51236 0.4235 -0.30033 -0.22376 0.15223 -0.048292 0.23532 0.46507 -
0.67579 -0.32905 0.08446 -0.22123 -0.045333 0.34463 -0.1455 -0.18047 -0.17887 0.96879 -1.0028 -0.47343 0.28542 0.56382 -
0.33211 -0.38275 -0.2749 -0.22955 -0.24265 -0.37689 0.24822 0.36941 0.14651 -0.37864 0.31134 -0.28449 0.36948 -2.8174 -0.38319
-0.022373 0.56376 0.40131 -0.42131 -0.11311 -0.17317 0.1411 -0.13194 0.18494 0.097692 -0.097341 -0.23987 0.16631 -0.28556
0.0038654 0.53292 -0.32367 -0.38744 0.27011 -0.34181 -0.27702 -0.67279 -0.10771 -0.062189 -0.24783 -0.070884 -0.20898 0.062404
0.022372 0.13408 0.1305 -0.19546 -0.46849 0.77731 -0.043978 0.3827 -0.23376 1.0457 -0.14371 -0.3565 -0.080713 -0.31047 -
0.57822 -0.28067 -0.069678 0.068929 -0.16227 -0.63934 -0.62149 0.11222 -0.16969 -0.54637 0.49661 0.46565 0.088294 -0.48496
0.69263 -0.068977 -0.53709 0.20802 -0.42987 -0.11921 0.1174 -0.18443 0.43797 -0.1236 0.3607 -0.19608 -0.35366 0.18808 -0.5061
0.14455 -0.024368 -0.10772 -0.0115 0.58634 -0.054461 0.0076487 -0.056297 0.27193 0.23096 -0.29296 -0.24325 0.10317 -0.10014
0.7089 0.17402 -0.0037509 -0.46304 0.11806 -0.16457 -0.38609 0.14524 0.098122 -0.12352 -0.1047 0.39047 -0.3063 -0.65375 -
0.0044248 -0.033876 0.037114 -0.27472 0.0053147 0.30737 0.12528 -0.19527 -0.16461 0.087518 -0.051107 -0.16323 0.521 0.10822 -
0.060379 -0.71735 -0.064327 0.37043 -0.41054 -0.2728 -0.30217 0.015771 -0.43056 0.35647 0.17188 -0.54598 -0.21541 -0.044889 -
0.10597 -0.54391 0.53908 0.070938 0.097839 0.097908 0.17805 0.18995 0.49962 -0.18529 0.051234 0.019574 0.24805 0.3144 -
0.29304 0.54235 0.46672 0.26017 -0.44705 0.28287 -0.033345 -0.33181 -0.10902 -0.023324 0.2106 -0.29633 0.81506 0.038524
0.46004 0.17187 -0.29804

Word embeddings are context-free
Embedding of a sentence from word embeddings
sentence embeding = average of term embeddings
Each term has always the same embedding
But the meaning of a word depends on the context
mouse (rodent, trap, computer, eye, garlic …)
cell (phone, prison, blood/skin, solar, = some people, hermitage…)
Sentence embeddings
Embedding of term depends on the context
QAware 76
BERT – WHY SENTENCE EMBEDDINGS?

Use word embeddings for every position in a sentence (word => sentence)
Take a gigantomanic neural network of a special type (Transformer)
Input: the sentence where one word has been blanked out
Output: the complete sentence
The king and the queen live in the castle.
The ____ and the queen live in the castle.
The king and the _____ live in the castle.
The king and the queen live in the ______.
Finally let a gazillion of tensorflow units burn on absurd amounts of data
QAware 77
BERT – THE ROUGH IDEA

!pip install -U sentence-transformers
!pip install scipy
import scipy
from sentence_transformers import SentenceTransformer
sentences = ["the sun shines",
"the sky is blue",
"we have good weather",
"bert is amazing",
"sentence embeddings rock",
"it is raining",
"uhh i need a rain coat",
"that's pretty bad weather"]
model = SentenceTransformer("roberta-large-nli-mean-tokens")
sentence_embeddings = model.encode(sentences)
distances = scipy.spatial.distance.cdist(sentence_embeddings, sentence_embeddings,
"cosine")
print(distances)
QAware 78
BERT – CODE

!pip install scipy
import scipy
"the sky is blue",
"bert is amazing",
"it is raining",
model = SentenceTransformer("roberta-large-nli-mean-tokens") # <== plenty to choose
from
"cosine")
print(distances) QAware 79
BERT – CODE

!pip install scipy
import scipy
"the sky is blue",
"bert is amazing",
"it is raining",
"cosine")
print(distances)
QAware 80
BERT – CODE

!pip install scipy
import scipy
"the sky is blue",
"bert is amazing",
"it is raining",
"cosine")
print(distances)
QAware 81
BERT – CODE

QAware 83
Semantic search summary
Name
Latent semantic
indexing
Probabilistic
latent semantic
indexing
Word2vec,
GloVe,
FastText…
BERT +
Variations
Approach
Matrix
decomposition
via SVD
Matrix
decomposition
via EM-algorithm
Neural network Neural network
Interpretability ? very good good okay
Level word word word sentence
Ready-to-use? no and difficult nope, feasible yes, easy yes, very easy
Type linear linear non-linear non-linear
Quality meh good good yihaaa!

Joerg Viechtbauer
joerg.viechtbauer@qaware.de
Thank you
December, the 10th 2020

From grep to BERT

Recommended

Recommended

More Related Content

What's hot

What's hot (11)

Similar to From grep to BERT

Similar to From grep to BERT (20)

More from QAware GmbH

More from QAware GmbH (20)

Recently uploaded

Recently uploaded (20)

From grep to BERT