Cloud Native Night, December 2020, talk by Jörg Viechtbauer (Senior Software Architect, QAware)
== Please download slides if blurred! ==
Abstract:
Neural networks like BERT have revolutionized the processing of natural language and achieve state-of-the-art performance in many NLP tasks. One of them is semantic search where documents are found by query intent and not only by exact match.
This talk takes us through the history of information retrieval and shows how keyword search has evolved into the term vector model. The desire for a better search led to the development of the first semantic models like SLI or PLSA. We will see how this culminates today in the use of sophisticated deep neural networks that perform nonlinear dimensional reductions and master long-range dependencies.
Semantic search has never been as good and easy to implement as it is today.
About Jörg:
Jörg is a search expert at QAware and uses neural networks for semantic search and text comprehension. He has spent almost 20 years developing search engines based on both proprietary and open source software for enterprise search, eDiscovery and local search - always hunting for the perfect ranking formula.
2. A cloud-native’s questions
How many requests were handled yesterday?
Where is this value set?
Why doesn't the microservice start?
It did start? How is that even possible?
We all have all used grep to get that important information!
QAware 2
THAT TYPICAL MOTIVATIONAL SLIDE
3. A cloud-native’s questions
How many requests were handled yesterday?
Where is this value set?
Why doesn't the microservice start?
It did start? How is that even possible?
We all have all used grep to get that important information!
And now I just can't think of this stupid song name…
QAware 3
THAT TYPICAL MOTIVATIONAL SLIDE
4. wget -r -q -O- http://the-internet | grep "austrian rap song about
mozart"
QAware 4
SO WHY NOT USE OUR TRUSTY GREP…?
5. wget -r -q -O- http://the-internet | grep "austrian rap song about
mozart"
QAware 5
…COMBINE IT WITH WGET –R
6. wget -r -q -O- http://the-internet | grep "austrian rap song about
mozart"
Problem solved!
QAware 6
= SEMANTIC INTERNET SEARCH IN ONE
LINE
7. wget -r -q -O- http://the-internet | grep "austrian rap song about
mozart"
Problem solved!
Thank you!
QAware 7
= SEMANTIC INTERNET SEARCH IN ONE
LINE
8. wget -r -q -O- http://the-internet | grep "austrian rap song about
mozart"
Problem solved!
Thank you!
Questions?
QAware 8
= SEMANTIC INTERNET SEARCH IN ONE
LINE
9. wget -r -q -O- http://the-internet | grep "austrian rap song about
mozart"
Problem solved!
Thank you!
Questions?
QAware 9
= SEMANTIC INTERNET SEARCH IN ONE
LINE
11. wget -r -q -O- http://the-internet | grep "austrian rap song about
mozart"
QAware 11
LARGE SCALE SEMANTIC SEARCH
12. wget -r -q -O- http://the-internet | grep "austrian rap song about
mozart"
How to make that work fast?
QAware 12
LARGE SCALE SEMANTIC SEARCH
13. wget -r -q -O- http://the-internet | grep "austrian rap song about
mozart"
How to make that work fast? How to make that
work at all?
QAware 13
LARGE SCALE SEMANTIC SEARCH
14. wget -r -q -O- http://the-internet | grep "austrian rap song about
mozart"
How to make that work fast? How to make that
work at all?
QAware 14
TODAY‘S TOPIC…
15. QAware 15
DISCLAIMER – MINOR SIMPLIFICATIONS
AHEAD
Takeaway (https://commons.wikimedia.org/wiki/File:Kaeng_phet_mu.jpg), „Kaeng phet mu“,
https://creativecommons.org/licenses/by-sa/3.0/legalcode
Recipe for Red Thai
Curry
Cut stuff into small pieces
Throw everything together
Heat
Bon Appetit!
17. Document
https://en.wikipedia.org/wiki/Cat
Text
A cat can either be a house cat, a farm cat or a feral cat.
Normalization = lowercase + stemming/lemmatization
(optional)
a cat can either be a house cat a farm cat or a feral cat
Document Vector
{"a":4, "be":1, "an":1, "cat":4, "either":1, "farm":1,
"feral":1,
"house":1, "or":1}
QAware 17
DOCUMENT VECTOR
18. Document vector
{"a":4, "be":1, "an":1, "cat":4, "either":1, "farm":1,
"feral":1,
"house":1, "or":1}
Document matrix
QAware 18
DOCUMENT VECTOR
DOC a an be cat curry dog
eithe
r
farm feral
hous
e
or
wikicat 1 1 1 4 1 1 1 1 1
wikido
g
1 1
curry 10
19. Cosinus Similarity
Transform document 𝑑! into document vector 𝑑!
Create query vector ⃗𝑞
Score for document 𝑑! and query 𝑞:
𝑠𝑖𝑚 𝑑!, 𝑞 =
𝑑!
𝑑! "
)
⃗𝑞
⃗𝑞 "
QAware
VECTOR SPACE MODEL SEARCH
20. Cosinus Similarity
Transform document 𝑑! into document vector 𝑑!
Create query vector ⃗𝑞
Score for document 𝑑! and query 𝑞:
𝑠𝑖𝑚 𝑑!, 𝑞 =
𝑑!
𝑑! "
)
⃗𝑞
⃗𝑞 "
QAware
VECTOR SPACE MODEL SEARCH
21. Better formula for vector values
𝑑! 𝑤 = log 1 + tf! 𝑤 + log
𝐷
df" 𝑤
(same formula for q)
Super simple
Very fast implementation possible
Pretty good search quality! (baseline for all retrieval systems)
QAware 21
TF*IDF Vectors
22. Find information not by matching keywords but by matching intent
Woah! Let’s start with synonyms!
Implementations
Manually crafted thesauri (= synonyms, associations, hierarchies)
Vector space dimensionality reduction
Find good approximation of the term-vectors in a low-dimensional space
(100000 -> 300)
Condensed representation captures the essence of the data (=> is that
meaning…?)
QAware 22
SEMANTIC SEARCH
23. Decompose the document-term-matrix A into smaller matrices
n = topic space (much smaller than t), typically n = 100-500
QAware 23
MATRIX FACTORIZATION
= × ×
t
d d
t n
n
24. Decompose the document-term-matrix A into smaller matrices
n = topic space (much smaller than t), typically n = 100-500
QAware 24
MATRIX FACTORIZATION
= × ×
t
d d
t n
n
U = topics V = document
embeddings
26. Uses singular value decomposition
All of you did this in the first semester (=> “Linear algebra"). (Do you remember?)
Unfortunately (tested on TREC7-SDR)
topic 0 = world news london boston eddie mair lisa mullins
radio public international cnn pri npr washington back next
ahead bbc coming …
Uhhh.
And it is slow.
And uhhh…
QAware 26
LSI - LATENT SEMANTIC INDEXING
27. Same idea: matrix decomposition
But different approach: expectation-maximization algorithm
(EM)
Will learn co-occurences of words in documents
All values are positive
Easy to implement (next slide)
Fast
QAware 27
PLSA - PROBABILISTIC LATENT SEMANTIC
ANALYSIS
28. for step in range(20): # <-- this is way way way too simple to work in
"reality(tm)"
#-- RESET ACCUMULATORS ------------------------------------------------------
padn = [[0] * aspects for j in range(docs )]
pawn = [[0] * aspects for j in range(words)]
pan = [0] * aspects
#-- PROCESS ALL DOCUMENTS ---------------------------------------------------
for docId in range(docs):
#-- ITERATE OVER ALL WORDS ------------------------------------------------
for wordId in range(words):
#-- E-STEP (THIS CAN BE DONE MUCH MORE EFFICIENT) ----------------------
pzdw = [pad[docId][a] * paw[wordId][a] / pa[a] for a in range(aspects)]
norm(pzdw)
scale(pzdw, arrDoc[docId][wordId])
#-- M-STEP --------------------------------------------------------------
add(padn[docId ], pzdw)
add(pawn[wordId], pzdw)
add(pan , pzdw)
#-- SAVE ACCUMULATORS -------------------------------------------------------
pad = padn
paw = pawn
pa = pan QAware 28
PLSA - PROBABILISTIC LATENT SEMANTIC
ANALYSIS
This actually works and
is not very far away from
a practical
implementation!
29. topic0
0.01928 kaczynski
0.01246 israel
0.00936 arafat
0.00910 israeli
0.00800 palestinian
0.00613 netanyahu
0.00607 minister
0.00584 peace
0.00536 judge
0.00506 suicide
0.00500 albright
0.00479 himself
0.00478 unabomber
0.00475 prime
0.00463 trial
0.00462 theodore
0.00450 said
0.00447 even
...
topic11
0.02723 hong
0.02691 kong
0.00811 human
0.00640 health
0.00614 flu
0.00611 those
0.00572 virus
0.00537 said
0.00524 any
0.00523 genetic
0.00487 government
0.00424 right
0.00420 them
0.00414 may
0.00414 don
0.00409 want
0.00405 china
0.00401 million
...
topic36
0.03553 space
0.02133 mir
0.01431 station
0.01076 mission
0.00927 crew
0.00926 russian
0.00924 nasa
0.00904 mars
0.00782 shuttle
0.00573 craft
0.00536 launch
0.00516 foale
0.00516 astronaut
0.00515 earth
0.00488 its
0.00479 tomorrow
0.00467 pathfinder
0.00466 off
...
QAware 29
PLSA ON TREC7-SDR (STANDARD CORPUS)
n=50
30. Project document into latent semantic space (= matrix multiplication)
Project query into latent semantic space (= matrix multiplication)
Calculate cosinus similarity
Result
Works well (10-15% better search quality*)
*mean average precision (measurement of search quality)
QAware 30
SEMANTIC SEARCH USING PLSA
42. QAware 42
NEURAL NETWORKS
Motivation
Brains are large networks of neurons connected by axons (and somewhat
sucessful)
Can approximate any input-output data function (universal approximation
theorem)
Potentially massively parallel execution (= fast, if you are Google, Microsoft,
Amazon)
Very successful with many highly complex problems
Paradigm shift
You do not try to find an algorithm that solves the problem
You only need to provide enough examples (training data)
43. QAware 43
NEURAL NETWORKS - COMPONENTS
1
Inputs/Outputs
Number
𝑖𝑛!: temperature, count, color value…
𝑜𝑢𝑡 : value (~ log-probability)
Neurons
Apply activiation function to sum of inputs
Bias (fixed input set to 1)
Activation function 𝑓 (non-linear, monotonic,
smooth, differentiable, f(0) = 0, f‘(0) = 1)
Connections
Between neurons
Each connection has a weight (𝑤!, 𝑏)
𝑓 𝑏 + ,
!
𝑤! ) 𝑖𝑛! = 𝑜𝑢𝑡
𝑖𝑛!
𝑖𝑛"
𝑖𝑛#
𝑤!
𝑤"
𝑤#
𝑜𝑢𝑡
𝑏
𝑓
<= They define the „algorithm“
„Magically“ trained from examples!
49. QAware 49
EXAMPLE
𝑥
𝑦
1
1
𝑜𝑢𝑡𝑓
x y weighted-sum f(w-sum) = out
0 0 0 0
1 0 1 1
0 1 1 1
1 1 2 1
=> That‘s the Boolean OR
(at least pretend to be impressed)
52. QAware 52
AND MANY HIDDEN LAYERS (DEEP
LEARNING)
Simple
features
Complex
features
53. QAware 53
NEURAL NETWORKS FOR IMAGE
RECOGNITION
probability for cat
probability for
dog
probability
for thai
curry
Told you this would not be
gentle!
54. QAware 54
AND NOW WHAT?
probability for cat
probability for dog
probability for
thai curry
57. QAware 57
…AND UPDATE THE WEIGHTS
probability for cat =
1
probability for dog =
0
probability for
thai curry =
0
And that is the beauty of neural
networks…
Automagically learn
weights
58. QAware 58
UNDER THE HOOD
CIFAR-10 - Learning Multiple Layers of Features from Tiny Images, Alex
Krizhevsky, 2009
Initialization
Set all weights to random
values
Training
Show a training example
Adjust weights a bit into
the direction of the correct
answer
(=> gradient descent)
Repeat (until „happy“)
59. QAware 59
TRAIN A NEURAL NETWORK
CIFAR-10 - Learning Multiple Layers of Features from Tiny Images, Alex
Krizhevsky, 2009
Python (Keras)
model.fit(images,
expectedClasses,
epochs=50,
batch_size=32)
61. In theory (but only there)
The one-hidden-all-dense-layer model approach can handle every problem
QAware 61
IT IS NOT QUITE THAT SIMPLE
62. In theory (but only there)
The one-hidden-all-dense-layer model approach can handle every problem
In practice
Training (such a model) can take ages (and probably will not be good)
Much better: configuration specifically tailored to the problem
Very difficult to find if you need to start from scratch (research)
Creating good training data can be hard
QAware 62
IT IS NOT QUITE THAT SIMPLE
63. In theory (but only there)
The one-hidden-all-dense-layer model approach can handle every problem
In practice
Training (such a model) can take ages (and probably will not be good)
Much better: configuration specifically tailored to the problem
Very difficult to find if you need to start from scratch (research)
Creating good training data can be hard
Good news
Many well-proven configurations
Many pre-trained and ready-to-use models
Adapt a pre-trained model to your problem (=> transfer learning)
QAware 63
IT IS NOT QUITE THAT SIMPLE
65. QAware 65
HOW TO HANDLE TEXT WITH A NN?
Text
The king and the queen live in the
castle.
One hot encoding
One input for each word in the
fixed vocabulary.
queen
the
live
and
castle
king
queen
the
live
and
castle
king
66. QAware 66
INFORMATION FUNNEL
Input format = output format
And a neural network
inbetween
Why? (And why that model?)
We’ll see!
queen
the
live
and
castle
king
queen
the
live
and
castle
king
67. QAware 67
INFORMATION FUNNEL
Text
The king and the queen live in the
castle.
Training
For all sentences in
Wikipedia…
Input: one word of the
sentence
Output: all words in the
sentence
queen
the
live
and
castle
king
queen
the
live
and
castle
king
68. QAware 68
INFORMATION FUNNEL
The king and the queen live in the
castle.
queen
the
live
and
castle
king
queen
the
live
and
castle
king
71. QAware 71
INFORMATION FUNNEL
We forced the neural network to
pass the information through a
funnel.
In order to reconstruct the input it
needs to learn relations between
words.
queen
the
live
and
castle
king
queen
the
live
and
castle
king
73. Word embeddings
Trained on a large number of input sentences
Not all use a neural network to generate the embedding
Freely available, ready for usage
(http://nlp.stanford.edu/data/glove.840B.300d.zip)
Search with word embeddings
Instead of the PLSA embeddings, we can use the GloVe embeddings
As vector use the average vector of every word from the document or the query
Cosinus similarity
QAware 73
GloVe/Word2Vec/fastText
76. Word embeddings are context-free
Embedding of a sentence from word embeddings
sentence embeding = average of term embeddings
Each term has always the same embedding
But the meaning of a word depends on the context
mouse (rodent, trap, computer, eye, garlic …)
cell (phone, prison, blood/skin, solar, = some people, hermitage…)
Sentence embeddings
Embedding of term depends on the context
QAware 76
BERT – WHY SENTENCE EMBEDDINGS?
77. Use word embeddings for every position in a sentence (word => sentence)
Take a gigantomanic neural network of a special type (Transformer)
Input: the sentence where one word has been blanked out
Output: the complete sentence
The king and the queen live in the castle.
The ____ and the queen live in the castle.
The king and the _____ live in the castle.
The king and the queen live in the ______.
Finally let a gazillion of tensorflow units burn on absurd amounts of data
QAware 77
BERT – THE ROUGH IDEA
78. !pip install -U sentence-transformers
!pip install scipy
import scipy
from sentence_transformers import SentenceTransformer
sentences = ["the sun shines",
"the sky is blue",
"we have good weather",
"bert is amazing",
"sentence embeddings rock",
"it is raining",
"uhh i need a rain coat",
"that's pretty bad weather"]
model = SentenceTransformer("roberta-large-nli-mean-tokens")
sentence_embeddings = model.encode(sentences)
distances = scipy.spatial.distance.cdist(sentence_embeddings, sentence_embeddings,
"cosine")
print(distances)
QAware 78
BERT – CODE
79. !pip install -U sentence-transformers
!pip install scipy
import scipy
from sentence_transformers import SentenceTransformer
sentences = ["the sun shines",
"the sky is blue",
"we have good weather",
"bert is amazing",
"sentence embeddings rock",
"it is raining",
"uhh i need a rain coat",
"that's pretty bad weather"]
model = SentenceTransformer("roberta-large-nli-mean-tokens") # <== plenty to choose
from
sentence_embeddings = model.encode(sentences)
distances = scipy.spatial.distance.cdist(sentence_embeddings, sentence_embeddings,
"cosine")
print(distances) QAware 79
BERT – CODE
80. !pip install -U sentence-transformers
!pip install scipy
import scipy
from sentence_transformers import SentenceTransformer
sentences = ["the sun shines",
"the sky is blue",
"we have good weather",
"bert is amazing",
"sentence embeddings rock",
"it is raining",
"uhh i need a rain coat",
"that's pretty bad weather"]
model = SentenceTransformer("roberta-large-nli-mean-tokens")
sentence_embeddings = model.encode(sentences)
distances = scipy.spatial.distance.cdist(sentence_embeddings, sentence_embeddings,
"cosine")
print(distances)
QAware 80
BERT – CODE
81. !pip install -U sentence-transformers
!pip install scipy
import scipy
from sentence_transformers import SentenceTransformer
sentences = ["the sun shines",
"the sky is blue",
"we have good weather",
"bert is amazing",
"sentence embeddings rock",
"it is raining",
"uhh i need a rain coat",
"that's pretty bad weather"]
model = SentenceTransformer("roberta-large-nli-mean-tokens")
sentence_embeddings = model.encode(sentences)
distances = scipy.spatial.distance.cdist(sentence_embeddings, sentence_embeddings,
"cosine")
print(distances)
QAware 81
BERT – CODE
82. | 0 1 2 3 4 5 6
7
---+-------------------------------
--
the sun shines 0 | . . . . . . .
.
the sky is blue 1 | 76 . . . . . .
.
we have good weather 2 | 81 75 . . . . .
.
| . . . . . . .
.
bert is amazing 3 | 52 43 61 . . . .
.
sentence embeddings rock 4 | 46 40 51 69 . . .
. QAware 82
BERT – EXAMPLES
83. QAware 83
Semantic search summary
Name
Latent semantic
indexing
Probabilistic
latent semantic
indexing
Word2vec,
GloVe,
FastText…
BERT +
Variations
Approach
Matrix
decomposition
via SVD
Matrix
decomposition
via EM-algorithm
Neural network Neural network
Interpretability ? very good good okay
Level word word word sentence
Ready-to-use? no and difficult nope, feasible yes, easy yes, very easy
Type linear linear non-linear non-linear
Quality meh good good yihaaa!