Deep Learning for Search: Personalization and Deep Tokenization

Deep Learning for Unified
Personalized Search
Recommendations
(and Fuzzy Tokenization)
Jake Mannix
Chief Data Engineer, Lucidworks
@pbrane | in/jakemannix
#Activate18 #ActivateSearch

$whoami
• Now: Chief Data Engineer, Lucidworks
• Applied ML / relevance / RecSys
• data engineering
• Previously:
• Allen Institute for AI: research pub. semantic search
• Twitter: account search, user interest modeling, RecSys
• LinkedIn: profile search, generic entity-to-entity RecSys
• Prehistory:
• Other software dev.
• Algebraic topology, particle cosmology

Agenda
• Personalized Search and the Clickstream
• Deep Learning To Rank
• Deep Tokenization for Lucene

Search Relevance Feature Types
• static document priors
• query intent class labels
• query entities
• query / doc text similarity
• personalization (p18n)
• clickstream
• (example Solr query which demonstrates all of these omitted
because it doesn’t fit on this slide)

Agenda: getting down to business
• Deep Learning To Rank
• Embeddings
• Text encoding
• p18n
• clickstream
• Objective functions
• Distributed vs Local training
• Query time inference
• Deep Tokenization for Lucene

DL4IR: How I learned to stop worrying and
love deep neural networks
• Non-reasons:
• Always the best ranking results
• c++/CUDA under the hood => superfast inference
• “default” model works OOTB
• My reasons, as a data engineer:
• Extremely modular, unified framework
• Easily updatable models
• GPU => fewer distributed systems
• Domain Knowledge + Feature Engineering => Naive Vectorization +
Network Architecture Engineering

DL4IR: Why?
• Extremely modular, unified framework. DL models are:
• dissectible: reusable sub-modules
• composable: inputs to other models
• Easily updatable models
• ok, maybe not “easy”
• (because transfer learning is hard)
• GPU => fewer distributed systems
• GPU=supercomputer, CUDA already written
• Feature Engineering is not repeatable:
• Architecture Engineering is (more or less)
• in DL, features aren’t free, but are learned

Agenda: Deep LTR
• Deep Learning to Rank
• Embeddings:
• pre-trained
• from scratch
• fine tuned
• Text encoding
• P18n: userId embeddings
• clickstream: docId embeddings
• Query-time inference

Embeddings
• Pre-trained text embeddings:
• GloVe (https://nlp.stanford.edu/projects/glove/)
• NNLM on Google news (https://tfhub.dev/google/nnlm-en-dim128/1)
• fastText (https://fasttext.cc)
• ELMo (https://tfhub.dev/google/elmo/2)
• From scratch
• Many parameters -> lots of training data
• Can be unsupervised first, then treated as above
• Fine-tuned
• Start w/ pre-trained, w/ trainable=False
• Train as usual, but not to convergence
• Re-start training with trainable=True + lower training rate

Embeddings: keras code
Pre-trained embeddings as numpy array of dense vectors (indexed
by token-id), just start building your model like so:
After training, the embedding will be saved with your model, and
you can also extract it out:

Agenda
• Embeddings
• Text encoding:
• chars vs words
• CNNs vs LSTMs
• P18n: userId embeddings

Text encoding
• Characters vs Words:
• word embeddings require lots of data
• Millions of parameters => many GB of training data
• needs good tokenization + preprocessing
• (same in data sci pipeline / at query time!)
• Try char sequences instead!
• sometimes works for “old” ML
• works on small data
• on raw byte streams (no tokenizers)
• not my clever trick (c.f Zhang, Zhao, LeCun ‘15)

1d-CNNs vs LSTMs: both operate on sequences
CNN: Convolutional Neural Network: 2d for images, 1d for text
LSTM: Long Short-Term Memory: updates state as it reads, can emit
sequence of states at each position as input for another LSTM:

LSTMs are “better”, but I CNNs
• LSTMs for text:
• A little harder to understand (boo!)
• (black box)-ish, not much to dissect (yay/boo?)
• Many parameters, needs big data (boo!)
• Not GPU-friendly -> slow to train (boo!)
• Often works OOTB w/ no tuning (yay!)
• Typically SOTA quality after significant tuning (yay!)
• CNNs for text:
• Fairly simple to understand (yay!)
• Easily dissectible (yay!)
• Few parameters, requires less training data (yay!)
• GPU-friendly -> super fast to train (yay!)
• Many many hyperparameters -> hard to tune (boo!)
• Currently not SOTA (boo!) but aren’t far off (yay!)
• Typically requires more code (boo!)

1D CNN text encoder: keras code

1D CNN text encoder: layer shapes and sizes

p18n features
• Embeddings
• Text encoding
• p18n: userId embeddings
• pre-trained RecSys (ALS) model
• from scratch w/ hashing trick
• objective functions

p18n: pre-trained “embeddings” vs hashing trick
ALS matrix decomposition as “pre-trained embedding”
from collaborative filtering:
or: just hash UIDs to O(1k) dim (4x: avoid total
collisions) and learn an O(1k) x O(100) embedding for
them

Clickstream features
• Embeddings
• Text encoding
• same as for userId!
• can overfit easily
• “memorizing” query/doc history
• (which is sometimes ok…)

All together now: p18n query/doc CNN ranker

Agenda
• Embeddings
• Text encoding
• Objective functions:
• Sentiment
• Text classification
• Text generation
• Identity function
• Ranking

non-classification objectives
• Text generation: Neural Network Language Models (NNLM)
• Predict the next character/word from text
• Identity function: Autoencoder
• Predict the input as output
• Search Ranking: score(query, doc)
• query -click-> doc => score = 1
• query -no-click-> doc => score = 0
• better w/ triplets + “curriculum learning”:
• Start with random “no-click” pairs
• Later, pick docs Solr returns for query
• (but got no clicks!)
• eventually: docs w/ less clicks than expected
• (known as “hard negative mining”)

Agenda
• Embeddings
• Text encoding
• p18n
• clickstream

Agenda
• Embeddings
• Text encoding
• p18n
• clickstream
• Ideally: minimal pre/post-processing
• beware of finicky tensor mappings!
• jvm: MLeap TF support

want: simple model serving config:

MLeap source: TF integration
http://mleap-docs.combust.ml/
(also supports SparkML, sklearn,
xgboost, etc)

(…and now for something completely different)

Agenda
• Deep Tokens for Lucene
• char-CNN internals
• LSH for discretization
• Hierarchical semantic tokenization

Deep Tokens
• What does a 1d-CNN consume/emit?
• Consumes a sequence (length n) of k-dim vectors
• Emits a sequence of (length n) of f-dim vectors
• (assuming sequences are pre+post-padded)
• If a CNN layer’s windows are w-wide, require:
• w*k*f parameters (plus biases)
• Activations are often ReLU: >= 0 w/lots of 0’s

Deep Tokens: intermediate layers
• 1d-CNN feature-vectors
• Consumes a sequence (length n) of k-dim vectors
• Emits a sequence of (length n) of f-dim vectors
• (assuming sequences are pre+post-padded)
• If a CNN layer’s windows are w-wide, require:
• w*k*f parameters (plus biases)
• Activations are often ReLU: >= 0 w/lots of 0’s
• How to get this data?
• activs = [enc.layer[3], enc.layer[5]]
• extractor = Model(input=enc.inputs, output=activs)

1d-char CNN feature vectors by layer
• layer 0:
• Learns simple features like word suffixes, simple morphology, spacing, etc
• layer 1:
• slightly more features like word roots, articles, pronouns, etc
• layer 2:
• complex features: words + common misspellings, hyphenations/concatenations
• layer n:
• Every time you pool + stride over previous layer, effective window grows by factor of
pool_size

How deep can a char-CNN go?!?
• “Very Deep Convolutional Networks for Text Classification”,
Conneau, Schwenk, LeCun, Barrault; ’17
• very small (3char) windows, low filter count (64) early on
• “temporal version” of VGG architecture
• 29 layers, input as long as 1k chars
• Trained on 100k-3M docs
• 2.5 days on single GPU
• (I don’t know if this works for ranking)

• Locality Sensitive Hash to int codes
• dense vector becomes 16-24 bit int
• text => List[Int] at each layer
• Layer 0: same length as input
• Layer N+1 after k-pooling: len(layer_n.output)/k
• Indexing List[Int] is easy!
• “makes sense” to an inverted index
• Query time
• Query => List[Int] per layer
• search as usual (with sparsity!)
What can we do with these vectors?

LSH in 30 seconds:
• Random projections preserve
distances on account of:
• Johnson-Lindenstrauss
lemma
• Can pick totally random vectors
• Or: random sample of 2K
vectors from your dataset,
project via pi = vi - vi+1

Deep Tokens: sample similar char-ngrams
• Trained 7-layer char-CNN ranker on 3M BestBuy ecommerce clicks (from Kaggle)
• 64-256 feature maps
• quasi-“hard” negative mining by taking docs returned by Solr but with no clicks
• Example ngrams similar at layer 3-ish or so:
• similar: “ rin”, “e ri”, “rinf”
• From: “lord of the ring”, “LOTR extended edition dvd”, “lord of the rinfs extended”
• and:
• “0 in”, “0in “, “ nch”, “inch”
• From: “70 inch lcd”, “55 nch tv”, “90in sony tv”
• and:
• “s z 8”, “ zs8 “, “ sz8 ”, “lumix”
• From: “panasonic lumix s z 8”, “lumix zs8”, “panasonic dmc-zs8s”
• longer strings similar at layer 2 levels deeper:
• “10.1inches”, “lnch”, “inchplasma”, “inch”
• Still to do: full measurement of full DL ranking vs. approximate multilayer search on these
tokens, while sweeping the hyperparameter space and hashing strategies

Deep tokens: challenges
• Stability:
• Once model + LSH family is chosen, this is like “choosing an Analyzer” - changing requires
full reindex
• Hash functions which are “optimal” for one data set may be bad after indexing much more
data
• Similarity on differing scales with same semantics
• i.e. “55in” and “fifty five inch”
• (“shortcut” CNN connections needed?)
• Stop words
• want: no hash bucket (i.e. posting list) at any level have > 10% of corpus
• Noisy tokens at earlier levels (maybe never “index” first 3?)
• More generally
• precision vs. recall tradeoff tuning

Related work: Xu, et al, CNNs for Text Hashing (IJCAI ’15)
and many more (but none with as fun an acronym)

Deep Tokens: TL;DR
• configure model w/ deep char-CNN-based ranker w/search relevance loss
• Train it as usual
• Configure a convolutional feature extractor (CFE)
• From documents:
• Extract convolutional activations
• (learned textual features!)
• LSH -> discrete buckets (“abstract tokens”)
• Index these tokens
• At query time, use this CFE for:
• posting-list friendly deeply fuzzy search!
• (because really, just have a very fancy tokenizer)
• N.B. char-cnn models are small (O(100-300k) params

Thank you!
Jake Mannix
Chief Data Engineer, Lucidworks
@pbrane
#Activate18 #ActivateSearch

Deep Learning for Search: Personalization and Deep Tokenization

Recommended

Recommended

More Related Content

Similar to Deep Learning for Search: Personalization and Deep Tokenization

Similar to Deep Learning for Search: Personalization and Deep Tokenization (20)

Recently uploaded

Recently uploaded (20)

Deep Learning for Search: Personalization and Deep Tokenization