SlideShare a Scribd company logo
Deep Learning for Unified
Personalized Search
Recommendations
(and Fuzzy Tokenization)
Jake Mannix
Chief Data Engineer, Lucidworks
@pbrane | in/jakemannix
#Activate18 #ActivateSearch
$whoami
• Now: Chief Data Engineer, Lucidworks
• Applied ML / relevance / RecSys
• data engineering
• Previously:
• Allen Institute for AI: research pub. semantic search
• Twitter: account search, user interest modeling, RecSys
• LinkedIn: profile search, generic entity-to-entity RecSys
• Prehistory:
• Other software dev.
• Algebraic topology, particle cosmology
Agenda
• Personalized Search and the Clickstream
• Deep Learning To Rank
• Deep Tokenization for Lucene
Search Relevance Feature Types
• static document priors
• query intent class labels
• query entities
• query / doc text similarity
• personalization (p18n)
• clickstream
• (example Solr query which demonstrates all of these omitted
because it doesn’t fit on this slide)
Agenda: getting down to business
• Personalized Search and the Clickstream
• Deep Learning To Rank
• Embeddings
• Text encoding
• p18n
• clickstream
• Objective functions
• Distributed vs Local training
• Query time inference
• Deep Tokenization for Lucene
DL4IR: How I learned to stop worrying and
love deep neural networks
• Non-reasons:
• Always the best ranking results
• c++/CUDA under the hood => superfast inference
• “default” model works OOTB
• My reasons, as a data engineer:
• Extremely modular, unified framework
• Easily updatable models
• GPU => fewer distributed systems
• Domain Knowledge + Feature Engineering => Naive Vectorization +
Network Architecture Engineering
DL4IR: Why?
• Extremely modular, unified framework. DL models are:
• dissectible: reusable sub-modules
• composable: inputs to other models
• Easily updatable models
• ok, maybe not “easy”
• (because transfer learning is hard)
• GPU => fewer distributed systems
• GPU=supercomputer, CUDA already written
• Feature Engineering is not repeatable:
• Architecture Engineering is (more or less)
• in DL, features aren’t free, but are learned
Agenda: Deep LTR
• Deep Learning to Rank
• Embeddings:
• pre-trained
• from scratch
• fine tuned
• Text encoding
• P18n: userId embeddings
• clickstream: docId embeddings
• Objective functions
• Distributed vs Local training
• Query-time inference
Embeddings
• Pre-trained text embeddings:
• GloVe (https://nlp.stanford.edu/projects/glove/)
• NNLM on Google news (https://tfhub.dev/google/nnlm-en-dim128/1)
• fastText (https://fasttext.cc)
• ELMo (https://tfhub.dev/google/elmo/2)
• From scratch
• Many parameters -> lots of training data
• Can be unsupervised first, then treated as above
• Fine-tuned
• Start w/ pre-trained, w/ trainable=False
• Train as usual, but not to convergence
• Re-start training with trainable=True + lower training rate
Embeddings: keras code
Pre-trained embeddings as numpy array of dense vectors (indexed
by token-id), just start building your model like so:
After training, the embedding will be saved with your model, and
you can also extract it out:
Agenda
• Deep Learning to Rank
• Embeddings
• Text encoding:
• chars vs words
• CNNs vs LSTMs
• P18n: userId embeddings
• clickstream: docId embeddings
• Objective functions
• Distributed vs Local training
• Query-time inference
Text encoding
• Characters vs Words:
• word embeddings require lots of data
• Millions of parameters => many GB of training data
• needs good tokenization + preprocessing
• (same in data sci pipeline / at query time!)
• Try char sequences instead!
• sometimes works for “old” ML
• works on small data
• on raw byte streams (no tokenizers)
• not my clever trick (c.f Zhang, Zhao, LeCun ‘15)
1d-CNNs vs LSTMs: both operate on sequences
CNN: Convolutional Neural Network: 2d for images, 1d for text
LSTM: Long Short-Term Memory: updates state as it reads, can emit
sequence of states at each position as input for another LSTM:
LSTMs are “better”, but I CNNs
• LSTMs for text:
• A little harder to understand (boo!)
• (black box)-ish, not much to dissect (yay/boo?)
• Many parameters, needs big data (boo!)
• Not GPU-friendly -> slow to train (boo!)
• Often works OOTB w/ no tuning (yay!)
• Typically SOTA quality after significant tuning (yay!)
• CNNs for text:
• Fairly simple to understand (yay!)
• Easily dissectible (yay!)
• Few parameters, requires less training data (yay!)
• GPU-friendly -> super fast to train (yay!)
• Many many hyperparameters -> hard to tune (boo!)
• Currently not SOTA (boo!) but aren’t far off (yay!)
• Typically requires more code (boo!)
1D CNN text encoder: keras code
1D CNN text encoder: layer shapes and sizes
p18n features
• Deep Learning to Rank
• Embeddings
• Text encoding
• p18n: userId embeddings
• pre-trained RecSys (ALS) model
• from scratch w/ hashing trick
• clickstream: docId embeddings
• objective functions
• Distributed vs Local training
• Query-time inference
p18n: pre-trained “embeddings” vs hashing trick
ALS matrix decomposition as “pre-trained embedding”
from collaborative filtering:
or: just hash UIDs to O(1k) dim (4x: avoid total
collisions) and learn an O(1k) x O(100) embedding for
them
Clickstream features
• Deep Learning to Rank
• Embeddings
• Text encoding
• p18n: userId embeddings
• clickstream: docId embeddings
• same as for userId!
• can overfit easily
• “memorizing” query/doc history
• (which is sometimes ok…)
• Objective functions
• Distributed vs Local training
• Query-time inference
All together now: p18n query/doc CNN ranker
Picture > 1k words
Agenda
• Deep Learning to Rank
• Embeddings
• Text encoding
• p18n: userId embeddings
• clickstream: docId embeddings
• Objective functions:
• Sentiment
• Text classification
• Text generation
• Identity function
• Ranking
• Distributed vs Local training
• Query-time inference
non-classification objectives
• Text generation: Neural Network Language Models (NNLM)
• Predict the next character/word from text
• Identity function: Autoencoder
• Predict the input as output
• Search Ranking: score(query, doc)
• query -click-> doc => score = 1
• query -no-click-> doc => score = 0
• better w/ triplets + “curriculum learning”:
• Start with random “no-click” pairs
• Later, pick docs Solr returns for query
• (but got no clicks!)
• eventually: docs w/ less clicks than expected
• (known as “hard negative mining”)
Agenda
• Deep Learning to Rank
• Embeddings
• Text encoding
• p18n
• clickstream
• Distributed vs Local training
• Query-time inference
Agenda
• Deep Learning to Rank
• Embeddings
• Text encoding
• p18n
• clickstream
• Distributed vs Local training
• Query-time inference
• Ideally: minimal pre/post-processing
• beware of finicky tensor mappings!
• jvm: MLeap TF support
want: simple model serving config:
MLeap source: TF integration
http://mleap-docs.combust.ml/
(also supports SparkML, sklearn,
xgboost, etc)
(…and now for something completely different)
Agenda
• Personalized Search and the Clickstream
• Deep Learning to Rank
• Deep Tokens for Lucene
• char-CNN internals
• LSH for discretization
• Hierarchical semantic tokenization
Deep Tokens
• What does a 1d-CNN consume/emit?
• Consumes a sequence (length n) of k-dim vectors
• Emits a sequence of (length n) of f-dim vectors
• (assuming sequences are pre+post-padded)
• If a CNN layer’s windows are w-wide, require:
• w*k*f parameters (plus biases)
• Activations are often ReLU: >= 0 w/lots of 0’s
Deep Tokens: intermediate layers
• 1d-CNN feature-vectors
• Consumes a sequence (length n) of k-dim vectors
• Emits a sequence of (length n) of f-dim vectors
• (assuming sequences are pre+post-padded)
• If a CNN layer’s windows are w-wide, require:
• w*k*f parameters (plus biases)
• Activations are often ReLU: >= 0 w/lots of 0’s
• How to get this data?
• activs = [enc.layer[3], enc.layer[5]]
• extractor = Model(input=enc.inputs, output=activs)
1d-char CNN feature vectors by layer
• layer 0:
• Learns simple features like word suffixes, simple morphology, spacing, etc
• layer 1:
• slightly more features like word roots, articles, pronouns, etc
• layer 2:
• complex features: words + common misspellings, hyphenations/concatenations
• layer n:
• Every time you pool + stride over previous layer, effective window grows by factor of
pool_size
How deep can a char-CNN go?!?
• “Very Deep Convolutional Networks for Text Classification”,
Conneau, Schwenk, LeCun, Barrault; ’17
• very small (3char) windows, low filter count (64) early on
• “temporal version” of VGG architecture
• 29 layers, input as long as 1k chars
• Trained on 100k-3M docs
• 2.5 days on single GPU
• (I don’t know if this works for ranking)
• Locality Sensitive Hash to int codes
• dense vector becomes 16-24 bit int
• text => List[Int] at each layer
• Layer 0: same length as input
• Layer N+1 after k-pooling: len(layer_n.output)/k
• Indexing List[Int] is easy!
• “makes sense” to an inverted index
• Query time
• Query => List[Int] per layer
• search as usual (with sparsity!)
What can we do with these vectors?
LSH in 30 seconds:
• Random projections preserve
distances on account of:
• Johnson-Lindenstrauss
lemma
• Can pick totally random vectors
• Or: random sample of 2K
vectors from your dataset,
project via pi = vi - vi+1
Deep Tokens: sample similar char-ngrams
• Trained 7-layer char-CNN ranker on 3M BestBuy ecommerce clicks (from Kaggle)
• 64-256 feature maps
• quasi-“hard” negative mining by taking docs returned by Solr but with no clicks
• Example ngrams similar at layer 3-ish or so:
• similar: “ rin”, “e ri”, “rinf”
• From: “lord of the ring”, “LOTR extended edition dvd”, “lord of the rinfs extended”
• and:
• “0 in”, “0in “, “ nch”, “inch”
• From: “70 inch lcd”, “55 nch tv”, “90in sony tv”
• and:
• “s z 8”, “ zs8 “, “ sz8 ”, “lumix”
• From: “panasonic lumix s z 8”, “lumix zs8”, “panasonic dmc-zs8s”
• longer strings similar at layer 2 levels deeper:
• “10.1inches”, “lnch”, “inchplasma”, “inch”
• Still to do: full measurement of full DL ranking vs. approximate multilayer search on these
tokens, while sweeping the hyperparameter space and hashing strategies
Deep tokens: challenges
• Stability:
• Once model + LSH family is chosen, this is like “choosing an Analyzer” - changing requires
full reindex
• Hash functions which are “optimal” for one data set may be bad after indexing much more
data
• Similarity on differing scales with same semantics
• i.e. “55in” and “fifty five inch”
• (“shortcut” CNN connections needed?)
• Stop words
• want: no hash bucket (i.e. posting list) at any level have > 10% of corpus
• Noisy tokens at earlier levels (maybe never “index” first 3?)
• More generally
• precision vs. recall tradeoff tuning
Related work: Xu, et al, CNNs for Text Hashing (IJCAI ’15)
and many more (but none with as fun an acronym)
Deep Tokens: TL;DR
• configure model w/ deep char-CNN-based ranker w/search relevance loss
• Train it as usual
• Configure a convolutional feature extractor (CFE)
• From documents:
• Extract convolutional activations
• (learned textual features!)
• LSH -> discrete buckets (“abstract tokens”)
• Index these tokens
• At query time, use this CFE for:
• posting-list friendly deeply fuzzy search!
• (because really, just have a very fancy tokenizer)
• N.B. char-cnn models are small (O(100-300k) params
Thank you!
Jake Mannix
Chief Data Engineer, Lucidworks
@pbrane
#Activate18 #ActivateSearch
References:
• Coming soon

More Related Content

Similar to Deep Learning for Search: Personalization and Deep Tokenization

Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
Roger Xia
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
smallerror
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...xlight
 
How to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldHow to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the World
Milo Yip
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
John Adams
 
You didnt see it’s coming? "Dawn of hardened Windows Kernel"
You didnt see it’s coming? "Dawn of hardened Windows Kernel" You didnt see it’s coming? "Dawn of hardened Windows Kernel"
You didnt see it’s coming? "Dawn of hardened Windows Kernel"
Peter Hlavaty
 
FP Days: Down the Clojure Rabbit Hole
FP Days: Down the Clojure Rabbit HoleFP Days: Down the Clojure Rabbit Hole
FP Days: Down the Clojure Rabbit HoleChristophe Grand
 
Internals of Presto Service
Internals of Presto ServiceInternals of Presto Service
Internals of Presto Service
Treasure Data, Inc.
 
Introduction to computer vision with Convoluted Neural Networks
Introduction to computer vision with Convoluted Neural NetworksIntroduction to computer vision with Convoluted Neural Networks
Introduction to computer vision with Convoluted Neural Networks
MarcinJedyk
 
Introduction to computer vision
Introduction to computer visionIntroduction to computer vision
Introduction to computer vision
Marcin Jedyk
 
Algorithm and Data Structures - Basic of IT Problem Solving
Algorithm and Data Structures - Basic of IT Problem SolvingAlgorithm and Data Structures - Basic of IT Problem Solving
Algorithm and Data Structures - Basic of IT Problem Solving
coolpie
 
Tensorflow 101 @ Machine Learning Innovation Summit SF June 6, 2017
Tensorflow 101 @ Machine Learning Innovation Summit SF June 6, 2017Tensorflow 101 @ Machine Learning Innovation Summit SF June 6, 2017
Tensorflow 101 @ Machine Learning Innovation Summit SF June 6, 2017
Ashish Bansal
 
Intelligent Stream Filtering Using MongoDB
Intelligent Stream Filtering Using MongoDBIntelligent Stream Filtering Using MongoDB
Intelligent Stream Filtering Using MongoDBMihnea Giurgea
 
DIY Java Profiling
DIY Java ProfilingDIY Java Profiling
DIY Java Profiling
Roman Elizarov
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Databricks
 
prace_days_ml_2019.pptx
prace_days_ml_2019.pptxprace_days_ml_2019.pptx
prace_days_ml_2019.pptx
ssuserf583ac
 
prace_days_ml_2019.pptx
prace_days_ml_2019.pptxprace_days_ml_2019.pptx
prace_days_ml_2019.pptx
RohanBorgalli
 
prace_days_ml_2019.pptx
prace_days_ml_2019.pptxprace_days_ml_2019.pptx
prace_days_ml_2019.pptx
SreeVani74
 
10 minutes fun with Cloud API comparison
10 minutes fun with Cloud API comparison10 minutes fun with Cloud API comparison
10 minutes fun with Cloud API comparison
Laurent Cerveau
 

Similar to Deep Learning for Search: Personalization and Deep Tokenization (20)

Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
How to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldHow to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the World
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
You didnt see it’s coming? "Dawn of hardened Windows Kernel"
You didnt see it’s coming? "Dawn of hardened Windows Kernel" You didnt see it’s coming? "Dawn of hardened Windows Kernel"
You didnt see it’s coming? "Dawn of hardened Windows Kernel"
 
FP Days: Down the Clojure Rabbit Hole
FP Days: Down the Clojure Rabbit HoleFP Days: Down the Clojure Rabbit Hole
FP Days: Down the Clojure Rabbit Hole
 
Internals of Presto Service
Internals of Presto ServiceInternals of Presto Service
Internals of Presto Service
 
Introduction to computer vision with Convoluted Neural Networks
Introduction to computer vision with Convoluted Neural NetworksIntroduction to computer vision with Convoluted Neural Networks
Introduction to computer vision with Convoluted Neural Networks
 
Introduction to computer vision
Introduction to computer visionIntroduction to computer vision
Introduction to computer vision
 
Algorithm and Data Structures - Basic of IT Problem Solving
Algorithm and Data Structures - Basic of IT Problem SolvingAlgorithm and Data Structures - Basic of IT Problem Solving
Algorithm and Data Structures - Basic of IT Problem Solving
 
Tensorflow 101 @ Machine Learning Innovation Summit SF June 6, 2017
Tensorflow 101 @ Machine Learning Innovation Summit SF June 6, 2017Tensorflow 101 @ Machine Learning Innovation Summit SF June 6, 2017
Tensorflow 101 @ Machine Learning Innovation Summit SF June 6, 2017
 
Intelligent Stream Filtering Using MongoDB
Intelligent Stream Filtering Using MongoDBIntelligent Stream Filtering Using MongoDB
Intelligent Stream Filtering Using MongoDB
 
DIY Java Profiling
DIY Java ProfilingDIY Java Profiling
DIY Java Profiling
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
 
prace_days_ml_2019.pptx
prace_days_ml_2019.pptxprace_days_ml_2019.pptx
prace_days_ml_2019.pptx
 
prace_days_ml_2019.pptx
prace_days_ml_2019.pptxprace_days_ml_2019.pptx
prace_days_ml_2019.pptx
 
prace_days_ml_2019.pptx
prace_days_ml_2019.pptxprace_days_ml_2019.pptx
prace_days_ml_2019.pptx
 
10 minutes fun with Cloud API comparison
10 minutes fun with Cloud API comparison10 minutes fun with Cloud API comparison
10 minutes fun with Cloud API comparison
 

Recently uploaded

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 

Recently uploaded (20)

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 

Deep Learning for Search: Personalization and Deep Tokenization

  • 1. Deep Learning for Unified Personalized Search Recommendations (and Fuzzy Tokenization) Jake Mannix Chief Data Engineer, Lucidworks @pbrane | in/jakemannix #Activate18 #ActivateSearch
  • 2. $whoami • Now: Chief Data Engineer, Lucidworks • Applied ML / relevance / RecSys • data engineering • Previously: • Allen Institute for AI: research pub. semantic search • Twitter: account search, user interest modeling, RecSys • LinkedIn: profile search, generic entity-to-entity RecSys • Prehistory: • Other software dev. • Algebraic topology, particle cosmology
  • 3. Agenda • Personalized Search and the Clickstream • Deep Learning To Rank • Deep Tokenization for Lucene
  • 4. Search Relevance Feature Types • static document priors • query intent class labels • query entities • query / doc text similarity • personalization (p18n) • clickstream • (example Solr query which demonstrates all of these omitted because it doesn’t fit on this slide)
  • 5. Agenda: getting down to business • Personalized Search and the Clickstream • Deep Learning To Rank • Embeddings • Text encoding • p18n • clickstream • Objective functions • Distributed vs Local training • Query time inference • Deep Tokenization for Lucene
  • 6. DL4IR: How I learned to stop worrying and love deep neural networks • Non-reasons: • Always the best ranking results • c++/CUDA under the hood => superfast inference • “default” model works OOTB • My reasons, as a data engineer: • Extremely modular, unified framework • Easily updatable models • GPU => fewer distributed systems • Domain Knowledge + Feature Engineering => Naive Vectorization + Network Architecture Engineering
  • 7. DL4IR: Why? • Extremely modular, unified framework. DL models are: • dissectible: reusable sub-modules • composable: inputs to other models • Easily updatable models • ok, maybe not “easy” • (because transfer learning is hard) • GPU => fewer distributed systems • GPU=supercomputer, CUDA already written • Feature Engineering is not repeatable: • Architecture Engineering is (more or less) • in DL, features aren’t free, but are learned
  • 8. Agenda: Deep LTR • Deep Learning to Rank • Embeddings: • pre-trained • from scratch • fine tuned • Text encoding • P18n: userId embeddings • clickstream: docId embeddings • Objective functions • Distributed vs Local training • Query-time inference
  • 9. Embeddings • Pre-trained text embeddings: • GloVe (https://nlp.stanford.edu/projects/glove/) • NNLM on Google news (https://tfhub.dev/google/nnlm-en-dim128/1) • fastText (https://fasttext.cc) • ELMo (https://tfhub.dev/google/elmo/2) • From scratch • Many parameters -> lots of training data • Can be unsupervised first, then treated as above • Fine-tuned • Start w/ pre-trained, w/ trainable=False • Train as usual, but not to convergence • Re-start training with trainable=True + lower training rate
  • 10. Embeddings: keras code Pre-trained embeddings as numpy array of dense vectors (indexed by token-id), just start building your model like so: After training, the embedding will be saved with your model, and you can also extract it out:
  • 11. Agenda • Deep Learning to Rank • Embeddings • Text encoding: • chars vs words • CNNs vs LSTMs • P18n: userId embeddings • clickstream: docId embeddings • Objective functions • Distributed vs Local training • Query-time inference
  • 12. Text encoding • Characters vs Words: • word embeddings require lots of data • Millions of parameters => many GB of training data • needs good tokenization + preprocessing • (same in data sci pipeline / at query time!) • Try char sequences instead! • sometimes works for “old” ML • works on small data • on raw byte streams (no tokenizers) • not my clever trick (c.f Zhang, Zhao, LeCun ‘15)
  • 13. 1d-CNNs vs LSTMs: both operate on sequences CNN: Convolutional Neural Network: 2d for images, 1d for text LSTM: Long Short-Term Memory: updates state as it reads, can emit sequence of states at each position as input for another LSTM:
  • 14. LSTMs are “better”, but I CNNs • LSTMs for text: • A little harder to understand (boo!) • (black box)-ish, not much to dissect (yay/boo?) • Many parameters, needs big data (boo!) • Not GPU-friendly -> slow to train (boo!) • Often works OOTB w/ no tuning (yay!) • Typically SOTA quality after significant tuning (yay!) • CNNs for text: • Fairly simple to understand (yay!) • Easily dissectible (yay!) • Few parameters, requires less training data (yay!) • GPU-friendly -> super fast to train (yay!) • Many many hyperparameters -> hard to tune (boo!) • Currently not SOTA (boo!) but aren’t far off (yay!) • Typically requires more code (boo!)
  • 15. 1D CNN text encoder: keras code
  • 16. 1D CNN text encoder: layer shapes and sizes
  • 17. p18n features • Deep Learning to Rank • Embeddings • Text encoding • p18n: userId embeddings • pre-trained RecSys (ALS) model • from scratch w/ hashing trick • clickstream: docId embeddings • objective functions • Distributed vs Local training • Query-time inference
  • 18. p18n: pre-trained “embeddings” vs hashing trick ALS matrix decomposition as “pre-trained embedding” from collaborative filtering: or: just hash UIDs to O(1k) dim (4x: avoid total collisions) and learn an O(1k) x O(100) embedding for them
  • 19. Clickstream features • Deep Learning to Rank • Embeddings • Text encoding • p18n: userId embeddings • clickstream: docId embeddings • same as for userId! • can overfit easily • “memorizing” query/doc history • (which is sometimes ok…) • Objective functions • Distributed vs Local training • Query-time inference
  • 20. All together now: p18n query/doc CNN ranker
  • 21. Picture > 1k words
  • 22. Agenda • Deep Learning to Rank • Embeddings • Text encoding • p18n: userId embeddings • clickstream: docId embeddings • Objective functions: • Sentiment • Text classification • Text generation • Identity function • Ranking • Distributed vs Local training • Query-time inference
  • 23. non-classification objectives • Text generation: Neural Network Language Models (NNLM) • Predict the next character/word from text • Identity function: Autoencoder • Predict the input as output • Search Ranking: score(query, doc) • query -click-> doc => score = 1 • query -no-click-> doc => score = 0 • better w/ triplets + “curriculum learning”: • Start with random “no-click” pairs • Later, pick docs Solr returns for query • (but got no clicks!) • eventually: docs w/ less clicks than expected • (known as “hard negative mining”)
  • 24. Agenda • Deep Learning to Rank • Embeddings • Text encoding • p18n • clickstream • Distributed vs Local training • Query-time inference
  • 25. Agenda • Deep Learning to Rank • Embeddings • Text encoding • p18n • clickstream • Distributed vs Local training • Query-time inference • Ideally: minimal pre/post-processing • beware of finicky tensor mappings! • jvm: MLeap TF support
  • 26. want: simple model serving config:
  • 27. MLeap source: TF integration http://mleap-docs.combust.ml/ (also supports SparkML, sklearn, xgboost, etc)
  • 28. (…and now for something completely different)
  • 29. Agenda • Personalized Search and the Clickstream • Deep Learning to Rank • Deep Tokens for Lucene • char-CNN internals • LSH for discretization • Hierarchical semantic tokenization
  • 30. Deep Tokens • What does a 1d-CNN consume/emit? • Consumes a sequence (length n) of k-dim vectors • Emits a sequence of (length n) of f-dim vectors • (assuming sequences are pre+post-padded) • If a CNN layer’s windows are w-wide, require: • w*k*f parameters (plus biases) • Activations are often ReLU: >= 0 w/lots of 0’s
  • 31. Deep Tokens: intermediate layers • 1d-CNN feature-vectors • Consumes a sequence (length n) of k-dim vectors • Emits a sequence of (length n) of f-dim vectors • (assuming sequences are pre+post-padded) • If a CNN layer’s windows are w-wide, require: • w*k*f parameters (plus biases) • Activations are often ReLU: >= 0 w/lots of 0’s • How to get this data? • activs = [enc.layer[3], enc.layer[5]] • extractor = Model(input=enc.inputs, output=activs)
  • 32. 1d-char CNN feature vectors by layer • layer 0: • Learns simple features like word suffixes, simple morphology, spacing, etc • layer 1: • slightly more features like word roots, articles, pronouns, etc • layer 2: • complex features: words + common misspellings, hyphenations/concatenations • layer n: • Every time you pool + stride over previous layer, effective window grows by factor of pool_size
  • 33. How deep can a char-CNN go?!? • “Very Deep Convolutional Networks for Text Classification”, Conneau, Schwenk, LeCun, Barrault; ’17 • very small (3char) windows, low filter count (64) early on • “temporal version” of VGG architecture • 29 layers, input as long as 1k chars • Trained on 100k-3M docs • 2.5 days on single GPU • (I don’t know if this works for ranking)
  • 34. • Locality Sensitive Hash to int codes • dense vector becomes 16-24 bit int • text => List[Int] at each layer • Layer 0: same length as input • Layer N+1 after k-pooling: len(layer_n.output)/k • Indexing List[Int] is easy! • “makes sense” to an inverted index • Query time • Query => List[Int] per layer • search as usual (with sparsity!) What can we do with these vectors?
  • 35. LSH in 30 seconds: • Random projections preserve distances on account of: • Johnson-Lindenstrauss lemma • Can pick totally random vectors • Or: random sample of 2K vectors from your dataset, project via pi = vi - vi+1
  • 36. Deep Tokens: sample similar char-ngrams • Trained 7-layer char-CNN ranker on 3M BestBuy ecommerce clicks (from Kaggle) • 64-256 feature maps • quasi-“hard” negative mining by taking docs returned by Solr but with no clicks • Example ngrams similar at layer 3-ish or so: • similar: “ rin”, “e ri”, “rinf” • From: “lord of the ring”, “LOTR extended edition dvd”, “lord of the rinfs extended” • and: • “0 in”, “0in “, “ nch”, “inch” • From: “70 inch lcd”, “55 nch tv”, “90in sony tv” • and: • “s z 8”, “ zs8 “, “ sz8 ”, “lumix” • From: “panasonic lumix s z 8”, “lumix zs8”, “panasonic dmc-zs8s” • longer strings similar at layer 2 levels deeper: • “10.1inches”, “lnch”, “inchplasma”, “inch” • Still to do: full measurement of full DL ranking vs. approximate multilayer search on these tokens, while sweeping the hyperparameter space and hashing strategies
  • 37. Deep tokens: challenges • Stability: • Once model + LSH family is chosen, this is like “choosing an Analyzer” - changing requires full reindex • Hash functions which are “optimal” for one data set may be bad after indexing much more data • Similarity on differing scales with same semantics • i.e. “55in” and “fifty five inch” • (“shortcut” CNN connections needed?) • Stop words • want: no hash bucket (i.e. posting list) at any level have > 10% of corpus • Noisy tokens at earlier levels (maybe never “index” first 3?) • More generally • precision vs. recall tradeoff tuning
  • 38. Related work: Xu, et al, CNNs for Text Hashing (IJCAI ’15) and many more (but none with as fun an acronym)
  • 39. Deep Tokens: TL;DR • configure model w/ deep char-CNN-based ranker w/search relevance loss • Train it as usual • Configure a convolutional feature extractor (CFE) • From documents: • Extract convolutional activations • (learned textual features!) • LSH -> discrete buckets (“abstract tokens”) • Index these tokens • At query time, use this CFE for: • posting-list friendly deeply fuzzy search! • (because really, just have a very fancy tokenizer) • N.B. char-cnn models are small (O(100-300k) params
  • 40. Thank you! Jake Mannix Chief Data Engineer, Lucidworks @pbrane #Activate18 #ActivateSearch