Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Deep	Learning	for	Personalized	
Search	and	Recommender	
Systems
Ganesh	Venkataraman
Airbnb
Nadia	Fawaz,	Saurabh	Kataria,	B...
Tutorial	Outline	
• Part	I	(45min)	Deep	Learning	Key	concepts
• Part	II	(45min)	Deep	learning	for	Search	and	Recommendatio...
Motivation	– Why	Recommender	Systems?	
• Recommendation	systems	are	everywhere.	Some	examples	of	impact:
• “Netflix	values...
Motivation	– Why	Search?
4
PERSONALIZED	SEARCH
4
Query	=	“things	to	do	in	halifax”
Search	view	– this	is	a	classic	IR	prob...
Why	Deep	Learning?	Why	now?
• Many	of	the	fundamental	algorithmic	techniques	have	existed	since	
the	80s	or	before
2.5	Exo...
Why	Deep	Learning?
Image	classification
eCommerce	fraud
Search
Recommendations
NLP
Deep	learning	is	eating	the	world
6
Why	Deep	Learning	and	Recommender	
Systems?
• Features
• Semantic	understanding	of	words/sentences	possible	with	embedding...
Part	I	– Representation	Learning	and	Deep	
Learning:	Key	Concepts
8
Deep	Learning	and	AI
http://www.deeplearningbook.org/contents/intro.html 9
Part	I	Outline
• Shallow	Models	for	Embedding	Learning
• Word2Vec
• Deep	Architectures
• FF,	CNN,	RNN
• Training	Deep	Neur...
Learning	Embeddings
11
Representation	learning	for	automated	feature	generation
• Natural	Language	Processing
• Word	embedding:	word2vec,	GloVe
•...
Example	Application	of	Representation	
Learning	- Understanding	Text
• One	of	the	keys	to	any	content	based	recommender	sy...
How	to	represent	a	word?
• Vocabulary	– run,	jog,	math
• Simple	representation:
• [1,	0,	0],	[0,	1,	0],	[0,	0,	1]
• No	rep...
How	to	represent	a	word?
• Trouble	with	cooccurrence	matrix
• Large	dimension,	lots	of	memory
• Dimensionality	reduction	u...
Word	embeddings	taking	context
• Key	Conjecture	
• Context	matters.	
• Words	that	convey	a	certain	context	occur	together
...
Word2vec
17
Word2Vec:	Skip	Gram	Model
• Basic	notations:
• w represents	a	word,	C(w) represents	all	the	context	around	a	word
• 𝜃 repr...
Word2vec	details
• Let	𝑣.	and	𝑣+ represent	the	current	word	and	context.	Note	that	
𝑣+	and	𝑣. are	parameters	we	want	to	le...
Negative	Sampling	– basic	intuition
p c w; 𝜃 =	
𝑒E>∗E@
∑ 𝑒EB∗E@
F∈-
• Sample	from	unigram	distribution	instead	of	taking	a...
Deep	Architectures
FF,	CNN,	RNN
21
Neuron:	Computational	Unit
• Input	vector:	x	=	[x1,	x2	,…,xn]
• Neuron
• Weight	vector:	W		
• Bias:	b
• Activation	functio...
Activation	Functions
• Tanh:	ℝ	 → (-1,1)
tanh	( 𝑥) =
𝑒M
− 𝑒OM
𝑒M + 𝑒OM
• Sigmoid:	ℝ	 → (0,1)
𝜎 𝑥 =
1
1 + 𝑒OM
• ReLU:	ℝ	 → ...
Layer
• Layer	l:	nl neurons
• weight	matrix:	W	=	[W1,…,	Wnl]
• bias	vector:	b	=	[b1,…,	bnl]
• activation	function:	f
• out...
Layer:	Matrix	Notation
• Layer	l:	nl neurons
• weight	matrix:	W
• bias	vector:	b
• activation	function:	f
• output	vector
...
Feed	Forward	Network
• Depth	L		layers
• Activation	at	layer	l+1
a(l+1)	=	f(W(l)T	a(l)	+	b(l) )
• Output:	prediction	in	
s...
Why	CNN:	Convolutional	Neural	Networks?
• Large	size	grid	structured	data
• 1D:	time	series
• 2D:	image
• Convolution	to	e...
Convolution	example
https://docs.gimp.org/en/plug-in-convmatrix.html
Edge	detect	kernel Sharpen	kernel
2D	convolution
http://ufldl.stanford.edu/tutorial/supervised/FeatureExtractionUsingConvolution/
2D	kernel	(3x3)
W1 W2 W3 W...
• Fully	connected
• hidden	unit	connected	to	all	input	units
• computationally	expensive
• Large	image	NxN pixels	and	Hidd...
Pooling
• Summary	statistics
• Aggregate	over	region
• Reduce	size
• Less	overfitting
• Translation	invariance
• Max,	mean...
CNN:	Convolutional	Neural	Network
Combination
• Convolutional	layers
• Pooling	layers
• Fully	connected	layers
http://cola...
CNN	example	for	image	recognition:	ImageNet	[Krizhevsky et	al.,	2012]	
Pictures	courtesy	of	[Krizhevsky et	al.,	2012],	htt...
Why	RNN:	Recurrent	Neural	Network?
• Sequential	data	processing
• ex:	predict	next	word	in	sentence:	“I	was	born	in	France...
Unfolded	RNN
• Copies	of	NN	passing	feedback	to	one	another
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
35
LSTM:	Long	Short	Term	Memory	[Hochreiter et	al.,	1997]	
• Avoid	vanishing	or	exploding	gradient
• Cell	state	updates	regul...
Examples	of	RNN	application
• Speech	recognition	[Graves	et	al.,	2013]
• Language	modeling	[Mikolov,	2012]
• Machine	trans...
Training	a	Deep	Neural	Network
38
Cost	Function
• m	training	samples	(feature	vector,	label)
(𝑥 X
, 𝑦 X
), … , (𝑥 [
, 𝑦 [
)
• Per	sample	cost:		error	betwee...
Gradient	Descent
• Random	parameter	initialization:	symmetry	breaking
• Gradient	descent	step:		update	for	every	parameter...
Stochastic	Gradient	Descent	(SGD)
• SGD:	follow	negative	gradient	after	
• single	sample
𝜃 = 𝜃 − 𝛼𝛻nJ(θ; 𝑥 _
, 𝑦(_)
)
• a	...
Backpropagation	[Rumelhart et	al.,	1986]
Goal:	Compute	gradient	numerically	
Recursively	apply	chain	rule	for	derivative	o...
Training	optimization
• Learning	Rate	Schedule
• Changing	learning	rate	as	learning	progresses
• Pre-training	
• Goal:	tra...
Part	II	– Deep	Learning	for	Personalized	
Recommender	Systems	at	Scale
44
Examples	of	Personalized	Recommender	Systems
45
Examples	of	Personalized	Recommender	Systems
Job	Search
46
Examples	of	Personalized	Recommender	Systems
47
item	j from	a	set	of	candidates
User	i
with
<user	features,	query	
(optional)>
(e.g.,	industry,
behavioral	features,
Demog...
An	Example	Architecture	of	
Personalized	Recommender	
Systems	
49
User	
Interaction	
Logs
Offline	Modeling	
Workflow	+	User	/	
Item	derived	
features
User
User	Feature	
Store
Item	Store	+	...
User	
Interaction	
Logs
Offline	Modeling	
Workflow	+	User	/	
Item	derived	
features
User
Search-based	
Candidate	
Selectio...
Key	Components	– Offline	Modeling
• Train	the	model	offline	(e.g.	Hadoop)	
• Push	model	to	online	ranking	model	store
• Pr...
Key	Components	– Candidate	Selection
• Personalized	Search	(With	user	query):	
• Form	a	query	to	the	index	based	on	user	q...
Key	Components	- Ranking
• Recommendation	Ranking
• The	main	ML	model	that	ranks	items	retrieved	by	candidate	selection	ba...
Integration	of	Deep	Learning	Models	
into	Personalized	Recommender	
Systems	at	Scale
55
Literature:	Deep	Learning	for	Recommendation	Systems
• RBM	for	Collaborative	Filtering	[Salakhutdinov et	al.,	2007]
• Deep...
User	
Interaction	
Logs
Offline	Modeling	
Workflow	+	User	/	
Item	derived	
features
User
Search-based	
Candidate	
Selectio...
Offline	Modeling	+	User	/	Item	Embeddings
User	Features Item	Features
User	Embedding	
Vector
Item	Embedding	
Vector
Sim(U,...
Query	Formulation	&	Candidate	Selection
• Issues	of	using	raw	text:	Noisy	or	incorrect	query	tagging	due	to
• Failure	to	c...
Query	Formulation	&	Candidate	Selection
• Represent	Query	as	an	
embedding	
• Expand	query	to	similar	
queries	in	a	semant...
Recommendation	Ranking	Models
• Wide	and	Deep	Models	to	capture	all	possible	signals	[Cheng,	et	
al.,	2016]
https://arxiv....
Challenges	&	Open	Problems	for	Deep	
Learning	at	Recommender	Systems
• Distributed	training	on	very	large	data
• Tensorflo...
Part	III	– Case	Study:	Jobs	You	May	Be	
Interested	In	(JYMBII)
63
Outline
• Introduction
• Generating	Embeddings	via	Word2vec	
• Generating	Embeddings	via	Deep	Networks
• Tree	Feature	Tran...
Introduction:	JYMBII
65
Introduction:	Problem	Formulation
• Rank	jobs	by	𝑃 User	𝑢	applies	to	Job	𝑗	 	𝑢, 𝑗)
• Model	response	given:
66
Careers	Hist...
Introduction:	JYMBII	Modeling- Generalization
Recommend
• Model	should	learn	general	rules	to	predict	which	
jobs	to	recom...
Introduction:	JYMBII	Modeling	- Memorization
Applies	to
68
• Model	should	memorize	exceptions	to	the	rules
• Learn	excepti...
Introduction:	Baseline	Features
• Dense BoW Similarity	Features	for	Generalization
• i.e:	Similarity	in	title	text	good	pr...
Introduction:	Issues
• BoW Features	don’t	capture	semantic	similarity	between	user/job
• Cosine	Similarity	between	Applica...
Introduction:	Deep	+	Wide	for	JYMBII
• BoW Features	don’t	capture	semantic	similarity	between	user/job
• Generate	embeddin...
Generating	Embeddings	via	Word2vec:
Training	Word	Vectors
• Key	Ideas
• Same	users	(context)	apply	to	similar	jobs	(target...
Generating	Embeddings	via	Word2vec:
Model	Structure
Application,	Developer Software,	EngineerTokenized	Titles
Word	Embeddi...
Generating	Embeddings	via	Word2vec:	
Results	and	Next	Steps
• Receiver	Operating	Characteristic	– Area	Under	Curve	for	eva...
Generating	Embeddings	via	Deep	Networks:	
Model	Structure
User Job
Response	Prediction	(Logistic	Regression)
Sparse	Featur...
Generating	Embeddings	via	Deep	Networks:	
Hyper	Parameters,	Lots	of	Knobs!
• Optimizer	Used
• SGD	w/	Momentum	and	exponent...
Generating	Embeddings	via	Deep	Networks:	
Training	Challenges
• Millions	of	rows	of	training	data	impossible	to	store	all	...
Generating	Embeddings	via	Deep	Networks:
Results
Model ROC AUC
Baseline Model 0.753
Deep +	Wide	Model 0.790	(+4.91%***)
**...
Response	Prediction	(Logistic	Regression)
The	Current	Deep	+	Wide	Model
Deep	Embedding	Features	(Feed	Forward	NN)
• Genera...
Tree	Feature	Transforms:	Feature	Selection	via	
Gradient	Boosted	Decision	Trees
Each	tree	outputs	a	path	from	root	to	leaf...
Response	Prediction	(Logistic	Regression)
Tree	Feature	Transforms:	The	Full	Picture
How	to	train	both	the	NN	model	and	GBD...
Tree	Feature	Transforms:	Joint	Training	via	
Block-wise	Cyclic	Coordinate	Descent
• Treat	NN	model	and	GBDT	model	as	separ...
Response	Prediction	(Logistic	Regression)
Tree	Feature	Transforms:	Train	NN	Until	
Convergence
Initially	no	trees	are	in	o...
Response	Prediction	(Logistic	Regression)
Tree	Feature	Transforms:	Train	GDBT	w/	NN	
Section	as	Initial	Margin
Deep	Embedd...
Response	Prediction	(Logistic	Regression)
Tree	Feature	Transforms:	Train	GDBT	w/	NN	
Section	as	Initial	Margin
Deep	Embedd...
Response	Prediction	(Logistic	Regression)
Tree	Feature	Transforms:	Train	Regression	
Layer	Weights
Deep	Embedding	Features...
Response	Prediction	(Logistic	Regression)
Tree	Feature	Transforms:	Train	NN	w/	GDBT	
Section	as	Initial	Margin
Deep	Embedd...
Tree	Feature	Transforms:	Block-wise	
Coordinate	Descent Results
Model ROC AUC
Baseline Model 0.753
Deep +	Wide	Model 0.790...
JYMBII	Deep	+	Wide:	Future	Direction
• Generating	Embeddings	w/	LSTM	Networks
• Leverage	sequential	career	history	data
• ...
Part	IV	– Case	Study:	Deep	Learning	Networks	
for	Job	Search
90
Outline
• Introduction
• Representations	via	Word2vec
• Robust	Representations	via	DSSM
91
Introduction:	Job	Search
92
Introduction:	Search	Architecture
Index
Indexer
Top-K	retrieval
ResultsOffline Training	/	
Model
Result Ranking
User Query...
Introduction: Query	Understanding	-
Segmentation	and	Tagging
• First	divide	the	search	query	into	
segments	
• Tag	query	s...
Introduction: Query	Understanding	–
Expansion	
• Task	of	adding	additional	
synonyms/related	entities	to	the	
query	to	imp...
Introduction: Query	Understanding	- Retrieval	
and	Ranking
COMPANY = Oracle OR NetSuite OR Taleo OR
Sun Microsystems OR …
...
Introduction: Issues	– Retrieval	and	Ranking
• Term	retrieval	has	limitations
• Cross	language	retrieval	
• Softwareentwic...
Introduction:	Solution	– Deep	Learning	for	
Query	and	Document	Representations
• Query	and	document	representations
• Map	...
Representations	via	Word2vec:
Leverage	JYMBII	Work
• Key	Ideas
• Similar	users	(context)	apply	to	the	same	job	(target)
• ...
Representations	via	Word2vec:
Word2vec	in	Ranking
Application,	Developer Software,	EngineerTokenized	Text
Word	Embedding	L...
Representations	via	Word2vec:
Ranking	Model	Results
Model Normalized Cumulative	
Discounted	Gain@5	(NDCG@5)
CTR@5(%)
Basel...
Representations	via	Word2vec:	
Optimize	Embeddings	for	Job	Search	Use	Case
• Leverage	apply	and	click	feedback	to	guide	le...
Robust	Representations	via	DSSM:
Deep	Structured	Semantic	Model	[Huang	et	al.,	2013]	
Query Applied	Job	(Positive)
Applica...
Robust	Representations	via	DSSM:
Tri-letter	Hashing
• Tri-letter	Hashing	Example
• Engineer	->	#en,	eng,	ngi,	gin,	ine,	ne...
Robust	Representations	via	DSSM:
Training	Details
105
• Parameter	Sharing	Helps
• Better	and	faster	convergence
• Model	si...
Robust	Representations	via	DSSM:
Lessons	in	Production	Environment
106
+	100%	
+	70%	
+	40%	
• Bottlenecks	in	Production	
...
Robust	Representations	via	DSSM:
DSSM	Qualitative	Results
Software	Engineer Data	Mining LinkedIn Softwareentwickler
Engine...
Robust	Representations	via	DSSM:
DSSM	Metric	Results
Model Normalized Cumulative	
Discounted	Gain@5	(NDCG@5)
CTR@5	Lift (%...
Robust	Representations	via	DSSM:
DSSM	Future	Direction
• Leverage	Current	Query	Understanding	Into	DSSM	Model
• Query	tag	...
Conclusion
• Recommender	Systems	and	personalized	search	are	very	similar	
problems
• Deep	Learning	is	here	to	stay	and	ca...
References
• [Rumelhart et	al.,	1986]	Learning	representations	by	back-propagating	errors,	Nature	1986
• [Hochreiter et	al...
References	(continued)
• [Arya	et	al.,	2016]	Personalized	Federated	Search	at	LinkedIn,	CIKM	2016
• [Cheng	et	al.,	2016]	W...
References	(continued)
• [netflix recsys]	http://nordic.businessinsider.com/netflix-recommendation-engine-worth-1-billion-...
Upcoming SlideShare
Loading in …5
×

Deep Learning for Personalized Search and Recommender Systems

0 views

Published on

Slide deck presented for a tutorial at KDD2017.
https://engineering.linkedin.com/data/publications/kdd-2017/deep-learning-tutorial

Published in: Engineering

Deep Learning for Personalized Search and Recommender Systems

  1. 1. Deep Learning for Personalized Search and Recommender Systems Ganesh Venkataraman Airbnb Nadia Fawaz, Saurabh Kataria, Benjamin Le, Liang Zhang LinkedIn 1
  2. 2. Tutorial Outline • Part I (45min) Deep Learning Key concepts • Part II (45min) Deep learning for Search and Recommendations at Scale • Coffee break (30 min) • Deep Learning Case Studies • Part III (45min) Jobs You May Be Interested In (JYMBII) at LinkedIn • Part IV (45min) Job Search at LinkedIn Q&A at the end of each part 2
  3. 3. Motivation – Why Recommender Systems? • Recommendation systems are everywhere. Some examples of impact: • “Netflix values recommendations at half a billion dollars to the company” [netflix recsys] • “LinkedIn job matching algorithms to improves performance by 50%” [San Jose Mercury News] • “Instagram switches to using algorithmic feed” [Instagram blog] 3
  4. 4. Motivation – Why Search? 4 PERSONALIZED SEARCH 4 Query = “things to do in halifax” Search view – this is a classic IR problem Recommendations view – For this query, what are the recommended results?
  5. 5. Why Deep Learning? Why now? • Many of the fundamental algorithmic techniques have existed since the 80s or before 2.5 Exobytes of data produced per day Or 530,000,000 songs 150,000,000 iPhones 5
  6. 6. Why Deep Learning? Image classification eCommerce fraud Search Recommendations NLP Deep learning is eating the world 6
  7. 7. Why Deep Learning and Recommender Systems? • Features • Semantic understanding of words/sentences possible with embeddings • Better classification of images (identifying cats in YouTube videos) • Modeling • Can we cast matching problems into a deep (and possibly) wide net and learn family of functions? 7
  8. 8. Part I – Representation Learning and Deep Learning: Key Concepts 8
  9. 9. Deep Learning and AI http://www.deeplearningbook.org/contents/intro.html 9
  10. 10. Part I Outline • Shallow Models for Embedding Learning • Word2Vec • Deep Architectures • FF, CNN, RNN • Training Deep Neural Networks • SGD, Backpropagation, Learning Rate Schedule, Regularization, Pre-Training 10
  11. 11. Learning Embeddings 11
  12. 12. Representation learning for automated feature generation • Natural Language Processing • Word embedding: word2vec, GloVe • Sequence modeling using RNN’s and LSTM’s • Graph Inputs • Deep Walk • Multiple Hierarchy of features for varying granularities for semantic meaning with deep networks 12
  13. 13. Example Application of Representation Learning - Understanding Text • One of the keys to any content based recommender system is understanding text • What does “understanding” mean? • How similar/dissimilar are any two words? • What does the word represent? (Named Entity Recognition) • “Abraham Lincoln, the 16th President ...” • “My cousin drives a Lincoln” 13
  14. 14. How to represent a word? • Vocabulary – run, jog, math • Simple representation: • [1, 0, 0], [0, 1, 0], [0, 0, 1] • No representation of meaning • Cooccurrence in a word/document matrix 14
  15. 15. How to represent a word? • Trouble with cooccurrence matrix • Large dimension, lots of memory • Dimensionality reduction using SVD • High computational time nxm matrix => O(mn^2) • Adding new word => redo everything 15
  16. 16. Word embeddings taking context • Key Conjecture • Context matters. • Words that convey a certain context occur together • “Abraham Lincoln was the 16th President of the United States” • Bigram model • P (“Lincoln”|”Abraham”) • Skip Gram Model • Consider all words within context and ignore position • P(Context|Word) 16
  17. 17. Word2vec 17
  18. 18. Word2Vec: Skip Gram Model • Basic notations: • w represents a word, C(w) represents all the context around a word • 𝜃 represents the parameter space • D represent all the (w, c) pairs • 𝑝 𝑐 𝑤; 𝜃 represents the probability of context c given word w parametrized by 𝜃 • The probability of all the context appearing given a word is given by: • ∏ 𝑝(𝑐|𝑤; 𝜃)+∈-(.) • The loss function then becomes: • 𝑎𝑟𝑔𝑚𝑎𝑥 𝜃 ∏ 𝑝(𝑐|𝑤; 𝜃).,+ ∈6 18
  19. 19. Word2vec details • Let 𝑣. and 𝑣+ represent the current word and context. Note that 𝑣+ and 𝑣. are parameters we want to learn • p c w; 𝜃 = <=>∗=@ ∑ <=B∗=@ B∈C • C represents set of all available contexts 19
  20. 20. Negative Sampling – basic intuition p c w; 𝜃 = 𝑒E>∗E@ ∑ 𝑒EB∗E@ F∈- • Sample from unigram distribution instead of taking all contexts into account • Word2vec itself is a shallow model and can be used to initialize a deep model 20
  21. 21. Deep Architectures FF, CNN, RNN 21
  22. 22. Neuron: Computational Unit • Input vector: x = [x1, x2 ,…,xn] • Neuron • Weight vector: W • Bias: b • Activation function: f • Output a = f(WT x + b) x1 x2 x3 x4 W b f a = f(WTx + b) Input x Neuron Output a 22
  23. 23. Activation Functions • Tanh: ℝ → (-1,1) tanh ( 𝑥) = 𝑒M − 𝑒OM 𝑒M + 𝑒OM • Sigmoid: ℝ → (0,1) 𝜎 𝑥 = 1 1 + 𝑒OM • ReLU: ℝ → [0, +∞) 𝑓 𝑥 = max 0, 𝑥 = 𝑥W http://ufldl.stanford.edu/tutorial/supervised/MultiLayerNeuralNetworks/ 23
  24. 24. Layer • Layer l: nl neurons • weight matrix: W = [W1,…, Wnl] • bias vector: b = [b1,…, bnl] • activation function: f • output vector • a = f(WT x + b) x1 x2 x3 x4 W1 b1 f a1 = f(W1 T x + b1) W2 b2 f a2= f(W2 T x + b2) Input x Layer Output a W3 b3 f a3= f(W3 T x + b3) 24
  25. 25. Layer: Matrix Notation • Layer l: nl neurons • weight matrix: W • bias vector: b • activation function: f • output vector • a = f(WT x + b) • more compact notation • fast-linear algebra routines for quick computations in network x1 x2 x3 x4 Input x Layer Output a a= f(WT a + b) W , b , f 25
  26. 26. Feed Forward Network • Depth L layers • Activation at layer l+1 a(l+1) = f(W(l)T a(l) + b(l) ) • Output: prediction in supervised learning • goal: approximate y = F(x) x1 x2 x3 x4 Input Layer 1 Hidden Layer 3 a(3) Hidden Layer 2 W(1) , b(1) , f(1) W(2) , b(2) , f(2) a(2) Depth L = 4 a(L) W(3) , b(3) , f(3) 26Output Layer 4: Prediction layer
  27. 27. Why CNN: Convolutional Neural Networks? • Large size grid structured data • 1D: time series • 2D: image • Convolution to extract features from image (e.g. edges, texture) • Local connectivity • Parameter sharing • Equivariance to translation: small translations in input do not affect output
  28. 28. Convolution example https://docs.gimp.org/en/plug-in-convmatrix.html Edge detect kernel Sharpen kernel
  29. 29. 2D convolution http://ufldl.stanford.edu/tutorial/supervised/FeatureExtractionUsingConvolution/ 2D kernel (3x3) W1 W2 W3 W4 input matrix Kernel matrix (2x2) 29
  30. 30. • Fully connected • hidden unit connected to all input units • computationally expensive • Large image NxN pixels and Hidden layer K features • Number of parameters: ~KN2 • Locally connected • hidden unit connected to some contiguous input units • no parameter sharing • Convolution • locally connected • kernel: parameter sharing • 1D Kernel vector [W1, W2] • 1D Toeplitz weight matrix W • Scaling to large input, images • Equivariance to translation 30 W11 W12 W22 W23 W33 W34 W1 W2 W1 W2 W1 W2 W11 W12 W13 W14 W21 W22 W23 W24 W31 W32 W33 W34 W11 W12 0 0 0 W22 W23 0 0 0 W33 W34 Kernel vector Weight matrix W Convolution W1 W2 0 0 0 W1 W2 0 0 0 W1 W2
  31. 31. Pooling • Summary statistics • Aggregate over region • Reduce size • Less overfitting • Translation invariance • Max, mean http://ufldl.stanford.edu/tutorial/supervised/Pooling/ 31
  32. 32. CNN: Convolutional Neural Network Combination • Convolutional layers • Pooling layers • Fully connected layers http://colah.github.io/posts/2014-07-Conv-Nets-Modular/ 32 [LeCun et al., 1998]
  33. 33. CNN example for image recognition: ImageNet [Krizhevsky et al., 2012] Pictures courtesy of [Krizhevsky et al., 2012], http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf 33 1st GPU 2nd GPU filters learned by first CNN layer
  34. 34. Why RNN: Recurrent Neural Network? • Sequential data processing • ex: predict next word in sentence: “I was born in France. I can speak…” • RNN • Persist information through feedback loop • loop passes information from one step to the next • Parameter sharing across time indexes • output unit depends on previous output units through same update rule. xt ht ht-1
  35. 35. Unfolded RNN • Copies of NN passing feedback to one another http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 35
  36. 36. LSTM: Long Short Term Memory [Hochreiter et al., 1997] • Avoid vanishing or exploding gradient • Cell state updates regulated by gates • Forget: how much info from cell state to let through • Input: which cell state components to update • Tanh: values to add to cell state • Output: select component values to output picture courtesy of http://colah.github.io/posts/2015-08-Understanding-LSTMs/ Cell state • Long term dependencies • large gap between relevant information and where it is needed • Cell state: long-term memory • Can remember relevant information over long period of time 36
  37. 37. Examples of RNN application • Speech recognition [Graves et al., 2013] • Language modeling [Mikolov, 2012] • Machine translation [Kalchbrenner et al., 2013][Sustkever et al., 2014] • Image captioning [Vinyals et al., 2014] 37
  38. 38. Training a Deep Neural Network 38
  39. 39. Cost Function • m training samples (feature vector, label) (𝑥 X , 𝑦 X ), … , (𝑥 [ , 𝑦 [ ) • Per sample cost: error between label and output from prediction layer 𝐽 𝑊, 𝑏; 𝑥 _ , 𝑦 _ = 𝑎(`) 𝑥 _ − 𝑦(_) a • Minimize cost function over parameters: weights W and biases b 𝐽 𝑊, 𝑏 = 1 𝑚 b 𝐽(𝑊, 𝑏; 𝑥 _ , 𝑦(_) ) [ _cX + 𝜆 2 b 𝑊(f) g a ` fcX Average error Regularization 39
  40. 40. Gradient Descent • Random parameter initialization: symmetry breaking • Gradient descent step: update for every parameter Wij (l) and bi (l) 𝜃 = 𝜃 − 𝛼𝛻j 𝔼[𝐽(𝜃)] • Gradient computed by Backpropagation • High cost of backpropagation over full training set 40
  41. 41. Stochastic Gradient Descent (SGD) • SGD: follow negative gradient after • single sample 𝜃 = 𝜃 − 𝛼𝛻nJ(θ; 𝑥 _ , 𝑦(_) ) • a few samples: mini-batch (256) • Epoch: full pass through training set • Randomly shuffle data prior to each training epoch 41
  42. 42. Backpropagation [Rumelhart et al., 1986] Goal: Compute gradient numerically Recursively apply chain rule for derivative of composition of functions Let 𝑦 = 𝑔 𝑥 and 𝑧 = 𝑓 𝑦 = 𝑓(𝑔(𝑥)) then st sM = st su su sM = 𝑓v 𝑔 𝑥 𝑔′(𝑥) Backpropagation steps 1. Feedforward pass: compute all activations 2. Output error: measures node contribution to output error 3. Backpropagate error through all layers 4. Compute partial derivatives 42
  43. 43. Training optimization • Learning Rate Schedule • Changing learning rate as learning progresses • Pre-training • Goal: training simple model on simple task before training desired model to perform desired task • Greedy supervised pre-training: pre-train for task on subset of layers as initialization for final network • Regularization to curb overfitting • Goal: reduce generalization error • Penalize parameter norm: L2, L1 • Augment dataset: train on more data • Early stopping: return parameter set at point in time with lowest validation error • Drop out [Srivatstava, 2013] : train ensemble of all subnetworks formed by removing non-output units • Gradient clipping to avoid exploding gradient • norm clipping • element wise clipping 43
  44. 44. Part II – Deep Learning for Personalized Recommender Systems at Scale 44
  45. 45. Examples of Personalized Recommender Systems 45
  46. 46. Examples of Personalized Recommender Systems Job Search 46
  47. 47. Examples of Personalized Recommender Systems 47
  48. 48. item j from a set of candidates User i with <user features, query (optional)> (e.g., industry, behavioral features, Demographic features,……) (i, j) : response yijvisits Algorithm selects (action or not, e.g. click, like, share, apply…) Which item(s) should we recommend to the user? • The item(s) with the best expected utility • Utility examples: • CTR, Revenue, Job Apply rates, Ads conversion rates, … • Can be a combination of the above for trade-offs Personalized Recommender Systems 48
  49. 49. An Example Architecture of Personalized Recommender Systems 49
  50. 50. User Interaction Logs Offline Modeling Workflow + User / Item derived features User User Feature Store Item Store + Features Recommendation Ranking Ranking Model Store Additional Re- ranking Steps 1 2 4 5 Offline System Online System 3 An example of Recommender System Architecture Item derived features 50
  51. 51. User Interaction Logs Offline Modeling Workflow + User / Item derived features User Search-based Candidate Selection & Retrieval Query Construction User Feature Store Search Index of Items Recommendation Ranking Ranking Model Store Additional Re- ranking Steps 1 2 3 4 5 6 7 Offline System Online System Item derived features An example of Personalized Search System Architecture 51
  52. 52. Key Components – Offline Modeling • Train the model offline (e.g. Hadoop) • Push model to online ranking model store • Pre-generate user / item derived features for online systems to consume • E.g. user / item embeddings from word2vec / DNNs based on the raw features 52
  53. 53. Key Components – Candidate Selection • Personalized Search (With user query): • Form a query to the index based on user query annotation [Arya et al., 2016] • Example: Panda Express Sunnyvaleà +restaurant:panda express +location:sunnyvale • Recommender system (Optional): • Can help dramatically reduce the number of items to score in ranking steps [Cheng, et al., 2016, Borisyuk et al. 2016] • Form a query based on the user features • Goal: Fetch only the items with at least some match with user feature • Example: a user with title software engineer -> +title:software engineer for jobs recommendation 53
  54. 54. Key Components - Ranking • Recommendation Ranking • The main ML model that ranks items retrieved by candidate selection based on the expected utility • Additional Re-ranking Steps • Often for user experience optimization related to business rules, e.g. • Diversification of the ranking results • Recency boost • Impression discounting • … 54
  55. 55. Integration of Deep Learning Models into Personalized Recommender Systems at Scale 55
  56. 56. Literature: Deep Learning for Recommendation Systems • RBM for Collaborative Filtering [Salakhutdinov et al., 2007] • Deep Belief Networks [Hinton et al., 2006] • Neural Autoregressive Distribution Estimator (NADE) [Zheng, 2016] • Neural Collaborative Filtering [He, et al., 2017] • Siamese networks for user item matching [Huang et al., 2013] • Deep Belief Networks with Pre-training [Hinton et al., 2006] • Collaborative Deep Learning [Wang et al., 2015] 56
  57. 57. User Interaction Logs Offline Modeling Workflow + User / Item derived features User Search-based Candidate Selection & Retrieval Query Construction User Feature Store Search Index of Items Recommendation Ranking Ranking Model Store Additional Re- ranking Steps 1 2 3 4 5 6 7 Offline System Online System Item derived features 57
  58. 58. Offline Modeling + User / Item Embeddings User Features Item Features User Embedding Vector Item Embedding Vector Sim(U,I) User Feature Store Item Store / Index with Features 58
  59. 59. Query Formulation & Candidate Selection • Issues of using raw text: Noisy or incorrect query tagging due to • Failure to capture semantic meaning • Ex. Query: Apple watch -> +food:apple +product:watch or +product:apple watch? • Multilingual text • Query: 熊猫快餐 -> +restaurant:panda express • Cross-domain understanding • People search vs job search 59
  60. 60. Query Formulation & Candidate Selection • Represent Query as an embedding • Expand query to similar queries in a semantic space • KNN search in dense feature space with Inverted Index [Cheng, et al., 2016] Q = “Apple Watch” D = “iphone” D = “Orange Swatch” D = “ipad” 60
  61. 61. Recommendation Ranking Models • Wide and Deep Models to capture all possible signals [Cheng, et al., 2016] https://arxiv.org/pdf/1606.07792.pdf 61
  62. 62. Challenges & Open Problems for Deep Learning at Recommender Systems • Distributed training on very large data • Tensorflow on Spark (https://github.com/yahoo/TensorFlowOnSpark) • CNTK (https://github.com/Microsoft/CNTK) • MXNet (http://mxnet.io/) • Caffe (http://caffe.berkeleyvision.org/) • … • Latency Issues from Online Scoring • Pre-generation of user / item embeddings • Multi-layer scoring (simple models => complex) • Batch vs online training 62
  63. 63. Part III – Case Study: Jobs You May Be Interested In (JYMBII) 63
  64. 64. Outline • Introduction • Generating Embeddings via Word2vec • Generating Embeddings via Deep Networks • Tree Feature Transforms in Deep + Wide Framework 64
  65. 65. Introduction: JYMBII 65
  66. 66. Introduction: Problem Formulation • Rank jobs by 𝑃 User 𝑢 applies to Job 𝑗 𝑢, 𝑗) • Model response given: 66 Careers History, Skills, Education, Connections Job Title, Description, Location, Company 66
  67. 67. Introduction: JYMBII Modeling- Generalization Recommend • Model should learn general rules to predict which jobs to recommend to a member. • Learn generalizations based on similarity in title, skill, location, etc between profile and job posting 67
  68. 68. Introduction: JYMBII Modeling - Memorization Applies to 68 • Model should memorize exceptions to the rules • Learn exceptions based on frequent co- occurrence of features
  69. 69. Introduction: Baseline Features • Dense BoW Similarity Features for Generalization • i.e: Similarity in title text good predictor of response • Sparse Two-Depth Cross Features for Memorization • i.e: Memorize that computer science students will transition to entry engineering roles Vector BoW Similarity Feature Sim(User Title BoW, Job Title BoW) Sparse Cross Feature AND(user = Comp Sci. Student, job = Software Engineer) Sparse Cross Feature AND(user = In Silicon Valley, job = In Austin, TX) Sparse Cross Feature AND(user = ML Engineer, job = UX Designer) 69
  70. 70. Introduction: Issues • BoW Features don’t capture semantic similarity between user/job • Cosine Similarity between Application Developer and Software Engineer is 0 • Generating three-depth, four-depth cross features won’t scale • i.e. Memorizing that Factory Workers from Detroit are applying to Fracking jobs in Pennsylvania • Hand-engineered features time consuming and will have low coverage • Permutations of three-depth, four-depth cross features grows exponentially 70
  71. 71. Introduction: Deep + Wide for JYMBII • BoW Features don’t capture semantic similarity between user/job • Generate embeddings to capture Generalization through semantic similarity • Deep + Wide model for JYMBII [Cheng et al., 2016] Semantic Similarity Feature Sim(User Embedding, Job Embedding) Global Model Cross Feature AND(user = Comp Sci. Student, job = Software Engineer) User Model Cross Feature AND(user = User 2, job = Job Latent Feature 1 ) Job Model Cross Feature AND(user = User Latent Feature, job = Job 1) 71 Sparse Cross Feature AND(user = Comp Sci. Student, job = Software Engineer) Sparse Cross Feature AND(user = In Silicon Valley, job = In Austin, TX) Sparse Cross Feature AND(user = ML Engineer, job = UX Designer) Vector BoW Similarity Feature Sim(User Title BoW, Job Title BoW)
  72. 72. Generating Embeddings via Word2vec: Training Word Vectors • Key Ideas • Same users (context) apply to similar jobs (target) • Similar users (target) will apply to the same jobs (context) Application Developer => Software Engineer • Train word vectors via word2vec skip-gram architecture • Concatenate user’s current title and the applied job’s title as input User Title Applied Job Title 72
  73. 73. Generating Embeddings via Word2vec: Model Structure Application, Developer Software, EngineerTokenized Titles Word Embedding Lookup Pre-trained Word Vectors Entity Embeddings Via Average Pooling Word Vectors Response Prediction (Logistic Regression) Cosine Similarity User Job 73
  74. 74. Generating Embeddings via Word2vec: Results and Next Steps • Receiver Operating Characteristic – Area Under Curve for evaluation • Response prediction is binary classification: Apply or don’t Apply • Highly skewed data: Low CTR for Apply Action • Good metric for ranking quality: Focus on discriminatory ability of model • Marginal 0.87% ROC AUC Gain • How to improve quality of embeddings? • Optimize embeddings for prediction task with supervised training • Leverage richer context about user and job 74
  75. 75. Generating Embeddings via Deep Networks: Model Structure User Job Response Prediction (Logistic Regression) Sparse Features (Title, Skill, Company) Embedding Layer Hidden Layer Entity Embedding Hadamard Product (Elementwise Product) 75
  76. 76. Generating Embeddings via Deep Networks: Hyper Parameters, Lots of Knobs! • Optimizer Used • SGD w/ Momentum and exponential decay vs. Adam [Kingma et al., 2015] (Adam) • Learning Rate • 10Oƒ to 10O„ (𝟏𝟎O𝟒 ) • Embedding Layer Size • 50 to 200 (100) • Dropout • 0% to 50% dropout (0% dropout) • Sharing Parameter Space for both user/job embeddings • Assumes communitive property of recommendations (a + b = b + a) (No shared parameter space) • Hidden Layer Sizes • 0 to 2 Hidden Layers (200 -> 200 Hidden Layer Size) • Activation Function • ReLU vs. Tanh (ReLU) 76
  77. 77. Generating Embeddings via Deep Networks: Training Challenges • Millions of rows of training data impossible to store all in memory • Stream data incrementally directly from files into a fixed size example pool • Add shuffling by randomly sampling from example pool for training batches • Extreme dimensionality of company sparse feature • Reduce dimensionality of company feature from millions -> tens of thousands • Perform feature selection by frequency in training set • Hyper parameter tuning • Distribute grid search through parallel modeling in single driver Spark jobs 77
  78. 78. Generating Embeddings via Deep Networks: Results Model ROC AUC Baseline Model 0.753 Deep + Wide Model 0.790 (+4.91%***) *** For reference, a previous major JYMBII modeling improvement with a 20% lift in ROC AUC resulted in a 30% lift in Job Applications 78
  79. 79. Response Prediction (Logistic Regression) The Current Deep + Wide Model Deep Embedding Features (Feed Forward NN) • Generating three-depth, four-depth cross features won’t scale • Smart feature selection required Wide Sparse Cross Features (Two-Depth) 79
  80. 80. Tree Feature Transforms: Feature Selection via Gradient Boosted Decision Trees Each tree outputs a path from root to leaf encoding a combination of feature crosses [He et al., 2014] GDBT’s select the most useful combinations of feature crosses for memorization Member Seniority: Vice President Yes No Member Industry: Banking Yes No Member Location: Silicon Valley Member Skill: Statistics Yes No 80 Yes No Job Seniority: CXO NoYes Job Title: ML Engineer Yes No
  81. 81. Response Prediction (Logistic Regression) Tree Feature Transforms: The Full Picture How to train both the NN model and GBDT model jointly with each other? Deep Embedding Features (Feed Forward NN) Wide Sparse Cross Features (GBDT) 81
  82. 82. Tree Feature Transforms: Joint Training via Block-wise Cyclic Coordinate Descent • Treat NN model and GBDT model as separate block-wise coordinates • Implemented by 1. Training the NN until convergence 2. Training GBDT w/ fixed NN embeddings 3. Training the regression layer weights w/ generated cross features from GBDT 4. Training the NN until convergence w/ fixed cross features 5. Cycle step 2-4 until global convergence criteria 82
  83. 83. Response Prediction (Logistic Regression) Tree Feature Transforms: Train NN Until Convergence Initially no trees are in our forest Deep Embedding Features (Feed Forward NN) Wide Sparse Cross Features (GDBT) 83
  84. 84. Response Prediction (Logistic Regression) Tree Feature Transforms: Train GDBT w/ NN Section as Initial Margin Deep Embedding Features (Feed Forward NN) Wide Sparse Cross Features (GDBT) 84
  85. 85. Response Prediction (Logistic Regression) Tree Feature Transforms: Train GDBT w/ NN Section as Initial Margin Deep Embedding Features (Feed Forward NN) Wide Sparse Cross Features (GDBT) 85
  86. 86. Response Prediction (Logistic Regression) Tree Feature Transforms: Train Regression Layer Weights Deep Embedding Features (Feed Forward NN) Wide Sparse Cross Features (GDBT) 86
  87. 87. Response Prediction (Logistic Regression) Tree Feature Transforms: Train NN w/ GDBT Section as Initial Margin Deep Embedding Features (Feed Forward NN) Wide Sparse Cross Features (GDBT) 87
  88. 88. Tree Feature Transforms: Block-wise Coordinate Descent Results Model ROC AUC Baseline Model 0.753 Deep + Wide Model 0.790 (+4.91%) Deep + Wide Model w/ GBDT Iteration 1 0.792 (+5.18%) Deep + Wide Model w/ GBDT Iteration 2 0.794 (+5.44%) Deep + Wide Model w/ GBDT Iteration 3 0.795 (+5.57%) Deep + Wide Model w/ GBDT Iteration 4 0.796 (+5.71%) 88
  89. 89. JYMBII Deep + Wide: Future Direction • Generating Embeddings w/ LSTM Networks • Leverage sequential career history data • Promising results in NEMO: Next Career Move Prediction with Contextual Embedding [Li et al., 2017] • Semi-Supervised Training • Leverage pre-trained title, skill, and company embeddings on profile data • Replace Hadamard Product for entity embedding similarity function • Deep Crossing [Shan et al., 2016] • Add even richer context • i.e. Location, Education, and Network features 89
  90. 90. Part IV – Case Study: Deep Learning Networks for Job Search 90
  91. 91. Outline • Introduction • Representations via Word2vec • Robust Representations via DSSM 91
  92. 92. Introduction: Job Search 92
  93. 93. Introduction: Search Architecture Index Indexer Top-K retrieval ResultsOffline Training / Model Result Ranking User QueryQuery Understanding 93
  94. 94. Introduction: Query Understanding - Segmentation and Tagging • First divide the search query into segments • Tag query segments based on recognized entity tags Oracle Java Application Developer Oracle Java Application Developer Query Segmentations COMPANY = Oracle SKILL = Java TITLE = Application Developer COMPANY = Oracle TITLE = Java Application Developer Query Tagging 94
  95. 95. Introduction: Query Understanding – Expansion • Task of adding additional synonyms/related entities to the query to improve recall • Current Approach: Curated dictionary for common synonyms and related entities COMPANY = Oracle OR NetSuite OR Taleo OR Sun Microsystems OR … SKILL = Java OR Java EE OR J2EE OR JVM OR JRE OR JDK … TITLE = Application Developer OR Software Engineer OR Software Developer OR Programmer … Green – Synonyms Blue – Related Entities 95
  96. 96. Introduction: Query Understanding - Retrieval and Ranking COMPANY = Oracle OR NetSuite OR Taleo OR Sun Microsystems OR … SKILL = Java OR Java EE OR J2EE OR JVM OR JRE OR JDK … TITLE = Application Developer OR Software Engineer OR Software Developer OR Programmer … Title Title Skills Company 96
  97. 97. Introduction: Issues – Retrieval and Ranking • Term retrieval has limitations • Cross language retrieval • Softwareentwickler ó Software developer • Word Inflections • Engineering Management ó Engineering Manager • Query expansion via curated dictionary of synonyms is not scalable • Expensive to refresh and store synonyms for all possible entities • Heavy reliance on query tagging is not robust enough • Novel title, skill, and company entities will not be tagged correctly • Errors upstream propagates to poor retrieval and ranking 97
  98. 98. Introduction: Solution – Deep Learning for Query and Document Representations • Query and document representations • Map queries and document text to vectors in semantic space • Robust to Handle Out of Vocabulary words • Term retrieval has limitations • Query expansion via curated dictionary of synonyms is not scalable • Map synonyms, translations and inflections to similar vectors in semantic space • Term retrieval on cluster id or KNN based retrieval • Heavy reliance on query tagging is not robust enough • Compliment structured query representations with semantic representations 98
  99. 99. Representations via Word2vec: Leverage JYMBII Work • Key Ideas • Similar users (context) apply to the same job (target) • The same user (target) will apply to similar jobs (context) Application Developer => Software Engineer • Train word vectors via word2vec skip-gram architecture • Concatenate user’s current title and the applied job’s title as input User Title Applied Job Title 99
  100. 100. Representations via Word2vec: Word2vec in Ranking Application, Developer Software, EngineerTokenized Text Word Embedding Lookup Pre-trained Word Vectors Entity Embeddings Via Average Pooling Word Vectors Learning to Rank Model (NDCG Loss) Cosine Similarity JobQuery 100
  101. 101. Representations via Word2vec: Ranking Model Results Model Normalized Cumulative Discounted Gain@5 (NDCG@5) CTR@5(%) Baseline Model 0.582 +0.0% Baseline Model + Word2Vec Feature 0.595 (+2.2%) +1.6% 101
  102. 102. Representations via Word2vec: Optimize Embeddings for Job Search Use Case • Leverage apply and click feedback to guide learning of embeddings • Fine tune embeddings for task using supervised feedback • Handle out of vocabulary words and scale to query vocabulary size • Compared to JYMBII, query vocabulary is much larger and less well-formed • Misspellings • Word Inflections • Free text search • Need to make representations more robust for these free text queries 102
  103. 103. Robust Representations via DSSM: Deep Structured Semantic Model [Huang et al., 2013] Query Applied Job (Positive) Application Developer Software EngineerRaw Text #Ap, App, ppl… #So, Sof, oft…Tri-letter Hashing #Ha, Hai, air… Hairdresser Randomly Sampled Applied Job (Negative) Hidden Layer 3 Hidden Layer 2 Hidden Layer 1 Cosine Similarity Softmax w/ Cross Entropy Loss 103
  104. 104. Robust Representations via DSSM: Tri-letter Hashing • Tri-letter Hashing Example • Engineer -> #en, eng, ngi, gin, ine, nee, eer, er# • Benefits of Tri-letter Hashing • More compact Bag of Tri-letters vs. Bag of Words representation • 700K Word Vocabulary -> 75K Tri-letters • Can generalize for out of vocabulary words • Tri-letter hashing robust to minor misspellings and inflections of words • Engneer -> #en, eng, ngn, gne, nee, eer, er# 104
  105. 105. Robust Representations via DSSM: Training Details 105 • Parameter Sharing Helps • Better and faster convergence • Model size is reduced • Regularization • L2 performs better than dropout • Toolkit Comparisons (CNTK vs TensorFlow) • CNTK: Faster convergence and better model quality • TensorFlow: Easy to implement and better community support. Comparative model quality Training performance with/o parameter sharing
  106. 106. Robust Representations via DSSM: Lessons in Production Environment 106 + 100% + 70% + 40% • Bottlenecks in Production Environment • Latency due to extra computation • Latency due to GC activity • Fat Jars in JVM environment • Practical Lessons • Avoid JVM Heap while serving the model • Caching most accessed entities’ embedding
  107. 107. Robust Representations via DSSM: DSSM Qualitative Results Software Engineer Data Mining LinkedIn Softwareentwickler Engineer Software Data Miner Google Software Software Engineers Machine Learning Engineer Software Engineers Software Engineer Software Engineering Microsoft Research Software Engineer Engineer Software For qualitative results, only top head queries are taken to analyze similarity to each other 107
  108. 108. Robust Representations via DSSM: DSSM Metric Results Model Normalized Cumulative Discounted Gain@5 (NDCG@5) CTR@5 Lift (%) Baseline Model 0.582 +0.0% Baseline Model + Word2Vec Feature 0.595 (+2.2%) +1.6% Baseline Model + DSSM Feature 0.602 (+3.4%) +3.2% 108
  109. 109. Robust Representations via DSSM: DSSM Future Direction • Leverage Current Query Understanding Into DSSM Model • Query tag entity information for richer context embeddings • Query segmentation structure can be considered into the network design • Deep Crossing for Similarity Layer [Shan et al., 2016] • Convolutional DSSM [Shen et al., 2014] 109
  110. 110. Conclusion • Recommender Systems and personalized search are very similar problems • Deep Learning is here to stay and can have significant impact on both • Understanding and constructing queries • Ranking • Deep learning and more traditional techniques are *not* mutually exclusive (hint: Deep + Wide) 110
  111. 111. References • [Rumelhart et al., 1986] Learning representations by back-propagating errors, Nature 1986 • [Hochreiter et al., 1997] Long short-term memory, Neural computation 1997 • [LeCun et al., 1998] Gradient-based learning applied to document recognition, Proceedings of the IEEE 1998 • [Krizhevsky et al., 2012] Imagenet classification with deep convolutional neural networks, NIPS 2012 • [Graves et al., 2013] Speech recognition with deep recurrent neural networks, ICASSP 2013 • [Mikolov, 2012] Statistical language models based on neural networks, PhD Thesis, Brno University of Technology, 2012 • [Kalchbrenner et al., 2013] Recurrent continuous translation models, EMNLP 2013 • [Srivatstava, 2013] Improving neural networks with dropout, PhD Thesis, University of Toronto, 2013 • [Sustkever et al., 2014] Sequence to sequence learningg with neural networks, NIPS 2014 • [Vinyals et al., 2014] Show and tell: a neural image caption generator, Arxiv 2014 • [Zaremba et al., 2015] Recurrent Neural Network Regularization, ICLR 2015 111
  112. 112. References (continued) • [Arya et al., 2016] Personalized Federated Search at LinkedIn, CIKM 2016 • [Cheng et al., 2016] Wide & Deep Learning for Recommender Systems, DLRS 2016 • [He et al., 2014] Practical Lessons from Predicting Clicks on Ads at Facebook, ADKDD 2014 • [Kingma et al., 2015] Adam: A Method for Stochastic Optimization, ICLR 2015 • [Huang et al., 2013] Learning Deep Structured Semantic Models for Web Search using Clickthrough Data, CIKM 2013 • [Li et al., 2017] NEMO: Next Career Move Prediction with Contextual Embedding, WWW 2017 • [Shan et al., 2016] Deep Crossing: Web-scale modeling without manually crafted combinatorial features, KDD 2016 • [Zhang et al., 2016] GLMix: Generalized Linear Mixed Models For Large-Scale Response Prediction, KDD 2016 • [Salakhutdinov et al., 2007] Restricted Boltzmann Machines for Collaborative Filtering, ICML 2007 • [Zheng, 2016] http://tech.hulu.com/blog/2016/08/01/cfnade.html • [Hinton et al., 2006] A fast learning algorithm for deep belief nets, Neural Computations 2006 • [Wang et al., 2015] Collaborative Deep Learning for Recommender Systems , KDD 2015 • [He et al., 2017] Neural Collaborative Filtering, WWW 2017 • [Borisyuk et al. 2016]. CaSMoS: A Framework for Learning Candidate Selection Models over Structured Queries and Documents, KDD 2016 112
  113. 113. References (continued) • [netflix recsys] http://nordic.businessinsider.com/netflix-recommendation-engine-worth-1-billion-per-year- 2016-6/ • [San Jose Mercury News] http://www.mercurynews.com/2017/01/06/at-linkedin-artificial-intelligence-is- like-oxygen/ • [Instagram blog] http://blog.instagram.com/post/145322772067/160602-news 113

×