Taboola’s	Road	to	Scale	
The	Data	Perspec4ve	
	
Tal	Sliwowicz
Copyright©2016	The	Nielsen	Company.	Confiden4al	and	Proprietary.	
Tal	Sliwowicz	
Director,	R&D	
tal@taboola.com	
	
Who	am	I?
You’ve Seen Us Before!
Enabling people to discover
information at that moment when
they’re likely to engage
Copyright©2016	The	Nielsen	Company.	Confiden4al	and	Proprietary.	
Entertainment | Lifestyle
Tech
Our Clients
are All Around
the Globe
Copyright©2016	The	Nielsen	Company.	Confiden4al	and	Proprietary.	
750M
monthly unique
users
100K+
Requests/sec
10B+
recommendation
s/day
5TB+
Daily data
REACH
 PROPERTY
95.5%
 Google Ad Network
87.8%
 Taboola
86.2%
 Google Sites
61.5%
 Facebook
60.3%
 Yahoo Sites
56.6%
 Outbrain
52%
mobile
traffic
48%
desktop
traffic
US desktop users reached, 12/2015
Taboola	in	Numbers
Copyright©2016	The	Nielsen	Company.	Confiden4al	and	Proprietary.	
Context
Metadata Region-based
Location
Recommendations
User Behavior
Cookie Data
Collaborative Filtering
Bucketed
Consumption Groups
CONTENT RECOMMENDATION ENGINE
Social
Facebook /
Twitter API
The	Recommenda4on	Engine
Copyright©2016	The	Nielsen	Company.	Confiden4al	and	Proprietary.	
Taboola’s Discovery Platform
Traffic Acquisition
	
Business Dev.!
Sponsored Content	
Editorial!
Newsroom
Sales!
Native Ads
Audience Dev.
 Product!
Personalization
Data & Insights!
Copyright©2016	The	Nielsen	Company.	Confiden4al	and	Proprietary.	
•  Events	and	logs	
(rawdata)	
wriPen	directly	
to	DB	
•  Recs	Are	read	
from	DB	
•  Crashed	when	
CNN	launched	
Taboola	2007	
Frontend	
FE	Server
Copyright©2016	The	Nielsen	Company.	Confiden4al	and	Proprietary.	
•  Same	as	before,	but	
without	direct	write	
to	DB	
•  Switching	to	bulk	load	
•  But	–	Very	Basic	
Repor4ng,	not	
scalable	
Taboola	2007.5	
Frontend	
Bulk	Load	
FE	Server
Copyright©2016	The	Nielsen	Company.	Confiden4al	and	Proprietary.	
•  Introduced	a	semi	real4me	
events	parsing	services:	
Session	Parser	and	Session	
Analyzer	
•  Divided	analysis	work	by	
unit	(session)	
•  Files	were	pushed	from	
RecServer(s)	to	Backend	
processing	
•  Files	are	gzip	textual	
INSERT	statements	
•  But	–	not	real	4me	enough	
Taboola	2008	
Frontend	
NFS	
Backend	
FE	Server	 SessionParser	 SessionAnalyzer	
Write	Summarized	Data	
Write	rawdata	
Read	session	
files	
Read	rawdata	
Write	session	
files
Copyright©2016	The	Nielsen	Company.	Confiden4al	and	Proprietary.	
•  Made	a	leap	towards	real-4me	
stream	processing	
•  Unified	Session	Parser	and	
Session	Analyzer	to	an	in-
memory	service	(without	
going	through	disk)	
•  Made	drama4c	op4miza4on	
to	memory	alloca4on	and	data	
models	
•  Failure	safe	architecture	-	can	
endure	data	delays,	front-end	
servers’	malfunc4on	
•  No	direct	DB	access	-	key	for	
performance,	only	using	bulk	
loading	for	loading	hourly	data	
Taboola	2010	
Frontend	
NFS	
Backend	
FE	Server	 Session	Parser	+	Analyzer	
Write	Hourly	Data	(Bulk	
Loading)	
Write	rawdata	
Read	rawdata
Copyright©2016	The	Nielsen	Company.	Confiden4al	and	Proprietary.	
•  Mul4	DC	
•  Roughly	same	architecture	
•  Increasing	backend	growth	
by	scaling	in	(monster	
machines)	
•  Introduced	real-4me	
analyzers	
•  Introduced	sharding	
•  Moved	to	lsync	based	file	
sync	
•  Introduced	Top	Reports	
capabili4es	
Taboola	2011-2013	
Frontend	
Lsync	
Backend	
FE	Server	 Session	Parser	+	Analyzer	
Write	Hourly	Data	(Bulk	
Loading)	
Write	rawdata	
Read	rawdata
Copyright©2016	The	Nielsen	Company.	Confiden4al	and	Proprietary.	
Taboola	2014	-
Copyright©2016	The	Nielsen	Company.	Confiden4al	and	Proprietary.	
•  Lots	of	incoming	traffic	(100K	requests/sec)	
•  Data	(5+	TB	/	day):	
•  Personalized	served	recommenda4ons	–	per	user,	per	page	view	
•  Events	-	What	the	user	actually	read	and	what	he	did	
•  The	data	needs	to	be	joined	and	processed	in	real	4me	
•  Campaigns	Management	
•  Recommenda4ons	
•  Billing	
•  Reports	
•  Etc.	
•  The	data	needs	to	be	available	for	offline	research	
Our	Data	Requirements
Copyright©2016	The	Nielsen	Company.	Confiden4al	and	Proprietary.	
Data	Model	
Users	
Sessions	
Views	
Requests	
Items	
Events
Copyright©2016	The	Nielsen	Company.	Confiden4al	and	Proprietary.	
•  We	care	about	sessions	-	chain	of	page	views	and	
events	for	a	specific	user	
•  Length	can	be	hours	or	even	days	
•  We	care	about	users	–	chain	of	sessions	across	sites		
•  Length	can	be	days	or	even	months	
•  Stateless	Applica4on	–	single	user	data	is	sent	from	
mul4ple	data	centers	and	mul4ple	servers	
•  No	determinis4c	affinity	to	a	server	or	DC	
•  Order	isn’t	guaranteed	
•  Must	be	robust	and	automa4cally	deal	with	late	arrivals	
•  “Exactly	once”	seman4cs	
Challenges
Copyright©2016	The	Nielsen	Company.	Confiden4al	and	Proprietary.	
• Many	streams	of	data	that	need	to	be	
joined	(user,	session,	page	view,	widgets,	
recommenda4ons,	events,	ac4ons)	
• 5+TB	of	daily	data	
• Research	purposes	require	looking	at	full	
user	ac4vity	across	4me	
Challenges	Cont.
Copyright©2016	The	Nielsen	Company.	Confiden4al	and	Proprietary.	
Data	Flow	
FE	Servers	
Kana	
FE	Consumer	
(Spark)	
C*	Sessions
Copyright©2016	The	Nielsen	Company.	Confiden4al	and	Proprietary.	
•  Par44on	key	-	session	start	hour	+	user	bucket	(0-9,999)	
•  Clustering	key	-	publisher_id,	user_id,	session_id,	view_id,	data_type,	
data_hash	
•  Data	Type	-	MULTI_REQUEST,	USER_EVENT,	ACTION_CONVERSION,	…	
•  Data	-	blobs	of	protobuff	
•  Results:		
•  All	the	data	of	a	single	session	is	in	one	place,	regardless	of	4me	of	arrival	
•  Idempotent	process	-	if	same	message	is	received	twice	it	overruns	the	
previous	arrivals	due	to	same	hash	id	
•  Sampling	is	built-in	to	the	model	
Table	Model	in	C*
Copyright©2016	The	Nielsen	Company.	Confiden4al	and	Proprietary.	
Traffic	Processor		
(Spark)	
Manual	runner	
Next	Gen.	Reports	
Next	Gen.	
Counters	(Spark)	
Zeppelin	 BIgQuery	
Data	Flow	Cont.	
C*	Sessions	
Hadoop	 Ver4ca
Copyright©2016	The	Nielsen	Company.	Confiden4al	and	Proprietary.	
•  Raw	data	–	real	4me	full	access	to	the	raw	data,	not	
just	aggregated	data
•  Week	of	data	(~35TB)	-	2	hours	to	analyze	and	report		
•  10	physical	nodes	,	320	Cores,	2.5TB	memory,	SSDs	
•  Analyzing	1%	sample	of	the	users	reduces	this	linearly	(par44on	
key)	
•  Analyzing	a	single	publisher	which	is	1%	of	the	data	reduces	this	
almost	linearly	(clustering	key)	
•  Repor4ng	–	minutes	for	availability	of	full	repor4ng	vs.	
hours	
•  Suppor4ng	our	growth	–	Spark	as	a	distributed	
compu4ng	engine	is	very	strong,	easy	to	scale	and	
extend	
Before	vs.	Ayer
Copyright©2016	The	Nielsen	Company.	Confiden4al	and	Proprietary.	
•  Long	term	data	access	–	Hadoop,	Cassandra	
and	BigQuery	provide	a	solu4on	we	did	not	
have	before	
•  Analy4cs	engine	–	the	move	from	MySQL	to	
Ver4ca	(as	an	MPP	engine)	allows	us	to	
support	complex	queries	over	very	large	data	
sets	
•  Algorithmic	Research	and	Modeling	–	we	are	
now	capable	of	in	depth	analysis	on	mul4ple	
dimensions	across	long	4me	periods	
Before	vs.	Ayer	-	Cont.
Thank You!
tal@taboola.com

BDX 2016 - Tal sliwowicz @ taboola