BDX 2016 - Tal sliwowicz @ taboola

Taboola’s Road to Scale
The Data Perspec4ve

Tal Sliwowicz

Copyright©2016 The Nielsen Company. Conﬁden4al and Proprietary.
Tal Sliwowicz
Director, R&D
tal@taboola.com

Who am I?

You’ve Seen Us Before!
Enabling people to discover
information at that moment when
they’re likely to engage

Entertainment | Lifestyle
Tech
Our Clients
are All Around
the Globe

750M
monthly unique
users
100K+
Requests/sec
10B+
recommendation
s/day
5TB+
Daily data
REACH
PROPERTY
95.5%
Google Ad Network
87.8%
Taboola
86.2%
Google Sites
61.5%
Facebook
60.3%
Yahoo Sites
56.6%
Outbrain
52%
mobile
trafﬁc
48%
desktop
trafﬁc
US desktop users reached, 12/2015
Taboola in Numbers

Context
Metadata Region-based
Location
Recommendations
User Behavior
Cookie Data
Collaborative Filtering
Bucketed
Consumption Groups
CONTENT RECOMMENDATION ENGINE
Social
Facebook /
Twitter API
The Recommenda4on Engine

Taboola’s Discovery Platform
Traﬃc Acquisition

Business Dev.!
Sponsored Content
Editorial!
Newsroom
Sales!
Native Ads
Audience Dev.
Product!
Personalization
Data & Insights!

•  Events and logs
(rawdata)
wriPen directly
to DB
•  Recs Are read
from DB
•  Crashed when
CNN launched
Taboola 2007
Frontend
FE Server

•  Same as before, but
without direct write
to DB
•  Switching to bulk load
•  But – Very Basic
Repor4ng, not
scalable
Taboola 2007.5
Frontend
Bulk Load
FE Server

•  Introduced a semi real4me
events parsing services:
Session Parser and Session
Analyzer
•  Divided analysis work by
unit (session)
•  Files were pushed from
RecServer(s) to Backend
processing
•  Files are gzip textual
INSERT statements
•  But – not real 4me enough
Taboola 2008
Frontend
NFS
Backend
FE Server SessionParser SessionAnalyzer
Write Summarized Data
Write rawdata
Read session
ﬁles
Read rawdata
Write session
ﬁles

•  Made a leap towards real-4me
stream processing
•  Uniﬁed Session Parser and
Session Analyzer to an in-
memory service (without
going through disk)
•  Made drama4c op4miza4on
to memory alloca4on and data
models
•  Failure safe architecture - can
endure data delays, front-end
servers’ malfunc4on
•  No direct DB access - key for
performance, only using bulk
loading for loading hourly data
Taboola 2010
Frontend
NFS
Backend
FE Server Session Parser + Analyzer
Write Hourly Data (Bulk
Loading)
Write rawdata
Read rawdata

•  Mul4 DC
•  Roughly same architecture
•  Increasing backend growth
by scaling in (monster
machines)
•  Introduced real-4me
analyzers
•  Introduced sharding
•  Moved to lsync based ﬁle
sync
•  Introduced Top Reports
capabili4es
Taboola 2011-2013
Frontend
Lsync
Backend
FE Server Session Parser + Analyzer
Write Hourly Data (Bulk
Loading)
Write rawdata
Read rawdata

Taboola 2014 -

•  Lots of incoming traﬃc (100K requests/sec)
•  Data (5+ TB / day):
•  Personalized served recommenda4ons – per user, per page view
•  Events - What the user actually read and what he did
•  The data needs to be joined and processed in real 4me
•  Campaigns Management
•  Recommenda4ons
•  Billing
•  Reports
•  Etc.
•  The data needs to be available for oﬄine research
Our Data Requirements

Data Model
Users
Sessions
Views
Requests
Items
Events

•  We care about sessions - chain of page views and
events for a speciﬁc user
•  Length can be hours or even days
•  We care about users – chain of sessions across sites
•  Length can be days or even months
•  Stateless Applica4on – single user data is sent from
mul4ple data centers and mul4ple servers
•  No determinis4c aﬃnity to a server or DC
•  Order isn’t guaranteed
•  Must be robust and automa4cally deal with late arrivals
•  “Exactly once” seman4cs
Challenges

• Many streams of data that need to be
joined (user, session, page view, widgets,
recommenda4ons, events, ac4ons)
• 5+TB of daily data
• Research purposes require looking at full
user ac4vity across 4me
Challenges Cont.

Data Flow
FE Servers
Kana
FE Consumer
(Spark)
C* Sessions

•  Par44on key - session start hour + user bucket (0-9,999)
•  Clustering key - publisher_id, user_id, session_id, view_id, data_type,
data_hash
•  Data Type - MULTI_REQUEST, USER_EVENT, ACTION_CONVERSION, …
•  Data - blobs of protobuﬀ
•  Results:
•  All the data of a single session is in one place, regardless of 4me of arrival
•  Idempotent process - if same message is received twice it overruns the
previous arrivals due to same hash id
•  Sampling is built-in to the model
Table Model in C*

Traﬃc Processor
(Spark)
Manual runner
Next Gen. Reports
Next Gen.
Counters (Spark)
Zeppelin BIgQuery
Data Flow Cont.
C* Sessions
Hadoop Ver4ca

•  Raw data – real 4me full access to the raw data, not
just aggregated data
•  Week of data (~35TB) - 2 hours to analyze and report
•  10 physical nodes , 320 Cores, 2.5TB memory, SSDs
•  Analyzing 1% sample of the users reduces this linearly (par44on
key)
•  Analyzing a single publisher which is 1% of the data reduces this
almost linearly (clustering key)
•  Repor4ng – minutes for availability of full repor4ng vs.
hours
•  Suppor4ng our growth – Spark as a distributed
compu4ng engine is very strong, easy to scale and
extend
Before vs. Ayer

•  Long term data access – Hadoop, Cassandra
and BigQuery provide a solu4on we did not
have before
•  Analy4cs engine – the move from MySQL to
Ver4ca (as an MPP engine) allows us to
support complex queries over very large data
sets
•  Algorithmic Research and Modeling – we are
now capable of in depth analysis on mul4ple
dimensions across long 4me periods
Before vs. Ayer - Cont.

BDX 2016 - Tal sliwowicz @ taboola

More Related Content

More from Ido Shilon

BDX 2016 - Tal sliwowicz @ taboola