Big data and machine learning / Gil Chamiel

Big Data and Machine Learning
These Lessons were Written in Clicks
Gil Chamiel
Director of Data Science and Algorithms Engineering

You’ve Seen Us Before
Enabling people to discover
information at that moment
when they’re likely to engage

750M
monthly unique
users
500K+
Requests/sec
15B+
recommendation
s/day
17TB+
Daily data
REACH PROPERTY
95.5% Google Ad Network
87.8% Taboola
86.2% Google Sites
61.5% Facebook
60.3% Yahoo Sites
56.6% Outbrain
52%
mobile
traffic
48%
desktop
traffic
US desktop users reached, 12/2015
Taboola in Numbers
A typical US user sees a Taboola widget at least twice a day

Taboola’s Discovery Platform
Traffic Acquisition
Business Dev.
Sponsored Content
Editorial
Newsroom
Sales
Native Ads
Audience Dev. Product
Personalization
Data & Insights

Context
Metadata Region-based
Location
Information
User Behavior
Data
User
Consumption Groups
Social
Facebook /
Twitter API
The Recommendation Engine

The Taboola Data Culture
One stop shop for all data needs to support our constant offensive battle.
7
Data for Machine Learning
User Behavior Analysis
System Behavior Analysis
Business Analysis
Data Driven OPS
Sea of Data

Machine Learning: The Basics
8
Predict User Engagement with Recommended Content
Offline Online
Bayesian
Inference
Linear Models
Gradient
Boosted Trees
Factorization
Machines
Deep Neural
Networks

Machine Learning: Circular Data Pipeline
9
Input “Regular” Program
Output
Input Train Output
Model
Predict

Offline vs. Online
10
• Efficient research can only be done offline
• Real effect can only be validated online (and we a/b test like crazy)
• Flexibility and ease of use => fast validation of new ideas
"Deep Neural Networks for YouTube
Recommendations", RecSys ’16
"Wide & Deep Learning for Recommender
Systems". CoRR abs/1606.07792 (2016)

11
Maintaining Data for Online Predictions

• Cookies
• Easy and super distributed
• Difficult to maintain (sustainability)
• Updates are online only (and
bootstrapping is hard)
• Cannot be reached offline
• Limited storage
• Increases network latency and costs
• Not so great in out of order events
12
• Server-Side Data Counters
• Requires high performance NoSQL database
technology (Cassandra, Hbase, Scylladb, etc.)
• Easy to bootstrap data calculated offline or
upload data from other sources
• Less limited on storage (up to $$$)
• Easy on read online (usually not a lot of data)
• Read before write (counter implementations
are dodgy)
• Fixed set of counters and aggregations (early
commitment)

• Saving Individual Events
– Let the “future you” decide on how to aggregate
– De-normalize to your liking (tradeoff between computation
time and latency/storage)
– No read before write (and non-blocking)
– Reads are extremely expensive
• Time Series Data Modeling
– Control over read latency
– Useful for time dependent modeling (e.g. decay counters)
– May still be a challenge (mastering DB internals is a must)
13
Is this enough for offline analysis and research?

14
Offline: Data for Machine Learning
Pipelining and Research

Data for ML Pipelining and Research: The Challenge
• Objective: A complete picture of the user and context on every impression!
• Challenges:
– Events occur in different times
– Historic user data must be true to the time of impression
– Fast querying by hundreds of analysts and engineers
– Machine learning programs like their data flat
• What is the real issue?
– Joins between various events to form a logical entity (user, session, page view)
– Joins between historic user data and current impression data
15
Maintain a Dedicated Data Store

How we went about solving these challenges?
• Starting point: pre-aggregate counters over raw data
• Every query requires rerun (parsing and joins over the raw data)
• Many additional disadvantages
• When in trouble: de-normalize!
• Use efficient and extendable serialization schema (e.g. Protobuf)
• De-normalize until you run out of space (or money)
• Useful for pipelining historic user data
• Join multiple events at write time (short term)
• Maintain a mutual key (user id, session id, page view id)
• Use a strong and scalable key-value database (e.g. C*)
• Use Columnar Storage (long term)
• Drives Machine Learning and research
• Many tools out there (Parquet, BigQuery, etc.)
• Use scalable and rich query mechanism (Spark SQL, BigQuery, Impala, etc.)
• Machine Learning programs like flat data (easy with FLATTEN, explode, user defined functions etc.)
16
Users
Sessions
Views
ClicksHistory
Post-click
events

Because We Recommend…
Data is king!
Online and offline pose different challenges -> different solutions
Storage is cheap: rewrite your data for convenience
Still worried about storage? You don’t have to keep everything for
every user:
Sub-sampling is a requirement when learning models
Be extremely verbose for small parts of the data
For fast research: save it again for sample of the users, views, etc.
17

Big data and machine learning / Gil Chamiel

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big data and machine learning / Gil Chamiel

Similar to Big data and machine learning / Gil Chamiel (20)

More from geektimecoil

More from geektimecoil (17)

Recently uploaded

Recently uploaded (20)

Big data and machine learning / Gil Chamiel