SlideShare a Scribd company logo
Big Data and Machine Learning
These Lessons were Written in Clicks
Gil Chamiel
Director of Data Science and Algorithms Engineering
You’ve Seen Us Before
Enabling people to discover
information at that moment
when they’re likely to engage
750M
monthly unique
users
500K+
Requests/sec
15B+
recommendation
s/day
17TB+
Daily data
REACH PROPERTY
95.5% Google Ad Network
87.8% Taboola
86.2% Google Sites
61.5% Facebook
60.3% Yahoo Sites
56.6% Outbrain
52%
mobile
traffic
48%
desktop
traffic
US desktop users reached, 12/2015
Taboola in Numbers
A typical US user sees a Taboola widget at least twice a day
Taboola’s Discovery Platform
Traffic Acquisition
Business Dev.
Sponsored Content
Editorial
Newsroom
Sales
Native Ads
Audience Dev. Product
Personalization
Data & Insights
Context
Metadata Region-based
Location
Information
User Behavior
Data
User
Consumption Groups
Social
Facebook /
Twitter API
The Recommendation Engine
6
Tools We Recommend
The Taboola Data Culture
One stop shop for all data needs to support our constant offensive battle.
7
Data for Machine Learning
User Behavior Analysis
System Behavior Analysis
Business Analysis
Data Driven OPS
Sea of Data
Machine Learning: The Basics
8
Predict User Engagement with Recommended Content
Offline Online
Bayesian
Inference
Linear Models
Gradient
Boosted Trees
Factorization
Machines
Deep Neural
Networks
Machine Learning: Circular Data Pipeline
9
Input “Regular” Program
Output
Input Train Output
Model
Predict
Offline vs. Online
10
• Efficient research can only be done offline
• Real effect can only be validated online (and we a/b test like crazy)
• Flexibility and ease of use => fast validation of new ideas
"Deep Neural Networks for YouTube
Recommendations", RecSys ’16
"Wide & Deep Learning for Recommender
Systems". CoRR abs/1606.07792 (2016)
11
Maintaining Data for Online Predictions
Maintaining Data for Online Predictions
• Cookies
• Easy and super distributed
• Difficult to maintain (sustainability)
• Updates are online only (and
bootstrapping is hard)
• Cannot be reached offline
• Limited storage
• Increases network latency and costs
• Not so great in out of order events
12
• Server-Side Data Counters
• Requires high performance NoSQL database
technology (Cassandra, Hbase, Scylladb, etc.)
• Easy to bootstrap data calculated offline or
upload data from other sources
• Less limited on storage (up to $$$)
• Easy on read online (usually not a lot of data)
• Read before write (counter implementations
are dodgy)
• Fixed set of counters and aggregations (early
commitment)
• Saving Individual Events
– Let the “future you” decide on how to aggregate
– De-normalize to your liking (tradeoff between computation
time and latency/storage)
– No read before write (and non-blocking)
– Reads are extremely expensive
• Time Series Data Modeling
– Control over read latency
– Useful for time dependent modeling (e.g. decay counters)
– May still be a challenge (mastering DB internals is a must)
13
Is this enough for offline analysis and research?
Maintaining Data for Online Predictions
14
Offline: Data for Machine Learning
Pipelining and Research
Data for ML Pipelining and Research: The Challenge
• Objective: A complete picture of the user and context on every impression!
• Challenges:
– Events occur in different times
– Historic user data must be true to the time of impression
– Fast querying by hundreds of analysts and engineers
– Machine learning programs like their data flat
• What is the real issue?
– Joins between various events to form a logical entity (user, session, page view)
– Joins between historic user data and current impression data
15
Maintain a Dedicated Data Store
How we went about solving these challenges?
• Starting point: pre-aggregate counters over raw data
• Every query requires rerun (parsing and joins over the raw data)
• Many additional disadvantages
• When in trouble: de-normalize!
• Use efficient and extendable serialization schema (e.g. Protobuf)
• De-normalize until you run out of space (or money)
• Useful for pipelining historic user data
• Join multiple events at write time (short term)
• Maintain a mutual key (user id, session id, page view id)
• Use a strong and scalable key-value database (e.g. C*)
• Use Columnar Storage (long term)
• Drives Machine Learning and research
• Many tools out there (Parquet, BigQuery, etc.)
• Use scalable and rich query mechanism (Spark SQL, BigQuery, Impala, etc.)
• Machine Learning programs like flat data (easy with FLATTEN, explode, user defined functions etc.)
16
Users
Sessions
Views
ClicksHistory
Post-click
events
Because We Recommend…
Data is king!
Online and offline pose different challenges -> different solutions
Storage is cheap: rewrite your data for convenience
Still worried about storage? You don’t have to keep everything for
every user:
Sub-sampling is a requirement when learning models
Be extremely verbose for small parts of the data
For fast research: save it again for sample of the users, views, etc.
17
Thank You!
Questions?

More Related Content

What's hot

Open Source Business Intelligence Overview
Open Source Business Intelligence OverviewOpen Source Business Intelligence Overview
Open Source Business Intelligence OverviewAlex Meadows
 
Denodo Data Virtualization Platform Architecture: Performance (session 2 from...
Denodo Data Virtualization Platform Architecture: Performance (session 2 from...Denodo Data Virtualization Platform Architecture: Performance (session 2 from...
Denodo Data Virtualization Platform Architecture: Performance (session 2 from...
Denodo
 
A brief history of data warehousing
A brief history of data warehousingA brief history of data warehousing
A brief history of data warehousing
Rob Winters
 
Real time bi solution architecture
Real time bi solution architectureReal time bi solution architecture
Real time bi solution architecture
Sandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW
 
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)DataPad Inc.
 
Building data "Py-pelines"
Building data "Py-pelines"Building data "Py-pelines"
Building data "Py-pelines"
Rob Winters
 
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...Romeo Kienzler
 
Bi 2.0 hadoop everywhere
Bi 2.0   hadoop everywhereBi 2.0   hadoop everywhere
Bi 2.0 hadoop everywhere
Dmitry Tolpeko
 
SlamData Overview 9-1-2014
SlamData Overview 9-1-2014SlamData Overview 9-1-2014
SlamData Overview 9-1-2014
carrjc2
 
Introduction to BIG DATA
Introduction to BIG DATA Introduction to BIG DATA
Introduction to BIG DATA
Zeeshan Khan
 
The Six pillars for Building big data analytics ecosystems
The Six pillars for Building big data analytics ecosystemsThe Six pillars for Building big data analytics ecosystems
The Six pillars for Building big data analytics ecosystems
taimur hafeez
 
Reducing Technology Risks Through Prototyping
Reducing Technology Risks Through Prototyping Reducing Technology Risks Through Prototyping
Reducing Technology Risks Through Prototyping
Valdas Maksimavičius
 
Build an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data ScientistsBuild an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data Scientists
Shawn Zhu
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil Games
Rob Winters
 
DataStax Enterprise in Practice (Field Notes)
DataStax Enterprise in Practice (Field Notes)DataStax Enterprise in Practice (Field Notes)
DataStax Enterprise in Practice (Field Notes)
DataStax
 
Cloud as a Data Platform
Cloud as a Data PlatformCloud as a Data Platform
Cloud as a Data Platform
Andrei Savu
 
Internet of Things Chicago - Meetup
Internet of Things Chicago - MeetupInternet of Things Chicago - Meetup
Internet of Things Chicago - Meetup
Jason Lobel
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
Sri Ambati
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
Tung Nguyen
 

What's hot (20)

Vldb Yaron Moshe
Vldb  Yaron MosheVldb  Yaron Moshe
Vldb Yaron Moshe
 
Open Source Business Intelligence Overview
Open Source Business Intelligence OverviewOpen Source Business Intelligence Overview
Open Source Business Intelligence Overview
 
Denodo Data Virtualization Platform Architecture: Performance (session 2 from...
Denodo Data Virtualization Platform Architecture: Performance (session 2 from...Denodo Data Virtualization Platform Architecture: Performance (session 2 from...
Denodo Data Virtualization Platform Architecture: Performance (session 2 from...
 
A brief history of data warehousing
A brief history of data warehousingA brief history of data warehousing
A brief history of data warehousing
 
Real time bi solution architecture
Real time bi solution architectureReal time bi solution architecture
Real time bi solution architecture
 
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
 
Building data "Py-pelines"
Building data "Py-pelines"Building data "Py-pelines"
Building data "Py-pelines"
 
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
 
Bi 2.0 hadoop everywhere
Bi 2.0   hadoop everywhereBi 2.0   hadoop everywhere
Bi 2.0 hadoop everywhere
 
SlamData Overview 9-1-2014
SlamData Overview 9-1-2014SlamData Overview 9-1-2014
SlamData Overview 9-1-2014
 
Introduction to BIG DATA
Introduction to BIG DATA Introduction to BIG DATA
Introduction to BIG DATA
 
The Six pillars for Building big data analytics ecosystems
The Six pillars for Building big data analytics ecosystemsThe Six pillars for Building big data analytics ecosystems
The Six pillars for Building big data analytics ecosystems
 
Reducing Technology Risks Through Prototyping
Reducing Technology Risks Through Prototyping Reducing Technology Risks Through Prototyping
Reducing Technology Risks Through Prototyping
 
Build an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data ScientistsBuild an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data Scientists
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil Games
 
DataStax Enterprise in Practice (Field Notes)
DataStax Enterprise in Practice (Field Notes)DataStax Enterprise in Practice (Field Notes)
DataStax Enterprise in Practice (Field Notes)
 
Cloud as a Data Platform
Cloud as a Data PlatformCloud as a Data Platform
Cloud as a Data Platform
 
Internet of Things Chicago - Meetup
Internet of Things Chicago - MeetupInternet of Things Chicago - Meetup
Internet of Things Chicago - Meetup
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 

Similar to Big data and machine learning / Gil Chamiel

Levelling up your data infrastructure
Levelling up your data infrastructureLevelling up your data infrastructure
Levelling up your data infrastructure
Simon Belak
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra
 
Ledingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lkLedingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lk
Mukesh Singh
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousing
Er. Nawaraj Bhandari
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
IdontKnow66967
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
RojaT4
 
Lecture1
Lecture1Lecture1
Lecture1
Manish Singh
 
Lecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsLecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in details
AbhishekKumarAgrahar2
 
IT webinar 2016
IT webinar 2016IT webinar 2016
IT webinar 2016
PR Cell, IIM Rohtak
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
datastack
 
Msst 2019 v4
Msst 2019 v4Msst 2019 v4
Msst 2019 v4
Nisha Talagala
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
Skillwise Group
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Caserta
 
Big data meet_up_08042016
Big data meet_up_08042016Big data meet_up_08042016
Big data meet_up_08042016
Mark Smith
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)Moacyr Passador
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lake
punedevscom
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Philip Filleul
 

Similar to Big data and machine learning / Gil Chamiel (20)

Levelling up your data infrastructure
Levelling up your data infrastructureLevelling up your data infrastructure
Levelling up your data infrastructure
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Ledingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lkLedingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lk
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousing
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
Lecture1
Lecture1Lecture1
Lecture1
 
Lecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsLecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in details
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
IT webinar 2016
IT webinar 2016IT webinar 2016
IT webinar 2016
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Msst 2019 v4
Msst 2019 v4Msst 2019 v4
Msst 2019 v4
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
 
Big data meet_up_08042016
Big data meet_up_08042016Big data meet_up_08042016
Big data meet_up_08042016
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lake
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 

More from geektimecoil

Solving modern API / Idan gazit
Solving modern API / Idan gazitSolving modern API / Idan gazit
Solving modern API / Idan gazit
geektimecoil
 
Early detection of cancer using NLP / Limor Lahiani
Early detection of cancer using NLP / Limor LahianiEarly detection of cancer using NLP / Limor Lahiani
Early detection of cancer using NLP / Limor Lahiani
geektimecoil
 
Modern server side development with node.js - Benjamin gruenbaum
Modern server side development with node.js - Benjamin gruenbaumModern server side development with node.js - Benjamin gruenbaum
Modern server side development with node.js - Benjamin gruenbaum
geektimecoil
 
Natural human to machine communication / Alon eirew
Natural human to machine communication / Alon eirewNatural human to machine communication / Alon eirew
Natural human to machine communication / Alon eirew
geektimecoil
 
The psychology of technology / Zohar arad
The psychology of technology / Zohar aradThe psychology of technology / Zohar arad
The psychology of technology / Zohar arad
geektimecoil
 
iOs app localization / Yoni tsafir
iOs app localization / Yoni tsafiriOs app localization / Yoni tsafir
iOs app localization / Yoni tsafir
geektimecoil
 
a friend in need-a js indeed / Yonatan levin
a friend in need-a js indeed / Yonatan levina friend in need-a js indeed / Yonatan levin
a friend in need-a js indeed / Yonatan levin
geektimecoil
 
Building your architect skillset / Rachel Ebner
Building your architect skillset / Rachel EbnerBuilding your architect skillset / Rachel Ebner
Building your architect skillset / Rachel Ebner
geektimecoil
 
Engineering your culture / Oren Ellenbogen
Engineering your culture  / Oren EllenbogenEngineering your culture  / Oren Ellenbogen
Engineering your culture / Oren Ellenbogen
geektimecoil
 
Scaling CTO / On Freund
Scaling CTO / On Freund   Scaling CTO / On Freund
Scaling CTO / On Freund
geektimecoil
 
measuring and monitoring client side performance / Nir Nahum
measuring and monitoring client side performance / Nir Nahummeasuring and monitoring client side performance / Nir Nahum
measuring and monitoring client side performance / Nir Nahum
geektimecoil
 
The impactful engineer / Joey Simhon
The impactful engineer / Joey SimhonThe impactful engineer / Joey Simhon
The impactful engineer / Joey Simhon
geektimecoil
 
Jelly button growth Case study / Ron Rajwan
Jelly button growth Case study / Ron RajwanJelly button growth Case study / Ron Rajwan
Jelly button growth Case study / Ron Rajwan
geektimecoil
 
Outbound b2b sales are not dead – they’ve just evolved / Yanay Sela
Outbound b2b sales are not dead – they’ve just evolved  / Yanay SelaOutbound b2b sales are not dead – they’ve just evolved  / Yanay Sela
Outbound b2b sales are not dead – they’ve just evolved / Yanay Sela
geektimecoil
 
Viber Growth Case Studay - playing with giants / Moshi Blum
Viber Growth Case Studay - playing with giants / Moshi BlumViber Growth Case Studay - playing with giants / Moshi Blum
Viber Growth Case Studay - playing with giants / Moshi Blum
geektimecoil
 
Paid Apps Economy / Nir Pochter
Paid Apps Economy / Nir PochterPaid Apps Economy / Nir Pochter
Paid Apps Economy / Nir Pochter
geektimecoil
 
Moovit Growth Case Study / Yovav Meydad
Moovit Growth Case Study / Yovav MeydadMoovit Growth Case Study / Yovav Meydad
Moovit Growth Case Study / Yovav Meydad
geektimecoil
 

More from geektimecoil (17)

Solving modern API / Idan gazit
Solving modern API / Idan gazitSolving modern API / Idan gazit
Solving modern API / Idan gazit
 
Early detection of cancer using NLP / Limor Lahiani
Early detection of cancer using NLP / Limor LahianiEarly detection of cancer using NLP / Limor Lahiani
Early detection of cancer using NLP / Limor Lahiani
 
Modern server side development with node.js - Benjamin gruenbaum
Modern server side development with node.js - Benjamin gruenbaumModern server side development with node.js - Benjamin gruenbaum
Modern server side development with node.js - Benjamin gruenbaum
 
Natural human to machine communication / Alon eirew
Natural human to machine communication / Alon eirewNatural human to machine communication / Alon eirew
Natural human to machine communication / Alon eirew
 
The psychology of technology / Zohar arad
The psychology of technology / Zohar aradThe psychology of technology / Zohar arad
The psychology of technology / Zohar arad
 
iOs app localization / Yoni tsafir
iOs app localization / Yoni tsafiriOs app localization / Yoni tsafir
iOs app localization / Yoni tsafir
 
a friend in need-a js indeed / Yonatan levin
a friend in need-a js indeed / Yonatan levina friend in need-a js indeed / Yonatan levin
a friend in need-a js indeed / Yonatan levin
 
Building your architect skillset / Rachel Ebner
Building your architect skillset / Rachel EbnerBuilding your architect skillset / Rachel Ebner
Building your architect skillset / Rachel Ebner
 
Engineering your culture / Oren Ellenbogen
Engineering your culture  / Oren EllenbogenEngineering your culture  / Oren Ellenbogen
Engineering your culture / Oren Ellenbogen
 
Scaling CTO / On Freund
Scaling CTO / On Freund   Scaling CTO / On Freund
Scaling CTO / On Freund
 
measuring and monitoring client side performance / Nir Nahum
measuring and monitoring client side performance / Nir Nahummeasuring and monitoring client side performance / Nir Nahum
measuring and monitoring client side performance / Nir Nahum
 
The impactful engineer / Joey Simhon
The impactful engineer / Joey SimhonThe impactful engineer / Joey Simhon
The impactful engineer / Joey Simhon
 
Jelly button growth Case study / Ron Rajwan
Jelly button growth Case study / Ron RajwanJelly button growth Case study / Ron Rajwan
Jelly button growth Case study / Ron Rajwan
 
Outbound b2b sales are not dead – they’ve just evolved / Yanay Sela
Outbound b2b sales are not dead – they’ve just evolved  / Yanay SelaOutbound b2b sales are not dead – they’ve just evolved  / Yanay Sela
Outbound b2b sales are not dead – they’ve just evolved / Yanay Sela
 
Viber Growth Case Studay - playing with giants / Moshi Blum
Viber Growth Case Studay - playing with giants / Moshi BlumViber Growth Case Studay - playing with giants / Moshi Blum
Viber Growth Case Studay - playing with giants / Moshi Blum
 
Paid Apps Economy / Nir Pochter
Paid Apps Economy / Nir PochterPaid Apps Economy / Nir Pochter
Paid Apps Economy / Nir Pochter
 
Moovit Growth Case Study / Yovav Meydad
Moovit Growth Case Study / Yovav MeydadMoovit Growth Case Study / Yovav Meydad
Moovit Growth Case Study / Yovav Meydad
 

Recently uploaded

Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 

Recently uploaded (20)

Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 

Big data and machine learning / Gil Chamiel

  • 1. Big Data and Machine Learning These Lessons were Written in Clicks Gil Chamiel Director of Data Science and Algorithms Engineering
  • 2. You’ve Seen Us Before Enabling people to discover information at that moment when they’re likely to engage
  • 3. 750M monthly unique users 500K+ Requests/sec 15B+ recommendation s/day 17TB+ Daily data REACH PROPERTY 95.5% Google Ad Network 87.8% Taboola 86.2% Google Sites 61.5% Facebook 60.3% Yahoo Sites 56.6% Outbrain 52% mobile traffic 48% desktop traffic US desktop users reached, 12/2015 Taboola in Numbers A typical US user sees a Taboola widget at least twice a day
  • 4. Taboola’s Discovery Platform Traffic Acquisition Business Dev. Sponsored Content Editorial Newsroom Sales Native Ads Audience Dev. Product Personalization Data & Insights
  • 5. Context Metadata Region-based Location Information User Behavior Data User Consumption Groups Social Facebook / Twitter API The Recommendation Engine
  • 7. The Taboola Data Culture One stop shop for all data needs to support our constant offensive battle. 7 Data for Machine Learning User Behavior Analysis System Behavior Analysis Business Analysis Data Driven OPS Sea of Data
  • 8. Machine Learning: The Basics 8 Predict User Engagement with Recommended Content Offline Online Bayesian Inference Linear Models Gradient Boosted Trees Factorization Machines Deep Neural Networks
  • 9. Machine Learning: Circular Data Pipeline 9 Input “Regular” Program Output Input Train Output Model Predict
  • 10. Offline vs. Online 10 • Efficient research can only be done offline • Real effect can only be validated online (and we a/b test like crazy) • Flexibility and ease of use => fast validation of new ideas "Deep Neural Networks for YouTube Recommendations", RecSys ’16 "Wide & Deep Learning for Recommender Systems". CoRR abs/1606.07792 (2016)
  • 11. 11 Maintaining Data for Online Predictions
  • 12. Maintaining Data for Online Predictions • Cookies • Easy and super distributed • Difficult to maintain (sustainability) • Updates are online only (and bootstrapping is hard) • Cannot be reached offline • Limited storage • Increases network latency and costs • Not so great in out of order events 12 • Server-Side Data Counters • Requires high performance NoSQL database technology (Cassandra, Hbase, Scylladb, etc.) • Easy to bootstrap data calculated offline or upload data from other sources • Less limited on storage (up to $$$) • Easy on read online (usually not a lot of data) • Read before write (counter implementations are dodgy) • Fixed set of counters and aggregations (early commitment)
  • 13. • Saving Individual Events – Let the “future you” decide on how to aggregate – De-normalize to your liking (tradeoff between computation time and latency/storage) – No read before write (and non-blocking) – Reads are extremely expensive • Time Series Data Modeling – Control over read latency – Useful for time dependent modeling (e.g. decay counters) – May still be a challenge (mastering DB internals is a must) 13 Is this enough for offline analysis and research? Maintaining Data for Online Predictions
  • 14. 14 Offline: Data for Machine Learning Pipelining and Research
  • 15. Data for ML Pipelining and Research: The Challenge • Objective: A complete picture of the user and context on every impression! • Challenges: – Events occur in different times – Historic user data must be true to the time of impression – Fast querying by hundreds of analysts and engineers – Machine learning programs like their data flat • What is the real issue? – Joins between various events to form a logical entity (user, session, page view) – Joins between historic user data and current impression data 15 Maintain a Dedicated Data Store
  • 16. How we went about solving these challenges? • Starting point: pre-aggregate counters over raw data • Every query requires rerun (parsing and joins over the raw data) • Many additional disadvantages • When in trouble: de-normalize! • Use efficient and extendable serialization schema (e.g. Protobuf) • De-normalize until you run out of space (or money) • Useful for pipelining historic user data • Join multiple events at write time (short term) • Maintain a mutual key (user id, session id, page view id) • Use a strong and scalable key-value database (e.g. C*) • Use Columnar Storage (long term) • Drives Machine Learning and research • Many tools out there (Parquet, BigQuery, etc.) • Use scalable and rich query mechanism (Spark SQL, BigQuery, Impala, etc.) • Machine Learning programs like flat data (easy with FLATTEN, explode, user defined functions etc.) 16 Users Sessions Views ClicksHistory Post-click events
  • 17. Because We Recommend… Data is king! Online and offline pose different challenges -> different solutions Storage is cheap: rewrite your data for convenience Still worried about storage? You don’t have to keep everything for every user: Sub-sampling is a requirement when learning models Be extremely verbose for small parts of the data For fast research: save it again for sample of the users, views, etc. 17