SlideShare a Scribd company logo
1 of 31
Taboola's Road to Scale
A Focus on Data &
Apache Spark
Collaborative Filtering
Bucketed Consumption Groups
Geo
Region-based
Recommendations
Context
Metadata
Social
Facebook/Twitter API
User Behavior
Cookie Data
Engine Focused on Maximizing CTR & Post Click Engagement
Largest Content Discovery and
Monetization Network
60M
monthly unique
users
130B
Monthly
recommendations
1M+
sourced content
providers
1M+
sourced content
items
What Does it Mean?
• Zero downtime allowed
– Every single component is fault tolerant
• 5 Data Centers across the globe
• Tera-bytes of data / day (many billion events)
• Data must be processed and analyzed in real time, for
example:
– Real-time, per user content recommendations
– Real-time expenditure reports
– Automated campaign management
– Automated recommendation algorithms calibration
– Real-time analytics
Taboola 2007
• Events and logs
(rawdata) written
directly to DB
• Recs Are read from
DB
• Crashed when CNN
launched
Frontend
RecServer
Taboola 2007.5
• Same as
before, but
without direct
write to DB
• Switching to
bulk load
• But – Very
Basic
Reporting, not
scalable
Frontend
Bulk Load
RecServer
Taboola 2008
• Introduced a semi
realtime events
parsing services:
Session Parser and
Session Analyzer
• Divided analysis work
by unit (session)
• Files were pushed
from RecServer(s) to
Backend processing
• Files are gzip textual
INSERT statements
• But – not real time
enough
Frontend
NFS
Backend
RecServer SessionParser SessionAnalyzer
Write Summarized Data
Write rawdata
Read session
files
Read rawdata
Write session
files
Taboola 2010
• Made a leap towards real-
time stream processing
• Unified Session Parser and
Session Analyzer to an in-
memory service (without
going through disk)
• Made dramatic optimization
to memory allocation and
data models
• Failure safe architecture -
can endure data delays,
front-end servers’
malfunction
• No direct DB access - key
for performance, only using
bulk loading for loading
hourly data
Frontend
NFS
Backend
RecServer Session Parser + Analyzer
Write Hourly Data (Bulk
Loading)
Write rawdata
Read rawdata
Taboola 2011-2013
• Roughly same
architecture
• Increasing backend
growth by scaling in
(monster machines)
• Introduced real-time
analyzers
• Introduced sharding
• Moved to lsync based
file sync
• Introduced Top
Reports capabilities
Frontend
Lsync
Backend
RecServer Session Parser + Analyzer
Write Hourly Data (Bulk
Loading)
Write rawdata
Read rawdata
Taboola 2014
• Spark as the distributed engine for data analysis
(and distributed computing in general)
• All critical data path already moved to Spark
• New data modelling based on ProtoStuff(Buf)
• Easily scalable
• Easy ad hoc analysis/research
About Spark
• Open Sourced
• Apache top level project (since Feb. 19th)
• DataBricks - A commercial company that supports it
• Hadoop-compatible computing engine
• Can run side-by-side with Hadoop/Hive on the same
data
• Drastically faster than Hadoop through in-memory
computing
• Multiple H/A options - standalone cluster, Apache
mesos and ZooKeeper or YARN
• With over 100 developers and 25 companies, one of
the most active communities in big data
Spark Development Community
Comparison: Storm (48), Giraph (52), Drill (18), Tez (12)
Past 6 months: more active devs than Hadoop MapReduce!
The Spark Community
Spark Performance
Hive
Impala(disk)
Impala(mem)
Shark(disk)
Shark(mem)
0
5
10
15
20
25
30
35
40
45
ResponseTime(s)
SQL
Hadoop
Giraph
GraphX
0
5
10
15
20
25
30
ResponseTime(min)
Graph
Storm
Spark
0
5
10
15
20
25
30
35
Throughput(MB/s/node)
Streaming
Spark API
• Simple to write through easy APIs in
Java, Scala and Python
• The same analytics code can be used for both
streaming data and offline data processing
Spark Key Concepts
Resilient Distributed
Datasets
• Collections of objects spread
across a cluster, stored in RAM
or on Disk
• Built through parallel
transformations
• Automatically rebuilt on failure
• Immutable
Operations
• Transformations
(e.g.
map, filter, groupB
y)
• Actions
(e.g.
count, collect, save
)
Write programs in terms of
transformations on distributed
datasets
Load error messages from a log into
memory, then interactively search for various
patterns
Example: Log Mining
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
messages.filter(lambda s: “mysql” in s).count()
messages.filter(lambda s: “php” in s).count()
. . .
Base RDD
Transformed RDD
Full-text search of Wikipedia
• 60GB on 20 EC2 machine
• 0.5 sec vs. 20s for on-disk
• General task
graphs
• Automatically
pipelines
functions
• Data locality
aware
• Partitioning
aware
to avoid shuffles
Task Scheduler
= cached partition= RDD
reduceByKey
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
map
• Spark runs as a library in your
program (1 instance per app)
• Runs tasks locally or on cluster
– Mesos, YARN or standalone
mode
• Accesses storage systems via
Hadoop InputFormat API
– Can use HBase, HDFS, S3, …
Software Components
Your application
SparkContext
Local
threads
Cluster
manager
Worker
Spark
executor
Worker
Spark
executor
HDFS or other storage
System Architecture & Data Flow @ Taboola
Driver +
ConsumersSpark Cluster
MySQL Cluster
FE Servers
C* Cluster
FE Servers
Execution Graph @ Taboola
• Data start point (dates, etc)rdd1 = Context.parallize([data])
• Loading data from external sources
(Cassandra, MySQL, etc)
rdd2 =
rdd1.mapPartitions(loadfunc())
• Aggregating the data and storing
results
rdd3 =
rdd2.reduceByKey(reduceFunc())
• Saving the results to a DBrdd4 =
rdd3.mapPartitions(saverfunc())
• Executing the above graph by
forcing an output operation
rdd4.count()
Cassandra as a Distributed Storage
• Event Log Files saved as blobs to a dedicated keyspace
• C* Tables holding the Event Log Files are partitioned by day – new Table per day. This way, it is
easier for maintenance and simpler to load into Spark
• Using Astyanax driver + CQL3
– Recipe to load all keys of a table very fast (hundred of thousands / sec)
– Split by keys and then load data by key in batches – in parallel partitions
• Wrote hadoop InputFormat that supports loading this into a lines RDD<String>
– The DataStax InputFormat had issues and at the time was not formally supported
• Worked well, but ended up not using it – instead using mapPartitions
– Very simple, no overhead, no need to be tied to hadoop
– Will probably use the InputFormat when we deploy a Shark solution
• Plans to open source all this
Key (String) Data (blob)
GUID (originally
log file name)
Gzipped file
GUID Gzipped file
… …
userevent_2014-02-19
Key (String) Data (blob)
GUID (originally
log file name)
Gzipped file
GUID Gzipped file
… …
userevent_2014-02-20
Sample – Click Counting for Campaign Stopping
1. mapPartitions – mapping from strings to objects with
a pre designed click key
2. reduceByKey – removing duplicate clicks (see next
slide)
3. Map – switch keys to a campaign+day key
4. reduceByKey – aggregate the data by
campaign+day
Campaign Stopping – Removing Dup Clicks
• When more than 1 click found from the same user on the same
item, leave only the oldest
• Using accumulators to track duplicate numbers
• Notice – not Spark specific
Our Deployment
• 16 nodes, each-
– 24 cores
– 256G Ram
– 6 1TB SSD Disks – JBOD configuration
– 10G Ethernet
• Total Cluster Power
– 4096GB Ram
– 384 CPUs
– 96 TB storage – (effective space is less, Cassandra Keyspaces defined with replication
factor 3)
• Symmetric Deployment
– Mesos + Spark
– Cassandra
• More
– Rabbit MQ on 3 nodes
– ZooKeeper on 3 nodes
– MySQL cluster outside this cluster
• Loads & processes ~1 Tera Bytes (unzipped data) in ~3 minutes
Things that work well with Spark
(from our experience)
• Very easy to code complex jobs
– Harder than SQL, but better than other Map Reduce options
– Simple concepts, “small” API
• Easy to Unit Test
– Runs in local mode, so ideal for micro E2E tests
– Each mapper/reducer can be unit tested without Spark – if you
do not use anonymous classes
• Very resilient
• Can read/write to/from any data source, including
RDBMS, Cassandra, HDFS, local files, etc.
• Great monitoring
• Easy to deploy & upgrade
• Blazing fast
Things that do not work that well
(from our experience)
• Long (endless) running tasks require some workarounds
– Temp files - Spark creates a lot of files in
spark.local.dir, requires periodic cleanup
– Use spark.cleaner.ttl for long running tasks
• Spark Streaming – not fully mature when we tested
– Some end cases can cause loss of data
– Sliding window / batch model does not fit our needs
• We always load some history to deal with late arriving data
• State management left to the user and not trivial
– BUT – we were able to easily implement a bullet proof home
grown, near real time, streaming solution with minimal amount
of code
General / Optimization Tips
• Use Spark Accumulators to collect and report
operational data
• 10G Ethernet
• Multiple SSD disks per node, JBOD configuration
• A lot of memory for the cluster
Technologies Taboola Uses for Spark
• Spark – computing cluster
• Mesos – cluster resource manager
– Better resource allocation (coarse grained) for Spark
• ZooKeeper – distributed coordination
– Enables multi master for mesos & spark
• Cassandra
– Distributed Data Store
• Monitoring – http://metrics.codahale.com/
Attributions
Many of the general Spark slides were taken from the
DataBricks Spark Summit 2013 slides.
There are great materials at:
https://spark.apache.org/
http://spark-summit.org/summit-2013/
Thank You!
tal.s@taboola.com

More Related Content

What's hot

Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationshadooparchbook
 
Cloud native data platform
Cloud native data platformCloud native data platform
Cloud native data platformLi Gao
 
Spark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit
 
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021StreamNative
 
Volta: Logging, Metrics, and Monitoring as a Service
Volta: Logging, Metrics, and Monitoring as a ServiceVolta: Logging, Metrics, and Monitoring as a Service
Volta: Logging, Metrics, and Monitoring as a ServiceLN Renganarayana
 
Big Data Platform at Pinterest
Big Data Platform at PinterestBig Data Platform at Pinterest
Big Data Platform at PinterestQubole
 
Kappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioKappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioPiotr Czarnas
 
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Databricks
 
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...HostedbyConfluent
 
Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...
Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...
Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...HostedbyConfluent
 
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineDataWorks Summit
 
Cloud Connect 2012, Big Data @ Netflix
Cloud Connect 2012, Big Data @ NetflixCloud Connect 2012, Big Data @ Netflix
Cloud Connect 2012, Big Data @ NetflixJerome Boulon
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKKriangkrai Chaonithi
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLhuguk
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureDatabricks
 
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit
 

What's hot (20)

Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 
Cloud native data platform
Cloud native data platformCloud native data platform
Cloud native data platform
 
Spark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas Geerdink
 
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
 
Volta: Logging, Metrics, and Monitoring as a Service
Volta: Logging, Metrics, and Monitoring as a ServiceVolta: Logging, Metrics, and Monitoring as a Service
Volta: Logging, Metrics, and Monitoring as a Service
 
Big Data Platform at Pinterest
Big Data Platform at PinterestBig Data Platform at Pinterest
Big Data Platform at Pinterest
 
Kappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioKappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.io
 
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
 
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
 
ASPgems - kappa architecture
ASPgems - kappa architectureASPgems - kappa architecture
ASPgems - kappa architecture
 
Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...
Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...
Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...
 
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query Engine
 
Cloud Connect 2012, Big Data @ Netflix
Cloud Connect 2012, Big Data @ NetflixCloud Connect 2012, Big Data @ Netflix
Cloud Connect 2012, Big Data @ Netflix
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data Capture
 
Apache HBase Workshop
Apache HBase WorkshopApache HBase Workshop
Apache HBase Workshop
 
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar Veliqi
 

Viewers also liked

Tabtale story: Building a publishing and monitoring mobile games architecture...
Tabtale story: Building a publishing and monitoring mobile games architecture...Tabtale story: Building a publishing and monitoring mobile games architecture...
Tabtale story: Building a publishing and monitoring mobile games architecture...Tikal Knowledge
 
TechX Azure 2015 - Application Insights
TechX Azure 2015 - Application InsightsTechX Azure 2015 - Application Insights
TechX Azure 2015 - Application InsightsAndreas Hammar
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)tsliwowicz
 
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...Helena Edelson
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaHelena Edelson
 

Viewers also liked (7)

Tabtale story: Building a publishing and monitoring mobile games architecture...
Tabtale story: Building a publishing and monitoring mobile games architecture...Tabtale story: Building a publishing and monitoring mobile games architecture...
Tabtale story: Building a publishing and monitoring mobile games architecture...
 
Heatmap
HeatmapHeatmap
Heatmap
 
TechX Azure 2015 - Application Insights
TechX Azure 2015 - Application InsightsTechX Azure 2015 - Application Insights
TechX Azure 2015 - Application Insights
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
 
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
 

Similar to Taboola's Transition to Apache Spark for Real-Time Analytics

Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014mahchiev
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big DataJoe Alex
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceUwe Printz
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processingprajods
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDatabricks
 

Similar to Taboola's Transition to Apache Spark for Real-Time Analytics (20)

Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 

Recently uploaded

Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noidabntitsolutionsrishis
 

Recently uploaded (20)

Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
 

Taboola's Transition to Apache Spark for Real-Time Analytics

  • 1. Taboola's Road to Scale A Focus on Data & Apache Spark
  • 2. Collaborative Filtering Bucketed Consumption Groups Geo Region-based Recommendations Context Metadata Social Facebook/Twitter API User Behavior Cookie Data Engine Focused on Maximizing CTR & Post Click Engagement
  • 3. Largest Content Discovery and Monetization Network 60M monthly unique users 130B Monthly recommendations 1M+ sourced content providers 1M+ sourced content items
  • 4. What Does it Mean? • Zero downtime allowed – Every single component is fault tolerant • 5 Data Centers across the globe • Tera-bytes of data / day (many billion events) • Data must be processed and analyzed in real time, for example: – Real-time, per user content recommendations – Real-time expenditure reports – Automated campaign management – Automated recommendation algorithms calibration – Real-time analytics
  • 5. Taboola 2007 • Events and logs (rawdata) written directly to DB • Recs Are read from DB • Crashed when CNN launched Frontend RecServer
  • 6. Taboola 2007.5 • Same as before, but without direct write to DB • Switching to bulk load • But – Very Basic Reporting, not scalable Frontend Bulk Load RecServer
  • 7. Taboola 2008 • Introduced a semi realtime events parsing services: Session Parser and Session Analyzer • Divided analysis work by unit (session) • Files were pushed from RecServer(s) to Backend processing • Files are gzip textual INSERT statements • But – not real time enough Frontend NFS Backend RecServer SessionParser SessionAnalyzer Write Summarized Data Write rawdata Read session files Read rawdata Write session files
  • 8. Taboola 2010 • Made a leap towards real- time stream processing • Unified Session Parser and Session Analyzer to an in- memory service (without going through disk) • Made dramatic optimization to memory allocation and data models • Failure safe architecture - can endure data delays, front-end servers’ malfunction • No direct DB access - key for performance, only using bulk loading for loading hourly data Frontend NFS Backend RecServer Session Parser + Analyzer Write Hourly Data (Bulk Loading) Write rawdata Read rawdata
  • 9. Taboola 2011-2013 • Roughly same architecture • Increasing backend growth by scaling in (monster machines) • Introduced real-time analyzers • Introduced sharding • Moved to lsync based file sync • Introduced Top Reports capabilities Frontend Lsync Backend RecServer Session Parser + Analyzer Write Hourly Data (Bulk Loading) Write rawdata Read rawdata
  • 10. Taboola 2014 • Spark as the distributed engine for data analysis (and distributed computing in general) • All critical data path already moved to Spark • New data modelling based on ProtoStuff(Buf) • Easily scalable • Easy ad hoc analysis/research
  • 11. About Spark • Open Sourced • Apache top level project (since Feb. 19th) • DataBricks - A commercial company that supports it • Hadoop-compatible computing engine • Can run side-by-side with Hadoop/Hive on the same data • Drastically faster than Hadoop through in-memory computing • Multiple H/A options - standalone cluster, Apache mesos and ZooKeeper or YARN
  • 12. • With over 100 developers and 25 companies, one of the most active communities in big data Spark Development Community Comparison: Storm (48), Giraph (52), Drill (18), Tez (12) Past 6 months: more active devs than Hadoop MapReduce!
  • 15. Spark API • Simple to write through easy APIs in Java, Scala and Python • The same analytics code can be used for both streaming data and offline data processing
  • 16. Spark Key Concepts Resilient Distributed Datasets • Collections of objects spread across a cluster, stored in RAM or on Disk • Built through parallel transformations • Automatically rebuilt on failure • Immutable Operations • Transformations (e.g. map, filter, groupB y) • Actions (e.g. count, collect, save ) Write programs in terms of transformations on distributed datasets
  • 17. Load error messages from a log into memory, then interactively search for various patterns Example: Log Mining lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() messages.filter(lambda s: “mysql” in s).count() messages.filter(lambda s: “php” in s).count() . . . Base RDD Transformed RDD Full-text search of Wikipedia • 60GB on 20 EC2 machine • 0.5 sec vs. 20s for on-disk
  • 18. • General task graphs • Automatically pipelines functions • Data locality aware • Partitioning aware to avoid shuffles Task Scheduler = cached partition= RDD reduceByKey filter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: map
  • 19. • Spark runs as a library in your program (1 instance per app) • Runs tasks locally or on cluster – Mesos, YARN or standalone mode • Accesses storage systems via Hadoop InputFormat API – Can use HBase, HDFS, S3, … Software Components Your application SparkContext Local threads Cluster manager Worker Spark executor Worker Spark executor HDFS or other storage
  • 20. System Architecture & Data Flow @ Taboola Driver + ConsumersSpark Cluster MySQL Cluster FE Servers C* Cluster FE Servers
  • 21. Execution Graph @ Taboola • Data start point (dates, etc)rdd1 = Context.parallize([data]) • Loading data from external sources (Cassandra, MySQL, etc) rdd2 = rdd1.mapPartitions(loadfunc()) • Aggregating the data and storing results rdd3 = rdd2.reduceByKey(reduceFunc()) • Saving the results to a DBrdd4 = rdd3.mapPartitions(saverfunc()) • Executing the above graph by forcing an output operation rdd4.count()
  • 22. Cassandra as a Distributed Storage • Event Log Files saved as blobs to a dedicated keyspace • C* Tables holding the Event Log Files are partitioned by day – new Table per day. This way, it is easier for maintenance and simpler to load into Spark • Using Astyanax driver + CQL3 – Recipe to load all keys of a table very fast (hundred of thousands / sec) – Split by keys and then load data by key in batches – in parallel partitions • Wrote hadoop InputFormat that supports loading this into a lines RDD<String> – The DataStax InputFormat had issues and at the time was not formally supported • Worked well, but ended up not using it – instead using mapPartitions – Very simple, no overhead, no need to be tied to hadoop – Will probably use the InputFormat when we deploy a Shark solution • Plans to open source all this Key (String) Data (blob) GUID (originally log file name) Gzipped file GUID Gzipped file … … userevent_2014-02-19 Key (String) Data (blob) GUID (originally log file name) Gzipped file GUID Gzipped file … … userevent_2014-02-20
  • 23. Sample – Click Counting for Campaign Stopping 1. mapPartitions – mapping from strings to objects with a pre designed click key 2. reduceByKey – removing duplicate clicks (see next slide) 3. Map – switch keys to a campaign+day key 4. reduceByKey – aggregate the data by campaign+day
  • 24. Campaign Stopping – Removing Dup Clicks • When more than 1 click found from the same user on the same item, leave only the oldest • Using accumulators to track duplicate numbers • Notice – not Spark specific
  • 25. Our Deployment • 16 nodes, each- – 24 cores – 256G Ram – 6 1TB SSD Disks – JBOD configuration – 10G Ethernet • Total Cluster Power – 4096GB Ram – 384 CPUs – 96 TB storage – (effective space is less, Cassandra Keyspaces defined with replication factor 3) • Symmetric Deployment – Mesos + Spark – Cassandra • More – Rabbit MQ on 3 nodes – ZooKeeper on 3 nodes – MySQL cluster outside this cluster • Loads & processes ~1 Tera Bytes (unzipped data) in ~3 minutes
  • 26. Things that work well with Spark (from our experience) • Very easy to code complex jobs – Harder than SQL, but better than other Map Reduce options – Simple concepts, “small” API • Easy to Unit Test – Runs in local mode, so ideal for micro E2E tests – Each mapper/reducer can be unit tested without Spark – if you do not use anonymous classes • Very resilient • Can read/write to/from any data source, including RDBMS, Cassandra, HDFS, local files, etc. • Great monitoring • Easy to deploy & upgrade • Blazing fast
  • 27. Things that do not work that well (from our experience) • Long (endless) running tasks require some workarounds – Temp files - Spark creates a lot of files in spark.local.dir, requires periodic cleanup – Use spark.cleaner.ttl for long running tasks • Spark Streaming – not fully mature when we tested – Some end cases can cause loss of data – Sliding window / batch model does not fit our needs • We always load some history to deal with late arriving data • State management left to the user and not trivial – BUT – we were able to easily implement a bullet proof home grown, near real time, streaming solution with minimal amount of code
  • 28. General / Optimization Tips • Use Spark Accumulators to collect and report operational data • 10G Ethernet • Multiple SSD disks per node, JBOD configuration • A lot of memory for the cluster
  • 29. Technologies Taboola Uses for Spark • Spark – computing cluster • Mesos – cluster resource manager – Better resource allocation (coarse grained) for Spark • ZooKeeper – distributed coordination – Enables multi master for mesos & spark • Cassandra – Distributed Data Store • Monitoring – http://metrics.codahale.com/
  • 30. Attributions Many of the general Spark slides were taken from the DataBricks Spark Summit 2013 slides. There are great materials at: https://spark.apache.org/ http://spark-summit.org/summit-2013/

Editor's Notes

  1. One of the most exciting things you’ll findGrowing all the timeNASCAR slideIncluding several sponsors of this event are just starting to get involved…If your logo is not up here, forgive us – it’s hard to keep up!
  2. RDD  Colloquially referred to as RDDs (e.g. caching in RAM)Lazy operations to build RDDs from other RDDsReturn a result or write it to storage
  3. Add “variables” to the “functions” in functional programming
  4. NOT a modified versionof Hadoop