3. Collaborative Filtering
Bucketed Consumption Groups
Geo
Region-based
Recommendations
Context
Metadata
Social
Facebook/Twitter API
User Behavior
Cookie Data
Engine Focused on Maximizing CTR & Post Click Engagement
4. Largest Content Discovery and
Monetization Network
500MMonthly Unique
Users
220BMonthly
Recommendations
10B+Daily User Events
5TB+Incoming Daily Data
5. What Does it Mean?
• Using Spark since 1983 (not really, but since 0.7)
• 6 Data Centers across the globe
• Dedicated Spark & Cassandra (for spark) cluster consists of
– 2,700 cores with 18.5TB of RAM memory and 576TB of SSD local
storage, across 2 Data Centers.
• Data must be processed and analyzed in real time, for example:
– Real-time, per user content recommendations
– Real-time expenditure reports
– Automated campaign management
– Automated recommendation algorithms calibration
– Real-time analytics
6. About “Newsroom”
• Newsroom is a real time analytics product for editors
of news and content sites
• MVP Requirements:
– Clicks & Impressions, per position & whole page
– Performance against live baseline
– AB testing of multiple titles and thumbnails
• The mission - design, develop and deploy a full blown
production system in 4 months after an alpha
7.
8. Spark WHAAAT??!
• Assembled an ad-hoc task force to design, develop & deploy
• We already had a very good experience with Spark at that point,
so we decided to build the new product around Spark
• We now have many live production publishers using Newsroom
exclusively (weather.com, theblaze, tribune, college humor and
many others) and usage is growing
• Newsroom is mission critical
– Clients are calling immediately if there’s any down time
– “Flying blind”
13. System Architecture & Data Flow
Driver +
Consumers
Spark Cluster
C* Cluster
FE ServersBackstage
14. Design Concepts
• Requirements:
– Semi real time (a few seconds latency)
– Idempotent processing / exactly once counting
– Support late and out of order data
• Implementation:
– Guid per data packet / time based
– 1 Minute batches in C* (latest batch is partial)
– Re-process time unit over and over and over
– Run over data in cassandra – without counters
– Data aggregation: Events Minute hour baseline
• Spark Streaming – was an alpha, too early to use (January 2014)
15. Spark Consumers
Multiple spark jobs using algorithmic and statistical
analysis in real time:
• Clicks and Impressions Aggregator
• Performance Analyzer
• AB Tests Manager
• Baseline Calculator
• Homepage Crawler
• More
18. Challenges
• Performance Optimizations
– DAG profiling
• Using .count() to cancel lazy DAG execution (turned on/off using a
live configuration)
– Code Profiling
• Yourkit, etc
• Debugging Errors in Production
– Local debugging on small datasets
– Remote debugging
– Extensive usage of logfiles (ELK)
19. Hash code pitfall
• JavaPairRDD<Key, Value>
• The Spark partitioner was hash partitioner
• The Key was an object with an enum as a member
• enum .hashCode() is final and is the memory position of the
object JVM Dependent The Key hash was JVM dependent
• Objects with the same key ended up in multiple partitions
reduceByKey() produced inconsistent results.
• Solution either avoid using enums in keys, or manually
change the hashCode method of the key object to use the
numeric or string value of the enum
20. Spark Usages @ Taboola
• Newsroom
• Automatic campaigns stopper / reviver
• Legacy Spark
• Spark SQL for reporting
• Algo team research
– MLLIB