Spark Magic Building and Deploying a High Scale Product in 4 Months

•Download as PPTX, PDF•

1 like•693 views

tsliwowicz

Building and Deploying a High Scale Product in 4 Months At Taboola

Software

Spark Magic
Building and Deploying a High Scale
Product in 4 Months

Collaborative Filtering
Bucketed Consumption Groups
Geo
Region-based
Recommendations
Context
Metadata
Social
Facebook/Twitter API
User Behavior
Cookie Data
Engine Focused on Maximizing CTR & Post Click Engagement

Largest Content Discovery and
Monetization Network
500MMonthly Unique
Users
220BMonthly
Recommendations
10B+Daily User Events
5TB+Incoming Daily Data

What Does it Mean?
• Using Spark since 1983 (not really, but since 0.7)
• 6 Data Centers across the globe
• Dedicated Spark & Cassandra (for spark) cluster consists of
– 2,700 cores with 18.5TB of RAM memory and 576TB of SSD local
storage, across 2 Data Centers.
• Data must be processed and analyzed in real time, for example:
– Real-time, per user content recommendations
– Real-time expenditure reports
– Automated campaign management
– Automated recommendation algorithms calibration
– Real-time analytics

About “Newsroom”
• Newsroom is a real time analytics product for editors
of news and content sites
• MVP Requirements:
– Clicks & Impressions, per position & whole page
– Performance against live baseline
– AB testing of multiple titles and thumbnails
• The mission - design, develop and deploy a full blown
production system in 4 months after an alpha

Spark WHAAAT??!
• Assembled an ad-hoc task force to design, develop & deploy
• We already had a very good experience with Spark at that point,
so we decided to build the new product around Spark
• We now have many live production publishers using Newsroom
exclusively (weather.com, theblaze, tribune, college humor and
many others) and usage is growing
• Newsroom is mission critical
– Clients are calling immediately if there’s any down time
– “Flying blind”

System Architecture & Data Flow
Driver +
Consumers
Spark Cluster
C* Cluster
FE ServersBackstage

Design Concepts
• Requirements:
– Semi real time (a few seconds latency)
– Idempotent processing / exactly once counting
– Support late and out of order data
• Implementation:
– Guid per data packet / time based
– 1 Minute batches in C* (latest batch is partial)
– Re-process time unit over and over and over
– Run over data in cassandra – without counters
– Data aggregation: Events  Minute  hour  baseline
• Spark Streaming – was an alpha, too early to use (January 2014)

Spark Consumers
Multiple spark jobs using algorithmic and statistical
analysis in real time:
• Clicks and Impressions Aggregator
• Performance Analyzer
• AB Tests Manager
• Baseline Calculator
• Homepage Crawler
• More

Challenges
• Performance Optimizations
– DAG profiling
• Using .count() to cancel lazy DAG execution (turned on/off using a
live configuration)
– Code Profiling
• Yourkit, etc
• Debugging Errors in Production
– Local debugging on small datasets
– Remote debugging
– Extensive usage of logfiles (ELK)

Hash code pitfall
• JavaPairRDD<Key, Value>
• The Spark partitioner was hash partitioner
• The Key was an object with an enum as a member
• enum .hashCode() is final and is the memory position of the
object  JVM Dependent  The Key hash was JVM dependent
• Objects with the same key ended up in multiple partitions 
reduceByKey() produced inconsistent results.
• Solution  either avoid using enums in keys, or manually
change the hashCode method of the key object to use the
numeric or string value of the enum

Spark Usages @ Taboola
• Newsroom
• Automatic campaigns stopper / reviver
• Legacy  Spark
• Spark SQL for reporting
• Algo team research
– MLLIB

tal@taboola.com
ruthy@taboola.com
Thank You!

What's hot

Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarDatabricks

Lambda architecture with SparkVincent GALOPIN

Cloud Connect 2012, Big Data @ NetflixJerome Boulon

Spark Summit EU talk by Christos ErotocritouSpark Summit

Realtime streaming architecture in INFINARIOJozo Kovac

Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Databricks

ASPgems - kappa architectureJuantomás García Molina

Lessons Learned from Modernizing USCIS Data Analytics PlatformDatabricks

Future of data visualizationhadoopsphere

Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerDatabricks

Top 5 mistakes when writing Streaming applicationshadooparchbook

Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks

Using Visualization to Succeed with Big Data Pactera_US

Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraDatabricks

Configuration Driven Reporting On Large Dataset Using Apache SparkDatabricks

Trends for Big Data and Apache Spark in 2017 by Matei ZahariaSpark Summit

Disrupting Big Data with Apache Spark in the CloudJen Aman

Building an ETL pipeline for Elasticsearch using SparkItai Yaffe

Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit

OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...Databricks

What's hot (20)

Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar

Lambda architecture with Spark

Cloud Connect 2012, Big Data @ Netflix

Spark Summit EU talk by Christos Erotocritou

Realtime streaming architecture in INFINARIO

Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...

ASPgems - kappa architecture

Lessons Learned from Modernizing USCIS Data Analytics Platform

Future of data visualization

Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler

Top 5 mistakes when writing Streaming applications

Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service

Using Visualization to Succeed with Big Data

Stream All Things—Patterns of Modern Data Integration with Gwen Shapira

Configuration Driven Reporting On Large Dataset Using Apache Spark

Trends for Big Data and Apache Spark in 2017 by Matei Zaharia

Disrupting Big Data with Apache Spark in the Cloud

Building an ETL pipeline for Elasticsearch using Spark

Spark Summit EU talk by Ruben Pulido Behar Veliqi

OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...

Similar to Spark Magic Building and Deploying a High Scale Product in 4 Months

Spark meetup2 final (Taboola) tsliwowicz

Cloud Security Monitoring and Spark Analyticsamesar0

Real time monitoring of hadoop and spark workflowsShankar Manian

Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Spark Summit

New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...Big Data Spain

Building Scalable Big Data Infrastructure Using Open Source Software Presenta...ssuserd3a367

Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks

Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson

Real Time Insights for Advertising TechApache Apex

Rakuten’s Journey with Splunk - Evolution of Splunk as a ServiceRakuten Group, Inc.

Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Landon Robinson

Simply Business - Near Real Time Event Processingidan_by

End-to-End Data Pipelines with Apache SparkBurak Yavuz

Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSpark Summit

What is Big Data ?AkhmadZakiAlsafi

Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Lillian Pierson

Hybrid Transactional/Analytics Processing with Spark and IMDGsAli Hodroj

Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017MLconf

Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global Lucidworks

Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks

Similar to Spark Magic Building and Deploying a High Scale Product in 4 Months (20)

Spark meetup2 final (Taboola)

Cloud Security Monitoring and Spark Analytics

Real time monitoring of hadoop and spark workflows

Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...

New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...

Building Scalable Big Data Infrastructure Using Open Source Software Presenta...

Headaches and Breakthroughs in Building Continuous Applications

Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...

Real Time Insights for Advertising Tech

Rakuten’s Journey with Splunk - Evolution of Splunk as a Service

Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...

Simply Business - Near Real Time Event Processing

End-to-End Data Pipelines with Apache Spark

Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma

What is Big Data ?

Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...

Hybrid Transactional/Analytics Processing with Spark and IMDGs

Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017

Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global

Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring

Recently uploaded

Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01

EY_Graph Database Powered SustainabilityNeo4j

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3

Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions

Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin

chapter--4-software-project-planning.pptkotipi9215

The Evolution of Karaoke From Analog to App.pdfPower Karaoke

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh

cybersecurity notes for mca students for learningVitsRangannavar

The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110

Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran

why an Opensea Clone Script might be your perfect match.pdfjoe51371421

Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea

Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app

Asset Management Software - InfographicHr365.us smith

Recently uploaded (20)

Automate your Kamailio Test Calls - Kamailio World 2024

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...

EY_Graph Database Powered Sustainability

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data

Advancing Engineering with AI through the Next Generation of Strategic Projec...

Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide

chapter--4-software-project-planning.ppt

The Evolution of Karaoke From Analog to App.pdf

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...

cybersecurity notes for mca students for learning

The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...

Intelligent Home Wi-Fi Solutions | ThinkPalm

why an Opensea Clone Script might be your perfect match.pdf

Der Spagat zwischen BIAS und FAIRNESS (2024)

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样

Unit 1.1 Excite Part 1, class 9, cbse...

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx

Asset Management Software - Infographic

Spark Magic Building and Deploying a High Scale Product in 4 Months

1. Spark Magic Building and Deploying a High Scale Product in 4 Months

2. Tal Sliwowicz Director, R&D tal@taboola.com Who are we? Ruthy Goldberg Sr. Software Engineer ruthy@taboola.com

3. Collaborative Filtering Bucketed Consumption Groups Geo Region-based Recommendations Context Metadata Social Facebook/Twitter API User Behavior Cookie Data Engine Focused on Maximizing CTR & Post Click Engagement

4. Largest Content Discovery and Monetization Network 500MMonthly Unique Users 220BMonthly Recommendations 10B+Daily User Events 5TB+Incoming Daily Data

5. What Does it Mean? • Using Spark since 1983 (not really, but since 0.7) • 6 Data Centers across the globe • Dedicated Spark & Cassandra (for spark) cluster consists of – 2,700 cores with 18.5TB of RAM memory and 576TB of SSD local storage, across 2 Data Centers. • Data must be processed and analyzed in real time, for example: – Real-time, per user content recommendations – Real-time expenditure reports – Automated campaign management – Automated recommendation algorithms calibration – Real-time analytics

6. About “Newsroom” • Newsroom is a real time analytics product for editors of news and content sites • MVP Requirements: – Clicks & Impressions, per position & whole page – Performance against live baseline – AB testing of multiple titles and thumbnails • The mission - design, develop and deploy a full blown production system in 4 months after an alpha

8. Spark WHAAAT??! • Assembled an ad-hoc task force to design, develop & deploy • We already had a very good experience with Spark at that point, so we decided to build the new product around Spark • We now have many live production publishers using Newsroom exclusively (weather.com, theblaze, tribune, college humor and many others) and usage is growing • Newsroom is mission critical – Clients are calling immediately if there’s any down time – “Flying blind”

9. Newsroom Dashboard

10. AB Tests

11. AB Tests

12. Under the Hood

13. System Architecture & Data Flow Driver + Consumers Spark Cluster C* Cluster FE ServersBackstage

14. Design Concepts • Requirements: – Semi real time (a few seconds latency) – Idempotent processing / exactly once counting – Support late and out of order data • Implementation: – Guid per data packet / time based – 1 Minute batches in C* (latest batch is partial) – Re-process time unit over and over and over – Run over data in cassandra – without counters – Data aggregation: Events  Minute  hour  baseline • Spark Streaming – was an alpha, too early to use (January 2014)

15. Spark Consumers Multiple spark jobs using algorithmic and statistical analysis in real time: • Clicks and Impressions Aggregator • Performance Analyzer • AB Tests Manager • Baseline Calculator • Homepage Crawler • More

16. Diving into the ‫דג‬

17. Monitoring

18. Challenges • Performance Optimizations – DAG profiling • Using .count() to cancel lazy DAG execution (turned on/off using a live configuration) – Code Profiling • Yourkit, etc • Debugging Errors in Production – Local debugging on small datasets – Remote debugging – Extensive usage of logfiles (ELK)

19. Hash code pitfall • JavaPairRDD<Key, Value> • The Spark partitioner was hash partitioner • The Key was an object with an enum as a member • enum .hashCode() is final and is the memory position of the object  JVM Dependent  The Key hash was JVM dependent • Objects with the same key ended up in multiple partitions  reduceByKey() produced inconsistent results. • Solution  either avoid using enums in keys, or manually change the hashCode method of the key object to use the numeric or string value of the enum

20. Spark Usages @ Taboola • Newsroom • Automatic campaigns stopper / reviver • Legacy  Spark • Spark SQL for reporting • Algo team research – MLLIB

21. tal@taboola.com ruthy@taboola.com Thank You!

Spark Magic Building and Deploying a High Scale Product in 4 Months

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spark Magic Building and Deploying a High Scale Product in 4 Months

Similar to Spark Magic Building and Deploying a High Scale Product in 4 Months (20)

Recently uploaded

Recently uploaded (20)

Spark Magic Building and Deploying a High Scale Product in 4 Months