Real Time Analytics with Druid, Apache Spark and Kafka

•Download as PPTX, PDF•

1 like•221 views

This document summarizes Daria Litvinov's presentation on using Druid, Apache Spark, and Kafka for real-time analytics. The presentation covers setting up real-time dashboards using these technologies, addressing issues like data loss on job restarts, and the solution of committing Kafka offsets manually and storing them synchronously.

Data & Analytics

Daria Litvinov | SW Engineer, Outbrain
November 2019
Real Time Analytics
with Druid, Apache
Spark and Kafka

SW Engineer with more than 15 years of
experience
Dealing with Big Data technologies
Druid, Spark, Kafka
Daria Litvinov

That wonderful moment when
you discover something you
never knew existed.
We are Discovery

1.8 Billion
Pages a Day**
290 Billion
Monthly Discoveries**
820 Million
Monthly Consumers*
*Comscore, 2019, **Outbrain internal data, March 2019
We are Global

• The Motivation
• Architecture
• The Problem
• The Way to the Solution
Agenda

The Motivation
Serving
(Clicks, Impressions)

The Motivation
Serving
(Clicks, Listing)
4 Hour
Cycle

Real Time Dashboard
Based on Kafka Events

Apache Druid
Druid is a column-oriented,
open-source, distributed data store.
Imply is a high-performance analytics solution
to store, query, and visualize streaming.

Design & Architecture
● Kafka topic format

● Kafka topic format
● Kafka topic content
[Click_id, publisher_id, marketer_id,
countryCode,...]
Design & Architecture

input_events
Avro,
raw data
output_events
JSON,
enriched
Design & Architecture

Let’s Get Started!
Spark
Kafka
Spark Streaming Kafka Integration Guide
Deploy Spark Job to Production
Load Data Into Druid

Spark Streaming Jobs Run Forever?
Not Really!

Data Loss on
Job Restart
Spark Streaming Jobs Run Forever?
Not Really!

Apache Kafka
● Distributed Streaming Platform
● Producer API
● Consumer API

Apache Kafka –
Consuming a Topic
● latest
● earliest
● from committed offsets
19 20 21 22 23 24 25 26 27 28
Time
29

Data Loss on
Job Restart
Data Loss on Job Restart

Auto Commit Offsets
enable.auto.commit
auto.commit.interval.ms

Micro-batches of 2 minutes
RDD RDD RDD RDD
t0 t1 t2 ti ti+1
Spark Streaming –
Micro Batches

Call commitAsync() in the end of each micro-batch
Commit Kafka Offsets Manually

● Go to the sparkUI and kill the application
● Use API to kill the application
● Graceful Shutdown
○ Create the context
spark.streaming.stopGracefullyOnShutdown=true
○ Stop the context
ssc.stop(stopSparkContext=true, stopGracefully=true)
How To Shutdown Spark Application

• Stop reading from Kafka
• Process queued events
• Stops the job’s execution
Spark Streaming - Graceful Shutdown

Check for
HDFS file
Graceful Shutdown – Check for Shutdown

/**
Queue up offset ranges for commit to Kafka at a future
time. Threadsafe.
*/
Spark CommitAsync API

Step 4:
Store Kafka Offsets Synchronously

Graceful
Shutdown
Store Kafka Offsets
Synchronously
The Final Solution

Real Time Analytics with Druid, Apache Spark and Kafka

What's hot

How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

Batch Processing at Scale with Flink & IcebergFlink Forward

“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward

Premier Inside-Out: Apache DruidHortonworks

Apache flinkpranay kumar

Apache Flink and what it is used forAljoscha Krettek

Iceberg: a fast table format for S3DataWorks Summit

Producer Performance Tuning for Apache KafkaJiangjie Qin

Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks

Apache Kafka - Martin PodvalMartin Podval

Evening out the uneven: dealing with skew in FlinkFlink Forward

Intro to InfluxDBInfluxData

Apache Kafka Best PracticesDataWorks Summit/Hadoop Summit

Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann

Real-time Analytics with Trino and Apache PinotXiang Fu

Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward

Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit

Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...HostedbyConfluent

What's hot (20)

How to build a streaming Lakehouse with Flink, Kafka, and Hudi

A Thorough Comparison of Delta Lake, Iceberg and Hudi

Batch Processing at Scale with Flink & Iceberg

“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...

Premier Inside-Out: Apache Druid

Apache flink

Apache Flink and what it is used for

Iceberg: a fast table format for S3

Producer Performance Tuning for Apache Kafka

Building a Streaming Microservice Architecture: with Apache Spark Structured ...

Apache Kafka - Martin Podval

Evening out the uneven: dealing with skew in Flink

Intro to InfluxDB

Apache Kafka Best Practices

Introduction to Apache Flink - Fast and reliable big data processing

Real-time Analytics with Trino and Apache Pinot

Exactly-Once Financial Data Processing at Scale with Flink and Pinot

Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...

Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...

Similar to Real Time Analytics with Druid, Apache Spark and Kafka

Apache Beam (incubating)Apache Apex

What's New in Apache Spark 2.3 & Why Should You CareDatabricks

Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPDaniel Zivkovic

How to Build Streaming Apps with Confluent IIconfluent

Lifting the hood on spark streaming - StampedeCon 2015StampedeCon

Google Cloud Dataflow Two Worlds Become a Much Better OneDataWorks Summit

Stateful Microservices with Apache Kafka and Spring Cloud Stream with Jan Svo...HostedbyConfluent

New Approaches for Fraud Detection on Apache Kafka and KSQLconfluent

How to build a social network on serverlessYan Cui

The Netflix Way to deal with Big Data ProblemsMonal Daxini

Netflix Keystone—Cloud scale event processing pipelineMonal Daxini

Social media analytics using Azure TechnologiesKoray Kocabas

The Scout24 Data Platform (A Technical Deep Dive)RaffaelDzikowski

Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...DataWorks Summit

Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Databricks

C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...DataStax Academy

How to build a social network on serverlessYan Cui

How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...Lightbend

Serverless Swift for Mobile DevelopersAll Things Open

Build Low-Latency Applications in Rust on ScyllaDBScyllaDB

Similar to Real Time Analytics with Druid, Apache Spark and Kafka (20)

Apache Beam (incubating)

What's New in Apache Spark 2.3 & Why Should You Care

Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP

How to Build Streaming Apps with Confluent II

Lifting the hood on spark streaming - StampedeCon 2015

Google Cloud Dataflow Two Worlds Become a Much Better One

Stateful Microservices with Apache Kafka and Spring Cloud Stream with Jan Svo...

New Approaches for Fraud Detection on Apache Kafka and KSQL

How to build a social network on serverless

The Netflix Way to deal with Big Data Problems

Netflix Keystone—Cloud scale event processing pipeline

Social media analytics using Azure Technologies

The Scout24 Data Platform (A Technical Deep Dive)

Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...

Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...

C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...

How to build a social network on serverless

How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...

Serverless Swift for Mobile Developers

Build Low-Latency Applications in Rust on ScyllaDB

Recently uploaded

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster

From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck

9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh

B2 Creative Industry Response Evaluation.docxStephen266013

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss

NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics

Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster

04242024_CCC TUG_Joins and Relationshipsccctableauusergroup

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss

RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort

Industrialised data - the key to AI success.pdfLars Albertsson

Recently uploaded (20)

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024

From idea to production in a day – Leveraging Azure ML and Streamlit to build...

9654467111 Call Girls In Munirka Hotel And Home Service

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝

B2 Creative Industry Response Evaluation.docx

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一

NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx

Top 5 Best Data Analytics Courses In Queens

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx

04242024_CCC TUG_Joins and Relationships

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改

RABBIT: A CLI tool for identifying bots based on their GitHub events.

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service

Industrialised data - the key to AI success.pdf

Real Time Analytics with Druid, Apache Spark and Kafka

1. Daria Litvinov | SW Engineer, Outbrain November 2019 Real Time Analytics with Druid, Apache Spark and Kafka

2. SW Engineer with more than 15 years of experience Dealing with Big Data technologies Druid, Spark, Kafka Daria Litvinov

3. What is

4. That wonderful moment when you discover something you never knew existed. We are Discovery

5. 1.8 Billion Pages a Day** 290 Billion Monthly Discoveries** 820 Million Monthly Consumers* *Comscore, 2019, **Outbrain internal data, March 2019 We are Global

6. • The Motivation • Architecture • The Problem • The Way to the Solution Agenda

7. The Motivation

8. The Motivation Serving (Clicks, Impressions)

9. The Motivation Serving (Clicks, Listing) 4 Hour Cycle

10. Real Time Dashboard Based on Kafka Events

11. Apache Druid Druid is a column-oriented, open-source, distributed data store. Imply is a high-performance analytics solution to store, query, and visualize streaming.

12. + = Druid & Kafka

13. Design & Architecture ● Kafka topic format

14. ● Kafka topic format ● Kafka topic content [Click_id, publisher_id, marketer_id, countryCode,...] Design & Architecture

15. input_events Avro, raw data output_events JSON, enriched Design & Architecture

16. Let’s Get Started!

17. Let’s Get Started! Spark Kafka Spark Streaming Kafka Integration Guide Deploy Spark Job to Production Load Data Into Druid

18. Data in Druid

19. Spark Streaming Jobs Run Forever? Not Really!

20. Data Loss on Job Restart Spark Streaming Jobs Run Forever? Not Really!

21. input_events Avro, raw data output_events JSON, enriched Design & Architecture

22. It’s All About Offsets!

23. Apache Kafka ● Distributed Streaming Platform ● Producer API ● Consumer API

24. Apache Kafka – Consuming a Topic ● latest ● earliest ● from committed offsets 19 20 21 22 23 24 25 26 27 28 Time 29

25. Data Loss on Job Restart Data Loss on Job Restart

26. Step 1: Auto Commit Offsets

27. Auto Commit Offsets enable.auto.commit auto.commit.interval.ms

28. Step 2: Commit Kafka Offsets Manually

29. Micro-batches of 2 minutes RDD RDD RDD RDD t0 t1 t2 ti ti+1 Spark Streaming – Micro Batches

30. Call commitAsync() in the end of each micro-batch Commit Kafka Offsets Manually

31. Commit Kafka Offsets Manually

32. Step 3: Graceful Shutdown

33. ● Go to the sparkUI and kill the application ● Use API to kill the application ● Graceful Shutdown ○ Create the context spark.streaming.stopGracefullyOnShutdown=true ○ Stop the context ssc.stop(stopSparkContext=true, stopGracefully=true) How To Shutdown Spark Application

34. ● Go to the sparkUI and kill the application ● Use API to kill the application ● Graceful Shutdown ○ Create the context spark.streaming.stopGracefullyOnShutdown=true ○ Stop the context ssc.stop(stopSparkContext=true, stopGracefully=true) How To Shutdown Spark Application

35. • Stop reading from Kafka • Process queued events • Stops the job’s execution Spark Streaming - Graceful Shutdown

36. Graceful Shutdown – Check for Shutdown

37. Check for HDFS file Graceful Shutdown – Check for Shutdown

38. /** Queue up offset ranges for commit to Kafka at a future time. Threadsafe. */ Spark CommitAsync API

39. Step 4: Store Kafka Offsets Synchronously

40. The Final Solution

41. Graceful Shutdown Store Kafka Offsets Synchronously The Final Solution

42. It Works! Job Restarted Job Restarted

43. Now We Have Reliable Data in Druid

44. Journey