This document summarizes Daria Litvinov's presentation on using Druid, Apache Spark, and Kafka for real-time analytics. The presentation covers setting up real-time dashboards using these technologies, addressing issues like data loss on job restarts, and the solution of committing Kafka offsets manually and storing them synchronously.
4. That wonderful moment when
you discover something you
never knew existed.
We are Discovery
5. 1.8 Billion
Pages a Day**
290 Billion
Monthly Discoveries**
820 Million
Monthly Consumers*
*Comscore, 2019, **Outbrain internal data, March 2019
We are Global
6. • The Motivation
• Architecture
• The Problem
• The Way to the Solution
Agenda
11. Apache Druid
Druid is a column-oriented,
open-source, distributed data store.
Imply is a high-performance analytics solution
to store, query, and visualize streaming.
33. ● Go to the sparkUI and kill the application
● Use API to kill the application
● Graceful Shutdown
○ Create the context
spark.streaming.stopGracefullyOnShutdown=true
○ Stop the context
ssc.stop(stopSparkContext=true, stopGracefully=true)
How To Shutdown Spark Application
34. ● Go to the sparkUI and kill the application
● Use API to kill the application
● Graceful Shutdown
○ Create the context
spark.streaming.stopGracefullyOnShutdown=true
○ Stop the context
ssc.stop(stopSparkContext=true, stopGracefully=true)
How To Shutdown Spark Application
35. • Stop reading from Kafka
• Process queued events
• Stops the job’s execution
Spark Streaming - Graceful Shutdown