Kafka pours and Spark resolves

Alexey Zinoviev
Alexey ZinovievLead Training and Development Specialist (Java, BigData) in EPAM Systems; researcher in Computer Science
Kafka pours and Spark
resolves!
Alexey Zinovyev, Java/BigData Trainer in EPAM
About
With IT since 2007
With Java since 2009
With Hadoop since 2012
With Spark since 2014
With EPAM since 2015
3Spark Streaming from Zinoviev Alexey
Contacts
E-mail : Alexey_Zinovyev@epam.com
Twitter : @zaleslaw @BigDataRussia
Facebook: https://www.facebook.com/zaleslaw
vk.com/big_data_russia Big Data Russia
vk.com/java_jvm Java & JVM langs
4Spark Streaming from Zinoviev Alexey
Spark
Family
5Spark Streaming from Zinoviev Alexey
Spark
Family
6Spark Streaming from Zinoviev Alexey
Pre-summary
• Before RealTime
• Spark + Cassandra
• Sending messages with Kafka
• DStream Kafka Consumer
• Structured Streaming in Spark 2.1
• Kafka Writer in Spark 2.2
7Spark Streaming from Zinoviev Alexey
< REAL-TIME
8Spark Streaming from Zinoviev Alexey
Modern Java in 2016Big Data in 2014
9Spark Streaming from Zinoviev Alexey
Batch jobs produce reports. More and more..
10Spark Streaming from Zinoviev Alexey
But customer can wait forever (ok, 1h)
11Spark Streaming from Zinoviev Alexey
Big Data in 2017
12Spark Streaming from Zinoviev Alexey
Machine Learning EVERYWHERE
13Spark Streaming from Zinoviev Alexey
Data Lake in promotional brochure
14Spark Streaming from Zinoviev Alexey
Data Lake in production
15Spark Streaming from Zinoviev Alexey
Simple Flow in Reporting/BI systems
16Spark Streaming from Zinoviev Alexey
Let’s use Spark. It’s fast!
17Spark Streaming from Zinoviev Alexey
MapReduce vs Spark
18Spark Streaming from Zinoviev Alexey
MapReduce vs Spark
19Spark Streaming from Zinoviev Alexey
Simple Flow in Reporting/BI systems with Spark
20Spark Streaming from Zinoviev Alexey
Spark handles last year logs with ease
21Spark Streaming from Zinoviev Alexey
Where can we store events?
22Spark Streaming from Zinoviev Alexey
Let’s use Cassandra to store events!
23Spark Streaming from Zinoviev Alexey
Let’s use Cassandra to read events!
24Spark Streaming from Zinoviev Alexey
Cassandra
CREATE KEYSPACE mySpace WITH replication = {'class':
'SimpleStrategy', 'replication_factor': 1 };
USE test;
CREATE TABLE logs
( application TEXT,
time TIMESTAMP,
message TEXT,
PRIMARY KEY (application, time));
25Spark Streaming from Zinoviev Alexey
Cassandra
to Spark
val dataSet = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "logs", "keyspace" -> "mySpace"
))
.load()
dataSet
.filter("message = 'Log message'")
.show()
26Spark Streaming from Zinoviev Alexey
Simple Flow in Pre-Real-Time systems
27Spark Streaming from Zinoviev Alexey
Spark cluster over Cassandra Cluster
28Spark Streaming from Zinoviev Alexey
More events every second!
29Spark Streaming from Zinoviev Alexey
SENDING MESSAGES
30Spark Streaming from Zinoviev Alexey
Your Father’s Messaging System
31Spark Streaming from Zinoviev Alexey
Your Father’s Messaging System
32Spark Streaming from Zinoviev Alexey
Your Father’s Messaging System
InitialContext ctx = new InitialContext();
QueueConnectionFactory f =
(QueueConnectionFactory)ctx.lookup(“qCFactory"); QueueConnection con =
f.createQueueConnection();
con.start();
33Spark Streaming from Zinoviev Alexey
KAFKA
34Spark Streaming from Zinoviev Alexey
Kafka
• messaging system
35Spark Streaming from Zinoviev Alexey
Kafka
• messaging system
• distributed
36Spark Streaming from Zinoviev Alexey
Kafka
• messaging system
• distributed
• supports Publish-Subscribe model
37Spark Streaming from Zinoviev Alexey
Kafka
• messaging system
• distributed
• supports Publish-Subscribe model
• persists messages on disk
38Spark Streaming from Zinoviev Alexey
Kafka
• messaging system
• distributed
• supports Publish-Subscribe model
• persists messages on disk
• replicates within the cluster (integrated with Zookeeper)
39Spark Streaming from Zinoviev Alexey
The main benefits of Kafka
Scalability with zero down time
Zero data loss due to replication
40Spark Streaming from Zinoviev Alexey
Kafka Cluster consists of …
• brokers (leader or follower)
• topics ( >= 1 partition)
• partitions
• partition offsets
• replicas of partition
• producers/consumers
41Spark Streaming from Zinoviev Alexey
Kafka Components with topic “messages” #1
Producer
Thread #1
Producer
Thread #2
Producer
Thread #3
Topic: Messages
Data
Data
Data
Zookeeper
42Spark Streaming from Zinoviev Alexey
Kafka Components with topic “messages” #2
Producer
Thread #1
Producer
Thread #2
Producer
Thread #3
Topic: Messages
Part #1
Part #2
Data
Data
Data
Broker #1
Zookeeper
Part #1
Part #2
Leader
Leader
43Spark Streaming from Zinoviev Alexey
Kafka Components with topic “messages” #3
Producer
Thread #1
Producer
Thread #2
Producer
Thread #3
Topic: Messages
Part #1
Part #2
Data
Data
Data
Broker #1
Broker #2
Zookeeper
Part #1
Part #1
Part #2
Part #2
44Spark Streaming from Zinoviev Alexey
Why do we need Zookeeper?
45Spark Streaming from Zinoviev Alexey
Kafka
Demo
46Spark Streaming from Zinoviev Alexey
REAL TIME WITH
DSTREAMS
47Spark Streaming from Zinoviev Alexey
RDD Factory
48Spark Streaming from Zinoviev Alexey
From socket to console with DStreams
49Spark Streaming from Zinoviev Alexey
DStream
val conf = new SparkConf().setMaster("local[2]")
.setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
50Spark Streaming from Zinoviev Alexey
DStream
val conf = new SparkConf().setMaster("local[2]")
.setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
51Spark Streaming from Zinoviev Alexey
DStream
val conf = new SparkConf().setMaster("local[2]")
.setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
52Spark Streaming from Zinoviev Alexey
DStream
val conf = new SparkConf().setMaster("local[2]")
.setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
53Spark Streaming from Zinoviev Alexey
DStream
val conf = new SparkConf().setMaster("local[2]")
.setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
54Spark Streaming from Zinoviev Alexey
Kafka as a main entry point for Spark
55Spark Streaming from Zinoviev Alexey
DStreams
Demo
56Spark Streaming from Zinoviev Alexey
How to avoid DStreams with RDD-like API?
57Spark Streaming from Zinoviev Alexey
SPARK 2.2 DISCUSSION
58Spark Streaming from Zinoviev Alexey
Continuous Applications
59Spark Streaming from Zinoviev Alexey
Continuous Applications cases
• Updating data that will be served in real time
• Extract, transform and load (ETL)
• Creating a real-time version of an existing batch job
• Online machine learning
60Spark Streaming from Zinoviev Alexey
The main concept of Structured Streaming
You can express your streaming computation the
same way you would express a batch computation
on static data.
61Spark Streaming from Zinoviev Alexey
Batch
Spark 2.2
// Read JSON once from S3
logsDF = spark.read.json("s3://logs")
// Transform with DataFrame API and save
logsDF.select("user", "url", "date")
.write.parquet("s3://out")
62Spark Streaming from Zinoviev Alexey
Real
Time
Spark 2.2
// Read JSON continuously from S3
logsDF = spark.readStream.json("s3://logs")
// Transform with DataFrame API and save
logsDF.select("user", "url", "date")
.writeStream.parquet("s3://out")
.start()
63Spark Streaming from Zinoviev Alexey
WordCount
from
Socket
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
64Spark Streaming from Zinoviev Alexey
WordCount
from
Socket
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
65Spark Streaming from Zinoviev Alexey
WordCount
from
Socket
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
66Spark Streaming from Zinoviev Alexey
WordCount
from
Socket
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
Don’t forget
to start
Streaming
67Spark Streaming from Zinoviev Alexey
Unlimited Table
68Spark Streaming from Zinoviev Alexey
WordCount with Structured Streaming [Complete Mode]
69Spark Streaming from Zinoviev Alexey
Kafka -> Structured Streaming -> Console
70Spark Streaming from Zinoviev Alexey
Kafka To
Console
Demo
71Spark Streaming from Zinoviev Alexey
OPERATIONS
72Spark Streaming from Zinoviev Alexey
You can …
• filter
• sort
• aggregate
• join
• foreach
• explain
73Spark Streaming from Zinoviev Alexey
Operators
Demo
74Spark Streaming from Zinoviev Alexey
How it works?
75Spark Streaming from Zinoviev Alexey
Deep Diving in Spark Internals
Dataset
76Spark Streaming from Zinoviev Alexey
Deep Diving in Spark Internals
Dataset Logical Plan
77Spark Streaming from Zinoviev Alexey
Deep Diving in Spark Internals
Dataset Logical Plan
Optimized
Logical Plan
78Spark Streaming from Zinoviev Alexey
Deep Diving in Spark Internals
Dataset Logical Plan
Optimized
Logical Plan
Logical Plan
Logical Plan
Logical PlanPhysical
Plan
79Spark Streaming from Zinoviev Alexey
Deep Diving in Spark Internals
Dataset Logical Plan
Optimized
Logical Plan
Logical Plan
Logical Plan
Logical PlanPhysical
Plan
Selected
Physical Plan
Planner
80Spark Streaming from Zinoviev Alexey
Deep Diving in Spark Internals
Dataset Logical Plan
Optimized
Logical Plan
Logical Plan
Logical Plan
Logical PlanPhysical
Plan
Selected
Physical Plan
81Spark Streaming from Zinoviev Alexey
Deep Diving in Spark Internals
Dataset Logical Plan
Optimized
Logical Plan
Logical Plan
Logical Plan
Logical PlanPhysical
Plan
Selected
Physical Plan
Incremental
#1
Incremental
#2
Incremental
#3
82Spark Streaming from Zinoviev Alexey
Incremental Execution: Planner polls
Planner
83Spark Streaming from Zinoviev Alexey
Incremental Execution: Planner runs
Planner
Offset: 1 - 5
Incremental
#1
R
U
N
Count: 5
84Spark Streaming from Zinoviev Alexey
Incremental Execution: Planner runs #2
Planner
Offset: 1 - 5
Incremental
#1
Incremental
#2
Offset: 6 - 9
R
U
N
Count: 5
Count: 4
85Spark Streaming from Zinoviev Alexey
Aggregation with State
Planner
Offset: 1 - 5
Incremental
#1
Incremental
#2
Offset: 6 - 9
R
U
N
Count: 5
Count: 5+4=9
5
86Spark Streaming from Zinoviev Alexey
DataSet.explain()
== Physical Plan ==
Project [avg(price)#43,carat#45]
+- SortMergeJoin [color#21], [color#47]
:- Sort [color#21 ASC], false, 0
: +- TungstenExchange hashpartitioning(color#21,200), None
: +- Project [avg(price)#43,color#21]
: +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as
bigint)),mode=Final,isDistinct=false)], output=[color#21,avg(price)#43])
: +- TungstenExchange hashpartitioning(cut#20,color#21,200), None
: +- TungstenAggregate(key=[cut#20,color#21],
functions=[(avg(cast(price#25 as bigint)),mode=Partial,isDistinct=false)],
output=[cut#20,color#21,sum#58,count#59L])
: +- Scan CsvRelation(-----)
+- Sort [color#47 ASC], false, 0
+- TungstenExchange hashpartitioning(color#47,200), None
+- ConvertToUnsafe
+- Scan CsvRelation(----)
87Spark Streaming from Zinoviev Alexey
What’s the difference between
Complete and Append
output modes?
88Spark Streaming from Zinoviev Alexey
COMPLETE, APPEND &
UPDATE
89Spark Streaming from Zinoviev Alexey
There are two main modes and one in future
• append (default)
90Spark Streaming from Zinoviev Alexey
There are two main modes and one in future
• append (default)
• complete
91Spark Streaming from Zinoviev Alexey
There are two main modes and one in future
• append (default)
• complete
• update [in dreams]
92Spark Streaming from Zinoviev Alexey
Aggregation with watermarks
93Spark Streaming from Zinoviev Alexey
SOURCES & SINKS
94Spark Streaming from Zinoviev Alexey
Spark Streaming is a brick in the Big Data Wall
95Spark Streaming from Zinoviev Alexey
Let's save to Parquet files
96Spark Streaming from Zinoviev Alexey
Let's save to Parquet files
97Spark Streaming from Zinoviev Alexey
Let's save to Parquet files
98Spark Streaming from Zinoviev Alexey
Can we write to Kafka?
99Spark Streaming from Zinoviev Alexey
Nightly Build
100Spark Streaming from Zinoviev Alexey
Kafka-to-Kafka
101Spark Streaming from Zinoviev Alexey
J-K-S-K-S-C
102Spark Streaming from Zinoviev Alexey
Pipeline
Demo
103Spark Streaming from Zinoviev Alexey
I didn’t find sink/source for XXX…
104Spark Streaming from Zinoviev Alexey
Console
Foreach
Sink
import org.apache.spark.sql.ForeachWriter
val customWriter = new ForeachWriter[String] {
override def open(partitionId: Long, version: Long) = true
override def process(value: String) = println(value)
override def close(errorOrNull: Throwable) = {}
}
stream.writeStream
.queryName(“ForeachOnConsole")
.foreach(customWriter)
.start
105Spark Streaming from Zinoviev Alexey
Pinch of wisdom
• check checkpointLocation
• don’t use MemoryStream
• think about GC pauses
• be careful about nighty builds
• use .groupBy.count() instead count()
• use console sink instead .show() function
106Spark Streaming from Zinoviev Alexey
We have no ability…
• join two streams
• work with update mode
• make full outer join
• take first N rows
• sort without pre-aggregation
107Spark Streaming from Zinoviev Alexey
Roadmap 2.2
• Support other data sources (not only S3 + HDFS)
• Transactional updates
• Dataset is one DSL for all operations
• GraphFrames + Structured MLLib
• KafkaWriter
• TensorFrames
108Spark Streaming from Zinoviev Alexey
IN CONCLUSION
109Spark Streaming from Zinoviev Alexey
Final point
Scalable Fault-Tolerant Real-Time Pipeline with
Spark & Kafka
is ready for usage
110Spark Streaming from Zinoviev Alexey
A few papers about Spark Streaming and Kafka
Introduction in Spark + Kafka
http://bit.ly/2mJjE4i
111Spark Streaming from Zinoviev Alexey
Source code from Demo
Spark Tutorial
https://github.com/zaleslaw/Spark-Tutorial
112Spark Streaming from Zinoviev Alexey
Contacts
E-mail : Alexey_Zinovyev@epam.com
Twitter : @zaleslaw @BigDataRussia
Facebook: https://www.facebook.com/zaleslaw
vk.com/big_data_russia Big Data Russia
vk.com/java_jvm Java & JVM langs
113Spark Streaming from Zinoviev Alexey
Any questions?
1 of 113

Recommended

Joker'16 Spark 2 (API changes; Structured Streaming; Encoders) by
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)Alexey Zinoviev
836 views142 slides
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ... by
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...StampedeCon
942 views54 slides
Spark with Elasticsearch - umd version 2014 by
Spark with Elasticsearch - umd version 2014Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Holden Karau
2.7K views30 slides
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise by
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterpriseA Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterprisePatrick McFadin
11.9K views56 slides
Spark with Elasticsearch by
Spark with ElasticsearchSpark with Elasticsearch
Spark with ElasticsearchHolden Karau
1.4K views24 slides
Integrate Solr with real-time stream processing applications by
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
8.7K views39 slides

More Related Content

What's hot

Norikra: SQL Stream Processing In Ruby by
Norikra: SQL Stream Processing In RubyNorikra: SQL Stream Processing In Ruby
Norikra: SQL Stream Processing In RubySATOSHI TAGOMORI
6.7K views52 slides
Building a near real time search engine & analytics for logs using solr by
Building a near real time search engine & analytics for logs using solrBuilding a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrlucenerevolution
11.5K views42 slides
DataStax: An Introduction to DataStax Enterprise Search by
DataStax: An Introduction to DataStax Enterprise SearchDataStax: An Introduction to DataStax Enterprise Search
DataStax: An Introduction to DataStax Enterprise SearchDataStax Academy
1.3K views39 slides
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent... by
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...DataStax Academy
3.3K views56 slides
Lightning fast analytics with Spark and Cassandra by
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandranickmbailey
7.8K views22 slides
Python database interfaces by
Python database  interfacesPython database  interfaces
Python database interfacesMohammad Javad Beheshtian
5.8K views21 slides

What's hot(20)

Norikra: SQL Stream Processing In Ruby by SATOSHI TAGOMORI
Norikra: SQL Stream Processing In RubyNorikra: SQL Stream Processing In Ruby
Norikra: SQL Stream Processing In Ruby
SATOSHI TAGOMORI6.7K views
Building a near real time search engine & analytics for logs using solr by lucenerevolution
Building a near real time search engine & analytics for logs using solrBuilding a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solr
lucenerevolution11.5K views
DataStax: An Introduction to DataStax Enterprise Search by DataStax Academy
DataStax: An Introduction to DataStax Enterprise SearchDataStax: An Introduction to DataStax Enterprise Search
DataStax: An Introduction to DataStax Enterprise Search
DataStax Academy1.3K views
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent... by DataStax Academy
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
DataStax Academy3.3K views
Lightning fast analytics with Spark and Cassandra by nickmbailey
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
nickmbailey7.8K views
Cassandra Fundamentals - C* 2.0 by Russell Spitzer
Cassandra Fundamentals - C* 2.0Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0
Russell Spitzer1.1K views
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ... by Lucidworks
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Lucidworks2.4K views
Cassandra introduction 2016 by Duyhai Doan
Cassandra introduction 2016Cassandra introduction 2016
Cassandra introduction 2016
Duyhai Doan832 views
Scalding - the not-so-basics @ ScalaDays 2014 by Konrad Malawski
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
Konrad Malawski8.4K views
User Defined Aggregation in Apache Spark: A Love Story by Databricks
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
Databricks402 views
Spark cassandra connector.API, Best Practices and Use-Cases by Duyhai Doan
Spark cassandra connector.API, Best Practices and Use-CasesSpark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-Cases
Duyhai Doan8.2K views
Big data 101 for beginners devoxxpl by Duyhai Doan
Big data 101 for beginners devoxxplBig data 101 for beginners devoxxpl
Big data 101 for beginners devoxxpl
Duyhai Doan731 views
HBase RowKey design for Akka Persistence by Konrad Malawski
HBase RowKey design for Akka PersistenceHBase RowKey design for Akka Persistence
HBase RowKey design for Akka Persistence
Konrad Malawski2.3K views
Manchester Hadoop Meetup: Cassandra Spark internals by Christopher Batey
Manchester Hadoop Meetup: Cassandra Spark internalsManchester Hadoop Meetup: Cassandra Spark internals
Manchester Hadoop Meetup: Cassandra Spark internals
Christopher Batey608 views
Diagnostics & Debugging webinar by MongoDB
Diagnostics & Debugging webinarDiagnostics & Debugging webinar
Diagnostics & Debugging webinar
MongoDB4.8K views
Enabling Search in your Cassandra Application with DataStax Enterprise by DataStax Academy
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy2.5K views
Diagnostics and Debugging by MongoDB
Diagnostics and DebuggingDiagnostics and Debugging
Diagnostics and Debugging
MongoDB2.2K views
Storing time series data with Apache Cassandra by Patrick McFadin
Storing time series data with Apache CassandraStoring time series data with Apache Cassandra
Storing time series data with Apache Cassandra
Patrick McFadin12.5K views

Similar to Kafka pours and Spark resolves

Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一 by
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一scalaconfjp
3.7K views37 slides
Memulai Data Processing dengan Spark dan Python by
Memulai Data Processing dengan Spark dan PythonMemulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan PythonRidwan Fadjar
2.3K views26 slides
KSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka by
KSQL Deep Dive - The Open Source Streaming Engine for Apache KafkaKSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
KSQL Deep Dive - The Open Source Streaming Engine for Apache KafkaKai Wähner
4.9K views79 slides
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar) by
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
83.3K views77 slides
Big Data Processing with .NET and Spark (SQLBits 2020) by
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
337 views36 slides
Spark streaming with kafka by
Spark streaming with kafkaSpark streaming with kafka
Spark streaming with kafkaDori Waldman
4.4K views35 slides

Similar to Kafka pours and Spark resolves(20)

Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一 by scalaconfjp
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
scalaconfjp3.7K views
Memulai Data Processing dengan Spark dan Python by Ridwan Fadjar
Memulai Data Processing dengan Spark dan PythonMemulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan Python
Ridwan Fadjar2.3K views
KSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka by Kai Wähner
KSQL Deep Dive - The Open Source Streaming Engine for Apache KafkaKSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
KSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
Kai Wähner4.9K views
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar) by Helena Edelson
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson83.3K views
Big Data Processing with .NET and Spark (SQLBits 2020) by Michael Rys
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys337 views
Spark streaming with kafka by Dori Waldman
Spark streaming with kafkaSpark streaming with kafka
Spark streaming with kafka
Dori Waldman4.4K views
Spark stream - Kafka by Dori Waldman
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka
Dori Waldman1.2K views
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A... by Helena Edelson
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Helena Edelson12.6K views
A Tale of Two APIs: Using Spark Streaming In Production by Lightbend
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
Lightbend2.7K views
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ... by Lightbend
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Lightbend18.9K views
Kai Waehner - KSQL – The Open Source SQL Streaming Engine for Apache Kafka - ... by Codemotion
Kai Waehner - KSQL – The Open Source SQL Streaming Engine for Apache Kafka - ...Kai Waehner - KSQL – The Open Source SQL Streaming Engine for Apache Kafka - ...
Kai Waehner - KSQL – The Open Source SQL Streaming Engine for Apache Kafka - ...
Codemotion46 views
Kai Waehner - KSQL – The Open Source SQL Streaming Engine for Apache Kafka - ... by Codemotion
Kai Waehner - KSQL – The Open Source SQL Streaming Engine for Apache Kafka - ...Kai Waehner - KSQL – The Open Source SQL Streaming Engine for Apache Kafka - ...
Kai Waehner - KSQL – The Open Source SQL Streaming Engine for Apache Kafka - ...
Codemotion275 views
Spark streaming state of the union by Databricks
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
Databricks3.9K views
KSQL – An Open Source Streaming Engine for Apache Kafka by Kai Wähner
KSQL – An Open Source Streaming Engine for Apache KafkaKSQL – An Open Source Streaming Engine for Apache Kafka
KSQL – An Open Source Streaming Engine for Apache Kafka
Kai Wähner2.5K views
Introduction to apache kafka, confluent and why they matter by Paolo Castagna
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matter
Paolo Castagna1.7K views
Strata NYC 2015: What's new in Spark Streaming by Databricks
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
Databricks4K views
KSQL – The Open Source SQL Streaming Engine for Apache Kafka (Big Data Spain ... by Kai Wähner
KSQL – The Open Source SQL Streaming Engine for Apache Kafka (Big Data Spain ...KSQL – The Open Source SQL Streaming Engine for Apache Kafka (Big Data Spain ...
KSQL – The Open Source SQL Streaming Engine for Apache Kafka (Big Data Spain ...
Kai Wähner7.3K views
HPBigData2015 PSTL kafka spark vertica by Jack Gudenkauf
HPBigData2015 PSTL kafka spark verticaHPBigData2015 PSTL kafka spark vertica
HPBigData2015 PSTL kafka spark vertica
Jack Gudenkauf5.1K views

More from Alexey Zinoviev

Java BigData Full Stack Development (version 2.0) by
Java BigData Full Stack Development (version 2.0)Java BigData Full Stack Development (version 2.0)
Java BigData Full Stack Development (version 2.0)Alexey Zinoviev
718 views92 slides
Hadoop Jungle by
Hadoop JungleHadoop Jungle
Hadoop JungleAlexey Zinoviev
1.3K views136 slides
HappyDev'15 Keynote: Когда все данные станут большими... by
HappyDev'15 Keynote: Когда все данные станут большими...HappyDev'15 Keynote: Когда все данные станут большими...
HappyDev'15 Keynote: Когда все данные станут большими...Alexey Zinoviev
719 views53 slides
Мастер-класс по BigData Tools для HappyDev'15 by
Мастер-класс по BigData Tools для HappyDev'15Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15Alexey Zinoviev
1.1K views114 slides
JavaDayKiev'15 Java in production for Data Mining Research projects by
JavaDayKiev'15 Java in production for Data Mining Research projectsJavaDayKiev'15 Java in production for Data Mining Research projects
JavaDayKiev'15 Java in production for Data Mining Research projectsAlexey Zinoviev
980 views114 slides
Joker'15 Java straitjackets for MongoDB by
Joker'15 Java straitjackets for MongoDBJoker'15 Java straitjackets for MongoDB
Joker'15 Java straitjackets for MongoDBAlexey Zinoviev
848 views104 slides

More from Alexey Zinoviev(20)

Java BigData Full Stack Development (version 2.0) by Alexey Zinoviev
Java BigData Full Stack Development (version 2.0)Java BigData Full Stack Development (version 2.0)
Java BigData Full Stack Development (version 2.0)
Alexey Zinoviev718 views
HappyDev'15 Keynote: Когда все данные станут большими... by Alexey Zinoviev
HappyDev'15 Keynote: Когда все данные станут большими...HappyDev'15 Keynote: Когда все данные станут большими...
HappyDev'15 Keynote: Когда все данные станут большими...
Alexey Zinoviev719 views
Мастер-класс по BigData Tools для HappyDev'15 by Alexey Zinoviev
Мастер-класс по BigData Tools для HappyDev'15Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15
Alexey Zinoviev1.1K views
JavaDayKiev'15 Java in production for Data Mining Research projects by Alexey Zinoviev
JavaDayKiev'15 Java in production for Data Mining Research projectsJavaDayKiev'15 Java in production for Data Mining Research projects
JavaDayKiev'15 Java in production for Data Mining Research projects
Alexey Zinoviev980 views
Joker'15 Java straitjackets for MongoDB by Alexey Zinoviev
Joker'15 Java straitjackets for MongoDBJoker'15 Java straitjackets for MongoDB
Joker'15 Java straitjackets for MongoDB
Alexey Zinoviev848 views
JPoint'15 Mom, I so wish Hibernate for my NoSQL database... by Alexey Zinoviev
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
Alexey Zinoviev1.1K views
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr... by Alexey Zinoviev
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Alexey Zinoviev1.3K views
Thorny path to the Large-Scale Graph Processing (Highload++, 2014) by Alexey Zinoviev
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Alexey Zinoviev13.7K views
Joker'14 Java as a fundamental working tool of the Data Scientist by Alexey Zinoviev
Joker'14 Java as a fundamental working tool of the Data ScientistJoker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data Scientist
Alexey Zinoviev3.6K views
First steps in Data Mining Kindergarten by Alexey Zinoviev
First steps in Data Mining KindergartenFirst steps in Data Mining Kindergarten
First steps in Data Mining Kindergarten
Alexey Zinoviev802 views
EST: Smart rate (Effective recommendation system for Taxi drivers based on th... by Alexey Zinoviev
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...EST: Smart rate (Effective recommendation system for Taxi drivers based on th...
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...
Alexey Zinoviev337 views
Android Geo Apps in Soviet Russia: Latitude and longitude find you by Alexey Zinoviev
Android Geo Apps in Soviet Russia: Latitude and longitude find youAndroid Geo Apps in Soviet Russia: Latitude and longitude find you
Android Geo Apps in Soviet Russia: Latitude and longitude find you
Alexey Zinoviev808 views
Keynote on JavaDay Omsk 2014 about new features in Java 8 by Alexey Zinoviev
Keynote on JavaDay Omsk 2014 about new features in Java 8Keynote on JavaDay Omsk 2014 about new features in Java 8
Keynote on JavaDay Omsk 2014 about new features in Java 8
Alexey Zinoviev655 views
Big data algorithms and data structures for large scale graphs by Alexey Zinoviev
Big data algorithms and data structures for large scale graphsBig data algorithms and data structures for large scale graphs
Big data algorithms and data structures for large scale graphs
Alexey Zinoviev999 views
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись" by Alexey Zinoviev
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"
Alexey Zinoviev6.4K views
Алгоритмы и структуры данных BigData для графов большой размерности by Alexey Zinoviev
Алгоритмы и структуры данных BigData для графов большой размерностиАлгоритмы и структуры данных BigData для графов большой размерности
Алгоритмы и структуры данных BigData для графов большой размерности
Alexey Zinoviev1.7K views
ALMADA 2013 (computer science school by Yandex and Microsoft Research) by Alexey Zinoviev
ALMADA 2013 (computer science school by Yandex and Microsoft Research)ALMADA 2013 (computer science school by Yandex and Microsoft Research)
ALMADA 2013 (computer science school by Yandex and Microsoft Research)
Alexey Zinoviev514 views
GDG Devfest Omsk 2013. Year of events! by Alexey Zinoviev
GDG Devfest Omsk 2013. Year of events!GDG Devfest Omsk 2013. Year of events!
GDG Devfest Omsk 2013. Year of events!
Alexey Zinoviev413 views

Recently uploaded

Data Journeys Hard Talk workshop final.pptx by
Data Journeys Hard Talk workshop final.pptxData Journeys Hard Talk workshop final.pptx
Data Journeys Hard Talk workshop final.pptxinfo828217
10 views18 slides
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation by
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented GenerationDataScienceConferenc1
15 views29 slides
Ukraine Infographic_22NOV2023_v2.pdf by
Ukraine Infographic_22NOV2023_v2.pdfUkraine Infographic_22NOV2023_v2.pdf
Ukraine Infographic_22NOV2023_v2.pdfAnastosiyaGurin
1.4K views3 slides
Organic Shopping in Google Analytics 4.pdf by
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdfGA4 Tutorials
16 views13 slides
Chapter 3b- Process Communication (1) (1)(1) (1).pptx by
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptxayeshabaig2004
7 views30 slides
UNEP FI CRS Climate Risk Results.pptx by
UNEP FI CRS Climate Risk Results.pptxUNEP FI CRS Climate Risk Results.pptx
UNEP FI CRS Climate Risk Results.pptxpekka28
11 views51 slides

Recently uploaded(20)

Data Journeys Hard Talk workshop final.pptx by info828217
Data Journeys Hard Talk workshop final.pptxData Journeys Hard Talk workshop final.pptx
Data Journeys Hard Talk workshop final.pptx
info82821710 views
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation by DataScienceConferenc1
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
Ukraine Infographic_22NOV2023_v2.pdf by AnastosiyaGurin
Ukraine Infographic_22NOV2023_v2.pdfUkraine Infographic_22NOV2023_v2.pdf
Ukraine Infographic_22NOV2023_v2.pdf
AnastosiyaGurin1.4K views
Organic Shopping in Google Analytics 4.pdf by GA4 Tutorials
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdf
GA4 Tutorials16 views
Chapter 3b- Process Communication (1) (1)(1) (1).pptx by ayeshabaig2004
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptx
ayeshabaig20047 views
UNEP FI CRS Climate Risk Results.pptx by pekka28
UNEP FI CRS Climate Risk Results.pptxUNEP FI CRS Climate Risk Results.pptx
UNEP FI CRS Climate Risk Results.pptx
pekka2811 views
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P... by DataScienceConferenc1
[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P...
SUPER STORE SQL PROJECT.pptx by khan888620
SUPER STORE SQL PROJECT.pptxSUPER STORE SQL PROJECT.pptx
SUPER STORE SQL PROJECT.pptx
khan88862013 views
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx by DataScienceConferenc1
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
Survey on Factuality in LLM's.pptx by NeethaSherra1
Survey on Factuality in LLM's.pptxSurvey on Factuality in LLM's.pptx
Survey on Factuality in LLM's.pptx
NeethaSherra17 views
Data about the sector workshop by info828217
Data about the sector workshopData about the sector workshop
Data about the sector workshop
info82821715 views
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M... by DataScienceConferenc1
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init... by DataScienceConferenc1
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...
[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ... by DataScienceConferenc1
[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ...[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ...
[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ...
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx by DataScienceConferenc1
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx

Kafka pours and Spark resolves

  • 1. Kafka pours and Spark resolves! Alexey Zinovyev, Java/BigData Trainer in EPAM
  • 2. About With IT since 2007 With Java since 2009 With Hadoop since 2012 With Spark since 2014 With EPAM since 2015
  • 3. 3Spark Streaming from Zinoviev Alexey Contacts E-mail : Alexey_Zinovyev@epam.com Twitter : @zaleslaw @BigDataRussia Facebook: https://www.facebook.com/zaleslaw vk.com/big_data_russia Big Data Russia vk.com/java_jvm Java & JVM langs
  • 4. 4Spark Streaming from Zinoviev Alexey Spark Family
  • 5. 5Spark Streaming from Zinoviev Alexey Spark Family
  • 6. 6Spark Streaming from Zinoviev Alexey Pre-summary • Before RealTime • Spark + Cassandra • Sending messages with Kafka • DStream Kafka Consumer • Structured Streaming in Spark 2.1 • Kafka Writer in Spark 2.2
  • 7. 7Spark Streaming from Zinoviev Alexey < REAL-TIME
  • 8. 8Spark Streaming from Zinoviev Alexey Modern Java in 2016Big Data in 2014
  • 9. 9Spark Streaming from Zinoviev Alexey Batch jobs produce reports. More and more..
  • 10. 10Spark Streaming from Zinoviev Alexey But customer can wait forever (ok, 1h)
  • 11. 11Spark Streaming from Zinoviev Alexey Big Data in 2017
  • 12. 12Spark Streaming from Zinoviev Alexey Machine Learning EVERYWHERE
  • 13. 13Spark Streaming from Zinoviev Alexey Data Lake in promotional brochure
  • 14. 14Spark Streaming from Zinoviev Alexey Data Lake in production
  • 15. 15Spark Streaming from Zinoviev Alexey Simple Flow in Reporting/BI systems
  • 16. 16Spark Streaming from Zinoviev Alexey Let’s use Spark. It’s fast!
  • 17. 17Spark Streaming from Zinoviev Alexey MapReduce vs Spark
  • 18. 18Spark Streaming from Zinoviev Alexey MapReduce vs Spark
  • 19. 19Spark Streaming from Zinoviev Alexey Simple Flow in Reporting/BI systems with Spark
  • 20. 20Spark Streaming from Zinoviev Alexey Spark handles last year logs with ease
  • 21. 21Spark Streaming from Zinoviev Alexey Where can we store events?
  • 22. 22Spark Streaming from Zinoviev Alexey Let’s use Cassandra to store events!
  • 23. 23Spark Streaming from Zinoviev Alexey Let’s use Cassandra to read events!
  • 24. 24Spark Streaming from Zinoviev Alexey Cassandra CREATE KEYSPACE mySpace WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; USE test; CREATE TABLE logs ( application TEXT, time TIMESTAMP, message TEXT, PRIMARY KEY (application, time));
  • 25. 25Spark Streaming from Zinoviev Alexey Cassandra to Spark val dataSet = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "logs", "keyspace" -> "mySpace" )) .load() dataSet .filter("message = 'Log message'") .show()
  • 26. 26Spark Streaming from Zinoviev Alexey Simple Flow in Pre-Real-Time systems
  • 27. 27Spark Streaming from Zinoviev Alexey Spark cluster over Cassandra Cluster
  • 28. 28Spark Streaming from Zinoviev Alexey More events every second!
  • 29. 29Spark Streaming from Zinoviev Alexey SENDING MESSAGES
  • 30. 30Spark Streaming from Zinoviev Alexey Your Father’s Messaging System
  • 31. 31Spark Streaming from Zinoviev Alexey Your Father’s Messaging System
  • 32. 32Spark Streaming from Zinoviev Alexey Your Father’s Messaging System InitialContext ctx = new InitialContext(); QueueConnectionFactory f = (QueueConnectionFactory)ctx.lookup(“qCFactory"); QueueConnection con = f.createQueueConnection(); con.start();
  • 33. 33Spark Streaming from Zinoviev Alexey KAFKA
  • 34. 34Spark Streaming from Zinoviev Alexey Kafka • messaging system
  • 35. 35Spark Streaming from Zinoviev Alexey Kafka • messaging system • distributed
  • 36. 36Spark Streaming from Zinoviev Alexey Kafka • messaging system • distributed • supports Publish-Subscribe model
  • 37. 37Spark Streaming from Zinoviev Alexey Kafka • messaging system • distributed • supports Publish-Subscribe model • persists messages on disk
  • 38. 38Spark Streaming from Zinoviev Alexey Kafka • messaging system • distributed • supports Publish-Subscribe model • persists messages on disk • replicates within the cluster (integrated with Zookeeper)
  • 39. 39Spark Streaming from Zinoviev Alexey The main benefits of Kafka Scalability with zero down time Zero data loss due to replication
  • 40. 40Spark Streaming from Zinoviev Alexey Kafka Cluster consists of … • brokers (leader or follower) • topics ( >= 1 partition) • partitions • partition offsets • replicas of partition • producers/consumers
  • 41. 41Spark Streaming from Zinoviev Alexey Kafka Components with topic “messages” #1 Producer Thread #1 Producer Thread #2 Producer Thread #3 Topic: Messages Data Data Data Zookeeper
  • 42. 42Spark Streaming from Zinoviev Alexey Kafka Components with topic “messages” #2 Producer Thread #1 Producer Thread #2 Producer Thread #3 Topic: Messages Part #1 Part #2 Data Data Data Broker #1 Zookeeper Part #1 Part #2 Leader Leader
  • 43. 43Spark Streaming from Zinoviev Alexey Kafka Components with topic “messages” #3 Producer Thread #1 Producer Thread #2 Producer Thread #3 Topic: Messages Part #1 Part #2 Data Data Data Broker #1 Broker #2 Zookeeper Part #1 Part #1 Part #2 Part #2
  • 44. 44Spark Streaming from Zinoviev Alexey Why do we need Zookeeper?
  • 45. 45Spark Streaming from Zinoviev Alexey Kafka Demo
  • 46. 46Spark Streaming from Zinoviev Alexey REAL TIME WITH DSTREAMS
  • 47. 47Spark Streaming from Zinoviev Alexey RDD Factory
  • 48. 48Spark Streaming from Zinoviev Alexey From socket to console with DStreams
  • 49. 49Spark Streaming from Zinoviev Alexey DStream val conf = new SparkConf().setMaster("local[2]") .setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination()
  • 50. 50Spark Streaming from Zinoviev Alexey DStream val conf = new SparkConf().setMaster("local[2]") .setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination()
  • 51. 51Spark Streaming from Zinoviev Alexey DStream val conf = new SparkConf().setMaster("local[2]") .setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination()
  • 52. 52Spark Streaming from Zinoviev Alexey DStream val conf = new SparkConf().setMaster("local[2]") .setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination()
  • 53. 53Spark Streaming from Zinoviev Alexey DStream val conf = new SparkConf().setMaster("local[2]") .setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination()
  • 54. 54Spark Streaming from Zinoviev Alexey Kafka as a main entry point for Spark
  • 55. 55Spark Streaming from Zinoviev Alexey DStreams Demo
  • 56. 56Spark Streaming from Zinoviev Alexey How to avoid DStreams with RDD-like API?
  • 57. 57Spark Streaming from Zinoviev Alexey SPARK 2.2 DISCUSSION
  • 58. 58Spark Streaming from Zinoviev Alexey Continuous Applications
  • 59. 59Spark Streaming from Zinoviev Alexey Continuous Applications cases • Updating data that will be served in real time • Extract, transform and load (ETL) • Creating a real-time version of an existing batch job • Online machine learning
  • 60. 60Spark Streaming from Zinoviev Alexey The main concept of Structured Streaming You can express your streaming computation the same way you would express a batch computation on static data.
  • 61. 61Spark Streaming from Zinoviev Alexey Batch Spark 2.2 // Read JSON once from S3 logsDF = spark.read.json("s3://logs") // Transform with DataFrame API and save logsDF.select("user", "url", "date") .write.parquet("s3://out")
  • 62. 62Spark Streaming from Zinoviev Alexey Real Time Spark 2.2 // Read JSON continuously from S3 logsDF = spark.readStream.json("s3://logs") // Transform with DataFrame API and save logsDF.select("user", "url", "date") .writeStream.parquet("s3://out") .start()
  • 63. 63Spark Streaming from Zinoviev Alexey WordCount from Socket val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load() val words = lines.as[String].flatMap(_.split(" ")) val wordCounts = words.groupBy("value").count()
  • 64. 64Spark Streaming from Zinoviev Alexey WordCount from Socket val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load() val words = lines.as[String].flatMap(_.split(" ")) val wordCounts = words.groupBy("value").count()
  • 65. 65Spark Streaming from Zinoviev Alexey WordCount from Socket val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load() val words = lines.as[String].flatMap(_.split(" ")) val wordCounts = words.groupBy("value").count()
  • 66. 66Spark Streaming from Zinoviev Alexey WordCount from Socket val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load() val words = lines.as[String].flatMap(_.split(" ")) val wordCounts = words.groupBy("value").count() Don’t forget to start Streaming
  • 67. 67Spark Streaming from Zinoviev Alexey Unlimited Table
  • 68. 68Spark Streaming from Zinoviev Alexey WordCount with Structured Streaming [Complete Mode]
  • 69. 69Spark Streaming from Zinoviev Alexey Kafka -> Structured Streaming -> Console
  • 70. 70Spark Streaming from Zinoviev Alexey Kafka To Console Demo
  • 71. 71Spark Streaming from Zinoviev Alexey OPERATIONS
  • 72. 72Spark Streaming from Zinoviev Alexey You can … • filter • sort • aggregate • join • foreach • explain
  • 73. 73Spark Streaming from Zinoviev Alexey Operators Demo
  • 74. 74Spark Streaming from Zinoviev Alexey How it works?
  • 75. 75Spark Streaming from Zinoviev Alexey Deep Diving in Spark Internals Dataset
  • 76. 76Spark Streaming from Zinoviev Alexey Deep Diving in Spark Internals Dataset Logical Plan
  • 77. 77Spark Streaming from Zinoviev Alexey Deep Diving in Spark Internals Dataset Logical Plan Optimized Logical Plan
  • 78. 78Spark Streaming from Zinoviev Alexey Deep Diving in Spark Internals Dataset Logical Plan Optimized Logical Plan Logical Plan Logical Plan Logical PlanPhysical Plan
  • 79. 79Spark Streaming from Zinoviev Alexey Deep Diving in Spark Internals Dataset Logical Plan Optimized Logical Plan Logical Plan Logical Plan Logical PlanPhysical Plan Selected Physical Plan Planner
  • 80. 80Spark Streaming from Zinoviev Alexey Deep Diving in Spark Internals Dataset Logical Plan Optimized Logical Plan Logical Plan Logical Plan Logical PlanPhysical Plan Selected Physical Plan
  • 81. 81Spark Streaming from Zinoviev Alexey Deep Diving in Spark Internals Dataset Logical Plan Optimized Logical Plan Logical Plan Logical Plan Logical PlanPhysical Plan Selected Physical Plan Incremental #1 Incremental #2 Incremental #3
  • 82. 82Spark Streaming from Zinoviev Alexey Incremental Execution: Planner polls Planner
  • 83. 83Spark Streaming from Zinoviev Alexey Incremental Execution: Planner runs Planner Offset: 1 - 5 Incremental #1 R U N Count: 5
  • 84. 84Spark Streaming from Zinoviev Alexey Incremental Execution: Planner runs #2 Planner Offset: 1 - 5 Incremental #1 Incremental #2 Offset: 6 - 9 R U N Count: 5 Count: 4
  • 85. 85Spark Streaming from Zinoviev Alexey Aggregation with State Planner Offset: 1 - 5 Incremental #1 Incremental #2 Offset: 6 - 9 R U N Count: 5 Count: 5+4=9 5
  • 86. 86Spark Streaming from Zinoviev Alexey DataSet.explain() == Physical Plan == Project [avg(price)#43,carat#45] +- SortMergeJoin [color#21], [color#47] :- Sort [color#21 ASC], false, 0 : +- TungstenExchange hashpartitioning(color#21,200), None : +- Project [avg(price)#43,color#21] : +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as bigint)),mode=Final,isDistinct=false)], output=[color#21,avg(price)#43]) : +- TungstenExchange hashpartitioning(cut#20,color#21,200), None : +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as bigint)),mode=Partial,isDistinct=false)], output=[cut#20,color#21,sum#58,count#59L]) : +- Scan CsvRelation(-----) +- Sort [color#47 ASC], false, 0 +- TungstenExchange hashpartitioning(color#47,200), None +- ConvertToUnsafe +- Scan CsvRelation(----)
  • 87. 87Spark Streaming from Zinoviev Alexey What’s the difference between Complete and Append output modes?
  • 88. 88Spark Streaming from Zinoviev Alexey COMPLETE, APPEND & UPDATE
  • 89. 89Spark Streaming from Zinoviev Alexey There are two main modes and one in future • append (default)
  • 90. 90Spark Streaming from Zinoviev Alexey There are two main modes and one in future • append (default) • complete
  • 91. 91Spark Streaming from Zinoviev Alexey There are two main modes and one in future • append (default) • complete • update [in dreams]
  • 92. 92Spark Streaming from Zinoviev Alexey Aggregation with watermarks
  • 93. 93Spark Streaming from Zinoviev Alexey SOURCES & SINKS
  • 94. 94Spark Streaming from Zinoviev Alexey Spark Streaming is a brick in the Big Data Wall
  • 95. 95Spark Streaming from Zinoviev Alexey Let's save to Parquet files
  • 96. 96Spark Streaming from Zinoviev Alexey Let's save to Parquet files
  • 97. 97Spark Streaming from Zinoviev Alexey Let's save to Parquet files
  • 98. 98Spark Streaming from Zinoviev Alexey Can we write to Kafka?
  • 99. 99Spark Streaming from Zinoviev Alexey Nightly Build
  • 100. 100Spark Streaming from Zinoviev Alexey Kafka-to-Kafka
  • 101. 101Spark Streaming from Zinoviev Alexey J-K-S-K-S-C
  • 102. 102Spark Streaming from Zinoviev Alexey Pipeline Demo
  • 103. 103Spark Streaming from Zinoviev Alexey I didn’t find sink/source for XXX…
  • 104. 104Spark Streaming from Zinoviev Alexey Console Foreach Sink import org.apache.spark.sql.ForeachWriter val customWriter = new ForeachWriter[String] { override def open(partitionId: Long, version: Long) = true override def process(value: String) = println(value) override def close(errorOrNull: Throwable) = {} } stream.writeStream .queryName(“ForeachOnConsole") .foreach(customWriter) .start
  • 105. 105Spark Streaming from Zinoviev Alexey Pinch of wisdom • check checkpointLocation • don’t use MemoryStream • think about GC pauses • be careful about nighty builds • use .groupBy.count() instead count() • use console sink instead .show() function
  • 106. 106Spark Streaming from Zinoviev Alexey We have no ability… • join two streams • work with update mode • make full outer join • take first N rows • sort without pre-aggregation
  • 107. 107Spark Streaming from Zinoviev Alexey Roadmap 2.2 • Support other data sources (not only S3 + HDFS) • Transactional updates • Dataset is one DSL for all operations • GraphFrames + Structured MLLib • KafkaWriter • TensorFrames
  • 108. 108Spark Streaming from Zinoviev Alexey IN CONCLUSION
  • 109. 109Spark Streaming from Zinoviev Alexey Final point Scalable Fault-Tolerant Real-Time Pipeline with Spark & Kafka is ready for usage
  • 110. 110Spark Streaming from Zinoviev Alexey A few papers about Spark Streaming and Kafka Introduction in Spark + Kafka http://bit.ly/2mJjE4i
  • 111. 111Spark Streaming from Zinoviev Alexey Source code from Demo Spark Tutorial https://github.com/zaleslaw/Spark-Tutorial
  • 112. 112Spark Streaming from Zinoviev Alexey Contacts E-mail : Alexey_Zinovyev@epam.com Twitter : @zaleslaw @BigDataRussia Facebook: https://www.facebook.com/zaleslaw vk.com/big_data_russia Big Data Russia vk.com/java_jvm Java & JVM langs
  • 113. 113Spark Streaming from Zinoviev Alexey Any questions?