Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Kafka pours and Spark
resolves!
Alexey Zinovyev, Java/BigData Trainer in EPAM
About
With IT since 2007
With Java since 2009
With Hadoop since 2012
With Spark since 2014
With EPAM since 2015
3Spark Streaming from Zinoviev Alexey
Contacts
E-mail : Alexey_Zinovyev@epam.com
Twitter : @zaleslaw @BigDataRussia
Facebo...
4Spark Streaming from Zinoviev Alexey
Spark
Family
5Spark Streaming from Zinoviev Alexey
Spark
Family
6Spark Streaming from Zinoviev Alexey
Pre-summary
• Before RealTime
• Spark + Cassandra
• Sending messages with Kafka
• DS...
7Spark Streaming from Zinoviev Alexey
< REAL-TIME
8Spark Streaming from Zinoviev Alexey
Modern Java in 2016Big Data in 2014
9Spark Streaming from Zinoviev Alexey
Batch jobs produce reports. More and more..
10Spark Streaming from Zinoviev Alexey
But customer can wait forever (ok, 1h)
11Spark Streaming from Zinoviev Alexey
Big Data in 2017
12Spark Streaming from Zinoviev Alexey
Machine Learning EVERYWHERE
13Spark Streaming from Zinoviev Alexey
Data Lake in promotional brochure
14Spark Streaming from Zinoviev Alexey
Data Lake in production
15Spark Streaming from Zinoviev Alexey
Simple Flow in Reporting/BI systems
16Spark Streaming from Zinoviev Alexey
Let’s use Spark. It’s fast!
17Spark Streaming from Zinoviev Alexey
MapReduce vs Spark
18Spark Streaming from Zinoviev Alexey
MapReduce vs Spark
19Spark Streaming from Zinoviev Alexey
Simple Flow in Reporting/BI systems with Spark
20Spark Streaming from Zinoviev Alexey
Spark handles last year logs with ease
21Spark Streaming from Zinoviev Alexey
Where can we store events?
22Spark Streaming from Zinoviev Alexey
Let’s use Cassandra to store events!
23Spark Streaming from Zinoviev Alexey
Let’s use Cassandra to read events!
24Spark Streaming from Zinoviev Alexey
Cassandra
CREATE KEYSPACE mySpace WITH replication = {'class':
'SimpleStrategy', 'r...
25Spark Streaming from Zinoviev Alexey
Cassandra
to Spark
val dataSet = sqlContext
.read
.format("org.apache.spark.sql.cas...
26Spark Streaming from Zinoviev Alexey
Simple Flow in Pre-Real-Time systems
27Spark Streaming from Zinoviev Alexey
Spark cluster over Cassandra Cluster
28Spark Streaming from Zinoviev Alexey
More events every second!
29Spark Streaming from Zinoviev Alexey
SENDING MESSAGES
30Spark Streaming from Zinoviev Alexey
Your Father’s Messaging System
31Spark Streaming from Zinoviev Alexey
Your Father’s Messaging System
32Spark Streaming from Zinoviev Alexey
Your Father’s Messaging System
InitialContext ctx = new InitialContext();
QueueConn...
33Spark Streaming from Zinoviev Alexey
KAFKA
34Spark Streaming from Zinoviev Alexey
Kafka
• messaging system
35Spark Streaming from Zinoviev Alexey
Kafka
• messaging system
• distributed
36Spark Streaming from Zinoviev Alexey
Kafka
• messaging system
• distributed
• supports Publish-Subscribe model
37Spark Streaming from Zinoviev Alexey
Kafka
• messaging system
• distributed
• supports Publish-Subscribe model
• persist...
38Spark Streaming from Zinoviev Alexey
Kafka
• messaging system
• distributed
• supports Publish-Subscribe model
• persist...
39Spark Streaming from Zinoviev Alexey
The main benefits of Kafka
Scalability with zero down time
Zero data loss due to re...
40Spark Streaming from Zinoviev Alexey
Kafka Cluster consists of …
• brokers (leader or follower)
• topics ( >= 1 partitio...
41Spark Streaming from Zinoviev Alexey
Kafka Components with topic “messages” #1
Producer
Thread #1
Producer
Thread #2
Pro...
42Spark Streaming from Zinoviev Alexey
Kafka Components with topic “messages” #2
Producer
Thread #1
Producer
Thread #2
Pro...
43Spark Streaming from Zinoviev Alexey
Kafka Components with topic “messages” #3
Producer
Thread #1
Producer
Thread #2
Pro...
44Spark Streaming from Zinoviev Alexey
Why do we need Zookeeper?
45Spark Streaming from Zinoviev Alexey
Kafka
Demo
46Spark Streaming from Zinoviev Alexey
REAL TIME WITH
DSTREAMS
47Spark Streaming from Zinoviev Alexey
RDD Factory
48Spark Streaming from Zinoviev Alexey
From socket to console with DStreams
49Spark Streaming from Zinoviev Alexey
DStream
val conf = new SparkConf().setMaster("local[2]")
.setAppName("NetworkWordCo...
50Spark Streaming from Zinoviev Alexey
DStream
val conf = new SparkConf().setMaster("local[2]")
.setAppName("NetworkWordCo...
51Spark Streaming from Zinoviev Alexey
DStream
val conf = new SparkConf().setMaster("local[2]")
.setAppName("NetworkWordCo...
52Spark Streaming from Zinoviev Alexey
DStream
val conf = new SparkConf().setMaster("local[2]")
.setAppName("NetworkWordCo...
53Spark Streaming from Zinoviev Alexey
DStream
val conf = new SparkConf().setMaster("local[2]")
.setAppName("NetworkWordCo...
54Spark Streaming from Zinoviev Alexey
Kafka as a main entry point for Spark
55Spark Streaming from Zinoviev Alexey
DStreams
Demo
56Spark Streaming from Zinoviev Alexey
How to avoid DStreams with RDD-like API?
57Spark Streaming from Zinoviev Alexey
SPARK 2.2 DISCUSSION
58Spark Streaming from Zinoviev Alexey
Continuous Applications
59Spark Streaming from Zinoviev Alexey
Continuous Applications cases
• Updating data that will be served in real time
• Ex...
60Spark Streaming from Zinoviev Alexey
The main concept of Structured Streaming
You can express your streaming computation...
61Spark Streaming from Zinoviev Alexey
Batch
Spark 2.2
// Read JSON once from S3
logsDF = spark.read.json("s3://logs")
// ...
62Spark Streaming from Zinoviev Alexey
Real
Time
Spark 2.2
// Read JSON continuously from S3
logsDF = spark.readStream.jso...
63Spark Streaming from Zinoviev Alexey
WordCount
from
Socket
val lines = spark.readStream
.format("socket")
.option("host"...
64Spark Streaming from Zinoviev Alexey
WordCount
from
Socket
val lines = spark.readStream
.format("socket")
.option("host"...
65Spark Streaming from Zinoviev Alexey
WordCount
from
Socket
val lines = spark.readStream
.format("socket")
.option("host"...
66Spark Streaming from Zinoviev Alexey
WordCount
from
Socket
val lines = spark.readStream
.format("socket")
.option("host"...
67Spark Streaming from Zinoviev Alexey
Unlimited Table
68Spark Streaming from Zinoviev Alexey
WordCount with Structured Streaming [Complete Mode]
69Spark Streaming from Zinoviev Alexey
Kafka -> Structured Streaming -> Console
70Spark Streaming from Zinoviev Alexey
Kafka To
Console
Demo
71Spark Streaming from Zinoviev Alexey
OPERATIONS
72Spark Streaming from Zinoviev Alexey
You can …
• filter
• sort
• aggregate
• join
• foreach
• explain
73Spark Streaming from Zinoviev Alexey
Operators
Demo
74Spark Streaming from Zinoviev Alexey
How it works?
75Spark Streaming from Zinoviev Alexey
Deep Diving in Spark Internals
Dataset
76Spark Streaming from Zinoviev Alexey
Deep Diving in Spark Internals
Dataset Logical Plan
77Spark Streaming from Zinoviev Alexey
Deep Diving in Spark Internals
Dataset Logical Plan
Optimized
Logical Plan
78Spark Streaming from Zinoviev Alexey
Deep Diving in Spark Internals
Dataset Logical Plan
Optimized
Logical Plan
Logical ...
79Spark Streaming from Zinoviev Alexey
Deep Diving in Spark Internals
Dataset Logical Plan
Optimized
Logical Plan
Logical ...
80Spark Streaming from Zinoviev Alexey
Deep Diving in Spark Internals
Dataset Logical Plan
Optimized
Logical Plan
Logical ...
81Spark Streaming from Zinoviev Alexey
Deep Diving in Spark Internals
Dataset Logical Plan
Optimized
Logical Plan
Logical ...
82Spark Streaming from Zinoviev Alexey
Incremental Execution: Planner polls
Planner
83Spark Streaming from Zinoviev Alexey
Incremental Execution: Planner runs
Planner
Offset: 1 - 5
Incremental
#1
R
U
N
Coun...
84Spark Streaming from Zinoviev Alexey
Incremental Execution: Planner runs #2
Planner
Offset: 1 - 5
Incremental
#1
Increme...
85Spark Streaming from Zinoviev Alexey
Aggregation with State
Planner
Offset: 1 - 5
Incremental
#1
Incremental
#2
Offset: ...
86Spark Streaming from Zinoviev Alexey
DataSet.explain()
== Physical Plan ==
Project [avg(price)#43,carat#45]
+- SortMerge...
87Spark Streaming from Zinoviev Alexey
What’s the difference between
Complete and Append
output modes?
88Spark Streaming from Zinoviev Alexey
COMPLETE, APPEND &
UPDATE
89Spark Streaming from Zinoviev Alexey
There are two main modes and one in future
• append (default)
90Spark Streaming from Zinoviev Alexey
There are two main modes and one in future
• append (default)
• complete
91Spark Streaming from Zinoviev Alexey
There are two main modes and one in future
• append (default)
• complete
• update [...
92Spark Streaming from Zinoviev Alexey
Aggregation with watermarks
93Spark Streaming from Zinoviev Alexey
SOURCES & SINKS
94Spark Streaming from Zinoviev Alexey
Spark Streaming is a brick in the Big Data Wall
95Spark Streaming from Zinoviev Alexey
Let's save to Parquet files
96Spark Streaming from Zinoviev Alexey
Let's save to Parquet files
97Spark Streaming from Zinoviev Alexey
Let's save to Parquet files
98Spark Streaming from Zinoviev Alexey
Can we write to Kafka?
99Spark Streaming from Zinoviev Alexey
Nightly Build
100Spark Streaming from Zinoviev Alexey
Kafka-to-Kafka
101Spark Streaming from Zinoviev Alexey
J-K-S-K-S-C
102Spark Streaming from Zinoviev Alexey
Pipeline
Demo
103Spark Streaming from Zinoviev Alexey
I didn’t find sink/source for XXX…
104Spark Streaming from Zinoviev Alexey
Console
Foreach
Sink
import org.apache.spark.sql.ForeachWriter
val customWriter = ...
105Spark Streaming from Zinoviev Alexey
Pinch of wisdom
• check checkpointLocation
• don’t use MemoryStream
• think about ...
106Spark Streaming from Zinoviev Alexey
We have no ability…
• join two streams
• work with update mode
• make full outer j...
107Spark Streaming from Zinoviev Alexey
Roadmap 2.2
• Support other data sources (not only S3 + HDFS)
• Transactional upda...
108Spark Streaming from Zinoviev Alexey
IN CONCLUSION
109Spark Streaming from Zinoviev Alexey
Final point
Scalable Fault-Tolerant Real-Time Pipeline with
Spark & Kafka
is ready...
110Spark Streaming from Zinoviev Alexey
A few papers about Spark Streaming and Kafka
Introduction in Spark + Kafka
http://...
111Spark Streaming from Zinoviev Alexey
Source code from Demo
Spark Tutorial
https://github.com/zaleslaw/Spark-Tutorial
112Spark Streaming from Zinoviev Alexey
Contacts
E-mail : Alexey_Zinovyev@epam.com
Twitter : @zaleslaw @BigDataRussia
Face...
113Spark Streaming from Zinoviev Alexey
Any questions?
Upcoming SlideShare
Loading in …5
×

Kafka pours and Spark resolves

755 views

Published on

This is second part of Spark 2 new features overview
This topic covers API changes; Structured Streaming; Output Modes, Apache Kafka, Kafka Direct Streams, Kafka source/sink in Spark 2.2)

Published in: Data & Analytics
  • Be the first to comment

Kafka pours and Spark resolves

  1. 1. Kafka pours and Spark resolves! Alexey Zinovyev, Java/BigData Trainer in EPAM
  2. 2. About With IT since 2007 With Java since 2009 With Hadoop since 2012 With Spark since 2014 With EPAM since 2015
  3. 3. 3Spark Streaming from Zinoviev Alexey Contacts E-mail : Alexey_Zinovyev@epam.com Twitter : @zaleslaw @BigDataRussia Facebook: https://www.facebook.com/zaleslaw vk.com/big_data_russia Big Data Russia vk.com/java_jvm Java & JVM langs
  4. 4. 4Spark Streaming from Zinoviev Alexey Spark Family
  5. 5. 5Spark Streaming from Zinoviev Alexey Spark Family
  6. 6. 6Spark Streaming from Zinoviev Alexey Pre-summary • Before RealTime • Spark + Cassandra • Sending messages with Kafka • DStream Kafka Consumer • Structured Streaming in Spark 2.1 • Kafka Writer in Spark 2.2
  7. 7. 7Spark Streaming from Zinoviev Alexey < REAL-TIME
  8. 8. 8Spark Streaming from Zinoviev Alexey Modern Java in 2016Big Data in 2014
  9. 9. 9Spark Streaming from Zinoviev Alexey Batch jobs produce reports. More and more..
  10. 10. 10Spark Streaming from Zinoviev Alexey But customer can wait forever (ok, 1h)
  11. 11. 11Spark Streaming from Zinoviev Alexey Big Data in 2017
  12. 12. 12Spark Streaming from Zinoviev Alexey Machine Learning EVERYWHERE
  13. 13. 13Spark Streaming from Zinoviev Alexey Data Lake in promotional brochure
  14. 14. 14Spark Streaming from Zinoviev Alexey Data Lake in production
  15. 15. 15Spark Streaming from Zinoviev Alexey Simple Flow in Reporting/BI systems
  16. 16. 16Spark Streaming from Zinoviev Alexey Let’s use Spark. It’s fast!
  17. 17. 17Spark Streaming from Zinoviev Alexey MapReduce vs Spark
  18. 18. 18Spark Streaming from Zinoviev Alexey MapReduce vs Spark
  19. 19. 19Spark Streaming from Zinoviev Alexey Simple Flow in Reporting/BI systems with Spark
  20. 20. 20Spark Streaming from Zinoviev Alexey Spark handles last year logs with ease
  21. 21. 21Spark Streaming from Zinoviev Alexey Where can we store events?
  22. 22. 22Spark Streaming from Zinoviev Alexey Let’s use Cassandra to store events!
  23. 23. 23Spark Streaming from Zinoviev Alexey Let’s use Cassandra to read events!
  24. 24. 24Spark Streaming from Zinoviev Alexey Cassandra CREATE KEYSPACE mySpace WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; USE test; CREATE TABLE logs ( application TEXT, time TIMESTAMP, message TEXT, PRIMARY KEY (application, time));
  25. 25. 25Spark Streaming from Zinoviev Alexey Cassandra to Spark val dataSet = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "logs", "keyspace" -> "mySpace" )) .load() dataSet .filter("message = 'Log message'") .show()
  26. 26. 26Spark Streaming from Zinoviev Alexey Simple Flow in Pre-Real-Time systems
  27. 27. 27Spark Streaming from Zinoviev Alexey Spark cluster over Cassandra Cluster
  28. 28. 28Spark Streaming from Zinoviev Alexey More events every second!
  29. 29. 29Spark Streaming from Zinoviev Alexey SENDING MESSAGES
  30. 30. 30Spark Streaming from Zinoviev Alexey Your Father’s Messaging System
  31. 31. 31Spark Streaming from Zinoviev Alexey Your Father’s Messaging System
  32. 32. 32Spark Streaming from Zinoviev Alexey Your Father’s Messaging System InitialContext ctx = new InitialContext(); QueueConnectionFactory f = (QueueConnectionFactory)ctx.lookup(“qCFactory"); QueueConnection con = f.createQueueConnection(); con.start();
  33. 33. 33Spark Streaming from Zinoviev Alexey KAFKA
  34. 34. 34Spark Streaming from Zinoviev Alexey Kafka • messaging system
  35. 35. 35Spark Streaming from Zinoviev Alexey Kafka • messaging system • distributed
  36. 36. 36Spark Streaming from Zinoviev Alexey Kafka • messaging system • distributed • supports Publish-Subscribe model
  37. 37. 37Spark Streaming from Zinoviev Alexey Kafka • messaging system • distributed • supports Publish-Subscribe model • persists messages on disk
  38. 38. 38Spark Streaming from Zinoviev Alexey Kafka • messaging system • distributed • supports Publish-Subscribe model • persists messages on disk • replicates within the cluster (integrated with Zookeeper)
  39. 39. 39Spark Streaming from Zinoviev Alexey The main benefits of Kafka Scalability with zero down time Zero data loss due to replication
  40. 40. 40Spark Streaming from Zinoviev Alexey Kafka Cluster consists of … • brokers (leader or follower) • topics ( >= 1 partition) • partitions • partition offsets • replicas of partition • producers/consumers
  41. 41. 41Spark Streaming from Zinoviev Alexey Kafka Components with topic “messages” #1 Producer Thread #1 Producer Thread #2 Producer Thread #3 Topic: Messages Data Data Data Zookeeper
  42. 42. 42Spark Streaming from Zinoviev Alexey Kafka Components with topic “messages” #2 Producer Thread #1 Producer Thread #2 Producer Thread #3 Topic: Messages Part #1 Part #2 Data Data Data Broker #1 Zookeeper Part #1 Part #2 Leader Leader
  43. 43. 43Spark Streaming from Zinoviev Alexey Kafka Components with topic “messages” #3 Producer Thread #1 Producer Thread #2 Producer Thread #3 Topic: Messages Part #1 Part #2 Data Data Data Broker #1 Broker #2 Zookeeper Part #1 Part #1 Part #2 Part #2
  44. 44. 44Spark Streaming from Zinoviev Alexey Why do we need Zookeeper?
  45. 45. 45Spark Streaming from Zinoviev Alexey Kafka Demo
  46. 46. 46Spark Streaming from Zinoviev Alexey REAL TIME WITH DSTREAMS
  47. 47. 47Spark Streaming from Zinoviev Alexey RDD Factory
  48. 48. 48Spark Streaming from Zinoviev Alexey From socket to console with DStreams
  49. 49. 49Spark Streaming from Zinoviev Alexey DStream val conf = new SparkConf().setMaster("local[2]") .setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination()
  50. 50. 50Spark Streaming from Zinoviev Alexey DStream val conf = new SparkConf().setMaster("local[2]") .setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination()
  51. 51. 51Spark Streaming from Zinoviev Alexey DStream val conf = new SparkConf().setMaster("local[2]") .setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination()
  52. 52. 52Spark Streaming from Zinoviev Alexey DStream val conf = new SparkConf().setMaster("local[2]") .setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination()
  53. 53. 53Spark Streaming from Zinoviev Alexey DStream val conf = new SparkConf().setMaster("local[2]") .setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination()
  54. 54. 54Spark Streaming from Zinoviev Alexey Kafka as a main entry point for Spark
  55. 55. 55Spark Streaming from Zinoviev Alexey DStreams Demo
  56. 56. 56Spark Streaming from Zinoviev Alexey How to avoid DStreams with RDD-like API?
  57. 57. 57Spark Streaming from Zinoviev Alexey SPARK 2.2 DISCUSSION
  58. 58. 58Spark Streaming from Zinoviev Alexey Continuous Applications
  59. 59. 59Spark Streaming from Zinoviev Alexey Continuous Applications cases • Updating data that will be served in real time • Extract, transform and load (ETL) • Creating a real-time version of an existing batch job • Online machine learning
  60. 60. 60Spark Streaming from Zinoviev Alexey The main concept of Structured Streaming You can express your streaming computation the same way you would express a batch computation on static data.
  61. 61. 61Spark Streaming from Zinoviev Alexey Batch Spark 2.2 // Read JSON once from S3 logsDF = spark.read.json("s3://logs") // Transform with DataFrame API and save logsDF.select("user", "url", "date") .write.parquet("s3://out")
  62. 62. 62Spark Streaming from Zinoviev Alexey Real Time Spark 2.2 // Read JSON continuously from S3 logsDF = spark.readStream.json("s3://logs") // Transform with DataFrame API and save logsDF.select("user", "url", "date") .writeStream.parquet("s3://out") .start()
  63. 63. 63Spark Streaming from Zinoviev Alexey WordCount from Socket val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load() val words = lines.as[String].flatMap(_.split(" ")) val wordCounts = words.groupBy("value").count()
  64. 64. 64Spark Streaming from Zinoviev Alexey WordCount from Socket val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load() val words = lines.as[String].flatMap(_.split(" ")) val wordCounts = words.groupBy("value").count()
  65. 65. 65Spark Streaming from Zinoviev Alexey WordCount from Socket val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load() val words = lines.as[String].flatMap(_.split(" ")) val wordCounts = words.groupBy("value").count()
  66. 66. 66Spark Streaming from Zinoviev Alexey WordCount from Socket val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load() val words = lines.as[String].flatMap(_.split(" ")) val wordCounts = words.groupBy("value").count() Don’t forget to start Streaming
  67. 67. 67Spark Streaming from Zinoviev Alexey Unlimited Table
  68. 68. 68Spark Streaming from Zinoviev Alexey WordCount with Structured Streaming [Complete Mode]
  69. 69. 69Spark Streaming from Zinoviev Alexey Kafka -> Structured Streaming -> Console
  70. 70. 70Spark Streaming from Zinoviev Alexey Kafka To Console Demo
  71. 71. 71Spark Streaming from Zinoviev Alexey OPERATIONS
  72. 72. 72Spark Streaming from Zinoviev Alexey You can … • filter • sort • aggregate • join • foreach • explain
  73. 73. 73Spark Streaming from Zinoviev Alexey Operators Demo
  74. 74. 74Spark Streaming from Zinoviev Alexey How it works?
  75. 75. 75Spark Streaming from Zinoviev Alexey Deep Diving in Spark Internals Dataset
  76. 76. 76Spark Streaming from Zinoviev Alexey Deep Diving in Spark Internals Dataset Logical Plan
  77. 77. 77Spark Streaming from Zinoviev Alexey Deep Diving in Spark Internals Dataset Logical Plan Optimized Logical Plan
  78. 78. 78Spark Streaming from Zinoviev Alexey Deep Diving in Spark Internals Dataset Logical Plan Optimized Logical Plan Logical Plan Logical Plan Logical PlanPhysical Plan
  79. 79. 79Spark Streaming from Zinoviev Alexey Deep Diving in Spark Internals Dataset Logical Plan Optimized Logical Plan Logical Plan Logical Plan Logical PlanPhysical Plan Selected Physical Plan Planner
  80. 80. 80Spark Streaming from Zinoviev Alexey Deep Diving in Spark Internals Dataset Logical Plan Optimized Logical Plan Logical Plan Logical Plan Logical PlanPhysical Plan Selected Physical Plan
  81. 81. 81Spark Streaming from Zinoviev Alexey Deep Diving in Spark Internals Dataset Logical Plan Optimized Logical Plan Logical Plan Logical Plan Logical PlanPhysical Plan Selected Physical Plan Incremental #1 Incremental #2 Incremental #3
  82. 82. 82Spark Streaming from Zinoviev Alexey Incremental Execution: Planner polls Planner
  83. 83. 83Spark Streaming from Zinoviev Alexey Incremental Execution: Planner runs Planner Offset: 1 - 5 Incremental #1 R U N Count: 5
  84. 84. 84Spark Streaming from Zinoviev Alexey Incremental Execution: Planner runs #2 Planner Offset: 1 - 5 Incremental #1 Incremental #2 Offset: 6 - 9 R U N Count: 5 Count: 4
  85. 85. 85Spark Streaming from Zinoviev Alexey Aggregation with State Planner Offset: 1 - 5 Incremental #1 Incremental #2 Offset: 6 - 9 R U N Count: 5 Count: 5+4=9 5
  86. 86. 86Spark Streaming from Zinoviev Alexey DataSet.explain() == Physical Plan == Project [avg(price)#43,carat#45] +- SortMergeJoin [color#21], [color#47] :- Sort [color#21 ASC], false, 0 : +- TungstenExchange hashpartitioning(color#21,200), None : +- Project [avg(price)#43,color#21] : +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as bigint)),mode=Final,isDistinct=false)], output=[color#21,avg(price)#43]) : +- TungstenExchange hashpartitioning(cut#20,color#21,200), None : +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as bigint)),mode=Partial,isDistinct=false)], output=[cut#20,color#21,sum#58,count#59L]) : +- Scan CsvRelation(-----) +- Sort [color#47 ASC], false, 0 +- TungstenExchange hashpartitioning(color#47,200), None +- ConvertToUnsafe +- Scan CsvRelation(----)
  87. 87. 87Spark Streaming from Zinoviev Alexey What’s the difference between Complete and Append output modes?
  88. 88. 88Spark Streaming from Zinoviev Alexey COMPLETE, APPEND & UPDATE
  89. 89. 89Spark Streaming from Zinoviev Alexey There are two main modes and one in future • append (default)
  90. 90. 90Spark Streaming from Zinoviev Alexey There are two main modes and one in future • append (default) • complete
  91. 91. 91Spark Streaming from Zinoviev Alexey There are two main modes and one in future • append (default) • complete • update [in dreams]
  92. 92. 92Spark Streaming from Zinoviev Alexey Aggregation with watermarks
  93. 93. 93Spark Streaming from Zinoviev Alexey SOURCES & SINKS
  94. 94. 94Spark Streaming from Zinoviev Alexey Spark Streaming is a brick in the Big Data Wall
  95. 95. 95Spark Streaming from Zinoviev Alexey Let's save to Parquet files
  96. 96. 96Spark Streaming from Zinoviev Alexey Let's save to Parquet files
  97. 97. 97Spark Streaming from Zinoviev Alexey Let's save to Parquet files
  98. 98. 98Spark Streaming from Zinoviev Alexey Can we write to Kafka?
  99. 99. 99Spark Streaming from Zinoviev Alexey Nightly Build
  100. 100. 100Spark Streaming from Zinoviev Alexey Kafka-to-Kafka
  101. 101. 101Spark Streaming from Zinoviev Alexey J-K-S-K-S-C
  102. 102. 102Spark Streaming from Zinoviev Alexey Pipeline Demo
  103. 103. 103Spark Streaming from Zinoviev Alexey I didn’t find sink/source for XXX…
  104. 104. 104Spark Streaming from Zinoviev Alexey Console Foreach Sink import org.apache.spark.sql.ForeachWriter val customWriter = new ForeachWriter[String] { override def open(partitionId: Long, version: Long) = true override def process(value: String) = println(value) override def close(errorOrNull: Throwable) = {} } stream.writeStream .queryName(“ForeachOnConsole") .foreach(customWriter) .start
  105. 105. 105Spark Streaming from Zinoviev Alexey Pinch of wisdom • check checkpointLocation • don’t use MemoryStream • think about GC pauses • be careful about nighty builds • use .groupBy.count() instead count() • use console sink instead .show() function
  106. 106. 106Spark Streaming from Zinoviev Alexey We have no ability… • join two streams • work with update mode • make full outer join • take first N rows • sort without pre-aggregation
  107. 107. 107Spark Streaming from Zinoviev Alexey Roadmap 2.2 • Support other data sources (not only S3 + HDFS) • Transactional updates • Dataset is one DSL for all operations • GraphFrames + Structured MLLib • KafkaWriter • TensorFrames
  108. 108. 108Spark Streaming from Zinoviev Alexey IN CONCLUSION
  109. 109. 109Spark Streaming from Zinoviev Alexey Final point Scalable Fault-Tolerant Real-Time Pipeline with Spark & Kafka is ready for usage
  110. 110. 110Spark Streaming from Zinoviev Alexey A few papers about Spark Streaming and Kafka Introduction in Spark + Kafka http://bit.ly/2mJjE4i
  111. 111. 111Spark Streaming from Zinoviev Alexey Source code from Demo Spark Tutorial https://github.com/zaleslaw/Spark-Tutorial
  112. 112. 112Spark Streaming from Zinoviev Alexey Contacts E-mail : Alexey_Zinovyev@epam.com Twitter : @zaleslaw @BigDataRussia Facebook: https://www.facebook.com/zaleslaw vk.com/big_data_russia Big Data Russia vk.com/java_jvm Java & JVM langs
  113. 113. 113Spark Streaming from Zinoviev Alexey Any questions?

×