Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
®
© 2016 MapR Technologies 1®
© 2016 MapR Technologies 1© 2016 MapR Technologies
®
Exploring Data Pipelines for Spark Stre...
®
© 2016 MapR Technologies 2®
© 2016 MapR Technologies 2
What is Streaming Data? Got Some Examples?
Data Collection
Device...
®
© 2016 MapR Technologies 3®
© 2016 MapR Technologies 3
It was hot
at 6:05
yesterday
!
Why Stream Processing?
Analyze
6:0...
®
© 2016 MapR Technologies 4®
© 2016 MapR Technologies 4
Why Stream Processing?
6:05 P.M.: 90°
To
pic
Stream
Temperature
T...
®
© 2016 MapR Technologies 5®
© 2016 MapR Technologies 5
Key to Real Time: Event-based Data Flows
web events
etc…
machine ...
®
© 2016 MapR Technologies 6®
© 2016 MapR Technologies 6
What if BP had detected problems before the oil hit
the water ?
•...
®
© 2016 MapR Technologies 7®
© 2016 MapR Technologies 7
Use Case: Time Series Data
Data for
real-time monitoring
read
Sen...
®
© 2016 MapR Technologies 8®
© 2016 MapR Technologies 8
Schema
•  All events stored, CF data could be set to expire data
...
®
© 2016 MapR Technologies 9®
© 2016 MapR Technologies 9
Schema
•  All events stored, CF data could be set to expire data
...
®
© 2016 MapR Technologies 10®
© 2016 MapR Technologies 10
Schema
•  All events stored, CF data could be set to expire dat...
®
© 2016 MapR Technologies 11®
© 2016 MapR Technologies 11
Serve DataStore DataCollect Data
What Do We Need to Do ?
Proces...
®
© 2016 MapR Technologies 12®
© 2016 MapR Technologies 12
How do we do this with High Performance at Scale?
•  Parallel o...
®
© 2016 MapR Technologies 13®
© 2016 MapR Technologies 13
Collect the Data
Data Ingest
MapR-FS
Source
Stream
Topic
•  Dat...
®
© 2016 MapR Technologies 14®
© 2016 MapR Technologies 14
MapR Streams Publish Subscribe Messaging
Topics Organize Events...
®
© 2016 MapR Technologies 15®
© 2016 MapR Technologies 15
Scalable Messaging with MapR Streams
Topics are partitioned for...
®
© 2016 MapR Technologies 16®
© 2016 MapR Technologies 16
How do we do this with High Performance at Scale?
•  Parallel ,...
®
© 2016 MapR Technologies 17®
© 2016 MapR Technologies 17
Collect Data
Process the Data with Spark Streaming
MapR-FS
Proc...
®
© 2016 MapR Technologies 18®
© 2016 MapR Technologies 18
Processing Spark DStreams
Data stream divided into batches of X...
®
© 2016 MapR Technologies 19®
© 2016 MapR Technologies 19
Spark Resilient Distributed Datasets
RDD
W
Executor
P4
W
Execut...
®
© 2016 MapR Technologies 20®
© 2016 MapR Technologies 20
Spark Resilient Distributed Datasets
Spark revolves around RDDs...
®
© 2016 MapR Technologies 21®
© 2016 MapR Technologies 21
How do we do this with High Performance at Scale?
•  Parallel ,...
®
© 2016 MapR Technologies 22®
© 2016 MapR Technologies 22
Processing Spark DStreams
transformations à create new RDDs
Tw...
®
© 2016 MapR Technologies 23®
© 2016 MapR Technologies 23
Two types of operations on DStreams
•  Transformations
•  Outpu...
®
© 2016 MapR Technologies 24®
© 2016 MapR Technologies 24
Serve DataStore DataCollect Data
What Do We Need to Do ?
MapR-F...
®
© 2016 MapR Technologies 25®
© 2016 MapR Technologies 25
MapR-DB (HBase API) is Designed to Scale
Key
Range
xxxx
xxxx
Ke...
®
© 2016 MapR Technologies 26®
© 2016 MapR Technologies 26
Store Lots of Data with NoSQL MapR-DB
bottleneck
Key colB col
C...
®
© 2016 MapR Technologies 27®
© 2016 MapR Technologies 27
Key to Real Time: Event-based Data Flows
Key to Scale = Paralle...
®
© 2016 MapR Technologies 28®
© 2016 MapR Technologies 28
Serve DataStore DataCollect Data
What Do We Need to Do ?
MapR-F...
®
© 2016 MapR Technologies 29®
© 2016 MapR Technologies 29
Use Case Example Code
Data for
real-time monitoring
read
Sensor...
®
© 2016 MapR Technologies 30®
© 2016 MapR Technologies 30
Use Case Example Code
Data for
real-time monitoring
read
Sensor...
®
© 2016 MapR Technologies 31®
© 2016 MapR Technologies 31
KafkaProducer
String topic=“/streams/pump:warning”;
public stat...
®
© 2016 MapR Technologies 32®
© 2016 MapR Technologies 32
Use Case Example Code
Data for
real-time monitoring
read
Sensor...
®
© 2016 MapR Technologies 33®
© 2016 MapR Technologies 33
Create a DStream
DStream: a sequence of RDDs
representing a str...
®
© 2016 MapR Technologies 34®
© 2016 MapR Technologies 34
Process DStream
val sensorDStream = dStream.map(_._2).map(parse...
®
© 2016 MapR Technologies 35®
© 2016 MapR Technologies 35
Message Data to Sensor Object
case class Sensor(resid: String, ...
®
© 2016 MapR Technologies 36®
© 2016 MapR Technologies 36
DataFrame and SQL Operations
// for Each RDD
sensorDStream.fore...
®
© 2016 MapR Technologies 37®
© 2016 MapR Technologies 37
Streaming Application Output
®
© 2016 MapR Technologies 38®
© 2016 MapR Technologies 38
Save to HBase
rdd.map(Sensor.convertToPut).saveAsHadoopDataset(...
®
© 2016 MapR Technologies 39®
© 2016 MapR Technologies 39
Start Receiving Data
sensorDStream.foreachRDD { rdd =>
. . .
}
...
®
© 2016 MapR Technologies 40®
© 2016 MapR Technologies 40
Stream Processing
Building a Complete Data Architecture
MapR Fi...
®
© 2016 MapR Technologies 41®
© 2016 MapR Technologies 41
To Learn More:
•  Read explanation of and Download code
–  http...
®
© 2016 MapR Technologies 42®
© 2016 MapR Technologies 42
To Learn More:
•  http://learn.mapr.com/
®
© 2016 MapR Technologies 43®
© 2016 MapR Technologies 43
Q&A
@mapr
@caroljmcdonald
https://www.mapr.com/blog/author/caro...
Upcoming SlideShare
Loading in …5
×

Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API and the HBase API

409 views

Published on

How to Build Fast, Scalable, Streaming Applications with Spark Streaming, MapR Steams using the Kafka API and MapR-DB using the HBase API

Published in: Software
  • Be the first to comment

Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API and the HBase API

  1. 1. ® © 2016 MapR Technologies 1® © 2016 MapR Technologies 1© 2016 MapR Technologies ® Exploring Data Pipelines for Spark Streaming Applications Carol McDonald, Industry Solutions Architect 2016
  2. 2. ® © 2016 MapR Technologies 2® © 2016 MapR Technologies 2 What is Streaming Data? Got Some Examples? Data Collection Devices Smart Machinery Phones and Tablets Home Automation RFID Systems Digital Signage Security Systems Medical Devices
  3. 3. ® © 2016 MapR Technologies 3® © 2016 MapR Technologies 3 It was hot at 6:05 yesterday ! Why Stream Processing? Analyze 6:01 P.M.: 72° 6:02 P.M.: 75° 6:03 P.M.: 77° 6:04 P.M.: 85° 6:05 P.M.: 90° 6:06 P.M.: 85° 6:07 P.M.: 77° 6:08 P.M.: 75° 90°90° 6:01 P.M.: 72° 6:02 P.M.: 75° 6:03 P.M.: 77° 6:04 P.M.: 85° 6:05 P.M.: 90° 6:06 P.M.: 85° 6:07 P.M.: 77° 6:08 P.M.: 75° Batch processing may be too late for some events
  4. 4. ® © 2016 MapR Technologies 4® © 2016 MapR Technologies 4 Why Stream Processing? 6:05 P.M.: 90° To pic Stream Temperature Turn on the air conditioning! It’s becoming important to process events as they arrive
  5. 5. ® © 2016 MapR Technologies 5® © 2016 MapR Technologies 5 Key to Real Time: Event-based Data Flows web events etc… machine sensors Biometrics Mobile events
  6. 6. ® © 2016 MapR Technologies 6® © 2016 MapR Technologies 6 What if BP had detected problems before the oil hit the water ? •  1M samples/sec •  High performance at scale is necessary!
  7. 7. ® © 2016 MapR Technologies 7® © 2016 MapR Technologies 7 Use Case: Time Series Data Data for real-time monitoring read Sensor time-stamped data Spark processing Spark Streaming Stream Topic
  8. 8. ® © 2016 MapR Technologies 8® © 2016 MapR Technologies 8 Schema •  All events stored, CF data could be set to expire data •  Filtered alerts put in CF alerts •  Daily summaries put in CF stats Row key CF data CF alerts CF stats hz … psi psi … hz_avg … psi_min COHUTTA_3/10/14_1:01 10.37 84 0 COHUTTA_3/10/14 10 0 Row Key contains oil pump name, date, and a time stamp
  9. 9. ® © 2016 MapR Technologies 9® © 2016 MapR Technologies 9 Schema •  All events stored, CF data could be set to expire data •  Filtered alerts put in CF alerts •  Daily summaries put in CF stats Row key CF data CF alerts CF stats hz … psi psi … hz_avg … psi_min COHUTTA_3/10/14_1:01 10.37 84 0 COHUTTA_3/10/14 10 0
  10. 10. ® © 2016 MapR Technologies 10® © 2016 MapR Technologies 10 Schema •  All events stored, CF data could be set to expire data •  Filtered alerts put in CF alerts •  Daily summaries put in CF stats Row key CF data CF alerts CF stats hz … psi psi … hz_avg … psi_min COHUTTA_3/10/14_1:01 10.37 84 0 COHUTTA_3/10/14 10 0
  11. 11. ® © 2016 MapR Technologies 11® © 2016 MapR Technologies 11 Serve DataStore DataCollect Data What Do We Need to Do ? Process DataData Sources ? ? ? ?
  12. 12. ® © 2016 MapR Technologies 12® © 2016 MapR Technologies 12 How do we do this with High Performance at Scale? •  Parallel operations and minimize disk read/write time
  13. 13. ® © 2016 MapR Technologies 13® © 2016 MapR Technologies 13 Collect the Data Data Ingest MapR-FS Source Stream Topic •  Data Ingest: –  File Based: NFS with MapR-FS, HDFS –  Network Based: MapR Streams, Kafka, Kinesis, Twitter, Sockets...
  14. 14. ® © 2016 MapR Technologies 14® © 2016 MapR Technologies 14 MapR Streams Publish Subscribe Messaging Topics Organize Events into Categories and decouple Producers from Consumers
  15. 15. ® © 2016 MapR Technologies 15® © 2016 MapR Technologies 15 Scalable Messaging with MapR Streams Topics are partitioned for throughput and scalability
  16. 16. ® © 2016 MapR Technologies 16® © 2016 MapR Technologies 16 How do we do this with High Performance at Scale? •  Parallel , Partitioned = fast , scalable –  Messaging with MapR Streams
  17. 17. ® © 2016 MapR Technologies 17® © 2016 MapR Technologies 17 Collect Data Process the Data with Spark Streaming MapR-FS Process Data Stream Topic •  Extension of the core Spark AP •  Enables scalable, high-throughput, fault-tolerant stream processing of live data
  18. 18. ® © 2016 MapR Technologies 18® © 2016 MapR Technologies 18 Processing Spark DStreams Data stream divided into batches of X milliseconds = DStreams
  19. 19. ® © 2016 MapR Technologies 19® © 2016 MapR Technologies 19 Spark Resilient Distributed Datasets RDD W Executor P4 W Executor P1 P3 W Executor P2 partitioned Partition 1 8213034705, 95, 2.927373, jake7870, 0…… Partition 2 8213034705, 115, 2.943484, Davidbresler2, 1…. Partition 3 8213034705, 100, 2.951285, gladimacowgirl, 58… Partition 4 8213034705, 117, 2.998947, daysrus, 95…. Spark revolves around RDDs •  Read only collection of elements •  Partitioned across a cluster •  Operated on in parallel •  Cached in memory
  20. 20. ® © 2016 MapR Technologies 20® © 2016 MapR Technologies 20 Spark Resilient Distributed Datasets Spark revolves around RDDs •  Read only collection of elements •  Partitioned across a cluster •  Operated on in parallel •  Cached in memory
  21. 21. ® © 2016 MapR Technologies 21® © 2016 MapR Technologies 21 How do we do this with High Performance at Scale? •  Parallel , Partitioned = fast , scalable –  Processing with Spark
  22. 22. ® © 2016 MapR Technologies 22® © 2016 MapR Technologies 22 Processing Spark DStreams transformations à create new RDDs Two types of operations on DStreams: •  Transformations: –  Create new DStreams –  map, filter, reduceByKey, SQL. . . •  Output Operations DStream RDDs DStream RDDs transform  transform   data from time 0 to 1 RDD @ time 1 data from time 1 to 2 RDD @ time 2 data from time 2 to 3 RDD @ time 3 RDD @ time 3 transform   RDD @ time 1 RDD @ time 2
  23. 23. ® © 2016 MapR Technologies 23® © 2016 MapR Technologies 23 Two types of operations on DStreams •  Transformations •  Output Operations: trigger Computation –  Save to File, HBase.. •  saveAsHadoopFiles •  saveAsHadoopDataset •  saveAsTextFiles Processing Spark DStreams Output operations à trigger computation MapR-FS MapR-DB DStream RDDs data from time 0 to 1 data from time 1 to 2 data from time 2 to 3 RDD @ time 3RDD @ time 1 RDD @ time 2 mapmap map savesave save
  24. 24. ® © 2016 MapR Technologies 24® © 2016 MapR Technologies 24 Serve DataStore DataCollect Data What Do We Need to Do ? MapR-FS Process DataData Sources MapR-FS Stream Topic
  25. 25. ® © 2016 MapR Technologies 25® © 2016 MapR Technologies 25 MapR-DB (HBase API) is Designed to Scale Key Range xxxx xxxx Key Range xxxx xxxx Key Range xxxx xxxx Key colB col C val val val xxx val val Key colB col C val val val xxx val val Key colB col C val val val xxx val val Fast Reads and Writes by Key! Data is automatically partitioned by Key Range!
  26. 26. ® © 2016 MapR Technologies 26® © 2016 MapR Technologies 26 Store Lots of Data with NoSQL MapR-DB bottleneck Key colB col C val val val xxx val val Key colB col C val val val xxx val val Key colB col C val val val xxx val val Storage ModelRDBMS MapR-DB Normalized schema à Joins for queries can cause bottleneck De-Normalized schema à Data that is read together is stored together
  27. 27. ® © 2016 MapR Technologies 27® © 2016 MapR Technologies 27 Key to Real Time: Event-based Data Flows Key to Scale = Parallel Partitioned: •  Messaging •  Processing •  Storage
  28. 28. ® © 2016 MapR Technologies 28® © 2016 MapR Technologies 28 Serve DataStore DataCollect Data What Do We Need to Do ? MapR-FS Process DataData Sources MapR-FS Stream Topic
  29. 29. ® © 2016 MapR Technologies 29® © 2016 MapR Technologies 29 Use Case Example Code Data for real-time monitoring read Sensor time-stamped data Spark processing Spark Streaming Stream Topic
  30. 30. ® © 2016 MapR Technologies 30® © 2016 MapR Technologies 30 Use Case Example Code Data for real-time monitoring read Sensor time-stamped data Spark processing Spark Streaming Stream Topic
  31. 31. ® © 2016 MapR Technologies 31® © 2016 MapR Technologies 31 KafkaProducer String topic=“/streams/pump:warning”; public static KafkaProducer producer; Properties properties = new Properties(); properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer"); // Instantiate KafkaProducer with properties producer = new KafkaProducer<String, String>(properties); String txt = “msg text”; ProducerRecord<String, String> rec = new ProducerRecord<String, String>(topic, txt); producer.send(rec);
  32. 32. ® © 2016 MapR Technologies 32® © 2016 MapR Technologies 32 Use Case Example Code Data for real-time monitoring read Sensor time-stamped data Spark processing Spark Streaming Stream Topic
  33. 33. ® © 2016 MapR Technologies 33® © 2016 MapR Technologies 33 Create a DStream DStream: a sequence of RDDs representing a stream of data val ssc = new StreamingContext(sparkConf, Seconds(5)) val dStream = KafkaUtils.createDirectStream[String, String](ssc, kafkaParams, topicsSet) batch time 0 to 1 batch time 1 to 2 batch time 2 to 3 dStream Stored in memory as an RDD
  34. 34. ® © 2016 MapR Technologies 34® © 2016 MapR Technologies 34 Process DStream val sensorDStream = dStream.map(_._2).map(parseSensor) dStream RDDs batch time 2 to 3 batch time 1 to 2 batch time 0 to 1 sensorDStream RDDs New RDDs created for every batch map map map
  35. 35. ® © 2016 MapR Technologies 35® © 2016 MapR Technologies 35 Message Data to Sensor Object case class Sensor(resid: String, date: String, time: String, hz: Double, disp: Double, flo: Double, sedPPM: Double, psi: Double, chlPPM: Double) def parseSensor(str: String): Sensor = { val p = str.split(",") Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble, p(6).toDouble, p(7).toDouble, p(8).toDouble) }
  36. 36. ® © 2016 MapR Technologies 36® © 2016 MapR Technologies 36 DataFrame and SQL Operations // for Each RDD sensorDStream.foreachRDD { rdd => val sqlContext = SQLContext.getOrCreate(rdd.sparkContext) rdd.toDF().registerTempTable("sensor") val res = sqlContext.sql( "SELECT resid, date, max(hz) as maxhz, min(hz) as minhz, avg(hz) as avghz, max(disp) as maxdisp, min(disp) as mindisp, avg(disp) as avgdisp, max(flo) as maxflo, min(flo) as minflo, avg(flo) as avgflo, max(psi) as maxpsi, min(psi) as minpsi, avg(psi) as avgpsi FROM sensor GROUP BY resid,date") res.show() }
  37. 37. ® © 2016 MapR Technologies 37® © 2016 MapR Technologies 37 Streaming Application Output
  38. 38. ® © 2016 MapR Technologies 38® © 2016 MapR Technologies 38 Save to HBase rdd.map(Sensor.convertToPut).saveAsHadoopDataset(jobConfig) linesRDD DStream sensorRDD DStream output operation: persist data to external storage Put objects written to HBase batch time 2-3 batch time 1 to 2 batch time 0 to 1 mapmap map savesave save
  39. 39. ® © 2016 MapR Technologies 39® © 2016 MapR Technologies 39 Start Receiving Data sensorDStream.foreachRDD { rdd => . . . } // Start the computation ssc.start() // Wait for the computation to terminate ssc.awaitTermination()
  40. 40. ® © 2016 MapR Technologies 40® © 2016 MapR Technologies 40 Stream Processing Building a Complete Data Architecture MapR File System (MapR-FS) MapR Converged Data Platform MapR Database (MapR-DB) MapR Streams Sources/Apps Bulk Processing
  41. 41. ® © 2016 MapR Technologies 41® © 2016 MapR Technologies 41 To Learn More: •  Read explanation of and Download code –  https://www.mapr.com/blog/fast-scalable-streaming-applications-mapr-streams- spark-streaming-and-mapr-db –  https://www.mapr.com/blog/spark-streaming-hbase
  42. 42. ® © 2016 MapR Technologies 42® © 2016 MapR Technologies 42 To Learn More: •  http://learn.mapr.com/
  43. 43. ® © 2016 MapR Technologies 43® © 2016 MapR Technologies 43 Q&A @mapr @caroljmcdonald https://www.mapr.com/blog/author/carol-mcdonald Engage with us! mapr-technologies

×