Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Free Code Friday - Spark Streaming with HBase

2,258 views

Published on

Spark Streaming is an extension of the core Spark API that enables continuous data stream processing. It is particularly useful when data needs to be processed in real-time. Carol McDonald, HBase Hadoop Instructor at MapR, will cover:

+ What is Spark Streaming and what is it used for?
+ How does Spark Streaming work?
+ Example code to read, process, and write the processed data

Published in: Technology
  • Be the first to comment

Free Code Friday - Spark Streaming with HBase

  1. 1. © 2015 MapR Technologies 1© 2014 MapR Technologies Overview of Apache Spark Streaming
  2. 2. © 2015 MapR Technologies 2 Agenda • Why Apache Spark Streaming ? • What is Apache Spark Streaming? – Key Concepts and Architecture • How it works by Example
  3. 3. © 2015 MapR Technologies 3 Why Spark Streaming? • Process Time Series data : – Results in near-real-time • Use Cases – Social network trends – Website statistics, monitoring – Fraud detection – Advertising click monetization put put put put Time stamped data data • Sensor, System Metrics, Events, log files • Stock Ticker, User Activity • Hi Volume, Velocity Data for real-time monitoring
  4. 4. © 2015 MapR Technologies 4 What is time series data? • Stuff with timestamps – Sensor data – log files – Phones.. Credit Card Transactions Web user behaviour Social media Log files Geodata Sensors
  5. 5. © 2015 MapR Technologies 5 Why Spark Streaming ? What If? • You want to analyze data as it arrives? For Example Time Series Data: Sensors, Clicks, Logs, Stats
  6. 6. © 2015 MapR Technologies 6 Batch Processing It's 6:01 and 72 degrees It's 6:02 and 75 degrees It's 6:03 and 77 degrees It's 6:04 and 85 degrees It's 6:05 and 90 degrees It's 6:06 and 85 degrees It's 6:07 and 77 degrees It's 6:08 and 75 degrees It was hot at 6:05 yesterday! Batch processing may be too late for some events
  7. 7. © 2015 MapR Technologies 7 Event Processing It's 6:05 and 90 degrees Someone should open a window! Streaming Its becoming important to process events as they arrive
  8. 8. © 2015 MapR Technologies 8 What is Spark Streaming? • extension of the core Spark AP • enables scalable, high-throughput, fault-tolerant stream processing of live data Data Sources Data Sinks
  9. 9. © 2015 MapR Technologies 9 Stream Processing Architecture Streaming Sources/Apps MapR-FS Data Ingest Topics MapR-DB Data Storage MapR-FS Apps Stream Processing
  10. 10. © 2015 MapR Technologies 10 Key Concepts • Data Sources: – File Based: HDFS – Network Based: TCP sockets, Twitter, Kafka, Flume, ZeroMQ, Akka Actor • Transformations • Output Operations MapR-FS Topics
  11. 11. © 2015 MapR Technologies 11 Spark Streaming Architecture • Divide data stream into batches of X seconds – Called DStream = sequence of RDDs Spark Streaming input data stream DStream RDD batches Batch interval data from time 0 to 1 data from time 1 to 2 RDD @ time 2 data from time 2 to 3 RDD @ time 3RDD @ time 1
  12. 12. © 2015 MapR Technologies 12 Resilient Distributed Datasets (RDD) Spark revolves around RDDs • read only collection of elements
  13. 13. © 2015 MapR Technologies 13 Resilient Distributed Datasets (RDD) Spark revolves around RDDs • read only collection of elements • operated on in parallel • Cached in memory – Or on disk • Fault tolerant
  14. 14. © 2015 MapR Technologies 14 Working With RDDs RDD RDD RDD RDD Transformations Action Value linesWithErrorRDD.count() 6 linesWithErrorRDD.first() # Error line textFile = sc.textFile(”SomeFile.txt”) linesWithErrorRDD = linesRDD.filter(lambda line: “ERROR” in line)
  15. 15. © 2015 MapR Technologies 15 Process DStream transform Transform map reduceByValue count DStream RDDs Dstream RDDs transformtransform • Process using transformations – creates new RDDs data from time 0 to 1 data from time 1 to 2 RDD @ time 2 data from time 2 to 3 RDD @ time 3RDD @ time 1 RDD @ time 1 RDD @ time 2 RDD @ time 3
  16. 16. © 2015 MapR Technologies 16 Key Concepts • Data Sources • Transformations: create new DStream – Standard RDD operations: map, filter, union, reduce, join, … – Stateful operations: UpdateStateByKey(function), countByValueAndWindow, … • Output Operations
  17. 17. © 2015 MapR Technologies 17 Spark Streaming Architecture • processed results are pushed out in batches Spark batches of processed results Spark Streaming input data stream DStream RDD batches data from time 0 to 1 data from time 1 to 2 RDD @ time 2 data from time 2 to 3 RDD @ time 3RDD @ time 1
  18. 18. © 2015 MapR Technologies 18 Key Concepts • Data Sources • Transformations • Output Operations: trigger Computation – saveAsHadoopFiles – save to HDFS – saveAsHadoopDataset – save to Hbase – saveAsTextFiles – foreach – do anything with each batch of RDDs MapR-DB MapR-FS
  19. 19. © 2015 MapR Technologies 19 Learning Goals • How it works by example
  20. 20. © 2015 MapR Technologies 20 Use Case: Time Series Data Data for real-time monitoring read Spark Processing Spark Streaming Oil Pump Sensor data
  21. 21. © 2015 MapR Technologies 21 Convert Line of CSV data to Sensor Object case class Sensor(resid: String, date: String, time: String, hz: Double, disp: Double, flo: Double, sedPPM: Double, psi: Double, chlPPM: Double) def parseSensor(str: String): Sensor = { val p = str.split(",") Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble, p(6).toDouble, p(7).toDouble, p(8).toDouble) }
  22. 22. © 2015 MapR Technologies 22 Schema • All events stored, data CF could be set to expire data • Filtered alerts put in alerts CF • Daily summaries put in Stats CF Row key CF data CF alerts CF stats hz … psi psi … hz_avg … psi_min COHUTTA_3/10/14_1:01 10.37 84 0 COHUTTA_3/10/14 10 0
  23. 23. © 2015 MapR Technologies 23 Basic Steps for Spark Streaming code These are the basic steps for Spark Streaming code: 1. create a Dstream 1. Apply transformations 2. Apply output operations 2. Start receiving data and processing it – using streamingContext.start(). 3. Wait for the processing to be stopped – using streamingContext.awaitTermination().
  24. 24. © 2015 MapR Technologies 24 Create a DStream val ssc = new StreamingContext(sparkConf, Seconds(2)) val linesDStream = ssc.textFileStream(“/mapr/stream") batch time 0-1 linesDStream batch time 1-2 batch time 1-2 DStream: a sequence of RDDs representing a stream of data stored in memory as an RDD
  25. 25. © 2015 MapR Technologies 25 Process DStream val linesDStream = ssc.textFileStream(”directory path") val sensorDStream = linesDStream.map(parseSensor) map new RDDs created for every batch batch time 0-1 linesDStream RDDs sensorDstream RDDs batch time 1-2 mapmap batch time 1-2
  26. 26. © 2015 MapR Technologies 26 Process DStream // for Each RDD sensorDStream.foreachRDD { rdd => // filter sensor data for low psi val alertRDD = rdd.filter(sensor => sensor.psi < 5.0) . . . }
  27. 27. © 2015 MapR Technologies 27 DataFrame and SQL Operations // for Each RDD parse into a sensor object filter sensorDStream.foreachRDD { rdd => . . . alertRdd.toDF().registerTempTable(”alert”) // join alert data with pump maintenance info val alertViewDF = sqlContext.sql( "select s.resid,s.psi, p.pumpType from alert s join pump p on s.resid = p.resid join maint m on p.resid=m.resid") . . . }
  28. 28. © 2015 MapR Technologies 28 Save to HBase // for Each RDD parse into a sensor object filter sensorDStream.foreachRDD { rdd => . . . // convert alert to put object write to HBase alerts rdd.map(Sensor.convertToPutAlert) .saveAsHadoopDataset(jobConfig) }
  29. 29. © 2015 MapR Technologies 29 Save to HBase rdd.map(Sensor.convertToPut).saveAsHadoopDataset(jobConfig) map Put objects written To HBase batch time 0-1 linesRDD DStream sensorRDD Dstream batch time 1-2 mapmap batch time 1-2 HBase save save save output operation: persist data to external storage
  30. 30. © 2015 MapR Technologies 30 Start Receiving Data sensorDStream.foreachRDD { rdd => . . . } // Start the computation ssc.start() // Wait for the computation to terminate ssc.awaitTermination()
  31. 31. © 2015 MapR Technologies 31 Using HBase as a Source and Sink read write Spark applicationHBase database EXAMPLE: calculate and store summaries, Pre-Computed, Materialized View
  32. 32. © 2015 MapR Technologies 32 HBase HBase Read and Write val hBaseRDD = sc.newAPIHadoopRDD( conf,classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]) keyStatsRDD.map { case (k, v) => convertToPut(k, v) }.saveAsHadoopDataset(jobConfig) newAPIHadoopRDD Row key Result saveAsHadoopDataset Key Put HBase Scan Result
  33. 33. © 2015 MapR Technologies 33 Read HBase // Load an RDD of (rowkey, Result) tuples from HBase table val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]) // get Result val resultRDD = hBaseRDD.map(tuple => tuple._2) // transform into an RDD of (RowKey, ColumnValue)s val keyValueRDD = resultRDD.map( result => (Bytes.toString(result.getRow()).split(" ")(0), Bytes.toDouble(result.value))) // group by rowkey , get statistics for column value val keyStatsRDD = keyValueRDD.groupByKey().mapValues(list => StatCounter(list))
  34. 34. © 2015 MapR Technologies 34 Write HBase // save to HBase table CF data val jobConfig: JobConf = new JobConf(conf, this.getClass) jobConfig.setOutputFormat(classOf[TableOutputFormat]) jobConfig.set(TableOutputFormat.OUTPUT_TABLE, tableName) // convert psi stats to put and write to hbase table stats column family keyStatsRDD.map { case (k, v) => convertToPut(k, v) }.saveAsHadoopDataset(jobConfig)
  35. 35. © 2015 MapR Technologies 35 MapR Blog: Using Apache Spark DataFrames for Processing of Tabular Data • https://www.mapr.com/blog/spark-streaming-hbase
  36. 36. © 2015 MapR Technologies 36 Free HBase On Demand Training (includes Hive and MapReduce with HBase) • https://www.mapr.com/services/mapr-academy/big-data-hadoop- online-training
  37. 37. © 2015 MapR Technologies 37 Soon to Come • Spark On Demand Training – https://www.mapr.com/services/mapr-academy/
  38. 38. © 2015 MapR Technologies 38 References • Spark web site: http://spark.apache.org/ • https://databricks.com/ • Spark on MapR: – http://www.mapr.com/products/apache-spark • Spark SQL and DataFrame Guide • Apache Spark vs. MapReduce – Whiteboard Walkthrough • Learning Spark - O'Reilly Book • Apache Spark
  39. 39. © 2015 MapR Technologies 39 References • Spark web site: http://spark.apache.org/ • https://databricks.com/ • Spark on MapR: – http://www.mapr.com/products/apache-spark • Spark SQL and DataFrame Guide • Apache Spark vs. MapReduce – Whiteboard Walkthrough • Learning Spark - O'Reilly Book • Apache Spark
  40. 40. © 2015 MapR Technologies 40 Q&A @mapr maprtech Engage with us! MapR maprtech mapr-technologies

×