Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Spark streaming and HBase

4,352 views

Published on

Overview of Apache Spark Streaming with HBase

Published in: Technology
  • Be the first to comment

Apache Spark streaming and HBase

  1. 1. ® © 2015 MapR Technologies 1 ® © 2014 MapR Technologies Overview of Apache Spark Streaming Carol McDonald
  2. 2. ® © 2015 MapR Technologies 2 Agenda •  Why Apache Spark Streaming ? •  What is Apache Spark Streaming? –  Key Concepts and Architecture •  How it works by Example
  3. 3. ® © 2015 MapR Technologies 3 Why Spark Streaming? •  Process Time Series data : –  Results in near-real-time •  Use Cases –  Social network trends –  Website statistics, monitoring –  Fraud detection –  Advertising click monetization put put put put Time stamped data data •  Sensor, System Metrics, Events, log files •  Stock Ticker, User Activity •  Hi Volume, Velocity Data for real-time monitoring
  4. 4. ® © 2015 MapR Technologies 4 What is time series data? •  Stuff with timestamps –  Sensor data –  log files –  Phones.. Credit Card Transactions Web user behaviour Social media Log files Geodata Sensors
  5. 5. ® © 2015 MapR Technologies 5 Why Spark Streaming ? What If? •  You want to analyze data as it arrives? For Example Time Series Data: Sensors, Clicks, Logs, Stats
  6. 6. ® © 2015 MapR Technologies 6 Batch Processing It's 6:01 and 72 degrees It's 6:02 and 75 degrees It's 6:03 and 77 degrees It's 6:04 and 85 degrees It's 6:05 and 90 degrees It's 6:06 and 85 degrees It's 6:07 and 77 degrees It's 6:08 and 75 degrees It was hot at 6:05 yesterday! Batch processing may be too late for some events
  7. 7. ® © 2015 MapR Technologies 7 Event Processing It's 6:05 and 90 degrees Someone should open a window! Streaming Its becoming important to process events as they arrive
  8. 8. ® © 2015 MapR Technologies 8 What is Spark Streaming? •  extension of the core Spark AP •  enables scalable, high-throughput, fault-tolerant stream processing of live data Data Sources Data Sinks
  9. 9. ® © 2015 MapR Technologies 9 Stream Processing Architecture Streaming Sources/Apps MapR-FS Data Ingest Topics MapR-DB Data Storage MapR-FS Apps   Stream Processing
  10. 10. ® © 2015 MapR Technologies 10 Key Concepts •  Data Sources: –  File Based: HDFS –  Network Based: TCP sockets, Twitter, Kafka, Flume, ZeroMQ, Akka Actor •  Transformations •  Output Operations MapR-FS Topics
  11. 11. ® © 2015 MapR Technologies 11 Spark Streaming Architecture •  Divide  data  stream  into  batches  of  X  seconds     – Called  DStream  =    sequence  of  RDDs       Spark Streaming input data stream DStream RDD batches Batch interval data from time 0 to 1 data from time 1 to 2 RDD @ time 2 data from time 2 to 3 RDD @ time 3RDD @ time 1
  12. 12. ® © 2015 MapR Technologies 12 Resilient Distributed Datasets (RDD) Spark revolves around RDDs •  read only collection of elements
  13. 13. ® © 2015 MapR Technologies 13 Resilient Distributed Datasets (RDD) Spark revolves around RDDs •  read only collection of elements •  operated on in parallel •  Cached in memory –  Or on disk •  Fault tolerant
  14. 14. ® © 2015 MapR Technologies 14 Working With RDDs RDD RDD RDD RDD Transformations Action Value linesWithErrorRDD.count()! 6! ! linesWithErrorRDD.first()! # Error line! textFile = sc.textFile(”SomeFile.txt”)! linesWithErrorRDD = linesRDD.filter(lambda line: “ERROR” in line)!
  15. 15. ® © 2015 MapR Technologies 15 Process DStream transform   Transform   map   reduceByValue   count   DStream RDDs Dstream   RDDs   transform  transform   •  Process  using  transformaBons     – creates  new  RDDs   data from time 0 to 1 data from time 1 to 2 RDD @ time 2 data from time 2 to 3 RDD @ time 3RDD @ time 1 RDD @ time 1 RDD @ time 2 RDD @ time 3
  16. 16. ® © 2015 MapR Technologies 16 Key Concepts •  Data Sources •  Transformations: create new DStream –  Standard RDD operations: map, filter, union, reduce, join, … –  Stateful operations: UpdateStateByKey(function), countByValueAndWindow, … •  Output Operations
  17. 17. ® © 2015 MapR Technologies 17 Spark Streaming Architecture •  processed  results  are  pushed  out    in  batches   Spark batches of processed results Spark Streaming input data stream DStream RDD batches data from time 0 to 1 data from time 1 to 2 RDD @ time 2 data from time 2 to 3 RDD @ time 3RDD @ time 1
  18. 18. ® © 2015 MapR Technologies 18 Key Concepts •  Data Sources •  Transformations •  Output Operations: trigger Computation –  saveAsHadoopFiles – save to HDFS –  saveAsHadoopDataset – save to Hbase –  saveAsTextFiles –  foreach – do anything with each batch of RDDs MapR-DB MapR-FS
  19. 19. ® © 2015 MapR Technologies 19 Learning Goals •  How it works by example
  20. 20. ® © 2015 MapR Technologies 20 Use Case: Time Series Data Data for real-time monitoring read Spark Processing Spark Streaming Oil Pump Sensor data
  21. 21. ® © 2015 MapR Technologies 21 Convert Line of CSV data to Sensor Object case class Sensor(resid: String, date: String, time: String, hz: Double, disp: Double, flo: Double, sedPPM: Double, psi: Double, chlPPM: Double) def parseSensor(str: String): Sensor = { val p = str.split(",") Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble, p(6).toDouble, p(7).toDouble, p(8).toDouble) }
  22. 22. ® © 2015 MapR Technologies 22 Schema •  All events stored, data CF could be set to expire data •  Filtered alerts put in alerts CF •  Daily summaries put in Stats CF Row key CF data CF alerts CF stats hz … psi psi … hz_avg … psi_min COHUTTA_3/10/14_1:01 10.37 84 0 COHUTTA_3/10/14 10 0
  23. 23. ® © 2015 MapR Technologies 23 Basic Steps for Spark Streaming code These are the basic steps for Spark Streaming code: 1.  create a Dstream 1.  Apply transformations 2.  Apply output operations 2.  Start receiving data and processing it –  using streamingContext.start(). 3.  Wait for the processing to be stopped –  using streamingContext.awaitTermination().
  24. 24. ® © 2015 MapR Technologies 24 Create a DStream val ssc = new StreamingContext(sparkConf, Seconds(2)) val linesDStream = ssc.textFileStream(“/mapr/stream") batch   'me  0-­‐1   linesDStream batch   'me  1-­‐2   batch   'me  1-­‐2   DStream:  a  sequence  of  RDDs  represenBng  a   stream  of  data   stored  in  memory  as  an   RDD  
  25. 25. ® © 2015 MapR Technologies 25 Process DStream val linesDStream = ssc.textFileStream(”directory path") val sensorDStream = linesDStream.map(parseSensor) map   new  RDDs  created   for  every  batch     batch   'me  0-­‐1   linesDStream RDDs sensorDstream   RDDs   batch   'me  1-­‐2   map  map   batch   'me  1-­‐2  
  26. 26. ® © 2015 MapR Technologies 26 Process DStream // for Each RDD sensorDStream.foreachRDD { rdd => // filter sensor data for low psi val alertRDD = rdd.filter(sensor => sensor.psi < 5.0) . . . }
  27. 27. ® © 2015 MapR Technologies 27 DataFrame and SQL Operations // for Each RDD parse into a sensor object filter sensorDStream.foreachRDD { rdd => . . . alertRdd.toDF().registerTempTable(”alert”) // join alert data with pump maintenance info val res = sqlContext.sql( "select s.resid,s.psi, p.pumpType from alert s join pump p on s.resid = p.resid join maint m on p.resid=m.resid") . . . }
  28. 28. ® © 2015 MapR Technologies 28 Save to HBase // for Each RDD parse into a sensor object filter sensorDStream.foreachRDD { rdd => . . . // convert alert to put object write to HBase alerts alertRDD.map(Sensor.convertToPutAlert) .saveAsHadoopDataset(jobConfig) }
  29. 29. ® © 2015 MapR Technologies 29 Save to HBase rdd.map(Sensor.convertToPut).saveAsHadoopDataset(jobConfig) map   Put  objects  wriFen     To  HBase   batch   'me  0-­‐1   linesRDD DStream sensorRDD   Dstream   batch   'me  1-­‐2   map  map   batch   'me  1-­‐2   HBase save save save output  opera'on:  persist  data  to  external  storage  
  30. 30. ® © 2015 MapR Technologies 30 Start Receiving Data sensorDStream.foreachRDD { rdd => . . . } // Start the computation ssc.start() // Wait for the computation to terminate ssc.awaitTermination()
  31. 31. ® © 2015 MapR Technologies 31 Using HBase as a Source and Sink read write Spark applicationHBase database EXAMPLE: calculate and store summaries, Pre-Computed, Materialized View
  32. 32. ® © 2015 MapR Technologies 32 HBase HBase Read and Write val hBaseRDD = sc.newAPIHadoopRDD( conf,classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]) keyStatsRDD.map { case (k, v) => convertToPut(k, v) }.saveAsHadoopDataset(jobConfig) newAPIHadoopRDD Row key Result saveAsHadoopDataset Key Put HBase Scan Result
  33. 33. ® © 2015 MapR Technologies 33 Read HBase // Load an RDD of (rowkey, Result) tuples from HBase table val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]) // get Result val resultRDD = hBaseRDD.map(tuple => tuple._2) // transform into an RDD of (RowKey, ColumnValue)s val keyValueRDD = resultRDD.map( result => (Bytes.toString(result.getRow()).split(" ")(0), Bytes.toDouble(result.value))) // group by rowkey , get statistics for column value val keyStatsRDD = keyValueRDD.groupByKey().mapValues(list => StatCounter(list))
  34. 34. ® © 2015 MapR Technologies 34 Write HBase // save to HBase table CF data val jobConfig: JobConf = new JobConf(conf, this.getClass) jobConfig.setOutputFormat(classOf[TableOutputFormat]) jobConfig.set(TableOutputFormat.OUTPUT_TABLE, tableName) // convert psi stats to put and write to hbase table stats column family keyStatsRDD.map { case (k, v) => convertToPut(k, v) }.saveAsHadoopDataset(jobConfig)
  35. 35. ® © 2015 MapR Technologies 35 MapR Blog: Using Apache Spark DataFrames for Processing of Tabular Data •  https://www.mapr.com/blog/spark-streaming-hbase
  36. 36. ® © 2015 MapR Technologies 36 Free HBase On Demand Training (includes Hive and MapReduce with HBase) •  https://www.mapr.com/services/mapr-academy/big-data-hadoop- online-training
  37. 37. ® © 2015 MapR Technologies 37 Soon to Come •  Spark On Demand Training –  https://www.mapr.com/services/mapr-academy/
  38. 38. ® © 2015 MapR Technologies 38 References •  Spark web site: http://spark.apache.org/ •  https://databricks.com/ •  Spark on MapR: –  http://www.mapr.com/products/apache-spark •  Spark SQL and DataFrame Guide •  Apache Spark vs. MapReduce – Whiteboard Walkthrough •  Learning Spark - O'Reilly Book •  Apache Spark
  39. 39. ® © 2015 MapR Technologies 39 Q&A @mapr maprtech Engage with us! MapR maprtech mapr-technologies

×