Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs

107 views

Published on

Date: 16th November 2017
Location: Fast Data Theatre
Time: 12:30 - 13:00
Speaker: Gerard Maas
Organisation: Lightbend

Published in: Data & Analytics
  • Be the first to comment

Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs

  1. 1. Processing Fast Data with Apache Spark: The Tale of Two APIs Big Data Lnd Gerard Maas Señor SW Engineer 16/Nov/2017
  2. 2. Gerard Maas Señor SW Engineer Computer Engineer Scala Programmer Early Spark Adopter (v0.9) Spark Notebook Contributor Cassandra MVP (2015, 2016) Stack Overflow Top Contributor (Spark, Spark Streaming, Scala) Wannabe { IoT Maker Drone crasher/tinkerer } @maasg https://github.com/maasg https://www.linkedin.com/ in/gerardmaas/ https://stackoverflow.com /users/764040/maasg
  3. 3. Streams Everywhere
  4. 4. Once upon a time...
  5. 5. Apache Spark Core Spark SQL SparkMLLib SparkStreaming Structured Streaming DataFrames DataSets GraphFrames Data Sources
  6. 6. Apache Spark Core Spark SQL SparkMLLib SparkStreaming Structured Streaming DataFrames DataSets GraphFrames Data Sources
  7. 7. Structured Streaming
  8. 8. Structured Streaming Kafka Sockets HDFS/S3 Custom Streaming DataFrame
  9. 9. Structured Streaming Kafka Sockets HDFS/S3 Custom Streaming DataFrame
  10. 10. Structured Streaming Kafka Sockets HDFS/S3 Custom Streaming DataFrame Query Kafka Files foreachSink console memory Output Mode
  11. 11. val rawData = sparkSession.readStream .format("kafka") .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("subscribe", topic) .option("group.id", "iot-data-consumer") .option("startingOffsets", "earliest") .load() case class SensorData(sensorId: Int, timestamp: Long, reading: Double) def parse(str: String): SensorData = ... val parsedData = rawData.select($"value").as[String].flatMap(record => parse(record)) val validData = parsedData.where($"reading" > 0) val query = validData.writeStream .outputMode("append") .format("parquet") .option("path", "/dest/sensor-data") .option("checkpointLocation", "/tmp/checkpoint") .start()
  12. 12. val rawData = sparkSession.readStream .format("kafka") .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("subscribe", topic) .option("group.id", "iot-data-consumer") .option("startingOffsets", "earliest") .load() case class SensorData(sensorId: Int, timestamp: Long, reading: Double) def parse(str: String): SensorData = ... val parsedData = rawData.select($"value").as[String].flatMap(record => parse(record)) val validData = parsedData.where($"reading" > 0) val query = validData.writeStream .outputMode("append") .format("parquet") .option("path", "/dest/sensor-data") .option("checkpointLocation", "/tmp/checkpoint") .start()
  13. 13. val rawData = sparkSession.readStream .format("kafka") .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("subscribe", topic) .option("group.id", "iot-data-consumer") .option("startingOffsets", "earliest") .load() case class SensorData(sensorId: Int, timestamp: Long, reading: Double) def parse(str: String): SensorData = ... val parsedData = rawData.select($"value").as[String].flatMap(record => parse(record)) val validData = parsedData.where($"reading" > 0) val query = validData.writeStream .outputMode("append") .format("parquet") .option("path", "/dest/sensor-data") .option("checkpointLocation", "/tmp/checkpoint") .start()
  14. 14. val rawData = sparkSession.readStream .format("kafka") .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("subscribe", topic) .option("group.id", "iot-data-consumer") .option("startingOffsets", "earliest") .load() case class SensorData(sensorId: Int, timestamp: Long, reading: Double) def parse(str: String): SensorData = ... val parsedData = rawData.select($"value").as[String].flatMap(record => parse(record)) val validData = parsedData.where($"reading" > 0) val query = validData.writeStream .outputMode("append") .format("parquet") .option("path", "/dest/sensor-data") .option("checkpointLocation", "/tmp/checkpoint") .start()
  15. 15. val rawData = sparkSession.readStream .format("kafka") .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("subscribe", topic) .option("group.id", "iot-data-consumer") .option("startingOffsets", "earliest") .load() case class SensorData(sensorId: Int, timestamp: Long, reading: Double) def parse(str: String): SensorData = ... val parsedData = rawData.select($"value").as[String].flatMap(record => parse(record)) val validData = parsedData.where($"reading" > 0) val query = validData.writeStream .outputMode("append") .format("parquet") .option("path", "/dest/sensor-data") .option("checkpointLocation", "/tmp/checkpoint") .start()
  16. 16. val rawData = sparkSession.readStream .format("kafka") .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("subscribe", topic) .option("group.id", "iot-data-consumer") .option("startingOffsets", "earliest") .load() case class SensorData(sensorId: Int, timestamp: Long, reading: Double) def parse(str: String): SensorData = ... val parsedData = rawData.select($"value").as[String].flatMap(record => parse(record)) val validData = parsedData.where($"reading" > 0) val query = validData.writeStream .outputMode("append") .format("parquet") .option("path", "/dest/sensor-data") .option("checkpointLocation", "/tmp/checkpoint") .start()
  17. 17. val rawData = sparkSession.readStream .format("kafka") .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("subscribe", topic) .option("group.id", "iot-data-consumer") .option("startingOffsets", "earliest") .load() case class SensorData(sensorId: Int, timestamp: Long, reading: Double) def parse(str: String): SensorData = ... val parsedData = rawData.select($"value").as[String].flatMap(record => parse(record)) val validData = parsedData.where($"reading" > 0) val query = validData.writeStream .outputMode("append") .format("parquet") .option("path", "/dest/sensor-data") .option("checkpointLocation", "/tmp/checkpoint") .start()
  18. 18. val rawData = sparkSession.readStream .format("kafka") .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("subscribe", topic) .option("group.id", "iot-data-consumer") .option("startingOffsets", "earliest") .load() case class SensorData(sensorId: Int, timestamp: Long, value: Double) def parse(str: String): SensorData = ... val parsedData = rawData.select($"value").as[String].flatMap(record => parse(record)) val validData = parsedData.where($"value" > 0) val query = validData.writeStream .outputMode("append") .format("parquet") .option("path", "/dest/sensor-data") .option("checkpointLocation", "/tmp/checkpoint") .start()
  19. 19. Structured Streaming - Usecases ● ETL ● Stream aggregations, windows ● Event-time oriented analytics ● Join Streams with Fixed Datasets ● Apply Machine Learning Models
  20. 20. Structured Streaming
  21. 21. Spark Streaming Kafka Flume Kinesis Twitter Sockets HDFS/S3 Custom Apache Spark SparkSQL SparkML ... Databases HDFS API Server Streams
  22. 22. DStream[T] RDD[T] RDD[T] RDD[T] RDD[T] RDD[T] t0 t1 t2 t3 ti ti+1
  23. 23. DStream[T] RDD[T] RDD[T] RDD[T] RDD[T] RDD[T] t0 t1 t2 t3 ti ti+1 RDD[U] RDD[U] RDD[U] RDD[U] RDD[U] Transformation T -> U
  24. 24. DStream[T] RDD[T] RDD[T] RDD[T] RDD[T] RDD[T] t0 t1 t2 t3 ti ti+1 RDD[U] RDD[U] RDD[U] RDD[U] RDD[U] Transformation T -> U Actions
  25. 25. API: Transformations map, flatmap, filter count, reduce, countByValue, reduceByKey n union, join cogroup
  26. 26. API: Transformations mapWithState … …
  27. 27. API: Transformations transform val iotDstream = MQTTUtils.createStream(...) val devicePriority = sparkContext.cassandraTable(...) val prioritizedDStream = iotDstream.transform{rdd => rdd.map(d => (d.id, d)).join(devicePriority) }
  28. 28. Actions print ------------------------------------------- Time: 1459875469000 ms ------------------------------------------- data1 data2 saveAsTextFiles, saveAsObjectFiles, saveAsHadoopFiles xxx yyy zzz foreachRDD *
  29. 29. Actions print ------------------------------------------- Time: 1459875469000 ms ------------------------------------------- data1 data2 saveAsTextFiles, saveAsObjectFiles, saveAsHadoopFiles xxx yyy zzz foreachRDD * Spark SQL Dataframes GraphFrames Any API
  30. 30. Structured Streaming - Usecases ● Stream-stream joins ● Complex state management (local + cluster state) ● Streaming Machine Learning ○ Learn ○ Score ● Join Streams with Updatable Datasets ● Event-time oriented analytics ● Continuous processing
  31. 31. Structured Streaming +
  32. 32. Spark Streaming + Structured Streaming 36 val parse: Dataset[String] => Dataset[Record] = ??? val process: Dataset[Record] => Dataset[Result] = ??? val serialize: Dataset[Result] => Dataset[String] = ??? val kafkaStream = spark.readStream… val f = parse andThen process andThen serialize val result = f(kafkaStream) result.writeStream .format("kafka") .option("kafka.bootstrap.servers",bootstrapServers) .option("topic", writeTopic) .option("checkpointLocation", checkpointLocation) .start() val dstream = KafkaUtils.createDirectStream(...) dstream.map{rdd => val ds = sparkSession.createDataset(rdd) val f = parse andThen process andThen serialize val result = f(ds) result.write.format("kafka") .option("kafka.bootstrap.servers", bootstrapServers) .option("topic", writeTopic) .option("checkpointLocation", checkpointLocation) .save() } Structured StreamingSpark Streaming
  33. 33. Streaming Pipelines Structured Streaming Keyword Extraction Keyword Relevance Similarity DB Storage
  34. 34. Structured Streaming
  35. 35. lightbend.com/fast-data-platform
  36. 36. Features 1. One-click component installations 2. Automatic dependency checks 3. One-click access to install logs 4. Real-time cluster visualization 5. Access to consolidated production logs Benefits: 1. Easy to get started 2. Ready access to all components 3. Increased developer velocity Fast Data Platform Manager, for Managing Running Clusters
  37. 37. Fast Data Platform Visit us at the Booth!
  38. 38. Thank You Gerard Maas @maasg

×