Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Real Time Data Processing With Spark Streaming, Node.js and Redis with Visualization

1,897 views

Published on

Contact:
https://www.linkedin.com/in/brandonjobrien
@hakczar

Code examples available at https://github.com/br4nd0n/spark-streaming and https://github.com/br4nd0n/spark-viz

A demo and explanation of building a streaming application using Spark Streaming, Node.js and Redis with a real time visualization. Includes discussion of internals of Spark and Spark streaming including RDD partitioning and code and data distribution and cluster resource allocation.

Published in: Data & Analytics
  • Did u try to use external powers for studying? Like ⇒ www.HelpWriting.net ⇐ ? They helped me a lot once.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Real Time Data Processing With Spark Streaming, Node.js and Redis with Visualization

  1. 1. RealTime DataProcessing with Spark Streaming Brandon O’Brien Oct 26th, 2016
  2. 2. Spark Streaming: Intro 1. Intro 2. Demo + highlevelwalkthrough 3. Sparkin detail 4. Detaileddemowalkthroughand/or workshop
  3. 3. Spark Streaming Sparkexperiencelevel? Selectone:  Beginner  Intermediate  Expert
  4. 4. Spark Streaming: Demo DEMO
  5. 5. Spark Streaming: Demo Info • Data Source: • Data Producer Thread • Redis • Data Consumer • Spark as Stream Consumer • Redis Publish • Dashboard: • Node.js/Redis Integration • Socket.io Publish • AngularJS + JavaScript
  6. 6. Spark Streaming: Spark in detail SPARK IN DETAIL
  7. 7. Spark Streaming: Concepts Application: • Driver program • RDD • Partition • Elements • DStream • InputReceiver • 1 JVM for driver program • 1 JVM per executor Cluster: • Master • Executors • Resources • Cores • Gigs RAM • Cluster Types: • Standalone • Mesos • YARN
  8. 8. Spark Streaming: Lazy execution //Allocate resources on cluster val conf = new SparkConf().setAppName(appName).setMaster(master) val sc = new SparkContext(conf) //Lazy definition of logical processing (transformations) val textFile = sc.textFile("README.md") .filter(line=> {line.length> 10}) //foreachPartition() triggers execution (actions) textFile.foreachPartition(partition=> { partition.foreach(line => { println(line) }) }) • Use rdd.persist() when multiple actions are called on the same RDD
  9. 9. Spark Streaming: Execution Env • Distributed data, distributed code • RDD partitions are distributed across executors • Actions trigger execution and return results to the driver program • Code is executed on either the driver or executors • Be careful of function closures! //Function arguments to transformations executed on executors val textFile = sc.textFile("README.md") .filter(line=> {line.length> 10}) //collect() triggers execution (actions) //executed on driver. foreachPartition executed on executors textFile.collect().foreach(line => { println(line) })
  10. 10. Spark Streaming: Execution Env
  11. 11. Spark Streaming: Parallelism • RDD partitions are processed in parallel • Elements in a single partition are processed serially • You control the number of partitions in an RDD • If you need to guarantee any particular ordering of processing, use groupByKey() to force all elements with the same key onto the same partitions • Be careful of shuffles val textFile = sc.textFile("README.md”) val singlePartitionRDD = textFile.repartition(1) val linesByKey = shopResultsEnriched .map(line => (getPartitionKey(line), line)) .groupByKey()
  12. 12. Spark Streaming: DStreams • Receiver Types • Kafka (Receiver + Direct) • Flume • Kinesis • TCP Socket • Custom (Ex: redis.receiver.RedisReceiver.scala) • Note: Kafka receiver will consume an entire core (no context switch)
  13. 13. RealTime DataProcessing with Spark Streaming Brandon O’Brien Oct 26th, 2016

×