Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AWS Webcast - Amazon Kinesis and Apache Storm


Published on

Join us for a for a Amazon Kinesis tutorial webinar. In this session we will provide a reference architecture and instructions for building a system that performs real-time sliding-windows analysis over streaming clickstream data. We will use Amazon Kinesis for managed ingestion of streaming data at scale with the ability to replay past data, and run sliding-window computation using Apache Storm. We’ll demonstrate in the webinar on how to build the system and deploy on AWS and walkthrough all the steps from ingestion, processing, and storing to visualizing of the data in real-time.

Published in: Technology

AWS Webcast - Amazon Kinesis and Apache Storm

  1. 1. @ 2015, Inc. and Its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of, Inc. CLICKSTREAM ANALYTICS – AMAZON KINESIS AND APACHE STORM
  2. 2. Agenda  Clickstream Analytics  Data Ingestion  Amazon Kinesis  Data Processing  Apache Storm  Amazon EMR  Q & A
  3. 3. Clickstream Analytics in Real-time
  4. 4. Clickstream Analytics  From Wikipedia  “… clicks anywhere in the webpage or application, the action is logged on …”  “… useful for web activity analysis, software testing, market research …” It’s all about People & Products !!!
  5. 5. Clickstream Analytics in Real-time Ingestion Files to Events Processing Batch to Continuous Consumption Reports to Alerts
  6. 6. Real-Time Analytics Real-time Ingest • Highly Scalable • Durable • Elastic • Replay-able Reads Continuous Processing FX • Load-balancing incoming streams • Fault-tolerance, Checkpoint / Replay • Elastic • Enable multiple apps to process in parallel Continuous data flow Low end-to-end latency Continuous, real-time workloads +
  7. 7. Data Ingestion
  8. 8. Global top-10 Starting simple...
  9. 9. Global top-10Elastic Beanstalk Distributing the workload…
  10. 10. Global top-10 Elastic Beanstalk Local top-10 Local top-10 Local top-10 Or using a Elastic Data Broker…
  11. 11. Global top-10 Elastic Beanstalk K I N E S I S Data Record Stream Shard Partition Key Worker My top-10 Data RecordSequence Number 14 17 18 21 23 Amazon Kinesis – Managed Stream
  12. 12. AWSEndpoint S3 DynamoDB Redshift Data Sources Availability Zone Availability Zone Data Sources Data Sources Data Sources Data Sources Availability Zone Shard 1 Shard 2 Shard N [Data Archive] [Metric Extraction] [Sliding Window Analysis] [Machine Learning] App. 1 App. 2 App. 3 App. 4 EMR Amazon Kinesis – Common Data Broker
  13. 13. Amazon Kinesis – Distributed Streams  From batch to continuous processing  Scale UP or DOWN without losing sequencing  Workers can replay records for up to 24 hours  Scale up to GB/sec without losing durability  Records stored across multiple availability zones  Multiple parallel Kinesis Apps  RDBMS, S3, Data Warehouse
  14. 14. Data Processing
  15. 15. Batch Real Time Clickstream – Real-time and Batch Batch Analysis DW Hadoop Notifications & Alerts Dashboards/ visualizations APIs Streaming Analytics Clickstream Deep Learning Dashboards/ visualizations Spark Storm KCL Data Archive
  16. 16. Processing Stream in real-time
  17. 17. Storm Concepts  Streams  Unbounded sequence of tuples  Spout  Source of Stream e.g. Read from Twitter streaming API  Bolts  Processes input streams and produces new streams e.g. Functions, Filters, Aggregation, Joins  Topologies  Network of spouts and bolts
  18. 18. Storm Architecture Master Node Cluster Coordination Worker Processes Worker Nimbus Zookeeper Zookeeper Zookeeper Supervisor Supervisor Supervisor Supervisor Worker Worker Worker Launches Workers
  19. 19. Apache Storm  Guaranteed data processing  Horizontal scalability  Fault-tolerance  Integration with queuing system  Higher level abstractions
  20. 20. Demo: Real time stream processing
  21. 21. Real-time: Event-based processing Kinesis Storm Spout Producer Amazon Kinesis Apache Storm ElastiCache (Redis) Node.js Client (D3) time-Sliding-Window-Application-Using-Amazon-Kinesis-and-Apache
  22. 22. Creating a Storm Topology KinesisSpoutConfig(streamName, zookeeperEndpoint). withZookeeperPrefix(zookeeperPrefix) .withInitialPositionInStream(initialPositionInStream) .withRegion(Regions.fromName(regionName)); … builder.setSpout("Kinesis", spout, 2); builder.setBolt("Parse", new ParseReferrerBolt(),6).shuffleGrouping("Kinesis"); builder.setBolt("Count", new RollingCountBolt(5, 2,elasticCacheRedisEndpoint), 6).fieldsGrouping("Parse", new Fields("referrer")); .. StormSubmitter.submitTopology(topologyName, topoConf, builder.createTopology()); Kinesis Storm Spout
  23. 23. Sliding window using Tick Tuple … public void execute(Tuple tuple) { if (TupleHelpers.isTickTuple(tuple)) { LOG.debug("Received tick tuple, triggering emit of current window counts"); emitCurrentWindowCounts(); } else { countObjAndAck(tuple); } }
  24. 24. Using Redis as an Event relay for (Entry<Object, Long> entry : counts.entrySet()) { … msg.put("name", referrer); msg.put("time", currentEPOCH); msg.put("count", count); … jedis.publish("pubsubCounters",msg.toString()); } ElastiCache (Redis)
  25. 25. NodeJs – PubSub to Server Side Events function ticker(req,res) { … subscriber.subscribe("pubsubCounters"); subscriber.on("message", function(channel, message) { res.json(message); …res.json = function(obj) { res.write("data: "+obj+"nn"); } } connect() { ... if(req.url == '/eventCounters') { ticker(req,res); } Node.js
  26. 26. Visualizing the events in Client var source = new EventSource('/ticker'); source.addEventListener('message',tick); function tick(e) { if(e) { var eventData = JSON.parse(; window[].push([{ time: eventData.time, y: eventData.count}]); Client (D3)
  27. 27. Amazon EMR Processing Streams with Hadoop
  28. 28. Amazon EMR? Map-Reduce engine Integrated with tools Hadoop-as-a-service Massively parallel Cost effective AWS wrapper Integrated to AWS services Introduction to Amazon EMR
  29. 29. Master instance group Task instance groupCore instance group HDFS HDFS Amazon S3 Amazon EMR - Architecture  Master instance  Controls the cluster  Core instance  Life of cluster  DataNode and TaskTracker daemons  Task instances  Added or subtracted to perform work (SPOT)  S3 as underlying ‘file system’
  30. 30. Offline Analysis Ad-hoc Analysis Analyzing Kinesis using Amazon EMR EMRS3Kinesis ApplicationProducer Amazon Kinesis EMR Hive Pig Spark MapReduceAmazon Kinesis
  31. 31. Demo: Stream processing with Spark
  32. 32. Spark Streaming and Kinesis  Launch a EMR cluster with Spark  0RV/Installing-Apache-Spark-on-an-Amazon-EMR-Cluster  Spark Streaming  programming-guide.html  Spark Streaming Kinesis integration  integration.html
  33. 33. Kinesis Word Count Example private object KinesisWordCountASL extends Logging { … val sparkConfig = new SparkConf().setAppName("KinesisWordCount") val ssc = new StreamingContext(sparkConfig, batchInterval) val unionStreams = ssc.union(kinesisStreams) /* Convert each line of Array[Byte] to String, split into words, and count them */ val words = unionStreams.flatMap(byteArray => new String(byteArray).split(" ")) /* Map each word to a (word, 1) tuple so we can reduce/aggregate by key. */ val wordCounts = => (word, 1)).reduceByKey(_ + _)
  34. 34. Amazon Kinesis with Apache Storm: analysis-of-clickstream-data-kinesis.pdf Amazon Kinesis with Amazon EMR uide/emr-kinesis.html Amazon Kinesis with Apache Spark integration.html Q & A
  35. 35. THANK YOU !!!