Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce


Published on

Originally, Hadoop was used as a batch analytics tool; however, this is rapidly changing, as applications move towards real-time processing and streaming. Amazon Elastic MapReduce has made running Hadoop in the cloud easier and more accessible than ever. Each day, tens of thousands of Hadoop clusters are run on the Amazon Elastic MapReduce infrastructure by users of every size — from university students to Fortune 50 companies. We recently launched Amazon Kinesis – a managed service for real-time processing of high volume, streaming data. Amazon Kinesis enables a new class of big data applications which can continuously analyze data at any volume and throughput, in real-time. Adi will discuss each service, dive into how customers are adopting the services for different use cases, and share emerging best practices. Learn how you can architect Amazon Kinesis and Amazon Elastic MapReduce together to create a highly scalable real-time analytics solution which can ingest and process terabytes of data per hour from hundreds of thousands of different concurrent sources. Forever change how you process web site click-streams, marketing and financial transactions, social media feeds, logs and metering data, and location-tracking events.

Published in: Technology, Business

Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

  1. 1. © 2014, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of, Inc. Amazon Kinesis & Big Data Meld real-time streaming with EMR (Hadoop), & Redshift (Data Warehousing) Adi Krishnan, AWS Product Management, @adityak Daniel Mintz, Director of BI, Upworthy, @danielmintz July 10, 2014
  2. 2. Amazon Kinesis & Big Data o Motivations for Stream Processing  Origins: Internal metering capability  Expanding the big data processing landscape o Customer view on streaming data o Amazon Kinesis Overview  Amazon Kinesis Architecture  Kinesis concepts & Demo o Amazon Elastic MapReduce and Kinesis  EMR connector morphs Kinesis streamed data into Hadoop framework  Applying Hadoop frameworks to streaming data o Amazon Kinesis and Redshift:  Upworthy presents “Shrinking Redshift data load times from 24 hours to 10 minutes”  Presented by Daniel Mintz, Director of Business Intelligence, Upworthy
  3. 3. The Motivation for Continuous Processing
  4. 4. Origins: Internal AWS Metering Capability Workload • 10s of millions records/sec • Multiple TB per hour • 100,000s of sources Pain points • Doesn’t scale elastically • Customers want real-time alerts • Expensive to operate • Relies on eventually consistent storage
  5. 5. Expanding the Big Data Processing Landscape • Query Engine Approach • Pre-computations such as indices and dimensional views improve performance • Historical, structured data • HIVE/SQL-on-Hadoop/ M-R/ Spark • Batch programs, or other abstractions breaking down into MR style computations • Historical, Semi-structured data • Custom computations of relative simple complexity • Continuous Processing – filters, sliding windows, aggregates – on infinite data streams • Semi/Structured data, generated continuously in real-time Traditional Data Warehousing Hadoop Style Processing Stream Processing
  6. 6. A Generalized Data Flow Many different technologies, at different stages of evolution Client/Sensor Aggregator Continuous Processing Storage Analytics + Reporting
  7. 7. Our Big Data Transition Old Posture • Capture huge amounts of data and process it in hourly or daily batches New Requirements • Make decisions faster, sometimes in real-time • Scale entire system elastically • Make it easy to “keep everything” • Multiple applications can process data in parallel
  8. 8. Foundation for Data Streams Ingestion, Continuous Processing Right Toolset for the Right Job Real-time Ingest • Highly Scalable • Durable • Elastic • Replay-able Reads Continuous Processing FX • Load-balancing incoming streams • Fault-tolerance, Checkpoint / Replay • Elastic • Enable multiple apps to process in parallel Enable data movement into Stores/ Processing Engines Managed Service Low end-to-end latency Continuous, real-time workloads
  9. 9. Customer View
  10. 10. Scenarios Accelerated Ingest-Transform-Load Continual Metrics/ KPI Extraction Responsive Data Analysis Data Types IT infrastructure, Applications logs, Social media, Fin. Market data, Web Clickstreams, Sensors, Geo/Location data Software/ Technology IT server , App logs ingestion IT operational metrics dashboards Devices / Sensor Operational Intelligence Digital Ad Tech./ Marketing Advertising Data aggregation Advertising metrics like coverage, yield, conversion Analytics on User engagement with Ads, Optimized bid/ buy engines Financial Services Market/ Financial Transaction order data collection Financial market data metrics Fraud monitoring, and Value-at-Risk assessment, Auditing of market order data Consumer Online/ E-Commerce Online customer engagement data aggregation Consumer engagement metrics like page views, CTR Customer clickstream analytics, Recommendation engines Customer Scenarios across Industry Segments 1 2 3
  11. 11. Big streaming data comes from the small { "payerId": "Joe", "productCode": "AmazonS3", "clientProductCode": "AmazonS3", "usageType": "Bandwidth", "operation": "PUT", "value": "22490", "timestamp": "1216674828" } Metering Record user-identifier frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 Common Log Entry <165>1 2003-10-11T22:14:15.003Z evntslog - ID47 [exampleSDID@32473 iut="3" eventSource="Application" eventID="1011"][examplePriority@32473 class="high"] Syslog Entry “SeattlePublicWater/Kinesis/123/Realtime” – 412309129140 MQTT Record <R,AMZN ,T,G,R1> NASDAQ OMX Record
  12. 12. What Biz. Problem needs to be solved? Mobile/ Social Gaming Digital Advertising Tech. Deliver continuous/ real-time delivery of game insight data by 100’s of game servers Generate real-time metrics, KPIs for online ad performance for advertisers/ publishers Custom-built solutions operationally complex to manage, & not scalable Store + Forward fleet of log servers, and Hadoop based processing pipeline • Delay with critical business data delivery • Developer burden in building reliable, scalable platform for real-time data ingestion/ processing • Slow-down of real-time customer insights • Lost data with Store/ Forward layer • Operational burden in managing reliable, scalable platform for real-time data ingestion/ processing • Batch-driven real-time customer insights Accelerate time to market of elastic, real-time applications – while minimizing operational overhead Generate freshest analytics on advertiser performance to optimize marketing spend, and increase responsiveness to clients
  13. 13. Amazon Kinesis Managed Service for streaming data ingestion, and processing
  14. 14. Amazon Kinesis Architecture Amazon Web Services AZ AZ AZ Durable, highly consistent storage replicates data across three data centers (availability zones) Aggregate and archive to S3 Millions of sources producing 100s of terabytes per hour Front End Authentication Authorization Ordered stream of events supports multiple readers Real-time dashboards and alarms Machine learning algorithms or sliding window analytics Aggregate analysis in Hadoop or a data warehouse Inexpensive: $0.028 per million puts
  15. 15. Kinesis Stream: Managed ability to capture and store data • Streams are made of Shards • Each Shard ingests data up to 1MB/sec, and up to 1000 TPS • Each Shard emits up to 2 MB/sec • All data is stored for 24 hours • Scale Kinesis streams by splitting or merging Shards • Replay data inside of 24Hr. Window
  16. 16. Putting Data into Kinesis Simple Put interface to store data in Kinesis • Producers use a PUT call to store data in a Stream • PutRecord {Data, PartitionKey, StreamName} • A Partition Key is supplied by producer and used to distribute the PUTs across Shards • Kinesis MD5 hashes supplied partition key over the hash key range of a Shard • A unique Sequence # is returned to the Producer upon a successful PUT call
  17. 17. Building Kinesis Processing Apps: Kinesis Client Library Open Source library for fault-tolerant, continuous processing apps • Java client library, source available on Github • Build app with KCL on your EC2 instance(s) • KCL is intermediary b/w your application & stream • Automatically starts a Kinesis Worker for each shard • Simplifies reading by abstracting individual shards • Increase / Decrease Workers as # of shards changes • Checkpoints to keep track of a Worker’s location in the stream, Restarts Workers if they fail • Deploy app on your EC2 instances • Integrates with AutoScaling groups to redistribute workers to new instances
  18. 18. Amazon Kinesis Connector Library Open Source code to Connect Kinesis with S3, Redshift, DynamoDB S3 DynamoDB Redshift Kinesis ITransformer • Defines the transformation of records from the Amazon Kinesis stream in order to suit the user- defined data model IFilter • Excludes irrelevant records from the processing. IBuffer • Buffers the set of records to be processed by specifying size limit (# of records)& total byte count IEmitter • Makes client calls to other AWS services and persists the records stored in the buffer.
  19. 19. Sending & Reading Data from Kinesis Streams HTTP Post AWS SDK LOG4J Flume Fluentd Get* APIs Kinesis Client Library + Connector Library Apache Storm Amazon Elastic MapReduce Sending Consuming AWS Mobile SDK
  20. 20. Amazon Kinesis & Elastic MapReduce
  21. 21. Amazon Elastic MapReduce (EMR) Managed Service for Hadoop based data processing • Managed service • Easy to tune clusters and trim costs • Support for multiple data stores • Unique features that ensure customer success on AWS
  22. 22. Applying batch processing to streamed data Client/ Sensor Recording Service Aggregator/ Sequencer Continuous processor for dashboard Storage Analytics and Reporting Amazon Kinesis Amazon EMR Streaming Data Ingestion
  23. 23. What would this look like? Processing Input • User • Dev My Website Kinesis Log4J Appender push to Kinesis EMR Hive Pig Cascading MapReduce pull from
  24. 24. • Features offered starting EMR AMI 3.0.4 – Simply spin up the EMR cluster like normal • Logical names – Labels that define units of work (Job A vs Job B) • Iterations – Provide idempotency (pessimistic locking of the Logical name) • Checkpoints – Creating an input start and end points to allow batch processing Features and Functionality
  25. 25. Iterations – the run of a Job Iteration 1 Iteration 2 Iteration 3 Iteration 4 Trim Horizon seqID 1:00 – 7:00 7:00 – 13:00 13:00 – 19:00 19:00 – 1:00 -24 hours Logical Name Stream NOW Latest seqID Next
  26. 26. Logical Names & Checkpointing – allows efficient batching Kinesis Stream NOW Latest seqIDTrim Horizon seqID -24 hours Logical Name Stream
  27. 27. • Dynamo DB Metadata Storage Logical Name A Mapper 1 Mapper 2 Mapper 3 Mapper 4 Logical Name B Mapper 1 Mapper 2 Mapper 3 Mapper 4
  28. 28. Each Kinesis shard maps 1:1 to a Hadoop map task 1:00 – 7:00 7:00 – 13:00 13:00 – 19:00 19:00 – 1:00 Mapper 2 Kinesis Hadoop Next Logical Name Mapper 1 Shard 2 Shard 1 Mapper 2 Mapper 1 Mapper 2 Mapper 1 Mapper 2 Mapper 1 -24 hours Start seq ID End seq ID NOW Latest seqID
  29. 29. Handling stream scaling events Trim Horizon seqID 1:00 – 7:00 7:00 – 13:00 13:00 – 19:00 19:00 – 1:00 Mapper 2 Kinesis Hadoop Logical Name Mapper 1 Shard 2 Shard 1 Mapper 2 Mapper 1 Mapper 2 Mapper 1 3 Mapper 3 1 2 4 5 Split Shard 2 Shard 1 Shard 2 Shard 1 Shard 3 Shard 2 Shard 1 Shard 3 Split Merge -24 hours Latest seqID NOW Next
  30. 30. • InputFormat handles service errors – Throttling: 400 – Service unavailable errors : 503 – Internal server 500 – Http Client exceptions : socket connection timeout • Hadoop handles retry of failed map tasks • Iterations allow retrys – Fixed input boundaries on a stream (idempotency for reruns) – Enable multiple queries on the same input boundaries Handling errors
  31. 31. Hadoop Ecosystem Implementation • Hadoop Input format • Hive Storage Handler • Pig Load Function • Cascading Scheme and Tap • Join multiple data sources for analysis • Filter and preprocess streams • Export and archive streaming data Use CasesImplementations
  32. 32. Writing to Kinesis using Log4J Option Default Description log4j.appender.KINESIS.streamName AccessLog Stream Stream name to which data is to be published. log4j.appender.KINESIS.encoding UTF-8 Encoding used to convert log message strings into bytes before sending to Amazon Kinesis. log4j.appender.KINESIS.maxRetries 3 Maximum number of retries when calling Kinesis APIs to publish a log message. log4j.appender.KINESIS.backoffInterval 100ms Milliseconds to wait before a retry attempt. log4j.appender.KINESIS.threadCount 20 Number of parallel threads for publishing logs to configured Kinesis stream. log4j.appender.KINESIS.bufferSize 2000 Maximum number of outstanding log messages to keep in memory. log4j.appender.KINESIS.shutdownTimeout 30 Seconds to send buffered messages before application JVM quits normally. .error("Cannot find resource XYX… go do something about it!");
  33. 33. Run the Ad-hoc Hive Query
  34. 34. Run the Ad-hoc Hive Query
  35. 35. Amazon Kinesis & Redshift
  36. 36. © 2014, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of, Inc. 24 Hours to 10 Minutes How Upworthy’s Data Pipeline uses Kinesis Daniel Mintz, Director Business Intelligence, Upworthy, @danielmintz
  37. 37. What’s Upworthy • We’ve been called – “Social media with a mission” by our About Page – “The fastest growing media site of all time” by Fast Company – “The Fastest Rising Startup” by The Crunchies – “That thing that’s all over my newsfeed” by my annoyed friends – “The most data-driven media company in history” by me, optimistically
  38. 38. What We Do • We aim to drive massive amounts of attention to things that really matter. • We do that by finding, packaging, and distributing great, meaningful content.
  39. 39. Our Use Case
  40. 40. When We Started • Had built a data warehouse from scratch • Hadoop-based batch workflow • Nightly ETL cycle • 2.5 Engineers • Wanted to do all three: – Comprehensive – Ad Hoc – Real-Time
  41. 41. The Decision • Speed up our current system, rather than building a parallel one • Had looked at alternative stream processors – Cost – Maintenance • Comfortable with concept of application log stream
  42. 42. How It Works • Log Drain receives, formats, batches and zips • PUTs 50k GZIP batches on Kinesis stream • Three types of Kinesis consumers: 1. Archiver – Batch and write permanent record 2. Stats – Filter, sample and count; Report to StatHat 3. Transformer – Filter, batch, validate; writes temporary BSVs to S3 • Database Importer handles manifest files. • S3 handles garbage collection.
  43. 43. Our system now • Stats: – Average: ~1085 events/second – Peak: ~2500 events/second • Data is available in Redshift < 10 min • Kinesis has been cheap, stable, and gives us redundancy and resiliency. • Computation model that’s easy to reason about
  44. 44. Resiliency • When something goes wrong, you have 24 hours. • Timestamp at outset. Track lag at each step. • Bigger workers (more CPU, RAM, deeper queues) can catch us up very fast.
  45. 45. What We’ve Learned
  46. 46. Some Lessons • You can use one pipeline for everything. • High-cardinality fact data belongs in Kinesis. • EDN works well with Kinesis. • We prefer explicit checkpointing. (Your mileage may vary.) • Languages that run on the JVM can take advantage of AWS Client Libraries.
  47. 47. Kinesis Pricing Simple, Pay-as-you-go, & no up-front costs Pricing Dimension Value Hourly Shard Rate $0.015 Per 1,000,000 PUT transactions: $0.028 • Customers specify throughput requirements in shards, that they control • Each Shard delivers 1 MB/s on ingest, and 2MB/s on egress • Inbound data transfer is free • EC2 instance charges apply for Kinesis processing applications
  48. 48. Canonical Data flows with Amazon Kinesis Continuous Metric Extraction Incremental Stats Computation Record Archiving Live Dashboard
  49. 49. Try out Amazon Kinesis • Try out Amazon Kinesis – • Thumb through the Developer Guide – • Test drive the sample app – • Kinesis Connector Framework – • Read EMR-Kinesis FAQs – • Visit, and Post on Kinesis Forum –
  50. 50. © 2014, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of, Inc. Thank You! Adi Krishnan, Product Management, AWS