Amazon Kinesis: Real-time Streaming Big data Processing Applications (BDT311) | AWS re:Invent 2013

12,450 views
11,868 views

Published on

"This presentation will introduce Kinesis, the new AWS service for real-time streaming big data ingestion and processing.
We’ll provide an overview of the key scenarios and business use cases suitable for real-time processing, and how Kinesis can help customers shift from a traditional batch-oriented processing of data to a continual real-time processing model. We’ll explore the key concepts, attributes, APIs and features of the service, and discuss building a Kinesis-enabled application for real-time processing. We’ll walk through a candidate use case in detail, starting from creating an appropriate Kinesis stream for the use case, configuring data producers to push data into Kinesis, and creating the application that reads from Kinesis and performs real-time processing. This talk will also include key lessons learnt, architectural tips and design considerations in working with Kinesis and building real-time processing applications."

Published in: Technology, Business
0 Comments
22 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
12,450
On SlideShare
0
From Embeds
0
Number of Embeds
2,617
Actions
Shares
0
Downloads
295
Comments
0
Likes
22
Embeds 0
No embeds

No notes for slide

Amazon Kinesis: Real-time Streaming Big data Processing Applications (BDT311) | AWS re:Invent 2013

  1. 1. Supercharge Your Big Data Infrastructure with Amazon Kinesis: Learn to Build Real-time Streaming Big data Processing Applications Marvin Theimer, AWS November 15, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  2. 2. Amazon Kinesis Managed Service for Real-Time Processing of Big Data App.1 Data Sources Availability Zone Data Sources Data Sources Availability Zone S3 App.2 AWS Endpoint Data Sources Availability Zone [Aggregate & De-Duplicate] Shard 1 Shard 2 Shard N [Metric Extraction] DynamoDB App.3 [Sliding Window Analysis] Redshift Data Sources App.4 [Machine Learning]
  3. 3. Talk outline • A tour of Kinesis concepts in the context of a Twitter Trends service • Implementing the Twitter Trends service • Kinesis in the broader context of a Big Data ecosystem
  4. 4. A tour of Kinesis concepts in the context of a Twitter Trends service
  5. 5. twitter-trends.com website twitter-trends.com Elastic Beanstalk twitter-trends.com
  6. 6. Too big to handle on one box twitter-trends.com
  7. 7. The solution: streaming map/reduce twitter-trends.com My top-10 My top-10 My top-10 Global top-10
  8. 8. Core concepts Data record Stream Partition key twitter-trends.com My top-10 Global top-10 My top-10 My top-10 Data record Shard: Shard Worker 14 17 18 21 23 Sequence number
  9. 9. How this relates to Kinesis twitter-trends.com Kinesis Kinesis application
  10. 10. Core Concepts recapped • Data record ~ tweet • Stream ~ all tweets (the Twitter Firehose) • Partition key ~ Twitter topic (every tweet record belongs to exactly one of these) • Shard ~ all the data records belonging to a set of Twitter topics that will get grouped together • Sequence number ~ each data record gets one assigned when first ingested. • Worker ~ processes the records of a shard in sequence number order
  11. 11. Using the Kinesis API directly twitter-trends.com K I N iterator = getShardIterator(shardId, LATEST); while (true) { process(records): { E while (record in records) { for (true) { localTop10Lists = S iterator] = [records, updateLocalTop10(record); scanDDBTable(); I } getNextRecords(iterator, maxRecsToReturn); updateGlobalTop10List( process(records); if (timeToDoOutput()) { localTop10Lists); S } writeLocalTop10ToDDB(); } } sleep(10); }
  12. 12. Challenges with using the Kinesis API directly How many workers Manual creation of workers and How many EC2 instances? per to shards assignment EC2 instance? twitter-trends.com K I N E S I S Kinesis application
  13. 13. Using the Kinesis Client Library twitter-trends.com K I N E S I S Shard mgmt table Kinesis application
  14. 14. Elasticity and load balancing using Auto Scaling groups Auto scaling Group K I N E S I S twitter-trends.com Shard mgmt table
  15. 15. Dynamic growth and hot spot management through stream resharding Auto scaling Group twitter-trends.com K I N E S I S Shard mgmt table
  16. 16. The challenges of fault tolerance K I N E S I S X X X twitter-trends.com Shard mgmt table
  17. 17. Fault tolerance support in KCL Availability Zone 1 K I N E S I S X twitter-trends.com Shard mgmt table Availability Zone 3
  18. 18. Checkpoint, replay design pattern Kinesis Host A X Shard ID Local top-10 #Seahawks: 9835, …} {#Obama: 10235, #Congress: 9910, …} Shard-i 0 18 10 3 5 14 21 18 23 21 18 17 14 10 8 5 10 3 2 8 17 23 3 Host B 2 Shard-i 23 21 18 17 Shard ID 14 Lock Shard-i A Host B Seq num 10 23 0 10
  19. 19. Catching up from delayed processing Kinesis Host A X 2 3 5 23 21 18 17 14 10 8 5 3 Shard-i Host B 2 10 How long until we’ve caught up? Shard throughput SLA: - 1 MBps in - 2 MBps out => Catch up time ~ down time 8 23 5 21 18 17 14 3 2 Unless your application can’t process 2 MBps on a single host => Provision more shards
  20. 20. Concurrent processing applications twitter-trends.com Kinesis 18 10 0 23 3 5 14 21 18 23 21 18 17 14 10 8 5 2 8 17 23 10 3 3 2 Shard-i 10 8 23 5 21 18 spot-the-revolution.org 3 17 14 2 10 0 23
  21. 21. Why not combine applications? twitter-trends.com Kinesis Output state & checkpoint every 10 secs twitter-stats.com Output state & checkpoint every hour
  22. 22. Implementing twitter-trends.com
  23. 23. Creating a Kinesis Stream
  24. 24. How many shards? Additional considerations: - How much processing is needed per data record/data byte? - How quickly do you need to catch up if one or more of your workers stalls or fails? You can always dynamically reshard to adjust to changing workloads
  25. 25. Twitter Trends Shard Processing Code Class TwitterTrendsShardProcessor implements IRecordProcessor { public TwitterTrendsShardProcessor() { … } @Override public void initialize(String shardId) { … } @Override public void processRecords(List<Record> records, IRecordProcessorCheckpointer checkpointer) { … } @Override public void shutdown(IRecordProcessorCheckpointer checkpointer, ShutdownReason reason) { … } }
  26. 26. Twitter Trends Shard Processing Code Class TwitterTrendsShardProcessor implements IRecordProcessor { private Map<String, AtomicInteger> hashCount = new HashMap<>(); private long tweetsProcessed = 0; @Override public void processRecords(List<Record> records, IRecordProcessorCheckpointer checkpointer) { computeLocalTop10(records); if ((tweetsProcessed++) >= 2000) { emitToDynamoDB(checkpointer); } } }
  27. 27. Twitter Trends Shard Processing Code private void computeLocalTop10(List<Record> records) { for (Record r : records) { String tweet = new String(r.getData().array()); String[] words = tweet.split(" t"); for (String word : words) { if (word.startsWith("#")) { if (!hashCount.containsKey(word)) hashCount.put(word, new AtomicInteger(0)); hashCount.get(word).incrementAndGet(); } } } private void emitToDynamoDB(IRecordProcessorCheckpointer checkpointer) { persistMapToDynamoDB(hashCount); try { checkpointer.checkpoint(); } catch (IOException | KinesisClientLibDependencyException | InvalidStateException | ThrottlingException | ShutdownException e) { // Error handling } hashCount = new HashMap<>(); tweetsProcessed = 0; }
  28. 28. Kinesis as a gateway into the Big Data eco-system
  29. 29. Leveraging the Big Data eco-system Twitter trends anonymized archive Twitter trends statistics twitter-trends.com Twitter trends processing application
  30. 30. Connector framework public interface IKinesisConnectorPipeline<T> { IEmitter<T> getEmitter(KinesisConnectorConfiguration configuration); IBuffer<T> getBuffer(KinesisConnectorConfiguration configuration); ITransformer<T> getTransformer( KinesisConnectorConfiguration configuration); IFilter<T> getFilter(KinesisConnectorConfiguration configuration); }
  31. 31. Transformer & Filter interfaces public interface ITransformer<T> { public T toClass(Record record) throws IOException; public byte[] fromClass(T record) throws IOException; } public interface IFilter<T> { public boolean keepRecord(T record); }
  32. 32. Tweet transformer for Amazon S3 archival public class TweetTransformer implements ITransformer<TwitterRecord> { private final JsonFactory fact = JsonActivityFeedProcessor.getObjectMapper().getJsonFactory(); @Override public TwitterRecord toClass(Record record) { try { return fact.createJsonParser( record.getData().array()).readValueAs(TwitterRecord.class); } catch (IOException e) { throw new RuntimeException(e); } }
  33. 33. Tweet transformer (cont.) @Override public byte[] fromClass(TwitterRecord r) { SimpleDateFormat df = new SimpleDateFormat("YYYY-MM-dd HH:MM:SS"); StringBuilder b = new StringBuilder(); b.append(r.getActor().getId()).append("|") .append(sanitize(r.getActor().getDisplayName())).append("|") .append(r.getActor().getFollowersCount()).append("|") .append(r.getActor().getFriendsCount()).append("|”) ... .append('n'); return b.toString().replaceAll("ufffdufffdufffd", "").getBytes(); } }
  34. 34. Buffer interface public interface IBuffer<T> { public long getBytesToBuffer(); public long getNumRecordsToBuffer(); public boolean shouldFlush(); public void consumeRecord(T record, int recordBytes, String sequenceNumber); public void clear(); public String getFirstSequenceNumber(); public String getLastSequenceNumber(); public List<T> getRecords(); }
  35. 35. Tweet aggregation buffer for Amazon Redshift ... @Override consumeRecord(TwitterAggRecord record, int recordSize, String sequenceNumber) { if (buffer.isEmpty()) { firstSequenceNumber = sequenceNumber; } lastSequenceNumber = sequenceNumber; TwitterAggRecord agg = null; if (!buffer.containsKey(record.getHashTag())) { agg = new TwitterAggRecord(); buffer.put(agg); byteCount.addAndGet(agg.getRecordSize()); } agg = buffer.get(record.getHashTag()); agg.aggregate(record); }
  36. 36. Tweet Amazon Redshift aggregation transformer public class TweetTransformer implements ITransformer<TwitterAggRecord> { private final JsonFactory fact = JsonActivityFeedProcessor.getObjectMapper().getJsonFactory(); @Override public TwitterAggRecord toClass(Record record) { try { return new TwitterAggRecord( fact.createJsonParser( record.getData().array()).readValueAs(TwitterRecord.class)); } catch (IOException e) { throw new RuntimeException(e); } }
  37. 37. Tweet aggregation transformer (cont.) @Override public byte[] fromClass(TwitterAggRecord r) { SimpleDateFormat df = new SimpleDateFormat("YYYY-MM-dd HH:MM:SS"); StringBuilder b = new StringBuilder(); b.append(r.getActor().getId()).append("|") .append(sanitize(r.getActor().getDisplayName())).append("|") .append(r.getActor().getFollowersCount()).append("|") .append(r.getActor().getFriendsCount()).append("|”) ... .append('n'); return b.toString().replaceAll("ufffdufffdufffd", "").getBytes(); } }
  38. 38. Emitter interface public interface IEmitter<T> { List<T> emit(UnmodifiableBuffer<T> buffer) throws IOException; void fail(List<T> records); void shutdown(); }
  39. 39. Example Amazon Redshift connector Amazon Redshift K I N E S I S Amazon S3
  40. 40. Example Amazon Redshift connector – v2 K I N E S I S K I N E S I S Amazon Redshift Amazon S3
  41. 41. Amazon Kinesis: Key Developer Benefits Easy Administration Managed service for real-time streaming data collection, processing and analysis. Simply create a new stream, set the desired level of capacity, and let the service handle the rest. Real-time Performance Perform continual processing on streaming big data. Processing latencies fall to a few seconds, compared with the minutes or hours associated with batch processing. Elastic Seamlessly scale to match your data throughput rate and volume. You can easily scale up to gigabytes per second. The service will scale up or down based on your operational or business needs. S3, Redshift, & DynamoDB Integration Build Real-time Applications Low Cost Reliably collect, process, and transform all of your data in real-time & deliver to AWS data stores of choice, with Connectors for Amazon S3, Amazon Redshift, and Amazon DynamoDB. Client libraries that enable developers to design and operate real-time streaming data processing applications. Cost-efficient for workloads of any scale. You can get started by provisioning a small stream, and pay low hourly rates only for what you use. 41
  42. 42. Please give us your feedback on this presentation BDT311 As a thank you, we will select prize winners daily for completed surveys! You can follow us at http://aws.amazon.com/kinesis @awscloud #kinesis
  43. 43. Implementing spot-the-revolution.org
  44. 44. spot-the-revolution.org – version 1 X Twitter Firehose spot-the-revolution Storm application ZooKeeper . . . . . $$ Kafka cluster
  45. 45. spot-the-revolution.org – Kinesis version Anonymized archive 1-minute statistics spot-the-revolution Storm application on EC2 Storm connector
  46. 46. Generalizing the design pattern
  47. 47. Clickstream analytics example Clickstream archive Aggregate clickstream statistics Clickstream trend analysis Clickstream processing application
  48. 48. Simple metering & billing example Metering record archive Incremental bill computation Billing mgmt service Billing auditors

×