© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or i...
•  Metering service
•  10s of millions records per second
•  Terabytes per hour
•  Hundreds of thousands of sources
•  Aud...
Metering service
Workload
•  10s of millions records/sec
•  Multiple TB per hour
•  100,000s of sources
Pain points
•  Doesn’t scale elasti...
Data Warehouse
Workload
•  Daily ingestion of hundreds of data
sources
•  > 3 hours to load and audit data
•  Hundreds of customers
•  Hu...
Old requirements
•  Capture huge amounts of data and process it in hourly
or daily batches
New requirements
•  Make decisi...
Big data comes from the small
{!
"payerId": "Joe",!
"productCode": "AmazonS3",!
"clientProductCode": "AmazonS3",!
"usageTy...
Kinesis
Movement or activity in response to a stimulus.
A fully managed service for real-time processing
of high-volume, s...
Kinesis architecture
Amazon Web Services
AZ AZ AZ
Durable, highly consistent storage replicates data
across three data cen...
•  Simple Put interface to store data in Kinesis
•  Producers use a put call to store data in a Stream
•  A Partition Key ...
Shard 1
Shard 2
Shard 3
Shard n
Shard 4
KCL Worker 1
KCL Worker 2
EC2 Instance
KCL Worker 3
KCL Worker 4
EC2 Instance
KCL ...
Sending & Reading data from Kinesis Streams
HTTP Post
AWS SDK
LOG4J
Flume
Fluentd
Get* APIs
Kinesis Client Library
+
Conne...
New Internal Metering Service
Capture
Submissions
Process in
Realtime
Store in
Redshift
Clients
Submitting
Data
Workload
•...
Workload
•  Daily load of billions records from millions of files
from hundreds of sources
•  3 hour SLA to load and audit...
•  “Our services and applications emit more than 1.5 million events per
second during peak hours, or around 80 billion eve...
Bizo: Digital Ad. Tech Metering with Amazon Kinesis
Continuous Ad
Metrics Extraction
Incremental Ad.
Statistics
Computatio...
Supercell: Gaming Analytics with Amazon Kinesis
Real-time Clickstream
Processing App
Aggregate
Statistics
Clickstream Arch...
Thank You
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
Upcoming SlideShare
Loading in...5
×

Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite

395

Published on

Presentation from Ryan Waite, General Manager, Data Services, Amazon Web Services
#gigaomlive
More at http://events.gigaom.com/structuredata-2014/

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
395
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
19
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite

  1. 1. © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Data Services at AWS or Why We Built Amazon Kinesis Ryan Waite, General Manager, AWS Data Services
  2. 2. •  Metering service •  10s of millions records per second •  Terabytes per hour •  Hundreds of thousands of sources •  Auditors guarantee 100% accuracy at month end •  Data Warehouse •  100s extract-transform-load (ETL) jobs every day •  Hundreds of thousands of files per load cycle •  Hundreds of daily users •  Hundreds of queries per hour •  Fraud •  95% of new customers able to use AWS in minutes •  Random forest model to detect fraudulent behavior •  Tagging Service •  Hundreds of millions of tags on 10s of millions of resources •  99.9% of read/write operations handled in < 250ms Some statistics about what AWS Data Services does
  3. 3. Metering service
  4. 4. Workload •  10s of millions records/sec •  Multiple TB per hour •  100,000s of sources Pain points •  Doesn’t scale elastically •  Customers want real-time alerts •  Expensive to operate •  Relies on eventually consistent storage Internal AWS Metering Service S3 Process Submissions Store Batches Process Hourly w/ Hadoop Clients Submitting Data Data Warehouse
  5. 5. Data Warehouse
  6. 6. Workload •  Daily ingestion of hundreds of data sources •  > 3 hours to load and audit data •  Hundreds of customers •  Hundreds of queries per hour Pain points •  Data volumes keep growing •  Missing our SLAs at month-end •  ETL solution is operationally expensive •  24 hour old data isn’t fresh enough Internal AWS Internal Data Warehouse Hundreds of internal data sources Legacy ETL solution Data staged in Amazon S3 Primary and secondary data warehouses Data Warehouse Data Warehouse
  7. 7. Old requirements •  Capture huge amounts of data and process it in hourly or daily batches New requirements •  Make decisions faster, sometimes in real-time •  Make it easy to “keep everything” •  Multiple applications can process data in parallel Our big data transition
  8. 8. Big data comes from the small {! "payerId": "Joe",! "productCode": "AmazonS3",! "clientProductCode": "AmazonS3",! "usageType": "Bandwidth",! "operation": "PUT",! "value": "22490",! "timestamp": "1216674828"! }! Metering Record 127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326! Common Log Entry <165>1 2003-10-11T22:14:15.003Z mymachine.example.com evntslog - ID47 [exampleSDID@32473 iut="3" eventSource="Application" eventID="1011"] [examplePriority@32473 class="high"]! Syslog Entry “SeattlePublicWater/Kinesis/123/ Realtime” – 412309129140! MQTT Record <R,AMZN ,T,G,R1>! NASDAQ OMX Record
  9. 9. Kinesis Movement or activity in response to a stimulus. A fully managed service for real-time processing of high-volume, streaming data. Kinesis can store and process terabytes of data an hour from hundreds of thousands of sources. Data is replicated across multiple Availability Zones to ensure high durability and availability.
  10. 10. Kinesis architecture Amazon Web Services AZ AZ AZ Durable, highly consistent storage replicates data across three data centers (availability zones) Aggregate and archive to S3 Millions of sources producing 100s of terabytes per hour Front End Authentication Authorization Ordered stream of events supports multiple readers Real-time dashboards and alarms Machine learning algorithms or sliding window analytics Aggregate analysis in Hadoop or a data warehouse Inexpensive: $0.028 per million puts
  11. 11. •  Simple Put interface to store data in Kinesis •  Producers use a put call to store data in a Stream •  A Partition Key is used to distribute the puts across Shards •  A unique Sequence # is returned to the Producer upon a successful put call •  Streams are made of Shards •  A Kinesis Stream is composed of multiple Shards •  Each Shard ingests up to 1MB/sec of data and up to 1000 TPS •  All data is stored for 24 hours •  Scale Kinesis streams by adding or removing Shards Producer Shard 1 Shard 2 Shard 3 Shard n Shard 4 Producer Producer Producer Producer Producer Producer Producer Producer Kinesis Make it easy to “capture everything”
  12. 12. Shard 1 Shard 2 Shard 3 Shard n Shard 4 KCL Worker 1 KCL Worker 2 EC2 Instance KCL Worker 3 KCL Worker 4 EC2 Instance KCL Worker n EC2 Instance Kinesis Making it easier to process data in parallel •  In order to keep up with the stream, an application must: •  Be distributed, to handle multiple shards and scaling up/down •  Be fault tolerant, to handle failures in hardware or software •  Scale up and down as the number of shards increase or decrease •  Kinesis Client Library (KCL) helps with distributed processing: •  Abstracts code from knowing about individual shards •  Automatically starts a Worker for each shard •  Increases and decreases Workers as number of shards changes •  Uses checkpoints to keep track of a Worker’s position in the stream •  Restarts Workers if they fail •  Also works with EC2 Auto Scaling
  13. 13. Sending & Reading data from Kinesis Streams HTTP Post AWS SDK LOG4J Flume Fluentd Get* APIs Kinesis Client Library + Connector Library Apache Storm Amazon Elastic MapReduce Sending Reading
  14. 14. New Internal Metering Service Capture Submissions Process in Realtime Store in Redshift Clients Submitting Data Workload •  Tens of millions records/sec •  Multiple TB per hour •  100,000s of sources New features •  Scale with the business •  Provide real-time alerting •  Inexpensive •  Improved auditing
  15. 15. Workload •  Daily load of billions records from millions of files from hundreds of sources •  3 hour SLA to load and audit data •  Hundreds of customers •  Hundreds of queries per hour New features •  Our data is fresh, we ingest every 6 hours •  Ingesting new data sets for the business •  “Hammerstone” ETL solution •  Built on AWS Data Pipeline •  Build business specific marts •  Build workload specific clusters •  Now processing triple the volume in less than 25% of the time •  Supports a variety of analytics tools: Tableau, R, Toad, SQL Developer, etc. New Internal AWS Data Warehouse Over 200 internal data sources Data staged in Amazon S3 "Hammerstone:" Custom ETL using AWS Data Pipeline Data processing Redshift cluster Batch reporting Redshift cluster Ad hoc query Redshift cluster
  16. 16. •  “Our services and applications emit more than 1.5 million events per second during peak hours, or around 80 billion events per day. The events could be log messages, user activity records, system operational data, or any arbitrary data that our systems need to collect for business, product, and operational analysis.” •  “As the number of clients that utilize targeted advertising grows, access to on-demand compute and storage resources becomes a requirement.” •  Customers use Market Replay on the trade support desk to validate client questions; compliance officers use it to validate execution requirements and rate National Market System (NMS) compliance; and traders and brokers use it to look at certain points in time to view missed opportunities or, potentially, unforeseen events Common use cases
  17. 17. Bizo: Digital Ad. Tech Metering with Amazon Kinesis Continuous Ad Metrics Extraction Incremental Ad. Statistics Computation Metering Record Archive Ad Analytics Dashboard
  18. 18. Supercell: Gaming Analytics with Amazon Kinesis Real-time Clickstream Processing App Aggregate Statistics Clickstream Archive In-Game Engagement Trends Dashboard
  19. 19. Thank You
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×