Successfully reported this slideshow.
Your SlideShare is downloading. ×

(BDT307) Zero Infrastructure, Real-Time Data Collection, and Analytics

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
What's New with AWS Lambda
What's New with AWS Lambda
Loading in …3
×

Check these out next

1 of 70 Ad

(BDT307) Zero Infrastructure, Real-Time Data Collection, and Analytics

Download to read offline

Any fast-growing organization needs a way to manage the ever-increasing volume of data being generated across the globe and the need for real-time analysis. In this session, we walk through a real-life architecture and demonstration of how to leverage Amazon Kinesis, AWS Lambda, Amazon S3, and Amazon Redshift/Aurora for near real-time access to data being collected around the globe. We dive deep into performance, cost, and system resiliency and give you practical tools you can use today to manage your own global data ingestion pipeline and produce quality analytics in real-time without building infrastructure.

Code used for the demo in this session is available for download here: http://abrstevepermalink.s3.amazonaws.com/Demo.zip

Any fast-growing organization needs a way to manage the ever-increasing volume of data being generated across the globe and the need for real-time analysis. In this session, we walk through a real-life architecture and demonstration of how to leverage Amazon Kinesis, AWS Lambda, Amazon S3, and Amazon Redshift/Aurora for near real-time access to data being collected around the globe. We dive deep into performance, cost, and system resiliency and give you practical tools you can use today to manage your own global data ingestion pipeline and produce quality analytics in real-time without building infrastructure.

Code used for the demo in this session is available for download here: http://abrstevepermalink.s3.amazonaws.com/Demo.zip

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Viewers also liked (20)

Advertisement

Similar to (BDT307) Zero Infrastructure, Real-Time Data Collection, and Analytics (20)

More from Amazon Web Services (20)

Advertisement

Recently uploaded (20)

(BDT307) Zero Infrastructure, Real-Time Data Collection, and Analytics

  1. 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Steve Abraham, Solutions Architect - AWS Brian Filppu, Director of Business Intelligence - Zillow October 2015 BDT307 Zero Infrastructure, Real-Time Data Collection, and Analytics
  2. 2. Who am I? • Steve Abraham • Solutions Architect – AWS • Previous life • T-Mobile • U.S. State Department • Hasbro • Software company
  3. 3. What we’ll cover • Data ingestion pipeline • Collect 1,000,000,000 data points per month • Varied clients • Near real-time access to data • High performance / high availability • Low cost / low maintenance • Case study – Zillow • Brian Filppu – Director of Business Intelligence
  4. 4. End State: Amazon Redshift
  5. 5. End State: Amazon Aurora
  6. 6. Amazon API Gateway
  7. 7. Amazon API Gateway
  8. 8. Amazon API Gateway • Create REST-based endpoints • Fully-managed • Scales automatically • Enables rapid development • Flexible security controls
  9. 9. Amazon API Gateway • Integration types • Lambda • Proxy AWS service • Proxy existing service • Mock
  10. 10. Amazon API Gateway • Deploy to stages • Cross-origin resource sharing (CORS) support • Automatically generates SDK • Android • iOS • JavaScript
  11. 11. Amazon API Gateway • $3.50 per 1,000,000 calls • Data transfer in - Free • Data transfer out - $0.05 -> $0.09 per GB • 1,000,000,000 calls • $3,500.00 – Gateway • $0.00 – Data transfer out • Total price - $3,500.00
  12. 12. AWS Lambda
  13. 13. AWS Lambda
  14. 14. AWS Lambda • Fully-managed server-less compute • Event-driven • Platform • Amazon Linux • Node.JS / Java • Configure memory / CPU • Timeout
  15. 15. AWS Lambda – Direct Invocation Model • Respond to invocation • Services • Amazon API Gateway • Custom code
  16. 16. AWS Lambda – Pull Model • Polls the event source • Services • Amazon Kinesis • Amazon DynamoDB Streams
  17. 17. AWS Lambda – Push Model • Respond to a specific event • Services • Amazon S3 • Amazon SNS • Amazon Cognito • Amazon Echo
  18. 18. AWS Lambda & Amazon API Gateway • Amazon API Gateway / AWS Lambda • Fast & easy to deploy • Automatic scaling • 100% utilization • 100% managed • Amazon EC2 • Existing infrastructure • High utilization (> 90%)
  19. 19. AWS Lambda • $0.20 per 1,000,000 requests • First 1,000,000 requests / month – Free • 1,000,000,000 executions -> $199.80 • $0.00001667 per GB-second • 400,000 GB-seconds – Free • 1,000,000,000 executions • 0.5 seconds / 128 MB -> $1,035.21 • Total price -> $1,235.01 • Proxy price -> $0.00
  20. 20. Amazon Kinesis
  21. 21. Amazon Kinesis
  22. 22. Amazon Kinesis • Fully-managed data aggregator • Terabytes of data per hour • Stream • Replicated across 3 facilities • 24-hour retention • Shard • 1 MB (1,000 PUT) / second – writes • 2 MB (5 operations) / second – reads • One thread
  23. 23. Amazon Kinesis
  24. 24. Amazon Kinesis
  25. 25. Amazon Kinesis Shard Management • Split shard • Add capacity to stream • Merge shard • Reduce cost • Amazon Kinesis scaling utility • Allows for scaling automatically • https://github.com/awslabs/amazon-kinesis-scaling-utils
  26. 26. Amazon Kinesis • Amazon API Gateway • REST interface / proxy • Most expensive • Direct to Amazon Kinesis • Amazon Kinesis API • Least expensive
  27. 27. Amazon Kinesis • $0.015 per shard hour / $11.16 per month • 1,000,000,000 / 31 / 86,400 = 373 avg. requests/second • 3 shards * $11.16 = $33.48 • $0.014 per 1,000,000 PUT payloads (25 KB) • 1,000,000,000 / 1,000,000 * $0.014 = $14.00 • Total cost -> $47.48
  28. 28. Amazon S3 & Amazon SQS
  29. 29. Amazon S3 & Amazon SQS
  30. 30. Amazon Simple Storage Service • Secure • Encryption in flight - HTTPS • Encryption at rest (Amazon S3 key, client key, AWS KMS) • Durable • Designed for 11 9’s of durability • Scalable • Millions of requests per second • Trillions of objects
  31. 31. AWS Key Management Service • Manage encryption keys • Encrypt / decrypt data directly • Directly Integrates with • Amazon S3 • Amazon RDS • Amazon Redshift • AWS Lambda integration • Access via API
  32. 32. Amazon Simple Storage Service • Key name distribution • Random values • Lifecycle policy • Delete objects • Move objects to Amazon Glacier • Amazon Glacier • Infrequently accessed data (cold storage) • Low-cost starting at $0.007 per GB • Secure / durable
  33. 33. Amazon Simple Queue Service • Simple • Easy to set up • Secure • Encryption in flight - HTTPS • Durable • Multiple servers / data centers • Scalable • Automatically scales
  34. 34. Amazon S3 Pricing • $0.0275 - $0.0408 per GB • Tiered pricing • Varies by region • $0.005 - $0.007 per 1,000 PUT requests • Varies by region • $0.004 - $0.0056 per 10,000 GET requests • Varies by region • Total cost -> $3.87
  35. 35. Amazon SQS Pricing • $0.50 per 1,000,000 requests • First 1,000,000 requests free • Total cost -> $0.00
  36. 36. Amazon Redshift
  37. 37. Amazon Redshift
  38. 38. Amazon Redshift • Fully-managed, petabyte scale data warehouse • Fast • Columnar storage / data compression • Scalable • Scale up or down • Fault tolerant • Data replicated across nodes / Backed up to Amazon S3 • Familiar • Connect via ODBC / JDBC
  39. 39. Amazon Redshift ODBC / JDBC Amazon Redshift cluster
  40. 40. Amazon Redshift • COPY command • Amazon Redshift parallelizes the load • Single transaction • Encrypt credentials using AWS KMS • Supports delimited, fixed width, JSON, AVRO • Supports GZIP & LZOP
  41. 41. Amazon Redshift • Micro-batch loading • Number of files = multiple of virtual cores • Define compression type for each column in table definition • Load data in sort key order • Use SSD node type (dc1 instance types)
  42. 42. Amazon Redshift • Infinite loop • Create 1 Amazon Kinesis stream with 1 shard • Attach Lambda function to Amazon Kinesis stream • Execute workload • Put record into stream • Create multiple shards for multiple threads
  43. 43. Amazon Redshift
  44. 44. Amazon Redshift • Spin up / spin down • 2 TB data warehouse • On Demand - $632.40 / month • 1 Year No Upfront - $496.00 / month (20% savings) • 1 Year Partial - $2,500.00, $157 / month (41% savings) • Total cost -> $365.33
  45. 45. Amazon Aurora
  46. 46. Amazon Aurora
  47. 47. Amazon Aurora • Fully-managed relational database • MySQL 5.6 • Wire compatible • InnoDB storage engine • Up to five times better performance than MySQL • Over 500,000 SELECTs per second • 100,000 updates per second • Multi-AZ • Data replicated 6 ways across 3 zones
  48. 48. Amazon Aurora or Amazon Redshift? • Amazon Redshift • Data warehouse workload • Data > 64 TB • 50 concurrent queries • Amazon Aurora • OLTP workload • Data < 64 TB • 500,000 SELECT / 100,000 UPDATES per second
  49. 49. Amazon Aurora Pricing - Compute • db.r3.xlarge • On Demand - $431.52 / month • 1 Year No Upfront - $277.40 / month (34% savings) • 1 Year Partial - $1,250.00, $131.40 / month (45% savings) • Total compute cost -> $235.47
  50. 50. Amazon Aurora Pricing - Storage • Storage • $0.10 per GB/month • $0.20 per 1,000,000 I/O requests • 1,000,000,000 records • Compute - $235.47 • 93 GB - $9.30 • 2,000,000,000 / 1,000,000 * $0.20 = $400.00 • Total cost -> $644.77
  51. 51. Zillow Case Study
  52. 52. Zillow • What is Zillow? • Zillow is the leading real estate and home-related information marketplace. Zillow is dedicated to empowering consumers with data, inspiration and knowledge around the place they call home. • Who am I? • Brian Filppu • Director, Business Intelligence, Zillow • I have been at Zillow close to 8 years • Previous life – Spent about 6 years consulting throughout North America
  53. 53. Zillow – Use Case • Needed to collect a subset of mobile app metrics • Solution needed to be delivered in under 3 weeks • Requirement to aggregate and report metrics back to business owners several times during the day • We already have a number of data warehouse processes in AWS so we reached out to Steve, our AWS solutions architect for assistance
  54. 54. Zillow – What Did We Create? • Custom URL endpoint in Amazon API Gateway • 16,000,000+ POSTs per day – to start • Data sent from API Gateway to Amazon Kinesis using AWS Lambda • Storing data encrypted with AWS KMS in Amazon S3 using Lambda • Analyze our data with Spark on Amazon EMR • Run Spark jobs through out the data with AWS Data Pipeline • Have the ability to consume/analyze data real time on Spark on Amazon EMR with Amazon Kinesis if the use case arises
  55. 55. Zillow – Architecture Diagram
  56. 56. Zillow – Data Collection Costs • Using 3 Amazon Kinesis shards costing around $1.30 a day which includes hourly + put costs. • On AWS Lambda, we allocated 128 MB of memory per function call. Lambda runs for under $6 dollars a day. • Lambda and Amazon Kinesis gave us a cost effective solution for storing data with little development time.
  57. 57. Zillow – Data Analysis • Use Spark to perform ETL, clean up, and analysis through out the day. ETL includes Parquet conversion, data partitioning, etc. • Use Presto on Amazon EMR for long-term querying/analysis of data. • Data is stored on Amazon S3. For all Amazon EMR jobs, we use Amazon S3 as our HDFS. • Currently running jobs 4 + times a day using AWS Data Pipeline which launches Spark jobs.
  58. 58. Zillow – What Else Does My Team Run in AWS? • Use Amazon Redshift for fast access to data • Big users of Spark and Presto on Amazon EMR, which includes ETL and ad hoc querying, other use cases involve long term historical data not kept in Amazon Redshift • Amazon SQS, AWS Data Pipeline, Amazon SNS, Amazon S3, AWS KMS, Amazon API Gateway, Amazon EC2
  59. 59. Zillow – We are Hiring • My team is hiring ETL data engineers and software developers • All open positions at Zillow can be found at http://www.zillow.com/jobs/
  60. 60. Demo
  61. 61. Recap
  62. 62. Related Sessions • BDT302 - Real-World Smart Applications with Amazon Machine Learning • BDT309 - Data Science & Best Practices for Apache Spark on Amazon EMR • BDT310 - Big Data Architectural Patterns and Best Practices on AWS
  63. 63. Remember to complete your evaluations!
  64. 64. Thank you!
  65. 65. Code used for the demo in this session is available for download here: http://abrstevepermalink.s3.amazonaws.com/Demo.zip
  66. 66. Amazon API Gateway Pricing • $3.50 per 1,000,000 calls • Data Transfer In - Free • Data Transfer Out • $0.09/GB for the first 10 TB • $0.085/GB for the next 40 TB • $0.07/GB for the next 100 TB • $0.05/GB for the next 350 TB • 1,000,000,000 calls / 1KB payload • $3,500.00 – Gateway • $85.83 – Data Transfer Out
  67. 67. AWS Lambda Pricing • $0.20 per 1,000,000 requests • First 1,000,000 requests / month – Free • 1,000,000,000 executions • (1,000,000,000 – 1,000,000) / 1,000,000 * $0.20 = $199.80 • $0.00001667 per GB-second • 400,000 GB-seconds – Free • 1,000,000,000 executions / 0.5 seconds / 128 MB • 1,000,000,000 * 0.5 * 128 / 1024 = 62,500,000 GB-Sec • 62,500,000 – 400,000 = 62,100,000 • 62,100,00 * $0.00001667 = $1,035.21
  68. 68. Amazon Kinesis Pricing • $0.015 per shard hour / $11.16 per month • 1,000,000,000 / 31 / 86,400 = 373 avg. requests/second • 3 shards * $11.16 = $33.48 • $0.014 per 1,000,000 PUT payloads (25 KB) • 1,000,000,000 / 1,000,000 * $0.014 = $14.00
  69. 69. Amazon S3 Pricing • $0.03 per GB (1st TB) • 1,000,000,000 * 100 bytes = 93.13 GB = $2.79 • $0.005 per 1,000 PUT requests • 1,000,000,000 / 5,000 records / 1,000 * $0.005 = $1.00 • $0.004 per 10,000 GET requests • 1,000,000,000 / 5,000 records / 10,000 * $0.004 = $0.08
  70. 70. Amazon SQS Pricing • $0.50 per 1,000,000 requests • First 1,000,000 requests free • 1,000,000,000 / 5,000 records = 200,000 messages • SendMessage -> 200,000 • ReceiveMessage -> 20,000 • DeleteMessageBatch -> 20,000 • Total -> 240,000 = $0.00

×