Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Analytics on AWS - IP Expo 2013


Published on

Learn more about the tools, techniques and technologies for working productively with data at any scale. This session will introduce the family of data analytics tools on AWS which you can use to collect, compute and collaborate around data, from gigabytes to petabytes. We'll discuss Amazon Elastic MapReduce, Redshift, Hadoop, structured and unstructured data, and the EC2 instance types which enable high performance analytics.

Published in: Technology
  • Login to see the comments

  • Be the first to like this

Analytics on AWS - IP Expo 2013

  1. 1. Analytics on AWS IP Expo 2013
  2. 2. BIG DATA When innovation is required to collect, store, analyze, and manage your data
  4. 4. Customer Needs • Store Any Amount of Data – Without Capacity Planning • Perform Complex Analysis on Any Data – Scale on Demand • Store Data Securely • Decrease Time to Market – Build Environments Quickly • Reduce Costs – Reduce Capital Expenditure • Enable Global Reach
  5. 5. Ingestion | Integration
  6. 6. Elastic Block Store High performance block storage Availability 99.99% device 1GB to 1TB in size Durability Mount as drives to instances with 99.999999999% snapshot/cloning functionalities Is a Web Store Not a file system No Single Points of Failure Eventually consistent Paradigm Object store Performance Very Fast Redundancy Across Availability Zones Security Public Key / Private Key Pricing $0.095/GB/month (DUB) Typical use case Limits IMAGE read many Write once, 100 Buckets, Unlimited Storage, 5TB Objects Simple Storage Service Highly scalable object storage for the internet 1 byte to 5TB in size 99.999999999% durability
  7. 7. Objects in S3 2100 2000 1500 1300 Peak Requests: 1.2 Million / Second 1000 762 500 Billions 262 102 14 40 0 Q4 2007 Q4 2008 Q4 2009 Q4 2010 Q4 2011 Q4 2012 Today
  8. 8. Performance & Scalability Amazon S3 provides near linear scalability S3 Streaming Performance 100 VMs; 9.6GB/s; $26/hr 350 VMs; 28.7GB/s; $90/hr Reader Connections 34 secs per terabyte GB/Second
  9. 9. Spotify uses Amazon S3 for Music Storage AMAZON S3 GIVES US CONFIDENCE IN O U R A B I L I T Y TO EXPAND STORAGE Q U I C K LY W H I L E ALSO PROVIDING H I G H D A T A D U R A B I L I T Y -Emil Fredriksson Operations Director for Spotify • Spotify is an online music service offering instant access to over 16 million licensed songs • Over 15 million active users and 4 million paying subscribers • Spotify adds over 20,000 tracks a day to its catalogue
  10. 10. Elastic Block Store High performance block storage Durability device 99.999999999% 1GB to 1TB in size Mount as drives to instances with snapshot/cloning functionalities Designed for Archival Not a file system Vaults & Archives 3-5 Hour Retrieval Time Paradigm Archive Store Performance Configurable - Low Redundancy Across Availability Zones Security Public Key / Private Key Pricing $0.011/GB/month Typical use case IMAGE once, read Write infrequently < 10% / Month Amazon Glacier Long term object archive Extremely low cost per gigabyte 99.999999999% durability
  11. 11. Storage Lifecycle Integration Simple Storage Service Glacier Highly scalable object storage Long term object archive 1 byte to 5TB in size Extremely low cost per gigabyte 99.999999999% durability 99.999999999% durability
  12. 12. NOSQL Data Capture RDS Dynamo DB Redshift Deployment & Administration App Services Compute Storage Database Networking AWS Global Infrastructure DynamoDB Provisioned throughput NoSQL database Fast, predictable, configurable performance Fully distributed, fault tolerant HA architecture Integration with EMR & Hive
  13. 13. Dynamo Consistency √ √ √ • Writes • Writes are acknowledged (committed) once they exist in at least two physical data centers • Writes are persisted to SSD • Reads • No reduction in durability or consistency in order to achieve throughput Strongly Consistent Read Stale Values reads possible No Stale Values read Highest Throughput • Tunable for Application Requirements Eventually Consistent Read Lower Potential Throughput
  14. 14. Shazam scaled Dynamo DB to 500,000 IOPS for a Superbowl Ad AWS GAVE USE THE ABILITY TO BRING A MASSIVE AMOUNT OF C A P A C I T Y ONLINE IN A S H O RT P E R I O D O F T I M E -Jason Titus Shazam CTO • Shazam connects more than 200 million people, in more than 200 countries and 33 languages, to the music, TV shows and brands they love • When customers hear a song or see a TV program or ad they like, they simply activate the app to “tag” it • Shazam realized it could support over 500,000 writes per second with Dynamo DB • Also using Amazon EMR for largescale data analysis that can require more than 1 million writes per second
  15. 15. Complex Data Analysis … Parallel ETL
  16. 16. Application Services Elastic MapReduce Deployment & Administration App Services Compute Storage Elastic MapReduce Database Managed, elastic Hadoop cluster Integrates with S3 & DynamoDB Automated installation of Hive & Pig Networking Support for Spot Instances Integrated HBase NOSQL Database AWS Global Infrastructure
  17. 17. EMR Data Sources
  18. 18. Reducing Costs with Spot Instances Mix Spot and On-Demand instances to reduce cost and accelerate computation while protecting against interruption Scenario #1 Job Flow Scenario #2 Job Flow #1: Cost without Spot 4 instances *14 hrs * $0.50 = $28 #2: Cost with Spot 4 instances *7 hrs * $0.50 = $14 + 5 instances * 7 hrs * $0.25 = $8.75 Total = $22.75 Duration: 14 Hours Duration: 7 Hours Time Savings: 50% Cost Savings: ~20% Other EMR + Spot Use Cases Run entire cluster on Spot for biggest cost savings Reduce the cost of application testing
  19. 19. Compute Vertical Scaling From $0.02/hr Elastic Compute Cloud (EC2) Basic unit of compute capacity Range of CPU, memory & local disk options 13 Instance types available, from micro to cluster compute Feature Details App Services Run windows or linux distributions Scalable Deployment & Administration Flexible Wide range of instance types from micro to cluster compute Machine Images Full control Compute Storage Database Secure Configurations can be saved as machine images (AMIs) from which new instances can be created Full root or administrator rights Full firewall control via Security Groups AWS Global Infrastructure Monitoring Publishes metrics to Cloud Watch Inexpensive Networking On-demand, Reserved and Spot instance types VM Import/Export Import and export VM images to transfer configurations in and out of EC2
  20. 20. Cluster Compute 1 EC2 Instance 2nd Generation cluster compute instance Cluster Compute instances implement HVM process execution Intel® Xeon® E5-2670 processors 10 Gigabit Ethernet 80 EC2 Compute Units 60GB RAM 3TB Local Disk Cluster Compute
  21. 21. Cluster Compute 2 Network placement groups Cluster instances deployed in a ‘Placement Group’ enjoy low latency, full bisection 10 Gbps bandwidth 10Gbps
  22. 22. CC2 Instance Cluster 240 TFLOPS Making it the 72nd fastest supercomputer in the world (#42 when announced at SC’11) (Test performed nov 2011, benchmark published June 2012)
  23. 23. Cluster GPU 1 EC2 instance GPU compute instances: Intel® Xeon® X5570 processors 2 x NVIDIA Tesla “Fermi” M2050 GPUs I/O Performance: Very High (10 Gigabit Ethernet) 33.5 EC2 Compute Units 20GB RAM 2x NVIDIA GPU @ >400 Cores Each Cluster GPU
  24. 24. S&P Capital IQ Uses AWS for Big Data Processing S3 Provides data to 4200+ top global investment firms Launched Hadoop faster, Learned Hadoop faster Hadoop Cluster
  25. 25. Structured Data Management
  26. 26. Structured Data Analysis Relational Database Service RDS Dynamo DB Managed Oracle, MySQL & SQL Server Dynamo DB Redshift Managed NOSQL Database Deployment & Administration App Services Compute Storage Amazon Redshift Massively Parallel Petabyte Scale Data Warehouse Database Networking AWS Global Infrastructure
  27. 27. Structured Data Analysis RDS Dynamo DB Redshift Deployment & Administration App Services Compute Storage Database Relational Database Service Database-as-a-Service Networking No need to install or manage database instances Scalable and fault tolerant configurations AWS Global Infrastructure Integration with Data Pipeline
  28. 28. Structured Data Analysis RDS Dynamo DB Redshift Deployment & Administration App Services Compute Storage Database Redshift Managed Massively Parallel Petabyte Scale Data Networking AWS Global Infrastructure Warehouse Streaming Backup/Restore to S3 Extensive Security 2 TB -> 1.6 PB
  29. 29. Redshift parallelizes and distributes everything Common BI Tools Query JDBC/ ODBC Load Backup Restore Resize Leader Node 10GigE Mesh Compute Node Compute Node Compute Node
  30. 30. Redshift lets you start small and grow big Extra Large Node (XL) 3 spindles, 2TB, 15GiB RAM 2 virtual cores, 10GigE 8 Extra Large Node (8XL) 24 spindles, 16TB, 120GiB RAM 16 virtual cores, 10GigE Single Node (2TB) Cluster 2-100 Nodes (32TB – 1.6PB) Cluster 2-32 Nodes (4TB – 64TB)
  31. 31. Important Redshift Features No Downtime Resize Streaming Backup/Restore to S3 Automated Point In Time Snapshotting Workload Management Support for VPC Support for Encrypted Data Loads Cluster SSL Only Communications
  32. 32. Application Services Input Datanode: This could be a S3 bucket, RDS table, EMR Hive table, etc. Activity: This is a data aggregation, manipulation, or copy that runs on a userconfigured schedule. Deployment & Administration Output Datanode: This supports all the same datasources as the input datanode, but they don’t have to be the same type. App Services Compute Storage Database Data Pipeline Networking Automatically Provision EC2 & EMR Resources Manage Dependencies & Scheduling AWS Global Infrastructure Automatically Retry and Notify of Success & Failure
  33. 33. Sample Use Case Input: RDS Table Table: User-Demographics SQL Precondition: “Select last_update from table“ > #{YY-MM-DD} Input: DynamoDB Table Table: User-Event-Data-#{year-month} Activity: EMR Transform Hive Query: user-metrics.hql Frequency: Daily Output: S3 file Path: s3://trend-data/#{year-month-day}.csv Success Notification: Failure Notification: Delay Notification: :
  34. 34. Integrated Analytics
  35. 35. Integrated Analytics
  36. 36. End User Reporting
  37. 37. End User Reporting EMR Redshift RDS