Analytics on AWS - IP Expo 2013
Upcoming SlideShare
Loading in...5
×
 

Analytics on AWS - IP Expo 2013

on

  • 687 views

Learn more about the tools, techniques and technologies for working productively with data at any scale. This session will introduce the family of data analytics tools on AWS which you can use to ...

Learn more about the tools, techniques and technologies for working productively with data at any scale. This session will introduce the family of data analytics tools on AWS which you can use to collect, compute and collaborate around data, from gigabytes to petabytes. We'll discuss Amazon Elastic MapReduce, Redshift, Hadoop, structured and unstructured data, and the EC2 instance types which enable high performance analytics.

Statistics

Views

Total Views
687
Views on SlideShare
687
Embed Views
0

Actions

Likes
0
Downloads
8
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Analytics on AWS - IP Expo 2013 Analytics on AWS - IP Expo 2013 Presentation Transcript

  • Analytics on AWS IP Expo 2013
  • BIG DATA When innovation is required to collect, store, analyze, and manage your data
  • VOLUME VELOCITY VARIETY
  • Customer Needs • Store Any Amount of Data – Without Capacity Planning • Perform Complex Analysis on Any Data – Scale on Demand • Store Data Securely • Decrease Time to Market – Build Environments Quickly • Reduce Costs – Reduce Capital Expenditure • Enable Global Reach
  • Ingestion | Integration
  • Elastic Block Store High performance block storage Availability 99.99% device 1GB to 1TB in size Durability Mount as drives to instances with 99.999999999% snapshot/cloning functionalities Is a Web Store Not a file system No Single Points of Failure Eventually consistent Paradigm Object store Performance Very Fast Redundancy Across Availability Zones Security Public Key / Private Key Pricing $0.095/GB/month (DUB) Typical use case Limits IMAGE read many Write once, 100 Buckets, Unlimited Storage, 5TB Objects Simple Storage Service Highly scalable object storage for the internet 1 byte to 5TB in size 99.999999999% durability
  • Objects in S3 2100 2000 1500 1300 Peak Requests: 1.2 Million / Second 1000 762 500 Billions 262 102 14 40 0 Q4 2007 Q4 2008 Q4 2009 Q4 2010 Q4 2011 Q4 2012 Today
  • Performance & Scalability Amazon S3 provides near linear scalability S3 Streaming Performance 100 VMs; 9.6GB/s; $26/hr 350 VMs; 28.7GB/s; $90/hr Reader Connections 34 secs per terabyte GB/Second
  • Spotify uses Amazon S3 for Music Storage AMAZON S3 GIVES US CONFIDENCE IN O U R A B I L I T Y TO EXPAND STORAGE Q U I C K LY W H I L E ALSO PROVIDING H I G H D A T A D U R A B I L I T Y -Emil Fredriksson Operations Director for Spotify • Spotify is an online music service offering instant access to over 16 million licensed songs • Over 15 million active users and 4 million paying subscribers • Spotify adds over 20,000 tracks a day to its catalogue
  • Elastic Block Store High performance block storage Durability device 99.999999999% 1GB to 1TB in size Mount as drives to instances with snapshot/cloning functionalities Designed for Archival Not a file system Vaults & Archives 3-5 Hour Retrieval Time Paradigm Archive Store Performance Configurable - Low Redundancy Across Availability Zones Security Public Key / Private Key Pricing $0.011/GB/month Typical use case IMAGE once, read Write infrequently < 10% / Month Amazon Glacier Long term object archive Extremely low cost per gigabyte 99.999999999% durability
  • Storage Lifecycle Integration Simple Storage Service Glacier Highly scalable object storage Long term object archive 1 byte to 5TB in size Extremely low cost per gigabyte 99.999999999% durability 99.999999999% durability
  • NOSQL Data Capture RDS Dynamo DB Redshift Deployment & Administration App Services Compute Storage Database Networking AWS Global Infrastructure DynamoDB Provisioned throughput NoSQL database Fast, predictable, configurable performance Fully distributed, fault tolerant HA architecture Integration with EMR & Hive
  • Dynamo Consistency √ √ √ • Writes • Writes are acknowledged (committed) once they exist in at least two physical data centers • Writes are persisted to SSD • Reads • No reduction in durability or consistency in order to achieve throughput Strongly Consistent Read Stale Values reads possible No Stale Values read Highest Throughput • Tunable for Application Requirements Eventually Consistent Read Lower Potential Throughput
  • Shazam scaled Dynamo DB to 500,000 IOPS for a Superbowl Ad AWS GAVE USE THE ABILITY TO BRING A MASSIVE AMOUNT OF C A P A C I T Y ONLINE IN A S H O RT P E R I O D O F T I M E -Jason Titus Shazam CTO • Shazam connects more than 200 million people, in more than 200 countries and 33 languages, to the music, TV shows and brands they love • When customers hear a song or see a TV program or ad they like, they simply activate the app to “tag” it • Shazam realized it could support over 500,000 writes per second with Dynamo DB • Also using Amazon EMR for largescale data analysis that can require more than 1 million writes per second
  • Complex Data Analysis … Parallel ETL
  • Application Services Elastic MapReduce Deployment & Administration App Services Compute Storage Elastic MapReduce Database Managed, elastic Hadoop cluster Integrates with S3 & DynamoDB Automated installation of Hive & Pig Networking Support for Spot Instances Integrated HBase NOSQL Database AWS Global Infrastructure
  • EMR Data Sources
  • Reducing Costs with Spot Instances Mix Spot and On-Demand instances to reduce cost and accelerate computation while protecting against interruption Scenario #1 Job Flow Scenario #2 Job Flow #1: Cost without Spot 4 instances *14 hrs * $0.50 = $28 #2: Cost with Spot 4 instances *7 hrs * $0.50 = $14 + 5 instances * 7 hrs * $0.25 = $8.75 Total = $22.75 Duration: 14 Hours Duration: 7 Hours Time Savings: 50% Cost Savings: ~20% Other EMR + Spot Use Cases Run entire cluster on Spot for biggest cost savings Reduce the cost of application testing
  • Compute Vertical Scaling From $0.02/hr Elastic Compute Cloud (EC2) Basic unit of compute capacity Range of CPU, memory & local disk options 13 Instance types available, from micro to cluster compute Feature Details App Services Run windows or linux distributions Scalable Deployment & Administration Flexible Wide range of instance types from micro to cluster compute Machine Images Full control Compute Storage Database Secure Configurations can be saved as machine images (AMIs) from which new instances can be created Full root or administrator rights Full firewall control via Security Groups AWS Global Infrastructure Monitoring Publishes metrics to Cloud Watch Inexpensive Networking On-demand, Reserved and Spot instance types VM Import/Export Import and export VM images to transfer configurations in and out of EC2
  • Cluster Compute 1 EC2 Instance 2nd Generation cluster compute instance Cluster Compute instances implement HVM process execution Intel® Xeon® E5-2670 processors 10 Gigabit Ethernet 80 EC2 Compute Units 60GB RAM 3TB Local Disk Cluster Compute
  • Cluster Compute 2 Network placement groups Cluster instances deployed in a ‘Placement Group’ enjoy low latency, full bisection 10 Gbps bandwidth 10Gbps
  • CC2 Instance Cluster 240 TFLOPS Making it the 72nd fastest supercomputer in the world (#42 when announced at SC’11) (Test performed nov 2011, benchmark published June 2012)
  • Cluster GPU 1 EC2 instance GPU compute instances: Intel® Xeon® X5570 processors 2 x NVIDIA Tesla “Fermi” M2050 GPUs I/O Performance: Very High (10 Gigabit Ethernet) 33.5 EC2 Compute Units 20GB RAM 2x NVIDIA GPU @ >400 Cores Each Cluster GPU
  • S&P Capital IQ Uses AWS for Big Data Processing S3 Provides data to 4200+ top global investment firms Launched Hadoop faster, Learned Hadoop faster Hadoop Cluster
  • Structured Data Management
  • Structured Data Analysis Relational Database Service RDS Dynamo DB Managed Oracle, MySQL & SQL Server Dynamo DB Redshift Managed NOSQL Database Deployment & Administration App Services Compute Storage Amazon Redshift Massively Parallel Petabyte Scale Data Warehouse Database Networking AWS Global Infrastructure
  • Structured Data Analysis RDS Dynamo DB Redshift Deployment & Administration App Services Compute Storage Database Relational Database Service Database-as-a-Service Networking No need to install or manage database instances Scalable and fault tolerant configurations AWS Global Infrastructure Integration with Data Pipeline
  • Structured Data Analysis RDS Dynamo DB Redshift Deployment & Administration App Services Compute Storage Database Redshift Managed Massively Parallel Petabyte Scale Data Networking AWS Global Infrastructure Warehouse Streaming Backup/Restore to S3 Extensive Security 2 TB -> 1.6 PB
  • Redshift parallelizes and distributes everything Common BI Tools Query JDBC/ ODBC Load Backup Restore Resize Leader Node 10GigE Mesh Compute Node Compute Node Compute Node
  • Redshift lets you start small and grow big Extra Large Node (XL) 3 spindles, 2TB, 15GiB RAM 2 virtual cores, 10GigE 8 Extra Large Node (8XL) 24 spindles, 16TB, 120GiB RAM 16 virtual cores, 10GigE Single Node (2TB) Cluster 2-100 Nodes (32TB – 1.6PB) Cluster 2-32 Nodes (4TB – 64TB)
  • Important Redshift Features No Downtime Resize Streaming Backup/Restore to S3 Automated Point In Time Snapshotting Workload Management Support for VPC Support for Encrypted Data Loads Cluster SSL Only Communications
  • Application Services Input Datanode: This could be a S3 bucket, RDS table, EMR Hive table, etc. Activity: This is a data aggregation, manipulation, or copy that runs on a userconfigured schedule. Deployment & Administration Output Datanode: This supports all the same datasources as the input datanode, but they don’t have to be the same type. App Services Compute Storage Database Data Pipeline Networking Automatically Provision EC2 & EMR Resources Manage Dependencies & Scheduling AWS Global Infrastructure Automatically Retry and Notify of Success & Failure
  • Sample Use Case Input: RDS Table Table: User-Demographics SQL Precondition: “Select last_update from table“ > #{YY-MM-DD} Input: DynamoDB Table Table: User-Event-Data-#{year-month} Activity: EMR Transform Hive Query: user-metrics.hql Frequency: Daily Output: S3 file Path: s3://trend-data/#{year-month-day}.csv Success Notification: metrics@example.com Failure Notification: emr-admin@example.com Delay Notification: : emr-admin@example.com
  • Integrated Analytics
  • Integrated Analytics
  • End User Reporting
  • End User Reporting EMR Redshift RDS