Analytics on AWS
IP Expo 2013
BIG DATA
When innovation is required
to collect, store, analyze, and
manage your data
VOLUME
VELOCITY
VARIETY
Customer Needs
• Store Any Amount of Data
– Without Capacity Planning

• Perform Complex Analysis on Any Data
– Scale on D...
Ingestion | Integration
Elastic Block Store
High performance block storage
Availability

99.99%
device
1GB to 1TB in size

Durability

Mount as dr...
Objects in S3

2100

2000

1500
1300

Peak Requests: 1.2 Million / Second

1000

762

500

Billions

262

102

14

40

0
Q...
Performance & Scalability
Amazon S3 provides near linear scalability

S3 Streaming Performance
100 VMs; 9.6GB/s; $26/hr
35...
Spotify uses Amazon S3 for Music Storage

AMAZON S3 GIVES
US CONFIDENCE IN
O U R A B I L I T Y TO
EXPAND STORAGE
Q U I C K...
Elastic Block Store
High performance block storage

Durability
device
99.999999999%
1GB to 1TB in size
Mount as drives to ...
Storage Lifecycle Integration
Simple Storage Service

Glacier

Highly scalable object storage

Long term object archive

1...
NOSQL Data Capture

RDS

Dynamo
DB

Redshift

Deployment & Administration
App Services
Compute

Storage

Database

Network...
Dynamo Consistency

√

√

√

• Writes
• Writes are acknowledged
(committed) once they exist in at
least two physical data ...
Shazam scaled Dynamo DB to 500,000 IOPS for a
Superbowl Ad
AWS GAVE USE
THE ABILITY TO
BRING A MASSIVE
AMOUNT OF
C A P A C...
Complex Data Analysis
…
Parallel ETL
Application Services

Elastic
MapReduce

Deployment & Administration
App Services
Compute

Storage

Elastic MapReduce
Data...
EMR Data Sources
Reducing Costs with Spot Instances
Mix Spot and On-Demand instances to reduce cost and
accelerate computation while protec...
Compute
Vertical
Scaling
From $0.02/hr

Elastic Compute Cloud (EC2)
Basic unit of compute capacity
Range of CPU, memory & ...
Cluster Compute
1

EC2 Instance
2nd Generation cluster compute instance

Cluster Compute instances implement HVM process e...
Cluster Compute
2

Network placement groups
Cluster instances deployed in a ‘Placement

Group’ enjoy low latency, full bis...
CC2 Instance Cluster

240 TFLOPS
Making it the 72nd fastest
supercomputer in the world
(#42 when announced at SC’11)
(Test...
Cluster GPU
1

EC2 instance
GPU compute instances: Intel® Xeon® X5570 processors

2 x NVIDIA Tesla “Fermi” M2050 GPUs
I/O ...
S&P Capital IQ Uses AWS for Big Data Processing

S3

Provides data to 4200+ top
global investment firms

Launched Hadoop f...
Structured Data Management
Structured Data Analysis
Relational Database Service
RDS

Dynamo
DB

Managed Oracle, MySQL & SQL Server

Dynamo DB
Redshif...
Structured Data Analysis

RDS

Dynamo
DB

Redshift

Deployment & Administration
App Services
Compute

Storage

Database

R...
Structured Data Analysis

RDS

Dynamo
DB

Redshift

Deployment & Administration
App Services
Compute

Storage

Database

R...
Redshift parallelizes and distributes everything

Common BI Tools

Query

JDBC/ ODBC

Load
Backup

Restore
Resize

Leader
...
Redshift lets you start small and grow
big
Extra Large Node (XL)
3 spindles, 2TB, 15GiB RAM
2 virtual cores, 10GigE

8 Ext...
Important Redshift Features

No Downtime Resize
Streaming Backup/Restore to S3
Automated Point In Time
Snapshotting
Worklo...
Application Services
Input Datanode: This could be a S3 bucket, RDS
table, EMR Hive table, etc.

Activity: This is a data ...
Sample Use Case
Input: RDS Table
Table: User-Demographics
SQL Precondition: “Select last_update from table“ > #{YY-MM-DD}
...
Integrated Analytics
Integrated Analytics
End User Reporting
End User Reporting

EMR

Redshift

RDS
Analytics on AWS - IP Expo 2013
Analytics on AWS - IP Expo 2013
Upcoming SlideShare
Loading in...5
×

Analytics on AWS - IP Expo 2013

450

Published on

Learn more about the tools, techniques and technologies for working productively with data at any scale. This session will introduce the family of data analytics tools on AWS which you can use to collect, compute and collaborate around data, from gigabytes to petabytes. We'll discuss Amazon Elastic MapReduce, Redshift, Hadoop, structured and unstructured data, and the EC2 instance types which enable high performance analytics.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
450
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Analytics on AWS - IP Expo 2013

  1. 1. Analytics on AWS IP Expo 2013
  2. 2. BIG DATA When innovation is required to collect, store, analyze, and manage your data
  3. 3. VOLUME VELOCITY VARIETY
  4. 4. Customer Needs • Store Any Amount of Data – Without Capacity Planning • Perform Complex Analysis on Any Data – Scale on Demand • Store Data Securely • Decrease Time to Market – Build Environments Quickly • Reduce Costs – Reduce Capital Expenditure • Enable Global Reach
  5. 5. Ingestion | Integration
  6. 6. Elastic Block Store High performance block storage Availability 99.99% device 1GB to 1TB in size Durability Mount as drives to instances with 99.999999999% snapshot/cloning functionalities Is a Web Store Not a file system No Single Points of Failure Eventually consistent Paradigm Object store Performance Very Fast Redundancy Across Availability Zones Security Public Key / Private Key Pricing $0.095/GB/month (DUB) Typical use case Limits IMAGE read many Write once, 100 Buckets, Unlimited Storage, 5TB Objects Simple Storage Service Highly scalable object storage for the internet 1 byte to 5TB in size 99.999999999% durability
  7. 7. Objects in S3 2100 2000 1500 1300 Peak Requests: 1.2 Million / Second 1000 762 500 Billions 262 102 14 40 0 Q4 2007 Q4 2008 Q4 2009 Q4 2010 Q4 2011 Q4 2012 Today
  8. 8. Performance & Scalability Amazon S3 provides near linear scalability S3 Streaming Performance 100 VMs; 9.6GB/s; $26/hr 350 VMs; 28.7GB/s; $90/hr Reader Connections 34 secs per terabyte GB/Second
  9. 9. Spotify uses Amazon S3 for Music Storage AMAZON S3 GIVES US CONFIDENCE IN O U R A B I L I T Y TO EXPAND STORAGE Q U I C K LY W H I L E ALSO PROVIDING H I G H D A T A D U R A B I L I T Y -Emil Fredriksson Operations Director for Spotify • Spotify is an online music service offering instant access to over 16 million licensed songs • Over 15 million active users and 4 million paying subscribers • Spotify adds over 20,000 tracks a day to its catalogue
  10. 10. Elastic Block Store High performance block storage Durability device 99.999999999% 1GB to 1TB in size Mount as drives to instances with snapshot/cloning functionalities Designed for Archival Not a file system Vaults & Archives 3-5 Hour Retrieval Time Paradigm Archive Store Performance Configurable - Low Redundancy Across Availability Zones Security Public Key / Private Key Pricing $0.011/GB/month Typical use case IMAGE once, read Write infrequently < 10% / Month Amazon Glacier Long term object archive Extremely low cost per gigabyte 99.999999999% durability
  11. 11. Storage Lifecycle Integration Simple Storage Service Glacier Highly scalable object storage Long term object archive 1 byte to 5TB in size Extremely low cost per gigabyte 99.999999999% durability 99.999999999% durability
  12. 12. NOSQL Data Capture RDS Dynamo DB Redshift Deployment & Administration App Services Compute Storage Database Networking AWS Global Infrastructure DynamoDB Provisioned throughput NoSQL database Fast, predictable, configurable performance Fully distributed, fault tolerant HA architecture Integration with EMR & Hive
  13. 13. Dynamo Consistency √ √ √ • Writes • Writes are acknowledged (committed) once they exist in at least two physical data centers • Writes are persisted to SSD • Reads • No reduction in durability or consistency in order to achieve throughput Strongly Consistent Read Stale Values reads possible No Stale Values read Highest Throughput • Tunable for Application Requirements Eventually Consistent Read Lower Potential Throughput
  14. 14. Shazam scaled Dynamo DB to 500,000 IOPS for a Superbowl Ad AWS GAVE USE THE ABILITY TO BRING A MASSIVE AMOUNT OF C A P A C I T Y ONLINE IN A S H O RT P E R I O D O F T I M E -Jason Titus Shazam CTO • Shazam connects more than 200 million people, in more than 200 countries and 33 languages, to the music, TV shows and brands they love • When customers hear a song or see a TV program or ad they like, they simply activate the app to “tag” it • Shazam realized it could support over 500,000 writes per second with Dynamo DB • Also using Amazon EMR for largescale data analysis that can require more than 1 million writes per second
  15. 15. Complex Data Analysis … Parallel ETL
  16. 16. Application Services Elastic MapReduce Deployment & Administration App Services Compute Storage Elastic MapReduce Database Managed, elastic Hadoop cluster Integrates with S3 & DynamoDB Automated installation of Hive & Pig Networking Support for Spot Instances Integrated HBase NOSQL Database AWS Global Infrastructure
  17. 17. EMR Data Sources
  18. 18. Reducing Costs with Spot Instances Mix Spot and On-Demand instances to reduce cost and accelerate computation while protecting against interruption Scenario #1 Job Flow Scenario #2 Job Flow #1: Cost without Spot 4 instances *14 hrs * $0.50 = $28 #2: Cost with Spot 4 instances *7 hrs * $0.50 = $14 + 5 instances * 7 hrs * $0.25 = $8.75 Total = $22.75 Duration: 14 Hours Duration: 7 Hours Time Savings: 50% Cost Savings: ~20% Other EMR + Spot Use Cases Run entire cluster on Spot for biggest cost savings Reduce the cost of application testing
  19. 19. Compute Vertical Scaling From $0.02/hr Elastic Compute Cloud (EC2) Basic unit of compute capacity Range of CPU, memory & local disk options 13 Instance types available, from micro to cluster compute Feature Details App Services Run windows or linux distributions Scalable Deployment & Administration Flexible Wide range of instance types from micro to cluster compute Machine Images Full control Compute Storage Database Secure Configurations can be saved as machine images (AMIs) from which new instances can be created Full root or administrator rights Full firewall control via Security Groups AWS Global Infrastructure Monitoring Publishes metrics to Cloud Watch Inexpensive Networking On-demand, Reserved and Spot instance types VM Import/Export Import and export VM images to transfer configurations in and out of EC2
  20. 20. Cluster Compute 1 EC2 Instance 2nd Generation cluster compute instance Cluster Compute instances implement HVM process execution Intel® Xeon® E5-2670 processors 10 Gigabit Ethernet 80 EC2 Compute Units 60GB RAM 3TB Local Disk Cluster Compute
  21. 21. Cluster Compute 2 Network placement groups Cluster instances deployed in a ‘Placement Group’ enjoy low latency, full bisection 10 Gbps bandwidth 10Gbps
  22. 22. CC2 Instance Cluster 240 TFLOPS Making it the 72nd fastest supercomputer in the world (#42 when announced at SC’11) (Test performed nov 2011, benchmark published June 2012)
  23. 23. Cluster GPU 1 EC2 instance GPU compute instances: Intel® Xeon® X5570 processors 2 x NVIDIA Tesla “Fermi” M2050 GPUs I/O Performance: Very High (10 Gigabit Ethernet) 33.5 EC2 Compute Units 20GB RAM 2x NVIDIA GPU @ >400 Cores Each Cluster GPU
  24. 24. S&P Capital IQ Uses AWS for Big Data Processing S3 Provides data to 4200+ top global investment firms Launched Hadoop faster, Learned Hadoop faster Hadoop Cluster
  25. 25. Structured Data Management
  26. 26. Structured Data Analysis Relational Database Service RDS Dynamo DB Managed Oracle, MySQL & SQL Server Dynamo DB Redshift Managed NOSQL Database Deployment & Administration App Services Compute Storage Amazon Redshift Massively Parallel Petabyte Scale Data Warehouse Database Networking AWS Global Infrastructure
  27. 27. Structured Data Analysis RDS Dynamo DB Redshift Deployment & Administration App Services Compute Storage Database Relational Database Service Database-as-a-Service Networking No need to install or manage database instances Scalable and fault tolerant configurations AWS Global Infrastructure Integration with Data Pipeline
  28. 28. Structured Data Analysis RDS Dynamo DB Redshift Deployment & Administration App Services Compute Storage Database Redshift Managed Massively Parallel Petabyte Scale Data Networking AWS Global Infrastructure Warehouse Streaming Backup/Restore to S3 Extensive Security 2 TB -> 1.6 PB
  29. 29. Redshift parallelizes and distributes everything Common BI Tools Query JDBC/ ODBC Load Backup Restore Resize Leader Node 10GigE Mesh Compute Node Compute Node Compute Node
  30. 30. Redshift lets you start small and grow big Extra Large Node (XL) 3 spindles, 2TB, 15GiB RAM 2 virtual cores, 10GigE 8 Extra Large Node (8XL) 24 spindles, 16TB, 120GiB RAM 16 virtual cores, 10GigE Single Node (2TB) Cluster 2-100 Nodes (32TB – 1.6PB) Cluster 2-32 Nodes (4TB – 64TB)
  31. 31. Important Redshift Features No Downtime Resize Streaming Backup/Restore to S3 Automated Point In Time Snapshotting Workload Management Support for VPC Support for Encrypted Data Loads Cluster SSL Only Communications
  32. 32. Application Services Input Datanode: This could be a S3 bucket, RDS table, EMR Hive table, etc. Activity: This is a data aggregation, manipulation, or copy that runs on a userconfigured schedule. Deployment & Administration Output Datanode: This supports all the same datasources as the input datanode, but they don’t have to be the same type. App Services Compute Storage Database Data Pipeline Networking Automatically Provision EC2 & EMR Resources Manage Dependencies & Scheduling AWS Global Infrastructure Automatically Retry and Notify of Success & Failure
  33. 33. Sample Use Case Input: RDS Table Table: User-Demographics SQL Precondition: “Select last_update from table“ > #{YY-MM-DD} Input: DynamoDB Table Table: User-Event-Data-#{year-month} Activity: EMR Transform Hive Query: user-metrics.hql Frequency: Daily Output: S3 file Path: s3://trend-data/#{year-month-day}.csv Success Notification: metrics@example.com Failure Notification: emr-admin@example.com Delay Notification: : emr-admin@example.com
  34. 34. Integrated Analytics
  35. 35. Integrated Analytics
  36. 36. End User Reporting
  37. 37. End User Reporting EMR Redshift RDS
  1. Gostou de algum slide específico?

    Recortar slides é uma maneira fácil de colecionar informações para acessar mais tarde.

×