DW on AWS
Gaurav Agrawal
Data Platforms, AOL Inc.
AOL Data Platforms Architecture 2014
Data Stats & Insights
Cluster Size
2 PB
In-House
Cluster
100 Nodes
Raw
Data/Day
2-3 TB
Data
Retention
13-24 Months
Challenges with In-House Infrastructure
Fixed Cost
Slow Deployment
Cycle
Always On Self Serve
Static : Not Scalable Outages Impact Production Upgrade
Storage Compute
AOL Data Platforms Architecture 2015
1
2
2
3
4
56
Migration
• Web Console vs. CLI
Web Console and CLI
Web Console for Training
Setup IAM for users
AWS Services Options
S3 Data upload
EMR Creation & Steps
Try & Test multiple approaches
CLI is your friend..!!!
Migration
• Web Console vs. CLI
• Copy Existing Data to S3
bucket-prod-control
Environment Level Buckets
Dev, QA, Production, Analyst
Project Level Buckets
Code, Data, Log, Extract and Control
Compressed Snappy Data to GZIP
Multi Platforms Support
Best Compression
Lowest storage cost
Low cost for Data OUT
bucket-dev bucket-qa
bucket-prod bucket-analyst
bucket-prod-code
bucket-prod-log
bucket-prod-data
bucket-prod-extract
76%
Less Storage
70K
Saving/Year
Copy Existing Data to S3
Migration
• Web Console vs. CLI
• Copy Existing Data to S3
• EMR Design options
EMR Design Options
Transient
Amazon S3
Elastic Cluster
On-Demand vs. Reserved vs.
Core NodesAmazon EMR
vs. Persistent Cluster
vs. local HDFS
vs. Static Cluster
Spot
vs. Task Nodes
AOL Data Platforms Architecture 2015
Migration
• Web Console vs. CLI
• Copy Existing Data to S3
• EMR Design options
• EMR Jobs Submission - CLI
EMR Jobs Submission - CLI
In-house scheduler
Common Utilities
Provision EMR
Push/Pull Data to S3
Job submission to Scheduler
Database Load
JSON Files
Applications, Steps, Bootstrap,EC2 attributes, Instance Groups
Future : Event Driven Design – Lambda, SQS
EMR Jobs Submission - CLI
aws emr create-cluster –name "prod_dataset_subdataset_2015-10-08" 
--tags "Env=prod" "Project=Omniture" "Dataset=DATASET" "Owner=gaurav"
"Subdataset=SUBDATASET" "Date=2015-10-08" "Region=us-east-1" 
--visible-to-all-users 
--ec2-attributes file://omni_awssot.generic.ec2_attributes.json 
--ami-version "3.7.0" 
--log-uri s3://bucket-prod-log/DATASET_NAME/SUBDATASET_NAME/ 
--enable-debugging 
--instance-groups file://omni_awssot.generic.instance_groups.json 
--auto-terminate 
--applications file://omni_awssot.generic.applications.json 
--bootstrap-actions file://omni_awssot.generic.bootstrap_actions.json 
--steps file://omni_awssot.generic.steps.json
Migration
• Web Console vs. CLI
• Copy Existing Data to S3
• EMR Design options
• EMR Jobs Submission – CLI
• Monitoring
Monitoring
EMR WatchDog : Node.js
Duplicate Clusters
Failed Clusters
Long-running Clusters
Long-provisioning Clusters
CloudWatch Alarms
Monthly Billing
S3 Bucket Size
SNS Email Notifications
Amazon CloudWatch
Amazon SNS
Migration
• Web Console vs. CLI
• Copy Existing Data to S3
• EMR Design options
• EMR Jobs Submission – CLI
• Monitoring
• Elasticity
Elasticity
Why be Elastic?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Cores Nodes Demand - 09/05/2015 Cores Nodes
Daily Processes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Core Nodes Demand - 09/20/2015 Core Nodes
No Clusters
Spike in Demand
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Cores Nodes Demand - 06/01/2015 Cores Nodes
Major Restatement
Demand > 10K EC2
Elasticity
Why be Elastic?
True Cloud Architecture
Spot is an Open Market
Scale Horizontally
Our Limit : 3,000 EC2/Region
Multiple Regions
Multiple Instance Types
Migration
• Web Console vs. CLI
• Copy Existing Data to S3
• EMR Design options
• EMR Jobs Submission – CLI
• Monitoring
• Elasticity
• Optimization
Optimization
Data Management
Partition Data on S3
S3 Versioning/Lifecycle
Hadoop Run-time Params
Memory Tuning
Compress M & R Output
Combine Splits Input format
Security
Roles
Security Groups
AOL VPC
Score Card
Feature AWS
Pay for what you use ✔
Decouple Storage and Compute ✔
True Cloud Architecture ✔
Self Service Model ✔
Elastic & Scalable ✔
Global Infrastructure. BCDR. ✔
Quick & Easy Deployments ✔
Redshift External Tables on S3 ?
More languages for Lambda ?
AWS vs. In-House Cost
0 2 4 6
Service
Cost Comparison
AWS
In-House
Source : AOL & AWS Billing Tool
4xIn-House / Month
1xAWS / Month
** In-House cluster includes Storage, Power and Network cost.
AWS vs. In-House Cost
10/8/2015
Amazon Web Services
1/4th Cost of In-House Hadoop Infrastructure
1/4th Cost
Data Platforms. AOL Inc.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Cores Nodes Demand - 06/01/2015 Core…
Restatement Use Case
• Restate historical data going back 6 months
Availability Zones
10
550
EMR Clusters
24,000
Spot EC2 Instances
0
10
20
30
40
50
60
70
Timing Comparison
In-House
AWS
Tag All Resources
Infrastructure as CodeCommand Line Interface
JSON as configuration files
IAM Roles and Policies
Use of Application ID
Enable CloudTrail
S3 Lifecycle ManagementS3 Versioning
Separate Code/Data/Logs buckets
Keyless EMR Clusters
Hybrid Model
Enable Debugging
Create Multiple CLI Profiles
Multi-Factor Authentication
CloudWatch Billing Alarms
Spot EC2 Instances
SNS notifications for failures
Loosely coupled Apps
Scale Horizontally
Best Practices & Suggestions
Thank you

DW on AWS

  • 1.
    DW on AWS GauravAgrawal Data Platforms, AOL Inc.
  • 2.
    AOL Data PlatformsArchitecture 2014
  • 3.
    Data Stats &Insights Cluster Size 2 PB In-House Cluster 100 Nodes Raw Data/Day 2-3 TB Data Retention 13-24 Months
  • 4.
    Challenges with In-HouseInfrastructure Fixed Cost Slow Deployment Cycle Always On Self Serve Static : Not Scalable Outages Impact Production Upgrade Storage Compute
  • 5.
    AOL Data PlatformsArchitecture 2015 1 2 2 3 4 56
  • 6.
  • 7.
    Web Console andCLI Web Console for Training Setup IAM for users AWS Services Options S3 Data upload EMR Creation & Steps Try & Test multiple approaches CLI is your friend..!!!
  • 8.
    Migration • Web Consolevs. CLI • Copy Existing Data to S3
  • 9.
    bucket-prod-control Environment Level Buckets Dev,QA, Production, Analyst Project Level Buckets Code, Data, Log, Extract and Control Compressed Snappy Data to GZIP Multi Platforms Support Best Compression Lowest storage cost Low cost for Data OUT bucket-dev bucket-qa bucket-prod bucket-analyst bucket-prod-code bucket-prod-log bucket-prod-data bucket-prod-extract 76% Less Storage 70K Saving/Year Copy Existing Data to S3
  • 10.
    Migration • Web Consolevs. CLI • Copy Existing Data to S3 • EMR Design options
  • 11.
    EMR Design Options Transient AmazonS3 Elastic Cluster On-Demand vs. Reserved vs. Core NodesAmazon EMR vs. Persistent Cluster vs. local HDFS vs. Static Cluster Spot vs. Task Nodes
  • 12.
    AOL Data PlatformsArchitecture 2015
  • 13.
    Migration • Web Consolevs. CLI • Copy Existing Data to S3 • EMR Design options • EMR Jobs Submission - CLI
  • 14.
    EMR Jobs Submission- CLI In-house scheduler Common Utilities Provision EMR Push/Pull Data to S3 Job submission to Scheduler Database Load JSON Files Applications, Steps, Bootstrap,EC2 attributes, Instance Groups Future : Event Driven Design – Lambda, SQS
  • 15.
    EMR Jobs Submission- CLI aws emr create-cluster –name "prod_dataset_subdataset_2015-10-08" --tags "Env=prod" "Project=Omniture" "Dataset=DATASET" "Owner=gaurav" "Subdataset=SUBDATASET" "Date=2015-10-08" "Region=us-east-1" --visible-to-all-users --ec2-attributes file://omni_awssot.generic.ec2_attributes.json --ami-version "3.7.0" --log-uri s3://bucket-prod-log/DATASET_NAME/SUBDATASET_NAME/ --enable-debugging --instance-groups file://omni_awssot.generic.instance_groups.json --auto-terminate --applications file://omni_awssot.generic.applications.json --bootstrap-actions file://omni_awssot.generic.bootstrap_actions.json --steps file://omni_awssot.generic.steps.json
  • 16.
    Migration • Web Consolevs. CLI • Copy Existing Data to S3 • EMR Design options • EMR Jobs Submission – CLI • Monitoring
  • 17.
    Monitoring EMR WatchDog :Node.js Duplicate Clusters Failed Clusters Long-running Clusters Long-provisioning Clusters CloudWatch Alarms Monthly Billing S3 Bucket Size SNS Email Notifications Amazon CloudWatch Amazon SNS
  • 18.
    Migration • Web Consolevs. CLI • Copy Existing Data to S3 • EMR Design options • EMR Jobs Submission – CLI • Monitoring • Elasticity
  • 19.
    Elasticity Why be Elastic? 12 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Cores Nodes Demand - 09/05/2015 Cores Nodes Daily Processes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Core Nodes Demand - 09/20/2015 Core Nodes No Clusters Spike in Demand 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Cores Nodes Demand - 06/01/2015 Cores Nodes Major Restatement Demand > 10K EC2
  • 20.
    Elasticity Why be Elastic? TrueCloud Architecture Spot is an Open Market Scale Horizontally Our Limit : 3,000 EC2/Region Multiple Regions Multiple Instance Types
  • 21.
    Migration • Web Consolevs. CLI • Copy Existing Data to S3 • EMR Design options • EMR Jobs Submission – CLI • Monitoring • Elasticity • Optimization
  • 22.
    Optimization Data Management Partition Dataon S3 S3 Versioning/Lifecycle Hadoop Run-time Params Memory Tuning Compress M & R Output Combine Splits Input format Security Roles Security Groups AOL VPC
  • 23.
    Score Card Feature AWS Payfor what you use ✔ Decouple Storage and Compute ✔ True Cloud Architecture ✔ Self Service Model ✔ Elastic & Scalable ✔ Global Infrastructure. BCDR. ✔ Quick & Easy Deployments ✔ Redshift External Tables on S3 ? More languages for Lambda ?
  • 24.
    AWS vs. In-HouseCost 0 2 4 6 Service Cost Comparison AWS In-House Source : AOL & AWS Billing Tool 4xIn-House / Month 1xAWS / Month ** In-House cluster includes Storage, Power and Network cost.
  • 25.
    AWS vs. In-HouseCost 10/8/2015 Amazon Web Services 1/4th Cost of In-House Hadoop Infrastructure 1/4th Cost Data Platforms. AOL Inc.
  • 26.
    1 2 34 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Cores Nodes Demand - 06/01/2015 Core… Restatement Use Case • Restate historical data going back 6 months Availability Zones 10 550 EMR Clusters 24,000 Spot EC2 Instances 0 10 20 30 40 50 60 70 Timing Comparison In-House AWS
  • 27.
    Tag All Resources Infrastructureas CodeCommand Line Interface JSON as configuration files IAM Roles and Policies Use of Application ID Enable CloudTrail S3 Lifecycle ManagementS3 Versioning Separate Code/Data/Logs buckets Keyless EMR Clusters Hybrid Model Enable Debugging Create Multiple CLI Profiles Multi-Factor Authentication CloudWatch Billing Alarms Spot EC2 Instances SNS notifications for failures Loosely coupled Apps Scale Horizontally Best Practices & Suggestions
  • 28.