A real use case of in-house 2 PB Hadoop Cluster Migration to AWS within few months. AWS is easy-to-use, cost-effective, flexible, scalable and very reliable.Technologies involved are Hive, Presto, Python, Autosys using AWS EMR, AWS Lambda, AWS S3, AWS DynamoDB and AWS SNS.
7. Web Console and CLI
Web Console for Training
Setup IAM for users
AWS Services Options
S3 Data upload
EMR Creation & Steps
Try & Test multiple approaches
CLI is your friend..!!!
9. bucket-prod-control
Environment Level Buckets
Dev, QA, Production, Analyst
Project Level Buckets
Code, Data, Log, Extract and Control
Compressed Snappy Data to GZIP
Multi Platforms Support
Best Compression
Lowest storage cost
Low cost for Data OUT
bucket-dev bucket-qa
bucket-prod bucket-analyst
bucket-prod-code
bucket-prod-log
bucket-prod-data
bucket-prod-extract
76%
Less Storage
70K
Saving/Year
Copy Existing Data to S3
11. EMR Design Options
Transient
Amazon S3
Elastic Cluster
On-Demand vs. Reserved vs.
Core NodesAmazon EMR
vs. Persistent Cluster
vs. local HDFS
vs. Static Cluster
Spot
vs. Task Nodes
20. Elasticity
Why be Elastic?
True Cloud Architecture
Spot is an Open Market
Scale Horizontally
Our Limit : 3,000 EC2/Region
Multiple Regions
Multiple Instance Types
21. Migration
• Web Console vs. CLI
• Copy Existing Data to S3
• EMR Design options
• EMR Jobs Submission – CLI
• Monitoring
• Elasticity
• Optimization
22. Optimization
Data Management
Partition Data on S3
S3 Versioning/Lifecycle
Hadoop Run-time Params
Memory Tuning
Compress M & R Output
Combine Splits Input format
Security
Roles
Security Groups
AOL VPC
23. Score Card
Feature AWS
Pay for what you use ✔
Decouple Storage and Compute ✔
True Cloud Architecture ✔
Self Service Model ✔
Elastic & Scalable ✔
Global Infrastructure. BCDR. ✔
Quick & Easy Deployments ✔
Redshift External Tables on S3 ?
More languages for Lambda ?
24. AWS vs. In-House Cost
0 2 4 6
Service
Cost Comparison
AWS
In-House
Source : AOL & AWS Billing Tool
4xIn-House / Month
1xAWS / Month
** In-House cluster includes Storage, Power and Network cost.
25. AWS vs. In-House Cost
10/8/2015
Amazon Web Services
1/4th Cost of In-House Hadoop Infrastructure
1/4th Cost
Data Platforms. AOL Inc.
27. Tag All Resources
Infrastructure as CodeCommand Line Interface
JSON as configuration files
IAM Roles and Policies
Use of Application ID
Enable CloudTrail
S3 Lifecycle ManagementS3 Versioning
Separate Code/Data/Logs buckets
Keyless EMR Clusters
Hybrid Model
Enable Debugging
Create Multiple CLI Profiles
Multi-Factor Authentication
CloudWatch Billing Alarms
Spot EC2 Instances
SNS notifications for failures
Loosely coupled Apps
Scale Horizontally
Best Practices & Suggestions