SlideShare a Scribd company logo
1 of 28
Download to read offline
DW on AWS
Gaurav Agrawal
Data Platforms, AOL Inc.
AOL Data Platforms Architecture 2014
Data Stats & Insights
Cluster Size
2 PB
In-House
Cluster
100 Nodes
Raw
Data/Day
2-3 TB
Data
Retention
13-24 Months
Challenges with In-House Infrastructure
Fixed Cost
Slow Deployment
Cycle
Always On Self Serve
Static : Not Scalable Outages Impact Production Upgrade
Storage Compute
AOL Data Platforms Architecture 2015
1
2
2
3
4
56
Migration
• Web Console vs. CLI
Web Console and CLI
Web Console for Training
Setup IAM for users
AWS Services Options
S3 Data upload
EMR Creation & Steps
Try & Test multiple approaches
CLI is your friend..!!!
Migration
• Web Console vs. CLI
• Copy Existing Data to S3
bucket-prod-control
Environment Level Buckets
Dev, QA, Production, Analyst
Project Level Buckets
Code, Data, Log, Extract and Control
Compressed Snappy Data to GZIP
Multi Platforms Support
Best Compression
Lowest storage cost
Low cost for Data OUT
bucket-dev bucket-qa
bucket-prod bucket-analyst
bucket-prod-code
bucket-prod-log
bucket-prod-data
bucket-prod-extract
76%
Less Storage
70K
Saving/Year
Copy Existing Data to S3
Migration
• Web Console vs. CLI
• Copy Existing Data to S3
• EMR Design options
EMR Design Options
Transient
Amazon S3
Elastic Cluster
On-Demand vs. Reserved vs.
Core NodesAmazon EMR
vs. Persistent Cluster
vs. local HDFS
vs. Static Cluster
Spot
vs. Task Nodes
AOL Data Platforms Architecture 2015
Migration
• Web Console vs. CLI
• Copy Existing Data to S3
• EMR Design options
• EMR Jobs Submission - CLI
EMR Jobs Submission - CLI
In-house scheduler
Common Utilities
Provision EMR
Push/Pull Data to S3
Job submission to Scheduler
Database Load
JSON Files
Applications, Steps, Bootstrap,EC2 attributes, Instance Groups
Future : Event Driven Design – Lambda, SQS
EMR Jobs Submission - CLI
aws emr create-cluster –name "prod_dataset_subdataset_2015-10-08" 
--tags "Env=prod" "Project=Omniture" "Dataset=DATASET" "Owner=gaurav"
"Subdataset=SUBDATASET" "Date=2015-10-08" "Region=us-east-1" 
--visible-to-all-users 
--ec2-attributes file://omni_awssot.generic.ec2_attributes.json 
--ami-version "3.7.0" 
--log-uri s3://bucket-prod-log/DATASET_NAME/SUBDATASET_NAME/ 
--enable-debugging 
--instance-groups file://omni_awssot.generic.instance_groups.json 
--auto-terminate 
--applications file://omni_awssot.generic.applications.json 
--bootstrap-actions file://omni_awssot.generic.bootstrap_actions.json 
--steps file://omni_awssot.generic.steps.json
Migration
• Web Console vs. CLI
• Copy Existing Data to S3
• EMR Design options
• EMR Jobs Submission – CLI
• Monitoring
Monitoring
EMR WatchDog : Node.js
Duplicate Clusters
Failed Clusters
Long-running Clusters
Long-provisioning Clusters
CloudWatch Alarms
Monthly Billing
S3 Bucket Size
SNS Email Notifications
Amazon CloudWatch
Amazon SNS
Migration
• Web Console vs. CLI
• Copy Existing Data to S3
• EMR Design options
• EMR Jobs Submission – CLI
• Monitoring
• Elasticity
Elasticity
Why be Elastic?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Cores Nodes Demand - 09/05/2015 Cores Nodes
Daily Processes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Core Nodes Demand - 09/20/2015 Core Nodes
No Clusters
Spike in Demand
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Cores Nodes Demand - 06/01/2015 Cores Nodes
Major Restatement
Demand > 10K EC2
Elasticity
Why be Elastic?
True Cloud Architecture
Spot is an Open Market
Scale Horizontally
Our Limit : 3,000 EC2/Region
Multiple Regions
Multiple Instance Types
Migration
• Web Console vs. CLI
• Copy Existing Data to S3
• EMR Design options
• EMR Jobs Submission – CLI
• Monitoring
• Elasticity
• Optimization
Optimization
Data Management
Partition Data on S3
S3 Versioning/Lifecycle
Hadoop Run-time Params
Memory Tuning
Compress M & R Output
Combine Splits Input format
Security
Roles
Security Groups
AOL VPC
Score Card
Feature AWS
Pay for what you use ✔
Decouple Storage and Compute ✔
True Cloud Architecture ✔
Self Service Model ✔
Elastic & Scalable ✔
Global Infrastructure. BCDR. ✔
Quick & Easy Deployments ✔
Redshift External Tables on S3 ?
More languages for Lambda ?
AWS vs. In-House Cost
0 2 4 6
Service
Cost Comparison
AWS
In-House
Source : AOL & AWS Billing Tool
4xIn-House / Month
1xAWS / Month
** In-House cluster includes Storage, Power and Network cost.
AWS vs. In-House Cost
10/8/2015
Amazon Web Services
1/4th Cost of In-House Hadoop Infrastructure
1/4th Cost
Data Platforms. AOL Inc.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Cores Nodes Demand - 06/01/2015 Core…
Restatement Use Case
• Restate historical data going back 6 months
Availability Zones
10
550
EMR Clusters
24,000
Spot EC2 Instances
0
10
20
30
40
50
60
70
Timing Comparison
In-House
AWS
Tag All Resources
Infrastructure as CodeCommand Line Interface
JSON as configuration files
IAM Roles and Policies
Use of Application ID
Enable CloudTrail
S3 Lifecycle ManagementS3 Versioning
Separate Code/Data/Logs buckets
Keyless EMR Clusters
Hybrid Model
Enable Debugging
Create Multiple CLI Profiles
Multi-Factor Authentication
CloudWatch Billing Alarms
Spot EC2 Instances
SNS notifications for failures
Loosely coupled Apps
Scale Horizontally
Best Practices & Suggestions
Thank you

More Related Content

What's hot

Accelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderAccelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderDatabricks
 
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
Microsoft Data Integration Pipelines: Azure Data Factory and SSISMicrosoft Data Integration Pipelines: Azure Data Factory and SSIS
Microsoft Data Integration Pipelines: Azure Data Factory and SSISMark Kromer
 
AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!Chris Taylor
 
Benchmarking Aerospike on the Google Cloud - NoSQL Speed with Ease
Benchmarking Aerospike on the Google Cloud - NoSQL Speed with EaseBenchmarking Aerospike on the Google Cloud - NoSQL Speed with Ease
Benchmarking Aerospike on the Google Cloud - NoSQL Speed with EaseLynn Langit
 
AWS for the Data Professional
AWS for the Data ProfessionalAWS for the Data Professional
AWS for the Data ProfessionalLynn Langit
 
Change Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVHChange Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVHParis Data Engineers !
 
Microsoft Build 2018 Analytic Solutions with Azure Data Factory and Azure SQL...
Microsoft Build 2018 Analytic Solutions with Azure Data Factory and Azure SQL...Microsoft Build 2018 Analytic Solutions with Azure Data Factory and Azure SQL...
Microsoft Build 2018 Analytic Solutions with Azure Data Factory and Azure SQL...Mark Kromer
 
Microsoft Azure Data Factory Hands-On Lab Overview Slides
Microsoft Azure Data Factory Hands-On Lab Overview SlidesMicrosoft Azure Data Factory Hands-On Lab Overview Slides
Microsoft Azure Data Factory Hands-On Lab Overview SlidesMark Kromer
 
AWS re:Invent 2016: How Mapbox Uses the AWS Edge to Deliver Fast Maps for Mob...
AWS re:Invent 2016: How Mapbox Uses the AWS Edge to Deliver Fast Maps for Mob...AWS re:Invent 2016: How Mapbox Uses the AWS Edge to Deliver Fast Maps for Mob...
AWS re:Invent 2016: How Mapbox Uses the AWS Edge to Deliver Fast Maps for Mob...Amazon Web Services
 
Data Engineering Roles
Data Engineering RolesData Engineering Roles
Data Engineering RolesAdam Doyle
 
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...Amazon Web Services
 
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and JujuMining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and Jujuseoul_engineer
 
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...Amazon Web Services
 
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Amazon Web Services
 
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & DataductAmazon Web Services
 
Microsoft Azure BI Solutions in the Cloud
Microsoft Azure BI Solutions in the CloudMicrosoft Azure BI Solutions in the Cloud
Microsoft Azure BI Solutions in the CloudMark Kromer
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsAmazon Web Services
 

What's hot (20)

Accelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderAccelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks Autoloader
 
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
Microsoft Data Integration Pipelines: Azure Data Factory and SSISMicrosoft Data Integration Pipelines: Azure Data Factory and SSIS
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
 
AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!
 
Benchmarking Aerospike on the Google Cloud - NoSQL Speed with Ease
Benchmarking Aerospike on the Google Cloud - NoSQL Speed with EaseBenchmarking Aerospike on the Google Cloud - NoSQL Speed with Ease
Benchmarking Aerospike on the Google Cloud - NoSQL Speed with Ease
 
AWS for the Data Professional
AWS for the Data ProfessionalAWS for the Data Professional
AWS for the Data Professional
 
AWS_Data_Pipeline
AWS_Data_PipelineAWS_Data_Pipeline
AWS_Data_Pipeline
 
Change Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVHChange Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVH
 
Microsoft Build 2018 Analytic Solutions with Azure Data Factory and Azure SQL...
Microsoft Build 2018 Analytic Solutions with Azure Data Factory and Azure SQL...Microsoft Build 2018 Analytic Solutions with Azure Data Factory and Azure SQL...
Microsoft Build 2018 Analytic Solutions with Azure Data Factory and Azure SQL...
 
Microsoft Azure Data Factory Hands-On Lab Overview Slides
Microsoft Azure Data Factory Hands-On Lab Overview SlidesMicrosoft Azure Data Factory Hands-On Lab Overview Slides
Microsoft Azure Data Factory Hands-On Lab Overview Slides
 
AWS re:Invent 2016: How Mapbox Uses the AWS Edge to Deliver Fast Maps for Mob...
AWS re:Invent 2016: How Mapbox Uses the AWS Edge to Deliver Fast Maps for Mob...AWS re:Invent 2016: How Mapbox Uses the AWS Edge to Deliver Fast Maps for Mob...
AWS re:Invent 2016: How Mapbox Uses the AWS Edge to Deliver Fast Maps for Mob...
 
Introduction to AWS Glue
Introduction to AWS GlueIntroduction to AWS Glue
Introduction to AWS Glue
 
Data Engineering Roles
Data Engineering RolesData Engineering Roles
Data Engineering Roles
 
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
 
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and JujuMining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
 
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
 
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
 
Microsoft Azure BI Solutions in the Cloud
Microsoft Azure BI Solutions in the CloudMicrosoft Azure BI Solutions in the Cloud
Microsoft Azure BI Solutions in the Cloud
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
 

Similar to DW on AWS

AWS Summit Stockholm 2014 – B2 – Migrating enterprise applications to AWS
AWS Summit Stockholm 2014 – B2 – Migrating enterprise applications to AWSAWS Summit Stockholm 2014 – B2 – Migrating enterprise applications to AWS
AWS Summit Stockholm 2014 – B2 – Migrating enterprise applications to AWSAmazon Web Services
 
Loading Data into Amazon Redshift
Loading Data into Amazon RedshiftLoading Data into Amazon Redshift
Loading Data into Amazon RedshiftAmazon Web Services
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data PlatformShu-Jeng Hsieh
 
Loading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SFLoading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SFAmazon Web Services
 
Amazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian MeyersAmazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian Meyershuguk
 
Loading Data into Redshift with Lab
Loading Data into Redshift with LabLoading Data into Redshift with Lab
Loading Data into Redshift with LabAmazon Web Services
 
Solving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute finalSolving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute finalAvere Systems
 
Well-Architected for Security: Advanced Session
Well-Architected for Security: Advanced SessionWell-Architected for Security: Advanced Session
Well-Architected for Security: Advanced SessionAmazon Web Services
 
Loading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF LoftLoading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF LoftAmazon Web Services
 
Who's in your Cloud? Cloud State Monitoring
Who's in your Cloud? Cloud State MonitoringWho's in your Cloud? Cloud State Monitoring
Who's in your Cloud? Cloud State MonitoringKevin Hakanson
 
ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...
ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...
ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...Amazon Web Services
 
Building a Just-in-Time Application Stack for Analysts
Building a Just-in-Time Application Stack for AnalystsBuilding a Just-in-Time Application Stack for Analysts
Building a Just-in-Time Application Stack for AnalystsAvere Systems
 
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...Rustem Feyzkhanov
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
 

Similar to DW on AWS (20)

AWS Summit Stockholm 2014 – B2 – Migrating enterprise applications to AWS
AWS Summit Stockholm 2014 – B2 – Migrating enterprise applications to AWSAWS Summit Stockholm 2014 – B2 – Migrating enterprise applications to AWS
AWS Summit Stockholm 2014 – B2 – Migrating enterprise applications to AWS
 
Loading Data into Amazon Redshift
Loading Data into Amazon RedshiftLoading Data into Amazon Redshift
Loading Data into Amazon Redshift
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data Platform
 
Loading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SFLoading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SF
 
Amazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian MeyersAmazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian Meyers
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
Loading Data into Redshift with Lab
Loading Data into Redshift with LabLoading Data into Redshift with Lab
Loading Data into Redshift with Lab
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
Solving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute finalSolving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute final
 
Well-Architected for Security: Advanced Session
Well-Architected for Security: Advanced SessionWell-Architected for Security: Advanced Session
Well-Architected for Security: Advanced Session
 
Loading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF LoftLoading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF Loft
 
Who's in your Cloud? Cloud State Monitoring
Who's in your Cloud? Cloud State MonitoringWho's in your Cloud? Cloud State Monitoring
Who's in your Cloud? Cloud State Monitoring
 
ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...
ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...
ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...
 
Building a Just-in-Time Application Stack for Analysts
Building a Just-in-Time Application Stack for AnalystsBuilding a Just-in-Time Application Stack for Analysts
Building a Just-in-Time Application Stack for Analysts
 
The Best of re:invent 2016
The Best of re:invent 2016The Best of re:invent 2016
The Best of re:invent 2016
 
Serverless SQL
Serverless SQLServerless SQL
Serverless SQL
 
Aws meetup 20190427
Aws meetup 20190427Aws meetup 20190427
Aws meetup 20190427
 
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 

Recently uploaded

Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 

Recently uploaded (20)

Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 

DW on AWS

  • 1. DW on AWS Gaurav Agrawal Data Platforms, AOL Inc.
  • 2. AOL Data Platforms Architecture 2014
  • 3. Data Stats & Insights Cluster Size 2 PB In-House Cluster 100 Nodes Raw Data/Day 2-3 TB Data Retention 13-24 Months
  • 4. Challenges with In-House Infrastructure Fixed Cost Slow Deployment Cycle Always On Self Serve Static : Not Scalable Outages Impact Production Upgrade Storage Compute
  • 5. AOL Data Platforms Architecture 2015 1 2 2 3 4 56
  • 7. Web Console and CLI Web Console for Training Setup IAM for users AWS Services Options S3 Data upload EMR Creation & Steps Try & Test multiple approaches CLI is your friend..!!!
  • 8. Migration • Web Console vs. CLI • Copy Existing Data to S3
  • 9. bucket-prod-control Environment Level Buckets Dev, QA, Production, Analyst Project Level Buckets Code, Data, Log, Extract and Control Compressed Snappy Data to GZIP Multi Platforms Support Best Compression Lowest storage cost Low cost for Data OUT bucket-dev bucket-qa bucket-prod bucket-analyst bucket-prod-code bucket-prod-log bucket-prod-data bucket-prod-extract 76% Less Storage 70K Saving/Year Copy Existing Data to S3
  • 10. Migration • Web Console vs. CLI • Copy Existing Data to S3 • EMR Design options
  • 11. EMR Design Options Transient Amazon S3 Elastic Cluster On-Demand vs. Reserved vs. Core NodesAmazon EMR vs. Persistent Cluster vs. local HDFS vs. Static Cluster Spot vs. Task Nodes
  • 12. AOL Data Platforms Architecture 2015
  • 13. Migration • Web Console vs. CLI • Copy Existing Data to S3 • EMR Design options • EMR Jobs Submission - CLI
  • 14. EMR Jobs Submission - CLI In-house scheduler Common Utilities Provision EMR Push/Pull Data to S3 Job submission to Scheduler Database Load JSON Files Applications, Steps, Bootstrap,EC2 attributes, Instance Groups Future : Event Driven Design – Lambda, SQS
  • 15. EMR Jobs Submission - CLI aws emr create-cluster –name "prod_dataset_subdataset_2015-10-08" --tags "Env=prod" "Project=Omniture" "Dataset=DATASET" "Owner=gaurav" "Subdataset=SUBDATASET" "Date=2015-10-08" "Region=us-east-1" --visible-to-all-users --ec2-attributes file://omni_awssot.generic.ec2_attributes.json --ami-version "3.7.0" --log-uri s3://bucket-prod-log/DATASET_NAME/SUBDATASET_NAME/ --enable-debugging --instance-groups file://omni_awssot.generic.instance_groups.json --auto-terminate --applications file://omni_awssot.generic.applications.json --bootstrap-actions file://omni_awssot.generic.bootstrap_actions.json --steps file://omni_awssot.generic.steps.json
  • 16. Migration • Web Console vs. CLI • Copy Existing Data to S3 • EMR Design options • EMR Jobs Submission – CLI • Monitoring
  • 17. Monitoring EMR WatchDog : Node.js Duplicate Clusters Failed Clusters Long-running Clusters Long-provisioning Clusters CloudWatch Alarms Monthly Billing S3 Bucket Size SNS Email Notifications Amazon CloudWatch Amazon SNS
  • 18. Migration • Web Console vs. CLI • Copy Existing Data to S3 • EMR Design options • EMR Jobs Submission – CLI • Monitoring • Elasticity
  • 19. Elasticity Why be Elastic? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Cores Nodes Demand - 09/05/2015 Cores Nodes Daily Processes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Core Nodes Demand - 09/20/2015 Core Nodes No Clusters Spike in Demand 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Cores Nodes Demand - 06/01/2015 Cores Nodes Major Restatement Demand > 10K EC2
  • 20. Elasticity Why be Elastic? True Cloud Architecture Spot is an Open Market Scale Horizontally Our Limit : 3,000 EC2/Region Multiple Regions Multiple Instance Types
  • 21. Migration • Web Console vs. CLI • Copy Existing Data to S3 • EMR Design options • EMR Jobs Submission – CLI • Monitoring • Elasticity • Optimization
  • 22. Optimization Data Management Partition Data on S3 S3 Versioning/Lifecycle Hadoop Run-time Params Memory Tuning Compress M & R Output Combine Splits Input format Security Roles Security Groups AOL VPC
  • 23. Score Card Feature AWS Pay for what you use ✔ Decouple Storage and Compute ✔ True Cloud Architecture ✔ Self Service Model ✔ Elastic & Scalable ✔ Global Infrastructure. BCDR. ✔ Quick & Easy Deployments ✔ Redshift External Tables on S3 ? More languages for Lambda ?
  • 24. AWS vs. In-House Cost 0 2 4 6 Service Cost Comparison AWS In-House Source : AOL & AWS Billing Tool 4xIn-House / Month 1xAWS / Month ** In-House cluster includes Storage, Power and Network cost.
  • 25. AWS vs. In-House Cost 10/8/2015 Amazon Web Services 1/4th Cost of In-House Hadoop Infrastructure 1/4th Cost Data Platforms. AOL Inc.
  • 26. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Cores Nodes Demand - 06/01/2015 Core… Restatement Use Case • Restate historical data going back 6 months Availability Zones 10 550 EMR Clusters 24,000 Spot EC2 Instances 0 10 20 30 40 50 60 70 Timing Comparison In-House AWS
  • 27. Tag All Resources Infrastructure as CodeCommand Line Interface JSON as configuration files IAM Roles and Policies Use of Application ID Enable CloudTrail S3 Lifecycle ManagementS3 Versioning Separate Code/Data/Logs buckets Keyless EMR Clusters Hybrid Model Enable Debugging Create Multiple CLI Profiles Multi-Factor Authentication CloudWatch Billing Alarms Spot EC2 Instances SNS notifications for failures Loosely coupled Apps Scale Horizontally Best Practices & Suggestions