SlideShare a Scribd company logo
Building a Bigdata
Architecture on AWS
ARUN SIRIMALLA
Cloud Computing
• Cloud computing is the on-demand delivery of
compute power, database storage, applications and
other IT resources
• Its a cloud services platform via the internet with pay-
as-you-go pricing
Why Cloud?
• No upfront investment in data centers and
servers
• Stop guessing capacity
• Stop spending money on running and
maintaining data centers
• Go global in minutes
• Disaster recovery
Overview
of Amazon
Web
Services
Regions and Availability Zones
• Amazon EC2 is hosted in multiple locations world-wide
• Each region is a separate geographic area
• Each region has multiple, isolated locations know as Availability zones
VPC
• Virtual datacenter in the cloud
• You can create your own public-facing subnet for your webservers and place your backend
systems such as databases or application servers in private subnet
• You can create a hardware virtual private network connection b/w your corporate datacenter
and AWS
• Assign custom IP address range in each subnet
• Create internet gateways
• Leverage multiple layers of security
EC2
• Web service that provides secure, resizable compute capacity in the cloud
ü On-demand Instances
Applications with spiky or unpredictable workloads or being developed or tested on AmazonEC2
ü Reserved Instances
Steady state or predictable usage and able to make upfront payment
ü Spot Instances
Applications that have flexible start and end times
EBS
Create storage volumes and attach them to Amazon EC2 instances
General Purpose SSD
Designed for 99.999% availability
3 IOPS/GB upto 10K IOPS
Provisioned IOPS SSD
Designed for I/O intensive applications such as large relational or NoSQL databases
Magnetic (standard)
S3
• S3 object based allows you to upload files
• Files can be 1 Byte to 5 TB
• Buckets have unique namespace for each region
• Amazon guarantees 99.99% availability
• Guarantees durability of 99.999999999%
RDS
• Allow you to create and scale Relational Databases
• You cannot SSH or RDP to RDS instance
• AWS does not provide you public or private IP address, instead gives you endpoint to connect
• Available and Durable, Secure, Inexpensive
Route 53
• Highly available and scalable cloud DNS web service
• You can create
ü Both public and private DNS records
ü A records which resolves names to IP Address
ü CNAME will resolve one name to another
Direct connect
• AWS Direct connect makes it easy to establish a dedicated network connection from your
premises to AWS
• Private connectivity to your Amazon VPC
• Provides 1 Gbps and 10 Gbps
IAM
• Enables you to securely control access to AWS services and resources for your users
• Offered at no additional charge
• Use permissions to allow and deny user/group access to AWS resources
EMR
• Managed Hadoop framework
• Fast and cost-effective to process vast amounts of data across dynamically scalable amazon EC2
instances
• Supported Applications
ü Hadoop, Hive, HUE, Pig, HBase, Zookeeper, Spark and more
• Cost optimization using Spot fleet
Cloudwatch
• Monitoring service for AWS resources and the applications you run on AWS
• Collect, track metrics, monitor log files and set alarms
• View metrics for CPU utilization, data transfer, and disk usage activity from Amazon EC2 instances
CloudFormation
• Gives developers and systems admins an easy way to create and manage AWS resources,
provision and update them in an orderly and predictable fashion
SNS
• Fully managed messaging service
• Allows you to push messages directly to AWS resources
Bigdata Use cases
uOn-Demand Big Data Analytics
uClickstream Analysis
uData Warehousing
Architecture US East Region
VPG
VPG
On-Prem
Persistent clusters
Choosing right instance
• Memory Optimized – R3 and R4
• CPU Optimized - C3 and C4
• Storage Optimized – I2 and D2
Amazon Machine Image (AMI)
• Choosing the base AMI (Redhat, CentOS)
• Create your own AMI
• AMI creation using Packer
Bigdata Distribution
• Apache, Cloudera and Hortonworks
RDS
For storing data related to Hive, HUE, Sentry and Cloudera
Manager
S3
Creating buckets for backups
Kerberos
KDC for authorization of HDFS services
Route 53 Records
Cloudera Manager, HUE, YARN and other applications
Placement Groups
• Logical grouping of instances within a single Availability zone
• Recommended for applications that benefit from low network latency
and high network throughput
• There is no charge for creating a placement group
Options to create Long-running clusters
• Puppet/CloudFormation
• Cloudera Director
üDeploy and manage the lifecycle of Cloudera Enterprise in the cloud
üIntegrated with AWS, Microsoft Azure and Google cloud platform
Raw Data – 100 TB
Data after Replication – 300 TB
25 % Reserve for M/R Jobs
Raw Data – 15 TB
Data after Replication – 45 TB
25 % Reserve for M/R Jobs
Security
Best practices for persistent/long-running cluster
• IAM policies for IAM users
• Integrate with AWS Directory Services for Microsoft AD so that the users are in
sync across corporate network and on AWS
• All of the instances should reside in private subnet’s, except for NAT instance
which resides in public subnet to occasionally connection to internet either to
setup local repositories or downloading packages
System Monitoring
• AWS CloudWatch and AWS SNS for infrastructure monitoring
• Cloudera Manager for monitoring CDH instances, services and clusters
• Nagios and Ganglia for cluster monitoring
Transient Clusters
EMR is a PaaS provided by Amazon to execute Hadoop and other Big Data workloads such as Hive, Pig,
Spark, Presto, Mahout, Oozie and others.
Amazon S3 as your cluster persistent data store
Advantages of using S3 as data repository:
• S3 guarantees 99.99% data availability and 99.999999999% data durability
• Serves as a backup of the data
• Serves as a Disaster Recovery strategy with RTO (Recovery Time Objective) ranging in hours
• Data s residing in S3 could be leveraged for transient workloads
• S3 supports cross region replication for redundancy
• S3 supports versioning of data, which provides accidental overwrites as well as prevents
accidental deletion of objects using versioning with MFA
• Using lifecycle configuration, archiving can be enabled for older data which is moved to Glacier
• Data movement between S3 and EMR instances would be faster
when compared to general EC2 instances
• No separate licensing costs
• Lesser time to bootstrap clusters
• Cost optimization using Spot instances
• Transient clusters are better approach for transient workloads
• No Single point of failure. The data is always consistently backed up
in S3
• If the cluster performing the job goes down, another cluster could
be instantiated to get it done.
Advantages
• Source of truth data in AWS S3
• No resource contention, high priority production jobs doesn’t have to wait
for resources being freed up by up stream production jobs
• Each job could get its own cluster
• Easily meet business SLAs
With flexibility of providing capacity based on the use cases, SLAs for
various LOBs could be easily met
• Usage of spot instances will give approx. 70% of discount in most common
use cases
Security
Best practices for securely accessing data in S3 by transient cluster(s)
• Define S3 bucket policies, which provides facility to allow or deny
certain action
• Define S3 ACLs, which provides permissions
• IAM roles for instances to access data in S3 with exposing AWS access
and security keys
• Define VPC endpoint for S3 bucket so that the data is transported over
private connection between S3 and instances in VPC (Virtual Private
Cloud)
Thank you!
Upcoming Sessions
Amazon EC2, S3 and EMR - Sep 26
Cost Optimization with Spot instances (EMR) – OCT 3
Deep Dive on EC2 and S3 – OCT 10

More Related Content

What's hot

Optimizing Storage for Big Data/Analytics Workloads
Optimizing Storage for Big Data/Analytics WorkloadsOptimizing Storage for Big Data/Analytics Workloads
Optimizing Storage for Big Data/Analytics Workloads
Amazon Web Services
 
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
Amazon Web Services
 
Strategic Uses for Cost Efficient Long-Term Cloud Storage
Strategic Uses for Cost Efficient Long-Term Cloud StorageStrategic Uses for Cost Efficient Long-Term Cloud Storage
Strategic Uses for Cost Efficient Long-Term Cloud Storage
Amazon Web Services
 
Bursting on-premise analytic workloads to Amazon EMR using Alluxio
Bursting on-premise analytic workloads to Amazon EMR using AlluxioBursting on-premise analytic workloads to Amazon EMR using Alluxio
Bursting on-premise analytic workloads to Amazon EMR using Alluxio
Alluxio, Inc.
 
REDSHIFT - Amazon
REDSHIFT - AmazonREDSHIFT - Amazon
REDSHIFT - Amazon
Douglas Bernardini
 
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS MeetupChallenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
Andrei Savu
 
Simple, Scalable and Highly Durable NAS in the Cloud - Amazon EFS
Simple, Scalable and Highly Durable NAS in the Cloud - Amazon EFSSimple, Scalable and Highly Durable NAS in the Cloud - Amazon EFS
Simple, Scalable and Highly Durable NAS in the Cloud - Amazon EFS
Amazon Web Services
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Amazon Web Services
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
Amazon Web Services
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
Amazon Web Services
 
Demystifying Storage on AWS | AWS Public Sector Summit 2017
Demystifying Storage on AWS | AWS Public Sector Summit 2017Demystifying Storage on AWS | AWS Public Sector Summit 2017
Demystifying Storage on AWS | AWS Public Sector Summit 2017
Amazon Web Services
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
Amazon Web Services
 
SRV404 Deep Dive on Amazon DynamoDB
SRV404 Deep Dive on Amazon DynamoDBSRV404 Deep Dive on Amazon DynamoDB
SRV404 Deep Dive on Amazon DynamoDB
Amazon Web Services
 
AWS re:Invent 2016: FINRA: Building a Secure Data Science Platform on AWS (BD...
AWS re:Invent 2016: FINRA: Building a Secure Data Science Platform on AWS (BD...AWS re:Invent 2016: FINRA: Building a Secure Data Science Platform on AWS (BD...
AWS re:Invent 2016: FINRA: Building a Secure Data Science Platform on AWS (BD...
Amazon Web Services
 
Optimizing Data Management Using AWS Storage and Data Migration Products | AW...
Optimizing Data Management Using AWS Storage and Data Migration Products | AW...Optimizing Data Management Using AWS Storage and Data Migration Products | AW...
Optimizing Data Management Using AWS Storage and Data Migration Products | AW...
Amazon Web Services
 
Cost and Performance Optimisation in Amazon RDS - AWS Summit Sydney 2018
Cost and Performance Optimisation in Amazon RDS - AWS Summit Sydney 2018Cost and Performance Optimisation in Amazon RDS - AWS Summit Sydney 2018
Cost and Performance Optimisation in Amazon RDS - AWS Summit Sydney 2018
Amazon Web Services
 
AWS re:Invent 2016: Bring Microsoft Applications to AWS to Save Money and Sta...
AWS re:Invent 2016: Bring Microsoft Applications to AWS to Save Money and Sta...AWS re:Invent 2016: Bring Microsoft Applications to AWS to Save Money and Sta...
AWS re:Invent 2016: Bring Microsoft Applications to AWS to Save Money and Sta...
Amazon Web Services
 
AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)
AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)
AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)
Amazon Web Services
 
AWS re:Invent 2016: ElastiCache Deep Dive: Best Practices and Usage Patterns ...
AWS re:Invent 2016: ElastiCache Deep Dive: Best Practices and Usage Patterns ...AWS re:Invent 2016: ElastiCache Deep Dive: Best Practices and Usage Patterns ...
AWS re:Invent 2016: ElastiCache Deep Dive: Best Practices and Usage Patterns ...
Amazon Web Services
 
Co 4, session 2, aws analytics services
Co 4, session 2, aws analytics servicesCo 4, session 2, aws analytics services
Co 4, session 2, aws analytics services
m vaishnavi
 

What's hot (20)

Optimizing Storage for Big Data/Analytics Workloads
Optimizing Storage for Big Data/Analytics WorkloadsOptimizing Storage for Big Data/Analytics Workloads
Optimizing Storage for Big Data/Analytics Workloads
 
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
 
Strategic Uses for Cost Efficient Long-Term Cloud Storage
Strategic Uses for Cost Efficient Long-Term Cloud StorageStrategic Uses for Cost Efficient Long-Term Cloud Storage
Strategic Uses for Cost Efficient Long-Term Cloud Storage
 
Bursting on-premise analytic workloads to Amazon EMR using Alluxio
Bursting on-premise analytic workloads to Amazon EMR using AlluxioBursting on-premise analytic workloads to Amazon EMR using Alluxio
Bursting on-premise analytic workloads to Amazon EMR using Alluxio
 
REDSHIFT - Amazon
REDSHIFT - AmazonREDSHIFT - Amazon
REDSHIFT - Amazon
 
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS MeetupChallenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
 
Simple, Scalable and Highly Durable NAS in the Cloud - Amazon EFS
Simple, Scalable and Highly Durable NAS in the Cloud - Amazon EFSSimple, Scalable and Highly Durable NAS in the Cloud - Amazon EFS
Simple, Scalable and Highly Durable NAS in the Cloud - Amazon EFS
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Demystifying Storage on AWS | AWS Public Sector Summit 2017
Demystifying Storage on AWS | AWS Public Sector Summit 2017Demystifying Storage on AWS | AWS Public Sector Summit 2017
Demystifying Storage on AWS | AWS Public Sector Summit 2017
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
SRV404 Deep Dive on Amazon DynamoDB
SRV404 Deep Dive on Amazon DynamoDBSRV404 Deep Dive on Amazon DynamoDB
SRV404 Deep Dive on Amazon DynamoDB
 
AWS re:Invent 2016: FINRA: Building a Secure Data Science Platform on AWS (BD...
AWS re:Invent 2016: FINRA: Building a Secure Data Science Platform on AWS (BD...AWS re:Invent 2016: FINRA: Building a Secure Data Science Platform on AWS (BD...
AWS re:Invent 2016: FINRA: Building a Secure Data Science Platform on AWS (BD...
 
Optimizing Data Management Using AWS Storage and Data Migration Products | AW...
Optimizing Data Management Using AWS Storage and Data Migration Products | AW...Optimizing Data Management Using AWS Storage and Data Migration Products | AW...
Optimizing Data Management Using AWS Storage and Data Migration Products | AW...
 
Cost and Performance Optimisation in Amazon RDS - AWS Summit Sydney 2018
Cost and Performance Optimisation in Amazon RDS - AWS Summit Sydney 2018Cost and Performance Optimisation in Amazon RDS - AWS Summit Sydney 2018
Cost and Performance Optimisation in Amazon RDS - AWS Summit Sydney 2018
 
AWS re:Invent 2016: Bring Microsoft Applications to AWS to Save Money and Sta...
AWS re:Invent 2016: Bring Microsoft Applications to AWS to Save Money and Sta...AWS re:Invent 2016: Bring Microsoft Applications to AWS to Save Money and Sta...
AWS re:Invent 2016: Bring Microsoft Applications to AWS to Save Money and Sta...
 
AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)
AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)
AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)
 
AWS re:Invent 2016: ElastiCache Deep Dive: Best Practices and Usage Patterns ...
AWS re:Invent 2016: ElastiCache Deep Dive: Best Practices and Usage Patterns ...AWS re:Invent 2016: ElastiCache Deep Dive: Best Practices and Usage Patterns ...
AWS re:Invent 2016: ElastiCache Deep Dive: Best Practices and Usage Patterns ...
 
Co 4, session 2, aws analytics services
Co 4, session 2, aws analytics servicesCo 4, session 2, aws analytics services
Co 4, session 2, aws analytics services
 

Similar to Building a Bigdata Architecture on AWS

AWS Webcast - Website Hosting in the Cloud
AWS Webcast - Website Hosting in the CloudAWS Webcast - Website Hosting in the Cloud
AWS Webcast - Website Hosting in the Cloud
Amazon Web Services
 
Débuter sur le cloud AWS
Débuter sur le cloud AWSDébuter sur le cloud AWS
Débuter sur le cloud AWS
Amazon Web Services
 
Cloud Service.pptx
Cloud Service.pptxCloud Service.pptx
Cloud Service.pptx
SibinBharathi
 
Building a Just-in-Time Application Stack for Analysts
Building a Just-in-Time Application Stack for AnalystsBuilding a Just-in-Time Application Stack for Analysts
Building a Just-in-Time Application Stack for Analysts
Avere Systems
 
What is Cloud computing?
What is Cloud computing?What is Cloud computing?
What is Cloud computing?
Richard Harvey
 
AWS Webcast - Website Hosting
AWS Webcast - Website HostingAWS Webcast - Website Hosting
AWS Webcast - Website Hosting
Amazon Web Services
 
AWS EC2 JSP.pptx
AWS EC2 JSP.pptxAWS EC2 JSP.pptx
AWS EC2 JSP.pptx
Jayesh Patil
 
AWS Webcast - AWS Webinar Series for Education #2 - Getting Started with AWS
AWS Webcast - AWS Webinar Series for Education #2 - Getting Started with AWSAWS Webcast - AWS Webinar Series for Education #2 - Getting Started with AWS
AWS Webcast - AWS Webinar Series for Education #2 - Getting Started with AWS
Amazon Web Services
 
AWS Webcast - AWS Webinar Series for Education #3 - Discover the Ease of AWS ...
AWS Webcast - AWS Webinar Series for Education #3 - Discover the Ease of AWS ...AWS Webcast - AWS Webinar Series for Education #3 - Discover the Ease of AWS ...
AWS Webcast - AWS Webinar Series for Education #3 - Discover the Ease of AWS ...
Amazon Web Services
 
Cloud computing aws -key services
Cloud computing  aws -key servicesCloud computing  aws -key services
Cloud computing aws -key services
Selvaraj Kesavan
 
Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)
Amazon Web Services
 
Uses, considerations, and recommendations for AWS
Uses, considerations, and recommendations for AWSUses, considerations, and recommendations for AWS
Uses, considerations, and recommendations for AWS
Scalar Decisions
 
Introduction to Amazon Relational Database Service
Introduction to Amazon Relational Database ServiceIntroduction to Amazon Relational Database Service
Introduction to Amazon Relational Database Service
Amazon Web Services
 
AWS Webcast - Explore the AWS Cloud for Government
AWS Webcast - Explore the AWS Cloud for GovernmentAWS Webcast - Explore the AWS Cloud for Government
AWS Webcast - Explore the AWS Cloud for Government
Amazon Web Services
 
AWS Webcast - Webinar Series for State and Local Government #2: Discover the ...
AWS Webcast - Webinar Series for State and Local Government #2: Discover the ...AWS Webcast - Webinar Series for State and Local Government #2: Discover the ...
AWS Webcast - Webinar Series for State and Local Government #2: Discover the ...
Amazon Web Services
 
Backup and archiving in the aws cloud
Backup and archiving in the aws cloudBackup and archiving in the aws cloud
Backup and archiving in the aws cloud
Amazon Web Services
 
Cloud computing benefits
Cloud computing benefitsCloud computing benefits
Cloud computing benefits
Madhukumar Vattipulusu
 
AWS Webcast - Build Agile Applications in AWS Cloud
AWS Webcast - Build Agile Applications in AWS CloudAWS Webcast - Build Agile Applications in AWS Cloud
AWS Webcast - Build Agile Applications in AWS Cloud
Amazon Web Services
 
CMS on AWS Deep Dive
CMS on AWS Deep DiveCMS on AWS Deep Dive
CMS on AWS Deep Dive
Amazon Web Services
 
Cloud & Native Cloud for Managers
Cloud & Native Cloud for ManagersCloud & Native Cloud for Managers
Cloud & Native Cloud for Managers
Eitan Sela
 

Similar to Building a Bigdata Architecture on AWS (20)

AWS Webcast - Website Hosting in the Cloud
AWS Webcast - Website Hosting in the CloudAWS Webcast - Website Hosting in the Cloud
AWS Webcast - Website Hosting in the Cloud
 
Débuter sur le cloud AWS
Débuter sur le cloud AWSDébuter sur le cloud AWS
Débuter sur le cloud AWS
 
Cloud Service.pptx
Cloud Service.pptxCloud Service.pptx
Cloud Service.pptx
 
Building a Just-in-Time Application Stack for Analysts
Building a Just-in-Time Application Stack for AnalystsBuilding a Just-in-Time Application Stack for Analysts
Building a Just-in-Time Application Stack for Analysts
 
What is Cloud computing?
What is Cloud computing?What is Cloud computing?
What is Cloud computing?
 
AWS Webcast - Website Hosting
AWS Webcast - Website HostingAWS Webcast - Website Hosting
AWS Webcast - Website Hosting
 
AWS EC2 JSP.pptx
AWS EC2 JSP.pptxAWS EC2 JSP.pptx
AWS EC2 JSP.pptx
 
AWS Webcast - AWS Webinar Series for Education #2 - Getting Started with AWS
AWS Webcast - AWS Webinar Series for Education #2 - Getting Started with AWSAWS Webcast - AWS Webinar Series for Education #2 - Getting Started with AWS
AWS Webcast - AWS Webinar Series for Education #2 - Getting Started with AWS
 
AWS Webcast - AWS Webinar Series for Education #3 - Discover the Ease of AWS ...
AWS Webcast - AWS Webinar Series for Education #3 - Discover the Ease of AWS ...AWS Webcast - AWS Webinar Series for Education #3 - Discover the Ease of AWS ...
AWS Webcast - AWS Webinar Series for Education #3 - Discover the Ease of AWS ...
 
Cloud computing aws -key services
Cloud computing  aws -key servicesCloud computing  aws -key services
Cloud computing aws -key services
 
Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)
 
Uses, considerations, and recommendations for AWS
Uses, considerations, and recommendations for AWSUses, considerations, and recommendations for AWS
Uses, considerations, and recommendations for AWS
 
Introduction to Amazon Relational Database Service
Introduction to Amazon Relational Database ServiceIntroduction to Amazon Relational Database Service
Introduction to Amazon Relational Database Service
 
AWS Webcast - Explore the AWS Cloud for Government
AWS Webcast - Explore the AWS Cloud for GovernmentAWS Webcast - Explore the AWS Cloud for Government
AWS Webcast - Explore the AWS Cloud for Government
 
AWS Webcast - Webinar Series for State and Local Government #2: Discover the ...
AWS Webcast - Webinar Series for State and Local Government #2: Discover the ...AWS Webcast - Webinar Series for State and Local Government #2: Discover the ...
AWS Webcast - Webinar Series for State and Local Government #2: Discover the ...
 
Backup and archiving in the aws cloud
Backup and archiving in the aws cloudBackup and archiving in the aws cloud
Backup and archiving in the aws cloud
 
Cloud computing benefits
Cloud computing benefitsCloud computing benefits
Cloud computing benefits
 
AWS Webcast - Build Agile Applications in AWS Cloud
AWS Webcast - Build Agile Applications in AWS CloudAWS Webcast - Build Agile Applications in AWS Cloud
AWS Webcast - Build Agile Applications in AWS Cloud
 
CMS on AWS Deep Dive
CMS on AWS Deep DiveCMS on AWS Deep Dive
CMS on AWS Deep Dive
 
Cloud & Native Cloud for Managers
Cloud & Native Cloud for ManagersCloud & Native Cloud for Managers
Cloud & Native Cloud for Managers
 

Recently uploaded

JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 

Recently uploaded (20)

JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 

Building a Bigdata Architecture on AWS

  • 1. Building a Bigdata Architecture on AWS ARUN SIRIMALLA
  • 2. Cloud Computing • Cloud computing is the on-demand delivery of compute power, database storage, applications and other IT resources • Its a cloud services platform via the internet with pay- as-you-go pricing
  • 3. Why Cloud? • No upfront investment in data centers and servers • Stop guessing capacity • Stop spending money on running and maintaining data centers • Go global in minutes • Disaster recovery
  • 5. Regions and Availability Zones • Amazon EC2 is hosted in multiple locations world-wide • Each region is a separate geographic area • Each region has multiple, isolated locations know as Availability zones VPC • Virtual datacenter in the cloud • You can create your own public-facing subnet for your webservers and place your backend systems such as databases or application servers in private subnet • You can create a hardware virtual private network connection b/w your corporate datacenter and AWS • Assign custom IP address range in each subnet • Create internet gateways • Leverage multiple layers of security EC2 • Web service that provides secure, resizable compute capacity in the cloud ü On-demand Instances Applications with spiky or unpredictable workloads or being developed or tested on AmazonEC2 ü Reserved Instances Steady state or predictable usage and able to make upfront payment ü Spot Instances Applications that have flexible start and end times
  • 6. EBS Create storage volumes and attach them to Amazon EC2 instances General Purpose SSD Designed for 99.999% availability 3 IOPS/GB upto 10K IOPS Provisioned IOPS SSD Designed for I/O intensive applications such as large relational or NoSQL databases Magnetic (standard) S3 • S3 object based allows you to upload files • Files can be 1 Byte to 5 TB • Buckets have unique namespace for each region • Amazon guarantees 99.99% availability • Guarantees durability of 99.999999999% RDS • Allow you to create and scale Relational Databases • You cannot SSH or RDP to RDS instance • AWS does not provide you public or private IP address, instead gives you endpoint to connect • Available and Durable, Secure, Inexpensive
  • 7. Route 53 • Highly available and scalable cloud DNS web service • You can create ü Both public and private DNS records ü A records which resolves names to IP Address ü CNAME will resolve one name to another Direct connect • AWS Direct connect makes it easy to establish a dedicated network connection from your premises to AWS • Private connectivity to your Amazon VPC • Provides 1 Gbps and 10 Gbps IAM • Enables you to securely control access to AWS services and resources for your users • Offered at no additional charge • Use permissions to allow and deny user/group access to AWS resources
  • 8. EMR • Managed Hadoop framework • Fast and cost-effective to process vast amounts of data across dynamically scalable amazon EC2 instances • Supported Applications ü Hadoop, Hive, HUE, Pig, HBase, Zookeeper, Spark and more • Cost optimization using Spot fleet Cloudwatch • Monitoring service for AWS resources and the applications you run on AWS • Collect, track metrics, monitor log files and set alarms • View metrics for CPU utilization, data transfer, and disk usage activity from Amazon EC2 instances CloudFormation • Gives developers and systems admins an easy way to create and manage AWS resources, provision and update them in an orderly and predictable fashion SNS • Fully managed messaging service • Allows you to push messages directly to AWS resources
  • 9. Bigdata Use cases uOn-Demand Big Data Analytics uClickstream Analysis uData Warehousing
  • 10.
  • 11. Architecture US East Region VPG VPG On-Prem
  • 12. Persistent clusters Choosing right instance • Memory Optimized – R3 and R4 • CPU Optimized - C3 and C4 • Storage Optimized – I2 and D2 Amazon Machine Image (AMI) • Choosing the base AMI (Redhat, CentOS) • Create your own AMI • AMI creation using Packer Bigdata Distribution • Apache, Cloudera and Hortonworks
  • 13. RDS For storing data related to Hive, HUE, Sentry and Cloudera Manager S3 Creating buckets for backups Kerberos KDC for authorization of HDFS services Route 53 Records Cloudera Manager, HUE, YARN and other applications
  • 14. Placement Groups • Logical grouping of instances within a single Availability zone • Recommended for applications that benefit from low network latency and high network throughput • There is no charge for creating a placement group Options to create Long-running clusters • Puppet/CloudFormation • Cloudera Director üDeploy and manage the lifecycle of Cloudera Enterprise in the cloud üIntegrated with AWS, Microsoft Azure and Google cloud platform
  • 15. Raw Data – 100 TB Data after Replication – 300 TB 25 % Reserve for M/R Jobs Raw Data – 15 TB Data after Replication – 45 TB 25 % Reserve for M/R Jobs
  • 16. Security Best practices for persistent/long-running cluster • IAM policies for IAM users • Integrate with AWS Directory Services for Microsoft AD so that the users are in sync across corporate network and on AWS • All of the instances should reside in private subnet’s, except for NAT instance which resides in public subnet to occasionally connection to internet either to setup local repositories or downloading packages System Monitoring • AWS CloudWatch and AWS SNS for infrastructure monitoring • Cloudera Manager for monitoring CDH instances, services and clusters • Nagios and Ganglia for cluster monitoring
  • 17. Transient Clusters EMR is a PaaS provided by Amazon to execute Hadoop and other Big Data workloads such as Hive, Pig, Spark, Presto, Mahout, Oozie and others. Amazon S3 as your cluster persistent data store Advantages of using S3 as data repository: • S3 guarantees 99.99% data availability and 99.999999999% data durability • Serves as a backup of the data • Serves as a Disaster Recovery strategy with RTO (Recovery Time Objective) ranging in hours • Data s residing in S3 could be leveraged for transient workloads • S3 supports cross region replication for redundancy • S3 supports versioning of data, which provides accidental overwrites as well as prevents accidental deletion of objects using versioning with MFA • Using lifecycle configuration, archiving can be enabled for older data which is moved to Glacier
  • 18. • Data movement between S3 and EMR instances would be faster when compared to general EC2 instances • No separate licensing costs • Lesser time to bootstrap clusters • Cost optimization using Spot instances • Transient clusters are better approach for transient workloads • No Single point of failure. The data is always consistently backed up in S3 • If the cluster performing the job goes down, another cluster could be instantiated to get it done. Advantages
  • 19. • Source of truth data in AWS S3 • No resource contention, high priority production jobs doesn’t have to wait for resources being freed up by up stream production jobs • Each job could get its own cluster • Easily meet business SLAs With flexibility of providing capacity based on the use cases, SLAs for various LOBs could be easily met • Usage of spot instances will give approx. 70% of discount in most common use cases
  • 20. Security Best practices for securely accessing data in S3 by transient cluster(s) • Define S3 bucket policies, which provides facility to allow or deny certain action • Define S3 ACLs, which provides permissions • IAM roles for instances to access data in S3 with exposing AWS access and security keys • Define VPC endpoint for S3 bucket so that the data is transported over private connection between S3 and instances in VPC (Virtual Private Cloud)
  • 22. Upcoming Sessions Amazon EC2, S3 and EMR - Sep 26 Cost Optimization with Spot instances (EMR) – OCT 3 Deep Dive on EC2 and S3 – OCT 10