SlideShare a Scribd company logo
1 of 22
Building a Bigdata
Architecture on AWS
ARUN SIRIMALLA
Cloud Computing
• Cloud computing is the on-demand delivery of
compute power, database storage, applications and
other IT resources
• Its a cloud services platform via the internet with pay-
as-you-go pricing
Why Cloud?
• No upfront investment in data centers and
servers
• Stop guessing capacity
• Stop spending money on running and
maintaining data centers
• Go global in minutes
• Disaster recovery
Overview
of Amazon
Web
Services
Regions and Availability Zones
• Amazon EC2 is hosted in multiple locations world-wide
• Each region is a separate geographic area
• Each region has multiple, isolated locations know as Availability zones
VPC
• Virtual datacenter in the cloud
• You can create your own public-facing subnet for your webservers and place your backend
systems such as databases or application servers in private subnet
• You can create a hardware virtual private network connection b/w your corporate datacenter
and AWS
• Assign custom IP address range in each subnet
• Create internet gateways
• Leverage multiple layers of security
EC2
• Web service that provides secure, resizable compute capacity in the cloud
ü On-demand Instances
Applications with spiky or unpredictable workloads or being developed or tested on AmazonEC2
ü Reserved Instances
Steady state or predictable usage and able to make upfront payment
ü Spot Instances
Applications that have flexible start and end times
EBS
Create storage volumes and attach them to Amazon EC2 instances
General Purpose SSD
Designed for 99.999% availability
3 IOPS/GB upto 10K IOPS
Provisioned IOPS SSD
Designed for I/O intensive applications such as large relational or NoSQL databases
Magnetic (standard)
S3
• S3 object based allows you to upload files
• Files can be 1 Byte to 5 TB
• Buckets have unique namespace for each region
• Amazon guarantees 99.99% availability
• Guarantees durability of 99.999999999%
RDS
• Allow you to create and scale Relational Databases
• You cannot SSH or RDP to RDS instance
• AWS does not provide you public or private IP address, instead gives you endpoint to connect
• Available and Durable, Secure, Inexpensive
Route 53
• Highly available and scalable cloud DNS web service
• You can create
ü Both public and private DNS records
ü A records which resolves names to IP Address
ü CNAME will resolve one name to another
Direct connect
• AWS Direct connect makes it easy to establish a dedicated network connection from your
premises to AWS
• Private connectivity to your Amazon VPC
• Provides 1 Gbps and 10 Gbps
IAM
• Enables you to securely control access to AWS services and resources for your users
• Offered at no additional charge
• Use permissions to allow and deny user/group access to AWS resources
EMR
• Managed Hadoop framework
• Fast and cost-effective to process vast amounts of data across dynamically scalable amazon EC2
instances
• Supported Applications
ü Hadoop, Hive, HUE, Pig, HBase, Zookeeper, Spark and more
• Cost optimization using Spot fleet
Cloudwatch
• Monitoring service for AWS resources and the applications you run on AWS
• Collect, track metrics, monitor log files and set alarms
• View metrics for CPU utilization, data transfer, and disk usage activity from Amazon EC2 instances
CloudFormation
• Gives developers and systems admins an easy way to create and manage AWS resources,
provision and update them in an orderly and predictable fashion
SNS
• Fully managed messaging service
• Allows you to push messages directly to AWS resources
Bigdata Use cases
uOn-Demand Big Data Analytics
uClickstream Analysis
uData Warehousing
Architecture US East Region
VPG
VPG
On-Prem
Persistent clusters
Choosing right instance
• Memory Optimized – R3 and R4
• CPU Optimized - C3 and C4
• Storage Optimized – I2 and D2
Amazon Machine Image (AMI)
• Choosing the base AMI (Redhat, CentOS)
• Create your own AMI
• AMI creation using Packer
Bigdata Distribution
• Apache, Cloudera and Hortonworks
RDS
For storing data related to Hive, HUE, Sentry and Cloudera
Manager
S3
Creating buckets for backups
Kerberos
KDC for authorization of HDFS services
Route 53 Records
Cloudera Manager, HUE, YARN and other applications
Placement Groups
• Logical grouping of instances within a single Availability zone
• Recommended for applications that benefit from low network latency
and high network throughput
• There is no charge for creating a placement group
Options to create Long-running clusters
• Puppet/CloudFormation
• Cloudera Director
üDeploy and manage the lifecycle of Cloudera Enterprise in the cloud
üIntegrated with AWS, Microsoft Azure and Google cloud platform
Raw Data – 100 TB
Data after Replication – 300 TB
25 % Reserve for M/R Jobs
Raw Data – 15 TB
Data after Replication – 45 TB
25 % Reserve for M/R Jobs
Security
Best practices for persistent/long-running cluster
• IAM policies for IAM users
• Integrate with AWS Directory Services for Microsoft AD so that the users are in
sync across corporate network and on AWS
• All of the instances should reside in private subnet’s, except for NAT instance
which resides in public subnet to occasionally connection to internet either to
setup local repositories or downloading packages
System Monitoring
• AWS CloudWatch and AWS SNS for infrastructure monitoring
• Cloudera Manager for monitoring CDH instances, services and clusters
• Nagios and Ganglia for cluster monitoring
Transient Clusters
EMR is a PaaS provided by Amazon to execute Hadoop and other Big Data workloads such as Hive, Pig,
Spark, Presto, Mahout, Oozie and others.
Amazon S3 as your cluster persistent data store
Advantages of using S3 as data repository:
• S3 guarantees 99.99% data availability and 99.999999999% data durability
• Serves as a backup of the data
• Serves as a Disaster Recovery strategy with RTO (Recovery Time Objective) ranging in hours
• Data s residing in S3 could be leveraged for transient workloads
• S3 supports cross region replication for redundancy
• S3 supports versioning of data, which provides accidental overwrites as well as prevents
accidental deletion of objects using versioning with MFA
• Using lifecycle configuration, archiving can be enabled for older data which is moved to Glacier
• Data movement between S3 and EMR instances would be faster
when compared to general EC2 instances
• No separate licensing costs
• Lesser time to bootstrap clusters
• Cost optimization using Spot instances
• Transient clusters are better approach for transient workloads
• No Single point of failure. The data is always consistently backed up
in S3
• If the cluster performing the job goes down, another cluster could
be instantiated to get it done.
Advantages
• Source of truth data in AWS S3
• No resource contention, high priority production jobs doesn’t have to wait
for resources being freed up by up stream production jobs
• Each job could get its own cluster
• Easily meet business SLAs
With flexibility of providing capacity based on the use cases, SLAs for
various LOBs could be easily met
• Usage of spot instances will give approx. 70% of discount in most common
use cases
Security
Best practices for securely accessing data in S3 by transient cluster(s)
• Define S3 bucket policies, which provides facility to allow or deny
certain action
• Define S3 ACLs, which provides permissions
• IAM roles for instances to access data in S3 with exposing AWS access
and security keys
• Define VPC endpoint for S3 bucket so that the data is transported over
private connection between S3 and instances in VPC (Virtual Private
Cloud)
Thank you!
Upcoming Sessions
Amazon EC2, S3 and EMR - Sep 26
Cost Optimization with Spot instances (EMR) – OCT 3
Deep Dive on EC2 and S3 – OCT 10

More Related Content

What's hot

Optimizing Storage for Big Data/Analytics Workloads
Optimizing Storage for Big Data/Analytics WorkloadsOptimizing Storage for Big Data/Analytics Workloads
Optimizing Storage for Big Data/Analytics WorkloadsAmazon Web Services
 
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)Amazon Web Services
 
Strategic Uses for Cost Efficient Long-Term Cloud Storage
Strategic Uses for Cost Efficient Long-Term Cloud StorageStrategic Uses for Cost Efficient Long-Term Cloud Storage
Strategic Uses for Cost Efficient Long-Term Cloud StorageAmazon Web Services
 
Bursting on-premise analytic workloads to Amazon EMR using Alluxio
Bursting on-premise analytic workloads to Amazon EMR using AlluxioBursting on-premise analytic workloads to Amazon EMR using Alluxio
Bursting on-premise analytic workloads to Amazon EMR using AlluxioAlluxio, Inc.
 
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS MeetupChallenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS MeetupAndrei Savu
 
Simple, Scalable and Highly Durable NAS in the Cloud - Amazon EFS
Simple, Scalable and Highly Durable NAS in the Cloud - Amazon EFSSimple, Scalable and Highly Durable NAS in the Cloud - Amazon EFS
Simple, Scalable and Highly Durable NAS in the Cloud - Amazon EFSAmazon Web Services
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsAmazon Web Services
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSAmazon Web Services
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Demystifying Storage on AWS | AWS Public Sector Summit 2017
Demystifying Storage on AWS | AWS Public Sector Summit 2017Demystifying Storage on AWS | AWS Public Sector Summit 2017
Demystifying Storage on AWS | AWS Public Sector Summit 2017Amazon Web Services
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
 
SRV404 Deep Dive on Amazon DynamoDB
SRV404 Deep Dive on Amazon DynamoDBSRV404 Deep Dive on Amazon DynamoDB
SRV404 Deep Dive on Amazon DynamoDBAmazon Web Services
 
AWS re:Invent 2016: FINRA: Building a Secure Data Science Platform on AWS (BD...
AWS re:Invent 2016: FINRA: Building a Secure Data Science Platform on AWS (BD...AWS re:Invent 2016: FINRA: Building a Secure Data Science Platform on AWS (BD...
AWS re:Invent 2016: FINRA: Building a Secure Data Science Platform on AWS (BD...Amazon Web Services
 
Optimizing Data Management Using AWS Storage and Data Migration Products | AW...
Optimizing Data Management Using AWS Storage and Data Migration Products | AW...Optimizing Data Management Using AWS Storage and Data Migration Products | AW...
Optimizing Data Management Using AWS Storage and Data Migration Products | AW...Amazon Web Services
 
Cost and Performance Optimisation in Amazon RDS - AWS Summit Sydney 2018
Cost and Performance Optimisation in Amazon RDS - AWS Summit Sydney 2018Cost and Performance Optimisation in Amazon RDS - AWS Summit Sydney 2018
Cost and Performance Optimisation in Amazon RDS - AWS Summit Sydney 2018Amazon Web Services
 
AWS re:Invent 2016: Bring Microsoft Applications to AWS to Save Money and Sta...
AWS re:Invent 2016: Bring Microsoft Applications to AWS to Save Money and Sta...AWS re:Invent 2016: Bring Microsoft Applications to AWS to Save Money and Sta...
AWS re:Invent 2016: Bring Microsoft Applications to AWS to Save Money and Sta...Amazon Web Services
 
AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)
AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)
AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)Amazon Web Services
 
AWS re:Invent 2016: ElastiCache Deep Dive: Best Practices and Usage Patterns ...
AWS re:Invent 2016: ElastiCache Deep Dive: Best Practices and Usage Patterns ...AWS re:Invent 2016: ElastiCache Deep Dive: Best Practices and Usage Patterns ...
AWS re:Invent 2016: ElastiCache Deep Dive: Best Practices and Usage Patterns ...Amazon Web Services
 
Co 4, session 2, aws analytics services
Co 4, session 2, aws analytics servicesCo 4, session 2, aws analytics services
Co 4, session 2, aws analytics servicesm vaishnavi
 

What's hot (20)

Optimizing Storage for Big Data/Analytics Workloads
Optimizing Storage for Big Data/Analytics WorkloadsOptimizing Storage for Big Data/Analytics Workloads
Optimizing Storage for Big Data/Analytics Workloads
 
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
 
Strategic Uses for Cost Efficient Long-Term Cloud Storage
Strategic Uses for Cost Efficient Long-Term Cloud StorageStrategic Uses for Cost Efficient Long-Term Cloud Storage
Strategic Uses for Cost Efficient Long-Term Cloud Storage
 
Bursting on-premise analytic workloads to Amazon EMR using Alluxio
Bursting on-premise analytic workloads to Amazon EMR using AlluxioBursting on-premise analytic workloads to Amazon EMR using Alluxio
Bursting on-premise analytic workloads to Amazon EMR using Alluxio
 
REDSHIFT - Amazon
REDSHIFT - AmazonREDSHIFT - Amazon
REDSHIFT - Amazon
 
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS MeetupChallenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
 
Simple, Scalable and Highly Durable NAS in the Cloud - Amazon EFS
Simple, Scalable and Highly Durable NAS in the Cloud - Amazon EFSSimple, Scalable and Highly Durable NAS in the Cloud - Amazon EFS
Simple, Scalable and Highly Durable NAS in the Cloud - Amazon EFS
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Demystifying Storage on AWS | AWS Public Sector Summit 2017
Demystifying Storage on AWS | AWS Public Sector Summit 2017Demystifying Storage on AWS | AWS Public Sector Summit 2017
Demystifying Storage on AWS | AWS Public Sector Summit 2017
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
SRV404 Deep Dive on Amazon DynamoDB
SRV404 Deep Dive on Amazon DynamoDBSRV404 Deep Dive on Amazon DynamoDB
SRV404 Deep Dive on Amazon DynamoDB
 
AWS re:Invent 2016: FINRA: Building a Secure Data Science Platform on AWS (BD...
AWS re:Invent 2016: FINRA: Building a Secure Data Science Platform on AWS (BD...AWS re:Invent 2016: FINRA: Building a Secure Data Science Platform on AWS (BD...
AWS re:Invent 2016: FINRA: Building a Secure Data Science Platform on AWS (BD...
 
Optimizing Data Management Using AWS Storage and Data Migration Products | AW...
Optimizing Data Management Using AWS Storage and Data Migration Products | AW...Optimizing Data Management Using AWS Storage and Data Migration Products | AW...
Optimizing Data Management Using AWS Storage and Data Migration Products | AW...
 
Cost and Performance Optimisation in Amazon RDS - AWS Summit Sydney 2018
Cost and Performance Optimisation in Amazon RDS - AWS Summit Sydney 2018Cost and Performance Optimisation in Amazon RDS - AWS Summit Sydney 2018
Cost and Performance Optimisation in Amazon RDS - AWS Summit Sydney 2018
 
AWS re:Invent 2016: Bring Microsoft Applications to AWS to Save Money and Sta...
AWS re:Invent 2016: Bring Microsoft Applications to AWS to Save Money and Sta...AWS re:Invent 2016: Bring Microsoft Applications to AWS to Save Money and Sta...
AWS re:Invent 2016: Bring Microsoft Applications to AWS to Save Money and Sta...
 
AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)
AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)
AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)
 
AWS re:Invent 2016: ElastiCache Deep Dive: Best Practices and Usage Patterns ...
AWS re:Invent 2016: ElastiCache Deep Dive: Best Practices and Usage Patterns ...AWS re:Invent 2016: ElastiCache Deep Dive: Best Practices and Usage Patterns ...
AWS re:Invent 2016: ElastiCache Deep Dive: Best Practices and Usage Patterns ...
 
Co 4, session 2, aws analytics services
Co 4, session 2, aws analytics servicesCo 4, session 2, aws analytics services
Co 4, session 2, aws analytics services
 

Similar to Building a Bigdata Architecture on AWS

AWS Webcast - Website Hosting in the Cloud
AWS Webcast - Website Hosting in the CloudAWS Webcast - Website Hosting in the Cloud
AWS Webcast - Website Hosting in the CloudAmazon Web Services
 
Building a Just-in-Time Application Stack for Analysts
Building a Just-in-Time Application Stack for AnalystsBuilding a Just-in-Time Application Stack for Analysts
Building a Just-in-Time Application Stack for AnalystsAvere Systems
 
What is Cloud computing?
What is Cloud computing?What is Cloud computing?
What is Cloud computing?Richard Harvey
 
AWS Webcast - AWS Webinar Series for Education #2 - Getting Started with AWS
AWS Webcast - AWS Webinar Series for Education #2 - Getting Started with AWSAWS Webcast - AWS Webinar Series for Education #2 - Getting Started with AWS
AWS Webcast - AWS Webinar Series for Education #2 - Getting Started with AWSAmazon Web Services
 
AWS Webcast - AWS Webinar Series for Education #3 - Discover the Ease of AWS ...
AWS Webcast - AWS Webinar Series for Education #3 - Discover the Ease of AWS ...AWS Webcast - AWS Webinar Series for Education #3 - Discover the Ease of AWS ...
AWS Webcast - AWS Webinar Series for Education #3 - Discover the Ease of AWS ...Amazon Web Services
 
Cloud computing aws -key services
Cloud computing  aws -key servicesCloud computing  aws -key services
Cloud computing aws -key servicesSelvaraj Kesavan
 
Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)Amazon Web Services
 
Uses, considerations, and recommendations for AWS
Uses, considerations, and recommendations for AWSUses, considerations, and recommendations for AWS
Uses, considerations, and recommendations for AWSScalar Decisions
 
Introduction to Amazon Relational Database Service
Introduction to Amazon Relational Database ServiceIntroduction to Amazon Relational Database Service
Introduction to Amazon Relational Database ServiceAmazon Web Services
 
AWS Webcast - Explore the AWS Cloud for Government
AWS Webcast - Explore the AWS Cloud for GovernmentAWS Webcast - Explore the AWS Cloud for Government
AWS Webcast - Explore the AWS Cloud for GovernmentAmazon Web Services
 
AWS Webcast - Webinar Series for State and Local Government #2: Discover the ...
AWS Webcast - Webinar Series for State and Local Government #2: Discover the ...AWS Webcast - Webinar Series for State and Local Government #2: Discover the ...
AWS Webcast - Webinar Series for State and Local Government #2: Discover the ...Amazon Web Services
 
Backup and archiving in the aws cloud
Backup and archiving in the aws cloudBackup and archiving in the aws cloud
Backup and archiving in the aws cloudAmazon Web Services
 
AWS Webcast - Build Agile Applications in AWS Cloud
AWS Webcast - Build Agile Applications in AWS CloudAWS Webcast - Build Agile Applications in AWS Cloud
AWS Webcast - Build Agile Applications in AWS CloudAmazon Web Services
 
Cloud & Native Cloud for Managers
Cloud & Native Cloud for ManagersCloud & Native Cloud for Managers
Cloud & Native Cloud for ManagersEitan Sela
 

Similar to Building a Bigdata Architecture on AWS (20)

AWS Webcast - Website Hosting in the Cloud
AWS Webcast - Website Hosting in the CloudAWS Webcast - Website Hosting in the Cloud
AWS Webcast - Website Hosting in the Cloud
 
Débuter sur le cloud AWS
Débuter sur le cloud AWSDébuter sur le cloud AWS
Débuter sur le cloud AWS
 
Cloud Service.pptx
Cloud Service.pptxCloud Service.pptx
Cloud Service.pptx
 
Building a Just-in-Time Application Stack for Analysts
Building a Just-in-Time Application Stack for AnalystsBuilding a Just-in-Time Application Stack for Analysts
Building a Just-in-Time Application Stack for Analysts
 
What is Cloud computing?
What is Cloud computing?What is Cloud computing?
What is Cloud computing?
 
AWS Webcast - Website Hosting
AWS Webcast - Website HostingAWS Webcast - Website Hosting
AWS Webcast - Website Hosting
 
AWS EC2 JSP.pptx
AWS EC2 JSP.pptxAWS EC2 JSP.pptx
AWS EC2 JSP.pptx
 
AWS Webcast - AWS Webinar Series for Education #2 - Getting Started with AWS
AWS Webcast - AWS Webinar Series for Education #2 - Getting Started with AWSAWS Webcast - AWS Webinar Series for Education #2 - Getting Started with AWS
AWS Webcast - AWS Webinar Series for Education #2 - Getting Started with AWS
 
AWS Webcast - AWS Webinar Series for Education #3 - Discover the Ease of AWS ...
AWS Webcast - AWS Webinar Series for Education #3 - Discover the Ease of AWS ...AWS Webcast - AWS Webinar Series for Education #3 - Discover the Ease of AWS ...
AWS Webcast - AWS Webinar Series for Education #3 - Discover the Ease of AWS ...
 
Cloud computing aws -key services
Cloud computing  aws -key servicesCloud computing  aws -key services
Cloud computing aws -key services
 
Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)
 
Uses, considerations, and recommendations for AWS
Uses, considerations, and recommendations for AWSUses, considerations, and recommendations for AWS
Uses, considerations, and recommendations for AWS
 
Introduction to Amazon Relational Database Service
Introduction to Amazon Relational Database ServiceIntroduction to Amazon Relational Database Service
Introduction to Amazon Relational Database Service
 
AWS Webcast - Explore the AWS Cloud for Government
AWS Webcast - Explore the AWS Cloud for GovernmentAWS Webcast - Explore the AWS Cloud for Government
AWS Webcast - Explore the AWS Cloud for Government
 
AWS Webcast - Webinar Series for State and Local Government #2: Discover the ...
AWS Webcast - Webinar Series for State and Local Government #2: Discover the ...AWS Webcast - Webinar Series for State and Local Government #2: Discover the ...
AWS Webcast - Webinar Series for State and Local Government #2: Discover the ...
 
Backup and archiving in the aws cloud
Backup and archiving in the aws cloudBackup and archiving in the aws cloud
Backup and archiving in the aws cloud
 
Cloud computing benefits
Cloud computing benefitsCloud computing benefits
Cloud computing benefits
 
AWS Webcast - Build Agile Applications in AWS Cloud
AWS Webcast - Build Agile Applications in AWS CloudAWS Webcast - Build Agile Applications in AWS Cloud
AWS Webcast - Build Agile Applications in AWS Cloud
 
CMS on AWS Deep Dive
CMS on AWS Deep DiveCMS on AWS Deep Dive
CMS on AWS Deep Dive
 
Cloud & Native Cloud for Managers
Cloud & Native Cloud for ManagersCloud & Native Cloud for Managers
Cloud & Native Cloud for Managers
 

Recently uploaded

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 

Recently uploaded (20)

DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 

Building a Bigdata Architecture on AWS

  • 1. Building a Bigdata Architecture on AWS ARUN SIRIMALLA
  • 2. Cloud Computing • Cloud computing is the on-demand delivery of compute power, database storage, applications and other IT resources • Its a cloud services platform via the internet with pay- as-you-go pricing
  • 3. Why Cloud? • No upfront investment in data centers and servers • Stop guessing capacity • Stop spending money on running and maintaining data centers • Go global in minutes • Disaster recovery
  • 5. Regions and Availability Zones • Amazon EC2 is hosted in multiple locations world-wide • Each region is a separate geographic area • Each region has multiple, isolated locations know as Availability zones VPC • Virtual datacenter in the cloud • You can create your own public-facing subnet for your webservers and place your backend systems such as databases or application servers in private subnet • You can create a hardware virtual private network connection b/w your corporate datacenter and AWS • Assign custom IP address range in each subnet • Create internet gateways • Leverage multiple layers of security EC2 • Web service that provides secure, resizable compute capacity in the cloud ü On-demand Instances Applications with spiky or unpredictable workloads or being developed or tested on AmazonEC2 ü Reserved Instances Steady state or predictable usage and able to make upfront payment ü Spot Instances Applications that have flexible start and end times
  • 6. EBS Create storage volumes and attach them to Amazon EC2 instances General Purpose SSD Designed for 99.999% availability 3 IOPS/GB upto 10K IOPS Provisioned IOPS SSD Designed for I/O intensive applications such as large relational or NoSQL databases Magnetic (standard) S3 • S3 object based allows you to upload files • Files can be 1 Byte to 5 TB • Buckets have unique namespace for each region • Amazon guarantees 99.99% availability • Guarantees durability of 99.999999999% RDS • Allow you to create and scale Relational Databases • You cannot SSH or RDP to RDS instance • AWS does not provide you public or private IP address, instead gives you endpoint to connect • Available and Durable, Secure, Inexpensive
  • 7. Route 53 • Highly available and scalable cloud DNS web service • You can create ü Both public and private DNS records ü A records which resolves names to IP Address ü CNAME will resolve one name to another Direct connect • AWS Direct connect makes it easy to establish a dedicated network connection from your premises to AWS • Private connectivity to your Amazon VPC • Provides 1 Gbps and 10 Gbps IAM • Enables you to securely control access to AWS services and resources for your users • Offered at no additional charge • Use permissions to allow and deny user/group access to AWS resources
  • 8. EMR • Managed Hadoop framework • Fast and cost-effective to process vast amounts of data across dynamically scalable amazon EC2 instances • Supported Applications ü Hadoop, Hive, HUE, Pig, HBase, Zookeeper, Spark and more • Cost optimization using Spot fleet Cloudwatch • Monitoring service for AWS resources and the applications you run on AWS • Collect, track metrics, monitor log files and set alarms • View metrics for CPU utilization, data transfer, and disk usage activity from Amazon EC2 instances CloudFormation • Gives developers and systems admins an easy way to create and manage AWS resources, provision and update them in an orderly and predictable fashion SNS • Fully managed messaging service • Allows you to push messages directly to AWS resources
  • 9. Bigdata Use cases uOn-Demand Big Data Analytics uClickstream Analysis uData Warehousing
  • 10.
  • 11. Architecture US East Region VPG VPG On-Prem
  • 12. Persistent clusters Choosing right instance • Memory Optimized – R3 and R4 • CPU Optimized - C3 and C4 • Storage Optimized – I2 and D2 Amazon Machine Image (AMI) • Choosing the base AMI (Redhat, CentOS) • Create your own AMI • AMI creation using Packer Bigdata Distribution • Apache, Cloudera and Hortonworks
  • 13. RDS For storing data related to Hive, HUE, Sentry and Cloudera Manager S3 Creating buckets for backups Kerberos KDC for authorization of HDFS services Route 53 Records Cloudera Manager, HUE, YARN and other applications
  • 14. Placement Groups • Logical grouping of instances within a single Availability zone • Recommended for applications that benefit from low network latency and high network throughput • There is no charge for creating a placement group Options to create Long-running clusters • Puppet/CloudFormation • Cloudera Director üDeploy and manage the lifecycle of Cloudera Enterprise in the cloud üIntegrated with AWS, Microsoft Azure and Google cloud platform
  • 15. Raw Data – 100 TB Data after Replication – 300 TB 25 % Reserve for M/R Jobs Raw Data – 15 TB Data after Replication – 45 TB 25 % Reserve for M/R Jobs
  • 16. Security Best practices for persistent/long-running cluster • IAM policies for IAM users • Integrate with AWS Directory Services for Microsoft AD so that the users are in sync across corporate network and on AWS • All of the instances should reside in private subnet’s, except for NAT instance which resides in public subnet to occasionally connection to internet either to setup local repositories or downloading packages System Monitoring • AWS CloudWatch and AWS SNS for infrastructure monitoring • Cloudera Manager for monitoring CDH instances, services and clusters • Nagios and Ganglia for cluster monitoring
  • 17. Transient Clusters EMR is a PaaS provided by Amazon to execute Hadoop and other Big Data workloads such as Hive, Pig, Spark, Presto, Mahout, Oozie and others. Amazon S3 as your cluster persistent data store Advantages of using S3 as data repository: • S3 guarantees 99.99% data availability and 99.999999999% data durability • Serves as a backup of the data • Serves as a Disaster Recovery strategy with RTO (Recovery Time Objective) ranging in hours • Data s residing in S3 could be leveraged for transient workloads • S3 supports cross region replication for redundancy • S3 supports versioning of data, which provides accidental overwrites as well as prevents accidental deletion of objects using versioning with MFA • Using lifecycle configuration, archiving can be enabled for older data which is moved to Glacier
  • 18. • Data movement between S3 and EMR instances would be faster when compared to general EC2 instances • No separate licensing costs • Lesser time to bootstrap clusters • Cost optimization using Spot instances • Transient clusters are better approach for transient workloads • No Single point of failure. The data is always consistently backed up in S3 • If the cluster performing the job goes down, another cluster could be instantiated to get it done. Advantages
  • 19. • Source of truth data in AWS S3 • No resource contention, high priority production jobs doesn’t have to wait for resources being freed up by up stream production jobs • Each job could get its own cluster • Easily meet business SLAs With flexibility of providing capacity based on the use cases, SLAs for various LOBs could be easily met • Usage of spot instances will give approx. 70% of discount in most common use cases
  • 20. Security Best practices for securely accessing data in S3 by transient cluster(s) • Define S3 bucket policies, which provides facility to allow or deny certain action • Define S3 ACLs, which provides permissions • IAM roles for instances to access data in S3 with exposing AWS access and security keys • Define VPC endpoint for S3 bucket so that the data is transported over private connection between S3 and instances in VPC (Virtual Private Cloud)
  • 22. Upcoming Sessions Amazon EC2, S3 and EMR - Sep 26 Cost Optimization with Spot instances (EMR) – OCT 3 Deep Dive on EC2 and S3 – OCT 10