Building a Bigdata Architecture on AWS

Building a Bigdata
Architecture on AWS
ARUN SIRIMALLA

Cloud Computing
• Cloud computing is the on-demand delivery of
compute power, database storage, applications and
other IT resources
• Its a cloud services platform via the internet with pay-
as-you-go pricing

Why Cloud?
• No upfront investment in data centers and
servers
• Stop guessing capacity
• Stop spending money on running and
maintaining data centers
• Go global in minutes
• Disaster recovery

Overview
of Amazon
Web
Services

Regions and Availability Zones
• Amazon EC2 is hosted in multiple locations world-wide
• Each region is a separate geographic area
• Each region has multiple, isolated locations know as Availability zones
VPC
• Virtual datacenter in the cloud
• You can create your own public-facing subnet for your webservers and place your backend
systems such as databases or application servers in private subnet
• You can create a hardware virtual private network connection b/w your corporate datacenter
and AWS
• Assign custom IP address range in each subnet
• Create internet gateways
• Leverage multiple layers of security
EC2
• Web service that provides secure, resizable compute capacity in the cloud
ü On-demand Instances
Applications with spiky or unpredictable workloads or being developed or tested on AmazonEC2
ü Reserved Instances
Steady state or predictable usage and able to make upfront payment
ü Spot Instances
Applications that have flexible start and end times

EBS
Create storage volumes and attach them to Amazon EC2 instances
General Purpose SSD
Designed for 99.999% availability
3 IOPS/GB upto 10K IOPS
Provisioned IOPS SSD
Designed for I/O intensive applications such as large relational or NoSQL databases
Magnetic (standard)
S3
• S3 object based allows you to upload files
• Files can be 1 Byte to 5 TB
• Buckets have unique namespace for each region
• Amazon guarantees 99.99% availability
• Guarantees durability of 99.999999999%
RDS
• Allow you to create and scale Relational Databases
• You cannot SSH or RDP to RDS instance
• AWS does not provide you public or private IP address, instead gives you endpoint to connect
• Available and Durable, Secure, Inexpensive

Route 53
• Highly available and scalable cloud DNS web service
• You can create
ü Both public and private DNS records
ü A records which resolves names to IP Address
ü CNAME will resolve one name to another
Direct connect
• AWS Direct connect makes it easy to establish a dedicated network connection from your
premises to AWS
• Private connectivity to your Amazon VPC
• Provides 1 Gbps and 10 Gbps
IAM
• Enables you to securely control access to AWS services and resources for your users
• Offered at no additional charge
• Use permissions to allow and deny user/group access to AWS resources

EMR
• Managed Hadoop framework
• Fast and cost-effective to process vast amounts of data across dynamically scalable amazon EC2
instances
• Supported Applications
ü Hadoop, Hive, HUE, Pig, HBase, Zookeeper, Spark and more
• Cost optimization using Spot fleet
Cloudwatch
• Monitoring service for AWS resources and the applications you run on AWS
• Collect, track metrics, monitor log files and set alarms
• View metrics for CPU utilization, data transfer, and disk usage activity from Amazon EC2 instances
CloudFormation
• Gives developers and systems admins an easy way to create and manage AWS resources,
provision and update them in an orderly and predictable fashion
SNS
• Fully managed messaging service
• Allows you to push messages directly to AWS resources

Bigdata Use cases
uOn-Demand Big Data Analytics
uClickstream Analysis
uData Warehousing

Architecture US East Region
VPG
VPG
On-Prem

Persistent clusters
Choosing right instance
• Memory Optimized – R3 and R4
• CPU Optimized - C3 and C4
• Storage Optimized – I2 and D2
Amazon Machine Image (AMI)
• Choosing the base AMI (Redhat, CentOS)
• Create your own AMI
• AMI creation using Packer
Bigdata Distribution
• Apache, Cloudera and Hortonworks

RDS
For storing data related to Hive, HUE, Sentry and Cloudera
Manager
S3
Creating buckets for backups
Kerberos
KDC for authorization of HDFS services
Route 53 Records
Cloudera Manager, HUE, YARN and other applications

Placement Groups
• Logical grouping of instances within a single Availability zone
• Recommended for applications that benefit from low network latency
and high network throughput
• There is no charge for creating a placement group
Options to create Long-running clusters
• Puppet/CloudFormation
• Cloudera Director
üDeploy and manage the lifecycle of Cloudera Enterprise in the cloud
üIntegrated with AWS, Microsoft Azure and Google cloud platform

Raw Data – 100 TB
Data after Replication – 300 TB
25 % Reserve for M/R Jobs
Raw Data – 15 TB
Data after Replication – 45 TB
25 % Reserve for M/R Jobs

Security
Best practices for persistent/long-running cluster
• IAM policies for IAM users
• Integrate with AWS Directory Services for Microsoft AD so that the users are in
sync across corporate network and on AWS
• All of the instances should reside in private subnet’s, except for NAT instance
which resides in public subnet to occasionally connection to internet either to
setup local repositories or downloading packages
System Monitoring
• AWS CloudWatch and AWS SNS for infrastructure monitoring
• Cloudera Manager for monitoring CDH instances, services and clusters
• Nagios and Ganglia for cluster monitoring

Transient Clusters
EMR is a PaaS provided by Amazon to execute Hadoop and other Big Data workloads such as Hive, Pig,
Spark, Presto, Mahout, Oozie and others.
Amazon S3 as your cluster persistent data store
Advantages of using S3 as data repository:
• S3 guarantees 99.99% data availability and 99.999999999% data durability
• Serves as a backup of the data
• Serves as a Disaster Recovery strategy with RTO (Recovery Time Objective) ranging in hours
• Data s residing in S3 could be leveraged for transient workloads
• S3 supports cross region replication for redundancy
• S3 supports versioning of data, which provides accidental overwrites as well as prevents
accidental deletion of objects using versioning with MFA
• Using lifecycle configuration, archiving can be enabled for older data which is moved to Glacier

• Data movement between S3 and EMR instances would be faster
when compared to general EC2 instances
• No separate licensing costs
• Lesser time to bootstrap clusters
• Cost optimization using Spot instances
• Transient clusters are better approach for transient workloads
• No Single point of failure. The data is always consistently backed up
in S3
• If the cluster performing the job goes down, another cluster could
be instantiated to get it done.
Advantages

• Source of truth data in AWS S3
• No resource contention, high priority production jobs doesn’t have to wait
for resources being freed up by up stream production jobs
• Each job could get its own cluster
• Easily meet business SLAs
With flexibility of providing capacity based on the use cases, SLAs for
various LOBs could be easily met
• Usage of spot instances will give approx. 70% of discount in most common
use cases

Security
Best practices for securely accessing data in S3 by transient cluster(s)
• Define S3 bucket policies, which provides facility to allow or deny
certain action
• Define S3 ACLs, which provides permissions
• IAM roles for instances to access data in S3 with exposing AWS access
and security keys
• Define VPC endpoint for S3 bucket so that the data is transported over
private connection between S3 and instances in VPC (Virtual Private
Cloud)

Upcoming Sessions
Amazon EC2, S3 and EMR - Sep 26
Cost Optimization with Spot instances (EMR) – OCT 3
Deep Dive on EC2 and S3 – OCT 10

Building a Bigdata Architecture on AWS

More Related Content

Similar to Building a Bigdata Architecture on AWS

Recently uploaded

Building a Bigdata Architecture on AWS