Building a Bigdata
Architecture on AWS
ARUN SIRIMALLA
Cloud Computing
• Cloud computing is the on-demand delivery of
compute power, database storage, applications and
other IT resources
• Its a cloud services platform via the internet with pay-
as-you-go pricing
Why Cloud?
• No upfront investment in data centers and
servers
• Stop guessing capacity
• Stop spending money on running and
maintaining data centers
• Go global in minutes
• Disaster recovery
Overview
of Amazon
Web
Services
Regions and Availability Zones
• Amazon EC2 is hosted in multiple locations world-wide
• Each region is a separate geographic area
• Each region has multiple, isolated locations know as Availability zones
VPC
• Virtual datacenter in the cloud
• You can create your own public-facing subnet for your webservers and place your backend
systems such as databases or application servers in private subnet
• You can create a hardware virtual private network connection b/w your corporate datacenter
and AWS
• Assign custom IP address range in each subnet
• Create internet gateways
• Leverage multiple layers of security
EC2
• Web service that provides secure, resizable compute capacity in the cloud
ü On-demand Instances
Applications with spiky or unpredictable workloads or being developed or tested on AmazonEC2
ü Reserved Instances
Steady state or predictable usage and able to make upfront payment
ü Spot Instances
Applications that have flexible start and end times
EBS
Create storage volumes and attach them to Amazon EC2 instances
General Purpose SSD
Designed for 99.999% availability
3 IOPS/GB upto 10K IOPS
Provisioned IOPS SSD
Designed for I/O intensive applications such as large relational or NoSQL databases
Magnetic (standard)
S3
• S3 object based allows you to upload files
• Files can be 1 Byte to 5 TB
• Buckets have unique namespace for each region
• Amazon guarantees 99.99% availability
• Guarantees durability of 99.999999999%
RDS
• Allow you to create and scale Relational Databases
• You cannot SSH or RDP to RDS instance
• AWS does not provide you public or private IP address, instead gives you endpoint to connect
• Available and Durable, Secure, Inexpensive
Route 53
• Highly available and scalable cloud DNS web service
• You can create
ü Both public and private DNS records
ü A records which resolves names to IP Address
ü CNAME will resolve one name to another
Direct connect
• AWS Direct connect makes it easy to establish a dedicated network connection from your
premises to AWS
• Private connectivity to your Amazon VPC
• Provides 1 Gbps and 10 Gbps
IAM
• Enables you to securely control access to AWS services and resources for your users
• Offered at no additional charge
• Use permissions to allow and deny user/group access to AWS resources
EMR
• Managed Hadoop framework
• Fast and cost-effective to process vast amounts of data across dynamically scalable amazon EC2
instances
• Supported Applications
ü Hadoop, Hive, HUE, Pig, HBase, Zookeeper, Spark and more
• Cost optimization using Spot fleet
Cloudwatch
• Monitoring service for AWS resources and the applications you run on AWS
• Collect, track metrics, monitor log files and set alarms
• View metrics for CPU utilization, data transfer, and disk usage activity from Amazon EC2 instances
CloudFormation
• Gives developers and systems admins an easy way to create and manage AWS resources,
provision and update them in an orderly and predictable fashion
SNS
• Fully managed messaging service
• Allows you to push messages directly to AWS resources
Bigdata Use cases
uOn-Demand Big Data Analytics
uClickstream Analysis
uData Warehousing
Architecture US East Region
VPG
VPG
On-Prem
Persistent clusters
Choosing right instance
• Memory Optimized – R3 and R4
• CPU Optimized - C3 and C4
• Storage Optimized – I2 and D2
Amazon Machine Image (AMI)
• Choosing the base AMI (Redhat, CentOS)
• Create your own AMI
• AMI creation using Packer
Bigdata Distribution
• Apache, Cloudera and Hortonworks
RDS
For storing data related to Hive, HUE, Sentry and Cloudera
Manager
S3
Creating buckets for backups
Kerberos
KDC for authorization of HDFS services
Route 53 Records
Cloudera Manager, HUE, YARN and other applications
Placement Groups
• Logical grouping of instances within a single Availability zone
• Recommended for applications that benefit from low network latency
and high network throughput
• There is no charge for creating a placement group
Options to create Long-running clusters
• Puppet/CloudFormation
• Cloudera Director
üDeploy and manage the lifecycle of Cloudera Enterprise in the cloud
üIntegrated with AWS, Microsoft Azure and Google cloud platform
Raw Data – 100 TB
Data after Replication – 300 TB
25 % Reserve for M/R Jobs
Raw Data – 15 TB
Data after Replication – 45 TB
25 % Reserve for M/R Jobs
Security
Best practices for persistent/long-running cluster
• IAM policies for IAM users
• Integrate with AWS Directory Services for Microsoft AD so that the users are in
sync across corporate network and on AWS
• All of the instances should reside in private subnet’s, except for NAT instance
which resides in public subnet to occasionally connection to internet either to
setup local repositories or downloading packages
System Monitoring
• AWS CloudWatch and AWS SNS for infrastructure monitoring
• Cloudera Manager for monitoring CDH instances, services and clusters
• Nagios and Ganglia for cluster monitoring
Transient Clusters
EMR is a PaaS provided by Amazon to execute Hadoop and other Big Data workloads such as Hive, Pig,
Spark, Presto, Mahout, Oozie and others.
Amazon S3 as your cluster persistent data store
Advantages of using S3 as data repository:
• S3 guarantees 99.99% data availability and 99.999999999% data durability
• Serves as a backup of the data
• Serves as a Disaster Recovery strategy with RTO (Recovery Time Objective) ranging in hours
• Data s residing in S3 could be leveraged for transient workloads
• S3 supports cross region replication for redundancy
• S3 supports versioning of data, which provides accidental overwrites as well as prevents
accidental deletion of objects using versioning with MFA
• Using lifecycle configuration, archiving can be enabled for older data which is moved to Glacier
• Data movement between S3 and EMR instances would be faster
when compared to general EC2 instances
• No separate licensing costs
• Lesser time to bootstrap clusters
• Cost optimization using Spot instances
• Transient clusters are better approach for transient workloads
• No Single point of failure. The data is always consistently backed up
in S3
• If the cluster performing the job goes down, another cluster could
be instantiated to get it done.
Advantages
• Source of truth data in AWS S3
• No resource contention, high priority production jobs doesn’t have to wait
for resources being freed up by up stream production jobs
• Each job could get its own cluster
• Easily meet business SLAs
With flexibility of providing capacity based on the use cases, SLAs for
various LOBs could be easily met
• Usage of spot instances will give approx. 70% of discount in most common
use cases
Security
Best practices for securely accessing data in S3 by transient cluster(s)
• Define S3 bucket policies, which provides facility to allow or deny
certain action
• Define S3 ACLs, which provides permissions
• IAM roles for instances to access data in S3 with exposing AWS access
and security keys
• Define VPC endpoint for S3 bucket so that the data is transported over
private connection between S3 and instances in VPC (Virtual Private
Cloud)
Thank you!
Upcoming Sessions
Amazon EC2, S3 and EMR - Sep 26
Cost Optimization with Spot instances (EMR) – OCT 3
Deep Dive on EC2 and S3 – OCT 10

Building a Bigdata Architecture on AWS

  • 1.
    Building a Bigdata Architectureon AWS ARUN SIRIMALLA
  • 2.
    Cloud Computing • Cloudcomputing is the on-demand delivery of compute power, database storage, applications and other IT resources • Its a cloud services platform via the internet with pay- as-you-go pricing
  • 3.
    Why Cloud? • Noupfront investment in data centers and servers • Stop guessing capacity • Stop spending money on running and maintaining data centers • Go global in minutes • Disaster recovery
  • 4.
  • 5.
    Regions and AvailabilityZones • Amazon EC2 is hosted in multiple locations world-wide • Each region is a separate geographic area • Each region has multiple, isolated locations know as Availability zones VPC • Virtual datacenter in the cloud • You can create your own public-facing subnet for your webservers and place your backend systems such as databases or application servers in private subnet • You can create a hardware virtual private network connection b/w your corporate datacenter and AWS • Assign custom IP address range in each subnet • Create internet gateways • Leverage multiple layers of security EC2 • Web service that provides secure, resizable compute capacity in the cloud ü On-demand Instances Applications with spiky or unpredictable workloads or being developed or tested on AmazonEC2 ü Reserved Instances Steady state or predictable usage and able to make upfront payment ü Spot Instances Applications that have flexible start and end times
  • 6.
    EBS Create storage volumesand attach them to Amazon EC2 instances General Purpose SSD Designed for 99.999% availability 3 IOPS/GB upto 10K IOPS Provisioned IOPS SSD Designed for I/O intensive applications such as large relational or NoSQL databases Magnetic (standard) S3 • S3 object based allows you to upload files • Files can be 1 Byte to 5 TB • Buckets have unique namespace for each region • Amazon guarantees 99.99% availability • Guarantees durability of 99.999999999% RDS • Allow you to create and scale Relational Databases • You cannot SSH or RDP to RDS instance • AWS does not provide you public or private IP address, instead gives you endpoint to connect • Available and Durable, Secure, Inexpensive
  • 7.
    Route 53 • Highlyavailable and scalable cloud DNS web service • You can create ü Both public and private DNS records ü A records which resolves names to IP Address ü CNAME will resolve one name to another Direct connect • AWS Direct connect makes it easy to establish a dedicated network connection from your premises to AWS • Private connectivity to your Amazon VPC • Provides 1 Gbps and 10 Gbps IAM • Enables you to securely control access to AWS services and resources for your users • Offered at no additional charge • Use permissions to allow and deny user/group access to AWS resources
  • 8.
    EMR • Managed Hadoopframework • Fast and cost-effective to process vast amounts of data across dynamically scalable amazon EC2 instances • Supported Applications ü Hadoop, Hive, HUE, Pig, HBase, Zookeeper, Spark and more • Cost optimization using Spot fleet Cloudwatch • Monitoring service for AWS resources and the applications you run on AWS • Collect, track metrics, monitor log files and set alarms • View metrics for CPU utilization, data transfer, and disk usage activity from Amazon EC2 instances CloudFormation • Gives developers and systems admins an easy way to create and manage AWS resources, provision and update them in an orderly and predictable fashion SNS • Fully managed messaging service • Allows you to push messages directly to AWS resources
  • 9.
    Bigdata Use cases uOn-DemandBig Data Analytics uClickstream Analysis uData Warehousing
  • 11.
    Architecture US EastRegion VPG VPG On-Prem
  • 12.
    Persistent clusters Choosing rightinstance • Memory Optimized – R3 and R4 • CPU Optimized - C3 and C4 • Storage Optimized – I2 and D2 Amazon Machine Image (AMI) • Choosing the base AMI (Redhat, CentOS) • Create your own AMI • AMI creation using Packer Bigdata Distribution • Apache, Cloudera and Hortonworks
  • 13.
    RDS For storing datarelated to Hive, HUE, Sentry and Cloudera Manager S3 Creating buckets for backups Kerberos KDC for authorization of HDFS services Route 53 Records Cloudera Manager, HUE, YARN and other applications
  • 14.
    Placement Groups • Logicalgrouping of instances within a single Availability zone • Recommended for applications that benefit from low network latency and high network throughput • There is no charge for creating a placement group Options to create Long-running clusters • Puppet/CloudFormation • Cloudera Director üDeploy and manage the lifecycle of Cloudera Enterprise in the cloud üIntegrated with AWS, Microsoft Azure and Google cloud platform
  • 15.
    Raw Data –100 TB Data after Replication – 300 TB 25 % Reserve for M/R Jobs Raw Data – 15 TB Data after Replication – 45 TB 25 % Reserve for M/R Jobs
  • 16.
    Security Best practices forpersistent/long-running cluster • IAM policies for IAM users • Integrate with AWS Directory Services for Microsoft AD so that the users are in sync across corporate network and on AWS • All of the instances should reside in private subnet’s, except for NAT instance which resides in public subnet to occasionally connection to internet either to setup local repositories or downloading packages System Monitoring • AWS CloudWatch and AWS SNS for infrastructure monitoring • Cloudera Manager for monitoring CDH instances, services and clusters • Nagios and Ganglia for cluster monitoring
  • 17.
    Transient Clusters EMR isa PaaS provided by Amazon to execute Hadoop and other Big Data workloads such as Hive, Pig, Spark, Presto, Mahout, Oozie and others. Amazon S3 as your cluster persistent data store Advantages of using S3 as data repository: • S3 guarantees 99.99% data availability and 99.999999999% data durability • Serves as a backup of the data • Serves as a Disaster Recovery strategy with RTO (Recovery Time Objective) ranging in hours • Data s residing in S3 could be leveraged for transient workloads • S3 supports cross region replication for redundancy • S3 supports versioning of data, which provides accidental overwrites as well as prevents accidental deletion of objects using versioning with MFA • Using lifecycle configuration, archiving can be enabled for older data which is moved to Glacier
  • 18.
    • Data movementbetween S3 and EMR instances would be faster when compared to general EC2 instances • No separate licensing costs • Lesser time to bootstrap clusters • Cost optimization using Spot instances • Transient clusters are better approach for transient workloads • No Single point of failure. The data is always consistently backed up in S3 • If the cluster performing the job goes down, another cluster could be instantiated to get it done. Advantages
  • 19.
    • Source oftruth data in AWS S3 • No resource contention, high priority production jobs doesn’t have to wait for resources being freed up by up stream production jobs • Each job could get its own cluster • Easily meet business SLAs With flexibility of providing capacity based on the use cases, SLAs for various LOBs could be easily met • Usage of spot instances will give approx. 70% of discount in most common use cases
  • 20.
    Security Best practices forsecurely accessing data in S3 by transient cluster(s) • Define S3 bucket policies, which provides facility to allow or deny certain action • Define S3 ACLs, which provides permissions • IAM roles for instances to access data in S3 with exposing AWS access and security keys • Define VPC endpoint for S3 bucket so that the data is transported over private connection between S3 and instances in VPC (Virtual Private Cloud)
  • 21.
  • 22.
    Upcoming Sessions Amazon EC2,S3 and EMR - Sep 26 Cost Optimization with Spot instances (EMR) – OCT 3 Deep Dive on EC2 and S3 – OCT 10