2. Yelp was able to save $55,000 in upfront in Hardware
costs.
Unilever processes Genetic sequences 20 times faster .
Swipely generates insight from millions of Credit Card
transactions.
Expedia processes click stream data from global
network of websites.
6. What is Cloud Computing ?
lCloud computing is a model for enabling ubiquitous, convenient, on-
demand network access to a shared pool of configurable computing
resources (e.g., networks, servers, storage, applications, and services) that
can be rapidly provisioned and released with minimal management effort
or service provider interaction.
l - NIST Definition
lThis cloud model is composed of five essential characteristics, three
service models, and four deployment models.
8. Service Models:
IaaS Providers : AWS,HPCloud,Rackspace.
PaaS Providers: Google AppEngine, heroku, Redhat Openshift
SaaS Providers: Salesforce,Linkedin, Taleo
9. Delivery Models:
lPublic Cloud
lPrivate Cloud
lHybrid Cloud
lCommunity Cloud*
* NIST Defines Community cloud as The cloud infrastructure provisioned for exclusive use by a specific
community of consumers from organizations that have shared concerns (e.g., mission,security requirements,
policy, and compliance considerations).
10. lNow Few Questions ??
1. What service model does AWS fall into ??
2. What are the advantages of using Cloud Platform
for Big data ?
3. How AWS leverage those advantages to provide
Big Data Analytics ?
11. Advantage of Cloud Platform
l- Ability to Scale the infrastructure
l- OPEX instead of CAPEX
l- Custom solutions as per the need.
l- Easier/faster Deployment.
l- Help focus on Core Business
l solutions/Analytics.
So , It can be safely said that the Cloud Platform acts as Enabler of Big Data
technology.
15. Hadoop as a Service
lAmazon Elastic Mapreduce supports Hadoop
Software Eco-System.(Hadoop 1.X, Hadoop 2.X)
lAmazon EMR control software is responsible for
automated arrangement, coordination, and
management of Hadoop Cluster.
lAmazon Elastic Mapreduce also Supports MAPR,
Apache Hadoop-derived software.
16. Integrated With Tools
Amazon EMR provides you have root access to the cluster.
Additional Software required can be installed and configured in the cluster before
Hadoop starts by creating BootStrap Action.
*Spark is installed using BootStrapping.
17. Mapreduce Engine
lJob/Task
lRoles of Servers:
la> Master Node
lb> Core Node
lc> Task Node
lStep: Unit of work
Mapreduce Engine implements the Distributed processing
framework of Hadoop.
18. Mapreduce Engine- Cont..
ll
Hadoop AWS
Name Node Master Node
Data Node Core Node
Additional concepts of Task Node and Steps :
Task Node - Task Nodes are optional. You can add task Nodes when you start
the cluster, or you can add task groups to a running cluster. Because they do
not store data and can be added and removed from a cluster, you can use task
nodes to manage the EC2 instance capacity your cluster uses, increasing
capacity to handle peak loads and decreasing it later.
Steps: Contains 1 or more Hadoop jobs. Step is an instruction given to
manipulate date using Hadoop jobs.
Max. no of Pending and Active Steps allowed in Cluster is 256.
19. Massively Parallel
lVirtual Instances -Much Easier to
Scale.
lQuick and Cost effective Scaling.
lDynamic Resizing while running the
job.
lDistributed Hadoop System in true
sense.
lMultiple clusters accessing same data
20. Cost Effective AWS Wrapper
lSpot Instances
lPay as you go.
lAutomatic Cluster
termination after
job completion.
lBundled License
softwares with
infrastructure.
lEconomy of Scale
21. Integrated to AWS Services
lAmazon EMR is integrated with other Amazon Web Services such as Amazon EC2,
Amazon S3, DynamoDB, Amazon RDS, CloudWatch, and AWS Data Pipeline.
lEasily access data stored in AWS from EMR cluster and make use of the
functionality offered by other Amazon Web Services to manage your cluster and
store the output of your cluster
Compute
lEC2
Networking
•VPC
•ELB
•Route 53
Storage
lEBS
lS3
lGlacier
Data Services
lRDS
lDynamoDB
lRedshift
Deployment and Management
lAWS Management Console
lAWS Command Line Interface
lAWS IAM
lCloud Watch
26. lProvide Cluster name for easier Identification.
lTermination Protection has to be selected 'Yes' to prevent accidental
termination of Cluster.
lLogging has to be enabled as this feature leads to automatic logging of cluster
activity.
lProvide S3 folder location for logging.
lDebugging is enabled so that any troubleshooting regarding cluster activity
can be done.
27. lIt is optional feature but always encouraged to have tags.
lTag is Key/Value pair which gets associated with every resource in cluster.
lHelps in monitoring and in managing cluster resource easily.