SlideShare a Scribd company logo
1Page:
Effectively Deploying
Hadoop to the Cloud
By Avinash Ramineni
2Page:
Agenda
• Hadoop - 101
• Understanding the needs
• Amazon Web Services - 101
• Compute and Storage Types
• HA and DR options
• Run your own cluster ?
• Questions
3Page
Clairvoyant
4Page:
Clairvoyant Services
5Page:
Hadoop - Key Concepts to
understand
● Unlike CPU, RAM improvements the disk speed has not changed much in the last
10 years
● Hardware fails! Redundancy
● Data locality
● Commodity hardware: Hadoop prefers more average nodes than few supernodes
● Hadoop is a giant I/O platform: To counter disk slowness > more spindles per node
for parallelism > more network traffic > more data per node > longer re-replication
times
6Page:
Understanding your needs
Technical Use Case Hive (MR2 or Spark) Spark Hbase Impala
Data Processing o x x
Machine Learning x o
Operational Database (Fast read write
lookups)
o
Analytic Database (Ad-hoc queries) x x o
Most Common: o Common: x
7Page:
Understanding your needs
Processing Frameworks CPU Memory IO-Throughput IO-Latency
Hive / MR2 Heavy Light/Medium Most important but tends to
be cpu bound
Spark Medium Heavy ML - Low, ETL - High Streaming use cases
HBase Low Medium Medium Most important factor
Impala Medium (concurrency
will need more cores)
Medium Full scans Throughput matters over
latency for almost all use
cases
8Page:
Common Architectural Patterns
● Long-Running Clusters Vs Transient Clusters
● Patterns based on Storage Types
○ S3
○ EBS
○ Ephemeral
9Page:
Transient and Long-Running Clusters
10Page:
AWS 101
- AWS Compute Services
- EC2
- AWS Storage Services
- EBS, S3, EFS
- Instance/Ephemeral Storage
- AWS Relational Database services
- AWS Networking
- VPC , DirectConnect
11Page:
Storage Types - AWS, Azure, GCP
12Page:
Cloud Storage: Cost vs Latency vs Throughput
Image: Courtesy Cloudera
13Page:
Hadoop Clusters on AWS - Storage - Instance Storage
● Instance Storage / Ephemeral Storage
○ Temporary Storage added to certain EC2 instances (d2,i2..)
○ Storage physically attached to the host computer
○ Attached at instance launch time
○ Data is reset when instance is halted, stopped, terminated
○ Faster than EBS Volumes (IOPS and throughput)
○ No specific storage cost
● Best suited for fault-tolerant architectures like Hadoop ( Long running / transient clusters -
with S3 backed)
14Page:
Hadoop Clusters on AWS - Storage - EBS Volumes
● Elastic Block Storage
○ Storage persists independently from the life of EC2 instance
○ Network attached disks
○ Multiple types (IOPs are dependent upon size of the disk)
■ General Purpose SSD (gp2)
■ Provisioned IOPS SSD (io1)
■ Throughput Optimized (st1)
■ Cold HDD (sc1)
■ Magnetic (standard)
○ Support for snapshots.
○ Encrypted to protect the data in-transit and at-rest
15Page:
Hadoop Clusters on AWS - Storage - EBS Volumes
● EBS- Optimized instances
○ Dedicated bandwidth to EBS
● Minimum dedicated EBS Bandwidth - 1000 Mbps (125 MB / s)
○ 1 TB st1 / 3.2 TB sc1 : 40 MB/s
○ 4 - 1 TB st1 volumes for instance ?
● No more than 25 EBS volumes per instance
● Increase read-ahead for high throughput (sudo blockdev --settra 2048 /dev/<device>)
● EBS Volumes Baseline , Burst performance and Burst Credit bucket
16Page:
Hadoop Clusters on AWS - Instance Types
● Attach Ephemeral disks if they are available for the instances
● Pick instance types with enhanced networking like C4,C3,R3,R4,M4 instances
● CPU
○ 2 vCPU for OS ,1 vCPU for each master service
● Placement groups
● Support for different types of instances for worker nodes ?
17Page:
Sizing the cluster
● Sizing the cluster based on IOPs
○ data to be scanned at a time for the workloads
● Sizing the cluster based on CPU Usage
○ Executors/Mappers/reducers/parallelism required for the processing
● Sizing based on data
○ compression ratio (Snappy, LZOP, etc)
○ Replication factor
○ Initial size of data
○ Intermediate data factor, for temporary storage
○ Rate of data increase
● Sizing based on Memory
18Page:
Hadoop Clusters on AWS - Recommendations
● Master
○ Use gp2 volumes for master nodes
○ 100 GB volume for each HDFS metadata , Zookeeper
○ 1 vCPU for each master service
● Worker
○ 1 vCPU for each worker service
○ Use st1 volumes for data storage
● Recommended to use HVM AMIs over PV (paravirtualization)
● Cloud Watch Monitoring for disk usage
○ burst balance, volume read/write bytes, volume read/write ops, volume queue length
19Page:
Hadoop Clusters on AWS - HA , DR
● HA
○ HDFS HA, YARN HA, HBase HA, Hive HA
● AWS Regions
○ Hadoop nodes across Regions ?
● AWS Availability Zones
○ Hadoop nodes across Availability zones ?
● Recovery Point Objective and Recovery Time Objective
○ Backup data / metadata to S3
○ Separate DR cluster
■ Parallel Ingest
■ Parallel Compute
■ Distcp/ Cloudera BDR
20Page:
Hadoop Clusters on AWS - Security
● Hadoop Security
○ Kerberos
○ Sentry / Ranger / Navigator / Atlas
● S3
○ IAM Roles
○ Server side encryption
● EBS volume encryption
● HDFS transparent encryption
○ encryption zones
○ navigator encrypt
21Page:
Typical Deployment Layout on AWS
22Page:
Tiny / Small / Medium / Large?
Tiny: 1-9 Small: 10-99 Medium: 100-999 Large: 1000+
MINIMAL HA CLUSTER
Cloudera Manager - (1) Master Nodes A - (2) Master Node B - (1) Worker Nodes -
(4)
Security Node A - (1) Security Node B - (1)
Alert Publisher
Event Server
Host Monitor
Navigator Audit Server
Navigator Metadata
Server
Reports Manager
Service Monitor
CM Database
ZooKeeper
HDFS JournalNode
HDFS Failover Controller
HDFS NameNode
YARN Resource
Manager
Hive Metastore
HiveServer2
Hue
Oozie
HBase Master
HDFS Gateway
YARN Gateway
Spark Gateway
Hive Gateway
Sqoop Gateway
ZooKeeper
HDFS JournalNode
HDFS Balancer
Impala Catalog Server
Impala StateStore
YARN History Server
Spark History Server
HBase/SOLR Key Value
Indexer
Sentry
Key Trustee KMS
HDFS Gateway
YARN Gateway
Spark Gateway
Hive Gateway
Sqoop Gateway
HDFS DataNode
YARN
NodeManager
Impala Impalad
HBase
RegionServer
SOLR Server
Kafka Broker
Flume Agent
KDC
LDAP
Private Certificate Authority
Active Key Trustee Server
Key Trustee Server Active
Database
KDC
LDAP
Passive Key Trustee
Server
Key Trustee Server
Passive Database
Cloudera licensed services / Orange is for Client Gateway roles / Blue is for Ingest services /Purple is for RDBMS
database.
Hadoop Cluster Security Cluster
23Page:
Pricing
24Page:
Run your own cluster?
● Amazon S3, Kinesis, Lambda, Athena, Databricks
● GCP -- Google Data flow
● Azure
● Cloudera Altus
● Databricks
Image: Courtesy Amazon
25Page:
AWS Big Data
Image: Courtesy Amazon
26Page:
Azure Big Data
Image: Courtesy Microsoft
27Page:
Google Big Data
Image: Courtesy Google
28Page:
It’s a tricky job
29Page:
Feedback & Questions
Effectively Deploying Hadoop to the Cloud
30Page:
Thank You!
Avinash Ramineni
Principal, Clairvoyant
avinash@clairvoyantsoft.com
https://www.linkedin.com/in/avinashramineni/

More Related Content

What's hot

Sharding Methods for MongoDB
Sharding Methods for MongoDBSharding Methods for MongoDB
Sharding Methods for MongoDB
MongoDB
 
Ceph Research at UCSC
Ceph Research at UCSCCeph Research at UCSC
Ceph Research at UCSC
Ceph Community
 
Ceph Object Storage at Spreadshirt
Ceph Object Storage at SpreadshirtCeph Object Storage at Spreadshirt
Ceph Object Storage at Spreadshirt
Jens Hadlich
 
Flume-based Independent News Aggregator
Flume-based Independent News AggregatorFlume-based Independent News Aggregator
Flume-based Independent News Aggregator
Mário Almeida
 
Propelling IoT Innovation with Predictive Analytics
Propelling IoT Innovation with Predictive AnalyticsPropelling IoT Innovation with Predictive Analytics
Propelling IoT Innovation with Predictive Analytics
SingleStore
 
Devopsconf 2015 sebamontini
Devopsconf 2015 sebamontiniDevopsconf 2015 sebamontini
Devopsconf 2015 sebamontini
Sebastian Montini
 
Using Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider DataUsing Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider Data
Rob Gardner
 
Hadoop over rgw
Hadoop over rgwHadoop over rgw
Hadoop over rgw
zhouyuan
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi
 
An intro to Ceph and big data - CERN Big Data Workshop
An intro to Ceph and big data - CERN Big Data WorkshopAn intro to Ceph and big data - CERN Big Data Workshop
An intro to Ceph and big data - CERN Big Data Workshop
Patrick McGarry
 
Time Series Data in a Time Series World
Time Series Data in a Time Series WorldTime Series Data in a Time Series World
Time Series Data in a Time Series World
MapR Technologies
 
Kafka website activity architecture
Kafka website activity architectureKafka website activity architecture
Kafka website activity architecture
Omid Vahdaty
 
Practical CephFS with nfs today using OpenStack Manila - Ceph Day Berlin - 12...
Practical CephFS with nfs today using OpenStack Manila - Ceph Day Berlin - 12...Practical CephFS with nfs today using OpenStack Manila - Ceph Day Berlin - 12...
Practical CephFS with nfs today using OpenStack Manila - Ceph Day Berlin - 12...
TomBarron
 
RubiX
RubiXRubiX
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
wang xing
 
Yaroslav Nedashkovsky - "Data Engineering in Information Security: how to col...
Yaroslav Nedashkovsky - "Data Engineering in Information Security: how to col...Yaroslav Nedashkovsky - "Data Engineering in Information Security: how to col...
Yaroslav Nedashkovsky - "Data Engineering in Information Security: how to col...
Lviv Startup Club
 
Ceph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud worldCeph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud world
Sage Weil
 
New Ceph capabilities and Reference Architectures
New Ceph capabilities and Reference ArchitecturesNew Ceph capabilities and Reference Architectures
New Ceph capabilities and Reference Architectures
Kamesh Pemmaraju
 

What's hot (18)

Sharding Methods for MongoDB
Sharding Methods for MongoDBSharding Methods for MongoDB
Sharding Methods for MongoDB
 
Ceph Research at UCSC
Ceph Research at UCSCCeph Research at UCSC
Ceph Research at UCSC
 
Ceph Object Storage at Spreadshirt
Ceph Object Storage at SpreadshirtCeph Object Storage at Spreadshirt
Ceph Object Storage at Spreadshirt
 
Flume-based Independent News Aggregator
Flume-based Independent News AggregatorFlume-based Independent News Aggregator
Flume-based Independent News Aggregator
 
Propelling IoT Innovation with Predictive Analytics
Propelling IoT Innovation with Predictive AnalyticsPropelling IoT Innovation with Predictive Analytics
Propelling IoT Innovation with Predictive Analytics
 
Devopsconf 2015 sebamontini
Devopsconf 2015 sebamontiniDevopsconf 2015 sebamontini
Devopsconf 2015 sebamontini
 
Using Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider DataUsing Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider Data
 
Hadoop over rgw
Hadoop over rgwHadoop over rgw
Hadoop over rgw
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
An intro to Ceph and big data - CERN Big Data Workshop
An intro to Ceph and big data - CERN Big Data WorkshopAn intro to Ceph and big data - CERN Big Data Workshop
An intro to Ceph and big data - CERN Big Data Workshop
 
Time Series Data in a Time Series World
Time Series Data in a Time Series WorldTime Series Data in a Time Series World
Time Series Data in a Time Series World
 
Kafka website activity architecture
Kafka website activity architectureKafka website activity architecture
Kafka website activity architecture
 
Practical CephFS with nfs today using OpenStack Manila - Ceph Day Berlin - 12...
Practical CephFS with nfs today using OpenStack Manila - Ceph Day Berlin - 12...Practical CephFS with nfs today using OpenStack Manila - Ceph Day Berlin - 12...
Practical CephFS with nfs today using OpenStack Manila - Ceph Day Berlin - 12...
 
RubiX
RubiXRubiX
RubiX
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Yaroslav Nedashkovsky - "Data Engineering in Information Security: how to col...
Yaroslav Nedashkovsky - "Data Engineering in Information Security: how to col...Yaroslav Nedashkovsky - "Data Engineering in Information Security: how to col...
Yaroslav Nedashkovsky - "Data Engineering in Information Security: how to col...
 
Ceph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud worldCeph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud world
 
New Ceph capabilities and Reference Architectures
New Ceph capabilities and Reference ArchitecturesNew Ceph capabilities and Reference Architectures
New Ceph capabilities and Reference Architectures
 

Similar to Effectively deploying hadoop to the cloud

AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
Omid Vahdaty
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
aspyker
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Omid Vahdaty
 
In-memory database
In-memory databaseIn-memory database
In-memory database
Chien Nguyen Dang
 
Study Notes - Architecting for the cloud (AWS Best Practices, Feb 2016)
Study Notes - Architecting for the cloud (AWS Best Practices, Feb 2016)Study Notes - Architecting for the cloud (AWS Best Practices, Feb 2016)
Study Notes - Architecting for the cloud (AWS Best Practices, Feb 2016)
Rick Hwang
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_Plan
Narayana B
 
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
Amazon Web Services
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS Cloud
Edureka!
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
Junping Du
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
DataWorks Summit
 
Aerospike Architecture
Aerospike ArchitectureAerospike Architecture
Aerospike Architecture
Peter Milne
 
Shootout at the PAAS Corral
Shootout at the PAAS CorralShootout at the PAAS Corral
Shootout at the PAAS Corral
PostgreSQL Experts, Inc.
 
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School
Adam Doyle
 
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix BarbeiraBackup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Ceph Community
 
OpenStack Cinder, Implementation Today and New Trends for Tomorrow
OpenStack Cinder, Implementation Today and New Trends for TomorrowOpenStack Cinder, Implementation Today and New Trends for Tomorrow
OpenStack Cinder, Implementation Today and New Trends for Tomorrow
Ed Balduf
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data Processing
Sam Ng
 
Red Hat Gluster Storage Performance
Red Hat Gluster Storage PerformanceRed Hat Gluster Storage Performance
Red Hat Gluster Storage Performance
Red_Hat_Storage
 
OSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage SystemOSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage System
NETWAYS
 

Similar to Effectively deploying hadoop to the cloud (20)

AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
In-memory database
In-memory databaseIn-memory database
In-memory database
 
Study Notes - Architecting for the cloud (AWS Best Practices, Feb 2016)
Study Notes - Architecting for the cloud (AWS Best Practices, Feb 2016)Study Notes - Architecting for the cloud (AWS Best Practices, Feb 2016)
Study Notes - Architecting for the cloud (AWS Best Practices, Feb 2016)
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_Plan
 
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS Cloud
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Aerospike Architecture
Aerospike ArchitectureAerospike Architecture
Aerospike Architecture
 
Shootout at the PAAS Corral
Shootout at the PAAS CorralShootout at the PAAS Corral
Shootout at the PAAS Corral
 
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices Workshop
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School
 
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix BarbeiraBackup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
 
OpenStack Cinder, Implementation Today and New Trends for Tomorrow
OpenStack Cinder, Implementation Today and New Trends for TomorrowOpenStack Cinder, Implementation Today and New Trends for Tomorrow
OpenStack Cinder, Implementation Today and New Trends for Tomorrow
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data Processing
 
Red Hat Gluster Storage Performance
Red Hat Gluster Storage PerformanceRed Hat Gluster Storage Performance
Red Hat Gluster Storage Performance
 
OSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage SystemOSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage System
 

More from Avinash Ramineni

Simplifying the data privacy governance quagmire building automated privacy ...
Simplifying the data privacy governance quagmire  building automated privacy ...Simplifying the data privacy governance quagmire  building automated privacy ...
Simplifying the data privacy governance quagmire building automated privacy ...
Avinash Ramineni
 
Winning the war on data breaches in a changing data landscape
Winning the war on data breaches in a changing data landscapeWinning the war on data breaches in a changing data landscape
Winning the war on data breaches in a changing data landscape
Avinash Ramineni
 
Autonomous Security: Using Big Data, Machine Learning and AI to Fix Today's S...
Autonomous Security: Using Big Data, Machine Learning and AI to Fix Today's S...Autonomous Security: Using Big Data, Machine Learning and AI to Fix Today's S...
Autonomous Security: Using Big Data, Machine Learning and AI to Fix Today's S...
Avinash Ramineni
 
Building zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafkaBuilding zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafka
Avinash Ramineni
 
Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...
Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data...Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data...
Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...
Avinash Ramineni
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014
Avinash Ramineni
 
HBase from the Trenches - Phoenix Data Conference 2015
HBase from the Trenches - Phoenix Data Conference 2015HBase from the Trenches - Phoenix Data Conference 2015
HBase from the Trenches - Phoenix Data Conference 2015
Avinash Ramineni
 
Strata+Hadoop World NY 2016 - Avinash Ramineni
Strata+Hadoop World NY 2016 - Avinash RamineniStrata+Hadoop World NY 2016 - Avinash Ramineni
Strata+Hadoop World NY 2016 - Avinash Ramineni
Avinash Ramineni
 
Log analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaLog analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and Kibana
Avinash Ramineni
 
Event Driven Architectures
Event Driven ArchitecturesEvent Driven Architectures
Event Driven Architectures
Avinash Ramineni
 

More from Avinash Ramineni (10)

Simplifying the data privacy governance quagmire building automated privacy ...
Simplifying the data privacy governance quagmire  building automated privacy ...Simplifying the data privacy governance quagmire  building automated privacy ...
Simplifying the data privacy governance quagmire building automated privacy ...
 
Winning the war on data breaches in a changing data landscape
Winning the war on data breaches in a changing data landscapeWinning the war on data breaches in a changing data landscape
Winning the war on data breaches in a changing data landscape
 
Autonomous Security: Using Big Data, Machine Learning and AI to Fix Today's S...
Autonomous Security: Using Big Data, Machine Learning and AI to Fix Today's S...Autonomous Security: Using Big Data, Machine Learning and AI to Fix Today's S...
Autonomous Security: Using Big Data, Machine Learning and AI to Fix Today's S...
 
Building zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafkaBuilding zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafka
 
Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...
Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data...Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data...
Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014
 
HBase from the Trenches - Phoenix Data Conference 2015
HBase from the Trenches - Phoenix Data Conference 2015HBase from the Trenches - Phoenix Data Conference 2015
HBase from the Trenches - Phoenix Data Conference 2015
 
Strata+Hadoop World NY 2016 - Avinash Ramineni
Strata+Hadoop World NY 2016 - Avinash RamineniStrata+Hadoop World NY 2016 - Avinash Ramineni
Strata+Hadoop World NY 2016 - Avinash Ramineni
 
Log analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaLog analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and Kibana
 
Event Driven Architectures
Event Driven ArchitecturesEvent Driven Architectures
Event Driven Architectures
 

Recently uploaded

GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 

Recently uploaded (20)

GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 

Effectively deploying hadoop to the cloud

  • 1. 1Page: Effectively Deploying Hadoop to the Cloud By Avinash Ramineni
  • 2. 2Page: Agenda • Hadoop - 101 • Understanding the needs • Amazon Web Services - 101 • Compute and Storage Types • HA and DR options • Run your own cluster ? • Questions
  • 5. 5Page: Hadoop - Key Concepts to understand ● Unlike CPU, RAM improvements the disk speed has not changed much in the last 10 years ● Hardware fails! Redundancy ● Data locality ● Commodity hardware: Hadoop prefers more average nodes than few supernodes ● Hadoop is a giant I/O platform: To counter disk slowness > more spindles per node for parallelism > more network traffic > more data per node > longer re-replication times
  • 6. 6Page: Understanding your needs Technical Use Case Hive (MR2 or Spark) Spark Hbase Impala Data Processing o x x Machine Learning x o Operational Database (Fast read write lookups) o Analytic Database (Ad-hoc queries) x x o Most Common: o Common: x
  • 7. 7Page: Understanding your needs Processing Frameworks CPU Memory IO-Throughput IO-Latency Hive / MR2 Heavy Light/Medium Most important but tends to be cpu bound Spark Medium Heavy ML - Low, ETL - High Streaming use cases HBase Low Medium Medium Most important factor Impala Medium (concurrency will need more cores) Medium Full scans Throughput matters over latency for almost all use cases
  • 8. 8Page: Common Architectural Patterns ● Long-Running Clusters Vs Transient Clusters ● Patterns based on Storage Types ○ S3 ○ EBS ○ Ephemeral
  • 10. 10Page: AWS 101 - AWS Compute Services - EC2 - AWS Storage Services - EBS, S3, EFS - Instance/Ephemeral Storage - AWS Relational Database services - AWS Networking - VPC , DirectConnect
  • 11. 11Page: Storage Types - AWS, Azure, GCP
  • 12. 12Page: Cloud Storage: Cost vs Latency vs Throughput Image: Courtesy Cloudera
  • 13. 13Page: Hadoop Clusters on AWS - Storage - Instance Storage ● Instance Storage / Ephemeral Storage ○ Temporary Storage added to certain EC2 instances (d2,i2..) ○ Storage physically attached to the host computer ○ Attached at instance launch time ○ Data is reset when instance is halted, stopped, terminated ○ Faster than EBS Volumes (IOPS and throughput) ○ No specific storage cost ● Best suited for fault-tolerant architectures like Hadoop ( Long running / transient clusters - with S3 backed)
  • 14. 14Page: Hadoop Clusters on AWS - Storage - EBS Volumes ● Elastic Block Storage ○ Storage persists independently from the life of EC2 instance ○ Network attached disks ○ Multiple types (IOPs are dependent upon size of the disk) ■ General Purpose SSD (gp2) ■ Provisioned IOPS SSD (io1) ■ Throughput Optimized (st1) ■ Cold HDD (sc1) ■ Magnetic (standard) ○ Support for snapshots. ○ Encrypted to protect the data in-transit and at-rest
  • 15. 15Page: Hadoop Clusters on AWS - Storage - EBS Volumes ● EBS- Optimized instances ○ Dedicated bandwidth to EBS ● Minimum dedicated EBS Bandwidth - 1000 Mbps (125 MB / s) ○ 1 TB st1 / 3.2 TB sc1 : 40 MB/s ○ 4 - 1 TB st1 volumes for instance ? ● No more than 25 EBS volumes per instance ● Increase read-ahead for high throughput (sudo blockdev --settra 2048 /dev/<device>) ● EBS Volumes Baseline , Burst performance and Burst Credit bucket
  • 16. 16Page: Hadoop Clusters on AWS - Instance Types ● Attach Ephemeral disks if they are available for the instances ● Pick instance types with enhanced networking like C4,C3,R3,R4,M4 instances ● CPU ○ 2 vCPU for OS ,1 vCPU for each master service ● Placement groups ● Support for different types of instances for worker nodes ?
  • 17. 17Page: Sizing the cluster ● Sizing the cluster based on IOPs ○ data to be scanned at a time for the workloads ● Sizing the cluster based on CPU Usage ○ Executors/Mappers/reducers/parallelism required for the processing ● Sizing based on data ○ compression ratio (Snappy, LZOP, etc) ○ Replication factor ○ Initial size of data ○ Intermediate data factor, for temporary storage ○ Rate of data increase ● Sizing based on Memory
  • 18. 18Page: Hadoop Clusters on AWS - Recommendations ● Master ○ Use gp2 volumes for master nodes ○ 100 GB volume for each HDFS metadata , Zookeeper ○ 1 vCPU for each master service ● Worker ○ 1 vCPU for each worker service ○ Use st1 volumes for data storage ● Recommended to use HVM AMIs over PV (paravirtualization) ● Cloud Watch Monitoring for disk usage ○ burst balance, volume read/write bytes, volume read/write ops, volume queue length
  • 19. 19Page: Hadoop Clusters on AWS - HA , DR ● HA ○ HDFS HA, YARN HA, HBase HA, Hive HA ● AWS Regions ○ Hadoop nodes across Regions ? ● AWS Availability Zones ○ Hadoop nodes across Availability zones ? ● Recovery Point Objective and Recovery Time Objective ○ Backup data / metadata to S3 ○ Separate DR cluster ■ Parallel Ingest ■ Parallel Compute ■ Distcp/ Cloudera BDR
  • 20. 20Page: Hadoop Clusters on AWS - Security ● Hadoop Security ○ Kerberos ○ Sentry / Ranger / Navigator / Atlas ● S3 ○ IAM Roles ○ Server side encryption ● EBS volume encryption ● HDFS transparent encryption ○ encryption zones ○ navigator encrypt
  • 22. 22Page: Tiny / Small / Medium / Large? Tiny: 1-9 Small: 10-99 Medium: 100-999 Large: 1000+ MINIMAL HA CLUSTER Cloudera Manager - (1) Master Nodes A - (2) Master Node B - (1) Worker Nodes - (4) Security Node A - (1) Security Node B - (1) Alert Publisher Event Server Host Monitor Navigator Audit Server Navigator Metadata Server Reports Manager Service Monitor CM Database ZooKeeper HDFS JournalNode HDFS Failover Controller HDFS NameNode YARN Resource Manager Hive Metastore HiveServer2 Hue Oozie HBase Master HDFS Gateway YARN Gateway Spark Gateway Hive Gateway Sqoop Gateway ZooKeeper HDFS JournalNode HDFS Balancer Impala Catalog Server Impala StateStore YARN History Server Spark History Server HBase/SOLR Key Value Indexer Sentry Key Trustee KMS HDFS Gateway YARN Gateway Spark Gateway Hive Gateway Sqoop Gateway HDFS DataNode YARN NodeManager Impala Impalad HBase RegionServer SOLR Server Kafka Broker Flume Agent KDC LDAP Private Certificate Authority Active Key Trustee Server Key Trustee Server Active Database KDC LDAP Passive Key Trustee Server Key Trustee Server Passive Database Cloudera licensed services / Orange is for Client Gateway roles / Blue is for Ingest services /Purple is for RDBMS database. Hadoop Cluster Security Cluster
  • 24. 24Page: Run your own cluster? ● Amazon S3, Kinesis, Lambda, Athena, Databricks ● GCP -- Google Data flow ● Azure ● Cloudera Altus ● Databricks Image: Courtesy Amazon
  • 25. 25Page: AWS Big Data Image: Courtesy Amazon
  • 26. 26Page: Azure Big Data Image: Courtesy Microsoft
  • 29. 29Page: Feedback & Questions Effectively Deploying Hadoop to the Cloud
  • 30. 30Page: Thank You! Avinash Ramineni Principal, Clairvoyant avinash@clairvoyantsoft.com https://www.linkedin.com/in/avinashramineni/