Capacity Planning
Big Data Solution
Hello!
I am Riyaz A Shaikh
Full Stack Architect
You can find me at:
@jf @rizAShaikh
Riyaz A Shaikh
www.riyazshaikh.com
Requirement
Need to setup analytical and alerting system
on data produced by 10,000 servers.
Assuming 10 million events generated per
day by all servers. Considering 50 GB of data
per day.
Big Data Cluster
Considering Hortonworks Hadoop
distribution for cluster setup with following
systems.
 HDFS for data backup in compressed
format.
 Spark for data computation and
transformation.
 Apache Kafka as messaging service for
data completeness.
 Flume for data capture
 Elasticsearch for Analytical data storage
and search engine.
 Kibana for data visualization
Kafka cluster capacity
Assumption Size in GB Rationale
Daily average raw data ingest rate 50
Kafka retention period of 2 days 100 Raw data * retention period
Kafka replication factor of 3 300 Raw data * retention period * replication factor
Storage per Day 300 GB
Storage per Month
This is staging. Monthly calculation is not required because
data will be auto purged after retention period.
Table 1
Elasticsearch cluster capacity
Assumption Size in GB Rationale Remarks
Daily average raw data ingest rate 50
Elasticsearch 3 shards 50
Shards are index split. No
extra space required.
Elasticsearch 3 replica 150 Raw data * replicas
Each shards will be
replicated 3 times
Storage per Day 150 GB
Storage per Month 4500 GB Per day * 30 4.5 TB per month
Table 2
HDFS to backup Elasticsearch data
Assumption Size in GB Rationale Remarks
Daily average raw data ingest rate 50
HDFS replication factor by 3 150 Raw data * replication factor
70 % Compression 45 (150 – (150*70/100)) LZO compression
Storage per Day 45 GB
Storage per Month 1350 GB 1.35 TB per month
Table 3
Typical Node structure
Table 4
Node Structure
Typical per data node storage capacity
4 TB 2 X 2 TB HDD
Temp space for processing by Spark,
Map Reduce etc. 1 TB 25% of the data node
Data node usable storage
3 TB
Raw storage - Spark
reserve
Considering storage capacity from above three tables
Table 1, Table 2 and Table 3.
Total storage required per month is
300GB+ 4500GB+1350GB = 6150 GB (approx.. 6.15 TB)
“
Assuming 10% data growth per quarter. Further, considering
15% year-on-year growth in data volume.
Below given Table 5 indicated capacity required as per data
growth year-on-year
Capacity growth year-on-year
Table 5
10% Data Growth Quarterly (Data in TB)
Quarter Year 1 Year 2 Year 3 Year 4 Year 5
Q1 6.15 9.4 12.5 16.7 22.2
Q2 6.8 9.9 13.2 17.5 23.3
Q3 7.4 10.4 13.8 18.4 24.5
Q4 8.2 10.9 14.5 19.3 25.7
Yearly storage 28.5 40.6 54.0 71.9 95.7
Data nodes required =
yearly storage / Data node usable storage
10 14 18 24 32
Hardware Specs
Considering one year storage on ten data
node with one Namenode and one standby
Namenode.
Table 6 & 7 shows hardware configuration of
each machines.
Typical worker node hardware configurations
Table 6
Midline configuration (Data Node)
CPU 2 × 8 core 2.9 Ghz
Memory 64 GB DDR3-1600 ECC
Disk controller SAS 6 Gb/s
Disks 5 × 1 TB LFF SATA II 7200 RPM. 1 TB for OS
Network controller 2 × 1 Gb Ethernet
Notes
CPU features such as Intel’s Hyper-Threading and QPI are desirable.
Allocate memory to take advantage of triple- or quad-channel
memory configurations.
Typical Namenode hardware configurations
Table 7
Namenode configuration
CPU 2 × 8 core 2.9 Ghz
Memory 128 GB
Disk controller RAID 1
Disks 4 × 1 TB 1 for the OS, 2 TB for FS image and 1 for Journal node
Network controller 2 × 1 Gb Ethernet
Notes
CPU features such as Intel’s Hyper-Threading and QPI are desirable.
Allocate memory to take advantage of triple- or quad-channel
memory configurations.
Thanks!
Any questions & feedback!
Write to me at:
@rizAShaikh
Shaikh.r.a@gmail.com

Big data solution capacity planning

  • 1.
  • 2.
    Hello! I am RiyazA Shaikh Full Stack Architect You can find me at: @jf @rizAShaikh Riyaz A Shaikh www.riyazshaikh.com
  • 3.
    Requirement Need to setupanalytical and alerting system on data produced by 10,000 servers. Assuming 10 million events generated per day by all servers. Considering 50 GB of data per day.
  • 4.
    Big Data Cluster ConsideringHortonworks Hadoop distribution for cluster setup with following systems.  HDFS for data backup in compressed format.  Spark for data computation and transformation.  Apache Kafka as messaging service for data completeness.  Flume for data capture  Elasticsearch for Analytical data storage and search engine.  Kibana for data visualization
  • 5.
    Kafka cluster capacity AssumptionSize in GB Rationale Daily average raw data ingest rate 50 Kafka retention period of 2 days 100 Raw data * retention period Kafka replication factor of 3 300 Raw data * retention period * replication factor Storage per Day 300 GB Storage per Month This is staging. Monthly calculation is not required because data will be auto purged after retention period. Table 1
  • 6.
    Elasticsearch cluster capacity AssumptionSize in GB Rationale Remarks Daily average raw data ingest rate 50 Elasticsearch 3 shards 50 Shards are index split. No extra space required. Elasticsearch 3 replica 150 Raw data * replicas Each shards will be replicated 3 times Storage per Day 150 GB Storage per Month 4500 GB Per day * 30 4.5 TB per month Table 2
  • 7.
    HDFS to backupElasticsearch data Assumption Size in GB Rationale Remarks Daily average raw data ingest rate 50 HDFS replication factor by 3 150 Raw data * replication factor 70 % Compression 45 (150 – (150*70/100)) LZO compression Storage per Day 45 GB Storage per Month 1350 GB 1.35 TB per month Table 3
  • 8.
    Typical Node structure Table4 Node Structure Typical per data node storage capacity 4 TB 2 X 2 TB HDD Temp space for processing by Spark, Map Reduce etc. 1 TB 25% of the data node Data node usable storage 3 TB Raw storage - Spark reserve Considering storage capacity from above three tables Table 1, Table 2 and Table 3. Total storage required per month is 300GB+ 4500GB+1350GB = 6150 GB (approx.. 6.15 TB)
  • 9.
    “ Assuming 10% datagrowth per quarter. Further, considering 15% year-on-year growth in data volume. Below given Table 5 indicated capacity required as per data growth year-on-year
  • 10.
    Capacity growth year-on-year Table5 10% Data Growth Quarterly (Data in TB) Quarter Year 1 Year 2 Year 3 Year 4 Year 5 Q1 6.15 9.4 12.5 16.7 22.2 Q2 6.8 9.9 13.2 17.5 23.3 Q3 7.4 10.4 13.8 18.4 24.5 Q4 8.2 10.9 14.5 19.3 25.7 Yearly storage 28.5 40.6 54.0 71.9 95.7 Data nodes required = yearly storage / Data node usable storage 10 14 18 24 32
  • 11.
    Hardware Specs Considering oneyear storage on ten data node with one Namenode and one standby Namenode. Table 6 & 7 shows hardware configuration of each machines.
  • 12.
    Typical worker nodehardware configurations Table 6 Midline configuration (Data Node) CPU 2 × 8 core 2.9 Ghz Memory 64 GB DDR3-1600 ECC Disk controller SAS 6 Gb/s Disks 5 × 1 TB LFF SATA II 7200 RPM. 1 TB for OS Network controller 2 × 1 Gb Ethernet Notes CPU features such as Intel’s Hyper-Threading and QPI are desirable. Allocate memory to take advantage of triple- or quad-channel memory configurations.
  • 13.
    Typical Namenode hardwareconfigurations Table 7 Namenode configuration CPU 2 × 8 core 2.9 Ghz Memory 128 GB Disk controller RAID 1 Disks 4 × 1 TB 1 for the OS, 2 TB for FS image and 1 for Journal node Network controller 2 × 1 Gb Ethernet Notes CPU features such as Intel’s Hyper-Threading and QPI are desirable. Allocate memory to take advantage of triple- or quad-channel memory configurations.
  • 14.
    Thanks! Any questions &feedback! Write to me at: @rizAShaikh Shaikh.r.a@gmail.com