4. BOUNDED Storage and Compute resources on Hadoop Nodes brings challenges
Source: 451 Research, Voice of the Enterprise: Storage Q4 2015
Data/Capacity
Inadequate Performance
Space, Spent, Power, Utilization
Multiple Storage Silos
Upgrade Cost
Typical Challenges
Costs
Provisioning And Configuration
Performance
& efficiency
Data Capacity Silos
Challenges of scaling Hadoop* Storage
4
*Other names and brands may be claimed as the property of others.
5. Options To Address The Challenges
5
Large Cluster
• Lacks isolation -
noisy neighbors
hinder SLAs
• Lacks elasticity -
rigid cluster size
• Can’t scale
compute/storage
costs separately
More Clusters
• Cost of
duplicating
datasets across
clusters
• Lacks on-demand
provisioning
• Can’t scale
compute/storage
costs separately
Compute and
Storage
Disaggregation
• Isolation of high-
priority workloads
• Shared big
datasets
• On-demand
provisioning
• compute/storage
costs scale
separately
Compute and Storage disaggregation provides Simplicity, Elasticity, Isolation
6. Unified Hadoop* File System and API for cloud storage
6
20162006 201520142008
oss://adl://
wasb://gs://s3:// s3n://
s3a://
Hadoop Compatible File System abstraction layer: Unified storage API interface Hadoop fs –ls s3a://job/
7. Disaggregated Object Storage Cluster
Proposal: Apache Hadoop* with disagreed Object Storage
7
SQL ……
Compute 1 Compute 2 Compute 3 Compute N…
Object
Storage 1
Object
Storage 2
Object
Storage 3
Object
Storage N
…
Hadoop Services
• Virtual Machine
• Container
• Bare Metal
Object Storage Services
• Co-located with gateway
• Dynamic DNS or load balancer
• Data protection via storage
replication or erasure code
• Storage tiering
HCFS
*Other names and brands may be claimed as the property of others.
9. Workloads
• Simple Read/Write
• DFSIO: TestDFSIO is the canonical example of a benchmark that attempts to measure the Storage's capacity
for reading and writing bulk data.
• Terasort: a popular benchmark that measures the amount of time to sort one terabyte of randomly
distributed data on a given computer system.
• Data Transformation
• ETL: Taking data as it is originally generated and transforming it to a format (Parquet, ORC) that more tuned
for analytical workloads.
• Batch Analytics
• To consistently executing analytical process to process large set of data.
• Leveraging 54 derived from TPC-DS * queries with intensive reads across objects in different buckets
9
14. Background
• The S3A filesystem client supports the notion of input policies, similar to that
of the POSIX fadvise() API call. This tunes the behavior of the S3A client to
optimize HTTP GET requests for various use cases. To optimize HTTP GET
requests, you can take advantage of the S3A experimental input
policy fs.s3a.experimental.input.fadvise.
• Ticket: https://issues.apache.org/jira/browse/HADOOP-13203
Optimizing HTTP Requests
-- S3a input policy
Seek in a IO stream
diff > 0 skip forward diff < 0 backward
diff = 0 sequential read
Close stream
&Open connection
again
diff = targetPos – pos
Solution
Enable random read policy in core-site.xml:
<property>
<name>fs.s3a.experimental.input.fadvise</name>
<value>random</value>
</property>
<property>
<name>fs.s3a.readahead.range</name>
<value>64K</value>
</property>
• By reducing the cost of closing existing HTTP
requests, this is highly efficient for file IO accessing a
binary file through a series of
`PositionedReadable.read()` and
`PositionedReadable.readFully()` calls.
14
15. Optimizing HTTP Requests
-- Performance
8305 8267
2659
0
2000
4000
6000
8000
10000
Hadoop 2.7.3/Spark 2.1.1 Hadoop 2.8.1/Spark
2.2.0(untuned)
Hadoop 2.8.1/Spark
2.2.0(tuned)
seconds
1TB Batch Analytics Query on Parquet
• Readahead feature support from
Hadoop 2.8.1, but not enabled by
default. Apply random read policy,
500 issue gone and performance
improved 3x than before
2659
2120
4810
3346
0
2000
4000
6000
Hadoop 2.8.1/Spark2.2.0[SSD] Hadoop 2.8.1/Spark2.2.0[HDD]
Seconds
1TB Dataset Batch Analytics and Interactive Query Hadoop/Spark
Comparison (lower the better)
Parquet ORC
• All Flash storage architecture
also show great performance
benefit and low TCO which
compared with HDD storage
15
16. Optimizing HTTP Requests
--Resource Utilization Comparison
0
500000
1000000
1500000
0
990
1980
2970
3960
4950
5940
6930
7920
8910
9900
Load Balance Network IO
Sum of rxkB/s Sum of txkB/s
0
1000000
2000000
3000000
0
465
930
1395
1860
2325
2790
3255
3720
4185
4650
Load Balance Network IO
Sum of rxkB/s Sum of txkB/s
0
200000
400000
600000
-5
1085
2175
3265
4355
5445
6535
7625
8715
9805
BW
Hadoop Shuffle Disk Bandwidth
Sum of rkB/s Sum of wkB/s
0
200000
400000
600000
800000
-5
505
1015
1525
2035
2545
3055
3565
4075
4585
BW
Hadoop Shuffle Disk Bandwidth
Sum of rkB/s Sum of wkB/s
0
200000
400000
600000
-5
1085
2175
3265
4355
5445
6535
7625
8715
10198
BW
OSD Disk Bandwidth
Sum of rkB/s Sum of wkB/s
0
200000
400000
600000
-5
505
1015
1525
2035
2545
3055
4011
4521
5031
BW
OSD Disk Bandwidth
Sum of rkB/s Sum of wkB/s
Before
optimized
After
optimized
16
16
17. Resolving RGW BW bottleneck
--The bottlenecks
0
500000
1000000
1500000
2000000
2500000
3000000
0
235
470
705
940
1175
1410
1645
1880
2115
2350
2585
2820
3055
3290
3525
3760
3995
4230
4465
4700
4935
Network IO
Sum of rxkB/s Sum of txkB/s
0
500000
1000000
1500000
2000000
2500000
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
Single Query Network IO
Sum of rxkB/s Sum of txkB/s
• LB BW has became the bottleneck
• Observed many messages blocked at load balancer server(send to s3a driver), but not much
blocked at receiving on s3a driver side
17
18. 18
Hardware Configuration
--No LB with more RGWs and round-robin DNS
5x Compute Node
• Intel® Xeon™ processor E5-2699 v4 @
2.2GHz, 64GB mem
• 2x10G 82599 10Gb NIC
• 2x SSDs
• 3x Data storage (can be emliminated)
Software:
• Hadoop 2.7.3
• Spark 2.1.1
• Hive 2.2.1
• Presto 0.177
• RHEL7.3
5x Storage Node, 2 RGW nodes, 1 LB nodes
• Intel(R) Xeon(R) CPU E5-2699v4 2.20GHz
• 128GB Memory
• 2x 82599 10Gb NIC
• 1x Intel® P3700 1.0TB SSD as WAL and
rocksdb
• 4x 1.6TB Intel® SSD DC S3510 as data
drive
• 2x 400G S3700 SSDs
• 1 OSD instances one each S3510 SSD
• RHEl7.3
• RHCS 2.3
*Other names and brands may be
claimed as the property of others.
OSD1
MON
OSD1 OSD4…
Hadoop
Hive
Spark
Presto
1x10Gb NIC
OSD2 OSD3 OSD4 OSD5
2x10Gb NIC
Hadoop
Hive
Spark
Presto
Hadoop
Hive
Spark
Presto
Hadoop
Hive
Spark
Presto
Hadoop
Hive
Spark
Presto
RGW1 RGW2
Head
DNS Server
RGW2 RGW2 RGW2
19. Performance evaluation
--No LB with more RGWs and round-robin DNS
• 18% performance improvement with more RGWs and round-robin DNS
• Query42(has less shuffle) is 1.64x faster in the new architecture
2659
2244
0
500
1000
1500
2000
2500
3000
Dedicate single load balance DNS
seconds
1TB Batch Analytics Query (parquet)
129
78
0
20
40
60
80
100
120
140
Dedicate load balance DNS RR
seconds
Query 42 Query Time(10TB
Dataset with parquet)
20. Key Resource Utilization Comparison
• Compute side(Hadoop s3a driver) can read more data from OSD, which represent
DNS deployment really can gain network throughput performance than single
gateway with bonding/teaming technology
0
200000
400000
600000
800000
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
(DNS)S3A Driver Network IO
Sum of rxkB/s Sum of txkB/s
0
200000
400000
600000
800000
kB/s
Axis Title
(Dedicate load balance server)S3A Driver
Network IO
Sum of rxkB/s Sum of txkB/s
21. Hardware Configuration
--RGW and OSD Collocated 5x Compute Node
• Intel® Xeon™ processor E5-2699 v4 @
2.2GHz, 64GB mem
• 2x10G 82599 10Gb NIC
• 2x SSDs
• 3x Data storage (can be emliminated)
Software:
• Hadoop 2.7.3
• Spark 2.1.1
• Hive 2.2.1
• Presto 0.177
• RHEL7.3
5x Storage Node, 2 RGW nodes, 1 LB nodes
• Intel(R) Xeon(R) CPU E5-2699v4 2.20GHz
• 128GB Memory
• 2x 82599 10Gb NIC
• 1x Intel® P3700 1.0TB SSD as WAL and
rocksdb
• 4x 1.6TB Intel® SSD DC S3510 as data
drive
• 2x 400G S3700 SSDs
• 1 OSD instances one each S3510 SSD
• RHEl7.3
• RHCS 2.3
RGW1
OSD1
MON
OSD1 OSD4…
Hadoop
Hive
Spark
Presto
1x10Gb NIC
RGW2
OSD2
RGW3
OSD3
RGW4
OSD4
RGW5
OSD5
3x10Gb NIC
Hadoop
Hive
Spark
Presto
Hadoop
Hive
Spark
Presto
Hadoop
Hive
Spark
Presto
Hadoop
Hive
Spark
Presto
Head
DNS Server
21
22. Performance evaluation with RGW & OSD
collocated
2242 2162
0
500
1000
1500
2000
2500
dedicate rgws rgw+ osd co-locality
seconds
1TB Batch Analytics Query on
Parquet
0 500 1000 1500 2000 2500
10TB parquet ETL
1TB dfsio_write
Other workloads
rgw+ osd co-locality dedicate rgws
• No need extra dedicate RGW servers, RGW instance and OSD go through
different network interface by enable ECMP
• No performance degradation, but more less TCO
22
23. RGW & OSD collocated – RGW scaling
• Scale out RGWs can improve performance before OSD(storage) saturating
• So How many RGWs can win the best performance should be decided by the
bandwidth of each RGW server and throughput of OSDs
23
950
1730
2119 2233 2254
0
500
1000
1500
2000
2500
1 2 3 4 5
MB/s
# of RGWs
1TB DFSIO write
337
252 250
0
100
200
300
400
6 9 18
seconds
# of RGWs
10TB ETL(Lower the better)
24. Using Ceph RBD as shuffle Storage
-- Eliminate the physical drive on the compute!
• Mount two RBDs on compute node remotely instead of physical shuffle device
• Ideally, the latency on remote RBDs larger than local physical device, but the
bandwidth of remote RBD is not smaller than local physical device too
• So final performance of a TPC-DS query set on RBDs maybe close or even better
than on physical device while there are heavy shuffles
24
2262 2244
0
500
1000
1500
2000
2500
RBD for Shuffle Physical Device for
Shuffle
seconds
1TB Batch Analytics Query on Parquet
0
50
100
150
200
UC11 1TB Dataset (Physical shuffle storage
vs RBD shuffle storage)
physical-query_time rbd-query_time
26. Hardware Configuration
--Remote HDFS
5x Compute Node
• Intel® Xeon™ processor E5-2699 v4 @
2.2GHz, 64GB mem
• 2x10G 82599 10Gb NIC
• 2x S3700 as shuffle storage
Software:
• Hadoop 2.7.3
• Spark 2.1.1
• Hive 2.2.1
• Presto 0.177
• RHEL7.3
5x Data Node
• Intel(R) Xeon(R) CPU E5-2699v4 2.20GHz
• 128GB Memory
• 2x 82599 10Gb NIC
• 1x Intel® P3700 1.0TB SSD as WAL and
rocksdb
• 7x 400G S3700 SSDs as data stire
• RHEl7.3
• RHCS 2.3Data
Node 1
1x10Gb NIC
Data
Node 2
Data
Node 3
Data
Node 4
Data
Node 5
1x10Gb NIC
Node Manager
Spark
Node Manager
Spark
Node Manager
Spark
Node Manager
Spark
Name Node
Node Manager
Spark
26
27. • On-par performance compared with
remote HDFS
• With optimizations, bigdata analytics
on object storage is onpar with
remote, especially on parquet format
data
• performance of s3a driver close to
native dfsclient , and demonstrate
compute and storage separate
solution has a considerable
performance compare with
combination solution
Bigdata on Cloud vs. Remote HDFS
--Batch Analytics
2244
1921
5060
2839
0
1000
2000
3000
4000
5000
6000
Parquet ORC
QueryTIme(s)
1TB Dataset Batch Analytics and Interactive Query comparsion
with remote HDFS (lower the better)
HDD: (740 HDDD OSDs, 24x Compute nodes)
SSD:(20 SSDs, 5 Compute nodes)
Over Object Store Remote HDFS
27
28. 0
50000
100000
150000
200000
250000
300000
-5
20
45
70
95
120
145
170
195
220
245
270
295
320
345
370
395
Disk Bandwidth
Sum of rkB/s Sum of wkB/s
DFSIO write performance on
Ceph is better than remote
hdfs(43%), but read
performance is 34% lower
Write to Ceph hit disk bottleneck
Bigdata on Cloud vs. Remote HDFS
--DFSIO
Shuffle storage block at
consuming data from Data Lake
4257
1644
2354
3845
2863
0
1000
2000
3000
4000
5000
remote
hdfs(replicaiton
1)
remote
hdfs(replication
3)
ceph over s3a
MB/s
DFSIO on 1TB Dataset(Throughput)
dfsio write dfsio read
28
29. Bigdata on Cloud vs. Remote HDFS
--Terasort
0
500000
1000000
1500000
2000000
2500000
-5
185
375
565
755
945
1135
1325
1515
1705
1895
2085
2275
2465
2655
OSD Data Drive Disk Bandwidth
Sum of rkB/s Sum of wkB/s
0
50
100
150
200
-5
155
315
475
635
795
955
1115
1275
1435
1595
1755
1915
2075
2235
2395
2555
2715
OSD Data Drive IO Latencies
Sum of await Sum of svctm
Time cost at
Reduce stage is
big part
Read and write
concurrently
887
442
0
200
400
600
800
1000
Remote HDFS Ceph over s3a
1TB Terasort Total
Throughput(MB/s)
29
30. DirectOutputCom
mitter
An implementation in Spark 1.6, that return the destination address as
working directory then no need to rename/move task output, no good
robustness for failures, removed in Spark 2.0
IBM's "Stocator"
committer
Targets Openstack Swift, good robustness, but it is another file system
for s3a
Staging committer A choice of new s3a committer*, need large capacity of hard disk for
staged data
Magic committer A choice of new s3a committer*, if you know your object store is
consistent or use s3gurad, this committer has higher performance
Bigdata on Cloud vs. Remote HDFS
--Ongoing rename optimizations
* New s3a committer has merged into trunk, and will release in Hadoop 3.1 or later
30
Rename operation can be improved!
32. 32
Summary and Next Step
Summary
• Bigdata on Ceph data lake is functionality ready validated by industry standard decision making
workloads TPC-DS
• Bigdata on the Cloud delivers on-par performance with remote HDFS for batch analytics, intensive
write operations still need further optimizations
• All flash solutions demonstrated significant TCO benefit compared with HDD solutions
Next
• Expand analytic workloads scope
• Rename operations optimizations to improve the performance
• Accelerating the performance with speed up layer for shuffle