SlideShare a Scribd company logo
1 of 37
Download to read offline
Unlock Bigdata Analytic Efficiency With Ceph
Data Lake
Jian Zhang, jian.zhang@intel.com, Intel
Yong Fu, yong.fu@intel.com, Intel
March, 2018
Agenda
 Background & Motivations
 The Workloads, Reference Architecture Evolution and Performance
Optimization
 Performance Comparison with Remote HDFS
 Summary & Next Step
3
BOUNDED Storage and Compute resources on Hadoop Nodes brings challenges
Source: 451 Research, Voice of the Enterprise: Storage Q4 2015
Data/Capacity
Inadequate Performance
Space, Spent, Power, Utilization
Multiple Storage Silos
Upgrade Cost
Typical Challenges
Costs
Provisioning And Configuration
Performance
& efficiency
Data Capacity Silos
Challenges of scaling Hadoop* Storage
4
*Other names and brands may be claimed as the property of others.
Options To Address The Challenges
5
Large Cluster
• Lacks isolation -
noisy neighbors
hinder SLAs
• Lacks elasticity -
rigid cluster size
• Can’t scale
compute/storage
costs separately
More Clusters
• Cost of
duplicating
datasets across
clusters
• Lacks on-demand
provisioning
• Can’t scale
compute/storage
costs separately
Compute and
Storage
Disaggregation
• Isolation of high-
priority workloads
• Shared big
datasets
• On-demand
provisioning
• compute/storage
costs scale
separately
Compute and Storage disaggregation provides Simplicity, Elasticity, Isolation
Unified Hadoop* File System and API for cloud storage
6
20162006 201520142008
oss://adl://
wasb://gs://s3:// s3n://
s3a://
Hadoop Compatible File System abstraction layer: Unified storage API interface Hadoop fs –ls s3a://job/
Disaggregated Object Storage Cluster
Proposal: Apache Hadoop* with disagreed Object Storage
7
SQL ……
Compute 1 Compute 2 Compute 3 Compute N…
Object
Storage 1
Object
Storage 2
Object
Storage 3
Object
Storage N
…
Hadoop Services
• Virtual Machine
• Container
• Bare Metal
Object Storage Services
• Co-located with gateway
• Dynamic DNS or load balancer
• Data protection via storage
replication or erasure code
• Storage tiering
HCFS
*Other names and brands may be claimed as the property of others.
8
Workloads
• Simple Read/Write
• DFSIO: TestDFSIO is the canonical example of a benchmark that attempts to measure the Storage's capacity
for reading and writing bulk data.
• Terasort: a popular benchmark that measures the amount of time to sort one terabyte of randomly
distributed data on a given computer system.
• Data Transformation
• ETL: Taking data as it is originally generated and transforming it to a format (Parquet, ORC) that more tuned
for analytical workloads.
• Batch Analytics
• To consistently executing analytical process to process large set of data.
• Leveraging 54 derived from TPC-DS * queries with intensive reads across objects in different buckets
9
Bigdata on Object Storage Performance overview
--Batch analytics
• Significant performance improvement from Hadoop 2.7.3/Spark 2.1.1 to Hadoop 2.8.1/Spark 2.2.0
(improvement in s3a)
• Batch analytics performance of 10-node Intel AFA is almost on-par with 60-node HDD cluster
10
2244
5060
2120
34463968 3852
6573
7719
0
2000
4000
6000
8000
10000
Parquet (SSD) ORC (SSD) Parquet (HDD) ORC HDD)
QueryTime(s)
1TB Dataset Batch Analytics and Interactive Query Hadoop/SPark/Presto
Comparison (lower the better)
HDD: (740 HDDD OSDs, 40x Compute nodes, 20x Storage nodes)
SSD:(20 SSDs, 5 Compute nodes, 5 storage nodes)
Hadoop 2.8.1 / Spark 2.2.0 Presto 0.170 Presto 0.177
11
Hardware Configuration & Topology
--Dedicate LB 5x Compute Node
• Intel® Xeon™ processor E5-2699 v4 @
2.2GHz, 64GB mem
• 2x10G 82599 10Gb NIC
• 2x SSDs
• 3x Data storage (can be emliminated)
Software:
• Hadoop 2.7.3
• Spark 2.1.1
• Hive 2.2.1
• Presto 0.177
• RHEL7.3
5x Storage Node, 2 RGW nodes, 1 LB nodes
• Intel(R) Xeon(R) CPU E5-2699v4 2.20GHz
• 128GB Memory
• 2x 82599 10Gb NIC
• 1x Intel® P3700 1.0TB SSD as WAL and
rocksdb
• 4x 1.6TB Intel® SSD DC S3510 as data
drive
• 2x 400G S3700 SSDs
• 1 OSD instances one each S3510 SSD
• RHEl7.3
• RHCS 2.3
*Other names and brands may be
claimed as the property of others.
OSD1
MON
OSD1 OSD4…
Hadoop
Hive
Spark
Presto
1x10Gb NIC
OSD2 OSD3 OSD4 OSD5
2x10Gb NIC
Hadoop
Hive
Spark
Presto
Hadoop
Hive
Spark
Presto
Hadoop
Hive
Spark
Presto
Hadoop
Hive
Spark
Presto
RGW1 RGW2
LB
Head
4x10Gb NIC
2x10Gb NIC
0%
20%
40%
60%
80%
100%
120%
hive-parquet spark-parquet presto-parquet
(untuned)
presto-parquet
(tuned)
QuerySuccess%
1TB Query Success % (54 TPC-DS Queries)
0%
20%
40%
60%
80%
100%
120%
spark-parquet spark-orc presto-parquet presto-parquet
1TB&10TB Query Success %(54 TPC-DS
Queries)
Improve Query Success Ratio with trouble-shootings
0 5 10 15
Ceph issue
Compatible issue
Deployment issue
Improper default configuration
Middleware issue
Runtime issue
S3a driver issue
Count of Issue Type • 100% selected TPC-DS query passed with tunings
• Improper Default configuration
• small capacity size,
• wrong middleware configuration
• improper Hadoop/Spark configuration for
different size and format data issues
tuned
12
ESTAB 0 0 ::ffff:10.0.2.36:44446 ::ffff:10.0.2.254:80
ESTAB 0 0 ::ffff:10.0.2.36:44454 ::ffff:10.0.2.254:80
ESTAB 0 0 ::ffff:10.0.2.36:44374 ::ffff:10.0.2.254:80
ESTAB 159724 0 ::ffff:10.0.2.36:44436 ::ffff:10.0.2.254:80
ESTAB 0 0 ::ffff:10.0.2.36:44448 ::ffff:10.0.2.254:80
ESTAB 0 0 ::ffff:10.0.2.36:44338 ::ffff:10.0.2.254:80
ESTAB 0 0 ::ffff:10.0.2.36:44438 ::ffff:10.0.2.254:80
ESTAB 0 0 ::ffff:10.0.2.36:44414 ::ffff:10.0.2.254:80
ESTAB 0 480 ::ffff:10.0.2.36:44450 ::ffff:10.0.2.254:80
timer:(on,170ms,0)
ESTAB 0 0 ::ffff:10.0.2.36:44442 ::ffff:10.0.2.254:80
ESTAB 0 0 ::ffff:10.0.2.36:44390 ::ffff:10.0.2.254:80
ESTAB 0 0 ::ffff:10.0.2.36:44326 ::ffff:10.0.2.254:80
ESTAB 0 0 ::ffff:10.0.2.36:44452 ::ffff:10.0.2.254:80
ESTAB 0 0 ::ffff:10.0.2.36:44394 ::ffff:10.0.2.254:80
ESTAB 0 0 ::ffff:10.0.2.36:44444 ::ffff:10.0.2.254:80
ESTAB 0 0 ::ffff:10.0.2.36:44456 ::ffff:10.0.2.254:80 2
seconds interval ======================
ESTAB 0 0 ::ffff:10.0.2.36:44508 ::ffff:10.0.2.254:80
ESTAB 0 0 ::ffff:10.0.2.36:44476 ::ffff:10.0.2.254:80
ESTAB 0 0 ::ffff:10.0.2.36:44524 ::ffff:10.0.2.254:80
ESTAB 0 0 ::ffff:10.0.2.36:44374 ::ffff:10.0.2.254:80
ESTAB 0 0 ::ffff:10.0.2.36:44500 ::ffff:10.0.2.254:80
ESTAB 0 0 ::ffff:10.0.2.36:44504 ::ffff:10.0.2.254:80
ESTAB 0 0 ::ffff:10.0.2.36:44512 ::ffff:10.0.2.254:80
ESTAB 0 0 ::ffff:10.0.2.36:44506 ::ffff:10.0.2.254:80
ESTAB 0 0 ::ffff:10.0.2.36:44464 ::ffff:10.0.2.254:80
ESTAB 0 0 ::ffff:10.0.2.36:44518 ::ffff:10.0.2.254:80
ESTAB 0 0 ::ffff:10.0.2.36:44510 ::ffff:10.0.2.254:80
ESTAB 0 0 ::ffff:10.0.2.36:44442 ::ffff:10.0.2.254:80
ESTAB 0 0 ::ffff:10.0.2.36:44526 ::ffff:10.0.2.254:80
ESTAB 0 0 ::ffff:10.0.2.36:44472 ::ffff:10.0.2.254:80
ESTAB 0 0 ::ffff:10.0.2.36:44466 ::ffff:10.0.2.254:80
2017-07-18 14:53:52.259976 7fddd67fc700 1 ====== starting new request req=0x7fddd67f6710 =====
2017-07-18 14:53:52.271829 7fddd5ffb700 1 ====== starting new request req=0x7fddd5ff5710 =====
2017-07-18 14:53:52.273940 7fddd7fff700 0 ERROR: flush_read_list(): d->client_c->handle_data() returned -
5
2017-07-18 14:53:52.274223 7fddd7fff700 0 WARNING: set_req_state_err err_no=5 resorting to 500
2017-07-18 14:53:52.274253 7fddd7fff700 0 ERROR: s->cio->send_content_length() returned err=-5
2017-07-18 14:53:52.274257 7fddd7fff700 0 ERROR: s->cio->print() returned err=-5
2017-07-18 14:53:52.274258 7fddd7fff700 0 ERROR: STREAM_IO(s)->print() returned err=-5
2017-07-18 14:53:52.274267 7fddd7fff700 0 ERROR: STREAM_IO(s)->complete_header() returned err=-5
Optimizing HTTP Requests
-- The bottlenecks
Compute time
take the big
part.
(compute time =
read data +sort
)
New connections out every time,
Connection not reused
Return 500
code in
RadosGW log
Unresued connection cause high read time and block performance 13
Background
• The S3A filesystem client supports the notion of input policies, similar to that
of the POSIX fadvise() API call. This tunes the behavior of the S3A client to
optimize HTTP GET requests for various use cases. To optimize HTTP GET
requests, you can take advantage of the S3A experimental input
policy fs.s3a.experimental.input.fadvise.
• Ticket: https://issues.apache.org/jira/browse/HADOOP-13203
Optimizing HTTP Requests
-- S3a input policy
Seek in a IO stream
diff > 0 skip forward diff < 0 backward
diff = 0 sequential read
Close stream
&Open connection
again
diff = targetPos – pos
Solution
Enable random read policy in core-site.xml:
<property>
<name>fs.s3a.experimental.input.fadvise</name>
<value>random</value>
</property>
<property>
<name>fs.s3a.readahead.range</name>
<value>64K</value>
</property>
• By reducing the cost of closing existing HTTP
requests, this is highly efficient for file IO accessing a
binary file through a series of
`PositionedReadable.read()` and
`PositionedReadable.readFully()` calls.
14
Optimizing HTTP Requests
-- Performance
8305 8267
2659
0
2000
4000
6000
8000
10000
Hadoop 2.7.3/Spark 2.1.1 Hadoop 2.8.1/Spark
2.2.0(untuned)
Hadoop 2.8.1/Spark
2.2.0(tuned)
seconds
1TB Batch Analytics Query on Parquet
• Readahead feature support from
Hadoop 2.8.1, but not enabled by
default. Apply random read policy,
500 issue gone and performance
improved 3x than before
2659
2120
4810
3346
0
2000
4000
6000
Hadoop 2.8.1/Spark2.2.0[SSD] Hadoop 2.8.1/Spark2.2.0[HDD]
Seconds
1TB Dataset Batch Analytics and Interactive Query Hadoop/Spark
Comparison (lower the better)
Parquet ORC
• All Flash storage architecture
also show great performance
benefit and low TCO which
compared with HDD storage
15
Optimizing HTTP Requests
--Resource Utilization Comparison
0
500000
1000000
1500000
0
990
1980
2970
3960
4950
5940
6930
7920
8910
9900
Load Balance Network IO
Sum of rxkB/s Sum of txkB/s
0
1000000
2000000
3000000
0
465
930
1395
1860
2325
2790
3255
3720
4185
4650
Load Balance Network IO
Sum of rxkB/s Sum of txkB/s
0
200000
400000
600000
-5
1085
2175
3265
4355
5445
6535
7625
8715
9805
BW
Hadoop Shuffle Disk Bandwidth
Sum of rkB/s Sum of wkB/s
0
200000
400000
600000
800000
-5
505
1015
1525
2035
2545
3055
3565
4075
4585
BW
Hadoop Shuffle Disk Bandwidth
Sum of rkB/s Sum of wkB/s
0
200000
400000
600000
-5
1085
2175
3265
4355
5445
6535
7625
8715
10198
BW
OSD Disk Bandwidth
Sum of rkB/s Sum of wkB/s
0
200000
400000
600000
-5
505
1015
1525
2035
2545
3055
4011
4521
5031
BW
OSD Disk Bandwidth
Sum of rkB/s Sum of wkB/s
Before
optimized
After
optimized
16
16
Resolving RGW BW bottleneck
--The bottlenecks
0
500000
1000000
1500000
2000000
2500000
3000000
0
235
470
705
940
1175
1410
1645
1880
2115
2350
2585
2820
3055
3290
3525
3760
3995
4230
4465
4700
4935
Network IO
Sum of rxkB/s Sum of txkB/s
0
500000
1000000
1500000
2000000
2500000
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
Single Query Network IO
Sum of rxkB/s Sum of txkB/s
• LB BW has became the bottleneck
• Observed many messages blocked at load balancer server(send to s3a driver), but not much
blocked at receiving on s3a driver side
17
18
Hardware Configuration
--No LB with more RGWs and round-robin DNS
5x Compute Node
• Intel® Xeon™ processor E5-2699 v4 @
2.2GHz, 64GB mem
• 2x10G 82599 10Gb NIC
• 2x SSDs
• 3x Data storage (can be emliminated)
Software:
• Hadoop 2.7.3
• Spark 2.1.1
• Hive 2.2.1
• Presto 0.177
• RHEL7.3
5x Storage Node, 2 RGW nodes, 1 LB nodes
• Intel(R) Xeon(R) CPU E5-2699v4 2.20GHz
• 128GB Memory
• 2x 82599 10Gb NIC
• 1x Intel® P3700 1.0TB SSD as WAL and
rocksdb
• 4x 1.6TB Intel® SSD DC S3510 as data
drive
• 2x 400G S3700 SSDs
• 1 OSD instances one each S3510 SSD
• RHEl7.3
• RHCS 2.3
*Other names and brands may be
claimed as the property of others.
OSD1
MON
OSD1 OSD4…
Hadoop
Hive
Spark
Presto
1x10Gb NIC
OSD2 OSD3 OSD4 OSD5
2x10Gb NIC
Hadoop
Hive
Spark
Presto
Hadoop
Hive
Spark
Presto
Hadoop
Hive
Spark
Presto
Hadoop
Hive
Spark
Presto
RGW1 RGW2
Head
DNS Server
RGW2 RGW2 RGW2
Performance evaluation
--No LB with more RGWs and round-robin DNS
• 18% performance improvement with more RGWs and round-robin DNS
• Query42(has less shuffle) is 1.64x faster in the new architecture
2659
2244
0
500
1000
1500
2000
2500
3000
Dedicate single load balance DNS
seconds
1TB Batch Analytics Query (parquet)
129
78
0
20
40
60
80
100
120
140
Dedicate load balance DNS RR
seconds
Query 42 Query Time(10TB
Dataset with parquet)
Key Resource Utilization Comparison
• Compute side(Hadoop s3a driver) can read more data from OSD, which represent
DNS deployment really can gain network throughput performance than single
gateway with bonding/teaming technology
0
200000
400000
600000
800000
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
(DNS)S3A Driver Network IO
Sum of rxkB/s Sum of txkB/s
0
200000
400000
600000
800000
kB/s
Axis Title
(Dedicate load balance server)S3A Driver
Network IO
Sum of rxkB/s Sum of txkB/s
Hardware Configuration
--RGW and OSD Collocated 5x Compute Node
• Intel® Xeon™ processor E5-2699 v4 @
2.2GHz, 64GB mem
• 2x10G 82599 10Gb NIC
• 2x SSDs
• 3x Data storage (can be emliminated)
Software:
• Hadoop 2.7.3
• Spark 2.1.1
• Hive 2.2.1
• Presto 0.177
• RHEL7.3
5x Storage Node, 2 RGW nodes, 1 LB nodes
• Intel(R) Xeon(R) CPU E5-2699v4 2.20GHz
• 128GB Memory
• 2x 82599 10Gb NIC
• 1x Intel® P3700 1.0TB SSD as WAL and
rocksdb
• 4x 1.6TB Intel® SSD DC S3510 as data
drive
• 2x 400G S3700 SSDs
• 1 OSD instances one each S3510 SSD
• RHEl7.3
• RHCS 2.3
RGW1
OSD1
MON
OSD1 OSD4…
Hadoop
Hive
Spark
Presto
1x10Gb NIC
RGW2
OSD2
RGW3
OSD3
RGW4
OSD4
RGW5
OSD5
3x10Gb NIC
Hadoop
Hive
Spark
Presto
Hadoop
Hive
Spark
Presto
Hadoop
Hive
Spark
Presto
Hadoop
Hive
Spark
Presto
Head
DNS Server
21
Performance evaluation with RGW & OSD
collocated
2242 2162
0
500
1000
1500
2000
2500
dedicate rgws rgw+ osd co-locality
seconds
1TB Batch Analytics Query on
Parquet
0 500 1000 1500 2000 2500
10TB parquet ETL
1TB dfsio_write
Other workloads
rgw+ osd co-locality dedicate rgws
• No need extra dedicate RGW servers, RGW instance and OSD go through
different network interface by enable ECMP
• No performance degradation, but more less TCO
22
RGW & OSD collocated – RGW scaling
• Scale out RGWs can improve performance before OSD(storage) saturating
• So How many RGWs can win the best performance should be decided by the
bandwidth of each RGW server and throughput of OSDs
23
950
1730
2119 2233 2254
0
500
1000
1500
2000
2500
1 2 3 4 5
MB/s
# of RGWs
1TB DFSIO write
337
252 250
0
100
200
300
400
6 9 18
seconds
# of RGWs
10TB ETL(Lower the better)
Using Ceph RBD as shuffle Storage
-- Eliminate the physical drive on the compute!
• Mount two RBDs on compute node remotely instead of physical shuffle device
• Ideally, the latency on remote RBDs larger than local physical device, but the
bandwidth of remote RBD is not smaller than local physical device too
• So final performance of a TPC-DS query set on RBDs maybe close or even better
than on physical device while there are heavy shuffles
24
2262 2244
0
500
1000
1500
2000
2500
RBD for Shuffle Physical Device for
Shuffle
seconds
1TB Batch Analytics Query on Parquet
0
50
100
150
200
UC11 1TB Dataset (Physical shuffle storage
vs RBD shuffle storage)
physical-query_time rbd-query_time
25
Hardware Configuration
--Remote HDFS
5x Compute Node
• Intel® Xeon™ processor E5-2699 v4 @
2.2GHz, 64GB mem
• 2x10G 82599 10Gb NIC
• 2x S3700 as shuffle storage
Software:
• Hadoop 2.7.3
• Spark 2.1.1
• Hive 2.2.1
• Presto 0.177
• RHEL7.3
5x Data Node
• Intel(R) Xeon(R) CPU E5-2699v4 2.20GHz
• 128GB Memory
• 2x 82599 10Gb NIC
• 1x Intel® P3700 1.0TB SSD as WAL and
rocksdb
• 7x 400G S3700 SSDs as data stire
• RHEl7.3
• RHCS 2.3Data
Node 1
1x10Gb NIC
Data
Node 2
Data
Node 3
Data
Node 4
Data
Node 5
1x10Gb NIC
Node Manager
Spark
Node Manager
Spark
Node Manager
Spark
Node Manager
Spark
Name Node
Node Manager
Spark
26
• On-par performance compared with
remote HDFS
• With optimizations, bigdata analytics
on object storage is onpar with
remote, especially on parquet format
data
• performance of s3a driver close to
native dfsclient , and demonstrate
compute and storage separate
solution has a considerable
performance compare with
combination solution
Bigdata on Cloud vs. Remote HDFS
--Batch Analytics
2244
1921
5060
2839
0
1000
2000
3000
4000
5000
6000
Parquet ORC
QueryTIme(s)
1TB Dataset Batch Analytics and Interactive Query comparsion
with remote HDFS (lower the better)
HDD: (740 HDDD OSDs, 24x Compute nodes)
SSD:(20 SSDs, 5 Compute nodes)
Over Object Store Remote HDFS
27
0
50000
100000
150000
200000
250000
300000
-5
20
45
70
95
120
145
170
195
220
245
270
295
320
345
370
395
Disk Bandwidth
Sum of rkB/s Sum of wkB/s
DFSIO write performance on
Ceph is better than remote
hdfs(43%), but read
performance is 34% lower
Write to Ceph hit disk bottleneck
Bigdata on Cloud vs. Remote HDFS
--DFSIO
Shuffle storage block at
consuming data from Data Lake
4257
1644
2354
3845
2863
0
1000
2000
3000
4000
5000
remote
hdfs(replicaiton
1)
remote
hdfs(replication
3)
ceph over s3a
MB/s
DFSIO on 1TB Dataset(Throughput)
dfsio write dfsio read
28
Bigdata on Cloud vs. Remote HDFS
--Terasort
0
500000
1000000
1500000
2000000
2500000
-5
185
375
565
755
945
1135
1325
1515
1705
1895
2085
2275
2465
2655
OSD Data Drive Disk Bandwidth
Sum of rkB/s Sum of wkB/s
0
50
100
150
200
-5
155
315
475
635
795
955
1115
1275
1435
1595
1755
1915
2075
2235
2395
2555
2715
OSD Data Drive IO Latencies
Sum of await Sum of svctm
Time cost at
Reduce stage is
big part
Read and write
concurrently
887
442
0
200
400
600
800
1000
Remote HDFS Ceph over s3a
1TB Terasort Total
Throughput(MB/s)
29
DirectOutputCom
mitter
An implementation in Spark 1.6, that return the destination address as
working directory then no need to rename/move task output, no good
robustness for failures, removed in Spark 2.0
IBM's "Stocator"
committer
Targets Openstack Swift, good robustness, but it is another file system
for s3a
Staging committer A choice of new s3a committer*, need large capacity of hard disk for
staged data
Magic committer A choice of new s3a committer*, if you know your object store is
consistent or use s3gurad, this committer has higher performance
Bigdata on Cloud vs. Remote HDFS
--Ongoing rename optimizations
* New s3a committer has merged into trunk, and will release in Hadoop 3.1 or later
30
Rename operation can be improved!
31
32
Summary and Next Step
Summary
• Bigdata on Ceph data lake is functionality ready validated by industry standard decision making
workloads TPC-DS
• Bigdata on the Cloud delivers on-par performance with remote HDFS for batch analytics, intensive
write operations still need further optimizations
• All flash solutions demonstrated significant TCO benefit compared with HDD solutions
Next
• Expand analytic workloads scope
• Rename operations optimizations to improve the performance
• Accelerating the performance with speed up layer for shuffle
33
34
Experiment environment
Cluster Hadoop head Hadoop slave Load balancer OSD RGW
Roles Hadoop name node
Secondary name
node Resource
manager
Data node
Node manager
Hive metastore
service
Yarn history server
Spark history server
Presto server
Data node
Node manager
Presto server
Haproxy Ceph osd Ceph rados gateway
# of node 1 5 1 5 5
Processor Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz 44 cores HT enabled Intel(R) Xeon(R) CPU
E31280 @ 3.50GH 4
cores HT enabled
Memory 128GB 64GB 128GB 32GB
Storage 4x 1TB HDD
2x Intel S3510 480GB SSD(vs s3700 metrics)
1x Intel S3510 480 GB
SSD
• 1x Intel® P3700
1.6TB as jounal
• 4x 1.6TB Intel®
SSD DC S3510 2X
400GB s370 as
data store
1x Intel S3510 480 GB
SSD
Network 10GB 40GB 10GB+10GB 10GB
SW Configuration
Hadoop version 2.7.3/2.8.1
Spark version 2.1.1/2.2.0
Hive version 2.2.1
Presto version 0.177
Executor memory 22GB
Executor cores 5
# of executor 24
JDK version 1.8.0_131
Memory.overhead 5GB
S3A Key Performance Configuration
fs.s3a.connection.maximu
m
10
fs.s3a.threads.max 30
fs.s3a.socket.send.buffer 8192
fs.s3a.socket.recv.buffer 8192
fs.s3a.threads.keepalivetim
e
60
fs.s3a.max.total.tasks 1000
fs.s3a.multipart.size 100M
fs.s3a.block.size 32M
fs.s3a.readahead.range 64k
fs.s3a.fast.upload true
fs.s3a.fast.upload.buffer array
fs.s3a.fast.upload.active.bl
ocks
4
fs.s3a.experimental.input.f
advise
radom
Legal Disclaimer & Optimization Notice
• Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as
SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those
factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated
purchases, including the performance of that product when combined with other products. For more complete information visit
www.intel.com/benchmarks.
• INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL
PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED
WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE,
MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
• Copyright © 2018, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel
Corporation in the U.S. and other countries.
37
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel
microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the
availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent
optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are
reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the
specific instruction sets covered by this notice.
Notice revision #20110804
37

More Related Content

What's hot

Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong ZhuBuild a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong ZhuCeph Community
 
Ceph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph Community
 
Ceph Performance Profiling and Reporting
Ceph Performance Profiling and ReportingCeph Performance Profiling and Reporting
Ceph Performance Profiling and ReportingCeph Community
 
Ceph Day San Jose - Object Storage for Big Data
Ceph Day San Jose - Object Storage for Big Data Ceph Day San Jose - Object Storage for Big Data
Ceph Day San Jose - Object Storage for Big Data Ceph Community
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Walk Through a Software Defined Everything PoC
Walk Through a Software Defined Everything PoCWalk Through a Software Defined Everything PoC
Walk Through a Software Defined Everything PoCCeph Community
 
RADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu Chai
RADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu ChaiRADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu Chai
RADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu ChaiCeph Community
 
Automation of Hadoop cluster operations in Arm Treasure Data
Automation of Hadoop cluster operations in Arm Treasure DataAutomation of Hadoop cluster operations in Arm Treasure Data
Automation of Hadoop cluster operations in Arm Treasure DataYan Wang
 
Ceph: Low Fail Go Scale
Ceph: Low Fail Go Scale Ceph: Low Fail Go Scale
Ceph: Low Fail Go Scale Ceph Community
 
Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nand...
Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nand...Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nand...
Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nand...inwin stack
 
Red Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference ArchitecturesRed Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference ArchitecturesRed_Hat_Storage
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightColleen Corrice
 
Ceph Day Melabourne - Community Update
Ceph Day Melabourne - Community UpdateCeph Day Melabourne - Community Update
Ceph Day Melabourne - Community UpdateCeph Community
 
Hadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native EraHadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native EraDataWorks Summit
 
Ceph Day Shanghai - Recovery Erasure Coding and Cache Tiering
Ceph Day Shanghai - Recovery Erasure Coding and Cache TieringCeph Day Shanghai - Recovery Erasure Coding and Cache Tiering
Ceph Day Shanghai - Recovery Erasure Coding and Cache TieringCeph Community
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureDanielle Womboldt
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsExperiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsCeph Community
 
Red Hat Ceph Storage Acceleration Utilizing Flash Technology
Red Hat Ceph Storage Acceleration Utilizing Flash Technology Red Hat Ceph Storage Acceleration Utilizing Flash Technology
Red Hat Ceph Storage Acceleration Utilizing Flash Technology Red_Hat_Storage
 
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...Odinot Stanislas
 

What's hot (20)

Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong ZhuBuild a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
 
Ceph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der Ster
 
Ceph Performance Profiling and Reporting
Ceph Performance Profiling and ReportingCeph Performance Profiling and Reporting
Ceph Performance Profiling and Reporting
 
Ceph Day San Jose - Object Storage for Big Data
Ceph Day San Jose - Object Storage for Big Data Ceph Day San Jose - Object Storage for Big Data
Ceph Day San Jose - Object Storage for Big Data
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Walk Through a Software Defined Everything PoC
Walk Through a Software Defined Everything PoCWalk Through a Software Defined Everything PoC
Walk Through a Software Defined Everything PoC
 
RADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu Chai
RADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu ChaiRADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu Chai
RADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu Chai
 
Automation of Hadoop cluster operations in Arm Treasure Data
Automation of Hadoop cluster operations in Arm Treasure DataAutomation of Hadoop cluster operations in Arm Treasure Data
Automation of Hadoop cluster operations in Arm Treasure Data
 
Ceph: Low Fail Go Scale
Ceph: Low Fail Go Scale Ceph: Low Fail Go Scale
Ceph: Low Fail Go Scale
 
Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nand...
Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nand...Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nand...
Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nand...
 
Red Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference ArchitecturesRed Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference Architectures
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
 
MySQL Head-to-Head
MySQL Head-to-HeadMySQL Head-to-Head
MySQL Head-to-Head
 
Ceph Day Melabourne - Community Update
Ceph Day Melabourne - Community UpdateCeph Day Melabourne - Community Update
Ceph Day Melabourne - Community Update
 
Hadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native EraHadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native Era
 
Ceph Day Shanghai - Recovery Erasure Coding and Cache Tiering
Ceph Day Shanghai - Recovery Erasure Coding and Cache TieringCeph Day Shanghai - Recovery Erasure Coding and Cache Tiering
Ceph Day Shanghai - Recovery Erasure Coding and Cache Tiering
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsExperiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah Watkins
 
Red Hat Ceph Storage Acceleration Utilizing Flash Technology
Red Hat Ceph Storage Acceleration Utilizing Flash Technology Red Hat Ceph Storage Acceleration Utilizing Flash Technology
Red Hat Ceph Storage Acceleration Utilizing Flash Technology
 
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...
 

Similar to Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong

Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAccelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAlluxio, Inc.
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
 
Strata + Hadoop 2015 Slides
Strata + Hadoop 2015 SlidesStrata + Hadoop 2015 Slides
Strata + Hadoop 2015 SlidesJun Liu
 
Unifying your data management with Hadoop
Unifying your data management with HadoopUnifying your data management with Hadoop
Unifying your data management with HadoopJayant Shekhar
 
RaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheRaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheAlluxio, Inc.
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014Eli Singer
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Vinoth Chandar
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weitingWei Ting Chen
 
Making sense of your data jug
Making sense of your data   jugMaking sense of your data   jug
Making sense of your data jugGerald Muecke
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Not Your Father’s Data Warehouse: Breaking Tradition with Innovation
Not Your Father’s Data Warehouse: Breaking Tradition with InnovationNot Your Father’s Data Warehouse: Breaking Tradition with Innovation
Not Your Father’s Data Warehouse: Breaking Tradition with InnovationInside Analysis
 
Intel_Swarm64 Solution Brief
Intel_Swarm64 Solution BriefIntel_Swarm64 Solution Brief
Intel_Swarm64 Solution BriefPaul McCullugh
 
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...Databricks
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkDatabricks
 

Similar to Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong (20)

Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAccelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
Strata + Hadoop 2015 Slides
Strata + Hadoop 2015 SlidesStrata + Hadoop 2015 Slides
Strata + Hadoop 2015 Slides
 
Unifying your data management with Hadoop
Unifying your data management with HadoopUnifying your data management with Hadoop
Unifying your data management with Hadoop
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
RaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheRaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cache
 
Splice machine-bloor-webinar-data-lakes
Splice machine-bloor-webinar-data-lakesSplice machine-bloor-webinar-data-lakes
Splice machine-bloor-webinar-data-lakes
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting
 
Making sense of your data jug
Making sense of your data   jugMaking sense of your data   jug
Making sense of your data jug
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Not Your Father’s Data Warehouse: Breaking Tradition with Innovation
Not Your Father’s Data Warehouse: Breaking Tradition with InnovationNot Your Father’s Data Warehouse: Breaking Tradition with Innovation
Not Your Father’s Data Warehouse: Breaking Tradition with Innovation
 
Intel_Swarm64 Solution Brief
Intel_Swarm64 Solution BriefIntel_Swarm64 Solution Brief
Intel_Swarm64 Solution Brief
 
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 

Recently uploaded

Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 

Recently uploaded (20)

Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 

Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong

  • 1. Unlock Bigdata Analytic Efficiency With Ceph Data Lake Jian Zhang, jian.zhang@intel.com, Intel Yong Fu, yong.fu@intel.com, Intel March, 2018
  • 2. Agenda  Background & Motivations  The Workloads, Reference Architecture Evolution and Performance Optimization  Performance Comparison with Remote HDFS  Summary & Next Step
  • 3. 3
  • 4. BOUNDED Storage and Compute resources on Hadoop Nodes brings challenges Source: 451 Research, Voice of the Enterprise: Storage Q4 2015 Data/Capacity Inadequate Performance Space, Spent, Power, Utilization Multiple Storage Silos Upgrade Cost Typical Challenges Costs Provisioning And Configuration Performance & efficiency Data Capacity Silos Challenges of scaling Hadoop* Storage 4 *Other names and brands may be claimed as the property of others.
  • 5. Options To Address The Challenges 5 Large Cluster • Lacks isolation - noisy neighbors hinder SLAs • Lacks elasticity - rigid cluster size • Can’t scale compute/storage costs separately More Clusters • Cost of duplicating datasets across clusters • Lacks on-demand provisioning • Can’t scale compute/storage costs separately Compute and Storage Disaggregation • Isolation of high- priority workloads • Shared big datasets • On-demand provisioning • compute/storage costs scale separately Compute and Storage disaggregation provides Simplicity, Elasticity, Isolation
  • 6. Unified Hadoop* File System and API for cloud storage 6 20162006 201520142008 oss://adl:// wasb://gs://s3:// s3n:// s3a:// Hadoop Compatible File System abstraction layer: Unified storage API interface Hadoop fs –ls s3a://job/
  • 7. Disaggregated Object Storage Cluster Proposal: Apache Hadoop* with disagreed Object Storage 7 SQL …… Compute 1 Compute 2 Compute 3 Compute N… Object Storage 1 Object Storage 2 Object Storage 3 Object Storage N … Hadoop Services • Virtual Machine • Container • Bare Metal Object Storage Services • Co-located with gateway • Dynamic DNS or load balancer • Data protection via storage replication or erasure code • Storage tiering HCFS *Other names and brands may be claimed as the property of others.
  • 8. 8
  • 9. Workloads • Simple Read/Write • DFSIO: TestDFSIO is the canonical example of a benchmark that attempts to measure the Storage's capacity for reading and writing bulk data. • Terasort: a popular benchmark that measures the amount of time to sort one terabyte of randomly distributed data on a given computer system. • Data Transformation • ETL: Taking data as it is originally generated and transforming it to a format (Parquet, ORC) that more tuned for analytical workloads. • Batch Analytics • To consistently executing analytical process to process large set of data. • Leveraging 54 derived from TPC-DS * queries with intensive reads across objects in different buckets 9
  • 10. Bigdata on Object Storage Performance overview --Batch analytics • Significant performance improvement from Hadoop 2.7.3/Spark 2.1.1 to Hadoop 2.8.1/Spark 2.2.0 (improvement in s3a) • Batch analytics performance of 10-node Intel AFA is almost on-par with 60-node HDD cluster 10 2244 5060 2120 34463968 3852 6573 7719 0 2000 4000 6000 8000 10000 Parquet (SSD) ORC (SSD) Parquet (HDD) ORC HDD) QueryTime(s) 1TB Dataset Batch Analytics and Interactive Query Hadoop/SPark/Presto Comparison (lower the better) HDD: (740 HDDD OSDs, 40x Compute nodes, 20x Storage nodes) SSD:(20 SSDs, 5 Compute nodes, 5 storage nodes) Hadoop 2.8.1 / Spark 2.2.0 Presto 0.170 Presto 0.177
  • 11. 11 Hardware Configuration & Topology --Dedicate LB 5x Compute Node • Intel® Xeon™ processor E5-2699 v4 @ 2.2GHz, 64GB mem • 2x10G 82599 10Gb NIC • 2x SSDs • 3x Data storage (can be emliminated) Software: • Hadoop 2.7.3 • Spark 2.1.1 • Hive 2.2.1 • Presto 0.177 • RHEL7.3 5x Storage Node, 2 RGW nodes, 1 LB nodes • Intel(R) Xeon(R) CPU E5-2699v4 2.20GHz • 128GB Memory • 2x 82599 10Gb NIC • 1x Intel® P3700 1.0TB SSD as WAL and rocksdb • 4x 1.6TB Intel® SSD DC S3510 as data drive • 2x 400G S3700 SSDs • 1 OSD instances one each S3510 SSD • RHEl7.3 • RHCS 2.3 *Other names and brands may be claimed as the property of others. OSD1 MON OSD1 OSD4… Hadoop Hive Spark Presto 1x10Gb NIC OSD2 OSD3 OSD4 OSD5 2x10Gb NIC Hadoop Hive Spark Presto Hadoop Hive Spark Presto Hadoop Hive Spark Presto Hadoop Hive Spark Presto RGW1 RGW2 LB Head 4x10Gb NIC 2x10Gb NIC
  • 12. 0% 20% 40% 60% 80% 100% 120% hive-parquet spark-parquet presto-parquet (untuned) presto-parquet (tuned) QuerySuccess% 1TB Query Success % (54 TPC-DS Queries) 0% 20% 40% 60% 80% 100% 120% spark-parquet spark-orc presto-parquet presto-parquet 1TB&10TB Query Success %(54 TPC-DS Queries) Improve Query Success Ratio with trouble-shootings 0 5 10 15 Ceph issue Compatible issue Deployment issue Improper default configuration Middleware issue Runtime issue S3a driver issue Count of Issue Type • 100% selected TPC-DS query passed with tunings • Improper Default configuration • small capacity size, • wrong middleware configuration • improper Hadoop/Spark configuration for different size and format data issues tuned 12
  • 13. ESTAB 0 0 ::ffff:10.0.2.36:44446 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44454 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44374 ::ffff:10.0.2.254:80 ESTAB 159724 0 ::ffff:10.0.2.36:44436 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44448 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44338 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44438 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44414 ::ffff:10.0.2.254:80 ESTAB 0 480 ::ffff:10.0.2.36:44450 ::ffff:10.0.2.254:80 timer:(on,170ms,0) ESTAB 0 0 ::ffff:10.0.2.36:44442 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44390 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44326 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44452 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44394 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44444 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44456 ::ffff:10.0.2.254:80 2 seconds interval ====================== ESTAB 0 0 ::ffff:10.0.2.36:44508 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44476 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44524 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44374 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44500 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44504 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44512 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44506 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44464 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44518 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44510 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44442 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44526 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44472 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44466 ::ffff:10.0.2.254:80 2017-07-18 14:53:52.259976 7fddd67fc700 1 ====== starting new request req=0x7fddd67f6710 ===== 2017-07-18 14:53:52.271829 7fddd5ffb700 1 ====== starting new request req=0x7fddd5ff5710 ===== 2017-07-18 14:53:52.273940 7fddd7fff700 0 ERROR: flush_read_list(): d->client_c->handle_data() returned - 5 2017-07-18 14:53:52.274223 7fddd7fff700 0 WARNING: set_req_state_err err_no=5 resorting to 500 2017-07-18 14:53:52.274253 7fddd7fff700 0 ERROR: s->cio->send_content_length() returned err=-5 2017-07-18 14:53:52.274257 7fddd7fff700 0 ERROR: s->cio->print() returned err=-5 2017-07-18 14:53:52.274258 7fddd7fff700 0 ERROR: STREAM_IO(s)->print() returned err=-5 2017-07-18 14:53:52.274267 7fddd7fff700 0 ERROR: STREAM_IO(s)->complete_header() returned err=-5 Optimizing HTTP Requests -- The bottlenecks Compute time take the big part. (compute time = read data +sort ) New connections out every time, Connection not reused Return 500 code in RadosGW log Unresued connection cause high read time and block performance 13
  • 14. Background • The S3A filesystem client supports the notion of input policies, similar to that of the POSIX fadvise() API call. This tunes the behavior of the S3A client to optimize HTTP GET requests for various use cases. To optimize HTTP GET requests, you can take advantage of the S3A experimental input policy fs.s3a.experimental.input.fadvise. • Ticket: https://issues.apache.org/jira/browse/HADOOP-13203 Optimizing HTTP Requests -- S3a input policy Seek in a IO stream diff > 0 skip forward diff < 0 backward diff = 0 sequential read Close stream &Open connection again diff = targetPos – pos Solution Enable random read policy in core-site.xml: <property> <name>fs.s3a.experimental.input.fadvise</name> <value>random</value> </property> <property> <name>fs.s3a.readahead.range</name> <value>64K</value> </property> • By reducing the cost of closing existing HTTP requests, this is highly efficient for file IO accessing a binary file through a series of `PositionedReadable.read()` and `PositionedReadable.readFully()` calls. 14
  • 15. Optimizing HTTP Requests -- Performance 8305 8267 2659 0 2000 4000 6000 8000 10000 Hadoop 2.7.3/Spark 2.1.1 Hadoop 2.8.1/Spark 2.2.0(untuned) Hadoop 2.8.1/Spark 2.2.0(tuned) seconds 1TB Batch Analytics Query on Parquet • Readahead feature support from Hadoop 2.8.1, but not enabled by default. Apply random read policy, 500 issue gone and performance improved 3x than before 2659 2120 4810 3346 0 2000 4000 6000 Hadoop 2.8.1/Spark2.2.0[SSD] Hadoop 2.8.1/Spark2.2.0[HDD] Seconds 1TB Dataset Batch Analytics and Interactive Query Hadoop/Spark Comparison (lower the better) Parquet ORC • All Flash storage architecture also show great performance benefit and low TCO which compared with HDD storage 15
  • 16. Optimizing HTTP Requests --Resource Utilization Comparison 0 500000 1000000 1500000 0 990 1980 2970 3960 4950 5940 6930 7920 8910 9900 Load Balance Network IO Sum of rxkB/s Sum of txkB/s 0 1000000 2000000 3000000 0 465 930 1395 1860 2325 2790 3255 3720 4185 4650 Load Balance Network IO Sum of rxkB/s Sum of txkB/s 0 200000 400000 600000 -5 1085 2175 3265 4355 5445 6535 7625 8715 9805 BW Hadoop Shuffle Disk Bandwidth Sum of rkB/s Sum of wkB/s 0 200000 400000 600000 800000 -5 505 1015 1525 2035 2545 3055 3565 4075 4585 BW Hadoop Shuffle Disk Bandwidth Sum of rkB/s Sum of wkB/s 0 200000 400000 600000 -5 1085 2175 3265 4355 5445 6535 7625 8715 10198 BW OSD Disk Bandwidth Sum of rkB/s Sum of wkB/s 0 200000 400000 600000 -5 505 1015 1525 2035 2545 3055 4011 4521 5031 BW OSD Disk Bandwidth Sum of rkB/s Sum of wkB/s Before optimized After optimized 16 16
  • 17. Resolving RGW BW bottleneck --The bottlenecks 0 500000 1000000 1500000 2000000 2500000 3000000 0 235 470 705 940 1175 1410 1645 1880 2115 2350 2585 2820 3055 3290 3525 3760 3995 4230 4465 4700 4935 Network IO Sum of rxkB/s Sum of txkB/s 0 500000 1000000 1500000 2000000 2500000 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 Single Query Network IO Sum of rxkB/s Sum of txkB/s • LB BW has became the bottleneck • Observed many messages blocked at load balancer server(send to s3a driver), but not much blocked at receiving on s3a driver side 17
  • 18. 18 Hardware Configuration --No LB with more RGWs and round-robin DNS 5x Compute Node • Intel® Xeon™ processor E5-2699 v4 @ 2.2GHz, 64GB mem • 2x10G 82599 10Gb NIC • 2x SSDs • 3x Data storage (can be emliminated) Software: • Hadoop 2.7.3 • Spark 2.1.1 • Hive 2.2.1 • Presto 0.177 • RHEL7.3 5x Storage Node, 2 RGW nodes, 1 LB nodes • Intel(R) Xeon(R) CPU E5-2699v4 2.20GHz • 128GB Memory • 2x 82599 10Gb NIC • 1x Intel® P3700 1.0TB SSD as WAL and rocksdb • 4x 1.6TB Intel® SSD DC S3510 as data drive • 2x 400G S3700 SSDs • 1 OSD instances one each S3510 SSD • RHEl7.3 • RHCS 2.3 *Other names and brands may be claimed as the property of others. OSD1 MON OSD1 OSD4… Hadoop Hive Spark Presto 1x10Gb NIC OSD2 OSD3 OSD4 OSD5 2x10Gb NIC Hadoop Hive Spark Presto Hadoop Hive Spark Presto Hadoop Hive Spark Presto Hadoop Hive Spark Presto RGW1 RGW2 Head DNS Server RGW2 RGW2 RGW2
  • 19. Performance evaluation --No LB with more RGWs and round-robin DNS • 18% performance improvement with more RGWs and round-robin DNS • Query42(has less shuffle) is 1.64x faster in the new architecture 2659 2244 0 500 1000 1500 2000 2500 3000 Dedicate single load balance DNS seconds 1TB Batch Analytics Query (parquet) 129 78 0 20 40 60 80 100 120 140 Dedicate load balance DNS RR seconds Query 42 Query Time(10TB Dataset with parquet)
  • 20. Key Resource Utilization Comparison • Compute side(Hadoop s3a driver) can read more data from OSD, which represent DNS deployment really can gain network throughput performance than single gateway with bonding/teaming technology 0 200000 400000 600000 800000 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 (DNS)S3A Driver Network IO Sum of rxkB/s Sum of txkB/s 0 200000 400000 600000 800000 kB/s Axis Title (Dedicate load balance server)S3A Driver Network IO Sum of rxkB/s Sum of txkB/s
  • 21. Hardware Configuration --RGW and OSD Collocated 5x Compute Node • Intel® Xeon™ processor E5-2699 v4 @ 2.2GHz, 64GB mem • 2x10G 82599 10Gb NIC • 2x SSDs • 3x Data storage (can be emliminated) Software: • Hadoop 2.7.3 • Spark 2.1.1 • Hive 2.2.1 • Presto 0.177 • RHEL7.3 5x Storage Node, 2 RGW nodes, 1 LB nodes • Intel(R) Xeon(R) CPU E5-2699v4 2.20GHz • 128GB Memory • 2x 82599 10Gb NIC • 1x Intel® P3700 1.0TB SSD as WAL and rocksdb • 4x 1.6TB Intel® SSD DC S3510 as data drive • 2x 400G S3700 SSDs • 1 OSD instances one each S3510 SSD • RHEl7.3 • RHCS 2.3 RGW1 OSD1 MON OSD1 OSD4… Hadoop Hive Spark Presto 1x10Gb NIC RGW2 OSD2 RGW3 OSD3 RGW4 OSD4 RGW5 OSD5 3x10Gb NIC Hadoop Hive Spark Presto Hadoop Hive Spark Presto Hadoop Hive Spark Presto Hadoop Hive Spark Presto Head DNS Server 21
  • 22. Performance evaluation with RGW & OSD collocated 2242 2162 0 500 1000 1500 2000 2500 dedicate rgws rgw+ osd co-locality seconds 1TB Batch Analytics Query on Parquet 0 500 1000 1500 2000 2500 10TB parquet ETL 1TB dfsio_write Other workloads rgw+ osd co-locality dedicate rgws • No need extra dedicate RGW servers, RGW instance and OSD go through different network interface by enable ECMP • No performance degradation, but more less TCO 22
  • 23. RGW & OSD collocated – RGW scaling • Scale out RGWs can improve performance before OSD(storage) saturating • So How many RGWs can win the best performance should be decided by the bandwidth of each RGW server and throughput of OSDs 23 950 1730 2119 2233 2254 0 500 1000 1500 2000 2500 1 2 3 4 5 MB/s # of RGWs 1TB DFSIO write 337 252 250 0 100 200 300 400 6 9 18 seconds # of RGWs 10TB ETL(Lower the better)
  • 24. Using Ceph RBD as shuffle Storage -- Eliminate the physical drive on the compute! • Mount two RBDs on compute node remotely instead of physical shuffle device • Ideally, the latency on remote RBDs larger than local physical device, but the bandwidth of remote RBD is not smaller than local physical device too • So final performance of a TPC-DS query set on RBDs maybe close or even better than on physical device while there are heavy shuffles 24 2262 2244 0 500 1000 1500 2000 2500 RBD for Shuffle Physical Device for Shuffle seconds 1TB Batch Analytics Query on Parquet 0 50 100 150 200 UC11 1TB Dataset (Physical shuffle storage vs RBD shuffle storage) physical-query_time rbd-query_time
  • 25. 25
  • 26. Hardware Configuration --Remote HDFS 5x Compute Node • Intel® Xeon™ processor E5-2699 v4 @ 2.2GHz, 64GB mem • 2x10G 82599 10Gb NIC • 2x S3700 as shuffle storage Software: • Hadoop 2.7.3 • Spark 2.1.1 • Hive 2.2.1 • Presto 0.177 • RHEL7.3 5x Data Node • Intel(R) Xeon(R) CPU E5-2699v4 2.20GHz • 128GB Memory • 2x 82599 10Gb NIC • 1x Intel® P3700 1.0TB SSD as WAL and rocksdb • 7x 400G S3700 SSDs as data stire • RHEl7.3 • RHCS 2.3Data Node 1 1x10Gb NIC Data Node 2 Data Node 3 Data Node 4 Data Node 5 1x10Gb NIC Node Manager Spark Node Manager Spark Node Manager Spark Node Manager Spark Name Node Node Manager Spark 26
  • 27. • On-par performance compared with remote HDFS • With optimizations, bigdata analytics on object storage is onpar with remote, especially on parquet format data • performance of s3a driver close to native dfsclient , and demonstrate compute and storage separate solution has a considerable performance compare with combination solution Bigdata on Cloud vs. Remote HDFS --Batch Analytics 2244 1921 5060 2839 0 1000 2000 3000 4000 5000 6000 Parquet ORC QueryTIme(s) 1TB Dataset Batch Analytics and Interactive Query comparsion with remote HDFS (lower the better) HDD: (740 HDDD OSDs, 24x Compute nodes) SSD:(20 SSDs, 5 Compute nodes) Over Object Store Remote HDFS 27
  • 28. 0 50000 100000 150000 200000 250000 300000 -5 20 45 70 95 120 145 170 195 220 245 270 295 320 345 370 395 Disk Bandwidth Sum of rkB/s Sum of wkB/s DFSIO write performance on Ceph is better than remote hdfs(43%), but read performance is 34% lower Write to Ceph hit disk bottleneck Bigdata on Cloud vs. Remote HDFS --DFSIO Shuffle storage block at consuming data from Data Lake 4257 1644 2354 3845 2863 0 1000 2000 3000 4000 5000 remote hdfs(replicaiton 1) remote hdfs(replication 3) ceph over s3a MB/s DFSIO on 1TB Dataset(Throughput) dfsio write dfsio read 28
  • 29. Bigdata on Cloud vs. Remote HDFS --Terasort 0 500000 1000000 1500000 2000000 2500000 -5 185 375 565 755 945 1135 1325 1515 1705 1895 2085 2275 2465 2655 OSD Data Drive Disk Bandwidth Sum of rkB/s Sum of wkB/s 0 50 100 150 200 -5 155 315 475 635 795 955 1115 1275 1435 1595 1755 1915 2075 2235 2395 2555 2715 OSD Data Drive IO Latencies Sum of await Sum of svctm Time cost at Reduce stage is big part Read and write concurrently 887 442 0 200 400 600 800 1000 Remote HDFS Ceph over s3a 1TB Terasort Total Throughput(MB/s) 29
  • 30. DirectOutputCom mitter An implementation in Spark 1.6, that return the destination address as working directory then no need to rename/move task output, no good robustness for failures, removed in Spark 2.0 IBM's "Stocator" committer Targets Openstack Swift, good robustness, but it is another file system for s3a Staging committer A choice of new s3a committer*, need large capacity of hard disk for staged data Magic committer A choice of new s3a committer*, if you know your object store is consistent or use s3gurad, this committer has higher performance Bigdata on Cloud vs. Remote HDFS --Ongoing rename optimizations * New s3a committer has merged into trunk, and will release in Hadoop 3.1 or later 30 Rename operation can be improved!
  • 31. 31
  • 32. 32 Summary and Next Step Summary • Bigdata on Ceph data lake is functionality ready validated by industry standard decision making workloads TPC-DS • Bigdata on the Cloud delivers on-par performance with remote HDFS for batch analytics, intensive write operations still need further optimizations • All flash solutions demonstrated significant TCO benefit compared with HDD solutions Next • Expand analytic workloads scope • Rename operations optimizations to improve the performance • Accelerating the performance with speed up layer for shuffle
  • 33. 33
  • 34. 34
  • 35. Experiment environment Cluster Hadoop head Hadoop slave Load balancer OSD RGW Roles Hadoop name node Secondary name node Resource manager Data node Node manager Hive metastore service Yarn history server Spark history server Presto server Data node Node manager Presto server Haproxy Ceph osd Ceph rados gateway # of node 1 5 1 5 5 Processor Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz 44 cores HT enabled Intel(R) Xeon(R) CPU E31280 @ 3.50GH 4 cores HT enabled Memory 128GB 64GB 128GB 32GB Storage 4x 1TB HDD 2x Intel S3510 480GB SSD(vs s3700 metrics) 1x Intel S3510 480 GB SSD • 1x Intel® P3700 1.6TB as jounal • 4x 1.6TB Intel® SSD DC S3510 2X 400GB s370 as data store 1x Intel S3510 480 GB SSD Network 10GB 40GB 10GB+10GB 10GB
  • 36. SW Configuration Hadoop version 2.7.3/2.8.1 Spark version 2.1.1/2.2.0 Hive version 2.2.1 Presto version 0.177 Executor memory 22GB Executor cores 5 # of executor 24 JDK version 1.8.0_131 Memory.overhead 5GB S3A Key Performance Configuration fs.s3a.connection.maximu m 10 fs.s3a.threads.max 30 fs.s3a.socket.send.buffer 8192 fs.s3a.socket.recv.buffer 8192 fs.s3a.threads.keepalivetim e 60 fs.s3a.max.total.tasks 1000 fs.s3a.multipart.size 100M fs.s3a.block.size 32M fs.s3a.readahead.range 64k fs.s3a.fast.upload true fs.s3a.fast.upload.buffer array fs.s3a.fast.upload.active.bl ocks 4 fs.s3a.experimental.input.f advise radom
  • 37. Legal Disclaimer & Optimization Notice • Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks. • INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. • Copyright © 2018, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. 37 Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 37