Accelerating Data Computation on Ceph Objects

Accelerating Data
Computation on Ceph Objects
Leonardo Militano
milt@zhaw.ch
Alluxio Online Meetup - 10.11.2020

Agenda
● Introduction to Cloud Storage
● Solutions for data analytics based on data locality
● Alluxio based solution for data analytics
● Performance evaluation
● Conclusions

Service Engineering group
● The SE group at InIT, Zurich University of Applied Sciences
(ZHAW), Switzerland
● Core expertise: IaaS, PaaS, SaaS, virtualization
● Focus is on scalable and reliable implementation of
IT-based services
● Research Initiatives:
○ Cloud (infrastructure, platform, CI/CD, DevOps, CNA)
○ Robotics (cloud robotics, ROS)
● Blog: https://blog.zhaw.ch/icclab/

ICCLab’s research approach
strategic research agenda
core research expertise

● The global storage market has an annual growth of
25.8% and it is predicted to reach $74.94 billion of
value in 2021
● Increasing demand for data storage:
○ IDC expects data to grow 61% to 163 ZB by 2025
○ By 2025, 49 percent of data will be stored in public cloud
environments
● At the same there is a paradigm shift with more data created,
stored and processed at the edge
● Data is the new oil!
Storage in the Cloud

Data analytics
● If data is the new oil, it needs to be processed into higher-order
products to beneﬁt from its value
● Disaggregation of storage and compute for cost eﬃciency and
manageability is the common approach
○ Data is remote to the compute nodes
● Bringing the code to the data (e.g., computational storage) or
bringing the data close to the code (e.g., in-memory
computation)?
○ Data locality for bandwidth, power consumption, cost, latency, and security

Ceph storage
● Ceph is a uniﬁed, distributed storage system
with self-management and healing features
for: Object Storage, Block Storage and File
Storage
● We performed some Experiments on Ceph
Object Classes for Active Storage showing
great time savings using object classes

Alluxio for Memory Speed Computation
● Alluxio on the compute nodes allows for in-memory computation and fast data
analysis
Source: alluxio.io

The framework used for testing
● Ceph (version mimic) storage cluster
○ 6 OpenStack VMs: 1 Ceph monitor, 3 OSDs,
1 RGW, 1 Admin node
● Total storage size of 420GiB over 7
OSD volumes
● Alluxio cluster (v2.3 and v2.4)
● Spark (v3.0.0)
● Scala application on Spark
● Find more details on our blog post

Two compute cluster configurations
● Single-node:
○ One VM (16vCPUs) for Alluxio and Spark with 40GB of
memory for the worker node
● Cluster-mode:
○ Two Spark/Alluxio worker nodes (16vCPUs, 40GB memory)
● Scala application over Spark
○ repeated access to a text file
○ count operation over the lines in the file
● A comparison was performed in terms of overall
execution time for different file sizes:
o Alluxio-based vs. direct Ceph access

Summary of results
● Single-node setup:
○ The second time the file is accessed directly on Ceph it takes 75
times more for the 1GB file, 111 and 107 times more for the 5GB
and 10GB file w.r.t. the access over Alluxio
● Cluster-mode setup:
○ The second time the file is accessed directly on Ceph it takes 35
times more for the 1GB file, 57 and 65 times more for the 5GB and
10GB file w.r.t. the access over Alluxio
● NB! Results were obtained using Java version 8 (prerequisite of
Alluxio v2.3)
o Direct Ceph file access with Spark using Java 11 performs much better when
compared to using Java 8!

Testing Alluxio 2.4
● The beneﬁts are downscaled by the general reduced execution time using Java 11
● Anyhow still a 6 times better performance is obtained for a 10GB ﬁle at the second
access compared to direct Ceph access
● So Alluxio 2.4 resolves an important limitation of previous versions

Conclusions
● Alluxio enables memory-speed data access by eliminating
remote data reads for repeated accesses
● Our results show how both single-node and cluster-mode
setups lead to several orders of improvement
● Alluxio 2.3 had Java version 8 as a prerequisite (default
Java version is Java 11), which was a limiting factor
● Alluxio 2.4 supporting Java 11 is fundamental to keep the
performance improvements w.r.t. direct backend storage
access

Q&A
Leonardo Militano
milt@zhaw.ch
Alluxio Online Meetup - 10.11.2020

Accelerating Data Computation on Ceph Objects

More Related Content

What's hot

Similar to Accelerating Data Computation on Ceph Objects

More from Alluxio, Inc.

Recently uploaded

Accelerating Data Computation on Ceph Objects