Accelerating Data
Computation on Ceph Objects
Leonardo Militano
milt@zhaw.ch
Alluxio Online Meetup - 10.11.2020
Agenda
● Introduction to Cloud Storage
● Solutions for data analytics based on data locality
● Alluxio based solution for data analytics
● Performance evaluation
● Conclusions
Service Engineering group
● The SE group at InIT, Zurich University of Applied Sciences
(ZHAW), Switzerland
● Core expertise: IaaS, PaaS, SaaS, virtualization
● Focus is on scalable and reliable implementation of
IT-based services
● Research Initiatives:
○ Cloud (infrastructure, platform, CI/CD, DevOps, CNA)
○ Robotics (cloud robotics, ROS)
● Blog: https://blog.zhaw.ch/icclab/
ICCLab’s research approach
strategic research agenda
core research expertise
● The global storage market has an annual growth of
25.8% and it is predicted to reach $74.94 billion of
value in 2021
● Increasing demand for data storage:
○ IDC expects data to grow 61% to 163 ZB by 2025
○ By 2025, 49 percent of data will be stored in public cloud
environments
● At the same there is a paradigm shift with more data created,
stored and processed at the edge
● Data is the new oil!
Storage in the Cloud
Data analytics
● If data is the new oil, it needs to be processed into higher-order
products to benefit from its value
● Disaggregation of storage and compute for cost efficiency and
manageability is the common approach
○ Data is remote to the compute nodes
● Bringing the code to the data (e.g., computational storage) or
bringing the data close to the code (e.g., in-memory
computation)?
○ Data locality for bandwidth, power consumption, cost, latency, and security
Ceph storage
● Ceph is a unified, distributed storage system
with self-management and healing features
for: Object Storage, Block Storage and File
Storage
● We performed some Experiments on Ceph
Object Classes for Active Storage showing
great time savings using object classes
Alluxio for Memory Speed Computation
● Alluxio on the compute nodes allows for in-memory computation and fast data
analysis
Source: alluxio.io
The framework used for testing
● Ceph (version mimic) storage cluster
○ 6 OpenStack VMs: 1 Ceph monitor, 3 OSDs,
1 RGW, 1 Admin node
● Total storage size of 420GiB over 7
OSD volumes
● Alluxio cluster (v2.3 and v2.4)
● Spark (v3.0.0)
● Scala application on Spark
● Find more details on our blog post
Two compute cluster configurations
● Single-node:
○ One VM (16vCPUs) for Alluxio and Spark with 40GB of
memory for the worker node
● Cluster-mode:
○ Two Spark/Alluxio worker nodes (16vCPUs, 40GB memory)
● Scala application over Spark
○ repeated access to a text file
○ count operation over the lines in the file
● A comparison was performed in terms of overall
execution time for different file sizes:
o Alluxio-based vs. direct Ceph access
Single VM setup results
Cluster setup results
Summary of results
● Single-node setup:
○ The second time the file is accessed directly on Ceph it takes 75
times more for the 1GB file, 111 and 107 times more for the 5GB
and 10GB file w.r.t. the access over Alluxio
● Cluster-mode setup:
○ The second time the file is accessed directly on Ceph it takes 35
times more for the 1GB file, 57 and 65 times more for the 5GB and
10GB file w.r.t. the access over Alluxio
● NB! Results were obtained using Java version 8 (prerequisite of
Alluxio v2.3)
o Direct Ceph file access with Spark using Java 11 performs much better when
compared to using Java 8!
Testing Alluxio 2.4
Testing Alluxio 2.4
● The benefits are downscaled by the general reduced execution time using Java 11
● Anyhow still a 6 times better performance is obtained for a 10GB file at the second
access compared to direct Ceph access
● So Alluxio 2.4 resolves an important limitation of previous versions
Conclusions
● Alluxio enables memory-speed data access by eliminating
remote data reads for repeated accesses
● Our results show how both single-node and cluster-mode
setups lead to several orders of improvement
● Alluxio 2.3 had Java version 8 as a prerequisite (default
Java version is Java 11), which was a limiting factor
● Alluxio 2.4 supporting Java 11 is fundamental to keep the
performance improvements w.r.t. direct backend storage
access
Q&A
Leonardo Militano
milt@zhaw.ch
Alluxio Online Meetup - 10.11.2020

Accelerating Data Computation on Ceph Objects

  • 1.
    Accelerating Data Computation onCeph Objects Leonardo Militano milt@zhaw.ch Alluxio Online Meetup - 10.11.2020
  • 2.
    Agenda ● Introduction toCloud Storage ● Solutions for data analytics based on data locality ● Alluxio based solution for data analytics ● Performance evaluation ● Conclusions
  • 3.
    Service Engineering group ●The SE group at InIT, Zurich University of Applied Sciences (ZHAW), Switzerland ● Core expertise: IaaS, PaaS, SaaS, virtualization ● Focus is on scalable and reliable implementation of IT-based services ● Research Initiatives: ○ Cloud (infrastructure, platform, CI/CD, DevOps, CNA) ○ Robotics (cloud robotics, ROS) ● Blog: https://blog.zhaw.ch/icclab/
  • 4.
    ICCLab’s research approach strategicresearch agenda core research expertise
  • 5.
    ● The globalstorage market has an annual growth of 25.8% and it is predicted to reach $74.94 billion of value in 2021 ● Increasing demand for data storage: ○ IDC expects data to grow 61% to 163 ZB by 2025 ○ By 2025, 49 percent of data will be stored in public cloud environments ● At the same there is a paradigm shift with more data created, stored and processed at the edge ● Data is the new oil! Storage in the Cloud
  • 6.
    Data analytics ● Ifdata is the new oil, it needs to be processed into higher-order products to benefit from its value ● Disaggregation of storage and compute for cost efficiency and manageability is the common approach ○ Data is remote to the compute nodes ● Bringing the code to the data (e.g., computational storage) or bringing the data close to the code (e.g., in-memory computation)? ○ Data locality for bandwidth, power consumption, cost, latency, and security
  • 7.
    Ceph storage ● Cephis a unified, distributed storage system with self-management and healing features for: Object Storage, Block Storage and File Storage ● We performed some Experiments on Ceph Object Classes for Active Storage showing great time savings using object classes
  • 8.
    Alluxio for MemorySpeed Computation ● Alluxio on the compute nodes allows for in-memory computation and fast data analysis Source: alluxio.io
  • 9.
    The framework usedfor testing ● Ceph (version mimic) storage cluster ○ 6 OpenStack VMs: 1 Ceph monitor, 3 OSDs, 1 RGW, 1 Admin node ● Total storage size of 420GiB over 7 OSD volumes ● Alluxio cluster (v2.3 and v2.4) ● Spark (v3.0.0) ● Scala application on Spark ● Find more details on our blog post
  • 10.
    Two compute clusterconfigurations ● Single-node: ○ One VM (16vCPUs) for Alluxio and Spark with 40GB of memory for the worker node ● Cluster-mode: ○ Two Spark/Alluxio worker nodes (16vCPUs, 40GB memory) ● Scala application over Spark ○ repeated access to a text file ○ count operation over the lines in the file ● A comparison was performed in terms of overall execution time for different file sizes: o Alluxio-based vs. direct Ceph access
  • 11.
  • 12.
  • 13.
    Summary of results ●Single-node setup: ○ The second time the file is accessed directly on Ceph it takes 75 times more for the 1GB file, 111 and 107 times more for the 5GB and 10GB file w.r.t. the access over Alluxio ● Cluster-mode setup: ○ The second time the file is accessed directly on Ceph it takes 35 times more for the 1GB file, 57 and 65 times more for the 5GB and 10GB file w.r.t. the access over Alluxio ● NB! Results were obtained using Java version 8 (prerequisite of Alluxio v2.3) o Direct Ceph file access with Spark using Java 11 performs much better when compared to using Java 8!
  • 14.
  • 15.
    Testing Alluxio 2.4 ●The benefits are downscaled by the general reduced execution time using Java 11 ● Anyhow still a 6 times better performance is obtained for a 10GB file at the second access compared to direct Ceph access ● So Alluxio 2.4 resolves an important limitation of previous versions
  • 16.
    Conclusions ● Alluxio enablesmemory-speed data access by eliminating remote data reads for repeated accesses ● Our results show how both single-node and cluster-mode setups lead to several orders of improvement ● Alluxio 2.3 had Java version 8 as a prerequisite (default Java version is Java 11), which was a limiting factor ● Alluxio 2.4 supporting Java 11 is fundamental to keep the performance improvements w.r.t. direct backend storage access
  • 17.