Benchmarking sahara based big data as a service solutions

Benchmarking Sahara-based
Big-Data-as-a-Service Solutions
Zhidong Yu, Weiting Chen (Intel)
Trevor McKay (Red Hat)
May 2015

o Why Sahara
o Sahara introduction
o Deployment considerations
o Performance testing and results
o Future envisioning
o Summary and Call to Action
2
Agenda

o You or someone at your company is using AWS, Azure, or Google
o You’re probably doing it for easy access to OS instances, but also
the modern application features, e.g. AWS’ EMR or RDS or Storage
o Migrating to, or even using, OpenStack infrastructure for workloads
means having application features, e.g. Sahara & Trove
o Writing applications is complex enough without having to manage
supporting (non-value-add) infrastructure
3
Why Sahara: Cloud features

o Writing data analysis applications are especially hard
o Complexity in acquiring data
o Complexity in organizing (ETL) data
o Complexity in analyzing data
o Complexity in integrating analysis into applications (bonus!)
o This compounded complexity does not even include the
necessary tooling and infrastructure
4
Why Sahara: Data analysis

o Why Sahara
5
Agenda

o Repeatable cluster provisioning and management
operations
o Data processing workflows (EDP)
o Cluster scaling (elasticity), Storage integration
(Swift, Cinder, HCFS)
o Network and security group (firewall) integration
o Service anti-affinity (fault domains & efficiency)
6
Sahara features

o Users get choice of integrated data progressing
engines
o Vendors get a way to integrate with OpenStack and
access users
o Upstream - Apache Hadoop (Vanilla), Hortonworks,
Cloudera, MapR, Apache Spark, Apache Storm
o Downstream - depends on your OpenStack vendor
8
Sahara plugins

o Why Sahara
9
Agenda

10
Storage Architecture
#2 #4#3#1
Host
HDFS
VM
Comput-
ing Task
VM
Comput-
ing Task
Host
HDFS
VM
Comput-
ing Task
VM
HDFS
Host
HDFS
VM
Comput-
ing Task
VM
Comput-
ing Task
Legacy
NFS
GlusterFS Ceph
External
HDFS Swift
HDFS
Scenario #1: computing and data service collocate in the
VMs
Scenario #2: data service locates in the host world
Scenario #3: data service locates in a separate VM world
Scenario #4: data service locates in the remote network
o Tenant provisioned (in VM)
o HDFS in the same VMs of computing tasks vs. in
the different VMs
o Ephemeral disk vs. Cinder volume
o Admin provided
o Logically disaggregated from computing tasks
o Physical collocation is a matter of deployment
o For network remote storage, Neutron DVR is very
useful feature
o A disaggregated (and centralized) storage system has
significant values
o No data silos, more business opportunities
o Could leverage Manila service
o Allow to create advanced solutions (.e.g. in-memory
overlayer)
o More vendor specific optimization opportunities

o Container seems to be promising but still need better support
o Determining the appropriate cluster size is always a challenge to tenants
o e.g. small flavor with more nodes or large flavor with less nodes
11
Compute Engine
Pros Cons
VM
• Best support in OpenStack
• Strong security
• Slow to provision
• Relatively high runtime performance overhead
Container
• Light-weight, fast provisioning
• Better runtime performance than VM
• Nova-docker readiness
• Cinder volume support is not ready yet
• Weaker security than VM
• Not the ideal way to use container
Bare-Metal
• Best performance and QoS
• Best security isolation
• Ironic readiness
• Worst efficiency (e.g. consolidation of workloads with
different behaviors)
• Worst flexibility (e.g. migration)
• Worst elasticity due to slow provisioning

o Direct Cluster Operations
o Sahara is used as a provisioning engine
o Tenants expect to have direct access to the virtual cluster
o e.g. directly SSH into the VMs
o May use whatever APIs comes with the distro
o e.g. Oozie
o EDP approach
o Sahara’s EDP is designed to be an abstraction layer for tenants to consume the services
o Ideally should be vendor neutral and plugin agnostic
o Limited job types are supported at present
o 3rd party abstraction APIs
o Not supported yet
o e.g. Cask CDAP
Data Processing API

Deployment Considerations Matrix
Storage
Compute
Distro/Plugin
Data Processing API
Vanilla CDH HDP MapRSpark
VM Container Bare-metal
Tenant vs. Admin
provisioned
Disaggregated vs. Collocated HDFS vs. other options
Traditional EDP
(Sahara native)
3rd party APIs
Storm
Performance results in the next section

o Why Sahara
14
Agenda

● Host OS: CentOS7
● Guest OS: CentOS7
● Hadoop 2.6.0
● 4-Nodes Cluster
● Baremetal
●OpenStack using KVM
○qemu-kvm v1.5
●OpenStack using Docker
○Docker v1.6
15
Testing Environment

Ephemeral Disk Performance
Host
HDFS
Host
Nova Inst.
Store
VM
HDFS
VM
HDFS
RAID
…..
…..vs.
● 1.3x read overhead, 2.1x overhead
○ disk access pattern change: 10%
○ virtualization overhead: 90%
■ 60% due to I/O overhead
■ 30% due to memory efficiency in virtualization
Heavy tuning is required.
1.3x overhead
2.1x overhead

o Performance difference comes from several factors:
o Free of I/O virtualization overhead
o DVR bring huge performance enhancement
o Location awareness(HVE) is not enabled
External HDFS Performance
Host
HDFS
VMVM
Host
Nova Inst.
Store
VM
HDFS
VM
HDFS
RAID
…..
….. vs.
RAID
2x improvement
1.3x overhead

Host
Swift Performance
o Similar to external HDFS but even worse
o Without enabling location awareness, all the traffic go through the Swift
proxy node
Swift
…..
VMVM
Host
Nova Inst.
Store
VM
HDFS
VM
HDFS
RAID
…..
….. vs.
RAID
1.35x overhead

● For an I/O intensive workload, 2.19x overhead is big but is consistent with previous
results.
● Container demonstrates a fair performance compared to KVM
○ considering OpenStack services also consumes resources, 0.46x is not that bad.
Bare-metal vs. Container vs. VM
Host
Container
Host Host
VM
HDFS
HDFS
HDFS
2.19X
1.46X
1X

o Why Sahara
20
Agenda

o Architectured for disaggregated computing and storage
o Supporting more storage backend
o Integration with Manila
o Better support for container and bare-metal (Nova-docker and Ironic)
o Magnum as an alternative?
o EDP as a PaaS like layer for Sahara core provisioning engine
o Data connector abstraction
o Workflow management
o Policy engine for resource and SLA management
o Auto-scale, auto-tune
o Sahara needs to offer broader vendor integration opportunities (not just engines)
o A complete big data stack may have many options at each layer
o e.g. acceleration libraries, analytics, developer oriented application framework (e.g.
CDAP)
o Requires a more generic plugin/driver framework to support it
21
Future of Sahara? (NOT a roadmap)

o Upgrade HDP plugin to HDP 2.2
oHadoop HA for CDH and HDP
oEnhancements to provisioning through Heat
oBring Sahara to python-openstackclient
oBare-metal clusters with Ironic
oSecurity enhancements
oEDP enhancements
oJob scheduling
oCoordination
oLog retrieval
oImproved parameter specification
22
Liberty Roadmap Highlights

o Great improvement in Sahara Kilo release. Production ready with
real customer deployments.
o A complete Big-Data-as-a-Service solution requires more
considerations than simply adding a Sahara service to the
existing OpenStack deployment
o Preliminary benchmark results show the performance gap with
bare-metal is still huge. Tuning and optimizations are required.
o Many features could be added to enhance Sahara. Opportunities
exist for various types of vendors.
23
Summary and Call-to-Action
Join in the Sahara community and make it even more vibrant!

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware,
software or service activation. Performance varies depending on system configuration. No computer system can
be absolutely secure. Check with your system manufacturer or retailer or learn more at [intel.com].
Software and workloads used in performance tests may have been optimized for performance only on Intel
microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer
systems, components, software, operations and functions. Any change to any of those factors may cause the
results to vary. You should consult other information and performance tests to assist you in fully evaluating your
contemplated purchases, including the performance of that product when combined with other products.
Include as footnote in appropriate slide where performance data is shown:
o §Configurations: [describe config + what test used + who did testing]
o §For more information go to http://www.intel.com/performance.
Intel, the Intel logo, {List the Intel trademarks in your document} are trademarks of Intel Corporation in the U.S.
and/or other countries.
*Other names and brands may be claimed as the property of others.
© 2015 Intel Corporation.
24
Disclaimer

Benchmarking sahara based big data as a service solutions

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Benchmarking sahara based big data as a service solutions

Similar to Benchmarking sahara based big data as a service solutions (20)

Recently uploaded

Recently uploaded (20)

Benchmarking sahara based big data as a service solutions