Webinar Session - https://youtu.be/_5MfGMf8PG4
In this webinar, we share how the Container Attached Storage pattern makes performance tuning more tractable, by giving each workload its own storage system, thereby decreasing the variables needed to understand and tune performance.
We then introduce MayaStor, a breakthrough in the use of containers and Kubernetes as a data plane. MayaStor is the first containerized data engine available that delivers near the theoretical maximum performance of underlying systems. MayaStor performance scales with the underlying hardware and has been shown, for example, to deliver in excess of 10 million IOPS in a particular environment.
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
1. • And more than a little about Mayastor
• Built on Kubernetes for Kubernetes
May 2020
And a demo!
2. Who is MayaData
- Founded by team with years of experience in what is now called Container Attached Storage
- A top 6 contributor to CNCF projects
- Including OpenEBS and other projects such as Kubernetes
- Also donating Litmus Chaos to the CNCF - leading chaos engineering project
- OpenSourced OpenEBS almost 3.5 years ago, now the most widely deployed CAS
- OpenEBS had the most trial users of any “cloud native storage” asked about in recent CNCF survey
- Well funded with Insight Partners, Nexus, EightRoads / Fidelity, DataCore and others
- Prominent users include the CNCF, Bloomberg, Arista and Comcast
3. So what is Container Attached Storage?
db1
db2
Redis
Micro service 1
Micro service 2
UI
REST API
CACHE service
db n
Every workload & team its own system
Different engines for different workloads
Built on Kubernetes for Kubernetes
Delivers the benefits of for data
- No lock-in
- Open source
- Runs consistently everywhere
- Any underlying cloud or
disk or SAN
4. Steven Bower at Bloomberg
○ Moved to Kubernetes in order to simplify
and standardize their environments
○ CNCF end user of the year 18/19
○ Running dozens of different stateful
workloads at scale
○ Believes in open source
○ Not about cost savings - about agility
○ Everything loosely coupled
○ Teams are autonomous and full stack
○ Does not use shared storage
○ Uses OpenEBS - different flavors
https://www.youtube.com/watch?v=0CEHN6ECaPs
https://www.youtube.com/watch?v=z_LbRfDKPvE
5. db1
db2
Redis
Micro
service 1 Micro
service 2
UI
REST API
CACHE
service db n
Can’t I just?
CSI
10^4 acceleration of storage media
10^x increase in workloads & dynamism
loose coupling of workloads
loose coupling of teams
-> for freedom
-> for care and
feeding
Of course you can. And you do.
However you lose so many benefits of
moving to Kubernetes.
Most workloads just use
Direct Attached Storage
instead.
?
6. UI Middle
ware
DB
db1
db2
Redis
Micro
service 1 Micro
service 2
UI
REST API
CACHE
service db n
A shared storage system is a complex
monolithic distributed system built before
Kubernetes
These systems have DBs for metadata
They have provisioning systems
They have retry & other logic
They take all the IO, mix it together, and
do their best
How can we configure them to achieve
optimal performance?
Designed when storage media was slow
and apps were NOT resilient
7. Performance configuration
Mileage may vary
Workloads might include:
= 32
Let’s try the top 10 settings Set to 4 levels each
Test runs take 10 minutes Let’s do 10 each
but a loaded system
behaves differently
& how about
network issues?
= 61 - 6750 years
https://storagetarget.com/2017/07/07/four-decades-of-tangled-concerns/
Simplify
Partition the problem
Pick an engine per workload
Adjust dynamically
8. Mayastor & OpenEBS data engines
https://docs.openebs.io/docs/next/architecture.html
9. Mayastor is built around CSI -- a recap
● Container Storage Interface (CSI) is a set of g(RPC) methods that is defined by
k8s storage Special Interest Group (SIG)
● Consists out of 3 sets of RPCs (services):
○ Controller (CO) service
○ Node service
○ Identify service
● Where you implement what and how is not relevant
● Different plugins may exist on the same node(s)
● CSI: the beginning to a answer for cloud native storage not the answer
12. All done -- stateful workloads solved?
● How does a developer configure the PV such that it exactly has the features
that are required for that particular workload
○ Number of replicas, snapshots? etc
● How do we abstract away differences between storage vendors when moving
to and from private or public cloud?
○ Provide cloud native storage type like “look and feel” on premises ?
○ Don't throw away our million dollar existing storage infra
● Data gravity -- the tendency to pull applications towards it
○ All the hard work to obtain loosely coupled systems is instantly lost
● Applications have changed, and someone forgot to tell storage.
13. 3 top level observations
● The people
○ Infrastructure as code
○ K8s as the unified control plane for SW deployments
● The languages and abstractions we use to write
the software
○ Go, Rust, and meta languages
○ Micro VMs (FireCracker, Kata, Cloud hypervisor)
○ Applications are distributed systems themselves
● HW changes enforce a change in the way how we
do things (and bugs)
○ IO uring, Hugepages
15. Impacts of HW changes on the stack
● Packets come in at a very high rate, single CPU 100% how to scale?
○ CPU has ~67ns per packet @3GHz
● Solution: spread across multiple cores which requires locking
○ Locks are expensive and locks are in memory which is 70-40ns away?
● Amdahl's law starts to dominate the performance envelope
● Context switches and system calls have gotten far more expensive post
spectere meltdown
● What we seem to need are lockless queues that scale per core
○ Poll mode drivers
● Partial rewrites are inevitable, the rewards are high
○ scyllaDB, VPP, Open vSwitch,
16. Ring based communication channels
● One queue is used for submitting
new requests
● A separate queue is used to store
requests that have been completed
● Sometimes a third queue is used to
submit admin commands
● io_uring a new interface added to
the kernel to “catch up” with the low
speed devices, poll mode FTW.
17. Hardware bugs and their impact
● High number of system calls have
a huge impact on performance
● Two solutions to mitigate this:
○ Making use of huge pages
○ Try to do as much as possible in user
space
18. NVMe
● NVMe is a protocol that dictates how
bits are moved between the
CPU/device but also -- between
devices
○ Its origin can be found with Infi Band used in
HPC for many years (1999)
● NVMe over Fabrics extends the
protocol over TCP, RDMA, FC, virtio
● A complete replacement of the SCSI
protocol which goes back all the way to
1978
block layer
SCSI
SAS
SAS
SCSI
NVMe
device device
App App
kernel bypass
19. Mayastor 0.1.0 - ALPHA
● 100% user space implementation
○ Crucially important to avoid cloud dependencies; ubuntu-GKE != ubuntu-AWS
○ Leverages poll mode drivers and auto detects uring support
● The Nexus supports several storage protocols and can be used with existing
iSCSI, NVM-oF targets and local storage
○ Can do n-way mirrors i.e iscsi://<host>/iqn + nvmf://host/nqn + file:///dev/sdb
● Mayastor persistence layer that allows you to export local storage over nvmf
○ Thin provisioning, snapshots and clones
● New control plane (MOAC) that schedules workloads based on real-time data
○ Currently still bound to nodes running mayastor itself
● Also API driven, i.e write to NVMe directly by passing the kernel
23. Roadmap
● Decouple CSI and Mayastor
○ Connect everywhere, but run mayastor only on selected nodes
● Rebuilding is work in progress, hope to have that available with the next release
snapshot
● Async event notification to further assist scheduling decisions (i.e auto replace)
● Community input (and contributions) are welcome and needed
● CSI currently only exports the nexus through iSCSI, good progress has been made to
make it nvmf back to back
○ requires recent 5.x kernel and hence we had no rush
● Use NVMe as read/write cache to leverage (slower) object stores as targets?
● Enhance pool capabilities
24. ● Low latency, high throughput data engine based on NVMe-oF technology
○ micro-VM ready (secure containers)
● Lockless, shared nothing design with scale per CPU core approach, written in Rust for additional safety guarantees
● In flight data integrity leveraging (DIF/DIX) crucial for multi-cloud and data mobility
● 100% in user space and with help of DPDK framework
● Cloud independent encryption
● OpenSource - part of the OpenEBS project (CNCF Sandbox -> Incubation)
● Brought to you by MayaData
Mayastor - world’s best engine (alpha)
What is it?
● The first engine of its kind; radically decentralized (via Kubernetes) and near theoretical max
performance
● Built for today’s hardware realities …. and for today’s small teams….. and for today’s workloads
● More latency sensitive workloads will move onto Kubernetes and onto OpenEBS
○ Today those workloads can use OpenEBS LocalPV - which is straight to disk
○ By using LocalPV they gain performance but loose mobility and resilience - with MayaStor they get it
all
So what?
25. Next steps Try it out!
- Today - via Quay, Docker hub or GitHub
- OpenEBS Director will support as well
- One click deploy, simple upgrades, backups &
more
- Register for free: Mayadata.io
- Active communities on K8S and OpenEBS slack
26. • And more than a little about Mayastor
• Built on Kubernetes for Kubernetes
May 2020
Q&A