Container attached storage with
openEBS
@JeffryMolanus

Date: 26/6/2019

https://openebs.io
About me
MayaData and the OpenEBS project
on premises Google packet.net
DMaaS
Analytics
Alerting
Compliance
Policies
Declarative Data Plane
A
P
I
Advisory
Chatbot
Resistance Is Futile
• K8s based on the original Google Borg paper 

• Containers are the “unit” of management 

• Mostly web based applications 

• Typically the apps where stateless — if you agree there is such a thing

• In its most simplistic form k8s is a control loop

• Converge to the desired state based on declarative intent provided by the DevOps
persona

• Abstract away underlying compute cluster details and decouple apps from
infra structure: avoid lock-in

• Have developer focus on application deployment and not worry about the
environment it runs in

• HW independent (commodity)
Borg Schematic
Persistency in Volatile Environnements
• Containers storage is ephemeral; data is only stored during the life time of
the container(s)

• This either means that temporary data has no value or it can be regenerated

• Sharing data between containers is also a challenge — need to persist

• In the case of severless — the intermediate state between tasks is ephemeral

• The problem then: containers need persistent volumes in order to run state
full workloads

• While doing so: abstract away the underlying storage details and decouple
the data from the underlying infra: avoid lock-in

• The “bar” has been set in terms of expectation by the cloud providers i.e PD, EBS

• Volume available at multiple DCs and/or regions and replicated
Data Loss Is Almost Guaranteed
apiVersion: v1
kind: Pod
metadata:
name: test-pd
spec:
containers:
- image: k8s.gcr.io/test-webserver
name: test-container
volumeMounts:
- mountPath: /test-pd
name: test-volume
volumes:
- name: test-volume
hostPath:
# directory location on host
path: /data
Unless…
Use a “Cloud” Disk
apiVersion: v1
kind: Pod
metadata:
name: test-pd
spec:
containers:
- image: k8s.gcr.io/test-webserver
name: test-container
volumeMounts:
- mountPath: /test-pd
name: test-volume
volumes:
- name: test-volume
# This GCE PD must already exist!
gcePersistentDisk:
pdName: my-data-disk
fsType: ext4
Evaluation and Progress
• In both cases we tie ourselves to a particular node — that defeats the agility
found natively in k8s and it failed to abstract away details
• We are cherrypicking pets from our herd
• anti pattern — easy to say and hard to avoid in some cases

• The second example allows us to mount (who?) the PV to different nodes
but requires volumes to be created prior to launching the workload

• Good — not great

• More abstraction through community efforts around Persistent Volumes
(PV) and Persistent Volume Claims (PVC) and CSI

• Container Storage Interface (CSI) to handle vendor specific needs before, in
example, mounting the volume

• Avoid wild fire of “volume plugins” or “drivers” in k8s main repo
The PV and PVC
kind: PersistentVolume
apiVersion: v1
metadata:
name: task-pv-volume
spec:
storageClassName: manual
capacity:
storage: 3Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/mnt/data"
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: task-pv-claim
spec:
storageClassName: manual
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 3Gi
kind: Pod
apiVersion: v1
metadata:
name: mypod
spec:
containers:
- name: myfrontend
image: nginx
volumeMounts:
- mountPath: "/var/www/html"
name: mypd
volumes:
- name: mypd
persistentVolumeClaim:
claimName: task-pv-claim
Summary So Far
• Register a set of “mountable” things to the cluster (PVC)

• Take ownership of a “mountable” thing in the cluster (PV)

• Refer in the application to the PVC

• Dynamic provisioning; create ad-hoc PVCs when claiming something that
does not exist yet

• Remove the need to preallocate them (is that a good thing?)

• The attaching and detaching of volumes to nodes is standardised by means
of CSI which is an gRPC interface that handles the details of creating,
attaching, staging, destroying etc

• Vendor specific implementations are hidden from the users
The Basics — Follow the Workload
Node Node
POD
PVC
Problem Solved?
• How does a developer configure the PV such that it exactly has the features
that are required for that particular workload
• Number of replica’s, Compression, Snapshot and clones (opt in/out)
• How do we abstract away differences between storage vendors when
moving to/from private or public cloud?

• Differences in replication approaches — usually not interchangeable 

• Abstract away access protocol and feature mismatch

• Provide cloud native storage type like “look and feel” on premises ? 

• Don't throw away our million dollar existing storage infra

• GKE on premisses, AWS outpost — if you are not going to the cloud it will come to
you, resistance if futile 

• Make data as agile as the applications that they serve
Data Gravity
• As data grows — it has the tendency to pull applications towards it (gravity)

• Everything will evolve around the sun and it dominates the planets

• Latency, throughput, IO blender 

• If the sun goes super nova — all your apps circling it will be gone instantly

• Some solutions involve replicating the sun towards some other location in
the “space time continuum”

• It works — but it exacerbates the problem
Picard Knows the Borg Like no Other
What if….
Storage for containers was itself container native ?
Cloud Native Architecture?
• Applications have changed, and somebody forgot to tell storage
• Cloud native applications are —distributed systems themselves

• May use a variety of protocols to achieve consensus (Paxos, Gossip, etc)

• Is a distributed storage system still needed? 

• Designed to fail and expected to fail

• Across racks, DC’s, regions and providers, physical or virtual

• Scalability batteries included

• HaProxy, Envoy, Nginx

• Datasets of individual containers relativity small in terms of IO and size
• Prefer having a collection of small stars over a big sun?

• The rise of cloud native languages such as Ballerina, Metaparticle etc
HW / Storage Trends
• Hardware trends enforce a change in the way we do things
• 40GbE and 100GbE are ramping up, RDMA capable

• NVMe and NVMe-OF (transport — works on any device)

• Increasing core counts — concurrency primitives built into languages

• Storage limitations bubble up in SW design (infra as code)

• “don’t do this because of that” — “don’t run X while I run my backup”

• Friction between teams creates “shadow it” — the (storage) problems start when
we move back from the dark side of the moon back into the sun
• “We simply use DAS —as there is nothing faster then that”

• small stars, that would work — no “enterprise features”?

• “they have to figure that out for themselves”

• Seems like storage is an agility anti-pattern?
HW Trends
The Persona Changed
• Deliver fast and frequently

• Infrastructure as code, declarative
intent, gitOps, chatOps

• K8s as the unified cross cloud
control plane (control loop)

• So what about storage? It has not
changed at all
The Idea
Manifests express intent
stateless
Container 1 Container 2 Container 3
stateful
Data Container Data Container Data Container
Any Server, Any Cloud Any Server, Any Cloud
container(n) container(n) container(n)
container(n) container(n) container(n)
Design Constraints
• Built on top of the substrate of Kubernetes

• That was a bet we made ~2 years ago that turned out to be right

• Not yet another distributed storage system; small is the new big
• Not to be confused with not scalable
• One on top of the other, an operational nightmare?

• Per workload: using declarative intent defined by the persona

• Runs in containers for containers — so it needs to run in user space
• Make volumes omnipresent — compute follows the storage?

• Where is the value? Compute or the data that feeds the compute?

• Not a clustered storage instance rather a cluster of storage instances
Decompose the Data
SAN/NAS Vs. DASCAS
Container Attached Storage
How Does That Look?
Topology Visualisation
Route Your Data Where You Need It To Be
PV
CAS
TheBox 1 TheBox 2 TheBox 3
Composable
PV
Ingress
local remote
T(x)
T(x)
T(x)
Egress
compress, encrypt, mirror
User Space and Performance
• NVMe as a transport is a game changer not just for its speed potential, but
also due to its relentless break away of the SCSI layer (1978)
• A Lot of similarities with Infini Band technology found in HPC for many years
(1999 as a result of a merger)
Less Is More
HW Changes Enforce A Change
• With these low latency devices CPUs are becoming the
bottleneck

• Post spectre/meltdown syscalls have become more expensive
then ever
Hugepages
PMD User Space IO
Testing It DevOps Style
K8S as a Control Loop
Kubelet
K8s
Master
YAML
+ -
Primary loop (k8s)
OP Sched
API
Servers
…..
-

+

Extending the K8S Control Loop
Kubeletk8s++
Adapt
YAML
+ -
RefMO
Primary loop (k8s)
Secondary loop (MOAC)
Raising the Bar — Automated Error Correction
CAS
FIO FIO FIO
replay blk IO pattern of various apps
kubectl scale up and down
DB
Regression
AI/ML
Logs Telemetry
Learn what failure 

impacts app how
Declarative Data Plane
A
P
I
Storage just fades away as concern
Questions?!

Container Attached Storage with OpenEBS - CNCF Paris Meetup

  • 1.
    Container attached storagewith openEBS @JeffryMolanus Date: 26/6/2019 https://openebs.io
  • 2.
    About me MayaData andthe OpenEBS project
  • 3.
    on premises Googlepacket.net DMaaS Analytics Alerting Compliance Policies Declarative Data Plane A P I Advisory Chatbot
  • 4.
    Resistance Is Futile •K8s based on the original Google Borg paper • Containers are the “unit” of management • Mostly web based applications • Typically the apps where stateless — if you agree there is such a thing • In its most simplistic form k8s is a control loop • Converge to the desired state based on declarative intent provided by the DevOps persona • Abstract away underlying compute cluster details and decouple apps from infra structure: avoid lock-in • Have developer focus on application deployment and not worry about the environment it runs in • HW independent (commodity)
  • 5.
  • 6.
    Persistency in VolatileEnvironnements • Containers storage is ephemeral; data is only stored during the life time of the container(s) • This either means that temporary data has no value or it can be regenerated • Sharing data between containers is also a challenge — need to persist • In the case of severless — the intermediate state between tasks is ephemeral • The problem then: containers need persistent volumes in order to run state full workloads • While doing so: abstract away the underlying storage details and decouple the data from the underlying infra: avoid lock-in • The “bar” has been set in terms of expectation by the cloud providers i.e PD, EBS • Volume available at multiple DCs and/or regions and replicated
  • 7.
    Data Loss IsAlmost Guaranteed apiVersion: v1 kind: Pod metadata: name: test-pd spec: containers: - image: k8s.gcr.io/test-webserver name: test-container volumeMounts: - mountPath: /test-pd name: test-volume volumes: - name: test-volume hostPath: # directory location on host path: /data Unless…
  • 8.
    Use a “Cloud”Disk apiVersion: v1 kind: Pod metadata: name: test-pd spec: containers: - image: k8s.gcr.io/test-webserver name: test-container volumeMounts: - mountPath: /test-pd name: test-volume volumes: - name: test-volume # This GCE PD must already exist! gcePersistentDisk: pdName: my-data-disk fsType: ext4
  • 9.
    Evaluation and Progress •In both cases we tie ourselves to a particular node — that defeats the agility found natively in k8s and it failed to abstract away details • We are cherrypicking pets from our herd • anti pattern — easy to say and hard to avoid in some cases • The second example allows us to mount (who?) the PV to different nodes but requires volumes to be created prior to launching the workload • Good — not great • More abstraction through community efforts around Persistent Volumes (PV) and Persistent Volume Claims (PVC) and CSI • Container Storage Interface (CSI) to handle vendor specific needs before, in example, mounting the volume • Avoid wild fire of “volume plugins” or “drivers” in k8s main repo
  • 10.
    The PV andPVC kind: PersistentVolume apiVersion: v1 metadata: name: task-pv-volume spec: storageClassName: manual capacity: storage: 3Gi accessModes: - ReadWriteOnce hostPath: path: "/mnt/data" kind: PersistentVolumeClaim apiVersion: v1 metadata: name: task-pv-claim spec: storageClassName: manual accessModes: - ReadWriteOnce resources: requests: storage: 3Gi kind: Pod apiVersion: v1 metadata: name: mypod spec: containers: - name: myfrontend image: nginx volumeMounts: - mountPath: "/var/www/html" name: mypd volumes: - name: mypd persistentVolumeClaim: claimName: task-pv-claim
  • 11.
    Summary So Far •Register a set of “mountable” things to the cluster (PVC) • Take ownership of a “mountable” thing in the cluster (PV) • Refer in the application to the PVC • Dynamic provisioning; create ad-hoc PVCs when claiming something that does not exist yet • Remove the need to preallocate them (is that a good thing?) • The attaching and detaching of volumes to nodes is standardised by means of CSI which is an gRPC interface that handles the details of creating, attaching, staging, destroying etc • Vendor specific implementations are hidden from the users
  • 12.
    The Basics —Follow the Workload Node Node POD PVC
  • 13.
    Problem Solved? • Howdoes a developer configure the PV such that it exactly has the features that are required for that particular workload • Number of replica’s, Compression, Snapshot and clones (opt in/out) • How do we abstract away differences between storage vendors when moving to/from private or public cloud? • Differences in replication approaches — usually not interchangeable • Abstract away access protocol and feature mismatch • Provide cloud native storage type like “look and feel” on premises ? • Don't throw away our million dollar existing storage infra • GKE on premisses, AWS outpost — if you are not going to the cloud it will come to you, resistance if futile • Make data as agile as the applications that they serve
  • 14.
    Data Gravity • Asdata grows — it has the tendency to pull applications towards it (gravity) • Everything will evolve around the sun and it dominates the planets • Latency, throughput, IO blender • If the sun goes super nova — all your apps circling it will be gone instantly • Some solutions involve replicating the sun towards some other location in the “space time continuum” • It works — but it exacerbates the problem
  • 15.
    Picard Knows theBorg Like no Other
  • 16.
    What if…. Storage forcontainers was itself container native ?
  • 17.
    Cloud Native Architecture? •Applications have changed, and somebody forgot to tell storage • Cloud native applications are —distributed systems themselves • May use a variety of protocols to achieve consensus (Paxos, Gossip, etc) • Is a distributed storage system still needed? • Designed to fail and expected to fail • Across racks, DC’s, regions and providers, physical or virtual • Scalability batteries included • HaProxy, Envoy, Nginx • Datasets of individual containers relativity small in terms of IO and size • Prefer having a collection of small stars over a big sun? • The rise of cloud native languages such as Ballerina, Metaparticle etc
  • 18.
    HW / StorageTrends • Hardware trends enforce a change in the way we do things • 40GbE and 100GbE are ramping up, RDMA capable • NVMe and NVMe-OF (transport — works on any device) • Increasing core counts — concurrency primitives built into languages • Storage limitations bubble up in SW design (infra as code) • “don’t do this because of that” — “don’t run X while I run my backup” • Friction between teams creates “shadow it” — the (storage) problems start when we move back from the dark side of the moon back into the sun • “We simply use DAS —as there is nothing faster then that” • small stars, that would work — no “enterprise features”? • “they have to figure that out for themselves” • Seems like storage is an agility anti-pattern?
  • 19.
  • 20.
    The Persona Changed •Deliver fast and frequently • Infrastructure as code, declarative intent, gitOps, chatOps • K8s as the unified cross cloud control plane (control loop) • So what about storage? It has not changed at all
  • 21.
    The Idea Manifests expressintent stateless Container 1 Container 2 Container 3 stateful Data Container Data Container Data Container Any Server, Any Cloud Any Server, Any Cloud container(n) container(n) container(n) container(n) container(n) container(n)
  • 22.
    Design Constraints • Builton top of the substrate of Kubernetes • That was a bet we made ~2 years ago that turned out to be right • Not yet another distributed storage system; small is the new big • Not to be confused with not scalable • One on top of the other, an operational nightmare? • Per workload: using declarative intent defined by the persona • Runs in containers for containers — so it needs to run in user space • Make volumes omnipresent — compute follows the storage? • Where is the value? Compute or the data that feeds the compute? • Not a clustered storage instance rather a cluster of storage instances
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
    Route Your DataWhere You Need It To Be PV CAS TheBox 1 TheBox 2 TheBox 3
  • 28.
  • 29.
    User Space andPerformance • NVMe as a transport is a game changer not just for its speed potential, but also due to its relentless break away of the SCSI layer (1978) • A Lot of similarities with Infini Band technology found in HPC for many years (1999 as a result of a merger)
  • 30.
  • 31.
    HW Changes EnforceA Change • With these low latency devices CPUs are becoming the bottleneck • Post spectre/meltdown syscalls have become more expensive then ever
  • 32.
  • 33.
  • 34.
  • 35.
    K8S as aControl Loop Kubelet K8s Master YAML + - Primary loop (k8s) OP Sched API Servers …..
  • 36.
    -
 +
 Extending the K8SControl Loop Kubeletk8s++ Adapt YAML + - RefMO Primary loop (k8s) Secondary loop (MOAC)
  • 37.
    Raising the Bar— Automated Error Correction CAS FIO FIO FIO replay blk IO pattern of various apps kubectl scale up and down DB Regression AI/ML Logs Telemetry Learn what failure impacts app how Declarative Data Plane A P I
  • 38.
    Storage just fadesaway as concern
  • 39.