Copyright©2017 NTT Corp. All Rights Reserved.
Akihiro Suda ( @_AkihiroSuda_ )
NTT Software Innovation Center
FILEgrain: Transport-Agnostic,
Fine-Grained Content-Addressable
Container Image Layout
github.com/AkihiroSuda/filegrain
Open Source Summit North America (September 11, 2017) Last	update:	September	11,	2017
2
Copyright©2017 NTT Corp. All Rights Reserved.
• Software Engineer at NTT
• Largest telco in Japan
• Docker Moby Core maintainer
• BuildKit initial maintainer (github.com/moby/buildkit)
• Next-generation `docker build`
• Executes DAG vertexes of Dockerfile-equivalent concurrently
In April, Docker [ as a project ] transited into Moby.
Now Docker [ as a product ] has been developed as one of downstream of Moby.
: ≒ :
RHEL Fedora
Who am I
3
Copyright©2017 NTT Corp. All Rights Reserved.
• New image format that allows doing `docker run` before
`docker pull` 100% finishes
• More benefits: deduplication, etc.
• OCI compatible
/bin/sh
Minimal JRE
Minimal JDK
Optional stuff
You just need to pull 20% for
running the official OpenJDK 8 image!
(The rest 80% can be pulled on demand)
672 MB (total)
137 MB
535 MB
Summary
4
Copyright©2017 NTT Corp. All Rights Reserved.
The current status of Docker / OCI format
5
Copyright©2017 NTT Corp. All Rights Reserved.
• In July 2017, Open Containers Initiative (OCI) announced the first
release of its industry-standard container image spec
• CoreOS's appc/ACI spec maintainers has also joined OCI
• The data structure is almost identical to Docker Image Manifest V2.2,
but made separate from Docker Registry HTTP API
• Agnostic to distribution protocol; Any protocol with directory-like
semantic will work. e.g. HTTP, NFS, IPFS
• Some extra coordinator is needed though (e.g. lock for the image index)
• I'm +1 for IPFS (P2P filesystem)
• Note: the spec is unrelated to Dockerfile syntax / instructions
OCI Image Spec
6
Copyright©2017 NTT Corp. All Rights Reserved.
• Composed of TAR archives of AUFS layers
• AUFS-style tar format is used as the lingua franca across different snapshot drivers
(e.g. OverlayFS, Devicemapper, BtrFS, ZFS)
Structure of Docker / OCI image format
TAR
TAR
FROM ubuntu:17.04
RUN apt-get install foobar
COPY foobar.conf /etc
mount –t overlay –o lowerdir=0,upperdir=1 ..
mount –t overlay –o lowerdir=1,upperdir=2 ..
base layer • added & modified files
• file deletion info ("whiteout")TAR
/bin/bash
/bin/ls ...
/usr/bin/foobar
/usr/lib/libfoobar.so ...
/var/lib/.wh.baz
/etc/foobar.conf
7
Copyright©2017 NTT Corp. All Rights Reserved.
• Blobs such as TARs are stored with a Merkle tree structure
• Generally, distributed via Docker Registry, with (e.g.) S3/Swift backend
Structure of Docker / OCI image format
{
"schemaVersion": 2,
"manifests": [
{
"mediaType": "application/vnd.oci.image.manifest.v1+json",
"size": 7143,
"digest":
"sha256:e692418e4cbaf90ca69d05a66403747baa33ee08806650b51fab815ad7fc331f"
}
}
/index.json
/blobs/sha256/e692418e...
... (Next slide)
JSON
JSON
8
Copyright©2017 NTT Corp. All Rights Reserved.
/blobs/sha256/e692418e...
{
"schemaVersion": 2,
"config": {
"mediaType": "application/vnd.oci.image.config.v1+json",
"size": 7023,
"digest": "sha256:b5b2b2c507a0944348e0303114d8d93aaaa081732b86451d9bce1f432a537bc7"
},
"layers": [
{
"mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
"size": 33554432,
"digest": "sha256:61be55a8e2f6b4e172338bddf184d6dbee29c98853e0a0485ecee7f27b9af0b4"
},
{
"mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
"size": 1073741824,
"digest": "sha256:3c3a4604a545cdc127456d94e421cd355bca5b528f4a9c1905b15da2eb4a4c6b"
},
{
"mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
"size": 73109,
"digest": "sha256:ec4b8955958665577945c89419d1af06b5f7636b4ac3da7f12184802ad867736"
}
]
}
/blobs/sha256/b5b2b2c5...
Environment variables and so on
/blobs/sha256/61be55a8...
Base layer
/blobs/sha256/3c3a4604...
Diff layer
/blobs/sha256/ec4b8955...
Diff layer
TAR
TAR
TAR
JSON
JSON
9
Copyright©2017 NTT Corp. All Rights Reserved.
• Merkle tree ensures reproducibility of an environment
(`docker pull foobar@sha256:e692418e..`)
Structure of Docker / OCI image format
/index.json
/blobs/sha256/e692418e...
/blobs/sha256/b5b2b2c5...
/blobs/sha256/61be55a8...
/blobs/sha256/3c3a4604...
/blobs/sha256/ec4b8955...
Index
Manifest JSON
Config JSON
(env vars and so on) AUFS layer TARs
Merkle tree
10
Copyright©2017 NTT Corp. All Rights Reserved.
• TAR = Tape ARchiver, appeared in circa. 1979 (UNIX 7th Edition)
• TAR was originally designed for magnetic tapes
à Not suitable for Docker/OCI workloads in 2017
Problems of Docker / OCI image format
PDP-11, the target architecture of UNIX in 1970s,
with TU56 DECtape drives
https://en.wikipedia.org/wiki/PDP-11
11
Copyright©2017 NTT Corp. All Rights Reserved.
Without scanning the whole "tape"...
• Files cannot be listed up
àCan't be mounted as a filesystem
• File offsets cannot be detected
à Files can't be accessed
• We could create a separate index, but it is useless
when a TAR is gzipped, as it can't be seeked
Problem 1: TAR requires scanning the whole "tape"
Metadata 0
File 0
Metadata 1
File 1
Metadata {n-1}
File {n-1}
Terminal zero bytes
...
File name, permission, ...
Content
12
Copyright©2017 NTT Corp. All Rights Reserved.
Problem 1: TAR requires scanning the whole "tape"
• If we could solve this problem, we can mount(2) an image and start a
container without pulling the whole image
• Only metadata are needed for mount(2)
• Files are lazily pulled on demand
• à Shorter start-up time & Less network traffic
New containers can start
immediately on newly added hosts
Faster testing and
deployment cycle
13
Copyright©2017 NTT Corp. All Rights Reserved.
Problem 1: TAR requires scanning the whole "tape"
Detailed usecases:
• Web apps with huge number of HTML files and graphic files
• Jupyter Notebook with various big data samples
• Academic papers will be immediately reproducible by just running
`docker run some-single-huge-image@sha256:deadbeef..`
• Full GNOME/KDE/Windows(potentially) Desktop
• Java / dotNET runtimes
• Integration testing environment...
14
Copyright©2017 NTT Corp. All Rights Reserved.
• A registry might contain very similar images
• Different versions
• Different architectures
• Different configurations
• TARs of these images are likely to contain identical files, but waste
storage without any data deduplication
Problem 2: No deduplication
FROM ubuntu:17.04
RUN apt-get install foo
FROM ubuntu:17.04
RUN apt-get install foo bar
FROM debian:9
RUN apt-get install foo
FROM ubuntu:17.04
RUN echo … > /etc/apt/source.list
RUN apt-get install foo
15
Copyright©2017 NTT Corp. All Rights Reserved.
• It might be good to pull a large TAR layer with multiple connections
• Especially when multiple servers are available
• But not all protocol allows that
• RFC7233 says HTTP/1.1 Range Requests are OPTIONAL
Problem 3: No concurrency
Range	0-1023 Range	1024-2047
16
Copyright©2017 NTT Corp. All Rights Reserved.
1. TAR requires scanning the whole "tape"
2. No deduplication
3. No concurrency
Problems of Docker / OCI image format
Image: https://en.wikipedia.org/wiki/Magnetic_tape
17
Copyright©2017 NTT Corp. All Rights Reserved.
New image format: FILEgrain
https://github.com/AkihiroSuda/filegrain
18
Copyright©2017 NTT Corp. All Rights Reserved.
• Single large TAR blob à Many small blobs
• Metadata blob with content-addressability
• Only metadata blob is needed for mounting the image
• File blobs can be lazily pulled on demand
FILEgrain overview
Metadata 0
File 0
Metadata 1
File 1
Metadata {n-1}
File {n-1}
Terminal zero bytes
...
Metadata 0 File 0
Metadata 1
File 1
Metadata {n-1} File {n-1}
...
TAR
continuity File name, permission bits,
SHA256 digest...
content-addressable
19
Copyright©2017 NTT Corp. All Rights Reserved.
• continuity: filesystem metadata manifest system used in
containerd / Moby community (github.com/containerd/continuity)
• Serializes file names, permission bits, XAttrs, and digest values as a
ProtocolBuffers message structure
• Similar format: mtree(8)
Metadata format: continuity
message Resource {
repeated string path;
int64 uid;
int64 gid;
uint32 mode;
uint64 size;
repeated string digest;
repeated XAttr xattr;
...
}
20
Copyright©2017 NTT Corp. All Rights Reserved.
• A single OCI image can contain both FILEgrain manifests and
traditional OCI manifests
Highly compatible with traditional OCI spec
/index.json
/blobs/sha256/e692...
/blobs/sha256/b5b2... /blobs/sha256/61be...
/blobs/sha256/3c3a...
/blobs/sha256/ec4b...
Index
Traditional OCI Manifest
Common config,
such as env vars
AUFS layer TARs
/blobs/sha256/a8e3...
FILEgrain Manifest
/blobs/sha256/de81...
continuity
/blobs/sha256/583f...
/blobs/sha256/3af1...
/blobs/sha256/5c2a...
/blobs/sha256/39c1...
/blobs/sha256/12ea... Bunch of files
`docker pull foo:v1-filegrain` `docker pull foo:v1`
{"v1-filegrain": "sha256:a8e3..", "v1":"sha256:e692.."}
21
Copyright©2017 NTT Corp. All Rights Reserved.
• FILEgrain layers can be put on top of traditional OCI layers
• Traditional OCI layers might be still suitable for frequently used files or large
number of small files
Highly compatible with traditional OCI spec
{
...
"layers": [
{
"mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
"size": 33554432,
"digest": "sha256:e692..."
},
{
"mediaType": "application/vnd.continuity.layer.v0+pb",
"size": 16724,
"digest": "sha256:a18b..."
}
]
}
TAR
continuity
Traditional OCI layer
(Needs to be completely
pulled before starting a container)
FILEgrain layer
(Can be lazy-pulled on demand)
22
Copyright©2017 NTT Corp. All Rights Reserved.
1. TAR requires scanning the whole "tape"
à Only metadata (continuity) blob is needed for mounting.
Other stuff can be lazily pulled on demand.
2. No deduplication
à Deduplication in the granularity of files
3. No concurrency
à Concurrency in the granularity of files
The problems are solved!
23
Copyright©2017 NTT Corp. All Rights Reserved.
1. Larger number of blobs
• /blobs/sha256 directory will contain huge number of files when laid out on
filesystem, and may make readdir(3) slow
• Generally, this issue could be solved by "sharding" /blobs/sha256/deadbeef.. like
/blobs/sha256/de/deadbeef.., but not in this case, for compatibility with OCI.
2. More RPC overhead
• Single traditional TAR is still suitable for small files so as to reduce number of RPCs
3. Less compression rate
• Single traditional TAR is still suitable when the layer contains similar files
But anyway, these cons can be easily mitigated by composing
FILEgrain layers on top of traditional OCI layers
Cons
24
Copyright©2017 NTT Corp. All Rights Reserved.
Alternative 1: Use some external blob store?
à No, because it depends on certain protocols. Also, it is unlikely to
work with OCI Merkle tree.
Alternative ideas?
Related work:
Harter, Tyler, et al. "Slacker: Fast Distribution with Lazy Docker Containers."
FAST. 2016.
• Loopback-mount an ext4 image located on NFS
• Support lazy-pulling and deduplication in granularity of blocks
25
Copyright©2017 NTT Corp. All Rights Reserved.
Alternative 1: Use some external blob store?
à No, because it depends on certain protocols. Also, it is unlikely to
work with OCI Merkle tree.
Alternative ideas?
Related work:
Lestaris, George. "Alternatives to layer-based image distribution: using CERN filesystem for
images." Container Camp UK. 2016.
Blomer, Jakob, et al. "A Novel Approach for Distributing and Managing Container Images:
Integrating CernVM File System and Mesos." MesosCon NA. 2016.
• CernVM FS
• CernVM FS has its own Merkle tree
• Support lazy-pulling and deduplication in granularity of files (as in FILEgrain)
26
Copyright©2017 NTT Corp. All Rights Reserved.
Alternative 1: Use some external blob store?
à No, because it depends on certain protocols. Also, it is unlikely to
work with OCI Merkle tree.
Alternative ideas?
• IPFS is also attractive protocol
• P2P, content-addressable
• FILEgrain is made agnostic to transportation protocol as in OCI image spec;
it works with HTTP / NFS / CernVM FS / IPFS / whatever
27
Copyright©2017 NTT Corp. All Rights Reserved.
Alternative 2: Seek TAR?
à No
• Compressed TAR is not seekable
• Even uncompressed TAR is sometimes not seekable, depending on transportation
protocol
• Larger number of request for fetching the whole metadata
Alternative ideas?
Metadata 0
File 0
Metadata 1
File 1
Metadata {n-1}
File {n-1}
Terminal zero bytes
...
Metadata can be pre-loaded
without reading file contents? (No)
TAR
28
Copyright©2017 NTT Corp. All Rights Reserved.
Alternative 3: Use (e.g.) ZIP instead of TAR?
à No
• still unseekable depending on transporation protocol
• poor metadata support
Alternative ideas?
ZIP
Extra metadata 0
Compressed file 0
Extra metadata 1
Compressed file 1
Extra metadata {n-1}
Compressed file {n-1}
...
Footer
Metadata 0
Metadata 1
...
Metadata {n-1}
Metadata can be pulled at once firstly? (No)
29
Copyright©2017 NTT Corp. All Rights Reserved.
Evaluation
30
Copyright©2017 NTT Corp. All Rights Reserved.
• Implemented as a read-only FUSE filesystem
• No support for write operations
• "Cattle" containers should be immutable; it is anti-pattern to do any write operation
against rootfs
• /tmp and /run should be mounted as tmpfs
• persistent data should be written to bind-mounted Ext4/XFS
• Actually, even Docker built-in storage drivers (overlayfs, AUFS) don't fully support
write operations (github.com/AkihiroSuda/issues-docker)
• e.g. Yum is known not to work on overlayfs, without installing yum-plugin-ovl
(neither a bug of overlayfs nor yum!)
Implementation
31
Copyright©2017 NTT Corp. All Rights Reserved.
• Docker doesn't support running a container with a rootfs that is not
managed by the Docker daemon
• So current FILEgrain POC is evaluated using runc
• TODO: reimplement as a containerd plugin
Implementation
32
Copyright©2017 NTT Corp. All Rights Reserved.
See https://github.com/AkihiroSuda/filegrain/issues/17 for detailed information
Images for evaluation
Image Description rootfs size
(after tar+gzip expansion)
openjdk:8
sha256:5da842d59f76009fa27ffde06888ebd
560c7ad17607d7cd1e52fc0757c9a45fb
Debian 9.1, OpenJDK
8u141
671.7MB
kdeneon/all
sha256:e3e7f216a5f8f1fdcff4eab8807d7af
cd291c050099ab3e8a8355b7b28a19247
Ubuntu 16.04, KDE Plasma
5.10, Firefox 54..
4.8GB
kaggle/python
sha256:335103c998aea22a5608c2eeca7dcf1
09e0828ed233b75f5098182c5b058fe98
Debian 8.5, Various
machine learning
frameworks, NLTK (natural
language toolkit) dataset..
8.3GB
33
Copyright©2017 NTT Corp. All Rights Reserved.
• openjdk:8 (blobs in total = 671.7MB + meta 5.4MB)
• mount: 5.4MB ( 2 blobs)
• only metadata are needed (manifest and continuity)
• `sh`: 7.3MB ( 8 blobs) in total
• `java –version`: 88.2MB (30 blobs) in total
• `javac HelloWorld.java`: 137.3MB (50 blobs) in total
• kdeneon/all (4.8GB + 34.5MB)
• mount: 34.5MB ( 2 blobs)
• `sh`: 36.7MB ( 8 blobs)
• `startkde`: 742.7MB (4,267 blobs)
• `firefox`: 866.6MB (4,506 blobs)
note: commands are executed sequentially
Evaluation: blobs needed to start a container (uncompressed)
1/5
less	than
1/5
The evaluation data in the abstract text (633MB Java image) is an old data
34
Copyright©2017 NTT Corp. All Rights Reserved.
• kaggle/python (8.3GB + 38.2MB)
• mount: 38.2MB ( 2 blobs)
• `sh`: 40.1MB ( 8 blobs)
• `ipython –c ‘echo(“hello”)’`: 75.4MB (1033 blobs)
• `ipython –c ‘import nltk’`: 352.0MB (2799 blobs)
Evaluation: blobs needed to start a container (uncompressed)
less	than
1/24
35
Copyright©2017 NTT Corp. All Rights Reserved.
Evaluation: compression
Image rootfs TAR at once +
gzip -9
FILEgrain +
gzip -9 against
each of blobs
openjdk:8 671.7MB 261.3MB 260.7MB
(-645,604B)
kdeneon/all 4.8GB 2.1GB 2.1GB
(-489,361B)
kaggle/python 8.3GB 3.6GB 3.6GB
(+4,345,701B)
No more than a rounding error in these cases
(If an image contained similar files, its compression rate would get worse)
36
Copyright©2017 NTT Corp. All Rights Reserved.
Evaluation: deduplication
kdeneon/all
(4.8GB)
kaggle/python
(8.3GB)
75.4MB deduplication
(KDE and Kaggle are mutually unrelated,
but have some common Debian files)
37
Copyright©2017 NTT Corp. All Rights Reserved.
Evaluation: FUSE overhead
0.1
1
10
100
1 2 3 4 5 6 7 8 9 10
Time required for archiving /usr of openjdk
(X: n-th experiment, Y:seconds)
Docker (overlay2) FILEgrain FUSE
Docker 17.06 / Fedora 26 / 2 vCPUs, 2GB RAM, 2GB swap (VMware Fusion, MacBook Pro 2016)
38
Copyright©2017 NTT Corp. All Rights Reserved.
Evaluation: FUSE overhead
0.1
1
10
100
1 2 3 4 5 6 7 8 9 10
Time required for archiving /usr of openjdk
(X: n-th experiment, Y:seconds)
Docker (overlay2) FILEgrain FUSE
For FILEgrain, the first data contains the time for pulling some blobs
needed to start a container
(But no network overhead, as localhost is the remote in this evaluation)
But the Docker data doesn't contain the time for pulling,
as a container can be started only after pulling the whole image.
39
Copyright©2017 NTT Corp. All Rights Reserved.
0.1
1
10
100
1 2 3 4 5 6 7 8 9 10
Time required for archiving /usr of openjdk
(X: n-th experiment, Y:seconds)
Docker (overlay2) FILEgrain FUSE
Evaluation: FUSE overhead
Getting faster due to in-kernel caching
For some implementation reason,
in-kernel cache doesn't seem working..
But this cache should be easily enabled,
as the filesystem is always read-only by design
40
Copyright©2017 NTT Corp. All Rights Reserved.
• Overhead of larger number of RPC for pulling blobs
• Depends on the transportation protocol
• TODO: implement Docker Registry API client and evaluate the overhead
• Current POC just uses a local directory as a mock registry
Evaluation: others
41
Copyright©2017 NTT Corp. All Rights Reserved.
Future work
42
Copyright©2017 NTT Corp. All Rights Reserved.
• Even finer granularity (CHUNKgrain?)
• For large files, it might be good compute SHA256 digests against each of chunks,
and serialize the digests in some format
• Probably, this serialization would be separate from continuity manifest itself.
continuity#85
• Use traditional OCI TAR for files that are very likely to be needed
immediately to start a container
• We can easily detect such files by starting a container and capture FUSE calls
before pushing the image
Future work
43
Copyright©2017 NTT Corp. All Rights Reserved.
• containerd is the next standard runtime in the container industry
Integration to the ecosystem
Docker
(Kubernetes)
containerd v0.2
runc
Moby Core
(successor to Docker daemon)
Kubernetes
(via CRI-containerd)
containerd v1.0
runc
44
Copyright©2017 NTT Corp. All Rights Reserved.
• containerd plugin system allows new feature to be added without
modifying the containerd upstream
Integration to the ecosystem
containerd v1.0
runtime service snapshot service differ service content service
overlayfs plugin
btrfs plugin
runc plugin generic plugin generic plugin
45
Copyright©2017 NTT Corp. All Rights Reserved.
• FILEgrain will be reimplemented as set of containerd plugins
• Can be easily integrated to the higher-level engines without
modification (Moby/Docker, Kubernetes, ..)
Integration to the ecosystem
containerd v1.0
runtime service snapshot service differ service content service
overlayfs plugin
btrfs plugin
runc plugin generic plugin generic plugin
FILEgrain plugins
46
Copyright©2017 NTT Corp. All Rights Reserved.
Recap
47
Copyright©2017 NTT Corp. All Rights Reserved.
• New image format that allows doing `docker run` before
`docker pull` 100% finishes
• More benefits: deduplication, etc.
• OCI compatible
/bin/sh
Minimal JRE
Minimal JDK
Optional stuff
You just need to pull 20% for
running the official OpenJDK 8 image!
(The rest 80% can be pulled on demand)
672 MB (total)
137 MB
535 MB
Recap
48
Copyright©2017 NTT Corp. All Rights Reserved.
Demo

FILEgrain: Transport-Agnostic, Fine-Grained Content-Addressable Container Image Layout

  • 1.
    Copyright©2017 NTT Corp.All Rights Reserved. Akihiro Suda ( @_AkihiroSuda_ ) NTT Software Innovation Center FILEgrain: Transport-Agnostic, Fine-Grained Content-Addressable Container Image Layout github.com/AkihiroSuda/filegrain Open Source Summit North America (September 11, 2017) Last update: September 11, 2017
  • 2.
    2 Copyright©2017 NTT Corp.All Rights Reserved. • Software Engineer at NTT • Largest telco in Japan • Docker Moby Core maintainer • BuildKit initial maintainer (github.com/moby/buildkit) • Next-generation `docker build` • Executes DAG vertexes of Dockerfile-equivalent concurrently In April, Docker [ as a project ] transited into Moby. Now Docker [ as a product ] has been developed as one of downstream of Moby. : ≒ : RHEL Fedora Who am I
  • 3.
    3 Copyright©2017 NTT Corp.All Rights Reserved. • New image format that allows doing `docker run` before `docker pull` 100% finishes • More benefits: deduplication, etc. • OCI compatible /bin/sh Minimal JRE Minimal JDK Optional stuff You just need to pull 20% for running the official OpenJDK 8 image! (The rest 80% can be pulled on demand) 672 MB (total) 137 MB 535 MB Summary
  • 4.
    4 Copyright©2017 NTT Corp.All Rights Reserved. The current status of Docker / OCI format
  • 5.
    5 Copyright©2017 NTT Corp.All Rights Reserved. • In July 2017, Open Containers Initiative (OCI) announced the first release of its industry-standard container image spec • CoreOS's appc/ACI spec maintainers has also joined OCI • The data structure is almost identical to Docker Image Manifest V2.2, but made separate from Docker Registry HTTP API • Agnostic to distribution protocol; Any protocol with directory-like semantic will work. e.g. HTTP, NFS, IPFS • Some extra coordinator is needed though (e.g. lock for the image index) • I'm +1 for IPFS (P2P filesystem) • Note: the spec is unrelated to Dockerfile syntax / instructions OCI Image Spec
  • 6.
    6 Copyright©2017 NTT Corp.All Rights Reserved. • Composed of TAR archives of AUFS layers • AUFS-style tar format is used as the lingua franca across different snapshot drivers (e.g. OverlayFS, Devicemapper, BtrFS, ZFS) Structure of Docker / OCI image format TAR TAR FROM ubuntu:17.04 RUN apt-get install foobar COPY foobar.conf /etc mount –t overlay –o lowerdir=0,upperdir=1 .. mount –t overlay –o lowerdir=1,upperdir=2 .. base layer • added & modified files • file deletion info ("whiteout")TAR /bin/bash /bin/ls ... /usr/bin/foobar /usr/lib/libfoobar.so ... /var/lib/.wh.baz /etc/foobar.conf
  • 7.
    7 Copyright©2017 NTT Corp.All Rights Reserved. • Blobs such as TARs are stored with a Merkle tree structure • Generally, distributed via Docker Registry, with (e.g.) S3/Swift backend Structure of Docker / OCI image format { "schemaVersion": 2, "manifests": [ { "mediaType": "application/vnd.oci.image.manifest.v1+json", "size": 7143, "digest": "sha256:e692418e4cbaf90ca69d05a66403747baa33ee08806650b51fab815ad7fc331f" } } /index.json /blobs/sha256/e692418e... ... (Next slide) JSON JSON
  • 8.
    8 Copyright©2017 NTT Corp.All Rights Reserved. /blobs/sha256/e692418e... { "schemaVersion": 2, "config": { "mediaType": "application/vnd.oci.image.config.v1+json", "size": 7023, "digest": "sha256:b5b2b2c507a0944348e0303114d8d93aaaa081732b86451d9bce1f432a537bc7" }, "layers": [ { "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip", "size": 33554432, "digest": "sha256:61be55a8e2f6b4e172338bddf184d6dbee29c98853e0a0485ecee7f27b9af0b4" }, { "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip", "size": 1073741824, "digest": "sha256:3c3a4604a545cdc127456d94e421cd355bca5b528f4a9c1905b15da2eb4a4c6b" }, { "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip", "size": 73109, "digest": "sha256:ec4b8955958665577945c89419d1af06b5f7636b4ac3da7f12184802ad867736" } ] } /blobs/sha256/b5b2b2c5... Environment variables and so on /blobs/sha256/61be55a8... Base layer /blobs/sha256/3c3a4604... Diff layer /blobs/sha256/ec4b8955... Diff layer TAR TAR TAR JSON JSON
  • 9.
    9 Copyright©2017 NTT Corp.All Rights Reserved. • Merkle tree ensures reproducibility of an environment (`docker pull foobar@sha256:e692418e..`) Structure of Docker / OCI image format /index.json /blobs/sha256/e692418e... /blobs/sha256/b5b2b2c5... /blobs/sha256/61be55a8... /blobs/sha256/3c3a4604... /blobs/sha256/ec4b8955... Index Manifest JSON Config JSON (env vars and so on) AUFS layer TARs Merkle tree
  • 10.
    10 Copyright©2017 NTT Corp.All Rights Reserved. • TAR = Tape ARchiver, appeared in circa. 1979 (UNIX 7th Edition) • TAR was originally designed for magnetic tapes à Not suitable for Docker/OCI workloads in 2017 Problems of Docker / OCI image format PDP-11, the target architecture of UNIX in 1970s, with TU56 DECtape drives https://en.wikipedia.org/wiki/PDP-11
  • 11.
    11 Copyright©2017 NTT Corp.All Rights Reserved. Without scanning the whole "tape"... • Files cannot be listed up àCan't be mounted as a filesystem • File offsets cannot be detected à Files can't be accessed • We could create a separate index, but it is useless when a TAR is gzipped, as it can't be seeked Problem 1: TAR requires scanning the whole "tape" Metadata 0 File 0 Metadata 1 File 1 Metadata {n-1} File {n-1} Terminal zero bytes ... File name, permission, ... Content
  • 12.
    12 Copyright©2017 NTT Corp.All Rights Reserved. Problem 1: TAR requires scanning the whole "tape" • If we could solve this problem, we can mount(2) an image and start a container without pulling the whole image • Only metadata are needed for mount(2) • Files are lazily pulled on demand • à Shorter start-up time & Less network traffic New containers can start immediately on newly added hosts Faster testing and deployment cycle
  • 13.
    13 Copyright©2017 NTT Corp.All Rights Reserved. Problem 1: TAR requires scanning the whole "tape" Detailed usecases: • Web apps with huge number of HTML files and graphic files • Jupyter Notebook with various big data samples • Academic papers will be immediately reproducible by just running `docker run some-single-huge-image@sha256:deadbeef..` • Full GNOME/KDE/Windows(potentially) Desktop • Java / dotNET runtimes • Integration testing environment...
  • 14.
    14 Copyright©2017 NTT Corp.All Rights Reserved. • A registry might contain very similar images • Different versions • Different architectures • Different configurations • TARs of these images are likely to contain identical files, but waste storage without any data deduplication Problem 2: No deduplication FROM ubuntu:17.04 RUN apt-get install foo FROM ubuntu:17.04 RUN apt-get install foo bar FROM debian:9 RUN apt-get install foo FROM ubuntu:17.04 RUN echo … > /etc/apt/source.list RUN apt-get install foo
  • 15.
    15 Copyright©2017 NTT Corp.All Rights Reserved. • It might be good to pull a large TAR layer with multiple connections • Especially when multiple servers are available • But not all protocol allows that • RFC7233 says HTTP/1.1 Range Requests are OPTIONAL Problem 3: No concurrency Range 0-1023 Range 1024-2047
  • 16.
    16 Copyright©2017 NTT Corp.All Rights Reserved. 1. TAR requires scanning the whole "tape" 2. No deduplication 3. No concurrency Problems of Docker / OCI image format Image: https://en.wikipedia.org/wiki/Magnetic_tape
  • 17.
    17 Copyright©2017 NTT Corp.All Rights Reserved. New image format: FILEgrain https://github.com/AkihiroSuda/filegrain
  • 18.
    18 Copyright©2017 NTT Corp.All Rights Reserved. • Single large TAR blob à Many small blobs • Metadata blob with content-addressability • Only metadata blob is needed for mounting the image • File blobs can be lazily pulled on demand FILEgrain overview Metadata 0 File 0 Metadata 1 File 1 Metadata {n-1} File {n-1} Terminal zero bytes ... Metadata 0 File 0 Metadata 1 File 1 Metadata {n-1} File {n-1} ... TAR continuity File name, permission bits, SHA256 digest... content-addressable
  • 19.
    19 Copyright©2017 NTT Corp.All Rights Reserved. • continuity: filesystem metadata manifest system used in containerd / Moby community (github.com/containerd/continuity) • Serializes file names, permission bits, XAttrs, and digest values as a ProtocolBuffers message structure • Similar format: mtree(8) Metadata format: continuity message Resource { repeated string path; int64 uid; int64 gid; uint32 mode; uint64 size; repeated string digest; repeated XAttr xattr; ... }
  • 20.
    20 Copyright©2017 NTT Corp.All Rights Reserved. • A single OCI image can contain both FILEgrain manifests and traditional OCI manifests Highly compatible with traditional OCI spec /index.json /blobs/sha256/e692... /blobs/sha256/b5b2... /blobs/sha256/61be... /blobs/sha256/3c3a... /blobs/sha256/ec4b... Index Traditional OCI Manifest Common config, such as env vars AUFS layer TARs /blobs/sha256/a8e3... FILEgrain Manifest /blobs/sha256/de81... continuity /blobs/sha256/583f... /blobs/sha256/3af1... /blobs/sha256/5c2a... /blobs/sha256/39c1... /blobs/sha256/12ea... Bunch of files `docker pull foo:v1-filegrain` `docker pull foo:v1` {"v1-filegrain": "sha256:a8e3..", "v1":"sha256:e692.."}
  • 21.
    21 Copyright©2017 NTT Corp.All Rights Reserved. • FILEgrain layers can be put on top of traditional OCI layers • Traditional OCI layers might be still suitable for frequently used files or large number of small files Highly compatible with traditional OCI spec { ... "layers": [ { "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip", "size": 33554432, "digest": "sha256:e692..." }, { "mediaType": "application/vnd.continuity.layer.v0+pb", "size": 16724, "digest": "sha256:a18b..." } ] } TAR continuity Traditional OCI layer (Needs to be completely pulled before starting a container) FILEgrain layer (Can be lazy-pulled on demand)
  • 22.
    22 Copyright©2017 NTT Corp.All Rights Reserved. 1. TAR requires scanning the whole "tape" à Only metadata (continuity) blob is needed for mounting. Other stuff can be lazily pulled on demand. 2. No deduplication à Deduplication in the granularity of files 3. No concurrency à Concurrency in the granularity of files The problems are solved!
  • 23.
    23 Copyright©2017 NTT Corp.All Rights Reserved. 1. Larger number of blobs • /blobs/sha256 directory will contain huge number of files when laid out on filesystem, and may make readdir(3) slow • Generally, this issue could be solved by "sharding" /blobs/sha256/deadbeef.. like /blobs/sha256/de/deadbeef.., but not in this case, for compatibility with OCI. 2. More RPC overhead • Single traditional TAR is still suitable for small files so as to reduce number of RPCs 3. Less compression rate • Single traditional TAR is still suitable when the layer contains similar files But anyway, these cons can be easily mitigated by composing FILEgrain layers on top of traditional OCI layers Cons
  • 24.
    24 Copyright©2017 NTT Corp.All Rights Reserved. Alternative 1: Use some external blob store? à No, because it depends on certain protocols. Also, it is unlikely to work with OCI Merkle tree. Alternative ideas? Related work: Harter, Tyler, et al. "Slacker: Fast Distribution with Lazy Docker Containers." FAST. 2016. • Loopback-mount an ext4 image located on NFS • Support lazy-pulling and deduplication in granularity of blocks
  • 25.
    25 Copyright©2017 NTT Corp.All Rights Reserved. Alternative 1: Use some external blob store? à No, because it depends on certain protocols. Also, it is unlikely to work with OCI Merkle tree. Alternative ideas? Related work: Lestaris, George. "Alternatives to layer-based image distribution: using CERN filesystem for images." Container Camp UK. 2016. Blomer, Jakob, et al. "A Novel Approach for Distributing and Managing Container Images: Integrating CernVM File System and Mesos." MesosCon NA. 2016. • CernVM FS • CernVM FS has its own Merkle tree • Support lazy-pulling and deduplication in granularity of files (as in FILEgrain)
  • 26.
    26 Copyright©2017 NTT Corp.All Rights Reserved. Alternative 1: Use some external blob store? à No, because it depends on certain protocols. Also, it is unlikely to work with OCI Merkle tree. Alternative ideas? • IPFS is also attractive protocol • P2P, content-addressable • FILEgrain is made agnostic to transportation protocol as in OCI image spec; it works with HTTP / NFS / CernVM FS / IPFS / whatever
  • 27.
    27 Copyright©2017 NTT Corp.All Rights Reserved. Alternative 2: Seek TAR? à No • Compressed TAR is not seekable • Even uncompressed TAR is sometimes not seekable, depending on transportation protocol • Larger number of request for fetching the whole metadata Alternative ideas? Metadata 0 File 0 Metadata 1 File 1 Metadata {n-1} File {n-1} Terminal zero bytes ... Metadata can be pre-loaded without reading file contents? (No) TAR
  • 28.
    28 Copyright©2017 NTT Corp.All Rights Reserved. Alternative 3: Use (e.g.) ZIP instead of TAR? à No • still unseekable depending on transporation protocol • poor metadata support Alternative ideas? ZIP Extra metadata 0 Compressed file 0 Extra metadata 1 Compressed file 1 Extra metadata {n-1} Compressed file {n-1} ... Footer Metadata 0 Metadata 1 ... Metadata {n-1} Metadata can be pulled at once firstly? (No)
  • 29.
    29 Copyright©2017 NTT Corp.All Rights Reserved. Evaluation
  • 30.
    30 Copyright©2017 NTT Corp.All Rights Reserved. • Implemented as a read-only FUSE filesystem • No support for write operations • "Cattle" containers should be immutable; it is anti-pattern to do any write operation against rootfs • /tmp and /run should be mounted as tmpfs • persistent data should be written to bind-mounted Ext4/XFS • Actually, even Docker built-in storage drivers (overlayfs, AUFS) don't fully support write operations (github.com/AkihiroSuda/issues-docker) • e.g. Yum is known not to work on overlayfs, without installing yum-plugin-ovl (neither a bug of overlayfs nor yum!) Implementation
  • 31.
    31 Copyright©2017 NTT Corp.All Rights Reserved. • Docker doesn't support running a container with a rootfs that is not managed by the Docker daemon • So current FILEgrain POC is evaluated using runc • TODO: reimplement as a containerd plugin Implementation
  • 32.
    32 Copyright©2017 NTT Corp.All Rights Reserved. See https://github.com/AkihiroSuda/filegrain/issues/17 for detailed information Images for evaluation Image Description rootfs size (after tar+gzip expansion) openjdk:8 sha256:5da842d59f76009fa27ffde06888ebd 560c7ad17607d7cd1e52fc0757c9a45fb Debian 9.1, OpenJDK 8u141 671.7MB kdeneon/all sha256:e3e7f216a5f8f1fdcff4eab8807d7af cd291c050099ab3e8a8355b7b28a19247 Ubuntu 16.04, KDE Plasma 5.10, Firefox 54.. 4.8GB kaggle/python sha256:335103c998aea22a5608c2eeca7dcf1 09e0828ed233b75f5098182c5b058fe98 Debian 8.5, Various machine learning frameworks, NLTK (natural language toolkit) dataset.. 8.3GB
  • 33.
    33 Copyright©2017 NTT Corp.All Rights Reserved. • openjdk:8 (blobs in total = 671.7MB + meta 5.4MB) • mount: 5.4MB ( 2 blobs) • only metadata are needed (manifest and continuity) • `sh`: 7.3MB ( 8 blobs) in total • `java –version`: 88.2MB (30 blobs) in total • `javac HelloWorld.java`: 137.3MB (50 blobs) in total • kdeneon/all (4.8GB + 34.5MB) • mount: 34.5MB ( 2 blobs) • `sh`: 36.7MB ( 8 blobs) • `startkde`: 742.7MB (4,267 blobs) • `firefox`: 866.6MB (4,506 blobs) note: commands are executed sequentially Evaluation: blobs needed to start a container (uncompressed) 1/5 less than 1/5 The evaluation data in the abstract text (633MB Java image) is an old data
  • 34.
    34 Copyright©2017 NTT Corp.All Rights Reserved. • kaggle/python (8.3GB + 38.2MB) • mount: 38.2MB ( 2 blobs) • `sh`: 40.1MB ( 8 blobs) • `ipython –c ‘echo(“hello”)’`: 75.4MB (1033 blobs) • `ipython –c ‘import nltk’`: 352.0MB (2799 blobs) Evaluation: blobs needed to start a container (uncompressed) less than 1/24
  • 35.
    35 Copyright©2017 NTT Corp.All Rights Reserved. Evaluation: compression Image rootfs TAR at once + gzip -9 FILEgrain + gzip -9 against each of blobs openjdk:8 671.7MB 261.3MB 260.7MB (-645,604B) kdeneon/all 4.8GB 2.1GB 2.1GB (-489,361B) kaggle/python 8.3GB 3.6GB 3.6GB (+4,345,701B) No more than a rounding error in these cases (If an image contained similar files, its compression rate would get worse)
  • 36.
    36 Copyright©2017 NTT Corp.All Rights Reserved. Evaluation: deduplication kdeneon/all (4.8GB) kaggle/python (8.3GB) 75.4MB deduplication (KDE and Kaggle are mutually unrelated, but have some common Debian files)
  • 37.
    37 Copyright©2017 NTT Corp.All Rights Reserved. Evaluation: FUSE overhead 0.1 1 10 100 1 2 3 4 5 6 7 8 9 10 Time required for archiving /usr of openjdk (X: n-th experiment, Y:seconds) Docker (overlay2) FILEgrain FUSE Docker 17.06 / Fedora 26 / 2 vCPUs, 2GB RAM, 2GB swap (VMware Fusion, MacBook Pro 2016)
  • 38.
    38 Copyright©2017 NTT Corp.All Rights Reserved. Evaluation: FUSE overhead 0.1 1 10 100 1 2 3 4 5 6 7 8 9 10 Time required for archiving /usr of openjdk (X: n-th experiment, Y:seconds) Docker (overlay2) FILEgrain FUSE For FILEgrain, the first data contains the time for pulling some blobs needed to start a container (But no network overhead, as localhost is the remote in this evaluation) But the Docker data doesn't contain the time for pulling, as a container can be started only after pulling the whole image.
  • 39.
    39 Copyright©2017 NTT Corp.All Rights Reserved. 0.1 1 10 100 1 2 3 4 5 6 7 8 9 10 Time required for archiving /usr of openjdk (X: n-th experiment, Y:seconds) Docker (overlay2) FILEgrain FUSE Evaluation: FUSE overhead Getting faster due to in-kernel caching For some implementation reason, in-kernel cache doesn't seem working.. But this cache should be easily enabled, as the filesystem is always read-only by design
  • 40.
    40 Copyright©2017 NTT Corp.All Rights Reserved. • Overhead of larger number of RPC for pulling blobs • Depends on the transportation protocol • TODO: implement Docker Registry API client and evaluate the overhead • Current POC just uses a local directory as a mock registry Evaluation: others
  • 41.
    41 Copyright©2017 NTT Corp.All Rights Reserved. Future work
  • 42.
    42 Copyright©2017 NTT Corp.All Rights Reserved. • Even finer granularity (CHUNKgrain?) • For large files, it might be good compute SHA256 digests against each of chunks, and serialize the digests in some format • Probably, this serialization would be separate from continuity manifest itself. continuity#85 • Use traditional OCI TAR for files that are very likely to be needed immediately to start a container • We can easily detect such files by starting a container and capture FUSE calls before pushing the image Future work
  • 43.
    43 Copyright©2017 NTT Corp.All Rights Reserved. • containerd is the next standard runtime in the container industry Integration to the ecosystem Docker (Kubernetes) containerd v0.2 runc Moby Core (successor to Docker daemon) Kubernetes (via CRI-containerd) containerd v1.0 runc
  • 44.
    44 Copyright©2017 NTT Corp.All Rights Reserved. • containerd plugin system allows new feature to be added without modifying the containerd upstream Integration to the ecosystem containerd v1.0 runtime service snapshot service differ service content service overlayfs plugin btrfs plugin runc plugin generic plugin generic plugin
  • 45.
    45 Copyright©2017 NTT Corp.All Rights Reserved. • FILEgrain will be reimplemented as set of containerd plugins • Can be easily integrated to the higher-level engines without modification (Moby/Docker, Kubernetes, ..) Integration to the ecosystem containerd v1.0 runtime service snapshot service differ service content service overlayfs plugin btrfs plugin runc plugin generic plugin generic plugin FILEgrain plugins
  • 46.
    46 Copyright©2017 NTT Corp.All Rights Reserved. Recap
  • 47.
    47 Copyright©2017 NTT Corp.All Rights Reserved. • New image format that allows doing `docker run` before `docker pull` 100% finishes • More benefits: deduplication, etc. • OCI compatible /bin/sh Minimal JRE Minimal JDK Optional stuff You just need to pull 20% for running the official OpenJDK 8 image! (The rest 80% can be pulled on demand) 672 MB (total) 137 MB 535 MB Recap
  • 48.
    48 Copyright©2017 NTT Corp.All Rights Reserved. Demo

Editor's Notes

  • #2 Monday, September 11 • 2:50pm - 3:30pm https://ossna2017.sched.com/event/BDpM/filegrain-transport-agnostic-fine-grained-content-addressable-container-image-layout-akihiro-suda-ntt