ConHub is a metadata management system for Docker containers built on PostgreSQL. It includes:
1) ConSQL, which models container metadata in a relational schema called ConSchema with tables for entities and relationships.
2) CQL, an extension of SQL for querying container metadata with APIs like TAG and INTERSECTION.
3) Applications like ConQ for querying, ConViz for visualizing relationships, and ConRecovery for facilitating recovery from failures.
ConHub A Metadata Management System for Docker Containers
1. ConHub: A Metadata Management System
for Docker Containers
Chris Xing Tian
National University of Singapore
tianxing@comp.nus.edu.sg
Aditya Pan
Amity University
aditya.pan@student.amity.edu
Y.C. Tay
National University of Singapore
dcstayyc@nus.edu.sg
ABSTRACT
For many years now, enterprises and cloud providers have been
using virtualization to run their workloads. Until recently, this
means running an application in a virtual machine (hardware
virtualization). However, virtual machines are increasingly
replaced by containers (operating system virtualization), as
evidenced by the rapid rise of Docker. A containerized software
environment can generate a large amount of metadata. If properly
managed, these metadata can greatly facilitate the management of
containers themselves.
This demonstration introduces ConHub, a PostgreSQL-based
container metadata management system. Visitors will see that
(1) ConHub has a language CQL that supports Docker commands;
(2) it has a user-friendly interface for querying and visualizing
container relationships; and (3) they can use CQL to formulate
sophisticated queries to facilitate container management.
A video of the demonstration can be found at
https://youtu.be/aWPgbeo_79g
Keywords
OS Virtualization; Container Metadata; Relational Database
1. INTRODUCTION
Datacenters and cloud providers largely rely on virtualization to
run enterprise workloads and user applications. Until recently,
this means hardware virtualization, where execution happens in
virtual machines (VMs).
Previously, a user process would run on a bare-metal machine that
consists of a physical machine and an operating system (OS).
Now, the user process and the guest OS that it calls are contained
in a VM (another process) that runs on a hypervisor that takes on
the role of a host OS. The hardware is thus virtualized.
The guest OS becomes largely redundant if it shares the same
kernel as the host OS. In this case, the VM can be replaced by a
container that consists of the application, and the files, libraries
and binaries that it needs. Two containers on the same bare-metal
may have different OS versions, distributions (libraries, tools,
window system, etc.) or namespaces (file system, process
identifiers, etc.). The OS is thus virtualized.
Interest in container-based virtualization is rising rapidly, as
evidenced by the viral adoption of the Docker engine 1
for
containers, replacing the hypervisor for VMs. In fact, even before
the rise of Docker, Google [1] and Facebook2
have both been
using containerized infrastructures for some years. There are 2
main reasons for the surging interest in containerization:
C1: A container image is generated by scripts that specify
dependencies, thus facilitating code development, debugging and
deployment; the container is a running instance of the image (like
a process is a running instance of its code).
C2: The removal of the guest OS makes these images much
smaller, so a physical machine can host many more containers
than VMs, and spawning containers (in response to a flash crowd,
say) is also much faster than booting VMs.
A containerized system has a lot of metadata that can facilitate
code development and debugging (C1): which OS version does
image X use? who is the developer for X? when was X last
modified? which containers are running X? Etc.
There are also a lot of metadata that can facilitate container
deployment (C2): how much free memory is there on a particular
node? which hosts are running replicas of a container? which
containers can be collocated without performance interference?
which containers are using a particular port? Etc.
While there are already systems for managing containers, there is
none so far for managing metadata for images and containers.
This demonstration introduces such a system, namely ConHub.
1.1 Related Work
Examples of OS virtualization include HP-UX3
, BSD jails [2],
Solaris Zones [3] and Linux containers (LXC 4
), which were
recently extended by Docker. Queries on the Docker metadata for
images and containers can only be expressed as keyword search.
Docker Datacenter5
and Kubernetes6
are systems for managing
Docker containers in a cluster environment; other container
management systems include Huawei CCE7
and Netease Hive8
.
1
http://www.docker.com/
2
http://www.slideshare.net/Docker/aravindnarayanan-
facebook140613153626phpapp02-37588997
3
http://www.hpl.hp.com/hpjournal/pdfs/IssuePDFs/1985-10.pdf
4
https://linuxcontainers.org
5
https://www.docker.com/products/docker-datacenter
6
http://kubernetes.io/
7
http://console.hwclouds.com/cce
8
http://c.163.com
Permission to make digital or hard copies of part or all of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. Copyrights
for third-party components of this work must be honored. For all other
uses, contact the Owner/Author.
Copyright is held by the owner/author(s).
CIKM’16, October 24 – November 28, 2016, Indianapolis, IN, USA
ACM 978-1-4503-4073-1/16/10.
DOI: http://dx.doi.org/10.1145/2983323.2983331
2. These systems dynamically generate a lot of metadata for
container deployment (failures, replication, resource allocation
etc.). The dynamic setting in a datacenter calls for a powerful
metadata management system to support container management.
We designed ConHub to fill this need. ConHub is built on
PostgreSQL 9
, so it brings to bear well-developed, industrial-
strength relational database technology on the management and
querying of metadata for images and containers.
2. CONHUB ARCHITECTURE
ConHub has 3 key components: (1) ConSQL, a metadata
management system. (2) A query processor that supports CQL, a
language for querying and generating metadata for images and
containers; it includes a set of APIs for developers to implement
third-party applications that suit their purpose. (3) An ecosystem
of applications, built on the APIs, for queries and visualization.
These components are illustrated in Figure 1, and described below:
Figure 1. ConHub Architecture
Figure 2. ConSchema has 9 tables for entities and 5 tables for
relationships (arrows point from primary key to foreign key).
9
http://www.postgresql.org/
(1) ConSQL
ConSQL is a database system implemented with PostgreSQL.
We have designed a relational schema called ConSchema,
shown in Figure 2, to model the metadata underlying a
container system. ConSchema has 9 tables for entities
(Images, Containers, Users, Dockerfiles, etc.) and 5
relationship tables (ConToImage, Labels, etc.).
Much of these metadata are extracted from the JSON files
generated by Docker. Users can also generate metadata, via
Docker labels or CQL tags. If integrated with a container
management system like Docker Datacenter or Kubernetes,
ConSQL can also manage metadata from that system.
(2) CQL
CQL is an extension of SQL, the standard language for
managing relational data. CQL thus inherits the power of
SQL in the declarative formulation of semantically rich
queries, like those illustrated in the Introduction for
provenance (C1) and management (C2). This is a
tremendous improvement over the simple keyword search
provided by current container systems.
The CQL extension of SQL consists of the following APIs:
• TAG (Set objects, String label): Tags a set of objects
with the specified label.
• INTERSECTION (id1, id2): Returns the lowest
common ancestor that two images or containers share.
• CHILD (imageId): Returns all the child images that
derived from the specified image in the form of a Set.
• IMAGE(conId): Returns the id of the image that
generated the specified container.
• CONTAINER(imageId): Returns the containers
generated from the specified image.
• DISTANCE(id1, id2): Returns the distance between
two images in the version chain; returns –1 if they are
unrelated.
(3) Application Ecosystem
The APIs accessible from CQL can also be used to build
applications for image and container management. So far,
we have implemented the following applications:
ConQ: A tool for formulating CQL queries by manipulating
ConSchema tables using a graphical user interface. We
designed this tool to help the Docker user who is unfamiliar
with SQL syntax and semantics.
ConViz: A tool for visualizing the provenance relationship
among images and containers. For example, a user can
specify two containers, and ConViz will display the image
paths that lead from the containers to their common ancestor
image (if any).
ConRecovery: A tool to facilitate recovery from a container
failure (using a reported incident10
as a guide). Suppose
there is a code change to an image X, and the container CX
spawned from X crashes a service. A user can use
ConRecovery to find a previous, safe image version Y, spawn
a new container CY to replace CX, and notify the developer
that made the change to X.
10
http://blog.flux7.com/blogs/docker/docker-saves-the-day-at-
flux7
3. 3. DEMONSTRATION SCENARIOS
Our demonstration of ConHub will proceed in 3 stages:
Stage I: Docker
Visitors at the conference who are familiar with Docker can
verify that ConHub is an extension of Docker: they can query
the ConHub repository (preloaded), create new images, and
label, spawn or shutdown containers, like they can with a
Docker client. For visitors who are unfamiliar with Docker,
we will demonstrate how these can be done with Docker
commands.
Stage II: ConQ and ConViz
In Stage II, the visitors can use ConQ’s table manipulation
interface (see Fig.2) to formulate queries of the metadata
stored in ConSchema, including those for any new images or
containers created in Stage I. They can also use ConViz to
visualize the version tree for an image, or the provenance
among images and containers (see Fig.3).
Stage III: ConRecovery and CQL
Finally, we use ConRecovery to demonstrate the scenario
described above for recovering from a container failure (see
Fig.4).
We will also demonstrate how the power of CQL can help a
development team deal with bugs. Suppose the team
identifies two containers, conIdX and conIdY, with similar
faulty behavior; they believe the fault is inherited from some
common image Z that both are based on, and want to find all
containers and images derived from Z and label them
“hazard”. This can be done with a CQL statement:
TAG ((SELECT C.conId, I.imageId
FROM Containers C, Images I WHERE I.imageId
IN CHILD(INTERSECTION(conIdX, conIdY))
AND C.imageId= I.imageId) , “hazard” )
The team suspects that Z has a virus that came from an
infected IDE downloaded by a developer when creating Z.
They decide to identify all developers of “hazard” images
and label the images and containers they produced as
“potential hazard”, using the CQL statement:
TAG((SELECT C.conId, I2.imageId
FROM Containers C, Images I1, Images I2, Labels L
WHERE L.key="hazard" AND I1.imageId=L.imageId
AND I1.maintainer = I2.maintainer
AND C.imageId = I2.imageId) , “potential hazard”)
Note the use of APIs TAG, CHILD and INTERSECTION.
Also, the second CQL statement uses four joins; it will be
hard to do the same thing with just keyword search.
We will encourage visitors to suggest queries so we can
demonstrate how they can be formulated with CQL.
4. FUTURE WORK
A video of the demonstration can be found at
https://youtu.be/aWPgbeo_79g
The metadata in this demonstration are generated by the user and
a Docker client. We plan to integrate ConHub with a container
management system, like Docker Datacenter or Kubernetes. That
will require expansion of ConSchema and addition of APIs, so the
application ecosystem can host more tools for facilitating
container deployment in a datacenter.
5. REFERENCES
[1] A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E.
Tune, and J. Wilkes. Large-scale cluster management at
Google with Borg. In Proc. EuroSys, page 18, April 2015.
[2] P.-H. Kamp and R.N. Watson. Jails: Confining the
omnipotent root. In Proc SANE, vol.43, 2000.
[3] J. Beck, D. Comay, Ozgur L., D. Price, Andy T., Andrew G.,
and Blaise S. Virtualization and namespace isolation in the
Solaris operating system (psarc/2002/174), 2006.
Figure 2. ConQ: graphical interface for query formulation.
Figure 3. ConViz: visualizing provenance.
Figure 4. ConRecovery: a tool for recovering from container
failure.