Measuring Resources & Workload Skew In Micro-Service MPP Analytic Query Engine

1
Measuring Resources & Workload Skew
In Micro-Service MPP Analytic Query Engine
Nikunj Parekh
Aster Engine
Teradata, Inc., Santa Clara, CA, USA
Nikunj.Parekh@Teradata.com
Alan Beck
Aster Engine
Teradata, Inc., Santa Clara, CA, USA
Alan.Beck@Teradata.com
Abstract
Big Data analytics is common in many business domains, for
example in financial sector for savings portfolio analysis, in
government agencies, scientific research and insurance
providers, to name a few. The uses of Big Data range from
generating simple reports to executing complex analytical
workloads. The increases in the amount of data being stored
and processed in these domains expose many challenges with
respect to scalable processing of analytical queries.
Massively Parallel Processing (MPP) databases address these
challenges by distributing storage and query processing
across multiple compute nodes and distributed processes in
parallel, usually in a shared-nothing architecture. Today, new
technologies are shaping up the way platforms for the
Internet of services are designed and managed. This
technology is the containers, like Docker[1,2]
and LXC[3,4,5]
.
The use of container as base technology for large-scale
distributed systems opens many challenges in the area of
resource management at run-time, for example: auto-scaling,
optimal deployment and monitoring.
Monitoring orchestrated distributed systems is at the heart of
many cloud resource management solutions. Measuring
performance and analyzing that data to take guesswork out of
resource utilization is key to engineering budget
management.
Measuring Workload Skew and resource utilization of MPP
databases and analytic engines is the focus of our work. This
paper explores the tools available to measure the
performance of MPP Docker and Kubernetes environments
from the perspective of a Database Administrator using such
systems in a virtualized environment. Our approach provides
a detailed characterization of CPU, memory and network IO,
while complex long duration analytic SQL queries are
loading compute resources in containerized systems.
Keywords: Docker, LXC, Massively Parallel Processing
(MPP), MapReduce, Kubernetes, Pods and Workload Skew
1. Introduction
The enabling force for large-scale distributed systems
moving from dedicated servers to cloud computing is the
containerization technologies. The drivers for
containerization are the following:
• Portability – agnostic of the environment
• Quick Installation and Deployment – should be able to
install and redeploy quickly on a given environment and
also deploy host-agnostically
• Elasticity – In the cloud environment, compute resources
are available on demand. We are able to start clusters on
the fly and expand/shrink them based on demand in
many applications
• Accurate Resource Management – a pay as you go
model wherein software products are no longer licensed
but metered requires a detailed resource monitoring and
management
• Multi-tenancy – should be able to start multiple instances
of same distributed system on the same physical
hardware, and containerization is one of the key
mechanisms to host multitenant apps in cloud
• Isolation – without introducing unworkable interface
conflicts such as for network, namespaces and other
sharable resources
• Quicker releases – CI / CD friendly product releases
• Quicker development and testing – because
containerization facilitates micro-services architecture, it
promotes parallelizable product development and
quicker feature testing
The interest of cloud providers and Internet Service Providers
(ISP)[6]
has gradually grown in Docker as the preferred
product offering model. A container is a software program
that performs OS level virtualization by using resource
isolation features of the Linux kernel[7]
. This allows
applications or application components (the so called micro-
services) to be installed with all the library dependencies, the
binaries, and all their configurations needed to run as if they
are running on independent Virtual Machines, except there’s
low overhead in starting and running them.
There are many management tools for Linux containers:
LXC, systemd-nspawn, lmctfy, Warden, Docker and rkt for
CoreOS[8,9]
. The latter is a minimal operating system that
supports popular container systems out of the box. The
operating system is designed to be operated in clusters and
can run directly on bare metal or on virtual machines.
CoreOS supports hybrid architectures (for example, virtual
machines plus bare metal). This approach enables the
Container-as-a-Service solutions that are becoming widely
available.

2
Momentum keeps building around Kubernetes. Our recent
survey on server-less dynamic cloud computing and DevOps
revealed that by mid-2018, more than 60% of respondents
will be evaluating, prototyping, or deploying to production
solutions involving container orchestration[10,11].
We see
Kubernetes emerging as the de-facto standard for container
orchestration as well as for large-scale distributed systems
deployments.
Having a rapidly increasing number of large-scale systems
deployed in Kubernetes environment, we wanted to explore
strategies for instrumenting resource utilization
measurements for applications running inside these
orchestrated clusters. This paper lays out two different
resource measurement solutions with a focus on monitoring
resource usage in MPP analytic systems, such as Teradata’s
Aster Engine.
The first part of the paper discusses resource monitoring at a
broad orchestrated application level, and enumerates
solutions to collect application monitoring data for exploring
performance issues, troubleshooting application errors and
analyzing performance trends to root cause performance
bottlenecks in the application. The question that always needs
to be answered is: Is the problem the input or the underlying
infrastructure?
While analyzing the broader cluster level resource utilization
profile, it is often necessary in an MPP analytic engine, to
understand utilization at more granular level. For example, to
be able to answer above question fully, or to be able to
estimate compute resource needs for future it is very valuable
to profile resource utilization specific to each query. It is
therefore necessary to know in detail the resource usage of
long running SQL queries or even of specific phases thereof.
The second part of the paper details Query Resource
Monitoring solution implemented in Aster Engine at
Teradata.
The rest of this paper is organized as follows. Section 1
continues with more introduction of Massively Parallel
Processing, and container orchestration. Section 2 briefly
introduces the Aster Engine in Kubernetes environment.
Section 3 details our Cluster Resource Monitoring
approaches and solution we picked, and section 4 details our
Query Resource Monitoring approach. It also highlights the
commonly applicable sub-problems, such as retrieval, storage
solutions and the policies for eviction of the time-series
resource utilization data.
1.1 Massively Parallel Processing
Modern scale-out database engines are usually based on one
of two design principles: sharded and massively parallel
processing (MPP) databases. Both are shared-nothing
architectures, where each node manages its own storage and
memory, and they are typically based on horizontal
partitioning[12,13]
. Sharded systems optimize for executing
queries on small subsets of the shards, and communication
between shards is relatively limited[14]
.
MPP databases optimize for parallel execution of each
query[15]
. The nodes are usually collocated within the same
data center, and each query can access data across all the
nodes. A query optimizer generates an execution plan that
includes explicit data movement directives, and the cost of
moving data is taken into account during optimization. A
query executing in an MPP database can include several
pipelined execution stages, with explicit communication
between nodes at each stage. For example, a multi-stage
aggregation can be used to compute an aggregate over the
entire dataset using all the nodes. MPP databases can use a
unique mechanism to solve sub-problems using independent
queries for each of them. This mechanism of issuing “child
queries” reuses the same JDBC connection from a client to
connect to MPP system. While measuring resource
utilization, the resource usages of these child sessions need to
be accounted for toward the main query that invokes these
child queries to solve sub problems.
Many robust, highly scalable, high performance distributed
systems and products, such as Teradata’s Aster Engine
analytic system are built on top of Docker and Kubernetes
technology, and these large-scale systems require monitoring
solutions similar to the ones outlined below.
2. Teradata Aster Engine On Kubernetes
Kubernetes is an open-source system for automating
deployment, scaling and management of containerized
applications that was originally designed by Google and now
maintained by the Cloud Native Computing Foundation
(CNCF). It aims to provide a "platform for automating
deployment, scaling, and operations of application containers
across clusters of hosts". The idea with Kubernetes is to
manage containers as a group called pods[40]
. Pods can be
collocated and can share network, volumes and namespace.
Kubernetes thus provides a good abstraction for distributed
applications that are not completely decoupled, such as, Aster
Engine. The networking and sharing of resources between
containers is taken care of and the end user does not need to
set it up. Kubernetes also manages failure of pods and restarts
a container if the container fails.
Teradata’s new Analytic Platform uses Aster Engine for
large-scale SQL and User Defined Analytic Functions (UDF)
query processing on massive data. Aster Engine is MPP
analytics databases with the typical deployments including
tens or hundreds of nodes. Storage and processing of large
amounts of data are handled by distributing the load across
several servers or hosts to create an array of individual
databases, working together to present a single database
image. The Master node is the entry point, where clients
connect and submit SQL statements. The Master coordinates
work with other database instances, called workers. Aster

3
Engine Master is a pod that runs several Linux Docker
containers, each responsible for specific tasks to coordinate
the execution of Aster query across workers. Database clients
connect to the Master though a JDBC connection to issue
queries. The Master is responsible for global optimization of
query, for parsing, serialization, query planning, query
optimization and execution of phases. Each Aster worker is a
separate pod and runs several Docker containers of its own in
each pod. Each worker container is responsible for specific
task during query execution under Master’s command as per
the query plan. One of the containers on each worker pod
encapsulates the Postgres database (the pgdb container). The
pgdb container is a standalone Postgres DB itself and
receives subset of the table that the query is executing on.
This subset is called a partition. Union of partitions across all
workers is the main table and the intersection is null.
During execution of a query, large amounts of data can be
distributed and redistributed in multiple ways across the
workers, due to repartitioning, complex JOINs etc. An entire
distributed table with its partitions stored on different
workers can also be gathered from those worker pods,
operated upon with various analytic SQLMR UDFs run as
queries by client and the resulting table can optionally be
stored or returned back to the client as output. SQLMR
stands for SQL-MapReduce, new UDF framework that is
inherently parallel, designed to facilitate parallel computation
of procedural functions across hundreds of servers working
together as a single relational database.
The pgdb container executes the Plain-SQL part of queries,
and a “runner” container deployed in each worker pod
executes the SQLMR UDF part of the queries. As such,
during a query, these are the two main containers that do the
most compute resource intensive tasks. The participation of
these and the other containers is discussed in more detail in
Section 4.
3. Cluster Resource Monitoring In MPP Engine
We present a novel approach to measure resource usage of a
cluster of Kubernetes pods. Our approach is implemented in
Teradata’s Aster Engine. The monitoring mechanisms
discussed, enable system administrators and analytics users
to visualize, plot, generates alerts and perform live and
historical-analytics on the cluster usage statistics. The
solution is usable for many third-party visualizers and data-
miners.
There is a plethora of solutions and ideas, available to
analyze Kubernetes pod performance, both at cost and for
free. Some are open source while others are commercial.
Some are easy to deploy while others require manual
configuration. Some are general purpose while others are
aimed specifically at certain container environments. Some
are hosted in the cloud while others require installations on
own cluster or hosts.
3.1 Comparison of Solutions
Following are the alternatives evaluated and found
noteworthy. Quick comparisons of over a dozen third party
solutions generated ideas. We looked through following
metrics for the offerings: Affinity to Data Collection
Framework / Data-stores, Built-in Aggregation Engines, Pre-
packaged Filtering, Visualization Integration, Alerting &
Logging, and most importantly, Ease to integrate with Aster
Engine and Teradata’s Viewpoint API. Readers should note
that the comparison table below is for brevity and assessment
does not even attempt to do full justice to the technological
complexity neither the technical evaluations done by us, let
alone the effort that has gone into innovating for them.
However, we wanted to incorporate something with relative
ease and extensibility. Following is a list of third-party
solutions that we looked at to incorporate into Aster Engine
to measure cluster level resource utilizations.
1. Docker Stats API[16,17,18]
: Docker Stats API was well
applicable for us. The approach is quite basic in terms of
utilizations that it can fetch for us via the API at this
time. We can get only per-container utilization. The
benefit of using Stats API is that the output from this
API is JSON friendly – something we need to simplify
integration with Teradata ViewPoint. We’d need to
implement the plumbing and aggregation ourselves.
2. cAdvisor[19,20,21]
: cAdvisor (Container Advisor) provides
a running daemon on each Kubernetes node that collects
aggregates, processes and exports a detailed resource
usage information for running containers on each node.
But, it does not have complete orchestration purview
beyond a single node, which translates to more complex
aggregation implementations and testing for us.
3. SysDig[22,23,24,25]
: The API provides deep visibility into
containers and it has robust alerting. But, this was of no
immediate value, because our goal was to interface with
Teradata’s ViewPoint User interface that manages alerts
within its framework. It’s also a paid solution.
4. SupervisorD[26]
: SupervisorD offers a client/server
system to monitor and control a number of processes on
UNIX-like operating systems. But this is not the most
suitable solution in a Kubernetes environment.
5. Scout[27]
: It is a hosted App and a Comprehensive
solution for monitoring Docker containers. We did not
analyze it deeply because of some of the similar
applicability caveats mentioned before.
6. New Relic[28]
: This is a neat application performance
monitoring (APM) solution as a purely SaaS offering.
Also, currently it is mainly usable for Docker, which
warrants a newly implemented resource utilization
aggregation solution. In addition to being somewhat hard
to incorporate, being an at-cost solution also factored
into our decision.
7. Librato[29,30]
: Is a generic cloud monitoring, alerting and
provides analytics solution for the resource utilization

4
type of time-series data. It is well suited for collecting
resource utilization or other metrics from AWS or
Heroku and post-processing them, for example to find
correlations and generating alerts. It is usable also as
Heapster Sink, at which point it looked to us more like a
consumer of resource utilization data than a resource.
8. Heapster[31,32,33]
: This is a robust go-to solution for basic
resource utilization metrics and events (read and exposed
by Eventer) on any Kubernetes clusters. Heapster is a
cluster-wide aggregator of monitoring and event data. It
supports Kubernetes natively and works on all
Kubernetes setups. We decided to use Heapster because
of its ease of availability, deployment, auto
configurability and convenience of building our solution
on top of its Restful API.
9. Other solutions[34,35]
: A ubiquitous solution we noticed
is of setting up the Prometheus - cAdvisor - InfluxDB &
Grafana stack. This solution is geared toward
administrators, end-users, cloud-hosted systems, or to
the data scientists. It did not seem to us like a solution
that we could adopt for Aster Engine and get resource
utilization easily converted for ViewPoint[36]
.
3.2 cAdvisor Introduction
cAdvisor is an open source container resource usage and
performance analysis agent. It is built for containers and
supports Docker containers natively. In Kubernetes, cAdvisor
is integrated into the Kubelet binary. cAdvisor auto-discovers
all containers in the machine and collects CPU, memory,
filesystem, and network usage statistics. cAdvisor also
provides the overall machine usage by analyzing the ‘root’
container on the machine.
The Kubelet acts as a bridge between the Kubernetes master
and the nodes. It manages the pods and containers running on
a machine. Kubelet translates each pod into its constituent
containers and fetches individual container usage statistics
from cAdvisor. It then exposes the aggregated pod resource
usage statistics via a REST API. More on how Heapster
Works can be found in the reference section.
Alternative Benefits Concerns
Docker Stats API Direct JSON o/put Hard to integrate
cAdvisor Node level option Hard to integrate
sysDig Detailed; Alerts Paid solution
SupervisorD Detailed, Unix-like Not pod aware
Scout Comprehensive Hosted App
New Relic Elegant solution Paid; Not pod aware
Librato Analytics; Alerts Hard to integrate
cAdvisor, InfluxDb,
Grafana
Well-adopted,
generic
Prometheus-or-
plotting friendly
Heapster Robust; Pod-aware;
Free; Pluggable API,
Json output
Needs trivial client
code to transcode,
aggregate
3.3 Heapster Introduction
Heapster is a cluster-wide aggregator of monitoring and
event data. Currently it natively supports Kubernetes[37,38]
only and works on all Kubernetes setups.
Heapster runs as a pod in the cluster, similar to how any
Kubernetes application would run. The Heapster pod
discovers all nodes in the cluster and queries usage
information from the nodes’ Kubelets, the on-machine
Kubernetes agent. The Kubelet itself fetches the data from
cAdvisor.
Heapster groups the information by pod along with the
relevant labels. This data is then pushed to a configurable
backend for storage and visualization.
Currently supported backends include InfluxDB (with
Grafana for visualization), Google Cloud Monitoring and
many others described in more details here. The overall
architecture of the service can be seen below. A Grafana
setup with InfluxDB is a very popular combination for
monitoring in the open source world. InfluxDB exposes an
easy to use API to write and fetch time series data.
Heapster is setup to use this storage backend by default on
most Kubernetes clusters.
3.4 Using Heapster To Measure MPP Database
Performance
3.4.1 Heapster Metric Model
The Heapster Model is a structured representation of metrics
for Kubernetes clusters, which is exposed through a set of
REST API endpoints. It allows the extraction of historical
data for any Container, Pod, Node or Namespace in the
cluster, as well as the cluster itself (depending on the metric).
The model API does not conform to the standards of a
Kubernetes API. It cannot be easily aggregated, does not
have auto generated serialization and clients for its types, and
does has a number of corner cases in its design that cause it
to fail to display metrics in certain case.
Within Kubernetes, its use been replaced by the resource
metrics API and custom metrics API, found in the
k8s.io/metrics repository. New applications that need metrics
are encouraged to use these APIs instead. Metrics-server and
custom metrics adapters provide these respectively.
3.4.2 Kubernetes Resource Metrics Api[39]
The goal with this for the Kubernetes community effort is to
provide resource usage metrics for pods and nodes through

5
the API server itself. This is a stable, versioned API, which
core Kubernetes components, can rely on. Because of
scalability limitations, Kubernetes apiserver persists all
Kubernetes resources in its key-value store etcd. It’s not able
to handle such load. On the other hand metrics tend to change
frequently, are temporary and in case of loss of them we can
collect them during the next housekeeping operation. As
such, Kubernetes stores them in memory and can’t reuse the
main apiserver and instead introduced a new one - metrics
server. This is the API that was used to collect resource usage
at entire cluster level.
3.5 Cluster Resource Measurement System
Figure 1 below is a simplified diagram that shows our CRM
subsystem implemented in Aster Engine.

Figure 1: Cluster Resource Monitoring In Aster Engine
At a high level Aster is deployed in its own namespace and
interacts with Teradata ViewPoint (on left) over HTTP or
HTTPS. A new container called “webservices” implements
the logic that queries pod level resource utilization statistics
(CPU %, Memory Used, Network IO, Disk IO etc.) from
Heapster pod. Heapster interacts with all containers of Aster
and fetches utilization statistics using cAdvisor via Kubelets
on the pods as detailed earlier The CRM Client module
implements a data format transcoder as well as aggregation
logic that caters the data through Apache webserver running
in within same container. ViewPoint is the client that
requests this data using a Restful HTTP API interface.
ViewPoint plots and displays the data in a user-friendly
manner.
3.6 Measurement Results
We implemented a Python client that pulls this information
from Heapster Metrics Server’s Restful API and store them
by pods. A pod in our cluster represented one Master pod and
many distributed Worker pods that together create an MPP
Analytic system.
The stored data contained % CPU utilization, Memory used
in Bytes, and Network Sent and Received in Bytes. Heapster
may also be able to return Disk Reads and Disk Writes per
pod, per node or per container if, needed. Heapster returns
time-series data related to the resource usage. Sample data
for one time-point is shown below, but entire series can be
retrieved for up to last 15 minutes, and recorded, plotted and
analyzed. A sample result data is shown in the table below.
The table shows real-time resource usage queried from
Heapster, aggregated at pods level. The data shows CPU,
memory and network IO on the Aster Engine’s Master pod
and 2 worker pods.
The client we implemented program from Heapster, queries
data using Heapster’s Restful API endpoints at pod level and
presents to the ViewPoint user interface. The program also
evicts the data from Heapster based on our eviction scheme.
Name Metric Value Epoch
Master CPU 33.5 % 1515494040
Master Memory 64180615 Bytes 1515494040
Master Nw Sent 40671 Bytes 1515494040
Master Nw Recv 53732 Bytes 1515494040
Worker1 CPU 45 % 1515494040
Worker1 Memory 81601882 Bytes 1515494040
Worker1 Nw Sent 81621 Bytes 1515494040
Worker1 Nw Recv 397232 Bytes 1515494040
Worker2 CPU 40 % 1515494040
Worker2 Memory 81604771 Bytes 1515494040
Worker2 Nw Sent 81621 Bytes 1515494040
Worker2 Nw Recv 207232 Bytes 1515494040
4. Query Resource Monitoring In MPP Engine
Resource utilization is modeled in Query Resource
Monitoring (QRM) on a per analytic query basis. When a
query takes a long time, QRM provides insights. This is
useful to identify expensive phases of a query that tax certain
resources more, or skew the work distribution.
4.1 Data Transfers Load During Query Execution
In Teradata’s Aster Engine, complex analytic computations
can result in large data movements and compute intensive
“hot spots” on specific containers, the bulk of which is
explained in Section 2. Data movement is done using Aster’s
proprietary data format known as “ICE”, which stands for
“Intra Cluster Express”. ICE moves large “tuples” of data
across workers pods and Master pods, during query
execution, partitioning, repartitioning or complex JOINs. In
addition to the pgdb and the runner containers, the ice
container can also exhibit sporadic spikes in CPU, Network
IO and Memory utilizations, because of intermittently large-
scale movements of partitions.
4.2 Highlights of Query Execution In Aster Engine

6
Broadly, there are two types of queries –
1. The plain-SQL query execution: For this we deploy a
dedicated container that runs an instance of a tailored
version of proven Postgres database in the pgdb
container. The deployed pgdb container on each of the
workers in parallel provides an MPP infrastructure for
running plain SQL parts of the user’s queries in a
distributed manner.
2. The SQLMR query execution: The runner container
facilitates an isolated environment to run SQLMR UDFs.
The deployed runner container on each of the workers in
parallel provides an MPP infrastructure for running
SQLMR UDFs parts of the user’s queries in a distributed
manner.
An Analytic user such as a database administrator (DBA) or
a data scientist will simply write a query or a UDF[43]
like
they traditionally do, and Aster Engine will distribute it on
the MPP system. The system can run arbitrarily complex
UDFs in this distributed environment.
Each runner runs a single JVM, which is used to solve a part
of the sub-problem to run the UDF. At the end of the
execution, the outcomes of all of the runners are combined to
form the final output, which is the result of running this
SQLMR UDF function in database on the given table.
We’d like to measure resource utilization within the pgdb and
the runner containers on each worker pod, and then aggregate
these values across workers to report compute resource
utilization caused by a specific user query on Aster Engine
cluster for a query.
At the time of this publication, the CPU % utilization and
Memory utilization are measured and reported by the
solution. We are working to incorporate Disk IO and
Network IO in our solution. The solution for these additional
metrics should fall in place since a robust framework has
been built first.
Note that the uniqueness of Aster Engine as a distributed
system is in the fact that we are slicing the SQLMR analytic
problems across worker containers within the MPP systems.
The pgdb and runner containers are solving the sub-problems
on disjoint partitions of the database tables, in parallel, in
order to achieve performance. As such, since the system does
not merely slice the work by containers or pods, the
measuring of resource utilization is non-trivial and need to be
done in a bottom-up manner.
Note also that we cannot afford to spin up and shutdown
containers on demand for Aster Engine to remain a high
performance MPP execution engine. T overhead of on-
demand spin up of containers will easily reduces
performance of Aster Engine on complex analytic queries
due to the latency issues.
In addition to accurately measuring plain-SQL and SQLMR
computations across the cluster, the impact of driver
functions is also measured. Driver functions use JDBC to
connect back to Aster Engine and issue additional queries. A
resource usage of the children sessions and all the queries
there are executed as part of it also gets accounted toward the
main query that invoked driver function.
4.3 Query Resource Measurement System
The Figure 2 below shows a simplified view of QRM system.
The light blue boxes represent the existing pods that are part
of the Aster Engine MPP system. The MPP Execution
Master creates and executes query plan and controls
execution of the query across Aster cluster. The pgdb and
runner are the Docker containers one or more of which may
exist on each worker pod, as shown below. The diagram also
shows a trivial sample table of the result from the QRM
system as stored in the DATABASE. Other actors in
QRM are following –
• QM Emitter helps retrieve memory and CPU usage
information by reading the info from the Linux procfs
filesystem[42]
for specific processes pertaining to the
Aster Engine, inside the Docker containers
• QM Collector container in each pod
• QM Master - lightweight container in Master pod is has
main role in QRM. It polls or subscribes to worker pod
QM Collector for utilization data
• Utilization in pgdb render Plain SQL resource usage by
reusing QM Emitter
• Utilization in runner render SQLMR resource usage by
reusing QM Emitter
4.3.1 Query Resource Measurement Steps
1. When QRM sub-system is on, it looks for the start of a
new session. Utilization starts to collect when a new
Session starts.
2. QM Emitter sends utilization data from pgdb and runner
to QM Collector local to each worker pod for entire
duration while Aster Engine session is active.
3. Data collection stops when the end of query session is
detected.
4. Data from QM Collectors on all workers is requested by
QM Master on Master pod and relayed to a generic
destination DATABASE where measurements are
stored. This is done on a periodic basis (minutes or
hours) and at ends of sessions.
5. Data is stored in new table in DATABASE, and is
indexed by session id.
Data can be pulled into other reporting, post-processing,
historical analysis systems and plotting tools from the
generic destination DATABASE.
QRM design is a best-effort architecture due to the nature
of its purpose.

7
If parts of sub-system don’t perform, reporting may be
incomplete, i.e., yield underestimated (but not inapplicable)
utilization. In specific –
Single Point Failure: If QM Master goes down either due
to load, container failure or system degenerate state, and the
measurements may lag or utilizations be under-estimated.
- In this case, Kubernetes may restart the QM Master
container.

Figure 2: Simplified Query Resource Monitoring Subsystem
In Aster Engine, keyed by Session id
Service Protection (Do no harm principle): Local pod
level query utilization storage are designed in a FIFO / LRU /
Ring Buffer / Round Robin way.
• QRM is designed to save accumulated data in a round-
robin fashion. If data is not collected either from QM
Collector on workers or from DATABASE after a period
of time, it may be overwritten if needed.
• This is to protect active analytic engine and the QM
Emitter from going out of memory.
• QM Collector can cap resource utilization in memory
data on a per session level memory limit.
• QM Collector in-memory data can also have an overall
limit across sessions.
• QM Collector can also have a Data Retiring Manager to
retire data in DATABASE on a per table basis (time
based or size based; long term expiry of data).
4.4 Measuring SQL Resource Utilization
Compute resource usage by plain SQL queries is measured
by measuring system load inside the pgdb containers running
in each of the worker pods, and aggregating the results that is
collected in the Master. This involves measuring resource
usage by each of the Postgres processes within the pgdb
containers on worker pods.
4.5 Measuring SQLMR Resource Utilization
Compute resource usage by SQLMR queries is measured by
measuring system load inside the runner containers running
in each of the worker pods, and aggregating the results that is
collected in the Master. This involves measuring resource
usage by each of the JVMs running inside the runner
containers.
4.6 Solution Scalability
A system, whose performance improves by adding hardware,
proportionally to the capacity added, is said to be scalable
system or horizontally scalable system. Scalability of QRM
ties to measurement load imposed on it. So, scalability of
QRM is somewhat inversely proportional to scalability of the
MPP system being measured! Adding workers load QM
Master in linear manner. So, in an MPP analytic engine
with N worker pods and m pgdb or runner instances per pod,
there will be (m+1)*N total resource utilization
measurements to be taken.
• Every measurement collection collects data at sampling
frequency, say f, or sampling period T. T = 1/f, for every
resource-type.
• Let R be the total number of resources measured
o Where resource-type € {Cpu %, Memory, Network
Recv, Network Sent, Disk Reads, Disk Writes}
o Currently, R = 2; Cpu % and Memory.
• Every sample is a time-series item, which is a pair or
tuple of doubles: <time> and <utilization value>.
• So, over an arbitrary sampling frequency f, the QRM
subsystem collects
(((m + 1) * N) * R * 2 * 8) / f bytes of data.
• Periodically, this data will be fetched or pushed into long
term storage, such as DATABASE or a third-party
reporting system.
• For convenience, we’d estimate scale on a “per fetch by
third-party reporting system”
basis, or else the sampling period to such a reporting
system can also be modeled here. If we model it, it will
proportionately affect scale. Measurement collection is
done for each query. So, for Q queries running
simultaneously,
(m + 1) * N * R * Q * 16 / f bytes of data per query per
fetch
Following are a few sample scenarios:
1. So, for a 2000 pod Aster Engine MPP system with 2
pgdb and/or runner containers per pod, sampling the
resource utilizations at period P = 10 sec, or f = 1/6

8
minute, the proposed solution will collect data
at 1,152,000 bytes / min or 19.2 KBPS per query.
• Each additional query monitored in parallel will
multiply the size of data, modulo the fetch period of
the third-party reporting system.
• If for this scenario, the QRM data is offloaded fully
into a third-party reporting system every 5 minutes,
it would be about 5.8 MB if one query was running
and 17.4 MB of memory requirement for 3 queries
running in parallel, and every 5 minutes they run.
2. Or for a small system with 2 pods, with one pgdb per
pod, P = 30 sec = 0.5 minute, we will collect 256 bytes
of data per running Promethium query per minute.
QRM is most applicable for long running queries. Be
informed that in absence of eviction, the queries will
have linear penalty on resources needed by QRM, as the
derivation shows.
Above calculations highlight the need of LRU based
local eviction and destructive read-out heuristics. Be
informed that QRM is off by default.
4.7 Measuring Workload Skew
Skew is one of the key characteristics of an MPP execution
engine that helps determine throughput for a specific
database session. In an MPP system with high cardinality,
skew is a condition in which compute work to execute a long
running database query is unevenly balanced among
partitions or workers in the cluster. In any practical scenario
of execution on any system, a small amount of skew is
inevitable and harmless.
There are mainly four skew parameters to judge throughput
and performance of orchestrated MPP execution engine of a
given cluster size:
1. Pod IO skew: comparison of the highest IO usage
watermark on the busiest node to the average use on
other pods. This can include network IO, disk IO or
both.
2. Pod CPU skew: comparison of highest CPU usage
watermark use on the busiest node to the average use on
other pods.
3. Pod Memory skew: comparison of highest memory
watermark for a query on the busiest node to the average
use on other pods.
Let,
podCount be total number of pods,
sum be the sum of the metrics on all pods,
max be the value of metrics on the busiest pod.
For IO metrics, the two metrics such as read and written
bytes are added together for Disk IO; similarly, the sent and
received bytes are added together for Network IO, and the
max and sum are computed for these totals. The Workload
Skew for a specific utilization metric across all Aster Engine
worker pods is computed as,
where, subscript m stands for specific resource the skew is to
be computed for, such as memory, CPU % etc.
This formula for podSkew helps a DBA judge the workload
spread across workers. The expectation is that overall
throughput of the MPP system on a long running query is
maximum if the workload is perfectly balanced, that is, all n
pods have a identical workload and thus the podSkew is 0.
4.8 QRM Data Aggregation
The data aggregation in QRM for resource
utilization requires a distributed system design for itself that
must be able to scale linearly or preferably super-
linearly with respect to the size of the MPP system. This is
because bigger systems do not reduce but grow challenge for
QRM performance.
The data being aggregated to QM Collector processes in
memory, per worker pod, store primarily resource utilization
time-series data.
Aggregation heuristics depends on the resource-type of the
metrics being aggregated. At a high level -
• The CPU utilization % should be stored uniquely per
pod and not added up to be able to compute podSkew for
CPU % in the system.
• The Memory utilization in bytes per pod, with each pod
can be returning (m + 1) time-series-es, where m is the
number of pgdb and/or runner Docker instances per pod.
Storing these values separate can allow computing
podSkew for Memory.
• This storage is local to QM Collector in each pod.
• It is possible to measure skew for SQL and SQLMR
queries separately by storing resource utilization metrics
for pgdb and runner separately, instead of aggregating
them in QM Collector.
• Newer metrics such as Network Sent bytes, Network
Received bytes, Disk Read bytes and Disk Written bytes
can also be reported in the same manner.
There are two ways to collect this data centrally in
DATABASE that is attached to the Master pod –
• QM Master can serially fetch or pull the data from QM
Collector of all workers
• QM Collectors on all workers can asynchronously
message or push the data to QM Master
The push mechanism can employ Asynchronous-Messaging
Queues wherein QM Master is subscribed to all QM
Collectors in the pods (across pod recreations). Following are
more detail on these two approaches.
PodSkewm = (1 - ((summ / PodCount) / maxm))) * 100

9
4.8.1 Serial Solution Formulation
In this method to fetch QRM resource utilization, QM Master
pulls the collected resource utilizations from QM Collectors
on worker pods on its own schedule.
1. QM Emitter retrieves resource utilization for Linux
processes in the container (pgdb or runner) from the
procfs.
2. This utilization info (either one point or a time-series
chunk) is sent to local QM Collector container on the
same pod. This data transfer reused
preexisting RPC mechanism, but can also use a Restful
API or a speedy Asynchronous-Message Queue such as
ZeroMQ[41]
. The data rests in the process memory for the
collector process inside the QM Collector until it is
either evicted (expired with time or by size limit) or
collected by QM Master.
3. QM Master periodically polls QM Collectors on all
worker pods via a similar data transfer mechanism and
fetches and erases all of the time-series data from each
pod, one at a time.
4. QM Master saves data into DATABASE after data is
fetched from each pod.
This is shown in the flow diagram in Figure 3 below.
4.8.2 Parallel Solution Formulation
Another design choice was to use a Proactor Design to
retrieve QRM data from QM Collectors on the worker pods.
Proactor is a software design pattern for event handling in
which long running data sources in the design subscribe to a
central part of the system, either for commands and control or
to offload the data.
A completion handler is called based on meeting any
arbitrary condition, asynchronously, at the data source – the
QM Collector in our case. Such designs follow Hollywood
Design Principle (“Don’t call us, we’ll call you.”). Here's the
flow –
1. As the worker pods come up, every QM Collector
container subscribes to the central QM Master container.
2. On each worker pod in QM Collector container the
Proactor subscriber waits for certain condition to be met.
This condition can be a periodic timer interrupt or a
more complex condition that need to be checked, such
as, an event that is triggered by size of time-series
accumulated so far on that pod, etc.
3. Proactor subscriber in QM Collector reads the time-
series.
4. Proactor subscriber in QM Collector dispatches an event
to the handler in QM Master.
5. Either the QM Collector can now send time-series as
payload with the event itself or the QM Master can
handshake subsequently to request the time-series data.
6. QM Master can process or aggregate or filter the data
and then write to QRM tables in DATABASE.
Benefit of a pull solution is that data transfer rates can be

Figure 3: QRM Data Collection To QM Master: Serial Solution
Benefit of a pull solution is that data transfer rates can be
throttled by constraining resources, like thread-pool, that
implement the pull in QM Master. A parallel or push
approach may flood the network or QRM subsystem in case
queries are resource consuming – which is a bad time to
further load the network.
4.9 Measurement Results
The measurement results specific to the queries in our MPP
execution engine are not available at the time of this writing.
These measurements will be taken in March of 2018, which
is well ahead of conference presentation deadline and will be
conveyed as soon as they are available and shared during
presentation.
5. Security
5.1 Network Access
1. CRM feature, in our case, did expose one port to retrieve
data from Heapster, and another port to send the
aggregated per-pod resource utilization data to
Teradata’s ViewPoint UI.
2. QRM feature exposes no new ports. QM Master does not
need network access other than internal connectivity via
reused RPC to QM Collector on worker pods to receive
and send data for aggregation and storage.
3. If more clients or sinks of data are connected the ports
need to be secured and not exposed to the outside world,
at the same time all resource utilization data needs to be
encrypted to avoid malicious access to it for CRM and
QRM.
5.2 User Access Control To Data
1. Consider running any newly added QRM and CRM
containers as non-privileged Docker containers and as
non-privileged users by default.
2. Protect access to the newly added tables and views in
external databases and systems to which the resource
utilization data gets saved.

10
6. Summary
In this paper we have presented two comprehensive
mechanisms and experiences to design and incorporate
resource usage monitoring in a large-scale Kubernetes
orchestrated MPP analytic engine. Our work details the
methods and addresses multiple challenges pertinent to
measuring resource utilization in MPP systems on a system
level and with precision during one or multiple distributed
SQL and SQLMR analytic query processing.
Our benchmarking results for Cluster Resource Monitoring
have shown that the solution built by implementing a
Heapster client and channeling the time-series data to other
subsystems and User interfaces does not load much the actual
system being measured.
7. References
[1] www.docker.com/what-container
[2] Vivek Ratan (February 8, 2017). "Docker: A Favourite in the
DevOps World". Open Source Forum, June 14, 2017.
[3] en.wikipedia.org/wiki/LXC
[4] www.linuxcontainers.org
[5] www.upguard.com/articles/docker-vs-lxc
[6] www.goto.docker.com/rs/929-FJL-178/images/Docker-Survey-
2016.pdf
[7] O'Gara, Maureen (26 July 2013). "Ben Golub, Who Sold Gluster
to Red Hat, Now Running dotCloud". SYS-CON Media, 2013-08-
09
[8]
www.domino.research.ibm.com/library/cyberdig.nsf/papers/092905
2195DD819C85257D2300681E7B/$File/rc25482.pdf
[9] www.arxiv.org/pdf/1709.10140.pdf
[10] www.blog.newrelic.com/2017/11/27/monitoring-application-
performance-in-kubernetes
[11] www.newrelic.com/serverless-dynamic-cloud-survey
[12] www.vldb.org/pvldb/vol9/p660-trummer.pdf
[13] www.dcs.bbk.ac.uk/~ap/teaching/ADM2018/notes5.pdf
[14]
static.googleusercontent.com/media/research.google.com/en//pubs/a
rchive/41344.pdf
[15]
citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.478.9491&rep=
rep1&type=pdf
[16] docs.docker.com/engine/reference/commandline/stats
[17] docs.docker.com/engine
[18] docs.docker.com/config/containers/runmetrics
[19] www.github.com/google/cadvisor
[20] hub.docker.com/r/google/cadvisor
[21] blog.codeship.com/monitoring-docker-containers
[22] www.sysdig.com/product/monitor
[23] www.tecmint.com/sysdig-system-monitoring-and-
troubleshooting-tool-for-linux
[24] www.sysdig.com/blog/ monitoring-kubernetes-with-sysdig-
cloud
[25] www.sysdig.com/blog/alerting-kubernetes
[26] www.supervisord.org/introduction.html
[27] www.monitorscout.com
[28] en.wikipedia.org/wiki/New_Relic
[29] www.librato.com
[30] www.github.com/librato/librato-metrics
[31]
www.github.com/kubernetes/heapster/blob/master/docs/overview.
md
[32]
www.github.com/kubernetes/heapster/blob/master/docs/storage-
schema.md
[33] www.github.com/DataDog/the-
monitor/blob/master/kubernetes/how-to-collect-and-graph-
kubernetes-metrics.md
[34] www.stackoverflow.com/questions/33749911/a-combination-
for-monitoring-system-for-container-grafanaheapsterinfluxdbcad
[35] blog.couchbase.com/wp-content/original-assets/december-
2016/kubernetes-monitoring-with-heapster-influxdb-and-
grafana/kubernetes-logging-1024x407.png
[36]
info.teradata.com/HTMLPubs/DB_TTU_16_00/index.html#page/G
eneral_Reference/B035-1091-160K/muq1472241426243.html
[37] www.kubernetes.io
[38] www.kubernetes.io/docs/concepts/overview/what-is-kubernetes
[39]
www.github.com/kubernetes/community/blob/master/contributors/d
esign-proposals/instrumentation/resource-metrics-api.md
[40] Large Scale Cluster Management At Google With Borg:
static.googleusercontent.com/media/research.google.com/en//pubs/a
rchive/43438.pdf
[41] www.zguide.zeromq.org/page:all
[42] ProcFs: en.wikipedia.org/wiki/Procfs
[43] dl.acm.org/citation.cfm?id=1687567

Measuring Resources & Workload Skew In Micro-Service MPP Analytic Query Engine

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Measuring Resources & Workload Skew In Micro-Service MPP Analytic Query Engine

Similar to Measuring Resources & Workload Skew In Micro-Service MPP Analytic Query Engine (20)

Recently uploaded

Recently uploaded (20)

Measuring Resources & Workload Skew In Micro-Service MPP Analytic Query Engine