SlideShare a Scribd company logo
1	
Measuring Resources & Workload Skew
In Micro-Service MPP Analytic Query Engine
Nikunj Parekh
Aster Engine
Teradata, Inc., Santa Clara, CA, USA
Nikunj.Parekh@Teradata.com
Alan Beck
Aster Engine
Teradata, Inc., Santa Clara, CA, USA
Alan.Beck@Teradata.com
Abstract
Big Data analytics is common in many business domains, for
example in financial sector for savings portfolio analysis, in
government agencies, scientific research and insurance
providers, to name a few. The uses of Big Data range from
generating simple reports to executing complex analytical
workloads. The increases in the amount of data being stored
and processed in these domains expose many challenges with
respect to scalable processing of analytical queries.
Massively Parallel Processing (MPP) databases address these
challenges by distributing storage and query processing
across multiple compute nodes and distributed processes in
parallel, usually in a shared-nothing architecture. Today, new
technologies are shaping up the way platforms for the
Internet of services are designed and managed. This
technology is the containers, like Docker[1,2]
and LXC[3,4,5]
.
The use of container as base technology for large-scale
distributed systems opens many challenges in the area of
resource management at run-time, for example: auto-scaling,
optimal deployment and monitoring.
Monitoring orchestrated distributed systems is at the heart of
many cloud resource management solutions. Measuring
performance and analyzing that data to take guesswork out of
resource utilization is key to engineering budget
management.
Measuring Workload Skew and resource utilization of MPP
databases and analytic engines is the focus of our work. This
paper explores the tools available to measure the
performance of MPP Docker and Kubernetes environments
from the perspective of a Database Administrator using such
systems in a virtualized environment. Our approach provides
a detailed characterization of CPU, memory and network IO,
while complex long duration analytic SQL queries are
loading compute resources in containerized systems.
Keywords: Docker, LXC, Massively Parallel Processing
(MPP), MapReduce, Kubernetes, Pods and Workload Skew
1. Introduction
The enabling force for large-scale distributed systems
moving from dedicated servers to cloud computing is the
containerization technologies. The drivers for
containerization are the following:
• Portability – agnostic of the environment
• Quick Installation and Deployment – should be able to
install and redeploy quickly on a given environment and
also deploy host-agnostically
• Elasticity – In the cloud environment, compute resources
are available on demand. We are able to start clusters on
the fly and expand/shrink them based on demand in
many applications
• Accurate Resource Management – a pay as you go
model wherein software products are no longer licensed
but metered requires a detailed resource monitoring and
management
• Multi-tenancy – should be able to start multiple instances
of same distributed system on the same physical
hardware, and containerization is one of the key
mechanisms to host multitenant apps in cloud
• Isolation – without introducing unworkable interface
conflicts such as for network, namespaces and other
sharable resources
• Quicker releases – CI / CD friendly product releases
• Quicker development and testing – because
containerization facilitates micro-services architecture, it
promotes parallelizable product development and
quicker feature testing
The interest of cloud providers and Internet Service Providers
(ISP)[6]
has gradually grown in Docker as the preferred
product offering model. A container is a software program
that performs OS level virtualization by using resource
isolation features of the Linux kernel[7]
. This allows
applications or application components (the so called micro-
services) to be installed with all the library dependencies, the
binaries, and all their configurations needed to run as if they
are running on independent Virtual Machines, except there’s
low overhead in starting and running them.
There are many management tools for Linux containers:
LXC, systemd-nspawn, lmctfy, Warden, Docker and rkt for
CoreOS[8,9]
. The latter is a minimal operating system that
supports popular container systems out of the box. The
operating system is designed to be operated in clusters and
can run directly on bare metal or on virtual machines.
CoreOS supports hybrid architectures (for example, virtual
machines plus bare metal). This approach enables the
Container-as-a-Service solutions that are becoming widely
available.
2	
Momentum keeps building around Kubernetes. Our recent
survey on server-less dynamic cloud computing and DevOps
revealed that by mid-2018, more than 60% of respondents
will be evaluating, prototyping, or deploying to production
solutions involving container orchestration[10,11].
We see
Kubernetes emerging as the de-facto standard for container
orchestration as well as for large-scale distributed systems
deployments.
Having a rapidly increasing number of large-scale systems
deployed in Kubernetes environment, we wanted to explore
strategies for instrumenting resource utilization
measurements for applications running inside these
orchestrated clusters. This paper lays out two different
resource measurement solutions with a focus on monitoring
resource usage in MPP analytic systems, such as Teradata’s
Aster Engine.
The first part of the paper discusses resource monitoring at a
broad orchestrated application level, and enumerates
solutions to collect application monitoring data for exploring
performance issues, troubleshooting application errors and
analyzing performance trends to root cause performance
bottlenecks in the application. The question that always needs
to be answered is: Is the problem the input or the underlying
infrastructure?
While analyzing the broader cluster level resource utilization
profile, it is often necessary in an MPP analytic engine, to
understand utilization at more granular level. For example, to
be able to answer above question fully, or to be able to
estimate compute resource needs for future it is very valuable
to profile resource utilization specific to each query. It is
therefore necessary to know in detail the resource usage of
long running SQL queries or even of specific phases thereof.
The second part of the paper details Query Resource
Monitoring solution implemented in Aster Engine at
Teradata.
The rest of this paper is organized as follows. Section 1
continues with more introduction of Massively Parallel
Processing, and container orchestration. Section 2 briefly
introduces the Aster Engine in Kubernetes environment.
Section 3 details our Cluster Resource Monitoring
approaches and solution we picked, and section 4 details our
Query Resource Monitoring approach. It also highlights the
commonly applicable sub-problems, such as retrieval, storage
solutions and the policies for eviction of the time-series
resource utilization data.
1.1 Massively Parallel Processing
Modern scale-out database engines are usually based on one
of two design principles: sharded and massively parallel
processing (MPP) databases. Both are shared-nothing
architectures, where each node manages its own storage and
memory, and they are typically based on horizontal
partitioning[12,13]
. Sharded systems optimize for executing
queries on small subsets of the shards, and communication
between shards is relatively limited[14]
.
MPP databases optimize for parallel execution of each
query[15]
. The nodes are usually collocated within the same
data center, and each query can access data across all the
nodes. A query optimizer generates an execution plan that
includes explicit data movement directives, and the cost of
moving data is taken into account during optimization. A
query executing in an MPP database can include several
pipelined execution stages, with explicit communication
between nodes at each stage. For example, a multi-stage
aggregation can be used to compute an aggregate over the
entire dataset using all the nodes. MPP databases can use a
unique mechanism to solve sub-problems using independent
queries for each of them. This mechanism of issuing “child
queries” reuses the same JDBC connection from a client to
connect to MPP system. While measuring resource
utilization, the resource usages of these child sessions need to
be accounted for toward the main query that invokes these
child queries to solve sub problems.
Many robust, highly scalable, high performance distributed
systems and products, such as Teradata’s Aster Engine
analytic system are built on top of Docker and Kubernetes
technology, and these large-scale systems require monitoring
solutions similar to the ones outlined below.
2. Teradata Aster Engine On Kubernetes
Kubernetes is an open-source system for automating
deployment, scaling and management of containerized
applications that was originally designed by Google and now
maintained by the Cloud Native Computing Foundation
(CNCF). It aims to provide a "platform for automating
deployment, scaling, and operations of application containers
across clusters of hosts". The idea with Kubernetes is to
manage containers as a group called pods[40]
. Pods can be
collocated and can share network, volumes and namespace.
Kubernetes thus provides a good abstraction for distributed
applications that are not completely decoupled, such as, Aster
Engine. The networking and sharing of resources between
containers is taken care of and the end user does not need to
set it up. Kubernetes also manages failure of pods and restarts
a container if the container fails.
Teradata’s new Analytic Platform uses Aster Engine for
large-scale SQL and User Defined Analytic Functions (UDF)
query processing on massive data. Aster Engine is MPP
analytics databases with the typical deployments including
tens or hundreds of nodes. Storage and processing of large
amounts of data are handled by distributing the load across
several servers or hosts to create an array of individual
databases, working together to present a single database
image. The Master node is the entry point, where clients
connect and submit SQL statements. The Master coordinates
work with other database instances, called workers. Aster
3	
Engine Master is a pod that runs several Linux Docker
containers, each responsible for specific tasks to coordinate
the execution of Aster query across workers. Database clients
connect to the Master though a JDBC connection to issue
queries. The Master is responsible for global optimization of
query, for parsing, serialization, query planning, query
optimization and execution of phases. Each Aster worker is a
separate pod and runs several Docker containers of its own in
each pod. Each worker container is responsible for specific
task during query execution under Master’s command as per
the query plan. One of the containers on each worker pod
encapsulates the Postgres database (the pgdb container). The
pgdb container is a standalone Postgres DB itself and
receives subset of the table that the query is executing on.
This subset is called a partition. Union of partitions across all
workers is the main table and the intersection is null.
During execution of a query, large amounts of data can be
distributed and redistributed in multiple ways across the
workers, due to repartitioning, complex JOINs etc. An entire
distributed table with its partitions stored on different
workers can also be gathered from those worker pods,
operated upon with various analytic SQLMR UDFs run as
queries by client and the resulting table can optionally be
stored or returned back to the client as output. SQLMR
stands for SQL-MapReduce, new UDF framework that is
inherently parallel, designed to facilitate parallel computation
of procedural functions across hundreds of servers working
together as a single relational database.
The pgdb container executes the Plain-SQL part of queries,
and a “runner” container deployed in each worker pod
executes the SQLMR UDF part of the queries. As such,
during a query, these are the two main containers that do the
most compute resource intensive tasks. The participation of
these and the other containers is discussed in more detail in
Section 4.
3. Cluster Resource Monitoring In MPP Engine
We present a novel approach to measure resource usage of a
cluster of Kubernetes pods. Our approach is implemented in
Teradata’s Aster Engine. The monitoring mechanisms
discussed, enable system administrators and analytics users
to visualize, plot, generates alerts and perform live and
historical-analytics on the cluster usage statistics. The
solution is usable for many third-party visualizers and data-
miners.
There is a plethora of solutions and ideas, available to
analyze Kubernetes pod performance, both at cost and for
free. Some are open source while others are commercial.
Some are easy to deploy while others require manual
configuration. Some are general purpose while others are
aimed specifically at certain container environments. Some
are hosted in the cloud while others require installations on
own cluster or hosts.
3.1 Comparison of Solutions
Following are the alternatives evaluated and found
noteworthy. Quick comparisons of over a dozen third party
solutions generated ideas. We looked through following
metrics for the offerings: Affinity to Data Collection
Framework / Data-stores, Built-in Aggregation Engines, Pre-
packaged Filtering, Visualization Integration, Alerting &
Logging, and most importantly, Ease to integrate with Aster
Engine and Teradata’s Viewpoint API. Readers should note
that the comparison table below is for brevity and assessment
does not even attempt to do full justice to the technological
complexity neither the technical evaluations done by us, let
alone the effort that has gone into innovating for them.
However, we wanted to incorporate something with relative
ease and extensibility. Following is a list of third-party
solutions that we looked at to incorporate into Aster Engine
to measure cluster level resource utilizations.
1. Docker Stats API[16,17,18]
: Docker Stats API was well
applicable for us. The approach is quite basic in terms of
utilizations that it can fetch for us via the API at this
time. We can get only per-container utilization. The
benefit of using Stats API is that the output from this
API is JSON friendly – something we need to simplify
integration with Teradata ViewPoint. We’d need to
implement the plumbing and aggregation ourselves.
2. cAdvisor[19,20,21]
: cAdvisor (Container Advisor) provides
a running daemon on each Kubernetes node that collects
aggregates, processes and exports a detailed resource
usage information for running containers on each node.
But, it does not have complete orchestration purview
beyond a single node, which translates to more complex
aggregation implementations and testing for us.
3. SysDig[22,23,24,25]
: The API provides deep visibility into
containers and it has robust alerting. But, this was of no
immediate value, because our goal was to interface with
Teradata’s ViewPoint User interface that manages alerts
within its framework. It’s also a paid solution.
4. SupervisorD[26]
: SupervisorD offers a client/server
system to monitor and control a number of processes on
UNIX-like operating systems. But this is not the most
suitable solution in a Kubernetes environment.
5. Scout[27]
: It is a hosted App and a Comprehensive
solution for monitoring Docker containers. We did not
analyze it deeply because of some of the similar
applicability caveats mentioned before.
6. New Relic[28]
: This is a neat application performance
monitoring (APM) solution as a purely SaaS offering.
Also, currently it is mainly usable for Docker, which
warrants a newly implemented resource utilization
aggregation solution. In addition to being somewhat hard
to incorporate, being an at-cost solution also factored
into our decision.
7. Librato[29,30]
: Is a generic cloud monitoring, alerting and
provides analytics solution for the resource utilization
4	
type of time-series data. It is well suited for collecting
resource utilization or other metrics from AWS or
Heroku and post-processing them, for example to find
correlations and generating alerts. It is usable also as
Heapster Sink, at which point it looked to us more like a
consumer of resource utilization data than a resource.
8. Heapster[31,32,33]
: This is a robust go-to solution for basic
resource utilization metrics and events (read and exposed
by Eventer) on any Kubernetes clusters. Heapster is a
cluster-wide aggregator of monitoring and event data. It
supports Kubernetes natively and works on all
Kubernetes setups. We decided to use Heapster because
of its ease of availability, deployment, auto
configurability and convenience of building our solution
on top of its Restful API.
9. Other solutions[34,35]
: A ubiquitous solution we noticed
is of setting up the Prometheus - cAdvisor - InfluxDB &
Grafana stack. This solution is geared toward
administrators, end-users, cloud-hosted systems, or to
the data scientists. It did not seem to us like a solution
that we could adopt for Aster Engine and get resource
utilization easily converted for ViewPoint[36]
.
3.2 cAdvisor Introduction
cAdvisor is an open source container resource usage and
performance analysis agent. It is built for containers and
supports Docker containers natively. In Kubernetes, cAdvisor
is integrated into the Kubelet binary. cAdvisor auto-discovers
all containers in the machine and collects CPU, memory,
filesystem, and network usage statistics. cAdvisor also
provides the overall machine usage by analyzing the ‘root’
container on the machine.
The Kubelet acts as a bridge between the Kubernetes master
and the nodes. It manages the pods and containers running on
a machine. Kubelet translates each pod into its constituent
containers and fetches individual container usage statistics
from cAdvisor. It then exposes the aggregated pod resource
usage statistics via a REST API. More on how Heapster
Works can be found in the reference section.
Alternative Benefits Concerns
Docker Stats API Direct JSON o/put Hard to integrate
cAdvisor Node level option Hard to integrate
sysDig Detailed; Alerts Paid solution
SupervisorD Detailed, Unix-like Not pod aware
Scout Comprehensive Hosted App
New Relic Elegant solution Paid; Not pod aware
Librato Analytics; Alerts Hard to integrate
cAdvisor, InfluxDb,
Grafana
Well-adopted,
generic
Prometheus-or-
plotting friendly
Heapster Robust; Pod-aware;
Free; Pluggable API,
Json output
Needs trivial client
code to transcode,
aggregate
3.3 Heapster Introduction
Heapster is a cluster-wide aggregator of monitoring and
event data. Currently it natively supports Kubernetes[37,38]
only and works on all Kubernetes setups.
Heapster runs as a pod in the cluster, similar to how any
Kubernetes application would run. The Heapster pod
discovers all nodes in the cluster and queries usage
information from the nodes’ Kubelets, the on-machine
Kubernetes agent. The Kubelet itself fetches the data from
cAdvisor.
Heapster groups the information by pod along with the
relevant labels. This data is then pushed to a configurable
backend for storage and visualization.
Currently supported backends include InfluxDB (with
Grafana for visualization), Google Cloud Monitoring and
many others described in more details here. The overall
architecture of the service can be seen below. A Grafana
setup with InfluxDB is a very popular combination for
monitoring in the open source world. InfluxDB exposes an
easy to use API to write and fetch time series data.
Heapster is setup to use this storage backend by default on
most Kubernetes clusters.
3.4 Using Heapster To Measure MPP Database
Performance
3.4.1 Heapster Metric Model
The Heapster Model is a structured representation of metrics
for Kubernetes clusters, which is exposed through a set of
REST API endpoints. It allows the extraction of historical
data for any Container, Pod, Node or Namespace in the
cluster, as well as the cluster itself (depending on the metric).
The model API does not conform to the standards of a
Kubernetes API. It cannot be easily aggregated, does not
have auto generated serialization and clients for its types, and
does has a number of corner cases in its design that cause it
to fail to display metrics in certain case.
Within Kubernetes, its use been replaced by the resource
metrics API and custom metrics API, found in the
k8s.io/metrics repository. New applications that need metrics
are encouraged to use these APIs instead. Metrics-server and
custom metrics adapters provide these respectively.
3.4.2 Kubernetes Resource Metrics Api[39]
The goal with this for the Kubernetes community effort is to
provide resource usage metrics for pods and nodes through
5	
the API server itself. This is a stable, versioned API, which
core Kubernetes components, can rely on. Because of
scalability limitations, Kubernetes apiserver persists all
Kubernetes resources in its key-value store etcd. It’s not able
to handle such load. On the other hand metrics tend to change
frequently, are temporary and in case of loss of them we can
collect them during the next housekeeping operation. As
such, Kubernetes stores them in memory and can’t reuse the
main apiserver and instead introduced a new one - metrics
server. This is the API that was used to collect resource usage
at entire cluster level.
3.5 Cluster Resource Measurement System
Figure 1 below is a simplified diagram that shows our CRM
subsystem implemented in Aster Engine.
	
Figure	1:	Cluster	Resource	Monitoring	In	Aster	Engine
At a high level Aster is deployed in its own namespace and
interacts with Teradata ViewPoint (on left) over HTTP or
HTTPS. A new container called “webservices” implements
the logic that queries pod level resource utilization statistics
(CPU %, Memory Used, Network IO, Disk IO etc.) from
Heapster pod. Heapster interacts with all containers of Aster
and fetches utilization statistics using cAdvisor via Kubelets
on the pods as detailed earlier The CRM Client module
implements a data format transcoder as well as aggregation
logic that caters the data through Apache webserver running
in within same container. ViewPoint is the client that
requests this data using a Restful HTTP API interface.
ViewPoint plots and displays the data in a user-friendly
manner.
3.6 Measurement Results
We implemented a Python client that pulls this information
from Heapster Metrics Server’s Restful API and store them
by pods. A pod in our cluster represented one Master pod and
many distributed Worker pods that together create an MPP
Analytic system.
The stored data contained % CPU utilization, Memory used
in Bytes, and Network Sent and Received in Bytes. Heapster
may also be able to return Disk Reads and Disk Writes per
pod, per node or per container if, needed. Heapster returns
time-series data related to the resource usage. Sample data
for one time-point is shown below, but entire series can be
retrieved for up to last 15 minutes, and recorded, plotted and
analyzed. A sample result data is shown in the table below.
The table shows real-time resource usage queried from
Heapster, aggregated at pods level. The data shows CPU,
memory and network IO on the Aster Engine’s Master pod
and 2 worker pods.
The client we implemented program from Heapster, queries
data using Heapster’s Restful API endpoints at pod level and
presents to the ViewPoint user interface. The program also
evicts the data from Heapster based on our eviction scheme.
Name Metric Value Epoch
Master CPU 33.5 % 1515494040
Master Memory 64180615 Bytes 1515494040
Master Nw Sent 40671 Bytes 1515494040
Master Nw Recv 53732 Bytes 1515494040
Worker1 CPU 45 % 1515494040
Worker1 Memory 81601882 Bytes 1515494040
Worker1 Nw Sent 81621 Bytes 1515494040
Worker1 Nw Recv 397232 Bytes 1515494040
Worker2 CPU 40 % 1515494040
Worker2 Memory 81604771 Bytes 1515494040
Worker2 Nw Sent 81621 Bytes 1515494040
Worker2 Nw Recv 207232 Bytes 1515494040
4. Query Resource Monitoring In MPP Engine
Resource utilization is modeled in Query Resource
Monitoring (QRM) on a per analytic query basis. When a
query takes a long time, QRM provides insights. This is
useful to identify expensive phases of a query that tax certain
resources more, or skew the work distribution.
4.1 Data Transfers Load During Query Execution
In Teradata’s Aster Engine, complex analytic computations
can result in large data movements and compute intensive
“hot spots” on specific containers, the bulk of which is
explained in Section 2. Data movement is done using Aster’s
proprietary data format known as “ICE”, which stands for
“Intra Cluster Express”. ICE moves large “tuples” of data
across workers pods and Master pods, during query
execution, partitioning, repartitioning or complex JOINs. In
addition to the pgdb and the runner containers, the ice
container can also exhibit sporadic spikes in CPU, Network
IO and Memory utilizations, because of intermittently large-
scale movements of partitions.
4.2 Highlights of Query Execution In Aster Engine
6	
Broadly, there are two types of queries –
1. The plain-SQL query execution: For this we deploy a
dedicated container that runs an instance of a tailored
version of proven Postgres database in the pgdb
container. The deployed pgdb container on each of the
workers in parallel provides an MPP infrastructure for
running plain SQL parts of the user’s queries in a
distributed manner.
2. The SQLMR query execution: The runner container
facilitates an isolated environment to run SQLMR UDFs.
The deployed runner container on each of the workers in
parallel provides an MPP infrastructure for running
SQLMR UDFs parts of the user’s queries in a distributed
manner.
An Analytic user such as a database administrator (DBA) or
a data scientist will simply write a query or a UDF[43]
like
they traditionally do, and Aster Engine will distribute it on
the MPP system. The system can run arbitrarily complex
UDFs in this distributed environment.
Each runner runs a single JVM, which is used to solve a part
of the sub-problem to run the UDF. At the end of the
execution, the outcomes of all of the runners are combined to
form the final output, which is the result of running this
SQLMR UDF function in database on the given table.
We’d like to measure resource utilization within the pgdb and
the runner containers on each worker pod, and then aggregate
these values across workers to report compute resource
utilization caused by a specific user query on Aster Engine
cluster for a query.
At the time of this publication, the CPU % utilization and
Memory utilization are measured and reported by the
solution. We are working to incorporate Disk IO and
Network IO in our solution. The solution for these additional
metrics should fall in place since a robust framework has
been built first.
Note that the uniqueness of Aster Engine as a distributed
system is in the fact that we are slicing the SQLMR analytic
problems across worker containers within the MPP systems.
The pgdb and runner containers are solving the sub-problems
on disjoint partitions of the database tables, in parallel, in
order to achieve performance. As such, since the system does
not merely slice the work by containers or pods, the
measuring of resource utilization is non-trivial and need to be
done in a bottom-up manner.
Note also that we cannot afford to spin up and shutdown
containers on demand for Aster Engine to remain a high
performance MPP execution engine. T overhead of on-
demand spin up of containers will easily reduces
performance of Aster Engine on complex analytic queries
due to the latency issues.
In addition to accurately measuring plain-SQL and SQLMR
computations across the cluster, the impact of driver
functions is also measured. Driver functions use JDBC to
connect back to Aster Engine and issue additional queries. A
resource usage of the children sessions and all the queries
there are executed as part of it also gets accounted toward the
main query that invoked driver function.
4.3 Query Resource Measurement System
The Figure 2 below shows a simplified view of QRM system.
The light blue boxes represent the existing pods that are part
of the Aster Engine MPP system. The MPP Execution
Master creates and executes query plan and controls
execution of the query across Aster cluster. The pgdb and
runner are the Docker containers one or more of which may
exist on each worker pod, as shown below. The diagram also
shows a trivial sample table of the result from the QRM
system as stored in the DATABASE. Other	 actors	 in	
QRM	are	following	–
• QM Emitter helps retrieve memory and CPU usage
information by reading the info from the Linux procfs
filesystem[42]
for specific processes pertaining to the
Aster Engine, inside the Docker containers
• QM Collector container in each pod
• QM Master - lightweight container in Master pod is has
main role in QRM. It polls or subscribes to worker pod
QM Collector for utilization data
• Utilization in pgdb render Plain SQL resource usage by
reusing QM Emitter
• Utilization in runner render SQLMR resource usage by
reusing QM Emitter
4.3.1 Query Resource Measurement Steps
1. When QRM sub-system is on, it looks for the start of a
new session. Utilization starts to collect when a new
Session starts.
2. QM Emitter sends utilization data from pgdb and runner
to QM Collector local to each worker pod for entire
duration while Aster Engine session is active.
3. Data collection stops when the end of query session is
detected.
4. Data from QM Collectors on all workers is requested by
QM Master on Master pod and relayed to a generic
destination DATABASE where measurements are
stored. This is done on a periodic basis (minutes or
hours) and at ends of sessions.
5. Data is stored in new table in DATABASE, and is
indexed by session id.
Data can be pulled into other reporting, post-processing,
historical analysis systems and plotting tools from the
generic destination DATABASE.
QRM design is a best-effort architecture due to the nature
of its purpose.
7	
If parts of sub-system don’t perform, reporting may be
incomplete, i.e., yield underestimated (but not inapplicable)
utilization. In specific –
Single Point Failure: If QM Master goes down either due
to load, container failure or system degenerate state, and the
measurements may lag or utilizations be under-estimated.
- In this case, Kubernetes may restart the QM Master
container.
	
Figure	2:	Simplified	Query	Resource	Monitoring	Subsystem	
In	Aster	Engine,	keyed	by	Session	id
Service Protection (Do no harm principle): Local pod
level query utilization storage are designed in a FIFO / LRU /
Ring Buffer / Round Robin way.
• QRM is designed to save accumulated data in a round-
robin fashion. If data is not collected either from QM
Collector on workers or from DATABASE after a period
of time, it may be overwritten if needed.
• This is to protect active analytic engine and the QM
Emitter from going out of memory.
• QM Collector can cap resource utilization in memory
data on a per session level memory limit.
• QM Collector in-memory data can also have an overall
limit across sessions.
• QM Collector can also have a Data Retiring Manager to
retire data in DATABASE on a per table basis (time
based or size based; long term expiry of data).
4.4 Measuring SQL Resource Utilization
Compute resource usage by plain SQL queries is measured
by measuring system load inside the pgdb containers running
in each of the worker pods, and aggregating the results that is
collected in the Master. This involves measuring resource
usage by each of the Postgres processes within the pgdb
containers on worker pods.
4.5 Measuring SQLMR Resource Utilization
Compute resource usage by SQLMR queries is measured by
measuring system load inside the runner containers running
in each of the worker pods, and aggregating the results that is
collected in the Master. This involves measuring resource
usage by each of the JVMs running inside the runner
containers.
4.6 Solution Scalability
A system, whose performance improves by adding hardware,
proportionally to the capacity added, is said to be scalable
system or horizontally scalable system. Scalability of QRM
ties to measurement load imposed on it. So, scalability of
QRM is somewhat inversely proportional to scalability of the
MPP system being measured! Adding workers load QM
Master in linear manner. So, in an MPP analytic engine
with N worker pods and m pgdb or runner instances per pod,
there will be (m+1)*N total resource utilization
measurements to be taken.
• Every measurement collection collects data at sampling
frequency, say f, or sampling period T. T = 1/f, for every
resource-type.
• Let R be the total number of resources measured
o Where resource-type € {Cpu %, Memory, Network
Recv, Network Sent, Disk Reads, Disk Writes}
o Currently, R = 2; Cpu % and Memory.
• Every sample is a time-series item, which is a pair or
tuple of doubles: <time> and <utilization value>.
• So, over an arbitrary sampling frequency f, the QRM
subsystem collects
(((m + 1) * N) * R * 2 * 8) / f bytes of data.
• Periodically, this data will be fetched or pushed into long
term storage, such as DATABASE or a third-party
reporting system.
• For convenience, we’d estimate scale on a “per fetch by
third-party reporting system”
basis, or else the sampling period to such a reporting
system can also be modeled here. If we model it, it will
proportionately affect scale. Measurement collection is
done for each query. So, for Q queries running
simultaneously,
(m + 1) * N * R * Q * 16 / f bytes of data per query per
fetch
Following are a few sample scenarios:
1. So, for a 2000 pod Aster Engine MPP system with 2
pgdb and/or runner containers per pod, sampling the
resource utilizations at period P = 10 sec, or f = 1/6
8	
minute, the proposed solution will collect data
at 1,152,000 bytes / min or 19.2 KBPS per query.
• Each additional query monitored in parallel will
multiply the size of data, modulo the fetch period of
the third-party reporting system.
• If for this scenario, the QRM data is offloaded fully
into a third-party reporting system every 5 minutes,
it would be about 5.8 MB if one query was running
and 17.4 MB of memory requirement for 3 queries
running in parallel, and every 5 minutes they run.
2. Or for a small system with 2 pods, with one pgdb per
pod, P = 30 sec = 0.5 minute, we will collect 256 bytes
of data per running Promethium query per minute.
QRM is most applicable for long running queries. Be
informed that in absence of eviction, the queries will
have linear penalty on resources needed by QRM, as the
derivation shows.
Above calculations highlight the need of LRU based
local eviction and destructive read-out heuristics. Be
informed that QRM is off by default.
4.7 Measuring Workload Skew
Skew is one of the key characteristics of an MPP execution
engine that helps determine throughput for a specific
database session. In an MPP system with high cardinality,
skew is a condition in which compute work to execute a long
running database query is unevenly balanced among
partitions or workers in the cluster. In any practical scenario
of execution on any system, a small amount of skew is
inevitable and harmless.
There are mainly four skew parameters to judge throughput
and performance of orchestrated MPP execution engine of a
given cluster size:
1. Pod IO skew: comparison of the highest IO usage
watermark on the busiest node to the average use on
other pods. This can include network IO, disk IO or
both.
2. Pod CPU skew: comparison of highest CPU usage
watermark use on the busiest node to the average use on
other pods.
3. Pod Memory skew: comparison of highest memory
watermark for a query on the busiest node to the average
use on other pods.
Let,
podCount be total number of pods,
sum be the sum of the metrics on all pods,
max be the value of metrics on the busiest pod.
For IO metrics, the two metrics such as read and written
bytes are added together for Disk IO; similarly, the sent and
received bytes are added together for Network IO, and the
max and sum are computed for these totals. The Workload
Skew for a specific utilization metric across all Aster Engine
worker pods is computed as,
where, subscript m stands for specific resource the skew is to
be computed for, such as memory, CPU % etc.
This formula for podSkew helps a DBA judge the workload
spread across workers. The expectation is that overall
throughput of the MPP system on a long running query is
maximum if the workload is perfectly balanced, that is, all n
pods have a identical workload and thus the podSkew is 0.
4.8 QRM Data Aggregation
The data aggregation in QRM for resource
utilization requires a distributed system design for itself that
must be able to scale linearly or preferably super-
linearly with respect to the size of the MPP system. This is
because bigger systems do not reduce but grow challenge for
QRM performance.
The data being aggregated to QM Collector processes in
memory, per worker pod, store primarily resource utilization
time-series data.
Aggregation heuristics depends on the resource-type of the
metrics being aggregated. At a high level -
• The CPU utilization % should be stored uniquely per
pod and not added up to be able to compute podSkew for
CPU % in the system.
• The Memory utilization in bytes per pod, with each pod
can be returning (m + 1) time-series-es, where m is the
number of pgdb and/or runner Docker instances per pod.
Storing these values separate can allow computing
podSkew for Memory.
• This storage is local to QM Collector in each pod.
• It is possible to measure skew for SQL and SQLMR
queries separately by storing resource utilization metrics
for pgdb and runner separately, instead of aggregating
them in QM Collector.
• Newer metrics such as Network Sent bytes, Network
Received bytes, Disk Read bytes and Disk Written bytes
can also be reported in the same manner.
There are two ways to collect this data centrally in
DATABASE that is attached to the Master pod –
• QM Master can serially fetch or pull the data from QM
Collector of all workers
• QM Collectors on all workers can asynchronously
message or push the data to QM Master
The push mechanism can employ Asynchronous-Messaging
Queues wherein QM Master is subscribed to all QM
Collectors in the pods (across pod recreations). Following are
more detail on these two approaches.
PodSkewm = (1 - ((summ / PodCount) / maxm))) * 100
9	
4.8.1 Serial Solution Formulation
In this method to fetch QRM resource utilization, QM Master
pulls the collected resource utilizations from QM Collectors
on worker pods on its own schedule.
1. QM Emitter retrieves resource utilization for Linux
processes in the container (pgdb or runner) from the
procfs.
2. This utilization info (either one point or a time-series
chunk) is sent to local QM Collector container on the
same pod. This data transfer reused
preexisting RPC mechanism, but can also use a Restful
API or a speedy Asynchronous-Message Queue such as
ZeroMQ[41]
. The data rests in the process memory for the
collector process inside the QM Collector until it is
either evicted (expired with time or by size limit) or
collected by QM Master.
3. QM Master periodically polls QM Collectors on all
worker pods via a similar data transfer mechanism and
fetches and erases all of the time-series data from each
pod, one at a time.
4. QM Master saves data into DATABASE after data is
fetched from each pod.
This is shown in the flow diagram in Figure 3 below.
4.8.2 Parallel Solution Formulation
Another design choice was to use a Proactor Design to
retrieve QRM data from QM Collectors on the worker pods.
Proactor is a software design pattern for event handling in
which long running data sources in the design subscribe to a
central part of the system, either for commands and control or
to offload the data.
A completion handler is called based on meeting any
arbitrary condition, asynchronously, at the data source – the
QM Collector in our case. Such designs follow Hollywood
Design Principle (“Don’t call us, we’ll call you.”). Here's the
flow –
1. As the worker pods come up, every QM Collector
container subscribes to the central QM Master container.
2. On each worker pod in QM Collector container the
Proactor subscriber waits for certain condition to be met.
This condition can be a periodic timer interrupt or a
more complex condition that need to be checked, such
as, an event that is triggered by size of time-series
accumulated so far on that pod, etc.
3. Proactor subscriber in QM Collector reads the time-
series.
4. Proactor subscriber in QM Collector dispatches an event
to the handler in QM Master.
5. Either the QM Collector can now send time-series as
payload with the event itself or the QM Master can
handshake subsequently to request the time-series data.
6. QM Master can process or aggregate or filter the data
and then write to QRM tables in DATABASE.
Benefit of a pull solution is that data transfer rates can be
	
Figure	3:	QRM	Data	Collection	To	QM	Master:		Serial	Solution	
Benefit of a pull solution is that data transfer rates can be
throttled by constraining resources, like thread-pool, that
implement the pull in QM Master. A parallel or push
approach may flood the network or QRM subsystem in case
queries are resource consuming – which is a bad time to
further load the network.
4.9 Measurement Results
The measurement results specific to the queries in our MPP
execution engine are not available at the time of this writing.
These measurements will be taken in March of 2018, which
is well ahead of conference presentation deadline and will be
conveyed as soon as they are available and shared during
presentation.
5. Security
5.1 Network Access
1. CRM feature, in our case, did expose one port to retrieve
data from Heapster, and another port to send the
aggregated per-pod resource utilization data to
Teradata’s ViewPoint UI.
2. QRM feature exposes no new ports. QM Master does not
need network access other than internal connectivity via
reused RPC to QM Collector on worker pods to receive
and send data for aggregation and storage.
3. If more clients or sinks of data are connected the ports
need to be secured and not exposed to the outside world,
at the same time all resource utilization data needs to be
encrypted to avoid malicious access to it for CRM and
QRM.
5.2 User Access Control To Data
1. Consider running any newly added QRM and CRM
containers as non-privileged Docker containers and as
non-privileged users by default.
2. Protect access to the newly added tables and views in
external databases and systems to which the resource
utilization data gets saved.
10	
6. Summary
In this paper we have presented two comprehensive
mechanisms and experiences to design and incorporate
resource usage monitoring in a large-scale Kubernetes
orchestrated MPP analytic engine. Our work details the
methods and addresses multiple challenges pertinent to
measuring resource utilization in MPP systems on a system
level and with precision during one or multiple distributed
SQL and SQLMR analytic query processing.
Our benchmarking results for Cluster Resource Monitoring
have shown that the solution built by implementing a
Heapster client and channeling the time-series data to other
subsystems and User interfaces does not load much the actual
system being measured.
7. References
[1] www.docker.com/what-container
[2] Vivek Ratan (February 8, 2017). "Docker: A Favourite in the
DevOps World". Open Source Forum, June 14, 2017.
[3] en.wikipedia.org/wiki/LXC
[4] www.linuxcontainers.org
[5]	www.upguard.com/articles/docker-vs-lxc
[6] www.goto.docker.com/rs/929-FJL-178/images/Docker-Survey-
2016.pdf
[7] O'Gara, Maureen (26 July 2013). "Ben Golub, Who Sold Gluster
to Red Hat, Now Running dotCloud". SYS-CON Media, 2013-08-
09
[8]
www.domino.research.ibm.com/library/cyberdig.nsf/papers/092905
2195DD819C85257D2300681E7B/$File/rc25482.pdf
[9] www.arxiv.org/pdf/1709.10140.pdf
[10] www.blog.newrelic.com/2017/11/27/monitoring-application-
performance-in-kubernetes
[11] www.newrelic.com/serverless-dynamic-cloud-survey
[12] www.vldb.org/pvldb/vol9/p660-trummer.pdf
[13] www.dcs.bbk.ac.uk/~ap/teaching/ADM2018/notes5.pdf
[14]
static.googleusercontent.com/media/research.google.com/en//pubs/a
rchive/41344.pdf
[15]
citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.478.9491&rep=
rep1&type=pdf
[16] docs.docker.com/engine/reference/commandline/stats
[17] docs.docker.com/engine
[18] docs.docker.com/config/containers/runmetrics
[19] www.github.com/google/cadvisor
[20] hub.docker.com/r/google/cadvisor
[21] blog.codeship.com/monitoring-docker-containers
[22] www.sysdig.com/product/monitor
[23] www.tecmint.com/sysdig-system-monitoring-and-
troubleshooting-tool-for-linux
[24] www.sysdig.com/blog/ monitoring-kubernetes-with-sysdig-
cloud
[25] www.sysdig.com/blog/alerting-kubernetes
[26] www.supervisord.org/introduction.html
[27] www.monitorscout.com
[28] en.wikipedia.org/wiki/New_Relic
[29] www.librato.com
[30] www.github.com/librato/librato-metrics
[31]
www.github.com/kubernetes/heapster/blob/master/docs/overview.
md
[32]
www.github.com/kubernetes/heapster/blob/master/docs/storage-
schema.md
[33] www.github.com/DataDog/the-
monitor/blob/master/kubernetes/how-to-collect-and-graph-
kubernetes-metrics.md
[34] www.stackoverflow.com/questions/33749911/a-combination-
for-monitoring-system-for-container-grafanaheapsterinfluxdbcad
[35] blog.couchbase.com/wp-content/original-assets/december-
2016/kubernetes-monitoring-with-heapster-influxdb-and-
grafana/kubernetes-logging-1024x407.png
[36]
info.teradata.com/HTMLPubs/DB_TTU_16_00/index.html#page/G
eneral_Reference/B035-1091-160K/muq1472241426243.html
[37] www.kubernetes.io
[38] www.kubernetes.io/docs/concepts/overview/what-is-kubernetes
[39]
www.github.com/kubernetes/community/blob/master/contributors/d
esign-proposals/instrumentation/resource-metrics-api.md
[40] Large Scale Cluster Management At Google With Borg:
static.googleusercontent.com/media/research.google.com/en//pubs/a
rchive/43438.pdf
[41] www.zguide.zeromq.org/page:all
[42] ProcFs: en.wikipedia.org/wiki/Procfs
[43] dl.acm.org/citation.cfm?id=1687567

More Related Content

What's hot

NEURO-FUZZY SYSTEM BASED DYNAMIC RESOURCE ALLOCATION IN COLLABORATIVE CLOUD C...
NEURO-FUZZY SYSTEM BASED DYNAMIC RESOURCE ALLOCATION IN COLLABORATIVE CLOUD C...NEURO-FUZZY SYSTEM BASED DYNAMIC RESOURCE ALLOCATION IN COLLABORATIVE CLOUD C...
NEURO-FUZZY SYSTEM BASED DYNAMIC RESOURCE ALLOCATION IN COLLABORATIVE CLOUD C...
ijccsa
 
My Dissertation 2016
My Dissertation 2016My Dissertation 2016
My Dissertation 2016
Vrushali Lanjewar
 
Data-Intensive Technologies for Cloud Computing
Data-Intensive Technologies for CloudComputingData-Intensive Technologies for CloudComputing
Data-Intensive Technologies for Cloud Computing
huda2018
 
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Acunu
 
Data Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with CloudData Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with Cloud
IJAAS Team
 
Cs6703 grid and cloud computing unit 4
Cs6703 grid and cloud computing unit 4Cs6703 grid and cloud computing unit 4
Cs6703 grid and cloud computing unit 4
RMK ENGINEERING COLLEGE, CHENNAI
 
Cs6703 grid and cloud computing unit 1
Cs6703 grid and cloud computing unit 1Cs6703 grid and cloud computing unit 1
Cs6703 grid and cloud computing unit 1
RMK ENGINEERING COLLEGE, CHENNAI
 
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEMCASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
IJCI JOURNAL
 
A 01
A 01A 01
A 01
kakaken9x
 
35 content distribution with dynamic migration of services for minimum cost u...
35 content distribution with dynamic migration of services for minimum cost u...35 content distribution with dynamic migration of services for minimum cost u...
35 content distribution with dynamic migration of services for minimum cost u...
INFOGAIN PUBLICATION
 
Scabiv0.2
Scabiv0.2Scabiv0.2
Scabiv0.2
Dilshad Mustafa
 
Quantitative Performance Evaluation of Cloud-Based MySQL (Relational) Vs. Mon...
Quantitative Performance Evaluation of Cloud-Based MySQL (Relational) Vs. Mon...Quantitative Performance Evaluation of Cloud-Based MySQL (Relational) Vs. Mon...
Quantitative Performance Evaluation of Cloud-Based MySQL (Relational) Vs. Mon...
Darshan Gorasiya
 
Towards secure and dependable storage
Towards secure and dependable storageTowards secure and dependable storage
Towards secure and dependable storageKhaja Moiz Uddin
 
B03410609
B03410609B03410609
Reactive Stream Processing for Data-centric Publish/Subscribe
Reactive Stream Processing for Data-centric Publish/SubscribeReactive Stream Processing for Data-centric Publish/Subscribe
Reactive Stream Processing for Data-centric Publish/Subscribe
Sumant Tambe
 
Towards Secure and Dependable Storage Services in Cloud Computing
Towards Secure and Dependable Storage Services in Cloud  Computing Towards Secure and Dependable Storage Services in Cloud  Computing
Towards Secure and Dependable Storage Services in Cloud Computing
IJMER
 
PUBLIC INTEGRITY AUDITING FOR SHARED DYNAMIC CLOUD DATA WITH GROUP USER REVO...
 PUBLIC INTEGRITY AUDITING FOR SHARED DYNAMIC CLOUD DATA WITH GROUP USER REVO... PUBLIC INTEGRITY AUDITING FOR SHARED DYNAMIC CLOUD DATA WITH GROUP USER REVO...
PUBLIC INTEGRITY AUDITING FOR SHARED DYNAMIC CLOUD DATA WITH GROUP USER REVO...
Nexgen Technology
 
Challenges Management and Opportunities of Cloud DBA
Challenges Management and Opportunities of Cloud DBAChallenges Management and Opportunities of Cloud DBA
Challenges Management and Opportunities of Cloud DBA
inventy
 
BigData Behind-the-Scenes~20150827
BigData Behind-the-Scenes~20150827BigData Behind-the-Scenes~20150827
BigData Behind-the-Scenes~20150827Anthony Potappel
 

What's hot (20)

NEURO-FUZZY SYSTEM BASED DYNAMIC RESOURCE ALLOCATION IN COLLABORATIVE CLOUD C...
NEURO-FUZZY SYSTEM BASED DYNAMIC RESOURCE ALLOCATION IN COLLABORATIVE CLOUD C...NEURO-FUZZY SYSTEM BASED DYNAMIC RESOURCE ALLOCATION IN COLLABORATIVE CLOUD C...
NEURO-FUZZY SYSTEM BASED DYNAMIC RESOURCE ALLOCATION IN COLLABORATIVE CLOUD C...
 
My Dissertation 2016
My Dissertation 2016My Dissertation 2016
My Dissertation 2016
 
Data-Intensive Technologies for Cloud Computing
Data-Intensive Technologies for CloudComputingData-Intensive Technologies for CloudComputing
Data-Intensive Technologies for Cloud Computing
 
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
 
Data Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with CloudData Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with Cloud
 
Cs6703 grid and cloud computing unit 4
Cs6703 grid and cloud computing unit 4Cs6703 grid and cloud computing unit 4
Cs6703 grid and cloud computing unit 4
 
Cs6703 grid and cloud computing unit 1
Cs6703 grid and cloud computing unit 1Cs6703 grid and cloud computing unit 1
Cs6703 grid and cloud computing unit 1
 
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEMCASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
 
WJCAT2-13707877
WJCAT2-13707877WJCAT2-13707877
WJCAT2-13707877
 
A 01
A 01A 01
A 01
 
35 content distribution with dynamic migration of services for minimum cost u...
35 content distribution with dynamic migration of services for minimum cost u...35 content distribution with dynamic migration of services for minimum cost u...
35 content distribution with dynamic migration of services for minimum cost u...
 
Scabiv0.2
Scabiv0.2Scabiv0.2
Scabiv0.2
 
Quantitative Performance Evaluation of Cloud-Based MySQL (Relational) Vs. Mon...
Quantitative Performance Evaluation of Cloud-Based MySQL (Relational) Vs. Mon...Quantitative Performance Evaluation of Cloud-Based MySQL (Relational) Vs. Mon...
Quantitative Performance Evaluation of Cloud-Based MySQL (Relational) Vs. Mon...
 
Towards secure and dependable storage
Towards secure and dependable storageTowards secure and dependable storage
Towards secure and dependable storage
 
B03410609
B03410609B03410609
B03410609
 
Reactive Stream Processing for Data-centric Publish/Subscribe
Reactive Stream Processing for Data-centric Publish/SubscribeReactive Stream Processing for Data-centric Publish/Subscribe
Reactive Stream Processing for Data-centric Publish/Subscribe
 
Towards Secure and Dependable Storage Services in Cloud Computing
Towards Secure and Dependable Storage Services in Cloud  Computing Towards Secure and Dependable Storage Services in Cloud  Computing
Towards Secure and Dependable Storage Services in Cloud Computing
 
PUBLIC INTEGRITY AUDITING FOR SHARED DYNAMIC CLOUD DATA WITH GROUP USER REVO...
 PUBLIC INTEGRITY AUDITING FOR SHARED DYNAMIC CLOUD DATA WITH GROUP USER REVO... PUBLIC INTEGRITY AUDITING FOR SHARED DYNAMIC CLOUD DATA WITH GROUP USER REVO...
PUBLIC INTEGRITY AUDITING FOR SHARED DYNAMIC CLOUD DATA WITH GROUP USER REVO...
 
Challenges Management and Opportunities of Cloud DBA
Challenges Management and Opportunities of Cloud DBAChallenges Management and Opportunities of Cloud DBA
Challenges Management and Opportunities of Cloud DBA
 
BigData Behind-the-Scenes~20150827
BigData Behind-the-Scenes~20150827BigData Behind-the-Scenes~20150827
BigData Behind-the-Scenes~20150827
 

Similar to Measuring Resources & Workload Skew In Micro-Service MPP Analytic Query Engine

Container Orchestration.pdf
Container Orchestration.pdfContainer Orchestration.pdf
Container Orchestration.pdf
Simform
 
Cloud Computing: A Perspective on Next Basic Utility in IT World
Cloud Computing: A Perspective on Next Basic Utility in IT World Cloud Computing: A Perspective on Next Basic Utility in IT World
Cloud Computing: A Perspective on Next Basic Utility in IT World
IRJET Journal
 
Advanced resource allocation and service level monitoring for container orche...
Advanced resource allocation and service level monitoring for container orche...Advanced resource allocation and service level monitoring for container orche...
Advanced resource allocation and service level monitoring for container orche...
Conference Papers
 
Advanced resource allocation and service level monitoring for container orche...
Advanced resource allocation and service level monitoring for container orche...Advanced resource allocation and service level monitoring for container orche...
Advanced resource allocation and service level monitoring for container orche...
Conference Papers
 
The New Stack Container Summit Talk
The New Stack Container Summit TalkThe New Stack Container Summit Talk
The New Stack Container Summit Talk
The New Stack
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
betalab
 
Kubernetes in 15 minutes
Kubernetes in 15 minutesKubernetes in 15 minutes
Kubernetes in 15 minutes
rhirschfeld
 
Mesos vs kubernetes comparison
Mesos vs kubernetes comparisonMesos vs kubernetes comparison
Mesos vs kubernetes comparison
Krishna-Kumar
 
Introduction to containers, k8s, Microservices & Cloud Native
Introduction to containers, k8s, Microservices & Cloud NativeIntroduction to containers, k8s, Microservices & Cloud Native
Introduction to containers, k8s, Microservices & Cloud Native
Terry Wang
 
Embracing Containers and Microservices for Future Proof Application Moderniza...
Embracing Containers and Microservices for Future Proof Application Moderniza...Embracing Containers and Microservices for Future Proof Application Moderniza...
Embracing Containers and Microservices for Future Proof Application Moderniza...
Marlabs
 
As34269277
As34269277As34269277
As34269277
IJERA Editor
 
Multicloud Deployment of Computing Clusters for Loosely Coupled Multi Task C...
Multicloud Deployment of Computing Clusters for Loosely  Coupled Multi Task C...Multicloud Deployment of Computing Clusters for Loosely  Coupled Multi Task C...
Multicloud Deployment of Computing Clusters for Loosely Coupled Multi Task C...
IOSR Journals
 
Data management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunitiesData management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunitiesEditor Jacotech
 
Build cloud native solution using open source
Build cloud native solution using open source Build cloud native solution using open source
Build cloud native solution using open source
Nitesh Jadhav
 
Kubernetes Basics - ICP Workshop Batch II
Kubernetes Basics - ICP Workshop Batch IIKubernetes Basics - ICP Workshop Batch II
Kubernetes Basics - ICP Workshop Batch II
PT Datacomm Diangraha
 
Kubernetes is a ppt of explanation of kubernet topics
Kubernetes is a ppt of explanation of kubernet topicsKubernetes is a ppt of explanation of kubernet topics
Kubernetes is a ppt of explanation of kubernet topics
tnmy4903
 
Containerized Hadoop beyond Kubernetes
Containerized Hadoop beyond KubernetesContainerized Hadoop beyond Kubernetes
Containerized Hadoop beyond Kubernetes
DataWorks Summit
 
Adoption of Cloud Computing in Healthcare to Improves Patient Care Coordination
Adoption of Cloud Computing in Healthcare to Improves Patient Care CoordinationAdoption of Cloud Computing in Healthcare to Improves Patient Care Coordination
Adoption of Cloud Computing in Healthcare to Improves Patient Care Coordination
Mindfire LLC
 
Dynamic Resource Provisioning with Authentication in Distributed Database
Dynamic Resource Provisioning with Authentication in Distributed DatabaseDynamic Resource Provisioning with Authentication in Distributed Database
Dynamic Resource Provisioning with Authentication in Distributed Database
Editor IJCATR
 

Similar to Measuring Resources & Workload Skew In Micro-Service MPP Analytic Query Engine (20)

Container Orchestration.pdf
Container Orchestration.pdfContainer Orchestration.pdf
Container Orchestration.pdf
 
Cloud Computing: A Perspective on Next Basic Utility in IT World
Cloud Computing: A Perspective on Next Basic Utility in IT World Cloud Computing: A Perspective on Next Basic Utility in IT World
Cloud Computing: A Perspective on Next Basic Utility in IT World
 
Advanced resource allocation and service level monitoring for container orche...
Advanced resource allocation and service level monitoring for container orche...Advanced resource allocation and service level monitoring for container orche...
Advanced resource allocation and service level monitoring for container orche...
 
Advanced resource allocation and service level monitoring for container orche...
Advanced resource allocation and service level monitoring for container orche...Advanced resource allocation and service level monitoring for container orche...
Advanced resource allocation and service level monitoring for container orche...
 
The New Stack Container Summit Talk
The New Stack Container Summit TalkThe New Stack Container Summit Talk
The New Stack Container Summit Talk
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 
Kubernetes in 15 minutes
Kubernetes in 15 minutesKubernetes in 15 minutes
Kubernetes in 15 minutes
 
Mesos vs kubernetes comparison
Mesos vs kubernetes comparisonMesos vs kubernetes comparison
Mesos vs kubernetes comparison
 
Introduction to containers, k8s, Microservices & Cloud Native
Introduction to containers, k8s, Microservices & Cloud NativeIntroduction to containers, k8s, Microservices & Cloud Native
Introduction to containers, k8s, Microservices & Cloud Native
 
Embracing Containers and Microservices for Future Proof Application Moderniza...
Embracing Containers and Microservices for Future Proof Application Moderniza...Embracing Containers and Microservices for Future Proof Application Moderniza...
Embracing Containers and Microservices for Future Proof Application Moderniza...
 
As34269277
As34269277As34269277
As34269277
 
Multicloud Deployment of Computing Clusters for Loosely Coupled Multi Task C...
Multicloud Deployment of Computing Clusters for Loosely  Coupled Multi Task C...Multicloud Deployment of Computing Clusters for Loosely  Coupled Multi Task C...
Multicloud Deployment of Computing Clusters for Loosely Coupled Multi Task C...
 
Data management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunitiesData management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunities
 
Build cloud native solution using open source
Build cloud native solution using open source Build cloud native solution using open source
Build cloud native solution using open source
 
Kubernetes Basics - ICP Workshop Batch II
Kubernetes Basics - ICP Workshop Batch IIKubernetes Basics - ICP Workshop Batch II
Kubernetes Basics - ICP Workshop Batch II
 
Kubernetes is a ppt of explanation of kubernet topics
Kubernetes is a ppt of explanation of kubernet topicsKubernetes is a ppt of explanation of kubernet topics
Kubernetes is a ppt of explanation of kubernet topics
 
Containerized Hadoop beyond Kubernetes
Containerized Hadoop beyond KubernetesContainerized Hadoop beyond Kubernetes
Containerized Hadoop beyond Kubernetes
 
Adoption of Cloud Computing in Healthcare to Improves Patient Care Coordination
Adoption of Cloud Computing in Healthcare to Improves Patient Care CoordinationAdoption of Cloud Computing in Healthcare to Improves Patient Care Coordination
Adoption of Cloud Computing in Healthcare to Improves Patient Care Coordination
 
The NoSQL Movement
The NoSQL MovementThe NoSQL Movement
The NoSQL Movement
 
Dynamic Resource Provisioning with Authentication in Distributed Database
Dynamic Resource Provisioning with Authentication in Distributed DatabaseDynamic Resource Provisioning with Authentication in Distributed Database
Dynamic Resource Provisioning with Authentication in Distributed Database
 

Recently uploaded

Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 

Recently uploaded (20)

Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 

Measuring Resources & Workload Skew In Micro-Service MPP Analytic Query Engine

  • 1. 1 Measuring Resources & Workload Skew In Micro-Service MPP Analytic Query Engine Nikunj Parekh Aster Engine Teradata, Inc., Santa Clara, CA, USA Nikunj.Parekh@Teradata.com Alan Beck Aster Engine Teradata, Inc., Santa Clara, CA, USA Alan.Beck@Teradata.com Abstract Big Data analytics is common in many business domains, for example in financial sector for savings portfolio analysis, in government agencies, scientific research and insurance providers, to name a few. The uses of Big Data range from generating simple reports to executing complex analytical workloads. The increases in the amount of data being stored and processed in these domains expose many challenges with respect to scalable processing of analytical queries. Massively Parallel Processing (MPP) databases address these challenges by distributing storage and query processing across multiple compute nodes and distributed processes in parallel, usually in a shared-nothing architecture. Today, new technologies are shaping up the way platforms for the Internet of services are designed and managed. This technology is the containers, like Docker[1,2] and LXC[3,4,5] . The use of container as base technology for large-scale distributed systems opens many challenges in the area of resource management at run-time, for example: auto-scaling, optimal deployment and monitoring. Monitoring orchestrated distributed systems is at the heart of many cloud resource management solutions. Measuring performance and analyzing that data to take guesswork out of resource utilization is key to engineering budget management. Measuring Workload Skew and resource utilization of MPP databases and analytic engines is the focus of our work. This paper explores the tools available to measure the performance of MPP Docker and Kubernetes environments from the perspective of a Database Administrator using such systems in a virtualized environment. Our approach provides a detailed characterization of CPU, memory and network IO, while complex long duration analytic SQL queries are loading compute resources in containerized systems. Keywords: Docker, LXC, Massively Parallel Processing (MPP), MapReduce, Kubernetes, Pods and Workload Skew 1. Introduction The enabling force for large-scale distributed systems moving from dedicated servers to cloud computing is the containerization technologies. The drivers for containerization are the following: • Portability – agnostic of the environment • Quick Installation and Deployment – should be able to install and redeploy quickly on a given environment and also deploy host-agnostically • Elasticity – In the cloud environment, compute resources are available on demand. We are able to start clusters on the fly and expand/shrink them based on demand in many applications • Accurate Resource Management – a pay as you go model wherein software products are no longer licensed but metered requires a detailed resource monitoring and management • Multi-tenancy – should be able to start multiple instances of same distributed system on the same physical hardware, and containerization is one of the key mechanisms to host multitenant apps in cloud • Isolation – without introducing unworkable interface conflicts such as for network, namespaces and other sharable resources • Quicker releases – CI / CD friendly product releases • Quicker development and testing – because containerization facilitates micro-services architecture, it promotes parallelizable product development and quicker feature testing The interest of cloud providers and Internet Service Providers (ISP)[6] has gradually grown in Docker as the preferred product offering model. A container is a software program that performs OS level virtualization by using resource isolation features of the Linux kernel[7] . This allows applications or application components (the so called micro- services) to be installed with all the library dependencies, the binaries, and all their configurations needed to run as if they are running on independent Virtual Machines, except there’s low overhead in starting and running them. There are many management tools for Linux containers: LXC, systemd-nspawn, lmctfy, Warden, Docker and rkt for CoreOS[8,9] . The latter is a minimal operating system that supports popular container systems out of the box. The operating system is designed to be operated in clusters and can run directly on bare metal or on virtual machines. CoreOS supports hybrid architectures (for example, virtual machines plus bare metal). This approach enables the Container-as-a-Service solutions that are becoming widely available.
  • 2. 2 Momentum keeps building around Kubernetes. Our recent survey on server-less dynamic cloud computing and DevOps revealed that by mid-2018, more than 60% of respondents will be evaluating, prototyping, or deploying to production solutions involving container orchestration[10,11]. We see Kubernetes emerging as the de-facto standard for container orchestration as well as for large-scale distributed systems deployments. Having a rapidly increasing number of large-scale systems deployed in Kubernetes environment, we wanted to explore strategies for instrumenting resource utilization measurements for applications running inside these orchestrated clusters. This paper lays out two different resource measurement solutions with a focus on monitoring resource usage in MPP analytic systems, such as Teradata’s Aster Engine. The first part of the paper discusses resource monitoring at a broad orchestrated application level, and enumerates solutions to collect application monitoring data for exploring performance issues, troubleshooting application errors and analyzing performance trends to root cause performance bottlenecks in the application. The question that always needs to be answered is: Is the problem the input or the underlying infrastructure? While analyzing the broader cluster level resource utilization profile, it is often necessary in an MPP analytic engine, to understand utilization at more granular level. For example, to be able to answer above question fully, or to be able to estimate compute resource needs for future it is very valuable to profile resource utilization specific to each query. It is therefore necessary to know in detail the resource usage of long running SQL queries or even of specific phases thereof. The second part of the paper details Query Resource Monitoring solution implemented in Aster Engine at Teradata. The rest of this paper is organized as follows. Section 1 continues with more introduction of Massively Parallel Processing, and container orchestration. Section 2 briefly introduces the Aster Engine in Kubernetes environment. Section 3 details our Cluster Resource Monitoring approaches and solution we picked, and section 4 details our Query Resource Monitoring approach. It also highlights the commonly applicable sub-problems, such as retrieval, storage solutions and the policies for eviction of the time-series resource utilization data. 1.1 Massively Parallel Processing Modern scale-out database engines are usually based on one of two design principles: sharded and massively parallel processing (MPP) databases. Both are shared-nothing architectures, where each node manages its own storage and memory, and they are typically based on horizontal partitioning[12,13] . Sharded systems optimize for executing queries on small subsets of the shards, and communication between shards is relatively limited[14] . MPP databases optimize for parallel execution of each query[15] . The nodes are usually collocated within the same data center, and each query can access data across all the nodes. A query optimizer generates an execution plan that includes explicit data movement directives, and the cost of moving data is taken into account during optimization. A query executing in an MPP database can include several pipelined execution stages, with explicit communication between nodes at each stage. For example, a multi-stage aggregation can be used to compute an aggregate over the entire dataset using all the nodes. MPP databases can use a unique mechanism to solve sub-problems using independent queries for each of them. This mechanism of issuing “child queries” reuses the same JDBC connection from a client to connect to MPP system. While measuring resource utilization, the resource usages of these child sessions need to be accounted for toward the main query that invokes these child queries to solve sub problems. Many robust, highly scalable, high performance distributed systems and products, such as Teradata’s Aster Engine analytic system are built on top of Docker and Kubernetes technology, and these large-scale systems require monitoring solutions similar to the ones outlined below. 2. Teradata Aster Engine On Kubernetes Kubernetes is an open-source system for automating deployment, scaling and management of containerized applications that was originally designed by Google and now maintained by the Cloud Native Computing Foundation (CNCF). It aims to provide a "platform for automating deployment, scaling, and operations of application containers across clusters of hosts". The idea with Kubernetes is to manage containers as a group called pods[40] . Pods can be collocated and can share network, volumes and namespace. Kubernetes thus provides a good abstraction for distributed applications that are not completely decoupled, such as, Aster Engine. The networking and sharing of resources between containers is taken care of and the end user does not need to set it up. Kubernetes also manages failure of pods and restarts a container if the container fails. Teradata’s new Analytic Platform uses Aster Engine for large-scale SQL and User Defined Analytic Functions (UDF) query processing on massive data. Aster Engine is MPP analytics databases with the typical deployments including tens or hundreds of nodes. Storage and processing of large amounts of data are handled by distributing the load across several servers or hosts to create an array of individual databases, working together to present a single database image. The Master node is the entry point, where clients connect and submit SQL statements. The Master coordinates work with other database instances, called workers. Aster
  • 3. 3 Engine Master is a pod that runs several Linux Docker containers, each responsible for specific tasks to coordinate the execution of Aster query across workers. Database clients connect to the Master though a JDBC connection to issue queries. The Master is responsible for global optimization of query, for parsing, serialization, query planning, query optimization and execution of phases. Each Aster worker is a separate pod and runs several Docker containers of its own in each pod. Each worker container is responsible for specific task during query execution under Master’s command as per the query plan. One of the containers on each worker pod encapsulates the Postgres database (the pgdb container). The pgdb container is a standalone Postgres DB itself and receives subset of the table that the query is executing on. This subset is called a partition. Union of partitions across all workers is the main table and the intersection is null. During execution of a query, large amounts of data can be distributed and redistributed in multiple ways across the workers, due to repartitioning, complex JOINs etc. An entire distributed table with its partitions stored on different workers can also be gathered from those worker pods, operated upon with various analytic SQLMR UDFs run as queries by client and the resulting table can optionally be stored or returned back to the client as output. SQLMR stands for SQL-MapReduce, new UDF framework that is inherently parallel, designed to facilitate parallel computation of procedural functions across hundreds of servers working together as a single relational database. The pgdb container executes the Plain-SQL part of queries, and a “runner” container deployed in each worker pod executes the SQLMR UDF part of the queries. As such, during a query, these are the two main containers that do the most compute resource intensive tasks. The participation of these and the other containers is discussed in more detail in Section 4. 3. Cluster Resource Monitoring In MPP Engine We present a novel approach to measure resource usage of a cluster of Kubernetes pods. Our approach is implemented in Teradata’s Aster Engine. The monitoring mechanisms discussed, enable system administrators and analytics users to visualize, plot, generates alerts and perform live and historical-analytics on the cluster usage statistics. The solution is usable for many third-party visualizers and data- miners. There is a plethora of solutions and ideas, available to analyze Kubernetes pod performance, both at cost and for free. Some are open source while others are commercial. Some are easy to deploy while others require manual configuration. Some are general purpose while others are aimed specifically at certain container environments. Some are hosted in the cloud while others require installations on own cluster or hosts. 3.1 Comparison of Solutions Following are the alternatives evaluated and found noteworthy. Quick comparisons of over a dozen third party solutions generated ideas. We looked through following metrics for the offerings: Affinity to Data Collection Framework / Data-stores, Built-in Aggregation Engines, Pre- packaged Filtering, Visualization Integration, Alerting & Logging, and most importantly, Ease to integrate with Aster Engine and Teradata’s Viewpoint API. Readers should note that the comparison table below is for brevity and assessment does not even attempt to do full justice to the technological complexity neither the technical evaluations done by us, let alone the effort that has gone into innovating for them. However, we wanted to incorporate something with relative ease and extensibility. Following is a list of third-party solutions that we looked at to incorporate into Aster Engine to measure cluster level resource utilizations. 1. Docker Stats API[16,17,18] : Docker Stats API was well applicable for us. The approach is quite basic in terms of utilizations that it can fetch for us via the API at this time. We can get only per-container utilization. The benefit of using Stats API is that the output from this API is JSON friendly – something we need to simplify integration with Teradata ViewPoint. We’d need to implement the plumbing and aggregation ourselves. 2. cAdvisor[19,20,21] : cAdvisor (Container Advisor) provides a running daemon on each Kubernetes node that collects aggregates, processes and exports a detailed resource usage information for running containers on each node. But, it does not have complete orchestration purview beyond a single node, which translates to more complex aggregation implementations and testing for us. 3. SysDig[22,23,24,25] : The API provides deep visibility into containers and it has robust alerting. But, this was of no immediate value, because our goal was to interface with Teradata’s ViewPoint User interface that manages alerts within its framework. It’s also a paid solution. 4. SupervisorD[26] : SupervisorD offers a client/server system to monitor and control a number of processes on UNIX-like operating systems. But this is not the most suitable solution in a Kubernetes environment. 5. Scout[27] : It is a hosted App and a Comprehensive solution for monitoring Docker containers. We did not analyze it deeply because of some of the similar applicability caveats mentioned before. 6. New Relic[28] : This is a neat application performance monitoring (APM) solution as a purely SaaS offering. Also, currently it is mainly usable for Docker, which warrants a newly implemented resource utilization aggregation solution. In addition to being somewhat hard to incorporate, being an at-cost solution also factored into our decision. 7. Librato[29,30] : Is a generic cloud monitoring, alerting and provides analytics solution for the resource utilization
  • 4. 4 type of time-series data. It is well suited for collecting resource utilization or other metrics from AWS or Heroku and post-processing them, for example to find correlations and generating alerts. It is usable also as Heapster Sink, at which point it looked to us more like a consumer of resource utilization data than a resource. 8. Heapster[31,32,33] : This is a robust go-to solution for basic resource utilization metrics and events (read and exposed by Eventer) on any Kubernetes clusters. Heapster is a cluster-wide aggregator of monitoring and event data. It supports Kubernetes natively and works on all Kubernetes setups. We decided to use Heapster because of its ease of availability, deployment, auto configurability and convenience of building our solution on top of its Restful API. 9. Other solutions[34,35] : A ubiquitous solution we noticed is of setting up the Prometheus - cAdvisor - InfluxDB & Grafana stack. This solution is geared toward administrators, end-users, cloud-hosted systems, or to the data scientists. It did not seem to us like a solution that we could adopt for Aster Engine and get resource utilization easily converted for ViewPoint[36] . 3.2 cAdvisor Introduction cAdvisor is an open source container resource usage and performance analysis agent. It is built for containers and supports Docker containers natively. In Kubernetes, cAdvisor is integrated into the Kubelet binary. cAdvisor auto-discovers all containers in the machine and collects CPU, memory, filesystem, and network usage statistics. cAdvisor also provides the overall machine usage by analyzing the ‘root’ container on the machine. The Kubelet acts as a bridge between the Kubernetes master and the nodes. It manages the pods and containers running on a machine. Kubelet translates each pod into its constituent containers and fetches individual container usage statistics from cAdvisor. It then exposes the aggregated pod resource usage statistics via a REST API. More on how Heapster Works can be found in the reference section. Alternative Benefits Concerns Docker Stats API Direct JSON o/put Hard to integrate cAdvisor Node level option Hard to integrate sysDig Detailed; Alerts Paid solution SupervisorD Detailed, Unix-like Not pod aware Scout Comprehensive Hosted App New Relic Elegant solution Paid; Not pod aware Librato Analytics; Alerts Hard to integrate cAdvisor, InfluxDb, Grafana Well-adopted, generic Prometheus-or- plotting friendly Heapster Robust; Pod-aware; Free; Pluggable API, Json output Needs trivial client code to transcode, aggregate 3.3 Heapster Introduction Heapster is a cluster-wide aggregator of monitoring and event data. Currently it natively supports Kubernetes[37,38] only and works on all Kubernetes setups. Heapster runs as a pod in the cluster, similar to how any Kubernetes application would run. The Heapster pod discovers all nodes in the cluster and queries usage information from the nodes’ Kubelets, the on-machine Kubernetes agent. The Kubelet itself fetches the data from cAdvisor. Heapster groups the information by pod along with the relevant labels. This data is then pushed to a configurable backend for storage and visualization. Currently supported backends include InfluxDB (with Grafana for visualization), Google Cloud Monitoring and many others described in more details here. The overall architecture of the service can be seen below. A Grafana setup with InfluxDB is a very popular combination for monitoring in the open source world. InfluxDB exposes an easy to use API to write and fetch time series data. Heapster is setup to use this storage backend by default on most Kubernetes clusters. 3.4 Using Heapster To Measure MPP Database Performance 3.4.1 Heapster Metric Model The Heapster Model is a structured representation of metrics for Kubernetes clusters, which is exposed through a set of REST API endpoints. It allows the extraction of historical data for any Container, Pod, Node or Namespace in the cluster, as well as the cluster itself (depending on the metric). The model API does not conform to the standards of a Kubernetes API. It cannot be easily aggregated, does not have auto generated serialization and clients for its types, and does has a number of corner cases in its design that cause it to fail to display metrics in certain case. Within Kubernetes, its use been replaced by the resource metrics API and custom metrics API, found in the k8s.io/metrics repository. New applications that need metrics are encouraged to use these APIs instead. Metrics-server and custom metrics adapters provide these respectively. 3.4.2 Kubernetes Resource Metrics Api[39] The goal with this for the Kubernetes community effort is to provide resource usage metrics for pods and nodes through
  • 5. 5 the API server itself. This is a stable, versioned API, which core Kubernetes components, can rely on. Because of scalability limitations, Kubernetes apiserver persists all Kubernetes resources in its key-value store etcd. It’s not able to handle such load. On the other hand metrics tend to change frequently, are temporary and in case of loss of them we can collect them during the next housekeeping operation. As such, Kubernetes stores them in memory and can’t reuse the main apiserver and instead introduced a new one - metrics server. This is the API that was used to collect resource usage at entire cluster level. 3.5 Cluster Resource Measurement System Figure 1 below is a simplified diagram that shows our CRM subsystem implemented in Aster Engine. Figure 1: Cluster Resource Monitoring In Aster Engine At a high level Aster is deployed in its own namespace and interacts with Teradata ViewPoint (on left) over HTTP or HTTPS. A new container called “webservices” implements the logic that queries pod level resource utilization statistics (CPU %, Memory Used, Network IO, Disk IO etc.) from Heapster pod. Heapster interacts with all containers of Aster and fetches utilization statistics using cAdvisor via Kubelets on the pods as detailed earlier The CRM Client module implements a data format transcoder as well as aggregation logic that caters the data through Apache webserver running in within same container. ViewPoint is the client that requests this data using a Restful HTTP API interface. ViewPoint plots and displays the data in a user-friendly manner. 3.6 Measurement Results We implemented a Python client that pulls this information from Heapster Metrics Server’s Restful API and store them by pods. A pod in our cluster represented one Master pod and many distributed Worker pods that together create an MPP Analytic system. The stored data contained % CPU utilization, Memory used in Bytes, and Network Sent and Received in Bytes. Heapster may also be able to return Disk Reads and Disk Writes per pod, per node or per container if, needed. Heapster returns time-series data related to the resource usage. Sample data for one time-point is shown below, but entire series can be retrieved for up to last 15 minutes, and recorded, plotted and analyzed. A sample result data is shown in the table below. The table shows real-time resource usage queried from Heapster, aggregated at pods level. The data shows CPU, memory and network IO on the Aster Engine’s Master pod and 2 worker pods. The client we implemented program from Heapster, queries data using Heapster’s Restful API endpoints at pod level and presents to the ViewPoint user interface. The program also evicts the data from Heapster based on our eviction scheme. Name Metric Value Epoch Master CPU 33.5 % 1515494040 Master Memory 64180615 Bytes 1515494040 Master Nw Sent 40671 Bytes 1515494040 Master Nw Recv 53732 Bytes 1515494040 Worker1 CPU 45 % 1515494040 Worker1 Memory 81601882 Bytes 1515494040 Worker1 Nw Sent 81621 Bytes 1515494040 Worker1 Nw Recv 397232 Bytes 1515494040 Worker2 CPU 40 % 1515494040 Worker2 Memory 81604771 Bytes 1515494040 Worker2 Nw Sent 81621 Bytes 1515494040 Worker2 Nw Recv 207232 Bytes 1515494040 4. Query Resource Monitoring In MPP Engine Resource utilization is modeled in Query Resource Monitoring (QRM) on a per analytic query basis. When a query takes a long time, QRM provides insights. This is useful to identify expensive phases of a query that tax certain resources more, or skew the work distribution. 4.1 Data Transfers Load During Query Execution In Teradata’s Aster Engine, complex analytic computations can result in large data movements and compute intensive “hot spots” on specific containers, the bulk of which is explained in Section 2. Data movement is done using Aster’s proprietary data format known as “ICE”, which stands for “Intra Cluster Express”. ICE moves large “tuples” of data across workers pods and Master pods, during query execution, partitioning, repartitioning or complex JOINs. In addition to the pgdb and the runner containers, the ice container can also exhibit sporadic spikes in CPU, Network IO and Memory utilizations, because of intermittently large- scale movements of partitions. 4.2 Highlights of Query Execution In Aster Engine
  • 6. 6 Broadly, there are two types of queries – 1. The plain-SQL query execution: For this we deploy a dedicated container that runs an instance of a tailored version of proven Postgres database in the pgdb container. The deployed pgdb container on each of the workers in parallel provides an MPP infrastructure for running plain SQL parts of the user’s queries in a distributed manner. 2. The SQLMR query execution: The runner container facilitates an isolated environment to run SQLMR UDFs. The deployed runner container on each of the workers in parallel provides an MPP infrastructure for running SQLMR UDFs parts of the user’s queries in a distributed manner. An Analytic user such as a database administrator (DBA) or a data scientist will simply write a query or a UDF[43] like they traditionally do, and Aster Engine will distribute it on the MPP system. The system can run arbitrarily complex UDFs in this distributed environment. Each runner runs a single JVM, which is used to solve a part of the sub-problem to run the UDF. At the end of the execution, the outcomes of all of the runners are combined to form the final output, which is the result of running this SQLMR UDF function in database on the given table. We’d like to measure resource utilization within the pgdb and the runner containers on each worker pod, and then aggregate these values across workers to report compute resource utilization caused by a specific user query on Aster Engine cluster for a query. At the time of this publication, the CPU % utilization and Memory utilization are measured and reported by the solution. We are working to incorporate Disk IO and Network IO in our solution. The solution for these additional metrics should fall in place since a robust framework has been built first. Note that the uniqueness of Aster Engine as a distributed system is in the fact that we are slicing the SQLMR analytic problems across worker containers within the MPP systems. The pgdb and runner containers are solving the sub-problems on disjoint partitions of the database tables, in parallel, in order to achieve performance. As such, since the system does not merely slice the work by containers or pods, the measuring of resource utilization is non-trivial and need to be done in a bottom-up manner. Note also that we cannot afford to spin up and shutdown containers on demand for Aster Engine to remain a high performance MPP execution engine. T overhead of on- demand spin up of containers will easily reduces performance of Aster Engine on complex analytic queries due to the latency issues. In addition to accurately measuring plain-SQL and SQLMR computations across the cluster, the impact of driver functions is also measured. Driver functions use JDBC to connect back to Aster Engine and issue additional queries. A resource usage of the children sessions and all the queries there are executed as part of it also gets accounted toward the main query that invoked driver function. 4.3 Query Resource Measurement System The Figure 2 below shows a simplified view of QRM system. The light blue boxes represent the existing pods that are part of the Aster Engine MPP system. The MPP Execution Master creates and executes query plan and controls execution of the query across Aster cluster. The pgdb and runner are the Docker containers one or more of which may exist on each worker pod, as shown below. The diagram also shows a trivial sample table of the result from the QRM system as stored in the DATABASE. Other actors in QRM are following – • QM Emitter helps retrieve memory and CPU usage information by reading the info from the Linux procfs filesystem[42] for specific processes pertaining to the Aster Engine, inside the Docker containers • QM Collector container in each pod • QM Master - lightweight container in Master pod is has main role in QRM. It polls or subscribes to worker pod QM Collector for utilization data • Utilization in pgdb render Plain SQL resource usage by reusing QM Emitter • Utilization in runner render SQLMR resource usage by reusing QM Emitter 4.3.1 Query Resource Measurement Steps 1. When QRM sub-system is on, it looks for the start of a new session. Utilization starts to collect when a new Session starts. 2. QM Emitter sends utilization data from pgdb and runner to QM Collector local to each worker pod for entire duration while Aster Engine session is active. 3. Data collection stops when the end of query session is detected. 4. Data from QM Collectors on all workers is requested by QM Master on Master pod and relayed to a generic destination DATABASE where measurements are stored. This is done on a periodic basis (minutes or hours) and at ends of sessions. 5. Data is stored in new table in DATABASE, and is indexed by session id. Data can be pulled into other reporting, post-processing, historical analysis systems and plotting tools from the generic destination DATABASE. QRM design is a best-effort architecture due to the nature of its purpose.
  • 7. 7 If parts of sub-system don’t perform, reporting may be incomplete, i.e., yield underestimated (but not inapplicable) utilization. In specific – Single Point Failure: If QM Master goes down either due to load, container failure or system degenerate state, and the measurements may lag or utilizations be under-estimated. - In this case, Kubernetes may restart the QM Master container. Figure 2: Simplified Query Resource Monitoring Subsystem In Aster Engine, keyed by Session id Service Protection (Do no harm principle): Local pod level query utilization storage are designed in a FIFO / LRU / Ring Buffer / Round Robin way. • QRM is designed to save accumulated data in a round- robin fashion. If data is not collected either from QM Collector on workers or from DATABASE after a period of time, it may be overwritten if needed. • This is to protect active analytic engine and the QM Emitter from going out of memory. • QM Collector can cap resource utilization in memory data on a per session level memory limit. • QM Collector in-memory data can also have an overall limit across sessions. • QM Collector can also have a Data Retiring Manager to retire data in DATABASE on a per table basis (time based or size based; long term expiry of data). 4.4 Measuring SQL Resource Utilization Compute resource usage by plain SQL queries is measured by measuring system load inside the pgdb containers running in each of the worker pods, and aggregating the results that is collected in the Master. This involves measuring resource usage by each of the Postgres processes within the pgdb containers on worker pods. 4.5 Measuring SQLMR Resource Utilization Compute resource usage by SQLMR queries is measured by measuring system load inside the runner containers running in each of the worker pods, and aggregating the results that is collected in the Master. This involves measuring resource usage by each of the JVMs running inside the runner containers. 4.6 Solution Scalability A system, whose performance improves by adding hardware, proportionally to the capacity added, is said to be scalable system or horizontally scalable system. Scalability of QRM ties to measurement load imposed on it. So, scalability of QRM is somewhat inversely proportional to scalability of the MPP system being measured! Adding workers load QM Master in linear manner. So, in an MPP analytic engine with N worker pods and m pgdb or runner instances per pod, there will be (m+1)*N total resource utilization measurements to be taken. • Every measurement collection collects data at sampling frequency, say f, or sampling period T. T = 1/f, for every resource-type. • Let R be the total number of resources measured o Where resource-type € {Cpu %, Memory, Network Recv, Network Sent, Disk Reads, Disk Writes} o Currently, R = 2; Cpu % and Memory. • Every sample is a time-series item, which is a pair or tuple of doubles: <time> and <utilization value>. • So, over an arbitrary sampling frequency f, the QRM subsystem collects (((m + 1) * N) * R * 2 * 8) / f bytes of data. • Periodically, this data will be fetched or pushed into long term storage, such as DATABASE or a third-party reporting system. • For convenience, we’d estimate scale on a “per fetch by third-party reporting system” basis, or else the sampling period to such a reporting system can also be modeled here. If we model it, it will proportionately affect scale. Measurement collection is done for each query. So, for Q queries running simultaneously, (m + 1) * N * R * Q * 16 / f bytes of data per query per fetch Following are a few sample scenarios: 1. So, for a 2000 pod Aster Engine MPP system with 2 pgdb and/or runner containers per pod, sampling the resource utilizations at period P = 10 sec, or f = 1/6
  • 8. 8 minute, the proposed solution will collect data at 1,152,000 bytes / min or 19.2 KBPS per query. • Each additional query monitored in parallel will multiply the size of data, modulo the fetch period of the third-party reporting system. • If for this scenario, the QRM data is offloaded fully into a third-party reporting system every 5 minutes, it would be about 5.8 MB if one query was running and 17.4 MB of memory requirement for 3 queries running in parallel, and every 5 minutes they run. 2. Or for a small system with 2 pods, with one pgdb per pod, P = 30 sec = 0.5 minute, we will collect 256 bytes of data per running Promethium query per minute. QRM is most applicable for long running queries. Be informed that in absence of eviction, the queries will have linear penalty on resources needed by QRM, as the derivation shows. Above calculations highlight the need of LRU based local eviction and destructive read-out heuristics. Be informed that QRM is off by default. 4.7 Measuring Workload Skew Skew is one of the key characteristics of an MPP execution engine that helps determine throughput for a specific database session. In an MPP system with high cardinality, skew is a condition in which compute work to execute a long running database query is unevenly balanced among partitions or workers in the cluster. In any practical scenario of execution on any system, a small amount of skew is inevitable and harmless. There are mainly four skew parameters to judge throughput and performance of orchestrated MPP execution engine of a given cluster size: 1. Pod IO skew: comparison of the highest IO usage watermark on the busiest node to the average use on other pods. This can include network IO, disk IO or both. 2. Pod CPU skew: comparison of highest CPU usage watermark use on the busiest node to the average use on other pods. 3. Pod Memory skew: comparison of highest memory watermark for a query on the busiest node to the average use on other pods. Let, podCount be total number of pods, sum be the sum of the metrics on all pods, max be the value of metrics on the busiest pod. For IO metrics, the two metrics such as read and written bytes are added together for Disk IO; similarly, the sent and received bytes are added together for Network IO, and the max and sum are computed for these totals. The Workload Skew for a specific utilization metric across all Aster Engine worker pods is computed as, where, subscript m stands for specific resource the skew is to be computed for, such as memory, CPU % etc. This formula for podSkew helps a DBA judge the workload spread across workers. The expectation is that overall throughput of the MPP system on a long running query is maximum if the workload is perfectly balanced, that is, all n pods have a identical workload and thus the podSkew is 0. 4.8 QRM Data Aggregation The data aggregation in QRM for resource utilization requires a distributed system design for itself that must be able to scale linearly or preferably super- linearly with respect to the size of the MPP system. This is because bigger systems do not reduce but grow challenge for QRM performance. The data being aggregated to QM Collector processes in memory, per worker pod, store primarily resource utilization time-series data. Aggregation heuristics depends on the resource-type of the metrics being aggregated. At a high level - • The CPU utilization % should be stored uniquely per pod and not added up to be able to compute podSkew for CPU % in the system. • The Memory utilization in bytes per pod, with each pod can be returning (m + 1) time-series-es, where m is the number of pgdb and/or runner Docker instances per pod. Storing these values separate can allow computing podSkew for Memory. • This storage is local to QM Collector in each pod. • It is possible to measure skew for SQL and SQLMR queries separately by storing resource utilization metrics for pgdb and runner separately, instead of aggregating them in QM Collector. • Newer metrics such as Network Sent bytes, Network Received bytes, Disk Read bytes and Disk Written bytes can also be reported in the same manner. There are two ways to collect this data centrally in DATABASE that is attached to the Master pod – • QM Master can serially fetch or pull the data from QM Collector of all workers • QM Collectors on all workers can asynchronously message or push the data to QM Master The push mechanism can employ Asynchronous-Messaging Queues wherein QM Master is subscribed to all QM Collectors in the pods (across pod recreations). Following are more detail on these two approaches. PodSkewm = (1 - ((summ / PodCount) / maxm))) * 100
  • 9. 9 4.8.1 Serial Solution Formulation In this method to fetch QRM resource utilization, QM Master pulls the collected resource utilizations from QM Collectors on worker pods on its own schedule. 1. QM Emitter retrieves resource utilization for Linux processes in the container (pgdb or runner) from the procfs. 2. This utilization info (either one point or a time-series chunk) is sent to local QM Collector container on the same pod. This data transfer reused preexisting RPC mechanism, but can also use a Restful API or a speedy Asynchronous-Message Queue such as ZeroMQ[41] . The data rests in the process memory for the collector process inside the QM Collector until it is either evicted (expired with time or by size limit) or collected by QM Master. 3. QM Master periodically polls QM Collectors on all worker pods via a similar data transfer mechanism and fetches and erases all of the time-series data from each pod, one at a time. 4. QM Master saves data into DATABASE after data is fetched from each pod. This is shown in the flow diagram in Figure 3 below. 4.8.2 Parallel Solution Formulation Another design choice was to use a Proactor Design to retrieve QRM data from QM Collectors on the worker pods. Proactor is a software design pattern for event handling in which long running data sources in the design subscribe to a central part of the system, either for commands and control or to offload the data. A completion handler is called based on meeting any arbitrary condition, asynchronously, at the data source – the QM Collector in our case. Such designs follow Hollywood Design Principle (“Don’t call us, we’ll call you.”). Here's the flow – 1. As the worker pods come up, every QM Collector container subscribes to the central QM Master container. 2. On each worker pod in QM Collector container the Proactor subscriber waits for certain condition to be met. This condition can be a periodic timer interrupt or a more complex condition that need to be checked, such as, an event that is triggered by size of time-series accumulated so far on that pod, etc. 3. Proactor subscriber in QM Collector reads the time- series. 4. Proactor subscriber in QM Collector dispatches an event to the handler in QM Master. 5. Either the QM Collector can now send time-series as payload with the event itself or the QM Master can handshake subsequently to request the time-series data. 6. QM Master can process or aggregate or filter the data and then write to QRM tables in DATABASE. Benefit of a pull solution is that data transfer rates can be Figure 3: QRM Data Collection To QM Master: Serial Solution Benefit of a pull solution is that data transfer rates can be throttled by constraining resources, like thread-pool, that implement the pull in QM Master. A parallel or push approach may flood the network or QRM subsystem in case queries are resource consuming – which is a bad time to further load the network. 4.9 Measurement Results The measurement results specific to the queries in our MPP execution engine are not available at the time of this writing. These measurements will be taken in March of 2018, which is well ahead of conference presentation deadline and will be conveyed as soon as they are available and shared during presentation. 5. Security 5.1 Network Access 1. CRM feature, in our case, did expose one port to retrieve data from Heapster, and another port to send the aggregated per-pod resource utilization data to Teradata’s ViewPoint UI. 2. QRM feature exposes no new ports. QM Master does not need network access other than internal connectivity via reused RPC to QM Collector on worker pods to receive and send data for aggregation and storage. 3. If more clients or sinks of data are connected the ports need to be secured and not exposed to the outside world, at the same time all resource utilization data needs to be encrypted to avoid malicious access to it for CRM and QRM. 5.2 User Access Control To Data 1. Consider running any newly added QRM and CRM containers as non-privileged Docker containers and as non-privileged users by default. 2. Protect access to the newly added tables and views in external databases and systems to which the resource utilization data gets saved.
  • 10. 10 6. Summary In this paper we have presented two comprehensive mechanisms and experiences to design and incorporate resource usage monitoring in a large-scale Kubernetes orchestrated MPP analytic engine. Our work details the methods and addresses multiple challenges pertinent to measuring resource utilization in MPP systems on a system level and with precision during one or multiple distributed SQL and SQLMR analytic query processing. Our benchmarking results for Cluster Resource Monitoring have shown that the solution built by implementing a Heapster client and channeling the time-series data to other subsystems and User interfaces does not load much the actual system being measured. 7. References [1] www.docker.com/what-container [2] Vivek Ratan (February 8, 2017). "Docker: A Favourite in the DevOps World". Open Source Forum, June 14, 2017. [3] en.wikipedia.org/wiki/LXC [4] www.linuxcontainers.org [5] www.upguard.com/articles/docker-vs-lxc [6] www.goto.docker.com/rs/929-FJL-178/images/Docker-Survey- 2016.pdf [7] O'Gara, Maureen (26 July 2013). "Ben Golub, Who Sold Gluster to Red Hat, Now Running dotCloud". SYS-CON Media, 2013-08- 09 [8] www.domino.research.ibm.com/library/cyberdig.nsf/papers/092905 2195DD819C85257D2300681E7B/$File/rc25482.pdf [9] www.arxiv.org/pdf/1709.10140.pdf [10] www.blog.newrelic.com/2017/11/27/monitoring-application- performance-in-kubernetes [11] www.newrelic.com/serverless-dynamic-cloud-survey [12] www.vldb.org/pvldb/vol9/p660-trummer.pdf [13] www.dcs.bbk.ac.uk/~ap/teaching/ADM2018/notes5.pdf [14] static.googleusercontent.com/media/research.google.com/en//pubs/a rchive/41344.pdf [15] citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.478.9491&rep= rep1&type=pdf [16] docs.docker.com/engine/reference/commandline/stats [17] docs.docker.com/engine [18] docs.docker.com/config/containers/runmetrics [19] www.github.com/google/cadvisor [20] hub.docker.com/r/google/cadvisor [21] blog.codeship.com/monitoring-docker-containers [22] www.sysdig.com/product/monitor [23] www.tecmint.com/sysdig-system-monitoring-and- troubleshooting-tool-for-linux [24] www.sysdig.com/blog/ monitoring-kubernetes-with-sysdig- cloud [25] www.sysdig.com/blog/alerting-kubernetes [26] www.supervisord.org/introduction.html [27] www.monitorscout.com [28] en.wikipedia.org/wiki/New_Relic [29] www.librato.com [30] www.github.com/librato/librato-metrics [31] www.github.com/kubernetes/heapster/blob/master/docs/overview. md [32] www.github.com/kubernetes/heapster/blob/master/docs/storage- schema.md [33] www.github.com/DataDog/the- monitor/blob/master/kubernetes/how-to-collect-and-graph- kubernetes-metrics.md [34] www.stackoverflow.com/questions/33749911/a-combination- for-monitoring-system-for-container-grafanaheapsterinfluxdbcad [35] blog.couchbase.com/wp-content/original-assets/december- 2016/kubernetes-monitoring-with-heapster-influxdb-and- grafana/kubernetes-logging-1024x407.png [36] info.teradata.com/HTMLPubs/DB_TTU_16_00/index.html#page/G eneral_Reference/B035-1091-160K/muq1472241426243.html [37] www.kubernetes.io [38] www.kubernetes.io/docs/concepts/overview/what-is-kubernetes [39] www.github.com/kubernetes/community/blob/master/contributors/d esign-proposals/instrumentation/resource-metrics-api.md [40] Large Scale Cluster Management At Google With Borg: static.googleusercontent.com/media/research.google.com/en//pubs/a rchive/43438.pdf [41] www.zguide.zeromq.org/page:all [42] ProcFs: en.wikipedia.org/wiki/Procfs [43] dl.acm.org/citation.cfm?id=1687567