2. ®
Data processing in the enterprise quickly shifts from “good enough” to “we need more and faster” as
expectations grow. The Zeta Architecture is an enterprise architecture which enables simplified business
processes and defines a scalable way for increasing the speed of integrating data into the business.
There will be no successful path to the future without understanding and appreciating history, which is
why it is of the utmost importance to understand how the current state of enterprise architectures have
come about.
While building out a data center, resources are often thought about as pools of servers where each pool
will meet the needs of a specific use case. Lines are created between the pools of servers, resulting in
static partitions. They are static in the sense that the resources cannot grow dynamically. The growth in
any particular partition over time has no direct effect on the other partitions. This partitioning model
simplifies troubleshooting to identify when something is failing within one of those static partitions.
Static partitioning enables a simplified way to calculate the theoretical maximum throughput of the
software running in that partition, which means capacity planning is pretty straightforward. Engineering
teams are usually pretty concerned about understanding the capacity of the software. The IT operations
team usually needs to understand where to add capacity for future growth. This information will give
you a maximum for your volume, for your compute and for your memory. Most use cases will never real-
ize complete utilization of resources in all given pools, and this is due in part to the workload imbalance
created by static partitioning.
Resource isolation is a big deal, and as nearly every engineer will attest to, fast troubleshooting is very
important. Production or IT operations, development, and QA all need mechanisms to isolate issues so
Zeta Architecture
Introduction
MapR Technologies, Inc.
White Paper, March 2015
continued on next page
A Brief History of Enterprise
Architectures
3. ®
2 MapR Technologies, Inc.
Zeta Architecture
A Brief History of Enterprise Architectures
continued
Isolated Workloads
Come at a Cost
they may understand where a problem originates. They also want to understand if it is one or multiple
issues. Their goal is to quickly track down and identify an issue, deploy a fix, and ensure that the problem
has been resolved.
Business continuity encompasses the topics of what keeps your business in business. We’ve got to
make sure that we don’t forget about things like backups and the schedules that come along with these.
Disaster recovery plans should be in place not only for peace of mind, but to ensure that businesses can
continue on in the face of the unexpected. Backup plans are generally defined against each of the static
partitions, which tend to include how to recover from a single server lost to the entire data center. Most
of these plans, which include plans for recovery, will have different levels of outage preparedness, going
from hours to days of downtime. Clearly every business can benefit from having rock-solid plans and
processes for outages.
One of the most notorious issues of isolated workloads is wasted capacity and wasted energy. Think
about a common use case like web servers. Take an instance where a business uses about 10 web serv-
ers running at 5% utilization (very normal for web servers), delivering web content with a load balancer
sitting in front so it can handle traffic spikes. If utilization is nearly always below 10%, that leaves 90% as
constantly wasted resources. Not only is there a capital cost for the 10 web servers, but when factoring in
the energy costs, it becomes an even bigger deal. It sure would be nice to get better utilization of capital
for the business.
Isolation of resources isn’t free, because every server needs to be monitored. Underutilized hardware
consumes more than just energy; it also consumes time to manage them by keeping them secure and
up-to-date.
Processes to move data from servers generating information to servers processing information tend to be
rather complicated to setup and manage. Beyond the processes, they normally require people to monitor
them around the clock. These jobs typically sit in a very high profile position in a business workflow; if
they are tied to revenue generation and any of them fails, you may have to answer to your customers.
In any good agile deployment process, there is a desire to promote software between any number of
environments to support the business. Promoting software between environments is tricky, because
environments tend to come in different shapes and sizes and usually do not contain the name number
of servers per pool in a development environment as they would in a production environment. Given 3
servers in QA and 100 servers in production, is there a guarantee that the code that was tested is going
to act the same way in production? Most assuredly not. Most people have probably lived through this
scenario when going into a production environment. This is perhaps one of the most difficult and least
fun things to troubleshoot.
continued on next page
4. ®
3
The Model of This
New Architecture
Goals with a New Approach The first goal with a new enterprise architectural approach should be the ability to leverage all existing
hardware in the data center. This would enable resources to be put on any business problem at any time.
There is still a need to maintain some form of isolation that meets the needs discussed in the current model.
The requirements for moving software between environments need to be understood, and the processes
need to be able to accommodate the new architecture and deliver more than what already exists.
Backing up data for point-in-time recovery, or from tape or any other form of backup, needs to be
improved in terms of what exists today. Too many architectures do not deliver any real added benefits
for disaster recovery, and the restoration processes for a serious disaster could take weeks. The goal of
this new approach should be to support real-time business continuity. This would mean that in the face
of a disaster, recovery—if any—should be able to be accomplished within a time frame in line with high
availability expectations (e.g. 99.9% or better). That is to say, this architecture will deliver the ability, but
the onus is still on the implementer to know how many nines are necessary for the business.
A cohesive security and compliance model including authorization and authentication should be consid-
ered to make management of systems easier and less prone to error. All the components have to be able
to work with the same security controls. Users, jobs, and data need to be secured. We must ensure that
even the most stringent regulatory environments are able to use this architecture.
The high level component view of this architecture is intended to support the goals defined for this new
architecture. It is not intended to dictate which specific software or project, open source or otherwise,
must be used. There are seven pluggable components of this new architecture, and all of the components
must work together:
• Distributed File System. Utilizing a shared distributed file system, all applications will be able to read
and write to a common location which enables simplification of the rest of the architecture.
• Real-time Data Storage. This supports the need for high-speed business applications through the use
of real-time databases.
• Pluggable Compute Model / Execution Engine. Different groups within a business have different
needs and requirements for meeting the demands put upon them at any given time, which requires the
support of potentially different engines and models to meet the needs of the business.
• Deployment / Container Management System. The need for having a standardized approach for
deploying software are important and all resource consumers should be able to be isolated and
deployed in a standard way.
• Solution Architecture. This focuses on solving a particular business problem. There may be one or
more applications built to deliver the complete solution. These solution architectures generally encom-
pass a higher level interaction among common algorithms or libraries, software components and
business workflows. All too often, solution architectures are folded into enterprise architectures, but
there is a clear separation with the Zeta Architecture.
continued on next page
MapR Technologies, Inc.
Zeta Architecture
5. ®
4
The Model of This New Architecture
continued
• Enterprise Applications. In the past, these applications would drive the rest of the architecture.
However, in this new model there is a shift. The rest of the architecture now simplifies these
applications by delivering the components necessary to realize all of the business goals we are defining
for this architecture.
• Dynamic and Global Resource Management. Allows dynamic allocation of resources to enable the
business to easily accommodate whatever task is the most important that day.
As we look at what technologies can fit in here, we’re basically going to start right in the middle. Mesos
is a data center-wide resource manager; YARN is a resource manager for functionality that lives in the
Hadoop ecosystem. When used alone they create silos of clusters. To get around this, project Myriad can
be utilized. Myriad enables Apache Mesos to manage YARN. When combined, these resource manage-
ment tools bring all the resources into a single cluster.
There is flexibility available within the area of the distributed file system. If running on a cloud provider
like Amazon, there is S3. Within a private data center, there is MapR-FS or HDFS. What is important to
understand is these functionalities/capabilities are going to be the foundation of the rest of this architec-
ture. While MapR-FS implements all of the APIs supported by HDFS, it delivers far more functionality
that is not available within HDFS.
Real-time applications require guarantees on data retrieval and storage. This will include technologies
like MapR-DB, which fully implements the HBase APIs as well as HBase. While this is the area that Cas-
sandra and MongoDB would fall, they are not referenced in this architecture because they do not support
running on distributed file systems like MapR-FS or HDFS. While it is conceivable that they could be
adapted to run here, their self-limitation is what prevents them from participating in this architecture.
The compute model / execution engine is where the biggest opportunity shows up from an analytics and
streaming perspective. In general, more than one at a time will be used to cover multiple use cases, and
continued on next page
Example technologies that fit into the Zeta Architecture
MapR Technologies, Inc.
Zeta Architecture
6. ®
5
The Model of This New Architecture
continued
they need to support the distributed file system to leverage all of the compute power. This enables prob-
lem solving with multiple technologies including using Hadoop MapReduce, Apache Drill, Apache Spark
or any others than can work with this distributed file system. The other benefit here is that when com-
bined with the global resource management and full access to all the data, those who perform analytics
work can have full access all hours of the day where the resources can be constrained or expanded based
on production utilization.
The containers portion of this architecture is important, as it delivers a type of isolation that is important
in certain use cases. The isolation provided by containers gives the ability to move software more easily
from development to QA to production. Mesos ships with its own container system, but it also supports
Docker and Kubernetes. This provides a better process model, which helps to ensure consistent software
between environments.
In the solution architecture space there are concepts like machine learning, recommendation engines
or even the Lambda architecture. These are solution architectures that are going to leverage this
platform, and you need to be able to describe them in a way that is more specific than the enterprise
architecture itself.
The simplest example of an enterprise application that could be used here is a web server. Take an
Apache web server deployed in a container that is configured to write its logs straight through to the
distributed file system. This bypasses log shipping and allows for the data to be processed or analyzed
immediately, without delay.
Google’s example
This architecture will allow anyone who implements it to be able to run at Google scale. As a point of
reference, here is a mapping of Google onto this architecture.
Implementations
continued on next page
Technologies Google leverages laid over the Zeta Architecture
MapR Technologies, Inc.
Zeta Architecture
7. ®
6
Let’s take a look at a few interesting points regarding Google’s technologies in this diagram: Borg is
sometimes referred to as the “project that is unnamed” within Google, but outside of Google it’s called
Borg. Omega is their scheduler and they define it as the crux of the entire distributed processing plat-
form, as it is figures out where and when to place jobs. From a solution architecture perspective, Gmail
conceptually operates on top of a recommendation engine. The machine learning concepts in general are
delivered in many of their product offerings.
Take a step back for a moment to understand all of the components that comprise a familiar application.
It is probably implemented with many of these same concepts. The question is, “Does the application
leverage all of these in a heterogeneous way?”
Ad Serving (Recommendation Engine) Example
Web servers and advertising make good implementation examples, as they are a cornerstone of the inter-
net. The high-level architecture for such applications is not overly complicated.
At almost every tier of this application architecture, there are logs emitted and collected. Collecting those
logs is important to advertising, as they are used to generate revenue calculations as well as analytics on
the performance of advertisements. This will create a feedback loop to optimally tune the advertising
engine. In general, this diagram is not overly complicated and it should make sense. When given the
opportunity to lay the application architecture on top of the new Zeta Architecture, there are a number
of simplifications that occur.
Now the web server, advertising engine, analytics execution engine, distributed file system and real-time
data store are all running on each server or in any combination necessary based on load requirements
and how many instances are dynamically started. The first benefit is that the logs generated by the web
server and advertising engine land directly on the distributed file system. Since the data is landing where
it is processed, the execution engine doesn’t have to wait for Flume processes to move the data. This
also means there are no people monitoring the Flume processes to ensure data makes it to the analytics
cluster in a timely fashion.
Implementations continued
continued on next page
Generalized Digital Advertising Platform Architecture
MapR Technologies, Inc.
Zeta Architecture
8. ®
7
The users that advertisements are being generated for are coming straight out of the real-time data store
and going right back in after modifications are made. This puts the data much closer to the advertising
engines where it is used.
Notice that the billing system is located in a relational database (RDBMS) outside of the core distributed
file system. That isn’t a requirement; it is just the most common scenario.
All processes running in the data center should be broken into two groups. The first group are those
which offer resources. Global resource manager (CPU and memory) and the distributed file system
(disk space and I/O) offer resources. The second group are those which consume resources. Web servers,
Apache Drill, and Apache Spark, among others, are all resource consumers. Resource consumers should
be containerized, whereas those offering resource should never be containerized.
Integrating business applications into this architecture requires plugging into standard APIs. Many
custom adapters have been written to work with the HDFS API; however, most integrations require some
sort of custom plugin to be able to fully utilize HDFS. On MapR-FS, there is native NFS support. In this
case, any application that can read / write to an NFS mount can plug into this architecture. The added
benefit of this approach is that when the application plugs in with these standards, the data is automati-
cally replicated by the distributed file system.
A pluggable security model is required, as applications come in many varieties, and to expect them all to
implement the same security model is highly unlikely. Linux pluggable authentication modules (PAMs)
are very convenient in most cases, as there is a tremendous amount of flexibility. Kerberos is an option
here, but it is not perfect as a solution to long-running jobs.
While many RDBMSes have the potential to work in this model, most do not openly support these dis-
tributed file systems. Some have their own, but those are explicitly for that product’s use. Some will work
just fine over a native NFS adapter, while others may not. If the RDBMS of choice supports this, there is a
great opportunity if the data format can be read by the analytics execution engine.
Implementations continued
continued on next page
Integration into the
Zeta Architecture
MapR Technologies, Inc.
Zeta Architecture
Digital Advertising Platform on the Zeta Architecture
9. ®
8
Historically, data analytics teams usually get the short straw when it comes to getting resources. They
don’t typically get access to production systems. Generally they have to get data dumps and have access
to less than adequate compute resources. This model enables that part of the business by allowing them
to participate in this new type of isolation and have dynamic access to the globally managed resources.
Nearly every application architecture needs to concern itself with many different things, including data
protection schemes, how to backup data, recovery from failures, and running multiple instances of soft-
ware. The Zeta Architecture simplifies those application architectures because it delivers many of those
pieces, which means there’s less stuff to go wrong. Fewer moving parts means fewer potential failure
points. Better hardware utilization means less to operate and lower operational costs. The business is
then capable of leveraging a global set of resources to solve any problem based on what is most impor-
tant right now. Priority number one can change quickly in any business.
Resilience is extremely important in an application architecture. The Hadoop ecosystem components
help to protect against disk and server failure. However, they don’t protect against people making
mistakes. With a statically partitioned model, backups are usually only completely performed once per
week, with partials performed nightly. Recovering from those takes significant time. In this new model,
recovery is easier to plan for and more resilience is available in the system. This is due primarily to near
real-time backups being available, as well as utilizing features of the distributed file system.
Occasionally there is a need to stream data in as opposed to waiting for some periodic interval of time
before processing the data. If an application architecture calls for acting on each and every event that
may occur in a log file in real time, then streaming should be considered. This would fall into the plug-
gable compute model / execution engine portion of the Zeta Architecture and it may or may not be
considered “analytics” based.
For this use case, there are a few options available. The first is to setup the stream processing engine with
a source that tails the log file from the distributed file system. The second approach is for the application
generating the logs to write the log information to some type of agent that can persist the log to disk and
send it to the streaming processing engine simultaneously. The final approach is to skip the disk alto-
gether and send it directly to the stream processing engine or some queue sitting in front of it. Each of
these approaches is going to have varying benefits / tradeoffs, all of which should be considered before
making a selection.
continued on next page
Zeta Simplifies
Application Architectures
Streaming Applications
MapR Technologies, Inc.
Zeta Architecture
Integration into the Zeta Architecture
continued
10. ®
9
The benefits of the Zeta Architecture are plentiful. Google relies on this architecture for their entire
company. This architecture will deliver an edge to everyone. Google also performs over two billion
container deployments per week. Containers will help deliver the isolation needed to be able to move
into the future. This architecture gives any company who uses it a competitive advantage. Google has
pioneered this architecture and it has served them very well. The Zeta Architecture will become the
traditional way of thinking to build and deploy software in the data center, whether on-premise or
hosted. This is the model to create an as-it-happens business—one that can sense and respond in real
time to its environment.
To read a summary of the business benefits of utilizing this architecture, or for a summary document to
share with people who don’t want as many technical details, download the Building the Data Centric
Enterprise white paper.
MapR Technologies, Inc.
Zeta Architecture
Summary
continued on next page