Realizing a multitenant big data infrastructure 3

Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 1
Realizing a shared, multi-tenant infrastructure for Big Data
and Analytic applications using IBM®
InfoSphere®
BigInsights and IBM Platform Computing™
Last revised: April 19, 2014
By: Gord Sissons
Steven Sit
Eric Fiala
Michael Feiman

Page 2
Contents
Document History.........................................................................................................................................4
Introduction ..............................................................................................................................................4
Disclaimers and limitations.......................................................................................................................4
About the customer described in this use case........................................................................................5
Industry Challenges...................................................................................................................................5
Impact on Information Technology ......................................................................................................6
The Big Data Environment ........................................................................................................................7
Hardware Infrastructure.......................................................................................................................7
The Software Environment...................................................................................................................7
Customer Requirements.......................................................................................................................8
Installing InfoSphere BigInsights for Multi-tenant services......................................................................9
Installation steps...................................................................................................................................9
Accessing the Platform Symphony Management Console .................................................................12
Accessing the Platform Symphony knowledge center........................................................................14
Platform Symphony Concepts.................................................................................................................15
An example of configuring a cluster for multi-tenancy ..........................................................................18
Adding users to run MapReduce applications....................................................................................19
Provide access to the BigInsights / Platform Computing cluster........................................................23
Understanding Platform Symphony Impersonation...........................................................................24
Configuring OS groups for the multitenant environment...................................................................25
Submitting a test job as a user to verify the configuration ................................................................25
Associating BigInsights with a Symphony Application........................................................................28
Enabling Symphony Repository Services ............................................................................................29
Adding a new Application / Tenant ....................................................................................................30
Configuring application properties .....................................................................................................34
Associating applications with consumers ...........................................................................................40
Accessing Consumer Definitions.........................................................................................................41
Manually editing Consumer Tree definitions......................................................................................42

Page 3
Controlling access to applications and consumers.............................................................................43
Determining the execution user for a consumer................................................................................44
Configuring Sharing Policies....................................................................................................................46
Summary.................................................................................................................................................48

Page 4
Document History
Date of this revision is Saturday April 19, 2014
Revision Date Summary of changes
0.9 March 23, 2014 Initial draft
0.95 April 19, 2014 Incorporate many valuable comments from Steven Sit based on
his direct client experience – thank you Steven.
Introduction
This document is written for IBM and partner architects. It is intended to be a guide for those working
with customers deploying IBM InfoSphere BigInsights and other Hadoop offerings together with IBM
Platform Symphony. While this paper describes the details of one customer implementation, we believe
that this use case is relevant to others as well. Challenges related to Hadoop multitenancy are faced by
customers across multiple industries.
The target audience for this document includes:
 Architects responsible for deploying big data or analytic workloads
 Technical users looking for ways to deploy Hadoop on shared clusters
 IBM architects, ISVs or business partners interested in building multitenant Big Data
environments to help customers reduce infrastructure requirements and save cost
This paper does not delve into YARN. YARN is another important (but less mature) technology that
delivers some of the capabilities described herein. It is important for IBM customers to understand that
IBM BigInsights is a safer choice in the sense that it supports open source technologies like YARN while
simultaneously offering more advanced capabilities. IBM’s view is the clients can best determine what
capabilities they need, but IBM InfoSphere BigInsights provides customers with flexibility. The best of a
100% open source distribution along with significant value added capability.
In the customer example documented here, the business advantage of using proprietary capabilities
(IBM Platform Symphony) dramatically outweighed the benefits of being “pure” from an open source
standpoint. The client was able to consolidate roughly 30 applications onto a shared infrastructure and
avoid significant incremental capital expense that would have been required to setup separate clusters
had the client decided to proceed with open source YARN only.
Disclaimers and limitations
The details of the customer implementation are proprietary and confidential. As such, while we can
describe what was done technically, we cannot share details of how this customer used particular
applications. As a result, the examples provided herein are meant to explain qualitatively what was
achieved by the customer without betraying confidential information. The details and screenshots in this

Page 5
document are not from the customer environment. They have been reproduced on a small test cluster
to explain particular capabilities that the client chose to take advantage of.
About the customer described in this use case
The customer described in this paper is a full-service financial service provider. They offer a broad range
of products to their clients including insurance, banking, investing, real estate, retirement planning,
wealth management and health insurance. Like many in the financial services sector, this customer is
increasingly deploying Hadoop based applications to augment their data warehouse. They are motivated
by the following imperatives:
 The need to leverage big data analytics to make better business decisions, improve customer
relations and develop innovative new products and services
 The need to contain or reduce costs (the cost of storing and processing data on a Hadoop cluster
is an order or magnitude less than persisting the same data in their data warehouse)
 The desire to architect their environment as a shared service to avoid each line of business
building their own discrete analytic environments on premise or in the cloud
Industry Challenges
Like many industries, the sector represented by this client is going through significant change. As a full-
spectrum provider, the client is disproportionally impacted by regulation. As a bank, not only are they
subject to various provisions in legislation like Dodd Frank, but they are also impacted by insurance
industry requirements such as the NAIC’s Risk Management and Own Risk Solvency Act (RMORSA) and
other initiatives around Enterprise Risk Management that have occurred as a response to the financial
crisis of 2008.
Of particular consequence is the Volcker rule, a US Senate bill that would give regulators the ability to
limit or prohibit certain types of proprietary trading activities. While the legislation is directed at retail
banks, this client will be impacted across their insurance and wealth management businesses where
proprietary trading is important to maximizing investment gains.
As if this tsunami of new regulation was not enough, fundamental changes are taking place in the
insurance industry as well driven by external factors. Among these factors are new disruptive
technologies. Big data, social and mobile technologies are prominent drivers of change. Some specific
challenges to the business are:
 Driven by high-profile events, and the increased frequency of natural catastrophes, contingent
business interruption (CBI) modeling is emerging as a priority for insurance firms
 Dramatic changes driven by technology are promising to fundamentally change auto-insurance.
Among these factors are collision avoidance technologies that promise to shift liability from
drivers to manufacturers, social media technologies enabling insurers to seek out and market to

Page 6
lower risk consumer pools, and advances in GPS and vehicle telematics that promise to provide
insurers with more granular data on which to base risk assessments
 Technological advances are leading to an explosion in available information and firms that
aggregate such information to help insurers better quality risk
 Widespread consumer use of mobile technologies and social technologies are causing firms to
rethink how they promote their brand and provide services to both their customers and
agents/advisors
 Advances in analytic techniques are making it easier for insurers to collect process and visualize
information. This is extending beyond core actuarial techniques to include approaches like
predictive analytics, natural language processing, social network analysis and simulation-based
analytics.
 Additionally, new technologies are changing how information is stored and processed.
Distributed file systems and clustered technologies like Hadoop can provide a significant per-
terabyte cost advantage over traditional warehouses. Because of these cost advantages, and
because the framework is well suited to storing and processing unstructured or semi-structured
data, this customer and similar firms are embracing Hadoop as a platform for many new
applications.
The reason we point this out is that that risk management that relies heavily on Monte Carlo simulation
for simulation and actuarial modeling, and big data analytics are converging. Both depend on scaled out
infrastructure. Firms that understand this convergence can obtain a cost advantage relative to their
competitors.
Impact on Information Technology
Both the regulatory challenges described above as well as the technological shifts and business
pressures are driving the need for greater data processing and analytic capacity.
 Traditional data warehouses cannot scale cost-efficiently to manage the vast amounts of data
being collected and processed, nor can they handle raw volumes of unstructured data involved.
 Organizations need more agile application development methodologies and toolsets that allow
them to evolve data schemas and applications on the fly as they continuously incorporate new
sources of data into their models.
A one-to-one mapping between applications and infrastructure is no longer practical. Many applications
(Hadoop, scenario generation, Monte Carlo simulation and ETL processing) rely on distributed
infrastructure that scales horizontally. Replicating this clustered infrastructure for each line of business
and each application would be cost prohibitive.

Page 7
The Big Data Environment
Hardware Infrastructure
The physical infrastructure deployed by this client is shown pictorially in Figure 1. While there are
actually four identical 16 node clusters, only the production environment is shown here. The server
infrastructure is based on an IBM System X based reference architecture for InfoSphere BigInsights. Each
cluster node has 12 CPUs, over 60 GB or memory and 12 locally connected physical disks. The
production cluster has 192 TB of disk and approximately 1 TB of memory.
A unique feature of this environment is that the cluster is shared by several lines of business comprising
approximately 30 different user groups across different lines of business.
Figure 1: Physical infrastructure for shared Hadoop Platform
The Software Environment
The Linux based infrastructure supports multiple big data and analytic applications.
Among these applications are:

Page 8
 IBM InfoSphere BigInsights (providing core Hadoop services)
 Datameer (for data visualization)
 IBM TeaLeaf – customer experience analytics platform
 Open source Sqoop 1.2.4 – used to perform bulk data transfers to and from various data sources
including an operational data warehouse and the production Hadoop cluster
 Various MapReduce streaming applications, where for convenience of development Map and
Reduce logic is expressed as Perl scripts
 Many in-house developed Java applications
 Various ETL scripts running in and out of the Hadoop MapReduce framework
The IBM furnished software environment is comprised of the following major components
 IBM InfoSphere BigInsights Enterprise Edition
 IBM Platform Symphony Advanced Edition (Software is bundled with BigInsights Enterprise
Edition for a single tenant, and this client has purchased a production licenses)
 IBM GPFS FPO (providing a POSIX compliant file system that fully preserves HDFS semantics)
Customer Requirements
This customer requires a multi-tenant environment for several business reasons listed below.
 They wish to share infrastructure between multiple departments and lines of business both to
boost capacity (by allowing departments to tap capacity not being used by others) and to reduce
costs by avoiding the need for separate physical environments.
 They need the ability to guarantee service levels to different tenants to ensure that business
critical applications can run in a predictable fashion. For example ETL or specific database load
operations must run with an overnight batch window.
 Because many services are long-running, to make sharing practical, agile pre-emption is required
to make sure that urgent jobs do not need to wait behind long running jobs on the cluster.
 The client needs to ensure that data is segmented between different tenants on the shared
environment for security and privacy reasons.
 Finally, the client requires multi-tenancy for technical reasons that are sometimes overlooked.
As the environment evolves, they need the flexibility to deploy different versions of software
components that may have specific dependencies. A specific example is this client’s requirement
to use a more recent version of open-source Sqoop, distinct from the version included in
BigInsights 2.1.0.1, the version deployed at the time of this writing.

Page 9
Different Hadoop vendors have different definitions of what they mean by multi-tenancy, so it is
important that we not confuse the multitenant capabilities offered by IBM in Platform Symphony with
open source offerings like YARN which is much less capable. While YARN is an important technology
being supported by IBM, the capabilities of YARN are well behind those described here.
Installing InfoSphere BigInsights for Multi-tenant services
Realizing a multitenant environment for BigInsights or other applications requires the use of IBM
Platform Symphony Advanced Edition. A run-time version of IBM Platform Symphony Advanced Edition
that enables a single tenant is included with IBM InfoSphere BigInsights Enterprise Edition 2.1 or later.
The Platform Symphony resource manager and workload manager is referred to in the BigInsights
documentation as Adaptive MapReduce for historical reasons. Clients wanting the multitenant
capabilities required in this document will need to license a full version of Platform Symphony Advanced
Edition.
Note that licensing is not enforced by the software directly. Customers can pilot these multitenant
capabilities using only the software included in the BigInsights 2.1 Enterprise Edition or later release
along with appropriate patches.
Installation steps
Fortunately, it is constantly getting much easier to have these products work together. While manual
configuration was required in prior releases, as of BigInsights 2.1 EE a simple patch can be applied to
unlock all of the features of Platform Symphony Advanced Edition and have it work with BigInsights. For
future releases starting in the spring of 2014, full functionality of Platform Symphony will be provided
“out of the box” with BigInsights with no requirement for a patch. (Please note the customers will still
need to license the software before using it in production)
The high-level steps to implement InfoSphere BigInsights 2.1 (or later) with IBM Platform Symphony
Advanced Edition are as follows:
 Install IBM InfoSphere BigInsights Enterprise Edition by following the installation instructions.
When installing BigInsights it is important to install Adaptive MapReduce. This is the choice that
causes the Platform Symphony software to be installed and configured with BigInsights.
 To do this, you will need to edit a file in the installation directory called install.properties before
starting the BigInsights installation process as shown below:
# set AdaptiveMR.Enable to true if you want to install AdaptiveMR
instead of Apache MapReduce
AdaptiveMR.Enable=true
# set AdaptiveMR.HA.Enable to true if you want to install AdaptiveMR
High Availability, this will also install AdaptiveMR instead of Apache
MapReduce
AdaptiveMR.HA.Enable=true

Page 10
 For multitenant environments, GPFS FPO is recommended, however Symphony can be
configured to support multiple tenants regardless of whether HDFS or GPFS FPO is chosen as the
cluster file system.
 BigInsights can be installed by using a web-based installation process. The web-based install
process generates an XML file that governs the installation process that is used for installation
via the GUI or optionally via the install.sh shell script. The name of this file will vary depending
on how the software is installed, but as of release 2.1 the file is called either simple-
fullinstall.xml or fullinstall.xml.
 The reason we mention this is that an apparent bug in BigInsights 2.1 caused the XML tag
<apache-mapred> to be set to true when Adaptive MapReduce was requested in the
install.properties file above. It might be worth validating that this setting is correct in the
simple-fullinstall.xml or fullinstall.xml file.
[biadmin@biginsights]$ grep "apache-mapred" simple-fullinstall.xml
<apache-mapred>false</apache-mapred>
[biadmin@biginsights]$
 As you proceed with the installation, you should see the BigInsights installation script install the
“HAManager” software components as part of the installation. This is where the Platform
Symphony software is located that supports HA functionality and Adaptive MapReduce
functionality. You can watch for this either through the web installation GUI or by checking the
installation log file.
 If you are installing BigInsights 2.1 Enterprise Edition you will need to install a patch by following
the procedure documented in the publication “Enabling the full functionality of IBM Platform
Symphony in your BigInsights 2.1 cluster”1
. This document is freely downloadable for users with
an IBM Developer Works ID.
 You can download a small patch for Platform Symphony 6.1.0.1 (the Symphony version included
in BigInsights 2.1) from https://www.ibm.com/support/fixcentral/ following instructions in the
document referenced above. At the time of this writing you can find and download the needed
package from Fix Central by searching for “Platform Symphony” and downloading the package
named “sym-6.1.0.1-build225866”. This package applies to both 64 bit Linux on Intel as well as
IBM PowerLinux machines. Later versions of BigInsights will not require this patch.
 Follow the instructions in the README file. If you are installing the patch as user “root” on the
BigInsights cluster, it would be a good idea to source the BigInsights environment before
attempting to install the patch since the patch procedure assumes the environment variables are
already set.
1
This documentation can be obtained from: https://www.ibm.com/developerworks/community/wikis/form/api/wiki/ee59a95e-5867-4deb-
90af-6bed6b0759b8/page/91903357-0a7d-4a96-bb70-520fb2acdc1b/attachment/52d79fbe-dc37-42f0-be3f-
5f4b75f14a05/media/Enable%20the%20full%20functionality%20of%20IBM%20Platform%20Symphony%20in%20BigInsight%202.1%20Cluster.p
df

Page 11
[biadmin@biginsights opt]$ cd /opt/ibm/biginsights/conf
[biadmin@biginsights conf]$ . biginsights-env.sh
[biadmin@biginsights conf]$ echo $EGO_TOP
/opt/ibm/biginsights/HAManager/data
[biadmin@biginsights conf]$
When this patch is applied, the multitenant capabilities of IBM Platform Symphony will become
functional and will be accessible through the Platform Symphony graphical user interface.
When BigInsights is installed, the BigInsights web console by default is available on port 8080 on the
BigInsights management host (as long as BigInsights services are started).
Check the status of the cluster using this command:
$ /opt/ibm/biginsights/bin/status.sh
If necessary, start BigInsights (which will also start Platform Symphony services):
$ /opt/ibm/biginsights/bin/start-all.sh
While logged in as the BigInsights administrator, if Symphony is properly installed with BigInsights you
should be able to run Symphony specific commands. As an example, the user biadmin should be able to
run the following command:
$ egosh service list
This command will list various software services associated with Symphony and show their status.
When the Platform Computing components are installed (Adaptive MapReduce), the Platform
Computing resource manager (EGO) is used to persist BigInsights services. You will notice that
Symphony services are associated with a consumer called “/Management”. If you are running HDFS,
HDFS services like the DataNode and Secondary Data node are associated with an “/HDFS” consumer.
The MapReduce shuffle service is start on Compute hosts in the cluster.
[biadmin@biginsights ~]$ egosh service list
SERVICE STATE ALLOC CONSUMER RGROUP RESOURCE SLOTS SEQ_NO INST_STATE ACTI
derbydb DEFINED /Manage* Manag*
purger DEFINED /Manage* Manag*
plc DEFINED /Manage* Manag*
WEBGUI STARTED 54 /Manage* Manag* biginsi* 1 1 RUN 121
RS DEFINED /Manage* Manag*
Seconda* DEFINED /HDFS/S*
MRSS STARTED 55 /Comput* MapRe* biginsi* 1 1 RUN 120
DataNode DEFINED /HDFS/D*
SD STARTED 56 /Manage* Manag* biginsi* 1 1 RUN 119
Service* DEFINED /Manage* Manag*
WebServ* DEFINED /Manage* Manag*
NameNode DEFINED /HDFS/N*
[biadmin@biginsights ~]$

Page 12
Accessing the Platform Symphony Management Console
The Platform Symphony console will usually be on the same host if you follow the installation
recommendations above, but will be on a different port. Port 18080 is the default. You should be able to
log into the Platform Symphony management console at http://<master-host>:18080/platform. The
default administrator login for Platform Symphony is “Admin / Admin”.
In production clusters there will normally be multiple Platform Symphony management hosts. Setting
this up is beyond the scope of this paper and is covered in the Platform Symphony documentation.
Figure 2- Logging into the Platform Symphony Management Console
If you are having trouble connecting to the Symphony web console you can use the command “egosh
service view WEBGUI” to see details about the web service.
The WEBGUI services should be started automatically by EGO, but if it becomes necessary to start or
stop the service, you can use the following commands:
$ egosh logon
Enter Admin / Admin as the username and the password when prompted
$ egosh service start WEBGUI
$ egosh service stop WEBGUI
The WEBGUI service is implemented using Apache TomCat.
If there are problems with the WEBGUI you can inspect the logs at ${EGO_TOP}/gui/logs/catalina.out
for information about what might be wrong with the service.

Page 13
If you cannot connect to the Symphony console, this may be blocked by your firewall configuration. You
can disable your firewall temporarily to see if this is the cause.
# service iptables stop
If you are not sure what port or host the Platform Symphony GUI was installed on, you should be able to
find it in the XML file that governs the BigInsights installation process (described earlier).
This XML file is generated by the web-based installation process. Platform Symphony related setup
details are found under “high-availability” section of the XML file that governs the installation process.
<high-availability>
<configure>false</configure>
<master-nodes/>
<baseport>7869</baseport>
<web-port>18080</web-port>
<log-directory>var/ibm/biginsights/ps-mapred/logs</log-directory>
<preferred-ip-mask/>
..
<max-retries>3</max-retries>
<failover>failover</failover>
</high-availability>
Once a user logs in to the Platform Symphony console on port 18080, they will see the main Platform
Symphony dashboard. This view is mostly used to monitor the high level status of the various
applications and tenants on a Platform Symphony cluster.
For BigInsights users, most of the action will center around the “MapReduce Workload” screen
accessible under “Quick Links”.

Page 14
Figure 3 - view of Platform Symphony console when logged in as an Administrator
Accessing the Platform Symphony knowledge center
Once you are able to access the Platform Symphony console above, you may want to access the
Platform Symphony Knowledge Center and bookmark it in your browser. The knowledge center is
accessible in a pull down menu under the question mark in the top bar on the Platform Symphony web
interface.
The knowledge center aggregates all of the various Platform Symphony documentation into a
searchable interface. This will prove handy as you learn about Platform Symphony.
A direct link to the knowledge center can be found at this URL (depending on the hostname where the
web interface is running).
http://<masterhost-name>:18080/doc/symphony/6.1/index.html
The command egosh services list shown earlier will show the names of the host running the web
interface (listed as the WEBGUI) if you are running on a cluster with multiple master hosts.
The Platform Symphony knowledge center, in particular the documentation dealing with the Platform
Symphony MapReduce framework, will be useful to BigInsights administrators since if you are using
Adaptive MapReduce you are in fact using the Platform Symphony MapReduce framework.

Page 15
Figure 4 - Platform Symphony Knowledge Center
Platform Symphony Concepts
While the reader of this document is likely to be familiar with Hadoop and various commercial
distributions, they may be less familiar with IBM Platform Symphony. IBM Platform Symphony is a
commercial grid workload and resource management solution that has been use to share resources
among diverse applications in multitenant environments for over a decade. Platform Symphony is
widely deployed as a shared services infrastructure in some of the world’s largest investment banks.
As a quick primer to some of the terminology referenced, in this document some definitions are offered
below. We would recommend that the interested reader please review a document called “IBM
Platform Symphony Foundations” available at http://publibfp.dhe.ibm.com/epubs/pdf/c2750652.pdf .
 Session Manager – service-oriented applications in Platform Symphony are managed by a
session manager. The session manager is responsible for dispatching tasks to service instances,
and collecting and assembling results. The Symphony session manager provides a function
simply in concept to a Hadoop application manager, although it has considerably more
capabilities. Platform Symphony implements job tracker functionality using the session
manager. In this paper the terms job tracker, application manager and session manager are used
interchangeably. While the concept of multiple concurrent application managers in Hadoop is
new with YARN. Platform Symphony has always featured a multitenant design.

Page 16
 Resource Groups – Unlike Hadoop clusters, Platform Symphony does not make assumptions
about the capabilities of hosts that participate in the cluster. While Hadoop generally assumes
that member nodes are 64-bit Linux hosts running Java, Platform Symphony supports a variety
of hardware platforms and operating environments. Platform Symphony allows hosts to be
grouped in flexible ways into different resource groups, and different types of applications can
share these underlying resource groups in flexible ways.
 Applications – The term application can be a little bit confusing as it is applied to Platform
Symphony. Symphony views an application as the combination of the client-side and service-
side code that comprise a distributed application. This is a more expansive definition than most
people are used to. By this definition an instance of BigInsights might be viewed as a single
application. Examples of Platform Symphony applications are custom applications written in
C++, a commercial ISV application like IBM Algorithmics, Calypso or Murex or a commercial or
Open Source Hadoop application like Cloudera, BigInsights or open source Hadoop. Platform
Symphony views applications as being an instance of middleware. Various client side tools
associated with a particular version of Hadoop (Pig, Hive, Sqoop etc) can all run against a single
Hadoop application definition. An important concept for those not familiar with Symphony is
that Symphony provisions service instances associated with different applications dynamically.
As a result, there is nothing technically stopping a Platform Symphony cluster from supporting
multiple instances of Hadoop and non-Hadoop environments concurrently.
 Application profiles – As explained above, applications in Symphony are flexible and highly
configurable constructs. An Application Profile in Symphony defines the characteristics of an
application and various behaviors at runtime.
 Consumers – From the viewpoint of a resource manager, an application or tenant on the cluster
is defined as something that needs particular types of resources at runtime. Platform Symphony
uses the term “consumer” to define these consumers of resources and provides capabilities to
define hierarchical consumer trees and express business rules about how consumers share
various types of resources collected into resource groups. The leaf nodes in consumer trees map
to a Symphony application.
 Services – Services are the portions of applications that run on cluster nodes. In a Hadoop
context, administrators likely think of services as equating to a task tracker that runs Map and
Reduce logic. Here again, Symphony takes a broader view. Symphony services are generic. A
service may be a task-tracker associated with a particular version of Hadoop or it may be
something else entirely. When the MapReduce framework is used in Platform Symphony, the
Hadoop service-side code that implements that Task Tracker logic is dynamically provisioned by
Symphony. Symphony owes its name to this ability to orchestrate a variety of services quickly
and dynamically according to sophisticated sharing policies.
 Sessions – A session in Symphony equates to the notion of a job in Hadoop. A client application
in Symphony normally opens a connection the cluster, selects an application and opens a

Page 17
session. Behind the scenes Symphony will provision a Symphony Session Manager to manage
the lifecycle of the job. A single Symphony Session Manager may support multiple sessions
(Hadoop jobs) concurrently. A Hadoop job is a special case of a Symphony job. The Hadoop
client will start a session manager that provides JobTracker functionality. Platform Symphony
actually uses the job tracker and task tracker code provided in a Hadoop distribution, however it
uses its own low-latency middleware to more efficiently orchestrate these services on a shared
cluster.
 Repositories – As explained previously, Platform Symphony dynamically orchestrates service-
side code in response to application demand. The binary code that comprises an application
service is stored in a Symphony repository. Normally for Symphony applications, Symphony
services are distributed to compute nodes from a repository service. For Hadoop applications,
code can be distributed either via the repository service, or it can be distributed via the HDFS /
GPFS FPO file system.
 Tasks – Symphony jobs are collections of tasks. Symphony jobs are managed by a session
manager that runs on a management host. The session manager makes sure that instances of
the needed service are running on compute nodes / data nodes on the cluster. Services
instances run under the control of a Symphony Service Instance Manager (SIM). MapReduce
jobs in the Symphony work the same way, but in this case the Symphony service is essentially
the Hadoop task tracker logic. On Hadoop clusters, slots are normally designated as running
either map logic or reduce logic. Again in Symphony, this is fluid. Because services are
orchestrated dynamically service instances can be either Map or Reduce tasks. This is an
advantage because it allows full utilization of the cluster as the job progresses. At the start of a
job the majority of slots can be allocated to map tasks while towards the end of the job the
function of slots can be shifted to perform the reduce function.

Page 18
An example of configuring a cluster for multi-tenancy
In this section we describe the step-by-step procedure to setup multiple tenants on the BigInsights
environments. In order to provide a realistic multitenant scenario, the diagram roughly models our
actual customer environment with names changed of course to protect client confidentiality.
The actual environment is more complex with hundreds of users, dozens of groups and approximately
thirty different applications planned, but the application sharing is similar to the diagram below. This
diagram maps to the “Consumer Tree” in Platform Symphony. Consumer is a term used from the
resource manager’s perspective. The resource manager views an application as a consumer of
resources, and the resource manager is responsible for allocating requested resources according to
policies that will be described shortly.
Figure 5 - an example consumer hierarchy for applications and departments
By default, BigInsights (which is just a single application on the cluster) maps to a single application and
associated is consumer called “MapReduce61” (the name corresponds to the version of Platform
Symphony used to support MapReduce processing in BigInsights – in this case 6.1.0.1). This is done so
that Symphony can accommodate future versions of MapReduce that will be provided in future versions
of BigInsights and will allow versions to co-exist. This is first consumer in the consumer tree above.
In the production environment the customer has specific needs:
 They wish to structure “sub-consumers” under the BigInsights consumer definition
(MapReduce61). This gives the cluster administrator the ability to have different run-time
characteristics for different BigInsights applications. It also allows us to setup configurable
sharing policies between our different applications and groups, control what users are allowed

Page 19
to access what applications, and ensure security between tenants by having different
applications run under different user-IDs if desired.
 In this example, under the BigInsights tenant (MapReduce61) we have several different
applications. We’ve arbitrarily called them “MR_AppA” through “MR_AppN” although in the real
environment these are the names of the client’s business applications. Note that we need to
configure each application (tenant) so that it runs under a different operating system level user-
id for security isolation. We also want to control in a granular way which users and groups have
access to these various applications.
 Also, as shown in figure 4, the client has additional applications used by particular lines of
business that they would also like to deploy on the same cluster. As examples, some Sqoop
workloads, DataMeer, IBM Tealeaf, various in-house developed streaming applications and
others. In this particular customer implementation all of these applications will just happen to
share the BigInsights MapReduce infrastructure, however it is important to under that this need
not be the case. As we’ll see shortly these applications can be totally different and still be
configured to share infrastructure.
Adding users to run MapReduce applications
In our example we want to show that how multiple users, grouped arbitrarily into one or groups for
security management can access tenant applications subject to access controls.
We create some sample cluster users for our illustration. These names represent individual cluster
users. For some lines of business, application administrators may choose to create a shared login like
“fraud” for a group authorized to use a particular fraud analytics application.
InfoSphere BigInsights has a recommend procedure for adding users. When using Platform Symphony
together with BigInsights, it is recommended that users follow procedures covered in the BigInsights
documentation and use the tool createosuser.sh included in the BigInsights distribution to automate the
create of OS level users. Doing this ensures that users can access the BigInsights console to run
applications deployed using the BigInsights application framework.
For convenience, the BigInsights infocenter is available on the public internet. For information on adding
users in BigInsights, you can learn more here: http://www-
01.ibm.com/support/knowledgecenter/SSPT3X_2.1.1/com.ibm.swg.im.infosphere.biginsights.admin.doc
/doc/bi_admin_add_users.html?lang=en
The specific procedures will depend on whether you are authenticating access via flat files, LDAP, PAM
or PAM+LDAP. In the example below we are using flat files for simplicity.
To create users known to BigInsights, edit the following file:
$BIGINSIGHTS_HOME/console/conf/security/biginsights_users.xml

Page 20
Add users as shown below.
<?xml version="1.0" encoding="UTF-8"?>
<server>
<featureManager/>
<basicRegistry id="basic" realm="Auth">
<user name="hadoop" password="passw0rd"/>
<user name="biadmin" password="temp4now"/>
<user name="sysadmin2" password="passw0rd"/>
<user name="appadmin2" password="passw0rd"/>
<user name="sysadmin1" password="passw0rd"/>
<user name="appadmin1" password="passw0rd"/>
<user name="dataadmin2" password="passw0rd"/>
<user name="dataadmin1" password="passw0rd"/>
<user name="user3" password="passw0rd"/>
<user name="vivian" password="temp4now"/>
<user name="gord" password="temp4now"/>
<user name="eric" password="temp4now"/>
<user name="michael" password="temp4now"/>
<user name="vince" password="temp4now"/>
<user name="steven" password="temp4now"/>
<user name="tiffany" password="temp4now"/>
<user name="appA" password="temp4now"/>
<user name="appB" password="temp4now"/>
<user name="appC" password="temp4now"/>
</basicRegistry>
</server>
The next step is to define groups and associated users with groups. This is an example only. The specific
will depend on how you wish to structure your own users and groups
<?xml version="1.0" encoding="UTF-8"?>
<server>
<featureManager/>
<basicRegistry id="basic" realm="Auth">
<group name="supergroup" gid="4000">
<member name="hadoop" uid="4000"/>
<member name="biadmin" uid="200"/>
</group>
<group name="appAdmins" gid="4100">
<member name="appA" uid="4100"/>
<member name="appB" uid="4101"/>
<member name="appC" uid="4101"/>
</group>
<group name="sysAdmins" gid="4200">
<member name="sysadmin1" uid="4200"/>
<member name="sysadmin2" uid="4201"/>
</group>

Page 21
<group name="dataAdmins" gid="4300">
<member name="dataadmin1" uid="4300"/>
<member name="dataadmin2" uid="4301"/>
</group>
<group name="users" gid="4400">
<member name="vivian" uid="6001"/>
<member name="gord" uid="6002"/>
<member name="eric" uid="6003"/>
<member name="michael" uid="6004"/>
<member name="vince" uid="6005"/>
<member name="steven" uid="6006"/>
<member name="tiffany" uid="6007"/>
</group>
<group name="groupA" gid="5000">
</group>
<group name="groupB" gid="5001">
</group>
<group name="groupC" gid="5002">
</group>
</basicRegistry>
</server>
In addition to have user IDs that map to individuals, I may want particular applications to execute on the
cluster under a specific user ID. For example, if my application is called “appA” I may want to have it
execute under a Linux user ID with the same name for simplicity. To accommodate this notice that
we’ve added application specific users to the biginsights_users.xml file in the example above.
You can add users using operating system facilities, but if you do, these users will not be recognized as
having credentials within the BigInsights web interface. They will still work with Symphony and the
BigInsights Hadoop framework however.

Page 22
The example below shows how additional users can be added at the OS level, but be unable to login to
the BigInsights console.
# useradd fred
# useradd george
# useradd frank
Once you have edited the BigInsights XML files to define users and groups as shown above, you are
ready to run the createosusers.sh script to create these accounts and groups at the operating system
level as well.
Run the createosusers.sh script as user “biadmin”.
#createosusers.sh
$BIGINSIGHTS_HOME/console/conf/security/biginsights_groups.xml
$BIGINSIGHTS_HOME/console/conf/security/biginsights_users.xml <biadmin's
password>
By following the procedure above to create users and groups, you will be able to run and monitor jobs
from both BigInsights Console as well as the Platform Symphony console.
Figure 6 - user Tiffany known as a BigInsights user is known to the Platform Symphony GUI

Page 23
Figure 7 - user Tiffany and others can also runs jobs via the BigInsights console.
Provide access to the BigInsights / Platform Computing cluster
For each operating system user who will be submitting jobs, make sure that their .bashrc file (or
equivalent depending on your shell) in the user’s home directory is configured to source the BigInsights
environment as shown below. If you have followed the procedures above, this should be done for you
automatically. We include these details because you may have additional users not known to BigInsights
that require access to Platform Symphony.
Sourcing the BigInsights environment will ensure that various shell variables like $PATH and
$CLASSPATH as well as environment variables specific to BigInsights and Platform Symphony are in the
environment when the user logs on. This will allow them to immediately run both BigInsights and
Symphony commands. If you are adding many users outside the procedure recommended above to add
BigInsights users, and you want them all to have access to the cluster, it will be faster to adjust the
system-wide template for .bashrc file (in /etc/skel) or adjust the common /etc/bashrc depending on
your preference.
If you have followed the instructions above, this step may not be necessary, but it is a good idea to
check that when users login they are inheriting an environment appropriate for running BigInsights jobs
and that they have access to the Platform Symphony environment.
In our case we want both our named users, as well as the user-ids that our applications will run under in
Symphony(see the concept of impersonation explained later) to source the environment and be able to
run commands.
[root@biginsights gord]# cat .bashrc
# .bashrc
# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi

Page 24
# User specific aliases and functions
# source the environment for BigInsights and Platform Symphony
source /opt/ibm/biginsights/conf/biginsights-env.sh
You should be able su to your created user ID after this and run Symphony or BigInsights commands.
Below we see that I can run a Symphony command confirming that my environment is setup correctly.
Note that with the installation of BigInsights we are entitled to user Platform Symphony Advanced
Edition which is the version of Symphony that supports the Hadoop MapReduce framework. We are not
entitled to use some other add-on products listed.
[root@biginsights /]# su - gord
[gord@biginsights ~]$ egosh entitlement info
Symphony Edition : Advanced
Desktop Harvesting : Not Entitled
Server Harvesting : Not Entitled
Virtual Server Harvesting : Not Entitled
GPU : Not Entitled
[gord@biginsights ~]$
After following the procedure above, it is a good idea to make sure that our /etc/group file reflects that
setup we’ve configured in the BigInsights XML files.
In /etc/group, create define the users that will be allowed to submit workloads on behalf of each group.
This is a very simple example. In reality, different users would belong to different groups and these
group names would be meaningful in the context of how the customer organizes their business.
groupA:x:5000:vivian,gord,eric,michael,vince,steven,biadmin
groupB:x:5001:vivian,gord,eric,michael,vince,steven,biadmin
groupC:x:5002:vivian,gord,eric,michael,vince,steven,biadmin
groupD:x:5003:vivian,gord,eric,michael,vince,steven,biadmin
groupF:x:5004:vivian,gord,eric,michael,vince,steven,biadmin
groupG:x:5005:vivian,gord,eric,michael,vince,steven,biadmin
groupH:x:5006:vivian,gord,eric,michael,vince,steven,biadmin
groupI:x:5007:vivian,gord,eric,michael,vince,steven,biadmin
Understanding Platform Symphony Impersonation
Now is a good time to explain the concept of “impersonation” in Platform Symphony. Symphony has
two different workload execution modes:
 Simple Workload Execution Mode
 Advanced Workload Execution Mode
This is normally an installation option with Platform Symphony. BigInsights Enterprise Edition installation
automatically installs Platform Symphony in Advanced Workload Execution Mode. This term is
frequently abbreviated as WEM in the Symphony documentation. In advanced workload execution
mode, core Symphony services will run as root as application administrators will be able to control the
user ID that clustered applications run under.

Page 25
Our approach to security hinges on this concept of impersonation in Symphony and we will see shortly
how we configure our applications to run under specific user credentials and control what users have
access to what applications and resources. The section called “Security within the MapReduce
framework” in the MapReduce user guide in the Platform Symphony documentation discusses this in
detail.
The customer that this paper is modeled after employs Kerberos authentication for their MapReduce
jobs to ensure security and that a particular service support impersonation cannot be spoofed. Details
on configuring Kerberos is too much detail for this short document, but customers will be pleased that
this capability exists. Symphony is frequently deployed in secure environments where these capabilities
are important.
Configuring OS groups for the multitenant environment
For users making use of Platform Symphony (both named users and the user IDs that applications will
run under via impersonation) these IDs need to be part of the OS group that owns the BigInsights (and
by extension the Symphony) installation.
In our installation, BigInsights was installed as part of the “biadmin” group, so we adjust the group
membership so that each application ID that Symphony jobs will run under is a part of the BigInsights
group.
biadmin:x:0:root,biadmin,gord,eric,vivian,appA,appB,appC,appD,appE,appF,appG
bin:x:1:root,bin,daemon
daemon:x:2:root,bin,daemon
..
If you are unsure what group BigInsights was installed under, issue a command like
$ ls -al ${EGO_TOP}
You will see the user and group that own each file. This will vary depending on how you installed
BigInsights but the default group is biadmin.
Submitting a test job as a user to verify the configuration
As we mentioned before, by default BigInsights is configured to use an Application called MapReduce61
which maps to the consumer called /MapReduceConsumer/MapReduce61.
I should be able to login to any of the accounts created, and run a sample Hadoop job. The sleep
command included with the BigInsights examples is a convenient Hadoop application for testing the
MapReduce framework. This command submits variable numbers of Map and Reduce tasks that simply
sleep for variable amounts of time. The example below submits two mappers that will sleep for 2
seconds (2,000 msec) followed by ten reducers that in the example below will sleep for 1 second.
Besides being a useful validation that everything is working, this test illustrates the performance
advantage of using Platform Symphony as the MapReduce framework over open-source Hadoop.

Page 26
Platform Symphony can run tests like this short running map and reduce tasks dramatically faster than
open source Hadoop – often more than ten times faster, even when a competing cluster is configured
with a short polling interval.
Note that as the test Hadoop job runs, everything is identical to open source Hadoop (it is actually the
BigInsights supplied Hadoop classes that are running) except that see that our JobTracker logic in
Hadoop is running inside a Symphony Session Manager.
Note also that the running job is given a Platform Symphony job ID (job_ssm_0401 in this example).
Because Platform Symphony is managing the job execution, it is able to manage this job as well as other
jobs on the cluster including non-Hadoop jobs.
[gord@biginsights ~]$ hadoop jar ${HADOOP_HOME}/hadoop-example.jar sleep -m 2
-r 10 -mt 2000 -rt 2000
14/03/15 13:14:25 INFO internal.MRJobSubmitter: Connected to JobTracker(SSM)
14/03/15 13:14:26 INFO internal.MRJobSubmitter: Job <Sleep job> submitted,
job id <401>
14/03/15 13:14:26 INFO internal.MRJobSubmitter: Job will not verify
intermediate data integrity using checksum.
14/03/15 13:14:26 INFO mapred.JobClient: Running job: job_ssm_0401
14/03/15 13:14:27 INFO mapred.JobClient: map 0% reduce 0%
14/03/15 13:14:59 INFO mapred.JobClient: Job complete: job_ssm_0401
14/03/15 13:15:00 INFO mapred.JobClient: Counters: 18
14/03/15 13:15:00 INFO mapred.JobClient: Shuffle Errors
14/03/15 13:15:00 INFO mapred.JobClient: WRONG_PATH=0
14/03/15 13:15:00 INFO mapred.JobClient: CONNECTION=0
14/03/15 13:15:00 INFO mapred.JobClient: IO_ERROR=0
14/03/15 13:15:00 INFO mapred.JobClient: FileSystemCounters
14/03/15 13:15:00 INFO mapred.JobClient: FILE_BYTES_WRITTEN=5146
14/03/15 13:15:00 INFO mapred.JobClient: Map-Reduce Framework
14/03/15 13:15:00 INFO mapred.JobClient: Reduce input groups=400
14/03/15 13:15:00 INFO mapred.JobClient: Combine output records=0
14/03/15 13:15:00 INFO mapred.JobClient: Map output records=400
14/03/15 13:15:00 INFO mapred.JobClient: SHUFFLED_MAPS=20
14/03/15 13:15:00 INFO mapred.JobClient: Reduce shuffle bytes=2440
14/03/15 13:15:00 INFO mapred.JobClient: Combine input records=0
14/03/15 13:15:00 INFO mapred.JobClient: Spilled Records=800
14/03/15 13:15:00 INFO mapred.JobClient: SPLIT_RAW_BYTES=0
14/03/15 13:15:00 INFO mapred.JobClient: Map output bytes=1600
14/03/15 13:15:00 INFO mapred.JobClient: Reduce input records=400
14/03/15 13:15:00 INFO mapred.JobClient: GC_TIME_MILLIS=0
14/03/15 13:15:00 INFO mapred.JobClient: FAILED_SHUFFLE=0
14/03/15 13:15:00 INFO mapred.JobClient: MERGED_MAP_OUTPUTS=20
14/03/15 13:15:00 INFO mapred.JobClient: Reduce output records=0

Page 27
As this job runs, we can monitor the job in the Symphony GUI by using the QuickLinks menu and
accessing “MapReduce Workload” to access the MapReduce workload screen shown below. As the
MapReduce jobs runs, you will see a view like the one shown in figure 6.
Figure 8 - monitoring our job using the Platform Symphony web interface
Note that the submitted job is associated with the application MapReduce 6.1 (this is the application
that BigInsights by default submits jobs to)
You can also launch jobs via the standard BigInsights Web GUI and watch them run either from within
the BigInsights console or from within the Platform Symphony Web interface.
Figure 9: Launching a terasort job from BigInsights
The Terasort example in BigInsights uses oozie to manage the sequence of running the teragen
application to generate the dataset to be sorted followed by Terasort itself.

Page 28
As the job runs in the BigInsights context, we see them running in Platform Symphony associated with
the MapReduce6.1 application that BigInsights is bound to.
Any BigInsights application that exercises the MapReduce framework including services like Hive, Pig,
Big SQL, Bigsheets and others will work with Symphony in this same way.
Figure 10 - Platform Symphony monitoring Terasort job run from BigInsights
Associating BigInsights with a Symphony Application
We’ve mentioned a few times that BigInsights is associated with the Symphony MapReduce6.1
application and customers frequently ask where this association is made.
[biadmin@biginsights ~]$ cd $HADOOP_CONF_DIR
[biadmin@biginsights hadoop-conf]$ cat pmr-site.xml
<?xml version="1.0"?>



<configuration>
<property>
<name>mapreduce.application.name</name>
<value>MapReduce6.1</value>
<description>The mapreduce application name.</description>
</property>
<property>
<name>mapreduce.map.skip.commit.task</name>
<value>false</value>
</property>
By changing to the BigInsights directory $HADOOP_CONF_DIR you can modify Symphony application
name that BigInsights will submit jobs to in the file pmr-site.xml. It is important to have this flexibility,
because over time customers may end up with different versions of BigInsights along with other
applications co-existing on the same cluster.

Page 29
Enabling Symphony Repository Services
By default, when Platform Symphony is installed the repository service in Symphony is disabled. The
function of the repository service is to store the application services and distribute the code that
implements services dynamically to service instances on the cluster.
The MapReduce framework in Platform Symphony by default distributes the application service code
(specifically the application logic that implements the task tracker functionality and Jar files that
implement map and reduce logic) by copying them to HDFS with a high block replication factor so that
the files will be accessible on all nodes.
If you are planning to add and remove application profiles in Symphony or Consumers you will to start
the Symphony repository service. Otherwise you will encounter errors as some of these services assume
that the repository service in Symphony is running.
This can be done through the web interface by following these steps:
 From the QuickLinks menu select system services
 For the service abbreviated as RS, select “Start” from the Actions pull-down menu
 After you refresh the GUI view you should see the service has started on a master host

Page 30
Figure 11 - Managing system services in Platform Symphony
The system services view is useful. This shows a list of system services that EGO is managing. Note that
EGO is managing not only native Platform Symphony services, but BigInsights services as well.
Adding a new Application / Tenant
Fundamental to the design of BigInsights 2.1 (and Open Source Hadoop) is the idea that there is only a
single instance of a Hadoop cluster.
Platform Symphony supports multiple applications however sharing the same cluster. It is also flexible
enough to support multiple instances of an application environment like BigInsights, however
configuring this is out of the scope of this paper.
Examples of tenants we may want to add might be:
 A native Symphony application written to the Platform Symphony APIs
 A batch-oriented workload (when Platform LSF is installed as an add-on to Platform Symphony)
 A distinct Hadoop MapReduce environment
 Third party applications like SAS, MatLab or Revolution R

Page 31
 A separate Hadoop MapReduce application instance that shares resources between applications
but that shares the same Hadoop binaries and file system instance.
In this example we are showing the last case where multiple Hadoop applications share resources.
From the Platform Symphony Dashboard:
 Use the QuickLinks menu and select Resources
 Select Workload / MapReduce / Application profiles from the pull down menu
There will already be an application profile already defined for MapReduce6.1. This is installed
automatically with Symphony and is the application profile that is used by BigInsights by default.
To add a new application profile to support a new tenant, click the “Add” button. The screen shown in
figure 10 will appear.
Figure 12 - Adding a new Application definition
We supply the following parameters:
 Our application name (SQOOP) – We require this tenant to use a different version of SQOOP
than the version including with BigInsights as mentioned earlier
 We define the user-ID that starts the job tracker and runs jobs – This is the impersonation
feature described earlier. This particular application will run under the OS id AppB.

Page 32
 Symphony has 10,000 priority levels. By default we are going to submit Sqoop jobs as having a
low priority.
 We configure user accounts that have access to this application. Note that we’ve provided all
users in GroupA access to the application along with named operating system and Platform
Symphony users.
Based on this information, Platform Symphony adds an application named Sqoop with a set of
reasonable defaults for a Hadoop MapReduce job. To make sure that our new application is working, as
a user entitled to use the application I can submit a test job as I did before.
Note that in this I am specifying that I want to have the job handled by a different MapReduce
application definition so I specify Sqoop as the application name on the command line.
Test the new application consumer by submitting a job as before.
[gord@biginsights ~]$ hadoop jar ${HADOOP_HOME}/hadoop-example.jar sleep -
Dmapreduce.application.name=Sqoop -m 2 -r 10 -mt 2000 -rt 2000
job id <1>
14/03/13 12:33:09 INFO mapred.JobClient: Counters: 18
..
What has changed is that in figure 11 we see that our job is now running under our separate application
definition called Sqoop.
This shows the basic process of adding the new application profile for a MapReduce job to Symphony to
support our additional tenants. The next step of course is to edit the configuration of the tenant as
necessary to suit the unique needs of the application. For example, my requirement may be as simple as
simple re-pointing some environment variables for point to different installation and configuration
directories for Sqoop for jobs submitted to this application.
[biadmin@biginsights hadoop-conf]$ set | grep SQOOP
SQOOP_CONF_DIR=/opt/ibm/biginsights/sqoop/conf
SQOOP_HOME=/opt/ibm/biginsights/sqoop
[biadmin@biginsights hadoop-conf]$

Page 33
Note that below my Job ID has reset to “1” since this is the first job associated with this particular
application tenant.
Figure 13 - Sleep job running under newly created application definition
Under the “Workload” / “MapReduce” / “Application Profiles” we can define as many separate
applications as we’d like. The view below additional applications added using the same process detailed
for the Sqoop application.
Figure 14 - Available MapReduce Application Profiles
Only MapReduce applications appear because “Application Profiles” have been selected from the
MapReduce submenu. Figure 13 shows a similar view of “Applications” accessible from the same
workload dropdown menu except instead of looking at Application Profiles I’m looking at a dashboard of
the applications themselves with job related status.

Page 34
Figure 15- Dashboard of MapReduce applications
Configuring application properties
When new applications profiles are created for each new application, a default template is used
represent reasonable settings for a MapReduce workload. The next step is to configure application
profiles to meet the unique requirements of each application workload.
In the Platform Symphony reference manual accessible from the knowledge center, application profiles
are covered in detail. Some of the more commonly configured settings are shown below.
To configure application properties for Sqoop, modify the application profile by selecting “Workload” /
“MapReduce” / “Application Profiles” from the top menu on the MapReduce applications screen. Select
the application profile definition for Sqoop created earlier and select Modify.
A new window will appear that allows detailed settings for the application to be changed. This web
interface is affecting the application service profile definitions (discussed shortly) that are stored in the
directory $EGO_TOP/data/soam/profiles on the Platform Symphony master host. Enabled profiles
reside in a subdirectory called “enabled” and disabled profiles reside in a directory called “disabled”.
First tab in the interface called Application Profile allows application profile settings to be adjusted. The
second tab labeled Users provides an opportunity to modify the users and groups that will have access
to the application profile.

Page 35
Figure 16 - Application Profile
Some important tips about Application Profiles:
 Application Profile names must be unique
 An Application Profile can be associated with only a single consumer
 In the consumer tree, MapReduce applications are by default placed under the
MapReduceConsumer tree
 You can find templates for various application profiles in the directory
$SOAM_HOME/6.1/Samples/Templates. The term SOAM in Symphony refers to the service-
oriented application middleware on which the MapReduce service is implemented
The application profile can be viewed in an Advanced Configuration, a Basic Configuration or in a
Dynamic Configuration Update mode. The Dynamic Configuration Update mode is not covered here, but
essentially it allows an administrator to register a profile fragment (part of an application profile)
modifying either the session types or services sections of the profile.
In the General settings area, settings such as where metadata associated with jobs and job history are
stored, the default service definition to be used (MapReduce for MapReduce applications) and resource
requirements.

Page 36
Resource requirements are an important concept in Symphony. In this simple example by using the
syntax “select(!mg)” we are essentially saying run this service on any host that is not tagged as a
member of the management group.
Resource requirement selections in Symphony are flexible and are covered in the Symphony
documentation. I can use an SQL like resource-requirements strings to specify the types of resources I
would like to use in a granular way. If for example I know that a particular application runs best on a
large memory PowerLinux machine, I express a requirement (or preference) for this application with an
appropriate resource requirement string.
select(!mg) && select(PowerResourceGroup) && select(maxmem > 8000 && maxswp
>=16000)
The example above would indicate that this service requires resources that are part of a Power-based
resource group that are not management hosts where at least 8GB of physical memory and 16GB of
swap space are available.
Pre-starting application services is a useful feature in Symphony. Application services refer to the
Symphony session manager (SSM) as well as service instance managers and service instances associated
with the application. As a reminder, with MapReduce workloads the SSM can be viewed as an
Application Manager. This is the component that implements the JobTracker logic. Services instances
will load TaskTracker logic appropriate to the version of Hadoop and will start map or reduce tasks
appropriate to the application.
If you have many applications and are frequently sharing slots pre-starting applications may not be
useful. By default Symphony will start SSMs automatically as clients connect and request services from
the middleware. As resources are assigned to applications, Symphony will dynamically provision needed
service code and start services appropriate.
Pre-starting applications is useful for applications that need to respond quickly. You can control the
number of slots (each slot can support a map or reduce task) that are pre-started by default
Figure 17 - Optionally have an application pre-allocate services
A key thing to understand about that Platform Symphony session manager is that it is fully
multithreaded and can accommodate multiple sessions at the same time. A session equates to a
MapReduce user submitted a job. Each job maps to a session where each session may have large
numbers of tasks.

Page 37
When multiple users are concurrently submitting jobs to the same application, the scheduling policy
controls how resources are shared. This R_Proportion policy specifies that resources are shared in
proportion to the priority of the job which is often the most sensible choice.
As an example, if I had 5000 slots allocated to this application consumer definition and JobA was
submitted to the application with priority 4000 and JobB was submitted with priority 1000, Symphony
would run both workloads concurrently under the same application definition giving 80% of available
resources to JobA. Unlike standard Hadoop where resource assignments are static while the job is
executing, Symphony can respond quickly at run-time to re-balance resource allocations between jobs.
Note that since each SSM maps to an application (a MapReduce application in this case) this scheduling
policy controls how multiple jobs running in the same application context share resources. A separate
resource sharing plan discussed shortly controls how sharing is implemented more broadly between
applications and tenants.
The term application can be confusing to users not familiar with Symphony. Symphony is referring to an
application in the context of the Hadoop services themselves – the binary code that comprises
BigInsights services like the JobTracker and the TaskTracker. It is not referring to the actual application
code written by users that run on the Hadoop framework. A single Symphony application can run
different user applications within the context of the same Hadoop MapReduce context in this case.
Figure 18 - controlling how multiple jobs associated with an application share resources
The Symphony application profile definition provides precise control over how MapReduce workloads
run, and this is useful to advanced users (in our experience most sites running Hadoop are already quite
advanced and will appreciate this)
A nice feature of Symphony is that because the execution logic is provisioned dynamically so slots are
interchangeable between mappers and reducers. The settings in figure 17 allow this to be configured
along with preferences for default ratios between mappers and reducers and precise configuration on a
per resource group basis.

Page 38
Figure 19 - MapReduce Settings associated with an Application
Symphony can allow multiple service definitions to exist for each application and the service definition
section provides granular control over this capability. This is a useful for applications written to Platform
Symphony’s native APIs and may be useful for Hadoop developers. For BigInsights it is not necessary to
change this setting being Platform has already implemented a service called “RunMapReduce “ service
started by service-instance managers to handle MapReduce workloads. The process of starting this
service is automatic for the MapReduce service. The service itself can be found in the directory
${EGO_TOP}/soam/mapreduce/6.1/linux2.6-glibc2.3-x86_64/etc. Note that the Start Command in
figure 18 allows for operating system specific implementations of a service definition for an application.
Figure 20 - configuring service definitions for the application
In the application profile definition, administrator can control environment variables associated with the
application. This is an important capability for ensuring multitenancy. By using environment variables I
can control what applications run in granular ways. If I choose, I could have an application profile that

Page 39
associates itself with a separate Hadoop instance by defining application specific variables such as
$HADOOP_HOME, $HADOOP_CONF_DIR that reference different software versions and different
configuration files.
I can always resolve technical issues that often occur where particular applications are depend on
particular versions or distributions of the Java run-time environment be defining $JAVA_HOME to point
to the version of Java needed by a specific application.
Figure 21 - configuring the environment for the application
This is a good time to mention that while much of the discussion in Hadoop centers on Java because
Hadoop itself is written in Java, Symphony supports heterogeneous applications. It does not matter
whether application clients or services are written in C/C++, Java, scripting languages or even C# in
Microsoft .NET environments. The versatility to handle all types of workloads is what makes Symphony
powerful as a multitenant environment.
Another unique capability that Symphony brings to Hadoop is the notion of “Recoverable sessions”. This
concept does not existing in open source Hadoop where the job tracker is implemented in a simplistic
way. If the JobTracker fails at run-time, in standard Hadoop the job needs to be re-started.
The Symphony SOAM middleware however has long supported the notion of journaling transactions so
that Hadoop MapReduce jobs become inherently recoverable. If the software service running the
JobTracker logic fails (and re-starts on the same host or a different host) the Symphony job can recover
from where it left off. This is a major advantage for customers that have long-running Hadoop jobs that
need to complete within specific batch windows.

Page 40
This and other points of configurability are very important for specific workloads. As another example, if
I have execution logic where the reducer is multi-threaded I can control the ration of reducer services to
slots thereby giving a reducer multiple slots if it can take advantage of them.
Figure 22 - configuring session behaviors in an SSM / Application Manager
Associating applications with consumers
The last section provided some details on how application profiles are used in Symphony to customize
applications to support multi-tenancy. In the Symphony architecture, resources are not actually
allocated to applications directory. They are allocated to Consumer definitions which in turn map to
applications.
This is an important distinction between while that application space is essentially “flat” (I have multiple
applications and flavors of applications of different types) the structure of consumers is usually
hierarchical. This is because most organizational structures are hierarchical.
 A bank may have several lines of business, each with various departments or application groups
 A service provider may have multiple tenant customers, and may provide different application
services for each tenant
 A government agency may have different divisions, each running different applications with a
particular need to segment data access

Page 41
Symphony allows consumer trees to be setup in flexible ways to accommodate the needs of almost any
organization. A key concept to understand is that the leaf-nodes of consumer trees are linked to the
application definitions we looked at in the previous section.
Accessing Consumer Definitions
To view consumer definitions, from the MapReduce screen in Symphony selected “Resources / Resource
Planning / Consumers”. This is the interface that is used to manage the Consumer Tree.
Setting up the consumer tree is reasonably straightforward. The left side panel us used to control where
you are on the tree and the right side of the interface allows one to perform operations relative to that
segment on the tree.
Recall from our scenario earlier, that we had multiple groups that would be running Datameer
workloads that we wanted to enforce sharing policies. Also Datameer workloads have specific setup
dependencies that are different that BigInsights workloads so the Datameer workloads require their
own application profile. Also, we wanted to provide isolation between the work done by different
Datameer application user groups. To achieve this policy, we have defined sub-consumers under
Datameer with a consumer appropriate for each group. Also, we can control what users have access to
the consumer. Note the heirchical notion of consumers in Symphony.
Figure 23 - A populated consumer tree in Symphony
The leaf nodes of the consumer tree under Datameer, each link to a specific application profile. The
associations between applications and the position in the consumer tree is made in the application
profile.

Page 42
Figure 24 - MapReduce applications
Manually editing Consumer Tree definitions
Advanced users may find it easier to manually edit the consumer tree.
Platform Symphony stores consumer tree definitions in $EGO_TOP/kernel/conf in the file
ConsumerTrees.xml.
If you hand edit this file, you will need to restart EGO services to bring the web-based view into
synchronization with the actual contents of the XML files where these settings are persisted.

Page 43
After editing the ConsumerTrees.xml file as shown above, while logged in as the cluster administrator
(biadmin) please stop and restart EGO services using the BigInsights scripts below to make sure that
changes are reflected in the Platform Symphony console.
$ stop.sh HAManager
$ start.sh HAManager
Controlling access to applications and consumers
In the Sqoop consumer definition above, the built-in Symphony user “Admin” has administrative
responsibility for the consumer. Several other users are listed as being able to access to consumer
application associated with the consumer. The user eric is not a member of the list of permitted users. If
an unauthorized user attempts to submit a job against the application definition (Sqoop) associated with
this Sqoop consumer, see an error as shown below as expected.
[eric@biginsights ~]$ hadoop jar ${HADOOP_HOME}/hadoop-example.jar sleep -
java.io.IOException: interrupted
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:1068)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:1032)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1575)
at org.apache.hadoop.examples.SleepJob.run(SleepJob.java:174)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
..
Caused by: java.lang.InterruptedException: Domain <VEM>: Security error: User:
eric is not authorized to perform this operation.
If an authorized user (gord) submits the same workload, note that it runs successfully.
[gord@biginsights ~]$ hadoop jar ${HADOOP_HOME}/hadoop-example.jar sleep -
job id <102>

Page 44
Determining the execution user for a consumer
Earlier we explained that by using impersentation, Symphony can control the user IDs that different
application services run under. In the case of the Sqoop application defined earlier, we had set the
application user to appB and this is reflected in the ConsumerTrees.xml definition.
We can verify that impersonation is taking place and that processes are running under the expected
user ID by monitoring the process tree while executing MapReduce jobs like the one above.
The monitor the process tree, use a command like:
$ watch ‘ps -ef | grep appB’
As you run the job, you will see the SSM start-up unless it is pre-started or the SSM is lingering on a
management host waiting for another job. In this example are services are running on the same node as
the master host so we see the service instance managers and services instances starting locally to
manage the job. On a larger cluster you would need to watch the compute hosts to validate the services
are starting as expected and running under the correct user ID.
Figure 25 - verify that services are running under the expected user IDs
We can use the pstree command on the management host to understand the process tree.

Page 45
Figure 26 - pstree can be used to show the process hierarchy
On compute hosts, services are management by the pem process.
On response to a workload requirement pem launches a sim process (service instance manager) which
in turn runs a service instance. In this case the RunMapReduceService since this is a Symphony
MapReduce workload.
Figure 27 - process hierarchy on the execution host
When configuring several consumers and applications as we have shown here, it will be faster to hand
edit XML based application profile files also.

Page 46
To access XML application profiles, check the directory $EGO_TOP/data/soam/profiles. The associated
XML profiles will exist in subdirectories with names corresponding to their state. For example Sqoop.xml
can be found in an “enabled” subdirectory since the application is enabled and accepting workload.
Configuring Sharing Policies

Page 47

Page 48
Summary
In this document we’ve described a customer use case involving a multitenant implementation of
InfoSphere BigInsights that permits the following:
 Concurrent execution of different Hadoop applications (including different versions of code) on
the same physical cluster
 Dynamic sharing of resources between tenants in a fashion that maximizes performance and
resource utilization while respecting individual SLAs
 Support for applications other than Hadoop MapReduce to maximize flexibility and allow
capital investments to be re-purposed for multiple requirements
 Security isolation between tenants, removing a major barrier to sharing in many commercial
organizations
These advances in our view are significant. While Hadoop is advancing, competing open source and
commercial distributions are many years away from offering true multitenancy and practical solutions
for supporting multiple workloads on a shared infrastructure. The economic arguments in favor of
resource sharing are compelling. Analytic applications are increasingly comprised of multiple software
components that rely on distributed services. Rather than deploying separate “silos” of application
infrastructure, Platform Symphony provides the option to consolidate these different application
instances on a common foundation thus increasing infrastructure utilization, boosting service levels and
helping significantly reduce costs.

Realizing a multitenant big data infrastructure 3

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Realizing a multitenant big data infrastructure 3

Similar to Realizing a multitenant big data infrastructure 3 (20)

Recently uploaded

Recently uploaded (20)

Realizing a multitenant big data infrastructure 3