SlideShare a Scribd company logo
1 of 48
Download to read offline
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 1
Realizing a shared, multi-tenant infrastructure for Big Data
and Analytic applications using IBM®
InfoSphere®
BigInsights and IBM Platform Computing™
Last revised: April 19, 2014
By: Gord Sissons
Steven Sit
Eric Fiala
Michael Feiman
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 2
Contents
Document History.........................................................................................................................................4
Introduction ..............................................................................................................................................4
Disclaimers and limitations.......................................................................................................................4
About the customer described in this use case........................................................................................5
Industry Challenges...................................................................................................................................5
Impact on Information Technology ......................................................................................................6
The Big Data Environment ........................................................................................................................7
Hardware Infrastructure.......................................................................................................................7
The Software Environment...................................................................................................................7
Customer Requirements.......................................................................................................................8
Installing InfoSphere BigInsights for Multi-tenant services......................................................................9
Installation steps...................................................................................................................................9
Accessing the Platform Symphony Management Console .................................................................12
Accessing the Platform Symphony knowledge center........................................................................14
Platform Symphony Concepts.................................................................................................................15
An example of configuring a cluster for multi-tenancy ..........................................................................18
Adding users to run MapReduce applications....................................................................................19
Provide access to the BigInsights / Platform Computing cluster........................................................23
Understanding Platform Symphony Impersonation...........................................................................24
Configuring OS groups for the multitenant environment...................................................................25
Submitting a test job as a user to verify the configuration ................................................................25
Associating BigInsights with a Symphony Application........................................................................28
Enabling Symphony Repository Services ............................................................................................29
Adding a new Application / Tenant ....................................................................................................30
Configuring application properties .....................................................................................................34
Associating applications with consumers ...........................................................................................40
Accessing Consumer Definitions.........................................................................................................41
Manually editing Consumer Tree definitions......................................................................................42
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 3
Controlling access to applications and consumers.............................................................................43
Determining the execution user for a consumer................................................................................44
Configuring Sharing Policies....................................................................................................................46
Summary.................................................................................................................................................48
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 4
Document History
Date of this revision is Saturday April 19, 2014
Revision Date Summary of changes
0.9 March 23, 2014 Initial draft
0.95 April 19, 2014 Incorporate many valuable comments from Steven Sit based on
his direct client experience – thank you Steven.
Introduction
This document is written for IBM and partner architects. It is intended to be a guide for those working
with customers deploying IBM InfoSphere BigInsights and other Hadoop offerings together with IBM
Platform Symphony. While this paper describes the details of one customer implementation, we believe
that this use case is relevant to others as well. Challenges related to Hadoop multitenancy are faced by
customers across multiple industries.
The target audience for this document includes:
 Architects responsible for deploying big data or analytic workloads
 Technical users looking for ways to deploy Hadoop on shared clusters
 IBM architects, ISVs or business partners interested in building multitenant Big Data
environments to help customers reduce infrastructure requirements and save cost
This paper does not delve into YARN. YARN is another important (but less mature) technology that
delivers some of the capabilities described herein. It is important for IBM customers to understand that
IBM BigInsights is a safer choice in the sense that it supports open source technologies like YARN while
simultaneously offering more advanced capabilities. IBM’s view is the clients can best determine what
capabilities they need, but IBM InfoSphere BigInsights provides customers with flexibility. The best of a
100% open source distribution along with significant value added capability.
In the customer example documented here, the business advantage of using proprietary capabilities
(IBM Platform Symphony) dramatically outweighed the benefits of being “pure” from an open source
standpoint. The client was able to consolidate roughly 30 applications onto a shared infrastructure and
avoid significant incremental capital expense that would have been required to setup separate clusters
had the client decided to proceed with open source YARN only.
Disclaimers and limitations
The details of the customer implementation are proprietary and confidential. As such, while we can
describe what was done technically, we cannot share details of how this customer used particular
applications. As a result, the examples provided herein are meant to explain qualitatively what was
achieved by the customer without betraying confidential information. The details and screenshots in this
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 5
document are not from the customer environment. They have been reproduced on a small test cluster
to explain particular capabilities that the client chose to take advantage of.
About the customer described in this use case
The customer described in this paper is a full-service financial service provider. They offer a broad range
of products to their clients including insurance, banking, investing, real estate, retirement planning,
wealth management and health insurance. Like many in the financial services sector, this customer is
increasingly deploying Hadoop based applications to augment their data warehouse. They are motivated
by the following imperatives:
 The need to leverage big data analytics to make better business decisions, improve customer
relations and develop innovative new products and services
 The need to contain or reduce costs (the cost of storing and processing data on a Hadoop cluster
is an order or magnitude less than persisting the same data in their data warehouse)
 The desire to architect their environment as a shared service to avoid each line of business
building their own discrete analytic environments on premise or in the cloud
Industry Challenges
Like many industries, the sector represented by this client is going through significant change. As a full-
spectrum provider, the client is disproportionally impacted by regulation. As a bank, not only are they
subject to various provisions in legislation like Dodd Frank, but they are also impacted by insurance
industry requirements such as the NAIC’s Risk Management and Own Risk Solvency Act (RMORSA) and
other initiatives around Enterprise Risk Management that have occurred as a response to the financial
crisis of 2008.
Of particular consequence is the Volcker rule, a US Senate bill that would give regulators the ability to
limit or prohibit certain types of proprietary trading activities. While the legislation is directed at retail
banks, this client will be impacted across their insurance and wealth management businesses where
proprietary trading is important to maximizing investment gains.
As if this tsunami of new regulation was not enough, fundamental changes are taking place in the
insurance industry as well driven by external factors. Among these factors are new disruptive
technologies. Big data, social and mobile technologies are prominent drivers of change. Some specific
challenges to the business are:
 Driven by high-profile events, and the increased frequency of natural catastrophes, contingent
business interruption (CBI) modeling is emerging as a priority for insurance firms
 Dramatic changes driven by technology are promising to fundamentally change auto-insurance.
Among these factors are collision avoidance technologies that promise to shift liability from
drivers to manufacturers, social media technologies enabling insurers to seek out and market to
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 6
lower risk consumer pools, and advances in GPS and vehicle telematics that promise to provide
insurers with more granular data on which to base risk assessments
 Technological advances are leading to an explosion in available information and firms that
aggregate such information to help insurers better quality risk
 Widespread consumer use of mobile technologies and social technologies are causing firms to
rethink how they promote their brand and provide services to both their customers and
agents/advisors
 Advances in analytic techniques are making it easier for insurers to collect process and visualize
information. This is extending beyond core actuarial techniques to include approaches like
predictive analytics, natural language processing, social network analysis and simulation-based
analytics.
 Additionally, new technologies are changing how information is stored and processed.
Distributed file systems and clustered technologies like Hadoop can provide a significant per-
terabyte cost advantage over traditional warehouses. Because of these cost advantages, and
because the framework is well suited to storing and processing unstructured or semi-structured
data, this customer and similar firms are embracing Hadoop as a platform for many new
applications.
The reason we point this out is that that risk management that relies heavily on Monte Carlo simulation
for simulation and actuarial modeling, and big data analytics are converging. Both depend on scaled out
infrastructure. Firms that understand this convergence can obtain a cost advantage relative to their
competitors.
Impact on Information Technology
Both the regulatory challenges described above as well as the technological shifts and business
pressures are driving the need for greater data processing and analytic capacity.
 Traditional data warehouses cannot scale cost-efficiently to manage the vast amounts of data
being collected and processed, nor can they handle raw volumes of unstructured data involved.
 Organizations need more agile application development methodologies and toolsets that allow
them to evolve data schemas and applications on the fly as they continuously incorporate new
sources of data into their models.
A one-to-one mapping between applications and infrastructure is no longer practical. Many applications
(Hadoop, scenario generation, Monte Carlo simulation and ETL processing) rely on distributed
infrastructure that scales horizontally. Replicating this clustered infrastructure for each line of business
and each application would be cost prohibitive.
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 7
The Big Data Environment
Hardware Infrastructure
The physical infrastructure deployed by this client is shown pictorially in Figure 1. While there are
actually four identical 16 node clusters, only the production environment is shown here. The server
infrastructure is based on an IBM System X based reference architecture for InfoSphere BigInsights. Each
cluster node has 12 CPUs, over 60 GB or memory and 12 locally connected physical disks. The
production cluster has 192 TB of disk and approximately 1 TB of memory.
A unique feature of this environment is that the cluster is shared by several lines of business comprising
approximately 30 different user groups across different lines of business.
Figure 1: Physical infrastructure for shared Hadoop Platform
The Software Environment
The Linux based infrastructure supports multiple big data and analytic applications.
Among these applications are:
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 8
 IBM InfoSphere BigInsights (providing core Hadoop services)
 Datameer (for data visualization)
 IBM TeaLeaf – customer experience analytics platform
 Open source Sqoop 1.2.4 – used to perform bulk data transfers to and from various data sources
including an operational data warehouse and the production Hadoop cluster
 Various MapReduce streaming applications, where for convenience of development Map and
Reduce logic is expressed as Perl scripts
 Many in-house developed Java applications
 Various ETL scripts running in and out of the Hadoop MapReduce framework
The IBM furnished software environment is comprised of the following major components
 IBM InfoSphere BigInsights Enterprise Edition
 IBM Platform Symphony Advanced Edition (Software is bundled with BigInsights Enterprise
Edition for a single tenant, and this client has purchased a production licenses)
 IBM GPFS FPO (providing a POSIX compliant file system that fully preserves HDFS semantics)
Customer Requirements
This customer requires a multi-tenant environment for several business reasons listed below.
 They wish to share infrastructure between multiple departments and lines of business both to
boost capacity (by allowing departments to tap capacity not being used by others) and to reduce
costs by avoiding the need for separate physical environments.
 They need the ability to guarantee service levels to different tenants to ensure that business
critical applications can run in a predictable fashion. For example ETL or specific database load
operations must run with an overnight batch window.
 Because many services are long-running, to make sharing practical, agile pre-emption is required
to make sure that urgent jobs do not need to wait behind long running jobs on the cluster.
 The client needs to ensure that data is segmented between different tenants on the shared
environment for security and privacy reasons.
 Finally, the client requires multi-tenancy for technical reasons that are sometimes overlooked.
As the environment evolves, they need the flexibility to deploy different versions of software
components that may have specific dependencies. A specific example is this client’s requirement
to use a more recent version of open-source Sqoop, distinct from the version included in
BigInsights 2.1.0.1, the version deployed at the time of this writing.
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 9
Different Hadoop vendors have different definitions of what they mean by multi-tenancy, so it is
important that we not confuse the multitenant capabilities offered by IBM in Platform Symphony with
open source offerings like YARN which is much less capable. While YARN is an important technology
being supported by IBM, the capabilities of YARN are well behind those described here.
Installing InfoSphere BigInsights for Multi-tenant services
Realizing a multitenant environment for BigInsights or other applications requires the use of IBM
Platform Symphony Advanced Edition. A run-time version of IBM Platform Symphony Advanced Edition
that enables a single tenant is included with IBM InfoSphere BigInsights Enterprise Edition 2.1 or later.
The Platform Symphony resource manager and workload manager is referred to in the BigInsights
documentation as Adaptive MapReduce for historical reasons. Clients wanting the multitenant
capabilities required in this document will need to license a full version of Platform Symphony Advanced
Edition.
Note that licensing is not enforced by the software directly. Customers can pilot these multitenant
capabilities using only the software included in the BigInsights 2.1 Enterprise Edition or later release
along with appropriate patches.
Installation steps
Fortunately, it is constantly getting much easier to have these products work together. While manual
configuration was required in prior releases, as of BigInsights 2.1 EE a simple patch can be applied to
unlock all of the features of Platform Symphony Advanced Edition and have it work with BigInsights. For
future releases starting in the spring of 2014, full functionality of Platform Symphony will be provided
“out of the box” with BigInsights with no requirement for a patch. (Please note the customers will still
need to license the software before using it in production)
The high-level steps to implement InfoSphere BigInsights 2.1 (or later) with IBM Platform Symphony
Advanced Edition are as follows:
 Install IBM InfoSphere BigInsights Enterprise Edition by following the installation instructions.
When installing BigInsights it is important to install Adaptive MapReduce. This is the choice that
causes the Platform Symphony software to be installed and configured with BigInsights.
 To do this, you will need to edit a file in the installation directory called install.properties before
starting the BigInsights installation process as shown below:
# set AdaptiveMR.Enable to true if you want to install AdaptiveMR
instead of Apache MapReduce
AdaptiveMR.Enable=true
# set AdaptiveMR.HA.Enable to true if you want to install AdaptiveMR
High Availability, this will also install AdaptiveMR instead of Apache
MapReduce
AdaptiveMR.HA.Enable=true
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 10
 For multitenant environments, GPFS FPO is recommended, however Symphony can be
configured to support multiple tenants regardless of whether HDFS or GPFS FPO is chosen as the
cluster file system.
 BigInsights can be installed by using a web-based installation process. The web-based install
process generates an XML file that governs the installation process that is used for installation
via the GUI or optionally via the install.sh shell script. The name of this file will vary depending
on how the software is installed, but as of release 2.1 the file is called either simple-
fullinstall.xml or fullinstall.xml.
 The reason we mention this is that an apparent bug in BigInsights 2.1 caused the XML tag
<apache-mapred> to be set to true when Adaptive MapReduce was requested in the
install.properties file above. It might be worth validating that this setting is correct in the
simple-fullinstall.xml or fullinstall.xml file.
[biadmin@biginsights]$ grep "apache-mapred" simple-fullinstall.xml
<apache-mapred>false</apache-mapred>
[biadmin@biginsights]$
 As you proceed with the installation, you should see the BigInsights installation script install the
“HAManager” software components as part of the installation. This is where the Platform
Symphony software is located that supports HA functionality and Adaptive MapReduce
functionality. You can watch for this either through the web installation GUI or by checking the
installation log file.
 If you are installing BigInsights 2.1 Enterprise Edition you will need to install a patch by following
the procedure documented in the publication “Enabling the full functionality of IBM Platform
Symphony in your BigInsights 2.1 cluster”1
. This document is freely downloadable for users with
an IBM Developer Works ID.
 You can download a small patch for Platform Symphony 6.1.0.1 (the Symphony version included
in BigInsights 2.1) from https://www.ibm.com/support/fixcentral/ following instructions in the
document referenced above. At the time of this writing you can find and download the needed
package from Fix Central by searching for “Platform Symphony” and downloading the package
named “sym-6.1.0.1-build225866”. This package applies to both 64 bit Linux on Intel as well as
IBM PowerLinux machines. Later versions of BigInsights will not require this patch.
 Follow the instructions in the README file. If you are installing the patch as user “root” on the
BigInsights cluster, it would be a good idea to source the BigInsights environment before
attempting to install the patch since the patch procedure assumes the environment variables are
already set.
1
This documentation can be obtained from: https://www.ibm.com/developerworks/community/wikis/form/api/wiki/ee59a95e-5867-4deb-
90af-6bed6b0759b8/page/91903357-0a7d-4a96-bb70-520fb2acdc1b/attachment/52d79fbe-dc37-42f0-be3f-
5f4b75f14a05/media/Enable%20the%20full%20functionality%20of%20IBM%20Platform%20Symphony%20in%20BigInsight%202.1%20Cluster.p
df
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 11
[biadmin@biginsights opt]$ cd /opt/ibm/biginsights/conf
[biadmin@biginsights conf]$ . biginsights-env.sh
[biadmin@biginsights conf]$ echo $EGO_TOP
/opt/ibm/biginsights/HAManager/data
[biadmin@biginsights conf]$
When this patch is applied, the multitenant capabilities of IBM Platform Symphony will become
functional and will be accessible through the Platform Symphony graphical user interface.
When BigInsights is installed, the BigInsights web console by default is available on port 8080 on the
BigInsights management host (as long as BigInsights services are started).
Check the status of the cluster using this command:
$ /opt/ibm/biginsights/bin/status.sh
If necessary, start BigInsights (which will also start Platform Symphony services):
$ /opt/ibm/biginsights/bin/start-all.sh
While logged in as the BigInsights administrator, if Symphony is properly installed with BigInsights you
should be able to run Symphony specific commands. As an example, the user biadmin should be able to
run the following command:
$ egosh service list
This command will list various software services associated with Symphony and show their status.
When the Platform Computing components are installed (Adaptive MapReduce), the Platform
Computing resource manager (EGO) is used to persist BigInsights services. You will notice that
Symphony services are associated with a consumer called “/Management”. If you are running HDFS,
HDFS services like the DataNode and Secondary Data node are associated with an “/HDFS” consumer.
The MapReduce shuffle service is start on Compute hosts in the cluster.
[biadmin@biginsights ~]$ egosh service list
SERVICE STATE ALLOC CONSUMER RGROUP RESOURCE SLOTS SEQ_NO INST_STATE ACTI
derbydb DEFINED /Manage* Manag*
purger DEFINED /Manage* Manag*
plc DEFINED /Manage* Manag*
WEBGUI STARTED 54 /Manage* Manag* biginsi* 1 1 RUN 121
RS DEFINED /Manage* Manag*
Seconda* DEFINED /HDFS/S*
MRSS STARTED 55 /Comput* MapRe* biginsi* 1 1 RUN 120
DataNode DEFINED /HDFS/D*
SD STARTED 56 /Manage* Manag* biginsi* 1 1 RUN 119
Service* DEFINED /Manage* Manag*
WebServ* DEFINED /Manage* Manag*
NameNode DEFINED /HDFS/N*
[biadmin@biginsights ~]$
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 12
Accessing the Platform Symphony Management Console
The Platform Symphony console will usually be on the same host if you follow the installation
recommendations above, but will be on a different port. Port 18080 is the default. You should be able to
log into the Platform Symphony management console at http://<master-host>:18080/platform. The
default administrator login for Platform Symphony is “Admin / Admin”.
In production clusters there will normally be multiple Platform Symphony management hosts. Setting
this up is beyond the scope of this paper and is covered in the Platform Symphony documentation.
Figure 2- Logging into the Platform Symphony Management Console
If you are having trouble connecting to the Symphony web console you can use the command “egosh
service view WEBGUI” to see details about the web service.
The WEBGUI services should be started automatically by EGO, but if it becomes necessary to start or
stop the service, you can use the following commands:
$ egosh logon
Enter Admin / Admin as the username and the password when prompted
$ egosh service start WEBGUI
$ egosh service stop WEBGUI
The WEBGUI service is implemented using Apache TomCat.
If there are problems with the WEBGUI you can inspect the logs at ${EGO_TOP}/gui/logs/catalina.out
for information about what might be wrong with the service.
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 13
If you cannot connect to the Symphony console, this may be blocked by your firewall configuration. You
can disable your firewall temporarily to see if this is the cause.
# service iptables stop
If you are not sure what port or host the Platform Symphony GUI was installed on, you should be able to
find it in the XML file that governs the BigInsights installation process (described earlier).
This XML file is generated by the web-based installation process. Platform Symphony related setup
details are found under “high-availability” section of the XML file that governs the installation process.
<high-availability>
<configure>false</configure>
<master-nodes/>
<baseport>7869</baseport>
<web-port>18080</web-port>
<log-directory>var/ibm/biginsights/ps-mapred/logs</log-directory>
<preferred-ip-mask/>
..
<max-retries>3</max-retries>
<failover>failover</failover>
</high-availability>
Once a user logs in to the Platform Symphony console on port 18080, they will see the main Platform
Symphony dashboard. This view is mostly used to monitor the high level status of the various
applications and tenants on a Platform Symphony cluster.
For BigInsights users, most of the action will center around the “MapReduce Workload” screen
accessible under “Quick Links”.
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 14
Figure 3 - view of Platform Symphony console when logged in as an Administrator
Accessing the Platform Symphony knowledge center
Once you are able to access the Platform Symphony console above, you may want to access the
Platform Symphony Knowledge Center and bookmark it in your browser. The knowledge center is
accessible in a pull down menu under the question mark in the top bar on the Platform Symphony web
interface.
The knowledge center aggregates all of the various Platform Symphony documentation into a
searchable interface. This will prove handy as you learn about Platform Symphony.
A direct link to the knowledge center can be found at this URL (depending on the hostname where the
web interface is running).
http://<masterhost-name>:18080/doc/symphony/6.1/index.html
The command egosh services list shown earlier will show the names of the host running the web
interface (listed as the WEBGUI) if you are running on a cluster with multiple master hosts.
The Platform Symphony knowledge center, in particular the documentation dealing with the Platform
Symphony MapReduce framework, will be useful to BigInsights administrators since if you are using
Adaptive MapReduce you are in fact using the Platform Symphony MapReduce framework.
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 15
Figure 4 - Platform Symphony Knowledge Center
Platform Symphony Concepts
While the reader of this document is likely to be familiar with Hadoop and various commercial
distributions, they may be less familiar with IBM Platform Symphony. IBM Platform Symphony is a
commercial grid workload and resource management solution that has been use to share resources
among diverse applications in multitenant environments for over a decade. Platform Symphony is
widely deployed as a shared services infrastructure in some of the world’s largest investment banks.
As a quick primer to some of the terminology referenced, in this document some definitions are offered
below. We would recommend that the interested reader please review a document called “IBM
Platform Symphony Foundations” available at http://publibfp.dhe.ibm.com/epubs/pdf/c2750652.pdf .
 Session Manager – service-oriented applications in Platform Symphony are managed by a
session manager. The session manager is responsible for dispatching tasks to service instances,
and collecting and assembling results. The Symphony session manager provides a function
simply in concept to a Hadoop application manager, although it has considerably more
capabilities. Platform Symphony implements job tracker functionality using the session
manager. In this paper the terms job tracker, application manager and session manager are used
interchangeably. While the concept of multiple concurrent application managers in Hadoop is
new with YARN. Platform Symphony has always featured a multitenant design.
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 16
 Resource Groups – Unlike Hadoop clusters, Platform Symphony does not make assumptions
about the capabilities of hosts that participate in the cluster. While Hadoop generally assumes
that member nodes are 64-bit Linux hosts running Java, Platform Symphony supports a variety
of hardware platforms and operating environments. Platform Symphony allows hosts to be
grouped in flexible ways into different resource groups, and different types of applications can
share these underlying resource groups in flexible ways.
 Applications – The term application can be a little bit confusing as it is applied to Platform
Symphony. Symphony views an application as the combination of the client-side and service-
side code that comprise a distributed application. This is a more expansive definition than most
people are used to. By this definition an instance of BigInsights might be viewed as a single
application. Examples of Platform Symphony applications are custom applications written in
C++, a commercial ISV application like IBM Algorithmics, Calypso or Murex or a commercial or
Open Source Hadoop application like Cloudera, BigInsights or open source Hadoop. Platform
Symphony views applications as being an instance of middleware. Various client side tools
associated with a particular version of Hadoop (Pig, Hive, Sqoop etc) can all run against a single
Hadoop application definition. An important concept for those not familiar with Symphony is
that Symphony provisions service instances associated with different applications dynamically.
As a result, there is nothing technically stopping a Platform Symphony cluster from supporting
multiple instances of Hadoop and non-Hadoop environments concurrently.
 Application profiles – As explained above, applications in Symphony are flexible and highly
configurable constructs. An Application Profile in Symphony defines the characteristics of an
application and various behaviors at runtime.
 Consumers – From the viewpoint of a resource manager, an application or tenant on the cluster
is defined as something that needs particular types of resources at runtime. Platform Symphony
uses the term “consumer” to define these consumers of resources and provides capabilities to
define hierarchical consumer trees and express business rules about how consumers share
various types of resources collected into resource groups. The leaf nodes in consumer trees map
to a Symphony application.
 Services – Services are the portions of applications that run on cluster nodes. In a Hadoop
context, administrators likely think of services as equating to a task tracker that runs Map and
Reduce logic. Here again, Symphony takes a broader view. Symphony services are generic. A
service may be a task-tracker associated with a particular version of Hadoop or it may be
something else entirely. When the MapReduce framework is used in Platform Symphony, the
Hadoop service-side code that implements that Task Tracker logic is dynamically provisioned by
Symphony. Symphony owes its name to this ability to orchestrate a variety of services quickly
and dynamically according to sophisticated sharing policies.
 Sessions – A session in Symphony equates to the notion of a job in Hadoop. A client application
in Symphony normally opens a connection the cluster, selects an application and opens a
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 17
session. Behind the scenes Symphony will provision a Symphony Session Manager to manage
the lifecycle of the job. A single Symphony Session Manager may support multiple sessions
(Hadoop jobs) concurrently. A Hadoop job is a special case of a Symphony job. The Hadoop
client will start a session manager that provides JobTracker functionality. Platform Symphony
actually uses the job tracker and task tracker code provided in a Hadoop distribution, however it
uses its own low-latency middleware to more efficiently orchestrate these services on a shared
cluster.
 Repositories – As explained previously, Platform Symphony dynamically orchestrates service-
side code in response to application demand. The binary code that comprises an application
service is stored in a Symphony repository. Normally for Symphony applications, Symphony
services are distributed to compute nodes from a repository service. For Hadoop applications,
code can be distributed either via the repository service, or it can be distributed via the HDFS /
GPFS FPO file system.
 Tasks – Symphony jobs are collections of tasks. Symphony jobs are managed by a session
manager that runs on a management host. The session manager makes sure that instances of
the needed service are running on compute nodes / data nodes on the cluster. Services
instances run under the control of a Symphony Service Instance Manager (SIM). MapReduce
jobs in the Symphony work the same way, but in this case the Symphony service is essentially
the Hadoop task tracker logic. On Hadoop clusters, slots are normally designated as running
either map logic or reduce logic. Again in Symphony, this is fluid. Because services are
orchestrated dynamically service instances can be either Map or Reduce tasks. This is an
advantage because it allows full utilization of the cluster as the job progresses. At the start of a
job the majority of slots can be allocated to map tasks while towards the end of the job the
function of slots can be shifted to perform the reduce function.
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 18
An example of configuring a cluster for multi-tenancy
In this section we describe the step-by-step procedure to setup multiple tenants on the BigInsights
environments. In order to provide a realistic multitenant scenario, the diagram roughly models our
actual customer environment with names changed of course to protect client confidentiality.
The actual environment is more complex with hundreds of users, dozens of groups and approximately
thirty different applications planned, but the application sharing is similar to the diagram below. This
diagram maps to the “Consumer Tree” in Platform Symphony. Consumer is a term used from the
resource manager’s perspective. The resource manager views an application as a consumer of
resources, and the resource manager is responsible for allocating requested resources according to
policies that will be described shortly.
Figure 5 - an example consumer hierarchy for applications and departments
By default, BigInsights (which is just a single application on the cluster) maps to a single application and
associated is consumer called “MapReduce61” (the name corresponds to the version of Platform
Symphony used to support MapReduce processing in BigInsights – in this case 6.1.0.1). This is done so
that Symphony can accommodate future versions of MapReduce that will be provided in future versions
of BigInsights and will allow versions to co-exist. This is first consumer in the consumer tree above.
In the production environment the customer has specific needs:
 They wish to structure “sub-consumers” under the BigInsights consumer definition
(MapReduce61). This gives the cluster administrator the ability to have different run-time
characteristics for different BigInsights applications. It also allows us to setup configurable
sharing policies between our different applications and groups, control what users are allowed
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 19
to access what applications, and ensure security between tenants by having different
applications run under different user-IDs if desired.
 In this example, under the BigInsights tenant (MapReduce61) we have several different
applications. We’ve arbitrarily called them “MR_AppA” through “MR_AppN” although in the real
environment these are the names of the client’s business applications. Note that we need to
configure each application (tenant) so that it runs under a different operating system level user-
id for security isolation. We also want to control in a granular way which users and groups have
access to these various applications.
 Also, as shown in figure 4, the client has additional applications used by particular lines of
business that they would also like to deploy on the same cluster. As examples, some Sqoop
workloads, DataMeer, IBM Tealeaf, various in-house developed streaming applications and
others. In this particular customer implementation all of these applications will just happen to
share the BigInsights MapReduce infrastructure, however it is important to under that this need
not be the case. As we’ll see shortly these applications can be totally different and still be
configured to share infrastructure.
Adding users to run MapReduce applications
In our example we want to show that how multiple users, grouped arbitrarily into one or groups for
security management can access tenant applications subject to access controls.
We create some sample cluster users for our illustration. These names represent individual cluster
users. For some lines of business, application administrators may choose to create a shared login like
“fraud” for a group authorized to use a particular fraud analytics application.
InfoSphere BigInsights has a recommend procedure for adding users. When using Platform Symphony
together with BigInsights, it is recommended that users follow procedures covered in the BigInsights
documentation and use the tool createosuser.sh included in the BigInsights distribution to automate the
create of OS level users. Doing this ensures that users can access the BigInsights console to run
applications deployed using the BigInsights application framework.
For convenience, the BigInsights infocenter is available on the public internet. For information on adding
users in BigInsights, you can learn more here: http://www-
01.ibm.com/support/knowledgecenter/SSPT3X_2.1.1/com.ibm.swg.im.infosphere.biginsights.admin.doc
/doc/bi_admin_add_users.html?lang=en
The specific procedures will depend on whether you are authenticating access via flat files, LDAP, PAM
or PAM+LDAP. In the example below we are using flat files for simplicity.
To create users known to BigInsights, edit the following file:
$BIGINSIGHTS_HOME/console/conf/security/biginsights_users.xml
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 20
Add users as shown below.
<?xml version="1.0" encoding="UTF-8"?>
<server>
<featureManager/>
<basicRegistry id="basic" realm="Auth">
<user name="hadoop" password="passw0rd"/>
<user name="biadmin" password="temp4now"/>
<user name="sysadmin2" password="passw0rd"/>
<user name="appadmin2" password="passw0rd"/>
<user name="sysadmin1" password="passw0rd"/>
<user name="appadmin1" password="passw0rd"/>
<user name="dataadmin2" password="passw0rd"/>
<user name="dataadmin1" password="passw0rd"/>
<user name="user3" password="passw0rd"/>
<user name="user2" password="passw0rd"/>
<user name="user1" password="passw0rd"/>
<user name="vivian" password="temp4now"/>
<user name="gord" password="temp4now"/>
<user name="eric" password="temp4now"/>
<user name="michael" password="temp4now"/>
<user name="vince" password="temp4now"/>
<user name="steven" password="temp4now"/>
<user name="tiffany" password="temp4now"/>
<user name="appA" password="temp4now"/>
<user name="appB" password="temp4now"/>
<user name="appC" password="temp4now"/>
</basicRegistry>
</server>
The next step is to define groups and associated users with groups. This is an example only. The specific
will depend on how you wish to structure your own users and groups
<?xml version="1.0" encoding="UTF-8"?>
<server>
<featureManager/>
<basicRegistry id="basic" realm="Auth">
<group name="supergroup" gid="4000">
<member name="hadoop" uid="4000"/>
<member name="biadmin" uid="200"/>
</group>
<group name="appAdmins" gid="4100">
<member name="appA" uid="4100"/>
<member name="appB" uid="4101"/>
<member name="appC" uid="4101"/>
</group>
<group name="sysAdmins" gid="4200">
<member name="sysadmin1" uid="4200"/>
<member name="sysadmin2" uid="4201"/>
</group>
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 21
<group name="dataAdmins" gid="4300">
<member name="dataadmin1" uid="4300"/>
<member name="dataadmin2" uid="4301"/>
</group>
<group name="users" gid="4400">
<member name="vivian" uid="6001"/>
<member name="gord" uid="6002"/>
<member name="eric" uid="6003"/>
<member name="michael" uid="6004"/>
<member name="vince" uid="6005"/>
<member name="steven" uid="6006"/>
<member name="tiffany" uid="6007"/>
</group>
<group name="groupA" gid="5000">
<member name="vivian" uid="6001"/>
<member name="gord" uid="6002"/>
<member name="eric" uid="6003"/>
<member name="michael" uid="6004"/>
<member name="vince" uid="6005"/>
<member name="steven" uid="6006"/>
<member name="tiffany" uid="6007"/>
</group>
<group name="groupB" gid="5001">
<member name="vivian" uid="6001"/>
<member name="gord" uid="6002"/>
<member name="eric" uid="6003"/>
<member name="michael" uid="6004"/>
<member name="vince" uid="6005"/>
<member name="steven" uid="6006"/>
<member name="tiffany" uid="6007"/>
</group>
<group name="groupC" gid="5002">
<member name="vivian" uid="6001"/>
<member name="gord" uid="6002"/>
<member name="eric" uid="6003"/>
<member name="michael" uid="6004"/>
<member name="vince" uid="6005"/>
<member name="steven" uid="6006"/>
<member name="tiffany" uid="6007"/>
</group>
</basicRegistry>
</server>
In addition to have user IDs that map to individuals, I may want particular applications to execute on the
cluster under a specific user ID. For example, if my application is called “appA” I may want to have it
execute under a Linux user ID with the same name for simplicity. To accommodate this notice that
we’ve added application specific users to the biginsights_users.xml file in the example above.
You can add users using operating system facilities, but if you do, these users will not be recognized as
having credentials within the BigInsights web interface. They will still work with Symphony and the
BigInsights Hadoop framework however.
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 22
The example below shows how additional users can be added at the OS level, but be unable to login to
the BigInsights console.
# useradd fred
# useradd george
# useradd frank
Once you have edited the BigInsights XML files to define users and groups as shown above, you are
ready to run the createosusers.sh script to create these accounts and groups at the operating system
level as well.
Run the createosusers.sh script as user “biadmin”.
#createosusers.sh
$BIGINSIGHTS_HOME/console/conf/security/biginsights_groups.xml
$BIGINSIGHTS_HOME/console/conf/security/biginsights_users.xml <biadmin's
password>
By following the procedure above to create users and groups, you will be able to run and monitor jobs
from both BigInsights Console as well as the Platform Symphony console.
Figure 6 - user Tiffany known as a BigInsights user is known to the Platform Symphony GUI
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 23
Figure 7 - user Tiffany and others can also runs jobs via the BigInsights console.
Provide access to the BigInsights / Platform Computing cluster
For each operating system user who will be submitting jobs, make sure that their .bashrc file (or
equivalent depending on your shell) in the user’s home directory is configured to source the BigInsights
environment as shown below. If you have followed the procedures above, this should be done for you
automatically. We include these details because you may have additional users not known to BigInsights
that require access to Platform Symphony.
Sourcing the BigInsights environment will ensure that various shell variables like $PATH and
$CLASSPATH as well as environment variables specific to BigInsights and Platform Symphony are in the
environment when the user logs on. This will allow them to immediately run both BigInsights and
Symphony commands. If you are adding many users outside the procedure recommended above to add
BigInsights users, and you want them all to have access to the cluster, it will be faster to adjust the
system-wide template for .bashrc file (in /etc/skel) or adjust the common /etc/bashrc depending on
your preference.
If you have followed the instructions above, this step may not be necessary, but it is a good idea to
check that when users login they are inheriting an environment appropriate for running BigInsights jobs
and that they have access to the Platform Symphony environment.
In our case we want both our named users, as well as the user-ids that our applications will run under in
Symphony(see the concept of impersonation explained later) to source the environment and be able to
run commands.
[root@biginsights gord]# cat .bashrc
# .bashrc
# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 24
# User specific aliases and functions
# source the environment for BigInsights and Platform Symphony
source /opt/ibm/biginsights/conf/biginsights-env.sh
You should be able su to your created user ID after this and run Symphony or BigInsights commands.
Below we see that I can run a Symphony command confirming that my environment is setup correctly.
Note that with the installation of BigInsights we are entitled to user Platform Symphony Advanced
Edition which is the version of Symphony that supports the Hadoop MapReduce framework. We are not
entitled to use some other add-on products listed.
[root@biginsights /]# su - gord
[gord@biginsights ~]$ egosh entitlement info
Symphony Edition : Advanced
Desktop Harvesting : Not Entitled
Server Harvesting : Not Entitled
Virtual Server Harvesting : Not Entitled
GPU : Not Entitled
[gord@biginsights ~]$
After following the procedure above, it is a good idea to make sure that our /etc/group file reflects that
setup we’ve configured in the BigInsights XML files.
In /etc/group, create define the users that will be allowed to submit workloads on behalf of each group.
This is a very simple example. In reality, different users would belong to different groups and these
group names would be meaningful in the context of how the customer organizes their business.
groupA:x:5000:vivian,gord,eric,michael,vince,steven,biadmin
groupB:x:5001:vivian,gord,eric,michael,vince,steven,biadmin
groupC:x:5002:vivian,gord,eric,michael,vince,steven,biadmin
groupD:x:5003:vivian,gord,eric,michael,vince,steven,biadmin
groupF:x:5004:vivian,gord,eric,michael,vince,steven,biadmin
groupG:x:5005:vivian,gord,eric,michael,vince,steven,biadmin
groupH:x:5006:vivian,gord,eric,michael,vince,steven,biadmin
groupI:x:5007:vivian,gord,eric,michael,vince,steven,biadmin
Understanding Platform Symphony Impersonation
Now is a good time to explain the concept of “impersonation” in Platform Symphony. Symphony has
two different workload execution modes:
 Simple Workload Execution Mode
 Advanced Workload Execution Mode
This is normally an installation option with Platform Symphony. BigInsights Enterprise Edition installation
automatically installs Platform Symphony in Advanced Workload Execution Mode. This term is
frequently abbreviated as WEM in the Symphony documentation. In advanced workload execution
mode, core Symphony services will run as root as application administrators will be able to control the
user ID that clustered applications run under.
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 25
Our approach to security hinges on this concept of impersonation in Symphony and we will see shortly
how we configure our applications to run under specific user credentials and control what users have
access to what applications and resources. The section called “Security within the MapReduce
framework” in the MapReduce user guide in the Platform Symphony documentation discusses this in
detail.
The customer that this paper is modeled after employs Kerberos authentication for their MapReduce
jobs to ensure security and that a particular service support impersonation cannot be spoofed. Details
on configuring Kerberos is too much detail for this short document, but customers will be pleased that
this capability exists. Symphony is frequently deployed in secure environments where these capabilities
are important.
Configuring OS groups for the multitenant environment
For users making use of Platform Symphony (both named users and the user IDs that applications will
run under via impersonation) these IDs need to be part of the OS group that owns the BigInsights (and
by extension the Symphony) installation.
In our installation, BigInsights was installed as part of the “biadmin” group, so we adjust the group
membership so that each application ID that Symphony jobs will run under is a part of the BigInsights
group.
biadmin:x:0:root,biadmin,gord,eric,vivian,appA,appB,appC,appD,appE,appF,appG
bin:x:1:root,bin,daemon
daemon:x:2:root,bin,daemon
..
If you are unsure what group BigInsights was installed under, issue a command like
$ ls -al ${EGO_TOP}
You will see the user and group that own each file. This will vary depending on how you installed
BigInsights but the default group is biadmin.
Submitting a test job as a user to verify the configuration
As we mentioned before, by default BigInsights is configured to use an Application called MapReduce61
which maps to the consumer called /MapReduceConsumer/MapReduce61.
I should be able to login to any of the accounts created, and run a sample Hadoop job. The sleep
command included with the BigInsights examples is a convenient Hadoop application for testing the
MapReduce framework. This command submits variable numbers of Map and Reduce tasks that simply
sleep for variable amounts of time. The example below submits two mappers that will sleep for 2
seconds (2,000 msec) followed by ten reducers that in the example below will sleep for 1 second.
Besides being a useful validation that everything is working, this test illustrates the performance
advantage of using Platform Symphony as the MapReduce framework over open-source Hadoop.
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 26
Platform Symphony can run tests like this short running map and reduce tasks dramatically faster than
open source Hadoop – often more than ten times faster, even when a competing cluster is configured
with a short polling interval.
Note that as the test Hadoop job runs, everything is identical to open source Hadoop (it is actually the
BigInsights supplied Hadoop classes that are running) except that see that our JobTracker logic in
Hadoop is running inside a Symphony Session Manager.
Note also that the running job is given a Platform Symphony job ID (job_ssm_0401 in this example).
Because Platform Symphony is managing the job execution, it is able to manage this job as well as other
jobs on the cluster including non-Hadoop jobs.
[gord@biginsights ~]$ hadoop jar ${HADOOP_HOME}/hadoop-example.jar sleep -m 2
-r 10 -mt 2000 -rt 2000
14/03/15 13:14:25 INFO internal.MRJobSubmitter: Connected to JobTracker(SSM)
14/03/15 13:14:26 INFO internal.MRJobSubmitter: Job <Sleep job> submitted,
job id <401>
14/03/15 13:14:26 INFO internal.MRJobSubmitter: Job will not verify
intermediate data integrity using checksum.
14/03/15 13:14:26 INFO mapred.JobClient: Running job: job_ssm_0401
14/03/15 13:14:27 INFO mapred.JobClient: map 0% reduce 0%
14/03/15 13:14:36 INFO mapred.JobClient: map 100% reduce 0%
14/03/15 13:14:46 INFO mapred.JobClient: map 100% reduce 20%
14/03/15 13:14:50 INFO mapred.JobClient: map 100% reduce 40%
14/03/15 13:14:54 INFO mapred.JobClient: map 100% reduce 60%
14/03/15 13:14:58 INFO mapred.JobClient: map 100% reduce 80%
14/03/15 13:14:59 INFO mapred.JobClient: map 100% reduce 100%
14/03/15 13:14:59 INFO mapred.JobClient: Job complete: job_ssm_0401
14/03/15 13:15:00 INFO mapred.JobClient: Counters: 18
14/03/15 13:15:00 INFO mapred.JobClient: Shuffle Errors
14/03/15 13:15:00 INFO mapred.JobClient: WRONG_PATH=0
14/03/15 13:15:00 INFO mapred.JobClient: CONNECTION=0
14/03/15 13:15:00 INFO mapred.JobClient: IO_ERROR=0
14/03/15 13:15:00 INFO mapred.JobClient: FileSystemCounters
14/03/15 13:15:00 INFO mapred.JobClient: FILE_BYTES_WRITTEN=5146
14/03/15 13:15:00 INFO mapred.JobClient: Map-Reduce Framework
14/03/15 13:15:00 INFO mapred.JobClient: Reduce input groups=400
14/03/15 13:15:00 INFO mapred.JobClient: Combine output records=0
14/03/15 13:15:00 INFO mapred.JobClient: Map output records=400
14/03/15 13:15:00 INFO mapred.JobClient: SHUFFLED_MAPS=20
14/03/15 13:15:00 INFO mapred.JobClient: Reduce shuffle bytes=2440
14/03/15 13:15:00 INFO mapred.JobClient: Combine input records=0
14/03/15 13:15:00 INFO mapred.JobClient: Spilled Records=800
14/03/15 13:15:00 INFO mapred.JobClient: SPLIT_RAW_BYTES=0
14/03/15 13:15:00 INFO mapred.JobClient: Map output bytes=1600
14/03/15 13:15:00 INFO mapred.JobClient: Reduce input records=400
14/03/15 13:15:00 INFO mapred.JobClient: GC_TIME_MILLIS=0
14/03/15 13:15:00 INFO mapred.JobClient: FAILED_SHUFFLE=0
14/03/15 13:15:00 INFO mapred.JobClient: MERGED_MAP_OUTPUTS=20
14/03/15 13:15:00 INFO mapred.JobClient: Reduce output records=0
[gord@biginsights ~]$
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 27
As this job runs, we can monitor the job in the Symphony GUI by using the QuickLinks menu and
accessing “MapReduce Workload” to access the MapReduce workload screen shown below. As the
MapReduce jobs runs, you will see a view like the one shown in figure 6.
Figure 8 - monitoring our job using the Platform Symphony web interface
Note that the submitted job is associated with the application MapReduce 6.1 (this is the application
that BigInsights by default submits jobs to)
You can also launch jobs via the standard BigInsights Web GUI and watch them run either from within
the BigInsights console or from within the Platform Symphony Web interface.
Figure 9: Launching a terasort job from BigInsights
The Terasort example in BigInsights uses oozie to manage the sequence of running the teragen
application to generate the dataset to be sorted followed by Terasort itself.
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 28
As the job runs in the BigInsights context, we see them running in Platform Symphony associated with
the MapReduce6.1 application that BigInsights is bound to.
Any BigInsights application that exercises the MapReduce framework including services like Hive, Pig,
Big SQL, Bigsheets and others will work with Symphony in this same way.
Figure 10 - Platform Symphony monitoring Terasort job run from BigInsights
Associating BigInsights with a Symphony Application
We’ve mentioned a few times that BigInsights is associated with the Symphony MapReduce6.1
application and customers frequently ask where this association is made.
[biadmin@biginsights ~]$ cd $HADOOP_CONF_DIR
[biadmin@biginsights hadoop-conf]$ cat pmr-site.xml
<?xml version="1.0"?>
<!-- This is a PMR configuration file. -->
<!-- It is intended for PMR internal parameters. Do not define -->
<!-- hadoop parameters here. -->
<configuration>
<property>
<name>mapreduce.application.name</name>
<value>MapReduce6.1</value>
<description>The mapreduce application name.</description>
</property>
<property>
<name>mapreduce.map.skip.commit.task</name>
<value>false</value>
</property>
By changing to the BigInsights directory $HADOOP_CONF_DIR you can modify Symphony application
name that BigInsights will submit jobs to in the file pmr-site.xml. It is important to have this flexibility,
because over time customers may end up with different versions of BigInsights along with other
applications co-existing on the same cluster.
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 29
Enabling Symphony Repository Services
By default, when Platform Symphony is installed the repository service in Symphony is disabled. The
function of the repository service is to store the application services and distribute the code that
implements services dynamically to service instances on the cluster.
The MapReduce framework in Platform Symphony by default distributes the application service code
(specifically the application logic that implements the task tracker functionality and Jar files that
implement map and reduce logic) by copying them to HDFS with a high block replication factor so that
the files will be accessible on all nodes.
If you are planning to add and remove application profiles in Symphony or Consumers you will to start
the Symphony repository service. Otherwise you will encounter errors as some of these services assume
that the repository service in Symphony is running.
This can be done through the web interface by following these steps:
 From the QuickLinks menu select system services
 For the service abbreviated as RS, select “Start” from the Actions pull-down menu
 After you refresh the GUI view you should see the service has started on a master host
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 30
Figure 11 - Managing system services in Platform Symphony
The system services view is useful. This shows a list of system services that EGO is managing. Note that
EGO is managing not only native Platform Symphony services, but BigInsights services as well.
Adding a new Application / Tenant
Fundamental to the design of BigInsights 2.1 (and Open Source Hadoop) is the idea that there is only a
single instance of a Hadoop cluster.
Platform Symphony supports multiple applications however sharing the same cluster. It is also flexible
enough to support multiple instances of an application environment like BigInsights, however
configuring this is out of the scope of this paper.
Examples of tenants we may want to add might be:
 A native Symphony application written to the Platform Symphony APIs
 A batch-oriented workload (when Platform LSF is installed as an add-on to Platform Symphony)
 A distinct Hadoop MapReduce environment
 Third party applications like SAS, MatLab or Revolution R
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 31
 A separate Hadoop MapReduce application instance that shares resources between applications
but that shares the same Hadoop binaries and file system instance.
In this example we are showing the last case where multiple Hadoop applications share resources.
From the Platform Symphony Dashboard:
 Use the QuickLinks menu and select Resources
 Select Workload / MapReduce / Application profiles from the pull down menu
There will already be an application profile already defined for MapReduce6.1. This is installed
automatically with Symphony and is the application profile that is used by BigInsights by default.
To add a new application profile to support a new tenant, click the “Add” button. The screen shown in
figure 10 will appear.
Figure 12 - Adding a new Application definition
We supply the following parameters:
 Our application name (SQOOP) – We require this tenant to use a different version of SQOOP
than the version including with BigInsights as mentioned earlier
 We define the user-ID that starts the job tracker and runs jobs – This is the impersonation
feature described earlier. This particular application will run under the OS id AppB.
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 32
 Symphony has 10,000 priority levels. By default we are going to submit Sqoop jobs as having a
low priority.
 We configure user accounts that have access to this application. Note that we’ve provided all
users in GroupA access to the application along with named operating system and Platform
Symphony users.
Based on this information, Platform Symphony adds an application named Sqoop with a set of
reasonable defaults for a Hadoop MapReduce job. To make sure that our new application is working, as
a user entitled to use the application I can submit a test job as I did before.
Note that in this I am specifying that I want to have the job handled by a different MapReduce
application definition so I specify Sqoop as the application name on the command line.
Test the new application consumer by submitting a job as before.
[gord@biginsights ~]$ hadoop jar ${HADOOP_HOME}/hadoop-example.jar sleep -
Dmapreduce.application.name=Sqoop -m 2 -r 10 -mt 2000 -rt 2000
14/03/13 12:32:07 INFO internal.MRJobSubmitter: Connected to JobTracker(SSM)
14/03/13 12:32:08 INFO internal.MRJobSubmitter: Job <Sleep job> submitted,
job id <1>
14/03/13 12:32:08 INFO internal.MRJobSubmitter: Job will not verify
intermediate data integrity using checksum.
14/03/13 12:32:08 INFO mapred.JobClient: Running job: job_ssm_0001
14/03/13 12:32:09 INFO mapred.JobClient: map 0% reduce 0%
14/03/13 12:32:37 INFO mapred.JobClient: map 100% reduce 0%
14/03/13 12:32:52 INFO mapred.JobClient: map 100% reduce 20%
14/03/13 12:32:56 INFO mapred.JobClient: map 100% reduce 40%
14/03/13 12:33:00 INFO mapred.JobClient: map 100% reduce 60%
14/03/13 12:33:05 INFO mapred.JobClient: map 100% reduce 80%
14/03/13 12:33:07 INFO mapred.JobClient: map 100% reduce 100%
14/03/13 12:33:07 INFO mapred.JobClient: Job complete: job_ssm_0001
14/03/13 12:33:09 INFO mapred.JobClient: Counters: 18
..
What has changed is that in figure 11 we see that our job is now running under our separate application
definition called Sqoop.
This shows the basic process of adding the new application profile for a MapReduce job to Symphony to
support our additional tenants. The next step of course is to edit the configuration of the tenant as
necessary to suit the unique needs of the application. For example, my requirement may be as simple as
simple re-pointing some environment variables for point to different installation and configuration
directories for Sqoop for jobs submitted to this application.
[biadmin@biginsights hadoop-conf]$ set | grep SQOOP
SQOOP_CONF_DIR=/opt/ibm/biginsights/sqoop/conf
SQOOP_HOME=/opt/ibm/biginsights/sqoop
[biadmin@biginsights hadoop-conf]$
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 33
Note that below my Job ID has reset to “1” since this is the first job associated with this particular
application tenant.
Figure 13 - Sleep job running under newly created application definition
Under the “Workload” / “MapReduce” / “Application Profiles” we can define as many separate
applications as we’d like. The view below additional applications added using the same process detailed
for the Sqoop application.
Figure 14 - Available MapReduce Application Profiles
Only MapReduce applications appear because “Application Profiles” have been selected from the
MapReduce submenu. Figure 13 shows a similar view of “Applications” accessible from the same
workload dropdown menu except instead of looking at Application Profiles I’m looking at a dashboard of
the applications themselves with job related status.
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 34
Figure 15- Dashboard of MapReduce applications
Configuring application properties
When new applications profiles are created for each new application, a default template is used
represent reasonable settings for a MapReduce workload. The next step is to configure application
profiles to meet the unique requirements of each application workload.
In the Platform Symphony reference manual accessible from the knowledge center, application profiles
are covered in detail. Some of the more commonly configured settings are shown below.
To configure application properties for Sqoop, modify the application profile by selecting “Workload” /
“MapReduce” / “Application Profiles” from the top menu on the MapReduce applications screen. Select
the application profile definition for Sqoop created earlier and select Modify.
A new window will appear that allows detailed settings for the application to be changed. This web
interface is affecting the application service profile definitions (discussed shortly) that are stored in the
directory $EGO_TOP/data/soam/profiles on the Platform Symphony master host. Enabled profiles
reside in a subdirectory called “enabled” and disabled profiles reside in a directory called “disabled”.
First tab in the interface called Application Profile allows application profile settings to be adjusted. The
second tab labeled Users provides an opportunity to modify the users and groups that will have access
to the application profile.
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 35
Figure 16 - Application Profile
Some important tips about Application Profiles:
 Application Profile names must be unique
 An Application Profile can be associated with only a single consumer
 In the consumer tree, MapReduce applications are by default placed under the
MapReduceConsumer tree
 You can find templates for various application profiles in the directory
$SOAM_HOME/6.1/Samples/Templates. The term SOAM in Symphony refers to the service-
oriented application middleware on which the MapReduce service is implemented
The application profile can be viewed in an Advanced Configuration, a Basic Configuration or in a
Dynamic Configuration Update mode. The Dynamic Configuration Update mode is not covered here, but
essentially it allows an administrator to register a profile fragment (part of an application profile)
modifying either the session types or services sections of the profile.
In the General settings area, settings such as where metadata associated with jobs and job history are
stored, the default service definition to be used (MapReduce for MapReduce applications) and resource
requirements.
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 36
Resource requirements are an important concept in Symphony. In this simple example by using the
syntax “select(!mg)” we are essentially saying run this service on any host that is not tagged as a
member of the management group.
Resource requirement selections in Symphony are flexible and are covered in the Symphony
documentation. I can use an SQL like resource-requirements strings to specify the types of resources I
would like to use in a granular way. If for example I know that a particular application runs best on a
large memory PowerLinux machine, I express a requirement (or preference) for this application with an
appropriate resource requirement string.
select(!mg) && select(PowerResourceGroup) && select(maxmem > 8000 && maxswp
>=16000)
The example above would indicate that this service requires resources that are part of a Power-based
resource group that are not management hosts where at least 8GB of physical memory and 16GB of
swap space are available.
Pre-starting application services is a useful feature in Symphony. Application services refer to the
Symphony session manager (SSM) as well as service instance managers and service instances associated
with the application. As a reminder, with MapReduce workloads the SSM can be viewed as an
Application Manager. This is the component that implements the JobTracker logic. Services instances
will load TaskTracker logic appropriate to the version of Hadoop and will start map or reduce tasks
appropriate to the application.
If you have many applications and are frequently sharing slots pre-starting applications may not be
useful. By default Symphony will start SSMs automatically as clients connect and request services from
the middleware. As resources are assigned to applications, Symphony will dynamically provision needed
service code and start services appropriate.
Pre-starting applications is useful for applications that need to respond quickly. You can control the
number of slots (each slot can support a map or reduce task) that are pre-started by default
Figure 17 - Optionally have an application pre-allocate services
A key thing to understand about that Platform Symphony session manager is that it is fully
multithreaded and can accommodate multiple sessions at the same time. A session equates to a
MapReduce user submitted a job. Each job maps to a session where each session may have large
numbers of tasks.
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 37
When multiple users are concurrently submitting jobs to the same application, the scheduling policy
controls how resources are shared. This R_Proportion policy specifies that resources are shared in
proportion to the priority of the job which is often the most sensible choice.
As an example, if I had 5000 slots allocated to this application consumer definition and JobA was
submitted to the application with priority 4000 and JobB was submitted with priority 1000, Symphony
would run both workloads concurrently under the same application definition giving 80% of available
resources to JobA. Unlike standard Hadoop where resource assignments are static while the job is
executing, Symphony can respond quickly at run-time to re-balance resource allocations between jobs.
Note that since each SSM maps to an application (a MapReduce application in this case) this scheduling
policy controls how multiple jobs running in the same application context share resources. A separate
resource sharing plan discussed shortly controls how sharing is implemented more broadly between
applications and tenants.
The term application can be confusing to users not familiar with Symphony. Symphony is referring to an
application in the context of the Hadoop services themselves – the binary code that comprises
BigInsights services like the JobTracker and the TaskTracker. It is not referring to the actual application
code written by users that run on the Hadoop framework. A single Symphony application can run
different user applications within the context of the same Hadoop MapReduce context in this case.
Figure 18 - controlling how multiple jobs associated with an application share resources
The Symphony application profile definition provides precise control over how MapReduce workloads
run, and this is useful to advanced users (in our experience most sites running Hadoop are already quite
advanced and will appreciate this)
A nice feature of Symphony is that because the execution logic is provisioned dynamically so slots are
interchangeable between mappers and reducers. The settings in figure 17 allow this to be configured
along with preferences for default ratios between mappers and reducers and precise configuration on a
per resource group basis.
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 38
Figure 19 - MapReduce Settings associated with an Application
Symphony can allow multiple service definitions to exist for each application and the service definition
section provides granular control over this capability. This is a useful for applications written to Platform
Symphony’s native APIs and may be useful for Hadoop developers. For BigInsights it is not necessary to
change this setting being Platform has already implemented a service called “RunMapReduce “ service
started by service-instance managers to handle MapReduce workloads. The process of starting this
service is automatic for the MapReduce service. The service itself can be found in the directory
${EGO_TOP}/soam/mapreduce/6.1/linux2.6-glibc2.3-x86_64/etc. Note that the Start Command in
figure 18 allows for operating system specific implementations of a service definition for an application.
Figure 20 - configuring service definitions for the application
In the application profile definition, administrator can control environment variables associated with the
application. This is an important capability for ensuring multitenancy. By using environment variables I
can control what applications run in granular ways. If I choose, I could have an application profile that
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 39
associates itself with a separate Hadoop instance by defining application specific variables such as
$HADOOP_HOME, $HADOOP_CONF_DIR that reference different software versions and different
configuration files.
I can always resolve technical issues that often occur where particular applications are depend on
particular versions or distributions of the Java run-time environment be defining $JAVA_HOME to point
to the version of Java needed by a specific application.
Figure 21 - configuring the environment for the application
This is a good time to mention that while much of the discussion in Hadoop centers on Java because
Hadoop itself is written in Java, Symphony supports heterogeneous applications. It does not matter
whether application clients or services are written in C/C++, Java, scripting languages or even C# in
Microsoft .NET environments. The versatility to handle all types of workloads is what makes Symphony
powerful as a multitenant environment.
Another unique capability that Symphony brings to Hadoop is the notion of “Recoverable sessions”. This
concept does not existing in open source Hadoop where the job tracker is implemented in a simplistic
way. If the JobTracker fails at run-time, in standard Hadoop the job needs to be re-started.
The Symphony SOAM middleware however has long supported the notion of journaling transactions so
that Hadoop MapReduce jobs become inherently recoverable. If the software service running the
JobTracker logic fails (and re-starts on the same host or a different host) the Symphony job can recover
from where it left off. This is a major advantage for customers that have long-running Hadoop jobs that
need to complete within specific batch windows.
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 40
This and other points of configurability are very important for specific workloads. As another example, if
I have execution logic where the reducer is multi-threaded I can control the ration of reducer services to
slots thereby giving a reducer multiple slots if it can take advantage of them.
Figure 22 - configuring session behaviors in an SSM / Application Manager
Associating applications with consumers
The last section provided some details on how application profiles are used in Symphony to customize
applications to support multi-tenancy. In the Symphony architecture, resources are not actually
allocated to applications directory. They are allocated to Consumer definitions which in turn map to
applications.
This is an important distinction between while that application space is essentially “flat” (I have multiple
applications and flavors of applications of different types) the structure of consumers is usually
hierarchical. This is because most organizational structures are hierarchical.
 A bank may have several lines of business, each with various departments or application groups
 A service provider may have multiple tenant customers, and may provide different application
services for each tenant
 A government agency may have different divisions, each running different applications with a
particular need to segment data access
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 41
Symphony allows consumer trees to be setup in flexible ways to accommodate the needs of almost any
organization. A key concept to understand is that the leaf-nodes of consumer trees are linked to the
application definitions we looked at in the previous section.
Accessing Consumer Definitions
To view consumer definitions, from the MapReduce screen in Symphony selected “Resources / Resource
Planning / Consumers”. This is the interface that is used to manage the Consumer Tree.
Setting up the consumer tree is reasonably straightforward. The left side panel us used to control where
you are on the tree and the right side of the interface allows one to perform operations relative to that
segment on the tree.
Recall from our scenario earlier, that we had multiple groups that would be running Datameer
workloads that we wanted to enforce sharing policies. Also Datameer workloads have specific setup
dependencies that are different that BigInsights workloads so the Datameer workloads require their
own application profile. Also, we wanted to provide isolation between the work done by different
Datameer application user groups. To achieve this policy, we have defined sub-consumers under
Datameer with a consumer appropriate for each group. Also, we can control what users have access to
the consumer. Note the heirchical notion of consumers in Symphony.
Figure 23 - A populated consumer tree in Symphony
The leaf nodes of the consumer tree under Datameer, each link to a specific application profile. The
associations between applications and the position in the consumer tree is made in the application
profile.
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 42
Figure 24 - MapReduce applications
Manually editing Consumer Tree definitions
Advanced users may find it easier to manually edit the consumer tree.
Platform Symphony stores consumer tree definitions in $EGO_TOP/kernel/conf in the file
ConsumerTrees.xml.
If you hand edit this file, you will need to restart EGO services to bring the web-based view into
synchronization with the actual contents of the XML files where these settings are persisted.
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 43
After editing the ConsumerTrees.xml file as shown above, while logged in as the cluster administrator
(biadmin) please stop and restart EGO services using the BigInsights scripts below to make sure that
changes are reflected in the Platform Symphony console.
$ stop.sh HAManager
$ start.sh HAManager
Controlling access to applications and consumers
In the Sqoop consumer definition above, the built-in Symphony user “Admin” has administrative
responsibility for the consumer. Several other users are listed as being able to access to consumer
application associated with the consumer. The user eric is not a member of the list of permitted users. If
an unauthorized user attempts to submit a job against the application definition (Sqoop) associated with
this Sqoop consumer, see an error as shown below as expected.
[eric@biginsights ~]$ hadoop jar ${HADOOP_HOME}/hadoop-example.jar sleep -
Dmapreduce.application.name=Sqoop -m 2 -r 10 -mt 2000 -rt 2000
java.io.IOException: interrupted
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:1068)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:1032)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1575)
at org.apache.hadoop.examples.SleepJob.run(SleepJob.java:174)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
..
Caused by: java.lang.InterruptedException: Domain <VEM>: Security error: User:
eric is not authorized to perform this operation.
If an authorized user (gord) submits the same workload, note that it runs successfully.
[gord@biginsights ~]$ hadoop jar ${HADOOP_HOME}/hadoop-example.jar sleep -
Dmapreduce.application.name=Sqoop -m 2 -r 10 -mt 2000 -rt 2000
14/03/14 08:56:45 INFO internal.MRJobSubmitter: Connected to JobTracker(SSM)
14/03/14 08:56:45 INFO internal.MRJobSubmitter: Job <Sleep job> submitted,
job id <102>
14/03/14 08:56:45 INFO internal.MRJobSubmitter: Job will not verify
intermediate data integrity using checksum.
14/03/14 08:56:45 INFO mapred.JobClient: Running job: job_ssm_0102
14/03/14 08:56:46 INFO mapred.JobClient: map 0% reduce 0%
14/03/14 08:57:02 INFO mapred.JobClient: map 100% reduce 0%
14/03/14 08:57:11 INFO mapred.JobClient: map 100% reduce 20%
14/03/14 08:57:15 INFO mapred.JobClient: map 100% reduce 40%
14/03/14 08:57:19 INFO mapred.JobClient: map 100% reduce 60%
14/03/14 08:57:23 INFO mapred.JobClient: map 100% reduce 80%
14/03/14 08:57:24 INFO mapred.JobClient: map 100% reduce 100%
14/03/14 08:57:24 INFO mapred.JobClient: Job complete: job_ssm_0102
[gord@biginsights ~]$
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 44
Determining the execution user for a consumer
Earlier we explained that by using impersentation, Symphony can control the user IDs that different
application services run under. In the case of the Sqoop application defined earlier, we had set the
application user to appB and this is reflected in the ConsumerTrees.xml definition.
We can verify that impersonation is taking place and that processes are running under the expected
user ID by monitoring the process tree while executing MapReduce jobs like the one above.
The monitor the process tree, use a command like:
$ watch ‘ps -ef | grep appB’
As you run the job, you will see the SSM start-up unless it is pre-started or the SSM is lingering on a
management host waiting for another job. In this example are services are running on the same node as
the master host so we see the service instance managers and services instances starting locally to
manage the job. On a larger cluster you would need to watch the compute hosts to validate the services
are starting as expected and running under the correct user ID.
Figure 25 - verify that services are running under the expected user IDs
We can use the pstree command on the management host to understand the process tree.
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 45
Figure 26 - pstree can be used to show the process hierarchy
On compute hosts, services are management by the pem process.
On response to a workload requirement pem launches a sim process (service instance manager) which
in turn runs a service instance. In this case the RunMapReduceService since this is a Symphony
MapReduce workload.
Figure 27 - process hierarchy on the execution host
When configuring several consumers and applications as we have shown here, it will be faster to hand
edit XML based application profile files also.
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 46
To access XML application profiles, check the directory $EGO_TOP/data/soam/profiles. The associated
XML profiles will exist in subdirectories with names corresponding to their state. For example Sqoop.xml
can be found in an “enabled” subdirectory since the application is enabled and accepting workload.
Configuring Sharing Policies
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 47
Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 48
Summary
In this document we’ve described a customer use case involving a multitenant implementation of
InfoSphere BigInsights that permits the following:
 Concurrent execution of different Hadoop applications (including different versions of code) on
the same physical cluster
 Dynamic sharing of resources between tenants in a fashion that maximizes performance and
resource utilization while respecting individual SLAs
 Support for applications other than Hadoop MapReduce to maximize flexibility and allow
capital investments to be re-purposed for multiple requirements
 Security isolation between tenants, removing a major barrier to sharing in many commercial
organizations
These advances in our view are significant. While Hadoop is advancing, competing open source and
commercial distributions are many years away from offering true multitenancy and practical solutions
for supporting multiple workloads on a shared infrastructure. The economic arguments in favor of
resource sharing are compelling. Analytic applications are increasingly comprised of multiple software
components that rely on distributed services. Rather than deploying separate “silos” of application
infrastructure, Platform Symphony provides the option to consolidate these different application
instances on a common foundation thus increasing infrastructure utilization, boosting service levels and
helping significantly reduce costs.

More Related Content

What's hot

Build the Optimal Mainframe Storage Architecture
Build the Optimal Mainframe Storage ArchitectureBuild the Optimal Mainframe Storage Architecture
Build the Optimal Mainframe Storage ArchitectureHitachi Vantara
 
Hds ucp sap hana infographic v6[1]
Hds ucp  sap hana infographic v6[1]Hds ucp  sap hana infographic v6[1]
Hds ucp sap hana infographic v6[1]Barbara Götz
 
Step 2: Back Up Less Datasheet
Step 2: Back Up Less DatasheetStep 2: Back Up Less Datasheet
Step 2: Back Up Less DatasheetHitachi Vantara
 
Meeting Mobile and BYOD Security Challenges
Meeting Mobile and BYOD Security ChallengesMeeting Mobile and BYOD Security Challenges
Meeting Mobile and BYOD Security ChallengesSymantec
 
Denodo as the Core Pillar of your API Strategy
Denodo as the Core Pillar of your API StrategyDenodo as the Core Pillar of your API Strategy
Denodo as the Core Pillar of your API StrategyDenodo
 
G09.2014 gartner enterprise content mgmt 2014
G09.2014   gartner enterprise content mgmt 2014G09.2014   gartner enterprise content mgmt 2014
G09.2014 gartner enterprise content mgmt 2014Satya Harish
 
Rethink Storage: Transform the Data Center with EMC ViPR Software-Defined Sto...
Rethink Storage: Transform the Data Center with EMC ViPR Software-Defined Sto...Rethink Storage: Transform the Data Center with EMC ViPR Software-Defined Sto...
Rethink Storage: Transform the Data Center with EMC ViPR Software-Defined Sto...EMC
 
IDC Spotlight: PBBAs Tap into Key Data Protection Trends to Drive Strong Mar...
IDC Spotlight:  PBBAs Tap into Key Data Protection Trends to Drive Strong Mar...IDC Spotlight:  PBBAs Tap into Key Data Protection Trends to Drive Strong Mar...
IDC Spotlight: PBBAs Tap into Key Data Protection Trends to Drive Strong Mar...Symantec
 
Content Centric Applications
Content Centric ApplicationsContent Centric Applications
Content Centric ApplicationsNetApp
 
SD Big Data Monthly Meetup #4 - Session 1 - IBM
SD Big Data Monthly Meetup #4 - Session 1 - IBMSD Big Data Monthly Meetup #4 - Session 1 - IBM
SD Big Data Monthly Meetup #4 - Session 1 - IBMBig Data Joe™ Rossi
 
The Next Evolution in Storage Virtualization Management White Paper
The Next Evolution in Storage Virtualization Management White PaperThe Next Evolution in Storage Virtualization Management White Paper
The Next Evolution in Storage Virtualization Management White PaperHitachi Vantara
 
2019 Storage Brand Leader Report
2019 Storage Brand Leader Report2019 Storage Brand Leader Report
2019 Storage Brand Leader ReportIT Brand Pulse
 
hitachi-content-platform-portfolio-esg-validation-report
hitachi-content-platform-portfolio-esg-validation-reporthitachi-content-platform-portfolio-esg-validation-report
hitachi-content-platform-portfolio-esg-validation-reportIngrid Fernandez, PhD
 
CRTC Cloud- Scott Sadler
CRTC Cloud- Scott SadlerCRTC Cloud- Scott Sadler
CRTC Cloud- Scott SadlerKrisValerio
 
Mba ii u v enterprise application integration
Mba ii u v enterprise application integrationMba ii u v enterprise application integration
Mba ii u v enterprise application integrationRai University
 
Cloud Computing - Beyond the Hype
Cloud Computing - Beyond the HypeCloud Computing - Beyond the Hype
Cloud Computing - Beyond the HypeRH
 
Φάννυ Κοφινά, 7th Digital Banking Forum
Φάννυ Κοφινά, 7th Digital Banking ForumΦάννυ Κοφινά, 7th Digital Banking Forum
Φάννυ Κοφινά, 7th Digital Banking ForumStarttech Ventures
 
2013.12.12 big data heise webcast
2013.12.12 big data heise webcast2013.12.12 big data heise webcast
2013.12.12 big data heise webcastWilfried Hoge
 

What's hot (19)

Build the Optimal Mainframe Storage Architecture
Build the Optimal Mainframe Storage ArchitectureBuild the Optimal Mainframe Storage Architecture
Build the Optimal Mainframe Storage Architecture
 
Hds ucp sap hana infographic v6[1]
Hds ucp  sap hana infographic v6[1]Hds ucp  sap hana infographic v6[1]
Hds ucp sap hana infographic v6[1]
 
Step 2: Back Up Less Datasheet
Step 2: Back Up Less DatasheetStep 2: Back Up Less Datasheet
Step 2: Back Up Less Datasheet
 
Meeting Mobile and BYOD Security Challenges
Meeting Mobile and BYOD Security ChallengesMeeting Mobile and BYOD Security Challenges
Meeting Mobile and BYOD Security Challenges
 
Denodo as the Core Pillar of your API Strategy
Denodo as the Core Pillar of your API StrategyDenodo as the Core Pillar of your API Strategy
Denodo as the Core Pillar of your API Strategy
 
G09.2014 gartner enterprise content mgmt 2014
G09.2014   gartner enterprise content mgmt 2014G09.2014   gartner enterprise content mgmt 2014
G09.2014 gartner enterprise content mgmt 2014
 
Rethink Storage: Transform the Data Center with EMC ViPR Software-Defined Sto...
Rethink Storage: Transform the Data Center with EMC ViPR Software-Defined Sto...Rethink Storage: Transform the Data Center with EMC ViPR Software-Defined Sto...
Rethink Storage: Transform the Data Center with EMC ViPR Software-Defined Sto...
 
IDC Spotlight: PBBAs Tap into Key Data Protection Trends to Drive Strong Mar...
IDC Spotlight:  PBBAs Tap into Key Data Protection Trends to Drive Strong Mar...IDC Spotlight:  PBBAs Tap into Key Data Protection Trends to Drive Strong Mar...
IDC Spotlight: PBBAs Tap into Key Data Protection Trends to Drive Strong Mar...
 
Host your Cloud – Netmagic Solutions
Host your Cloud – Netmagic SolutionsHost your Cloud – Netmagic Solutions
Host your Cloud – Netmagic Solutions
 
Content Centric Applications
Content Centric ApplicationsContent Centric Applications
Content Centric Applications
 
SD Big Data Monthly Meetup #4 - Session 1 - IBM
SD Big Data Monthly Meetup #4 - Session 1 - IBMSD Big Data Monthly Meetup #4 - Session 1 - IBM
SD Big Data Monthly Meetup #4 - Session 1 - IBM
 
The Next Evolution in Storage Virtualization Management White Paper
The Next Evolution in Storage Virtualization Management White PaperThe Next Evolution in Storage Virtualization Management White Paper
The Next Evolution in Storage Virtualization Management White Paper
 
2019 Storage Brand Leader Report
2019 Storage Brand Leader Report2019 Storage Brand Leader Report
2019 Storage Brand Leader Report
 
hitachi-content-platform-portfolio-esg-validation-report
hitachi-content-platform-portfolio-esg-validation-reporthitachi-content-platform-portfolio-esg-validation-report
hitachi-content-platform-portfolio-esg-validation-report
 
CRTC Cloud- Scott Sadler
CRTC Cloud- Scott SadlerCRTC Cloud- Scott Sadler
CRTC Cloud- Scott Sadler
 
Mba ii u v enterprise application integration
Mba ii u v enterprise application integrationMba ii u v enterprise application integration
Mba ii u v enterprise application integration
 
Cloud Computing - Beyond the Hype
Cloud Computing - Beyond the HypeCloud Computing - Beyond the Hype
Cloud Computing - Beyond the Hype
 
Φάννυ Κοφινά, 7th Digital Banking Forum
Φάννυ Κοφινά, 7th Digital Banking ForumΦάννυ Κοφινά, 7th Digital Banking Forum
Φάννυ Κοφινά, 7th Digital Banking Forum
 
2013.12.12 big data heise webcast
2013.12.12 big data heise webcast2013.12.12 big data heise webcast
2013.12.12 big data heise webcast
 

Similar to Realizing a multitenant big data infrastructure 3

Integrating Structure and Analytics with Unstructured Data
Integrating Structure and Analytics with Unstructured DataIntegrating Structure and Analytics with Unstructured Data
Integrating Structure and Analytics with Unstructured DataDATAVERSITY
 
Make from your it department a competitive differentiator for your business
Make from your it department a competitive differentiator for your businessMake from your it department a competitive differentiator for your business
Make from your it department a competitive differentiator for your businessMarcos Quezada
 
20110514 PMI San Diego Keynote
20110514 PMI San Diego Keynote20110514 PMI San Diego Keynote
20110514 PMI San Diego KeynotePeter Coffee
 
Cloud in the sky of Business Intelligence
Cloud in the sky of Business IntelligenceCloud in the sky of Business Intelligence
Cloud in the sky of Business IntelligenceIJMER
 
IDC Analyst Connection: Flash, Cloud, and Software-Defined Storage: Trends Di...
IDC Analyst Connection: Flash, Cloud, and Software-Defined Storage: Trends Di...IDC Analyst Connection: Flash, Cloud, and Software-Defined Storage: Trends Di...
IDC Analyst Connection: Flash, Cloud, and Software-Defined Storage: Trends Di...Hitachi Vantara
 
BLU Acceleration on the Cloud – 101
BLU Acceleration on the Cloud – 101BLU Acceleration on the Cloud – 101
BLU Acceleration on the Cloud – 101IBM Analytics
 
IBM Industry Models and Data Lake
IBM Industry Models and Data Lake IBM Industry Models and Data Lake
IBM Industry Models and Data Lake Pat O'Sullivan
 
The Small and Medium Enterprises utilize some important criteria when acquiri...
The Small and Medium Enterprises utilize some important criteria when acquiri...The Small and Medium Enterprises utilize some important criteria when acquiri...
The Small and Medium Enterprises utilize some important criteria when acquiri...IBM India Smarter Computing
 
Calculating the true value of industry specific clouds linthicum
Calculating the true value of industry specific clouds linthicumCalculating the true value of industry specific clouds linthicum
Calculating the true value of industry specific clouds linthicumDavid Linthicum
 
Consumption-based public cloud (CBPC) model
Consumption-based public cloud (CBPC) modelConsumption-based public cloud (CBPC) model
Consumption-based public cloud (CBPC) modelWerner Feld
 
On-premises, consumption-based private cloud creates opportunity for enterpri...
On-premises, consumption-based private cloud creates opportunity for enterpri...On-premises, consumption-based private cloud creates opportunity for enterpri...
On-premises, consumption-based private cloud creates opportunity for enterpri...Stanton Jones
 
Hybrid Hosting: Evolving the Cloud in 2011
Hybrid Hosting: Evolving the Cloud in 2011Hybrid Hosting: Evolving the Cloud in 2011
Hybrid Hosting: Evolving the Cloud in 2011Rackspace
 
Cloud computing-overview
Cloud computing-overviewCloud computing-overview
Cloud computing-overviewshraddhaudage
 
Cloud Computing - A collection of predictions, principles and providers - Feb...
Cloud Computing - A collection of predictions, principles and providers - Feb...Cloud Computing - A collection of predictions, principles and providers - Feb...
Cloud Computing - A collection of predictions, principles and providers - Feb...William Santiago
 
GigaOm-sector-roadmap-cloud-analytic-databases-2017
GigaOm-sector-roadmap-cloud-analytic-databases-2017GigaOm-sector-roadmap-cloud-analytic-databases-2017
GigaOm-sector-roadmap-cloud-analytic-databases-2017Jeremy Maranitch
 
Why Infrastructure matters?!
Why Infrastructure matters?!Why Infrastructure matters?!
Why Infrastructure matters?!Gabi Bauer
 

Similar to Realizing a multitenant big data infrastructure 3 (20)

Integrating Structure and Analytics with Unstructured Data
Integrating Structure and Analytics with Unstructured DataIntegrating Structure and Analytics with Unstructured Data
Integrating Structure and Analytics with Unstructured Data
 
Make from your it department a competitive differentiator for your business
Make from your it department a competitive differentiator for your businessMake from your it department a competitive differentiator for your business
Make from your it department a competitive differentiator for your business
 
20110514 PMI San Diego Keynote
20110514 PMI San Diego Keynote20110514 PMI San Diego Keynote
20110514 PMI San Diego Keynote
 
Cloud in the sky of Business Intelligence
Cloud in the sky of Business IntelligenceCloud in the sky of Business Intelligence
Cloud in the sky of Business Intelligence
 
IDC Analyst Connection: Flash, Cloud, and Software-Defined Storage: Trends Di...
IDC Analyst Connection: Flash, Cloud, and Software-Defined Storage: Trends Di...IDC Analyst Connection: Flash, Cloud, and Software-Defined Storage: Trends Di...
IDC Analyst Connection: Flash, Cloud, and Software-Defined Storage: Trends Di...
 
BLU Acceleration on the Cloud – 101
BLU Acceleration on the Cloud – 101BLU Acceleration on the Cloud – 101
BLU Acceleration on the Cloud – 101
 
IBM Industry Models and Data Lake
IBM Industry Models and Data Lake IBM Industry Models and Data Lake
IBM Industry Models and Data Lake
 
Cloud computing
Cloud computingCloud computing
Cloud computing
 
The Small and Medium Enterprises utilize some important criteria when acquiri...
The Small and Medium Enterprises utilize some important criteria when acquiri...The Small and Medium Enterprises utilize some important criteria when acquiri...
The Small and Medium Enterprises utilize some important criteria when acquiri...
 
Azure Biz
Azure BizAzure Biz
Azure Biz
 
Calculating the true value of industry specific clouds linthicum
Calculating the true value of industry specific clouds linthicumCalculating the true value of industry specific clouds linthicum
Calculating the true value of industry specific clouds linthicum
 
Hadoop in the Cloud
Hadoop in the CloudHadoop in the Cloud
Hadoop in the Cloud
 
Consumption-based public cloud (CBPC) model
Consumption-based public cloud (CBPC) modelConsumption-based public cloud (CBPC) model
Consumption-based public cloud (CBPC) model
 
On-premises, consumption-based private cloud creates opportunity for enterpri...
On-premises, consumption-based private cloud creates opportunity for enterpri...On-premises, consumption-based private cloud creates opportunity for enterpri...
On-premises, consumption-based private cloud creates opportunity for enterpri...
 
Hybrid Hosting: Evolving the Cloud in 2011
Hybrid Hosting: Evolving the Cloud in 2011Hybrid Hosting: Evolving the Cloud in 2011
Hybrid Hosting: Evolving the Cloud in 2011
 
Cloud computing-overview
Cloud computing-overviewCloud computing-overview
Cloud computing-overview
 
Cloud Computing Overview | Torry Harris Whitepaper
Cloud Computing Overview | Torry Harris WhitepaperCloud Computing Overview | Torry Harris Whitepaper
Cloud Computing Overview | Torry Harris Whitepaper
 
Cloud Computing - A collection of predictions, principles and providers - Feb...
Cloud Computing - A collection of predictions, principles and providers - Feb...Cloud Computing - A collection of predictions, principles and providers - Feb...
Cloud Computing - A collection of predictions, principles and providers - Feb...
 
GigaOm-sector-roadmap-cloud-analytic-databases-2017
GigaOm-sector-roadmap-cloud-analytic-databases-2017GigaOm-sector-roadmap-cloud-analytic-databases-2017
GigaOm-sector-roadmap-cloud-analytic-databases-2017
 
Why Infrastructure matters?!
Why Infrastructure matters?!Why Infrastructure matters?!
Why Infrastructure matters?!
 

Recently uploaded

Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxvipinkmenon1
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 

Recently uploaded (20)

Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptx
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 

Realizing a multitenant big data infrastructure 3

  • 1. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 1 Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications using IBM® InfoSphere® BigInsights and IBM Platform Computing™ Last revised: April 19, 2014 By: Gord Sissons Steven Sit Eric Fiala Michael Feiman
  • 2. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 2 Contents Document History.........................................................................................................................................4 Introduction ..............................................................................................................................................4 Disclaimers and limitations.......................................................................................................................4 About the customer described in this use case........................................................................................5 Industry Challenges...................................................................................................................................5 Impact on Information Technology ......................................................................................................6 The Big Data Environment ........................................................................................................................7 Hardware Infrastructure.......................................................................................................................7 The Software Environment...................................................................................................................7 Customer Requirements.......................................................................................................................8 Installing InfoSphere BigInsights for Multi-tenant services......................................................................9 Installation steps...................................................................................................................................9 Accessing the Platform Symphony Management Console .................................................................12 Accessing the Platform Symphony knowledge center........................................................................14 Platform Symphony Concepts.................................................................................................................15 An example of configuring a cluster for multi-tenancy ..........................................................................18 Adding users to run MapReduce applications....................................................................................19 Provide access to the BigInsights / Platform Computing cluster........................................................23 Understanding Platform Symphony Impersonation...........................................................................24 Configuring OS groups for the multitenant environment...................................................................25 Submitting a test job as a user to verify the configuration ................................................................25 Associating BigInsights with a Symphony Application........................................................................28 Enabling Symphony Repository Services ............................................................................................29 Adding a new Application / Tenant ....................................................................................................30 Configuring application properties .....................................................................................................34 Associating applications with consumers ...........................................................................................40 Accessing Consumer Definitions.........................................................................................................41 Manually editing Consumer Tree definitions......................................................................................42
  • 3. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 3 Controlling access to applications and consumers.............................................................................43 Determining the execution user for a consumer................................................................................44 Configuring Sharing Policies....................................................................................................................46 Summary.................................................................................................................................................48
  • 4. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 4 Document History Date of this revision is Saturday April 19, 2014 Revision Date Summary of changes 0.9 March 23, 2014 Initial draft 0.95 April 19, 2014 Incorporate many valuable comments from Steven Sit based on his direct client experience – thank you Steven. Introduction This document is written for IBM and partner architects. It is intended to be a guide for those working with customers deploying IBM InfoSphere BigInsights and other Hadoop offerings together with IBM Platform Symphony. While this paper describes the details of one customer implementation, we believe that this use case is relevant to others as well. Challenges related to Hadoop multitenancy are faced by customers across multiple industries. The target audience for this document includes:  Architects responsible for deploying big data or analytic workloads  Technical users looking for ways to deploy Hadoop on shared clusters  IBM architects, ISVs or business partners interested in building multitenant Big Data environments to help customers reduce infrastructure requirements and save cost This paper does not delve into YARN. YARN is another important (but less mature) technology that delivers some of the capabilities described herein. It is important for IBM customers to understand that IBM BigInsights is a safer choice in the sense that it supports open source technologies like YARN while simultaneously offering more advanced capabilities. IBM’s view is the clients can best determine what capabilities they need, but IBM InfoSphere BigInsights provides customers with flexibility. The best of a 100% open source distribution along with significant value added capability. In the customer example documented here, the business advantage of using proprietary capabilities (IBM Platform Symphony) dramatically outweighed the benefits of being “pure” from an open source standpoint. The client was able to consolidate roughly 30 applications onto a shared infrastructure and avoid significant incremental capital expense that would have been required to setup separate clusters had the client decided to proceed with open source YARN only. Disclaimers and limitations The details of the customer implementation are proprietary and confidential. As such, while we can describe what was done technically, we cannot share details of how this customer used particular applications. As a result, the examples provided herein are meant to explain qualitatively what was achieved by the customer without betraying confidential information. The details and screenshots in this
  • 5. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 5 document are not from the customer environment. They have been reproduced on a small test cluster to explain particular capabilities that the client chose to take advantage of. About the customer described in this use case The customer described in this paper is a full-service financial service provider. They offer a broad range of products to their clients including insurance, banking, investing, real estate, retirement planning, wealth management and health insurance. Like many in the financial services sector, this customer is increasingly deploying Hadoop based applications to augment their data warehouse. They are motivated by the following imperatives:  The need to leverage big data analytics to make better business decisions, improve customer relations and develop innovative new products and services  The need to contain or reduce costs (the cost of storing and processing data on a Hadoop cluster is an order or magnitude less than persisting the same data in their data warehouse)  The desire to architect their environment as a shared service to avoid each line of business building their own discrete analytic environments on premise or in the cloud Industry Challenges Like many industries, the sector represented by this client is going through significant change. As a full- spectrum provider, the client is disproportionally impacted by regulation. As a bank, not only are they subject to various provisions in legislation like Dodd Frank, but they are also impacted by insurance industry requirements such as the NAIC’s Risk Management and Own Risk Solvency Act (RMORSA) and other initiatives around Enterprise Risk Management that have occurred as a response to the financial crisis of 2008. Of particular consequence is the Volcker rule, a US Senate bill that would give regulators the ability to limit or prohibit certain types of proprietary trading activities. While the legislation is directed at retail banks, this client will be impacted across their insurance and wealth management businesses where proprietary trading is important to maximizing investment gains. As if this tsunami of new regulation was not enough, fundamental changes are taking place in the insurance industry as well driven by external factors. Among these factors are new disruptive technologies. Big data, social and mobile technologies are prominent drivers of change. Some specific challenges to the business are:  Driven by high-profile events, and the increased frequency of natural catastrophes, contingent business interruption (CBI) modeling is emerging as a priority for insurance firms  Dramatic changes driven by technology are promising to fundamentally change auto-insurance. Among these factors are collision avoidance technologies that promise to shift liability from drivers to manufacturers, social media technologies enabling insurers to seek out and market to
  • 6. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 6 lower risk consumer pools, and advances in GPS and vehicle telematics that promise to provide insurers with more granular data on which to base risk assessments  Technological advances are leading to an explosion in available information and firms that aggregate such information to help insurers better quality risk  Widespread consumer use of mobile technologies and social technologies are causing firms to rethink how they promote their brand and provide services to both their customers and agents/advisors  Advances in analytic techniques are making it easier for insurers to collect process and visualize information. This is extending beyond core actuarial techniques to include approaches like predictive analytics, natural language processing, social network analysis and simulation-based analytics.  Additionally, new technologies are changing how information is stored and processed. Distributed file systems and clustered technologies like Hadoop can provide a significant per- terabyte cost advantage over traditional warehouses. Because of these cost advantages, and because the framework is well suited to storing and processing unstructured or semi-structured data, this customer and similar firms are embracing Hadoop as a platform for many new applications. The reason we point this out is that that risk management that relies heavily on Monte Carlo simulation for simulation and actuarial modeling, and big data analytics are converging. Both depend on scaled out infrastructure. Firms that understand this convergence can obtain a cost advantage relative to their competitors. Impact on Information Technology Both the regulatory challenges described above as well as the technological shifts and business pressures are driving the need for greater data processing and analytic capacity.  Traditional data warehouses cannot scale cost-efficiently to manage the vast amounts of data being collected and processed, nor can they handle raw volumes of unstructured data involved.  Organizations need more agile application development methodologies and toolsets that allow them to evolve data schemas and applications on the fly as they continuously incorporate new sources of data into their models. A one-to-one mapping between applications and infrastructure is no longer practical. Many applications (Hadoop, scenario generation, Monte Carlo simulation and ETL processing) rely on distributed infrastructure that scales horizontally. Replicating this clustered infrastructure for each line of business and each application would be cost prohibitive.
  • 7. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 7 The Big Data Environment Hardware Infrastructure The physical infrastructure deployed by this client is shown pictorially in Figure 1. While there are actually four identical 16 node clusters, only the production environment is shown here. The server infrastructure is based on an IBM System X based reference architecture for InfoSphere BigInsights. Each cluster node has 12 CPUs, over 60 GB or memory and 12 locally connected physical disks. The production cluster has 192 TB of disk and approximately 1 TB of memory. A unique feature of this environment is that the cluster is shared by several lines of business comprising approximately 30 different user groups across different lines of business. Figure 1: Physical infrastructure for shared Hadoop Platform The Software Environment The Linux based infrastructure supports multiple big data and analytic applications. Among these applications are:
  • 8. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 8  IBM InfoSphere BigInsights (providing core Hadoop services)  Datameer (for data visualization)  IBM TeaLeaf – customer experience analytics platform  Open source Sqoop 1.2.4 – used to perform bulk data transfers to and from various data sources including an operational data warehouse and the production Hadoop cluster  Various MapReduce streaming applications, where for convenience of development Map and Reduce logic is expressed as Perl scripts  Many in-house developed Java applications  Various ETL scripts running in and out of the Hadoop MapReduce framework The IBM furnished software environment is comprised of the following major components  IBM InfoSphere BigInsights Enterprise Edition  IBM Platform Symphony Advanced Edition (Software is bundled with BigInsights Enterprise Edition for a single tenant, and this client has purchased a production licenses)  IBM GPFS FPO (providing a POSIX compliant file system that fully preserves HDFS semantics) Customer Requirements This customer requires a multi-tenant environment for several business reasons listed below.  They wish to share infrastructure between multiple departments and lines of business both to boost capacity (by allowing departments to tap capacity not being used by others) and to reduce costs by avoiding the need for separate physical environments.  They need the ability to guarantee service levels to different tenants to ensure that business critical applications can run in a predictable fashion. For example ETL or specific database load operations must run with an overnight batch window.  Because many services are long-running, to make sharing practical, agile pre-emption is required to make sure that urgent jobs do not need to wait behind long running jobs on the cluster.  The client needs to ensure that data is segmented between different tenants on the shared environment for security and privacy reasons.  Finally, the client requires multi-tenancy for technical reasons that are sometimes overlooked. As the environment evolves, they need the flexibility to deploy different versions of software components that may have specific dependencies. A specific example is this client’s requirement to use a more recent version of open-source Sqoop, distinct from the version included in BigInsights 2.1.0.1, the version deployed at the time of this writing.
  • 9. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 9 Different Hadoop vendors have different definitions of what they mean by multi-tenancy, so it is important that we not confuse the multitenant capabilities offered by IBM in Platform Symphony with open source offerings like YARN which is much less capable. While YARN is an important technology being supported by IBM, the capabilities of YARN are well behind those described here. Installing InfoSphere BigInsights for Multi-tenant services Realizing a multitenant environment for BigInsights or other applications requires the use of IBM Platform Symphony Advanced Edition. A run-time version of IBM Platform Symphony Advanced Edition that enables a single tenant is included with IBM InfoSphere BigInsights Enterprise Edition 2.1 or later. The Platform Symphony resource manager and workload manager is referred to in the BigInsights documentation as Adaptive MapReduce for historical reasons. Clients wanting the multitenant capabilities required in this document will need to license a full version of Platform Symphony Advanced Edition. Note that licensing is not enforced by the software directly. Customers can pilot these multitenant capabilities using only the software included in the BigInsights 2.1 Enterprise Edition or later release along with appropriate patches. Installation steps Fortunately, it is constantly getting much easier to have these products work together. While manual configuration was required in prior releases, as of BigInsights 2.1 EE a simple patch can be applied to unlock all of the features of Platform Symphony Advanced Edition and have it work with BigInsights. For future releases starting in the spring of 2014, full functionality of Platform Symphony will be provided “out of the box” with BigInsights with no requirement for a patch. (Please note the customers will still need to license the software before using it in production) The high-level steps to implement InfoSphere BigInsights 2.1 (or later) with IBM Platform Symphony Advanced Edition are as follows:  Install IBM InfoSphere BigInsights Enterprise Edition by following the installation instructions. When installing BigInsights it is important to install Adaptive MapReduce. This is the choice that causes the Platform Symphony software to be installed and configured with BigInsights.  To do this, you will need to edit a file in the installation directory called install.properties before starting the BigInsights installation process as shown below: # set AdaptiveMR.Enable to true if you want to install AdaptiveMR instead of Apache MapReduce AdaptiveMR.Enable=true # set AdaptiveMR.HA.Enable to true if you want to install AdaptiveMR High Availability, this will also install AdaptiveMR instead of Apache MapReduce AdaptiveMR.HA.Enable=true
  • 10. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 10  For multitenant environments, GPFS FPO is recommended, however Symphony can be configured to support multiple tenants regardless of whether HDFS or GPFS FPO is chosen as the cluster file system.  BigInsights can be installed by using a web-based installation process. The web-based install process generates an XML file that governs the installation process that is used for installation via the GUI or optionally via the install.sh shell script. The name of this file will vary depending on how the software is installed, but as of release 2.1 the file is called either simple- fullinstall.xml or fullinstall.xml.  The reason we mention this is that an apparent bug in BigInsights 2.1 caused the XML tag <apache-mapred> to be set to true when Adaptive MapReduce was requested in the install.properties file above. It might be worth validating that this setting is correct in the simple-fullinstall.xml or fullinstall.xml file. [biadmin@biginsights]$ grep "apache-mapred" simple-fullinstall.xml <apache-mapred>false</apache-mapred> [biadmin@biginsights]$  As you proceed with the installation, you should see the BigInsights installation script install the “HAManager” software components as part of the installation. This is where the Platform Symphony software is located that supports HA functionality and Adaptive MapReduce functionality. You can watch for this either through the web installation GUI or by checking the installation log file.  If you are installing BigInsights 2.1 Enterprise Edition you will need to install a patch by following the procedure documented in the publication “Enabling the full functionality of IBM Platform Symphony in your BigInsights 2.1 cluster”1 . This document is freely downloadable for users with an IBM Developer Works ID.  You can download a small patch for Platform Symphony 6.1.0.1 (the Symphony version included in BigInsights 2.1) from https://www.ibm.com/support/fixcentral/ following instructions in the document referenced above. At the time of this writing you can find and download the needed package from Fix Central by searching for “Platform Symphony” and downloading the package named “sym-6.1.0.1-build225866”. This package applies to both 64 bit Linux on Intel as well as IBM PowerLinux machines. Later versions of BigInsights will not require this patch.  Follow the instructions in the README file. If you are installing the patch as user “root” on the BigInsights cluster, it would be a good idea to source the BigInsights environment before attempting to install the patch since the patch procedure assumes the environment variables are already set. 1 This documentation can be obtained from: https://www.ibm.com/developerworks/community/wikis/form/api/wiki/ee59a95e-5867-4deb- 90af-6bed6b0759b8/page/91903357-0a7d-4a96-bb70-520fb2acdc1b/attachment/52d79fbe-dc37-42f0-be3f- 5f4b75f14a05/media/Enable%20the%20full%20functionality%20of%20IBM%20Platform%20Symphony%20in%20BigInsight%202.1%20Cluster.p df
  • 11. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 11 [biadmin@biginsights opt]$ cd /opt/ibm/biginsights/conf [biadmin@biginsights conf]$ . biginsights-env.sh [biadmin@biginsights conf]$ echo $EGO_TOP /opt/ibm/biginsights/HAManager/data [biadmin@biginsights conf]$ When this patch is applied, the multitenant capabilities of IBM Platform Symphony will become functional and will be accessible through the Platform Symphony graphical user interface. When BigInsights is installed, the BigInsights web console by default is available on port 8080 on the BigInsights management host (as long as BigInsights services are started). Check the status of the cluster using this command: $ /opt/ibm/biginsights/bin/status.sh If necessary, start BigInsights (which will also start Platform Symphony services): $ /opt/ibm/biginsights/bin/start-all.sh While logged in as the BigInsights administrator, if Symphony is properly installed with BigInsights you should be able to run Symphony specific commands. As an example, the user biadmin should be able to run the following command: $ egosh service list This command will list various software services associated with Symphony and show their status. When the Platform Computing components are installed (Adaptive MapReduce), the Platform Computing resource manager (EGO) is used to persist BigInsights services. You will notice that Symphony services are associated with a consumer called “/Management”. If you are running HDFS, HDFS services like the DataNode and Secondary Data node are associated with an “/HDFS” consumer. The MapReduce shuffle service is start on Compute hosts in the cluster. [biadmin@biginsights ~]$ egosh service list SERVICE STATE ALLOC CONSUMER RGROUP RESOURCE SLOTS SEQ_NO INST_STATE ACTI derbydb DEFINED /Manage* Manag* purger DEFINED /Manage* Manag* plc DEFINED /Manage* Manag* WEBGUI STARTED 54 /Manage* Manag* biginsi* 1 1 RUN 121 RS DEFINED /Manage* Manag* Seconda* DEFINED /HDFS/S* MRSS STARTED 55 /Comput* MapRe* biginsi* 1 1 RUN 120 DataNode DEFINED /HDFS/D* SD STARTED 56 /Manage* Manag* biginsi* 1 1 RUN 119 Service* DEFINED /Manage* Manag* WebServ* DEFINED /Manage* Manag* NameNode DEFINED /HDFS/N* [biadmin@biginsights ~]$
  • 12. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 12 Accessing the Platform Symphony Management Console The Platform Symphony console will usually be on the same host if you follow the installation recommendations above, but will be on a different port. Port 18080 is the default. You should be able to log into the Platform Symphony management console at http://<master-host>:18080/platform. The default administrator login for Platform Symphony is “Admin / Admin”. In production clusters there will normally be multiple Platform Symphony management hosts. Setting this up is beyond the scope of this paper and is covered in the Platform Symphony documentation. Figure 2- Logging into the Platform Symphony Management Console If you are having trouble connecting to the Symphony web console you can use the command “egosh service view WEBGUI” to see details about the web service. The WEBGUI services should be started automatically by EGO, but if it becomes necessary to start or stop the service, you can use the following commands: $ egosh logon Enter Admin / Admin as the username and the password when prompted $ egosh service start WEBGUI $ egosh service stop WEBGUI The WEBGUI service is implemented using Apache TomCat. If there are problems with the WEBGUI you can inspect the logs at ${EGO_TOP}/gui/logs/catalina.out for information about what might be wrong with the service.
  • 13. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 13 If you cannot connect to the Symphony console, this may be blocked by your firewall configuration. You can disable your firewall temporarily to see if this is the cause. # service iptables stop If you are not sure what port or host the Platform Symphony GUI was installed on, you should be able to find it in the XML file that governs the BigInsights installation process (described earlier). This XML file is generated by the web-based installation process. Platform Symphony related setup details are found under “high-availability” section of the XML file that governs the installation process. <high-availability> <configure>false</configure> <master-nodes/> <baseport>7869</baseport> <web-port>18080</web-port> <log-directory>var/ibm/biginsights/ps-mapred/logs</log-directory> <preferred-ip-mask/> .. <max-retries>3</max-retries> <failover>failover</failover> </high-availability> Once a user logs in to the Platform Symphony console on port 18080, they will see the main Platform Symphony dashboard. This view is mostly used to monitor the high level status of the various applications and tenants on a Platform Symphony cluster. For BigInsights users, most of the action will center around the “MapReduce Workload” screen accessible under “Quick Links”.
  • 14. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 14 Figure 3 - view of Platform Symphony console when logged in as an Administrator Accessing the Platform Symphony knowledge center Once you are able to access the Platform Symphony console above, you may want to access the Platform Symphony Knowledge Center and bookmark it in your browser. The knowledge center is accessible in a pull down menu under the question mark in the top bar on the Platform Symphony web interface. The knowledge center aggregates all of the various Platform Symphony documentation into a searchable interface. This will prove handy as you learn about Platform Symphony. A direct link to the knowledge center can be found at this URL (depending on the hostname where the web interface is running). http://<masterhost-name>:18080/doc/symphony/6.1/index.html The command egosh services list shown earlier will show the names of the host running the web interface (listed as the WEBGUI) if you are running on a cluster with multiple master hosts. The Platform Symphony knowledge center, in particular the documentation dealing with the Platform Symphony MapReduce framework, will be useful to BigInsights administrators since if you are using Adaptive MapReduce you are in fact using the Platform Symphony MapReduce framework.
  • 15. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 15 Figure 4 - Platform Symphony Knowledge Center Platform Symphony Concepts While the reader of this document is likely to be familiar with Hadoop and various commercial distributions, they may be less familiar with IBM Platform Symphony. IBM Platform Symphony is a commercial grid workload and resource management solution that has been use to share resources among diverse applications in multitenant environments for over a decade. Platform Symphony is widely deployed as a shared services infrastructure in some of the world’s largest investment banks. As a quick primer to some of the terminology referenced, in this document some definitions are offered below. We would recommend that the interested reader please review a document called “IBM Platform Symphony Foundations” available at http://publibfp.dhe.ibm.com/epubs/pdf/c2750652.pdf .  Session Manager – service-oriented applications in Platform Symphony are managed by a session manager. The session manager is responsible for dispatching tasks to service instances, and collecting and assembling results. The Symphony session manager provides a function simply in concept to a Hadoop application manager, although it has considerably more capabilities. Platform Symphony implements job tracker functionality using the session manager. In this paper the terms job tracker, application manager and session manager are used interchangeably. While the concept of multiple concurrent application managers in Hadoop is new with YARN. Platform Symphony has always featured a multitenant design.
  • 16. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 16  Resource Groups – Unlike Hadoop clusters, Platform Symphony does not make assumptions about the capabilities of hosts that participate in the cluster. While Hadoop generally assumes that member nodes are 64-bit Linux hosts running Java, Platform Symphony supports a variety of hardware platforms and operating environments. Platform Symphony allows hosts to be grouped in flexible ways into different resource groups, and different types of applications can share these underlying resource groups in flexible ways.  Applications – The term application can be a little bit confusing as it is applied to Platform Symphony. Symphony views an application as the combination of the client-side and service- side code that comprise a distributed application. This is a more expansive definition than most people are used to. By this definition an instance of BigInsights might be viewed as a single application. Examples of Platform Symphony applications are custom applications written in C++, a commercial ISV application like IBM Algorithmics, Calypso or Murex or a commercial or Open Source Hadoop application like Cloudera, BigInsights or open source Hadoop. Platform Symphony views applications as being an instance of middleware. Various client side tools associated with a particular version of Hadoop (Pig, Hive, Sqoop etc) can all run against a single Hadoop application definition. An important concept for those not familiar with Symphony is that Symphony provisions service instances associated with different applications dynamically. As a result, there is nothing technically stopping a Platform Symphony cluster from supporting multiple instances of Hadoop and non-Hadoop environments concurrently.  Application profiles – As explained above, applications in Symphony are flexible and highly configurable constructs. An Application Profile in Symphony defines the characteristics of an application and various behaviors at runtime.  Consumers – From the viewpoint of a resource manager, an application or tenant on the cluster is defined as something that needs particular types of resources at runtime. Platform Symphony uses the term “consumer” to define these consumers of resources and provides capabilities to define hierarchical consumer trees and express business rules about how consumers share various types of resources collected into resource groups. The leaf nodes in consumer trees map to a Symphony application.  Services – Services are the portions of applications that run on cluster nodes. In a Hadoop context, administrators likely think of services as equating to a task tracker that runs Map and Reduce logic. Here again, Symphony takes a broader view. Symphony services are generic. A service may be a task-tracker associated with a particular version of Hadoop or it may be something else entirely. When the MapReduce framework is used in Platform Symphony, the Hadoop service-side code that implements that Task Tracker logic is dynamically provisioned by Symphony. Symphony owes its name to this ability to orchestrate a variety of services quickly and dynamically according to sophisticated sharing policies.  Sessions – A session in Symphony equates to the notion of a job in Hadoop. A client application in Symphony normally opens a connection the cluster, selects an application and opens a
  • 17. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 17 session. Behind the scenes Symphony will provision a Symphony Session Manager to manage the lifecycle of the job. A single Symphony Session Manager may support multiple sessions (Hadoop jobs) concurrently. A Hadoop job is a special case of a Symphony job. The Hadoop client will start a session manager that provides JobTracker functionality. Platform Symphony actually uses the job tracker and task tracker code provided in a Hadoop distribution, however it uses its own low-latency middleware to more efficiently orchestrate these services on a shared cluster.  Repositories – As explained previously, Platform Symphony dynamically orchestrates service- side code in response to application demand. The binary code that comprises an application service is stored in a Symphony repository. Normally for Symphony applications, Symphony services are distributed to compute nodes from a repository service. For Hadoop applications, code can be distributed either via the repository service, or it can be distributed via the HDFS / GPFS FPO file system.  Tasks – Symphony jobs are collections of tasks. Symphony jobs are managed by a session manager that runs on a management host. The session manager makes sure that instances of the needed service are running on compute nodes / data nodes on the cluster. Services instances run under the control of a Symphony Service Instance Manager (SIM). MapReduce jobs in the Symphony work the same way, but in this case the Symphony service is essentially the Hadoop task tracker logic. On Hadoop clusters, slots are normally designated as running either map logic or reduce logic. Again in Symphony, this is fluid. Because services are orchestrated dynamically service instances can be either Map or Reduce tasks. This is an advantage because it allows full utilization of the cluster as the job progresses. At the start of a job the majority of slots can be allocated to map tasks while towards the end of the job the function of slots can be shifted to perform the reduce function.
  • 18. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 18 An example of configuring a cluster for multi-tenancy In this section we describe the step-by-step procedure to setup multiple tenants on the BigInsights environments. In order to provide a realistic multitenant scenario, the diagram roughly models our actual customer environment with names changed of course to protect client confidentiality. The actual environment is more complex with hundreds of users, dozens of groups and approximately thirty different applications planned, but the application sharing is similar to the diagram below. This diagram maps to the “Consumer Tree” in Platform Symphony. Consumer is a term used from the resource manager’s perspective. The resource manager views an application as a consumer of resources, and the resource manager is responsible for allocating requested resources according to policies that will be described shortly. Figure 5 - an example consumer hierarchy for applications and departments By default, BigInsights (which is just a single application on the cluster) maps to a single application and associated is consumer called “MapReduce61” (the name corresponds to the version of Platform Symphony used to support MapReduce processing in BigInsights – in this case 6.1.0.1). This is done so that Symphony can accommodate future versions of MapReduce that will be provided in future versions of BigInsights and will allow versions to co-exist. This is first consumer in the consumer tree above. In the production environment the customer has specific needs:  They wish to structure “sub-consumers” under the BigInsights consumer definition (MapReduce61). This gives the cluster administrator the ability to have different run-time characteristics for different BigInsights applications. It also allows us to setup configurable sharing policies between our different applications and groups, control what users are allowed
  • 19. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 19 to access what applications, and ensure security between tenants by having different applications run under different user-IDs if desired.  In this example, under the BigInsights tenant (MapReduce61) we have several different applications. We’ve arbitrarily called them “MR_AppA” through “MR_AppN” although in the real environment these are the names of the client’s business applications. Note that we need to configure each application (tenant) so that it runs under a different operating system level user- id for security isolation. We also want to control in a granular way which users and groups have access to these various applications.  Also, as shown in figure 4, the client has additional applications used by particular lines of business that they would also like to deploy on the same cluster. As examples, some Sqoop workloads, DataMeer, IBM Tealeaf, various in-house developed streaming applications and others. In this particular customer implementation all of these applications will just happen to share the BigInsights MapReduce infrastructure, however it is important to under that this need not be the case. As we’ll see shortly these applications can be totally different and still be configured to share infrastructure. Adding users to run MapReduce applications In our example we want to show that how multiple users, grouped arbitrarily into one or groups for security management can access tenant applications subject to access controls. We create some sample cluster users for our illustration. These names represent individual cluster users. For some lines of business, application administrators may choose to create a shared login like “fraud” for a group authorized to use a particular fraud analytics application. InfoSphere BigInsights has a recommend procedure for adding users. When using Platform Symphony together with BigInsights, it is recommended that users follow procedures covered in the BigInsights documentation and use the tool createosuser.sh included in the BigInsights distribution to automate the create of OS level users. Doing this ensures that users can access the BigInsights console to run applications deployed using the BigInsights application framework. For convenience, the BigInsights infocenter is available on the public internet. For information on adding users in BigInsights, you can learn more here: http://www- 01.ibm.com/support/knowledgecenter/SSPT3X_2.1.1/com.ibm.swg.im.infosphere.biginsights.admin.doc /doc/bi_admin_add_users.html?lang=en The specific procedures will depend on whether you are authenticating access via flat files, LDAP, PAM or PAM+LDAP. In the example below we are using flat files for simplicity. To create users known to BigInsights, edit the following file: $BIGINSIGHTS_HOME/console/conf/security/biginsights_users.xml
  • 20. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 20 Add users as shown below. <?xml version="1.0" encoding="UTF-8"?> <server> <featureManager/> <basicRegistry id="basic" realm="Auth"> <user name="hadoop" password="passw0rd"/> <user name="biadmin" password="temp4now"/> <user name="sysadmin2" password="passw0rd"/> <user name="appadmin2" password="passw0rd"/> <user name="sysadmin1" password="passw0rd"/> <user name="appadmin1" password="passw0rd"/> <user name="dataadmin2" password="passw0rd"/> <user name="dataadmin1" password="passw0rd"/> <user name="user3" password="passw0rd"/> <user name="user2" password="passw0rd"/> <user name="user1" password="passw0rd"/> <user name="vivian" password="temp4now"/> <user name="gord" password="temp4now"/> <user name="eric" password="temp4now"/> <user name="michael" password="temp4now"/> <user name="vince" password="temp4now"/> <user name="steven" password="temp4now"/> <user name="tiffany" password="temp4now"/> <user name="appA" password="temp4now"/> <user name="appB" password="temp4now"/> <user name="appC" password="temp4now"/> </basicRegistry> </server> The next step is to define groups and associated users with groups. This is an example only. The specific will depend on how you wish to structure your own users and groups <?xml version="1.0" encoding="UTF-8"?> <server> <featureManager/> <basicRegistry id="basic" realm="Auth"> <group name="supergroup" gid="4000"> <member name="hadoop" uid="4000"/> <member name="biadmin" uid="200"/> </group> <group name="appAdmins" gid="4100"> <member name="appA" uid="4100"/> <member name="appB" uid="4101"/> <member name="appC" uid="4101"/> </group> <group name="sysAdmins" gid="4200"> <member name="sysadmin1" uid="4200"/> <member name="sysadmin2" uid="4201"/> </group>
  • 21. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 21 <group name="dataAdmins" gid="4300"> <member name="dataadmin1" uid="4300"/> <member name="dataadmin2" uid="4301"/> </group> <group name="users" gid="4400"> <member name="vivian" uid="6001"/> <member name="gord" uid="6002"/> <member name="eric" uid="6003"/> <member name="michael" uid="6004"/> <member name="vince" uid="6005"/> <member name="steven" uid="6006"/> <member name="tiffany" uid="6007"/> </group> <group name="groupA" gid="5000"> <member name="vivian" uid="6001"/> <member name="gord" uid="6002"/> <member name="eric" uid="6003"/> <member name="michael" uid="6004"/> <member name="vince" uid="6005"/> <member name="steven" uid="6006"/> <member name="tiffany" uid="6007"/> </group> <group name="groupB" gid="5001"> <member name="vivian" uid="6001"/> <member name="gord" uid="6002"/> <member name="eric" uid="6003"/> <member name="michael" uid="6004"/> <member name="vince" uid="6005"/> <member name="steven" uid="6006"/> <member name="tiffany" uid="6007"/> </group> <group name="groupC" gid="5002"> <member name="vivian" uid="6001"/> <member name="gord" uid="6002"/> <member name="eric" uid="6003"/> <member name="michael" uid="6004"/> <member name="vince" uid="6005"/> <member name="steven" uid="6006"/> <member name="tiffany" uid="6007"/> </group> </basicRegistry> </server> In addition to have user IDs that map to individuals, I may want particular applications to execute on the cluster under a specific user ID. For example, if my application is called “appA” I may want to have it execute under a Linux user ID with the same name for simplicity. To accommodate this notice that we’ve added application specific users to the biginsights_users.xml file in the example above. You can add users using operating system facilities, but if you do, these users will not be recognized as having credentials within the BigInsights web interface. They will still work with Symphony and the BigInsights Hadoop framework however.
  • 22. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 22 The example below shows how additional users can be added at the OS level, but be unable to login to the BigInsights console. # useradd fred # useradd george # useradd frank Once you have edited the BigInsights XML files to define users and groups as shown above, you are ready to run the createosusers.sh script to create these accounts and groups at the operating system level as well. Run the createosusers.sh script as user “biadmin”. #createosusers.sh $BIGINSIGHTS_HOME/console/conf/security/biginsights_groups.xml $BIGINSIGHTS_HOME/console/conf/security/biginsights_users.xml <biadmin's password> By following the procedure above to create users and groups, you will be able to run and monitor jobs from both BigInsights Console as well as the Platform Symphony console. Figure 6 - user Tiffany known as a BigInsights user is known to the Platform Symphony GUI
  • 23. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 23 Figure 7 - user Tiffany and others can also runs jobs via the BigInsights console. Provide access to the BigInsights / Platform Computing cluster For each operating system user who will be submitting jobs, make sure that their .bashrc file (or equivalent depending on your shell) in the user’s home directory is configured to source the BigInsights environment as shown below. If you have followed the procedures above, this should be done for you automatically. We include these details because you may have additional users not known to BigInsights that require access to Platform Symphony. Sourcing the BigInsights environment will ensure that various shell variables like $PATH and $CLASSPATH as well as environment variables specific to BigInsights and Platform Symphony are in the environment when the user logs on. This will allow them to immediately run both BigInsights and Symphony commands. If you are adding many users outside the procedure recommended above to add BigInsights users, and you want them all to have access to the cluster, it will be faster to adjust the system-wide template for .bashrc file (in /etc/skel) or adjust the common /etc/bashrc depending on your preference. If you have followed the instructions above, this step may not be necessary, but it is a good idea to check that when users login they are inheriting an environment appropriate for running BigInsights jobs and that they have access to the Platform Symphony environment. In our case we want both our named users, as well as the user-ids that our applications will run under in Symphony(see the concept of impersonation explained later) to source the environment and be able to run commands. [root@biginsights gord]# cat .bashrc # .bashrc # Source global definitions if [ -f /etc/bashrc ]; then . /etc/bashrc fi
  • 24. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 24 # User specific aliases and functions # source the environment for BigInsights and Platform Symphony source /opt/ibm/biginsights/conf/biginsights-env.sh You should be able su to your created user ID after this and run Symphony or BigInsights commands. Below we see that I can run a Symphony command confirming that my environment is setup correctly. Note that with the installation of BigInsights we are entitled to user Platform Symphony Advanced Edition which is the version of Symphony that supports the Hadoop MapReduce framework. We are not entitled to use some other add-on products listed. [root@biginsights /]# su - gord [gord@biginsights ~]$ egosh entitlement info Symphony Edition : Advanced Desktop Harvesting : Not Entitled Server Harvesting : Not Entitled Virtual Server Harvesting : Not Entitled GPU : Not Entitled [gord@biginsights ~]$ After following the procedure above, it is a good idea to make sure that our /etc/group file reflects that setup we’ve configured in the BigInsights XML files. In /etc/group, create define the users that will be allowed to submit workloads on behalf of each group. This is a very simple example. In reality, different users would belong to different groups and these group names would be meaningful in the context of how the customer organizes their business. groupA:x:5000:vivian,gord,eric,michael,vince,steven,biadmin groupB:x:5001:vivian,gord,eric,michael,vince,steven,biadmin groupC:x:5002:vivian,gord,eric,michael,vince,steven,biadmin groupD:x:5003:vivian,gord,eric,michael,vince,steven,biadmin groupF:x:5004:vivian,gord,eric,michael,vince,steven,biadmin groupG:x:5005:vivian,gord,eric,michael,vince,steven,biadmin groupH:x:5006:vivian,gord,eric,michael,vince,steven,biadmin groupI:x:5007:vivian,gord,eric,michael,vince,steven,biadmin Understanding Platform Symphony Impersonation Now is a good time to explain the concept of “impersonation” in Platform Symphony. Symphony has two different workload execution modes:  Simple Workload Execution Mode  Advanced Workload Execution Mode This is normally an installation option with Platform Symphony. BigInsights Enterprise Edition installation automatically installs Platform Symphony in Advanced Workload Execution Mode. This term is frequently abbreviated as WEM in the Symphony documentation. In advanced workload execution mode, core Symphony services will run as root as application administrators will be able to control the user ID that clustered applications run under.
  • 25. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 25 Our approach to security hinges on this concept of impersonation in Symphony and we will see shortly how we configure our applications to run under specific user credentials and control what users have access to what applications and resources. The section called “Security within the MapReduce framework” in the MapReduce user guide in the Platform Symphony documentation discusses this in detail. The customer that this paper is modeled after employs Kerberos authentication for their MapReduce jobs to ensure security and that a particular service support impersonation cannot be spoofed. Details on configuring Kerberos is too much detail for this short document, but customers will be pleased that this capability exists. Symphony is frequently deployed in secure environments where these capabilities are important. Configuring OS groups for the multitenant environment For users making use of Platform Symphony (both named users and the user IDs that applications will run under via impersonation) these IDs need to be part of the OS group that owns the BigInsights (and by extension the Symphony) installation. In our installation, BigInsights was installed as part of the “biadmin” group, so we adjust the group membership so that each application ID that Symphony jobs will run under is a part of the BigInsights group. biadmin:x:0:root,biadmin,gord,eric,vivian,appA,appB,appC,appD,appE,appF,appG bin:x:1:root,bin,daemon daemon:x:2:root,bin,daemon .. If you are unsure what group BigInsights was installed under, issue a command like $ ls -al ${EGO_TOP} You will see the user and group that own each file. This will vary depending on how you installed BigInsights but the default group is biadmin. Submitting a test job as a user to verify the configuration As we mentioned before, by default BigInsights is configured to use an Application called MapReduce61 which maps to the consumer called /MapReduceConsumer/MapReduce61. I should be able to login to any of the accounts created, and run a sample Hadoop job. The sleep command included with the BigInsights examples is a convenient Hadoop application for testing the MapReduce framework. This command submits variable numbers of Map and Reduce tasks that simply sleep for variable amounts of time. The example below submits two mappers that will sleep for 2 seconds (2,000 msec) followed by ten reducers that in the example below will sleep for 1 second. Besides being a useful validation that everything is working, this test illustrates the performance advantage of using Platform Symphony as the MapReduce framework over open-source Hadoop.
  • 26. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 26 Platform Symphony can run tests like this short running map and reduce tasks dramatically faster than open source Hadoop – often more than ten times faster, even when a competing cluster is configured with a short polling interval. Note that as the test Hadoop job runs, everything is identical to open source Hadoop (it is actually the BigInsights supplied Hadoop classes that are running) except that see that our JobTracker logic in Hadoop is running inside a Symphony Session Manager. Note also that the running job is given a Platform Symphony job ID (job_ssm_0401 in this example). Because Platform Symphony is managing the job execution, it is able to manage this job as well as other jobs on the cluster including non-Hadoop jobs. [gord@biginsights ~]$ hadoop jar ${HADOOP_HOME}/hadoop-example.jar sleep -m 2 -r 10 -mt 2000 -rt 2000 14/03/15 13:14:25 INFO internal.MRJobSubmitter: Connected to JobTracker(SSM) 14/03/15 13:14:26 INFO internal.MRJobSubmitter: Job <Sleep job> submitted, job id <401> 14/03/15 13:14:26 INFO internal.MRJobSubmitter: Job will not verify intermediate data integrity using checksum. 14/03/15 13:14:26 INFO mapred.JobClient: Running job: job_ssm_0401 14/03/15 13:14:27 INFO mapred.JobClient: map 0% reduce 0% 14/03/15 13:14:36 INFO mapred.JobClient: map 100% reduce 0% 14/03/15 13:14:46 INFO mapred.JobClient: map 100% reduce 20% 14/03/15 13:14:50 INFO mapred.JobClient: map 100% reduce 40% 14/03/15 13:14:54 INFO mapred.JobClient: map 100% reduce 60% 14/03/15 13:14:58 INFO mapred.JobClient: map 100% reduce 80% 14/03/15 13:14:59 INFO mapred.JobClient: map 100% reduce 100% 14/03/15 13:14:59 INFO mapred.JobClient: Job complete: job_ssm_0401 14/03/15 13:15:00 INFO mapred.JobClient: Counters: 18 14/03/15 13:15:00 INFO mapred.JobClient: Shuffle Errors 14/03/15 13:15:00 INFO mapred.JobClient: WRONG_PATH=0 14/03/15 13:15:00 INFO mapred.JobClient: CONNECTION=0 14/03/15 13:15:00 INFO mapred.JobClient: IO_ERROR=0 14/03/15 13:15:00 INFO mapred.JobClient: FileSystemCounters 14/03/15 13:15:00 INFO mapred.JobClient: FILE_BYTES_WRITTEN=5146 14/03/15 13:15:00 INFO mapred.JobClient: Map-Reduce Framework 14/03/15 13:15:00 INFO mapred.JobClient: Reduce input groups=400 14/03/15 13:15:00 INFO mapred.JobClient: Combine output records=0 14/03/15 13:15:00 INFO mapred.JobClient: Map output records=400 14/03/15 13:15:00 INFO mapred.JobClient: SHUFFLED_MAPS=20 14/03/15 13:15:00 INFO mapred.JobClient: Reduce shuffle bytes=2440 14/03/15 13:15:00 INFO mapred.JobClient: Combine input records=0 14/03/15 13:15:00 INFO mapred.JobClient: Spilled Records=800 14/03/15 13:15:00 INFO mapred.JobClient: SPLIT_RAW_BYTES=0 14/03/15 13:15:00 INFO mapred.JobClient: Map output bytes=1600 14/03/15 13:15:00 INFO mapred.JobClient: Reduce input records=400 14/03/15 13:15:00 INFO mapred.JobClient: GC_TIME_MILLIS=0 14/03/15 13:15:00 INFO mapred.JobClient: FAILED_SHUFFLE=0 14/03/15 13:15:00 INFO mapred.JobClient: MERGED_MAP_OUTPUTS=20 14/03/15 13:15:00 INFO mapred.JobClient: Reduce output records=0 [gord@biginsights ~]$
  • 27. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 27 As this job runs, we can monitor the job in the Symphony GUI by using the QuickLinks menu and accessing “MapReduce Workload” to access the MapReduce workload screen shown below. As the MapReduce jobs runs, you will see a view like the one shown in figure 6. Figure 8 - monitoring our job using the Platform Symphony web interface Note that the submitted job is associated with the application MapReduce 6.1 (this is the application that BigInsights by default submits jobs to) You can also launch jobs via the standard BigInsights Web GUI and watch them run either from within the BigInsights console or from within the Platform Symphony Web interface. Figure 9: Launching a terasort job from BigInsights The Terasort example in BigInsights uses oozie to manage the sequence of running the teragen application to generate the dataset to be sorted followed by Terasort itself.
  • 28. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 28 As the job runs in the BigInsights context, we see them running in Platform Symphony associated with the MapReduce6.1 application that BigInsights is bound to. Any BigInsights application that exercises the MapReduce framework including services like Hive, Pig, Big SQL, Bigsheets and others will work with Symphony in this same way. Figure 10 - Platform Symphony monitoring Terasort job run from BigInsights Associating BigInsights with a Symphony Application We’ve mentioned a few times that BigInsights is associated with the Symphony MapReduce6.1 application and customers frequently ask where this association is made. [biadmin@biginsights ~]$ cd $HADOOP_CONF_DIR [biadmin@biginsights hadoop-conf]$ cat pmr-site.xml <?xml version="1.0"?> <!-- This is a PMR configuration file. --> <!-- It is intended for PMR internal parameters. Do not define --> <!-- hadoop parameters here. --> <configuration> <property> <name>mapreduce.application.name</name> <value>MapReduce6.1</value> <description>The mapreduce application name.</description> </property> <property> <name>mapreduce.map.skip.commit.task</name> <value>false</value> </property> By changing to the BigInsights directory $HADOOP_CONF_DIR you can modify Symphony application name that BigInsights will submit jobs to in the file pmr-site.xml. It is important to have this flexibility, because over time customers may end up with different versions of BigInsights along with other applications co-existing on the same cluster.
  • 29. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 29 Enabling Symphony Repository Services By default, when Platform Symphony is installed the repository service in Symphony is disabled. The function of the repository service is to store the application services and distribute the code that implements services dynamically to service instances on the cluster. The MapReduce framework in Platform Symphony by default distributes the application service code (specifically the application logic that implements the task tracker functionality and Jar files that implement map and reduce logic) by copying them to HDFS with a high block replication factor so that the files will be accessible on all nodes. If you are planning to add and remove application profiles in Symphony or Consumers you will to start the Symphony repository service. Otherwise you will encounter errors as some of these services assume that the repository service in Symphony is running. This can be done through the web interface by following these steps:  From the QuickLinks menu select system services  For the service abbreviated as RS, select “Start” from the Actions pull-down menu  After you refresh the GUI view you should see the service has started on a master host
  • 30. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 30 Figure 11 - Managing system services in Platform Symphony The system services view is useful. This shows a list of system services that EGO is managing. Note that EGO is managing not only native Platform Symphony services, but BigInsights services as well. Adding a new Application / Tenant Fundamental to the design of BigInsights 2.1 (and Open Source Hadoop) is the idea that there is only a single instance of a Hadoop cluster. Platform Symphony supports multiple applications however sharing the same cluster. It is also flexible enough to support multiple instances of an application environment like BigInsights, however configuring this is out of the scope of this paper. Examples of tenants we may want to add might be:  A native Symphony application written to the Platform Symphony APIs  A batch-oriented workload (when Platform LSF is installed as an add-on to Platform Symphony)  A distinct Hadoop MapReduce environment  Third party applications like SAS, MatLab or Revolution R
  • 31. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 31  A separate Hadoop MapReduce application instance that shares resources between applications but that shares the same Hadoop binaries and file system instance. In this example we are showing the last case where multiple Hadoop applications share resources. From the Platform Symphony Dashboard:  Use the QuickLinks menu and select Resources  Select Workload / MapReduce / Application profiles from the pull down menu There will already be an application profile already defined for MapReduce6.1. This is installed automatically with Symphony and is the application profile that is used by BigInsights by default. To add a new application profile to support a new tenant, click the “Add” button. The screen shown in figure 10 will appear. Figure 12 - Adding a new Application definition We supply the following parameters:  Our application name (SQOOP) – We require this tenant to use a different version of SQOOP than the version including with BigInsights as mentioned earlier  We define the user-ID that starts the job tracker and runs jobs – This is the impersonation feature described earlier. This particular application will run under the OS id AppB.
  • 32. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 32  Symphony has 10,000 priority levels. By default we are going to submit Sqoop jobs as having a low priority.  We configure user accounts that have access to this application. Note that we’ve provided all users in GroupA access to the application along with named operating system and Platform Symphony users. Based on this information, Platform Symphony adds an application named Sqoop with a set of reasonable defaults for a Hadoop MapReduce job. To make sure that our new application is working, as a user entitled to use the application I can submit a test job as I did before. Note that in this I am specifying that I want to have the job handled by a different MapReduce application definition so I specify Sqoop as the application name on the command line. Test the new application consumer by submitting a job as before. [gord@biginsights ~]$ hadoop jar ${HADOOP_HOME}/hadoop-example.jar sleep - Dmapreduce.application.name=Sqoop -m 2 -r 10 -mt 2000 -rt 2000 14/03/13 12:32:07 INFO internal.MRJobSubmitter: Connected to JobTracker(SSM) 14/03/13 12:32:08 INFO internal.MRJobSubmitter: Job <Sleep job> submitted, job id <1> 14/03/13 12:32:08 INFO internal.MRJobSubmitter: Job will not verify intermediate data integrity using checksum. 14/03/13 12:32:08 INFO mapred.JobClient: Running job: job_ssm_0001 14/03/13 12:32:09 INFO mapred.JobClient: map 0% reduce 0% 14/03/13 12:32:37 INFO mapred.JobClient: map 100% reduce 0% 14/03/13 12:32:52 INFO mapred.JobClient: map 100% reduce 20% 14/03/13 12:32:56 INFO mapred.JobClient: map 100% reduce 40% 14/03/13 12:33:00 INFO mapred.JobClient: map 100% reduce 60% 14/03/13 12:33:05 INFO mapred.JobClient: map 100% reduce 80% 14/03/13 12:33:07 INFO mapred.JobClient: map 100% reduce 100% 14/03/13 12:33:07 INFO mapred.JobClient: Job complete: job_ssm_0001 14/03/13 12:33:09 INFO mapred.JobClient: Counters: 18 .. What has changed is that in figure 11 we see that our job is now running under our separate application definition called Sqoop. This shows the basic process of adding the new application profile for a MapReduce job to Symphony to support our additional tenants. The next step of course is to edit the configuration of the tenant as necessary to suit the unique needs of the application. For example, my requirement may be as simple as simple re-pointing some environment variables for point to different installation and configuration directories for Sqoop for jobs submitted to this application. [biadmin@biginsights hadoop-conf]$ set | grep SQOOP SQOOP_CONF_DIR=/opt/ibm/biginsights/sqoop/conf SQOOP_HOME=/opt/ibm/biginsights/sqoop [biadmin@biginsights hadoop-conf]$
  • 33. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 33 Note that below my Job ID has reset to “1” since this is the first job associated with this particular application tenant. Figure 13 - Sleep job running under newly created application definition Under the “Workload” / “MapReduce” / “Application Profiles” we can define as many separate applications as we’d like. The view below additional applications added using the same process detailed for the Sqoop application. Figure 14 - Available MapReduce Application Profiles Only MapReduce applications appear because “Application Profiles” have been selected from the MapReduce submenu. Figure 13 shows a similar view of “Applications” accessible from the same workload dropdown menu except instead of looking at Application Profiles I’m looking at a dashboard of the applications themselves with job related status.
  • 34. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 34 Figure 15- Dashboard of MapReduce applications Configuring application properties When new applications profiles are created for each new application, a default template is used represent reasonable settings for a MapReduce workload. The next step is to configure application profiles to meet the unique requirements of each application workload. In the Platform Symphony reference manual accessible from the knowledge center, application profiles are covered in detail. Some of the more commonly configured settings are shown below. To configure application properties for Sqoop, modify the application profile by selecting “Workload” / “MapReduce” / “Application Profiles” from the top menu on the MapReduce applications screen. Select the application profile definition for Sqoop created earlier and select Modify. A new window will appear that allows detailed settings for the application to be changed. This web interface is affecting the application service profile definitions (discussed shortly) that are stored in the directory $EGO_TOP/data/soam/profiles on the Platform Symphony master host. Enabled profiles reside in a subdirectory called “enabled” and disabled profiles reside in a directory called “disabled”. First tab in the interface called Application Profile allows application profile settings to be adjusted. The second tab labeled Users provides an opportunity to modify the users and groups that will have access to the application profile.
  • 35. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 35 Figure 16 - Application Profile Some important tips about Application Profiles:  Application Profile names must be unique  An Application Profile can be associated with only a single consumer  In the consumer tree, MapReduce applications are by default placed under the MapReduceConsumer tree  You can find templates for various application profiles in the directory $SOAM_HOME/6.1/Samples/Templates. The term SOAM in Symphony refers to the service- oriented application middleware on which the MapReduce service is implemented The application profile can be viewed in an Advanced Configuration, a Basic Configuration or in a Dynamic Configuration Update mode. The Dynamic Configuration Update mode is not covered here, but essentially it allows an administrator to register a profile fragment (part of an application profile) modifying either the session types or services sections of the profile. In the General settings area, settings such as where metadata associated with jobs and job history are stored, the default service definition to be used (MapReduce for MapReduce applications) and resource requirements.
  • 36. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 36 Resource requirements are an important concept in Symphony. In this simple example by using the syntax “select(!mg)” we are essentially saying run this service on any host that is not tagged as a member of the management group. Resource requirement selections in Symphony are flexible and are covered in the Symphony documentation. I can use an SQL like resource-requirements strings to specify the types of resources I would like to use in a granular way. If for example I know that a particular application runs best on a large memory PowerLinux machine, I express a requirement (or preference) for this application with an appropriate resource requirement string. select(!mg) && select(PowerResourceGroup) && select(maxmem > 8000 && maxswp >=16000) The example above would indicate that this service requires resources that are part of a Power-based resource group that are not management hosts where at least 8GB of physical memory and 16GB of swap space are available. Pre-starting application services is a useful feature in Symphony. Application services refer to the Symphony session manager (SSM) as well as service instance managers and service instances associated with the application. As a reminder, with MapReduce workloads the SSM can be viewed as an Application Manager. This is the component that implements the JobTracker logic. Services instances will load TaskTracker logic appropriate to the version of Hadoop and will start map or reduce tasks appropriate to the application. If you have many applications and are frequently sharing slots pre-starting applications may not be useful. By default Symphony will start SSMs automatically as clients connect and request services from the middleware. As resources are assigned to applications, Symphony will dynamically provision needed service code and start services appropriate. Pre-starting applications is useful for applications that need to respond quickly. You can control the number of slots (each slot can support a map or reduce task) that are pre-started by default Figure 17 - Optionally have an application pre-allocate services A key thing to understand about that Platform Symphony session manager is that it is fully multithreaded and can accommodate multiple sessions at the same time. A session equates to a MapReduce user submitted a job. Each job maps to a session where each session may have large numbers of tasks.
  • 37. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 37 When multiple users are concurrently submitting jobs to the same application, the scheduling policy controls how resources are shared. This R_Proportion policy specifies that resources are shared in proportion to the priority of the job which is often the most sensible choice. As an example, if I had 5000 slots allocated to this application consumer definition and JobA was submitted to the application with priority 4000 and JobB was submitted with priority 1000, Symphony would run both workloads concurrently under the same application definition giving 80% of available resources to JobA. Unlike standard Hadoop where resource assignments are static while the job is executing, Symphony can respond quickly at run-time to re-balance resource allocations between jobs. Note that since each SSM maps to an application (a MapReduce application in this case) this scheduling policy controls how multiple jobs running in the same application context share resources. A separate resource sharing plan discussed shortly controls how sharing is implemented more broadly between applications and tenants. The term application can be confusing to users not familiar with Symphony. Symphony is referring to an application in the context of the Hadoop services themselves – the binary code that comprises BigInsights services like the JobTracker and the TaskTracker. It is not referring to the actual application code written by users that run on the Hadoop framework. A single Symphony application can run different user applications within the context of the same Hadoop MapReduce context in this case. Figure 18 - controlling how multiple jobs associated with an application share resources The Symphony application profile definition provides precise control over how MapReduce workloads run, and this is useful to advanced users (in our experience most sites running Hadoop are already quite advanced and will appreciate this) A nice feature of Symphony is that because the execution logic is provisioned dynamically so slots are interchangeable between mappers and reducers. The settings in figure 17 allow this to be configured along with preferences for default ratios between mappers and reducers and precise configuration on a per resource group basis.
  • 38. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 38 Figure 19 - MapReduce Settings associated with an Application Symphony can allow multiple service definitions to exist for each application and the service definition section provides granular control over this capability. This is a useful for applications written to Platform Symphony’s native APIs and may be useful for Hadoop developers. For BigInsights it is not necessary to change this setting being Platform has already implemented a service called “RunMapReduce “ service started by service-instance managers to handle MapReduce workloads. The process of starting this service is automatic for the MapReduce service. The service itself can be found in the directory ${EGO_TOP}/soam/mapreduce/6.1/linux2.6-glibc2.3-x86_64/etc. Note that the Start Command in figure 18 allows for operating system specific implementations of a service definition for an application. Figure 20 - configuring service definitions for the application In the application profile definition, administrator can control environment variables associated with the application. This is an important capability for ensuring multitenancy. By using environment variables I can control what applications run in granular ways. If I choose, I could have an application profile that
  • 39. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 39 associates itself with a separate Hadoop instance by defining application specific variables such as $HADOOP_HOME, $HADOOP_CONF_DIR that reference different software versions and different configuration files. I can always resolve technical issues that often occur where particular applications are depend on particular versions or distributions of the Java run-time environment be defining $JAVA_HOME to point to the version of Java needed by a specific application. Figure 21 - configuring the environment for the application This is a good time to mention that while much of the discussion in Hadoop centers on Java because Hadoop itself is written in Java, Symphony supports heterogeneous applications. It does not matter whether application clients or services are written in C/C++, Java, scripting languages or even C# in Microsoft .NET environments. The versatility to handle all types of workloads is what makes Symphony powerful as a multitenant environment. Another unique capability that Symphony brings to Hadoop is the notion of “Recoverable sessions”. This concept does not existing in open source Hadoop where the job tracker is implemented in a simplistic way. If the JobTracker fails at run-time, in standard Hadoop the job needs to be re-started. The Symphony SOAM middleware however has long supported the notion of journaling transactions so that Hadoop MapReduce jobs become inherently recoverable. If the software service running the JobTracker logic fails (and re-starts on the same host or a different host) the Symphony job can recover from where it left off. This is a major advantage for customers that have long-running Hadoop jobs that need to complete within specific batch windows.
  • 40. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 40 This and other points of configurability are very important for specific workloads. As another example, if I have execution logic where the reducer is multi-threaded I can control the ration of reducer services to slots thereby giving a reducer multiple slots if it can take advantage of them. Figure 22 - configuring session behaviors in an SSM / Application Manager Associating applications with consumers The last section provided some details on how application profiles are used in Symphony to customize applications to support multi-tenancy. In the Symphony architecture, resources are not actually allocated to applications directory. They are allocated to Consumer definitions which in turn map to applications. This is an important distinction between while that application space is essentially “flat” (I have multiple applications and flavors of applications of different types) the structure of consumers is usually hierarchical. This is because most organizational structures are hierarchical.  A bank may have several lines of business, each with various departments or application groups  A service provider may have multiple tenant customers, and may provide different application services for each tenant  A government agency may have different divisions, each running different applications with a particular need to segment data access
  • 41. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 41 Symphony allows consumer trees to be setup in flexible ways to accommodate the needs of almost any organization. A key concept to understand is that the leaf-nodes of consumer trees are linked to the application definitions we looked at in the previous section. Accessing Consumer Definitions To view consumer definitions, from the MapReduce screen in Symphony selected “Resources / Resource Planning / Consumers”. This is the interface that is used to manage the Consumer Tree. Setting up the consumer tree is reasonably straightforward. The left side panel us used to control where you are on the tree and the right side of the interface allows one to perform operations relative to that segment on the tree. Recall from our scenario earlier, that we had multiple groups that would be running Datameer workloads that we wanted to enforce sharing policies. Also Datameer workloads have specific setup dependencies that are different that BigInsights workloads so the Datameer workloads require their own application profile. Also, we wanted to provide isolation between the work done by different Datameer application user groups. To achieve this policy, we have defined sub-consumers under Datameer with a consumer appropriate for each group. Also, we can control what users have access to the consumer. Note the heirchical notion of consumers in Symphony. Figure 23 - A populated consumer tree in Symphony The leaf nodes of the consumer tree under Datameer, each link to a specific application profile. The associations between applications and the position in the consumer tree is made in the application profile.
  • 42. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 42 Figure 24 - MapReduce applications Manually editing Consumer Tree definitions Advanced users may find it easier to manually edit the consumer tree. Platform Symphony stores consumer tree definitions in $EGO_TOP/kernel/conf in the file ConsumerTrees.xml. If you hand edit this file, you will need to restart EGO services to bring the web-based view into synchronization with the actual contents of the XML files where these settings are persisted.
  • 43. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 43 After editing the ConsumerTrees.xml file as shown above, while logged in as the cluster administrator (biadmin) please stop and restart EGO services using the BigInsights scripts below to make sure that changes are reflected in the Platform Symphony console. $ stop.sh HAManager $ start.sh HAManager Controlling access to applications and consumers In the Sqoop consumer definition above, the built-in Symphony user “Admin” has administrative responsibility for the consumer. Several other users are listed as being able to access to consumer application associated with the consumer. The user eric is not a member of the list of permitted users. If an unauthorized user attempts to submit a job against the application definition (Sqoop) associated with this Sqoop consumer, see an error as shown below as expected. [eric@biginsights ~]$ hadoop jar ${HADOOP_HOME}/hadoop-example.jar sleep - Dmapreduce.application.name=Sqoop -m 2 -r 10 -mt 2000 -rt 2000 java.io.IOException: interrupted at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:1068) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:1032) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1575) at org.apache.hadoop.examples.SleepJob.run(SleepJob.java:174) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) .. Caused by: java.lang.InterruptedException: Domain <VEM>: Security error: User: eric is not authorized to perform this operation. If an authorized user (gord) submits the same workload, note that it runs successfully. [gord@biginsights ~]$ hadoop jar ${HADOOP_HOME}/hadoop-example.jar sleep - Dmapreduce.application.name=Sqoop -m 2 -r 10 -mt 2000 -rt 2000 14/03/14 08:56:45 INFO internal.MRJobSubmitter: Connected to JobTracker(SSM) 14/03/14 08:56:45 INFO internal.MRJobSubmitter: Job <Sleep job> submitted, job id <102> 14/03/14 08:56:45 INFO internal.MRJobSubmitter: Job will not verify intermediate data integrity using checksum. 14/03/14 08:56:45 INFO mapred.JobClient: Running job: job_ssm_0102 14/03/14 08:56:46 INFO mapred.JobClient: map 0% reduce 0% 14/03/14 08:57:02 INFO mapred.JobClient: map 100% reduce 0% 14/03/14 08:57:11 INFO mapred.JobClient: map 100% reduce 20% 14/03/14 08:57:15 INFO mapred.JobClient: map 100% reduce 40% 14/03/14 08:57:19 INFO mapred.JobClient: map 100% reduce 60% 14/03/14 08:57:23 INFO mapred.JobClient: map 100% reduce 80% 14/03/14 08:57:24 INFO mapred.JobClient: map 100% reduce 100% 14/03/14 08:57:24 INFO mapred.JobClient: Job complete: job_ssm_0102 [gord@biginsights ~]$
  • 44. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 44 Determining the execution user for a consumer Earlier we explained that by using impersentation, Symphony can control the user IDs that different application services run under. In the case of the Sqoop application defined earlier, we had set the application user to appB and this is reflected in the ConsumerTrees.xml definition. We can verify that impersonation is taking place and that processes are running under the expected user ID by monitoring the process tree while executing MapReduce jobs like the one above. The monitor the process tree, use a command like: $ watch ‘ps -ef | grep appB’ As you run the job, you will see the SSM start-up unless it is pre-started or the SSM is lingering on a management host waiting for another job. In this example are services are running on the same node as the master host so we see the service instance managers and services instances starting locally to manage the job. On a larger cluster you would need to watch the compute hosts to validate the services are starting as expected and running under the correct user ID. Figure 25 - verify that services are running under the expected user IDs We can use the pstree command on the management host to understand the process tree.
  • 45. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 45 Figure 26 - pstree can be used to show the process hierarchy On compute hosts, services are management by the pem process. On response to a workload requirement pem launches a sim process (service instance manager) which in turn runs a service instance. In this case the RunMapReduceService since this is a Symphony MapReduce workload. Figure 27 - process hierarchy on the execution host When configuring several consumers and applications as we have shown here, it will be faster to hand edit XML based application profile files also.
  • 46. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 46 To access XML application profiles, check the directory $EGO_TOP/data/soam/profiles. The associated XML profiles will exist in subdirectories with names corresponding to their state. For example Sqoop.xml can be found in an “enabled” subdirectory since the application is enabled and accepting workload. Configuring Sharing Policies
  • 47. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 47
  • 48. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™ Page 48 Summary In this document we’ve described a customer use case involving a multitenant implementation of InfoSphere BigInsights that permits the following:  Concurrent execution of different Hadoop applications (including different versions of code) on the same physical cluster  Dynamic sharing of resources between tenants in a fashion that maximizes performance and resource utilization while respecting individual SLAs  Support for applications other than Hadoop MapReduce to maximize flexibility and allow capital investments to be re-purposed for multiple requirements  Security isolation between tenants, removing a major barrier to sharing in many commercial organizations These advances in our view are significant. While Hadoop is advancing, competing open source and commercial distributions are many years away from offering true multitenancy and practical solutions for supporting multiple workloads on a shared infrastructure. The economic arguments in favor of resource sharing are compelling. Analytic applications are increasingly comprised of multiple software components that rely on distributed services. Rather than deploying separate “silos” of application infrastructure, Platform Symphony provides the option to consolidate these different application instances on a common foundation thus increasing infrastructure utilization, boosting service levels and helping significantly reduce costs.