Guide for those working with customers deploying IBM InfoSphere BigInsights and other Hadoop offerings together with IBM Platform Symphony. While this paper describes the details of one customer implementation, we believe that this use case is relevant to others as well. Challenges related to Hadoop multitenancy are faced by customers across multiple industries.
Recent IDC case study features IBM SoftLayer German customer pixx.io and explains how pixx.io (a solution provider for the media industry) has leveraged IBM Cloud to create a hybrid media sharing service.
Pixx.io talked to a number of suppliers including Amazon Web Services (AWS), Hosteurope and Rackspace, before it met an IBM SoftLayer representative at an exhibition.
Back then, IBM Softlayer was in the process of opening up its datacentre facility in Frankfurt – which was going to be operational in December 2014.
After a comparison of capabilities versus the other providers it was the data location aspect - combined with the strong customer engagement from the start - that played a deciding role in the initial supplier selection." [...] Interestingly, IBM Softlayer was not the cheapest of the options that pixx.io had evaluated, but when comparing the overall value of the solution, including especially the support in setting up the services, the offering presented definitely the best value for money, according to pixx.io."
Know whether cloud based storage or dedicated storage is best for your business IT infrastructure depending on our organization requirements. Check Netmagic’s outlooks.
The document describes a software called LetterGen that allows users to dynamically generate documents from multiple data sources in different formats. It has modules for storing templates and business logic, designing documents, and generating documents securely for users. The software aims to reduce costs and errors in document creation processes.
Today the telco industry is at the vortex of change due to developments such as network functions virtualization and big data analytics. By allying with IT to embrace and transcend the disruptions characterized by these developments, telecom providers stand to benefit from reduced costs and new revenue streams, and see their profits grow.
Modernization of storage infrastructure with technologies like all-flash arrays is helping organizations manage large amounts of structured and unstructured data to power digital transformation initiatives. All-flash arrays provide significantly higher performance than traditional spinning disk storage and enable consolidation of workloads. They also reduce data center space and energy usage. Selection criteria for all-flash arrays include performance, data services, cloud integration, seamless upgrade ability, and management capabilities. Leveraging data assets is key to digital transformation success by enabling insights for improved customer experiences, new revenue streams, and operational efficiencies. CIOs must address growing business demands against flat IT budgets by reducing operational expenses through predictive analytics and infrastructure optimization.
The document summarizes and compares IBM and EMC's strategies for information infrastructure. It finds that IBM takes a more holistic, solution-oriented approach to address all customer needs, while EMC maintains a stronger product focus through its disk, security, and content management business units. The document also notes that IBM can provide a more complete set of hardware, software, services and financing to support customers' information infrastructure transformations.
White Paper: Rethink Storage: Transform the Data Center with EMC ViPR Softwar...EMC
This white paper discusses the software-defined data center (SDCC) and challenges of heterogeneous storage silos in making SDDC a reality. It introduces EMC ViPR software-defined storage, which enables enterprise IT departments and service providers to transform physical storage arrays into simple, extensible, open virtual storage platform.
IBM InfoSphere Data Architect 9.1 - Francis ArnaudièsIBMInfoSphereUGFR
The document discusses IBM InfoSphere Data Architect, a tool for modeling, relating, and standardizing diverse data assets. It can design and manage enterprise data models, enforce standards, leverage industry data models, and optimize existing investments. The tool is based on the Eclipse platform and allows various users like data architects, database developers, and administrators to be more productive. It provides logical, physical, and dimensional modeling capabilities as well as tools to define and enforce standards to increase quality and governance.
Recent IDC case study features IBM SoftLayer German customer pixx.io and explains how pixx.io (a solution provider for the media industry) has leveraged IBM Cloud to create a hybrid media sharing service.
Pixx.io talked to a number of suppliers including Amazon Web Services (AWS), Hosteurope and Rackspace, before it met an IBM SoftLayer representative at an exhibition.
Back then, IBM Softlayer was in the process of opening up its datacentre facility in Frankfurt – which was going to be operational in December 2014.
After a comparison of capabilities versus the other providers it was the data location aspect - combined with the strong customer engagement from the start - that played a deciding role in the initial supplier selection." [...] Interestingly, IBM Softlayer was not the cheapest of the options that pixx.io had evaluated, but when comparing the overall value of the solution, including especially the support in setting up the services, the offering presented definitely the best value for money, according to pixx.io."
Know whether cloud based storage or dedicated storage is best for your business IT infrastructure depending on our organization requirements. Check Netmagic’s outlooks.
The document describes a software called LetterGen that allows users to dynamically generate documents from multiple data sources in different formats. It has modules for storing templates and business logic, designing documents, and generating documents securely for users. The software aims to reduce costs and errors in document creation processes.
Today the telco industry is at the vortex of change due to developments such as network functions virtualization and big data analytics. By allying with IT to embrace and transcend the disruptions characterized by these developments, telecom providers stand to benefit from reduced costs and new revenue streams, and see their profits grow.
Modernization of storage infrastructure with technologies like all-flash arrays is helping organizations manage large amounts of structured and unstructured data to power digital transformation initiatives. All-flash arrays provide significantly higher performance than traditional spinning disk storage and enable consolidation of workloads. They also reduce data center space and energy usage. Selection criteria for all-flash arrays include performance, data services, cloud integration, seamless upgrade ability, and management capabilities. Leveraging data assets is key to digital transformation success by enabling insights for improved customer experiences, new revenue streams, and operational efficiencies. CIOs must address growing business demands against flat IT budgets by reducing operational expenses through predictive analytics and infrastructure optimization.
The document summarizes and compares IBM and EMC's strategies for information infrastructure. It finds that IBM takes a more holistic, solution-oriented approach to address all customer needs, while EMC maintains a stronger product focus through its disk, security, and content management business units. The document also notes that IBM can provide a more complete set of hardware, software, services and financing to support customers' information infrastructure transformations.
White Paper: Rethink Storage: Transform the Data Center with EMC ViPR Softwar...EMC
This white paper discusses the software-defined data center (SDCC) and challenges of heterogeneous storage silos in making SDDC a reality. It introduces EMC ViPR software-defined storage, which enables enterprise IT departments and service providers to transform physical storage arrays into simple, extensible, open virtual storage platform.
IBM InfoSphere Data Architect 9.1 - Francis ArnaudièsIBMInfoSphereUGFR
The document discusses IBM InfoSphere Data Architect, a tool for modeling, relating, and standardizing diverse data assets. It can design and manage enterprise data models, enforce standards, leverage industry data models, and optimize existing investments. The tool is based on the Eclipse platform and allows various users like data architects, database developers, and administrators to be more productive. It provides logical, physical, and dimensional modeling capabilities as well as tools to define and enforce standards to increase quality and governance.
Build the Optimal Mainframe Storage ArchitectureHitachi Vantara
This document discusses the benefits of using a switched FICON architecture with Hitachi Virtual Storage Platform storage connected to IBM mainframes through a Brocade Gen5 DCX 8510 director, over a direct-attached storage configuration. Some key advantages of the switched FICON approach are that it overcomes buffer credit limitations on FICON channels, allows fan-in and fan-out connectivity for better resource utilization, helps localize failures for improved availability, and provides greater scalability. The Hitachi VSP provides high performance, large capacity, and data services for mainframe environments, while the Brocade director offers reliability, scalability, and high bandwidth. Together they provide an optimal solution for mainframe storage.
Business analytics can drive real-time performance when using SAP HANA. Hitachi provides a unified compute platform that solves challenges of SAP HANA with fast query performance, scalability without complexity, and mission-critical operations based on over 50 years of engineering experience. The platform reduces operational expenses in testing/development, deployment time, asset utilization, environmental costs, and improves staff productivity.
Explains how backup-free storage reduces cost and complexity; provides benefits of Hitachi Content Platform; includes brief HDS backup use cases.
For more information on our Unstructured Data Management Solutions please check: http://www.hds.com/go/hitachi-abc-ebook-managing-data/
Meeting Mobile and BYOD Security ChallengesSymantec
This white paper is written for enterprise executives who wish to understand what digital certificates are and why they are invaluable for mobile and Bring Your Own Device (BYOD) security on wired and wireless networks. The paper also illustrates the benefits of adopting Symantec Managed PKI Service and provides real-world use cases.
Denodo as the Core Pillar of your API StrategyDenodo
Watch full webinar here: https://buff.ly/2KTz2IB
Most people associate data virtualization with BI and analytics. However, one of the core ideas behind data virtualization is the decoupling of the consumption method from the data model. Why should the need for data requests in JSON over HTTP require extra development? Denodo provides immediate access to its datasets via REST, OData 4, GeoJSON and other protocols, with no coding involved. Easy to scale, cloud friendly and ready to integrate with API management tools, Denodo can be the perfect tool to fulfill your API strategy!
Attend this session to learn:
- What’s the role of Denodo in an API strategy
- Integration between Denodo and other elements of the API stack, like API management tools
- How easy it is to access Denodo as a RESTful endpoint
- Advanced options of Denodo web services: OAuth, OpenAPI, geographical capabilities, etc.
This document provides a summary of Gartner's Magic Quadrant report on enterprise content management vendors. It assesses 22 vendors and places them in four categories based on their completeness of vision and ability to execute. The summary analyzes the strengths and cautions of several leading vendors, including Alfresco, EMC, Ever Team, Fabasoft, HP, Hyland, and IBM. It describes their product portfolios, target markets, growth strategies, and areas for improvement.
Rethink Storage: Transform the Data Center with EMC ViPR Software-Defined Sto...EMC
This white paper discusses the evolution of the Software-Defined Data Center and the challenges of heterogeneous storage silos in making the SDDC a reality.
IDC Spotlight: PBBAs Tap into Key Data Protection Trends to Drive Strong Mar...Symantec
Purpose-built backup appliances (PBBAs) have grown rapidly since their introduction around 2006, and they are expected to generate $3.38 billion in revenue in 2014. PBBAs are turnkey data protection solutions, providing hardware/software bundles targeted at helping organizations protect and recovertheir data in the highly dynamic 3rd Platform computing era. They are excellent options for enterprises looking to deploy their first backup solution or expand their existing data protection infrastructure.
PBBAs provide ready access to the latest disk-based data protection technologies to help organizations deal with the high-growth, highly agile, and extremely heterogeneous computing infrastructure that is quickly becoming a reality in today's datacenters. This Technology Spotlight examines the PBBA market, discussing the drivers of market development and the key benefits these appliances offer enterprises. It also looks at the role of Symantec in this strategically important market.
Netmagic stresses on how switching to the cloud allows organizations to meet their changing needs and goals without large capital or time investments. Read more here!
The document discusses content centric applications, which require new storage architectures optimized for processing and analyzing vast amounts of content. These applications are driving major increases in storage capacity needs. Content centric storage systems must provide high performance, scalability, simplicity of access, and cost effectiveness without compromising functionality. The NetApp E-Series storage system is designed to meet the unique demands of content centric applications through its performance, efficiency, reliability, and ability to integrate with operational environments.
The document provides an overview of IBM's BigInsights product. It discusses how BigInsights can help businesses gain insights from large, complex datasets through features like built-in text analytics, SQL support, spreadsheet-style analysis, and accelerators for domain-specific analytics like social media. The document also summarizes capabilities of BigInsights like Big SQL, Big Sheets, Big R, and its embedded text analytics engine.
The Next Evolution in Storage Virtualization Management White PaperHitachi Vantara
Hitachi's global storage virtualization solution combines advanced storage virtualization technology with integrated management software. This allows enterprises to pool, abstract, and mobilize storage resources across physical storage platforms, enabling more efficient management of large, complex storage environments. Hitachi Command Suite provides centralized management of Hitachi and third-party storage systems. When used with Hitachi's Virtual Storage Platform and Storage Virtualization Operating System, it can manage global storage virtualization environments at enterprise scale with lower costs.
The 2019 Storage brand leader surveys cover 14 storage products. This report includes the results of IT Pro voting for six categories of brand leadership for each service: Market, Price, Performance, Reliability, Innovation, and Service & Support.
This document provides a summary of ESG Lab's validation of the Hitachi Content Platform portfolio, including Hitachi Content Platform (HCP), HCP Anywhere, and Hitachi Data Ingestor Remote Server (HDI). ESG Lab tested how these products can be integrated to provide scalable, secure storage and sharing of unstructured data across distributed environments. Key findings include:
1) HCP provides a massively scalable object storage system for private cloud storage, content distribution, and compliance. HCP Anywhere enables secure file sharing and HDI acts as a cache at remote sites, providing seamless access to HCP storage.
2) ESG Lab tested a simulated multi-tenant environment and found HCP's management
Presentation from Chesapeake Regional Tech Council\'s TechFocus Seminar on Cloud Security; Presented by Scott C Sadler, Business Development Executive - Cloud Computing, IBM US East Mid-Market & Channels on Thursday, October 27, 2011. http://www.chesapeaketech.org
Mba ii u v enterprise application integrationRai University
Enterprise application integration (EAI) uses software and systems to integrate enterprise computer applications that typically cannot communicate to share data or business rules, such as supply chain, customer relationship, and business intelligence applications. EAI provides a common infrastructure and methodology to connect existing and new applications while ensuring data consistency, business rule independence from specific vendors, ongoing support, and security and privacy requirements. Common EAI standards include XML, SOAP, WSDL, and UDDI, and it uses techniques like object-oriented programming, message brokers, and middleware. EAI allows enterprises to modernize legacy systems while adopting new technologies.
Presented at the New Zealand Computer Society 50th Anniversary Conference. The conference theme was about ICT Innovation.
This presentation was delivered during the conference by Phil Patton, IBM NZ will focus on answering in simple terms the key questions many are asking in their quest to understand why there is so much hype around Cloud – what are the key ingredients of Cloud Computing? And what’s different about it, what are the deployment types, and what workloads are suitable for Cloud deployment?
Phil will also cover the Enterprise Roadmap for Cloud adoption, the integration and connectivity between Cloud and legacy applications and address the significant security concerns related to the uptake of Cloud.
Presentation about BigData from a German Webcast: http://business-services.heise.de/it-management/big-data/beitrag/big-data-technologie-einsatzgebiete-datenschutz-160.html?source=IBM_12_2013_IT_Conn
Integrating Structure and Analytics with Unstructured DataDATAVERSITY
How can you make sense of messy data? How do you wrap structure around non-relational, flexibly structured data? With the growth in cloud technologies, how do you balance the need for flexibility and scale with the need for structure and analytics? Join us for an overview of the marketplace today and a review of the tools needed to get the job done.
During this hour, we'll cover:
- How big data is challenging the limits of traditional data management tools
- How to recognize when tools like MongoDB, Hadoop, IBM Cloudant, R Studio, IBM dashDB, CouchDB, and others are the right tools for the job.
Make from your it department a competitive differentiator for your businessMarcos Quezada
IBM Systems, combining the strengths of IBM middleware and IBM hardware to create a resilient, modern enterprise infrastructure to make from your IT department a competitive differentiator for your business. Infrastructure Matters #ITMatters
Build the Optimal Mainframe Storage ArchitectureHitachi Vantara
This document discusses the benefits of using a switched FICON architecture with Hitachi Virtual Storage Platform storage connected to IBM mainframes through a Brocade Gen5 DCX 8510 director, over a direct-attached storage configuration. Some key advantages of the switched FICON approach are that it overcomes buffer credit limitations on FICON channels, allows fan-in and fan-out connectivity for better resource utilization, helps localize failures for improved availability, and provides greater scalability. The Hitachi VSP provides high performance, large capacity, and data services for mainframe environments, while the Brocade director offers reliability, scalability, and high bandwidth. Together they provide an optimal solution for mainframe storage.
Business analytics can drive real-time performance when using SAP HANA. Hitachi provides a unified compute platform that solves challenges of SAP HANA with fast query performance, scalability without complexity, and mission-critical operations based on over 50 years of engineering experience. The platform reduces operational expenses in testing/development, deployment time, asset utilization, environmental costs, and improves staff productivity.
Explains how backup-free storage reduces cost and complexity; provides benefits of Hitachi Content Platform; includes brief HDS backup use cases.
For more information on our Unstructured Data Management Solutions please check: http://www.hds.com/go/hitachi-abc-ebook-managing-data/
Meeting Mobile and BYOD Security ChallengesSymantec
This white paper is written for enterprise executives who wish to understand what digital certificates are and why they are invaluable for mobile and Bring Your Own Device (BYOD) security on wired and wireless networks. The paper also illustrates the benefits of adopting Symantec Managed PKI Service and provides real-world use cases.
Denodo as the Core Pillar of your API StrategyDenodo
Watch full webinar here: https://buff.ly/2KTz2IB
Most people associate data virtualization with BI and analytics. However, one of the core ideas behind data virtualization is the decoupling of the consumption method from the data model. Why should the need for data requests in JSON over HTTP require extra development? Denodo provides immediate access to its datasets via REST, OData 4, GeoJSON and other protocols, with no coding involved. Easy to scale, cloud friendly and ready to integrate with API management tools, Denodo can be the perfect tool to fulfill your API strategy!
Attend this session to learn:
- What’s the role of Denodo in an API strategy
- Integration between Denodo and other elements of the API stack, like API management tools
- How easy it is to access Denodo as a RESTful endpoint
- Advanced options of Denodo web services: OAuth, OpenAPI, geographical capabilities, etc.
This document provides a summary of Gartner's Magic Quadrant report on enterprise content management vendors. It assesses 22 vendors and places them in four categories based on their completeness of vision and ability to execute. The summary analyzes the strengths and cautions of several leading vendors, including Alfresco, EMC, Ever Team, Fabasoft, HP, Hyland, and IBM. It describes their product portfolios, target markets, growth strategies, and areas for improvement.
Rethink Storage: Transform the Data Center with EMC ViPR Software-Defined Sto...EMC
This white paper discusses the evolution of the Software-Defined Data Center and the challenges of heterogeneous storage silos in making the SDDC a reality.
IDC Spotlight: PBBAs Tap into Key Data Protection Trends to Drive Strong Mar...Symantec
Purpose-built backup appliances (PBBAs) have grown rapidly since their introduction around 2006, and they are expected to generate $3.38 billion in revenue in 2014. PBBAs are turnkey data protection solutions, providing hardware/software bundles targeted at helping organizations protect and recovertheir data in the highly dynamic 3rd Platform computing era. They are excellent options for enterprises looking to deploy their first backup solution or expand their existing data protection infrastructure.
PBBAs provide ready access to the latest disk-based data protection technologies to help organizations deal with the high-growth, highly agile, and extremely heterogeneous computing infrastructure that is quickly becoming a reality in today's datacenters. This Technology Spotlight examines the PBBA market, discussing the drivers of market development and the key benefits these appliances offer enterprises. It also looks at the role of Symantec in this strategically important market.
Netmagic stresses on how switching to the cloud allows organizations to meet their changing needs and goals without large capital or time investments. Read more here!
The document discusses content centric applications, which require new storage architectures optimized for processing and analyzing vast amounts of content. These applications are driving major increases in storage capacity needs. Content centric storage systems must provide high performance, scalability, simplicity of access, and cost effectiveness without compromising functionality. The NetApp E-Series storage system is designed to meet the unique demands of content centric applications through its performance, efficiency, reliability, and ability to integrate with operational environments.
The document provides an overview of IBM's BigInsights product. It discusses how BigInsights can help businesses gain insights from large, complex datasets through features like built-in text analytics, SQL support, spreadsheet-style analysis, and accelerators for domain-specific analytics like social media. The document also summarizes capabilities of BigInsights like Big SQL, Big Sheets, Big R, and its embedded text analytics engine.
The Next Evolution in Storage Virtualization Management White PaperHitachi Vantara
Hitachi's global storage virtualization solution combines advanced storage virtualization technology with integrated management software. This allows enterprises to pool, abstract, and mobilize storage resources across physical storage platforms, enabling more efficient management of large, complex storage environments. Hitachi Command Suite provides centralized management of Hitachi and third-party storage systems. When used with Hitachi's Virtual Storage Platform and Storage Virtualization Operating System, it can manage global storage virtualization environments at enterprise scale with lower costs.
The 2019 Storage brand leader surveys cover 14 storage products. This report includes the results of IT Pro voting for six categories of brand leadership for each service: Market, Price, Performance, Reliability, Innovation, and Service & Support.
This document provides a summary of ESG Lab's validation of the Hitachi Content Platform portfolio, including Hitachi Content Platform (HCP), HCP Anywhere, and Hitachi Data Ingestor Remote Server (HDI). ESG Lab tested how these products can be integrated to provide scalable, secure storage and sharing of unstructured data across distributed environments. Key findings include:
1) HCP provides a massively scalable object storage system for private cloud storage, content distribution, and compliance. HCP Anywhere enables secure file sharing and HDI acts as a cache at remote sites, providing seamless access to HCP storage.
2) ESG Lab tested a simulated multi-tenant environment and found HCP's management
Presentation from Chesapeake Regional Tech Council\'s TechFocus Seminar on Cloud Security; Presented by Scott C Sadler, Business Development Executive - Cloud Computing, IBM US East Mid-Market & Channels on Thursday, October 27, 2011. http://www.chesapeaketech.org
Mba ii u v enterprise application integrationRai University
Enterprise application integration (EAI) uses software and systems to integrate enterprise computer applications that typically cannot communicate to share data or business rules, such as supply chain, customer relationship, and business intelligence applications. EAI provides a common infrastructure and methodology to connect existing and new applications while ensuring data consistency, business rule independence from specific vendors, ongoing support, and security and privacy requirements. Common EAI standards include XML, SOAP, WSDL, and UDDI, and it uses techniques like object-oriented programming, message brokers, and middleware. EAI allows enterprises to modernize legacy systems while adopting new technologies.
Presented at the New Zealand Computer Society 50th Anniversary Conference. The conference theme was about ICT Innovation.
This presentation was delivered during the conference by Phil Patton, IBM NZ will focus on answering in simple terms the key questions many are asking in their quest to understand why there is so much hype around Cloud – what are the key ingredients of Cloud Computing? And what’s different about it, what are the deployment types, and what workloads are suitable for Cloud deployment?
Phil will also cover the Enterprise Roadmap for Cloud adoption, the integration and connectivity between Cloud and legacy applications and address the significant security concerns related to the uptake of Cloud.
Presentation about BigData from a German Webcast: http://business-services.heise.de/it-management/big-data/beitrag/big-data-technologie-einsatzgebiete-datenschutz-160.html?source=IBM_12_2013_IT_Conn
Integrating Structure and Analytics with Unstructured DataDATAVERSITY
How can you make sense of messy data? How do you wrap structure around non-relational, flexibly structured data? With the growth in cloud technologies, how do you balance the need for flexibility and scale with the need for structure and analytics? Join us for an overview of the marketplace today and a review of the tools needed to get the job done.
During this hour, we'll cover:
- How big data is challenging the limits of traditional data management tools
- How to recognize when tools like MongoDB, Hadoop, IBM Cloudant, R Studio, IBM dashDB, CouchDB, and others are the right tools for the job.
Make from your it department a competitive differentiator for your businessMarcos Quezada
IBM Systems, combining the strengths of IBM middleware and IBM hardware to create a resilient, modern enterprise infrastructure to make from your IT department a competitive differentiator for your business. Infrastructure Matters #ITMatters
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
Four major trends - mobile, social media, big data, and cloud computing - are driving significant changes in the IT industry according to IDC. These trends are creating demand for more agile and flexible IT infrastructure that can dynamically allocate resources and provide continuous access. This will require new infrastructure capabilities around automation, virtualization, and software-defined storage to accommodate massive data growth, meet performance requirements, and reduce costs. Storage infrastructure will need to support agility, massive scaling, high availability, and improved storage efficiency through technologies like flash storage and data deduplication to deliver value in this new environment.
The document discusses IBM's BLU Acceleration for data warehousing and business intelligence in the cloud. It provides an overview of IBM's cloud offerings including DB2 with BLU Acceleration, Cognos BI, InfoSphere Data Architect, and security features. Examples are given showing how the cloud-based solution could be used for development and testing applications as well as creating agile analytic marts. Deployment options on AWS or SoftLayer are outlined along with trial and purchase options.
This document discusses IBM's industry data models and how they can be used with IBM's data lake architecture. It provides an overview of the data lake components and how the models integrate by being deployed to the data lake catalog and repositories. The models include predefined business vocabularies, data warehouse designs, and other reference materials that can accelerate analytics projects and provide governance.
This report helps the reader understand the cloud computing segment, the different technology & service vendors in this space, their positioning in this market
This document discusses criteria that small and medium enterprises should consider when building efficient storage environments. It recommends that in addition to factors like cost, reliability, and support, SMEs should also evaluate ease of implementation, scalability, flexible purchase options, and storage management tools that improve productivity. Major vendors like EMC, HP, and IBM offer packaged solutions that make storage simpler for SMEs to deploy while providing enterprise-level features. It analyzes each vendors' storage strategy and sales approach, noting differences in how they integrate and deliver storage solutions.
This document discusses the business drivers and opportunities for cloud computing. It notes that CIOs are looking to cut costs and IT budgets while still driving business value. Cloud computing allows companies to leverage past IT investments and maintain security while increasing access. The cloud is seen as a way to reduce costs and gain a competitive edge. The document also summarizes analyst predictions of strong growth in cloud computing spending and adoption between now and 2012. It outlines opportunities for partners in building applications for the Windows Azure platform and migrating existing applications and customers to the cloud.
Calculating the true value of industry specific clouds linthicumDavid Linthicum
IDC sees a $65 billion market for industry-specific clouds in 2013, rising to $100 billion by 2016. Industry-specific clouds are PaaS, IaaS, and SaaS services tailored for specific industries that allow businesses within a vertical to connect to predefined applications, processes, and databases. The value is that businesses can leverage existing industry data and processes rather than defining everything within a generic cloud. The session will help determine whether industry-specific clouds make sense for a particular business by examining costs, value of agility, potential for process and service reuse, and how to calculate return on investment.
Originally Published on Sep 23, 2014
IBM InfoSphere BigInsights, an enterprise-ready distribution of Hadoop, is designed to address the challenges of big data and modern IT by analyzing larger volumes of data more cost-effectively. Deployed on the cloud, it enables rapid deployment of clusters and real-time analytics.
FYI: The value of Hadoop and many more questions will be pondered at this year’s Strata/Hadoop World event in NYC (October 15-17, 2014) and certainly at IBM Insight (October 26-30, 2014).
On-premises, consumption-based private cloud creates opportunity for enterpri...Stanton Jones
The document discusses a new "on-premises, consumption-based private cloud" (OPCB) model for data storage that combines the benefits of public and private clouds. It provides flexibility and cost savings like public clouds through usage-based pricing, but with the security, control and data sovereignty of private clouds by hosting the infrastructure on-site. The OPCB model addresses enterprises' needs to reduce costs while the pace of data growth prevents waiting for public cloud issues to be resolved. It evaluates this model for customers who value data sovereignty and flexibility without full operational control of traditional private clouds.
Consumption-based public cloud (CBPC) modelWerner Feld
The document discusses a new "on-premises, consumption-based private cloud" (OPCB) model for data storage that combines the benefits of public and private clouds. It provides flexibility and cost savings like public clouds through usage-based pricing, but with the security, control and data sovereignty of private clouds by storing assets on-site. The model addresses enterprises' needs to reduce costs while the pace of data growth prevents waiting for public cloud issues to be addressed. It evaluates this model for customers seeking data sovereignty and operational expenditure flexibility without full operational control loss.
Hybrid Hosting: Evolving the Cloud in 2011Rackspace
This whitepaper discusses hybrid hosting, which combines dedicated hosting and cloud hosting. Hybrid hosting allows businesses to seamlessly switch between dedicated servers and cloud services as needed. It provides the stability and security of dedicated hosting for critical applications alongside the scalability of cloud computing. The paper outlines the elements of hybrid hosting and how it provides flexibility, scalability, and cost savings through the ability to move workloads between dedicated servers and cloud servers. It also discusses Rackspace's hybrid hosting capabilities and AMD server platforms that support hybrid hosting.
This document provides an overview of cloud computing, including its key benefits and challenges. It discusses the basics of cloud computing models like SaaS, PaaS, and IaaS. Public and private cloud options are described, as well as hybrid cloud. The main benefits of cloud computing are reduced costs, increased storage, and flexibility. However, key challenges include data security, availability, management capabilities, and regulatory compliance restrictions.
The paper aims to provide a means of understanding the model and exploring options available for complementing your technology and infrastructure needs.
Cloud Computing - A collection of predictions, principles and providers - Feb...William Santiago
The document provides an overview and update on cloud computing, including:
1) It discusses different cloud computing models and predictions about adoption rates for technologies like virtualization, big data, and hybrid clouds over the next 10 years.
2) Several real-world examples of companies using cloud computing applications are mentioned, including The New York Times, Coca-Cola Enterprises, and Nasdaq.
3) Architectural principles for cloud design are outlined, focusing on economies of scale, efficiency, lightweight architectures, and generic platforms.
4) Major cloud providers are examined, including Amazon Web Services, Microsoft Azure, Google Cloud, and Yahoo Cloud. Implementation roadmaps and considerations for moving applications to the cloud
This document provides a sector roadmap for cloud analytic databases in 2017. It discusses key topics such as usage scenarios, disruption vectors, and an analysis of companies in the sector. Some main points:
- Cloud databases can now be considered the default option for most selections in 2017 due to economics and functionality.
- Several newer cloud-native offerings have been able to leapfrog more established databases through tight integration of cloud features like elasticity and separation of compute and storage.
- While traditional database functionality is still required, cloud dynamics are causing needs for capabilities like robust SQL support, diverse data support, and dynamic environment adaptation.
- Vendor solutions are evaluated on disruption vectors including SQL support, optimization, elasticity, environment
The emergence of social, mobile, cloud, big data and analytics are fundamentally changing how we live, work and interact.
Mobile devices are ubiquitous. Changing consumer behaviors, supplanting PCs, generating massive amounts of data and putting new demands on the enterprise to not only support these devices but to adjust the way they do business.
Social technologies are changing the way we interact, communicate and share information – equally generating vast amounts of data and impacting business as they try to unlock the full potential social has to offer.
Cloud technologies bring new scale and efficiency to service delivery and enable more agile ways of doing business and drive business model innovation. For companies, It also brings information and applications to people at the right time and place.
All of these trends are fueling an explosion of data. Not only do enterprises need to store, manage and secure this data, they also need to derive meaningful insight from these vast amounts of data. Data is the basis of significant opportunity and a source of competitive advantage for all organizations. Data is a new economic asset, the next natural resource.
These trends are spawning new workloads, business processes and technology deployments that are putting unprecedented demands on our IT environments.
Similar to Realizing a multitenant big data infrastructure 3 (20)
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEMHODECEDSIET
Time Division Multiplexing (TDM) is a method of transmitting multiple signals over a single communication channel by dividing the signal into many segments, each having a very short duration of time. These time slots are then allocated to different data streams, allowing multiple signals to share the same transmission medium efficiently. TDM is widely used in telecommunications and data communication systems.
### How TDM Works
1. **Time Slots Allocation**: The core principle of TDM is to assign distinct time slots to each signal. During each time slot, the respective signal is transmitted, and then the process repeats cyclically. For example, if there are four signals to be transmitted, the TDM cycle will divide time into four slots, each assigned to one signal.
2. **Synchronization**: Synchronization is crucial in TDM systems to ensure that the signals are correctly aligned with their respective time slots. Both the transmitter and receiver must be synchronized to avoid any overlap or loss of data. This synchronization is typically maintained by a clock signal that ensures time slots are accurately aligned.
3. **Frame Structure**: TDM data is organized into frames, where each frame consists of a set of time slots. Each frame is repeated at regular intervals, ensuring continuous transmission of data streams. The frame structure helps in managing the data streams and maintaining the synchronization between the transmitter and receiver.
4. **Multiplexer and Demultiplexer**: At the transmitting end, a multiplexer combines multiple input signals into a single composite signal by assigning each signal to a specific time slot. At the receiving end, a demultiplexer separates the composite signal back into individual signals based on their respective time slots.
### Types of TDM
1. **Synchronous TDM**: In synchronous TDM, time slots are pre-assigned to each signal, regardless of whether the signal has data to transmit or not. This can lead to inefficiencies if some time slots remain empty due to the absence of data.
2. **Asynchronous TDM (or Statistical TDM)**: Asynchronous TDM addresses the inefficiencies of synchronous TDM by allocating time slots dynamically based on the presence of data. Time slots are assigned only when there is data to transmit, which optimizes the use of the communication channel.
### Applications of TDM
- **Telecommunications**: TDM is extensively used in telecommunication systems, such as in T1 and E1 lines, where multiple telephone calls are transmitted over a single line by assigning each call to a specific time slot.
- **Digital Audio and Video Broadcasting**: TDM is used in broadcasting systems to transmit multiple audio or video streams over a single channel, ensuring efficient use of bandwidth.
- **Computer Networks**: TDM is used in network protocols and systems to manage the transmission of data from multiple sources over a single network medium.
### Advantages of TDM
- **Efficient Use of Bandwidth**: TDM all
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTjpsjournal1
The rivalry between prominent international actors for dominance over Central Asia's hydrocarbon
reserves and the ancient silk trade route, along with China's diplomatic endeavours in the area, has been
referred to as the "New Great Game." This research centres on the power struggle, considering
geopolitical, geostrategic, and geoeconomic variables. Topics including trade, political hegemony, oil
politics, and conventional and nontraditional security are all explored and explained by the researcher.
Using Mackinder's Heartland, Spykman Rimland, and Hegemonic Stability theories, examines China's role
in Central Asia. This study adheres to the empirical epistemological method and has taken care of
objectivity. This study analyze primary and secondary research documents critically to elaborate role of
china’s geo economic outreach in central Asian countries and its future prospect. China is thriving in trade,
pipeline politics, and winning states, according to this study, thanks to important instruments like the
Shanghai Cooperation Organisation and the Belt and Road Economic Initiative. According to this study,
China is seeing significant success in commerce, pipeline politics, and gaining influence on other
governments. This success may be attributed to the effective utilisation of key tools such as the Shanghai
Cooperation Organisation and the Belt and Road Economic Initiative.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
Batteries -Introduction – Types of Batteries – discharging and charging of battery - characteristics of battery –battery rating- various tests on battery- – Primary battery: silver button cell- Secondary battery :Ni-Cd battery-modern battery: lithium ion battery-maintenance of batteries-choices of batteries for electric vehicle applications.
Fuel Cells: Introduction- importance and classification of fuel cells - description, principle, components, applications of fuel cells: H2-O2 fuel cell, alkaline fuel cell, molten carbonate fuel cell and direct methanol fuel cells.
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...IJECEIAES
Medical image analysis has witnessed significant advancements with deep learning techniques. In the domain of brain tumor segmentation, the ability to
precisely delineate tumor boundaries from magnetic resonance imaging (MRI)
scans holds profound implications for diagnosis. This study presents an ensemble convolutional neural network (CNN) with transfer learning, integrating
the state-of-the-art Deeplabv3+ architecture with the ResNet18 backbone. The
model is rigorously trained and evaluated, exhibiting remarkable performance
metrics, including an impressive global accuracy of 99.286%, a high-class accuracy of 82.191%, a mean intersection over union (IoU) of 79.900%, a weighted
IoU of 98.620%, and a Boundary F1 (BF) score of 83.303%. Notably, a detailed comparative analysis with existing methods showcases the superiority of
our proposed model. These findings underscore the model’s competence in precise brain tumor localization, underscoring its potential to revolutionize medical
image analysis and enhance healthcare outcomes. This research paves the way
for future exploration and optimization of advanced CNN models in medical
imaging, emphasizing addressing false positives and resource efficiency.
1. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 1
Realizing a shared, multi-tenant infrastructure for Big Data
and Analytic applications using IBM®
InfoSphere®
BigInsights and IBM Platform Computing™
Last revised: April 19, 2014
By: Gord Sissons
Steven Sit
Eric Fiala
Michael Feiman
2. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 2
Contents
Document History.........................................................................................................................................4
Introduction ..............................................................................................................................................4
Disclaimers and limitations.......................................................................................................................4
About the customer described in this use case........................................................................................5
Industry Challenges...................................................................................................................................5
Impact on Information Technology ......................................................................................................6
The Big Data Environment ........................................................................................................................7
Hardware Infrastructure.......................................................................................................................7
The Software Environment...................................................................................................................7
Customer Requirements.......................................................................................................................8
Installing InfoSphere BigInsights for Multi-tenant services......................................................................9
Installation steps...................................................................................................................................9
Accessing the Platform Symphony Management Console .................................................................12
Accessing the Platform Symphony knowledge center........................................................................14
Platform Symphony Concepts.................................................................................................................15
An example of configuring a cluster for multi-tenancy ..........................................................................18
Adding users to run MapReduce applications....................................................................................19
Provide access to the BigInsights / Platform Computing cluster........................................................23
Understanding Platform Symphony Impersonation...........................................................................24
Configuring OS groups for the multitenant environment...................................................................25
Submitting a test job as a user to verify the configuration ................................................................25
Associating BigInsights with a Symphony Application........................................................................28
Enabling Symphony Repository Services ............................................................................................29
Adding a new Application / Tenant ....................................................................................................30
Configuring application properties .....................................................................................................34
Associating applications with consumers ...........................................................................................40
Accessing Consumer Definitions.........................................................................................................41
Manually editing Consumer Tree definitions......................................................................................42
3. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 3
Controlling access to applications and consumers.............................................................................43
Determining the execution user for a consumer................................................................................44
Configuring Sharing Policies....................................................................................................................46
Summary.................................................................................................................................................48
4. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 4
Document History
Date of this revision is Saturday April 19, 2014
Revision Date Summary of changes
0.9 March 23, 2014 Initial draft
0.95 April 19, 2014 Incorporate many valuable comments from Steven Sit based on
his direct client experience – thank you Steven.
Introduction
This document is written for IBM and partner architects. It is intended to be a guide for those working
with customers deploying IBM InfoSphere BigInsights and other Hadoop offerings together with IBM
Platform Symphony. While this paper describes the details of one customer implementation, we believe
that this use case is relevant to others as well. Challenges related to Hadoop multitenancy are faced by
customers across multiple industries.
The target audience for this document includes:
Architects responsible for deploying big data or analytic workloads
Technical users looking for ways to deploy Hadoop on shared clusters
IBM architects, ISVs or business partners interested in building multitenant Big Data
environments to help customers reduce infrastructure requirements and save cost
This paper does not delve into YARN. YARN is another important (but less mature) technology that
delivers some of the capabilities described herein. It is important for IBM customers to understand that
IBM BigInsights is a safer choice in the sense that it supports open source technologies like YARN while
simultaneously offering more advanced capabilities. IBM’s view is the clients can best determine what
capabilities they need, but IBM InfoSphere BigInsights provides customers with flexibility. The best of a
100% open source distribution along with significant value added capability.
In the customer example documented here, the business advantage of using proprietary capabilities
(IBM Platform Symphony) dramatically outweighed the benefits of being “pure” from an open source
standpoint. The client was able to consolidate roughly 30 applications onto a shared infrastructure and
avoid significant incremental capital expense that would have been required to setup separate clusters
had the client decided to proceed with open source YARN only.
Disclaimers and limitations
The details of the customer implementation are proprietary and confidential. As such, while we can
describe what was done technically, we cannot share details of how this customer used particular
applications. As a result, the examples provided herein are meant to explain qualitatively what was
achieved by the customer without betraying confidential information. The details and screenshots in this
5. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 5
document are not from the customer environment. They have been reproduced on a small test cluster
to explain particular capabilities that the client chose to take advantage of.
About the customer described in this use case
The customer described in this paper is a full-service financial service provider. They offer a broad range
of products to their clients including insurance, banking, investing, real estate, retirement planning,
wealth management and health insurance. Like many in the financial services sector, this customer is
increasingly deploying Hadoop based applications to augment their data warehouse. They are motivated
by the following imperatives:
The need to leverage big data analytics to make better business decisions, improve customer
relations and develop innovative new products and services
The need to contain or reduce costs (the cost of storing and processing data on a Hadoop cluster
is an order or magnitude less than persisting the same data in their data warehouse)
The desire to architect their environment as a shared service to avoid each line of business
building their own discrete analytic environments on premise or in the cloud
Industry Challenges
Like many industries, the sector represented by this client is going through significant change. As a full-
spectrum provider, the client is disproportionally impacted by regulation. As a bank, not only are they
subject to various provisions in legislation like Dodd Frank, but they are also impacted by insurance
industry requirements such as the NAIC’s Risk Management and Own Risk Solvency Act (RMORSA) and
other initiatives around Enterprise Risk Management that have occurred as a response to the financial
crisis of 2008.
Of particular consequence is the Volcker rule, a US Senate bill that would give regulators the ability to
limit or prohibit certain types of proprietary trading activities. While the legislation is directed at retail
banks, this client will be impacted across their insurance and wealth management businesses where
proprietary trading is important to maximizing investment gains.
As if this tsunami of new regulation was not enough, fundamental changes are taking place in the
insurance industry as well driven by external factors. Among these factors are new disruptive
technologies. Big data, social and mobile technologies are prominent drivers of change. Some specific
challenges to the business are:
Driven by high-profile events, and the increased frequency of natural catastrophes, contingent
business interruption (CBI) modeling is emerging as a priority for insurance firms
Dramatic changes driven by technology are promising to fundamentally change auto-insurance.
Among these factors are collision avoidance technologies that promise to shift liability from
drivers to manufacturers, social media technologies enabling insurers to seek out and market to
6. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 6
lower risk consumer pools, and advances in GPS and vehicle telematics that promise to provide
insurers with more granular data on which to base risk assessments
Technological advances are leading to an explosion in available information and firms that
aggregate such information to help insurers better quality risk
Widespread consumer use of mobile technologies and social technologies are causing firms to
rethink how they promote their brand and provide services to both their customers and
agents/advisors
Advances in analytic techniques are making it easier for insurers to collect process and visualize
information. This is extending beyond core actuarial techniques to include approaches like
predictive analytics, natural language processing, social network analysis and simulation-based
analytics.
Additionally, new technologies are changing how information is stored and processed.
Distributed file systems and clustered technologies like Hadoop can provide a significant per-
terabyte cost advantage over traditional warehouses. Because of these cost advantages, and
because the framework is well suited to storing and processing unstructured or semi-structured
data, this customer and similar firms are embracing Hadoop as a platform for many new
applications.
The reason we point this out is that that risk management that relies heavily on Monte Carlo simulation
for simulation and actuarial modeling, and big data analytics are converging. Both depend on scaled out
infrastructure. Firms that understand this convergence can obtain a cost advantage relative to their
competitors.
Impact on Information Technology
Both the regulatory challenges described above as well as the technological shifts and business
pressures are driving the need for greater data processing and analytic capacity.
Traditional data warehouses cannot scale cost-efficiently to manage the vast amounts of data
being collected and processed, nor can they handle raw volumes of unstructured data involved.
Organizations need more agile application development methodologies and toolsets that allow
them to evolve data schemas and applications on the fly as they continuously incorporate new
sources of data into their models.
A one-to-one mapping between applications and infrastructure is no longer practical. Many applications
(Hadoop, scenario generation, Monte Carlo simulation and ETL processing) rely on distributed
infrastructure that scales horizontally. Replicating this clustered infrastructure for each line of business
and each application would be cost prohibitive.
7. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 7
The Big Data Environment
Hardware Infrastructure
The physical infrastructure deployed by this client is shown pictorially in Figure 1. While there are
actually four identical 16 node clusters, only the production environment is shown here. The server
infrastructure is based on an IBM System X based reference architecture for InfoSphere BigInsights. Each
cluster node has 12 CPUs, over 60 GB or memory and 12 locally connected physical disks. The
production cluster has 192 TB of disk and approximately 1 TB of memory.
A unique feature of this environment is that the cluster is shared by several lines of business comprising
approximately 30 different user groups across different lines of business.
Figure 1: Physical infrastructure for shared Hadoop Platform
The Software Environment
The Linux based infrastructure supports multiple big data and analytic applications.
Among these applications are:
8. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 8
IBM InfoSphere BigInsights (providing core Hadoop services)
Datameer (for data visualization)
IBM TeaLeaf – customer experience analytics platform
Open source Sqoop 1.2.4 – used to perform bulk data transfers to and from various data sources
including an operational data warehouse and the production Hadoop cluster
Various MapReduce streaming applications, where for convenience of development Map and
Reduce logic is expressed as Perl scripts
Many in-house developed Java applications
Various ETL scripts running in and out of the Hadoop MapReduce framework
The IBM furnished software environment is comprised of the following major components
IBM InfoSphere BigInsights Enterprise Edition
IBM Platform Symphony Advanced Edition (Software is bundled with BigInsights Enterprise
Edition for a single tenant, and this client has purchased a production licenses)
IBM GPFS FPO (providing a POSIX compliant file system that fully preserves HDFS semantics)
Customer Requirements
This customer requires a multi-tenant environment for several business reasons listed below.
They wish to share infrastructure between multiple departments and lines of business both to
boost capacity (by allowing departments to tap capacity not being used by others) and to reduce
costs by avoiding the need for separate physical environments.
They need the ability to guarantee service levels to different tenants to ensure that business
critical applications can run in a predictable fashion. For example ETL or specific database load
operations must run with an overnight batch window.
Because many services are long-running, to make sharing practical, agile pre-emption is required
to make sure that urgent jobs do not need to wait behind long running jobs on the cluster.
The client needs to ensure that data is segmented between different tenants on the shared
environment for security and privacy reasons.
Finally, the client requires multi-tenancy for technical reasons that are sometimes overlooked.
As the environment evolves, they need the flexibility to deploy different versions of software
components that may have specific dependencies. A specific example is this client’s requirement
to use a more recent version of open-source Sqoop, distinct from the version included in
BigInsights 2.1.0.1, the version deployed at the time of this writing.
9. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 9
Different Hadoop vendors have different definitions of what they mean by multi-tenancy, so it is
important that we not confuse the multitenant capabilities offered by IBM in Platform Symphony with
open source offerings like YARN which is much less capable. While YARN is an important technology
being supported by IBM, the capabilities of YARN are well behind those described here.
Installing InfoSphere BigInsights for Multi-tenant services
Realizing a multitenant environment for BigInsights or other applications requires the use of IBM
Platform Symphony Advanced Edition. A run-time version of IBM Platform Symphony Advanced Edition
that enables a single tenant is included with IBM InfoSphere BigInsights Enterprise Edition 2.1 or later.
The Platform Symphony resource manager and workload manager is referred to in the BigInsights
documentation as Adaptive MapReduce for historical reasons. Clients wanting the multitenant
capabilities required in this document will need to license a full version of Platform Symphony Advanced
Edition.
Note that licensing is not enforced by the software directly. Customers can pilot these multitenant
capabilities using only the software included in the BigInsights 2.1 Enterprise Edition or later release
along with appropriate patches.
Installation steps
Fortunately, it is constantly getting much easier to have these products work together. While manual
configuration was required in prior releases, as of BigInsights 2.1 EE a simple patch can be applied to
unlock all of the features of Platform Symphony Advanced Edition and have it work with BigInsights. For
future releases starting in the spring of 2014, full functionality of Platform Symphony will be provided
“out of the box” with BigInsights with no requirement for a patch. (Please note the customers will still
need to license the software before using it in production)
The high-level steps to implement InfoSphere BigInsights 2.1 (or later) with IBM Platform Symphony
Advanced Edition are as follows:
Install IBM InfoSphere BigInsights Enterprise Edition by following the installation instructions.
When installing BigInsights it is important to install Adaptive MapReduce. This is the choice that
causes the Platform Symphony software to be installed and configured with BigInsights.
To do this, you will need to edit a file in the installation directory called install.properties before
starting the BigInsights installation process as shown below:
# set AdaptiveMR.Enable to true if you want to install AdaptiveMR
instead of Apache MapReduce
AdaptiveMR.Enable=true
# set AdaptiveMR.HA.Enable to true if you want to install AdaptiveMR
High Availability, this will also install AdaptiveMR instead of Apache
MapReduce
AdaptiveMR.HA.Enable=true
10. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 10
For multitenant environments, GPFS FPO is recommended, however Symphony can be
configured to support multiple tenants regardless of whether HDFS or GPFS FPO is chosen as the
cluster file system.
BigInsights can be installed by using a web-based installation process. The web-based install
process generates an XML file that governs the installation process that is used for installation
via the GUI or optionally via the install.sh shell script. The name of this file will vary depending
on how the software is installed, but as of release 2.1 the file is called either simple-
fullinstall.xml or fullinstall.xml.
The reason we mention this is that an apparent bug in BigInsights 2.1 caused the XML tag
<apache-mapred> to be set to true when Adaptive MapReduce was requested in the
install.properties file above. It might be worth validating that this setting is correct in the
simple-fullinstall.xml or fullinstall.xml file.
[biadmin@biginsights]$ grep "apache-mapred" simple-fullinstall.xml
<apache-mapred>false</apache-mapred>
[biadmin@biginsights]$
As you proceed with the installation, you should see the BigInsights installation script install the
“HAManager” software components as part of the installation. This is where the Platform
Symphony software is located that supports HA functionality and Adaptive MapReduce
functionality. You can watch for this either through the web installation GUI or by checking the
installation log file.
If you are installing BigInsights 2.1 Enterprise Edition you will need to install a patch by following
the procedure documented in the publication “Enabling the full functionality of IBM Platform
Symphony in your BigInsights 2.1 cluster”1
. This document is freely downloadable for users with
an IBM Developer Works ID.
You can download a small patch for Platform Symphony 6.1.0.1 (the Symphony version included
in BigInsights 2.1) from https://www.ibm.com/support/fixcentral/ following instructions in the
document referenced above. At the time of this writing you can find and download the needed
package from Fix Central by searching for “Platform Symphony” and downloading the package
named “sym-6.1.0.1-build225866”. This package applies to both 64 bit Linux on Intel as well as
IBM PowerLinux machines. Later versions of BigInsights will not require this patch.
Follow the instructions in the README file. If you are installing the patch as user “root” on the
BigInsights cluster, it would be a good idea to source the BigInsights environment before
attempting to install the patch since the patch procedure assumes the environment variables are
already set.
1
This documentation can be obtained from: https://www.ibm.com/developerworks/community/wikis/form/api/wiki/ee59a95e-5867-4deb-
90af-6bed6b0759b8/page/91903357-0a7d-4a96-bb70-520fb2acdc1b/attachment/52d79fbe-dc37-42f0-be3f-
5f4b75f14a05/media/Enable%20the%20full%20functionality%20of%20IBM%20Platform%20Symphony%20in%20BigInsight%202.1%20Cluster.p
df
11. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 11
[biadmin@biginsights opt]$ cd /opt/ibm/biginsights/conf
[biadmin@biginsights conf]$ . biginsights-env.sh
[biadmin@biginsights conf]$ echo $EGO_TOP
/opt/ibm/biginsights/HAManager/data
[biadmin@biginsights conf]$
When this patch is applied, the multitenant capabilities of IBM Platform Symphony will become
functional and will be accessible through the Platform Symphony graphical user interface.
When BigInsights is installed, the BigInsights web console by default is available on port 8080 on the
BigInsights management host (as long as BigInsights services are started).
Check the status of the cluster using this command:
$ /opt/ibm/biginsights/bin/status.sh
If necessary, start BigInsights (which will also start Platform Symphony services):
$ /opt/ibm/biginsights/bin/start-all.sh
While logged in as the BigInsights administrator, if Symphony is properly installed with BigInsights you
should be able to run Symphony specific commands. As an example, the user biadmin should be able to
run the following command:
$ egosh service list
This command will list various software services associated with Symphony and show their status.
When the Platform Computing components are installed (Adaptive MapReduce), the Platform
Computing resource manager (EGO) is used to persist BigInsights services. You will notice that
Symphony services are associated with a consumer called “/Management”. If you are running HDFS,
HDFS services like the DataNode and Secondary Data node are associated with an “/HDFS” consumer.
The MapReduce shuffle service is start on Compute hosts in the cluster.
[biadmin@biginsights ~]$ egosh service list
SERVICE STATE ALLOC CONSUMER RGROUP RESOURCE SLOTS SEQ_NO INST_STATE ACTI
derbydb DEFINED /Manage* Manag*
purger DEFINED /Manage* Manag*
plc DEFINED /Manage* Manag*
WEBGUI STARTED 54 /Manage* Manag* biginsi* 1 1 RUN 121
RS DEFINED /Manage* Manag*
Seconda* DEFINED /HDFS/S*
MRSS STARTED 55 /Comput* MapRe* biginsi* 1 1 RUN 120
DataNode DEFINED /HDFS/D*
SD STARTED 56 /Manage* Manag* biginsi* 1 1 RUN 119
Service* DEFINED /Manage* Manag*
WebServ* DEFINED /Manage* Manag*
NameNode DEFINED /HDFS/N*
[biadmin@biginsights ~]$
12. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 12
Accessing the Platform Symphony Management Console
The Platform Symphony console will usually be on the same host if you follow the installation
recommendations above, but will be on a different port. Port 18080 is the default. You should be able to
log into the Platform Symphony management console at http://<master-host>:18080/platform. The
default administrator login for Platform Symphony is “Admin / Admin”.
In production clusters there will normally be multiple Platform Symphony management hosts. Setting
this up is beyond the scope of this paper and is covered in the Platform Symphony documentation.
Figure 2- Logging into the Platform Symphony Management Console
If you are having trouble connecting to the Symphony web console you can use the command “egosh
service view WEBGUI” to see details about the web service.
The WEBGUI services should be started automatically by EGO, but if it becomes necessary to start or
stop the service, you can use the following commands:
$ egosh logon
Enter Admin / Admin as the username and the password when prompted
$ egosh service start WEBGUI
$ egosh service stop WEBGUI
The WEBGUI service is implemented using Apache TomCat.
If there are problems with the WEBGUI you can inspect the logs at ${EGO_TOP}/gui/logs/catalina.out
for information about what might be wrong with the service.
13. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 13
If you cannot connect to the Symphony console, this may be blocked by your firewall configuration. You
can disable your firewall temporarily to see if this is the cause.
# service iptables stop
If you are not sure what port or host the Platform Symphony GUI was installed on, you should be able to
find it in the XML file that governs the BigInsights installation process (described earlier).
This XML file is generated by the web-based installation process. Platform Symphony related setup
details are found under “high-availability” section of the XML file that governs the installation process.
<high-availability>
<configure>false</configure>
<master-nodes/>
<baseport>7869</baseport>
<web-port>18080</web-port>
<log-directory>var/ibm/biginsights/ps-mapred/logs</log-directory>
<preferred-ip-mask/>
..
<max-retries>3</max-retries>
<failover>failover</failover>
</high-availability>
Once a user logs in to the Platform Symphony console on port 18080, they will see the main Platform
Symphony dashboard. This view is mostly used to monitor the high level status of the various
applications and tenants on a Platform Symphony cluster.
For BigInsights users, most of the action will center around the “MapReduce Workload” screen
accessible under “Quick Links”.
14. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 14
Figure 3 - view of Platform Symphony console when logged in as an Administrator
Accessing the Platform Symphony knowledge center
Once you are able to access the Platform Symphony console above, you may want to access the
Platform Symphony Knowledge Center and bookmark it in your browser. The knowledge center is
accessible in a pull down menu under the question mark in the top bar on the Platform Symphony web
interface.
The knowledge center aggregates all of the various Platform Symphony documentation into a
searchable interface. This will prove handy as you learn about Platform Symphony.
A direct link to the knowledge center can be found at this URL (depending on the hostname where the
web interface is running).
http://<masterhost-name>:18080/doc/symphony/6.1/index.html
The command egosh services list shown earlier will show the names of the host running the web
interface (listed as the WEBGUI) if you are running on a cluster with multiple master hosts.
The Platform Symphony knowledge center, in particular the documentation dealing with the Platform
Symphony MapReduce framework, will be useful to BigInsights administrators since if you are using
Adaptive MapReduce you are in fact using the Platform Symphony MapReduce framework.
15. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 15
Figure 4 - Platform Symphony Knowledge Center
Platform Symphony Concepts
While the reader of this document is likely to be familiar with Hadoop and various commercial
distributions, they may be less familiar with IBM Platform Symphony. IBM Platform Symphony is a
commercial grid workload and resource management solution that has been use to share resources
among diverse applications in multitenant environments for over a decade. Platform Symphony is
widely deployed as a shared services infrastructure in some of the world’s largest investment banks.
As a quick primer to some of the terminology referenced, in this document some definitions are offered
below. We would recommend that the interested reader please review a document called “IBM
Platform Symphony Foundations” available at http://publibfp.dhe.ibm.com/epubs/pdf/c2750652.pdf .
Session Manager – service-oriented applications in Platform Symphony are managed by a
session manager. The session manager is responsible for dispatching tasks to service instances,
and collecting and assembling results. The Symphony session manager provides a function
simply in concept to a Hadoop application manager, although it has considerably more
capabilities. Platform Symphony implements job tracker functionality using the session
manager. In this paper the terms job tracker, application manager and session manager are used
interchangeably. While the concept of multiple concurrent application managers in Hadoop is
new with YARN. Platform Symphony has always featured a multitenant design.
16. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 16
Resource Groups – Unlike Hadoop clusters, Platform Symphony does not make assumptions
about the capabilities of hosts that participate in the cluster. While Hadoop generally assumes
that member nodes are 64-bit Linux hosts running Java, Platform Symphony supports a variety
of hardware platforms and operating environments. Platform Symphony allows hosts to be
grouped in flexible ways into different resource groups, and different types of applications can
share these underlying resource groups in flexible ways.
Applications – The term application can be a little bit confusing as it is applied to Platform
Symphony. Symphony views an application as the combination of the client-side and service-
side code that comprise a distributed application. This is a more expansive definition than most
people are used to. By this definition an instance of BigInsights might be viewed as a single
application. Examples of Platform Symphony applications are custom applications written in
C++, a commercial ISV application like IBM Algorithmics, Calypso or Murex or a commercial or
Open Source Hadoop application like Cloudera, BigInsights or open source Hadoop. Platform
Symphony views applications as being an instance of middleware. Various client side tools
associated with a particular version of Hadoop (Pig, Hive, Sqoop etc) can all run against a single
Hadoop application definition. An important concept for those not familiar with Symphony is
that Symphony provisions service instances associated with different applications dynamically.
As a result, there is nothing technically stopping a Platform Symphony cluster from supporting
multiple instances of Hadoop and non-Hadoop environments concurrently.
Application profiles – As explained above, applications in Symphony are flexible and highly
configurable constructs. An Application Profile in Symphony defines the characteristics of an
application and various behaviors at runtime.
Consumers – From the viewpoint of a resource manager, an application or tenant on the cluster
is defined as something that needs particular types of resources at runtime. Platform Symphony
uses the term “consumer” to define these consumers of resources and provides capabilities to
define hierarchical consumer trees and express business rules about how consumers share
various types of resources collected into resource groups. The leaf nodes in consumer trees map
to a Symphony application.
Services – Services are the portions of applications that run on cluster nodes. In a Hadoop
context, administrators likely think of services as equating to a task tracker that runs Map and
Reduce logic. Here again, Symphony takes a broader view. Symphony services are generic. A
service may be a task-tracker associated with a particular version of Hadoop or it may be
something else entirely. When the MapReduce framework is used in Platform Symphony, the
Hadoop service-side code that implements that Task Tracker logic is dynamically provisioned by
Symphony. Symphony owes its name to this ability to orchestrate a variety of services quickly
and dynamically according to sophisticated sharing policies.
Sessions – A session in Symphony equates to the notion of a job in Hadoop. A client application
in Symphony normally opens a connection the cluster, selects an application and opens a
17. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 17
session. Behind the scenes Symphony will provision a Symphony Session Manager to manage
the lifecycle of the job. A single Symphony Session Manager may support multiple sessions
(Hadoop jobs) concurrently. A Hadoop job is a special case of a Symphony job. The Hadoop
client will start a session manager that provides JobTracker functionality. Platform Symphony
actually uses the job tracker and task tracker code provided in a Hadoop distribution, however it
uses its own low-latency middleware to more efficiently orchestrate these services on a shared
cluster.
Repositories – As explained previously, Platform Symphony dynamically orchestrates service-
side code in response to application demand. The binary code that comprises an application
service is stored in a Symphony repository. Normally for Symphony applications, Symphony
services are distributed to compute nodes from a repository service. For Hadoop applications,
code can be distributed either via the repository service, or it can be distributed via the HDFS /
GPFS FPO file system.
Tasks – Symphony jobs are collections of tasks. Symphony jobs are managed by a session
manager that runs on a management host. The session manager makes sure that instances of
the needed service are running on compute nodes / data nodes on the cluster. Services
instances run under the control of a Symphony Service Instance Manager (SIM). MapReduce
jobs in the Symphony work the same way, but in this case the Symphony service is essentially
the Hadoop task tracker logic. On Hadoop clusters, slots are normally designated as running
either map logic or reduce logic. Again in Symphony, this is fluid. Because services are
orchestrated dynamically service instances can be either Map or Reduce tasks. This is an
advantage because it allows full utilization of the cluster as the job progresses. At the start of a
job the majority of slots can be allocated to map tasks while towards the end of the job the
function of slots can be shifted to perform the reduce function.
18. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 18
An example of configuring a cluster for multi-tenancy
In this section we describe the step-by-step procedure to setup multiple tenants on the BigInsights
environments. In order to provide a realistic multitenant scenario, the diagram roughly models our
actual customer environment with names changed of course to protect client confidentiality.
The actual environment is more complex with hundreds of users, dozens of groups and approximately
thirty different applications planned, but the application sharing is similar to the diagram below. This
diagram maps to the “Consumer Tree” in Platform Symphony. Consumer is a term used from the
resource manager’s perspective. The resource manager views an application as a consumer of
resources, and the resource manager is responsible for allocating requested resources according to
policies that will be described shortly.
Figure 5 - an example consumer hierarchy for applications and departments
By default, BigInsights (which is just a single application on the cluster) maps to a single application and
associated is consumer called “MapReduce61” (the name corresponds to the version of Platform
Symphony used to support MapReduce processing in BigInsights – in this case 6.1.0.1). This is done so
that Symphony can accommodate future versions of MapReduce that will be provided in future versions
of BigInsights and will allow versions to co-exist. This is first consumer in the consumer tree above.
In the production environment the customer has specific needs:
They wish to structure “sub-consumers” under the BigInsights consumer definition
(MapReduce61). This gives the cluster administrator the ability to have different run-time
characteristics for different BigInsights applications. It also allows us to setup configurable
sharing policies between our different applications and groups, control what users are allowed
19. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 19
to access what applications, and ensure security between tenants by having different
applications run under different user-IDs if desired.
In this example, under the BigInsights tenant (MapReduce61) we have several different
applications. We’ve arbitrarily called them “MR_AppA” through “MR_AppN” although in the real
environment these are the names of the client’s business applications. Note that we need to
configure each application (tenant) so that it runs under a different operating system level user-
id for security isolation. We also want to control in a granular way which users and groups have
access to these various applications.
Also, as shown in figure 4, the client has additional applications used by particular lines of
business that they would also like to deploy on the same cluster. As examples, some Sqoop
workloads, DataMeer, IBM Tealeaf, various in-house developed streaming applications and
others. In this particular customer implementation all of these applications will just happen to
share the BigInsights MapReduce infrastructure, however it is important to under that this need
not be the case. As we’ll see shortly these applications can be totally different and still be
configured to share infrastructure.
Adding users to run MapReduce applications
In our example we want to show that how multiple users, grouped arbitrarily into one or groups for
security management can access tenant applications subject to access controls.
We create some sample cluster users for our illustration. These names represent individual cluster
users. For some lines of business, application administrators may choose to create a shared login like
“fraud” for a group authorized to use a particular fraud analytics application.
InfoSphere BigInsights has a recommend procedure for adding users. When using Platform Symphony
together with BigInsights, it is recommended that users follow procedures covered in the BigInsights
documentation and use the tool createosuser.sh included in the BigInsights distribution to automate the
create of OS level users. Doing this ensures that users can access the BigInsights console to run
applications deployed using the BigInsights application framework.
For convenience, the BigInsights infocenter is available on the public internet. For information on adding
users in BigInsights, you can learn more here: http://www-
01.ibm.com/support/knowledgecenter/SSPT3X_2.1.1/com.ibm.swg.im.infosphere.biginsights.admin.doc
/doc/bi_admin_add_users.html?lang=en
The specific procedures will depend on whether you are authenticating access via flat files, LDAP, PAM
or PAM+LDAP. In the example below we are using flat files for simplicity.
To create users known to BigInsights, edit the following file:
$BIGINSIGHTS_HOME/console/conf/security/biginsights_users.xml
20. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 20
Add users as shown below.
<?xml version="1.0" encoding="UTF-8"?>
<server>
<featureManager/>
<basicRegistry id="basic" realm="Auth">
<user name="hadoop" password="passw0rd"/>
<user name="biadmin" password="temp4now"/>
<user name="sysadmin2" password="passw0rd"/>
<user name="appadmin2" password="passw0rd"/>
<user name="sysadmin1" password="passw0rd"/>
<user name="appadmin1" password="passw0rd"/>
<user name="dataadmin2" password="passw0rd"/>
<user name="dataadmin1" password="passw0rd"/>
<user name="user3" password="passw0rd"/>
<user name="user2" password="passw0rd"/>
<user name="user1" password="passw0rd"/>
<user name="vivian" password="temp4now"/>
<user name="gord" password="temp4now"/>
<user name="eric" password="temp4now"/>
<user name="michael" password="temp4now"/>
<user name="vince" password="temp4now"/>
<user name="steven" password="temp4now"/>
<user name="tiffany" password="temp4now"/>
<user name="appA" password="temp4now"/>
<user name="appB" password="temp4now"/>
<user name="appC" password="temp4now"/>
</basicRegistry>
</server>
The next step is to define groups and associated users with groups. This is an example only. The specific
will depend on how you wish to structure your own users and groups
<?xml version="1.0" encoding="UTF-8"?>
<server>
<featureManager/>
<basicRegistry id="basic" realm="Auth">
<group name="supergroup" gid="4000">
<member name="hadoop" uid="4000"/>
<member name="biadmin" uid="200"/>
</group>
<group name="appAdmins" gid="4100">
<member name="appA" uid="4100"/>
<member name="appB" uid="4101"/>
<member name="appC" uid="4101"/>
</group>
<group name="sysAdmins" gid="4200">
<member name="sysadmin1" uid="4200"/>
<member name="sysadmin2" uid="4201"/>
</group>
21. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 21
<group name="dataAdmins" gid="4300">
<member name="dataadmin1" uid="4300"/>
<member name="dataadmin2" uid="4301"/>
</group>
<group name="users" gid="4400">
<member name="vivian" uid="6001"/>
<member name="gord" uid="6002"/>
<member name="eric" uid="6003"/>
<member name="michael" uid="6004"/>
<member name="vince" uid="6005"/>
<member name="steven" uid="6006"/>
<member name="tiffany" uid="6007"/>
</group>
<group name="groupA" gid="5000">
<member name="vivian" uid="6001"/>
<member name="gord" uid="6002"/>
<member name="eric" uid="6003"/>
<member name="michael" uid="6004"/>
<member name="vince" uid="6005"/>
<member name="steven" uid="6006"/>
<member name="tiffany" uid="6007"/>
</group>
<group name="groupB" gid="5001">
<member name="vivian" uid="6001"/>
<member name="gord" uid="6002"/>
<member name="eric" uid="6003"/>
<member name="michael" uid="6004"/>
<member name="vince" uid="6005"/>
<member name="steven" uid="6006"/>
<member name="tiffany" uid="6007"/>
</group>
<group name="groupC" gid="5002">
<member name="vivian" uid="6001"/>
<member name="gord" uid="6002"/>
<member name="eric" uid="6003"/>
<member name="michael" uid="6004"/>
<member name="vince" uid="6005"/>
<member name="steven" uid="6006"/>
<member name="tiffany" uid="6007"/>
</group>
</basicRegistry>
</server>
In addition to have user IDs that map to individuals, I may want particular applications to execute on the
cluster under a specific user ID. For example, if my application is called “appA” I may want to have it
execute under a Linux user ID with the same name for simplicity. To accommodate this notice that
we’ve added application specific users to the biginsights_users.xml file in the example above.
You can add users using operating system facilities, but if you do, these users will not be recognized as
having credentials within the BigInsights web interface. They will still work with Symphony and the
BigInsights Hadoop framework however.
22. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 22
The example below shows how additional users can be added at the OS level, but be unable to login to
the BigInsights console.
# useradd fred
# useradd george
# useradd frank
Once you have edited the BigInsights XML files to define users and groups as shown above, you are
ready to run the createosusers.sh script to create these accounts and groups at the operating system
level as well.
Run the createosusers.sh script as user “biadmin”.
#createosusers.sh
$BIGINSIGHTS_HOME/console/conf/security/biginsights_groups.xml
$BIGINSIGHTS_HOME/console/conf/security/biginsights_users.xml <biadmin's
password>
By following the procedure above to create users and groups, you will be able to run and monitor jobs
from both BigInsights Console as well as the Platform Symphony console.
Figure 6 - user Tiffany known as a BigInsights user is known to the Platform Symphony GUI
23. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 23
Figure 7 - user Tiffany and others can also runs jobs via the BigInsights console.
Provide access to the BigInsights / Platform Computing cluster
For each operating system user who will be submitting jobs, make sure that their .bashrc file (or
equivalent depending on your shell) in the user’s home directory is configured to source the BigInsights
environment as shown below. If you have followed the procedures above, this should be done for you
automatically. We include these details because you may have additional users not known to BigInsights
that require access to Platform Symphony.
Sourcing the BigInsights environment will ensure that various shell variables like $PATH and
$CLASSPATH as well as environment variables specific to BigInsights and Platform Symphony are in the
environment when the user logs on. This will allow them to immediately run both BigInsights and
Symphony commands. If you are adding many users outside the procedure recommended above to add
BigInsights users, and you want them all to have access to the cluster, it will be faster to adjust the
system-wide template for .bashrc file (in /etc/skel) or adjust the common /etc/bashrc depending on
your preference.
If you have followed the instructions above, this step may not be necessary, but it is a good idea to
check that when users login they are inheriting an environment appropriate for running BigInsights jobs
and that they have access to the Platform Symphony environment.
In our case we want both our named users, as well as the user-ids that our applications will run under in
Symphony(see the concept of impersonation explained later) to source the environment and be able to
run commands.
[root@biginsights gord]# cat .bashrc
# .bashrc
# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi
24. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 24
# User specific aliases and functions
# source the environment for BigInsights and Platform Symphony
source /opt/ibm/biginsights/conf/biginsights-env.sh
You should be able su to your created user ID after this and run Symphony or BigInsights commands.
Below we see that I can run a Symphony command confirming that my environment is setup correctly.
Note that with the installation of BigInsights we are entitled to user Platform Symphony Advanced
Edition which is the version of Symphony that supports the Hadoop MapReduce framework. We are not
entitled to use some other add-on products listed.
[root@biginsights /]# su - gord
[gord@biginsights ~]$ egosh entitlement info
Symphony Edition : Advanced
Desktop Harvesting : Not Entitled
Server Harvesting : Not Entitled
Virtual Server Harvesting : Not Entitled
GPU : Not Entitled
[gord@biginsights ~]$
After following the procedure above, it is a good idea to make sure that our /etc/group file reflects that
setup we’ve configured in the BigInsights XML files.
In /etc/group, create define the users that will be allowed to submit workloads on behalf of each group.
This is a very simple example. In reality, different users would belong to different groups and these
group names would be meaningful in the context of how the customer organizes their business.
groupA:x:5000:vivian,gord,eric,michael,vince,steven,biadmin
groupB:x:5001:vivian,gord,eric,michael,vince,steven,biadmin
groupC:x:5002:vivian,gord,eric,michael,vince,steven,biadmin
groupD:x:5003:vivian,gord,eric,michael,vince,steven,biadmin
groupF:x:5004:vivian,gord,eric,michael,vince,steven,biadmin
groupG:x:5005:vivian,gord,eric,michael,vince,steven,biadmin
groupH:x:5006:vivian,gord,eric,michael,vince,steven,biadmin
groupI:x:5007:vivian,gord,eric,michael,vince,steven,biadmin
Understanding Platform Symphony Impersonation
Now is a good time to explain the concept of “impersonation” in Platform Symphony. Symphony has
two different workload execution modes:
Simple Workload Execution Mode
Advanced Workload Execution Mode
This is normally an installation option with Platform Symphony. BigInsights Enterprise Edition installation
automatically installs Platform Symphony in Advanced Workload Execution Mode. This term is
frequently abbreviated as WEM in the Symphony documentation. In advanced workload execution
mode, core Symphony services will run as root as application administrators will be able to control the
user ID that clustered applications run under.
25. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 25
Our approach to security hinges on this concept of impersonation in Symphony and we will see shortly
how we configure our applications to run under specific user credentials and control what users have
access to what applications and resources. The section called “Security within the MapReduce
framework” in the MapReduce user guide in the Platform Symphony documentation discusses this in
detail.
The customer that this paper is modeled after employs Kerberos authentication for their MapReduce
jobs to ensure security and that a particular service support impersonation cannot be spoofed. Details
on configuring Kerberos is too much detail for this short document, but customers will be pleased that
this capability exists. Symphony is frequently deployed in secure environments where these capabilities
are important.
Configuring OS groups for the multitenant environment
For users making use of Platform Symphony (both named users and the user IDs that applications will
run under via impersonation) these IDs need to be part of the OS group that owns the BigInsights (and
by extension the Symphony) installation.
In our installation, BigInsights was installed as part of the “biadmin” group, so we adjust the group
membership so that each application ID that Symphony jobs will run under is a part of the BigInsights
group.
biadmin:x:0:root,biadmin,gord,eric,vivian,appA,appB,appC,appD,appE,appF,appG
bin:x:1:root,bin,daemon
daemon:x:2:root,bin,daemon
..
If you are unsure what group BigInsights was installed under, issue a command like
$ ls -al ${EGO_TOP}
You will see the user and group that own each file. This will vary depending on how you installed
BigInsights but the default group is biadmin.
Submitting a test job as a user to verify the configuration
As we mentioned before, by default BigInsights is configured to use an Application called MapReduce61
which maps to the consumer called /MapReduceConsumer/MapReduce61.
I should be able to login to any of the accounts created, and run a sample Hadoop job. The sleep
command included with the BigInsights examples is a convenient Hadoop application for testing the
MapReduce framework. This command submits variable numbers of Map and Reduce tasks that simply
sleep for variable amounts of time. The example below submits two mappers that will sleep for 2
seconds (2,000 msec) followed by ten reducers that in the example below will sleep for 1 second.
Besides being a useful validation that everything is working, this test illustrates the performance
advantage of using Platform Symphony as the MapReduce framework over open-source Hadoop.
26. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 26
Platform Symphony can run tests like this short running map and reduce tasks dramatically faster than
open source Hadoop – often more than ten times faster, even when a competing cluster is configured
with a short polling interval.
Note that as the test Hadoop job runs, everything is identical to open source Hadoop (it is actually the
BigInsights supplied Hadoop classes that are running) except that see that our JobTracker logic in
Hadoop is running inside a Symphony Session Manager.
Note also that the running job is given a Platform Symphony job ID (job_ssm_0401 in this example).
Because Platform Symphony is managing the job execution, it is able to manage this job as well as other
jobs on the cluster including non-Hadoop jobs.
[gord@biginsights ~]$ hadoop jar ${HADOOP_HOME}/hadoop-example.jar sleep -m 2
-r 10 -mt 2000 -rt 2000
14/03/15 13:14:25 INFO internal.MRJobSubmitter: Connected to JobTracker(SSM)
14/03/15 13:14:26 INFO internal.MRJobSubmitter: Job <Sleep job> submitted,
job id <401>
14/03/15 13:14:26 INFO internal.MRJobSubmitter: Job will not verify
intermediate data integrity using checksum.
14/03/15 13:14:26 INFO mapred.JobClient: Running job: job_ssm_0401
14/03/15 13:14:27 INFO mapred.JobClient: map 0% reduce 0%
14/03/15 13:14:36 INFO mapred.JobClient: map 100% reduce 0%
14/03/15 13:14:46 INFO mapred.JobClient: map 100% reduce 20%
14/03/15 13:14:50 INFO mapred.JobClient: map 100% reduce 40%
14/03/15 13:14:54 INFO mapred.JobClient: map 100% reduce 60%
14/03/15 13:14:58 INFO mapred.JobClient: map 100% reduce 80%
14/03/15 13:14:59 INFO mapred.JobClient: map 100% reduce 100%
14/03/15 13:14:59 INFO mapred.JobClient: Job complete: job_ssm_0401
14/03/15 13:15:00 INFO mapred.JobClient: Counters: 18
14/03/15 13:15:00 INFO mapred.JobClient: Shuffle Errors
14/03/15 13:15:00 INFO mapred.JobClient: WRONG_PATH=0
14/03/15 13:15:00 INFO mapred.JobClient: CONNECTION=0
14/03/15 13:15:00 INFO mapred.JobClient: IO_ERROR=0
14/03/15 13:15:00 INFO mapred.JobClient: FileSystemCounters
14/03/15 13:15:00 INFO mapred.JobClient: FILE_BYTES_WRITTEN=5146
14/03/15 13:15:00 INFO mapred.JobClient: Map-Reduce Framework
14/03/15 13:15:00 INFO mapred.JobClient: Reduce input groups=400
14/03/15 13:15:00 INFO mapred.JobClient: Combine output records=0
14/03/15 13:15:00 INFO mapred.JobClient: Map output records=400
14/03/15 13:15:00 INFO mapred.JobClient: SHUFFLED_MAPS=20
14/03/15 13:15:00 INFO mapred.JobClient: Reduce shuffle bytes=2440
14/03/15 13:15:00 INFO mapred.JobClient: Combine input records=0
14/03/15 13:15:00 INFO mapred.JobClient: Spilled Records=800
14/03/15 13:15:00 INFO mapred.JobClient: SPLIT_RAW_BYTES=0
14/03/15 13:15:00 INFO mapred.JobClient: Map output bytes=1600
14/03/15 13:15:00 INFO mapred.JobClient: Reduce input records=400
14/03/15 13:15:00 INFO mapred.JobClient: GC_TIME_MILLIS=0
14/03/15 13:15:00 INFO mapred.JobClient: FAILED_SHUFFLE=0
14/03/15 13:15:00 INFO mapred.JobClient: MERGED_MAP_OUTPUTS=20
14/03/15 13:15:00 INFO mapred.JobClient: Reduce output records=0
[gord@biginsights ~]$
27. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 27
As this job runs, we can monitor the job in the Symphony GUI by using the QuickLinks menu and
accessing “MapReduce Workload” to access the MapReduce workload screen shown below. As the
MapReduce jobs runs, you will see a view like the one shown in figure 6.
Figure 8 - monitoring our job using the Platform Symphony web interface
Note that the submitted job is associated with the application MapReduce 6.1 (this is the application
that BigInsights by default submits jobs to)
You can also launch jobs via the standard BigInsights Web GUI and watch them run either from within
the BigInsights console or from within the Platform Symphony Web interface.
Figure 9: Launching a terasort job from BigInsights
The Terasort example in BigInsights uses oozie to manage the sequence of running the teragen
application to generate the dataset to be sorted followed by Terasort itself.
28. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 28
As the job runs in the BigInsights context, we see them running in Platform Symphony associated with
the MapReduce6.1 application that BigInsights is bound to.
Any BigInsights application that exercises the MapReduce framework including services like Hive, Pig,
Big SQL, Bigsheets and others will work with Symphony in this same way.
Figure 10 - Platform Symphony monitoring Terasort job run from BigInsights
Associating BigInsights with a Symphony Application
We’ve mentioned a few times that BigInsights is associated with the Symphony MapReduce6.1
application and customers frequently ask where this association is made.
[biadmin@biginsights ~]$ cd $HADOOP_CONF_DIR
[biadmin@biginsights hadoop-conf]$ cat pmr-site.xml
<?xml version="1.0"?>
<!-- This is a PMR configuration file. -->
<!-- It is intended for PMR internal parameters. Do not define -->
<!-- hadoop parameters here. -->
<configuration>
<property>
<name>mapreduce.application.name</name>
<value>MapReduce6.1</value>
<description>The mapreduce application name.</description>
</property>
<property>
<name>mapreduce.map.skip.commit.task</name>
<value>false</value>
</property>
By changing to the BigInsights directory $HADOOP_CONF_DIR you can modify Symphony application
name that BigInsights will submit jobs to in the file pmr-site.xml. It is important to have this flexibility,
because over time customers may end up with different versions of BigInsights along with other
applications co-existing on the same cluster.
29. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 29
Enabling Symphony Repository Services
By default, when Platform Symphony is installed the repository service in Symphony is disabled. The
function of the repository service is to store the application services and distribute the code that
implements services dynamically to service instances on the cluster.
The MapReduce framework in Platform Symphony by default distributes the application service code
(specifically the application logic that implements the task tracker functionality and Jar files that
implement map and reduce logic) by copying them to HDFS with a high block replication factor so that
the files will be accessible on all nodes.
If you are planning to add and remove application profiles in Symphony or Consumers you will to start
the Symphony repository service. Otherwise you will encounter errors as some of these services assume
that the repository service in Symphony is running.
This can be done through the web interface by following these steps:
From the QuickLinks menu select system services
For the service abbreviated as RS, select “Start” from the Actions pull-down menu
After you refresh the GUI view you should see the service has started on a master host
30. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 30
Figure 11 - Managing system services in Platform Symphony
The system services view is useful. This shows a list of system services that EGO is managing. Note that
EGO is managing not only native Platform Symphony services, but BigInsights services as well.
Adding a new Application / Tenant
Fundamental to the design of BigInsights 2.1 (and Open Source Hadoop) is the idea that there is only a
single instance of a Hadoop cluster.
Platform Symphony supports multiple applications however sharing the same cluster. It is also flexible
enough to support multiple instances of an application environment like BigInsights, however
configuring this is out of the scope of this paper.
Examples of tenants we may want to add might be:
A native Symphony application written to the Platform Symphony APIs
A batch-oriented workload (when Platform LSF is installed as an add-on to Platform Symphony)
A distinct Hadoop MapReduce environment
Third party applications like SAS, MatLab or Revolution R
31. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 31
A separate Hadoop MapReduce application instance that shares resources between applications
but that shares the same Hadoop binaries and file system instance.
In this example we are showing the last case where multiple Hadoop applications share resources.
From the Platform Symphony Dashboard:
Use the QuickLinks menu and select Resources
Select Workload / MapReduce / Application profiles from the pull down menu
There will already be an application profile already defined for MapReduce6.1. This is installed
automatically with Symphony and is the application profile that is used by BigInsights by default.
To add a new application profile to support a new tenant, click the “Add” button. The screen shown in
figure 10 will appear.
Figure 12 - Adding a new Application definition
We supply the following parameters:
Our application name (SQOOP) – We require this tenant to use a different version of SQOOP
than the version including with BigInsights as mentioned earlier
We define the user-ID that starts the job tracker and runs jobs – This is the impersonation
feature described earlier. This particular application will run under the OS id AppB.
32. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 32
Symphony has 10,000 priority levels. By default we are going to submit Sqoop jobs as having a
low priority.
We configure user accounts that have access to this application. Note that we’ve provided all
users in GroupA access to the application along with named operating system and Platform
Symphony users.
Based on this information, Platform Symphony adds an application named Sqoop with a set of
reasonable defaults for a Hadoop MapReduce job. To make sure that our new application is working, as
a user entitled to use the application I can submit a test job as I did before.
Note that in this I am specifying that I want to have the job handled by a different MapReduce
application definition so I specify Sqoop as the application name on the command line.
Test the new application consumer by submitting a job as before.
[gord@biginsights ~]$ hadoop jar ${HADOOP_HOME}/hadoop-example.jar sleep -
Dmapreduce.application.name=Sqoop -m 2 -r 10 -mt 2000 -rt 2000
14/03/13 12:32:07 INFO internal.MRJobSubmitter: Connected to JobTracker(SSM)
14/03/13 12:32:08 INFO internal.MRJobSubmitter: Job <Sleep job> submitted,
job id <1>
14/03/13 12:32:08 INFO internal.MRJobSubmitter: Job will not verify
intermediate data integrity using checksum.
14/03/13 12:32:08 INFO mapred.JobClient: Running job: job_ssm_0001
14/03/13 12:32:09 INFO mapred.JobClient: map 0% reduce 0%
14/03/13 12:32:37 INFO mapred.JobClient: map 100% reduce 0%
14/03/13 12:32:52 INFO mapred.JobClient: map 100% reduce 20%
14/03/13 12:32:56 INFO mapred.JobClient: map 100% reduce 40%
14/03/13 12:33:00 INFO mapred.JobClient: map 100% reduce 60%
14/03/13 12:33:05 INFO mapred.JobClient: map 100% reduce 80%
14/03/13 12:33:07 INFO mapred.JobClient: map 100% reduce 100%
14/03/13 12:33:07 INFO mapred.JobClient: Job complete: job_ssm_0001
14/03/13 12:33:09 INFO mapred.JobClient: Counters: 18
..
What has changed is that in figure 11 we see that our job is now running under our separate application
definition called Sqoop.
This shows the basic process of adding the new application profile for a MapReduce job to Symphony to
support our additional tenants. The next step of course is to edit the configuration of the tenant as
necessary to suit the unique needs of the application. For example, my requirement may be as simple as
simple re-pointing some environment variables for point to different installation and configuration
directories for Sqoop for jobs submitted to this application.
[biadmin@biginsights hadoop-conf]$ set | grep SQOOP
SQOOP_CONF_DIR=/opt/ibm/biginsights/sqoop/conf
SQOOP_HOME=/opt/ibm/biginsights/sqoop
[biadmin@biginsights hadoop-conf]$
33. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 33
Note that below my Job ID has reset to “1” since this is the first job associated with this particular
application tenant.
Figure 13 - Sleep job running under newly created application definition
Under the “Workload” / “MapReduce” / “Application Profiles” we can define as many separate
applications as we’d like. The view below additional applications added using the same process detailed
for the Sqoop application.
Figure 14 - Available MapReduce Application Profiles
Only MapReduce applications appear because “Application Profiles” have been selected from the
MapReduce submenu. Figure 13 shows a similar view of “Applications” accessible from the same
workload dropdown menu except instead of looking at Application Profiles I’m looking at a dashboard of
the applications themselves with job related status.
34. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 34
Figure 15- Dashboard of MapReduce applications
Configuring application properties
When new applications profiles are created for each new application, a default template is used
represent reasonable settings for a MapReduce workload. The next step is to configure application
profiles to meet the unique requirements of each application workload.
In the Platform Symphony reference manual accessible from the knowledge center, application profiles
are covered in detail. Some of the more commonly configured settings are shown below.
To configure application properties for Sqoop, modify the application profile by selecting “Workload” /
“MapReduce” / “Application Profiles” from the top menu on the MapReduce applications screen. Select
the application profile definition for Sqoop created earlier and select Modify.
A new window will appear that allows detailed settings for the application to be changed. This web
interface is affecting the application service profile definitions (discussed shortly) that are stored in the
directory $EGO_TOP/data/soam/profiles on the Platform Symphony master host. Enabled profiles
reside in a subdirectory called “enabled” and disabled profiles reside in a directory called “disabled”.
First tab in the interface called Application Profile allows application profile settings to be adjusted. The
second tab labeled Users provides an opportunity to modify the users and groups that will have access
to the application profile.
35. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 35
Figure 16 - Application Profile
Some important tips about Application Profiles:
Application Profile names must be unique
An Application Profile can be associated with only a single consumer
In the consumer tree, MapReduce applications are by default placed under the
MapReduceConsumer tree
You can find templates for various application profiles in the directory
$SOAM_HOME/6.1/Samples/Templates. The term SOAM in Symphony refers to the service-
oriented application middleware on which the MapReduce service is implemented
The application profile can be viewed in an Advanced Configuration, a Basic Configuration or in a
Dynamic Configuration Update mode. The Dynamic Configuration Update mode is not covered here, but
essentially it allows an administrator to register a profile fragment (part of an application profile)
modifying either the session types or services sections of the profile.
In the General settings area, settings such as where metadata associated with jobs and job history are
stored, the default service definition to be used (MapReduce for MapReduce applications) and resource
requirements.
36. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 36
Resource requirements are an important concept in Symphony. In this simple example by using the
syntax “select(!mg)” we are essentially saying run this service on any host that is not tagged as a
member of the management group.
Resource requirement selections in Symphony are flexible and are covered in the Symphony
documentation. I can use an SQL like resource-requirements strings to specify the types of resources I
would like to use in a granular way. If for example I know that a particular application runs best on a
large memory PowerLinux machine, I express a requirement (or preference) for this application with an
appropriate resource requirement string.
select(!mg) && select(PowerResourceGroup) && select(maxmem > 8000 && maxswp
>=16000)
The example above would indicate that this service requires resources that are part of a Power-based
resource group that are not management hosts where at least 8GB of physical memory and 16GB of
swap space are available.
Pre-starting application services is a useful feature in Symphony. Application services refer to the
Symphony session manager (SSM) as well as service instance managers and service instances associated
with the application. As a reminder, with MapReduce workloads the SSM can be viewed as an
Application Manager. This is the component that implements the JobTracker logic. Services instances
will load TaskTracker logic appropriate to the version of Hadoop and will start map or reduce tasks
appropriate to the application.
If you have many applications and are frequently sharing slots pre-starting applications may not be
useful. By default Symphony will start SSMs automatically as clients connect and request services from
the middleware. As resources are assigned to applications, Symphony will dynamically provision needed
service code and start services appropriate.
Pre-starting applications is useful for applications that need to respond quickly. You can control the
number of slots (each slot can support a map or reduce task) that are pre-started by default
Figure 17 - Optionally have an application pre-allocate services
A key thing to understand about that Platform Symphony session manager is that it is fully
multithreaded and can accommodate multiple sessions at the same time. A session equates to a
MapReduce user submitted a job. Each job maps to a session where each session may have large
numbers of tasks.
37. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 37
When multiple users are concurrently submitting jobs to the same application, the scheduling policy
controls how resources are shared. This R_Proportion policy specifies that resources are shared in
proportion to the priority of the job which is often the most sensible choice.
As an example, if I had 5000 slots allocated to this application consumer definition and JobA was
submitted to the application with priority 4000 and JobB was submitted with priority 1000, Symphony
would run both workloads concurrently under the same application definition giving 80% of available
resources to JobA. Unlike standard Hadoop where resource assignments are static while the job is
executing, Symphony can respond quickly at run-time to re-balance resource allocations between jobs.
Note that since each SSM maps to an application (a MapReduce application in this case) this scheduling
policy controls how multiple jobs running in the same application context share resources. A separate
resource sharing plan discussed shortly controls how sharing is implemented more broadly between
applications and tenants.
The term application can be confusing to users not familiar with Symphony. Symphony is referring to an
application in the context of the Hadoop services themselves – the binary code that comprises
BigInsights services like the JobTracker and the TaskTracker. It is not referring to the actual application
code written by users that run on the Hadoop framework. A single Symphony application can run
different user applications within the context of the same Hadoop MapReduce context in this case.
Figure 18 - controlling how multiple jobs associated with an application share resources
The Symphony application profile definition provides precise control over how MapReduce workloads
run, and this is useful to advanced users (in our experience most sites running Hadoop are already quite
advanced and will appreciate this)
A nice feature of Symphony is that because the execution logic is provisioned dynamically so slots are
interchangeable between mappers and reducers. The settings in figure 17 allow this to be configured
along with preferences for default ratios between mappers and reducers and precise configuration on a
per resource group basis.
38. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 38
Figure 19 - MapReduce Settings associated with an Application
Symphony can allow multiple service definitions to exist for each application and the service definition
section provides granular control over this capability. This is a useful for applications written to Platform
Symphony’s native APIs and may be useful for Hadoop developers. For BigInsights it is not necessary to
change this setting being Platform has already implemented a service called “RunMapReduce “ service
started by service-instance managers to handle MapReduce workloads. The process of starting this
service is automatic for the MapReduce service. The service itself can be found in the directory
${EGO_TOP}/soam/mapreduce/6.1/linux2.6-glibc2.3-x86_64/etc. Note that the Start Command in
figure 18 allows for operating system specific implementations of a service definition for an application.
Figure 20 - configuring service definitions for the application
In the application profile definition, administrator can control environment variables associated with the
application. This is an important capability for ensuring multitenancy. By using environment variables I
can control what applications run in granular ways. If I choose, I could have an application profile that
39. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 39
associates itself with a separate Hadoop instance by defining application specific variables such as
$HADOOP_HOME, $HADOOP_CONF_DIR that reference different software versions and different
configuration files.
I can always resolve technical issues that often occur where particular applications are depend on
particular versions or distributions of the Java run-time environment be defining $JAVA_HOME to point
to the version of Java needed by a specific application.
Figure 21 - configuring the environment for the application
This is a good time to mention that while much of the discussion in Hadoop centers on Java because
Hadoop itself is written in Java, Symphony supports heterogeneous applications. It does not matter
whether application clients or services are written in C/C++, Java, scripting languages or even C# in
Microsoft .NET environments. The versatility to handle all types of workloads is what makes Symphony
powerful as a multitenant environment.
Another unique capability that Symphony brings to Hadoop is the notion of “Recoverable sessions”. This
concept does not existing in open source Hadoop where the job tracker is implemented in a simplistic
way. If the JobTracker fails at run-time, in standard Hadoop the job needs to be re-started.
The Symphony SOAM middleware however has long supported the notion of journaling transactions so
that Hadoop MapReduce jobs become inherently recoverable. If the software service running the
JobTracker logic fails (and re-starts on the same host or a different host) the Symphony job can recover
from where it left off. This is a major advantage for customers that have long-running Hadoop jobs that
need to complete within specific batch windows.
40. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 40
This and other points of configurability are very important for specific workloads. As another example, if
I have execution logic where the reducer is multi-threaded I can control the ration of reducer services to
slots thereby giving a reducer multiple slots if it can take advantage of them.
Figure 22 - configuring session behaviors in an SSM / Application Manager
Associating applications with consumers
The last section provided some details on how application profiles are used in Symphony to customize
applications to support multi-tenancy. In the Symphony architecture, resources are not actually
allocated to applications directory. They are allocated to Consumer definitions which in turn map to
applications.
This is an important distinction between while that application space is essentially “flat” (I have multiple
applications and flavors of applications of different types) the structure of consumers is usually
hierarchical. This is because most organizational structures are hierarchical.
A bank may have several lines of business, each with various departments or application groups
A service provider may have multiple tenant customers, and may provide different application
services for each tenant
A government agency may have different divisions, each running different applications with a
particular need to segment data access
41. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 41
Symphony allows consumer trees to be setup in flexible ways to accommodate the needs of almost any
organization. A key concept to understand is that the leaf-nodes of consumer trees are linked to the
application definitions we looked at in the previous section.
Accessing Consumer Definitions
To view consumer definitions, from the MapReduce screen in Symphony selected “Resources / Resource
Planning / Consumers”. This is the interface that is used to manage the Consumer Tree.
Setting up the consumer tree is reasonably straightforward. The left side panel us used to control where
you are on the tree and the right side of the interface allows one to perform operations relative to that
segment on the tree.
Recall from our scenario earlier, that we had multiple groups that would be running Datameer
workloads that we wanted to enforce sharing policies. Also Datameer workloads have specific setup
dependencies that are different that BigInsights workloads so the Datameer workloads require their
own application profile. Also, we wanted to provide isolation between the work done by different
Datameer application user groups. To achieve this policy, we have defined sub-consumers under
Datameer with a consumer appropriate for each group. Also, we can control what users have access to
the consumer. Note the heirchical notion of consumers in Symphony.
Figure 23 - A populated consumer tree in Symphony
The leaf nodes of the consumer tree under Datameer, each link to a specific application profile. The
associations between applications and the position in the consumer tree is made in the application
profile.
42. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 42
Figure 24 - MapReduce applications
Manually editing Consumer Tree definitions
Advanced users may find it easier to manually edit the consumer tree.
Platform Symphony stores consumer tree definitions in $EGO_TOP/kernel/conf in the file
ConsumerTrees.xml.
If you hand edit this file, you will need to restart EGO services to bring the web-based view into
synchronization with the actual contents of the XML files where these settings are persisted.
43. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 43
After editing the ConsumerTrees.xml file as shown above, while logged in as the cluster administrator
(biadmin) please stop and restart EGO services using the BigInsights scripts below to make sure that
changes are reflected in the Platform Symphony console.
$ stop.sh HAManager
$ start.sh HAManager
Controlling access to applications and consumers
In the Sqoop consumer definition above, the built-in Symphony user “Admin” has administrative
responsibility for the consumer. Several other users are listed as being able to access to consumer
application associated with the consumer. The user eric is not a member of the list of permitted users. If
an unauthorized user attempts to submit a job against the application definition (Sqoop) associated with
this Sqoop consumer, see an error as shown below as expected.
[eric@biginsights ~]$ hadoop jar ${HADOOP_HOME}/hadoop-example.jar sleep -
Dmapreduce.application.name=Sqoop -m 2 -r 10 -mt 2000 -rt 2000
java.io.IOException: interrupted
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:1068)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:1032)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1575)
at org.apache.hadoop.examples.SleepJob.run(SleepJob.java:174)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
..
Caused by: java.lang.InterruptedException: Domain <VEM>: Security error: User:
eric is not authorized to perform this operation.
If an authorized user (gord) submits the same workload, note that it runs successfully.
[gord@biginsights ~]$ hadoop jar ${HADOOP_HOME}/hadoop-example.jar sleep -
Dmapreduce.application.name=Sqoop -m 2 -r 10 -mt 2000 -rt 2000
14/03/14 08:56:45 INFO internal.MRJobSubmitter: Connected to JobTracker(SSM)
14/03/14 08:56:45 INFO internal.MRJobSubmitter: Job <Sleep job> submitted,
job id <102>
14/03/14 08:56:45 INFO internal.MRJobSubmitter: Job will not verify
intermediate data integrity using checksum.
14/03/14 08:56:45 INFO mapred.JobClient: Running job: job_ssm_0102
14/03/14 08:56:46 INFO mapred.JobClient: map 0% reduce 0%
14/03/14 08:57:02 INFO mapred.JobClient: map 100% reduce 0%
14/03/14 08:57:11 INFO mapred.JobClient: map 100% reduce 20%
14/03/14 08:57:15 INFO mapred.JobClient: map 100% reduce 40%
14/03/14 08:57:19 INFO mapred.JobClient: map 100% reduce 60%
14/03/14 08:57:23 INFO mapred.JobClient: map 100% reduce 80%
14/03/14 08:57:24 INFO mapred.JobClient: map 100% reduce 100%
14/03/14 08:57:24 INFO mapred.JobClient: Job complete: job_ssm_0102
[gord@biginsights ~]$
44. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 44
Determining the execution user for a consumer
Earlier we explained that by using impersentation, Symphony can control the user IDs that different
application services run under. In the case of the Sqoop application defined earlier, we had set the
application user to appB and this is reflected in the ConsumerTrees.xml definition.
We can verify that impersonation is taking place and that processes are running under the expected
user ID by monitoring the process tree while executing MapReduce jobs like the one above.
The monitor the process tree, use a command like:
$ watch ‘ps -ef | grep appB’
As you run the job, you will see the SSM start-up unless it is pre-started or the SSM is lingering on a
management host waiting for another job. In this example are services are running on the same node as
the master host so we see the service instance managers and services instances starting locally to
manage the job. On a larger cluster you would need to watch the compute hosts to validate the services
are starting as expected and running under the correct user ID.
Figure 25 - verify that services are running under the expected user IDs
We can use the pstree command on the management host to understand the process tree.
45. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 45
Figure 26 - pstree can be used to show the process hierarchy
On compute hosts, services are management by the pem process.
On response to a workload requirement pem launches a sim process (service instance manager) which
in turn runs a service instance. In this case the RunMapReduceService since this is a Symphony
MapReduce workload.
Figure 27 - process hierarchy on the execution host
When configuring several consumers and applications as we have shown here, it will be faster to hand
edit XML based application profile files also.
46. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 46
To access XML application profiles, check the directory $EGO_TOP/data/soam/profiles. The associated
XML profiles will exist in subdirectories with names corresponding to their state. For example Sqoop.xml
can be found in an “enabled” subdirectory since the application is enabled and accepting workload.
Configuring Sharing Policies
47. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 47
48. Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM®
InfoSphere® BigInsights and IBM Platform Computing™
Page 48
Summary
In this document we’ve described a customer use case involving a multitenant implementation of
InfoSphere BigInsights that permits the following:
Concurrent execution of different Hadoop applications (including different versions of code) on
the same physical cluster
Dynamic sharing of resources between tenants in a fashion that maximizes performance and
resource utilization while respecting individual SLAs
Support for applications other than Hadoop MapReduce to maximize flexibility and allow
capital investments to be re-purposed for multiple requirements
Security isolation between tenants, removing a major barrier to sharing in many commercial
organizations
These advances in our view are significant. While Hadoop is advancing, competing open source and
commercial distributions are many years away from offering true multitenancy and practical solutions
for supporting multiple workloads on a shared infrastructure. The economic arguments in favor of
resource sharing are compelling. Analytic applications are increasingly comprised of multiple software
components that rely on distributed services. Rather than deploying separate “silos” of application
infrastructure, Platform Symphony provides the option to consolidate these different application
instances on a common foundation thus increasing infrastructure utilization, boosting service levels and
helping significantly reduce costs.