SlideShare a Scribd company logo
1 of 61
Download to read offline
The top 10 reasons 
manufacturers should use 
BigInsights as 
major vehicle 
IBM 
their Hadoop Platform
This paper draws from our extensive experience of Hadoop applications in the automotive industry, 
where advice from those with the early adoption experiences shows the relevance of the IBM 
BigInsights platform approach and technical capabilities. 
ii 
Introduction 
iv 
Drivers for change in the vehicle industry 
iv 
The market opportunity for Connected Cars 
v 
The business need for Hadoop capabilities 
vii 
Big data in the vehicle industry enables radical new use cases 
viii 
The top 10 reasons to use IBM BigInsights as your Big Data Hadoop system 
1. IBM as your partner for driving Data warehouse modernisation with Hadoop ................ 1 
2. Connected vehicle IBM solution elements ........................................................................... 3 
3. BigInsights product depth completes Hadoop .................................................................... 6 
3.1 Overview of BigInsights ................................................................................................. 6 
3.1 Users supported ............................................................................................................ 7 
3.2 Application lifecycle management ................................................................................. 7 
3.1 What IBM BigInsights adds to open source Hadoop .................................................... 9 
3.2 BigInsights for Hadoop: Technical capabilities which lend themselves to the 
requirements of the vehicle industry ........................................................................................ 10 
3.3 GPFS ........................................................................................................................... 11 
3.4 BigSQL ........................................................................................................................ 12 
3.5 BigSheets spread-sheet interface ............................................................................... 12 
3.6 Text Analytics .............................................................................................................. 12 
3.7 Adaptive Map-Reduce ................................................................................................. 12 
4. Big Data platform technology supplier commitment and stability .................................. 13 
4.1 Scale and commitment ................................................................................................ 13 
4.2 IBM Hadoop deployments ........................................................................................... 13 
4.3 Commitment to open source ....................................................................................... 13 
4.4 The key differences between IBM, Cloudera, Pivotal and Hortonworks .................... 15 
5. BigInsights Performance and stability ................................................................................ 17 
5.1 Overview ...................................................................................................................... 17 
5.2 Adaptive MapReduce for job acceleration .................................................................. 17 
5.3 Comprehensive SQL Performance and support ......................................................... 17 
5.4 Federated data access ................................................................................................ 19 
5.5 Performance architecture ............................................................................................ 19 
5.6 Real time analytics ...................................................................................................... 20 
5.7 Resilience delivered from Platform Symphony ........................................................... 20 
5.8 Platform scheduling ..................................................................................................... 20 
5.9 High Availability ........................................................................................................... 22 
6. Real time streaming analytics .............................................................................................. 29 
7. Data security .......................................................................................................................... 31 
7.1 Granular data security ................................................................................................. 31 
7.2 User security................................................................................................................ 32
7.3 Audit and integration with Guardium, the leading security platform ........................... 32 
7.4 Masking confidential information and test data with IBM Optim ................................. 34 
8. Big Data Platform integration ............................................................................................... 35 
8.1 Connectors .................................................................................................................. 36 
8.2 Data warehouse integration ........................................................................................ 36 
8.3 IBM Symphony ............................................................................................................ 37 
8.4 Big Match & MDM ....................................................................................................... 37 
8.5 Information Server: DataStage & QualityStage .......................................................... 37 
8.6 InfoSphere Streams .................................................................................................... 37 
8.1 Cognos Business Intelligence ..................................................................................... 37 
8.2 Search Indexing and integration with enterprise wide search .................................... 38 
8.3 Spreadsheet visualisation ........................................................................................... 38 
8.4 BigR ............................................................................................................................. 40 
8.5 SPSS ........................................................................................................................... 40 
8.6 Text Analytics with AQL .............................................................................................. 40 
8.7 Data Lineage and governance .................................................................................... 40 
8.8 SAS integration ........................................................................................................... 41 
8.9 BigSQL ........................................................................................................................ 41 
9. Speed of deployment ............................................................................................................ 42 
10. IBM Big Data delivery is proven with case studies from around the world .................... 44 
10.1 General Motors ............................................................................................................ 44 
10.2 Big Data Pioneers: Volvo Case Study ........................................................................ 47 
10.3 PSA.............................................................................................................................. 48 
10.4 Science & Technology Facilities Council .................................................................... 48 
10.5 Octotelematics ............................................................................................................. 48 
10.6 BMW ............................................................................................................................ 49 
11. IBM Company information .................................................................................................... 50 
iii
iv 
Introduction 
We are at a tipping point of mass awareness of, and mass adoption of internet connected devices 
taking us beyond the smart phone into pervasive technology in our homes and workplaces. 
Domestic appliances, industrial equipment and significantly, the connected vehicle will become part 
of the new normal. 
This paper explains why IBM’s Big Data Platform, and particularly BigInsights for Hadoop provides a 
scalable, strategic and economic solution to the data challenge facing the vehicle industry. 
Drivers for change in the vehicle industry 
All this means data...big data. When we consider the drivers for change in the vehicle industry, they 
all create and depend upon data that will be high in volume, variety, and velocity.
v 
The market opportunity for Connected Cars 
The connected car market is summarised in this table from Connected Car Forecast, 
“Global Connected Car Market to Grow Threefold within Five Years report, Feb 2013” 
Note the largest segment is in service areas often supplied by 3rd party aftermarket businesses; this 
is a major opportunity for OEM’S. 
In summary this commercial opportunity has such scale and momentum that the data challenge 
facing the IT departments of major OEM’s will be significant. An industry with great skills in 
technology such as ERP and Computer Aided Design has of course had to create data warehouses for 
customer and business operations, but never to the scale and scope required to support connected 
vehicles. The amount of data generated by vehicles is significant, both in terms of the amount kept 
on the vehicle, and that generated from vehicles but held in data centres 
The model that determines types and quantities for vehicle data is yet to be standardised, so 
efficiencies are yet to be created. This means multi-petabyte data centre stores will proliferate 
amongst the OEM’s until on-vehicle processing reduces the amount of dependence on data centre 
stored data. One can see a point where the autonomous vehicle will not be dependent upon stored 
data but this is over the horizon....in between us and this point in time is a mountain of data.
This data explosion within the automotive manufacturing industry will create skills shortages. As 
other industries are also seeing growth in connected devices, so the skills required may have to be 
grown from within the company by cross training IT staff from the (usually small) Data Warehouse, 
as well as from ERP and CAD system teams. 
vi
vii 
The business need for Hadoop capabilities 
The fundamental needs are to 
• Lower the costs of managing data as a business asset; IBM’s Hadoop can be delivered for €800 
per Terabyte. IBM high performance warehouse appliances (Netezza) can be delivered for 
€10,000 per Terabyte. This is a fraction of traditional data warehouse systems which cost 
€24,000 per Terabyte (e.g. Oracle/Teradata) 
• Provide a landing zone for all data – existing but disparate structured data silo’s, and 
unstructured data such as video files, and external social media data. Currently most IT 
departments would not easily support a business executive who asks IT to take data from a telco 
offering vehicle movement data, and to combine it with video clips from vehicle cameras, and 
integrate social media site information about vehicle problems simply because IT investment has 
been focussed on ERP and CAD. Hadoop as part of an big data platform enables this business 
data as a service 
Hadoop directly integrates with the existing IT landscape and standards while complementing 
existing technologies like MPP (Massively Parallel Processing) warehousing and predictive analytics. 
Uniquely, BigInsights extends open source Hadoop and HDFS capabilities with additional integrated 
enterprise capabilities. This provides clients with additional benefits, through a solution which: 
• Reduces cost and improves flexibility: by simultaneously allowing a single cluster to be used for 
multiple groups of users, running a mix of Hadoop MapReduce jobs and other workloads, under 
the control of the advanced and proven scheduler IBM Platform Symphony 
• Supports complex workloads: by providing an architecture which can perform a mix of 
MapReduce and other computing tasks within a single job 
• Ensures data resilience and reliability: by enabling the use of an enterprise-class alternative to 
open source HDFS through IBM’s General Parallel File System (GPFS) – a proven, robust, high-performance 
file system that allows the use of both centralised disks and locally-attached 
storage simultaneously, while also providing compatibility with Hadoop and providing additional 
features and reliability. 
This allows BigInsights to provide greater resiliency, performance and manageability over 
alternatives, while also saving costs and without compromising fully open accessibility to 
information within the platform. 
BigInsights is also pre-integrated with our Platform Computing grid management technology, which 
we believe is a critical extension to Hadoop in order to increase data processing performance, 
reduce server idleness, and improve deployment efficiency and cluster flexibility. IBM is committed 
to the accessibility and security of data within our solution to ensure its full exploitation for 
retention, exploration, integration, advanced analytics and visualisation. 
Also, we recommend the use of our bundled IBM InfoSphere Streams component to achieve low-latency 
analytical processing of data-in-motion. Streams is also pre-integrated with BigInsights and is 
one of the only commercially supported capability of its kind.
Big data system in the vehicle industry enables radical new use cases 
The business benefits derived from Connected Vehicles are diverse, so universal and reusable data 
systems are needed or else data silos will emerge, severely limiting the efficiency and cost/benefit 
balance of this market leap. 
In summary this collection of use cases results in a justification for integrated data capabilities and 
that’s where Hadoop comes in. 
Use case Description Benefits KEY = REVENUE COST REDUCTION 
viii 
3600 view of 
vehicle/ 
driver/fleet 
Indexing data assets to create a 
range of search based 
applications on federated data. 
Shifts business model from 
product to customer centricity 
Driver can reduce fuel used 
Fleet manager can controls costs 
OEM has ongoing relationship with owners so 
can increase re-purchase loyalty 
Social Media 
Lead 
Generation 
By identifying Twitter discussions 
which contain a propensity to buy 
a specific product, we can create 
a business alert to exploit the 
information and create a lead 
Sales leads 
Targeted campaigns against specific 
competitors 
Brand and product sentiment analysis to target 
marketing communications more accurately 
Warranty 
Claim 
Predictions 
Accurately identifying which 
vehicles are likely to have 
warranty claims well in advance 
supports predictive maintenance 
Reduction in recalls 
Increased quality customer service 
Warranty provision at lower cost than the 
current business model 
3rd Party Data 
Sales 
Selling granular weather and 
traffic congestion data 
The value to meteorological organisations of 
granular weather data is higher than they are 
able to harvest from fixed site weather 
monitors, and allows the OEM to enter the 
traffic data business to generate service 
revenue 
User Based 
Insurance 
Enabling offers to customers to 
reward safe driving through 
detailed usage analytics 
Sales of pay as you drive insurance, and pay as 
you drive pricing bundles 
Personal-isation 
and 
location 
services 
Provision of end user 
refinements as an extension to 
the infotainment personalisation 
experience. E.g. seat, 
temperature, music settings 
follow the owner. Locating the 
vehicle and identifying a 
commute partner for example 
Customer saves fuel costs so drives sales into 
new segments 
Retains customers whose cars are closely 
integrated into their lifestyle through their 
Smartphone
ix 
Product Usage 
for R&D 
Deliver granular data to product 
designers such that they can 
create cars closer suited to the 
actual usage pattern 
Lower warranty costs, lower running costs, 
increased customer loyalty revenue 
User Based 
Vehicle 
Upgrade Sales 
Generating leads based on 
identifying the ideal replacement 
vehicle for each owner, by 
analysing driver profiles 
Supporting finance buy-back and upgrade 
campaigns to increase revenue and customer 
satisfaction 
Online 
Software 
upgrades to 
vehicle 
Identifying issues where an 
online upgrade to the in vehicle 
software improves performance, 
addresses safety issues and 
reduces the need for physical 
recalls 
Reduced cost of warranty claims and recalls. 
Increased customer satisfaction 
Sales 
Forecasting 
Harvesting leading indicators for 
sales from search engines/dealer 
systems/websites/driver data to 
support the shift from physical 
dealer visits to an online 
relationship with prospects 
Accurate sales forecasting reduces stock levels 
and increases manufacturing efficiency
1. IBM as your partner for driving Data 
warehouse modernisation with Hadoop 
1 
The cost of high 
performance data 
warehousing is 
significantly reduced 
using open source 
software 
Despite technology maturity, the use of relational databases for data 
warehousing has not addressed the need to be able to load all types of 
data (particularly unstructured such as text) nor has its costs fallen in line 
with other advances. 
Moore’s Law predicts that processing capability doubles every 18 months 
or so, and the costs of processing have fallen at comparable rates. 
Hadoop is an open source disruptive technology which reduces the cost 
per Terabyte from a traditional €100k to a fraction of this - €8k in live 
examples. 
IBM’s Hadoop distribution, BigInsights, delivers a better technology within 
the Open Source distribution however enterprise requirements for 
resiliency, performance and manageability of this kernel require 
something more. BigInsights is designed to address this requirement. 
For example, Hadoop performance is increased by BigInsights’ capability 
to support Hadoop clusters for multiple groups of users by running a 
mixture of Hadoop MapReduce and other scheduled workloads on one 
cluster under the control of IBM Platform Symphony. 
The business value of a 
company data asset is 
increasing 
As line of business executives becoming increasingly technology savvy and 
data dependent their realization that IT should provide robust, low cost, 
flexible yet secure data for their information assets increases. 
Traditional data warehouses have not delivered so Hadoop is succeeding 
due to a willing new demand. Associated with this is the need for a 
trusted long term stable partner to deliver such innovations, and that’s 
where IBM’s history provides welcome reassurance. 
Data volumes will grow 
significantly (but with 
limited predictability) 
so systems need to be 
able to land new data 
at higher rates of 
velocity, volumes and 
variety 
The shift from ERP and CAD data usage in the vehicle industry to one of 
customer centricity where selling and servicing complex connected 
vehicles will drive a significant data evolution; one whose capability 
requirements are addressed perfectly by IBM enterprise data platform 
systems 
IBM clients have proved that these technologies scale well and cater for 
any kind of data, with resilience and high performance at higher levels 
than non-IBM alternatives. 
IBM’s Streams product is a vital component to achieve low latency 
analytical processing of data-in-motion. As it is integrated with 
BigInsights for Hadoop the combination addresses the data flows from 
moving vehicles as well as the analytics required for historic data. 
Hadoop delivers the 
enterprise data landing 
zone 
IBM BigInsights for Hadoop gives the enterprise data architects the ability 
to land, store, query and analyse data of any type. 
This integrates with existing IT landscapes so is the ideal place to correlate
data with statistical tools, identify issues that span departments, and the 
tooling required to provide business the catch the “fast ball” of vehicle 
generated data and extract some value from it. These are classified as 
operational efficiency, advanced analytics and exploration & discovery. 
2 
BigInsights delivers 
Operational Efficiency 
To more effectively handle the performance and economic impact of 
growing data volumes, architectures incorporating different operational 
characters can be used together. For example, large amounts of cold data 
in the data warehouse can be archived to an analytics environment rather 
than to a passive store. 
InfoSphere BigInsights helps improve operational efficiency by 
modernizing — not replacing — the data warehouse environment. It can 
be used as a query-able archive, enabling organizations to store and 
analyze large volumes of poly-structured data without straining the data 
warehouse. As a pre-processing hub — also referred to as a “landing 
zone” for data — InfoSphere BigInsights helps organizations explore their 
data, determine the high-value assets and extract that data cost-effectively. 
It also supports ad hoc analysis of large amounts of data for 
exploration, discovery and analysis. 
BigInsights delivers 
Advanced Analytics 
In addition to increasing operational efficiency, some organizations are 
looking to perform new, advanced analytics but lack the proper tools. 
With InfoSphere BigInsights, analytics is not a separate step performed 
after data is stored; instead, InfoSphere BigInsights, in combination with 
InfoSphere Streams, enables real-time analytics that can leverage historic 
models derived from data being analyzed at rest. InfoSphere BigInsights 
includes advanced text-analytic capabilities and pre-packaged 
accelerators. Organizations can use these pre-built analytic capabilities to 
understand the context of text in unstructured documents, perform 
sentiment analysis on social data or derive insight from a wide variety of 
data sources. 
BigInsights delivers 
Exploration & 
Discovery 
The explosive growth of big data may overwhelm organizations, making it 
difficult to uncover nuggets of high-value information. InfoSphere 
BigInsights helps build an environment well suited to exploring and 
discovering data relationships and correlations that can lead to new 
insights and improved business results. Data scientists can analyze raw 
data from big data sources alongside data from the enterprise warehouse 
and several other sources in a sandbox-like environment. Subsequently, 
they can combine any newly discovered high-value information with other 
data to help improve operational and strategic insights and decision 
making. 
The bottom line: with InfoSphere BigInsights, enterprises can finally get 
their arms around massive amounts of untapped data and mine it for 
valuable insights in an efficient, optimized and scalable way.
2. Connected vehicle IBM solution elements 
IBM delivers an integrated solution for Connected Vehicle IT architectures already proven to scale 
for Tier 1 OEM’s as per the following diagram: 
This is summarised as a series of connected capabilities, and the following describes IBM’s 
components to deliver this as a complete solution: 
Efficient data protocols IBM’s messaging appliance, MessageSight, uses the Open Source MQTT 
protocol and is 4-6 times more efficient than HTTP. 
3 
Capture all data onto the 
centralised landing zone 
The combination of Stream’s capability to handle large high speed data 
feeds such as MQTT and BigInsight’s strength in storing and processing 
large datasets for long term storage creates the landing zone platform 
missing from current data warehouses. 
Real Time analytics IBM Streams contains the filters, complex queries, statistical treatments 
and data management instructions to support real time analytics. It has 
the capability to handle streaming data such as video files and large 
message volumes as well as its integration with BigInsights Hadoop 
storage which is particularly useful for connected vehicle systems. 
High performance 
analytics platform 
Streams pushes data as it arrives to a high performance data 
warehousing platform comprising Netezza (PureSystems) FPGA based 
hardware accelerator appliance for high speed analytics as well as to the 
long term Hadoop data store in BigInsights.
FPGA’s are used in BLU-RAY players to remove the CPU bottleneck that 
high resolution TV would otherwise create...it’s the same patented 
technology that gives IBM its ability to process large amounts of data 
and deliver high speed analytics. 
Application platform BigInsight’s use of Eclipse creates a layered system where data assets 
can be explored, and applications designed and delivered to address 
evolving business needs. Management of this process is enabled 
through IBM Rational which is commonly used in the vehicle industry to 
manage CAD development, ERP systems management, as well as 
bespoke applications. 
4 
Data governance and 
integration 
Data Housekeeping such as securing and maintaining availability of 
these new and sensitive data assets is delivered through enterprise data 
security applications Guardium and Optim, and a range of product 
connectors and accelerators which take the burden of complex system 
integration away from IBM customers by delivering inter-product 
integration capabilities. 
These capability areas break down into 21 subset technology components, each as defined and 
recognised by independent analysts Gartner and Forrester, who publish vendor assessments of them 
as comparison tables. 
Mapping IBM’s capabilities into the comparison above shows IBM has leadership in all areas, with 13 
ex 21 as the leader above other vendors, and the remaining 8 showing IBM are in a leader quadrant 
position behind a point solution vendor.
Enterprise architects assembling these systems can consider IBM the leading vendor of technology 
and services, potentially as a single source supplier of an end to end system with Hadoop as a key 
component. 
5
3. BigInsights product depth completes Hadoop 
6 
3.1 Overview of 
BigInsights 
Hadoop is a key part of a Big Data platform of integrated technology to 
provide the capability to store and analyse vast amounts of data – any kind of 
data. 
IBM’s primary focus with its InfoSphere BigInsights Hadoop distribution is to 
fully embrace open source, while integrating it into the wider enterprise IT 
landscape. IBM is in a unique position to accomplish this given our breadth of 
enterprise capability. 
In our BigInsights distribution, we have specifically focused on: 
• Out of the box integration and optimisation with existing IT capabilities 
such as data integration, data privacy, data security, and business 
intelligence components – all aligned to existing standards within the 
IBM enterprise data management system (DataStage, Optim, Guardium, 
and Cognos) 
• Exploiting the re-use of existing skills in accessing and using the wealth 
of data BigInsights holds, by providing multiple intuitive interfaces over 
the same raw data. For example, we provide standard ODBC and JDBC 
drivers, ANSI standard SQL, a spreadsheet-style user interface that runs 
in a web browser, and pluggable modules to enable self-service 
advanced analytics 
• Ensuring data resilience and efficient operational management through 
deep integration with robust High Performance Computing technologies. 
These leverage IBM’s decades of experience and expertise in the HPC 
field and applying this robustness to a relatively new and emerging 
technology (Hadoop) 
IBM InfoSphere BigInsights was first made generally available in May 2011 
with its 1.1 release, primarily containing a distribution of Apache Hadoop and 
other open source projects along with security, workload management, and 
administration enhancements. It quickly evolved through 1.2 and 1.3 
releases, also delivered in 2011, which added more extensive developer 
tooling, web-based user interfaces and a variety of enhancements to the 
original features. 
The product offering has continued to quickly evolve with a 1.4 and 2.0 
release both made available in 2012, and the current generally available 
release from September of 2014 is 3.0. 
While the specific release schedule depends on development requirements, 
BigInsights generally has a significant product release approximately every 6 
months.
7 
3.1 Users 
supported 
InfoSphere BigInsights provides capabilities for a wide range of users. Tools 
are included that are specific to the goals of each user, such as installing 
components, developing applications, deploying applications, and running 
applications to analyse data. 
System Administrator 
The System Administrator installs, configures, and backs up InfoSphere 
BigInsights components on the system. This user also monitors the cluster to 
ensure that the InfoSphere BigInsights environment is healthy and running at 
optimum capacity. 
Application Developer 
The Application Developer develops, publishes, and tests applications for 
InfoSphere BigInsights. This user works with Data Scientists to understand 
the function of each application, and the business problem that the 
application helps to solve. 
Application Administrator 
The Application Administrator publishes applications in the system, deploys 
applications to the cluster, and assigns permissions to applications. This user 
works with the Application Developer to ensure that applications are 
functioning properly before being published and deployed. 
Data Scientist 
The Data Scientist collects data, completes analysis, and visualises insights to 
provide answers to specific business questions. This user determines which 
applications and data sources to aggregate information from, and how to 
present the results to the intended audience. 
3.2 Application 
lifecycle 
management 
Developers can develop and test InfoSphere BigInsights programs from 
within the Eclipse environment and publish applications that contain 
workflows, text analytics modules, BigSheets readers and functions, and Jaql 
modules to the cluster. After deploying applications to the cluster, the 
applications can be run from the InfoSphere BigInsights console. 
The 
following capabilities are supported by the Eclipse tooling, organised by sub-component 
of BigInsights:
• Create text analytics modules that contain text extractors by using an 
extraction task wizard and editor. Developers can then test the extractor 
by running it locally against sample data. Visualise the results of the text 
extraction and improve the quality of the extractor by analysing how 
results were obtained 
• Create Jaql scripts or modules by using a wizard, and edit scripts with an 
editor that provides content assistance and syntax highlighting. Run Jaql 
explain statements in scripts, and run the scripts locally or against the 
InfoSphere BigInsights server. Developers can open the Jaql shell from 
within Eclipse to run Jaql statements against the cluster 
• Create Pig scripts by using a wizard and edit the scripts with an editor 
that provides content assistance and syntax highlighting. Run Pig explain 
statements and illustrate statements for aliases in scripts, and then run 
the Pig scripts locally or against the InfoSphere BigInsights server. 
Developers can open the Pig shell from within Eclipse to run Pig 
statements against the cluster 
• Connect to the Hive server by using the Hive JDBC driver and run Hive 
SQL scripts and explore the results. Browse the navigation tree to explore 
the structure and content of the tables in the Hive server 
• Use the Java editor to write programs that use MapReduce, and then run 
these programs locally or against the InfoSphere BigInsights server. Open 
the InfoSphere BigInsights console to monitor jobs that are created by 
MapReduce 
• Create templates for BigSheets readers or functions and then use the 
Java editor to implement the classes 
• Write Java programs that use the HBase APIs and run them against the 
InfoSphere BigInsights server. Open the HBase shell from your Eclipse 
environment to run HBase statements against the cluster. 
• Additional capabilities included in InfoSphere BigInsights include 
application linking and pre-built accelerators. 
Application linking using BigInsights: 
• A graphical, web-based means through which to define Oozie workflows 
• Compose and invoke new applications by combining together existing 
applications, including integration with BigSheets. 
8
9 
Pre-built applications 
provide enhanced data 
import capability: 
• REST Data Source App that enables users to load data from any data 
source supporting REST APIs into BigInsights, including popular social 
media services 
• Sampling App that enables users to sample data for analysis 
• Subsetting App that enables users to subset data for data analysis 
• Accelerators to provide packaged application components to address 
social data analytics, machine data analytics and call detail records 
streaming analytics, as examples. 
3.1 What IBM 
BigInsights adds 
to open source 
Hadoop 
The blue areas illustrate the categories of functionality BigInsights adds to 
native Hadoop. 
This emphasises the strategic importance of vendor “clout” behind the 
selected distribution as enterprise scale Non-Functional Standard 
Requirements required to scale out to the requirements of the vehicle 
data industry.
10 
3.2 BigInsights for 
Hadoop: 
Technical 
capabilities 
which lend 
themselves to 
the 
requirements of 
the vehicle 
industry 
BigInsights is a complex software product with many capabilities. In the 
engagements with vehicle industry clients there are several unique 
capabilities which have resulted in its selection over alternatives, and are 
fundamental to the benefits delivered. 
These include the following key areas, each expanded up in the remainder 
of this section: 
1. IBM’s file system – GPFS as an option to the open source 
HDFS Hadoop Distribution File System 
2. The way IBM opens up this data store to SQL using IBM 
BigSQL, which means that you can keep the data in one place 
– very important given the data volumes connected vehicles 
are generating, and the limitations of the memory cache 
approach taken by other branded Hadoop distributions 
3. The spreadsheet visualisation front end tool BigSheets which 
allows business users to explore the data being landed into 
Hadoop 
4. The ability to analyse text using Annotative Query Language 
(and IBM applications which add business analysis front end) 
such that sentiment on brand and products can be surfaced 
from call centre, warranty and service datasets. 
5. Adaptive MapReduce - IBM’s pre-integration with Platform 
Symphony’s near real-time, low latency scheduler for more 
quickly carrying out any MapReduce data processing routines.
11 
3.3 GPFS 
Advanced features 
supported within 
IBM’s General 
Parallel File System 
As vehicles generate 
large data sets, and 
the vehicle industry 
moves to a more 
customer centric data 
model it will find it 
has requirements for 
data warehouses of 
enormous size. GPFS 
is the proven data 
system for this 
requirement. 
GPFS is a mature, enterprise-class file system that adds a number of 
important resiliency and maintainability characteristics to Hadoop and can 
be used as an alternative to HDFS, the Hadoop Data File System 
• GPFS is scalable: 400 GBytes/second has been achieved for a single 
filesystem, and due to the parallel architecture of all GPFS filesystem 
functions performance can be increased as required by adding more 
hardware resources 
• GPFS is reliable: it is in use for some of the largest and fastest 
filesystems in the world, supporting batch workloads where each job 
can run for months, and GPFS has been proven in the field for over 15 
years 
• Supports failover clustering built-in to the filesystem 
• Active-active clustering across sites to provide a “24x7” filesystem 
• Remote asynchronous caching designed to work across very large 
distances 
• Information Lifecycle Management (ILM) which provides for data to be 
moved between different storage pools of disk or even tape 
• Rolling upgrades of GPFS software to minimise downtime 
• Online addition/removal of server nodes or storage resources 
• Built-in replication of data under filesystem control, specified down to 
the file level if required 
• Metadata scan operations up to millions of files per minute (10 Billion 
files in 43 minutes1), which can be used to produce lists of files to 
backup, move, migrate to other physical storage tiers, or perform other 
operations on 
• Extended attributes which are stored along with the file, and can be 
used as “tags”, for example project IDs or other information: these can 
also be searched on using the parallel scan engine 
• Fully supported by IBM: the people who write the code support the 
code, using the same IBM support and problem escalation processes 
available for mainframe software 
• GPFS is in use with clients as a: 
− “Standard” POSIX filesystem 
− Supported data storage layer for databases such as DB2, Oracle, 
and Informix 
− As a drop-in replacement for HDFS, by presenting an HDFS-compatible 
API.
12 
3.4 BigSQL 
BigSQL is an ANSI standard SQL interface to data across the distributed 
filesystem, Hive and HBase, re-using Hive’s metadata, providing standard 
JDBC and ODBC connectivity, and query optimisations to address both small 
and large queries 
3.5 BigSheets 
spread-sheet 
interface 
BigSheets is a browser-based, spreadsheet-style user interface allowing 
users to directly interact with data. As vehicle data is relatively new its use 
and its structure is not mature or well documented, so business use of the 
new information assets to support exploration of new use cases requires 
visualisation tooling such as BigSheets, which is included within BigInsights. 
3.6 Text Analytics 
Text Analytics: BigInsights provides AQL - an analytical environment for 
extracting structured information from unstructured and semi-structured 
textual data, including batch and real-time runtimes and an integrated 
development environment. This is useful to for example extract meaning 
from text fields in CRM databases and social media sites to understand 
customer sentiment and generate leads from comments about vehicle 
comparisons. 
3.7 Adaptive Map- 
Reduce 
Adaptive MapReduce is a near real-time, low-latency scheduler that can be 
transparently used as an alternative to Apache MapReduce. This is actually 
a “single-tenant” version of the IBM Platform Symphony scheduler that has 
been pre-integrated with BigInsights. 
Scheduling data management tasks will allow the company to manage the 
evolution of its vehicle and customer data lifecycle as storage and 
processing to support the move to customer centricity. Connected vehicle 
architectures will place stress on data warehouse systems if not managed 
effectively
4. Big Data platform technology supplier 
commitment and stability 
13 
4.1 Scale and 
commitment 
IBM has completed more than 30,000 analytics client engagements and 
projects $20 billion in business analytics and big data revenue by 2015. 
IBM has established the world's deepest portfolio of analytics solutions; 
deploys 9,000 business analytics consultants and 400 researchers, and 
has acquired more than 30 companies since 2005 to build targeted 
expertise in this area. 
IBM secures hundreds of patents a year in big data and analytics, and 
converts this deep intellectual capital into breakthrough capabilities, 
including Watson-like cognitive systems. The company has established a 
global network of nine analytics solutions centres and goes to market 
with more than 27,000 IBM business partners 
With 434,000 employees and $100BN revenues, IBM’s 100 year 
momentum continues. The company is renowned for its ability to 
reinvent itself around business and technology shifts summarised in the 
IBM strategy statement: 
• We are making markets by transforming industries and 
professions with data 
• We are remaking enterprise IT for the era of the cloud 
• We are enabling systems of engagement for enterprise and 
leading by example. 
4.2 IBM Hadoop 
deployments 
Hadoop adoption is a long term strategic platform decision so warrants 
consideration of the client company to work with its supplier as a long 
term engagement. 
IBM has over 100 production installation and thousands of users of the 
free download evaluation system. It has thousands of users of the 
online Bluemix cloud development platform where BigInsights is a 
service. Many thousands individuals inside IBM and in its customer base 
use the online education environment called IBM Big Data University. 
4.3 Commitment to 
open source 
We distribute 100% open source Apache Hadoop components. This is 
not proprietary. On top of the open source code we provide analytical 
tools to help get value from the data. 
IBM is committed to supporting the open source movement. IBM helped 
open platforms such as Linux, Eclipse and Apache become standards 
with vital industry ecosystems, and then we developed high-value 
businesses on top of them. Today IBM collaborates broadly to support 
open platforms such as OpenStack and Hadoop. 
Because of this commitment, IBM avoids creating any independent fork 
of Apache project code, and merely selects the open source versions
that we feel are the best in achieving most current and most stable 
capabilities together in the overall Hadoop operating environment. The 
inner core of BigInsights is Apache Hadoop, and we do inter-version 
testing of the projects included so our enterprise customers are ensured 
that they have a blue-washed and interoperable codebase across the 
projects. As “most current” and “stable” are often conflicting, 
BigInsights does not always use the most current version of projects, but 
rather the most stable. Where we identify issues in the open source 
projects, we have a number of committers with our IBM development 
labs that submit fixes back to the open source community. 
IBM’s goal with this approach is to protect the corporate IT organisation 
from version management across the various open source projects by 
providing this pre-tested, interoperable set in InfoSphere BigInsights. 
An example of the commitment is that IBM contributed 25% the fixes 
for a recent release of Hadoop 
The most widely deployed version of BigInsights for Hadoop is v2.1 
(current release is v3) is to supported by IBM until 05-Jul-21 
14
15 
4.4 The key 
differences 
between IBM, 
Cloudera, Pivotal 
and Hortonworks 
Cloudera 
IBM has a comparable number of significant sized deployments with 
Cloudera, a Hadoop distributor. However the company is quite different 
to IBM. Cloudera is Venture Capital funded with $160m invested in its 6th 
funding investment round completed in March 2014. Sales revenue was 
reported as $73m in 2013, its fourth year trading. It has 500 employees. 
Our opinion on the long term destiny for companies like this – niche 
technology players – is of a business exit plan based on acquisition by the 
industry giants to fill a technology gap in the enterprise platforms they 
provide. At this point it’s not clear if any of the enterprise technology 
mega vendors have such a gap so the future of Cloudera is unclear. 
Functionality which will be useful to the vehicle data industry such as real 
time streaming data analytics, text analytics, analytics accelerator tools, 
visualisation, enterprise wide search, indexing, data integration software, 
connected analytics appliances, relational data marts, governance audit 
and compliance is all available in IBM BigInsights but not in this 
alternative. 
Documented limitations in the Cloudera query engine explain why results 
in data joins can fail to complete. Referring to the Cloudera user manual 
highlights the cause as insufficient memory – this is caused by the need to 
load ALL data into memory. As raw data sets can be very large this 
limitation can easily exceed the total memory available. Vehicle data 
volumes being generated are currently very large, and customer datasets 
are also very large so this limitation constrains vehicle industry 
applications. 
By comparison IBM BigSQL has no limitation that joined tables have to fit 
in aggregated memory of data nodes which causes queries to run out of 
memory and fail. 
IBM Hadoop is up to 41x faster than Hive .12 (Cloudera) on TPC-H like 
benchmark 
IBM Hadoop is over 2x faster than (Cloudera) Impala on TPC-H like 
benchmark 
Pivotal /EMC/Greenplum 
Greenplum has changed hands several times and is now part of Pivotal, an 
EMC spinoff, and their Hadoop offering is now called Pivotal HD. IBM 
BigInsights has many advantages over Pivotal HD 
IBM BigInsights adds significant functions beyond IBM’s 100% open source 
Hadoop components – it includes analytic accelerators such as Big SQL, 
BigSheets, BigMatch, BigR and text analytics unlike Pivotal which includes 
proprietary components and lack of added-value software applications 
such as those listed above 
IBM has already achieved broader marketplace presence and analyst 
rating (e.g. Forrester Wave) 
IBM BigInsights offers greater flexibility and lower cost solution with 
availability as software only, on the cloud, or on flexible IBM System x 
reference architecture. By comparison Pivotal is now recommending 
expensive Isilon storage which uses a proprietary OneFS file system. IBM
has made significant investments to ensure enterprise open architecture 
leverage the low cost Hadoop elements rather than creating lock-in 
solutions. 
Significantly, Pivotal does not support HDFS. BigInsights offers HDFS and 
GPFS support. Where the new SQL HAWQ component of Pivotal HD is 
offered as a license cost option, the powerful IBM Big SQL is included with 
BigInsights. 
IBM offers a complete Big Data platform Solution as an integrated 
architecture that offers more than just Hadoop - including BigInsights, 
Streams, MPP Database, Information Integration. 
Real time analytics – not just batch – is provided by IBM, whereas Pivotal 
has an in-memory grid, which is not a real time streaming solution. 
Data security at an enterprise granular level is provided by IBM’s 
Integration and Governance offerings, (Information Server, Guardium and 
Optim) which are integrated with Hadoop but would require local 
development or 3rd party integration projects to achieve the same level of 
data management. 
As the vehicle data will include personal data, its governance is mandated. 
Delivering the appropriate systems is easier and lower cost with IBM. 
Pivotal HAWQ adds the entire RDBMS structure (query engine, storage 
layer, metadata) to Hadoop. This adds proprietary layers and database 
complexity to the Hadoop solution. By comparison IBM Big SQL integrates 
just the query engine with Hadoop. This allows the query engine to be 
collocated with the Hadoop cluster and executes using native meta data 
and HDFS files, which is how IBM won the performance benchmark tests 
cited above, and IBM Big SQL offers elastic scalability where nodes can be 
added / removed online. 
Hortonworks 
BigInsights and Hortonworks have similar Hadoop components and both 
are committed to open source Apache Hadoop with Committers and 
contributors to open source Apache Hadoop. However, BigInsights 
extends value beyond Hortonworks for analytics with its Social Media 
Accelerator, Machine Data Accelerator, BigSheets spreadsheet and 
visualization, Advanced Text Analytics. Also BigInsights includes Data 
Explorer for Search and Indexing in Hadoop and beyond to all enterprise 
data; a vital function to make the data accessible to potential users inside 
the company and through applications to its customers. 
BigInsights Big SQL has advantages over Horton’s HiveQL as Big SQL 
provides richer SQL, HBase performance, and short query performance. 
For many large companies already using GPFS, IBM BigInsights 2.1 
uniquely offers GPFS as a Hadoop file system providing enterprise data life 
cycle management. BigInsights also has Adaptive MapReduce (Platform 
Symphony) for faster Map Reduce processing and BigInsights integrates 
with InfoSphere Streams, while Hortonworks does not, limiting its use to 
batch processing. 
16
5. BigInsights Performance and stability 
17 
5.1 Overview 
Out-performing open source alternatives on identical hardware, IBM InfoSphere 
BigInsights has been independently benchmarked and proven to be between 4 and 
11 times faster than open source alternatives running on identical infrastructure. 
InfoSphere BigInsights provides several features that help increase performance, as 
well as enhance its adaptability and compatibility within an enterprise environment. 
5.2 Adaptive 
MapReduce for 
job 
acceleration 
Jobs running on Hadoop can end up creating multiple small tasks that consume a 
disproportionately large amount of system resources. To combat this, IBM invented 
a technique called Adaptive MapReduce that is designed to speed up small jobs by 
changing how MapReduce tasks are handled without altering how jobs are created. 
Adaptive MapReduce is transparent to MapReduce operations and Hadoop 
application programming interface (API) operations. 
5.3 Comprehensive 
SQL 
Performance 
and support 
Let’s start with the number one reason why this new release of Big SQL sets 
a new bar: performance. Benchmark tests indicate that Big SQL executes 
queries 20 times faster, on average, over Apache Hive 12 with performance 
improvements ranging up to 70 times faster. 
This performance improvement was achieved by replacing the earlier Map- 
Reduce (MR) implementation with a massively parallel processing (MPP) 
SQL engine. The MPP engine deploys directly on the physical Hadoop 
Distributed File System (HDFS) cluster. A fundamental difference from other 
MPP offerings on Hadoop is that this engine actually pushes processing 
down to the same nodes that hold the data. Because it natively operates in 
a shared-nothing environment, it does not suffer from limitations common 
to shared-disk architectures (for example: poor scalability and networking 
caused by the need to move shared data around). 
IBM’s unique ANSI standard SQL interface in BigInsights automatically 
optimises queries so that smaller queries run in-memory and bypass 
MapReduce. For larger queries that still rely on MapReduce, BigSQL can also 
still leverage the performance benefits of Adaptive MapReduce. 
IBM BigInsights v3.0, with Big SQL 3.0, is the only Hadoop distribution 
to successfully run ALL 99 TPC-DS queries and ALL 22 TPC-H queries without 
modification. To contrast, Apache Hive 12 executes only 43 of the 99 queries 
without modification. In a Jan 2013 blog post, Cloudera describes how its 
benchmark tests were completed by modifying the TPC-DS queries to SQL- 
92 syntax and selectively included only 20 of the 99 TPC-DS queries. 
IBM Big SQL has many advantages over Impala, including richer SQL support 
such as SQL-92 sub-queries, SQL 99 aggregate functions, and SQL 2003 
windowing aggregate functions. 
Big Benchmark tests indicate that Big SQL executes queries 20 times 
faster, on average, over Apache Hive 12 with performance 
improvements ranging up to 70 times faster for individual queries.
SQL 3.0 has successfully run ALL 99 TPC-DS queries and ALL 22 TPC-H 
queries without modification. To contrast, Apache Hive 12 executes only 
43 of the 99 TPC-DS queries without modification. 
IBM Big SQL has many advantages over Impala, including richer SQL 
support such as SQL-92 sub-queries, SQL 99 aggregate functions, and 
SQL 2003 windowing aggregate functions. Impala is immature, feature 
poor, back-level SQL, limited offering. SQL tools may not work with 
Impala due to ODBC and JDBC driver limitations 
Big SQL enables row and column access control, or “fine-grained control” 
consistent with functionality found in an RDBMS. 
The comprehensive SQL support by Big SQL 3.0 enables an organization 
to make full use its existing SQL skills, reducing the need to augment its 
analytic applications with Hadoop-specific functions. 
Now here’s the real value: Big SQL 3.0 can access data from more than 
BigInsights. It can query and combine data from many data sources, 
including (but not limited to) DB2 for Linux, UNIX and Windows database 
software, IBM PureData System for Analytics, IBM PureData System for 
Operational Analytics, Teradata and Oracle. Organizations can choose to 
leave data where it currently exists and use BigInsights to augment 
where it makes the most sense. 
Note that this approach, minimizing the need to move data, is part of 
IBM’s overall big data and analytics strategy. SPSS and Cognos Business 
Intelligence also support querying and joining data across disparate data 
sources, addressing the need to analyze all data, wherever it is located. 
IBM InfoSphere BigInsights v3.0, with the MPP-based performance and 
SQL support of Big SQL 3.0, provides an enterprise-ready Hadoop 
distribution that minimizes the impact on users while enabling IT to 
adopt this new technology into its data architecture strategy. 
18
19 
5.4 Federated 
data access 
Big SQL can access data from more than BigInsights. Its federated access allows 
users to send distributed requests to multiple data sources within a single SQL 
statement. 
Administrators start with a GUI-driven installation tool that guides them to specify 
which optional components to install and how to configure the platform. Installation 
progress is reported in real time, and a built-in health check is designed to 
automatically verify the success of the installation. These advanced installation 
features minimize the amount of time needed for installation and tuning, freeing 
administrators to work on other critical projects 
Once the Hadoop cluster is in place, robust job management features give 
organizations control of InfoSphere BigInsights jobs, user roles, security and key 
performance indicator (KPI) monitoring. Technical staff can easily direct job 
creation, submission and cancellation; they can also stay informed of workload 
progress through integrated job status dashboards, logs and monitors that provide 
details on configuration, tasks, attempts and other critical information. In addition, 
InfoSphere BigInsights provides administration features for Hadoop Distributed File 
System (HDFS), IBM GPFS™ File Placement Optimizer (FPO), big data applications and 
MapReduce jobs, and cluster management. 
5.5 Performance 
architecture 
The architecture of IBM InfoSphere BigInsights is essentially comprised of 
three layers and a management / administration tier. All data is stored in 
our distributed file system, which can be either open source Apache HDFS or 
IBM’s more enterprise-class General Parallel File System (GPFS). This forms 
the underlying data persistence layer on which all other components rely, 
and hence is typically considered the bottom-most layer. 
In the middle, there are a number of data processing components that all 
leverage the MapReduce capabilities of Hadoop in order to parallelise their 
work. These include: 
• Data processing languages like Pig and Jaql 
• Query mechanisms like Hive 
• Indexing mechanisms like Lucene 
• Data load mechanisms like Sqoop and Flume 
• Analytical capabilities like Text Analytics and Probabilistic Matching and 
• Data repositories built on top of the distributed file system like HBase. 
MapReduce itself could be considered the backbone of this layer, and 
IBM provides a high performance optimisation of open source 
MapReduce called “Adaptive MapReduce” to provide greater 
performance to all data processing in this layer through integration with 
Platform 
• IBM is also unique in providing one of the only commercially supported 
data-in-motion analytics capability, InfoSphere Streams, which we 
originally developed through our unique Research division, in 
cooperation with various US governmental agencies. Streams can 
directly leverage data in Hadoop, in-memory data grids like Redis, and 
also integrates directly with DataStage.
20 
5.6 Real time 
analytics 
In addition, to address true real-time (data in motion) requirements, we 
integrated the use of IBM InfoSphere Streams. This is a unique capability 
that IBM initially developed in partnership with its Research division and 
various US governmental agencies to process large quantities of both 
structured and unstructured data with both high throughput and low 
latency. 
Streams uses its own in-memory processing and node coordination facilities 
to achieve microsecond latencies, but can use InfoSphere BigInsights and a 
number of relational databases as both a source of historical information 
and a target to which to store information for retention purposes. In 
addition, Streams can integrate with in-memory data grids in order to 
support low-latency lookup of information, typically reference data. 
5.7 Resilience 
delivered from 
Platform 
Symphony 
5.8 Platform 
scheduling 
IBM’s Platform software is IBM’s cluster management and scheduling 
system, which can support diverse compute and data intensive applications. 
Platform is a mature and well-established product, used across many 
industries for grid centric workloads. 
The major benefits that Platform brings to BigInsights are as follows: 
• Recovery and reliability: 
− Hadoop jobs, and job tasks, are recoverable in the event of node 
failure 
− Platform infrastructure has no single point of failure 
− All services are highly available, and will be restarted automatically 
on alternative servers in the event of a management server failure 
• Resource sharing and flexibility: 
− Platform can manage both Hadoop and non-Hadoop workloads 
within the same cluster, including provisioning through the use of 
the optional Cluster Manager 
− Multiple IBM and third party analytic applications can be supported 
on a shared infrastructure, e.g.: InfoSphere Streams, InfoSphere
DataStage, SPSS, SAS, R, etc. 
− Infrastructure can be shared across development, test, and 
production environments; across different user groups, clusters and 
workload types: which will drive greater efficiencies and utilisation, 
while reducing costs. 
21 
• Scheduling agility: 
− Agile scheduling ensures that time critical workloads start and finish 
fast 
− Optionally give priority to interactive jobs (i.e. BigSheets, BigSQL) 
− Resource allocations shift instantly based on priority adjustments 
and proportional allocations at run-time 
− Platform’s highly effective scheduling ensures that the cluster can be 
kept at high average levels of utilisation: 80-90% average utilisation 
is not uncommon for Platform clusters. 
One important addition IBM has made beyond open source capability in 
terms of scalability, however, is in the area of resource scheduling. When 
using the Adaptive MapReduce framework built into BigInsights, a more 
advanced scheduling mechanism is used. This mechanism improves 
scalability by leveraging Platform Symphony’s ability to support a number of 
scheduling agents running in parallel rather than open source’s historically 
singular JobTracker service. This is similar to the scalability improvements in 
the recently released YARN capability of open source Hadoop, but uses the 
proven robustness of Platform Symphony rather than months-old and 
relatively unproven open source technology. 
For example, one of our largest deployments of BigInsights started with an 
initial volume of approximately 2.5 PB, and has grown over the last couple 
of years to approximately 5 PB. It is expected to grow to 20 PB of data 
within the next several years. In combination, GPFS and Platform bring 
significant operational benefits to the running of Hadoop workloads, and 
provide the flexibility to support other types of workloads in a common 
infrastructure. Coupled with Platform’s resource allocation and prioritisation 
capabilities, we believe this will drive higher utilisation and efficiency, and 
will lower operational support costs by virtue of having a single architecture 
to support diverse workloads and application types. 
The robust availability and recoverability characteristics provided by 
Platform and GPFS will provide clients with a Hadoop solution ready for 
enterprise deployment into what is becoming an increasingly time-sensitive 
business environment.
22 
5.9 High 
Availability 
BigInsights provides a highly robust Hadoop solution that automatically 
handles the failure of management and data nodes without losing data and 
without any interruption to processing. 
The recommended configuration of BigInsights is to use the Platform 
Symphony MapReduce scheduler instead of the Apache MapReduce 
scheduler, and GPFS-FPO as the High Availability file system for data storage 
instead of HDFS- note that GPFS can provide a highly available HDFS 
filesystem across sites. GPFS is also used to provide a highly available (and 
optionally cross-site) filesystem which is used by the Symphony scheduler to 
support High Availability configurations- the use of a shared NAS facility is 
also an option. 
When BigInsights is configured in this way, and combined with the 
appropriate server, network, and environmental infrastructure (e.g. power, 
cooling), it provides a highly available solution. Platform Symphony requires 
a shared file system to be accessible between management nodes. In a 
production environment two or more management nodes will be 
configured. The number of management nodes will be dictated by two 
factors. The first factor is the level of redundancy required; additional 
management nodes mean that the cluster can tolerate more failure at the 
management level. The second factor is the cluster size/load. Platform 
Symphony can use multiple instances of the Symphony scheduler (similar to 
JobTracker), one for each logical application. As the number of applications 
increases then Symphony can be scaled out to multiple Symphony 
Schedulers-providing load balanced scheduler instances across available 
management nodes. Platform Symphony will do this automatically. 
Therefore as the cluster size/load increases additional management nodes 
can be added. 
All management nodes are active. 
The shared file system is used to store component state data. For example 
an instance of a Symphony scheduler will store metadata about all in flight 
workload currently being processed. If a management node on which a 
component is running fails, the component will be restarted on another 
available node, during start-up the component will recover state written to 
the shared file system. 
The shared file system can be implemented using a variety of technologies 
and solutions, for example with Network Attached Storage (NAS) appliances 
as HA features (e.g. dual controllers, RAID) are normally built in. GPFS can 
also be used as it is a clustered file system and can provide the required 
redundancy and high availability. The Symphony shared file system can be 
implemented using GPFS through a number of different hardware 
configurations. One example is shown below. 
Platform Symphony High Availability with GPFS
In the above diagram three management nodes are shown. Each node has a 
number of solid state disks. Each SSD is divided into 2 partitions, one for the 
operating system and the second for GPFS. All disks are GPFS Network 
Shared Disks (NSD). All 3 servers are GPFS Quorum nodes. There are three 
GPFS failure groups; each is created using direct-attached SSDs on each 
management node. A replication factor of 3 is used for both data and 
metadata - each block of a replicated file is in all 3 failure groups. 
Use of SSDs is not a requirement but may provide performance advantages 
for more real time/latency sensitive applications or for filesystem metadata. 
IBM BigInsights uses an HA manager from Platform Symphony, known 
internally as the service controller, to manage management components or 
services. The service controller is responsible for starting one or more 
instances of registered service types. The BigInsights and Platform 
Symphony management components are registered as service types within 
Platform Symphony. The service controller monitors all service instances 
and restarts a service instance if it exits unexpectedly. There is always one 
instance of the service controller running. If the management node on 
which it is running becomes unavailable a new instance is restarted on 
another management node. 
Hardware component failures which cause the node to go offline are 
handled as follows: 
If the node is a node running GPFS management functions, or a Symphony 
(Adaptive MapReduce) management node, then the failover clustering for 
GPFS or Symphony automatically moves any management functions to 
another designated node. 
If the failed node “owns” replicated data (i.e. it is a data node), then GPFS 
marks the node and associated storage as unavailable and “stale”. I/O 
requests for data will be fulfilled by another copy of the data. If a disk fails, 
then similarly, GPFS will mark that disk as stale, and redirect I/O's to other 
copies of the data. 
Symphony 
An HA Manager Service runs on all management nodes. This service handles 
the monitoring of critical services and manages the failover steps (for 
instance, the termination of the failed process, binding the floating IP to the 
standby server, and starting the required process on the standby server). 
This is not required for Symphony and GPFS, which have their own built-in 
failover clustering, which allows for simplified failover with zero or minimal 
disruption to ongoing work. 
23
Namenode failure 
With our use of GPFS as the file system for BigInsights, there is no active-passive 
NameNode failover required, as file system management functions 
are failover clustered within GPFS itself, and metadata is distributed across 
the file system itself. There is no concept of a “master” NameNode to fail: all 
GPFS services are designed to be mobile around the cluster, and fail over as 
required. 
GPFS provides automated failover of file system management functions 
from any failed node running these functions. Access to data or metadata 
which was owned by that node is maintained transparently, by redirecting 
I/O to a replica of the data. I/O is briefly suspended (transparently to users) 
– typically for 1 to 2 minutes – while recovery steps are in process within 
GPFS, and the I/O is resumed. Note that this transparent failover also 
extends to the Clustered NFS facility within GPFS, which could be used to 
provide edge/gateway services to move data in and out of the cluster. 
Recovery from loss of nodes within the cluster? 
If a data/compute node fails, access to all data is maintained. Data which 
was “owned” by that node and stored in the GPFS filesystem remains 
available, even if one of the copies of the data is no longer accessible on the 
failed node. This is done transparently to the other nodes in the cluster. 
GPFS has robust, built-in clustering within the filesystem, and does not 
require additional hardware or failover software to operate. 
In a Hadoop cluster it is expected that nodes will become unavailable for a 
variety of reasons. The nodes themselves do not generally have any 
redundancy built into them. The IBM solution has a service controller that is 
responsible for starting one or more instances of registered service types. 
The service controller finds an available node on which to start a particular 
service instance. It is also responsible for monitoring all service instances. 
Therefore if a node becomes unavailable the service controller will restart 
an instance on another available node. There is always one instance of the 
service controller running. If the node on which the service controller is lost 
the service controller is automatically restarted on another node. 
HA supported at the job level 
IBM Platform Symphony replaces the JobTracker component of open source 
Hadoop with its own scheduler. The Task Tracker component is also 
replaced with another Platform Symphony component that runs on each 
data node for managing each map or reduce task. A number of Platform 
Symphony schedulers will be running within the cluster, one for each 
configured application. A Symphony scheduler itself is responsible for 
managing all jobs submitted on behalf of a particular application. 
In terms of job-level HA, failure at either the data node or management 
node are relevant. Data nodes are used for running all map and reduce tasks 
associated with any MapReduce jobs submitted to the cluster. Management 
nodes run a number of management daemons / components. 
If a data node becomes unavailable, the scheduler will detect the loss of 
communication with the runtime components (Task Tracker equivalent) on 
the data node. The scheduler will then automatically re-queue any map or 
reduce tasks belonging to jobs managed by this scheduler, that were 
running on the failed data node. The scheduler will then reschedule the map 
24
and reduce tasks on available data nodes. 
When a management node becomes unavailable all components, including 
any schedulers, will be automatically restarted on other management nodes 
(Platform Symphony will load balance across management nodes). Each 
scheduler writes state information to disk. The state information includes all 
in flight map and reduce tasks. When the scheduler is restarted this data is 
read from disk as part of the scheduler recovery. The scheduler can recover 
all necessary information about jobs previously submitted by client 
applications. It will then continue processing the workload. 
As previously described the HA operations assume that there is a robust and 
available file system that can be accessed from any management node 
(typically a GPFS filesystem, or a NAS with high availability configuration). 
With the use of GPFS, Symphony uses the GPFS shared file system to 
support HA operations at the job level. 
While the data nodes do not require access to the shared Symphony file 
system for the jobs to be scheduled, the use of GPFS as a separate highly 
available file system accessible across the cluster provides the reliable data 
storage facility for the data used by the running workloads. 
Therefore all workload can be recovered in the event of a management 
node or scheduler failure. There is no requirement to resubmit workload 
from the client perspective: recovery is automatic. The client application 
that submits any new work will automatically connect to the new instance 
of the scheduler. 
Cluster backups 
The fact that all of the Hadoop data can be held in a POSIX-compatible GPFS 
filesystem where it is also accessible using normal operating system 
commands will assist in allowing data to be easily backed up, whether it is 
physically held inside the data/compute nodes, or in a shared disk storage 
subsystem. In either case, GPFS allows Hadoop jobs to see the data as if it 
were “HDFS”, and backup software to see the same data as a normal file in a 
POSIX filesystem. 
This also means that there is a reduced requirement for multiple copies of 
the data, and that data can be transparently moved from inside the Hadoop 
cluster to less expensive shared storage (using RAID which has 20% to 40% 
overhead, rather than 3 copies used by default by HDFS with 200% 
overhead). Data can even be transparently moved to tape, and 
automatically and transparently recalled. 
GPFS also supports snapshots, providing a measure of recovery in the case 
of accidental deletion, or other need to restore to a previous version of 
data. 
Backups need to be done of management nodes, GPFS nodes, and other 
non-data nodes. Information relevant to the re-creation of the cluster 
should also be backed up (e.g. filesystem configuration information, image 
directory for the OS provisioning manager, operating system installation 
backups for the provisioning manager, network switch configurations, etc). 
The general and disciplined use of a provisioning manager such as PCMAE 
reduces the number of individual items that have to be separately 
considered for “bare metal” backups. If most systems are deployed through 
the provisioning manager, then those servers can be easily and reliably 
25
rebuilt exactly, by once again using the provisioning manager. 
For the cluster, user data should be backed up if it cannot be easily 
recreated in the case of a total data centre loss, or a catastrophic problem 
which destroys all of the online data. Offline backups represent a safer 
alternative or adjunct to online backups (snapshots, or backups-to-disk). 
This is because a structured and ordered sequence of events needs to occur 
to access offline data – the backup system has to request a tape mount of a 
valid tape volume ID, the tape is mounted, and only then is access provided. 
Multiple structured steps to access offline storage minimise the chances of 
data loss due to malicious or accidental damage. 
Disaster Recovery / Data Corruption 
Disaster Recovery (DR) planning is an involved process, which must balance 
the Recovery Point Objective (RPO, broadly “data loss”), Recovery Time 
Objective (RTO, broadly “time until service is restored”), budgets, and 
operational and physical resources and constraints. 
DR is supported by the proposed BigInsights component for scenarios 
ranging from cold, to warm, to hot, to active-active sites. To support some 
scenarios, the Symphony and GPFS licenses included with BigInsights may 
need to be extended to full product licenses for some or all cluster nodes. 
GPFS is a clustered filesystem and supports multi-site “24x7” high 
availability configurations with data replicated between sites, allowing a site 
to fail without interruption to data access in the filesystem. So, 2 sites (plus 
a single “tiebreaker” server at a third site) could form the basis of a highly 
available cluster configuration. Recovery of jobs and other services would 
also need to be considered. 
Alternatively, you may choose to rely on existing backup and recovery 
products and procedures for DR. The use of GPFS, which presents all data 
(including MapReduce data) as a POSIX filesystem, enables enterprise 
backup software to be used to back up and restore data. Note that 
additional steps and considerations may be required: for example to back 
up and restore ACLs, as well as to re-establish the filesystem at the DR site. 
Remote caching of data using GPFS-AFM may also be considered. From a 
workload management perspective two primary DR scenarios are 
supported; 
Active – active 
Active – passive 
In the first scenario management and data nodes are located in at least two 
data centres (plus a management node a tiebreaker site – see diagram 
below). All management and data nodes are active. Both data centres are 
used for running jobs. In the event of a DR event which removes one data 
centre the remaining data centre continues to process running jobs and 
receive new jobs from clients. In order to support this scenario the Platform 
Symphony shared file system must be available to management nodes in 
both data centres. In addition Platform Symphony metadata (for example 
Hadoop job metadata) must be replicated between the data centres. GPFS 
enables us to create a clustered file system with this type of HA 
configuration. 
In the second scenario management and data nodes are again located in all 
data centres. In this scenario there is no requirement for a tiebreaker site 
26
with respect to managing Platform Symphony. Only the management nodes 
in the primary data centre are active. Data nodes in both sites may be active 
if application data is replicated between the data centres. In the event of a 
DR event the management nodes in the secondary data centre are started. 
They will connect to a shared file system that is available in the secondary 
data centre. All remaining data nodes will automatically connect to the 
management nodes in the secondary data centre. Workload that was being 
processed before the DR event will be lost and will need to be resubmitted. 
There is no requirement in this scenario to synchronously replicate Platform 
Symphony metadata data between the sites. Platform Symphony 
configuration data (grid and application configuration data) should be 
asynchronously replicated at an appropriate interval – for example every 30 
minutes. Again GPFS can be used as the basis for the Platform Symphony 
shared file system. 
If the proposed solution is correctly implemented and maintained with an 
appropriate level of redundancy for environmental, systems, and networks, 
then individual node failures at a site are handled with the built-in clustering 
of Symphony and GPFS failover clustering within the filesystem. A non-exhaustive 
list of scenarios in which DR would need to be invoked includes 
the complete failure of the network at a site, a power outage at a site, or 
other situation where there are multiple simultaneous failures that result in 
multiple key components being offline at the same time. For example, a 
power spike blows up both redundant network switches. If the DR site is not 
an “active-active” configuration for Symphony and GPFS, then a decision 
would have to be made regarding repair time versus time and cost to invoke 
DR and the subsequent failback. 
Recovery Point Objective (RPO) and Recovery Time Objective (RTO) of 
disaster recovery configuration 
For the GPFS filesystem, the RPO for a dual site active-active solution would 
be zero, as writes happen synchronously between sites under the control of 
GPFS. Other options such as asynchronous replication by GPFS (GPFS-AFM) 
are also possible, which would present an RPO of minutes to hours. 
In the case of an active – active configuration the RTO would be a few 
minutes. Jobs would continue to run (with reduced throughput as capacity 
would be reduced) once any Platform Symphony schedulers were restarted 
on the second data centre. This would take a few minutes – including time 
to detect failure at first data centre. 
In the active – passive configuration all running workload will be lost and will 
need to be resubmitted. The RTO is the time taken to start the management 
nodes in the secondary site, plus the time taken for data nodes to connect 
to these management nodes. This ignores the cost of having to start data 
nodes and load application data from backup locations if this is required. 
An RTO of close to 0 may be achievable using the fully integrated solution 
including Platform Symphony and GPFS, in a multi-site configuration with 
synchronous data replication between the sites. Note that there may be a 
reduction in compute resources (unless policy keeps a complete separate 
idle copy of production at DR), and that jobs which were in process at the 
failed site would need to be restarted or abandoned. During the actual 
failure of a site, I/O is temporarily suspended while the filesystem cluster 
reconfigures, after which work continues as normal. This time typically 
ranges from tens of seconds to a few minutes. 
27
For the jobs which are executing at the surviving site, they continue to 
operate (though with the temporary suspension of I/O mentioned above). 
While the Symphony scheduler cluster reconfigures itself, new jobs can be 
submitted, though there may be some loss of service until all cluster 
components are started on the surviving data centre. Workload will 
continue to execute on the cluster whilst Platform Symphony components 
are in the process of being restarted on the surviving data centre. All 
Platform schedulers that were running on the surviving data centre will be 
unaffected by the DR event, apart from loss of service and a requirement to 
reschedule any tasks running on the data centre that was lost. 
GPFS can provide a dual site configuration (plus quorum/tiebreaker site), 
which supports active/active replication (see diagram below). This is done 
using GPFS-controlled synchronous replication over TCP/IP between the 
sites. This can provide a highly available data store, but may require 
additional software in the middleware or application layers of the solution 
stack to achieve “no transactions lost”. 
For example, IBM MQ can also be used with GPFS across multiple sites with 
GPFS providing a tested and supported high availability data store for MQ 
messages. This provides an environment where transactions can be 
retained, even in the case of the loss of a single site. 
28
6. Real time streaming analytics 
IBM InfoSphere Streams is an advanced analytic platform that allows user-developed applications to 
quickly ingest, analyze and correlate information as it arrives from thousands of real-time sources. 
The solution can handle very high data throughput rates, up to millions of events or messages per 
second. This graphic illustrates an application which contains statistical rules to process telematics 
data from cars. One car has triggered an alert for slippery road surface by sending data that 3 wheels 
are rotating at different speeds. The approaching cars are instantly alerted! 
29 
InfoSphere Streams 
helps you Analyze data 
in motion—provides sub-millisecond 
response 
times, allowing you to 
view information and 
events as they unfold – 
from moving vehicles 
You can analyze data in motion with Streams, which: 
Supports analysis of continuous data including text, images, audio, 
voice, video, web traffic, email, GPS data, financial transactions, satellite 
data and sensor logs. 
Includes toolkits and accelerators for advanced analytics, including a 
telco event data accelerator that analyzes large volumes of streaming 
data from telecommunications systems in near real time and a social 
data accelerator for analyzing social media data. 
Distributes portions of programs over one or more nodes of the runtime 
computing cluster to help achieve volumes in the millions of messages 
per second with velocities of under a millisecond. 
Allows you to filter and extract only relevant data from unimportant 
volumes of information to help reduce data storage costs.
Scales from a single server to thousands of computer nodes based on 
data volumes or analytics complexity and Provides security features and 
confidentiality for shared information. 
30 
Simplify development of 
streaming applications 
which uses an Eclipse-based 
integrated 
development 
environment (IDE) 
Streams: 
Allows you to build applications with drag operators, and dynamically 
add new views to running applications using data visualization 
capabilities such as charts and graphs. 
Enables you to create, edit, visualize, test, debug and run Streams 
Processing Language (SPL) applications. 
Provides composites capability to increase application modularity and 
support large or distributed application development teams. 
Allows you to nest and aggregate data types within a single stream 
definition. 
Enables applications to be built on a development cluster and moved 
into production without recompiling. 
Extend the value of 
existing systems 
integrated with your 
applications, and 
supporting both 
structured and 
unstructured data 
sources. 
Streams: Adapts to rapidly changing data forms and types. 
Allows you to quickly develop new applications that can be mapped to a 
variety of hardware configurations. 
Supports reuse of existing Java or C++ code, as well as Predictive Model 
Markup Language (PMML) models. 
Includes a limited license for IBM InfoSphere BigInsights, a Hadoop-based 
offering for analyzing large volumes of unstructured data at rest. 
Integrates with IBM DB2, IBM Informix, IBM Netezza, IBM solidDB, IBM 
InfoSphere Warehouse, IBM Smart Analytics System, Oracle, Microsoft 
SQLServer and MySQL, and more.
31 
7. Data security 
7.1 Granular data 
security 
BigInsights integrates with Active Directory for user authentication. 
A secure connection uses encryption to make data unreadable to third 
parties while the data is sent over the network between Directory Server and 
its clients binding to the secure port using the Secure Socket Layer (SSL). 
InfoSphere BigInsights also supports Kerberos service-to-service 
authentication protocol, increasing security strength to prevent middle man 
attacks. 
BigInsights supports integration with LDAP and single sign-on across clusters 
(e.g. Development and Test) through the use of the Name Service Switch 
(NSS) package and Lightweight Third Party Authentication (LTPA) tokens. The 
development, deployment and execution of analytic applications or other 
data processing jobs are controlled through role-based security, and 
information access is controlled through granular access control lists (ACLs). 
BigInsights uses POSIX compliant ACLs to control access to the data itself, 
down to the data stored in each individual node in the cluster, by using IBM’s 
General Parallel File System (GPFS). GPFS is a kernel-level, POSIX compliant 
file system that provides the same level of control for file access security and 
auditing capabilities as other POSIX file systems, allowing standard “owner”, 
“group” and “other” permissions to be assigned and changed using standard 
operating system commands like “chmod”. In fact, the ACLs within GPFS 
allow even greater flexibility by allowing additional users and groups to be 
defined, as well as a “control” level that determines who can change the ACL 
itself. 
Perhaps most importantly, as a kernel-level file system the data stored in 
GPFS can be shared across BigInsights and any other application, without 
moving or replicating the data. This immediately improves flexibility of using 
the data in BigInsights, removing delays in the movement of data between 
systems and tools, and minimising the need to reproduce access control 
definitions at multiple levels. 
In addition to standard application authentication via LDAP or Kerberos, Big 
SQL enables row and column access control or what is sometimes described 
as fine grained control consistent with functionality found in an RDBMS. This 
functionality supports compliance for regulations and policies related to data 
privacy, such as patient health records or securities data so is suited to the 
compliance challenges for vehicle data such as those limiting the use of eCall 
data for the specific purpose of emergency response to vehicle problems. 
To monitor and validate data access, BigInsights’ built-in auditing can track 
changes to access privileges or data objects and track SQL statement 
execution and retrieving security information.
32 
7.2 User security 
Administrators have the option to choose flat file, Lightweight Directory 
Access Protocol (LDAP) or Pluggable Authentication Modules (PAM) for the 
InfoSphere BigInsights web console. With LDAP authentication, the 
InfoSphere BigInsights installation program will communicate with an LDAP 
credentials store for authentication. Administrators can then provide access 
to the InfoSphere BigInsights console based on role membership, making it 
easy to set access rights for groups of users. 
InfoSphere BigInsights provides four levels of user roles: 
BigInsights System Administrator. Performs all system administration tasks. 
For example, a user in this role can perform monitoring of the cluster’s 
health, and adding, removing, starting, and stopping nodes 
BigInsights Data Administrator. Performs all data administration tasks. For 
example, these users create directories, run Hadoop file system commands, 
and upload, delete, download, and view files 
BigInsights Application Administrator. Performs all application administration 
tasks, for example publishing and un-publishing (deleting) an application, 
deploying and removing an application to the cluster, configuring the 
application icons, applying application descriptions, changing the runtime 
libraries and categories of an application, and assigning permissions of an 
application to a group 
BigInsights User: Runs applications that the user is given permission to run 
and views the results, data, and cluster health. This role is typically the most 
commonly granted role to cluster users who perform non-administrative 
tasks. 
MapReduce jobs can be run under designated account IDs, which helps 
tighten security, access control and auditing. Integration of InfoSphere 
BigInsights with IBM InfoSphere Guardium® data security software helps 
organizations to manage the security and auditing needs of Hadoop the same 
way they manage traditional structured data sources. 
7.3 Audit and 
integration 
with Guardium, 
the leading 
security 
platform 
BigInsights can be configured to collect a range of audit information. 
BigInsights stores security audit information as audit events in its own audit 
log files for general security tracking. The log files are written to the file 
system in directories using a date based naming protocol, which can only be 
accessed by administrators. 
You can also configure InfoSphere BigInsights to send audit log events to 
InfoSphere Guardium for security analysis and reporting via Guardium Proxy. 
An audit message contains three critical pieces of information derived from 
an audit event: audit event timestamp, component that generated the audit 
event, and the audit message itself. 
After BigInsights events exist in the InfoSphere Guardium repository, other 
InfoSphere Guardium features such as workflow (to email and track report 
signoff), alerting, and reporting are available. 
InfoSphere Guardium has a secure, tamper-proof repository. All audit
information is stored in a secure repository, where it cannot be modified: 
even by privileged users. Once data is collected and written to Guardium, 
there is no way for it to be modified, which guarantees the non-repudiation 
of the data. This secure repository supports separation of duties and absolves 
database administrators of any suggestion that they might have changed 
audit data to “cover their tracks,” even in a legal environment. 
In addition Guardium has a hardened operating system and database kernel 
– There is no way for users to directly access the underlying operating 
system, file system, or database. As an added precaution, all unused 
software components of the operating system and the embedded database 
have been removed or disabled. 
The security audit information that InfoSphere BigInsights generates depends 
on your environment. The following list is representative of the type of 
information that InfoSphere BigInsights generates: 
• Hadoop Remote Procedure Calls (RPC) authentication and 
authorisation successes and failures 
• Hadoop Distributed File System (HDFS) file and permission-related 
commands such as cat, tail, chmod, chown, and expunge 
• Hadoop MapReduce information about jobs, operations, targets, and 
33 
permissions 
• Oozie information about jobs 
• HBase operation authorisation for data access and administrative 
operations, such as global privilege authorisation, table and column 
family privilege authorisation, grant permission, and revoke 
permission.
34 
7.4 Masking 
confidential 
information 
and test data 
with IBM 
Optim 
InfoSphere Optim data masking on demand is the only masking service 
available for Hadoop-based systems. You can decide when and where to 
mask: for example, in relational data sources, in reports or inside 
applications. 
InfoSphere Optim Data Masking de-identifies or obfuscates sensitive data 
such as PII, business data (revenues, HR, etc.) and corporate secrets in big 
data environments. Using InfoSphere Optim protects data against theft and 
misuse in accordance with compliance mandates. InfoSphere Optim Data 
Masking ensures data privacy, enables compliance and helps manage risk. 
InfoSphere Optim is first to market with data masking on demand for 
Hadoop-based systems. InfoSphere Optim de-identifies data to ensure 
privacy while keeping the original context to facilitate business processes. 
Flexible masking services allow you to create customised masking routines 
for specific data types or leverage out of the box support. 
IBM InfoSphere Optim Test Data Management optimises and automates the 
test data management process. Prebuilt workflows and services facilitate 
continuous testing and Agile software development. IBM InfoSphere Optim 
Test Data Management helps development and testing teams use realistic, 
right-sized test databases or data warehouses to accelerate application 
development. 
InfoSphere Optim Test Data Management helps organisations: 
• Streamline test data management processes to help reduce costs 
and speed application delivery 
• Analyse and refresh test data on demand for developers and testers 
• Create production-like environments to shorten iterative testing 
cycles, support continuous testing and accelerate time to market 
• Protect sensitive data based on business polices and help reduce risk 
in testing, training and development environments 
• Use a single, scalable enterprise solution across applications, 
databases and operating systems 
• Provides a comprehensive continuous testing solution through 
Rational Test Workbench for functional, regression, integration 
(service virtualisation) and load testing.
8. Big Data Platform integration 
35 
Integrated business 
functionality is delivered 
through the breadth of 
the IBM Big Data 
Platform which includes 
the following 
capabilities:
36 
8.1 Connectors 
InfoSphere BigInsights provides connectors to IBM DB2® database 
software, the IBM PureData™ Systems family of data warehouse 
appliances, IBM Netezza appliances, IBM InfoSphere Warehouse and the 
IBM Smart Analytics System. These high-speed connectors help simplify 
and accelerate data manipulation tasks. 
Standard Java Database Connectivity (JDBC) connectors make it possible 
for organizations to quickly integrate with a wide variety of data and 
information systems including Oracle, Microsoft® SQL Server, MySQL 
and Teradata. 
This connectivity encourages a platform approach to big data projects, 
so for example selections of data can be drawn from BigInsights to 
support high performance analytics on a Netezza appliance and the 
results posted back to the all encompassing data store. Queries from 
SQL tools can access large data stores in Hadoop for long term history 
and high performance systems for current operational reporting 
concurrently. This replaces the stove-pipe system where each data store 
has its own reporting tools, support team, and labour intensive 
maintenance overheads with a platform. 
8.2 Data warehouse 
integration 
IBM’s approach of combining Hadoop with in-memory database 
processing results in the application world seeing one warehouse, yet 
providing a more agile and faster response time for ROLAP: all at 
cheaper cost. 
In addition, the IBM solution also provides a cheaper and larger-scale 
engine for data integration and transformation: both extract-transform-load 
(ETL) and extract-load-transform (ELT). 
DataStage can push down transformations into the IBM Hadoop solution 
to perform transformations within Hadoop. DataStage in this case 
automatically creates MapReduce jobs that perform the transformation 
work using an ELT approach. 
IBM therefore does not advocate moving the entirety of the data mart 
and data warehouse landscape into a Hadoop solution. Rather, IBM 
recommends taking a measured approach to constructing a logical data 
warehouse that is comprised of fit-for-purpose components that achieve 
the best result for particular workloads while minimising cost.
37 
8.3 IBM Symphony 
By using the IBM Symphony scheduler, the solution provides high 
availability, as well as higher performance for many MapReduce 
workloads due to the faster scheduling of workloads. In addition, the 
Symphony system can be deployed to support multi-tenancy, so that the 
same cluster can simultaneously support Hadoop/MapReduce 
workloads as well as other work. 
8.4 Big Match & MDM 
For users performing customer analytics, InfoSphere BigInsights 
leverages the probabilistic matching engine of InfoSphere Master Data 
Management to match and link customer information directly in 
Hadoop, at high speeds. A unique identifier for each customer ensures 
analytics are performed on more accurate and information. 
8.5 Information Server: 
DataStage & 
QualityStage 
IBM InfoSphere DataStage® includes a connector that enables 
InfoSphere BigInsights data to be leveraged within an InfoSphere 
DataStage extract/transform/ load (ETL) or in an extract/load/transform 
(ELT) job. 
The Balanced Optimiser functionality in BigInsights places the workload 
where it will run most efficiently such as during an ETL process, or in a 
DB2 database, or as a MapReduce Process. 
Quality rules and actions can be run using the integration with 
Information Server QualityStage; particularly useful given the disparate 
data types which are typically ingested into Hadoop to avoid creating a 
large but useless data store. 
8.6 InfoSphere Streams 
InfoSphere BigInsights includes a limited-use license of InfoSphere 
Streams, which enables real-time, continuous analysis of data on the fly. 
InfoSphere Streams is an enterprise-class stream-processing system that 
can extract actionable insights from data in motion while transforming 
data and transferring it to InfoSphere BigInsights at high speeds. 
This enables organizations to capture and act on business data in real 
time — rapidly ingesting, analyzing and correlating information as it 
arrives — and fundamentally enhance processing performance. 
8.1 Cognos Business 
Intelligence 
InfoSphere BigInsights includes a limited-use license for Cognos Business 
Intelligence, which enables business users to access and analyze the 
information they need to improve decision making, gain better insight 
and manage performance. Cognos Business Intelligence includes 
software for query, reporting, analysis and dashboards, as well as 
software to gather and organize information from multiple sources.
2014 10 09 Top reasons to use IBM BigInsights as your Big Data Hadoop system
2014 10 09 Top reasons to use IBM BigInsights as your Big Data Hadoop system
2014 10 09 Top reasons to use IBM BigInsights as your Big Data Hadoop system
2014 10 09 Top reasons to use IBM BigInsights as your Big Data Hadoop system
2014 10 09 Top reasons to use IBM BigInsights as your Big Data Hadoop system
2014 10 09 Top reasons to use IBM BigInsights as your Big Data Hadoop system
2014 10 09 Top reasons to use IBM BigInsights as your Big Data Hadoop system
2014 10 09 Top reasons to use IBM BigInsights as your Big Data Hadoop system
2014 10 09 Top reasons to use IBM BigInsights as your Big Data Hadoop system
2014 10 09 Top reasons to use IBM BigInsights as your Big Data Hadoop system
2014 10 09 Top reasons to use IBM BigInsights as your Big Data Hadoop system
2014 10 09 Top reasons to use IBM BigInsights as your Big Data Hadoop system
2014 10 09 Top reasons to use IBM BigInsights as your Big Data Hadoop system
2014 10 09 Top reasons to use IBM BigInsights as your Big Data Hadoop system
2014 10 09 Top reasons to use IBM BigInsights as your Big Data Hadoop system

More Related Content

What's hot

Enterprise Architecture in the Era of Big Data and Quantum Computing
Enterprise Architecture in the Era of Big Data and Quantum ComputingEnterprise Architecture in the Era of Big Data and Quantum Computing
Enterprise Architecture in the Era of Big Data and Quantum ComputingKnowledgent
 
Big-Data Server Farm Architecture
Big-Data Server Farm Architecture Big-Data Server Farm Architecture
Big-Data Server Farm Architecture Jordan Chung
 
Pouring the Foundation: Data Management in the Energy Industry
Pouring the Foundation: Data Management in the Energy IndustryPouring the Foundation: Data Management in the Energy Industry
Pouring the Foundation: Data Management in the Energy IndustryDataWorks Summit
 
Making Bank Predictive and Real-Time
Making Bank Predictive and Real-TimeMaking Bank Predictive and Real-Time
Making Bank Predictive and Real-TimeDataWorks Summit
 
Democratizing Data Science on Kubernetes
Democratizing Data Science on Kubernetes Democratizing Data Science on Kubernetes
Democratizing Data Science on Kubernetes John Archer
 
Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy snehal parikh
 
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, ClouderaMongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, ClouderaMongoDB
 
Use dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeUse dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeDataWorks Summit
 
How Market Intelligence From Hadoop on Azure Shows Trucking Companies a Clear...
How Market Intelligence From Hadoop on Azure Shows Trucking Companies a Clear...How Market Intelligence From Hadoop on Azure Shows Trucking Companies a Clear...
How Market Intelligence From Hadoop on Azure Shows Trucking Companies a Clear...DataWorks Summit
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big DataInfochimps, a CSC Big Data Business
 
Marketing Digital Command Center
Marketing Digital Command CenterMarketing Digital Command Center
Marketing Digital Command CenterDataWorks Summit
 
Machine Learning Everywhere
Machine Learning EverywhereMachine Learning Everywhere
Machine Learning EverywhereDataWorks Summit
 
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...Precisely
 
SplunkSummit 2015 - Real World Big Data Architecture
SplunkSummit 2015 -  Real World Big Data ArchitectureSplunkSummit 2015 -  Real World Big Data Architecture
SplunkSummit 2015 - Real World Big Data ArchitectureSplunk
 
The Big Picture on Big Data and Cognos
The Big Picture on Big Data and CognosThe Big Picture on Big Data and Cognos
The Big Picture on Big Data and CognosSenturus
 
Driven by data - Why we need a Modern Enterprise Data Analytics Platform
Driven by data - Why we need a Modern Enterprise Data Analytics PlatformDriven by data - Why we need a Modern Enterprise Data Analytics Platform
Driven by data - Why we need a Modern Enterprise Data Analytics PlatformArne Roßmann
 

What's hot (20)

Enterprise Architecture in the Era of Big Data and Quantum Computing
Enterprise Architecture in the Era of Big Data and Quantum ComputingEnterprise Architecture in the Era of Big Data and Quantum Computing
Enterprise Architecture in the Era of Big Data and Quantum Computing
 
Big-Data Server Farm Architecture
Big-Data Server Farm Architecture Big-Data Server Farm Architecture
Big-Data Server Farm Architecture
 
Pouring the Foundation: Data Management in the Energy Industry
Pouring the Foundation: Data Management in the Energy IndustryPouring the Foundation: Data Management in the Energy Industry
Pouring the Foundation: Data Management in the Energy Industry
 
Hadoop dev 01
Hadoop dev 01Hadoop dev 01
Hadoop dev 01
 
Making Bank Predictive and Real-Time
Making Bank Predictive and Real-TimeMaking Bank Predictive and Real-Time
Making Bank Predictive and Real-Time
 
Democratizing Data Science on Kubernetes
Democratizing Data Science on Kubernetes Democratizing Data Science on Kubernetes
Democratizing Data Science on Kubernetes
 
Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy
 
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, ClouderaMongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
 
Solution architecture for big data projects
Solution architecture for big data projectsSolution architecture for big data projects
Solution architecture for big data projects
 
Use dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeUse dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application code
 
How Market Intelligence From Hadoop on Azure Shows Trucking Companies a Clear...
How Market Intelligence From Hadoop on Azure Shows Trucking Companies a Clear...How Market Intelligence From Hadoop on Azure Shows Trucking Companies a Clear...
How Market Intelligence From Hadoop on Azure Shows Trucking Companies a Clear...
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
 
Marketing Digital Command Center
Marketing Digital Command CenterMarketing Digital Command Center
Marketing Digital Command Center
 
Machine Learning Everywhere
Machine Learning EverywhereMachine Learning Everywhere
Machine Learning Everywhere
 
Data Mesh
Data MeshData Mesh
Data Mesh
 
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
 
Destroying Data Silos
Destroying Data SilosDestroying Data Silos
Destroying Data Silos
 
SplunkSummit 2015 - Real World Big Data Architecture
SplunkSummit 2015 -  Real World Big Data ArchitectureSplunkSummit 2015 -  Real World Big Data Architecture
SplunkSummit 2015 - Real World Big Data Architecture
 
The Big Picture on Big Data and Cognos
The Big Picture on Big Data and CognosThe Big Picture on Big Data and Cognos
The Big Picture on Big Data and Cognos
 
Driven by data - Why we need a Modern Enterprise Data Analytics Platform
Driven by data - Why we need a Modern Enterprise Data Analytics PlatformDriven by data - Why we need a Modern Enterprise Data Analytics Platform
Driven by data - Why we need a Modern Enterprise Data Analytics Platform
 

Viewers also liked (15)

greasing wheels of mobility
greasing wheels of mobilitygreasing wheels of mobility
greasing wheels of mobility
 
Why, When and How?
 Why, When and How? Why, When and How?
Why, When and How?
 
Auto & Design Dec 2014
Auto & Design Dec 2014Auto & Design Dec 2014
Auto & Design Dec 2014
 
UL LibGuides
UL LibGuidesUL LibGuides
UL LibGuides
 
Sambutan hari kantin sekolah(mte)
Sambutan hari kantin sekolah(mte)Sambutan hari kantin sekolah(mte)
Sambutan hari kantin sekolah(mte)
 
Genoeg van overtypen? Document Capture
Genoeg van overtypen? Document CaptureGenoeg van overtypen? Document Capture
Genoeg van overtypen? Document Capture
 
onze china brochure
onze china brochureonze china brochure
onze china brochure
 
CHILD ABUSE
CHILD ABUSECHILD ABUSE
CHILD ABUSE
 
RENAVE Resolução nº 655 doc
RENAVE Resolução nº 655 docRENAVE Resolução nº 655 doc
RENAVE Resolução nº 655 doc
 
IPv6: Is there a problem?
IPv6: Is there a problem?IPv6: Is there a problem?
IPv6: Is there a problem?
 
Pedrada
PedradaPedrada
Pedrada
 
Mariyah haidar
Mariyah haidarMariyah haidar
Mariyah haidar
 
Evaluation 3
Evaluation 3Evaluation 3
Evaluation 3
 
Session viii(state mngtclient)
Session viii(state mngtclient)Session viii(state mngtclient)
Session viii(state mngtclient)
 
Do's and Dont's for Your 341 Meeting of Creditors
Do's and Dont's for Your 341 Meeting of CreditorsDo's and Dont's for Your 341 Meeting of Creditors
Do's and Dont's for Your 341 Meeting of Creditors
 

Similar to 2014 10 09 Top reasons to use IBM BigInsights as your Big Data Hadoop system

Cloud Integration for Hybrid IT: Balancing Business Self-Service and IT Control
Cloud Integration for Hybrid IT: Balancing Business Self-Service and IT ControlCloud Integration for Hybrid IT: Balancing Business Self-Service and IT Control
Cloud Integration for Hybrid IT: Balancing Business Self-Service and IT ControlAshwin V.
 
Creating Value with SAP BusinessObjects Planning and Consolidation, version f...
Creating Value with SAP BusinessObjects Planning and Consolidation, version f...Creating Value with SAP BusinessObjects Planning and Consolidation, version f...
Creating Value with SAP BusinessObjects Planning and Consolidation, version f...dcd2z
 
Architecting a-big-data-platform-for-analytics 24606569
Architecting a-big-data-platform-for-analytics 24606569Architecting a-big-data-platform-for-analytics 24606569
Architecting a-big-data-platform-for-analytics 24606569Kun Le
 
Supply Chain Collaboration: The Key to Success in a Global Economy
Supply Chain Collaboration: The Key to Success in a Global EconomySupply Chain Collaboration: The Key to Success in a Global Economy
Supply Chain Collaboration: The Key to Success in a Global EconomyFindWhitePapers
 
G-Cloud Programme vision UK - technical architectureworkstrand-report t8
G-Cloud Programme vision UK - technical architectureworkstrand-report t8G-Cloud Programme vision UK - technical architectureworkstrand-report t8
G-Cloud Programme vision UK - technical architectureworkstrand-report t8Victor Gridnev
 
Big Data as a Service - A Market and Technology Perspective
Big Data as a Service - A Market and Technology PerspectiveBig Data as a Service - A Market and Technology Perspective
Big Data as a Service - A Market and Technology PerspectiveEMC
 
Manufacturing-Tech-Survey-Results-US50602523-IB.pdf
Manufacturing-Tech-Survey-Results-US50602523-IB.pdfManufacturing-Tech-Survey-Results-US50602523-IB.pdf
Manufacturing-Tech-Survey-Results-US50602523-IB.pdfAhmedMahamedMustafaG
 
Stepping into the Digital Future with IoT
Stepping into the Digital Future with IoTStepping into the Digital Future with IoT
Stepping into the Digital Future with IoTCognizant
 
Tx2014 Feature and Highlights
Tx2014 Feature and Highlights Tx2014 Feature and Highlights
Tx2014 Feature and Highlights Heath Turner
 
Esm install guide_5.5
Esm install guide_5.5Esm install guide_5.5
Esm install guide_5.5Protect724v2
 
Business and Economic Benefits of VMware NSX
Business and Economic Benefits of VMware NSXBusiness and Economic Benefits of VMware NSX
Business and Economic Benefits of VMware NSXAngel Villar Garea
 
Network Virtualization and Security with VMware NSX - Business Case White Pap...
Network Virtualization and Security with VMware NSX - Business Case White Pap...Network Virtualization and Security with VMware NSX - Business Case White Pap...
Network Virtualization and Security with VMware NSX - Business Case White Pap...Błażej Matusik
 
Hybrid Hosting: Evolving the Cloud in 2011
Hybrid Hosting: Evolving the Cloud in 2011Hybrid Hosting: Evolving the Cloud in 2011
Hybrid Hosting: Evolving the Cloud in 2011Rackspace
 
Advantages of Software as a Service over ASP Hosting
Advantages of Software as a Service over ASP HostingAdvantages of Software as a Service over ASP Hosting
Advantages of Software as a Service over ASP Hostingcorncrew1
 

Similar to 2014 10 09 Top reasons to use IBM BigInsights as your Big Data Hadoop system (20)

Cloud Integration for Hybrid IT: Balancing Business Self-Service and IT Control
Cloud Integration for Hybrid IT: Balancing Business Self-Service and IT ControlCloud Integration for Hybrid IT: Balancing Business Self-Service and IT Control
Cloud Integration for Hybrid IT: Balancing Business Self-Service and IT Control
 
Creating Value with SAP BusinessObjects Planning and Consolidation, version f...
Creating Value with SAP BusinessObjects Planning and Consolidation, version f...Creating Value with SAP BusinessObjects Planning and Consolidation, version f...
Creating Value with SAP BusinessObjects Planning and Consolidation, version f...
 
Architecting a-big-data-platform-for-analytics 24606569
Architecting a-big-data-platform-for-analytics 24606569Architecting a-big-data-platform-for-analytics 24606569
Architecting a-big-data-platform-for-analytics 24606569
 
Cloud view platform-highlights-web3
Cloud view platform-highlights-web3Cloud view platform-highlights-web3
Cloud view platform-highlights-web3
 
Supply Chain Collaboration: The Key to Success in a Global Economy
Supply Chain Collaboration: The Key to Success in a Global EconomySupply Chain Collaboration: The Key to Success in a Global Economy
Supply Chain Collaboration: The Key to Success in a Global Economy
 
G-Cloud Programme vision UK - technical architectureworkstrand-report t8
G-Cloud Programme vision UK - technical architectureworkstrand-report t8G-Cloud Programme vision UK - technical architectureworkstrand-report t8
G-Cloud Programme vision UK - technical architectureworkstrand-report t8
 
Big Data as a Service - A Market and Technology Perspective
Big Data as a Service - A Market and Technology PerspectiveBig Data as a Service - A Market and Technology Perspective
Big Data as a Service - A Market and Technology Perspective
 
Manufacturing-Tech-Survey-Results-US50602523-IB.pdf
Manufacturing-Tech-Survey-Results-US50602523-IB.pdfManufacturing-Tech-Survey-Results-US50602523-IB.pdf
Manufacturing-Tech-Survey-Results-US50602523-IB.pdf
 
Sample Business Plan
Sample Business PlanSample Business Plan
Sample Business Plan
 
IBM Cloud
IBM CloudIBM Cloud
IBM Cloud
 
Stepping into the Digital Future with IoT
Stepping into the Digital Future with IoTStepping into the Digital Future with IoT
Stepping into the Digital Future with IoT
 
Tx2014 Feature and Highlights
Tx2014 Feature and Highlights Tx2014 Feature and Highlights
Tx2014 Feature and Highlights
 
Industry 4.0 for beginners
Industry 4.0 for beginnersIndustry 4.0 for beginners
Industry 4.0 for beginners
 
Esm install guide_5.5
Esm install guide_5.5Esm install guide_5.5
Esm install guide_5.5
 
Business and Economic Benefits of VMware NSX
Business and Economic Benefits of VMware NSXBusiness and Economic Benefits of VMware NSX
Business and Economic Benefits of VMware NSX
 
Network Virtualization and Security with VMware NSX - Business Case White Pap...
Network Virtualization and Security with VMware NSX - Business Case White Pap...Network Virtualization and Security with VMware NSX - Business Case White Pap...
Network Virtualization and Security with VMware NSX - Business Case White Pap...
 
Whats new
Whats newWhats new
Whats new
 
Industry 4.0
Industry 4.0Industry 4.0
Industry 4.0
 
Hybrid Hosting: Evolving the Cloud in 2011
Hybrid Hosting: Evolving the Cloud in 2011Hybrid Hosting: Evolving the Cloud in 2011
Hybrid Hosting: Evolving the Cloud in 2011
 
Advantages of Software as a Service over ASP Hosting
Advantages of Software as a Service over ASP HostingAdvantages of Software as a Service over ASP Hosting
Advantages of Software as a Service over ASP Hosting
 

2014 10 09 Top reasons to use IBM BigInsights as your Big Data Hadoop system

  • 1. The top 10 reasons manufacturers should use BigInsights as major vehicle IBM their Hadoop Platform
  • 2. This paper draws from our extensive experience of Hadoop applications in the automotive industry, where advice from those with the early adoption experiences shows the relevance of the IBM BigInsights platform approach and technical capabilities. ii Introduction iv Drivers for change in the vehicle industry iv The market opportunity for Connected Cars v The business need for Hadoop capabilities vii Big data in the vehicle industry enables radical new use cases viii The top 10 reasons to use IBM BigInsights as your Big Data Hadoop system 1. IBM as your partner for driving Data warehouse modernisation with Hadoop ................ 1 2. Connected vehicle IBM solution elements ........................................................................... 3 3. BigInsights product depth completes Hadoop .................................................................... 6 3.1 Overview of BigInsights ................................................................................................. 6 3.1 Users supported ............................................................................................................ 7 3.2 Application lifecycle management ................................................................................. 7 3.1 What IBM BigInsights adds to open source Hadoop .................................................... 9 3.2 BigInsights for Hadoop: Technical capabilities which lend themselves to the requirements of the vehicle industry ........................................................................................ 10 3.3 GPFS ........................................................................................................................... 11 3.4 BigSQL ........................................................................................................................ 12 3.5 BigSheets spread-sheet interface ............................................................................... 12 3.6 Text Analytics .............................................................................................................. 12 3.7 Adaptive Map-Reduce ................................................................................................. 12 4. Big Data platform technology supplier commitment and stability .................................. 13 4.1 Scale and commitment ................................................................................................ 13 4.2 IBM Hadoop deployments ........................................................................................... 13 4.3 Commitment to open source ....................................................................................... 13 4.4 The key differences between IBM, Cloudera, Pivotal and Hortonworks .................... 15 5. BigInsights Performance and stability ................................................................................ 17 5.1 Overview ...................................................................................................................... 17 5.2 Adaptive MapReduce for job acceleration .................................................................. 17 5.3 Comprehensive SQL Performance and support ......................................................... 17 5.4 Federated data access ................................................................................................ 19 5.5 Performance architecture ............................................................................................ 19 5.6 Real time analytics ...................................................................................................... 20 5.7 Resilience delivered from Platform Symphony ........................................................... 20 5.8 Platform scheduling ..................................................................................................... 20 5.9 High Availability ........................................................................................................... 22 6. Real time streaming analytics .............................................................................................. 29 7. Data security .......................................................................................................................... 31 7.1 Granular data security ................................................................................................. 31 7.2 User security................................................................................................................ 32
  • 3. 7.3 Audit and integration with Guardium, the leading security platform ........................... 32 7.4 Masking confidential information and test data with IBM Optim ................................. 34 8. Big Data Platform integration ............................................................................................... 35 8.1 Connectors .................................................................................................................. 36 8.2 Data warehouse integration ........................................................................................ 36 8.3 IBM Symphony ............................................................................................................ 37 8.4 Big Match & MDM ....................................................................................................... 37 8.5 Information Server: DataStage & QualityStage .......................................................... 37 8.6 InfoSphere Streams .................................................................................................... 37 8.1 Cognos Business Intelligence ..................................................................................... 37 8.2 Search Indexing and integration with enterprise wide search .................................... 38 8.3 Spreadsheet visualisation ........................................................................................... 38 8.4 BigR ............................................................................................................................. 40 8.5 SPSS ........................................................................................................................... 40 8.6 Text Analytics with AQL .............................................................................................. 40 8.7 Data Lineage and governance .................................................................................... 40 8.8 SAS integration ........................................................................................................... 41 8.9 BigSQL ........................................................................................................................ 41 9. Speed of deployment ............................................................................................................ 42 10. IBM Big Data delivery is proven with case studies from around the world .................... 44 10.1 General Motors ............................................................................................................ 44 10.2 Big Data Pioneers: Volvo Case Study ........................................................................ 47 10.3 PSA.............................................................................................................................. 48 10.4 Science & Technology Facilities Council .................................................................... 48 10.5 Octotelematics ............................................................................................................. 48 10.6 BMW ............................................................................................................................ 49 11. IBM Company information .................................................................................................... 50 iii
  • 4. iv Introduction We are at a tipping point of mass awareness of, and mass adoption of internet connected devices taking us beyond the smart phone into pervasive technology in our homes and workplaces. Domestic appliances, industrial equipment and significantly, the connected vehicle will become part of the new normal. This paper explains why IBM’s Big Data Platform, and particularly BigInsights for Hadoop provides a scalable, strategic and economic solution to the data challenge facing the vehicle industry. Drivers for change in the vehicle industry All this means data...big data. When we consider the drivers for change in the vehicle industry, they all create and depend upon data that will be high in volume, variety, and velocity.
  • 5. v The market opportunity for Connected Cars The connected car market is summarised in this table from Connected Car Forecast, “Global Connected Car Market to Grow Threefold within Five Years report, Feb 2013” Note the largest segment is in service areas often supplied by 3rd party aftermarket businesses; this is a major opportunity for OEM’S. In summary this commercial opportunity has such scale and momentum that the data challenge facing the IT departments of major OEM’s will be significant. An industry with great skills in technology such as ERP and Computer Aided Design has of course had to create data warehouses for customer and business operations, but never to the scale and scope required to support connected vehicles. The amount of data generated by vehicles is significant, both in terms of the amount kept on the vehicle, and that generated from vehicles but held in data centres The model that determines types and quantities for vehicle data is yet to be standardised, so efficiencies are yet to be created. This means multi-petabyte data centre stores will proliferate amongst the OEM’s until on-vehicle processing reduces the amount of dependence on data centre stored data. One can see a point where the autonomous vehicle will not be dependent upon stored data but this is over the horizon....in between us and this point in time is a mountain of data.
  • 6. This data explosion within the automotive manufacturing industry will create skills shortages. As other industries are also seeing growth in connected devices, so the skills required may have to be grown from within the company by cross training IT staff from the (usually small) Data Warehouse, as well as from ERP and CAD system teams. vi
  • 7. vii The business need for Hadoop capabilities The fundamental needs are to • Lower the costs of managing data as a business asset; IBM’s Hadoop can be delivered for €800 per Terabyte. IBM high performance warehouse appliances (Netezza) can be delivered for €10,000 per Terabyte. This is a fraction of traditional data warehouse systems which cost €24,000 per Terabyte (e.g. Oracle/Teradata) • Provide a landing zone for all data – existing but disparate structured data silo’s, and unstructured data such as video files, and external social media data. Currently most IT departments would not easily support a business executive who asks IT to take data from a telco offering vehicle movement data, and to combine it with video clips from vehicle cameras, and integrate social media site information about vehicle problems simply because IT investment has been focussed on ERP and CAD. Hadoop as part of an big data platform enables this business data as a service Hadoop directly integrates with the existing IT landscape and standards while complementing existing technologies like MPP (Massively Parallel Processing) warehousing and predictive analytics. Uniquely, BigInsights extends open source Hadoop and HDFS capabilities with additional integrated enterprise capabilities. This provides clients with additional benefits, through a solution which: • Reduces cost and improves flexibility: by simultaneously allowing a single cluster to be used for multiple groups of users, running a mix of Hadoop MapReduce jobs and other workloads, under the control of the advanced and proven scheduler IBM Platform Symphony • Supports complex workloads: by providing an architecture which can perform a mix of MapReduce and other computing tasks within a single job • Ensures data resilience and reliability: by enabling the use of an enterprise-class alternative to open source HDFS through IBM’s General Parallel File System (GPFS) – a proven, robust, high-performance file system that allows the use of both centralised disks and locally-attached storage simultaneously, while also providing compatibility with Hadoop and providing additional features and reliability. This allows BigInsights to provide greater resiliency, performance and manageability over alternatives, while also saving costs and without compromising fully open accessibility to information within the platform. BigInsights is also pre-integrated with our Platform Computing grid management technology, which we believe is a critical extension to Hadoop in order to increase data processing performance, reduce server idleness, and improve deployment efficiency and cluster flexibility. IBM is committed to the accessibility and security of data within our solution to ensure its full exploitation for retention, exploration, integration, advanced analytics and visualisation. Also, we recommend the use of our bundled IBM InfoSphere Streams component to achieve low-latency analytical processing of data-in-motion. Streams is also pre-integrated with BigInsights and is one of the only commercially supported capability of its kind.
  • 8. Big data system in the vehicle industry enables radical new use cases The business benefits derived from Connected Vehicles are diverse, so universal and reusable data systems are needed or else data silos will emerge, severely limiting the efficiency and cost/benefit balance of this market leap. In summary this collection of use cases results in a justification for integrated data capabilities and that’s where Hadoop comes in. Use case Description Benefits KEY = REVENUE COST REDUCTION viii 3600 view of vehicle/ driver/fleet Indexing data assets to create a range of search based applications on federated data. Shifts business model from product to customer centricity Driver can reduce fuel used Fleet manager can controls costs OEM has ongoing relationship with owners so can increase re-purchase loyalty Social Media Lead Generation By identifying Twitter discussions which contain a propensity to buy a specific product, we can create a business alert to exploit the information and create a lead Sales leads Targeted campaigns against specific competitors Brand and product sentiment analysis to target marketing communications more accurately Warranty Claim Predictions Accurately identifying which vehicles are likely to have warranty claims well in advance supports predictive maintenance Reduction in recalls Increased quality customer service Warranty provision at lower cost than the current business model 3rd Party Data Sales Selling granular weather and traffic congestion data The value to meteorological organisations of granular weather data is higher than they are able to harvest from fixed site weather monitors, and allows the OEM to enter the traffic data business to generate service revenue User Based Insurance Enabling offers to customers to reward safe driving through detailed usage analytics Sales of pay as you drive insurance, and pay as you drive pricing bundles Personal-isation and location services Provision of end user refinements as an extension to the infotainment personalisation experience. E.g. seat, temperature, music settings follow the owner. Locating the vehicle and identifying a commute partner for example Customer saves fuel costs so drives sales into new segments Retains customers whose cars are closely integrated into their lifestyle through their Smartphone
  • 9. ix Product Usage for R&D Deliver granular data to product designers such that they can create cars closer suited to the actual usage pattern Lower warranty costs, lower running costs, increased customer loyalty revenue User Based Vehicle Upgrade Sales Generating leads based on identifying the ideal replacement vehicle for each owner, by analysing driver profiles Supporting finance buy-back and upgrade campaigns to increase revenue and customer satisfaction Online Software upgrades to vehicle Identifying issues where an online upgrade to the in vehicle software improves performance, addresses safety issues and reduces the need for physical recalls Reduced cost of warranty claims and recalls. Increased customer satisfaction Sales Forecasting Harvesting leading indicators for sales from search engines/dealer systems/websites/driver data to support the shift from physical dealer visits to an online relationship with prospects Accurate sales forecasting reduces stock levels and increases manufacturing efficiency
  • 10. 1. IBM as your partner for driving Data warehouse modernisation with Hadoop 1 The cost of high performance data warehousing is significantly reduced using open source software Despite technology maturity, the use of relational databases for data warehousing has not addressed the need to be able to load all types of data (particularly unstructured such as text) nor has its costs fallen in line with other advances. Moore’s Law predicts that processing capability doubles every 18 months or so, and the costs of processing have fallen at comparable rates. Hadoop is an open source disruptive technology which reduces the cost per Terabyte from a traditional €100k to a fraction of this - €8k in live examples. IBM’s Hadoop distribution, BigInsights, delivers a better technology within the Open Source distribution however enterprise requirements for resiliency, performance and manageability of this kernel require something more. BigInsights is designed to address this requirement. For example, Hadoop performance is increased by BigInsights’ capability to support Hadoop clusters for multiple groups of users by running a mixture of Hadoop MapReduce and other scheduled workloads on one cluster under the control of IBM Platform Symphony. The business value of a company data asset is increasing As line of business executives becoming increasingly technology savvy and data dependent their realization that IT should provide robust, low cost, flexible yet secure data for their information assets increases. Traditional data warehouses have not delivered so Hadoop is succeeding due to a willing new demand. Associated with this is the need for a trusted long term stable partner to deliver such innovations, and that’s where IBM’s history provides welcome reassurance. Data volumes will grow significantly (but with limited predictability) so systems need to be able to land new data at higher rates of velocity, volumes and variety The shift from ERP and CAD data usage in the vehicle industry to one of customer centricity where selling and servicing complex connected vehicles will drive a significant data evolution; one whose capability requirements are addressed perfectly by IBM enterprise data platform systems IBM clients have proved that these technologies scale well and cater for any kind of data, with resilience and high performance at higher levels than non-IBM alternatives. IBM’s Streams product is a vital component to achieve low latency analytical processing of data-in-motion. As it is integrated with BigInsights for Hadoop the combination addresses the data flows from moving vehicles as well as the analytics required for historic data. Hadoop delivers the enterprise data landing zone IBM BigInsights for Hadoop gives the enterprise data architects the ability to land, store, query and analyse data of any type. This integrates with existing IT landscapes so is the ideal place to correlate
  • 11. data with statistical tools, identify issues that span departments, and the tooling required to provide business the catch the “fast ball” of vehicle generated data and extract some value from it. These are classified as operational efficiency, advanced analytics and exploration & discovery. 2 BigInsights delivers Operational Efficiency To more effectively handle the performance and economic impact of growing data volumes, architectures incorporating different operational characters can be used together. For example, large amounts of cold data in the data warehouse can be archived to an analytics environment rather than to a passive store. InfoSphere BigInsights helps improve operational efficiency by modernizing — not replacing — the data warehouse environment. It can be used as a query-able archive, enabling organizations to store and analyze large volumes of poly-structured data without straining the data warehouse. As a pre-processing hub — also referred to as a “landing zone” for data — InfoSphere BigInsights helps organizations explore their data, determine the high-value assets and extract that data cost-effectively. It also supports ad hoc analysis of large amounts of data for exploration, discovery and analysis. BigInsights delivers Advanced Analytics In addition to increasing operational efficiency, some organizations are looking to perform new, advanced analytics but lack the proper tools. With InfoSphere BigInsights, analytics is not a separate step performed after data is stored; instead, InfoSphere BigInsights, in combination with InfoSphere Streams, enables real-time analytics that can leverage historic models derived from data being analyzed at rest. InfoSphere BigInsights includes advanced text-analytic capabilities and pre-packaged accelerators. Organizations can use these pre-built analytic capabilities to understand the context of text in unstructured documents, perform sentiment analysis on social data or derive insight from a wide variety of data sources. BigInsights delivers Exploration & Discovery The explosive growth of big data may overwhelm organizations, making it difficult to uncover nuggets of high-value information. InfoSphere BigInsights helps build an environment well suited to exploring and discovering data relationships and correlations that can lead to new insights and improved business results. Data scientists can analyze raw data from big data sources alongside data from the enterprise warehouse and several other sources in a sandbox-like environment. Subsequently, they can combine any newly discovered high-value information with other data to help improve operational and strategic insights and decision making. The bottom line: with InfoSphere BigInsights, enterprises can finally get their arms around massive amounts of untapped data and mine it for valuable insights in an efficient, optimized and scalable way.
  • 12. 2. Connected vehicle IBM solution elements IBM delivers an integrated solution for Connected Vehicle IT architectures already proven to scale for Tier 1 OEM’s as per the following diagram: This is summarised as a series of connected capabilities, and the following describes IBM’s components to deliver this as a complete solution: Efficient data protocols IBM’s messaging appliance, MessageSight, uses the Open Source MQTT protocol and is 4-6 times more efficient than HTTP. 3 Capture all data onto the centralised landing zone The combination of Stream’s capability to handle large high speed data feeds such as MQTT and BigInsight’s strength in storing and processing large datasets for long term storage creates the landing zone platform missing from current data warehouses. Real Time analytics IBM Streams contains the filters, complex queries, statistical treatments and data management instructions to support real time analytics. It has the capability to handle streaming data such as video files and large message volumes as well as its integration with BigInsights Hadoop storage which is particularly useful for connected vehicle systems. High performance analytics platform Streams pushes data as it arrives to a high performance data warehousing platform comprising Netezza (PureSystems) FPGA based hardware accelerator appliance for high speed analytics as well as to the long term Hadoop data store in BigInsights.
  • 13. FPGA’s are used in BLU-RAY players to remove the CPU bottleneck that high resolution TV would otherwise create...it’s the same patented technology that gives IBM its ability to process large amounts of data and deliver high speed analytics. Application platform BigInsight’s use of Eclipse creates a layered system where data assets can be explored, and applications designed and delivered to address evolving business needs. Management of this process is enabled through IBM Rational which is commonly used in the vehicle industry to manage CAD development, ERP systems management, as well as bespoke applications. 4 Data governance and integration Data Housekeeping such as securing and maintaining availability of these new and sensitive data assets is delivered through enterprise data security applications Guardium and Optim, and a range of product connectors and accelerators which take the burden of complex system integration away from IBM customers by delivering inter-product integration capabilities. These capability areas break down into 21 subset technology components, each as defined and recognised by independent analysts Gartner and Forrester, who publish vendor assessments of them as comparison tables. Mapping IBM’s capabilities into the comparison above shows IBM has leadership in all areas, with 13 ex 21 as the leader above other vendors, and the remaining 8 showing IBM are in a leader quadrant position behind a point solution vendor.
  • 14. Enterprise architects assembling these systems can consider IBM the leading vendor of technology and services, potentially as a single source supplier of an end to end system with Hadoop as a key component. 5
  • 15. 3. BigInsights product depth completes Hadoop 6 3.1 Overview of BigInsights Hadoop is a key part of a Big Data platform of integrated technology to provide the capability to store and analyse vast amounts of data – any kind of data. IBM’s primary focus with its InfoSphere BigInsights Hadoop distribution is to fully embrace open source, while integrating it into the wider enterprise IT landscape. IBM is in a unique position to accomplish this given our breadth of enterprise capability. In our BigInsights distribution, we have specifically focused on: • Out of the box integration and optimisation with existing IT capabilities such as data integration, data privacy, data security, and business intelligence components – all aligned to existing standards within the IBM enterprise data management system (DataStage, Optim, Guardium, and Cognos) • Exploiting the re-use of existing skills in accessing and using the wealth of data BigInsights holds, by providing multiple intuitive interfaces over the same raw data. For example, we provide standard ODBC and JDBC drivers, ANSI standard SQL, a spreadsheet-style user interface that runs in a web browser, and pluggable modules to enable self-service advanced analytics • Ensuring data resilience and efficient operational management through deep integration with robust High Performance Computing technologies. These leverage IBM’s decades of experience and expertise in the HPC field and applying this robustness to a relatively new and emerging technology (Hadoop) IBM InfoSphere BigInsights was first made generally available in May 2011 with its 1.1 release, primarily containing a distribution of Apache Hadoop and other open source projects along with security, workload management, and administration enhancements. It quickly evolved through 1.2 and 1.3 releases, also delivered in 2011, which added more extensive developer tooling, web-based user interfaces and a variety of enhancements to the original features. The product offering has continued to quickly evolve with a 1.4 and 2.0 release both made available in 2012, and the current generally available release from September of 2014 is 3.0. While the specific release schedule depends on development requirements, BigInsights generally has a significant product release approximately every 6 months.
  • 16. 7 3.1 Users supported InfoSphere BigInsights provides capabilities for a wide range of users. Tools are included that are specific to the goals of each user, such as installing components, developing applications, deploying applications, and running applications to analyse data. System Administrator The System Administrator installs, configures, and backs up InfoSphere BigInsights components on the system. This user also monitors the cluster to ensure that the InfoSphere BigInsights environment is healthy and running at optimum capacity. Application Developer The Application Developer develops, publishes, and tests applications for InfoSphere BigInsights. This user works with Data Scientists to understand the function of each application, and the business problem that the application helps to solve. Application Administrator The Application Administrator publishes applications in the system, deploys applications to the cluster, and assigns permissions to applications. This user works with the Application Developer to ensure that applications are functioning properly before being published and deployed. Data Scientist The Data Scientist collects data, completes analysis, and visualises insights to provide answers to specific business questions. This user determines which applications and data sources to aggregate information from, and how to present the results to the intended audience. 3.2 Application lifecycle management Developers can develop and test InfoSphere BigInsights programs from within the Eclipse environment and publish applications that contain workflows, text analytics modules, BigSheets readers and functions, and Jaql modules to the cluster. After deploying applications to the cluster, the applications can be run from the InfoSphere BigInsights console. The following capabilities are supported by the Eclipse tooling, organised by sub-component of BigInsights:
  • 17. • Create text analytics modules that contain text extractors by using an extraction task wizard and editor. Developers can then test the extractor by running it locally against sample data. Visualise the results of the text extraction and improve the quality of the extractor by analysing how results were obtained • Create Jaql scripts or modules by using a wizard, and edit scripts with an editor that provides content assistance and syntax highlighting. Run Jaql explain statements in scripts, and run the scripts locally or against the InfoSphere BigInsights server. Developers can open the Jaql shell from within Eclipse to run Jaql statements against the cluster • Create Pig scripts by using a wizard and edit the scripts with an editor that provides content assistance and syntax highlighting. Run Pig explain statements and illustrate statements for aliases in scripts, and then run the Pig scripts locally or against the InfoSphere BigInsights server. Developers can open the Pig shell from within Eclipse to run Pig statements against the cluster • Connect to the Hive server by using the Hive JDBC driver and run Hive SQL scripts and explore the results. Browse the navigation tree to explore the structure and content of the tables in the Hive server • Use the Java editor to write programs that use MapReduce, and then run these programs locally or against the InfoSphere BigInsights server. Open the InfoSphere BigInsights console to monitor jobs that are created by MapReduce • Create templates for BigSheets readers or functions and then use the Java editor to implement the classes • Write Java programs that use the HBase APIs and run them against the InfoSphere BigInsights server. Open the HBase shell from your Eclipse environment to run HBase statements against the cluster. • Additional capabilities included in InfoSphere BigInsights include application linking and pre-built accelerators. Application linking using BigInsights: • A graphical, web-based means through which to define Oozie workflows • Compose and invoke new applications by combining together existing applications, including integration with BigSheets. 8
  • 18. 9 Pre-built applications provide enhanced data import capability: • REST Data Source App that enables users to load data from any data source supporting REST APIs into BigInsights, including popular social media services • Sampling App that enables users to sample data for analysis • Subsetting App that enables users to subset data for data analysis • Accelerators to provide packaged application components to address social data analytics, machine data analytics and call detail records streaming analytics, as examples. 3.1 What IBM BigInsights adds to open source Hadoop The blue areas illustrate the categories of functionality BigInsights adds to native Hadoop. This emphasises the strategic importance of vendor “clout” behind the selected distribution as enterprise scale Non-Functional Standard Requirements required to scale out to the requirements of the vehicle data industry.
  • 19. 10 3.2 BigInsights for Hadoop: Technical capabilities which lend themselves to the requirements of the vehicle industry BigInsights is a complex software product with many capabilities. In the engagements with vehicle industry clients there are several unique capabilities which have resulted in its selection over alternatives, and are fundamental to the benefits delivered. These include the following key areas, each expanded up in the remainder of this section: 1. IBM’s file system – GPFS as an option to the open source HDFS Hadoop Distribution File System 2. The way IBM opens up this data store to SQL using IBM BigSQL, which means that you can keep the data in one place – very important given the data volumes connected vehicles are generating, and the limitations of the memory cache approach taken by other branded Hadoop distributions 3. The spreadsheet visualisation front end tool BigSheets which allows business users to explore the data being landed into Hadoop 4. The ability to analyse text using Annotative Query Language (and IBM applications which add business analysis front end) such that sentiment on brand and products can be surfaced from call centre, warranty and service datasets. 5. Adaptive MapReduce - IBM’s pre-integration with Platform Symphony’s near real-time, low latency scheduler for more quickly carrying out any MapReduce data processing routines.
  • 20. 11 3.3 GPFS Advanced features supported within IBM’s General Parallel File System As vehicles generate large data sets, and the vehicle industry moves to a more customer centric data model it will find it has requirements for data warehouses of enormous size. GPFS is the proven data system for this requirement. GPFS is a mature, enterprise-class file system that adds a number of important resiliency and maintainability characteristics to Hadoop and can be used as an alternative to HDFS, the Hadoop Data File System • GPFS is scalable: 400 GBytes/second has been achieved for a single filesystem, and due to the parallel architecture of all GPFS filesystem functions performance can be increased as required by adding more hardware resources • GPFS is reliable: it is in use for some of the largest and fastest filesystems in the world, supporting batch workloads where each job can run for months, and GPFS has been proven in the field for over 15 years • Supports failover clustering built-in to the filesystem • Active-active clustering across sites to provide a “24x7” filesystem • Remote asynchronous caching designed to work across very large distances • Information Lifecycle Management (ILM) which provides for data to be moved between different storage pools of disk or even tape • Rolling upgrades of GPFS software to minimise downtime • Online addition/removal of server nodes or storage resources • Built-in replication of data under filesystem control, specified down to the file level if required • Metadata scan operations up to millions of files per minute (10 Billion files in 43 minutes1), which can be used to produce lists of files to backup, move, migrate to other physical storage tiers, or perform other operations on • Extended attributes which are stored along with the file, and can be used as “tags”, for example project IDs or other information: these can also be searched on using the parallel scan engine • Fully supported by IBM: the people who write the code support the code, using the same IBM support and problem escalation processes available for mainframe software • GPFS is in use with clients as a: − “Standard” POSIX filesystem − Supported data storage layer for databases such as DB2, Oracle, and Informix − As a drop-in replacement for HDFS, by presenting an HDFS-compatible API.
  • 21. 12 3.4 BigSQL BigSQL is an ANSI standard SQL interface to data across the distributed filesystem, Hive and HBase, re-using Hive’s metadata, providing standard JDBC and ODBC connectivity, and query optimisations to address both small and large queries 3.5 BigSheets spread-sheet interface BigSheets is a browser-based, spreadsheet-style user interface allowing users to directly interact with data. As vehicle data is relatively new its use and its structure is not mature or well documented, so business use of the new information assets to support exploration of new use cases requires visualisation tooling such as BigSheets, which is included within BigInsights. 3.6 Text Analytics Text Analytics: BigInsights provides AQL - an analytical environment for extracting structured information from unstructured and semi-structured textual data, including batch and real-time runtimes and an integrated development environment. This is useful to for example extract meaning from text fields in CRM databases and social media sites to understand customer sentiment and generate leads from comments about vehicle comparisons. 3.7 Adaptive Map- Reduce Adaptive MapReduce is a near real-time, low-latency scheduler that can be transparently used as an alternative to Apache MapReduce. This is actually a “single-tenant” version of the IBM Platform Symphony scheduler that has been pre-integrated with BigInsights. Scheduling data management tasks will allow the company to manage the evolution of its vehicle and customer data lifecycle as storage and processing to support the move to customer centricity. Connected vehicle architectures will place stress on data warehouse systems if not managed effectively
  • 22. 4. Big Data platform technology supplier commitment and stability 13 4.1 Scale and commitment IBM has completed more than 30,000 analytics client engagements and projects $20 billion in business analytics and big data revenue by 2015. IBM has established the world's deepest portfolio of analytics solutions; deploys 9,000 business analytics consultants and 400 researchers, and has acquired more than 30 companies since 2005 to build targeted expertise in this area. IBM secures hundreds of patents a year in big data and analytics, and converts this deep intellectual capital into breakthrough capabilities, including Watson-like cognitive systems. The company has established a global network of nine analytics solutions centres and goes to market with more than 27,000 IBM business partners With 434,000 employees and $100BN revenues, IBM’s 100 year momentum continues. The company is renowned for its ability to reinvent itself around business and technology shifts summarised in the IBM strategy statement: • We are making markets by transforming industries and professions with data • We are remaking enterprise IT for the era of the cloud • We are enabling systems of engagement for enterprise and leading by example. 4.2 IBM Hadoop deployments Hadoop adoption is a long term strategic platform decision so warrants consideration of the client company to work with its supplier as a long term engagement. IBM has over 100 production installation and thousands of users of the free download evaluation system. It has thousands of users of the online Bluemix cloud development platform where BigInsights is a service. Many thousands individuals inside IBM and in its customer base use the online education environment called IBM Big Data University. 4.3 Commitment to open source We distribute 100% open source Apache Hadoop components. This is not proprietary. On top of the open source code we provide analytical tools to help get value from the data. IBM is committed to supporting the open source movement. IBM helped open platforms such as Linux, Eclipse and Apache become standards with vital industry ecosystems, and then we developed high-value businesses on top of them. Today IBM collaborates broadly to support open platforms such as OpenStack and Hadoop. Because of this commitment, IBM avoids creating any independent fork of Apache project code, and merely selects the open source versions
  • 23. that we feel are the best in achieving most current and most stable capabilities together in the overall Hadoop operating environment. The inner core of BigInsights is Apache Hadoop, and we do inter-version testing of the projects included so our enterprise customers are ensured that they have a blue-washed and interoperable codebase across the projects. As “most current” and “stable” are often conflicting, BigInsights does not always use the most current version of projects, but rather the most stable. Where we identify issues in the open source projects, we have a number of committers with our IBM development labs that submit fixes back to the open source community. IBM’s goal with this approach is to protect the corporate IT organisation from version management across the various open source projects by providing this pre-tested, interoperable set in InfoSphere BigInsights. An example of the commitment is that IBM contributed 25% the fixes for a recent release of Hadoop The most widely deployed version of BigInsights for Hadoop is v2.1 (current release is v3) is to supported by IBM until 05-Jul-21 14
  • 24. 15 4.4 The key differences between IBM, Cloudera, Pivotal and Hortonworks Cloudera IBM has a comparable number of significant sized deployments with Cloudera, a Hadoop distributor. However the company is quite different to IBM. Cloudera is Venture Capital funded with $160m invested in its 6th funding investment round completed in March 2014. Sales revenue was reported as $73m in 2013, its fourth year trading. It has 500 employees. Our opinion on the long term destiny for companies like this – niche technology players – is of a business exit plan based on acquisition by the industry giants to fill a technology gap in the enterprise platforms they provide. At this point it’s not clear if any of the enterprise technology mega vendors have such a gap so the future of Cloudera is unclear. Functionality which will be useful to the vehicle data industry such as real time streaming data analytics, text analytics, analytics accelerator tools, visualisation, enterprise wide search, indexing, data integration software, connected analytics appliances, relational data marts, governance audit and compliance is all available in IBM BigInsights but not in this alternative. Documented limitations in the Cloudera query engine explain why results in data joins can fail to complete. Referring to the Cloudera user manual highlights the cause as insufficient memory – this is caused by the need to load ALL data into memory. As raw data sets can be very large this limitation can easily exceed the total memory available. Vehicle data volumes being generated are currently very large, and customer datasets are also very large so this limitation constrains vehicle industry applications. By comparison IBM BigSQL has no limitation that joined tables have to fit in aggregated memory of data nodes which causes queries to run out of memory and fail. IBM Hadoop is up to 41x faster than Hive .12 (Cloudera) on TPC-H like benchmark IBM Hadoop is over 2x faster than (Cloudera) Impala on TPC-H like benchmark Pivotal /EMC/Greenplum Greenplum has changed hands several times and is now part of Pivotal, an EMC spinoff, and their Hadoop offering is now called Pivotal HD. IBM BigInsights has many advantages over Pivotal HD IBM BigInsights adds significant functions beyond IBM’s 100% open source Hadoop components – it includes analytic accelerators such as Big SQL, BigSheets, BigMatch, BigR and text analytics unlike Pivotal which includes proprietary components and lack of added-value software applications such as those listed above IBM has already achieved broader marketplace presence and analyst rating (e.g. Forrester Wave) IBM BigInsights offers greater flexibility and lower cost solution with availability as software only, on the cloud, or on flexible IBM System x reference architecture. By comparison Pivotal is now recommending expensive Isilon storage which uses a proprietary OneFS file system. IBM
  • 25. has made significant investments to ensure enterprise open architecture leverage the low cost Hadoop elements rather than creating lock-in solutions. Significantly, Pivotal does not support HDFS. BigInsights offers HDFS and GPFS support. Where the new SQL HAWQ component of Pivotal HD is offered as a license cost option, the powerful IBM Big SQL is included with BigInsights. IBM offers a complete Big Data platform Solution as an integrated architecture that offers more than just Hadoop - including BigInsights, Streams, MPP Database, Information Integration. Real time analytics – not just batch – is provided by IBM, whereas Pivotal has an in-memory grid, which is not a real time streaming solution. Data security at an enterprise granular level is provided by IBM’s Integration and Governance offerings, (Information Server, Guardium and Optim) which are integrated with Hadoop but would require local development or 3rd party integration projects to achieve the same level of data management. As the vehicle data will include personal data, its governance is mandated. Delivering the appropriate systems is easier and lower cost with IBM. Pivotal HAWQ adds the entire RDBMS structure (query engine, storage layer, metadata) to Hadoop. This adds proprietary layers and database complexity to the Hadoop solution. By comparison IBM Big SQL integrates just the query engine with Hadoop. This allows the query engine to be collocated with the Hadoop cluster and executes using native meta data and HDFS files, which is how IBM won the performance benchmark tests cited above, and IBM Big SQL offers elastic scalability where nodes can be added / removed online. Hortonworks BigInsights and Hortonworks have similar Hadoop components and both are committed to open source Apache Hadoop with Committers and contributors to open source Apache Hadoop. However, BigInsights extends value beyond Hortonworks for analytics with its Social Media Accelerator, Machine Data Accelerator, BigSheets spreadsheet and visualization, Advanced Text Analytics. Also BigInsights includes Data Explorer for Search and Indexing in Hadoop and beyond to all enterprise data; a vital function to make the data accessible to potential users inside the company and through applications to its customers. BigInsights Big SQL has advantages over Horton’s HiveQL as Big SQL provides richer SQL, HBase performance, and short query performance. For many large companies already using GPFS, IBM BigInsights 2.1 uniquely offers GPFS as a Hadoop file system providing enterprise data life cycle management. BigInsights also has Adaptive MapReduce (Platform Symphony) for faster Map Reduce processing and BigInsights integrates with InfoSphere Streams, while Hortonworks does not, limiting its use to batch processing. 16
  • 26. 5. BigInsights Performance and stability 17 5.1 Overview Out-performing open source alternatives on identical hardware, IBM InfoSphere BigInsights has been independently benchmarked and proven to be between 4 and 11 times faster than open source alternatives running on identical infrastructure. InfoSphere BigInsights provides several features that help increase performance, as well as enhance its adaptability and compatibility within an enterprise environment. 5.2 Adaptive MapReduce for job acceleration Jobs running on Hadoop can end up creating multiple small tasks that consume a disproportionately large amount of system resources. To combat this, IBM invented a technique called Adaptive MapReduce that is designed to speed up small jobs by changing how MapReduce tasks are handled without altering how jobs are created. Adaptive MapReduce is transparent to MapReduce operations and Hadoop application programming interface (API) operations. 5.3 Comprehensive SQL Performance and support Let’s start with the number one reason why this new release of Big SQL sets a new bar: performance. Benchmark tests indicate that Big SQL executes queries 20 times faster, on average, over Apache Hive 12 with performance improvements ranging up to 70 times faster. This performance improvement was achieved by replacing the earlier Map- Reduce (MR) implementation with a massively parallel processing (MPP) SQL engine. The MPP engine deploys directly on the physical Hadoop Distributed File System (HDFS) cluster. A fundamental difference from other MPP offerings on Hadoop is that this engine actually pushes processing down to the same nodes that hold the data. Because it natively operates in a shared-nothing environment, it does not suffer from limitations common to shared-disk architectures (for example: poor scalability and networking caused by the need to move shared data around). IBM’s unique ANSI standard SQL interface in BigInsights automatically optimises queries so that smaller queries run in-memory and bypass MapReduce. For larger queries that still rely on MapReduce, BigSQL can also still leverage the performance benefits of Adaptive MapReduce. IBM BigInsights v3.0, with Big SQL 3.0, is the only Hadoop distribution to successfully run ALL 99 TPC-DS queries and ALL 22 TPC-H queries without modification. To contrast, Apache Hive 12 executes only 43 of the 99 queries without modification. In a Jan 2013 blog post, Cloudera describes how its benchmark tests were completed by modifying the TPC-DS queries to SQL- 92 syntax and selectively included only 20 of the 99 TPC-DS queries. IBM Big SQL has many advantages over Impala, including richer SQL support such as SQL-92 sub-queries, SQL 99 aggregate functions, and SQL 2003 windowing aggregate functions. Big Benchmark tests indicate that Big SQL executes queries 20 times faster, on average, over Apache Hive 12 with performance improvements ranging up to 70 times faster for individual queries.
  • 27. SQL 3.0 has successfully run ALL 99 TPC-DS queries and ALL 22 TPC-H queries without modification. To contrast, Apache Hive 12 executes only 43 of the 99 TPC-DS queries without modification. IBM Big SQL has many advantages over Impala, including richer SQL support such as SQL-92 sub-queries, SQL 99 aggregate functions, and SQL 2003 windowing aggregate functions. Impala is immature, feature poor, back-level SQL, limited offering. SQL tools may not work with Impala due to ODBC and JDBC driver limitations Big SQL enables row and column access control, or “fine-grained control” consistent with functionality found in an RDBMS. The comprehensive SQL support by Big SQL 3.0 enables an organization to make full use its existing SQL skills, reducing the need to augment its analytic applications with Hadoop-specific functions. Now here’s the real value: Big SQL 3.0 can access data from more than BigInsights. It can query and combine data from many data sources, including (but not limited to) DB2 for Linux, UNIX and Windows database software, IBM PureData System for Analytics, IBM PureData System for Operational Analytics, Teradata and Oracle. Organizations can choose to leave data where it currently exists and use BigInsights to augment where it makes the most sense. Note that this approach, minimizing the need to move data, is part of IBM’s overall big data and analytics strategy. SPSS and Cognos Business Intelligence also support querying and joining data across disparate data sources, addressing the need to analyze all data, wherever it is located. IBM InfoSphere BigInsights v3.0, with the MPP-based performance and SQL support of Big SQL 3.0, provides an enterprise-ready Hadoop distribution that minimizes the impact on users while enabling IT to adopt this new technology into its data architecture strategy. 18
  • 28. 19 5.4 Federated data access Big SQL can access data from more than BigInsights. Its federated access allows users to send distributed requests to multiple data sources within a single SQL statement. Administrators start with a GUI-driven installation tool that guides them to specify which optional components to install and how to configure the platform. Installation progress is reported in real time, and a built-in health check is designed to automatically verify the success of the installation. These advanced installation features minimize the amount of time needed for installation and tuning, freeing administrators to work on other critical projects Once the Hadoop cluster is in place, robust job management features give organizations control of InfoSphere BigInsights jobs, user roles, security and key performance indicator (KPI) monitoring. Technical staff can easily direct job creation, submission and cancellation; they can also stay informed of workload progress through integrated job status dashboards, logs and monitors that provide details on configuration, tasks, attempts and other critical information. In addition, InfoSphere BigInsights provides administration features for Hadoop Distributed File System (HDFS), IBM GPFS™ File Placement Optimizer (FPO), big data applications and MapReduce jobs, and cluster management. 5.5 Performance architecture The architecture of IBM InfoSphere BigInsights is essentially comprised of three layers and a management / administration tier. All data is stored in our distributed file system, which can be either open source Apache HDFS or IBM’s more enterprise-class General Parallel File System (GPFS). This forms the underlying data persistence layer on which all other components rely, and hence is typically considered the bottom-most layer. In the middle, there are a number of data processing components that all leverage the MapReduce capabilities of Hadoop in order to parallelise their work. These include: • Data processing languages like Pig and Jaql • Query mechanisms like Hive • Indexing mechanisms like Lucene • Data load mechanisms like Sqoop and Flume • Analytical capabilities like Text Analytics and Probabilistic Matching and • Data repositories built on top of the distributed file system like HBase. MapReduce itself could be considered the backbone of this layer, and IBM provides a high performance optimisation of open source MapReduce called “Adaptive MapReduce” to provide greater performance to all data processing in this layer through integration with Platform • IBM is also unique in providing one of the only commercially supported data-in-motion analytics capability, InfoSphere Streams, which we originally developed through our unique Research division, in cooperation with various US governmental agencies. Streams can directly leverage data in Hadoop, in-memory data grids like Redis, and also integrates directly with DataStage.
  • 29. 20 5.6 Real time analytics In addition, to address true real-time (data in motion) requirements, we integrated the use of IBM InfoSphere Streams. This is a unique capability that IBM initially developed in partnership with its Research division and various US governmental agencies to process large quantities of both structured and unstructured data with both high throughput and low latency. Streams uses its own in-memory processing and node coordination facilities to achieve microsecond latencies, but can use InfoSphere BigInsights and a number of relational databases as both a source of historical information and a target to which to store information for retention purposes. In addition, Streams can integrate with in-memory data grids in order to support low-latency lookup of information, typically reference data. 5.7 Resilience delivered from Platform Symphony 5.8 Platform scheduling IBM’s Platform software is IBM’s cluster management and scheduling system, which can support diverse compute and data intensive applications. Platform is a mature and well-established product, used across many industries for grid centric workloads. The major benefits that Platform brings to BigInsights are as follows: • Recovery and reliability: − Hadoop jobs, and job tasks, are recoverable in the event of node failure − Platform infrastructure has no single point of failure − All services are highly available, and will be restarted automatically on alternative servers in the event of a management server failure • Resource sharing and flexibility: − Platform can manage both Hadoop and non-Hadoop workloads within the same cluster, including provisioning through the use of the optional Cluster Manager − Multiple IBM and third party analytic applications can be supported on a shared infrastructure, e.g.: InfoSphere Streams, InfoSphere
  • 30. DataStage, SPSS, SAS, R, etc. − Infrastructure can be shared across development, test, and production environments; across different user groups, clusters and workload types: which will drive greater efficiencies and utilisation, while reducing costs. 21 • Scheduling agility: − Agile scheduling ensures that time critical workloads start and finish fast − Optionally give priority to interactive jobs (i.e. BigSheets, BigSQL) − Resource allocations shift instantly based on priority adjustments and proportional allocations at run-time − Platform’s highly effective scheduling ensures that the cluster can be kept at high average levels of utilisation: 80-90% average utilisation is not uncommon for Platform clusters. One important addition IBM has made beyond open source capability in terms of scalability, however, is in the area of resource scheduling. When using the Adaptive MapReduce framework built into BigInsights, a more advanced scheduling mechanism is used. This mechanism improves scalability by leveraging Platform Symphony’s ability to support a number of scheduling agents running in parallel rather than open source’s historically singular JobTracker service. This is similar to the scalability improvements in the recently released YARN capability of open source Hadoop, but uses the proven robustness of Platform Symphony rather than months-old and relatively unproven open source technology. For example, one of our largest deployments of BigInsights started with an initial volume of approximately 2.5 PB, and has grown over the last couple of years to approximately 5 PB. It is expected to grow to 20 PB of data within the next several years. In combination, GPFS and Platform bring significant operational benefits to the running of Hadoop workloads, and provide the flexibility to support other types of workloads in a common infrastructure. Coupled with Platform’s resource allocation and prioritisation capabilities, we believe this will drive higher utilisation and efficiency, and will lower operational support costs by virtue of having a single architecture to support diverse workloads and application types. The robust availability and recoverability characteristics provided by Platform and GPFS will provide clients with a Hadoop solution ready for enterprise deployment into what is becoming an increasingly time-sensitive business environment.
  • 31. 22 5.9 High Availability BigInsights provides a highly robust Hadoop solution that automatically handles the failure of management and data nodes without losing data and without any interruption to processing. The recommended configuration of BigInsights is to use the Platform Symphony MapReduce scheduler instead of the Apache MapReduce scheduler, and GPFS-FPO as the High Availability file system for data storage instead of HDFS- note that GPFS can provide a highly available HDFS filesystem across sites. GPFS is also used to provide a highly available (and optionally cross-site) filesystem which is used by the Symphony scheduler to support High Availability configurations- the use of a shared NAS facility is also an option. When BigInsights is configured in this way, and combined with the appropriate server, network, and environmental infrastructure (e.g. power, cooling), it provides a highly available solution. Platform Symphony requires a shared file system to be accessible between management nodes. In a production environment two or more management nodes will be configured. The number of management nodes will be dictated by two factors. The first factor is the level of redundancy required; additional management nodes mean that the cluster can tolerate more failure at the management level. The second factor is the cluster size/load. Platform Symphony can use multiple instances of the Symphony scheduler (similar to JobTracker), one for each logical application. As the number of applications increases then Symphony can be scaled out to multiple Symphony Schedulers-providing load balanced scheduler instances across available management nodes. Platform Symphony will do this automatically. Therefore as the cluster size/load increases additional management nodes can be added. All management nodes are active. The shared file system is used to store component state data. For example an instance of a Symphony scheduler will store metadata about all in flight workload currently being processed. If a management node on which a component is running fails, the component will be restarted on another available node, during start-up the component will recover state written to the shared file system. The shared file system can be implemented using a variety of technologies and solutions, for example with Network Attached Storage (NAS) appliances as HA features (e.g. dual controllers, RAID) are normally built in. GPFS can also be used as it is a clustered file system and can provide the required redundancy and high availability. The Symphony shared file system can be implemented using GPFS through a number of different hardware configurations. One example is shown below. Platform Symphony High Availability with GPFS
  • 32. In the above diagram three management nodes are shown. Each node has a number of solid state disks. Each SSD is divided into 2 partitions, one for the operating system and the second for GPFS. All disks are GPFS Network Shared Disks (NSD). All 3 servers are GPFS Quorum nodes. There are three GPFS failure groups; each is created using direct-attached SSDs on each management node. A replication factor of 3 is used for both data and metadata - each block of a replicated file is in all 3 failure groups. Use of SSDs is not a requirement but may provide performance advantages for more real time/latency sensitive applications or for filesystem metadata. IBM BigInsights uses an HA manager from Platform Symphony, known internally as the service controller, to manage management components or services. The service controller is responsible for starting one or more instances of registered service types. The BigInsights and Platform Symphony management components are registered as service types within Platform Symphony. The service controller monitors all service instances and restarts a service instance if it exits unexpectedly. There is always one instance of the service controller running. If the management node on which it is running becomes unavailable a new instance is restarted on another management node. Hardware component failures which cause the node to go offline are handled as follows: If the node is a node running GPFS management functions, or a Symphony (Adaptive MapReduce) management node, then the failover clustering for GPFS or Symphony automatically moves any management functions to another designated node. If the failed node “owns” replicated data (i.e. it is a data node), then GPFS marks the node and associated storage as unavailable and “stale”. I/O requests for data will be fulfilled by another copy of the data. If a disk fails, then similarly, GPFS will mark that disk as stale, and redirect I/O's to other copies of the data. Symphony An HA Manager Service runs on all management nodes. This service handles the monitoring of critical services and manages the failover steps (for instance, the termination of the failed process, binding the floating IP to the standby server, and starting the required process on the standby server). This is not required for Symphony and GPFS, which have their own built-in failover clustering, which allows for simplified failover with zero or minimal disruption to ongoing work. 23
  • 33. Namenode failure With our use of GPFS as the file system for BigInsights, there is no active-passive NameNode failover required, as file system management functions are failover clustered within GPFS itself, and metadata is distributed across the file system itself. There is no concept of a “master” NameNode to fail: all GPFS services are designed to be mobile around the cluster, and fail over as required. GPFS provides automated failover of file system management functions from any failed node running these functions. Access to data or metadata which was owned by that node is maintained transparently, by redirecting I/O to a replica of the data. I/O is briefly suspended (transparently to users) – typically for 1 to 2 minutes – while recovery steps are in process within GPFS, and the I/O is resumed. Note that this transparent failover also extends to the Clustered NFS facility within GPFS, which could be used to provide edge/gateway services to move data in and out of the cluster. Recovery from loss of nodes within the cluster? If a data/compute node fails, access to all data is maintained. Data which was “owned” by that node and stored in the GPFS filesystem remains available, even if one of the copies of the data is no longer accessible on the failed node. This is done transparently to the other nodes in the cluster. GPFS has robust, built-in clustering within the filesystem, and does not require additional hardware or failover software to operate. In a Hadoop cluster it is expected that nodes will become unavailable for a variety of reasons. The nodes themselves do not generally have any redundancy built into them. The IBM solution has a service controller that is responsible for starting one or more instances of registered service types. The service controller finds an available node on which to start a particular service instance. It is also responsible for monitoring all service instances. Therefore if a node becomes unavailable the service controller will restart an instance on another available node. There is always one instance of the service controller running. If the node on which the service controller is lost the service controller is automatically restarted on another node. HA supported at the job level IBM Platform Symphony replaces the JobTracker component of open source Hadoop with its own scheduler. The Task Tracker component is also replaced with another Platform Symphony component that runs on each data node for managing each map or reduce task. A number of Platform Symphony schedulers will be running within the cluster, one for each configured application. A Symphony scheduler itself is responsible for managing all jobs submitted on behalf of a particular application. In terms of job-level HA, failure at either the data node or management node are relevant. Data nodes are used for running all map and reduce tasks associated with any MapReduce jobs submitted to the cluster. Management nodes run a number of management daemons / components. If a data node becomes unavailable, the scheduler will detect the loss of communication with the runtime components (Task Tracker equivalent) on the data node. The scheduler will then automatically re-queue any map or reduce tasks belonging to jobs managed by this scheduler, that were running on the failed data node. The scheduler will then reschedule the map 24
  • 34. and reduce tasks on available data nodes. When a management node becomes unavailable all components, including any schedulers, will be automatically restarted on other management nodes (Platform Symphony will load balance across management nodes). Each scheduler writes state information to disk. The state information includes all in flight map and reduce tasks. When the scheduler is restarted this data is read from disk as part of the scheduler recovery. The scheduler can recover all necessary information about jobs previously submitted by client applications. It will then continue processing the workload. As previously described the HA operations assume that there is a robust and available file system that can be accessed from any management node (typically a GPFS filesystem, or a NAS with high availability configuration). With the use of GPFS, Symphony uses the GPFS shared file system to support HA operations at the job level. While the data nodes do not require access to the shared Symphony file system for the jobs to be scheduled, the use of GPFS as a separate highly available file system accessible across the cluster provides the reliable data storage facility for the data used by the running workloads. Therefore all workload can be recovered in the event of a management node or scheduler failure. There is no requirement to resubmit workload from the client perspective: recovery is automatic. The client application that submits any new work will automatically connect to the new instance of the scheduler. Cluster backups The fact that all of the Hadoop data can be held in a POSIX-compatible GPFS filesystem where it is also accessible using normal operating system commands will assist in allowing data to be easily backed up, whether it is physically held inside the data/compute nodes, or in a shared disk storage subsystem. In either case, GPFS allows Hadoop jobs to see the data as if it were “HDFS”, and backup software to see the same data as a normal file in a POSIX filesystem. This also means that there is a reduced requirement for multiple copies of the data, and that data can be transparently moved from inside the Hadoop cluster to less expensive shared storage (using RAID which has 20% to 40% overhead, rather than 3 copies used by default by HDFS with 200% overhead). Data can even be transparently moved to tape, and automatically and transparently recalled. GPFS also supports snapshots, providing a measure of recovery in the case of accidental deletion, or other need to restore to a previous version of data. Backups need to be done of management nodes, GPFS nodes, and other non-data nodes. Information relevant to the re-creation of the cluster should also be backed up (e.g. filesystem configuration information, image directory for the OS provisioning manager, operating system installation backups for the provisioning manager, network switch configurations, etc). The general and disciplined use of a provisioning manager such as PCMAE reduces the number of individual items that have to be separately considered for “bare metal” backups. If most systems are deployed through the provisioning manager, then those servers can be easily and reliably 25
  • 35. rebuilt exactly, by once again using the provisioning manager. For the cluster, user data should be backed up if it cannot be easily recreated in the case of a total data centre loss, or a catastrophic problem which destroys all of the online data. Offline backups represent a safer alternative or adjunct to online backups (snapshots, or backups-to-disk). This is because a structured and ordered sequence of events needs to occur to access offline data – the backup system has to request a tape mount of a valid tape volume ID, the tape is mounted, and only then is access provided. Multiple structured steps to access offline storage minimise the chances of data loss due to malicious or accidental damage. Disaster Recovery / Data Corruption Disaster Recovery (DR) planning is an involved process, which must balance the Recovery Point Objective (RPO, broadly “data loss”), Recovery Time Objective (RTO, broadly “time until service is restored”), budgets, and operational and physical resources and constraints. DR is supported by the proposed BigInsights component for scenarios ranging from cold, to warm, to hot, to active-active sites. To support some scenarios, the Symphony and GPFS licenses included with BigInsights may need to be extended to full product licenses for some or all cluster nodes. GPFS is a clustered filesystem and supports multi-site “24x7” high availability configurations with data replicated between sites, allowing a site to fail without interruption to data access in the filesystem. So, 2 sites (plus a single “tiebreaker” server at a third site) could form the basis of a highly available cluster configuration. Recovery of jobs and other services would also need to be considered. Alternatively, you may choose to rely on existing backup and recovery products and procedures for DR. The use of GPFS, which presents all data (including MapReduce data) as a POSIX filesystem, enables enterprise backup software to be used to back up and restore data. Note that additional steps and considerations may be required: for example to back up and restore ACLs, as well as to re-establish the filesystem at the DR site. Remote caching of data using GPFS-AFM may also be considered. From a workload management perspective two primary DR scenarios are supported; Active – active Active – passive In the first scenario management and data nodes are located in at least two data centres (plus a management node a tiebreaker site – see diagram below). All management and data nodes are active. Both data centres are used for running jobs. In the event of a DR event which removes one data centre the remaining data centre continues to process running jobs and receive new jobs from clients. In order to support this scenario the Platform Symphony shared file system must be available to management nodes in both data centres. In addition Platform Symphony metadata (for example Hadoop job metadata) must be replicated between the data centres. GPFS enables us to create a clustered file system with this type of HA configuration. In the second scenario management and data nodes are again located in all data centres. In this scenario there is no requirement for a tiebreaker site 26
  • 36. with respect to managing Platform Symphony. Only the management nodes in the primary data centre are active. Data nodes in both sites may be active if application data is replicated between the data centres. In the event of a DR event the management nodes in the secondary data centre are started. They will connect to a shared file system that is available in the secondary data centre. All remaining data nodes will automatically connect to the management nodes in the secondary data centre. Workload that was being processed before the DR event will be lost and will need to be resubmitted. There is no requirement in this scenario to synchronously replicate Platform Symphony metadata data between the sites. Platform Symphony configuration data (grid and application configuration data) should be asynchronously replicated at an appropriate interval – for example every 30 minutes. Again GPFS can be used as the basis for the Platform Symphony shared file system. If the proposed solution is correctly implemented and maintained with an appropriate level of redundancy for environmental, systems, and networks, then individual node failures at a site are handled with the built-in clustering of Symphony and GPFS failover clustering within the filesystem. A non-exhaustive list of scenarios in which DR would need to be invoked includes the complete failure of the network at a site, a power outage at a site, or other situation where there are multiple simultaneous failures that result in multiple key components being offline at the same time. For example, a power spike blows up both redundant network switches. If the DR site is not an “active-active” configuration for Symphony and GPFS, then a decision would have to be made regarding repair time versus time and cost to invoke DR and the subsequent failback. Recovery Point Objective (RPO) and Recovery Time Objective (RTO) of disaster recovery configuration For the GPFS filesystem, the RPO for a dual site active-active solution would be zero, as writes happen synchronously between sites under the control of GPFS. Other options such as asynchronous replication by GPFS (GPFS-AFM) are also possible, which would present an RPO of minutes to hours. In the case of an active – active configuration the RTO would be a few minutes. Jobs would continue to run (with reduced throughput as capacity would be reduced) once any Platform Symphony schedulers were restarted on the second data centre. This would take a few minutes – including time to detect failure at first data centre. In the active – passive configuration all running workload will be lost and will need to be resubmitted. The RTO is the time taken to start the management nodes in the secondary site, plus the time taken for data nodes to connect to these management nodes. This ignores the cost of having to start data nodes and load application data from backup locations if this is required. An RTO of close to 0 may be achievable using the fully integrated solution including Platform Symphony and GPFS, in a multi-site configuration with synchronous data replication between the sites. Note that there may be a reduction in compute resources (unless policy keeps a complete separate idle copy of production at DR), and that jobs which were in process at the failed site would need to be restarted or abandoned. During the actual failure of a site, I/O is temporarily suspended while the filesystem cluster reconfigures, after which work continues as normal. This time typically ranges from tens of seconds to a few minutes. 27
  • 37. For the jobs which are executing at the surviving site, they continue to operate (though with the temporary suspension of I/O mentioned above). While the Symphony scheduler cluster reconfigures itself, new jobs can be submitted, though there may be some loss of service until all cluster components are started on the surviving data centre. Workload will continue to execute on the cluster whilst Platform Symphony components are in the process of being restarted on the surviving data centre. All Platform schedulers that were running on the surviving data centre will be unaffected by the DR event, apart from loss of service and a requirement to reschedule any tasks running on the data centre that was lost. GPFS can provide a dual site configuration (plus quorum/tiebreaker site), which supports active/active replication (see diagram below). This is done using GPFS-controlled synchronous replication over TCP/IP between the sites. This can provide a highly available data store, but may require additional software in the middleware or application layers of the solution stack to achieve “no transactions lost”. For example, IBM MQ can also be used with GPFS across multiple sites with GPFS providing a tested and supported high availability data store for MQ messages. This provides an environment where transactions can be retained, even in the case of the loss of a single site. 28
  • 38. 6. Real time streaming analytics IBM InfoSphere Streams is an advanced analytic platform that allows user-developed applications to quickly ingest, analyze and correlate information as it arrives from thousands of real-time sources. The solution can handle very high data throughput rates, up to millions of events or messages per second. This graphic illustrates an application which contains statistical rules to process telematics data from cars. One car has triggered an alert for slippery road surface by sending data that 3 wheels are rotating at different speeds. The approaching cars are instantly alerted! 29 InfoSphere Streams helps you Analyze data in motion—provides sub-millisecond response times, allowing you to view information and events as they unfold – from moving vehicles You can analyze data in motion with Streams, which: Supports analysis of continuous data including text, images, audio, voice, video, web traffic, email, GPS data, financial transactions, satellite data and sensor logs. Includes toolkits and accelerators for advanced analytics, including a telco event data accelerator that analyzes large volumes of streaming data from telecommunications systems in near real time and a social data accelerator for analyzing social media data. Distributes portions of programs over one or more nodes of the runtime computing cluster to help achieve volumes in the millions of messages per second with velocities of under a millisecond. Allows you to filter and extract only relevant data from unimportant volumes of information to help reduce data storage costs.
  • 39. Scales from a single server to thousands of computer nodes based on data volumes or analytics complexity and Provides security features and confidentiality for shared information. 30 Simplify development of streaming applications which uses an Eclipse-based integrated development environment (IDE) Streams: Allows you to build applications with drag operators, and dynamically add new views to running applications using data visualization capabilities such as charts and graphs. Enables you to create, edit, visualize, test, debug and run Streams Processing Language (SPL) applications. Provides composites capability to increase application modularity and support large or distributed application development teams. Allows you to nest and aggregate data types within a single stream definition. Enables applications to be built on a development cluster and moved into production without recompiling. Extend the value of existing systems integrated with your applications, and supporting both structured and unstructured data sources. Streams: Adapts to rapidly changing data forms and types. Allows you to quickly develop new applications that can be mapped to a variety of hardware configurations. Supports reuse of existing Java or C++ code, as well as Predictive Model Markup Language (PMML) models. Includes a limited license for IBM InfoSphere BigInsights, a Hadoop-based offering for analyzing large volumes of unstructured data at rest. Integrates with IBM DB2, IBM Informix, IBM Netezza, IBM solidDB, IBM InfoSphere Warehouse, IBM Smart Analytics System, Oracle, Microsoft SQLServer and MySQL, and more.
  • 40. 31 7. Data security 7.1 Granular data security BigInsights integrates with Active Directory for user authentication. A secure connection uses encryption to make data unreadable to third parties while the data is sent over the network between Directory Server and its clients binding to the secure port using the Secure Socket Layer (SSL). InfoSphere BigInsights also supports Kerberos service-to-service authentication protocol, increasing security strength to prevent middle man attacks. BigInsights supports integration with LDAP and single sign-on across clusters (e.g. Development and Test) through the use of the Name Service Switch (NSS) package and Lightweight Third Party Authentication (LTPA) tokens. The development, deployment and execution of analytic applications or other data processing jobs are controlled through role-based security, and information access is controlled through granular access control lists (ACLs). BigInsights uses POSIX compliant ACLs to control access to the data itself, down to the data stored in each individual node in the cluster, by using IBM’s General Parallel File System (GPFS). GPFS is a kernel-level, POSIX compliant file system that provides the same level of control for file access security and auditing capabilities as other POSIX file systems, allowing standard “owner”, “group” and “other” permissions to be assigned and changed using standard operating system commands like “chmod”. In fact, the ACLs within GPFS allow even greater flexibility by allowing additional users and groups to be defined, as well as a “control” level that determines who can change the ACL itself. Perhaps most importantly, as a kernel-level file system the data stored in GPFS can be shared across BigInsights and any other application, without moving or replicating the data. This immediately improves flexibility of using the data in BigInsights, removing delays in the movement of data between systems and tools, and minimising the need to reproduce access control definitions at multiple levels. In addition to standard application authentication via LDAP or Kerberos, Big SQL enables row and column access control or what is sometimes described as fine grained control consistent with functionality found in an RDBMS. This functionality supports compliance for regulations and policies related to data privacy, such as patient health records or securities data so is suited to the compliance challenges for vehicle data such as those limiting the use of eCall data for the specific purpose of emergency response to vehicle problems. To monitor and validate data access, BigInsights’ built-in auditing can track changes to access privileges or data objects and track SQL statement execution and retrieving security information.
  • 41. 32 7.2 User security Administrators have the option to choose flat file, Lightweight Directory Access Protocol (LDAP) or Pluggable Authentication Modules (PAM) for the InfoSphere BigInsights web console. With LDAP authentication, the InfoSphere BigInsights installation program will communicate with an LDAP credentials store for authentication. Administrators can then provide access to the InfoSphere BigInsights console based on role membership, making it easy to set access rights for groups of users. InfoSphere BigInsights provides four levels of user roles: BigInsights System Administrator. Performs all system administration tasks. For example, a user in this role can perform monitoring of the cluster’s health, and adding, removing, starting, and stopping nodes BigInsights Data Administrator. Performs all data administration tasks. For example, these users create directories, run Hadoop file system commands, and upload, delete, download, and view files BigInsights Application Administrator. Performs all application administration tasks, for example publishing and un-publishing (deleting) an application, deploying and removing an application to the cluster, configuring the application icons, applying application descriptions, changing the runtime libraries and categories of an application, and assigning permissions of an application to a group BigInsights User: Runs applications that the user is given permission to run and views the results, data, and cluster health. This role is typically the most commonly granted role to cluster users who perform non-administrative tasks. MapReduce jobs can be run under designated account IDs, which helps tighten security, access control and auditing. Integration of InfoSphere BigInsights with IBM InfoSphere Guardium® data security software helps organizations to manage the security and auditing needs of Hadoop the same way they manage traditional structured data sources. 7.3 Audit and integration with Guardium, the leading security platform BigInsights can be configured to collect a range of audit information. BigInsights stores security audit information as audit events in its own audit log files for general security tracking. The log files are written to the file system in directories using a date based naming protocol, which can only be accessed by administrators. You can also configure InfoSphere BigInsights to send audit log events to InfoSphere Guardium for security analysis and reporting via Guardium Proxy. An audit message contains three critical pieces of information derived from an audit event: audit event timestamp, component that generated the audit event, and the audit message itself. After BigInsights events exist in the InfoSphere Guardium repository, other InfoSphere Guardium features such as workflow (to email and track report signoff), alerting, and reporting are available. InfoSphere Guardium has a secure, tamper-proof repository. All audit
  • 42. information is stored in a secure repository, where it cannot be modified: even by privileged users. Once data is collected and written to Guardium, there is no way for it to be modified, which guarantees the non-repudiation of the data. This secure repository supports separation of duties and absolves database administrators of any suggestion that they might have changed audit data to “cover their tracks,” even in a legal environment. In addition Guardium has a hardened operating system and database kernel – There is no way for users to directly access the underlying operating system, file system, or database. As an added precaution, all unused software components of the operating system and the embedded database have been removed or disabled. The security audit information that InfoSphere BigInsights generates depends on your environment. The following list is representative of the type of information that InfoSphere BigInsights generates: • Hadoop Remote Procedure Calls (RPC) authentication and authorisation successes and failures • Hadoop Distributed File System (HDFS) file and permission-related commands such as cat, tail, chmod, chown, and expunge • Hadoop MapReduce information about jobs, operations, targets, and 33 permissions • Oozie information about jobs • HBase operation authorisation for data access and administrative operations, such as global privilege authorisation, table and column family privilege authorisation, grant permission, and revoke permission.
  • 43. 34 7.4 Masking confidential information and test data with IBM Optim InfoSphere Optim data masking on demand is the only masking service available for Hadoop-based systems. You can decide when and where to mask: for example, in relational data sources, in reports or inside applications. InfoSphere Optim Data Masking de-identifies or obfuscates sensitive data such as PII, business data (revenues, HR, etc.) and corporate secrets in big data environments. Using InfoSphere Optim protects data against theft and misuse in accordance with compliance mandates. InfoSphere Optim Data Masking ensures data privacy, enables compliance and helps manage risk. InfoSphere Optim is first to market with data masking on demand for Hadoop-based systems. InfoSphere Optim de-identifies data to ensure privacy while keeping the original context to facilitate business processes. Flexible masking services allow you to create customised masking routines for specific data types or leverage out of the box support. IBM InfoSphere Optim Test Data Management optimises and automates the test data management process. Prebuilt workflows and services facilitate continuous testing and Agile software development. IBM InfoSphere Optim Test Data Management helps development and testing teams use realistic, right-sized test databases or data warehouses to accelerate application development. InfoSphere Optim Test Data Management helps organisations: • Streamline test data management processes to help reduce costs and speed application delivery • Analyse and refresh test data on demand for developers and testers • Create production-like environments to shorten iterative testing cycles, support continuous testing and accelerate time to market • Protect sensitive data based on business polices and help reduce risk in testing, training and development environments • Use a single, scalable enterprise solution across applications, databases and operating systems • Provides a comprehensive continuous testing solution through Rational Test Workbench for functional, regression, integration (service virtualisation) and load testing.
  • 44. 8. Big Data Platform integration 35 Integrated business functionality is delivered through the breadth of the IBM Big Data Platform which includes the following capabilities:
  • 45. 36 8.1 Connectors InfoSphere BigInsights provides connectors to IBM DB2® database software, the IBM PureData™ Systems family of data warehouse appliances, IBM Netezza appliances, IBM InfoSphere Warehouse and the IBM Smart Analytics System. These high-speed connectors help simplify and accelerate data manipulation tasks. Standard Java Database Connectivity (JDBC) connectors make it possible for organizations to quickly integrate with a wide variety of data and information systems including Oracle, Microsoft® SQL Server, MySQL and Teradata. This connectivity encourages a platform approach to big data projects, so for example selections of data can be drawn from BigInsights to support high performance analytics on a Netezza appliance and the results posted back to the all encompassing data store. Queries from SQL tools can access large data stores in Hadoop for long term history and high performance systems for current operational reporting concurrently. This replaces the stove-pipe system where each data store has its own reporting tools, support team, and labour intensive maintenance overheads with a platform. 8.2 Data warehouse integration IBM’s approach of combining Hadoop with in-memory database processing results in the application world seeing one warehouse, yet providing a more agile and faster response time for ROLAP: all at cheaper cost. In addition, the IBM solution also provides a cheaper and larger-scale engine for data integration and transformation: both extract-transform-load (ETL) and extract-load-transform (ELT). DataStage can push down transformations into the IBM Hadoop solution to perform transformations within Hadoop. DataStage in this case automatically creates MapReduce jobs that perform the transformation work using an ELT approach. IBM therefore does not advocate moving the entirety of the data mart and data warehouse landscape into a Hadoop solution. Rather, IBM recommends taking a measured approach to constructing a logical data warehouse that is comprised of fit-for-purpose components that achieve the best result for particular workloads while minimising cost.
  • 46. 37 8.3 IBM Symphony By using the IBM Symphony scheduler, the solution provides high availability, as well as higher performance for many MapReduce workloads due to the faster scheduling of workloads. In addition, the Symphony system can be deployed to support multi-tenancy, so that the same cluster can simultaneously support Hadoop/MapReduce workloads as well as other work. 8.4 Big Match & MDM For users performing customer analytics, InfoSphere BigInsights leverages the probabilistic matching engine of InfoSphere Master Data Management to match and link customer information directly in Hadoop, at high speeds. A unique identifier for each customer ensures analytics are performed on more accurate and information. 8.5 Information Server: DataStage & QualityStage IBM InfoSphere DataStage® includes a connector that enables InfoSphere BigInsights data to be leveraged within an InfoSphere DataStage extract/transform/ load (ETL) or in an extract/load/transform (ELT) job. The Balanced Optimiser functionality in BigInsights places the workload where it will run most efficiently such as during an ETL process, or in a DB2 database, or as a MapReduce Process. Quality rules and actions can be run using the integration with Information Server QualityStage; particularly useful given the disparate data types which are typically ingested into Hadoop to avoid creating a large but useless data store. 8.6 InfoSphere Streams InfoSphere BigInsights includes a limited-use license of InfoSphere Streams, which enables real-time, continuous analysis of data on the fly. InfoSphere Streams is an enterprise-class stream-processing system that can extract actionable insights from data in motion while transforming data and transferring it to InfoSphere BigInsights at high speeds. This enables organizations to capture and act on business data in real time — rapidly ingesting, analyzing and correlating information as it arrives — and fundamentally enhance processing performance. 8.1 Cognos Business Intelligence InfoSphere BigInsights includes a limited-use license for Cognos Business Intelligence, which enables business users to access and analyze the information they need to improve decision making, gain better insight and manage performance. Cognos Business Intelligence includes software for query, reporting, analysis and dashboards, as well as software to gather and organize information from multiple sources.