Advantages of Software as a Service over ASP Hosting
2014 10 09 Top reasons to use IBM BigInsights as your Big Data Hadoop system
1. The top 10 reasons
manufacturers should use
BigInsights as
major vehicle
IBM
their Hadoop Platform
2. This paper draws from our extensive experience of Hadoop applications in the automotive industry,
where advice from those with the early adoption experiences shows the relevance of the IBM
BigInsights platform approach and technical capabilities.
ii
Introduction
iv
Drivers for change in the vehicle industry
iv
The market opportunity for Connected Cars
v
The business need for Hadoop capabilities
vii
Big data in the vehicle industry enables radical new use cases
viii
The top 10 reasons to use IBM BigInsights as your Big Data Hadoop system
1. IBM as your partner for driving Data warehouse modernisation with Hadoop ................ 1
2. Connected vehicle IBM solution elements ........................................................................... 3
3. BigInsights product depth completes Hadoop .................................................................... 6
3.1 Overview of BigInsights ................................................................................................. 6
3.1 Users supported ............................................................................................................ 7
3.2 Application lifecycle management ................................................................................. 7
3.1 What IBM BigInsights adds to open source Hadoop .................................................... 9
3.2 BigInsights for Hadoop: Technical capabilities which lend themselves to the
requirements of the vehicle industry ........................................................................................ 10
3.3 GPFS ........................................................................................................................... 11
3.4 BigSQL ........................................................................................................................ 12
3.5 BigSheets spread-sheet interface ............................................................................... 12
3.6 Text Analytics .............................................................................................................. 12
3.7 Adaptive Map-Reduce ................................................................................................. 12
4. Big Data platform technology supplier commitment and stability .................................. 13
4.1 Scale and commitment ................................................................................................ 13
4.2 IBM Hadoop deployments ........................................................................................... 13
4.3 Commitment to open source ....................................................................................... 13
4.4 The key differences between IBM, Cloudera, Pivotal and Hortonworks .................... 15
5. BigInsights Performance and stability ................................................................................ 17
5.1 Overview ...................................................................................................................... 17
5.2 Adaptive MapReduce for job acceleration .................................................................. 17
5.3 Comprehensive SQL Performance and support ......................................................... 17
5.4 Federated data access ................................................................................................ 19
5.5 Performance architecture ............................................................................................ 19
5.6 Real time analytics ...................................................................................................... 20
5.7 Resilience delivered from Platform Symphony ........................................................... 20
5.8 Platform scheduling ..................................................................................................... 20
5.9 High Availability ........................................................................................................... 22
6. Real time streaming analytics .............................................................................................. 29
7. Data security .......................................................................................................................... 31
7.1 Granular data security ................................................................................................. 31
7.2 User security................................................................................................................ 32
3. 7.3 Audit and integration with Guardium, the leading security platform ........................... 32
7.4 Masking confidential information and test data with IBM Optim ................................. 34
8. Big Data Platform integration ............................................................................................... 35
8.1 Connectors .................................................................................................................. 36
8.2 Data warehouse integration ........................................................................................ 36
8.3 IBM Symphony ............................................................................................................ 37
8.4 Big Match & MDM ....................................................................................................... 37
8.5 Information Server: DataStage & QualityStage .......................................................... 37
8.6 InfoSphere Streams .................................................................................................... 37
8.1 Cognos Business Intelligence ..................................................................................... 37
8.2 Search Indexing and integration with enterprise wide search .................................... 38
8.3 Spreadsheet visualisation ........................................................................................... 38
8.4 BigR ............................................................................................................................. 40
8.5 SPSS ........................................................................................................................... 40
8.6 Text Analytics with AQL .............................................................................................. 40
8.7 Data Lineage and governance .................................................................................... 40
8.8 SAS integration ........................................................................................................... 41
8.9 BigSQL ........................................................................................................................ 41
9. Speed of deployment ............................................................................................................ 42
10. IBM Big Data delivery is proven with case studies from around the world .................... 44
10.1 General Motors ............................................................................................................ 44
10.2 Big Data Pioneers: Volvo Case Study ........................................................................ 47
10.3 PSA.............................................................................................................................. 48
10.4 Science & Technology Facilities Council .................................................................... 48
10.5 Octotelematics ............................................................................................................. 48
10.6 BMW ............................................................................................................................ 49
11. IBM Company information .................................................................................................... 50
iii
4. iv
Introduction
We are at a tipping point of mass awareness of, and mass adoption of internet connected devices
taking us beyond the smart phone into pervasive technology in our homes and workplaces.
Domestic appliances, industrial equipment and significantly, the connected vehicle will become part
of the new normal.
This paper explains why IBM’s Big Data Platform, and particularly BigInsights for Hadoop provides a
scalable, strategic and economic solution to the data challenge facing the vehicle industry.
Drivers for change in the vehicle industry
All this means data...big data. When we consider the drivers for change in the vehicle industry, they
all create and depend upon data that will be high in volume, variety, and velocity.
5. v
The market opportunity for Connected Cars
The connected car market is summarised in this table from Connected Car Forecast,
“Global Connected Car Market to Grow Threefold within Five Years report, Feb 2013”
Note the largest segment is in service areas often supplied by 3rd party aftermarket businesses; this
is a major opportunity for OEM’S.
In summary this commercial opportunity has such scale and momentum that the data challenge
facing the IT departments of major OEM’s will be significant. An industry with great skills in
technology such as ERP and Computer Aided Design has of course had to create data warehouses for
customer and business operations, but never to the scale and scope required to support connected
vehicles. The amount of data generated by vehicles is significant, both in terms of the amount kept
on the vehicle, and that generated from vehicles but held in data centres
The model that determines types and quantities for vehicle data is yet to be standardised, so
efficiencies are yet to be created. This means multi-petabyte data centre stores will proliferate
amongst the OEM’s until on-vehicle processing reduces the amount of dependence on data centre
stored data. One can see a point where the autonomous vehicle will not be dependent upon stored
data but this is over the horizon....in between us and this point in time is a mountain of data.
6. This data explosion within the automotive manufacturing industry will create skills shortages. As
other industries are also seeing growth in connected devices, so the skills required may have to be
grown from within the company by cross training IT staff from the (usually small) Data Warehouse,
as well as from ERP and CAD system teams.
vi
7. vii
The business need for Hadoop capabilities
The fundamental needs are to
• Lower the costs of managing data as a business asset; IBM’s Hadoop can be delivered for €800
per Terabyte. IBM high performance warehouse appliances (Netezza) can be delivered for
€10,000 per Terabyte. This is a fraction of traditional data warehouse systems which cost
€24,000 per Terabyte (e.g. Oracle/Teradata)
• Provide a landing zone for all data – existing but disparate structured data silo’s, and
unstructured data such as video files, and external social media data. Currently most IT
departments would not easily support a business executive who asks IT to take data from a telco
offering vehicle movement data, and to combine it with video clips from vehicle cameras, and
integrate social media site information about vehicle problems simply because IT investment has
been focussed on ERP and CAD. Hadoop as part of an big data platform enables this business
data as a service
Hadoop directly integrates with the existing IT landscape and standards while complementing
existing technologies like MPP (Massively Parallel Processing) warehousing and predictive analytics.
Uniquely, BigInsights extends open source Hadoop and HDFS capabilities with additional integrated
enterprise capabilities. This provides clients with additional benefits, through a solution which:
• Reduces cost and improves flexibility: by simultaneously allowing a single cluster to be used for
multiple groups of users, running a mix of Hadoop MapReduce jobs and other workloads, under
the control of the advanced and proven scheduler IBM Platform Symphony
• Supports complex workloads: by providing an architecture which can perform a mix of
MapReduce and other computing tasks within a single job
• Ensures data resilience and reliability: by enabling the use of an enterprise-class alternative to
open source HDFS through IBM’s General Parallel File System (GPFS) – a proven, robust, high-performance
file system that allows the use of both centralised disks and locally-attached
storage simultaneously, while also providing compatibility with Hadoop and providing additional
features and reliability.
This allows BigInsights to provide greater resiliency, performance and manageability over
alternatives, while also saving costs and without compromising fully open accessibility to
information within the platform.
BigInsights is also pre-integrated with our Platform Computing grid management technology, which
we believe is a critical extension to Hadoop in order to increase data processing performance,
reduce server idleness, and improve deployment efficiency and cluster flexibility. IBM is committed
to the accessibility and security of data within our solution to ensure its full exploitation for
retention, exploration, integration, advanced analytics and visualisation.
Also, we recommend the use of our bundled IBM InfoSphere Streams component to achieve low-latency
analytical processing of data-in-motion. Streams is also pre-integrated with BigInsights and is
one of the only commercially supported capability of its kind.
8. Big data system in the vehicle industry enables radical new use cases
The business benefits derived from Connected Vehicles are diverse, so universal and reusable data
systems are needed or else data silos will emerge, severely limiting the efficiency and cost/benefit
balance of this market leap.
In summary this collection of use cases results in a justification for integrated data capabilities and
that’s where Hadoop comes in.
Use case Description Benefits KEY = REVENUE COST REDUCTION
viii
3600 view of
vehicle/
driver/fleet
Indexing data assets to create a
range of search based
applications on federated data.
Shifts business model from
product to customer centricity
Driver can reduce fuel used
Fleet manager can controls costs
OEM has ongoing relationship with owners so
can increase re-purchase loyalty
Social Media
Lead
Generation
By identifying Twitter discussions
which contain a propensity to buy
a specific product, we can create
a business alert to exploit the
information and create a lead
Sales leads
Targeted campaigns against specific
competitors
Brand and product sentiment analysis to target
marketing communications more accurately
Warranty
Claim
Predictions
Accurately identifying which
vehicles are likely to have
warranty claims well in advance
supports predictive maintenance
Reduction in recalls
Increased quality customer service
Warranty provision at lower cost than the
current business model
3rd Party Data
Sales
Selling granular weather and
traffic congestion data
The value to meteorological organisations of
granular weather data is higher than they are
able to harvest from fixed site weather
monitors, and allows the OEM to enter the
traffic data business to generate service
revenue
User Based
Insurance
Enabling offers to customers to
reward safe driving through
detailed usage analytics
Sales of pay as you drive insurance, and pay as
you drive pricing bundles
Personal-isation
and
location
services
Provision of end user
refinements as an extension to
the infotainment personalisation
experience. E.g. seat,
temperature, music settings
follow the owner. Locating the
vehicle and identifying a
commute partner for example
Customer saves fuel costs so drives sales into
new segments
Retains customers whose cars are closely
integrated into their lifestyle through their
Smartphone
9. ix
Product Usage
for R&D
Deliver granular data to product
designers such that they can
create cars closer suited to the
actual usage pattern
Lower warranty costs, lower running costs,
increased customer loyalty revenue
User Based
Vehicle
Upgrade Sales
Generating leads based on
identifying the ideal replacement
vehicle for each owner, by
analysing driver profiles
Supporting finance buy-back and upgrade
campaigns to increase revenue and customer
satisfaction
Online
Software
upgrades to
vehicle
Identifying issues where an
online upgrade to the in vehicle
software improves performance,
addresses safety issues and
reduces the need for physical
recalls
Reduced cost of warranty claims and recalls.
Increased customer satisfaction
Sales
Forecasting
Harvesting leading indicators for
sales from search engines/dealer
systems/websites/driver data to
support the shift from physical
dealer visits to an online
relationship with prospects
Accurate sales forecasting reduces stock levels
and increases manufacturing efficiency
10. 1. IBM as your partner for driving Data
warehouse modernisation with Hadoop
1
The cost of high
performance data
warehousing is
significantly reduced
using open source
software
Despite technology maturity, the use of relational databases for data
warehousing has not addressed the need to be able to load all types of
data (particularly unstructured such as text) nor has its costs fallen in line
with other advances.
Moore’s Law predicts that processing capability doubles every 18 months
or so, and the costs of processing have fallen at comparable rates.
Hadoop is an open source disruptive technology which reduces the cost
per Terabyte from a traditional €100k to a fraction of this - €8k in live
examples.
IBM’s Hadoop distribution, BigInsights, delivers a better technology within
the Open Source distribution however enterprise requirements for
resiliency, performance and manageability of this kernel require
something more. BigInsights is designed to address this requirement.
For example, Hadoop performance is increased by BigInsights’ capability
to support Hadoop clusters for multiple groups of users by running a
mixture of Hadoop MapReduce and other scheduled workloads on one
cluster under the control of IBM Platform Symphony.
The business value of a
company data asset is
increasing
As line of business executives becoming increasingly technology savvy and
data dependent their realization that IT should provide robust, low cost,
flexible yet secure data for their information assets increases.
Traditional data warehouses have not delivered so Hadoop is succeeding
due to a willing new demand. Associated with this is the need for a
trusted long term stable partner to deliver such innovations, and that’s
where IBM’s history provides welcome reassurance.
Data volumes will grow
significantly (but with
limited predictability)
so systems need to be
able to land new data
at higher rates of
velocity, volumes and
variety
The shift from ERP and CAD data usage in the vehicle industry to one of
customer centricity where selling and servicing complex connected
vehicles will drive a significant data evolution; one whose capability
requirements are addressed perfectly by IBM enterprise data platform
systems
IBM clients have proved that these technologies scale well and cater for
any kind of data, with resilience and high performance at higher levels
than non-IBM alternatives.
IBM’s Streams product is a vital component to achieve low latency
analytical processing of data-in-motion. As it is integrated with
BigInsights for Hadoop the combination addresses the data flows from
moving vehicles as well as the analytics required for historic data.
Hadoop delivers the
enterprise data landing
zone
IBM BigInsights for Hadoop gives the enterprise data architects the ability
to land, store, query and analyse data of any type.
This integrates with existing IT landscapes so is the ideal place to correlate
11. data with statistical tools, identify issues that span departments, and the
tooling required to provide business the catch the “fast ball” of vehicle
generated data and extract some value from it. These are classified as
operational efficiency, advanced analytics and exploration & discovery.
2
BigInsights delivers
Operational Efficiency
To more effectively handle the performance and economic impact of
growing data volumes, architectures incorporating different operational
characters can be used together. For example, large amounts of cold data
in the data warehouse can be archived to an analytics environment rather
than to a passive store.
InfoSphere BigInsights helps improve operational efficiency by
modernizing — not replacing — the data warehouse environment. It can
be used as a query-able archive, enabling organizations to store and
analyze large volumes of poly-structured data without straining the data
warehouse. As a pre-processing hub — also referred to as a “landing
zone” for data — InfoSphere BigInsights helps organizations explore their
data, determine the high-value assets and extract that data cost-effectively.
It also supports ad hoc analysis of large amounts of data for
exploration, discovery and analysis.
BigInsights delivers
Advanced Analytics
In addition to increasing operational efficiency, some organizations are
looking to perform new, advanced analytics but lack the proper tools.
With InfoSphere BigInsights, analytics is not a separate step performed
after data is stored; instead, InfoSphere BigInsights, in combination with
InfoSphere Streams, enables real-time analytics that can leverage historic
models derived from data being analyzed at rest. InfoSphere BigInsights
includes advanced text-analytic capabilities and pre-packaged
accelerators. Organizations can use these pre-built analytic capabilities to
understand the context of text in unstructured documents, perform
sentiment analysis on social data or derive insight from a wide variety of
data sources.
BigInsights delivers
Exploration &
Discovery
The explosive growth of big data may overwhelm organizations, making it
difficult to uncover nuggets of high-value information. InfoSphere
BigInsights helps build an environment well suited to exploring and
discovering data relationships and correlations that can lead to new
insights and improved business results. Data scientists can analyze raw
data from big data sources alongside data from the enterprise warehouse
and several other sources in a sandbox-like environment. Subsequently,
they can combine any newly discovered high-value information with other
data to help improve operational and strategic insights and decision
making.
The bottom line: with InfoSphere BigInsights, enterprises can finally get
their arms around massive amounts of untapped data and mine it for
valuable insights in an efficient, optimized and scalable way.
12. 2. Connected vehicle IBM solution elements
IBM delivers an integrated solution for Connected Vehicle IT architectures already proven to scale
for Tier 1 OEM’s as per the following diagram:
This is summarised as a series of connected capabilities, and the following describes IBM’s
components to deliver this as a complete solution:
Efficient data protocols IBM’s messaging appliance, MessageSight, uses the Open Source MQTT
protocol and is 4-6 times more efficient than HTTP.
3
Capture all data onto the
centralised landing zone
The combination of Stream’s capability to handle large high speed data
feeds such as MQTT and BigInsight’s strength in storing and processing
large datasets for long term storage creates the landing zone platform
missing from current data warehouses.
Real Time analytics IBM Streams contains the filters, complex queries, statistical treatments
and data management instructions to support real time analytics. It has
the capability to handle streaming data such as video files and large
message volumes as well as its integration with BigInsights Hadoop
storage which is particularly useful for connected vehicle systems.
High performance
analytics platform
Streams pushes data as it arrives to a high performance data
warehousing platform comprising Netezza (PureSystems) FPGA based
hardware accelerator appliance for high speed analytics as well as to the
long term Hadoop data store in BigInsights.
13. FPGA’s are used in BLU-RAY players to remove the CPU bottleneck that
high resolution TV would otherwise create...it’s the same patented
technology that gives IBM its ability to process large amounts of data
and deliver high speed analytics.
Application platform BigInsight’s use of Eclipse creates a layered system where data assets
can be explored, and applications designed and delivered to address
evolving business needs. Management of this process is enabled
through IBM Rational which is commonly used in the vehicle industry to
manage CAD development, ERP systems management, as well as
bespoke applications.
4
Data governance and
integration
Data Housekeeping such as securing and maintaining availability of
these new and sensitive data assets is delivered through enterprise data
security applications Guardium and Optim, and a range of product
connectors and accelerators which take the burden of complex system
integration away from IBM customers by delivering inter-product
integration capabilities.
These capability areas break down into 21 subset technology components, each as defined and
recognised by independent analysts Gartner and Forrester, who publish vendor assessments of them
as comparison tables.
Mapping IBM’s capabilities into the comparison above shows IBM has leadership in all areas, with 13
ex 21 as the leader above other vendors, and the remaining 8 showing IBM are in a leader quadrant
position behind a point solution vendor.
14. Enterprise architects assembling these systems can consider IBM the leading vendor of technology
and services, potentially as a single source supplier of an end to end system with Hadoop as a key
component.
5
15. 3. BigInsights product depth completes Hadoop
6
3.1 Overview of
BigInsights
Hadoop is a key part of a Big Data platform of integrated technology to
provide the capability to store and analyse vast amounts of data – any kind of
data.
IBM’s primary focus with its InfoSphere BigInsights Hadoop distribution is to
fully embrace open source, while integrating it into the wider enterprise IT
landscape. IBM is in a unique position to accomplish this given our breadth of
enterprise capability.
In our BigInsights distribution, we have specifically focused on:
• Out of the box integration and optimisation with existing IT capabilities
such as data integration, data privacy, data security, and business
intelligence components – all aligned to existing standards within the
IBM enterprise data management system (DataStage, Optim, Guardium,
and Cognos)
• Exploiting the re-use of existing skills in accessing and using the wealth
of data BigInsights holds, by providing multiple intuitive interfaces over
the same raw data. For example, we provide standard ODBC and JDBC
drivers, ANSI standard SQL, a spreadsheet-style user interface that runs
in a web browser, and pluggable modules to enable self-service
advanced analytics
• Ensuring data resilience and efficient operational management through
deep integration with robust High Performance Computing technologies.
These leverage IBM’s decades of experience and expertise in the HPC
field and applying this robustness to a relatively new and emerging
technology (Hadoop)
IBM InfoSphere BigInsights was first made generally available in May 2011
with its 1.1 release, primarily containing a distribution of Apache Hadoop and
other open source projects along with security, workload management, and
administration enhancements. It quickly evolved through 1.2 and 1.3
releases, also delivered in 2011, which added more extensive developer
tooling, web-based user interfaces and a variety of enhancements to the
original features.
The product offering has continued to quickly evolve with a 1.4 and 2.0
release both made available in 2012, and the current generally available
release from September of 2014 is 3.0.
While the specific release schedule depends on development requirements,
BigInsights generally has a significant product release approximately every 6
months.
16. 7
3.1 Users
supported
InfoSphere BigInsights provides capabilities for a wide range of users. Tools
are included that are specific to the goals of each user, such as installing
components, developing applications, deploying applications, and running
applications to analyse data.
System Administrator
The System Administrator installs, configures, and backs up InfoSphere
BigInsights components on the system. This user also monitors the cluster to
ensure that the InfoSphere BigInsights environment is healthy and running at
optimum capacity.
Application Developer
The Application Developer develops, publishes, and tests applications for
InfoSphere BigInsights. This user works with Data Scientists to understand
the function of each application, and the business problem that the
application helps to solve.
Application Administrator
The Application Administrator publishes applications in the system, deploys
applications to the cluster, and assigns permissions to applications. This user
works with the Application Developer to ensure that applications are
functioning properly before being published and deployed.
Data Scientist
The Data Scientist collects data, completes analysis, and visualises insights to
provide answers to specific business questions. This user determines which
applications and data sources to aggregate information from, and how to
present the results to the intended audience.
3.2 Application
lifecycle
management
Developers can develop and test InfoSphere BigInsights programs from
within the Eclipse environment and publish applications that contain
workflows, text analytics modules, BigSheets readers and functions, and Jaql
modules to the cluster. After deploying applications to the cluster, the
applications can be run from the InfoSphere BigInsights console.
The
following capabilities are supported by the Eclipse tooling, organised by sub-component
of BigInsights:
17. • Create text analytics modules that contain text extractors by using an
extraction task wizard and editor. Developers can then test the extractor
by running it locally against sample data. Visualise the results of the text
extraction and improve the quality of the extractor by analysing how
results were obtained
• Create Jaql scripts or modules by using a wizard, and edit scripts with an
editor that provides content assistance and syntax highlighting. Run Jaql
explain statements in scripts, and run the scripts locally or against the
InfoSphere BigInsights server. Developers can open the Jaql shell from
within Eclipse to run Jaql statements against the cluster
• Create Pig scripts by using a wizard and edit the scripts with an editor
that provides content assistance and syntax highlighting. Run Pig explain
statements and illustrate statements for aliases in scripts, and then run
the Pig scripts locally or against the InfoSphere BigInsights server.
Developers can open the Pig shell from within Eclipse to run Pig
statements against the cluster
• Connect to the Hive server by using the Hive JDBC driver and run Hive
SQL scripts and explore the results. Browse the navigation tree to explore
the structure and content of the tables in the Hive server
• Use the Java editor to write programs that use MapReduce, and then run
these programs locally or against the InfoSphere BigInsights server. Open
the InfoSphere BigInsights console to monitor jobs that are created by
MapReduce
• Create templates for BigSheets readers or functions and then use the
Java editor to implement the classes
• Write Java programs that use the HBase APIs and run them against the
InfoSphere BigInsights server. Open the HBase shell from your Eclipse
environment to run HBase statements against the cluster.
• Additional capabilities included in InfoSphere BigInsights include
application linking and pre-built accelerators.
Application linking using BigInsights:
• A graphical, web-based means through which to define Oozie workflows
• Compose and invoke new applications by combining together existing
applications, including integration with BigSheets.
8
18. 9
Pre-built applications
provide enhanced data
import capability:
• REST Data Source App that enables users to load data from any data
source supporting REST APIs into BigInsights, including popular social
media services
• Sampling App that enables users to sample data for analysis
• Subsetting App that enables users to subset data for data analysis
• Accelerators to provide packaged application components to address
social data analytics, machine data analytics and call detail records
streaming analytics, as examples.
3.1 What IBM
BigInsights adds
to open source
Hadoop
The blue areas illustrate the categories of functionality BigInsights adds to
native Hadoop.
This emphasises the strategic importance of vendor “clout” behind the
selected distribution as enterprise scale Non-Functional Standard
Requirements required to scale out to the requirements of the vehicle
data industry.
19. 10
3.2 BigInsights for
Hadoop:
Technical
capabilities
which lend
themselves to
the
requirements of
the vehicle
industry
BigInsights is a complex software product with many capabilities. In the
engagements with vehicle industry clients there are several unique
capabilities which have resulted in its selection over alternatives, and are
fundamental to the benefits delivered.
These include the following key areas, each expanded up in the remainder
of this section:
1. IBM’s file system – GPFS as an option to the open source
HDFS Hadoop Distribution File System
2. The way IBM opens up this data store to SQL using IBM
BigSQL, which means that you can keep the data in one place
– very important given the data volumes connected vehicles
are generating, and the limitations of the memory cache
approach taken by other branded Hadoop distributions
3. The spreadsheet visualisation front end tool BigSheets which
allows business users to explore the data being landed into
Hadoop
4. The ability to analyse text using Annotative Query Language
(and IBM applications which add business analysis front end)
such that sentiment on brand and products can be surfaced
from call centre, warranty and service datasets.
5. Adaptive MapReduce - IBM’s pre-integration with Platform
Symphony’s near real-time, low latency scheduler for more
quickly carrying out any MapReduce data processing routines.
20. 11
3.3 GPFS
Advanced features
supported within
IBM’s General
Parallel File System
As vehicles generate
large data sets, and
the vehicle industry
moves to a more
customer centric data
model it will find it
has requirements for
data warehouses of
enormous size. GPFS
is the proven data
system for this
requirement.
GPFS is a mature, enterprise-class file system that adds a number of
important resiliency and maintainability characteristics to Hadoop and can
be used as an alternative to HDFS, the Hadoop Data File System
• GPFS is scalable: 400 GBytes/second has been achieved for a single
filesystem, and due to the parallel architecture of all GPFS filesystem
functions performance can be increased as required by adding more
hardware resources
• GPFS is reliable: it is in use for some of the largest and fastest
filesystems in the world, supporting batch workloads where each job
can run for months, and GPFS has been proven in the field for over 15
years
• Supports failover clustering built-in to the filesystem
• Active-active clustering across sites to provide a “24x7” filesystem
• Remote asynchronous caching designed to work across very large
distances
• Information Lifecycle Management (ILM) which provides for data to be
moved between different storage pools of disk or even tape
• Rolling upgrades of GPFS software to minimise downtime
• Online addition/removal of server nodes or storage resources
• Built-in replication of data under filesystem control, specified down to
the file level if required
• Metadata scan operations up to millions of files per minute (10 Billion
files in 43 minutes1), which can be used to produce lists of files to
backup, move, migrate to other physical storage tiers, or perform other
operations on
• Extended attributes which are stored along with the file, and can be
used as “tags”, for example project IDs or other information: these can
also be searched on using the parallel scan engine
• Fully supported by IBM: the people who write the code support the
code, using the same IBM support and problem escalation processes
available for mainframe software
• GPFS is in use with clients as a:
− “Standard” POSIX filesystem
− Supported data storage layer for databases such as DB2, Oracle,
and Informix
− As a drop-in replacement for HDFS, by presenting an HDFS-compatible
API.
21. 12
3.4 BigSQL
BigSQL is an ANSI standard SQL interface to data across the distributed
filesystem, Hive and HBase, re-using Hive’s metadata, providing standard
JDBC and ODBC connectivity, and query optimisations to address both small
and large queries
3.5 BigSheets
spread-sheet
interface
BigSheets is a browser-based, spreadsheet-style user interface allowing
users to directly interact with data. As vehicle data is relatively new its use
and its structure is not mature or well documented, so business use of the
new information assets to support exploration of new use cases requires
visualisation tooling such as BigSheets, which is included within BigInsights.
3.6 Text Analytics
Text Analytics: BigInsights provides AQL - an analytical environment for
extracting structured information from unstructured and semi-structured
textual data, including batch and real-time runtimes and an integrated
development environment. This is useful to for example extract meaning
from text fields in CRM databases and social media sites to understand
customer sentiment and generate leads from comments about vehicle
comparisons.
3.7 Adaptive Map-
Reduce
Adaptive MapReduce is a near real-time, low-latency scheduler that can be
transparently used as an alternative to Apache MapReduce. This is actually
a “single-tenant” version of the IBM Platform Symphony scheduler that has
been pre-integrated with BigInsights.
Scheduling data management tasks will allow the company to manage the
evolution of its vehicle and customer data lifecycle as storage and
processing to support the move to customer centricity. Connected vehicle
architectures will place stress on data warehouse systems if not managed
effectively
22. 4. Big Data platform technology supplier
commitment and stability
13
4.1 Scale and
commitment
IBM has completed more than 30,000 analytics client engagements and
projects $20 billion in business analytics and big data revenue by 2015.
IBM has established the world's deepest portfolio of analytics solutions;
deploys 9,000 business analytics consultants and 400 researchers, and
has acquired more than 30 companies since 2005 to build targeted
expertise in this area.
IBM secures hundreds of patents a year in big data and analytics, and
converts this deep intellectual capital into breakthrough capabilities,
including Watson-like cognitive systems. The company has established a
global network of nine analytics solutions centres and goes to market
with more than 27,000 IBM business partners
With 434,000 employees and $100BN revenues, IBM’s 100 year
momentum continues. The company is renowned for its ability to
reinvent itself around business and technology shifts summarised in the
IBM strategy statement:
• We are making markets by transforming industries and
professions with data
• We are remaking enterprise IT for the era of the cloud
• We are enabling systems of engagement for enterprise and
leading by example.
4.2 IBM Hadoop
deployments
Hadoop adoption is a long term strategic platform decision so warrants
consideration of the client company to work with its supplier as a long
term engagement.
IBM has over 100 production installation and thousands of users of the
free download evaluation system. It has thousands of users of the
online Bluemix cloud development platform where BigInsights is a
service. Many thousands individuals inside IBM and in its customer base
use the online education environment called IBM Big Data University.
4.3 Commitment to
open source
We distribute 100% open source Apache Hadoop components. This is
not proprietary. On top of the open source code we provide analytical
tools to help get value from the data.
IBM is committed to supporting the open source movement. IBM helped
open platforms such as Linux, Eclipse and Apache become standards
with vital industry ecosystems, and then we developed high-value
businesses on top of them. Today IBM collaborates broadly to support
open platforms such as OpenStack and Hadoop.
Because of this commitment, IBM avoids creating any independent fork
of Apache project code, and merely selects the open source versions
23. that we feel are the best in achieving most current and most stable
capabilities together in the overall Hadoop operating environment. The
inner core of BigInsights is Apache Hadoop, and we do inter-version
testing of the projects included so our enterprise customers are ensured
that they have a blue-washed and interoperable codebase across the
projects. As “most current” and “stable” are often conflicting,
BigInsights does not always use the most current version of projects, but
rather the most stable. Where we identify issues in the open source
projects, we have a number of committers with our IBM development
labs that submit fixes back to the open source community.
IBM’s goal with this approach is to protect the corporate IT organisation
from version management across the various open source projects by
providing this pre-tested, interoperable set in InfoSphere BigInsights.
An example of the commitment is that IBM contributed 25% the fixes
for a recent release of Hadoop
The most widely deployed version of BigInsights for Hadoop is v2.1
(current release is v3) is to supported by IBM until 05-Jul-21
14
24. 15
4.4 The key
differences
between IBM,
Cloudera, Pivotal
and Hortonworks
Cloudera
IBM has a comparable number of significant sized deployments with
Cloudera, a Hadoop distributor. However the company is quite different
to IBM. Cloudera is Venture Capital funded with $160m invested in its 6th
funding investment round completed in March 2014. Sales revenue was
reported as $73m in 2013, its fourth year trading. It has 500 employees.
Our opinion on the long term destiny for companies like this – niche
technology players – is of a business exit plan based on acquisition by the
industry giants to fill a technology gap in the enterprise platforms they
provide. At this point it’s not clear if any of the enterprise technology
mega vendors have such a gap so the future of Cloudera is unclear.
Functionality which will be useful to the vehicle data industry such as real
time streaming data analytics, text analytics, analytics accelerator tools,
visualisation, enterprise wide search, indexing, data integration software,
connected analytics appliances, relational data marts, governance audit
and compliance is all available in IBM BigInsights but not in this
alternative.
Documented limitations in the Cloudera query engine explain why results
in data joins can fail to complete. Referring to the Cloudera user manual
highlights the cause as insufficient memory – this is caused by the need to
load ALL data into memory. As raw data sets can be very large this
limitation can easily exceed the total memory available. Vehicle data
volumes being generated are currently very large, and customer datasets
are also very large so this limitation constrains vehicle industry
applications.
By comparison IBM BigSQL has no limitation that joined tables have to fit
in aggregated memory of data nodes which causes queries to run out of
memory and fail.
IBM Hadoop is up to 41x faster than Hive .12 (Cloudera) on TPC-H like
benchmark
IBM Hadoop is over 2x faster than (Cloudera) Impala on TPC-H like
benchmark
Pivotal /EMC/Greenplum
Greenplum has changed hands several times and is now part of Pivotal, an
EMC spinoff, and their Hadoop offering is now called Pivotal HD. IBM
BigInsights has many advantages over Pivotal HD
IBM BigInsights adds significant functions beyond IBM’s 100% open source
Hadoop components – it includes analytic accelerators such as Big SQL,
BigSheets, BigMatch, BigR and text analytics unlike Pivotal which includes
proprietary components and lack of added-value software applications
such as those listed above
IBM has already achieved broader marketplace presence and analyst
rating (e.g. Forrester Wave)
IBM BigInsights offers greater flexibility and lower cost solution with
availability as software only, on the cloud, or on flexible IBM System x
reference architecture. By comparison Pivotal is now recommending
expensive Isilon storage which uses a proprietary OneFS file system. IBM
25. has made significant investments to ensure enterprise open architecture
leverage the low cost Hadoop elements rather than creating lock-in
solutions.
Significantly, Pivotal does not support HDFS. BigInsights offers HDFS and
GPFS support. Where the new SQL HAWQ component of Pivotal HD is
offered as a license cost option, the powerful IBM Big SQL is included with
BigInsights.
IBM offers a complete Big Data platform Solution as an integrated
architecture that offers more than just Hadoop - including BigInsights,
Streams, MPP Database, Information Integration.
Real time analytics – not just batch – is provided by IBM, whereas Pivotal
has an in-memory grid, which is not a real time streaming solution.
Data security at an enterprise granular level is provided by IBM’s
Integration and Governance offerings, (Information Server, Guardium and
Optim) which are integrated with Hadoop but would require local
development or 3rd party integration projects to achieve the same level of
data management.
As the vehicle data will include personal data, its governance is mandated.
Delivering the appropriate systems is easier and lower cost with IBM.
Pivotal HAWQ adds the entire RDBMS structure (query engine, storage
layer, metadata) to Hadoop. This adds proprietary layers and database
complexity to the Hadoop solution. By comparison IBM Big SQL integrates
just the query engine with Hadoop. This allows the query engine to be
collocated with the Hadoop cluster and executes using native meta data
and HDFS files, which is how IBM won the performance benchmark tests
cited above, and IBM Big SQL offers elastic scalability where nodes can be
added / removed online.
Hortonworks
BigInsights and Hortonworks have similar Hadoop components and both
are committed to open source Apache Hadoop with Committers and
contributors to open source Apache Hadoop. However, BigInsights
extends value beyond Hortonworks for analytics with its Social Media
Accelerator, Machine Data Accelerator, BigSheets spreadsheet and
visualization, Advanced Text Analytics. Also BigInsights includes Data
Explorer for Search and Indexing in Hadoop and beyond to all enterprise
data; a vital function to make the data accessible to potential users inside
the company and through applications to its customers.
BigInsights Big SQL has advantages over Horton’s HiveQL as Big SQL
provides richer SQL, HBase performance, and short query performance.
For many large companies already using GPFS, IBM BigInsights 2.1
uniquely offers GPFS as a Hadoop file system providing enterprise data life
cycle management. BigInsights also has Adaptive MapReduce (Platform
Symphony) for faster Map Reduce processing and BigInsights integrates
with InfoSphere Streams, while Hortonworks does not, limiting its use to
batch processing.
16
26. 5. BigInsights Performance and stability
17
5.1 Overview
Out-performing open source alternatives on identical hardware, IBM InfoSphere
BigInsights has been independently benchmarked and proven to be between 4 and
11 times faster than open source alternatives running on identical infrastructure.
InfoSphere BigInsights provides several features that help increase performance, as
well as enhance its adaptability and compatibility within an enterprise environment.
5.2 Adaptive
MapReduce for
job
acceleration
Jobs running on Hadoop can end up creating multiple small tasks that consume a
disproportionately large amount of system resources. To combat this, IBM invented
a technique called Adaptive MapReduce that is designed to speed up small jobs by
changing how MapReduce tasks are handled without altering how jobs are created.
Adaptive MapReduce is transparent to MapReduce operations and Hadoop
application programming interface (API) operations.
5.3 Comprehensive
SQL
Performance
and support
Let’s start with the number one reason why this new release of Big SQL sets
a new bar: performance. Benchmark tests indicate that Big SQL executes
queries 20 times faster, on average, over Apache Hive 12 with performance
improvements ranging up to 70 times faster.
This performance improvement was achieved by replacing the earlier Map-
Reduce (MR) implementation with a massively parallel processing (MPP)
SQL engine. The MPP engine deploys directly on the physical Hadoop
Distributed File System (HDFS) cluster. A fundamental difference from other
MPP offerings on Hadoop is that this engine actually pushes processing
down to the same nodes that hold the data. Because it natively operates in
a shared-nothing environment, it does not suffer from limitations common
to shared-disk architectures (for example: poor scalability and networking
caused by the need to move shared data around).
IBM’s unique ANSI standard SQL interface in BigInsights automatically
optimises queries so that smaller queries run in-memory and bypass
MapReduce. For larger queries that still rely on MapReduce, BigSQL can also
still leverage the performance benefits of Adaptive MapReduce.
IBM BigInsights v3.0, with Big SQL 3.0, is the only Hadoop distribution
to successfully run ALL 99 TPC-DS queries and ALL 22 TPC-H queries without
modification. To contrast, Apache Hive 12 executes only 43 of the 99 queries
without modification. In a Jan 2013 blog post, Cloudera describes how its
benchmark tests were completed by modifying the TPC-DS queries to SQL-
92 syntax and selectively included only 20 of the 99 TPC-DS queries.
IBM Big SQL has many advantages over Impala, including richer SQL support
such as SQL-92 sub-queries, SQL 99 aggregate functions, and SQL 2003
windowing aggregate functions.
Big Benchmark tests indicate that Big SQL executes queries 20 times
faster, on average, over Apache Hive 12 with performance
improvements ranging up to 70 times faster for individual queries.
27. SQL 3.0 has successfully run ALL 99 TPC-DS queries and ALL 22 TPC-H
queries without modification. To contrast, Apache Hive 12 executes only
43 of the 99 TPC-DS queries without modification.
IBM Big SQL has many advantages over Impala, including richer SQL
support such as SQL-92 sub-queries, SQL 99 aggregate functions, and
SQL 2003 windowing aggregate functions. Impala is immature, feature
poor, back-level SQL, limited offering. SQL tools may not work with
Impala due to ODBC and JDBC driver limitations
Big SQL enables row and column access control, or “fine-grained control”
consistent with functionality found in an RDBMS.
The comprehensive SQL support by Big SQL 3.0 enables an organization
to make full use its existing SQL skills, reducing the need to augment its
analytic applications with Hadoop-specific functions.
Now here’s the real value: Big SQL 3.0 can access data from more than
BigInsights. It can query and combine data from many data sources,
including (but not limited to) DB2 for Linux, UNIX and Windows database
software, IBM PureData System for Analytics, IBM PureData System for
Operational Analytics, Teradata and Oracle. Organizations can choose to
leave data where it currently exists and use BigInsights to augment
where it makes the most sense.
Note that this approach, minimizing the need to move data, is part of
IBM’s overall big data and analytics strategy. SPSS and Cognos Business
Intelligence also support querying and joining data across disparate data
sources, addressing the need to analyze all data, wherever it is located.
IBM InfoSphere BigInsights v3.0, with the MPP-based performance and
SQL support of Big SQL 3.0, provides an enterprise-ready Hadoop
distribution that minimizes the impact on users while enabling IT to
adopt this new technology into its data architecture strategy.
18
28. 19
5.4 Federated
data access
Big SQL can access data from more than BigInsights. Its federated access allows
users to send distributed requests to multiple data sources within a single SQL
statement.
Administrators start with a GUI-driven installation tool that guides them to specify
which optional components to install and how to configure the platform. Installation
progress is reported in real time, and a built-in health check is designed to
automatically verify the success of the installation. These advanced installation
features minimize the amount of time needed for installation and tuning, freeing
administrators to work on other critical projects
Once the Hadoop cluster is in place, robust job management features give
organizations control of InfoSphere BigInsights jobs, user roles, security and key
performance indicator (KPI) monitoring. Technical staff can easily direct job
creation, submission and cancellation; they can also stay informed of workload
progress through integrated job status dashboards, logs and monitors that provide
details on configuration, tasks, attempts and other critical information. In addition,
InfoSphere BigInsights provides administration features for Hadoop Distributed File
System (HDFS), IBM GPFS™ File Placement Optimizer (FPO), big data applications and
MapReduce jobs, and cluster management.
5.5 Performance
architecture
The architecture of IBM InfoSphere BigInsights is essentially comprised of
three layers and a management / administration tier. All data is stored in
our distributed file system, which can be either open source Apache HDFS or
IBM’s more enterprise-class General Parallel File System (GPFS). This forms
the underlying data persistence layer on which all other components rely,
and hence is typically considered the bottom-most layer.
In the middle, there are a number of data processing components that all
leverage the MapReduce capabilities of Hadoop in order to parallelise their
work. These include:
• Data processing languages like Pig and Jaql
• Query mechanisms like Hive
• Indexing mechanisms like Lucene
• Data load mechanisms like Sqoop and Flume
• Analytical capabilities like Text Analytics and Probabilistic Matching and
• Data repositories built on top of the distributed file system like HBase.
MapReduce itself could be considered the backbone of this layer, and
IBM provides a high performance optimisation of open source
MapReduce called “Adaptive MapReduce” to provide greater
performance to all data processing in this layer through integration with
Platform
• IBM is also unique in providing one of the only commercially supported
data-in-motion analytics capability, InfoSphere Streams, which we
originally developed through our unique Research division, in
cooperation with various US governmental agencies. Streams can
directly leverage data in Hadoop, in-memory data grids like Redis, and
also integrates directly with DataStage.
29. 20
5.6 Real time
analytics
In addition, to address true real-time (data in motion) requirements, we
integrated the use of IBM InfoSphere Streams. This is a unique capability
that IBM initially developed in partnership with its Research division and
various US governmental agencies to process large quantities of both
structured and unstructured data with both high throughput and low
latency.
Streams uses its own in-memory processing and node coordination facilities
to achieve microsecond latencies, but can use InfoSphere BigInsights and a
number of relational databases as both a source of historical information
and a target to which to store information for retention purposes. In
addition, Streams can integrate with in-memory data grids in order to
support low-latency lookup of information, typically reference data.
5.7 Resilience
delivered from
Platform
Symphony
5.8 Platform
scheduling
IBM’s Platform software is IBM’s cluster management and scheduling
system, which can support diverse compute and data intensive applications.
Platform is a mature and well-established product, used across many
industries for grid centric workloads.
The major benefits that Platform brings to BigInsights are as follows:
• Recovery and reliability:
− Hadoop jobs, and job tasks, are recoverable in the event of node
failure
− Platform infrastructure has no single point of failure
− All services are highly available, and will be restarted automatically
on alternative servers in the event of a management server failure
• Resource sharing and flexibility:
− Platform can manage both Hadoop and non-Hadoop workloads
within the same cluster, including provisioning through the use of
the optional Cluster Manager
− Multiple IBM and third party analytic applications can be supported
on a shared infrastructure, e.g.: InfoSphere Streams, InfoSphere
30. DataStage, SPSS, SAS, R, etc.
− Infrastructure can be shared across development, test, and
production environments; across different user groups, clusters and
workload types: which will drive greater efficiencies and utilisation,
while reducing costs.
21
• Scheduling agility:
− Agile scheduling ensures that time critical workloads start and finish
fast
− Optionally give priority to interactive jobs (i.e. BigSheets, BigSQL)
− Resource allocations shift instantly based on priority adjustments
and proportional allocations at run-time
− Platform’s highly effective scheduling ensures that the cluster can be
kept at high average levels of utilisation: 80-90% average utilisation
is not uncommon for Platform clusters.
One important addition IBM has made beyond open source capability in
terms of scalability, however, is in the area of resource scheduling. When
using the Adaptive MapReduce framework built into BigInsights, a more
advanced scheduling mechanism is used. This mechanism improves
scalability by leveraging Platform Symphony’s ability to support a number of
scheduling agents running in parallel rather than open source’s historically
singular JobTracker service. This is similar to the scalability improvements in
the recently released YARN capability of open source Hadoop, but uses the
proven robustness of Platform Symphony rather than months-old and
relatively unproven open source technology.
For example, one of our largest deployments of BigInsights started with an
initial volume of approximately 2.5 PB, and has grown over the last couple
of years to approximately 5 PB. It is expected to grow to 20 PB of data
within the next several years. In combination, GPFS and Platform bring
significant operational benefits to the running of Hadoop workloads, and
provide the flexibility to support other types of workloads in a common
infrastructure. Coupled with Platform’s resource allocation and prioritisation
capabilities, we believe this will drive higher utilisation and efficiency, and
will lower operational support costs by virtue of having a single architecture
to support diverse workloads and application types.
The robust availability and recoverability characteristics provided by
Platform and GPFS will provide clients with a Hadoop solution ready for
enterprise deployment into what is becoming an increasingly time-sensitive
business environment.
31. 22
5.9 High
Availability
BigInsights provides a highly robust Hadoop solution that automatically
handles the failure of management and data nodes without losing data and
without any interruption to processing.
The recommended configuration of BigInsights is to use the Platform
Symphony MapReduce scheduler instead of the Apache MapReduce
scheduler, and GPFS-FPO as the High Availability file system for data storage
instead of HDFS- note that GPFS can provide a highly available HDFS
filesystem across sites. GPFS is also used to provide a highly available (and
optionally cross-site) filesystem which is used by the Symphony scheduler to
support High Availability configurations- the use of a shared NAS facility is
also an option.
When BigInsights is configured in this way, and combined with the
appropriate server, network, and environmental infrastructure (e.g. power,
cooling), it provides a highly available solution. Platform Symphony requires
a shared file system to be accessible between management nodes. In a
production environment two or more management nodes will be
configured. The number of management nodes will be dictated by two
factors. The first factor is the level of redundancy required; additional
management nodes mean that the cluster can tolerate more failure at the
management level. The second factor is the cluster size/load. Platform
Symphony can use multiple instances of the Symphony scheduler (similar to
JobTracker), one for each logical application. As the number of applications
increases then Symphony can be scaled out to multiple Symphony
Schedulers-providing load balanced scheduler instances across available
management nodes. Platform Symphony will do this automatically.
Therefore as the cluster size/load increases additional management nodes
can be added.
All management nodes are active.
The shared file system is used to store component state data. For example
an instance of a Symphony scheduler will store metadata about all in flight
workload currently being processed. If a management node on which a
component is running fails, the component will be restarted on another
available node, during start-up the component will recover state written to
the shared file system.
The shared file system can be implemented using a variety of technologies
and solutions, for example with Network Attached Storage (NAS) appliances
as HA features (e.g. dual controllers, RAID) are normally built in. GPFS can
also be used as it is a clustered file system and can provide the required
redundancy and high availability. The Symphony shared file system can be
implemented using GPFS through a number of different hardware
configurations. One example is shown below.
Platform Symphony High Availability with GPFS
32. In the above diagram three management nodes are shown. Each node has a
number of solid state disks. Each SSD is divided into 2 partitions, one for the
operating system and the second for GPFS. All disks are GPFS Network
Shared Disks (NSD). All 3 servers are GPFS Quorum nodes. There are three
GPFS failure groups; each is created using direct-attached SSDs on each
management node. A replication factor of 3 is used for both data and
metadata - each block of a replicated file is in all 3 failure groups.
Use of SSDs is not a requirement but may provide performance advantages
for more real time/latency sensitive applications or for filesystem metadata.
IBM BigInsights uses an HA manager from Platform Symphony, known
internally as the service controller, to manage management components or
services. The service controller is responsible for starting one or more
instances of registered service types. The BigInsights and Platform
Symphony management components are registered as service types within
Platform Symphony. The service controller monitors all service instances
and restarts a service instance if it exits unexpectedly. There is always one
instance of the service controller running. If the management node on
which it is running becomes unavailable a new instance is restarted on
another management node.
Hardware component failures which cause the node to go offline are
handled as follows:
If the node is a node running GPFS management functions, or a Symphony
(Adaptive MapReduce) management node, then the failover clustering for
GPFS or Symphony automatically moves any management functions to
another designated node.
If the failed node “owns” replicated data (i.e. it is a data node), then GPFS
marks the node and associated storage as unavailable and “stale”. I/O
requests for data will be fulfilled by another copy of the data. If a disk fails,
then similarly, GPFS will mark that disk as stale, and redirect I/O's to other
copies of the data.
Symphony
An HA Manager Service runs on all management nodes. This service handles
the monitoring of critical services and manages the failover steps (for
instance, the termination of the failed process, binding the floating IP to the
standby server, and starting the required process on the standby server).
This is not required for Symphony and GPFS, which have their own built-in
failover clustering, which allows for simplified failover with zero or minimal
disruption to ongoing work.
23
33. Namenode failure
With our use of GPFS as the file system for BigInsights, there is no active-passive
NameNode failover required, as file system management functions
are failover clustered within GPFS itself, and metadata is distributed across
the file system itself. There is no concept of a “master” NameNode to fail: all
GPFS services are designed to be mobile around the cluster, and fail over as
required.
GPFS provides automated failover of file system management functions
from any failed node running these functions. Access to data or metadata
which was owned by that node is maintained transparently, by redirecting
I/O to a replica of the data. I/O is briefly suspended (transparently to users)
– typically for 1 to 2 minutes – while recovery steps are in process within
GPFS, and the I/O is resumed. Note that this transparent failover also
extends to the Clustered NFS facility within GPFS, which could be used to
provide edge/gateway services to move data in and out of the cluster.
Recovery from loss of nodes within the cluster?
If a data/compute node fails, access to all data is maintained. Data which
was “owned” by that node and stored in the GPFS filesystem remains
available, even if one of the copies of the data is no longer accessible on the
failed node. This is done transparently to the other nodes in the cluster.
GPFS has robust, built-in clustering within the filesystem, and does not
require additional hardware or failover software to operate.
In a Hadoop cluster it is expected that nodes will become unavailable for a
variety of reasons. The nodes themselves do not generally have any
redundancy built into them. The IBM solution has a service controller that is
responsible for starting one or more instances of registered service types.
The service controller finds an available node on which to start a particular
service instance. It is also responsible for monitoring all service instances.
Therefore if a node becomes unavailable the service controller will restart
an instance on another available node. There is always one instance of the
service controller running. If the node on which the service controller is lost
the service controller is automatically restarted on another node.
HA supported at the job level
IBM Platform Symphony replaces the JobTracker component of open source
Hadoop with its own scheduler. The Task Tracker component is also
replaced with another Platform Symphony component that runs on each
data node for managing each map or reduce task. A number of Platform
Symphony schedulers will be running within the cluster, one for each
configured application. A Symphony scheduler itself is responsible for
managing all jobs submitted on behalf of a particular application.
In terms of job-level HA, failure at either the data node or management
node are relevant. Data nodes are used for running all map and reduce tasks
associated with any MapReduce jobs submitted to the cluster. Management
nodes run a number of management daemons / components.
If a data node becomes unavailable, the scheduler will detect the loss of
communication with the runtime components (Task Tracker equivalent) on
the data node. The scheduler will then automatically re-queue any map or
reduce tasks belonging to jobs managed by this scheduler, that were
running on the failed data node. The scheduler will then reschedule the map
24
34. and reduce tasks on available data nodes.
When a management node becomes unavailable all components, including
any schedulers, will be automatically restarted on other management nodes
(Platform Symphony will load balance across management nodes). Each
scheduler writes state information to disk. The state information includes all
in flight map and reduce tasks. When the scheduler is restarted this data is
read from disk as part of the scheduler recovery. The scheduler can recover
all necessary information about jobs previously submitted by client
applications. It will then continue processing the workload.
As previously described the HA operations assume that there is a robust and
available file system that can be accessed from any management node
(typically a GPFS filesystem, or a NAS with high availability configuration).
With the use of GPFS, Symphony uses the GPFS shared file system to
support HA operations at the job level.
While the data nodes do not require access to the shared Symphony file
system for the jobs to be scheduled, the use of GPFS as a separate highly
available file system accessible across the cluster provides the reliable data
storage facility for the data used by the running workloads.
Therefore all workload can be recovered in the event of a management
node or scheduler failure. There is no requirement to resubmit workload
from the client perspective: recovery is automatic. The client application
that submits any new work will automatically connect to the new instance
of the scheduler.
Cluster backups
The fact that all of the Hadoop data can be held in a POSIX-compatible GPFS
filesystem where it is also accessible using normal operating system
commands will assist in allowing data to be easily backed up, whether it is
physically held inside the data/compute nodes, or in a shared disk storage
subsystem. In either case, GPFS allows Hadoop jobs to see the data as if it
were “HDFS”, and backup software to see the same data as a normal file in a
POSIX filesystem.
This also means that there is a reduced requirement for multiple copies of
the data, and that data can be transparently moved from inside the Hadoop
cluster to less expensive shared storage (using RAID which has 20% to 40%
overhead, rather than 3 copies used by default by HDFS with 200%
overhead). Data can even be transparently moved to tape, and
automatically and transparently recalled.
GPFS also supports snapshots, providing a measure of recovery in the case
of accidental deletion, or other need to restore to a previous version of
data.
Backups need to be done of management nodes, GPFS nodes, and other
non-data nodes. Information relevant to the re-creation of the cluster
should also be backed up (e.g. filesystem configuration information, image
directory for the OS provisioning manager, operating system installation
backups for the provisioning manager, network switch configurations, etc).
The general and disciplined use of a provisioning manager such as PCMAE
reduces the number of individual items that have to be separately
considered for “bare metal” backups. If most systems are deployed through
the provisioning manager, then those servers can be easily and reliably
25
35. rebuilt exactly, by once again using the provisioning manager.
For the cluster, user data should be backed up if it cannot be easily
recreated in the case of a total data centre loss, or a catastrophic problem
which destroys all of the online data. Offline backups represent a safer
alternative or adjunct to online backups (snapshots, or backups-to-disk).
This is because a structured and ordered sequence of events needs to occur
to access offline data – the backup system has to request a tape mount of a
valid tape volume ID, the tape is mounted, and only then is access provided.
Multiple structured steps to access offline storage minimise the chances of
data loss due to malicious or accidental damage.
Disaster Recovery / Data Corruption
Disaster Recovery (DR) planning is an involved process, which must balance
the Recovery Point Objective (RPO, broadly “data loss”), Recovery Time
Objective (RTO, broadly “time until service is restored”), budgets, and
operational and physical resources and constraints.
DR is supported by the proposed BigInsights component for scenarios
ranging from cold, to warm, to hot, to active-active sites. To support some
scenarios, the Symphony and GPFS licenses included with BigInsights may
need to be extended to full product licenses for some or all cluster nodes.
GPFS is a clustered filesystem and supports multi-site “24x7” high
availability configurations with data replicated between sites, allowing a site
to fail without interruption to data access in the filesystem. So, 2 sites (plus
a single “tiebreaker” server at a third site) could form the basis of a highly
available cluster configuration. Recovery of jobs and other services would
also need to be considered.
Alternatively, you may choose to rely on existing backup and recovery
products and procedures for DR. The use of GPFS, which presents all data
(including MapReduce data) as a POSIX filesystem, enables enterprise
backup software to be used to back up and restore data. Note that
additional steps and considerations may be required: for example to back
up and restore ACLs, as well as to re-establish the filesystem at the DR site.
Remote caching of data using GPFS-AFM may also be considered. From a
workload management perspective two primary DR scenarios are
supported;
Active – active
Active – passive
In the first scenario management and data nodes are located in at least two
data centres (plus a management node a tiebreaker site – see diagram
below). All management and data nodes are active. Both data centres are
used for running jobs. In the event of a DR event which removes one data
centre the remaining data centre continues to process running jobs and
receive new jobs from clients. In order to support this scenario the Platform
Symphony shared file system must be available to management nodes in
both data centres. In addition Platform Symphony metadata (for example
Hadoop job metadata) must be replicated between the data centres. GPFS
enables us to create a clustered file system with this type of HA
configuration.
In the second scenario management and data nodes are again located in all
data centres. In this scenario there is no requirement for a tiebreaker site
26
36. with respect to managing Platform Symphony. Only the management nodes
in the primary data centre are active. Data nodes in both sites may be active
if application data is replicated between the data centres. In the event of a
DR event the management nodes in the secondary data centre are started.
They will connect to a shared file system that is available in the secondary
data centre. All remaining data nodes will automatically connect to the
management nodes in the secondary data centre. Workload that was being
processed before the DR event will be lost and will need to be resubmitted.
There is no requirement in this scenario to synchronously replicate Platform
Symphony metadata data between the sites. Platform Symphony
configuration data (grid and application configuration data) should be
asynchronously replicated at an appropriate interval – for example every 30
minutes. Again GPFS can be used as the basis for the Platform Symphony
shared file system.
If the proposed solution is correctly implemented and maintained with an
appropriate level of redundancy for environmental, systems, and networks,
then individual node failures at a site are handled with the built-in clustering
of Symphony and GPFS failover clustering within the filesystem. A non-exhaustive
list of scenarios in which DR would need to be invoked includes
the complete failure of the network at a site, a power outage at a site, or
other situation where there are multiple simultaneous failures that result in
multiple key components being offline at the same time. For example, a
power spike blows up both redundant network switches. If the DR site is not
an “active-active” configuration for Symphony and GPFS, then a decision
would have to be made regarding repair time versus time and cost to invoke
DR and the subsequent failback.
Recovery Point Objective (RPO) and Recovery Time Objective (RTO) of
disaster recovery configuration
For the GPFS filesystem, the RPO for a dual site active-active solution would
be zero, as writes happen synchronously between sites under the control of
GPFS. Other options such as asynchronous replication by GPFS (GPFS-AFM)
are also possible, which would present an RPO of minutes to hours.
In the case of an active – active configuration the RTO would be a few
minutes. Jobs would continue to run (with reduced throughput as capacity
would be reduced) once any Platform Symphony schedulers were restarted
on the second data centre. This would take a few minutes – including time
to detect failure at first data centre.
In the active – passive configuration all running workload will be lost and will
need to be resubmitted. The RTO is the time taken to start the management
nodes in the secondary site, plus the time taken for data nodes to connect
to these management nodes. This ignores the cost of having to start data
nodes and load application data from backup locations if this is required.
An RTO of close to 0 may be achievable using the fully integrated solution
including Platform Symphony and GPFS, in a multi-site configuration with
synchronous data replication between the sites. Note that there may be a
reduction in compute resources (unless policy keeps a complete separate
idle copy of production at DR), and that jobs which were in process at the
failed site would need to be restarted or abandoned. During the actual
failure of a site, I/O is temporarily suspended while the filesystem cluster
reconfigures, after which work continues as normal. This time typically
ranges from tens of seconds to a few minutes.
27
37. For the jobs which are executing at the surviving site, they continue to
operate (though with the temporary suspension of I/O mentioned above).
While the Symphony scheduler cluster reconfigures itself, new jobs can be
submitted, though there may be some loss of service until all cluster
components are started on the surviving data centre. Workload will
continue to execute on the cluster whilst Platform Symphony components
are in the process of being restarted on the surviving data centre. All
Platform schedulers that were running on the surviving data centre will be
unaffected by the DR event, apart from loss of service and a requirement to
reschedule any tasks running on the data centre that was lost.
GPFS can provide a dual site configuration (plus quorum/tiebreaker site),
which supports active/active replication (see diagram below). This is done
using GPFS-controlled synchronous replication over TCP/IP between the
sites. This can provide a highly available data store, but may require
additional software in the middleware or application layers of the solution
stack to achieve “no transactions lost”.
For example, IBM MQ can also be used with GPFS across multiple sites with
GPFS providing a tested and supported high availability data store for MQ
messages. This provides an environment where transactions can be
retained, even in the case of the loss of a single site.
28
38. 6. Real time streaming analytics
IBM InfoSphere Streams is an advanced analytic platform that allows user-developed applications to
quickly ingest, analyze and correlate information as it arrives from thousands of real-time sources.
The solution can handle very high data throughput rates, up to millions of events or messages per
second. This graphic illustrates an application which contains statistical rules to process telematics
data from cars. One car has triggered an alert for slippery road surface by sending data that 3 wheels
are rotating at different speeds. The approaching cars are instantly alerted!
29
InfoSphere Streams
helps you Analyze data
in motion—provides sub-millisecond
response
times, allowing you to
view information and
events as they unfold –
from moving vehicles
You can analyze data in motion with Streams, which:
Supports analysis of continuous data including text, images, audio,
voice, video, web traffic, email, GPS data, financial transactions, satellite
data and sensor logs.
Includes toolkits and accelerators for advanced analytics, including a
telco event data accelerator that analyzes large volumes of streaming
data from telecommunications systems in near real time and a social
data accelerator for analyzing social media data.
Distributes portions of programs over one or more nodes of the runtime
computing cluster to help achieve volumes in the millions of messages
per second with velocities of under a millisecond.
Allows you to filter and extract only relevant data from unimportant
volumes of information to help reduce data storage costs.
39. Scales from a single server to thousands of computer nodes based on
data volumes or analytics complexity and Provides security features and
confidentiality for shared information.
30
Simplify development of
streaming applications
which uses an Eclipse-based
integrated
development
environment (IDE)
Streams:
Allows you to build applications with drag operators, and dynamically
add new views to running applications using data visualization
capabilities such as charts and graphs.
Enables you to create, edit, visualize, test, debug and run Streams
Processing Language (SPL) applications.
Provides composites capability to increase application modularity and
support large or distributed application development teams.
Allows you to nest and aggregate data types within a single stream
definition.
Enables applications to be built on a development cluster and moved
into production without recompiling.
Extend the value of
existing systems
integrated with your
applications, and
supporting both
structured and
unstructured data
sources.
Streams: Adapts to rapidly changing data forms and types.
Allows you to quickly develop new applications that can be mapped to a
variety of hardware configurations.
Supports reuse of existing Java or C++ code, as well as Predictive Model
Markup Language (PMML) models.
Includes a limited license for IBM InfoSphere BigInsights, a Hadoop-based
offering for analyzing large volumes of unstructured data at rest.
Integrates with IBM DB2, IBM Informix, IBM Netezza, IBM solidDB, IBM
InfoSphere Warehouse, IBM Smart Analytics System, Oracle, Microsoft
SQLServer and MySQL, and more.
40. 31
7. Data security
7.1 Granular data
security
BigInsights integrates with Active Directory for user authentication.
A secure connection uses encryption to make data unreadable to third
parties while the data is sent over the network between Directory Server and
its clients binding to the secure port using the Secure Socket Layer (SSL).
InfoSphere BigInsights also supports Kerberos service-to-service
authentication protocol, increasing security strength to prevent middle man
attacks.
BigInsights supports integration with LDAP and single sign-on across clusters
(e.g. Development and Test) through the use of the Name Service Switch
(NSS) package and Lightweight Third Party Authentication (LTPA) tokens. The
development, deployment and execution of analytic applications or other
data processing jobs are controlled through role-based security, and
information access is controlled through granular access control lists (ACLs).
BigInsights uses POSIX compliant ACLs to control access to the data itself,
down to the data stored in each individual node in the cluster, by using IBM’s
General Parallel File System (GPFS). GPFS is a kernel-level, POSIX compliant
file system that provides the same level of control for file access security and
auditing capabilities as other POSIX file systems, allowing standard “owner”,
“group” and “other” permissions to be assigned and changed using standard
operating system commands like “chmod”. In fact, the ACLs within GPFS
allow even greater flexibility by allowing additional users and groups to be
defined, as well as a “control” level that determines who can change the ACL
itself.
Perhaps most importantly, as a kernel-level file system the data stored in
GPFS can be shared across BigInsights and any other application, without
moving or replicating the data. This immediately improves flexibility of using
the data in BigInsights, removing delays in the movement of data between
systems and tools, and minimising the need to reproduce access control
definitions at multiple levels.
In addition to standard application authentication via LDAP or Kerberos, Big
SQL enables row and column access control or what is sometimes described
as fine grained control consistent with functionality found in an RDBMS. This
functionality supports compliance for regulations and policies related to data
privacy, such as patient health records or securities data so is suited to the
compliance challenges for vehicle data such as those limiting the use of eCall
data for the specific purpose of emergency response to vehicle problems.
To monitor and validate data access, BigInsights’ built-in auditing can track
changes to access privileges or data objects and track SQL statement
execution and retrieving security information.
41. 32
7.2 User security
Administrators have the option to choose flat file, Lightweight Directory
Access Protocol (LDAP) or Pluggable Authentication Modules (PAM) for the
InfoSphere BigInsights web console. With LDAP authentication, the
InfoSphere BigInsights installation program will communicate with an LDAP
credentials store for authentication. Administrators can then provide access
to the InfoSphere BigInsights console based on role membership, making it
easy to set access rights for groups of users.
InfoSphere BigInsights provides four levels of user roles:
BigInsights System Administrator. Performs all system administration tasks.
For example, a user in this role can perform monitoring of the cluster’s
health, and adding, removing, starting, and stopping nodes
BigInsights Data Administrator. Performs all data administration tasks. For
example, these users create directories, run Hadoop file system commands,
and upload, delete, download, and view files
BigInsights Application Administrator. Performs all application administration
tasks, for example publishing and un-publishing (deleting) an application,
deploying and removing an application to the cluster, configuring the
application icons, applying application descriptions, changing the runtime
libraries and categories of an application, and assigning permissions of an
application to a group
BigInsights User: Runs applications that the user is given permission to run
and views the results, data, and cluster health. This role is typically the most
commonly granted role to cluster users who perform non-administrative
tasks.
MapReduce jobs can be run under designated account IDs, which helps
tighten security, access control and auditing. Integration of InfoSphere
BigInsights with IBM InfoSphere Guardium® data security software helps
organizations to manage the security and auditing needs of Hadoop the same
way they manage traditional structured data sources.
7.3 Audit and
integration
with Guardium,
the leading
security
platform
BigInsights can be configured to collect a range of audit information.
BigInsights stores security audit information as audit events in its own audit
log files for general security tracking. The log files are written to the file
system in directories using a date based naming protocol, which can only be
accessed by administrators.
You can also configure InfoSphere BigInsights to send audit log events to
InfoSphere Guardium for security analysis and reporting via Guardium Proxy.
An audit message contains three critical pieces of information derived from
an audit event: audit event timestamp, component that generated the audit
event, and the audit message itself.
After BigInsights events exist in the InfoSphere Guardium repository, other
InfoSphere Guardium features such as workflow (to email and track report
signoff), alerting, and reporting are available.
InfoSphere Guardium has a secure, tamper-proof repository. All audit
42. information is stored in a secure repository, where it cannot be modified:
even by privileged users. Once data is collected and written to Guardium,
there is no way for it to be modified, which guarantees the non-repudiation
of the data. This secure repository supports separation of duties and absolves
database administrators of any suggestion that they might have changed
audit data to “cover their tracks,” even in a legal environment.
In addition Guardium has a hardened operating system and database kernel
– There is no way for users to directly access the underlying operating
system, file system, or database. As an added precaution, all unused
software components of the operating system and the embedded database
have been removed or disabled.
The security audit information that InfoSphere BigInsights generates depends
on your environment. The following list is representative of the type of
information that InfoSphere BigInsights generates:
• Hadoop Remote Procedure Calls (RPC) authentication and
authorisation successes and failures
• Hadoop Distributed File System (HDFS) file and permission-related
commands such as cat, tail, chmod, chown, and expunge
• Hadoop MapReduce information about jobs, operations, targets, and
33
permissions
• Oozie information about jobs
• HBase operation authorisation for data access and administrative
operations, such as global privilege authorisation, table and column
family privilege authorisation, grant permission, and revoke
permission.
43. 34
7.4 Masking
confidential
information
and test data
with IBM
Optim
InfoSphere Optim data masking on demand is the only masking service
available for Hadoop-based systems. You can decide when and where to
mask: for example, in relational data sources, in reports or inside
applications.
InfoSphere Optim Data Masking de-identifies or obfuscates sensitive data
such as PII, business data (revenues, HR, etc.) and corporate secrets in big
data environments. Using InfoSphere Optim protects data against theft and
misuse in accordance with compliance mandates. InfoSphere Optim Data
Masking ensures data privacy, enables compliance and helps manage risk.
InfoSphere Optim is first to market with data masking on demand for
Hadoop-based systems. InfoSphere Optim de-identifies data to ensure
privacy while keeping the original context to facilitate business processes.
Flexible masking services allow you to create customised masking routines
for specific data types or leverage out of the box support.
IBM InfoSphere Optim Test Data Management optimises and automates the
test data management process. Prebuilt workflows and services facilitate
continuous testing and Agile software development. IBM InfoSphere Optim
Test Data Management helps development and testing teams use realistic,
right-sized test databases or data warehouses to accelerate application
development.
InfoSphere Optim Test Data Management helps organisations:
• Streamline test data management processes to help reduce costs
and speed application delivery
• Analyse and refresh test data on demand for developers and testers
• Create production-like environments to shorten iterative testing
cycles, support continuous testing and accelerate time to market
• Protect sensitive data based on business polices and help reduce risk
in testing, training and development environments
• Use a single, scalable enterprise solution across applications,
databases and operating systems
• Provides a comprehensive continuous testing solution through
Rational Test Workbench for functional, regression, integration
(service virtualisation) and load testing.
44. 8. Big Data Platform integration
35
Integrated business
functionality is delivered
through the breadth of
the IBM Big Data
Platform which includes
the following
capabilities:
45. 36
8.1 Connectors
InfoSphere BigInsights provides connectors to IBM DB2® database
software, the IBM PureData™ Systems family of data warehouse
appliances, IBM Netezza appliances, IBM InfoSphere Warehouse and the
IBM Smart Analytics System. These high-speed connectors help simplify
and accelerate data manipulation tasks.
Standard Java Database Connectivity (JDBC) connectors make it possible
for organizations to quickly integrate with a wide variety of data and
information systems including Oracle, Microsoft® SQL Server, MySQL
and Teradata.
This connectivity encourages a platform approach to big data projects,
so for example selections of data can be drawn from BigInsights to
support high performance analytics on a Netezza appliance and the
results posted back to the all encompassing data store. Queries from
SQL tools can access large data stores in Hadoop for long term history
and high performance systems for current operational reporting
concurrently. This replaces the stove-pipe system where each data store
has its own reporting tools, support team, and labour intensive
maintenance overheads with a platform.
8.2 Data warehouse
integration
IBM’s approach of combining Hadoop with in-memory database
processing results in the application world seeing one warehouse, yet
providing a more agile and faster response time for ROLAP: all at
cheaper cost.
In addition, the IBM solution also provides a cheaper and larger-scale
engine for data integration and transformation: both extract-transform-load
(ETL) and extract-load-transform (ELT).
DataStage can push down transformations into the IBM Hadoop solution
to perform transformations within Hadoop. DataStage in this case
automatically creates MapReduce jobs that perform the transformation
work using an ELT approach.
IBM therefore does not advocate moving the entirety of the data mart
and data warehouse landscape into a Hadoop solution. Rather, IBM
recommends taking a measured approach to constructing a logical data
warehouse that is comprised of fit-for-purpose components that achieve
the best result for particular workloads while minimising cost.
46. 37
8.3 IBM Symphony
By using the IBM Symphony scheduler, the solution provides high
availability, as well as higher performance for many MapReduce
workloads due to the faster scheduling of workloads. In addition, the
Symphony system can be deployed to support multi-tenancy, so that the
same cluster can simultaneously support Hadoop/MapReduce
workloads as well as other work.
8.4 Big Match & MDM
For users performing customer analytics, InfoSphere BigInsights
leverages the probabilistic matching engine of InfoSphere Master Data
Management to match and link customer information directly in
Hadoop, at high speeds. A unique identifier for each customer ensures
analytics are performed on more accurate and information.
8.5 Information Server:
DataStage &
QualityStage
IBM InfoSphere DataStage® includes a connector that enables
InfoSphere BigInsights data to be leveraged within an InfoSphere
DataStage extract/transform/ load (ETL) or in an extract/load/transform
(ELT) job.
The Balanced Optimiser functionality in BigInsights places the workload
where it will run most efficiently such as during an ETL process, or in a
DB2 database, or as a MapReduce Process.
Quality rules and actions can be run using the integration with
Information Server QualityStage; particularly useful given the disparate
data types which are typically ingested into Hadoop to avoid creating a
large but useless data store.
8.6 InfoSphere Streams
InfoSphere BigInsights includes a limited-use license of InfoSphere
Streams, which enables real-time, continuous analysis of data on the fly.
InfoSphere Streams is an enterprise-class stream-processing system that
can extract actionable insights from data in motion while transforming
data and transferring it to InfoSphere BigInsights at high speeds.
This enables organizations to capture and act on business data in real
time — rapidly ingesting, analyzing and correlating information as it
arrives — and fundamentally enhance processing performance.
8.1 Cognos Business
Intelligence
InfoSphere BigInsights includes a limited-use license for Cognos Business
Intelligence, which enables business users to access and analyze the
information they need to improve decision making, gain better insight
and manage performance. Cognos Business Intelligence includes
software for query, reporting, analysis and dashboards, as well as
software to gather and organize information from multiple sources.