InfoSphere BigInsights

InfoSphere BigInsights
Hadoop business ready
Wilfried Hoge
IT Architect Big Data

© 2013 International Business Machines Corporation 2
Getting the Value from Big Data – Why a Platform?
§  Almost all big data use cases require
an integrated set of big data technologies
to address the business pain completely
§  Reduce time and cost and provide quick ROI
by leveraging pre-integrated components
§  Be flexible in the combination of technologies
§  Start small with a single project and progress
to others over your big data journey
Accelerators
Information Integration & Governance
Data
Warehouse
Stream
Computing
Hadoop
System
DiscoveryApplication
Development
Systems
Management
Data Media Content Machine Social
BIG DATA PLATFORM

Accelerators
Information Integration & Governance
Data
Warehouse
Stream
Computing
Hadoop
System
DiscoveryApplication
Development
Systems
Management
Data Media Content Machine Social
BIG DATA PLATFORM
InfoSphere BigInsights is IBM‘s distribution of
Hadoop that delivers additional value
Accelerators
Speed time to value with analytic
and application accelerators
Bringing Hadoop to the enterprise

New Architecture to Leverage All Data and Analytics
Data
in

Mo)on

Data
at

Rest

Data
in

Many
Forms

Information
Ingestion and
Operational
Information
Decision
Management
BI and Predictive
Analytics
Navigation
and Discovery
Intelligence
Analysis
Landing Area,
Analytics Zone
and Archive
§  Raw Data
§  Structured Data
§  Text Analytics
§  Data Mining
§  Entity Analytics
§  Machine Learning
Real-time
Analytics
§  Video/Audio
§  Network/Sensor
§  Predictive
Exploration,
Integrated
Warehouse,
and Mart Zones
§  Discovery
§  Deep Reflection
§  Operational
§  Predictive
§  Stream Processing
§  Data Integration
§  Master Data
Streams
Information Governance, Security and Business Continuity

New Architecture to Leverage All Data and Analytics
Data
in

Mo)on

Data
at

Rest

Data
in

Many
Forms

Information
Ingestion and
Operational
Information
Decision
Management
BI and Predictive
Analytics
Navigation
and Discovery
Intelligence
Analysis
Landing Area,
Analytics Zone
and Archive
§  Raw Data
§  Structured Data
§  Text Analytics
§  Data Mining
§  Machine Learning
Real-time
Analytics
§  Video/Audio
§  Network/Sensor
§  Predictive
Exploration,
Integrated
Warehouse,
and Mart Zones
§  Discovery
§  Deep Reflection
§  Operational
§  Predictive
§  Stream Processing
§  Data Integration
§  Master Data
Streams
Information Governance, Security and Business Continuity
•  brings Hadoop to the Enterprise
•  enhances ease of use and
consumability
•  takes the complexity out of
getting started with Hadoop
•  users across the organization
can build applications, and get
insights at their fingertips without
having to learn new skill sets

Tools for Administrators
6
•  Monitoring capabilities provide a centralized dashboard view to visualize key performance
indicators including CPU, disk, and memory and network usage for the cluster, data services
such as HDFS, HBase, Zookeeper and Flume, and application services including MapReduce,
Hive, and Oozie
•  Status information and control
over the major cluster
capabilities
•  Advanced capabilities to control
application permissions and
deployment
•  Capability to view and control
all applications from a single
page

BigSheets to analyze and visualize
•  Model “big data” collected
from various sources in
spreadsheet-like structures
•  Filter and enrich content with
built-in functions
•  Combine data in different
workbooks
•  Visualize results through
spreadsheets, charts
•  Export data into common
formats (if desired)
No programming knowledge needed!

8
A centralized dashboard to visualize
analytic results:
•  BigSheets collections
•  Analytic application results
•  Monitoring metrics
•  Ability to view BigSheets data flows between
and across data sets to quickly navigate and
relate analysis and charts
•  Visualize inner outer joins, enhanced filters
for BigSheets columns, column data-type
mapping for collections and application of
analytics to BigSheets
columns, … etc
Centralized dashboard & data flows

9
Editors
•  A workflow editor that greatly simplifies the creation of
complex Oozie workflows with a consumable interface
•  A Pig/Jaql Editor with content assist and syntax
highlighting that enables users to create and execute
new applications using Pig or Jaql in local or cluster
mode from the Eclipse IDE
Application development & deployment
•  Enablement of BigSheets macro
and BigSheets reader development
•  Text Analytics development,
including support for modular
rule sets
•  Publish new application: BigSheets
Macro, BigSheets Reader, AQL
module, Jaql module
Tools for Developers 1. Sample your
Data
2. Develop your
application using
BigInsights tools
3. Test your
application
4. Package and publish your
application
5. Deploy your
application on the
cluster

Running Applications on Big Data
•  Browse available applications
•  Deploy published applications
(administrators only)
•  Launch (or schedule for launch) a
deployed application
•  Monitor job (application) execution
status
•  Predefined applications
•  Import & Export Data
•  Database & Files
•  Web and Social
•  Analyze and Query
•  Predictive Analytics
•  Text Analytics
•  SQL/Hive, Jaql, Pig, Hbase
•  Accelerators

Application linking and interfaces to build new apps
11
•  Compose new
applications from
existing applications
and BigSheets
•  Invoke analytics
applications from the
web console, including
integration within
BigSheets
•  REST data source App
that enables users to
load data from any data source supporting REST APIs into BigInsights, including
popular social media services
•  Sampling App that enables users to sample data for analysis
•  Subsetting App that enables users to subset data for data analysis

Collaborative Big Data for many roles
•  Business Users can get their hands on big
data and use big data applications and
BigSheets to get insights into their data
§  Data scientists can perform deeper analysis
and get richer insights
§  Administrators are empowered to be more
agile through better controls and views into key
performance indicators
§  Developers can leverage unified tooling in a Big Data
Application Development Lifecycle and are able to create and
deploy new types of applications, with enhancements that
simplify even complex workflows

Build-in accelerators
•  Software components that accelerate development and/or implementation of specific
solutions or use cases on top of the Big Data platform
•  Provide business logic, data processing, and UI/visualization, tailored for a given use case
•  Bundled with Big Data platform components – InfoSphere BigInsights and InfoSphere
Streams
•  Key Benefits
–  Time to value
–  Leverage best practices around implementation of a given use case.
•  Analytical Accelerators
–  Text analytics – Geospatial analytics
–  Machine learning – Time series
–  Data mining
•  Application Accelerators
–  Machine Data Analytics – operational data including logs for operations efficiency
–  Social Data Analytics – sentiment analytics, Intent to purchase
–  Telecommunications – CDR streaming analytics deep customer event analytics
–  Finance Analysis – streaming options, trading, Insurance and banking DW models

Machine Data Analytics Accelerator
What does it do?
§  Provides the ability to ingest, parse and extract a wide
variety of machine data
– Faceted search enables easy navigation and discovery
– Visualization enables easy analysis of the data
Machine Data Analytics
Example Application: Facilities Management
• Use real time data from building devices such as meters, sensors and motion
detectors to monitor and manage power usage
Why should you care?
§  It enables clients to gain insights into operations, customer experience,
transactions and behavior, processing machine data in minutes instead of days
and weeks
§  With these insights, clients can:
– Proactively plan to increase operational efficiency
– Troubleshoot problems and investigate security incidents
– Monitor end-to-end infrastructure to avoid service degradation or outages

Machine Data Analytics Accelerator High-Level Workflow
© 2013 IBM Corporation

Use the Machine Data Analytics Accelerator by starting the
predefined applications

© 2013 IBM Corporation
View results of MDA in web, BigSheets and dashboard

BigInsights Enterprise Edition
Connectivity and Integration Streams
Netezza
Text
processing
engine and
library
JDBC
Flume
Infrastructure Jaql
Hive
Pig
HBase
MapReduce
HDFS
ZooKeeper
Indexing Lucene
Adaptive
MapReduce
Oozie
Text compression
Enhanced
security
Flexible
scheduler
Optional
IBM and
partner
offerings
Analytics and discovery “Apps”
DB2
BigSheets
Web Crawler
Distrib file
copy
DB export
Boardreader
DB import
Ad hoc query
Machine
learning
Data
processing
. . .
Administrative and
development tools
Web console
•  Monitor cluster health, jobs,
etc.
•  Add / remove nodes
•  Start / stop services
•  Inspect job status
•  Inspect workflow status
•  Deploy applications
•  Launch apps / jobs
•  Work with distrib file system
•  Work with spreadsheet
Interface
•  Support REST-based API
•  . . .
R
Eclipse tools
•  Text analytics
•  MapReduce programming
•  Jaql, Hive, Pig development
•  BigSheets plug-in
development
•  Oozie workflow generation
Integrated
installer
Open Source IBMIBM
Cognos BI
GPFS (EAP)
Accelerator for
machine data
analysis
Accelerator for
social data
analysis
Guardium DataStageData Explorer
Sqoop
HCatalog

BigInsights: Value Beyond Open Source
Enterprise Capabilities
Administration & Security
Workload Optimization
Connectors
Open source
components
Advanced Engines
Visualization & Exploration
Development Tools
IBM-certified
Apache Hadoop or or …
Key differentiators
•  Built-in analytics
•  Enterprise software integration
•  Spreadsheet-style analysis
•  Integrated installation of supported open
source and other components
•  Web Console for admin and application
access
•  Platform enrichment: additional security,
performance features, . . .
•  World-class support
•  Full open source compatibility
Business benefits
•  Quicker time-to-value due to IBM
technology and support
•  Reduced operational risk
•  Enhanced business knowledge with flexible
analytical platform
•  Leverages and complements existing
software

If this were easy, everyone would already be
leveraging big data
“Big Data offers big business gains but hidden costs and complexity present
barriers that most organizations will struggle with”
- The Cost of Big Data, Eric Savitz, Forbes 5/2012
§  Open source Apache Hadoop for enterprise usage is incomplete
§  Hadoop skills are in short supply
§  Custom built solutions lack integrated cluster management
§  Requires integration effort within the existing analytic ecosystem
§  Most integrated solutions do not help with archival

Simplifying Big Data for the Enterprise
The new PureData System for Hadoop
§  Accelerate time to value
§  Accelerate time to insight
§  Simplify big data adoption and consumption
§  Extend the value of the data warehouse
§  Implement enterprise class big data
§  Minimize system setup and administration
§  Available in 2H2013
System for Hadoop

Accelerate Big Data
Time to Value
Simplify Big Data
Adoption & Consumption
Implement Enterprise Class
Big Data
1 Based on IBM internal testing and customer feedback. "Custom built clusters" refer to clusters that are not professionally pre-
built, pre-tested and optimized. Individual results may vary.
2 Based on current commercially available Big Data appliance product data sheets from large vendors. US ONLY CLAIM.
Built-in Expertise
Simplified Experience
Integration by Design
Benefits of IBM PureData System for Hadoop
§  Deploy 8x faster than custom-built solutions1
§  Built-in visualization to accelerate insight
§  Built-in analytic accelerators2
unlike big data appliances on the market
§  Single system console for full system administration
§  Rapid maintenance updates with automation
§  No assembly required, data load ready in hours
§  Only integrated Hadoop system
with built-in archiving tools2
§  Delivered with more robust security
than open source software
§  Architected for high availability

SQL Access for Hadoop: Why?
•  Data warehouse augmentation is
a leading Hadoop use case
•  MapReduce is difficult
–  MapReduce Java API is tedious and
requires programming expertise
–  Unfamiliar languages (ie. Pig) also require special skills
•  SQL support would open the data to a much wider audience
–  Familiar, widely known syntax
–  Common catalog for identifying data and structure
–  Declarative – clear separation of the what (the data you’re after) vs.
the how (processing)
Pre-Processing Hub Query-able Archive Exploratory Analysis
Information
Integration
Data Warehouse
Streams
Real-time
processing
BigInsights
Landing zone
for all data
Data Warehouse
BigInsights Can combine
with
unstructured
information
Data Warehouse
1 2 3

SQL for Hadoop: What’s the Problem?
•  SQL Access to data in Hadoop is challenging
–  Data is in many formats
•  CSV, JSON, Hive RCFile, HBase, ...
•  Some formats (HBase composite keys) don’t map cleanly
to relational models
–  No schemas or statistics
–  Hadoop was not designed to be a query engine
•  Hive (with HiveQL): limited query access for Hadoop
–  SQL-like, but NOT SQL
•  Limited data types – no varchar(n), decimal(p,s), etc…
•  Limited join support
•  No subqueries
•  No windowed aggregates
–  Very limited JDBC/ODBC driver
–  Everything executes in MapReduce
•  Even very small queries requiring little processing

Big SQL: Native SQL Query Access for Hadoop
•  Native SQL access to data
stored in BigInsights
–  ANSI SQL 92+
–  Standard syntax support (joins, data types, …)
•  Real JDBC/ODBC drivers
–  Prepared statements
–  Cancel support
–  Database metadata API support
–  Secure socket connections (SSL)
•  Optimization
–  Leveraging MapReduce parallelism
or…
–  Direct access for low-latency queries
•  Varied data sources
–  HBase (including secondary indexes)
–  CSV, Delimited files, Sequence files
–  JSON
–  Hive tables
Big SQL Engine
BigInsights
Data Sources
SQL
Hive Tables HBase tables CSV Files
Application
JDBC / ODBC Server
JDBC / ODBC Driver

From Getting Starting to Enterprise Deployment
InfoSphere BigInsights Brings Hadoop to the Enterprise
Basic Edition
Enterprise Edition
- Accelerators
- Performance Optimization
- Visualization Capabilities
- Pre-built applications
- Text analytics
- Spreadsheet-style tool
- RDBMS, warehouse connectivity
- Administrative tools, security
- Eclipse development tools
- Enterprise Integration . . . .
- Web-based
mgmt console
- Jaql
- Integrated install
Breadth of capabilities
Enterpriseclass
Free download
Sold by # of terabytes managed
Apache
Hadoop
PureData for Hadoop
- Appliance simplicity for the
enterprise

Where to start with BigInsights?
•  Learn it at BigDataUniversity.com
•  Try it on Smart Cloud Enterprise: ibm.biz/Bdx8FF
•  Read about it in “Harness the Power of Big Data”
at ibm.biz/Bdx8RP
•  Learn about Big Data at www.ibmbigdatahub.com
•  Register for “Big Data at the speed of business” event on
April 30th at ibm.co/bigdataevent
•  Try BigSQL: bigsql.imdemocloud.com
•  YouTube Videos - Big Data Channel: youtube.com/user/ibmbigdata

IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without
notice at IBM’s sole discretion.
Information regarding potential future products is intended to outline our general product direction and it
should not be relied on in making a purchasing decision.
The information mentioned regarding potential future products is not a commitment, promise, or legal
obligation to deliver any material, code or functionality. Information about potential future products may not
be incorporated into any contract. The development, release, and timing of any future features or
functionality described for our products remains at our sole discretion.
Performance is based on measurements and projections using standard IBM benchmarks in a controlled
environment. The actual throughput or performance that any user will experience will vary depending upon
many factors, including considerations such as the amount of multiprogramming in the user’s job stream,
the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can
be given that an individual user will achieve results similar to those stated here.
Please Note

InfoSphere BigInsights

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to InfoSphere BigInsights

Similar to InfoSphere BigInsights (20)

More from Wilfried Hoge

More from Wilfried Hoge (8)

Recently uploaded

Recently uploaded (20)

InfoSphere BigInsights