Skillwise Big Data part 2

IBM Big Data Platform
Overview

Big Data is a Hot Topic Because Technology Makes it Possible to Analyze
ALL Available Data
Cost effectively manage and analyze
all available data in its native form
unstructured, structured, streaming
ERP
CRM RFID
Website
Network Switches
Social Media
Billing

BIG DATA is not just HADOOP
Manage & store huge
volume of any data
Hadoop File System
MapReduce
Manage streaming data Stream Computing
Analyze unstructured data Text Analytics Engine
Data WarehousingStructure and control data
Integrate and govern all
data sources
Integration, Data Quality, Security,
Lifecycle Management, MDM
Understand and navigate
federated big data sources
Federated Discovery and Navigation

Business-Centric Big Data Enables You to Start With a Critical Business Pain and Expand the
Foundation for Future Requirements
 “Big data” isn’t just a
technology—it’s a business
strategy for capitalizing on
information resources
 Getting started is crucial
 Success at each entry point is
accelerated by products within
the Big Data platform
 Build the foundation for future
requirements by expanding
further into the big data platform

1 – Unlock Big Data
Customer need
• Understand existing data sources
• Search and navigate data within existing
systems
• No copying of data
Value statement
• Get up and running quickly
• Discover and retrieve big data
• Work even with big data sources – by
business users
Solution
• Vivisimo Velocity renamed to
• IBM InfoSphere DataDiscovery

2 – Analyze Raw Data
Customer need
• Ingest data as-is into Hadoop
• Combine it with data from DWH
• Process very large volume of data
Value statement
• Gain new insight
• Overcome the high cost of converting
data from unstructured to structured
format
• Experiment with analysis on different
data and combine them with other
sources
Solution
• IBM InfoSphere BigInsights

Merging the Traditional and Big Data
Approaches
IT
Structures the
data to answer
that question
IT
Delivers a platform to
enable creative
discovery
Business
Explores what questions
could be asked
Business Users
Determine what
question to ask
Monthly sales reports
Profitability analysis
Customer surveys
Brand sentiment
Product strategy
Maximum asset utilization
Big Data Approach
Iterative & Exploratory Analysis
Traditional Approach
Structured & Repeatable Analysis

InfoSphere BigInsights is more than
just HADOOP
IBM InfoSphere Big Insights
• Is much more than HADOOP
IBM Big data platform
• Includes much more than
IBM InfoSphere Big Insights

Hadoop
 Open-source software framework from Apache
 Inspired by
 Google MapReduce
 GFS (Google File System)
 HDFS
 Map/Reduce

InfoSphere BigInsightsPlatform for volume, variety,
velocity
 Enhanced Hadoop
foundation
Analytics
 Text analytics & tooling
 Application accelerators
Usability
 Web console
 Spreadsheet-style tool
 Ready-made “apps”
Enterprise Class
 Storage, security, cluster
management
Integration
 Connectivity to Netezza,
DB2, JDBC databases, etc
Apache
Hadoop
Basic Edition
Enterprise Edition
Licensed
Application accelerators
Pre-built applications
Text analytics
Spreadsheet-style tool
RDBMS, warehouse connectivity
Administrative tools, security
Eclipse development tools
Performance enhancements
. . . .
Free download
Integrated install
Online InfoCenter
BigData Univ.
Breadth of capabilities
Enterpriseclass
Can run also on top of

Spreadsheet-style Analysis
Web-based analysis
and visualization
Spreadsheet-like
interface
 Define and manage
long running data
collection jobs
 Analyze content of
the text on the
pages that have
been retrieved

Build a Big Data Program – MapReduce example
Eclipse tools
For Jaql, Hive, Pig Java MapReduce, BigSheets
plug-ins, text analytics, etc.

JAQL – IBM’s programming language in hadoop world
• Jaql is a complete solutions environment
supporting all other BigInsights components Integration point for
various analytics
– Text analytics
– Statistical analysis
– Machine learning
– Ad-hoc analysis
 Integration point for
various data sources
– Local and distributed file
systems
– NoSQL data bases
– Content repositories
– Relational sources
(Warehouses, operational
data bases)
BigInsightsText
Analytics
StatisticalAnalysis
(Rmodule)
Machinelearning
(SystemML)
Ad-Hocanalysis
(BigSheets)
(Integration)DB2,
Netezza,Streams,
…
Jaql
Jaql I/O Jaql Core
Operators
Jaql Modules
DFS NoSQL RDBMS File System

BigInsights
Data warehouse
Traditional
analytic
tools
Big Data
analytic
applications
Filter Transform Aggregate
BigInsights and the data warehouse

3 – Simplify your warehouse
Customer need – SIGNIFICANTLY
• Make performance of DWH better
• Reduce DWH administration costs
Value statement
• Speed: 10 – 100x better performance
• Simplicity: Administration costs reduced by 75% - 90%
• Scalability
• Smart system
• In-database analytics
• Out-of-the box integration with SPSS
Solution
• IBM Netezza renamed to
• PureData System for Analytics

Analyst
IT
I need to evaluate the possible
relationship between client salary
and overdrafts
OK. We have to evaluate a lot of
statistics, set the correct db
indexes and db partitioning. It will
take us 5 days.

Analyst IT
Great. Thanks a lot.
I’m going to check the results.
Done. You can run your analytical
query.

Analyst IT
Great. I can see here some nice
correlations. Now I need to look at it
from the different perspective.
Ohhh, welcome dear friend.
Understand. So, it’s …. another 5
days of our work
Noooo!!!
It’s not possible to work
here!

Analyst
IT
I need to evaluate the possible
relationship between client salary
and overdrafts.
I will use Netezza.

Analyst IT
Great. I can see here some nice
correlations. Now I need to look at it from
the different perspective.
With Netezza I can run the query
immediately. The response will be in the
same time
IT can do something else
– much more useful

Built-In Expertise Makes This as Simple as an Appliance
2
 Dedicated device
 Optimized for purpose
 Complete solution
 Fast installation
 Very easy operation
 Standard interfaces
 Low cost

IBM Netezza was renamed to IBM PureData System for Analytics
In October 2012

Netezza
Genesis in T-Mobile CZ
Proof-Of-Concept Project
–New EnterpriseDataWarehouse platform selection
–Comparison of existing and other platforms
–Selection Criteria
• Performance
• Operational Savings
….and the winner was: Netezza

Netezza Genesis in T-Mobile CZ
Expectations
Significant response improvement:
Faster platform means better reports response
Direct Data Availability
Higher trust in data , one version of truth
Aggregation reduction
Any attribute available
Operational Benefits
Storage savings (no data replicas)
Administration costs reduction(DBA)
Infrastructure Simplification
Lower environment complexity

Project Implementation
–EDW platform migration
•Netezza platform implementation
•ETL graphs/processes redesign
–BI Front-End Tool Migration
•SAP Business Object implementation
•All reports redesign
Main Integration Partner: T-System CZ

Actual Status
All relevant ETL procecessing redesigned
Actual parallel run to Original and Netezza platform finished
Netezza as only primary platform

Original
Platform
Netezza
Workflow Reporting 2 hours 1 minute
Invoicing and Payments reporting
Payment discipline of current month invoices 33 minutes 17 seconds
Overdue Debt of Invoices – in Current Month 10 hours 23 seconds
Average Monthly Invoice Figures 50 minutes 38 seconds
RESPONSE TIME MASSIVELY IMPROVED
Real Netezza experience from T-Mobile Czech Rep.

4 – Reduce costs with Hadoop
Customer need – SIGNIFICANTLY
• Too much data => Too expensive to store and to maintain
• Big portion is used “just in case”
• Data amount is still growing => it’s more expensive
• => too expensive to have all data in standard DWH
Value statement
• Leverage the architecture of parallel processing in Hadoop
• Hadoop uses cheap commodity HW
• Enable business users still work in the same or similar way
Solution
• IBM InfoSphere BigInsights

BigInsights and the data warehouse
BigInsights
• Query-ready archive for “cold” warehouse data
Data Warehouse
Big Data
analytic
applications
Traditional
analytic
tools From Cognos BI
via Hive JDBC

Application
SQL interface Engine
InfoSphere BigInsights
HiveTables HBase tables CSV Files
Data Sources
SQL Language
JDBC / ODBC Driver
JDBC / ODBC Server
Future: The SQL interface . . . .
• Rich SQL query capabilities
– SQL '92 and 2011 features
– Correlated subqueries
– Windowed aggregates
• SQL access to all data stored in
InfoSphere BigInsights
• Robust JDBC/ODBC support
• Take advantage of key features of
each data source
• Leverage MapReduce parallelism
OR
achieving low-latency

5 – Analyze Streaming Data
Customer need
• Process and leverage streaming data
• Select valuable data from data stream for
future processing
• Quickly process data going to be useless if it’s
not processed immediately
Value statement
• React in real-time to take an oppurtinity
before it expires
• Periodically adjust streaming models based
on analysis on data at rest
Solution
• IBM InfoSphere Streams
Streams Computing
Streaming Data
Sources
ACTION

Why and when to use InfoSphere
Streams?
Sensors
 Environmental, Industrial, GPS, …
 Images, Videos, …
Data Exhaust
 Network data
 system logs (web server, app server), …
High-rate transaction data
 Financial transactions
 CDRs
Isolation
 Processing in isolation
 … or in limited windows (time / nr. Of records)
Non-traditional formats included  Spatial data, images, text, voice, …
Integration challenges
 Different connection methods
 Different data rates
 Different processing requirements
Multiple processing nodes  Volume / rate very high => scalability required
Sub-millisecond latency  Immediate analysis and response
Store & mine approach doesn’t work  Because of very high volume of data (and its rates)
At least 2 criteria from the list bellow should be fulfilled
Applications needing on-fly processing, filtering and analyzing streaming data

Streams and BigInsights - Integrated Analytics on Data in Motion &
Data at Rest
1. Data Ingest
Data Integration,
data mining,
machine learning,
statistical modeling
Visualization of real-
time and historical
insights
3. Adaptive Analytics Model
Data ingest,
preparation, online
analysis, model
validation
Data
2. Bootstrap/Enrich
Contro
l flow
InfoSphere
BigInsights,
Database &
Warehouse
InfoSphere
Streams

The Platform Advantage
BI /
Reporting
BI /
Reporting
Exploration /
Visualization
Functional
App
Industry
App
Predictive
Analytics
Content
Analytics
Analytic Applications
IBM Big Data Platform
Systems
Management
Application
Development
Visualization
& Discovery
Accelerators
Information Integration & Governance
Hadoop
System
Stream
Computing
Data
Warehouse
BENEFITS IN DETAIL
Increase over
time
 By moving from entry to a 2nd
and 3rd project
Lowering
deployment costs
 Shared components
 Integration
Points of leverage  Shared text analytics for
Streams and BigInsights
 HDFS connectors (data
integration (ETL, …),
Streams)
 Accelerators
 Build across multiple
engines

Skillwise Big Data part 2

More Related Content

What's hot

Viewers also liked

Similar to Skillwise Big Data part 2

More from Skillwise Group

Recently uploaded

Skillwise Big Data part 2