IBM PureData System
for Analytics
Powered by Netezza
Hossein Sarshar
Agenda
• What is PureData and Netezza
o History
o Characteristics
o Product chain
• PureData Hardware Architecture
o Introduction
o Hardware architecture
o Paralleled structures
• Analytics with PureData
o Introduction
o In-database analytics tools
• Demo
IBM® PureData™ for Analytics 2
What is PureData and
Netezza
PureSystems
PureFlex PureApplication
IBM® PureData™ for Analytics 3
In 2010, IBM bought a new analytics platform called
Netezza. It was founded in 2000 at Marlborough, CA.
IBM later rebranded it to PureData.
What is PureData and
Netezza
PureSystems
PureFlex PureApplication PureData
IBM® PureData™ for Analytics 4
PureSystems Product
Family
PureFlex:
o Combines and optimizes compute, storage, networking and virtualization
capabilities under a single, unified management console into an
infrastructure system.
PureApplication:
o Is a platform system designed and tuned specifically for transactional web
and database applications.
PureData:
o Based on Netezza technology, PureData is all data experts need in a
single well tuned appliance.
IBM® PureData™ for Analytics 5
PureData
Operational
Analytics
Transactions Analytics
PureSystems
Characteristics
• Built-in Experts
o No indexing/tuning/partitioning
o Fully parallel, optimized in-Database Analytics.
o No storage administration.
o No software installation.
• Integration by Design:
o Server, Storage, Database in one easy to use package.
o Automatic parallelization and resource optimization to scale economically
o Enterprise-class security and platform management
• Simplified Experience:
o Up and running in hours.
o Minimal ongoing administration.
o Standard interfaces to best of breed Analytics, BI, and data integration tools.
o Built-in analytics capabilities allow users to derive insight from data quickly.
o Easy connectivity to other Big Data Platform components
IBM® PureData™ for Analytics 6
Each of these come as an appliance equal to
simplified yet strong private clouds with
minimal administration
PureData Introduction
• It is a datawarehousing and data analytics
appliance that is fast enough to process terabytes
of data in seconds. It is a fully parallel machine.
• Netezza’s main technology is using FPGA (Field
Programmable Gateway Array) to filter
unnecessary files in parallel manner.
• PureData uses Netezza technology to perform
deep analytics on huge amount of data in a
reasonable time.
• It is purpose-built for high performance analytics.
• It supports all DB structures (3NF, Star, De-Normalized
table)
IBM® PureData™ for Analytics 7
PureData Architecture
IBM® PureData™ for Analytics 8
Disk storage
RAID 1 disks
High speed data
streams
SMP Host
Redhat linux
servers
Optimizer
Compiler
A gateway to
the system
Snippet-Blades
Query accelerator
using FPGAs
S-Blades (SPU)
IBM® PureData™ for Analytics 9
S-Blades
IBM® PureData™ for Analytics 10
Intel Quad-Core
Dual-Core FPGADRAM
IBM BladeCenter Server Netezza DB Accelerator
SAS Expander
Module
SAS Expander
Module
S-Blades Overview
• There are 8 intel core on IBM Blade-Center Server
and 8 FPGA on Netezza DB accelerator.
o FPGA has similar dimensions a CPU has, consumes 5 times less power and
clock speed is about 5 times less
o More caching capability
o Low latency and high throughput
• Each of these S-Blades takes ownership of 6-8 disks.
• The queries are divided into subqueries that are
processed by S-Blades.
IBM® PureData™ for Analytics 11
PureData AMPP (Shared-
Nothing) Architecture
12
Advanced
Analytics
Loader
ETL
BI
Applications
FPGA
Memory
CPU
FPGA
Memory
CPU
FPGA
Memory
CPU
Hosts
SMP
Host
Disk
Enclosures
S-Blades™
Network
Fabric
Netezza Appliance
FPGA Secret Sauce
IBM® PureData™ for Analytics 13
FPGA Core CPU Core
Uncompress
Project Restrict,
Visibility
Complex ∑
Group by, …
select DISTRICT,
PRODUCTGRP,
sum(NRX)
from MTHLY_RX_TERR_DATA
where MONTH = '20091201'
and MARKET = 509123
and SPECIALTY = 'GASTRO'
Slice of table
MTHLY_RX_TERR_DATA
(compressed)
where MONTH = '20091201'
and MARKET = 509123
and SPECIALTY = 'GASTRO'
sum(NRX)
select DISTRICT,
PRODUCTGRP,
sum(NRX)
Using FPGA reduces a
tremendous among of
unnecessary data movement
PureData System
Configuration
14IBM® PureData™ for Analytics
PureData System
Configuration
IBM® PureData™ for Analytics 15
PureData System
Configuration
IBM® PureData™ for Analytics 16
Single Rack System Multi Rack System
Specs N3001-
002
N3001-
005
N3001-
010
N3001-
020
N3001-
040
N3001-
080
Racks 1 1 1 2 4 8
Active S-Blades 2 4 7 14 28 56
CPU Cores 40 80 140 280 560 1120
FPGA Cores 32 64 112 224 448 896
User Data in TB 32 98 192 384 768 1536
N3001 is the newest IBM PureData
What is Achievable
• Having agile analytics platform.
• No administration effort to install/manage
• Scalability in petabyte level
• Linear speedup scalability by adding additional
racks.
• Big Data Meets Deep Analytics => No need to
sample
IBM® PureData™ for Analytics 17
High Performance
Analytics Architecture
IBM® PureData™ for Analytics 18
PureData Analytics
Modules
IBM® PureData™ for Analytics 19
Netezza In-Database
Analytics Options
Classification Time Series Clustering
Associate
Rules
Simulation
and Monte
Carlo Analysis
Geospatial
IBM® PureData™ for Analytics 20
Demo
• Installation
• Client Tool Exploration
• Command Execution
IBM® PureData™ for Analytics 21
Summary
• A system for analytics
• Out-of-the-box solution
• It uses FPGA technology to boost query execution
• It uses nothing-shared approach.
• PureData uses open standards to communicate to
outside world
• It has many NZ in-database and 3rd party in-
database options to enrich our analytics
IBM® PureData™ for Analytics 22
References
• http://www-01.ibm.com/software/data/netezza/
• http://www.ibm.com/ibm/puresystems/ca/en/
IBM® PureData™ for Analytics 24
Netezza pure data

Netezza pure data

  • 1.
    IBM PureData System forAnalytics Powered by Netezza Hossein Sarshar
  • 2.
    Agenda • What isPureData and Netezza o History o Characteristics o Product chain • PureData Hardware Architecture o Introduction o Hardware architecture o Paralleled structures • Analytics with PureData o Introduction o In-database analytics tools • Demo IBM® PureData™ for Analytics 2
  • 3.
    What is PureDataand Netezza PureSystems PureFlex PureApplication IBM® PureData™ for Analytics 3
  • 4.
    In 2010, IBMbought a new analytics platform called Netezza. It was founded in 2000 at Marlborough, CA. IBM later rebranded it to PureData. What is PureData and Netezza PureSystems PureFlex PureApplication PureData IBM® PureData™ for Analytics 4
  • 5.
    PureSystems Product Family PureFlex: o Combinesand optimizes compute, storage, networking and virtualization capabilities under a single, unified management console into an infrastructure system. PureApplication: o Is a platform system designed and tuned specifically for transactional web and database applications. PureData: o Based on Netezza technology, PureData is all data experts need in a single well tuned appliance. IBM® PureData™ for Analytics 5 PureData Operational Analytics Transactions Analytics
  • 6.
    PureSystems Characteristics • Built-in Experts oNo indexing/tuning/partitioning o Fully parallel, optimized in-Database Analytics. o No storage administration. o No software installation. • Integration by Design: o Server, Storage, Database in one easy to use package. o Automatic parallelization and resource optimization to scale economically o Enterprise-class security and platform management • Simplified Experience: o Up and running in hours. o Minimal ongoing administration. o Standard interfaces to best of breed Analytics, BI, and data integration tools. o Built-in analytics capabilities allow users to derive insight from data quickly. o Easy connectivity to other Big Data Platform components IBM® PureData™ for Analytics 6 Each of these come as an appliance equal to simplified yet strong private clouds with minimal administration
  • 7.
    PureData Introduction • Itis a datawarehousing and data analytics appliance that is fast enough to process terabytes of data in seconds. It is a fully parallel machine. • Netezza’s main technology is using FPGA (Field Programmable Gateway Array) to filter unnecessary files in parallel manner. • PureData uses Netezza technology to perform deep analytics on huge amount of data in a reasonable time. • It is purpose-built for high performance analytics. • It supports all DB structures (3NF, Star, De-Normalized table) IBM® PureData™ for Analytics 7
  • 8.
    PureData Architecture IBM® PureData™for Analytics 8 Disk storage RAID 1 disks High speed data streams SMP Host Redhat linux servers Optimizer Compiler A gateway to the system Snippet-Blades Query accelerator using FPGAs
  • 9.
  • 10.
    S-Blades IBM® PureData™ forAnalytics 10 Intel Quad-Core Dual-Core FPGADRAM IBM BladeCenter Server Netezza DB Accelerator SAS Expander Module SAS Expander Module
  • 11.
    S-Blades Overview • Thereare 8 intel core on IBM Blade-Center Server and 8 FPGA on Netezza DB accelerator. o FPGA has similar dimensions a CPU has, consumes 5 times less power and clock speed is about 5 times less o More caching capability o Low latency and high throughput • Each of these S-Blades takes ownership of 6-8 disks. • The queries are divided into subqueries that are processed by S-Blades. IBM® PureData™ for Analytics 11
  • 12.
    PureData AMPP (Shared- Nothing)Architecture 12 Advanced Analytics Loader ETL BI Applications FPGA Memory CPU FPGA Memory CPU FPGA Memory CPU Hosts SMP Host Disk Enclosures S-Blades™ Network Fabric Netezza Appliance
  • 13.
    FPGA Secret Sauce IBM®PureData™ for Analytics 13 FPGA Core CPU Core Uncompress Project Restrict, Visibility Complex ∑ Group by, … select DISTRICT, PRODUCTGRP, sum(NRX) from MTHLY_RX_TERR_DATA where MONTH = '20091201' and MARKET = 509123 and SPECIALTY = 'GASTRO' Slice of table MTHLY_RX_TERR_DATA (compressed) where MONTH = '20091201' and MARKET = 509123 and SPECIALTY = 'GASTRO' sum(NRX) select DISTRICT, PRODUCTGRP, sum(NRX) Using FPGA reduces a tremendous among of unnecessary data movement
  • 14.
  • 15.
  • 16.
    PureData System Configuration IBM® PureData™for Analytics 16 Single Rack System Multi Rack System Specs N3001- 002 N3001- 005 N3001- 010 N3001- 020 N3001- 040 N3001- 080 Racks 1 1 1 2 4 8 Active S-Blades 2 4 7 14 28 56 CPU Cores 40 80 140 280 560 1120 FPGA Cores 32 64 112 224 448 896 User Data in TB 32 98 192 384 768 1536 N3001 is the newest IBM PureData
  • 17.
    What is Achievable •Having agile analytics platform. • No administration effort to install/manage • Scalability in petabyte level • Linear speedup scalability by adding additional racks. • Big Data Meets Deep Analytics => No need to sample IBM® PureData™ for Analytics 17
  • 18.
  • 19.
  • 20.
    Netezza In-Database Analytics Options ClassificationTime Series Clustering Associate Rules Simulation and Monte Carlo Analysis Geospatial IBM® PureData™ for Analytics 20
  • 21.
    Demo • Installation • ClientTool Exploration • Command Execution IBM® PureData™ for Analytics 21
  • 22.
    Summary • A systemfor analytics • Out-of-the-box solution • It uses FPGA technology to boost query execution • It uses nothing-shared approach. • PureData uses open standards to communicate to outside world • It has many NZ in-database and 3rd party in- database options to enrich our analytics IBM® PureData™ for Analytics 22
  • 23.

Editor's Notes

  • #6 PureFlex: Platform as a Service: PaaS PureApplication and PureData: SaaS
  • #7 PureFlex: Platform as a Service: PaaS PureApplication and PureData: SaaS
  • #9 SMP: symmetric multiprocessor system
  • #11 Linux server installed on BladeCenter Server
  • #13 Based on divide and conquer method. PureData handles the parallelization with no user knowledge. Netezza’s proprietary AMPP (Asymmetric Massively Parallel Processing) architecture is a two-tiered system designed to quickly handle very large queries from multiple users. The first tier is a high-performance Linux SMP host that compiles data query tasks received from business intelligence applications, and generates query execution plans. It then divides a query into a sequence of sub-tasks, or snippets that can be executed in parallel, and distributes the snippets to the second tier for execution. The second tier consists of one to hundreds of snippet processing blades, or S-Blades, where all the primary processing work of the appliance is executed. The S-Blades are intelligent processing nodes that make up the massively parallel processing (MPP) engine of the appliance. Each S-Blade is an independent server that contains multi-core Intel-based CPUs and Netezza’s proprietary multi-engine, high-throughput FPGAs. The S-Blade is composed of a standard blade-server combined with a special Netezza Database Accelerator card that snaps alongside the blade. Each S-Blade is, in turn, connected to multiple disk drives processing multiple data streams in parallel in TwinFin or Skimmer.
  • #14 FPGA are to filter out 90-95 percent of irrelevant data passes the rest of data to CPU cores. It is a pipeline processing approach that boosts the performance.