More Related Content
Similar to IBM Stream au Hadoop User Group
Similar to IBM Stream au Hadoop User Group (20)
More from Modern Data Stack France
More from Modern Data Stack France (20)
IBM Stream au Hadoop User Group
- 1. Big Data
Jerome Chailloux, Big Data Specialist
jerome.chailloux@fr.ibm.com
© 2011 IBM Corporation
- 2. Imagine the Possibilities of Analyzing All Available Data
Faster, More Comprehensive, Less Expensive
Real-time Understand and
Traffic Flow Fraud & risk
act on customer
Optimization detection
sentiment
Accurate and timely Predict and act on Low-latency network
threat detection intent to purchase analysis
© 2011 IBM Corporation
- 3. Where is this data coming from?
Every day, the New York
Stock Exchange captures
1 TB of trade information.
12 TB of tweets being 5 Billion mobile phones in
created each day. use in 2010. Only 12% were
smartphones.
Every second of HD video More than 30M networked
generates > 2,000 times as sensor, growing at a rate
many bytes as required to store >30% per year.
a single page of text.
What is your business doing with it?
© 2011 IBM Corporation
3 Source: McKinsey & Company, May 2011
- 4. Why is Big Data important ?
Data AVAILABLE to an
organization
Missed ty
ni
opportu
data an organization can
PROCESS
Organizations are able to
Enterprises are “more blind”
process less and less of the
to new opportunities.
available data.
4 © 2011 IBM Corporation
4
- 5. What does a Big Data platform do ?
Analyze a Variety of Information
Novel analytics on a broad set of mixed
information that could not be analyzed before
Analyze Information in Motion
Streaming data analysis
Large volume data bursts & ad-hoc analysis
Analyze Extreme Volumes of Information
Cost-efficiently process and analyze petabytes of information
Manage & analyze high volumes of structured, relational data
Discover & Experiment
Ad-hoc analytics, data discovery &
experimentation
Manage & Plan
Enforce data structure, integrity and control to
ensure consistency for repeatable queries
© 2011 IBM Corporation
5
- 6. Complementary Approaches for Different Use Cases
Traditional Approach New Approach
Structured, analytical, logical Creative, holistic thought, intuition
Data Hadoop
Warehouse Streams
Transaction Data Web Logs
Internal App Data Social Data
Structured
Structured Unstructured
Unstructured
Repeatable Enterprise Exploratory
Repeatable Integration Exploratory Text Data: emails
Mainframe Data Linear Iterative
Linear Iterative sentiment
Monthly sales reports Brand
Profitability analysis Product Sensor data: images
strategy
OLTP System Data
Customer surveys Maximum asset utilization
ERP data Traditional New RFID
Sources Sources
© 2011 IBM Corporation
- 7. IBM Big Data Strategy: Move the Analytics Closer to the Data
New analytic applications drive the Analytic Applications
requirements for a big data platform BI / Exploration / Functional Industry Predictive Content
Reporting Visualization App App BI /
Analytics Analytics
Reporting
• Integrate and manage the full IBM Big Data Platform
variety, velocity and volume of data
Visualization Application Systems
• Apply advanced analytics to & Discovery Development Management
information in its native form
• Visualize all available data for ad- Accelerators
hoc analysis
• Development environment for Hadoop Stream Data
System Computing Warehouse
building new analytic applications
• Workload optimization and
scheduling
• Security and Governance
Information Integration & Governance
© 2011 IBM Corporation
- 8. Most Client Use Cases Combine Multiple Technologies
Pre-processing
Ingest and analyze unstructured data
types and convert to structured data
Combine structured and unstructured analysis
Augment data warehouse with additional external
sources, such as social media
Combine high velocity and historical analysis
Analyze and react to data in motion; adjust models
with deep historical analysis
Reuse structured data for exploratory analysis
Experimentation and ad-hoc analysis with structured
data
© 2011 IBM Corporation
- 9. IBM is in a lead position to exploit the Big Data opportunity
February 2012 “The Forrester Wave™: Enterprise Hadoop Solutions, Q1 2012”
Forrester Wave™: Enterprise Hadoop Solutions, Q1 ’12
IBM Differentiation
Embracing Open Source
Data in Motion (Streams) and Data at
Rest (Hadoop/BigInsights)
Tight integration with other Information
Management products
Bundled, scalable analytics technology
Hardened Apache Hadoop for enterprise
readiness
© 2011 IBM Corporation
- 10. IBM’s unique strengths in Big Data
Big Data in
Ingest, analyze and act on massive volumes of streaming data.
Real-Time Faster AND more cost-effective for specific use cases. (10x volume
of data on the same hardware.)
Fit for
purpose Analyzes a variety of data types, in their native format – text,
analytics geospatial, time series, video, audio & more.
Enterprise Open source enhanced for reliability, performance and security.
Class High performance warehouse software and appliances
Ease of use with end users, admin and development UIs.
Integration Integration into your IM architecture.
Pre-integrated analytic applications.
© 2011 IBM Corporation
10
- 11. Stream Computing : What is good for ?
Analyze all your data, all the time, just in time
What if you could get IMMEDIATE insight?
Analytic Results What if you could analyze MORE kinds of data?
What if you could do it with exceptional
performance?
Alerts
Threat
Prevention
Systems
More context
Logging
Traditional Data,
Sensor Events, Active
response
Signals
Storage and
Warehousing
© 2011 IBM Corporation
11
- 12. What is Stream Processing ?
Relational databases and warehouses find information stored on
disk
Stream computing analyzes data before you store it
Databases find the needle in the haystack
Streams finds the needle as it’s blowing by
© 2011 IBM Corporation
- 13. Without Streams With Streams
• Intensive scripting Streams provide a Productive and Reusable
• Embedded SQL Development Environment
• File / Storage management by hand
• Record management embedded in application
code
• Data Buffering, Locality
• Security
• Dynamic Application Composition
• High Availability
• Application management (checkpointing, Streams Runtime provides your Application
performance optimization, monitoring, workload Infrastructure
management, error and event handling)
• Application tied to specific Hardware,
Infrastructure
• Multithreading, Multiprocessing
• Debugging
• Migration from development to production
• Integration of best-of-breed commercial tools
• Code reusability
“TerraEchos developers can deliver
• Source / Target interfaces applications 45% faster due to the agility
of Streams Processing Language.“
– Alex Philp, TerraEchos
IBM and Customer Confidential © 2011 IBM Corporation
13
- 15. How Streams Works
Continuous ingestion Infrastructure provides services for
Continuous analysis Scheduling analytics across hardware hosts,
Establishing streaming connectivity
Filter / Sample
Transform Annotate
Correlate
Classify
Achieve scale: Where appropriate:
By partitioning applications into software components Elements can be fused together
By distributing across stream-connected hardware hosts for lower communication latency
© 2011 IBM Corporation
15
- 16. Scalable Stream Processing
Streams programming model: construct a graph
– Mathematical concept
OP OP OP
• not a line -, bar -, or pie chart! OP OP
OP
• Also called a network stream OP
• Familiar: for example, a tree structure is a graph
– Consisting of operators and the streams that connect them
• The vertices (or nodes) and edges of the mathematical graph
• A directed graph: the edges have a direction (arrows)
Streams runtime model: distributed processes
– Single or multiple operators form a Processing Element (PE)
– Compiler and runtime services make it easy to deploy PEs
• On one machine
• Across multiple hosts in a cluster when scaled-up processing is required
– All links and data transport are handled by runtime services
• Automatically
• With manual placement directives where required
© 2011 IBM Corporation
16
- 17. InfoSphere Streams Objects: Runtime View
Instance
Instance
– Runtime instantiation of InfoSphere Job
Streams executing across one or more Node
hosts PE PE
Stream 1 Stream 2
– Collection of components and services operator
Processing Element (PE) 1
Stream
– Fundamental execution unit that is run by
the Streams instance PE 3 Stream 4
– Can encapsulate a single operator or Stream
many “fused” operators Stream
3 Stream 5
Job Node
– A deployed Streams application
executing in an instance
– Consists of one or more PEs
© 2011 IBM Corporation
17
- 18. InfoSphere Streams Objects: Development View
Operator
– The fundamental building block of the Streams Streams Application
Processing Language
– Operators process data from streams and may
stream
produce new streams operator
Stream
– An infinite sequence of structured tuples
– Can be consumed by operators on a tuple-by-
tuple basis or through the definition of a
window height: height: height:
640 1280 640
Tuple width: width: width:
– A structured list of attributes and their types. 480 1024 480
Each tuple on a stream has the form dictated data: data: data:
by its stream type
Stream type
– Specification of the name and data type of each
attribute in the tuple directory: directory: directory: directory:
Window "/img" "/img" "/opt" "/img"
– A finite, sequential group of tuples filename: filename: filename: filename:
– Based on count, time, attribute value, "farm" "bird" "java" "cat"
or punctuation marks
tuple
© 2011 IBM Corporation
18
- 19. What is Streams Processing Language?
Designed for stream computing
– Define a streaming-data flow graph
– Rich set of data types to define tuple attributes
Declarative
– Operator invocations name the input and output streams
– Referring to streams by name is enough to connect the graph
Procedural support
– Full-featured C++/Java-like language
– Custom logic in operator invocations
– Expressions in attribute assignments and parameter definitions
Extensible
– User-defined data types
– Custom functions written in SPL or a native language (C++ or Java)
– Custom operator written in SPL
– User-defined operators written in C++ or Java
© 2011 IBM Corporation
19
- 20. Some SPL Terms port
An operator represents a class of manipulations Aggregate
– of tuples from one or more input streams
– to produce tuples on one or more output streams
A stream connects to an operator on a port
– an operator defines input and output ports Employee Salary
Info Statistics
Aggregate
An operator invocation
– is a specific use of an operator
– with specific assigned input and output streams port
– with locally specified parameters, logic, etc.
TCP
Many operators have one input port and one output port; Source
others have File
– zero input ports: source adapters, e.g., TCPSource Sink
– zero output ports: sink adapters, e.g., FileSink
– multiple output ports, e.g., Split Split
– multiple input ports, e.g., Join Join
A composite operator is a collection of operators
– An encapsulation of a subgraph of
• Primitive operators (non-composite) composite
• Composite operators (nested) operator
– Similar to a macro in a procedural language
© 2011 IBM Corporation
20
- 21. Composite Operators
Every graph is encoded as a composite
composite Main {
– A composite is a graph of one or more operators graph
– A composite may have input and output ports stream … {
– Source code construct only }
• Nothing to do with operator fusion (PEs) stream … {
}
Each stream declaration in the composite . . .
– Invokes a primitive operator or }
– another composite operator
Application (logical view)
An application is a main composite
– No input or output ports Stream 1 Stream 2
– Data flows in and out but not on operator
streams within a graph
1
– Streams may be exported to and Stream
imported from other applications Stream 4
3
running in the same instance Stream
Stream
3 Stream 5
© 2011 IBM Corporation
21
21
- 22. Anatomy of an Operator Invocation
Operators share a common structure Syntax:
– <> are sections to fill in stream<stream-type> stream-name
= MyOperator(input-stream; …)
Reading an operator invocation {
– Declare a stream stream-name logic logic ;
– With attributes from stream-type param parameters ;
– that is produced by MyOperator output output ;
– from the input(s) input-stream window windowspec ;
– MyOperator behavior defined by config configuration ;
}
logic, parameters, windowspec, and
configuration; output attribute
assignments are specified in output
Example:
For the example: stream<rstring item> Sale
– Declare the stream Sale with the attribute = Join(Bid; Ask)
item, which is a raw string {
– Join Bid and Ask streams with window Bid: sliding, time(30);
– sliding windows of 30 seconds on Bid, Ask: sliding, count(50);
param match: Bid.item == Ask.item
and 50 tuples of Ask && Bid.price >= Ask.price;
– When items are equal, and Bid price is output Sale: item = Bid.item
greater than or equal to Ask price }
– Output the item value on the Sale stream
© 2011 IBM Corporation
22
22
- 23. Streams V2.0 Data Types
(any)
(primitive) (composite)
boolean enum (numeric) timestamp (string) blob (collection) tuple
(integral) (floatingpoint) (complex) rstring ustring list set map
(signed) (unsigned) (float) (decimal)
int8 uint8 float32 decimal32 complex32
int16 uint16 float64 decimal64 complex64
int32 uint32 float128 decimal128 complex128
int64 uint64
© 2011 IBM Corporation
23
- 24. Stream and Tuple Types
Stream type (often called “schema”)
– Definition of the structure of the data flowing through the stream
Tuple type definition
– tuple<sequence of attributes> tuple<uint16 id, rstring name>
• Attribute: a type and a name
• Nesting: any attribute may be another tuple type
Stream type is a tuple type
– stream<sequence of attributes> stream<uint16 id, rstring name>
Indirect stream type definitions
– Fully defined within the output stream declaration
stream<uint32 callerNum, … rstring endTime, list<uint32> mastIDs> Calls = Op(…)…
– Reference a tuple type
CallInfo = tuple<uint32 callerNum, … rstring endTime, list<uint32> mastIDs>;
stream<CallInfo> InternationalCalls = Op(…) {…}
– Reference another stream
stream<uint32 callerNum, … rstring endTime, list<uint32> mastIDs> Calls = Op(…)…
stream<Calls> RoamingCalls = Op(…) {…}
© 2011 IBM Corporation
24
- 25. Collection Types
list: array with bounds-checking [0, 17, age-1, 99]
– Random access: can access any element at any time
Ordered, base-zero indexed: first element is someList[0]
set: unordered collection {"cats", "yeasts", "plankton"}
– No duplicate element values
map: key-to-value mappings {"Mon":0, "Sat":99, "Sun":-1}
– Unordered
Use type constructors to specify element type
– list<type>, set<type> list<uint16>, set<rstring>
– map<key-type,value-type> map<rstring[3],int8>
Can be nested to any number of levels
– map<int32, list<tuple<ustring name, int64 value>>>
– {1 : [{"Joe",117885}, {"Fred",923416}], 2 : [{"Max",117885}], -1 : []}
Bounded collections optimize performance
– list<int32>[5]: at most 5 (32-bit) integer elements
– Bounds also apply to strings: rstring[3] has at most 3 (8-bit) characters
© 2011 IBM Corporation
25
- 26. The Functor Operator
stream<rstring name,
Transforms input tuples into output
uint32 age,
tuples uint64 salary> Person = Op(…){}
– One input port
– One or more output ports
stream<rstring name,
May filter tuples uint32 age,
– Parameter filter rstring login,
– A boolean expression tuple<boolean young,
– If true, emit output tuple; boolean rich> info>
if false, do not Adult = Functor(Person) {
Arbitrary attribute assignments param
– Full-blown expressions filter : age >= 21u;
– Including function calls output Adult :
– Drop, add, transform attributes login = lower(name),
– Omitted attributes auto-assigned info = {young = (age < 30u),
rich = (salary > 100000ul)};
Custom logic supported }
– logic clause Person Adult
name Functor name
– May include state
age age
– Applies to filter and assignments salary login
info
© 2011 IBM Corporation
26
- 27. The FileSink Operator
Writes tuples to a file
Has a single input port
– No output port: data goes to a file, () as Sink = FileSink(StreamIn) {
not a Streams stream param
Selected Parameters file : "/tmp/people.dat";
– file format : csv;
• Mandatory
• Base for relative paths is flush : 20u;
data subdirectory }
• Directories must already exist File-
– flush Sink
• Flush the output buffer after a given
number of tuples
– format
• csv: comma-separated values
• txt, line, binary, block
© 2011 IBM Corporation
27
- 28. Communication Between Streams Applications
Streams jobs exchange data with the outside world
– Source- and Sink-type operators
– Can also be used between Streams jobs (e.g., TCPSource/Sink)
Streams jobs can exchange data with each other
– Within one Streams Instance
Supports Dynamic Application Composition
– By name or based on properties (tags)
– One job exports a stream; another imports it
Implemented using two new pseudo-operators: Export and Import
Job 1 Stream exported by Job 1
and imported by Job 2
oper-
source sink
ator
Export Import Job 2
oper- oper-
source sink
ator ator
© 2011 IBM Corporation
28
- 29. Application Design – Dynamic Stream Properties
API available for toolkit development
Can add/modify/delete
– Exported stream properties
– Imported stream subscription expression
Dynamic Job Flow Control Bus Pattern
– Operators within jobs interpret control stream tuples
– Rewire the flow of data from job to job
Flow Control Tuples
Exported [A,B,C]
Control Stream
Job A Job B Job C Job D
Data Stream
© 2011 IBM Corporation
29
- 30. Application Design – Dynamic Stream Properties
API available for toolkit development
Can add/modify/delete
– Exported stream properties
– Imported stream subscription expression
Dynamic Job Flow Control Bus Pattern
– Operators within jobs interpret control stream tuples
– Rewire the flow of data from job to job
Flow Control Tuples
Exported [A,B,C]
Control Stream
[A,C,D]
Job A Job B Job C Job D
Data Stream
© 2011 IBM Corporation
30
- 31. Application Design – Multi-job Design
Streams Instance: stream1
Job: imagefeeder Job: imagewriter
Timestamp +
File metadata File metadata Filename
Directory- Image- Image-
Functor FileSink
Scan Source Sink
subscription:
properties: type == "Image" &&
name = "Feed", write == “ok"
type = "Image",
write = “ok"
Application / Job Decomposition
– Dynamic Job Submission + Stream Import / Export
© 2011 IBM Corporation
31
- 32. Application Design – Multi-job Design
Streams Instance: stream1
Job: imagefeeder Job: imagewriter
Timestamp +
File metadata Image + File metadata Filename
File metadata
Directory- Image- Image-
Functor FileSink
Scan Source Sink
subscription:
properties: type == "Image" &&
name = "Feed", write == “ok"
type = "Image",
write = “ok"
Application / Job Decomposition
– Dynamic Job Submission + Stream Import / Export
© 2011 IBM Corporation
32
- 33. Application Design – Multi-job Design
Streams Instance: stream1
Job: imagefeeder Job: imagewriter
Timestamp +
File metadata Image + File metadata Filename
File metadata
Directory- Image- Image-
Functor FileSink
Scan Source Sink
Job:
greyscaler
subscription:
properties: type == "Image" &&
name = "Feed", Greyscale write == “ok"
type = "Image",
write = “ok" properties:
name = “Grey",
subscription: type = "Image",
name == "Feed" write = “ok"
Application / Job Decomposition
– Dynamic Job Submission + Stream Import / Export
© 2011 IBM Corporation
33
- 34. Application Design – Multi-job Design
Streams Instance: stream1
Job: imagefeeder Job: imagewriter
Timestamp +
File metadata Image + File metadata Filename
File metadata
Directory- Image- Image-
Functor FileSink
Scan Source Sink
Job:
greyscaler
subscription:
properties: type == "Image" &&
name = "Feed", Greyscale write == “ok"
type = "Image",
write = “ok" properties:
Job: resizer name = “Grey",
subscription: type = "Image",
name == "Feed" write = “ok"
Job:
facial scan Job: Alerter
Application / Job Decomposition
– Dynamic Job Submission + Stream Import / Export
© 2011 IBM Corporation
34
- 35. Application Design – Multi-job Design
Streams Instance: stream1
Job: imagefeeder Job: imagewriter
Job: imagefeeder Job: imagewriter
Timestamp +
Job: imagefeeder
File metadata Image + File metadata Filename
Directory- metadata
File Image- metadata
File
File metadata
Image-
DirReader
Scan
File metadata
Source
WriteImage Functor
Sink
Functor FileSink
Sink
DirReader Job:
greyscaler
subscription:
properties: type == "Image" &&
name = "Feed", Greyscale write == “ok"
type = "Image",
write = “ok" properties:
Job: resizer name = “Grey",
subscription: type = "Image",
name == "Feed" write = “ok"
Job:
facial scan Job: Alerter
Application / Job Decomposition
– Dynamic Job Submission + Stream Import / Export
© 2011 IBM Corporation
35
- 36. Two Styles of Export/Import
Publish and subscribe (Recommended approach):
– The exporting application publishes a stream with certain properties
– The importing stream subscribes to an exported stream with properties
satisfying a specified condition
Point to point:
– The importing application names a specific stream of a specific exporting
application
Dynamic publish and subscribe:
– Export properties and Import expressions can be altered during the execution of
a job
– Allows dynamic data flows
– Alter the flow of data based on the data (history, trends, etc.)
() as ImageStream = Export(ImagesIn) { stream<IplImage image, rstring filename,
param properties : { rstring directory> ImagesIn =
streamName = "ImageFeed", Import() {
dataType = "IplImage", param subscription :
writeImage = "true"}; dataType == "IplImage" &&
} writeImage == "true";
}
© 2011 IBM Corporation
36
- 37. Parallelization Patterns – Introduction
Problem Statement
– Series of operations to be performed on a piece of data (a tuple)
– How to improve performance of these operations?
Key Question
– Reduce latency?
• For a single piece of data
– Increase throughput?
• For the entire data flow
Three possible design patterns
– Serial Path
– Parallel Operators (Task Parallelization)
– Parallel Paths (Data Parallelization)
© 2011 IBM Corporation
37
- 38. Parallelization Patterns – Pipeline, Task
Pipeline (serial path)
A B C D
– Base pattern: inherent in graph paradigm
– Results arrive at D in time T(A) + T(B) + T(C)
Parallel operators (task parallelization)
A
B M D
C
– Process the tuple in operators A, B, and C at the same time
– Requires merger (e.g., Barrier) before operator D
– Results arrive at D in time Max(T(A),T(B),T(C)) + T(M)
– Use when tuple latency requirement < T(A) + T(B) + T(C)
– Complexity of merger depends on behavior of operators A, B, and C
© 2011 IBM Corporation
38
- 39. Parallelization Patterns – Parallel Pipelines
Parallel pipelines (data parallelization)
A B C
A B C D
A B C
– Migration step from pipeline patttern
– Can improve throughput
• Especially good for variable-size data / processing time
Design Decisions
– Are there latency and/or throughput requirements?
– Do the operators perform filtering, feature extraction, transformation?
– Is there an execution order requirement?
– Is there a tuple order requirement?
Recommend Pipeline Parallel Pipelines when possible
© 2011 IBM Corporation
39
- 40. Application Design – Multi-tier Design
Transport Processing / Transport
Ingestion Reduction Transformation
Adaptation Analytics Adaptation
Transport Processing / Transport
Ingestion
Adaptation Analytics Adaptation
Examples
N-tier design
– Number and purpose of tiers is a result of Application Design
Create well-defined interfaces between the tiers
Supports several overarching concepts
– Incremental development / testing
– Application / Job / Operator reuse
– Modular programming practices
Each tier in these examples may be made up of one or more jobs (programs)
© 2011 IBM Corporation
40
- 41. Application Design – High Availability
HA application design pattern
– Source job exports stream, enriched with tuple ID
– Jobs 1 & 2 process in parallel, and export final streams
– Sink job imports streams, discards duplicates, alerts on missing tuples
Job 1
Job 1 Job 1
Job 1 Sink
Sink
Host pool 1 Job 1
Job 1 Job 1
Job 1
Job 2
Job 2
Host pool 2 Source
Source Job 2
Job 2 Job 2
Job 2 Job 2
Job 2
Host pool 3 x86 host x86 host x86 host x86 host x86 host
Host pool 4
© 2011 IBM Corporation
41
- 42. Application Design – High Availability
HA application design pattern
– Source job exports stream, enriched with tuple ID
– Jobs 1 & 2 process in parallel, and export final streams
– Sink job imports streams, discards duplicates, alerts on missing tuples
Job 1
Job 1 Job 1
Job 1 Sink
Sink
Host pool 1 Job 1
Job 1 Job 1
Job 1
Source
Source Job 2
Job 2
Host pool 2
Job 2
Job 2 Job 2
Job 2 Job 2
Job 2
x86 host
Host pool 3 x86 host x86 host x86 host x86 host
Host pool 4
© 2011 IBM Corporation
42
- 43. IBM InfoSphere Streams
Agile Development Distributed Runtime Sophisticated Analytics
Environment Environment with Toolkits & Adapters
Front Office 3.0
Toolkits
Database Advanced Text
Mining Geospatial
Clustered runtime for Financial Timeseries
massive scalability Standard Messaging
RHEL v5.x and v6.x, Internet ...
Eclipse IDE BigData User-defined
CentOS v6.x
Streams Live Graph x86 & Power multicore
• HDFS
• DataExplorer
Streams Debugger hardware
Ethernet & InfiniBand
Over 50 samples
© 2011 IBM Corporation
- 44. Toolkits and Operators to Speed and Simplify Development
Standard Toolkit Internet Toolkit
Relational Operators InetSource
Filter Sort HTTP FTP HTTPS
Functor Join FTPS RSS file
Punctor Aggregate
Adapter Operators Database Toolkit
FileSource UDPSource ODBCAppendODBCEnrich
FileSink UDPSink ODBCSource SolidDBEnrich
DirectoryScan Export DB2SplitDB DB2PartitionedAppend
TCPSource Import Supports: DB2 LUW, IDS, solidDB,
TCPSink MetricsSink Netezza, Oracle, SQL Server, MySQL
Utility Operators
Custom Split Financial Toolkit
Beacon DeDuplicate
Throttle Union Data Mining Toolkit
Delay ThreadedSplit Big Data toolkit
Barrier DynamicFilter
Pair Gate
Text Toolkit
JavaOp …..
Standard toolkit contains the User-Defined Toolkits
default operators shipped with the Extend the language by adding
product user-defined operators
and functions
© 2011 IBM Corporation
44
- 45. User Defined Toolkits
Streams supports toolkits
– Reusable sets of operators and functions
– What can be included in a toolkit?
• Primitive and composite operators
• Native and SPL functions
• Types
• Tools/documentation/samples/data, etc.
– Versioning is supported
– Define dependencies on other versioned assets (toolkits, Streams)
– Create cross-domain and domain-specific accelerators
© 2011 IBM Corporation
45
45
- 47. A quick peek inside …
InfoSphere Streams Instance – Single Host
Management Services &
Applications
Streams Web Service (SWS)
Streams Application Manager (SAM)
Streams Resource Manager (SRM)
Authorization and Authentication Service (AAS)
Scheduler Recover DB Name Server
Host Controller Processing Element
Container
File System
© 2011 IBM Corporation
- 48. A quick peek inside …
InfoSphere Streams Instance – Multi host, Management Services on separate node
Management Services
Streams Web Service (SWS)
Streams Application Manager (SAM)
Streams Resource Manager (SRM)
Authorization and Authentication Service (AAS)
Scheduler Recover DB Name Server
Shared File System
Application Host Application Host Application Host
Host Controller Host Controller Host Controller
Processing Element Processing Element Processing Element
Container Container Container
© 2011 IBM Corporation
- 49. A quick peek inside …
InfoSphere Streams Instance – Multi host, Management Services on multiple hosts
Management Management Management
Streams Web Service AAS Recovery DB
Management Management
Application Host
Streams App Manager Scheduler Host Controller
Processing Element
Management Management Container
Streams Resource Mgr Name Server
Shared File System
Application Host Application Host Application Host Application Host
Host Controller Host Controller Host Controller Host Controller
Processing Element Processing Element Processing Element Processing Element
Container Container Container Container
© 2011 IBM Corporation