SlideShare a Scribd company logo
1 of 49
Download to read offline
Big Data
                    Jerome Chailloux, Big Data Specialist




jerome.chailloux@fr.ibm.com



                                                            © 2011 IBM Corporation
Imagine the Possibilities of Analyzing All Available Data
 Faster, More Comprehensive, Less Expensive



   Real-time                                         Understand and
   Traffic Flow                  Fraud & risk
                                                     act on customer
   Optimization                  detection
                                                     sentiment




 Accurate and timely           Predict and act on   Low-latency network
 threat detection              intent to purchase   analysis




                                                              © 2011 IBM Corporation
Where is this data coming from?



                                       Every day, the New York
                                      Stock Exchange captures
                                      1 TB of trade information.
            12 TB of tweets being                                              5 Billion mobile phones in
              created each day.                                               use in 2010. Only 12% were
                                                                                      smartphones.




      Every second of HD video                                                    More than 30M networked
      generates > 2,000 times as                                                  sensor, growing at a rate
    many bytes as required to store                                                    >30% per year.
        a single page of text.




                         What is your business doing with it?
                                                                                              © 2011 IBM Corporation
3                                      Source: McKinsey & Company, May 2011
Why is Big Data important ?


                                       Data AVAILABLE to an
                                            organization

                                                                   Missed ty
                                                                        ni
                                                                 opportu

                                                         data an organization can
                                                                PROCESS




         Organizations are able to
                                                Enterprises are “more blind”
        process less and less of the
                                                   to new opportunities.
              available data.



    4                                                               © 2011 IBM Corporation
4
What does a Big Data platform do ?
                                    Analyze a Variety of Information
                                    Novel analytics on a broad set of mixed
                                    information that could not be analyzed before



                                     Analyze Information in Motion
                                     Streaming data analysis
                                     Large volume data bursts & ad-hoc analysis



                                    Analyze Extreme Volumes of Information
                                    Cost-efficiently process and analyze petabytes of information
                                    Manage & analyze high volumes of structured, relational data


                                     Discover & Experiment
                                     Ad-hoc analytics, data discovery &
                                     experimentation



                                     Manage & Plan
                                     Enforce data structure, integrity and control to
                                     ensure consistency for repeatable queries
                                                                                 © 2011 IBM Corporation
5
Complementary Approaches for Different Use Cases

    Traditional Approach                         New Approach
    Structured, analytical, logical              Creative, holistic thought, intuition


                           Data                        Hadoop
                           Warehouse                   Streams
     Transaction Data                                                        Web Logs

   Internal App Data                                                           Social Data
                     Structured
                 Structured                                   Unstructured
                                                       Unstructured
                Repeatable               Enterprise           Exploratory
                     Repeatable          Integration   Exploratory Text Data: emails
   Mainframe Data    Linear                                   Iterative
                        Linear                         Iterative sentiment
          Monthly sales reports                                Brand
           Profitability analysis                              Product Sensor data: images
                                                                       strategy
     OLTP System Data
             Customer surveys                                  Maximum asset utilization


        ERP data           Traditional                  New                   RFID
                           Sources                      Sources



                                                                                     © 2011 IBM Corporation
IBM Big Data Strategy: Move the Analytics Closer to the Data

New analytic applications drive the                       Analytic Applications
requirements for a big data platform          BI /    Exploration / Functional Industry Predictive Content
                                            Reporting Visualization   App        App                 BI /
                                                                                        Analytics Analytics
                                                                                                   Reporting


   • Integrate and manage the full                     IBM Big Data Platform
     variety, velocity and volume of data
                                              Visualization         Application          Systems
   • Apply advanced analytics to              & Discovery          Development          Management
     information in its native form
   • Visualize all available data for ad-                             Accelerators
     hoc analysis
   • Development environment for                 Hadoop              Stream                Data
                                                 System             Computing            Warehouse
     building new analytic applications
   • Workload optimization and
     scheduling
   • Security and Governance
                                                       Information Integration & Governance



                                                                                          © 2011 IBM Corporation
Most Client Use Cases Combine Multiple Technologies

                          Pre-processing
                              Ingest and analyze unstructured data
                              types and convert to structured data



                          Combine structured and unstructured analysis
                              Augment data warehouse with additional external
                              sources, such as social media



                          Combine high velocity and historical analysis
                              Analyze and react to data in motion; adjust models
                              with deep historical analysis



                          Reuse structured data for exploratory analysis
                              Experimentation and ad-hoc analysis with structured
                              data


                                                                     © 2011 IBM Corporation
IBM is in a lead position to exploit the Big Data opportunity

                                       February 2012 “The Forrester Wave™: Enterprise Hadoop Solutions, Q1 2012”

                                            Forrester Wave™: Enterprise Hadoop Solutions, Q1 ’12




IBM Differentiation
Embracing Open Source
Data in Motion (Streams) and Data at
 Rest (Hadoop/BigInsights)
Tight integration with other Information
 Management products
Bundled, scalable analytics technology
Hardened Apache Hadoop for enterprise
 readiness
                                                                                             © 2011 IBM Corporation
IBM’s unique strengths in Big Data

                 Big Data in
                                Ingest, analyze and act on massive volumes of streaming data.
                 Real-Time      Faster AND more cost-effective for specific use cases. (10x volume
                                 of data on the same hardware.)



                 Fit for
                 purpose        Analyzes a variety of data types, in their native format – text,
                 analytics       geospatial, time series, video, audio & more.




                 Enterprise     Open source enhanced for reliability, performance and security.
                 Class          High performance warehouse software and appliances
                                Ease of use with end users, admin and development UIs.




                 Integration    Integration into your IM architecture.
                                Pre-integrated analytic applications.


                                                                                           © 2011 IBM Corporation
10
Stream Computing : What is good for ?
Analyze all your data, all the time, just in time
                                  What if you could get IMMEDIATE insight?
               Analytic Results    What if you could analyze MORE kinds of data?
                                    What if you could do it with exceptional
                                     performance?

                                                          Alerts


                                                                      Threat
                                                                    Prevention
                                                                     Systems
More context
                                                                            Logging



 Traditional Data,
  Sensor Events,                                                          Active
                                                                        response
     Signals
                                                      Storage and
                                                      Warehousing


                                                                             © 2011 IBM Corporation
   11
What is Stream Processing ?

  Relational databases and warehouses find information stored on
   disk
  Stream computing analyzes data before you store it


  Databases find the needle in the haystack
  Streams finds the needle as it’s blowing by




                                                              © 2011 IBM Corporation
Without Streams                                    With Streams
     •   Intensive scripting                              Streams provide a Productive and Reusable
     •   Embedded SQL                                             Development Environment
     •   File / Storage management by hand
     •   Record management embedded in application
         code
     •   Data Buffering, Locality
     •   Security
     •   Dynamic Application Composition
     •   High Availability
     •   Application management (checkpointing,           Streams Runtime provides your Application
         performance optimization, monitoring, workload                Infrastructure
         management, error and event handling)
     •   Application tied to specific Hardware,
         Infrastructure
     •   Multithreading, Multiprocessing
     •   Debugging
     •   Migration from development to production
     •   Integration of best-of-breed commercial tools
     •   Code reusability
                                                           “TerraEchos developers can deliver
     •   Source / Target interfaces                        applications 45% faster due to the agility
                                                           of Streams Processing Language.“
                                                                       – Alex Philp, TerraEchos


                     IBM and Customer Confidential                                                © 2011 IBM Corporation
13
Streams




          © 2011 IBM Corporation
How Streams Works
 Continuous ingestion                        Infrastructure provides services for
 Continuous analysis                              Scheduling analytics across hardware hosts,
                                                   Establishing streaming connectivity
                      Filter / Sample
                                                 Transform                   Annotate




                                                 Correlate
                                                                  Classify




     Achieve scale:                                                  Where appropriate:
         By partitioning applications into software components        Elements can be fused together
         By distributing across stream-connected hardware hosts       for lower communication latency
                                                                                        © 2011 IBM Corporation
15
Scalable Stream Processing

  Streams programming model: construct a graph

     – Mathematical concept
                                               OP          OP                                           OP
        • not a line -, bar -, or pie chart!                         OP               OP
                                                        OP
        • Also called a network                              stream                                     OP
        • Familiar: for example, a tree structure is a graph
     – Consisting of operators and the streams that connect them
        • The vertices (or nodes) and edges of the mathematical graph
        • A directed graph: the edges have a direction (arrows)
  Streams runtime model: distributed processes
     – Single or multiple operators form a Processing Element (PE)
     – Compiler and runtime services make it easy to deploy PEs
         • On one machine
         • Across multiple hosts in a cluster when scaled-up processing is required
     – All links and data transport are handled by runtime services
         • Automatically
         • With manual placement directives where required




                                                                                      © 2011 IBM Corporation
16
InfoSphere Streams Objects: Runtime View
                                                   Instance
  Instance
     – Runtime instantiation of InfoSphere         Job
       Streams executing across one or more         Node
       hosts                                          PE                      PE
                                                                Stream 1        Stream 2
     – Collection of components and services           operator

  Processing Element (PE)                                     1
                                                        Stream
     – Fundamental execution unit that is run by
       the Streams instance                                   PE          3   Stream 4
     – Can encapsulate a single operator or                        Stream
       many “fused” operators                                      Stream
                                                                          3   Stream 5
  Job                                               Node
     – A deployed Streams application
       executing in an instance
     – Consists of one or more PEs




                                                                                     © 2011 IBM Corporation
17
InfoSphere Streams Objects: Development View
  Operator
     – The fundamental building block of the Streams Streams Application
       Processing Language
     – Operators process data from streams and may
                                                                 stream
       produce new streams                              operator
  Stream
     – An infinite sequence of structured tuples
     – Can be consumed by operators on a tuple-by-
       tuple basis or through the definition of a
       window                                                          height:   height:       height:
                                                                         640       1280         640
  Tuple                                                               width:    width:        width:
     – A structured list of attributes and their types.                  480       1024         480
       Each tuple on a stream has the form dictated                    data:     data:         data:
       by its stream type
  Stream type
     – Specification of the name and data type of each
       attribute in the tuple                          directory: directory: directory: directory:
  Window                                                    "/img" "/img"       "/opt" "/img"
     – A finite, sequential group of tuples               filename: filename: filename: filename:
     – Based on count, time, attribute value,                "farm" "bird" "java" "cat"
       or punctuation marks
                                                                       tuple
                                                                                           © 2011 IBM Corporation
18
What is Streams Processing Language?

  Designed for stream computing
    – Define a streaming-data flow graph
    – Rich set of data types to define tuple attributes
  Declarative
    – Operator invocations name the input and output streams
    – Referring to streams by name is enough to connect the graph
  Procedural support
    – Full-featured C++/Java-like language
    – Custom logic in operator invocations
    – Expressions in attribute assignments and parameter definitions
  Extensible
    – User-defined data types
    – Custom functions written in SPL or a native language (C++ or Java)
    – Custom operator written in SPL
    – User-defined operators written in C++ or Java


                                                                           © 2011 IBM Corporation
19
Some SPL Terms                                                               port


  An operator represents a class of manipulations                          Aggregate
     – of tuples from one or more input streams
     – to produce tuples on one or more output streams
  A stream connects to an operator on a port
     – an operator defines input and output ports              Employee                    Salary
                                                               Info                        Statistics
                                                                            Aggregate
  An operator invocation
     – is a specific use of an operator
     – with specific assigned input and output streams                        port
     – with locally specified parameters, logic, etc.
                                                                   TCP
  Many operators have one input port and one output port;         Source
   others have                                                                            File
     – zero input ports: source adapters, e.g., TCPSource                                 Sink
     – zero output ports: sink adapters, e.g., FileSink
     – multiple output ports, e.g., Split                         Split
     – multiple input ports, e.g., Join                                                         Join
  A composite operator is a collection of operators
     – An encapsulation of a subgraph of
        • Primitive operators (non-composite)               composite
        • Composite operators (nested)                      operator
     – Similar to a macro in a procedural language
                                                                                     © 2011 IBM Corporation
20
Composite Operators

  Every graph is encoded as a composite
                                                                   composite Main {
    – A composite is a graph of one or more operators                graph
    – A composite may have input and output ports                      stream … {
    – Source code construct only                                       }
        • Nothing to do with operator fusion (PEs)                     stream … {
                                                                       }
  Each stream declaration in the composite                          . . .
    – Invokes a primitive operator or                              }
    – another composite operator
                                                   Application (logical view)
  An application is a main composite
    – No input or output ports                                     Stream 1      Stream 2
    – Data flows in and out but not on              operator
      streams within a graph
                                                               1
    – Streams may be exported to and                    Stream
      imported from other applications                                        Stream 4
                                                                        3
      running in the same instance                             Stream
                                                               Stream
                                                                        3     Stream 5


                                                                                © 2011 IBM Corporation
21
21
Anatomy of an Operator Invocation

  Operators share a common structure               Syntax:
     – <> are sections to fill in                   stream<stream-type> stream-name
                                                            = MyOperator(input-stream; …)
  Reading an operator invocation                   {
     – Declare a stream stream-name                     logic logic ;
     – With attributes from stream-type                 param parameters ;
     – that is produced by MyOperator                   output output ;
     – from the input(s) input-stream                   window windowspec ;
     – MyOperator behavior defined by                   config configuration ;
                                                    }
       logic, parameters, windowspec, and
       configuration; output attribute
       assignments are specified in output
                                                     Example:
  For the example:                                 stream<rstring item> Sale
     – Declare the stream Sale with the attribute     = Join(Bid; Ask)
       item, which is a raw string                  {
     – Join Bid and Ask streams with                  window Bid:   sliding, time(30);
     – sliding windows of 30 seconds on Bid,                 Ask:   sliding, count(50);
                                                      param match: Bid.item == Ask.item
       and 50 tuples of Ask                                      && Bid.price >= Ask.price;
     – When items are equal, and Bid price is         output Sale: item = Bid.item
       greater than or equal to Ask price           }
     – Output the item value on the Sale stream
                                                                               © 2011 IBM Corporation
22
22
Streams V2.0 Data Types
                                                             (any)


                                    (primitive)                       (composite)


boolean enum        (numeric)      timestamp (string)       blob        (collection)     tuple


 (integral)   (floatingpoint)      (complex)      rstring   ustring        list        set            map


 (signed) (unsigned) (float)         (decimal)

     int8      uint8            float32    decimal32        complex32
     int16     uint16           float64    decimal64        complex64
     int32     uint32           float128   decimal128       complex128
     int64     uint64



                                                                                         © 2011 IBM Corporation
23
Stream and Tuple Types
  Stream type (often called “schema”)
      – Definition of the structure of the data flowing through the stream
  Tuple type definition
      – tuple<sequence of attributes> tuple<uint16 id, rstring name>
          • Attribute: a type and a name
          • Nesting: any attribute may be another tuple type
  Stream type is a tuple type
      – stream<sequence of attributes>       stream<uint16 id, rstring name>

  Indirect stream type definitions
      – Fully defined within the output stream declaration
 stream<uint32 callerNum, … rstring endTime, list<uint32> mastIDs> Calls = Op(…)…


      – Reference a tuple type
 CallInfo = tuple<uint32 callerNum, … rstring endTime, list<uint32> mastIDs>;

 stream<CallInfo> InternationalCalls = Op(…) {…}


      – Reference another stream
 stream<uint32 callerNum, … rstring endTime, list<uint32> mastIDs> Calls = Op(…)…

 stream<Calls> RoamingCalls = Op(…) {…}

                                                                                    © 2011 IBM Corporation
24
Collection Types

  list: array with bounds-checking                                    [0, 17, age-1, 99]
      – Random access: can access any element at any time
 Ordered, base-zero indexed: first element is                           someList[0]
  set: unordered collection                                {"cats", "yeasts", "plankton"}
     – No duplicate element values
  map: key-to-value mappings                               {"Mon":0, "Sat":99, "Sun":-1}
    – Unordered
  Use type constructors to specify element type
    – list<type>,    set<type>                                list<uint16>, set<rstring>
    – map<key-type,value-type>                                  map<rstring[3],int8>
  Can be nested to any number of levels
     – map<int32, list<tuple<ustring name, int64 value>>>
     – {1 : [{"Joe",117885}, {"Fred",923416}], 2 : [{"Max",117885}], -1 : []}
  Bounded collections optimize performance
    – list<int32>[5]: at most 5 (32-bit) integer elements
    – Bounds also apply to strings: rstring[3] has at most 3 (8-bit) characters


                                                                                  © 2011 IBM Corporation
25
The Functor Operator
                                           stream<rstring name,
  Transforms input tuples into output
                                                   uint32   age,
   tuples                                          uint64   salary> Person = Op(…){}
     – One input port
     – One or more output ports
                                           stream<rstring name,
  May filter tuples                               uint32 age,
     – Parameter filter                            rstring login,
     – A boolean expression                        tuple<boolean young,
     – If true, emit output tuple;                        boolean rich> info>
       if false, do not                                Adult = Functor(Person) {


  Arbitrary attribute assignments             param


     – Full-blown expressions                   filter : age >= 21u;


     – Including function calls                output Adult :


     – Drop, add, transform attributes          login = lower(name),


     – Omitted attributes auto-assigned         info = {young = (age < 30u),

                                                        rich = (salary > 100000ul)};
  Custom logic supported                  }

     – logic clause                                          Person                    Adult
                                                                name        Functor    name
     – May include state
                                                                age                    age
     – Applies to filter and assignments                        salary                 login
                                                                                       info
                                                                                       © 2011 IBM Corporation
26
The FileSink Operator

  Writes tuples to a file
  Has a single input port
     – No output port: data goes to a file,        () as Sink = FileSink(StreamIn) {
       not a Streams stream                            param
  Selected Parameters                                  file : "/tmp/people.dat";
     – file                                             format : csv;
         • Mandatory
         • Base for relative paths is                   flush : 20u;
           data subdirectory                       }
         • Directories must already exist                          File-
     – flush                                                       Sink
         • Flush the output buffer after a given
           number of tuples
     – format
         • csv: comma-separated values
         • txt, line, binary, block



                                                                           © 2011 IBM Corporation
27
Communication Between Streams Applications

  Streams jobs exchange data with the outside world
      – Source- and Sink-type operators
      – Can also be used between Streams jobs (e.g., TCPSource/Sink)
  Streams jobs can exchange data with each other
      – Within one Streams Instance
  Supports Dynamic Application Composition
      – By name or based on properties (tags)
      – One job exports a stream; another imports it
  Implemented using two new pseudo-operators: Export and Import
                 Job 1                                 Stream exported by Job 1
                                                       and imported by Job 2
                 oper-
     source                   sink
                  ator

                             Export         Import                 Job 2

                                                           oper-           oper-
                                           source                                         sink
                                                            ator            ator

                                                                                   © 2011 IBM Corporation
28
Application Design – Dynamic Stream Properties

  API available for toolkit development
  Can add/modify/delete
    – Exported stream properties
    – Imported stream subscription expression
  Dynamic Job Flow Control Bus Pattern
    – Operators within jobs interpret control stream tuples
    – Rewire the flow of data from job to job




                                                                      Flow Control Tuples
Exported                                                              [A,B,C]
Control Stream


                          Job A        Job B        Job C     Job D


 Data Stream


                                                                            © 2011 IBM Corporation
29
Application Design – Dynamic Stream Properties

  API available for toolkit development
  Can add/modify/delete
    – Exported stream properties
    – Imported stream subscription expression
  Dynamic Job Flow Control Bus Pattern
    – Operators within jobs interpret control stream tuples
    – Rewire the flow of data from job to job




                                                                      Flow Control Tuples
Exported                                                              [A,B,C]
Control Stream
                                                                      [A,C,D]
                          Job A        Job B        Job C     Job D


 Data Stream


                                                                            © 2011 IBM Corporation
30
Application Design – Multi-job Design
     Streams Instance: stream1
     Job: imagefeeder                                    Job: imagewriter
                                                                               Timestamp +
             File metadata                                     File metadata   Filename
      Directory-        Image-                             Image-
                                                                         Functor         FileSink
      Scan              Source                             Sink


                                                                    subscription:
          properties:                                               type == "Image" &&
          name = "Feed",                                            write == “ok"
          type = "Image",
          write = “ok"




        Application / Job Decomposition
           – Dynamic Job Submission + Stream Import / Export


                                                                                    © 2011 IBM Corporation
31
Application Design – Multi-job Design
     Streams Instance: stream1
     Job: imagefeeder                                    Job: imagewriter
                                                                               Timestamp +
             File metadata                 Image +             File metadata   Filename
                                        File metadata
      Directory-        Image-                             Image-
                                                                         Functor         FileSink
      Scan              Source                             Sink


                                                                    subscription:
          properties:                                               type == "Image" &&
          name = "Feed",                                            write == “ok"
          type = "Image",
          write = “ok"




        Application / Job Decomposition
           – Dynamic Job Submission + Stream Import / Export


                                                                                    © 2011 IBM Corporation
32
Application Design – Multi-job Design
     Streams Instance: stream1
     Job: imagefeeder                                    Job: imagewriter
                                                                                 Timestamp +
             File metadata                 Image +             File metadata     Filename
                                        File metadata
      Directory-         Image-                            Image-
                                                                           Functor       FileSink
      Scan               Source                            Sink
                                           Job:
                                        greyscaler
                                                                    subscription:
          properties:                                               type == "Image" &&
          name = "Feed",                 Greyscale                  write == “ok"
          type = "Image",
          write = “ok"                                         properties:
                                                               name = “Grey",
                   subscription:                               type = "Image",
                   name == "Feed"                              write = “ok"




        Application / Job Decomposition
           – Dynamic Job Submission + Stream Import / Export


                                                                                      © 2011 IBM Corporation
33
Application Design – Multi-job Design
     Streams Instance: stream1
     Job: imagefeeder                                    Job: imagewriter
                                                                                 Timestamp +
             File metadata                 Image +             File metadata     Filename
                                        File metadata
      Directory-         Image-                            Image-
                                                                           Functor       FileSink
      Scan               Source                            Sink
                                           Job:
                                        greyscaler
                                                                    subscription:
          properties:                                               type == "Image" &&
          name = "Feed",                 Greyscale                  write == “ok"
          type = "Image",
          write = “ok"                                         properties:
                                       Job: resizer            name = “Grey",
                   subscription:                               type = "Image",
                   name == "Feed"                              write = “ok"
                                       Job:
                                       facial scan         Job: Alerter


        Application / Job Decomposition
           – Dynamic Job Submission + Stream Import / Export


                                                                                      © 2011 IBM Corporation
34
Application Design – Multi-job Design
     Streams Instance: stream1
     Job: imagefeeder                                   Job: imagewriter
      Job: imagefeeder                                    Job: imagewriter
                                                                         Timestamp +
         Job: imagefeeder
             File metadata                Image +             File metadata     Filename
      Directory- metadata
             File                                        Image- metadata
                                                            File
                                       File metadata
                         Image-
      DirReader
      Scan
               File metadata
                         Source
                                                        WriteImage Functor
                                                         Sink
                                                                    Functor             FileSink
                                                                                          Sink
        DirReader                         Job:
                                       greyscaler
                                                                   subscription:
          properties:                                              type == "Image" &&
          name = "Feed",                Greyscale                  write == “ok"
          type = "Image",
          write = “ok"                                        properties:
                                      Job: resizer            name = “Grey",
                 subscription:                                type = "Image",
                 name == "Feed"                               write = “ok"
                                      Job:
                                      facial scan         Job: Alerter


        Application / Job Decomposition
          – Dynamic Job Submission + Stream Import / Export


                                                                                     © 2011 IBM Corporation
35
Two Styles of Export/Import

  Publish and subscribe (Recommended approach):
     – The exporting application publishes a stream with certain properties
     – The importing stream subscribes to an exported stream with properties
       satisfying a specified condition
  Point to point:
     – The importing application names a specific stream of a specific exporting
       application
  Dynamic publish and subscribe:
     – Export properties and Import expressions can be altered during the execution of
       a job
     – Allows dynamic data flows
     – Alter the flow of data based on the data (history, trends, etc.)
() as ImageStream = Export(ImagesIn) {       stream<IplImage image, rstring filename,
  param properties : {                              rstring directory> ImagesIn =
    streamName = "ImageFeed",                Import() {
    dataType = "IplImage",                     param subscription :
    writeImage = "true"};                         dataType == "IplImage" &&
}                                                 writeImage == "true";
                                             }
                                                                            © 2011 IBM Corporation
36
Parallelization Patterns – Introduction

  Problem Statement
    – Series of operations to be performed on a piece of data (a tuple)
    – How to improve performance of these operations?
  Key Question
    – Reduce latency?
        • For a single piece of data
    – Increase throughput?
        • For the entire data flow
  Three possible design patterns
    – Serial Path
    – Parallel Operators (Task Parallelization)
    – Parallel Paths (Data Parallelization)




                                                                          © 2011 IBM Corporation
37
Parallelization Patterns – Pipeline, Task

  Pipeline (serial path)
                                           A      B     C       D


     – Base pattern: inherent in graph paradigm
     – Results arrive at D in time T(A) + T(B) + T(C)
  Parallel operators (task parallelization)
                                           A


                                           B      M         D


                                           C




     – Process the tuple in operators A, B, and C at the same time
     – Requires merger (e.g., Barrier) before operator D
     – Results arrive at D in time Max(T(A),T(B),T(C)) + T(M)
     – Use when tuple latency requirement < T(A) + T(B) + T(C)
     – Complexity of merger depends on behavior of operators A, B, and C

                                                                           © 2011 IBM Corporation
38
Parallelization Patterns – Parallel Pipelines

  Parallel pipelines (data parallelization)
                                                     A       B       C


                                                     A       B       C          D


                                                     A       B       C



     – Migration step from pipeline patttern
     – Can improve throughput
        • Especially good for variable-size data / processing time
  Design Decisions
    – Are there latency and/or throughput requirements?
    – Do the operators perform filtering, feature extraction, transformation?
    – Is there an execution order requirement?
    – Is there a tuple order requirement?
  Recommend Pipeline  Parallel Pipelines when possible


                                                                            © 2011 IBM Corporation
39
Application Design – Multi-tier Design
     Transport                             Processing /                     Transport
                  Ingestion    Reduction                   Transformation
     Adaptation                             Analytics                       Adaptation


                  Transport                 Processing /     Transport
                               Ingestion
                  Adaptation                 Analytics       Adaptation


                                           Examples
  N-tier design
    – Number and purpose of tiers is a result of Application Design
  Create well-defined interfaces between the tiers
  Supports several overarching concepts
    – Incremental development / testing
    – Application / Job / Operator reuse
    – Modular programming practices
  Each tier in these examples may be made up of one or more jobs (programs)



                                                                                  © 2011 IBM Corporation
40
Application Design – High Availability

  HA application design pattern
    – Source job exports stream, enriched with tuple ID
    – Jobs 1 & 2 process in parallel, and export final streams
    – Sink job imports streams, discards duplicates, alerts on missing tuples




                                                             Job 1
                                                             Job 1       Job 1
                                                                         Job 1               Sink
                                                                                             Sink
     Host pool 1               Job 1
                               Job 1           Job 1
                                               Job 1
                                                                                    Job 2
                                                                                    Job 2
     Host pool 2      Source
                      Source           Job 2
                                       Job 2                     Job 2
                                                                 Job 2      Job 2
                                                                            Job 2

     Host pool 3    x86 host      x86 host        x86 host   x86 host    x86 host

     Host pool 4



                                                                                        © 2011 IBM Corporation
41
Application Design – High Availability

  HA application design pattern
    – Source job exports stream, enriched with tuple ID
    – Jobs 1 & 2 process in parallel, and export final streams
    – Sink job imports streams, discards duplicates, alerts on missing tuples




                                                             Job 1
                                                             Job 1       Job 1
                                                                         Job 1               Sink
                                                                                             Sink
     Host pool 1               Job 1
                               Job 1           Job 1
                                               Job 1

                      Source
                      Source                                                        Job 2
                                                                                    Job 2
     Host pool 2
                                       Job 2
                                       Job 2                     Job 2
                                                                 Job 2      Job 2
                                                                            Job 2
                       x86 host
     Host pool 3                  x86 host        x86 host   x86 host    x86 host

     Host pool 4



                                                                                        © 2011 IBM Corporation
42
IBM InfoSphere Streams

 Agile Development    Distributed Runtime        Sophisticated Analytics
    Environment           Environment             with Toolkits & Adapters



                                                                           Front Office 3.0




                                                Toolkits
                                                   Database          Advanced Text
                                                   Mining            Geospatial
                      Clustered runtime for       Financial         Timeseries
                       massive scalability         Standard          Messaging
                      RHEL v5.x and v6.x,         Internet          ...
Eclipse IDE                                        BigData           User-defined
                       CentOS v6.x
Streams Live Graph    x86 & Power multicore
                                                       • HDFS
                                                       • DataExplorer
Streams Debugger       hardware
                      Ethernet & InfiniBand
                                                Over 50 samples
                                                                       © 2011 IBM Corporation
Toolkits and Operators to Speed and Simplify Development
     Standard Toolkit                      Internet Toolkit
      Relational Operators                  InetSource
       Filter          Sort                   HTTP FTP HTTPS
       Functor         Join                   FTPS RSS file
       Punctor         Aggregate
      Adapter Operators                    Database Toolkit
       FileSource      UDPSource             ODBCAppendODBCEnrich
       FileSink        UDPSink               ODBCSource SolidDBEnrich
       DirectoryScan Export                  DB2SplitDB DB2PartitionedAppend
       TCPSource       Import               Supports: DB2 LUW, IDS, solidDB,
       TCPSink MetricsSink                  Netezza, Oracle, SQL Server, MySQL
      Utility Operators
       Custom          Split                   Financial Toolkit
       Beacon          DeDuplicate
       Throttle        Union                   Data Mining Toolkit
       Delay           ThreadedSplit           Big Data toolkit
       Barrier         DynamicFilter
       Pair            Gate
                                               Text Toolkit
       JavaOp                                  …..
      Standard toolkit contains the            User-Defined Toolkits
      default operators shipped with the         Extend the language by adding
      product                                     user-defined operators
                                                  and functions
                                                                       © 2011 IBM Corporation
44
User Defined Toolkits

  Streams supports toolkits
    – Reusable sets of operators and functions
    – What can be included in a toolkit?
        • Primitive and composite operators
        • Native and SPL functions
        • Types
        • Tools/documentation/samples/data, etc.
    – Versioning is supported
    – Define dependencies on other versioned assets (toolkits, Streams)
    – Create cross-domain and domain-specific accelerators




                                                                          © 2011 IBM Corporation
 45
45
© 2011 IBM Corporation
46
A quick peek inside …
InfoSphere Streams Instance – Single Host


                            Management Services &
                                   Applications
                               Streams Web Service (SWS)

                               Streams Application Manager (SAM)

                                Streams Resource Manager (SRM)

                          Authorization and Authentication Service (AAS)

                        Scheduler           Recover DB       Name Server

                        Host Controller             Processing Element
                                                        Container



                                            File System




                                                                           © 2011 IBM Corporation
A quick peek inside …
InfoSphere Streams Instance – Multi host, Management Services on separate node


                                Management Services
                                       Streams Web Service (SWS)

                                 Streams Application Manager (SAM)

                                  Streams Resource Manager (SRM)

                            Authorization and Authentication Service (AAS)

                           Scheduler         Recover DB         Name Server


                                         Shared File System


     Application Host                     Application Host                    Application Host
       Host Controller                      Host Controller                    Host Controller
      Processing Element                   Processing Element                 Processing Element
          Container                            Container                          Container




                                                                                           © 2011 IBM Corporation
A quick peek inside …
InfoSphere Streams Instance – Multi host, Management Services on multiple hosts


        Management                        Management                           Management
     Streams Web Service                          AAS                           Recovery DB


        Management                        Management
                                                                           Application Host
     Streams App Manager                     Scheduler                        Host Controller
                                                                             Processing Element
        Management                        Management                             Container

     Streams Resource Mgr                   Name Server


                                      Shared File System


    Application Host        Application Host            Application Host            Application Host
      Host Controller         Host Controller            Host Controller              Host Controller
     Processing Element      Processing Element         Processing Element           Processing Element
         Container               Container                  Container                    Container


                                                                                                  © 2011 IBM Corporation

More Related Content

What's hot

EDF2013: Selected Talk: Bryan Drexler: The 80/20 Rule and Big Data
EDF2013: Selected Talk: Bryan Drexler: The 80/20 Rule and Big Data EDF2013: Selected Talk: Bryan Drexler: The 80/20 Rule and Big Data
EDF2013: Selected Talk: Bryan Drexler: The 80/20 Rule and Big Data European Data Forum
 
Investigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists ToolboxInvestigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists ToolboxData Science London
 
Tackling big data with hadoop and open source integration
Tackling big data with hadoop and open source integrationTackling big data with hadoop and open source integration
Tackling big data with hadoop and open source integrationDataWorks Summit
 
Hadoop: What It Is and What It's Not
Hadoop: What It Is and What It's NotHadoop: What It Is and What It's Not
Hadoop: What It Is and What It's NotInside Analysis
 
SmartData - Monetizing Data Assets
SmartData - Monetizing Data AssetsSmartData - Monetizing Data Assets
SmartData - Monetizing Data AssetsEd Dodds
 
The Future of ERP by Bertrand Andries
The Future of ERP by Bertrand Andries  The Future of ERP by Bertrand Andries
The Future of ERP by Bertrand Andries CONFENIS 2012
 
Finding the “Sweet Spot”: Big Data, Smart Technology, and Domain Knowledge
Finding the “Sweet Spot”: Big Data, Smart Technology, and Domain KnowledgeFinding the “Sweet Spot”: Big Data, Smart Technology, and Domain Knowledge
Finding the “Sweet Spot”: Big Data, Smart Technology, and Domain KnowledgeEmPower Research, a Genpact company
 
Progress with confidence into next generation IT
Progress with confidence into next generation ITProgress with confidence into next generation IT
Progress with confidence into next generation ITPaul Muller
 
The Comprehensive Approach: A Unified Information Architecture
The Comprehensive Approach: A Unified Information ArchitectureThe Comprehensive Approach: A Unified Information Architecture
The Comprehensive Approach: A Unified Information ArchitectureInside Analysis
 
Cetas Analytics as a Service for Predictive Analytics
Cetas Analytics as a Service for Predictive AnalyticsCetas Analytics as a Service for Predictive Analytics
Cetas Analytics as a Service for Predictive AnalyticsJ. David Morris
 
NogaLogic-DataClassification&Governace&BusinessIntelligence
NogaLogic-DataClassification&Governace&BusinessIntelligenceNogaLogic-DataClassification&Governace&BusinessIntelligence
NogaLogic-DataClassification&Governace&BusinessIntelligenceGiuliano Bonassi
 
IDC Big Data Conference 2013, Milano 20 febbraio
IDC Big Data Conference 2013, Milano 20 febbraioIDC Big Data Conference 2013, Milano 20 febbraio
IDC Big Data Conference 2013, Milano 20 febbraioIDC Italy
 
Big data and big content
Big data and big contentBig data and big content
Big data and big contentJohn Mancini
 
The power of_mobile_and_social_data_webinar_slides_21_may2012
The power of_mobile_and_social_data_webinar_slides_21_may2012The power of_mobile_and_social_data_webinar_slides_21_may2012
The power of_mobile_and_social_data_webinar_slides_21_may2012Accenture
 
Enabling Flexible Governance for All Data Sources
Enabling Flexible Governance for All Data SourcesEnabling Flexible Governance for All Data Sources
Enabling Flexible Governance for All Data SourcesInside Analysis
 

What's hot (20)

EDF2013: Selected Talk: Bryan Drexler: The 80/20 Rule and Big Data
EDF2013: Selected Talk: Bryan Drexler: The 80/20 Rule and Big Data EDF2013: Selected Talk: Bryan Drexler: The 80/20 Rule and Big Data
EDF2013: Selected Talk: Bryan Drexler: The 80/20 Rule and Big Data
 
Investigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists ToolboxInvestigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists Toolbox
 
Tackling big data with hadoop and open source integration
Tackling big data with hadoop and open source integrationTackling big data with hadoop and open source integration
Tackling big data with hadoop and open source integration
 
Hadoop: What It Is and What It's Not
Hadoop: What It Is and What It's NotHadoop: What It Is and What It's Not
Hadoop: What It Is and What It's Not
 
SmartData - Monetizing Data Assets
SmartData - Monetizing Data AssetsSmartData - Monetizing Data Assets
SmartData - Monetizing Data Assets
 
Greenplum hadoop
Greenplum hadoopGreenplum hadoop
Greenplum hadoop
 
The Future of ERP by Bertrand Andries
The Future of ERP by Bertrand Andries  The Future of ERP by Bertrand Andries
The Future of ERP by Bertrand Andries
 
Finding the “Sweet Spot”: Big Data, Smart Technology, and Domain Knowledge
Finding the “Sweet Spot”: Big Data, Smart Technology, and Domain KnowledgeFinding the “Sweet Spot”: Big Data, Smart Technology, and Domain Knowledge
Finding the “Sweet Spot”: Big Data, Smart Technology, and Domain Knowledge
 
Progress with confidence into next generation IT
Progress with confidence into next generation ITProgress with confidence into next generation IT
Progress with confidence into next generation IT
 
The Comprehensive Approach: A Unified Information Architecture
The Comprehensive Approach: A Unified Information ArchitectureThe Comprehensive Approach: A Unified Information Architecture
The Comprehensive Approach: A Unified Information Architecture
 
Cetas Predictive Analytics Prezo
Cetas Predictive Analytics PrezoCetas Predictive Analytics Prezo
Cetas Predictive Analytics Prezo
 
Cetas Analytics as a Service for Predictive Analytics
Cetas Analytics as a Service for Predictive AnalyticsCetas Analytics as a Service for Predictive Analytics
Cetas Analytics as a Service for Predictive Analytics
 
NogaLogic-DataClassification&Governace&BusinessIntelligence
NogaLogic-DataClassification&Governace&BusinessIntelligenceNogaLogic-DataClassification&Governace&BusinessIntelligence
NogaLogic-DataClassification&Governace&BusinessIntelligence
 
IDC Big Data Conference 2013, Milano 20 febbraio
IDC Big Data Conference 2013, Milano 20 febbraioIDC Big Data Conference 2013, Milano 20 febbraio
IDC Big Data Conference 2013, Milano 20 febbraio
 
Big data primer
Big data primerBig data primer
Big data primer
 
Big data and big content
Big data and big contentBig data and big content
Big data and big content
 
Query at Speed of Thought
Query at Speed of ThoughtQuery at Speed of Thought
Query at Speed of Thought
 
The power of_mobile_and_social_data_webinar_slides_21_may2012
The power of_mobile_and_social_data_webinar_slides_21_may2012The power of_mobile_and_social_data_webinar_slides_21_may2012
The power of_mobile_and_social_data_webinar_slides_21_may2012
 
Informatics technologies in an evolving r & d landscape
Informatics technologies in an evolving r & d landscapeInformatics technologies in an evolving r & d landscape
Informatics technologies in an evolving r & d landscape
 
Enabling Flexible Governance for All Data Sources
Enabling Flexible Governance for All Data SourcesEnabling Flexible Governance for All Data Sources
Enabling Flexible Governance for All Data Sources
 

Viewers also liked

Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Modern Data Stack France
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreModern Data Stack France
 
Hug france-2012-12-04
Hug france-2012-12-04Hug france-2012-12-04
Hug france-2012-12-04Ted Dunning
 
Talend Open Studio for Big Data (powered by Apache Hadoop)
Talend Open Studio for Big Data (powered by Apache Hadoop)Talend Open Studio for Big Data (powered by Apache Hadoop)
Talend Open Studio for Big Data (powered by Apache Hadoop)Modern Data Stack France
 
Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Cassandra Hadoop Integration at HUG France by Piotr KołaczkowskiCassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Cassandra Hadoop Integration at HUG France by Piotr KołaczkowskiModern Data Stack France
 
Cassandra Hadoop Best Practices by Jeremy Hanna
Cassandra Hadoop Best Practices by Jeremy HannaCassandra Hadoop Best Practices by Jeremy Hanna
Cassandra Hadoop Best Practices by Jeremy HannaModern Data Stack France
 
Paris HUG - Agile Analytics Applications on Hadoop
Paris HUG - Agile Analytics Applications on HadoopParis HUG - Agile Analytics Applications on Hadoop
Paris HUG - Agile Analytics Applications on HadoopHortonworks
 
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Cedric CARBONE
 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connectorDuyhai Doan
 
Hadoop HPC, calcul de VAR sur Hadoop vs GridGain
Hadoop HPC, calcul de VAR sur Hadoop vs GridGainHadoop HPC, calcul de VAR sur Hadoop vs GridGain
Hadoop HPC, calcul de VAR sur Hadoop vs GridGainModern Data Stack France
 

Viewers also liked (20)

Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
 
Cascalog présenté par Bertrand Dechoux
Cascalog présenté par Bertrand DechouxCascalog présenté par Bertrand Dechoux
Cascalog présenté par Bertrand Dechoux
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
 
Hug france-2012-12-04
Hug france-2012-12-04Hug france-2012-12-04
Hug france-2012-12-04
 
M7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal HausenblasM7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal Hausenblas
 
Hadoop on Azure
Hadoop on AzureHadoop on Azure
Hadoop on Azure
 
Talend Open Studio for Big Data (powered by Apache Hadoop)
Talend Open Studio for Big Data (powered by Apache Hadoop)Talend Open Studio for Big Data (powered by Apache Hadoop)
Talend Open Studio for Big Data (powered by Apache Hadoop)
 
Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Cassandra Hadoop Integration at HUG France by Piotr KołaczkowskiCassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
 
Cassandra Hadoop Best Practices by Jeremy Hanna
Cassandra Hadoop Best Practices by Jeremy HannaCassandra Hadoop Best Practices by Jeremy Hanna
Cassandra Hadoop Best Practices by Jeremy Hanna
 
Paris HUG - Agile Analytics Applications on Hadoop
Paris HUG - Agile Analytics Applications on HadoopParis HUG - Agile Analytics Applications on Hadoop
Paris HUG - Agile Analytics Applications on Hadoop
 
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connector
 
Hadoop HPC, calcul de VAR sur Hadoop vs GridGain
Hadoop HPC, calcul de VAR sur Hadoop vs GridGainHadoop HPC, calcul de VAR sur Hadoop vs GridGain
Hadoop HPC, calcul de VAR sur Hadoop vs GridGain
 
Dépasser map() et reduce()
Dépasser map() et reduce()Dépasser map() et reduce()
Dépasser map() et reduce()
 
Hadoop chez Kobojo
Hadoop chez KobojoHadoop chez Kobojo
Hadoop chez Kobojo
 
Big Data et SEO, par Vincent Heuschling
Big Data et SEO, par Vincent HeuschlingBig Data et SEO, par Vincent Heuschling
Big Data et SEO, par Vincent Heuschling
 
HCatalog
HCatalogHCatalog
HCatalog
 
Hadopp Vue d'ensemble
Hadopp Vue d'ensembleHadopp Vue d'ensemble
Hadopp Vue d'ensemble
 
Hadoop Graph Analysis par Thomas Vial
Hadoop Graph Analysis par Thomas VialHadoop Graph Analysis par Thomas Vial
Hadoop Graph Analysis par Thomas Vial
 
Retour Hadoop Summit 2012
Retour Hadoop Summit 2012Retour Hadoop Summit 2012
Retour Hadoop Summit 2012
 

Similar to IBM Stream au Hadoop User Group

Ibm big data ibm marriage of hadoop and data warehousing
Ibm big dataibm marriage of hadoop and data warehousingIbm big dataibm marriage of hadoop and data warehousing
Ibm big data ibm marriage of hadoop and data warehousing DataWorks Summit
 
IBM Big Data Platform, 2012
IBM Big Data Platform, 2012IBM Big Data Platform, 2012
IBM Big Data Platform, 2012Rob Thomas
 
01 im overview high level
01 im overview high level01 im overview high level
01 im overview high levelJames Findlay
 
Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big DecisionsInnoTech
 
Robert LeBlanc - Why Big Data? Why Now?
Robert LeBlanc - Why Big Data? Why Now?Robert LeBlanc - Why Big Data? Why Now?
Robert LeBlanc - Why Big Data? Why Now?Mauricio Godoy
 
What is big data - Architectures and Practical Use Cases
What is big data - Architectures and Practical Use CasesWhat is big data - Architectures and Practical Use Cases
What is big data - Architectures and Practical Use CasesTony Pearson
 
Big Data and Implications on Platform Architecture
Big Data and Implications on Platform ArchitectureBig Data and Implications on Platform Architecture
Big Data and Implications on Platform ArchitectureOdinot Stanislas
 
How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...Vladimir Bacvanski, PhD
 
How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...DATAVERSITY
 
Big Data World Forum
Big Data World ForumBig Data World Forum
Big Data World Forumbigdatawf
 
Big Data Beyond Hadoop*: Research Directions for the Future
Big Data Beyond Hadoop*: Research Directions for the FutureBig Data Beyond Hadoop*: Research Directions for the Future
Big Data Beyond Hadoop*: Research Directions for the FutureOdinot Stanislas
 
IBM Software Day 2013. Smarter analytics and big data. building the next gene...
IBM Software Day 2013. Smarter analytics and big data. building the next gene...IBM Software Day 2013. Smarter analytics and big data. building the next gene...
IBM Software Day 2013. Smarter analytics and big data. building the next gene...IBM (Middle East and Africa)
 
Big Data, Big Deal? (A Big Data 101 presentation)
Big Data, Big Deal? (A Big Data 101 presentation)Big Data, Big Deal? (A Big Data 101 presentation)
Big Data, Big Deal? (A Big Data 101 presentation)Matt Turck
 
Big Data: A Big Trap for Product Development
Big Data: A Big Trap for Product DevelopmentBig Data: A Big Trap for Product Development
Big Data: A Big Trap for Product DevelopmentStrategy 2 Market, Inc,
 
Evaluating Big Data Predictive Analytics Platforms
Evaluating Big Data Predictive Analytics PlatformsEvaluating Big Data Predictive Analytics Platforms
Evaluating Big Data Predictive Analytics PlatformsTeradata Aster
 
Customer summit - big data (final)
Customer summit  - big data (final)Customer summit  - big data (final)
Customer summit - big data (final)Anand Deshpande
 
Big Data 視覺化分析解決方案
Big Data 視覺化分析解決方案Big Data 視覺化分析解決方案
Big Data 視覺化分析解決方案Etu Solution
 
Big data? No. Big Decisions are What You Want
Big data? No. Big Decisions are What You WantBig data? No. Big Decisions are What You Want
Big data? No. Big Decisions are What You WantStuart Miniman
 

Similar to IBM Stream au Hadoop User Group (20)

Ibm big data ibm marriage of hadoop and data warehousing
Ibm big dataibm marriage of hadoop and data warehousingIbm big dataibm marriage of hadoop and data warehousing
Ibm big data ibm marriage of hadoop and data warehousing
 
The New Enterprise Data Platform
The New Enterprise Data PlatformThe New Enterprise Data Platform
The New Enterprise Data Platform
 
IBM Big Data Platform, 2012
IBM Big Data Platform, 2012IBM Big Data Platform, 2012
IBM Big Data Platform, 2012
 
01 im overview high level
01 im overview high level01 im overview high level
01 im overview high level
 
Accelerate Return on Data
Accelerate Return on DataAccelerate Return on Data
Accelerate Return on Data
 
Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big Decisions
 
Robert LeBlanc - Why Big Data? Why Now?
Robert LeBlanc - Why Big Data? Why Now?Robert LeBlanc - Why Big Data? Why Now?
Robert LeBlanc - Why Big Data? Why Now?
 
What is big data - Architectures and Practical Use Cases
What is big data - Architectures and Practical Use CasesWhat is big data - Architectures and Practical Use Cases
What is big data - Architectures and Practical Use Cases
 
Big Data and Implications on Platform Architecture
Big Data and Implications on Platform ArchitectureBig Data and Implications on Platform Architecture
Big Data and Implications on Platform Architecture
 
How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...
 
How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...
 
Big Data World Forum
Big Data World ForumBig Data World Forum
Big Data World Forum
 
Big Data Beyond Hadoop*: Research Directions for the Future
Big Data Beyond Hadoop*: Research Directions for the FutureBig Data Beyond Hadoop*: Research Directions for the Future
Big Data Beyond Hadoop*: Research Directions for the Future
 
IBM Software Day 2013. Smarter analytics and big data. building the next gene...
IBM Software Day 2013. Smarter analytics and big data. building the next gene...IBM Software Day 2013. Smarter analytics and big data. building the next gene...
IBM Software Day 2013. Smarter analytics and big data. building the next gene...
 
Big Data, Big Deal? (A Big Data 101 presentation)
Big Data, Big Deal? (A Big Data 101 presentation)Big Data, Big Deal? (A Big Data 101 presentation)
Big Data, Big Deal? (A Big Data 101 presentation)
 
Big Data: A Big Trap for Product Development
Big Data: A Big Trap for Product DevelopmentBig Data: A Big Trap for Product Development
Big Data: A Big Trap for Product Development
 
Evaluating Big Data Predictive Analytics Platforms
Evaluating Big Data Predictive Analytics PlatformsEvaluating Big Data Predictive Analytics Platforms
Evaluating Big Data Predictive Analytics Platforms
 
Customer summit - big data (final)
Customer summit  - big data (final)Customer summit  - big data (final)
Customer summit - big data (final)
 
Big Data 視覺化分析解決方案
Big Data 視覺化分析解決方案Big Data 視覺化分析解決方案
Big Data 視覺化分析解決方案
 
Big data? No. Big Decisions are What You Want
Big data? No. Big Decisions are What You WantBig data? No. Big Decisions are What You Want
Big data? No. Big Decisions are What You Want
 

More from Modern Data Stack France

Talend spark meetup 03042017 - Paris Spark Meetup
Talend spark meetup 03042017 - Paris Spark MeetupTalend spark meetup 03042017 - Paris Spark Meetup
Talend spark meetup 03042017 - Paris Spark MeetupModern Data Stack France
 
Paris Spark Meetup - Trifacta - 03_04_2017
Paris Spark Meetup - Trifacta - 03_04_2017Paris Spark Meetup - Trifacta - 03_04_2017
Paris Spark Meetup - Trifacta - 03_04_2017Modern Data Stack France
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Modern Data Stack France
 
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...Modern Data Stack France
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with sparkModern Data Stack France
 
HUG France - 20160114 industrialisation_process_big_data CanalPlus
HUG France -  20160114 industrialisation_process_big_data CanalPlusHUG France -  20160114 industrialisation_process_big_data CanalPlus
HUG France - 20160114 industrialisation_process_big_data CanalPlusModern Data Stack France
 
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)Modern Data Stack France
 
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015Modern Data Stack France
 
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Modern Data Stack France
 
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015Modern Data Stack France
 
June Spark meetup : search as recommandation
June Spark meetup : search as recommandationJune Spark meetup : search as recommandation
June Spark meetup : search as recommandationModern Data Stack France
 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Modern Data Stack France
 
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamielParis Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamielModern Data Stack France
 

More from Modern Data Stack France (20)

Stash - Data FinOPS
Stash - Data FinOPSStash - Data FinOPS
Stash - Data FinOPS
 
Vue d'ensemble Dremio
Vue d'ensemble DremioVue d'ensemble Dremio
Vue d'ensemble Dremio
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
Talend spark meetup 03042017 - Paris Spark Meetup
Talend spark meetup 03042017 - Paris Spark MeetupTalend spark meetup 03042017 - Paris Spark Meetup
Talend spark meetup 03042017 - Paris Spark Meetup
 
Paris Spark Meetup - Trifacta - 03_04_2017
Paris Spark Meetup - Trifacta - 03_04_2017Paris Spark Meetup - Trifacta - 03_04_2017
Paris Spark Meetup - Trifacta - 03_04_2017
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
 
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
 
Hug janvier 2016 -EDF
Hug   janvier 2016 -EDFHug   janvier 2016 -EDF
Hug janvier 2016 -EDF
 
HUG France - 20160114 industrialisation_process_big_data CanalPlus
HUG France -  20160114 industrialisation_process_big_data CanalPlusHUG France -  20160114 industrialisation_process_big_data CanalPlus
HUG France - 20160114 industrialisation_process_big_data CanalPlus
 
Hugfr SPARK & RIAK -20160114_hug_france
Hugfr  SPARK & RIAK -20160114_hug_franceHugfr  SPARK & RIAK -20160114_hug_france
Hugfr SPARK & RIAK -20160114_hug_france
 
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
 
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
 
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
 
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
 
Spark dataframe
Spark dataframeSpark dataframe
Spark dataframe
 
June Spark meetup : search as recommandation
June Spark meetup : search as recommandationJune Spark meetup : search as recommandation
June Spark meetup : search as recommandation
 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)
 
Spark meetup at viadeo
Spark meetup at viadeoSpark meetup at viadeo
Spark meetup at viadeo
 
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamielParis Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
 

IBM Stream au Hadoop User Group

  • 1. Big Data Jerome Chailloux, Big Data Specialist jerome.chailloux@fr.ibm.com © 2011 IBM Corporation
  • 2. Imagine the Possibilities of Analyzing All Available Data Faster, More Comprehensive, Less Expensive Real-time Understand and Traffic Flow Fraud & risk act on customer Optimization detection sentiment Accurate and timely Predict and act on Low-latency network threat detection intent to purchase analysis © 2011 IBM Corporation
  • 3. Where is this data coming from? Every day, the New York Stock Exchange captures 1 TB of trade information. 12 TB of tweets being 5 Billion mobile phones in created each day. use in 2010. Only 12% were smartphones. Every second of HD video More than 30M networked generates > 2,000 times as sensor, growing at a rate many bytes as required to store >30% per year. a single page of text. What is your business doing with it? © 2011 IBM Corporation 3 Source: McKinsey & Company, May 2011
  • 4. Why is Big Data important ? Data AVAILABLE to an organization Missed ty ni opportu data an organization can PROCESS Organizations are able to Enterprises are “more blind” process less and less of the to new opportunities. available data. 4 © 2011 IBM Corporation 4
  • 5. What does a Big Data platform do ? Analyze a Variety of Information Novel analytics on a broad set of mixed information that could not be analyzed before Analyze Information in Motion Streaming data analysis Large volume data bursts & ad-hoc analysis Analyze Extreme Volumes of Information Cost-efficiently process and analyze petabytes of information Manage & analyze high volumes of structured, relational data Discover & Experiment Ad-hoc analytics, data discovery & experimentation Manage & Plan Enforce data structure, integrity and control to ensure consistency for repeatable queries © 2011 IBM Corporation 5
  • 6. Complementary Approaches for Different Use Cases Traditional Approach New Approach Structured, analytical, logical Creative, holistic thought, intuition Data Hadoop Warehouse Streams Transaction Data Web Logs Internal App Data Social Data Structured Structured Unstructured Unstructured Repeatable Enterprise Exploratory Repeatable Integration Exploratory Text Data: emails Mainframe Data Linear Iterative Linear Iterative sentiment Monthly sales reports Brand Profitability analysis Product Sensor data: images strategy OLTP System Data Customer surveys Maximum asset utilization ERP data Traditional New RFID Sources Sources © 2011 IBM Corporation
  • 7. IBM Big Data Strategy: Move the Analytics Closer to the Data New analytic applications drive the Analytic Applications requirements for a big data platform BI / Exploration / Functional Industry Predictive Content Reporting Visualization App App BI / Analytics Analytics Reporting • Integrate and manage the full IBM Big Data Platform variety, velocity and volume of data Visualization Application Systems • Apply advanced analytics to & Discovery Development Management information in its native form • Visualize all available data for ad- Accelerators hoc analysis • Development environment for Hadoop Stream Data System Computing Warehouse building new analytic applications • Workload optimization and scheduling • Security and Governance Information Integration & Governance © 2011 IBM Corporation
  • 8. Most Client Use Cases Combine Multiple Technologies Pre-processing Ingest and analyze unstructured data types and convert to structured data Combine structured and unstructured analysis Augment data warehouse with additional external sources, such as social media Combine high velocity and historical analysis Analyze and react to data in motion; adjust models with deep historical analysis Reuse structured data for exploratory analysis Experimentation and ad-hoc analysis with structured data © 2011 IBM Corporation
  • 9. IBM is in a lead position to exploit the Big Data opportunity February 2012 “The Forrester Wave™: Enterprise Hadoop Solutions, Q1 2012” Forrester Wave™: Enterprise Hadoop Solutions, Q1 ’12 IBM Differentiation Embracing Open Source Data in Motion (Streams) and Data at Rest (Hadoop/BigInsights) Tight integration with other Information Management products Bundled, scalable analytics technology Hardened Apache Hadoop for enterprise readiness © 2011 IBM Corporation
  • 10. IBM’s unique strengths in Big Data Big Data in  Ingest, analyze and act on massive volumes of streaming data. Real-Time  Faster AND more cost-effective for specific use cases. (10x volume of data on the same hardware.) Fit for purpose  Analyzes a variety of data types, in their native format – text, analytics geospatial, time series, video, audio & more. Enterprise  Open source enhanced for reliability, performance and security. Class  High performance warehouse software and appliances  Ease of use with end users, admin and development UIs. Integration  Integration into your IM architecture.  Pre-integrated analytic applications. © 2011 IBM Corporation 10
  • 11. Stream Computing : What is good for ? Analyze all your data, all the time, just in time What if you could get IMMEDIATE insight? Analytic Results What if you could analyze MORE kinds of data? What if you could do it with exceptional performance? Alerts Threat Prevention Systems More context Logging Traditional Data, Sensor Events, Active response Signals Storage and Warehousing © 2011 IBM Corporation 11
  • 12. What is Stream Processing ?  Relational databases and warehouses find information stored on disk  Stream computing analyzes data before you store it  Databases find the needle in the haystack  Streams finds the needle as it’s blowing by © 2011 IBM Corporation
  • 13. Without Streams With Streams • Intensive scripting Streams provide a Productive and Reusable • Embedded SQL Development Environment • File / Storage management by hand • Record management embedded in application code • Data Buffering, Locality • Security • Dynamic Application Composition • High Availability • Application management (checkpointing, Streams Runtime provides your Application performance optimization, monitoring, workload Infrastructure management, error and event handling) • Application tied to specific Hardware, Infrastructure • Multithreading, Multiprocessing • Debugging • Migration from development to production • Integration of best-of-breed commercial tools • Code reusability “TerraEchos developers can deliver • Source / Target interfaces applications 45% faster due to the agility of Streams Processing Language.“ – Alex Philp, TerraEchos IBM and Customer Confidential © 2011 IBM Corporation 13
  • 14. Streams © 2011 IBM Corporation
  • 15. How Streams Works  Continuous ingestion Infrastructure provides services for  Continuous analysis Scheduling analytics across hardware hosts, Establishing streaming connectivity Filter / Sample Transform Annotate Correlate Classify Achieve scale: Where appropriate: By partitioning applications into software components Elements can be fused together By distributing across stream-connected hardware hosts for lower communication latency © 2011 IBM Corporation 15
  • 16. Scalable Stream Processing  Streams programming model: construct a graph – Mathematical concept OP OP OP • not a line -, bar -, or pie chart! OP OP OP • Also called a network stream OP • Familiar: for example, a tree structure is a graph – Consisting of operators and the streams that connect them • The vertices (or nodes) and edges of the mathematical graph • A directed graph: the edges have a direction (arrows)  Streams runtime model: distributed processes – Single or multiple operators form a Processing Element (PE) – Compiler and runtime services make it easy to deploy PEs • On one machine • Across multiple hosts in a cluster when scaled-up processing is required – All links and data transport are handled by runtime services • Automatically • With manual placement directives where required © 2011 IBM Corporation 16
  • 17. InfoSphere Streams Objects: Runtime View Instance  Instance – Runtime instantiation of InfoSphere Job Streams executing across one or more Node hosts PE PE Stream 1 Stream 2 – Collection of components and services operator  Processing Element (PE) 1 Stream – Fundamental execution unit that is run by the Streams instance PE 3 Stream 4 – Can encapsulate a single operator or Stream many “fused” operators Stream 3 Stream 5  Job Node – A deployed Streams application executing in an instance – Consists of one or more PEs © 2011 IBM Corporation 17
  • 18. InfoSphere Streams Objects: Development View  Operator – The fundamental building block of the Streams Streams Application Processing Language – Operators process data from streams and may stream produce new streams operator  Stream – An infinite sequence of structured tuples – Can be consumed by operators on a tuple-by- tuple basis or through the definition of a window height: height: height: 640 1280 640  Tuple width: width: width: – A structured list of attributes and their types. 480 1024 480 Each tuple on a stream has the form dictated data: data: data: by its stream type  Stream type – Specification of the name and data type of each attribute in the tuple directory: directory: directory: directory:  Window "/img" "/img" "/opt" "/img" – A finite, sequential group of tuples filename: filename: filename: filename: – Based on count, time, attribute value, "farm" "bird" "java" "cat" or punctuation marks tuple © 2011 IBM Corporation 18
  • 19. What is Streams Processing Language?  Designed for stream computing – Define a streaming-data flow graph – Rich set of data types to define tuple attributes  Declarative – Operator invocations name the input and output streams – Referring to streams by name is enough to connect the graph  Procedural support – Full-featured C++/Java-like language – Custom logic in operator invocations – Expressions in attribute assignments and parameter definitions  Extensible – User-defined data types – Custom functions written in SPL or a native language (C++ or Java) – Custom operator written in SPL – User-defined operators written in C++ or Java © 2011 IBM Corporation 19
  • 20. Some SPL Terms port  An operator represents a class of manipulations Aggregate – of tuples from one or more input streams – to produce tuples on one or more output streams  A stream connects to an operator on a port – an operator defines input and output ports Employee Salary Info Statistics Aggregate  An operator invocation – is a specific use of an operator – with specific assigned input and output streams port – with locally specified parameters, logic, etc. TCP  Many operators have one input port and one output port; Source others have File – zero input ports: source adapters, e.g., TCPSource Sink – zero output ports: sink adapters, e.g., FileSink – multiple output ports, e.g., Split Split – multiple input ports, e.g., Join Join  A composite operator is a collection of operators – An encapsulation of a subgraph of • Primitive operators (non-composite) composite • Composite operators (nested) operator – Similar to a macro in a procedural language © 2011 IBM Corporation 20
  • 21. Composite Operators  Every graph is encoded as a composite composite Main { – A composite is a graph of one or more operators graph – A composite may have input and output ports stream … { – Source code construct only } • Nothing to do with operator fusion (PEs) stream … { }  Each stream declaration in the composite . . . – Invokes a primitive operator or } – another composite operator Application (logical view)  An application is a main composite – No input or output ports Stream 1 Stream 2 – Data flows in and out but not on operator streams within a graph 1 – Streams may be exported to and Stream imported from other applications Stream 4 3 running in the same instance Stream Stream 3 Stream 5 © 2011 IBM Corporation 21 21
  • 22. Anatomy of an Operator Invocation  Operators share a common structure Syntax: – <> are sections to fill in stream<stream-type> stream-name = MyOperator(input-stream; …)  Reading an operator invocation { – Declare a stream stream-name logic logic ; – With attributes from stream-type param parameters ; – that is produced by MyOperator output output ; – from the input(s) input-stream window windowspec ; – MyOperator behavior defined by config configuration ; } logic, parameters, windowspec, and configuration; output attribute assignments are specified in output Example:  For the example: stream<rstring item> Sale – Declare the stream Sale with the attribute = Join(Bid; Ask) item, which is a raw string { – Join Bid and Ask streams with window Bid: sliding, time(30); – sliding windows of 30 seconds on Bid, Ask: sliding, count(50); param match: Bid.item == Ask.item and 50 tuples of Ask && Bid.price >= Ask.price; – When items are equal, and Bid price is output Sale: item = Bid.item greater than or equal to Ask price } – Output the item value on the Sale stream © 2011 IBM Corporation 22 22
  • 23. Streams V2.0 Data Types (any) (primitive) (composite) boolean enum (numeric) timestamp (string) blob (collection) tuple (integral) (floatingpoint) (complex) rstring ustring list set map (signed) (unsigned) (float) (decimal) int8 uint8 float32 decimal32 complex32 int16 uint16 float64 decimal64 complex64 int32 uint32 float128 decimal128 complex128 int64 uint64 © 2011 IBM Corporation 23
  • 24. Stream and Tuple Types  Stream type (often called “schema”) – Definition of the structure of the data flowing through the stream  Tuple type definition – tuple<sequence of attributes> tuple<uint16 id, rstring name> • Attribute: a type and a name • Nesting: any attribute may be another tuple type  Stream type is a tuple type – stream<sequence of attributes> stream<uint16 id, rstring name>  Indirect stream type definitions – Fully defined within the output stream declaration stream<uint32 callerNum, … rstring endTime, list<uint32> mastIDs> Calls = Op(…)… – Reference a tuple type CallInfo = tuple<uint32 callerNum, … rstring endTime, list<uint32> mastIDs>; stream<CallInfo> InternationalCalls = Op(…) {…} – Reference another stream stream<uint32 callerNum, … rstring endTime, list<uint32> mastIDs> Calls = Op(…)… stream<Calls> RoamingCalls = Op(…) {…} © 2011 IBM Corporation 24
  • 25. Collection Types  list: array with bounds-checking [0, 17, age-1, 99] – Random access: can access any element at any time Ordered, base-zero indexed: first element is someList[0]  set: unordered collection {"cats", "yeasts", "plankton"} – No duplicate element values  map: key-to-value mappings {"Mon":0, "Sat":99, "Sun":-1} – Unordered  Use type constructors to specify element type – list<type>, set<type> list<uint16>, set<rstring> – map<key-type,value-type> map<rstring[3],int8>  Can be nested to any number of levels – map<int32, list<tuple<ustring name, int64 value>>> – {1 : [{"Joe",117885}, {"Fred",923416}], 2 : [{"Max",117885}], -1 : []}  Bounded collections optimize performance – list<int32>[5]: at most 5 (32-bit) integer elements – Bounds also apply to strings: rstring[3] has at most 3 (8-bit) characters © 2011 IBM Corporation 25
  • 26. The Functor Operator stream<rstring name,  Transforms input tuples into output uint32 age, tuples uint64 salary> Person = Op(…){} – One input port – One or more output ports stream<rstring name,  May filter tuples uint32 age, – Parameter filter rstring login, – A boolean expression tuple<boolean young, – If true, emit output tuple; boolean rich> info> if false, do not Adult = Functor(Person) {  Arbitrary attribute assignments param – Full-blown expressions filter : age >= 21u; – Including function calls output Adult : – Drop, add, transform attributes login = lower(name), – Omitted attributes auto-assigned info = {young = (age < 30u), rich = (salary > 100000ul)};  Custom logic supported } – logic clause Person Adult name Functor name – May include state age age – Applies to filter and assignments salary login info © 2011 IBM Corporation 26
  • 27. The FileSink Operator  Writes tuples to a file  Has a single input port – No output port: data goes to a file, () as Sink = FileSink(StreamIn) { not a Streams stream param  Selected Parameters file : "/tmp/people.dat"; – file format : csv; • Mandatory • Base for relative paths is flush : 20u; data subdirectory } • Directories must already exist File- – flush Sink • Flush the output buffer after a given number of tuples – format • csv: comma-separated values • txt, line, binary, block © 2011 IBM Corporation 27
  • 28. Communication Between Streams Applications  Streams jobs exchange data with the outside world – Source- and Sink-type operators – Can also be used between Streams jobs (e.g., TCPSource/Sink)  Streams jobs can exchange data with each other – Within one Streams Instance  Supports Dynamic Application Composition – By name or based on properties (tags) – One job exports a stream; another imports it  Implemented using two new pseudo-operators: Export and Import Job 1 Stream exported by Job 1 and imported by Job 2 oper- source sink ator Export Import Job 2 oper- oper- source sink ator ator © 2011 IBM Corporation 28
  • 29. Application Design – Dynamic Stream Properties  API available for toolkit development  Can add/modify/delete – Exported stream properties – Imported stream subscription expression  Dynamic Job Flow Control Bus Pattern – Operators within jobs interpret control stream tuples – Rewire the flow of data from job to job Flow Control Tuples Exported [A,B,C] Control Stream Job A Job B Job C Job D Data Stream © 2011 IBM Corporation 29
  • 30. Application Design – Dynamic Stream Properties  API available for toolkit development  Can add/modify/delete – Exported stream properties – Imported stream subscription expression  Dynamic Job Flow Control Bus Pattern – Operators within jobs interpret control stream tuples – Rewire the flow of data from job to job Flow Control Tuples Exported [A,B,C] Control Stream [A,C,D] Job A Job B Job C Job D Data Stream © 2011 IBM Corporation 30
  • 31. Application Design – Multi-job Design Streams Instance: stream1 Job: imagefeeder Job: imagewriter Timestamp + File metadata File metadata Filename Directory- Image- Image- Functor FileSink Scan Source Sink subscription: properties: type == "Image" && name = "Feed", write == “ok" type = "Image", write = “ok"  Application / Job Decomposition – Dynamic Job Submission + Stream Import / Export © 2011 IBM Corporation 31
  • 32. Application Design – Multi-job Design Streams Instance: stream1 Job: imagefeeder Job: imagewriter Timestamp + File metadata Image + File metadata Filename File metadata Directory- Image- Image- Functor FileSink Scan Source Sink subscription: properties: type == "Image" && name = "Feed", write == “ok" type = "Image", write = “ok"  Application / Job Decomposition – Dynamic Job Submission + Stream Import / Export © 2011 IBM Corporation 32
  • 33. Application Design – Multi-job Design Streams Instance: stream1 Job: imagefeeder Job: imagewriter Timestamp + File metadata Image + File metadata Filename File metadata Directory- Image- Image- Functor FileSink Scan Source Sink Job: greyscaler subscription: properties: type == "Image" && name = "Feed", Greyscale write == “ok" type = "Image", write = “ok" properties: name = “Grey", subscription: type = "Image", name == "Feed" write = “ok"  Application / Job Decomposition – Dynamic Job Submission + Stream Import / Export © 2011 IBM Corporation 33
  • 34. Application Design – Multi-job Design Streams Instance: stream1 Job: imagefeeder Job: imagewriter Timestamp + File metadata Image + File metadata Filename File metadata Directory- Image- Image- Functor FileSink Scan Source Sink Job: greyscaler subscription: properties: type == "Image" && name = "Feed", Greyscale write == “ok" type = "Image", write = “ok" properties: Job: resizer name = “Grey", subscription: type = "Image", name == "Feed" write = “ok" Job: facial scan Job: Alerter  Application / Job Decomposition – Dynamic Job Submission + Stream Import / Export © 2011 IBM Corporation 34
  • 35. Application Design – Multi-job Design Streams Instance: stream1 Job: imagefeeder Job: imagewriter Job: imagefeeder Job: imagewriter Timestamp + Job: imagefeeder File metadata Image + File metadata Filename Directory- metadata File Image- metadata File File metadata Image- DirReader Scan File metadata Source WriteImage Functor Sink Functor FileSink Sink DirReader Job: greyscaler subscription: properties: type == "Image" && name = "Feed", Greyscale write == “ok" type = "Image", write = “ok" properties: Job: resizer name = “Grey", subscription: type = "Image", name == "Feed" write = “ok" Job: facial scan Job: Alerter  Application / Job Decomposition – Dynamic Job Submission + Stream Import / Export © 2011 IBM Corporation 35
  • 36. Two Styles of Export/Import  Publish and subscribe (Recommended approach): – The exporting application publishes a stream with certain properties – The importing stream subscribes to an exported stream with properties satisfying a specified condition  Point to point: – The importing application names a specific stream of a specific exporting application  Dynamic publish and subscribe: – Export properties and Import expressions can be altered during the execution of a job – Allows dynamic data flows – Alter the flow of data based on the data (history, trends, etc.) () as ImageStream = Export(ImagesIn) { stream<IplImage image, rstring filename, param properties : { rstring directory> ImagesIn = streamName = "ImageFeed", Import() { dataType = "IplImage", param subscription : writeImage = "true"}; dataType == "IplImage" && } writeImage == "true"; } © 2011 IBM Corporation 36
  • 37. Parallelization Patterns – Introduction  Problem Statement – Series of operations to be performed on a piece of data (a tuple) – How to improve performance of these operations?  Key Question – Reduce latency? • For a single piece of data – Increase throughput? • For the entire data flow  Three possible design patterns – Serial Path – Parallel Operators (Task Parallelization) – Parallel Paths (Data Parallelization) © 2011 IBM Corporation 37
  • 38. Parallelization Patterns – Pipeline, Task  Pipeline (serial path) A B C D – Base pattern: inherent in graph paradigm – Results arrive at D in time T(A) + T(B) + T(C)  Parallel operators (task parallelization) A B M D C – Process the tuple in operators A, B, and C at the same time – Requires merger (e.g., Barrier) before operator D – Results arrive at D in time Max(T(A),T(B),T(C)) + T(M) – Use when tuple latency requirement < T(A) + T(B) + T(C) – Complexity of merger depends on behavior of operators A, B, and C © 2011 IBM Corporation 38
  • 39. Parallelization Patterns – Parallel Pipelines  Parallel pipelines (data parallelization) A B C A B C D A B C – Migration step from pipeline patttern – Can improve throughput • Especially good for variable-size data / processing time  Design Decisions – Are there latency and/or throughput requirements? – Do the operators perform filtering, feature extraction, transformation? – Is there an execution order requirement? – Is there a tuple order requirement?  Recommend Pipeline  Parallel Pipelines when possible © 2011 IBM Corporation 39
  • 40. Application Design – Multi-tier Design Transport Processing / Transport Ingestion Reduction Transformation Adaptation Analytics Adaptation Transport Processing / Transport Ingestion Adaptation Analytics Adaptation Examples  N-tier design – Number and purpose of tiers is a result of Application Design  Create well-defined interfaces between the tiers  Supports several overarching concepts – Incremental development / testing – Application / Job / Operator reuse – Modular programming practices  Each tier in these examples may be made up of one or more jobs (programs) © 2011 IBM Corporation 40
  • 41. Application Design – High Availability  HA application design pattern – Source job exports stream, enriched with tuple ID – Jobs 1 & 2 process in parallel, and export final streams – Sink job imports streams, discards duplicates, alerts on missing tuples Job 1 Job 1 Job 1 Job 1 Sink Sink Host pool 1 Job 1 Job 1 Job 1 Job 1 Job 2 Job 2 Host pool 2 Source Source Job 2 Job 2 Job 2 Job 2 Job 2 Job 2 Host pool 3 x86 host x86 host x86 host x86 host x86 host Host pool 4 © 2011 IBM Corporation 41
  • 42. Application Design – High Availability  HA application design pattern – Source job exports stream, enriched with tuple ID – Jobs 1 & 2 process in parallel, and export final streams – Sink job imports streams, discards duplicates, alerts on missing tuples Job 1 Job 1 Job 1 Job 1 Sink Sink Host pool 1 Job 1 Job 1 Job 1 Job 1 Source Source Job 2 Job 2 Host pool 2 Job 2 Job 2 Job 2 Job 2 Job 2 Job 2 x86 host Host pool 3 x86 host x86 host x86 host x86 host Host pool 4 © 2011 IBM Corporation 42
  • 43. IBM InfoSphere Streams Agile Development Distributed Runtime Sophisticated Analytics Environment Environment with Toolkits & Adapters Front Office 3.0  Toolkits  Database  Advanced Text  Mining  Geospatial  Clustered runtime for  Financial  Timeseries massive scalability  Standard  Messaging  RHEL v5.x and v6.x,  Internet  ... Eclipse IDE  BigData  User-defined CentOS v6.x Streams Live Graph  x86 & Power multicore • HDFS • DataExplorer Streams Debugger hardware  Ethernet & InfiniBand  Over 50 samples © 2011 IBM Corporation
  • 44. Toolkits and Operators to Speed and Simplify Development Standard Toolkit Internet Toolkit Relational Operators InetSource Filter Sort HTTP FTP HTTPS Functor Join FTPS RSS file Punctor Aggregate Adapter Operators Database Toolkit FileSource UDPSource ODBCAppendODBCEnrich FileSink UDPSink ODBCSource SolidDBEnrich DirectoryScan Export DB2SplitDB DB2PartitionedAppend TCPSource Import Supports: DB2 LUW, IDS, solidDB, TCPSink MetricsSink Netezza, Oracle, SQL Server, MySQL Utility Operators Custom Split  Financial Toolkit Beacon DeDuplicate Throttle Union  Data Mining Toolkit Delay ThreadedSplit  Big Data toolkit Barrier DynamicFilter Pair Gate  Text Toolkit JavaOp  ….. Standard toolkit contains the  User-Defined Toolkits default operators shipped with the  Extend the language by adding product user-defined operators and functions © 2011 IBM Corporation 44
  • 45. User Defined Toolkits  Streams supports toolkits – Reusable sets of operators and functions – What can be included in a toolkit? • Primitive and composite operators • Native and SPL functions • Types • Tools/documentation/samples/data, etc. – Versioning is supported – Define dependencies on other versioned assets (toolkits, Streams) – Create cross-domain and domain-specific accelerators © 2011 IBM Corporation 45 45
  • 46. © 2011 IBM Corporation 46
  • 47. A quick peek inside … InfoSphere Streams Instance – Single Host Management Services & Applications Streams Web Service (SWS) Streams Application Manager (SAM) Streams Resource Manager (SRM) Authorization and Authentication Service (AAS) Scheduler Recover DB Name Server Host Controller Processing Element Container File System © 2011 IBM Corporation
  • 48. A quick peek inside … InfoSphere Streams Instance – Multi host, Management Services on separate node Management Services Streams Web Service (SWS) Streams Application Manager (SAM) Streams Resource Manager (SRM) Authorization and Authentication Service (AAS) Scheduler Recover DB Name Server Shared File System Application Host Application Host Application Host Host Controller Host Controller Host Controller Processing Element Processing Element Processing Element Container Container Container © 2011 IBM Corporation
  • 49. A quick peek inside … InfoSphere Streams Instance – Multi host, Management Services on multiple hosts Management Management Management Streams Web Service AAS Recovery DB Management Management Application Host Streams App Manager Scheduler Host Controller Processing Element Management Management Container Streams Resource Mgr Name Server Shared File System Application Host Application Host Application Host Application Host Host Controller Host Controller Host Controller Host Controller Processing Element Processing Element Processing Element Processing Element Container Container Container Container © 2011 IBM Corporation