SlideShare a Scribd company logo
Big Data Plattform der IBM
InfoSphere BigInsights und InfoSphere Streams
Big Data Plattform der IBM
InfoSphere BigInsights und InfoSphere Streams



Wilfried Hoge – Leading Technical Sales Professional
hoge@de.ibm.com
twitter.com/wilfriedhoge
IBM Big Data Strategy: Move the Analytics Closer to the Data

New analytic applications drive
                                                       Analytic Applications
the requirements for a big data
                                            BI /    Exploration / Functional Industry Predictive Content
platform                                  Reporting Visualization    App       App    Analytics Analytics


•  Integrate and manage the full
   variety, velocity and volume of data               IBM Big Data Platform
                                             Visualization        Application         Systems
•  Apply advanced analytics to               & Discovery         Development         Management
   information in its native form
•  Visualize all available data for ad-                             Accelerators
   hoc analysis
                                                Hadoop             Stream               Data
•  Development environment for                  System            Computing           Warehouse
   building new analytic applications
•  Workload optimization and
   scheduling
•  Security and Governance                          Information Integration & Governance
Volume and Velocity – two dimensions for Big Data
             Exa

                                                                                                                   Wind Turbine Placement &
                              Up to
                              10,000
                                                                                                                   Operation
                              Times                                                                                PBs of data
             Peta             larger                                                                               Analysis time to 3 days from 3 weeks
                                                                                                                   1220 IBM iDataPlex nodes
                Data Scale




             Tera
                                                                                                                          DeepQA
                                                                                                                          100s GB for Deep Analytics
                               Data at Rest
Data Scale




                                                                                                                          3 sec/decision
                                                                                                                          Power7, 15TB memory
             Giga

                                                                                                                            Telco Promotions
                                                                                                                            100,000 records/sec, 6B/day
                             Traditional Data                                                                               10 ms/decision
             Mega            Warehouse and
                                                                                                                            270TB for Deep Analytics
                             Business Intelligence
                                                                                  Up to 10,000
                                                         Data in Motion                                                    Security
                                                                                  times faster
             Kilo
                                                                                                                           600,000 records/sec, 50B/day
                                                                                                                           1-2 ms/decision
                     yr        mo             wk   day    hr   min        sec        …        ms        µs
                                                                                                                           320TB for Deep Analytics
                             Occasional                   Frequent                    Real-time
                                                   Decision Frequency
              26.04.2012                                                        © Copyright IBM Corporation 2012                                       4
BigInsights – analytical platform for persistent “Big Data”
Based on open source & IBM
technologies                                                 Analytic Applications
                                                  BI /    Exploration / Functional Industry Predictive Content
Distinguishing characteristics                  Reporting Visualization    App       App    Analytics Analytics


•  Built-in analytics . . . enhances business
   knowledge                                                IBM Big Data Platform
•  Enterprise software integration . . .           Visualization        Application         Systems
                                                   & Discovery         Development         Management
   complements and extends existing
   capabilities
•  Production-ready platform with tooling for                             Accelerators
   analysts, developers, and
   administrators. . . speeds time-to-value           Hadoop             Stream               Data
   and simplifies development/maintenance             System            Computing           Warehouse

IBM advantage
•  Combination of software, hardware,
   services and advanced research
                                                          Information Integration & Governance
About the BigInsights Platform
Flexible, enterprise-class support for processing large volumes of data
•  Based on Google’s MapReduce technology
•  Inspired by Apache Hadoop; compatible with its ecosystem and distribution
•  Well-suited to batch-oriented, read-intensive applications
•  Supports wide variety of data


Enables applications to work with thousands of nodes and petabytes of
data in a highly parallel, cost effective manner
•  CPU + disks = “node”
•  Nodes can be combined into clusters
•  New nodes can be added as needed without changing
  •  Data formats
  •  How data is loaded
  •  How jobs are written
Hadoop Explained – Map Reduce
         Hadoop computation model
             •  Data stored in a distributed file system spanning many inexpensive computers
             •  Bring function to the data
             •  Distribute application to the compute resources where the data is stored
         Scalable to thousands of nodes and petabytes of data
 public	
  static	
  class	
  TokenizerMapper	
  	
  
 	
  	
  	
  extends	
  Mapper<Object,Text,Text,IntWritable>	
  {	
  
                                                                                                              Hadoop Data Nodes
 	
  	
  private	
  final	
  static	
  IntWritable	
  
 	
  	
  	
  	
  	
  one	
  =	
  new	
  IntWritable(1);	
  
 	
  	
  private	
  Text	
  word	
  =	
  new	
  Text();	
  
 	
  

 	
  	
  public	
  void	
  map(Object	
  key,	
  Text	
  val,	
  Context	
  
 	
  	
  	
  	
  StringTokenizer	
  itr	
  =	
  
 	
  	
  	
  	
  	
  	
  	
  new	
  StringTokenizer(val.toString());	
  



                                                                                                                                  1.  Map Phase
 	
  	
  	
  	
  while	
  (itr.hasMoreTokens())	
  {	
  
 	
  	
  	
  	
  word.set(itr.nextToken());	
  
 	
  	
  	
  	
  	
  	
  context.write(word,	
  one);	
  
 	
  	
  	
  	
  }	
  	
  	
  	
   	
  	
  

                                                                                                                                     (break job into small parts)
 	
  	
  }	
  
 }	
  
 	
  
 public	
  static	
  class	
  IntSumReducer	
  	
  
 	
  	
  	
  extends	
  Reducer<Text,IntWritable,Text,IntWrita	
  

                                                                                       Distribute map                             2.  Shuffle
 	
  	
  private	
  IntWritable	
  result	
  =	
  new	
  Intritable();	
  
 	
  

 	
  	
  public	
  void	
  reduce(Text	
  key,	
  
 	
  	
  	
  	
  	
  Iterable<IntWritable>	
  val,	
  Context	
  context){	
  
 	
  	
  	
  	
  int	
  sum	
  =	
  0;	
  
 	
  	
  	
  	
  for	
  (IntWritable	
  v	
  :	
  val)	
  {	
                         tasks to cluster                               (transfer interim output
 	
  	
  	
  	
  	
  	
  sum	
  +=	
  v.get();	
  
 	
  
 .	
  .	
  .	
  
                                                                                                                                     for final processing)

MapReduce Application                                                                                                             3.  Reduce Phase
                                                                                                                                     (boil all output down to
                                                                                              Shuffle                                a single result set)




                   Result Set                                                    Return a single result set
BigInsights – Value Beyond Open Source
Technical differentiators
•  Built-in analytics
  •  Text processing engine, annotators, Eclipse tooling
  •  Statistical and predictive analysis
  •  Interface to project R (statistical platform)
•  Enterprise software integration (DBMS, warehouse)
•  Spreadsheet-style analytical tool for analysts
•  Ready-made business process accelerators
•  Integrated installation of supported open source and IBM components
•  Web Console for administration and application access
•  Platform enrichment: additional security, performance features, . . .
•  Standard IBM licensing agreement and world-class support
Business benefits
•  Quicker time-to-value due to IBM technology and support
•  Reduced operational risk
•  Enhanced business knowledge with flexible analytical platform
•  Leverages and complements existing software assets
Web Installation Tool
Seamless process for single
node and cluster environments


Integrated installation of all
selected components


Post-install validation of IBM and
open source components




       No need to iteratively download, configure, and test multiple open source
       projects and their pre-requisite software.
Web Console
Manage BigInsights
•  Inspect system health
•  Add / drop nodes
•  Start / stop services
•  Run / monitor jobs (applications)
•  Explore / modify file system


Launch applications
•  Spreadsheet-like analysis tool
•  Pre-built applications (IBM supplied
   or user developed)


Publish applications
Leverage community resources
BigSheets
BigSheets is a visual tool for data manipulation and prototyping
•  Allows more users to do more work, more quickly
•  Simply stated, growing an army of MapReduce developers is not cost effective
•  In your BI environments you have a ratio of 30+ report users for every complex SQL
   developer. We need to support the same ratios with BigInsights

Sample Uses
•  Data exploration and visualization
•  Visual job creation
BigSheets – Spreadsheet-style Data Analysis and Discovery
BigSheets – Visualization
Quick start applications or “apps”
Reusable software assets based on customer engagements
•  Useful for starting point for various applications
•  Can be customized by BigInsights application developers as needed
•  Accessible through Web console



Available assets
•  Data export (to relational DBMS, files, HBase)
•  Data import (from relational DBMS, files)
•  Web crawler, Twitter crawler
•  Boardreader.com support (Web forum search engine)
•  Ad hoc queries for Jaql, Hive, Pig
•  TeraGen-TeraSort, WordCount sample applications
Running Applications from the Web Console
Develop Hive with the SQL Editor and view results
Build a Big Data Program – Map Reduce example

                              Eclipse based development tools
                                  For JAQL, Hive, Java MapReduce, Text Analytics
Text Analytics in BigInsights
Text analytics – Distill structured information from unstructured data
•  Rich annotator library supports multiple languages
•  Declarative Information Extraction (IE) system based on an algebraic framework
•  Richer, cleaner rule semantics
•  Better performance through optimization



Developed at IBM Research since 2004


Embedded in several IBM products
•  Lotus Notes
•  Cognos Consumer Insights
•  InfoSphere Streams
•  Compose operators to build complex annotators
Turns disparate words into measurable insights
Pre-configured text annotators ready for distributed processing on Big Data
•  City, County, Zipcode, Address, Maplocation, StateOrProvince, Country, Continent,
   EmailAddress, Person, Organizaion, DateTime, URL, Compane Names, Merger,
   Acquisition, Alliance, etc..
Support for native languages including double-byte




  Physically assemble                                        Identify positive or                                Reporting/Monitoring social
    data, standardize              Part-of-speech           negative sentiment,      Iterative classification   commentary, combination w/
 formats, address auto-    identification, standard and    NLP-based analytics,      using automated and         structured data, clustering,
    identify language,         customized extraction      define variables, macros    manual techniques.            associated concepts,
  process punctuation        dictionaries, proper noun           and rules.          Concept derivation &        correlated concepts, auto-
  and non-grammatical          identification, concept                                 inclusion, semantic      classification of documents,
 characters, standardize    categorization, synonyms,                                   networks and co-                 sites, posts.
         spelling.            exclusions, multi-terms,                                   occurrence rules
                           regular expressions, fuzzy-
                                       matching
Text Analytics – highly accurate analysis of textual content
How it works                               Unstructured text (document, email, etc)
•  Parses text and detects meaning with
   annotators                             Football World Cup 2010, one team
                                          distinguished themselves well, losing to
•  Understands the context in which the
                                          the eventual champions 1-0 in the Final.
   text is analyzed
                                          Early in the second half, Netherlands’
•  Hundreds of pre-built annotators for   striker, Arjen Robben, had a breakaway,
   names, addresses, phone numbers,       but the keeper for Spain, Iker Casillas
   along others                           made the save. Winger Andres Iniesta
                                          scored for Spain for the win.
Accuracy
•  Highly accurate in deriving meaning
   from complex text
Performance                                      Classification and Insight
•  AQL language optimized for
   MapReduce
BigInsights Text Analytics Development – AQL
Text Analytics Tooling
          AQL Editor     Result Viewer




Runtime Explain
Statistical and Predictive Analysis
Framework for machine learning (ML) implementations on Big Data
•  Large, sparse data sets, e.g. 5B non-zero values
•  Runs on large BigInsights clusters with 1000s of nodes
Productivity
•  Build and enhance predictive models directly on Big Data
•  High-level language – Declarative Machine Learning Language (DML)
  •  E.g. 1500 lines of Java code boils down to 15 lines of DML code
•  Parallel SPSS data mining algorithms implementable in DML
Optimization
•  Compile algorithms into optimized parallel code
                                                                                         4500
•  For different clusters and different data characteristics                             4000

                                                                                         3500
•  E.g. 1 hr. execution (hand-coded) down to 10 mins




                                                                  Execution Time (sec)
                                                                                         3000

                                                                                         2500

                                                                                         2000

                                                                                         1500

                                                                                         1000

                                                                                         500

                                                                                           0
                                                                                                0       500            1000            1500            2000

                                                                                                               # non zeros (million)

                                                                                                    Java Map-Reduce     SystemML       Single node R
Workload Optimization
  Optimized performance for big data analytic workloads

         Adaptive MapReduce                                      Hadoop System Scheduler
  §  Algorithm to optimize execution time of               §  Identifies small and large jobs from
      multiple small jobs                                       prior experience

  §  Performance gains of 30% reduce                       §  Sequences work to reduce overhead
      overhead of task startup



Task                 Map                             Adaptive Map                     Reduce
                     (break task into small parts)   (optimization —                  (many results to a
                                                     order small units of work)       single result set)
InfoSphere BigInsights – Embrace and Extend Hadoop
Analytics
                            ML Analytics                            Text Analytics             BigSheets           Interface

                                                                                                                    Web console
Application                                                                                                        •  Monitor cluster health
                                                      Pig                 Hive                Jaql                 •  Add / remove nodes




                                                                                                            Avro
                Zookeeper

                              IBM LZO Compression
                                                                                                                   •  Start / stop services
                                                                      MapReduce                                    •  Inspect job status
                                                                                                                   •  Inspect workflow status
                                                                                                                   •  Deploy apps
                                                    AdaptiveMR              FLEX              BigIndex             •  Launch apps / jobs
                                                                                                                   •  Work with distrib. file system
                                                                                                                   •  Work with spreadsheet
                                                                 Oozie                        Lucene
                                                                                                                      interface
                                                                                                                   •  Support REST-based API
                                                                                                                   •  . . .

Storage                                                                   HBase
                                                                                                                    Eclipse plug-ins
                                                            HDFS                     GPFS-SNC
                                                                                                                   •  Text analytics
                                                                                                                   •  MapReduce programming
                                                                                                                   •  Jaql development
Data Sources/                                               Netezza          BoardReader               R           •  Hive query development
                            Streams
Connectors
                        Data Stage                            DB2          CSV / XML / JSON          SPSS
                                                                                                                                      IBM
                            Flume                            JDBC            Web Crawler
                                                                                                                                      Open Source
Ways to get started with BigInsights
In the Cloud
•  Via RightScale, or directly on Amazon, Rackspace, IBM
   Smart Enterprise Cloud, or on private clouds.
•  Pay only for the resources used.

In the Virtual Classroom
•  Free Hadoop Fundamentals training course
   www.bigdatauniversity.com
  •  e.g. BD105EN - Text Analytics Essentials

On Your Cluster
•  Download Basic Edition from ibm.com.
In the Classroom
•  Enroll in the InfoSphere BigInsights Essentials course.
Visit the BigInsights technical portal . . . .
Free links to papers, demos, discussion forum, and more
http://www.ibm.com/developerworks/wiki/biginsights/
Streams – analytical platform for in-motion “Big Data”
Built to analyze data in motion
                                                           Analytic Applications
•  Multiple concurrent input streams
                                                BI /    Exploration / Functional Industry Predictive Content
                                              Reporting Visualization    App       App    Analytics Analytics
•  Massive scalability

                                                          IBM Big Data Platform
Process and analyze a variety of                 Visualization        Application         Systems
data                                             & Discovery         Development         Management

•  Structured, unstructured content, video,
   audio                                                                Accelerators
•  Advanced analytic operators
                                                    Hadoop             Stream               Data
                                                    System            Computing           Warehouse




                                                        Information Integration & Governance
Stream Computing – Analyze Data in Motion

        Traditional Computing                            Stream Computing




Historical fact finding                        Current fact finding

Find and analyze information stored on disk    Analyze data in motion – before it is stored

Batch paradigm, pull model                     Low latency paradigm, push model

Query-driven: submits queries to static data   Data driven – bring the data to the query

   Query           Data          Results        Data           Query          Results
Why InfoSphere Streams?
Applications that require on-the-fly processing, filtering and analysis of
streaming data
•  Sensors: environmental, industrial, surveillance video, GPS, …
•  “Data exhaust”: network/system/web server/app server log files
•  High-rate transaction data: financial transactions, call detail records


Criteria: two or more of the following
•  Messages are processed in isolation or in limited data windows
•  Sources include non-traditional data (spatial, imagery, text, …)
•  Sources vary in connection methods, data rates, and processing requirements,
   presenting integration challenges
•  Data rates/volumes require the resources of multiple processing nodes
•  Analysis and response are needed with sub-millisecond latency
•  Data rates and volumes are too great for store-and-mine approaches
Massively Scalable Stream Analytics
Linear Scalability                                 Deployments
§  Clustered deployments – unlimited               Source       Analytic    Sync
    scalability                                     Adapters     Operators   Adapters

Automated Deployment
§  Automatically optimize operator
                                                                     Streams Studio IDE
    deployment across clusters
Performance Optimization                                                                  Automated and
                                                                                          Optimized
§  JVM Sharing – minimize memory use                                                     Deployment

§  Fuse operators on             Streaming Data   Streams Runtime
                                        Sources
    same cluster
§  Telco client – 25 Million
                                                                                              Visualization
    messages per second
Analytics on Streaming Data
§  Analytic accelerators for a
    variety of data types
§  Optimized for real-time performance
Streams approach illustrated                                   tuple

                               directory: directory: directory: directory:
                                  ”/img" ”/img"        ”/opt" ”/img"
                               filename: filename: filename: filename:
                                                                             height:   height:   height:
                                  “farm” “bird” “java” “cat”                   640       1280      640
                                                                             width:    width:    width:
                                                                               480       1024      480
                                                                             data:     data:     data:
InfoSphere Streams for superior real time analytic processing
                          Streams Processing Language (SPL)
                          built for Streaming applications:                 Compile groups of operators into
                          •    Reusable operators                           single processes:
                          •    Rapid application development                •  Efficient use of cores
     Use the data         •    Continuous “pipeline” processing             •    Distributed execution
     that gives                                                             •    Very fast data exchange
     you a competitive                                                      •    Can be automatic or tuned
     advantage:                                                             •    Scaled with push of a button
     •  Can handle virtually
        any data type
     •  Use data that is too
        expensive and time
        sensitive for traditional
        approaches

Easy to extend:
•      Built in adaptors
•      Users add capability
       with familiar C++ and
       Java
                                                                                       Dynamic analysis:
         Easy to manage:                                                               •    Programmatically change
                                                    Flexible and high
         •    Automatic placement                                                           topology at runtime
                                                    performance transport:             •    Create new subscriptions
         •    Extend applications incrementall
                                                    •    Very low latency              •    Create new port properties
              without downtime
                                                    •    High data rates
         •    Multi-user / multiple applications
Streams Studio Integrated Development Environment




                                                    34
Compiler Framework
Operator Fusion
•  Fine-grained operators




                                                Logical app view
•  From small parts, make larger ones
   that fit
Code generation
•  Generates code to match the underlying
   runtime environment
  •  Number of cores
  •  Interconnect characteristics




                                            Physical app view
  •  Architecture-specific instructions
•  Driven by automatic profiling
•  Compiler-based optimization
•  Driven by incremental learning of
   application characteristics
Streams Data Mining Toolkit
Enables scoring of real-time data in a Streams application
•  Scoring is performed against a predefined model
•  Supports a variety of model types and scoring algorithms

Models represented in Predictive Model Markup Language (PMML)
  •  Standard for statistical and data mining models
  •  XML Representation

Toolkit provides four Streams operators to enable scoring
•  Classification
•  Clustering
•  Regression
•  Associations
The toolkit supports dynamic replacement of the PMML model used by an
operator.
Without a Big Data Platform                                      IBM Big Data Platform
You Code…
                                                        Over 100 sample applications and toolkits with industry
                                                          focused toolkits with 300+ functions and operators

       Event            Custom SQL
      Handling              and
                          Scripts
                                       Multithreading


  Check           Application
 Pointing        Management                              Accelerators
                                                                        Streams provides development, deployment,
                                   HA                        and             runtime, and infrastructure services
                                                           Toolkits




                        Performance          Debug
       Connectors
                        Optimization




 Security                                                                   “TerraEchos developers can deliver
                                                                          applications 45% faster due to the agility
                                                                            of Streams Processing Language…”
                                                                             – Alex Philip, CEO and President, TerraEchos
Streams Redbook
redbooks.ibm.com/abstracts/sg247970.html


This book is intended for professionals that
require an understanding of how to process high
volumes of streaming data or need information
about how to implement systems to satisfy
those requirements.
Right-time actions are taken in the new BI/BA ecosystem
 • Three routes to analytics
 • Application and workload optimized appliances and systems
 • Fast data movement and integration

Traditional      Traditional /
Warehouse         Relational
                Data Sources
                                                                          Database &       At-Rest Data    Results
                                                                          Warehouse         Analytics


               Non-Traditional /
   Streams      Non-Relational
                Data Sources
                                   In-Motion                                                          Ultra Low Latency
                                   Analytics                                                                Results

               Non-Traditional/                                           InfoSphere
               Non-Relational                                             Big Insights
                Data Sources
    Internet                        Internet Scale
      Scale      Traditional/                                                            Data Analytics, Data   Results
               Relational Data                                                           Operations & Model
                  Sources                                                                      Building


 26.04.2012                            © Copyright IBM Corporation 2012                                              39
Example of 360° customer view

                Business Processes"




                  Events and                       Master Data         Campaign          Cognos Consumer
                    Alerts                         Management         Management             Insight


                         Big Data Platform
                                                                   Web Traffic and
                                                                 Social Media Insight




      Website Logs
      Social Media      Internet Scale Analytics


                                                       Information                        Data
                                                        Integration                     Warehouse


        Call Detail                                              Call Behavior and
         Records          Streaming Analytics                    Experience Insight
Big Data Plattform der IBM

InfoSphere BigInsights und InfoSphere Streams

More Related Content

What's hot

[Hadoop] NexR Terapot: Massive Email Archiving
[Hadoop] NexR Terapot: Massive Email Archiving[Hadoop] NexR Terapot: Massive Email Archiving
[Hadoop] NexR Terapot: Massive Email Archiving
Jinho Jung
 
"A Study of I/O and Virtualization Performance with a Search Engine based on ...
"A Study of I/O and Virtualization Performance with a Search Engine based on ..."A Study of I/O and Virtualization Performance with a Search Engine based on ...
"A Study of I/O and Virtualization Performance with a Search Engine based on ...
Lucidworks (Archived)
 
Introduction to Gruter and Gruter's BigData Platform
Introduction to Gruter and Gruter's BigData PlatformIntroduction to Gruter and Gruter's BigData Platform
Introduction to Gruter and Gruter's BigData Platform
Gruter
 
Streaming Hadoop for Enterprise Adoption
Streaming Hadoop for Enterprise AdoptionStreaming Hadoop for Enterprise Adoption
Streaming Hadoop for Enterprise Adoption
DATAVERSITY
 
Data Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web ServicesData Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web Services
Amazon Web Services
 
Cetas Analytics as a Service for Predictive Analytics
Cetas Analytics as a Service for Predictive AnalyticsCetas Analytics as a Service for Predictive Analytics
Cetas Analytics as a Service for Predictive Analytics
J. David Morris
 
Cetas Predictive Analytics Prezo
Cetas Predictive Analytics PrezoCetas Predictive Analytics Prezo
Cetas Predictive Analytics Prezo
Pivotal Analytics (Cetas Analytics)
 
Hadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondHadoop - Now, Next and Beyond
Hadoop - Now, Next and Beyond
Teradata Aster
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
DataWorks Summit
 
Big Data For Investment Research Management
Big Data For Investment Research ManagementBig Data For Investment Research Management
Big Data For Investment Research Management
IDT Partners
 
Big Data and HPC
Big Data and HPCBig Data and HPC
Big Data and HPC
NetApp
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
Gigaom
 
MapR lucidworks joint webinar
MapR lucidworks joint webinarMapR lucidworks joint webinar
MapR lucidworks joint webinar
Ted Dunning
 
Jubatus Presentation on R&D forum 2011
Jubatus Presentation on R&D forum 2011Jubatus Presentation on R&D forum 2011
Jubatus Presentation on R&D forum 2011
JubatusOfficial
 
Core concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsCore concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data Analytics
Kaniska Mandal
 
2013 storage prediction hds hong kong
2013 storage prediction hds hong kong2013 storage prediction hds hong kong
2013 storage prediction hds hong kong
Andrew Wong
 
2011 x.commerce Innovate Data Alchemy
2011 x.commerce Innovate Data Alchemy2011 x.commerce Innovate Data Alchemy
2011 x.commerce Innovate Data Alchemy
Brian Johnson
 
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Carlos Castillo (ChaTo)
 
CSC1100 - Chapter08 - Database Management
CSC1100 - Chapter08 - Database ManagementCSC1100 - Chapter08 - Database Management
CSC1100 - Chapter08 - Database Management
Yhal Htet Aung
 
Vodafone xone fev142013v3 ext
Vodafone xone fev142013v3 extVodafone xone fev142013v3 ext
Vodafone xone fev142013v3 ext
InfiniteGraph
 

What's hot (20)

[Hadoop] NexR Terapot: Massive Email Archiving
[Hadoop] NexR Terapot: Massive Email Archiving[Hadoop] NexR Terapot: Massive Email Archiving
[Hadoop] NexR Terapot: Massive Email Archiving
 
"A Study of I/O and Virtualization Performance with a Search Engine based on ...
"A Study of I/O and Virtualization Performance with a Search Engine based on ..."A Study of I/O and Virtualization Performance with a Search Engine based on ...
"A Study of I/O and Virtualization Performance with a Search Engine based on ...
 
Introduction to Gruter and Gruter's BigData Platform
Introduction to Gruter and Gruter's BigData PlatformIntroduction to Gruter and Gruter's BigData Platform
Introduction to Gruter and Gruter's BigData Platform
 
Streaming Hadoop for Enterprise Adoption
Streaming Hadoop for Enterprise AdoptionStreaming Hadoop for Enterprise Adoption
Streaming Hadoop for Enterprise Adoption
 
Data Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web ServicesData Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web Services
 
Cetas Analytics as a Service for Predictive Analytics
Cetas Analytics as a Service for Predictive AnalyticsCetas Analytics as a Service for Predictive Analytics
Cetas Analytics as a Service for Predictive Analytics
 
Cetas Predictive Analytics Prezo
Cetas Predictive Analytics PrezoCetas Predictive Analytics Prezo
Cetas Predictive Analytics Prezo
 
Hadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondHadoop - Now, Next and Beyond
Hadoop - Now, Next and Beyond
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
 
Big Data For Investment Research Management
Big Data For Investment Research ManagementBig Data For Investment Research Management
Big Data For Investment Research Management
 
Big Data and HPC
Big Data and HPCBig Data and HPC
Big Data and HPC
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
 
MapR lucidworks joint webinar
MapR lucidworks joint webinarMapR lucidworks joint webinar
MapR lucidworks joint webinar
 
Jubatus Presentation on R&D forum 2011
Jubatus Presentation on R&D forum 2011Jubatus Presentation on R&D forum 2011
Jubatus Presentation on R&D forum 2011
 
Core concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsCore concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data Analytics
 
2013 storage prediction hds hong kong
2013 storage prediction hds hong kong2013 storage prediction hds hong kong
2013 storage prediction hds hong kong
 
2011 x.commerce Innovate Data Alchemy
2011 x.commerce Innovate Data Alchemy2011 x.commerce Innovate Data Alchemy
2011 x.commerce Innovate Data Alchemy
 
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
 
CSC1100 - Chapter08 - Database Management
CSC1100 - Chapter08 - Database ManagementCSC1100 - Chapter08 - Database Management
CSC1100 - Chapter08 - Database Management
 
Vodafone xone fev142013v3 ext
Vodafone xone fev142013v3 extVodafone xone fev142013v3 ext
Vodafone xone fev142013v3 ext
 

Similar to 2012.04.26 big insights streams im forum2

Big Data, Big Content, and Aligning Your Storage Strategy
Big Data, Big Content, and Aligning Your Storage StrategyBig Data, Big Content, and Aligning Your Storage Strategy
Big Data, Big Content, and Aligning Your Storage Strategy
Hitachi Vantara
 
Protect Your Big Data with Intel<sup>®</sup> Xeon<sup>®</sup> Processors a..
Protect Your Big Data with Intel<sup>®</sup> Xeon<sup>®</sup> Processors a..Protect Your Big Data with Intel<sup>®</sup> Xeon<sup>®</sup> Processors a..
Protect Your Big Data with Intel<sup>®</sup> Xeon<sup>®</sup> Processors a..
Odinot Stanislas
 
Big Data on AWS
Big Data on AWSBig Data on AWS
Big Data on AWS
Amazon Web Services LATAM
 
Big Data and the Cloud a Best Friend Story
Big Data and the Cloud a Best Friend StoryBig Data and the Cloud a Best Friend Story
Big Data and the Cloud a Best Friend Story
Amazon Web Services
 
Kurukshetra - Big Data
Kurukshetra - Big DataKurukshetra - Big Data
Kurukshetra - Big Data
shankar_radhakrishnan
 
Embedded Analytics: The Next Mega-Wave of Innovation
Embedded Analytics: The Next Mega-Wave of InnovationEmbedded Analytics: The Next Mega-Wave of Innovation
Embedded Analytics: The Next Mega-Wave of Innovation
Inside Analysis
 
Anexinet Big Data Solutions
Anexinet Big Data SolutionsAnexinet Big Data Solutions
Anexinet Big Data Solutions
Mark Kromer
 
Microsoft StreamInsight
Microsoft StreamInsight Microsoft StreamInsight
Microsoft StreamInsight
Mark Ginnebaugh
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantage
Amazon Web Services
 
Big Data and Implications on Platform Architecture
Big Data and Implications on Platform ArchitectureBig Data and Implications on Platform Architecture
Big Data and Implications on Platform Architecture
Odinot Stanislas
 
Intel Cloud summit: Big Data by Nick Knupffer
Intel Cloud summit: Big Data by Nick KnupfferIntel Cloud summit: Big Data by Nick Knupffer
Intel Cloud summit: Big Data by Nick Knupffer
IntelAPAC
 
AWS Tech Talks - Data Lake Analytics
AWS Tech Talks - Data Lake AnalyticsAWS Tech Talks - Data Lake Analytics
AWS Tech Talks - Data Lake Analytics
Amazon Web Services LATAM
 
Big data? No. Big Decisions are What You Want
Big data? No. Big Decisions are What You WantBig data? No. Big Decisions are What You Want
Big data? No. Big Decisions are What You Want
Stuart Miniman
 
16h30 p duff-big-data-final
16h30   p duff-big-data-final16h30   p duff-big-data-final
16h30 p duff-big-data-final
Luiz Gustavo Santos
 
Big Data & The Cloud
Big Data & The CloudBig Data & The Cloud
Big Data & The Cloud
Amazon Web Services
 
Accel Partners New Data Workshop 7-14-10
Accel Partners New Data Workshop 7-14-10Accel Partners New Data Workshop 7-14-10
Accel Partners New Data Workshop 7-14-10
keirdo1
 
GPU Acceleration for Financial Services
GPU Acceleration for Financial ServicesGPU Acceleration for Financial Services
GPU Acceleration for Financial Services
Kinetica
 
Intel Cloud Summit: Big Data
Intel Cloud Summit: Big DataIntel Cloud Summit: Big Data
Intel Cloud Summit: Big Data
IntelAPAC
 
2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overview
jdijcks
 
Webinar | Using Hadoop Analytics to Gain a Big Data Advantage
Webinar | Using Hadoop Analytics to Gain a Big Data AdvantageWebinar | Using Hadoop Analytics to Gain a Big Data Advantage
Webinar | Using Hadoop Analytics to Gain a Big Data Advantage
Cloudera, Inc.
 

Similar to 2012.04.26 big insights streams im forum2 (20)

Big Data, Big Content, and Aligning Your Storage Strategy
Big Data, Big Content, and Aligning Your Storage StrategyBig Data, Big Content, and Aligning Your Storage Strategy
Big Data, Big Content, and Aligning Your Storage Strategy
 
Protect Your Big Data with Intel<sup>®</sup> Xeon<sup>®</sup> Processors a..
Protect Your Big Data with Intel<sup>®</sup> Xeon<sup>®</sup> Processors a..Protect Your Big Data with Intel<sup>®</sup> Xeon<sup>®</sup> Processors a..
Protect Your Big Data with Intel<sup>®</sup> Xeon<sup>®</sup> Processors a..
 
Big Data on AWS
Big Data on AWSBig Data on AWS
Big Data on AWS
 
Big Data and the Cloud a Best Friend Story
Big Data and the Cloud a Best Friend StoryBig Data and the Cloud a Best Friend Story
Big Data and the Cloud a Best Friend Story
 
Kurukshetra - Big Data
Kurukshetra - Big DataKurukshetra - Big Data
Kurukshetra - Big Data
 
Embedded Analytics: The Next Mega-Wave of Innovation
Embedded Analytics: The Next Mega-Wave of InnovationEmbedded Analytics: The Next Mega-Wave of Innovation
Embedded Analytics: The Next Mega-Wave of Innovation
 
Anexinet Big Data Solutions
Anexinet Big Data SolutionsAnexinet Big Data Solutions
Anexinet Big Data Solutions
 
Microsoft StreamInsight
Microsoft StreamInsight Microsoft StreamInsight
Microsoft StreamInsight
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantage
 
Big Data and Implications on Platform Architecture
Big Data and Implications on Platform ArchitectureBig Data and Implications on Platform Architecture
Big Data and Implications on Platform Architecture
 
Intel Cloud summit: Big Data by Nick Knupffer
Intel Cloud summit: Big Data by Nick KnupfferIntel Cloud summit: Big Data by Nick Knupffer
Intel Cloud summit: Big Data by Nick Knupffer
 
AWS Tech Talks - Data Lake Analytics
AWS Tech Talks - Data Lake AnalyticsAWS Tech Talks - Data Lake Analytics
AWS Tech Talks - Data Lake Analytics
 
Big data? No. Big Decisions are What You Want
Big data? No. Big Decisions are What You WantBig data? No. Big Decisions are What You Want
Big data? No. Big Decisions are What You Want
 
16h30 p duff-big-data-final
16h30   p duff-big-data-final16h30   p duff-big-data-final
16h30 p duff-big-data-final
 
Big Data & The Cloud
Big Data & The CloudBig Data & The Cloud
Big Data & The Cloud
 
Accel Partners New Data Workshop 7-14-10
Accel Partners New Data Workshop 7-14-10Accel Partners New Data Workshop 7-14-10
Accel Partners New Data Workshop 7-14-10
 
GPU Acceleration for Financial Services
GPU Acceleration for Financial ServicesGPU Acceleration for Financial Services
GPU Acceleration for Financial Services
 
Intel Cloud Summit: Big Data
Intel Cloud Summit: Big DataIntel Cloud Summit: Big Data
Intel Cloud Summit: Big Data
 
2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overview
 
Webinar | Using Hadoop Analytics to Gain a Big Data Advantage
Webinar | Using Hadoop Analytics to Gain a Big Data AdvantageWebinar | Using Hadoop Analytics to Gain a Big Data Advantage
Webinar | Using Hadoop Analytics to Gain a Big Data Advantage
 

More from Wilfried Hoge

Cloud Data Services - from prototyping to scalable analytics on cloud
Cloud Data Services - from prototyping to scalable analytics on cloudCloud Data Services - from prototyping to scalable analytics on cloud
Cloud Data Services - from prototyping to scalable analytics on cloud
Wilfried Hoge
 
Is it harder to find a taxi when it is raining?
Is it harder to find a taxi when it is raining? Is it harder to find a taxi when it is raining?
Is it harder to find a taxi when it is raining?
Wilfried Hoge
 
innovations born in the cloud - cloud data services from IBM to prototype you...
innovations born in the cloud - cloud data services from IBM to prototype you...innovations born in the cloud - cloud data services from IBM to prototype you...
innovations born in the cloud - cloud data services from IBM to prototype you...
Wilfried Hoge
 
2015.05.07 watson rp15
2015.05.07 watson rp152015.05.07 watson rp15
2015.05.07 watson rp15
Wilfried Hoge
 
Twitter analytics in Bluemix
Twitter analytics in BluemixTwitter analytics in Bluemix
Twitter analytics in Bluemix
Wilfried Hoge
 
InfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experienceInfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experience
Wilfried Hoge
 
Big SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on HadoopBig SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on Hadoop
Wilfried Hoge
 
2014.07.11 biginsights data2014
2014.07.11 biginsights data20142014.07.11 biginsights data2014
2014.07.11 biginsights data2014
Wilfried Hoge
 
2013.12.12 big data heise webcast
2013.12.12 big data heise webcast2013.12.12 big data heise webcast
2013.12.12 big data heise webcast
Wilfried Hoge
 
InfoSphere BigInsights
InfoSphere BigInsightsInfoSphere BigInsights
InfoSphere BigInsights
Wilfried Hoge
 
IBM - Big Value from Big Data
IBM - Big Value from Big DataIBM - Big Value from Big Data
IBM - Big Value from Big Data
Wilfried Hoge
 

More from Wilfried Hoge (11)

Cloud Data Services - from prototyping to scalable analytics on cloud
Cloud Data Services - from prototyping to scalable analytics on cloudCloud Data Services - from prototyping to scalable analytics on cloud
Cloud Data Services - from prototyping to scalable analytics on cloud
 
Is it harder to find a taxi when it is raining?
Is it harder to find a taxi when it is raining? Is it harder to find a taxi when it is raining?
Is it harder to find a taxi when it is raining?
 
innovations born in the cloud - cloud data services from IBM to prototype you...
innovations born in the cloud - cloud data services from IBM to prototype you...innovations born in the cloud - cloud data services from IBM to prototype you...
innovations born in the cloud - cloud data services from IBM to prototype you...
 
2015.05.07 watson rp15
2015.05.07 watson rp152015.05.07 watson rp15
2015.05.07 watson rp15
 
Twitter analytics in Bluemix
Twitter analytics in BluemixTwitter analytics in Bluemix
Twitter analytics in Bluemix
 
InfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experienceInfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experience
 
Big SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on HadoopBig SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on Hadoop
 
2014.07.11 biginsights data2014
2014.07.11 biginsights data20142014.07.11 biginsights data2014
2014.07.11 biginsights data2014
 
2013.12.12 big data heise webcast
2013.12.12 big data heise webcast2013.12.12 big data heise webcast
2013.12.12 big data heise webcast
 
InfoSphere BigInsights
InfoSphere BigInsightsInfoSphere BigInsights
InfoSphere BigInsights
 
IBM - Big Value from Big Data
IBM - Big Value from Big DataIBM - Big Value from Big Data
IBM - Big Value from Big Data
 

Recently uploaded

Understanding of Self - Applied Social Psychology - Psychology SuperNotes
Understanding of Self - Applied Social Psychology - Psychology SuperNotesUnderstanding of Self - Applied Social Psychology - Psychology SuperNotes
Understanding of Self - Applied Social Psychology - Psychology SuperNotes
PsychoTech Services
 
Aggression - Applied Social Psychology - Psychology SuperNotes
Aggression - Applied Social Psychology - Psychology SuperNotesAggression - Applied Social Psychology - Psychology SuperNotes
Aggression - Applied Social Psychology - Psychology SuperNotes
PsychoTech Services
 
ProSocial Behaviour - Applied Social Psychology - Psychology SuperNotes
ProSocial Behaviour - Applied Social Psychology - Psychology SuperNotesProSocial Behaviour - Applied Social Psychology - Psychology SuperNotes
ProSocial Behaviour - Applied Social Psychology - Psychology SuperNotes
PsychoTech Services
 
Assignment 1 (Introductions to Microsoft Power point 2019) kiran.pptx
Assignment 1 (Introductions to Microsoft Power point 2019) kiran.pptxAssignment 1 (Introductions to Microsoft Power point 2019) kiran.pptx
Assignment 1 (Introductions to Microsoft Power point 2019) kiran.pptx
kirannaveed6
 
healthy relationships and building a friendship
healthy relationships and building a friendshiphealthy relationships and building a friendship
healthy relationships and building a friendship
HaydarbekYuldoshev1
 
The Secret Warrior - Help Share a Parent or Loved Ones’ Cancer Diagnosis with...
The Secret Warrior - Help Share a Parent or Loved Ones’ Cancer Diagnosis with...The Secret Warrior - Help Share a Parent or Loved Ones’ Cancer Diagnosis with...
The Secret Warrior - Help Share a Parent or Loved Ones’ Cancer Diagnosis with...
CANSA The Cancer Association of South Africa
 
7 Habits of Highly Effective People.pptx
7 Habits of Highly Effective People.pptx7 Habits of Highly Effective People.pptx
7 Habits of Highly Effective People.pptx
gpangilinan2017
 
The Six Working Genius Short Explanation
The Six Working Genius Short ExplanationThe Six Working Genius Short Explanation
The Six Working Genius Short Explanation
abijabar2
 
1-CIE-IGCSE-Additional-Mathematics-Topical-Past-Paper-Functions.pdf
1-CIE-IGCSE-Additional-Mathematics-Topical-Past-Paper-Functions.pdf1-CIE-IGCSE-Additional-Mathematics-Topical-Past-Paper-Functions.pdf
1-CIE-IGCSE-Additional-Mathematics-Topical-Past-Paper-Functions.pdf
shahul62
 

Recently uploaded (9)

Understanding of Self - Applied Social Psychology - Psychology SuperNotes
Understanding of Self - Applied Social Psychology - Psychology SuperNotesUnderstanding of Self - Applied Social Psychology - Psychology SuperNotes
Understanding of Self - Applied Social Psychology - Psychology SuperNotes
 
Aggression - Applied Social Psychology - Psychology SuperNotes
Aggression - Applied Social Psychology - Psychology SuperNotesAggression - Applied Social Psychology - Psychology SuperNotes
Aggression - Applied Social Psychology - Psychology SuperNotes
 
ProSocial Behaviour - Applied Social Psychology - Psychology SuperNotes
ProSocial Behaviour - Applied Social Psychology - Psychology SuperNotesProSocial Behaviour - Applied Social Psychology - Psychology SuperNotes
ProSocial Behaviour - Applied Social Psychology - Psychology SuperNotes
 
Assignment 1 (Introductions to Microsoft Power point 2019) kiran.pptx
Assignment 1 (Introductions to Microsoft Power point 2019) kiran.pptxAssignment 1 (Introductions to Microsoft Power point 2019) kiran.pptx
Assignment 1 (Introductions to Microsoft Power point 2019) kiran.pptx
 
healthy relationships and building a friendship
healthy relationships and building a friendshiphealthy relationships and building a friendship
healthy relationships and building a friendship
 
The Secret Warrior - Help Share a Parent or Loved Ones’ Cancer Diagnosis with...
The Secret Warrior - Help Share a Parent or Loved Ones’ Cancer Diagnosis with...The Secret Warrior - Help Share a Parent or Loved Ones’ Cancer Diagnosis with...
The Secret Warrior - Help Share a Parent or Loved Ones’ Cancer Diagnosis with...
 
7 Habits of Highly Effective People.pptx
7 Habits of Highly Effective People.pptx7 Habits of Highly Effective People.pptx
7 Habits of Highly Effective People.pptx
 
The Six Working Genius Short Explanation
The Six Working Genius Short ExplanationThe Six Working Genius Short Explanation
The Six Working Genius Short Explanation
 
1-CIE-IGCSE-Additional-Mathematics-Topical-Past-Paper-Functions.pdf
1-CIE-IGCSE-Additional-Mathematics-Topical-Past-Paper-Functions.pdf1-CIE-IGCSE-Additional-Mathematics-Topical-Past-Paper-Functions.pdf
1-CIE-IGCSE-Additional-Mathematics-Topical-Past-Paper-Functions.pdf
 

2012.04.26 big insights streams im forum2

  • 1. Big Data Plattform der IBM InfoSphere BigInsights und InfoSphere Streams
  • 2. Big Data Plattform der IBM InfoSphere BigInsights und InfoSphere Streams Wilfried Hoge – Leading Technical Sales Professional hoge@de.ibm.com twitter.com/wilfriedhoge
  • 3. IBM Big Data Strategy: Move the Analytics Closer to the Data New analytic applications drive Analytic Applications the requirements for a big data BI / Exploration / Functional Industry Predictive Content platform Reporting Visualization App App Analytics Analytics •  Integrate and manage the full variety, velocity and volume of data IBM Big Data Platform Visualization Application Systems •  Apply advanced analytics to & Discovery Development Management information in its native form •  Visualize all available data for ad- Accelerators hoc analysis Hadoop Stream Data •  Development environment for System Computing Warehouse building new analytic applications •  Workload optimization and scheduling •  Security and Governance Information Integration & Governance
  • 4. Volume and Velocity – two dimensions for Big Data Exa Wind Turbine Placement & Up to 10,000 Operation Times PBs of data Peta larger Analysis time to 3 days from 3 weeks 1220 IBM iDataPlex nodes Data Scale Tera DeepQA 100s GB for Deep Analytics Data at Rest Data Scale 3 sec/decision Power7, 15TB memory Giga Telco Promotions 100,000 records/sec, 6B/day Traditional Data 10 ms/decision Mega Warehouse and 270TB for Deep Analytics Business Intelligence Up to 10,000 Data in Motion Security times faster Kilo 600,000 records/sec, 50B/day 1-2 ms/decision yr mo wk day hr min sec … ms µs 320TB for Deep Analytics Occasional Frequent Real-time Decision Frequency 26.04.2012 © Copyright IBM Corporation 2012 4
  • 5. BigInsights – analytical platform for persistent “Big Data” Based on open source & IBM technologies Analytic Applications BI / Exploration / Functional Industry Predictive Content Distinguishing characteristics Reporting Visualization App App Analytics Analytics •  Built-in analytics . . . enhances business knowledge IBM Big Data Platform •  Enterprise software integration . . . Visualization Application Systems & Discovery Development Management complements and extends existing capabilities •  Production-ready platform with tooling for Accelerators analysts, developers, and administrators. . . speeds time-to-value Hadoop Stream Data and simplifies development/maintenance System Computing Warehouse IBM advantage •  Combination of software, hardware, services and advanced research Information Integration & Governance
  • 6. About the BigInsights Platform Flexible, enterprise-class support for processing large volumes of data •  Based on Google’s MapReduce technology •  Inspired by Apache Hadoop; compatible with its ecosystem and distribution •  Well-suited to batch-oriented, read-intensive applications •  Supports wide variety of data Enables applications to work with thousands of nodes and petabytes of data in a highly parallel, cost effective manner •  CPU + disks = “node” •  Nodes can be combined into clusters •  New nodes can be added as needed without changing •  Data formats •  How data is loaded •  How jobs are written
  • 7. Hadoop Explained – Map Reduce Hadoop computation model •  Data stored in a distributed file system spanning many inexpensive computers •  Bring function to the data •  Distribute application to the compute resources where the data is stored Scalable to thousands of nodes and petabytes of data public  static  class  TokenizerMapper          extends  Mapper<Object,Text,Text,IntWritable>  {   Hadoop Data Nodes    private  final  static  IntWritable            one  =  new  IntWritable(1);      private  Text  word  =  new  Text();        public  void  map(Object  key,  Text  val,  Context          StringTokenizer  itr  =                new  StringTokenizer(val.toString());   1.  Map Phase        while  (itr.hasMoreTokens())  {          word.set(itr.nextToken());              context.write(word,  one);          }             (break job into small parts)    }   }     public  static  class  IntSumReducer          extends  Reducer<Text,IntWritable,Text,IntWrita   Distribute map 2.  Shuffle    private  IntWritable  result  =  new  Intritable();        public  void  reduce(Text  key,            Iterable<IntWritable>  val,  Context  context){          int  sum  =  0;          for  (IntWritable  v  :  val)  {   tasks to cluster (transfer interim output            sum  +=  v.get();     .  .  .   for final processing) MapReduce Application 3.  Reduce Phase (boil all output down to Shuffle a single result set) Result Set Return a single result set
  • 8. BigInsights – Value Beyond Open Source Technical differentiators •  Built-in analytics •  Text processing engine, annotators, Eclipse tooling •  Statistical and predictive analysis •  Interface to project R (statistical platform) •  Enterprise software integration (DBMS, warehouse) •  Spreadsheet-style analytical tool for analysts •  Ready-made business process accelerators •  Integrated installation of supported open source and IBM components •  Web Console for administration and application access •  Platform enrichment: additional security, performance features, . . . •  Standard IBM licensing agreement and world-class support Business benefits •  Quicker time-to-value due to IBM technology and support •  Reduced operational risk •  Enhanced business knowledge with flexible analytical platform •  Leverages and complements existing software assets
  • 9. Web Installation Tool Seamless process for single node and cluster environments Integrated installation of all selected components Post-install validation of IBM and open source components No need to iteratively download, configure, and test multiple open source projects and their pre-requisite software.
  • 10. Web Console Manage BigInsights •  Inspect system health •  Add / drop nodes •  Start / stop services •  Run / monitor jobs (applications) •  Explore / modify file system Launch applications •  Spreadsheet-like analysis tool •  Pre-built applications (IBM supplied or user developed) Publish applications Leverage community resources
  • 11. BigSheets BigSheets is a visual tool for data manipulation and prototyping •  Allows more users to do more work, more quickly •  Simply stated, growing an army of MapReduce developers is not cost effective •  In your BI environments you have a ratio of 30+ report users for every complex SQL developer. We need to support the same ratios with BigInsights Sample Uses •  Data exploration and visualization •  Visual job creation
  • 12. BigSheets – Spreadsheet-style Data Analysis and Discovery
  • 14. Quick start applications or “apps” Reusable software assets based on customer engagements •  Useful for starting point for various applications •  Can be customized by BigInsights application developers as needed •  Accessible through Web console Available assets •  Data export (to relational DBMS, files, HBase) •  Data import (from relational DBMS, files) •  Web crawler, Twitter crawler •  Boardreader.com support (Web forum search engine) •  Ad hoc queries for Jaql, Hive, Pig •  TeraGen-TeraSort, WordCount sample applications
  • 15. Running Applications from the Web Console
  • 16. Develop Hive with the SQL Editor and view results
  • 17. Build a Big Data Program – Map Reduce example Eclipse based development tools For JAQL, Hive, Java MapReduce, Text Analytics
  • 18. Text Analytics in BigInsights Text analytics – Distill structured information from unstructured data •  Rich annotator library supports multiple languages •  Declarative Information Extraction (IE) system based on an algebraic framework •  Richer, cleaner rule semantics •  Better performance through optimization Developed at IBM Research since 2004 Embedded in several IBM products •  Lotus Notes •  Cognos Consumer Insights •  InfoSphere Streams •  Compose operators to build complex annotators
  • 19. Turns disparate words into measurable insights Pre-configured text annotators ready for distributed processing on Big Data •  City, County, Zipcode, Address, Maplocation, StateOrProvince, Country, Continent, EmailAddress, Person, Organizaion, DateTime, URL, Compane Names, Merger, Acquisition, Alliance, etc.. Support for native languages including double-byte Physically assemble Identify positive or Reporting/Monitoring social data, standardize Part-of-speech negative sentiment, Iterative classification commentary, combination w/ formats, address auto- identification, standard and NLP-based analytics, using automated and structured data, clustering, identify language, customized extraction define variables, macros manual techniques. associated concepts, process punctuation dictionaries, proper noun and rules. Concept derivation & correlated concepts, auto- and non-grammatical identification, concept inclusion, semantic classification of documents, characters, standardize categorization, synonyms, networks and co- sites, posts. spelling. exclusions, multi-terms, occurrence rules regular expressions, fuzzy- matching
  • 20. Text Analytics – highly accurate analysis of textual content How it works Unstructured text (document, email, etc) •  Parses text and detects meaning with annotators Football World Cup 2010, one team distinguished themselves well, losing to •  Understands the context in which the the eventual champions 1-0 in the Final. text is analyzed Early in the second half, Netherlands’ •  Hundreds of pre-built annotators for striker, Arjen Robben, had a breakaway, names, addresses, phone numbers, but the keeper for Spain, Iker Casillas along others made the save. Winger Andres Iniesta scored for Spain for the win. Accuracy •  Highly accurate in deriving meaning from complex text Performance Classification and Insight •  AQL language optimized for MapReduce
  • 21. BigInsights Text Analytics Development – AQL
  • 22. Text Analytics Tooling AQL Editor Result Viewer Runtime Explain
  • 23. Statistical and Predictive Analysis Framework for machine learning (ML) implementations on Big Data •  Large, sparse data sets, e.g. 5B non-zero values •  Runs on large BigInsights clusters with 1000s of nodes Productivity •  Build and enhance predictive models directly on Big Data •  High-level language – Declarative Machine Learning Language (DML) •  E.g. 1500 lines of Java code boils down to 15 lines of DML code •  Parallel SPSS data mining algorithms implementable in DML Optimization •  Compile algorithms into optimized parallel code 4500 •  For different clusters and different data characteristics 4000 3500 •  E.g. 1 hr. execution (hand-coded) down to 10 mins Execution Time (sec) 3000 2500 2000 1500 1000 500 0 0 500 1000 1500 2000 # non zeros (million) Java Map-Reduce SystemML Single node R
  • 24. Workload Optimization Optimized performance for big data analytic workloads Adaptive MapReduce Hadoop System Scheduler §  Algorithm to optimize execution time of §  Identifies small and large jobs from multiple small jobs prior experience §  Performance gains of 30% reduce §  Sequences work to reduce overhead overhead of task startup Task Map Adaptive Map Reduce (break task into small parts) (optimization — (many results to a order small units of work) single result set)
  • 25. InfoSphere BigInsights – Embrace and Extend Hadoop Analytics ML Analytics Text Analytics BigSheets Interface Web console Application •  Monitor cluster health Pig Hive Jaql •  Add / remove nodes Avro Zookeeper IBM LZO Compression •  Start / stop services MapReduce •  Inspect job status •  Inspect workflow status •  Deploy apps AdaptiveMR FLEX BigIndex •  Launch apps / jobs •  Work with distrib. file system •  Work with spreadsheet Oozie Lucene interface •  Support REST-based API •  . . . Storage HBase Eclipse plug-ins HDFS GPFS-SNC •  Text analytics •  MapReduce programming •  Jaql development Data Sources/ Netezza BoardReader R •  Hive query development Streams Connectors Data Stage DB2 CSV / XML / JSON SPSS IBM Flume JDBC Web Crawler Open Source
  • 26. Ways to get started with BigInsights In the Cloud •  Via RightScale, or directly on Amazon, Rackspace, IBM Smart Enterprise Cloud, or on private clouds. •  Pay only for the resources used. In the Virtual Classroom •  Free Hadoop Fundamentals training course www.bigdatauniversity.com •  e.g. BD105EN - Text Analytics Essentials On Your Cluster •  Download Basic Edition from ibm.com. In the Classroom •  Enroll in the InfoSphere BigInsights Essentials course.
  • 27. Visit the BigInsights technical portal . . . . Free links to papers, demos, discussion forum, and more http://www.ibm.com/developerworks/wiki/biginsights/
  • 28. Streams – analytical platform for in-motion “Big Data” Built to analyze data in motion Analytic Applications •  Multiple concurrent input streams BI / Exploration / Functional Industry Predictive Content Reporting Visualization App App Analytics Analytics •  Massive scalability IBM Big Data Platform Process and analyze a variety of Visualization Application Systems data & Discovery Development Management •  Structured, unstructured content, video, audio Accelerators •  Advanced analytic operators Hadoop Stream Data System Computing Warehouse Information Integration & Governance
  • 29. Stream Computing – Analyze Data in Motion Traditional Computing Stream Computing Historical fact finding Current fact finding Find and analyze information stored on disk Analyze data in motion – before it is stored Batch paradigm, pull model Low latency paradigm, push model Query-driven: submits queries to static data Data driven – bring the data to the query Query Data Results Data Query Results
  • 30. Why InfoSphere Streams? Applications that require on-the-fly processing, filtering and analysis of streaming data •  Sensors: environmental, industrial, surveillance video, GPS, … •  “Data exhaust”: network/system/web server/app server log files •  High-rate transaction data: financial transactions, call detail records Criteria: two or more of the following •  Messages are processed in isolation or in limited data windows •  Sources include non-traditional data (spatial, imagery, text, …) •  Sources vary in connection methods, data rates, and processing requirements, presenting integration challenges •  Data rates/volumes require the resources of multiple processing nodes •  Analysis and response are needed with sub-millisecond latency •  Data rates and volumes are too great for store-and-mine approaches
  • 31. Massively Scalable Stream Analytics Linear Scalability Deployments §  Clustered deployments – unlimited Source Analytic Sync scalability Adapters Operators Adapters Automated Deployment §  Automatically optimize operator Streams Studio IDE deployment across clusters Performance Optimization Automated and Optimized §  JVM Sharing – minimize memory use Deployment §  Fuse operators on Streaming Data Streams Runtime Sources same cluster §  Telco client – 25 Million Visualization messages per second Analytics on Streaming Data §  Analytic accelerators for a variety of data types §  Optimized for real-time performance
  • 32. Streams approach illustrated tuple directory: directory: directory: directory: ”/img" ”/img" ”/opt" ”/img" filename: filename: filename: filename: height: height: height: “farm” “bird” “java” “cat” 640 1280 640 width: width: width: 480 1024 480 data: data: data:
  • 33. InfoSphere Streams for superior real time analytic processing Streams Processing Language (SPL) built for Streaming applications: Compile groups of operators into •  Reusable operators single processes: •  Rapid application development •  Efficient use of cores Use the data •  Continuous “pipeline” processing •  Distributed execution that gives •  Very fast data exchange you a competitive •  Can be automatic or tuned advantage: •  Scaled with push of a button •  Can handle virtually any data type •  Use data that is too expensive and time sensitive for traditional approaches Easy to extend: •  Built in adaptors •  Users add capability with familiar C++ and Java Dynamic analysis: Easy to manage: •  Programmatically change Flexible and high •  Automatic placement topology at runtime performance transport: •  Create new subscriptions •  Extend applications incrementall •  Very low latency •  Create new port properties without downtime •  High data rates •  Multi-user / multiple applications
  • 34. Streams Studio Integrated Development Environment 34
  • 35. Compiler Framework Operator Fusion •  Fine-grained operators Logical app view •  From small parts, make larger ones that fit Code generation •  Generates code to match the underlying runtime environment •  Number of cores •  Interconnect characteristics Physical app view •  Architecture-specific instructions •  Driven by automatic profiling •  Compiler-based optimization •  Driven by incremental learning of application characteristics
  • 36. Streams Data Mining Toolkit Enables scoring of real-time data in a Streams application •  Scoring is performed against a predefined model •  Supports a variety of model types and scoring algorithms Models represented in Predictive Model Markup Language (PMML) •  Standard for statistical and data mining models •  XML Representation Toolkit provides four Streams operators to enable scoring •  Classification •  Clustering •  Regression •  Associations The toolkit supports dynamic replacement of the PMML model used by an operator.
  • 37. Without a Big Data Platform IBM Big Data Platform You Code… Over 100 sample applications and toolkits with industry focused toolkits with 300+ functions and operators Event Custom SQL Handling and Scripts Multithreading Check Application Pointing Management Accelerators Streams provides development, deployment, HA and runtime, and infrastructure services Toolkits Performance Debug Connectors Optimization Security “TerraEchos developers can deliver applications 45% faster due to the agility of Streams Processing Language…” – Alex Philip, CEO and President, TerraEchos
  • 38. Streams Redbook redbooks.ibm.com/abstracts/sg247970.html This book is intended for professionals that require an understanding of how to process high volumes of streaming data or need information about how to implement systems to satisfy those requirements.
  • 39. Right-time actions are taken in the new BI/BA ecosystem • Three routes to analytics • Application and workload optimized appliances and systems • Fast data movement and integration Traditional Traditional / Warehouse Relational Data Sources Database & At-Rest Data Results Warehouse Analytics Non-Traditional / Streams Non-Relational Data Sources In-Motion Ultra Low Latency Analytics Results Non-Traditional/ InfoSphere Non-Relational Big Insights Data Sources Internet Internet Scale Scale Traditional/ Data Analytics, Data Results Relational Data Operations & Model Sources Building 26.04.2012 © Copyright IBM Corporation 2012 39
  • 40. Example of 360° customer view Business Processes" Events and Master Data Campaign Cognos Consumer Alerts Management Management Insight Big Data Platform Web Traffic and Social Media Insight Website Logs Social Media Internet Scale Analytics Information Data Integration Warehouse Call Detail Call Behavior and Records Streaming Analytics Experience Insight
  • 41. Big Data Plattform der IBM InfoSphere BigInsights und InfoSphere Streams