2012.04.26 big insights streams im forum2


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

2012.04.26 big insights streams im forum2

  1. 1. Big Data Plattform der IBMInfoSphere BigInsights und InfoSphere Streams
  2. 2. Big Data Plattform der IBMInfoSphere BigInsights und InfoSphere StreamsWilfried Hoge – Leading Technical Sales Professionalhoge@de.ibm.comtwitter.com/wilfriedhoge
  3. 3. IBM Big Data Strategy: Move the Analytics Closer to the DataNew analytic applications drive Analytic Applicationsthe requirements for a big data BI / Exploration / Functional Industry Predictive Contentplatform Reporting Visualization App App Analytics Analytics•  Integrate and manage the full variety, velocity and volume of data IBM Big Data Platform Visualization Application Systems•  Apply advanced analytics to & Discovery Development Management information in its native form•  Visualize all available data for ad- Accelerators hoc analysis Hadoop Stream Data•  Development environment for System Computing Warehouse building new analytic applications•  Workload optimization and scheduling•  Security and Governance Information Integration & Governance
  4. 4. Volume and Velocity – two dimensions for Big Data Exa Wind Turbine Placement & Up to 10,000 Operation Times PBs of data Peta larger Analysis time to 3 days from 3 weeks 1220 IBM iDataPlex nodes Data Scale Tera DeepQA 100s GB for Deep Analytics Data at RestData Scale 3 sec/decision Power7, 15TB memory Giga Telco Promotions 100,000 records/sec, 6B/day Traditional Data 10 ms/decision Mega Warehouse and 270TB for Deep Analytics Business Intelligence Up to 10,000 Data in Motion Security times faster Kilo 600,000 records/sec, 50B/day 1-2 ms/decision yr mo wk day hr min sec … ms µs 320TB for Deep Analytics Occasional Frequent Real-time Decision Frequency 26.04.2012 © Copyright IBM Corporation 2012 4
  5. 5. BigInsights – analytical platform for persistent “Big Data”Based on open source & IBMtechnologies Analytic Applications BI / Exploration / Functional Industry Predictive ContentDistinguishing characteristics Reporting Visualization App App Analytics Analytics•  Built-in analytics . . . enhances business knowledge IBM Big Data Platform•  Enterprise software integration . . . Visualization Application Systems & Discovery Development Management complements and extends existing capabilities•  Production-ready platform with tooling for Accelerators analysts, developers, and administrators. . . speeds time-to-value Hadoop Stream Data and simplifies development/maintenance System Computing WarehouseIBM advantage•  Combination of software, hardware, services and advanced research Information Integration & Governance
  6. 6. About the BigInsights PlatformFlexible, enterprise-class support for processing large volumes of data•  Based on Google’s MapReduce technology•  Inspired by Apache Hadoop; compatible with its ecosystem and distribution•  Well-suited to batch-oriented, read-intensive applications•  Supports wide variety of dataEnables applications to work with thousands of nodes and petabytes ofdata in a highly parallel, cost effective manner•  CPU + disks = “node”•  Nodes can be combined into clusters•  New nodes can be added as needed without changing •  Data formats •  How data is loaded •  How jobs are written
  7. 7. Hadoop Explained – Map Reduce Hadoop computation model •  Data stored in a distributed file system spanning many inexpensive computers •  Bring function to the data •  Distribute application to the compute resources where the data is stored Scalable to thousands of nodes and petabytes of data public  static  class  TokenizerMapper          extends  Mapper<Object,Text,Text,IntWritable>  {   Hadoop Data Nodes    private  final  static  IntWritable            one  =  new  IntWritable(1);      private  Text  word  =  new  Text();        public  void  map(Object  key,  Text  val,  Context          StringTokenizer  itr  =                new  StringTokenizer(val.toString());   1.  Map Phase        while  (itr.hasMoreTokens())  {          word.set(itr.nextToken());              context.write(word,  one);          }             (break job into small parts)    }   }     public  static  class  IntSumReducer          extends  Reducer<Text,IntWritable,Text,IntWrita   Distribute map 2.  Shuffle    private  IntWritable  result  =  new  Intritable();        public  void  reduce(Text  key,            Iterable<IntWritable>  val,  Context  context){          int  sum  =  0;          for  (IntWritable  v  :  val)  {   tasks to cluster (transfer interim output            sum  +=  v.get();     .  .  .   for final processing)MapReduce Application 3.  Reduce Phase (boil all output down to Shuffle a single result set) Result Set Return a single result set
  8. 8. BigInsights – Value Beyond Open SourceTechnical differentiators•  Built-in analytics •  Text processing engine, annotators, Eclipse tooling •  Statistical and predictive analysis •  Interface to project R (statistical platform)•  Enterprise software integration (DBMS, warehouse)•  Spreadsheet-style analytical tool for analysts•  Ready-made business process accelerators•  Integrated installation of supported open source and IBM components•  Web Console for administration and application access•  Platform enrichment: additional security, performance features, . . .•  Standard IBM licensing agreement and world-class supportBusiness benefits•  Quicker time-to-value due to IBM technology and support•  Reduced operational risk•  Enhanced business knowledge with flexible analytical platform•  Leverages and complements existing software assets
  9. 9. Web Installation ToolSeamless process for singlenode and cluster environmentsIntegrated installation of allselected componentsPost-install validation of IBM andopen source components No need to iteratively download, configure, and test multiple open source projects and their pre-requisite software.
  10. 10. Web ConsoleManage BigInsights•  Inspect system health•  Add / drop nodes•  Start / stop services•  Run / monitor jobs (applications)•  Explore / modify file systemLaunch applications•  Spreadsheet-like analysis tool•  Pre-built applications (IBM supplied or user developed)Publish applicationsLeverage community resources
  11. 11. BigSheetsBigSheets is a visual tool for data manipulation and prototyping•  Allows more users to do more work, more quickly•  Simply stated, growing an army of MapReduce developers is not cost effective•  In your BI environments you have a ratio of 30+ report users for every complex SQL developer. We need to support the same ratios with BigInsightsSample Uses•  Data exploration and visualization•  Visual job creation
  12. 12. BigSheets – Spreadsheet-style Data Analysis and Discovery
  13. 13. BigSheets – Visualization
  14. 14. Quick start applications or “apps”Reusable software assets based on customer engagements•  Useful for starting point for various applications•  Can be customized by BigInsights application developers as needed•  Accessible through Web consoleAvailable assets•  Data export (to relational DBMS, files, HBase)•  Data import (from relational DBMS, files)•  Web crawler, Twitter crawler•  Boardreader.com support (Web forum search engine)•  Ad hoc queries for Jaql, Hive, Pig•  TeraGen-TeraSort, WordCount sample applications
  15. 15. Running Applications from the Web Console
  16. 16. Develop Hive with the SQL Editor and view results
  17. 17. Build a Big Data Program – Map Reduce example Eclipse based development tools For JAQL, Hive, Java MapReduce, Text Analytics
  18. 18. Text Analytics in BigInsightsText analytics – Distill structured information from unstructured data•  Rich annotator library supports multiple languages•  Declarative Information Extraction (IE) system based on an algebraic framework•  Richer, cleaner rule semantics•  Better performance through optimizationDeveloped at IBM Research since 2004Embedded in several IBM products•  Lotus Notes•  Cognos Consumer Insights•  InfoSphere Streams•  Compose operators to build complex annotators
  19. 19. Turns disparate words into measurable insightsPre-configured text annotators ready for distributed processing on Big Data•  City, County, Zipcode, Address, Maplocation, StateOrProvince, Country, Continent, EmailAddress, Person, Organizaion, DateTime, URL, Compane Names, Merger, Acquisition, Alliance, etc..Support for native languages including double-byte Physically assemble Identify positive or Reporting/Monitoring social data, standardize Part-of-speech negative sentiment, Iterative classification commentary, combination w/ formats, address auto- identification, standard and NLP-based analytics, using automated and structured data, clustering, identify language, customized extraction define variables, macros manual techniques. associated concepts, process punctuation dictionaries, proper noun and rules. Concept derivation & correlated concepts, auto- and non-grammatical identification, concept inclusion, semantic classification of documents, characters, standardize categorization, synonyms, networks and co- sites, posts. spelling. exclusions, multi-terms, occurrence rules regular expressions, fuzzy- matching
  20. 20. Text Analytics – highly accurate analysis of textual contentHow it works Unstructured text (document, email, etc)•  Parses text and detects meaning with annotators Football World Cup 2010, one team distinguished themselves well, losing to•  Understands the context in which the the eventual champions 1-0 in the Final. text is analyzed Early in the second half, Netherlands’•  Hundreds of pre-built annotators for striker, Arjen Robben, had a breakaway, names, addresses, phone numbers, but the keeper for Spain, Iker Casillas along others made the save. Winger Andres Iniesta scored for Spain for the win.Accuracy•  Highly accurate in deriving meaning from complex textPerformance Classification and Insight•  AQL language optimized for MapReduce
  21. 21. BigInsights Text Analytics Development – AQL
  22. 22. Text Analytics Tooling AQL Editor Result ViewerRuntime Explain
  23. 23. Statistical and Predictive AnalysisFramework for machine learning (ML) implementations on Big Data•  Large, sparse data sets, e.g. 5B non-zero values•  Runs on large BigInsights clusters with 1000s of nodesProductivity•  Build and enhance predictive models directly on Big Data•  High-level language – Declarative Machine Learning Language (DML) •  E.g. 1500 lines of Java code boils down to 15 lines of DML code•  Parallel SPSS data mining algorithms implementable in DMLOptimization•  Compile algorithms into optimized parallel code 4500•  For different clusters and different data characteristics 4000 3500•  E.g. 1 hr. execution (hand-coded) down to 10 mins Execution Time (sec) 3000 2500 2000 1500 1000 500 0 0 500 1000 1500 2000 # non zeros (million) Java Map-Reduce SystemML Single node R
  24. 24. Workload Optimization Optimized performance for big data analytic workloads Adaptive MapReduce Hadoop System Scheduler §  Algorithm to optimize execution time of §  Identifies small and large jobs from multiple small jobs prior experience §  Performance gains of 30% reduce §  Sequences work to reduce overhead overhead of task startupTask Map Adaptive Map Reduce (break task into small parts) (optimization — (many results to a order small units of work) single result set)
  25. 25. InfoSphere BigInsights – Embrace and Extend HadoopAnalytics ML Analytics Text Analytics BigSheets Interface Web consoleApplication •  Monitor cluster health Pig Hive Jaql •  Add / remove nodes Avro Zookeeper IBM LZO Compression •  Start / stop services MapReduce •  Inspect job status •  Inspect workflow status •  Deploy apps AdaptiveMR FLEX BigIndex •  Launch apps / jobs •  Work with distrib. file system •  Work with spreadsheet Oozie Lucene interface •  Support REST-based API •  . . .Storage HBase Eclipse plug-ins HDFS GPFS-SNC •  Text analytics •  MapReduce programming •  Jaql developmentData Sources/ Netezza BoardReader R •  Hive query development StreamsConnectors Data Stage DB2 CSV / XML / JSON SPSS IBM Flume JDBC Web Crawler Open Source
  26. 26. Ways to get started with BigInsightsIn the Cloud•  Via RightScale, or directly on Amazon, Rackspace, IBM Smart Enterprise Cloud, or on private clouds.•  Pay only for the resources used.In the Virtual Classroom•  Free Hadoop Fundamentals training course www.bigdatauniversity.com •  e.g. BD105EN - Text Analytics EssentialsOn Your Cluster•  Download Basic Edition from ibm.com.In the Classroom•  Enroll in the InfoSphere BigInsights Essentials course.
  27. 27. Visit the BigInsights technical portal . . . .Free links to papers, demos, discussion forum, and morehttp://www.ibm.com/developerworks/wiki/biginsights/
  28. 28. Streams – analytical platform for in-motion “Big Data”Built to analyze data in motion Analytic Applications•  Multiple concurrent input streams BI / Exploration / Functional Industry Predictive Content Reporting Visualization App App Analytics Analytics•  Massive scalability IBM Big Data PlatformProcess and analyze a variety of Visualization Application Systemsdata & Discovery Development Management•  Structured, unstructured content, video, audio Accelerators•  Advanced analytic operators Hadoop Stream Data System Computing Warehouse Information Integration & Governance
  29. 29. Stream Computing – Analyze Data in Motion Traditional Computing Stream ComputingHistorical fact finding Current fact findingFind and analyze information stored on disk Analyze data in motion – before it is storedBatch paradigm, pull model Low latency paradigm, push modelQuery-driven: submits queries to static data Data driven – bring the data to the query Query Data Results Data Query Results
  30. 30. Why InfoSphere Streams?Applications that require on-the-fly processing, filtering and analysis ofstreaming data•  Sensors: environmental, industrial, surveillance video, GPS, …•  “Data exhaust”: network/system/web server/app server log files•  High-rate transaction data: financial transactions, call detail recordsCriteria: two or more of the following•  Messages are processed in isolation or in limited data windows•  Sources include non-traditional data (spatial, imagery, text, …)•  Sources vary in connection methods, data rates, and processing requirements, presenting integration challenges•  Data rates/volumes require the resources of multiple processing nodes•  Analysis and response are needed with sub-millisecond latency•  Data rates and volumes are too great for store-and-mine approaches
  31. 31. Massively Scalable Stream AnalyticsLinear Scalability Deployments§  Clustered deployments – unlimited Source Analytic Sync scalability Adapters Operators AdaptersAutomated Deployment§  Automatically optimize operator Streams Studio IDE deployment across clustersPerformance Optimization Automated and Optimized§  JVM Sharing – minimize memory use Deployment§  Fuse operators on Streaming Data Streams Runtime Sources same cluster§  Telco client – 25 Million Visualization messages per secondAnalytics on Streaming Data§  Analytic accelerators for a variety of data types§  Optimized for real-time performance
  32. 32. Streams approach illustrated tuple directory: directory: directory: directory: ”/img" ”/img" ”/opt" ”/img" filename: filename: filename: filename: height: height: height: “farm” “bird” “java” “cat” 640 1280 640 width: width: width: 480 1024 480 data: data: data:
  33. 33. InfoSphere Streams for superior real time analytic processing Streams Processing Language (SPL) built for Streaming applications: Compile groups of operators into •  Reusable operators single processes: •  Rapid application development •  Efficient use of cores Use the data •  Continuous “pipeline” processing •  Distributed execution that gives •  Very fast data exchange you a competitive •  Can be automatic or tuned advantage: •  Scaled with push of a button •  Can handle virtually any data type •  Use data that is too expensive and time sensitive for traditional approachesEasy to extend:•  Built in adaptors•  Users add capability with familiar C++ and Java Dynamic analysis: Easy to manage: •  Programmatically change Flexible and high •  Automatic placement topology at runtime performance transport: •  Create new subscriptions •  Extend applications incrementall •  Very low latency •  Create new port properties without downtime •  High data rates •  Multi-user / multiple applications
  34. 34. Streams Studio Integrated Development Environment 34
  35. 35. Compiler FrameworkOperator Fusion•  Fine-grained operators Logical app view•  From small parts, make larger ones that fitCode generation•  Generates code to match the underlying runtime environment •  Number of cores •  Interconnect characteristics Physical app view •  Architecture-specific instructions•  Driven by automatic profiling•  Compiler-based optimization•  Driven by incremental learning of application characteristics
  36. 36. Streams Data Mining ToolkitEnables scoring of real-time data in a Streams application•  Scoring is performed against a predefined model•  Supports a variety of model types and scoring algorithmsModels represented in Predictive Model Markup Language (PMML) •  Standard for statistical and data mining models •  XML RepresentationToolkit provides four Streams operators to enable scoring•  Classification•  Clustering•  Regression•  AssociationsThe toolkit supports dynamic replacement of the PMML model used by anoperator.
  37. 37. Without a Big Data Platform IBM Big Data PlatformYou Code… Over 100 sample applications and toolkits with industry focused toolkits with 300+ functions and operators Event Custom SQL Handling and Scripts Multithreading Check Application Pointing Management Accelerators Streams provides development, deployment, HA and runtime, and infrastructure services Toolkits Performance Debug Connectors Optimization Security “TerraEchos developers can deliver applications 45% faster due to the agility of Streams Processing Language…” – Alex Philip, CEO and President, TerraEchos
  38. 38. Streams Redbookredbooks.ibm.com/abstracts/sg247970.htmlThis book is intended for professionals thatrequire an understanding of how to process highvolumes of streaming data or need informationabout how to implement systems to satisfythose requirements.
  39. 39. Right-time actions are taken in the new BI/BA ecosystem • Three routes to analytics • Application and workload optimized appliances and systems • Fast data movement and integrationTraditional Traditional /Warehouse Relational Data Sources Database & At-Rest Data Results Warehouse Analytics Non-Traditional / Streams Non-Relational Data Sources In-Motion Ultra Low Latency Analytics Results Non-Traditional/ InfoSphere Non-Relational Big Insights Data Sources Internet Internet Scale Scale Traditional/ Data Analytics, Data Results Relational Data Operations & Model Sources Building 26.04.2012 © Copyright IBM Corporation 2012 39
  40. 40. Example of 360° customer view Business Processes" Events and Master Data Campaign Cognos Consumer Alerts Management Management Insight Big Data Platform Web Traffic and Social Media Insight Website Logs Social Media Internet Scale Analytics Information Data Integration Warehouse Call Detail Call Behavior and Records Streaming Analytics Experience Insight
  41. 41. Big Data Plattform der IBMInfoSphere BigInsights und InfoSphere Streams