• Save
IBM and Big Data
Upcoming SlideShare
Loading in...5

IBM and Big Data



IBM and Big Data

IBM and Big Data



Total Views
Views on SlideShare
Embed Views



21 Embeds 1,031

http://www.robdthomas.com 742
http://paper.li 83
http://www.itg.hitachi.co.jp 51
http://www.wien.gv.at 48
http://absolutefreedom-natureofbeing.blogspot.com 34
http://seduvaq.uvaq.edu.mx 25 17
http://www.techgig.com 9
http://www.goghumarket.com 6
https://twitter.com 2
http://absolutefreedom-natureofbeing.blogspot.kr 2
http://feeds.feedburner.com 2
http://a0.twimg.com 2
http://twitter.com 1
http://www.onlydoo.com 1
http://www.acw.be 1 1
http://www.blogger.com 1
http://esemkarueh.blogspot.com 1
http://iovivola3.wordpress.com 1
http://us-w1.rockmelt.com 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

IBM and Big Data IBM and Big Data Presentation Transcript

  • Why IBM for Big Data
  • Why IBM for Big Data
    • Big Data includes Data in Motion and Data at Rest
      • IBM has solutions for both
    • Tools included
      • Most other vendors only offer infrastructure
    • Committed to Open Source
      • Leveraging and contributing to open source community
    • Hadoop Operational Excellence
      • Making Hadoop easily deployed in the enterprise
    • Performance Observations
      • Open Source community is committed to driving improvements
    • Customers
      • Fast adoption of InfoSphere Streams and InfoSphere BigInsights
    • Competitive Advantages
      • Built-in Analytics, Security, Workload Management, and more
  • The IBM Big Data Platform: In Motion and At Rest Big Data Analytics Engines IBM Big Data Solutions BigInsights for Data at Rest Streams for Data in Motion Developers End Users Administrators Big Data User Environments Bringing Big Data to the Enterprise Client and Partner Solutions Open Source Foundational Components Eclipse Oozie Hadoop HBase Pig Lucene Jaql AGENTS INTEGRATION Information Server Marketing Warehouse Appliances Data Warehouse Database Content Analytics Business Analytics Master Data Mgmt InfoSphere Warehouse Netezza InfoSphere MDM DB2 Cognos & SPSS Unica Data Growth Management InfoSphere Optim ECM
  • BigInsights Includes BigSheets for Visualization
    • BigSheets enables Hadoop Analytics for Business Analysts
      • Don’t need to be a MapReduce programmer
      • Simple visual tool enables MapReduce productivity right away
      • In your BI environments you have a ratio of 30+ report users for every complex SQL developer. We need to support the same ratios with BigInsights
    • Sample Uses
      • Data exploration and visualization
      • Visual job creation
  • BigInsights Includes Text Analytics Toolkit
    • System T text analytics engine previously only embedded in IBM products and hidden from end users
      • Found in Lotus Notes, IBM eDiscovery Analyzer, CCI, InfoSphere Warehouse,+++
      • Almost a decade since initial release
    • BigInsights is the first time IBM opens up the Text Analytics Engine technology for customization and development
    • BigInsights Text Analytic Toolkit provides developer tools, an easy to use text analytics language, and a set of extractors for fast adoption
      • Multilingual support , including support for DBCS languages
    • BigInsights includes Annotator Query Language (AQL): SQL-like!
      • Fully declarative text analytics language
      • No “black boxes” or modules that can’t be customized.
      • Tooling for easy customization because you are abstracted from the programmatic details
      • Competing solutions make use of locked up black-box modules that cannot be customized , which restricts flexibility and are difficult to optimize for performance
  • IBM is Committed to Open Source
    • Decade of lineage and contributions to the open source community
      • Apache Hadoop and Jaql
      • Apache Derby
      • Apache Geronimo
      • Apache Jakarta
      • Eclipse: founded by IBM, PMC Board of Directors (Core, DTP, WTP, SOA, +++)
      • Text analytics: Unstructured Information Management Architecture (UIMA)
        • Text analytics is a major use case around Hadoop
      • Significant Lucene contributions via IBM Lucene Extension Library (ILEL)
        • Project Gumshoe (IBM ’s use of BigIndex to its Intranet search engine) is heavily mention in Chuck Lam’s Hadoop in Action
      • DRDA, XQuery, SQL, XML4J, XERCES, HTTP, Java, Linux, +++
    • IBM products built on open source
      • WebSphere: Apache
      • Rational: Eclipse and Apache
      • InfoSphere: Eclipse and Apache, +++
    • Will continue down this path, especially around Hadoop
      • Working to move Oozie as an Apache sub project
      • Plan to converge Metatracker with Oozie over time
  • Table of Contents for Hadoop Operational Excellence
    • Security
      • LDAP authentication and role based security
    • Workload Management
      • Intelligent Scheduler
    • Enterprise Integration
      • DB2 and Netezza interfaces, DataStage integration
    • Compression
      • Fully licensed LZO compression for optimal Hadoop compression
    • Adaptive Map Reduce
      • Speeds up how small jobs are handled
    • Large Scale Indexing
      • IBM BigIndex technology
    • GPFS-SNC
      • Winner of 2010 SuperComputing Storage Challenge
  • Security in BigInsights
    • BigInsights reduces the surface area and by default secures access to administrative interfaces and key Hadoop services
    • LDAP authentication and reverse-proxy support help administrators restrict access to authorized users
      • Clients outside the cluster must use REST HTTP access
      • Apache Hadoop has multiple open ports on every node in cluster
    • BigInsights defines 4 roles not available in Hadoop for finer grained security:
      • System Admin, Data Admin, Application Admin, and User
      • BigInsights installer automatically lets you map these roles to LDAP groups and users
    • GPFS-SNC means the cluster is aware of the underlying O/S security services without added complexity
    • IBM Hardens Hadoop – makes it Enterprise ready for Enterprises with expectations of operations (SLA)
      • Better performance; Better availability; Better management
  • Compression in Hadoop
    • Hadoop users like to think about compression because of the default 3 replication factor and sequential reading of data
      • Need to think about splittable in Hadoop if a lot of text data
      • Need to consider speed of compression (fast and slow algorithms)
    • Jaql Applications understand the splittable compression: no worries
    • LZO compression algorithm is fast and removes the split problem
      • BUT! LZO comes with a GPL license and requires some ‘extra’ work
      • BUT! LZO requires index files which is a performance overhead
        • GNU LZO uses variable-length compression blocks
        • Needs index files to tell mapper where it can safely split a compressed file
        • Mappers have to perform index lookups during decompress and read operations
      • Hadoop 0.21 will have splittable BZIP2 compression, but this is slow
    • BigInsights comes with “Blue-washed” LZO compression
      • IBM LZO compression fully integrated with BigInsights and under the same enterprise-friendly license agreement as the rest of BigInsights
        • No worries about enterprise deployment and usage concerns with GNU GPL
      • LZO compression does not create an index file while compressing because it uses fixed-length compression blocks
  • Workload Management in BigInsights
    • Open Source Hadoop ships with a number of different schedulers
      • First in First Out (FIFO), Fair Scheduler, and Capacity Scheduler
      • Hadoop schedulers don ’t provide adequate controls to manage Hadoop cluster
        • FAIR is pretty good at ensuring resources are applied to workloads
        • But! Don ’t give you SLA-like granular controls on the workload
    • IBM database workload management researchers studied Hadoop workload scheduling problems and created the Intelligent Scheduler
      • Extends Fair Scheduler with capability to constantly alter the minimum number of slots assigned to jobs
      • Includes a variety of metrics you can use to optimize workloads
        • Granular choice of metrics cluster-wide or job-specific
        • Optionally weight these metrics to balance competing priorities
        • Minimize or maximize the sum of all of all the individual job metrics
      • Examples:
        • VP gets more Hadoop cycles than Director
        • Target average response times
        • Big jobs get higher priority
  • Adaptive MapReduce
    • IBM Research workload management experts looking at way to performance optimize Hadoop
    • Adaptive MapReduce extends Hadoop by making individual mappers self-aware and aware of other mappers via Adaptive Mappers
      • Individual map tasks can adapt to their environment and make efficient decisions
    • Split size tradeoffs – which one to use?
      • Small split size means more mappers which balances the workload
        • Small splits increases cluster overhead due to the higher volumes of startup costs for each map task
      • Large split sizes better for workloads with high startup costs
    • Adaptive MapReduce gives BigInsights the best of both worlds
      • Database analogy: connection pooling
      • No penalty for small split size
  • Large Scale Indexing with BigIndex
    • BigIndex is a framework for building large-scale indexing and search solutions
      • Includes modules for indexing over Hadoop, as well as optimizing, merging, and replicating indexes
      • Includes search component modules for programmable search, faceted search, and searching an index in local and distributed deployments
    • Provides the ability to search through hundreds of TBs of data, while maintaining sub-second search response times
      • Uses specialized distribution architects for specific searches:
        • Partitioned indexes, Distributed indexes, Real-time indexes
    • BigIndex technology used in other IBM products such as Lotus Connections, IBM Content Analyzer, and Cognos Consumer Insight
      • IBM uses BigIndex to drive its Intranet search engine: Gumshoe
        • Gumshoe is heavily talked about in Chuck Lams ’ Hadoop in Action book
  • Big Data Must Be an Integrated Part of your Enterprise Traditional Approach Structured, analytical, logical New Approach Creative, holistic thought, intuition Unstructured Exploratory Iterative Brand sentiment Product strategy Maximum asset utilization Structured Repeatable Linear Monthly sales reports Profitability analysis Customer surveys Must be easy to deploy and integrate, cannot be a silo