Client approaches to successfully navigate through the big data storm


Published on

Hadoop is not a platform for data integration: As a result, some organizations turn to hand coding for integration – or end up deploying solutions that aren’t fully scalable. Review this Slideshare to learn about IBM client best practices for Big Data Integration success.

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Client approaches to successfully navigate through the big data storm

  1. 1. © 2014 IBM Corporation Client Approaches to Successfully Navigate through the Big Data Storm June 2014
  2. 2. © 2014 IBM Corporation2 Does Your Big Data Project Look Like This? IBM Presentation Template Full Version You need cost predictability, together with a solution that can quickly take you places!  Hadoop is a fascinating, exciting engine. However, it is:  Ungoverned  All custom, all the time  Requires expensive, constantly changing skills  Includes no concept of quality, governance or lineage And, MapReduce was originally designed for finely grained fault tolerance, which makes it slow for big data integration processing Hadoop is just not a solution for big data integration
  3. 3. © 2014 IBM Corporation3 If so, that’s because 80% of the development work for a big data project is to address Big Data Integration challenges IBM Presentation Template Full Version “By most accounts, 80 percent of the development effort in a big data project goes into data integration and only 20 percent goes towards data analysis.” Intel Corporation: Extract, Transform, and Load Big Data With Apache Hadoop (White Paper) Most Hadoop initiatives end up achieving garbage in, garbage out faster, against larger data volumes and:  MapReduce was not designed to accommodate the processing all the logic necessary for big data integration  Teams forget that Hadoop initiatives require: collecting, moving, transforming, cleansing, integrating, exploring & analyzing volumes of disparate data (of various types, from various sources) --- AKA Data Integration To succeed, you need Data Integration capabilities that create consumable data by:  Collecting, moving, transforming, cleansing, governing, integrating, exploring & analyzing volumes of disparate data  Providing simplicity, speed, scalability and reduced risk
  4. 4. © 2014 IBM Corporation4 A large US Bank needed to reduce total cost of ownership … IBM Presentation Template Full Version Business Problem Challenges  Primary: Reduce Teradata total cost of ownership  Secondary: Allow for new analytic exploration & asset optimization  Create a Data Distribution Hub / Big Data platform to cut costs  Move front-end processing from Teradata to the Data Distrubion Hub  Needed to offload ELT workload in a cost-effective, efficient way
  5. 5. © 2014 IBM Corporation5 … and successfully offloaded ELT workloads to reduce costs IBM Presentation Template Full Version Approach Outcome  Reduce costs by offloading ELT workloads from Teradata to a Big Data platform  Leverage existing InfoSphere Information Server data integration skills and assets (jobs)  Hand coding: Client would not consider hand coding for data integration capabilities  Client decides to deploy IBM PureData for Hadoop  Client uses InfoSphere Information Server as their single scalable & flexible Big Data Integration solution  Client successfully migrated their Teradata ELT and now uses InfoSphere Information Server to exploit the lower cost of running data integration on Hadoop
  6. 6. © 2014 IBM Corporation6 A government entity anticipated the need to support 10x increase in incoming data volumes over 3-5 years … IBM Presentation Template Full Version Business Problem Project Challenges  This Master Data Management (MDM) client compares frequently updated records to identify potential national security threats. They needed to: – Support a 10X increase in incoming data volumes (in the next 3-5 years) – Reduce high software and hardware costs  Create a solution that could support scalable probabilistic matching for up to 10X data growth  Modernize ETL practices and remove bottlenecks
  7. 7. © 2014 IBM Corporation7 … and replaced an expensive and failing hand-coding approach with a massively scalable Big Data Integration solution IBM Presentation Template Full Version Approach Outcome  Eliminate hand coding for data integration to significantly reduce software costs  Deploy a data integration solution that can scale fast enough to feed the MDM system  Reduce high costs of ELT running in their database  Removed hand coding & replaced it with InfoSphere InfoSphere Information Server for massively scalable data integration processing  Stopped running ELT in the database, leveraging Hadoop instead  Client purchased an end-to-end Big Data solution from IBM – across MDM, Hadoop, and Information Integration areas
  8. 8. © 2014 IBM Corporation8 A large European telco wants to leverage big data to increase revenue and customer satisfaction … IBM Presentation Template Full Version Business Problem Project Challenges  Increase revenue & customer satisfaction by analyzing usage patterns of mobile devices to match user demand  Needed a comprehensive Big Data platform that could keep up with analytics requirements  Reduce costs by reducing inventory  Client used Informatica for ETL, generally, and planned to extend use to the Big Data effort. They asked Informatica to improve (existing) Netezza loading performance in support of their goals and: – The ETL process broke with a small sample of jobs – They switched to an ELT approach and encountered technical problems
  9. 9. © 2014 IBM Corporation9 … and learned that ELT only was not sufficient to support Big Data Integration IBM Presentation Template Full Version Approach Outcome  Leverage a worldwide predictive solution to anticipate customer requirements  Add a Hadoop layer to enrich predictive models with unstructured social media data  Expand existing IBM Netezza footprint to keep pace with new data volumes  Client requested a full-workload data integration POC with IBM  Client realized ELT only was not sufficient for Big Data Integration (all data integration logic cannot be pushed into IBM Neteeza or Hadoop)  Client found InfoSphere Information Server can often run data integration faster than either Neteeza or Hadoop  Client selected InfoSphere Information Server over Informatica for Big Data Integration and InfoSphere BigInsights over Cloudera
  10. 10. © 2014 IBM Corporation10 Plan for Success! Successfully navigate the big data maze IBM Presentation Template Full Version Hadoop is not a Data Integration platform, 80% of the work is around Big Data Integration, and MapReduce is slow To move into production successfully, you need to plan ahead and make sure you have accounted for your Big Data Integration needs: Hand coding does not meet Big Data Integration scalability, flexibility, or performance requirements Get more information about Big Data Integration requirements and key success factors ELT only is NOT sufficient to meet most Big Data Integration requirements, because you cannot push ALL the data integration logic into the data warehouse or into Hadoop