• Save
Faster, Cheaper, Easier... and Successful Best Practices for Big Data Integration
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Faster, Cheaper, Easier... and Successful Best Practices for Big Data Integration

on

  • 334 views

 

Statistics

Views

Total Views
334
Views on SlideShare
334
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Faster, Cheaper, Easier... and Successful Best Practices for Big Data Integration Presentation Transcript

  • 1. © 2014 IBM Corporation | IBM Confidential Faster, cheaper, easier… and successful! Best practices for Big Data Integration 2014-06-05
  • 2. © 2014 IBM Corporation | IBM Confidential Big Data Integration Is Critical For Success With Hadoop Extract, Transform, and Load Big Data With Apache Hadoop - White Paper https://software.intel.com/sites/default/files/article/402274/etl-big-data-with- hadoop.pdf “By most accounts 80% of the development effort in a big data project goes into data integration goes towards data analysis.” …and only 20% Most Hadoop initiatives involve collecting, moving, transforming, cleansing, integrating, exploring, and analysing volumes of disparate data sources and types.
  • 3. © 2014 IBM Corporation | IBM Confidential Why is 80% of the effort in “data integration” Heterogeneity of data sources Optimizing performance Data Issues Missing or bad requirements Complexity Lack of understanding Diverse formats To be useful, the meaning and accuracy of the data should never be in question As such, data needs to be made fit-for- purpose so that it is used correctly and consistently Inhibitors – Both traditional & Big Data
  • 4. © 2014 IBM Corporation | IBM Confidential Isn’t there any good news? most Hadoop initiatives will end up achieving “garbage in, garbage out” faster, against larger data volumes, at much lower total cost than without Hadoop YES Without effective Big Data Integration, you won’t have consumable data
  • 5. © 2014 IBM Corporation | IBM Confidential Getting consumable data from the data lake The Data Lake unfettered Adding integration & governance discipline
  • 6. © 2014 IBM Corporation | IBM Confidential Five Best Practices for Big Data Integration No Hand Coding Anywhere For Any Purpose One Data Integration And Governance Platform For The Enterprise Massively Scalable Data Integration Wherever It Needs To Run World-Class Data Governance Across The Enterprise Robust Administration And Operations Control Across The Enterprise 1 2 3 4 5
  • 7. © 2014 IBM Corporation | IBM Confidential Best Practice #1 No hand coding anywhere, for any purpose • No hand coding for any aspect of Big Data Integration • Data access and movement across the enterprise • Data integration logic • Assembling data integration jobs from logic objects • Assembling larger workflows • Data governance • Operational and administrative management What does this mean
  • 8. © 2014 IBM Corporation | IBM Confidential Cost of hand coding vs market-leading tooling 30 man days to write Almost 2,000 lines of code 71,000 characters No documentation Difficult to re-use Difficult to maintain 2 days to write Graphical Self Documenting Reusability More Maintainable Improved Performance Handcoding / Legacy DI Tooling / Info Server VS Saving in Dev Costs Our largest customers concluded years ago that they will not succeed with Big Data Initiatives without Information Server * Pharmaceutical Customer example 87% Winner Loser
  • 9. © 2014 IBM Corporation | IBM Confidential Best Practice #1 No hand coding anywhere, for any purpose Lowers Costs • DI tooling reduces labor costs by 90% over hand coding • One set of skills and best practices leveraged across all projects Faster time to value • DI tooling reduces project timelines by 90% over hand coding • Much less time required to add new sources and new DI processes Higher quality data • Data profiling and cleansing are very difficult to implement using hand coding Effective data governance • requires world- class data integration tooling to support objectives like impact analysis and data lineage
  • 10. © 2014 IBM Corporation | IBM Confidential Best Practice #2 One Data Integration And Governance Platform For The Enterprise • Build a job once and run it anywhere on any platform in the enterprise without modification • Access, move and load data between a variety of sources and targets across the enterprise • Support a variety of data integration paradigms • Batch processing • Federation • Change data capture • SOA enablement of data integration tasks • Real-time with transactional integrity • Self-service for business users • Support the establishment of world-class data governance across the enterprise What does this mean
  • 11. © 2014 IBM Corporation | IBM Confidential Self-Service Big Data Integration On-Demand InfoSphere Data Click • Provides a simple web-based interface for any user • Move data in batch or real-time in a few clicks • Policy choices which are then automated, without any coding • Optimized runtime • Automatically captures metadata for built-in governance "I have a feeling before long Gartner will be telling us if we’re not doing this something is wrong.” – an IBM Customer
  • 12. © 2014 IBM Corporation | IBM Confidential Optimized for Hadoop with blazing fast HDFS speeds Extends the same easy drag and drop paradigm, then simply add your hadoop server name and port number 15tb/hr Information Server engine has streaming parallelization techniques to pipe data in and out at massive scale Performance study run up to 15 TB/hr before HDFS disks were completely saturated
  • 13. © 2014 IBM Corporation | IBM Confidential Make data available to Hadoop in real-time Non Invasive Record Capture Read data from transactional database logs, to minimize impact to source systems High Speed Data Replication Low-latency capture and delivery of real time information Consistently current Hadoop data Data available in Hadoop moments after it was committed in source databases to accelerate analytics currency
  • 14. © 2014 IBM Corporation | IBM Confidential Best Practice #3 Massively Scalable Data Integration Wherever It Needs To Run Case 1. InfoServer parallel engine running against any traditional data source Case 2. Push processing into parallel database Case 4. Push processing into Hadoop MapReduce Case 5. InfoServer parallel engine running against HDFS without M/R Outside Hadoop Environment Within Hadoop Environment Case 3. Move and process data in parallel between environments Design Once • Develop the logic in the same manner regardless of execution platform Scale Anywhere • Execute the logic in any of the 5 patterns for scalable data integration … no single pattern is sufficient What does this mean Information Server is the only Big Data Integration platform supporting all 5 use cases
  • 15. © 2014 IBM Corporation | IBM Confidential Dynamic Instantly get better performance as hardware resources are added Extendable Add a new server to scale out through simple text file edit (or, in grid config, automatically via integration with grid management software). Data Partitioned In true MPP fashion (like Hadoop) data persisted in the DI parallel to scale out the I/O. Sour ce Data Transform Cleanse Enrich EDW Information Server is Big Data Integration Disk CPU Memory Sequential Disk CPU Shared Memory CPUCPU CPU 4-way Parallel 64-way Parallel Uniprocessor SMP System MPP Clustered System
  • 16. © 2014 IBM Corporation | IBM Confidential Information Server: Customers stories with Big Data Using the Information Server MPP Engine 200,000 programs built in Information Server on a grid/cluster of low commodity hardware Uses text analytics across 200 million medical documents a weekend, creating indexes to support optimal retrieval by users Desensitizes 200 TB of data one weekend each month to populate their dev environments Process 50,000 tps with complex transformation and guaranteed delivery Information Server powered grid processing over 40+ trillion rcds each month Global Bank Data Services Co Global Bank Health Care Health Care
  • 17. © 2014 IBM Corporation | IBM Confidential Where should you run scalable data integration Run In the Database Advantages: • Exploit database MPP engine • Minimize data movement • Leverage database for joins/aggregations • Works best when data is already clean • Frees up cycles on ETL server • Use excess capacity on RDBMS server • Database faster for some processes Disadvantages: • Very expensive hardware and storage • Can force 100% reliance on ELT • Degradation of query SLAs • Not all ETL logic can be pushed into RDBMS (with ELT tool or hand coding) • Can’t exploit commodity hardware • Usually requires hand coding • Limitations on complex transformations • Limited data cleansing • Database slower for some processes • ELT can consume RDBMS capacity • (capacity planning is nontrivial) Run in the DI engine Advantages: • Exploit ETL MPP engine • Exploit commodity hardware and storage • Exploit grid to consolidate SMP servers • Perform complex transforms (data cleansing) that can’t be pushed into RDBMS • Free up capacity on RDBMS server • Process heterogenous data sources (not stored in the database) • ETL server faster for some processes Disadvantages: • ETL server slower for some processes (data already stored in relational tables) • May require extra hardware (low cost hardware) Run in Hadoop Advantages: • Exploit MapReduce MPP engine • Exploit commodity hardware and storage • Free up capacity on the database server • Support processing of unstructured data • Exploit Hadoop’s capabilities for persisting • data (e.g. updating and indexing) • Low cost archiving of history data Disadvantages: • Not all ETL logic can be pushed into RDBMS (with ELT tool or hand coding) • Can require complex programming • MapReduce will usually be much slower than parallel database or scalable ETL tool • Risk: Hadoop is still a young technology Big Data Integration requires a balanced approach that supports all of the above
  • 18. © 2014 IBM Corporation | IBM Confidential Automated MapReduce Job Generation • Leverage the same UI and the same stages to automatically build MapReduce • Drag and drop stages to the canvas to create a job, rather than have to learn MapReduce programming. • Push the processing to the data for patterns when you don’t want to transport the data on the network.
  • 19. © 2014 IBM Corporation | IBM Confidential © 2013 IBM Corporation Build integration jobs with the same data flow tool and stages Automatically creates MapReduce code. Automated MapReduce Job Generation
  • 20. © 2014 IBM Corporation | IBM Confidential Big Data Integration also requires running scalable data integration workloads outside of the Hadoop MapReduce environment. Complex data integration logic can’t be pushed into a parallel database or MapReduce easily and efficiently, or at all in some cases. • IBM’s experiences with customers’ early Hadoop initiatives have shown that much of their data integration processing logic can’t be pushed into MapReduce. • Without Information Server, these more complex data integration processes would have to be hand coded to run in MapReduce, increasing project time, cost, and complexity MapReduce has significant and known performance limitations • For processing large data volumes with complex transformations (including data integration) • Many Big Data vendors and researchers are focusing on bypassing MapReduce performance limitations DataStage will process data integration 10X-15X faster than MapReduce. So why is Pushdown Of ETL Into MapReduce not sufficient for Big Data Integration
  • 21. © 2014 IBM Corporation | IBM Confidential Best Practice #4 World-Class Data Governance Across The Enterprise What does this mean • Both IT and Line of business need to have a high degree of confidence in the data • Confidence requires that data is understood to be of high quality, secure and fit-for-purpose • Where does the data in my report come from? • What is being done with it inside of Hadoop? • Where was it before reaching our data lake? • Oftentimes these requirements extends from regulations within the specific industry
  • 22. © 2014 IBM Corporation | IBM Confidential How well do your business users understand the content of the information in your Big Data stores? Are you measuring the quality of your information in Big Data? ? Why Is Data Governance Critical For Big Data?
  • 23. © 2014 IBM Corporation | IBM Confidential All data needs to build confidence via a Fully Governed Data Lifecycle Find • Leverage Terms, Labels and Collections to find governed, curated data sources Curate • Add Labels, Terms, Custom Properties to relevant assets Collect • Use Collections to capture assets for a specific analysis or governance effort Collaborate • Share Collections for additional Curation and Governance Govern • Create and reference IG Policies and Rules • Apply DQ, Masking, Archiving, Cleansing, … to data Offload • Copy data in one click to HDFS for analysis for warehouse augmentation Analyze • Perform analyses on offloaded data Reuse & Trust • Understand how data is being used today via lineage for analyses and reports
  • 24. © 2014 IBM Corporation | IBM Confidential Best Practice #5 Robust Administration And Operations Control Across The Enterprise What does this mean • Operations Management For Big Data Integration • provides quick answers for the operators, developers and other stakeholders as they monitor run-time environment • Workload Management allocates resource priority in a shared services environment and queues workload on a busy system • Performance Analysis provides insight into resource consumption to help understand when system may need more resources. • Build workflows that include hadoop activities defined via Oozie directly along with other data integration activities. • Administrative Management For Big Data Integration • Web-based installer for all Integration & Governance capabilities • High-available configurations for meeting 24/7 requirements • Instantly provision/deploy a new project instance • Centralized Authentication, Authorization, & Session Management • Audit logging of security related events to promote SOX compliance
  • 25. © 2014 IBM Corporation | IBM Confidential Combined Workflows For Big Data • Simple design paradigm for workflows (same as for job design) • Mix any Oozie activity right alongside other data integration activities • Allows users to have the data sourcing, ETL, analytics and delivery of information all controlled through a single coordinating process • Monitor all stages through Operations Console’s web based interface 25
  • 26. © 2014 IBM Corporation | IBM Confidential Automotive manufacturer uses Big Data Integration to build out global data warehouse Challenges • Need to manage massive amounts of vehicle data – about 5TB per day • Need to understand, incorporate, and correlate a variety of data sources to better understand problems and product quality issues • Need to share information across functional teams for improved decision making Business Benefits • Doubled number of models for JD Power award for initial quality study in 1 year • Improved and streamlined decision making and system efficiency • Lowered warranty costs IT Benefits • Single infrastructure to consolidate structured, semi- structured and unstructured data; simplified management  Optimize existing Teradata environment – size, performance, and TCO  High-performance ETL for in-database transformations 26
  • 27. © 2014 IBM Corporation | IBM Confidential Other Examples of Proven Value of IBM Big Data Integration european telco • ELT pushdown into the database and Hadoop was not sufficient for Big Data Integration • InfoSphere DataStage runs some DI processes faster than the parallel database and MapReduce wireless carrier • InfoSphere Information Server can transform a dirty Hadoop lake into a clean Hadoop lake • IBM met requirements for processing 25 terabytes in 24 hours • Capabilities of InfoSphere Information Server and InfoSphere Optim data masking all helped to produce a clean Hadoop lake insurance company • IBM could shred complex XML claims messages and flatten them for Hadoop reporting and analysis • IBM could meet all requirements for large- scale batch processing • Information Server could adjust for changes in XML structure while tolerating future unanticipated changes
  • 28. © 2014 IBM Corporation | IBM Confidential Promote Object Reuse Build once, share, and run anywhere (ETL/ELTt/real-time) Reduce Operational Cost Provides a robust framework to manage data integration Protect from Changes isolation from underlying technologies changes as they continue to evolve Summing it up Increase ROI for Hadoop via 5 Best Practices for Big Data Integration Speed Productivity Graphical design easier to use than hand coding Simplify Heterogeneity Common method for diverse data sources Shorten Project Cycles Pre-built components reduce cost and timelines
  • 29. © 2014 IBM Corporation | IBM Confidential Thank you