Your SlideShare is downloading. ×
Dealing with Changed Data in Hadoop
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Dealing with Changed Data in Hadoop

670
views

Published on

Published in: Technology, Business

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
670
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • As data volumes and business complexity grew, traditional scale up and scale out architectures become too costly. Therefore most organizations are unable to keep all the data they would like to query directly in the data warehouse. They have to archive the data to more affordable offline systems, such as a storage grid or tape backup. A typical strategy is to define a “time window” for data retention beyond which data is archived. Of course, this data is not in the warehouse so business users cannot benefit from it.
    For traditional grid computing the network was becoming the bottleneck as large data volumes were pushed to the CPU workloads. This placed a limit on how much data could be processed in a reasonable amount of time to meet business SLA’s
    Does not handle new types of multi-structured data
    Changes to schemas cause delays in project delivery

  • The requirements for an optimized DW architecture include:
    Cost-effective scale out infrastructure to support unlimited data volumes
    Leverage commodity hardware and software to lower infrastructure costs
    Leverage existing skills to lower operational costs
    Must supports all types of data
    Must support agile methodologies with schema-on-read, rapid prototyping, metadata-driven visual IDE’s, and collaboration tools
    Integrates with existing and new types of infrastructure

  • First start by identifying what data and processing to offload from the DW to Hadoop
    Inactive or infrequently used data can be moved to a Hadoop-based environment
    Transformations that are consuming too much CPU capacity in the DW can be moved
    Unstructured and multi-structured data (e.g. non-relational) data should be staged in Hadoop and not the DW
    You can also offload data from relational and mainframe systems to the Hadoop-based environment

    For lower latency data originating in relational database, data can be replicated, in real-time, from relational sources to the Hadoop-based environment
    Use change data capture (CDC) to capture changes as the occur in your operational transactions systems and propagate these changes to Hadoop.
    Also, because HDFS doesn’t impose schema requirements on data, unstructured data that was previously not available to the warehouse can also be loaded
    Collect real-time machine and sensor data at the source as it is created and stream it directly into Hadoop instead of staging it in a temporary file system or worse yet staging it in the DW

    As data is ingested into the Hadoop-based environment you can leverage the power of high performance distributed grid computing to parse, extract features, integrate, normalize, standardize, and cleanse data for analysis. Data must be parsed and prepared to further analysis. For example, semi-structured data, like json or xml, is parsed into a tabular format for easier downstream consumption by analysis programs and tools. Data cleansing logic can be applied to increase the data’s trustworthiness.
    The Hadoop-based environment cost-effectively and automatically scales to prepare all types of data no matter the volume for analysis.

    After the data has been cleansed and transformed, move copy high-value datasets from the Hadoop-based environment into the DW that have been refined, curated, to augment existing tables to make it directly accessible by the enterprise’s existing BI reports and applications.
  • Classic Data Warehouse offloading use case
  • Informatica enables you to define the data processing flow (e.g. ETL, DQ, etc) with transformations and rules using a visual design UI. We call these mappings.
    When these data flows or mappings are deployed and run, Informatica optimizes the end-to-end flow from source to target to generate Hive-QL scripts
    Transformations that don’t map to HQL, for example name and address cleansing routines, will be run as User Defined Functions (UDF) via the VibeTM virtual data machine libraries that resides on each of the Hadoop nodes. Because we have separated the design from the deployment you can take existing PowerCenter mappings and run them on Hadoop
    In fact, the source and target data don’t have to reside in Hadoop. Informatica will stream the data from the source into Hadoop for processing and then deliver it to the target whether on Hadoop or another system

    Tech notes: Currently, the VibeTM virtual data machine library is approx 1.3GB of jar and shared library files. Note that the VibeTM virtual data machine is not a continuously executing service process (i.e. daemon), but rather is a set of libraries that are executed only within the context of map-reduce jobs.

  • One of the first challenges Hadoop developers face is accessing all the data needed for processing and getting it into Hadoop. Sure you can build custom adapters and scripts but there are several challenges that comes with it. To name a few:
    require expert knowledge of the source systems, applications, data structures, and formats
    The custom code should perform and scale as data volumes grows
    Along with the need for speed, security and reliability can not be overlooked.

    Thus building a robust custom adapter takes time and can be costly to maintain as software versions change. On the other hand, Informatica PowerExchange can access data from virtually any data source at any latency (e.g., batch, real time, or near real time) and deliver all your data directly to/from a Hadoop-based environment.

    Proven Path to Innovation
    5000+ customers, 500+ partners, 100,000+ trained Informatica developers
    Enterprise scalability, security, & support
  • Transcript

    • 1. Dealing With Changed Data on Hadoop An old data warehouse problem in a new world Kunal Jain, Big Data Solutions Architect at Informatica June, 2014
    • 2. Agenda • Challenges with Traditional Data Warehouse • Requirements for Data Warehouse Optimization • Data Warehouse Optimization Process Flow • Dealing With Changed Data on Hadoop • Demo
    • 3. Challenges With Traditional Data Warehousing • Expensive to scale as data volumes grow and new data types emerge • Staging of raw data and ELT consuming capacity of data warehouse too quickly forcing costly upgrades • Network becoming a bottleneck to performance • Does not handle new types of multi-structured data • Changes to schemas cause delays in project delivery 3
    • 4. Requirements for an Optimized Data Warehouse • Cost-effective scale out infrastructure to support unlimited data volumes • Leverage commodity hardware and software to lower infrastructure costs • Leverage existing skills to lower operational costs • Must support all types of data • Must support agile methodologies with schema-on-read, rapid prototyping, metadata-driven visual IDE’s, and collaboration tools • Integrates with existing and new types of infrastructure 4
    • 5. Data Warehouse Optimization Process Flow BI Reports & AppsData Warehouse 1. Offload data & ELT processing to Hadoop 3. Parse & prepare (e.g. ETL, data quality) data for analysis 4. Move high value curated data into data warehouse 2. Batch load raw data (e.g. transactions, multi-structured) Relational, Mainframe Documents and Emails Social Media, Web Logs Machine Device, Cloud
    • 6. Use Case: Updates in Traditional DW/RDBMS • Example Requirement: Historical table containing 10 Billion rows of data • Every day gets incremental data of 10 million rows (70% new inserts, 30% updates) • Traditional approach: Straightforward to insert and update in a traditional DW/RDBMS • Challenge: Traditional infrastructure cannot scale to the data size and is not cost-effective.
    • 7. Use Case: Update/Insert in Hadoop/Hive • Requirement: Use Hive to store massive amounts of data, but need to perform inserts, deletes and updates. • Typical approach: Since Hive does not support updates, the workaround used is to perform a FULL OUTER JOIN and a FULL TABLE REFRESH to update impacted rows • Challenge: Table refresh / full outer join on historical tables (10B+ rows) would blow SLAs out of the water
    • 8. Use Case: Update/Insert in Hadoop/Hive TXID Description Transaction Date Amount Last Modified Date 1 Xxxx 20-JAN-13 200 20-JAN-13 2 Yyy 21-FEB-13 300 21-FEB-13 3 Aaa 22-MAR-13 400 22-MAR-13 TXID Description Transaction Date Amount Last Modified Date 1 Xxxx 20-JAN-13 210 23-MAR-13 2 Yyy 21-FEB-13 300 21-FEB-13 3 Aaa 22-MAR-13 400 22-MAR-13 4 Ccc 23-MAR-13 150 23-MAR-13 6 Bbb 23-MAR-13 500 23-MAR-13 Target Table (10 billion + 7 million rows) Target Table (10 billion rows) TXID Description Transaction Date Amount Last Modified Date 1 Xxxx 20-JAN-13 210 23-MAR-13 4 Ccc 23-MAR-13 150 23-MAR-13 6 Bbb 23-MAR-13 500 23-MAR-13 Staging Table (10 million rows) with 70% Inserts and 30% Updates UPDATE INSERT Partitioning rows by date significantly reduces total # of partitions impacted by updates
    • 9. Use Case: Update/Insert in Hadoop/Hive Relational Data Source Inserts (70%) Updates(30%) Staging Target Target Rows: ~10M Rows: ~10B Rows: ~10B Inserts Updates Temporary Rows: ~13M 1. Extract & Load 2b. Bring unchanged data from impacted partitions 2a. Bring new data and the updated data 3. Delete matching partitions from Target 4. Load all data from Temporary into TargetImpacted Partitions Rows: ~10B+7M Impacted Partition
    • 10. DEMO
    • 11. Optimize the Entire Data Pipeline Increase Performance & Productivity on Hadoop Archive Profile Parse CleanseETL Match Stream Load Load Services Events Replicate Topics Machine Device, Cloud Documents and Emails Relational, Mainframe Social Media, Web Logs Data Warehouse Mobile Apps Analytics & Op Dashboards Alerts Analytics Teams
    • 12. Informatica on Hadoop Benefits • Cost-effectively scale storage and processing (over 2x the performance) • Increase developer productivity (up to 5x over hand-coding) • Continue to leverage existing ETL skills you have today • Informatica Hive partitioning/UPSERT is a key capability for rapid implementation of CDC use-case • Ensure success with proven leader in big data and data warehouse optimization
    • 13. 15