Dealing with Changed Data in Hadoop

Dealing With Changed Data on
Hadoop
An old data warehouse problem in a new world
Kunal Jain, Big Data Solutions Architect at Informatica
June, 2014

Agenda
• Challenges with Traditional Data Warehouse
• Requirements for Data Warehouse Optimization
• Data Warehouse Optimization Process Flow
• Dealing With Changed Data on Hadoop
• Demo

Challenges With Traditional Data Warehousing
• Expensive to scale as data volumes grow and new data types emerge
• Staging of raw data and ELT consuming capacity of data warehouse
too quickly forcing costly upgrades
• Network becoming a bottleneck to performance
• Does not handle new types of multi-structured data
• Changes to schemas cause delays in project delivery
3

Requirements for an Optimized Data Warehouse
• Cost-effective scale out infrastructure to support unlimited data
volumes
• Leverage commodity hardware and software to lower infrastructure
costs
• Leverage existing skills to lower operational costs
• Must support all types of data
• Must support agile methodologies with schema-on-read, rapid
prototyping, metadata-driven visual IDE’s, and collaboration tools
• Integrates with existing and new types of infrastructure
4

Data Warehouse Optimization Process Flow
BI Reports & AppsData Warehouse
1. Offload data & ELT
processing to Hadoop
3. Parse & prepare
(e.g. ETL, data quality)
data for analysis
4. Move high value
curated data into data
warehouse
2. Batch load raw
data (e.g. transactions,
multi-structured)
Relational, Mainframe
Documents and Emails
Social Media, Web Logs
Machine Device, Cloud

Use Case: Updates in Traditional DW/RDBMS
• Example Requirement: Historical table containing 10 Billion rows
of data
• Every day gets incremental data of 10 million rows (70% new inserts,
30% updates)
• Traditional approach: Straightforward to insert and update in a
traditional DW/RDBMS
• Challenge: Traditional infrastructure cannot scale to the data size
and is not cost-effective.

Use Case: Update/Insert in Hadoop/Hive
• Requirement: Use Hive to store massive amounts of data, but need
to perform inserts, deletes and updates.
• Typical approach: Since Hive does not support updates, the
workaround used is to perform a FULL OUTER JOIN and a FULL
TABLE REFRESH to update impacted rows
• Challenge: Table refresh / full outer join on historical tables (10B+
rows) would blow SLAs out of the water

TXID Description Transaction Date Amount Last Modified Date
1 Xxxx 20-JAN-13 200 20-JAN-13
2 Yyy 21-FEB-13 300 21-FEB-13
3 Aaa 22-MAR-13 400 22-MAR-13
1 Xxxx 20-JAN-13 210 23-MAR-13
2 Yyy 21-FEB-13 300 21-FEB-13
3 Aaa 22-MAR-13 400 22-MAR-13
4 Ccc 23-MAR-13 150 23-MAR-13
6 Bbb 23-MAR-13 500 23-MAR-13
Target Table (10 billion + 7 million rows)
Target Table (10 billion rows)
1 Xxxx 20-JAN-13 210 23-MAR-13
4 Ccc 23-MAR-13 150 23-MAR-13
6 Bbb 23-MAR-13 500 23-MAR-13
Staging Table (10 million rows) with 70% Inserts and 30% Updates
UPDATE
INSERT
Partitioning rows by date significantly reduces
total # of partitions impacted by updates

Relational
Data Source
Inserts (70%)
Updates(30%)
Staging
Target
Target
Rows: ~10M
Rows: ~10B
Rows: ~10B
Inserts
Updates
Temporary
Rows: ~13M
1. Extract & Load
2b. Bring
unchanged data
from impacted
partitions
2a. Bring new
data and the
updated data 3. Delete matching
partitions from
Target
4. Load all data
from Temporary
into TargetImpacted
Partitions
Rows: ~10B+7M
Impacted
Partition

Optimize the Entire Data Pipeline
Increase Performance & Productivity on Hadoop
Archive
Profile Parse CleanseETL Match
Stream
Load Load
Services
Events
Replicate
Topics
Machine Device,
Cloud
Documents and
Emails
Relational, Mainframe
Social Media, Web
Logs
Data Warehouse
Mobile Apps
Analytics & Op
Dashboards
Alerts
Analytics Teams

Informatica on Hadoop Benefits
• Cost-effectively scale storage and processing (over 2x the
performance)
• Increase developer productivity (up to 5x over hand-coding)
• Continue to leverage existing ETL skills you have today
• Informatica Hive partitioning/UPSERT is a key capability for rapid
implementation of CDC use-case
• Ensure success with proven leader in big data and data warehouse
optimization

Dealing with Changed Data in Hadoop

More Related Content

What's hot

Viewers also liked

Similar to Dealing with Changed Data in Hadoop

More from DataWorks Summit

Recently uploaded

Dealing with Changed Data in Hadoop

Editor's Notes