Dealing With Changed Data on
Hadoop
An old data warehouse problem in a new world
Kunal Jain, Big Data Solutions Architect at Informatica
June, 2014
Agenda
• Challenges with Traditional Data Warehouse
• Requirements for Data Warehouse Optimization
• Data Warehouse Optimization Process Flow
• Dealing With Changed Data on Hadoop
• Demo
Challenges With Traditional Data Warehousing
• Expensive to scale as data volumes grow and new data types emerge
• Staging of raw data and ELT consuming capacity of data warehouse
too quickly forcing costly upgrades
• Network becoming a bottleneck to performance
• Does not handle new types of multi-structured data
• Changes to schemas cause delays in project delivery
3
Requirements for an Optimized Data Warehouse
• Cost-effective scale out infrastructure to support unlimited data
volumes
• Leverage commodity hardware and software to lower infrastructure
costs
• Leverage existing skills to lower operational costs
• Must support all types of data
• Must support agile methodologies with schema-on-read, rapid
prototyping, metadata-driven visual IDE’s, and collaboration tools
• Integrates with existing and new types of infrastructure
4
Data Warehouse Optimization Process Flow
BI Reports & AppsData Warehouse
1. Offload data & ELT
processing to Hadoop
3. Parse & prepare
(e.g. ETL, data quality)
data for analysis
4. Move high value
curated data into data
warehouse
2. Batch load raw
data (e.g. transactions,
multi-structured)
Relational, Mainframe
Documents and Emails
Social Media, Web Logs
Machine Device, Cloud
Use Case: Updates in Traditional DW/RDBMS
• Example Requirement: Historical table containing 10 Billion rows
of data
• Every day gets incremental data of 10 million rows (70% new inserts,
30% updates)
• Traditional approach: Straightforward to insert and update in a
traditional DW/RDBMS
• Challenge: Traditional infrastructure cannot scale to the data size
and is not cost-effective.
Use Case: Update/Insert in Hadoop/Hive
• Requirement: Use Hive to store massive amounts of data, but need
to perform inserts, deletes and updates.
• Typical approach: Since Hive does not support updates, the
workaround used is to perform a FULL OUTER JOIN and a FULL
TABLE REFRESH to update impacted rows
• Challenge: Table refresh / full outer join on historical tables (10B+
rows) would blow SLAs out of the water
Use Case: Update/Insert in Hadoop/Hive
TXID Description Transaction Date Amount Last Modified Date
1 Xxxx 20-JAN-13 200 20-JAN-13
2 Yyy 21-FEB-13 300 21-FEB-13
3 Aaa 22-MAR-13 400 22-MAR-13
TXID Description Transaction Date Amount Last Modified Date
1 Xxxx 20-JAN-13 210 23-MAR-13
2 Yyy 21-FEB-13 300 21-FEB-13
3 Aaa 22-MAR-13 400 22-MAR-13
4 Ccc 23-MAR-13 150 23-MAR-13
6 Bbb 23-MAR-13 500 23-MAR-13
Target Table (10 billion + 7 million rows)
Target Table (10 billion rows)
TXID Description Transaction Date Amount Last Modified Date
1 Xxxx 20-JAN-13 210 23-MAR-13
4 Ccc 23-MAR-13 150 23-MAR-13
6 Bbb 23-MAR-13 500 23-MAR-13
Staging Table (10 million rows) with 70% Inserts and 30% Updates
UPDATE
INSERT
Partitioning rows by date significantly reduces
total # of partitions impacted by updates
Use Case: Update/Insert in Hadoop/Hive
Relational
Data Source
Inserts (70%)
Updates(30%)
Staging
Target
Target
Rows: ~10M
Rows: ~10B
Rows: ~10B
Inserts
Updates
Temporary
Rows: ~13M
1. Extract & Load
2b. Bring
unchanged data
from impacted
partitions
2a. Bring new
data and the
updated data 3. Delete matching
partitions from
Target
4. Load all data
from Temporary
into TargetImpacted
Partitions
Rows: ~10B+7M
Impacted
Partition
DEMO
Optimize the Entire Data Pipeline
Increase Performance & Productivity on Hadoop
Archive
Profile Parse CleanseETL Match
Stream
Load Load
Services
Events
Replicate
Topics
Machine Device,
Cloud
Documents and
Emails
Relational, Mainframe
Social Media, Web
Logs
Data Warehouse
Mobile Apps
Analytics & Op
Dashboards
Alerts
Analytics Teams
Informatica on Hadoop Benefits
• Cost-effectively scale storage and processing (over 2x the
performance)
• Increase developer productivity (up to 5x over hand-coding)
• Continue to leverage existing ETL skills you have today
• Informatica Hive partitioning/UPSERT is a key capability for rapid
implementation of CDC use-case
• Ensure success with proven leader in big data and data warehouse
optimization
15

Dealing with Changed Data in Hadoop

  • 1.
    Dealing With ChangedData on Hadoop An old data warehouse problem in a new world Kunal Jain, Big Data Solutions Architect at Informatica June, 2014
  • 2.
    Agenda • Challenges withTraditional Data Warehouse • Requirements for Data Warehouse Optimization • Data Warehouse Optimization Process Flow • Dealing With Changed Data on Hadoop • Demo
  • 3.
    Challenges With TraditionalData Warehousing • Expensive to scale as data volumes grow and new data types emerge • Staging of raw data and ELT consuming capacity of data warehouse too quickly forcing costly upgrades • Network becoming a bottleneck to performance • Does not handle new types of multi-structured data • Changes to schemas cause delays in project delivery 3
  • 4.
    Requirements for anOptimized Data Warehouse • Cost-effective scale out infrastructure to support unlimited data volumes • Leverage commodity hardware and software to lower infrastructure costs • Leverage existing skills to lower operational costs • Must support all types of data • Must support agile methodologies with schema-on-read, rapid prototyping, metadata-driven visual IDE’s, and collaboration tools • Integrates with existing and new types of infrastructure 4
  • 5.
    Data Warehouse OptimizationProcess Flow BI Reports & AppsData Warehouse 1. Offload data & ELT processing to Hadoop 3. Parse & prepare (e.g. ETL, data quality) data for analysis 4. Move high value curated data into data warehouse 2. Batch load raw data (e.g. transactions, multi-structured) Relational, Mainframe Documents and Emails Social Media, Web Logs Machine Device, Cloud
  • 6.
    Use Case: Updatesin Traditional DW/RDBMS • Example Requirement: Historical table containing 10 Billion rows of data • Every day gets incremental data of 10 million rows (70% new inserts, 30% updates) • Traditional approach: Straightforward to insert and update in a traditional DW/RDBMS • Challenge: Traditional infrastructure cannot scale to the data size and is not cost-effective.
  • 7.
    Use Case: Update/Insertin Hadoop/Hive • Requirement: Use Hive to store massive amounts of data, but need to perform inserts, deletes and updates. • Typical approach: Since Hive does not support updates, the workaround used is to perform a FULL OUTER JOIN and a FULL TABLE REFRESH to update impacted rows • Challenge: Table refresh / full outer join on historical tables (10B+ rows) would blow SLAs out of the water
  • 8.
    Use Case: Update/Insertin Hadoop/Hive TXID Description Transaction Date Amount Last Modified Date 1 Xxxx 20-JAN-13 200 20-JAN-13 2 Yyy 21-FEB-13 300 21-FEB-13 3 Aaa 22-MAR-13 400 22-MAR-13 TXID Description Transaction Date Amount Last Modified Date 1 Xxxx 20-JAN-13 210 23-MAR-13 2 Yyy 21-FEB-13 300 21-FEB-13 3 Aaa 22-MAR-13 400 22-MAR-13 4 Ccc 23-MAR-13 150 23-MAR-13 6 Bbb 23-MAR-13 500 23-MAR-13 Target Table (10 billion + 7 million rows) Target Table (10 billion rows) TXID Description Transaction Date Amount Last Modified Date 1 Xxxx 20-JAN-13 210 23-MAR-13 4 Ccc 23-MAR-13 150 23-MAR-13 6 Bbb 23-MAR-13 500 23-MAR-13 Staging Table (10 million rows) with 70% Inserts and 30% Updates UPDATE INSERT Partitioning rows by date significantly reduces total # of partitions impacted by updates
  • 9.
    Use Case: Update/Insertin Hadoop/Hive Relational Data Source Inserts (70%) Updates(30%) Staging Target Target Rows: ~10M Rows: ~10B Rows: ~10B Inserts Updates Temporary Rows: ~13M 1. Extract & Load 2b. Bring unchanged data from impacted partitions 2a. Bring new data and the updated data 3. Delete matching partitions from Target 4. Load all data from Temporary into TargetImpacted Partitions Rows: ~10B+7M Impacted Partition
  • 10.
  • 11.
    Optimize the EntireData Pipeline Increase Performance & Productivity on Hadoop Archive Profile Parse CleanseETL Match Stream Load Load Services Events Replicate Topics Machine Device, Cloud Documents and Emails Relational, Mainframe Social Media, Web Logs Data Warehouse Mobile Apps Analytics & Op Dashboards Alerts Analytics Teams
  • 12.
    Informatica on HadoopBenefits • Cost-effectively scale storage and processing (over 2x the performance) • Increase developer productivity (up to 5x over hand-coding) • Continue to leverage existing ETL skills you have today • Informatica Hive partitioning/UPSERT is a key capability for rapid implementation of CDC use-case • Ensure success with proven leader in big data and data warehouse optimization
  • 13.

Editor's Notes

  • #4 As data volumes and business complexity grew, traditional scale up and scale out architectures become too costly. Therefore most organizations are unable to keep all the data they would like to query directly in the data warehouse. They have to archive the data to more affordable offline systems, such as a storage grid or tape backup. A typical strategy is to define a “time window” for data retention beyond which data is archived. Of course, this data is not in the warehouse so business users cannot benefit from it. For traditional grid computing the network was becoming the bottleneck as large data volumes were pushed to the CPU workloads. This placed a limit on how much data could be processed in a reasonable amount of time to meet business SLA’s Does not handle new types of multi-structured data Changes to schemas cause delays in project delivery
  • #5 The requirements for an optimized DW architecture include: Cost-effective scale out infrastructure to support unlimited data volumes Leverage commodity hardware and software to lower infrastructure costs Leverage existing skills to lower operational costs Must supports all types of data Must support agile methodologies with schema-on-read, rapid prototyping, metadata-driven visual IDE’s, and collaboration tools Integrates with existing and new types of infrastructure
  • #6 First start by identifying what data and processing to offload from the DW to Hadoop Inactive or infrequently used data can be moved to a Hadoop-based environment Transformations that are consuming too much CPU capacity in the DW can be moved Unstructured and multi-structured data (e.g. non-relational) data should be staged in Hadoop and not the DW You can also offload data from relational and mainframe systems to the Hadoop-based environment For lower latency data originating in relational database, data can be replicated, in real-time, from relational sources to the Hadoop-based environment Use change data capture (CDC) to capture changes as the occur in your operational transactions systems and propagate these changes to Hadoop. Also, because HDFS doesn’t impose schema requirements on data, unstructured data that was previously not available to the warehouse can also be loaded Collect real-time machine and sensor data at the source as it is created and stream it directly into Hadoop instead of staging it in a temporary file system or worse yet staging it in the DW As data is ingested into the Hadoop-based environment you can leverage the power of high performance distributed grid computing to parse, extract features, integrate, normalize, standardize, and cleanse data for analysis. Data must be parsed and prepared to further analysis. For example, semi-structured data, like json or xml, is parsed into a tabular format for easier downstream consumption by analysis programs and tools. Data cleansing logic can be applied to increase the data’s trustworthiness. The Hadoop-based environment cost-effectively and automatically scales to prepare all types of data no matter the volume for analysis. After the data has been cleansed and transformed, move copy high-value datasets from the Hadoop-based environment into the DW that have been refined, curated, to augment existing tables to make it directly accessible by the enterprise’s existing BI reports and applications.
  • #7 Classic Data Warehouse offloading use case
  • #12 Informatica enables you to define the data processing flow (e.g. ETL, DQ, etc) with transformations and rules using a visual design UI. We call these mappings. When these data flows or mappings are deployed and run, Informatica optimizes the end-to-end flow from source to target to generate Hive-QL scripts Transformations that don’t map to HQL, for example name and address cleansing routines, will be run as User Defined Functions (UDF) via the VibeTM virtual data machine libraries that resides on each of the Hadoop nodes. Because we have separated the design from the deployment you can take existing PowerCenter mappings and run them on Hadoop In fact, the source and target data don’t have to reside in Hadoop. Informatica will stream the data from the source into Hadoop for processing and then deliver it to the target whether on Hadoop or another system Tech notes: Currently, the VibeTM virtual data machine library is approx 1.3GB of jar and shared library files. Note that the VibeTM virtual data machine is not a continuously executing service process (i.e. daemon), but rather is a set of libraries that are executed only within the context of map-reduce jobs.
  • #14 One of the first challenges Hadoop developers face is accessing all the data needed for processing and getting it into Hadoop. Sure you can build custom adapters and scripts but there are several challenges that comes with it. To name a few: require expert knowledge of the source systems, applications, data structures, and formats The custom code should perform and scale as data volumes grows Along with the need for speed, security and reliability can not be overlooked. Thus building a robust custom adapter takes time and can be costly to maintain as software versions change. On the other hand, Informatica PowerExchange can access data from virtually any data source at any latency (e.g., batch, real time, or near real time) and deliver all your data directly to/from a Hadoop-based environment. Proven Path to Innovation 5000+ customers, 500+ partners, 100,000+ trained Informatica developers Enterprise scalability, security, & support