SlideShare a Scribd company logo
1 of 21
Download to read offline
TPC-DI The First Industry Benchmark for Data Integration 
Meikel Poess, Tilmann Rabl, 
Hans-Arno Jacobsen, Brian Caufield 
VLDB 2014, Hangzhou, China, September 4
Data Integration 
•Data Integration (DI) covers a variety of scenarios 
•Data acquisition for business intelligence, analytics and data warehousing 
•Data migration between systems 
•Data conversion 
•etc. 
•All of the above require: 
1.Data Extraction from Multiple Sources 
2.Data Transformation from Multiple Formats into a Target Format 
3.Consolidating data into one or more Target Systems 
9/04/2014 
TPC-DI 
2
Why a Data Integration Benchmark 
•Vendors 
•Publish performance numbers without always providing detailed information on how performance numbers were obtained 
•Compare performance numbers that might not be comparable 
•Results are of little value for customers who would like to evaluate data integration tools across vendors 
•Situation is similar to the 1980’s when vendors compared performance of OLTP systems using a variety of workloads and metrics and which eventually resulted in the creation of the TPC. 
•See “The History Of DebitCredit and the TPC” by Omri Serlin 
9/04/2014 
TPC-DI 
3 
JasperETL can be used for both analytic decision support system tasks such as updating data warehouses or marts, as well as for operational solutions such as data consolidation, duplication, synchronization, quality, migration, and change data capture. Performance tests indicate performance up to 50% faster than other leading commercial ETL tools. 
Microsoft and Unisys announced a record for loading data into a relational database using an Extract, Transform and Load (ETL) tool. Over 1 TB of TPC-H data was loaded in under 30 minutes. SSIS 
PowerCenter 8 running across 64 CPUs loaded 1TB into Oracle in just 45 minutes, compared to 95 minutes for PowerCenter 7. Additional tests that targeted flat files took just 32 minutes, a new world record based on published benchmarks. As in past benchmarks, PowerCenter 8 exhibited near- perfect linear scalability across 16-, 32- and 64-CPU HP Integrity Superdome server configurations. Informatica
Outline 
•Scope of TPC-DI 
•General Concepts 
•Data Model 
•Source Model 
•Target Model 
•Data Set 
•Transformations 
•Metric and Execution rules 
•Experimental Results 
9/04/2014 
TPC-DI 
4
General Concepts 
•TPC-DI uses data integration of a factious Retail Brokerage Firm as model: 
Main Trading System 
Internal Human Resource System 
Internal Customer Relationship Management System 
Externally acquired data 
•Operations measured use the above model, but are not limited to those of a brokerage firm 
•They capture the variety and complexity of typical DI tasks: 
Loading of large volumes of historical data 
Loading of incremental updates 
Execution of a variety of transformation types using various input types and various target types with inter-table relationships 
Assuring consistency of loaded data 
•Benchmark is technology agnostic 
9/04/2014 
TPC-DI 
5
Scope of TPC-DI 
•Out-Scope 
•Extraction of data from operational systems 
•Transport of data into a staging area 
•Data of source systems is provided by a data generator, based on PDGF 
•In-Scope 
•Reading of data from staging area 
•Data transformation and their insertion into target system 
•Storing of intermediate results 
•Verification of transformed data 
9/04/2014 
TPC-DI 
6
Source Schema 
•18 different source tables 
•Various formats 
•CSV (Comma Separated) 
•CDC (Change Data Capture) 
•Multi Record 
•DEL (Pipe Delimited) 
•XML 
•Some used only for the Historical Load 
•Some used only for the Incremental Load 
•Some used in both Historical and Incremental Loads 
9/04/2014 
TPC-DI 
7
Target Schema 
•Dimensional schema 
•5 fact tables 
•9 dimension tables 
•5 reference tables (dimensions in the strict sense of a star schema) 
9/04/2014 
TPC-DI 
8
Data Set 
• 
9/04/2014 
TPC-DI 
9
Data Generation (PDGF) 
TPC-DI 9/04/2014 
10 
• Based on Parallel Data 
Generation Framework (PDGF) 
 Generic – can generate any schema 
 Configurable – XML configuration 
files for schema and output format 
 Extensible – plug-in mechanism for 
 Distributions 
 Specialized data generation formats 
 Efficient – utilizes all system 
resources to a maximum degree (if 
desired) 
 Scalable – parallel generation for 
modern multi-core SMPs and 
clustered systems 
• Evaluation 
 2 E5-2450 Intel Sockets 
 16 cores, 32 hardware threads 
 1-42 workers (= degree of parallelism) 
 Almost linear scale-up with cores 
 Slow down after 38 workers
18 Trans- formations 
1.Transfer XML to relational data 
2.Update DIMessage file 
3.Convert CSV to relational data 
4.Merge multiple input files of the same structure 
5.Convert missing values to NULL 
6.Standardize entries of the input files 
7.Join data from input file to dimension table 
8.Perform extensive arithmetic calculations 
9.Join data from multiple input files with separate structures 
10.Consolidate multiple change records per day and identify most current 
11.Read data from files with variable type records 
12.Check data for errors or for adherence to business rules 
13.Detect changes in fact data, and journaling updates to reflect current state 
14.Detect changes in dimension data, and applying appropriate tracking mechanisms for history keeping dimensions 
15.Filter input data according to pre-defined conditions 
16.Identify new, deleted and updated records in input data 
17.Join data of one input file to data from another input file with different structure 
9/04/2014 
TPC-DI 
11
Transformations 
• No standard language to define 
• Transformations are specified in English text 
• Correctness of transformation implementations is 
guaranteed by: 
 Independent audit by a certified TPC auditor 
 Correctness queries run during the benchmark run and at 
benchmark completion 
 Qualification run on small scale factor and comparison of 
results with reference output 
TPC-DI 9/04/2014 
12 
Pseudo code example: 
DimAccount
Execution Rules 
•Un-timed part prepares the system 
•Timed part measures the data integration performance: 
Historical Load: Initial load of decision support system from historical records or due to restructuring of decision support system 
Incremental Loads: Periodic incremental updates of daily feeds 
Two incremental loads to measure the affect of data structure maintenance 
•No phase may overlap 
9/04/2014 
TPC-DI 
13
Metric 
9/04/2014 
TPC-DI 
14 
•
Metric Characteristics 
9/04/2014 
TPC-DI 
15 
•One performance metric  Makes ranking of results easy 
•Throughput metric: rows processed per second 
•Geometric mean of the throughputs during historical and incremental load phases  entices performance improvements in all phases 
E.g. reducing the elapsed time of the historical load from 100s to 90s has the same impact on metric as reducing the incremental load with the smaller elapsed time from 10s to 9s.
Metric Analysis 
•Metric encourages the processing of a sufficiently large amount of data 
•Actual amount of data processed depends on the system performance 
•The higher the performance of a system, the more data it needs to process 
•While the benchmark rules allow elapsed times of less than 1800s, there is a negative performance impact due to the max function in the denominator of the incremental throughput functions (T1 and T2) 
•The above graph shows the performance of a system with load performance linear to data size. 
9/04/2014 
TPC-DI 
16
Metric Analysis 
•Metric encourages constant elapsed times of consecutive incremental loads due to min(T1,T2) 
•Metric scales linearly with scale factor 
Important for measuring scale-out and scale-up solutions 
System with double the resources, e.g. CPU, memory, should show double the performance 
This is only true if a scale factor is chosen that results in an elapsed time of 1800s. 
9/04/2014 
TPC-DI 
17
Experimental Results 
•TPC-DI was run with 5 scale factors 
•Figure shows normalized data 
X-axis shows normalized data size 
Y-axis shows normalized elapsed time 
•Results show linear scalability 
•In another experiment it was shown that the elapsed time remained constant when the hardware resources were scaled to match the data size, i.e. double the data size with double the number of CPU’s, IO and memory 
9/04/2014 
TPC-DI 
18
Experimental Results 
•Figure shows: 
X-axis: three different data sizes in Gigabytes 
Y-axis: percentage of elapsed time spent in historical load, incremental load 1 and incremental load 2 
•Across data size the time spent in the historical load is 80%. 20% is spent in the incremental update phases. 
•This can be used to extrapolate the total elapsed time of benchmark runs. 
9/04/2014 
TPC-DI 
19
Conclusion 
•New TPC Standard Benchmark for Data Integration 
Accepted in January 2014 
•Brokerage firm business model 
OLTP system, HR, CRM, external data 
18 transformations into integrated data warehouse 
•Covers 
Multiple formats 
Historical load and updates 
Complex interdependencies 
9/04/2014 
TPC-DI 
20
Questions? 
•Thank You! 
•TPC-DI website 
http://www.tpc.org/tpcdi/default.asp 
9/04/2014 
TPC-DI 
21

More Related Content

What's hot

Enable GoldenGate Monitoring with OEM 12c/JAgent
Enable GoldenGate Monitoring with OEM 12c/JAgentEnable GoldenGate Monitoring with OEM 12c/JAgent
Enable GoldenGate Monitoring with OEM 12c/JAgentBobby Curtis
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsGuido Schmutz
 
Patroni - HA PostgreSQL made easy
Patroni - HA PostgreSQL made easyPatroni - HA PostgreSQL made easy
Patroni - HA PostgreSQL made easyAlexander Kukushkin
 
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...Andrew Lamb
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkLi Jin
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Taking advantage of Prometheus relabeling
Taking advantage of Prometheus relabelingTaking advantage of Prometheus relabeling
Taking advantage of Prometheus relabelingJulien Pivotto
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsDatabricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
Análise de performance usando as estatísticas do PostgreSQL
Análise de performance usando as estatísticas do PostgreSQLAnálise de performance usando as estatísticas do PostgreSQL
Análise de performance usando as estatísticas do PostgreSQLMatheus de Oliveira
 
Comparison of ACFS and DBFS
Comparison of ACFS and DBFSComparison of ACFS and DBFS
Comparison of ACFS and DBFSDanielHillinger
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best PracticesCloudera, Inc.
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxData
 
PLPgSqL- Datatypes, Language structure.pptx
PLPgSqL- Datatypes, Language structure.pptxPLPgSqL- Datatypes, Language structure.pptx
PLPgSqL- Datatypes, Language structure.pptxjohnwick814916
 
A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides Altinity Ltd
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
 
Oracle Performance Tuning Fundamentals
Oracle Performance Tuning FundamentalsOracle Performance Tuning Fundamentals
Oracle Performance Tuning FundamentalsCarlos Sierra
 

What's hot (20)

Enable GoldenGate Monitoring with OEM 12c/JAgent
Enable GoldenGate Monitoring with OEM 12c/JAgentEnable GoldenGate Monitoring with OEM 12c/JAgent
Enable GoldenGate Monitoring with OEM 12c/JAgent
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
 
Patroni - HA PostgreSQL made easy
Patroni - HA PostgreSQL made easyPatroni - HA PostgreSQL made easy
Patroni - HA PostgreSQL made easy
 
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySpark
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Taking advantage of Prometheus relabeling
Taking advantage of Prometheus relabelingTaking advantage of Prometheus relabeling
Taking advantage of Prometheus relabeling
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Análise de performance usando as estatísticas do PostgreSQL
Análise de performance usando as estatísticas do PostgreSQLAnálise de performance usando as estatísticas do PostgreSQL
Análise de performance usando as estatísticas do PostgreSQL
 
Comparison of ACFS and DBFS
Comparison of ACFS and DBFSComparison of ACFS and DBFS
Comparison of ACFS and DBFS
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
PLPgSqL- Datatypes, Language structure.pptx
PLPgSqL- Datatypes, Language structure.pptxPLPgSqL- Datatypes, Language structure.pptx
PLPgSqL- Datatypes, Language structure.pptx
 
A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
Oracle Performance Tuning Fundamentals
Oracle Performance Tuning FundamentalsOracle Performance Tuning Fundamentals
Oracle Performance Tuning Fundamentals
 

Similar to TPC-DI - The First Industry Benchmark for Data Integration

Data Management Workshop - ETOT 2016
Data Management Workshop - ETOT 2016Data Management Workshop - ETOT 2016
Data Management Workshop - ETOT 2016DataGenic Ltd
 
The_Case_for_Single_Node_Systems_Supporting_Large_Scale_Data_Analytics (1).pdf
The_Case_for_Single_Node_Systems_Supporting_Large_Scale_Data_Analytics (1).pdfThe_Case_for_Single_Node_Systems_Supporting_Large_Scale_Data_Analytics (1).pdf
The_Case_for_Single_Node_Systems_Supporting_Large_Scale_Data_Analytics (1).pdfDotInsight1
 
BUILDING A DATA WAREHOUSE
BUILDING A DATA WAREHOUSEBUILDING A DATA WAREHOUSE
BUILDING A DATA WAREHOUSENeha Kapoor
 
Enterprise resource planning_system
Enterprise resource planning_systemEnterprise resource planning_system
Enterprise resource planning_systemJithin Zcs
 
Analysis of economic data using big data
Analysis of economic data using big data Analysis of economic data using big data
Analysis of economic data using big data Shivu Manjesh
 
Real-Time Data Warehouse Loading Methodology Ricardo Jorge S.docx
Real-Time Data Warehouse Loading Methodology Ricardo Jorge S.docxReal-Time Data Warehouse Loading Methodology Ricardo Jorge S.docx
Real-Time Data Warehouse Loading Methodology Ricardo Jorge S.docxsodhi3
 
Can data virtualization uphold performance with complex queries?
Can data virtualization uphold performance with complex queries?Can data virtualization uphold performance with complex queries?
Can data virtualization uphold performance with complex queries?Denodo
 
DNMTT - Synchrophasor Data Delivery Efficiency GEP Testing Results at Peak RC
DNMTT - Synchrophasor Data Delivery Efficiency GEP Testing Results at Peak RCDNMTT - Synchrophasor Data Delivery Efficiency GEP Testing Results at Peak RC
DNMTT - Synchrophasor Data Delivery Efficiency GEP Testing Results at Peak RCGrid Protection Alliance
 
GouriShankar_Informatica
GouriShankar_InformaticaGouriShankar_Informatica
GouriShankar_InformaticaGouri Shankar M
 
Get started with data migration
Get started with data migrationGet started with data migration
Get started with data migrationThinqloud
 
Ax 2012 R3 Legacy Data Migration
Ax 2012 R3 Legacy Data MigrationAx 2012 R3 Legacy Data Migration
Ax 2012 R3 Legacy Data MigrationJayanta Sarkar
 
Data Collection Process And Integrity
Data Collection Process And IntegrityData Collection Process And Integrity
Data Collection Process And IntegrityGerrit Klaschke, CSM
 
Technical Product Manager Case Challenge
Technical Product Manager Case ChallengeTechnical Product Manager Case Challenge
Technical Product Manager Case ChallengeArush Sharma
 

Similar to TPC-DI - The First Industry Benchmark for Data Integration (20)

AIRflow at Scale
AIRflow at ScaleAIRflow at Scale
AIRflow at Scale
 
Data Management Workshop - ETOT 2016
Data Management Workshop - ETOT 2016Data Management Workshop - ETOT 2016
Data Management Workshop - ETOT 2016
 
The_Case_for_Single_Node_Systems_Supporting_Large_Scale_Data_Analytics (1).pdf
The_Case_for_Single_Node_Systems_Supporting_Large_Scale_Data_Analytics (1).pdfThe_Case_for_Single_Node_Systems_Supporting_Large_Scale_Data_Analytics (1).pdf
The_Case_for_Single_Node_Systems_Supporting_Large_Scale_Data_Analytics (1).pdf
 
BUILDING A DATA WAREHOUSE
BUILDING A DATA WAREHOUSEBUILDING A DATA WAREHOUSE
BUILDING A DATA WAREHOUSE
 
Data mining
Data miningData mining
Data mining
 
ETL Process
ETL ProcessETL Process
ETL Process
 
ETL Testing
ETL TestingETL Testing
ETL Testing
 
Enterprise resource planning_system
Enterprise resource planning_systemEnterprise resource planning_system
Enterprise resource planning_system
 
Analysis of economic data using big data
Analysis of economic data using big data Analysis of economic data using big data
Analysis of economic data using big data
 
Real-Time Data Warehouse Loading Methodology Ricardo Jorge S.docx
Real-Time Data Warehouse Loading Methodology Ricardo Jorge S.docxReal-Time Data Warehouse Loading Methodology Ricardo Jorge S.docx
Real-Time Data Warehouse Loading Methodology Ricardo Jorge S.docx
 
Can data virtualization uphold performance with complex queries?
Can data virtualization uphold performance with complex queries?Can data virtualization uphold performance with complex queries?
Can data virtualization uphold performance with complex queries?
 
Chapter 6.pptx
Chapter 6.pptxChapter 6.pptx
Chapter 6.pptx
 
DNMTT - Synchrophasor Data Delivery Efficiency GEP Testing Results at Peak RC
DNMTT - Synchrophasor Data Delivery Efficiency GEP Testing Results at Peak RCDNMTT - Synchrophasor Data Delivery Efficiency GEP Testing Results at Peak RC
DNMTT - Synchrophasor Data Delivery Efficiency GEP Testing Results at Peak RC
 
GouriShankar_Informatica
GouriShankar_InformaticaGouriShankar_Informatica
GouriShankar_Informatica
 
Get started with data migration
Get started with data migrationGet started with data migration
Get started with data migration
 
Ax 2012 R3 Legacy Data Migration
Ax 2012 R3 Legacy Data MigrationAx 2012 R3 Legacy Data Migration
Ax 2012 R3 Legacy Data Migration
 
Data Collection Process And Integrity
Data Collection Process And IntegrityData Collection Process And Integrity
Data Collection Process And Integrity
 
Datawarehouse org
Datawarehouse orgDatawarehouse org
Datawarehouse org
 
Technical Product Manager Case Challenge
Technical Product Manager Case ChallengeTechnical Product Manager Case Challenge
Technical Product Manager Case Challenge
 
DCIM: ERP for the Data Center Manager
DCIM: ERP for the Data Center ManagerDCIM: ERP for the Data Center Manager
DCIM: ERP for the Data Center Manager
 

More from Tilmann Rabl

Crafting bigdatabenchmarks
Crafting bigdatabenchmarksCrafting bigdatabenchmarks
Crafting bigdatabenchmarksTilmann Rabl
 
Big Data Benchmarking Tutorial
Big Data Benchmarking TutorialBig Data Benchmarking Tutorial
Big Data Benchmarking TutorialTilmann Rabl
 
A BigBench Implementation in the Hadoop Ecosystem
A BigBench Implementation in the Hadoop EcosystemA BigBench Implementation in the Hadoop Ecosystem
A BigBench Implementation in the Hadoop EcosystemTilmann Rabl
 
MADES - A Multi-Layered, Adaptive, Distributed Event Store
MADES - A Multi-Layered, Adaptive, Distributed Event StoreMADES - A Multi-Layered, Adaptive, Distributed Event Store
MADES - A Multi-Layered, Adaptive, Distributed Event StoreTilmann Rabl
 
CaSSanDra: An SSD Boosted Key-Value Store
CaSSanDra: An SSD Boosted Key-Value StoreCaSSanDra: An SSD Boosted Key-Value Store
CaSSanDra: An SSD Boosted Key-Value StoreTilmann Rabl
 
Rapid Development of Data Generators Using Meta Generators in PDGF
Rapid Development of Data Generators Using Meta Generators in PDGFRapid Development of Data Generators Using Meta Generators in PDGF
Rapid Development of Data Generators Using Meta Generators in PDGFTilmann Rabl
 
Solving Big Data Challenges for Enterprise Application Performance Management
Solving Big Data Challenges for Enterprise Application Performance ManagementSolving Big Data Challenges for Enterprise Application Performance Management
Solving Big Data Challenges for Enterprise Application Performance ManagementTilmann Rabl
 

More from Tilmann Rabl (7)

Crafting bigdatabenchmarks
Crafting bigdatabenchmarksCrafting bigdatabenchmarks
Crafting bigdatabenchmarks
 
Big Data Benchmarking Tutorial
Big Data Benchmarking TutorialBig Data Benchmarking Tutorial
Big Data Benchmarking Tutorial
 
A BigBench Implementation in the Hadoop Ecosystem
A BigBench Implementation in the Hadoop EcosystemA BigBench Implementation in the Hadoop Ecosystem
A BigBench Implementation in the Hadoop Ecosystem
 
MADES - A Multi-Layered, Adaptive, Distributed Event Store
MADES - A Multi-Layered, Adaptive, Distributed Event StoreMADES - A Multi-Layered, Adaptive, Distributed Event Store
MADES - A Multi-Layered, Adaptive, Distributed Event Store
 
CaSSanDra: An SSD Boosted Key-Value Store
CaSSanDra: An SSD Boosted Key-Value StoreCaSSanDra: An SSD Boosted Key-Value Store
CaSSanDra: An SSD Boosted Key-Value Store
 
Rapid Development of Data Generators Using Meta Generators in PDGF
Rapid Development of Data Generators Using Meta Generators in PDGFRapid Development of Data Generators Using Meta Generators in PDGF
Rapid Development of Data Generators Using Meta Generators in PDGF
 
Solving Big Data Challenges for Enterprise Application Performance Management
Solving Big Data Challenges for Enterprise Application Performance ManagementSolving Big Data Challenges for Enterprise Application Performance Management
Solving Big Data Challenges for Enterprise Application Performance Management
 

Recently uploaded

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 

Recently uploaded (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

TPC-DI - The First Industry Benchmark for Data Integration

  • 1. TPC-DI The First Industry Benchmark for Data Integration Meikel Poess, Tilmann Rabl, Hans-Arno Jacobsen, Brian Caufield VLDB 2014, Hangzhou, China, September 4
  • 2. Data Integration •Data Integration (DI) covers a variety of scenarios •Data acquisition for business intelligence, analytics and data warehousing •Data migration between systems •Data conversion •etc. •All of the above require: 1.Data Extraction from Multiple Sources 2.Data Transformation from Multiple Formats into a Target Format 3.Consolidating data into one or more Target Systems 9/04/2014 TPC-DI 2
  • 3. Why a Data Integration Benchmark •Vendors •Publish performance numbers without always providing detailed information on how performance numbers were obtained •Compare performance numbers that might not be comparable •Results are of little value for customers who would like to evaluate data integration tools across vendors •Situation is similar to the 1980’s when vendors compared performance of OLTP systems using a variety of workloads and metrics and which eventually resulted in the creation of the TPC. •See “The History Of DebitCredit and the TPC” by Omri Serlin 9/04/2014 TPC-DI 3 JasperETL can be used for both analytic decision support system tasks such as updating data warehouses or marts, as well as for operational solutions such as data consolidation, duplication, synchronization, quality, migration, and change data capture. Performance tests indicate performance up to 50% faster than other leading commercial ETL tools. Microsoft and Unisys announced a record for loading data into a relational database using an Extract, Transform and Load (ETL) tool. Over 1 TB of TPC-H data was loaded in under 30 minutes. SSIS PowerCenter 8 running across 64 CPUs loaded 1TB into Oracle in just 45 minutes, compared to 95 minutes for PowerCenter 7. Additional tests that targeted flat files took just 32 minutes, a new world record based on published benchmarks. As in past benchmarks, PowerCenter 8 exhibited near- perfect linear scalability across 16-, 32- and 64-CPU HP Integrity Superdome server configurations. Informatica
  • 4. Outline •Scope of TPC-DI •General Concepts •Data Model •Source Model •Target Model •Data Set •Transformations •Metric and Execution rules •Experimental Results 9/04/2014 TPC-DI 4
  • 5. General Concepts •TPC-DI uses data integration of a factious Retail Brokerage Firm as model: Main Trading System Internal Human Resource System Internal Customer Relationship Management System Externally acquired data •Operations measured use the above model, but are not limited to those of a brokerage firm •They capture the variety and complexity of typical DI tasks: Loading of large volumes of historical data Loading of incremental updates Execution of a variety of transformation types using various input types and various target types with inter-table relationships Assuring consistency of loaded data •Benchmark is technology agnostic 9/04/2014 TPC-DI 5
  • 6. Scope of TPC-DI •Out-Scope •Extraction of data from operational systems •Transport of data into a staging area •Data of source systems is provided by a data generator, based on PDGF •In-Scope •Reading of data from staging area •Data transformation and their insertion into target system •Storing of intermediate results •Verification of transformed data 9/04/2014 TPC-DI 6
  • 7. Source Schema •18 different source tables •Various formats •CSV (Comma Separated) •CDC (Change Data Capture) •Multi Record •DEL (Pipe Delimited) •XML •Some used only for the Historical Load •Some used only for the Incremental Load •Some used in both Historical and Incremental Loads 9/04/2014 TPC-DI 7
  • 8. Target Schema •Dimensional schema •5 fact tables •9 dimension tables •5 reference tables (dimensions in the strict sense of a star schema) 9/04/2014 TPC-DI 8
  • 9. Data Set • 9/04/2014 TPC-DI 9
  • 10. Data Generation (PDGF) TPC-DI 9/04/2014 10 • Based on Parallel Data Generation Framework (PDGF)  Generic – can generate any schema  Configurable – XML configuration files for schema and output format  Extensible – plug-in mechanism for  Distributions  Specialized data generation formats  Efficient – utilizes all system resources to a maximum degree (if desired)  Scalable – parallel generation for modern multi-core SMPs and clustered systems • Evaluation  2 E5-2450 Intel Sockets  16 cores, 32 hardware threads  1-42 workers (= degree of parallelism)  Almost linear scale-up with cores  Slow down after 38 workers
  • 11. 18 Trans- formations 1.Transfer XML to relational data 2.Update DIMessage file 3.Convert CSV to relational data 4.Merge multiple input files of the same structure 5.Convert missing values to NULL 6.Standardize entries of the input files 7.Join data from input file to dimension table 8.Perform extensive arithmetic calculations 9.Join data from multiple input files with separate structures 10.Consolidate multiple change records per day and identify most current 11.Read data from files with variable type records 12.Check data for errors or for adherence to business rules 13.Detect changes in fact data, and journaling updates to reflect current state 14.Detect changes in dimension data, and applying appropriate tracking mechanisms for history keeping dimensions 15.Filter input data according to pre-defined conditions 16.Identify new, deleted and updated records in input data 17.Join data of one input file to data from another input file with different structure 9/04/2014 TPC-DI 11
  • 12. Transformations • No standard language to define • Transformations are specified in English text • Correctness of transformation implementations is guaranteed by:  Independent audit by a certified TPC auditor  Correctness queries run during the benchmark run and at benchmark completion  Qualification run on small scale factor and comparison of results with reference output TPC-DI 9/04/2014 12 Pseudo code example: DimAccount
  • 13. Execution Rules •Un-timed part prepares the system •Timed part measures the data integration performance: Historical Load: Initial load of decision support system from historical records or due to restructuring of decision support system Incremental Loads: Periodic incremental updates of daily feeds Two incremental loads to measure the affect of data structure maintenance •No phase may overlap 9/04/2014 TPC-DI 13
  • 15. Metric Characteristics 9/04/2014 TPC-DI 15 •One performance metric  Makes ranking of results easy •Throughput metric: rows processed per second •Geometric mean of the throughputs during historical and incremental load phases  entices performance improvements in all phases E.g. reducing the elapsed time of the historical load from 100s to 90s has the same impact on metric as reducing the incremental load with the smaller elapsed time from 10s to 9s.
  • 16. Metric Analysis •Metric encourages the processing of a sufficiently large amount of data •Actual amount of data processed depends on the system performance •The higher the performance of a system, the more data it needs to process •While the benchmark rules allow elapsed times of less than 1800s, there is a negative performance impact due to the max function in the denominator of the incremental throughput functions (T1 and T2) •The above graph shows the performance of a system with load performance linear to data size. 9/04/2014 TPC-DI 16
  • 17. Metric Analysis •Metric encourages constant elapsed times of consecutive incremental loads due to min(T1,T2) •Metric scales linearly with scale factor Important for measuring scale-out and scale-up solutions System with double the resources, e.g. CPU, memory, should show double the performance This is only true if a scale factor is chosen that results in an elapsed time of 1800s. 9/04/2014 TPC-DI 17
  • 18. Experimental Results •TPC-DI was run with 5 scale factors •Figure shows normalized data X-axis shows normalized data size Y-axis shows normalized elapsed time •Results show linear scalability •In another experiment it was shown that the elapsed time remained constant when the hardware resources were scaled to match the data size, i.e. double the data size with double the number of CPU’s, IO and memory 9/04/2014 TPC-DI 18
  • 19. Experimental Results •Figure shows: X-axis: three different data sizes in Gigabytes Y-axis: percentage of elapsed time spent in historical load, incremental load 1 and incremental load 2 •Across data size the time spent in the historical load is 80%. 20% is spent in the incremental update phases. •This can be used to extrapolate the total elapsed time of benchmark runs. 9/04/2014 TPC-DI 19
  • 20. Conclusion •New TPC Standard Benchmark for Data Integration Accepted in January 2014 •Brokerage firm business model OLTP system, HR, CRM, external data 18 transformations into integrated data warehouse •Covers Multiple formats Historical load and updates Complex interdependencies 9/04/2014 TPC-DI 20
  • 21. Questions? •Thank You! •TPC-DI website http://www.tpc.org/tpcdi/default.asp 9/04/2014 TPC-DI 21