This document compares different ETL (extract, transform, load) tools. It begins with introductions to ETL tools in general and four specific tools: Pentaho Kettle, Talend, Informatica PowerCenter, and Inaplex Inaport. The document then compares the tools across various criteria like cost, ease of use, speed, and connectivity. It aims to help readers evaluate the tools for different use cases.
All about Informatica PowerCenter features for both Business and Technical staff, it illustrates how Informatica PowerCenter solves core business challenges in Data Integration projects.
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
Creating a Data validation and Testing StrategyRTTS
Creating A Data Validation & Testing Strategy
Are you struggling with formulating a strategy for how to validate the massive amount of data continuously entering your data warehouse or data lake?
We can help you!
Learn how RTTS’ Data Validation Assessment provides:
- an evaluation of your current data validation process
- recommendations on how to improve your process and
- a proposal for successful implementation
This slide deck addresses the following issues:
- How do I find out if I have bad data?
- How do I ensure I am testing the proper data permutations?
- How much of my data needs to be validated and automated?
- Which critical data endpoints need to be tested?
- How do I test data in my cloud environments?
And much more!
For more information, visit:
https://www.rttsweb.com/services/solutions/data-validation-assessment
Web scraping is a technique for gathering data or information on web pages. You could revisit your favorite web site every time it updates for new information. Or you could write a web scraper to have it do it for you!
It is a method to extract data from a website that does not have an API or we want to extract a LOT of data which we can not do through an API due to rate limiting.
Through web scraping we can extract any data which we can see while browsing the web
All about Informatica PowerCenter features for both Business and Technical staff, it illustrates how Informatica PowerCenter solves core business challenges in Data Integration projects.
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
Creating a Data validation and Testing StrategyRTTS
Creating A Data Validation & Testing Strategy
Are you struggling with formulating a strategy for how to validate the massive amount of data continuously entering your data warehouse or data lake?
We can help you!
Learn how RTTS’ Data Validation Assessment provides:
- an evaluation of your current data validation process
- recommendations on how to improve your process and
- a proposal for successful implementation
This slide deck addresses the following issues:
- How do I find out if I have bad data?
- How do I ensure I am testing the proper data permutations?
- How much of my data needs to be validated and automated?
- Which critical data endpoints need to be tested?
- How do I test data in my cloud environments?
And much more!
For more information, visit:
https://www.rttsweb.com/services/solutions/data-validation-assessment
Web scraping is a technique for gathering data or information on web pages. You could revisit your favorite web site every time it updates for new information. Or you could write a web scraper to have it do it for you!
It is a method to extract data from a website that does not have an API or we want to extract a LOT of data which we can not do through an API due to rate limiting.
Through web scraping we can extract any data which we can see while browsing the web
Presented at All Things Open RTP Meetup
Presented by Brent Laster
Title: Introduction to GitHub Copilot
Abstract: Learn about the future of coding with this presentation, "Introduction to GitHub Copilot." Join author, speaker, and DevOps director Brent Laster as we discuss the GitHub coding assistant that leverages the power of artificial intelligence to augment your development activities. In this session, we'll delve into the core features of GitHub Copilot, explaining what it is, and learning about its ability to generate code snippets, autocompletions, and even entire functions across a variety of programming languages.
Learn how Copilot enhances your coding efficiency by contextualizing suggestions based on your codebase and offering a seamless integration into popular code editors. We'll explore some example applications and look at how developers can harness Copilot to streamline their workflows and boost productivity. Whether you're a seasoned developer or just embarking on your coding journey, this presentation offers a quick introduction to GitHub Copilot's capabilities, as a starting point for you to embrace cutting-edge AI assistance in your coding endeavors.
This presentation explains what data engineering is and describes the data lifecycles phases briefly. I used this presentation during my work as an on-demand instructor at Nooreed.com
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformDatabricks
In large enterprises, large solutions are sometimes required to tackle even the smallest tasks and ML is no different. At Comcast we are building a comprehensive, configuration based, continuously integrated and deployed platform for data pipeline transformations, model development and deployment. This is accomplished using a range of tools and frameworks such as Databricks, MLflow, Apache Spark and others. With a Databricks environment used by hundreds of researchers and petabytes of data, scale is critical to Comcast, so making it all work together in a frictionless experience is a high priority. The platform consists of a number of components: an abstraction for data pipelines and transformation to allow our data scientists the freedom to combine the most appropriate algorithms from different frameworks , experiment tracking, project and model packaging using MLflow and model serving via the Kubeflow environment on Kubernetes. The architecture, progress and current state of the platform will be discussed as well as the challenges we had to overcome to make this platform work at Comcast scale. As a machine learning practitioner, you will gain knowledge in: an example of data pipeline abstraction; ways to package and track your ML project and experiments at scale; and how Comcast uses Kubeflow on Kubernetes to bring everything together.
Informatica provides the market's leading data integration platform. Tested on nearly 500,000 combinations of platforms and applications, the data integration platform inter operates with the broadest possible range of disparate standards, systems, and applications. This unbiased and universal view makes Informatica unique in today's market as a leader in the data integration platform. It also makes Informatica the ideal strategic platform for companies looking to solve data integration issues of any size.
This presenation explains basics of ETL (Extract-Transform-Load) concept in relation to such data solutions as data warehousing, data migration, or data integration. CloverETL is presented closely as an example of enterprise ETL tool. It also covers typical phases of data integration projects.
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...Edureka!
This Edureka "Hadoop tutorial For Beginners" ( Hadoop Blog series: https://goo.gl/LFesy8 ) will help you to understand the problem with traditional system while processing Big Data and how Hadoop solves it. This tutorial will provide you a comprehensive idea about HDFS and YARN along with their architecture that has been explained in a very simple manner using examples and practical demonstration. At the end, you will get to know how to analyze Olympic data set using Hadoop and gain useful insights.
Below are the topics covered in this tutorial:
1. Big Data Growth Drivers
2. What is Big Data?
3. Hadoop Introduction
4. Hadoop Master/Slave Architecture
5. Hadoop Core Components
6. HDFS Data Blocks
7. HDFS Read/Write Mechanism
8. What is MapReduce
9. MapReduce Program
10. MapReduce Job Workflow
11. Hadoop Ecosystem
12. Hadoop Use Case: Analyzing Olympic Dataset
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Simplilearn
This presentation on Pig will help you understand why Pig is required, what is Pig, MapReduce vs Hive vs Pig, Pig architecture, working of Pig, Pig Latin data model, Pig Execution modes, and finally a demo which shows Pig Latin scripts. Pig is a scripting platform that runs on Hadoop clusters, designed to process and analyze large datasets. It operates on various types of data like structured, semi-structured and unstructured data. Pig Latin is the procedural data flow language used in Pig to analyze data. It is easy to program using Pig Latin as it is similar to SQL.
Now, let us get started with Pig.
Below topics are explained in this Pig presentation:
1. Why Pig?
2. What is Pig?
3. MapReduce vs Hive vs Pig
4. Pig architecture
5. Working of Pig
6. Pig Latin data model
7. Pig Execution modes
8. Use case – Twitter
9. Features of Pig
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
Presentación usada por Joseba Díaz, de HP, en la Jornada "Aplicación del Big Data en sectores económicos estratégicos" celebrada el 27 de octubre de 2015. Más información: http://bit.ly/1MkKmnF
Presented at All Things Open RTP Meetup
Presented by Brent Laster
Title: Introduction to GitHub Copilot
Abstract: Learn about the future of coding with this presentation, "Introduction to GitHub Copilot." Join author, speaker, and DevOps director Brent Laster as we discuss the GitHub coding assistant that leverages the power of artificial intelligence to augment your development activities. In this session, we'll delve into the core features of GitHub Copilot, explaining what it is, and learning about its ability to generate code snippets, autocompletions, and even entire functions across a variety of programming languages.
Learn how Copilot enhances your coding efficiency by contextualizing suggestions based on your codebase and offering a seamless integration into popular code editors. We'll explore some example applications and look at how developers can harness Copilot to streamline their workflows and boost productivity. Whether you're a seasoned developer or just embarking on your coding journey, this presentation offers a quick introduction to GitHub Copilot's capabilities, as a starting point for you to embrace cutting-edge AI assistance in your coding endeavors.
This presentation explains what data engineering is and describes the data lifecycles phases briefly. I used this presentation during my work as an on-demand instructor at Nooreed.com
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformDatabricks
In large enterprises, large solutions are sometimes required to tackle even the smallest tasks and ML is no different. At Comcast we are building a comprehensive, configuration based, continuously integrated and deployed platform for data pipeline transformations, model development and deployment. This is accomplished using a range of tools and frameworks such as Databricks, MLflow, Apache Spark and others. With a Databricks environment used by hundreds of researchers and petabytes of data, scale is critical to Comcast, so making it all work together in a frictionless experience is a high priority. The platform consists of a number of components: an abstraction for data pipelines and transformation to allow our data scientists the freedom to combine the most appropriate algorithms from different frameworks , experiment tracking, project and model packaging using MLflow and model serving via the Kubeflow environment on Kubernetes. The architecture, progress and current state of the platform will be discussed as well as the challenges we had to overcome to make this platform work at Comcast scale. As a machine learning practitioner, you will gain knowledge in: an example of data pipeline abstraction; ways to package and track your ML project and experiments at scale; and how Comcast uses Kubeflow on Kubernetes to bring everything together.
Informatica provides the market's leading data integration platform. Tested on nearly 500,000 combinations of platforms and applications, the data integration platform inter operates with the broadest possible range of disparate standards, systems, and applications. This unbiased and universal view makes Informatica unique in today's market as a leader in the data integration platform. It also makes Informatica the ideal strategic platform for companies looking to solve data integration issues of any size.
This presenation explains basics of ETL (Extract-Transform-Load) concept in relation to such data solutions as data warehousing, data migration, or data integration. CloverETL is presented closely as an example of enterprise ETL tool. It also covers typical phases of data integration projects.
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...Edureka!
This Edureka "Hadoop tutorial For Beginners" ( Hadoop Blog series: https://goo.gl/LFesy8 ) will help you to understand the problem with traditional system while processing Big Data and how Hadoop solves it. This tutorial will provide you a comprehensive idea about HDFS and YARN along with their architecture that has been explained in a very simple manner using examples and practical demonstration. At the end, you will get to know how to analyze Olympic data set using Hadoop and gain useful insights.
Below are the topics covered in this tutorial:
1. Big Data Growth Drivers
2. What is Big Data?
3. Hadoop Introduction
4. Hadoop Master/Slave Architecture
5. Hadoop Core Components
6. HDFS Data Blocks
7. HDFS Read/Write Mechanism
8. What is MapReduce
9. MapReduce Program
10. MapReduce Job Workflow
11. Hadoop Ecosystem
12. Hadoop Use Case: Analyzing Olympic Dataset
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Simplilearn
This presentation on Pig will help you understand why Pig is required, what is Pig, MapReduce vs Hive vs Pig, Pig architecture, working of Pig, Pig Latin data model, Pig Execution modes, and finally a demo which shows Pig Latin scripts. Pig is a scripting platform that runs on Hadoop clusters, designed to process and analyze large datasets. It operates on various types of data like structured, semi-structured and unstructured data. Pig Latin is the procedural data flow language used in Pig to analyze data. It is easy to program using Pig Latin as it is similar to SQL.
Now, let us get started with Pig.
Below topics are explained in this Pig presentation:
1. Why Pig?
2. What is Pig?
3. MapReduce vs Hive vs Pig
4. Pig architecture
5. Working of Pig
6. Pig Latin data model
7. Pig Execution modes
8. Use case – Twitter
9. Features of Pig
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
Presentación usada por Joseba Díaz, de HP, en la Jornada "Aplicación del Big Data en sectores económicos estratégicos" celebrada el 27 de octubre de 2015. Más información: http://bit.ly/1MkKmnF
En el último año, Big Data se ha transformado en uno de los pilares más importantes de la estrategia de negocio de los Bancos de Chile y el mundo. En un entorno cada vez más competitivo y con altos niveles de regulación, las organizaciones deben comenzar a tomar decisiones en función de los datos y no de la intuición. Para tomar dichas decisiones, se vuelve necesario procesar grandes volúmenes de información de manera eficiente, incorporando nuevas fuentes de datos y automatizando las decisiones a través de algoritmos analíticos avanzados. Durante esta presentación, analizaremos qué deben hacer los Bancos para transformar su arquitectura de datos tradicional, en una arquitectura de datos moderna con soporte para big data y así estar preparados para abordar los nuevos desafíos que plantea la transformación digital del negocio financiero.
The introductory morning session will discuss big data challenges and provide an overview of the AWS Big Data Platform. We will also cover:
• How AWS customers leverage the platform to manage massive volumes of data from a variety of sources while containing costs.
• Reference architectures for popular use cases, including: connected devices (IoT), log streaming, real-time intelligence, and analytics.
• The AWS big data portfolio of services, including Amazon S3, Kinesis, DynamoDB, Elastic MapReduce (EMR) and Redshift.
• The latest relational database engine, Amazon Aurora - a MySQL-compatible, highly-available relational database engine which provides up to five times better performance than MySQL at a price one-tenth the cost of a commercial database.
• Amazon Machine Learning – the latest big data service from AWS provides visualization tools and wizards that guide you through the process of creating machine learning (ML) models without having to learn complex ML algorithms and technology.
Pentaho Data Integration. Preparing and blending data from any source for analytics. Thus, enabling data-driven decision making. Application for education, specially, academic and learning analytics.
Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando m...Alex Rayón Jerez
Sesión de Pentaho Data Integration impartida en Noviembre de 2015 en el marco del Programa de Big Data y Business Intelligence de la Universidad de Deusto (detalle aquí http://bit.ly/1PhIVgJ).
Informatica to ODI Migration – What, Why and How | Informatica to Oracle Dat...Jade Global
Learn about the First and Only Automated Solution for Informatica to Oracle Data Integrator (ODI) conversion
Do you want to know:
“What” is Informatica vs ODI?
“Why” do you need to move to ODI?
“How” is the migration from Informatica to ODI possible?
Learn how you can achieve up to 90% automated conversion, up to 90% reduced implementation time, up to 50% cost savings and up to 5X productivity gain.
Know more please visit: http://informaticatoodi.jadeglobal.com/
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Hortonworks
Today’s enterprises are challenged with capturing large amounts of data from a number of sources in a variety of formats, and then storing it in a cost-effective, timely manner. With your current data warehouse, this may seem overwhelming. It doesn’t have to be. With a Hadoop-based modern data warehouse, you can overcome these challenges and get meaningful insights from real-time data.
Want to learn how? Join experts from Attunity, Hortonworks, and RCG Global Services for a live webinar - where we will be discussing enterprise data warehouse optimization. You will learn how to:
•Rebalance your data warehouse by identifying unused data and resource-intensive workloads that can be moved to Hadoop.
•Seamlessly integrate your current enterprise data warehouse with a Modern Data Architecture.
•Better utilize data assets to reduce costs while realizing more value from your data.
•Develop a roadmap for implementing the Hadoop-based Modern Data Architecture and Data Lake.
Talend Metadata Bridge accelerates the design and continuous improvement of integration scenarios by sharing, synchronizing and managing metadata across your enterprise data environment.
Developers save time by focusing on design and improving productivity to apply change, architects gain agility to keep their projects up to date with the latest business needs and technologies, and business users can access to crucial information about their data for reporting, validation, auditing and compliance.
Machine Learning - Eine Challenge für ArchitektenHarald Erb
Aufgrund vielfältiger potenzieller Geschäftschancen, die Machine Learning bietet, starten viele Unternehmen Initiativen für datengetriebene Innovationen. Dabei gründen sie Analytics-Teams, schreiben neue Stellen für Data Scientists aus, bauen intern Know-how auf und fordern von der IT-Organisation eine Infrastruktur für "heavy" Data Engineering & Processing samt Bereitstellung einer Analytics-Toolbox ein. Für IT-Architekten warten hier spannende Herausforderungen, u.a. bei der Zusammenarbeit mit interdisziplinären Teams, deren Mitglieder unterschiedlich ausgeprägte Kenntnisse im Bereich Machine Learning (ML) und Bedarfe bei der Tool-Unterstützung haben.
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)Rittman Analytics
Set of product roadmap + capabilities slides from Oracle Data Integration Product Management, and thoughts on data integration on big data implementations by Mark Rittman (Independent Analyst)
Renewing the BI infrastructure at Hellorider - Big Data Expo 2019webwinkelvakdag
At hellorider (Fietsenwinkel.nl) we have recently implemented a new BI tool. We went trough the process of choosing the infrastructure, building the database and dashboards and getting our colleagues to work with it.
In this presentation we will explain:
- how we chose our BI infrastructure
- how we got our collegues to understand and accept this change
- what we do to keep our collegues involved in using the reports
- how we use the features of the BI tool in a smart way to get the most out of it
What is a Data Warehouse and How Do I Test It?RTTS
ETL Testing: A primer for Testers on Data Warehouses, ETL, Business Intelligence and how to test them.
Are you hearing and reading about Big Data, Enterprise Data Warehouses (EDW), the ETL Process and Business Intelligence (BI)? The software markets for EDW and BI are quickly approaching $22 billion, according to Gartner, and Big Data is growing at an exponential pace.
Are you being tasked to test these environments or would you like to learn about them and be prepared for when you are asked to test them?
RTTS, the Software Quality Experts, provided this groundbreaking webinar, based upon our many years of experience in providing software quality solutions for more than 400 companies.
You will learn the answer to the following questions:
• What is Big Data and what does it mean to me?
• What are the business reasons for a building a Data Warehouse and for using Business Intelligence software?
• How do Data Warehouses, Business Intelligence tools and ETL work from a technical perspective?
• Who are the primary players in this software space?
• How do I test these environments?
• What tools should I use?
This slide deck is geared towards:
QA Testers
Data Architects
Business Analysts
ETL Developers
Operations Teams
Project Managers
...and anyone else who is (a) new to the EDW space, (b) wants to be educated in the business and technical sides and (c) wants to understand how to test them.
Similar to Informatica Pentaho Etl Tools Comparison (20)
Manuel Torres Gil
mtorres@ual.es
Departamento de Lenguajes y Computación
Universidad de Almería
Documento original:
http://indalog.ual.es/mtorres/cursodw/DisDimensional.pdf
Data Mining. Extracción de Conocimiento en Grandes Bases de DatosRoberto Espinosa
Data Mining.
Extracción de Conocimiento en
Grandes Bases de Datos. Realizado por
José M. Gutiérrez
Dpto. de Matemática Aplicada,
Universidad de Cantabria, Santander
http://personales.unican.es/gutierjm/docs/trans_DataMining.pdf
2. Table of Contents
• Introduction
• ETL tools
• Comparison
• Use Cases
• Conclusion
Page 2 | 20 March 2008 | ETL Tools Comparison
3. Table of Contents
• Introduction
• What do ETL tools do?
• Why use an ETL tool?
• ETL tools
• Comparison
• Use Cases
• Conclusion
Page 3 | 20 March 2008 | ETL Tools Comparison
4. What do ETLs tool do?
An ETL tool is a tool that:
• Extracts data from various data sources (usually legacy)
• Transforms data
• from -> being optimized for transaction
to -> being optimized for reporting and analysis
• synchronizes the data coming from different databases
• data cleanses to remove errors
• Loads data into a data warehouse
Page 4 | 20 March 2008 | ETL Tools Comparison
5. Table of Contents
• Introduction
• What do ETL tools do?
• Why use an ETL tool?
• ETL tools
• Comparison
• Use Cases
• Conclusion
Page 5 | 20 March 2008 | ETL Tools Comparison
6. Why use an ETL tool?
ETL tools save time and money when developing a data
warehouse by removing the need for “hand-coding”.
“Hand Coding” is still the most common way of integrating data
today. It requires hours and hours of development and expertise
to create a Business-Intelligence-System.
It is very difficult for data base administrators to connect
between different brands of databases without using an external
tool.
In the event that databases are altered or new databases need
to be integrated, a lot of “hand-coded” work needs to be
completely redone.
Page 6 | 20 March 2008 | ETL Tools Comparison
9. ETL Tools
Pentaho Kettle
• Pentaho is a commercial open-source BI suite that has a
product called Kettle for data integration.
• It uses an innovative meta-driven approach and has a strong
and very easy-to-use GUI
• The company started around 2001
• It has a strong community of 13,500 registered users
• It uses a stand-alone java engine that process the tasks for
moving data between many different databases and files
Page 9 | 20 March 2008 | ETL Tools Comparison
11. ETL Tools
Talend
• Talend is an open-source data integration tool
• It uses a code-generating approach and uses a GUI
(implemented in Eclipse RC)
• It started around October 2006
• It has a much smaller community then Pentaho, but is
supported by 2 finance companies
• It generates Java code or Perl code which can later be run
on a server
Page 11 | 20 March 2008 | ETL Tools Comparison
13. ETL Tools
Informatica PowerCenter
• Informatica has a very good commercial data integration suite
• It was founded in 1993
• It is the market share leader in data integration (Gartner
Dataquest)
• It has 2600 customers. Of those, there are fortune 100
companies, companies listed on the Dow Jones and government
organization
• The company's sole focus is data integration
• It has quite a big package for enterprises to integrate their
systems, cleanse their data and can connect to a vast number of
current and legacy systems
Page 13 | 20 March 2008 | ETL Tools Comparison
15. ETL Tools
Inaplex Inaport
• Inaplex is a small UK company
• InaPlex is a producer of Customer Data Integration products
for mid-market CRM solutions
• Inaplex mainly focuses on providing simple solutions for it’s
customers to integrate their data into CRM and accounting
software like Sage and Goldmine
Page 15 | 20 March 2008 | ETL Tools Comparison
16. Table of Contents
• Introduction
• ETL tools
• Comparison
• ETL Tools Comparison Chart
• Total Cost of Ownership
• Risk
• Ease of Use
• Support
• Deployment
• Speed
• Data Quality
• Monitoring
• Connectivity
• Use Cases
• Conclusion
Page 16 | 20 March 2008 | ETL Tools Comparison
17. Table of Contents
• Introduction
• ETL tools
• Comparison
• ETL Tools Comparison Chart
• Total Cost of Ownership
• Risk
• Ease of Use
• Support
• Deployment
• Speed
• Data Quality
• Monitoring
• Connectivity
• Use Cases
• Conclusion
Page 17 | 20 March 2008 | ETL Tools Comparison
18. Comparison
ETL Tool Comparison Chart
Pentaho Informatica Inaplex
Talend
Kettle PowerCenter Inaport
Cost
Risk
Ease of Use
Support
Deployment
Speed
Data Quality
Monitoring
Connectivity
Page 18 | 20 March 2008 | ETL Tools Comparison
19. Table of Contents
• Introduction
• ETL tools
• Comparison
• ETL Tools Comparison Chart
• Total Cost of Ownership
• Risk
• Ease of Use
• Support
• Deployment
• Speed
• Data Quality
• Monitoring
• Connectivity
• Use Cases
• Conclusion
Page 19 | 20 March 2008 | ETL Tools Comparison
20. Comparison
Total Cost of Ownership
Total Cost of Ownership means the
over all cost for a certain product.
This can mean initial ordering, licensing
servicing, support, training, consulting,
and any other additional payments that
need to be made before the product is
in full use.
Commercial Open Source products are typically free to use, but the support,
training and consulting are what companies need to pay for.
Pentaho Informatica Inaplex
Talend Kettle PowerCenter Inaport
Page 20 | 20 March 2008 | ETL Tools Comparison
21. Table of Contents
• Introduction
• ETL tools
• Comparison
• ETL Tools Comparison Chart
• Total Cost of Ownership
• Risk
• Ease of Use
• Support
• Deployment
• Speed
• Data Quality
• Monitoring
• Connectivity
• Use Cases
• Conclusion
Page 21 | 20 March 2008 | ETL Tools Comparison
22. Comparison
Risk
There are always risks with projects, especially big projects.
The risks for projects failing are:
• Going over budget
• Going over schedule
• Not completing the requirements or expectations of the customers
Open Source products have much lower risk then Commercial ones since
they do not restrict the use of their products by pricey licenses.
Pentaho Informatica Inaplex
Talend
Kettle PowerCenter Inaport
Page 22 | 20 March 2008 | ETL Tools Comparison
23. Table of Contents
• Introduction
• ETL tools
• Comparison
• ETL Tools Comparison Chart
• Total Cost of Ownership
• Risk
• Ease of Use
• Support
• Deployment
• Speed
• Data Quality
• Monitoring
• Connectivity
• Use Cases
• Conclusion
Page 23 | 20 March 2008 | ETL Tools Comparison
24. Comparison
Ease of Use
All of the ETL tools, apart from Inaport, have GUI to simplify the
development process. Having a good GUI also reduces the time to train
and use the tools.
Talend – Does have a GUI but is an add-on inside Eclipse RC.
Pentaho Kettle – Has the most easy to use GUI out of all the tools.
Training can also be found online or within the community.
Informatica PowerCenter – Has an easy to use GUI, but requires some
training to make full use of it.
Inaplex Inaport – Does not have a “drag and drop” GUI.
Talend Pentaho Informatica Inaplex
Kettle PowerCenter Inaport
Page 24 | 20 March 2008 | ETL Tools Comparison
25. Table of Contents
• Introduction
• ETL tools
• Comparison
• ETL Tools Comparison Chart
• Total Cost of Ownership
• Risk
• Ease of Use
• Support
• Deployment
• Speed
• Data Quality
• Monitoring
• Connectivity
• Use Cases
• Conclusion
Page 25 | 20 March 2008 | ETL Tools Comparison
26. Comparison
Support
Nowadays, all software products have support and all of the ETL tool
providers offer support.
Talend – Offers support, but mainly resides in the US.
Pentaho Kettle – Offers support from US, UK and has a partner consultant
in Hong Kong.
Informatica PowerCenter – Offers world-wide support.
Inaplex Inaport – Offers support, but mainly resides in the UK.
Pentaho Informatica Inaplex
Talend
Kettle PowerCenter Inaport
Page 26 | 20 March 2008 | ETL Tools Comparison
27. Table of Contents
• Introduction
• ETL tools
• Comparison
• ETL Tools Comparison Chart
• Total Cost of Ownership
• Risk
• Ease of Use
• Support
• Deployment
• Speed
• Data Quality
• Monitoring
• Connectivity
• Use Cases
• Conclusion
Page 27 | 20 March 2008 | ETL Tools Comparison
28. Comparison
Deployment
Talend – Creates a java file or perl file that can be run with an external
scheduler on any machine with very little resource.
Recommended one 1Ghz CPU and 512mbs ram
Pentaho Kettle – Is a stand-alone java engine that can run on any machine
that can run java. Needs an external scheduler to run automatically.
It can be deployed on many different machines and used as “slave servers” to
help with transformation processing.
Recommended one 1Ghz CPU and 512mbs ram
Informatica PowerCenter – Requires a server with platforms: Windows,
Solaris, HP-UX, IBM-UX, Redhat, SUSE linux.
Recommended to use two CPUs with 1Gb ram for Standard Edition Server
Inaplex Inaport – Can run on any windows platform that has .NET 2.0 installed
Recommended one CPU with 50mbs ram. Pentaho Informatica Inaplex
Talend Kettle PowerCenter Inaport
Page 28 | 20 March 2008 | ETL Tools Comparison
29. Table of Contents
• Introduction
• ETL tools
• Comparison
• ETL Tools Comparison Chart
• Total Cost of Ownership
• Risk
• Ease of Use
• Support
• Deployment
• Speed
• Data Quality
• Monitoring
• Connectivity
• Use Cases
• Conclusion
Page 29 | 20 March 2008 | ETL Tools Comparison
30. Comparison
Speed
The speed of ETL tools depends largely on the data that needs to be
transferred over the network and the processing power involved in
transforming the data.
Talend – Is slower then Pentaho. It requires manual tweaking and prior
knowledge of the specific data source to reduce network traffic and
processing.
Pentaho Kettle – Is faster then Talend, but the Java-connector slows it
down somewhat. Also requires manual tweaking like Talend. Can be
clustered by placed on many machines to reduce network traffic.
Informatica PowerCenter – Is the fastest tool. It has an advanced
“PushDown” option that localizes transformation tasks depending on how
busy the machine is.
Inaplex Inaport – does not use any special techniques to improve speed.
Talend Pentaho Informatica Inaplex
Kettle PowerCenter Inaport
Page 30 | 20 March 2008 | ETL Tools Comparison
31. Table of Contents
• Introduction
• ETL tools
• Comparison
• ETL Tools Comparison Chart
• Total Cost of Ownership
• Risk
• Ease of Use
• Support
• Deployment
• Speed
• Data Quality
• Monitoring
• Connectivity
• Use Cases
• Conclusion
Page 31 | 20 March 2008 | ETL Tools Comparison
32. Comparison
Data Quality
Data Quality is fast becoming the most important feature in any data
integration tool.
Talend – has DQ features in its GUI, allows for customized SQL
statements and by using Java.
Pentaho – has DQ features in its GUI, allows for customized SQL
statements, by using JavaScript and Regular Expressions. It also has some
additional modules after subscribing.
Informatica PowerCenter – does not have that many DQ features, but there
is another product called Informatica Data Quality which has many DQ
features.
Inaplex Inaport – does have DQ features. Because of the very specific data
that Inaport can integrate, it is relatively easy to clean that data.
Pentaho Informatica Inaplex
Talend Kettle PowerCenter Inaport
Page 32 | 20 March 2008 | ETL Tools Comparison
33. Table of Contents
• Introduction
• ETL tools
• Comparison
• ETL Tools Comparison Chart
• Total Cost of Ownership
• Risk
• Ease of Use
• Support
• Deployment
• Speed
• Data Quality
• Monitoring
• Connectivity
• Use Cases
• Conclusion
Page 33 | 20 March 2008 | ETL Tools Comparison
34. Comparison
Monitoring
Monitoring allows to find problems and debug them during and after the
development stage.
Talend – has practical monitoring tools and logging.
Pentaho Kettle – has practical monitoring tools and logging.
Informatica PowerCenter – has extensive monitoring tools and logging.
Inaplex Inaport - has practical monitoring tools and logging.
Pentaho Informatica Inaplex
Talend Kettle PowerCenter Inaport
Page 34 | 20 March 2008 | ETL Tools Comparison
35. Table of Contents
• Introduction
• ETL tools
• Comparison
• ETL Tools Comparison Chart
• Total Cost of Ownership
• Risk
• Ease of Use
• Support
• Deployment
• Speed
• Data Quality
• Monitoring
• Connectivity
• Use Cases
• Conclusion
Page 35 | 20 March 2008 | ETL Tools Comparison
36. Comparison
Connectivity
In most cases, ETL tools transfer data from legacy systems. Their connectivity
is very important to the usefulness of the ETL tools.
Talend – Can connect to all the current databases, flat files, xml files, excel
files and web services, but is reliant on Java drivers to connect to those data
sources.
Pentaho Kettle – Can connect to a very wide variety of databases, flat files,
xml files, excel files and web services.
Informatica PowerCenter – Can connect to a huge number of databases,
mainframes, flat files, excel files and web services. It can also export as a web
service.
Inaplex Inaport – Can connect to any ODBC (windows) connection. It usually
gets its data from current databases, outlook, ACT and excel files.
Pentaho Informatica Inaplex
Talend Kettle PowerCenter Inaport
Page 36 | 20 March 2008 | ETL Tools Comparison
37. Comparison
ETL Tool Comparison Chart
Pentaho Informatica Inaplex
Talend
Kettle PowerCenter Inaport
Cost
Risk
Ease of Use
Support
Deployment
Speed
Data Quality
Monitoring
Connectivity
Page 37 | 20 March 2008 | ETL Tools Comparison
38. Table of Contents
• Introduction
• ETL tools
• Comparison
• Use Cases
• MySQL
• Loma Linda University Health Care
• BNSF Logistics
• U.S. Naval Air Systems Command
• Conclusion
Page 38 | 20 March 2008 | ETL Tools Comparison
39. Table of Contents
• Introduction
• ETL tools
• Comparison
• Use Cases
• MySQL
• Loma Linda University Health Care
• BNSF Logistics
• U.S. Naval Air Systems Command
• Conclusion
Page 39 | 20 March 2008 | ETL Tools Comparison
40. Use Cases
MySQL
"We selected Pentaho for its ease-of-use. Pentaho addressed many of our
requirements -- from reporting and analysis to dashboards, OLAP and ETL,
and offered our business users the Excel-based access that they wanted."
Key Challenges
• Reporting and analysis of operational expenses by department and cost
center
• Multiple data sources including Microsoft Excel (for cost-center rollups)
Results
• Centralized view of spending by department
• Easy access to information from Excel
Why Pentaho
• Ease of use
• Breadth of solution
• Cost of ownership
Page 40 | 20 March 2008 | ETL Tools Comparison
41. Table of Contents
• Introduction
• ETL tools
• Comparison
• Use Cases
• MySQL
• Loma Linda University Health Care
• BNSF Logistics
• U.S. Naval Air Systems Command
• Conclusion
Page 41 | 20 March 2008 | ETL Tools Comparison
42. Use Cases
Loma Linda University Health Care
"Pentaho Customer Support has been exceptional. This is a strategic
application at LLUHC, and working with Pentaho has accelerated our
deployment and improved our overall application delivery."
Key Challenges
• Providing analytics for billing and operations supporting 500,000 patients and
600 doctors
Results
• Comprehensive analysis of time periods, services provided, billing groups,
physicians
• Centralized, secured, consistent information delivery (versus prior Excel-
based system)
• Ability to drill and analyze down to the individual patient level
Why Pentaho
• Open standards support and ease of integration
• Cost of ownership
Page 42 | 20 March 2008 | ETL Tools Comparison
43. Table of Contents
• Introduction
• ETL tools
• Comparison
• Use Cases
• MySQL
• Loma Linda University Health Care
• BNSF Logistics
• U.S. Naval Air Systems Command
• Conclusion
Page 43 | 20 March 2008 | ETL Tools Comparison
44. Use Cases
BNSF Logistics
"Using Pentaho for our business intelligence platform, along with the expert
support and knowledge provided by OpenBI, BNSF Logistics was able to
implement our initial data warehouse with web-based reporting and analytics
in just six weeks. Not only did we deliver a powerful business intelligence tool
set for our organization in short order, but were able to do so at a fraction of
the cost of proprietary alternatives."
Key Challenges
• Cumbersome, manual process for creation and distribution of reports
• Inconsistent data accuracy because of semi-automated preparation processes
Results
• Initial data warehouse with web-based reporting and analytics in 6 weeks
• 75% lower acquisition costs, 50% lower ongoing ownership costs compared to
proprietary BI
• Ability to monitor operational business health Why Pentaho
• Faster, better decisions in sales processes Open standards support and
ease of integration
Page 44 | 20 March 2008 | ETL Tools Comparison
Cost of ownership
45. Table of Contents
• Introduction
• ETL tools
• Comparison
• Use Cases
• MySQL
• Loma Linda University Health Care
• BNSF Logistics
• U.S. Naval Air Systems Command
• Conclusion
Page 45 | 20 March 2008 | ETL Tools Comparison
46. Use Cases
U.S. Naval Air Systems Command
"[Open technologies] reduce the cost of software development and they
reduce the time in which innovations in software can be incorporated in
systems. 'If the project is of a sufficient scale, you cannot get there without an
open-source approach,' said Dewey Houck, a senior engineer at Boeing, who
spoke at a conference last month about DOD's use of open source."
(Government Computer News, Jan. 2008)"
Key Challenges
• Analyzing flight data to reduce operational risk and improve training (human
error is a causal factor in 70% of aviation mishaps)
Results
• Ability to leverage recorded electronic sensor data to reduce risk and improve
crew performance
Why Pentaho
• Breadth of capabilities
• Proven success and large-scale referenceable deployments
• Successful proof-of-concept
• Dramatically lower costs
Page 46 | 20 March 2008 | ETL Tools Comparison
47. Table of Contents
• Introduction
• What do ETL tools do?
• Why use an ETL tools?
• ETL tools
• Comparison
• Use Cases
• Conclusion
Page 47 | 20 March 2008 | ETL Tools Comparison
48. Conclusion
• Informatica and Pentaho have very good products.
• Informatica has a far more extensive range of products, but compared to
Pentaho is very expensive.
• Pentaho has proved that it can handle small to large scale systems.
• Pentaho is gaining fast momentum with businesses that would not have
considered using open source products before.
Page 48 | 20 March 2008 | ETL Tools Comparison