2. Introduction
◦ ETL Process
◦ Pentaho’s Kettle
Data Integration Challenges
Prerequisites and Recent Releases
Pentaho DI Components
JDBC
Spoon
◦ Transformations
◦ Jobs
08/13/13 2MaxQDPro: Kettle- ETL Tool
3. 4 major components:
◦ Extracting
Gathering raw data from source systems and storing it in ETL staging
environment
Data profiling
Identifying data that changed since last load
◦ Transforming- Cleaning and Conforming
Processing data to improve its quality, format it, merge from multiple
sources, enforce conformed dimensions
Data cleansing
Recording error events
Audit dimensions
Creating and maintaining conformed dimensions and facts
08/13/13MaxQDPro: Kettle- ETL Tool 3
4. Data filtering
◦ Is not null, greater than, less than, includes
Field manipulation
◦ Trimming, padding, upper and lowercase conversion
Data calculations
◦ + - X / , average, absolute value, arctangent, natural logarithm
Date manipulation
◦ First day of month, Last day of month, add months, week of year, day of year
Data type conversion
◦ String to number, number to string, date to number
Merging fields & splitting fields
Looking up date
◦ Look up in a database, in a text file, an excel sheet, …
08/13/13 4MaxQDPro: Kettle- ETL Tool
5. ◦ Loading
Loading data into data warehouse tables
Managing hierarchies in dimensions
Managing special dimensions such as date and time,
junk, mini, shrunken, small static, and user-maintained
dimensions
Fact table loading
Building and maintaining bridge dimension tables
Handling late arriving data
Management of conformed dimensions
Administration of fact tables
Building aggregations
Building OLAP cubes
Transferring DW data to other environment for specific
purposes
08/13/13MaxQDPro: Kettle- ETL Tool 5
7. Complexity and significant operational problems.
Exceeds the designers expectations
Data Profiling of a source.
Data warehouses typically grow asynchronously.
Establishing the scalability of an ETL system
across the lifetime .
08/13/13MaxQDPro: Kettle- ETL Tool 7
8. Many off-the-shelf tools exist
High-end tools may not justify value for smaller
warehouses
Proprietary ETL
◦ High upfront cost
◦ Long term maintenance
Custom Code
◦ Low upfront cost
◦ Support grows as business requirements changes
08/13/13 8MaxQDPro: Kettle- ETL Tool
9. 08/13/13MaxQDPro: Kettle- ETL Tool 9
Tool Vendor
Oracle Warehouse Builder (OWB) Oracle
Data Integrator (BODI) Business Objects
IBM Information Server (Ascential) IBM
SAS Data Integration Studio SAS Institute
PowerCenter Informatica
Oracle Data Integrator (Sunopsis) Oracle
Data Migrator Information Builders
Integration Services Microsoft
Talend Open Studio Talend
DataFlow Group 1 Software (Sagent)
Data Integrator Pervasive
Transformation Server DataMirror
Transformation Manager ETL Solutions Ltd.
Data Manager Cognos
DT/Studio Embarcadero Technologies
ETL4ALL IKAN
DB2 Warehouse Edition IBM
Jitterbit Jitterbit
Pentaho Data Integration Pentaho
10. Kettle – Kettle Extraction Transformation
Transportation & Loading tool
Its open source business intelligence suite for
powerful data integration by Pentaho. Founded in
2004.
Products of Pentaho
◦ Mondrain – OLAP server written in Java
◦ Kettle – ETL tool
◦ Weka – Machine learning and Data mining tool
08/13/13 10MaxQDPro: Kettle- ETL Tool
11. Data is everywhere
Data is inconsistent
◦ Records are different in each system
Performance issues
◦ Running queries to summarize data for stipulated long
period takes operating system for task
◦ Brings the OS on max load
Data is never all in Data Warehouse
◦ Excel sheet, acquisition, new application
08/13/13 11MaxQDPro: Kettle- ETL Tool
12. Meta data , model driven approach
◦ What to do? And how to do?
◦ Complex transformation with zero code
◦ Graphically design data transformation and jobs
100% Java with cross-platform support
Extensible architecture
Repository-based
Full featured ETL
Integration with Pentaho Open BI Platform
08/13/13 12MaxQDPro: Kettle- ETL Tool
13. Prerequisites Recent Releases
Java Runtime
Environment 1.5 and
above
Compatible with almost
any platform
Compatible with wide
range of Databases
technologies.
4/25 Data Integration 3.0.3 GA
4/18 Data Integration 3.1 Milestone
2/8 Data Integration 3.0.2 GA
12/12 Data Integration 3.0.1 GA
11/15 Data Integration 3.0 GA
10/31 Data Integration 3.0 RC2
10/24 Data Integration 2.5.2 GA
10/08 Data Integration 3.0 RC1
08/24 Data Integration 2.5.1 GA
08/13/13MaxQDPro: Kettle- ETL Tool 13
14. Pan
◦ A program to execute transformations designed by Spoon in XML
or database repository.
◦ Transformations are scheduled in batch mode to be run
automatically at regular intervals
Carte
◦ Simple web server to execute transformations and jobs remotely.
◦ Accept an XML (small servlet) that contains transformation to
execute and the execution configuration.
◦ Allows to remotely monitor, start and stop the transformations and
jobs
◦ Server running in Carte is a Slave Server
08/13/13MaxQDPro: Kettle- ETL Tool 14
15. Spoon
◦ GUI that allows you to design transformations and jobs that
can be run with the Kettle tools — Pan and Kitchen
◦ Transformations and Jobs can describe themselves using
an XML file or can be put in a Kettle database repository.
◦ Spoon is available as executable script and batch file to
make use of tool in heterogeneous environment.
◦ Latest version of Spoon is 3.2 beta version.
Kitchen
◦ Execute jobs designed by Spoon in XML or database
repository
08/13/13MaxQDPro: Kettle- ETL Tool 15
16. Create Shortcut with spoon.ico
pointing to bat file
Works on most of OS
Installing
◦ Ensure JRE 1.5 is installed.
◦ Unzip the binary distribution
in any folder
Launching
◦ spoon.bat in windows
platform
◦ spoon.sh in Unix like
platform
Supported platform
◦ Microsoft Windows including
Vista
◦ Linux GTK: on i386 and
x86_64 processors
◦ Apple's OSX: works both on
PowerPC and Intel
machines
◦ Solaris: using a Motif
interface
◦ AIX, HP-UX, FreeBSD
08/13/13MaxQDPro: Kettle- ETL Tool 16
17. Latest JDBC 3.0
JDBC -Database connectivity
Java tool.
Comes in four different types
◦ Type1: JDBC-ODBC Bridge
◦ Type 2 : Native API partial Java
driver
◦ Type 3 : Middleware Java Drivers
◦ Type 4: Direct to DB Java Drivers
Microsoft Based DB like
MS Access rely on Type
1drivers
Oracle, Mysql can be
connected with other
types. But traditionally
used is the Type 4 driver.
JDBC can also operate in
Distributed environment.
08/13/13MaxQDPro: Kettle- ETL Tool 17
20. Key Improvement
◦ Execution Results Pane for logs, metrics and
performance graph
◦ Improved Database Connection dialog
◦ Snap to grid (graphical workspace)
◦ Zoom (Graphical Workspace)
◦ Easier to use left panel for the objects palette
◦ Over 30 new or improved Transformation Steps
◦ 13 new or improved Job Entries
◦ Support for four new database types - MonetDB,
KingbaseES, Vertica, and HP NeoView
◦ Improved translations
08/13/13MaxQDPro: Kettle- ETL Tool 20
21. Repository Connection establishment
Auto login
◦ By setting manually KETTLE_REPOSITORY,
KETTLE_USER and KETTLE_PASSWORD
environmental variables.
Login
◦ By default PDI provides login username and password
ad admin.
◦ It strictly advised to change default password to avoid
any security vulnerablity.
08/13/13MaxQDPro: Kettle- ETL Tool 21
25. Transformation
◦ Value: Values are part of a row
and can contain any type of data
◦ Row: a row exists of 0 or more
values
◦ Output stream: an output stream
is a stack of rows that leaves a
step.
◦ Input stream: an input stream is
a stack of rows that enters a step.
◦ Hop: A hop is a graphical
representation of one or more data
streams between 2 steps.
◦ Note: A note is a piece of
information that can be added to a
transformation
08/13/13MaxQDPro: Kettle- ETL Tool 25
Engine capable of performing a
multitude of functions such as reading,
manipulating and writing data to and
from various data sources.
26. Jobs
◦ Job Entry: A job entry is
one part of a job and
performs a certain
◦ Hop: A hop is a graphical
representation of one or
more data streams between
2 steps
◦ Note: a note is a piece of
information that can be added to a
job
08/13/13MaxQDPro: Kettle- ETL Tool 26
A way of calling transformations and
controlling the sequence of their
execution. Usually jobs are
scheduled in batch mode to be run
automatically at regular intervals.
37. Besides the execution order, it specifies the condition for next job entry
· “Unconditional” - next job entry will be executed regardless of the result
of the originating job entry.
· “Follow when result is true” - next job entry will only be executed when
the result of the originating job entry is true,
· “Follow when result is false” - next job entry will only be executed when
the result of the originating job entry was false
08/13/13 37MaxQDPro: Kettle- ETL Tool
41. Brief Introduction to ETL process
JDBC Repository Connection
Pentaho Data Integration Tool
◦ Components
Pan
Carte
Kitchen
Spoon
◦ Transformation with different Input Data Source
◦ Jobs
08/13/13MaxQDPro: Kettle- ETL Tool 41
42. kettle.pentaho.org
◦ Kettle project homepage
kettle.javaforge.com
◦ Kettle community website: forum, source, documentation, tech tips, samples, …
www.pentaho.org/download/
◦ All Pentaho modules, pre-configured with sample data
◦ Developer forums, documentation
◦ Ventana Research Open Source BI Survey
www.mysql.com
◦ White paper - http://dev.mysql.com/tech-resources/articles/mysql_5.0_pentaho.html
◦ Kettle Webinar - http://www.mysql.com/news-and-events/on-demand-
webinars/pentaho-2006-09-19.php
◦ Roland Bouman blog on Pentaho Data Integration and MySQL
http://rpbouman.blogspot.com/2006/06/pentaho-data-integration-kettle-turns.html
08/13/13 42MaxQDPro: Kettle- ETL Tool