MaxQDPro Team
Anjan.K Harish.R
II Sem M.Tech CSE
08/13/13 MaxQDPro: Kettle- ETL Tool 1
A Pentaho Data Integration tool
 Introduction
◦ ETL Process
◦ Pentaho’s Kettle
 Data Integration Challenges
 Prerequisites and Recent Releases
 Pentah...
 4 major components:
◦ Extracting
 Gathering raw data from source systems and storing it in ETL staging
environment
 Da...
 Data filtering
◦ Is not null, greater than, less than, includes
 Field manipulation
◦ Trimming, padding, upper and lowe...
◦ Loading
 Loading data into data warehouse tables
 Managing hierarchies in dimensions
 Managing special dimensions suc...
08/13/13MaxQDPro: Kettle- ETL Tool 6
 Complexity and significant operational problems. 
 Exceeds the designers expectations
 Data Profiling of a source.
 D...
 Many off-the-shelf tools exist
 High-end tools may not justify value for smaller
warehouses
 Proprietary ETL
◦ High up...
08/13/13MaxQDPro: Kettle- ETL Tool 9
Tool Vendor
Oracle Warehouse Builder (OWB) Oracle
Data Integrator (BODI) Business Obj...
 Kettle – Kettle Extraction Transformation
Transportation & Loading tool
 Its open source business intelligence suite fo...
 Data is everywhere
 Data is inconsistent
◦ Records are different in each system
 Performance issues
◦ Running queries ...
 Meta data , model driven approach
◦ What to do? And how to do?
◦ Complex transformation with zero code
◦ Graphically des...
Prerequisites Recent Releases
 Java Runtime
Environment 1.5 and
above
 Compatible with almost
any platform
 Compatible ...
 Pan
◦ A program to execute transformations designed by Spoon in XML
or database repository.
◦ Transformations are schedu...
 Spoon
◦  GUI that allows you to design transformations and jobs that
can be run with the Kettle tools — Pan and Kitchen
...
Create Shortcut with spoon.ico
pointing to bat file
Works on most of OS
 Installing
◦ Ensure JRE 1.5 is installed.
◦ Unzi...
Latest JDBC 3.0
 JDBC -Database connectivity
Java tool.
 Comes in four different types
◦ Type1: JDBC-ODBC Bridge
◦ Type ...
08/13/13MaxQDPro: Kettle- ETL Tool 18
08/13/13MaxQDPro: Kettle- ETL Tool 19
 Key Improvement
◦ Execution Results Pane for logs, metrics and
performance graph
◦ Improved Database Connection dialog
◦...
 Repository Connection establishment
 Auto login
◦ By setting manually KETTLE_REPOSITORY,
KETTLE_USER and KETTLE_PASSWOR...
08/13/13MaxQDPro: Kettle- ETL Tool 22
08/13/13MaxQDPro: Kettle- ETL Tool 23
08/13/13MaxQDPro: Kettle- ETL Tool 24
 Transformation
◦ Value: Values are part of a row
and can contain any type of data
◦ Row: a row exists of 0 or more
value...
 Jobs
◦ Job Entry: A job entry is
one part of a job and
performs a certain
◦ Hop: A hop is a graphical
representation of ...
Input Steps
Output Steps
Lookup Steps
Transformation
Steps
Join Steps
DW Steps
Mapping Steps
Job Steps
08/13/13 27MaxQDPro...
08/13/13 28MaxQDPro: Kettle- ETL Tool
08/13/13 29MaxQDPro: Kettle- ETL Tool
08/13/13MaxQDPro: Kettle- ETL Tool 30
08/13/13 31MaxQDPro: Kettle- ETL Tool
08/13/13 32MaxQDPro: Kettle- ETL Tool
08/13/13 33MaxQDPro: Kettle- ETL Tool
08/13/13 34MaxQDPro: Kettle- ETL Tool
Table Output Step
08/13/13 35MaxQDPro: Kettle- ETL Tool
Insert / Update Output Step
08/13/13 36MaxQDPro: Kettle- ETL Tool
Besides the execution order, it specifies the condition for next job entry
· “Unconditional” - next job entry will be ex...
08/13/13 38MaxQDPro: Kettle- ETL Tool
08/13/13 39MaxQDPro: Kettle- ETL Tool
08/13/13MaxQDPro: Kettle- ETL Tool 40
 Brief Introduction to ETL process
 JDBC Repository Connection
 Pentaho Data Integration Tool
◦ Components
 Pan
 Cart...
 kettle.pentaho.org
◦ Kettle project homepage
 kettle.javaforge.com
◦ Kettle community website: forum, source, documenta...
Upcoming SlideShare
Loading in...5
×

Kettleetltool 090522005630-phpapp01

321

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
321
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Kettleetltool 090522005630-phpapp01

  1. 1. MaxQDPro Team Anjan.K Harish.R II Sem M.Tech CSE 08/13/13 MaxQDPro: Kettle- ETL Tool 1 A Pentaho Data Integration tool
  2. 2.  Introduction ◦ ETL Process ◦ Pentaho’s Kettle  Data Integration Challenges  Prerequisites and Recent Releases  Pentaho DI Components  JDBC  Spoon ◦ Transformations ◦ Jobs 08/13/13 2MaxQDPro: Kettle- ETL Tool
  3. 3.  4 major components: ◦ Extracting  Gathering raw data from source systems and storing it in ETL staging environment  Data profiling  Identifying data that changed since last load ◦ Transforming- Cleaning and Conforming  Processing data to improve its quality, format it, merge from multiple sources, enforce conformed dimensions  Data cleansing  Recording error events  Audit dimensions  Creating and maintaining conformed dimensions and facts 08/13/13MaxQDPro: Kettle- ETL Tool 3
  4. 4.  Data filtering ◦ Is not null, greater than, less than, includes  Field manipulation ◦ Trimming, padding, upper and lowercase conversion  Data calculations ◦ + - X / , average, absolute value, arctangent, natural logarithm  Date manipulation ◦ First day of month, Last day of month, add months, week of year, day of year  Data type conversion ◦ String to number, number to string, date to number  Merging fields & splitting fields  Looking up date ◦ Look up in a database, in a text file, an excel sheet, … 08/13/13 4MaxQDPro: Kettle- ETL Tool
  5. 5. ◦ Loading  Loading data into data warehouse tables  Managing hierarchies in dimensions  Managing special dimensions such as date and time, junk, mini, shrunken, small static, and user-maintained dimensions  Fact table loading  Building and maintaining bridge dimension tables  Handling late arriving data  Management of conformed dimensions  Administration of fact tables  Building aggregations  Building OLAP cubes  Transferring DW data to other environment for specific purposes 08/13/13MaxQDPro: Kettle- ETL Tool 5
  6. 6. 08/13/13MaxQDPro: Kettle- ETL Tool 6
  7. 7.  Complexity and significant operational problems.   Exceeds the designers expectations  Data Profiling of a source.  Data warehouses typically grow asynchronously.  Establishing the scalability of an ETL system across the lifetime . 08/13/13MaxQDPro: Kettle- ETL Tool 7
  8. 8.  Many off-the-shelf tools exist  High-end tools may not justify value for smaller warehouses  Proprietary ETL ◦ High upfront cost ◦ Long term maintenance  Custom Code ◦ Low upfront cost ◦ Support grows as business requirements changes 08/13/13 8MaxQDPro: Kettle- ETL Tool
  9. 9. 08/13/13MaxQDPro: Kettle- ETL Tool 9 Tool Vendor Oracle Warehouse Builder (OWB) Oracle Data Integrator (BODI) Business Objects IBM Information Server (Ascential) IBM SAS Data Integration Studio SAS Institute PowerCenter Informatica Oracle Data Integrator (Sunopsis) Oracle Data Migrator Information Builders Integration Services Microsoft Talend Open Studio Talend DataFlow Group 1 Software (Sagent) Data Integrator Pervasive Transformation Server DataMirror Transformation Manager ETL Solutions Ltd. Data Manager Cognos DT/Studio Embarcadero Technologies ETL4ALL IKAN DB2 Warehouse Edition IBM Jitterbit Jitterbit Pentaho Data Integration Pentaho
  10. 10.  Kettle – Kettle Extraction Transformation Transportation & Loading tool  Its open source business intelligence suite for powerful data integration by Pentaho. Founded in 2004.  Products of Pentaho ◦ Mondrain – OLAP server written in Java ◦ Kettle – ETL tool ◦ Weka – Machine learning and Data mining tool 08/13/13 10MaxQDPro: Kettle- ETL Tool
  11. 11.  Data is everywhere  Data is inconsistent ◦ Records are different in each system  Performance issues ◦ Running queries to summarize data for stipulated long period takes operating system for task ◦ Brings the OS on max load  Data is never all in Data Warehouse ◦ Excel sheet, acquisition, new application 08/13/13 11MaxQDPro: Kettle- ETL Tool
  12. 12.  Meta data , model driven approach ◦ What to do? And how to do? ◦ Complex transformation with zero code ◦ Graphically design data transformation and jobs  100% Java with cross-platform support  Extensible architecture  Repository-based  Full featured ETL  Integration with Pentaho Open BI Platform 08/13/13 12MaxQDPro: Kettle- ETL Tool
  13. 13. Prerequisites Recent Releases  Java Runtime Environment 1.5 and above  Compatible with almost any platform  Compatible with wide range of Databases technologies.  4/25 Data Integration 3.0.3 GA  4/18 Data Integration 3.1 Milestone  2/8 Data Integration 3.0.2 GA  12/12 Data Integration 3.0.1 GA  11/15 Data Integration 3.0 GA  10/31 Data Integration 3.0 RC2  10/24 Data Integration 2.5.2 GA  10/08 Data Integration 3.0 RC1  08/24 Data Integration 2.5.1 GA 08/13/13MaxQDPro: Kettle- ETL Tool 13
  14. 14.  Pan ◦ A program to execute transformations designed by Spoon in XML or database repository. ◦ Transformations are scheduled in batch mode to be run automatically at regular intervals  Carte ◦ Simple web server to execute transformations and jobs remotely. ◦ Accept an XML (small servlet) that contains transformation to execute and the execution configuration.  ◦ Allows to remotely monitor, start and stop the transformations and jobs ◦ Server running in Carte is a Slave Server 08/13/13MaxQDPro: Kettle- ETL Tool 14
  15. 15.  Spoon ◦  GUI that allows you to design transformations and jobs that can be run with the Kettle tools — Pan and Kitchen ◦ Transformations and Jobs can describe themselves using an XML file or can be put in a Kettle database repository. ◦ Spoon is available as executable script and batch file to make use of tool in heterogeneous environment. ◦ Latest version of Spoon is 3.2 beta version.  Kitchen ◦ Execute jobs designed by Spoon in XML or database repository 08/13/13MaxQDPro: Kettle- ETL Tool 15
  16. 16. Create Shortcut with spoon.ico pointing to bat file Works on most of OS  Installing ◦ Ensure JRE 1.5 is installed. ◦ Unzip the binary distribution in any folder  Launching ◦ spoon.bat in windows platform ◦ spoon.sh in Unix like platform  Supported platform ◦ Microsoft Windows including Vista ◦ Linux GTK: on i386 and x86_64 processors ◦ Apple's OSX: works both on PowerPC and Intel machines ◦ Solaris: using a Motif interface  ◦ AIX, HP-UX, FreeBSD 08/13/13MaxQDPro: Kettle- ETL Tool 16
  17. 17. Latest JDBC 3.0  JDBC -Database connectivity Java tool.  Comes in four different types ◦ Type1: JDBC-ODBC Bridge ◦ Type 2 : Native API partial Java driver ◦ Type 3 : Middleware Java Drivers ◦ Type 4: Direct to DB Java Drivers  Microsoft Based DB like MS Access rely on Type 1drivers  Oracle, Mysql can be connected with other types. But traditionally used is the Type 4 driver.  JDBC can also operate in Distributed environment. 08/13/13MaxQDPro: Kettle- ETL Tool 17
  18. 18. 08/13/13MaxQDPro: Kettle- ETL Tool 18
  19. 19. 08/13/13MaxQDPro: Kettle- ETL Tool 19
  20. 20.  Key Improvement ◦ Execution Results Pane for logs, metrics and performance graph ◦ Improved Database Connection dialog ◦ Snap to grid (graphical workspace) ◦ Zoom (Graphical Workspace) ◦ Easier to use left panel for the objects palette ◦ Over 30 new or improved Transformation Steps ◦ 13 new or improved Job Entries ◦ Support for four new database types - MonetDB, KingbaseES, Vertica, and HP NeoView ◦ Improved translations 08/13/13MaxQDPro: Kettle- ETL Tool 20
  21. 21.  Repository Connection establishment  Auto login ◦ By setting manually KETTLE_REPOSITORY, KETTLE_USER and KETTLE_PASSWORD environmental variables.  Login ◦ By default PDI provides login username and password ad admin. ◦ It strictly advised to change default password to avoid any security vulnerablity. 08/13/13MaxQDPro: Kettle- ETL Tool 21
  22. 22. 08/13/13MaxQDPro: Kettle- ETL Tool 22
  23. 23. 08/13/13MaxQDPro: Kettle- ETL Tool 23
  24. 24. 08/13/13MaxQDPro: Kettle- ETL Tool 24
  25. 25.  Transformation ◦ Value: Values are part of a row and can contain any type of data ◦ Row: a row exists of 0 or more values  ◦ Output stream: an output stream is a stack of rows that leaves a step.  ◦ Input stream: an input stream is a stack of rows that enters a step.  ◦ Hop: A hop is a graphical representation of one or more data streams between 2 steps. ◦ Note: A note is a piece of information that can be added to a transformation 08/13/13MaxQDPro: Kettle- ETL Tool 25 Engine capable of performing a multitude of functions such as reading, manipulating and writing data to and from various data sources.
  26. 26.  Jobs ◦ Job Entry: A job entry is one part of a job and performs a certain ◦ Hop: A hop is a graphical representation of one or more data streams between 2 steps ◦ Note: a note is a piece of information that can be added to a job 08/13/13MaxQDPro: Kettle- ETL Tool 26 A way of calling transformations and controlling the sequence of their execution. Usually jobs are scheduled in batch mode to be run automatically at regular intervals.
  27. 27. Input Steps Output Steps Lookup Steps Transformation Steps Join Steps DW Steps Mapping Steps Job Steps 08/13/13 27MaxQDPro: Kettle- ETL Tool
  28. 28. 08/13/13 28MaxQDPro: Kettle- ETL Tool
  29. 29. 08/13/13 29MaxQDPro: Kettle- ETL Tool
  30. 30. 08/13/13MaxQDPro: Kettle- ETL Tool 30
  31. 31. 08/13/13 31MaxQDPro: Kettle- ETL Tool
  32. 32. 08/13/13 32MaxQDPro: Kettle- ETL Tool
  33. 33. 08/13/13 33MaxQDPro: Kettle- ETL Tool
  34. 34. 08/13/13 34MaxQDPro: Kettle- ETL Tool
  35. 35. Table Output Step 08/13/13 35MaxQDPro: Kettle- ETL Tool
  36. 36. Insert / Update Output Step 08/13/13 36MaxQDPro: Kettle- ETL Tool
  37. 37. Besides the execution order, it specifies the condition for next job entry · “Unconditional” - next job entry will be executed regardless of the result of the originating job entry. · “Follow when result is true” - next job entry will only be executed when the result of the originating job entry is true, · “Follow when result is false” - next job entry will only be executed when the result of the originating job entry was false 08/13/13 37MaxQDPro: Kettle- ETL Tool
  38. 38. 08/13/13 38MaxQDPro: Kettle- ETL Tool
  39. 39. 08/13/13 39MaxQDPro: Kettle- ETL Tool
  40. 40. 08/13/13MaxQDPro: Kettle- ETL Tool 40
  41. 41.  Brief Introduction to ETL process  JDBC Repository Connection  Pentaho Data Integration Tool ◦ Components  Pan  Carte  Kitchen  Spoon ◦ Transformation with different Input Data Source ◦ Jobs 08/13/13MaxQDPro: Kettle- ETL Tool 41
  42. 42.  kettle.pentaho.org ◦ Kettle project homepage  kettle.javaforge.com ◦ Kettle community website: forum, source, documentation, tech tips, samples, …  www.pentaho.org/download/ ◦ All Pentaho modules, pre-configured with sample data ◦ Developer forums, documentation ◦ Ventana Research Open Source BI Survey  www.mysql.com ◦ White paper - http://dev.mysql.com/tech-resources/articles/mysql_5.0_pentaho.html ◦ Kettle Webinar - http://www.mysql.com/news-and-events/on-demand- webinars/pentaho-2006-09-19.php ◦ Roland Bouman blog on Pentaho Data Integration and MySQL  http://rpbouman.blogspot.com/2006/06/pentaho-data-integration-kettle-turns.html 08/13/13 42MaxQDPro: Kettle- ETL Tool

×