Your SlideShare is downloading. ×

Pentaho etl-tool


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Kettle – ETL Tool Sreenivas K
  • 2. Agenda  Introduction − ETL Process − Pentaho's Kettle  Data Integration Challenges  Prerequisites and Recent Releases  Pentaho DI Components  Spoon − Transformations − Jobs
  • 3. Introduction – ETL Process  Major Components − Extracting  Gathering raw data from source systems and storing it in ETL staging environment  Data Profiling  Identifying data that changed since last load. − Transforming- Cleaning and Conforming  Processing data to improve its quality, format it, merge from multiple sources, enforce conformed dimensions  Data cleansing  Recording error events  Audit dimensions  Creating and maintaining conformed dimensions and facts
  • 4. Introduction – ETL Process − Loading  Loading data into data warehouse tables  Managing hierarchies in dimensions  Managing special dimensions such as date and time, junk, mini, shrunken, small static, and user-maintained dimensions  Fact table loading  Building and maintaining bridge dimension tables  Handling late arriving data  Management of conformed dimensions  Administration of fact tables  Building aggregations  Building OLAP cubes  Transferring DW data to other environment for specific purposes
  • 5. Data Transformation and Integration Examples  Data filtering − Is not null, greater than, less than, includes  Field manipulation − Trimming, padding, upper and lowercase conversion  Data calculations − + - X / , average, absolute value, arctangent, natural logarithm  Date manipulation − First day of month, Last day of month, add months, week of year, day of year  Data type conversion − String to number, number to string, date to number  Merging fields & splitting fields  Looking up date − Look up in a database, in a text file, an excel sheet, …
  • 6. Introduction – Pentaho Kettle  Kettle – Kettle Extraction Transformation Transportation & Loading tool  Its open source business intelligence suite for powerful data integration by Pentaho. Founded in 2004.  Products of Pentaho − Mondrain – OLAP server written in Java − Kettle – ETL tool
  • 7. Data Integration - Challenges  Data is everywhere  Data is inconsistent − Records are different in each system  Performance issues − Running queries to summarize data for stipulated long period takes operating system for task  Data is never all in Data Warehouse − Excel sheet, acquisition, new application
  • 8. Prerequisites Recent Releases  Java Runtime Environment 1.5 and above  Compatible with almost any platform  Compatible with wide range of Databases technologies.  4/25 Data Integration 3.0.3 GA  4/18 Data Integration 3.1 Milestone  2/8 Data Integration 3.0.2 GA  12/12 Data Integration 3.0.1 GA  11/15 Data Integration 3.0 GA  10/31 Data Integration 3.0 RC2  10/24 Data Integration 2.5.2 GA  10/08 Data Integration 3.0 RC1  08/24 Data Integration 2.5.1 GA
  • 9. Pentaho Components  Spoon − GUI that allows you to design transformations and jobs that can be run with the Kettle tools — Pan and Kitchen − Transformations and Jobs can describe themselves using an XML file or can be put in a Kettle database repository. − Spoon is available as executable script and batch file to make use of tool in heterogeneous environment.  Pan − A program to execute transformations designed by Spoon in XML or database repository. − Transformations are scheduled in batch mode to be run automatically at regular intervals  Kitchen − Execute jobs designed by Spoon in XML or database repository
  • 10.  Repository Connection establishment  Auto login − By setting manually KETTLE_REPOSITORY, KETTLE_USER and KETTLE_PASSWORD environmental variables.  Login − By default PDI provides login username and password ad admin.
  • 11.  Transformation − Value: Values are part of a row and can contain any type of data − Row: a row exists of 0 or more values − Output stream: an output stream is a stack of rows that leaves a step. − Input stream: an input stream is a stack of rows that enters a step. − Hop: A hop is a graphical representation of one or more data streams between 2 steps. − Note: A note is a piece of information that can be added to a transformation Engine capable of performing a multitude of functions such as reading, manipulating and writing data to and from various data sources.
  • 12.  Jobs − Job Entry: A job entry is one part of a job and performs a certain − Hop: A hop is a graphical representation of one or more data streams between 2 steps − Note: a note is a piece of information that can be added to a job A way of calling transformations and controlling the sequence of their execution. Usually jobs are scheduled in batch mode to be run automatically at regular intervals.
  • 13. Input Steps Output Steps Lookup Steps Transformation Steps Join Steps DW Steps Mapping Steps Job Steps