Pentaho etl-tool


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Pentaho etl-tool

  1. 1. Kettle – ETL Tool Sreenivas K
  2. 2. Agenda  Introduction − ETL Process − Pentaho's Kettle  Data Integration Challenges  Prerequisites and Recent Releases  Pentaho DI Components  Spoon − Transformations − Jobs
  3. 3. Introduction – ETL Process  Major Components − Extracting  Gathering raw data from source systems and storing it in ETL staging environment  Data Profiling  Identifying data that changed since last load. − Transforming- Cleaning and Conforming  Processing data to improve its quality, format it, merge from multiple sources, enforce conformed dimensions  Data cleansing  Recording error events  Audit dimensions  Creating and maintaining conformed dimensions and facts
  4. 4. Introduction – ETL Process − Loading  Loading data into data warehouse tables  Managing hierarchies in dimensions  Managing special dimensions such as date and time, junk, mini, shrunken, small static, and user-maintained dimensions  Fact table loading  Building and maintaining bridge dimension tables  Handling late arriving data  Management of conformed dimensions  Administration of fact tables  Building aggregations  Building OLAP cubes  Transferring DW data to other environment for specific purposes
  5. 5. Data Transformation and Integration Examples  Data filtering − Is not null, greater than, less than, includes  Field manipulation − Trimming, padding, upper and lowercase conversion  Data calculations − + - X / , average, absolute value, arctangent, natural logarithm  Date manipulation − First day of month, Last day of month, add months, week of year, day of year  Data type conversion − String to number, number to string, date to number  Merging fields & splitting fields  Looking up date − Look up in a database, in a text file, an excel sheet, …
  6. 6. Introduction – Pentaho Kettle  Kettle – Kettle Extraction Transformation Transportation & Loading tool  Its open source business intelligence suite for powerful data integration by Pentaho. Founded in 2004.  Products of Pentaho − Mondrain – OLAP server written in Java − Kettle – ETL tool
  7. 7. Data Integration - Challenges  Data is everywhere  Data is inconsistent − Records are different in each system  Performance issues − Running queries to summarize data for stipulated long period takes operating system for task  Data is never all in Data Warehouse − Excel sheet, acquisition, new application
  8. 8. Prerequisites Recent Releases  Java Runtime Environment 1.5 and above  Compatible with almost any platform  Compatible with wide range of Databases technologies.  4/25 Data Integration 3.0.3 GA  4/18 Data Integration 3.1 Milestone  2/8 Data Integration 3.0.2 GA  12/12 Data Integration 3.0.1 GA  11/15 Data Integration 3.0 GA  10/31 Data Integration 3.0 RC2  10/24 Data Integration 2.5.2 GA  10/08 Data Integration 3.0 RC1  08/24 Data Integration 2.5.1 GA
  9. 9. Pentaho Components  Spoon − GUI that allows you to design transformations and jobs that can be run with the Kettle tools — Pan and Kitchen − Transformations and Jobs can describe themselves using an XML file or can be put in a Kettle database repository. − Spoon is available as executable script and batch file to make use of tool in heterogeneous environment.  Pan − A program to execute transformations designed by Spoon in XML or database repository. − Transformations are scheduled in batch mode to be run automatically at regular intervals  Kitchen − Execute jobs designed by Spoon in XML or database repository
  10. 10.  Repository Connection establishment  Auto login − By setting manually KETTLE_REPOSITORY, KETTLE_USER and KETTLE_PASSWORD environmental variables.  Login − By default PDI provides login username and password ad admin.
  11. 11.  Transformation − Value: Values are part of a row and can contain any type of data − Row: a row exists of 0 or more values − Output stream: an output stream is a stack of rows that leaves a step. − Input stream: an input stream is a stack of rows that enters a step. − Hop: A hop is a graphical representation of one or more data streams between 2 steps. − Note: A note is a piece of information that can be added to a transformation Engine capable of performing a multitude of functions such as reading, manipulating and writing data to and from various data sources.
  12. 12.  Jobs − Job Entry: A job entry is one part of a job and performs a certain − Hop: A hop is a graphical representation of one or more data streams between 2 steps − Note: a note is a piece of information that can be added to a job A way of calling transformations and controlling the sequence of their execution. Usually jobs are scheduled in batch mode to be run automatically at regular intervals.
  13. 13. Input Steps Output Steps Lookup Steps Transformation Steps Join Steps DW Steps Mapping Steps Job Steps