A Pentaho Data Integration tool




                         MaxQDPro Team
                            Anjan.K            ...
   Introduction
    ◦ ETL Process
    ◦ Pentaho’s Kettle
   Data Integration Challenges
   Prerequisites and Recent Rel...
   4 major components:
    ◦ Extracting
       Gathering raw data from source systems and storing it in ETL
        stag...
   Data filtering
    ◦ Is not null, greater than, less than, includes

   Field manipulation
    ◦ Trimming, padding, u...
◦ Loading
  Loading data into data warehouse tables
  Managing hierarchies in dimensions
  Managing special dimensions ...
MaxQDPro: Kettle- ETL Tool   05/22/09   6
   Complexity and significant operational
    problems. 
   Exceeds the designers expectations
   Data Profiling of a s...
   Many off-the-shelf tools exist
   High-end tools may not justify value for
    smaller warehouses
   Proprietary ETL...
Tool                                         Vendor
Oracle Warehouse Builder (OWB)               Oracle
Data Integrator (B...
   Kettle – Kettle Extraction Transformation
    Transportation & Loading tool
   Its open source business intelligence ...
   Data is everywhere
   Data is inconsistent
    ◦ Records are different in each system
   Performance issues
    ◦ Ru...
   Meta data , model driven approach
    ◦ What to do? And how to do?
    ◦ Complex transformation with zero code
    ◦ G...
Prerequisites                  Recent Releases

                               4/25 Data Integration 3.0.3 GA
    Java Run...
   Pan
    ◦ A program to execute transformations designed by Spoon
      in XML or database repository.
    ◦ Transforma...
   Spoon
    ◦  GUI that allows you to design transformations and
      jobs that can be run with the Kettle tools — Pan ...
   Installing                        Supported platform
    ◦ Ensure JRE 1.5 is                ◦ Microsoft Windows
     ...
   JDBC -Database                     Microsoft Based DB like
    connectivity Java tool.             MS Access rely on ...
MaxQDPro: Kettle- ETL Tool   05/22/09   18
MaxQDPro: Kettle- ETL Tool   05/22/09   19
   Key Improvement
    ◦ Execution Results Pane for logs, metrics and
      performance graph
    ◦ Improved Database Con...
   Repository Connection establishment
   Auto login
    ◦ By setting manually KETTLE_REPOSITORY,
      KETTLE_USER and ...
MaxQDPro: Kettle- ETL Tool   05/22/09   22
MaxQDPro: Kettle- ETL Tool   05/22/09   23
MaxQDPro: Kettle- ETL Tool   05/22/09   24
Engine capable of performing a
   Transformation                       multitude of functions such as reading,
          ...
A way of calling transformations and
   Jobs                                controlling the sequence of their
    ◦ Job E...
Input Steps
                           Output Steps
                                      Lookup Steps
                   ...
MaxQDPro: Kettle- ETL Tool   05/22/09   28
MaxQDPro: Kettle- ETL Tool   05/22/09   29
MaxQDPro: Kettle- ETL Tool   05/22/09   30
MaxQDPro: Kettle- ETL Tool   05/22/09   31
MaxQDPro: Kettle- ETL Tool   05/22/09   32
MaxQDPro: Kettle- ETL Tool   05/22/09   33
MaxQDPro: Kettle- ETL Tool   05/22/09   34
Table Output Step




                    MaxQDPro: Kettle- ETL Tool   05/22/09   35
Insert / Update Output Step




                              MaxQDPro: Kettle- ETL Tool   05/22/09   36
Besides the execution order, it specifies the condition for next job entry

· “Unconditional” - next job entry will be e...
MaxQDPro: Kettle- ETL Tool   05/22/09   38
MaxQDPro: Kettle- ETL Tool   05/22/09   39
MaxQDPro: Kettle- ETL Tool   05/22/09   40
   Brief Introduction to ETL process
   JDBC Repository Connection
   Pentaho Data Integration Tool
    ◦ Components
  ...
   kettle.pentaho.org
    ◦ Kettle project homepage

   kettle.javaforge.com
    ◦ Kettle community website: forum, sour...
Upcoming SlideShare
Loading in...5
×

Kettle – Etl Tool

23,548

Published on

Pentaho Kettle ETL tools demostration and jest of the ETL process

Published in: Technology
2 Comments
27 Likes
Statistics
Notes
  • Hi Randi !! Have you worked on Pentaho (Kettle) ?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Interesante
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
23,548
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
918
Comments
2
Likes
27
Embeds 0
No embeds

No notes for slide
  • <number>
  • 9
  • 9
  • 9
  • 9
  • 9
  • 9
  • 9
  • 9
  • 9
  • 9
  • 9
  • 9
  • 9
  • 9
  • Kettle – Etl Tool

    1. 1. A Pentaho Data Integration tool MaxQDPro Team Anjan.K Harish.R II Sem M.Tech CSE 05/22/09 MaxQDPro: Kettle- ETL Tool 1
    2. 2.  Introduction ◦ ETL Process ◦ Pentaho’s Kettle  Data Integration Challenges  Prerequisites and Recent Releases  Pentaho DI Components  JDBC  Spoon ◦ Transformations ◦ Jobs MaxQDPro: Kettle- ETL Tool 05/22/09 2
    3. 3.  4 major components: ◦ Extracting  Gathering raw data from source systems and storing it in ETL staging environment  Data profiling  Identifying data that changed since last load ◦ Transforming- Cleaning and Conforming  Processing data to improve its quality, format it, merge from multiple sources, enforce conformed dimensions  Data cleansing  Recording error events  Audit dimensions  Creating and maintaining conformed dimensions05/22/09facts MaxQDPro: Kettle- ETL Tool and 3
    4. 4.  Data filtering ◦ Is not null, greater than, less than, includes  Field manipulation ◦ Trimming, padding, upper and lowercase conversion  Data calculations ◦ + - X / , average, absolute value, arctangent, natural logarithm  Date manipulation ◦ First day of month, Last day of month, add months, week of year, day of year  Data type conversion ◦ String to number, number to string, date to number  Merging fields & splitting fields  Looking up date MaxQDPro: Kettle- ETL Tool 05/22/09 4
    5. 5. ◦ Loading  Loading data into data warehouse tables  Managing hierarchies in dimensions  Managing special dimensions such as date and time, junk, mini, shrunken, small static, and user- maintained dimensions  Fact table loading  Building and maintaining bridge dimension tables  Handling late arriving data  Management of conformed dimensions  Administration of fact tables  Building aggregations  Building OLAP cubes MaxQDPro: Kettle- ETL Tool 05/22/09 5
    6. 6. MaxQDPro: Kettle- ETL Tool 05/22/09 6
    7. 7.  Complexity and significant operational problems.   Exceeds the designers expectations  Data Profiling of a source.  Data warehouses typically grow asynchronously.  Establishing the scalability of an ETL system across the lifetime . MaxQDPro: Kettle- ETL Tool 05/22/09 7
    8. 8.  Many off-the-shelf tools exist  High-end tools may not justify value for smaller warehouses  Proprietary ETL ◦ High upfront cost ◦ Long term maintenance  Custom Code ◦ Low upfront cost ◦ Support grows as business requirements changes MaxQDPro: Kettle- ETL Tool 05/22/09 8
    9. 9. Tool Vendor Oracle Warehouse Builder (OWB) Oracle Data Integrator (BODI) Business Objects IBM Information Server (Ascential) IBM SAS Data Integration Studio SAS Institute PowerCenter Informatica Oracle Data Integrator (Sunopsis) Oracle Data Migrator Information Builders Integration Services Microsoft Talend Open Studio Talend DataFlow Group 1 Software (Sagent) Data Integrator Pervasive Transformation Server DataMirror Transformation Manager ETL Solutions Ltd. Data Manager Cognos DT/Studio Embarcadero Technologies ETL4ALL IKAN DB2 Warehouse Edition IBM Jitterbit Jitterbit Pentaho Data Integration Pentaho MaxQDPro: Kettle- ETL Tool 05/22/09 9
    10. 10.  Kettle – Kettle Extraction Transformation Transportation & Loading tool  Its open source business intelligence suite for powerful data integration by Pentaho. Founded in 2004.  Products of Pentaho ◦ Mondrain – OLAP server written in Java ◦ Kettle – ETL tool ◦ Weka – Machine learning and Data mining tool MaxQDPro: Kettle- ETL Tool 05/22/09 10
    11. 11.  Data is everywhere  Data is inconsistent ◦ Records are different in each system  Performance issues ◦ Running queries to summarize data for stipulated long period takes operating system for task ◦ Brings the OS on max load  Data is never all in Data Warehouse ◦ Excel sheet, acquisition, new application MaxQDPro: Kettle- ETL Tool 05/22/09 11
    12. 12.  Meta data , model driven approach ◦ What to do? And how to do? ◦ Complex transformation with zero code ◦ Graphically design data transformation and jobs  100% Java with cross-platform support  Extensible architecture  Repository-based  Full featured ETL  Integration with Pentaho Open BI Platform MaxQDPro: Kettle- ETL Tool 05/22/09 12
    13. 13. Prerequisites Recent Releases 4/25 Data Integration 3.0.3 GA Java Runtime   Environment 1.5 and  4/18 Data Integration 3.1 above Milestone  2/8 Data Integration 3.0.2 GA  Compatible with 12/12 Data Integration 3.0.1 GA almost any platform   11/15 Data Integration 3.0 GA  Compatible with wide  10/31 Data Integration 3.0 RC2 range of Databases technologies.  10/24 Data Integration 2.5.2 GA MaxQDPro: Kettle- ETL Tool 05/22/09 13
    14. 14.  Pan ◦ A program to execute transformations designed by Spoon in XML or database repository. ◦ Transformations are scheduled in batch mode to be run automatically at regular intervals  Carte ◦ Simple web server to execute transformations and jobs remotely. ◦ Accept an XML (small servlet) that contains transformation to execute and the execution configuration.  ◦ Allows to remotely monitor, start and stop the transformations and jobs ◦ Server running in Carte is a Slave Server ETL Tool MaxQDPro: Kettle- 05/22/09 14
    15. 15.  Spoon ◦  GUI that allows you to design transformations and jobs that can be run with the Kettle tools — Pan and Kitchen ◦ Transformations and Jobs can describe themselves using an XML file or can be put in a Kettle database repository. ◦ Spoon is available as executable script and batch file to make use of tool in heterogeneous environment. ◦ Latest version of Spoon is 3.2 beta version.  Kitchen ◦ Execute jobs designed by Spoon Kettle- ETL Toolor database in XML MaxQDPro: 05/22/09 15
    16. 16.  Installing  Supported platform ◦ Ensure JRE 1.5 is ◦ Microsoft Windows installed. including Vista ◦ Unzip the binary ◦ Linux GTK: on i386 and distribution in any folder x86_64 processors  Launching ◦ Apple's OSX: works both on PowerPC and Intel ◦ spoon.bat in windows machines platform ◦ Solaris: using a Motif ◦ spoon.sh in Unix like interface  platform Create Shortcut with Works on most of OS ◦ AIX, HP-UX, FreeBSD spoon.ico pointing to bat file MaxQDPro: Kettle- ETL Tool 05/22/09 16
    17. 17.  JDBC -Database  Microsoft Based DB like connectivity Java tool. MS Access rely on Type  Comes in four different 1drivers types  Oracle, Mysql can be ◦ Type1: JDBC-ODBC Bridge connected with other ◦ Type 2 : Native API partial types. But traditionally Java driver used is the Type 4 ◦ Type 3 : Middleware Java driver. Drivers JDBC can also operate ◦ Type 4: Direct to DB Java JDBC 3.0 Latest in Distributed Drivers environment. MaxQDPro: Kettle- ETL Tool 05/22/09 17
    18. 18. MaxQDPro: Kettle- ETL Tool 05/22/09 18
    19. 19. MaxQDPro: Kettle- ETL Tool 05/22/09 19
    20. 20.  Key Improvement ◦ Execution Results Pane for logs, metrics and performance graph ◦ Improved Database Connection dialog ◦ Snap to grid (graphical workspace) ◦ Zoom (Graphical Workspace) ◦ Easier to use left panel for the objects palette ◦ Over 30 new or improved Transformation Steps ◦ 13 new or improved Job Entries ◦ Support for four new database types - MonetDB, KingbaseES, Vertica, and HP NeoView 05/22/09 MaxQDPro: Kettle- ETL Tool 20
    21. 21.  Repository Connection establishment  Auto login ◦ By setting manually KETTLE_REPOSITORY, KETTLE_USER and KETTLE_PASSWORD environmental variables.  Login ◦ By default PDI provides login username and password ad admin. ◦ It strictly advised to change default password to avoid any security vulnerablity. MaxQDPro: Kettle- ETL Tool 05/22/09 21
    22. 22. MaxQDPro: Kettle- ETL Tool 05/22/09 22
    23. 23. MaxQDPro: Kettle- ETL Tool 05/22/09 23
    24. 24. MaxQDPro: Kettle- ETL Tool 05/22/09 24
    25. 25. Engine capable of performing a  Transformation multitude of functions such as reading, manipulating and writing data to and ◦ Value: Values are part of a row from various data sources. and can contain any type of data ◦ Row: a row exists of 0 or more values  ◦ Output stream: an output stream is a stack of rows that leaves a step.  ◦ Input stream: an input stream is a stack of rows that enters a step.  ◦ Hop: A hop is a graphical representation of one or more data streams between 2 steps. ◦ Note: A note is a piece of MaxQDPro: Kettle- ETL Tool 05/22/09 25 information that can be added
    26. 26. A way of calling transformations and  Jobs controlling the sequence of their ◦ Job Entry: A job entry is execution. Usually jobs are scheduled in batch mode to be run one part of a job and automatically at regular intervals. performs a certain ◦ Hop: A hop is a graphical representation of one or more data streams between 2 steps ◦ Note: a note is a piece of information that can be added to a job MaxQDPro: Kettle- ETL Tool 05/22/09 26
    27. 27. Input Steps Output Steps Lookup Steps Transformation Steps Job Steps DW Steps Join Steps Mapping Steps MaxQDPro: Kettle- ETL Tool 05/22/09 27
    28. 28. MaxQDPro: Kettle- ETL Tool 05/22/09 28
    29. 29. MaxQDPro: Kettle- ETL Tool 05/22/09 29
    30. 30. MaxQDPro: Kettle- ETL Tool 05/22/09 30
    31. 31. MaxQDPro: Kettle- ETL Tool 05/22/09 31
    32. 32. MaxQDPro: Kettle- ETL Tool 05/22/09 32
    33. 33. MaxQDPro: Kettle- ETL Tool 05/22/09 33
    34. 34. MaxQDPro: Kettle- ETL Tool 05/22/09 34
    35. 35. Table Output Step MaxQDPro: Kettle- ETL Tool 05/22/09 35
    36. 36. Insert / Update Output Step MaxQDPro: Kettle- ETL Tool 05/22/09 36
    37. 37. Besides the execution order, it specifies the condition for next job entry · “Unconditional” - next job entry will be executed regardless of the result of the originating job entry. · “Follow when result is true” - next job entry will only be executed when the result of the originating job entry is true, · “Follow when result is false” - next job entry will only be executed when the result of the originating job entry was false MaxQDPro: Kettle- ETL Tool 05/22/09 37
    38. 38. MaxQDPro: Kettle- ETL Tool 05/22/09 38
    39. 39. MaxQDPro: Kettle- ETL Tool 05/22/09 39
    40. 40. MaxQDPro: Kettle- ETL Tool 05/22/09 40
    41. 41.  Brief Introduction to ETL process  JDBC Repository Connection  Pentaho Data Integration Tool ◦ Components  Pan  Carte  Kitchen  Spoon ◦ Transformation with different Input Data Source ◦ Jobs MaxQDPro: Kettle- ETL Tool 05/22/09 41
    42. 42.  kettle.pentaho.org ◦ Kettle project homepage  kettle.javaforge.com ◦ Kettle community website: forum, source, documentation, tech tips, samples, …  www.pentaho.org/download/ ◦ All Pentaho modules, pre-configured with sample data ◦ Developer forums, documentation ◦ Ventana Research Open Source BI Survey  www.mysql.com ◦ White paper - http://dev.mysql.com/tech-resources/articles/mysql_5.0_pentaho.html ◦ Kettle Webinar - http://www.mysql.com/news-and-events/on-demand-webinars/pentaho- MaxQDPro: Kettle- ETL Tool 05/22/09 42 ◦ Roland Bouman blog on Pentaho Data Integration and MySQL
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×