SlideShare a Scribd company logo
Kettle – ETL Tool
Sreenivas K
Agenda

Introduction
− ETL Process
− Pentaho's Kettle

Data Integration Challenges

Prerequisites and Recent Releases

Pentaho DI Components

Spoon
− Transformations
− Jobs
Introduction – ETL Process

Major Components
− Extracting

Gathering raw data from source systems and storing it in ETL staging
environment

Data Profiling

Identifying data that changed since last load.
− Transforming- Cleaning and Conforming

Processing data to improve its quality, format it, merge from multiple
sources, enforce conformed dimensions

Data cleansing

Recording error events

Audit dimensions

Creating and maintaining conformed dimensions and facts
Introduction – ETL Process
− Loading

Loading data into data warehouse tables

Managing hierarchies in dimensions

Managing special dimensions such as date and time, junk, mini, shrunken,
small static, and user-maintained dimensions

Fact table loading

Building and maintaining bridge dimension tables

Handling late arriving data

Management of conformed dimensions

Administration of fact tables

Building aggregations

Building OLAP cubes

Transferring DW data to other environment for specific purposes
Data Transformation and
Integration Examples

Data filtering
− Is not null, greater than, less than, includes

Field manipulation
− Trimming, padding, upper and lowercase conversion

Data calculations
− + - X / , average, absolute value, arctangent, natural logarithm

Date manipulation
− First day of month, Last day of month, add months, week of year, day of year

Data type conversion
− String to number, number to string, date to number

Merging fields & splitting fields

Looking up date
− Look up in a database, in a text file, an excel sheet, …
Introduction – Pentaho Kettle

Kettle – Kettle Extraction Transformation Transportation &
Loading tool

Its open source business intelligence suite for powerful
data integration by Pentaho. Founded in 2004.

Products of Pentaho
− Mondrain – OLAP server written in Java
− Kettle – ETL tool
Data Integration - Challenges

Data is everywhere

Data is inconsistent
− Records are different in each system

Performance issues
− Running queries to summarize data for stipulated
long period takes operating system for task

Data is never all in Data Warehouse
− Excel sheet, acquisition, new application
Prerequisites Recent Releases

Java Runtime Environment
1.5 and above

Compatible with almost any
platform

Compatible with wide range
of Databases technologies.

4/25 Data Integration 3.0.3 GA

4/18 Data Integration 3.1 Milestone

2/8 Data Integration 3.0.2 GA

12/12 Data Integration 3.0.1 GA

11/15 Data Integration 3.0 GA

10/31 Data Integration 3.0 RC2

10/24 Data Integration 2.5.2 GA

10/08 Data Integration 3.0 RC1

08/24 Data Integration 2.5.1 GA
Pentaho Components

Spoon
− GUI that allows you to design transformations and jobs that can
be run with the Kettle tools — Pan and Kitchen
− Transformations and Jobs can describe themselves using an XML
file or can be put in a Kettle database repository.
− Spoon is available as executable script and batch file to make use
of tool in heterogeneous environment.

Pan
− A program to execute transformations designed by Spoon in XML or
database repository.
− Transformations are scheduled in batch mode to be run automatically at
regular intervals

Kitchen
− Execute jobs designed by Spoon in XML or database repository

Repository Connection establishment

Auto login
− By setting manually KETTLE_REPOSITORY,
KETTLE_USER and KETTLE_PASSWORD
environmental variables.

Login
− By default PDI provides login username and
password ad admin.

Transformation
− Value: Values are part of a row
and can contain any type of data
− Row: a row exists of 0 or more
values
− Output stream: an output
stream is a stack of rows that
leaves a step.
− Input stream: an input stream is
a stack of rows that enters a
step.
− Hop: A hop is a graphical
representation of one or more
data streams between 2 steps.
− Note: A note is a piece of
information that can be added to
a transformation
Engine capable of performing a
multitude of functions such as reading,
manipulating and writing data to and
from various data sources.

Jobs
− Job Entry: A job entry is
one part of a job and
performs a certain
− Hop: A hop is a graphical
representation of one or
more data streams
between 2 steps
− Note: a note is a piece of
information that can be added to
a job
A way of calling transformations and
controlling the sequence of their
execution. Usually jobs are
scheduled in batch mode to be run
automatically at regular intervals.
Input Steps
Output Steps
Lookup Steps
Transformation
Steps
Join Steps
DW Steps
Mapping Steps
Job Steps
Pentaho etl-tool
Pentaho etl-tool
Pentaho etl-tool

More Related Content

What's hot

Kettle – Etl Tool
Kettle – Etl ToolKettle – Etl Tool
Kettle – Etl Tool
Dr Anjan Krishnamurthy
 
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
HostedbyConfluent
 
Big Data Ecosystem
Big Data EcosystemBig Data Ecosystem
Big Data Ecosystem
Lucian Neghina
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing conceptspcherukumalla
 
What is Informatica Powercenter
What is Informatica PowercenterWhat is Informatica Powercenter
What is Informatica Powercenter
BigClasses Com
 
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon RedshiftBDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
Amazon Web Services
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
Guido Schmutz
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
LibbySchulze
 
DBT ELT approach for Advanced Analytics.pptx
DBT ELT approach for Advanced Analytics.pptxDBT ELT approach for Advanced Analytics.pptx
DBT ELT approach for Advanced Analytics.pptx
Hong Ong
 
Data Sharing with Snowflake
Data Sharing with SnowflakeData Sharing with Snowflake
Data Sharing with Snowflake
Snowflake Computing
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
Jayant Mukherjee
 
Talend Open Studio Data Integration
Talend Open Studio Data IntegrationTalend Open Studio Data Integration
Talend Open Studio Data Integration
Roberto Marchetto
 
ETL Using Informatica Power Center
ETL Using Informatica Power CenterETL Using Informatica Power Center
ETL Using Informatica Power Center
Edureka!
 
Snowflake free trial_lab_guide
Snowflake free trial_lab_guideSnowflake free trial_lab_guide
Snowflake free trial_lab_guide
slidedown1
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Databricks
 
Data ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFiData ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFi
Lev Brailovskiy
 

What's hot (20)

Kettle – Etl Tool
Kettle – Etl ToolKettle – Etl Tool
Kettle – Etl Tool
 
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
 
Big Data Ecosystem
Big Data EcosystemBig Data Ecosystem
Big Data Ecosystem
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing concepts
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
data warehousing
data warehousingdata warehousing
data warehousing
 
What is Informatica Powercenter
What is Informatica PowercenterWhat is Informatica Powercenter
What is Informatica Powercenter
 
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon RedshiftBDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
 
ETL
ETLETL
ETL
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
 
DBT ELT approach for Advanced Analytics.pptx
DBT ELT approach for Advanced Analytics.pptxDBT ELT approach for Advanced Analytics.pptx
DBT ELT approach for Advanced Analytics.pptx
 
Data Sharing with Snowflake
Data Sharing with SnowflakeData Sharing with Snowflake
Data Sharing with Snowflake
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 
Talend Open Studio Data Integration
Talend Open Studio Data IntegrationTalend Open Studio Data Integration
Talend Open Studio Data Integration
 
ETL Using Informatica Power Center
ETL Using Informatica Power CenterETL Using Informatica Power Center
ETL Using Informatica Power Center
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Snowflake free trial_lab_guide
Snowflake free trial_lab_guideSnowflake free trial_lab_guide
Snowflake free trial_lab_guide
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Data ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFiData ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFi
 

Viewers also liked

Pentaho PDI
Pentaho PDIPentaho PDI
Pentaho PDI
Joao Gutheil
 
Spatial ETL For Web Services-Based Data Sharing
Spatial ETL For Web Services-Based Data SharingSpatial ETL For Web Services-Based Data Sharing
Spatial ETL For Web Services-Based Data Sharing
Safe Software
 
vertica_tmp_4.5
vertica_tmp_4.5vertica_tmp_4.5
vertica_tmp_4.5
Hwang Andrew
 
Penatho
PenathoPenatho
Penatho
chenvi123
 
Hire Pentaho Developer | BI Tools
Hire Pentaho Developer | BI ToolsHire Pentaho Developer | BI Tools
Hire Pentaho Developer | BI Tools
eLuminous Technologies Pvt. Ltd.
 
Hybrid & Logical Data Warehouse
Hybrid & Logical Data WarehouseHybrid & Logical Data Warehouse
Hybrid & Logical Data Warehouse
Heungsoon Yang
 
Pentaho ETL ハンズオン
Pentaho ETL ハンズオンPentaho ETL ハンズオン
Pentaho ETL ハンズオン
Teruo Kawasaki
 
Open Source Reporting Tool Comparison
Open Source Reporting Tool ComparisonOpen Source Reporting Tool Comparison
Open Source Reporting Tool Comparison
Rogue Wave Software
 
Building Data Integration and Transformations using Pentaho
Building Data Integration and Transformations using PentahoBuilding Data Integration and Transformations using Pentaho
Building Data Integration and Transformations using Pentaho
Ashnikbiz
 
TeraStream for ETL
TeraStream for ETLTeraStream for ETL
TeraStream for ETL
치민 최
 
Daum 내부 빅데이터 및 클라우드 기술 활용 사례- 윤석찬 (2012)
Daum 내부 빅데이터 및 클라우드 기술 활용 사례- 윤석찬 (2012)Daum 내부 빅데이터 및 클라우드 기술 활용 사례- 윤석찬 (2012)
Daum 내부 빅데이터 및 클라우드 기술 활용 사례- 윤석찬 (2012)Channy Yun
 
빅데이터 기술 현황과 시장 전망(2014)
빅데이터 기술 현황과 시장 전망(2014)빅데이터 기술 현황과 시장 전망(2014)
빅데이터 기술 현황과 시장 전망(2014)
Channy Yun
 
Informatica Pentaho Etl Tools Comparison
Informatica Pentaho Etl Tools ComparisonInformatica Pentaho Etl Tools Comparison
Informatica Pentaho Etl Tools Comparison
Roberto Espinosa
 
What's New in Pentaho 7.0?
What's New in Pentaho 7.0?What's New in Pentaho 7.0?
What's New in Pentaho 7.0?
Xpand IT
 

Viewers also liked (14)

Pentaho PDI
Pentaho PDIPentaho PDI
Pentaho PDI
 
Spatial ETL For Web Services-Based Data Sharing
Spatial ETL For Web Services-Based Data SharingSpatial ETL For Web Services-Based Data Sharing
Spatial ETL For Web Services-Based Data Sharing
 
vertica_tmp_4.5
vertica_tmp_4.5vertica_tmp_4.5
vertica_tmp_4.5
 
Penatho
PenathoPenatho
Penatho
 
Hire Pentaho Developer | BI Tools
Hire Pentaho Developer | BI ToolsHire Pentaho Developer | BI Tools
Hire Pentaho Developer | BI Tools
 
Hybrid & Logical Data Warehouse
Hybrid & Logical Data WarehouseHybrid & Logical Data Warehouse
Hybrid & Logical Data Warehouse
 
Pentaho ETL ハンズオン
Pentaho ETL ハンズオンPentaho ETL ハンズオン
Pentaho ETL ハンズオン
 
Open Source Reporting Tool Comparison
Open Source Reporting Tool ComparisonOpen Source Reporting Tool Comparison
Open Source Reporting Tool Comparison
 
Building Data Integration and Transformations using Pentaho
Building Data Integration and Transformations using PentahoBuilding Data Integration and Transformations using Pentaho
Building Data Integration and Transformations using Pentaho
 
TeraStream for ETL
TeraStream for ETLTeraStream for ETL
TeraStream for ETL
 
Daum 내부 빅데이터 및 클라우드 기술 활용 사례- 윤석찬 (2012)
Daum 내부 빅데이터 및 클라우드 기술 활용 사례- 윤석찬 (2012)Daum 내부 빅데이터 및 클라우드 기술 활용 사례- 윤석찬 (2012)
Daum 내부 빅데이터 및 클라우드 기술 활용 사례- 윤석찬 (2012)
 
빅데이터 기술 현황과 시장 전망(2014)
빅데이터 기술 현황과 시장 전망(2014)빅데이터 기술 현황과 시장 전망(2014)
빅데이터 기술 현황과 시장 전망(2014)
 
Informatica Pentaho Etl Tools Comparison
Informatica Pentaho Etl Tools ComparisonInformatica Pentaho Etl Tools Comparison
Informatica Pentaho Etl Tools Comparison
 
What's New in Pentaho 7.0?
What's New in Pentaho 7.0?What's New in Pentaho 7.0?
What's New in Pentaho 7.0?
 

Similar to Pentaho etl-tool

Kettleetltool 090522005630-phpapp01
Kettleetltool 090522005630-phpapp01Kettleetltool 090522005630-phpapp01
Kettleetltool 090522005630-phpapp01jade_22
 
Skills Portfolio
Skills PortfolioSkills Portfolio
Skills Portfolio
rolee23
 
Datawa.re: Data warehouse design, development and support just got alot faster
Datawa.re: Data warehouse design, development and support just got alot fasterDatawa.re: Data warehouse design, development and support just got alot faster
Datawa.re: Data warehouse design, development and support just got alot faster
John Leonard
 
Building the DW - ETL
Building the DW - ETLBuilding the DW - ETL
Building the DW - ETLganblues
 
ETL (1).ppt
ETL (1).pptETL (1).ppt
ETL (1).ppt
ssuser98bffa1
 
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data AnalyticsStrata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
SingleStore
 
ELT Publishing Tool Overview V3_Jeff
ELT Publishing Tool Overview V3_JeffELT Publishing Tool Overview V3_Jeff
ELT Publishing Tool Overview V3_JeffJeff McQuigg
 
Dan Querimit - BI Portfolio
Dan Querimit - BI PortfolioDan Querimit - BI Portfolio
Dan Querimit - BI Portfolioquerimit
 
oracle_soultion_oracledataintegrator_goldengate_2021
oracle_soultion_oracledataintegrator_goldengate_2021oracle_soultion_oracledataintegrator_goldengate_2021
oracle_soultion_oracledataintegrator_goldengate_2021
ssuser8ccb5a
 
CERN_DIS_ODI_OGG_final_oracle_golde.pptx
CERN_DIS_ODI_OGG_final_oracle_golde.pptxCERN_DIS_ODI_OGG_final_oracle_golde.pptx
CERN_DIS_ODI_OGG_final_oracle_golde.pptx
camyla81
 
Datastage to ODI
Datastage to ODIDatastage to ODI
Datastage to ODI
Nagendra K
 
Ramesh BODS_IS
Ramesh BODS_ISRamesh BODS_IS
Ramesh BODS_ISRamesh Ch
 
Amit Kumar_Resume
Amit Kumar_ResumeAmit Kumar_Resume
Amit Kumar_ResumeAmit Kumar
 
Data automation 101
Data automation 101Data automation 101
Data automation 101
Yosua Michael Maranatha
 
Ramesh BODS_IS
Ramesh BODS_ISRamesh BODS_IS
Ramesh BODS_ISRamesh Ch
 
AIRflow at Scale
AIRflow at ScaleAIRflow at Scale
AIRflow at Scale
Digital Vidya
 
Informatica overview
Informatica overviewInformatica overview
Informatica overview
Swetha Naveen
 
Hadoop first ETL on Apache Falcon
Hadoop first ETL on Apache FalconHadoop first ETL on Apache Falcon
Hadoop first ETL on Apache FalconDataWorks Summit
 

Similar to Pentaho etl-tool (20)

Kettleetltool 090522005630-phpapp01
Kettleetltool 090522005630-phpapp01Kettleetltool 090522005630-phpapp01
Kettleetltool 090522005630-phpapp01
 
Skills Portfolio
Skills PortfolioSkills Portfolio
Skills Portfolio
 
Datawa.re: Data warehouse design, development and support just got alot faster
Datawa.re: Data warehouse design, development and support just got alot fasterDatawa.re: Data warehouse design, development and support just got alot faster
Datawa.re: Data warehouse design, development and support just got alot faster
 
Building the DW - ETL
Building the DW - ETLBuilding the DW - ETL
Building the DW - ETL
 
ETL (1).ppt
ETL (1).pptETL (1).ppt
ETL (1).ppt
 
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data AnalyticsStrata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
 
ELT Publishing Tool Overview V3_Jeff
ELT Publishing Tool Overview V3_JeffELT Publishing Tool Overview V3_Jeff
ELT Publishing Tool Overview V3_Jeff
 
Dan Querimit - BI Portfolio
Dan Querimit - BI PortfolioDan Querimit - BI Portfolio
Dan Querimit - BI Portfolio
 
ETL
ETL ETL
ETL
 
oracle_soultion_oracledataintegrator_goldengate_2021
oracle_soultion_oracledataintegrator_goldengate_2021oracle_soultion_oracledataintegrator_goldengate_2021
oracle_soultion_oracledataintegrator_goldengate_2021
 
CERN_DIS_ODI_OGG_final_oracle_golde.pptx
CERN_DIS_ODI_OGG_final_oracle_golde.pptxCERN_DIS_ODI_OGG_final_oracle_golde.pptx
CERN_DIS_ODI_OGG_final_oracle_golde.pptx
 
Datastage to ODI
Datastage to ODIDatastage to ODI
Datastage to ODI
 
Ramesh BODS_IS
Ramesh BODS_ISRamesh BODS_IS
Ramesh BODS_IS
 
Data migration
Data migrationData migration
Data migration
 
Amit Kumar_Resume
Amit Kumar_ResumeAmit Kumar_Resume
Amit Kumar_Resume
 
Data automation 101
Data automation 101Data automation 101
Data automation 101
 
Ramesh BODS_IS
Ramesh BODS_ISRamesh BODS_IS
Ramesh BODS_IS
 
AIRflow at Scale
AIRflow at ScaleAIRflow at Scale
AIRflow at Scale
 
Informatica overview
Informatica overviewInformatica overview
Informatica overview
 
Hadoop first ETL on Apache Falcon
Hadoop first ETL on Apache FalconHadoop first ETL on Apache Falcon
Hadoop first ETL on Apache Falcon
 

Recently uploaded

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 

Recently uploaded (20)

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 

Pentaho etl-tool

  • 1. Kettle – ETL Tool Sreenivas K
  • 2. Agenda  Introduction − ETL Process − Pentaho's Kettle  Data Integration Challenges  Prerequisites and Recent Releases  Pentaho DI Components  Spoon − Transformations − Jobs
  • 3. Introduction – ETL Process  Major Components − Extracting  Gathering raw data from source systems and storing it in ETL staging environment  Data Profiling  Identifying data that changed since last load. − Transforming- Cleaning and Conforming  Processing data to improve its quality, format it, merge from multiple sources, enforce conformed dimensions  Data cleansing  Recording error events  Audit dimensions  Creating and maintaining conformed dimensions and facts
  • 4. Introduction – ETL Process − Loading  Loading data into data warehouse tables  Managing hierarchies in dimensions  Managing special dimensions such as date and time, junk, mini, shrunken, small static, and user-maintained dimensions  Fact table loading  Building and maintaining bridge dimension tables  Handling late arriving data  Management of conformed dimensions  Administration of fact tables  Building aggregations  Building OLAP cubes  Transferring DW data to other environment for specific purposes
  • 5. Data Transformation and Integration Examples  Data filtering − Is not null, greater than, less than, includes  Field manipulation − Trimming, padding, upper and lowercase conversion  Data calculations − + - X / , average, absolute value, arctangent, natural logarithm  Date manipulation − First day of month, Last day of month, add months, week of year, day of year  Data type conversion − String to number, number to string, date to number  Merging fields & splitting fields  Looking up date − Look up in a database, in a text file, an excel sheet, …
  • 6. Introduction – Pentaho Kettle  Kettle – Kettle Extraction Transformation Transportation & Loading tool  Its open source business intelligence suite for powerful data integration by Pentaho. Founded in 2004.  Products of Pentaho − Mondrain – OLAP server written in Java − Kettle – ETL tool
  • 7. Data Integration - Challenges  Data is everywhere  Data is inconsistent − Records are different in each system  Performance issues − Running queries to summarize data for stipulated long period takes operating system for task  Data is never all in Data Warehouse − Excel sheet, acquisition, new application
  • 8. Prerequisites Recent Releases  Java Runtime Environment 1.5 and above  Compatible with almost any platform  Compatible with wide range of Databases technologies.  4/25 Data Integration 3.0.3 GA  4/18 Data Integration 3.1 Milestone  2/8 Data Integration 3.0.2 GA  12/12 Data Integration 3.0.1 GA  11/15 Data Integration 3.0 GA  10/31 Data Integration 3.0 RC2  10/24 Data Integration 2.5.2 GA  10/08 Data Integration 3.0 RC1  08/24 Data Integration 2.5.1 GA
  • 9. Pentaho Components  Spoon − GUI that allows you to design transformations and jobs that can be run with the Kettle tools — Pan and Kitchen − Transformations and Jobs can describe themselves using an XML file or can be put in a Kettle database repository. − Spoon is available as executable script and batch file to make use of tool in heterogeneous environment.  Pan − A program to execute transformations designed by Spoon in XML or database repository. − Transformations are scheduled in batch mode to be run automatically at regular intervals  Kitchen − Execute jobs designed by Spoon in XML or database repository
  • 10.  Repository Connection establishment  Auto login − By setting manually KETTLE_REPOSITORY, KETTLE_USER and KETTLE_PASSWORD environmental variables.  Login − By default PDI provides login username and password ad admin.
  • 11.
  • 12.
  • 13.
  • 14.  Transformation − Value: Values are part of a row and can contain any type of data − Row: a row exists of 0 or more values − Output stream: an output stream is a stack of rows that leaves a step. − Input stream: an input stream is a stack of rows that enters a step. − Hop: A hop is a graphical representation of one or more data streams between 2 steps. − Note: A note is a piece of information that can be added to a transformation Engine capable of performing a multitude of functions such as reading, manipulating and writing data to and from various data sources.
  • 15.  Jobs − Job Entry: A job entry is one part of a job and performs a certain − Hop: A hop is a graphical representation of one or more data streams between 2 steps − Note: a note is a piece of information that can be added to a job A way of calling transformations and controlling the sequence of their execution. Usually jobs are scheduled in batch mode to be run automatically at regular intervals.
  • 16. Input Steps Output Steps Lookup Steps Transformation Steps Join Steps DW Steps Mapping Steps Job Steps