SlideShare a Scribd company logo
1 of 42
MaxQDPro Team
Anjan.K Harish.R
II Sem M.Tech CSE
08/13/13 MaxQDPro: Kettle- ETL Tool 1
A Pentaho Data Integration tool
 Introduction
◦ ETL Process
◦ Pentaho’s Kettle
 Data Integration Challenges
 Prerequisites and Recent Releases
 Pentaho DI Components
 JDBC
 Spoon
◦ Transformations
◦ Jobs
08/13/13 2MaxQDPro: Kettle- ETL Tool
 4 major components:
◦ Extracting
 Gathering raw data from source systems and storing it in ETL staging
environment
 Data profiling
 Identifying data that changed since last load
◦ Transforming- Cleaning and Conforming
 Processing data to improve its quality, format it, merge from multiple
sources, enforce conformed dimensions
 Data cleansing
 Recording error events
 Audit dimensions
 Creating and maintaining conformed dimensions and facts
08/13/13MaxQDPro: Kettle- ETL Tool 3
 Data filtering
◦ Is not null, greater than, less than, includes
 Field manipulation
◦ Trimming, padding, upper and lowercase conversion
 Data calculations
◦ + - X / , average, absolute value, arctangent, natural logarithm
 Date manipulation
◦ First day of month, Last day of month, add months, week of year, day of year
 Data type conversion
◦ String to number, number to string, date to number
 Merging fields & splitting fields
 Looking up date
◦ Look up in a database, in a text file, an excel sheet, …
08/13/13 4MaxQDPro: Kettle- ETL Tool
◦ Loading
 Loading data into data warehouse tables
 Managing hierarchies in dimensions
 Managing special dimensions such as date and time,
junk, mini, shrunken, small static, and user-maintained
dimensions
 Fact table loading
 Building and maintaining bridge dimension tables
 Handling late arriving data
 Management of conformed dimensions
 Administration of fact tables
 Building aggregations
 Building OLAP cubes
 Transferring DW data to other environment for specific
purposes
08/13/13MaxQDPro: Kettle- ETL Tool 5
08/13/13MaxQDPro: Kettle- ETL Tool 6
 Complexity and significant operational problems. 
 Exceeds the designers expectations
 Data Profiling of a source.
 Data warehouses typically grow asynchronously.
 Establishing the scalability of an ETL system
across the lifetime .
08/13/13MaxQDPro: Kettle- ETL Tool 7
 Many off-the-shelf tools exist
 High-end tools may not justify value for smaller
warehouses
 Proprietary ETL
◦ High upfront cost
◦ Long term maintenance
 Custom Code
◦ Low upfront cost
◦ Support grows as business requirements changes
08/13/13 8MaxQDPro: Kettle- ETL Tool
08/13/13MaxQDPro: Kettle- ETL Tool 9
Tool Vendor
Oracle Warehouse Builder (OWB) Oracle
Data Integrator (BODI) Business Objects
IBM Information Server (Ascential) IBM
SAS Data Integration Studio SAS Institute
PowerCenter Informatica
Oracle Data Integrator (Sunopsis) Oracle
Data Migrator Information Builders
Integration Services Microsoft
Talend Open Studio Talend
DataFlow Group 1 Software (Sagent)
Data Integrator Pervasive
Transformation Server DataMirror
Transformation Manager ETL Solutions Ltd.
Data Manager Cognos
DT/Studio Embarcadero Technologies
ETL4ALL IKAN
DB2 Warehouse Edition IBM
Jitterbit Jitterbit
Pentaho Data Integration Pentaho
 Kettle – Kettle Extraction Transformation
Transportation & Loading tool
 Its open source business intelligence suite for
powerful data integration by Pentaho. Founded in
2004.
 Products of Pentaho
◦ Mondrain – OLAP server written in Java
◦ Kettle – ETL tool
◦ Weka – Machine learning and Data mining tool
08/13/13 10MaxQDPro: Kettle- ETL Tool
 Data is everywhere
 Data is inconsistent
◦ Records are different in each system
 Performance issues
◦ Running queries to summarize data for stipulated long
period takes operating system for task
◦ Brings the OS on max load
 Data is never all in Data Warehouse
◦ Excel sheet, acquisition, new application
08/13/13 11MaxQDPro: Kettle- ETL Tool
 Meta data , model driven approach
◦ What to do? And how to do?
◦ Complex transformation with zero code
◦ Graphically design data transformation and jobs
 100% Java with cross-platform support
 Extensible architecture
 Repository-based
 Full featured ETL
 Integration with Pentaho Open BI Platform
08/13/13 12MaxQDPro: Kettle- ETL Tool
Prerequisites Recent Releases
 Java Runtime
Environment 1.5 and
above
 Compatible with almost
any platform
 Compatible with wide
range of Databases
technologies.
 4/25 Data Integration 3.0.3 GA
 4/18 Data Integration 3.1 Milestone
 2/8 Data Integration 3.0.2 GA
 12/12 Data Integration 3.0.1 GA
 11/15 Data Integration 3.0 GA
 10/31 Data Integration 3.0 RC2
 10/24 Data Integration 2.5.2 GA
 10/08 Data Integration 3.0 RC1
 08/24 Data Integration 2.5.1 GA
08/13/13MaxQDPro: Kettle- ETL Tool 13
 Pan
◦ A program to execute transformations designed by Spoon in XML
or database repository.
◦ Transformations are scheduled in batch mode to be run
automatically at regular intervals
 Carte
◦ Simple web server to execute transformations and jobs remotely.
◦ Accept an XML (small servlet) that contains transformation to
execute and the execution configuration. 
◦ Allows to remotely monitor, start and stop the transformations and
jobs
◦ Server running in Carte is a Slave Server
08/13/13MaxQDPro: Kettle- ETL Tool 14
 Spoon
◦  GUI that allows you to design transformations and jobs that
can be run with the Kettle tools — Pan and Kitchen
◦ Transformations and Jobs can describe themselves using
an XML file or can be put in a Kettle database repository.
◦ Spoon is available as executable script and batch file to
make use of tool in heterogeneous environment.
◦ Latest version of Spoon is 3.2 beta version.
 Kitchen
◦ Execute jobs designed by Spoon in XML or database
repository
08/13/13MaxQDPro: Kettle- ETL Tool 15
Create Shortcut with spoon.ico
pointing to bat file
Works on most of OS
 Installing
◦ Ensure JRE 1.5 is installed.
◦ Unzip the binary distribution
in any folder
 Launching
◦ spoon.bat in windows
platform
◦ spoon.sh in Unix like
platform
 Supported platform
◦ Microsoft Windows including
Vista
◦ Linux GTK: on i386 and
x86_64 processors
◦ Apple's OSX: works both on
PowerPC and Intel
machines
◦ Solaris: using a Motif
interface 
◦ AIX, HP-UX, FreeBSD
08/13/13MaxQDPro: Kettle- ETL Tool 16
Latest JDBC 3.0
 JDBC -Database connectivity
Java tool.
 Comes in four different types
◦ Type1: JDBC-ODBC Bridge
◦ Type 2 : Native API partial Java
driver
◦ Type 3 : Middleware Java Drivers
◦ Type 4: Direct to DB Java Drivers
 Microsoft Based DB like
MS Access rely on Type
1drivers
 Oracle, Mysql can be
connected with other
types. But traditionally
used is the Type 4 driver.
 JDBC can also operate in
Distributed environment.
08/13/13MaxQDPro: Kettle- ETL Tool 17
08/13/13MaxQDPro: Kettle- ETL Tool 18
08/13/13MaxQDPro: Kettle- ETL Tool 19
 Key Improvement
◦ Execution Results Pane for logs, metrics and
performance graph
◦ Improved Database Connection dialog
◦ Snap to grid (graphical workspace)
◦ Zoom (Graphical Workspace)
◦ Easier to use left panel for the objects palette
◦ Over 30 new or improved Transformation Steps
◦ 13 new or improved Job Entries
◦ Support for four new database types - MonetDB,
KingbaseES, Vertica, and HP NeoView
◦ Improved translations
08/13/13MaxQDPro: Kettle- ETL Tool 20
 Repository Connection establishment
 Auto login
◦ By setting manually KETTLE_REPOSITORY,
KETTLE_USER and KETTLE_PASSWORD
environmental variables.
 Login
◦ By default PDI provides login username and password
ad admin.
◦ It strictly advised to change default password to avoid
any security vulnerablity.
08/13/13MaxQDPro: Kettle- ETL Tool 21
08/13/13MaxQDPro: Kettle- ETL Tool 22
08/13/13MaxQDPro: Kettle- ETL Tool 23
08/13/13MaxQDPro: Kettle- ETL Tool 24
 Transformation
◦ Value: Values are part of a row
and can contain any type of data
◦ Row: a row exists of 0 or more
values 
◦ Output stream: an output stream
is a stack of rows that leaves a
step. 
◦ Input stream: an input stream is
a stack of rows that enters a step. 
◦ Hop: A hop is a graphical
representation of one or more data
streams between 2 steps.
◦ Note: A note is a piece of
information that can be added to a
transformation
08/13/13MaxQDPro: Kettle- ETL Tool 25
Engine capable of performing a
multitude of functions such as reading,
manipulating and writing data to and
from various data sources.
 Jobs
◦ Job Entry: A job entry is
one part of a job and
performs a certain
◦ Hop: A hop is a graphical
representation of one or
more data streams between
2 steps
◦ Note: a note is a piece of
information that can be added to a
job
08/13/13MaxQDPro: Kettle- ETL Tool 26
A way of calling transformations and
controlling the sequence of their
execution. Usually jobs are
scheduled in batch mode to be run
automatically at regular intervals.
Input Steps
Output Steps
Lookup Steps
Transformation
Steps
Join Steps
DW Steps
Mapping Steps
Job Steps
08/13/13 27MaxQDPro: Kettle- ETL Tool
08/13/13 28MaxQDPro: Kettle- ETL Tool
08/13/13 29MaxQDPro: Kettle- ETL Tool
08/13/13MaxQDPro: Kettle- ETL Tool 30
08/13/13 31MaxQDPro: Kettle- ETL Tool
08/13/13 32MaxQDPro: Kettle- ETL Tool
08/13/13 33MaxQDPro: Kettle- ETL Tool
08/13/13 34MaxQDPro: Kettle- ETL Tool
Table Output Step
08/13/13 35MaxQDPro: Kettle- ETL Tool
Insert / Update Output Step
08/13/13 36MaxQDPro: Kettle- ETL Tool
Besides the execution order, it specifies the condition for next job entry
· “Unconditional” - next job entry will be executed regardless of the result
of the originating job entry.
· “Follow when result is true” - next job entry will only be executed when
the result of the originating job entry is true,
· “Follow when result is false” - next job entry will only be executed when
the result of the originating job entry was false
08/13/13 37MaxQDPro: Kettle- ETL Tool
08/13/13 38MaxQDPro: Kettle- ETL Tool
08/13/13 39MaxQDPro: Kettle- ETL Tool
08/13/13MaxQDPro: Kettle- ETL Tool 40
 Brief Introduction to ETL process
 JDBC Repository Connection
 Pentaho Data Integration Tool
◦ Components
 Pan
 Carte
 Kitchen
 Spoon
◦ Transformation with different Input Data Source
◦ Jobs
08/13/13MaxQDPro: Kettle- ETL Tool 41
 kettle.pentaho.org
◦ Kettle project homepage
 kettle.javaforge.com
◦ Kettle community website: forum, source, documentation, tech tips, samples, …
 www.pentaho.org/download/
◦ All Pentaho modules, pre-configured with sample data
◦ Developer forums, documentation
◦ Ventana Research Open Source BI Survey
 www.mysql.com
◦ White paper - http://dev.mysql.com/tech-resources/articles/mysql_5.0_pentaho.html
◦ Kettle Webinar - http://www.mysql.com/news-and-events/on-demand-
webinars/pentaho-2006-09-19.php
◦ Roland Bouman blog on Pentaho Data Integration and MySQL
 http://rpbouman.blogspot.com/2006/06/pentaho-data-integration-kettle-turns.html
08/13/13 42MaxQDPro: Kettle- ETL Tool

More Related Content

What's hot

Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineChester Chen
 
OLAP Basics and Fundamentals by Bharat Kalia
OLAP Basics and Fundamentals by Bharat Kalia OLAP Basics and Fundamentals by Bharat Kalia
OLAP Basics and Fundamentals by Bharat Kalia Bharat Kalia
 
Data Migration with Spark to Hive
Data Migration with Spark to HiveData Migration with Spark to Hive
Data Migration with Spark to HiveDatabricks
 
Revamp the tablespace reorg process with ibm db2 automation tool
Revamp the tablespace reorg process with ibm db2 automation toolRevamp the tablespace reorg process with ibm db2 automation tool
Revamp the tablespace reorg process with ibm db2 automation toolBharath Nunepalli
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
 
Redirected Recovery of Recovery Expert for DB2 on z/OS
Redirected Recovery of Recovery Expert for DB2 on z/OSRedirected Recovery of Recovery Expert for DB2 on z/OS
Redirected Recovery of Recovery Expert for DB2 on z/OSBharath Nunepalli
 
Stream Processing with Pipelines and Stored Procedures
Stream Processing with Pipelines  and Stored ProceduresStream Processing with Pipelines  and Stored Procedures
Stream Processing with Pipelines and Stored ProceduresSingleStore
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseDatabricks
 
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storagehive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata StorageDataWorks Summit/Hadoop Summit
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applicationsLars Albertsson
 
A primer on building real time data-driven products
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven productsLars Albertsson
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of MusicLars Albertsson
 
Less18 moving data
Less18 moving dataLess18 moving data
Less18 moving dataImran Ali
 
Functional architectural patterns
Functional architectural patternsFunctional architectural patterns
Functional architectural patternsLars Albertsson
 
Sql Server 2005 Business Inteligence
Sql Server 2005 Business InteligenceSql Server 2005 Business Inteligence
Sql Server 2005 Business Inteligenceabercius24
 
SnowCamp - Adding search to a legacy application
SnowCamp - Adding search to a legacy applicationSnowCamp - Adding search to a legacy application
SnowCamp - Adding search to a legacy applicationNicolas Fränkel
 
Less08 managing data and concurrency
Less08 managing data and concurrencyLess08 managing data and concurrency
Less08 managing data and concurrencyImran Ali
 

What's hot (20)

Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data Pipeline
 
OLAP Basics and Fundamentals by Bharat Kalia
OLAP Basics and Fundamentals by Bharat Kalia OLAP Basics and Fundamentals by Bharat Kalia
OLAP Basics and Fundamentals by Bharat Kalia
 
MICHAEL SHEFFER ETL CA
MICHAEL SHEFFER ETL CAMICHAEL SHEFFER ETL CA
MICHAEL SHEFFER ETL CA
 
Data Migration with Spark to Hive
Data Migration with Spark to HiveData Migration with Spark to Hive
Data Migration with Spark to Hive
 
Revamp the tablespace reorg process with ibm db2 automation tool
Revamp the tablespace reorg process with ibm db2 automation toolRevamp the tablespace reorg process with ibm db2 automation tool
Revamp the tablespace reorg process with ibm db2 automation tool
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
Redirected Recovery of Recovery Expert for DB2 on z/OS
Redirected Recovery of Recovery Expert for DB2 on z/OSRedirected Recovery of Recovery Expert for DB2 on z/OS
Redirected Recovery of Recovery Expert for DB2 on z/OS
 
Stream Processing with Pipelines and Stored Procedures
Stream Processing with Pipelines  and Stored ProceduresStream Processing with Pipelines  and Stored Procedures
Stream Processing with Pipelines and Stored Procedures
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
 
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storagehive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
 
Intro to Pinot (2016-01-04)
Intro to Pinot (2016-01-04)Intro to Pinot (2016-01-04)
Intro to Pinot (2016-01-04)
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
 
A primer on building real time data-driven products
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven products
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of Music
 
Less18 moving data
Less18 moving dataLess18 moving data
Less18 moving data
 
Functional architectural patterns
Functional architectural patternsFunctional architectural patterns
Functional architectural patterns
 
Sql Server 2005 Business Inteligence
Sql Server 2005 Business InteligenceSql Server 2005 Business Inteligence
Sql Server 2005 Business Inteligence
 
SnowCamp - Adding search to a legacy application
SnowCamp - Adding search to a legacy applicationSnowCamp - Adding search to a legacy application
SnowCamp - Adding search to a legacy application
 
Percona Lucid Db
Percona Lucid DbPercona Lucid Db
Percona Lucid Db
 
Less08 managing data and concurrency
Less08 managing data and concurrencyLess08 managing data and concurrency
Less08 managing data and concurrency
 

Similar to Kettleetltool 090522005630-phpapp01

ELT Publishing Tool Overview V3_Jeff
ELT Publishing Tool Overview V3_JeffELT Publishing Tool Overview V3_Jeff
ELT Publishing Tool Overview V3_JeffJeff McQuigg
 
Airbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stackAirbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stackMichel Tricot
 
40043 claborn
40043 claborn40043 claborn
40043 clabornBaba Ib
 
Softshake 2013: Introduction to NoSQL with Couchbase
Softshake 2013: Introduction to NoSQL with CouchbaseSoftshake 2013: Introduction to NoSQL with Couchbase
Softshake 2013: Introduction to NoSQL with CouchbaseTugdual Grall
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simpleDori Waldman
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 
Kettle: Pentaho Data Integration tool
Kettle: Pentaho Data Integration toolKettle: Pentaho Data Integration tool
Kettle: Pentaho Data Integration toolAlex Rayón Jerez
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in DeltaDatabricks
 
Aioug ha day oct2015 goldengate- High Availability Day 2015
Aioug ha day oct2015 goldengate- High Availability Day 2015Aioug ha day oct2015 goldengate- High Availability Day 2015
Aioug ha day oct2015 goldengate- High Availability Day 2015aioughydchapter
 
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?Jim Czuprynski
 
Oracle 11g data warehouse introdution
Oracle 11g data warehouse introdutionOracle 11g data warehouse introdution
Oracle 11g data warehouse introdutionAditya Trivedi
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformDatabricks
 
History of Oracle and Databases
History of Oracle and DatabasesHistory of Oracle and Databases
History of Oracle and DatabasesConnor McDonald
 

Similar to Kettleetltool 090522005630-phpapp01 (20)

Kettle – Etl Tool
Kettle – Etl ToolKettle – Etl Tool
Kettle – Etl Tool
 
Pentaho etl-tool
Pentaho etl-toolPentaho etl-tool
Pentaho etl-tool
 
ELT Publishing Tool Overview V3_Jeff
ELT Publishing Tool Overview V3_JeffELT Publishing Tool Overview V3_Jeff
ELT Publishing Tool Overview V3_Jeff
 
11g R2
11g R211g R2
11g R2
 
Airbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stackAirbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stack
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
40043 claborn
40043 claborn40043 claborn
40043 claborn
 
Softshake 2013: Introduction to NoSQL with Couchbase
Softshake 2013: Introduction to NoSQL with CouchbaseSoftshake 2013: Introduction to NoSQL with Couchbase
Softshake 2013: Introduction to NoSQL with Couchbase
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Kettle: Pentaho Data Integration tool
Kettle: Pentaho Data Integration toolKettle: Pentaho Data Integration tool
Kettle: Pentaho Data Integration tool
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Aioug ha day oct2015 goldengate- High Availability Day 2015
Aioug ha day oct2015 goldengate- High Availability Day 2015Aioug ha day oct2015 goldengate- High Availability Day 2015
Aioug ha day oct2015 goldengate- High Availability Day 2015
 
Greenplum Architecture
Greenplum ArchitectureGreenplum Architecture
Greenplum Architecture
 
Subhabrata Deb Resume
Subhabrata Deb ResumeSubhabrata Deb Resume
Subhabrata Deb Resume
 
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?
 
Pentaho ppt up
Pentaho ppt upPentaho ppt up
Pentaho ppt up
 
Oracle 11g data warehouse introdution
Oracle 11g data warehouse introdutionOracle 11g data warehouse introdution
Oracle 11g data warehouse introdution
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
 
History of Oracle and Databases
History of Oracle and DatabasesHistory of Oracle and Databases
History of Oracle and Databases
 

Recently uploaded

Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 

Recently uploaded (20)

Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Kettleetltool 090522005630-phpapp01

  • 1. MaxQDPro Team Anjan.K Harish.R II Sem M.Tech CSE 08/13/13 MaxQDPro: Kettle- ETL Tool 1 A Pentaho Data Integration tool
  • 2.  Introduction ◦ ETL Process ◦ Pentaho’s Kettle  Data Integration Challenges  Prerequisites and Recent Releases  Pentaho DI Components  JDBC  Spoon ◦ Transformations ◦ Jobs 08/13/13 2MaxQDPro: Kettle- ETL Tool
  • 3.  4 major components: ◦ Extracting  Gathering raw data from source systems and storing it in ETL staging environment  Data profiling  Identifying data that changed since last load ◦ Transforming- Cleaning and Conforming  Processing data to improve its quality, format it, merge from multiple sources, enforce conformed dimensions  Data cleansing  Recording error events  Audit dimensions  Creating and maintaining conformed dimensions and facts 08/13/13MaxQDPro: Kettle- ETL Tool 3
  • 4.  Data filtering ◦ Is not null, greater than, less than, includes  Field manipulation ◦ Trimming, padding, upper and lowercase conversion  Data calculations ◦ + - X / , average, absolute value, arctangent, natural logarithm  Date manipulation ◦ First day of month, Last day of month, add months, week of year, day of year  Data type conversion ◦ String to number, number to string, date to number  Merging fields & splitting fields  Looking up date ◦ Look up in a database, in a text file, an excel sheet, … 08/13/13 4MaxQDPro: Kettle- ETL Tool
  • 5. ◦ Loading  Loading data into data warehouse tables  Managing hierarchies in dimensions  Managing special dimensions such as date and time, junk, mini, shrunken, small static, and user-maintained dimensions  Fact table loading  Building and maintaining bridge dimension tables  Handling late arriving data  Management of conformed dimensions  Administration of fact tables  Building aggregations  Building OLAP cubes  Transferring DW data to other environment for specific purposes 08/13/13MaxQDPro: Kettle- ETL Tool 5
  • 7.  Complexity and significant operational problems.   Exceeds the designers expectations  Data Profiling of a source.  Data warehouses typically grow asynchronously.  Establishing the scalability of an ETL system across the lifetime . 08/13/13MaxQDPro: Kettle- ETL Tool 7
  • 8.  Many off-the-shelf tools exist  High-end tools may not justify value for smaller warehouses  Proprietary ETL ◦ High upfront cost ◦ Long term maintenance  Custom Code ◦ Low upfront cost ◦ Support grows as business requirements changes 08/13/13 8MaxQDPro: Kettle- ETL Tool
  • 9. 08/13/13MaxQDPro: Kettle- ETL Tool 9 Tool Vendor Oracle Warehouse Builder (OWB) Oracle Data Integrator (BODI) Business Objects IBM Information Server (Ascential) IBM SAS Data Integration Studio SAS Institute PowerCenter Informatica Oracle Data Integrator (Sunopsis) Oracle Data Migrator Information Builders Integration Services Microsoft Talend Open Studio Talend DataFlow Group 1 Software (Sagent) Data Integrator Pervasive Transformation Server DataMirror Transformation Manager ETL Solutions Ltd. Data Manager Cognos DT/Studio Embarcadero Technologies ETL4ALL IKAN DB2 Warehouse Edition IBM Jitterbit Jitterbit Pentaho Data Integration Pentaho
  • 10.  Kettle – Kettle Extraction Transformation Transportation & Loading tool  Its open source business intelligence suite for powerful data integration by Pentaho. Founded in 2004.  Products of Pentaho ◦ Mondrain – OLAP server written in Java ◦ Kettle – ETL tool ◦ Weka – Machine learning and Data mining tool 08/13/13 10MaxQDPro: Kettle- ETL Tool
  • 11.  Data is everywhere  Data is inconsistent ◦ Records are different in each system  Performance issues ◦ Running queries to summarize data for stipulated long period takes operating system for task ◦ Brings the OS on max load  Data is never all in Data Warehouse ◦ Excel sheet, acquisition, new application 08/13/13 11MaxQDPro: Kettle- ETL Tool
  • 12.  Meta data , model driven approach ◦ What to do? And how to do? ◦ Complex transformation with zero code ◦ Graphically design data transformation and jobs  100% Java with cross-platform support  Extensible architecture  Repository-based  Full featured ETL  Integration with Pentaho Open BI Platform 08/13/13 12MaxQDPro: Kettle- ETL Tool
  • 13. Prerequisites Recent Releases  Java Runtime Environment 1.5 and above  Compatible with almost any platform  Compatible with wide range of Databases technologies.  4/25 Data Integration 3.0.3 GA  4/18 Data Integration 3.1 Milestone  2/8 Data Integration 3.0.2 GA  12/12 Data Integration 3.0.1 GA  11/15 Data Integration 3.0 GA  10/31 Data Integration 3.0 RC2  10/24 Data Integration 2.5.2 GA  10/08 Data Integration 3.0 RC1  08/24 Data Integration 2.5.1 GA 08/13/13MaxQDPro: Kettle- ETL Tool 13
  • 14.  Pan ◦ A program to execute transformations designed by Spoon in XML or database repository. ◦ Transformations are scheduled in batch mode to be run automatically at regular intervals  Carte ◦ Simple web server to execute transformations and jobs remotely. ◦ Accept an XML (small servlet) that contains transformation to execute and the execution configuration.  ◦ Allows to remotely monitor, start and stop the transformations and jobs ◦ Server running in Carte is a Slave Server 08/13/13MaxQDPro: Kettle- ETL Tool 14
  • 15.  Spoon ◦  GUI that allows you to design transformations and jobs that can be run with the Kettle tools — Pan and Kitchen ◦ Transformations and Jobs can describe themselves using an XML file or can be put in a Kettle database repository. ◦ Spoon is available as executable script and batch file to make use of tool in heterogeneous environment. ◦ Latest version of Spoon is 3.2 beta version.  Kitchen ◦ Execute jobs designed by Spoon in XML or database repository 08/13/13MaxQDPro: Kettle- ETL Tool 15
  • 16. Create Shortcut with spoon.ico pointing to bat file Works on most of OS  Installing ◦ Ensure JRE 1.5 is installed. ◦ Unzip the binary distribution in any folder  Launching ◦ spoon.bat in windows platform ◦ spoon.sh in Unix like platform  Supported platform ◦ Microsoft Windows including Vista ◦ Linux GTK: on i386 and x86_64 processors ◦ Apple's OSX: works both on PowerPC and Intel machines ◦ Solaris: using a Motif interface  ◦ AIX, HP-UX, FreeBSD 08/13/13MaxQDPro: Kettle- ETL Tool 16
  • 17. Latest JDBC 3.0  JDBC -Database connectivity Java tool.  Comes in four different types ◦ Type1: JDBC-ODBC Bridge ◦ Type 2 : Native API partial Java driver ◦ Type 3 : Middleware Java Drivers ◦ Type 4: Direct to DB Java Drivers  Microsoft Based DB like MS Access rely on Type 1drivers  Oracle, Mysql can be connected with other types. But traditionally used is the Type 4 driver.  JDBC can also operate in Distributed environment. 08/13/13MaxQDPro: Kettle- ETL Tool 17
  • 20.  Key Improvement ◦ Execution Results Pane for logs, metrics and performance graph ◦ Improved Database Connection dialog ◦ Snap to grid (graphical workspace) ◦ Zoom (Graphical Workspace) ◦ Easier to use left panel for the objects palette ◦ Over 30 new or improved Transformation Steps ◦ 13 new or improved Job Entries ◦ Support for four new database types - MonetDB, KingbaseES, Vertica, and HP NeoView ◦ Improved translations 08/13/13MaxQDPro: Kettle- ETL Tool 20
  • 21.  Repository Connection establishment  Auto login ◦ By setting manually KETTLE_REPOSITORY, KETTLE_USER and KETTLE_PASSWORD environmental variables.  Login ◦ By default PDI provides login username and password ad admin. ◦ It strictly advised to change default password to avoid any security vulnerablity. 08/13/13MaxQDPro: Kettle- ETL Tool 21
  • 25.  Transformation ◦ Value: Values are part of a row and can contain any type of data ◦ Row: a row exists of 0 or more values  ◦ Output stream: an output stream is a stack of rows that leaves a step.  ◦ Input stream: an input stream is a stack of rows that enters a step.  ◦ Hop: A hop is a graphical representation of one or more data streams between 2 steps. ◦ Note: A note is a piece of information that can be added to a transformation 08/13/13MaxQDPro: Kettle- ETL Tool 25 Engine capable of performing a multitude of functions such as reading, manipulating and writing data to and from various data sources.
  • 26.  Jobs ◦ Job Entry: A job entry is one part of a job and performs a certain ◦ Hop: A hop is a graphical representation of one or more data streams between 2 steps ◦ Note: a note is a piece of information that can be added to a job 08/13/13MaxQDPro: Kettle- ETL Tool 26 A way of calling transformations and controlling the sequence of their execution. Usually jobs are scheduled in batch mode to be run automatically at regular intervals.
  • 27. Input Steps Output Steps Lookup Steps Transformation Steps Join Steps DW Steps Mapping Steps Job Steps 08/13/13 27MaxQDPro: Kettle- ETL Tool
  • 35. Table Output Step 08/13/13 35MaxQDPro: Kettle- ETL Tool
  • 36. Insert / Update Output Step 08/13/13 36MaxQDPro: Kettle- ETL Tool
  • 37. Besides the execution order, it specifies the condition for next job entry · “Unconditional” - next job entry will be executed regardless of the result of the originating job entry. · “Follow when result is true” - next job entry will only be executed when the result of the originating job entry is true, · “Follow when result is false” - next job entry will only be executed when the result of the originating job entry was false 08/13/13 37MaxQDPro: Kettle- ETL Tool
  • 41.  Brief Introduction to ETL process  JDBC Repository Connection  Pentaho Data Integration Tool ◦ Components  Pan  Carte  Kitchen  Spoon ◦ Transformation with different Input Data Source ◦ Jobs 08/13/13MaxQDPro: Kettle- ETL Tool 41
  • 42.  kettle.pentaho.org ◦ Kettle project homepage  kettle.javaforge.com ◦ Kettle community website: forum, source, documentation, tech tips, samples, …  www.pentaho.org/download/ ◦ All Pentaho modules, pre-configured with sample data ◦ Developer forums, documentation ◦ Ventana Research Open Source BI Survey  www.mysql.com ◦ White paper - http://dev.mysql.com/tech-resources/articles/mysql_5.0_pentaho.html ◦ Kettle Webinar - http://www.mysql.com/news-and-events/on-demand- webinars/pentaho-2006-09-19.php ◦ Roland Bouman blog on Pentaho Data Integration and MySQL  http://rpbouman.blogspot.com/2006/06/pentaho-data-integration-kettle-turns.html 08/13/13 42MaxQDPro: Kettle- ETL Tool