Introduction To Pentaho Kettle

5,244 views

Published on

Presentation by Dan Moore at the Boulder Java User's Group on August 13, 2013. See more at http://boulderjug.org

Published in: Technology
1 Comment
2 Likes
Statistics
Notes
  • License is now Apache not GPL!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
5,244
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
129
Comments
1
Likes
2
Embeds 0
No embeds

No notes for slide

Introduction To Pentaho Kettle

  1. 1. Pentaho Data Integration/Kettle ● Dan Moore ● 8z Real Estate ● Kettle user for two years
  2. 2. Questions ● Who has used a relational database? ● Who has written scripts or java code to munge data from one source and load it to another? – What did you use? – Scripts – Custom java code – ETL tool
  3. 3. What is Kettle? ● Batch data integration and processing tool written in Java ● Exists to retrieve, process and load data ● ETL – Extract, transform and load ● PDI synonomous
  4. 4. What is Kettle good for ● Mirroring data from master to slave ● Syncing two data sources ● Processing data retrieved from multiple sources and pushed to multiple destinations ● Loading data to RDBMS ● Datamart/data warehouse – Dimension lookup/update step ● Graphical manipulation of data
  5. 5. Alternatives ● Code – Custom java – Spring batch ● Scripts – perl, python, shell, etc – Possibly + db loader tool and cron ● Commercial ETL tools – Oracle Warehouse Builder – Datastage – Informatica – SQL Server Integration services ● Open source ETL tools: – Talend – KETL – Clover.ETL ● Special case tools – SymmetricDS – Db replication
  6. 6. Why Kettle is better ● Higher level than code – Graphical interface – No connection pooling to worry about – No DDL to write – Validation/business rules ● Well tested full suite of components ● Data analysis tools – Preview – Data profiling with data cleaner (add on) ● Free (as in beer and speech) – Two editions – GPLv2 ● Performant? – Developer vs computer performant – Depends, right? – Sitemailsame job 10k rows/second for 125M rows ● Leverage java – jvm tuning skills – java libraries and logic (in jars)
  7. 7. Data sources ● Files ● Databases ● No SQL ● REST ● XML ● Hadoop/HBase ● JSON ● Excel ● EDI ● RSS ● Google Analytics
  8. 8. Kettle concepts ● Repository ● Rows/Stream ● Steps ● Job ● Transformation
  9. 9. Demo 1: one way sync ● Sync tables
  10. 10. Demo 2: processing ● Process data from one table and replace some values, filter some values ● Lookup table
  11. 11. Demo 3: log file processing ● Load apache logs for analysis
  12. 12. What it is not good for ● User interfaces/user interaction ● Small data sets – 500 (from experience) ● Web applications ● One off processes? – One off becomes regular
  13. 13. Who uses ● Survey results – ~20 people ● Number of downloads: 110K downloads of Kettle 4.4 – Since Nov 2012 ● Our specific use – MLS data ● Different data source formats and types (jdbc, local csv, ftp) – Public records data ● Fixed width files
  14. 14. Larger picture ● Kettle 10 years old – joined Pentaho about 7 years ago ● Open source, at version 4.4 – GPLv2 license – EE edition available ● BI suite – Reporting – Analytics – Dashboards – Machine Learning (weka)
  15. 15. Kettle tools ● Spoon ● Kitchen ● Pan ● Carte – Clustering tool
  16. 16. Advanced topics ● Existing java logic – Embedded – Polygon example – Demo 4 ● Deployment – Variables Config files are your friend ● Mapping/Parameterization – Subroutines of logic
  17. 17. Advanced Topics Continued ● Testing – Who tests ● Version control – Who uses version control ● Error handling – Email – Log files
  18. 18. Getting started ● Download – sourceforge ● Includes over 150 example transformations – Mysql 3.14 jdbc driver ● Helpful sites – Forums: http://forums.pentaho.com/forumdisplay.php?135-Pentaho-Data-Integration- Kettle – Wiki: http://wiki.pentaho.com/display/EAI/Pentaho+Data+Integration+Steps – Testing: http://www.mooreds.com/wordpress/pentaho-kettle-testing ● Helpful books – Pentaho Kettle Solutions: Casters, Bouman, van Dongen ● Barely scratched surface ● Don't like tools that turn me into a mechanic

×