SlideShare a Scribd company logo
Pentaho Data Integration
January, 2014
Alex Rayón Jerez
alex.rayon@deusto.es
DeustoTech Learning – Deusto Institute of Technology – University of Deusto
Avda. Universidades 24, 48007 Bilbao, Spain
www.deusto.es
Before starting….

Who has
used a
relational
database?
Source: http://www.agiledata.org/essays/databaseTesting.html
Before starting…. (II)

Source: http://www.theguardian.com/teacher-network/2012/jan/10/how-to-teach-code

Who has written
scripts or Java
code to move
data from one
source and load
it to another?
Before starting…. (III)

What did you use?
1. Scripts
2. Custom Java Code
3. ETL
Table of Contents
●
●
●
●
●
●

Pentaho at a glance
In the academic field
ETL
Kettle
Big Data
Predictive Analytics
Table of Contents
●
●
●
●
●
●

Pentaho at a glance
In the academic field
ETL
Kettle
Big Data
Predictive Analytics
Pentaho at a glance
Business Intelligence
Pentaho at a glance (II)
Pentaho at a glance (III)
● Business Intelligence & Analytics
● Open Core
○ GPL v2
○ Apache 2.0
○ Enterprise and OEM licenses
● Java-based
● Web front-ends
Pentaho at a glance (IV)
● The Pentaho Stack
○ Data Integration / ETL
○ Big Data / NoSQL
○ Data Modeling
○ Reporting
○ OLAP / Analysis
○ Data Visualization
○ Dashboarding
○ Data Mining / Predictive Analysis
○ Scheduling

Source: http://helicaltech.com/blogs/hire-pentaho-consultants-hire-pentaho-developers/
Pentaho at a glance (V)
● Modules
○ Pentaho Data Integration
■ Kettle
○ Pentaho Analysis
■ Mondrian
○ Pentaho Reporting
○ Pentaho Dashboards
○ Pentaho Data Mining
■ WEKA
Pentaho at a glance (VI)
● Figures
○
○
○
○

○

+ 10.000 deployments
+ 185 countries
+ 1.200 customers
Since 2012, in Gartner
Magic Quadrant for BI
Platforms
1 download / 30
seconds
Pentaho at a glance (VII)
● Open Source Leader
Pentaho at a glance (VIII)
Single Platform
Table of Contents
●
●
●
●
●
●

Pentaho at a glance
In the academic field
ETL
Kettle
Big Data
Predictive Analytics
Academic field
Academic field (II)
Academic field (III)
Academic field (IV)
Academic field (V)
Academic field (VI)
Table of Contents
●
●
●
●
●
●

Pentaho at a glance
In the academic field
ETL
Kettle
Big Data
Predictive Analytics
ETL
Definition and characteristics

● An ETL tool is a tool that
○ Extracts data from various data sources (usually
legacy data)
○ Transforms data
■ from → being optimized for transaction
■ to → being optimized for reporting and analysis
■ synchronizes the data coming from different
databases
■ data cleanses to remove errors
○ Loads data into a data warehouse
ETL
Why do I need it?

● ETL tools save time and money when
developing a data warehouse by removing
the need for hand-coding
● It is very difficult for database administrators
to connect between different brands of
databases without using an external tool
● In the event that databases are altered or new
databases need to be integrated, a lot of handcoded work needs to be completely redone
ETL
Business Intelligence

● ETL is the heart
and soul of
business
intelligence (BI)
○ ETL processes
bring together
and combine data
from multiple
source systems
into a data
warehouse

Source: http://datawarehouseujap.blogspot.com.es/2010/08/data-warehouse.html
ETL
Business Intelligence (II)

Source: http://www.dwuser.com/news/tag/optimization/

According to most
practitioners, ETL
design and
development work
consumes 60 to 80
percent of an entire BI
project
Source: The Data Warehousing Institute. www.dw-institute.com
ETL
Processing framework

Source: The Data Warehousing Institute. www.dw-institute.com
ETL
Tools

Source: http://www.slideshare.net/jade_22/kettleetltool-090522005630phpapp01
ETL
Open Source tools

●
●
●
●

CloverETL
KETL
Kettle
Talend
ETL
CloverETL

● Create a basic archive of functions
for mapping and transformations,
allowing companies to move large
amounts of data as quickly and
efficiently as possible
● Uses building blocks called
components to create a
transformation graph, which is a
visual depiction of the intended
data processing
ETL
CloverETL (II)

● The graphic presentation simplifies even
complex data transformations, allowing for
drag-and-drop functionality
● Limited to approximately 40 different
components to simplify graph creation
○ Yet you may configure each component to meet
specific needs

● It also features extensive debugging
capabilities to ensure all transformation
graphs work precisely as intended
ETL
KETL

● Contains a scalable, platform-independent
engine capable of supporting multiple
computers and 64-bit servers
● The program also offers performance
monitoring, extensive data source support,
XML compatibility and a scheduling engine for
time-based and event-driven job execution
ETL
Kettle
● The Pentaho company produced Kettle as an OS
alternative to commercial ETL software
○ No relation to Kinetic Networks' KETL
● Kettle features a drop-and-drag, graphical
environment with progress feedback for all data
transactions, including automatic documentation of
executed jobs
● XML Input Stream to handle huge XML files without
suffering a loss in performance or a spike in memory
usage
○ Users can also upgrade the free Kettle version for
optional pay features and dedicated technical support.
ETL
Talend
● Provides a graphical environment for data integration,
migration and synchronization
● Drag and drop graphic components to create the java code
required to execute the desired task, saving time and
effort
● Pre-built connectors to enable compatibility with a wide
range of business systems and databases
● Users gain real-time access to corporate data, allowing for
the monitoring and debugging of transactions to ensure
smooth data integration
ETL
Comparison

● The set of criteria that were used for the ETL
tools comparison were divided into seven
categories:
○
○
○
○
○
○
○
○
○

TCO
Risk
Ease of use
Support
Deployment
Speed
Data Quality
Monitoring
Connectivity
ETL
Comparison (II)
ETL
Comparison (III)
● Total Cost of Ownership
○ The overall cost for a certain
product.
○ This can mean initial ordering,
licensing servicing, support,
training, consulting, and any
other additional payments that
need to be made before the
product is in full use
○ Commercial Open Source
products are typically free to
use, but the support, training and
consulting are what companies
need to pay for
ETL
Comparison (IV)
● Risk
○ There are always risks with projects, especially big
projects.
○ The risks for projects failing are:
■ Going over budget
■ Going over schedule
■ Not completing the requirements or expectations of
the customers
○ Open Source products have much lower risk then
Commercial ones since they do not restrict the use of their
products by pricey licenses
ETL
Comparison (V)
● Ease of use
○ All of the ETL tools, apart from Inaport, have GUI to
simplify the development process
○ Having a good GUI also reduces the time to train and use
the tools
○ Pentaho Kettle has an easy to use GUI out of all the tools
■ Training can also be found online or within the
community
ETL
Comparison (VI)
● Support
○ Nowadays, all software products have support and all of
the ETL tool providers offer support
○ Pentaho Kettle – Offers support from US, UK and has a
partner consultant in Hong Kong
● Deployment
○ Pentaho Kettle is a stand-alone java engine that can run
on any machine that can run java. Needs an external
scheduler to run automatically.
○ It can be deployed on many different machines and used
as “slave servers” to help with transformation processing.
○ Recommended one 1Ghz CPU and 512mbs RAM
ETL
Comparison (VII)
● Speed
○ The speed of ETL tools depends largely on the data that
needs to be transferred over the network and the
processing power involved in transforming the data.
○ Pentaho Kettle is faster than Talend, but the Javaconnector slows it down somewhat. Also requires manual
tweaking like Talend. Can be clustered by placed on many
machines to reduce network traffic
ETL
Comparison (VIII)
● Data Quality
○ Data Quality is fast becoming the most important feature
in any data integration tool.
○ Pentaho – has DQ features in its GUI, allows for
customized SQL statements, by using JavaScript and
Regular Expressions. It also has some additional modules
after subscribing.
● Monitoring
○ Pentaho Kettle – has practical monitoring tools and
logging
ETL
Comparison (IX)
● Connectivity
○ In most cases, ETL tools transfer data from legacy systems
○ Their connectivity is very important to the usefulness of
the ETL tools.
○ Kettle can connect to a very wide variety of databases, flat
files, xml files, excel files and web services.
Table of Contents
●
●
●
●
●
●

Pentaho at a glance
In the academic field
ETL
Kettle
Big Data
Predictive Analytics
Kettle
Introduction

Project Kettle
Powerful Extraction, Transformation and
Loading (ETL) capabilities using an
innovative, metadata-driven approach
Kettle
Introduction (II)

● What is Kettle?
○ Batch data integration
and processing tool
written in Java
○ Exists to retrieve,
process and load data
○ PDI is a synonymous
term
Source: http://www.dreamstime.com/stock-photo-very-old-kettle-isolated-image16622230
Kettle
Introduction (III)

● It uses an innovative meta-driven approach
● It has a very easy-to-use GUI
● Strong community of 13,500 registered
users
● It uses a stand-alone Java engine that
process the tasks for moving data between
many different databases and files
Kettle
Introduction (IV)
Kettle
Data Integration Platform

Source: http://download.101com.com/tdwi/research_report/2003ETLReport.pdf
Kettle
Architecture

Source: Pentaho Corporation
Kettle
Most common uses

●
●
●
●
●
●

Datawarehouse and datamart loads
Data Integration
Data cleansing
Data migration
Data export
etc.
Kettle
Data Integration

● Changing input to desired output
● Jobs
○ Synchronous workflow of job
entries (tasks)
● Transformations
○ Stepwise parallel & asynchronous
processing of a recordstream
● Distributed
Kettle
Data Integration challenges

● Data is everywhere
● Data is inconsistent
○ Records are different in each system
● Performance issues
○ Running queries to summarize data for
stipulated long period takes operating
system for task
○ Brings the OS on max load
● Data is never all in Data Warehouse
○ Excel sheet, acquisition, new application
DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
Kettle
Transformations

●
●
●
●
●
●
●
●

String and Date Manipulation
Data Validation / Business Rules
Lookup / Join
Calculation, Statistics
Cryptography
Decisions, Flow control
Scripting
etc.
DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
Kettle
What is good for?

● Mirroring data from master to slave
● Syncing two data sources
● Processing data retrieved from multiple
sources and pushed to multiple
destinations
● Loading data to RDBMS
● Datamart / Datawarehouse
○ Dimension lookup/update step
● Graphical manipulation of data
DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
Kettle
Alternatives

● Code
○ Custom java
○ Spring batch
● Scripts
○ perl, python,
shell, etc
○ Possibly + db
loader tool and
cron

● Commercial ETL
tools
○ Datastage
○ Informatica
● Oracle Warehouse
Builder
● SQL Server
Integration services

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
Kettle
Extraction

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
Kettle
Extraction (II)

Source: http://download.101com.com/tdwi/research_report/2003ETLReport.pdf

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
Kettle
Extraction (III)

● RDBMS (SQL Server, DB2, Oracle, MySQL,
PostgreSQL, Sybase IQ, etc.)
● NoSQL Data: HBase, Cassandra, MongoDB
● OLAP (Mondrian, Palo, XML/A)
● Web (REST, SOAP, XML, JSON)
● Files (CSV, Fixed, Excel, etc.)
● ERP (SAP, Salesforce, OpenERP)
● Hadoop Data: HDFS, Hive
● Web Data: Twitter, Facebook, Log Files, Web Logs
● Others: LDAP/Active Directory, Google Analytics,
etc.
DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
Kettle
Transportation

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
Kettle
Transformation

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
Kettle
Loading

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
Kettle
Environment

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
Kettle
Comparison of Data Integration tools

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
Table of Contents
●
●
●
●
●
●

Pentaho at a glance
In the academic field
ETL
Kettle
Big Data
Predictive Analytics

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
Big Data
Business Intelligente
A brief (BI) history….

Source: http://es.wikipedia.org/wiki/Weka_(aprendizaje_autom%C3%A1tico)

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
Big Data
WEKA

Project Weka
A comprehensive set of tools for Machine
Learning and Data Mining

Source: http://es.wikipedia.org/wiki/Weka_(aprendizaje_autom%C3%A1tico)

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
Big Data
Among Pentaho’s products

Mondrian
OLAP server written in Java

Kettle
ETL tool

Weka
Machine learning and Data Mining tool
DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
Big Data
WEKA platform

● WEKA (Waikato Environment for Knowledge
Analysis)
● Funded by the New Zealand’s Government (for
more than 10 years)
○ Develop an open-source state-of-the-art
workbench of data mining tools
○ Explore fielded applications
○ Develop new fundamental methods
● Became part of Pentaho platform in 2006
(PDM - Pentaho Data Mining)
DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
Big Data
Data Mining with WEKA

● (One-of-the-many) Definition: Extraction of implicit,
previously unknown, and potentially useful
information from data
● Goal: improve marketing, sales, and customer
support operations, risk assessment etc.
○ Who is likely to remain a loyal customer?
○ What products should be marketed to which
prospects?
○ What determines whether a person will respond
to a certain offer?
○ How can I detect potential fraud?
DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
Big Data
Data Mining with WEKA (II)

Central idea: historical data contains
information that will be useful in the
future (patterns → generalizations)
Data Mining employs a set of
algorithms that automatically detect
patterns and regularities in data
DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
Big Data
Data Mining with WEKA (III)
● A bank’s case as an example
○ Problem: Prediction (Probability Score) of a
Corporate Customer Delinquency (or default) in the
next year
○ Customer historical data used include:
■ Customer footings behavior (assets & liabilities)
■ Customer delinquencies (rates and time data)
■ Business Sector behavioral data

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
Big Data
Data Mining with WEKA (IV)
● Variable selection using the Information Value (IV)
criterion

● Automatic Binning of continuous data variables was used
(Chi-merge). Manual corrections were made to address
particularities in the data distribution of some variables
(using again IV)
DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
Big Data
Data Mining with WEKA (V)

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
Big Data
Data Mining with WEKA (VI)

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
Big Data
Data Mining with WEKA (VII)

● Limitations
○ Traditional algorithms need to have all data
in (main) memory
■ big datasets are an issue
● Solution
○ Incremental schemes
○ Stream algorithms
■ MOA (Massive Online Analysis)
■ http://moa.cs.waikato.ac.nz/

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
Big Data
Be careful with Data Mining

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
Table of Contents
●
●
●
●
●
●

Pentaho at a glance
In the academic field
ETL
Kettle
Big Data
Predictive Analytics

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
Predictive analytics
Unified solution for Big Data Analytics

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
Predictive analytics
Unified solution for Big Data Analytics (II)
Curren release: Pentaho Business Analytics Suite 4.8

Instant and interactive
data discovery for iPad
● Full analytical power on
the go – unique to
Pentaho
● Mobile-optimized user
interface

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
Predictive analytics
Unified solution for Big Data Analytics (III)
Curren release: Pentaho Business Analytics Suite 4.8

Instant and interactive data
discovery and development for
big data
● Broadens big data access to
data analysts
● Removes the need for
separate big data
visualization tools
● Further improves
productivity for big data
developers
DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
Predictive analytics
Unified solution for Big Data Analytics (IV)
Pentaho Instaview
●

●

●

Instaview is simple
○ Created for data analysts
○ Dramatically simplifies ways to
access Hadoop and NoSQL data
stores
Instaview is instant & interactive
○ Time accelerator – 3 quick steps from
data to analytics
○ Interact with big data sources –
group, sort, aggregate & visualize
Instaview is big data analytics
○ Marketing analysis for weblog data in
Hadoop
○ Application log analysis for data in
MongoDB

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
Predictive analytics
Comparison

Source: http://cdn.oreillystatic.com/en/assets/1/event/100/Using%20R%20and%20Hadoop%20for%20Statistical%20Computation%20at%20Scale%20Presentation.htm#/2

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
References
http://cdn.oreillystatic.com/en/assets/1/event/100/Big%20Data%20Architectural%20Patterns%20Presentation.pdf
http://blog.pentaho.com/tag/strata/
http://www.slideshare.net/mattcasters/pentaho-data-integration-introduction?from_search=2
http://www.slideshare.net/infoaxon/open-source-bi-7640848
http://download.101com.com/tdwi/research_report/2003ETLReport.pdf
http://www.slideshare.net/jade_22/kettleetltool-090522005630phpapp01
http://www.pentaho.com/Blend-of-the-Week?mkt_tok=3RkMMJWWfF9wsRonuKvNce%2FhmjTEU5z17%2BQoXaO2hokz2EFye%
2BLIHETpodcMTcdgPbjYDBceEJhqyQJxPr3DJNAN1dt%2BRhDhCA%3D%3D#Analytics

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
Copyright (c) 2014 University of Deusto
This work (but the quoted images, whose rights are reserved to their owners*) is licensed under the
Creative Commons “Attribution-ShareAlike” License. To view a copy of this license, visit http:
//creativecommons.org/licenses/by-sa/3.0/

Alex Rayón Jerez
January 2014
DeustoTech-Learning 2013/2014 - 9 de Enero del 2014

More Related Content

What's hot

Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
nzhang
 
Introduction of ssis
Introduction of ssisIntroduction of ssis
Introduction of ssis
deepakk073
 
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and ManageEnd to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
Animesh Singh
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Top Five Cool Features in Oracle SQL Developer Data Modeler
Top Five Cool Features in Oracle SQL Developer Data ModelerTop Five Cool Features in Oracle SQL Developer Data Modeler
Top Five Cool Features in Oracle SQL Developer Data Modeler
Kent Graziano
 
Snowflake + Power BI: Cloud Analytics for Everyone
Snowflake + Power BI: Cloud Analytics for EveryoneSnowflake + Power BI: Cloud Analytics for Everyone
Snowflake + Power BI: Cloud Analytics for Everyone
Angel Abundez
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
Dustin Vannoy
 
What is ETL?
What is ETL?What is ETL?
What is ETL?
Ismail El Gayar
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
Inderaj (Raj) Bains
 
Etl techniques
Etl techniquesEtl techniques
Etl techniques
mahezabeenIlkal
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
pcherukumalla
 
Oracle GoldenGate
Oracle GoldenGate Oracle GoldenGate
Oracle GoldenGate
oracleonthebrain
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
James Serra
 
Oracle Exadata Maintenance tasks 101 - OTN Tour 2015
Oracle Exadata Maintenance tasks 101 - OTN Tour 2015Oracle Exadata Maintenance tasks 101 - OTN Tour 2015
Oracle Exadata Maintenance tasks 101 - OTN Tour 2015
Nelson Calero
 
How Big is Big Data? Big Data Overview | Edureka
How Big is Big Data? Big Data Overview | EdurekaHow Big is Big Data? Big Data Overview | Edureka
How Big is Big Data? Big Data Overview | Edureka
Edureka!
 
ETL Technologies.pptx
ETL Technologies.pptxETL Technologies.pptx
ETL Technologies.pptx
Gaurav Bhatnagar
 
Kettle – Etl Tool
Kettle – Etl ToolKettle – Etl Tool
Kettle – Etl Tool
Dr Anjan Krishnamurthy
 
Snowflake Data Loading.pptx
Snowflake Data Loading.pptxSnowflake Data Loading.pptx
Snowflake Data Loading.pptx
Parag860410
 

What's hot (20)

Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
Introduction of ssis
Introduction of ssisIntroduction of ssis
Introduction of ssis
 
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and ManageEnd to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Top Five Cool Features in Oracle SQL Developer Data Modeler
Top Five Cool Features in Oracle SQL Developer Data ModelerTop Five Cool Features in Oracle SQL Developer Data Modeler
Top Five Cool Features in Oracle SQL Developer Data Modeler
 
Snowflake + Power BI: Cloud Analytics for Everyone
Snowflake + Power BI: Cloud Analytics for EveryoneSnowflake + Power BI: Cloud Analytics for Everyone
Snowflake + Power BI: Cloud Analytics for Everyone
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
 
What is ETL?
What is ETL?What is ETL?
What is ETL?
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Etl techniques
Etl techniquesEtl techniques
Etl techniques
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
 
Oracle GoldenGate
Oracle GoldenGate Oracle GoldenGate
Oracle GoldenGate
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
 
Oracle Exadata Maintenance tasks 101 - OTN Tour 2015
Oracle Exadata Maintenance tasks 101 - OTN Tour 2015Oracle Exadata Maintenance tasks 101 - OTN Tour 2015
Oracle Exadata Maintenance tasks 101 - OTN Tour 2015
 
How Big is Big Data? Big Data Overview | Edureka
How Big is Big Data? Big Data Overview | EdurekaHow Big is Big Data? Big Data Overview | Edureka
How Big is Big Data? Big Data Overview | Edureka
 
ETL Technologies.pptx
ETL Technologies.pptxETL Technologies.pptx
ETL Technologies.pptx
 
Kettle – Etl Tool
Kettle – Etl ToolKettle – Etl Tool
Kettle – Etl Tool
 
Snowflake Data Loading.pptx
Snowflake Data Loading.pptxSnowflake Data Loading.pptx
Snowflake Data Loading.pptx
 

Viewers also liked

Pentaho-BI
Pentaho-BIPentaho-BI
Pentaho-BI
Edureka!
 
Introduction To Pentaho Kettle
Introduction To Pentaho KettleIntroduction To Pentaho Kettle
Introduction To Pentaho Kettle
Boulder Java User's Group
 
Introduction to ETL and Data Integration
Introduction to ETL and Data IntegrationIntroduction to ETL and Data Integration
Introduction to ETL and Data Integration
CloverDX (formerly known as CloverETL)
 
What's New in Pentaho 7.0?
What's New in Pentaho 7.0?What's New in Pentaho 7.0?
What's New in Pentaho 7.0?
Xpand IT
 
Pentaho interview question and answers
Pentaho interview question and answersPentaho interview question and answers
Pentaho interview question and answers
enrollmy training
 
Elementos ETL - Kettle Pentaho
Elementos ETL - Kettle Pentaho Elementos ETL - Kettle Pentaho
Elementos ETL - Kettle Pentaho
valex_haro
 
Informatica Pentaho Etl Tools Comparison
Informatica Pentaho Etl Tools ComparisonInformatica Pentaho Etl Tools Comparison
Informatica Pentaho Etl Tools Comparison
Roberto Espinosa
 
Tic valeria haro
Tic valeria haroTic valeria haro
Tic valeria haro
valex_haro
 
Guía de compras de alimentos
Guía de compras de alimentosGuía de compras de alimentos
Guía de compras de alimentos
valex_haro
 
Webinar: Conhecendo a solução Pentaho, líder em Business Analytics
Webinar: Conhecendo a solução Pentaho, líder em Business AnalyticsWebinar: Conhecendo a solução Pentaho, líder em Business Analytics
Webinar: Conhecendo a solução Pentaho, líder em Business Analytics
Ricardo Gouvêa
 
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Pentaho
 
Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando m...
Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando m...Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando m...
Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando m...
Alex Rayón Jerez
 
Got Personally-Owned Devices? Manage Them with System Center
Got Personally-Owned Devices? Manage Them with System CenterGot Personally-Owned Devices? Manage Them with System Center
Got Personally-Owned Devices? Manage Them with System Center
C/D/H Technology Consultants
 
FAST Search for SharePoint
FAST Search for SharePointFAST Search for SharePoint
FAST Search for SharePoint
C/D/H Technology Consultants
 
Data Is Your Next Product Opportunity
Data Is Your Next Product Opportunity Data Is Your Next Product Opportunity
Data Is Your Next Product Opportunity
Pentaho
 
Pentaho: inteligência de negócios utilizando software livre
Pentaho: inteligência de negócios utilizando software livrePentaho: inteligência de negócios utilizando software livre
Pentaho: inteligência de negócios utilizando software livre
Caio Moreno
 
Pregel
PregelPregel
Pregel
Weiru Dai
 
No sql now2011_review_of_adhoc_architectures
No sql now2011_review_of_adhoc_architecturesNo sql now2011_review_of_adhoc_architectures
No sql now2011_review_of_adhoc_architectures
Nicholas Goodman
 
Big data&data science vfinal
Big data&data science vfinalBig data&data science vfinal
Big data&data science vfinal
Luis Joyanes
 

Viewers also liked (20)

Pentaho-BI
Pentaho-BIPentaho-BI
Pentaho-BI
 
Introduction To Pentaho Kettle
Introduction To Pentaho KettleIntroduction To Pentaho Kettle
Introduction To Pentaho Kettle
 
Introduction to ETL and Data Integration
Introduction to ETL and Data IntegrationIntroduction to ETL and Data Integration
Introduction to ETL and Data Integration
 
What's New in Pentaho 7.0?
What's New in Pentaho 7.0?What's New in Pentaho 7.0?
What's New in Pentaho 7.0?
 
Pentaho: CE versus EE
Pentaho: CE versus EEPentaho: CE versus EE
Pentaho: CE versus EE
 
Pentaho interview question and answers
Pentaho interview question and answersPentaho interview question and answers
Pentaho interview question and answers
 
Elementos ETL - Kettle Pentaho
Elementos ETL - Kettle Pentaho Elementos ETL - Kettle Pentaho
Elementos ETL - Kettle Pentaho
 
Informatica Pentaho Etl Tools Comparison
Informatica Pentaho Etl Tools ComparisonInformatica Pentaho Etl Tools Comparison
Informatica Pentaho Etl Tools Comparison
 
Tic valeria haro
Tic valeria haroTic valeria haro
Tic valeria haro
 
Guía de compras de alimentos
Guía de compras de alimentosGuía de compras de alimentos
Guía de compras de alimentos
 
Webinar: Conhecendo a solução Pentaho, líder em Business Analytics
Webinar: Conhecendo a solução Pentaho, líder em Business AnalyticsWebinar: Conhecendo a solução Pentaho, líder em Business Analytics
Webinar: Conhecendo a solução Pentaho, líder em Business Analytics
 
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
 
Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando m...
Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando m...Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando m...
Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando m...
 
Got Personally-Owned Devices? Manage Them with System Center
Got Personally-Owned Devices? Manage Them with System CenterGot Personally-Owned Devices? Manage Them with System Center
Got Personally-Owned Devices? Manage Them with System Center
 
FAST Search for SharePoint
FAST Search for SharePointFAST Search for SharePoint
FAST Search for SharePoint
 
Data Is Your Next Product Opportunity
Data Is Your Next Product Opportunity Data Is Your Next Product Opportunity
Data Is Your Next Product Opportunity
 
Pentaho: inteligência de negócios utilizando software livre
Pentaho: inteligência de negócios utilizando software livrePentaho: inteligência de negócios utilizando software livre
Pentaho: inteligência de negócios utilizando software livre
 
Pregel
PregelPregel
Pregel
 
No sql now2011_review_of_adhoc_architectures
No sql now2011_review_of_adhoc_architecturesNo sql now2011_review_of_adhoc_architectures
No sql now2011_review_of_adhoc_architectures
 
Big data&data science vfinal
Big data&data science vfinalBig data&data science vfinal
Big data&data science vfinal
 

Similar to Kettle: Pentaho Data Integration tool

ETL with WSO2 Enterprise Middleware Platform
ETL with WSO2 Enterprise Middleware Platform ETL with WSO2 Enterprise Middleware Platform
ETL with WSO2 Enterprise Middleware Platform
WSO2
 
Airbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stackAirbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stack
Michel Tricot
 
sandhya exp resume
sandhya exp resume sandhya exp resume
sandhya exp resume
sandhya chamarthi
 
Shivaprasada_Kodoth
Shivaprasada_KodothShivaprasada_Kodoth
Shivaprasada_Kodoth
Shivaprasada Kodoth
 
Pentaho ppt up
Pentaho ppt upPentaho ppt up
Pentaho ppt up
03446940736
 
Resume_gmail
Resume_gmailResume_gmail
Machine Learning - Eine Challenge für Architekten
Machine Learning - Eine Challenge für ArchitektenMachine Learning - Eine Challenge für Architekten
Machine Learning - Eine Challenge für Architekten
Harald Erb
 
Querona Presentation 2018
Querona Presentation 2018Querona Presentation 2018
Querona Presentation 2018
Synergo!
 
Sourav_Giri_Resume_2015
Sourav_Giri_Resume_2015Sourav_Giri_Resume_2015
Sourav_Giri_Resume_2015
sourav giri
 
Data Analytics.01. Data selection and capture
Data Analytics.01. Data selection and captureData Analytics.01. Data selection and capture
Data Analytics.01. Data selection and capture
Alex Rayón Jerez
 
Sanjaykumar Kakaso Mane_MAY2016
Sanjaykumar Kakaso Mane_MAY2016Sanjaykumar Kakaso Mane_MAY2016
Sanjaykumar Kakaso Mane_MAY2016
Sanjay Mane
 
Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021
Mobcoder
 
Agile Business Intelligence
Agile Business IntelligenceAgile Business Intelligence
Agile Business Intelligence
David Portnoy
 
Database automation guide - Oracle Community Tour LATAM 2023
Database automation guide - Oracle Community Tour LATAM 2023Database automation guide - Oracle Community Tour LATAM 2023
Database automation guide - Oracle Community Tour LATAM 2023
Nelson Calero
 
Shaik Niyas Ahamed M Resume
Shaik Niyas Ahamed M ResumeShaik Niyas Ahamed M Resume
Shaik Niyas Ahamed M Resume
Shaik Niyas Ahamed M
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
Fei Chen
 
Business Intelligence and Big Data Analytics with Pentaho
Business Intelligence and Big Data Analytics with Pentaho Business Intelligence and Big Data Analytics with Pentaho
Business Intelligence and Big Data Analytics with Pentaho
Uday Kothari
 
Naman_Abinitio_7757021406
Naman_Abinitio_7757021406Naman_Abinitio_7757021406
Naman_Abinitio_7757021406
Naman Gupta
 
Anil_Kumar_Andra_ETL
Anil_Kumar_Andra_ETLAnil_Kumar_Andra_ETL
Anil_Kumar_Andra_ETL
Anil Kumar Andra
 
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa... Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Databricks
 

Similar to Kettle: Pentaho Data Integration tool (20)

ETL with WSO2 Enterprise Middleware Platform
ETL with WSO2 Enterprise Middleware Platform ETL with WSO2 Enterprise Middleware Platform
ETL with WSO2 Enterprise Middleware Platform
 
Airbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stackAirbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stack
 
sandhya exp resume
sandhya exp resume sandhya exp resume
sandhya exp resume
 
Shivaprasada_Kodoth
Shivaprasada_KodothShivaprasada_Kodoth
Shivaprasada_Kodoth
 
Pentaho ppt up
Pentaho ppt upPentaho ppt up
Pentaho ppt up
 
Resume_gmail
Resume_gmailResume_gmail
Resume_gmail
 
Machine Learning - Eine Challenge für Architekten
Machine Learning - Eine Challenge für ArchitektenMachine Learning - Eine Challenge für Architekten
Machine Learning - Eine Challenge für Architekten
 
Querona Presentation 2018
Querona Presentation 2018Querona Presentation 2018
Querona Presentation 2018
 
Sourav_Giri_Resume_2015
Sourav_Giri_Resume_2015Sourav_Giri_Resume_2015
Sourav_Giri_Resume_2015
 
Data Analytics.01. Data selection and capture
Data Analytics.01. Data selection and captureData Analytics.01. Data selection and capture
Data Analytics.01. Data selection and capture
 
Sanjaykumar Kakaso Mane_MAY2016
Sanjaykumar Kakaso Mane_MAY2016Sanjaykumar Kakaso Mane_MAY2016
Sanjaykumar Kakaso Mane_MAY2016
 
Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021
 
Agile Business Intelligence
Agile Business IntelligenceAgile Business Intelligence
Agile Business Intelligence
 
Database automation guide - Oracle Community Tour LATAM 2023
Database automation guide - Oracle Community Tour LATAM 2023Database automation guide - Oracle Community Tour LATAM 2023
Database automation guide - Oracle Community Tour LATAM 2023
 
Shaik Niyas Ahamed M Resume
Shaik Niyas Ahamed M ResumeShaik Niyas Ahamed M Resume
Shaik Niyas Ahamed M Resume
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
 
Business Intelligence and Big Data Analytics with Pentaho
Business Intelligence and Big Data Analytics with Pentaho Business Intelligence and Big Data Analytics with Pentaho
Business Intelligence and Big Data Analytics with Pentaho
 
Naman_Abinitio_7757021406
Naman_Abinitio_7757021406Naman_Abinitio_7757021406
Naman_Abinitio_7757021406
 
Anil_Kumar_Andra_ETL
Anil_Kumar_Andra_ETLAnil_Kumar_Andra_ETL
Anil_Kumar_Andra_ETL
 
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa... Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 

More from Alex Rayón Jerez

El Big Data en la dirección comercial: market(ing) intelligence
El Big Data en la dirección comercial: market(ing) intelligenceEl Big Data en la dirección comercial: market(ing) intelligence
El Big Data en la dirección comercial: market(ing) intelligence
Alex Rayón Jerez
 
Herramientas y metodologías Big Data para acceder a datos no estructurados
Herramientas y metodologías Big Data para acceder a datos no estructuradosHerramientas y metodologías Big Data para acceder a datos no estructurados
Herramientas y metodologías Big Data para acceder a datos no estructurados
Alex Rayón Jerez
 
Las competencias digitales como método de observación de competencias genéricas
Las competencias digitales como método de observación de competencias genéricasLas competencias digitales como método de observación de competencias genéricas
Las competencias digitales como método de observación de competencias genéricas
Alex Rayón Jerez
 
El Big Data en mi empresa ¿de qué me sirve?
El Big Data en mi empresa  ¿de qué me sirve?El Big Data en mi empresa  ¿de qué me sirve?
El Big Data en mi empresa ¿de qué me sirve?
Alex Rayón Jerez
 
Aplicación del Big Data a la mejora de la competitividad de la empresa
Aplicación del Big Data a la mejora de la competitividad de la empresaAplicación del Big Data a la mejora de la competitividad de la empresa
Aplicación del Big Data a la mejora de la competitividad de la empresa
Alex Rayón Jerez
 
Análisis de Redes Sociales (Social Network Analysis) y Text Mining
Análisis de Redes Sociales (Social Network Analysis) y Text MiningAnálisis de Redes Sociales (Social Network Analysis) y Text Mining
Análisis de Redes Sociales (Social Network Analysis) y Text Mining
Alex Rayón Jerez
 
Marketing intelligence con estrategia omnicanal y Customer Journey
Marketing intelligence con estrategia omnicanal y Customer JourneyMarketing intelligence con estrategia omnicanal y Customer Journey
Marketing intelligence con estrategia omnicanal y Customer Journey
Alex Rayón Jerez
 
Modelos de propensión en la era del Big Data
Modelos de propensión en la era del Big DataModelos de propensión en la era del Big Data
Modelos de propensión en la era del Big Data
Alex Rayón Jerez
 
Customer Lifetime Value Management con Big Data
Customer Lifetime Value Management con Big DataCustomer Lifetime Value Management con Big Data
Customer Lifetime Value Management con Big Data
Alex Rayón Jerez
 
Big Data: the Management Revolution
Big Data: the Management RevolutionBig Data: the Management Revolution
Big Data: the Management Revolution
Alex Rayón Jerez
 
Optimización de procesos con el Big Data
Optimización de procesos con el Big DataOptimización de procesos con el Big Data
Optimización de procesos con el Big Data
Alex Rayón Jerez
 
La economía del dato: transformando sectores, generando oportunidades
La economía del dato: transformando sectores, generando oportunidadesLa economía del dato: transformando sectores, generando oportunidades
La economía del dato: transformando sectores, generando oportunidades
Alex Rayón Jerez
 
Cómo crecer, ser más eficiente y competitivo a través del Big Data
Cómo crecer, ser más eficiente y competitivo a través del Big DataCómo crecer, ser más eficiente y competitivo a través del Big Data
Cómo crecer, ser más eficiente y competitivo a través del Big Data
Alex Rayón Jerez
 
El poder de los datos: hacia una sociedad inteligente, pero ética
El poder de los datos: hacia una sociedad inteligente, pero éticaEl poder de los datos: hacia una sociedad inteligente, pero ética
El poder de los datos: hacia una sociedad inteligente, pero ética
Alex Rayón Jerez
 
Búsqueda, organización y presentación de recursos de aprendizaje
Búsqueda, organización y presentación de recursos de aprendizajeBúsqueda, organización y presentación de recursos de aprendizaje
Búsqueda, organización y presentación de recursos de aprendizaje
Alex Rayón Jerez
 
Deusto Knowledge Hub como herramienta de publicación y descubrimiento de cono...
Deusto Knowledge Hub como herramienta de publicación y descubrimiento de cono...Deusto Knowledge Hub como herramienta de publicación y descubrimiento de cono...
Deusto Knowledge Hub como herramienta de publicación y descubrimiento de cono...
Alex Rayón Jerez
 
Fomentando la colaboración en el aula a través de herramientas sociales
Fomentando la colaboración en el aula a través de herramientas socialesFomentando la colaboración en el aula a través de herramientas sociales
Fomentando la colaboración en el aula a través de herramientas sociales
Alex Rayón Jerez
 
Utilizando Google Drive y Google Docs en el aula para trabajar con mis estudi...
Utilizando Google Drive y Google Docs en el aula para trabajar con mis estudi...Utilizando Google Drive y Google Docs en el aula para trabajar con mis estudi...
Utilizando Google Drive y Google Docs en el aula para trabajar con mis estudi...
Alex Rayón Jerez
 
Procesamiento y visualización de datos para generar nuevo conocimiento
Procesamiento y visualización de datos para generar nuevo conocimientoProcesamiento y visualización de datos para generar nuevo conocimiento
Procesamiento y visualización de datos para generar nuevo conocimiento
Alex Rayón Jerez
 
El Big Data y Business Intelligence en mi empresa: ¿de qué me sirve?
El Big Data y Business Intelligence en mi empresa: ¿de qué me sirve?El Big Data y Business Intelligence en mi empresa: ¿de qué me sirve?
El Big Data y Business Intelligence en mi empresa: ¿de qué me sirve?
Alex Rayón Jerez
 

More from Alex Rayón Jerez (20)

El Big Data en la dirección comercial: market(ing) intelligence
El Big Data en la dirección comercial: market(ing) intelligenceEl Big Data en la dirección comercial: market(ing) intelligence
El Big Data en la dirección comercial: market(ing) intelligence
 
Herramientas y metodologías Big Data para acceder a datos no estructurados
Herramientas y metodologías Big Data para acceder a datos no estructuradosHerramientas y metodologías Big Data para acceder a datos no estructurados
Herramientas y metodologías Big Data para acceder a datos no estructurados
 
Las competencias digitales como método de observación de competencias genéricas
Las competencias digitales como método de observación de competencias genéricasLas competencias digitales como método de observación de competencias genéricas
Las competencias digitales como método de observación de competencias genéricas
 
El Big Data en mi empresa ¿de qué me sirve?
El Big Data en mi empresa  ¿de qué me sirve?El Big Data en mi empresa  ¿de qué me sirve?
El Big Data en mi empresa ¿de qué me sirve?
 
Aplicación del Big Data a la mejora de la competitividad de la empresa
Aplicación del Big Data a la mejora de la competitividad de la empresaAplicación del Big Data a la mejora de la competitividad de la empresa
Aplicación del Big Data a la mejora de la competitividad de la empresa
 
Análisis de Redes Sociales (Social Network Analysis) y Text Mining
Análisis de Redes Sociales (Social Network Analysis) y Text MiningAnálisis de Redes Sociales (Social Network Analysis) y Text Mining
Análisis de Redes Sociales (Social Network Analysis) y Text Mining
 
Marketing intelligence con estrategia omnicanal y Customer Journey
Marketing intelligence con estrategia omnicanal y Customer JourneyMarketing intelligence con estrategia omnicanal y Customer Journey
Marketing intelligence con estrategia omnicanal y Customer Journey
 
Modelos de propensión en la era del Big Data
Modelos de propensión en la era del Big DataModelos de propensión en la era del Big Data
Modelos de propensión en la era del Big Data
 
Customer Lifetime Value Management con Big Data
Customer Lifetime Value Management con Big DataCustomer Lifetime Value Management con Big Data
Customer Lifetime Value Management con Big Data
 
Big Data: the Management Revolution
Big Data: the Management RevolutionBig Data: the Management Revolution
Big Data: the Management Revolution
 
Optimización de procesos con el Big Data
Optimización de procesos con el Big DataOptimización de procesos con el Big Data
Optimización de procesos con el Big Data
 
La economía del dato: transformando sectores, generando oportunidades
La economía del dato: transformando sectores, generando oportunidadesLa economía del dato: transformando sectores, generando oportunidades
La economía del dato: transformando sectores, generando oportunidades
 
Cómo crecer, ser más eficiente y competitivo a través del Big Data
Cómo crecer, ser más eficiente y competitivo a través del Big DataCómo crecer, ser más eficiente y competitivo a través del Big Data
Cómo crecer, ser más eficiente y competitivo a través del Big Data
 
El poder de los datos: hacia una sociedad inteligente, pero ética
El poder de los datos: hacia una sociedad inteligente, pero éticaEl poder de los datos: hacia una sociedad inteligente, pero ética
El poder de los datos: hacia una sociedad inteligente, pero ética
 
Búsqueda, organización y presentación de recursos de aprendizaje
Búsqueda, organización y presentación de recursos de aprendizajeBúsqueda, organización y presentación de recursos de aprendizaje
Búsqueda, organización y presentación de recursos de aprendizaje
 
Deusto Knowledge Hub como herramienta de publicación y descubrimiento de cono...
Deusto Knowledge Hub como herramienta de publicación y descubrimiento de cono...Deusto Knowledge Hub como herramienta de publicación y descubrimiento de cono...
Deusto Knowledge Hub como herramienta de publicación y descubrimiento de cono...
 
Fomentando la colaboración en el aula a través de herramientas sociales
Fomentando la colaboración en el aula a través de herramientas socialesFomentando la colaboración en el aula a través de herramientas sociales
Fomentando la colaboración en el aula a través de herramientas sociales
 
Utilizando Google Drive y Google Docs en el aula para trabajar con mis estudi...
Utilizando Google Drive y Google Docs en el aula para trabajar con mis estudi...Utilizando Google Drive y Google Docs en el aula para trabajar con mis estudi...
Utilizando Google Drive y Google Docs en el aula para trabajar con mis estudi...
 
Procesamiento y visualización de datos para generar nuevo conocimiento
Procesamiento y visualización de datos para generar nuevo conocimientoProcesamiento y visualización de datos para generar nuevo conocimiento
Procesamiento y visualización de datos para generar nuevo conocimiento
 
El Big Data y Business Intelligence en mi empresa: ¿de qué me sirve?
El Big Data y Business Intelligence en mi empresa: ¿de qué me sirve?El Big Data y Business Intelligence en mi empresa: ¿de qué me sirve?
El Big Data y Business Intelligence en mi empresa: ¿de qué me sirve?
 

Recently uploaded

Standardized tool for Intelligence test.
Standardized tool for Intelligence test.Standardized tool for Intelligence test.
Standardized tool for Intelligence test.
deepaannamalai16
 
math operations ued in python and all used
math operations ued in python and all usedmath operations ued in python and all used
math operations ued in python and all used
ssuser13ffe4
 
Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
Jyoti Chand
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
mulvey2
 
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumPhilippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
MJDuyan
 
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.pptLevel 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Henry Hollis
 
Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"
National Information Standards Organization (NISO)
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
Nguyen Thanh Tu Collection
 
Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...
Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...
Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...
imrankhan141184
 
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptxBIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
RidwanHassanYusuf
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
Nicholas Montgomery
 
Temple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation resultsTemple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation results
Krassimira Luka
 
Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...
PsychoTech Services
 
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptxPrésentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
siemaillard
 
Lifelines of National Economy chapter for Class 10 STUDY MATERIAL PDF
Lifelines of National Economy chapter for Class 10 STUDY MATERIAL PDFLifelines of National Economy chapter for Class 10 STUDY MATERIAL PDF
Lifelines of National Economy chapter for Class 10 STUDY MATERIAL PDF
Vivekanand Anglo Vedic Academy
 
Stack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 MicroprocessorStack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 Microprocessor
JomonJoseph58
 
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
TechSoup
 
How to deliver Powerpoint Presentations.pptx
How to deliver Powerpoint  Presentations.pptxHow to deliver Powerpoint  Presentations.pptx
How to deliver Powerpoint Presentations.pptx
HajraNaeem15
 
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
Nguyen Thanh Tu Collection
 
BBR 2024 Summer Sessions Interview Training
BBR  2024 Summer Sessions Interview TrainingBBR  2024 Summer Sessions Interview Training
BBR 2024 Summer Sessions Interview Training
Katrina Pritchard
 

Recently uploaded (20)

Standardized tool for Intelligence test.
Standardized tool for Intelligence test.Standardized tool for Intelligence test.
Standardized tool for Intelligence test.
 
math operations ued in python and all used
math operations ued in python and all usedmath operations ued in python and all used
math operations ued in python and all used
 
Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
 
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumPhilippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
 
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.pptLevel 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
 
Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
 
Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...
Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...
Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...
 
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptxBIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
 
Temple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation resultsTemple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation results
 
Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...
 
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptxPrésentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
 
Lifelines of National Economy chapter for Class 10 STUDY MATERIAL PDF
Lifelines of National Economy chapter for Class 10 STUDY MATERIAL PDFLifelines of National Economy chapter for Class 10 STUDY MATERIAL PDF
Lifelines of National Economy chapter for Class 10 STUDY MATERIAL PDF
 
Stack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 MicroprocessorStack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 Microprocessor
 
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
 
How to deliver Powerpoint Presentations.pptx
How to deliver Powerpoint  Presentations.pptxHow to deliver Powerpoint  Presentations.pptx
How to deliver Powerpoint Presentations.pptx
 
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
 
BBR 2024 Summer Sessions Interview Training
BBR  2024 Summer Sessions Interview TrainingBBR  2024 Summer Sessions Interview Training
BBR 2024 Summer Sessions Interview Training
 

Kettle: Pentaho Data Integration tool

  • 1. Pentaho Data Integration January, 2014 Alex Rayón Jerez alex.rayon@deusto.es DeustoTech Learning – Deusto Institute of Technology – University of Deusto Avda. Universidades 24, 48007 Bilbao, Spain www.deusto.es
  • 2. Before starting…. Who has used a relational database? Source: http://www.agiledata.org/essays/databaseTesting.html
  • 3. Before starting…. (II) Source: http://www.theguardian.com/teacher-network/2012/jan/10/how-to-teach-code Who has written scripts or Java code to move data from one source and load it to another?
  • 4. Before starting…. (III) What did you use? 1. Scripts 2. Custom Java Code 3. ETL
  • 5. Table of Contents ● ● ● ● ● ● Pentaho at a glance In the academic field ETL Kettle Big Data Predictive Analytics
  • 6. Table of Contents ● ● ● ● ● ● Pentaho at a glance In the academic field ETL Kettle Big Data Predictive Analytics
  • 7. Pentaho at a glance Business Intelligence
  • 8. Pentaho at a glance (II)
  • 9. Pentaho at a glance (III) ● Business Intelligence & Analytics ● Open Core ○ GPL v2 ○ Apache 2.0 ○ Enterprise and OEM licenses ● Java-based ● Web front-ends
  • 10. Pentaho at a glance (IV) ● The Pentaho Stack ○ Data Integration / ETL ○ Big Data / NoSQL ○ Data Modeling ○ Reporting ○ OLAP / Analysis ○ Data Visualization ○ Dashboarding ○ Data Mining / Predictive Analysis ○ Scheduling Source: http://helicaltech.com/blogs/hire-pentaho-consultants-hire-pentaho-developers/
  • 11. Pentaho at a glance (V) ● Modules ○ Pentaho Data Integration ■ Kettle ○ Pentaho Analysis ■ Mondrian ○ Pentaho Reporting ○ Pentaho Dashboards ○ Pentaho Data Mining ■ WEKA
  • 12. Pentaho at a glance (VI) ● Figures ○ ○ ○ ○ ○ + 10.000 deployments + 185 countries + 1.200 customers Since 2012, in Gartner Magic Quadrant for BI Platforms 1 download / 30 seconds
  • 13. Pentaho at a glance (VII) ● Open Source Leader
  • 14. Pentaho at a glance (VIII) Single Platform
  • 15. Table of Contents ● ● ● ● ● ● Pentaho at a glance In the academic field ETL Kettle Big Data Predictive Analytics
  • 22. Table of Contents ● ● ● ● ● ● Pentaho at a glance In the academic field ETL Kettle Big Data Predictive Analytics
  • 23. ETL Definition and characteristics ● An ETL tool is a tool that ○ Extracts data from various data sources (usually legacy data) ○ Transforms data ■ from → being optimized for transaction ■ to → being optimized for reporting and analysis ■ synchronizes the data coming from different databases ■ data cleanses to remove errors ○ Loads data into a data warehouse
  • 24. ETL Why do I need it? ● ETL tools save time and money when developing a data warehouse by removing the need for hand-coding ● It is very difficult for database administrators to connect between different brands of databases without using an external tool ● In the event that databases are altered or new databases need to be integrated, a lot of handcoded work needs to be completely redone
  • 25. ETL Business Intelligence ● ETL is the heart and soul of business intelligence (BI) ○ ETL processes bring together and combine data from multiple source systems into a data warehouse Source: http://datawarehouseujap.blogspot.com.es/2010/08/data-warehouse.html
  • 26. ETL Business Intelligence (II) Source: http://www.dwuser.com/news/tag/optimization/ According to most practitioners, ETL design and development work consumes 60 to 80 percent of an entire BI project Source: The Data Warehousing Institute. www.dw-institute.com
  • 27. ETL Processing framework Source: The Data Warehousing Institute. www.dw-institute.com
  • 30. ETL CloverETL ● Create a basic archive of functions for mapping and transformations, allowing companies to move large amounts of data as quickly and efficiently as possible ● Uses building blocks called components to create a transformation graph, which is a visual depiction of the intended data processing
  • 31. ETL CloverETL (II) ● The graphic presentation simplifies even complex data transformations, allowing for drag-and-drop functionality ● Limited to approximately 40 different components to simplify graph creation ○ Yet you may configure each component to meet specific needs ● It also features extensive debugging capabilities to ensure all transformation graphs work precisely as intended
  • 32. ETL KETL ● Contains a scalable, platform-independent engine capable of supporting multiple computers and 64-bit servers ● The program also offers performance monitoring, extensive data source support, XML compatibility and a scheduling engine for time-based and event-driven job execution
  • 33. ETL Kettle ● The Pentaho company produced Kettle as an OS alternative to commercial ETL software ○ No relation to Kinetic Networks' KETL ● Kettle features a drop-and-drag, graphical environment with progress feedback for all data transactions, including automatic documentation of executed jobs ● XML Input Stream to handle huge XML files without suffering a loss in performance or a spike in memory usage ○ Users can also upgrade the free Kettle version for optional pay features and dedicated technical support.
  • 34. ETL Talend ● Provides a graphical environment for data integration, migration and synchronization ● Drag and drop graphic components to create the java code required to execute the desired task, saving time and effort ● Pre-built connectors to enable compatibility with a wide range of business systems and databases ● Users gain real-time access to corporate data, allowing for the monitoring and debugging of transactions to ensure smooth data integration
  • 35. ETL Comparison ● The set of criteria that were used for the ETL tools comparison were divided into seven categories: ○ ○ ○ ○ ○ ○ ○ ○ ○ TCO Risk Ease of use Support Deployment Speed Data Quality Monitoring Connectivity
  • 37. ETL Comparison (III) ● Total Cost of Ownership ○ The overall cost for a certain product. ○ This can mean initial ordering, licensing servicing, support, training, consulting, and any other additional payments that need to be made before the product is in full use ○ Commercial Open Source products are typically free to use, but the support, training and consulting are what companies need to pay for
  • 38. ETL Comparison (IV) ● Risk ○ There are always risks with projects, especially big projects. ○ The risks for projects failing are: ■ Going over budget ■ Going over schedule ■ Not completing the requirements or expectations of the customers ○ Open Source products have much lower risk then Commercial ones since they do not restrict the use of their products by pricey licenses
  • 39. ETL Comparison (V) ● Ease of use ○ All of the ETL tools, apart from Inaport, have GUI to simplify the development process ○ Having a good GUI also reduces the time to train and use the tools ○ Pentaho Kettle has an easy to use GUI out of all the tools ■ Training can also be found online or within the community
  • 40. ETL Comparison (VI) ● Support ○ Nowadays, all software products have support and all of the ETL tool providers offer support ○ Pentaho Kettle – Offers support from US, UK and has a partner consultant in Hong Kong ● Deployment ○ Pentaho Kettle is a stand-alone java engine that can run on any machine that can run java. Needs an external scheduler to run automatically. ○ It can be deployed on many different machines and used as “slave servers” to help with transformation processing. ○ Recommended one 1Ghz CPU and 512mbs RAM
  • 41. ETL Comparison (VII) ● Speed ○ The speed of ETL tools depends largely on the data that needs to be transferred over the network and the processing power involved in transforming the data. ○ Pentaho Kettle is faster than Talend, but the Javaconnector slows it down somewhat. Also requires manual tweaking like Talend. Can be clustered by placed on many machines to reduce network traffic
  • 42. ETL Comparison (VIII) ● Data Quality ○ Data Quality is fast becoming the most important feature in any data integration tool. ○ Pentaho – has DQ features in its GUI, allows for customized SQL statements, by using JavaScript and Regular Expressions. It also has some additional modules after subscribing. ● Monitoring ○ Pentaho Kettle – has practical monitoring tools and logging
  • 43. ETL Comparison (IX) ● Connectivity ○ In most cases, ETL tools transfer data from legacy systems ○ Their connectivity is very important to the usefulness of the ETL tools. ○ Kettle can connect to a very wide variety of databases, flat files, xml files, excel files and web services.
  • 44. Table of Contents ● ● ● ● ● ● Pentaho at a glance In the academic field ETL Kettle Big Data Predictive Analytics
  • 45. Kettle Introduction Project Kettle Powerful Extraction, Transformation and Loading (ETL) capabilities using an innovative, metadata-driven approach
  • 46. Kettle Introduction (II) ● What is Kettle? ○ Batch data integration and processing tool written in Java ○ Exists to retrieve, process and load data ○ PDI is a synonymous term Source: http://www.dreamstime.com/stock-photo-very-old-kettle-isolated-image16622230
  • 47. Kettle Introduction (III) ● It uses an innovative meta-driven approach ● It has a very easy-to-use GUI ● Strong community of 13,500 registered users ● It uses a stand-alone Java engine that process the tasks for moving data between many different databases and files
  • 49. Kettle Data Integration Platform Source: http://download.101com.com/tdwi/research_report/2003ETLReport.pdf
  • 51. Kettle Most common uses ● ● ● ● ● ● Datawarehouse and datamart loads Data Integration Data cleansing Data migration Data export etc.
  • 52. Kettle Data Integration ● Changing input to desired output ● Jobs ○ Synchronous workflow of job entries (tasks) ● Transformations ○ Stepwise parallel & asynchronous processing of a recordstream ● Distributed
  • 53. Kettle Data Integration challenges ● Data is everywhere ● Data is inconsistent ○ Records are different in each system ● Performance issues ○ Running queries to summarize data for stipulated long period takes operating system for task ○ Brings the OS on max load ● Data is never all in Data Warehouse ○ Excel sheet, acquisition, new application DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
  • 54. Kettle Transformations ● ● ● ● ● ● ● ● String and Date Manipulation Data Validation / Business Rules Lookup / Join Calculation, Statistics Cryptography Decisions, Flow control Scripting etc. DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
  • 55. Kettle What is good for? ● Mirroring data from master to slave ● Syncing two data sources ● Processing data retrieved from multiple sources and pushed to multiple destinations ● Loading data to RDBMS ● Datamart / Datawarehouse ○ Dimension lookup/update step ● Graphical manipulation of data DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
  • 56. Kettle Alternatives ● Code ○ Custom java ○ Spring batch ● Scripts ○ perl, python, shell, etc ○ Possibly + db loader tool and cron ● Commercial ETL tools ○ Datastage ○ Informatica ● Oracle Warehouse Builder ● SQL Server Integration services DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
  • 59. Kettle Extraction (III) ● RDBMS (SQL Server, DB2, Oracle, MySQL, PostgreSQL, Sybase IQ, etc.) ● NoSQL Data: HBase, Cassandra, MongoDB ● OLAP (Mondrian, Palo, XML/A) ● Web (REST, SOAP, XML, JSON) ● Files (CSV, Fixed, Excel, etc.) ● ERP (SAP, Salesforce, OpenERP) ● Hadoop Data: HDFS, Hive ● Web Data: Twitter, Facebook, Log Files, Web Logs ● Others: LDAP/Active Directory, Google Analytics, etc. DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
  • 64. Kettle Comparison of Data Integration tools DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
  • 65. Table of Contents ● ● ● ● ● ● Pentaho at a glance In the academic field ETL Kettle Big Data Predictive Analytics DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
  • 66. Big Data Business Intelligente A brief (BI) history…. Source: http://es.wikipedia.org/wiki/Weka_(aprendizaje_autom%C3%A1tico) DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
  • 67. Big Data WEKA Project Weka A comprehensive set of tools for Machine Learning and Data Mining Source: http://es.wikipedia.org/wiki/Weka_(aprendizaje_autom%C3%A1tico) DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
  • 68. Big Data Among Pentaho’s products Mondrian OLAP server written in Java Kettle ETL tool Weka Machine learning and Data Mining tool DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
  • 69. Big Data WEKA platform ● WEKA (Waikato Environment for Knowledge Analysis) ● Funded by the New Zealand’s Government (for more than 10 years) ○ Develop an open-source state-of-the-art workbench of data mining tools ○ Explore fielded applications ○ Develop new fundamental methods ● Became part of Pentaho platform in 2006 (PDM - Pentaho Data Mining) DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
  • 70. Big Data Data Mining with WEKA ● (One-of-the-many) Definition: Extraction of implicit, previously unknown, and potentially useful information from data ● Goal: improve marketing, sales, and customer support operations, risk assessment etc. ○ Who is likely to remain a loyal customer? ○ What products should be marketed to which prospects? ○ What determines whether a person will respond to a certain offer? ○ How can I detect potential fraud? DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
  • 71. Big Data Data Mining with WEKA (II) Central idea: historical data contains information that will be useful in the future (patterns → generalizations) Data Mining employs a set of algorithms that automatically detect patterns and regularities in data DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
  • 72. Big Data Data Mining with WEKA (III) ● A bank’s case as an example ○ Problem: Prediction (Probability Score) of a Corporate Customer Delinquency (or default) in the next year ○ Customer historical data used include: ■ Customer footings behavior (assets & liabilities) ■ Customer delinquencies (rates and time data) ■ Business Sector behavioral data DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
  • 73. Big Data Data Mining with WEKA (IV) ● Variable selection using the Information Value (IV) criterion ● Automatic Binning of continuous data variables was used (Chi-merge). Manual corrections were made to address particularities in the data distribution of some variables (using again IV) DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
  • 74. Big Data Data Mining with WEKA (V) DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
  • 75. Big Data Data Mining with WEKA (VI) DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
  • 76. Big Data Data Mining with WEKA (VII) ● Limitations ○ Traditional algorithms need to have all data in (main) memory ■ big datasets are an issue ● Solution ○ Incremental schemes ○ Stream algorithms ■ MOA (Massive Online Analysis) ■ http://moa.cs.waikato.ac.nz/ DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
  • 77. Big Data Be careful with Data Mining DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
  • 78. Table of Contents ● ● ● ● ● ● Pentaho at a glance In the academic field ETL Kettle Big Data Predictive Analytics DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
  • 79. Predictive analytics Unified solution for Big Data Analytics DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
  • 80. Predictive analytics Unified solution for Big Data Analytics (II) Curren release: Pentaho Business Analytics Suite 4.8 Instant and interactive data discovery for iPad ● Full analytical power on the go – unique to Pentaho ● Mobile-optimized user interface DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
  • 81. Predictive analytics Unified solution for Big Data Analytics (III) Curren release: Pentaho Business Analytics Suite 4.8 Instant and interactive data discovery and development for big data ● Broadens big data access to data analysts ● Removes the need for separate big data visualization tools ● Further improves productivity for big data developers DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
  • 82. Predictive analytics Unified solution for Big Data Analytics (IV) Pentaho Instaview ● ● ● Instaview is simple ○ Created for data analysts ○ Dramatically simplifies ways to access Hadoop and NoSQL data stores Instaview is instant & interactive ○ Time accelerator – 3 quick steps from data to analytics ○ Interact with big data sources – group, sort, aggregate & visualize Instaview is big data analytics ○ Marketing analysis for weblog data in Hadoop ○ Application log analysis for data in MongoDB DeustoTech-Learning 2013/2014 - 9 de Enero del 2014
  • 85. Copyright (c) 2014 University of Deusto This work (but the quoted images, whose rights are reserved to their owners*) is licensed under the Creative Commons “Attribution-ShareAlike” License. To view a copy of this license, visit http: //creativecommons.org/licenses/by-sa/3.0/ Alex Rayón Jerez January 2014 DeustoTech-Learning 2013/2014 - 9 de Enero del 2014