Sesión de Pentaho Data Integration impartida en Noviembre de 2015 en el marco del Programa de Big Data y Business Intelligence de la Universidad de Deusto (detalle aquí http://bit.ly/1PhIVgJ).
Pentaho Data Integration. Preparing and blending data from any source for analytics. Thus, enabling data-driven decision making. Application for education, specially, academic and learning analytics.
Business Intelligence and Big Data Analytics with Pentaho Uday Kothari
This webinar gives an overview of the Pentaho technology stack and then delves deep into its features like ETL, Reporting, Dashboards, Analytics and Big Data. The webinar also facilitates a cross industry perspective and how Pentaho can be leveraged effectively for decision making. In the end, it also highlights how apart from strong technological features, low TCO is central to Pentaho’s value proposition. For BI technology enthusiasts, this webinar presents easiest ways to learn an end to end analytics tool. For those who are interested in developing a BI / Analytics toolset for their organization, this webinar presents an interesting option of leveraging low cost technology. For big data enthusiasts, this webinar presents overview of how Pentaho has come out as a leader in data integration space for Big data.
Pentaho is one of the leading niche players in Business Intelligence and Big Data Analytics. It offers a comprehensive, end-to-end open source platform for Data Integration and Business Analytics. Pentaho’s leading product: Pentaho Business Analytics is a data integration, BI and analytics platform composed of ETL, OLAP, reporting, interactive dashboards, ad hoc analysis, data mining and predictive analytics.
Pentaho Data Integration. Preparing and blending data from any source for analytics. Thus, enabling data-driven decision making. Application for education, specially, academic and learning analytics.
Business Intelligence and Big Data Analytics with Pentaho Uday Kothari
This webinar gives an overview of the Pentaho technology stack and then delves deep into its features like ETL, Reporting, Dashboards, Analytics and Big Data. The webinar also facilitates a cross industry perspective and how Pentaho can be leveraged effectively for decision making. In the end, it also highlights how apart from strong technological features, low TCO is central to Pentaho’s value proposition. For BI technology enthusiasts, this webinar presents easiest ways to learn an end to end analytics tool. For those who are interested in developing a BI / Analytics toolset for their organization, this webinar presents an interesting option of leveraging low cost technology. For big data enthusiasts, this webinar presents overview of how Pentaho has come out as a leader in data integration space for Big data.
Pentaho is one of the leading niche players in Business Intelligence and Big Data Analytics. It offers a comprehensive, end-to-end open source platform for Data Integration and Business Analytics. Pentaho’s leading product: Pentaho Business Analytics is a data integration, BI and analytics platform composed of ETL, OLAP, reporting, interactive dashboards, ad hoc analysis, data mining and predictive analytics.
Here is a case study that I developed to explain the different sets of functionality with the Pentaho Suite. I focused on the functionality, features, illustrative tools and key strengths. I've provided an understanding toward evaluating BI tools when selecting vendors. Enjoy!
Talend Open Studio Introduction - OSSCamp 2014OSSCube
Talend Open Studio is the most open, innovative and
powerful data integration solution on the market today. Talend Open Studio for Data Integration allows you to
create ETL (extract, transform, load) jobs.
Data Scientists and Machine Learning practitioners, nowadays, seem to be churning out models by the dozen and they continuously experiment to find ways to improve their accuracies. They also use a variety of ML and DL frameworks & languages , and a typical organization may find that this results in a heterogenous, complicated bunch of assets that require different types of runtimes, resources and sometimes even specialized compute to operate efficiently.
But what does it mean for an enterprise to actually take these models to "production" ? How does an organization scale inference engines out & make them available for real-time applications without significant latencies ? There needs to be different techniques for batch (offline) inferences and instant, online scoring. Data needs to be accessed from various sources and cleansing, transformations of data needs to be enabled prior to any predictions. In many cases, there maybe no substitute for customized data handling with scripting either.
Enterprises also require additional auditing and authorizations built in, approval processes and still support a "continuous delivery" paradigm whereby a data scientist can enable insights faster. Not all models are created equal, nor are consumers of a model - so enterprises require both metering and allocation of compute resources for SLAs.
In this session, we will take a look at how machine learning is operationalized in IBM Data Science Experience (DSX), a Kubernetes based offering for the Private Cloud and optimized for the HortonWorks Hadoop Data Platform. DSX essentially brings in typical software engineering development practices to Data Science, organizing the dev->test->production for machine learning assets in much the same way as typical software deployments. We will also see what it means to deploy, monitor accuracies and even rollback models & custom scorers as well as how API based techniques enable consuming business processes and applications to remain relatively stable amidst all the chaos.
Speaker
Piotr Mierzejewski, Program Director Development IBM DSX Local, IBM
Any data source becomes an SQL Query with all the power of
Apache Spark. Querona is a virtual database that seamlessly connects any data source with Power BI, TARGIT, Qlik, Tableau, Microsoft Excel or others. It lets you build your
own universal data model and share it among reporting tools.
Querona does not create another copy of your data, unless you want to accelerate your reports and use build-in execution engine created for purpose of Big Data analytics. Just write standard SQL query and let Querona consolidate data on the fly, use one of execution engines and accelerate processing no matter what kind and how many sources you have.
The process of extracting data from source systems and bringing it into the data warehouse is commonly called ETL, which stands for extraction, transformation, and loading.
Here is a case study that I developed to explain the different sets of functionality with the Pentaho Suite. I focused on the functionality, features, illustrative tools and key strengths. I've provided an understanding toward evaluating BI tools when selecting vendors. Enjoy!
Talend Open Studio Introduction - OSSCamp 2014OSSCube
Talend Open Studio is the most open, innovative and
powerful data integration solution on the market today. Talend Open Studio for Data Integration allows you to
create ETL (extract, transform, load) jobs.
Data Scientists and Machine Learning practitioners, nowadays, seem to be churning out models by the dozen and they continuously experiment to find ways to improve their accuracies. They also use a variety of ML and DL frameworks & languages , and a typical organization may find that this results in a heterogenous, complicated bunch of assets that require different types of runtimes, resources and sometimes even specialized compute to operate efficiently.
But what does it mean for an enterprise to actually take these models to "production" ? How does an organization scale inference engines out & make them available for real-time applications without significant latencies ? There needs to be different techniques for batch (offline) inferences and instant, online scoring. Data needs to be accessed from various sources and cleansing, transformations of data needs to be enabled prior to any predictions. In many cases, there maybe no substitute for customized data handling with scripting either.
Enterprises also require additional auditing and authorizations built in, approval processes and still support a "continuous delivery" paradigm whereby a data scientist can enable insights faster. Not all models are created equal, nor are consumers of a model - so enterprises require both metering and allocation of compute resources for SLAs.
In this session, we will take a look at how machine learning is operationalized in IBM Data Science Experience (DSX), a Kubernetes based offering for the Private Cloud and optimized for the HortonWorks Hadoop Data Platform. DSX essentially brings in typical software engineering development practices to Data Science, organizing the dev->test->production for machine learning assets in much the same way as typical software deployments. We will also see what it means to deploy, monitor accuracies and even rollback models & custom scorers as well as how API based techniques enable consuming business processes and applications to remain relatively stable amidst all the chaos.
Speaker
Piotr Mierzejewski, Program Director Development IBM DSX Local, IBM
Any data source becomes an SQL Query with all the power of
Apache Spark. Querona is a virtual database that seamlessly connects any data source with Power BI, TARGIT, Qlik, Tableau, Microsoft Excel or others. It lets you build your
own universal data model and share it among reporting tools.
Querona does not create another copy of your data, unless you want to accelerate your reports and use build-in execution engine created for purpose of Big Data analytics. Just write standard SQL query and let Querona consolidate data on the fly, use one of execution engines and accelerate processing no matter what kind and how many sources you have.
The process of extracting data from source systems and bringing it into the data warehouse is commonly called ETL, which stands for extraction, transformation, and loading.
Title_ What are the various tools used in ETL testing.pdfishansharma200107
In the dynamic field of data integration, proficiency in ETL testing and mastery of relevant tools are indispensable skills. The highlighted tools cover a broad spectrum of ETL testing needs, from performance testing to data validation and transformation. Technogeeks IT Institute in Pune stands as a beacon, providing comprehensive training on these tools and preparing professionals for the real-world challenges of ETL testing. By enrolling in Technogeeks IT Institute’s ETL testing courses, individuals gain practical experience and contribute to the seamless flow of data across diverse systems.
The process of data warehousing is undergoing rapidtransformation, giving rise to various new terminologies, especially due to theshift from the traditional ETL to the new ELT. Forsomeone new to the process, these additional terminologies and abbreviationsmight seem overwhelming, some may even ask, “Why does it matter if the L comesbefore the T?”
The answer lies in the infrastructure and the setup. Here iswhat the fuss is all about, the sequencing of the words and more importantly,why you should be shifting from ETL to ELT.
Airbyte @ Airflow Summit - The new modern data stackMichel Tricot
In this talk, I’ll describe how you can leverage 3 open-source standards - workflow management with Airflow, EL with Airbyte, transformation with dbt - to build your next modern data stack. I’ll explain how to configure your Airflow DAG to trigger Airbyte’s data replication jobs and DBT’s transformation one with a concrete use case.
NEAR-REAL-TIME PARALLEL ETL+Q FOR AUTOMATIC SCALABILITY IN BIGDATA cscpconf
In this paper we investigate the problem of providing scalability to near-real-time ETL+Q (Extract, transform, load and querying) process of data warehouses. In general, data loading, transformation and integration are heavy tasks that are performed only periodically during small fixed time windows. We propose an approach to enable the automatic scalability and freshness of any data warehouse and ETL+Q process for near-real-time BigData scenarios. A general framework for testing the proposed system was implementing, supporting parallelization solutions for each part of the ETL+Q pipeline. The results show that the proposed system is capable of handling scalability to provide the desired processing speed.
NEAR-REAL-TIME PARALLEL ETL+Q FOR AUTOMATIC SCALABILITY IN BIGDATAcsandit
In this paper we investigate the problem of providing scalability to near-real-time ETL+Q (Extract, transform, load and querying) process of data warehouses. In general, data loading,
transformation and integration are heavy tasks that are performed only periodically during small fixed time windows.
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)Rittman Analytics
Set of product roadmap + capabilities slides from Oracle Data Integration Product Management, and thoughts on data integration on big data implementations by Mark Rittman (Independent Analyst)
Flink in Zalando's world of Microservices ZalandoHayley
Apache Flink Meetup at Zalando Technology, May 2016
By Javier Lopez & Mihail Vieru, Zalando
In this talk we present Zalando's microservices architecture and introduce Saiki – our next generation data integration and distribution platform on AWS. We show why we chose Apache Flink to serve as our stream processing framework and describe how we employ it for our current use cases: business process monitoring and continuous ETL. We then have an outlook on future use cases.
Similar to Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos (20)
El Big Data en la dirección comercial: market(ing) intelligenceAlex Rayón Jerez
Sesión donde vimos mediante el método del caso diferentes aplicaciones del análisis de datos al mundo de la dirección comercial. Dentro del Programa Experto en Dirección Comercial de la Deusto Business School.
Herramientas y metodologías Big Data para acceder a datos no estructuradosAlex Rayón Jerez
Conferencia "Herramientas y metodologías Big Data para acceder a datos no estructurados" en las Jornadas "Investigación para Mejorar la Adecuación Asistencial. Foro sanitario interesado en la aplicación de tecnologías y metodologías Big Data para la extracción de conocimiento a partir de datos no estructurados.
Las competencias digitales como método de observación de competencias genéricasAlex Rayón Jerez
Conferencia "Las competencias digitales como método de observación de competencias genéricas" impartida el 21 de Abril de 2016 en Innobasque, Zamudio, Bizkaia. En el marco de los "Brunch & Learn" que organiza Innobasque, en una jornada donde hablamos de competencias profesionales y digitales, su aportación al campo de la empresa, y en qué consisten realmente. Se habló mucho de su importancia en este Siglo XXI que nos ocupa.
Conferencia "El BIg Data en mi empresa ¿de qué me sirve?" en el Donostia - San Sebastián el 20 de Abril de 2016. Jornadas "Big Data para PYMEs". Hablo sobre el perfil Big Data y sus competencias, así como las utilidades que tiene para las empresas.
Aplicación del Big Data a la mejora de la competitividad de la empresaAlex Rayón Jerez
Conferencia "Aplicación del Big Data a la mejora de la competitividad de la empresa" celebrada el 21 de Marzo de 2016 en Palma de Mallorca, en la Universidad de las Islas Baleares. El objetivo era entrever las posibilidades que abre el Big Data dentro del contexto de la empresa y su competitividad.
Análisis de Redes Sociales (Social Network Analysis) y Text MiningAlex Rayón Jerez
Presentación sobre la sesión "Análisis de Redes Sociales (Social Network Analysis) y Text Mining", dentro del Programa Ejecutivo de Big Data y Business Intelligence celebrado en Madrid en Febrero de 2016, en nuestra sede de la Universidad de Deusto.
Marketing intelligence con estrategia omnicanal y Customer JourneyAlex Rayón Jerez
Presentación sobre la sesión "Marketing intelligence con estrategia omnicanal y Customer Journey", dentro del Programa Ejecutivo de Big Data y Business Intelligence celebrado en Madrid en Febrero de 2016, en nuestra sede de la Universidad de Deusto.
Presentación sobre la sesión "Modelos de propensión en la era del Big Data", dentro del Programa Ejecutivo de Big Data y Business Intelligence celebrado en Madrid en Febrero de 2016, en nuestra sede de la Universidad de Deusto.
Presentación sobre la sesión "Customer Lifetime Value Management con Big Data", dentro del Programa Ejecutivo de Big Data y Business Intelligence celebrado en Madrid en Febrero de 2016, en nuestra sede de la Universidad de Deusto.
Presentación sobre la sesión "Big Data: the Management Revolution", dentro del Programa Ejecutivo de Big Data y Business Intelligence celebrado en Madrid en Febrero de 2016, en nuestra sede de la Universidad de Deusto.
Presentación sobre la sesión "Optimización de procesos con el Big Data", dentro del Programa Ejecutivo de Big Data y Business Intelligence celebrado en Madrid en Febrero de 2016, en nuestra sede de la Universidad de Deusto.
La economía del dato: transformando sectores, generando oportunidadesAlex Rayón Jerez
Ponencia "La economía del dato: transformando sectores, generando oportunidades" preparada para el I Databeers Euskadi, promovido y organizado por Decidata (www.decidata.es). Hablando de los retos y las oportunidades que ha traído esta era de los datos.
Cómo crecer, ser más eficiente y competitivo a través del Big DataAlex Rayón Jerez
Conferencia "Cómo crecer, ser más eficiente y competitivo a través del Big Data" impartida en el 14º Congreso HORECA de AECOC, Asociación Española de Codificación Comercial). Hablando de la aplicación del Big Data al canal HORECA.
El poder de los datos: hacia una sociedad inteligente, pero éticaAlex Rayón Jerez
Lectio Brevis del profesor Alex Rayón, de la Facultad de Ingeniería. Nos habla sobre el poder que han adquirido los datos en esta era. Es lo que se ha venido a conocer como Big Data. Un área, que también entraña retos legales y éticos, expuestos en el texto.
Búsqueda, organización y presentación de recursos de aprendizajeAlex Rayón Jerez
Curso de formación interna "Búsqueda, organización y presentación de recursos de aprendizaje" en la Universidad de Deusto. Cómo buscar, organizar y presentar recursos de aprendizaje para luego poder utilizar en contextos educativos.
Deusto Knowledge Hub como herramienta de publicación y descubrimiento de cono...Alex Rayón Jerez
Curso de formación interna "Google Calendar para la planificación de la asignatura con mis estudiantes" en la Universidad de Deusto. Para qué me sirve en mi día a día el repositorio Deusto Knowledge Hub como herramietna de publicación y descubrimiento de conocimiento.
Fomentando la colaboración en el aula a través de herramientas socialesAlex Rayón Jerez
Curso de formación interna "Fomentando la colaboración en el aula a través de herramientas sociales" en la Universidad de Deusto. Herramientas de naturaleza social para fomentar la colaboración en al aula entre profesor y estudiantes.
Utilizando Google Drive y Google Docs en el aula para trabajar con mis estudi...Alex Rayón Jerez
Curso de formación interna "Utilizando Google Drive y Google Docs en el aula para trabajar con mis estudiantes" en la Universidad de Deusto. Cómo utlizar Google Drive y Docs para trabajar en el aula con mis estudiantes.
Procesamiento y visualización de datos para generar nuevo conocimientoAlex Rayón Jerez
Curso de formación interna "Procesamiento y visualización de datos para generar nuevo conocimiento" en la Universidad de Deusto. Procesamiento de datos a pequeña y precisa escala (Smart Data) para mejorar mi día a día en la universidad.
El Big Data y Business Intelligence en mi empresa: ¿de qué me sirve?Alex Rayón Jerez
Conferencia "El Big Data y Business Intelligence en mi empresa: ¿de qué me sirve?" impartida en Medellín, Colombia, en Septiembre de 2015. Sesión dirigida a empresas para que conozcan las posibilidades que abre el Big Data para su día a día.
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
Safalta Digital marketing institute in Noida, provide complete applications that encompass a huge range of virtual advertising and marketing additives, which includes search engine optimization, virtual communication advertising, pay-per-click on marketing, content material advertising, internet analytics, and greater. These university courses are designed for students who possess a comprehensive understanding of virtual marketing strategies and attributes.Safalta Digital Marketing Institute in Noida is a first choice for young individuals or students who are looking to start their careers in the field of digital advertising. The institute gives specialized courses designed and certification.
for beginners, providing thorough training in areas such as SEO, digital communication marketing, and PPC training in Noida. After finishing the program, students receive the certifications recognised by top different universitie, setting a strong foundation for a successful career in digital marketing.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
Embracing GenAI - A Strategic ImperativePeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Normal Labour/ Stages of Labour/ Mechanism of LabourWasim Ak
Normal labor is also termed spontaneous labor, defined as the natural physiological process through which the fetus, placenta, and membranes are expelled from the uterus through the birth canal at term (37 to 42 weeks
Normal Labour/ Stages of Labour/ Mechanism of Labour
Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos
1. TALLER
Pentaho Data Integration:
Extrayendo, Integrando,
Normalizando y Preparando
mis datos
Proyectos Programa Big Data y Business Intelligence
Alex Rayón
alex.rayon@deusto.es
Noviembre, 2015
3. Before starting…. (II)
Who has written
scripts or Java
code to move
data from one
source and load
it to another?
Source: http://www.theguardian.com/teacher-network/2012/jan/10/how-to-teach-code
3
9. Pentaho at a glance (III)
Business Intelligence & Analytics
Open Core
GPL v2
Apache 2.0
Enterprise and OEM licenses
Java-based
Web front-ends
9
10. Pentaho at a glance (IV)
The Pentaho Stack
Data Integration / ETL
Big Data / NoSQL
Data Modeling
Reporting
OLAP / Analysis
Data Visualization
Source: http://helicaltech.com/blogs/hire-pentaho-consultants-hire-pentaho-developers/
10
11. Pentaho at a glance (V)
Modules
Pentaho Data Integration
Kettle
Pentaho Analysis
Mondrian
Pentaho Reporting
Pentaho Dashboards
11
12. Pentaho at a glance (VI)
Figures
+ 10.000 deployments
+ 185 countries
+ 1.200 customers
Since 2012, in Gartner
Magic Quadrant for BI
Platforms
1 download / 30
12
22. Table of Contents
Pentaho at a glance
In the academic field
ETL
Kettle
Big Data
Predictive Analytics
22
23. ETL
Definition and characteristics
An ETL tool is a tool that
Extracts data from various data sources (usually legacy
data)
Transforms data
from → being optimized for transaction
to → being optimized for reporting and analysis
synchronizes the data coming from different databases
data cleanses to remove errors
Loads data into a data warehouse
23
24. ETL
Why do I need it?
ETL tools save time and money when
developing a data warehouse by removing
the need for hand-coding
It is very difficult for database administrators
to connect between different brands of
databases without using an external tool
In the event that databases are altered or new
databases need to be integrated, a lot of hand-
coded work needs to be completely redone24
25. ETL
Business Intelligence
ETL is the heart
and soul of
business
intelligence (BI)
ETL processes
bring together
and combine data
from multiple
source systems
into a data
warehouse
Source: http://datawarehouseujap.blogspot.com.es/2010/08/data-warehouse.html
25
26. ETL
Business Intelligence (II)
According to most
practitioners, ETL
design and
development work
consumes 60 to 80
percent of an entire BI
project
Source: http://www.dwuser.com/news/tag/optimization/
Source: The Data Warehousing Institute. www.dw-institute.com
26
30. ETL
CloverETL
Create a basic archive of functions
for mapping and transformations,
allowing companies to move large
amounts of data as quickly and
efficiently as possible
Uses building blocks called
components to create a
transformation graph, which is a
visual depiction of the intended
30
31. ETL
CloverETL (II)
The graphic presentation simplifies even
complex data transformations, allowing for
drag-and-drop functionality
Limited to approximately 40 different
components to simplify graph creation
Yet you may configure each component to meet
specific needs
It also features extensive debugging capabilities
to ensure all transformation graphs work31
32. ETL
KETL
Contains a scalable, platform-independent
engine capable of supporting multiple
computers and 64-bit servers
The program also offers performance
monitoring, extensive data source support,
XML compatibility and a scheduling engine for
time-based and event-driven job execution
32
33. ETL
Kettle
The Pentaho company produced Kettle as an OS
alternative to commercial ETL software
No relation to Kinetic Networks' KETL
Kettle features a drop-and-drag, graphical environment
with progress feedback for all data transactions,
including automatic documentation of executed jobs
XML Input Stream to handle huge XML files without
suffering a loss in performance or a spike in memory
usage
Users can also upgrade the free Kettle version for
33
34. ETL
Talend
Provides a graphical environment for data integration,
migration and synchronization
Drag and drop graphic components to create the java code
required to execute the desired task, saving time and
effort
Pre-built connectors to enable compatibility with a wide
range of business systems and databases
Users gain real-time access to corporate data, allowing for
the monitoring and debugging of transactions to ensure
smooth data integration
34
35. ETL
Comparison
The set of criteria that were used for the ETL
tools comparison were divided into seven
categories:
TCO
Risk
Ease of use
Support
Deployment
Speed 35
37. ETL
Comparison (III)
Total Cost of Ownership
The overall cost for a certain
product.
This can mean initial ordering,
licensing servicing, support,
training, consulting, and any
other additional payments that
need to be made before the
product is in full use
Commercial Open Source products
are typically free to use, but the
support, training and consulting
are what companies need to pay37
38. ETL
Comparison (IV)
Risk
There are always risks with projects, especially big projects.
The risks for projects failing are:
Going over budget
Going over schedule
Not completing the requirements or expectations of the customers
Open Source products have much lower risk then
Commercial ones since they do not restrict the use of their
products by pricey licenses
38
39. ETL
Comparison (V)
Ease of use
All of the ETL tools, apart from Inaport, have GUI to simplify
the development process
Having a good GUI also reduces the time to train and use
the tools
Pentaho Kettle has an easy to use GUI out of all the tools
Training can also be found online or within the community
39
40. ETL
Comparison (VI)
Support
Nowadays, all software products have support and all of the
ETL tool providers offer support
Pentaho Kettle – Offers support from US, UK and has a
partner consultant in Hong Kong
Deployment
Pentaho Kettle is a stand-alone java engine that can run on
any machine that can run java. Needs an external
scheduler to run automatically.
It can be deployed on many different machines and used as40
41. ETL
Comparison (VII)
Speed
The speed of ETL tools depends largely on the data that
needs to be transferred over the network and the
processing power involved in transforming the data.
Pentaho Kettle is faster than Talend, but the Java-connector
slows it down somewhat. Also requires manual tweaking
like Talend. Can be clustered by placed on many machines
to reduce network traffic
41
42. ETL
Comparison (VIII)
Data Quality
Data Quality is fast becoming the most important feature in
any data integration tool.
Pentaho – has DQ features in its GUI, allows for customized
SQL statements, by using JavaScript and Regular
Expressions. It also has some additional modules after
subscribing.
Monitoring
Pentaho Kettle – has practical monitoring tools and logging
42
43. ETL
Comparison (IX)
Connectivity
In most cases, ETL tools transfer data from legacy systems
Their connectivity is very important to the usefulness of the
ETL tools.
Kettle can connect to a very wide variety of databases, flat
files, xml files, excel files and web services.
43
44. Table of Contents
Pentaho at a glance
In the academic field
ETL
Kettle
Big Data
Predictive Analytics
44
46. Kettle
Introduction (II)
What is Kettle?
Batch data integration
and processing tool
written in Java
Exists to retrieve,
process and load data
PDI is a synonymous
term
Source: http://www.dreamstime.com/stock-photo-very-old-kettle-isolated-image16622230
46
47. Kettle
Introduction (III)
It uses an innovative meta-driven approach
It has a very easy-to-use GUI
Strong community of 13,500 registered
users
It uses a stand-alone Java engine that
process the tasks for moving data between
many different databases and files
47
52. Kettle
Data Integration
Changing input to desired output
Jobs
Synchronous workflow of job entries
(tasks)
Transformations
Stepwise parallel & asynchronous
processing of a recordstream52
53. Kettle
Data Integration challenges
Data is everywhere
Data is inconsistent
Records are different in each system
Performance issues
Running queries to summarize data for
stipulated long period takes operating
system for task
Brings the OS on max load53
54. Kettle
Transformations
String and Date Manipulation
Data Validation / Business Rules
Lookup / Join
Calculation, Statistics
Cryptography
Decisions, Flow control
54
55. Kettle
What is good for?
Mirroring data from master to slave
Syncing two data sources
Processing data retrieved from multiple
sources and pushed to multiple
destinations
Loading data to RDBMS
Datamart / Datawarehouse
55
67. Big Data
WEKA
Project Weka
A comprehensive set of tools for Machine
Learning and Data Mining
Source: http://es.wikipedia.org/wiki/Weka_(aprendizaje_autom%C3%A1tico)
67
68. Big Data
Among Pentaho’s products
Mondrian
OLAP server written in Java
Kettle
ETL tool
Weka
Machine learning and Data Mining tool
68
69. Big Data
WEKA platform
WEKA (Waikato Environment for Knowledge Analysis)
Funded by the New Zealand’s Government (for more
than 10 years)
Develop an open-source state-of-the-art workbench
of data mining tools
Explore fielded applications
Develop new fundamental methods
Became part of Pentaho platform in 2006 (PDM -
Pentaho Data Mining)
69
70. Big Data
Data Mining with WEKA
(One-of-the-many) Definition: Extraction of implicit,
previously unknown, and potentially useful
information from data
Goal: improve marketing, sales, and customer support
operations, risk assessment etc.
Who is likely to remain a loyal customer?
What products should be marketed to which
prospects?
What determines whether a person will respond to
a certain offer? 70
71. Big Data
Data Mining with WEKA (II)
Central idea: historical data contains
information that will be useful in the
future (patterns → generalizations)
Data Mining employs a set of
algorithms that automatically detect
patterns and regularities in data
71
72. Big Data
Data Mining with WEKA (III)
A bank’s case as an example
Problem: Prediction (Probability Score) of a Corporate
Customer Delinquency (or default) in the next year
Customer historical data used include:
Customer footings behavior (assets & liabilities)
Customer delinquencies (rates and time data)
Business Sector behavioral data
72
73. Big Data
Data Mining with WEKA (IV)
Variable selection using the Information Value (IV) criterion
Automatic Binning of continuous data variables was used
(Chi-merge). Manual corrections were made to address
particularities in the data distribution of some variables
(using again IV)
73
76. Big Data
Data Mining with WEKA (VII)
Limitations
Traditional algorithms need to have all data
in (main) memory
big datasets are an issue
Solution
Incremental schemes
Stream algorithms
MOA (Massive Online Analysis)
76
80. Predictive analytics
Unified solution for Big Data Analytics (II)
Curren release: Pentaho Business Analytics Suite 4.8
Instant and interactive
data discovery for iPad
● Full analytical power
on the go – unique to
Pentaho
● Mobile-optimized user
interface
80
81. Predictive analytics
Unified solution for Big Data Analytics (III)
Curren release: Pentaho Business Analytics Suite 4.8
Instant and interactive data
discovery and development for
big data
● Broadens big data access to
data analysts
● Removes the need for
separate big data
visualization tools
● Further improves
productivity for big data
developers
81
82. Predictive analytics
Unified solution for Big Data Analytics (IV)
Pentaho Instaview
● Instaview is simple
○ Created for data analysts
○ Dramatically simplifies ways to
access Hadoop and NoSQL data
stores
● Instaview is instant & interactive
○ Time accelerator – 3 quick steps
from data to analytics
○ Interact with big data sources –
group, sort, aggregate & visualize
● Instaview is big data analytics
○ Marketing analysis for weblog data in
Hadoop
○ Application log analysis for data in
MongoDB
82
85. Copyright (c) 2015 University of Deusto
This work (but the quoted images, whose rights are reserved to their owners*) is licensed under the Creative
Commons “Attribution-ShareAlike” License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-sa/3.0/
Alex Rayón
Noviembre 2015
86. TALLER
Pentaho Data Integration:
Extrayendo, Integrando,
Normalizando y Preparando
mis datos
Proyectos Programa Big Data y Business Intelligence
Alex Rayón
alex.rayon@deusto.es
Noviembre, 2015