What is ETL?
Extract is the process of reading data from a database
Transform is the process of converting the extracted data from its
previous form into the form it needs to be in so that it can be placed
into another database. Transformation occurs by using rules or
lookup tables or by combining the data with other data
Load is the process of writing the data into the target database
Terms closely related and managed by
Different ETL tools
•Pentaho Data Integration -Kettle Project (open source ETL)
•SAS ETL studio
•Business Objects Data Integrator (BODI)
•Microsoft SQL Server Integration Services (SSIS)
Talend Open Studio for Data Integration
Hortonworks Sandbox VM
Supported data input and output
What kinds of datasets can be loaded?
Talend Studio offers nearly comprehensive connectivity to:
Packaged applications (ERP, CRM, etc.), databases, mainframes, files, Web Services, and so on to
address the growing disparity of sources.
Data warehouses, data marts, OLAP applications - for analysis, reporting, dashboarding,
scorecarding, and so on.
Built-in advanced components for ETL, including string manipulations, Slowly Changing
Dimensions, automatic lookup handling, bulk loads support etc.
We will do following tasks in this assignment:
1. Load data from DB on your local machine to HDFS
2. Write hive query to do analysis
3. Result of above hive query is then pushed to HBase output component
Use row generator to simulate the
rows in db and created a table with
3 columns, ID, name and level.
• Drag and drop the hdfsoutput component to the
surface and connect the major output of the row
generator to the hdfs.
• For hdfs component, double click on the HDFS
component in design area and just specify the
name node address, and the folder in your
machine to hold the file
After loading the data to HDFS, we can create
external hive table customers using following
command by logging in hive shell and
executing following command
CREATE EXTERNAL TABLE customers(id INT, name
STRING, level STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY
STORED AS TEXTFILE
Create one flow to read the data in hive. pick up the version ,and the thrift server ip/port, then
write a hive query as shown in below screen
Click the edit schema button and just add one column with type as object then we will parse the
result and map to our schema.
Click the advanced tab, to enable the parse query results, using the column we just created as
Drag the parserecordset component to the surface and conenct the mainout of hiverow to it,
click edit schema to do the necessary mapping and then match the values as shown belowust
created as object type
◦ Click to run this job, from
the console it tell you
whether it has connected
to the hive server
◦ Go to the hive server and
it will show you that it has
received one query and
will execute it
◦ you can see the results
from the run talend
Drag one component called hbaseoutput from right pallette, and config the zookeeper info
Run the job will get this as final output
You can login into hbase shell and check if
data is insterted to the hbase and the table
was created by talend also!