Apache Scoop - Import with Append mode and Last Modified mode

Sqoop
Import with Append mode
and Last Modified mode

Getting data from Mysql Database: import
$sqoop import --connect jdbc:mysql://localhost/db_1
--username root --password root --table student_details --split-
by ID --target-dir studentdata;
$hadoop fs -ls studentdata/
Now we can see multiple part-m files in the folder. This is
because sqoop uses multiple map tasks to process the output
and each mapper gives a subset of rows and by default
sqoop uses 4 mappers i.e. the output is divided into number
of mappers.
Use CAT command to view the content of the each mapper
outputs: part-m-00000 inside u will see row data such as abc 12 TX
part-m-00001 inside –no data---
part-m-00002 inside u will see row data such as ecg 56 FL
part-m-00003 inside --- no data---
Rupak Roy

 The reason behind this is we are not using Primary key
for splitting and results in unbalance tasks where some
mapper process more data than other.
Sqoop by default uses Primary key of the table for
splitting the column.
Alternatively, we can address this issue by explicitly
declaring which column will be used for splitting the rows
among the mappers.
#explicitly declaring
Sqoop import --connect jdbc:mysql://localhost/db_1 --
username root --password root --table student_details --
Target-dir studentdata --split-by ID;
Rupak Roy

Now using Primary key
#adding primary key to our table in the database
Mysql> ALTER TABLE student_details ;
ADD PRIMARY KEY (ID);
#now use the same query to load the same data from mysql database to the hdfs.
$sqoop import --connect
jdbc:mysql://localhost/db_1 --u root –p root - - table student_details –target-dir
student_details1.
Note: it will through an error if our target directory already have the same student_details folder.
#check the data
Hadoop fs –ls /user/hduser/student_details1
Hadoop fs –cat /user/hduser/student_details1/part-m-00000
#access the database and table list
$ sqoop list-databases --connect jdbc:/mysql://localhost/ --username root --
password root
$ sqoop list-tables –connect jdbc:/mysql://localhost/db_1 --username root --
password root
Rupak Roy

Controlling Parallelism
We know sqoop by default uses map tasks to
process its job.
However scoop also provides flexibility to
change the number of map tasks depending on
our job requirements.
 With the flexibility of controlling the amount of
map tasks i.e. the parallel processing. helps to
control the load in our database.
 More mappers doesn’t always mean faster
performance. The optimal number depends
on the type of database, hardware of the
nodes(systems) or the amount of job requests.
Rupak Roy

Controlling Parallelism
Sqoop import --connect jdbc:///localhost/db_1 --
usename root --password root --table student_details
--num – mappers 1 --target-dir student_details2;
Alternatively,
Sqoop import --connect jdbc://localhost/db_1 --
username root --password root --table emp --m 1 --
target-dir emp;
Rupak Roy

Now if we want to import only the
updated data
 This can be done by using 2 modes:
1) Append Mode
2) Last – Modified Mode
1)Append Mode:
First add new rows
Mysql> use tb_1;
> insert into student_details(ID,Name,Location) values (44,’albert’,’CA’)
> insert into student_details(ID,Name,Location) values (55,’Zayn’,’MI’)
#note: we need integer data type to detect the last value in append mode
ALTER TABLE student_details
MODIFY ID int(30);
Then import
$sqoop import --connect jdbc:mysql://locahost/db_1 --username root --password root --
table student_details –split-by ID --incremental append --check-column ID --last-value 33
–target-dir Appendresults/
Where as, type: incremental mode: append, --check-column: checks the column
Therefore Append Mode is used only when the table is populated with new rows.
Rupak Roy

Now if we want to import only the
updated data
2) Last- Modified Mode: is used to overcome the
limitations of Append Mode’s row and column
updates. Hence is it suitable when the table is
populated with new rows and columns.
Each time whenever the table gets updated, the
last modified mode will use the recent time-stamp
attached with each updates to import only the
new modified data rows and columns in hdfs.
Rupak Roy

Add
#add a timestamp column
Mysql> ALTER TABLE student_details
ADD Column updated_at
TIMESTAMP DEFAULT CURRENT_TIMESTAMP
ON UPDATE CURRENT_TIMESTAMP;
#add a new column
Mysql> ALTER TABLE student_details
ADD Column YEAR char(10)
AFTER Location;
#addValues to the new columns
MySql> insert into student_details(YEAR)
values(2010)
OR
Mysql> UPDATE student_details
SET Year = 2010
where Location = ‘FL’;
….repeat again for the 2 rows
Rupak Roy

#then import
$sqoop import --connect jdbc:mysql://localhost/db_1 -u root –p
root --table student_details --split-by ID --incremental lastmodified
--checkcolumn updated_at --last-Value “2017-01-15 13:00:28”
--target-dir lmresults/
Where as type: incremental mode: lastmodified,
--checkcolumn updated_at: will check the
timestamp column with --last-Value “2017-01-15
13:00:28”
Rupak Roy

Append Mode vs Last Modified
mode
 Both append and last modified mode sets apart with
their unique advantages over each others limitations.
In Append mode you don’t have to delete the existing
output folder in HDFS, it will create an another file and
renames by itself sequentially.
But in Last Modified Mode sqoop needs the existing
output HDFS folder to be empty .
Also
In append mode it will import the data from the last-
value described but in Last Modified mode it will take all
the newly modified rows & columns into account.
Rupak Roy

Next
 In real life it might not be efficient and
practical to remember the last value each
time we run sqoop. To overcome this issue we
have an another function call sqoop job.
Rupak Roy

Apache Scoop - Import with Append mode and Last Modified mode

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Apache Scoop - Import with Append mode and Last Modified mode

Similar to Apache Scoop - Import with Append mode and Last Modified mode (20)

More from Rupak Roy

More from Rupak Roy (20)

Recently uploaded

Recently uploaded (20)

Apache Scoop - Import with Append mode and Last Modified mode