Get the latest data from daily data through data processing by Map Reduce

Detect
Latest Data
From
Daily Data-Technoligent

streaming
copy
log
put
We collect the daily data from
many sources such as streaming,
copy, put, log data from other
data sources to our system.
Now Question is that how to
detect Latest data from it?
Latest
Data?

Data
Data
Data
DataDaily Data
Latest Data
Solution is Map Reduce! Let’s Learn How to do it?

• Here, we use Apache Pig to represent for map reduce Job.
• The script will load the old data, new data and do the
sorting base on the collect date.
• And only pick records which just collect today and filter
records which processed from few days ago.
Daily Data
Latest Data

• Java: JDK 1.7
• Cloudera version: CDH4.6.0

Step 1:
We need to prepare some input data file, open a new file in
terminal of Linux:
vi file1
Text some input data with format:
id;product;price;collectdate
1;XY milk;2000;20160730000000
2;AB
candy;5000;20160730000000
3;B chair;6000;20160730000000
vi file2
Text some input data with format:
id;product;price;collectdate
2;AB
candy;3000;20160731000000
3;B chair;1000;20160731000000

Step 2:
We need to put the local files to Hadoop Distributed File System
(HDFS), use this command:
hadoop fs -mkdir -p /data/mysample/mergedData
hadoop fs -put file1 /data/mysample/mergedData_20160730000000/
hadoop fs -put file2 /data/mysample/mergedData_20160731000000/

This pig script will merge the data with old and new and collect
only the latest records from daily data set.
SET job.name ‘merge old and new data with map reduce by Pig script’;
Load old data which already processed yesterday
previousDayData = LOAD ‘/data/mysample/mergedData_20160730000000/’
USING PigStorage(‘;’) AS (id:chararray,
product:chararray,
price:chararray,
collectdate:chararray);

Load today data which collected today.
todayData = LOAD ‘/data/mysample/mergedData_20160731000000/’ USING
PigStorage(‘;’) AS (id:chararray,
product:chararray,
price:chararray,
collectdate:chararray);
Combine two data set together
unionData = UNION previousDayData, todayData;
Group data by id as a key of data set
groupData = GROUP unionData by id;

Sort the data set by collect date then we will have the latest
date is top rank of dataset
We will collect only 1 record from the top rank of dataset then
we can collect the latest data collect by today. This is de-
duplication process and generate the output data to HDFS.
outputData = foreach groupData {
removeDuplication = LIMIT (ORDER unionData by collectdate DESC) 1;
GENERATE FLATTEN(removeDuplication);
}

Store outputData to HDFS
STORE outputData INTO
‘/data/mysample/mergedData_20160731000000_processed/’ USING
PigStorage(‘;’);
Verify the result
We can check in the HDFS location to see the output
hadoop fs –text /data/mysample/mergedData_20160731000000_processed/*
| head –n 10
The latest data in HDFS for 31/Jul will be:
1;XY milk;2000;20160730000000
2;AB candy;3000;20160731000000
3;B chair;1000;20160731000000

• Understand the steps to merge the daily data in big data
application development by Map Reduce.
• You must follow all the steps as discussed in this post for best
results.
• Hope that this can help you guys

We Target the World!
info@technoligent.com www. technoligent.com

Get the latest data from daily data through data processing by Map Reduce

More Related Content

Recently uploaded

Featured

Get the latest data from daily data through data processing by Map Reduce