Detect
Latest Data
From
Daily Data-Technoligent
streaming
copy
log
put
We collect the daily data from
many sources such as streaming,
copy, put, log data from other
data sources to our system.
Now Question is that how to
detect Latest data from it?
Latest
Data?
Data
Data
Data
DataDaily Data
Latest Data
Solution is Map Reduce! Let’s Learn How to do it?
• Here, we use Apache Pig to represent for map reduce Job.
• The script will load the old data, new data and do the
sorting base on the collect date.
• And only pick records which just collect today and filter
records which processed from few days ago.
Daily Data
Latest Data
• Java: JDK 1.7
• Cloudera version: CDH4.6.0
Step 1:
We need to prepare some input data file, open a new file in
terminal of Linux:
vi file1
Text some input data with format:
id;product;price;collectdate
1;XY milk;2000;20160730000000
2;AB
candy;5000;20160730000000
3;B chair;6000;20160730000000
vi file2
Text some input data with format:
id;product;price;collectdate
2;AB
candy;3000;20160731000000
3;B chair;1000;20160731000000
Step 2:
We need to put the local files to Hadoop Distributed File System
(HDFS), use this command:
hadoop fs -mkdir -p /data/mysample/mergedData
hadoop fs -put file1 /data/mysample/mergedData_20160730000000/
hadoop fs -put file2 /data/mysample/mergedData_20160731000000/
This pig script will merge the data with old and new and collect
only the latest records from daily data set.
SET job.name ‘merge old and new data with map reduce by Pig script’;
Load old data which already processed yesterday
previousDayData = LOAD ‘/data/mysample/mergedData_20160730000000/’
USING PigStorage(‘;’) AS (id:chararray,
product:chararray,
price:chararray,
collectdate:chararray);
Load today data which collected today.
todayData = LOAD ‘/data/mysample/mergedData_20160731000000/’ USING
PigStorage(‘;’) AS (id:chararray,
product:chararray,
price:chararray,
collectdate:chararray);
Combine two data set together
unionData = UNION previousDayData, todayData;
Group data by id as a key of data set
groupData = GROUP unionData by id;
Sort the data set by collect date then we will have the latest
date is top rank of dataset
We will collect only 1 record from the top rank of dataset then
we can collect the latest data collect by today. This is de-
duplication process and generate the output data to HDFS.
outputData = foreach groupData {
removeDuplication = LIMIT (ORDER unionData by collectdate DESC) 1;
GENERATE FLATTEN(removeDuplication);
}
Store outputData to HDFS
STORE outputData INTO
‘/data/mysample/mergedData_20160731000000_processed/’ USING
PigStorage(‘;’);
Verify the result
We can check in the HDFS location to see the output
hadoop fs –text /data/mysample/mergedData_20160731000000_processed/*
| head –n 10
The latest data in HDFS for 31/Jul will be:
1;XY milk;2000;20160730000000
2;AB candy;3000;20160731000000
3;B chair;1000;20160731000000
• Understand the steps to merge the daily data in big data
application development by Map Reduce.
• You must follow all the steps as discussed in this post for best
results.
• Hope that this can help you guys
We Target the World!
info@technoligent.com www. technoligent.com

Get the latest data from daily data through data processing by Map Reduce

  • 1.
  • 2.
    streaming copy log put We collect thedaily data from many sources such as streaming, copy, put, log data from other data sources to our system. Now Question is that how to detect Latest data from it? Latest Data?
  • 3.
    Data Data Data DataDaily Data Latest Data Solutionis Map Reduce! Let’s Learn How to do it?
  • 4.
    • Here, weuse Apache Pig to represent for map reduce Job. • The script will load the old data, new data and do the sorting base on the collect date. • And only pick records which just collect today and filter records which processed from few days ago. Daily Data Latest Data
  • 5.
    • Java: JDK1.7 • Cloudera version: CDH4.6.0
  • 6.
    Step 1: We needto prepare some input data file, open a new file in terminal of Linux: vi file1 Text some input data with format: id;product;price;collectdate 1;XY milk;2000;20160730000000 2;AB candy;5000;20160730000000 3;B chair;6000;20160730000000 vi file2 Text some input data with format: id;product;price;collectdate 2;AB candy;3000;20160731000000 3;B chair;1000;20160731000000
  • 7.
    Step 2: We needto put the local files to Hadoop Distributed File System (HDFS), use this command: hadoop fs -mkdir -p /data/mysample/mergedData hadoop fs -put file1 /data/mysample/mergedData_20160730000000/ hadoop fs -put file2 /data/mysample/mergedData_20160731000000/
  • 8.
    This pig scriptwill merge the data with old and new and collect only the latest records from daily data set. SET job.name ‘merge old and new data with map reduce by Pig script’; Load old data which already processed yesterday previousDayData = LOAD ‘/data/mysample/mergedData_20160730000000/’ USING PigStorage(‘;’) AS (id:chararray, product:chararray, price:chararray, collectdate:chararray);
  • 9.
    Load today datawhich collected today. todayData = LOAD ‘/data/mysample/mergedData_20160731000000/’ USING PigStorage(‘;’) AS (id:chararray, product:chararray, price:chararray, collectdate:chararray); Combine two data set together unionData = UNION previousDayData, todayData; Group data by id as a key of data set groupData = GROUP unionData by id;
  • 10.
    Sort the dataset by collect date then we will have the latest date is top rank of dataset We will collect only 1 record from the top rank of dataset then we can collect the latest data collect by today. This is de- duplication process and generate the output data to HDFS. outputData = foreach groupData { removeDuplication = LIMIT (ORDER unionData by collectdate DESC) 1; GENERATE FLATTEN(removeDuplication); }
  • 11.
    Store outputData toHDFS STORE outputData INTO ‘/data/mysample/mergedData_20160731000000_processed/’ USING PigStorage(‘;’); Verify the result We can check in the HDFS location to see the output hadoop fs –text /data/mysample/mergedData_20160731000000_processed/* | head –n 10 The latest data in HDFS for 31/Jul will be: 1;XY milk;2000;20160730000000 2;AB candy;3000;20160731000000 3;B chair;1000;20160731000000
  • 12.
    • Understand thesteps to merge the daily data in big data application development by Map Reduce. • You must follow all the steps as discussed in this post for best results. • Hope that this can help you guys
  • 13.
    We Target theWorld! info@technoligent.com www. technoligent.com