This is data for runs scored by players in different countries in different years. Let’s assume some external process is writing data into a directory in CSV format, Write a flume configuration to copy this data to HDFS using flume and then write a PIG script to process data using PIG to find out sum of run scored and balls played by each player.
3. Problem Statement
Data Set: This is data for runs scored by players in different countries in different years. Let’s assume some
external process is writing data into a directory in CSV format where columns of the data are as shown below:
Problem Statement:
Assume data is copied periodically into “/home/cloudera/runs” directory. Write a flume configuration to copy
this data to HDFS using flume and then write a PIG script to process data using PIG to find out sum of run
scored and balls played by each player.
4. Solution Architecture
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large
amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and
fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a
simple extensible data model that allows for online analytic application.
The diagram above shows a high-level view of how Apache Flume interacts with Agent Service and gets data
to HDFS using Flume components - source, channel and sink, once data loads to HDFS. Using Apache Pig we
then visualize the data. Apache Flume is a data ingestion system that is configured by defining endpoints in
a data flow called sources and sinks. In Flume, each individual piece of data is called event sources, which
produce events, and send the events through a channel, which connects the source to the sink. The sink then
writes the events out to a predefined location.
6. Solution Description
We needtosetupHDFS fromHadoop eco-systemsothatothercomponentlike flumeandpig can work on that. In order
to setup that, we need to download files from apache website and do the installation on Ubantu machine. After a
successful installation, we can then verify whether Hadoop is installed on machine successfully.
8. Flume storingfile insideHDFSwe can see inabove screen.
Nowwe needtosetupPigto analysisdatawhichisstoredon HDFS.
A= LOAD '/user/cloudera/flume_sink/FlumeData.1526646743902' USINGPigStorage(' ') AS (Player_id:int, Year:chararray,
Country:chararray,Opposition_Team:chararray, Runs_Scored:int, Balls_Played:int);
B = FOREACH A GENERATE Player_id, Year, Country, Opposition_Team, Runs_Scored, Balls_Played;
C = GROUP B BY Player_id;
D = foreach C generate group,SUM(B.Runs_Scored);
D = foreach C generate group,SUM(B.Runs_Scored),SUM(B.Balls_Played);
9. Conclusion
Folder Logging/Spooling is a wide branch for analysis. We have a number of applications which send and place their
applicationfeeds, sothat reporting tools can analyze on top of that data and organization can take benefit and growth
withthe data. Inthis projectwe have done analysis based on csv data, which keeps feeding on periodic time using Pig
language which we have visualized.