Runs scored by Players Analysis with Flume and Pig

Runs scored by Players Analysis
with Flume and Pig
Nitesh Ghosh

Contents
Problem Statement.................................................................................................................................................3
Solution Architecture..............................................................................................................................................4
Software and Tools Specification .............................................................................................................................5
Solution Description................................................................................................................................................6
Program Code.........................................................................................................................................................7
Conclusion..............................................................................................................................................................9

Problem Statement
Data Set: This is data for runs scored by players in different countries in different years. Let’s assume some
external process is writing data into a directory in CSV format where columns of the data are as shown below:
Problem Statement:
Assume data is copied periodically into “/home/cloudera/runs” directory. Write a flume configuration to copy
this data to HDFS using flume and then write a PIG script to process data using PIG to find out sum of run
scored and balls played by each player.

Solution Architecture
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large
amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and
fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a
simple extensible data model that allows for online analytic application.
The diagram above shows a high-level view of how Apache Flume interacts with Agent Service and gets data
to HDFS using Flume components - source, channel and sink, once data loads to HDFS. Using Apache Pig we
then visualize the data. Apache Flume is a data ingestion system that is configured by defining endpoints in
a data flow called sources and sinks. In Flume, each individual piece of data is called event sources, which
produce events, and send the events through a channel, which connects the source to the sink. The sink then
writes the events out to a predefined location.

Software and Tools Specification
 Oracle Virtual Box- Version5.2.8r121009 (Qt5.6.2)
 Ubantu 16.04 LTS
 Apache Hadoop -Version2.7.6(ClusterEnvironment)
 Apache Hive- Version2.3.3(SetuponEdge Node)
 Apache Flume- Version0.17.0

Solution Description
We needtosetupHDFS fromHadoop eco-systemsothatothercomponentlike flumeandpig can work on that. In order
to setup that, we need to download files from apache website and do the installation on Ubantu machine. After a
successful installation, we can then verify whether Hadoop is installed on machine successfully.

Program Code
Once successfullysetupwithHDFSwe needtoconfigure Flume andsetupconfigurationfilesforflume.
Place the configurationfileinside flume/confdirectory. We needtomake twochanges inside.conf asfollowed.
 Agent1.sources.source1_1.spoolDirissetwithinputpathas inlocal file systempath.
 Agent1.sinks.hdfs-sink1_1.hdfs.pathissetwithoutputpathasin HDFS path.
ConfigurationDetails
agent1.channels.fileChannel1_1.type=file
agent1.channels.fileChannel1_1.capacity=200000
agent1.channels.fileChannel1_1.transactionCapacity=1000
agent1.sources.source1_1.type =spooldir
agent1.sources.source1_1.spoolDir=/home/hadoopuser/Downloads/tmpload
agent1.sources.source1_1.fileHeader=false
agent1.sources.source1_1.fileSuffix =.COMPLETED
agent1.sinks.hdfs-sink1_1.type=hdfs
agent1.sinks.hdfs-sink1_1.hdfs.path=hdfs://localhost:9000/user/cloudera/flume_sink
agent1.sinks.hdfs-sink1_1.hdfs.batchSize =1000
agent1.sinks.hdfs-sink1_1.hdfs.rollSize =268435456
agent1.sinks.hdfs-sink1_1.hdfs.rollInterval=0
agent1.sinks.hdfs-sink1_1.hdfs.rollCount=50000000
agent1.sinks.hdfs-sink1_1.hdfs.writeFormat=Text
agent1.sinks.hdfs-sink1_1.hdfs.fileType=DataStream
agent1.sources.source1_1.channels=fileChannel1_1
agent1.sinks.hdfs-sink1_1.channel =fileChannel1_1
agent1.sinks= hdfs-sink1_1
agent1.sources=source1_1
agent1.channels=fileChannel1_1
File placedontmploadfolder

Flume storingfile insideHDFSwe can see inabove screen.
Nowwe needtosetupPigto analysisdatawhichisstoredon HDFS.
 A= LOAD '/user/cloudera/flume_sink/FlumeData.1526646743902' USINGPigStorage(' ') AS (Player_id:int, Year:chararray,
Country:chararray,Opposition_Team:chararray, Runs_Scored:int, Balls_Played:int);
 B = FOREACH A GENERATE Player_id, Year, Country, Opposition_Team, Runs_Scored, Balls_Played;
 C = GROUP B BY Player_id;
 D = foreach C generate group,SUM(B.Runs_Scored);
 D = foreach C generate group,SUM(B.Runs_Scored),SUM(B.Balls_Played);

Conclusion
Folder Logging/Spooling is a wide branch for analysis. We have a number of applications which send and place their
applicationfeeds, sothat reporting tools can analyze on top of that data and organization can take benefit and growth
withthe data. Inthis projectwe have done analysis based on csv data, which keeps feeding on periodic time using Pig
language which we have visualized.

Runs scored by Players Analysis with Flume and Pig

More Related Content

What's hot

Similar to Runs scored by Players Analysis with Flume and Pig

Recently uploaded

Runs scored by Players Analysis with Flume and Pig