Runs scored by Players Analysis
with Flume and Pig
Nitesh Ghosh
Contents
Problem Statement.................................................................................................................................................3
Solution Architecture..............................................................................................................................................4
Software and Tools Specification .............................................................................................................................5
Solution Description................................................................................................................................................6
Program Code.........................................................................................................................................................7
Conclusion..............................................................................................................................................................9
Problem Statement
Data Set: This is data for runs scored by players in different countries in different years. Let’s assume some
external process is writing data into a directory in CSV format where columns of the data are as shown below:
Problem Statement:
Assume data is copied periodically into “/home/cloudera/runs” directory. Write a flume configuration to copy
this data to HDFS using flume and then write a PIG script to process data using PIG to find out sum of run
scored and balls played by each player.
Solution Architecture
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large
amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and
fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a
simple extensible data model that allows for online analytic application.
The diagram above shows a high-level view of how Apache Flume interacts with Agent Service and gets data
to HDFS using Flume components - source, channel and sink, once data loads to HDFS. Using Apache Pig we
then visualize the data. Apache Flume is a data ingestion system that is configured by defining endpoints in
a data flow called sources and sinks. In Flume, each individual piece of data is called event sources, which
produce events, and send the events through a channel, which connects the source to the sink. The sink then
writes the events out to a predefined location.
Software and Tools Specification
 Oracle Virtual Box- Version5.2.8r121009 (Qt5.6.2)
 Ubantu 16.04 LTS
 Apache Hadoop -Version2.7.6(ClusterEnvironment)
 Apache Hive- Version2.3.3(SetuponEdge Node)
 Apache Flume- Version0.17.0
Solution Description
We needtosetupHDFS fromHadoop eco-systemsothatothercomponentlike flumeandpig can work on that. In order
to setup that, we need to download files from apache website and do the installation on Ubantu machine. After a
successful installation, we can then verify whether Hadoop is installed on machine successfully.
Program Code
Once successfullysetupwithHDFSwe needtoconfigure Flume andsetupconfigurationfilesforflume.
Place the configurationfileinside flume/confdirectory. We needtomake twochanges inside.conf asfollowed.
 Agent1.sources.source1_1.spoolDirissetwithinputpathas inlocal file systempath.
 Agent1.sinks.hdfs-sink1_1.hdfs.pathissetwithoutputpathasin HDFS path.
ConfigurationDetails
agent1.channels.fileChannel1_1.type=file
agent1.channels.fileChannel1_1.capacity=200000
agent1.channels.fileChannel1_1.transactionCapacity=1000
agent1.sources.source1_1.type =spooldir
agent1.sources.source1_1.spoolDir=/home/hadoopuser/Downloads/tmpload
agent1.sources.source1_1.fileHeader=false
agent1.sources.source1_1.fileSuffix =.COMPLETED
agent1.sinks.hdfs-sink1_1.type=hdfs
agent1.sinks.hdfs-sink1_1.hdfs.path=hdfs://localhost:9000/user/cloudera/flume_sink
agent1.sinks.hdfs-sink1_1.hdfs.batchSize =1000
agent1.sinks.hdfs-sink1_1.hdfs.rollSize =268435456
agent1.sinks.hdfs-sink1_1.hdfs.rollInterval=0
agent1.sinks.hdfs-sink1_1.hdfs.rollCount=50000000
agent1.sinks.hdfs-sink1_1.hdfs.writeFormat=Text
agent1.sinks.hdfs-sink1_1.hdfs.fileType=DataStream
agent1.sources.source1_1.channels=fileChannel1_1
agent1.sinks.hdfs-sink1_1.channel =fileChannel1_1
agent1.sinks= hdfs-sink1_1
agent1.sources=source1_1
agent1.channels=fileChannel1_1
File placedontmploadfolder
Flume storingfile insideHDFSwe can see inabove screen.
Nowwe needtosetupPigto analysisdatawhichisstoredon HDFS.
 A= LOAD '/user/cloudera/flume_sink/FlumeData.1526646743902' USINGPigStorage(' ') AS (Player_id:int, Year:chararray,
Country:chararray,Opposition_Team:chararray, Runs_Scored:int, Balls_Played:int);
 B = FOREACH A GENERATE Player_id, Year, Country, Opposition_Team, Runs_Scored, Balls_Played;
 C = GROUP B BY Player_id;
 D = foreach C generate group,SUM(B.Runs_Scored);
 D = foreach C generate group,SUM(B.Runs_Scored),SUM(B.Balls_Played);
Conclusion
Folder Logging/Spooling is a wide branch for analysis. We have a number of applications which send and place their
applicationfeeds, sothat reporting tools can analyze on top of that data and organization can take benefit and growth
withthe data. Inthis projectwe have done analysis based on csv data, which keeps feeding on periodic time using Pig
language which we have visualized.

Runs scored by Players Analysis with Flume and Pig

  • 1.
    Runs scored byPlayers Analysis with Flume and Pig Nitesh Ghosh
  • 2.
    Contents Problem Statement.................................................................................................................................................3 Solution Architecture..............................................................................................................................................4 Softwareand Tools Specification .............................................................................................................................5 Solution Description................................................................................................................................................6 Program Code.........................................................................................................................................................7 Conclusion..............................................................................................................................................................9
  • 3.
    Problem Statement Data Set:This is data for runs scored by players in different countries in different years. Let’s assume some external process is writing data into a directory in CSV format where columns of the data are as shown below: Problem Statement: Assume data is copied periodically into “/home/cloudera/runs” directory. Write a flume configuration to copy this data to HDFS using flume and then write a PIG script to process data using PIG to find out sum of run scored and balls played by each player.
  • 4.
    Solution Architecture Flume isa distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application. The diagram above shows a high-level view of how Apache Flume interacts with Agent Service and gets data to HDFS using Flume components - source, channel and sink, once data loads to HDFS. Using Apache Pig we then visualize the data. Apache Flume is a data ingestion system that is configured by defining endpoints in a data flow called sources and sinks. In Flume, each individual piece of data is called event sources, which produce events, and send the events through a channel, which connects the source to the sink. The sink then writes the events out to a predefined location.
  • 5.
    Software and ToolsSpecification  Oracle Virtual Box- Version5.2.8r121009 (Qt5.6.2)  Ubantu 16.04 LTS  Apache Hadoop -Version2.7.6(ClusterEnvironment)  Apache Hive- Version2.3.3(SetuponEdge Node)  Apache Flume- Version0.17.0
  • 6.
    Solution Description We needtosetupHDFSfromHadoop eco-systemsothatothercomponentlike flumeandpig can work on that. In order to setup that, we need to download files from apache website and do the installation on Ubantu machine. After a successful installation, we can then verify whether Hadoop is installed on machine successfully.
  • 7.
    Program Code Once successfullysetupwithHDFSweneedtoconfigure Flume andsetupconfigurationfilesforflume. Place the configurationfileinside flume/confdirectory. We needtomake twochanges inside.conf asfollowed.  Agent1.sources.source1_1.spoolDirissetwithinputpathas inlocal file systempath.  Agent1.sinks.hdfs-sink1_1.hdfs.pathissetwithoutputpathasin HDFS path. ConfigurationDetails agent1.channels.fileChannel1_1.type=file agent1.channels.fileChannel1_1.capacity=200000 agent1.channels.fileChannel1_1.transactionCapacity=1000 agent1.sources.source1_1.type =spooldir agent1.sources.source1_1.spoolDir=/home/hadoopuser/Downloads/tmpload agent1.sources.source1_1.fileHeader=false agent1.sources.source1_1.fileSuffix =.COMPLETED agent1.sinks.hdfs-sink1_1.type=hdfs agent1.sinks.hdfs-sink1_1.hdfs.path=hdfs://localhost:9000/user/cloudera/flume_sink agent1.sinks.hdfs-sink1_1.hdfs.batchSize =1000 agent1.sinks.hdfs-sink1_1.hdfs.rollSize =268435456 agent1.sinks.hdfs-sink1_1.hdfs.rollInterval=0 agent1.sinks.hdfs-sink1_1.hdfs.rollCount=50000000 agent1.sinks.hdfs-sink1_1.hdfs.writeFormat=Text agent1.sinks.hdfs-sink1_1.hdfs.fileType=DataStream agent1.sources.source1_1.channels=fileChannel1_1 agent1.sinks.hdfs-sink1_1.channel =fileChannel1_1 agent1.sinks= hdfs-sink1_1 agent1.sources=source1_1 agent1.channels=fileChannel1_1 File placedontmploadfolder
  • 8.
    Flume storingfile insideHDFSwecan see inabove screen. Nowwe needtosetupPigto analysisdatawhichisstoredon HDFS.  A= LOAD '/user/cloudera/flume_sink/FlumeData.1526646743902' USINGPigStorage(' ') AS (Player_id:int, Year:chararray, Country:chararray,Opposition_Team:chararray, Runs_Scored:int, Balls_Played:int);  B = FOREACH A GENERATE Player_id, Year, Country, Opposition_Team, Runs_Scored, Balls_Played;  C = GROUP B BY Player_id;  D = foreach C generate group,SUM(B.Runs_Scored);  D = foreach C generate group,SUM(B.Runs_Scored),SUM(B.Balls_Played);
  • 9.
    Conclusion Folder Logging/Spooling isa wide branch for analysis. We have a number of applications which send and place their applicationfeeds, sothat reporting tools can analyze on top of that data and organization can take benefit and growth withthe data. Inthis projectwe have done analysis based on csv data, which keeps feeding on periodic time using Pig language which we have visualized.