SlideShare a Scribd company logo
1 of 9
Runs scored by Players Analysis
with Flume and Pig
Nitesh Ghosh
Contents
Problem Statement.................................................................................................................................................3
Solution Architecture..............................................................................................................................................4
Software and Tools Specification .............................................................................................................................5
Solution Description................................................................................................................................................6
Program Code.........................................................................................................................................................7
Conclusion..............................................................................................................................................................9
Problem Statement
Data Set: This is data for runs scored by players in different countries in different years. Let’s assume some
external process is writing data into a directory in CSV format where columns of the data are as shown below:
Problem Statement:
Assume data is copied periodically into “/home/cloudera/runs” directory. Write a flume configuration to copy
this data to HDFS using flume and then write a PIG script to process data using PIG to find out sum of run
scored and balls played by each player.
Solution Architecture
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large
amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and
fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a
simple extensible data model that allows for online analytic application.
The diagram above shows a high-level view of how Apache Flume interacts with Agent Service and gets data
to HDFS using Flume components - source, channel and sink, once data loads to HDFS. Using Apache Pig we
then visualize the data. Apache Flume is a data ingestion system that is configured by defining endpoints in
a data flow called sources and sinks. In Flume, each individual piece of data is called event sources, which
produce events, and send the events through a channel, which connects the source to the sink. The sink then
writes the events out to a predefined location.
Software and Tools Specification
 Oracle Virtual Box- Version5.2.8r121009 (Qt5.6.2)
 Ubantu 16.04 LTS
 Apache Hadoop -Version2.7.6(ClusterEnvironment)
 Apache Hive- Version2.3.3(SetuponEdge Node)
 Apache Flume- Version0.17.0
Solution Description
We needtosetupHDFS fromHadoop eco-systemsothatothercomponentlike flumeandpig can work on that. In order
to setup that, we need to download files from apache website and do the installation on Ubantu machine. After a
successful installation, we can then verify whether Hadoop is installed on machine successfully.
Program Code
Once successfullysetupwithHDFSwe needtoconfigure Flume andsetupconfigurationfilesforflume.
Place the configurationfileinside flume/confdirectory. We needtomake twochanges inside.conf asfollowed.
 Agent1.sources.source1_1.spoolDirissetwithinputpathas inlocal file systempath.
 Agent1.sinks.hdfs-sink1_1.hdfs.pathissetwithoutputpathasin HDFS path.
ConfigurationDetails
agent1.channels.fileChannel1_1.type=file
agent1.channels.fileChannel1_1.capacity=200000
agent1.channels.fileChannel1_1.transactionCapacity=1000
agent1.sources.source1_1.type =spooldir
agent1.sources.source1_1.spoolDir=/home/hadoopuser/Downloads/tmpload
agent1.sources.source1_1.fileHeader=false
agent1.sources.source1_1.fileSuffix =.COMPLETED
agent1.sinks.hdfs-sink1_1.type=hdfs
agent1.sinks.hdfs-sink1_1.hdfs.path=hdfs://localhost:9000/user/cloudera/flume_sink
agent1.sinks.hdfs-sink1_1.hdfs.batchSize =1000
agent1.sinks.hdfs-sink1_1.hdfs.rollSize =268435456
agent1.sinks.hdfs-sink1_1.hdfs.rollInterval=0
agent1.sinks.hdfs-sink1_1.hdfs.rollCount=50000000
agent1.sinks.hdfs-sink1_1.hdfs.writeFormat=Text
agent1.sinks.hdfs-sink1_1.hdfs.fileType=DataStream
agent1.sources.source1_1.channels=fileChannel1_1
agent1.sinks.hdfs-sink1_1.channel =fileChannel1_1
agent1.sinks= hdfs-sink1_1
agent1.sources=source1_1
agent1.channels=fileChannel1_1
File placedontmploadfolder
Flume storingfile insideHDFSwe can see inabove screen.
Nowwe needtosetupPigto analysisdatawhichisstoredon HDFS.
 A= LOAD '/user/cloudera/flume_sink/FlumeData.1526646743902' USINGPigStorage(' ') AS (Player_id:int, Year:chararray,
Country:chararray,Opposition_Team:chararray, Runs_Scored:int, Balls_Played:int);
 B = FOREACH A GENERATE Player_id, Year, Country, Opposition_Team, Runs_Scored, Balls_Played;
 C = GROUP B BY Player_id;
 D = foreach C generate group,SUM(B.Runs_Scored);
 D = foreach C generate group,SUM(B.Runs_Scored),SUM(B.Balls_Played);
Conclusion
Folder Logging/Spooling is a wide branch for analysis. We have a number of applications which send and place their
applicationfeeds, sothat reporting tools can analyze on top of that data and organization can take benefit and growth
withthe data. Inthis projectwe have done analysis based on csv data, which keeps feeding on periodic time using Pig
language which we have visualized.

More Related Content

What's hot

What's hot (20)

How To Install and Configure Log Rotation on RHEL 7 or CentOS 7
How To Install and Configure Log Rotation on RHEL 7 or CentOS 7How To Install and Configure Log Rotation on RHEL 7 or CentOS 7
How To Install and Configure Log Rotation on RHEL 7 or CentOS 7
 
How To Install and Configure SNMP on RHEL 7 or CentOS 7
How To Install and Configure SNMP on RHEL 7 or CentOS 7How To Install and Configure SNMP on RHEL 7 or CentOS 7
How To Install and Configure SNMP on RHEL 7 or CentOS 7
 
How To Configure Apache VirtualHost on RHEL 7 on AWS
How To Configure Apache VirtualHost on RHEL 7 on AWSHow To Configure Apache VirtualHost on RHEL 7 on AWS
How To Configure Apache VirtualHost on RHEL 7 on AWS
 
VMWare Tools Installation and Troubleshooting Guide
VMWare Tools Installation and Troubleshooting GuideVMWare Tools Installation and Troubleshooting Guide
VMWare Tools Installation and Troubleshooting Guide
 
How To Find Package Installation Date on RHEL 7
How To Find Package Installation Date on RHEL 7How To Find Package Installation Date on RHEL 7
How To Find Package Installation Date on RHEL 7
 
How To Install OpenFire in CentOS 7
How To Install OpenFire in CentOS 7How To Install OpenFire in CentOS 7
How To Install OpenFire in CentOS 7
 
Windows PowerShell Basics – How To Create powershell for loop
Windows PowerShell Basics – How To Create powershell for loopWindows PowerShell Basics – How To Create powershell for loop
Windows PowerShell Basics – How To Create powershell for loop
 
Install and Configure RSyslog – CentOS 7 / RHEL 7
Install and Configure RSyslog – CentOS 7 / RHEL 7Install and Configure RSyslog – CentOS 7 / RHEL 7
Install and Configure RSyslog – CentOS 7 / RHEL 7
 
How to Troubleshoot SELinux Audit2Allow unable to open (null)
How to Troubleshoot SELinux Audit2Allow unable to open (null)How to Troubleshoot SELinux Audit2Allow unable to open (null)
How to Troubleshoot SELinux Audit2Allow unable to open (null)
 
How To Manage Services on RHEL 7 or CentOS 7
How To Manage Services on RHEL 7 or CentOS 7How To Manage Services on RHEL 7 or CentOS 7
How To Manage Services on RHEL 7 or CentOS 7
 
Zimbra Troubleshooting - Mails not being Delivered or Deferred or Connection ...
Zimbra Troubleshooting - Mails not being Delivered or Deferred or Connection ...Zimbra Troubleshooting - Mails not being Delivered or Deferred or Connection ...
Zimbra Troubleshooting - Mails not being Delivered or Deferred or Connection ...
 
How To Install and Configure GNome on CentOS 7
How To Install and Configure GNome on CentOS 7How To Install and Configure GNome on CentOS 7
How To Install and Configure GNome on CentOS 7
 
derby onboarding (1)
derby onboarding (1)derby onboarding (1)
derby onboarding (1)
 
How To Install and Configure Screen on CentOS 7
How To Install and Configure Screen on CentOS 7How To Install and Configure Screen on CentOS 7
How To Install and Configure Screen on CentOS 7
 
How to install and configure firewall on ubuntu os
How to install and configure firewall on ubuntu osHow to install and configure firewall on ubuntu os
How to install and configure firewall on ubuntu os
 
How To List Nginx Modules Installed / Complied on CentOS 7
How To List Nginx Modules Installed / Complied on CentOS 7How To List Nginx Modules Installed / Complied on CentOS 7
How To List Nginx Modules Installed / Complied on CentOS 7
 
How To Install and Configure AWS CLI on RHEL 7
How To Install and Configure AWS CLI on RHEL 7How To Install and Configure AWS CLI on RHEL 7
How To Install and Configure AWS CLI on RHEL 7
 
How To Add DVD ISO to YUM Repository in CentOS 6
How To Add DVD ISO to YUM Repository in CentOS 6How To Add DVD ISO to YUM Repository in CentOS 6
How To Add DVD ISO to YUM Repository in CentOS 6
 
How To Install and Configure AWS CLI for Windows
How To Install and Configure AWS CLI for WindowsHow To Install and Configure AWS CLI for Windows
How To Install and Configure AWS CLI for Windows
 
Install VMWare Tools CentOS 7
Install VMWare Tools CentOS 7Install VMWare Tools CentOS 7
Install VMWare Tools CentOS 7
 

Similar to Runs scored by Players Analysis with Flume and Pig

Fedora 17-installation guide-en-us
Fedora 17-installation guide-en-usFedora 17-installation guide-en-us
Fedora 17-installation guide-en-us
nelson-10
 
Plesk 8.1 for Windows
Plesk 8.1 for WindowsPlesk 8.1 for Windows
Plesk 8.1 for Windows
webhostingguy
 
Plesk 8.1 for Windows
Plesk 8.1 for WindowsPlesk 8.1 for Windows
Plesk 8.1 for Windows
webhostingguy
 
Plesk 8.1 for Linux/UNIX
Plesk 8.1 for Linux/UNIXPlesk 8.1 for Linux/UNIX
Plesk 8.1 for Linux/UNIX
webhostingguy
 
Red hat enterprise_linux-7-beta-installation_guide-en-us
Red hat enterprise_linux-7-beta-installation_guide-en-usRed hat enterprise_linux-7-beta-installation_guide-en-us
Red hat enterprise_linux-7-beta-installation_guide-en-us
muhammad adeel
 
D space manual 1.5.2
D space manual 1.5.2D space manual 1.5.2
D space manual 1.5.2
tvcumet
 
Plesk 8.1 for Windows
Plesk 8.1 for WindowsPlesk 8.1 for Windows
Plesk 8.1 for Windows
webhostingguy
 
Plesk 8.0 for Linux/UNIX
Plesk 8.0 for Linux/UNIXPlesk 8.0 for Linux/UNIX
Plesk 8.0 for Linux/UNIX
webhostingguy
 
Getting Started Guide
Getting Started GuideGetting Started Guide
Getting Started Guide
webhostingguy
 
Dreamweaver cs5 help
Dreamweaver cs5 helpDreamweaver cs5 help
Dreamweaver cs5 help
ok71
 
Configuration of sas 9.1.3
Configuration of sas 9.1.3Configuration of sas 9.1.3
Configuration of sas 9.1.3
satish090909
 
CIS Docker Benchmark v1.5.0 PDF.pdf
CIS Docker Benchmark v1.5.0 PDF.pdfCIS Docker Benchmark v1.5.0 PDF.pdf
CIS Docker Benchmark v1.5.0 PDF.pdf
SantanuJoshi3
 
Plesk 8.0 for Linux/UNIX
Plesk 8.0 for Linux/UNIXPlesk 8.0 for Linux/UNIX
Plesk 8.0 for Linux/UNIX
webhostingguy
 

Similar to Runs scored by Players Analysis with Flume and Pig (20)

Suse linux enterprise_server_15_x_for_sap_applications_configuration_guide_fo...
Suse linux enterprise_server_15_x_for_sap_applications_configuration_guide_fo...Suse linux enterprise_server_15_x_for_sap_applications_configuration_guide_fo...
Suse linux enterprise_server_15_x_for_sap_applications_configuration_guide_fo...
 
Orangescrum Time Log Gold add-on User Manual
Orangescrum Time Log Gold add-on User Manual Orangescrum Time Log Gold add-on User Manual
Orangescrum Time Log Gold add-on User Manual
 
Install
InstallInstall
Install
 
Book hudson
Book hudsonBook hudson
Book hudson
 
Fedora 17-installation guide-en-us
Fedora 17-installation guide-en-usFedora 17-installation guide-en-us
Fedora 17-installation guide-en-us
 
Using Open Source Tools For STR7XX Cross Development
Using Open Source Tools For STR7XX Cross DevelopmentUsing Open Source Tools For STR7XX Cross Development
Using Open Source Tools For STR7XX Cross Development
 
Plesk 8.1 for Windows
Plesk 8.1 for WindowsPlesk 8.1 for Windows
Plesk 8.1 for Windows
 
Plesk 8.1 for Windows
Plesk 8.1 for WindowsPlesk 8.1 for Windows
Plesk 8.1 for Windows
 
Plesk 8.1 for Linux/UNIX
Plesk 8.1 for Linux/UNIXPlesk 8.1 for Linux/UNIX
Plesk 8.1 for Linux/UNIX
 
Red hat enterprise_linux-7-beta-installation_guide-en-us
Red hat enterprise_linux-7-beta-installation_guide-en-usRed hat enterprise_linux-7-beta-installation_guide-en-us
Red hat enterprise_linux-7-beta-installation_guide-en-us
 
IBM PowerLinux Open Source Infrastructure Services Implementation and T…
IBM PowerLinux Open Source Infrastructure Services Implementation and T…IBM PowerLinux Open Source Infrastructure Services Implementation and T…
IBM PowerLinux Open Source Infrastructure Services Implementation and T…
 
D space manual 1.5.2
D space manual 1.5.2D space manual 1.5.2
D space manual 1.5.2
 
Plesk 8.1 for Windows
Plesk 8.1 for WindowsPlesk 8.1 for Windows
Plesk 8.1 for Windows
 
Plesk 8.0 for Linux/UNIX
Plesk 8.0 for Linux/UNIXPlesk 8.0 for Linux/UNIX
Plesk 8.0 for Linux/UNIX
 
Getting Started Guide
Getting Started GuideGetting Started Guide
Getting Started Guide
 
Dreamweaver cs5 help
Dreamweaver cs5 helpDreamweaver cs5 help
Dreamweaver cs5 help
 
Dreamweaver cs5 help
Dreamweaver cs5 helpDreamweaver cs5 help
Dreamweaver cs5 help
 
Configuration of sas 9.1.3
Configuration of sas 9.1.3Configuration of sas 9.1.3
Configuration of sas 9.1.3
 
CIS Docker Benchmark v1.5.0 PDF.pdf
CIS Docker Benchmark v1.5.0 PDF.pdfCIS Docker Benchmark v1.5.0 PDF.pdf
CIS Docker Benchmark v1.5.0 PDF.pdf
 
Plesk 8.0 for Linux/UNIX
Plesk 8.0 for Linux/UNIXPlesk 8.0 for Linux/UNIX
Plesk 8.0 for Linux/UNIX
 

Recently uploaded

VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
dharasingh5698
 

Recently uploaded (20)

COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 

Runs scored by Players Analysis with Flume and Pig

  • 1. Runs scored by Players Analysis with Flume and Pig Nitesh Ghosh
  • 2. Contents Problem Statement.................................................................................................................................................3 Solution Architecture..............................................................................................................................................4 Software and Tools Specification .............................................................................................................................5 Solution Description................................................................................................................................................6 Program Code.........................................................................................................................................................7 Conclusion..............................................................................................................................................................9
  • 3. Problem Statement Data Set: This is data for runs scored by players in different countries in different years. Let’s assume some external process is writing data into a directory in CSV format where columns of the data are as shown below: Problem Statement: Assume data is copied periodically into “/home/cloudera/runs” directory. Write a flume configuration to copy this data to HDFS using flume and then write a PIG script to process data using PIG to find out sum of run scored and balls played by each player.
  • 4. Solution Architecture Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application. The diagram above shows a high-level view of how Apache Flume interacts with Agent Service and gets data to HDFS using Flume components - source, channel and sink, once data loads to HDFS. Using Apache Pig we then visualize the data. Apache Flume is a data ingestion system that is configured by defining endpoints in a data flow called sources and sinks. In Flume, each individual piece of data is called event sources, which produce events, and send the events through a channel, which connects the source to the sink. The sink then writes the events out to a predefined location.
  • 5. Software and Tools Specification  Oracle Virtual Box- Version5.2.8r121009 (Qt5.6.2)  Ubantu 16.04 LTS  Apache Hadoop -Version2.7.6(ClusterEnvironment)  Apache Hive- Version2.3.3(SetuponEdge Node)  Apache Flume- Version0.17.0
  • 6. Solution Description We needtosetupHDFS fromHadoop eco-systemsothatothercomponentlike flumeandpig can work on that. In order to setup that, we need to download files from apache website and do the installation on Ubantu machine. After a successful installation, we can then verify whether Hadoop is installed on machine successfully.
  • 7. Program Code Once successfullysetupwithHDFSwe needtoconfigure Flume andsetupconfigurationfilesforflume. Place the configurationfileinside flume/confdirectory. We needtomake twochanges inside.conf asfollowed.  Agent1.sources.source1_1.spoolDirissetwithinputpathas inlocal file systempath.  Agent1.sinks.hdfs-sink1_1.hdfs.pathissetwithoutputpathasin HDFS path. ConfigurationDetails agent1.channels.fileChannel1_1.type=file agent1.channels.fileChannel1_1.capacity=200000 agent1.channels.fileChannel1_1.transactionCapacity=1000 agent1.sources.source1_1.type =spooldir agent1.sources.source1_1.spoolDir=/home/hadoopuser/Downloads/tmpload agent1.sources.source1_1.fileHeader=false agent1.sources.source1_1.fileSuffix =.COMPLETED agent1.sinks.hdfs-sink1_1.type=hdfs agent1.sinks.hdfs-sink1_1.hdfs.path=hdfs://localhost:9000/user/cloudera/flume_sink agent1.sinks.hdfs-sink1_1.hdfs.batchSize =1000 agent1.sinks.hdfs-sink1_1.hdfs.rollSize =268435456 agent1.sinks.hdfs-sink1_1.hdfs.rollInterval=0 agent1.sinks.hdfs-sink1_1.hdfs.rollCount=50000000 agent1.sinks.hdfs-sink1_1.hdfs.writeFormat=Text agent1.sinks.hdfs-sink1_1.hdfs.fileType=DataStream agent1.sources.source1_1.channels=fileChannel1_1 agent1.sinks.hdfs-sink1_1.channel =fileChannel1_1 agent1.sinks= hdfs-sink1_1 agent1.sources=source1_1 agent1.channels=fileChannel1_1 File placedontmploadfolder
  • 8. Flume storingfile insideHDFSwe can see inabove screen. Nowwe needtosetupPigto analysisdatawhichisstoredon HDFS.  A= LOAD '/user/cloudera/flume_sink/FlumeData.1526646743902' USINGPigStorage(' ') AS (Player_id:int, Year:chararray, Country:chararray,Opposition_Team:chararray, Runs_Scored:int, Balls_Played:int);  B = FOREACH A GENERATE Player_id, Year, Country, Opposition_Team, Runs_Scored, Balls_Played;  C = GROUP B BY Player_id;  D = foreach C generate group,SUM(B.Runs_Scored);  D = foreach C generate group,SUM(B.Runs_Scored),SUM(B.Balls_Played);
  • 9. Conclusion Folder Logging/Spooling is a wide branch for analysis. We have a number of applications which send and place their applicationfeeds, sothat reporting tools can analyze on top of that data and organization can take benefit and growth withthe data. Inthis projectwe have done analysis based on csv data, which keeps feeding on periodic time using Pig language which we have visualized.