SlideShare a Scribd company logo
1 of 5
Download to read offline
International Journal of Technical Innovation in Morden
Engineering & Science (IJTIMES)
Impact Factor: 3.45 (SJIF-2015), e-ISSN: 2455-2584
Volume 2, Issue 4, April-2016
IJTIMES-2015@All rights reserved 1
Twitter Word Frequency Count using Hadoop Components
1
Ms. Pooja S. Desai, 2
Mrs. Isha Agarwal
1
M.Tech., Computer Engineering Department,CGPIT,Uka Tarsadia University
2
Asst. Professor. Computer Engineering Department,CGPIT,Uka Tarsadia University
Abstract- Analysis of twitter data can be very useful for marketing as well as for promotion strategies. This paper
implements word frequency count for 1.5+ million twitters dataset .The hadoop components such as Apache pig and
MapReduce framework are used for parallel execution of the data. This parallel execution makes the implementation
time-effective and feasible to execute on a single system.
Keywords – Twitter data, Hadoop, Apache Pig, MapReduce, Big Data Analytics
I. INTRODUCTION
With the evolution of computing technologies and data science field, it is now possible to manage immensely large
data and perform computation on it that previously could handled only by super computers or large datasets. Today
social media is a critical part of the web. Social media platforms applications have a widespread reach to users all over
the globe. According to statistica.com, the marketers use various social media as follows:
TABLE 1
SOCIAL MEDIA PROFILES FOR MARKETING[5]
Facebook Twitter LinkedIn Google+ Pinterest
B2C 97% 81% 59% 51% 51%
B2B 89% 86% 88% 59% 41%
Social media sites and a strong web presence are crucial from marketing perspective in all forms of business.
The interest in social media from all walks of life has been increasing exponentially from both application and research
perspective. A wide number of tools, open source and commercial, is now available to gain a first impression of a
particular social media channel. For popular social media channels like Twitter, Facebook, Google, YouTube many tools
or services can be found to provide an overview of that channel. The span of all social media sites is growing faster day
by day and as a consequence of that a lot of data is produced either structured or unstructured. This data can provide
deeper insights by using analytics. Analysing the twitter data can lead to focused marketing goals.
According to internetlivestats.com, every day around 500 million tweets are tweeted on twitter. By analysing
this tweets, the frequently appearing word can be derived, which can be provide useful insights for developing marketing
strategy, finding trends of sale or inventory stock patterns. Here in this paper, 1.5 million+ tweets are analysed and each
word frequency is calculated, which can further be used for many purposes.
Analysing such a massive data on a RDBMS or a single system is not feasible. The big data analytics must be
used for analysis of such datasets. Hadoop Eco-system is one of the solutions for big data analytics which can be used.
II. HADOOP ECOSYSTEM[2]
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across
clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of
machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the
library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on
top of a cluster of computers, each of which may be prone to failures. [3]
Hadoop Eco System is mainly consists of two sevices:
A. Hadoop Distributed File system
A reliable, distributed file system that is mainly used to store all the data of the hadoop cluster. The data is not
necessarily well-structured as in case of databases, but data can be in the form of structured, un-structured or semi-
structured format. The HDFS system works on a Master-Slave approach. It contains one Name node that acts as a master
server. It manages the storage of data, allocation and de-allocation of disks and replication management of the data.
There are numbers of Data nodes that acts as a slave. It manages the storage of file and handles read-write requests. It
also handles block creation, deletion or update based on instruction from the Name node. Thus, with help of name node
and data node, HDFS manages all issues like where to store the data physically? Or how many replicas should be
generated? And how the replicas should be distributed across different nodes?
B. Hadoop Mapreduce
International Journal of Technical Innovation in Morden Engineering & Science (IJTIMES)
Volume 2, Issue 4, April-2016, e-ISSN: 2455-2584, Impact Factor: 3.45 (SJIF-2015)
IJTIMES-2015@All rights reserved 2
Hadoop Map-reduce is a software-framework that allows computation to be performed in parallel manner.
Contradictory to an HDFS, Map-Reduce provides computation facilities on stored structured or unstructured data. The
number of mappers and reducers varies based on application and data sizes.
Hadoop also supports many applications such as Apache Pig, Apache Hive, Apache HBase, Apache ooozie and
many more.
TABLE 2
HADOOP ECOSYSTEM COMPONENTS[3]
Hadoop Component Description
Sqoop -Used as intermediate between SQL and Hadoop.
- It let the user import tables or entire database into HDFS system
HBASE -It is a columnar storage based on Google’s big table
-It can handle massive data tables combining billions and billions of rows and
columns
Pig -It is high-level platform for creating Map-reduce program using Hadoop
- It can be used to build much larger, much more complex applications
Hive -It is a datawarehouse that that facilitates querying and managing large dataset
residing in HDFS.
- It uses query language called Hive QL
Oozie - It is workflow schedule system that manages all the hadoop jobs.
- It is integrated with the hadoop stack and supports several different hadoop
jobs.
Zookeeper - It provides distributed configuration service and synchronization service.
- It provides operational services for hadoop cluster.
Flume - It is distributed and reliable service for efficient collection, aggregation and
moving of large amount of data.
Mahout - It focuses on distributed or scalable machine learning.
- It is currently in its growing phase.
R Connectors - It is most widely used open source statistical tool.
- It contains rich set of packages for statistical analysis and plotting.
Each of these applications has its own working models and different applications. Based on requirement of a system,
any of these applications can be used, either alone or together with other components.
In this paper Apache Pig for data extraction and Map Reduce paradigm for processing the data is used.
III. APACHE PIG OVERVIEW[4]
Apache Pig is a part of Hadoop Ecosystem which provides a high level language called Pig Latin. Pig commands are
translated into Map-Reduce jobs. Pig tool can easily work with structured as well as unstructured data. Pig is a complete
component of Hadoop Ecosystem as all the manipulation on the data can be done using Pig only. Pig can efficiently work
with many languages such as Ruby, Python or Java etc. It can also be used to translate one data formats into other
formats. Pig can ingest data from files, streams or any other data formats. A good example of Pig Application is ETL
transaction model which basically extracts, transforms and loads the data.
Fig. 1: Apache Pig data format support
IV. MAP REDUCE FRAMEWORK OVERVIEW[4]
Map reduce Framework is a programming model that was developed for parallel processing applications under
Hadoop. It splits the data into various chunks and then these chunks are processed in parallel. Map Reduce brings
computation to the data. It basically involves two phases:
A. Mapper Phase
International Journal of Technical Innovation in Morden Engineering & Science (IJTIMES)
Volume 2, Issue 4, April-2016, e-ISSN: 2455-2584, Impact Factor: 3.45 (SJIF-2015)
IJTIMES-2015@All rights reserved 3
B. Reducer phase
In Mapper phase, Map job takes a collections of data and converts it into another set of data, where individual
elements are broken into <key,value> pairs. While in Reducer phase, reduce jobs take the output of the map phase that is
<key,value> pairs and produce smaller set of <key,value> pairs as an output.
MapReduce framework works with the help of two processes provided by the hadoop eco system. A Job
Tracker that sends the mapreduce tasks to specific nodes in the cluster. And a Task Tracker that are deployes on each
machine in the cluster. It is responsible for running the map and reduce tasks as instructed by the Job tracker.
V. TWEETS ANALYSIS
Social media like twitter, facebook etc produces a lot of structured as well as unstructured data on daily basis, which
if analysed can be very useful. The system working is shown in figure 2.
Fig. 2: Twitter word frequency Count System
The Twitter database contains a huge amount of data that grows rapidly every day. From the whole database, a
dataset is selected for analysis which is basically a part of a twitter database. The task is divided into two processes
namely
A. Extraction of tweets from twitter dataset
The twitter dataset contains lot of information such as user names, tweet date and timing, tweet contents and many
more.[5] But for finding a word frequency only the tweet contents are needed. Similarly we can also count tweets per
users by extracting user names only. But here we are concerned with tweets and its words so only tweets contents are
extracted from the dataset.
Apache pig can be used for this transformation. It inputs the data from our dataset of CSV format and using Pig
Latin we can extract tweet contents efficiently. This output data is in text format and contains only tweets content. This
data can be stored in any other format also.
B. Word frequency count from tweets
After the extraction of data word frequency algorithm can directly be applied on that data. Map and reduce algorithm
is used to obtain parallelism in this process. The algorithm for the same is presented in table 2 and table 3 as shown
below.
TABLE 3
WORD FREQUENCY MAPPER ALGORITHM
Algorithm : Word Frequency Mapper
Input: Text file
Output: <word,count> pair
1. Divide the data into different map jobs
2. For each line of text
3. Count each word separated by whitespaces
4. Form <word,count> pair locally
5. Temporary store all the <word,count> pair
6. Forward list of the <word,count> pair to reducer task
International Journal of Technical Innovation in Morden Engineering & Science (IJTIMES)
Volume 2, Issue 4, April-2016, e-ISSN: 2455-2584, Impact Factor: 3.45 (SJIF-2015)
IJTIMES-2015@All rights reserved 4
TABLE 4
WORD FREQUENCY REDUCER ALGORITHM
Algorithm: Word Frequency Reducer
Input : <word,count> pair
Output: <word,total-count> pair
1. Prev_key = None
2. Current_key = word from first <word,count> pair
3. For each <word, count> pair repeat
4. If Prev_key == Current_key
5. total_count += count;
6. Else
7. total_count = count;
8. Prev_key = Current_key
9. Return all <word,total-count> pair
VI. RESULTS
In this system, a word frequency count is implemented which produces list of words appearing in
tweets with their counts. The time analysis for execution of the program is shown in a figure 3. The system can
efficiently perform in parallel and produces output file in comparatively small amount of time. With the variable
size of data, system can take variable time. The execution time also depends on configuration of Hadoop
Cluster.
From the output data generated it is analysed that some regular words appears more frequently that
others as shown below:
TABLE 5
SOME FREQUENT WORDS IN OUTPUT
Some Words Count
! 10666
But 13898
Hi 2770
I 341441
It 13244
My 14038
Thank 6262
You 3392
And many more...
Fig.3: Performance Measure of system
0
2
4
6
8
10
60 thousands 1 miilion 1.4 million 1.6 million
ExecutionTime(minutes)
Size of datasets
Performance Measure
International Journal of Technical Innovation in Morden Engineering & Science (IJTIMES)
Volume 2, Issue 4, April-2016, e-ISSN: 2455-2584, Impact Factor: 3.45 (SJIF-2015)
IJTIMES-2015@All rights reserved 5
VII. CONCLUSION AND FUTURE WORK
In this paper, a system for counting the frequency of words of 1.5+ million tweets in parallel is
implemented. It is implemented using two components of the Hadoop Eco system namely Apache Pig and Map
reduce framework. The parallel execution of the program makes it feasible and fast.
The frequent word implementation can further be extended to analyse the dataset for marketing
purpose or promotion strategy definitions. The system can be further extended to perform sentiment analysis or
prediction of trends in markets.
REFERENCES
[1] White, Tom. Hadoop: The definitive guide. " O'Reilly Media, Inc.", 2012.
[2] Chowdhury, Badrul, et al. A BigBench implementation in the Hadoop ecosystem. Advancing Big Data
Benchmarks. Springer International Publishing, 2013. 3-18.
[3] Olston, Christopher, et al. Pig latin: a not-so-foreign language for data processing. Proceedings of the
2008 ACM SIGMOD international conference on Management of data. ACM, 2008.
[4] M. Bhandarkar, MapReduce programming with apache Hadoop, Parallel & Distributed Processing
(IPDPS), 2010 IEEE International Symposium on, Atlanta, GA, 2010, pp. 1-1.
[5] https://www.statista.com/chart/2289/how-marketers-use-social-media/
[6] http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip

More Related Content

What's hot

Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
 
10 Popular Hadoop Technical Interview Questions
10 Popular Hadoop Technical Interview Questions10 Popular Hadoop Technical Interview Questions
10 Popular Hadoop Technical Interview QuestionsZaranTech LLC
 
Hadoop and its role in Facebook: An Overview
Hadoop and its role in Facebook: An OverviewHadoop and its role in Facebook: An Overview
Hadoop and its role in Facebook: An Overviewrahulmonikasharma
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architectureHarikrishnan K
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesDavid Tjahjono,MD,MBA(UK)
 
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...Puneet Kansal
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPTAnand Pandey
 
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopA Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopIJTET Journal
 

What's hot (18)

Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop technology doc
Hadoop technology docHadoop technology doc
Hadoop technology doc
 
Big data
Big dataBig data
Big data
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
IJET-V2I6P25
IJET-V2I6P25IJET-V2I6P25
IJET-V2I6P25
 
10 Popular Hadoop Technical Interview Questions
10 Popular Hadoop Technical Interview Questions10 Popular Hadoop Technical Interview Questions
10 Popular Hadoop Technical Interview Questions
 
Hadoop and its role in Facebook: An Overview
Hadoop and its role in Facebook: An OverviewHadoop and its role in Facebook: An Overview
Hadoop and its role in Facebook: An Overview
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
hadoop
hadoophadoop
hadoop
 
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
 
hive lab
hive labhive lab
hive lab
 
Hadoop Presentation
Hadoop PresentationHadoop Presentation
Hadoop Presentation
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPT
 
Seminar Report Vaibhav
Seminar Report VaibhavSeminar Report Vaibhav
Seminar Report Vaibhav
 
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopA Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
 

Viewers also liked

Analyzing the trend of students studying abroad as a result of various param...
Analyzing the trend  of students studying abroad as a result of various param...Analyzing the trend  of students studying abroad as a result of various param...
Analyzing the trend of students studying abroad as a result of various param...Arjun Sehgal
 
Twitter word frequency count using hadoop components 150331221753
Twitter word frequency count using hadoop components 150331221753Twitter word frequency count using hadoop components 150331221753
Twitter word frequency count using hadoop components 150331221753pradip patel
 
Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processorTushar B Kute
 
The Past, Present, and Future of Hadoop at LinkedIn
The Past, Present, and Future of Hadoop at LinkedInThe Past, Present, and Future of Hadoop at LinkedIn
The Past, Present, and Future of Hadoop at LinkedInCarl Steinbach
 
m2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and Reusem2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and ReuseVasia Kalavri
 

Viewers also liked (6)

Analyzing the trend of students studying abroad as a result of various param...
Analyzing the trend  of students studying abroad as a result of various param...Analyzing the trend  of students studying abroad as a result of various param...
Analyzing the trend of students studying abroad as a result of various param...
 
Apache pig
Apache pigApache pig
Apache pig
 
Twitter word frequency count using hadoop components 150331221753
Twitter word frequency count using hadoop components 150331221753Twitter word frequency count using hadoop components 150331221753
Twitter word frequency count using hadoop components 150331221753
 
Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processor
 
The Past, Present, and Future of Hadoop at LinkedIn
The Past, Present, and Future of Hadoop at LinkedInThe Past, Present, and Future of Hadoop at LinkedIn
The Past, Present, and Future of Hadoop at LinkedIn
 
m2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and Reusem2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and Reuse
 

Similar to Twitter word frequency analysis using Hadoop

A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introductionsaisreealekhya
 
IRJET- Sentiment Analysis on Twitter Posts using Hadoop
IRJET- Sentiment Analysis on Twitter Posts using HadoopIRJET- Sentiment Analysis on Twitter Posts using Hadoop
IRJET- Sentiment Analysis on Twitter Posts using HadoopIRJET Journal
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringIRJET Journal
 
IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopIOSR Journals
 
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...IRJET Journal
 
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Rio Info
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoopManoj Jangalva
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data scienceAjay Ohri
 
Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewIRJET Journal
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBhavya Gulati
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook AhmedDoukh
 

Similar to Twitter word frequency analysis using Hadoop (20)

Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
 
Big data
Big dataBig data
Big data
 
IRJET- Sentiment Analysis on Twitter Posts using Hadoop
IRJET- Sentiment Analysis on Twitter Posts using HadoopIRJET- Sentiment Analysis on Twitter Posts using Hadoop
IRJET- Sentiment Analysis on Twitter Posts using Hadoop
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and Storing
 
IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED-V2I3P43
IJSRED-V2I3P43
 
G017143640
G017143640G017143640
G017143640
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
IJARCCE_49
IJARCCE_49IJARCCE_49
IJARCCE_49
 
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
 
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
 
paper
paperpaper
paper
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A Review
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook
 
[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...
[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...
[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...
 

Recently uploaded

Novel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsNovel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsResearcher Researcher
 
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSHigh Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSsandhya757531
 
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书rnrncn29
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdfHafizMudaserAhmad
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian
 
Artificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewArtificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewsandhya757531
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfModule-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfManish Kumar
 
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmComputer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmDeepika Walanjkar
 
OOP concepts -in-Python programming language
OOP concepts -in-Python programming languageOOP concepts -in-Python programming language
OOP concepts -in-Python programming languageSmritiSharma901052
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSneha Padhiar
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.elesangwon
 
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdfPaper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdfNainaShrivastava14
 
List of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfList of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfisabel213075
 
TEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACHTEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACHSneha Padhiar
 
Levelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument methodLevelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument methodManicka Mamallan Andavar
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfDrew Moseley
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra
 

Recently uploaded (20)

Novel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsNovel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending Actuators
 
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSHigh Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
 
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
 
Artificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewArtificial Intelligence in Power System overview
Artificial Intelligence in Power System overview
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfModule-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
 
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmComputer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm
 
OOP concepts -in-Python programming language
OOP concepts -in-Python programming languageOOP concepts -in-Python programming language
OOP concepts -in-Python programming language
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
 
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdfPaper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
 
List of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfList of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdf
 
TEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACHTEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACH
 
Levelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument methodLevelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument method
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdf
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptx
 

Twitter word frequency analysis using Hadoop

  • 1. International Journal of Technical Innovation in Morden Engineering & Science (IJTIMES) Impact Factor: 3.45 (SJIF-2015), e-ISSN: 2455-2584 Volume 2, Issue 4, April-2016 IJTIMES-2015@All rights reserved 1 Twitter Word Frequency Count using Hadoop Components 1 Ms. Pooja S. Desai, 2 Mrs. Isha Agarwal 1 M.Tech., Computer Engineering Department,CGPIT,Uka Tarsadia University 2 Asst. Professor. Computer Engineering Department,CGPIT,Uka Tarsadia University Abstract- Analysis of twitter data can be very useful for marketing as well as for promotion strategies. This paper implements word frequency count for 1.5+ million twitters dataset .The hadoop components such as Apache pig and MapReduce framework are used for parallel execution of the data. This parallel execution makes the implementation time-effective and feasible to execute on a single system. Keywords – Twitter data, Hadoop, Apache Pig, MapReduce, Big Data Analytics I. INTRODUCTION With the evolution of computing technologies and data science field, it is now possible to manage immensely large data and perform computation on it that previously could handled only by super computers or large datasets. Today social media is a critical part of the web. Social media platforms applications have a widespread reach to users all over the globe. According to statistica.com, the marketers use various social media as follows: TABLE 1 SOCIAL MEDIA PROFILES FOR MARKETING[5] Facebook Twitter LinkedIn Google+ Pinterest B2C 97% 81% 59% 51% 51% B2B 89% 86% 88% 59% 41% Social media sites and a strong web presence are crucial from marketing perspective in all forms of business. The interest in social media from all walks of life has been increasing exponentially from both application and research perspective. A wide number of tools, open source and commercial, is now available to gain a first impression of a particular social media channel. For popular social media channels like Twitter, Facebook, Google, YouTube many tools or services can be found to provide an overview of that channel. The span of all social media sites is growing faster day by day and as a consequence of that a lot of data is produced either structured or unstructured. This data can provide deeper insights by using analytics. Analysing the twitter data can lead to focused marketing goals. According to internetlivestats.com, every day around 500 million tweets are tweeted on twitter. By analysing this tweets, the frequently appearing word can be derived, which can be provide useful insights for developing marketing strategy, finding trends of sale or inventory stock patterns. Here in this paper, 1.5 million+ tweets are analysed and each word frequency is calculated, which can further be used for many purposes. Analysing such a massive data on a RDBMS or a single system is not feasible. The big data analytics must be used for analysis of such datasets. Hadoop Eco-system is one of the solutions for big data analytics which can be used. II. HADOOP ECOSYSTEM[2] The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. [3] Hadoop Eco System is mainly consists of two sevices: A. Hadoop Distributed File system A reliable, distributed file system that is mainly used to store all the data of the hadoop cluster. The data is not necessarily well-structured as in case of databases, but data can be in the form of structured, un-structured or semi- structured format. The HDFS system works on a Master-Slave approach. It contains one Name node that acts as a master server. It manages the storage of data, allocation and de-allocation of disks and replication management of the data. There are numbers of Data nodes that acts as a slave. It manages the storage of file and handles read-write requests. It also handles block creation, deletion or update based on instruction from the Name node. Thus, with help of name node and data node, HDFS manages all issues like where to store the data physically? Or how many replicas should be generated? And how the replicas should be distributed across different nodes? B. Hadoop Mapreduce
  • 2. International Journal of Technical Innovation in Morden Engineering & Science (IJTIMES) Volume 2, Issue 4, April-2016, e-ISSN: 2455-2584, Impact Factor: 3.45 (SJIF-2015) IJTIMES-2015@All rights reserved 2 Hadoop Map-reduce is a software-framework that allows computation to be performed in parallel manner. Contradictory to an HDFS, Map-Reduce provides computation facilities on stored structured or unstructured data. The number of mappers and reducers varies based on application and data sizes. Hadoop also supports many applications such as Apache Pig, Apache Hive, Apache HBase, Apache ooozie and many more. TABLE 2 HADOOP ECOSYSTEM COMPONENTS[3] Hadoop Component Description Sqoop -Used as intermediate between SQL and Hadoop. - It let the user import tables or entire database into HDFS system HBASE -It is a columnar storage based on Google’s big table -It can handle massive data tables combining billions and billions of rows and columns Pig -It is high-level platform for creating Map-reduce program using Hadoop - It can be used to build much larger, much more complex applications Hive -It is a datawarehouse that that facilitates querying and managing large dataset residing in HDFS. - It uses query language called Hive QL Oozie - It is workflow schedule system that manages all the hadoop jobs. - It is integrated with the hadoop stack and supports several different hadoop jobs. Zookeeper - It provides distributed configuration service and synchronization service. - It provides operational services for hadoop cluster. Flume - It is distributed and reliable service for efficient collection, aggregation and moving of large amount of data. Mahout - It focuses on distributed or scalable machine learning. - It is currently in its growing phase. R Connectors - It is most widely used open source statistical tool. - It contains rich set of packages for statistical analysis and plotting. Each of these applications has its own working models and different applications. Based on requirement of a system, any of these applications can be used, either alone or together with other components. In this paper Apache Pig for data extraction and Map Reduce paradigm for processing the data is used. III. APACHE PIG OVERVIEW[4] Apache Pig is a part of Hadoop Ecosystem which provides a high level language called Pig Latin. Pig commands are translated into Map-Reduce jobs. Pig tool can easily work with structured as well as unstructured data. Pig is a complete component of Hadoop Ecosystem as all the manipulation on the data can be done using Pig only. Pig can efficiently work with many languages such as Ruby, Python or Java etc. It can also be used to translate one data formats into other formats. Pig can ingest data from files, streams or any other data formats. A good example of Pig Application is ETL transaction model which basically extracts, transforms and loads the data. Fig. 1: Apache Pig data format support IV. MAP REDUCE FRAMEWORK OVERVIEW[4] Map reduce Framework is a programming model that was developed for parallel processing applications under Hadoop. It splits the data into various chunks and then these chunks are processed in parallel. Map Reduce brings computation to the data. It basically involves two phases: A. Mapper Phase
  • 3. International Journal of Technical Innovation in Morden Engineering & Science (IJTIMES) Volume 2, Issue 4, April-2016, e-ISSN: 2455-2584, Impact Factor: 3.45 (SJIF-2015) IJTIMES-2015@All rights reserved 3 B. Reducer phase In Mapper phase, Map job takes a collections of data and converts it into another set of data, where individual elements are broken into <key,value> pairs. While in Reducer phase, reduce jobs take the output of the map phase that is <key,value> pairs and produce smaller set of <key,value> pairs as an output. MapReduce framework works with the help of two processes provided by the hadoop eco system. A Job Tracker that sends the mapreduce tasks to specific nodes in the cluster. And a Task Tracker that are deployes on each machine in the cluster. It is responsible for running the map and reduce tasks as instructed by the Job tracker. V. TWEETS ANALYSIS Social media like twitter, facebook etc produces a lot of structured as well as unstructured data on daily basis, which if analysed can be very useful. The system working is shown in figure 2. Fig. 2: Twitter word frequency Count System The Twitter database contains a huge amount of data that grows rapidly every day. From the whole database, a dataset is selected for analysis which is basically a part of a twitter database. The task is divided into two processes namely A. Extraction of tweets from twitter dataset The twitter dataset contains lot of information such as user names, tweet date and timing, tweet contents and many more.[5] But for finding a word frequency only the tweet contents are needed. Similarly we can also count tweets per users by extracting user names only. But here we are concerned with tweets and its words so only tweets contents are extracted from the dataset. Apache pig can be used for this transformation. It inputs the data from our dataset of CSV format and using Pig Latin we can extract tweet contents efficiently. This output data is in text format and contains only tweets content. This data can be stored in any other format also. B. Word frequency count from tweets After the extraction of data word frequency algorithm can directly be applied on that data. Map and reduce algorithm is used to obtain parallelism in this process. The algorithm for the same is presented in table 2 and table 3 as shown below. TABLE 3 WORD FREQUENCY MAPPER ALGORITHM Algorithm : Word Frequency Mapper Input: Text file Output: <word,count> pair 1. Divide the data into different map jobs 2. For each line of text 3. Count each word separated by whitespaces 4. Form <word,count> pair locally 5. Temporary store all the <word,count> pair 6. Forward list of the <word,count> pair to reducer task
  • 4. International Journal of Technical Innovation in Morden Engineering & Science (IJTIMES) Volume 2, Issue 4, April-2016, e-ISSN: 2455-2584, Impact Factor: 3.45 (SJIF-2015) IJTIMES-2015@All rights reserved 4 TABLE 4 WORD FREQUENCY REDUCER ALGORITHM Algorithm: Word Frequency Reducer Input : <word,count> pair Output: <word,total-count> pair 1. Prev_key = None 2. Current_key = word from first <word,count> pair 3. For each <word, count> pair repeat 4. If Prev_key == Current_key 5. total_count += count; 6. Else 7. total_count = count; 8. Prev_key = Current_key 9. Return all <word,total-count> pair VI. RESULTS In this system, a word frequency count is implemented which produces list of words appearing in tweets with their counts. The time analysis for execution of the program is shown in a figure 3. The system can efficiently perform in parallel and produces output file in comparatively small amount of time. With the variable size of data, system can take variable time. The execution time also depends on configuration of Hadoop Cluster. From the output data generated it is analysed that some regular words appears more frequently that others as shown below: TABLE 5 SOME FREQUENT WORDS IN OUTPUT Some Words Count ! 10666 But 13898 Hi 2770 I 341441 It 13244 My 14038 Thank 6262 You 3392 And many more... Fig.3: Performance Measure of system 0 2 4 6 8 10 60 thousands 1 miilion 1.4 million 1.6 million ExecutionTime(minutes) Size of datasets Performance Measure
  • 5. International Journal of Technical Innovation in Morden Engineering & Science (IJTIMES) Volume 2, Issue 4, April-2016, e-ISSN: 2455-2584, Impact Factor: 3.45 (SJIF-2015) IJTIMES-2015@All rights reserved 5 VII. CONCLUSION AND FUTURE WORK In this paper, a system for counting the frequency of words of 1.5+ million tweets in parallel is implemented. It is implemented using two components of the Hadoop Eco system namely Apache Pig and Map reduce framework. The parallel execution of the program makes it feasible and fast. The frequent word implementation can further be extended to analyse the dataset for marketing purpose or promotion strategy definitions. The system can be further extended to perform sentiment analysis or prediction of trends in markets. REFERENCES [1] White, Tom. Hadoop: The definitive guide. " O'Reilly Media, Inc.", 2012. [2] Chowdhury, Badrul, et al. A BigBench implementation in the Hadoop ecosystem. Advancing Big Data Benchmarks. Springer International Publishing, 2013. 3-18. [3] Olston, Christopher, et al. Pig latin: a not-so-foreign language for data processing. Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008. [4] M. Bhandarkar, MapReduce programming with apache Hadoop, Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on, Atlanta, GA, 2010, pp. 1-1. [5] https://www.statista.com/chart/2289/how-marketers-use-social-media/ [6] http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip