Analysis of twitter data can be very useful for marketing as well as for promotion strategies. This paper implements word frequency count for 1.5+ million twitters dataset .The hadoop components such as Apache pig and MapReduce framework are used for parallel execution of the data. This parallel execution makes the implementation time-effective and feasible to execute on a single system.
this is a presentation on hadoop basics. Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models.
Big data is a popular term used to describe the large volume of data which includes structured, semi-structured and unstructured
data. Now-a-days, unstructured data is growing in an explosive speed with the development of Internet and social networks like Twitter,Facebook
& Yahoo etc., In order to process such colossal of data a software is required that does this efficiently and this is where Hadoop steps in. Hadoop
has become one of the most used frameworks when dealing with big data. It is used to analyze and process big data. In this paper, Apache Flume
is configured and integrated with spark streaming for streaming the data from twitter application. The streamed data is stored into Apache
Cassandra. After retrieving the data, the data is going to be analyzed by using the concept of Apache Zeppelin. The result will be displayed on
Dashboard and the dashboard result is also going to be analyzed and validating using JSON
this is a presentation on hadoop basics. Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models.
Big data is a popular term used to describe the large volume of data which includes structured, semi-structured and unstructured
data. Now-a-days, unstructured data is growing in an explosive speed with the development of Internet and social networks like Twitter,Facebook
& Yahoo etc., In order to process such colossal of data a software is required that does this efficiently and this is where Hadoop steps in. Hadoop
has become one of the most used frameworks when dealing with big data. It is used to analyze and process big data. In this paper, Apache Flume
is configured and integrated with spark streaming for streaming the data from twitter application. The streamed data is stored into Apache
Cassandra. After retrieving the data, the data is going to be analyzed by using the concept of Apache Zeppelin. The result will be displayed on
Dashboard and the dashboard result is also going to be analyzed and validating using JSON
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
A guide to using Apache Hadoop as your open source big data platform of choice, including the vendors that make various Hadoop flavors, related open source tools, Hadoop capabilities and suitable applications.
Today’s era is generally treated as the era of data on each and every field of computing application huge amount of data is generated. The society is gradually more dependent on computers so large amount of data is generated in each and every second which is either in structured format, unstructured format or semi structured format. These huge amount of data are generally treated as big data. To analyze big data is a biggest challenge in current world. Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage and it generally follows horizontal processing. Map Reduce programming is generally run over Hadoop Framework and process the large amount of structured and unstructured data. This Paper describes about different joining strategies used in Map reduce programming to combine the data of two files in Hadoop Framework and also discusses the skewness problem associate to it.
10 Popular Hadoop Technical Interview QuestionsZaranTech LLC
Big Data has been attested as one of the fastest growing technologies of this decade and thus potent enough to produce a large number of jobs. While enterprises across industrial stretch have started building teams, Hadoop technical interview questions could vary from simple definitions to critical case studies. Let’s take quick glimpse at the most obvious ones.
One of the challenges in storing and processing the data and using the latest internet technologies has resulted in large volumes of data. The technique to manage this massive amount of data and to pull out the value, out of this volume is collectively called Big data. Over the recent years, there has been a rising interest in big data for social media analysis. Online social media have become the important platform across the world to share information. Facebook, one of the largest social media site receives posts in millions every day. One of the efficient technologies that deal with the Big Data is Hadoop. Hadoop, for processing large data volume jobs uses MapReduce programming model. This paper provides a survey on Hadoop and its role in facebook and a brief introduction to HIVE.
Asserting that Big Data is vital to business is an understatement. Organizations have generated more and more data for years, but struggle to use it effectively. Clearly Big Data has more important uses than ensuring compliance with regulatory requirements. In addition, data is being generated with greater velocity, due to the advent of new pervasive devices (e.g., smartphones, tablets, etc.), social Web sites (e.g., Facebook, Twitter, LinkedIn, etc.) and other sources like GPS, Google Maps, heat/pressure sensors, etc.
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...Puneet Kansal
This research article experimentally shows, how Multiple Queries can be provided to Hive and their execution can be reduced by searching common expression and common data source. The TPC-H queries are used for demonstration and test data is generated in variation of 2 GB, 5 GB and 10 GB using DBGEN software. Technology used in this is HADOOP and Hive. HADOOP is configured in Single user over ubunto Operating system.
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopIJTET Journal
Abstract— In data mining, association rule mining is one of the major techniques for discovering meaningful patterns from large collection of data. Discovering frequent item sets play an important role in mining association rules, sequence rules, web log mining and many other interesting patterns surrounded by complex data. Frequent Item set Mining is one of the classical data mining tribulations in most of the data mining applications. Apache Hadoop is a major innovation in the IT market place last decade. From modest beginnings Apache Hadoop has become a world-wide adoption in data centers. It brings parallel processing in hands of average programmer. This paper presents a literature analysis on different techniques for mining frequent item sets and frequent item sets on Hadoop.
Analyzing the trend of students studying abroad as a result of various param...Arjun Sehgal
In this project, I have used various big data and analytics technologies like SQL, Excel, Apache Pig, Apache Hive, Tableau, H20 in order to analyse datasets and creating analytical relationships between the number of students studying abroad from a country and try to relate it to a number of factors of that country.
Big Data Technologies were used to transform the data, and gain insights. Effective visualisations were created using Tableau. Also, predictive analysis was done using various Machine Learning Algorithms in H2O in order to create a predictive analysis, which can be used in the future to predict the number of students studying abroad, from all or part of the given factors.
Twitter word frequency count using hadoop components 150331221753pradip patel
Abstract- Analysis of twitter data can be very useful for marketing as well as for promotion strategies. This paper
implements word frequency count for 1.5+ million twitters dataset .The hadoop components such as Apache pig and
MapReduce framework are used for parallel execution of the data. This parallel execution makes the implementation
time-effective and feasible to execute on a single system
Apache Pig: Introduction, Description, Installation, Pig Latin Commands, Use, Examples, Usefulness are demonstrated in this presentation.
Tushar B. Kute
Researcher,
http://tusharkute.com
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
A guide to using Apache Hadoop as your open source big data platform of choice, including the vendors that make various Hadoop flavors, related open source tools, Hadoop capabilities and suitable applications.
Today’s era is generally treated as the era of data on each and every field of computing application huge amount of data is generated. The society is gradually more dependent on computers so large amount of data is generated in each and every second which is either in structured format, unstructured format or semi structured format. These huge amount of data are generally treated as big data. To analyze big data is a biggest challenge in current world. Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage and it generally follows horizontal processing. Map Reduce programming is generally run over Hadoop Framework and process the large amount of structured and unstructured data. This Paper describes about different joining strategies used in Map reduce programming to combine the data of two files in Hadoop Framework and also discusses the skewness problem associate to it.
10 Popular Hadoop Technical Interview QuestionsZaranTech LLC
Big Data has been attested as one of the fastest growing technologies of this decade and thus potent enough to produce a large number of jobs. While enterprises across industrial stretch have started building teams, Hadoop technical interview questions could vary from simple definitions to critical case studies. Let’s take quick glimpse at the most obvious ones.
One of the challenges in storing and processing the data and using the latest internet technologies has resulted in large volumes of data. The technique to manage this massive amount of data and to pull out the value, out of this volume is collectively called Big data. Over the recent years, there has been a rising interest in big data for social media analysis. Online social media have become the important platform across the world to share information. Facebook, one of the largest social media site receives posts in millions every day. One of the efficient technologies that deal with the Big Data is Hadoop. Hadoop, for processing large data volume jobs uses MapReduce programming model. This paper provides a survey on Hadoop and its role in facebook and a brief introduction to HIVE.
Asserting that Big Data is vital to business is an understatement. Organizations have generated more and more data for years, but struggle to use it effectively. Clearly Big Data has more important uses than ensuring compliance with regulatory requirements. In addition, data is being generated with greater velocity, due to the advent of new pervasive devices (e.g., smartphones, tablets, etc.), social Web sites (e.g., Facebook, Twitter, LinkedIn, etc.) and other sources like GPS, Google Maps, heat/pressure sensors, etc.
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...Puneet Kansal
This research article experimentally shows, how Multiple Queries can be provided to Hive and their execution can be reduced by searching common expression and common data source. The TPC-H queries are used for demonstration and test data is generated in variation of 2 GB, 5 GB and 10 GB using DBGEN software. Technology used in this is HADOOP and Hive. HADOOP is configured in Single user over ubunto Operating system.
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopIJTET Journal
Abstract— In data mining, association rule mining is one of the major techniques for discovering meaningful patterns from large collection of data. Discovering frequent item sets play an important role in mining association rules, sequence rules, web log mining and many other interesting patterns surrounded by complex data. Frequent Item set Mining is one of the classical data mining tribulations in most of the data mining applications. Apache Hadoop is a major innovation in the IT market place last decade. From modest beginnings Apache Hadoop has become a world-wide adoption in data centers. It brings parallel processing in hands of average programmer. This paper presents a literature analysis on different techniques for mining frequent item sets and frequent item sets on Hadoop.
Analyzing the trend of students studying abroad as a result of various param...Arjun Sehgal
In this project, I have used various big data and analytics technologies like SQL, Excel, Apache Pig, Apache Hive, Tableau, H20 in order to analyse datasets and creating analytical relationships between the number of students studying abroad from a country and try to relate it to a number of factors of that country.
Big Data Technologies were used to transform the data, and gain insights. Effective visualisations were created using Tableau. Also, predictive analysis was done using various Machine Learning Algorithms in H2O in order to create a predictive analysis, which can be used in the future to predict the number of students studying abroad, from all or part of the given factors.
Twitter word frequency count using hadoop components 150331221753pradip patel
Abstract- Analysis of twitter data can be very useful for marketing as well as for promotion strategies. This paper
implements word frequency count for 1.5+ million twitters dataset .The hadoop components such as Apache pig and
MapReduce framework are used for parallel execution of the data. This parallel execution makes the implementation
time-effective and feasible to execute on a single system
Apache Pig: Introduction, Description, Installation, Pig Latin Commands, Use, Examples, Usefulness are demonstrated in this presentation.
Tushar B. Kute
Researcher,
http://tusharkute.com
A short presentation on big data and the technologies available for managing Big Data. and it also contains a brief description of the Apache Hadoop Framework
A short overview of Bigdata along with its popularity, ups and downs from past to present. We had a look of its needs, challenges and risks too. Architectures involved in it. Vendors associated with it.
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Rio Info
Palestra sobre Big data: Descoberta de conhecimento em ambientes de big data e computação na nuvem apresentada por Nelson Favilla durante o Rio Info 2014
Big data analytics: Technology's bleeding edgeBhavya Gulati
There can be data without information , but there can not be information without data.
Companies without Big Data Analytics are deaf and dumb , mere wanderers on web.
Data infrastructure at Facebook with reference to the conference paper " Data warehousing and analytics infrastructure at facebook"
Datewarehouse
Hadoop - Hive - scrive
Text categorization is a term that has intrigued researchers for quite some time now. It is the concept
in which news articles are categorized into specific groups to cut down efforts put in manually categorizing
news articles into particular groups. A growing number of statistical classification and machine learning
technique have been applied to text categorization. This paper is based on the automatic text categorization
of news articles based on clustering using k-mean algorithm. The goal of this paper is to automatically
categorize news articles into groups. Our paper mostly concentrates on K-mean for clustering and for term
frequency we are going to use TF-IDF dictionary is applied for categorization. This is done using mahaout
as platform.
Similar to Twitter word frequency count using hadoop components 150331221753 (20)
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERSveerababupersonal22
It consists of cw radar and fmcw radar ,range measurement,if amplifier and fmcw altimeterThe CW radar operates using continuous wave transmission, while the FMCW radar employs frequency-modulated continuous wave technology. Range measurement is a crucial aspect of radar systems, providing information about the distance to a target. The IF amplifier plays a key role in signal processing, amplifying intermediate frequency signals for further analysis. The FMCW altimeter utilizes frequency-modulated continuous wave technology to accurately measure altitude above a reference point.
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...ssuser7dcef0
Power plants release a large amount of water vapor into the
atmosphere through the stack. The flue gas can be a potential
source for obtaining much needed cooling water for a power
plant. If a power plant could recover and reuse a portion of this
moisture, it could reduce its total cooling water intake
requirement. One of the most practical way to recover water
from flue gas is to use a condensing heat exchanger. The power
plant could also recover latent heat due to condensation as well
as sensible heat due to lowering the flue gas exit temperature.
Additionally, harmful acids released from the stack can be
reduced in a condensing heat exchanger by acid condensation. reduced in a condensing heat exchanger by acid condensation.
Condensation of vapors in flue gas is a complicated
phenomenon since heat and mass transfer of water vapor and
various acids simultaneously occur in the presence of noncondensable
gases such as nitrogen and oxygen. Design of a
condenser depends on the knowledge and understanding of the
heat and mass transfer processes. A computer program for
numerical simulations of water (H2O) and sulfuric acid (H2SO4)
condensation in a flue gas condensing heat exchanger was
developed using MATLAB. Governing equations based on
mass and energy balances for the system were derived to
predict variables such as flue gas exit temperature, cooling
water outlet temperature, mole fraction and condensation rates
of water and sulfuric acid vapors. The equations were solved
using an iterative solution technique with calculations of heat
and mass transfer coefficients and physical properties.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Twitter word frequency count using hadoop components 150331221753
1. International Journal of Technical Innovation in Morden
Engineering & Science (IJTIMES)
Impact Factor: 3.45 (SJIF-2015), e-ISSN: 2455-2584
Volume 2, Issue 4, April-2016
IJTIMES-2015@All rights reserved 1
Twitter Word Frequency Count using Hadoop Components
1
Ms. Pooja S. Desai, 2
Mrs. Isha Agarwal
1
M.Tech., Computer Engineering Department,CGPIT,Uka Tarsadia University
2
Asst. Professor. Computer Engineering Department,CGPIT,Uka Tarsadia University
Abstract- Analysis of twitter data can be very useful for marketing as well as for promotion strategies. This paper
implements word frequency count for 1.5+ million twitters dataset .The hadoop components such as Apache pig and
MapReduce framework are used for parallel execution of the data. This parallel execution makes the implementation
time-effective and feasible to execute on a single system.
Keywords – Twitter data, Hadoop, Apache Pig, MapReduce, Big Data Analytics
I. INTRODUCTION
With the evolution of computing technologies and data science field, it is now possible to manage immensely large
data and perform computation on it that previously could handled only by super computers or large datasets. Today
social media is a critical part of the web. Social media platforms applications have a widespread reach to users all over
the globe. According to statistica.com, the marketers use various social media as follows:
TABLE 1
SOCIAL MEDIA PROFILES FOR MARKETING[5]
Facebook Twitter LinkedIn Google+ Pinterest
B2C 97% 81% 59% 51% 51%
B2B 89% 86% 88% 59% 41%
Social media sites and a strong web presence are crucial from marketing perspective in all forms of business.
The interest in social media from all walks of life has been increasing exponentially from both application and research
perspective. A wide number of tools, open source and commercial, is now available to gain a first impression of a
particular social media channel. For popular social media channels like Twitter, Facebook, Google, YouTube many tools
or services can be found to provide an overview of that channel. The span of all social media sites is growing faster day
by day and as a consequence of that a lot of data is produced either structured or unstructured. This data can provide
deeper insights by using analytics. Analysing the twitter data can lead to focused marketing goals.
According to internetlivestats.com, every day around 500 million tweets are tweeted on twitter. By analysing
this tweets, the frequently appearing word can be derived, which can be provide useful insights for developing marketing
strategy, finding trends of sale or inventory stock patterns. Here in this paper, 1.5 million+ tweets are analysed and each
word frequency is calculated, which can further be used for many purposes.
Analysing such a massive data on a RDBMS or a single system is not feasible. The big data analytics must be
used for analysis of such datasets. Hadoop Eco-system is one of the solutions for big data analytics which can be used.
II. HADOOP ECOSYSTEM[2]
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across
clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of
machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the
library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on
top of a cluster of computers, each of which may be prone to failures. [3]
Hadoop Eco System is mainly consists of two sevices:
A. Hadoop Distributed File system
A reliable, distributed file system that is mainly used to store all the data of the hadoop cluster. The data is not
necessarily well-structured as in case of databases, but data can be in the form of structured, un-structured or semi-
structured format. The HDFS system works on a Master-Slave approach. It contains one Name node that acts as a master
server. It manages the storage of data, allocation and de-allocation of disks and replication management of the data.
There are numbers of Data nodes that acts as a slave. It manages the storage of file and handles read-write requests. It
also handles block creation, deletion or update based on instruction from the Name node. Thus, with help of name node
and data node, HDFS manages all issues like where to store the data physically? Or how many replicas should be
generated? And how the replicas should be distributed across different nodes?
B. Hadoop Mapreduce
2. International Journal of Technical Innovation in Morden Engineering & Science (IJTIMES)
Volume 2, Issue 4, April-2016, e-ISSN: 2455-2584, Impact Factor: 3.45 (SJIF-2015)
IJTIMES-2015@All rights reserved 2
Hadoop Map-reduce is a software-framework that allows computation to be performed in parallel manner.
Contradictory to an HDFS, Map-Reduce provides computation facilities on stored structured or unstructured data. The
number of mappers and reducers varies based on application and data sizes.
Hadoop also supports many applications such as Apache Pig, Apache Hive, Apache HBase, Apache ooozie and
many more.
TABLE 2
HADOOP ECOSYSTEM COMPONENTS[3]
Hadoop Component Description
Sqoop -Used as intermediate between SQL and Hadoop.
- It let the user import tables or entire database into HDFS system
HBASE -It is a columnar storage based on Google’s big table
-It can handle massive data tables combining billions and billions of rows and
columns
Pig -It is high-level platform for creating Map-reduce program using Hadoop
- It can be used to build much larger, much more complex applications
Hive -It is a datawarehouse that that facilitates querying and managing large dataset
residing in HDFS.
- It uses query language called Hive QL
Oozie - It is workflow schedule system that manages all the hadoop jobs.
- It is integrated with the hadoop stack and supports several different hadoop
jobs.
Zookeeper - It provides distributed configuration service and synchronization service.
- It provides operational services for hadoop cluster.
Flume - It is distributed and reliable service for efficient collection, aggregation and
moving of large amount of data.
Mahout - It focuses on distributed or scalable machine learning.
- It is currently in its growing phase.
R Connectors - It is most widely used open source statistical tool.
- It contains rich set of packages for statistical analysis and plotting.
Each of these applications has its own working models and different applications. Based on requirement of a system,
any of these applications can be used, either alone or together with other components.
In this paper Apache Pig for data extraction and Map Reduce paradigm for processing the data is used.
III. APACHE PIG OVERVIEW[4]
Apache Pig is a part of Hadoop Ecosystem which provides a high level language called Pig Latin. Pig commands are
translated into Map-Reduce jobs. Pig tool can easily work with structured as well as unstructured data. Pig is a complete
component of Hadoop Ecosystem as all the manipulation on the data can be done using Pig only. Pig can efficiently work
with many languages such as Ruby, Python or Java etc. It can also be used to translate one data formats into other
formats. Pig can ingest data from files, streams or any other data formats. A good example of Pig Application is ETL
transaction model which basically extracts, transforms and loads the data.
Fig. 1: Apache Pig data format support
IV. MAP REDUCE FRAMEWORK OVERVIEW[4]
Map reduce Framework is a programming model that was developed for parallel processing applications under
Hadoop. It splits the data into various chunks and then these chunks are processed in parallel. Map Reduce brings
computation to the data. It basically involves two phases:
A. Mapper Phase
3. International Journal of Technical Innovation in Morden Engineering & Science (IJTIMES)
Volume 2, Issue 4, April-2016, e-ISSN: 2455-2584, Impact Factor: 3.45 (SJIF-2015)
IJTIMES-2015@All rights reserved 3
B. Reducer phase
In Mapper phase, Map job takes a collections of data and converts it into another set of data, where individual
elements are broken into <key,value> pairs. While in Reducer phase, reduce jobs take the output of the map phase that is
<key,value> pairs and produce smaller set of <key,value> pairs as an output.
MapReduce framework works with the help of two processes provided by the hadoop eco system. A Job
Tracker that sends the mapreduce tasks to specific nodes in the cluster. And a Task Tracker that are deployes on each
machine in the cluster. It is responsible for running the map and reduce tasks as instructed by the Job tracker.
V. TWEETS ANALYSIS
Social media like twitter, facebook etc produces a lot of structured as well as unstructured data on daily basis, which
if analysed can be very useful. The system working is shown in figure 2.
Fig. 2: Twitter word frequency Count System
The Twitter database contains a huge amount of data that grows rapidly every day. From the whole database, a
dataset is selected for analysis which is basically a part of a twitter database. The task is divided into two processes
namely
A. Extraction of tweets from twitter dataset
The twitter dataset contains lot of information such as user names, tweet date and timing, tweet contents and many
more.[5] But for finding a word frequency only the tweet contents are needed. Similarly we can also count tweets per
users by extracting user names only. But here we are concerned with tweets and its words so only tweets contents are
extracted from the dataset.
Apache pig can be used for this transformation. It inputs the data from our dataset of CSV format and using Pig
Latin we can extract tweet contents efficiently. This output data is in text format and contains only tweets content. This
data can be stored in any other format also.
B. Word frequency count from tweets
After the extraction of data word frequency algorithm can directly be applied on that data. Map and reduce algorithm
is used to obtain parallelism in this process. The algorithm for the same is presented in table 2 and table 3 as shown
below.
TABLE 3
WORD FREQUENCY MAPPER ALGORITHM
Algorithm : Word Frequency Mapper
Input: Text file
Output: <word,count> pair
1. Divide the data into different map jobs
2. For each line of text
3. Count each word separated by whitespaces
4. Form <word,count> pair locally
5. Temporary store all the <word,count> pair
6. Forward list of the <word,count> pair to reducer task
4. International Journal of Technical Innovation in Morden Engineering & Science (IJTIMES)
Volume 2, Issue 4, April-2016, e-ISSN: 2455-2584, Impact Factor: 3.45 (SJIF-2015)
IJTIMES-2015@All rights reserved 4
TABLE 4
WORD FREQUENCY REDUCER ALGORITHM
Algorithm: Word Frequency Reducer
Input : <word,count> pair
Output: <word,total-count> pair
1. Prev_key = None
2. Current_key = word from first <word,count> pair
3. For each <word, count> pair repeat
4. If Prev_key == Current_key
5. total_count += count;
6. Else
7. total_count = count;
8. Prev_key = Current_key
9. Return all <word,total-count> pair
VI. RESULTS
In this system, a word frequency count is implemented which produces list of words appearing in
tweets with their counts. The time analysis for execution of the program is shown in a figure 3. The system can
efficiently perform in parallel and produces output file in comparatively small amount of time. With the variable
size of data, system can take variable time. The execution time also depends on configuration of Hadoop
Cluster.
From the output data generated it is analysed that some regular words appears more frequently that
others as shown below:
TABLE 5
SOME FREQUENT WORDS IN OUTPUT
Some Words Count
! 10666
But 13898
Hi 2770
I 341441
It 13244
My 14038
Thank 6262
You 3392
And many more...
Fig.3: Performance Measure of system
0
2
4
6
8
10
60 thousands 1 miilion 1.4 million 1.6 million
ExecutionTime(minutes)
Size of datasets
Performance Measure
5. International Journal of Technical Innovation in Morden Engineering & Science (IJTIMES)
Volume 2, Issue 4, April-2016, e-ISSN: 2455-2584, Impact Factor: 3.45 (SJIF-2015)
IJTIMES-2015@All rights reserved 5
VII. CONCLUSION AND FUTURE WORK
In this paper, a system for counting the frequency of words of 1.5+ million tweets in parallel is
implemented. It is implemented using two components of the Hadoop Eco system namely Apache Pig and Map
reduce framework. The parallel execution of the program makes it feasible and fast.
The frequent word implementation can further be extended to analyse the dataset for marketing
purpose or promotion strategy definitions. The system can be further extended to perform sentiment analysis or
prediction of trends in markets.
REFERENCES
[1] White, Tom. Hadoop: The definitive guide. " O'Reilly Media, Inc.", 2012.
[2] Chowdhury, Badrul, et al. A BigBench implementation in the Hadoop ecosystem. Advancing Big Data
Benchmarks. Springer International Publishing, 2013. 3-18.
[3] Olston, Christopher, et al. Pig latin: a not-so-foreign language for data processing. Proceedings of the
2008 ACM SIGMOD international conference on Management of data. ACM, 2008.
[4] M. Bhandarkar, MapReduce programming with apache Hadoop, Parallel & Distributed Processing
(IPDPS), 2010 IEEE International Symposium on, Atlanta, GA, 2010, pp. 1-1.
[5] https://www.statista.com/chart/2289/how-marketers-use-social-media/
[6] http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip