Paper ijert

(Twiche) TWITTER TREND CACHING FOR BIG-DATA
APPLICATIONS USING THE MAPREDUCE FRAMEWORK
Santosh Wayal1
,Yogesh More2
,Prasad Wandhekar3
,Utkarsh Honey4
, Prof. Jayshree Chaudhari5
B.E. Student1,2,3,4, Assistant Professor5
Department of Computer Engineering
Dr. D.Y. Patil School of Engineering Pune, India.
Abstract— The big-data refers to the large-scale distributed
data processing applications which work on exceptionally
large amounts of data like twitter data. Google’s
MapReduce and Apache’s Hadoop, its open-source
implementation, are the software systems for big-data
applications. An observation of the MapReduce framework
is that the framework generates a large amount of
intermediate data whenever it analysis the twitter data from
twitter server. MapReduce is unable to utilize such data so
they are thrown after used. We propose a Twiche framework
used for big-data applications. In our system, tasks submit
their intermediate results to the cache manager and queries
the cache manager before executing the actual computing
work. A novel cache description system and a cache request
and reply protocol are designed.
Key words: MapReduce, Twiche, Twitter, Big Data,
Hadoop,Flume
I. INTRODUCTION
Social media is a web-based and mobile-based internet
application that will allow the creation, access and exchange
of user-generated content that is ubiquitously accessible.
Besides social networking media like twitter and face- book,
the term social media to encompass really simple
syndication (RSS) feeds, blogs, wikis and news, all typically
yielding unstructured text and accessible through the web.
Social media is especially important for research into
computational social science that investigates questions
using quantitative techniques for example, computational
statistics, machine learning and complexity and so-called big
data for data mining and simulation modelling. The Apache
Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of
computers using simple programming models. Google
MapReduce is a programming model and also a software
framework for Big -scale distributed Computing on large
amounts of data that large data we fetching from twitter
server. Application developers specify the computation in
terms of reduce function and a map and the underlying
MapReduce Task scheduling system automatically
parallelizes the computation across the cluster of machines.
While MapReduce obtain popularity for its simple
programming interface and excellent Performance when
implement a large spectrum of applications. Since most such
applications take a huge amount of input data, they are
called as “Big-data applications”. Input data is first divided
and then given to workers in the map stage.
Every Individual data items are called records. The
MapReduce system parses the input splits to each worker
and produces records. After the phase of map, intermediate
results generated in the map phase are shuffled and sorted by
the MapReduce system which are then given into the
workers in the reduce phase. Final results are computed by
multiple reducers and after it written back to the disk.
Apache Flume is a distributed, reliable, and available
service for efficiently collecting, aggregating, and moving
large amounts of streaming data into the Hadoop Distributed
File System (HDFS).
Hadoop is an open-source implementation of the Google
MapReduce programming model. Hadoop includes of the
Hadoop Common, which provides access to the file systems
supported by Hadoop. HDFS (Hadoop Distributed File
System) provides distributed file storage and is optimized
for large unchangeable blocks of data. A small Hadoop
cluster will include a single master and multiple worker
nodes known as slave. Whereas master node runs multiple
processes, including a Task tracker and a Name Node. The
Task tracker is having control for the management of
running jobs in the Hadoop cluster. While the name node
manages the HDFS, The Task tracker and the Name Node
are primarily collocated on the same physical machine. The
other servers in the cluster run a Task tracker for Data Node
processes.
A MapReduce job is divided into number of tasks. Tasks
are managed by the Task tracker. The Task trackers and the
Data Node are collated on the same servers to provide data
locality. MapReduce provides a standardized framework for
implementing the large-scale distributed computation,
known as the big-data applications. Still, there is restriction
of the system. i.e. The inefficiency in incremental
processing which refers to the applications that
incrementally grow up the input data and continuously
apply computations on this input and produce output.
II. OBJECTIVES AND SCOPES
The Scope Of The Work Can Be Extended To Following:
1. Requires minimum change to the original
MapReduce programming model.
2. Application code only requires slight changes in
order to utilize Dache.
3. Implement Dache in Hadoop by extending the
relevant components.
4. Tested experiments show that it can easily
eliminate all the duplicate tasks in incremental
MapReduce jobs.
5. Minimum execution time and CPU utilization
III. PROBLEM STATEMENT
In current Hadoop MapReduce framework is that the
framework generates a large flow of intermediate data.
MapReduce is unable to save that such data so they are
deleted after used.But in our systemwe introducing the
cache memory that holds the intermediate results in it,

because of that the data processing,means job executing
processing is faster than old system, So that systemis a time
consuming, repetition of data processing are reduced. For
experimental purpose we are accessing the twitter server for
twitted comments by using flume for data analysis
operations on it.
III. PROPOSED SYSTEM
The above Fig. shows the our proposed system. In this
proposed work, Apache Pig, Apache HDFS, Apache Oozie,
and Apache Hive can be used to design direct data
pipeline that will enable to analyze Twitter data. In order
to find out who is prominent in Social Media one should
know the mechanism of twitter which works on tweets
and retweets. A retweet is a repost of an update similar
to forwarding an email. Querying Twitter data in a
traditional RDBMS is inefficient. There are many Twitter
API which provide streaming of twitter data. In the
proposed work
A. Developing Twitter API
Develop a Twitter API on the Twitter side. The Twitter
API directly communicates with the Source and Sink
Mechanism via network based application. The
Authentication keys and tokens are established that
helps in communication over Twitter Server.
B. Establishing The connection via Source and Sink
Mechanism
After creation of Twitter API , design the source and sink
mechanism that will help in speedy data downloading
approach from Twitter Server to HDFS(Hadoop
Distributed File System).
The source agent communicates with the Twitter API
and Channels the data .
The streaming data is in form of JSON i.e event form of
data. The data is queued and channeled via channel
mechanism.
Finally the data is sink down into HDFS. Then the tweets
are analyzed using PIG or HIVE.
C. Analyzing The Data On HDFS i.e. Tweets using PIG or
Hive
The data now stored on data nodes is analyzed using PIG
or Hive. Suppose we want to perform Twitter Trend
Analysis, and then we have to just fire the Count query
that will count the specific word count about any
keyword. We have performed Log Analysis, Fraud
detection and Click Stream Analysis by using this system.
IV. CONCLUSION
In We finally present the design and evaluation of a Twitter
data cache framework that requires minimum change to the
original MapReduce programming model for provisioning
incremental processing for Big-data of twitter using the
MapReduce model with flume. We propose Twiche, a
Twitter Trent cache description scheme, protocol, and
architecture. Our method requires only a slight modification
in the input format processing and task management of the
MapReduce frames. Hence, application code only requires
slight changes in order to utilize Twiche. We imple ment
Twiche in Hadoop by extending relevant components.
Tested experiments show that it can eliminate all the
duplicate tasks in incremental MapReduce jobs and does not
require substantialchanges to the application code.
V. ACKNOWLEDGEMENT
The authors would like to thank the management and prof.
Jayashree Chaudhari, Dr. D.Y. Patil School of Engineering
for providing all the technical, guided constant
encouragement and suggestions throughout the project.
REFERENCES
[1]Xiuzhen Zhang, Lishan Cui, and Yan Wang, "Computing
Multi-Dimensional Trust by Mining E-Commerce Feedback
Comments " IEEE TRANSACTIONS ON KNOWLEDGE
AND DATA ENGINEERING VOL:26 NO:7 YEAR 2014
[2] P. Resnick and R. Zeckhauser “Reputation Systems:
Facilitating Trust in Internet Interactions”, Communications
of the ACM, vol. 43, pp. 45–48, 2000.
[3] P. Resnick and R. Zeckhauser, “Trust among strangers
in internet transactions: Empirical analysis of eBay’s
reputation system,” The Economics of the Internet and E-
commerce, 2002.
[4] J. O’Donovan and B. Smyth “Extracting and visualizing
trust relationships from online auction feedback comments”
in Proc. IJCAI, 2007
[6] H. Zhang, Y. Wang, and X. Zhang, “Efficient contextual
transaction trust computation in e-commerce environments,”
in Proc. 11th IEEE TrustCom, Liverpool, UK, 2012.
[7] S. D. Kamvar, M. T. Schlosser, and H. Garcia-Molina,
“The EigenTrust algorithm for reputation management in
P2P networks,” in Proc. 12th Int. Conf. WWW, Budapest,
Hungary, 2003.

G. Qiu, B. Liu, J. Bu, and C. Chen, “Opinion word
expansion and arget extraction through double propagation,"
Comput. Linguist, vol. 37, no. 1, pp. 927, 2011
[8] S. Reece and N. Jennings “Rumors’ and reputation:
Evaluating multi-dimensional trust within a decentralized
reputation system" in Proc. 6th AAMAS, Honolulu ,USA,
2007, pp. 165172.
[9] G. Qiu and B. Liu “Opinion word expansion and target
extraction through double propagation," Comput. Linguist,
vol. 37, no. 1, pp. 927, 2011.
[10] P. Thomas and D. Hawking, “Evaluation by comparing
result sets in context," in Proc. 15th ACM CIKM, Arlington,
VA, USA, 2006, pp. 94101. Department
[11] M. De Marne_e and C. Manning, “The Stanford typed
dependencies representation," in Proc. Cross Parser,
Stroudsburg,PA, USA, 2008.
[12] X. Wang, L. Liu, and J. Su, “RLM: A general model for
trust representation and aggregation," IEEE Trans. Serv.
Comput., vol. 5,
no. 1, pp. 131143, Jan-Mar, 2012.

Paper ijert

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Paper ijert

Similar to Paper ijert (20)

Recently uploaded

Recently uploaded (20)

Paper ijert