SlideShare a Scribd company logo
(Twiche) TWITTER TREND CACHING FOR BIG-DATA
APPLICATIONS USING THE MAPREDUCE FRAMEWORK
Santosh Wayal1
,Yogesh More2
,Prasad Wandhekar3
,Utkarsh Honey4
, Prof. Jayshree Chaudhari5
B.E. Student1,2,3,4, Assistant Professor5
Department of Computer Engineering
Dr. D.Y. Patil School of Engineering Pune, India.
Abstract— The big-data refers to the large-scale distributed
data processing applications which work on exceptionally
large amounts of data like twitter data. Google’s
MapReduce and Apache’s Hadoop, its open-source
implementation, are the software systems for big-data
applications. An observation of the MapReduce framework
is that the framework generates a large amount of
intermediate data whenever it analysis the twitter data from
twitter server. MapReduce is unable to utilize such data so
they are thrown after used. We propose a Twiche framework
used for big-data applications. In our system, tasks submit
their intermediate results to the cache manager and queries
the cache manager before executing the actual computing
work. A novel cache description system and a cache request
and reply protocol are designed.
Key words: MapReduce, Twiche, Twitter, Big Data,
Hadoop,Flume
I. INTRODUCTION
Social media is a web-based and mobile-based internet
application that will allow the creation, access and exchange
of user-generated content that is ubiquitously accessible.
Besides social networking media like twitter and face- book,
the term social media to encompass really simple
syndication (RSS) feeds, blogs, wikis and news, all typically
yielding unstructured text and accessible through the web.
Social media is especially important for research into
computational social science that investigates questions
using quantitative techniques for example, computational
statistics, machine learning and complexity and so-called big
data for data mining and simulation modelling. The Apache
Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of
computers using simple programming models. Google
MapReduce is a programming model and also a software
framework for Big -scale distributed Computing on large
amounts of data that large data we fetching from twitter
server. Application developers specify the computation in
terms of reduce function and a map and the underlying
MapReduce Task scheduling system automatically
parallelizes the computation across the cluster of machines.
While MapReduce obtain popularity for its simple
programming interface and excellent Performance when
implement a large spectrum of applications. Since most such
applications take a huge amount of input data, they are
called as “Big-data applications”. Input data is first divided
and then given to workers in the map stage.
Every Individual data items are called records. The
MapReduce system parses the input splits to each worker
and produces records. After the phase of map, intermediate
results generated in the map phase are shuffled and sorted by
the MapReduce system which are then given into the
workers in the reduce phase. Final results are computed by
multiple reducers and after it written back to the disk.
Apache Flume is a distributed, reliable, and available
service for efficiently collecting, aggregating, and moving
large amounts of streaming data into the Hadoop Distributed
File System (HDFS).
Hadoop is an open-source implementation of the Google
MapReduce programming model. Hadoop includes of the
Hadoop Common, which provides access to the file systems
supported by Hadoop. HDFS (Hadoop Distributed File
System) provides distributed file storage and is optimized
for large unchangeable blocks of data. A small Hadoop
cluster will include a single master and multiple worker
nodes known as slave. Whereas master node runs multiple
processes, including a Task tracker and a Name Node. The
Task tracker is having control for the management of
running jobs in the Hadoop cluster. While the name node
manages the HDFS, The Task tracker and the Name Node
are primarily collocated on the same physical machine. The
other servers in the cluster run a Task tracker for Data Node
processes.
A MapReduce job is divided into number of tasks. Tasks
are managed by the Task tracker. The Task trackers and the
Data Node are collated on the same servers to provide data
locality. MapReduce provides a standardized framework for
implementing the large-scale distributed computation,
known as the big-data applications. Still, there is restriction
of the system. i.e. The inefficiency in incremental
processing which refers to the applications that
incrementally grow up the input data and continuously
apply computations on this input and produce output.
II. OBJECTIVES AND SCOPES
The Scope Of The Work Can Be Extended To Following:
1. Requires minimum change to the original
MapReduce programming model.
2. Application code only requires slight changes in
order to utilize Dache.
3. Implement Dache in Hadoop by extending the
relevant components.
4. Tested experiments show that it can easily
eliminate all the duplicate tasks in incremental
MapReduce jobs.
5. Minimum execution time and CPU utilization
III. PROBLEM STATEMENT
In current Hadoop MapReduce framework is that the
framework generates a large flow of intermediate data.
MapReduce is unable to save that such data so they are
deleted after used.But in our systemwe introducing the
cache memory that holds the intermediate results in it,
because of that the data processing,means job executing
processing is faster than old system, So that systemis a time
consuming, repetition of data processing are reduced. For
experimental purpose we are accessing the twitter server for
twitted comments by using flume for data analysis
operations on it.
III. PROPOSED SYSTEM
The above Fig. shows the our proposed system. In this
proposed work, Apache Pig, Apache HDFS, Apache Oozie,
and Apache Hive can be used to design direct data
pipeline that will enable to analyze Twitter data. In order
to find out who is prominent in Social Media one should
know the mechanism of twitter which works on tweets
and retweets. A retweet is a repost of an update similar
to forwarding an email. Querying Twitter data in a
traditional RDBMS is inefficient. There are many Twitter
API which provide streaming of twitter data. In the
proposed work
A. Developing Twitter API
Develop a Twitter API on the Twitter side. The Twitter
API directly communicates with the Source and Sink
Mechanism via network based application. The
Authentication keys and tokens are established that
helps in communication over Twitter Server.
B. Establishing The connection via Source and Sink
Mechanism
After creation of Twitter API , design the source and sink
mechanism that will help in speedy data downloading
approach from Twitter Server to HDFS(Hadoop
Distributed File System).
The source agent communicates with the Twitter API
and Channels the data .
The streaming data is in form of JSON i.e event form of
data. The data is queued and channeled via channel
mechanism.
Finally the data is sink down into HDFS. Then the tweets
are analyzed using PIG or HIVE.
C. Analyzing The Data On HDFS i.e. Tweets using PIG or
Hive
The data now stored on data nodes is analyzed using PIG
or Hive. Suppose we want to perform Twitter Trend
Analysis, and then we have to just fire the Count query
that will count the specific word count about any
keyword. We have performed Log Analysis, Fraud
detection and Click Stream Analysis by using this system.
IV. CONCLUSION
In We finally present the design and evaluation of a Twitter
data cache framework that requires minimum change to the
original MapReduce programming model for provisioning
incremental processing for Big-data of twitter using the
MapReduce model with flume. We propose Twiche, a
Twitter Trent cache description scheme, protocol, and
architecture. Our method requires only a slight modification
in the input format processing and task management of the
MapReduce frames. Hence, application code only requires
slight changes in order to utilize Twiche. We imple ment
Twiche in Hadoop by extending relevant components.
Tested experiments show that it can eliminate all the
duplicate tasks in incremental MapReduce jobs and does not
require substantialchanges to the application code.
V. ACKNOWLEDGEMENT
The authors would like to thank the management and prof.
Jayashree Chaudhari, Dr. D.Y. Patil School of Engineering
for providing all the technical, guided constant
encouragement and suggestions throughout the project.
REFERENCES
[1]Xiuzhen Zhang, Lishan Cui, and Yan Wang, "Computing
Multi-Dimensional Trust by Mining E-Commerce Feedback
Comments " IEEE TRANSACTIONS ON KNOWLEDGE
AND DATA ENGINEERING VOL:26 NO:7 YEAR 2014
[2] P. Resnick and R. Zeckhauser “Reputation Systems:
Facilitating Trust in Internet Interactions”, Communications
of the ACM, vol. 43, pp. 45–48, 2000.
[3] P. Resnick and R. Zeckhauser, “Trust among strangers
in internet transactions: Empirical analysis of eBay’s
reputation system,” The Economics of the Internet and E-
commerce, 2002.
[4] J. O’Donovan and B. Smyth “Extracting and visualizing
trust relationships from online auction feedback comments”
in Proc. IJCAI, 2007
[6] H. Zhang, Y. Wang, and X. Zhang, “Efficient contextual
transaction trust computation in e-commerce environments,”
in Proc. 11th IEEE TrustCom, Liverpool, UK, 2012.
[7] S. D. Kamvar, M. T. Schlosser, and H. Garcia-Molina,
“The EigenTrust algorithm for reputation management in
P2P networks,” in Proc. 12th Int. Conf. WWW, Budapest,
Hungary, 2003.
G. Qiu, B. Liu, J. Bu, and C. Chen, “Opinion word
expansion and arget extraction through double propagation,"
Comput. Linguist, vol. 37, no. 1, pp. 927, 2011
[8] S. Reece and N. Jennings “Rumors’ and reputation:
Evaluating multi-dimensional trust within a decentralized
reputation system" in Proc. 6th AAMAS, Honolulu ,USA,
2007, pp. 165172.
[9] G. Qiu and B. Liu “Opinion word expansion and target
extraction through double propagation," Comput. Linguist,
vol. 37, no. 1, pp. 927, 2011.
[10] P. Thomas and D. Hawking, “Evaluation by comparing
result sets in context," in Proc. 15th ACM CIKM, Arlington,
VA, USA, 2006, pp. 94101. Department
[11] M. De Marne_e and C. Manning, “The Stanford typed
dependencies representation," in Proc. Cross Parser,
Stroudsburg,PA, USA, 2008.
[12] X. Wang, L. Liu, and J. Su, “RLM: A general model for
trust representation and aggregation," IEEE Trans. Serv.
Comput., vol. 5,
no. 1, pp. 131143, Jan-Mar, 2012.

More Related Content

What's hot

Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
IJECEIAES
 
Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics
iosrjce
 
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
IRJET Journal
 
Hadoop Distributed file system.pdf
Hadoop Distributed file system.pdfHadoop Distributed file system.pdf
Hadoop Distributed file system.pdf
vishal choudhary
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
himanshu arora
 
Hadoop technology doc
Hadoop technology docHadoop technology doc
Hadoop technology doc
tipanagiriharika
 
Paper id 25201498
Paper id 25201498Paper id 25201498
Paper id 25201498
IJRAT
 
Big Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many ClusteringBig Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many Clustering
paperpublications3
 
A Query Model for Ad Hoc Queries using a Scanning Architecture
A Query Model for Ad Hoc Queries using a Scanning ArchitectureA Query Model for Ad Hoc Queries using a Scanning Architecture
A Query Model for Ad Hoc Queries using a Scanning Architecture
Flurry, Inc.
 
A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...
ijcses
 
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopA Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
IJTET Journal
 
A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
A General Purpose Extensible Scanning Query Architecture for Ad Hoc AnalyticsA General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
Flurry, Inc.
 
Final Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_SharmilaFinal Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_Sharmila
Nithin Kakkireni
 
IJET-V2I6P25
IJET-V2I6P25IJET-V2I6P25
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Robert Grossman
 
TCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYATCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYA
Aditya Srinivasan
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
inventionjournals
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
ijwscjournal
 
Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A Review
IRJET Journal
 

What's hot (19)

Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
 
Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics
 
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
 
Hadoop Distributed file system.pdf
Hadoop Distributed file system.pdfHadoop Distributed file system.pdf
Hadoop Distributed file system.pdf
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
 
Hadoop technology doc
Hadoop technology docHadoop technology doc
Hadoop technology doc
 
Paper id 25201498
Paper id 25201498Paper id 25201498
Paper id 25201498
 
Big Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many ClusteringBig Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many Clustering
 
A Query Model for Ad Hoc Queries using a Scanning Architecture
A Query Model for Ad Hoc Queries using a Scanning ArchitectureA Query Model for Ad Hoc Queries using a Scanning Architecture
A Query Model for Ad Hoc Queries using a Scanning Architecture
 
A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...
 
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopA Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
 
A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
A General Purpose Extensible Scanning Query Architecture for Ad Hoc AnalyticsA General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
 
Final Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_SharmilaFinal Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_Sharmila
 
IJET-V2I6P25
IJET-V2I6P25IJET-V2I6P25
IJET-V2I6P25
 
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
 
TCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYATCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYA
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
 
Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A Review
 

Similar to Paper ijert

A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
SANTOSH WAYAL
 
G017143640
G017143640G017143640
G017143640
IOSR Journals
 
B017320612
B017320612B017320612
B017320612
IOSR Journals
 
Twitter word frequency count using hadoop components 150331221753
Twitter word frequency count using hadoop components 150331221753Twitter word frequency count using hadoop components 150331221753
Twitter word frequency count using hadoop components 150331221753
pradip patel
 
Finding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopFinding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache Hadoop
Nushrat
 
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
Puneet Kansal
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
J S Jodha
 
Hadoop
HadoopHadoop
Big data
Big dataBig data
Big data
revathireddyb
 
Big data
Big dataBig data
Big data
revathireddyb
 
Design architecture based on web
Design architecture based on webDesign architecture based on web
Design architecture based on web
csandit
 
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
cscpconf
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
Bhavya Gulati
 
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATA
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATADATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATA
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATA
Aishwarya Saseendran
 
B04 06 0918
B04 06 0918B04 06 0918
Twitter_Sentiment_analysis.pptx
Twitter_Sentiment_analysis.pptxTwitter_Sentiment_analysis.pptx
Twitter_Sentiment_analysis.pptx
JOELFRANKLIN13
 
Survey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopSurvey Paper on Big Data and Hadoop
Survey Paper on Big Data and Hadoop
IRJET Journal
 
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
MLconf
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
NavNeet KuMar
 
InternReport
InternReportInternReport
InternReport
Swetha Tanamala
 

Similar to Paper ijert (20)

A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
 
G017143640
G017143640G017143640
G017143640
 
B017320612
B017320612B017320612
B017320612
 
Twitter word frequency count using hadoop components 150331221753
Twitter word frequency count using hadoop components 150331221753Twitter word frequency count using hadoop components 150331221753
Twitter word frequency count using hadoop components 150331221753
 
Finding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopFinding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache Hadoop
 
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Hadoop
HadoopHadoop
Hadoop
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Design architecture based on web
Design architecture based on webDesign architecture based on web
Design architecture based on web
 
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
 
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATA
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATADATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATA
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATA
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
Twitter_Sentiment_analysis.pptx
Twitter_Sentiment_analysis.pptxTwitter_Sentiment_analysis.pptx
Twitter_Sentiment_analysis.pptx
 
Survey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopSurvey Paper on Big Data and Hadoop
Survey Paper on Big Data and Hadoop
 
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
 
InternReport
InternReportInternReport
InternReport
 

Recently uploaded

Jio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdfJio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdf
inaya7568
 
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
lzdvtmy8
 
Building a Quantum Computer Neutral Atom.pdf
Building a Quantum Computer Neutral Atom.pdfBuilding a Quantum Computer Neutral Atom.pdf
Building a Quantum Computer Neutral Atom.pdf
cjimenez2581
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
1tyxnjpia
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
nyvan3
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
UofT毕业证如何办理
UofT毕业证如何办理UofT毕业证如何办理
UofT毕业证如何办理
exukyp
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
Vietnam Cotton & Spinning Association
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
hqfek
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
z6osjkqvd
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
bmucuha
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
mkkikqvo
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
slg6lamcq
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
taqyea
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 

Recently uploaded (20)

Jio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdfJio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdf
 
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
 
Building a Quantum Computer Neutral Atom.pdf
Building a Quantum Computer Neutral Atom.pdfBuilding a Quantum Computer Neutral Atom.pdf
Building a Quantum Computer Neutral Atom.pdf
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
UofT毕业证如何办理
UofT毕业证如何办理UofT毕业证如何办理
UofT毕业证如何办理
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 

Paper ijert

  • 1. (Twiche) TWITTER TREND CACHING FOR BIG-DATA APPLICATIONS USING THE MAPREDUCE FRAMEWORK Santosh Wayal1 ,Yogesh More2 ,Prasad Wandhekar3 ,Utkarsh Honey4 , Prof. Jayshree Chaudhari5 B.E. Student1,2,3,4, Assistant Professor5 Department of Computer Engineering Dr. D.Y. Patil School of Engineering Pune, India. Abstract— The big-data refers to the large-scale distributed data processing applications which work on exceptionally large amounts of data like twitter data. Google’s MapReduce and Apache’s Hadoop, its open-source implementation, are the software systems for big-data applications. An observation of the MapReduce framework is that the framework generates a large amount of intermediate data whenever it analysis the twitter data from twitter server. MapReduce is unable to utilize such data so they are thrown after used. We propose a Twiche framework used for big-data applications. In our system, tasks submit their intermediate results to the cache manager and queries the cache manager before executing the actual computing work. A novel cache description system and a cache request and reply protocol are designed. Key words: MapReduce, Twiche, Twitter, Big Data, Hadoop,Flume I. INTRODUCTION Social media is a web-based and mobile-based internet application that will allow the creation, access and exchange of user-generated content that is ubiquitously accessible. Besides social networking media like twitter and face- book, the term social media to encompass really simple syndication (RSS) feeds, blogs, wikis and news, all typically yielding unstructured text and accessible through the web. Social media is especially important for research into computational social science that investigates questions using quantitative techniques for example, computational statistics, machine learning and complexity and so-called big data for data mining and simulation modelling. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. Google MapReduce is a programming model and also a software framework for Big -scale distributed Computing on large amounts of data that large data we fetching from twitter server. Application developers specify the computation in terms of reduce function and a map and the underlying MapReduce Task scheduling system automatically parallelizes the computation across the cluster of machines. While MapReduce obtain popularity for its simple programming interface and excellent Performance when implement a large spectrum of applications. Since most such applications take a huge amount of input data, they are called as “Big-data applications”. Input data is first divided and then given to workers in the map stage. Every Individual data items are called records. The MapReduce system parses the input splits to each worker and produces records. After the phase of map, intermediate results generated in the map phase are shuffled and sorted by the MapReduce system which are then given into the workers in the reduce phase. Final results are computed by multiple reducers and after it written back to the disk. Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS). Hadoop is an open-source implementation of the Google MapReduce programming model. Hadoop includes of the Hadoop Common, which provides access to the file systems supported by Hadoop. HDFS (Hadoop Distributed File System) provides distributed file storage and is optimized for large unchangeable blocks of data. A small Hadoop cluster will include a single master and multiple worker nodes known as slave. Whereas master node runs multiple processes, including a Task tracker and a Name Node. The Task tracker is having control for the management of running jobs in the Hadoop cluster. While the name node manages the HDFS, The Task tracker and the Name Node are primarily collocated on the same physical machine. The other servers in the cluster run a Task tracker for Data Node processes. A MapReduce job is divided into number of tasks. Tasks are managed by the Task tracker. The Task trackers and the Data Node are collated on the same servers to provide data locality. MapReduce provides a standardized framework for implementing the large-scale distributed computation, known as the big-data applications. Still, there is restriction of the system. i.e. The inefficiency in incremental processing which refers to the applications that incrementally grow up the input data and continuously apply computations on this input and produce output. II. OBJECTIVES AND SCOPES The Scope Of The Work Can Be Extended To Following: 1. Requires minimum change to the original MapReduce programming model. 2. Application code only requires slight changes in order to utilize Dache. 3. Implement Dache in Hadoop by extending the relevant components. 4. Tested experiments show that it can easily eliminate all the duplicate tasks in incremental MapReduce jobs. 5. Minimum execution time and CPU utilization III. PROBLEM STATEMENT In current Hadoop MapReduce framework is that the framework generates a large flow of intermediate data. MapReduce is unable to save that such data so they are deleted after used.But in our systemwe introducing the cache memory that holds the intermediate results in it,
  • 2. because of that the data processing,means job executing processing is faster than old system, So that systemis a time consuming, repetition of data processing are reduced. For experimental purpose we are accessing the twitter server for twitted comments by using flume for data analysis operations on it. III. PROPOSED SYSTEM The above Fig. shows the our proposed system. In this proposed work, Apache Pig, Apache HDFS, Apache Oozie, and Apache Hive can be used to design direct data pipeline that will enable to analyze Twitter data. In order to find out who is prominent in Social Media one should know the mechanism of twitter which works on tweets and retweets. A retweet is a repost of an update similar to forwarding an email. Querying Twitter data in a traditional RDBMS is inefficient. There are many Twitter API which provide streaming of twitter data. In the proposed work A. Developing Twitter API Develop a Twitter API on the Twitter side. The Twitter API directly communicates with the Source and Sink Mechanism via network based application. The Authentication keys and tokens are established that helps in communication over Twitter Server. B. Establishing The connection via Source and Sink Mechanism After creation of Twitter API , design the source and sink mechanism that will help in speedy data downloading approach from Twitter Server to HDFS(Hadoop Distributed File System). The source agent communicates with the Twitter API and Channels the data . The streaming data is in form of JSON i.e event form of data. The data is queued and channeled via channel mechanism. Finally the data is sink down into HDFS. Then the tweets are analyzed using PIG or HIVE. C. Analyzing The Data On HDFS i.e. Tweets using PIG or Hive The data now stored on data nodes is analyzed using PIG or Hive. Suppose we want to perform Twitter Trend Analysis, and then we have to just fire the Count query that will count the specific word count about any keyword. We have performed Log Analysis, Fraud detection and Click Stream Analysis by using this system. IV. CONCLUSION In We finally present the design and evaluation of a Twitter data cache framework that requires minimum change to the original MapReduce programming model for provisioning incremental processing for Big-data of twitter using the MapReduce model with flume. We propose Twiche, a Twitter Trent cache description scheme, protocol, and architecture. Our method requires only a slight modification in the input format processing and task management of the MapReduce frames. Hence, application code only requires slight changes in order to utilize Twiche. We imple ment Twiche in Hadoop by extending relevant components. Tested experiments show that it can eliminate all the duplicate tasks in incremental MapReduce jobs and does not require substantialchanges to the application code. V. ACKNOWLEDGEMENT The authors would like to thank the management and prof. Jayashree Chaudhari, Dr. D.Y. Patil School of Engineering for providing all the technical, guided constant encouragement and suggestions throughout the project. REFERENCES [1]Xiuzhen Zhang, Lishan Cui, and Yan Wang, "Computing Multi-Dimensional Trust by Mining E-Commerce Feedback Comments " IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING VOL:26 NO:7 YEAR 2014 [2] P. Resnick and R. Zeckhauser “Reputation Systems: Facilitating Trust in Internet Interactions”, Communications of the ACM, vol. 43, pp. 45–48, 2000. [3] P. Resnick and R. Zeckhauser, “Trust among strangers in internet transactions: Empirical analysis of eBay’s reputation system,” The Economics of the Internet and E- commerce, 2002. [4] J. O’Donovan and B. Smyth “Extracting and visualizing trust relationships from online auction feedback comments” in Proc. IJCAI, 2007 [6] H. Zhang, Y. Wang, and X. Zhang, “Efficient contextual transaction trust computation in e-commerce environments,” in Proc. 11th IEEE TrustCom, Liverpool, UK, 2012. [7] S. D. Kamvar, M. T. Schlosser, and H. Garcia-Molina, “The EigenTrust algorithm for reputation management in P2P networks,” in Proc. 12th Int. Conf. WWW, Budapest, Hungary, 2003.
  • 3. G. Qiu, B. Liu, J. Bu, and C. Chen, “Opinion word expansion and arget extraction through double propagation," Comput. Linguist, vol. 37, no. 1, pp. 927, 2011 [8] S. Reece and N. Jennings “Rumors’ and reputation: Evaluating multi-dimensional trust within a decentralized reputation system" in Proc. 6th AAMAS, Honolulu ,USA, 2007, pp. 165172. [9] G. Qiu and B. Liu “Opinion word expansion and target extraction through double propagation," Comput. Linguist, vol. 37, no. 1, pp. 927, 2011. [10] P. Thomas and D. Hawking, “Evaluation by comparing result sets in context," in Proc. 15th ACM CIKM, Arlington, VA, USA, 2006, pp. 94101. Department [11] M. De Marne_e and C. Manning, “The Stanford typed dependencies representation," in Proc. Cross Parser, Stroudsburg,PA, USA, 2008. [12] X. Wang, L. Liu, and J. Su, “RLM: A general model for trust representation and aggregation," IEEE Trans. Serv. Comput., vol. 5, no. 1, pp. 131143, Jan-Mar, 2012.