This document presents a framework for detecting near-duplicate tweets on Twitter. The framework analyzes tweet pairs using three approaches: (1) comparing syntactic characteristics like word overlap, (2) measuring semantic similarity, and (3) analyzing contextual information. Machine learning is used to learn patterns that help identify duplicate tweets. The framework is integrated into a Twitter search engine called Twinder to diversify search results and improve search quality. Extensive experiments evaluate strategies for detecting duplicate tweets and analyzing features that impact detection. The results show semantic features can boost duplicate detection performance.
Twitter: Social Network Or News Medium?Serge Beckers
This document analyzes Twitter as a social network and news media by studying its topological characteristics and information diffusion. The authors:
1) Crawled over 41 million user profiles, 1.47 billion social connections, and 106 million tweets to analyze Twitter's structure and behavior.
2) Found that Twitter has a non-power law distribution of followers, short paths of separation between users, and low reciprocity - distinguishing it from other social networks.
3) Ranked users by followers, PageRank and retweets, finding influence inferred from followers differs from popularity of tweets.
4) Analyzed trending topics and found most are news headlines that persist for days with participation from many users.
Text mining on Twitter information based on R platformFayan TAO
This document summarizes an analysis of tweets related to the hashtag "#prayforparis" following the November 2015 Paris attacks. The analysis was conducted using R for text mining. It preprocessed 3200 tweets by cleaning, stemming words, and building a term-document matrix. Frequent words and associations were identified. Tweets were clustered into 10 groups and sentiment analysis was performed, finding most expressed sadness, anger, and prayed for Paris with a positive attitude toward the attacks. Key text mining techniques like preprocessing, matrix building, clustering and sentiment analysis were implemented in R to analyze information from Twitter about this event.
This paper investigates methods for organizing and classifying a large collection of geotagged Twitter data. The author uses a spatial clustering algorithm called mean-shift to identify "hot spots" with high volumes of tweets in different regions of the US. They then analyze the most popular hashtags within each cluster to identify differences in topics between regions. The results show clear differences in hashtags used depending on location, such as hashtags about cities within each cluster. However, the author notes limitations around parameter selection and the presence of spam tweets, and suggests areas for future improvement.
This document discusses mining social data from graphs by extracting the data from social networks using their APIs or by crawling websites. It describes representing the social data as graphs and using graph mining techniques like finding frequent patterns and substructures using algorithms like Apriori, pattern growth, and CL-CBI (ChunkingLess - Constraint-Based Induction). Decision trees can also be used to iteratively find patterns that branch the data. The challenges include the graph nature of the data, errors and unknowns, and vanity metrics, but graphs are useful for capturing complex social structures.
This document discusses methods for collecting reliable data on user behavior from online social networks. It reviews previous research that has collected social network data through OSN APIs/crawling, surveys/questionnaires, and custom application deployment. The authors argue that these methods can produce biased data as they often only collect publicly shared content or rely on self-reported information. The authors propose combining existing methods, specifically the Experience Sampling Method, to collect more reliable data by capturing both shared and non-shared user behaviors and contexts through in situ experiments.
This document analyzes a dataset of over 37 billion tweets spanning 2006 to 2013 to study the evolution of Twitter users and their behavior. It finds that the percentage of U.S. and Canadian users dropped from over 80% to 32% as Twitter spread globally, and that the percentage of tweets in English fell from 83% to 52%. It also observes increases in tweet deletions, inactive accounts, and suspended accounts, as well as a shift from desktop to mobile usage, with over half of tweets now coming from mobile devices. The results provide insight into how Twitter and user behavior have changed over time.
This document analyzes 653 tweets containing the words "public relations" or the acronym "PR" to understand how Twitter contributes to the development of public relations theory and practice. The tweets were categorized and the following key findings were reported:
1. The most common categories were announcements/events (28.6%) and discussions between users (18.7%), showing Twitter's role in networking and sharing information.
2. Press releases (4.3%) and jobs (15.2%) were also frequently discussed, demonstrating Twitter's use for professional purposes in PR.
3. Unexpectedly, few tweets discussed academic topics (2.3%), suggesting PR scholars do not often use Twitter, though it
This document analyzes aspects of broad folksonomies based on a sample of over 800,000 bookmarks from Delicious. It finds that:
1) The frequency-rank distribution of co-occurring tags follows a power law for around 80% of tags analyzed.
2) When analyzing resource-based and user-based tagging characteristics, only around 18-13% of tags respectively followed a power law distribution, indicating folksonomies have a more complex underlying structure.
3) Tags provide added value for retrieval over title alone, though precision and recall were mostly below 0.5, showing tags are not redundant but current usage may not optimize retrieval performance.
Twitter: Social Network Or News Medium?Serge Beckers
This document analyzes Twitter as a social network and news media by studying its topological characteristics and information diffusion. The authors:
1) Crawled over 41 million user profiles, 1.47 billion social connections, and 106 million tweets to analyze Twitter's structure and behavior.
2) Found that Twitter has a non-power law distribution of followers, short paths of separation between users, and low reciprocity - distinguishing it from other social networks.
3) Ranked users by followers, PageRank and retweets, finding influence inferred from followers differs from popularity of tweets.
4) Analyzed trending topics and found most are news headlines that persist for days with participation from many users.
Text mining on Twitter information based on R platformFayan TAO
This document summarizes an analysis of tweets related to the hashtag "#prayforparis" following the November 2015 Paris attacks. The analysis was conducted using R for text mining. It preprocessed 3200 tweets by cleaning, stemming words, and building a term-document matrix. Frequent words and associations were identified. Tweets were clustered into 10 groups and sentiment analysis was performed, finding most expressed sadness, anger, and prayed for Paris with a positive attitude toward the attacks. Key text mining techniques like preprocessing, matrix building, clustering and sentiment analysis were implemented in R to analyze information from Twitter about this event.
This paper investigates methods for organizing and classifying a large collection of geotagged Twitter data. The author uses a spatial clustering algorithm called mean-shift to identify "hot spots" with high volumes of tweets in different regions of the US. They then analyze the most popular hashtags within each cluster to identify differences in topics between regions. The results show clear differences in hashtags used depending on location, such as hashtags about cities within each cluster. However, the author notes limitations around parameter selection and the presence of spam tweets, and suggests areas for future improvement.
This document discusses mining social data from graphs by extracting the data from social networks using their APIs or by crawling websites. It describes representing the social data as graphs and using graph mining techniques like finding frequent patterns and substructures using algorithms like Apriori, pattern growth, and CL-CBI (ChunkingLess - Constraint-Based Induction). Decision trees can also be used to iteratively find patterns that branch the data. The challenges include the graph nature of the data, errors and unknowns, and vanity metrics, but graphs are useful for capturing complex social structures.
This document discusses methods for collecting reliable data on user behavior from online social networks. It reviews previous research that has collected social network data through OSN APIs/crawling, surveys/questionnaires, and custom application deployment. The authors argue that these methods can produce biased data as they often only collect publicly shared content or rely on self-reported information. The authors propose combining existing methods, specifically the Experience Sampling Method, to collect more reliable data by capturing both shared and non-shared user behaviors and contexts through in situ experiments.
This document analyzes a dataset of over 37 billion tweets spanning 2006 to 2013 to study the evolution of Twitter users and their behavior. It finds that the percentage of U.S. and Canadian users dropped from over 80% to 32% as Twitter spread globally, and that the percentage of tweets in English fell from 83% to 52%. It also observes increases in tweet deletions, inactive accounts, and suspended accounts, as well as a shift from desktop to mobile usage, with over half of tweets now coming from mobile devices. The results provide insight into how Twitter and user behavior have changed over time.
This document analyzes 653 tweets containing the words "public relations" or the acronym "PR" to understand how Twitter contributes to the development of public relations theory and practice. The tweets were categorized and the following key findings were reported:
1. The most common categories were announcements/events (28.6%) and discussions between users (18.7%), showing Twitter's role in networking and sharing information.
2. Press releases (4.3%) and jobs (15.2%) were also frequently discussed, demonstrating Twitter's use for professional purposes in PR.
3. Unexpectedly, few tweets discussed academic topics (2.3%), suggesting PR scholars do not often use Twitter, though it
This document analyzes aspects of broad folksonomies based on a sample of over 800,000 bookmarks from Delicious. It finds that:
1) The frequency-rank distribution of co-occurring tags follows a power law for around 80% of tags analyzed.
2) When analyzing resource-based and user-based tagging characteristics, only around 18-13% of tags respectively followed a power law distribution, indicating folksonomies have a more complex underlying structure.
3) Tags provide added value for retrieval over title alone, though precision and recall were mostly below 0.5, showing tags are not redundant but current usage may not optimize retrieval performance.
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
Highlighted notes on Deeper Inside PageRank.
While doing research work under Prof. Kishore Kothapalli.
This is a really "deep" review of PageRank! Should be a good story for a PhD student going to be working with PageRank optimizations.
Computing Social Score of Web Artifacts - IRE Major Project Spring 2015Amar Budhiraja
Final Presentation on Computing Social Scores of Web Artifacts - Major Project IRE Spring 2015
Web artifacts are items like news articles, web pages, videos, pinterest pints,tweets or URLs.
The task is to find the social score of an artifact (something like popularity). This is quite trivial because it can be done using features like “number of times the item has been shared”, “number of people who like or dislike the item” and so on.
The challenge is to find the social score of an item, if it has been shared on multiple places. For example a URL shared on facebook and twitter. For the purpose of this study, we mined tweets and Facebook posts of time frame and created an engine out of SOLR and Lucene to accomplish the desired goals.
FRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITYcscpconf
This document summarizes previous research on analyzing social media like Twitter to detect criminal or suspicious activity. It discusses frameworks that have used features of Twitter like hashtags, mentions, retweets etc. to analyze user profiles and communication networks. Some key approaches discussed include using graph analysis to identify influencers and automated accounts, predicting crime locations using Twitter data and linear regression, and detecting criminal networks through node analysis. However, many of the past studies did not fully address challenges like evaluating tweet and user credibility to reduce false positives.
This paper provides a comprehensive survey of issues related to PageRank, the algorithm used by Google to rank web pages. It covers the basic PageRank model, solution methods, storage considerations, properties like existence and convergence, possible model alterations, and future research areas. The paper explores suggestions that have been made to modify the basic PageRank model, and presents some new results and extensions. It aims to give insight into the beauty and usefulness of PageRank, and inspire further improvements.
Data Cleaning for social media knowledge extractionMarco Brambilla
Social media platforms let users share their opinions through textual or multimedia content. In many settings, this becomes a valuable source of knowledge that can be exploited for specific business objectives. Brands and companies often ask to monitor social media as sources for understanding the stance, opinion, and sentiment of their customers, audience and potential audience. This is crucial for them because it let them understand the trends and future commercial and marketing opportunities.
However, all this relies on a solid and reliable data collection phase, that grants that all the analyses, extractions and predictions are applied on clean, solid and focused data. Indeed, the typical topic-based collection of social media content performed through keyword-based search typically entails very noisy results.
We recently implemented a simple study aiming at cleaning the data collected from social content, within specific domains or related to given topics of interest. We propose a basic method for data cleaning and removal of off-topic content based on supervised machine learning techniques, i.e. classification, over data collected from social media platforms based on keywords regarding a specific topic. We define a general method for this and then we validate it through an experiment of data extraction from Twitter, with respect to a set of famous cultural institutions in Italy, including theaters, museums, and other venues.
For this case, we collaborated with domain experts to label the dataset, and then we evaluated and compared the performance of classifiers that are trained with different feature extraction strategies.
Fuzzy AndANN Based Mining Approach Testing For Social Network AnalysisIJERA Editor
Fast and Appropriate Social Network Analysis (SNA) tools ,techniques, are required to collect and classify
opinion scores on social networksites , as a grouping on wrong opinion may create problems for a society or
country . Social Network Analysis (SNA) is popular means for researcher as the number of users and groups
increasing day by day on that social sites , and a large group may influence other.In this paper, we
recommendhybrid model of opinion recommendation systems, for single user and for collective community
respectively, formed on social liking and influence network theory. By collecting thedata of user social networks
and preferenceslike, we designed aimproved hybrid prototype to imitate the social influence by like and sharing
the information among groups.The significance of this paper to analyze the suitability of ANN and Fuzzy sets
method in a hybrid manner for social web sites classifications, First, we intend to use Artificial Neural
Network(ANN)techniques in social media data classification by using some contemporary methods different
than the conventional methods of statistics and data analysis, in next we want to propagate the fuzzy approach
as a way to overcome the uncertainity that is always present in social media analysis . We give a brief overview
of the main ideas and recent results of social networks analysis , and we point to relationships between the two
social network analysis and classification approaches .This researchsuggests a hybrid classification model build
on fuzzy and artificial neural network (HFANN). Information Gain and three popular social sites are used to
collect information depicting features that are then used to train and test the proposed methods . This neoteric
approach combines the advantages of ANN and Fuzzy sets in classification accuracy with utilizing social data
and knowledge base available in the hate lexicons.
18th home blog_twitter_English (12OCT2010) Han Woo PARK
This study analyzed the social network structures of Korean politicians on their homepages, blogs, and Twitter accounts between 2009-2010. It found that Twitter networks had higher connectivity and more evenly distributed links than homepage and blog networks. While politicians' homepage and blog networks mainly linked to others within their own party, Twitter networks showed more cross-party linking, especially between the two largest parties. The number of followers, followings, and tweets on Twitter in 2009 and 2010 were also found to be correlated. However, having more followers in 2009 did not guarantee more in 2010, showing the importance of continued online engagement.
- The document presents a study analyzing factors that influence continued participation in Twitter group chats. It develops a 5F model examining individual initiative, group characteristics, perceived receptivity, linguistic affinity, and geographic proximity.
- The study analyzes data from 30 educational Twitter chats over two years involving over 71,000 users. It also conducted a user survey.
- The 5F model effectively predicts whether a new user who attends one session will return based on analysis of their contributions, how well they fit with the group linguistically, and other metrics.
This document discusses incentive-based tagging to maximize the quality of resource tagging in social tagging systems. It finds that most resources are under-tagged, receiving too few tags to be useful, while a few popular resources are over-tagged. It proposes rewarding users for tagging under-tagged resources to address this imbalance. The key concepts of tagging quality, tagging stability, unstable and stable points are introduced. An optimal incentive allocation algorithm is presented to decide how to distribute rewards among resources to maximize overall tagging stability.
This paper proposes a system to detect spammer tweets on Twitter. It extracts features from tweets such as the number of unique mentions, unsolicited mentions, and duplicate domain names. It also uses the Google Safe Browsing API to analyze URLs. The system fetches tweets for a particular hashtag using the Twitter4J API. It then performs preprocessing like removing hashtags and quotes. The features are then used to classify tweets as spam or not spam using machine learning algorithms. The system aims to investigate linguistic features for detecting spam accounts and tweets in a supervised manner by leveraging existing hashtags for training data.
Presentation for tutorial session 'Measuring scholarly impact: Methods and practice' at ISSI2015
Explains how to use linkpred: https://github.com/rafguns/linkpred
Social Networks analysis to characterize HIV at-risk populations - Progress a...UC San Diego
This document describes a study that aimed to characterize HIV-vulnerable populations on Twitter by analyzing user sentiments and extracting risk-related information. Researchers collected Twitter data using different APIs and classified tweets based on predefined HIV risk words. They modeled the data as a property graph in Neo4j with nodes for users, tweets, hashtags etc. and edges to represent relationships. Queries were run to find conversations between users mentioning drug and sex-related terms, most mentioned users, topics discussed by followers of high-risk users, and proximity of drug and homosexual users in the social graph. The study demonstrated how social network analysis and graph databases can help identify at-risk groups for public health interventions.
#ICCSS2015 - Computational Human Security Analytics using "Big Data"Pete Burnap
This document summarizes research on analyzing social media data to study computational human security and cyber hate speech. It describes the COSMOS social media observatory project, which collects and analyzes data from Twitter and other sources. Machine learning models are used to automatically classify tweets as hate speech or not with up to 89% accuracy. Experimental research analyzes how cyber hate spreads and persists on social media after national security events. The goal is to model social responses to inform policy and decision making.
Gates Toorcon X New School Information GatheringChris Gates
This document provides an overview of open source intelligence (OSINT) techniques for information gathering about a target domain. It discusses tools like Fierce, SEAT, Goolag for searching search engines; Google mail harvesters and Metagoofil for extracting emails and metadata; and online tools like ServerSniff and DomainTools. It emphasizes using Maltego to visualize relationships between data discovered and gain a fuller picture of the target domain, including IPs, servers, documents, and potential usernames without directly contacting the target's systems. The goal is developing an initial target list and insights that could enable further social engineering or client-side attacks.
Integrated approach to detect spam in social media networks using hybrid feat...IJECEIAES
Online social networking sites are becoming more popular amongst Internet users. The Internet users spend some amount of time on popular social networking sites like Facebook, Twitter and LinkedIn etc. Online social networks are considered to be much useful tool to the society used by Internet lovers to communicate and transmit information. These social networking platforms are useful to share information, opinions and ideas, make new friends, and create new friend groups. Social networking sites provide large amount of technical information to the users. This large amount of information in social networking sites attracts cyber criminals to misuse these sites information. These users create their own accounts and spread vulnerable information to the genuine users. This information may be advertising some product, send some malicious links etc to disturb the natural users on social sites. Spammer detection is a major problem now days in social networking sites. Previous spam detection techniques use different set of features to classify spam and non spam users. In this paper we proposed a hybrid approach which uses content based and user based features for identification of spam on Twitter network. In this hybrid approach we used decision tree induction algorithm and Bayesian network algorithm to construct a classification model. We have analysed the proposed technique on twitter dataset. Our analysis shows that our proposed methodology is better than some other existing techniques.
Introduction to the Responsible Use of Social Media Monitoring and SOCMINT ToolsMike Kujawski
These are my slides from a custom tool-based demonstration workshop I was asked to do where I went over various free tools that can be used to obtain valuable public data.
1. The document analyzes word-of-mouth sharing of URLs on Twitter to understand how users discover content through social media.
2. It finds that word-of-mouth on Twitter can spread a single URL to a large audience, in some cases reaching millions of users, through multiple propagation trees on the social network.
3. The analysis aims to understand questions like how far word-of-mouth can spread content, what types of content are popular on social media versus the wider web, and typical structures of word-of-mouth propagation trees.
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
Highlighted notes on Deeper Inside PageRank.
While doing research work under Prof. Kishore Kothapalli.
This is a really "deep" review of PageRank! Should be a good story for a PhD student going to be working with PageRank optimizations.
Computing Social Score of Web Artifacts - IRE Major Project Spring 2015Amar Budhiraja
Final Presentation on Computing Social Scores of Web Artifacts - Major Project IRE Spring 2015
Web artifacts are items like news articles, web pages, videos, pinterest pints,tweets or URLs.
The task is to find the social score of an artifact (something like popularity). This is quite trivial because it can be done using features like “number of times the item has been shared”, “number of people who like or dislike the item” and so on.
The challenge is to find the social score of an item, if it has been shared on multiple places. For example a URL shared on facebook and twitter. For the purpose of this study, we mined tweets and Facebook posts of time frame and created an engine out of SOLR and Lucene to accomplish the desired goals.
FRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITYcscpconf
This document summarizes previous research on analyzing social media like Twitter to detect criminal or suspicious activity. It discusses frameworks that have used features of Twitter like hashtags, mentions, retweets etc. to analyze user profiles and communication networks. Some key approaches discussed include using graph analysis to identify influencers and automated accounts, predicting crime locations using Twitter data and linear regression, and detecting criminal networks through node analysis. However, many of the past studies did not fully address challenges like evaluating tweet and user credibility to reduce false positives.
This paper provides a comprehensive survey of issues related to PageRank, the algorithm used by Google to rank web pages. It covers the basic PageRank model, solution methods, storage considerations, properties like existence and convergence, possible model alterations, and future research areas. The paper explores suggestions that have been made to modify the basic PageRank model, and presents some new results and extensions. It aims to give insight into the beauty and usefulness of PageRank, and inspire further improvements.
Data Cleaning for social media knowledge extractionMarco Brambilla
Social media platforms let users share their opinions through textual or multimedia content. In many settings, this becomes a valuable source of knowledge that can be exploited for specific business objectives. Brands and companies often ask to monitor social media as sources for understanding the stance, opinion, and sentiment of their customers, audience and potential audience. This is crucial for them because it let them understand the trends and future commercial and marketing opportunities.
However, all this relies on a solid and reliable data collection phase, that grants that all the analyses, extractions and predictions are applied on clean, solid and focused data. Indeed, the typical topic-based collection of social media content performed through keyword-based search typically entails very noisy results.
We recently implemented a simple study aiming at cleaning the data collected from social content, within specific domains or related to given topics of interest. We propose a basic method for data cleaning and removal of off-topic content based on supervised machine learning techniques, i.e. classification, over data collected from social media platforms based on keywords regarding a specific topic. We define a general method for this and then we validate it through an experiment of data extraction from Twitter, with respect to a set of famous cultural institutions in Italy, including theaters, museums, and other venues.
For this case, we collaborated with domain experts to label the dataset, and then we evaluated and compared the performance of classifiers that are trained with different feature extraction strategies.
Fuzzy AndANN Based Mining Approach Testing For Social Network AnalysisIJERA Editor
Fast and Appropriate Social Network Analysis (SNA) tools ,techniques, are required to collect and classify
opinion scores on social networksites , as a grouping on wrong opinion may create problems for a society or
country . Social Network Analysis (SNA) is popular means for researcher as the number of users and groups
increasing day by day on that social sites , and a large group may influence other.In this paper, we
recommendhybrid model of opinion recommendation systems, for single user and for collective community
respectively, formed on social liking and influence network theory. By collecting thedata of user social networks
and preferenceslike, we designed aimproved hybrid prototype to imitate the social influence by like and sharing
the information among groups.The significance of this paper to analyze the suitability of ANN and Fuzzy sets
method in a hybrid manner for social web sites classifications, First, we intend to use Artificial Neural
Network(ANN)techniques in social media data classification by using some contemporary methods different
than the conventional methods of statistics and data analysis, in next we want to propagate the fuzzy approach
as a way to overcome the uncertainity that is always present in social media analysis . We give a brief overview
of the main ideas and recent results of social networks analysis , and we point to relationships between the two
social network analysis and classification approaches .This researchsuggests a hybrid classification model build
on fuzzy and artificial neural network (HFANN). Information Gain and three popular social sites are used to
collect information depicting features that are then used to train and test the proposed methods . This neoteric
approach combines the advantages of ANN and Fuzzy sets in classification accuracy with utilizing social data
and knowledge base available in the hate lexicons.
18th home blog_twitter_English (12OCT2010) Han Woo PARK
This study analyzed the social network structures of Korean politicians on their homepages, blogs, and Twitter accounts between 2009-2010. It found that Twitter networks had higher connectivity and more evenly distributed links than homepage and blog networks. While politicians' homepage and blog networks mainly linked to others within their own party, Twitter networks showed more cross-party linking, especially between the two largest parties. The number of followers, followings, and tweets on Twitter in 2009 and 2010 were also found to be correlated. However, having more followers in 2009 did not guarantee more in 2010, showing the importance of continued online engagement.
- The document presents a study analyzing factors that influence continued participation in Twitter group chats. It develops a 5F model examining individual initiative, group characteristics, perceived receptivity, linguistic affinity, and geographic proximity.
- The study analyzes data from 30 educational Twitter chats over two years involving over 71,000 users. It also conducted a user survey.
- The 5F model effectively predicts whether a new user who attends one session will return based on analysis of their contributions, how well they fit with the group linguistically, and other metrics.
This document discusses incentive-based tagging to maximize the quality of resource tagging in social tagging systems. It finds that most resources are under-tagged, receiving too few tags to be useful, while a few popular resources are over-tagged. It proposes rewarding users for tagging under-tagged resources to address this imbalance. The key concepts of tagging quality, tagging stability, unstable and stable points are introduced. An optimal incentive allocation algorithm is presented to decide how to distribute rewards among resources to maximize overall tagging stability.
This paper proposes a system to detect spammer tweets on Twitter. It extracts features from tweets such as the number of unique mentions, unsolicited mentions, and duplicate domain names. It also uses the Google Safe Browsing API to analyze URLs. The system fetches tweets for a particular hashtag using the Twitter4J API. It then performs preprocessing like removing hashtags and quotes. The features are then used to classify tweets as spam or not spam using machine learning algorithms. The system aims to investigate linguistic features for detecting spam accounts and tweets in a supervised manner by leveraging existing hashtags for training data.
Presentation for tutorial session 'Measuring scholarly impact: Methods and practice' at ISSI2015
Explains how to use linkpred: https://github.com/rafguns/linkpred
Social Networks analysis to characterize HIV at-risk populations - Progress a...UC San Diego
This document describes a study that aimed to characterize HIV-vulnerable populations on Twitter by analyzing user sentiments and extracting risk-related information. Researchers collected Twitter data using different APIs and classified tweets based on predefined HIV risk words. They modeled the data as a property graph in Neo4j with nodes for users, tweets, hashtags etc. and edges to represent relationships. Queries were run to find conversations between users mentioning drug and sex-related terms, most mentioned users, topics discussed by followers of high-risk users, and proximity of drug and homosexual users in the social graph. The study demonstrated how social network analysis and graph databases can help identify at-risk groups for public health interventions.
#ICCSS2015 - Computational Human Security Analytics using "Big Data"Pete Burnap
This document summarizes research on analyzing social media data to study computational human security and cyber hate speech. It describes the COSMOS social media observatory project, which collects and analyzes data from Twitter and other sources. Machine learning models are used to automatically classify tweets as hate speech or not with up to 89% accuracy. Experimental research analyzes how cyber hate spreads and persists on social media after national security events. The goal is to model social responses to inform policy and decision making.
Gates Toorcon X New School Information GatheringChris Gates
This document provides an overview of open source intelligence (OSINT) techniques for information gathering about a target domain. It discusses tools like Fierce, SEAT, Goolag for searching search engines; Google mail harvesters and Metagoofil for extracting emails and metadata; and online tools like ServerSniff and DomainTools. It emphasizes using Maltego to visualize relationships between data discovered and gain a fuller picture of the target domain, including IPs, servers, documents, and potential usernames without directly contacting the target's systems. The goal is developing an initial target list and insights that could enable further social engineering or client-side attacks.
Integrated approach to detect spam in social media networks using hybrid feat...IJECEIAES
Online social networking sites are becoming more popular amongst Internet users. The Internet users spend some amount of time on popular social networking sites like Facebook, Twitter and LinkedIn etc. Online social networks are considered to be much useful tool to the society used by Internet lovers to communicate and transmit information. These social networking platforms are useful to share information, opinions and ideas, make new friends, and create new friend groups. Social networking sites provide large amount of technical information to the users. This large amount of information in social networking sites attracts cyber criminals to misuse these sites information. These users create their own accounts and spread vulnerable information to the genuine users. This information may be advertising some product, send some malicious links etc to disturb the natural users on social sites. Spammer detection is a major problem now days in social networking sites. Previous spam detection techniques use different set of features to classify spam and non spam users. In this paper we proposed a hybrid approach which uses content based and user based features for identification of spam on Twitter network. In this hybrid approach we used decision tree induction algorithm and Bayesian network algorithm to construct a classification model. We have analysed the proposed technique on twitter dataset. Our analysis shows that our proposed methodology is better than some other existing techniques.
Introduction to the Responsible Use of Social Media Monitoring and SOCMINT ToolsMike Kujawski
These are my slides from a custom tool-based demonstration workshop I was asked to do where I went over various free tools that can be used to obtain valuable public data.
1. The document analyzes word-of-mouth sharing of URLs on Twitter to understand how users discover content through social media.
2. It finds that word-of-mouth on Twitter can spread a single URL to a large audience, in some cases reaching millions of users, through multiple propagation trees on the social network.
3. The analysis aims to understand questions like how far word-of-mouth can spread content, what types of content are popular on social media versus the wider web, and typical structures of word-of-mouth propagation trees.
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...IRJET Journal
This document describes a study that uses community detection models to identify prevalent news topics discussed on both Twitter and traditional media like BBC. It collects tweets and news articles about sports over a one-month period. Keywords are extracted from the data and a graph is constructed to represent relationships between words. Three community detection models - Girvan-Newman clustering, CLIQUE, and Louvain - are used to cluster similar content and detect communities of keywords representing news topics. The number of unique Twitter users engaged with each topic is also calculated to rank topics by user attention. The goal is to analyze how information is distributed between social and traditional media and identify emerging topics with low coverage in traditional sources.
The document discusses analyzing political trends on social networks using the Hidden Markov Model. It begins by introducing how social network data can be analyzed to observe user behaviors and interests. It then discusses using NodeXL to gather Twitter data based on political keywords and applying the Hidden Markov Model to statistically analyze the data and determine what political topics people focus on most. Finally, it reviews related work where other researchers have used techniques like the Hidden Markov Model and social network analysis to gather and analyze data from social media platforms.
Applying Clustering Techniques for Efficient Text Mining in Twitter Dataijbuiiir1
Knowledge is the ultimate output of decisions on a dataset. The revolution of the Internet has made the global distance closer with the touch on the hand held electronic devices. Usage of social media sites have increased in the past decades. One of the most popular social media micro blog is Twitter. Twitter has millions of users in the world. In this paper the analysis of Twitter data is performed through the text contained in hash tags. After Preprocessing clustering algorithms are applied on text data. The different clusters formed are compared through various parameters. Visualization techniques are used to portray the results from which inferences like time series and topic flow can be easily made. The observed results show that the hierarchical clustering algorithm performs better than other algorithms.
Analysis, modelling and protection of online private data.Silvia Puglisi
Do we have online privacy? And what is privacy anyway in an online context?
This work aim at discovering what footprints users leave online and how these represent a threat to privacy.
Topic Modeling : Clustering of Deep Webpagescsandit
The internet is comprised of massive amount of info
rmation in the form of zillions of web
pages.This information can be categorized into the
surface web and the deep web. The existing
search engines can effectively make use of surface
web information.But the deep web remains
unexploited yet. Machine learning techniques have b
een commonly employed to access deep
web content.
Under Machine Learning, topic models provide a simp
le way to analyze large volumes of
unlabeled text. A "topic" consists of a cluster of
words that frequently occur together. Using
contextual clues, topic models can connect words wi
th similar meanings and distinguish
between words with multiple meanings. Clustering is
one of the key solutions to organize the
deep web databases.In this paper, we cluster deep w
eb databases based on the relevance found
among deep web forms by employing a generative prob
abilistic model called Latent Dirichlet
Allocation(LDA) for modeling content representative
of deep web databases. This is
implemented after preprocessing the set of web page
s to extract page contents and form
contents.Further, we contrive the distribution of “
topics per document” and “words per topic”
using the technique of Gibbs sampling. Experimental
results show that the proposed method
clearly outperforms the existing clustering methods
.
Topic Modeling : Clustering of Deep Webpagescsandit
The internet is comprised of massive amount of information in the form of zillions of web pages.This information can be categorized into the surface web and the deep web. The existing search engines can effectively make use of surface web information.But the deep web remains unexploited yet. Machine learning techniques have been commonly employed to access deep web content.
Under Machine Learning, topic models provide a simple way to analyze large volumes of unlabeled text. A "topic" consists of a cluster of words that frequently occur together. Using
contextual clues, topic models can connect words with similar meanings and distinguish between words with multiple meanings. Clustering is one of the key solutions to organize the deep web databases.In this paper, we cluster deep web databases based on the relevance found among deep web forms by employing a generative probabilistic model called Latent Dirichlet
Allocation(LDA) for modeling content representative of deep web databases. This is implemented after preprocessing the set of web pages to extract page contents and form
contents.Further, we contrive the distribution of “topics per document” and “words per topic”
using the technique of Gibbs sampling. Experimental results show that the proposed method clearly outperforms the existing clustering methods.
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHIRuchika Sharma
This report is done as a part in completion of our Big Data Analysis Course at Jindal Global Business School.
In this report, we have mainly focused on literature review of 10 use-cases in the visualization task. We have worked on use cases pertaining to varied use of social media site Twitter in the political, cultural and business context; use by drug marketers and musicians among others.
Detection and Analysis of Twitter Trending Topics via Link-Anomaly DetectionIJERA Editor
This paper involves two approaches for finding the trending topics in social networks that is key-based approach and link-based approach. In conventional key-based approach for topics detection have mainly focus on frequencies of (textual) words. We propose a link-based approach which focuses on posts reflected in the mentioning behavior of hundreds users. The anomaly detection in the twitter data set is carried out by retrieving the trend topics from the twitter in a sequential manner by using some API and corresponding user for training, then computed anomaly score is aggregated from different users. Further the aggregated anomaly score will be feed into change-point analysis or burst detection at the pinpoint, in order to detect the emerging topics. We have used the real time twitter account, so results are vary according to the tweet trends made. The experiment shows that proposed link-based approach performs even better than the keyword-based approach.
In the age of social media communication, it is easy to
modulate the minds of users and also instigate violent
actions being taken by them in some cases. There is a need
to have a system that can analyze the threat level of tweets
from influential users and rank their Twitter handles so
that dangerous tweets can be avoided going public on
Twitter before fact-checking which can hurt the sentiments
of people and can take the shape of violence. The study
aims to analyse and rank twitter users according to their
influential power and extremism of their tweets to help
prevent major protests and violent events. We scraped top
trending topics and fetched tweets using those hashtags.
We propose a custom ranking algorithm which considers
source based and content based features along with a
knowledge graph which generates the score and rank the
twitter users according to the scores. Our aim with this
study is to identify and rank extremist twitter users with
regards to their impact and influence. We use a technique
that takes into consideration both source based and
content-based features of tweets to generate the ranking of
the extremist twitter users having a high impact factor
This document discusses challenges and opportunities in managing large volumes of Twitter data for analytics purposes. It reviews existing approaches for collecting, storing, and querying Twitter data, noting that current systems operate in silos and calling for a unified framework. The paper surveys tools for data collection, data management frameworks, and languages for querying tweets, with the goal of informing the development of an integrated Twitter data management solution.
This document summarizes an algorithm for efficiently clustering short messages like tweets into general domains or topics. The algorithm breaks the clustering task into two stages: (1) batch clustering of user-annotated data using hashtags to create dense virtual documents, and (2) online clustering of new tweets using the centroids from stage 1. Experimental results show the algorithm outperforms other clustering methods and can accurately cluster large streams of sparse, short messages efficiently.
Sampling of User Behavior Using Online Social NetworkEditor IJCATR
The popularity of online networks provides an opportunity to study the characteristics of online social network graphs is important, both to improve current systems and to design new application of online social networks. Although personalized search has been proposed for many years and many personalization strategies have been investigated, it is still unclear whether personalization is consistently effective on different queries for different users, and under different search contexts. In this paper, we study performance of information collection in a dynamic social network. By analyzing the results, we reveal that personalized search has significant improvement over common web search.
The mixing time of thee sampling process strongly depends on the characteristics of the graph.
Machine Learning Approach to Classify Twitter Hate Speechijtsrd
In this modern age, social media platforms have become indispensable tools for communication and information sharing. However, this unprecedented connectivity has also given rise to a concerning proliferation of hate speech and offensive content. This research article presents a comprehensive study on the development and evaluation of machine learning ML models for the automatic detection of hate speech on Twitter. We leverage a diverse dataset collected from Twitter, encompassing a wide range of hate speech categories, including hate speech targeting race, gender, religion, and more. To address the multifaceted nature of hate speech, we employ a hybrid approach that combines traditional natural language processing NLP techniques with state of the art machine learning algorithms. Our methodology involves extensive preprocessing of the text data, including tokenization, stemming, and feature extraction. We then experiment with various machine learning algorithms, including Naïve Bayes NB , K nearest Neighbor KNN , Random Forest RF , and Support Vector Machines SVM . The models are trained and fine tuned on a labeled dataset and evaluated using robust metrics to assess their performance. Subrata Saha | Md. Motinur Rahman | Md. Mahbub Alam "Machine Learning Approach to Classify Twitter Hate Speech" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-7 | Issue-5 , October 2023, URL: https://www.ijtsrd.com/papers/ijtsrd59873.pdf Paper Url: https://www.ijtsrd.com/computer-science/artificial-intelligence/59873/machine-learning-approach-to-classify-twitter-hate-speech/subrata-saha
Groundhog Day: Near-Duplicate Detection on Twitter Ke Tao
Presentation given by Ke Tao at 22nd International World Wide Web Conference in Rio de Janeiro Brazil, titled Groundhog Day: Near-Duplicate Detection on Twitter, during the track of Social Web Engineering
Information access can be limited in some situations where traditional media outlets can’t cover the events due to geographical limitations or censorship. Examples of those situations can be civil unrest, war or natural disasters. In these situations citizen journalism replace or complement traditional media in the documentation of such events. Microblogging services such as Twitter have become of great use in these scenarios due their mobile nature and multimedia capabilities.
In this research we propose a method to create searchable, semantically annotated news articles from tweets in an automated way using the cloud of linked open data.
IRJET- Effective Countering of Communal Hatred During Disaster Events in Soci...IRJET Journal
This document summarizes a research paper that aims to effectively counter communal hatred during disaster events on social media. It uses machine learning techniques to analyze tweets and classify them based on parameters like offensive, hatred, or neither. Tweets are collected using Twitter's API and preprocessed. A supervised machine learning algorithm (Support Vector Machine) is trained on manually labeled tweet data to classify new tweets. The results are visualized in a pie chart graph displaying the percentage of tweets containing offensive, hatred, or neutral words. The goal is to reduce the spread of communal hate speech on social media during disasters.
This document summarizes a research paper that proposes using a logistic regression classifier trained with stochastic gradient descent to predict Twitter users' personalities from their tweets. It begins with an abstract of the paper and an introduction on personality prediction from social media. It then provides more detail on the anatomy of the research, including defining personality prediction from Twitter, its applications, and the general process of using machine learning for the task. Next, it reviews several previous studies on personality prediction from Twitter and social networks, noting their approaches, findings and limitations. It identifies remaining research gaps, such as the need for improved linguistic analysis of tweets and more robust/scalable predictive models. Finally, it proposes using a logistic regression classifier as the personality prediction model to address
IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...IRJET Journal
This document discusses tweet segmentation and its application to named entity recognition. It proposes a novel framework called Hybrid Segmentation that segments tweets into meaningful phrases by maximizing the stickiness score of candidate segments based on global context from external sources like Wikipedia and local context from linguistic features within a batch of tweets. Two methods are developed - one using local linguistic features and the other using local term dependency. Experimental results on two tweet datasets show improved segmentation quality when using both global and local contexts compared to global context alone. The segmented tweets achieve higher accuracy for named entity recognition compared to word-based methods, demonstrating the benefit of tweet segmentation for downstream natural language processing applications.
Similar to Groundhog day: near duplicate detection on twitter (20)
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELijaia
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
VARIABLE FREQUENCY DRIVE. VFDs are widely used in industrial applications for...PIMR BHOPAL
Variable frequency drive .A Variable Frequency Drive (VFD) is an electronic device used to control the speed and torque of an electric motor by varying the frequency and voltage of its power supply. VFDs are widely used in industrial applications for motor control, providing significant energy savings and precise motor operation.
Supermarket Management System Project Report.pdfKamal Acharya
Supermarket management is a stand-alone J2EE using Eclipse Juno program.
This project contains all the necessary required information about maintaining
the supermarket billing system.
The core idea of this project to minimize the paper work and centralize the
data. Here all the communication is taken in secure manner. That is, in this
application the information will be stored in client itself. For further security the
data base is stored in the back-end oracle and so no intruders can access it.
Software Engineering and Project Management - Software Testing + Agile Method...Prakhyath Rai
Software Testing: A Strategic Approach to Software Testing, Strategic Issues, Test Strategies for Conventional Software, Test Strategies for Object -Oriented Software, Validation Testing, System Testing, The Art of Debugging.
Agile Methodology: Before Agile – Waterfall, Agile Development.
Applications of artificial Intelligence in Mechanical Engineering.pdfAtif Razi
Historically, mechanical engineering has relied heavily on human expertise and empirical methods to solve complex problems. With the introduction of computer-aided design (CAD) and finite element analysis (FEA), the field took its first steps towards digitization. These tools allowed engineers to simulate and analyze mechanical systems with greater accuracy and efficiency. However, the sheer volume of data generated by modern engineering systems and the increasing complexity of these systems have necessitated more advanced analytical tools, paving the way for AI.
AI offers the capability to process vast amounts of data, identify patterns, and make predictions with a level of speed and accuracy unattainable by traditional methods. This has profound implications for mechanical engineering, enabling more efficient design processes, predictive maintenance strategies, and optimized manufacturing operations. AI-driven tools can learn from historical data, adapt to new information, and continuously improve their performance, making them invaluable in tackling the multifaceted challenges of modern mechanical engineering.
Gas agency management system project report.pdfKamal Acharya
The project entitled "Gas Agency" is done to make the manual process easier by making it a computerized system for billing and maintaining stock. The Gas Agencies get the order request through phone calls or by personal from their customers and deliver the gas cylinders to their address based on their demand and previous delivery date. This process is made computerized and the customer's name, address and stock details are stored in a database. Based on this the billing for a customer is made simple and easier, since a customer order for gas can be accepted only after completing a certain period from the previous delivery. This can be calculated and billed easily through this. There are two types of delivery like domestic purpose use delivery and commercial purpose use delivery. The bill rate and capacity differs for both. This can be easily maintained and charged accordingly.
Build the Next Generation of Apps with the Einstein 1 Platform.
Rejoignez Philippe Ozil pour une session de workshops qui vous guidera à travers les détails de la plateforme Einstein 1, l'importance des données pour la création d'applications d'intelligence artificielle et les différents outils et technologies que Salesforce propose pour vous apporter tous les bénéfices de l'IA.
Design and optimization of ion propulsion dronebjmsejournal
Electric propulsion technology is widely used in many kinds of vehicles in recent years, and aircrafts are no exception. Technically, UAVs are electrically propelled but tend to produce a significant amount of noise and vibrations. Ion propulsion technology for drones is a potential solution to this problem. Ion propulsion technology is proven to be feasible in the earth’s atmosphere. The study presented in this article shows the design of EHD thrusters and power supply for ion propulsion drones along with performance optimization of high-voltage power supply for endurance in earth’s atmosphere.
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
Groundhog day: near duplicate detection on twitter
1. Groundhog Day: Near-Duplicate Detection on Twitter
Ke Tao1
, Fabian Abel1,2
, Claudia Hauff1
, Geert-Jan Houben1
, Ujwal Gadiraju1
1
TU Delft, Web Information Systems, PO Box 5031, 2600 GA Delft, the Netherlands
wis@st.ewi.tudelft.nl
2
XING AG, Gänsemarkt 43, 20354 Hamburg, Germany
fabian.abel@xing.com
ABSTRACT
With more than 340 million messages that are posted on
Twitter every day, the amount of duplicate content as well
as the demand for appropriate duplicate detection mecha-
nisms is increasing tremendously. Yet there exists little re-
search that aims at detecting near-duplicate content on mi-
croblogging platforms. We investigate the problem of near-
duplicate detection on Twitter and introduce a framework
that analyzes the tweets by comparing (i) syntactical char-
acteristics, (ii) semantic similarity, and (iii) contextual infor-
mation. Our framework provides different duplicate detec-
tion strategies that, among others, make use of external Web
resources which are referenced from microposts. Machine
learning is exploited in order to learn patterns that help
identifying duplicate content. We put our duplicate detec-
tion framework into practice by integrating it into Twinder,
a search engine for Twitter streams. An in-depth analysis
shows that it allows Twinder to diversify search results and
improve the quality of Twitter search. We conduct extensive
experiments in which we (1) evaluate the quality of different
strategies for detecting duplicates, (2) analyze the impact of
various features on duplicate detection, (3) investigate the
quality of strategies that classify to what exact level two
microposts can be considered as duplicates and (4) optimize
the process of identifying duplicate content on Twitter. Our
results prove that semantic features which are extracted by
our framework can boost the performance of detecting du-
plicates.
Categories and Subject Descriptors
H.3.3 [Information Systems]: Information Search and Re-
trieval—Information filtering; H.4.m [Information Sys-
tems]: Miscellaneous
Keywords
Duplicate Detection; Twitter; Diversification; Search
1. INTRODUCTION
On microblogging platforms such as Twitter or Sina Weibo,
where the number of messages that are posted per second ex-
ceeds several thousands during big events1
, solving the prob-
1
http://blog.twitter.com/2012/02/post-bowl-twitter-analysis.
html
Copyright is held by the International World Wide Web Conference
Committee (IW3C2). IW3C2 reserves the right to provide a hyperlink
to the author’s site if the Material is used in electronic media.
WWW 2013, May 13–17, 2013, Rio de Janeiro, Brazil.
ACM 978-1-4503-2035-1/13/05.
lem of information overload and providing solutions that al-
low users to access new information efficiently are non-trivial
research challenges. The majority of messages on microblog-
ging platforms refer to news, e.g. on Twitter more than 85%
of the tweets are news-related [9]. Many of the microp-
osts convey the same information in slightly different forms
which puts a burden on users of microblogging services when
searching for new content. Teevan et al. [23] revealed that
the search behaviour on Twitter differs considerably from
the search behaviour that can be observed on regular Web
search engines: Twitter users issue repetitively the same
query and thus monitor whether there is new content that
matches their query.
Traditional Web search engines apply techniques for de-
tecting near-duplicate content [8, 12] and provide diversifi-
cation mechanisms to maximize the chance of meeting the
expectations of their users [19]. However, there exists lit-
tle research that focuses on techniques for detecting near-
duplicate content and diversifying search results on micro-
blogging platforms. The conditions for inferring whether
two microposts comprise highly similar information and can
thus be considered near-duplicates differ from traditional
Web settings. For example, the textual content is limited
in length, people frequently use abbreviations or informal
words instead of proper vocabulary and the amount of mes-
sages that are posted daily is at a different scale (more than
340 million tweets per day2
).
In this paper, we bridge the gap and explore near-duplicate
detection as well as search result diversification in the mi-
croblogging sphere. The main contributions of this paper
can be summarized as follows3
.
• We conduct an analysis of duplicate content in Twit-
ter search results and infer a model for categorizing
different levels of duplicity.
• We develop a near-duplicate detection framework for
microposts that provides functionality for analyzing (i)
syntactical characteristics, (ii) semantic similarity and
(iii) contextual information. The framework also ex-
ploits external Web content which is referenced by the
microposts.
• Given our duplicate detection framework, we perform
extensive evaluations and analyzes of different dupli-
cate detection strategies on a large, standardized Twit-
2
http://blog.twitter.com/2012/03/twitter-turns-six.html
3
We make our framework and datasets publicly available on
our supporting website [22].
1273
2. ter corpus to investigate the quality of (i) detecting
duplicates and (ii) categorizing the duplicity level of
two tweets.
• We integrate our duplicate detection framework into a
Twitter search engine to enable search result diversifi-
cation and analyze the impact of the diversification on
the search quality.
2. RELATED WORK
Since Twitter was launched in 2006 it has attracted a lot
of attention both from the general public and research com-
munities. Researchers managed to find patterns in user be-
haviours on Twitter, including users’ interests towards news
articles [1], users’ behaviour over time [10], and more gen-
eral habits that users have on Twitter [18]. Previous re-
search also studied the characteristics of emerging network
structures [14] and showed that Twitter is rather a news
media than a social platform [9]. Another area of interest is
event detection in social Web streams [24], for example in
the context of natural disasters [20].
The keyword-based search functionality is a generic tool
for users to retrieve relevant information. Teevan et al. [23]
analyzed the search behaviour on Twitter and found differ-
ences with respect to normal Web search. A first benchmark
on Twitter data was introduced at TREC4
2011 with a track
related to search in microblogs5
. Among the solutions devel-
oped by participating researchers are many feature-driven
approaches that exploit topic-insensitive features such as
“does the tweet contain a URL?” to rank the tweets that
match a given keyword query, e.g. [16]. More sophisticated
search solutions also extract named entities from tweets in
order to analyze the semantic meaning of tweets [21]. Bern-
stein et al. [5] investigated alternative topic-based browsing
interfaces for Twitter while Abel et al. [2] investigated the
utility of faceted search for retrieval on Twitter. However,
none of the aforementioned research initiatives investigated
strategies for near-duplicate detection and search result di-
versification in microblogs.
Traditional Web search engines benefit from duplicate de-
tection algorithms and diversification strategies that make
use of Broder et al.’s [6] shingling algorithm or Charikar’s [7]
random projection approach. Henzinger conducted a large-
scale evaluation to compare these two methods [8] and Manku
et al. [12] proposed to use the latter one for near-duplicate
detection during Web crawling. To achieve diversification
in search results, Agrawal et al. [3] studied the problem of
search result diversification in the context of answering am-
biguous Web queries and Rafiei et al. [19] suggested a solu-
tion to maximize expectations that users have towards the
query results. However, to the best of our knowledge, the
problem of identifying near-duplicate content has not been
studied in the context of microblogs. In this paper, we thus
aim to bridge the gap and research near-duplicate detection
and search result diversification on Twitter.
3. DUPLICATE CONTENT ON TWITTER
In this section, we provide the outcomes of our study of
duplicate content on the Twitter platform. We present a
4
http://trec.nist.gov
5
http://sites.google.com/site/microblogtrack/
definition of near-duplicate tweets in 5 levels and show con-
crete examples. We then analyze near-duplicate content in
a large Twitter corpus and investigate to what extent near-
duplicate content appears in Twitter search results.
All our examples and experiments utilize the Twitter cor-
pus which is provided by TREC [13].
3.1 Different levels of Near-Duplicate tweets
In this paper, we define duplicate tweets as tweets that
convey the same information either syntactically or seman-
tically. We distinguish near-duplicates in 5 levels.
Exact copy The duplicates at the level of exact copy are
identical in terms of characters. An example tweet pair (t1,
t2) in our Twitter corpus is:
t1 and t2: Huge New Toyota Recall Includes 245,000 Lexus GS,
IS Sedans - http://newzfor.me/?cuye
Nearly exact copy The duplicates of nearly exact copy
are identical in terms of characters except for #hashtags,
URLs, or @mentions. Consider the following tweet:
t3: Huge New Toyota Recall Includes 245,000 Lexus GS,
IS Sedans - http://bit.ly/ibUoJs
Here, the tweet pair of (t1, t3) is a near-duplicate at a level
of nearly exact copy.
Strong near-duplicate A pair of tweets is strong near-
duplicate if both tweets contain the same core messages syn-
tactically and semantically, but at least one of them con-
tains more information in form of new statements or hard
facts. For example, the tweet pair of (t4, t5) is strong near-
duplicate:
t4: Toyota recalls 1.7 million vehicles for fuel leaks:
Toyota’s latest recalls are mostly in Japan, but they
also... http://bit.ly/dH0Pmw
t5: Toyota Recalls 1.7 Million Vehicles For Fuel Leaks
http://bit.ly/flWFWU
Weak near-duplicate Two weak near-duplicate tweets
either (i) contain the same core messages syntactically and
semantically while personal opinions are also included in one
or both of them, or (ii) convey semantically the same mes-
sages with differing information nuggets. For example, the
tweet pair of (t6, t7) is a weak near-duplicate:
t6: The White Stripes broke up. Oh well.
t7: The White Stripes broke up. That’s a bummer for me.
Low-overlapping The low-overlapping pairs of tweets se-
mantically contain the same core message, but only have a
couple of common words, e.g. the tweet pair of (t8, t9):
t8: Federal Judge rules Obamacare is unconsitutional...
t9: Our man of the hour: Judge Vinson gave Obamacare its
second unconstitutional ruling. http://fb.me/zQsChak9
If a tweet pair does not match any of the above definitions,
it is considered as non-duplicate.
3.2 Near-Duplicates in Twitter Search Results
In Section 3.1, the example tweets come from the Tweets
2011 corpus [13], which was used in the Microblog track of
TREC 2011. The corpus is a representative sample from
tweets posted during a period of 2 weeks (January 23rd to
February 8th, 2011, inclusive). As the corpus is designed to
be a reusable test collection for investigating Twitter search
1274
4. Length difference Besides matching letters, words, hash-
tags, and URLs, we also calculate the difference in length
between two tweets and normalize it by Lmax:
length difference =
abs(|tweeta| − |tweetb|)
140
(2)
Hypothesis H6: The smaller the difference in length between
two tweets, the higher the likelihood of them being duplicates
and the higher their duplicate score.
4.1.2 Semantic Features
Apart from syntactical features of tweet pairs, semantic
information may also be valuable for identifying duplicates,
especially when the core messages or important entities in
tweets are mentioned in different order. For this reason, we
analyze the semantics in both tweets of a pair and construct
features that may help with distinguishing duplicate tweets.
We utilize NER services like DBpedia Spotlight, OpenCalais
as well as the lexical database WordNet to extract the fol-
lowing features.
Overlap in entities Given extracted entities or concepts
by employing NER services, we can check the overlap be-
tween the sets of entities in tweet pairs. The near-duplicate
tweet pairs should contain the same core messages and there-
fore the same entities should be mentioned.
Hypothesis H7: The tweet pairs with more overlapping enti-
ties are more likely to have a high duplicate score.
Overlap in entity types For entities extracted from NER
services, the types of the entities can also be retrieved. For
example, if tb contains the entities of type person and loca-
tion, ta should also contain the same type of entities to con-
vey the core messages if they are a near-duplicate tweet pair.
Otherwise, more types of entities may indicate it contains
more information or less types may suggest only a partial
coverage of the core message in ta. Therefore, we construct
features that measure the overlap in entity types between
tweet pairs.
Hypothesis H8: The tweet pairs with more overlapping entity
types are more likely to have a high duplicate score.
In fact, we found only a slight difference in performance
between using DBpedia Spotlight and OpenCalais. In prac-
tice, we construct two features and derivative features (in-
troduced later) by using DBpedia Spotlight because it yields
slightly better results.
Overlap in topics Besides outputting entities with types,
OpenCalais can classify the input textual snippets into 18 dif-
ferent categories a.k.a. topics. In this case, each tweet may
be assigned more than one topic label or no topic at all.
Therefore, it is possible to construct a feature by checking
the overlap in topics.
Hypothesis H9: The tweet pairs that share more topics are
more likely to have a high duplicate score.
Overlap in WordNet concepts We constructed this fea-
ture to compute the overlap based on lexical standards. To
achieve this, we make use of the lexical database Word-
Net [17] to identify the nouns in pairs of tweets and calculate
their overlap in these nouns. Practically, we use JWI (MIT
Java Wordnet Interface)7
to find the root concepts of the
nouns in the tweets.
Hypothesis H10: The more overlap in WordNet noun con-
cepts we find in a pair of tweets, the more likely they are to
be duplicates and the higher their duplicate score.
7
http://projects.csail.mit.edu/jwi/
Algorithm 1: WordNet similarity of a tweet pair
input : Tweet Pair (ta, tb)
output: WordNet similarity of Tweet Pair (ta, tb)
acc ← 0;
if |ta| > |tb| then
swap(ta, tb);
foreach WordNet noun concept ca in ta do
maximum ← 0;
foreach WordNet noun concept cb in tb do
if maximum < similaritylin (ca, cb) then
maximum ← similaritylin (ca, cb);
acc ←acc + maximum;
return acc
|Wa|
;
Overlap in WordNet synset concepts Making use of
merely WordNet noun concepts may not fully cover the over-
lap in information because different tweets may use differ-
ent words or synonyms to convey the same information. In
WordNet, synsets are interlinked by means of conceptual-
semantic and lexical relations. We can make use of synsets
to include all words with similar meaning for checking the
overlap between tweet pairs.
Hypothesis H11: If the concepts in synsets are included for
checking overlap between tweet pairs then the overlap fea-
ture may have a more positive correlation with the duplicate
scores.
WordNet similarity There are several existing algorithms
for calculating the semantic relatedness between WordNet
concepts, e.g. the method proposed by Lin et al. [11] can
measure the semantic relatedness between two concepts with
a value between [0, 1]. The WordNet concepts are paired in
order to get the highest relatedness. Practically, we follow
Algorithm 1 to get this feature for a tweet pair (ta, tb). In
the description of the algorithm, Wa stands for the set of
WordNet noun concepts that appear in ta.
Hypothesis H12: The higher the WordNet similarity of a
tweet pair, the higher the likelihood of the tweets being du-
plicates and the higher their duplicate score.
4.1.3 Enriched Semantic Features
Due to the length limitation of tweets, 140 characters may
not be enough to tell a complete story. Furthermore, some
tweets, created by sharing buttons from other news sites for
example, may even break the complete message. Thus, we
make use of the external resources that are linked from the
tweets. This step yields additional information and further
enriches the tweets’ semantics. Finally, we build a set of
so-called enriched semantic features.
We construct six enriched semantic features, which are
constructed in the same way as semantic features introduced
in Section 4.1.2. The only difference is that the source of
semantics contains not only the content of the tweets but
also the content that we find by retrieving the content of
the Web sites that are linked from the tweets.
4.1.4 Contextual Features
Besides analyzing syntactical and semantic aspects, which
describe the characteristics of tweet pairs, we also evalu-
ate the effects of the context in which the tweets were pub-
lished on the duplicate detection. We investigate three types
1276
5. of contextual features: temporal difference of the creation
times, similarity of the tweets’ authors, and the client appli-
cation that the authors used.
Temporal difference For several popular events, e.g. UK
Royal wedding, Japanese earthquake, and Super Bowl, users
have posted thousands of tweets per second. During these
events, breaking news are often retweeted not long after be-
ing posted. Therefore, it is reasonable to assume that the
time difference between duplicate tweets is rather small. We
normalize this feature by dividing the original value by the
length of the temporal range of the dataset (two weeks in
our setup).
Hypothesis H13: The smaller the difference in posting time
between a pair of tweets, the higher the likelihood of it being
a duplicate pair and the higher the duplicate score.
User similarity Similar users may publish similar content.
We measure user similarity in a lightweight fashion by com-
paring the number of followers and the number of followees.
Hence, we extract two features: the differences in #followers
and #followees to measure the similarity of the authors of
a post. As the absolute values of these two features vary in
magnitude, we normalize this feature by applying log-scale
and dividing by the largest difference in log-scale, which is
7 in our case.
Hypothesis H14: The higher the similarity of the authors of a
pair of tweets, the more likely that the tweets are duplicates.
Same client This is a boolean feature to check whether
the pair of tweets were posted via same client application.
With authorization, third-party client applications can post
tweets on behalf of users. Hence, different Twitter client
applications as well as sharing buttons on various Web sites
are being used. As the tweets that are posted from the
same applications and Web sites may share similar content,
provenance information and particularly information about
the client application may be used as evidence for duplicate
detection.
Hypothesis H15: The tweet pairs that are posted from the
same client application tend to be near-duplicates.
4.2 Feature Analysis
As previously stated, we take the Twitter corpus released
at TREC (Tweets2011) as our Twitter stream sample for
the task of duplicate detection. Before we turn to (eval-
uating) duplicate detection strategies, we first perform an
in-depth analysis of this sample with respect to the features
that we presented in Section 4.1. We extracted these fea-
tures for the 55,362 tweet pairs with duplicity judged (see
Section 3.2). In Table 1, we list the average values and
the standard deviations of the features and the percentages
of true instances for the boolean feature respectively (same
client). Moreover, Table 1 shows a comparison between fea-
tures of duplicate (on all 5 levels) and non-duplicate tweet
pairs.
Unsurprisingly, the Levenshtein distances of duplicate tweet
pairs are on average 15% shorter than the ones of non-
duplicate tweet pairs. Similarly, duplicate tweet pairs share
more identical terms than non-duplicate ones: the dupli-
cates have a Jaccard Similarity of 0.2148 in terms, whereas
only 0.0571 for the non-duplicates. Hence, these two fea-
tures which compare the tweets in letters and words may
be potentially good indicators for duplicate detection. Al-
though there is a difference in common hashtags between the
duplicates and the non-duplicates, the overlap in hashtags
does not seem to be a promising feature because of the low
absolute value. This may be explained by the low usage of
hashtags. The two features that check overlap in hyperlinks
show similar characteristics but are slightly better. As ex-
pected, we discover more overlap in links by expanding the
shortened URLs.
Tweet pairs may convey the same messages with syntac-
tically different but semantically similar words. If this is the
case then the syntactical features may fail to detect the du-
plicate tweets. Therefore, the features that are formulated
as overlap in semantics are expected to be larger in absolute
values than the syntactical overlap features. Overall, the
statistics that are listed in Table 1 are in line with our ex-
pectations. We discover more overlap in the duplicates along
3 dimensions, including entities, entity types, and topics, by
exploiting semantics with NER services. More distinguish-
able differences can be found in the features constructed
from WordNet. The duplicate tweet pairs have more overlap
in WordNet noun concepts or synsets (0.38) than the non-
duplicate pairs (0.12). The feature of WordNet similarity is
also potentially a good criterion for duplicate detection: the
average similarity of duplicate pairs is 0.61 compared to 0.35
for non-duplicate pairs. The comparison of the enriched se-
mantic features shows similar findings to those we observed
for the semantic features. Again, the features that compare
WordNet-based concepts are more likely to be good indica-
tors for duplicate detection. However the WordNet similar-
ity shows less difference if we consider external resources.
Finally, we attempted to detect the duplicates based on
information about the context in which the tweets were
posted. Hypothesis H13 (see Section 4.1.4) states that du-
plicates are more likely to be posted in a short temporal
range. The result for the feature of temporal difference in
Table 1 supports this hypothesis: the average value of this
feature for the duplicate pairs is only 0.0256 (about 8 hours
before normalization, see Section 4.1.4) in contrast to 0.2134
(about 3 days) for the non-duplicate ones. With respect to
user similarity, we have not discovered an explicit differ-
ence between the two classes. Regarding the client appli-
cations from which duplicate tweets are posted, we observe
the following: 21.1% of the duplicate pairs were posted from
the same client applications whereas only 15.8% of the non-
duplicate ones show the same characteristic.
4.3 Duplicate Detection Strategies
Having all the features constructed in Section 4.1 and
preliminarily analyzed in Section 4.2, we now create differ-
ent strategies for the task of duplicate detection. In prac-
tice, as requirements and limitations may vary in process-
ing time, real-time demands, storage, network bandwidth
et cetra, different strategies may be adopted. Given that
our models for duplicate detection are derived from logis-
tic regression, we define the following strategies by combin-
ing different sets of features, including one Baseline strategy
and six Twinder strategies: Sy (only syntactical features),
SySe (including tweet content-based features), SyCo (with-
out semantics), SySeCo (without enriched semantics), Sy-
SeEn (without contextual features), and SySeEnCo (all fea-
tures).
4.3.1 Baseline Strategy
As baseline strategy, Levenshtein distance, which com-
pares tweet pairs in letters, is used to distinguish the dupli-
cate pairs and further the duplicate levels.
1277
6. Category Feature Duplicate Std. deviation Non-duplicate Std. deviation
syntactical
Levenshtein Distance 0.5340 0.2151 0.6805 0.1255
overlap in terms 0.2148 0.2403 0.0571 0.0606
overlap in hashtags 0.0054 0.0672 0.0016 0.0337
overlap in URLs 0.0315 0.1706 0.0002 0.0136
overlap in expanded URLs 0.0768 0.2626 0.0017 0.0406
length difference 0.1937 0.1656 0.2254 0.1794
semantics
overlap in entities 0.2291 0.3246 0.1093 0.1966
overlap in entity types 0.5083 0.4122 0.3504 0.3624
overlap in topics 0.1872 0.3354 0.0995 0.2309
overlap in WordNet concepts 0.3808 0.2890 0.1257 0.1142
overlap in WordNet Synset concepts 0.3876 0.2897 0.1218 0.1241
WordNet similarity 0.6090 0.2977 0.3511 0.2111
enriched semantics
overlap in entities 0.1717 0.2864 0.0668 0.1230
overlap in entity types 0.3181 0.3814 0.1727 0.2528
overlap in topics 0.2768 0.3571 0.1785 0.2800
overlap in WordNet concepts 0.2641 0.3249 0.0898 0.0987
overlap in WordNet Synset concepts 0.2712 0.3258 0.0927 0.1046
WordNet similarity 0.7550 0.2457 0.5963 0.2371
contextual
temporal difference 0.0256 0.0588 0.2134 0.2617
difference in #followees 0.3975 0.1295 0.4037 0.1174
difference in #followers 0.4350 0.1302 0.4427 0.1227
same client 21.13% 40.83% 15.77% 36.45%
Table 1: The comparison of features between duplicate and non-duplicate tweets
4.3.2 Twinder Strategies
The Twinder strategies exploit the sets of features that
have been introduced in Section 4.1). In our duplicate detec-
tion framework which is integrated in the Twinder search en-
gine for Twitter streams (see Section 6), new strategies can
easily be defined by grouping together different features.
Sy The Sy strategy is the most basic strategy in Twinder.
It includes only syntactical features that compare tweets on
a term level. These features can easily be extracted from
the tweets and are expected to have a good performance on
the duplicates for the levels of Exact copy or Nearly exact
copy.
SySe This strategy makes use of the features that take
the actual content of the tweets into account. Besides the
syntactical features, this strategy makes use of NER services
and WordNet to obtain the semantic features.
SyCo The strategy of SyCo (without semantics) is formu-
lated to prevent the retrieval of external resources as well
as a large amount of semantics extractions that rely on ei-
ther external Web services or extra computation time. Only
syntactical features and contextual features are considered
by this strategy.
SySeCo Duplicate detection can be configured as applying
features without relying on external Web resources. We call
the strategy that uses the sytactical features, semantics that
are extracted from the content of tweets, and the contextual
informaiton SySeCo.
SySeEn The contextual features, especially the ones re-
lated to users, may require extra storage and may be recom-
puted frequently. Therefore, the duplicate detection may
work without contextual information by applying the so-
called SySeEn (without contextual features).
SySeEnCo If enough hardware resources and network band-
width are available then the strategy that integrates all the
features can be applied so that the quality of the duplicate
detection can be maximized.
5. EVALUATION OF DUPLICATE DETEC-
TION STRATEGIES
To understand how different features and strategies influ-
ence the performance of duplicate detection, we formulated
a number of research questions, which can be summarized
as follows:
1. How accurately can the different duplicate detection
strategies identify duplicates?
2. What kind of features are of particular importance for
duplicate detection?
3. How does the accuracy vary for the different levels of
duplicates?
5.1 Experimental Setup
We employ logistic regression for both steps of the task of
duplicate detection: (i) to classify tweet pairs as duplicate or
non-duplicate and (ii) to estimate the duplicate level. Due
to the limited amount of duplicate pairs (of all 5 levels, 2,745
instances) in the manually labelled dataset (55,362 instances
in total, see Section 3.2), we use 5-fold cross-validation to
evaluate the learned classification models. At most, we used
22 features as predictor variables (see Table 1). Since the
fraction of positive instances is considerably smaller than
the negative one, we employed a cost-sensitive classification
setup to prevent all the tweet pairs from being classified as
non-duplicates. Moreover, because the precision and recall
for non-duplicate are over 90%, we use the non-duplicate
class as the reference class and focus on the performance
of the class of duplicates. We use precision, recall, and F-
measure to evaluate the results. Furthermore, since our final
objective in this paper is to reduce duplicates in search re-
sults, we also point out the fraction of false positives as the
indicator of the costs of losing information by applying our
framework.
5.2 Influence of Strategies on Duplicate De-
tection
Table 2 shows the performance of predicting the dupli-
cate tweet pairs by applying the strategies described in Sec-
tion 4.3. The baseline strategy, which only uses Levensthein
distance, leads to a precision and recall of 0.5068 and 0.1913
respectively. It means, for example, if 100 relevant tweets
are returned for a certain search query and about 20 tweets
(the example ratio of 20% according to the statistics given in
Section 3.2) are duplicates that could be removed, the base-
line strategy would identify 8 tweets as duplicates. How-
ever, only 4 of them are correctly classified while 16 other
1278
7. Strategies Precision Recall F-measure
Baseline 0.5068 0.1913 0.2777
Sy 0.5982 0.2918 0.3923
SyCo 0.5127 0.3370 0.4067
SySe 0.5333 0.3679 0.4354
SySeEn 0.5297 0.3767 0.4403
SySeCo 0.4816 0.4200 0.4487
SySeEnCo 0.4868 0.4299 0.4566
Table 2: Performance Results of duplicate detection
for different sets of features
true duplicates are missed. In order to measure both pre-
cision and recall in once, the F-measure is used and for the
Baseline strategy the value is 0.2777. In contrast, the Sy
strategy, which is the most basic one for Twinder, leads to
a much better performance in terms of all measures, e.g. an
F-measure of 0.3923. By combing the contextual features,
the SyCo strategy achieves a slightly better F-measure of
0.4067. It appears that the contextual features contribute
relatively little to the performance.
Subsequently, we leave out the contextual features and
check the importance of semantics in the content of the
tweets and external resources. The SySe (including tweet
content-based features) strategy considers not only the syn-
tactical features but also the semantics extracted from the
content of the tweets. We find that the semantic features can
boost the classifier’s effectiveness as the F-measure increased
to 0.4354. The enriched semantics extracted from external
resources brought little benefit to the result as the SySeEn
strategy has a performance with F-measure of 0.4403. Over-
all, we conclude that semantics play an important role as
they lead to a performance improvement with respect to F-
measure from 0.3923 to 0.4403.
Thus the so-called SySeCo strategy excludes the features
of enriched semantics but again includes the contextual fea-
tures. Given this strategy, we observe an F-measure of
0.4487. However, if we adopt the strategy of SySeEnCo (all
features), the highest F-measure can be achieved. At the
same time, we nearly keep the same precision as with the
Baseline strategy but boost the recall from 0.1913 to 0.4299.
This means that more than an additional 20% of duplicates
can be found while we keep the accuracy high. In this stage,
we will further analyze the impact of the different features
in detail as they are used in the strategy of SySeEnCo.
In the logistic regression approach, the importance of fea-
tures can be investigated by considering the absolute value of
the coefficients assigned to them. We have listed the details
about the model derived for the SySeEnCo (all features)
strategy in Table 3. The most important features are:
• Levenshtein distance: As it is a feature of negative
coefficient in the classification model, we infer that a
shorter Levenshtein distance indicates a higher proba-
bility of being duplicate pairs. Therefore, we confirm
our Hypothesis H1 made in Section 4.1.1.
• overlap in terms: Another syntactical feature also plays
an important role as the coefficient is ranked fourth
most indicative in the model. This can be explained
by the usage of common words in duplicate tweet pairs.
This result supports Hypothesis H2.
• overlap in WordNet concepts: The coefficients of se-
mantic and enriched semantic vary in the model. How-
ever, the most important feature is overlap in Word-
Net concepts. It has the largest positive weight which
Performance Measure Score
precision 0.4868
recall 0.4299
F-measure 0.4566
Category Feature Coefficient
syntactical
Levenshtein distance -2.9387
overlap in terms 2.6769
overlap in hashtags 0.4450
overlap in URLs 1.2648
overlap in expanded URLs 0.8832
length difference 1.2820
semantics
overlap in entities -2.1404
overlap in entity types 0.9624
overlap in topics 1.4686
overlap in WordNet concepts 4.5225
overlap in WordNet Synset concepts 0.6279
WordNet similarity -0.8208
overlap in entities -0.8819
overlap in entity types 0.9578
enriched overlap in topics -0.1825
semantics overlap in WordNet concepts -2.0867
overlap in WordNet Synset concepts 2.5496
WordNet similarity 0.7949
contextual
temporal difference -12.6370
difference in #followees 0.4504
difference in #followers -0.3757
same client -0.1150
Table 3: The coefficients of different features. The
five features with the highest absolute coefficients
are underlined.
means that pairs of tweets with high overlap in Word-
Net concepts are more likely to be duplicates, con-
firming Hypothesis H10 (Section 4.1.2). However, we
noticed a contradiction in the feature set of enriched
semantics, in which the coefficient for overlap in Word-
Net concepts is negative (-2.0867) whereas the one the
coefficient for the overlap in WordNet synset concept is
positive (2.5496). It can be explained by the high cor-
relation between these two features, especially for high
coverage of possible words in external resources. For
this reason, they counteract each other in the model.
• temporal difference: In line with the preliminary analy-
sis, the shorter the temporal difference between a pair
of tweets, the more likely that it is a duplicate pair.
The highest value of the coefficient is partially due to
low average absolute values of this feature. However,
we can still conclude that Hypothesis H13 holds (see
Section 4.1.4).
Overall, we noticed that the hypotheses that we made
for syntactical features can all be confirmed. Although the
same conclusion could not be made for all the features con-
structed based on semantics, some of them can be explained.
For example, the overlap of WordNet concepts in the set of
enriched semantics is negative. The reason for this may
be twofold: (i) more general terms (such as politics, sport,
news, mobile) are overlapping if we consider external re-
sources; (ii) the features in the set of enriched semantics
may mislead when we extract the features for a pair of tweets
from which no external resources can be found or only one
tweet contains a URL. The situation for other features, e.g.
WordNet similarity, can be explained by the dependencies
between some of them. More specifically, the features that
are based on WordNet similarity in the sets of semantics and
enriched semantics may have positive correlation. Therefore,
the coefficients complement each other in values. When we
1279
8. consider only the contextual features, all other three fea-
tures except the temporal difference do not belong to the
most important features. More sophisticated techniques for
measuring user similarity might be used to better exploit,
for example, the provenance of tweets for the duplicate de-
tection task.
5.3 Influence of Topic Characteristics on Du-
plicate Detection
In all reported experiments so far, we have considered the
entire Twitter sample available to us. In this section, we in-
vestigate to what extent certain topic (or query) character-
istics play a role for duplicate detection and to what extent
those differences lead to a change in the logistic regression
models.
Consider the following two topics: Taco Bell filling lawsuit
(MB0208
) and Egyptian protesters attack museum (MB010).
While the former has a business theme and is likely to be
mostly of interest to American users, the latter topic belongs
into the category of politics and can be considered as being of
global interest, as the entire world was watching the events in
Egypt unfold. Due to these differences, we defined a number
of topic splits. A manual annotator then decided for each
split dimension into which category the topic should fall. We
investigated four topic splits, three splits with two partitions
each and one split with five partitions:
• Popular/unpopular: The topics were split into popular
(interesting to many users) and unpopular (interesting
to few users) topics. An example of a popular topic is
2022 FIFA soccer (MB002) – in total we found 24.
In contrast, topic NIST computer security (MB005) is
classified as unpopular (as one of 25 topics).
• Global/local: In this split, we considered the inter-
est for the topic across the globe. The already men-
tioned topic MB002 is of global interest, since soccer
is a highly popular sport in many countries, whereas
topic Cuomo budget cuts (MB019) is mostly of local
interest to users living or working in New York where
Andrew Cuomo is the governor. We found 18 topics
of global and 31 topics of local interest.
• Persistent/occasional: This split is concerned with the
interestingness of the topic over time. Some topics
persist for a long time, such as MB002 (the FIFA world
cup will be played in 2022), whereas other topics are
only of short-term interest, e.g. Keith Olbermann new
job (MB030). We assigned 28 topics to the persistent
and 21 topics to the occasional topic partition.
• Topic themes: The topics were classified as belonging
to one of five themes, either business, entertainment,
sports, politics or technology. MB002 is, e.g., a sports
topic while MB019 is considered to be a political topic.
Our discussion of the results focuses on two aspects: (i)
the difference between the models derived for each of the
two partitions, and (ii) the difference between these models
(denoted MsplitName) and the model derived over all topics
(MallT opics) in Table 5. The results for the three binary
topic splits are shown in Table 4.
Popularity: A comparison of the most important fea-
tures of Mpopular and Munpopular shows few differences with
the exception of a single feature: temporal difference. While
temporal difference is the most important feature in Mpopular,
8
The identifiers of the topics correspond to the ones used in
the official TREC dataset.
it is ranked fourth in Munpopular. We hypothesize that the
discussion on popular topics evolve quickly on Twitter, thus
the duplicate tweet pairs should have less differences in post-
ing time.
Global vs. local: The most important feature in Mglobal,
the overlap in terms, and the second most important fea-
ture in Mlocal, Levenshtein distance, do not have a similar
significance in each others’ models. We consider it as an
interesting finding and the possible explanation can lie in
the sources of the information. In more detail, on the one
hand the duplicate tweets about the local topics may share
the same source thus are low in Levenshtein distances; on
the other hand, different sources may report on the global
topics in their own styles but with the same terms.
Temporal persistence: Comparing the Mpersistent and
the Moccasional models, yields to similar conclusions as in the
previous two splits: (i) the persistent topics are continuously
discussed so that the duplicate pairs are more likely to have
short temporal differences, while the temporal differences
between tweets on occasional topics are relatively insignif-
icant; (ii) the occasionally discussed topics are often using
the same set of words.
Topic Themes: The partial results of the topic split
according to the theme of the topic are shown in Table 5
(full results can be found on the supporting website for this
paper [22]). Three topics did not fit in one of the five cat-
egories. Since the topic set is split into five partitions, the
size of some partitions is extremely small, making it diffi-
cult to reach conclusive results. Nevertheless, we can detect
trends such as the fact that duplicate tweet pairs in sports
topics are more likely to contain the same source links (pos-
itive coefficient of overlap in original URLs and the opposite
of overlap in expanded URLs), while duplicate pairs in en-
tertainment topics contain more shortened links (positive
coefficient of overlap in expanded URLs). The overlap in
terms has a large impact on all themes but politics. Another
interesting observation is that a short temporal difference,
is a prominent indicator for the duplicates in the topics of
entertainment and politics but not in the other models.
The observation that certain topic splits lead to models
that emphasize certain features also offers a natural way
forward: if we are able to determine for each topic in advance
to which theme or topic characteristic it belongs to, we can
select the model that fits the topic best.
5.4 Analysis of Duplicate levels
Having estimated whether a tweet pair is duplicate or
not, we now proceed to the second step of the duplicate
detection task: determining the exact level of the duplicate
tweet pairs. We compare the different strategies (see Sec-
tion 4.3) in the same way as we have done in Section 5.2.
To analyze the performance in general, we used weighted
measures, including precision, recall, and F-measure across
5 levels. The results are summarized in Table 6; a simi-
lar pattern in performance improvement can be observed.
However, it appears that the enriched semantics are more
prominent than the contextual features as the so-called Sy-
SeEn strategy (without contextual features) performs better
than SySeCo strategy (without enriched semantics).
In Figure 2, we plot the performance of the classification
for 5 different levels and the weighted average of them with
all the strategies that we have introduced in Section 4.3.
The weight of each level depends on the ratio of duplicate
instances. The curves for 5 different levels show the simi-
1280
9. Performance Measure popular unpopular global local persistent occasional
#topics 24 25 18 31 28 21
#samples 32,635 22,727 19,862 35,500 33,474 21,888
precision 0.4480 0.6756 0.6148 0.4617 0.4826 0.6129
recall 0.4569 0.6436 0.5294 0.5059 0.5590 0.5041
F-measure 0.4524 0.6592 0.5689 0.4828 0.5180 0.5532
Category Feature popular unpopular global local persistent occasional
syntactical
Levenshtein distance -3.4919 -0.6126 0.1342 -3.5136 -3.5916 -1.2338
overlap in terms 2.8352 6.0905 6.7126 1.0498 1.3474 5.7705
overlap in hashtags 0.6234 -2.5868 -1.8751 1.2671 2.2210 -2.7187
overlap in URLs -0.1865 5.8130 0.3275 1.6342 3.4323 0.4907
overlap in expanded URLs 0.5180 2.7594 1.4933 0.6362 1.4936 1.0751
length difference 1.2459 0.5974 0.5028 1.3236 1.5043 0.3793
semantics
overlap in entities -2.3460 0.5430 -0.2263 -3.3525 -3.1071 0.5538
overlap in entity types 1.2612 -1.1651 -0.3301 1.1571 0.8802 -0.8804
overlap in topics 1.6607 0.7505 1.2147 1.2294 1.3911 0.7848
overlap in WordNet concepts 5.5288 7.1115 5.8185 2.5319 4.0427 4.6365
overlap in WordNet Synset concepts -0.7763 -2.4393 -0.7335 3.0327 1.8752 0.5750
WordNet similarity -0.6254 1.8168 1.1355 -0.3909 -0.5141 1.0457
overlap in entities -1.3013 0.0583 0.0501 -0.5548 -1.9666 0.8555
overlap in entity types 0.8997 1.2098 0.1179 0.8132 1.0279 0.3228
enriched overlap in topics -0.5470 0.6581 0.0118 -0.1282 0.0292 -0.1884
semantics overlap in WordNet concepts -1.8633 -1.2312 -2.3160 -2.4825 -2.5058 -2.1868
overlap in WordNet Synset concepts 2.4274 1.0336 3.4218 2.6263 3.0461 2.5514
WordNet similarity 0.7328 0.4342 -1.0359 0.9406 1.0470 -1.1867
contextual
temporal difference -15.8249 -2.9890 -5.2712 -17.3894 -18.9433 -5.6780
difference in #followees 0.7026 0.2322 -1.0797 0.5489 0.0659 -1.0473
difference in #followers -0.9826 1.1960 -0.3224 0.0048 -0.1281 -0.5178
same client -0.1800 0.3851 -0.0272 -0.0692 -0.2641 0.3058
Table 4: Influence comparison of different features among different topic partitions. There are three splits
shown here: popular vs. unpopular topics, global vs. local topics and persistent vs. occasional topics. While
the performance measures are based on 5-fold cross-validation, the derived feature weights for the logistic
regression model were determined across all topics of a split. The total number of topics is 49. For each topic
split, the three features with the highest absolute coefficient are underlined.
Performance Measure business entertainment sports politics technology
#topics 6 12 5 21 2
#samples 11,445 7,678 1,722 30,037 1,622
precision 0.6865 0.6844 0.5000 0.4399 0.6383
recall 0.6615 0.7153 0.6071 0.4713 0.7143
F-measure 0.6737 0.6995 0.5484 0.4551 0.6742
Table 5: This table shows the comparison of performance when partitioning the topic set according to five
broad topic themes. The full results can be found on the supporting website for this paper [22].
Strategies Precision Recall F-measure
Baseline 0.5553 0.5208 0.5375
Sy 0.6599 0.5809 0.6179
SyCo 0.6747 0.5889 0.6289
SySe 0.6708 0.6151 0.6417
SySeEn 0.6694 0.6241 0.6460
SySeCo 0.6852 0.6198 0.6508
SySeEnCo 0.6739 0.6308 0.6516
Table 6: Performance Results of predicting dupli-
cate levels for different sets of features
lar trend with the weighted average of them that indicates
the overall performance. We observe that the level of Weak
near duplicate performs better than average and the reason
can be attributed to the large ratio of learning instances
(see Figure 1). The classification regarding the level of Ex-
act copy is the best because of the decisive influence of the
Levensthein distance. However, we see a declining trend
in performance as we integrate more other tweets. Hence,
further optimization is possible.
5.5 Optimization of Duplicate Detection
To optimize the duplicate detection procedure, we exploit
the fact that duplicate pairs of level Exact copy can easily
0.00#
0.10#
0.20#
0.30#
0.40#
0.50#
0.60#
0.70#
0.80#
0.90#
1.00#
Baseline##Syntac8cal#
W
ithout#Sem
an8cs#
Tw
eet#ConentAbased#
W
ithout#Contextual#
W
ithout#External#All#Features#
Exact#Copy#
Nearly#Exact#Copy#
Strong#NearADuplicate#
Weak#NearADuplicate#
Low#Overlapping#
Weighted#
Figure 2: The F-measure of classification for differ-
ent levels and weighted average by applying different
strategies
be detected by their Levenshtein distance of 0. After the re-
moval of mentions, URLs, and hashtags, we can also apply
the same rule for Nearly exact copy. Therefore, we can op-
timize the duplicate detection procedure with the following
cascade:
1. If the Levensthein distance is zero between a pair of
tweets or after removal of mentions, URLs, and hash-
tags from both of them, they can be classified as Exact
copy or Nearly exact copy;
1281
10. Strategies Precision Recall F-measure
Baseline 0.9011 0.2856 0.4337
Sy 0.7065 0.4095 0.5185
SyCo 0.6220 0.4550 0.5256
SySe 0.6153 0.4849 0.5424
SySeEn 0.5612 0.5395 0.5501
SySeCo 0.6079 0.4914 0.5435
SySeEnCo 0.5656 0.5512 0.5583
Table 7: Performance Results of duplicate detection
using different strategies after optimization
Feature
Extrac+on
Relevance
Es+ma+on
Social
Web
Streams
Feature
Extrac+on
Task
Broker
Cloud
Computing
Infrastructure
Index
Keyword-‐based
Relevance
messages
Twinder
Search
Engine
feature
extraction
tasks
Search
User
Interface
query
results
feedback
users
Duplicate
Detec+on
and
Diversifica+on
Seman+c-‐based
Relevance
Seman+c
Features
Syntac+cal
Features
Contextual
Features
Further
Enrichment
Figure 3: Architecture of the Twinder Search En-
gine: duplicate detection and diversification based
on feature extraction modules.
2. Otherwise, we apply aforementioned strategies to de-
tect duplicity.
After this optimization, we get a performance improvement
from 0.45 to 0.55 with respect to the F-measure. The cor-
responding results are listed in Table 7 (the original results
are given in Table 2).
6. SEARCH RESULT DIVERSIFICATION
A core application of near-duplicate detection strategies
is the diversification of search results. Therefore, we inte-
grated our duplicate detection framework into the so-called
Twinder [21] (Twitter Finder) search engine which provides
search and ranking functionality for Twitter streams. Fig-
ure 3 depicts the architecture of Twinder and highlights the
core modules which we designed, developed and analyzed in
the context of this paper.
The duplicate detection and diversification is performed
after the relevance estimation of the tweets. Hence, given a
search query, the engine first ranks the tweets according to
their relevance and then iterates over the top k tweets of the
search result to remove near-duplicate tweets and diversify
the search results. Both, the duplicate detection and the
relevance estimation module, benefit from the features that
are extracted as part of the indexing step which is performed
iteratively as soon as new tweets are monitored.
The lightweight diversification strategy applies the near-
duplicate detection functionality as listed in Algorithm 2.
It iterates from the top to the bottom of the top k search
results. For each tweet i, it removes all tweets at rank j with
i < j (i.e. tweet i has a better rank than tweet j) that are
near-duplicates of tweet i.
In Section 3.2, we analyzed the ratios of duplicates in the
search results. After applying the lightweight diversification
strategy proposed above, we again examine the ratios. The
results are listed in Table 8 and reveal that the fraction
Algorithm 2: Diversification Strategy
input : Ranking of tweets T , k
output: Diversified top k ranking T @k
T @k ← ∅;
i ← 0;
while i < k and i < T.length do
j ← i+1 ;
while j < T.length do
if T [i] and T [j] are duplicates then
remove T [j] from T;
else
j++;
T [i] = T [i]
return T @k;
Range Top 10 Top 20 Top 50 All
Before diversification 19.4% 22.2% 22.5% 22.3%
After diversification 9.1% 10.5% 12.0% 12.1%
Improvement 53.1% 52.0% 46.7% 45.7%
Table 8: Average ratios of near-duplicates in search
results after diversification
of near-duplicate tweets within the top k search results is
considerably smaller. For example, without diversification
there exists, on average, for 22.2% of the tweets at least
one near-duplicate tweet within the top 20 search results.
In contrast, the diversification strategy improves the search
result quality with respect to duplicate content by more than
50%. Thus there are, on average, less than 11% duplicates
in the top 20 search results.
7. CONCLUSIONS
People are confronted with a high fraction of near-duplicate
content when searching and exploring information on mi-
croblogging platforms such as Twitter. In this paper, we
analyzed the problem of near-duplicate content on Twitter
and developed a duplicate detection and search result di-
versification framework for Twitter. Our framework is able
to identify near-duplicate tweets with a precision and recall
of 48% and 43% respectively by combining (i) syntactical
features, (ii) semantic features and (iii) contextual features
and by considering information from external Web resources
that are linked from the microposts. For certain types of top-
ics such as occasional news events, we even observe perfor-
mances of more than 61% and 50% with respect to precision
and recall. Our experiments show that semantic features
such as the overlap of WordNet concepts are of particular
importance for detecting near-duplicates. By analyzing a
large Twitter sample, we also identified five main levels of
duplicity ranging from exact copies which can easily be iden-
tified by means of syntactic features such as string similarity
to low overlapping duplicates for which an analysis of the
semantics and context is specifically important. Our frame-
work is able to classify the duplicity score on that level with
an accuracy of more than 60%. Given our near-duplicate
detection strategies, we additionally developed functionality
for the diversification of search results. We integrated this
functionality into the Twinder search engine and could show
that our duplicate detection and diversification framework
improves the quality of top k retrieval significantly since we
decrease the fraction of duplicate content that is delivered
to the users by more than 45%.
Acknowledgements. This work is co-funded by the EU
FP7 project ImREAL (http://imreal-project.eu).
1282
11. 8. REFERENCES
[1] F. Abel, Q. Gao, G.-J. Houben, and K. Tao.
Analyzing User Modeling on Twitter for Personalized
News Recommendations. In Proceedings of the 19th
International Conference on User Modeling, Adaption
and Personalization (UMAP), pages 1–12. Springer,
July 2011.
[2] F. Abel, C. Hauff, G.-J. Houben, R. Stronkman, and
K. Tao. Twitcident: Fighting Fire with Information
from Social Web Stream. In Proceedings of the 21st
International Conference Companion on World Wide
Web (WWW), pages 305–308. ACM, April 2012.
[3] R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong.
Diversifying Search Results. In Proceedings of the 2nd
ACM International Conference on Web Search and
Data Mining (WSDM), pages 5–14. ACM, February
2009.
[4] D. Antoniades, I. Polakis, G. Kontaxis,
E. Athanasopoulos, S. Ioannidis, E. P. Markatos, and
T. Karagiannis. we.b: The web of Short URLs. In
Proceedings of the 20th International Conference on
World Wide Web (WWW), pages 715–724. ACM,
April 2011.
[5] M. S. Bernstein, B. Suh, L. Hong, J. Chen, S. Kairam,
and E. H. Chi. Eddi: Interactive Topic-based
Browsing of Social Status Streams. In Proceedings of
the 23rd Annual ACM Symposium on User Interface
Software and Technology (UIST), pages 303–312.
ACM, October 2010.
[6] A. Z. Broder, S. C. Glassman, M. S. Manasse, and
G. Zweig. Syntactic Clustering of the Web. In Selected
Papers from the 6th International Conference on
World Wide Web (WWW), pages 1157–1166. Elsevier
Science Publishers Ltd., September 1997.
[7] M. S. Charikar. Similarity Estimation Techniques from
Rounding Algorithms. In Proceedings of the 34th
Annual ACM Symposium on Theory of Computing
(STOC), pages 380–388. ACM, May 2002.
[8] M. Henzinger. Finding Near-Duplicate Web Pages: A
Large-Scale Evaluation of Algorithms. In Proceedings
of the 29th Annual International ACM SIGIR
Conference on Research and Development in
Information Retrieval (SIGIR), pages 284–291. ACM,
August 2006.
[9] H. Kwak, C. Lee, H. Park, and S. Moon. What is
Twitter, a Social Network or a News Media? In
Proceedings of the 19th International Conference on
World Wide Web (WWW), pages 591–600. ACM,
April 2010.
[10] C. Lee, H. Kwak, H. Park, and S. Moon. Finding
Influentials Based on the Temporal Order of
Information Adoption in Twitter. In Proceedings of
the 19th International Conference on World Wide
Web (WWW), pages 1137–1138. ACM, April 2010.
[11] D. Lin. An Information-Theoretic Definition of
Similarity. In Proceedings of the 15th International
Conference on Machine Learning (ICML), pages
296–304. Morgan Kaufmann, July 1998.
[12] G. S. Manku, A. Jain, and A. Das Sarma. Detecting
Near-Duplicates for Web Crawling. In Proceedings of
the 16th International Conference on World Wide
Web (WWW), pages 141–150. ACM, May 2007.
[13] R. McCreadie, I. Soboroff, J. Lin, C. Macdonald,
I. Ounis, and D. McCullough. On Building a Reusable
Twitter Corpus. In Proceedings of the 35th
International ACM SIGIR Conference on Research
and Development in Information Retrieval (SIGIR),
pages 1113–1114. ACM, August 2012.
[14] B. Meeder, B. Karrer, A. Sayedi, R. Ravi, C. Borgs,
and J. Chayes. We Know Who You Followed Last
Summer: Inferring Social Link Creation Times In
Twitter. In Proceedings of the 20th International
Conference on World Wide Web (WWW), pages
517–526. ACM, April 2011.
[15] P. N. Mendes, M. Jakob, A. Garc´ıa-Silva, and
C. Bizer. DBpedia Spotlight: Shedding Light on the
Web of Documents. In Proceedings of the 7th
International Conference on Semantic Systems
(I-SEMANTICS), pages 1–8. ACM, September 2011.
[16] D. Metzler and C. Cai. USC/ISI at TREC 2011:
Microblog Track. In Working Notes, The Twentieth
Text REtrieval Conference (TREC 2011) Proceedings.
NIST, November 2011.
[17] G. Miller et al. WordNet: A Lexical Database for
English. Communications of the ACM, 38(11):39–41,
November 1995.
[18] M. Pennacchiotti and A.-M. Popescu. A Machine
Learning Approach to Twitter User Classification. In
Proceedings of the 5th International AAAI Conference
on Weblogs and Social Media (ICWSM), pages
281–288. AAAI Press, July 2011.
[19] D. Rafiei, K. Bharat, and A. Shukla. Diversifying Web
Search Results. In Proceedings of the 19th
International Conference on World Wide Web
(WWW), pages 781–790. ACM, April 2010.
[20] T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake
Shakes Twitter Users: Real-time Event Detection by
Social Sensors. In Proceedings of the 19th
International Conference on World Wide Web
(WWW), pages 851–860. ACM, April 2010.
[21] K. Tao, F. Abel, C. Hauff, and G.-J. Houben.
Twinder: A Search Engine for Twitter Streams. In
Proceedings of the 12th International Conference on
Web Engineering (ICWE), pages 153–168. Springer,
July 2012.
[22] K. Tao, F. Abel, C. Hauff, G.-J. Houben, and
U. Gadiraju. Supporting Website: datasets and
additional findings., November 2012.
http://wis.ewi.tudelft.nl/duptweet/.
[23] J. Teevan, D. Ramage, and M. R. Morris.
#TwitterSearch: A Comparison of Microblog Search
and Web Search. In Proceedings of the 4th
International Conference on Web Search and Web
Data Mining (WSDM), pages 35–44. ACM, February
2011.
[24] J. Weng and B.-S. Lee. Event Detection in Twitter. In
Proceedings of the 5th International AAAI Conference
on Weblogs and Social Media (ICWSM). AAAI Press,
July 2011.
1283