SlideShare a Scribd company logo
1 of 3
Download to read offline
IOSR Journal of Engineering (IOSRJEN) www.iosrjen.org
ISSN (e): 2250-3021, ISSN (p): 2278-8719
Vol. 05, Issue 09 (September. 2015), ||V1|| PP 32-34
International organization of Scientific Research 32 | P a g e
Predicting spam videos using predictive analysis.
P. Sai Kiran, Dathala Irwin Emmanuel
(Computer Science and Engineering, Vidya Jyothi Institute of Technology(JNTU-H), TG, India)
(Computer Science and Engineering, Vidya Jyothi Institute of Technology(JNTU-H), TG, India)
Abstract: Social networking has become a popular way for users to meet and interact online. Users spend a
significant amount of time on popular social network platforms (such as Facebook, MySpace, or Twitter),
storing and sharing personal information. This information, also attracts the interest of cybercriminals.
There has been a lot of development regarding spam detection in the recent times. This paper tries to address, if
there is a way to leave off a video in a social video platform without checking if it is spam or not. That is,
predicting if it is spam or not.
Keywords: YouTube, Spammers, video spam, social network, Supervised Machine Learning, Machine Learning,
SVM, video predictions, predictive analysis.
I. INTRODUCTION
Over the last few years, social networking sites have become one of the main ways for users to keep
track and communicate with their friends online. Sites such as Facebook, MySpace, and Twitter are consistently
among the top 20 most-visited sites of the Internet. Moreover, statistics show that, on average, users spend more
time on popular social networking sites than on any other site [1]. Most social networks provide mobile
platforms that allow users to access their services from mobile phones, making the access to these sites
ubiquitous. The tremendous increase in popularity of social networking sites allows them to collect a huge
amount of personal information about the users, their friends, and their habits. Unfortunately, this amount of
information, as well as the ease with which one can reach many users, also attracted the interest of malicious
parties. In particular, spammers are always looking for ways to reach new victims with their unsolicited
messages. This is shown by a market survey about the user perception of spam over social networks, which
shows that, in 2008, 83% of the users of social networks have received at least one unwanted friend request or
message [2].
By allowing users to publicize and share their independently generated content, social video sharing
systems may become susceptible to different types of malicious and opportunistic user actions, such as self-
promotion, video aliasing and video spamming [3]. A video response spam is defined as a video posted as a
response to an opening video, but whose content is completely unrelated to the opening video. Video spammers
are motivated to spam in order to promote specific content, advertise to generate sales, disseminate pornography
(often as an advertisement) or compromise the system reputation.
Ultimately, users cannot easily identify a video spam before watching at least a segment of it, thus
consuming system resources, in particular bandwidth, and compromising user patience and satisfaction with the
system. Thus, identifying video spam is a challenging problem in social video sharing systems.
The paper specifically addresses the issue, 'Do we really need to check every video and analyze to predict
whether it's spam or not?'
What I am trying to accomplish is taking some attributes of the videos, attach some threshold values to the
attributes which might lead to address the above question.
The rest of the paper is organized as follows, the next section would give an overview of the background
followed by user test collection which discusses the method in which the test data was collected. Lastly, we
have predictive analysis section which discusses the algorithm and the results.
II. BACKGROUND
Mechanisms to detect and identify spam and spammers have been largely studied in the context of web
[4, 5] and email spamming [6]. In particular, Castillo et al [4] proposed a framework to detect web spamming
which uses social network metrics. A framework to detect spamming in tagging systems, which is a type of
attack that aims at raising the visibility of specific objects, was proposed in [7]. Although applicable to social
media sharing systems that allow object tagging by users, such as YouTube, the proposed technique exploits a
specific object attribute, i.e., its tags. A survey of approaches to combat spamming in Social web sites is
presented in [8]. Many existing approaches are based on extracting evidence from the content of a text, treating
the text corpus as a set of objects with associated attributes and using these attributes to detect spam.
Predicting spam videos using predictive analysis.
International organization of Scientific Research 33 | P a g e
These techniques, based on content classification, can be directly applied to textual information, and
thus can be used to detect spam in email, text commentaries in blogs, forums, and online social networking sites.
Complementary to my effort, the characterization of the traffic to online video sharing systems, in particular
YouTube, has also been the focus of some studies. An in-depth analysis of popularity distribution, popularity
evolution and content characteristics of YouTube and of a popular Korean video sharing service is presented in
[9]. The authors also analyze mechanisms to improve video distribution, such as caching and peer-to-peer
distribution schemes. Gill et al [10]present a characterization of the YouTube traffic collected from the
University of Calgary campus network and compare its properties with those previously reported for web and
media streaming workloads. Both studies focus on traffic and video characterization.
III. USER TEST COLLECTION
This paper is a continuation of my previous work in the same segment and many results are compared
directly. The process of the user collection is same. The results are presented in [11].
Simultaneously attributes such as the number of likes, comments are collected additionally.
IV. PREDICTIVE ANALYSIS
Generically the videos whose view count is more than a threshold of 10,000 are found to be 85%
non - spam when compared with the results from [11].
ALGORITHM:
input: A list of information about users.
1.1 Initialized threshold values for no_of_likes and viewcount;
1.2 foreach User U in info - list do
1.3 if likes greater than no_of_likes and view greater than viewcount then
1.4 do natural language processing on every comment;
1.5 if result is positive
1.6 label it as not spam;
1.7 end
1.8 if result is negative
1.9 label it as spam;
1.10 end
1.11 end
1.12 end
To increase the precision, attributes such as likes and comments are considered. A video which has
more than 70% likes increases the precision to 90%. Additionally performing analysis using a natural language
processing library such as Natural Language Toolkit [12] differentiating the positive and negative comments
further increased the precision to 93.6%.
V. CONCLUSION
Deploying an automatic engine such as this will help the spam classifier engine to carefully omit some
videos which are predicted to be not spam.
This will not only increase the efficiency but also ensures minimal use of resources.
Issues with this method:
For smaller data sets it has been observed that doing analysis on comments sometimes led to decrease in
accuracy.
One of the methods to address this problem is to have a large data set for training SVMs.
My work in this will continue and in future I would work on the above mentioned issue.
REFERENCES
[1] Alexa top 500 global sites. http://www.alexa.com/topsites.
[2] Harris Interactive Public Relations Research. A study of social networks scams. 2008
[3] M. Cha, H. Kwak, P. Rodriguez, Y. Ahn, and S. Moon. I tube, you tube, everybody tubes: Analyzing the
world’s largest user generated content video system. In Proc. of IMC, 2007.
[4] C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: Ib spam detection
using the Ib topology. In Int’l ACM SIGIR, pages 423–430, 2007.
[5] Z. Gy¨ongyi, H. Garcia-Molina, and J. Pedersen. Combating Ib spam with trustrank. In Int’l. Conf. on
Very Large Data Bases, pages 576–587, 2004.
Predicting spam videos using predictive analysis.
International organization of Scientific Research 34 | P a g e
[6] L. Gomes, F. Castro, V. Almeida, J. Almeida, R. Almeida, and L. Bettencourt. Improving spam detection
based on structural similarity. In Proc. of SRUTI, 2005.
[7] G. Koutrika, F. Effendi, Z. Gy¨ongyi, P. Heymann, and H. Garcia-Molina. Combating spam in tagging
systems. In Proc. of AIRIb, 2007.
[8] P. Heymann, G. Koutrika, and H. Garcia-Molina. Fighting spam on social Ib sites: A survey of
approaches and future challenges. IEEE Internet Computing, 11(6):36–45, 2007.
[9] M. Cha, H. Kwak, P. Rodriguez, Y. Ahn, and S. Moon. I tube, you tube, everybody tubes: Analyzing the
world’s largest user generated content video system. In Proc. of IMC, 2007.
[10] P. Gill, M. Arlitt, Z. Li, and A. Mahanti. YouTube traffic characterization: A view from the edge. In Proc.
of IMC, 2007.
[11] P. Sai Kiran. Detecting spammers in YouTube: A study to find spam content in a video platform. ISSN
(e): 2250-3021, ISSN (p): 2278-8719 Vol. 05, Issue 07 (July. 2015), ||V4|| PP 26-30.
[12] Natural Language Toolkit. http://www.nltk.org

More Related Content

What's hot

Internet Safety Technical Task Force Final Report
Internet Safety Technical Task Force Final ReportInternet Safety Technical Task Force Final Report
Internet Safety Technical Task Force Final ReportChris White
 
Converging Communications: The Perfect Storm
Converging Communications: The Perfect StormConverging Communications: The Perfect Storm
Converging Communications: The Perfect StormJoanne Jacobs
 
Efficient and effective video sharing in online Social network using revocati...
Efficient and effective video sharing in online Social network using revocati...Efficient and effective video sharing in online Social network using revocati...
Efficient and effective video sharing in online Social network using revocati...IRJET Journal
 
Investigating Tertiary Students’ Perceptions on Internet Security
Investigating Tertiary Students’ Perceptions on Internet SecurityInvestigating Tertiary Students’ Perceptions on Internet Security
Investigating Tertiary Students’ Perceptions on Internet SecurityITIIIndustries
 
Achieving Behavioral Change, for ISSA 2011 in San Francisco Feb 2011
Achieving Behavioral Change, for ISSA 2011 in San Francisco Feb 2011Achieving Behavioral Change, for ISSA 2011 in San Francisco Feb 2011
Achieving Behavioral Change, for ISSA 2011 in San Francisco Feb 2011Jason Hong
 
Examination Of Mobile Learning
Examination Of Mobile LearningExamination Of Mobile Learning
Examination Of Mobile LearningJames Brittain
 
Context Aggregation and Analysis: A Tool for User- Generated Video Verificati...
Context Aggregation and Analysis: A Tool for User- Generated Video Verificati...Context Aggregation and Analysis: A Tool for User- Generated Video Verificati...
Context Aggregation and Analysis: A Tool for User- Generated Video Verificati...Weverify
 
Finaly! Untangling Web 2.0
Finaly! Untangling Web 2.0Finaly! Untangling Web 2.0
Finaly! Untangling Web 2.0sydneyblackmore
 
Introduction to Microblogging in Education
Introduction to Microblogging in EducationIntroduction to Microblogging in Education
Introduction to Microblogging in EducationCarmen Holotescu
 

What's hot (11)

Internet Safety Technical Task Force Final Report
Internet Safety Technical Task Force Final ReportInternet Safety Technical Task Force Final Report
Internet Safety Technical Task Force Final Report
 
Converging Communications: The Perfect Storm
Converging Communications: The Perfect StormConverging Communications: The Perfect Storm
Converging Communications: The Perfect Storm
 
Efficient and effective video sharing in online Social network using revocati...
Efficient and effective video sharing in online Social network using revocati...Efficient and effective video sharing in online Social network using revocati...
Efficient and effective video sharing in online Social network using revocati...
 
Investigating Tertiary Students’ Perceptions on Internet Security
Investigating Tertiary Students’ Perceptions on Internet SecurityInvestigating Tertiary Students’ Perceptions on Internet Security
Investigating Tertiary Students’ Perceptions on Internet Security
 
Achieving Behavioral Change, for ISSA 2011 in San Francisco Feb 2011
Achieving Behavioral Change, for ISSA 2011 in San Francisco Feb 2011Achieving Behavioral Change, for ISSA 2011 in San Francisco Feb 2011
Achieving Behavioral Change, for ISSA 2011 in San Francisco Feb 2011
 
Examination Of Mobile Learning
Examination Of Mobile LearningExamination Of Mobile Learning
Examination Of Mobile Learning
 
Context Aggregation and Analysis: A Tool for User- Generated Video Verificati...
Context Aggregation and Analysis: A Tool for User- Generated Video Verificati...Context Aggregation and Analysis: A Tool for User- Generated Video Verificati...
Context Aggregation and Analysis: A Tool for User- Generated Video Verificati...
 
Finaly! Untangling Web 2.0
Finaly! Untangling Web 2.0Finaly! Untangling Web 2.0
Finaly! Untangling Web 2.0
 
Twitter tweet presentation_2011
Twitter tweet presentation_2011Twitter tweet presentation_2011
Twitter tweet presentation_2011
 
Introduction to Microblogging in Education
Introduction to Microblogging in EducationIntroduction to Microblogging in Education
Introduction to Microblogging in Education
 
You tube video promotion by cross network
You tube video promotion by cross networkYou tube video promotion by cross network
You tube video promotion by cross network
 

Viewers also liked

Stories, defects and tasks
Stories, defects and tasksStories, defects and tasks
Stories, defects and tasksWalther Lalk
 
Gas turbine analysis
Gas turbine analysisGas turbine analysis
Gas turbine analysisdaggyf
 
Проблеми та перспективи реформування системи збору за «приватну копію» в Україні
Проблеми та перспективи реформування системи збору за «приватну копію» в УкраїніПроблеми та перспективи реформування системи збору за «приватну копію» в Україні
Проблеми та перспективи реформування системи збору за «приватну копію» в Україніnadeh
 
NDC Sydney Debugging your communication
NDC Sydney Debugging your communicationNDC Sydney Debugging your communication
NDC Sydney Debugging your communicationSabine Wojcieszak
 
Juan carlos tedesco
Juan carlos tedescoJuan carlos tedesco
Juan carlos tedescoNatDelb
 
Lifecycle of a Moodle Bug - #mootus16
Lifecycle of a Moodle Bug - #mootus16Lifecycle of a Moodle Bug - #mootus16
Lifecycle of a Moodle Bug - #mootus16Dan Poltawski
 
ゲーム業界でよく聞くAWSクラウドに対する3つの誤解を解決しよう。
ゲーム業界でよく聞くAWSクラウドに対する3つの誤解を解決しよう。ゲーム業界でよく聞くAWSクラウドに対する3つの誤解を解決しよう。
ゲーム業界でよく聞くAWSクラウドに対する3つの誤解を解決しよう。Satoshi Nakada
 
フルマネージドサービスの活用とIoTシステムのオペレーション
フルマネージドサービスの活用とIoTシステムのオペレーションフルマネージドサービスの活用とIoTシステムのオペレーション
フルマネージドサービスの活用とIoTシステムのオペレーションSatoshi Nakada
 
Swiftでの関数型プログラミングについて考えていること
Swiftでの関数型プログラミングについて考えていることSwiftでの関数型プログラミングについて考えていること
Swiftでの関数型プログラミングについて考えていることShingo Sato
 
NO SMOKING ZONE - SIFS INDIA
NO SMOKING ZONE - SIFS INDIANO SMOKING ZONE - SIFS INDIA
NO SMOKING ZONE - SIFS INDIASifs India
 
Xcode7時代のアプリ配布
Xcode7時代のアプリ配布Xcode7時代のアプリ配布
Xcode7時代のアプリ配布toyship
 
THE IMPORTANCE OF A STRATEGIC MARKETING PLAN FOR SRILANKA AS A TOURIST DESTI...
THE IMPORTANCE OF  A STRATEGIC MARKETING PLAN FOR SRILANKA AS A TOURIST DESTI...THE IMPORTANCE OF  A STRATEGIC MARKETING PLAN FOR SRILANKA AS A TOURIST DESTI...
THE IMPORTANCE OF A STRATEGIC MARKETING PLAN FOR SRILANKA AS A TOURIST DESTI...Paul Solaman Srilal 🇱🇰
 
Tips for better CI on Android
Tips for better CI on AndroidTips for better CI on Android
Tips for better CI on AndroidTomoaki Imai
 
Система захисту від недобросовісної конкуренції: ефективні рішення
Система захисту від недобросовісної конкуренції: ефективні рішенняСистема захисту від недобросовісної конкуренції: ефективні рішення
Система захисту від недобросовісної конкуренції: ефективні рішенняnadeh
 
Security Features on New ₹500 & ₹2,000 Currency Notes
Security Features on New ₹500 & ₹2,000 Currency NotesSecurity Features on New ₹500 & ₹2,000 Currency Notes
Security Features on New ₹500 & ₹2,000 Currency NotesSifs India
 
HPC Parallel Computing for CFD - Customer Examples (2 of 4)
HPC Parallel Computing for CFD - Customer Examples (2 of 4)HPC Parallel Computing for CFD - Customer Examples (2 of 4)
HPC Parallel Computing for CFD - Customer Examples (2 of 4)Ansys
 
El gran-libro-del-pendulo
El gran-libro-del-penduloEl gran-libro-del-pendulo
El gran-libro-del-penduloFatunji Aworeni
 

Viewers also liked (20)

Austria
AustriaAustria
Austria
 
Stories, defects and tasks
Stories, defects and tasksStories, defects and tasks
Stories, defects and tasks
 
Gas turbine analysis
Gas turbine analysisGas turbine analysis
Gas turbine analysis
 
Проблеми та перспективи реформування системи збору за «приватну копію» в Україні
Проблеми та перспективи реформування системи збору за «приватну копію» в УкраїніПроблеми та перспективи реформування системи збору за «приватну копію» в Україні
Проблеми та перспективи реформування системи збору за «приватну копію» в Україні
 
NDC Sydney Debugging your communication
NDC Sydney Debugging your communicationNDC Sydney Debugging your communication
NDC Sydney Debugging your communication
 
Juan carlos tedesco
Juan carlos tedescoJuan carlos tedesco
Juan carlos tedesco
 
Lifecycle of a Moodle Bug - #mootus16
Lifecycle of a Moodle Bug - #mootus16Lifecycle of a Moodle Bug - #mootus16
Lifecycle of a Moodle Bug - #mootus16
 
ゲーム業界でよく聞くAWSクラウドに対する3つの誤解を解決しよう。
ゲーム業界でよく聞くAWSクラウドに対する3つの誤解を解決しよう。ゲーム業界でよく聞くAWSクラウドに対する3つの誤解を解決しよう。
ゲーム業界でよく聞くAWSクラウドに対する3つの誤解を解決しよう。
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
フルマネージドサービスの活用とIoTシステムのオペレーション
フルマネージドサービスの活用とIoTシステムのオペレーションフルマネージドサービスの活用とIoTシステムのオペレーション
フルマネージドサービスの活用とIoTシステムのオペレーション
 
Finger Painting
Finger PaintingFinger Painting
Finger Painting
 
Swiftでの関数型プログラミングについて考えていること
Swiftでの関数型プログラミングについて考えていることSwiftでの関数型プログラミングについて考えていること
Swiftでの関数型プログラミングについて考えていること
 
NO SMOKING ZONE - SIFS INDIA
NO SMOKING ZONE - SIFS INDIANO SMOKING ZONE - SIFS INDIA
NO SMOKING ZONE - SIFS INDIA
 
Xcode7時代のアプリ配布
Xcode7時代のアプリ配布Xcode7時代のアプリ配布
Xcode7時代のアプリ配布
 
THE IMPORTANCE OF A STRATEGIC MARKETING PLAN FOR SRILANKA AS A TOURIST DESTI...
THE IMPORTANCE OF  A STRATEGIC MARKETING PLAN FOR SRILANKA AS A TOURIST DESTI...THE IMPORTANCE OF  A STRATEGIC MARKETING PLAN FOR SRILANKA AS A TOURIST DESTI...
THE IMPORTANCE OF A STRATEGIC MARKETING PLAN FOR SRILANKA AS A TOURIST DESTI...
 
Tips for better CI on Android
Tips for better CI on AndroidTips for better CI on Android
Tips for better CI on Android
 
Система захисту від недобросовісної конкуренції: ефективні рішення
Система захисту від недобросовісної конкуренції: ефективні рішенняСистема захисту від недобросовісної конкуренції: ефективні рішення
Система захисту від недобросовісної конкуренції: ефективні рішення
 
Security Features on New ₹500 & ₹2,000 Currency Notes
Security Features on New ₹500 & ₹2,000 Currency NotesSecurity Features on New ₹500 & ₹2,000 Currency Notes
Security Features on New ₹500 & ₹2,000 Currency Notes
 
HPC Parallel Computing for CFD - Customer Examples (2 of 4)
HPC Parallel Computing for CFD - Customer Examples (2 of 4)HPC Parallel Computing for CFD - Customer Examples (2 of 4)
HPC Parallel Computing for CFD - Customer Examples (2 of 4)
 
El gran-libro-del-pendulo
El gran-libro-del-penduloEl gran-libro-del-pendulo
El gran-libro-del-pendulo
 

Similar to G05913234

IRJET - YouTube Spam Comments Detection
IRJET - YouTube Spam Comments DetectionIRJET - YouTube Spam Comments Detection
IRJET - YouTube Spam Comments DetectionIRJET Journal
 
Fake News Detection Using Machine Learning
Fake News Detection Using Machine LearningFake News Detection Using Machine Learning
Fake News Detection Using Machine LearningIRJET Journal
 
A study of cyberbullying detection using Deep Learning and Machine Learning T...
A study of cyberbullying detection using Deep Learning and Machine Learning T...A study of cyberbullying detection using Deep Learning and Machine Learning T...
A study of cyberbullying detection using Deep Learning and Machine Learning T...IRJET Journal
 
A study of cyberbullying detection using Deep Learning and Machine Learning T...
A study of cyberbullying detection using Deep Learning and Machine Learning T...A study of cyberbullying detection using Deep Learning and Machine Learning T...
A study of cyberbullying detection using Deep Learning and Machine Learning T...IRJET Journal
 
IRJET- Secure Social Network using Text Mining
IRJET- Secure Social Network using Text MiningIRJET- Secure Social Network using Text Mining
IRJET- Secure Social Network using Text MiningIRJET Journal
 
Application Of Sentiment Lexicons On Movies Transcripts To Detect Violence In...
Application Of Sentiment Lexicons On Movies Transcripts To Detect Violence In...Application Of Sentiment Lexicons On Movies Transcripts To Detect Violence In...
Application Of Sentiment Lexicons On Movies Transcripts To Detect Violence In...Sara Alvarez
 
Word embedding for detecting cyberbullying based on recurrent neural networks
Word embedding for detecting cyberbullying based on recurrent neural networksWord embedding for detecting cyberbullying based on recurrent neural networks
Word embedding for detecting cyberbullying based on recurrent neural networksIAESIJAI
 
Automatic video censoring system using deep learning
Automatic video censoring system using deep learningAutomatic video censoring system using deep learning
Automatic video censoring system using deep learningIJECEIAES
 
IRJET- Authentic News Summarization
IRJET-  	  Authentic News SummarizationIRJET-  	  Authentic News Summarization
IRJET- Authentic News SummarizationIRJET Journal
 
BINARY TEXT CLASSIFICATION OF CYBER HARASSMENT USING DEEP LEARNING
BINARY TEXT CLASSIFICATION OF CYBER HARASSMENT USING DEEP LEARNINGBINARY TEXT CLASSIFICATION OF CYBER HARASSMENT USING DEEP LEARNING
BINARY TEXT CLASSIFICATION OF CYBER HARASSMENT USING DEEP LEARNINGIRJET Journal
 
IRJET- Fake News Detection
IRJET- Fake News DetectionIRJET- Fake News Detection
IRJET- Fake News DetectionIRJET Journal
 
Integrated approach to detect spam in social media networks using hybrid feat...
Integrated approach to detect spam in social media networks using hybrid feat...Integrated approach to detect spam in social media networks using hybrid feat...
Integrated approach to detect spam in social media networks using hybrid feat...IJECEIAES
 
IRJET - Profanity Statistical Analyzer
 IRJET -  	  Profanity Statistical Analyzer IRJET -  	  Profanity Statistical Analyzer
IRJET - Profanity Statistical AnalyzerIRJET Journal
 
Detection and Minimization Influence of Rumor in Social Network
Detection and Minimization Influence of Rumor in Social NetworkDetection and Minimization Influence of Rumor in Social Network
Detection and Minimization Influence of Rumor in Social NetworkIRJET Journal
 
SECUREWALL-A FRAMEWORK FOR FINEGRAINED PRIVACY CONTROL IN ONLINE SOCIAL NETWORKS
SECUREWALL-A FRAMEWORK FOR FINEGRAINED PRIVACY CONTROL IN ONLINE SOCIAL NETWORKSSECUREWALL-A FRAMEWORK FOR FINEGRAINED PRIVACY CONTROL IN ONLINE SOCIAL NETWORKS
SECUREWALL-A FRAMEWORK FOR FINEGRAINED PRIVACY CONTROL IN ONLINE SOCIAL NETWORKSZac Darcy
 
An Automated Model to Detect Fake Profiles and botnets in Online Social Netwo...
An Automated Model to Detect Fake Profiles and botnets in Online Social Netwo...An Automated Model to Detect Fake Profiles and botnets in Online Social Netwo...
An Automated Model to Detect Fake Profiles and botnets in Online Social Netwo...IOSR Journals
 
Visual Relation Identification Using BoFT Labels in Social Media Feeds
Visual Relation Identification Using BoFT Labels in Social Media FeedsVisual Relation Identification Using BoFT Labels in Social Media Feeds
Visual Relation Identification Using BoFT Labels in Social Media FeedsIRJET Journal
 
IRJET- An Effective Analysis of Anti Troll System using Artificial Intell...
IRJET-  	  An Effective Analysis of Anti Troll System using Artificial Intell...IRJET-  	  An Effective Analysis of Anti Troll System using Artificial Intell...
IRJET- An Effective Analysis of Anti Troll System using Artificial Intell...IRJET Journal
 

Similar to G05913234 (20)

IRJET - YouTube Spam Comments Detection
IRJET - YouTube Spam Comments DetectionIRJET - YouTube Spam Comments Detection
IRJET - YouTube Spam Comments Detection
 
Fake News Detection Using Machine Learning
Fake News Detection Using Machine LearningFake News Detection Using Machine Learning
Fake News Detection Using Machine Learning
 
A study of cyberbullying detection using Deep Learning and Machine Learning T...
A study of cyberbullying detection using Deep Learning and Machine Learning T...A study of cyberbullying detection using Deep Learning and Machine Learning T...
A study of cyberbullying detection using Deep Learning and Machine Learning T...
 
A study of cyberbullying detection using Deep Learning and Machine Learning T...
A study of cyberbullying detection using Deep Learning and Machine Learning T...A study of cyberbullying detection using Deep Learning and Machine Learning T...
A study of cyberbullying detection using Deep Learning and Machine Learning T...
 
IRJET- Secure Social Network using Text Mining
IRJET- Secure Social Network using Text MiningIRJET- Secure Social Network using Text Mining
IRJET- Secure Social Network using Text Mining
 
Application Of Sentiment Lexicons On Movies Transcripts To Detect Violence In...
Application Of Sentiment Lexicons On Movies Transcripts To Detect Violence In...Application Of Sentiment Lexicons On Movies Transcripts To Detect Violence In...
Application Of Sentiment Lexicons On Movies Transcripts To Detect Violence In...
 
Word embedding for detecting cyberbullying based on recurrent neural networks
Word embedding for detecting cyberbullying based on recurrent neural networksWord embedding for detecting cyberbullying based on recurrent neural networks
Word embedding for detecting cyberbullying based on recurrent neural networks
 
Automatic video censoring system using deep learning
Automatic video censoring system using deep learningAutomatic video censoring system using deep learning
Automatic video censoring system using deep learning
 
IRJET- Authentic News Summarization
IRJET-  	  Authentic News SummarizationIRJET-  	  Authentic News Summarization
IRJET- Authentic News Summarization
 
BINARY TEXT CLASSIFICATION OF CYBER HARASSMENT USING DEEP LEARNING
BINARY TEXT CLASSIFICATION OF CYBER HARASSMENT USING DEEP LEARNINGBINARY TEXT CLASSIFICATION OF CYBER HARASSMENT USING DEEP LEARNING
BINARY TEXT CLASSIFICATION OF CYBER HARASSMENT USING DEEP LEARNING
 
IRJET- Fake News Detection
IRJET- Fake News DetectionIRJET- Fake News Detection
IRJET- Fake News Detection
 
Integrated approach to detect spam in social media networks using hybrid feat...
Integrated approach to detect spam in social media networks using hybrid feat...Integrated approach to detect spam in social media networks using hybrid feat...
Integrated approach to detect spam in social media networks using hybrid feat...
 
IRJET - Profanity Statistical Analyzer
 IRJET -  	  Profanity Statistical Analyzer IRJET -  	  Profanity Statistical Analyzer
IRJET - Profanity Statistical Analyzer
 
Detection and Minimization Influence of Rumor in Social Network
Detection and Minimization Influence of Rumor in Social NetworkDetection and Minimization Influence of Rumor in Social Network
Detection and Minimization Influence of Rumor in Social Network
 
SECUREWALL-A FRAMEWORK FOR FINEGRAINED PRIVACY CONTROL IN ONLINE SOCIAL NETWORKS
SECUREWALL-A FRAMEWORK FOR FINEGRAINED PRIVACY CONTROL IN ONLINE SOCIAL NETWORKSSECUREWALL-A FRAMEWORK FOR FINEGRAINED PRIVACY CONTROL IN ONLINE SOCIAL NETWORKS
SECUREWALL-A FRAMEWORK FOR FINEGRAINED PRIVACY CONTROL IN ONLINE SOCIAL NETWORKS
 
L017146571
L017146571L017146571
L017146571
 
An Automated Model to Detect Fake Profiles and botnets in Online Social Netwo...
An Automated Model to Detect Fake Profiles and botnets in Online Social Netwo...An Automated Model to Detect Fake Profiles and botnets in Online Social Netwo...
An Automated Model to Detect Fake Profiles and botnets in Online Social Netwo...
 
Ijcatr04041017
Ijcatr04041017Ijcatr04041017
Ijcatr04041017
 
Visual Relation Identification Using BoFT Labels in Social Media Feeds
Visual Relation Identification Using BoFT Labels in Social Media FeedsVisual Relation Identification Using BoFT Labels in Social Media Feeds
Visual Relation Identification Using BoFT Labels in Social Media Feeds
 
IRJET- An Effective Analysis of Anti Troll System using Artificial Intell...
IRJET-  	  An Effective Analysis of Anti Troll System using Artificial Intell...IRJET-  	  An Effective Analysis of Anti Troll System using Artificial Intell...
IRJET- An Effective Analysis of Anti Troll System using Artificial Intell...
 

More from IOSR-JEN

More from IOSR-JEN (20)

C05921721
C05921721C05921721
C05921721
 
B05921016
B05921016B05921016
B05921016
 
A05920109
A05920109A05920109
A05920109
 
J05915457
J05915457J05915457
J05915457
 
I05914153
I05914153I05914153
I05914153
 
H05913540
H05913540H05913540
H05913540
 
F05912731
F05912731F05912731
F05912731
 
E05912226
E05912226E05912226
E05912226
 
D05911621
D05911621D05911621
D05911621
 
C05911315
C05911315C05911315
C05911315
 
B05910712
B05910712B05910712
B05910712
 
A05910106
A05910106A05910106
A05910106
 
B05840510
B05840510B05840510
B05840510
 
I05844759
I05844759I05844759
I05844759
 
H05844346
H05844346H05844346
H05844346
 
G05843942
G05843942G05843942
G05843942
 
F05843238
F05843238F05843238
F05843238
 
E05842831
E05842831E05842831
E05842831
 
D05842227
D05842227D05842227
D05842227
 
C05841121
C05841121C05841121
C05841121
 

Recently uploaded

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Recently uploaded (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

G05913234

  • 1. IOSR Journal of Engineering (IOSRJEN) www.iosrjen.org ISSN (e): 2250-3021, ISSN (p): 2278-8719 Vol. 05, Issue 09 (September. 2015), ||V1|| PP 32-34 International organization of Scientific Research 32 | P a g e Predicting spam videos using predictive analysis. P. Sai Kiran, Dathala Irwin Emmanuel (Computer Science and Engineering, Vidya Jyothi Institute of Technology(JNTU-H), TG, India) (Computer Science and Engineering, Vidya Jyothi Institute of Technology(JNTU-H), TG, India) Abstract: Social networking has become a popular way for users to meet and interact online. Users spend a significant amount of time on popular social network platforms (such as Facebook, MySpace, or Twitter), storing and sharing personal information. This information, also attracts the interest of cybercriminals. There has been a lot of development regarding spam detection in the recent times. This paper tries to address, if there is a way to leave off a video in a social video platform without checking if it is spam or not. That is, predicting if it is spam or not. Keywords: YouTube, Spammers, video spam, social network, Supervised Machine Learning, Machine Learning, SVM, video predictions, predictive analysis. I. INTRODUCTION Over the last few years, social networking sites have become one of the main ways for users to keep track and communicate with their friends online. Sites such as Facebook, MySpace, and Twitter are consistently among the top 20 most-visited sites of the Internet. Moreover, statistics show that, on average, users spend more time on popular social networking sites than on any other site [1]. Most social networks provide mobile platforms that allow users to access their services from mobile phones, making the access to these sites ubiquitous. The tremendous increase in popularity of social networking sites allows them to collect a huge amount of personal information about the users, their friends, and their habits. Unfortunately, this amount of information, as well as the ease with which one can reach many users, also attracted the interest of malicious parties. In particular, spammers are always looking for ways to reach new victims with their unsolicited messages. This is shown by a market survey about the user perception of spam over social networks, which shows that, in 2008, 83% of the users of social networks have received at least one unwanted friend request or message [2]. By allowing users to publicize and share their independently generated content, social video sharing systems may become susceptible to different types of malicious and opportunistic user actions, such as self- promotion, video aliasing and video spamming [3]. A video response spam is defined as a video posted as a response to an opening video, but whose content is completely unrelated to the opening video. Video spammers are motivated to spam in order to promote specific content, advertise to generate sales, disseminate pornography (often as an advertisement) or compromise the system reputation. Ultimately, users cannot easily identify a video spam before watching at least a segment of it, thus consuming system resources, in particular bandwidth, and compromising user patience and satisfaction with the system. Thus, identifying video spam is a challenging problem in social video sharing systems. The paper specifically addresses the issue, 'Do we really need to check every video and analyze to predict whether it's spam or not?' What I am trying to accomplish is taking some attributes of the videos, attach some threshold values to the attributes which might lead to address the above question. The rest of the paper is organized as follows, the next section would give an overview of the background followed by user test collection which discusses the method in which the test data was collected. Lastly, we have predictive analysis section which discusses the algorithm and the results. II. BACKGROUND Mechanisms to detect and identify spam and spammers have been largely studied in the context of web [4, 5] and email spamming [6]. In particular, Castillo et al [4] proposed a framework to detect web spamming which uses social network metrics. A framework to detect spamming in tagging systems, which is a type of attack that aims at raising the visibility of specific objects, was proposed in [7]. Although applicable to social media sharing systems that allow object tagging by users, such as YouTube, the proposed technique exploits a specific object attribute, i.e., its tags. A survey of approaches to combat spamming in Social web sites is presented in [8]. Many existing approaches are based on extracting evidence from the content of a text, treating the text corpus as a set of objects with associated attributes and using these attributes to detect spam.
  • 2. Predicting spam videos using predictive analysis. International organization of Scientific Research 33 | P a g e These techniques, based on content classification, can be directly applied to textual information, and thus can be used to detect spam in email, text commentaries in blogs, forums, and online social networking sites. Complementary to my effort, the characterization of the traffic to online video sharing systems, in particular YouTube, has also been the focus of some studies. An in-depth analysis of popularity distribution, popularity evolution and content characteristics of YouTube and of a popular Korean video sharing service is presented in [9]. The authors also analyze mechanisms to improve video distribution, such as caching and peer-to-peer distribution schemes. Gill et al [10]present a characterization of the YouTube traffic collected from the University of Calgary campus network and compare its properties with those previously reported for web and media streaming workloads. Both studies focus on traffic and video characterization. III. USER TEST COLLECTION This paper is a continuation of my previous work in the same segment and many results are compared directly. The process of the user collection is same. The results are presented in [11]. Simultaneously attributes such as the number of likes, comments are collected additionally. IV. PREDICTIVE ANALYSIS Generically the videos whose view count is more than a threshold of 10,000 are found to be 85% non - spam when compared with the results from [11]. ALGORITHM: input: A list of information about users. 1.1 Initialized threshold values for no_of_likes and viewcount; 1.2 foreach User U in info - list do 1.3 if likes greater than no_of_likes and view greater than viewcount then 1.4 do natural language processing on every comment; 1.5 if result is positive 1.6 label it as not spam; 1.7 end 1.8 if result is negative 1.9 label it as spam; 1.10 end 1.11 end 1.12 end To increase the precision, attributes such as likes and comments are considered. A video which has more than 70% likes increases the precision to 90%. Additionally performing analysis using a natural language processing library such as Natural Language Toolkit [12] differentiating the positive and negative comments further increased the precision to 93.6%. V. CONCLUSION Deploying an automatic engine such as this will help the spam classifier engine to carefully omit some videos which are predicted to be not spam. This will not only increase the efficiency but also ensures minimal use of resources. Issues with this method: For smaller data sets it has been observed that doing analysis on comments sometimes led to decrease in accuracy. One of the methods to address this problem is to have a large data set for training SVMs. My work in this will continue and in future I would work on the above mentioned issue. REFERENCES [1] Alexa top 500 global sites. http://www.alexa.com/topsites. [2] Harris Interactive Public Relations Research. A study of social networks scams. 2008 [3] M. Cha, H. Kwak, P. Rodriguez, Y. Ahn, and S. Moon. I tube, you tube, everybody tubes: Analyzing the world’s largest user generated content video system. In Proc. of IMC, 2007. [4] C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: Ib spam detection using the Ib topology. In Int’l ACM SIGIR, pages 423–430, 2007. [5] Z. Gy¨ongyi, H. Garcia-Molina, and J. Pedersen. Combating Ib spam with trustrank. In Int’l. Conf. on Very Large Data Bases, pages 576–587, 2004.
  • 3. Predicting spam videos using predictive analysis. International organization of Scientific Research 34 | P a g e [6] L. Gomes, F. Castro, V. Almeida, J. Almeida, R. Almeida, and L. Bettencourt. Improving spam detection based on structural similarity. In Proc. of SRUTI, 2005. [7] G. Koutrika, F. Effendi, Z. Gy¨ongyi, P. Heymann, and H. Garcia-Molina. Combating spam in tagging systems. In Proc. of AIRIb, 2007. [8] P. Heymann, G. Koutrika, and H. Garcia-Molina. Fighting spam on social Ib sites: A survey of approaches and future challenges. IEEE Internet Computing, 11(6):36–45, 2007. [9] M. Cha, H. Kwak, P. Rodriguez, Y. Ahn, and S. Moon. I tube, you tube, everybody tubes: Analyzing the world’s largest user generated content video system. In Proc. of IMC, 2007. [10] P. Gill, M. Arlitt, Z. Li, and A. Mahanti. YouTube traffic characterization: A view from the edge. In Proc. of IMC, 2007. [11] P. Sai Kiran. Detecting spammers in YouTube: A study to find spam content in a video platform. ISSN (e): 2250-3021, ISSN (p): 2278-8719 Vol. 05, Issue 07 (July. 2015), ||V4|| PP 26-30. [12] Natural Language Toolkit. http://www.nltk.org