We propose an automatic classification system of movie genres based on different features from their textual synopsis. Our system is first trained on thousands of movie synopsis from online open databases, by learning relationships between textual signatures and movie genres. Then it is tested on other movie synopsis, and its results are compared to the true genres obtained from the Wikipedia and the Open Movie Database
(OMDB) databases. The results show that our algorithm achieves a classification accuracy exceeding 75%.
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEkevig
We propose an automatic classification system of movie genres based on different features from their textual
synopsis. Our system is first trained on thousands of movie synopsis from online open databases, by learning relationships between textual signatures and movie genres. Then it is tested on other movie synopsis,
and its results are compared to the true genres obtained from the Wikipedia and the Open Movie Database
(OMDB) databases. The results show that our algorithm achieves a classification accuracy exceeding 75%.
Process the sentiments of NLP with Naive Bayes Rule, Random Forest, Support Vector Machine, and much more.
Thanks, for your time, if you enjoyed this short slide there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
Detailed documented with the definition of text mining along with challenges, implementing modeling techniques, word cloud and much more.
Thanks, for your time, if you enjoyed this short video there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
Extracts objects from the entire collection, without modifying the objects themselves. where the goal is to select whole sentences (without modifying them)
The project is developed as a part of IRE course work @IIIT-Hyderabad.
Team members:
Aishwary Gupta (201302216)
B Prabhakar (201505618)
Sahil Swami (201302071)
Links:
https://github.com/prabhakar9885/Text-Summarization
http://prabhakar9885.github.io/Text-Summarization/
https://www.youtube.com/playlist?list=PLtBx4kn8YjxJUGsszlev52fC1Jn07HkUw
http://www.slideshare.net/prabhakar9885/text-summarization-60954970
https://www.dropbox.com/sh/uaxc2cpyy3pi97z/AADkuZ_24OHVi3PJmEAziLxha?dl=0
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEkevig
We propose an automatic classification system of movie genres based on different features from their textual
synopsis. Our system is first trained on thousands of movie synopsis from online open databases, by learning relationships between textual signatures and movie genres. Then it is tested on other movie synopsis,
and its results are compared to the true genres obtained from the Wikipedia and the Open Movie Database
(OMDB) databases. The results show that our algorithm achieves a classification accuracy exceeding 75%.
Process the sentiments of NLP with Naive Bayes Rule, Random Forest, Support Vector Machine, and much more.
Thanks, for your time, if you enjoyed this short slide there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
Detailed documented with the definition of text mining along with challenges, implementing modeling techniques, word cloud and much more.
Thanks, for your time, if you enjoyed this short video there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
Extracts objects from the entire collection, without modifying the objects themselves. where the goal is to select whole sentences (without modifying them)
The project is developed as a part of IRE course work @IIIT-Hyderabad.
Team members:
Aishwary Gupta (201302216)
B Prabhakar (201505618)
Sahil Swami (201302071)
Links:
https://github.com/prabhakar9885/Text-Summarization
http://prabhakar9885.github.io/Text-Summarization/
https://www.youtube.com/playlist?list=PLtBx4kn8YjxJUGsszlev52fC1Jn07HkUw
http://www.slideshare.net/prabhakar9885/text-summarization-60954970
https://www.dropbox.com/sh/uaxc2cpyy3pi97z/AADkuZ_24OHVi3PJmEAziLxha?dl=0
Automatic text summarization is the process of reducing the text content and retaining the
important points of the document. Generally, there are two approaches for automatic text summarization:
Extractive and Abstractive. The process of extractive based text summarization can be divided into two
phases: pre-processing and processing. In this paper, we discuss some of the extractive based text
summarization approaches used by researchers. We also provide the features for extractive based text
summarization process. We also present the available linguistic preprocessing tools with their features,
which are used for automatic text summarization. The tools and parameters useful for evaluating the
generated summary are also discussed in this paper. Moreover, we explain our proposed lexical chain
analysis approach, with sample generated lexical chains, for extractive based automatic text summarization.
We also provide the evaluation results of our system generated summary. The proposed lexical chain
analysis approach can be used to solve different text mining problems like topic classification, sentiment
analysis, and summarization.
Explore detailed Topic Modeling via LDA Laten Dirichlet Allocation and their steps.
Thanks, for your time, if you enjoyed this short video there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
Extraction of Data Using Comparable Entity Miningiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Following are the questions which I tried to answer in this ppt
What is text summarization.
What is automatic text summarization?
How it has evolved over the time?
What are different methods?
How deep learning is used for text summarization?
business application
in first few slides extractive summarization is explained, with pro and cons in next section abstractive on is explained.
In the last section business application of each one is highlighted
Most of the text classification problems are associated with multiple class labels and hence automatic text
classification is one of the most challenging and prominent research area. Text classification is the
problem of categorizing text documents into different classes. In the multi-label classification scenario,
each document is associated may have more than one label. The real challenge in the multi-label
classification is the labelling of large number of text documents with a subset of class categories. The
feature extraction and classification of such text documents require an efficient machine learning algorithm
which performs automatic text classification. This paper describes the multi-label classification of product
review documents using Structured Support Vector Machine.
Nowadays Sentiment Analysis play an important Role in each field such as Stock market, product reviews, news article, political debates which help us to determining current trend in the market regarding specific product, event, issues. Here we are apply sentiment analysis on microblogging platforms such as twitter, Facebook which is used by different people to express their opinion with respect to different kind of foods in the field of home’schef. This paper explain different methods of text preprocessing and applies them with a naive Bayes classifier in a big data, distributed computing platform with the goal of creating a scalable sentiment analysis solution that can classify text into positive or negative categories. We apply negation handling, word n-grams, stemming, and feature selection to evaluate how different combinations of these pre-processing methods affect performance and efficiency.
An on-going project on Natural Language Processing (using Python and the NLTK toolkit), which focuses on the extraction of sentiment from a Question and its title on www.stackoverflow.com and determining the polarity.Based on the above findings, it is verified whether the rules and guidelines imposed by the SO community on the users are strictly followed or not.
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUEJournal For Research
Natural Language Processing (NLP) techniques are one of the most used techniques in the field of computer applications. It has become one of the vast and advanced techniques. Language is the means of communication or interaction among humans and in present scenario when everything is dependent on machine or everything is computerized, communication between computer and human has become a necessity. To fulfill this necessity NLP has been emerged as the means of interaction which narrows the gap between machines (computers) and humans. It was evolved from the study of linguistics which was passed through the Turing test to check the similarity between data but it was limited to small set of data. Later on various algorithms were developed along with the concept of AI (Artificial Intelligence) for the successful execution of NLP. In this paper, the main emphasis is on the different techniques of NLP which have been developed till now, their applications and the comparison of all those techniques on different parameters.
SEMI-SUPERVISED BOOTSTRAPPING APPROACH FOR NAMED ENTITY RECOGNITIONkevig
The aim of Named Entity Recognition (NER) is to identify references of named entities in unstructured documents, and to classify them into pre-defined semantic categories. NER often aids from added background knowledge in the form of gazetteers. However using such a collection does not deal with name variants and cannot resolve ambiguities associated in identifying the entities in context and associating them with predefined categories. We present a semi-supervised NER approach that starts with identifying named entities with a small set of training data. Using the identified named entities, the word and the context features are used to define the pattern. This pattern of each named entity category is used as a seed pattern to identify the named entities in the test set. Pattern scoring and tuple value score enables the generation of the new patterns to identify the named entity categories. We have evaluated the proposed system for English language with the dataset of tagged (IEER) and untagged (CoNLL 2003) named entity corpus and for Tamil language with the documents from the FIRE corpus and yield an average f-measure of 75% for both the languages.
Automatic text summarization is the process of reducing the text content and retaining the
important points of the document. Generally, there are two approaches for automatic text summarization:
Extractive and Abstractive. The process of extractive based text summarization can be divided into two
phases: pre-processing and processing. In this paper, we discuss some of the extractive based text
summarization approaches used by researchers. We also provide the features for extractive based text
summarization process. We also present the available linguistic preprocessing tools with their features,
which are used for automatic text summarization. The tools and parameters useful for evaluating the
generated summary are also discussed in this paper. Moreover, we explain our proposed lexical chain
analysis approach, with sample generated lexical chains, for extractive based automatic text summarization.
We also provide the evaluation results of our system generated summary. The proposed lexical chain
analysis approach can be used to solve different text mining problems like topic classification, sentiment
analysis, and summarization.
Explore detailed Topic Modeling via LDA Laten Dirichlet Allocation and their steps.
Thanks, for your time, if you enjoyed this short video there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
Extraction of Data Using Comparable Entity Miningiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Following are the questions which I tried to answer in this ppt
What is text summarization.
What is automatic text summarization?
How it has evolved over the time?
What are different methods?
How deep learning is used for text summarization?
business application
in first few slides extractive summarization is explained, with pro and cons in next section abstractive on is explained.
In the last section business application of each one is highlighted
Most of the text classification problems are associated with multiple class labels and hence automatic text
classification is one of the most challenging and prominent research area. Text classification is the
problem of categorizing text documents into different classes. In the multi-label classification scenario,
each document is associated may have more than one label. The real challenge in the multi-label
classification is the labelling of large number of text documents with a subset of class categories. The
feature extraction and classification of such text documents require an efficient machine learning algorithm
which performs automatic text classification. This paper describes the multi-label classification of product
review documents using Structured Support Vector Machine.
Nowadays Sentiment Analysis play an important Role in each field such as Stock market, product reviews, news article, political debates which help us to determining current trend in the market regarding specific product, event, issues. Here we are apply sentiment analysis on microblogging platforms such as twitter, Facebook which is used by different people to express their opinion with respect to different kind of foods in the field of home’schef. This paper explain different methods of text preprocessing and applies them with a naive Bayes classifier in a big data, distributed computing platform with the goal of creating a scalable sentiment analysis solution that can classify text into positive or negative categories. We apply negation handling, word n-grams, stemming, and feature selection to evaluate how different combinations of these pre-processing methods affect performance and efficiency.
An on-going project on Natural Language Processing (using Python and the NLTK toolkit), which focuses on the extraction of sentiment from a Question and its title on www.stackoverflow.com and determining the polarity.Based on the above findings, it is verified whether the rules and guidelines imposed by the SO community on the users are strictly followed or not.
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUEJournal For Research
Natural Language Processing (NLP) techniques are one of the most used techniques in the field of computer applications. It has become one of the vast and advanced techniques. Language is the means of communication or interaction among humans and in present scenario when everything is dependent on machine or everything is computerized, communication between computer and human has become a necessity. To fulfill this necessity NLP has been emerged as the means of interaction which narrows the gap between machines (computers) and humans. It was evolved from the study of linguistics which was passed through the Turing test to check the similarity between data but it was limited to small set of data. Later on various algorithms were developed along with the concept of AI (Artificial Intelligence) for the successful execution of NLP. In this paper, the main emphasis is on the different techniques of NLP which have been developed till now, their applications and the comparison of all those techniques on different parameters.
SEMI-SUPERVISED BOOTSTRAPPING APPROACH FOR NAMED ENTITY RECOGNITIONkevig
The aim of Named Entity Recognition (NER) is to identify references of named entities in unstructured documents, and to classify them into pre-defined semantic categories. NER often aids from added background knowledge in the form of gazetteers. However using such a collection does not deal with name variants and cannot resolve ambiguities associated in identifying the entities in context and associating them with predefined categories. We present a semi-supervised NER approach that starts with identifying named entities with a small set of training data. Using the identified named entities, the word and the context features are used to define the pattern. This pattern of each named entity category is used as a seed pattern to identify the named entities in the test set. Pattern scoring and tuple value score enables the generation of the new patterns to identify the named entity categories. We have evaluated the proposed system for English language with the dataset of tagged (IEER) and untagged (CoNLL 2003) named entity corpus and for Tamil language with the documents from the FIRE corpus and yield an average f-measure of 75% for both the languages.
This is an introduction to text analytics for advanced business users and IT professionals with limited programming expertise. The presentation will go through different areas of text analytics as well as provide some real work examples that help to make the subject matter a little more relatable. We will cover topics like search engine building, categorization (supervised and unsupervised), clustering, NLP, and social media analysis.
In this paper, I develop a custom binary classifier of search queries for the makeup category using different Machine Learning techniques and models. An extensive exploration of shallow and Deep Learning models was performed using a cross-validation framework to identify the top three models, optimize them tuning their hyperparameters, and finally creating an ensemble of models with a custom decision threshold that outperforms all other models. The final classifier achieves an accuracy of 98.83% on a test set, making it ready for production.
Abstract: Text classification or categorization is the process of automatically tagging a textual document with most relevant labels or categories. When the number of labels is restricted to one, the task becomes single-label text categorization. However, the multi-label version is challenging. For Arabic language, both tasks (especially the latter one) become more challenging in the absence of large and free Arabic rich and rational datasets. We used sub data from a large Single-labeled Arabic News Articles Dataset (SANAD) of textual data collected from three news portals. The dataset is a large one consisting of almost 200k articles distributed into seven categories that we offer to the research community on Arabic computational linguistics.
Code and data in this link:
https://github.com/hezam2022/text-categorization
Conceptual Similarity Measurement Algorithm For Domain Specific OntologyZac Darcy
This paper presents the similarity measurement algorithm for domain specific terms collected in the
ontology based data integration system. This similarity measurement algorithm can be used in ontology
mapping and query service of ontology based data integration system. In this paper, we focus on the web
query service to apply this proposed algorithm. Concepts similarity is important for web query service
because the words in user input query are not same wholly with the concepts in ontology. So, we need to
extract the possible concepts that are match or related to the input words with the help of machine readable
dictionary WordNet. Sometimes, we use the generated mapping rules in query generation procedure for
some words that cannot be confirmed the similarity of these words by WordNet. We prove the effect of this
algorithm with two degree semantic result of web mining by generating the concepts results obtained form
the input query.
Conceptual similarity measurement algorithm for domain specific ontology[Zac Darcy
This paper presents the similarity measurement algorithm for domain specific terms collected in the
ontology based data integration system. This similarity measurement algorithm can be used in ontology
mapping and query service of
ontology based data integration sy
stem. In this paper, we focus
o
n the web
query service to apply
this proposed algorithm
. Concepts similarity is important for web query service
because the words in user input query are not
same wholly with the concepts in
ontology. So, we need to
extract the possible concepts that are match or related to the input words with the help of machine readable
dictionary WordNet. Sometimes, we use the generated mapping rules in query generation procedure for
some words that canno
t be
confirmed the similarity of these words
by WordNet. We prove the effect
of this
algorithm with two degree semantic result of web minin
g by generating
the concepts results obtained form
the input query
With the rapidly increasing growth in the field of internet and web usage, it has become essential to use a certain specific powerful tool, which should be capable to analyze and rank all these available reviews/opinion on the web/Internet. In this paper we have propose a new and effective approach which uses a powerful sentiment analysis procedure which will be based on an ontological adjustment and arrangements. This study also aims to understand pos tag order to get detailed observation for any review or opinion, it also helps in identifying all present positive /Negative sentiments and suggest a proper sentence inclination. For this we have used reviews available on internet regarding Nokia and Stanford parser for the purpose or pos tagging.
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
Document similarity is an important part of Natural Language Processing and is most commonly used for
plagiarism-detection and text summarization. Thus, finding the overall most effective document similarity
algorithm could have a major positive impact on the field of Natural Language Processing. This report sets
out to examine the numerous document similarity algorithms, and determine which ones are the most
useful. It addresses the most effective document similarity algorithm by categorizing them into 3 types of
document similarity algorithms: statistical algorithms, neural networks, and corpus/knowledge-based
algorithms. The most effective algorithms in each category are also compared in our work using a series of
benchmark datasets and evaluations that test every possible area that each algorithm could be used in.
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
Document similarity is an important part of Natural Language Processing and is most commonly used for
plagiarism-detection and text summarization. Thus, finding the overall most effective document similarity
algorithm could have a major positive impact on the field of Natural Language Processing. This report sets
out to examine the numerous document similarity algorithms, and determine which ones are the most
useful. It addresses the most effective document similarity algorithm by categorizing them into 3 types of
document similarity algorithms: statistical algorithms, neural networks, and corpus/knowledge-based
algorithms. The most effective algorithms in each category are also compared in our work using a series of
benchmark datasets and evaluations that test every possible area that each algorithm could be used in.
Neural Network Based Context Sensitive Sentiment AnalysisEditor IJCATR
Social media communication is evolving more in these days. Social networking site is being rapidly increased in recent years, which provides platform to connect people all over the world and share their interests. The conversation and the posts available in social media are unstructured in nature. So sentiment analysis will be a challenging work in this platform. These analyses are mostly performed in machine learning techniques which are less accurate than neural network methodologies. This paper is based on sentiment classification using Competitive layer neural networks and classifies the polarity of a given text whether the expressed opinion in the text is positive or negative or neutral. It determines the overall topic of the given text. Context independent sentences and implicit meaning in the text are also considered in polarity classification.
The most integral part of our work is to extract Aspects from User Feedback and associate Sentiment and Opinion terms to them. The dataset we have at our disposal to work upon, is a set of feedback documents for various departments in a Hospital in XML format which have comments represented in tags. It contains about 65000 responses to a survey taken in a Hospital. Every response or comment is treated as a sentence or a set of them. We perform a sentence level aspect and sentiment extraction and we attempt to understand and mine User Feedback data to gather aspects from it. Further to it, we extract the sentiment mentions and evaluate them contextually for sentiment and associate those sentiment mentions with the corresponding aspects. To start with, we perform a clean up on the User Feedback data, followed by aspect extraction and sentiment polarity calculation, with the help of POS tagging and SentiWordNet filters respectively. The obtained sentiments are further classified according to a set of Linguistic rules and the scores are normalized to nullify any noise that might be present. We lay emphasis on using a rule based approach; rules being Linguistic rules that correspond to the positioning of various parts-of-speech words in a sentence.
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalMauro Dragoni
The presentation provides an overview of what an ontology is and how it can be used for representing information and for retrieving data with a particular focus on the linguistic resources available for supporting this kind of task. Overview of semantic-based retrieval approaches by highlighting the pro and cons of using semantic approaches with respect to classic ones. Use cases are presented and discussed
Sentiment classification aims to detect information such as opinions, explicit , implicit feelings expressed
in text. The most existing approaches are able to detect either explicit expressions or implicit expressions of
sentiments in the text separately. In this proposed framework it will detect both Implicit and Explicit
expressions available in the meeting transcripts. It will classify the Positive, Negative, Neutral words and
also identify the topic of the particular meeting transcripts by using fuzzy logic. This paper aims to add
some additional features for improving the classification method. The quality of the sentiment classification
is improved using proposed fuzzy logic framework .In this fuzzy logic it includes the features like Fuzzy
rules and Fuzzy C-means algorithm.The quality of the output is evaluated using the parameters such as
precision, recall, f-measure. Here Fuzzy C-means Clustering technique measured in terms of Purity and
Entropy. The data set was validated using 10-fold cross validation method and observed 95% confidence
interval between the accuracy values .Finally, the proposed fuzzy logic method produced more than 85 %
accurate results and error rate is very less compared to existing sentiment classification techniques.
Text document clustering and similarity detection is the major part of document management, where every document should be identified by its key terms and domain knowledge. Based on the similarity, the documents are grouped into clusters. For document similarity calculation there are several approaches were proposed in the existing system. But the existing system is either term based or pattern based. And those systems suffered from several problems. To make a revolution in this challenging environment, the proposed system presents an innovative model for document similarity by applying back propagation time stamp algorithm. It discovers patterns in text documents as higher level features and creates a network for fast grouping. It also detects the most appropriate patterns based on its weight and BPTT performs the document similarity measures. Using this approach, the document can be categorized easily. In order to perform the above, a new approach is used. This helps to reduce the training process problems. The above framework is named as BPTT. The BPTT has implemented and evaluated using dot net platform with different set of datasets.
Similar to A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE (20)
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Event Management System Vb Net Project Report.pdfKamal Acharya
In present era, the scopes of information technology growing with a very fast .We do not see any are untouched from this industry. The scope of information technology has become wider includes: Business and industry. Household Business, Communication, Education, Entertainment, Science, Medicine, Engineering, Distance Learning, Weather Forecasting. Carrier Searching and so on.
My project named “Event Management System” is software that store and maintained all events coordinated in college. It also helpful to print related reports. My project will help to record the events coordinated by faculties with their Name, Event subject, date & details in an efficient & effective ways.
In my system we have to make a system by which a user can record all events coordinated by a particular faculty. In our proposed system some more featured are added which differs it from the existing system such as security.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Democratizing Fuzzing at Scale by Abhishek Aryaabh.arya
Presented at NUS: Fuzzing and Software Security Summer School 2024
This keynote talks about the democratization of fuzzing at scale, highlighting the collaboration between open source communities, academia, and industry to advance the field of fuzzing. It delves into the history of fuzzing, the development of scalable fuzzing platforms, and the empowerment of community-driven research. The talk will further discuss recent advancements leveraging AI/ML and offer insights into the future evolution of the fuzzing landscape.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfKamal Acharya
The College Bus Management system is completely developed by Visual Basic .NET Version. The application is connect with most secured database language MS SQL Server. The application is develop by using best combination of front-end and back-end languages. The application is totally design like flat user interface. This flat user interface is more attractive user interface in 2017. The application is gives more important to the system functionality. The application is to manage the student’s details, driver’s details, bus details, bus route details, bus fees details and more. The application has only one unit for admin. The admin can manage the entire application. The admin can login into the application by using username and password of the admin. The application is develop for big and small colleges. It is more user friendly for non-computer person. Even they can easily learn how to manage the application within hours. The application is more secure by the admin. The system will give an effective output for the VB.Net and SQL Server given as input to the system. The compiled java program given as input to the system, after scanning the program will generate different reports. The application generates the report for users. The admin can view and download the report of the data. The application deliver the excel format reports. Because, excel formatted reports is very easy to understand the income and expense of the college bus. This application is mainly develop for windows operating system users. In 2017, 73% of people enterprises are using windows operating system. So the application will easily install for all the windows operating system users. The application-developed size is very low. The application consumes very low space in disk. Therefore, the user can allocate very minimum local disk space for this application.
Forklift Classes Overview by Intella PartsIntella Parts
Discover the different forklift classes and their specific applications. Learn how to choose the right forklift for your needs to ensure safety, efficiency, and compliance in your operations.
For more technical information, visit our website https://intellaparts.com
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
1. International Journal on Natural Language Computing (IJNLC) Vol.7, No.5, October 2018
DOI: 10.5121/ijnlc.2018.7503 26
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON
MAJORITY VOTE
Roger Booto Tokime1 and Eric Hervet1
1Perception, Robotics, and Intelligent Machines Research Group (PRIME), Computer
Science, Universite´ de Moncton, Moncton, NB, Canada
ABSTRACT
We propose an automatic classification system of movie genres based on different features from their textual
synopsis. Our system is first trained on thousands of movie synopsis from online open databases, by learn-
ing relationships between textual signatures and movie genres. Then it is tested on other movie synopsis,
and its results are compared to the true genres obtained from the Wikipedia and the Open Movie Database
(OMDB) databases. The results show that our algorithm achieves a classification accuracy exceeding 75%.
KEYWORDS
Movie Synopsis, Text Classifier, Information Retrieval, Text Mining.
1. INTRODUCTION
Automatic classification of text is known to be a difficult task for a computer. Usually, algorithms in
the fieldtry to regroup a given set of texts into genres or categories that can contain subcategories (e.g.
News articles can be separated in sports, news, weather, etc.), or even regroup them together to form
a new one. This can be accomplished by the use of several methods and algorithms that are usually
based on text analysis, word grouping, sentence segmentation, etc. Automatic text classi- fication is
widely tackled in the field of classifying textual data from the web [6] [9] [8], and some research has
been done on classifying other textual data such as books summaries [2]. Though there are many
algorithms and strategies already in place, with some using advanced approaches like machine
learning [3], it is still difficult for algorithms to classify correctly texts into genres. The main
reason for this has been explained by Chandler in his Introduction to Genre Theory: ”There are no
rigid rules of inclusion or exclusion. Genres are not discrete systems consisting of a fixed number
of listable items”. While working on a computer vision homework, we had to implement an
algorithm for the feature extraction task. We used the Hough transform to do so, and we got to
study the purpose of this technique. Wikipedia gives a very good description of its purpose: ”The
purpose of the technique is to find imperfect instances of objects within a certain class of shapes
by a voting procedure. This voting procedure is carried out in a parameter space, from which
object candidates are obtained as local maxima in a so-called accumulator space, that is explicitly
constructed by the algorithm for computing the Hough transform.” [12]. What is important to keep
in mind is that in computer vision, we are able to find features in an image by letting multiples
algorithms vote to what they consider a good feature, and at the end keeping only the most rated.
We then thought to apply this principle to classify text into genres, which is the main matter of this
work. This paper shows how to make use of a combination of features in order to build a majority
voting system that finds automatically the category(ies) which a text belongs to. Our work focuses
on the classification of movie synopsis intogenres. First, we are going to introduce a new concept,
the movie signature. A movie signature is defined as the list of categories found by the system after
2. International Journal on Natural Language Computing (IJNLC) Vol.7, No.5, October 2018
27
voting. Now, let’s have an overview of the struc- ture of this paper, which is organized as follows:
First, we describe how the data collection has been retrieved and processed before being presented to
the system. Second, we present the different algorithms compared to compute the similarity
between movie synopsis. Then we present the results that show how the majority voting system
reaches a precision exceed- ing 75%. Finally, we conclude with a discussion about possible
improvements and other application do- mains where our system could be used to classify other
types of documents than movie synopsis
2. RELATED WORK
This paper is about automatic classification of movie synopsis into genres. To do so, we use a
novel approach that combines the strengths of several similarity algorithms (that could be referred as
voters) from the domain of Natural Language Processing (NLP) in order to obtain a better
classification. Each of these algorithms taken individually performs great on some specific data,
but is not able to process efficiently other various kind of data. The goal of our method is to
generalize the classification of textual data into pre-labeled categories. To do so, we explore
different NLP methods that classify textual data based on specific features. The methods chosen as
eligible to vote are the ones that perform the best in their category. We have the following feature
comparison themes: term frequencies, sets, vector-based and strings. Those features are the most
used in the field of information retrieval and text classification.
2.1 Term frequencies comparison
In order to know if a textual query is a member of a category, the first thing we can think of is to
count the number of words from the textual query and then sum it up. This sum can be used as a
measure of the similarity between the textual query and the pre-labeled database.
2.1.1 Term frequency
Term frequency is at first glance a good idea because we would expect that the words representing a
target category appear more often than the others, but it is not that simple. Term frequency has a
big defect due to its definition: Indeed, by counting only the number of words from the textual
query to a category, e.g., Action, the frequent words in the English language which bring no
relevant information such as “the”, “a”, “about”, etc. are the words that come up most often. To
avoid this problem we could consider only the words repeating with a certain frequency, but then it
rises the question of the right threshold to use. To bypass this issue, we use a list of stop-words
from the English language [7]. A stop-word is defined as a word that comes up most often in
various texts of a given language and provides no relevant information to categorize the text. Now
that we are able to filer stop-words, here is the mathematical expression that we use to calculate
the similarity between two texts, based on the term frequency. Let us assume that w is a word from
the query that occurs in a target category, cat (reset at each run of an algorithm) is the list of
categories, and x is a given target category, e.g., Action. We have that:
This means that we use a binary classification for each word, according it is a stop-word or not.
Since in our case, a text query is a movie synopsis, cat represents a similarity measure between a
movie plot and all the different known genres in which we would like to know to what extent a
given synopsis applies to. This method is simple to implement and give good results. The main
3. International Journal on Natural Language Computing (IJNLC) Vol.7, No.5, October 2018
28
.
concern about this method is that the quality of the results is really dependent on the stop-word
list, so in practice a prior knowledge and pre-processing of the genres, especially about what a
stop-word is in the field of movie synopsis, is necessary to achieve accurate classification results.
2.1.2 Term frequency inverse documentfrequency
Text classification by simply counting the frequencies of words, minus the stop-words, is then very
limited when previous knowledge of the classifying genres is not available. Since we are trying to
build an accurate voting system, we use the same basis as this type of algorithm, but we introduce
a much more robust term frequency (TF) similarity algorithm. Here, the approach is different in a
way that we consider the words of each kind of our genre list as independent entities, and their
influence during the classification is represented by some weight wt. This way of doing things is
called feature selection. Feature selection studies how to select a subset or list of attributes or
variables that are used to construct models describing data. Its purposes include dimensions
reduction, removing irrelevant and redundant features, reducing the amount of data needed for
learning, improving algorithm’s predictive accuracy, and increasing the constructed model’s
comprehensibility [14]. Our goal here is to reduce the vector dimension by automatically removing
irrelevant words. An irrelevant word is defined as a word that appears in at least 80% of the
categories. To do so, we use the inverse document frequency (IDF) measure. The IDF measure is
based on the document frequency (DF), which is simply the number of documents in which a word
w appears. The IDF can be computed as follows:
where D is the total number of document (genres), and the plus one at denominator is used in case a
word appears in none of the datasets, to avoid dividing by zero [5]. With this measurement, we can
simply remove irrelevant words that appear in at least 80% of the categories by checking if the result
of the ratio (DF/D) is less than 0.8. The calculation of the term frequency inverse document
frequency (TFIDF) is performed as follows:
TFIDF is a measure of the importance of a word w in a document or a set of documents. Let us
assume that w is a word from the query that occurs in a category, cat is the list of categories, and x
is a target category, e.g. Action. We have that
We have now a robust term frequency similarity measure that eliminates non relevant words to our
purpose from the entire vocabulary (over all documents).
2.2 Sets comparison
In this section, we present a new approach which does not compare the frequencies of words when
calculating the similarity, but simply checks the presence or the absence of a word. To do so, we
use the Jaccard coefficient.
4. International Journal on Natural Language Computing (IJNLC) Vol.7, No.5, October 2018
29
2.2.1Jaccard similarity
The Jaccard coefficient, also known as the Tanimoto coefficient, is a measure of similarity based
on datasets. In the article ”Text Categorization using Jaccard Coefficient for Text Messages” [4], it is
shown that this kind of similarity is extremely simple and fast to compute, since it comes down to
the simple calculation of the division of the intersection by the union of two sets of data The values
of the Jaccard coefficient range from [0, 1], with 0 meaning that the two datasets are not similar,
and 1 that they are exactly the same. Let us assume that pq is the plot query, cat is the list of
categories and x is any target category, e.g., Action. We have that
J (x) measures the similarity between the set of categories and the set of words of the plot query.
Since our case, the cat sets are much larger than the pq set, the Jaccard coefficient can never reach
the value 1. But it still provides a good order of similarity between the different genres. Therefore,
the Jaccard similarity measure is one of the method we use in our combined voting system.
2.3 Comparison of the angle between vectors
At this stage, we change the way we look at our documents, and rather represent them as vectors
of words, also called bags of words. A bag of words representation allows us to calculate the
correlation between two bags, which is used as the similarity measure between the corresponding
documents. Calculating the correlation between bags or vectors of words is the same as calculating
the angle between two vectors, and is known as cosine similarity.
2.3.1 Cosine similarity
The cosine similarity measures the cosine of the angle between two vectors of words. Each vector is
an n-dimensional vector over the words W = {w1, w2, . .. , wn}, and each dimension represents a word
with its weight (usually TF) in the document. Also, the word vector similarity coefficient does not
vary if the vectors do not have the same lengths. This feature is especially convenient for us since
our datasets vocabulary is usually much bigger than the synopsis query. Because the weight of each
word is its TF, the angle of similarity between two vectors is non-negative and bounded in [0, 1]
[1]. Let us assume that v is the plot query vector, cat is the list of categories and cat(x) is any
specific category vector, e.g., Action. We have that:
This means that we consider the information as a vector of words rather than a set of words, and
the angle between the two vectors is used as a similarity metric.
2.4 String comparison
Here, we get deeper into our comparison scheme by going as low as comparing the strings from a
given plot query to the ones in the dataset. This level is the lowest we can go in order to compare
two sets of textual data. The idea here is to have an algorithm that performs well under the
following conditions: strings (words) with a small distance must be similar, and the order in which
the words are distributed in the text must not affect the similarity coefficient. The reason behind
5. International Journal on Natural Language Computing (IJNLC) Vol.7, No.5, October 2018
30
these conditions is to give the algorithm the possibility to be used at large scale.
2.4.1 Dice similarity
Well-known algorithms, such as Edit Distance, Longest Common Substring and others, do not
perform well against these requirements. The Edit distance does respect the first requirement,
which states “words with asmall distance must be similar”, as it rates “FRANCE”and “QUEBEC” as
being similar with a distance of 6, compared to “FRANCE” and “REPUBLIC OF FRANCE” with
a distance of 12. The Longest Common Substring would give “FRANCE” and “REPUBLIC OF
FRANCE” quite a good rating of similarity with a distance of 6, but would fail at giving the same
similarity value for “FRENCH REPUBLIC”, “REPUBLIC OF FRANCE” and “REPUBLIC OF
CUBA” [10]. For these reasons, we decided to use the Dice coefficient as it does respect both
conditions. The Dice coefficient is defined as twice the number of common terms in the compared
strings, divided by the total number of terms in both strings [11]. Let us assume that pairs(x) is
function that generates the pairs of adjacent letters in the plot query, cat(x) is the same function for
the list of categories, and s1and s2 are respectively a plot query and a specific dataset genre
text, e.g., Action. We have that
Here, each plot and categories are decomposed in pairs of words, and the calculation works as
follows: it computes the percentage of pairs of words that can be found both in the plot s1and in the
category s2, then multiplies it by a factor of pairs that belongs to both s1 and s2.
3. DATASETS
The data used in this work are movie synopsis from Wikipedia [13]. The training data is a selection of
American, British, and French cinematographic works from 1920 to 2016. The collection of data is
made once and requires four steps: The first step consists in retrieving the information from the
Wikipedia website; The second step consists in filtering only the cinematographic movies, of
which at least one of the genres belongs to the following list of 19 genres: Action, Adventure,
Animation, Comedy, Crime, Drama, Fantasy, Family, Fiction, Historical, Horror, Mystery, Noir,
Romance, Science, Short, Thriller, War, and Western. The third step consists in repeating steps 1
and 2 with the data from the OMDB database, with the difference that this time the requests are
limited to the movie titles previously filtered. At the end of the third step, the number of movies
goes from 25561 to 9937, and after eliminating duplicates there is a total of 8492 movie titles
remaining. Those are the ones used to build the training dataset used in the fourth step. The fourth
step consists in the creation of 19-word banks corresponding to the 19 aforementioned genres. The
word banks are created as follows: Let T be a movie object, and cat a word bank. We have that:
The means we add the movie titles to the word bank to help the algorithms that calculate the
similarity between two documents. The same steps are performed when creating the test data. The
test data contains Canadian, Spanish, Australian, Japanese, and Chinese cinematographic works
from 2000 to 2015. A total of 793 different titles are used as test data. We would like to precise
that for this experiment, all textual information were taken from the english version of Wikipedia,
therefore the only language conflict we may have could be the translations of the synopsis into
english.
6. International Journal on Natural Language Computing (IJNLC) Vol.7, No.5, October 2018
4. SIMILARITY MEASURES
To allow various algorithms to vote,
two objects and measure their level
characteristic features upon which
istics are dependent on the data, or
universally best for all kinds of clustering
that a combination of different features
approach is similar to the ones used
created pre-labeled databases. The
by different similarity calculation
rithms, the classification of the test
obtained by each of the similarity algorithms. More
compare similar features between
and angles between vectors of words.
the algorithms used and compared in this
5. RESULTS
Our method is a majority voting system, which means that each algorithm has the chance to
for its top candidates. The final election is conducted as follows: First, each algorithm receives the
same synopsis query, and classifies
algorithm are selected to create the
to the true genres, and the average
where MS represents the movie signature,
lenSum the true positive predictions for a
expected across the test set. We
candidates from each voter. The reason we limited our
mathematical: Indeed, we have 5 voters
case, if each voter elects different candidates, we get a movie signature length that is less than the
total number of genres, which means
reduces the precision, and enhances the probability to find a genre in which a film synopsis
belongs to. Table 1 shows the results
majority voting system on the precision
the measure of precision in the context of classification is computed as
International Journal on Natural Language Computing (IJNLC) Vol.7, No.5, October 2018
5
EASURES
vote, a measure of similarity must be defined, in order
level of disparity or similarity. Usually, this measure depends on
which we want to make a comparison. In many cases, these
or on the problem context at hand, and there is no measure
clustering problems [1]. This is the reason we made the
features would perform a better classification of textual
used in supervised machine learning since it is based on
The percentage of belonging of a text to each category
calculation algorithms. However, unlike supervised machine learning
test data is performed by choosing the most significant
obtained by each of the similarity algorithms. More precisely, the algorithms used in this work
between two text documents based on: term frequencies, sets,
words. The following sections of this paper give some
the algorithms used and compared in this work.
Our method is a majority voting system, which means that each algorithm has the chance to
for its top candidates. The final election is conducted as follows: First, each algorithm receives the
classifies the query over all genres. Second, the top candidates
the movie signature. Third, the movie signature is compared
average accuracy is computed in the following way
represents the movie signature, GT the ground truth, N the number of test synopsis,
predictions for a given synopsis, and SumttT the total amount of genres
We conducted three elections where we selected the top 1, 2 and 3
The reason we limited our vote to the top 3 best candidates is purely
voters and 19 genres, and: 19
= 3. This means that in the
case, if each voter elects different candidates, we get a movie signature length that is less than the
means that there is a long list of elected candidates.
enhances the probability to find a genre in which a film synopsis
results ordered from top 1 to 3, while table 2 shows the
precision of the algorithm. This result is completely expected
the measure of precision in the context of classification is computed as
International Journal on Natural Language Computing (IJNLC) Vol.7, No.5, October 2018
31
order to compare
this measure depends on
these character-
measure that is
the hypothesis
textual data. This
on previously
category is computed
learning algo-
significant categories
the algorithms used in this work
sets, bi-grams,
some details about
Our method is a majority voting system, which means that each algorithm has the chance to vote
for its top candidates. The final election is conducted as follows: First, each algorithm receives the
candidates from each
compared
the number of test synopsis,
the total amount of genres
we selected the top 1, 2 and 3
to the top 3 best candidates is purely
the worst
case, if each voter elects different candidates, we get a movie signature length that is less than the
Therefore, it
enhances the probability to find a genre in which a film synopsis
the effect of the
expected since
the measure of precision in the context of classification is computed as follows:
7. International Journal on Natural Language Computing (IJNLC) Vol.7, No.5, October 2018
32
It is the ratio between the ground truth (truePositive), and the sum of our signature and the ground
truth. Our signature (predicted genres) length is significantly bigger than the ground truth when
considering the top choices, as shown in table 3, and because a false positive result is the set of
propositions that are in our signature, but not in the ground truth, we end up with a low precision.
This is a trade-off we had to make because finding the genre(s) which a synopsis truly belongs to is
more important in our case. So here is how we compute the value of falsePositive based on MS
(Movie Signature) and GT (Ground Truth):
Table 1 presents the best accuracies obtained when considering the top 1, 2, and 3 movie genres.
Table 2 compares our algorithm, Majority Vote, to the five other existing algorithms described
earlier, in terms of accuracy, precision, recall, and F-measure, also called F1-score. The F-measure is
defined as the harmonic mean of precision and recall :
Finally, Table 3 presents the average signature lengths for top 1, 2, and 3 movie genres based on
our algorithm.
Table 1. Top 1 to 3 best classification accuracies
Table 2.Performance of our proposed algorithm vs. the existing ones
8. International Journal on Natural Language Computing (IJNLC) Vol.7, No.5, October 2018
33
Table 3 . Increase of the signature length with our method
6. CONCLUSION
In summary, we tackled a challenging problem in the field of text classification into multiple
genres. More specifically, we developed a new method that automatically assigns genres to a
movie synopsis. As explained in the introduction, the difficulty of this task comes from the fact
that there is no rigid boundaries between the genres. To solve this issue, we proposed a majority
voting system inspired by the Hough transform which is usually used in computer vision. Our
method is able to generalize text classification into a category with an accuracy exceeding 75%. To
obtain those results, we had to make some trade-off to the detriment of the precision of prediction.
Due to the absence of rigid rules to discriminate between genres, we put the focus on the accuracy
results, and outperformed the existing algorithms in the field. In addition, our system is language-
invariant up to the extent of the stop-word list used by the term frequency voting method. We
tested our method on movie synopsis with good results, and we believe that it could be used to
classify books, journals, and even websites.
The next step after this work is to use deep learning, especially long short-term memory (LSTM)
neural networks, to compare and classify movie synopsis.
REFERENCES
[1] ANNA, H. Similarity-measures for text document clustering.
[2] EMILY, J. Automated genre classification in literature, 2014.
[3] IKONOMAKIS, E., KOTSIANTIS, S., AND TAMPAKAS, V. Text classification using machine
learning techniques. WSEAS transactions on computers 4 (08 2005), 966–974.
[4] JADHAO, D. A. J. A. A. Text categorization using jaccard coefficient for text messages. International
Journal of Science and Research (IJSR) 5, 5 (May 2016), 2047–2050.
[5] KA-WING, H. Movies’ genres classification by synopsis.
[6] PHILIPP, P. Assessing approaches to genre classification, 2009.
[7] RANKS, N. L. Stopwords, visited 2018.
[8] S., Y., G., W., Y., Q., AND W., Z. Research and implement of classification algorithm on web text
mining. In Third International Conference on Semantics, Knowledge and Grid (SKG 2007) (Oct 2007), pp.
446–449.
[9] SANTINI, M. Automatic genre identification: Towards a flexible classification scheme.
9. International Journal on Natural Language Computing (IJNLC) Vol.7, No.5, October 2018
34
[10] SIMON, W. How to strike a match, 1992.
[11] WAEL, H. G., AND ALY, A. F. Article: A survey of text similarity approaches. International Journal
of Computer Applications 68, 13 (April 2013), 13–18.
[12] WKIPEDIA. Hough transform, visited 2018.
[13] WKIPEDIA. https://www.wikipedia.org/, visited 2018.
[14] YANG, J. W. C. Text categorization based on a similarity approach.