SlideShare a Scribd company logo
1 of 9
Download to read offline
Procedia - Social and Behavioral Sciences 177 (2015) 169 – 177
Available online at www.sciencedirect.com
ScienceDirect
1877-0428 © 2015 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Peer-review under responsibility of the Scientific Committee of GLOBE-EDU 2014.
doi:10.1016/j.sbspro.2015.02.373
Global Conference on Contemporary Issues in Education, GLOBE-EDU 2014, 12-14 July 2014,
Las Vegas, USA
Automatize Document Topic and Subtopic Detection with Support
of a Corpus
Metin Turana*
, Coskun Sönmezb
a
Computer Engineering Department,Yıldız Technical University, Esenler, 34220,Istanbul
b
Computer Engineering Department,İstanbul Technical University,Maslak, 34469, Istanbul
Abstract
In this article, we propose a new automatic topic and subtopic detection method from a document called paragraph extension. In
paragraph extension, a document is considered as a set of paragraphs and a paragraph merging technique is used to merge similar
consecutive paragraphs until no similar consecutive paragraphs left. Following this, similar word counts in merged paragraphs
are summed up to construct subtopic scores by using a corpus which is designed so that we can find words related to a subtopic.
The paragraph vectors are represented by subtopics instead of the words. The subtopic of a paragraph is the most frequent one in
the paragraph vector. On the other hand, topic of the document is the most dispersive subtopic in the document. An experimental
topic/subtopic corpus is constructed for sport and education topics. We also supported corpus by WordNet to obtain synonyms
words. We evaluate the proposed method on a data set contains randomly selected 40 documents from the education and sport
topics. The experiment results show that average of topic detection success ratio is about %83 and the subtopic detection is about
%68.
© 2014 The Authors. Published by Elsevier Ltd.
Peer-review under responsibility of the Scientific Committee of GLOBE-EDU 2014.
Keywords: topic detection, text mining, document summarization
1. Introduction
The huge amount of documents needs to be processed, categorized and summarized to make life easier for
researchers. Document processing age is still not capable of extracting information as a human reader. Moreover,
* Metin Turan. Tel.:+90 05327623554.
E-mail address: turanmetin@gmail.com
© 2015 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Peer-review under responsibility of the Scientific Committee of GLOBE-EDU 2014.
170 Metin Turan and Coskun Sönmez / Procedia - Social and Behavioral Sciences 177 (2015) 169 – 177
the importance of content in the document may also vary from one reader to another. In order to automatize the
process needs to use both textual properties and resolve the language structure which is a really colicky operation
with nowadays techniques.
Topic and subtopics detection is an important working area of the document processing. It gives us the
relativeness of the document for a searched subject, analysis of the document structure (well-written or not), the
boundary of the subtopics mentioned (for summarization) and the word relativeness in a hierarchical (events,
subtopics and topics) order.
Different approaches of topic identification have been reported in the literature. The typical method in single
document is text segmentation, which is to segment text based on the similarity of adjacent sentences and detect the
boundary of subtopics (Choi, 2000; Hearst, 1997; Moens & Busser 2001; Zheng,Guo, Gong & Xue 2008). A
popular method for topic identification in multiple documents is text clustering (Clifton, Cooley & Rennie, 2004;
Radev, Jing, Stys & Tam, 2004).
A group of researchers worked on frequent word sequence (Liu, 2005; Yap, Loh, Shen & Liu, 2006) which is a
sequence of words that appears in at least σ documents in a document set. It uses also frequent word pairs initially.
Word clustering methods are used to find semantically related words. Words with similar distributions are grouped
together (Pingoli, & Varma, 2007) and form a topic (Amini & Usunier 2007; Lin, Hovy, 2000). These models lack
the semantic content information in documents and their summaries.
Concept-based classification (Blei, Ng, Jordan & Lafferty, 2011; Çelikyılmaz, Hakkani-Tur, 2011) is another
approach applied to topic detection. Each sentence is associated with a multi-nominal distribution representing
sparse-topic. A classifier separates the general and specific concepts (sLDA hidden topics). Finally, general topics
sentences are selected for summary.
Probabilistic topic modeling is also a research area in this context (Gong, Qu & Tian, 2010). In this approach,
each subtopic is associated with a set of document words and each sentence subtopics probability is evaluated. It is
more than the classical TF-IDF and other textual properties calculation.
Agglomerative clustering algorithm is used to establish the hierarchical topic tree (Xu, Quan, Zhang & Wang,
2011). This approach divides documents into segments and then similar segments are merged up to a higher level.
One of the old approaches applied for concept outline is Vector Space Model (VSM) (Ji, Lua, Wan & Gao, 2002).
This is the fact that two words may have very different spelling but may be close semantically (“network”, “net”,
“system”).
The latent semantic analysis (LSA) (Bellegarda, 2000; DeerWester, Dumais, Furnas, Landauer & Harshman,
1990) is proposed to extract the latent semantics from words and documents via singular value decomposition. Each
document is projected to LSA space and represented by a low-dimensional vector where each entry acted as the
weight on a specific topic (Hofmann, 1999).
A new hierarchical segmentation model (HSM) (Chien & Chueh, 2012) where the heterogeneous topic
information in stream level and the word variations in document level are characterized. In this approach, the
contextual topic information is incorporated in stream-level segmentation.
New event detection is an interesting research subject of news summarization. If a new includes novel
information this should be checked as a recent event. In order to novelty detection, a cosine-similarity, a language-
model and cover-coefficient methods are compared for the best result (Aksoy, Can, Kocberber, 2011).
Topic theme is a new approach proposed by Harabagiu and Lacatusu (2010) in order to evaluate the role of the
correct topic detection on the summarization quality. They define a topic theme as any type of knowledge
representation that encodes some facet of the knowledge pertaining to a topic. In their work, they compare the five
known topic detection techniques with the new proposed two novel techniques.
The article is organized so that, the next chapter summarizes the inspired and related works. The chapter III
includes research technique and details. The chapter IV informs about the data set and evaluation results. The last
two chapters discuss the results and further work respectively.
2. Related Work
The first attempt to use word frequency as an indicator of document property is reached to the Luhn’s work
(1958) which is accepted a milestone for document summarization. He suggested weighting sentences as a function
171
Metin Turan and Coskun Sönmez / Procedia - Social and Behavioral Sciences 177 (2015) 169 – 177
of high frequent words. After one decade later Edmundson (1969) suggested three methods (cue, title, location) to
consider when evaluating the sentence weights.
TF is the best known and highly stable algorithm to calculate word frequencies in a document. It gives us what is
the document mention about, or in another sense what is the most important issues in the document. Summarization
is a task in order to reduce document size, finds out most important and non-redundant information and presents
them in a readable rank. The most informative sentences include most frequent word/s. But, a question arises when
a word frequency is not the most frequent one, but if it is seen most frequently through document units (sentence,
paragraph). This is called Inverse Document Frequency (IDF) and accepted as an important indicator of document
topic (Nakata, Ikeda, Ando & Okumura, 2002).
ܶ‫݀ܨ‬௜௪is count of how many times a word ݀௪ appears in ݀‫ݐ݊݁݉ݑܿ݋‬௜ (frequency) while IDF measures what
percentage of all documents in a collection contains that word. The IDF portion in formula (1), N represents
documents number in the document set, while ܰ݀ௐ shows the number of documents contains word ݀௪
  †୧୵ ൌ †୧୵ ‫כ‬Ž‘‰
୒
୒ୢ౭
(1)
The cut-off value for TF’s is another interesting issue. What is the threshold value in order to select words which
bring more important information.The informative (meaningful) words in the document are identified by some
researchers (Balinsky H., Balinsky A.  Simske, 2011; Matsua, Ohsawa  Ishizuka, 2001).
To detect topics and event words, some researchers (M. Wang, C. Wang  Liu, 2008; M. Wang, X. Wang  Liu,
2009) tend to use TF-IDF values and dispersion values together. They believe if a word is an event word that word
appears at one or several paragraphs of one document, but if it is topic it appears at whole documents. Formula (2) is
dispersion value of a word w and denotes how frequently w appears across documents. m shows the number of
documents in the document set, where ݉݁ܽ݊௪ is the average of the word w in all documents.
‹•’୵ ൌ ට
σ ሺ୘୊୍ୈ୊ୢ౟౭ି୫ୣୟ୬౭ሻమ
ౣ
౟సభ
୫
(2)
Wang  Chan (2008) found similarities between paragraphs and choose the one with largest length as a
component of the final abstraction result. If similarity values are higher than a threshold value, consecutive
paragraphs can be merged or similar sentences can be extracted between these paragraphs. In Formula (3), Sim (P1,
P2) denotes the similarity between two paragraphs P1 and P2, V1 and V2 denote the vector used to represent P1 and
P2.
Sim(ܲ௜ǡ ܲ௜ାଵ)ൌSim(ܸ௜ǡ ܸ௜ାଵ) ൌ 
௏೔‫כ‬௏೔శభ
ට௏೔‫כ‬௏೔‫כ‬ට௏೔శభ‫כ‬௏೔శభ
 (3)
Hierarchy corpus system can be sensible way to find topic which is called L-level PAM (Gong, 2010). It has four
steps these are; documents set, super topic, sub-topic and word. Document set is the words in whole documents.
Super topic is a general topic name like sports. Sub-topic contains super topics' specific words and word list
involves sub topics specific words. For example; super topic is sports, subtopic contain sports branches like tennis
and word list involves tennis ball, tennis racket and tennis court.
3. Work
3.1. Corpus Design
In order to memorize the keywords (topics, subtopics) for experimental subjects (education, sports), a corpus is
constructed using a bulk of articles selected from Internet. It is a kind of a tree-structure where hierarchical relation
between topic, subtopic and words are established. Fig. 1. shows the corpus design structure.
172 Metin Turan and Coskun Sönmez / Procedia - Social and Behavioral Sciences 177 (2015) 169 – 177
Fig. 1. Corpus Design
In corpus, 2 topics (education and sport), 37 subtopics (16 subtopics for education, 21 subtopics for sport) and
total 2313 words are placed initially. Topics (education, sports) have broad range of subtopics and words to include
in to the corpus all of them. These topics especially are preferred to test system success under insufficient corpus
content.
In this project, we also get assistance from WordNet corpus to get word’s synonyms in order to calculate term
frequencies correctly. The assistance works like this; if a word is not contained by our corpus, we use WordNet
synonym (synetsis) database to get synonyms of the word and match them from our corpus.
3.2. Topic and subtopic detection
The document is pre-processed initially. It includes:
x header selection
x sentence boundary identification
x stop-word elimination
x stemming
x paragraph boundary identification.
Special case for word frequency is that header words are calculated as twice. However it has no such effect when
IDF is calculated (it is only used for topic detection).
Paragraph is accepted document unit and word vector representation is constructed only for paragraphs. Words in
the paragraph vector these are related to the same subtopic (corpus is checked) are later huddled together to find
subtopic counts for that paragraph. Finally, paragraph is defined as a list of subtopics (it is a new squeezed vector).
The following example paragraph vector represents that paragraph P1 includes only ‫ܿ݅݌݋ݐܾݑݏ‬௖ and‫ܿ݅݌݋ݐܾݑݏ‬௘.
V1(ͲǡͲǡ •—„–‘’‹ ୡ,0•—„–‘’‹ ୣǡ Ͳ)
Later each paragraph is processed by its consecutive paragraph for similarity. Cosine function is used to evaluate
similarity. If there is similarity above a threshold value (determined 0.7) (M. Wang, X. Wang, 2009) then
paragraphs are merged. A new paragraph vector is calculated for merged paragraphs. This operation goes on until
there is no merging left.
The subtopics are obtained from the remained paragraphs. The most frequent subtopic in a paragraph vector is
assumed the subtopic of that paragraph.
Paragraph Merging Example: Imagine that we have found four subtopics (X, Y, Z, W) in a document. In the case,
173
Metin Turan and Coskun Sönmez / Procedia - Social and Behavioral Sciences 177 (2015) 169 – 177
first paragraph ܲଵ contains 5 words for subtopic X and 3 words for Y subtopic, where second paragraph ܲଶ contains
2 words for subtopic X and 1 word for subtopic Z. The vector representations for ܲଵܽ݊݀ܲଶare given below.
V1 (ͷ௑, ͵௒,0,0) V2 (ʹଡ଼,0, ͳ௓ǡ Ͳ)
Similarity of these paragraphs using cosine function is calculated using the following formula (3):
Sim(ܲଵǡ ܲଶ)ൌ
ହ‫כ‬ଶାଷ‫כ‬଴ା଴‫כ‬ଵା଴‫כ‬଴
ξହ‫כ‬ହାଷ‫כ‬ଷା଴‫כ‬଴ା଴‫כ‬଴‫כ‬ξଶ‫כ‬ଶା଴‫כ‬଴ାଵ‫כ‬ଵା଴‫כ‬଴
= 0.76
Paragraph similarity is over 0.7 so these paragraphs are merged.
To calculate the dispersion value of all subtopics in the document found, we rearranged the formulas (1) and (2)
given in the related work. In order to calculate TF and IDF values correctly only for one document, formulas are
adopted to paragraph level. In formula (4), P is the number of paragraphs in the document, where ܲ௦ shows the
number of paragraphs contains subtopic s. i represents the ith paragraph in the document.
  ’୧ୱ ൌ ’୧ୱ ‫כ‬ Ž‘‰
௉
௉ೞ
(4)
In order to calculate dispersion value of a candidate subtopic s, the dispersion formula given in (2) is also adopted
to paragraph level. In formula (5), m shows the number of paragraphs in the document, where ݉݁ܽ݊௦ is the average
of the subtopic s for all paragraphs in the document after subtopic detection.
‫ܲ݌ݏ݅ܦ‬௦ ൌ ට
σ ሺ்ிூ஽ி௣೔ೞି௠௘௔௡ೞሻమ
೘
೔సభ
௠
(5)
The topic of the document is obtained as the topic of the highest dispersive subtopic in the document.
4. Evaluation
4.1. Data set
A data set contains 40 documents from Internet web sites (www.elearningtech.blogspot.com, www.articlesbase.com,
www.badminton-information.com) is randomly selected. Articles are about the sport and education experimental
subjects. Sport documents topics include tennis, badminton, archery and general sport information. Education
documents topics are science, preschool and e-learning. The documents selected have paragraphs between 6 and 9.
The number of documents related to each subject is equal and 20.
A human reader extracted subtopics and topics of documents to compare with the output of the developed
experimental system.
4.2. Evaluation
The theory is applied to an experimental system which is developed as a Computer Engineering Final Project
(Turan, Kececi and Kesim, 2012). System has ability to get a document as an input, separate into paragraphs and
finally gives the subtopics of left paragraphs and topic of that document.
System performance (correct subtopics and topics detection) is evaluated using two different working conditions:
only with the Corpus, or Corpus plus WordNet.
The results obtained from experiments are presented as graphs in the following Fig. 2 and 3. In this graphs
horizontal line shows the documents, where vertical line implies the accuracy percentage of subtopics detection in
that document.
174 Metin Turan and Coskun Sönmez / Procedia - Social and Behavioral Sciences 177 (2015) 169 – 177
a
b
Fig. 2. Subtopic Detection only with the Corpus (a) Sport Dataset (b) Education Dataset
a
b
Fig. 3. Subtopic Detection Corpus plus WordNet (a) Sport Dataset (b) Education Dataset
Fig. 4. shows the details of the subtopics detection in Fig. 2. and 3. In this graphs horizontal line shows the
subtopics and its total paragraph number in the data set, however vertical line outline the accuracy of these
paragraphs subtopics detection. The results show that the average accuracy for subtopics detection in education
subject is %61, however in sport it is %72.
175
Metin Turan and Coskun Sönmez / Procedia - Social and Behavioral Sciences 177 (2015) 169 – 177
a
b
Fig. 4. Subtopic Details of Subjects (a) Sport Dataset (b) Education Dataset
a b
Fig. 5. Topic Detection (a) Sport Dataset (b) Education Dataset
The topics are determined for 40 documents in the data set. The Fig. 5. shows the results obtained. The results
show that the average accuracy for topic detection is near %83.
Paragraph merging is another important evaluation point in that work. The number of merged consecutive
paragraphs is calculated during documents processing and we got the number 156 over 271. The numbers show that
merging paragraphs decrease the vector number significantly.
176 Metin Turan and Coskun Sönmez / Procedia - Social and Behavioral Sciences 177 (2015) 169 – 177
5. Conclusion
In this study, a corpus is used to find words these are related to same subtopic. By the way the word frequency is
calculated using that strategy then the word list is decreased and a few subtopics is obtained for that document at the
end of the process. It results a small sized vector to handle easily. Moreover, vector is constructed at the paragraph
level, so the number of vectors is also limited to the paragraph number in that document at most. This also decreases
the vector operations time.
A well-written document generally contains an order of subtopics. Consecutive paragraphs merging help us to
group related text parts together. Experiments in this study have shown that nearly over %50 consecutive paragraphs
are merged. Applying that technique, the number of paragraph vectors is decreased significantly. Briefly, paragraph
merging is important for both subtopic detection and summarization.
WordNet usage affects our results insignificantly (Fig. 2. and 3.). The accuracy of the system depends on the
organization and coverage of the corpus for used subject. The study is extended to extract words dynamically from
the documents and select more valuable ones into the corpus.
Lots of research has proved that topic model can achieve more than 80% accuracy of finding the words correct
topics in unsupervised ways (Li, Maccallum, 2006) Accuracy percentage of two other researches (Nakata, 2002;
Wartena  Brusse, 2008) are %77 and %75 respectively. Comparison with the results in the literature, current
system has accuracy over them and moreover it presents simple solution and produces speedy calculation.
6. Future work
One of the aims of that work is to automatize corpus generation. This is human-made corpus and can be extended
to new subjects by the time. Such ability is added to system, but it still needs human assistance currently. In the
future the corpus extension can be done heuristically.
The other research header is to summarize single or multiple documents with extracted information (topics and
subtopics). Common subtopics in multiple documents can be ordered in importance to extract more valuable
sentences in order to produce better extractive summary.
Observed that if a document is well-written it is so organized that same topic related paragraphs are placed next
by next. In another words, the same topic paragraphs do not scattered in the document. In order to decide the
document written style, that research can be extended using machine learning techniques.
References
Aksoy, C., Can, F.  Kocberber, S. (2011). Novelty detection for topic tracking, Journal of the American Society for Information Science and
Technology, 777-795.
Amini, M.R.  Usunier, N. (2007). A contextual query expansion approach by term clustering for robust text summarization, in Proc. of
Document Understanding Conference, 48-55.
Balinsky, H., Balinsky, A.  Simske, S. (2011). Document sentences as a small world, in Proc. SMC, 2583-2588.
Bellegarda, J. (2000). Exploiting latent semantic information in statistical language modeling, in Proc. IEEE, 88(8), 1279–1296.
Blei, D.M., Ng, A.Y., Jordan, M.I.  Lafferty, J. (2011). Latent dirichlet allocation, Journal of Machine Learning Research, 993-1022.
Celikyilmaz, A.  Hakkani-Tur, D. (2011). Concept-based classification for multi-document summarization, Acoustics, Speech and Signal
Processing (ICASSP), 5540-5543.
Chien, J.-T.  Chueh, C.-H. (2012). Topic-based hierarchical segmentation, IEEE Transaction on Audio, Speech, and Language Processing,
20(1), 55-66.
Choi, F.Y.Y. (2000). Advances in domain independent linear text segmentation, In Proceedings of the 1st North American Chapter of the
Association for Computational Linquistics, 26-33.
Clifton, C., Cooley, R.  Rennie, J. (2004). TopCat: data mining for topic identification in a text corpus, IEEE Transactions on Knowledge and
Data Engineering, 949-964.
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K.  Harshman, R. (1990). Indexing by latent semantic analysis, J. Amer. Soc. Inf. Sci.,
41(6), 391–407.
Edmundson, H.P. (1969). New methods in automatic extracting, Journal of the ACM, 264-285.
177
Metin Turan and Coskun Sönmez / Procedia - Social and Behavioral Sciences 177 (2015) 169 – 177
Gong, S., Qu, Y.  Tian, S. (2010). Subtopic-based multi-documents summarization, Third International Joint Conference on Computational
Science and Optimization, 382-386.
Harabagio, S.  Lacatusu, F. (2010). Using topic themes for multi-document summarization, ACM Transactions on Information Systems, 28(3),
1-46.
Hearst, M. A. (1997). TextTiling:segmenting text into multi-paragraph subtopic passages, Computational Linguistics, 33-64.
Hofmann, T. (1999). Probabilistic latent semantic indexing, in Proc. ACM SIGIR, 50–57.
Ji, H., Luo, Z., Wan, M.  Gao, X. (2002). Summarizing based on concept counting and hierarchy analysis, IEEE SMC.
Li, W.  Maccallum, A. (2006). Pachinko allocation: dag-structured mixture models of topic correlations, In Proceeding of the 23rd
International Conference on Machine Learning, 577-584.
Lin, Y.-C.  Hovy, E.H. (2000). The automated acquisition of topic signatures for text summarization, In COLING, 495-501.
Liu, Y. (2005). A Concept-based Text Classification System for Manufacturing Information Retrieval. (PhD thesis). National University of
Singapore, Singapore.
Luhn, H.P. (1958). The automatic creation of literature abstracts, IBM Journal of Research and Development, 159-165.
Matsuo, Y., Ohsawa, Y.  Ishizuka, M. (2001). A document as a small world, Lecture Notes in Computer Science, 444-448.
Moens, M.-F.  Busser, R. D. (2001). Generic topic segmentation of document texts, In Proceedings of the 24th Annual International ACM
SIGIR Conference on Research and Development in Information Retrieval, 418-419.
Nakata, T., Ikeda, T., Ando, S.  Okumura, A. (2002). Topic detection based on dialogue history, COLING '02 Proceedings of the 19th
International Conference on Computational Linguistics, 1-7.
Pingali, P., K, R.  Varma, V. (2007). IIIT Hyderabad at DUC 2007, In Proc. of Document Understanding Conference, NIST.
Radev, D.R., Jing, H., Stys, M.  Tam, D. (2004). Centroid-based summarization of multiple documents, Information Processing and
Management, 193-207.
Turan, M., Kececi, O.  Kesim, A.E. (2012). Article (document) Topic and Subtopic Detection. (under-graduate thesis). İstanbul Kültür
University, İstanbul.
Wang, H.-C.  Chan, Y.-C. (2008). On the abstraction and presentation of multi-source knowledge, Proceedings of the Seventh International
Conference on Machine Learning and Cybernetics, 3307-3309.
Wang, M., Wang, C.  Li, Z.Z. (2008). Multi-document summarization based on word feature mining, International Conference on Computer
Science and Software Engineering, 743-746.
Wang, M., Wang, X.  Li, C. (2009). Extracting multi-document summarization based on local topics, Sixth International Conference on Fuzzy
Systems and Knowledge Discovery, 238-241.
Wartena, C.  Brusse, R. (2008). Topic detection by clustering keywords, DEXA '08 Proceedings of the 2008 19th International Conference on
Database and Expert Systems Application, 54-58.
Xu, Y.-D., Quan, G.-R., Zhang, T.B.  Wang, Y.-D. (2011). Used hierarchical topic to generate multi-document automatic summarization,
Fourth International Conference on Intelligent Computation Technology and Automation, 295-298.
Yap, I., Loh, H.T., Shen, L.  Liu, Y. (2006). Topic detection using MFSs, Proceedings of the 19th International Conference on
IndustrialEngineering Applications of Artificial Intelligence  Expert Systems (IEA/AIE), Lecture Notes in Computer Science, 342-352.
Zhang, Y. , Guo, J., Gong, H.  Xue, Z. (2008). Single-document automatic abstracting system based on topic partition, International
Symposium on Intelligent Information Technology Application Workshops, 280-283.

More Related Content

Similar to Automatize Document Topic And Subtopic Detection With Support Of A Corpus

Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning  	Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning   sstose
 
NLP applicata a LIS
NLP applicata a LISNLP applicata a LIS
NLP applicata a LISnoemiricci2
 
ANALYSIS OF RHETORICAL MOVES OF JOURNAL ARTICLES AND ITS IMPLICATION TO THE T...
ANALYSIS OF RHETORICAL MOVES OF JOURNAL ARTICLES AND ITS IMPLICATION TO THE T...ANALYSIS OF RHETORICAL MOVES OF JOURNAL ARTICLES AND ITS IMPLICATION TO THE T...
ANALYSIS OF RHETORICAL MOVES OF JOURNAL ARTICLES AND ITS IMPLICATION TO THE T...April Knyff
 
Text Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewText Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewINFOGAIN PUBLICATION
 
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISH
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISHDICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISH
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISHcscpconf
 
601-CriticalEssay-2-Portfolio Edition
601-CriticalEssay-2-Portfolio Edition601-CriticalEssay-2-Portfolio Edition
601-CriticalEssay-2-Portfolio EditionJordan Chapman
 
A Rhetorical Move Analysis Of TEFL Thesis Abstracts The Case Of Allameh Taba...
A Rhetorical Move Analysis Of TEFL Thesis Abstracts  The Case Of Allameh Taba...A Rhetorical Move Analysis Of TEFL Thesis Abstracts  The Case Of Allameh Taba...
A Rhetorical Move Analysis Of TEFL Thesis Abstracts The Case Of Allameh Taba...Amy Cernava
 
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...pathsproject
 
05 handbook summ-hovy
05 handbook summ-hovy05 handbook summ-hovy
05 handbook summ-hovySagar Dabhi
 
Keyphrase Extraction using Neighborhood Knowledge
Keyphrase Extraction using Neighborhood KnowledgeKeyphrase Extraction using Neighborhood Knowledge
Keyphrase Extraction using Neighborhood KnowledgeIJMTST Journal
 
Co word analysis
Co word analysisCo word analysis
Co word analysisdebolina73
 
Dictionary based concept mining an application for turkish
Dictionary based concept mining  an application for turkishDictionary based concept mining  an application for turkish
Dictionary based concept mining an application for turkishcsandit
 
A study on the approaches of developing a named entity recognition tool
A study on the approaches of developing a named entity recognition toolA study on the approaches of developing a named entity recognition tool
A study on the approaches of developing a named entity recognition tooleSAT Publishing House
 
A Review Of Text Mining Techniques And Applications
A Review Of Text Mining Techniques And ApplicationsA Review Of Text Mining Techniques And Applications
A Review Of Text Mining Techniques And ApplicationsLisa Graves
 
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION cscpconf
 
An exploration of the generic structures of problem statements in research ...
An exploration of the generic structures of problem   statements in research ...An exploration of the generic structures of problem   statements in research ...
An exploration of the generic structures of problem statements in research ...Alexander Decker
 
UNIVERSALITY IN TRANSLATION: AN ANALYSIS OF TRANSLATION INTERFERENCE IN MULTI...
UNIVERSALITY IN TRANSLATION: AN ANALYSIS OF TRANSLATION INTERFERENCE IN MULTI...UNIVERSALITY IN TRANSLATION: AN ANALYSIS OF TRANSLATION INTERFERENCE IN MULTI...
UNIVERSALITY IN TRANSLATION: AN ANALYSIS OF TRANSLATION INTERFERENCE IN MULTI...John1Lorcan
 
ONTOLOGICAL MODEL FOR CHARACTER RECOGNITION BASED ON SPATIAL RELATIONS
ONTOLOGICAL MODEL FOR CHARACTER RECOGNITION BASED ON SPATIAL RELATIONSONTOLOGICAL MODEL FOR CHARACTER RECOGNITION BASED ON SPATIAL RELATIONS
ONTOLOGICAL MODEL FOR CHARACTER RECOGNITION BASED ON SPATIAL RELATIONSsipij
 
An Approach To Assess The Existence Of A Proposed Intervention In Essay-Argum...
An Approach To Assess The Existence Of A Proposed Intervention In Essay-Argum...An Approach To Assess The Existence Of A Proposed Intervention In Essay-Argum...
An Approach To Assess The Existence Of A Proposed Intervention In Essay-Argum...Heather Strinden
 

Similar to Automatize Document Topic And Subtopic Detection With Support Of A Corpus (20)

Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning  	Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning  
 
NLP applicata a LIS
NLP applicata a LISNLP applicata a LIS
NLP applicata a LIS
 
Ny3424442448
Ny3424442448Ny3424442448
Ny3424442448
 
ANALYSIS OF RHETORICAL MOVES OF JOURNAL ARTICLES AND ITS IMPLICATION TO THE T...
ANALYSIS OF RHETORICAL MOVES OF JOURNAL ARTICLES AND ITS IMPLICATION TO THE T...ANALYSIS OF RHETORICAL MOVES OF JOURNAL ARTICLES AND ITS IMPLICATION TO THE T...
ANALYSIS OF RHETORICAL MOVES OF JOURNAL ARTICLES AND ITS IMPLICATION TO THE T...
 
Text Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewText Mining at Feature Level: A Review
Text Mining at Feature Level: A Review
 
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISH
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISHDICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISH
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISH
 
601-CriticalEssay-2-Portfolio Edition
601-CriticalEssay-2-Portfolio Edition601-CriticalEssay-2-Portfolio Edition
601-CriticalEssay-2-Portfolio Edition
 
A Rhetorical Move Analysis Of TEFL Thesis Abstracts The Case Of Allameh Taba...
A Rhetorical Move Analysis Of TEFL Thesis Abstracts  The Case Of Allameh Taba...A Rhetorical Move Analysis Of TEFL Thesis Abstracts  The Case Of Allameh Taba...
A Rhetorical Move Analysis Of TEFL Thesis Abstracts The Case Of Allameh Taba...
 
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
 
05 handbook summ-hovy
05 handbook summ-hovy05 handbook summ-hovy
05 handbook summ-hovy
 
Keyphrase Extraction using Neighborhood Knowledge
Keyphrase Extraction using Neighborhood KnowledgeKeyphrase Extraction using Neighborhood Knowledge
Keyphrase Extraction using Neighborhood Knowledge
 
Co word analysis
Co word analysisCo word analysis
Co word analysis
 
Dictionary based concept mining an application for turkish
Dictionary based concept mining  an application for turkishDictionary based concept mining  an application for turkish
Dictionary based concept mining an application for turkish
 
A study on the approaches of developing a named entity recognition tool
A study on the approaches of developing a named entity recognition toolA study on the approaches of developing a named entity recognition tool
A study on the approaches of developing a named entity recognition tool
 
A Review Of Text Mining Techniques And Applications
A Review Of Text Mining Techniques And ApplicationsA Review Of Text Mining Techniques And Applications
A Review Of Text Mining Techniques And Applications
 
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
 
An exploration of the generic structures of problem statements in research ...
An exploration of the generic structures of problem   statements in research ...An exploration of the generic structures of problem   statements in research ...
An exploration of the generic structures of problem statements in research ...
 
UNIVERSALITY IN TRANSLATION: AN ANALYSIS OF TRANSLATION INTERFERENCE IN MULTI...
UNIVERSALITY IN TRANSLATION: AN ANALYSIS OF TRANSLATION INTERFERENCE IN MULTI...UNIVERSALITY IN TRANSLATION: AN ANALYSIS OF TRANSLATION INTERFERENCE IN MULTI...
UNIVERSALITY IN TRANSLATION: AN ANALYSIS OF TRANSLATION INTERFERENCE IN MULTI...
 
ONTOLOGICAL MODEL FOR CHARACTER RECOGNITION BASED ON SPATIAL RELATIONS
ONTOLOGICAL MODEL FOR CHARACTER RECOGNITION BASED ON SPATIAL RELATIONSONTOLOGICAL MODEL FOR CHARACTER RECOGNITION BASED ON SPATIAL RELATIONS
ONTOLOGICAL MODEL FOR CHARACTER RECOGNITION BASED ON SPATIAL RELATIONS
 
An Approach To Assess The Existence Of A Proposed Intervention In Essay-Argum...
An Approach To Assess The Existence Of A Proposed Intervention In Essay-Argum...An Approach To Assess The Existence Of A Proposed Intervention In Essay-Argum...
An Approach To Assess The Existence Of A Proposed Intervention In Essay-Argum...
 

More from Richard Hogue

Paper Mate Write Bros Ballpoint Pens, Medium P
Paper Mate Write Bros Ballpoint Pens, Medium PPaper Mate Write Bros Ballpoint Pens, Medium P
Paper Mate Write Bros Ballpoint Pens, Medium PRichard Hogue
 
Writing Phrases Best Essay Writing Service, Essay Writ
Writing Phrases Best Essay Writing Service, Essay WritWriting Phrases Best Essay Writing Service, Essay Writ
Writing Phrases Best Essay Writing Service, Essay WritRichard Hogue
 
Examples How To Write A Persuasive Essay - Acker
Examples How To Write A Persuasive Essay - AckerExamples How To Write A Persuasive Essay - Acker
Examples How To Write A Persuasive Essay - AckerRichard Hogue
 
Controversial Issue Essay. Controversial Issue Essay
Controversial Issue Essay. Controversial Issue EssayControversial Issue Essay. Controversial Issue Essay
Controversial Issue Essay. Controversial Issue EssayRichard Hogue
 
Best Tips On How To Write A Term Paper Outline, Form
Best Tips On How To Write A Term Paper Outline, FormBest Tips On How To Write A Term Paper Outline, Form
Best Tips On How To Write A Term Paper Outline, FormRichard Hogue
 
Formal Letter In English For Your Needs - Letter Templ
Formal Letter In English For Your Needs - Letter TemplFormal Letter In English For Your Needs - Letter Templ
Formal Letter In English For Your Needs - Letter TemplRichard Hogue
 
Get Essay Help You Can Get Essays Written For You By
Get Essay Help You Can Get Essays Written For You ByGet Essay Help You Can Get Essays Written For You By
Get Essay Help You Can Get Essays Written For You ByRichard Hogue
 
Sample Website Analysis Essay. Online assignment writing service.
Sample Website Analysis Essay. Online assignment writing service.Sample Website Analysis Essay. Online assignment writing service.
Sample Website Analysis Essay. Online assignment writing service.Richard Hogue
 
Pin By Cindy Campbell On GrammarEnglish Language E
Pin By Cindy Campbell On GrammarEnglish Language EPin By Cindy Campbell On GrammarEnglish Language E
Pin By Cindy Campbell On GrammarEnglish Language ERichard Hogue
 
How To Write Evaluation Paper. Self Evaluation Ess
How To Write Evaluation Paper. Self Evaluation EssHow To Write Evaluation Paper. Self Evaluation Ess
How To Write Evaluation Paper. Self Evaluation EssRichard Hogue
 
Pumpkin Writing Page (Print Practice) - Made By Teach
Pumpkin Writing Page (Print Practice) - Made By TeachPumpkin Writing Page (Print Practice) - Made By Teach
Pumpkin Writing Page (Print Practice) - Made By TeachRichard Hogue
 
What Is The Best Way To Write An Essay - HazelNe
What Is The Best Way To Write An Essay - HazelNeWhat Is The Best Way To Write An Essay - HazelNe
What Is The Best Way To Write An Essay - HazelNeRichard Hogue
 
The Importance Of Reading Books Free Essay Example
The Importance Of Reading Books Free Essay ExampleThe Importance Of Reading Books Free Essay Example
The Importance Of Reading Books Free Essay ExampleRichard Hogue
 
Narrative Essay Personal Leadership Style Essay
Narrative Essay Personal Leadership Style EssayNarrative Essay Personal Leadership Style Essay
Narrative Essay Personal Leadership Style EssayRichard Hogue
 
Thesis Introduction Examples Examples - How To Write A The
Thesis Introduction Examples Examples - How To Write A TheThesis Introduction Examples Examples - How To Write A The
Thesis Introduction Examples Examples - How To Write A TheRichard Hogue
 
Literature Review Thesis Statemen. Online assignment writing service.
Literature Review Thesis Statemen. Online assignment writing service.Literature Review Thesis Statemen. Online assignment writing service.
Literature Review Thesis Statemen. Online assignment writing service.Richard Hogue
 
008 Essay Writing Competitions In India Cust
008 Essay Writing Competitions In India Cust008 Essay Writing Competitions In India Cust
008 Essay Writing Competitions In India CustRichard Hogue
 
A LEVEL SOCIOLOGY 20 MARK GENDER SOCUS. Online assignment writing service.
A LEVEL SOCIOLOGY 20 MARK GENDER SOCUS. Online assignment writing service.A LEVEL SOCIOLOGY 20 MARK GENDER SOCUS. Online assignment writing service.
A LEVEL SOCIOLOGY 20 MARK GENDER SOCUS. Online assignment writing service.Richard Hogue
 
Composition Writing Meaning. How To Write A D
Composition Writing Meaning. How To Write A DComposition Writing Meaning. How To Write A D
Composition Writing Meaning. How To Write A DRichard Hogue
 
Get Essay Writing Help At My Assignment Services By Our Highly
Get Essay Writing Help At My Assignment Services By Our HighlyGet Essay Writing Help At My Assignment Services By Our Highly
Get Essay Writing Help At My Assignment Services By Our HighlyRichard Hogue
 

More from Richard Hogue (20)

Paper Mate Write Bros Ballpoint Pens, Medium P
Paper Mate Write Bros Ballpoint Pens, Medium PPaper Mate Write Bros Ballpoint Pens, Medium P
Paper Mate Write Bros Ballpoint Pens, Medium P
 
Writing Phrases Best Essay Writing Service, Essay Writ
Writing Phrases Best Essay Writing Service, Essay WritWriting Phrases Best Essay Writing Service, Essay Writ
Writing Phrases Best Essay Writing Service, Essay Writ
 
Examples How To Write A Persuasive Essay - Acker
Examples How To Write A Persuasive Essay - AckerExamples How To Write A Persuasive Essay - Acker
Examples How To Write A Persuasive Essay - Acker
 
Controversial Issue Essay. Controversial Issue Essay
Controversial Issue Essay. Controversial Issue EssayControversial Issue Essay. Controversial Issue Essay
Controversial Issue Essay. Controversial Issue Essay
 
Best Tips On How To Write A Term Paper Outline, Form
Best Tips On How To Write A Term Paper Outline, FormBest Tips On How To Write A Term Paper Outline, Form
Best Tips On How To Write A Term Paper Outline, Form
 
Formal Letter In English For Your Needs - Letter Templ
Formal Letter In English For Your Needs - Letter TemplFormal Letter In English For Your Needs - Letter Templ
Formal Letter In English For Your Needs - Letter Templ
 
Get Essay Help You Can Get Essays Written For You By
Get Essay Help You Can Get Essays Written For You ByGet Essay Help You Can Get Essays Written For You By
Get Essay Help You Can Get Essays Written For You By
 
Sample Website Analysis Essay. Online assignment writing service.
Sample Website Analysis Essay. Online assignment writing service.Sample Website Analysis Essay. Online assignment writing service.
Sample Website Analysis Essay. Online assignment writing service.
 
Pin By Cindy Campbell On GrammarEnglish Language E
Pin By Cindy Campbell On GrammarEnglish Language EPin By Cindy Campbell On GrammarEnglish Language E
Pin By Cindy Campbell On GrammarEnglish Language E
 
How To Write Evaluation Paper. Self Evaluation Ess
How To Write Evaluation Paper. Self Evaluation EssHow To Write Evaluation Paper. Self Evaluation Ess
How To Write Evaluation Paper. Self Evaluation Ess
 
Pumpkin Writing Page (Print Practice) - Made By Teach
Pumpkin Writing Page (Print Practice) - Made By TeachPumpkin Writing Page (Print Practice) - Made By Teach
Pumpkin Writing Page (Print Practice) - Made By Teach
 
What Is The Best Way To Write An Essay - HazelNe
What Is The Best Way To Write An Essay - HazelNeWhat Is The Best Way To Write An Essay - HazelNe
What Is The Best Way To Write An Essay - HazelNe
 
The Importance Of Reading Books Free Essay Example
The Importance Of Reading Books Free Essay ExampleThe Importance Of Reading Books Free Essay Example
The Importance Of Reading Books Free Essay Example
 
Narrative Essay Personal Leadership Style Essay
Narrative Essay Personal Leadership Style EssayNarrative Essay Personal Leadership Style Essay
Narrative Essay Personal Leadership Style Essay
 
Thesis Introduction Examples Examples - How To Write A The
Thesis Introduction Examples Examples - How To Write A TheThesis Introduction Examples Examples - How To Write A The
Thesis Introduction Examples Examples - How To Write A The
 
Literature Review Thesis Statemen. Online assignment writing service.
Literature Review Thesis Statemen. Online assignment writing service.Literature Review Thesis Statemen. Online assignment writing service.
Literature Review Thesis Statemen. Online assignment writing service.
 
008 Essay Writing Competitions In India Cust
008 Essay Writing Competitions In India Cust008 Essay Writing Competitions In India Cust
008 Essay Writing Competitions In India Cust
 
A LEVEL SOCIOLOGY 20 MARK GENDER SOCUS. Online assignment writing service.
A LEVEL SOCIOLOGY 20 MARK GENDER SOCUS. Online assignment writing service.A LEVEL SOCIOLOGY 20 MARK GENDER SOCUS. Online assignment writing service.
A LEVEL SOCIOLOGY 20 MARK GENDER SOCUS. Online assignment writing service.
 
Composition Writing Meaning. How To Write A D
Composition Writing Meaning. How To Write A DComposition Writing Meaning. How To Write A D
Composition Writing Meaning. How To Write A D
 
Get Essay Writing Help At My Assignment Services By Our Highly
Get Essay Writing Help At My Assignment Services By Our HighlyGet Essay Writing Help At My Assignment Services By Our Highly
Get Essay Writing Help At My Assignment Services By Our Highly
 

Recently uploaded

Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxUnboundStockton
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerunnathinaik
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 

Recently uploaded (20)

Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docx
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developer
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 

Automatize Document Topic And Subtopic Detection With Support Of A Corpus

  • 1. Procedia - Social and Behavioral Sciences 177 (2015) 169 – 177 Available online at www.sciencedirect.com ScienceDirect 1877-0428 © 2015 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of the Scientific Committee of GLOBE-EDU 2014. doi:10.1016/j.sbspro.2015.02.373 Global Conference on Contemporary Issues in Education, GLOBE-EDU 2014, 12-14 July 2014, Las Vegas, USA Automatize Document Topic and Subtopic Detection with Support of a Corpus Metin Turana* , Coskun Sönmezb a Computer Engineering Department,Yıldız Technical University, Esenler, 34220,Istanbul b Computer Engineering Department,İstanbul Technical University,Maslak, 34469, Istanbul Abstract In this article, we propose a new automatic topic and subtopic detection method from a document called paragraph extension. In paragraph extension, a document is considered as a set of paragraphs and a paragraph merging technique is used to merge similar consecutive paragraphs until no similar consecutive paragraphs left. Following this, similar word counts in merged paragraphs are summed up to construct subtopic scores by using a corpus which is designed so that we can find words related to a subtopic. The paragraph vectors are represented by subtopics instead of the words. The subtopic of a paragraph is the most frequent one in the paragraph vector. On the other hand, topic of the document is the most dispersive subtopic in the document. An experimental topic/subtopic corpus is constructed for sport and education topics. We also supported corpus by WordNet to obtain synonyms words. We evaluate the proposed method on a data set contains randomly selected 40 documents from the education and sport topics. The experiment results show that average of topic detection success ratio is about %83 and the subtopic detection is about %68. © 2014 The Authors. Published by Elsevier Ltd. Peer-review under responsibility of the Scientific Committee of GLOBE-EDU 2014. Keywords: topic detection, text mining, document summarization 1. Introduction The huge amount of documents needs to be processed, categorized and summarized to make life easier for researchers. Document processing age is still not capable of extracting information as a human reader. Moreover, * Metin Turan. Tel.:+90 05327623554. E-mail address: turanmetin@gmail.com © 2015 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of the Scientific Committee of GLOBE-EDU 2014.
  • 2. 170 Metin Turan and Coskun Sönmez / Procedia - Social and Behavioral Sciences 177 (2015) 169 – 177 the importance of content in the document may also vary from one reader to another. In order to automatize the process needs to use both textual properties and resolve the language structure which is a really colicky operation with nowadays techniques. Topic and subtopics detection is an important working area of the document processing. It gives us the relativeness of the document for a searched subject, analysis of the document structure (well-written or not), the boundary of the subtopics mentioned (for summarization) and the word relativeness in a hierarchical (events, subtopics and topics) order. Different approaches of topic identification have been reported in the literature. The typical method in single document is text segmentation, which is to segment text based on the similarity of adjacent sentences and detect the boundary of subtopics (Choi, 2000; Hearst, 1997; Moens & Busser 2001; Zheng,Guo, Gong & Xue 2008). A popular method for topic identification in multiple documents is text clustering (Clifton, Cooley & Rennie, 2004; Radev, Jing, Stys & Tam, 2004). A group of researchers worked on frequent word sequence (Liu, 2005; Yap, Loh, Shen & Liu, 2006) which is a sequence of words that appears in at least σ documents in a document set. It uses also frequent word pairs initially. Word clustering methods are used to find semantically related words. Words with similar distributions are grouped together (Pingoli, & Varma, 2007) and form a topic (Amini & Usunier 2007; Lin, Hovy, 2000). These models lack the semantic content information in documents and their summaries. Concept-based classification (Blei, Ng, Jordan & Lafferty, 2011; Çelikyılmaz, Hakkani-Tur, 2011) is another approach applied to topic detection. Each sentence is associated with a multi-nominal distribution representing sparse-topic. A classifier separates the general and specific concepts (sLDA hidden topics). Finally, general topics sentences are selected for summary. Probabilistic topic modeling is also a research area in this context (Gong, Qu & Tian, 2010). In this approach, each subtopic is associated with a set of document words and each sentence subtopics probability is evaluated. It is more than the classical TF-IDF and other textual properties calculation. Agglomerative clustering algorithm is used to establish the hierarchical topic tree (Xu, Quan, Zhang & Wang, 2011). This approach divides documents into segments and then similar segments are merged up to a higher level. One of the old approaches applied for concept outline is Vector Space Model (VSM) (Ji, Lua, Wan & Gao, 2002). This is the fact that two words may have very different spelling but may be close semantically (“network”, “net”, “system”). The latent semantic analysis (LSA) (Bellegarda, 2000; DeerWester, Dumais, Furnas, Landauer & Harshman, 1990) is proposed to extract the latent semantics from words and documents via singular value decomposition. Each document is projected to LSA space and represented by a low-dimensional vector where each entry acted as the weight on a specific topic (Hofmann, 1999). A new hierarchical segmentation model (HSM) (Chien & Chueh, 2012) where the heterogeneous topic information in stream level and the word variations in document level are characterized. In this approach, the contextual topic information is incorporated in stream-level segmentation. New event detection is an interesting research subject of news summarization. If a new includes novel information this should be checked as a recent event. In order to novelty detection, a cosine-similarity, a language- model and cover-coefficient methods are compared for the best result (Aksoy, Can, Kocberber, 2011). Topic theme is a new approach proposed by Harabagiu and Lacatusu (2010) in order to evaluate the role of the correct topic detection on the summarization quality. They define a topic theme as any type of knowledge representation that encodes some facet of the knowledge pertaining to a topic. In their work, they compare the five known topic detection techniques with the new proposed two novel techniques. The article is organized so that, the next chapter summarizes the inspired and related works. The chapter III includes research technique and details. The chapter IV informs about the data set and evaluation results. The last two chapters discuss the results and further work respectively. 2. Related Work The first attempt to use word frequency as an indicator of document property is reached to the Luhn’s work (1958) which is accepted a milestone for document summarization. He suggested weighting sentences as a function
  • 3. 171 Metin Turan and Coskun Sönmez / Procedia - Social and Behavioral Sciences 177 (2015) 169 – 177 of high frequent words. After one decade later Edmundson (1969) suggested three methods (cue, title, location) to consider when evaluating the sentence weights. TF is the best known and highly stable algorithm to calculate word frequencies in a document. It gives us what is the document mention about, or in another sense what is the most important issues in the document. Summarization is a task in order to reduce document size, finds out most important and non-redundant information and presents them in a readable rank. The most informative sentences include most frequent word/s. But, a question arises when a word frequency is not the most frequent one, but if it is seen most frequently through document units (sentence, paragraph). This is called Inverse Document Frequency (IDF) and accepted as an important indicator of document topic (Nakata, Ikeda, Ando & Okumura, 2002). ܶ‫݀ܨ‬௜௪is count of how many times a word ݀௪ appears in ݀‫ݐ݊݁݉ݑܿ݋‬௜ (frequency) while IDF measures what percentage of all documents in a collection contains that word. The IDF portion in formula (1), N represents documents number in the document set, while ܰ݀ௐ shows the number of documents contains word ݀௪ †୧୵ ൌ †୧୵ ‫כ‬Ž‘‰ ୒ ୒ୢ౭ (1) The cut-off value for TF’s is another interesting issue. What is the threshold value in order to select words which bring more important information.The informative (meaningful) words in the document are identified by some researchers (Balinsky H., Balinsky A. Simske, 2011; Matsua, Ohsawa Ishizuka, 2001). To detect topics and event words, some researchers (M. Wang, C. Wang Liu, 2008; M. Wang, X. Wang Liu, 2009) tend to use TF-IDF values and dispersion values together. They believe if a word is an event word that word appears at one or several paragraphs of one document, but if it is topic it appears at whole documents. Formula (2) is dispersion value of a word w and denotes how frequently w appears across documents. m shows the number of documents in the document set, where ݉݁ܽ݊௪ is the average of the word w in all documents. ‹•’୵ ൌ ට σ ሺ୘୊୍ୈ୊ୢ౟౭ି୫ୣୟ୬౭ሻమ ౣ ౟సభ ୫ (2) Wang Chan (2008) found similarities between paragraphs and choose the one with largest length as a component of the final abstraction result. If similarity values are higher than a threshold value, consecutive paragraphs can be merged or similar sentences can be extracted between these paragraphs. In Formula (3), Sim (P1, P2) denotes the similarity between two paragraphs P1 and P2, V1 and V2 denote the vector used to represent P1 and P2. Sim(ܲ௜ǡ ܲ௜ାଵ)ൌSim(ܸ௜ǡ ܸ௜ାଵ) ൌ ௏೔‫כ‬௏೔శభ ට௏೔‫כ‬௏೔‫כ‬ට௏೔శభ‫כ‬௏೔శభ (3) Hierarchy corpus system can be sensible way to find topic which is called L-level PAM (Gong, 2010). It has four steps these are; documents set, super topic, sub-topic and word. Document set is the words in whole documents. Super topic is a general topic name like sports. Sub-topic contains super topics' specific words and word list involves sub topics specific words. For example; super topic is sports, subtopic contain sports branches like tennis and word list involves tennis ball, tennis racket and tennis court. 3. Work 3.1. Corpus Design In order to memorize the keywords (topics, subtopics) for experimental subjects (education, sports), a corpus is constructed using a bulk of articles selected from Internet. It is a kind of a tree-structure where hierarchical relation between topic, subtopic and words are established. Fig. 1. shows the corpus design structure.
  • 4. 172 Metin Turan and Coskun Sönmez / Procedia - Social and Behavioral Sciences 177 (2015) 169 – 177 Fig. 1. Corpus Design In corpus, 2 topics (education and sport), 37 subtopics (16 subtopics for education, 21 subtopics for sport) and total 2313 words are placed initially. Topics (education, sports) have broad range of subtopics and words to include in to the corpus all of them. These topics especially are preferred to test system success under insufficient corpus content. In this project, we also get assistance from WordNet corpus to get word’s synonyms in order to calculate term frequencies correctly. The assistance works like this; if a word is not contained by our corpus, we use WordNet synonym (synetsis) database to get synonyms of the word and match them from our corpus. 3.2. Topic and subtopic detection The document is pre-processed initially. It includes: x header selection x sentence boundary identification x stop-word elimination x stemming x paragraph boundary identification. Special case for word frequency is that header words are calculated as twice. However it has no such effect when IDF is calculated (it is only used for topic detection). Paragraph is accepted document unit and word vector representation is constructed only for paragraphs. Words in the paragraph vector these are related to the same subtopic (corpus is checked) are later huddled together to find subtopic counts for that paragraph. Finally, paragraph is defined as a list of subtopics (it is a new squeezed vector). The following example paragraph vector represents that paragraph P1 includes only ‫ܿ݅݌݋ݐܾݑݏ‬௖ and‫ܿ݅݌݋ݐܾݑݏ‬௘. V1(ͲǡͲǡ •—„–‘’‹ ୡ,0•—„–‘’‹ ୣǡ Ͳ) Later each paragraph is processed by its consecutive paragraph for similarity. Cosine function is used to evaluate similarity. If there is similarity above a threshold value (determined 0.7) (M. Wang, X. Wang, 2009) then paragraphs are merged. A new paragraph vector is calculated for merged paragraphs. This operation goes on until there is no merging left. The subtopics are obtained from the remained paragraphs. The most frequent subtopic in a paragraph vector is assumed the subtopic of that paragraph. Paragraph Merging Example: Imagine that we have found four subtopics (X, Y, Z, W) in a document. In the case,
  • 5. 173 Metin Turan and Coskun Sönmez / Procedia - Social and Behavioral Sciences 177 (2015) 169 – 177 first paragraph ܲଵ contains 5 words for subtopic X and 3 words for Y subtopic, where second paragraph ܲଶ contains 2 words for subtopic X and 1 word for subtopic Z. The vector representations for ܲଵܽ݊݀ܲଶare given below. V1 (ͷ௑, ͵௒,0,0) V2 (ʹଡ଼,0, ͳ௓ǡ Ͳ) Similarity of these paragraphs using cosine function is calculated using the following formula (3): Sim(ܲଵǡ ܲଶ)ൌ ହ‫כ‬ଶାଷ‫כ‬଴ା଴‫כ‬ଵା଴‫כ‬଴ ξହ‫כ‬ହାଷ‫כ‬ଷା଴‫כ‬଴ା଴‫כ‬଴‫כ‬ξଶ‫כ‬ଶା଴‫כ‬଴ାଵ‫כ‬ଵା଴‫כ‬଴ = 0.76 Paragraph similarity is over 0.7 so these paragraphs are merged. To calculate the dispersion value of all subtopics in the document found, we rearranged the formulas (1) and (2) given in the related work. In order to calculate TF and IDF values correctly only for one document, formulas are adopted to paragraph level. In formula (4), P is the number of paragraphs in the document, where ܲ௦ shows the number of paragraphs contains subtopic s. i represents the ith paragraph in the document. ’୧ୱ ൌ ’୧ୱ ‫כ‬ Ž‘‰ ௉ ௉ೞ (4) In order to calculate dispersion value of a candidate subtopic s, the dispersion formula given in (2) is also adopted to paragraph level. In formula (5), m shows the number of paragraphs in the document, where ݉݁ܽ݊௦ is the average of the subtopic s for all paragraphs in the document after subtopic detection. ‫ܲ݌ݏ݅ܦ‬௦ ൌ ට σ ሺ்ிூ஽ி௣೔ೞି௠௘௔௡ೞሻమ ೘ ೔సభ ௠ (5) The topic of the document is obtained as the topic of the highest dispersive subtopic in the document. 4. Evaluation 4.1. Data set A data set contains 40 documents from Internet web sites (www.elearningtech.blogspot.com, www.articlesbase.com, www.badminton-information.com) is randomly selected. Articles are about the sport and education experimental subjects. Sport documents topics include tennis, badminton, archery and general sport information. Education documents topics are science, preschool and e-learning. The documents selected have paragraphs between 6 and 9. The number of documents related to each subject is equal and 20. A human reader extracted subtopics and topics of documents to compare with the output of the developed experimental system. 4.2. Evaluation The theory is applied to an experimental system which is developed as a Computer Engineering Final Project (Turan, Kececi and Kesim, 2012). System has ability to get a document as an input, separate into paragraphs and finally gives the subtopics of left paragraphs and topic of that document. System performance (correct subtopics and topics detection) is evaluated using two different working conditions: only with the Corpus, or Corpus plus WordNet. The results obtained from experiments are presented as graphs in the following Fig. 2 and 3. In this graphs horizontal line shows the documents, where vertical line implies the accuracy percentage of subtopics detection in that document.
  • 6. 174 Metin Turan and Coskun Sönmez / Procedia - Social and Behavioral Sciences 177 (2015) 169 – 177 a b Fig. 2. Subtopic Detection only with the Corpus (a) Sport Dataset (b) Education Dataset a b Fig. 3. Subtopic Detection Corpus plus WordNet (a) Sport Dataset (b) Education Dataset Fig. 4. shows the details of the subtopics detection in Fig. 2. and 3. In this graphs horizontal line shows the subtopics and its total paragraph number in the data set, however vertical line outline the accuracy of these paragraphs subtopics detection. The results show that the average accuracy for subtopics detection in education subject is %61, however in sport it is %72.
  • 7. 175 Metin Turan and Coskun Sönmez / Procedia - Social and Behavioral Sciences 177 (2015) 169 – 177 a b Fig. 4. Subtopic Details of Subjects (a) Sport Dataset (b) Education Dataset a b Fig. 5. Topic Detection (a) Sport Dataset (b) Education Dataset The topics are determined for 40 documents in the data set. The Fig. 5. shows the results obtained. The results show that the average accuracy for topic detection is near %83. Paragraph merging is another important evaluation point in that work. The number of merged consecutive paragraphs is calculated during documents processing and we got the number 156 over 271. The numbers show that merging paragraphs decrease the vector number significantly.
  • 8. 176 Metin Turan and Coskun Sönmez / Procedia - Social and Behavioral Sciences 177 (2015) 169 – 177 5. Conclusion In this study, a corpus is used to find words these are related to same subtopic. By the way the word frequency is calculated using that strategy then the word list is decreased and a few subtopics is obtained for that document at the end of the process. It results a small sized vector to handle easily. Moreover, vector is constructed at the paragraph level, so the number of vectors is also limited to the paragraph number in that document at most. This also decreases the vector operations time. A well-written document generally contains an order of subtopics. Consecutive paragraphs merging help us to group related text parts together. Experiments in this study have shown that nearly over %50 consecutive paragraphs are merged. Applying that technique, the number of paragraph vectors is decreased significantly. Briefly, paragraph merging is important for both subtopic detection and summarization. WordNet usage affects our results insignificantly (Fig. 2. and 3.). The accuracy of the system depends on the organization and coverage of the corpus for used subject. The study is extended to extract words dynamically from the documents and select more valuable ones into the corpus. Lots of research has proved that topic model can achieve more than 80% accuracy of finding the words correct topics in unsupervised ways (Li, Maccallum, 2006) Accuracy percentage of two other researches (Nakata, 2002; Wartena Brusse, 2008) are %77 and %75 respectively. Comparison with the results in the literature, current system has accuracy over them and moreover it presents simple solution and produces speedy calculation. 6. Future work One of the aims of that work is to automatize corpus generation. This is human-made corpus and can be extended to new subjects by the time. Such ability is added to system, but it still needs human assistance currently. In the future the corpus extension can be done heuristically. The other research header is to summarize single or multiple documents with extracted information (topics and subtopics). Common subtopics in multiple documents can be ordered in importance to extract more valuable sentences in order to produce better extractive summary. Observed that if a document is well-written it is so organized that same topic related paragraphs are placed next by next. In another words, the same topic paragraphs do not scattered in the document. In order to decide the document written style, that research can be extended using machine learning techniques. References Aksoy, C., Can, F. Kocberber, S. (2011). Novelty detection for topic tracking, Journal of the American Society for Information Science and Technology, 777-795. Amini, M.R. Usunier, N. (2007). A contextual query expansion approach by term clustering for robust text summarization, in Proc. of Document Understanding Conference, 48-55. Balinsky, H., Balinsky, A. Simske, S. (2011). Document sentences as a small world, in Proc. SMC, 2583-2588. Bellegarda, J. (2000). Exploiting latent semantic information in statistical language modeling, in Proc. IEEE, 88(8), 1279–1296. Blei, D.M., Ng, A.Y., Jordan, M.I. Lafferty, J. (2011). Latent dirichlet allocation, Journal of Machine Learning Research, 993-1022. Celikyilmaz, A. Hakkani-Tur, D. (2011). Concept-based classification for multi-document summarization, Acoustics, Speech and Signal Processing (ICASSP), 5540-5543. Chien, J.-T. Chueh, C.-H. (2012). Topic-based hierarchical segmentation, IEEE Transaction on Audio, Speech, and Language Processing, 20(1), 55-66. Choi, F.Y.Y. (2000). Advances in domain independent linear text segmentation, In Proceedings of the 1st North American Chapter of the Association for Computational Linquistics, 26-33. Clifton, C., Cooley, R. Rennie, J. (2004). TopCat: data mining for topic identification in a text corpus, IEEE Transactions on Knowledge and Data Engineering, 949-964. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K. Harshman, R. (1990). Indexing by latent semantic analysis, J. Amer. Soc. Inf. Sci., 41(6), 391–407. Edmundson, H.P. (1969). New methods in automatic extracting, Journal of the ACM, 264-285.
  • 9. 177 Metin Turan and Coskun Sönmez / Procedia - Social and Behavioral Sciences 177 (2015) 169 – 177 Gong, S., Qu, Y. Tian, S. (2010). Subtopic-based multi-documents summarization, Third International Joint Conference on Computational Science and Optimization, 382-386. Harabagio, S. Lacatusu, F. (2010). Using topic themes for multi-document summarization, ACM Transactions on Information Systems, 28(3), 1-46. Hearst, M. A. (1997). TextTiling:segmenting text into multi-paragraph subtopic passages, Computational Linguistics, 33-64. Hofmann, T. (1999). Probabilistic latent semantic indexing, in Proc. ACM SIGIR, 50–57. Ji, H., Luo, Z., Wan, M. Gao, X. (2002). Summarizing based on concept counting and hierarchy analysis, IEEE SMC. Li, W. Maccallum, A. (2006). Pachinko allocation: dag-structured mixture models of topic correlations, In Proceeding of the 23rd International Conference on Machine Learning, 577-584. Lin, Y.-C. Hovy, E.H. (2000). The automated acquisition of topic signatures for text summarization, In COLING, 495-501. Liu, Y. (2005). A Concept-based Text Classification System for Manufacturing Information Retrieval. (PhD thesis). National University of Singapore, Singapore. Luhn, H.P. (1958). The automatic creation of literature abstracts, IBM Journal of Research and Development, 159-165. Matsuo, Y., Ohsawa, Y. Ishizuka, M. (2001). A document as a small world, Lecture Notes in Computer Science, 444-448. Moens, M.-F. Busser, R. D. (2001). Generic topic segmentation of document texts, In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 418-419. Nakata, T., Ikeda, T., Ando, S. Okumura, A. (2002). Topic detection based on dialogue history, COLING '02 Proceedings of the 19th International Conference on Computational Linguistics, 1-7. Pingali, P., K, R. Varma, V. (2007). IIIT Hyderabad at DUC 2007, In Proc. of Document Understanding Conference, NIST. Radev, D.R., Jing, H., Stys, M. Tam, D. (2004). Centroid-based summarization of multiple documents, Information Processing and Management, 193-207. Turan, M., Kececi, O. Kesim, A.E. (2012). Article (document) Topic and Subtopic Detection. (under-graduate thesis). İstanbul Kültür University, İstanbul. Wang, H.-C. Chan, Y.-C. (2008). On the abstraction and presentation of multi-source knowledge, Proceedings of the Seventh International Conference on Machine Learning and Cybernetics, 3307-3309. Wang, M., Wang, C. Li, Z.Z. (2008). Multi-document summarization based on word feature mining, International Conference on Computer Science and Software Engineering, 743-746. Wang, M., Wang, X. Li, C. (2009). Extracting multi-document summarization based on local topics, Sixth International Conference on Fuzzy Systems and Knowledge Discovery, 238-241. Wartena, C. Brusse, R. (2008). Topic detection by clustering keywords, DEXA '08 Proceedings of the 2008 19th International Conference on Database and Expert Systems Application, 54-58. Xu, Y.-D., Quan, G.-R., Zhang, T.B. Wang, Y.-D. (2011). Used hierarchical topic to generate multi-document automatic summarization, Fourth International Conference on Intelligent Computation Technology and Automation, 295-298. Yap, I., Loh, H.T., Shen, L. Liu, Y. (2006). Topic detection using MFSs, Proceedings of the 19th International Conference on IndustrialEngineering Applications of Artificial Intelligence Expert Systems (IEA/AIE), Lecture Notes in Computer Science, 342-352. Zhang, Y. , Guo, J., Gong, H. Xue, Z. (2008). Single-document automatic abstracting system based on topic partition, International Symposium on Intelligent Information Technology Application Workshops, 280-283.