Comparative analysis of algorithms
classificationand methods
the presentation of Web documents
© Sayed Rahman
MADI(Moscow automobile and Road Construction Institute)
Y2ksayed@gmail.com
Abstract
This article focuses on two main problems
of the categorization of texts: choice of classification algorithm and
methods pre-processing of the text. On the basis
of experiments in the framework of the seminar ROMIP’2005 was
carried out a comparative analysis of the approaches
and proposed solutions to the detected problems.
1.
Introduction
This work is part of research into ways
of providing periodic thematic search that includes in
itself and the development of such systems. One of the main stages
of work of this system is the categorization of texts.
Since the quality of the categorization largely determines the quality
of the final result, the study of classification algorithms and
methods of preprocessing of documents is
an important task.
Algorithms
classification
work
with
some
mathematical view model instance (in this
case a text document). The most common
model is represented as a set of characteristics, which
will be considered in the future. Definition of traits and
comparison of them with weights is essentially an informal
step and greatly influences the classification result. For this
reason
stage
preliminary
processing
document
is considered separately.
2. The consideredalgorithms
During the experiments, three were considered representative
of a family of linear algorithms: support vector machines (SVM)
[1], PrTFIDF algorithm[2] and modified algorithm naive
Bayes[3].
The PrTFIDF algorithms and naive Bayes algorithm is based on
statistical models and have a lot in common. These algorithms
are of interest because they are highly scalable and
have
high
performance.
Known
their
disadvantage is the relatively low classification accuracy,
especially in the case of binary classification. The PrTFIDF algorithm
is considered as a basic algorithm for a number of reasons:
•
Experimentally[2] shows a higher accuracy
of classification compared to naive Bayes and
TFIDF[4]. Also it was confirmed in other
experiments that are beyond the scope of this article.
•
This algorithm is applicable to analyze large
quantity of documents and allows you to use a large
number of signs. This is important because the algorithm
will be used for processing large volumes of
data.
The algorithm of Bayesian recently is assessed as
relatively low-quality algorithm. The main reasons
are the problems associated with the principle of independence
of attributes and incorrect assessment of a prioriprobability in the case of
substantially neravnodushnyh training samples. Proposing a series of
empirical
modifications
algorithm
or
adding
additional symptoms, for example, on the basis of the choice of phrases in
the document, you can try to resolve existing problems and
to better quality of the algorithm while maintaining its simplicity,
performance and scalability.
The classification results support vector machines in
recent times are estimated[5] as the best or one of the best.
However, the learning speed of this algorithm is relatively low
(O(|D|
a
where a>1,2[5])) and it requires a large memory capacity that
reduces its scalability. However, this algorithm
can be used as a reference from the point of view of quality
classification. Also proposed modifying the evaluation
weights of attributes, which will be discussed later.
Thus, the requirements for the classification algorithm in the framework
the task to be solved can be formulated as follows:
1.
Quality classification must be comparable to
the quality of the support vector machine.
2.
The algorithm should have low computational
complexity and is highly scalable.
Consider proposed modifications to existing
algorithms.
2.1 Methodology a preliminary assessment of modifications
Algorithms require experimental
validation to assess their impact on the algorithm. In this
work a preliminary evaluation was performed on two test
sets:Newsgroup-20[6] and educational collection of normative
documents. For the second collection as a training sample
was selected 40% of the documents randomly, the remaining
documents were used for accuracy assessment of classification.
Second
sample
interesting
strong
the uneven
distribution of papers by classes and lots
of classes.
2.2 modification of naive Bayes
The rule class definition for the document in the Bayesian algorithm
can be represented as follows:
]
log
)
(
[log(
max
arg
)
(
∑
∈
+
=
d
w
Cw
w
C
p
f
C
p
d
C
,
where
- the number of occurrences of the token w in document
w
f
)
|
(
C
w
p
p
Cw
=
To fight with an incorrect definition of a priori conditional
probabilities of the attributes in case neravnodushnyh training
samples, it is proposed to use the paradigm of class-inclusions,
that is, instead of the probability of belonging to the class of tokens
to rate
the probability
accessories
tokens
Classopollis C’ (note that p(w|C) ~ 1-p(w|C’)).Using
the principle of smoothing parameters in Laplace, we obtain the following
rule:
]
)
|
|
1
log(
)
(
[log(
max
arg
)
(
∑
∈
+
+
−
=
d
w
C
Cw
w
C
V
N
N
f
C
p
d
C
where
Cw
N
- the number of tokens in all classrooms in addition to this,
C
N
the total number of tokens in the class-addition,
-
the dimension of the dictionary of lexemes.
|
|V
It should be noted that this heuristic works only
if the number of classes N >> 2.
For
further
improve
quality
classification
the following techniques:
• Logarithmic smoothing in the frequency of symptoms
• Normalization of the weights of the features in the document by its length
• the Use of inverse frequency characteristic (IDF and the IDF’[2])
• Normalizing the logarithms of the weights of the features (log(p
Cw
))
Preliminary experiments have shown improved accuracy
classification
when
the inclusion of
all
heuristics,
in addition to
logarithmic smoothing and the use of inverse
frequency. In the end, before the final run of the algorithm (more
ModBayes) degrading the quality of the heuristics were disabled.
The accuracy of the algorithm in the preliminary testing proved
comparable accuracy with SVM, the accuracy of the basic algorithm
Bayes was close to zero.
2.3 Modification of the SVM algorithm
Consider
modification
algorithm
is
to
trivial empirical change in the weight indication.
The original premise due to the following:
• Lexemes with a high frequency of inversion possible
significant, and therefore should carry more weight
similar to the assumptions of the algorithm TFIDF.
• If the token is often found in documents of the same class,
but rarely in other documents, this lexeme is also
perhaps more important than the token, found in
a small number of documents, but in many classrooms. In
the example, two situations: the token
occurs in ten documents of the same class, and the other by
two times in both classes. From the point of view of the inverse frequency of
the second lexeme would have more weight, but actually
the first is much more important for the qualitative separation
of two classes.
Thus it is suggested that the modifier weight
tokens:
'
*
)
'
,
(
max
'
IDF
C
w
TF
C
C
∈
where
∑ ∑
∈
∈
=
C
C
F
w
C
w
TF
C
w
TF
D
IDF
'
'
)
'
,'
(
)
'
,
(
|
|
'
During preliminary experiments on a test collection of
Newsgroup-20 the application of this heuristic led to a slight
increase of the classification accuracy. When you run the SVM algorithm
on collection of normative documents of the heuristics was
enabled.
3. Pre-processing ofdocuments
The objective of the preliminary stage of document processing is
the extraction of features of the document and mapping them to scales. In
the simplest case, a multinomial model set of characteristics
of the document will be contained in the set of tokens, but as
the weights being the number of occurrences of tokens in the document.
The disadvantage of this approach is that virtually no
taken into account
features
natural
language
and
also
the structure of the document and relationships between documents in
the context of Web pages.
3.1 Processing of natural language texts
During the processing of a text can be divided into several stages:
1. Lexical analysis
2. Morphological analysis
3. Syntactic and post-morphological analysis
4. The selection of phrases (n-grams)
5. The elimination of stop-words
the First two steps are pretty obvious: the first task
is the allocation of tokens and the second based on a set of rules and
internal dictionary maps each token a set
of possible slovesnov with their grammatical characteristics.
Use parse to allow
a significant portion of the cases of homonymy. Also, syntactic
analysis may allow for more precise filtering
of stop-words and phrasing on the basis of the syntactically related
lexical items, which significantly reduces their number compared to the
complete enumeration of tokens of neighbors.
3.1.1 parsing
Problems of existing solutions in the field of syntactic
analysis (for example, LinkParser [7], Dialing[8]) are rather
low speed of text processing and sensitivity to
incorrect syntactic constructions. These problems
follow from applications of these solutions – check
spelling and machine translation. In classifying
texts requirements to parse a few others:
high speed text processing and working with syntactically
incomplete fragments of text, with a permissible
increase in the error analysis.
The developed parser has
much in common with the algorithm used in the system of Dialing.
The differences lie in the changing list of rules and a significant
simplification of the fragmentation analysis. As a result, the algorithm
correctly parses syntactically incomplete fragments, and
the processing speed of text increased by about an order of magnitude.
The result of the parser is
elimination of morphological ambiguities, and constructing a
set of syntactically related phrases. Post-morphological
analysis enables us to identify the part of speech of the lexeme,
respectively, the phase filtering of stop-words is done after
parsing.
3.1.2 Choice of phrases
In the previous paragraph we have considered the range of sentences based on
syntactic analysis. There are also algorithms for selecting
phrases based on statistical analysis. It should be noted that
when using such algorithms in their pure form is analyzed
a great number of phrases, which complicates their
application in processing large number of documents.
Consider two of the basic algorithm of selection phrases. The first
algorithm is that phrases are considered as
some context for the most significant tokens within
a certain
subjects.
So
way
the phrase
is considered to be
"contextual", if it contains at least one of the most
significant terms, pre-screened by conventional
algorithms of selection of signs.
The second algorithm is based on the following: if the given
phrase is "sustainable", among the many documents in
which there are baths phrases should be and the documents in
which present phrase. Thus, the selection of "sustainable
phrases" in the context of some topics is as follows:
For each phrase is calculated the number of documents
,
in which it occurs.
p
N
Then calculated the number of documents
in which
meet all baths phrases.
t
N
The phrase is considered stable, if
where K is
a coefcient of stability of the phrases, determined
experimentally.
t
p
N
K
N
≥
*
In practice, when analyzing a large number of documents
have
to combine
both
algorithm.
However,
more
promising
it seems
joint
the use of
syntactic selection of phrases and phrases filtering on the basis
of the principle of “sustainability”.
3.2 Processing Web page
Tests used in the analysis of Web pages is quite simple.
In particular, it is necessary to solve the problem automatically
determine the encoding, since it is clearly not indicated
in all documents. This task was performed in two
stages:
• Analysis of the frequency of occurrences of the most frequently used
liter
• In the case where the frequency analysis does not allow
a given output, checks availability of
parts of the selected lexemes in morphological dictionary
If a dictionary or frequency analysis showed
a mismatch of encoding the text document is converted to
another encoding.
Also during the processing of a document produced weight gain
tokens included in the headings, title, keywords, text
links, etc.
4. The results of the experiments
4.1 Track classification of Web pages
In this track there had been one run. Algorithm
classification
used
modified
the algorithm
PrTFIDF. During pre-processing of texts was used
morphological analysis on the basis of ISpell dictionaries and the analysis of
the structure of the Web page. The main purpose of this run was
to compare the algorithm with others on a large volume of real data.
The results obtained were worse than those of other
participants, which is largely due to the weakness
of the algorithm in the case neravnodushnyh training samples and
shows the actual applicability of this algorithm for
such problems. This was confirmed when testing on
training collection of normative documents.
4.2 Track classification legal documents
Under the tracks of a classification of normative documents
there were four run:
Run 1. the PrTFIDF algorithm.
Run 2. the PrTFIDF algorithm with a statistical selection of phrases.
Run 3. a modified naive Bayes algorithm with
using post-morphology and partial phrases
Run 4. the modified SVM algorithm using
post-morphology and partial phrases
The purpose of the runs was to determine the degree of influence of choices
on the quality of the classification and comparative evaluation
of algorithms PrTFIDF, the modified Bayes algorithm and SVM.
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
F1(macro)
F1(micro)
xxxx
xxxx
PrTFIDF+phrase
PrTFIDF
ModBayes
SVM
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
Fig.1 Comparison of quality classifiers
The results of the runs was substantially different from
the results obtained in the preliminary experiments and
their
enough
difficult
to interpret,
especially
the weighted average according to the documents the evaluation of F1. In particular, it is possible
to detect that the PrTFIDF algorithm outperforms SVM what can
be explained is that a bad selection of weights modifiers.
However, in the case of preliminary testing on
the training set of the collection of normative documents of the winning
SVM was more than substantial.
Also
should
note
no
win
when
using phrases, which casts doubt on the use of
statistical selection in its purest form. In preliminary
experiments, statistical (set Newsgroup-20) and
syntactic (the training collection) the choice of phrases showed
improved accuracy for all the analyzed algorithms in
the range from 1 to 4%.
Final results, including on a modified
Bayesian algorithm, which significantly differ from the expected that
demands further study.
4.3 Experiments on the training set the collection
of legal documents
The analysis of the classification results reveals a number of
weaknesses of the algorithms used. To resolve the found
problems was proposed a number of modifications to the Bayesian algorithm and
also the algorithm ModSimpl, based on the construction
of several separating hyperplanes, corresponding to
the Fisher discriminant.
Experiments conducted on the training set normative
documents (outside the scope of the seminar ROMIP) showed results,
allowing to speak about the prospects of these algorithms.
NB
PrTFIDF
ModBayes
ModSimpl
SVM
accuracy < 10%
< 10%
45,46%
44,54%
47,83%
Table 1. Comparison of accuracy of algorithms
The use of syntactic analysis of sentences is also provided
a small increase of the classification accuracy.
5. Further areas of work
Analyzing the results, we can draw conclusions about the need for
more detailed study as preliminary analysis
of documents, especially for Web collections and the
used
algorithms.
Can
to allocate
the following
directions for future work:
1.
Revision of probabilistic algorithms for solving the problem of
classification with a large number neravnodushnyh
classes.
2.
The study and refinement of the algorithm ModSimpl. This
algorithm, unlike probability, has shown good
results
and
when
the decision
tasks
binary
classification
3.
Revision parser, account of the parts of speech
and other characteristics of lexemes with matching weights and
the selection of signs
4.
Joint
use
syntax
and
statistical phrase selection
5.
The analysis of blocks of Web pages and eliminate noise
elements
6.
Analysis of the context of citations to the document
7.
The use of dictionaries of synonyms and possibly
an adapted probabilistic latent semantic
analysis
6. Conclusion
In this work, we considered a number of classification algorithms
and the preliminary processing of texts. Based on the analysis
of experimental results suggested a number of improvements
of classifiers and the basic directions of further
development.
Literature
[1] T. Joachims. Making large-scale SVM learning practical// Advances
in kernel methods: support vector learning, MIT Press, 1999
[2] T. Joachims. Aprobabilistic analysis of the Rocchio algorithm with
TFIDF for text categorization// Proceedings of ICML-97, 14th
International Conference on Machine Learning, pages 143-151 //
Morgan Kaufmann Publishers, 1997
[3] D. Lewis. Naive Bayes at forty: The independence assumption in
information retrieval// Proceedings of ECML-98, 10th European
Conference on Machine Learning, pages 4-15, 1998
[4] G. Salton. Developments in Automatic Text Retrieval// Science, vol
253, pages 974-979, 1991
[5] S. Chakrabarti. Mining The Web Discovering Knowledge From
Hypertext Data// Morgan Kaufmann Publishers, 2004
[6] Home Page for 20 Newsgroups Data Set.
http://people.csail.mit.edu/jrennie/20Newsgroups/
[7] D. Temperley, J. Lafferty , D. Sleator. Link Grammar Parser
http://www.link.cs.cmu.edu/link
[8] A. Sokirko. Semantic dictionaries in automatic processing
text (on materials of the system DIALING)// Dissertation.
http://www.aot.ru/docs/sokirko/sokirko-candid-1.html
On Comparative Analysis of ClassificationAlgorithms
and Web documents representation
Sayed Rahman
Two main problems in text rubrication are reviewed in this
article: classification algorithm choice and text preprocessing
methods. Based on experiments held on ROMIP’2005
collections methods were compared and revealed solutions to
problems were proposed.

Comparative analysis of algorithms_MADI

  • 1.
    Comparative analysis ofalgorithms classificationand methods the presentation of Web documents © Sayed Rahman MADI(Moscow automobile and Road Construction Institute) Y2ksayed@gmail.com Abstract This article focuses on two main problems of the categorization of texts: choice of classification algorithm and methods pre-processing of the text. On the basis of experiments in the framework of the seminar ROMIP’2005 was carried out a comparative analysis of the approaches and proposed solutions to the detected problems. 1. Introduction This work is part of research into ways of providing periodic thematic search that includes in itself and the development of such systems. One of the main stages of work of this system is the categorization of texts. Since the quality of the categorization largely determines the quality of the final result, the study of classification algorithms and methods of preprocessing of documents is an important task. Algorithms classification work with some mathematical view model instance (in this case a text document). The most common model is represented as a set of characteristics, which will be considered in the future. Definition of traits and comparison of them with weights is essentially an informal step and greatly influences the classification result. For this reason stage preliminary processing document is considered separately. 2. The consideredalgorithms During the experiments, three were considered representative of a family of linear algorithms: support vector machines (SVM) [1], PrTFIDF algorithm[2] and modified algorithm naive Bayes[3]. The PrTFIDF algorithms and naive Bayes algorithm is based on statistical models and have a lot in common. These algorithms
  • 2.
    are of interestbecause they are highly scalable and have high performance. Known their disadvantage is the relatively low classification accuracy, especially in the case of binary classification. The PrTFIDF algorithm is considered as a basic algorithm for a number of reasons: • Experimentally[2] shows a higher accuracy of classification compared to naive Bayes and TFIDF[4]. Also it was confirmed in other experiments that are beyond the scope of this article. • This algorithm is applicable to analyze large quantity of documents and allows you to use a large number of signs. This is important because the algorithm will be used for processing large volumes of data. The algorithm of Bayesian recently is assessed as relatively low-quality algorithm. The main reasons are the problems associated with the principle of independence of attributes and incorrect assessment of a prioriprobability in the case of substantially neravnodushnyh training samples. Proposing a series of empirical modifications algorithm or adding additional symptoms, for example, on the basis of the choice of phrases in the document, you can try to resolve existing problems and to better quality of the algorithm while maintaining its simplicity, performance and scalability. The classification results support vector machines in recent times are estimated[5] as the best or one of the best. However, the learning speed of this algorithm is relatively low (O(|D| a where a>1,2[5])) and it requires a large memory capacity that reduces its scalability. However, this algorithm can be used as a reference from the point of view of quality classification. Also proposed modifying the evaluation weights of attributes, which will be discussed later. Thus, the requirements for the classification algorithm in the framework the task to be solved can be formulated as follows: 1. Quality classification must be comparable to the quality of the support vector machine. 2.
  • 3.
    The algorithm shouldhave low computational complexity and is highly scalable. Consider proposed modifications to existing algorithms. 2.1 Methodology a preliminary assessment of modifications Algorithms require experimental validation to assess their impact on the algorithm. In this work a preliminary evaluation was performed on two test sets:Newsgroup-20[6] and educational collection of normative documents. For the second collection as a training sample was selected 40% of the documents randomly, the remaining documents were used for accuracy assessment of classification. Second sample interesting strong the uneven distribution of papers by classes and lots of classes. 2.2 modification of naive Bayes The rule class definition for the document in the Bayesian algorithm can be represented as follows: ] log ) ( [log( max arg ) ( ∑ ∈ + = d w Cw w C p f C p d C , where - the number of occurrences of the token w in document w
  • 4.
    f ) | ( C w p p Cw = To fight withan incorrect definition of a priori conditional probabilities of the attributes in case neravnodushnyh training samples, it is proposed to use the paradigm of class-inclusions, that is, instead of the probability of belonging to the class of tokens to rate the probability accessories tokens Classopollis C’ (note that p(w|C) ~ 1-p(w|C’)).Using the principle of smoothing parameters in Laplace, we obtain the following rule: ] ) | | 1 log( ) ( [log( max arg ) ( ∑ ∈ + + − = d w C Cw w C V
  • 5.
    N N f C p d C where Cw N - the numberof tokens in all classrooms in addition to this, C N the total number of tokens in the class-addition, - the dimension of the dictionary of lexemes. | |V It should be noted that this heuristic works only if the number of classes N >> 2. For further improve quality classification the following techniques: • Logarithmic smoothing in the frequency of symptoms • Normalization of the weights of the features in the document by its length • the Use of inverse frequency characteristic (IDF and the IDF’[2]) • Normalizing the logarithms of the weights of the features (log(p Cw )) Preliminary experiments have shown improved accuracy classification when the inclusion of all heuristics, in addition to logarithmic smoothing and the use of inverse frequency. In the end, before the final run of the algorithm (more ModBayes) degrading the quality of the heuristics were disabled. The accuracy of the algorithm in the preliminary testing proved comparable accuracy with SVM, the accuracy of the basic algorithm Bayes was close to zero. 2.3 Modification of the SVM algorithm Consider modification algorithm is
  • 6.
    to trivial empirical changein the weight indication. The original premise due to the following: • Lexemes with a high frequency of inversion possible significant, and therefore should carry more weight similar to the assumptions of the algorithm TFIDF. • If the token is often found in documents of the same class, but rarely in other documents, this lexeme is also perhaps more important than the token, found in a small number of documents, but in many classrooms. In the example, two situations: the token occurs in ten documents of the same class, and the other by two times in both classes. From the point of view of the inverse frequency of the second lexeme would have more weight, but actually the first is much more important for the qualitative separation of two classes. Thus it is suggested that the modifier weight tokens: ' * ) ' , ( max ' IDF C w TF C C ∈ where ∑ ∑ ∈ ∈ = C C F w C w TF C w TF D IDF ' '
  • 7.
    ) ' ,' ( ) ' , ( | | ' During preliminary experimentson a test collection of Newsgroup-20 the application of this heuristic led to a slight increase of the classification accuracy. When you run the SVM algorithm on collection of normative documents of the heuristics was enabled. 3. Pre-processing ofdocuments The objective of the preliminary stage of document processing is the extraction of features of the document and mapping them to scales. In the simplest case, a multinomial model set of characteristics of the document will be contained in the set of tokens, but as the weights being the number of occurrences of tokens in the document. The disadvantage of this approach is that virtually no taken into account features natural language and also the structure of the document and relationships between documents in the context of Web pages. 3.1 Processing of natural language texts During the processing of a text can be divided into several stages: 1. Lexical analysis 2. Morphological analysis 3. Syntactic and post-morphological analysis 4. The selection of phrases (n-grams) 5. The elimination of stop-words the First two steps are pretty obvious: the first task is the allocation of tokens and the second based on a set of rules and internal dictionary maps each token a set of possible slovesnov with their grammatical characteristics. Use parse to allow a significant portion of the cases of homonymy. Also, syntactic analysis may allow for more precise filtering of stop-words and phrasing on the basis of the syntactically related lexical items, which significantly reduces their number compared to the complete enumeration of tokens of neighbors. 3.1.1 parsing
  • 8.
    Problems of existingsolutions in the field of syntactic analysis (for example, LinkParser [7], Dialing[8]) are rather low speed of text processing and sensitivity to incorrect syntactic constructions. These problems follow from applications of these solutions – check spelling and machine translation. In classifying texts requirements to parse a few others: high speed text processing and working with syntactically incomplete fragments of text, with a permissible increase in the error analysis. The developed parser has much in common with the algorithm used in the system of Dialing. The differences lie in the changing list of rules and a significant simplification of the fragmentation analysis. As a result, the algorithm correctly parses syntactically incomplete fragments, and the processing speed of text increased by about an order of magnitude. The result of the parser is elimination of morphological ambiguities, and constructing a set of syntactically related phrases. Post-morphological analysis enables us to identify the part of speech of the lexeme, respectively, the phase filtering of stop-words is done after parsing. 3.1.2 Choice of phrases In the previous paragraph we have considered the range of sentences based on syntactic analysis. There are also algorithms for selecting phrases based on statistical analysis. It should be noted that when using such algorithms in their pure form is analyzed a great number of phrases, which complicates their application in processing large number of documents. Consider two of the basic algorithm of selection phrases. The first algorithm is that phrases are considered as some context for the most significant tokens within a certain subjects. So way the phrase is considered to be "contextual", if it contains at least one of the most significant terms, pre-screened by conventional algorithms of selection of signs. The second algorithm is based on the following: if the given phrase is "sustainable", among the many documents in which there are baths phrases should be and the documents in which present phrase. Thus, the selection of "sustainable phrases" in the context of some topics is as follows: For each phrase is calculated the number of documents , in which it occurs. p N
  • 9.
    Then calculated thenumber of documents in which meet all baths phrases. t N The phrase is considered stable, if where K is a coefcient of stability of the phrases, determined experimentally. t p N K N ≥ * In practice, when analyzing a large number of documents have to combine both algorithm. However, more promising it seems joint the use of syntactic selection of phrases and phrases filtering on the basis of the principle of “sustainability”. 3.2 Processing Web page Tests used in the analysis of Web pages is quite simple. In particular, it is necessary to solve the problem automatically determine the encoding, since it is clearly not indicated in all documents. This task was performed in two stages: • Analysis of the frequency of occurrences of the most frequently used liter • In the case where the frequency analysis does not allow a given output, checks availability of parts of the selected lexemes in morphological dictionary If a dictionary or frequency analysis showed a mismatch of encoding the text document is converted to another encoding. Also during the processing of a document produced weight gain tokens included in the headings, title, keywords, text links, etc. 4. The results of the experiments 4.1 Track classification of Web pages In this track there had been one run. Algorithm classification used modified
  • 10.
    the algorithm PrTFIDF. Duringpre-processing of texts was used morphological analysis on the basis of ISpell dictionaries and the analysis of the structure of the Web page. The main purpose of this run was to compare the algorithm with others on a large volume of real data. The results obtained were worse than those of other participants, which is largely due to the weakness of the algorithm in the case neravnodushnyh training samples and shows the actual applicability of this algorithm for such problems. This was confirmed when testing on training collection of normative documents. 4.2 Track classification legal documents Under the tracks of a classification of normative documents there were four run: Run 1. the PrTFIDF algorithm. Run 2. the PrTFIDF algorithm with a statistical selection of phrases. Run 3. a modified naive Bayes algorithm with using post-morphology and partial phrases Run 4. the modified SVM algorithm using post-morphology and partial phrases The purpose of the runs was to determine the degree of influence of choices on the quality of the classification and comparative evaluation of algorithms PrTFIDF, the modified Bayes algorithm and SVM. 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 F1(macro) F1(micro) xxxx xxxx PrTFIDF+phrase PrTFIDF ModBayes SVM xxxx xxxx xxxx xxxx xxxx xxxx Fig.1 Comparison of quality classifiers The results of the runs was substantially different from the results obtained in the preliminary experiments and their enough difficult to interpret, especially the weighted average according to the documents the evaluation of F1. In particular, it is possible to detect that the PrTFIDF algorithm outperforms SVM what can be explained is that a bad selection of weights modifiers. However, in the case of preliminary testing on
  • 11.
    the training setof the collection of normative documents of the winning SVM was more than substantial. Also should note no win when using phrases, which casts doubt on the use of statistical selection in its purest form. In preliminary experiments, statistical (set Newsgroup-20) and syntactic (the training collection) the choice of phrases showed improved accuracy for all the analyzed algorithms in the range from 1 to 4%. Final results, including on a modified Bayesian algorithm, which significantly differ from the expected that demands further study. 4.3 Experiments on the training set the collection of legal documents The analysis of the classification results reveals a number of weaknesses of the algorithms used. To resolve the found problems was proposed a number of modifications to the Bayesian algorithm and also the algorithm ModSimpl, based on the construction of several separating hyperplanes, corresponding to the Fisher discriminant. Experiments conducted on the training set normative documents (outside the scope of the seminar ROMIP) showed results, allowing to speak about the prospects of these algorithms. NB PrTFIDF ModBayes ModSimpl SVM accuracy < 10% < 10% 45,46% 44,54% 47,83% Table 1. Comparison of accuracy of algorithms The use of syntactic analysis of sentences is also provided a small increase of the classification accuracy. 5. Further areas of work Analyzing the results, we can draw conclusions about the need for more detailed study as preliminary analysis of documents, especially for Web collections and the used algorithms. Can to allocate the following directions for future work:
  • 12.
    1. Revision of probabilisticalgorithms for solving the problem of classification with a large number neravnodushnyh classes. 2. The study and refinement of the algorithm ModSimpl. This algorithm, unlike probability, has shown good results and when the decision tasks binary classification 3. Revision parser, account of the parts of speech and other characteristics of lexemes with matching weights and the selection of signs 4. Joint use syntax and statistical phrase selection 5. The analysis of blocks of Web pages and eliminate noise elements 6. Analysis of the context of citations to the document 7. The use of dictionaries of synonyms and possibly an adapted probabilistic latent semantic analysis 6. Conclusion In this work, we considered a number of classification algorithms and the preliminary processing of texts. Based on the analysis of experimental results suggested a number of improvements of classifiers and the basic directions of further development. Literature [1] T. Joachims. Making large-scale SVM learning practical// Advances in kernel methods: support vector learning, MIT Press, 1999 [2] T. Joachims. Aprobabilistic analysis of the Rocchio algorithm with TFIDF for text categorization// Proceedings of ICML-97, 14th International Conference on Machine Learning, pages 143-151 // Morgan Kaufmann Publishers, 1997 [3] D. Lewis. Naive Bayes at forty: The independence assumption in information retrieval// Proceedings of ECML-98, 10th European Conference on Machine Learning, pages 4-15, 1998 [4] G. Salton. Developments in Automatic Text Retrieval// Science, vol 253, pages 974-979, 1991
  • 13.
    [5] S. Chakrabarti.Mining The Web Discovering Knowledge From Hypertext Data// Morgan Kaufmann Publishers, 2004 [6] Home Page for 20 Newsgroups Data Set. http://people.csail.mit.edu/jrennie/20Newsgroups/ [7] D. Temperley, J. Lafferty , D. Sleator. Link Grammar Parser http://www.link.cs.cmu.edu/link [8] A. Sokirko. Semantic dictionaries in automatic processing text (on materials of the system DIALING)// Dissertation. http://www.aot.ru/docs/sokirko/sokirko-candid-1.html On Comparative Analysis of ClassificationAlgorithms and Web documents representation Sayed Rahman Two main problems in text rubrication are reviewed in this article: classification algorithm choice and text preprocessing methods. Based on experiments held on ROMIP’2005 collections methods were compared and revealed solutions to problems were proposed.