T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...csandit
Text mining and Text classification are the two pro
minent and challenging tasks in the field of
Machine learning. Text mining refers to the process
of deriving high quality and relevant
information from text, while Text classification de
als with the categorization of text documents
into different classes. The real challenge in these
areas is to address the problems like handling
large text corpora, similarity of words in text doc
uments, and association of text documents with
a subset of class categories. The feature extractio
n and classification of such text documents
require an efficient machine learning algorithm whi
ch performs automatic text classification.
This paper describes the classification of product
review documents as a multi-label
classification scenario and addresses the problem u
sing Structured Support Vector Machine.
The work also explains the flexibility and performan
ce of the proposed approach for e
fficient text classification.
Speech to text conversion for visually impaired person using µ law compandingiosrjce
The paper represents the overall design and implementation of DSP based speech recognition and
text conversion system. Speech is usually taken as a preferred mode of operation for human being, This paper
represent voice oriented command for converting into text. We intended to compute the entire speech processing
in real time. This involves simultaneously accepting the input from the user and using software filters to analyse
the data. The comparison was then to be established by using correlation and µ law companding techniques. In
this paper, voice recognition is carried out using MATLAB. The voice command is a person independent. The
voice command is stored in the data base with the help of the function keys. The real time input speech received
is then processed in the speech recognition system where the required feature of the speech words are extracted,
filtered out and matched with the existing sample stored in the database. Then the required MATLAB processes
are done to convert the received data and into text form.
Transformer Models have taken over most of the Natural language Inference tasks. In recent
times they have proved to beat several benchmarks. Chunking means splitting the sentences into
tokens and then grouping them in a meaningful way. Chunking is a task that has gradually
moved from POS tag-based statistical models to neural nets using Language models such as
LSTM, Bidirectional LSTMs, attention models, etc. Deep neural net Models are deployed
indirectly for classifying tokens as different tags defined under Named Recognition Tasks. Later
these tags are used in conjunction with pointer frameworks for the final chunking task. In our
paper, we propose an Ensemble Model using a fine-tuned Transformer Model and a recurrent
neural network model together to predict tags and chunk substructures of a sentence. We
analyzed the shortcomings of the transformer models in predicting different tags and then
trained the BILSTM+CNN accordingly to compensate for the same.
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...csandit
Text mining and Text classification are the two pro
minent and challenging tasks in the field of
Machine learning. Text mining refers to the process
of deriving high quality and relevant
information from text, while Text classification de
als with the categorization of text documents
into different classes. The real challenge in these
areas is to address the problems like handling
large text corpora, similarity of words in text doc
uments, and association of text documents with
a subset of class categories. The feature extractio
n and classification of such text documents
require an efficient machine learning algorithm whi
ch performs automatic text classification.
This paper describes the classification of product
review documents as a multi-label
classification scenario and addresses the problem u
sing Structured Support Vector Machine.
The work also explains the flexibility and performan
ce of the proposed approach for e
fficient text classification.
Speech to text conversion for visually impaired person using µ law compandingiosrjce
The paper represents the overall design and implementation of DSP based speech recognition and
text conversion system. Speech is usually taken as a preferred mode of operation for human being, This paper
represent voice oriented command for converting into text. We intended to compute the entire speech processing
in real time. This involves simultaneously accepting the input from the user and using software filters to analyse
the data. The comparison was then to be established by using correlation and µ law companding techniques. In
this paper, voice recognition is carried out using MATLAB. The voice command is a person independent. The
voice command is stored in the data base with the help of the function keys. The real time input speech received
is then processed in the speech recognition system where the required feature of the speech words are extracted,
filtered out and matched with the existing sample stored in the database. Then the required MATLAB processes
are done to convert the received data and into text form.
Transformer Models have taken over most of the Natural language Inference tasks. In recent
times they have proved to beat several benchmarks. Chunking means splitting the sentences into
tokens and then grouping them in a meaningful way. Chunking is a task that has gradually
moved from POS tag-based statistical models to neural nets using Language models such as
LSTM, Bidirectional LSTMs, attention models, etc. Deep neural net Models are deployed
indirectly for classifying tokens as different tags defined under Named Recognition Tasks. Later
these tags are used in conjunction with pointer frameworks for the final chunking task. In our
paper, we propose an Ensemble Model using a fine-tuned Transformer Model and a recurrent
neural network model together to predict tags and chunk substructures of a sentence. We
analyzed the shortcomings of the transformer models in predicting different tags and then
trained the BILSTM+CNN accordingly to compensate for the same.
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...cscpconf
On-line text documents rapidly increase in size with the growth of World Wide Web. To manage such a huge amount of texts,several text miningapplications came into existence. Those
applications such as search engine, text categorization,summarization, and topic detection arebased on feature extraction.It is extremely time consuming and difficult task to extract keyword or feature manually.So an automated process that extracts keywords or features needs to be
established.This paper proposes a new domain keyword extraction technique that includes a new weighting method on the base of the conventional TF•IDF. Term frequency-Inverse
document frequency is widely used to express the documentsfeature weight, which can’t reflect the division of terms in the document, and then can’t reflect the significance degree and the difference between categories. This paper proposes a new weighting method to which a new weight is added to express the differences between domains on the base of original TF•IDF.The extracted feature can represent the content of the text better and has a better distinguished
Text independent speaker identification system using average pitch and forman...ijitjournal
The aim of this paper is to design a closed-set text-independent Speaker Identification system using average
pitch and speech features from formant analysis. The speech features represented by the speech signal are
potentially characterized by formant analysis (Power Spectral Density). In this paper we have designed two
methods: one for average pitch estimation based on Autocorrelation and other for formant analysis. The
average pitches of speech signals are calculated and employed with formant analysis. From the performance
comparison of the proposed method with some of the existing methods, it is evident that the designed
speaker identification system with the proposed method is superior to others.
Possible Algorithms of 2NF and 3NF for DBNorma- A tool for Relational Databas...IDES Editor
DBNorma [1] is a semi automated database
normalization tool, which uses a singly linked list to store a
relation and functional dependencies hold on it. This paper
describes possible algorithms that can be used to normalize a
given relation represented using singly linked list. These
algorithms are tested on various relational schemas collected
from several research papers/resources and output validated.
We have also determined the time required to normalize a given
relation in 2NF and 3NF and found that it is proportional to
number of attributes and number of functional dependencies
present in that relation. The time required, on average, is in
order of 186 msec for 2NF and 209 msec for 3NF. By observing
time required, one can conclude that these algorithms can be
used for normalizing relations within the shorter time; this is
specifically needed when database designer is using a universal
relation.
Design and implementation of a java based virtual laboratory for data communi...IJECEIAES
Students in this modern age find engineering courses taught in the university very abstract and difficult, and cannot relate theoretical calculations to real life scenarios. They consequently lose interest in their coursework and perform poorly in their grades. Simulation of classroom concepts with simulation software like MATLAB, were developed to facilitate learning experience. This paper involves the development of a virtual laboratory simulation package for teaching data communication concepts such as coding schemes, modulation and filtering. Unlike other simulation packages, no prior knowledge of computer programming is required for students to grasp these concepts.
Document Classification Using KNN with Fuzzy Bags of Word Representationsuthi
Abstract — Text classification is used to classify the documents depending on the words, phrases and word combinations according to the declared syntaxes. There are many applications that are using text classification such as artificial intelligence, to maintain the data according to the category and in many other. Some keywords which are called topics are selected to classify the given document. Using these Topics the main idea of the document can be identified. Selecting the Topics is an important task to classify the document according to the category. In this proposed system keywords are extracted from documents using TF-IDF and Word Net. TF-IDF algorithm is mainly used to select the important words by which document can be classified. Word Net is mainly used to find similarity between these candidate words. The words which are having the maximum similarity are considered as Topics(keywords). In this experiment we used TF-IDF model to find the similar words so that to classify the document. Decision tree algorithm gives the better accuracy for text classification when compared to other algorithms fuzzy system to classify text written in natural language according to topic. It is necessary to use a fuzzy classifier for this task, due to the fact that a given text can cover several topics with different degrees. In this context, traditional classifiers are inappropriate, as they attempt to sort each text in a single class in a winner-takes-all fashion. The classifier we proposeautomatically learns its fuzzy rules from training examples. We have applied it to classify news articles, and the results we obtained are promising. The dimensionality of a vector is very important in text classification. We can decrease this dimensionality by using clustering based on fuzzy logic. Depending on the similarity we can classify the document and thus they can be formed into clusters according to their Topics. After formation of clusters one can easily access the documents and save the documents very easily. In this we can find the similarity and summarize the words called Topics which can be used to classify the Documents.
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...ijnlc
This study investigates the effectiveness of Knowledge Named Entity Recognition in Online Judges (OJs). OJs are lacking in the classification of topics and limited to the IDs only. Therefore a lot of time is consumed in finding programming problems more specifically in knowledge entities.A Bidirectional Long Short-Term Memory (BiLSTM) with Conditional Random Fields (CRF) model is applied for the recognition of knowledge named entities existing in the solution reports.For the test run, more than 2000 solution reports are crawled from the Online Judges and processed for the model output. The stability of the model is
also assessed with the higher F1 value. The results obtained through the proposed BiLSTM-CRF model are more effectual (F1: 98.96%) and efficient in lead-time.
Dear students get fully solved assignments
Send your semester & Specialization name to our mail id :
help.mbaassignments@gmail.com
or
call us at : 08263069601
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Networkkevig
In recent years, there has been an increasing use of social media among people in Myanmar and writing
review on social media pages about the product, movie, and trip are also popular among people. Moreover,
most of the people are going to find the review pages about the product they want to buy before deciding
whether they should buy it or not. Extracting and receiving useful reviews over interesting products is very
important and time consuming for people. Sentiment analysis is one of the important processes for extracting
useful reviews of the products. In this paper, the Convolutional LSTM neural network architecture is
proposed to analyse the sentiment classification of cosmetic reviews written in Myanmar Language. The
paper also intends to build the cosmetic reviews dataset for deep learning and sentiment lexicon in Myanmar
Language.
Chunking means splitting the sentences into tokens and then grouping them in a meaningful way. When it comes to high-performance chunking systems, transformer models have proved to be the state of the art benchmarks. To perform chunking as a task it requires a large-scale high quality annotated corpus where each token is attached with a particular tag similar as that of Named Entity Recognition Tasks. Later these tags are used in conjunction with pointer frameworks to find the final chunk. To solve this for a specific domain problem, it becomes a highly costly affair in terms of time and resources to manually annotate and produce a large-high-quality training set. When the domain is specific and diverse, then cold starting becomes even more difficult because of the expected large number of manually annotated queries to cover all aspects. To overcome the problem, we applied a grammar-based text generation mechanism where instead of annotating a sentence we annotate using grammar templates. We defined various templates corresponding to different grammar rules. To create a sentence we used these templates along with the rules where symbol or terminal values were chosen from the domain data catalog. It helped us to create a large number of annotated queries. These annotated queries were used for training the machine learning model using an ensemble transformer-based deep neural network model [24.] We found that grammar-based annotation was useful to solve domain-based chunks in input query sentences without any manual annotation where it was found to achieve a classification F1 score of 96.97% in classifying the tokens for the out of template queries.
Using Class Frequency for Improving Centroid-based Text ClassificationIDES Editor
Most previous works on text classification,
represented importance of terms by term occurrence frequency
(tf) and inverse document frequency (idf). This paper presents
the ways to apply class frequency in centroid-based text
categorization. Three approaches are taken into account. The
first one is to explore the effectiveness of inverse class
frequency on the popular term weighting, i.e., TFIDF, as a
replacement of idf and an addition to TFIDF. The second
approach is to evaluate some functions, which are used to
adjust the power of inverse class frequency. The other approach
is to apply terms, which are found in only one class or few
classes, to improve classification performance, using two-step
classification. From the results, class frequency expresses its
usefulness on text classification, especially the two-step
classification.
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...cscpconf
On-line text documents rapidly increase in size with the growth of World Wide Web. To manage such a huge amount of texts,several text miningapplications came into existence. Those
applications such as search engine, text categorization,summarization, and topic detection arebased on feature extraction.It is extremely time consuming and difficult task to extract keyword or feature manually.So an automated process that extracts keywords or features needs to be
established.This paper proposes a new domain keyword extraction technique that includes a new weighting method on the base of the conventional TF•IDF. Term frequency-Inverse
document frequency is widely used to express the documentsfeature weight, which can’t reflect the division of terms in the document, and then can’t reflect the significance degree and the difference between categories. This paper proposes a new weighting method to which a new weight is added to express the differences between domains on the base of original TF•IDF.The extracted feature can represent the content of the text better and has a better distinguished
Text independent speaker identification system using average pitch and forman...ijitjournal
The aim of this paper is to design a closed-set text-independent Speaker Identification system using average
pitch and speech features from formant analysis. The speech features represented by the speech signal are
potentially characterized by formant analysis (Power Spectral Density). In this paper we have designed two
methods: one for average pitch estimation based on Autocorrelation and other for formant analysis. The
average pitches of speech signals are calculated and employed with formant analysis. From the performance
comparison of the proposed method with some of the existing methods, it is evident that the designed
speaker identification system with the proposed method is superior to others.
Possible Algorithms of 2NF and 3NF for DBNorma- A tool for Relational Databas...IDES Editor
DBNorma [1] is a semi automated database
normalization tool, which uses a singly linked list to store a
relation and functional dependencies hold on it. This paper
describes possible algorithms that can be used to normalize a
given relation represented using singly linked list. These
algorithms are tested on various relational schemas collected
from several research papers/resources and output validated.
We have also determined the time required to normalize a given
relation in 2NF and 3NF and found that it is proportional to
number of attributes and number of functional dependencies
present in that relation. The time required, on average, is in
order of 186 msec for 2NF and 209 msec for 3NF. By observing
time required, one can conclude that these algorithms can be
used for normalizing relations within the shorter time; this is
specifically needed when database designer is using a universal
relation.
Design and implementation of a java based virtual laboratory for data communi...IJECEIAES
Students in this modern age find engineering courses taught in the university very abstract and difficult, and cannot relate theoretical calculations to real life scenarios. They consequently lose interest in their coursework and perform poorly in their grades. Simulation of classroom concepts with simulation software like MATLAB, were developed to facilitate learning experience. This paper involves the development of a virtual laboratory simulation package for teaching data communication concepts such as coding schemes, modulation and filtering. Unlike other simulation packages, no prior knowledge of computer programming is required for students to grasp these concepts.
Document Classification Using KNN with Fuzzy Bags of Word Representationsuthi
Abstract — Text classification is used to classify the documents depending on the words, phrases and word combinations according to the declared syntaxes. There are many applications that are using text classification such as artificial intelligence, to maintain the data according to the category and in many other. Some keywords which are called topics are selected to classify the given document. Using these Topics the main idea of the document can be identified. Selecting the Topics is an important task to classify the document according to the category. In this proposed system keywords are extracted from documents using TF-IDF and Word Net. TF-IDF algorithm is mainly used to select the important words by which document can be classified. Word Net is mainly used to find similarity between these candidate words. The words which are having the maximum similarity are considered as Topics(keywords). In this experiment we used TF-IDF model to find the similar words so that to classify the document. Decision tree algorithm gives the better accuracy for text classification when compared to other algorithms fuzzy system to classify text written in natural language according to topic. It is necessary to use a fuzzy classifier for this task, due to the fact that a given text can cover several topics with different degrees. In this context, traditional classifiers are inappropriate, as they attempt to sort each text in a single class in a winner-takes-all fashion. The classifier we proposeautomatically learns its fuzzy rules from training examples. We have applied it to classify news articles, and the results we obtained are promising. The dimensionality of a vector is very important in text classification. We can decrease this dimensionality by using clustering based on fuzzy logic. Depending on the similarity we can classify the document and thus they can be formed into clusters according to their Topics. After formation of clusters one can easily access the documents and save the documents very easily. In this we can find the similarity and summarize the words called Topics which can be used to classify the Documents.
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...ijnlc
This study investigates the effectiveness of Knowledge Named Entity Recognition in Online Judges (OJs). OJs are lacking in the classification of topics and limited to the IDs only. Therefore a lot of time is consumed in finding programming problems more specifically in knowledge entities.A Bidirectional Long Short-Term Memory (BiLSTM) with Conditional Random Fields (CRF) model is applied for the recognition of knowledge named entities existing in the solution reports.For the test run, more than 2000 solution reports are crawled from the Online Judges and processed for the model output. The stability of the model is
also assessed with the higher F1 value. The results obtained through the proposed BiLSTM-CRF model are more effectual (F1: 98.96%) and efficient in lead-time.
Dear students get fully solved assignments
Send your semester & Specialization name to our mail id :
help.mbaassignments@gmail.com
or
call us at : 08263069601
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Networkkevig
In recent years, there has been an increasing use of social media among people in Myanmar and writing
review on social media pages about the product, movie, and trip are also popular among people. Moreover,
most of the people are going to find the review pages about the product they want to buy before deciding
whether they should buy it or not. Extracting and receiving useful reviews over interesting products is very
important and time consuming for people. Sentiment analysis is one of the important processes for extracting
useful reviews of the products. In this paper, the Convolutional LSTM neural network architecture is
proposed to analyse the sentiment classification of cosmetic reviews written in Myanmar Language. The
paper also intends to build the cosmetic reviews dataset for deep learning and sentiment lexicon in Myanmar
Language.
Chunking means splitting the sentences into tokens and then grouping them in a meaningful way. When it comes to high-performance chunking systems, transformer models have proved to be the state of the art benchmarks. To perform chunking as a task it requires a large-scale high quality annotated corpus where each token is attached with a particular tag similar as that of Named Entity Recognition Tasks. Later these tags are used in conjunction with pointer frameworks to find the final chunk. To solve this for a specific domain problem, it becomes a highly costly affair in terms of time and resources to manually annotate and produce a large-high-quality training set. When the domain is specific and diverse, then cold starting becomes even more difficult because of the expected large number of manually annotated queries to cover all aspects. To overcome the problem, we applied a grammar-based text generation mechanism where instead of annotating a sentence we annotate using grammar templates. We defined various templates corresponding to different grammar rules. To create a sentence we used these templates along with the rules where symbol or terminal values were chosen from the domain data catalog. It helped us to create a large number of annotated queries. These annotated queries were used for training the machine learning model using an ensemble transformer-based deep neural network model [24.] We found that grammar-based annotation was useful to solve domain-based chunks in input query sentences without any manual annotation where it was found to achieve a classification F1 score of 96.97% in classifying the tokens for the out of template queries.
Using Class Frequency for Improving Centroid-based Text ClassificationIDES Editor
Most previous works on text classification,
represented importance of terms by term occurrence frequency
(tf) and inverse document frequency (idf). This paper presents
the ways to apply class frequency in centroid-based text
categorization. Three approaches are taken into account. The
first one is to explore the effectiveness of inverse class
frequency on the popular term weighting, i.e., TFIDF, as a
replacement of idf and an addition to TFIDF. The second
approach is to evaluate some functions, which are used to
adjust the power of inverse class frequency. The other approach
is to apply terms, which are found in only one class or few
classes, to improve classification performance, using two-step
classification. From the results, class frequency expresses its
usefulness on text classification, especially the two-step
classification.
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsLiwei Ren任力偉
Near-duplicate document detection is a well-known problem in the area of information retrieval. It is an important problem to be solved for many applications in IT industry. It has been studied with profound research literatures. This article provides a novel solution to this classic problem. We present the problem with abstract models along with additional concepts such as text models, document fingerprints and document similarity. With these concepts, the problem can be transformed into keyword like search problem with results ranked by document similarity. There are two major techniques. The first technique is to extract robust and unique fingerprints from a document. The second one is to calculate document similarity effectively. Algorithms for both fingerprint extraction and document similarity calculation are introduced as a complete solution.
High level introduction to text mining analytics, which covers the building blocks or most commonly used techniques of text mining along with useful additional references/links where required for background/literature and R codes to get you started.
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
A Large number of digital text information is generated every day. Effectively searching, managing and
exploring the text data has become a main task. In this paper, we first present an introduction to text
mining and LDA topic model. Then we deeply explained how to apply LDA topic model to text corpus by
doing experiments on Simple Wikipedia documents. The experiments include all necessary steps of data
retrieving, pre-processing, fitting the model and an application of document exploring system. The result of
the experiments shows LDA topic model working effectively on documents clustering and finding the
similar documents. Furthermore, the document exploring system could be a useful research tool for
students and researchers.
Most of the text classification problems are associated with multiple class labels and hence automatic text
classification is one of the most challenging and prominent research area. Text classification is the
problem of categorizing text documents into different classes. In the multi-label classification scenario,
each document is associated may have more than one label. The real challenge in the multi-label
classification is the labelling of large number of text documents with a subset of class categories. The
feature extraction and classification of such text documents require an efficient machine learning algorithm
which performs automatic text classification. This paper describes the multi-label classification of product
review documents using Structured Support Vector Machine.
The Realization of Agent-Based E-mail automatic Handling System
1. The Realization of Agent-Based E-mail Automatic Handling System
CHEN Xiao-ping, LIU Gui-quan, WANG Xu-fa, ZHAO Lei
(Department of Computer Science and Technology,University of Science and Technology of China, Hefei 230027)
Abstract Currently e-mail is an important network ply, etc. In this paper, Situation is divided into 7 lev-
based communication method. Based on the agent els:
techniques and machine learning method, a kind of Situation = {Excellent, Very Good, Good, Normal,
interface agent that can handle e-mails automatically Poor, Very Bad, Terrible}
for users is designed and implemented. The mail agent learns her user’s interest model
Keywords Agent, machine learning, interface agent when her user handles his/her e-mails. At the very be-
ginning, agent has no knowledge about her user and
1 Introduction cannot give her user any help. But when an agent has
learned to certain degrees, she can actively handle e-
As an important means of communication, e-mail mails for her user.
is being used by millions of network users, and the
amount is increasing. Although users can receive use- 3 The Method and Implementation
ful mails, they also receive many “garbage mails”.
Such mails not only waste a lot of computer re- Obviously, some of the features of an e-mail have
sources, but make users difficult to access useful in- no effect on the user’s interest. Thus, for the mail
formation also. Therefore, users hope that the system agent to learn her user’s interest model, she must re-
can handle the e-mails automatically: the system will move those useless features and represent the mail in
inform the user in case important mails arrive, and a compress fashion. In the vector space information
delete the garbage mails. retrieval paradigm documents are represented as vec-
For simplicity, the system was designed for En- tors[5]: Assume some dictionary vector D, where each
glish e-mails only. element di is a word. Each document then has a vec-
tor V, where element vi is the weight of word di for
2 The Basic Idea that document. If the document does not contain di
then vi = 0.
It is reasonable to assume that users can appropri- In the typical information retrieval setting there is
ately determine the relevance of a particular mail to a collection of documents from which an inverted in-
his interest. We model the user’s determination and dex is created. However, for the application discussed
his/her action to handle a particular mail as a tuple: here, e-mails arrive unexpectedly and the dictionary
< Document, Situation, Action > vector D is difficult to define beforehand, thus the
Such tuples are called user interest model or inter- traditional vector space representation is inappropri-
est model. Where Document contains the sender of ate for e-mails and needs to be modified, which will
the e-mail 、 the sending date 、 the address of the be discussed in detail below.
sender, etc., and the compress representation of the 3.1 Representing E-mails
mail text, Situation refers to the importance of the An e-mail consists of two components: header and
mail based on Document, and Action is the user’s ac- body, where header contains control information,
tion to handle the mail, such as delete、 、
save print、re- such as the sender of the e-mail、 sending date、
the the
2. address of the sender, etc.; and body is the mail text. puting、computability all reduced to comput.
When a new mail arrives, agent reads in the head- Words are then weighted using a “TFIDF”
er, analyzes and saves the control information as his- scheme: the weight vdi of a word di in an e-mail text
tory record. Such information can be used to help fur- D is derived by multiplying a term frequency (“TF”)
ther processing. Then, agent reads in the mail text component by an inverse document (here refers to e-
and extracts individual words from it. The mail text is mail) frequency (“IDF”) component:
thus represented as a vector: tf
n
D = (d1 , d2 , d3 , … , dn) v d i = 0.5 + 0.5 i
log
…… (Eqn. 1)
tf max df i
where di(i∈{1,2,…,n}) is a word appearing in the
mail body. For any di in D, if di belongs to stop words where tfi is the number of times word di appears in e-
(words so common as to be useless as discriminators, mail text D (the term frequency), tfmax is the maxi-
like the、 These words were structured as a Stop list
is. mum term frequency over all words in D, n is the
in the system), then di will be removed from D. number of e-mails that have been handled and dfi is
For the remainder of the words in D, agent uses the number of handled e-mails which contain di (the
[1][2]
the Porter suffix-stripping algorithm to reduce document frequency).
them to their stems. For instance, computer 、 com- The process can be illustrated by figure 1:
E-mail
Header Body
word
Word Stream
History stop list
Record
Keywords
suffix-stripping
Stems
TFIDF scheme
Weighted Vector
Figure 1: The process of e-mail representation
For the system discussed here, the e-mail Situation The agent’s learning process can be divided into 3
is divided into 7 levels and there are no other distinc- stages according to agent’s degree of adeptness:
tions between e-mails that belong to the same Situa- 1. Learning Stage
tion, so log(n/dfi) in (Eqn. 1) can be substituted for At this stage, the agent has no experience and just
(the number of e-mails in current Situation that con- accumulates knowledge (about her user’s interest
tains di). model) according to her user’s action or evaluation.
3.2 Agent’s Learning Process At this stage the agent can not provide her user any
3. help yet. mails, agent’s dictionary may become larger and larg-
2. Growing-up Stage er, and the retrieval speed will decrease. Therefore,
After the agent has gained some experience, she mechanism to maintain the dictionary is very impor-
will be in the growing-up stage. Under the gained ex- tant. Agent uses following rules to maintain her dic-
perience, an agent can assist her user in dealing e- tionary:
mails. However, at this stage the agent has not been Rule 1: if a stem occurs very few, agent deletes it
competent enough that she needs further learning from her dictionary.
from her user’s feedback (especially in unexplored Rule 2: if a stem appears nearly the same frequen-
situations). For each e-mail, the agent presents her cy in every Situation, it is useless in classifying e-
evaluation to her user, if the user is not satisfied with mails, then agent deletes it from her dictionary and
agent’s evaluation, he/she can present his/her own stores it in her Stem-Stop list (analogous to the Stop
evaluation and the agent will update the interest mod- list introduced before). Agent uses frequency equili-
el based on this. bration (FE) to determine whether a stem should be
3. Applying Stage stored in her Stem-Stop list. The calculating method
If the agent has accumulated enough experience for a stem’s FE is given in (Eqn. 2), where E is the
with high accuracy and is permissive to handle e- FE of the stem, Si is the frequency the stem occurs in
mails for user, she is in the applying stage. As the fi- Situation i (i=1..7 refers to the 7 levels), and SA is the
nal stage of learning, the agent now can automatically mean of Si, i.e. SA = (S1 + … + S7)/7.
1
evaluate and handle e-mails for her user. For instance,
E = 2 ……
7
agent can delete a “Terrible” e-mails or break user’s (
∑S i −S A )
2
(Eqn. 2)
current work in case “Important” e-mails arrive. i=1
3.3 Agent’s Learning Method If a stem’s FE is less than a threshold (it is ad-
The e-mail agent employs statistic-based learning justable for user), this rule will be applied.
method. Firstly, the agent derives normalized vector Rule 3: user can either add some words to Stop
for each Situation based on the statistics over large list, or delete some words from Stop list.
amount of e-mails (the deriving method will be dis- 2. Learning Method
cussed below). Secondly, the agent chooses action ac- Agent’s learning module uses statistic-based
cording to the similarity between the current e-mail method. For every stem in the dictionary, the agent
vector and every normalized vector. During the pro- calculates its occurring frequency in each Situation.
cess, the agent will encounter the problem of dictio- Normalized vector for a Situation is obtained by sort-
nary construction and maintenance, which should be ing stems according to their occurring frequencies in
discussed first. the Situation, and Situation i’s normalized vector will
1. Dictionary Construction and Maintenance be denoted by Di.
Agent’s dictionary is dynamically constructed. As Suppose the current e-mail vector is denoted by D,
e-mails will be represented by stems, elements of the similarity between D and Di can be obtained by
agent’s dictionary are also stems. In addition to calculating the cosine of D and Di: SIM(D,Di) =
stems, the occurring frequency of every stem in every Cos(D,Di). (D and Di must have the same order).
Situation is also stored in agent’s dictionary. The The Situation corresponds to the maximum among
agent’s dictionary is initially empty. During the learn- the 7 similarities is the situation agent will choose.
ing process, new stems will be added to agent’s dic- The information in the e-mail header can be used
tionary and the occurring frequency of some old to revise the result. For example, a user may be inter-
stems might need recalculation. ested in e-mails from particular people. Agent learns
With the increasing of the number of handled e- such revising rules through inductive learning meth-
4. ods, which is not the topic of this paper. me threshold
3. Action Prediction
Currently, the e-mail agent usually adopts the 4 Experimental Results
same user-defined action for e-mails in the same Sit-
uation. After the agent has determined that the cur- In order to test the capability of the e-mail agent,
rent e-mail belongs to a Situation, she will choose 55 e-mails were selected for experiment. The parame-
one of the following actions to do: If the similarity ters of the selected e-mails and the experimental re-
between the e-mail and the Situation is above the sults are as following:
tell-me threshold, agent will suggest the user to take 1.Compress Properties:
the corresponding action; if the similarity is above the The maximum compressibility: 77.1%. The mini-
do-it threshold, the agent autonomously takes the cor- mum compressibility: 35.7%. The average length af-
responding action. The default values of the tell-me ter compressing: 113 words.
threshold and the do-it threshold are 0.7 and 0.95, re- 2.Correctness of Prediction:
spectively. The two thresholds can be set by the user, Number of predicted e-mails: 60 (has repetitions).
and the do-it threshold must be greater than the tell- Number of wrongly predicted e-mails: 16.
Number of
16
Errors
12
8
4
Handled E-
mails
10 20 30 40 50 60
Figure 2. The relationship between errors and handled e-mails
The relationship between the number of wrongly realization.
predicted e-mails and the number of handled e-mails
is illustrated in figure 2. Figure 2 indicates that the er- 6 Conclusion and Future Directions
ror rate decreases with the increasing number of han-
dled e-mails: 11 errors occur in the first 20 e-mails, The experimental results show that the perfor-
however only 5 errors occur in the last 40 e-mails. mance of the e-mail agent is to some extent satisfac-
tory. Moreover, the method discussed in this paper
5 Comparison with Related Works can be applied in some other Internet-based services.
As the e-mail agent uses statistic-based learning
With the fast development of Internet, currently method, and the problems relevant to context are in-
network-based services are hotspots of computer ap- evitable. Thus, the agent will be more appropriate for
plications. Now, some corporations (say, Microsoft[3]) application if natural language processing module is
[4]
and institutes have researched on how to automati- added in.
cally handle e-mails for the user. But for a variety of
reasons, the references did not concern the details of
5. References
[1] M.F. Porter. An algorithm for suffix stripping.
Program, 14(3) 130-137 (1980).
[2] W.B. Frakes. Stemming algorithms. In: W.B.
Frakes and R. Baeza-Yates, editors, Information Re-
trieval: Data Structures and Algorithms, pp. 131-160.
Prentice Hall, Inc., Englewood Cliffs, NJ, 1992.
[3] Based on the introduction of Kaifu Lee, the dean
of the Chinese Institute of Microsoft, 1998.
[4] Y. Lashkari, et al., Collaborative Interface Agents,
MIT Media Laboratory (1996).
[5] G. Salton and M.J. McGill. An Introduction to
Modern Information Retrieval. McGraw-Hill, 1983.