SlideShare a Scribd company logo
1 of 5
The Realization of Agent-Based E-mail Automatic Handling System

                      CHEN Xiao-ping, LIU Gui-quan, WANG Xu-fa, ZHAO Lei

     (Department of Computer Science and Technology,University of Science and Technology of China, Hefei 230027)


Abstract Currently e-mail is an important network            ply, etc. In this paper, Situation is divided into 7 lev-
based communication method. Based on the agent               els:
techniques and machine learning method, a kind of               Situation = {Excellent, Very Good, Good, Normal,
interface agent that can handle e-mails automatically        Poor, Very Bad, Terrible}
for users is designed and implemented.                          The mail agent learns her user’s interest model
Keywords Agent, machine learning, interface agent            when her user handles his/her e-mails. At the very be-
                                                             ginning, agent has no knowledge about her user and
1 Introduction                                               cannot give her user any help. But when an agent has
                                                             learned to certain degrees, she can actively handle e-
   As an important means of communication, e-mail            mails for her user.
is being used by millions of network users, and the
amount is increasing. Although users can receive use-        3 The Method and Implementation
ful mails, they also receive many “garbage mails”.
Such mails not only waste a lot of computer re-                 Obviously, some of the features of an e-mail have
sources, but make users difficult to access useful in-       no effect on the user’s interest. Thus, for the mail
formation also. Therefore, users hope that the system        agent to learn her user’s interest model, she must re-
can handle the e-mails automatically: the system will        move those useless features and represent the mail in
inform the user in case important mails arrive, and          a compress fashion. In the vector space information
delete the garbage mails.                                    retrieval paradigm documents are represented as vec-
   For simplicity, the system was designed for En-           tors[5]: Assume some dictionary vector D, where each
glish e-mails only.                                          element di is a word. Each document then has a vec-
                                                             tor V, where element vi is the weight of word di for
2 The Basic Idea                                             that document. If the document does not contain di
                                                             then vi = 0.
   It is reasonable to assume that users can appropri-          In the typical information retrieval setting there is
ately determine the relevance of a particular mail to        a collection of documents from which an inverted in-
his interest. We model the user’s determination and          dex is created. However, for the application discussed
his/her action to handle a particular mail as a tuple:       here, e-mails arrive unexpectedly and the dictionary
     < Document, Situation, Action >                         vector D is difficult to define beforehand, thus the
Such tuples are called user interest model or inter-         traditional vector space representation is inappropri-
est model. Where Document contains the sender of             ate for e-mails and needs to be modified, which will
the e-mail 、 the sending date 、 the address of the           be discussed in detail below.
sender, etc., and the compress representation of the         3.1 Representing E-mails
mail text, Situation refers to the importance of the            An e-mail consists of two components: header and
mail based on Document, and Action is the user’s ac-         body, where header contains control information,
tion to handle the mail, such as delete、 、
                                        save print、re-       such as the sender of the e-mail、 sending date、
                                                                                              the           the
address of the sender, etc.; and body is the mail text.              puting、computability all reduced to comput.
   When a new mail arrives, agent reads in the head-                   Words are then weighted using a “TFIDF”
er, analyzes and saves the control information as his-               scheme: the weight vdi of a word di in an e-mail text
tory record. Such information can be used to help fur-               D is derived by multiplying a term frequency (“TF”)
ther processing. Then, agent reads in the mail text                  component by an inverse document (here refers to e-
and extracts individual words from it. The mail text is              mail) frequency (“IDF”) component:
thus represented as a vector:                                                                   tf                 
                                                                                                                n
             D = (d1 , d2 , d3 , … , dn)                                    v d i =  0.5 + 0.5 i
                                                                                                        log
                                                                                                                     …… (Eqn. 1)
                                                                                                                      
                                                                                              tf max        df i   
where di(i∈{1,2,…,n}) is a word appearing in the
mail body. For any di in D, if di belongs to stop words              where tfi is the number of times word di appears in e-
(words so common as to be useless as discriminators,                 mail text D (the term frequency), tfmax is the maxi-
like the、 These words were structured as a Stop list
         is.                                                         mum term frequency over all words in D, n is the
in the system), then di will be removed from D.                      number of e-mails that have been handled and dfi is
   For the remainder of the words in D, agent uses                   the number of handled e-mails which contain di (the
                                           [1][2]
the Porter suffix-stripping algorithm               to reduce        document frequency).
them to their stems. For instance, computer 、 com-                       The process can be illustrated by figure 1:

                                                             E-mail



                                 Header                                               Body

                                                           word

                                                                                  Word Stream

                                 History                 stop list
                                 Record
                                                                                     Keywords

                                                      suffix-stripping

                                                                                       Stems

                                                       TFIDF scheme

                                                                                 Weighted Vector


                                      Figure 1: The process of e-mail representation


   For the system discussed here, the e-mail Situation                   The agent’s learning process can be divided into 3
is divided into 7 levels and there are no other distinc-             stages according to agent’s degree of adeptness:
tions between e-mails that belong to the same Situa-                     1. Learning Stage
tion, so log(n/dfi) in (Eqn. 1) can be substituted for                   At this stage, the agent has no experience and just
(the number of e-mails in current Situation that con-                accumulates knowledge (about her user’s interest
tains di).                                                           model) according to her user’s action or evaluation.
3.2 Agent’s Learning Process                                         At this stage the agent can not provide her user any
help yet.                                                   mails, agent’s dictionary may become larger and larg-
   2. Growing-up Stage                                      er, and the retrieval speed will decrease. Therefore,
   After the agent has gained some experience, she          mechanism to maintain the dictionary is very impor-
will be in the growing-up stage. Under the gained ex-       tant. Agent uses following rules to maintain her dic-
perience, an agent can assist her user in dealing e-        tionary:
mails. However, at this stage the agent has not been           Rule 1: if a stem occurs very few, agent deletes it
competent enough that she needs further learning            from her dictionary.
from her user’s feedback (especially in unexplored             Rule 2: if a stem appears nearly the same frequen-
situations). For each e-mail, the agent presents her        cy in every Situation, it is useless in classifying e-
evaluation to her user, if the user is not satisfied with   mails, then agent deletes it from her dictionary and
agent’s evaluation, he/she can present his/her own          stores it in her Stem-Stop list (analogous to the Stop
evaluation and the agent will update the interest mod-      list introduced before). Agent uses frequency equili-
el based on this.                                           bration (FE) to determine whether a stem should be
   3. Applying Stage                                        stored in her Stem-Stop list. The calculating method
   If the agent has accumulated enough experience           for a stem’s FE is given in (Eqn. 2), where E is the
with high accuracy and is permissive to handle e-           FE of the stem, Si is the frequency the stem occurs in
mails for user, she is in the applying stage. As the fi-    Situation i (i=1..7 refers to the 7 levels), and SA is the
nal stage of learning, the agent now can automatically      mean of Si, i.e. SA = (S1 + … + S7)/7.
                                                                                                  1
evaluate and handle e-mails for her user. For instance,
                                                                E =                             2 ……
                                                                          7
agent can delete a “Terrible” e-mails or break user’s                   (
                                                                        ∑S      i   −S A )
                                                                                             2
                                                                                                      (Eqn. 2)
current work in case “Important” e-mails arrive.                        i=1                     
3.3 Agent’s Learning Method                                    If a stem’s FE is less than a threshold (it is ad-
   The e-mail agent employs statistic-based learning        justable for user), this rule will be applied.
method. Firstly, the agent derives normalized vector           Rule 3: user can either add some words to Stop
for each Situation based on the statistics over large       list, or delete some words from Stop list.
amount of e-mails (the deriving method will be dis-            2. Learning Method
cussed below). Secondly, the agent chooses action ac-          Agent’s learning module uses statistic-based
cording to the similarity between the current e-mail        method. For every stem in the dictionary, the agent
vector and every normalized vector. During the pro-         calculates its occurring frequency in each Situation.
cess, the agent will encounter the problem of dictio-       Normalized vector for a Situation is obtained by sort-
nary construction and maintenance, which should be          ing stems according to their occurring frequencies in
discussed first.                                            the Situation, and Situation i’s normalized vector will
   1. Dictionary Construction and Maintenance               be denoted by Di.
   Agent’s dictionary is dynamically constructed. As           Suppose the current e-mail vector is denoted by D,
e-mails will be represented by stems, elements of           the similarity between D and Di can be obtained by
agent’s dictionary are also stems. In addition to           calculating the cosine of D and Di: SIM(D,Di) =
stems, the occurring frequency of every stem in every       Cos(D,Di). (D and Di must have the same order).
Situation is also stored in agent’s dictionary. The            The Situation corresponds to the maximum among
agent’s dictionary is initially empty. During the learn-    the 7 similarities is the situation agent will choose.
ing process, new stems will be added to agent’s dic-           The information in the e-mail header can be used
tionary and the occurring frequency of some old             to revise the result. For example, a user may be inter-
stems might need recalculation.                             ested in e-mails from particular people. Agent learns
   With the increasing of the number of handled e-          such revising rules through inductive learning meth-
ods, which is not the topic of this paper.                      me threshold
   3. Action Prediction
   Currently, the e-mail agent usually adopts the               4 Experimental Results
same user-defined action for e-mails in the same Sit-
uation. After the agent has determined that the cur-               In order to test the capability of the e-mail agent,
rent e-mail belongs to a Situation, she will choose             55 e-mails were selected for experiment. The parame-
one of the following actions to do: If the similarity           ters of the selected e-mails and the experimental re-
between the e-mail and the Situation is above the               sults are as following:
tell-me threshold, agent will suggest the user to take             1.Compress Properties:
the corresponding action; if the similarity is above the           The maximum compressibility: 77.1%. The mini-
do-it threshold, the agent autonomously takes the cor-          mum compressibility: 35.7%. The average length af-
responding action. The default values of the tell-me            ter compressing: 113 words.
threshold and the do-it threshold are 0.7 and 0.95, re-            2.Correctness of Prediction:
spectively. The two thresholds can be set by the user,             Number of predicted e-mails: 60 (has repetitions).
and the do-it threshold must be greater than the tell-          Number of wrongly predicted e-mails: 16.
                Number     of
        16
                Errors

         12


          8


          4
                                                                                                 Handled E-
                                                                                                 mails
                              10         20          30          40            50         60


                               Figure 2. The relationship between errors and handled e-mails


   The relationship between the number of wrongly               realization.
predicted e-mails and the number of handled e-mails
is illustrated in figure 2. Figure 2 indicates that the er-     6 Conclusion and Future Directions
ror rate decreases with the increasing number of han-
dled e-mails: 11 errors occur in the first 20 e-mails,             The experimental results show that the perfor-
however only 5 errors occur in the last 40 e-mails.             mance of the e-mail agent is to some extent satisfac-
                                                                tory. Moreover, the method discussed in this paper
5 Comparison with Related Works                                 can be applied in some other Internet-based services.
                                                                   As the e-mail agent uses statistic-based learning
   With the fast development of Internet, currently             method, and the problems relevant to context are in-
network-based services are hotspots of computer ap-             evitable. Thus, the agent will be more appropriate for
plications. Now, some corporations (say, Microsoft[3])          application if natural language processing module is
                 [4]
and institutes         have researched on how to automati-      added in.
cally handle e-mails for the user. But for a variety of
reasons, the references did not concern the details of
References

[1] M.F. Porter. An algorithm for suffix stripping.
Program, 14(3) 130-137 (1980).
[2] W.B. Frakes. Stemming algorithms. In: W.B.
Frakes and R. Baeza-Yates, editors, Information Re-
trieval: Data Structures and Algorithms, pp. 131-160.
Prentice Hall, Inc., Englewood Cliffs, NJ, 1992.
[3] Based on the introduction of Kaifu Lee, the dean
of the Chinese Institute of Microsoft, 1998.
[4] Y. Lashkari, et al., Collaborative Interface Agents,
MIT Media Laboratory (1996).
[5] G. Salton and M.J. McGill. An Introduction to
Modern Information Retrieval. McGraw-Hill, 1983.

More Related Content

What's hot

Lecture Notes in Computer Science:
Lecture Notes in Computer Science:Lecture Notes in Computer Science:
Lecture Notes in Computer Science:butest
 
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...cscpconf
 
Text independent speaker identification system using average pitch and forman...
Text independent speaker identification system using average pitch and forman...Text independent speaker identification system using average pitch and forman...
Text independent speaker identification system using average pitch and forman...ijitjournal
 
Possible Algorithms of 2NF and 3NF for DBNorma- A tool for Relational Databas...
Possible Algorithms of 2NF and 3NF for DBNorma- A tool for Relational Databas...Possible Algorithms of 2NF and 3NF for DBNorma- A tool for Relational Databas...
Possible Algorithms of 2NF and 3NF for DBNorma- A tool for Relational Databas...IDES Editor
 
Design and implementation of a java based virtual laboratory for data communi...
Design and implementation of a java based virtual laboratory for data communi...Design and implementation of a java based virtual laboratory for data communi...
Design and implementation of a java based virtual laboratory for data communi...IJECEIAES
 
Comparing Naive Bayesian and k-NN algorithms for automatic ...
Comparing Naive Bayesian and k-NN algorithms for automatic ...Comparing Naive Bayesian and k-NN algorithms for automatic ...
Comparing Naive Bayesian and k-NN algorithms for automatic ...butest
 
Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...
Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...
Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...RENDER project
 
Document Classification Using KNN with Fuzzy Bags of Word Representation
Document Classification Using KNN with Fuzzy Bags of Word RepresentationDocument Classification Using KNN with Fuzzy Bags of Word Representation
Document Classification Using KNN with Fuzzy Bags of Word Representationsuthi
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...ijnlc
 
IRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET Journal
 
IRJET - Speech to Speech Translation using Encoder Decoder Architecture
IRJET -  	  Speech to Speech Translation using Encoder Decoder ArchitectureIRJET -  	  Speech to Speech Translation using Encoder Decoder Architecture
IRJET - Speech to Speech Translation using Encoder Decoder ArchitectureIRJET Journal
 
Mca winter 2013 2nd sem
Mca winter 2013 2nd semMca winter 2013 2nd sem
Mca winter 2013 2nd semsmumbahelp
 
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural NetworkSentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Networkkevig
 
DOMAIN BASED CHUNKING
DOMAIN BASED CHUNKINGDOMAIN BASED CHUNKING
DOMAIN BASED CHUNKINGkevig
 
Using Class Frequency for Improving Centroid-based Text Classification
Using Class Frequency for Improving Centroid-based Text ClassificationUsing Class Frequency for Improving Centroid-based Text Classification
Using Class Frequency for Improving Centroid-based Text ClassificationIDES Editor
 

What's hot (18)

228-SE3001_2
228-SE3001_2228-SE3001_2
228-SE3001_2
 
Lecture Notes in Computer Science:
Lecture Notes in Computer Science:Lecture Notes in Computer Science:
Lecture Notes in Computer Science:
 
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
 
Text independent speaker identification system using average pitch and forman...
Text independent speaker identification system using average pitch and forman...Text independent speaker identification system using average pitch and forman...
Text independent speaker identification system using average pitch and forman...
 
Possible Algorithms of 2NF and 3NF for DBNorma- A tool for Relational Databas...
Possible Algorithms of 2NF and 3NF for DBNorma- A tool for Relational Databas...Possible Algorithms of 2NF and 3NF for DBNorma- A tool for Relational Databas...
Possible Algorithms of 2NF and 3NF for DBNorma- A tool for Relational Databas...
 
Design and implementation of a java based virtual laboratory for data communi...
Design and implementation of a java based virtual laboratory for data communi...Design and implementation of a java based virtual laboratory for data communi...
Design and implementation of a java based virtual laboratory for data communi...
 
Comparing Naive Bayesian and k-NN algorithms for automatic ...
Comparing Naive Bayesian and k-NN algorithms for automatic ...Comparing Naive Bayesian and k-NN algorithms for automatic ...
Comparing Naive Bayesian and k-NN algorithms for automatic ...
 
Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...
Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...
Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...
 
Huffman ppt
Huffman ppt Huffman ppt
Huffman ppt
 
Document Classification Using KNN with Fuzzy Bags of Word Representation
Document Classification Using KNN with Fuzzy Bags of Word RepresentationDocument Classification Using KNN with Fuzzy Bags of Word Representation
Document Classification Using KNN with Fuzzy Bags of Word Representation
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
 
IRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF Metric
 
IRJET - Speech to Speech Translation using Encoder Decoder Architecture
IRJET -  	  Speech to Speech Translation using Encoder Decoder ArchitectureIRJET -  	  Speech to Speech Translation using Encoder Decoder Architecture
IRJET - Speech to Speech Translation using Encoder Decoder Architecture
 
Mca winter 2013 2nd sem
Mca winter 2013 2nd semMca winter 2013 2nd sem
Mca winter 2013 2nd sem
 
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural NetworkSentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
 
DOMAIN BASED CHUNKING
DOMAIN BASED CHUNKINGDOMAIN BASED CHUNKING
DOMAIN BASED CHUNKING
 
Sangeetha seminar (1)
Sangeetha  seminar (1)Sangeetha  seminar (1)
Sangeetha seminar (1)
 
Using Class Frequency for Improving Centroid-based Text Classification
Using Class Frequency for Improving Centroid-based Text ClassificationUsing Class Frequency for Improving Centroid-based Text Classification
Using Class Frequency for Improving Centroid-based Text Classification
 

Viewers also liked

Viewers also liked (11)

How to Handle a Simple and Efficient Mailing Policy
How to Handle a Simple and Efficient Mailing PolicyHow to Handle a Simple and Efficient Mailing Policy
How to Handle a Simple and Efficient Mailing Policy
 
Efficent e-Mail handling/process
Efficent e-Mail handling/processEfficent e-Mail handling/process
Efficent e-Mail handling/process
 
Chapter 16
Chapter 16Chapter 16
Chapter 16
 
Processing Mail
Processing MailProcessing Mail
Processing Mail
 
Chapter 7: Processing Mail
Chapter 7: Processing MailChapter 7: Processing Mail
Chapter 7: Processing Mail
 
Chapter 09
Chapter 09Chapter 09
Chapter 09
 
How To Build A Bulk Email Sending Application In PHP
How To Build A Bulk Email Sending Application In PHPHow To Build A Bulk Email Sending Application In PHP
How To Build A Bulk Email Sending Application In PHP
 
Mail handling
Mail handlingMail handling
Mail handling
 
How to Incoming Mail Handling Procedure
How to Incoming Mail Handling ProcedureHow to Incoming Mail Handling Procedure
How to Incoming Mail Handling Procedure
 
Mail Handling
Mail Handling  Mail Handling
Mail Handling
 
Mail Handling
Mail HandlingMail Handling
Mail Handling
 

Similar to Realization of an Automatic E-mail Handling Agent System

Near Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsNear Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsLiwei Ren任力偉
 
mailfilter.ppt
mailfilter.pptmailfilter.ppt
mailfilter.pptbutest
 
Team G
Team GTeam G
Team Gbutest
 
chapter 1 Introduction to Ds and Algorithm Anyasis.pptx
chapter 1 Introduction to Ds and Algorithm Anyasis.pptxchapter 1 Introduction to Ds and Algorithm Anyasis.pptx
chapter 1 Introduction to Ds and Algorithm Anyasis.pptxAmrutaNavale2
 
Bca examination 2015 dbms
Bca examination 2015 dbmsBca examination 2015 dbms
Bca examination 2015 dbmsAnjaan Gajendra
 
A Dialogue System for Telugu, a Resource-Poor Language
A Dialogue System for Telugu, a Resource-Poor LanguageA Dialogue System for Telugu, a Resource-Poor Language
A Dialogue System for Telugu, a Resource-Poor LanguageSravanthi Mullapudi
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
 
Multi label classification of
Multi label classification ofMulti label classification of
Multi label classification ofijaia
 
Formalizing Message Exchange Patterns using BPEL light
Formalizing Message Exchange Patterns using BPEL lightFormalizing Message Exchange Patterns using BPEL light
Formalizing Message Exchange Patterns using BPEL lightTammo van Lessen
 
vectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.pptvectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.pptpepe3059
 
Introduction of Database Design and Development
Introduction of Database Design and DevelopmentIntroduction of Database Design and Development
Introduction of Database Design and DevelopmentEr. Nawaraj Bhandari
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in RAshraf Uddin
 

Similar to Realization of an Automatic E-mail Handling Agent System (20)

Unit 02 dbms
Unit 02 dbmsUnit 02 dbms
Unit 02 dbms
 
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsNear Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
 
mailfilter.ppt
mailfilter.pptmailfilter.ppt
mailfilter.ppt
 
Team G
Team GTeam G
Team G
 
Text Mining Analytics 101
Text Mining Analytics 101Text Mining Analytics 101
Text Mining Analytics 101
 
chapter 1 Introduction to Ds and Algorithm Anyasis.pptx
chapter 1 Introduction to Ds and Algorithm Anyasis.pptxchapter 1 Introduction to Ds and Algorithm Anyasis.pptx
chapter 1 Introduction to Ds and Algorithm Anyasis.pptx
 
CN
CNCN
CN
 
Lec1
Lec1Lec1
Lec1
 
Bca examination 2015 dbms
Bca examination 2015 dbmsBca examination 2015 dbms
Bca examination 2015 dbms
 
A Dialogue System for Telugu, a Resource-Poor Language
A Dialogue System for Telugu, a Resource-Poor LanguageA Dialogue System for Telugu, a Resource-Poor Language
A Dialogue System for Telugu, a Resource-Poor Language
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
Multi label classification of
Multi label classification ofMulti label classification of
Multi label classification of
 
Formalizing Message Exchange Patterns using BPEL light
Formalizing Message Exchange Patterns using BPEL lightFormalizing Message Exchange Patterns using BPEL light
Formalizing Message Exchange Patterns using BPEL light
 
vectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.pptvectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.ppt
 
Oracle fundamentals and sql
Oracle fundamentals and sqlOracle fundamentals and sql
Oracle fundamentals and sql
 
Email
EmailEmail
Email
 
Email
EmailEmail
Email
 
Introduction of Database Design and Development
Introduction of Database Design and DevelopmentIntroduction of Database Design and Development
Introduction of Database Design and Development
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
 
A018110108
A018110108A018110108
A018110108
 

More from butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Realization of an Automatic E-mail Handling Agent System

  • 1. The Realization of Agent-Based E-mail Automatic Handling System CHEN Xiao-ping, LIU Gui-quan, WANG Xu-fa, ZHAO Lei (Department of Computer Science and Technology,University of Science and Technology of China, Hefei 230027) Abstract Currently e-mail is an important network ply, etc. In this paper, Situation is divided into 7 lev- based communication method. Based on the agent els: techniques and machine learning method, a kind of Situation = {Excellent, Very Good, Good, Normal, interface agent that can handle e-mails automatically Poor, Very Bad, Terrible} for users is designed and implemented. The mail agent learns her user’s interest model Keywords Agent, machine learning, interface agent when her user handles his/her e-mails. At the very be- ginning, agent has no knowledge about her user and 1 Introduction cannot give her user any help. But when an agent has learned to certain degrees, she can actively handle e- As an important means of communication, e-mail mails for her user. is being used by millions of network users, and the amount is increasing. Although users can receive use- 3 The Method and Implementation ful mails, they also receive many “garbage mails”. Such mails not only waste a lot of computer re- Obviously, some of the features of an e-mail have sources, but make users difficult to access useful in- no effect on the user’s interest. Thus, for the mail formation also. Therefore, users hope that the system agent to learn her user’s interest model, she must re- can handle the e-mails automatically: the system will move those useless features and represent the mail in inform the user in case important mails arrive, and a compress fashion. In the vector space information delete the garbage mails. retrieval paradigm documents are represented as vec- For simplicity, the system was designed for En- tors[5]: Assume some dictionary vector D, where each glish e-mails only. element di is a word. Each document then has a vec- tor V, where element vi is the weight of word di for 2 The Basic Idea that document. If the document does not contain di then vi = 0. It is reasonable to assume that users can appropri- In the typical information retrieval setting there is ately determine the relevance of a particular mail to a collection of documents from which an inverted in- his interest. We model the user’s determination and dex is created. However, for the application discussed his/her action to handle a particular mail as a tuple: here, e-mails arrive unexpectedly and the dictionary < Document, Situation, Action > vector D is difficult to define beforehand, thus the Such tuples are called user interest model or inter- traditional vector space representation is inappropri- est model. Where Document contains the sender of ate for e-mails and needs to be modified, which will the e-mail 、 the sending date 、 the address of the be discussed in detail below. sender, etc., and the compress representation of the 3.1 Representing E-mails mail text, Situation refers to the importance of the An e-mail consists of two components: header and mail based on Document, and Action is the user’s ac- body, where header contains control information, tion to handle the mail, such as delete、 、 save print、re- such as the sender of the e-mail、 sending date、 the the
  • 2. address of the sender, etc.; and body is the mail text. puting、computability all reduced to comput. When a new mail arrives, agent reads in the head- Words are then weighted using a “TFIDF” er, analyzes and saves the control information as his- scheme: the weight vdi of a word di in an e-mail text tory record. Such information can be used to help fur- D is derived by multiplying a term frequency (“TF”) ther processing. Then, agent reads in the mail text component by an inverse document (here refers to e- and extracts individual words from it. The mail text is mail) frequency (“IDF”) component: thus represented as a vector:  tf   n D = (d1 , d2 , d3 , … , dn) v d i =  0.5 + 0.5 i   log   …… (Eqn. 1)   tf max  df i  where di(i∈{1,2,…,n}) is a word appearing in the mail body. For any di in D, if di belongs to stop words where tfi is the number of times word di appears in e- (words so common as to be useless as discriminators, mail text D (the term frequency), tfmax is the maxi- like the、 These words were structured as a Stop list is. mum term frequency over all words in D, n is the in the system), then di will be removed from D. number of e-mails that have been handled and dfi is For the remainder of the words in D, agent uses the number of handled e-mails which contain di (the [1][2] the Porter suffix-stripping algorithm to reduce document frequency). them to their stems. For instance, computer 、 com- The process can be illustrated by figure 1: E-mail Header Body word Word Stream History stop list Record Keywords suffix-stripping Stems TFIDF scheme Weighted Vector Figure 1: The process of e-mail representation For the system discussed here, the e-mail Situation The agent’s learning process can be divided into 3 is divided into 7 levels and there are no other distinc- stages according to agent’s degree of adeptness: tions between e-mails that belong to the same Situa- 1. Learning Stage tion, so log(n/dfi) in (Eqn. 1) can be substituted for At this stage, the agent has no experience and just (the number of e-mails in current Situation that con- accumulates knowledge (about her user’s interest tains di). model) according to her user’s action or evaluation. 3.2 Agent’s Learning Process At this stage the agent can not provide her user any
  • 3. help yet. mails, agent’s dictionary may become larger and larg- 2. Growing-up Stage er, and the retrieval speed will decrease. Therefore, After the agent has gained some experience, she mechanism to maintain the dictionary is very impor- will be in the growing-up stage. Under the gained ex- tant. Agent uses following rules to maintain her dic- perience, an agent can assist her user in dealing e- tionary: mails. However, at this stage the agent has not been Rule 1: if a stem occurs very few, agent deletes it competent enough that she needs further learning from her dictionary. from her user’s feedback (especially in unexplored Rule 2: if a stem appears nearly the same frequen- situations). For each e-mail, the agent presents her cy in every Situation, it is useless in classifying e- evaluation to her user, if the user is not satisfied with mails, then agent deletes it from her dictionary and agent’s evaluation, he/she can present his/her own stores it in her Stem-Stop list (analogous to the Stop evaluation and the agent will update the interest mod- list introduced before). Agent uses frequency equili- el based on this. bration (FE) to determine whether a stem should be 3. Applying Stage stored in her Stem-Stop list. The calculating method If the agent has accumulated enough experience for a stem’s FE is given in (Eqn. 2), where E is the with high accuracy and is permissive to handle e- FE of the stem, Si is the frequency the stem occurs in mails for user, she is in the applying stage. As the fi- Situation i (i=1..7 refers to the 7 levels), and SA is the nal stage of learning, the agent now can automatically mean of Si, i.e. SA = (S1 + … + S7)/7. 1 evaluate and handle e-mails for her user. For instance, E = 2 …… 7 agent can delete a “Terrible” e-mails or break user’s  ( ∑S i −S A ) 2  (Eqn. 2) current work in case “Important” e-mails arrive.  i=1  3.3 Agent’s Learning Method If a stem’s FE is less than a threshold (it is ad- The e-mail agent employs statistic-based learning justable for user), this rule will be applied. method. Firstly, the agent derives normalized vector Rule 3: user can either add some words to Stop for each Situation based on the statistics over large list, or delete some words from Stop list. amount of e-mails (the deriving method will be dis- 2. Learning Method cussed below). Secondly, the agent chooses action ac- Agent’s learning module uses statistic-based cording to the similarity between the current e-mail method. For every stem in the dictionary, the agent vector and every normalized vector. During the pro- calculates its occurring frequency in each Situation. cess, the agent will encounter the problem of dictio- Normalized vector for a Situation is obtained by sort- nary construction and maintenance, which should be ing stems according to their occurring frequencies in discussed first. the Situation, and Situation i’s normalized vector will 1. Dictionary Construction and Maintenance be denoted by Di. Agent’s dictionary is dynamically constructed. As Suppose the current e-mail vector is denoted by D, e-mails will be represented by stems, elements of the similarity between D and Di can be obtained by agent’s dictionary are also stems. In addition to calculating the cosine of D and Di: SIM(D,Di) = stems, the occurring frequency of every stem in every Cos(D,Di). (D and Di must have the same order). Situation is also stored in agent’s dictionary. The The Situation corresponds to the maximum among agent’s dictionary is initially empty. During the learn- the 7 similarities is the situation agent will choose. ing process, new stems will be added to agent’s dic- The information in the e-mail header can be used tionary and the occurring frequency of some old to revise the result. For example, a user may be inter- stems might need recalculation. ested in e-mails from particular people. Agent learns With the increasing of the number of handled e- such revising rules through inductive learning meth-
  • 4. ods, which is not the topic of this paper. me threshold 3. Action Prediction Currently, the e-mail agent usually adopts the 4 Experimental Results same user-defined action for e-mails in the same Sit- uation. After the agent has determined that the cur- In order to test the capability of the e-mail agent, rent e-mail belongs to a Situation, she will choose 55 e-mails were selected for experiment. The parame- one of the following actions to do: If the similarity ters of the selected e-mails and the experimental re- between the e-mail and the Situation is above the sults are as following: tell-me threshold, agent will suggest the user to take 1.Compress Properties: the corresponding action; if the similarity is above the The maximum compressibility: 77.1%. The mini- do-it threshold, the agent autonomously takes the cor- mum compressibility: 35.7%. The average length af- responding action. The default values of the tell-me ter compressing: 113 words. threshold and the do-it threshold are 0.7 and 0.95, re- 2.Correctness of Prediction: spectively. The two thresholds can be set by the user, Number of predicted e-mails: 60 (has repetitions). and the do-it threshold must be greater than the tell- Number of wrongly predicted e-mails: 16. Number of 16 Errors 12 8 4 Handled E- mails 10 20 30 40 50 60 Figure 2. The relationship between errors and handled e-mails The relationship between the number of wrongly realization. predicted e-mails and the number of handled e-mails is illustrated in figure 2. Figure 2 indicates that the er- 6 Conclusion and Future Directions ror rate decreases with the increasing number of han- dled e-mails: 11 errors occur in the first 20 e-mails, The experimental results show that the perfor- however only 5 errors occur in the last 40 e-mails. mance of the e-mail agent is to some extent satisfac- tory. Moreover, the method discussed in this paper 5 Comparison with Related Works can be applied in some other Internet-based services. As the e-mail agent uses statistic-based learning With the fast development of Internet, currently method, and the problems relevant to context are in- network-based services are hotspots of computer ap- evitable. Thus, the agent will be more appropriate for plications. Now, some corporations (say, Microsoft[3]) application if natural language processing module is [4] and institutes have researched on how to automati- added in. cally handle e-mails for the user. But for a variety of reasons, the references did not concern the details of
  • 5. References [1] M.F. Porter. An algorithm for suffix stripping. Program, 14(3) 130-137 (1980). [2] W.B. Frakes. Stemming algorithms. In: W.B. Frakes and R. Baeza-Yates, editors, Information Re- trieval: Data Structures and Algorithms, pp. 131-160. Prentice Hall, Inc., Englewood Cliffs, NJ, 1992. [3] Based on the introduction of Kaifu Lee, the dean of the Chinese Institute of Microsoft, 1998. [4] Y. Lashkari, et al., Collaborative Interface Agents, MIT Media Laboratory (1996). [5] G. Salton and M.J. McGill. An Introduction to Modern Information Retrieval. McGraw-Hill, 1983.