2. Introduction to Expert Systems
Title: Author Identification System Using Keywords and
Pattern Frequency
OBJECTIVE OF SYSTEM: To identify or verify authors
through text analysis.
How it works: The system analyzes word frequency,
keyword frequency, and part-of-speech pattern frequency
(POS permutations) after keywords.
3. Problem definition and system
objectives
PROBLEM DEFINITION: From multiple texts, it is difficult
to identify or verify the authorship of each text. This is
needed in many areas, such as copyright infringement
issues, pragmatism research, or authorship identification
of literary works.
System Purpose: the system solves the above problem
using text analysis and Natural Language Processing (NLP)
techniques.
4. Introduction of the entire system
Our system analyzes word frequency, keyword frequency,
and POS patterns after keywords in the input text.
These data form a feature vector that represents the
writing style and style of a particular author.
Finally, these feature vectors are used to identify or verify
authors.
5. Specific details in the system
Word frequency: Authors tend to favor certain words.
Analyzing the frequency of these words can help identify
authors.
Keyword frequency: The frequency of certain keywords is
also important because authors frequently write about a
particular subject or topic.
POS Patterns: Authors also tend to use grammatical
patterns consistently. For example, a particular keyword
is always followed by a particular part of speech.
7. Problem
Debbie was on her honeymoon and wrote an email to her mother. However,
the mother thought that Debbie might not have written the email and
contacted the police. We need to consider an expert system to ascertain
whether Debbie really wrote the email.
8. Overall system
Prepare four data sets: one set of emails received from Debbie after the
marriage (Questioned), one set of emails received from Debbie before the
marriage (Known1), one set of emails received from Jamie (Known2), and an
unspecified set of emails (Reference).
Using statistical analysis, discover Debbie's and Jamie's respective keywords
and see if the Keyword in Questioned applies to either of them.
9. System in detail (1/2)
It divides the sentences in the four data sets by word and counts the number
of words that occur. Then sort them in order of word count.
Summarize the words with the smallest percentage used compared to the
reference and the three data. These become keywords.
Compare the keywords of the Questioned and the two people and calculate
how applicable they are. The one with the highest number of applicable
keywords is assumed to be the writer.
10. System in detail (2/2)
Tokenization of
sentences
Find keywords
based on
references
Output Jamie
label
Output Debbie
label
Does the keyword in
Known1 apply to the
Questioned more than
the keyword in Known2?
Yes No
Sort by word
frequency
14. System in detail how
①How to create dataset?
・ get a lot of sentences written by someone who we want to search.
・separate sentences by words
・Divide the dataset into word frequencies, keyword frequencies, and keyword and pattern frequencies
15. System in detail how (continue)
②How to compare inputting sentences and datasets?
・separate sentences by words
・Compare the dataset with the input words to see if there are similarities in word frequency, keyword
frequency, and keyword/pattern frequency.
19. System in detail 1
・The knowledge needed
Known Dataset: Emails received from Debbie before marriage,
Emails received from Jamie
Reference Dataset: Large collection of emails from many different
senders
・The inference needed
Frequency comparison, Keyword matching, Pattern analysis
20. System in detail 2
• Frequency comparison
This starts to create a word frequency list from each dataset of emails and compare whether there are
words or keywords which are overlapped.
• Keyword matching
This divides the sentences on emails into several keywords, counts their frequency, and searches for
keywords which are matching.
• Pattern analysis
This calculates keyword POS patterns comparing the reference dataset and other datasets. Also, this
evaluates the overlap of keyword patterns in emails.