Authorship attribution


Published on

شناسایی مولف

Published in: Technology, Education
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Lemmatisation (or lemmatization) in linguistics, is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item.Inflection=نواخت، نوا
  • oriental =مشرق زمین مانند چین1. A subtle or slight degree of difference, as in meaning, feeling, or tone; a gradation.2. Expression or appreciation of subtle shades of meaning, feeling, or tone: a rich artistic performance, full of nuance.
  • Adverbial=عبارت قیدیprepositional=حرف اضافه
  • Pragmatic=واقع گرایانه
  • Nuances=ریزه کاری، آهنگ، فحواfarewells =خداحافظی indentation=تورفتگی
  • با حذف ویژگی های مهم، دقت Classifierی که کلاس درست را نشان می دهد از همه بدتر می شود، زیرا صفات مهم در آن از همه موثرتر است.
  • Authorship attribution

    1. 1. Basic Definitions Stylometry Features Algorithms
    2. 2. Authorship Attribution– Reza Ramezani Danielle Jones • Last seen 18th June 2001. • After her disappearance a series of text messages were sent from her phone. • Linguistic analysis showed that the later messages were sent by her Uncle, Stuart Campbell. • Campbell was convicted of Danielle’s murder 19th December 2002 in part because of the linguistic evidence. Jenny Nicholl • Last seen 30th June 2005. • After her disappearance a series of text messages were sent from her phone. • Linguistic analysis showed that the later messages were sent by her classmate, David Hodgson. Hodgson was convicted of Jenny’s murder 19th February 2008 in part because of the linguistic evidence. Importance 26
    3. 3. Authorship Attribution – Reza Ramezani Authorship Attribution • Definition – In the typical authorship attribution problem, a text of unknown authorship is assigned to one candidate author, given a set of candidate authors for whom text samples of undisputed authorship are available. – From a machine learning point of view, this can be viewed as a multiclass, single- label text-categorization task. – This task also is called authorship (or author) identification • Idea – The main idea behind authorship attribution is by measuring some textual features. – Authorship attribution is supported by statistical or computational methods. – This scientific field takes advantage of research advances in areas such as machine learning, information retrieval, and natural language processing. 4
    4. 4. Authorship Attribution– Reza Ramezani Supervised Learning 5
    5. 5. Authorship Attribution – Reza Ramezani Stylometry • Representation – Research in authorship attribution are done by attempts to define features for quantifying writing style, a line of research known as “stylometry” • Sentence length, word length, word frequencies, character frequencies, and vocabulary richness functions – 1,000 different measures was estimated by Rudman (1998) 6
    6. 6. Authorship Attribution – Reza Ramezani Stylometric Features • Lexical and Character Features – Consider a text as a mere sequence of word-tokens or characters, respectively. • Syntactic and Semantic Features – Require deeper linguistic analysis • Application-Specific Features – Can be defined only in certain text domains or languages. 7
    7. 7. Authorship Attribution – Reza Ramezani 1. Lexical Features • A simple and natural way to view a text is as a: – Sequence of tokens grouped into sentences, with each token corresponding to a word, number, or punctuation mark. • Method – The very first attempts to attribute authorship were based on simple measures such as sentence length counts and word length counts. • Advantage – They can be applied to any language and any corpus with no additional requirements except the availability of a tokenizer. • The most straightforward approach to represent texts is by vectors of word frequencies (vast majority of authorship attribution studies). 8
    8. 8. Authorship Attribution – Reza Ramezani Classification & Authorship Attribution • Difference in style-based and topic-based text classification – The most common words (articles, prepositions, pronouns, etc.) are found to be among the best features to discriminate between authors. – Such words are usually excluded from the feature set of the topic-based text- classification methods since they do not carry any semantic information, and they are usually called “function words”. • Various sets of function words have been used for English, but limited information was provided about the way that they were selected: 150, 303, 365, 480 and 675 9
    9. 9. Authorship Attribution – Reza Ramezani Methods • A Simple Method – Extract the most frequent words found in the available corpus (comprising all the texts of the candidate authors). – Then, a decision has to be made about the amount of the frequent words that will be used as features. • In the earlier studies, sets of at most 100 frequent words were considered adequate to represent the style of an author. – Another factor that affects the feature-set size is the classification algorithm that will be used since many algorithms over-fit the training data when the dimensionality of the problem increases. – Some machine learning algorithm (Such as SVM) can deal with thousands of features. 10
    10. 10. Authorship Attribution – Reza Ramezani Methods (Cont’d) • Required Routines – Tokenizer (Word Extraction) – Conversion to lowercase – Stemmers – Lemmatizers – Detectors of common homographic forms • Disadvantages – The bag-of-words approach provides a simple and efficient solution, but disregards word-order (i.e., contextual) information. • One Possible Solution – n-grams 11
    11. 11. Authorship Attribution – Reza Ramezani Methods (Cont’d) • n-grams – n contiguous words also known as word collocations • Features – The dimensionality of the problem following this approach increases considerably with n to account for all the possible combinations between words. – The representation produced by this approach is very sparse since most of the word combinations are not encountered in a given (especially short) text, making it very difficult to be handled effectively by a classification algorithm. – The classification accuracy achieved by word n-grams is not always better than individual word features. 12
    12. 12. Authorship Attribution – Reza Ramezani Methods (Cont’d) • Writing Error Measures – Spelling errors • Letter omissions • Insertions – Formatting errors • “all caps” words – This method needs an accurate spell checker. 13
    13. 13. Authorship Attribution – Reza Ramezani Methods • Vocabulary Richness – The vocabulary richness functions are attempts to quantify the diversity of the vocabulary of a text. – Typical examples are the type-token ratio V/N, where V is the size of the vocabulary (unique tokens) and N is the total number of tokens of the text. • Unreliable Measures – Vocabulary size (V) depends heavily on text length (as the text length increases, the vocabulary also increases, quickly at the beginning and then more and more slowly). – Various functions have been proposed to achieve stability over text length, including K (Yule, 1944), and R (Honore, 1979), with questionable results. 14
    14. 14. Authorship Attribution – Reza Ramezani 2. Character Features • A text is viewed as a mere sequence of characters. • Character-level Measures – Alphabetic characters count – Digit characters count – Uppercase and lowercase characters count – Punctuation marks count – And so on … • Feature – This type of information is easily available for any natural language and corpus – It has been proven to be quite useful to quantify the writing style. 15
    15. 15. Authorship Attribution – Reza Ramezani Methods • Character n-gram – Extract frequencies of n-grams on the character level. • Features – An advantage of this representation is its ability to be tolerant to noise. – In cases of lexicon errors the character n-gram representation is not affected dramatically. • The words “simplistic” and “simpilstc” would produce many common character 3-grams. – For oriental languages where the tokenization procedure is quite hard, character n-grams offers a suitable solution – The procedure of extracting the most frequent n-grams is language-independent and requires no special tools. • Compression-based Approaches – Will be discussed later … 16
    16. 16. Authorship Attribution – Reza Ramezani 3. Syntactic Features • Employing syntactic information • Idea – The idea is that authors tend to unconsciously use similar syntactic patterns. – Therefore, syntactic information is considered more a reliable authorial fingerprint in comparison to lexical information. – This type of information requires robust and accurate NLP tools able to perform syntactic analysis of texts. • The syntactic measure extraction is a language dependent procedure • Such features will produce noisy datasets due to unavoidable errors made by the parser. 17
    17. 17. Authorship Attribution – Reza Ramezani Methods • Rewrite Rule – Extracting Rewrite Rule frequencies, using a produced full parse tree of each sentence. – Using Rewrite Rules to analysis parts of syntactic. – Consider the following rewrite rule: A : PP → P : PREP + PC : NP – It means that an adverbial prepositional phrase is constituted by a preposition followed by a noun phrase as a prepositional complement. – These information describe “how the words are combined to form phrases or other structures”. – Experimental results have shown that this type of measure performs better than do Lexical and Characters features. • It needs accurate fully automated parser, able to provide a detailed syntactic analysis of sentences. 18
    18. 18. Authorship Attribution – Reza Ramezani Methods (Cont’d) • Paragraph Analyze – Another attempt to exploit syntactic information was proposed by Stamatatos – This sentence would be analyzed as following: – NP[Another attempt] VP[to exploit] NP[syntactic information] VP[was proposed] PP[by Stamatatos] – Where NP, VP, and PP stand for noun phrase, verb phrase, and prepositional phrase, respectively. – This type of information is simpler than Rewrite Rules. – It could be extracted automatically with relatively high accuracy. – The extracted measures referred to noun phrase counts, verb phrase counts, length of noun phrases, length of verb phrases, and so on… 19
    19. 19. Authorship Attribution – Reza Ramezani Methods (Cont’d) • Part-of-Speech (POS) – (POS) tagger, a tool that assigns a tag of morpho-syntactic information to each word-token based on contextual information. – Several researchers have used POS tag frequencies or POS tag n-gram frequencies to represent style – POS tag information provides only a hint of the structural analysis of sentences since it is not clear: • How the words are combined to form phrases • How the phrases are combined into higher level structures 20
    20. 20. Authorship Attribution – Reza Ramezani 4. Semantic Features • Low-Level vs. High-Level – Previous methods are at context level (low-level), not semantic level (high-level) – NLP tools can be applied successfully to low-level tasks such as: • Sentence splitting • POS tagging • Text chunking • Partial parsing, – More complicated tasks cannot yet be handled adequately by current NLP technology for unrestricted text. such as: • Full syntactic parsing • Semantic analysis • Pragmatic analysis – As a result, very few attempts have been made to exploit high-level features for stylometric purposes. 21
    21. 21. Authorship Attribution – Reza Ramezani Semantic Features Tools • Produce semantic dependency graphs – Gamon, Michael. "Linguistic correlates of style: authorship classification with deep linguistic analysis features." In Proceedings of the 20th international conference on Computational Linguistics, p. 611. Association for Computational Linguistics, 2004. • Extracting semantic measures based on WordNet – McCarthy, Philip M., Gwyneth A. Lewis, David F. Dufty, and Danielle S. McNamara. "Analyzing writing styles with Coh-Metrix." In Proceedings of the Florida Artificial Intelligence Research Society International Conference (FLAIRS), pp. 764-769. 2006. • Using the theory of Systemic Functional Grammar (SFG) – Argamon, Shlomo, Casey Whitelaw, Paul Chase, Sobhan Raj Hota, Navendu Garg, and Shlomo Levitan. "Stylistic text classification using functional lexical features." Journal of the American Society for Information Science and Technology 58, no. 6 (2007): 802-822. 22
    22. 22. Authorship Attribution – Reza Ramezani 5. Application-Specific Features • Application-Specific Features – One can define application-specific measures to better represent the nuances of style in a given text domain. – Defining structural measures to quantify the authorial style in special domain. • Such as e-mail messages and online-forum messages – Structural measures include: • The use of greetings and farewells in the messages, • Types of signatures, • Use of indentation, • Paragraph length, • and so on … – Other types of application-specific features can be defined only for certain natural languages, such as Greek. 23
    23. 23. Authorship Attribution – Reza Ramezani Attribution Methods • Profile-Based Approaches – Probabilistic models – Compression models – Common n-grams and variants • Instance-Based Approaches – Vector space models – Similarity-based models – Meta-learning models • Hybrid Approaches – Average Methods 24
    24. 24. Authorship Attribution – Reza Ramezani Attribution Methods (Cont’d) • Profile-Based Approaches – Cumulatively (per author) • Concatenating all the available training texts per author in one big file (author’s profile) • The stylometric measures extracted from the concatenated file may be quite different in comparison to each of the original training texts. • Instance-Based Approaches – Individually (per author) 25
    25. 25. Authorship Attribution – Reza Ramezani 1. Profile-Based Approaches 26
    26. 26. Authorship Attribution – Reza Ramezani 1.1. Probabilistic Models 27
    27. 27. Authorship Attribution – Reza Ramezani 1.2. Compression Models • Compression Models – Such methods do not produce a concrete vector representation of the author’s profile. • Steps – Initially a compression algorithm is called to produce a compressed file C(xa). – Then, the unseen text x is added to each text xa, and the compression algorithm is called again for each C(xa +x). – The difference in bit-wise size of the compressed files d(x, xa) = C(xa +x) − C(xa) indicates the similarity of the unseen text with each candidate author. – These models are applied only to character sequences, not word sequences. 28
    28. 28. Authorship Attribution – Reza Ramezani 1.3. Common n-grams (CNG) 29
    29. 29. Authorship Attribution – Reza Ramezani 1.3. Common n-grams (CNG) (Cont’d) • Parameters – The CNG method has two important parameters that should be tuned: – The profile size L • How many strings constitute the profile. – And the character n-gram length n; • How long strings constitute the profile. – Keselj et al. (2003) reported their best results for 1,000 ≤ L ≤ 5,000 and 3 ≤ n ≤ 5. – The CNG distance function performs well when the training corpus is relatively balanced. – But it fails in imbalanced cases where at least one author’s profile is shorter than L. • CNG variant – To solve the problem of class imbalanced. 30
    30. 30. Authorship Attribution – Reza Ramezani 2. Instance-Based Approaches 31
    31. 31. Authorship Attribution – Reza Ramezani 2.1. Vector Space Models • Definition – It could be considered each text as a vector in a multivariate space. – Then, a variety of powerful statistical and machine learning algorithms can be used to build a classification model, including: • Discriminant Analysis • SVM • Decision Trees • Neural Networks • Genetic Algorithms • Memory-based Learners • Classifier Ensemble Methods • and so on. – Some of these algorithms can effectively handle high-dimensional, noisy, and sparse data, allowing more expressive representations of texts. – The effectiveness of methods is diminished by the presence of the class-imbalance. 32
    32. 32. Authorship Attribution – Reza Ramezani 2.2. Similarity-based Models • Idea – Calculation of pairwise similarity measures between the unseen text and all the training texts, – And then estimating the most likely author based on a nearest-neighbor algorithm. • Example – Compression Model • Compressing of each training text in separate files using an off-the-shelf algorithm • C(x) is the bit-wise size of the compression of file x • The difference C(x +y) − C(x) indicates the similarity of a training text x with the unseen text y. 33
    33. 33. Authorship Attribution – Reza Ramezani 2.3. Meta-learning Models • Definition – More complex algorithms specifically designed for authorship attribution. – The main goal is to use such meta-data to understand how automatic learning can become flexible in solving different kinds of learning problems: • Hence to improve the performance of existing learning algorithms. – The most interesting approach of this kind is the unmasking method. 34
    34. 34. Authorship Attribution – Reza Ramezani Unmasking Method • Unmasking Method – In the unmasking method, For each unseen text, an SVM classifier is built to discriminate it from the training texts of each candidate author. – Thus, for n candidate authors, n classifiers for each unseen text is built. – Then, in an iterative procedure, a predefined amount of the most important features for each classifier is removed and the drop in accuracy is measured. – At the beginning, all the classifiers had more or less the same very high accuracy. – After a few iterations, the accuracy of the classifier that discriminates between the unseen text and the true author would be too low while the accuracy of the other classifiers would remain relatively high. – This happens because the differences between the unseen text and the other authors are manifold, so by removing a few features, the accuracy is not affected dramatically. 35
    35. 35. Authorship Attribution – Reza Ramezani 3. Hybrid Approaches • Hybrid Approaches – Methods that borrow some elements from both profile-based and instance-based approaches. • Example – All the training text samples are represented separately, as it happens with the instance-based approaches. – The representation vectors for the texts of each author are feature-wisely averaged and produced a single profile vector for each author, as happens with the profile- based approaches. – The distance of the profile of an unseen text from the profile of each author is then calculated by a weighted feature-wise function. 36
    36. 36. Authorship Attribution – Reza Ramezani 37