Now in the UK – the expert, not the method; becoming more like the US.
Two particularly interesting aspects and also areas of concern in my research were the text length and the number of texts. As repeatedly pointed out by Coulthard (2004), Coulthard and Johnson (2007) and Chaski (2001), texts that forensic experts usually work with are very few and short: around 100-400 words. In her 2001 study, Chaski used texts that varied in length between 93 and 556 words. However, Wallace and Mosteller in their study on Federalist Papers looked as 85 essays, each 900 to 3,500 words in length. The number of texts matters as much as their length does: e.g. Grant (2007) examines 63 texts (3 authors with 21 texts per author). This study examines 20 texts (18 K (known) ones and 2 Q (query) ones), which also differ in length (see Table 5 for details). Although they are longer than usual forensic texts, they are relatively few. This may make the analysis more difficult as Grant's (2007) analysis breaks down when he reduces the number of texts per author. However, this study attempts to replicate, at least in part, a forensic experiment and research difficulties may be seen as the real-world challenges.
All the Text‟s a Stage; And All the Function Words Merely Players? Statistical Analysis of Authorship Vlad Mackevic Aston University
Playing detective?In forensic science – investigators look for cluesthat the culprit leaves unwittingly;In linguistics – „unconscious language‟i.e. Function Words (de Vel, 2001; Argamon& Levitan, 2005; Burrows, 2003)Rather old idea (Wallace & Mosteller, 1964);revisited in Holmes & Forsyth (1995).
Advantages of Function Words in FL„Unconscious language‟Numerous even in a relatively short text.Can be easily counted Related to the Daubert Criteria Enables corpus analysis (Key Words in Context)
The Daubert Criteria1. The theory must have been tested;2. It must have been subjected to peer review andpublication;3. It must have a known error rate;4. It must be generally accepted in the scientific community. (Tiersma & Solan, 2002, cited in Coulthard, 2004; Chaski, 1997; Grant, 2007)
Implications for linguistsIncreased pressure on the linguists to usemathematical methods, repeatable procedures;Forensic linguists must serve justice;„Beyond reasonable doubt‟ in criminal cases(Grant, 2010)„Raise legitimate doubt‟ in civil cases (ibid.)The method is King, not the expert.
It is „a challenge to the academic community totest the error rate and at the same time to fix anacceptable statistical equivalent for „beyondreasonable doubt‟ Coulthard (2004: 476)It is „the linguist‟s responsibility to createtheoretically sound hypotheses‟ and test them Chaski (2001: 2).
IdiolectDefined as the idiosyncratic use of dialect, idiolectis a way of speaking (and, consequently, writing)that is unique for each individual Chaski (1997).the totality of the possible utterances of onespeaker at one time in using a language tointeract with one other speaker‟ Bloch (1948, cited in Grieve, 2007: 255).
TheoryGrant (2010) - two theoretical frameworks: Idiolect is linked to neuroscience The author is influenced by the language he/she is exposed to.De Vel‟s (2001) and Argamon & Levitan‟s(2005) claims about certain function wordsbeing unconscious linguistic choices – also atheory.
Theory (cont.)Grant (2010):„simple detection of consistency and determination ofdistinctiveness‟ would be able to help practicalauthorship analysis more than even a strong theory.
HypothesesThe use of function words is unique to eachindividual (could be limited by context or genre) -idiolect;The frequency of certain function words is anauthorship marker (e.g. Holmes & Forsyth, 1995);The frequency of semantic roles that certainfunction words play is also an authorshipmarker.
Semantic RolesSemantic roles are the word‟s functions in thespecific context of the sentence.The words I analysed were AS, IT, THAT andTHERECriteria: frequency (corpus) and explicit multiplemeanings
ASFunction ExamplesStart of time adjunct clause As we approached the small hut; as I followed the massesFixed Phrase as [adj/adv] as As easily as; as soon as, as well asAS + Noun Phrase as a museum; as the red-light districtAS at the start of a manner adjunct as you can imagine; as the locals doAS could be replaced with because big push for the Chinese people to learn English, as they have now made it mandatory in their schoolsAS is used for comparison as if they knew we were on their turf; still as a board; the same as fall back in Chicago;
ITFunction ExamplesIT serves as s dummy subjectIT + [to be] + predicament + infinitive Its hard to enjoy a festival the same wayIT + [to be] or other verb phrase (+ It turns out Ill be going to at least fouradj/noun phrase) + relative clause (that, ifetc.)IT + [to be] + time reference its time for Pendulum
IT (cont.)Function ExamplesIT + seem/feel/any other perception verb it stops feeling like HannoverIT + [to be] + noun phrase it would have been a great dayIT refers to something mentioned before We woke up early to catch the ferry and it couldnt have been easier.IT is a part of a fixed phrase We made it to Macau in less than 2 hours
THATFunction ExamplesTHAT begins a subordinate clause I also couldnt help but notice that when I looked toward the islandTHAT could be replaced with which It was the spot on the beach that was shaped like a triangleTHAT is a determiner That night, we all reconvened at the hotel
THEREFunction ExamplesTHERE serves as a dummy subject there are a few longhaired dogsTHERE refers to a place it was there strictly for the tourists
My Dataset Author A Author B Type of text Travel Blog Travel Blog Gender (self- Female Male declared) Mother Tongue and English (American) English (perhaps Irish) variety (self-declared) Website URL - the http://www.travelblog.org http://www.getjealous.com data source Size of K corpus 9 texts 7 texts 5 texts 3 texts Q textAuthor A 20,875 16,118 11,024 6,260 2,479Author B 7,991 6,176 4,241 2,611 750
MethodologyTexts were imported into TEXTSTAT concordance software;Words AS, IT, THAT and THERE were chosen for theirexplicit diverse meanings in the sentence;Quantitative analysis was used to determine how different(or similar) the authors were in terms of their frequency ofuse of function words and their meanings;The number of texts was reduced to see if at some pointanalysis breaks down (compare to Grant, 2007);Statistical technique used – T-TEST
Matrix of ProbabilitiesApplication PSA values MeaningClustering PSA > 90% SuccessClustering and Differentiating PSA ≥ 95% ‘Beyond Reasonable Doubt’Differentiating PSA < 85% Definite Failure (error rate at 15% causes reasonable doubt).Clustering and Differentiating PSA > 50% Balance of probabilities – suitable for civil court. PSA = probability of same authorship Clustering = the author of both texts is likely to be the same person Differentiating = texts were written by different authors Beyond reasonable doubt: 95%
Findings: T-TestClustering DiscriminatingAnalysing each marker Analysing each markerof the same author of the one authoragainst the values of against the values ofthat marker in the Q text that marker in the Q textby the same author by the other authorHow likely is that person How likely is that K andto have produced the Q texts have beentext? produced by the same person?
Findings: Reliability of markers All texts by one author compared against each other Every semantic role of each function word was included Special attention: success of the test depends on the amount of text Not all markers are reliable; their frequency can be too low in a short textMarker Clustering DiscriminationAS Very inconsistent ConsistentIT Very consistent Very ConsistentTHAT depends on the amount of depends on the amount of text (A- yes; B - no) text (A- yes; B - no)THERE Very consistent Very consistent
T-Test: Success Beyond Reasonable Doubt: 95% or moreFuncti Function Clustering Discrimion natingWord A BAS Start of time adjunct clause FAIL YES BRD NO BRD Fixed Phrase as [adj/adv] as BRD FAIL FAIL YES BRD AS + Noun Phrase FAIL BRD YES YES NO AS at the start of a manner FAIL YES BRD N/A NO adjunct AS could be replaced with BRD BRD N/A N/A N/A because AS is used for comparison YES BRD BRD FAIL NO
Function Function Clustering DiscriminWord ating A BIT YES YES BRD FAIL BRD Dummy subject Dummy subject at the FAIL FAIL FAIL FAIL NO start of the sentenceTHAT That begins a subordinate BRD YES FAIL FAIL NO clause That could be replaced with FAIL FAIL BRD BRD BRD which That is a determiner FAIL FAIL FAIL YES BRDTHERE YES BRD N/A FAIL NO Dummy subject Dummy subject at the FAIL FAIL N/A FAIL BRD start of the sentence
ResultsMarker Success Failure ExplanationAS 50% 33.33% A fairly reliable marker. Would do in civil court.IT 80% 20% The most reliable marker in this study. IT at the start of the sentence has no linguistic theory behind it, and failure was expected.THAT 46.67% 53.33% Also in Mackevic (2011): “Very unreliable across all authors – enormous error rates; PSA shooting over 50% most of the time. ”THERE 30% 50% Marker totally unreliable.
Discussion of ResultsMost of the markers – much better atdiscriminating that at clustering;A lot depends of the text’s length– when Istarted removing texts from the corpus (9, then 7,then 5 and finally 3), analysis began breakingdown; 6000 words for the reference corpus – approximate benchmark.Possible conclusion: function words are reallybetter for longer texts, which also occur inforensic settings.
Why did T-test fail?Possible explanation: some markers occurred very rarelyThey had little linguistic significance (no theory behind)Analysis broke down with very consistent markers. Why?Possibly, because the amount of text (number of words)was insufficient For Comparison: Grant‟s(2010) also reports his analysis breaking down when the amount of text is reduced Perhaps qualitative analysis is better for shorter texts But it works against the Daubert Criteria
RecommendationsUse grammar reference books for semantic roles offunction words and more detailed division ofrolesChoose different words (look what worked for otherauthors)Try more texts, but short ones (e.g. 50 texts of 400words each)Try more statistical techniques
ConclusionFunction words – potentially another tool in a forensiclinguist‟s toolboxT-Test – good analytical tool;It returns exact results with certain error rates that areeasy to interpret (consistent with Daubert criteria)However, it also has some limitations and additionalanalysis may be needed to complete the pictureT-Test works with discriminating better than withclusteringAnalysis breaks down with small corpora
ReferencesNB: The references are from the original paper; some authors present in this list may not have been cited in the presentation Books and Journals Argamon, S. & Levitan, S. (2005) Measuring the Usefulness of Function Words for Authorship Attribution [Online]. Available at: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.71.6935&rep=rep1&type=pdf [Accessed 12 September 2010] Burrows, J. (2003). Questions of Authorship: Attribution and Beyond. Computers and Humanities [Online] 37, pp. 5-23. Available from: http://www.springerlink.com/content/nv46t75125472350/ [Accessed 1 August 2010]. Chaski, C. E. (1997). Who Wrote It? Steps Towards a Science of Authorship Identification. National Institute of Justice Journal. (September Issue) [Online]. Available from: http://www.ncjrs.gov/pdffiles/jr000233.pdf [Accessed 31 January 2010]. Chaski, C. E. (2001). Empirical evaluations of language-based author identification techniques. The International Journal of Speech, Language and the Law [Online] 8 (1), pp. 1-65. Available from: http://www.equinoxjournals.com/ojs/index.php/IJSLL/article/view/1690/1151 [Accessed 12 June 2008]. Chaski, C. E. (2005). Who‟s at the Keyboard? Authorship Attribution in Digital Evidence Investigations. International Journal of Digital Evidence [Online] 4 (1), pp. 1-14. Available from: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.75.3852&rep=rep1&type=pdf [Accessed 31 January 2010].
Coulthard, M. (1998). Identifying the Author. Cahiers de Linguistique Française [Online] 20, pp. 139-161. Available at: http://clf.unige.ch/display.php?idFichier=168 [Accessed 28 January 2010].Coulthard, M. (2004). Author Identification, Idiolect and Linguistic Uniqueness. Applied Linguistics[Online] 25 (4), pp. 431-447. Available at: http://www.business-english.ch/downloads/Malcolm%20Coulthard/AppLing.art.final.pdf [Accessed 27 January 2010].Coulthard, M. & Johnson, A. (2007). An Introduction to Forensic Linguistics: Language in Evidence.Abingdon: Routledge.De Vel, O. (2001). Multi-Topic E-mail Authorship Attribution Forensics. In: ACM Conference onComputer Security – Workshop on data mining for security applications. November 8,2001.Phildelphia, PA [Online]. Available at:http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.19.9951&rep=rep1&type=pdf [Accessed31 August 2010].Grant, T. (2007). Quantifying Evidence in Forensic Authorship Analysis. The international Journal ofSpeech, Language and the Law [Online] 14 (1), pp. 1-25. Available at:http://www.equinoxjournals.com/IJSLL/article/view/3955/2428 [Accessed 12 July 2010].Grant, T. (2008). Dr Tim Grant: How text-messaging slips can help catch murderers. TheIndependent [Online]. (Last updated 9 September 2009). Available at:http://www.independent.co.uk/opinion/commentators/dr-tim-grant-how-textmessaging-slips-can-help-catch-murderers-923503.html [Accessed 11 September 2010].Grant, T. D. (2010). Txt 4n6: idiolect free authorship analysis? In: Roultledge Handbook of ForensicLingusitics. Abingdon: RoutledgeDe Vel, O. (2001). Multi-Topic E-mail Authorship Attribution Forensics. In: ACM Conference onComputer Security – Workshop on data mining for security applications. November 8,2001.Phildelphia, PA [Online]. Available at:http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.19.9951&rep=rep1&type=pdf [Accessed31 August 2010].
De Vel, O. (2001). Multi-Topic E-mail Authorship Attribution Forensics. In: ACM Conference onComputer Security – Workshop on data mining for security applications. November 8,2001.Phildelphia, PA [Online]. Available at:http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.19.9951&rep=rep1&type=pdf [Accessed31 August 2010].Grant, T. (2007). Quantifying Evidence in Forensic Authorship Analysis. The international Journal ofSpeech, Language and the Law [Online] 14 (1), pp. 1-25. Available at:http://www.equinoxjournals.com/IJSLL/article/view/3955/2428 [Accessed 12 July 2010].Grant, T. (2008). Dr Tim Grant: How text-messaging slips can help catch murderers. TheIndependent [Online]. (Last updated 9 September 2009). Available at:http://www.independent.co.uk/opinion/commentators/dr-tim-grant-how-textmessaging-slips-can-help-catch-murderers-923503.html [Accessed 11 September 2010].Grant, T. D. (2010). Txt 4n6: idiolect free authorship analysis? In: Roultledge Handbook of ForensicLingusitics. Abingdon: RoutledgeGrant, T. & Baker, K. (2001). Identifying reliable, valid markers of authorship: a response to Chaski.The International Journal of Speech, Language and the Law [Online] 8 (1), pp. 66-79. Available at:http://www.equinoxjournals.com/ojs/index.php/IJSLL/article/view/1691/1150 [Accessed 12 June2008].Holmes, D. I. & Forsyth, R. S. (1995). The Federalist Revisited: New Directions in AuthorshipAttribution. Literary and Linguistic Computing [Online] 10 (2), pp. 111-127. Available from:http://llc.oxfordjournals.org/cgi/reprint/10/2/111 [Accessed 1 August 2010] .Hunston, C. (2002). Corpora in Applied Linguistics. Cambridge: Cambridge University Press.Mitchell, E. (2008). The Case for Forensic Linguisitcs. BBC News [Online]. (Last updates 8September 2008). Available at: http://news.bbc.co.uk/1/hi/sci/tech/7600769.stm [Accessed 11September 2010]
Rudman, J. (1998). The State of Authorship Attribution Studies: Some Problems and Solutions. Computers andthe Humanities [Online] 31, pp. 351–365. Available from:http://www.springerlink.com/content/l023q7047388133x/fulltext.pdf[Accessed 2 August 2010].Websites:Textstathttp://neon.niederlandistik.fu-berlin.de/textstat/T-test Calculatorhttp://www.graphpad.com/quickcalcs/OneSampleT1.cfmT-Tableshttp://www.statsoft.com/textbook/distribution-tables/#t