I IKnow What You Wrote Last Know What You Wrote Last Summer Summer Using Cumulative Sum for Voice Using Cumulative Sum Voice Unification in Authoring Unification in AuthoringConfidential SDL Information
Jonathan Slaughter – Business ConsultantSDL Internationaljslaughter@sdl.com@JRSlaughterSDL Confidential SDL Information
Today’s agenda Overview • What is it? • History of Writer Analysis Cumulative Sum • Early origins • Current uses How it Works • Creating a Voice • Analysis of Authors • Unification Applicability and ROI • Impact • Where does it make sense? • When is it “overkill?” Examples • Charts • Customers Q&A
What is it? The Cumulative Sum technique is a recognition system applied to human utterance, whether written or spoken. The application of this system is commonly called “QSUM.” Two-stage analysis based on: 1) analyzing sequences of language units (normal unit is the sentence) and, 2) counts of recurrent kinds of language- use within each sentence Based on “quantitative stylistics” – the use of mathematical models as a basis for examining the periodic, or recurrent, nature of language. Literary “scholarship” versus “criticism”
Brief history 1859 – Augustus de Morgan, professor of mathematics at London University first suggests using number of words and average word length of all Epistles to confirm/deny authorship of Hebrews to Paul. 1938 – Cambridge statistician, G. Udny Yale developed first formal word-length index format and focused on word distribution within each sentence and across the document. 1960’s – four major statistical studies around authorship: 1962 – Alvar Ellegard’s examination of the Junius Letters 1964 – Mosteller and Wallace’s study of the Federalist papers 1967 – Louis Milic’s analysis of Jonathan Swift’s prose 1966 – Morton and McLeman’s work on the Pauline Epistles 1988 – Andrew Morton incorporates cumulative sum tests, commonly used in industrial settings, within the study of human utterance. 1990 – QSUM techniques and graphs used in court case to attribute/refute ownership of confession during appeal. Followed by future uses within courts. 2005* – First uses of QSUM techniques to unify multiple authors’ “voices” to a single “voice.”
How does this fit in to business? Global organizations are taking significant steps to improve/reduce the costs of creating and distributing content to their end-users. Examples include: Minimalism Global Authoring Practices/Training Workforce Globalization Content Management Systems Authoring Tools What none of these tools and processes do is create a truly “homogeneous” voice for authored content. CMS systems optimize re-use (consistency) but assume the source content is of acceptable quality Global authoring and Minimalism teach “practices” but fail to address the effect of combined voices in re-used content Voice Unification is a “next” step for organizations looking to establish optimal ROI on process and technological investments. Good investment where “brand image” and “brand communication” is central to company success Impact on technical material can vary, based upon target markets Recommended to clients centralizing source content development in organizations grown primarily through acquisition (loose integration) or significant shifts in development strategy.
How to create/define your voice? Understanding what your company “voice” sounds like is important. There are three common methods Voice Creation Mean Voice Alteration Select Voice Modification Each provides similar benefits, but the best option depends on a number of factors, including: Content types Number of voices Audience expectations Content re-use
Factors used to define Cusum analysis, aims to compare two aspects of habitual language use within a given text, segment of text, or combination of texts: Length – the number of words, in a sentence written or uttered, by the person providing the sample. • Cusum is the sum of the deviations in length – more or less – of the sentences from the average sentence length. Produces sld (sentence length distribution) Habit – feature of language use within each sentence. Most commonly used are the number of two and three-letter words (23lw) and initial vowel words (ivw). • Cusum of habit is average of these per sentence, with the deviation from this average tracked. QSUM charts can then be created combining the graphs of both aspects in overlaid format. Provides a visual comparison of that aspects The closer the two charts (demonstrated on the next slides) are, the more “homogeneous” the user voices are – the more likely it was written by the same person. Voice Unification is a difficult process and requires conscious content creation.