From their earliest incarnations in the seventeenth-century, through their Georgian expansion into provincial and colonial markets and culminating in their late-Victorian transformation into New Journalism, British newspapers have relied upon scissors-and-paste journalism to meet consumer demands for the latest political intelligence and diverting content. Although this practice, wherein one newspaper extracted or wholly duplicated content from another, is well known to scholars of the periodical press, in-depth analysis of the process is hindered by the lack of formal records relating to the reprinting process. Although anecdotes abound, attributions were rarely and inconsistently given and, with no legal requirement to recompense the original author, formal records of where material was obtained were unnecessary. Even if they had existed, the number of titles that relied upon reprinted material makes systematic analysis impossible; for many periodicals, only a few issues, let alone business records, survive.
However, mass digitisation of these periodicals, in both photographic and machine-readable form, offers historians a new opportunity to rediscover the mechanics of nineteenth-century reprinting. By undertaking multi-modal and multi-scale analyses of digitised periodicals, we can begin to reconstruct the precise journeys these texts took from their first appearance to their multiple ends. Moreover, by repurposing individual ‘boutique’ research outputs within large-scale textual analyses, we can greatly enhance the resolution of our computer-aided conclusions and bridge the gaps between commercial, state and private databases.
This paper will explore the possibilities of large-scale reprint identification, using out-of-the-box and project-specific software, within and across digitised collections. Second, it will demonstrate the means by which reprint directionality and branching can be achieved and the relative precision of manual and computer-aided techniques. Finally, it will explore the nature of multi-scale analysis and how we might best reintegrate ‘boutique’ periodical research into large-scale text-mining projects.
Boutique Big Data: Reintegrating Close and Distant Reading of 19th-Century Newspapers
1. BOUTIQUE BIG DATA
Reintegrating Close and Distant Reading of 19th-Century Newspapers
M. H. Beals (ORCID: 0000-0002-2907-3313)
Loughborough University
@MHBEALS
2. THE HISTORICAL PROBLEM
Image Courtesy of Mike Licht (CC BY) at https://www.flickr.com/photos/notionscapital/2313507405
• Culture of Reprinting in 18th and 19th Centuries
• Inconsistent Attribution
• Inconsistent Survival of Network Components
• Limited Historiographical Resources
3.
4. SEARCH AND TRANSCRIBE
Left Image Courtesy of Dan Tantrum (CC BY NC ND) at https://www.flickr.com/photos/tantrum_dan/2344581860
5. COPYFIND REPRINT DETECTION
• Freeware Programme Developed by Lou Bloomfield
http://plagiarism.bloomfieldmedia.com/z-wordpress/software/copyfind/
• Highly Customisable Search As Well as Open Source
• Measures Left, Right and Overall Matches
• Displays Left-Right Comparisons of Text
Image Courtesy of the Lou Bloomfield at http://rabi.phys.virginia.edu/lab3e/
6.
7. COPYFIND IN OCR CORPORA
• Freeware Programme Developed by Lou Bloomfield (University of Virginia)
• Highly Customisable Search Parameters
• Measures Left, Right and Overall Matches
• Displays Left-Right Comparisons of Text
• Extremely Effective at Discovering OCR-Transcribed Matches
11. ESTABLISHING LIKELY CANDIDATES
• Single Year (1810) Contained over 200,000 Possible Matches
• Removed Internal (Same Title) Reprints
• Restricted Match Size (90 Right, 90 Left or 160 Overall)
• Restricted Date Separation (200 Days)
12. COMPARING DATABASES
• Historical Networks Bear Little
Resemblance to Digitsed Corpora
• Undigitised Collections Require
Manual Discovery and Transcription
• Paywalled Collections (Currently)
Require Search-and-Transcribe Inclusion
• Overcoming Political and
Linguistic Divisions
13. ADVERTISEMENTS
• Reprinted in Same Title
• Reprinted in Other Titles
• Reprinted with Minor Variations
• Reprinted after Long Periods
• Similar Wording in Different Adverts
• Own Networks Ripe for Analysis!
14.
15. DIRECTIONALITY
• Reprint Maps are Non-Linear, Similar
to Phytogenic Trees
• Paths of Specific Branches Dictated
by Date, Content, Errors
• Similar Method to Meme-Tracking
(Adamic et al, 2014)
• Attributions Are Often Red Herrings
16. ANOTHER DREADFUL MASSACRE…
Times Courier
Star
St. James Chronicle
Sydney Gazette
Morning Chronicle
Caledonian Mercury
Aberdeen Journal
17. AN AFFECTING INSTANCE OF SELF-MURDER
Coroner’s Inquest.–At half past two o’clock yesterday, an
Inquest was held at the Nag’s Head, Orange-court,
Leicester-fields, before Anthony Gell, Esq. Coroner for
Westminster, on the body of Madamoiselle Ann Paris, then
lying dead at No. 4, St. Martin’s-street, Leicester-fields.
Morning Chronicle (London, England, United Kingdom), 06 January 1810, p. 3,
available at the Scissors and Paste Database,
http://www.scissorsandpaste.net/381.
Trewman’s Exeter Flying Post (Exeter, England, United Kingdom), 11 January 1810,
p. 2, available at the Scissors and Paste Database,
http://www.scissorsandpaste.net/379.
Examiner (London, England, United Kingdom), 17 January 1810, p. 15, 16, available
at the Scissors and Paste Database, http://www.scissorsandpaste.net/380.
18. DIRECTIONALITY
Perfect Match Overall Match Copy Original Reprint ID
559 (82% L, 31% R) 559 (82%) L; 559 (31%) R 1810-01-11_Trewman's Exeter Flying Post_379.txt 1810-01-06_Morning Chronicle_381.txt
381379
992 (96% L, 56% R) 992 (96%) L; 992 (56%) R 1810-01-17_Examiner_380.txt 1810-01-06_Morning Chronicle_381.txt
381380
Perfect Match Overall Match View Both Files File L File R
559 (82% L, 31% R) 559 (82%) L; 559 (31%) R Side-by-Side 1810-01-11_Trewman's Exeter Flying Post_379.txt 1810-01-06_Morning Chronicle_381.txt
992 (96% L, 56% R) 992 (96%) L; 992 (56%) R Side-by-Side 1810-01-17_Examiner_380.txt 1810-01-06_Morning Chronicle_381.txt
533 (51% L, 78% R) 533 (51%) L; 533 (78%) R Side-by-Side 1810-01-17_Examiner_380.txt 1810-01-11_Trewman's Exeter Flying Post_379.txt
559 (82% L, 31% R) 559 (82%) L; 559 (31%) R 1810-01-06_Morning Chronicle_381 1810-01-11_Trewman's Exeter Flying Post_379 381379
9923 3854
Type Subtype Text Copy Text Original Characters Removed Characters Added % Original % Copy
Style Capitalisation CORONER'S INQUEST Coroner's Inquest 17 17 0.34% 0.44%
Truncation Text At half past two o'clock yesterday 6919 0 69.73% 0.02%
Addition Text An inquest was held yesterday evening 0 853 8.60% 22.14%
Style Punctuation . .-- 3 1 0.04% 0.03%
Style Punctuation ; , 1 1 0.02% 0.03%
Style Punctuation , 0 1 0.01% 0.03%
Style Spelling te ea 2 2 0.04% 0.05%
Style Punctuation , 1 0 0.01% 0.00%
Style Punctuation , 1 0 0.01% 0.00%
Style Punctuation , : 1 1 0.02% 0.03%
Style Punctuation , 1 0 0.01% 0.00%
Style Punctuation , : 1 1 0.02% 0.03%
Style Punctuation , 0 1 0.01% 0.03%
78.86% 22.80%
19. DIRECTIONALITY
992 (96% L, 56% R) 992 (96%) L; 992 (56%) R 1810-01-17_Examiner_380.txt 1810-01-06_Morning Chronicle_381.txt 381380
5749 9923
Type Subtype Text Original Text Copy Characters Removed Characters Added % Original % Copy
Truncation Text Coroner's Inquest.--At half past two 55 0 0.55% 0.96%
Addition Text On Friday, 0 10 0.10% 0.17%
Truncation Text Orange-court, 13 0 0.13% 0.23%
Truncation Text before Anthony Gell, Esq 50 0 0.50% 0.87%
Style Punctuation ; . 1 1 0.02% 0.03%
Truncation Text that the deceased had lodged 174 0 1.75% 3.03%
Style Punctuation , 0 1 0.01% 0.02%
Truncation Text She was also extremely incoherent 432 0 4.35% 7.51%
Truncation Text told the witness that some one had 69 0 0.70% 1.20%
Style Capitalisation M m 1 1 0.02% 0.03%
Truncation Text At other times, the poor young lady 466 0 4.70% 8.11%
Style Punctuation , 0 1 0.01% 0.02%
Truncation Text Immediately on the unfortunate 54 0 0.54% 0.94%
Style Capitalisation m M 1 1 0.02% 0.03%
Addition Text 0 1 0.01% 0.02%
Truncation Text Mr. Emanuel Gristock, of Wardour-street 2840 0 28.62% 49.40%
Truncation Text without a moment's hesitation, 31 0 0.31% 0.54%
Editorial Vocabulary their the 5 3 0.08% 0.14%
Style Capitalisation - 1 1 0.02% 0.03%
Style Punctuation , 0 1 0.01% 0.02%
Style Spelling at te 2 2 0.04% 0.07%
Style Punctuation ; , 1 1 0.02% 0.03%
Style Punctuation , 0 1 0.01% 0.02%
Style Capitalisation m M 1 1 0.02% 0.03%
Style Punctuation ; , 1 1 0.02% 0.03%
Style Punctuation ; , 1 1 0.02% 0.03%
Style Punctuation , 0 1 0.01% 0.02%
Truncation Text completely 10 0 0.10% 0.17%
Style Punctuation , 1 0 0.01% 0.02%
Style Capitalisation P p 1 1 0.02% 0.03%
Style Capitalisation a disordered intellect A DISORDERED INTELLECT 22 22 0.44% 0.77%
Style Punctuation . ! 1 1 0.02% 0.03%
43.20% 74.57%
21. A MATTER OF SCALE
• Case-Study Search and Transcribe Limited by:
• Time
• Access to Relevant Collections
• Creative Search Methods and Hidden Biases
• OCR-Reprint Matching Limited by:
• OCR Quality
• Reprint Matching Resolution (Article, Page, nGram)
22. RE-ANALYSING THE DATABASE
• Manual-to-OCR Matches
Much More Accurate
• Finds a Small
but Sometimes Crucial
Set of New Matches
• Can Remap the
Entire Reprint Network
23. BOUTIQUE BIG DATA?
• Shared Transcription Standards
• Collegial Sharing of Data and Results
• Reuse in New and Unexpected Ways
• Case Study Discoveries Refining
Big Data Search Parameters
Image Courtesy of Mike Licht (CC BY) at https://www.flickr.com/photos/notionscapital/14032020799/