This document discusses extracting social networks from digital corpora to understand the dissemination of information. It covers identifying reprints and memes at scale using digitized newspapers. Methods include keyword searching, n-gram matching, and edition tracking. Understanding dissemination pathways involves identifying memes, modeling chronological spread, and constructing genealogical models. Both manual and computer-aided approaches are discussed, with future plans involving developing a computer program and directional social network database to better model relatedness factors and inform additional research.
Mapping Implicit Processes: Extracting Social Networks from Digital Corpora
1. VIEW THESE SLIDES
MAPPING IMPLICIT PROCESSES:
EXTRACTING SOCIAL NETWORKS FROM DIGITAL CORPORA
M. H. Beals
Shef f ield Hallam University
@mhbeals
ABOUT ME
2. Overview
Understanding Scissors-and-Paste Journalism in Georgian Britain
Computer-Aided Identification of Reprints and Memes
Understanding Dissemination Pathways
Manual Construction of Social Networks
Computer-Aided Ordering of Dissemination Pathways
Future Plans
3. Scissors-and-Paste Journalism in Georgian Britain
Proliferation of Colonial and Provincial Presses
Spread of Journeyman Printers
Reduction of Stamp Duty
New Profit Models
Entertaining and Literary Content
Adverts to Attract Readers to Sell to Advertisers
Manual Dissemination of News
Limited Number of “Specials”
Postal Exchange, Subscriptions, Correspondence
No Telegraph until 1840s and Not Used for Miscellany
4. Computer-Aided Identification of Reprints & Memes
Promise
Large-Scale Digitisation Efforts
Keyword Searching
nGram Matching (WCopyFind)
Edition Tracking (Juxta)
Viral Texts Project (Cordell, Dillon, and Smith)
Large-Scale Corpus of Nineteenth Century Newspapers
Extensive, Automatic Repair of OCR Errors
Identification of Highly Reprinted Materials (Memes)
Discussion and Exploration of Meme Traits and and Patterns
Perils
Discrete Digital Corpera (Paywalls)
Offline Penumbra (Curation)
Lost Nodes (Incomplete Data)
OCR Variability (50-80%)
5. Computer-Aided Identification of Reprints & Memes
# concordanceset.py
import re
def replace_words(text, word_dic):
rc = re.compile('|'.join(map(re.escape, word_dic)))
def translate(match):
return word_dic[match.group(0)]
return rc.sub(translate, text)
def getNGrams(wordlist, n):
return [wordlist[i:i+n] for i in range(len(wordlist)-(n-1))]
basenumber = raw_input('What is the first id number? ’)
number = str(basenumber)
numberint = int(basenumber)
basenumberend = raw_input('What is the last id number? ’)
endnumber = int(basenumberend)
ngram = raw_input('How many words should be in a phrase? ’)
ngrams = int(ngram)
combifile = 'combine.txt’
listopen = open(combifile, "r”)
wordlist = listopen.read()
splitlist = wordlist.split()
listopen.close()
ngramslist = getNGrams(splitlist, ngrams)
if ngramslist:
ngramslist.sort()
last = ngramslist[-1]
for i in range(len(ngramslist)-2, -1, -1):
if last == ngramslist[i]:
del ngramslist[i]
else:
last = ngramslist[i]
tidystring = '’
for item in ngramslist:
number = str(basenumber)
numberint = int(basenumber)
lineitem = " ".join(item)
print lineitem
tidystring += str('n' + lineitem + ',')
while (numberint<=endnumber):
file = str(number + ".txt”)
fin = open(file, 'r’)
text = fin.read()
fin.close()
if lineitem in text:
tidystring += str(number + ',’)
numberint = int(number)
numberint += 1
number = str(numberint)
# create an excelfile for this example
excel_file = "ngramcompiled.csv”
fout = open(excel_file, "w”)
fout.write(tidystring)
fout.close()
10. Manual Construction of Social Networks
The Glasgow Advertiser, 7 October 1793, p. 5
Knoxville, May 11.
IT is shocking to describe the bloody scenes that
have lately taken place in this district. The
Indians have killed and scalped a great number of
persons, among whom is Colonel Isaac Bledose,
who was massacred within 150 yards of his own
house.
On the 27th instant a body of Indians attacked
Greenfield station: they killed John Jervis, and
a negro fellow, belonging to Mrs. Tarker. By
the bravery of three young men, viz. William Nee-ly,
William Wilson, and William Hall, the station
was preserved; they killed two Indians, wounded
several others, and put them to flight. It is to be
remembered, that Neely and Hall had each lost a
father and two brothers, and Wilson a brother, by
the savages. Men are now in pursuit of the Indi-ans.
Full Discussion of Dissemination Pathway Available at: http://prezi.com/in4_bqvgmanr/
11. Manual Construction of Social Networks
Derived from
Glasgow News Archive, British Library 19th Century Newspapers,
NewspaperArchive.com, Readex Early American Newspapers, Newspapers.com, and the University of Kentucky
12. Computer-Aided Ordering of Dissemination Pathways
Binary Computer Model
Arbitrary Tolerance Levels
Reference to Additional Tables
Bypassing Missing Nodes
Flexibility
Difficult to Recreate Human Instinct…
…But is That a Bad Thing?
13. Computer-Aided Ordering of Dissemination Pathways
Phylogenetic Model
Image Courtesy of Fred Hsu (Wikipedia:User:Fredhsu on en.wikipedia)
CC-BY-SA-3.0 via Wikimedia Commons
14. Future Plans
Computer Program
OCR Clean-up Processes
Division into Likely Meme Groupings
Variety of Relatedness Scores
Textual Integrity
Prefixes and Suffixes
Chronological Separation
Chronological-Geographical Feasibility
Well-Worn Path Modifier
Modeling of Relatedness Factors
Manual Corrections
Directional Social Network Database
Raw Data to Inform Additional Research
Direct Attributions
Parsing Compilations
Initial Discovery of Well-Worn Paths
Inclusion of Offline Materials
www.mhbeals.com/cnd
15. VIEW THESE SLIDES ON SLIDESHARE
MAPPING IMPLICIT PROCESSES:
EXTRACTING SOCIAL NETWORKS FROM DIGITAL CORPORA
M. H. Beals
Shef f ield Hallam University
@mhbeals
ABOUT ME
WWW. MHBEALS.COM