SlideShare a Scribd company logo
1 of 24
Download to read offline
Ancient History of the UK Web
With support by and thanks to Ning Wang and Adham Tamer
Josh Cowls, Scott A. Hale, Helen Margetts,
Eric T. Meyer, Ralph Schroeder, Taha Yasseri
Past Web Archive Activities at OII
• 2008-2009. JISC/NEH Transatlantic Digitisation Collaboration: World Wide Web of
Humanities (Jisc & NEH funded)
– OII, Internet Archive, Hanzo Archives
– Meyer, E.T., Carpenter, K., Middleton, M. (2009). World Wide Web of Humanities: Final
Report to JISC. Online:
http://www.jisc.ac.uk/media/documents/programmes/digitisation/humanitiesfinalrepor
t.pdf
• 2010. Researcher Engagement with Web Archives (Jisc funded)
– OII, VKS
– Dougherty, M., Meyer, E.T., Madsen, C., van den Heuvel, C., Thomas, A., Wyatt, S. (2010).
Researcher Engagement with Web Archives: State of the Art. London: JISC. Online:
http://ssrn.com/abstract=1714997 and http://ie-repository.jisc.ac.uk/544/
– Thomas, A., Meyer, E.T., Dougherty, M., van den Heuvel, C., Madsen, C., Wyatt, S. (2010).
Researcher Engagement with Web Archives: Challenges and Opportunities for
Investment. London: JISC. Online: http://ssrn.com/abstract=1715000 and http://ie-
repository.jisc.ac.uk/543/
– Dougherty, M., Meyer, E.T. (2014). Community, Tools, and Practices in Web Archiving:
The state of the art in relation to social science and humanities research needs. Journal
of the American Society of Information Science & Technology.
http://onlinelibrary.wiley.com/doi/10.1002/asi.23099/abstract
• 2011. Using Web Archives: A Futures Perspective (IIPC funded)
– OII
– Meyer, E.T., Thomas, A.J., Schroeder, R. (2011). Web Archives: The Future(s). London:
IIPC. Online: http://ssrn.com/abstract=1830025
Recent Web Archive Activities at OII
• 2013-2015: Jisc Big Data project (Jisc funded)
– OII, British Library
– Prepare and release hyperlink corpus
• 2014-2015: Big UK Domain Data for the Arts and Humanities (AHRC
funded)
– IHR, OII, British Library
– Supporting researchers in Arts & Humanities to use web archive data
– Producing edited book of empirical studies concerning the history of
the UK web
• First paper from these combined projects
– Hale, S.A., Yasseri, T., Cowls, J., Meyer, E.T., Schroeder, R., Margetts, H.
(2014, July). Mapping the UK webspace: Fifteen years of British
universities on the web. ACM WebSci’14, Bloomington, Indiana.
http://papers.ssrn.com/abstract=2435481 or
http://arxiv.org/abs/1405.2856
Big Data:
Demonstrating the Value of the UK Web Domain Dataset
for Social Science Research
This project aims to enhance JISC's UK Web
Domain archive, a 30 TB archive of the .uk
country-code top level domain collected from
1996 to 2010. It will extract link graphs from the
data and disseminate social science research
using the collection.
February 2012 - February 2014
Taming a mammoth:
Web Archive Dataset Preparation
30 TB compressed data
6.2TB metadata and links
2.5 TB temporal links
30 TB compressed data in (w)arc format
– Approx. 4.5 million files
– Mix of binary and plain text payloads along
with header data
– Two formats: old arc and newer warc
Housed at the BL, access restrictions
WARC/1.0
WARC-Type: response
WARC-Target-URI: http://hits.guardian.co.uk/b/ss/guardiangu-blogs,guardiangu-news,guardiangu-
network/1/H.22.2/56938?ns=guardian&pageName=Prisoner+of+war+camps+in+the+UK+mapped+and+listed.+Download+the+d
ata%3AGraphic%3A1476560&ch=News&c3=GU.co.uk&c4=History+%28Books+genre%29%2CBooks%2CSecond+world+war+
%28News%29%2CGermany%2CUK+news%2CTechnology&c5=Not+commercially+useful%2CCorporate+IT&c6=Simon+Roger
s&c7=10-Nov-
08&c8=1476560&c9=Graphic&c10=Blogpost&c11=News&c13=&c25=Datablog&c30=content&h2=GU%2FNews%2Fblog%2FDa
tablog&c2=GUID:(none)
WARC-Date: 2010-12-05T02:58:00Z
WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
WARC-IP-Address: 66.235.138.18
WARC-Record-ID: <urn:uuid:7d5ce147-9b4b-46cb-8975-ee93b4d0dda8>
Content-Type: application/http; msgtype=response
Content-Length: 740
HTTP/1.1 302 Found
Date: Sun, 05 Dec 2010 02:58:00 GMT
Server: Omniture DC/2.0.0
X-C: ms-4.3.1
Expires: Sat, 04 Dec 2010 02:58:00 GMT
Last-Modified: Mon, 06 Dec 2010 02:58:00 GMT
Cache-Control: no-cache, no-store, must-revalidate, max-age=0, proxy-revalidate, no-transform, private
Pragma: no-cache
ETag: "4CFAFFB8-0E4C-7443902F"
Vary: *
P3P: policyref="/w3c/p3p.xml", CP="NOI DSP COR NID PSA OUR IND COM NAV STA"
Location: http://b.scorecardresearch.com/r?c2=6035250&d.c=gif&d.o=guardiangu-
network&d.x=243551159&d.t=page&d.u=http%3A%2F%2Fwww.guardian.co.uk%2Fnews%2Fdatablog%2F2010%2Fnov%2F08
%2Fprisoner-of-war-camps-uk
xserver: www422
Content-Length: 0
Keep-Alive: timeout=15
Connection: close
Content-Type: text/plain
Extract meta-data and links (wat format)
– Approx. 4.5 million files
– 6.2TB on disk compressed
– Housed at OII
– Structured JSON
– Different formats for arc/warcs
{
"Container": {
"Filename": "DOTUK-HISTORICAL-1996-2010-GROUP-AA-XAAAAA-20110428000000-
00000.arc.gz",
"Offset": "88937",
"Compressed": true,
"Gzip-Metadata": {
"Header-Length": "10",
"Inflated-CRC": "-1223265901",
"Inflated-Length": "26073",
"Deflate-Length": "4463",
"Footer-Length": "8"
}
},
"Envelope": {
"ARC-Header-Length": "102",
"ARC-Header-Metadata": {
"Date": "20080509081524",
"Target-URI": "http://www.ukhomeinteriors.co.uk/content/ext_corbels.php",
"Content-Length": "25970",
"Content-Type": "text/html",
"IP-Address": "83.223.106.10"
},
"Payload-Metadata": {
"Actual-Content-Type": "application/http; msgtype=response",
"Block-Digest": "sha1:MCCZNOKBJHTZ5MMMCUJGBPE25C2TVUWF",
"HTTP-Response-Metadata": {
"Headers-Length": "591",
"HTML-Metadata": {
"Head": {
"Title": "Exterior Corbels",
Plain text lists
Build own ad-hawk Hadoop cluster, fix
incompatibilities, divide into smaller batches
– Build plain text lists of pages and hyperlinks
– Remove error page (e.g., 404 Not Found)
– Remove pages not in .uk
– Standardize dates (many formats)
– Standardize hyperlinks (trailing /, etc.)
– Fix/remove tons of invalid hyperlinks (whitespace,
invalid characters, etc.)
Load results into Apache Hive (2.5 TB)
Source Destination Time
LinkText
http://octopus.well.ox.ac.uk:80/
http://octopus.well.ox.ac.uk:80/links.html
1032758438
Links
http://octopus.well.ox.ac.uk:80/
http://octopus.well.ox.ac.uk:80/projects.html
1001793436
Projects
http://octopus.well.ox.ac.uk:80/computing.sht
ml
http://debian.org/
1075794060
Debian/GNU
Overall Statistics
Third-level-
domains:
e.g.
ox.ac.uk
Relative size of second-level-domains
Number of links within SLD per node
Cross-domain links (2010)
Absolute Normalized to target size
Case of ac.uk
Mapping the UK Webspace:
Fifteen Years of British Universities on the Web
Hale et al., WebSci'14, available: http://arxiv.org/abs/1405.2856
121 UK universities
websites and links
1) League table ranking
2) Group affiliation
3) Geographical location
Group Affiliations
League table ranking
Geography
Colour ~ intensity
Gravity Law σ𝑖𝑗 =
𝑠𝑖𝑗
𝑠𝑖
𝑜𝑢𝑡
𝑠𝑗
𝑖𝑛
𝑠𝑖𝑗 =
𝑠𝑖
𝑜𝑢𝑡
𝑠𝑗
𝑖𝑛
𝑟0.28
Big UK Domain Data for the Arts and
Humanities
Primary aim: developing a methodological and
theoretical framework within which to study over 15
years of UK domain data – with lessons for the
future study of web archives more generally
Big UK Domain Data for the Arts and
Humanities
The dataset:
– Crawled from 1996 – 2013
– Approximately 65 TB, billions of words
– Building interface to allow search by retrieval
date, target domain of links, sentiment
– Allow qualitative and quantitative analysis – and
iteration between multiple research techniques
Big UK Domain Data for the Arts and
Humanities
Key outputs:
– Ten bursary projects using web archive data to
investigate a broad range of topics, for example…
• Armed services recruitment online
• The accessibility of the web for disabled users
• Online discussions of ‘Beat’ poetry
– An edited book of empirical studies concerning the
history of the UK web, featuring chapters on, for
example…
• Constitutional and institutional change in UK government
• The BBC’s online presence
• The ‘web of faith’ online
Next
● Studies underway at OII, BL, IHR
● Book and articles
– Study overall growth of .uk
– Case study of .gov.uk
– Study of media and select committee
visibility
● Releasing data open source

More Related Content

What's hot

Reports from the UKMHL and Historical Texts live lab
Reports from the UKMHL and Historical Texts live lab Reports from the UKMHL and Historical Texts live lab
Reports from the UKMHL and Historical Texts live lab
Jisc
 
DBpedia Mappings Wiki, SMWCon Fall 2013, Berlin
DBpedia Mappings Wiki, SMWCon Fall 2013, BerlinDBpedia Mappings Wiki, SMWCon Fall 2013, Berlin
DBpedia Mappings Wiki, SMWCon Fall 2013, Berlin
Anja Jentzsch
 
資訊素養工作坊PowerPoint
資訊素養工作坊PowerPoint資訊素養工作坊PowerPoint
資訊素養工作坊PowerPoint
kaikwong
 

What's hot (20)

Academic Libraries and Big Data: Trends in Collection, Publication, Preservat...
Academic Libraries and Big Data: Trends in Collection, Publication, Preservat...Academic Libraries and Big Data: Trends in Collection, Publication, Preservat...
Academic Libraries and Big Data: Trends in Collection, Publication, Preservat...
 
Open Access in Architectural Research
Open Access in Architectural ResearchOpen Access in Architectural Research
Open Access in Architectural Research
 
Digital Cultural Heritage and Open Education
Digital Cultural Heritage and Open EducationDigital Cultural Heritage and Open Education
Digital Cultural Heritage and Open Education
 
Reports from the UKMHL and Historical Texts live lab
Reports from the UKMHL and Historical Texts live lab Reports from the UKMHL and Historical Texts live lab
Reports from the UKMHL and Historical Texts live lab
 
3e Studiedag Webarchivering - Promise
3e Studiedag Webarchivering - Promise3e Studiedag Webarchivering - Promise
3e Studiedag Webarchivering - Promise
 
DBpedia Mappings Wiki, SMWCon Fall 2013, Berlin
DBpedia Mappings Wiki, SMWCon Fall 2013, BerlinDBpedia Mappings Wiki, SMWCon Fall 2013, Berlin
DBpedia Mappings Wiki, SMWCon Fall 2013, Berlin
 
Open.Ed
Open.EdOpen.Ed
Open.Ed
 
Linked Open Data in Libraries Archives & Museums
Linked Open Data in Libraries Archives & MuseumsLinked Open Data in Libraries Archives & Museums
Linked Open Data in Libraries Archives & Museums
 
Digital Humanities: A brief introduction to the field
Digital Humanities: A brief introduction to the fieldDigital Humanities: A brief introduction to the field
Digital Humanities: A brief introduction to the field
 
Maphub und Pelagios: Anwendung von Linked Data in den Digitalen Geisteswissen...
Maphub und Pelagios: Anwendung von Linked Data in den Digitalen Geisteswissen...Maphub und Pelagios: Anwendung von Linked Data in den Digitalen Geisteswissen...
Maphub und Pelagios: Anwendung von Linked Data in den Digitalen Geisteswissen...
 
Corpus Protocols IFLA Geneva August 2014 by Neil Smyth and Stella Wisdom
Corpus Protocols IFLA Geneva August 2014 by Neil Smyth and Stella WisdomCorpus Protocols IFLA Geneva August 2014 by Neil Smyth and Stella Wisdom
Corpus Protocols IFLA Geneva August 2014 by Neil Smyth and Stella Wisdom
 
Developing Open Access Content into Academic English Resources for Data-Drive...
Developing Open Access Content into Academic English Resources for Data-Drive...Developing Open Access Content into Academic English Resources for Data-Drive...
Developing Open Access Content into Academic English Resources for Data-Drive...
 
The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...
The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...
The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...
 
Disrupting the transactional library model: the challenges and opportunities ...
Disrupting the transactional library model: the challenges and opportunities ...Disrupting the transactional library model: the challenges and opportunities ...
Disrupting the transactional library model: the challenges and opportunities ...
 
Digital Libraries: Local and Global
Digital Libraries: Local and GlobalDigital Libraries: Local and Global
Digital Libraries: Local and Global
 
Launch of Welsh Newspapers Online
Launch of Welsh Newspapers OnlineLaunch of Welsh Newspapers Online
Launch of Welsh Newspapers Online
 
JCDL 2015 Tutorial Opening Slides
JCDL 2015 Tutorial Opening SlidesJCDL 2015 Tutorial Opening Slides
JCDL 2015 Tutorial Opening Slides
 
資訊素養工作坊PowerPoint
資訊素養工作坊PowerPoint資訊素養工作坊PowerPoint
資訊素養工作坊PowerPoint
 
Future Directions of the European Library
Future Directions of the European LibraryFuture Directions of the European Library
Future Directions of the European Library
 
Digital Cultural Heritage: Experiences from British Library
Digital Cultural Heritage: Experiences from British LibraryDigital Cultural Heritage: Experiences from British Library
Digital Cultural Heritage: Experiences from British Library
 

Viewers also liked (8)

The uk today
The uk todayThe uk today
The uk today
 
History of uk music press
History of uk music pressHistory of uk music press
History of uk music press
 
Culture in United Kingdom & Ireland
Culture in United Kingdom & IrelandCulture in United Kingdom & Ireland
Culture in United Kingdom & Ireland
 
The royal family of great britain
The royal family of great britainThe royal family of great britain
The royal family of great britain
 
Education in the uk
Education in the ukEducation in the uk
Education in the uk
 
Education In The Uk
Education In The UkEducation In The Uk
Education In The Uk
 
The uk education system
The uk education systemThe uk education system
The uk education system
 
Educational System in UK
Educational System in UKEducational System in UK
Educational System in UK
 

Similar to Ancient History of the UK Web

World Archaeology Congress paper
World Archaeology Congress paperWorld Archaeology Congress paper
World Archaeology Congress paper
dejp3
 
eResearch-Oz-Bellamy
eResearch-Oz-BellamyeResearch-Oz-Bellamy
eResearch-Oz-Bellamy
Craig Bellamy
 
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
Micah Altman
 

Similar to Ancient History of the UK Web (20)

A Service Perspective: Unlocking metadata to enhance discoverability and conn...
A Service Perspective: Unlocking metadata to enhance discoverability and conn...A Service Perspective: Unlocking metadata to enhance discoverability and conn...
A Service Perspective: Unlocking metadata to enhance discoverability and conn...
 
World Archaeology Congress paper
World Archaeology Congress paperWorld Archaeology Congress paper
World Archaeology Congress paper
 
Data Science at the ATI and BL Web Archiving
Data Science at the ATI and BL Web ArchivingData Science at the ATI and BL Web Archiving
Data Science at the ATI and BL Web Archiving
 
NORFest 2023 Lightning Talks Session Three
NORFest 2023 Lightning Talks Session Three NORFest 2023 Lightning Talks Session Three
NORFest 2023 Lightning Talks Session Three
 
Methodological Guidelines for Publishing Linked Data
Methodological Guidelines for Publishing Linked DataMethodological Guidelines for Publishing Linked Data
Methodological Guidelines for Publishing Linked Data
 
eResearch-Oz-Bellamy
eResearch-Oz-BellamyeResearch-Oz-Bellamy
eResearch-Oz-Bellamy
 
Where Do I Stand? Deconstructing Digital Collections [Research] Infrastructur...
Where Do I Stand? Deconstructing Digital Collections [Research] Infrastructur...Where Do I Stand? Deconstructing Digital Collections [Research] Infrastructur...
Where Do I Stand? Deconstructing Digital Collections [Research] Infrastructur...
 
EAA 2017 Re-engineering the process: How best to share, connect, re-use & pro...
EAA 2017 Re-engineering the process: How best to share, connect, re-use & pro...EAA 2017 Re-engineering the process: How best to share, connect, re-use & pro...
EAA 2017 Re-engineering the process: How best to share, connect, re-use & pro...
 
Chaos&Order: Using visualization as a means to
 explore large heritage collec...
Chaos&Order: Using visualization as a means to
 explore large heritage collec...Chaos&Order: Using visualization as a means to
 explore large heritage collec...
Chaos&Order: Using visualization as a means to
 explore large heritage collec...
 
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
 
Digital Scholarship at the British Library
Digital Scholarship at the British LibraryDigital Scholarship at the British Library
Digital Scholarship at the British Library
 
HIBERLINK: Reference Rot and Linked Data: Threat and Remedy
HIBERLINK: Reference Rot and Linked Data: Threat and RemedyHIBERLINK: Reference Rot and Linked Data: Threat and Remedy
HIBERLINK: Reference Rot and Linked Data: Threat and Remedy
 
Introduction to Edinburgh University Data Library and national data services
Introduction to Edinburgh University Data Library and national data servicesIntroduction to Edinburgh University Data Library and national data services
Introduction to Edinburgh University Data Library and national data services
 
NECTAR_VRE1
NECTAR_VRE1NECTAR_VRE1
NECTAR_VRE1
 
Reference Rot and Linked Data: Threat and Remedy
Reference Rot and Linked Data: Threat and RemedyReference Rot and Linked Data: Threat and Remedy
Reference Rot and Linked Data: Threat and Remedy
 
Ensuring Continuity of Access To Our Published Heritage
Ensuring Continuity of Access To Our Published HeritageEnsuring Continuity of Access To Our Published Heritage
Ensuring Continuity of Access To Our Published Heritage
 
Rethink research, illuminate history with the British Library
Rethink research, illuminate history with the British LibraryRethink research, illuminate history with the British Library
Rethink research, illuminate history with the British Library
 
Exploring the Web of Data for Earth and Environmental Sciences
Exploring the Web of Data for Earth and Environmental SciencesExploring the Web of Data for Earth and Environmental Sciences
Exploring the Web of Data for Earth and Environmental Sciences
 
bellamy_budapest
bellamy_budapestbellamy_budapest
bellamy_budapest
 
British Library Datasets Programme Feb 2011
British Library Datasets Programme Feb 2011British Library Datasets Programme Feb 2011
British Library Datasets Programme Feb 2011
 

More from Scott A. Hale

Oxford Digital Humanities Summer School
Oxford Digital Humanities Summer SchoolOxford Digital Humanities Summer School
Oxford Digital Humanities Summer School
Scott A. Hale
 

More from Scott A. Hale (12)

Researching Misinformation
Researching MisinformationResearching Misinformation
Researching Misinformation
 
Big Tech & Disinformation: What are the main threats and how can journalists ...
Big Tech & Disinformation: What are the main threats and how can journalists ...Big Tech & Disinformation: What are the main threats and how can journalists ...
Big Tech & Disinformation: What are the main threats and how can journalists ...
 
No Master Algorithm: Human-machine intelligence and the real-world needs of f...
No Master Algorithm: Human-machine intelligence and the real-world needs of f...No Master Algorithm: Human-machine intelligence and the real-world needs of f...
No Master Algorithm: Human-machine intelligence and the real-world needs of f...
 
Foreign-language Reviews: Help or Hindrance? (Slides)
Foreign-language Reviews: Help or Hindrance? (Slides)Foreign-language Reviews: Help or Hindrance? (Slides)
Foreign-language Reviews: Help or Hindrance? (Slides)
 
How much is said in a microblog? A multilingual inquiry based on Weibo and Tw...
How much is said in a microblog? A multilingual inquiry based on Weibo and Tw...How much is said in a microblog? A multilingual inquiry based on Weibo and Tw...
How much is said in a microblog? A multilingual inquiry based on Weibo and Tw...
 
Interactive Visualizations for teaching, research, and dissemination
Interactive Visualizations for teaching, research, and disseminationInteractive Visualizations for teaching, research, and dissemination
Interactive Visualizations for teaching, research, and dissemination
 
Oxford Digital Humanities Summer School
Oxford Digital Humanities Summer SchoolOxford Digital Humanities Summer School
Oxford Digital Humanities Summer School
 
Multilinguals and Wikipedia Editing
Multilinguals and Wikipedia EditingMultilinguals and Wikipedia Editing
Multilinguals and Wikipedia Editing
 
Mapping the UK Webspace: Fifteen Years of British Universities on the Web
Mapping the UK Webspace: Fifteen Years of British Universities on the WebMapping the UK Webspace: Fifteen Years of British Universities on the Web
Mapping the UK Webspace: Fifteen Years of British Universities on the Web
 
Design and Multilingual Users on Twitter and Wikipedia
Design and Multilingual Users on Twitter and WikipediaDesign and Multilingual Users on Twitter and Wikipedia
Design and Multilingual Users on Twitter and Wikipedia
 
Global connectivity and multilinguals in the Twitter network (slides)
Global connectivity and multilinguals in the Twitter network (slides)Global connectivity and multilinguals in the Twitter network (slides)
Global connectivity and multilinguals in the Twitter network (slides)
 
ECPR 2011 Leaders and Followers Experiment
ECPR 2011 Leaders and Followers ExperimentECPR 2011 Leaders and Followers Experiment
ECPR 2011 Leaders and Followers Experiment
 

Recently uploaded

怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
vexqp
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
vexqp
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
cnajjemba
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
vexqp
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 

Recently uploaded (20)

怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 

Ancient History of the UK Web

  • 1. Ancient History of the UK Web With support by and thanks to Ning Wang and Adham Tamer Josh Cowls, Scott A. Hale, Helen Margetts, Eric T. Meyer, Ralph Schroeder, Taha Yasseri
  • 2. Past Web Archive Activities at OII • 2008-2009. JISC/NEH Transatlantic Digitisation Collaboration: World Wide Web of Humanities (Jisc & NEH funded) – OII, Internet Archive, Hanzo Archives – Meyer, E.T., Carpenter, K., Middleton, M. (2009). World Wide Web of Humanities: Final Report to JISC. Online: http://www.jisc.ac.uk/media/documents/programmes/digitisation/humanitiesfinalrepor t.pdf • 2010. Researcher Engagement with Web Archives (Jisc funded) – OII, VKS – Dougherty, M., Meyer, E.T., Madsen, C., van den Heuvel, C., Thomas, A., Wyatt, S. (2010). Researcher Engagement with Web Archives: State of the Art. London: JISC. Online: http://ssrn.com/abstract=1714997 and http://ie-repository.jisc.ac.uk/544/ – Thomas, A., Meyer, E.T., Dougherty, M., van den Heuvel, C., Madsen, C., Wyatt, S. (2010). Researcher Engagement with Web Archives: Challenges and Opportunities for Investment. London: JISC. Online: http://ssrn.com/abstract=1715000 and http://ie- repository.jisc.ac.uk/543/ – Dougherty, M., Meyer, E.T. (2014). Community, Tools, and Practices in Web Archiving: The state of the art in relation to social science and humanities research needs. Journal of the American Society of Information Science & Technology. http://onlinelibrary.wiley.com/doi/10.1002/asi.23099/abstract • 2011. Using Web Archives: A Futures Perspective (IIPC funded) – OII – Meyer, E.T., Thomas, A.J., Schroeder, R. (2011). Web Archives: The Future(s). London: IIPC. Online: http://ssrn.com/abstract=1830025
  • 3. Recent Web Archive Activities at OII • 2013-2015: Jisc Big Data project (Jisc funded) – OII, British Library – Prepare and release hyperlink corpus • 2014-2015: Big UK Domain Data for the Arts and Humanities (AHRC funded) – IHR, OII, British Library – Supporting researchers in Arts & Humanities to use web archive data – Producing edited book of empirical studies concerning the history of the UK web • First paper from these combined projects – Hale, S.A., Yasseri, T., Cowls, J., Meyer, E.T., Schroeder, R., Margetts, H. (2014, July). Mapping the UK webspace: Fifteen years of British universities on the web. ACM WebSci’14, Bloomington, Indiana. http://papers.ssrn.com/abstract=2435481 or http://arxiv.org/abs/1405.2856
  • 4. Big Data: Demonstrating the Value of the UK Web Domain Dataset for Social Science Research This project aims to enhance JISC's UK Web Domain archive, a 30 TB archive of the .uk country-code top level domain collected from 1996 to 2010. It will extract link graphs from the data and disseminate social science research using the collection. February 2012 - February 2014
  • 5. Taming a mammoth: Web Archive Dataset Preparation 30 TB compressed data 6.2TB metadata and links 2.5 TB temporal links
  • 6. 30 TB compressed data in (w)arc format – Approx. 4.5 million files – Mix of binary and plain text payloads along with header data – Two formats: old arc and newer warc Housed at the BL, access restrictions
  • 7. WARC/1.0 WARC-Type: response WARC-Target-URI: http://hits.guardian.co.uk/b/ss/guardiangu-blogs,guardiangu-news,guardiangu- network/1/H.22.2/56938?ns=guardian&pageName=Prisoner+of+war+camps+in+the+UK+mapped+and+listed.+Download+the+d ata%3AGraphic%3A1476560&ch=News&c3=GU.co.uk&c4=History+%28Books+genre%29%2CBooks%2CSecond+world+war+ %28News%29%2CGermany%2CUK+news%2CTechnology&c5=Not+commercially+useful%2CCorporate+IT&c6=Simon+Roger s&c7=10-Nov- 08&c8=1476560&c9=Graphic&c10=Blogpost&c11=News&c13=&c25=Datablog&c30=content&h2=GU%2FNews%2Fblog%2FDa tablog&c2=GUID:(none) WARC-Date: 2010-12-05T02:58:00Z WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ WARC-IP-Address: 66.235.138.18 WARC-Record-ID: <urn:uuid:7d5ce147-9b4b-46cb-8975-ee93b4d0dda8> Content-Type: application/http; msgtype=response Content-Length: 740 HTTP/1.1 302 Found Date: Sun, 05 Dec 2010 02:58:00 GMT Server: Omniture DC/2.0.0 X-C: ms-4.3.1 Expires: Sat, 04 Dec 2010 02:58:00 GMT Last-Modified: Mon, 06 Dec 2010 02:58:00 GMT Cache-Control: no-cache, no-store, must-revalidate, max-age=0, proxy-revalidate, no-transform, private Pragma: no-cache ETag: "4CFAFFB8-0E4C-7443902F" Vary: * P3P: policyref="/w3c/p3p.xml", CP="NOI DSP COR NID PSA OUR IND COM NAV STA" Location: http://b.scorecardresearch.com/r?c2=6035250&d.c=gif&d.o=guardiangu- network&d.x=243551159&d.t=page&d.u=http%3A%2F%2Fwww.guardian.co.uk%2Fnews%2Fdatablog%2F2010%2Fnov%2F08 %2Fprisoner-of-war-camps-uk xserver: www422 Content-Length: 0 Keep-Alive: timeout=15 Connection: close Content-Type: text/plain
  • 8. Extract meta-data and links (wat format) – Approx. 4.5 million files – 6.2TB on disk compressed – Housed at OII – Structured JSON – Different formats for arc/warcs
  • 9. { "Container": { "Filename": "DOTUK-HISTORICAL-1996-2010-GROUP-AA-XAAAAA-20110428000000- 00000.arc.gz", "Offset": "88937", "Compressed": true, "Gzip-Metadata": { "Header-Length": "10", "Inflated-CRC": "-1223265901", "Inflated-Length": "26073", "Deflate-Length": "4463", "Footer-Length": "8" } }, "Envelope": { "ARC-Header-Length": "102", "ARC-Header-Metadata": { "Date": "20080509081524", "Target-URI": "http://www.ukhomeinteriors.co.uk/content/ext_corbels.php", "Content-Length": "25970", "Content-Type": "text/html", "IP-Address": "83.223.106.10" }, "Payload-Metadata": { "Actual-Content-Type": "application/http; msgtype=response", "Block-Digest": "sha1:MCCZNOKBJHTZ5MMMCUJGBPE25C2TVUWF", "HTTP-Response-Metadata": { "Headers-Length": "591", "HTML-Metadata": { "Head": { "Title": "Exterior Corbels",
  • 10. Plain text lists Build own ad-hawk Hadoop cluster, fix incompatibilities, divide into smaller batches – Build plain text lists of pages and hyperlinks – Remove error page (e.g., 404 Not Found) – Remove pages not in .uk – Standardize dates (many formats) – Standardize hyperlinks (trailing /, etc.) – Fix/remove tons of invalid hyperlinks (whitespace, invalid characters, etc.) Load results into Apache Hive (2.5 TB)
  • 13. Relative size of second-level-domains
  • 14. Number of links within SLD per node
  • 15. Cross-domain links (2010) Absolute Normalized to target size
  • 16. Case of ac.uk Mapping the UK Webspace: Fifteen Years of British Universities on the Web Hale et al., WebSci'14, available: http://arxiv.org/abs/1405.2856 121 UK universities websites and links 1) League table ranking 2) Group affiliation 3) Geographical location
  • 20. Gravity Law σ𝑖𝑗 = 𝑠𝑖𝑗 𝑠𝑖 𝑜𝑢𝑡 𝑠𝑗 𝑖𝑛 𝑠𝑖𝑗 = 𝑠𝑖 𝑜𝑢𝑡 𝑠𝑗 𝑖𝑛 𝑟0.28
  • 21. Big UK Domain Data for the Arts and Humanities Primary aim: developing a methodological and theoretical framework within which to study over 15 years of UK domain data – with lessons for the future study of web archives more generally
  • 22. Big UK Domain Data for the Arts and Humanities The dataset: – Crawled from 1996 – 2013 – Approximately 65 TB, billions of words – Building interface to allow search by retrieval date, target domain of links, sentiment – Allow qualitative and quantitative analysis – and iteration between multiple research techniques
  • 23. Big UK Domain Data for the Arts and Humanities Key outputs: – Ten bursary projects using web archive data to investigate a broad range of topics, for example… • Armed services recruitment online • The accessibility of the web for disabled users • Online discussions of ‘Beat’ poetry – An edited book of empirical studies concerning the history of the UK web, featuring chapters on, for example… • Constitutional and institutional change in UK government • The BBC’s online presence • The ‘web of faith’ online
  • 24. Next ● Studies underway at OII, BL, IHR ● Book and articles – Study overall growth of .uk – Case study of .gov.uk – Study of media and select committee visibility ● Releasing data open source