ArAcAi - The Problem of Access: Prototyping a Researcher Dashboard for the UK Government Web Archive

Web archives and the
problem of access:
prototyping a researcher
dashboard for the UK
Government Web Archive
Mark Bell, Tom Storrar and
Jane Winters
15 January 2020
The National Archives is the official archive of UK government: collecting,
preserving and giving access to 1,000 years of history
Alongside paper and digitised records, our and web archive collections are
growing rapidly and the UKGWA is our largest collection:
■ 1996 to present: over 23 years of government websites and social
media
■ 6 billion resources, 150TB+ (compressed) data
■ It has gov.uk domains but lots more, too - wherever government
hosts content (at present, over 800 websites!)
The UK Government Web Archive (UKGWA)
The UKGWA is openly available and well-used
Typical routes into the content of the collection include:
■ Through Google and other search engines
■ Redirection to it from government websites or from references to
historic documents within other documents
■ Direct “research sessions” - often returning users who have a specific
information need. They will often use our search service:
https://webarchive.nationalarchives.gov.uk/search/
■ Increasing use of the collection “as data” - but this is challenging in a
number of ways
Use of the UKGWA
What do researchers want to do with the
UKGWA?
❏ Essential primary source for the history of the late 20th and early 21st
century (mid 1990s to the present day)
❏ Record of government (central and local) and its interactions with its
citizens online
❏ Need to understand both its scope and its scale, and this means moving
beyond keyword searching (the default for many humanities researchers)
❏ Gain insight into the collection processes, how these have changed over
time, and the factors that have influenced when and how data is
harvested (these are patchwork or ‘Frankenstein’ archives)
❏ Extract different kinds of data
from the archive (text, images,
remove navigation etc.)
❏ Analyse trends in the data, e.g.
cultural and linguistic change
❏ Study online networks of
government and the flow of
information between and
within departments
❏ Deploy visualisation to aid
navigation and analysis
(macro- and micro-level)
What do researchers want to do with the
UKGWA?
Elevation for clock dial for Big Ben tower
Web archiving as collaboration
❏ The challenges posed by web archives (for researchers, web archivists
and research software engineers) are too complex to be solved by
individuals or organisations working on their own
❏ Researchers need web archivists, and web archivists need researchers
❏ Through collaboration, we can develop a robust community of practice
and knowledge
❏ We can argue for enhanced access to web archives, for researchers and
the wider public
❏ We can experiment, innovate and sometimes fail
❏ We can make the case for greater investment in web archiving (and in
web archiving institutions)
Dashboard basics
Rise and Fall of the Web
What are we analysing? - Macroscopic view
Archive
-> Domain
-> Sub-domain
-> Page
-> Resource
What are we analysing? - Content
History of salt
The craving for salt
Human beings have an intimate relationship with salt. Our
tears, blood and sweat taste of salt.
The chemical reactions inside our bodies need sodium - one of
the two elements that make up salt (with chloride).
We can't survive without sodium, but it was about five million
years before humans began to eat their sodium as salt.
Hunters in Greenland ate no salt until they were introduced to
it by whaling Europeans in the 17th century. Like our
prehistoric forebears, Lapps, Samoyeds, Kirghiz, Bedouin,
Masai and Zulus used to consume all the sodium they needed
from the animals and fish they ate.
Agriculture and salt
Archaeologists believe that salt eating developed as humans
learned how to keep animals and grow crops in the years after
10,000 BC. As the proportion of meat in their diet fell, people
had to find salt for themselves and for their domesticated
animals.
Content
What is content?
Content
What are we analysing? - Page Structure
What are we analysing? - Site Structure
https://webarchive.nationalarchives.gov.uk/20190102181627/https://www.gov.uk/guidance/cartels-confess-and-apply-for-
leniency
Warning: This doesn’t exist!
Topic Modelling
0 : research councils council innovation rcuk funding public government review business executive working training development work group
1 : museum maritime national greenwich royal nmm time london observatory family house rights world visit reserved events
2 : day information fruit local health navigation legal school scheme contact children vegetable healthy vegetables department content
3 : ocr science information gateway aqa including edexcel chemistry physics teachers webpage wjec teaching revision gcse century
4 : science triple learning support resources latest physics students schools programme teaching teachers gcse resource feedback comments
5 : food eat foods people bacteria meat fish agency fridge don standards raw cooked pregnant date find
6 : army museum national british war general nam enquiries pm services quick britain follow world field soldiers
7 : salt eat fruit foods fat food high good eating day milk diet children vitamin vegetables healthy
Doc2Vec - Like word2vec but with documents
● Find similar documents
● Group documents together
● Enable semantic search
Document Summarisation
Scale reduction
Home
Page
Sub-section BSub-section A
Page 1 Page 2 Page 3 Page 4
PDF 1 PDF 2
10s of millions
1000s
Home
Page
Sub-section A Sub-section B
Sub-section A Page 1 Page 2
Change over time
Content
Structure
Static Dormant
Components of a dashboard
Select sites for analysis: manual or by similarityScope
Granularity
Time
Content/
Structure
£
Export
Level to perform analysis: archive, domain, page
Filter by time period: state at time; activity during period
Compare change in one set of sites with another
Charges: paying for computation
Exporting results and visualisations
Compare
Analyse by content or structure (page, site, network)
Visualise Charts, networks, word clouds etc.
Web archives are created through actions, decisions, both human and
machine.
Human actions involve decisions on when and how to capture a resource
or a website but also why. Data on this is kept as part of the archive but
most of it is not public.
Machines make decisions based on the parameters or rules they are
provided by human actors. We can add trust and transparency to this
process by revealing as much of this as we can to our users.
We can commit to publishing this knowledge but publishing in a way
that adds to users’ comprehension of the web archive it a challenge.
Static datasets (csv) are a start, leading to queryable ones (APIs…)
Key Context on the creation of the UKGWA
We’re not alone; we are part of a vibrant community of web archives and
researchers.
We are taking inspiration (and code!) from the great work being done by
Archives Unleashed, the Internet Archive, the British Library and many
others.
We’ve also been gaining more and more hands-on experience of
running research projects using UKGWA data, for example, recently:
■ Alan Turing Institute Data Challenge - Identifying Topics and Trends
(December 2019)
■ CAS Network Analysis Workshop (June 2019)
These are crucial to our work and there are many more are to come!
Collaborate!
❏ Bring stakeholders together
regularly (workshops, hackathons
etc.)
❏ A wide range of skills and expertise
are required but some
interventions can lower barriers
❏ Artificial intelligence is already
helping us to explore web archives,
and will continue to transform
access
❏ … but it is not enough on its own
Conclusion
Wartime storage of documents in the
Long Gallery at Haddon Hall
Colossus electronic digital computer
1 of 23

More Related Content

Recently uploaded(20)

Project & Portfolio 1Project & Portfolio 1
Project & Portfolio 1
BeatzbyKingCJ50 views
Prospectus (1).pdfProspectus (1).pdf
Prospectus (1).pdf
PancrazioScalambrino12 views
Thanks Giving Encouragement Wednesday.pptxThanks Giving Encouragement Wednesday.pptx
Thanks Giving Encouragement Wednesday.pptx
FamilyWorshipCenterD8 views
CitSciOz MOUA Inspiring Change Through ArtCitSciOz MOUA Inspiring Change Through Art
CitSciOz MOUA Inspiring Change Through Art
Christian Bartens37 views
Sviland skule 2024.pptxSviland skule 2024.pptx
Sviland skule 2024.pptx
RenateFurenes19 views

Featured(20)

ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
Alireza Esmikhani30.2K views
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
Project for Public Spaces & National Center for Biking and Walking6.9K views
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
Erica Santiago25K views
9 Tips for a Work-free Vacation9 Tips for a Work-free Vacation
9 Tips for a Work-free Vacation
Weekdone.com7.1K views
I Rock Therefore I Am. 20 Legendary Quotes from PrinceI Rock Therefore I Am. 20 Legendary Quotes from Prince
I Rock Therefore I Am. 20 Legendary Quotes from Prince
Empowered Presentations142.8K views
How to Map Your FutureHow to Map Your Future
How to Map Your Future
SlideShop.com275.1K views
Read with Pride | LGBTQ+ ReadsRead with Pride | LGBTQ+ Reads
Read with Pride | LGBTQ+ Reads
Kayla Martin-Gant1.1K views
The Student's Guide to LinkedInThe Student's Guide to LinkedIn
The Student's Guide to LinkedIn
LinkedIn87.8K views

ArAcAi - The Problem of Access: Prototyping a Researcher Dashboard for the UK Government Web Archive

  • 1. Web archives and the problem of access: prototyping a researcher dashboard for the UK Government Web Archive Mark Bell, Tom Storrar and Jane Winters 15 January 2020
  • 2. The National Archives is the official archive of UK government: collecting, preserving and giving access to 1,000 years of history Alongside paper and digitised records, our and web archive collections are growing rapidly and the UKGWA is our largest collection: ■ 1996 to present: over 23 years of government websites and social media ■ 6 billion resources, 150TB+ (compressed) data ■ It has gov.uk domains but lots more, too - wherever government hosts content (at present, over 800 websites!) The UK Government Web Archive (UKGWA)
  • 3. The UKGWA is openly available and well-used Typical routes into the content of the collection include: ■ Through Google and other search engines ■ Redirection to it from government websites or from references to historic documents within other documents ■ Direct “research sessions” - often returning users who have a specific information need. They will often use our search service: https://webarchive.nationalarchives.gov.uk/search/ ■ Increasing use of the collection “as data” - but this is challenging in a number of ways Use of the UKGWA
  • 4. What do researchers want to do with the UKGWA? ❏ Essential primary source for the history of the late 20th and early 21st century (mid 1990s to the present day) ❏ Record of government (central and local) and its interactions with its citizens online ❏ Need to understand both its scope and its scale, and this means moving beyond keyword searching (the default for many humanities researchers) ❏ Gain insight into the collection processes, how these have changed over time, and the factors that have influenced when and how data is harvested (these are patchwork or ‘Frankenstein’ archives)
  • 5. ❏ Extract different kinds of data from the archive (text, images, remove navigation etc.) ❏ Analyse trends in the data, e.g. cultural and linguistic change ❏ Study online networks of government and the flow of information between and within departments ❏ Deploy visualisation to aid navigation and analysis (macro- and micro-level) What do researchers want to do with the UKGWA? Elevation for clock dial for Big Ben tower
  • 6. Web archiving as collaboration ❏ The challenges posed by web archives (for researchers, web archivists and research software engineers) are too complex to be solved by individuals or organisations working on their own ❏ Researchers need web archivists, and web archivists need researchers ❏ Through collaboration, we can develop a robust community of practice and knowledge ❏ We can argue for enhanced access to web archives, for researchers and the wider public ❏ We can experiment, innovate and sometimes fail ❏ We can make the case for greater investment in web archiving (and in web archiving institutions)
  • 8. Rise and Fall of the Web
  • 9. What are we analysing? - Macroscopic view Archive -> Domain -> Sub-domain -> Page -> Resource
  • 10. What are we analysing? - Content History of salt The craving for salt Human beings have an intimate relationship with salt. Our tears, blood and sweat taste of salt. The chemical reactions inside our bodies need sodium - one of the two elements that make up salt (with chloride). We can't survive without sodium, but it was about five million years before humans began to eat their sodium as salt. Hunters in Greenland ate no salt until they were introduced to it by whaling Europeans in the 17th century. Like our prehistoric forebears, Lapps, Samoyeds, Kirghiz, Bedouin, Masai and Zulus used to consume all the sodium they needed from the animals and fish they ate. Agriculture and salt Archaeologists believe that salt eating developed as humans learned how to keep animals and grow crops in the years after 10,000 BC. As the proportion of meat in their diet fell, people had to find salt for themselves and for their domesticated animals. Content
  • 12. What are we analysing? - Page Structure
  • 13. What are we analysing? - Site Structure https://webarchive.nationalarchives.gov.uk/20190102181627/https://www.gov.uk/guidance/cartels-confess-and-apply-for- leniency Warning: This doesn’t exist!
  • 14. Topic Modelling 0 : research councils council innovation rcuk funding public government review business executive working training development work group 1 : museum maritime national greenwich royal nmm time london observatory family house rights world visit reserved events 2 : day information fruit local health navigation legal school scheme contact children vegetable healthy vegetables department content 3 : ocr science information gateway aqa including edexcel chemistry physics teachers webpage wjec teaching revision gcse century 4 : science triple learning support resources latest physics students schools programme teaching teachers gcse resource feedback comments 5 : food eat foods people bacteria meat fish agency fridge don standards raw cooked pregnant date find 6 : army museum national british war general nam enquiries pm services quick britain follow world field soldiers 7 : salt eat fruit foods fat food high good eating day milk diet children vitamin vegetables healthy
  • 15. Doc2Vec - Like word2vec but with documents ● Find similar documents ● Group documents together ● Enable semantic search
  • 17. Scale reduction Home Page Sub-section BSub-section A Page 1 Page 2 Page 3 Page 4 PDF 1 PDF 2 10s of millions 1000s Home Page Sub-section A Sub-section B Sub-section A Page 1 Page 2
  • 19. Components of a dashboard Select sites for analysis: manual or by similarityScope Granularity Time Content/ Structure £ Export Level to perform analysis: archive, domain, page Filter by time period: state at time; activity during period Compare change in one set of sites with another Charges: paying for computation Exporting results and visualisations Compare Analyse by content or structure (page, site, network) Visualise Charts, networks, word clouds etc.
  • 20. Web archives are created through actions, decisions, both human and machine. Human actions involve decisions on when and how to capture a resource or a website but also why. Data on this is kept as part of the archive but most of it is not public. Machines make decisions based on the parameters or rules they are provided by human actors. We can add trust and transparency to this process by revealing as much of this as we can to our users. We can commit to publishing this knowledge but publishing in a way that adds to users’ comprehension of the web archive it a challenge. Static datasets (csv) are a start, leading to queryable ones (APIs…) Key Context on the creation of the UKGWA
  • 21. We’re not alone; we are part of a vibrant community of web archives and researchers. We are taking inspiration (and code!) from the great work being done by Archives Unleashed, the Internet Archive, the British Library and many others. We’ve also been gaining more and more hands-on experience of running research projects using UKGWA data, for example, recently: ■ Alan Turing Institute Data Challenge - Identifying Topics and Trends (December 2019) ■ CAS Network Analysis Workshop (June 2019) These are crucial to our work and there are many more are to come! Collaborate!
  • 22. ❏ Bring stakeholders together regularly (workshops, hackathons etc.) ❏ A wide range of skills and expertise are required but some interventions can lower barriers ❏ Artificial intelligence is already helping us to explore web archives, and will continue to transform access ❏ … but it is not enough on its own Conclusion Wartime storage of documents in the Long Gallery at Haddon Hall