3. www.bl.uk 3
What are we collecting?
All of the UK Public Web Space
• 5-10 million hosts (websites)
•Up to 80-100TB of data each year
•Over 100 curated collections
4. www.bl.uk 4
What don’t we collect?
• Email
• Intranets
• Anything behind a user login
• Flash
• Most (but not all) video and audio content
•Very little Twitter or Facebook
9. www.bl.uk 9
How often?
‘Everything’ once a year (takes about 3
months)
• Selected sites more frequently (daily,
weekly, monthly, quarterly, six-monthly)
• News and some other sites daily
10. www.bl.uk 10
Do we collect ‘everything’?
‘Everything’ is not everything
• Most sites capped at 500mb (not BBC)
• Database driven websites very hard to collect
• Don’t always look how they should
• Wordpress is really hard
13. www.bl.uk 13
Access
•Licence required to display website publicly
(approx 15,000 websites)
•Otherwise only in a reading room of a Legal
Deposit Library
14. www.bl.uk 14
Discovery
•How do you find what you want when
there are billions of potential results?
•Search can’t work like google (google
know a LOT about you)
18. www.bl.uk 18
Secondary Datasets
• JISC UK Web Domain Dataset (1996-2013):
– Format Profile
– Geo-Index
– Host-Level Links
– Crawled URL Index
– WATs (rich resource-level metadata, not released yet)
• UK Open (Selective) Web Archive:
– Website Classification Dataset
• Available as CC0 downloads:
– http://data.webarchive.org.uk/opendata/