News data at the
British Library
Luke McKernan
Lead Curator, News and Moving Image
Working with news data across different media
7 September 2015
www.bl.uk 2
Map of news stories in the UK as read via Twitter (created using bit.ly links), Guardian Datablog, 16 May 2012
Changing news
www.bl.uk 3
 Moving from a world-class newspaper service to a world-
class news service
 Newspapers, television, radio and Web news
 Reflection of the significant changes in news production
and consumption taking place today, but it also reflects
how news has always been consumed
 News does not exist in any one form. It is sought out and
selected by its users, from the multiple forms of
information on offer
 A change in how we manage news data is an essential
part of how to deliver such change
 “News is information of current interest for a specific
audience”
News content strategy
The Newcastle Courant, The Huffington
Post, Today, Al Jazeera English
www.bl.uk 4
Newspapers
 The UK national collection
 34,000 newspaper titles: approximately 60M issues or
450M individual pages, from 17thC to present day
 Current acquisition: 1,500 daily or weekly titles
 Print copies acquired under legal deposit but will move
increasingly towards digital acquisition
 Physical access at Newsroom and Boston Spa
 Online access to 11M pages via British Newspaper
Archive (http://www.britishnewspaperarchive.com)
 Approximately third of collection has microfilm access
copies; around 2.5% has been digitised so far
British Newspaper Archive
www.bl.uk 5
Television and radio news
 Began recording television and radio news
programmes receivable in the UK in May 2010
 Collection of over 60,000 programmes, recorded off-air
from 20 channels inc. BBC, Al-Jazeera, Russia Today,
CNN, CCTV (China), NHK, Bloomberg, France 24,
World Service, LBC
 30 hours of TV and 22 hours of radio captured per day
 Born digital archive, including Electronic Programme
Guide data and subtitles where available
 Access onsite only, owing to copyright restrictions, via
Broadcast News service
Broadcast News
www.bl.uk 6
Web news
 Non-print legal deposit legislation introduced in April
2013 means British Library can start harvesting UK
websites
 First annual crawl collected 4.5M .uk websites and web
pages – collection now amounts to around 3Bn digital
assets
 Harvesting c.1000 UK news websites (newspapers and
web-only sites e.g. hyperlocals) on daily/weekly basis,
from end of 2013, with another 500 to be added soon
 Access onsite only at British Library and other Legal
Deposit libraries
 Also Open UK Web Archive, smaller collection of
selected websites, openly available at
http://www.webarchive.org.uk
UK Web Archive
www.bl.uk 7
Our news research services
Explore.bl.uk The Newsroom Boston Spa reading room
British Newspaper Archive UK Web Archive Broadcast News
www.bl.uk 8
News data
 2M 19thC British newspaper pages – XML, images
 UK television news data 2010 onwards – EPG data for
45,000 programmes, subtitles (XML) for c.25,000
programmes, some speech-to-text files for 2011
broadcasts (XML)
 UK radio news data 2010 onwards – EPG data for
15,000 programmes, some speech-to-text files for
2011 broadcasts (XML)
 Financial Times – four years of content (1888, 1939,
1966, 1991) – XML, images
 Web news selection – possibly
Financial Times, 1893 and 2008
www.bl.uk 9
Plans
 All out-of-copyright UK newspapers on British
Newspaper Archive, issue level data for research re-
use, covered by single agreement, available through an
API. Possibly…
 Title-level data for all newspapers we hold (34,000
titles) released as open data
 More partner initiatives
 Hackathon on 16 November 2015, to be followed by
other news data events in 2016
 User-led development
BBC radio news script, 14/7/1969
www.bl.uk 10
Dreams
 An open news dataset
 An archive news data model
 All British Library news records available at
issue level
Hyperlocal news sites: On the Wight,
The City Talking, A Little Bit of Stone
www.bl.uk 11
Questions
 Copyright constraints limit use of much material to BL
premises – how can tools such as named entity
extraction work as a means to get round this?
 How can print, web, television, radio news, and other
news media, be linked up together, and to other
resources, and how would this benefit research?
 What research questions will we be able to support
through a greater focus on news data?
 Is news data only for the specialist, or can more general
user-friendly applications be produced?
 What can news archives learn from the management
tools for current news?
 How can we help each other? TV news idents
www.bl.uk 12
Email: luke.mckernan@bl.uk
Twitter: @BL_newsroom
Web: http://bl.uk/subjects/news-media
Blog: http://britishlibrary.typepad.co.uk/thenewsroom
Contact

News data at the British Library

  • 1.
    News data atthe British Library Luke McKernan Lead Curator, News and Moving Image Working with news data across different media 7 September 2015
  • 2.
    www.bl.uk 2 Map ofnews stories in the UK as read via Twitter (created using bit.ly links), Guardian Datablog, 16 May 2012 Changing news
  • 3.
    www.bl.uk 3  Movingfrom a world-class newspaper service to a world- class news service  Newspapers, television, radio and Web news  Reflection of the significant changes in news production and consumption taking place today, but it also reflects how news has always been consumed  News does not exist in any one form. It is sought out and selected by its users, from the multiple forms of information on offer  A change in how we manage news data is an essential part of how to deliver such change  “News is information of current interest for a specific audience” News content strategy The Newcastle Courant, The Huffington Post, Today, Al Jazeera English
  • 4.
    www.bl.uk 4 Newspapers  TheUK national collection  34,000 newspaper titles: approximately 60M issues or 450M individual pages, from 17thC to present day  Current acquisition: 1,500 daily or weekly titles  Print copies acquired under legal deposit but will move increasingly towards digital acquisition  Physical access at Newsroom and Boston Spa  Online access to 11M pages via British Newspaper Archive (http://www.britishnewspaperarchive.com)  Approximately third of collection has microfilm access copies; around 2.5% has been digitised so far British Newspaper Archive
  • 5.
    www.bl.uk 5 Television andradio news  Began recording television and radio news programmes receivable in the UK in May 2010  Collection of over 60,000 programmes, recorded off-air from 20 channels inc. BBC, Al-Jazeera, Russia Today, CNN, CCTV (China), NHK, Bloomberg, France 24, World Service, LBC  30 hours of TV and 22 hours of radio captured per day  Born digital archive, including Electronic Programme Guide data and subtitles where available  Access onsite only, owing to copyright restrictions, via Broadcast News service Broadcast News
  • 6.
    www.bl.uk 6 Web news Non-print legal deposit legislation introduced in April 2013 means British Library can start harvesting UK websites  First annual crawl collected 4.5M .uk websites and web pages – collection now amounts to around 3Bn digital assets  Harvesting c.1000 UK news websites (newspapers and web-only sites e.g. hyperlocals) on daily/weekly basis, from end of 2013, with another 500 to be added soon  Access onsite only at British Library and other Legal Deposit libraries  Also Open UK Web Archive, smaller collection of selected websites, openly available at http://www.webarchive.org.uk UK Web Archive
  • 7.
    www.bl.uk 7 Our newsresearch services Explore.bl.uk The Newsroom Boston Spa reading room British Newspaper Archive UK Web Archive Broadcast News
  • 8.
    www.bl.uk 8 News data 2M 19thC British newspaper pages – XML, images  UK television news data 2010 onwards – EPG data for 45,000 programmes, subtitles (XML) for c.25,000 programmes, some speech-to-text files for 2011 broadcasts (XML)  UK radio news data 2010 onwards – EPG data for 15,000 programmes, some speech-to-text files for 2011 broadcasts (XML)  Financial Times – four years of content (1888, 1939, 1966, 1991) – XML, images  Web news selection – possibly Financial Times, 1893 and 2008
  • 9.
    www.bl.uk 9 Plans  Allout-of-copyright UK newspapers on British Newspaper Archive, issue level data for research re- use, covered by single agreement, available through an API. Possibly…  Title-level data for all newspapers we hold (34,000 titles) released as open data  More partner initiatives  Hackathon on 16 November 2015, to be followed by other news data events in 2016  User-led development BBC radio news script, 14/7/1969
  • 10.
    www.bl.uk 10 Dreams  Anopen news dataset  An archive news data model  All British Library news records available at issue level Hyperlocal news sites: On the Wight, The City Talking, A Little Bit of Stone
  • 11.
    www.bl.uk 11 Questions  Copyrightconstraints limit use of much material to BL premises – how can tools such as named entity extraction work as a means to get round this?  How can print, web, television, radio news, and other news media, be linked up together, and to other resources, and how would this benefit research?  What research questions will we be able to support through a greater focus on news data?  Is news data only for the specialist, or can more general user-friendly applications be produced?  What can news archives learn from the management tools for current news?  How can we help each other? TV news idents
  • 12.
    www.bl.uk 12 Email: luke.mckernan@bl.uk Twitter:@BL_newsroom Web: http://bl.uk/subjects/news-media Blog: http://britishlibrary.typepad.co.uk/thenewsroom Contact