Digitising Hansard

  • 247 views
Uploaded on

Digitising Hansard …

Digitising Hansard
Edward Wood, Director of Information Management, House of Commons.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
247
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
3
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Today we’re going to say a little bit about our project to digitise Hansard, which is of course the official record of debates in Parliament. Although this isn’t records management in the traditional sense, it is about taking one of the most important records in the country if not the world, using it to create a new digital asset and fully exploiting its value as an information source rather than gathering dust on a shelf

Transcript

  • 1. Digitising Hansard Edward WoodDirector of Information Management House of Commons 16.6.08
  • 2. digitising Hansard• digitising Hansard: scanning and OCR• the policy context• database and front end
  • 3. Hansard• the official report of debates in Parliament• actually an unofficial private enterprise at first• “nationalised” in 1909• early reports written in the third person• eventually developed into a (nearly) verbatim account• volumes from 1803 – 2005 were digitised• nearly 3 million pages
  • 4. “though not strictly verbatim, [it] is substantiallythe verbatim report, with repetitions andredundancies omitted and with obvious mistakescorrected, but [...] on the other hand leaves outnothing that adds to the meaning of the speechor illustrates the argument.”
  • 5. why digitise?• enable preservation• conservation is expensive• increase access• increase usability• improve business processes• re-use physical storage space• costs have fallen significantly• quality improving steadily
  • 6. preservation vs. conservationconservation direct intervention to prevent/make good damage to materialspreservation a broader term than conservation. It includes all managerial and financial considerations including storage and accommodation provision, staffing levels, policies, techniques, and methods involved in preserving library and archive materials and the information contained therein
  • 7. preservation• originals printed on poor quality paper• starting to deteriorate• reduce wear and tear from daily use• keep in a controlled environment• conservation is expensive
  • 8. improve access• internal – extensive day to day business use across a very large site• public – national heritage and birthright – disposal by libraries – international interest
  • 9. increase usability• search• print• share• novel uses/mash-upsquality of digitisation techniquesimproving steadily
  • 10. costs• costs have fallen significantly• alternative funding models• reduce physical storage needs – dispose of surplus copies – locate in less valuable space• but beware the hidden costs…
  • 11. ongoing costs• developing a front-end and database• hosting• storing images• digital preservation• format migration
  • 12. alternatives• microfilm• conservation• facsimile
  • 13. why not leave it to the big boys?in a word, control• subject matter• quality• value added• use
  • 14. funding models• self-funding• commercial funding• joint funding• grants
  • 15. doing the work• In house or contractor?• method – image only – re-keying (single, double, triple...) – OCR (optical character recognition) – image plus text – metadata capture – manual intervention increases quality and costs!
  • 16. scanning from...• microfilm• loose originals• bound originals• dis-bound originals
  • 17. OCR• how accurate does it need to be?• mass vs batch capture• double or triple compare• diminishing returns
  • 18. QA (quality assurance)• automate where possible• contractor – 100% proof reading• client – heavy sampling of images – 1% sampling of text• third party?
  • 19. the need for a policy framework• Hansard was the first major digitisation project in the UK parliament• an earlier project to digitise Local and Private Acts captured images only• we needed a digitisation policy for parliament to ensure consistency and learning from experience
  • 20. policy aims• ensure that individual projects: – take into account the wider information context both inside and outside Parliament – deliver their target benefits – offer value for money• ensure the resources created can be: – exploited fully – used for as long as is required
  • 21. policy scope• publications• photographs• archival documents• business records
  • 22. policy principles• digitisation needs to be seen as an integral part of the information work carried out by parliament• use of appropriate technical standards• scan once for many purposes• business cases should take account of all relevant costs
  • 23. selection criteria• measurable user demand (for public use)• business need (for internal use)• the potential for learning and educational use• cost and the availability of other resources• technical considerations• the uniqueness of the items• conservation requirements• intellectual property rights and copyright issues• the availability of digitised versions of the same material elsewhere• the potential for revenue raising• the feasibility of long-term preservation, where required
  • 24. other aspects of the policy• the delivery method will be planned at the outset• the preservation master will be an uncompressed TIFF file• metadata will be created, to support resource discovery, use, storage and digital preservation• we will adopt international standards where possible• we will work with partners where possible
  • 25. developing a digitisation strategy• a project board has been created• an integral part of an online parliamentary history programme for parliament• will use the criteria set out in the digitisation policy to prioritise future digitisation work
  • 26. practical guidelines• guidelines have been developed for all parts of parliament which need to create digitised assets: – a checklist for doing the work – glossary – details of file formats, OCR options – describes popular myths on costs
  • 27. hosting• text and images• text only• navigation• search• web 2.0• funding models• give it away!? http://www.parliament.uk/publications/archives.cfm
  • 28. developing a web interfacedrivers• keep costs down• work closely with users• meaningful search across a large amount of datasolution• experimental approach• open source
  • 29. methodology and progress• small team of developers from Parliamentary ICT working closely with users (inside and outside Parliament)• uses “micro formats” approach• XML is parsed into HTML before loading into the database• JPEGs not currently being used• half of the data has been loaded (mainly 20th century)• public discussion group and issues log
  • 30. http://hansard.millbanksystems.com
  • 31. faceted classification• faceted approach to browsing and searching• assignment of multiple classifications to an object• classifications can be to be ordered in a variety of ways• facets include – date – volume number – monarch – chamber – content type (debates or questions) – constituencies – Members of Parliament – offices held.
  • 32. other features• references using the standard format can be located using the search box HC Deb Vol 385 13 May 2002 c498• predictable URLs http://hansard.millbanksystems.com/commons/1941/may/07/w• pages created for: – individual Members of Parliament – constituencies – acts – bills – divisions
  • 33. http://hansard.millbanksystems.com