Thank you for inviting me to speak here today. Before I begin I would like to acknowledge the hard work of the ANDP team over the last 2 years. Our team was small consisting of only 6 people and we worked closely together with a shared vision and goal to achieve what I will show you today.
All the information I will speak about today is available on the ANDP website. The address is www.nla.gov.au/ndp Under the project details tab are several papers, reports and all previous presentations. We have a high level of transparency with the program and this website has proved to be a useful information tool for the public, librarians and stakeholders. All information about titles to be digitised is available under the ‘selected titles’ tab.
The overall objective of the Australian Newspaper Digitisation Program is to improve access to Australian newspapers, focusing first on content that is out of copyright – so up until the end of 1954. Up until now people wishing to research historic Australian newspapers needed to go to libraries across Australia and scroll through reels of microfilm. This program aimed to provide an online service that will let people anywhere, anytime access these newspapers via the internet. The service is now available. It is free. You can full text search across every page of every newspaper in the service, including advertising, cartoons, letters to the editor as well as the news and sports articles.
Every state and territory library in Australia is involved in this national program. By 2011 we will have digitised 4 million newspaper pages, that’s about 40 million articles. The Sydney Morning Herald will comprise 600,000 pages of this. Each state has selected a major daily newspaper to begin with. We are working in collaboration with the state and territory libraries to digitise these first 4 million pages since many of them own the microfilm copies of the newspapers that we use to create the digital images from. Regional newspapers will be included from this year onwards. Regional titles are being contributed from libraries around Australia.
The program started 2 years ago and we have digitised 1.8 million pages from microfilm so far. It is a 2 step process. Firstly microfilm is scanned into digital images by our contractor in Sydney and then the pages are sent to our contractor in India for Optical Character Recognition (OCR) processing. This makes them full text searchable. After quality assurance they are made available to the public through the Australian Newspapers service. The Beta service was released to the public in July 2008. It now contains 360,000 pages (3.5 million articles) and is being very well used. We will add another 40 million articles into the service by 2011.
Today I am going to mainly talk about data enhancement by the public, including text correction. However I would just like to say that the development of the public search service is only one aspect of the overall program. Behind the scenes we have been undertaking significant software development – we have designed 2 systems, the Newspapers Content Management System which includes Quality Assurance Modules and the Search and Delivery System. We have also upgraded our infrastructure and purchased 63 TB of storage so far for the national newspapers storage infrastructure. All aspects of the digitisation process are being outsourced (some offshore). The ANDP team of 6 has been responsible for all aspects of the project. In addition we have employed some university students on a casual basis to undertake the quality assurance processes.
On the technical side of things we are using a MySQL database and a Lucene search index. It was not our preference to undertake software development to the extent we have, but since there were no solutions available off the shelf we have gone down this path. It is our intent to share the code as open source for both systems sometime in the near future. We have had a lot of interest from other national libraries and institutions who wish to obtain the code and/or assist us with software development.
The development cycle for the search and delivery system was first to release a prototype to state and territory libraries for feedback in 2007. We then developed a beta version in 2008 which had a public release. It is our intent this year to move the beta version into a version 1 and officially launch the service very soon.
This is the home page of Australian Newspapers beta. Users can either keyword search or browse by date, title or state. The service is being heavily used with around 28,000 keyword searches per day and an unknown number of browses. We have not widely publicised the service as yet since we are still in a beta version.
People are predominantly searching for names in the service. This is a visual image of search terms. The most searched names are John, William, Thomas, George and James.
Search phrases also remain pretty similar from month to month with phrases often being a personal names in combination with the term, births, deaths, murders, shipping. In December 2008 the term ‘christmas’ was also a popular search term.
Most of the users have found out about the service from genealogy blogs and forums. This is an example of a popular international forum where the news of OCR text correction wings its way from Mary in Italy to William in Gateshead UK, (PAGE)
to Zoe in London, to Uncle John in Bedforshire, and then to Harry’s mum in Brisbane in a matter of minutes.
The features discussed on the forums that the public are using are adding tags and comments to articles, and correcting the text within articles. We’ll look at each of these in turn..
Firstly when a user comes to the site they can choose whether or not they want to login. It is not mandatory to login even if using the tag, comment or text correction features. The benefit of logging in is that users can track their activity and if they are a top corrector they may appear in the top correctors hall of fame. To date we have 3000 registered users, out of over 300,000 unique users. Of the 3000 registered users 1300 are regular text correctors. We do not know how many unregistered (anonymous) users are correcting text.
Newspapers have a hierarchy of issue, pages in an issue, and articles on a page, which is reflected in the system. It is easy to navigate between the levels when browsing newspapers. This shows the page level view. On this screen you can move the frame splitter on the left to entirely hide the left bar and view only the newspaper image if you want. To access the enhancement features the user needs to go to the article level. If you do a keyword search instead of a browse you will come to article view immediately.
This is the article view. Users can zoom in or out and choose to view the article in the context of the entire page. They can also navigate to any other page within the newspaper issue. The electronically generated text created through the OCR process is displayed on the left hand side. This is also where the users can use the 3 enhancement features. Users can tag the article with keywords and they can write comments and notes about the article. If users login they will be able to choose to make their tags and comments public or private. So they can share their comments with all users or they can add their own private research notes that only they can access. One feature that we believe is innovative and not available in any other online newspaper service, is the ability for the user to correct the electronically generated text. There are a number of reasons why the electronically created text is not always 100% accurate, mainly due to the quality of the original newspaper that the image was created from. Users can correct the text by clicking on the ‘Help fix this text’ button. We will now use these features on this article. The article we are looking at is the first report in an Australian paper of the sinking of the titantic.It’s in the Northern Territory Times on 19 April 1912.
I want to tag the article with ‘titantic sinking’. If a user does not login when they first enter the service then the first time they want to enhance an article they will be offered the option to login. At this point they can either login or enter the captcha to verify they are human (and not a robot – attempting to do something undesirable).
Once logged in or verified with captcha a user can enter their tags.
Now I want to add a comment. Those of you who read this article may have noticed that it was reported that all passengers were safely rescued from the titanic and the weather was calm. I’ll just add a comment to say this was unfortunately not the case.
Now I have zoomed in on the image and if the OCR text was inaccurate I would edit it in the box on the left. In this article the text is actually very accurate so has either OCR’d very well, or already been corrected by someone else.
Now we can review the article with all the enhancements we have made showing on the left. Tags, comments and corrections. We can view the history of all the enhancements (both ours and other peoples history). So those were the basics, but lets take a closer look at users activity with the enhancement features over the last 6 months
Adding tags has been a hugely popular activity for users. 46,000 tags have been added. However of these the vast majority are for personal names and only 34 tags have been used more than 100 times… This has led not to a useful tag cloud, but to tag fog! The screenshot shows the ‘John’ fog. Most of the tags have been used less than 10 times. Of the 46,000 16,500 are unique. The use of tags is surprising because we were dubious initially about the value of tags for articles when every article is full-text searchable and if the name you are looking for is incorrect you can edit it so that you can find it again. It certainly appears that people are using tags to try and track their research. Very few services if any have enabled tagging of full-text items, most tagging is for image collections only so what we are seeing here is new to us.
The most used tag (one of only 5 that jump from the fog) is LLRSA which we have now discovered is short for the Light Railway Research Society of Australia. They have 250 members who are using the tag to record their group research.
Tagging enables ‘marking’ or ‘saving’ of records into a group so that you can come back to them later. There is currently no other method to save a group of articles, other than bookmarking them.
Each user has a profile page where they can view their latest tagging, commenting and text correction activities. The user profile pages are visible to other users. At this stage users cannot edit their profiles. It is desirable however that users are able to edit and personalise their profiles so they can share information about themselves and their research interests with other users.
By browsing user profile pages we can see 2 distinct methods that people use to correct text. This first profile shows us that this user is looking at lots of different articles with a similar subject – flying saucers and ufo’s and just correcting a few lines in each article. The profile shows the article, the date changed, the old text and the new text.
The next user profile shows method 2 – find an interesting article and then correct the whole article. Two of our top correctors are correcting long articles on gruesome murders, this is a popular theme. Text correctors report doing 1-3 hrs of text correction at a sitting on average. The average visitor spends 17 minutes searching and reading articles in a session.
Several people can correct the same article. All corrections are saved and viewable in the history of the article. All versions of corrections are searched for. It is the last correction that is visible in the left hand pane. Articles are corrected by many users when they are either very long, very significant, or very illegible. For example this article is in the first Australian newspaper – the Sydney Gazette and NSW advertiser of March 1803. Around 20 people have made corrections to this article. It is particularly challenging because of its use of the long f instead of an s.
This is the text correction history of this article, showing all the different users and what parts they corrected.
Another regular activity of text correctors is methodically working through the family notices to correct names in the births, marriages, and deaths columns. This is a perfect example of a barely legible births column in the image on the right. We can see that it has already been corrected by a user and we can view the corrections.
The raw OCR text has basically come out as rubbish (on the left) and users here have just fixed the names but not the rest of the words in the line. This means that other people will now be able to find these names.
The comments feature was originally for researchers to annotate articles. It changed its name from annotations to notes to comments after user feedback. Some users are annotating the articles and adding further information about the content of the article or people mentioned in the article.
Other users are adding comments on the physical state of the image or difficulties and questions they have around text correction. We have observed users using the comments to communicate with other users.
We are not moderating text correction and this was a risk that both we and the users were aware of. To date no vandalism of text has been reported to us or noticed by us. By being transparent about the lack of moderation and giving a high level of trust to our users we appear to have gained a committed, responsible and dedicated group of text correctors. Some have likened it to Wikipedia. However if a user was to change something incorrectly we can see by this example that it would not take long for another user to notice it and correct it. In the example 3 different users are correcting the same article and helping each other in a matter of minutes. The users are therefore moderating each others corrections at the moment. In the worse case scenario that something was changed totally incorrectly other users would be aware of this since they can all still see the image. Also the search engine searches all text, even corrections of corrections so the original terms are still retrievable. Users have been using the comments field to communicate with each other and ask for help as this example also shows. This is because there is no other forum for them to communicate with each other at present.
Since the release of the service in July 2008 text correction has remained consistent among a core group of 1300 correctors who have mostly been doing the same amount per month. Between 300,000 and 400,000 lines of text are corrected per month in 15-20,000 articles. There was a slight dip in November which was due to no new articles being added that month (which many users said de-motivated them). However text correction increased in January, despite there still being no new content added. Perhaps a lot of people were staying inside in the 40 degree heat looking for things to do with air-con on?
In the first 6 months a total of 2 million lines in 100,000 articles had been corrected. The top 5 correctors had consistently remained in the top 5 each month and were working up to 45 hrs per week on text correction. Top correctors are correcting up to 30,000 lines per month. We had many users saying that t ext correction is proving to be an ‘addictive’ or compulsive activity. They sat down to fix a few words for 5 minutes and before they knew it 3 hours had passed. This was very interesting.
Due to user demand we instigated the ‘hall of fame’ into the beta service. The top 5 correctors show on the home page and also in the hall of fame. Originally the hall of fame only showed the top 10 but users wanted to see more, so now it is anyone who has corrected more than 5000 lines per month. Users are still asking for entire league tables however so they can see where they are in the big picture. This is a motivating factor for them. During development it was suggested that we need to use gaming technologies to encourage people to correct text but this has so far not proved necessary!
So after all this activity the most common question people kept asking me was “Who are these people?” and also “Why do they do it?” Some people even suspected that the text correctors were really library staff, which is not the case. The text correctors are real, normal people. We sent some of them a survey to find answers to our questions about how long they spend correcting, why they do it, what motivates them, what would motivate them to do more or less? The responses were very interesting.
The three main reasons for correcting text were: We’re helping to provide an accurate record of Australian History We want to record family names and help others as we go We think it is a useful cause that will help all Australians, the Library, and ourselves and we are willing to give time for this.
The motivating factors given were no different to those that motivate anyone to do anything for example they enjoy it, they have their own research goals, the think about the main outcome (ie making it better for everyone), they have been given a high level of trust and respect to do the job, and it is a challenge.
To maintain or increase their motivation they again gave standard motivational answers. Things we had not done which they would like were to give them detailed instructions on how to do the job, to create for them a feeling of team spirit and being part of a virtual community, to recognise their achievements and acknowledge they were making a difference, and lastly to give them more content. They said the more content they were given the more they would do. Many noted that we had not publicised the service in any way or called for volunteers and the potential to harness a lot more volunteers was vast.
All our top 5 correctors are Australians living in Victoria, New South Wales, and Queensland, with one in America. The five turned out to be 6 since one was a married couple sharing a logon to do research. Of the 6, 4 are female and 2 male. One is working full-time, one is a stay at home mum and 4 are retired. They are aged between 38 and 65. Three of the correctors are correcting as a volunteer ‘do good’ activity and trying to think up topics to correct, whereas the other 3 are correcting around their own areas of family history and local research. 2 of the 6 are also transcribing shipping records and births, marriages and deaths for other organisations. Here are some quotes from some of our top correctors. Julie is our top corrector and has corrected 2,500 articles so far. She is in her thirties and is a stay at home mum. She mainly corrects articles on local history and murder and corrects whole articles at a time. She says “ I enjoy the correction – it’s a great way to learn more about past history and things of interest whilst doing a service to the community by correcting text for the benefit of others” I keep doing because of the knowledge that you are doing something that will benefit future people that wish to access articles on their family history.
Catherine is located in Washington DC and works full-time as the Director of an e-commerce company. She says “I enjoy typing, want to do something useful and find the content fascinating. I do it to benefit others”. Also she does not watch much TV. Lyn and Maurie a retired couple work on it together as part of their family history shipping research. They also do voluntary work for the mariners records. They say “ We get sick of doing housework, we find text correction addictive and it helps us and other people. How can you not correct errors when you see them?”.
Mick is recently retired from IT. He says “ I thought I could be of some assistance to the project. It benefits me and other people. It helps with my family research. I would do more if I had broadband and did not have to share the computer with the rest of my family!” Fay is retired, she says “I enjoy the challenge, I need something to do in my spare time and it benefits me and others”
Many of our current text correctors are genealogists. Genealogists do things that other groups of people may not. There is a genealogists ‘to do’ list that is circulating on blogs at the moment. It gives a useful insight into the life of a genealogist. One thing that is very important to them is what they call ‘random acts of genealogical kindness” where they may do something helpful for someone else that will help them trace their family tree. They also do organised acts of kindness such as transcribing births, marriages and deaths records. Genealogists very quickly get to grips with new technology if it helps them access resources or achieve one of their objectives.
We have been gathering feedback from users for 6 months about the beta service and text correction in particular. The feedback has been overwhelming positive and thousands of suggestions and comment have been received. The feedback was gathered from a survey form, from e-mails, by observation of users, by statistics, and by lurking on forums and blogs (going into the users spaces). The users have given us valuable feedback so that we can better meet their needs. Some of their ideas match our own and other ideas they have given us are innovative and fresh and we had not thought of them ourselves.
The main requests from users for improvements are as follows: Improve the text correction feature (so they can do more) Have more advanced searching including ability to define and search across enhancement layers e.g. tags only, tags and corrected text only, tags and comments. Have a communication mechanism e.g. a forum Enhancement of user profiles More statistics and where they are in the big picture of text correctors Alerting for new content coming into the database Guidelines for enhancement activities
The lessons we have learnt to date are that engaging with users and building virtual communities is just as important to the users as providing the data itself. They want to be part of a community. By giving the users a high level of trust we have built commitment and loyalty in the community. Another lesson we have learnt is that using the term ‘text correction’ is not always helpful. It implies that something will be corrected and the old version deleted, which has caused concern to stakeholders and to the public. However as users undertake the activity it has become apparent that what they are doing is ‘enhancement’ or ‘enriching’ the data. They are actually creating layers on top of the original data, and all the layers can be transparent and separate or jointly searchable. The term ‘enhancement of data’ is not one which has yet become common terminology in Australian libraries but it will not be long before it does and is commonly understood by both the public and libraries. Lastly we know that the Australian Newspapers has had a big ‘social impact’ on peoples lives and the genealogical community. We are unable to quantitatively measure the impact or predict what may happen next.
Traditionally libraries have held the power and control over data but the Australian Newspapers service is shifting that power to the community. Recently Barack Obama speaking on community engagement and volunteering said “Don’t under-estimate the power of people who join together …. They can accomplish amazing things”. This is true. People want to achieve amazing things and we as librarians have the power to give them both the data and the tools to do this – they will do the rest themselves. The challenge for the library is now how to nurture, sustain and grow this virtual community we have created and their resulting activities.
The future potential of text enhancement is mind boggling when you think of it in the world context. In Australia alone we have 21 million people, more than half of whom have internet access at home so could potentially be volunteers. FamilyIndexSearch project report that in their first year they had 2000 volunteers and by their third year they have 160,000 volunteers correcting birth,marriage and death records. The Australian Newspapers program has the potential to match this easily. But why just think about Australian Newspapers? This functionality could be applied to many other full-text resources, indeed a global centre could be established where users decide what types of materials from which countries they wish to enhance. The future is exciting and open.
That brings me to the end of my talk. I could of course talk a lot longer but I wanted to give you the opportunity to be able to ask me some questions. There is a full report on the activity of text correctors called ‘many hands make light work’ on the website. Thank you.
Enhancement and Enrichment of Digital Content by User Communities: The Australian Newspapers Experience. March 2009
Enhancement and Enrichment of Digital Content by User Communities: The Australian Newspapers Experience <ul><li>Rose Holley </li></ul><ul><li>Manager - Australia Newspapers Digitisation Program </li></ul><ul><li>National Library of Australia </li></ul><ul><li>Innovative Ideas Forum: </li></ul><ul><li>The value and significance of social networking for cultural institutions </li></ul><ul><li>27 March 2009, Canberra </li></ul>
<ul><li>Increase access to Australian newspapers </li></ul><ul><li>Build a national service that will provide free online access from the first Australian newspaper published in 1803 through to the end of 1954 </li></ul><ul><li>Key Features of the service </li></ul><ul><ul><li>Online access </li></ul></ul><ul><ul><li>Freely available </li></ul></ul><ul><ul><li>Full text searchable </li></ul></ul>Objectives
National Program and Content <ul><li>Initial focus on major titles from each state and territory </li></ul><ul><li>‘ Regional’ titles being contributed by libraries 2009 onwards </li></ul><ul><li>Coverage: published between 1803 – 1954 </li></ul><ul><li>(out of copyright) </li></ul>West Australian Northern Territory Times Courier Mail Advertiser Sydney Morning Herald Sydney Gazette Argus Mercury Canberra Times
Overview <ul><li>Project started 2 years ago </li></ul><ul><li>Digitise from microfilm (outsourced) </li></ul><ul><li>1.8 million pages scanned so far </li></ul><ul><li>Australian Newspapers beta released July 2008 </li></ul><ul><li>360,000 pages (3.5 million articles) in beta </li></ul><ul><li>Will make 4 million pages (40 million articles) available to public by 2011. </li></ul>
Behind the scenes… <ul><li>Software development </li></ul><ul><ul><ul><li>Newspapers Content Management System </li></ul></ul></ul><ul><ul><ul><li>Quality Assurance modules </li></ul></ul></ul><ul><ul><ul><li>Search and Delivery System </li></ul></ul></ul><ul><li>Infrastructure – storage </li></ul><ul><ul><ul><li>63 TB </li></ul></ul></ul><ul><li>Digitisation (outsourced) </li></ul><ul><ul><ul><li>Scanning of microfilm </li></ul></ul></ul><ul><ul><ul><li>OCR of articles </li></ul></ul></ul><ul><ul><ul><li>Additional processes (categorising, zoning, re-keying) </li></ul></ul></ul><ul><li>Quality assurance of data </li></ul><ul><ul><ul><li>Before acceptance/delivery </li></ul></ul></ul>
Development cycle <ul><li>Search and Delivery System </li></ul><ul><li>2007- Prototype (to state and territory libraries for feedback) </li></ul><ul><li>2008 – Beta (to public for feedback) </li></ul><ul><li>2009 – Version 1 official launch (planned) </li></ul>
Comments 1. Some users add further information about the content and people mentioned in article
Comments 2. Some users add notes on the physical state of the image or difficulties they are having with text correction.
Sample of user activity Nov 08 <ul><li>Users seem to observe accidental mis-corrections of others within a short space of time and correct them. </li></ul><ul><li>No vandalism of text has been observed to date </li></ul><ul><li>Correctors help each other </li></ul>
“ Who are the text correctors?” Flickr: LucLeqay
Why correct text? <ul><li>Australian history - Helping to provide accurate record (sometimes linked to local history research) </li></ul><ul><li>Family Names - Doing family history and help others with names as they go by correcting </li></ul><ul><li>Useful cause and want to help Australian community/Library/themselves </li></ul>
Motivating factors <ul><li>Pleasure </li></ul><ul><li>Short and long term goals </li></ul><ul><li>Concentrating on outcomes </li></ul><ul><li>Trust and Respect given </li></ul><ul><li>The challenge </li></ul>http://www.pickthebrain.com/blog/21-proven-motivation-tactics/
Maintaining motivation <ul><li>Detailed instructions - If you want a specific result, give us specific instructions. We will work better when we know exactly what’s expected. </li></ul><ul><li>Team Spirit - Create an online environment of camaraderie. We’ll work more effectively when we feel like part of team or virtual community. We don’t want to let others down. </li></ul><ul><li>Recognize achievement - Make a point to recognize achievements one-on-one and also in group settings. We like to think we are being noticed and are making a difference. Show us how we fit into the big picture. </li></ul><ul><li>Raising the bar – The more we do the more you should expect us to do. We’ll do a lot more if you give us a lot more content. That would be our highest motivational factor. </li></ul>
Understanding genealogists <ul><li>http://blog.epcrowe.com/2009/01/07/104-genealogy-things-done-to-do-not-going-there </li></ul><ul><li>Things they do: </li></ul><ul><li>Learn new technology quickly to access relevant resources </li></ul><ul><li>Perform random acts of genealogical kindness (e.g. marking up names for others) </li></ul><ul><li>Regularly do indexing for Family Search Indexing or other genealogy projects to help others. </li></ul><ul><li>Do lots of social networking </li></ul><ul><li>Look for convict ancestors and long lost cousins in Australia </li></ul>
Opinions of users <ul><li>‘ OCR text correction is great! I think I just found my new hobby!’ </li></ul><ul><li>‘ It’s looking like it will be very cool and the text fixing and tagging is quite addictive.’ </li></ul><ul><li>‘ An interesting way of using interested readers “labour”! I really like it.’ </li></ul><ul><li>‘ A wonderful tool - the amount of user control is very surprising but refreshing.’ </li></ul><ul><li>‘ I applaud the capability for readers to correct the text.’ </li></ul>http://www.nla.gov.au/ndp/project_details/documents/ANDP_TextCorrectionComments.pdf http://www.nla.gov.au/ndp/project_details/documents/ANDP_PositiveFeedbackBetaDec2008.pdf
Requests from users <ul><li>Improve text correction feature </li></ul><ul><li>Advanced searching of layers of enhancements </li></ul><ul><li>Communication mechanism </li></ul><ul><li>User profiles </li></ul><ul><li>More stats and where they are in big picture </li></ul><ul><li>Alerting to new content </li></ul><ul><li>Guidelines for enhancement activities </li></ul>
Lessons learnt <ul><li>Engaging with users just as important as improving data quality (in opinion of users) </li></ul><ul><li>Giving users high level of trust results in commitment and loyalty </li></ul><ul><li>‘ Correction’ implies deletion vs ‘Enhancement’ implies adding layers safely </li></ul><ul><li>Big social impact </li></ul>
The power <ul><li>"Don't under estimate the power of people who join together…. they can accomplish amazing things," </li></ul><ul><li>Barack Obama 19 Jan 2009 Speaking on community engagement and involvement and voluntary work </li></ul><ul><li>Rose says: </li></ul><ul><li>People want to work together to achieve amazing things – we as librarians have the power to give them both the data and tools to do this - they will do the rest…… </li></ul>
Future potential of text enhancement <ul><li>Could have hundreds of thousands of volunteers if publicised </li></ul><ul><li>Could apply to other full text collections </li></ul><ul><li>Could develop a global system </li></ul>