Metadata Extraction Projects   Pru Mitchell & Sarah Hayman Education Network Australia
<ul><li>delivering innovative, cost-effective services across all sectors of education </li></ul><ul><li>formed 1 March 20...
Metadata is not scalable <ul><li>We can no longer be comprehensive or meet the standards set by our collection policy, bec...
Solutions <ul><li> reduce quantity of metadata </li></ul><ul><li>   reduce quality of metadata </li></ul><ul><li>   get...
edna proof of concepts <ul><li> professional networking </li></ul><ul><li>edna sustainable collections (ESC) </li...
professional networking site for educators users bookmark and discuss resources and these are aggregated to own url the sy...
person – resource - tag - community
edna Sustainable Collections (ESC) <ul><li>harvests bookmarks </li></ul><ul><li>from key educators in and extern...
How does it do this? <ul><li>takes an RSS feed </li></ul><ul><li>extracts available metadata </li></ul><ul><li>checks for ...
Outcomes <ul><li>increasing efficiency for information managers </li></ul><ul><li>freeing of Information managers to focus...
Faceted search <ul><li>use metadata to help solve issue for stakeholder - cost of educational copying </li></ul><ul><li>ha...
Rights schemes harvested
AI proof of concept <ul><li>Flinders University Artificial Intelligence and Knowledge Laboratory and 2008-09 ...
Elements of the project <ul><li>text analysis </li></ul><ul><li>automatic classification edna category </li></ul><ul><li>s...
Findings <ul><li>35% accuracy for mapping category from title alone, 60% accuracy using  WordNet-based  semantic relatedne...
Conclusions <ul><li>consider new approaches and keep pace with developments, cultural and technical </li></ul><ul><li>find...
Questions, feedback <ul><li>Pru Mitchell </li></ul><ul><li>[email_address] </li></ul><ul><li>Sarah Hayman </li></ul><ul><l...
Upcoming SlideShare
Loading in …5

Metadata Extraction Projects for Education Network Australia


Published on

Overview of proof of concept projects presented at Metadata 2010 conference: Sharing Data, Sharing Ideas Canberra, 26-27 May 2010.

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Powerpoint Header This presentation is based heavily on a paper by colleagues Sarah Hayman and Nick Lothian written for the Dublin Core Metadata Initiative 2009 conference entitled: Towards linked data: metadata extraction projects for Education Network Australia (edna), DC-2009 &amp;quot;Semantic Interoperability of Linked Data“ Seoul, Korea, 12-16 October, 2009. It briefly presents some metadata extraction projects undertaken by the Education Network Australia (edna) team of (now Education Services Australia) over the past 2 years.
  • Shared learning: SCIS presentation Education Services Australia formed from the merger of – Australia’s ICT in education national agency, and Curriculum Corporation. It provides services for the Ministerial Council for Education, Early Childhood Development and Youth Affairs and new national education agencies Australian Curriculum, Assessment and Reporting Authority (ACARA) and Australian Institute for Teaching and School Leadership (AITSL). Its Objects are: (a) to advance key nationally-agreed and commissioned education initiatives, programs and projects in line with national education initiatives such as the national curriculum by providing services to MCEECDYA and other education and training bodies. Included within scope are: (i) researching, testing and developing effective and innovative technologies and communication systems for use in education; (ii) devising, developing and delivering curriculum and assessment, professional development, career and information support services; (iii) facilitating the pooling, sharing and distribution of knowledge, resources and services to support and promote e-learning; and (iv) supporting national infrastructure to ensure access to quality assured systems and content and interoperability between individuals, entities and systems; (b) to create, publish, disseminate and market curriculum and assessment materials, ICT based solutions, products and services to support learning, teaching, leadership and administration, as required by the company owners and/or to be paid for by those organisations which commission such work; and (c) to act as required as the legal company for MCEECDYA. 19 April 2010 Education Services Australia
  • Across our services we are involved in a multitude of metadata standards, profiles, vocabularies, and applications – all driving description and discovery of information, resources and services across Australian education and training. The logos in the ring indicate Australian education sector specific metadata, services and licences which were developed or are managed within Education Services Australia-based projects/services. New ones seem to be coming along regularly – eg ACARA levels of schooling have been added to registry since this diagram was developed. Those outside the ring are a selection of external services, standards and organisations that we make use of in our work – both education-specific on RH side and on LH side those used across education and broader government and web-based services. The reference for Australian education metadata is A Handbook of Guidelines for Metadata Usage in Australian and New Zealand Education and Training (2007) HB 256:2007 available from Standards Australia Obviously standards and interoperability are major priorities for us! Powerpoint Header
  • This presentation is about Education Network Australia – our second longest-running service, established in 1997. Edna is an aggregator service: linking to evaluated online resources, news and events of relevance to educators across all sectors. We host and facilitate online communities through groups, lists and social networking services. Our metadata is available via search and browse functionality on the edna website but can just as easily be exported or delivered via api, rss feeds, or search services that can be plugged in to other sites. We provide a federated search across edna in combination with other related databases of interest to educators, eg ABC, the Australian Council for Educational Research’s EdResearch Online, the Australian Flexible Learning Framework and the National Council for Vocational Education Research’s VOCED database Our services are all free and government funded. Our focus is firmly on content and communities for Australian educators. We collect and disseminate online content and information (metadata) about that content. Increasingly we are looking at Web 2.0 approaches to collecting content and metadata and that is what I’ll be looking at today.
  • These are issues facing Education Network Australia - and that of many other similar services- over the past two years. In summary, the projects discussed in this presentation are various investigations aimed at addressing one or more of these issues more content (much more content) less funding (and less security of funding) Fewer cataloguers (who are doing many new things) same old clunky metadata creation tools
  • We have tried all these solutions in various ways… reduce quantity of metadata (cut AGLS with demise of the Government Education Portal)  reduce quality of metadata ( bookmarks harvest only url, title, description, subject and sector)  get someone else to provide/pay for metadata  users ( &amp; ESC, eg delicious feeds)  other organisations – federated search, harvesting, TLF had a deal with cultural institutions and organisations with content of interest to schools collaborate / share metadata (ASN and machine readable curriculum)) Improve metadata creation tools – ESC tool faster workflow ? programme machines to create metadata (Flinders project)
  • From the edna project here are four proof of concept projects related to the harvesting and management of metadata. Proof of concept methodology at edna typically involved $30,000 for 3 months with 3 show and tell presentations as progress markers. This was quite successful in developing options and issues, and in some cases led to production systems. There are also frustrations with POC methodology. is a professional networking project (just won bronze award at the IMS international Learning Impact awards for New and Research &amp; Development Entries ) – this contributes to growth of the edna service by providing us with significant user contributed metadata The edna Sustainable Collections project resulted in development of a tool (ESC) which is available to the edna metadata creation team and addresses efficiency by extracting user and page metadata, plus some linked data from OpenCalais – presenting these to the metadata creator to aid decision making. The faceted search project addressed two issues which in fact did not save time but were attempts to improve the value of metadata to the end user, by making licence and audience metadata much more accessible and useful as filters for teachers Finally a proof of concept project with Flinders University Artificial Intelligence and Knowledge Laboratory looking at whether artificial intelligence applied to automated metadata generation could improve efficiency or quality of subject metadata in edna.
  • Professional networking site – filling a need for a place that allowed social networking but in a professional environment – built for educators Sharing resources is one of its key features. There is much other activity on the site e.g. blogging, posting messages, connecting with those of similar interests, and also used successfully as an e-portfolio environment. As well as many other things, the system collects and aggregates metadata.
  • is a professional networking site for Australian educators. It has currently 19,000 registered users of whom 12,500 are public but of course the same varying levels of activity that many social services experience.   This shot shows a logged in view of the site in the social bookmarking tab. Profile metadata about the user is down the side including interests – the words used for interests are tags and automatically become community names. Being a registered me user means I have a blog space. However I won’t go into all the details of me and how it works here – we will just look at the harvesting metadata aspect.   One of the main features of the site is the ability for users to share web based resources with their colleagues either directly or by entering an RSS/Atom feed of their own publishing activity as a source of links. The insert shows the feed management screen where I add feeds of my activity across the web that I think might be useful to my colleagues on On the bottom RH insert we can see a list of my colleagues whose bookmarked resources will come through to myview – whether they have specifically bookmarked using the service or from any other bookmarking activity across the web that they choose to share through their feeds. These feeds are themselves aggregated and can be fed by RSS to anywhere else a user is ‘living’ on the web – no lock in.
  • When resources are entered into the system it extracts automatically as much metadata information as possible. edna developers built a generalised metadata extraction and mapping system which can automatically extract and store metadata found in RSS/Atom feeds, HTML pages and MP3/podcasts. It stores metadata in a relational (Postgres) database, but in a schema based around the RDF data model.   This screen shows the collection of geographic metadata to go with a flickr picture that Nick posted as a shared link to It’s a photo from the Powerhouse Museum’s flickr collection. When he added it to it displayed a map using metadata the system has harvested.   A link to a Flickr photo page has been shared which was marked up using the geotag standard ( supports both the ICBM and geo.position methods). Using that metadata enabled the location of the photo to be extracted and a map to be displayed. In the future there is potential to do a position-based search which shows other resources in the same area. The system is making use of the Media RSS format, which allows extraction and use of thumbnail images. Powerpoint Header
  • Tagging was always a key element of the thinking behind This is the tagcloud of the top 100 communities based on number of members AND activity levels. The insert below is a section of the members and shared resources of the Digital Storytelling community – if a member bookmarks a resource and uses the tag Digital Storytelling it will automatically feed the community’s resource list. As well as the traditional triad of person + resource + tag, we have added community
  • Finally in terms of metadata enhances discoverability – in addition to finding bookmarks and 3 rd party resources, users are starting to obtain easy access to personal profiles and blogs which means that when they search edna for a topic such as education related copyright policy – they will retrieve not only documents, events, communities, news and resources – but they can also access key people in this field who are members of with their status set to public (as opposed to private, colleagues only, or logged in members only)
  • A project which also involves metadata extraction, resulted in development of ESC a tool designed to take advantage of the web 2.0 environment and harvest bookmarked items from key educators in Australia and worldwide. The tool at this stage is for internal use, to enhance the edna collection. Our aim in developing this was to inform and enhance the work of our internal resource coordinators, who find, evaluate and index resources for edna. Using social bookmarking services as part of collection building, we can see what our users themselves value and harvest those resources as candidates for our collections – as well as finding the terms that these domain experts are using to tag their resources. This is turn has the capacity to inform our own thesauri (folksonomy directed taxonomy, Hayman &amp; Lothian). We love Mark Matienzo’s picture from Why you should support linked data: – shows something of the initial feeling of some cataloguers about this process. The other interesting part of this project was that OpenCalais had just been released, and we built this in to see whether the entity data coming from there could add to metadata creation workflow.
  • It takes selected quality RSS feeds and displays their items together with descriptions, subject tags and other useful metadata (some created through automation). RSS feeds come from bookmarkers, from delicious bookmarkers, or from delicious tags such as ‘most popular items tagged e-learning’ – basically any RSS feed can be used.   Information management staff at edna select feeds and describe the feed (more metadata)   They then evaluate and select individual items from that feed, together with their metadata, and add those selected straight to the edna collection.   ESC allows for mapping to DSpace and checks for duplicates.
  • This picture shows the screen in ESC where feed management is done. This is what a logged in user sees. For each feed it shows name / url / person who added it and whether active or inactive. If a feed is inactive then its items are not collected. That can be toggled on or off at any time. If a checkbox is ticked it means the feed’s items are visible to the user. This is useful for our resource coordinators who work in particular sectors and want to monitor certain feeds.   Feed can come from individual bookmarkers, internal or external; from organisations; and can be built by users in e.g. Yahoo Pipes. It is simply a tool and the choice of feed is up to us. We can use Diigo, delicious, our own bookmarks and so on. The aim is to identify good bookmarkers out there who are already doing this and monitor their recommendations.
  • This shot shows in closeup the addition of a new feed. Here I’m adding a feed of ABC Radio’sEducation news. I’m providing a name, URL and description for the feed.
  • Next we see a couple of the items that have come through from a user’s feed Metadata from delicious: url, title, date creator (note we have an issue here with bookmarker vs creator and bookmarking date – these will be corrected by the IO) Metadata from the page: this resource has been given excellent dc metadata which ESC has harvested from the page Calais metadata: the open Calais service built into our ESC tool has provided some industry terms different from the delicious tag and the page subject headings. We find OpenCalais to be less useful than we hoped, and would like to find ways to improve the relevance both in terms of Australian entries, and education-sector entities. The National Library of Australia is an obvious place to start. All of this metadata is available to the IOs to select and pull into DSpace if they wish – the final judgment is theirs but we have provided a very useful set of candidate terms
  • This has resulted in efficiency gains as well as a unique perspective on what edna users value. From edna’s inception, all three of these tasks for the edna collection have been undertaken by Information Officers. Important to free them to do the higher level tasks but equally important to harness the user perspective – what do they value and how do they describe it. Including user contributed links has resulted in a wider range of resource types– blog posts, videos, etc Flexible: can add any feed can construct feeds – it’s a management tool for us. Things to do: Improve mapping – map metadata more precisely – look at what else we can capture e.g. user ratings, give preference to items bookmarked by more than one person, etc. Develop business rules for feed management – still early days but we may need to look at how we choose feeds. That leads to the question of what we do with the repository itself. edna is still built by us, albeit now with help in user suggestions and metadata. But we could publish this other repository as something between edna at one end – a highly selective and evaluated set of resources and the full set of bookmarked items valued by educators worldwide with some edna like structure and metadata applied: worth considering for the future.
  • Copyright costs in education are a major issue, Delia Browne quotes 2006 figures for education copyright agency fees were $45m (see Slide 12 This is before Digital Education Revolution, and explosion of digital use. Under pressure to apply print licence regime to free online digital world. Helping educators understand what they can do with what online resources both in terms of re-use, publishing online and modifying was part of this project. The suite of 14 licences used commonly in education material were mapped to two basic questions: can I share this material with others by re-publishing it? and can I modify this material and share it with others ie by re-publishing it online? Obviously there are many other nuances of licensing, but given that our users do not have to login to conduct searches, it was not possible to tailor the facets to deal with education specific licences which depend on knowing the jurisdiction and rights of the individual searcher. There was also a requirement to study the impact of algorithms which preferenced openly licensed material
  • The various edna facets are explained at: In this example a search for assessment – filtered by sector = VET, then ok to share &amp; modify produces resources labelled with Creative Commons licence To set this up after 10 years of not collecting this DC.Rights information consistently, an initial scan was made across the home pages of the 40,000 items in the edna repository matching them to any of the machine-readable licences within this project (in particular Creative Commons). From now on, Dspace looks for this machine readable information for individual urls as added to the database and harvests it if available.
  • See 14 licences and their mapping at: These are available from a look up list within DSpace
  • In 2008-2009, a joint project was undertaken by researchers from Flinders University Artificial Intelligence and Knowledge Laboratory in collaboration with information managers from, the agency that manages and produces Education Network Australia (edna).
  • The project aimed at partial automation of the categorisation and annotation of web pages for inclusion in edna. It also investigated the possibility of analysing resources in order to automatically classify them into edna browse categories, and suggesting controlled vocabulary subject terms from Schools Online Thesaurus,   Similar to ESC in looking at discovery / evaluation / description but different in that it aimed to learn from what metadata creators do and build a smart tool to suggest classifications – not just analyse text – this is the AI aspect. So the researchers have built a simple tool to collect data about how classification decisions are made for edna.
  • The intention was to use the edna repository as a training set. This screen show a resource from the edna repository that has been delivered randomly and it asks the IO to quote the text in the resource that was the reason for the choice of various categories assigned. This proved to be very difficult for audience level metadata – much of that information was not explicit in the text, but through images, page layout, descriptions on pages deeper in the site or style of language. Concept of subject was more successful but the time required to undertake this manual training exercise was more than we could achieve with the time and funding available, so additional funding is being sought to continue this part of the research.
  • It was found that attempting to predict the edna category from the subject term using semantic relatedness yielded very promising results, with a 60% success rate in mapping a subject term to a category (see Table 1), and 35% accuracy for mapping a category from the title of the resource alone (suggesting that the expert-provided subject term is a more reliable guide to the topic of the resource). The findings so far seem to bear out that the best results come from a human eye and expertise to evaluate the items but also in the description area. However if more likely candidates together with potential metadata terms are delivered to the human experts that will save time and increase efficiency and therefore speed for the user.   We found that of course much essential information is contained in images, style, tone and this is difficult to identify for a machine to capture.   The researchers tried both wordnet and Wikipedia terms for mapping semantically and found that the Wikipedia set of terms was better for the purpose.   Preliminary results have indicated that a knowledge engineering approach may be more successful than a standard machine learning approach.   The researchers are currently looking at various possible data sets to use, such a conference papers from an IT teachers conference and also a range of possible controlled vocabularies including the Australian schools online thesaurus, edna categories and potentially the new Australian national schools curriculum terms.   If people have questions about this project I can refer you to the researchers involved.
  • We need to consider these new approaches to keep pace with developments, cultural and technical We see benefits and seek ongoing opportunities to involve users in discovery, evaluation and description of content We need to continue to explore smart tools to help build and manage collections These semantic collection projects are part of a broader platform of digital content management efficiencies, enhancements and software tools to position edna and associated services for the future.   The knowledge and tools developed can be shared with stakeholders (state and territory education jurisdictions and institutions) who have expressed interest in these approaches for their own collection purposes.  In conclusion, we believe it is crucial to take this opportunity to harness user knowledge, experience and interests in building and managing our collections and we are also keen to develop effective and innovative tools. We anticipate a movement toward a collection built to a far greater extent by users themselves and managed and facilitated by smart tools exploiting interoperable metadata. Powerpoint Header
  • Powerpoint Header
  • Metadata Extraction Projects for Education Network Australia

    1. 1. Metadata Extraction Projects Pru Mitchell & Sarah Hayman Education Network Australia
    2. 2. <ul><li>delivering innovative, cost-effective services across all sectors of education </li></ul><ul><li>formed 1 March 2010 </li></ul><ul><li>not-for-profit, ministerial company (MCEECDYA) </li></ul>
    4. 5. Metadata is not scalable <ul><li>We can no longer be comprehensive or meet the standards set by our collection policy, because now we have: </li></ul><ul><ul><li>more content </li></ul></ul><ul><ul><li>less funding </li></ul></ul><ul><ul><li>fewer cataloguers </li></ul></ul><ul><ul><li>same old clunky metadata tools </li></ul></ul>
    5. 6. Solutions <ul><li> reduce quantity of metadata </li></ul><ul><li> reduce quality of metadata </li></ul><ul><li> get someone else to create/pay for metadata </li></ul><ul><ul><li> users </li></ul></ul><ul><ul><li>other organisations </li></ul></ul><ul><li>improve metadata creation tools </li></ul><ul><li>? program machines to create metadata </li></ul>
    6. 7. edna proof of concepts <ul><li> professional networking </li></ul><ul><li>edna sustainable collections (ESC) </li></ul><ul><li> faceted search: rights and user level </li></ul><ul><li> Flinders University-edna AI research </li></ul>
    7. 8. professional networking site for educators users bookmark and discuss resources and these are aggregated to own url the system collects, manages and maps metadata
    8. 11. person – resource - tag - community
    9. 13. edna Sustainable Collections (ESC) <ul><li>harvests bookmarks </li></ul><ul><li>from key educators in and external services </li></ul><ul><li>links to OpenCalais entity data </li></ul>
    10. 14. How does it do this? <ul><li>takes an RSS feed </li></ul><ul><li>extracts available metadata </li></ul><ul><li>checks for duplicates </li></ul><ul><li>maps it to edna metadata profile in DSpace metadata management system </li></ul>
    11. 18. Outcomes <ul><li>increasing efficiency for information managers </li></ul><ul><li>freeing of Information managers to focus on higher end work, eg subject, user level metadata </li></ul><ul><li>adding user suggestion to collection </li></ul><ul><li>widening the range of resources being captured and evaluated </li></ul>
    12. 19. Faceted search <ul><li>use metadata to help solve issue for stakeholder - cost of educational copying </li></ul><ul><li>harvest rights/licence metadata </li></ul><ul><li>make this meaningful to educators ‘what can I do with this resource?’ </li></ul><ul><li>preference openly licensed content </li></ul>
    13. 21. Rights schemes harvested
    14. 22. AI proof of concept <ul><li>Flinders University Artificial Intelligence and Knowledge Laboratory and 2008-09 </li></ul><ul><li>partial automation of categorisation and annotation of web pages </li></ul>
    15. 23. Elements of the project <ul><li>text analysis </li></ul><ul><li>automatic classification edna category </li></ul><ul><li>suggestion of categories from controlled vocabulary </li></ul><ul><li>classification data capture tool </li></ul>
    16. 25. Findings <ul><li>35% accuracy for mapping category from title alone, 60% accuracy using WordNet-based semantic relatedness </li></ul><ul><li>confirmation of the need for a human eye/expertise </li></ul><ul><li>classification information may be contained in images/style/tone not text </li></ul>
    17. 26. Conclusions <ul><li>consider new approaches and keep pace with developments, cultural and technical </li></ul><ul><li>find opportunities to involve users in discovery, evaluation and description of content </li></ul><ul><li>continue to explore smart tools to help build and manage collections </li></ul>
    18. 27. Questions, feedback <ul><li>Pru Mitchell </li></ul><ul><li>[email_address] </li></ul><ul><li>Sarah Hayman </li></ul><ul><li>[email_address] </li></ul>