Introducing an automated subject classifier

•Download as PPTX, PDF•

1 like•437 views

Australian Council for Educational Research

This VALA 2016 conference paper presents the outcomes of a 2014-15 trial of automated subject indexing at the Australian Council for Educational Research. The integration of a machine learning classification tool has resulted in streamlined workflows and increased use of machine-readable data. Insights were gained into the decisions human indexers make in using a controlled vocabulary, and into the importance of quality abstracts and metadata.

Education

Introducing an automated
subject classifier
Pru Mitchell, Tine Grimston
Robert Parkes
With thanks to: Phil Anderson, Leidos
#vala16 #s27

Australian
Education
Index
• First print edition 1957
• Available on Informit as A+
Education, ProQuest,
Taiwan
• Indexed by ACER staff and
external contract indexers
Indexing varies with staffing levels and budget
“an increasingly onerous
task”
0
2000
4000
6000
8000
10000
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

Production
steps
1. Identification of potential
sources
2. Acquisition of identified
sources
3. Selection of relevant material
from these sources
4. Cataloguing or indexing of
selected material
5. Quality assurance of indexed
records
6. Dissemination of records to
users

Indexing
database
Cunningham
catalogue
One vocabulary to bind them
• AEI
• EdResearch
Online
• Australian
Education
Research
Theses
• IDP Database
• Learning
Ground
Australian
Thesaurus
of
Education
Descriptor
s
Web
docsbooks
Journal
articles
conf
papers

Automated
classification
Why
• More to index
• Less staff time available
• Increasing metadata feeds
instead of print journals
• Increase efficiency
Our story
2009 First journal metadata
2011 Information online
presentation
2012 Increased metadata
replacing print journals
2013 Feasibility study
2014 Initial installation in June –
followed by continuous
refinement of system

What is the
classifier?
Two Processes
1. Training:
Uses past data to create
models of how each subject
term should be used
2. Classifier:
Uses the models to assign
subjects to new records based
on article title, abstract and
journal title

Training the classifier
• Selection of past records - not all are suitable

How the
classifier has
performed
• Provides a useful set of
descriptors on the majority
of records
• Average of 11.7 major
descriptors assigned per
record (Max=13)
• Average of 6.5 “correct”
major descriptors per
record

Findings
A particular challenge:
Horse-Girl Assemblages:
Towards a Post-Human
Cartography of Girls' Desire in
an Ex-Mining Valleys
Community
[Discourse, 35(3)]
• Classifier performance greatly
dependent on abstract length,
style and level of detail
• ACER index a wide variety of
material, some is not
necessarily easy to index
using ATED
• The specific topic of an article
might only have a more
general term in ATED
• Quality vs efficiency

Workflow improvements
Classifier use increasing due to workflow improvements

Publisher
feeds
• Taylor & Francis 2009--
• SAGE 2013--
• Wiley 2013--
• Springer 2013--
• Inderscience 2013—
• Emerald (in negotiation)
• Many publishers can provide
a metadata feed of education
journals
• All in XML, but all different
from each other
• 24,138 articles received in
feeds in 2015, up from 5,006
in 2010

Lessons
• Indexing from the abstract
• Thesaurus structure
• Metadata
• Processes simplification
• Prioritisation
• Indexer experience
• Curation
• Skill set required in team

What next?
• Ongoing development of
workflows
• Possible changes to our
database structure
• More publisher feeds
• Other ways to get bibliographic
metadata into the workflow – eg
RSS feeds, search alerts from
databases
• Develop selection processes
further
• Documentation and

Presentation for a Society of American Archivists Web Archiving Roundtable professional development webinar. Session Description: Two co-authors, Alexis Antracoli, Records Management Archivist at Drexel University and Kristen Yarmey, Associate Professor and Digital Services Librarian at the University of Scranton will share their experiences and engage in discussion about their web archiving projects. The work they will be talking about is covered in “Capture All the URLs: First Steps in Web Archiving” (http://palrap.pitt.edu/ojs/index.php/palrap/article/view/67). Kristen will discuss her and her colleagues’ first steps in web archiving at the University of Scranton, including making the case to campus stakeholders, finding funding, choosing Archive-It as well as selecting content and seeds to capture. Alexis will talk about establishing policies and implementing QA procedures. Both Alexis and Kristen will provide insights on stumbling blocks, lessons learned, and future plans. Plenty of time will be allotted for questions and discussion.

Professor Erica McWilliam, keynote at ASLA XXIII Biennial Conference The burgeoning volume, variety, velocity and veracity of the data that is shaping our social world means that we cannot hope to teach the next generation of young people what they need to know to live, learn and earn well. What we can and must do is to build young people's capacity to manage their own learning in such a way that they can engage meaningfully and ethically with a world replete with uncertain data and unfamiliar concepts and processes. In the era of Big Data, much of the information that young people encounter is fictitious or misleading. Given this, our pedagogy needs to assist young people to transcend a 'type and pray' approach to investigating information. Erica's presentation explores the challenges of pedagogy in the era of Big Data, and how we might respond more realistically in our libraries and classrooms.

13 3

Amin Fatah

Guided inquiry does it work

Australian School Library Association

Guided Inquiry is one of the keys to establishing the elusive collaboration that teacher librarians have been seeking for many years now. This presentation will essentially be an analysis of the learnings of a team of teachers and teacher librarians about Guided Inquiry as two inquiry units are planned, carried out and evaluated during 2011, with the aim of identifying what works and what doesn’t, and the organising principles behind Guided Inquiry, from the practitioners’ perspectives.

Memoria viaje Ruta del Tequila

Yesenia Casanova

Natural Disasters

Ikram Ul haq

Industrial Dye Presentation 12 11 2015Thomas Tarantelli

Electron beam machining by Himanshu Vaid

Himanshu Vaid

Exercício Senac - Dandara Alexandra

Dandara Alexandra

Un retraité sur 8 perçoit une rente de retraite supplémentaire.

Anne-Bénédicte LE MENTEC

Integrating an electronic lab notebook with a data repository; American Chemi...

rmacneil88

Elns and repositories, American Chemical Society, Dallas, March 2014

ResearchSpace

Viewers also liked

Berat volume agregat andre

andrepratamaputra

Siltumapgādes sistēmas renovācija/ rekonstrukcija daudzdzīvokļu mājā

Ekonomikas ministrija/ Dzīvo siltāk

Amjadmalikcv

MUHAMMAD AMJAD IQBAL

Nip, tuck and polish: Reworking those library lessons with your technological...

Australian School Library Association

Qadirov ElmarElmar Gadirov

Memorial jk

Cicero Feltrin

CV NOFERI 2015nof feri

We are the weather makers: Advocacy in action

Australian School Library Association

Stato dell'arte sugli stili nutrizionali efficaci

Enrico Ponta

Sirisha_V&Vsiri challa

(19 02-13)--ndt tests & their importanceRajesh Sharma

Library pedagogy in the era of Big Data

Australian School Library Association

13 3

Amin Fatah

Guided inquiry does it work

Australian School Library Association

Memoria viaje Ruta del Tequila

Yesenia Casanova

Natural Disasters

Ikram Ul haq

Industrial Dye Presentation 12 11 2015Thomas Tarantelli

Electron beam machining by Himanshu Vaid

Himanshu Vaid

Exercício Senac - Dandara Alexandra

Dandara Alexandra

Un retraité sur 8 perçoit une rente de retraite supplémentaire.

Anne-Bénédicte LE MENTEC

Viewers also liked (20)

Berat volume agregat andre

Siltumapgādes sistēmas renovācija/ rekonstrukcija daudzdzīvokļu mājā

Amjadmalikcv

Nip, tuck and polish: Reworking those library lessons with your technological...

Qadirov Elmar

Memorial jk

CV NOFERI 2015

We are the weather makers: Advocacy in action

Stato dell'arte sugli stili nutrizionali efficaci

Sirisha_V&V

(19 02-13)--ndt tests & their importance

Library pedagogy in the era of Big Data

13 3

Guided inquiry does it work

Memoria viaje Ruta del Tequila

Natural Disasters

Industrial Dye Presentation 12 11 2015

Electron beam machining by Himanshu Vaid

Exercício Senac - Dandara Alexandra

Un retraité sur 8 perçoit une rente de retraite supplémentaire.

Similar to Introducing an automated subject classifier

Integrating an electronic lab notebook with a data repository; American Chemi...

rmacneil88

Elns and repositories, American Chemical Society, Dallas, March 2014

ResearchSpace

Paradata standards presentation, IDEA012

Nick Nicholas

EBSCO Discovery Service @ University of Toledo - Rigda

SWON-EDS

Federated to library discovery platfoms

Nikesh Narayanan

Staffing Research Data Services at University of Edinburgh

Robin Rice

Hansen Metadata for Institutional Repositories

National Information Standards Organization (NISO)

How to expose research data in EOSC

EUDAT

Roles & Skills for RDM

EDINA, University of Edinburgh

The Discipline of Organzing - Workshop presentation

unmilk

MetadataTheory: Introduction to Repositories (8th of 10)

Nikos Palavitsinis, PhD

The OCLC Research Library Partnership

OCLC

Presented at the OCLC Research Library Partnership meeting by Senior Program Officer, Karen Smith-Yoshimura and hosted by the University of Sydney in Sydney, NSW Australia, 17 February 2017. This meeting provided an opportunity for Research Library Partners to touch base with each other on issues of common concern and explore possible areas of future engagement with the OCLC Research Library Partnership and OCLC Research.

Andrew Cox Research data managementIncisive_Events

Deluca "Building Momentum and Support for Institutional Repository Deposits"

National Information Standards Organization (NISO)

The ELIXIR UK training portal (TeSS) by Carole Goble

ELIXIR UK

Research Data Management at Imperial College London

Sarah Anna Stewart

Research Data Service at the University of Edinburgh

Robin Rice

Sharing the load: librarians and research data support services

London South Bank University

Ms Gitta Siekmann - Identifying and Sharing Work Skills - International Appro...

MelindaFischer1

Getting on with it (research support at an academic library) presented at Uni...Reed Elsevier

Similar to Introducing an automated subject classifier (20)

Integrating an electronic lab notebook with a data repository; American Chemi...

Elns and repositories, American Chemical Society, Dallas, March 2014

Paradata standards presentation, IDEA012

EBSCO Discovery Service @ University of Toledo - Rigda

Federated to library discovery platfoms

Staffing Research Data Services at University of Edinburgh

Hansen Metadata for Institutional Repositories

How to expose research data in EOSC

Roles & Skills for RDM

The Discipline of Organzing - Workshop presentation

MetadataTheory: Introduction to Repositories (8th of 10)

The OCLC Research Library Partnership

Andrew Cox Research data management

Deluca "Building Momentum and Support for Institutional Repository Deposits"

The ELIXIR UK training portal (TeSS) by Carole Goble

Research Data Management at Imperial College London

Research Data Service at the University of Edinburgh

Sharing the load: librarians and research data support services

Ms Gitta Siekmann - Identifying and Sharing Work Skills - International Appro...

Getting on with it (research support at an academic library) presented at Uni...

More from Australian Council for Educational Research

Literacy and numeracy development for Indigenous students

Australian Council for Educational Research

APA 6th edition referencing. Part 2: Reference list

Australian Council for Educational Research

APA 6th edition referencing. Part 1: In text citation

Australian Council for Educational Research

Best-practice model of teaching and learning for refugee students from Sub-Sa...

Australian Council for Educational Research

This presentation by Dr Mary Kimani discusses a qualitative study exploring success stories of refugee students from Sub-Saharan Africa. It considers African refugee students’ experiences in schools, what African refugee students bring to schools that can be incorporated positively into their learning and school experiences, and how best schools can serve African refugee students. Presented at The Centre of Excellence for Equity in Higher Education (CEEHE) inaugural one-day symposium on students from refugee backgrounds in higher education, at the University of Newcastle on 20 November 2015.

Global trends in higher education policies

Australian Council for Educational Research

Keynote presentation by Professor Kathryn Moyle for the International Conference on Teacher Training and Education held in Solo, Indonesia on 5-6 November 2015. This presentation outlines the current global context for higher education in 2015, as a basis for examining the key trends in teacher education in the first decades of the 21st century. The purpose of this paper is to outline the current global contexts for higher education, and to provide an overview of the policies found in teacher education in those countries that consistently produce students who perform highly on international standardized tests such as PISA, TIMSS and PIRLS.

Translating research into action

Australian Council for Educational Research

If education is to be an evidence-based profession then all teachers need access to that evidence if they are to improve student learning. This presentation considers how teachers interested in Aboriginal and Torres Strait Islander Education can access high quality research that is relevant, reliable and readable, and the importance of engaging with researchers to translate research into practice.

Assessing general capabilities

Australian Council for Educational Research

This presentation by Julian Fraillon and Juliette Mendelovits from Research Conference 2015 considers assessment of general capabilities and cross-curricular learning outcomes such as literacy in information and communication technologies, creative thinking and collaborative and individual problem-solving. As the expectation for such competencies to be taught in schools has increased, so has the need for teachers and schools to validly and reliably assess student learning in those areas, and to report on them in ways that inform future teaching and learning. This presentation will examine the challenges of assessing and reporting on student learning and learning growth in general capabilities and cross-curricular learning areas. The presentation will explore approaches used in research to address some of these challenges and reflect on how these can be applied in the classroom.

Finding pathways in education

Australian Council for Educational Research

Procurando caminhos na educacao pesquisa

Australian Council for Educational Research

$Elaborating responses to fraction assessment tasks reveals students’ algebrai...$ $Elaborating responses to fraction assessment tasks reveals students’ algebrai...$

Elaborating responses to fraction assessment tasks reveals students’ algebrai...Australian Council for Educational Research

How strong is your school as a professional community?

Australian Council for Educational Research

Presentation by Dr Lawrence Ingvarson, ACER and Ed Roper, Brisbane Grammar School at the 2015 ACER Excellence in Professional Practice Conference. The ACER Professional Community Framework describes the five domains that characterise schools with strong professional culture, as defined by the Australian Performance and Development Framework, together with key elements, indicators and rubrics. The Professional Community Questionnaire provides a confidential online survey of all teaching staff in a school, based on the framework. Initial trials indicate that the questionnaire has high levels of internal reliability. School leaders can use the framework and questionnaire to identify key areas for action and measure changes over time. Participating schools receive a comprehensive report based on the survey results. This session will report on the results of administering the Professional Community Questionnaire in one school.

Improving subject access to the Office for Learning and Teaching's resource c...

Australian Council for Educational Research

Presentation by Philip Hider, Charles Sturt University and Barbara Spiller, Australian Council for Educational Research Australia, at Write Edit Index 2015, an Australian conference for editors, indexers, and publishing professionals. This case study focuses on the process of improving subject access to a collection of resources related to the scholarship of teaching and learning in higher education. It describes how existing controlled vocabularies in the education field were evaluated as candidates for adoption. The Australian Thesaurus of Education Descriptors (ATED) was selected and enhanced to meet the specific needs of the OLT Resource Library.

Digital education in Australia

Australian Council for Educational Research

More from Australian Council for Educational Research (13)

Literacy and numeracy development for Indigenous students

APA 6th edition referencing. Part 2: Reference list

APA 6th edition referencing. Part 1: In text citation

Best-practice model of teaching and learning for refugee students from Sub-Sa...

Global trends in higher education policies

Translating research into action

Assessing general capabilities

Finding pathways in education

Procurando caminhos na educacao pesquisa

$Elaborating responses to fraction assessment tasks reveals students’ algebrai...$ $Elaborating responses to fraction assessment tasks reveals students’ algebrai...$

Elaborating responses to fraction assessment tasks reveals students’ algebrai...

How strong is your school as a professional community?

Improving subject access to the Office for Learning and Teaching's resource c...

Digital education in Australia

Recently uploaded

Operation Blue Star - Saka Neela Tara

Balvir Singh

Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup. The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.

Marketing internship report file for MBA

gb193092

The Accursed House by Émile Gaboriau.pptx

DhatriParmar

Pride Month Slides 2024 David Douglas School District

David Douglas School District

Multithreading_in_C++ - std::thread, race condition

Mohammed Sikander

2024.06.01 Introducing a competency framework for languag learning materials ...

Sandy Millin

http://sandymillin.wordpress.com/iateflwebinar2024 Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error. Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials. This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.

Overview on Edible Vaccine: Pros & Cons with Mechanism

DeeptiGupta154

Biological Screening of Herbal Drugs in detailed.

Ashokrao Mane college of Pharmacy Peth-Vadgaon

Biological screening of herbal drugs: Introduction and Need for Phyto-Pharmacological Screening, New Strategies for evaluating Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and Antifertility, Toxicity studies as per OECD guidelines

Supporting (UKRI) OA monographs at Salford.pptx

Jisc

Acetabularia Information For Class 9 .docx

vaibhavrinwa19

Language Across the Curriculm LAC B.Ed.

Atul Kumar Singh

MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf

goswamiyash170123

Embracing GenAI - A Strategic Imperative

Peter Windle

Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction. This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.

A Survey of Techniques for Maximizing LLM Performance.pptx

thanhdowork

The Challenger.pdf DNHS Official Publication

Delapenabediema

Model Attribute Check Company Auto Property

Celine George

Francesca Gottschalk - How can education support child empowerment.pptx

EduSkills OECD

"Protectable subject matters, Protection in biotechnology, Protection of othe...

SACHIN R KONDAGURI

BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...

Nguyen Thanh Tu Collection

Digital Artifact 2 - Investigating Pavilion Designs

chanes7

Recently uploaded (20)

Operation Blue Star - Saka Neela Tara

Marketing internship report file for MBA

The Accursed House by Émile Gaboriau.pptx

Pride Month Slides 2024 David Douglas School District

Multithreading_in_C++ - std::thread, race condition

2024.06.01 Introducing a competency framework for languag learning materials ...

Overview on Edible Vaccine: Pros & Cons with Mechanism

Biological Screening of Herbal Drugs in detailed.

Supporting (UKRI) OA monographs at Salford.pptx

Acetabularia Information For Class 9 .docx

Language Across the Curriculm LAC B.Ed.

MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf

Embracing GenAI - A Strategic Imperative

A Survey of Techniques for Maximizing LLM Performance.pptx

The Challenger.pdf DNHS Official Publication

Model Attribute Check Company Auto Property

Francesca Gottschalk - How can education support child empowerment.pptx

"Protectable subject matters, Protection in biotechnology, Protection of othe...

BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...

Digital Artifact 2 - Investigating Pavilion Designs

Introducing an automated subject classifier

1. Introducing an automated subject classifier Pru Mitchell, Tine Grimston Robert Parkes With thanks to: Phil Anderson, Leidos #vala16 #s27

2. Cunningham Library • Services • ACER staff • ACER students • Education community • Indexing services

3. Australian Education Index • First print edition 1957 • Available on Informit as A+ Education, ProQuest, Taiwan • Indexed by ACER staff and external contract indexers Indexing varies with staffing levels and budget “an increasingly onerous task” 0 2000 4000 6000 8000 10000 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

4. Production steps 1. Identification of potential sources 2. Acquisition of identified sources 3. Selection of relevant material from these sources 4. Cataloguing or indexing of selected material 5. Quality assurance of indexed records 6. Dissemination of records to users

5. The product

6. Indexing database Cunningham catalogue One vocabulary to bind them • AEI • EdResearch Online • Australian Education Research Theses • IDP Database • Learning Ground Australian Thesaurus of Education Descriptor s Web docsbooks Journal articles conf papers

7. Machine learning

8. Automated classification Why • More to index • Less staff time available • Increasing metadata feeds instead of print journals • Increase efficiency Our story 2009 First journal metadata 2011 Information online presentation 2012 Increased metadata replacing print journals 2013 Feasibility study 2014 Initial installation in June – followed by continuous refinement of system

9. What is the classifier? Two Processes 1. Training: Uses past data to create models of how each subject term should be used 2. Classifier: Uses the models to assign subjects to new records based on article title, abstract and journal title

10. Training the classifier • Selection of past records - not all are suitable

11. Running the classifier

12. What the human indexer sees

13. How the classifier has performed • Provides a useful set of descriptors on the majority of records • Average of 11.7 major descriptors assigned per record (Max=13) • Average of 6.5 “correct” major descriptors per record

14. Findings A particular challenge: Horse-Girl Assemblages: Towards a Post-Human Cartography of Girls' Desire in an Ex-Mining Valleys Community [Discourse, 35(3)] • Classifier performance greatly dependent on abstract length, style and level of detail • ACER index a wide variety of material, some is not necessarily easy to index using ATED • The specific topic of an article might only have a more general term in ATED • Quality vs efficiency

15. Workflow improvements Classifier use increasing due to workflow improvements

16. Publisher feeds • Taylor & Francis 2009-- • SAGE 2013-- • Wiley 2013-- • Springer 2013-- • Inderscience 2013— • Emerald (in negotiation) • Many publishers can provide a metadata feed of education journals • All in XML, but all different from each other • 24,138 articles received in feeds in 2015, up from 5,006 in 2010

17. Lessons • Indexing from the abstract • Thesaurus structure • Metadata • Processes simplification • Prioritisation • Indexer experience • Curation • Skill set required in team

18. What next? • Ongoing development of workflows • Possible changes to our database structure • More publisher feeds • Other ways to get bibliographic metadata into the workflow – eg RSS feeds, search alerts from databases • Develop selection processes further • Documentation and

19. Questions library@acer.edu.au

Editor's Notes

This paper presents the outcomes of a 2014-15 trial of automated subject indexing at the Australian Council for Educational Research. I will give some background to the project which involved the integration of a machine learning subject classification tool into our indexing process. Tine will give you more detail on the trial and implementation stages and discuss the findings and insights we gained. Rob who has adopted the classifier, trained it, fed it , kept it ticking and monitored its every motion has provided technical detail and analysis of findings in the paper itself. Thanks also are owed to Phil Anderson of Leidos for developing the machine learning algorithms.
Established in 1930, the Australian Council for Educational Research (ACER) is a not-for-profit organisation providing educational research services and products. The Cunningham Library at ACER (the Library) has a research level collection in Australian education. We have a mission to support the work of ACER staff working in educational research and assessment services, as well as the education community at large. Recently ACER has become a private higher education provider and we are now offering academic library services as well. The library is still very much a physical presence – situated off the Atrium in the Melbourne office. We are proud of our physical collection dating back to 1930s. But luckily space-wise (and for the sake of our 8 interstate and overseas offices, and our predominantly remote graduate students) the new collection is predominantly digital. As well as all this, we maintain an expert indexing team who provide a range of products and services.
Our flagship product is the Australian Education Index (AEI), a bibliographic database containing over 200,000 entries and abstracts -The Australian Education Index (AEI) is a bibliographic database containing over 200,000 entries and abstracts. Increasingly it has full-text material with 58% of records since 2000 with a DOI, Persistent URL or scanned PDF. Selecting and indexing Australia’s education literature is labour intensive and thus an increasingly expensive activity. Curating the ever-growing range of documents and assigning thesaurus terms to metadata records are intellectually demanding processes, as well as being time consuming. The goal of capturing and indexing comprehensively all research in Australian education is indeed an onerous task’ “The sources are numerous, and the task is growing as years pass.” This quote is from the preface to the first edition of the Australian Education Index (Radford in 1958) - so not much has changed. This reality of ever-increasing content to index and decreasing budget and staffing levels has been a long-term mantra for the indexing team, and really has to be addressed. For the sake of indexer sanity, subscriber satisfaction and budgets something in the equation has to change.
In any process change it is helpful to break an issue into its component parts, and the process of producing the Australian Education Index involves the following six information tasks – familiar to all involved in collection building. A rigorous selection process ensures comprehensive coverage of significant Australian education research. The challenge of curating and indexing the literature on Australian education required by administrators, teachers and students is one of even greater complexity and cost, as types and sources of literature increase and the topics related to education expand. While these six steps of production for the AEI have been constant since 1958, there have been changes in the way they are performed over the intervening years. This has been in response to both the changing formats of the resources being indexed, and the format of the Index itself.
This is a screenshot of part of an AEI record - just so I can highlight the 4 subject descriptor fields that are of interest to the subject classifier project. major descriptors are the concepts discussed in the work – all major descriptors come from our thesaurus. It is the ‘aboutness’ The minor descriptors are also from the thesaurus - but they descrIbe aspects such as educational level of those involved in the research, and the methodology of the research geographic descriptors: refers to the location the content is ‘about’, rather than the place of publication, although they may be the same in some content identifiers: these are terms NOT in the thesaurus but which the indexer would like to have used – so candidate terms.
At the heart of all our indexing services is a vocabulary. The Australian Thesaurus of Education Descriptors (ATED) is the source of the major and minor subject descriptors, and is basically the glue that holds ACER’s indexing services together. A hierarchically-structured thesaurus of concepts across all levels of education from preschool to higher education, ATED is used to index and search the subject matter of the AEI and its subsidiary databases - as well as the Library’s catalogue. It is searchable free of charge online, and can be purchased in hard copy or as an electronic dataset to be embedded into an organisation’s own information services. ATED is updated on a six-monthly basis. As at February 2016, it contains over 10,000 terms, around half of which are preferred terms, and half are references. [Library Catalogue 60,000 recordsIndexing MasterDatabase 214,000 recordsContains all indexing records – books, reports, journal articles, theses, conference papers, book chapters etc.]Relevant records for each separate database or index product are tagged and filtered from the MasterDatabase into subject or audience specialist databases: IDP – Database of Research in International Education, Learning Ground is our indigenous education database and BOLDE covers Blended, Online learning and Distance Education]
While the value of providing the Index is not disputed, as I said - increasing costs as well as a decrease in indexing output, means support for the professional indexers was required. A typical strategy in cases like this is to investigate ways of automating the process. In our case, subject classification was considered the most complex aspect of indexing - so the obvious place to start.The quest for automated indexing is not new. A 1965 monograph by Stevens, entitled Automatic indexing: a state-of-the-art report contains almost 200 pages of experiments [in ‘automatic assignment indexing, automatic classification and categorisation, computer use of thesauri, statistical association techniques, and linguistic data processing’ (p.1).] ‘Automatic indexing’ as a concept was actually added to ATED in 1984, with a related term ‘computational linguistics’. Sad it has taken 30 years to actually get the concept. [described as: A branch of linguistics concerned with the use of computers for the analysis and synthesis of language data - for example, in machine translation, word frequency counts, and speech recognition and synthesis (ATED, 2013, p. 25).] In a nutshell machine learning involves using an existing set of documents (corpus) to train a computer about what constitutes an appropriate response (in this case which thesaurus term) so that when it encounters a new document it can suggest an appropriate response.In 2010 I was involved in a proof-of-concept project trialling automated metadata for web-based education services using the Schools Online Thesaurus (ScOT) with Flinders University. At that time we concluded that “automated classification based on artificial intelligence is useful as a means of supplementing and assisting human classification, but is not at this stage a replacement for human classification of educational resources” (Leibbrandt et al, 2010).
Our story is about two interweaving threads – Publisher journal feeds and automated classification In 2009 Taylor & Francis changed their policy of providing us with “free for indexing” print journals. We began to receive xml files of journal article metadata instead of print journals. In FEB 2011 Lance Deveson (our previous library manager) and myself attended the Information Online conference in Sydney and listened to a presentation about the Parliamentary Library’s automated indexing project. This sparked our interest, because we were looking for ways to increase our indexing efficiency to help us cope with the increasing amounts of relevant material being published and reduced funding for indexing staff. In order for ACER to even consider automated classification we would need a pool of metadata to be classified. The Journal metadata from publishers seemed to be a promising source. We started to actively negotiate with other publishers for more journal feeds. We also spoke to the Parliamentary library about their experience. We hoped we could use automated classification on our journal feed material to produce a set of terms that could be used by a human indexer as the basis of the final assigned terms. In Nov 2012 –We got a quotation from SAIC (now Leidos) to investigate the application of automated classification to journal data at ACER. Funding for this was approved in August 2013 and the feasibility study was undertaken from September to December 2013 After significant dialog back & forth, we received a feasibility report. We were satisfied that an “automated classifier” could work with our journal feed data to produce useful terms for our purposes. Jan 2014 – We requested a quotation for implementing “automated classification” into our system – this expenditure was approved in March 2014 and in June 2014 The components were installed on Robert and my computers. SINCE THEN----- We have been continuously refining our workflows. Reference Revolutionising digital content ingest: building a newspaper clippings collection using practical automation to assist with selection and classification linkJudy Hutchinson, Roxanne Missingham, Information Access Branch, Parliamentary Library, Philip Anderson, SAIC Pty. Ltd
So what exactly is the Classifier? It is not hardware, and not a software “product.” It does require the installation of TeraText software, but the Classifier itself is a set of Custom programs built to suit our specific data and to work with our existing DBTextworks system. The system works with XML. The feasibility stage of the project was about the training process -- creating and testing models by learning from past indexing records. It was found that the best fields to learn from were article title, abstract and journal title. There was some testing using the fulltext of articles, but it was found that better results were obtained when just using the abstract. The implementation stage of the project was about the workflow of using the training models to actually assign terms to new records, and get those terms into DBTextworks so indexers can see and use them.
The training program uses information from past records to create 4 models - one for each of the descriptor fields We need to keep retraining to include newly indexed material and improve results over time. We need to ensure the latest update of our thesaurus is being used. ATED is updated periodically with new and/or modified terms. Because Classifier “learns” from what subjects have been assigned to records in the past we did a significant amount of work on our subject data to improve what the Classifier is learning from. Firstly we cleaned up existing data to make sure only currently valid terms were used in each field Major Descriptors Minor Descriptors Identifiers Geographical Secondly we created a set of the “best” records to use for training. For example we excluded old records with no abstract, or one sentence abstracts.
Our indexers do their indexing in DBTextworks. Records to be classified need to have their Article title, Journal title and Abstract exported in XML format. That XML is then run through the classifier which uses the models to assign and add suggested terms. A new XML file which includes the suggested terms is then imported back into DBTextworks. Initially we had to manually export a batch of records, run the classifier and then re-import the updated records. Rob, with his IT background was able to come up with a much more streamlined way to do this. We can now process single records or batches of records with the click of a button. We search for record(s) in DBTextworks, then click the “run classifier” button on the resulting form. Rob has programmed this “run classifier” button to export XML data from the selected record(s), place it in a certain folder, run the classifier program, put results in another folder then import the new XML back into the record(s). A human indexer then checks the suggested subjects and completes the record.
This is screenshot of PART of the data input screen used by indexers The classifier populates right hand fields with suggested terms arranged by their weight – the bigger the number – the more confident the classifier is of that term. Terms above a certain threshold weight are also shown in he left hand fields In the Major Descriptors field we can only see the terms with a weight of more than the threshold of 10,000 were not assigned to the left hand fields Human indexer keeps “correct” terms, deletes “incorrect” terms, adds “missed” terms to the left hand fields. They mark the record complete and save it.
Performance detail is very complex – you can see the full paper for some detailed performance information. Performance is different in each of the 4 subject fields. Major descriptors are currently performing best. We have some limited ways to continue improving the number of correct terms in all fields over time. Overall we believe the classifier provides a useful set of descriptors for most records, and it does make indexing quicker and increase our efficiency. Because the training models only consider terms that have been used at least a certain number of times, it does not ever assign “new terms” itself. New terms need to be added by human indexers a certain number of times before the classifier can learn and assign the term. This is the reason for some “missed” terms, and one of the reasons for poorer performance in the identifiers field which is where new terms usually first appear. “Incorrect” terms Occasionally some terms assigned by the classifier are completely wrong, such as it once strangely assigning the term 'fairy tales' to an article from a psychometrics journal. More common though are terms that while not totally wrong, are deemed by the human indexer to be too general and therefore not needed. If the classifier assigns both 'Leadership' and 'Educational leadership' the indexer might choose only the more specific term. Also, there are some terms in ATED that are very similar but have different meanings. For example - the classifier struggles to decide between the separate ATED terms: 'Scholarship' and 'Scholarships' We are finding the classifier a useful tool but it can still be improved
Classifier performance greatly dependent on abstract length, style and level of detail. The classifier performs best when a quality abstract is available. ACER index a wide variety of material, some is not necessarily easy to index using ATED terms. If it is an article that is difficult for a human indexer to index – then the Classifier will usually also perform poorly. For example this article from discourse has a title which doesn’t really give good clues to what the article is about. 'The specific topic of an article might only have a more general term in ATED. The classifier is unlikely to perform well on an article if its main topic is not even in ATED, such as a specific concept from physics or mathematics.' Quality vs efficiency - If we do lots of “classifier” indexing which has the most efficient workflow, we can certainly increase the amount of indexing throughput. However there will always be lots of indexing such as conference proceedings, book chapters and reports that don’t have metadata readily available in a useable form. These items are quite possibly more important to index as they aren’t available elsewhere. Creating an indexing record for these items is more labour intensive and time consuming. We can run the classifier after the indexer populates all the bibliographic fields in the indexing record, but by the time they have cut and pasted or manually typed in all that data, experienced indexers report that running the classifier may not be significantly quicker than manually assigning terms. What balance should we strike between the efficient “classifier” indexing and the less efficient, but possibly more important other material.
We are using the classifier more and more. 41% of our latest batch of indexing used the classifier. The biggest increase in usage comes as a result of the “run classifier” button developed by Robert and mentioned earlier. There have been a string of tweaks and enhancements to processes, dating right back to the time of the feasibility study. Most of these changes have been a result of collaborative discussions which are time consuming but get everyone’s agreement to changes. We have had some vigorous discussion about our indexing standards and methods and even the scope of our index . Some of the many tweaks not already mentioned include..... The Classifier has now been implemented in the catalogue as well as our indexing database Improved data entry screens Changes to data entry guidelines for some fields Many changes to structure of indexing databases and usage of particular fields Troubleshooting the classifier, the publisher feeds and DBTextworks – if one thing changes in one place it can cause unexpected problems somewhere else. All in-house indexers now have the programs available on their computer.
The workflow for dealing with publisher feeds and implementing the classifier are dependent on each other so I just wanted to show a slide with some information about the Publisher feeds. Image acknowledgement: https://plus.google.com/+SteveThomas/posts/SL77ii3qWa6
You may be wondering what there is for you to learn from this project which is so uniquely tailored to an obscure Australian education index. Well let me suggest some lessons: There is an obvious challenge for scholars, editors and publishers – think carefully about how dependent you are on your abstract in a machine-oriented landscape There are particular challenges for vocabulary owners who want their thesaurus, taxonomy or subject headings to be read by machines. We have not gone anywhere near the linked data model for ATED as our own systems have no way of accommodating this. We know now that the current classifier algorithms do not take into account ATED’s reference structure, so that is another challenge Metadata – machines like clean, consistent metadata. They are not accommodating like humans. You will clean and clean and clean – so make sure you can export and import large chunks of your metadata. Thank goodness for DBTextWorks As with any change management project, the human aspect is all important. Ours was the classic – fund the technology and a bit for the consultant, but no funds for time release/backfill for training, setting up, conducting the research required, or making a user-friendly interface. That was left to the already overloaded team. We thought we were dealing with the subject classification step of the indexing process, but in fact the change that got highest votes from the indexers was the reduction in manual keying, or cutting and pasting of data. A not insignificant improvement.We also thought classification was the most complex step, but in fact we are beginning to regard curation as the real challenge. Given we can’t index everything, how do we prioritise across so many variables? Do we run the risk of doing the easier articles from feeds at the expense of the more important, but slower to index material such as book chapters and web-based conference proceedings. Finally – team members who are prepared to stick at long, tedious data cleansing projects, to motivate those reluctant to change, and librarians with advanced technical skills on your team make an amazing difference.
Watch the statistics: Can we arrest the sliding quantity without reducing the quality of our index? When will we recoup the time invested in setting this up?

Introducing an automated subject classifier

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Introducing an automated subject classifier

Similar to Introducing an automated subject classifier (20)

More from Australian Council for Educational Research

More from Australian Council for Educational Research (13)

Recently uploaded

Recently uploaded (20)

Introducing an automated subject classifier

Editor's Notes