SlideShare a Scribd company logo
210 mm




Integration of an Automatic Indexing
System within the Document Flow of a
Grey Literature Repository

Jindřich Mynarz, Ctibor Škuta
National Technical Library

Grey Literature 12 Conference, 7.12. 2010
210 mm




 Indexing of Grey Literature
• self-publishing, self-indexing
• the Web made publishing easier, can it make
  indexing easier as well?
• make non-professional indexing better
  through technology
• increase grey literature visibility and support
  navigation interfaces
210 mm




 Automatic Indexing
• conditional on full-text availability
• machine learning based on analysis of
  language corpora
• automatic term assignment
• automatic suggestions of indexing terms
  lessen the cognitive overhead involved in
  indexing
• human feedback to correct the obvious
  mistakes
210 mm




 Implementation
• re-use of existing components
   o combination and extension
• open source, open formats

  subject headings system + digital repository
    + automatic indexer + text corpus + glue
                      code
                       =
           automatic indexing system
210 mm




 Subject Heading System
• Polythematic Structured Subject Headings
  System
   o universal Czech-English controlled
     vocabulary managed and used at the
     National Technical Library
   o expressed in RDF data format via SKOS
     vocabulary
210 mm




 Digital Repository
• CDS Invenio
  o open source, modular architecture
  o extensions to the interface for entering
    new documents and the search interface
210 mm




 Automatic Indexer
• Maui Indexer
  o automatic term assignment with a
    controlled vocabulary
  o extensions for Czech language (stemmer,
    stopwords)
  o indexing model for Czech language with
    usage of PSH
210 mm




 Text Corpus
• National Repository of Grey Literature
  o maintained by the National Technical
    Library
  o aggregates documents from partner
    institutions
  o in some cases, metadata are created by
    the users
210 mm




  Glue Code
• code to tie all pieces together
• web services
   o loose coupling
   o re-use of existing code
210 mm




    User Interface Design
    Considerations
• opt-in indexing procedure
• suggest indexing headings
• autocomplete headings' fragments
• learn by example — show example
  documents indexed with the heading in
  question
• extending search interface
210 mm




 Further Possibilities and Challenges
• indexing must be reflected in end-user
  interfaces
• continuous enhancements of the individual
  parts of the document processing pipeline
• user-generated indexing
• feeding back into the development of the
  subject headings system
210 mm




Thank you for your
attention!
<mailto:jindrich.mynarz@techlib.cz>
<mailto:ctibor.skuta@techlib.cz>
<http://www.techlib.cz/en/>

More Related Content

Similar to Integration of an Automatic Indexing System within the Document Flow of a Grey Literature Repository

Polythematic Structured Subject Heading System & Creative Commons (Kateřina K...
Polythematic Structured Subject Heading System & Creative Commons (Kateřina K...Polythematic Structured Subject Heading System & Creative Commons (Kateřina K...
Polythematic Structured Subject Heading System & Creative Commons (Kateřina K...
Národní technická knihovna (NTK)
 
Challenges on modeling annotations in the Europeana Sounds project
Challenges on modeling annotations in the Europeana Sounds projectChallenges on modeling annotations in the Europeana Sounds project
Challenges on modeling annotations in the Europeana Sounds project
Hugo Manguinhas
 
Challenges on modeling annotations in the europeana sounds project
Challenges on modeling annotations in the europeana sounds projectChallenges on modeling annotations in the europeana sounds project
Challenges on modeling annotations in the europeana sounds project
Europeana_Sounds
 
The Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with EuropeanaThe Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
Vangelis Banos
 
Designing and Implementing Search Solutions
Designing and Implementing Search SolutionsDesigning and Implementing Search Solutions
Designing and Implementing Search Solutions
Findwise
 
The Hellenic Aggregator
The Hellenic AggregatorThe Hellenic Aggregator
The Hellenic Aggregator
EuropeanaLocal Project
 
Linked Open Data Cloud
Linked Open Data CloudLinked Open Data Cloud
Linked Open Data Cloud
PretaLLOD
 
Engage 2019 Software documentation is fun if you have the right tools: Introd...
Engage 2019 Software documentation is fun if you have the right tools: Introd...Engage 2019 Software documentation is fun if you have the right tools: Introd...
Engage 2019 Software documentation is fun if you have the right tools: Introd...
AndrewMagerman
 
Web indexing finale
Web indexing finaleWeb indexing finale
Web indexing finale
Ajit More
 
Jenkins Pipeline @ Scale. Building Automation Frameworks for Systems Integration
Jenkins Pipeline @ Scale. Building Automation Frameworks for Systems IntegrationJenkins Pipeline @ Scale. Building Automation Frameworks for Systems Integration
Jenkins Pipeline @ Scale. Building Automation Frameworks for Systems Integration
Oleg Nenashev
 
User Interface of the National Repository of Grey Literature
User Interface of the National Repository of Grey LiteratureUser Interface of the National Repository of Grey Literature
User Interface of the National Repository of Grey Literature
pejsovap
 
Rubedo features list
Rubedo features listRubedo features list
Rubedo features list
Rubedo, a WebTales solution
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
IMPACT Centre of Competence
 
Visual Ontology Modeling for Domain Experts and Business Users with metaphactory
Visual Ontology Modeling for Domain Experts and Business Users with metaphactoryVisual Ontology Modeling for Domain Experts and Business Users with metaphactory
Visual Ontology Modeling for Domain Experts and Business Users with metaphactory
Peter Haase
 
How community software supports language documentation and data analysis
How community software supports language documentation and data analysisHow community software supports language documentation and data analysis
How community software supports language documentation and data analysis
Peter Bouda
 
Visualizing Co-authorship Networks for Actionable Insights: Action Design Res...
Visualizing Co-authorship Networks for Actionable Insights: Action Design Res...Visualizing Co-authorship Networks for Actionable Insights: Action Design Res...
Visualizing Co-authorship Networks for Actionable Insights: Action Design Res...
Jukka Huhtamäki
 
Open Infrastructure for Cultural Heritage Digital Content
Open Infrastructure for Cultural Heritage Digital ContentOpen Infrastructure for Cultural Heritage Digital Content
Open Infrastructure for Cultural Heritage Digital Content
Nikos Houssos
 
Teaching Machines to Listen: An Introduction to Automatic Speech Recognition
Teaching Machines to Listen: An Introduction to Automatic Speech RecognitionTeaching Machines to Listen: An Introduction to Automatic Speech Recognition
Teaching Machines to Listen: An Introduction to Automatic Speech Recognition
Zachary S. Brown
 
Ict uses in libraries
Ict uses in librariesIct uses in libraries
Ict uses in libraries
Liaquat Rahoo
 
Nuxeo Platform LTS 2015 - Opening Keynote Event 2015-10
Nuxeo Platform LTS 2015 - Opening Keynote Event 2015-10Nuxeo Platform LTS 2015 - Opening Keynote Event 2015-10
Nuxeo Platform LTS 2015 - Opening Keynote Event 2015-10
Nuxeo
 

Similar to Integration of an Automatic Indexing System within the Document Flow of a Grey Literature Repository (20)

Polythematic Structured Subject Heading System & Creative Commons (Kateřina K...
Polythematic Structured Subject Heading System & Creative Commons (Kateřina K...Polythematic Structured Subject Heading System & Creative Commons (Kateřina K...
Polythematic Structured Subject Heading System & Creative Commons (Kateřina K...
 
Challenges on modeling annotations in the Europeana Sounds project
Challenges on modeling annotations in the Europeana Sounds projectChallenges on modeling annotations in the Europeana Sounds project
Challenges on modeling annotations in the Europeana Sounds project
 
Challenges on modeling annotations in the europeana sounds project
Challenges on modeling annotations in the europeana sounds projectChallenges on modeling annotations in the europeana sounds project
Challenges on modeling annotations in the europeana sounds project
 
The Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with EuropeanaThe Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
 
Designing and Implementing Search Solutions
Designing and Implementing Search SolutionsDesigning and Implementing Search Solutions
Designing and Implementing Search Solutions
 
The Hellenic Aggregator
The Hellenic AggregatorThe Hellenic Aggregator
The Hellenic Aggregator
 
Linked Open Data Cloud
Linked Open Data CloudLinked Open Data Cloud
Linked Open Data Cloud
 
Engage 2019 Software documentation is fun if you have the right tools: Introd...
Engage 2019 Software documentation is fun if you have the right tools: Introd...Engage 2019 Software documentation is fun if you have the right tools: Introd...
Engage 2019 Software documentation is fun if you have the right tools: Introd...
 
Web indexing finale
Web indexing finaleWeb indexing finale
Web indexing finale
 
Jenkins Pipeline @ Scale. Building Automation Frameworks for Systems Integration
Jenkins Pipeline @ Scale. Building Automation Frameworks for Systems IntegrationJenkins Pipeline @ Scale. Building Automation Frameworks for Systems Integration
Jenkins Pipeline @ Scale. Building Automation Frameworks for Systems Integration
 
User Interface of the National Repository of Grey Literature
User Interface of the National Repository of Grey LiteratureUser Interface of the National Repository of Grey Literature
User Interface of the National Repository of Grey Literature
 
Rubedo features list
Rubedo features listRubedo features list
Rubedo features list
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Visual Ontology Modeling for Domain Experts and Business Users with metaphactory
Visual Ontology Modeling for Domain Experts and Business Users with metaphactoryVisual Ontology Modeling for Domain Experts and Business Users with metaphactory
Visual Ontology Modeling for Domain Experts and Business Users with metaphactory
 
How community software supports language documentation and data analysis
How community software supports language documentation and data analysisHow community software supports language documentation and data analysis
How community software supports language documentation and data analysis
 
Visualizing Co-authorship Networks for Actionable Insights: Action Design Res...
Visualizing Co-authorship Networks for Actionable Insights: Action Design Res...Visualizing Co-authorship Networks for Actionable Insights: Action Design Res...
Visualizing Co-authorship Networks for Actionable Insights: Action Design Res...
 
Open Infrastructure for Cultural Heritage Digital Content
Open Infrastructure for Cultural Heritage Digital ContentOpen Infrastructure for Cultural Heritage Digital Content
Open Infrastructure for Cultural Heritage Digital Content
 
Teaching Machines to Listen: An Introduction to Automatic Speech Recognition
Teaching Machines to Listen: An Introduction to Automatic Speech RecognitionTeaching Machines to Listen: An Introduction to Automatic Speech Recognition
Teaching Machines to Listen: An Introduction to Automatic Speech Recognition
 
Ict uses in libraries
Ict uses in librariesIct uses in libraries
Ict uses in libraries
 
Nuxeo Platform LTS 2015 - Opening Keynote Event 2015-10
Nuxeo Platform LTS 2015 - Opening Keynote Event 2015-10Nuxeo Platform LTS 2015 - Opening Keynote Event 2015-10
Nuxeo Platform LTS 2015 - Opening Keynote Event 2015-10
 

More from Jindřich Mynarz

EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org
EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.orgEC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org
EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org
Jindřich Mynarz
 
Applying Linked Open Data to Public Procurement
Applying Linked Open Data to Public ProcurementApplying Linked Open Data to Public Procurement
Applying Linked Open Data to Public Procurement
Jindřich Mynarz
 
Linking library data
Linking library dataLinking library data
Linking library data
Jindřich Mynarz
 
Statistical data in RDF
Statistical data in RDFStatistical data in RDF
Statistical data in RDF
Jindřich Mynarz
 
Linked data as a library data platform
Linked data as a library data platformLinked data as a library data platform
Linked data as a library data platform
Jindřich Mynarz
 
Linked library data
Linked library dataLinked library data
Linked library data
Jindřich Mynarz
 

More from Jindřich Mynarz (6)

EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org
EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.orgEC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org
EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org
 
Applying Linked Open Data to Public Procurement
Applying Linked Open Data to Public ProcurementApplying Linked Open Data to Public Procurement
Applying Linked Open Data to Public Procurement
 
Linking library data
Linking library dataLinking library data
Linking library data
 
Statistical data in RDF
Statistical data in RDFStatistical data in RDF
Statistical data in RDF
 
Linked data as a library data platform
Linked data as a library data platformLinked data as a library data platform
Linked data as a library data platform
 
Linked library data
Linked library dataLinked library data
Linked library data
 

Integration of an Automatic Indexing System within the Document Flow of a Grey Literature Repository

  • 1. 210 mm Integration of an Automatic Indexing System within the Document Flow of a Grey Literature Repository Jindřich Mynarz, Ctibor Škuta National Technical Library Grey Literature 12 Conference, 7.12. 2010
  • 2. 210 mm Indexing of Grey Literature • self-publishing, self-indexing • the Web made publishing easier, can it make indexing easier as well? • make non-professional indexing better through technology • increase grey literature visibility and support navigation interfaces
  • 3. 210 mm Automatic Indexing • conditional on full-text availability • machine learning based on analysis of language corpora • automatic term assignment • automatic suggestions of indexing terms lessen the cognitive overhead involved in indexing • human feedback to correct the obvious mistakes
  • 4. 210 mm Implementation • re-use of existing components o combination and extension • open source, open formats subject headings system + digital repository + automatic indexer + text corpus + glue code = automatic indexing system
  • 5. 210 mm Subject Heading System • Polythematic Structured Subject Headings System o universal Czech-English controlled vocabulary managed and used at the National Technical Library o expressed in RDF data format via SKOS vocabulary
  • 6. 210 mm Digital Repository • CDS Invenio o open source, modular architecture o extensions to the interface for entering new documents and the search interface
  • 7. 210 mm Automatic Indexer • Maui Indexer o automatic term assignment with a controlled vocabulary o extensions for Czech language (stemmer, stopwords) o indexing model for Czech language with usage of PSH
  • 8. 210 mm Text Corpus • National Repository of Grey Literature o maintained by the National Technical Library o aggregates documents from partner institutions o in some cases, metadata are created by the users
  • 9. 210 mm Glue Code • code to tie all pieces together • web services o loose coupling o re-use of existing code
  • 10. 210 mm User Interface Design Considerations • opt-in indexing procedure • suggest indexing headings • autocomplete headings' fragments • learn by example — show example documents indexed with the heading in question • extending search interface
  • 11. 210 mm Further Possibilities and Challenges • indexing must be reflected in end-user interfaces • continuous enhancements of the individual parts of the document processing pipeline • user-generated indexing • feeding back into the development of the subject headings system
  • 12. 210 mm Thank you for your attention! <mailto:jindrich.mynarz@techlib.cz> <mailto:ctibor.skuta@techlib.cz> <http://www.techlib.cz/en/>