SlideShare a Scribd company logo
1 of 7
32 ADLUG Meeting - Vitoria-Gasteiz, October 2013
Stefano Bargioni - Pontificia Università della Santa
Croce
Catalog enrichment: importing Dewey Decimal
Classification from external sources
Slide 2 - The project

The project related to this presentation started as a
real need in the library of the Santa Croce Pontifical
University. After migrating to Koha open source ILS
<http://koha-community.org> and adding authority
records for names, name-titles and uniform titles to
the catalog, we studied how to add subject headings.
Even if we were using uncontrolled subject headings
starting from many years ago, we realized that this job
is nor simple neither fast.
Our catalog, in the meanwhile, would have had to be
accessible through other semantic search paths. So we
decided to improve the Dewey Decimal Classification
(DDC). Not manually, of course, but using an
automatic procedure.
Slide 3 - Version 1: The batch mode

Batch mode is the process that scans the catalog and
automatically adds DDC numbers to some records.
This process can span hours or days, depending on
the catalog size and other factors, and is meant to
upgrade old bibliographic records. A similar process
can be applied in interactive mode, for new
bibliographic records. We will discuss it in the last part
of the presentation.
The aim of this process is to enrich the catalog adding
DDC to records, minimizing or even excluding human
interactions. The process will detect records without
the DDC (field 082), but having the ISBN (field 020).
The ISBN, a unique identifier, will be used to discover
the corresponding record on other databases where
the DDC is in use. Big catalogs like national libraries
or bibliographies can be good choices.
However, the DDC is not used everywhere at the same
level of quality or uniformity. This is why we defined
some criteria to check each DDC number before
adding it to the catalog.
Slide 4 - An atomic copy cataloguing

Adding a single field to a record is very similar to copy
cataloguing. We named it "atomic copy cataloguing". It
is based on a unique identifier (often ISBN), and
requires that the you are allowed to programmatically
modify single records in your Integrated Library
System (ILS). This can be done if Advanced
Programming Interfaces (APIs) are available. For
commercial ILS's there can be restrictions or reasons
to avoid this kind of operation.
Since this "atomic copy" is a subset of the copy
cataloguing, we can say that its use is allowed thanks
to the same basic ideas and agreements that govern
full copy cataloguing in the library international
community.
Slide 5 - Records to be modified

This slides shows a way to detect records involved in
the process, extract their system number and ISBN.
We also choose to limit to the language of the work, to
avoid to send lot of unuseful queries to some remote
servers that mostly contain records of their country
language.
Slide 6 - Dewey Sources (I)

OCLC Classify and other six national libraries or
bibliographies were selected as Dewey sources. It is
necessary that they can be accessed using a protocol
like REST, Z39.50 or HTTP, in order to receive
responses in XML, MARC binary or HTML data. The
best is XML, i.e. structured data; the worst is HTML,
because in this case data do not have metadata and
are mixed with or merged into formatting code, and
are difficult to extract.
Slide 7 - Dewey Sources (II): OCLC Classify

Even if it is interesting, we can avoid a detailed
description of Classify OCLC service. It is important to
notice what is highlighted in red color:
• it has a machine interface, and its response is in
XML format;
• ISBN is a keyword to access it;
• it contains 36 million records having the DDC.
Slide 8 - Dewey Sources (III): National Libraries

The table shows six important national libraries we
queried, the language we choose to limit queries, and
the format of responses.
The order followed to query Dewey sources was the
same used in this table, with OCLC Classify as the first
one.
Note that there are no Spanish sources, and we are
very interested to find one, if any.
Slide 9 - The logic used in the programs

As usual, we can avoid to enter in details about the
programs we wrote. Red color refers to quality control,
as mentioned before, and the policy we adopted to
avoid overloading. We are going to discuss both.
Slide 10 - Quality check

At the start of the project, our catalog contained DDC
down to edition 19th. In OCLC Classify or Library of
Congress catalogs there are many DDC numbers
belonging to older editions, or the edition is not
specified. Or indicators are not present. We decided to
discard and not accept DDC numbers inconsistent with
our quality standard. Even if many of them were
discarded, we added 50% more DDC numbers to our
catalog. Less strict criteria can lead to add more DDC
numbers, but with lower quality.
Slide 11 - Delay while searching sources

When running, programs can query Dewey targets at
very high rate. This can suffocate the remote server,
especially if it is accessed by other users. Furthermore,
if records of your catalog are continuously modified,
their indexing can overload your ILS.
To avoid that, and ensure you are respecting access
policies sometime set by remote catalogs, we defined
a delay of 5-6 seconds between queries. As a
consequence, the harvesting process became slower
and in some cases lasted more than one day.
Slide 12 - Statistics

Programs were written to log their operations. Log
files allowed to build this table and some graphics.
Useful information about harvested data and DDC use
in each catalog was analyzed.
Top results are from OCLC Classify, but we have to
remark that records, once modified, were no more
processed against other Dewey sources.
Slide 13 - Browsing Dewey Index

The enriched DDC values where added to our catalog
browse search. Now, this search path is the most
used, more than the author index. Waiting for
controlled subject headings, of course.
Slide 14 - Software

The software involved in the project is listed here.
Only developers can appreciate this information, so we
can switch to the next slide, not before emphasizing
that software libraries, especially open sourced, act
like bricks in a building, allowing to write useful tools.
Slide 15 - A scientific article

The batch mode of this project is published in a
scientific article in the current issue of JLIS.it, a peer
reviewed academic journal whose editor in chief is
professor Mauro Guerrini.
It was written by me and my cataloguers (we are very
proud of this collaboration), and doesn't deal with the
next part of this presentation, since we developed it
after the publication.
Slide 16 - Version 2: The single record mode

The interactive mode was integrated in the ILS, and
helps cataloguers to add the DDC number to new
records.
The basic ideas of the interactive mode are the same
of the batch mode. However, the seven Dewey sources
are accessed asynchronously or in parallel, heavily
reducing the delay.
Slide 17 - Schema of the single record mode

When adding the ISBN number, or if present pressing
the "Go" button, the search starts and the responses
are used to compose the result table, from which a
cataloguer can choose a DDC number clicking on it.
Subfields "a" and "2", as well as indicators of field 082
are filled in. Also, a new occurrence of field 035 is
created to store the system number of the record of
the copied DDC number. This tag logically connects
two records, in your catalog and in the remote catalog,
saying that they are describing the same resource,
while tracing the contribute coming from another
catalog.
Slide 18 - Conclusions

More and more bibliographic data are available
worldwide, with machine interfaces. Big and linked
data are going to heavily change cataloguing modules,
OPACs and so on. The catalog enrichment, through
unique identifiers, can help to improve bibliographic
and authority records to expose them on the net. The
most important information to store in a catalog,
especially in a linked data environment, are standard
identifiers and coded information, thus allowing to
retrieve more related information.
_______________________________________________________
Slide 17 - Schema of the single record mode

When adding the ISBN number, or if present pressing
the "Go" button, the search starts and the responses
are used to compose the result table, from which a
cataloguer can choose a DDC number clicking on it.
Subfields "a" and "2", as well as indicators of field 082
are filled in. Also, a new occurrence of field 035 is
created to store the system number of the record of
the copied DDC number. This tag logically connects
two records, in your catalog and in the remote catalog,
saying that they are describing the same resource,
while tracing the contribute coming from another
catalog.
Slide 18 - Conclusions

More and more bibliographic data are available
worldwide, with machine interfaces. Big and linked
data are going to heavily change cataloguing modules,
OPACs and so on. The catalog enrichment, through
unique identifiers, can help to improve bibliographic
and authority records to expose them on the net. The
most important information to store in a catalog,
especially in a linked data environment, are standard
identifiers and coded information, thus allowing to
retrieve more related information.
_______________________________________________________

More Related Content

What's hot

A semantic based approach for information retrieval from html documents using...
A semantic based approach for information retrieval from html documents using...A semantic based approach for information retrieval from html documents using...
A semantic based approach for information retrieval from html documents using...csandit
 
A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING...
A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING...A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING...
A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING...cscpconf
 
owb-platform-adapter-cookbook-177344
owb-platform-adapter-cookbook-177344owb-platform-adapter-cookbook-177344
owb-platform-adapter-cookbook-177344Carnot Antonio Romero
 
European Pharmaceutical Contractor: SAS and R Team in Clinical Research
European Pharmaceutical Contractor: SAS and R Team in Clinical ResearchEuropean Pharmaceutical Contractor: SAS and R Team in Clinical Research
European Pharmaceutical Contractor: SAS and R Team in Clinical ResearchKCR
 
Class viii ch-2 log on to access
Class  viii ch-2 log on to accessClass  viii ch-2 log on to access
Class viii ch-2 log on to accessjessandy
 
Informatica complex transformation i
Informatica complex transformation iInformatica complex transformation i
Informatica complex transformation iAmit Sharma
 
Rick_Daniels_Resume_Crawford_Company
Rick_Daniels_Resume_Crawford_CompanyRick_Daniels_Resume_Crawford_Company
Rick_Daniels_Resume_Crawford_CompanyRick Daniels
 
Patent database a methodology of information retrieval from pdf
Patent database  a methodology of information retrieval from pdfPatent database  a methodology of information retrieval from pdf
Patent database a methodology of information retrieval from pdfijdms
 
Informatica object migration
Informatica object migrationInformatica object migration
Informatica object migrationAmit Sharma
 
MS Access and Database Fundamentals
MS Access and Database FundamentalsMS Access and Database Fundamentals
MS Access and Database FundamentalsAnanda Gupta
 
Mastering informatica log files
Mastering informatica log filesMastering informatica log files
Mastering informatica log filesAmit Sharma
 
15. session 15 data binding
15. session 15   data binding15. session 15   data binding
15. session 15 data bindingPhúc Đỗ
 
Sap abap ale idoc
Sap abap ale idocSap abap ale idoc
Sap abap ale idocBunty Jain
 
Xml For Dummies Chapter 17 Serving Up The Data Xml And Databases
Xml For Dummies   Chapter 17 Serving Up The Data Xml And DatabasesXml For Dummies   Chapter 17 Serving Up The Data Xml And Databases
Xml For Dummies Chapter 17 Serving Up The Data Xml And Databasesphanleson
 
Ms access Database
Ms access DatabaseMs access Database
Ms access DatabaseYasir Khan
 
B2SHARE REST API Hands-on - EUDAT Summer School (Hans van Piggelen, SURFsara)
B2SHARE REST API Hands-on - EUDAT Summer School (Hans van Piggelen, SURFsara)B2SHARE REST API Hands-on - EUDAT Summer School (Hans van Piggelen, SURFsara)
B2SHARE REST API Hands-on - EUDAT Summer School (Hans van Piggelen, SURFsara)EUDAT
 

What's hot (20)

A semantic based approach for information retrieval from html documents using...
A semantic based approach for information retrieval from html documents using...A semantic based approach for information retrieval from html documents using...
A semantic based approach for information retrieval from html documents using...
 
A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING...
A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING...A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING...
A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING...
 
owb-platform-adapter-cookbook-177344
owb-platform-adapter-cookbook-177344owb-platform-adapter-cookbook-177344
owb-platform-adapter-cookbook-177344
 
European Pharmaceutical Contractor: SAS and R Team in Clinical Research
European Pharmaceutical Contractor: SAS and R Team in Clinical ResearchEuropean Pharmaceutical Contractor: SAS and R Team in Clinical Research
European Pharmaceutical Contractor: SAS and R Team in Clinical Research
 
Class viii ch-2 log on to access
Class  viii ch-2 log on to accessClass  viii ch-2 log on to access
Class viii ch-2 log on to access
 
Informatica training
Informatica trainingInformatica training
Informatica training
 
Informatica complex transformation i
Informatica complex transformation iInformatica complex transformation i
Informatica complex transformation i
 
Rick_Daniels_Resume_Crawford_Company
Rick_Daniels_Resume_Crawford_CompanyRick_Daniels_Resume_Crawford_Company
Rick_Daniels_Resume_Crawford_Company
 
Patent database a methodology of information retrieval from pdf
Patent database  a methodology of information retrieval from pdfPatent database  a methodology of information retrieval from pdf
Patent database a methodology of information retrieval from pdf
 
Informatica object migration
Informatica object migrationInformatica object migration
Informatica object migration
 
MS Access and Database Fundamentals
MS Access and Database FundamentalsMS Access and Database Fundamentals
MS Access and Database Fundamentals
 
Mastering informatica log files
Mastering informatica log filesMastering informatica log files
Mastering informatica log files
 
15. session 15 data binding
15. session 15   data binding15. session 15   data binding
15. session 15 data binding
 
Sap abap ale idoc
Sap abap ale idocSap abap ale idoc
Sap abap ale idoc
 
Xml For Dummies Chapter 17 Serving Up The Data Xml And Databases
Xml For Dummies   Chapter 17 Serving Up The Data Xml And DatabasesXml For Dummies   Chapter 17 Serving Up The Data Xml And Databases
Xml For Dummies Chapter 17 Serving Up The Data Xml And Databases
 
Ms access
Ms accessMs access
Ms access
 
Sq lite module1
Sq lite module1Sq lite module1
Sq lite module1
 
Ms access Database
Ms access DatabaseMs access Database
Ms access Database
 
Introduction to ms access
Introduction to ms accessIntroduction to ms access
Introduction to ms access
 
B2SHARE REST API Hands-on - EUDAT Summer School (Hans van Piggelen, SURFsara)
B2SHARE REST API Hands-on - EUDAT Summer School (Hans van Piggelen, SURFsara)B2SHARE REST API Hands-on - EUDAT Summer School (Hans van Piggelen, SURFsara)
B2SHARE REST API Hands-on - EUDAT Summer School (Hans van Piggelen, SURFsara)
 

Viewers also liked

Presentation Lovemarks Mobitec-06-2012
Presentation Lovemarks Mobitec-06-2012Presentation Lovemarks Mobitec-06-2012
Presentation Lovemarks Mobitec-06-2012Aneliseborges
 
Bankruptcy Chapter 13 dismissal Please Help?
Bankruptcy Chapter 13 dismissal Please Help?Bankruptcy Chapter 13 dismissal Please Help?
Bankruptcy Chapter 13 dismissal Please Help?orkeeivleraleracelis
 
brief-company-profile-v3.3-reduced
brief-company-profile-v3.3-reducedbrief-company-profile-v3.3-reduced
brief-company-profile-v3.3-reducedAbraham Auzan
 
Catalog enrichment: importing Dewey Decimal Classification from external sour...
Catalog enrichment: importing Dewey Decimal Classification from external sour...Catalog enrichment: importing Dewey Decimal Classification from external sour...
Catalog enrichment: importing Dewey Decimal Classification from external sour...Stefano Bargioni
 
Publication cover management in a library system (slides)
Publication cover management in a library system (slides)Publication cover management in a library system (slides)
Publication cover management in a library system (slides)Stefano Bargioni
 
Publication cover management in a library system (text)
Publication cover management in a library system (text)Publication cover management in a library system (text)
Publication cover management in a library system (text)Stefano Bargioni
 

Viewers also liked (8)

Un nuovo motore per Koha
Un nuovo motore per KohaUn nuovo motore per Koha
Un nuovo motore per Koha
 
Presentation Lovemarks Mobitec-06-2012
Presentation Lovemarks Mobitec-06-2012Presentation Lovemarks Mobitec-06-2012
Presentation Lovemarks Mobitec-06-2012
 
Open, Big, & Linked Data
Open, Big, & Linked DataOpen, Big, & Linked Data
Open, Big, & Linked Data
 
Bankruptcy Chapter 13 dismissal Please Help?
Bankruptcy Chapter 13 dismissal Please Help?Bankruptcy Chapter 13 dismissal Please Help?
Bankruptcy Chapter 13 dismissal Please Help?
 
brief-company-profile-v3.3-reduced
brief-company-profile-v3.3-reducedbrief-company-profile-v3.3-reduced
brief-company-profile-v3.3-reduced
 
Catalog enrichment: importing Dewey Decimal Classification from external sour...
Catalog enrichment: importing Dewey Decimal Classification from external sour...Catalog enrichment: importing Dewey Decimal Classification from external sour...
Catalog enrichment: importing Dewey Decimal Classification from external sour...
 
Publication cover management in a library system (slides)
Publication cover management in a library system (slides)Publication cover management in a library system (slides)
Publication cover management in a library system (slides)
 
Publication cover management in a library system (text)
Publication cover management in a library system (text)Publication cover management in a library system (text)
Publication cover management in a library system (text)
 

Similar to Catalog enrichment: importing Dewey Decimal Classification from external sources (text)

Catalog Enrichment for RDA - Adding relationship designators (in Koha) [text]
Catalog Enrichment for RDA - Adding relationship designators (in Koha) [text]Catalog Enrichment for RDA - Adding relationship designators (in Koha) [text]
Catalog Enrichment for RDA - Adding relationship designators (in Koha) [text]Stefano Bargioni
 
Why no sql_ibm_cloudant
Why no sql_ibm_cloudantWhy no sql_ibm_cloudant
Why no sql_ibm_cloudantPeter Tutty
 
Research Article
Research ArticleResearch Article
Research Articlesparwaiz
 
Comparing sql and nosql dbs
Comparing sql and nosql dbsComparing sql and nosql dbs
Comparing sql and nosql dbsVasilios Kuznos
 
locotalk-whitepaper-2016
locotalk-whitepaper-2016locotalk-whitepaper-2016
locotalk-whitepaper-2016Anthony Wijnen
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Amy W. Tang
 
UNIT 3.2 GETTING STARTED WITH IDA.ppt
UNIT 3.2 GETTING STARTED WITH IDA.pptUNIT 3.2 GETTING STARTED WITH IDA.ppt
UNIT 3.2 GETTING STARTED WITH IDA.pptManjuAppukuttan2
 
A case study of encouraging compliance with Spectrum data standards and proce...
A case study of encouraging compliance with Spectrum data standards and proce...A case study of encouraging compliance with Spectrum data standards and proce...
A case study of encouraging compliance with Spectrum data standards and proce...Axiell ALM
 
Minimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data VirtualizationMinimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data VirtualizationDenodo
 
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLA STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLijscai
 
A Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLA Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLIJSCAI Journal
 
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLA STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLijscai
 
A Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLA Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLIJSCAI Journal
 
Enabling SQL Access to Data Lakes
Enabling SQL Access to Data LakesEnabling SQL Access to Data Lakes
Enabling SQL Access to Data LakesVasu S
 
Document Based Data Modeling Technique
Document Based Data Modeling TechniqueDocument Based Data Modeling Technique
Document Based Data Modeling TechniqueCarmen Sanborn
 

Similar to Catalog enrichment: importing Dewey Decimal Classification from external sources (text) (20)

Dwh faqs
Dwh faqsDwh faqs
Dwh faqs
 
Catalog Enrichment for RDA - Adding relationship designators (in Koha) [text]
Catalog Enrichment for RDA - Adding relationship designators (in Koha) [text]Catalog Enrichment for RDA - Adding relationship designators (in Koha) [text]
Catalog Enrichment for RDA - Adding relationship designators (in Koha) [text]
 
Why no sql_ibm_cloudant
Why no sql_ibm_cloudantWhy no sql_ibm_cloudant
Why no sql_ibm_cloudant
 
Research Article
Research ArticleResearch Article
Research Article
 
Comparing sql and nosql dbs
Comparing sql and nosql dbsComparing sql and nosql dbs
Comparing sql and nosql dbs
 
databases
databasesdatabases
databases
 
locotalk-whitepaper-2016
locotalk-whitepaper-2016locotalk-whitepaper-2016
locotalk-whitepaper-2016
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
UNIT 3.2 GETTING STARTED WITH IDA.ppt
UNIT 3.2 GETTING STARTED WITH IDA.pptUNIT 3.2 GETTING STARTED WITH IDA.ppt
UNIT 3.2 GETTING STARTED WITH IDA.ppt
 
A case study of encouraging compliance with Spectrum data standards and proce...
A case study of encouraging compliance with Spectrum data standards and proce...A case study of encouraging compliance with Spectrum data standards and proce...
A case study of encouraging compliance with Spectrum data standards and proce...
 
Minimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data VirtualizationMinimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data Virtualization
 
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLA STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
 
A Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLA Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQL
 
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLA STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
 
A Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLA Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQL
 
NOSQL
NOSQLNOSQL
NOSQL
 
Enabling SQL Access to Data Lakes
Enabling SQL Access to Data LakesEnabling SQL Access to Data Lakes
Enabling SQL Access to Data Lakes
 
Collins, Hammer, Jones, and Lagace "NISO Update: Interoperability of Systems:...
Collins, Hammer, Jones, and Lagace "NISO Update: Interoperability of Systems:...Collins, Hammer, Jones, and Lagace "NISO Update: Interoperability of Systems:...
Collins, Hammer, Jones, and Lagace "NISO Update: Interoperability of Systems:...
 
Document Based Data Modeling Technique
Document Based Data Modeling TechniqueDocument Based Data Modeling Technique
Document Based Data Modeling Technique
 
Database Management & Models
Database Management & ModelsDatabase Management & Models
Database Management & Models
 

More from Stefano Bargioni

Catalog Enrichment for RDA - Adding relationship designators (in Koha)
Catalog Enrichment for RDA - Adding relationship designators (in Koha)Catalog Enrichment for RDA - Adding relationship designators (in Koha)
Catalog Enrichment for RDA - Adding relationship designators (in Koha)Stefano Bargioni
 
Koha RDA FRBR: alcune riflessioni (text)
Koha RDA FRBR: alcune riflessioni (text)Koha RDA FRBR: alcune riflessioni (text)
Koha RDA FRBR: alcune riflessioni (text)Stefano Bargioni
 
Koha, RDA, FRBR: alcune riflessioni
Koha, RDA, FRBR: alcune riflessioniKoha, RDA, FRBR: alcune riflessioni
Koha, RDA, FRBR: alcune riflessioniStefano Bargioni
 
Adding browse to Koha using Solr
Adding browse to Koha using SolrAdding browse to Koha using Solr
Adding browse to Koha using SolrStefano Bargioni
 
Adding browse to Koha using Solr
Adding browse to Koha using SolrAdding browse to Koha using Solr
Adding browse to Koha using SolrStefano Bargioni
 

More from Stefano Bargioni (6)

Catalog Enrichment for RDA - Adding relationship designators (in Koha)
Catalog Enrichment for RDA - Adding relationship designators (in Koha)Catalog Enrichment for RDA - Adding relationship designators (in Koha)
Catalog Enrichment for RDA - Adding relationship designators (in Koha)
 
Koha RDA FRBR: alcune riflessioni (text)
Koha RDA FRBR: alcune riflessioni (text)Koha RDA FRBR: alcune riflessioni (text)
Koha RDA FRBR: alcune riflessioni (text)
 
Koha, RDA, FRBR: alcune riflessioni
Koha, RDA, FRBR: alcune riflessioniKoha, RDA, FRBR: alcune riflessioni
Koha, RDA, FRBR: alcune riflessioni
 
Adding browse to Koha using Solr
Adding browse to Koha using SolrAdding browse to Koha using Solr
Adding browse to Koha using Solr
 
Adding browse to Koha using Solr
Adding browse to Koha using SolrAdding browse to Koha using Solr
Adding browse to Koha using Solr
 
Stelline 2013
Stelline 2013Stelline 2013
Stelline 2013
 

Recently uploaded

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Recently uploaded (20)

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

Catalog enrichment: importing Dewey Decimal Classification from external sources (text)

  • 1. 32 ADLUG Meeting - Vitoria-Gasteiz, October 2013 Stefano Bargioni - Pontificia Università della Santa Croce Catalog enrichment: importing Dewey Decimal Classification from external sources Slide 2 - The project The project related to this presentation started as a real need in the library of the Santa Croce Pontifical University. After migrating to Koha open source ILS <http://koha-community.org> and adding authority records for names, name-titles and uniform titles to the catalog, we studied how to add subject headings. Even if we were using uncontrolled subject headings starting from many years ago, we realized that this job is nor simple neither fast. Our catalog, in the meanwhile, would have had to be accessible through other semantic search paths. So we decided to improve the Dewey Decimal Classification (DDC). Not manually, of course, but using an automatic procedure. Slide 3 - Version 1: The batch mode Batch mode is the process that scans the catalog and automatically adds DDC numbers to some records. This process can span hours or days, depending on the catalog size and other factors, and is meant to upgrade old bibliographic records. A similar process can be applied in interactive mode, for new bibliographic records. We will discuss it in the last part of the presentation.
  • 2. The aim of this process is to enrich the catalog adding DDC to records, minimizing or even excluding human interactions. The process will detect records without the DDC (field 082), but having the ISBN (field 020). The ISBN, a unique identifier, will be used to discover the corresponding record on other databases where the DDC is in use. Big catalogs like national libraries or bibliographies can be good choices. However, the DDC is not used everywhere at the same level of quality or uniformity. This is why we defined some criteria to check each DDC number before adding it to the catalog. Slide 4 - An atomic copy cataloguing Adding a single field to a record is very similar to copy cataloguing. We named it "atomic copy cataloguing". It is based on a unique identifier (often ISBN), and requires that the you are allowed to programmatically modify single records in your Integrated Library System (ILS). This can be done if Advanced Programming Interfaces (APIs) are available. For commercial ILS's there can be restrictions or reasons to avoid this kind of operation. Since this "atomic copy" is a subset of the copy cataloguing, we can say that its use is allowed thanks to the same basic ideas and agreements that govern full copy cataloguing in the library international community. Slide 5 - Records to be modified This slides shows a way to detect records involved in the process, extract their system number and ISBN. We also choose to limit to the language of the work, to avoid to send lot of unuseful queries to some remote
  • 3. servers that mostly contain records of their country language. Slide 6 - Dewey Sources (I) OCLC Classify and other six national libraries or bibliographies were selected as Dewey sources. It is necessary that they can be accessed using a protocol like REST, Z39.50 or HTTP, in order to receive responses in XML, MARC binary or HTML data. The best is XML, i.e. structured data; the worst is HTML, because in this case data do not have metadata and are mixed with or merged into formatting code, and are difficult to extract. Slide 7 - Dewey Sources (II): OCLC Classify Even if it is interesting, we can avoid a detailed description of Classify OCLC service. It is important to notice what is highlighted in red color: • it has a machine interface, and its response is in XML format; • ISBN is a keyword to access it; • it contains 36 million records having the DDC. Slide 8 - Dewey Sources (III): National Libraries The table shows six important national libraries we queried, the language we choose to limit queries, and the format of responses. The order followed to query Dewey sources was the same used in this table, with OCLC Classify as the first one. Note that there are no Spanish sources, and we are very interested to find one, if any.
  • 4. Slide 9 - The logic used in the programs As usual, we can avoid to enter in details about the programs we wrote. Red color refers to quality control, as mentioned before, and the policy we adopted to avoid overloading. We are going to discuss both. Slide 10 - Quality check At the start of the project, our catalog contained DDC down to edition 19th. In OCLC Classify or Library of Congress catalogs there are many DDC numbers belonging to older editions, or the edition is not specified. Or indicators are not present. We decided to discard and not accept DDC numbers inconsistent with our quality standard. Even if many of them were discarded, we added 50% more DDC numbers to our catalog. Less strict criteria can lead to add more DDC numbers, but with lower quality. Slide 11 - Delay while searching sources When running, programs can query Dewey targets at very high rate. This can suffocate the remote server, especially if it is accessed by other users. Furthermore, if records of your catalog are continuously modified, their indexing can overload your ILS. To avoid that, and ensure you are respecting access policies sometime set by remote catalogs, we defined a delay of 5-6 seconds between queries. As a consequence, the harvesting process became slower and in some cases lasted more than one day. Slide 12 - Statistics Programs were written to log their operations. Log files allowed to build this table and some graphics. Useful information about harvested data and DDC use in each catalog was analyzed.
  • 5. Top results are from OCLC Classify, but we have to remark that records, once modified, were no more processed against other Dewey sources. Slide 13 - Browsing Dewey Index The enriched DDC values where added to our catalog browse search. Now, this search path is the most used, more than the author index. Waiting for controlled subject headings, of course. Slide 14 - Software The software involved in the project is listed here. Only developers can appreciate this information, so we can switch to the next slide, not before emphasizing that software libraries, especially open sourced, act like bricks in a building, allowing to write useful tools. Slide 15 - A scientific article The batch mode of this project is published in a scientific article in the current issue of JLIS.it, a peer reviewed academic journal whose editor in chief is professor Mauro Guerrini. It was written by me and my cataloguers (we are very proud of this collaboration), and doesn't deal with the next part of this presentation, since we developed it after the publication. Slide 16 - Version 2: The single record mode The interactive mode was integrated in the ILS, and helps cataloguers to add the DDC number to new records. The basic ideas of the interactive mode are the same of the batch mode. However, the seven Dewey sources are accessed asynchronously or in parallel, heavily reducing the delay.
  • 6. Slide 17 - Schema of the single record mode When adding the ISBN number, or if present pressing the "Go" button, the search starts and the responses are used to compose the result table, from which a cataloguer can choose a DDC number clicking on it. Subfields "a" and "2", as well as indicators of field 082 are filled in. Also, a new occurrence of field 035 is created to store the system number of the record of the copied DDC number. This tag logically connects two records, in your catalog and in the remote catalog, saying that they are describing the same resource, while tracing the contribute coming from another catalog. Slide 18 - Conclusions More and more bibliographic data are available worldwide, with machine interfaces. Big and linked data are going to heavily change cataloguing modules, OPACs and so on. The catalog enrichment, through unique identifiers, can help to improve bibliographic and authority records to expose them on the net. The most important information to store in a catalog, especially in a linked data environment, are standard identifiers and coded information, thus allowing to retrieve more related information. _______________________________________________________
  • 7. Slide 17 - Schema of the single record mode When adding the ISBN number, or if present pressing the "Go" button, the search starts and the responses are used to compose the result table, from which a cataloguer can choose a DDC number clicking on it. Subfields "a" and "2", as well as indicators of field 082 are filled in. Also, a new occurrence of field 035 is created to store the system number of the record of the copied DDC number. This tag logically connects two records, in your catalog and in the remote catalog, saying that they are describing the same resource, while tracing the contribute coming from another catalog. Slide 18 - Conclusions More and more bibliographic data are available worldwide, with machine interfaces. Big and linked data are going to heavily change cataloguing modules, OPACs and so on. The catalog enrichment, through unique identifiers, can help to improve bibliographic and authority records to expose them on the net. The most important information to store in a catalog, especially in a linked data environment, are standard identifiers and coded information, thus allowing to retrieve more related information. _______________________________________________________