Your SlideShare is downloading. ×
0
Automatic creation of mappings between
classification systems for bibliographic data
Prof. Magnus Pfeffer
Stuttgart Media ...
Agenda


Motivation



Instance-based matching



Application to bibliographic data



Evaluation



Ongoing projects...
Motivation

November 26th, 2013

Semantic Web in Libraries, Hamburg

3
Current situation in Germany


Five regional library unions


Subject headings




Classification systems







...
Current situation in Germany


National library


Subject headings




Classification systems






Predominantly ...
Austrian National Library


Subject headings




Predominantly RSWK („Regeln für den
Schlagwortkatalog“ - „Rules for th...
Goals


Re-use existing indexing information


National level






BK is used mainly in northern Germany / Austria
...
Ideas


Use of appropriate classification systems


Facetted search in resource discovery systems





Browsing of si...
Ideas


Enable the use of existing tools and visualisations

Denton (2012)
November 26th, 2013

Legrady (2005)
Semantic W...
Instance-based Matching

November 26th, 2013

Semantic Web in Libraries, Hamburg

10
Ontology matching


Well-studied problem in computer science



Several approaches


Based on the descriptors



Based...
Instances


Entries in catalogues with multiple classifications

November 26th, 2013

Semantic Web in Libraries, Hamburg
...
Instance-based matching


Assumptions





Classes with semantic overlap co-occur in instances
The more often these cl...
Example

November 26th, 2013

Semantic Web in Libraries, Hamburg

14
Example


Entry 1



Pairs





179.9 / CC 7200



RVK: CC 7200



179.9 / CC 7250





DDC: 179.9
RVK: CC 7250

...
Normalisation


Comparing solely absolute numbers is bad





Some classes are more often used than others
Number of p...
Further interpretation


a and b are two classes from two classification
systems A and B


The classes a and b only occu...
Prior work


Pfeffer (2009)


Analysis of classification system structure and actual
use







Locating classes th...
Related work


Isaac et.al. (2007)


Applied instance based matching to bibliographic data



Data from the National Li...
Application to bibliographic data

November 26th, 2013

Semantic Web in Libraries, Hamburg

20
Bibliographic data is different


Multiple editions



Multiple document types

November 26th, 2013

Semantic Web in Lib...
Bibliographic data


Skewed data





Multiple editions → More pairs
Some co-occurrences could appear stronger than ot...
Prior work


Pfeffer (2013)


Matching bibliographic records


Based on author, title and uniform title




(as well ...
Evaluation

November 26th, 2013

Semantic Web in Libraries, Hamburg

24
Comparison with existing mappings


Existing (partial) mappings can be used as a basis for
evaluation
→ „Gold standard“

...
Ongoing projects

November 26th, 2013

Semantic Web in Libraries, Hamburg

26
Data


Bibliographic data



German National Library catalogue



Austrian National Library catalogue





German li...
Interesting Mappings


RVK → BK



BK well suited for faceted retrieval





Gold standard exists
RVK has largest pro...
Implementation: Tasks


Import and mapping of MAB2 and MARC data



Clustering



Matching and clustering





Gener...
Implementation: State


All steps implemented as a prototype





Perl scripts
File-based data and indexes

Current de...
RDF representation

November 26th, 2013

Semantic Web in Libraries, Hamburg

31
Classification systems as Linked Data


DDC has been published as Linked Data



RVK has not been published as Linked Da...
Direct Mappings


SKOS offers


skos:mappingRelation



skos:closeMatch



skos:exactMatch



skos:broadMatch



sko...
Further Mappings


1:n-Relationships





List of classes that are all narrow matches
Or: A combination of classes is ...
Qualification through indirection


RDF relations cannot be qualified
http://dewey.info/class/641/

November 26th, 2013

...
Qualification through indirection


So an intermediate node is used
http://dewey.info/class/641/

skos:exactMatch

ex:qua...
Summary


Mappings between classification systems are an
important means for interoperability and sharing of
classificati...
Thank you for listening.

Slides available online
http://www.slideshare.net/MagnusPfeffer/
This work is licensed under a C...
References


Denton, W. (2012). On dentographs, a new method of visualizing library collections.
In: Code4Lib.



Isaac,...
Upcoming SlideShare
Loading in...5
×

Automatic creation of mappings between classification systems for bibliographic data

320

Published on

Classification systems are an important means to provide topic-based access to large collections. Most library collections, however, are often only partially classified and use local or regional classification systems. Traditionally, manually created mappings between classification systems are used to improve this situation. I propose a different approach to automatically create such mappings: To achieve a large base for the mapping algorithm, bibliographic data from diverse sources that contain items classified by the classification systems is aggregated in a single database. Next, a clustering algorithm is used in order to group individual issues and editions of the same work. The basic idea is that for classification purposes, there is no significant difference across editions. Indexing information can thus be consolidated within the clusters, resulting in a higher proportion of dual-indexed entries. The novel step is that instead of individual catalogue entries, the "work-level" clusters are used for an instance-based matching: Statistical analysis creates a co-occurrence table of pairs of classes and high co-occurance of a given pair indicating a match between the two classification systems. This information is aggregated into a complete mapping The approach is implemented on an open-source infrastructure which was mainly developed by the German National Library: CultureGraph.org. In ongoing projects, mappings between several classification systems are being produced. The talk will discuss the approach, the implementation issues and the preliminary results as well as the challenges of publishing the created mappings as linked data.

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
320
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Automatic creation of mappings between classification systems for bibliographic data"

  1. 1. Automatic creation of mappings between classification systems for bibliographic data Prof. Magnus Pfeffer Stuttgart Media University pfeffer@hdm-stuttgart.de
  2. 2. Agenda  Motivation  Instance-based matching  Application to bibliographic data  Evaluation  Ongoing projects  RDF Representation November 26th, 2013 Semantic Web in Libraries, Hamburg 2
  3. 3. Motivation November 26th, 2013 Semantic Web in Libraries, Hamburg 3
  4. 4. Current situation in Germany  Five regional library unions  Subject headings   Classification systems      Predominantly RSWK („Regeln für den Schlagwortkatalog“ „Rules for the subject catalogue“) using a shared authority file RVK (Regensburg Union Classification) BK (Basic Classification) DDC (Dewey Decimal Classification) Various local classification systems Low proportion of indexed titles (25-30%) November 26th, 2013 Semantic Web in Libraries, Hamburg 4
  5. 5. Current situation in Germany  National library  Subject headings   Classification systems     Predominantly RSWK („Regeln für den Schlagwortkatalog“ „Rules for the subject catalogue“) using a shared authority file DDC (Dewey Decimal Classification) Coarse categories DDC only for titles published since 2007 Only „Reihe A“ (print trade publications) is fully indexed with RSWK November 26th, 2013 Semantic Web in Libraries, Hamburg 5
  6. 6. Austrian National Library  Subject headings   Predominantly RSWK („Regeln für den Schlagwortkatalog“ - „Rules for the subject catalogue“) using a shared authority file Classification systems  BK since 2007  RVK in the Austrian librariy union catalogue November 26th, 2013 Semantic Web in Libraries, Hamburg 6
  7. 7. Goals  Re-use existing indexing information  National level     BK is used mainly in northern Germany / Austria RVK mainly in southern Germany DDC mainly by the National Library International level   Make RVK data more accessible to DDC users Use DDC indexing information available from e.g. the Library of Congress November 26th, 2013 Semantic Web in Libraries, Hamburg 7
  8. 8. Ideas  Use of appropriate classification systems  Facetted search in resource discovery systems    Browsing of similar titles   Should be monohierarchical Should have limited number of classes → DDC (first digits) or BK Should be fine-grained → DDC (full) or RVK (Multi-lingual retrieval) November 26th, 2013 Semantic Web in Libraries, Hamburg 8
  9. 9. Ideas  Enable the use of existing tools and visualisations Denton (2012) November 26th, 2013 Legrady (2005) Semantic Web in Libraries, Hamburg 9
  10. 10. Instance-based Matching November 26th, 2013 Semantic Web in Libraries, Hamburg 10
  11. 11. Ontology matching  Well-studied problem in computer science  Several approaches  Based on the descriptors  Based on the structure  Based on the manifestations (instances) November 26th, 2013 Semantic Web in Libraries, Hamburg 11
  12. 12. Instances  Entries in catalogues with multiple classifications November 26th, 2013 Semantic Web in Libraries, Hamburg 12
  13. 13. Instance-based matching  Assumptions    Classes with semantic overlap co-occur in instances The more often these classes co-occur, the stronger the overlap Preparation  Extraction of all pairs of classifications from the data  Count of the extracted pairs November 26th, 2013 Semantic Web in Libraries, Hamburg 13
  14. 14. Example November 26th, 2013 Semantic Web in Libraries, Hamburg 14
  15. 15. Example  Entry 1  Pairs   179.9 / CC 7200  RVK: CC 7200  179.9 / CC 7250   DDC: 179.9 RVK: CC 7250  179.9 / CC 7200 Entry 2  DDC: 179.9  RVK: CC 7200 November 26th, 2013 Semantic Web in Libraries, Hamburg 15
  16. 16. Normalisation  Comparing solely absolute numbers is bad    Some classes are more often used than others Number of pairs correlates with the number of entries that are classified using a given class Instead: Use proportion of co-occurrence ↔ occurrence ∣E c1∩ E c2∣ ∣E c1∪ E c2∣ number of entries with both classifications divided by number of entries with either classification (Jaccard measure for overlap of sets) November 26th, 2013 Semantic Web in Libraries, Hamburg 16
  17. 17. Further interpretation  a and b are two classes from two classification systems A and B  The classes a and b only occur together → exact match  a only co-occurs with b, but b co-occurs with other classes from A → a is narrower concept than b  a co-occurs with several classes from B (including b) → a is wider concept than b  a and b do not co-occur → cannot infer that a and b are unrelated November 26th, 2013 Semantic Web in Libraries, Hamburg 17
  18. 18. Prior work  Pfeffer (2009)  Analysis of classification system structure and actual use      Locating classes that describe the same concept Finding ways to improve existing mappings to RVK Focus on RVK, using data from library union catalogues Co-occurrence analysis Results    High co-occurrence and close in the hierarchy: → classes are hard to assign properly High co-occurrence and far in the hierarchy: → classes describe identical concepts Mappings from RSWK to RVK could be augmented November 26th, 2013 Semantic Web in Libraries, Hamburg 18
  19. 19. Related work  Isaac et.al. (2007)  Applied instance based matching to bibliographic data  Data from the National Library of the Netherlands  Mapping from a thesaurus to a classification system  Results   Generated mappings are quite good More sophisticated measures than Jaccard do not lead to better mappings November 26th, 2013 Semantic Web in Libraries, Hamburg 19
  20. 20. Application to bibliographic data November 26th, 2013 Semantic Web in Libraries, Hamburg 20
  21. 21. Bibliographic data is different  Multiple editions  Multiple document types November 26th, 2013 Semantic Web in Libraries, Hamburg 21
  22. 22. Bibliographic data  Skewed data    Multiple editions → More pairs Some co-occurrences could appear stronger than others Solution: Pre-clustering individual titles on the „work“ level  Increases chance for instances with more than one classifications  Each cluster contributes only once  Allows using absolute co-occurrence numbers   Cut-off for small numbers Ranking of competing matches November 26th, 2013 Semantic Web in Libraries, Hamburg 22
  23. 23. Prior work  Pfeffer (2013)  Matching bibliographic records  Based on author, title and uniform title   (as well as information on title changes) Matches any edition and revision of a work  Including translations  Merge match sets → Discrete clusters  Consolidating indexing information   For indexing purposes, the differences between editions and revisions are irrelevant Subject headings and classifications are shared between all members of a cluster November 26th, 2013 Semantic Web in Libraries, Hamburg 23
  24. 24. Evaluation November 26th, 2013 Semantic Web in Libraries, Hamburg 24
  25. 25. Comparison with existing mappings  Existing (partial) mappings can be used as a basis for evaluation → „Gold standard“  Comparison of automatic and manual mapping    Recall: Are all the mappings found? Precision: Are all found mappings correct? Analysis of additional links  Maybe the gold standard can be improved? November 26th, 2013 Semantic Web in Libraries, Hamburg 25
  26. 26. Ongoing projects November 26th, 2013 Semantic Web in Libraries, Hamburg 26
  27. 27. Data  Bibliographic data   German National Library catalogue  Austrian National Library catalogue   German library union catalogues British national bibliography Gold standards  Partial mappings BK ↔ RVK November 26th, 2013 Semantic Web in Libraries, Hamburg 27
  28. 28. Interesting Mappings  RVK → BK   BK well suited for faceted retrieval   Gold standard exists RVK has largest proportion of classified titles RVK ↔ DDC   Enable data sharing between the German National Library and the RVK-using libraries Not limited to classification systems  See Pfeffer (2009) and Wang et.al. (2009) November 26th, 2013 Semantic Web in Libraries, Hamburg 28
  29. 29. Implementation: Tasks  Import and mapping of MAB2 and MARC data  Clustering   Matching and clustering   Generation of keys for the match process Consolidation of indexing and classification information Statistics    Co-occurrence counts Jaccard measure Output  Full mappings November 26th, 2013 Semantic Web in Libraries, Hamburg 29
  30. 30. Implementation: State  All steps implemented as a prototype    Perl scripts File-based data and indexes Current development  Still Perl scripts (but better documented)  All data is accumulated in a document store   MongoDB Further plan: Porting to MetaFacture framework November 26th, 2013 Semantic Web in Libraries, Hamburg 30
  31. 31. RDF representation November 26th, 2013 Semantic Web in Libraries, Hamburg 31
  32. 32. Classification systems as Linked Data  DDC has been published as Linked Data  RVK has not been published as Linked Data    There is no versioning and no stable identifiers A project to fix this and to publish RVK as Linked Data has been cancelled by the university library of Regensburg BK has not been published as Linked Data  There is authority data in the GVK union catalogue → One would have to create temporary URIs for the RVK and BK classes November 26th, 2013 Semantic Web in Libraries, Hamburg 32
  33. 33. Direct Mappings  SKOS offers  skos:mappingRelation  skos:closeMatch  skos:exactMatch  skos:broadMatch  skos:narrowMatch  skos:relatedMatch November 26th, 2013 Semantic Web in Libraries, Hamburg 33
  34. 34. Further Mappings  1:n-Relationships    List of classes that are all narrow matches Or: A combination of classes is a (near) exact match Qualified mappings  Express the confidence of the proposed match  Allow applications to optimize for precision or recall November 26th, 2013 Semantic Web in Libraries, Hamburg 34
  35. 35. Qualification through indirection  RDF relations cannot be qualified http://dewey.info/class/641/ November 26th, 2013 skos:exactMatch http://ex.org/rvk/BF_8150 Semantic Web in Libraries, Hamburg 35
  36. 36. Qualification through indirection  So an intermediate node is used http://dewey.info/class/641/ skos:exactMatch ex:qualifiedMatch http://ex.org/rvk/BF_8150 ex:targetConcept _1 ex:confidence Further information “1.0” November 26th, 2013 Semantic Web in Libraries, Hamburg 36
  37. 37. Summary  Mappings between classification systems are an important means for interoperability and sharing of classification information and tools  Simple statistics on the existing data in catalogues can generate candidates for matches between individual classes  Where manually created mappings exist, they can be used to evaluate the algorithmic results  Mappings can be expressed in SKOS terms   To qualify the mappings further, intermediate nodes need to be introduced There is no standard for this yet November 26th, 2013 Semantic Web in Libraries, Hamburg 37
  38. 38. Thank you for listening. Slides available online http://www.slideshare.net/MagnusPfeffer/ This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. November 26th, 2013 Semantic Web in Libraries, Hamburg 38
  39. 39. References  Denton, W. (2012). On dentographs, a new method of visualizing library collections. In: Code4Lib.  Isaac, A., Van Der Meij, L., Schlobach, S. and Wang, S. (2007). An empirical study of instance-based ontology matching. In: The Semantic Web (pp. 253-266). Springer Berlin Heidelberg.  Legrady, G. (2005). Making visible the invisible. Seattle Library Data Flow Visualization. In: Digital Culture and Heritage. Proceedings of ICHIM05 Sept, 21-23.  Pfeffer, M. (2009). Äquivalenzklassen – Alle Doppelstellen der RVK finden. Presentation given at the Librarian Workshop of the 33rd Annual Conference of the German Classification Society on Data Analysis, Machine Learning, and Applications (GfKl).  Pfeffer, M. (2013). Using clustering across union catalogues to enrich entries with indexing information. In: Data Analysis, Machine Learning and Knowledge Discovery - Proceedings of the 36th Annual Conference of the Gesellschaft für Klassifikation e. V.. Springer Berlin Heidelberg.  Wang, S., Isaac, A., Schopman, B., Schlobach, S. and Van Der Meij, L. (2009). Matching multi-lingual subject vocabularies. In: Research and Advanced Technology for Digital Libraries (pp. 125-137). Springer Berlin Heidelberg. November 26th, 2013 Semantic Web in Libraries, Hamburg 39
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×