Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
32st ADLUG ANNUAL MEETING 2013 
ARTIUM and the Fundación Sancho El Sabio – Vitoria-Gasteiz 
16th – 18th October 2013 
Fred...
Agenda 
Goals 
Intermediate storage Layer 
Final storage layer 
Chaining all together 
Q&A 
Copyright 2009-2010 @CULT. All...
Agenda 
Goals 
Intermediate storage Layer 
Final storage layer 
Chaining all together 
Q&A 
Copyright 2009-2010 @CULT. All...
Goals : functional 
000 00694nam a2200241 i 4500 
008 971205s1997 it j 000 0 ita c 
020 a 880921191X 
082 1 a 853.8 
100 1...
Do you remember? 
Copyright 2009-2010 @CULT. All rights reserved 5
Goals : non functional 
000 00694nam a2200241 i 4500 
008 971205s1997 it j 000 0 ita c 
020 a 880921191X 
082 1 a 853.8 
1...
Agenda 
Goals 
Intermediate storage Layer 
Final storage Layer 
Chaining all together 
Q&A 
Copyright 2009-2010 @CULT. All...
Intermediate storage Layer 
We assume that the conversion process takes records in MARC format. This is 
very important be...
Intermediate storage Layer 
Using a MARC binary stream as main input requires an intermediary storage 
for collecting MARC...
OseeGenius with M(arc)QL 
INDEX 
SEARCH 
(245a=History AND 1XX=Potter) (245a=History AND 1XX=Potter) O ORR ( 2(26600bb==AA...
MQL examples 
(245a=History AND 1XX=Potter) OR ((245a=History AND 1XX=Potter) OR (226600bb==AAddddisisoonn W Weessleleyy))...
MQL result (example) 
Search metadata 
Search result (first record) 
Copyright 2009-2010 @CULT. All rights reserved 12
Agenda 
Goals 
Intermediate storage layer 
Final storage Layer 
Chaining all together 
Q&A 
Copyright 2009-2010 @CULT. All...
Final storage layer 
Even if the final result could be directed to a local file, a more appropriate 
destination would be ...
Agenda 
Goals 
Information Retrieval 
Triple store 
Chaining all together 
Q&A 
Copyright 2009-2010 @CULT. All rights rese...
Introducing Apache Camel 
Apache Camel ™ is a versatile open-source 
integration framework based on known Enterprise 
Inte...
Introducing Apache Camel 
In practice, Apache Camel let us define this kind of things: 
Basically the whole process is div...
Indexing processor 
INDEX 
000 00694nam a2200241 i 4500 
008 971205s1997 it j 000 0 ita c 
020 a 880921191X 
082 1 a 853.8...
Split processor 
000 00694nam a2200241 i 4500 
008 971205s1997 it j 000 0 ita c 
020 a 880921191X 
082 1 a 853.8 
100 1 a ...
Transformation processor (1/2) 
A processor that operates on a fragment (i.e. a tag) of MARC record for producing an RDF t...
Transformation processor (2/2) 
We could have a transformer that, in order to produce results (one or more triple), needs ...
The big picture 
SOLR1 SOLR2 SOLR3 SOLR4 
1 
2 
3 
1 1 
2 
4 
2 
3 
3 
4 4 
4 . 4 . x . 4 . 2 In d ic e d iv is o o r iz z...
An example (1/5) 
Copyright 2009-2010 @CULT. All rights reserved 23
An example (2/5) 
Copyright 2009-2010 @CULT. All rights reserved 24
An example (3/5) 
Copyright 2009-2010 @CULT. All rights reserved 25
An example (4/5) 
Copyright 2009-2010 @CULT. All rights reserved 26
An example (5/5) 
Copyright 2009-2010 @CULT. All rights reserved 27
Agenda 
Goals 
Information Retrieval 
Triple store 
Chaining all together 
Q&A 
Copyright 2009-2010 @CULT. All rights rese...
32st ADLUG ANNUAL MEETING 2013 
ARTIUM and the Fundación Sancho El Sabio - Vitoria-Gasteiz 
16th – 18th October 2013 
A pr...
Upcoming SlideShare
Loading in …5
×

ADLUG 2013 - A proposal for an RDF assembly line

481 views

Published on

A proposal for an RDF assembly line able to convert bibliographic data in RDF using Java, SOLR and Apache Camel

Published in: Technology
  • Be the first to comment

  • Be the first to like this

ADLUG 2013 - A proposal for an RDF assembly line

  1. 1. 32st ADLUG ANNUAL MEETING 2013 ARTIUM and the Fundación Sancho El Sabio – Vitoria-Gasteiz 16th – 18th October 2013 Frederick W. Taylor Andrea Gazzarini Software Architect A proposal for an RDF assembly line Copyright 2009-2010 @CULT. All rights reserved
  2. 2. Agenda Goals Intermediate storage Layer Final storage layer Chaining all together Q&A Copyright 2009-2010 @CULT. All rights reserved 2
  3. 3. Agenda Goals Intermediate storage Layer Final storage layer Chaining all together Q&A Copyright 2009-2010 @CULT. All rights reserved 3
  4. 4. Goals : functional 000 00694nam a2200241 i 4500 008 971205s1997 it j 000 0 ita c 020 a 880921191X 082 1 a 853.8 100 1 a Collodi, Carlo. 245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino 260 a Firenze : b Giunti, c 1997. 440 0 a Collana favolosa / [Giunti] 521 a Letteratura per ragazzi 700 1 a Mussino, Attilio. Copyright 2009-2010 @CULT. All rights reserved 4
  5. 5. Do you remember? Copyright 2009-2010 @CULT. All rights reserved 5
  6. 6. Goals : non functional 000 00694nam a2200241 i 4500 008 971205s1997 it j 000 0 ita c 020 a 880921191X 082 1 a 853.8 100 1 a Collodi, Carlo. 245 13 a Le avventure di Pinocchio / 000 00694nam a2200241 i 4500 008 971205s1997 it j 000 0 ita c 020 a 880921191X 082 1 c C. Collodi a 853.8 ; illustrazioni di A. Mussino 100 1 a Collodi, Carlo. 245 13 a Le avventure di Pinocchio / 260 a Firenze : 000 008 b Giunti, 00694nam a2200241 i 4500 c 1997. c C. Collodi 971205s1997 ; illustrazioni it di A. j 020 a 880921191X Mussino 000 0 ita c 082 1 a 853.8 100 1 a Collodi, Carlo. 245 13 a Le avventure di Pinocchio / 440 0 260 a Collana a favolosa Firenze : / [Giunti] 521 a Letteratura b Giunti, per ragazzi 700 1 a Mussino, c 1997. Attilio. 000 00694nam a2200241 i 4500 008 971205s1997 it j 000 0 ita c 020 a 880921191X 082 1 a 853.8 100 1 a Collodi, Carlo. 245 13 a Le avventure di Pinocchio / 440 0 a Collana favolosa / [Giunti] 521 a Letteratura per ragazzi 700 1 a Mussino, Attilio. c C. Collodi ; illustrazioni di A. Mussino 260 a Firenze : b Giunti, c 1997. 000 00694nam a2200241 i 4500 008 971205s1997 it j 000 0 ita c 020 a 880921191X 082 1 a 853.8 100 1 a Collodi, Carlo. 245 13 a Le avventure di Pinocchio / 440 0 a Collana favolosa / [Giunti] 521 a Letteratura per ragazzi 700 1 a Mussino, Attilio. c C. Collodi ; illustrazioni di A. Mussino 260 a Firenze : 000 008 b Giunti, 00694nam a2200241 i 4500 020 c 1997. 971205s1997 it j 000 0 ita c a 880921191X 082 1 a 853.8 100 1 a Collodi, Carlo. 245 13 a Le avventure di Pinocchio / 440 0 a Collana favolosa / [Giunti] 521 a Letteratura per ragazzi 700 1 a Mussino, Attilio. c C. Collodi ; illustrazioni di A. Mussino 260 a Firenze : 000 008 b Giunti, 00694nam a2200241 i 4500 c 1997. c C. Collodi 971205s1997 ; illustrazioni it di A. j 020 a 880921191X Mussino 000 0 ita c 082 1 a 853.8 100 1 a Collodi, Carlo. 245 13 a Le avventure di Pinocchio / 440 0 260 a Collana a favolosa Firenze : / [Giunti] 521 a Letteratura b Giunti, per ragazzi 700 1 a Mussino, c 1997. Attilio. 000 00694nam a2200241 i 4500 008 971205s1997 it j 000 0 ita c 020 a 880921191X 082 1 a 853.8 100 1 a Collodi, Carlo. 245 13 a Le avventure di Pinocchio / 440 0 a Collana favolosa / [Giunti] 521 a Letteratura per ragazzi 700 1 a Mussino, Attilio. c C. Collodi ; illustrazioni di A. Mussino 260 a Firenze : b Giunti, c 1997. 000 00694nam a2200241 i 4500 008 971205s1997 it j 000 0 ita c 020 a 880921191X 082 1 a 853.8 100 1 a Collodi, Carlo. 245 13 a Le avventure di Pinocchio / 440 0 a Collana favolosa / [Giunti] 521 a Letteratura per ragazzi 700 1 a Mussino, Attilio. c C. Collodi ; illustrazioni di A. Mussino 260 a Firenze : b Giunti, c 1997. 000 00694nam a2200241 i 4500 008 971205s1997 it j 000 0 ita c 020 a 880921191X 082 1 a 853.8 100 1 a Collodi, Carlo. 245 13 a Le avventure di Pinocchio / 440 0 a Collana favolosa / [Giunti] 521 a Letteratura per ragazzi 700 1 a Mussino, Attilio. c C. Collodi ; illustrazioni di A. Mussino 260 a Firenze : b Giunti, c 1997. 440 0 a Collana favolosa / [Giunti] 521 a Letteratura per ragazzi 700 1 a Mussino, Attilio. c C. Collodi ; illustrazioni di A. Mussino 260 a Firenze : b Giunti, c 1997. 440 0 a Collana favolosa / [Giunti] 521 a Letteratura per ragazzi 700 1 a Mussino, Attilio. Copyright 2009-2010 @CULT. All rights reserved 6
  7. 7. Agenda Goals Intermediate storage Layer Final storage Layer Chaining all together Q&A Copyright 2009-2010 @CULT. All rights reserved 7
  8. 8. Intermediate storage Layer We assume that the conversion process takes records in MARC format. This is very important because allows us – to be completely decopuled from whatever LMS; – to have a set of predefined and standard rules; – a fine level of granularity about the input: the same process can be executed for one or a for a million of records without any difference at all 000 00694nam a2200241 i 4500 008 971205s1997 it j 000 0 ita c 020 a 880921191X 082 1 a 853.8 100 1 a Collodi, Carlo. 245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino 260 a Firenze : b Giunti, c 1997. 440 0 a Collana favolosa / [Giunti] 521 a Letteratura per ragazzi 700 1 a Mussino, Attilio. Copyright 2009-2010 @CULT. All rights reserved 8
  9. 9. Intermediate storage Layer Using a MARC binary stream as main input requires an intermediary storage for collecting MARC data. That will be widely used during the conversion phase. As an example think about the different relationships between records (i.e. entities) expressed in 77X i.e. while we are processing the record X we should analyze record Y in order to take some decision. Using a file as only data container that would be very hard because file can be traversed only in sequential mode, and there's no a query language. We should hard code a lookup process. Using a database as main input source (instead of MARC file / stream) would resolve the query language issue. But each LMS has a different database with different schema Copyright 2009-2010 @CULT. All rights reserved 9
  10. 10. OseeGenius with M(arc)QL INDEX SEARCH (245a=History AND 1XX=Potter) (245a=History AND 1XX=Potter) O ORR ( 2(26600bb==AAddddisisoonn W Weessleleyy)) 000 00694nam a2200241 i 4500 008 971205s1997 it j 000 0 ita c 020 a 880921191X 082 1 a 853.8 100 1 a Collodi, Carlo. 245 13 a Le avventure di Pinocchio / 000 00694nam a2200241 i 4500 008 971205s1997 it j 000 0 ita c 020 a 880921191X 082 1 a 853.8 100 1 a Collodi, Carlo. 245 13 a Le avventure di Pinocchio / 000 c C. Collodi ; illustrazioni 00694nam di A. a2200241 Mussino i 4500 008 971205s1997 it j 000 0 ita c 020 a 880921191X 082 1 a 853.8 100 1 a Collodi, Carlo. 245 13 a Le avventure di Pinocchio / 260 a Firenze : b Giunti, c 1997. c C. Collodi ; illustrazioni di A. Mussino 440 0 260 a Collana a favolosa Firenze : / [Giunti] 521 a Letteratura per ragazzi 700 1 a Mussino, Attilio. 000 008 b Giunti, 00694nam a2200241 i 4500 c 1997. c C. Collodi 971205s1997 ; illustrazioni it di A. j 020 a 880921191X Mussino 000 0 ita c 082 1 a 853.8 100 1 a Collodi, Carlo. 245 13 a Le avventure di Pinocchio / 440 0 260 a Collana a favolosa Firenze : / [Giunti] 521 a Letteratura b Giunti, per ragazzi 700 1 a Mussino, c 1997. Attilio. 000 00694nam a2200241 i 4500 008 971205s1997 it j 000 0 ita c 020 a 880921191X 082 1 a 853.8 100 1 a Collodi, Carlo. 245 13 a Le avventure di Pinocchio / 440 0 a Collana favolosa / [Giunti] 521 a Letteratura per ragazzi 700 1 a Mussino, Attilio. c C. Collodi ; illustrazioni di A. Mussino 260 a Firenze : b Giunti, c 1997. 000 00694nam a2200241 i 4500 008 971205s1997 it j 000 0 ita c 020 a 880921191X 082 1 a 853.8 100 1 a Collodi, Carlo. 245 13 a Le avventure di Pinocchio / 440 0 a Collana favolosa / [Giunti] 521 a Letteratura per ragazzi 700 1 a Mussino, Attilio. c C. Collodi ; illustrazioni di A. Mussino 260 a Firenze : b Giunti, c 1997. 440 0 a Collana favolosa / [Giunti] 521 a Letteratura per ragazzi 700 1 a Mussino, Attilio. c C. Collodi ; illustrazioni di A. Mussino 260 a Firenze : b Giunti, c 1997. 440 0 a Collana favolosa / [Giunti] 521 a Letteratura per ragazzi 700 1 a Mussino, Attilio. Copyright 2009-2010 @CULT. All rights reserved 10
  11. 11. MQL examples (245a=History AND 1XX=Potter) OR ((245a=History AND 1XX=Potter) OR (226600bb==AAddddisisoonn W Weessleleyy)) 224455aa==** A ANNDD 2 24455cc==PPoottteterr A ANNDD 2 26600bb==GG??uunnt?t? 000011==((2277889922 2 277229900 2 299990000 2 24466992233)) 000011==((2277889922 O ORR 2 277229900 O ORR 2 299990000 O ORR 2 24466992233)) 000055==[1[1999955 T TOO 1 19999550022223311]] 110000aa==MMoorrggaann AANNDD 1 10000ee==((ininteterrvvieiewweerr O ORR c coolllelecctotorr)) Copyright 2009-2010 @CULT. All rights reserved 11
  12. 12. MQL result (example) Search metadata Search result (first record) Copyright 2009-2010 @CULT. All rights reserved 12
  13. 13. Agenda Goals Intermediate storage layer Final storage Layer Chaining all together Q&A Copyright 2009-2010 @CULT. All rights reserved 13
  14. 14. Final storage layer Even if the final result could be directed to a local file, a more appropriate destination would be a triple store, where we can have the following benefits: 1) a standard Query language (SPARQL) to query the store; 2) a standard format for exchanging data (RDF); 3) a storage where you are free to change your data in realtime without doing any kind of reindex operation; RDF Stream TRIPLE STORE Copyright 2009-2010 @CULT. All rights reserved 14
  15. 15. Agenda Goals Information Retrieval Triple store Chaining all together Q&A Copyright 2009-2010 @CULT. All rights reserved 15
  16. 16. Introducing Apache Camel Apache Camel ™ is a versatile open-source integration framework based on known Enterprise Integration Patterns. Camel empowers you to define routing and mediation rules in a variety of domain-specific languages, including a Java-based Fluent API, Spring or Blueprint XML Configuration files, and a Scala DSL Copyright 2009-2010 @CULT. All rights reserved 16
  17. 17. Introducing Apache Camel In practice, Apache Camel let us define this kind of things: Basically the whole process is divided in atomic pieces (called “processors”) and each of them is responsible for a little part of the overall work. Each processor can act as a splitter or aggregator, can do some content manipulation or in general can perform some task on the incoming message. Copyright 2009-2010 @CULT. All rights reserved 17
  18. 18. Indexing processor INDEX 000 00694nam a2200241 i 4500 008 971205s1997 it j 000 0 ita c 020 a 880921191X 082 1 a 853.8 100 1 a Collodi, Carlo. 245 13 a Le avventure di Pinocchio / 000 00694nam a2200241 i 4500 008 971205s1997 it j 000 0 ita c 020 a 880921191X 082 1 c C. Collodi a 853.8 ; illustrazioni di A. Mussino 100 1 a Collodi, Carlo. 245 13 a Le avventure di Pinocchio / 000 00694nam a2200241 i 4500 008 971205s1997 it j 000 0 ita c 020 a 880921191X 082 1 a 853.8 100 1 a Collodi, Carlo. 245 13 a Le avventure di Pinocchio / 260 a Firenze : b Giunti, c 1997. c C. Collodi ; illustrazioni di A. Mussino 440 0 260 a Collana a favolosa Firenze : / [Giunti] 521 a Letteratura per ragazzi 700 1 a Mussino, Attilio. 000 008 b Giunti, 00694nam a2200241 i 4500 c 1997. c C. Collodi 971205s1997 ; illustrazioni it di A. j 020 a 880921191X Mussino 000 0 ita c 082 1 a 853.8 100 1 a Collodi, Carlo. 245 13 a Le avventure di Pinocchio / 440 0 260 a Collana a favolosa Firenze : / [Giunti] 521 a Letteratura b Giunti, per ragazzi 700 1 a Mussino, c 1997. Attilio. 000 00694nam a2200241 i 4500 008 971205s1997 it j 000 0 ita c 020 a 880921191X 082 1 a 853.8 100 1 a Collodi, Carlo. 245 13 a Le avventure di Pinocchio / 440 0 a Collana favolosa / [Giunti] 521 a Letteratura per ragazzi 700 1 a Mussino, Attilio. c C. Collodi ; illustrazioni di A. Mussino 260 a Firenze : b Giunti, c 1997. 000 00694nam a2200241 i 4500 008 971205s1997 it j 000 0 ita c 020 a 880921191X 082 1 a 853.8 100 1 a Collodi, Carlo. 245 13 a Le avventure di Pinocchio / 440 0 a Collana favolosa / [Giunti] 521 a Letteratura per ragazzi 700 1 a Mussino, Attilio. c C. Collodi ; illustrazioni di A. Mussino 260 a Firenze : b Giunti, c 1997. 440 0 a Collana favolosa / [Giunti] 521 a Letteratura per ragazzi 700 1 a Mussino, Attilio. c C. Collodi ; illustrazioni di A. Mussino 260 a Firenze : b Giunti, c 1997. 440 0 a Collana favolosa / [Giunti] 521 a Letteratura per ragazzi 700 1 a Mussino, Attilio. Once did that, the storage contains MARC record and provides MQL capabilities. Note that this is a Near Real Time (NRT) engine so records can be added on the fly and made immediately visible to searchers. Last but not least, although we tagged this storage as “intermediate” that doesn't mean it is transient. That is, data is persisted and can be reused for further indexing processes. Copyright 2009-2010 @CULT. All rights reserved 18
  19. 19. Split processor 000 00694nam a2200241 i 4500 008 971205s1997 it j 000 0 ita c 020 a 880921191X 082 1 a 853.8 100 1 a Collodi, Carlo. 245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino 260 a Firenze : b Giunti, c 1997. 440 0 a Collana favolosa / [Giunti] 521 a Letteratura per ragazzi 700 1 a Mussino, Attilio. 000 00694nam a2200241 i 4500 008 971205s19 97 it j 000 0 ita c 02 0 $a 880 92 1191X Each “piece” will have a record fragment and a special correlation ID 001 00122721727 100 1 $a Collodi, Carlo 082 1 $a 853.8 52 1 $a Let teratura per raga z zi 100 1 $ a Co llod i, Carl o. 245 13$a Le av venture di Pinoc chio / $c C. Collodi ; illustrazioni di A. Mussino 260 $a Firenze :$b Giunti,$c 1997. 440 0 $a Collana favolosa / [Giunti] 700 1 $a Mussino, Attilio 260 $a Firenze :$b Giunti,$c 1997. 000 00694nam a2200241 i 4500 100 1 $a Coll odi , Ca rlo. 008 971205s1997 i t j 000 0 it a c Copyright 2009-2010 @CULT. All rights reserved 19
  20. 20. Transformation processor (1/2) A processor that operates on a fragment (i.e. a tag) of MARC record for producing an RDF triple. Obviously this is a family of processors because different logics can be applied depending on the incoming tag. In the most simplicistic case a tag can be directly translated to a triple. 001 27283 020 1 $a880921191X <atcult:27283> <bibo:isbn> “880921191X” Other times from a single tag we could generate a different (dependent) entity and a reference within the main entity. If the dependent entity has been already created by another similar tag then the processor will create just a reference. 001 27283 020 1 $aCollodi Carlo. <atcult:029100> <foaf:name> “Collodi Carlo” <atcult:27283> <dc:creator> <atcult:029100> Copyright 2009-2010 @CULT. All rights reserved 20
  21. 21. Transformation processor (2/2) We could have a transformer that, in order to produce results (one or more triple), needs to find information about another record. Think for example at 77X relation tags. How to get those information? In the incoming message we have just a correlation ID (the 001) and a tag Here comes the intermediate storage with MQL capabilities 001 27283 773 1 $aThe Hug <atcult:27283> <frbr:Work> <atcult:92827> <atcult:27283> <bibo:DocumentPart> <atcult:92827> 224455aa==TThhee H Huugg MARC XML Copyright 2009-2010 @CULT. All rights reserved 21
  22. 22. The big picture SOLR1 SOLR2 SOLR3 SOLR4 1 2 3 1 1 2 4 2 3 3 4 4 4 . 4 . x . 4 . 2 In d ic e d iv is o o r iz z on ta lm e n t e LOAD BALANCER Copyright 2009-2010 @CULT. All rights reserved 22
  23. 23. An example (1/5) Copyright 2009-2010 @CULT. All rights reserved 23
  24. 24. An example (2/5) Copyright 2009-2010 @CULT. All rights reserved 24
  25. 25. An example (3/5) Copyright 2009-2010 @CULT. All rights reserved 25
  26. 26. An example (4/5) Copyright 2009-2010 @CULT. All rights reserved 26
  27. 27. An example (5/5) Copyright 2009-2010 @CULT. All rights reserved 27
  28. 28. Agenda Goals Information Retrieval Triple store Chaining all together Q&A Copyright 2009-2010 @CULT. All rights reserved 28
  29. 29. 32st ADLUG ANNUAL MEETING 2013 ARTIUM and the Fundación Sancho El Sabio - Vitoria-Gasteiz 16th – 18th October 2013 A proposal for an RDF assembly line Thank You!

×