OWLIM

Mariana Damova, PhD



     DM2E
Vienna, November 2012
Ontotext
   – Top-5 provider of core Semantic Technology
   – Established in year 2000; offices in Bulgaria, UK, USA
   – Active both in research and commercial projects (FP7 funding for 10 years)

• 360° semantic technology – unique portfolio:
   – Semantic Databases: high-performance RDF DBMS, scalable reasoning
   – Semantic Search: text-mining (IE), metadata generation, Information Retrieval (IR)
   – Web Mining: focused crawling, screen scraping, data fusion
   – Linked Data Management and Data Integration

   Good recognition in the SemTech community
   – Ontotext pages are ranked #1 for “semantic annotation” and “semantic repository” at
     GYM, #3 for “linked data management” at Google

   Several joint ventures and subsidiaries
   – Innovantage: leading online recruitment intelligence provider in UK
Ontotext Clients (selected)

          British Broadcasting Corporation (BBC)
                – Run its World Cup 2010 sites on top of OWLIM
                – Since Mar’12 BBC Sports
                – 2012 Olympics sections are driven
                  by OWLIM and a Concept Extraction service developed by Ontotext
          Press Association (UK)
                – Analysis of Sports news
                – Concept extraction
                – Linked data generation
          Top-3 USA media (not allowed to name)
          The National Archives (UK) contracted Ontotext to implement
          semantic KB and semantic search for the Government Web Archive
          British Museum (UK) Ontotext leads the development of Phase 3 of
          ResearchSpace project on collaborative research in cultural heritage;
          British Museum’s public SPARQL end-point is powered by OWLIM
          de Bibliothek (Holland) aggregation of data from 150 library databases
Semantic Technologies


•   Semantic technologies (RDF, LOD) allow for an unprecedented ease of
    integration of heterogeneous data sources
      – Already adopted in pharmaceuticals and publishing industries
      – Cultural heritage is the next

     BBC – when MySQL was replaced with OWLIM in their “Dynamic Semantic
       Publishing” architecture, the BBC team observed considerable reduction of
       complexity of database design, query specification, application
       development, and query evaluation time. BBC World Cup 2010 dynamic
       semantic publishing. Jem Rayfield, Senior Technical Architect BBC News
       and Knowledge.
       http://www.bbc.co.uk/blogs/bbcinternet/2010/07/bbc_world_cup_2010_dyna
       mic_sem.html
OWLIM
Semantic Repository for RDFS and OWL

• OWLIM is a family of scalable semantic repositories
   • OWLIM-Lite: in-memory, fastest, scales to ~100 million statements
   • OWLIM-SE: file-based, sameAs & query optimizations, scales to 20 billion
     statements
   • OWLIM-Enterprise: replication cluster deployment for resilience and high
     performance parallel query-answering

• OWLIM provides
    – Management, integration and analysis of heterogeneous data
    – Combined with light-weight, high-performance reasoning
    – The inference is based on logical rule-entailment
    – Full RDFS, OWL Horst, restricted OWL-Lite, OWL2-QL and OWL2 RL
    – Custom semantics can be defined via rules and axiomatic triples
OWLIM in the Cultural Heritage Domain

Selected commercial projects
          ResearchSpace project funded by the Andrew W. Mellon Foundation
          Support for collaborative web-based research, information sharing and web publishing for
          the cultural heritage scholarly community. An Ontotext-led international consortium.
             The Polish Digital National Museum aggregates artifacts from over 70 contributing
           cultural institutions in the Digital Libraries Federation PIONIER Network using OWLIM
           repository of Ontotext
            LODAC (Linked Open Data in Academia), Japan's National Institute of Informatics
           aggregates various information across multiple Japanese resources as LOD. The system
           uses 8 OWLIM nodes and aggregates 19 collections with 700 000 entities and 15M triples.
            SemTech for Cultural Heritage project funded by ITCC
           Semantic publishing of Bulgarian cultural heritage to Europeana Establishing a Bulgarian
            technical aggregator for Europeana
Selected research projects
            MOLTO FP7 project, a use case in cultural heritage for a semantic knowledge
           representationinfrastructure for querying RDF and presenting query results, includes close
           to 9K museum objects from two collections of The Gothenburg City
             Charisma (Cultural Heritage Advanced Research Infrastructures) an EU-funded
           integrating activity project, a consortium of 21 partners, metadata from 6 major European
           cultural institutions has selected OWLIM repository of Ontotext
OWLIM PERFORMANCE



•   OWLIM is a scalable, robust and efficient triple store
     – Serving the two most important web-sites for the London Olympic Games
         • Official Olympics website
         • BBC Olympics website
     – Performance highlights
         • OWLIM loads the 100M and the 200M datasets almost twice as fast as the next best product
           (17 min. for 100M)
         • Best query performance among those repositories that can handle update and multi-client
           query tasks (5,285 Query-mixes-per-hour, where a query mix contains 25 queries; e.g. about
           100 queries/sec)
         • OWLIM v5 is 43% faster than v.4.3 on the BSBM Explore and Update scenario
         • OWLIM v5 requires between 25% and 70% less storage space



•   OWL 2 RL-type languages have proven to be the only feasible approach for
    reasoning with billion statements
Reasoning complexity
owl:sameAs Optimization

a way to handle the equivalent statements by a single master node,
which has as an impact efficient and compact handling of inferred
statements resulting in 4-6 times more statements available to query
than the explicitly introduced ones
OWLIM Replication Cluster

• Distribution through data replication is used to ensure:
   – Better handling of concurrent user requests
   – Failover support
• How does it work?
   – Every user request is pushed in a transaction queue
   – Each data write request is are multiplexed to all repository instances
   – Each read request is dispatched to one of the
     instance only
   – To ensure load-balancing, each
     read requests is send to the
     instance with smallest execution
     queue at this point in time
Geo-spatial index

• Geo-spatial information concerns the geometry of points, shapes and distances relative to the
  surface of the Earth (or any spherical object).
• When using OWLIM-SE all angles are in decimal degrees with the latitude ranging from -90 to
  +90 degrees and the longitude ranging from -180 to +180 degrees.




• airports have a reference point given by latitude, longitude and altitude;
• political boundaries can be specified by polygons where each vertex is a 2-Dimensional
  latitude/longitude pair.
RDF Rank

• OWLIM-SE includes a plug-in that allows for efficient
  calculation of a modification of PageRank over RDF graphs
• Computation of rank values is fast, e.g.
   – 400M LOD statements takes 310 sec (27 iteraions)

• Results are available through a system predicate
• Example: get the 100 most important nodes in the RDF graph
      SELECT ?n {?n rank:hasRDFRank ?r}
      ORDER BY DESC(?r) LIMIT 100
Define: nested repositories

”Nested repositories” represent a new data
   management concept for RDF data:
•   a mechanism for sharing data stored across
    multiple repositories, where
•   one of them contains a large body of
    knowledge which gets embedded in other
    repositories
•   each containing more specific data, which are
    being interlinked with the common body of
    knowledge
http://www.ontotext.com/owlim




                       mariana.damova@ontotext.com

Mariana Damova - Ontotext

  • 1.
    OWLIM Mariana Damova, PhD DM2E Vienna, November 2012
  • 2.
    Ontotext – Top-5 provider of core Semantic Technology – Established in year 2000; offices in Bulgaria, UK, USA – Active both in research and commercial projects (FP7 funding for 10 years) • 360° semantic technology – unique portfolio: – Semantic Databases: high-performance RDF DBMS, scalable reasoning – Semantic Search: text-mining (IE), metadata generation, Information Retrieval (IR) – Web Mining: focused crawling, screen scraping, data fusion – Linked Data Management and Data Integration Good recognition in the SemTech community – Ontotext pages are ranked #1 for “semantic annotation” and “semantic repository” at GYM, #3 for “linked data management” at Google Several joint ventures and subsidiaries – Innovantage: leading online recruitment intelligence provider in UK
  • 3.
    Ontotext Clients (selected) British Broadcasting Corporation (BBC) – Run its World Cup 2010 sites on top of OWLIM – Since Mar’12 BBC Sports – 2012 Olympics sections are driven by OWLIM and a Concept Extraction service developed by Ontotext Press Association (UK) – Analysis of Sports news – Concept extraction – Linked data generation Top-3 USA media (not allowed to name) The National Archives (UK) contracted Ontotext to implement semantic KB and semantic search for the Government Web Archive British Museum (UK) Ontotext leads the development of Phase 3 of ResearchSpace project on collaborative research in cultural heritage; British Museum’s public SPARQL end-point is powered by OWLIM de Bibliothek (Holland) aggregation of data from 150 library databases
  • 4.
    Semantic Technologies • Semantic technologies (RDF, LOD) allow for an unprecedented ease of integration of heterogeneous data sources – Already adopted in pharmaceuticals and publishing industries – Cultural heritage is the next BBC – when MySQL was replaced with OWLIM in their “Dynamic Semantic Publishing” architecture, the BBC team observed considerable reduction of complexity of database design, query specification, application development, and query evaluation time. BBC World Cup 2010 dynamic semantic publishing. Jem Rayfield, Senior Technical Architect BBC News and Knowledge. http://www.bbc.co.uk/blogs/bbcinternet/2010/07/bbc_world_cup_2010_dyna mic_sem.html
  • 5.
  • 6.
    Semantic Repository forRDFS and OWL • OWLIM is a family of scalable semantic repositories • OWLIM-Lite: in-memory, fastest, scales to ~100 million statements • OWLIM-SE: file-based, sameAs & query optimizations, scales to 20 billion statements • OWLIM-Enterprise: replication cluster deployment for resilience and high performance parallel query-answering • OWLIM provides – Management, integration and analysis of heterogeneous data – Combined with light-weight, high-performance reasoning – The inference is based on logical rule-entailment – Full RDFS, OWL Horst, restricted OWL-Lite, OWL2-QL and OWL2 RL – Custom semantics can be defined via rules and axiomatic triples
  • 7.
    OWLIM in theCultural Heritage Domain Selected commercial projects ResearchSpace project funded by the Andrew W. Mellon Foundation Support for collaborative web-based research, information sharing and web publishing for the cultural heritage scholarly community. An Ontotext-led international consortium. The Polish Digital National Museum aggregates artifacts from over 70 contributing cultural institutions in the Digital Libraries Federation PIONIER Network using OWLIM repository of Ontotext LODAC (Linked Open Data in Academia), Japan's National Institute of Informatics aggregates various information across multiple Japanese resources as LOD. The system uses 8 OWLIM nodes and aggregates 19 collections with 700 000 entities and 15M triples. SemTech for Cultural Heritage project funded by ITCC Semantic publishing of Bulgarian cultural heritage to Europeana Establishing a Bulgarian technical aggregator for Europeana Selected research projects MOLTO FP7 project, a use case in cultural heritage for a semantic knowledge representationinfrastructure for querying RDF and presenting query results, includes close to 9K museum objects from two collections of The Gothenburg City Charisma (Cultural Heritage Advanced Research Infrastructures) an EU-funded integrating activity project, a consortium of 21 partners, metadata from 6 major European cultural institutions has selected OWLIM repository of Ontotext
  • 8.
    OWLIM PERFORMANCE • OWLIM is a scalable, robust and efficient triple store – Serving the two most important web-sites for the London Olympic Games • Official Olympics website • BBC Olympics website – Performance highlights • OWLIM loads the 100M and the 200M datasets almost twice as fast as the next best product (17 min. for 100M) • Best query performance among those repositories that can handle update and multi-client query tasks (5,285 Query-mixes-per-hour, where a query mix contains 25 queries; e.g. about 100 queries/sec) • OWLIM v5 is 43% faster than v.4.3 on the BSBM Explore and Update scenario • OWLIM v5 requires between 25% and 70% less storage space • OWL 2 RL-type languages have proven to be the only feasible approach for reasoning with billion statements
  • 9.
  • 10.
    owl:sameAs Optimization a wayto handle the equivalent statements by a single master node, which has as an impact efficient and compact handling of inferred statements resulting in 4-6 times more statements available to query than the explicitly introduced ones
  • 11.
    OWLIM Replication Cluster •Distribution through data replication is used to ensure: – Better handling of concurrent user requests – Failover support • How does it work? – Every user request is pushed in a transaction queue – Each data write request is are multiplexed to all repository instances – Each read request is dispatched to one of the instance only – To ensure load-balancing, each read requests is send to the instance with smallest execution queue at this point in time
  • 12.
    Geo-spatial index • Geo-spatialinformation concerns the geometry of points, shapes and distances relative to the surface of the Earth (or any spherical object). • When using OWLIM-SE all angles are in decimal degrees with the latitude ranging from -90 to +90 degrees and the longitude ranging from -180 to +180 degrees. • airports have a reference point given by latitude, longitude and altitude; • political boundaries can be specified by polygons where each vertex is a 2-Dimensional latitude/longitude pair.
  • 13.
    RDF Rank • OWLIM-SEincludes a plug-in that allows for efficient calculation of a modification of PageRank over RDF graphs • Computation of rank values is fast, e.g. – 400M LOD statements takes 310 sec (27 iteraions) • Results are available through a system predicate • Example: get the 100 most important nodes in the RDF graph SELECT ?n {?n rank:hasRDFRank ?r} ORDER BY DESC(?r) LIMIT 100
  • 14.
    Define: nested repositories ”Nestedrepositories” represent a new data management concept for RDF data: • a mechanism for sharing data stored across multiple repositories, where • one of them contains a large body of knowledge which gets embedded in other repositories • each containing more specific data, which are being interlinked with the common body of knowledge
  • 15.
    http://www.ontotext.com/owlim mariana.damova@ontotext.com