Linked data based semantic annotation using Drupal and Apache Stanbol


Published on

My presentation from Drupalaton 2013 -

This session will focus on the implementation of semantic services (automatic content enhancement, autotagging, content recommendation, reasoning) based on linked data datasets using the integration of Drupal with Apache Stanbol.

During the presentation the audience will find out about:
main features of Apache Stanbol and its integration with Drupal

how to discover and use custom/domain specific Linked Data datasets with Apache Stanbol/Drupal
how to build an advanced semantic processing chain in Apache Stanbol that will automatically annotate Drupal entities
how to implement a content recommendation/reasoning feature for Drupal based on Apache Stanbol services.
Apache Stanbol is an Open Source software stack designed to provide a powerful semantic engine via RESTful services returning results as RDF (Resource Description Language) and JSON. Unlike existing proprietary, commmerically oriented solutions such as OpenCalais, Apache Stanbol is highly customizable and may be trained to provide semantic services for virtually any language.

Published in: Technology, Education

Linked data based semantic annotation using Drupal and Apache Stanbol

  1. 1. Drupal and Apache Stanbol LINKED DATA BASED SEMANTIC ANNOTATION Gabriel Dragomir Sunday, August 18, 13
  2. 2. The Semantic Web Tim Berners Lee: ‘‘The first step is putting data on the Web in a form that machines can naturally understand, or converting it to that form. This creates what I call a Semantic Web – a Web of data that can be processed directly or indirectly by machines.’’ Sunday, August 18, 13
  3. 3. What’s the hype? Most organizations need to organize/analyze/relate huge amounts of textual, unstructured, dissipated data Examples: keyword extraction from content: annotate abstracts text categorization: organize big volumes of text based on a thesaurus media monitoring of tags: occurences of a specific keyword on social media channels Sunday, August 18, 13
  4. 4. Linked data Sunday, August 18, 13
  5. 5. Linked data Project started in 2007 Aimed at building the Web of Data by: identifying open access data sets converting them into RDF vocabularies publish them as open access data sets Sunday, August 18, 13
  6. 6. Linked data ecosystem Linked Open Vocabularies (LOV): dataset/lov/ Provides a conceptual map of the vocabularies Various providers: libraries, governmental actors, NGOs Sunday, August 18, 13
  7. 7. Linked data ecosystem Where to find other data sets? Swoogle: PoolParty: Sunday, August 18, 13
  8. 8. Linked data at work! Sunday, August 18, 13
  9. 9. Semantic annotation Creates specific metadata that enable new ways to retrieve and aggregate information Annotations are done based on a conceptual scheme, an ontology (ex. FOAF, DC Core) For more on ontologies see: Good_Ontologies The annotations build semantic relationships: e.g. rdf:type, owl:sameAs Sunday, August 18, 13
  10. 10. Semantic annotation Most common uses: Named Entity Linking: limited recognizing entities of type person, organization, place (e.g. OpenCalais) Entityhub Linking: annotation based on vocabularies with no limitations of entity types. Requires more natural language processing prior to annotation. Sunday, August 18, 13
  11. 11. Apache Stanbol on the fly Here comes Apache Stanbol A new approach: modular semantic analysis of documents processing components can be built for virtually any language flexible workflows via semantic annotation chains any vocabulary (Linked Data, custom) can be used Sunday, August 18, 13
  12. 12. From IKS to Apache Stanbol IKS - Interactive Knowledge Stack for small to medium CMS providers - EU funded consortium An open source software stack written in Java Goal: extract and process semantic data from documents Project undergoing incubation at Apache Foundation Sunday, August 18, 13
  13. 13. Service oriented architecture Stanbol is designed to offer service oriented integration RESTful web services API returning RDF or JSON/ JSON-LD Each component exposes an endpoint independently Open Services Gateway initiative compliant (OSGi) via Apache Felix and Apache Sling Remote component management Sunday, August 18, 13
  14. 14. Implementation OSGi layer: Apache Felix and Apache Sling Build environment: Apache Maven RDF framework: Apache Clerezza Triples store, reasoning engine: Apache Jena Indexing and semantic search: Apache Solr Content analysis/metadata extraction: Apache Tika Natural language processing: Apache OpenNLP Sunday, August 18, 13
  15. 15. Architecture Sunday, August 18, 13
  16. 16. Components Semantic layer: Enhancer, EntityHub, ContentHub Enhancement engines: internal, 3rd party User interfaces Knowledge integration (rule sets, reasoners) Storage integration Sunday, August 18, 13
  17. 17. Content enhancement Examples: retrieve additional metadata for a piece of content identify the language of a text extract entities (persons, places, organizations) create annotations to external sources use 3rd party services for named entities recognition Sunday, August 18, 13
  18. 18. Drupal meets Stanbol Several modules implement RDF support allowing data transport to Stanbol semantic annotations Taxonomy system allows for complex annotation Fieldable taxonomy terms allow for storage of complex semantic data Sunday, August 18, 13
  19. 19. User scenarios Semantic indexing via Stanbol (SOLR yard) Content enrichment with semantically related information (documents, factual data, images etc.) Tag as you type: dynamic annotation of text in editors Sunday, August 18, 13
  20. 20. How it works POST request sends content via REST API content is processed by an enhancement chain Returns JSON-LD, RDF/XML, RDF/JSON etc JSON-LD - JavaScript Object Notation for Linked Data a human readable and simple linked data transport format for best results an enancement chain should do language detection, tokenization, POS Tagging prior to performing semantic annotation Sunday, August 18, 13
  21. 21. Drupal integration Source: Sunday, August 18, 13
  22. 22. Drupal distribution: IKS CE IKS CE distribution - Wolfgang Ziegler (fago), Stéphane Corlosquet (scor) Components: Search API Stanbol VIE.js - semantic annotation UI Sunday, August 18, 13
  23. 23. Search API Stanbol enables the indexing of Drupal entities such as nodes, users, taxonomy terms, files, etc. in Stanbol EntityHub. data sent as RDF data can be mashed up with data from other sources (Managed Sites, Remote Sites) Sunday, August 18, 13
  24. 24. VIE.js “Vienna IKS Editables” JavaScript library for implementing decoupled Content Management Systems and semantic interaction in web applications. Sunday, August 18, 13
  25. 25. Monolitic vs Decoupled Content Management Monolitic vs Decoupled Content Management Systems source: Henri Bergius - Sunday, August 18, 13
  26. 26. Demo setup we store Drupal entities in a SOLR index annotations are to be made based on: DBPedia - bundled with Apache Stanbol a custom vocabulary of terms related to semantic web - Social Semantic Web Thesaurus SemWeb is imported as a SOLR index into Apache Stanbol Sunday, August 18, 13
  27. 27. Custom vocabularies Social Semantic Web Thesaurus 1959 concepts related to semantic web Author: Andreas Blumauer Sunday, August 18, 13
  28. 28. Demo index Drupal entities in Apache Stanbol retrieve annotated entites via REST API annotate entities using dbpedia and semweb indexes edit Drupal entities and annotate on the fly retrieve linked data tag recommendations Sunday, August 18, 13
  29. 29. Questions? Sunday, August 18, 13
  30. 30. Contact me twitter: gabidrg Sunday, August 18, 13
  31. 31. Thank you! Sunday, August 18, 13
  32. 32. Sunday, August 18, 13