Gabriel Dragomir

Drupal and Apache
Stanbol
SEMANTIC ANNOTATION WITH CUSTOM
VOCABULARIES
About me

• Drupal developer, trainer and consultant
• Founding member of Drupal Romania
Association
The Semantic Web
• Tim Berners Lee:
‘‘The first step is putting data on the
Web in a form that machines can
naturally understand, or converting
it to that form. This creates what I
call a Semantic Web – a Web of data
that can be processed directly or
indirectly by machines.’’
What’s the hype?
• Most organizations need to organize/analyze/

relate huge amounts of textual, unstructured,
dissipated data

• Examples:
• keyword extraction from content: annotate
abstracts

• text categorization: organize big volumes of text
based on a thesaurus

• media monitoring of tags: occurences of a specific
keyword on social media channels
Linked data

http://lod-cloud.net/
Linked data
• Project started in 2007
• Aimed at building the Web of Data by:
• identifying open access data sets
• converting them into RDF
vocabularies

• publish them as open access data
sets
Linked data ecosystem
• Linked Open Vocabularies (LOV):
http://lov.okfn.org/dataset/lov/

• Provides a conceptual map of the
vocabularies

• Various providers: libraries,
governmental actors, NGOs
Linked data ecosystem
• Where to find other data sets?
• http://www.w3.org/2001/sw/wiki/
SKOS/Datasets

• Swoogle: http://swoogle.umbc.edu/
• PoolParty: http://

vocabulary.semantic-web.at
Linked data at work!
Semantic annotation
• Creates specific metadata that enable
new ways to retrieve and aggregate
information

• Annotations are done based on a

conceptual scheme, an ontology (ex.
FOAF, DC Core)

• For more on ontologies see: http://

www.w3.org/wiki/Good_Ontologies

• The annotations build semantic
Semantic annotation
• Most common uses:
• Named Entity Linking: limited

recognizing entities of type person,
organization, place (e.g. OpenCalais)

• Entityhub Linking: annotation based on

vocabularies with no limitations of
entity types. Requires more natural
language processing prior to annotation.
Apache Stanbol on the fly
• Here comes Apache Stanbol
• A new approach:
• modular semantic analysis of documents
• processing components can be built for
virtually any language

• flexible workflows via semantic annotation
chains

• any vocabulary (Linked Data, custom) can be
used
Service oriented
architecture
• Stanbol is designed to offer service oriented
integration

• RESTful web services API returning RDF or
JSON/JSON-LD

• Each component exposes an endpoint
independently

• Open Services Gateway initiative compliant
(OSGi) via Apache Felix and Apache Sling

• Remote component management
Implementation
• OSGi layer: Apache Felix and Apache Sling
• Build environment: Apache Maven
• RDF framework: Apache Clerezza
• Triples store, reasoning engine: Apache Jena
• Indexing and semantic search: Apache Solr
• Content analysis/metadata extraction: Apache
Tika

• Natural language processing: Apache OpenNLP
Architecture
Components
• Semantic layer:
• Enhancer, EntityHub, ContentHub
• Enhancement engines: internal, 3rd party
• User interfaces
• Knowledge integration (rule sets,
reasoners)

• Storage integration
Content enhancement
• Examples:
• retrieve additional metadata for a piece of
content

• identify the language of a text
• extract entities (persons, places, organizations)
• create annotations to external sources
• use 3rd party services for named entities
recognition
Drupal meets Stanbol
• Several modules implement RDF

support allowing data transport to
Stanbol semantic annotations

• Taxonomy system allows for complex
annotation

• Fieldable taxonomy terms allow for
storage of complex semantic data
User scenarios
• Semantic indexing via Stanbol (SOLR
yard)

• Content enrichment with semantically
related information (documents,
factual data, images etc.)

• Tag as you type: dynamic annotation
of text in editors
How it works
• POST request sends content via REST API
• content is processed by an enhancement chain
• Returns JSON-LD, RDF/XML, RDF/JSON etc

JSON-LD - JavaScript Object Notation for Linked
Data a human readable and simple linked data
transport format

• for best results an enancement chain should do
language detection, tokenization, POS Tagging
prior to performing semantic annotation

• http://stanbol-yle.jelastic.planeetta.net/demo/
enhancer
Drupal integration

Source: blog.iks-project.eu
Drupal distribution: IKS
CE
• IKS CE distribution - Wolfgang Ziegler (fago),
Stéphane Corlosquet (scor)

• Components:
• Search API Stanbol
• VIE.js - semantic annotation UI
• https://drupal.org/project/iksce
• http://drupal.org/project/vie
• http://drupal.org/project/search_api_stanbol
• https://github.com/fago/stanbol-for-drupal
Search API Stanbol
• enables the indexing of Drupal

entities such as nodes, users,
taxonomy terms, files, etc. in Stanbol
EntityHub.

• data sent as RDF
• data can be mashed up with data from

other sources (Managed Sites, Remote
Sites)
VIE.js
• “Vienna IKS Editables”
• JavaScript library for

implementing decoupled Content
Management Systems and semantic
interaction in web applications.
Monolitic vs Decoupled
Content Management Systems
• Monolitic vs Decoupled Content
Management Systems

source: Henri Bergius - http://bergie.iki.fi
Demo setup
• we store Drupal entities in a SOLR index
• annotations are to be made based on:
• DBPedia - bundled with Apache Stanbol
• a custom vocabulary of terms related to
semantic web - Social Semantic Web
Thesaurus

• SemWeb is imported as a SOLR index
into Apache Stanbol
Custom vocabularies
• PoolParty Semantic Web
• 224 concepts related to semantic web
• Author: Andreas Blumauer
• http://vocabulary.semantic-web.at/
PoolPartySemanticWeb.html

• http://vocabulary.semantic-web.at/

PoolPartySemanticWeb/Drupal.html
Demo
• index Drupal entities in Apache Stanbol
• retrieve annotated entites via REST API
• annotate entities using dbpedia and
semweb indexes

• edit Drupal entities and annotate on the
fly

• retrieve linked data tag recommendations
Questions?
Contact me

• gabriel.dragomir@webikon.com
• twitter: gabidrg
Thank you!

Drupal and Apache Stanbol

  • 1.
    Gabriel Dragomir Drupal andApache Stanbol SEMANTIC ANNOTATION WITH CUSTOM VOCABULARIES
  • 2.
    About me • Drupaldeveloper, trainer and consultant • Founding member of Drupal Romania Association
  • 3.
    The Semantic Web •Tim Berners Lee: ‘‘The first step is putting data on the Web in a form that machines can naturally understand, or converting it to that form. This creates what I call a Semantic Web – a Web of data that can be processed directly or indirectly by machines.’’
  • 4.
    What’s the hype? •Most organizations need to organize/analyze/ relate huge amounts of textual, unstructured, dissipated data • Examples: • keyword extraction from content: annotate abstracts • text categorization: organize big volumes of text based on a thesaurus • media monitoring of tags: occurences of a specific keyword on social media channels
  • 5.
  • 6.
    Linked data • Projectstarted in 2007 • Aimed at building the Web of Data by: • identifying open access data sets • converting them into RDF vocabularies • publish them as open access data sets
  • 7.
    Linked data ecosystem •Linked Open Vocabularies (LOV): http://lov.okfn.org/dataset/lov/ • Provides a conceptual map of the vocabularies • Various providers: libraries, governmental actors, NGOs
  • 8.
    Linked data ecosystem •Where to find other data sets? • http://www.w3.org/2001/sw/wiki/ SKOS/Datasets • Swoogle: http://swoogle.umbc.edu/ • PoolParty: http:// vocabulary.semantic-web.at
  • 9.
  • 10.
    Semantic annotation • Createsspecific metadata that enable new ways to retrieve and aggregate information • Annotations are done based on a conceptual scheme, an ontology (ex. FOAF, DC Core) • For more on ontologies see: http:// www.w3.org/wiki/Good_Ontologies • The annotations build semantic
  • 11.
    Semantic annotation • Mostcommon uses: • Named Entity Linking: limited recognizing entities of type person, organization, place (e.g. OpenCalais) • Entityhub Linking: annotation based on vocabularies with no limitations of entity types. Requires more natural language processing prior to annotation.
  • 12.
    Apache Stanbol onthe fly • Here comes Apache Stanbol • A new approach: • modular semantic analysis of documents • processing components can be built for virtually any language • flexible workflows via semantic annotation chains • any vocabulary (Linked Data, custom) can be used
  • 13.
    Service oriented architecture • Stanbolis designed to offer service oriented integration • RESTful web services API returning RDF or JSON/JSON-LD • Each component exposes an endpoint independently • Open Services Gateway initiative compliant (OSGi) via Apache Felix and Apache Sling • Remote component management
  • 14.
    Implementation • OSGi layer:Apache Felix and Apache Sling • Build environment: Apache Maven • RDF framework: Apache Clerezza • Triples store, reasoning engine: Apache Jena • Indexing and semantic search: Apache Solr • Content analysis/metadata extraction: Apache Tika • Natural language processing: Apache OpenNLP
  • 15.
  • 16.
    Components • Semantic layer: •Enhancer, EntityHub, ContentHub • Enhancement engines: internal, 3rd party • User interfaces • Knowledge integration (rule sets, reasoners) • Storage integration
  • 17.
    Content enhancement • Examples: •retrieve additional metadata for a piece of content • identify the language of a text • extract entities (persons, places, organizations) • create annotations to external sources • use 3rd party services for named entities recognition
  • 18.
    Drupal meets Stanbol •Several modules implement RDF support allowing data transport to Stanbol semantic annotations • Taxonomy system allows for complex annotation • Fieldable taxonomy terms allow for storage of complex semantic data
  • 19.
    User scenarios • Semanticindexing via Stanbol (SOLR yard) • Content enrichment with semantically related information (documents, factual data, images etc.) • Tag as you type: dynamic annotation of text in editors
  • 20.
    How it works •POST request sends content via REST API • content is processed by an enhancement chain • Returns JSON-LD, RDF/XML, RDF/JSON etc JSON-LD - JavaScript Object Notation for Linked Data a human readable and simple linked data transport format • for best results an enancement chain should do language detection, tokenization, POS Tagging prior to performing semantic annotation • http://stanbol-yle.jelastic.planeetta.net/demo/ enhancer
  • 21.
  • 22.
    Drupal distribution: IKS CE •IKS CE distribution - Wolfgang Ziegler (fago), Stéphane Corlosquet (scor) • Components: • Search API Stanbol • VIE.js - semantic annotation UI • https://drupal.org/project/iksce • http://drupal.org/project/vie • http://drupal.org/project/search_api_stanbol • https://github.com/fago/stanbol-for-drupal
  • 23.
    Search API Stanbol •enables the indexing of Drupal entities such as nodes, users, taxonomy terms, files, etc. in Stanbol EntityHub. • data sent as RDF • data can be mashed up with data from other sources (Managed Sites, Remote Sites)
  • 24.
    VIE.js • “Vienna IKSEditables” • JavaScript library for implementing decoupled Content Management Systems and semantic interaction in web applications.
  • 25.
    Monolitic vs Decoupled ContentManagement Systems • Monolitic vs Decoupled Content Management Systems source: Henri Bergius - http://bergie.iki.fi
  • 26.
    Demo setup • westore Drupal entities in a SOLR index • annotations are to be made based on: • DBPedia - bundled with Apache Stanbol • a custom vocabulary of terms related to semantic web - Social Semantic Web Thesaurus • SemWeb is imported as a SOLR index into Apache Stanbol
  • 27.
    Custom vocabularies • PoolPartySemantic Web • 224 concepts related to semantic web • Author: Andreas Blumauer • http://vocabulary.semantic-web.at/ PoolPartySemanticWeb.html • http://vocabulary.semantic-web.at/ PoolPartySemanticWeb/Drupal.html
  • 28.
    Demo • index Drupalentities in Apache Stanbol • retrieve annotated entites via REST API • annotate entities using dbpedia and semweb indexes • edit Drupal entities and annotate on the fly • retrieve linked data tag recommendations
  • 29.
  • 30.
  • 31.