DOMEO ANNOTATION TOOLKIT
AND TEXT MINING


CREATING,   VISUALISING, CURATING AND SHARING
TEXT MINING RESULTS

Paolo Ciccarese, PhD
paolo.ciccarese@gmail.com


January 30th 2012, W3C Scientific Discourse Call
 Domeo Annotation Toolkit is a collection of software
  components that allow to create and share
  annotation of web documents and their fragments
 It can export and exchange all the annotation in
  Annotation Ontology (AO) RDF format
 The Domeo client is the user interface that can be
  used to produce manual and semi-automatic
  annotation of HTML documents directly in your
  browser


                              http://annotationframework.org/
ANNOTATION ONTOLOGY
   OWL vocabulary for representing and sharing
    annotation and semantic annotationof digital
    resources and their fragments:
       Is orthogonal to the domain(s) of interest




                                                     http://purl.org/ao/home
       Supports Stand-off annotation
       Offers tools for identifying fragments
       Designed with extension points
       Defines basic annotation containers
       Supports versioning
       Tracks provenance
DOMEO AND TEXT MINING SERVICES
 Domeo allows to trigger text mining algorithms
  when they are available through web services
 Software connectors have to be developed to
  translate the results in a suitable format
 The results are displayed in the web documents

 Users can record their feedback/judgment through
  customizable user interfaces
NCBO ANNOTATOR




                                                            http://www.bioontology.org/annotator-service
 Web service that annotates textual metadata (e.g.
  journal abstract) with relevant ontology concepts
 It is possible to preselect the ontologies of interests
  as one of the many parameters
DOMEO AND THE NCBO ANNOTATOR




                                                       http://www.bioontology.org/annotator-service
   Domeo allows automatic/manual annotation with
    terms coming from selected ontologies managed by
    the BioPortal
RUNNING NCBO ANNOTATOR




 Additional text mining services
 will be listed here
NCBO ANNOTATOR RESULTS IN DOMEO




List of recognized
entities
RESULTS CURATION

                   Customizable
CUMULATIVE RESULTS CURATION
 One item only
 All instances with the same text match

 All instances independently from the text match
SERIALIZATION IN AO/RDF
SOFTWARE CONNECTORS
At the current stage
 For each text mining service we have to write a
  specific connector that normally is translating offset
  and range into prefix and postfix
 And keep it up to date!
UIMA, CLEREZZA AND AO
OSS BASED    INFRASTRUCTURE FOR TEXT MINING OVER
ONTOLOGIES

TommasoTeofili and Paolo Ciccarese
tommaso@apache.org
APACHE UIMA
 Architecturalframework for UIM
 OASIS standard

 Build, deploy and run text mining pipelines

 Scaling capabilities for large volumes of data

 NLP/TM algorithms wrapped as Analysis Engines




                                   http://uima.apache.org/
UIMA TYPES
 Defining annotation domain in Typesystems
 Types and features are just declared

 Existing Typesystemscan be
  imported/exported/enhanced
 Ease data exchange between AEs

 Two “main” types
   TOP
   Annotation
APACHE CLEREZZA
 Service platform for linked data
 OSGi-based

 RDF API

 RESTful Web Service Framework

 TripleStore independent

 Integrated with Apache UIMA




                          http://incubator.apache.org/clerezza/
UIMA/CLEREZZA CONVENTION
 devs  can create custom types / typesystems
 need to manage URIs

 integration of services vs ontology sharing

 ClerezzaTypeSystem
     ClerezzaBaseAnnotation
         uri
     ClerezzaBaseEntity
       uri
       label (rdfs:label)

       references (annotations referring this entity)

     service specific annotations and entity types are defined
      subclassing the above
CLEREZZABASEANNOTATION DESCRIPTOR
CLEREZZABASEENTITYDESCRIPTOR
BEFORE
AFTER (URI FIELD INHERITED)
CONVERSION STRATEGIES
 UIMA  annotations stored inside CAS
 Services “talking” via webservices + RDF

 CAS to RDF mapping via Clerezza

 Pluggable mapping strategies
   Clerezza Default
   AnnotationOntology
   …
CONVERSION STRATEGIES
Change mapping strategies via XML/Eclipse plugin




Or in the descriptor directly
 <nameValuePair>
 <name>mappingStrategy</name>
 <value><string>ao</string></value>
 </nameValuePair>
CLEREZZA WEB SERVICES EXAMPLE
LOOKING AHEAD
DOMEO TOOLKIT V. 2

Paolo Ciccarese, PhD
DOMEO ANNOTATION TOOLKIT V.2
 DomeoAnnotation Toolkit v.2 is planned by the end
  of the first quarter of 2012
 It will consist in major refactoring to improve
  modularity and make plug-ins writing easier
 It will include various new features and will be the
  first step towards a federated architecture
 It will be open source!
DOMEO FEDERATION
 We currently have two instances of the Domeo
  Toolkit and the number of instances is going to
  increase
 We need to define a clean architecture that
  supports communication between instances or
  nodes
 Instances should be able to access each other
  annotations in multiple ways
Annotation Flow
                                                                         Web Service
  DOMEO FEDERATION                                                       Triplestore



      Domeo                                        Domeo    Web Client
               Web Client
      Node 1                                       Node 2




                                          SPARQL
                                      Web Client
                             Domeo                                         DomeoN
                             Node 3                                         ode 4
                    SPARQL




Ex: DT3 retrieves annotation from DT1 through a web service
and from DT2 through a SPARQL query against its triplestore
SOFTWARE ANNOTATION ACCESS
Nodes can access annotations of other nodes through
 Through Web Services
       Annotation by User
       Annotation by Group
       Annotation by Document
       Annotation by Corpora
       …
   SPARQL queries, when a SPARQL end-point is available
USERS ANNOTATION ACCESS
Users can export their own annotation in AO RDF
   Annotation by document
   Annotation by corpora
   All of the annotation
Request
CURRENT DOMEO ARCHITECTURE                              Annotation


                              Domeo
                              Web Client
                    AO-RDF




                Annotation
               Web Services



                               Domeo
                                                           User
                                           MySQL           Annotation
                                                           Export
 Text Mining                                       UI
 Connector




   NCBO
 Web Service

  NCBO
 Annotator
DOMEO NODE ARCHITECTURE
> ACCESSING EXTERNAL ANNOTATION
 Other          1                                         2
                                            External
 Domeo                        Domeo
                                           Triplestore
  Node                        Web Client
                    AO-RDF
                                           SPARQL

     AO-RDF                                   AO-RDF


                Annotation                 Triple Store
               Web Services                Connector



Domeo v.2 Node
                                                                   User
                                           MySQL                   Annotation
                                                                   Export
 Text Mining                                                  UI
 Connector




   NCBO
 Web Service

   NCBO
  Annotator
DOMEO NODE ARCHITECTURE
> ADDING A SPARQL ENDPOINT
 Other
                                            External
 Domeo                        Domeo
                                           Triplestore
  Node                        Web Client
                    AO-RDF
                                           SPARQL

     AO-RDF                                   AO-RDF


                Annotation                 Triple Store    SPARQL
               Web Services                Connector

                                                          Triplestore
Domeo v.2 Node
                                                                        User
                                           MySQL                        Annotation
                                                                        Export
 Text Mining                                                      UI
 Connector




   NCBO
 Web Service

   NCBO
  Annotator
DOMEO NODE ARCHITECTURE
    > TEXT MINING ALGORITHMS INTEGRATION
     Other                                                                     1
                                                                 External
     Domeo                            Domeo
                                                                Triplestore
      Node                            Web Client
                        AO-RDF
                                                                SPARQL

         AO-RDF                                                    AO-RDF


                    Annotation                                  Triple Store        SPARQL
                   Web Services                                 Connector

                                                                                   Triplestore
    Domeo v.2 Node
                              3                                 MySQL                            User
                                                                                                 Annotation
                                                                                                 Export
     Text Mining      Clerezza                Text Mining                                  UI
     Connector        Connector               Connector
2                                                           4


       NCBO            Clerezza               Text Mining
                                    Library




     Web Service      Web Service              Manager

       NCBO              UIMA                 Text Mining
      Annotator        Algorithm               Algorithm
DOMEO AND TEXT MINING
IN SUMMARY
   Run algorithms within Domeo
     Making available the algorithms through Web Services
     Integrating the algorithms - as libraries – within the
      Domeo architecture.
   Run algorithms separately and then
     Load the results into a Domeo node through web
      services
     Store the results directly in the (a) triplestore
     Store the results directly in the database
W3C COMMUNITY GROUP
OPEN ANNOTATION
 Annotation Ontology (AO) and Open Annotation
  Collaboration (OAC) are merging
 Unified model for representing and sharing
  annotation in RDF




                 http://www.w3.org/community/openannotation/
THANK YOU!
If you are interested in using - or contributing to -
the Domeo Annotation Toolkit follow our website
http://annotationframework.org or contact
paolo.ciccarese -at- gmail.com

Domeo, Text Mining, UIMA and Clerezza

  • 1.
    DOMEO ANNOTATION TOOLKIT ANDTEXT MINING CREATING, VISUALISING, CURATING AND SHARING TEXT MINING RESULTS Paolo Ciccarese, PhD paolo.ciccarese@gmail.com January 30th 2012, W3C Scientific Discourse Call
  • 2.
     Domeo AnnotationToolkit is a collection of software components that allow to create and share annotation of web documents and their fragments  It can export and exchange all the annotation in Annotation Ontology (AO) RDF format  The Domeo client is the user interface that can be used to produce manual and semi-automatic annotation of HTML documents directly in your browser http://annotationframework.org/
  • 3.
    ANNOTATION ONTOLOGY  OWL vocabulary for representing and sharing annotation and semantic annotationof digital resources and their fragments:  Is orthogonal to the domain(s) of interest http://purl.org/ao/home  Supports Stand-off annotation  Offers tools for identifying fragments  Designed with extension points  Defines basic annotation containers  Supports versioning  Tracks provenance
  • 4.
    DOMEO AND TEXTMINING SERVICES  Domeo allows to trigger text mining algorithms when they are available through web services  Software connectors have to be developed to translate the results in a suitable format  The results are displayed in the web documents  Users can record their feedback/judgment through customizable user interfaces
  • 5.
    NCBO ANNOTATOR http://www.bioontology.org/annotator-service  Web service that annotates textual metadata (e.g. journal abstract) with relevant ontology concepts  It is possible to preselect the ontologies of interests as one of the many parameters
  • 6.
    DOMEO AND THENCBO ANNOTATOR http://www.bioontology.org/annotator-service  Domeo allows automatic/manual annotation with terms coming from selected ontologies managed by the BioPortal
  • 7.
    RUNNING NCBO ANNOTATOR Additional text mining services will be listed here
  • 8.
    NCBO ANNOTATOR RESULTSIN DOMEO List of recognized entities
  • 9.
    RESULTS CURATION Customizable
  • 10.
    CUMULATIVE RESULTS CURATION One item only  All instances with the same text match  All instances independently from the text match
  • 11.
  • 12.
    SOFTWARE CONNECTORS At thecurrent stage  For each text mining service we have to write a specific connector that normally is translating offset and range into prefix and postfix  And keep it up to date!
  • 13.
    UIMA, CLEREZZA ANDAO OSS BASED INFRASTRUCTURE FOR TEXT MINING OVER ONTOLOGIES TommasoTeofili and Paolo Ciccarese tommaso@apache.org
  • 14.
    APACHE UIMA  Architecturalframeworkfor UIM  OASIS standard  Build, deploy and run text mining pipelines  Scaling capabilities for large volumes of data  NLP/TM algorithms wrapped as Analysis Engines http://uima.apache.org/
  • 15.
    UIMA TYPES  Definingannotation domain in Typesystems  Types and features are just declared  Existing Typesystemscan be imported/exported/enhanced  Ease data exchange between AEs  Two “main” types  TOP  Annotation
  • 16.
    APACHE CLEREZZA  Serviceplatform for linked data  OSGi-based  RDF API  RESTful Web Service Framework  TripleStore independent  Integrated with Apache UIMA http://incubator.apache.org/clerezza/
  • 17.
    UIMA/CLEREZZA CONVENTION  devs can create custom types / typesystems  need to manage URIs  integration of services vs ontology sharing  ClerezzaTypeSystem  ClerezzaBaseAnnotation  uri  ClerezzaBaseEntity  uri  label (rdfs:label)  references (annotations referring this entity)  service specific annotations and entity types are defined subclassing the above
  • 18.
  • 19.
  • 20.
  • 21.
    AFTER (URI FIELDINHERITED)
  • 22.
    CONVERSION STRATEGIES  UIMA annotations stored inside CAS  Services “talking” via webservices + RDF  CAS to RDF mapping via Clerezza  Pluggable mapping strategies  Clerezza Default  AnnotationOntology  …
  • 23.
    CONVERSION STRATEGIES Change mappingstrategies via XML/Eclipse plugin Or in the descriptor directly <nameValuePair> <name>mappingStrategy</name> <value><string>ao</string></value> </nameValuePair>
  • 24.
  • 25.
    LOOKING AHEAD DOMEO TOOLKITV. 2 Paolo Ciccarese, PhD
  • 26.
    DOMEO ANNOTATION TOOLKITV.2  DomeoAnnotation Toolkit v.2 is planned by the end of the first quarter of 2012  It will consist in major refactoring to improve modularity and make plug-ins writing easier  It will include various new features and will be the first step towards a federated architecture  It will be open source!
  • 27.
    DOMEO FEDERATION  Wecurrently have two instances of the Domeo Toolkit and the number of instances is going to increase  We need to define a clean architecture that supports communication between instances or nodes  Instances should be able to access each other annotations in multiple ways
  • 28.
    Annotation Flow Web Service DOMEO FEDERATION Triplestore Domeo Domeo Web Client Web Client Node 1 Node 2 SPARQL Web Client Domeo DomeoN Node 3 ode 4 SPARQL Ex: DT3 retrieves annotation from DT1 through a web service and from DT2 through a SPARQL query against its triplestore
  • 29.
    SOFTWARE ANNOTATION ACCESS Nodescan access annotations of other nodes through  Through Web Services  Annotation by User  Annotation by Group  Annotation by Document  Annotation by Corpora  …  SPARQL queries, when a SPARQL end-point is available
  • 30.
    USERS ANNOTATION ACCESS Userscan export their own annotation in AO RDF  Annotation by document  Annotation by corpora  All of the annotation
  • 31.
    Request CURRENT DOMEO ARCHITECTURE Annotation Domeo Web Client AO-RDF Annotation Web Services Domeo User MySQL Annotation Export Text Mining UI Connector NCBO Web Service NCBO Annotator
  • 32.
    DOMEO NODE ARCHITECTURE >ACCESSING EXTERNAL ANNOTATION Other 1 2 External Domeo Domeo Triplestore Node Web Client AO-RDF SPARQL AO-RDF AO-RDF Annotation Triple Store Web Services Connector Domeo v.2 Node User MySQL Annotation Export Text Mining UI Connector NCBO Web Service NCBO Annotator
  • 33.
    DOMEO NODE ARCHITECTURE >ADDING A SPARQL ENDPOINT Other External Domeo Domeo Triplestore Node Web Client AO-RDF SPARQL AO-RDF AO-RDF Annotation Triple Store SPARQL Web Services Connector Triplestore Domeo v.2 Node User MySQL Annotation Export Text Mining UI Connector NCBO Web Service NCBO Annotator
  • 34.
    DOMEO NODE ARCHITECTURE > TEXT MINING ALGORITHMS INTEGRATION Other 1 External Domeo Domeo Triplestore Node Web Client AO-RDF SPARQL AO-RDF AO-RDF Annotation Triple Store SPARQL Web Services Connector Triplestore Domeo v.2 Node 3 MySQL User Annotation Export Text Mining Clerezza Text Mining UI Connector Connector Connector 2 4 NCBO Clerezza Text Mining Library Web Service Web Service Manager NCBO UIMA Text Mining Annotator Algorithm Algorithm
  • 35.
    DOMEO AND TEXTMINING IN SUMMARY  Run algorithms within Domeo  Making available the algorithms through Web Services  Integrating the algorithms - as libraries – within the Domeo architecture.  Run algorithms separately and then  Load the results into a Domeo node through web services  Store the results directly in the (a) triplestore  Store the results directly in the database
  • 36.
    W3C COMMUNITY GROUP OPENANNOTATION  Annotation Ontology (AO) and Open Annotation Collaboration (OAC) are merging  Unified model for representing and sharing annotation in RDF http://www.w3.org/community/openannotation/
  • 37.
    THANK YOU! If youare interested in using - or contributing to - the Domeo Annotation Toolkit follow our website http://annotationframework.org or contact paolo.ciccarese -at- gmail.com