 Copyright 2009 Digital Enterprise Research Institute. All rights reserved.
   The Semantic Web vision & Linked Data
   Multi-disciplinary perspective
       Linked Data, IR, NLP
   Case study: Treo
       Talking to the Linked Data Web
   Semantic application patterns
   Take-away message
2001:

   Software which is able to
    understand meaning
    (intelligent, flexible)

   Leveraging the Web for
    information scale
   What was the plan to
    achieve it?

   Build a Semantic Web
    Stack

   Which covers both
    representation and
    reasoning
   Adoption:
       No significant data
        growth
   Ontologies are not
    straightforward to
    build:
       People are not
        familiriazed with the
        tools and principles
       Difficult to keep
        consistency at Web scale
   Scalability
   Problems:
       Consistecy
       Scalability


         Logic World   Web World
2006:




   The Web as a Huge Database
   Fundamental step for data
    creation
   Where is the intelligence and
    flexibility?
   We will be back to this point
    in a minute
   Data Model Features:
     Graph-based      data model
     Extensible   schema
     Entity-centric   data integration


   Specific Features:
     Designed   over open Web standards
     Based   on the Web infrastructure (HTTP, URIs)
   Positives:
     Solidadoption in the Open Data context
      (eGovernment, eScience, etc,...)
     Existing data is relevant (you can build real
      applications)
   Negatives:
     Data    consumption is a problem
     Datageneration beyond databases
      mapping/triplification is also a problem
     Still   far from the Semantic Web vision
   How to address the previous challenges?

   Linked Data:
       Web-scale structured data representation
   Information Retrieval:
       Search, approximation, ranking strategies
       Scalability
   Natural Language Processing (NLP):
       Analysing natural language
       Semantic approximation (distributional semantics)
   IBM Watson approach
   With Linked Data we are still in the DB world




From which university did the wife of
Barack Obama graduate?
 With Linked Data we are still in the DB world
 (but slightly worse)
From which university did the wife of Barack Obama graduate?
): Direction, path




Demonstration
   Transform natural language queries into triple patterns
   Steps:
       Entity Recognition
        “From which university did the wife of Barack Obama graduate?”
       Dependency parsing
       Query Pattern detection     prep(graduate-10, From-1)    From/IN
                                    det(university-3, which-2)   which/WDT
       Query Planning              pobj(From-1, university-3)   university/NN
                                    aux(graduate-10, did-4)      did/VBD
                                    det(wife-6, the-5)           the/DT
                                    nsubj(graduate-10, wife-6)   wife/NN
                                    prep(wife-6, of-7)           of/IN
                                    nn(Obama-9, Barack-8)        Barack/NNP
                                    pobj(of-7, Obama-9)          Obama/NNP
                                    root(ROOT-0, graduate-10)    graduate/VB
Using NLP                                                        ?/.
Query:




Using NLP
   Entity Search:
       Build an entity index (instances)
       Extract terms from URIs and index the terms using your
        favourite IR framework
       Search instances by keywords




Using IR
Query




Linked Data
Web


              Using IR
 Use distributional semantics to semantically match
  query terms to predicates and classes
 Distributional principle: Words that co-occur together
  tend to have related meaning
       Allows the creation of a comprehensive semantic model from
        unstructured text
       Based on statistical patterns over large amounts of text
       No human annotations
   Distributional semantics can be used to compute a
    semantic relatedness measure between two words
                                                         Using NLP
                                                         and IR
 Computation of a measure of “semantic proximity”
  between two terms
 Allows a semantic approximate matching between
               and
 It supports a reasoning-like behavior based on the
  knowledge embedded in the corpus




                                          Using NLP
                                          and IR
Query



              Which properties are
              semantically related to ‘wife’?


Linked Data
Web


                                   Using NLP
                                   and IR
Query




Linked Data
Web




              Using NLP
              and IR
Query




Linked Data
Web




              Using NLP
              and IR
Query




Linked Data
Web




              Using NLP
              and IR
 Semantic approximation in databases (as in any IR
  system): semantic best-effort
 Need     some level of user disambiguation,
  refinement and feedback
 As we move in the direction of semantic systems
  we should expect the need for principled dialog
  mechanisms (like in human communication)
 Pull the the user interaction back into the system



                                           Using NLP
                                           and IR
   Derived from the experience developing Treo

   Not restricted to queries over Linked Data

   The following list is not intended to be complete
   Pattern #1: Maximize the amount of knowledge in
    your semantic application

   Meaning interpretation depends on knowledge

   Using LOD: DBpedia, Freebase, YAGO can give you
    a very comprehensive set of instances and their
    types

   Wikipedia can provide you       a   comprehensive
    distributional semantic model
   Pattern #2: Allow your database to grow

   Dynamic schema

   Entity-centric data integration
   Pattern #3: Once the database grows in complexity
    use semantic search instead of structured queries

   Instances can be used as pivot entities to reduce
    the search space
       They are easier to search
       Higher specificity and lower vocabulary variation
   Pattern #4: Use distributional semantics and
    semantic relatedness for a robust semantic
    matching

   Distributional semantics allows your application to
    digest (and make use of) large amounts of
    unstructured information

   Multilingual solution

   Can be complemented with WordNet
   Pattern #5: POS-Tags, Syntactic Parsing + Rules will
    go a long way to interpret natural language queries
    and sentences
   Use them to explore the regularities in natural
    language

   Define a scope for natural language processing in
    your application (restrict by domain, syntactic
    complexity)

   These tools are easy to use and quite robust (at
    least for English)
   Pattern #6: Provide a user dialog mechanism in the
    application

   Improve the semantic model with user feedback
   Part of the Semantic Web vision can be addressed
    today with a multi-disciplinary perspective
       Linked Data, IR and NLP
 You can build your own IBM Watson-like application
 Both data and tools are available and ready to use:
  the barrier is the mindset
 Large opportunity for new solutions
   NLP                         Datasets
        WordNet                    DBpedia
        VerbNet                    Freebase
        Stanford parser            YAGO
        C&C parser/Boxer
        NLTK
                                Tools that will be
        DBpedia Spotlight       available soon:
        Gate                       Treo
        UIMA                       Treo-ESA
   IR                              Graphia
        Lucene/Solr
        Terrier
André Freitas, Edward Curry, João Gabriel Oliveira, Sean O'Riain,
                                                                               . IEEE Internet
Computing, Special Issue on Internet-Scale Data, 2012.

 André Freitas, Edward Curry, João Gabriel Oliveira, Sean O'Riain,
                                           International Journal of Semantic Computing (IJSC),
2012.

 André Freitas, Sean O'Riain, Edward Curry,
                                . 27th ACM Applied Computing Symposium, Semantic Web and Its
Applications Track, 2012.

 André Freitas, João Gabriel Oliveira, Sean O'Riain, Edward Curry, João Carlos Pereira da
Silva,                                                                                 In
Proceedings of the 16th International Conference on Applications of Natural Language to
Information Systems (NLDB) 2011.

 André Freitas, Danilo S. Carvalho, João Carlos Pereira da Silva, Sean O'Riain, Edward Curry, A
Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia. In
Proceedings of the 1st Workshop on the Web of Linked Entities (WoLE 2012) at the 11th
International Semantic Web Conference (ISWC), 2012
andrefreitas.org

andre (dot) freitas – at – deri (dot) org

From Linked Data to Semantic Applications

  • 1.
     Copyright 2009Digital Enterprise Research Institute. All rights reserved.
  • 2.
    The Semantic Web vision & Linked Data  Multi-disciplinary perspective  Linked Data, IR, NLP  Case study: Treo  Talking to the Linked Data Web  Semantic application patterns  Take-away message
  • 4.
    2001:  Software which is able to understand meaning (intelligent, flexible)  Leveraging the Web for information scale
  • 5.
    What was the plan to achieve it?  Build a Semantic Web Stack  Which covers both representation and reasoning
  • 6.
    Adoption:  No significant data growth  Ontologies are not straightforward to build:  People are not familiriazed with the tools and principles  Difficult to keep consistency at Web scale  Scalability
  • 7.
    Problems:  Consistecy  Scalability Logic World Web World
  • 8.
    2006:  The Web as a Huge Database  Fundamental step for data creation
  • 9.
    Where is the intelligence and flexibility?  We will be back to this point in a minute
  • 10.
    Data Model Features:  Graph-based data model  Extensible schema  Entity-centric data integration  Specific Features:  Designed over open Web standards  Based on the Web infrastructure (HTTP, URIs)
  • 11.
    Positives:  Solidadoption in the Open Data context (eGovernment, eScience, etc,...)  Existing data is relevant (you can build real applications)  Negatives:  Data consumption is a problem  Datageneration beyond databases mapping/triplification is also a problem  Still far from the Semantic Web vision
  • 13.
    How to address the previous challenges?  Linked Data:  Web-scale structured data representation  Information Retrieval:  Search, approximation, ranking strategies  Scalability  Natural Language Processing (NLP):  Analysing natural language  Semantic approximation (distributional semantics)
  • 14.
    IBM Watson approach
  • 16.
    With Linked Data we are still in the DB world From which university did the wife of Barack Obama graduate?
  • 17.
     With LinkedData we are still in the DB world  (but slightly worse)
  • 19.
    From which universitydid the wife of Barack Obama graduate?
  • 20.
  • 30.
    Transform natural language queries into triple patterns  Steps:  Entity Recognition “From which university did the wife of Barack Obama graduate?”  Dependency parsing  Query Pattern detection prep(graduate-10, From-1) From/IN det(university-3, which-2) which/WDT  Query Planning pobj(From-1, university-3) university/NN aux(graduate-10, did-4) did/VBD det(wife-6, the-5) the/DT nsubj(graduate-10, wife-6) wife/NN prep(wife-6, of-7) of/IN nn(Obama-9, Barack-8) Barack/NNP pobj(of-7, Obama-9) Obama/NNP root(ROOT-0, graduate-10) graduate/VB Using NLP ?/.
  • 31.
  • 32.
    Entity Search:  Build an entity index (instances)  Extract terms from URIs and index the terms using your favourite IR framework  Search instances by keywords Using IR
  • 33.
  • 34.
     Use distributionalsemantics to semantically match query terms to predicates and classes  Distributional principle: Words that co-occur together tend to have related meaning  Allows the creation of a comprehensive semantic model from unstructured text  Based on statistical patterns over large amounts of text  No human annotations  Distributional semantics can be used to compute a semantic relatedness measure between two words Using NLP and IR
  • 35.
     Computation ofa measure of “semantic proximity” between two terms  Allows a semantic approximate matching between and  It supports a reasoning-like behavior based on the knowledge embedded in the corpus Using NLP and IR
  • 36.
    Query Which properties are semantically related to ‘wife’? Linked Data Web Using NLP and IR
  • 37.
    Query Linked Data Web Using NLP and IR
  • 38.
    Query Linked Data Web Using NLP and IR
  • 39.
    Query Linked Data Web Using NLP and IR
  • 40.
     Semantic approximationin databases (as in any IR system): semantic best-effort  Need some level of user disambiguation, refinement and feedback  As we move in the direction of semantic systems we should expect the need for principled dialog mechanisms (like in human communication)  Pull the the user interaction back into the system Using NLP and IR
  • 44.
    Derived from the experience developing Treo  Not restricted to queries over Linked Data  The following list is not intended to be complete
  • 45.
    Pattern #1: Maximize the amount of knowledge in your semantic application  Meaning interpretation depends on knowledge  Using LOD: DBpedia, Freebase, YAGO can give you a very comprehensive set of instances and their types  Wikipedia can provide you a comprehensive distributional semantic model
  • 46.
    Pattern #2: Allow your database to grow  Dynamic schema  Entity-centric data integration
  • 47.
    Pattern #3: Once the database grows in complexity use semantic search instead of structured queries  Instances can be used as pivot entities to reduce the search space  They are easier to search  Higher specificity and lower vocabulary variation
  • 48.
    Pattern #4: Use distributional semantics and semantic relatedness for a robust semantic matching  Distributional semantics allows your application to digest (and make use of) large amounts of unstructured information  Multilingual solution  Can be complemented with WordNet
  • 49.
    Pattern #5: POS-Tags, Syntactic Parsing + Rules will go a long way to interpret natural language queries and sentences  Use them to explore the regularities in natural language  Define a scope for natural language processing in your application (restrict by domain, syntactic complexity)  These tools are easy to use and quite robust (at least for English)
  • 50.
    Pattern #6: Provide a user dialog mechanism in the application  Improve the semantic model with user feedback
  • 51.
    Part of the Semantic Web vision can be addressed today with a multi-disciplinary perspective  Linked Data, IR and NLP  You can build your own IBM Watson-like application  Both data and tools are available and ready to use: the barrier is the mindset  Large opportunity for new solutions
  • 52.
    NLP  Datasets  WordNet  DBpedia  VerbNet  Freebase  Stanford parser  YAGO  C&C parser/Boxer  NLTK  Tools that will be  DBpedia Spotlight available soon:  Gate  Treo  UIMA  Treo-ESA  IR  Graphia  Lucene/Solr  Terrier
  • 53.
    André Freitas, EdwardCurry, João Gabriel Oliveira, Sean O'Riain, . IEEE Internet Computing, Special Issue on Internet-Scale Data, 2012. André Freitas, Edward Curry, João Gabriel Oliveira, Sean O'Riain, International Journal of Semantic Computing (IJSC), 2012. André Freitas, Sean O'Riain, Edward Curry, . 27th ACM Applied Computing Symposium, Semantic Web and Its Applications Track, 2012. André Freitas, João Gabriel Oliveira, Sean O'Riain, Edward Curry, João Carlos Pereira da Silva, In Proceedings of the 16th International Conference on Applications of Natural Language to Information Systems (NLDB) 2011. André Freitas, Danilo S. Carvalho, João Carlos Pereira da Silva, Sean O'Riain, Edward Curry, A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia. In Proceedings of the 1st Workshop on the Web of Linked Entities (WoLE 2012) at the 11th International Semantic Web Conference (ISWC), 2012
  • 54.
    andrefreitas.org andre (dot) freitas– at – deri (dot) org