Stardog
Linked Data Catalog
      Héctor Pérez-Urbina
     Edgar Rodríguez-Díaz

       Clark & Parsia, LLC
 {hector, edgar}@clarkparsia.com
Who are we?
● Clark & Parsia is a semantic software startup
● HQ in Washington, DC & office in Boston
● Provides software development and integration
  services
● Specializing in Semantic Web, web services, and
  advanced AI technologies for federal and
  enterprise customers

        http://clarkparsia.com/
        Twitter: @candp
What's SLDC?
● Stardog Linked Data Catalog
● A catalog of data sources
    ○ Semi structured
    ○ Relational
    ○ Object-oriented
    ○ ...
● Provides a coherent view over existing data
  repositories so that users and/or
  applications can easily find them and query
  them
Use Cases
● Sources
   ○ Management, import, subscription,
     categorization, sharing
● Query
   ○ Management, sharing, results export
   ○ Querying
      ■ Metadata, external sources, integration
● Locating sources
   ○ Search, browse
● NLP/AI
   ○ Entity extraction, graph algorithms, clustering
     analysis
Application layer




  Middleware layer




NLP/AI analytics layer




     Data layer
Demo
Semantic Technologies
● W3C standards
   ○ RDF(S), OWL, SPARQL
● Lower operational costs and raise productivity
   ○ Cooperation without coordination
   ○ Appropriate abstractions
   ○ Declarative is better than imperative
   ○ Correctness when it matters; sloppiness
     when it doesn’t
Data Model
● Similar to DCAT from W3C
   ○ Catalog entries
● Enhanced with
   ○ SSD
   ○ VoID datasets
   ○ SKOS background models
   ○ Axioms & rules
Modeling the Domain
● Use of axioms to model
  relationships between
  classes
   ○ :Query subClassOf :
     Resource
   ○ :Entry subClassOf :
     Resource
● Retrieve the resources
  user :u can see
   ○ SELECT ?resource
     WHERE { ?resource
     type :Resource . }
Security
● Authentication
   ○ Shiro-Based implementation
   ○ Extensible to LDAP and/or AD
● Authorization
   ○ Eat-your-own-food approach
   ○ Reasoning-Based
   ○ Use of axioms & rules
Deriving Permissions
● Users have permission
  roles
● Permission roles have
  permission relations with
  resources
Deriving Permissions
● If a user has a permission role containing a
  read permission associated to a resource,
  then the user has the same permission over
  the resource
     :permissionRole(?user,?role),
     :readPermission(?role,?resource) ->
     :readUserPermission(?user,?resource)
● Everybody has read access to public
  resources
     :User(?user),
     :PublicResource(?resource) ->
     :readUserPermission(?user,?resource)
Deriving Permissions
● User :user1 has delete permissions over any
  source
   ○ :deleteUserPermission(?user,:anySource),
     :DataSource(?source) ->
     :deleteUserPermission(?user,?source)
   ○ :user1 :deleteUserPermission :anySource
● Everybody has all permissions to the resources
  they created
   ○ :resourceCreator(?user,?resource) ->
     :allUserPermissions(?user,?resource)
   ○ :allUserPermissions(?user,?resource) ->
     :readUserPermission(?user,?resource)
   ○ ...
Impact of Reasoning
Can user :user1 delete resource :source1?
     ASK WHERE {
         { :user1 :deleteUserPermission :source1 . }
         UNION
         { :user1 :permissionRole ?role .
           ?role :deletePermission :source1 . }
         UNION
         { :user1 :resourceCreator :source1 . }
         UNION
         { :user1 :deleteUserPermission :anyResource . }
         UNION
         { :user1 :allUserPermissions :source1 . }
         UNION
         { ... }
         UNION
         ...
Impact of Reasoning
● Are you sure you're not missing anything?
● New awesome way of getting delete permissions
  you came up with yesterday
● Model knowledge where it belongs and let the
  reasoner do the work for you:
    ASK WHERE {
        { :user1 :deleteUserPermission :source1 . }
    }
Too much Inference?
When I say
   :deleteUserPermission domain :User
   :deleteUserPermission range :Resource
I mean that for every triple
  :user1 :deleteUserPermission :resource1
the individual :user1 must be an instance of :
User and :resource1 of :Resource.

But the reasoner doesn't find the error!!
Typing Constraint
Only users can have delete user permissions
 ● :deleteUserPermission domain :User
 ● :user1 :deleteUserPermission :resource1
Typing Constraint
Only users can have delete user permissions
  ● :deleteUserPermission domain :User
  ● :user1 :deleteUserPermission :resource1


                     OWA                  CWA
Consistent            true                 false

             Infer that          Assume that
Reason       :user1 type :User   :user1 type not :User
CWA or OWA?
● Which one?
   ○ Of course use both!
● Some axioms should be interpreted under
  CWA
        :deleteUserPermission domain :User
● And others under OWA
        :SuperUser subClassOf   :User
● So the right thing happens
        :user1 :deleteUserPermission :resource1
        :user1 type :SuperUser
SLDC for Data Integration
● SLDC provides descriptions of data sources,
  relationships between them, and information
  to query them
● We can treat data sources as an integrated
  single data source
    ○ Distributed querying
    ○ AI analytics
● Virtual, materialized, hybrid
Mappings
● Simple
   ○ pops:Employee subClassOf foaf:Person
   ○ pops:Project equivalentTo foaf:Project
   ○ pops:hasEmployee subPropertyOf foaf:member
● SWRL-Based
   ○ pops:firstName(?person, ?first),
     pops:lastName(?person, ?last),
     swrlb:concat(?name, ?first, " ", ?last) ->
     foaf:name(?person, ?name)
   ○ pops:worksOnProject(?person,?project),
     pops:ActiveProject(?project) ->
     foaf:currentProject(?person,?project)
Summing Up
● SLDC is a linked data catalog
    ○ Manage a variety of sources
    ○ Find sources
    ○ Query sources
● Implemented using Semantic Technologies
    ○ Reasoning
       ■ Axioms & Rules
    ○ Data validation
    ○ Data integration
Questions?
Why?
● Large organizations
   ○ Disparate departments
   ○ Independent, isolated sources
● Where is what?
   ○ Do we have a data source about clients?
   ○ Where is it?
● Who created what?
   ○ Who owns it?
● Who has access to what?
   ○ Do I have access to it?
   ○ Who do I talk to to get it?
Source Management
● Management
    ○ Create, delete, update, clone
● Import
    ○ RDF, HTML, XML
● Subscription
    ○ Endpoint location
● Categorization
    ○ Categories
    ○ External vocabularies
● Sharing
    ○ To specific users
    ○ Public
Querying Sources
● Querying metadata
    ○ Queries about the catalog itself
● External query
    ○ Querying a particular source
● Integrated query
    ○ Querying a set of integrated sources
● Query management
● Query sharing
● Results export
Finding Sources
● Browse
   ○ Facets
   ○ Pelorus
● Search
   ○ Text-based search
   ○ Rich query language
Last but not least
● NLP processing
   ○ Entity/Event extraction from natural language
     source descriptions
   ○ Better source classification & search
● Graph algorithms
   ○ What's the shortest path between these
     resources?
● Clustering
   ○ Can we discover similar sources based on a
     given criteria?
Axioms
● It's not always about simple taxonomies...
● What about domain/range axioms?
   ○ :someProperty domain :SomeClass
   ○ :a :someProperty :b
   ○ :SomeClass(x)?
● What about complex subclass chains?
   ○ :SomeClass subClassOf :someProperty
     some :OtherClass
   ○ :someProperty some :OtherClass subClassOf
     :AnotherClass
   ○ :a type :SomeClass
   ○ :AnotherClass(x)?
● What about cardinality constraints, universal
  quantification, datatype reasoning, ...?
Data Validation
● Fundamental data management problem
   ○ Verify data integrity and correctness
   ○ Data corruption can lead to failures in applications, errors
     in decision making, security vulnerabilities, etc.
● Relevant in many scenarios
   ○ Storing data for stand-alone applications
   ○ Exchanging data in distributed settings
● For some use cases, data validation is critical but
  we still want to do it intelligently
Participation Constraint
Each resource must have been created by a user
 ● :Resource subClassOf inv(resourceCreator) some
   :User
 ● :resource1 type :Resource


                     OWA                         CWA
Consistent             true                       false

             Infer that
                                        Assume that
                 ● _:b :                _:b :resourceCreator :
Reason             resourceCreator :
                                        resource1
                   resource1
                                        is false
                 ● _:b type :Resource
Uniqueness Constraint
Each data source must belong to at most one
catalog entry
 ● :dataSource inverseFunctional
 ● :entry1 :dataSource :dataSource1
 ● :entry2 :dataSource :dataSource1

                     OWA                      CWA
Consistent            true                    false

                                    Assume that
             Infer that
Reason       :entry1 sameAs :entry2
                                    :entry1 sameAs :entry2
                                    is false

Stardog Linked Data Catalog

  • 1.
    Stardog Linked Data Catalog Héctor Pérez-Urbina Edgar Rodríguez-Díaz Clark & Parsia, LLC {hector, edgar}@clarkparsia.com
  • 2.
    Who are we? ●Clark & Parsia is a semantic software startup ● HQ in Washington, DC & office in Boston ● Provides software development and integration services ● Specializing in Semantic Web, web services, and advanced AI technologies for federal and enterprise customers http://clarkparsia.com/ Twitter: @candp
  • 3.
    What's SLDC? ● StardogLinked Data Catalog ● A catalog of data sources ○ Semi structured ○ Relational ○ Object-oriented ○ ... ● Provides a coherent view over existing data repositories so that users and/or applications can easily find them and query them
  • 4.
    Use Cases ● Sources ○ Management, import, subscription, categorization, sharing ● Query ○ Management, sharing, results export ○ Querying ■ Metadata, external sources, integration ● Locating sources ○ Search, browse ● NLP/AI ○ Entity extraction, graph algorithms, clustering analysis
  • 5.
    Application layer Middleware layer NLP/AI analytics layer Data layer
  • 6.
  • 7.
    Semantic Technologies ● W3Cstandards ○ RDF(S), OWL, SPARQL ● Lower operational costs and raise productivity ○ Cooperation without coordination ○ Appropriate abstractions ○ Declarative is better than imperative ○ Correctness when it matters; sloppiness when it doesn’t
  • 8.
    Data Model ● Similarto DCAT from W3C ○ Catalog entries ● Enhanced with ○ SSD ○ VoID datasets ○ SKOS background models ○ Axioms & rules
  • 9.
    Modeling the Domain ●Use of axioms to model relationships between classes ○ :Query subClassOf : Resource ○ :Entry subClassOf : Resource ● Retrieve the resources user :u can see ○ SELECT ?resource WHERE { ?resource type :Resource . }
  • 10.
    Security ● Authentication ○ Shiro-Based implementation ○ Extensible to LDAP and/or AD ● Authorization ○ Eat-your-own-food approach ○ Reasoning-Based ○ Use of axioms & rules
  • 11.
    Deriving Permissions ● Usershave permission roles ● Permission roles have permission relations with resources
  • 12.
    Deriving Permissions ● Ifa user has a permission role containing a read permission associated to a resource, then the user has the same permission over the resource :permissionRole(?user,?role), :readPermission(?role,?resource) -> :readUserPermission(?user,?resource) ● Everybody has read access to public resources :User(?user), :PublicResource(?resource) -> :readUserPermission(?user,?resource)
  • 13.
    Deriving Permissions ● User:user1 has delete permissions over any source ○ :deleteUserPermission(?user,:anySource), :DataSource(?source) -> :deleteUserPermission(?user,?source) ○ :user1 :deleteUserPermission :anySource ● Everybody has all permissions to the resources they created ○ :resourceCreator(?user,?resource) -> :allUserPermissions(?user,?resource) ○ :allUserPermissions(?user,?resource) -> :readUserPermission(?user,?resource) ○ ...
  • 14.
    Impact of Reasoning Canuser :user1 delete resource :source1? ASK WHERE { { :user1 :deleteUserPermission :source1 . } UNION { :user1 :permissionRole ?role . ?role :deletePermission :source1 . } UNION { :user1 :resourceCreator :source1 . } UNION { :user1 :deleteUserPermission :anyResource . } UNION { :user1 :allUserPermissions :source1 . } UNION { ... } UNION ...
  • 15.
    Impact of Reasoning ●Are you sure you're not missing anything? ● New awesome way of getting delete permissions you came up with yesterday ● Model knowledge where it belongs and let the reasoner do the work for you: ASK WHERE { { :user1 :deleteUserPermission :source1 . } }
  • 16.
    Too much Inference? WhenI say :deleteUserPermission domain :User :deleteUserPermission range :Resource I mean that for every triple :user1 :deleteUserPermission :resource1 the individual :user1 must be an instance of : User and :resource1 of :Resource. But the reasoner doesn't find the error!!
  • 17.
    Typing Constraint Only userscan have delete user permissions ● :deleteUserPermission domain :User ● :user1 :deleteUserPermission :resource1
  • 18.
    Typing Constraint Only userscan have delete user permissions ● :deleteUserPermission domain :User ● :user1 :deleteUserPermission :resource1 OWA CWA Consistent true false Infer that Assume that Reason :user1 type :User :user1 type not :User
  • 19.
    CWA or OWA? ●Which one? ○ Of course use both! ● Some axioms should be interpreted under CWA :deleteUserPermission domain :User ● And others under OWA :SuperUser subClassOf :User ● So the right thing happens :user1 :deleteUserPermission :resource1 :user1 type :SuperUser
  • 20.
    SLDC for DataIntegration ● SLDC provides descriptions of data sources, relationships between them, and information to query them ● We can treat data sources as an integrated single data source ○ Distributed querying ○ AI analytics ● Virtual, materialized, hybrid
  • 23.
    Mappings ● Simple ○ pops:Employee subClassOf foaf:Person ○ pops:Project equivalentTo foaf:Project ○ pops:hasEmployee subPropertyOf foaf:member ● SWRL-Based ○ pops:firstName(?person, ?first), pops:lastName(?person, ?last), swrlb:concat(?name, ?first, " ", ?last) -> foaf:name(?person, ?name) ○ pops:worksOnProject(?person,?project), pops:ActiveProject(?project) -> foaf:currentProject(?person,?project)
  • 24.
    Summing Up ● SLDCis a linked data catalog ○ Manage a variety of sources ○ Find sources ○ Query sources ● Implemented using Semantic Technologies ○ Reasoning ■ Axioms & Rules ○ Data validation ○ Data integration
  • 25.
  • 26.
    Why? ● Large organizations ○ Disparate departments ○ Independent, isolated sources ● Where is what? ○ Do we have a data source about clients? ○ Where is it? ● Who created what? ○ Who owns it? ● Who has access to what? ○ Do I have access to it? ○ Who do I talk to to get it?
  • 27.
    Source Management ● Management ○ Create, delete, update, clone ● Import ○ RDF, HTML, XML ● Subscription ○ Endpoint location ● Categorization ○ Categories ○ External vocabularies ● Sharing ○ To specific users ○ Public
  • 28.
    Querying Sources ● Queryingmetadata ○ Queries about the catalog itself ● External query ○ Querying a particular source ● Integrated query ○ Querying a set of integrated sources ● Query management ● Query sharing ● Results export
  • 29.
    Finding Sources ● Browse ○ Facets ○ Pelorus ● Search ○ Text-based search ○ Rich query language
  • 30.
    Last but notleast ● NLP processing ○ Entity/Event extraction from natural language source descriptions ○ Better source classification & search ● Graph algorithms ○ What's the shortest path between these resources? ● Clustering ○ Can we discover similar sources based on a given criteria?
  • 31.
    Axioms ● It's notalways about simple taxonomies... ● What about domain/range axioms? ○ :someProperty domain :SomeClass ○ :a :someProperty :b ○ :SomeClass(x)? ● What about complex subclass chains? ○ :SomeClass subClassOf :someProperty some :OtherClass ○ :someProperty some :OtherClass subClassOf :AnotherClass ○ :a type :SomeClass ○ :AnotherClass(x)? ● What about cardinality constraints, universal quantification, datatype reasoning, ...?
  • 32.
    Data Validation ● Fundamentaldata management problem ○ Verify data integrity and correctness ○ Data corruption can lead to failures in applications, errors in decision making, security vulnerabilities, etc. ● Relevant in many scenarios ○ Storing data for stand-alone applications ○ Exchanging data in distributed settings ● For some use cases, data validation is critical but we still want to do it intelligently
  • 33.
    Participation Constraint Each resourcemust have been created by a user ● :Resource subClassOf inv(resourceCreator) some :User ● :resource1 type :Resource OWA CWA Consistent true false Infer that Assume that ● _:b : _:b :resourceCreator : Reason resourceCreator : resource1 resource1 is false ● _:b type :Resource
  • 34.
    Uniqueness Constraint Each datasource must belong to at most one catalog entry ● :dataSource inverseFunctional ● :entry1 :dataSource :dataSource1 ● :entry2 :dataSource :dataSource1 OWA CWA Consistent true false Assume that Infer that Reason :entry1 sameAs :entry2 :entry1 sameAs :entry2 is false