Data Segmenting in Anzo


Contact:
Lee Feigenbaum
lee@cambridgesemantics.com


                                      ©2011 Cambridge Semantics Inc. All rights reserved.
Simple Introduction to Cambridge Semantics & Anzo

     • Cambridge Semantics is a software startup founded
       by a team of engineers from IBM’s Advanced Internet
       Technology group in 2007
     • We sell the Anzo platform and tools to (mainly)
       Fortune 500 companies
     • Anzo is Semantic Web middleware that often stores
       large amounts of data for diverse uses




2                                          ©2011 Cambridge Semantics Inc. All rights reserved.
We Use Named Graphs

    • Primary tool for segmenting data in Anzo
    • Smallest unit of granularity for:
       –   Versioning & provenance
       –   Access control
       –   Notifications
       –   Replication
    • (Concretely: we use TriG extensively)




3                                             ©2011 Cambridge Semantics Inc. All rights reserved.
Which Triples Go Into a Named Graph?

    • Everything
       – Effectively a triple store
    • Single triple
       – Gives per statement access control, etc.
    • Whatever was in the source document
       – OK in some cases, but documents are often an artificial
         construct
       – What happens when doing a bulk load of hundreds of
         millions of triples?
    • All triples that share a subject
       – Decent compromise / default state in our experience
    • Closure of triples from a given subject following
      predicated annotated as “internal”
4                                                   ©2011 Cambridge Semantics Inc. All rights reserved.
Typical Anzo Data Segmenting

                debut showing             10/14/1994
     Pulp
    Fiction         budget                $ 8,500,000

              director


                                                    directed
                              Tarantino                                  Reservoir
                                                                           Dogs


                         birth date            full name

                                                   Quentin Jerome
                              3/27/1963              Tarantino




5                                                                   ©2011 Cambridge Semantics Inc. All rights reserved.
Impact of Typical Anzo Data Segmenting

    • Many, many (millions) of small graphs
    • Often corresponds with the natural granularity at
      which you want to do things like
      permissions, versioning, alerting, etc.
    • Significant overhead for per-graph metadata
       – Sometimes encourages other partitioning schemes




6                                              ©2011 Cambridge Semantics Inc. All rights reserved.
Finding the Graph for a Particular Resource

    • Default case: graph name is the same as the resource
      name
       – Not Kosher, but works well
    • Fallback case: system-wide SPARQL query
    • General case: graph resolution framework that can
      identify appropriate graph(s) via:
       – SPARQL DESCRIBE query (just kicks the can down the road
         a bit)
       – Lookup (registry)
       – Pattern matching (similar to POWDER)
    • (Graphs do not have to be local; sometimes
      resolution ends up retrieving them via HTTP or from
      an RDB)
7                                              ©2011 Cambridge Semantics Inc. All rights reserved.
Accessing Graphs

    • Replication service
       – Chunked to handle large graphs gracefully
       – Client replicas kept up to date via JMS-driven notification
         service
       – Replicas are cached aggressively – encourages smaller
         graphs to limit client memory footprint (e.g. in a Web
         browser)




8                                                  ©2011 Cambridge Semantics Inc. All rights reserved.
Linked Data in Anzo

    • Data in Anzo can be exposed as linked data
    • Anzo will dereference external URIs to get at
      data, but that’s of limited utility
       – Allows single-instance views, but not faceted browsing
    • Anzo does not use linked data internally for data
      access
    • Linked Data consumption/publication is a
      feature, not a core part of Anzo’s architecture




9                                                ©2011 Cambridge Semantics Inc. All rights reserved.
Accessing Graphs

     • SPARQL queries
       – Clients (e.g. Anzo on the Web facetted browser) target
         subsets of the server data with SPARQL queries
       – Impractical to enumerate millions of graphs in FROM or FROM
         NAMED clauses
       – Extend SPARQL with named datasets
          • Server-based lists of graphs that comprise an RDF dataset (default
            graph and named graphs)
          • Add FROM DATASET clause to reference named datasets from a
            query




10                                                       ©2011 Cambridge Semantics Inc. All rights reserved.
Anzo and other Sem Web Technologies

     • Everything described in RDFS and OWL (used as a rich
       data modeling language mostly)
     • We publish RDFa
     • We use JSON serializations of SPARQL results and RDF
     • We implement SPARQL Update but don’t use it from our
       tools
     • SPARQL-based rules (used to be CONSTRUCT, now INSERT )
     • We use SPARQL ASK queries for transaction pre-
       conditions and validation
     • We have our own long-in-the-tooth implementation of
       the D2RQ mapping language that we don’t use often

11                                            ©2011 Cambridge Semantics Inc. All rights reserved.
This is the full architecture that drives the Anzo
             Server and applications.
These parts are driven
primarily by SemWeb
    technologies.
These parts are driven
 primarily by quality
software engineering.
We can’t & shouldn’t standardize everything.

     • Need to leave room for competitive differentiation
       that goes beyond simply who has the “best”
       implementation of a standard
     • For standardization work, take a disciplined approach
       to identifying what problems are both:
        – Costly (a.k.a. valuable to solve)
        – Impacting interoperability




15                                            ©2011 Cambridge Semantics Inc. All rights reserved.
What we could use

     • We often get asked “can we use your tools against
       <insert arbitrary SPARQL endpoint or linked data
       source here>?”
        – “No.”
     • We need standards for & adoption of:
        – Richly advertising contents of linked data sources
           • c.f. VoID
        – Richly advertising capabilities of SPARQL endpoints
           • c.f. SPARQL 1.1 Service Description and Basic Federated Query
        – Named datasets
        – Various other SPARQL extensions (though we can work
          around many of these)
16                                                       ©2011 Cambridge Semantics Inc. All rights reserved.

Data Segmenting in Anzo

  • 1.
    Data Segmenting inAnzo Contact: Lee Feigenbaum lee@cambridgesemantics.com ©2011 Cambridge Semantics Inc. All rights reserved.
  • 2.
    Simple Introduction toCambridge Semantics & Anzo • Cambridge Semantics is a software startup founded by a team of engineers from IBM’s Advanced Internet Technology group in 2007 • We sell the Anzo platform and tools to (mainly) Fortune 500 companies • Anzo is Semantic Web middleware that often stores large amounts of data for diverse uses 2 ©2011 Cambridge Semantics Inc. All rights reserved.
  • 3.
    We Use NamedGraphs • Primary tool for segmenting data in Anzo • Smallest unit of granularity for: – Versioning & provenance – Access control – Notifications – Replication • (Concretely: we use TriG extensively) 3 ©2011 Cambridge Semantics Inc. All rights reserved.
  • 4.
    Which Triples GoInto a Named Graph? • Everything – Effectively a triple store • Single triple – Gives per statement access control, etc. • Whatever was in the source document – OK in some cases, but documents are often an artificial construct – What happens when doing a bulk load of hundreds of millions of triples? • All triples that share a subject – Decent compromise / default state in our experience • Closure of triples from a given subject following predicated annotated as “internal” 4 ©2011 Cambridge Semantics Inc. All rights reserved.
  • 5.
    Typical Anzo DataSegmenting debut showing 10/14/1994 Pulp Fiction budget $ 8,500,000 director directed Tarantino Reservoir Dogs birth date full name Quentin Jerome 3/27/1963 Tarantino 5 ©2011 Cambridge Semantics Inc. All rights reserved.
  • 6.
    Impact of TypicalAnzo Data Segmenting • Many, many (millions) of small graphs • Often corresponds with the natural granularity at which you want to do things like permissions, versioning, alerting, etc. • Significant overhead for per-graph metadata – Sometimes encourages other partitioning schemes 6 ©2011 Cambridge Semantics Inc. All rights reserved.
  • 7.
    Finding the Graphfor a Particular Resource • Default case: graph name is the same as the resource name – Not Kosher, but works well • Fallback case: system-wide SPARQL query • General case: graph resolution framework that can identify appropriate graph(s) via: – SPARQL DESCRIBE query (just kicks the can down the road a bit) – Lookup (registry) – Pattern matching (similar to POWDER) • (Graphs do not have to be local; sometimes resolution ends up retrieving them via HTTP or from an RDB) 7 ©2011 Cambridge Semantics Inc. All rights reserved.
  • 8.
    Accessing Graphs • Replication service – Chunked to handle large graphs gracefully – Client replicas kept up to date via JMS-driven notification service – Replicas are cached aggressively – encourages smaller graphs to limit client memory footprint (e.g. in a Web browser) 8 ©2011 Cambridge Semantics Inc. All rights reserved.
  • 9.
    Linked Data inAnzo • Data in Anzo can be exposed as linked data • Anzo will dereference external URIs to get at data, but that’s of limited utility – Allows single-instance views, but not faceted browsing • Anzo does not use linked data internally for data access • Linked Data consumption/publication is a feature, not a core part of Anzo’s architecture 9 ©2011 Cambridge Semantics Inc. All rights reserved.
  • 10.
    Accessing Graphs • SPARQL queries – Clients (e.g. Anzo on the Web facetted browser) target subsets of the server data with SPARQL queries – Impractical to enumerate millions of graphs in FROM or FROM NAMED clauses – Extend SPARQL with named datasets • Server-based lists of graphs that comprise an RDF dataset (default graph and named graphs) • Add FROM DATASET clause to reference named datasets from a query 10 ©2011 Cambridge Semantics Inc. All rights reserved.
  • 11.
    Anzo and otherSem Web Technologies • Everything described in RDFS and OWL (used as a rich data modeling language mostly) • We publish RDFa • We use JSON serializations of SPARQL results and RDF • We implement SPARQL Update but don’t use it from our tools • SPARQL-based rules (used to be CONSTRUCT, now INSERT ) • We use SPARQL ASK queries for transaction pre- conditions and validation • We have our own long-in-the-tooth implementation of the D2RQ mapping language that we don’t use often 11 ©2011 Cambridge Semantics Inc. All rights reserved.
  • 12.
    This is thefull architecture that drives the Anzo Server and applications.
  • 13.
    These parts aredriven primarily by SemWeb technologies.
  • 14.
    These parts aredriven primarily by quality software engineering.
  • 15.
    We can’t &shouldn’t standardize everything. • Need to leave room for competitive differentiation that goes beyond simply who has the “best” implementation of a standard • For standardization work, take a disciplined approach to identifying what problems are both: – Costly (a.k.a. valuable to solve) – Impacting interoperability 15 ©2011 Cambridge Semantics Inc. All rights reserved.
  • 16.
    What we coulduse • We often get asked “can we use your tools against <insert arbitrary SPARQL endpoint or linked data source here>?” – “No.” • We need standards for & adoption of: – Richly advertising contents of linked data sources • c.f. VoID – Richly advertising capabilities of SPARQL endpoints • c.f. SPARQL 1.1 Service Description and Basic Federated Query – Named datasets – Various other SPARQL extensions (though we can work around many of these) 16 ©2011 Cambridge Semantics Inc. All rights reserved.