Data Segmenting in Anzo

Data Segmenting in Anzo

Contact:
Lee Feigenbaum
lee@cambridgesemantics.com

©2011 Cambridge Semantics Inc. All rights reserved.

Simple Introduction to Cambridge Semantics & Anzo

• Cambridge Semantics is a software startup founded
by a team of engineers from IBM’s Advanced Internet
Technology group in 2007
• We sell the Anzo platform and tools to (mainly)
Fortune 500 companies
• Anzo is Semantic Web middleware that often stores
large amounts of data for diverse uses

2 ©2011 Cambridge Semantics Inc. All rights reserved.

We Use Named Graphs

• Primary tool for segmenting data in Anzo
• Smallest unit of granularity for:
– Versioning & provenance
– Access control
– Notifications
– Replication
• (Concretely: we use TriG extensively)


Which Triples Go Into a Named Graph?

• Everything
– Effectively a triple store
• Single triple
– Gives per statement access control, etc.
• Whatever was in the source document
– OK in some cases, but documents are often an artificial
construct
– What happens when doing a bulk load of hundreds of
millions of triples?
• All triples that share a subject
– Decent compromise / default state in our experience
• Closure of triples from a given subject following
predicated annotated as “internal”

Typical Anzo Data Segmenting

debut showing 10/14/1994
Pulp
Fiction budget $ 8,500,000

director

directed
Tarantino Reservoir
Dogs

birth date full name

Quentin Jerome
3/27/1963 Tarantino


Impact of Typical Anzo Data Segmenting

• Many, many (millions) of small graphs
• Often corresponds with the natural granularity at
which you want to do things like
permissions, versioning, alerting, etc.
• Significant overhead for per-graph metadata
– Sometimes encourages other partitioning schemes


Finding the Graph for a Particular Resource

• Default case: graph name is the same as the resource
name
– Not Kosher, but works well
• Fallback case: system-wide SPARQL query
• General case: graph resolution framework that can
identify appropriate graph(s) via:
– SPARQL DESCRIBE query (just kicks the can down the road
a bit)
– Lookup (registry)
– Pattern matching (similar to POWDER)
• (Graphs do not have to be local; sometimes
resolution ends up retrieving them via HTTP or from
an RDB)

Accessing Graphs

• Replication service
– Chunked to handle large graphs gracefully
– Client replicas kept up to date via JMS-driven notification
service
– Replicas are cached aggressively – encourages smaller
graphs to limit client memory footprint (e.g. in a Web
browser)


Linked Data in Anzo

• Data in Anzo can be exposed as linked data
• Anzo will dereference external URIs to get at
data, but that’s of limited utility
– Allows single-instance views, but not faceted browsing
• Anzo does not use linked data internally for data
access
• Linked Data consumption/publication is a
feature, not a core part of Anzo’s architecture


Accessing Graphs

• SPARQL queries
– Clients (e.g. Anzo on the Web facetted browser) target
subsets of the server data with SPARQL queries
– Impractical to enumerate millions of graphs in FROM or FROM
NAMED clauses
– Extend SPARQL with named datasets
• Server-based lists of graphs that comprise an RDF dataset (default
graph and named graphs)
• Add FROM DATASET clause to reference named datasets from a
query


Anzo and other Sem Web Technologies

• Everything described in RDFS and OWL (used as a rich
data modeling language mostly)
• We publish RDFa
• We use JSON serializations of SPARQL results and RDF
• We implement SPARQL Update but don’t use it from our
tools
• SPARQL-based rules (used to be CONSTRUCT, now INSERT )
• We use SPARQL ASK queries for transaction pre-
conditions and validation
• We have our own long-in-the-tooth implementation of
the D2RQ mapping language that we don’t use often


This is the full architecture that drives the Anzo
Server and applications.

These parts are driven
primarily by SemWeb
technologies.

These parts are driven
primarily by quality
software engineering.

We can’t & shouldn’t standardize everything.

• Need to leave room for competitive differentiation
that goes beyond simply who has the “best”
implementation of a standard
• For standardization work, take a disciplined approach
to identifying what problems are both:
– Costly (a.k.a. valuable to solve)
– Impacting interoperability


What we could use

• We often get asked “can we use your tools against
<insert arbitrary SPARQL endpoint or linked data
source here>?”
– “No.”
• We need standards for & adoption of:
– Richly advertising contents of linked data sources
• c.f. VoID
– Richly advertising capabilities of SPARQL endpoints
• c.f. SPARQL 1.1 Service Description and Basic Federated Query
– Named datasets
– Various other SPARQL extensions (though we can work
around many of these)

Data Segmenting in Anzo

More Related Content

What's hot

Viewers also liked

Similar to Data Segmenting in Anzo

More from LeeFeigenbaum

Recently uploaded

Data Segmenting in Anzo