Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Challenges and patterns for semantics at scale

Discusses some of the challenges around applying semantics at scale (tens of billions of triples and larger). Describes some of the patterns that can be used to meet those challenges.

  • Login to see the comments

  • Be the first to like this

Challenges and patterns for semantics at scale

  1. 1. C O M P U T E | S T O R E | A N A L Y Z E Challenges and Patterns for Semantics at Scale Rob Vesse @RobVesse
  2. 2. C O M P U T E | S T O R E | A N A L Y Z E Overview ● Background ● Challenges & Patterns ● Obtaining Data ● Input Format ● Blank Nodes ● Graph Partitioning ● Benchmarking
  3. 3. C O M P U T E | S T O R E | A N A L Y Z E Background ● PhD in Computer Science ● Open Source ● Apache Jena ● dotNetRDF ● Software Engineer at Cray Inc ● In Analytics R&D ● Last 5 years ● Cray sells a range of analytics products ● Cray Graph Engine ● Massively scalable parallel RDF database and SPARQL engine ● Runs on GX and XC hardware platforms ● GX nodes are roughly equivalent to r3.8xlarge EC2 instance
  4. 4. C O M P U T E | S T O R E | A N A L Y Z E Background - Terminology ● What do we mean by at scale? ● Typical customers have 10s of billions of triples ● Some are around the 100 billion mark ● What do we mean by parallelism? ● On node i.e. multiple threads/processes ● Across nodes i.e. multiple machines
  5. 5. C O M P U T E | S T O R E | A N A L Y Z E Challenge #1 - Obtaining Data ● Most Data does not start out as RDF ● Relational databases, spreadsheets, structured/semi-structured data, flat files etc. ● It varies depending on customer domain ● Therefore the first challenge is to get the data into RDF ● Problems ● Many ETL tools don't support it as an output format ● Even if tools do support it they are not scalable ● E.g D2RQ (
  6. 6. C O M P U T E | S T O R E | A N A L Y Z E Pattern #1 - Leverage Big Data ● Lots of big data projects can be used to implement ETL pipelines ● E.g. Map Reduce, Spark, Flume, Sqoop ● There are some libraries available that provide basic plumbing for this e.g. ● Apache Jena Elephas ● ● Unfortunately ETL tends to be very customer and data specific
  7. 7. C O M P U T E | S T O R E | A N A L Y Z E Challenge #2 - Input Format ● What data format should we be using? ● There are at least four widely used standard serialisations: ● NTriples/NQuads, Turtle/TriG, RDF/XML and JSON-LD ● Plus the variety of lesser used formats e.g. TriX, RDF/JSON, HDT, RDF/Thrift, Sesame Binary RDF etc ● Choice of format affects how you process it ● Parallel processing ● Error Tolerance ● State Tracking
  8. 8. C O M P U T E | S T O R E | A N A L Y Z E Pattern #2 - Use NTriples/NQuads ● Simple but effective ● Can be arbitrarily split into chunks ● E.g. Pick some number of bytes, split into chunks, seek from chunk boundaries to find actual line boundaries, process line by line ● Extremely error tolerant ● Every line can be processed independently without needing any shared state ● Even this has challenges: ● Verbose format so large datasets require extremely large files ● Blank nodes can still be problematic
  9. 9. C O M P U T E | S T O R E | A N A L Y Z E Challenge #3 - Blank Node Identifiers ● Specifications say that a blank node identifier is file scoped ● I.e. _:foo in a.nt is a different node from _:foo in b.nt ● And _:foo is the same node throughout a.nt ● Need to consistently assign identifiers despite processing the data in chunks on different physical nodes ● Preferably without resorting to global state/synchronisation <urn:a> <urn:link> _:foo . _:foo <urn:link> <urn:b> . # Many 100,000s of lines later <urn:z> <urn:link> _:foo . _:foo <urn:value> “example” . _:bar <urn:value> “other” . a.nt b.nt
  10. 10. C O M P U T E | S T O R E | A N A L Y Z E Pattern #3 - Derived Blank Node Identifiers ● Derive identifiers from a combination of their local identifier and a scope identifier ● E.g. _:foo and a.nt ● Derivation method doesn't matter provided it is: ● Scope aware ● Deterministic ● Some possibilities: ● One-way hash e.g. MD5 ● Mathematical transform ● Seeded random number generator (RNG) ● Apache Jena uses seeded RNG ● Scope awareness achieved by seeding the RNG based upon the filename
  11. 11. C O M P U T E | S T O R E | A N A L Y Z E Challenge #4 - Graph Partitioning ● Open Problem ● NP Hard ● Large graphs are never going to be processable on a single node ● Need to partition across multiple nodes ● Partitioning affects both storage and processing of a graph ● May need different schemes depending on desired processing
  12. 12. C O M P U T E | S T O R E | A N A L Y Z E Pattern #4 - Domain Specific/Avoid It! ● For specific workloads a domain specific partitioning will be best ● Needs knowledge of data and workload ● E.g. Educating the Planet with Pearson ● If you can then avoid it! ● Take advantage of increasingly capable hardware ● Large memory sizes, non-volatile memory, RDMA, high speed interconnects, SSDs
  13. 13. C O M P U T E | S T O R E | A N A L Y Z E Challenge #5 - Benchmarking ● Many of the classic benchmarks were developed by academics ● E.g. LUBM, SP2B ● Often aren’t representative of actual customer problems ● Many data generators are single threaded ● Difficult to generate large-scale datasets
  14. 14. C O M P U T E | S T O R E | A N A L Y Z E Pattern #5 - Change Benchmarks ● Linked Data Benchmark Council (LDBC) ● Industry working group that develops standardised benchmarks ● Equivalent to Transaction Processing Council (TPC) in relational database industry ● ● Design your own ● ● Improve an existing one ● ● LUBM 8k (~ 1 Billion Triples) can be generated in under 7 minutes which is a 10x speed up
  15. 15. C O M P U T E | S T O R E | A N A L Y Z E Questions? @RobVesse