SlideShare a Scribd company logo
C O M P U T E | S T O R E | A N A L Y Z E
Challenges and Patterns for
Semantics at Scale
Rob Vesse
rvesse@cray.com
@RobVesse
C O M P U T E | S T O R E | A N A L Y Z E
Overview
● Background
● Challenges & Patterns
● Obtaining Data
● Input Format
● Blank Nodes
● Graph Partitioning
● Benchmarking
C O M P U T E | S T O R E | A N A L Y Z E
Background
● PhD in Computer Science
● Open Source
● Apache Jena
● dotNetRDF
● Software Engineer at Cray Inc
● In Analytics R&D
● Last 5 years
● Cray sells a range of analytics products
● Cray Graph Engine
● Massively scalable parallel RDF database and SPARQL engine
● Runs on GX and XC hardware platforms
● GX nodes are roughly equivalent to r3.8xlarge EC2 instance
C O M P U T E | S T O R E | A N A L Y Z E
Background - Terminology
● What do we mean by at scale?
● Typical customers have 10s of billions of triples
● Some are around the 100 billion mark
● What do we mean by parallelism?
● On node i.e. multiple threads/processes
● Across nodes i.e. multiple machines
C O M P U T E | S T O R E | A N A L Y Z E
Challenge #1 - Obtaining Data
● Most Data does not start out as RDF
● Relational databases, spreadsheets, structured/semi-structured
data, flat files etc.
● It varies depending on customer domain
● Therefore the first challenge is to get the data into RDF
● Problems
● Many ETL tools don't support it as an output format
● Even if tools do support it they are not scalable
● E.g D2RQ (http://d2rq.org)
C O M P U T E | S T O R E | A N A L Y Z E
Pattern #1 - Leverage Big Data
● Lots of big data projects can be used to implement ETL
pipelines
● E.g. Map Reduce, Spark, Flume, Sqoop
● There are some libraries available that provide basic
plumbing for this e.g.
● Apache Jena Elephas
● http://jena.apache.org/documentation/hadoop/index.html
● Unfortunately ETL tends to be very customer and data
specific
C O M P U T E | S T O R E | A N A L Y Z E
Challenge #2 - Input Format
● What data format should we be using?
● There are at least four widely used standard
serialisations:
● NTriples/NQuads, Turtle/TriG, RDF/XML and JSON-LD
● Plus the variety of lesser used formats e.g. TriX, RDF/JSON,
HDT, RDF/Thrift, Sesame Binary RDF etc
● Choice of format affects how you process it
● Parallel processing
● Error Tolerance
● State Tracking
C O M P U T E | S T O R E | A N A L Y Z E
Pattern #2 - Use NTriples/NQuads
● Simple but effective
● Can be arbitrarily split into chunks
● E.g. Pick some number of bytes, split into chunks, seek from
chunk boundaries to find actual line boundaries, process line by
line
● Extremely error tolerant
● Every line can be processed independently without needing
any shared state
● Even this has challenges:
● Verbose format so large datasets require extremely large files
● Blank nodes can still be problematic
C O M P U T E | S T O R E | A N A L Y Z E
Challenge #3 - Blank Node Identifiers
● Specifications say that a blank node
identifier is file scoped
● I.e. _:foo in a.nt is a different node from
_:foo in b.nt
● And _:foo is the same node throughout
a.nt
● Need to consistently assign identifiers
despite processing the data in chunks on
different physical nodes
● Preferably without resorting to global
state/synchronisation
<urn:a> <urn:link> _:foo .
_:foo <urn:link> <urn:b> .
# Many 100,000s of lines later
<urn:z> <urn:link> _:foo .
_:foo <urn:value> “example” .
_:bar <urn:value> “other” .
a.nt
b.nt
C O M P U T E | S T O R E | A N A L Y Z E
Pattern #3 - Derived Blank Node Identifiers
● Derive identifiers from a combination of their local
identifier and a scope identifier
● E.g. _:foo and a.nt
● Derivation method doesn't matter provided it is:
● Scope aware
● Deterministic
● Some possibilities:
● One-way hash e.g. MD5
● Mathematical transform
● Seeded random number generator (RNG)
● Apache Jena uses seeded RNG
● Scope awareness achieved by seeding the RNG based upon
the filename
C O M P U T E | S T O R E | A N A L Y Z E
Challenge #4 - Graph Partitioning
● Open Problem
● NP Hard
● Large graphs are never going to be processable on a
single node
● Need to partition across multiple nodes
● Partitioning affects both storage and processing of a
graph
● May need different schemes depending on desired processing
C O M P U T E | S T O R E | A N A L Y Z E
Pattern #4 - Domain Specific/Avoid It!
● For specific workloads a domain specific partitioning will
be best
● Needs knowledge of data and workload
● E.g. Educating the Planet with Pearson
● If you can then avoid it!
● Take advantage of increasingly capable hardware
● Large memory sizes, non-volatile memory, RDMA, high speed
interconnects, SSDs
C O M P U T E | S T O R E | A N A L Y Z E
Challenge #5 - Benchmarking
● Many of the classic benchmarks were developed by
academics
● E.g. LUBM, SP2B
● Often aren’t representative of actual customer problems
● Many data generators are single threaded
● Difficult to generate large-scale datasets
C O M P U T E | S T O R E | A N A L Y Z E
Pattern #5 - Change Benchmarks
● Linked Data Benchmark Council (LDBC)
● Industry working group that develops standardised benchmarks
● Equivalent to Transaction Processing Council (TPC) in
relational database industry
● http://ldbcouncil.org
● Design your own
● https://github.com/rvesse/sparql-query-bm
● Improve an existing one
● https://github.com/rvesse/lubm-uba
● LUBM 8k (~ 1 Billion Triples) can be generated in under 7
minutes which is a 10x speed up
C O M P U T E | S T O R E | A N A L Y Z E
Questions?
rvesse@cray.com
@RobVesse

More Related Content

What's hot

R programming
R programmingR programming
R programming
TIB Academy
 
R programming
R programmingR programming
R programming
Shantanu Patil
 
1 R Tutorial Introduction
1 R Tutorial Introduction1 R Tutorial Introduction
1 R Tutorial Introduction
Sakthi Dasans
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
Ajay Ohri
 
R programming language
R programming languageR programming language
R programming language
Keerti Verma
 
R programming
R programmingR programming
R programming
Nandhini G
 
Are Linked Datasets fit for Open-domain Question Answering? A Quality Assessment
Are Linked Datasets fit for Open-domain Question Answering? A Quality AssessmentAre Linked Datasets fit for Open-domain Question Answering? A Quality Assessment
Are Linked Datasets fit for Open-domain Question Answering? A Quality Assessment
Harsh Thakkar
 
R Introduction
R IntroductionR Introduction
R Introductionschamber
 
Programming in C++ and Data Strucutres
Programming in C++ and Data StrucutresProgramming in C++ and Data Strucutres
Programming in C++ and Data Strucutres
Dr. C.V. Suresh Babu
 
R Programming: First Steps
R Programming: First StepsR Programming: First Steps
R Programming: First Steps
Rsquared Academy
 
F# Data: Making structured data first class citizens
F# Data: Making structured data first class citizensF# Data: Making structured data first class citizens
F# Data: Making structured data first class citizens
Tomas Petricek
 
R programming
R programmingR programming
R programming
Dr. Vaibhav Kumar
 
R programming for data science
R programming for data scienceR programming for data science
R programming for data science
Sovello Hildebrand
 
An Intoduction to R
An Intoduction to RAn Intoduction to R
An Intoduction to R
Mahmoud Shiri Varamini
 
Incomplete Information in RDF
Incomplete Information in RDFIncomplete Information in RDF
Incomplete Information in RDF
Charalampos (Babis) Nikolaou
 
Combining Textual and Graph-Based Features for Named Entity Disambiguation us...
Combining Textual and Graph-Based Features for Named Entity Disambiguation us...Combining Textual and Graph-Based Features for Named Entity Disambiguation us...
Combining Textual and Graph-Based Features for Named Entity Disambiguation us...
shakimov
 
R language
R languageR language
Publishing RDF SKOS with microservices
Publishing RDF SKOS with microservicesPublishing RDF SKOS with microservices
Publishing RDF SKOS with microservices
Bart Hanssens
 
R Programming
R ProgrammingR Programming
R Programming
Abhishek Pratap Singh
 
8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status
8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status
8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status
LDBC council
 

What's hot (20)

R programming
R programmingR programming
R programming
 
R programming
R programmingR programming
R programming
 
1 R Tutorial Introduction
1 R Tutorial Introduction1 R Tutorial Introduction
1 R Tutorial Introduction
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
R programming language
R programming languageR programming language
R programming language
 
R programming
R programmingR programming
R programming
 
Are Linked Datasets fit for Open-domain Question Answering? A Quality Assessment
Are Linked Datasets fit for Open-domain Question Answering? A Quality AssessmentAre Linked Datasets fit for Open-domain Question Answering? A Quality Assessment
Are Linked Datasets fit for Open-domain Question Answering? A Quality Assessment
 
R Introduction
R IntroductionR Introduction
R Introduction
 
Programming in C++ and Data Strucutres
Programming in C++ and Data StrucutresProgramming in C++ and Data Strucutres
Programming in C++ and Data Strucutres
 
R Programming: First Steps
R Programming: First StepsR Programming: First Steps
R Programming: First Steps
 
F# Data: Making structured data first class citizens
F# Data: Making structured data first class citizensF# Data: Making structured data first class citizens
F# Data: Making structured data first class citizens
 
R programming
R programmingR programming
R programming
 
R programming for data science
R programming for data scienceR programming for data science
R programming for data science
 
An Intoduction to R
An Intoduction to RAn Intoduction to R
An Intoduction to R
 
Incomplete Information in RDF
Incomplete Information in RDFIncomplete Information in RDF
Incomplete Information in RDF
 
Combining Textual and Graph-Based Features for Named Entity Disambiguation us...
Combining Textual and Graph-Based Features for Named Entity Disambiguation us...Combining Textual and Graph-Based Features for Named Entity Disambiguation us...
Combining Textual and Graph-Based Features for Named Entity Disambiguation us...
 
R language
R languageR language
R language
 
Publishing RDF SKOS with microservices
Publishing RDF SKOS with microservicesPublishing RDF SKOS with microservices
Publishing RDF SKOS with microservices
 
R Programming
R ProgrammingR Programming
R Programming
 
8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status
8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status
8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status
 

Viewers also liked

Practical SPARQL Benchmarking Revisited
Practical SPARQL Benchmarking RevisitedPractical SPARQL Benchmarking Revisited
Practical SPARQL Benchmarking Revisited
Rob Vesse
 
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OO
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OOVirtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OO
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OO
Paolo Cristofaro
 
Quadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemQuadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystem
Rob Vesse
 
Everyday Tools for the Semantic Web Developer
Everyday Tools for the Semantic Web DeveloperEveryday Tools for the Semantic Web Developer
Everyday Tools for the Semantic Web Developer
Rob Vesse
 
Apache Jena Elephas and Friends
Apache Jena Elephas and FriendsApache Jena Elephas and Friends
Apache Jena Elephas and Friends
Rob Vesse
 
Introducing JDBC for SPARQL
Introducing JDBC for SPARQLIntroducing JDBC for SPARQL
Introducing JDBC for SPARQL
Rob Vesse
 

Viewers also liked (6)

Practical SPARQL Benchmarking Revisited
Practical SPARQL Benchmarking RevisitedPractical SPARQL Benchmarking Revisited
Practical SPARQL Benchmarking Revisited
 
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OO
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OOVirtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OO
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OO
 
Quadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemQuadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystem
 
Everyday Tools for the Semantic Web Developer
Everyday Tools for the Semantic Web DeveloperEveryday Tools for the Semantic Web Developer
Everyday Tools for the Semantic Web Developer
 
Apache Jena Elephas and Friends
Apache Jena Elephas and FriendsApache Jena Elephas and Friends
Apache Jena Elephas and Friends
 
Introducing JDBC for SPARQL
Introducing JDBC for SPARQLIntroducing JDBC for SPARQL
Introducing JDBC for SPARQL
 

Similar to Challenges and patterns for semantics at scale

Leveraging open source for large scale analytics
Leveraging open source for large scale analyticsLeveraging open source for large scale analytics
Leveraging open source for large scale analytics
South West Data Meetup
 
Software Craftmanship - Cours Polytech
Software Craftmanship - Cours PolytechSoftware Craftmanship - Cours Polytech
Software Craftmanship - Cours Polytech
yannick grenzinger
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dan Lynn
 
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionData Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Formulatedby
 
A Journey into Hexagon: Dissecting Qualcomm Basebands
A Journey into Hexagon: Dissecting Qualcomm BasebandsA Journey into Hexagon: Dissecting Qualcomm Basebands
A Journey into Hexagon: Dissecting Qualcomm Basebands
Priyanka Aash
 
Etl confessions pg conf us 2017
Etl confessions   pg conf us 2017Etl confessions   pg conf us 2017
Etl confessions pg conf us 2017
Corey Huinker
 
Everything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDBEverything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDB
jhugg
 
Olap scalability
Olap scalabilityOlap scalability
Olap scalabilitylucboudreau
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
Oracle to Postgres Schema Migration Hustle
Oracle to Postgres Schema Migration HustleOracle to Postgres Schema Migration Hustle
Oracle to Postgres Schema Migration Hustle
EDB
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
datamantra
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
DataWorks Summit/Hadoop Summit
 
Big data analytics using R
Big data analytics using RBig data analytics using R
Big data analytics using R
Karthik Padmanabhan ( MLE℠)
 
Heterogenous Persistence
Heterogenous PersistenceHeterogenous Persistence
Heterogenous Persistence
Jervin Real
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
Omid Vahdaty
 
Note for Java Programming////////////////
Note for Java Programming////////////////Note for Java Programming////////////////
Note for Java Programming////////////////
MeghaKulkarni27
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine Learning
Sudarsun Santhiappan
 
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
VMware Tanzu
 
Ceph Day New York: Ceph: one decade in
Ceph Day New York: Ceph: one decade inCeph Day New York: Ceph: one decade in
Ceph Day New York: Ceph: one decade in
Ceph Community
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
Mukesh Singh
 

Similar to Challenges and patterns for semantics at scale (20)

Leveraging open source for large scale analytics
Leveraging open source for large scale analyticsLeveraging open source for large scale analytics
Leveraging open source for large scale analytics
 
Software Craftmanship - Cours Polytech
Software Craftmanship - Cours PolytechSoftware Craftmanship - Cours Polytech
Software Craftmanship - Cours Polytech
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionData Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
 
A Journey into Hexagon: Dissecting Qualcomm Basebands
A Journey into Hexagon: Dissecting Qualcomm BasebandsA Journey into Hexagon: Dissecting Qualcomm Basebands
A Journey into Hexagon: Dissecting Qualcomm Basebands
 
Etl confessions pg conf us 2017
Etl confessions   pg conf us 2017Etl confessions   pg conf us 2017
Etl confessions pg conf us 2017
 
Everything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDBEverything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDB
 
Olap scalability
Olap scalabilityOlap scalability
Olap scalability
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
Oracle to Postgres Schema Migration Hustle
Oracle to Postgres Schema Migration HustleOracle to Postgres Schema Migration Hustle
Oracle to Postgres Schema Migration Hustle
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
Big data analytics using R
Big data analytics using RBig data analytics using R
Big data analytics using R
 
Heterogenous Persistence
Heterogenous PersistenceHeterogenous Persistence
Heterogenous Persistence
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
 
Note for Java Programming////////////////
Note for Java Programming////////////////Note for Java Programming////////////////
Note for Java Programming////////////////
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine Learning
 
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
 
Ceph Day New York: Ceph: one decade in
Ceph Day New York: Ceph: one decade inCeph Day New York: Ceph: one decade in
Ceph Day New York: Ceph: one decade in
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 

Recently uploaded

SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
Globus
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
UiPathCommunity
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 

Recently uploaded (20)

SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 

Challenges and patterns for semantics at scale

  • 1. C O M P U T E | S T O R E | A N A L Y Z E Challenges and Patterns for Semantics at Scale Rob Vesse rvesse@cray.com @RobVesse
  • 2. C O M P U T E | S T O R E | A N A L Y Z E Overview ● Background ● Challenges & Patterns ● Obtaining Data ● Input Format ● Blank Nodes ● Graph Partitioning ● Benchmarking
  • 3. C O M P U T E | S T O R E | A N A L Y Z E Background ● PhD in Computer Science ● Open Source ● Apache Jena ● dotNetRDF ● Software Engineer at Cray Inc ● In Analytics R&D ● Last 5 years ● Cray sells a range of analytics products ● Cray Graph Engine ● Massively scalable parallel RDF database and SPARQL engine ● Runs on GX and XC hardware platforms ● GX nodes are roughly equivalent to r3.8xlarge EC2 instance
  • 4. C O M P U T E | S T O R E | A N A L Y Z E Background - Terminology ● What do we mean by at scale? ● Typical customers have 10s of billions of triples ● Some are around the 100 billion mark ● What do we mean by parallelism? ● On node i.e. multiple threads/processes ● Across nodes i.e. multiple machines
  • 5. C O M P U T E | S T O R E | A N A L Y Z E Challenge #1 - Obtaining Data ● Most Data does not start out as RDF ● Relational databases, spreadsheets, structured/semi-structured data, flat files etc. ● It varies depending on customer domain ● Therefore the first challenge is to get the data into RDF ● Problems ● Many ETL tools don't support it as an output format ● Even if tools do support it they are not scalable ● E.g D2RQ (http://d2rq.org)
  • 6. C O M P U T E | S T O R E | A N A L Y Z E Pattern #1 - Leverage Big Data ● Lots of big data projects can be used to implement ETL pipelines ● E.g. Map Reduce, Spark, Flume, Sqoop ● There are some libraries available that provide basic plumbing for this e.g. ● Apache Jena Elephas ● http://jena.apache.org/documentation/hadoop/index.html ● Unfortunately ETL tends to be very customer and data specific
  • 7. C O M P U T E | S T O R E | A N A L Y Z E Challenge #2 - Input Format ● What data format should we be using? ● There are at least four widely used standard serialisations: ● NTriples/NQuads, Turtle/TriG, RDF/XML and JSON-LD ● Plus the variety of lesser used formats e.g. TriX, RDF/JSON, HDT, RDF/Thrift, Sesame Binary RDF etc ● Choice of format affects how you process it ● Parallel processing ● Error Tolerance ● State Tracking
  • 8. C O M P U T E | S T O R E | A N A L Y Z E Pattern #2 - Use NTriples/NQuads ● Simple but effective ● Can be arbitrarily split into chunks ● E.g. Pick some number of bytes, split into chunks, seek from chunk boundaries to find actual line boundaries, process line by line ● Extremely error tolerant ● Every line can be processed independently without needing any shared state ● Even this has challenges: ● Verbose format so large datasets require extremely large files ● Blank nodes can still be problematic
  • 9. C O M P U T E | S T O R E | A N A L Y Z E Challenge #3 - Blank Node Identifiers ● Specifications say that a blank node identifier is file scoped ● I.e. _:foo in a.nt is a different node from _:foo in b.nt ● And _:foo is the same node throughout a.nt ● Need to consistently assign identifiers despite processing the data in chunks on different physical nodes ● Preferably without resorting to global state/synchronisation <urn:a> <urn:link> _:foo . _:foo <urn:link> <urn:b> . # Many 100,000s of lines later <urn:z> <urn:link> _:foo . _:foo <urn:value> “example” . _:bar <urn:value> “other” . a.nt b.nt
  • 10. C O M P U T E | S T O R E | A N A L Y Z E Pattern #3 - Derived Blank Node Identifiers ● Derive identifiers from a combination of their local identifier and a scope identifier ● E.g. _:foo and a.nt ● Derivation method doesn't matter provided it is: ● Scope aware ● Deterministic ● Some possibilities: ● One-way hash e.g. MD5 ● Mathematical transform ● Seeded random number generator (RNG) ● Apache Jena uses seeded RNG ● Scope awareness achieved by seeding the RNG based upon the filename
  • 11. C O M P U T E | S T O R E | A N A L Y Z E Challenge #4 - Graph Partitioning ● Open Problem ● NP Hard ● Large graphs are never going to be processable on a single node ● Need to partition across multiple nodes ● Partitioning affects both storage and processing of a graph ● May need different schemes depending on desired processing
  • 12. C O M P U T E | S T O R E | A N A L Y Z E Pattern #4 - Domain Specific/Avoid It! ● For specific workloads a domain specific partitioning will be best ● Needs knowledge of data and workload ● E.g. Educating the Planet with Pearson ● If you can then avoid it! ● Take advantage of increasingly capable hardware ● Large memory sizes, non-volatile memory, RDMA, high speed interconnects, SSDs
  • 13. C O M P U T E | S T O R E | A N A L Y Z E Challenge #5 - Benchmarking ● Many of the classic benchmarks were developed by academics ● E.g. LUBM, SP2B ● Often aren’t representative of actual customer problems ● Many data generators are single threaded ● Difficult to generate large-scale datasets
  • 14. C O M P U T E | S T O R E | A N A L Y Z E Pattern #5 - Change Benchmarks ● Linked Data Benchmark Council (LDBC) ● Industry working group that develops standardised benchmarks ● Equivalent to Transaction Processing Council (TPC) in relational database industry ● http://ldbcouncil.org ● Design your own ● https://github.com/rvesse/sparql-query-bm ● Improve an existing one ● https://github.com/rvesse/lubm-uba ● LUBM 8k (~ 1 Billion Triples) can be generated in under 7 minutes which is a 10x speed up
  • 15. C O M P U T E | S T O R E | A N A L Y Z E Questions? rvesse@cray.com @RobVesse