... or how to query an RDF graph with 28 billion triples in a standard laptop
These slides correspond to my talk at the Stanford Center for Biomedical Informatics, on 25th April 2018
Exchange and Consumption of Huge RDF DataMario Arias
Huge RDF datasets are currently exchanged on textual RDF formats, hence consumers need to post-process them using RDF stores for local consumption, such as indexing and SPARQL query. This results in a painful task requiring a great effort in terms of time and compu- tational resources. A first approach to lightweight data exchange is a compact (binary) RDF serialization format called HDT. In this paper, we show how to enhance the exchanged HDT with additional structures to support some basic forms of SPARQL query resolution without the need of "unpacking" the data. Experiments show that i) with an exchanging ef- ficiency that outperforms universal compression, ii) post-processing now becomes a fast process which iii) provides competitive query performance at consumption.
ParlBench: a SPARQL-benchmark for electronic publishing applications.Tatiana Tarasova
Slides from the workshop on Benchmarking RDF Systems co-located with the Extended Semantic Web Conference 2013. The presentation is about an on-going work on building the benchmark for electronic publishing applications. The benchmark provides real-world data sets, the Dutch parliamentary proceedings and a set of analytical SPARQL queries that were built on top of these data sets. The queries were grouped into micro-benchmarks according to their analytical aims. This allows one to perform better analysis of RDF stores behaviors with respect to a certain SPARQL feature used in a micro-benchmark/query.
Preliminary results of running the benchmark on the Virtuoso native RDF store are presented, as well as references to the on-line material including the data sets, queries and the scripts that were used to obtain the results.
Exchange and Consumption of Huge RDF DataMario Arias
Huge RDF datasets are currently exchanged on textual RDF formats, hence consumers need to post-process them using RDF stores for local consumption, such as indexing and SPARQL query. This results in a painful task requiring a great effort in terms of time and compu- tational resources. A first approach to lightweight data exchange is a compact (binary) RDF serialization format called HDT. In this paper, we show how to enhance the exchanged HDT with additional structures to support some basic forms of SPARQL query resolution without the need of "unpacking" the data. Experiments show that i) with an exchanging ef- ficiency that outperforms universal compression, ii) post-processing now becomes a fast process which iii) provides competitive query performance at consumption.
ParlBench: a SPARQL-benchmark for electronic publishing applications.Tatiana Tarasova
Slides from the workshop on Benchmarking RDF Systems co-located with the Extended Semantic Web Conference 2013. The presentation is about an on-going work on building the benchmark for electronic publishing applications. The benchmark provides real-world data sets, the Dutch parliamentary proceedings and a set of analytical SPARQL queries that were built on top of these data sets. The queries were grouped into micro-benchmarks according to their analytical aims. This allows one to perform better analysis of RDF stores behaviors with respect to a certain SPARQL feature used in a micro-benchmark/query.
Preliminary results of running the benchmark on the Virtuoso native RDF store are presented, as well as references to the on-line material including the data sets, queries and the scripts that were used to obtain the results.
This project aimed to create a series of models for the extraction of Named Entities (People, Locations, Organizations, Dates) from news headlines obtained online. We created two models: a traditional Natural Processing Language Model using Maximum Entropy , and a Deep Neural Network Model using pre-trained word embeddings. Accuracy results of both models show similar performance, but the requirements and limitations of both models are different and can help determine what type of model is best suited for each specific use case
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Andy Petrella
Data science requires so many skills, people and time before the results can be accessed. Moreover, these results cannot be static anymore. And finally, the Big Data comes to the plate and the whole tool chain needs to change.
In this talk Data Fellas introduces Shar3, a tool kit aiming to bridged the gaps to build a interactive distributed data processing pipeline, or loop!
Then the talk covers genomics nowadays problems including data types, processing, discovery by introducing the GA4GH initiative and its implementation using Shar3.
Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...andrea huang
Will the rich domain knowledge from research publications and the implicit cross-domain metadata of cultural objects be compliant with each other? A contextual framework is proposed as dynamic and relational in supporting three different contexts: Reusing, Publication and Curation, which are individually constructed but overlapped with major conceptual elements. A Relations for Reusing (R4R) ontology has been devised for modeling these overlapping
conceptual components (Article, Data, Code, Provence, and License) for interlinking research outputs and cultural heritage data. In particular, packaging and citation relations are key to build up interpretations for dynamic contexts. Examples are provided for illustrating how the linking mechanism can be constructed and represented as a result to reveal the data linked in different contexts.
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...Dataconomy Media
"Spark, DeepLearning and Life Sciences, Systems Biology in the Big Data age" Dev Lakhani, Founder of Batch Insights
YouTube Link: https://www.youtube.com/watch?v=z6aTv0ZKndQ
Watch more from Data Natives 2015 here: http://bit.ly/1OVkK2J
Visit the conference website to learn more: www.datanatives.io
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2016: http://bit.ly/1WMJAqS
About the author:
Dev Lakhani has a background in Software Engineering and Computational Statistics and is a founder of Batch Insights, a Big Data consultancy that has worked on numerous Big Data architectures and data science projects in Tier 1 banking, global telecoms, retail, media and fashion. Dev has been actively working with the Hadoop infrastructure since it’s inception and is currently researching and contributing to the Apache Spark and Tachyon community.
For the past several decades the rising tide of technology -- especially the increasing speed of single processors -- has allowed the same data analysis code to run faster and on bigger data sets. That happy era is ending. The size of data sets is increasing much more rapidly than the speed of single cores, of I/O, and of RAM. To deal with this, we need software that can use multiple cores, multiple hard drives, and multiple computers.
That is, we need scalable data analysis software. It needs to scale from small data sets to huge ones, from using one core and one hard drive on one computer to using many cores and many hard drives on many computers, and from using local hardware to using remote clouds.
R is the ideal platform for scalable data analysis software. It is easy to add new functionality in the R environment, and easy to integrate it into existing functionality. R is also powerful, flexible and forgiving.
I will discuss the approach to scalability we have taken at Revolution Analytics with our package RevoScaleR. A key part of this approach is to efficiently operate on "chunks" of data -- sets of rows of data for selected columns. I will discuss this approach from the point of view of:
- Storing data on disk
- Importing data from other sources
- Reading and writing of chunks of data
- Handling data in memory
- Using multiple cores on single computers
- Using multiple computers
- Automatically parallelizing "external memory" algorithms
Using NLP to Explore Entity Relationships in COVID-19 LiteratureDatabricks
In this talk, we will cover how to extract entities from text using both rule-based and deep learning techniques. We will also cover how to use rule-based entity extraction to bootstrap a named entity recognition model. The other important aspect of this project we will cover is how to infer relationships between entities, and combine them with explicit relationships found in the source data sets. Although this talk is focused on the CORD-19 data set, the techniques covered are applicable to a wide variety of domains. This talk is for those who want to learn how to use NLP to explore relationships in text.
This project aimed to create a series of models for the extraction of Named Entities (People, Locations, Organizations, Dates) from news headlines obtained online. We created two models: a traditional Natural Processing Language Model using Maximum Entropy , and a Deep Neural Network Model using pre-trained word embeddings. Accuracy results of both models show similar performance, but the requirements and limitations of both models are different and can help determine what type of model is best suited for each specific use case.
FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...Mark Wilkinson
This slide deck accompanies the manuscript "Interoperability and FAIRness through a novel combination of Web technologies", submitted to PeerJ Computer Science: https://doi.org/10.7287/peerj.preprints.2522v1
It describes the output of the "Skunkworks" FAIR implementation group, who were tasked with building a prototype infrastructure that would fulfill the FAIR Principles for scholarly data publishing. We show how a novel combination of the Linked Data Platform, RDF Mapping Language (RML) and Triple Pattern Fragments (TPF) can be combined to create a scholarly publishing infrastructure that is markedly interoperable, at both the metadata and the data level.
This slide deck (or something close) will be presented at the Dutch Techcenter for Life Sciences Partners Workshop, November 4, 2016.
Spanish Ministerio de Economía y Competitividad grant number TIN2014-55993-R
Presentation of the early prototype of "FAIR Profiles" - an example of the proposed DCAT Profile, proposed by the DCAT working group (but AFAIK never implemented). This prototype emerged from the activity of the "Skunkworks" group, from the Data FAIRport project.
This project aimed to create a series of models for the extraction of Named Entities (People, Locations, Organizations, Dates) from news headlines obtained online. We created two models: a traditional Natural Processing Language Model using Maximum Entropy , and a Deep Neural Network Model using pre-trained word embeddings. Accuracy results of both models show similar performance, but the requirements and limitations of both models are different and can help determine what type of model is best suited for each specific use case
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Andy Petrella
Data science requires so many skills, people and time before the results can be accessed. Moreover, these results cannot be static anymore. And finally, the Big Data comes to the plate and the whole tool chain needs to change.
In this talk Data Fellas introduces Shar3, a tool kit aiming to bridged the gaps to build a interactive distributed data processing pipeline, or loop!
Then the talk covers genomics nowadays problems including data types, processing, discovery by introducing the GA4GH initiative and its implementation using Shar3.
Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...andrea huang
Will the rich domain knowledge from research publications and the implicit cross-domain metadata of cultural objects be compliant with each other? A contextual framework is proposed as dynamic and relational in supporting three different contexts: Reusing, Publication and Curation, which are individually constructed but overlapped with major conceptual elements. A Relations for Reusing (R4R) ontology has been devised for modeling these overlapping
conceptual components (Article, Data, Code, Provence, and License) for interlinking research outputs and cultural heritage data. In particular, packaging and citation relations are key to build up interpretations for dynamic contexts. Examples are provided for illustrating how the linking mechanism can be constructed and represented as a result to reveal the data linked in different contexts.
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...Dataconomy Media
"Spark, DeepLearning and Life Sciences, Systems Biology in the Big Data age" Dev Lakhani, Founder of Batch Insights
YouTube Link: https://www.youtube.com/watch?v=z6aTv0ZKndQ
Watch more from Data Natives 2015 here: http://bit.ly/1OVkK2J
Visit the conference website to learn more: www.datanatives.io
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2016: http://bit.ly/1WMJAqS
About the author:
Dev Lakhani has a background in Software Engineering and Computational Statistics and is a founder of Batch Insights, a Big Data consultancy that has worked on numerous Big Data architectures and data science projects in Tier 1 banking, global telecoms, retail, media and fashion. Dev has been actively working with the Hadoop infrastructure since it’s inception and is currently researching and contributing to the Apache Spark and Tachyon community.
For the past several decades the rising tide of technology -- especially the increasing speed of single processors -- has allowed the same data analysis code to run faster and on bigger data sets. That happy era is ending. The size of data sets is increasing much more rapidly than the speed of single cores, of I/O, and of RAM. To deal with this, we need software that can use multiple cores, multiple hard drives, and multiple computers.
That is, we need scalable data analysis software. It needs to scale from small data sets to huge ones, from using one core and one hard drive on one computer to using many cores and many hard drives on many computers, and from using local hardware to using remote clouds.
R is the ideal platform for scalable data analysis software. It is easy to add new functionality in the R environment, and easy to integrate it into existing functionality. R is also powerful, flexible and forgiving.
I will discuss the approach to scalability we have taken at Revolution Analytics with our package RevoScaleR. A key part of this approach is to efficiently operate on "chunks" of data -- sets of rows of data for selected columns. I will discuss this approach from the point of view of:
- Storing data on disk
- Importing data from other sources
- Reading and writing of chunks of data
- Handling data in memory
- Using multiple cores on single computers
- Using multiple computers
- Automatically parallelizing "external memory" algorithms
Using NLP to Explore Entity Relationships in COVID-19 LiteratureDatabricks
In this talk, we will cover how to extract entities from text using both rule-based and deep learning techniques. We will also cover how to use rule-based entity extraction to bootstrap a named entity recognition model. The other important aspect of this project we will cover is how to infer relationships between entities, and combine them with explicit relationships found in the source data sets. Although this talk is focused on the CORD-19 data set, the techniques covered are applicable to a wide variety of domains. This talk is for those who want to learn how to use NLP to explore relationships in text.
This project aimed to create a series of models for the extraction of Named Entities (People, Locations, Organizations, Dates) from news headlines obtained online. We created two models: a traditional Natural Processing Language Model using Maximum Entropy , and a Deep Neural Network Model using pre-trained word embeddings. Accuracy results of both models show similar performance, but the requirements and limitations of both models are different and can help determine what type of model is best suited for each specific use case.
FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...Mark Wilkinson
This slide deck accompanies the manuscript "Interoperability and FAIRness through a novel combination of Web technologies", submitted to PeerJ Computer Science: https://doi.org/10.7287/peerj.preprints.2522v1
It describes the output of the "Skunkworks" FAIR implementation group, who were tasked with building a prototype infrastructure that would fulfill the FAIR Principles for scholarly data publishing. We show how a novel combination of the Linked Data Platform, RDF Mapping Language (RML) and Triple Pattern Fragments (TPF) can be combined to create a scholarly publishing infrastructure that is markedly interoperable, at both the metadata and the data level.
This slide deck (or something close) will be presented at the Dutch Techcenter for Life Sciences Partners Workshop, November 4, 2016.
Spanish Ministerio de Economía y Competitividad grant number TIN2014-55993-R
Presentation of the early prototype of "FAIR Profiles" - an example of the proposed DCAT Profile, proposed by the DCAT working group (but AFAIK never implemented). This prototype emerged from the activity of the "Skunkworks" group, from the Data FAIRport project.
This contains the agenda of the Spark Meetup I organised in Bangalore on Friday, the 23rd of Jan 2014. It carries the slides for the talk I gave on distributed deep learning over Spark
IBC FAIR Data Prototype Implementation slideshowMark Wilkinson
Discussion about ways of achieving FAIRness of both metadata and data. Brute force approaches, and more elegant "projection" approaches are shown.
Relevant papers are at:
doi: 10.7717/peerj-cs.110 (https://peerj.com/articles/cs-110/)
doi: 10.3389/fpls.2016.00641 (https://doi.org/10.3389/fpls.2016.00641)
Spanish Ministerio de Economía y Competitividad grant number TIN2014-55993-R
RO-Crate: packaging metadata love notes into FAIR Digital ObjectsCarole Goble
Abstract
slides available at: https://zenodo.org/record/7147703#.Y7agoxXP2F4
The Helmholtz Metadata Collaboration aims to make the research data [and software] produced by Helmholtz Centres FAIR for their own and the wider science community by means of metadata enrichment [1]. Why metadata enrichment and why FAIR? Because the whole scientific enterprise depends on a cycle of finding, exchanging, understanding, validating, reproducing), integrating and reusing research entities across a dispersed community of researchers.
Metadata is not just “a love note to the future” [2], it is a love note to today’s collaborators and peers. Moreover, a FAIR Commons must cater for the metadata of all the entities of research – data, software, workflows, protocols, instruments, geo-spatial locations, specimens, samples, people (well as traditional articles) – and their interconnectivity. That is a lot of metadata love notes to manage, bundle up and move around. Notes written in different languages at different times by different folks, produced and hosted by different platforms, yet referring to each other, and building an integrated picture of a multi-part and multi-party investigation. We need a crate!
RO-Crate [3] is an open, community-driven, and lightweight approach to packaging research entities along with their metadata in a machine-readable manner. Following key principles - “just enough” and “developer and legacy friendliness - RO-Crate simplifies the process of making research outputs FAIR while also enhancing research reproducibility and citability. As a self-describing and unbounded “metadata middleware” framework RO-Crate shows that a little bit of packaging goes a long way to realise the goals of FAIR Digital Objects (FDO)[4], and to not just overcome platform diversity but celebrate it while retaining investigation contextual integrity.
In this talk I will present the why, and how Research Object packaging eases Metadata Collaboration using examples in big data and mixed object exchange, mixed object archiving and publishing, mass citation, and reproducibility. Some examples come from the HMC, others from EOSC, USA and Australia, and from different disciplines.
Metadata is a love note to the future, RO-Crate is the delivery package.
[1] https://helmholtz-metadaten.de/en
[2] Scott, Jason The Metadata Mania, http://ascii.textfiles.com/archives/3181, June 2011
[3] Soiland-Reyes, Stian et al. “Packaging Research Artefacts with RO-Crate”. Data Science, 2022; 5(2):97-138, DOI: 10.3233/DS-210053
[4] De Smedt K, Koureas D, Wittenburg P. “FAIR Digital Objects for Science: From Data Pieces to Actionable Knowledge Units”. Publications. 2020; 8(2):21. https://doi.org/10.3390/publications8020021
An introduction to the Python programming language and its numerical abilities will be presented. With this background, Andrew Collette's H5Py module--an HDF5-Python interface--will be explained highlighting the unique and useful similarities between Python data structures and HDF5.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Democratizing Big Semantic Data management
1. Democratizing Big Semantic
Data management
Javier D. Fernández
WU Vienna, Austria
Complexity Science Hub Vienna, Austria
Privacy and Sustainable Computing Lab, Austria
STANFORD CENTER FOR BIOMEDICAL INFORMATICS
APRIL 25TH, 2018.
or how to query an RDF graph with 28 billion triples in a standard laptop
2. The Linked Open Data cloud (2018)
PAGE 2
~10K datasets organized into 9
domains which include many and
varied knowledge fields.
150B statements, including
entity descriptions and
(inter/intra-dataset) links between
them.
>500 live endpoints serving this
data.
http://lod-cloud.net/
http://stats.lod2.eu/
http://sparqles.ai.wu.ac.at/
3. But what about Web-scale queries
E.g. retrieve all entities in LOD referring to the gene WBGene00000001 (aap-1)
Solutions?
3
select distinct ?x {
?x dcterms:title "WBGene00000001 (aap-1)" .
}
5. A) Federated Queries!!
1. Get a list of potential SPARQL Endpoints
datahub.io, LOV, other catalogs?
2. Query each SPARQL Endpoint
Problems?
Many SPARQL Endpoints have low availability
5
The Web of Data Eco System
http://sparqles.ai.wu.ac.at/
6. A) Federated Queries!!
1. Get a list of potential SPARQL Endpoints
datahub.io, LOV, other catalogs?
2. Query each SPARQL Endpoint
Problems?
Many SPARQL Endpoints have low availability
SPARQL Endpoints are usually restricted (timeout,#results)
Moreover, it can be tricky with complex queries (joins) due to intermediary
results, delays, etc
6
The Web of Data Eco System
7. B) Follow-your-nose
1. Follow self-descriptive IRIs and links
2. Filter the results you are interested in
Problems?
You need some initial seed
DBpedia could be a good start
It’s slow (fetching many documents)
Where should I start for unbounded queries?
?x dcterms:title “WBGene00000001 (aap-1)"
7
The Web of Data Eco System
8. C) Use the RDF dumps by yourself
1. Crawl de Web of Data
Probably start with datahub.io, LOV, other catalogs?
2. Download datasets
You better have some free space in your machine
3. Index the datasets locally
You better are patience and survive parsing errors
4. Query all datasets
You better are alive by then
Problems?
Hugh resources!
+ Messiness of the data
8
The Web of Data Eco System
Rietveld, L., Beek, W., & Schlobach, S. (2015). LOD
lab: Experiments at LOD scale. In ISWC
9. Publication, Exchange and Consumption of large RDF datasets
Most RDF formats (N3, XML, Turtle) are text serializations, designed for
human readability (not for machines)
Verbose = High costs to write/exchange/parse
A basic offline search = (decompress)+ index the file + search
The problem is in the roots
(Big tree = big roots)
Steve Garry
10. 1) HDT
Highly compact serialization of RDF
Allows fast RDF retrieval in compressed space (without prior decompression)
Includes internal indexes to solve basic queries with small (3%) memory footprint.
Very fast on basic queries (triple patterns), x 1.5 faster than Virtuoso, Jena, RDF3X.
Supports FULL SPARQL as the compressed backend store of Jena, with an efficiency on the
same scale as current more optimized solutions
Challenges:
Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then
it is ready to consume efficiently!)
10
A Linked Data hacker toolkit
431 M.triples~
63 GB
NT + gzip
5 GB
HDT
6.6 GB
Slightly more but you can query!
rdfhdt.org
12. Publication Metadata:
Publisher, Public Endpoint, …
Statistical Metadata:
Number of triples, subjects, entities, histograms…
Format Metadata:
Description of the data structure, e.g. Triples Order.
Additional Metadata:
Domain-specific.
… In RDF
Header
16. Mapping of strings to correlative IDs. {1..n}
Lexicographically sorted, no duplicates.
Prefix-Based compression in each section.
Efficient IDString operations
Dictionary
17. Prefix-Based compression used in each section
1. Each string is encoded with two values
An integer representing the number of characters shared with the previous
string
A sequence of characters representing the suffix that is not shared with the
previous string
Dictionary. Plain Front Coding (PFC)
A
An
Ant
Antivirus
Antivirus Software
Best
(0,a) (1,n) (2,t) (3,ivirus) (9, Software) (0,Best)
18. Prefix-Based compression used in each section
2. The vocabulary is split in buckets, each of them storing “b” strings
The first string of each bucket (header) is coded explicitly (i.e. full string)
The subsequent b-1 strings (internal strings) are coded differentially
Dictionary. Plain Front Coding (PFC)
A
An
Ant
Antivirus
Antivirus Software
Best
Bucket 1 Bucket 2
a (1,n) (2,t) Antivirus (9, Software) (0,Best)
1 2 3 4 5 6
19. Prefix-Based compression used in each section
3. PFC is encoded with a byte sequence and an array of pointers (ptr) to
denote the first byte of each bucket
Dictionary. Plain Front Coding (PFC)
A
An
Ant
Antivirus
Antivirus Software
Best ptr 1 9
Bucket 1 Bucket 2
a (1,n) (2,t) Antivirus (9, Software) (0,B)
1 2 3 4 5 6
20. Prefix-Based compression used in each section
Locate (string) performs a binary search in the headers + sequential decoding of
internal strings
e.g. locate (Antivirus Software)=5
Extract (id), finds the bucket id/b, and decodes until the given position
E.g. extract (5) = Antivirus Software
Dictionary. Plain Front Coding (PFC)
A
An
Ant
Antivirus
Antivirus Software
Best
More on Compressed Dictionaries: Martínez-Prieto, M. A., Brisaboa, N., Cánovas, R., Claude, F., &
Navarro, G. (2016). Practical compressed string dictionaries. Information Systems, 56, 73-108.
ptr 1 9
Bucket 1 Bucket 2
a (1,n) (2,t) Antivirus (9, Software) (0,B)
1 2 3 4 5 6
22. Represent and index large volumes of data
~ Theoretical minimum space while serving efficient operations:
Mostly based on 3 operations:
Access
Rank
Select
22
Remember…. Succinct Data Structures
23. Bitmap Sequence.
Operations in constant time
access(position) = Value.
rank(position) = “Number of ones, up to position”.
select(i) = “Position where the one has i occurrences”.
Implementation:
n + o(n) bits
Adjustable space overhead: In practice, 37,5 % overhead
23
Bit Sequence Coding
1 1 0 0 0 1 0 0 1 0 1 1 0 1 1 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1
rank(7) = 4
9
select(5) = 9
1
access(14) = 1
24. Bitmap Triples:
24
Bitmap Triples Encoding
subjects
Objects:
Predicates:
E.g. retrieve (2,5,?)
Find the position of the second ‘1’-bit in Bp (select)
Binary search on the list of predicates looking for 5
Note that such predicate 5 is in position 4 of Sp
Find the position of the four ‘1’-bit in Bo (select)
S P O
S P ?
S ? O
S ? ?
? ? ?
25. E.g. retrieve (?,4,?)
Get the location of the list, Bl.select(3)+1= 4+1 =5
Retrieve the position: Sl[5]=1
Get the associated subject, rank(1)=1st subject
access the object as (1,4,?)
Encoded as a position list
25
Additional Predicate Index
predicates
5 6 2 3 1 4
0 1 1 1 1 1Bl
Sl
1 2 3 4 5
? P ?
subjects
Objects:
Predicates:
26. Encoded as a position list
26
Additional Object Index
subjects
Objects:
Predicates:
? ? O
? P O
predicates
5 6 2 3 1 4
0 1 1 1 1 1Bl
Sl
1 2 3 4 5objects
2 5 6 4 3 3 1
0 0 1 1 1 1 1Bl
Sl
1 2 3 4 5
27. From the exchanged HDT to the functional HDT-FoQ:
Publish and Exchange HDT
At the consumer:
27
On-the-fly indexes: HDT-FoQ
Process Type of Index Patterns
index the bitsequences
Subject
SPO
SPO, SP?,
S??, S?O, ???
We index the position of each predicate
(just a position list)
Predicate
PSO ?P?
We index the position of each object
(just a position list)
Object
OPS ?PO, ??O
1
2
3
29. 29
Some numbers on size
http://dataweb.infor.uva.es/projects/hdt-mr/
José M. Giménez-García, Javier D. Fernández, and Miguel A. Martínez-Prieto. HDT-MR: A
Scalable Solution for RDF Compression with HDT and MapReduce. In Proc. of
International Semantic Web Conference (ISWC), 2015
28,362,198,927 Triples
30. Data is ready to be consumed 10-15x faster.
HDT << any other RDF format || RDF engine
Competitive query performance.
Very fast on triple patterns, x 1.5 faster (Virtuoso, RDF3x).
Integration with Jena
Joins on the same scale of existing solutions (Virtuoso, RDF3x).
30
Results
33. 33
A Linked Data hacker toolkit
uses HDT as the main storage solution
2) LOD Laundromat
Challenges:
Still you need to query 650K datasets
Of course it does not contain all LOD, but “a good approximation”
http://lodlaundromat.org/
Beek, W., Rietveld, L., Bazoobandi, H. R., Wielemaker, J., & Schlobach, S. (2014, October). LOD laundromat: a
uniform way of publishing other people’s dirty data. In ISWC (pp. 213-228).
34. 3) Linked Data Fragments
Challenges:
Still room for optimization for complex federated queries (delays,
intermediate results, …)
34
A Linked Data hacker toolkit
typically uses HDT as the main engine
Verborgh, R., Hartig, O., De Meester, B., Haesendonck, G., De Vocht, L., Vander Sande, M., ... & Van de Walle, R. (2014).
Querying datasets on the web with high availability. In ISWC (pp. 180-196).
35. Get more than 650K HDT datasets from
LOD Laundromat…
PAGE 35
37. Scalable storage
Store and serve thousands of (large) datasets (e.g. LOD Laundromat)
Archiving (reduce storage costs + foster smart consumption)
Also with deltas (see https://aic.ai.wu.ac.at/qadlod/bear.html)
Better consumer-centric publication
Compress and share ready-to-consume RDF datasets
Consumption with limited resources
smartphones, standard laptops
Fast –low cost- SPARQL Query Engine
Via HDT-Jena
Via Linked Data Fragments
Application scenarios (HDT)
38. Storage + Light API
http://lodlaundromat.org/
http://linkeddatafragments.org/
Storage + SPARQL Query Engine
https://data.world/
Advance features
Top k Shortest Path
Query Answering over the Web of Data
Others: Versioning, streaming,…
Application Examples
39. Application Examples
Publication in HDT (~1B triples): https://zenodo.org/record/1116889#.WuBt0C7FKpo
Uri resolver based on HDT: https://github.com/pharmbio/urisolve
91,498,351 compounds from PubChem with predicted logD (water–octanol distribution coefficient) values at 90% confidence level
40. But what about Web-scale queries
E.g. retrieve all entities referring to the gene WBGene00000001 (aap-1)
Solutions?
40
select distinct ?x {
?x dcterms:title "WBGene00000001 (aap-1)" .
}
45. Query resolution at Web scale
Using LDF, Jena
Evaluation and Benchmarking
No excuse
RDF metrics and analytics
45
LOD-a-lot (some use cases)
subjects predicates objects
46. Identity closure
?x owl:sameAs ?y
Graph navigations
E.g. shortest path, random walk
46
LOD-a-lot (some use cases)
Wouter Beek, Javier D.
Fernández and Ruben Verborgh.
LOD-a-lot: A Single-File Enabler
for Data Science. In Proc. of
SEMANTiCS 2017.
More use cases:
47. Update LOD-a-lot regularly
More and newer datasets from the LOD Cloud
Leverage the HDT indexes to support “data science”
E.g. get links across datasets, study the topology of the network, optimize
query planning
Support provenance of the triples (i.e. origin of each triple)
Currently supported only via LOD Laundromat
… implement the use cases and help the community to democratize
the access to LOD
Roadmap
48. We are currently facing Big Linked Data challenges
Generation, publication and consumption
Archiving, evolution…
Thanks to compression/HDT, the Big Linked Data
today will be the “pocket” data tomorrow
HDT democratizes the access to Big Linked Data
= Cheap, scalable consumers
low-cost access to LOD = high-impact research
PAGE 48
Take-home messages
50. Thank you!
javier.fernandez@wu.ac.at
Kudos to all the co-authors involved in the works presented here
Incomplete list of ACKs:
Miguel A. Martínez-Prieto
Mario Arias
Pablo de la Fuente
Claudio Gutierrez
Axel Polleres
Wouter Beek
Ruben Verborgh
… And many others