The evolution of Netflix's S3 data warehouse (Strata NY 2018)Ryan Blue
In the last few years, Netflix’s S3 data warehouse has grown to more than 100 PB. In that time, the company has shared several techniques and released open source tools for working around S3’s quirks, including s3mper to work around eventual consistency, S3 multipart committers to commit data without renames, and the batchid pattern for cross-partition atomic commits.
Ryan Blue and Daniel Weeks share lessons learned, the tools Netflix currently uses and those it has retired, and the improvements it is rolling out, including Iceberg, a new table format for S3 that is replacing many of the company’s current tools. Iceberg enables a new generation of improvements, including:
* Snapshot isolation with no directory listing or file renames
* Distributed planning to relieve metastore bottlenecks
* Improved data layout for S3 performance
* Immediately available writes from streaming applications
* Opportunistic compaction and data optimization
The evolution of Netflix's S3 data warehouse (Strata NY 2018)Ryan Blue
In the last few years, Netflix’s S3 data warehouse has grown to more than 100 PB. In that time, the company has shared several techniques and released open source tools for working around S3’s quirks, including s3mper to work around eventual consistency, S3 multipart committers to commit data without renames, and the batchid pattern for cross-partition atomic commits.
Ryan Blue and Daniel Weeks share lessons learned, the tools Netflix currently uses and those it has retired, and the improvements it is rolling out, including Iceberg, a new table format for S3 that is replacing many of the company’s current tools. Iceberg enables a new generation of improvements, including:
* Snapshot isolation with no directory listing or file renames
* Distributed planning to relieve metastore bottlenecks
* Improved data layout for S3 performance
* Immediately available writes from streaming applications
* Opportunistic compaction and data optimization
Slide show for the webinar on "Spatial Data Science with R" organized for the GeoDevelopers.org community. The video of the webinar and all the related materials including source code and sample data can be downloaded from this link: http://amsantac.co/blog/en/2016/08/07/spatial-data-science-r.html
In this webinar I talked about Data Science in the context of its application to spatial data and explained how we can use the R language for the analysis of geographic information within the different stages of a data science workflow, from the import and processing of spatial data to visualization and publication of results.
Slides of the Apache Omid presentation at Hadoop Summit 2016 in San Jose, CA. Omid is a flexible, reliable, high performant and scalable transaction manager for HBase.
Shen Li, VP engineering at PingCAP, shares the slides about TiDB with the Big Data Ecosystem. Enjoy~
TiDB, an open source distributed HTAP database. Inspired by Google Spanner/F1, PingCAP develops TiDB, an open source distributed Hybrid Transactional/Analytical Processing (HTAP) database. TiDB features infinite horizontal scalability, strong consistency, and high availability. The goal of TiDB is to serve as a one-stop solution for online transactions and analysis.
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
Spark Datasets are an evolution of Spark DataFrames which allow us to work with both functional and relational transformations on big data with the speed of Spark.
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
Talk at Scala Up North Jul 21 2017
We will talk about Spotify's story with Scala big data and our journey to migrate our entire data infrastructure to Google Cloud and how Justin Bieber contributed to breaking it. We'll talk about Scio, a Scala API for Apache Beam and Google Cloud Dataflow, and the technology behind it, including macros, algebird, chill and shapeless. There'll also be a live coding demo.
Extending Spark for Qbeast's SQL Data Source with Paola Pardo and Cesare Cug...Qbeast
Slides of the Barcelona Spark meetup of the 24th of October 2019. The recording is available at https://www.youtube.com/watch?v=eCoCcBH4hIU.
Abstract
One of the key strengths of Spark is its flexibility as it integrates with dozens of different storage systems and file formats. However, it is not the same reading from a CSV file, or a SQL database, or an exotic stratified sampled multidimensional database. And finding the right balance between modularity and flexibility is not easy!
In this presentation, we will talk about the evolution of Spark's DataSource API, and how it integrates with the SQL optimizer, highlighting how we can make much faster queries with logical and the physical plans that better integrates with the storage. From theory to practise, we will then discuss how we extended the Spark's internals, and we built a new source integration that allows the push-down of both sampling and multidimensional filtering.
About the speakers:
Paola Pardo is a Computer Engineer from Barcelona. She graduated in Computer engineer this last summer at the Technical University of Catalunya with a thesis focused on Data storage push down optimization based on Apache Spark. She is, and she is currently working at Barcelona Supercomputing Center and in its spin-off Qbeast developing a Qbeast-Spark connector.
Cesare Cugnasco is a PhD in Computer Architecture and a researcher at the Barcelona Supercomputing Center. His research focuses on NoSQL databases, distributed computing and High-performance storage. He invented and patented a new database architecture for Big Data, and he is building a spin-off for its commercialization.
Linked Open Data is the most usable kind of Open Data. An example of a well integrated source of Linked Open Data on tourism and mobility is the Open Data Hub operated by NOI. We will use the SPARQL querying language, a W3C standard, to query the data and show how this differs from other access methods. The tour will start by querying the end point directly from the command line with tools, like curl. Then, one by one, well known data science software packages. like R and Pandas, will be used to directly work with these datasets, to perform statistical calculations and generating graphs from data.
In the final part, these software packages will be used to query data from other well known data sources, like Wikidata and DBpedia.
Graphs in data structures are non-linear data structures made up of a finite ...bhargavi804095
Graphs in data structures are non-linear data structures made up of a finite number of nodes or vertices and the edges that connect them. Graphs in data structures are used to address real-world problems in which it represents the problem area as a network like telephone networks, circuit networks, and social networks
Slide show for the webinar on "Spatial Data Science with R" organized for the GeoDevelopers.org community. The video of the webinar and all the related materials including source code and sample data can be downloaded from this link: http://amsantac.co/blog/en/2016/08/07/spatial-data-science-r.html
In this webinar I talked about Data Science in the context of its application to spatial data and explained how we can use the R language for the analysis of geographic information within the different stages of a data science workflow, from the import and processing of spatial data to visualization and publication of results.
Slides of the Apache Omid presentation at Hadoop Summit 2016 in San Jose, CA. Omid is a flexible, reliable, high performant and scalable transaction manager for HBase.
Shen Li, VP engineering at PingCAP, shares the slides about TiDB with the Big Data Ecosystem. Enjoy~
TiDB, an open source distributed HTAP database. Inspired by Google Spanner/F1, PingCAP develops TiDB, an open source distributed Hybrid Transactional/Analytical Processing (HTAP) database. TiDB features infinite horizontal scalability, strong consistency, and high availability. The goal of TiDB is to serve as a one-stop solution for online transactions and analysis.
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
Spark Datasets are an evolution of Spark DataFrames which allow us to work with both functional and relational transformations on big data with the speed of Spark.
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
Talk at Scala Up North Jul 21 2017
We will talk about Spotify's story with Scala big data and our journey to migrate our entire data infrastructure to Google Cloud and how Justin Bieber contributed to breaking it. We'll talk about Scio, a Scala API for Apache Beam and Google Cloud Dataflow, and the technology behind it, including macros, algebird, chill and shapeless. There'll also be a live coding demo.
Extending Spark for Qbeast's SQL Data Source with Paola Pardo and Cesare Cug...Qbeast
Slides of the Barcelona Spark meetup of the 24th of October 2019. The recording is available at https://www.youtube.com/watch?v=eCoCcBH4hIU.
Abstract
One of the key strengths of Spark is its flexibility as it integrates with dozens of different storage systems and file formats. However, it is not the same reading from a CSV file, or a SQL database, or an exotic stratified sampled multidimensional database. And finding the right balance between modularity and flexibility is not easy!
In this presentation, we will talk about the evolution of Spark's DataSource API, and how it integrates with the SQL optimizer, highlighting how we can make much faster queries with logical and the physical plans that better integrates with the storage. From theory to practise, we will then discuss how we extended the Spark's internals, and we built a new source integration that allows the push-down of both sampling and multidimensional filtering.
About the speakers:
Paola Pardo is a Computer Engineer from Barcelona. She graduated in Computer engineer this last summer at the Technical University of Catalunya with a thesis focused on Data storage push down optimization based on Apache Spark. She is, and she is currently working at Barcelona Supercomputing Center and in its spin-off Qbeast developing a Qbeast-Spark connector.
Cesare Cugnasco is a PhD in Computer Architecture and a researcher at the Barcelona Supercomputing Center. His research focuses on NoSQL databases, distributed computing and High-performance storage. He invented and patented a new database architecture for Big Data, and he is building a spin-off for its commercialization.
Linked Open Data is the most usable kind of Open Data. An example of a well integrated source of Linked Open Data on tourism and mobility is the Open Data Hub operated by NOI. We will use the SPARQL querying language, a W3C standard, to query the data and show how this differs from other access methods. The tour will start by querying the end point directly from the command line with tools, like curl. Then, one by one, well known data science software packages. like R and Pandas, will be used to directly work with these datasets, to perform statistical calculations and generating graphs from data.
In the final part, these software packages will be used to query data from other well known data sources, like Wikidata and DBpedia.
Graphs in data structures are non-linear data structures made up of a finite ...bhargavi804095
Graphs in data structures are non-linear data structures made up of a finite number of nodes or vertices and the edges that connect them. Graphs in data structures are used to address real-world problems in which it represents the problem area as a network like telephone networks, circuit networks, and social networks
Remember the last time you tried to write a MapReduce job (obviously something non trivial than a word count)? It sure did the work, but has lot of pain points from getting an idea to implement it in terms of map reduce. Did you wonder how life will be much simple if you had to code like doing collection operations and hence being transparent* to its distributed nature? Did you want/hope for more performant/low latency jobs? Well, seems like you are in luck.
In this talk, we will be covering a different way to do MapReduce kind of operations without being just limited to map and reduce, yes, we will be talking about Apache Spark. We will compare and contrast Spark programming model with Map Reduce. We will see where it shines, and why to use it, how to use it. We’ll be covering aspects like testability, maintainability, conciseness of the code, and some features like iterative processing, optional in-memory caching and others. We will see how Spark, being just a cluster computing engine, abstracts the underlying distributed storage, and cluster management aspects, giving us a uniform interface to consume/process/query the data. We will explore the basic abstraction of RDD which gives us so many awesome features making Apache Spark a very good choice for your big data applications. We will see this through some non trivial code examples.
Session at the IndicThreads.com Confence held in Pune, India on 27-28 Feb 2015
http://www.indicthreads.com
http://pune15.indicthreads.com
Out of the box, Accumulo's strengths are difficult to appreciate without first building an application that showcases its capabilities to handle massive amounts of data. Unfortunately, building such an application is non-trivial for many would-be users, which affects Accumulo's adoption.
In this talk, we introduce Datawave, a complete ingest, query, and analytic framework for Accumulo. Datawave, recently open-sourced by the National Security Agency, capitalizes on Accumulo's capabilities, provides an API for working with structured and unstructured data, and boasts a robust, flexible, and scalable backend.
We'll do a deep dive into Datawave's project layout, table structures, and APIs in addition to demonstrating the Datawave quickstart—a tool that makes it incredibly easy to hit the ground running with Accumulo and Datawave without having to develop a complete application.
This presentation is an attempt do demystify the practice of building reliable data processing pipelines. We go through the necessary pieces needed to build a stable processing platform: data ingestion, processing engines, workflow management, schemas, and pipeline development processes. The presentation also includes component choice considerations and recommendations, as well as best practices and pitfalls to avoid, most learnt through expensive mistakes.
The magic of (data parallel) distributed systems and where it all breaks - Re...Holden Karau
Distributed systems can seem magical, and sometimes all of the magic works and our job succeeds. However, if you've worked with them for a long enough time you've found a few places where the magic starts to break down and the fact that it's actually a collection of several hundred garden gnomes* rather than a single large garden gnome.
This talk will use Apache Spark, Beam, Flink, Kafka, and Map Reduce to explore the world of data parallel distributed systems. We'll start with some happy pieces of magic, like how we can combine different transformations into a single pass over the data, working between different languages, data partitioning, and lambda serialization. After each new piece of magic is introduced we'll look at how it breaks in one (or two) of the systems.
Come to be told it's not your fault everything is broken, or if your distributed software still works an exciting preview of everything that's going to go wrong. Don't work with distributed systems? Come to be reassured you've made good life choices.
And introdution to MR and Hadoop and an view on the opportunities to use MR with databases i.e., SQL-MapReduce by Teradata and In-database MR by Oracle.
The presentation was used during a class of Datenbanken Implementierungstechniken in 2013.
Similar to Workload-Aware RDF Partitioning and SPARQL Query Caching for Massive RDF Graphs stored in NoSQL Databases (20)
O termo NewSQL inclui bancos de dados que prometem a escalabilidade dos NoSQL e transações ACID e linguagem SQL dos bancos de dados tradicionais. O VoltDB, desenvolvido pela equipe de Michael Stonebraker, é o principal exemplo dessa vertente. Essa palestra apresenta a experiência de uso do VoltDB, descreve benefícios de desafios do seu uso, e compara essa solução com outras ferramentas como o Apache Ignite.
Essa apresentação discute o processo de decisão realizado pela Socialbase durante a escolha do melhor framework/serviço de aprendizado de máquina. Entre as questões avaliadas estão: feature, escalabilidade, curva de aprendizado, custos e segurança.
Apresentado com Narlei Moreira e Igor Siciliani no TDC Floripa 2018.
Oficina apresentada na Escola Reginal de Banco de Dados em 2018. Essa oficina apresentou como combinar Spark, VoltDB e Elasticsearch, três tecnologias que materializam os conceitos de Big Data para alcançar velocidade de processamento para um grande volume de dados. Usando um exemplo em informação geográficas, o participante aprenderá como processar dados em tempo real usando Apache Spark, criar visualizações através do Elasticsearch e disponibilizar esses dados de forma escalável em uma ferramenta NewSQL usando o VoltDB.
Os bancos de dados NoSQL, outrora uma tendência, são atualmente a realidade no que diz respeito ao armazenamento dos grandes volumes de dados gerenciados pelas aplicações de Big Data. Contudo, a Big Data, traz também outros desafios como o acesso integrado e em tempo real a fontes variadas de informação. Embora sejam relativamente recentes na história da Ciência da Computação, em muitos aspectos os NoSQL são suportados por uma longa tradição de conceitos e ferramentas. Este fato é especialmente visível na integração de NoSQL, onde as ideias bem conhecidas, tais como federação, integração e migração ainda são válidas. Nesse sentido, esse trabalho comparou as obras mais recentes que lidam com o acesso integrado vários bancos de dados NoSQL. Tais trabalhos propõem diferentes níveis de solução, que vão desde simples integrações em nível de código até criação de modelos integrados, contudo há uma lacuna no que diz respeito ao acesso integrado, semântico e em tempo real aos repositórios NoSQL. A partir desta análise, é proposto middleware chamado Rendezvous que oferece acesso integrado considerando uma visão semântica dos dados - usando os padrões RDF e SPARQL - pertencentes a qualquer um dos principais modelos de dados NoSQL – chave-valor, colunar, documento e grafos – e permite acesso em tempo real os dados gerenciados pelo middleware Rendezvous.
An Approach for RDF-based Semantic Access to NoSQL Repositories, presented as partial requiremnt for the discipline "Metodologia da Pesquisa em Ciência da Computação" at UFSC/2015
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Epistemic Interaction - tuning interfaces to provide information for AI support
Workload-Aware RDF Partitioning and SPARQL Query Caching for Massive RDF Graphs stored in NoSQL Databases
1. Workload-Aware RDF Partitioning and
SPARQL Query Caching for Massive RDF
Graphs stored in NoSQL Databases
Simpósio Brasileiro de Banco de Dados (SBBD)
Uberlândia, Outubro/2017
Luiz Henrique Zambom Santana
Prof. Dr. Ronaldo dos Santos Mello
2. Agenda
● Introduction: Motivation, objectives, and contributions
● Background
○ RDF
○ NoSQL
● State of the Art
● Rendezvous
○ Storing: Fragmentation, Indexing, Partitioning, and Mapping
○ Querying: Query decomposition and Caching
● Evaluation
2
3. Introduction: Motivation
● Since the of Semantic Web proposal in 2001, many advances introduced by
W3C
● RDF and SPARQL is currently widespread:
○ Best buy:
■ http://www.nytimes.com/external/readwriteweb/2010/07/01/01readwr
iteweb-how-best-buy-is-using-the-semantic-web-23031.html
○ Globo.com:
■ https://www.slideshare.net/icaromedeiros/apresantacao-ufrj-icaro201
3
○ US data.gov:
■ https://www.data.gov/developers/semantic-web
3
5. Introduction: Motivation
● Research problem
○ Storing/querying large RDF graphs
■ No single node can handle the complete graph
■ Native RDF storage can’t scale to the current data
requirements
■ Inter-partitions joins is very costly
● Research hypothesis
○ A workload-aware approach based on distributed
polyglot NoSQL persistence could be a good solution 5
6. Rendezvous
● Triplestore implemented as middleware for storing
massing RDF graphs into multiple NoSQL databases
● Novel data partitioning approach
● Fragmentation strategy that maps pieces of this RDF
graph into NoSQL databases with different data
models
● Caching structure that accelerate the querying response
7. Introduction: Contributions
● Mapping of RDF to columnar, document, and key/value NoSQL models;
● A workload-aware partitioner based on the current graph structure and,
mainly, on the typical application workload;
● A caching schema based on key/value databases for speeding up the query
response time;
● An experimental evaluation that compares the current version of our approach
against two baselines ScalaRDF (HU et al., )) by considering Redis, Apache
Cassandra and MongoDB, the most popular key/value, columnar and
document NoSQL databases, respectively.
7
8. Agenda
● Introduction: Motivation, objectives, and contributions
● Background
○ RDF
○ NoSQL
● State of the Art
● Rendezvous
○ Storing: Fragmentation, Indexing, Partitioning, and Mapping
○ Querying: Query decomposition and Caching
● Evaluation
8
10. Background: NoSQL
● No SQL interface
● No ACID transactions
● Very scalable
● Schemaless
https://db-engines.com/en/ranking
10
11. Agenda
● Introduction: Motivation, objectives, and contributions
● Background
○ RDF
○ NoSQL
● State of the Art
● Rendezvous
○ Storing: Fragmentation, Indexing, Partitioning, and Mapping
○ Querying: Query decomposition and Caching
● Evaluation
● Schedule
11
12. State of the Art - Triplestores
Triplestore Frag. Replication Partitioning Model In-memory Workload-aware
Hexastore (2008) No No No Native No No
SW-Store (2009) No No Vertical SQL No No
CumulusRDF
(2011)
No No Vertical Columnar
(Cassandra)
No No
SPOVC (2012) No No Horizontal Columnar
(MonetDB)
No No
WARP (2013) Yes N-hop replication on
partition boundary
Hash Native No Dynamic
Rainbow (2015) No No Hash Polyglot K/V cache Static
ScalaRDF (2016) No Next-hop Hash Polyglot K/V cache No
Rendezvous Yes N-hop replication fragment
and on partition boundary
V and H Polyglot K/V and local
cache
Dynamic
Key differentials
13. Agenda
● Introduction: Motivation, objectives, and contributions
● Background
○ RDF
○ NoSQL
● State of the Art
● Rendezvous
○ Storing: Fragmentation, Indexing, Partitioning, and Mapping
○ Querying: Query decomposition and Caching
● Evaluation
● Schedule
13
14. Rendezvous
● Triplestore implemented as middleware for storing
massing RDF graphs into multiple NoSQL databases
● Novel data partitioning approach
● Fragmentation strategy that maps pieces of this RDF
graph into NoSQL databases with different data
models
● Caching structure that accelerate the querying response
14
17. Workload awareness
Given the graph:
If the following query is issued:
SELECT ?x WHERE {
B p2 C .
C p3 x?
}
SELECT ?x WHERE {
F p6 G .
F p9 L .
F p8 x?}
A
B
C
M F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
J p11
D
Star-shaped
... ...
F {Fp6G,
Fp9L,
Fp8?}
Indexed by the predicate
Chain-shaped
... ...
p3 {Bp2C,
Cp3?}
Indexed by the subject/object
Dataset
Characterizer
17
19. Star Fragmentation (n-hop expansion)
Given the graph and this state
on Dataset Characterizer
A
B
C
M F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
J p11
D
F C
p10
Chain-shaped
... ...
p3 {Bp2C,
Cp3?}
Dataset
Characterizer
Star-shaped
... ...
F {Fp6G,
Fp9L,
Fp8?}
F tends to be in star queries
with diameter 1,
so we expand the triple
Fp10C to a 1-hop fragment
B C
F G
LHI
p5 p6
p7 p9
p8
p10
Fp10C will be
stored
19
20. Star Fragmentation (mapping)
With the expanded fragment
B C
F G
LHI
p5 p6
p7 p9
p8
p10
{
subject: F,
p6: G,
p7: I,
p8: H,
p10: C,
p9: L,
p5: {
object: B
}}
We translate it to a JSON
document:
Document
database
20
21. Chain Fragmentation (n-hop expansion)
Given the graph and this state
on Dataset Characterizer
A
B
C
M F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
J p11
D
C G
p3
Chain-shaped
... ...
p3 {Bp2C,
Cp3?}
Dataset
Characterizer
Star-shaped
... ...
F {Fp6G,
Fp9L,
Fp8?}
p3 tends to be in chain queries with
max-diameter 1, so we expand the
triple Cp3G to a 1-hop fragment
B
C
F G
p2
p3
p6
D
p3
Cp3G will be
stored
21
22. Chain Fragmentation (mapping)
With the expanded fragment We translate it to a set of
columnar tables:
B
C
F G
p2
p3
p6
D
p3
p2
Obj Subj
B C
p3
Obj Subj
C D
C G
p6
Obj Subj
F G
Columnar
database
22
24. Indexing
S_PO
...
F {p10C}
C {p3G}
O_SP
...
C {Fp10}
G {Gp3}
Indexer
Each new triple is indexed by the
subject and the object
It helps on a triple expansion, and to solve simple queries like:
SELECT ?x WHERE {F p10 x? }
24
26. Partitioning
A
B
C
E F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
P2
P1
J p11
D
P3
p10
If a graph is bigger than a
server capabilities, the
Rendezvous DBA can
create multiple partitions
Columnar
database
Document
database
P3
P1
P2
Each NoSQL server can hold one or
more partitions and each partition is
in only one server.
26
27. Partitioning
Fragments hash
(F p10 C)
Size: 2
{P1
,
P2
}
(C p3 D)
Size: 2
{P3
}
(L p12 H)
Size: 1
{P2
}
P3 Elements
S P O
C p3 D
... ... ...
P1 Elements
S P O
A p1 B
F p10 C
...
Dictionary
P2 Elements
S P O
F p10 C
L p12 J
... ... H
(vi)
Columnar
database
Columnar
database
Document
database
P3
Pn
P1
P2
A
B
C
E F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
P2
P1
J p11
D
P3
p10
Rendezvous manages the
partitions by saving it on
the dictionary
27
28. Partitioning (boundary replication)
Fragments hash
(F p10 C)
Size: 2
{P1
,
P2
}
(C p3 D)
Size: 2
{P3
}
(L p12 H)
Size: 1
{P2
}
P3 Elements
S P O
C p3 D
... ... ...
P1 Elements
S P O
A p1 B
F p10 C
...
Dictionary
P2 Elements
S P O
F p10 C
L p12 J
... ... H
(vi)
Columnar
database
Columnar
database
Document
database
P3
Pn
P1
P2
A
B
C
E F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
P2
P1
J p11
D
P3
p10
If a triple is on the edge of two
partitions, it will be replicated in
both partitions. The size of this
boundary is defined by the DBA.
28
29. Partitioning (Data placement)
Fragments hash
(F p10 C)
Size: 2
{P1
,
P2
}
(C p3 D)
Size: 2
{P3
}
(L p12 H)
Size: 1
{P2
}
P3 Elements
S P O
C p3 D
... ... ...
P1 Elements
S P O
A p1 B
F p10 C
...
Dictionary
P2 Elements
S P O
F p10 C
L p12 J
... ... H
(vi)
Columnar
database
Columnar
database
Document
database
P3
Pn
P1
P2
A
B
C
E F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
P2
P1
J p11
D
P3
p10
The fragment hash will help on th
data placement. Based on the trip
and the size of the fragment,
Rendezvous will find the best
partition to store a triple.
29
31. Querying evaluation
Given the graph:
A
B
C
M F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
J p11
D
If the following query is issued:
Q: SELECT ?x WHERE
{
w? p6 G .
w? p7 I .
w? p8 H .
x? p1 y? .
y? p2 z? .
z? p3 w?
}
P2
P1
P3
1. It will search for:
1.1. Simple queries
1.2. Star queries
1.3. Chain queries
2. Updates the Dataset
Characterizer
Chain:
Qc: SELECT ?x
WHERE {
x? p1 y? .
y? p2 z? .
z? p3 w? .
}
Star:
Qs: SELECT ?x
WHERE {
w? p6 G .
w? p7 I .
w? p8 H
}
31
32. Querying decomposition
Given the graph:
A
B
C
M F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
J p11
D
P2
P1
P3
Chain:
Q2c: SELECT ?x
WHERE {
x? p1 y? .
y? p2 z? .
z? p3 w? .}
Star:
Qs: SELECT ?x
WHERE {
w? p6 G .
w? p7 I .
w? p8 H }
D: db.partition2.find({
{p6:{$exists:true}, object:G},
{p7:{$exists:true}, object:I},
{p8:{$exists:true}, object:H},
})
Partition 1:
Cp1: SELECT S1, O1 FROM p1
Cp1: SELECT S2, O2 FROM p2
WHERE O=S1
Partition 3:
Cp3: SELECT S3,O3 FROM p3
WHERE O=S2
Find the right partition using the
dictionary and translates the SPARQL
query to the final query to be
processed by the NoSQL database.
32
34. Caching (two level cache)
Given the graph:
A
B
C
M F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
J p11
D
After the last query was issued:
Q: SELECT ?x WHERE
{
w? p6 G .
w? p7 I .
w? p8 H .
x? p1 y? .
y? p2 z? .
z? p3 w?
y? p5 w?
}
P2
P1
P3
Near cache
(in-memory tree map(
A:p1:B {A:p1:B, B:p2:C}
B:p2:C {B:p2:C, C:p3:D}
Remote cache
(key/value NoSQL database)
...
A:p1:B {A:p1:B, B:p2:C}
B:p2:C {B:p2:C, C:p3:D}
...
B:p5:F {B:p5:F, F:p9:D}
Normally, the near
cache is smaller
than the remote
cache.
34
35. Caching (querying)
Given the graph:
A
B
C
M F G
LHI
p1 p2
p3
p4
p5
p6
p7 p9
p8
J p11
D
If the following query is issued:
Q: SELECT ?x WHERE
{
x? p1 y? .
y? p2 z? .
z? p3 w? .
y? p5 F
}
P2
P1
P3
Near cache
(in-memory tree map(
A:p1:B {A:p1:B, B:p2:C}
B:p2:C {B:p2:C, C:p3:D}
Remote cache
(key/value NoSQL database)
...
A:p1:B {A:p1:B, B:p2:C}
B:p2:C {B:p2:C, C:p3:D}
...
B:p5:F {B:p5:F, F:p9:D}
This query will be
solved only with
triples from cache 35
36. Agenda
● Introduction: Motivation, objectives, and contributions
● Background
○ RDF
○ NoSQL
● State of the Art
● Rendezvous
○ Storing: Fragmentation, Indexing, Partitioning, and Mapping
○ Querying: Query decomposition and Caching
● Evaluation
36
37. Evaluation
● LUBM: ontology for the University domain, synthetic RDF data scalable to any
size, and 14 extensional queries representing a variety of properties
● Generated dataset with 4000 universities (around 100 GB and contains
around 500 million triples)
● 12 queries with joins, all of them have at least one subject-subject join, and
six of them also have at least one subject-object join
● Apache Jena version 3.2.0 with Java 1.8, and we use Redis 3.2, MongoDB
3.4.3, and Apache Cassandra 3.10
● Amazon m3.xlarge spot with 7.5 GB of memory and 1 x 32 SSD capacity
37
38. Evaluation: Rendezvous performance
The bigger the number of hops (the replication),
the bigger (exponentially) the size of the dataset
and the loading time. However, as the joins are
avoided the response time decreases.
38
39. Evaluation: Rendezvous different settings
Better performance when the partition is
managed by Rendezvous.
The bigger is the boundary replication, the
faster is the response time, without a big impact
on the dataset size.
39
41. Conclusions
● Rendezvous contributes on:
○ Graph partitioning problem via fragments
○ Better query response time through n-hop and boundary replication
○ Better query response time via two-level caching
○ Scalable RDF storage provided by NoSQL databases
● About the evaluation:
○ Fragments are scalable
○ Bigger boundaries are not necessarily related to bigger storage size
○ Graph-aware partitions are better than NoSQL partitions
○ Near cache is fast but it makes more difficult to keep data consistency
41
42. Future Work
● Formalize the query mapping
○ No standard query language to rely on
● Compression of triples during the storage
● Update and delete operations
● Other NoSQL types (e.g., graph)
● Better datasets
42
43. Obrigado!
Simpósio Brasileiro de Banco de Dados (SBBD)
Uberlândia, Outubro/2017
Luiz Henrique Zambom Santana
Prof. Dr. Ronaldo dos Santos Mello
50. State of the Art - SQL Triplestores
WARP Hexastore YARS 4store SPIDER RDF-3x SHARd
SW-Store SOLID SPOVC S2X
50
51. State of the Art - NoSQL Triplestores
RDFJoin, RDFKB, Jena+HBase, Hive+HBase, CumulusRDF,
Rya, Stratustore, MAPSIN, H2RDF, AMADA, Trinity.RDF,
H2RDF+, MonetDBRDF, xR2RML, W3C RDF/JSON,
Rainbow, Sempala, PrestoRDF, RDFChain, Tomaszuk,
Bouhali, and Laurent, Papailiou et al., and, ScalaRDF.
51
52. State of the Art - Triplestores
Recent survey (September 2017):
Ibrahim Abdelaziz, Razen Harbi, Zuhair Khayyat,
Panos Kalnis: A Survey and Experimental
Comparison of Distributed SPARQL Engines for
Very Large RDF Data. Proceedings of the VLDB
Endowment, Volume 10, No. 13, September 2017,
2049 - 2060.
52