A multi-model database is a document store, a graph database as well as a key/value store. To allow for convenient and powerful querying such a database needs a query language that understands all three data models and allows to mix these models in queries. For example, it should be possible to find some documents in a collection according to some criteria, then follow some edges in a graph in which the documents represent vertices, and finally join the results with documents from yet another collection.
In this talk I will explain how a query engine for such a language works, give an overview of the life of a query from parsing, over translation into an execution plan, the optimisation phase and finally the execution. I will show how distributed query execution plans look like, how the query optimiser reasons about them and how the distributed execution works.
Fault-tolerant and thus distributed applications are hard, but Mesos is the ultimate breeding ground for them. Rule number one for designing such beasts is to use well-known, tried and tested, and in particular proven methods and tools as building blocks.
Video recording of the talk: https://www.youtube.com/watch?v=6O_Wuc9FUXg&index=1&list=PLbzoR-pLrL6pLSHrXSg7IYgzSlkOh132K
Exceptions are the Norm: Dealing with Bad Actors in ETLDatabricks
Stable and robust data pipelines are a critical component of the data infrastructure of enterprises. Most commonly, data pipelines ingest messy data sources with incorrect, incomplete or inconsistent records and produce curated and/or summarized data for consumption by subsequent applications.
In this talk, we go over new and upcoming features in Spark that enabled it to better serve such workloads. Such features include isolation of corrupt input records and files, useful diagnostic feedback to users and improved support for nested type handling which is common in ETL jobs.
Speaker: Sameer Agarwal
This talk was originally presented at Spark Summit East 2017.
A Graph Database That Scales - ArangoDB 3.7 Release WebinarArangoDB Database
örg Schad (Head of Engineering and ML) and Chris Woodward (Developer Relations Engineer) introduce the new capabilities to work with graph in a distributed setting. In addition explain and showcase the new fuzzy search within ArangoDB's search engine as well as JSON schema validation.
Get started with ArangoDB: https://www.arangodb.com/arangodb-tra...
Explore ArangoDB Cloud for free with 1-click demos: https://cloud.arangodb.com/home
ArangoDB is a native multi-model database written in C++ supporting graph, document and key/value needs with one engine and one query language. Fulltext search and ranking is supported via ArangoSearch the fully integrated C++ based search engine in ArangoDB.
How to design software and APIs for parallelism on modern hardware
- Richard Parker, mathematician and freelance computer programmer in Cambridge, England
I see only three techniques capable of speeding up well written but old software by a factor of ten or more, namely
- Using multiple cores
- Making good use of cache memory
- Using the full power of a single processor.
So far, commercial software seems to have ignored recent changes in hardware. Often, performance does not really matter, but when it does matter, it requires changes to the design and coding of the computer software.
I have been looking into the changes needed for several years now, and it is gradually emerging that one aspect sticks out as being more important than all the others.
If functions, subroutines, methods etc. do a large number of whatever they do, rather than just one, it becomes possible to write clever implementations that run much faster. Several examples of this will be discussed.
I suggest that software designed and written today should take this into account, making it easier to speed it up if and when it becomes necessary to do so.
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...DataWorks Summit
Last year, in Apache Spark 2.0, we introduced Structured Steaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data, and ensuring end-to-end exactly-once fault-tolerance guarantees.
Since Spark 2.0 we've been hard at work building first class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality in addition to the existing connectivity of Spark SQL make it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse or arriving in real-time from pubsub systems like Kafka and Kinesis.
We'll walk through a concrete example where in less than 10 lines, we read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. We'll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.
Fault-tolerant and thus distributed applications are hard, but Mesos is the ultimate breeding ground for them. Rule number one for designing such beasts is to use well-known, tried and tested, and in particular proven methods and tools as building blocks.
Video recording of the talk: https://www.youtube.com/watch?v=6O_Wuc9FUXg&index=1&list=PLbzoR-pLrL6pLSHrXSg7IYgzSlkOh132K
Exceptions are the Norm: Dealing with Bad Actors in ETLDatabricks
Stable and robust data pipelines are a critical component of the data infrastructure of enterprises. Most commonly, data pipelines ingest messy data sources with incorrect, incomplete or inconsistent records and produce curated and/or summarized data for consumption by subsequent applications.
In this talk, we go over new and upcoming features in Spark that enabled it to better serve such workloads. Such features include isolation of corrupt input records and files, useful diagnostic feedback to users and improved support for nested type handling which is common in ETL jobs.
Speaker: Sameer Agarwal
This talk was originally presented at Spark Summit East 2017.
A Graph Database That Scales - ArangoDB 3.7 Release WebinarArangoDB Database
örg Schad (Head of Engineering and ML) and Chris Woodward (Developer Relations Engineer) introduce the new capabilities to work with graph in a distributed setting. In addition explain and showcase the new fuzzy search within ArangoDB's search engine as well as JSON schema validation.
Get started with ArangoDB: https://www.arangodb.com/arangodb-tra...
Explore ArangoDB Cloud for free with 1-click demos: https://cloud.arangodb.com/home
ArangoDB is a native multi-model database written in C++ supporting graph, document and key/value needs with one engine and one query language. Fulltext search and ranking is supported via ArangoSearch the fully integrated C++ based search engine in ArangoDB.
How to design software and APIs for parallelism on modern hardware
- Richard Parker, mathematician and freelance computer programmer in Cambridge, England
I see only three techniques capable of speeding up well written but old software by a factor of ten or more, namely
- Using multiple cores
- Making good use of cache memory
- Using the full power of a single processor.
So far, commercial software seems to have ignored recent changes in hardware. Often, performance does not really matter, but when it does matter, it requires changes to the design and coding of the computer software.
I have been looking into the changes needed for several years now, and it is gradually emerging that one aspect sticks out as being more important than all the others.
If functions, subroutines, methods etc. do a large number of whatever they do, rather than just one, it becomes possible to write clever implementations that run much faster. Several examples of this will be discussed.
I suggest that software designed and written today should take this into account, making it easier to speed it up if and when it becomes necessary to do so.
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...DataWorks Summit
Last year, in Apache Spark 2.0, we introduced Structured Steaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data, and ensuring end-to-end exactly-once fault-tolerance guarantees.
Since Spark 2.0 we've been hard at work building first class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality in addition to the existing connectivity of Spark SQL make it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse or arriving in real-time from pubsub systems like Kafka and Kinesis.
We'll walk through a concrete example where in less than 10 lines, we read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. We'll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerSachin Aggarwal
RDD recap
Spark SQL library
Architecture of Spark SQL
Comparison with Pig and Hive Pipeline
DataFrames
Definition of a DataFrames API
DataFrames Operations
DataFrames features
Data cleansing
Diagram for logical plan container
Plan Optimization & Execution
Catalyst Analyzer
Catalyst Optimizer
Generating Physical Plan
Code Generation
Extensions
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Databricks
A technical overview of Spark’s DataFrame API. First, we’ll review the DataFrame API and show how to create DataFrames from a variety of data sources such as Hive, RDBMS databases, or structured file formats like Avro. We’ll then give example user programs that operate on DataFrames and point out common design patterns. The second half of the talk will focus on the technical implementation of DataFrames, such as the use of Spark SQL’s Catalyst optimizer to intelligently plan user programs, and the use of fast binary data structures in Spark’s core engine to substantially improve performance and memory use for common types of operations.
Designing and Building a Graph Database Application - Ian Robinson (Neo Techn...jaxLondonConference
Presented at JAX London
In this session we'll look at some of the design and implementation strategies you can employ when building a Neo4j-based graph database solution, including architectural choices, data modelling, and testing.
Robust and Scalable ETL over Cloud Storage with Apache SparkDatabricks
The majority of reported Spark deployments are now in the cloud. In such an environment, it is preferable for Spark to access data directly from services such as Amazon S3, thereby decoupling storage and compute. However, there are limitations to object stores such as S3. Chained or concurrent ETL jobs often run into issues on S3 due to inconsistent file listings and the lack of atomic rename support. Metadata performance also becomes an issue when running jobs over many thousands to millions of files.
Speaker: Eric Liang
This talk was originally presented at Spark Summit East 2017.
Open Source Big Data Ingestion - Without the Heartburn!Pat Patterson
Big Data tools such as Hadoop and Spark allow you to process data at unprecedented scale, but keeping your processing engine fed can be a challenge. Upstream data sources can 'drift' due to infrastructure, OS and application changes, causing ETL tools and hand-coded solutions to fail, inducing heartburn in even the most resilient data scientist. This session will survey the big data ingestion landscape, focusing on how open source tools such as Sqoop, Flume, Nifi and StreamSets can keep the data pipeline flowing.
Last year, in Apache Spark 2.0, Databricks introduced Structured Streaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data and ensuring end-to-end exactly-once fault-tolerance guarantees.
Since Spark 2.0, Databricks has been hard at work building first-class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality, in addition to the existing connectivity of Spark SQL, makes it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse, or arriving in real-time from Kafka/Kinesis.
In this session, Das will walk through a concrete example where – in less than 10 lines – you read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. He’ll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.
Patterns and Operational Insights from the First Users of Delta LakeDatabricks
Cyber threat detection and response requires demanding work loads over large volumes of log and telemetry data. A few years ago I came to Apple after building such a system at another FAANG company, and my boss asked me to do it again.
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkDatabricks
In this talk, we will introduce some of the new available APIs around stateful aggregation in Structured Streaming, namely flatMapGroupsWithState. We will show how this API can be used to power many complex real-time workflows, including stream-to-stream joins, through live demos using Databricks and Apache Kafka.
In this hotcode 2013 talk Lucas and Frank gave an overview over NoSQL and explained why it is a good idea to use Javascript also in the database environment.
Backbone using Extensible Database APIs over HTTPMax Neunhöffer
These days, more and more software applications are designed using a micro services architecture, that is, as suites of independently deployable services, talking to each other with well-defined interfaces. This approach is helped by the fact that many NoSQL databases expose their API through HTTP, which makes it particularly easy to define the interfaces.
The multi-model NoSQL database ArangoDB embeds Google's V8 JavaScript engine and features the Foxx framework, which allows the developer to extend ArangoDB's API by user defined JavaScript code that runs on the database server.
In this talk I will explain the benefits of this approach to the software architecture and development process. I will keep the presentation practice oriented by showing concrete examples in ArangoDB and JavaScript, using Backbone.js
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerSachin Aggarwal
RDD recap
Spark SQL library
Architecture of Spark SQL
Comparison with Pig and Hive Pipeline
DataFrames
Definition of a DataFrames API
DataFrames Operations
DataFrames features
Data cleansing
Diagram for logical plan container
Plan Optimization & Execution
Catalyst Analyzer
Catalyst Optimizer
Generating Physical Plan
Code Generation
Extensions
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Databricks
A technical overview of Spark’s DataFrame API. First, we’ll review the DataFrame API and show how to create DataFrames from a variety of data sources such as Hive, RDBMS databases, or structured file formats like Avro. We’ll then give example user programs that operate on DataFrames and point out common design patterns. The second half of the talk will focus on the technical implementation of DataFrames, such as the use of Spark SQL’s Catalyst optimizer to intelligently plan user programs, and the use of fast binary data structures in Spark’s core engine to substantially improve performance and memory use for common types of operations.
Designing and Building a Graph Database Application - Ian Robinson (Neo Techn...jaxLondonConference
Presented at JAX London
In this session we'll look at some of the design and implementation strategies you can employ when building a Neo4j-based graph database solution, including architectural choices, data modelling, and testing.
Robust and Scalable ETL over Cloud Storage with Apache SparkDatabricks
The majority of reported Spark deployments are now in the cloud. In such an environment, it is preferable for Spark to access data directly from services such as Amazon S3, thereby decoupling storage and compute. However, there are limitations to object stores such as S3. Chained or concurrent ETL jobs often run into issues on S3 due to inconsistent file listings and the lack of atomic rename support. Metadata performance also becomes an issue when running jobs over many thousands to millions of files.
Speaker: Eric Liang
This talk was originally presented at Spark Summit East 2017.
Open Source Big Data Ingestion - Without the Heartburn!Pat Patterson
Big Data tools such as Hadoop and Spark allow you to process data at unprecedented scale, but keeping your processing engine fed can be a challenge. Upstream data sources can 'drift' due to infrastructure, OS and application changes, causing ETL tools and hand-coded solutions to fail, inducing heartburn in even the most resilient data scientist. This session will survey the big data ingestion landscape, focusing on how open source tools such as Sqoop, Flume, Nifi and StreamSets can keep the data pipeline flowing.
Last year, in Apache Spark 2.0, Databricks introduced Structured Streaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data and ensuring end-to-end exactly-once fault-tolerance guarantees.
Since Spark 2.0, Databricks has been hard at work building first-class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality, in addition to the existing connectivity of Spark SQL, makes it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse, or arriving in real-time from Kafka/Kinesis.
In this session, Das will walk through a concrete example where – in less than 10 lines – you read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. He’ll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.
Patterns and Operational Insights from the First Users of Delta LakeDatabricks
Cyber threat detection and response requires demanding work loads over large volumes of log and telemetry data. A few years ago I came to Apple after building such a system at another FAANG company, and my boss asked me to do it again.
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkDatabricks
In this talk, we will introduce some of the new available APIs around stateful aggregation in Structured Streaming, namely flatMapGroupsWithState. We will show how this API can be used to power many complex real-time workflows, including stream-to-stream joins, through live demos using Databricks and Apache Kafka.
In this hotcode 2013 talk Lucas and Frank gave an overview over NoSQL and explained why it is a good idea to use Javascript also in the database environment.
Backbone using Extensible Database APIs over HTTPMax Neunhöffer
These days, more and more software applications are designed using a micro services architecture, that is, as suites of independently deployable services, talking to each other with well-defined interfaces. This approach is helped by the fact that many NoSQL databases expose their API through HTTP, which makes it particularly easy to define the interfaces.
The multi-model NoSQL database ArangoDB embeds Google's V8 JavaScript engine and features the Foxx framework, which allows the developer to extend ArangoDB's API by user defined JavaScript code that runs on the database server.
In this talk I will explain the benefits of this approach to the software architecture and development process. I will keep the presentation practice oriented by showing concrete examples in ArangoDB and JavaScript, using Backbone.js
In this talk the general idea of graph databases is presented and the execution of a graph traversal is shown.
The talk has been given at NoSQL UG Cologne (in Nov. 2013)
In this talk I will explain the motivation behind the multi model database approach, discuss its advantages and limitations, and will keep the presentation concrete and practice oriented by showing concrete usage examples from node.js .
Extensible Database APIs and their role in Software ArchitectureMax Neunhöffer
This event will start with a presentation on “Extensible database APIs and their role in software architecture”, centered around JavaScript. This will be followed by a hands-on interactive workshop. Participants with their own computers will learn how to create a small web application with a database backend, within the session, using only JavaScript. This will be a guided hands-on session using the multi-model NoSQL database ArangoDB and its Foxx JavaScript extension framework. Presenting this workshop will be Max Neunhöffer from https://www.arangodb.com/.
ArangoDB – Persistência Poliglota e Banco de Dados Multi-ModelosHelder Santana
Nesta palestra, apresento uma breve introdução ao mundo do banco de dados NoSQL, apontando os benefícios da persistência poliglota. Apresento o ArangoDB, um banco de dados universal de código aberto com um modelo de dados flexível para documentos, grafos e chave-valor. Crie aplicativos de alto desempenho usando uma linguagem de consulta sql-like ou extensões JavaScript/Ruby.
In this talk we present the term polyglot persistence, give a brief introduction to the world of NoSQL database and point out the benefits and costs of polyglot persistence. Thereafter we present the idea of a multi-model database that reduces the costs for polyglot persistence but keeps its benefits. Next up we present ArangoDB as a Multi-Model database
Jan Steemann: Modelling data in a schema free world (Talk held at Froscon, 2...ArangoDB Database
Even though most NoSQL databases follow
the "schemafree" data paradigma, it is still import to choose the right data model to make the best of the underlying database technology. This talk provides an overview of
the different data storage models available in popular NoSQL databases. It also introduces some best practices on how to model your data for both best performance and best querying.
Recently a new breed of "multi-model" databases has emerged. They are a document store, a graph database and a key/value store combined in one program. Therefore they are able to cover a lot of use cases which otherwise would need multiple different database systems.
This approach promises a boost to the idea of "polyglot persistence", which has become very popular in recent years although it creates some friction in the form of data conversion and synchronisation between different systems. This is, because with a multi-model database one can enjoy the benefits of polyglot persistence without the disadvantages.
In this talk I will explain the motivation behind the multi-model approach, discuss its advantages and limitations, and will then risk to make some predictions about the NoSQL database market in five years time, which I shall only reveal during the talk.
Processing large-scale graphs with Google(TM) Pregel by MICHAEL HACKSTEIN at...Big Data Spain
This talk will give a good overview over the complex architecture of the Pregel framework and will give some insights where there are potential bottlenecks when writing a Pregel algorithm.
Jan Steemann had a talk at Javascript Everywhere in Paris 2012 on Javascript in Arangodb an open source NoSQL database. With ArangoDB you can use Javascript and/or Ruby (mruby) as embedded language
In 2014 we had to do a major overhaul of ArangoDB's database engine,because we wanted to introduce a write-ahead log. Since for a database this change is similar in nature to the proverbial open-heart surgery for humans, it was clear from day one that this would be a difficult endeavour with a lot of risk to break things. Rather fundamental changes were needed in nearly all places of the kernel code and it seemedimpossible to serialise the work to keep the system in a working state. As usual, time was at a premium, since the next major release had to go out of the door in 2 months time.
In this talk I will tell the story of this overhaul, explain the role of unit tests and continuous integration and describe the challenges we faced and how finally overcame them.
guacamole: an Object Document Mapper for ArangoDBMax Neunhöffer
In this talk I will give a brief introduction and overview for guacamole, showing how easy it is to get started with using ArangoDB as the persistence layer for a Rails app. I will also explain the philosophy behind ArangoDB's "multi-model approach", but still show concrete code examples, and all of this in 15 minutes.
Processing large-scale graphs with Google PregelMax Neunhöffer
Graphs are a very popular data structure to store relations like
friendship or web pages and their links. Therefore graph databases
have become popular recently and some of them even allow sharding,
i.e. automatic distribution of the data across multiple machines.
On the other hand, very computation-intensive algorithms for graphs are known and used in practice, and they often access very large data sets, which leads to heavy communication loads.
Therefore, it is an obvious idea to run such graph algorithms on the database servers, close to the data, making use of the computational power of the storage nodes.
Google's Pregel framework allows to implement a lot of graph algorithms in a general system and plays a role similar to the map-reduce skeleton, but for graphs.
In this talk I will explain the framework and describe its implementation in the multi-model database ArangoDB.
Max Neunhöffer – Joins and aggregations in a distributed NoSQL DB - NoSQL mat...NoSQLmatters
Max Neunhöffer – Joins and aggregations in a distributed NoSQL DB
NoSQL databases are renowned for their good horizontal scalability and sharding is these days an essential feature for every self-respecting DB. However, most systems choose to offer less features with respect to joins and aggregations in queries than traditional relational DBMS do. In this talk I report about the joys and pains of (re-)implementing the powerful query language AQL with joins and aggregations for ArangoDB. I will cover (distributed) execution plans, query optimisation and data locality issues.
Search Engine-Building with Lucene and SolrKai Chan
These are the slides for the session I presented at SoCal Code Camp San Diego on July 27, 2013.
http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=6b28337d-6eae-4003-a664-5ed719f43533
Querying and Merging Heterogeneous Data by Approximate Joins on Higher-Order ...Simon Price
This paper addresses the important problem of integrating heterogeneous data from sources as diverse as web pages, digital libraries, knowledge bases and databases. The ultimate aim of this work is to be able to query such heterogeneous data sources as if their data were conveniently held in a single relational database. Pursuant of this aim, we propose a generalisation of relational joins from the relational database model to enable joins on arbitrarily complex structured data in a higher-order representation. By incorporating kernels and distances for structured data, we further extend this model to support approximate joins of data originating from heterogeneous sources. We have implemented these higher-order relational operators and their associated kernels in Prolog and applied this framework on the CORA data sets. We demonstrate the flexibility of our approach in the publications domain by evaluating example approximate queries on structured data, joining on types ranging from sets of co-authors through to entire publications.
Powerful Analysis with the Aggregation PipelineMongoDB
Speaker: Asya Kamsky
Think you need to move your data "elsewhere" to do powerful analysis? Think again. The most efficient way to analyze your data is where it already lives. MongoDB Aggregation Pipeline has been getting more and more powerful and using new stages, expressions and tricks we can do extensive analysis of our data inside MongoDB Server.
The core Search frameworks in Liferay 7 have been significantly retooled to benefit not only from Liferay's new modular architecture, but also from one of the most innovative players in the market: Elasticsearch, which replaces Lucene as the default search engine in Portal. This session will cover topics like clustering and scalability, unveil improvements (both Elasticsearch and Solr) like aggregations, filters, geolocation, "more like this" and other new query types, and also hot new features for the Enterprise like out-of-the-box Marvel cluster monitoring and Shield security.
André "Arbo" Oliveira joined Liferay in early 2014 as a senior engineer and leads the Search Infrastructure team. He's been writing code for a living for 22 years, 14 of them as a Java developer and architect. Ever since discovering Elasticsearch, he's vowed never to write another SQL WHERE clause again.
Dynamic languages, for software craftmanship groupReuven Lerner
Reuven Lerner's talk about dynamic programming languages in general, and about Ruby in particular. Why would you want to use a dynamic language? What can you do with one that isn't possible (or easy) with a static language?
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...Paul Leclercq
Paris Spark Meetup - May 2017
Video : https://www.youtube.com/watch?v=w5Zd-1wIJrU
AdHoc analysis of radio stations broadcasts stored in a parquet files with plain SQL, the dataframe API.
The aim was to notice radio stations habits, differences and if radio stations brainwashing is a thing
This talk's Databricks notebook can be found here : https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6937750999095841/3645330882010081/6197123402747553/latest.html
What/How to do with GraphQL? - Valentyn Ostakh (ENG) | Ruby Meditation 27Ruby Meditation
Speech of Valentyn Ostakh, Ruby Developer at Ruby Garage, at Ruby Meditation 27, Dnipro, 19.05.2019
Slideshare -
Next conference - http://www.rubymeditation.com/
This talk explores basic concepts of GraphQL.
The main goal is to show how GraphQL works and of what parts it consists of.
From the Ruby side we will look at how to create a GraphQL schema.
In addition, we will consider what pitfalls can be encountered at the start of work with GraphQL.
Announcements and conference materials https://www.fb.me/RubyMeditation
News https://twitter.com/RubyMeditation
Photos https://www.instagram.com/RubyMeditation
The stream of Ruby conferences (not just ours) https://t.me/RubyMeditation
Ontop: Answering SPARQL Queries over Relational DatabasesGuohui Xiao
We present Ontop, an open-source Ontology-Based Data Access (OBDA) system that allows for querying relational data sources through a conceptual representation of the domain of interest, provided in terms of an ontology, to which the data sources are mapped. Key features of Ontop are its solid theoretical foundations, a virtual approach to OBDA, which avoids materializing triples and is implemented through the query rewriting technique, extensive optimizations exploiting all elements of the OBDA architecture, its compliance to all relevant W3C recommendations (including SPARQL queries, R2RML mappings, and OWL 2 QL and RDFS ontologies), and its support for all major relational databases.
What makes a search engine "intelligent"? In this talk I discuss MarkLogic's full text search features and demonstrate how to enhance search functionality using MarkLogic's new Search API to deliver better, faster results automatically. You will learn how to use Search API to include indexed facets alongside results and perform query expansion to add robust automatic semantic search for known entities and expand thesaurus terms to reduce false negatives.
Introduction to source{d} Engine and source{d} Lookout source{d}
Join us for a presentation and demo of source{d} Engine and source{d} Lookout. Combining code retrieval, language agnostic parsing, and git management tools with familiar APIs parsing, source{d} Engine simplifies code analysis. source{d} Lookout, a service for assisted code review that enables running custom code analyzers on GitHub pull requests.
MongoDB Europe 2016 - Advanced MongoDB Aggregation PipelinesMongoDB
We will do a deep dive into the powerful query capabilities of MongoDB's Aggregation Framework, and show you how you can use MongoDB's built-in features to inspect the execution and tune the performance of your queries. And, last but not least, we will also give you a brief outlook into MongoDB 3.4's awesome new Aggregation Framework additions.
We will take a deep dive into ArangoDB (https://www.arangodb.com/) together with Max (https://www.linkedin.com/in/maxneunhoeffer) one of the core developers of the product.
ArangoDB is a multi-model database, which means that it is a document store, a key/value store and a graph database, all in one engine and with a query language that supports all three data models, as well as joins and transactions. Queries can use a single data model or can even mix them.
ArangoDB scales out horizontally with convenient cluster deployment using Apache Mesos. Furthermore, the HTTP API can easily be extended by server-side JavaScript code using high performance access to the C++ database core.
During the talk I will show all these features using several different cloud deployments, since in most projects one will not deploy a ArangoDB monolith, but rather multiple instances, each either a possibly replicated single server, or a cluster. This demonstrates that all these properties together make ArangoDB a very useful and valuable tool in modern microservice oriented architectures.
Recently, ArangoDB integrated its cluster management with Apache Mesos. This makes it now possible to launch an ArangoDB cluster on a Mesos cluster with a single, albeit complex shell command. In a DCOS-enabled Mesosphere cluster this is even easier, because one can use the dcos subcommand for ArangoDB, which essentially turns a Mesosphere cluster into a single, large computer.
In this talk I explain the whole setup and show (live on stage) how to deploy ArangoDB clusters on Amazon Web Services, and how we used this to scale ArangoDB up until it could sustain 1000000 document writes per second.
Recently, ArangoDB integrated its cluster management with Apache Mesos. This makes it now possible to launch an ArangoDB cluster on a Mesos cluster with a single, albeit complex shell command. In a DCOS-enabled Mesosphere cluster this is even easier, because one can use the dcos subcommand for ArangoDB, which essentially turns a Mesosphere cluster into a single, large computer.
In this talk I explain the whole setup and show (live on stage) how to deploy ArangoDB clusters on Google Compute Engine, and how we used this to scale ArangoDB up until it could sustain 1000000 document writes per second.
ArangoDB is an open source, multi-model NoSQL database that is written in C++ and embeds Google's V8 engine to implement the higher levels of its functionality in JavaScript. Recently we decided to switch from C++03 to C++11 for the database kernel. In this talk I will first give a short overview of the software architecture of ArangoDB and proceed to tell you about our practical experiences with the switch to C++11. I will explain which of the parts of the "new" standard have been more important and which have been less useful, and I will report about the difficulties we encountered.
Domain Driven Design is a software development process that focuses on finding a common language for the involved parties. This language and the resulting models are taken from the domain rather than the technical details of the implementation. The goal is to improve the communication between customers, developers and all other involved groups. Even if Eric Evan's book about this topic was written almost ten years ago, this topic remains important because a lot of projects fail for communication reasons.
Relational databases have their own language and influence the design of software into a direction further away from the Domain: Entities have to be created for the sole purpose of adhering to best practices of relational database. Two kinds of NoSQL databases are changing that: Document stores and graph databases. In a document store you can model a "contains" relation in a more natural way and thereby express if this entity can exist outside of its surrounding entity. A graph database allows you to model relationships between entities in a straight forward way that can be expressed in the language of the domain.
In this talk I want to look at the way a multi model database that combines a document store and a graph database can help you to model your problems in a way that is understandable for all parties involved, and explain the benefits of this approach for the software development process.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
3. Documents and collections
{
"_key": "123456",
"_id": "chars/123456",
"name": "Duck",
"firstname": "Donald",
"dob": "1934-11-13",
"hobbies": ["Golf",
"Singing",
"Running"],
"home":
{"town": "Duck town",
"street": "Lake Road",
"number": 17},
"species": "duck"
}
When I say “document”,
I mean “JSON”.
1
4. Documents and collections
{
"_key": "123456",
"_id": "chars/123456",
"name": "Duck",
"firstname": "Donald",
"dob": "1934-11-13",
"hobbies": ["Golf",
"Singing",
"Running"],
"home":
{"town": "Duck town",
"street": "Lake Road",
"number": 17},
"species": "duck"
}
When I say “document”,
I mean “JSON”.
A “collection” is a set of
documents in a DB.
1
5. Documents and collections
{
"_key": "123456",
"_id": "chars/123456",
"name": "Duck",
"firstname": "Donald",
"dob": "1934-11-13",
"hobbies": ["Golf",
"Singing",
"Running"],
"home":
{"town": "Duck town",
"street": "Lake Road",
"number": 17},
"species": "duck"
}
When I say “document”,
I mean “JSON”.
A “collection” is a set of
documents in a DB.
The DB can inspect the
values, allowing for
secondary indexes.
1
6. Documents and collections
{
"_key": "123456",
"_id": "chars/123456",
"name": "Duck",
"firstname": "Donald",
"dob": "1934-11-13",
"hobbies": ["Golf",
"Singing",
"Running"],
"home":
{"town": "Duck town",
"street": "Lake Road",
"number": 17},
"species": "duck"
}
When I say “document”,
I mean “JSON”.
A “collection” is a set of
documents in a DB.
The DB can inspect the
values, allowing for
secondary indexes.
Or one can just treat the
DB as a key/value store.
1
7. Documents and collections
{
"_key": "123456",
"_id": "chars/123456",
"name": "Duck",
"firstname": "Donald",
"dob": "1934-11-13",
"hobbies": ["Golf",
"Singing",
"Running"],
"home":
{"town": "Duck town",
"street": "Lake Road",
"number": 17},
"species": "duck"
}
When I say “document”,
I mean “JSON”.
A “collection” is a set of
documents in a DB.
The DB can inspect the
values, allowing for
secondary indexes.
Or one can just treat the
DB as a key/value store.
Sharding: the data of a
collection is distributed
between multiple servers.
1
13. Graphs
A
B
D
E
F
G
C
"likes"
"hates"
A graph consists of vertices and
edges.
Graphs model relations, can be
directed or undirected.
Vertices and edges are
documents.
Every edge has a _from and a _to
attribute.
The database offers queries and
transactions dealing with graphs.
2
14. Graphs
A
B
D
E
F
G
C
"likes"
"hates"
A graph consists of vertices and
edges.
Graphs model relations, can be
directed or undirected.
Vertices and edges are
documents.
Every edge has a _from and a _to
attribute.
The database offers queries and
transactions dealing with graphs.
For example, paths in the graph
are interesting.
2
15. Query 1
Fetch all documents in a collection
FOR p IN people
RETURN p
3
16. Query 1
Fetch all documents in a collection
FOR p IN people
RETURN p
[ { "name": "Schmidt", "firstname": "Helmut",
"hobbies": ["Smoking"]},
{ "name": "Neunhöffer", "firstname": "Max",
"hobbies": ["Piano", "Golf"]},
...
]
3
17. Query 1
Fetch all documents in a collection
FOR p IN people
RETURN p
[ { "name": "Schmidt", "firstname": "Helmut",
"hobbies": ["Smoking"]},
{ "name": "Neunhöffer", "firstname": "Max",
"hobbies": ["Piano", "Golf"]},
...
]
(Actually, a cursor is returned.)
3
18. Query 2
Use filtering, sorting and limit
FOR p IN people
FILTER p.age >= @minage
SORT p.name, p.firstname
LIMIT @nrlimit
RETURN { name: CONCAT(p.name, ", ",
p.firstname),
age : p.age }
4
19. Query 2
Use filtering, sorting and limit
FOR p IN people
FILTER p.age >= @minage
SORT p.name, p.firstname
LIMIT @nrlimit
RETURN { name: CONCAT(p.name, ", ",
p.firstname),
age : p.age }
[ { "name": "Neunhöffer, Max", "age": 44 },
{ "name": "Schmidt, Helmut", "age": 95 },
...
]
4
20. Query 3
Aggregation and functions
FOR p IN people
COLLECT a = p.age INTO L
FILTER a >= @minage
RETURN { "age": a, "number": LENGTH(L) }
5
21. Query 3
Aggregation and functions
FOR p IN people
COLLECT a = p.age INTO L
FILTER a >= @minage
RETURN { "age": a, "number": LENGTH(L) }
[ { "age": 18, "number": 10 },
{ "age": 19, "number": 17 },
{ "age": 20, "number": 12 },
...
]
5
22. Query 4
Joins
FOR p IN @@peoplecollection
FOR h IN houses
FILTER p._key == h.owner
SORT h.streetname, h.housename
RETURN { housename: h.housename,
streetname: h.streetname,
owner: p.name,
value: h.value }
6
23. Query 4
Joins
FOR p IN @@peoplecollection
FOR h IN houses
FILTER p._key == h.owner
SORT h.streetname, h.housename
RETURN { housename: h.housename,
streetname: h.streetname,
owner: p.name,
value: h.value }
[ { "housename": "Firlefanz",
"streetname": "Meyer street",
"owner": "Hans Schmidt", "value": 423000
},
...
]
6
24. Query 5
Modifying data
FOR e IN events
FILTER e.timestamp<"2014-09-01T09:53+0200"
INSERT e IN oldevents
FOR e IN events
FILTER e.timestamp<"2014-09-01T09:53+0200"
REMOVE e._key IN events
7
25. Query 6
Graph queries
FOR x IN GRAPH_SHORTEST_PATH(
"routeplanner", "germanCity/Cologne",
"frenchCity/Paris", {weight: "distance"} )
RETURN { begin : x.startVertex,
end : x.vertex,
distance : x.distance,
nrPaths : LENGTH(x.paths) }
8
26. Query 6
Graph queries
FOR x IN GRAPH_SHORTEST_PATH(
"routeplanner", "germanCity/Cologne",
"frenchCity/Paris", {weight: "distance"} )
RETURN { begin : x.startVertex,
end : x.vertex,
distance : x.distance,
nrPaths : LENGTH(x.paths) }
[ { "begin": "germanCity/Cologne",
"end" : {"_id": "frenchCity/Paris", ... },
"distance": 550,
"nrPaths": 10 },
...
]
8
27. Life of a query
1. Text and query parameters come from user
9
28. Life of a query
1. Text and query parameters come from user
2. Parse text, produce abstract syntax tree (AST)
9
29. Life of a query
1. Text and query parameters come from user
2. Parse text, produce abstract syntax tree (AST)
3. Substitute query parameters
9
30. Life of a query
1. Text and query parameters come from user
2. Parse text, produce abstract syntax tree (AST)
3. Substitute query parameters
4. First optimisation: constant expressions, etc.
9
31. Life of a query
1. Text and query parameters come from user
2. Parse text, produce abstract syntax tree (AST)
3. Substitute query parameters
4. First optimisation: constant expressions, etc.
5. Translate AST into an execution plan (EXP)
9
32. Life of a query
1. Text and query parameters come from user
2. Parse text, produce abstract syntax tree (AST)
3. Substitute query parameters
4. First optimisation: constant expressions, etc.
5. Translate AST into an execution plan (EXP)
6. Optimise one EXP, produce many, potentially better EXPs
9
33. Life of a query
1. Text and query parameters come from user
2. Parse text, produce abstract syntax tree (AST)
3. Substitute query parameters
4. First optimisation: constant expressions, etc.
5. Translate AST into an execution plan (EXP)
6. Optimise one EXP, produce many, potentially better EXPs
7. Reason about distribution in cluster
9
34. Life of a query
1. Text and query parameters come from user
2. Parse text, produce abstract syntax tree (AST)
3. Substitute query parameters
4. First optimisation: constant expressions, etc.
5. Translate AST into an execution plan (EXP)
6. Optimise one EXP, produce many, potentially better EXPs
7. Reason about distribution in cluster
8. Optimise distributed EXPs
9
35. Life of a query
1. Text and query parameters come from user
2. Parse text, produce abstract syntax tree (AST)
3. Substitute query parameters
4. First optimisation: constant expressions, etc.
5. Translate AST into an execution plan (EXP)
6. Optimise one EXP, produce many, potentially better EXPs
7. Reason about distribution in cluster
8. Optimise distributed EXPs
9. Estimate costs for all EXPs, and sort by ascending cost
9
36. Life of a query
1. Text and query parameters come from user
2. Parse text, produce abstract syntax tree (AST)
3. Substitute query parameters
4. First optimisation: constant expressions, etc.
5. Translate AST into an execution plan (EXP)
6. Optimise one EXP, produce many, potentially better EXPs
7. Reason about distribution in cluster
8. Optimise distributed EXPs
9. Estimate costs for all EXPs, and sort by ascending cost
10. Instanciate “cheapest” plan, i.e. set up execution engine
9
37. Life of a query
1. Text and query parameters come from user
2. Parse text, produce abstract syntax tree (AST)
3. Substitute query parameters
4. First optimisation: constant expressions, etc.
5. Translate AST into an execution plan (EXP)
6. Optimise one EXP, produce many, potentially better EXPs
7. Reason about distribution in cluster
8. Optimise distributed EXPs
9. Estimate costs for all EXPs, and sort by ascending cost
10. Instanciate “cheapest” plan, i.e. set up execution engine
11. Distribute and link up engines on different servers
9
38. Life of a query
1. Text and query parameters come from user
2. Parse text, produce abstract syntax tree (AST)
3. Substitute query parameters
4. First optimisation: constant expressions, etc.
5. Translate AST into an execution plan (EXP)
6. Optimise one EXP, produce many, potentially better EXPs
7. Reason about distribution in cluster
8. Optimise distributed EXPs
9. Estimate costs for all EXPs, and sort by ascending cost
10. Instanciate “cheapest” plan, i.e. set up execution engine
11. Distribute and link up engines on different servers
12. Execute plan, provide cursor API
9
39. Execution plans
FOR a IN collA
RETURN {x: a.x, z: b.z}
EnumerateCollection a
EnumerateCollection b
Calculation xx == b.y
Filter xx == b.y
Singleton
Calculation xx
Return {x: a.x, z: b.z}
Calc {x: a.x, z: b.z}
FILTER xx == b.y
FOR b IN collB
LET xx = a.x
Query → EXP
10
40. Execution plans
FOR a IN collA
RETURN {x: a.x, z: b.z}
EnumerateCollection a
EnumerateCollection b
Calculation xx == b.y
Filter xx == b.y
Singleton
Calculation xx
Return {x: a.x, z: b.z}
Calc {x: a.x, z: b.z}
FILTER xx == b.y
FOR b IN collB
LET xx = a.x
Query → EXP
Black arrows are
dependencies
10
41. Execution plans
FOR a IN collA
RETURN {x: a.x, z: b.z}
EnumerateCollection a
EnumerateCollection b
Calculation xx == b.y
Filter xx == b.y
Singleton
Calculation xx
Return {x: a.x, z: b.z}
Calc {x: a.x, z: b.z}
FILTER xx == b.y
FOR b IN collB
LET xx = a.x
Query → EXP
Black arrows are
dependencies
Think of a pipeline
10
42. Execution plans
FOR a IN collA
RETURN {x: a.x, z: b.z}
EnumerateCollection a
EnumerateCollection b
Calculation xx == b.y
Filter xx == b.y
Singleton
Calculation xx
Return {x: a.x, z: b.z}
Calc {x: a.x, z: b.z}
FILTER xx == b.y
FOR b IN collB
LET xx = a.x
Query → EXP
Black arrows are
dependencies
Think of a pipeline
Each node provides
a cursor API
10
43. Execution plans
FOR a IN collA
RETURN {x: a.x, z: b.z}
EnumerateCollection a
EnumerateCollection b
Calculation xx == b.y
Filter xx == b.y
Singleton
Calculation xx
Return {x: a.x, z: b.z}
Calc {x: a.x, z: b.z}
FILTER xx == b.y
FOR b IN collB
LET xx = a.x
Query → EXP
Black arrows are
dependencies
Think of a pipeline
Each node provides
a cursor API
Blocks of “Items”
travel through the
pipeline
10
44. Execution plans
FOR a IN collA
RETURN {x: a.x, z: b.z}
EnumerateCollection a
EnumerateCollection b
Calculation xx == b.y
Filter xx == b.y
Singleton
Calculation xx
Return {x: a.x, z: b.z}
Calc {x: a.x, z: b.z}
FILTER xx == b.y
FOR b IN collB
LET xx = a.x
Query → EXP
Black arrows are
dependencies
Think of a pipeline
Each node provides
a cursor API
Blocks of “Items”
travel through the
pipeline
What is an “item”???
10
45. Pipeline and items
FOR a IN collA EnumerateCollection a
EnumerateCollection b
Singleton
Calculation xx
FOR b IN collB
LET xx = a.x Items have vars a, xx
Items have no vars
Items are the thingies traveling through the pipeline.
11
46. Pipeline and items
FOR a IN collA EnumerateCollection a
EnumerateCollection b
Singleton
Calculation xx
FOR b IN collB
LET xx = a.x Items have vars a, xx
Items have no vars
Items are the thingies traveling through the pipeline.
An item holds values of those variables in the current frame
11
47. Pipeline and items
FOR a IN collA EnumerateCollection a
EnumerateCollection b
Singleton
Calculation xx
FOR b IN collB
LET xx = a.x Items have vars a, xx
Items have no vars
Items are the thingies traveling through the pipeline.
An item holds values of those variables in the current frame
Thus: Items look differently in different parts of the plan
11
48. Pipeline and items
FOR a IN collA EnumerateCollection a
EnumerateCollection b
Singleton
Calculation xx
FOR b IN collB
LET xx = a.x Items have vars a, xx
Items have no vars
Items are the thingies traveling through the pipeline.
An item holds values of those variables in the current frame
Thus: Items look differently in different parts of the plan
We always deal with blocks of items for performance reasons
11
49. Execution plans
FOR a IN collA
RETURN {x: a.x, z: b.z}
EnumerateCollection a
EnumerateCollection b
Calculation xx == b.y
Filter xx == b.y
Singleton
Calculation xx
Return {x: a.x, z: b.z}
Calc {x: a.x, z: b.z}
FILTER xx == b.y
FOR b IN collB
LET xx = a.x
12
50. Move filters up
FOR a IN collA
FOR b IN collB
FILTER a.x == 10
FILTER a.u == b.v
RETURN {u:a.u,w:b.w}
Singleton
EnumColl a
EnumColl b
Calc a.x == 10
Return {u:a.u,w:b.w}
Filter a.u == b.v
Calc a.u == b.v
Filter a.x == 10
13
51. Move filters up
FOR a IN collA
FOR b IN collB
FILTER a.x == 10
FILTER a.u == b.v
RETURN {u:a.u,w:b.w}
The result and behaviour does not
change, if the first FILTER is pulled
out of the inner FOR.
Singleton
EnumColl a
EnumColl b
Calc a.x == 10
Return {u:a.u,w:b.w}
Filter a.u == b.v
Calc a.u == b.v
Filter a.x == 10
13
52. Move filters up
FOR a IN collA
FILTER a.x < 10
FOR b IN collB
FILTER a.u == b.v
RETURN {u:a.u,w:b.w}
The result and behaviour does not
change, if the first FILTER is pulled
out of the inner FOR.
However, the number of items trave-
ling in the pipeline is decreased.
Singleton
EnumColl a
Return {u:a.u,w:b.w}
Filter a.u == b.v
Calc a.u == b.v
Calc a.x == 10
EnumColl b
Filter a.x == 10
13
53. Move filters up
FOR a IN collA
FILTER a.x < 10
FOR b IN collB
FILTER a.u == b.v
RETURN {u:a.u,w:b.w}
The result and behaviour does not
change, if the first FILTER is pulled
out of the inner FOR.
However, the number of items trave-
ling in the pipeline is decreased.
Note that the two FOR statements
could be interchanged!
Singleton
EnumColl a
Return {u:a.u,w:b.w}
Filter a.u == b.v
Calc a.u == b.v
Calc a.x == 10
EnumColl b
Filter a.x == 10
13
54. Remove unnecessary calculations
FOR a IN collA
LET L = LENGTH(a.hobbies)
FOR b IN collB
FILTER a.u == b.v
RETURN {h:a.hobbies,w:b.w}
Singleton
EnumColl a
Calc L = ...
EnumColl b
Calc a.u == b.v
Filter a.u == b.v
Return {...}
14
55. Remove unnecessary calculations
FOR a IN collA
LET L = LENGTH(a.hobbies)
FOR b IN collB
FILTER a.u == b.v
RETURN {h:a.hobbies,w:b.w}
The Calculation of L is unnecessary!
Singleton
EnumColl a
Calc L = ...
EnumColl b
Calc a.u == b.v
Filter a.u == b.v
Return {...}
14
56. Remove unnecessary calculations
FOR a IN collA
FOR b IN collB
FILTER a.u == b.v
RETURN {h:a.hobbies,w:b.w}
The Calculation of L is unnecessary!
(since it cannot throw an exception).
Singleton
EnumColl a
EnumColl b
Calc a.u == b.v
Filter a.u == b.v
Return {...}
14
57. Remove unnecessary calculations
FOR a IN collA
FOR b IN collB
FILTER a.u == b.v
RETURN {h:a.hobbies,w:b.w}
The Calculation of L is unnecessary!
(since it cannot throw an exception).
Therefore we can just leave it out.
Singleton
EnumColl a
EnumColl b
Calc a.u == b.v
Filter a.u == b.v
Return {...}
14
58. Use index for FILTER and SORT
FOR a IN collA
FILTER a.x > 17 &&
a.x <= 23 &&
a.y == 10
SORT a.y, a.x
RETURN a
Singleton
EnumColl a
Filter ...
Calc ...
Sort a.y, a.x
Return a
15
59. Use index for FILTER and SORT
FOR a IN collA
FILTER a.x > 17 &&
a.x <= 23 &&
a.y == 10
SORT a.y, a.x
RETURN a
Assume collA has a skiplist index on “y”
and “x” (in this order),
Singleton
EnumColl a
Filter ...
Calc ...
Sort a.y, a.x
Return a
15
60. Use index for FILTER and SORT
FOR a IN collA
FILTER a.x > 17 &&
a.x <= 23 &&
a.y == 10
SORT a.y, a.x
RETURN a
Assume collA has a skiplist index on “y”
and “x” (in this order), then we can read
off the half-open interval between
{ y: 10, x: 17 } and
{ y: 10, x: 23 }
from the skiplist index.
Singleton
Sort a.y, a.x
Return a
IndexRange a
15
61. Use index for FILTER and SORT
FOR a IN collA
FILTER a.x > 17 &&
a.x <= 23 &&
a.y == 10
SORT a.y, a.x
RETURN a
Assume collA has a skiplist index on “y”
and “x” (in this order), then we can read
off the half-open interval between
{ y: 10, x: 17 } and
{ y: 10, x: 23 }
from the skiplist index.
The result will automatically be sorted by
y and then by x.
Singleton
Return a
IndexRange a
15
62. Data distribution in a cluster
Requests
DBserver DBserver DBserver
CoordinatorCoordinator
4 2 5 3 11
The shards of a collection are distributed across the DB
servers.
16
63. Data distribution in a cluster
Requests
DBserver DBserver DBserver
CoordinatorCoordinator
4 2 5 3 11
The shards of a collection are distributed across the DB
servers.
The coordinators receive queries and organise their
execution
16
68. Modifying queries
Fortunately:
There can be at most one modifying node in each query.
There can be no modifying nodes in subqueries.
Modifying nodes
The modifying node in a query
is executed on the DBservers,
18
69. Modifying queries
Fortunately:
There can be at most one modifying node in each query.
There can be no modifying nodes in subqueries.
Modifying nodes
The modifying node in a query
is executed on the DBservers,
to this end, we either scatter the items to all DBservers,
or, if possible, we distribute each item to the shard
that is responsible for the modification.
18
70. Modifying queries
Fortunately:
There can be at most one modifying node in each query.
There can be no modifying nodes in subqueries.
Modifying nodes
The modifying node in a query
is executed on the DBservers,
to this end, we either scatter the items to all DBservers,
or, if possible, we distribute each item to the shard
that is responsible for the modification.
Sometimes, we can even optimise away a gather/scatter
combination and parallelise completely.
18