Max Neunhöffer – Joins and aggregations in a distributed NoSQL DB
NoSQL databases are renowned for their good horizontal scalability and sharding is these days an essential feature for every self-respecting DB. However, most systems choose to offer less features with respect to joins and aggregations in queries than traditional relational DBMS do. In this talk I report about the joys and pains of (re-)implementing the powerful query language AQL with joins and aggregations for ArangoDB. I will cover (distributed) execution plans, query optimisation and data locality issues.
Understanding Graph Databases with Neo4j and CypherRuhaim Izmeth
Inroduction of Grapgh database concepts, explained by comparing the widely popular relational databases and the the sql query language. Neo4j and cypher is used to describe how graph databases work in real life
Apache Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It provides reliable storage through its distributed file system and scalable processing through its MapReduce programming model. Yahoo! uses Hadoop extensively for applications like log analysis, content optimization, and computational advertising, processing over 6 petabytes of data across 40,000 machines daily.
What is the best full text search engine for Python?Andrii Soldatenko
Nowadays we can see lot’s of benchmarks and performance tests of different web frameworks and Python tools. Regarding to search engines, it’s difficult to find useful information especially benchmarks or comparing between different search engines. It’s difficult to manage what search engine you should select for instance, ElasticSearch, Postgres Full Text Search or may be Sphinx or Whoosh. You face a difficult choice, that’s why I am pleased to share with you my acquired experience and benchmarks and focus on how to compare full text search engines for Python.
La recherche sur internet, l'E-réputationBernard André
The document discusses searching on the internet and managing one's online identity and digital footprint. It provides tips for how to search more effectively using different search engines and keywords. It also discusses the importance of verifying information from multiple sources and being aware that personal information can be found online. Employers may examine an applicant's online profile, so one's digital identity could impact career opportunities.
What/How to do with GraphQL? - Valentyn Ostakh (ENG) | Ruby Meditation 27Ruby Meditation
Speech of Valentyn Ostakh, Ruby Developer at Ruby Garage, at Ruby Meditation 27, Dnipro, 19.05.2019
Slideshare -
Next conference - http://www.rubymeditation.com/
This talk explores basic concepts of GraphQL.
The main goal is to show how GraphQL works and of what parts it consists of.
From the Ruby side we will look at how to create a GraphQL schema.
In addition, we will consider what pitfalls can be encountered at the start of work with GraphQL.
Announcements and conference materials https://www.fb.me/RubyMeditation
News https://twitter.com/RubyMeditation
Photos https://www.instagram.com/RubyMeditation
The stream of Ruby conferences (not just ours) https://t.me/RubyMeditation
The new MongoDB aggregation framework provides a more powerful and performant way to perform data aggregation compared to the existing MapReduce functionality. The aggregation framework uses a pipeline of aggregation operations like $match, $project, $group and $unwind. It allows expressing data aggregation logic through a declarative pipeline in a more intuitive way without needing to write JavaScript code. This provides better performance than MapReduce as it is implemented in C++ rather than JavaScript.
These are slides from our Big Data Warehouse Meetup in April. We talked about NoSQL databases: What they are, how they’re used and where they fit in existing enterprise data ecosystems.
Mike O’Brian from 10gen, introduced the syntax and usage patterns for a new aggregation system in MongoDB and give some demonstrations of aggregation using the new system. The new MongoDB aggregation framework makes it simple to do tasks such as counting, averaging, and finding minima or maxima while grouping by keys in a collection, complementing MongoDB’s built-in map/reduce capabilities.
For more information, visit our website at http://casertaconcepts.com/ or email us at info@casertaconcepts.com.
Java Performance Tips (So Code Camp San Diego 2014)Kai Chan
Slides for my presentation at SoCal Code Camp, June 29, 2014 (http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=68942cd0-6714-4753-a218-20d4b48da07d)
Understanding Graph Databases with Neo4j and CypherRuhaim Izmeth
Inroduction of Grapgh database concepts, explained by comparing the widely popular relational databases and the the sql query language. Neo4j and cypher is used to describe how graph databases work in real life
Apache Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It provides reliable storage through its distributed file system and scalable processing through its MapReduce programming model. Yahoo! uses Hadoop extensively for applications like log analysis, content optimization, and computational advertising, processing over 6 petabytes of data across 40,000 machines daily.
What is the best full text search engine for Python?Andrii Soldatenko
Nowadays we can see lot’s of benchmarks and performance tests of different web frameworks and Python tools. Regarding to search engines, it’s difficult to find useful information especially benchmarks or comparing between different search engines. It’s difficult to manage what search engine you should select for instance, ElasticSearch, Postgres Full Text Search or may be Sphinx or Whoosh. You face a difficult choice, that’s why I am pleased to share with you my acquired experience and benchmarks and focus on how to compare full text search engines for Python.
La recherche sur internet, l'E-réputationBernard André
The document discusses searching on the internet and managing one's online identity and digital footprint. It provides tips for how to search more effectively using different search engines and keywords. It also discusses the importance of verifying information from multiple sources and being aware that personal information can be found online. Employers may examine an applicant's online profile, so one's digital identity could impact career opportunities.
What/How to do with GraphQL? - Valentyn Ostakh (ENG) | Ruby Meditation 27Ruby Meditation
Speech of Valentyn Ostakh, Ruby Developer at Ruby Garage, at Ruby Meditation 27, Dnipro, 19.05.2019
Slideshare -
Next conference - http://www.rubymeditation.com/
This talk explores basic concepts of GraphQL.
The main goal is to show how GraphQL works and of what parts it consists of.
From the Ruby side we will look at how to create a GraphQL schema.
In addition, we will consider what pitfalls can be encountered at the start of work with GraphQL.
Announcements and conference materials https://www.fb.me/RubyMeditation
News https://twitter.com/RubyMeditation
Photos https://www.instagram.com/RubyMeditation
The stream of Ruby conferences (not just ours) https://t.me/RubyMeditation
The new MongoDB aggregation framework provides a more powerful and performant way to perform data aggregation compared to the existing MapReduce functionality. The aggregation framework uses a pipeline of aggregation operations like $match, $project, $group and $unwind. It allows expressing data aggregation logic through a declarative pipeline in a more intuitive way without needing to write JavaScript code. This provides better performance than MapReduce as it is implemented in C++ rather than JavaScript.
These are slides from our Big Data Warehouse Meetup in April. We talked about NoSQL databases: What they are, how they’re used and where they fit in existing enterprise data ecosystems.
Mike O’Brian from 10gen, introduced the syntax and usage patterns for a new aggregation system in MongoDB and give some demonstrations of aggregation using the new system. The new MongoDB aggregation framework makes it simple to do tasks such as counting, averaging, and finding minima or maxima while grouping by keys in a collection, complementing MongoDB’s built-in map/reduce capabilities.
For more information, visit our website at http://casertaconcepts.com/ or email us at info@casertaconcepts.com.
Java Performance Tips (So Code Camp San Diego 2014)Kai Chan
Slides for my presentation at SoCal Code Camp, June 29, 2014 (http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=68942cd0-6714-4753-a218-20d4b48da07d)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Kai Chan
These are the slides for the session I presented at SoCal Code Camp Los Angeles on November 10, 2013.
http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=8cdfd955-2cd4-44a2-ad08-5353e079685a
Search Engine-Building with Lucene and SolrKai Chan
These are the slides for the session I presented at SoCal Code Camp San Diego on July 27, 2013.
http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=6b28337d-6eae-4003-a664-5ed719f43533
Google Code Search was a code search engine that indexed open source code from various sources online. It allowed programmers to search code using regular expressions, keywords, and other metadata tags. However, Google discontinued the service in 2013. Popular alternatives to Google Code for hosting and searching code include GitHub, Bitbucket, and CodePlex. These services provide version control, code review, issue tracking, and other collaboration features for developers.
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Kai Chan
These are the slides for the session I presented at SoCal Code Camp Los Angeles on November 10, 2013.
http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=cc1e6803-b0ec-4832-b8df-e15ea7bd7694
Complex queries in a distributed multi-model databaseMax Neunhöffer
A multi-model database is a document store, a graph database as well as a key/value store. To allow for convenient and powerful querying such a database needs a query language that understands all three data models and allows to mix these models in queries. For example, it should be possible to find some documents in a collection according to some criteria, then follow some edges in a graph in which the documents represent vertices, and finally join the results with documents from yet another collection.
In this talk I will explain how a query engine for such a language works, give an overview of the life of a query from parsing, over translation into an execution plan, the optimisation phase and finally the execution. I will show how distributed query execution plans look like, how the query optimiser reasons about them and how the distributed execution works.
Powerful Analysis with the Aggregation PipelineMongoDB
Speaker: Asya Kamsky
Think you need to move your data "elsewhere" to do powerful analysis? Think again. The most efficient way to analyze your data is where it already lives. MongoDB Aggregation Pipeline has been getting more and more powerful and using new stages, expressions and tricks we can do extensive analysis of our data inside MongoDB Server.
"Powerful Analysis with the Aggregation Pipeline (Tutorial)"MongoDB
This document discusses MongoDB aggregation pipelines and array expressions. It provides examples of various array expression operators like $map, $filter, and $reduce. $map projects each element of an array to a new array. $filter filters an array to only elements that match a condition. $reduce iterates an array and combines element values using an accumulating parameter to return a single value. The document demonstrates how these expressions work with sample array data.
Speaker: Asya Kamsky
MongoDB Aggregation Language has been getting more powerful and expressive with every release. In this talk we'll review how to create powerful aggregation pipelines and how to leverage aggregation expressions in your queries.
[MongoDB.local Bengaluru 2018] Tutorial: Pipeline Power - Doing More with Mon...MongoDB
This document discusses using MongoDB aggregation to analyze and transform data. It provides examples of using aggregation pipelines and stages like $match, $group, $unwind, $sort, $limit, $project to perform analytics on document data. It also covers expressions like $map, $filter, $reduce that can be used in aggregation to transform arrays and extract/calculate values. The document emphasizes best practices for readability when writing complex aggregation expressions.
This document discusses mapping and analysis in ElasticSearch. It explains that mapping defines how documents are indexed and stored, including specifying field types and custom analyzers. Different analyzers, like standard, simple, and language-specific analyzers, tokenize and normalize text differently. Inner objects and arrays in documents are flattened during indexing for search. The document provides examples of mapping definitions and using the _analyze API to test analyzers.
The document discusses MongoDB's Aggregation Framework, which allows users to perform ad-hoc queries and reshape data in MongoDB. It describes the key components of the aggregation pipeline including $match, $project, $group, $sort operators. It provides examples of how to filter, reshape, and summarize document data using the aggregation framework. The document also covers usage and limitations of aggregation as well as how it can be used to enable more flexible data analysis and reporting compared to MapReduce.
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...Paul Leclercq
Paris Spark Meetup - May 2017
Video : https://www.youtube.com/watch?v=w5Zd-1wIJrU
AdHoc analysis of radio stations broadcasts stored in a parquet files with plain SQL, the dataframe API.
The aim was to notice radio stations habits, differences and if radio stations brainwashing is a thing
This talk's Databricks notebook can be found here : https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6937750999095841/3645330882010081/6197123402747553/latest.html
MongoDB World 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pipeline Em...MongoDB
Aggregation pipeline has been able to power your analysis of data since version 2.2. In 4.2 we added more power and now you can use it for more powerful queries, updates, and outputting your data to existing collections. Come hear how you can do everything with the pipeline, including single-view, ETL, data roll-ups and materialized views.
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++MongoDB
Aggregation pipeline has been able to power your analysis of data since version 2.2. In 4.2 we added more power and now you can use it for more powerful queries, updates, and outputting your data to existing collections. Come hear how you can do everything with the pipeline, including single-view, ETL, data roll-ups and materialized views.
The core Search frameworks in Liferay 7 have been significantly retooled to benefit not only from Liferay's new modular architecture, but also from one of the most innovative players in the market: Elasticsearch, which replaces Lucene as the default search engine in Portal. This session will cover topics like clustering and scalability, unveil improvements (both Elasticsearch and Solr) like aggregations, filters, geolocation, "more like this" and other new query types, and also hot new features for the Enterprise like out-of-the-box Marvel cluster monitoring and Shield security.
André "Arbo" Oliveira joined Liferay in early 2014 as a senior engineer and leads the Search Infrastructure team. He's been writing code for a living for 22 years, 14 of them as a Java developer and architect. Ever since discovering Elasticsearch, he's vowed never to write another SQL WHERE clause again.
Aggregation pipeline has been able to power your analysis of data since version 2.2. In 4.2 we added more power and now you can use it for more powerful queries, updates, and outputting your data to existing collections. Come hear how you can do everything with the pipeline, including single-view, ETL, data roll-ups and materialized views.
MongoDB .local Toronto 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pi...MongoDB
Aggregation pipeline has been able to power your analysis of data since version 2.2. In 4.2 we added more power and now you can use it for more powerful queries, updates, and outputting your data to existing collections. Come hear how you can do everything with the pipeline, including single-view, ETL, data roll-ups and materialized views.
MongoDB .local Chicago 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pi...MongoDB
Aggregation pipeline has been able to power your analysis of data since version 2.2. In 4.2 we added more power and now you can use it for more powerful queries, updates, and outputting your data to existing collections. Come hear how you can do everything with the pipeline, including single-view, ETL, data roll-ups and materialized views.
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Kai Chan
These are the slides for the session I presented at SoCal Code Camp Los Angeles on November 10, 2013.
http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=8cdfd955-2cd4-44a2-ad08-5353e079685a
Search Engine-Building with Lucene and SolrKai Chan
These are the slides for the session I presented at SoCal Code Camp San Diego on July 27, 2013.
http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=6b28337d-6eae-4003-a664-5ed719f43533
Google Code Search was a code search engine that indexed open source code from various sources online. It allowed programmers to search code using regular expressions, keywords, and other metadata tags. However, Google discontinued the service in 2013. Popular alternatives to Google Code for hosting and searching code include GitHub, Bitbucket, and CodePlex. These services provide version control, code review, issue tracking, and other collaboration features for developers.
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Kai Chan
These are the slides for the session I presented at SoCal Code Camp Los Angeles on November 10, 2013.
http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=cc1e6803-b0ec-4832-b8df-e15ea7bd7694
Complex queries in a distributed multi-model databaseMax Neunhöffer
A multi-model database is a document store, a graph database as well as a key/value store. To allow for convenient and powerful querying such a database needs a query language that understands all three data models and allows to mix these models in queries. For example, it should be possible to find some documents in a collection according to some criteria, then follow some edges in a graph in which the documents represent vertices, and finally join the results with documents from yet another collection.
In this talk I will explain how a query engine for such a language works, give an overview of the life of a query from parsing, over translation into an execution plan, the optimisation phase and finally the execution. I will show how distributed query execution plans look like, how the query optimiser reasons about them and how the distributed execution works.
Powerful Analysis with the Aggregation PipelineMongoDB
Speaker: Asya Kamsky
Think you need to move your data "elsewhere" to do powerful analysis? Think again. The most efficient way to analyze your data is where it already lives. MongoDB Aggregation Pipeline has been getting more and more powerful and using new stages, expressions and tricks we can do extensive analysis of our data inside MongoDB Server.
"Powerful Analysis with the Aggregation Pipeline (Tutorial)"MongoDB
This document discusses MongoDB aggregation pipelines and array expressions. It provides examples of various array expression operators like $map, $filter, and $reduce. $map projects each element of an array to a new array. $filter filters an array to only elements that match a condition. $reduce iterates an array and combines element values using an accumulating parameter to return a single value. The document demonstrates how these expressions work with sample array data.
Speaker: Asya Kamsky
MongoDB Aggregation Language has been getting more powerful and expressive with every release. In this talk we'll review how to create powerful aggregation pipelines and how to leverage aggregation expressions in your queries.
[MongoDB.local Bengaluru 2018] Tutorial: Pipeline Power - Doing More with Mon...MongoDB
This document discusses using MongoDB aggregation to analyze and transform data. It provides examples of using aggregation pipelines and stages like $match, $group, $unwind, $sort, $limit, $project to perform analytics on document data. It also covers expressions like $map, $filter, $reduce that can be used in aggregation to transform arrays and extract/calculate values. The document emphasizes best practices for readability when writing complex aggregation expressions.
This document discusses mapping and analysis in ElasticSearch. It explains that mapping defines how documents are indexed and stored, including specifying field types and custom analyzers. Different analyzers, like standard, simple, and language-specific analyzers, tokenize and normalize text differently. Inner objects and arrays in documents are flattened during indexing for search. The document provides examples of mapping definitions and using the _analyze API to test analyzers.
The document discusses MongoDB's Aggregation Framework, which allows users to perform ad-hoc queries and reshape data in MongoDB. It describes the key components of the aggregation pipeline including $match, $project, $group, $sort operators. It provides examples of how to filter, reshape, and summarize document data using the aggregation framework. The document also covers usage and limitations of aggregation as well as how it can be used to enable more flexible data analysis and reporting compared to MapReduce.
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...Paul Leclercq
Paris Spark Meetup - May 2017
Video : https://www.youtube.com/watch?v=w5Zd-1wIJrU
AdHoc analysis of radio stations broadcasts stored in a parquet files with plain SQL, the dataframe API.
The aim was to notice radio stations habits, differences and if radio stations brainwashing is a thing
This talk's Databricks notebook can be found here : https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6937750999095841/3645330882010081/6197123402747553/latest.html
MongoDB World 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pipeline Em...MongoDB
Aggregation pipeline has been able to power your analysis of data since version 2.2. In 4.2 we added more power and now you can use it for more powerful queries, updates, and outputting your data to existing collections. Come hear how you can do everything with the pipeline, including single-view, ETL, data roll-ups and materialized views.
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++MongoDB
Aggregation pipeline has been able to power your analysis of data since version 2.2. In 4.2 we added more power and now you can use it for more powerful queries, updates, and outputting your data to existing collections. Come hear how you can do everything with the pipeline, including single-view, ETL, data roll-ups and materialized views.
The core Search frameworks in Liferay 7 have been significantly retooled to benefit not only from Liferay's new modular architecture, but also from one of the most innovative players in the market: Elasticsearch, which replaces Lucene as the default search engine in Portal. This session will cover topics like clustering and scalability, unveil improvements (both Elasticsearch and Solr) like aggregations, filters, geolocation, "more like this" and other new query types, and also hot new features for the Enterprise like out-of-the-box Marvel cluster monitoring and Shield security.
André "Arbo" Oliveira joined Liferay in early 2014 as a senior engineer and leads the Search Infrastructure team. He's been writing code for a living for 22 years, 14 of them as a Java developer and architect. Ever since discovering Elasticsearch, he's vowed never to write another SQL WHERE clause again.
Aggregation pipeline has been able to power your analysis of data since version 2.2. In 4.2 we added more power and now you can use it for more powerful queries, updates, and outputting your data to existing collections. Come hear how you can do everything with the pipeline, including single-view, ETL, data roll-ups and materialized views.
MongoDB .local Toronto 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pi...MongoDB
Aggregation pipeline has been able to power your analysis of data since version 2.2. In 4.2 we added more power and now you can use it for more powerful queries, updates, and outputting your data to existing collections. Come hear how you can do everything with the pipeline, including single-view, ETL, data roll-ups and materialized views.
MongoDB .local Chicago 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pi...MongoDB
Aggregation pipeline has been able to power your analysis of data since version 2.2. In 4.2 we added more power and now you can use it for more powerful queries, updates, and outputting your data to existing collections. Come hear how you can do everything with the pipeline, including single-view, ETL, data roll-ups and materialized views.
This document provides an introduction to using ElasticSearch. It discusses installing ElasticSearch and making API calls. It demonstrates indexing employee documents with fields like name, age, interests. It shows how to search for documents by field values or full text, do phrase searches, and highlight search terms. It also introduces analytics capabilities like aggregations to analyze field values.
This document provides an overview of key Python concepts including variables, data types, operators, conditional statements, loops, functions, classes, modules and file handling. It also discusses Python collections like lists, tuples, sets and dictionaries. Finally, it covers how to connect Python to MySQL databases and perform basic operations like selecting, inserting, updating and deleting data.
Dynamic languages, for software craftmanship groupReuven Lerner
Reuven Lerner's talk about dynamic programming languages in general, and about Ruby in particular. Why would you want to use a dynamic language? What can you do with one that isn't possible (or easy) with a static language?
I survey three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package. I also discuss some methods for visualizing large data sets.
Webinar: Strongly Typed Languages and Flexible SchemasMongoDB
This document discusses strategies for managing flexible schemas in strongly typed languages and databases, including decoupled architectures, object-document mappers (ODMs), versioning, and data migrations. It describes how decoupled architectures allow business logic and data storage to evolve independently. ODMs like Spring Data and Morphia reduce impedance mismatch and handle mapping between objects and database documents. Versioning strategies include incrementing fields, storing full documents, or maintaining separate collections for current and past versions. Migrations involve adding/removing fields, changing names/data types, or extracting embedded documents. The document outlines tradeoffs between these approaches.
Similar to Max Neunhöffer – Joins and aggregations in a distributed NoSQL DB - NoSQL matters Barcelona 2014 (20)
Nathan Ford- Divination of the Defects (Graph-Based Defect Prediction through...NoSQLmatters
While metrics generated by static code analysis are well established as predictors of possible future defects, there is another untapped source of useful information, namely your source code revision history. This presentation will discuss converting this revision information into a graph representation, various defect prediction models and how to generate their related change metrics through graph traversal, as well as the potential applications and benefits of these graph enabled prediction models.
Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...NoSQLmatters
PostgreSQL is well known being an object-relational database management system. In it`s core PostgreSQL is schema-aware dealing with fixed database tables and column types. However, recent versions of PostgreSQL made it possible to deal with schema-free data. Learn which new features PostgreSQL supports and how to use those features in your application.
NoSQL matters, on that much I'm sure we can all agree. But if we take a closer look, what really matters when it comes to choosing a data store and/or a data processing platform? What really matters when it comes to getting the most out of that platform? And what is really going to matter as we take things to the next level?
Peter Bakas - Zero to Insights - Real time analytics with Kafka, C*, and Spar...NoSQLmatters
In this talk, Peter will cover his experience using Spark, Cassandra & Kafka to build a real time analytics platform that processed billions events a day. He will cover the challenges in how to turn all those raw events into actionable insights. He will also cover scaling the platform across multiple regions, as well as across multiple cloud environments.
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...NoSQLmatters
Data analysis is an exploratory process that requires a variety of tools and a flexible data store. Data analysis projects are easy to start but quickly become difficult to manage and error prone when depending on file-based data storage. Relational databases are poorly equipped to accommodate the dynamic demands complex analysis. This talk describes best practices for using MongoDB for analytics projects. Examples will be drawn from a large scale text mining project (approximately 25 million documents) that applies machine learning (neural networks and support vector machines) and statistical analysis. Tools discussed include R, Spark, Python scientific stack, and custom pre-processing scripts but the focus is on using these with the document database.
Mark Harwood - Building Entity Centric Indexes - NoSQL matters Dublin 2015NoSQLmatters
Sometimes we need to step back and take a look at the bigger picture - not just counting huge piles of individual log records, but reasoning about the behaviors of the people who are ultimately generating this firehose of data. While your DevOps folks care deeply about log records from a machine utlization perspective, marketing wants to know what these records tell us about the customers' needs. Elasticsearch Aggregations are a great feature but are not a panacea. We can happily use them to summarise complex things like the number of web requests per day broken down by geography and browser type on a busy website, but we would quickly run out of memory if we tried to calculate something as simple as a single number for the average duration of visitor web sessions when using the very same dataset. Why does this occur? A web session duration is an example of a behavioural attribute not held on any one log record; it has to be derived by finding the first and last records for each session in our weblogs, requiring some complex query expressions and a lot of memory to connect all the data points. We can maintain a more useful joined-up-picture if we run an ongoing background process to fuse related events from one index into ?entity-centric? summaries in another index e.g: • Web log events summarised into ?web session? entities • Road-worthiness test results summarised into ?car? entities • Reviews in a marketplace summarised into a ?reviewer? entity Using real data, this session will demonstrate how to incrementally build entity-centric indexes alongside event-centric indexes by using simple scripts to uncover interesting behaviours that accumulate over time. We'll explore: • Which cars are driven long distances after failing roadworthiness tests? • Which website visitors look to be behaving like ?bots?? • Which seller in my marketplace has employed an army of ?shills? to boost his feedback rating? Attendees will leave this session with all the tools required to begin building entity-centric indexes and using that data to derive richer business insights across every department in their organization.
Prassnitha Sampath - Real Time Big Data Analytics with Kafka, Storm & HBase -...NoSQLmatters
Relevance and Personalization is crucial to building personalized local commerce experience at Groupon. Talk covers overview of the real time analytics infrastructure that handles over 3 million events/ second and stores and scales to billions of data points. Solution covers how our Kafka -> Storm -> Redis/ HBase pipeline is used to generate real time analytics for hundreds of millions of users of Groupon. Solution includes various architecture design choices and tradeoffs including some interesting algorithmic choices such as Bloom Filters & Hyper Log Log. Attendees can take away learnings from our real-life experience that can help them understand various tuning methods, their tradeoffs and apply them in their solutions
Akmal Chaudhri - How to Build Streaming Data Applications: Evaluating the Top...NoSQLmatters
Building applications on streaming data has its challenges. If you are trying to use programs such as Apache Spark or Storm to build applications, this presentation will explain the advantages and disadvantages of each solution and how to choose the right tool for your next streaming data project. Building streaming data applications that can manage the massive quantities of data generated from mobile devices, M2M, sensors and other IoT devices, is a big challenge that many organizations face today. Traditional tools, such as conventional database systems, do not have the capacity to ingest data, analyze it in real-time, and make decisions. New technologies such as Apache Spark and Storm are now coming to the forefront as possible solutions to handing fast data streams. Typical technology choices fall into one of three categories: OLAP, OLTP, and stream-processing systems. Each of these solutions has its benefits, but some choices support streaming data and application development much better than others. Employing a solution that handles streaming data, provides state, ensures durability, and supports transactions and real-time decisions is key to benefitting from fast data. During this presentation you will learn: - The difference between fast OLAP, stream-processing, and OLTP database solutions. - The importance of state, real-time analytics and real-time decisions when building applications on streaming data. - How streaming applications deliver more value when built on a super-fast in-memory, SQL database.
Just a few years ago all software systems were designed to be monoliths running on a single big and powerful machine. But nowadays most companies desire to scale out instead of scaling up, because it is much easier to buy or rent a large cluster of commodity hardware then to get a single machine that is powerful enough. In the database area scaling out is realized by utilizing a combination of polyglot persistence and sharding of data. On the application level scaling out is realized by microservices. In this talk I will briefly introduce the concepts and ideas of microservices and discuss their benefits and drawbacks. Afterwards I will focus on the point of intersection of a microservice based application talking to one or many NoSQL databases. We will try and find answers to these questions: Are the differences to a monolithic application? How to scale the whole system properly? What about polyglot persistence? Is there a data-centric way to split microservices?
Chris Ward - Understanding databases for distributed docker applications - No...NoSQLmatters
In this talk we'll focus on the use of Crate alongside Weave in Docker containers, the technical challenges, best practices learned, and getting a big data application running alongside it. You'll learn about the reasons why Crate.IO is building "yet another NoSQL database" and why it's unique and important when running web scale containerized applications. We'll show why the shared-nothing architecture is so important when deploying large clusters in containers and how it addresses the issues and fears of a Docker-based persistence layer. You will learn how to deploy a Crate cluster in the cloud within minutes using Docker, some of the challenges you'll encounter, and how to overcome them in order to scale your backends efficiently. We focused on super simple integration with any cloud provider, striving it to be as turnkey as possible with minimal up-front configuration required to establish a cluster. Once established, we'll show how to scale the cluster horizontally by simply adding more nodes. The session will also give you examples when you should use Crate compared to other similar technologies such as MongoDB, Hadoop, Cassandra or FoundationDB. We'll talk about this approach's strengths and what types of applications are well-suited for this type of data store, as well what is not. Finally we'll outline how to architect an application that is easy to scale using Crate and Docker.
Philipp Krenn - Host your database in the cloud, they said... - NoSQL matters...NoSQLmatters
More than two years ago we faced the decision whether to run our MongoDB database on Amazon's EC2 ourselves or to rely on a Database as a Service provider. Common wisdom told us that a well known provider, focusing all its knowledge and energy on running MongoDB, would be a better choice than us trying it on the side. Well, this talk describes what can go wrong, since we have seen a lot of interesting minor and major hiccups — including stopped instances, broken backups, a major security incident, and more broken backups. Additionally, we discuss some reasons why a hosted solution is not always the better choice and which new challenges arise from it.
Lucian Precup - Back to the Future: SQL 92 for Elasticsearch? - NoSQL matters...NoSQLmatters
What if we would try to make Elasticsearch SQL 92 compliant (http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt)? This wouldn't serve that much nowadays, you would say. Well, we actually tried to do the exercise and we have some interesting conclusions. While we take Elasticsearch as an example for this "side by side", the issues we are addressing also apply to nosql in general. With this unusual exercise, we take the occasion to compare relational databases / sql with Elasticsearch / nosql on all the levels : functionality, semantics, performance and user experience.
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015NoSQLmatters
There are many frameworks that can offer real time on top of Hadoop. This talk will show you the usage of Pivotal HAWQ and how it is easy to use SQL for querying your Hadoop data. Come and see the power and easy of use that can help you on using the Hadoop ecosystem.
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...NoSQLmatters
Apache Spark is a general data processing framework which allows you perform map-reduce tasks (but not only) in memory. Apache Cassandra is a highly available and massively scalable NoSQL data-store. By combining Spark flexible API and Cassandra performance, we get an interesting alternative to the Hadoop eco-system for both real-time and batch processing. During this talk we will highlight the tight integration between Spark & Cassandra and demonstrate some usages with live code demo.
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...NoSQLmatters
When deploying your service to Microsoft Azure, you have a number of options in terms of noSQL: you can install databases on Linux or Windows virtual machines by yourself, or via the marketplace, or you can use open source databases available as a service like HBase or proprietary and managed databases like Document DB. After showing these options, we'll show Document DB in more details. This is a noSQL database as a service that stores JSON.
David Pilato - Advance search for your legacy application - NoSQL matters Par...NoSQLmatters
How do you mix SQL and NoSQL worlds without starting a messy revolution?This live coding talk will show you how to add Elasticsearch to your legacy application without changing all your current development habits. Your application will have suddenly have advanced search features, all without the need to write complex SQL code!David will start from a Spring, Hibernate and Postgresql based application and will add a complete integration of Elasticsearch, all live from the stage during his presentation.
Tugdual Grall - From SQL to NoSQL in less than 40 min - NoSQL matters Paris 2015NoSQLmatters
During this live-coding session, Tugdual will move an old fashion full SQL application (JavaEE) to the new NoSQL world.Using MongoDB, and REST, he will show the benefits of this new architecture: * Easyness * Flexibility * High availability * Scalability; During this presentation, you will learn more about: * Document Oriented Model * JSON * REST * Iterative development; This demonstration is also a good opportunity to see how you can migrate data from a relational database, and the various schema options.
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015NoSQLmatters
How do you monitor performance for one of your clients on a specific user segmentation when dealing with billions of events a day ? With over 2 billion ads served and 230Tb of data processed a day, we at Criteo have a comprehensive need for an interactive analytics stack. And by interactive, we mean a querying system with dynamic filtering to drill down over multiple dimensions, answering within sub-second latency. This session will take you on our journey with Druid, ""an open-source data store designed for real-time exploratory analytics on large data sets"". We will explore Druid's architecture and noticeable concepts, how relevant they are for some use cases and how it really performs.
In many modern applications the database side is realized using polyglot persistence – store each data format (graphs, documents, etc.) in an appropriate separate database. This approach yields several benefits, databases are optimized for their specific duty, however there are also drawbacks: * keep all databases in sync * queries might require data from several databases * experts needed for all used systems A multi-model database is not restricted to one data format, but can cope with several of them. In this talk i will present how a multi-model database can be used in a polyglot persistence setup and how it will reduce the effort drastically.
Rob Harrop- Key Note The God, the Bad and the Ugly - NoSQL matters Paris 2015NoSQLmatters
The impact that NoSQL has had on the technology community cannot be overstated. The proliferation of new and exciting data systems has led to a slew of interesting solutions to problems that were once solved the relational way. In this session we explore all that is great and good about NoSQL: the innovative software, the clever storage paradigms and the reigniting of developer interest in data access. It is unfortunate that NoSQL is not only a force for good in our community. We'll explore some of the darker corners of NoSQL: the disregard for years of proven technology, the overbearing hype, the overblown marketing and the ever present arguments over which technology is best. We close the session by exploring what can be done to extract even more value from the NoSQL movement, where we can improve how the community interacts with the larger technology community and what the future holds for data access technologies.
Did you know that drowning is a leading cause of unintentional death among young children? According to recent data, children aged 1-4 years are at the highest risk. Let's raise awareness and take steps to prevent these tragic incidents. Supervision, barriers around pools, and learning CPR can make a difference. Stay safe this summer!
Enhanced data collection methods can help uncover the true extent of child abuse and neglect. This includes Integrated Data Systems from various sources (e.g., schools, healthcare providers, social services) to identify patterns and potential cases of abuse and neglect.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Generative Classifiers: Classifying with Bayesian decision theory, Bayes’ rule, Naïve Bayes classifier.
Discriminative Classifiers: Logistic Regression, Decision Trees: Training and Visualizing a Decision Tree, Making Predictions, Estimating Class Probabilities, The CART Training Algorithm, Attribute selection measures- Gini impurity; Entropy, Regularization Hyperparameters, Regression Trees, Linear Support vector machines.
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Marlon Dumas
This webinar discusses the limitations of traditional approaches for business process simulation based on had-crafted model with restrictive assumptions. It shows how process mining techniques can be assembled together to discover high-fidelity digital twins of end-to-end processes from event data.
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
3. Documents and collections
{
"_key": "123456",
"_id": "chars/123456",
"name": "Duck",
"firstname": "Donald",
"dob": "1934-11-13",
"hobbies": ["Golf",
"Singing",
"Running"],
"home":
{"town": "Duck town",
"street": "Lake Road",
"number": 17},
"species": "duck"
}
When I say “document”,
I mean “JSON”.
1
4. Documents and collections
{
"_key": "123456",
"_id": "chars/123456",
"name": "Duck",
"firstname": "Donald",
"dob": "1934-11-13",
"hobbies": ["Golf",
"Singing",
"Running"],
"home":
{"town": "Duck town",
"street": "Lake Road",
"number": 17},
"species": "duck"
}
When I say “document”,
I mean “JSON”.
A “collection” is a set of
documents in a DB.
1
5. Documents and collections
{
"_key": "123456",
"_id": "chars/123456",
"name": "Duck",
"firstname": "Donald",
"dob": "1934-11-13",
"hobbies": ["Golf",
"Singing",
"Running"],
"home":
{"town": "Duck town",
"street": "Lake Road",
"number": 17},
"species": "duck"
}
When I say “document”,
I mean “JSON”.
A “collection” is a set of
documents in a DB.
The DB can inspect the
values, allowing for
secondary indexes.
1
6. Documents and collections
{
"_key": "123456",
"_id": "chars/123456",
"name": "Duck",
"firstname": "Donald",
"dob": "1934-11-13",
"hobbies": ["Golf",
"Singing",
"Running"],
"home":
{"town": "Duck town",
"street": "Lake Road",
"number": 17},
"species": "duck"
}
When I say “document”,
I mean “JSON”.
A “collection” is a set of
documents in a DB.
The DB can inspect the
values, allowing for
secondary indexes.
Or one can just treat the
DB as a key/value store.
1
7. Documents and collections
{
"_key": "123456",
"_id": "chars/123456",
"name": "Duck",
"firstname": "Donald",
"dob": "1934-11-13",
"hobbies": ["Golf",
"Singing",
"Running"],
"home":
{"town": "Duck town",
"street": "Lake Road",
"number": 17},
"species": "duck"
}
When I say “document”,
I mean “JSON”.
A “collection” is a set of
documents in a DB.
The DB can inspect the
values, allowing for
secondary indexes.
Or one can just treat the
DB as a key/value store.
Sharding: the data of a
collection is distributed
between multiple servers.
1
9. Graphs
A
B
D
E
F
G
C
"likes"
"hates"
A graph consists of vertices and
edges.
2
10. Graphs
A
B
D
E
F
G
C
"likes"
"hates"
A graph consists of vertices and
edges.
Graphs model relations, can be
directed or undirected.
2
11. Graphs
A
B
D
E
F
G
C
"likes"
"hates"
A graph consists of vertices and
edges.
Graphs model relations, can be
directed or undirected.
Vertices and edges are
documents.
2
12. Graphs
A
B
D
E
F
G
C
"likes"
"hates"
A graph consists of vertices and
edges.
Graphs model relations, can be
directed or undirected.
Vertices and edges are
documents.
Every edge has a _from and a _to
attribute.
2
13. Graphs
A
B
D
E
F
G
C
"likes"
"hates"
A graph consists of vertices and
edges.
Graphs model relations, can be
directed or undirected.
Vertices and edges are
documents.
Every edge has a _from and a _to
attribute.
The database offers queries and
transactions dealing with graphs.
2
14. Graphs
A
B
D
E
F
G
C
"likes"
"hates"
A graph consists of vertices and
edges.
Graphs model relations, can be
directed or undirected.
Vertices and edges are
documents.
Every edge has a _from and a _to
attribute.
The database offers queries and
transactions dealing with graphs.
For example, paths in the graph
are interesting.
2
15. Query 1
Fetch all documents in a collection
FOR p IN people
RETURN p
3
16. Query 1
Fetch all documents in a collection
FOR p IN people
RETURN p
[ { "name": "Schmidt", "firstname": "Helmut",
"hobbies": ["Smoking"]},
{ "name": "Neunhöffer", "firstname": "Max",
"hobbies": ["Piano", "Golf"]},
...
]
3
17. Query 1
Fetch all documents in a collection
FOR p IN people
RETURN p
[ { "name": "Schmidt", "firstname": "Helmut",
"hobbies": ["Smoking"]},
{ "name": "Neunhöffer", "firstname": "Max",
"hobbies": ["Piano", "Golf"]},
...
]
(Actually, a cursor is returned.)
3
18. Query 2
Use 1ltering, sorting and limit
FOR p IN people
FILTER p.age >= @minage
SORT p.name, p.firstname
LIMIT @nrlimit
RETURN { name: CONCAT(p.name, ", ", p.firstname),
age : p.age }
4
19. Query 2
Use 1ltering, sorting and limit
FOR p IN people
FILTER p.age >= @minage
SORT p.name, p.firstname
LIMIT @nrlimit
RETURN { name: CONCAT(p.name, ", ", p.firstname),
age : p.age }
[ { "name": "Neunhöffer, Max", "age": 44 },
{ "name": "Schmidt, Helmut", "age": 95 },
...
]
4
20. Query 3
Aggregation and functions
FOR p IN people
COLLECT a = p.age INTO L
FILTER a >= @minage
RETURN { "age": a, "number": LENGTH(L) }
5
21. Query 3
Aggregation and functions
FOR p IN people
COLLECT a = p.age INTO L
FILTER a >= @minage
RETURN { "age": a, "number": LENGTH(L) }
[ { "age": 18, "number": 10 },
{ "age": 19, "number": 17 },
{ "age": 20, "number": 12 },
...
]
5
22. Query 4
Joins
FOR p IN @@peoplecollection
FOR h IN houses
FILTER p._key == h.owner
SORT h.streetname, h.housename
RETURN { housename: h.housename,
streetname: h.streetname,
owner: p.name,
value: h.value }
6
23. Query 4
Joins
FOR p IN @@peoplecollection
FOR h IN houses
FILTER p._key == h.owner
SORT h.streetname, h.housename
RETURN { housename: h.housename,
streetname: h.streetname,
owner: p.name,
value: h.value }
[ { "housename": "Firlefanz",
"streetname": "Meyer street",
"owner": "Hans Schmidt", "value": 423000
},
...
]
6
24. Query 5
Modifying data
FOR e IN events
FILTER e.timestamp < "2014-09-01T09:53+0200"
INSERT e IN oldevents
FOR e IN events
FILTER e.timestamp < "2014-09-01T09:53+0200"
REMOVE e._key IN events
7
25. Query 6
Graph queries
FOR x IN GRAPH_SHORTEST_PATH(
"routeplanner", "germanCity/Cologne",
"frenchCity/Paris", {weight: "distance"} )
RETURN { begin : x.startVertex,
end : x.vertex,
distance : x.distance,
nrPaths : LENGTH(x.paths) }
8
26. Query 6
Graph queries
FOR x IN GRAPH_SHORTEST_PATH(
"routeplanner", "germanCity/Cologne",
"frenchCity/Paris", {weight: "distance"} )
RETURN { begin : x.startVertex,
end : x.vertex,
distance : x.distance,
nrPaths : LENGTH(x.paths) }
[ { "begin": "germanCity/Cologne",
"end" : {"_id": "frenchCity/Paris", ... },
"distance": 550,
"nrPaths": 10 },
...
] 8
27. Life of a query
Text and query parameters come from user
9
28. Life of a query
Text and query parameters come from user
Parse text, produce abstract syntax tree (AST)
9
29. Life of a query
Text and query parameters come from user
Parse text, produce abstract syntax tree (AST)
Substitute query parameters
9
30. Life of a query
Text and query parameters come from user
Parse text, produce abstract syntax tree (AST)
Substitute query parameters
First optimisation: constant expressions, etc.
9
31. Life of a query
Text and query parameters come from user
Parse text, produce abstract syntax tree (AST)
Substitute query parameters
First optimisation: constant expressions, etc.
Translate AST into an execution plan (EXP)
9
32. Life of a query
Text and query parameters come from user
Parse text, produce abstract syntax tree (AST)
Substitute query parameters
First optimisation: constant expressions, etc.
Translate AST into an execution plan (EXP)
Optimise one EXP, produce many, potentially better EXPs
9
33. Life of a query
Text and query parameters come from user
Parse text, produce abstract syntax tree (AST)
Substitute query parameters
First optimisation: constant expressions, etc.
Translate AST into an execution plan (EXP)
Optimise one EXP, produce many, potentially better EXPs
Reason about distribution in cluster
9
34. Life of a query
Text and query parameters come from user
Parse text, produce abstract syntax tree (AST)
Substitute query parameters
First optimisation: constant expressions, etc.
Translate AST into an execution plan (EXP)
Optimise one EXP, produce many, potentially better EXPs
Reason about distribution in cluster
Optimise distributed EXPs
9
35. Life of a query
Text and query parameters come from user
Parse text, produce abstract syntax tree (AST)
Substitute query parameters
First optimisation: constant expressions, etc.
Translate AST into an execution plan (EXP)
Optimise one EXP, produce many, potentially better EXPs
Reason about distribution in cluster
Optimise distributed EXPs
Estimate costs for all EXPs, and sort by ascending cost
9
36. Life of a query
Text and query parameters come from user
Parse text, produce abstract syntax tree (AST)
Substitute query parameters
First optimisation: constant expressions, etc.
Translate AST into an execution plan (EXP)
Optimise one EXP, produce many, potentially better EXPs
Reason about distribution in cluster
Optimise distributed EXPs
Estimate costs for all EXPs, and sort by ascending cost
Instanciate “cheapest” plan, i.e. set up execution engine
9
37. Life of a query
Text and query parameters come from user
Parse text, produce abstract syntax tree (AST)
Substitute query parameters
First optimisation: constant expressions, etc.
Translate AST into an execution plan (EXP)
Optimise one EXP, produce many, potentially better EXPs
Reason about distribution in cluster
Optimise distributed EXPs
Estimate costs for all EXPs, and sort by ascending cost
Instanciate “cheapest” plan, i.e. set up execution engine
Distribute and link up engines on different servers
9
38. Life of a query
Text and query parameters come from user
Parse text, produce abstract syntax tree (AST)
Substitute query parameters
First optimisation: constant expressions, etc.
Translate AST into an execution plan (EXP)
Optimise one EXP, produce many, potentially better EXPs
Reason about distribution in cluster
Optimise distributed EXPs
Estimate costs for all EXPs, and sort by ascending cost
Instanciate “cheapest” plan, i.e. set up execution engine
Distribute and link up engines on different servers
Execute plan, provide cursor API
9
39. Execution plans
FOR a IN collA
LET xx = a.x
FOR b IN collB
RETURN {x: a.x, z: b.z}
Singleton
EnumerateCollection a
Calculation xx
EnumerateCollection b
Calculation xx == b.y
Filter xx == b.y
Calc {x: a.x, z: b.z}
Return {x: a.x, z: b.z}
FILTER xx == b.y
Query ! EXP
10
40. Execution plans
FOR a IN collA
LET xx = a.x
FOR b IN collB
RETURN {x: a.x, z: b.z}
Singleton
EnumerateCollection a
Calculation xx
EnumerateCollection b
Calculation xx == b.y
Filter xx == b.y
Calc {x: a.x, z: b.z}
Return {x: a.x, z: b.z}
FILTER xx == b.y
Query ! EXP
Black arrows are
dependencies
10
41. Execution plans
FOR a IN collA
LET xx = a.x
FOR b IN collB
RETURN {x: a.x, z: b.z}
Singleton
EnumerateCollection a
Calculation xx
EnumerateCollection b
Calculation xx == b.y
Filter xx == b.y
Calc {x: a.x, z: b.z}
Return {x: a.x, z: b.z}
FILTER xx == b.y
Query ! EXP
Black arrows are
dependencies
Think of a pipeline
10
42. Execution plans
FOR a IN collA
LET xx = a.x
FOR b IN collB
RETURN {x: a.x, z: b.z}
Singleton
EnumerateCollection a
Calculation xx
EnumerateCollection b
Calculation xx == b.y
Filter xx == b.y
Calc {x: a.x, z: b.z}
Return {x: a.x, z: b.z}
FILTER xx == b.y
Query ! EXP
Black arrows are
dependencies
Think of a pipeline
Each node provides
a cursor API
10
43. Execution plans
FOR a IN collA
LET xx = a.x
FOR b IN collB
RETURN {x: a.x, z: b.z}
Singleton
EnumerateCollection a
Calculation xx
EnumerateCollection b
Calculation xx == b.y
Filter xx == b.y
Calc {x: a.x, z: b.z}
Return {x: a.x, z: b.z}
FILTER xx == b.y
Query ! EXP
Black arrows are
dependencies
Think of a pipeline
Each node provides
a cursor API
Blocks of “Items”
travel through the
pipeline
10
44. Execution plans
FOR a IN collA
LET xx = a.x
FOR b IN collB
RETURN {x: a.x, z: b.z}
Singleton
EnumerateCollection a
Calculation xx
EnumerateCollection b
Calculation xx == b.y
Filter xx == b.y
Calc {x: a.x, z: b.z}
Return {x: a.x, z: b.z}
FILTER xx == b.y
Query ! EXP
Black arrows are
dependencies
Think of a pipeline
Each node provides
a cursor API
Blocks of “Items”
travel through the
pipeline
What is an “item”???
10
45. Pipeline and items
Singleton
FOR a IN collA EnumerateCollection a
LET xx = a.x Calculation xx
Items have vars a, xx
EnumerateCollection b
FOR b IN collB
Items have no vars
Items are the thingies traveling through the pipeline.
11
46. Pipeline and items
Singleton
FOR a IN collA EnumerateCollection a
LET xx = a.x Calculation xx
Items have vars a, xx
EnumerateCollection b
FOR b IN collB
Items have no vars
Items are the thingies traveling through the pipeline.
An item holds values of those variables in the current frame
11
47. Pipeline and items
Singleton
FOR a IN collA EnumerateCollection a
LET xx = a.x Calculation xx
Items have vars a, xx
EnumerateCollection b
FOR b IN collB
Items have no vars
Items are the thingies traveling through the pipeline.
An item holds values of those variables in the current frame
Thus: Items look differently in different parts of the plan
11
48. Pipeline and items
Singleton
FOR a IN collA EnumerateCollection a
LET xx = a.x Calculation xx
Items have vars a, xx
EnumerateCollection b
FOR b IN collB
Items have no vars
Items are the thingies traveling through the pipeline.
An item holds values of those variables in the current frame
Thus: Items look differently in different parts of the plan
We always deal with blocks of items for performance reasons
11
49. Execution plans
FOR a IN collA
LET xx = a.x
FOR b IN collB
RETURN {x: a.x, z: b.z}
Singleton
EnumerateCollection a
Calculation xx
EnumerateCollection b
Calculation xx == b.y
Filter xx == b.y
Calc {x: a.x, z: b.z}
Return {x: a.x, z: b.z}
FILTER xx == b.y
12
50. Move 1lters up
FOR a IN collA
FOR b IN collB
FILTER a.x == 10
FILTER a.u == b.v
RETURN {u:a.u,w:b.w}
Singleton
EnumColl a
EnumColl b
Calc a.x == 10
Filter a.x == 10
Calc a.u == b.v
Filter a.u == b.v
Return {u:a.u,w:b.w}
13
51. Move 1lters up
FOR a IN collA
FOR b IN collB
FILTER a.x == 10
FILTER a.u == b.v
RETURN {u:a.u,w:b.w}
The result and behaviour does not
change, if the 1rst FILTER is pulled
out of the inner FOR.
Singleton
EnumColl a
EnumColl b
Calc a.x == 10
Filter a.x == 10
Calc a.u == b.v
Filter a.u == b.v
Return {u:a.u,w:b.w}
13
52. Move 1lters up
FOR a IN collA
FILTER a.x < 10
FOR b IN collB
FILTER a.u == b.v
RETURN {u:a.u,w:b.w}
The result and behaviour does not
change, if the 1rst FILTER is pulled
out of the inner FOR.
However, the number of items trave-ling
in the pipeline is decreased.
Singleton
EnumColl a
Calc a.x == 10
Filter a.x == 10
EnumColl b
Calc a.u == b.v
Filter a.u == b.v
Return {u:a.u,w:b.w}
13
53. Move 1lters up
FOR a IN collA
FILTER a.x < 10
FOR b IN collB
FILTER a.u == b.v
RETURN {u:a.u,w:b.w}
The result and behaviour does not
change, if the 1rst FILTER is pulled
out of the inner FOR.
However, the number of items trave-ling
in the pipeline is decreased.
Note that the two FOR statements
could be interchanged!
Singleton
EnumColl a
Calc a.x == 10
Filter a.x == 10
EnumColl b
Calc a.u == b.v
Filter a.u == b.v
Return {u:a.u,w:b.w}
13
54. Remove unnecessary calculations
FOR a IN collA
LET L = LENGTH(a.hobbies)
FOR b IN collB
FILTER a.u == b.v
RETURN {h:a.hobbies,w:b.w}
Singleton
EnumColl a
Calc L = ...
EnumColl b
Calc a.u == b.v
Filter a.u == b.v
Return {...}
14
55. Remove unnecessary calculations
FOR a IN collA
LET L = LENGTH(a.hobbies)
FOR b IN collB
FILTER a.u == b.v
RETURN {h:a.hobbies,w:b.w}
The Calculation of L is unnecessary!
Singleton
EnumColl a
Calc L = ...
EnumColl b
Calc a.u == b.v
Filter a.u == b.v
Return {...}
14
56. Remove unnecessary calculations
FOR a IN collA
FOR b IN collB
FILTER a.u == b.v
RETURN {h:a.hobbies,w:b.w}
The Calculation of L is unnecessary!
(since it cannot throw an exception).
Singleton
EnumColl a
EnumColl b
Calc a.u == b.v
Filter a.u == b.v
Return {...}
14
57. Remove unnecessary calculations
FOR a IN collA
FOR b IN collB
FILTER a.u == b.v
RETURN {h:a.hobbies,w:b.w}
The Calculation of L is unnecessary!
(since it cannot throw an exception).
Therefore we can just leave it out.
Singleton
EnumColl a
EnumColl b
Calc a.u == b.v
Filter a.u == b.v
Return {...}
14
58. Use index for FILTER and SORT
FOR a IN collA
FILTER a.x > 17 &&
a.x <= 23 &&
a.y == 10
SORT a.y, a.x
RETURN a
Singleton
EnumColl a
Calc ...
Filter ...
Sort a.y, a.x
Return a
15
59. Use index for FILTER and SORT
FOR a IN collA
FILTER a.x > 17 &&
a.x <= 23 &&
a.y == 10
SORT a.y, a.x
RETURN a
Assume collA has a skiplist index on “y”
and “x” (in this order),
Singleton
EnumColl a
Calc ...
Filter ...
Sort a.y, a.x
Return a
15
60. Use index for FILTER and SORT
FOR a IN collA
FILTER a.x > 17 &&
a.x <= 23 &&
a.y == 10
SORT a.y, a.x
RETURN a
Assume collA has a skiplist index on “y”
and “x” (in this order), then we can read
off the half-open interval between
{ y: 10, x: 17 } and
{ y: 10, x: 23 }
from the skiplist index.
Singleton
IndexRange a
Sort a.y, a.x
Return a
15
61. Use index for FILTER and SORT
FOR a IN collA
FILTER a.x > 17 &&
a.x <= 23 &&
a.y == 10
SORT a.y, a.x
RETURN a
Assume collA has a skiplist index on “y”
and “x” (in this order), then we can read
off the half-open interval between
{ y: 10, x: 17 } and
{ y: 10, x: 23 }
from the skiplist index.
The result will automatically be sorted by
y and then by x.
Singleton
IndexRange a
Return a
15
62. Data distribution in a cluster
Requests
Coordinator Coordinator
DBserver DBserver DBserver
1 4 2 5 3 1
The shards of a collection are distributed across the DB
servers.
16
63. Data distribution in a cluster
Requests
Coordinator Coordinator
DBserver DBserver DBserver
1 4 2 5 3 1
The shards of a collection are distributed across the DB
servers.
The coordinators receive queries and organise their
execution
16
67. Modifying queries
Fortunately:
There can be at most one modifying node in each query.
There can be no modifying nodes in subqueries.
18
68. Modifying queries
Fortunately:
There can be at most one modifying node in each query.
There can be no modifying nodes in subqueries.
Modifying nodes
The modifying node in a query
is executed on the DBservers,
18
69. Modifying queries
Fortunately:
There can be at most one modifying node in each query.
There can be no modifying nodes in subqueries.
Modifying nodes
The modifying node in a query
is executed on the DBservers,
to this end, we either scatter the items to all DBservers,
or, if possible, we distribute each item to the shard
that is responsible for the modi1cation.
18
70. Modifying queries
Fortunately:
There can be at most one modifying node in each query.
There can be no modifying nodes in subqueries.
Modifying nodes
The modifying node in a query
is executed on the DBservers,
to this end, we either scatter the items to all DBservers,
or, if possible, we distribute each item to the shard
that is responsible for the modi1cation.
Sometimes, we can even optimise away a gather/scatter
combination and parallelise completely.
18