Whether it's statistics, weather forecasting, astronomy, finance, or network management, time series data plays a critical role in analytics and forecasting. Unfortunately, while many tools exist for time series storage and analysis, few are able to scale past memory limits, or provide rich query and analytics capabilities outside what is necessary to produce simple plots; For those challenged by large volumes of data, there is much room for improvement.
Apache Cassandra is a fully distributed second-generation database. Cassandra stores data in key-sorted order making it ideal for time series, and its high throughput and linear scalability make it well suited to very large data sets.
This talk will cover some of the requirements and challenges of large scale time series storage and analysis. Cassandra data and query modeling for this use-case will be discussed, and Newts, an open source Cassandra-based time series store under development at The OpenNMS Group will be introduced.
Presented at Cassandra London (April 7, 2014); The challenges of time-series storage and analytics in OpenNMS, with an introduction to Newts, a new Cassandra-based time-series data store.
Whether it's statistics, weather forecasting, astronomy, finance, or network management, time series data plays a critical role in analytics and forecasting. Unfortunately, while many tools exist for time series storage and analysis, few are able to scale past memory limits, or provide rich query and analytics capabilities outside what is necessary to produce simple plots; For those challenged by large volumes of data, there is much room for improvement.
Apache Cassandra is a fully distributed second-generation database. Cassandra stores data in key-sorted order making it ideal for time series, and its high throughput and linear scalability make it well suited to very large data sets.
This talk will cover some of the requirements and challenges of large scale time series storage and analysis. Cassandra data and query modeling for this use-case will be discussed, and Newts, an open source Cassandra-based time series store under development at The OpenNMS Group will be introduced.
Presented at Cassandra London (April 7, 2014); The challenges of time-series storage and analytics in OpenNMS, with an introduction to Newts, a new Cassandra-based time-series data store.
Stream Processing Live Traffic Data with Kafka StreamsTim Ysewyn
In this workshop we will set up a streaming framework which will process realtime data of traffic sensors installed within the Belgian road system.
Starting with the intake of the data, you will learn best practices and the recommended approach to split the information into events in a way that won't come back to haunt you.
With some basic stream operations (count, filter, ... ) you will get to know the data and experience how easy it is to get things done with Spring Boot & Spring Cloud Stream.
But since simple data processing is not enough to fulfill all your streaming needs, we will also let you experience the power of windows. After this workshop, tumbling, sliding and session windows hold no more mysteries and you will be a true streaming wizard.
http://bit.ly/1ALVcwR – MapR Director of Architecture and Enterprise Strategy Jim Scott presented a session titled “Time Series Data in a Time Series World.” His session focused on working with time series data including single-value, geospatial and log time series data. By focusing on enterprise applications and the data center, OpenTSDB will be used as an example to explain some of the key time series core concepts including when to use different storage models.
Things Expo | San Jose, California - November 2014
tado° Makes Your Home Environment Smart with InfluxDBInfluxData
Michal Knizek, Head of Research and Development at tado° GmbH, will share how they use InfluxData to gather data collected from their Smart Thermostat to help turn any home thermostat into a smart device. This device uses a variety of information collected (geo-location, temperature, user settings, current device functional state) to serve information to automatically control the environment temperature as well as letting users know when the device may need maintenance.
Terark (Y Combinator W17) has built a new storage engine based on nested succinct trie which provides a 10x-500x performance improvement, a 10:1 compression ratio and a crazy low latency compared to Google's LevelDB, Facebook's RocksDB. It is usable as a standalone key-value store, or as a storage engine for MySQL and MongoDB.
TypoScript and EEL outside of Neos [InspiringFlow2013]Christian Müller
Talk from Inspiring Flow 2013 describing TypoScript and Embedded Expression Language in Neos and use cases to have both in TYPO3 Flow applications and extend them for your needs.
[Input Speed]
Increase TPS as the number of input processes increased and the TPS is 1,090,000 for the input of 5 processes.
[Query Speed]
Measuring query speed with the conditions of time period among 7,000,000,000 stored data.
Virtual training Intro to the Tick stack and InfluxEnterpriseInfluxData
In this webinar, we will provide an introduction to the components of the TICK Stack and a review the features of InfluxEnterprise and InfluxCloud. We also demo how to install the TICK stack.
Bitcoin Price Detection with Pyspark presentationYakup Görür
Cryptocurrencies are digital currencies that have garnered significant investor attention in the financial markets. The aim of this project is to predict the daily price, particularly the daily closing price of the cryptocurrency Bitcoin. This plays a vital role in making trading decisions. There exist various factors which affect the price of Bitcoin, thereby making price prediction a complex and technically challenging task. To perform prediction, random forest model was trained on the historical time series which is the past prices of Bitcoin over several years. Features such as the opening price, highest price, lowest price, closing price, volume of Bitcoin, volume of currencies, and weighted price were taken into consideration so as to predict the closing price of the next day. Random forest model designed and implemented on both of pyspark and scikit learn frameworks to build predictive analysis and evaluated them by computing various measures such as the RMSE (root mean square error) and r (Pearson's correlation coefficient) on test data. Pyspark framework was used to make parallelize the creating trees when training the random forest to handle bigdata. Code has been made available at: https://github.com/ykpgrr/Price-Prediction-with-Random-Forest
Stream Processing Live Traffic Data with Kafka StreamsTim Ysewyn
In this workshop we will set up a streaming framework which will process realtime data of traffic sensors installed within the Belgian road system.
Starting with the intake of the data, you will learn best practices and the recommended approach to split the information into events in a way that won't come back to haunt you.
With some basic stream operations (count, filter, ... ) you will get to know the data and experience how easy it is to get things done with Spring Boot & Spring Cloud Stream.
But since simple data processing is not enough to fulfill all your streaming needs, we will also let you experience the power of windows. After this workshop, tumbling, sliding and session windows hold no more mysteries and you will be a true streaming wizard.
http://bit.ly/1ALVcwR – MapR Director of Architecture and Enterprise Strategy Jim Scott presented a session titled “Time Series Data in a Time Series World.” His session focused on working with time series data including single-value, geospatial and log time series data. By focusing on enterprise applications and the data center, OpenTSDB will be used as an example to explain some of the key time series core concepts including when to use different storage models.
Things Expo | San Jose, California - November 2014
tado° Makes Your Home Environment Smart with InfluxDBInfluxData
Michal Knizek, Head of Research and Development at tado° GmbH, will share how they use InfluxData to gather data collected from their Smart Thermostat to help turn any home thermostat into a smart device. This device uses a variety of information collected (geo-location, temperature, user settings, current device functional state) to serve information to automatically control the environment temperature as well as letting users know when the device may need maintenance.
Terark (Y Combinator W17) has built a new storage engine based on nested succinct trie which provides a 10x-500x performance improvement, a 10:1 compression ratio and a crazy low latency compared to Google's LevelDB, Facebook's RocksDB. It is usable as a standalone key-value store, or as a storage engine for MySQL and MongoDB.
TypoScript and EEL outside of Neos [InspiringFlow2013]Christian Müller
Talk from Inspiring Flow 2013 describing TypoScript and Embedded Expression Language in Neos and use cases to have both in TYPO3 Flow applications and extend them for your needs.
[Input Speed]
Increase TPS as the number of input processes increased and the TPS is 1,090,000 for the input of 5 processes.
[Query Speed]
Measuring query speed with the conditions of time period among 7,000,000,000 stored data.
Virtual training Intro to the Tick stack and InfluxEnterpriseInfluxData
In this webinar, we will provide an introduction to the components of the TICK Stack and a review the features of InfluxEnterprise and InfluxCloud. We also demo how to install the TICK stack.
Bitcoin Price Detection with Pyspark presentationYakup Görür
Cryptocurrencies are digital currencies that have garnered significant investor attention in the financial markets. The aim of this project is to predict the daily price, particularly the daily closing price of the cryptocurrency Bitcoin. This plays a vital role in making trading decisions. There exist various factors which affect the price of Bitcoin, thereby making price prediction a complex and technically challenging task. To perform prediction, random forest model was trained on the historical time series which is the past prices of Bitcoin over several years. Features such as the opening price, highest price, lowest price, closing price, volume of Bitcoin, volume of currencies, and weighted price were taken into consideration so as to predict the closing price of the next day. Random forest model designed and implemented on both of pyspark and scikit learn frameworks to build predictive analysis and evaluated them by computing various measures such as the RMSE (root mean square error) and r (Pearson's correlation coefficient) on test data. Pyspark framework was used to make parallelize the creating trees when training the random forest to handle bigdata. Code has been made available at: https://github.com/ykpgrr/Price-Prediction-with-Random-Forest
Precog & MongoDB User Group: Skyrocket Your Analytics MongoDB
earn how to do advanced analytics with the Precog data science platform on your MongoDB database. It's free to download the Precog file and after installing, you'll be on your way to analyzing all the data in your MongoDB database, without forcing you to export data into another tool or write any custom code. Learn more here: www.precog.com/mongodb
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS
At Uber we use high cardinality monitoring to observe and detect issues with our 4,000 microservices running on Mesos and across our infrastructure systems and servers. We’ll cover how we put the resulting 6 billion plus time series to work in a variety of different ways, auto-discovering services and their usage of other systems at Uber, setting up and tearing down alerts automatically for services, sending smart alert notifications that rollup different failures into individual high level contextual alerts, and more. We’ll also talk about how we accomplish all this with a global view of our systems with M3, our open source metrics platform. We’ll take a deep dive look at how we use M3DB, now available as an open source Prometheus long term storage backend, to horizontally scale our metrics platform in a cost efficient manner with a system that’s still sane to operate with petabytes of metrics data.
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
Moving to a new home is daunting. Packing up all your things, getting a vehicle to move it all, unpacking it, updating your mailing address, and making sure you did not leave anything behind. Well, the move to MongoDB Atlas is similar, but all the logistics are already figured out for you by MongoDB.
Our database experts, Rajnikant Tandel and Anup Gopinathan, will show you how to identify and fine tune your problem queries to make a significant impact on the overall performance of your database.
GridSQL is commonly thought of as a replication solution along the likes of Slony and Bucardo, but the open source GridSQL project actually allows PostgreSQL queries to be parallelized across many servers allowing performance to scale nearly linearly. In this session, we will discuss the advantages to using GridSQL for large multi-terabyte data warehouses and how to design your PostgreSQL schemas and queries to leverage GridSQL. We will dig into how GridSQL plans a query capable of spanning multiple PostgreSQL servers and executes across those nodes. We will delve into some performance expectations and where GridSQL should be deployed.
Matt Sarrel of Imply draws on his work benchmarking Apache Druid with the Star Schema Benchmark (SSB) and shows how you can performance test Druid with your workload. Virtual meetup of July 16, 2020.
Watch the video: https://www.youtube.com/watch?v=RbwMCy4GsIE
Matt Sarrel of Imply draws on his work benchmarking Apache Druid with the Star Schema Benchmark (SSB) and shows how you can performance test Druid with your workload. Virtual meetup of July 16, 2020.
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013Amazon Web Services
Get the most out of Amazon Redshift by learning about cutting-edge data warehousing implementations. Desk.com, a Salesforce.com company, discusses how they maintain a large concurrent user base on their customer-facing business intelligence portal powered by Amazon Redshift. HasOffers shares how they load 60 million events per day into Amazon Redshift with a 3-minute end-to-end load latency to support ad performance tracking for thousands of affiliate networks. Finally, Aggregate Knowledge discusses how they perform complex queries at scale with Amazon Redshift to support their media intelligence platform.
Run your queries 14X faster without any investment!Knoldus Inc.
OLAP queries that need a full scan of table tend to take a lot of time, with the advent of NoSQL databases there are a lot of techniques that you can apply to make your queries faster.
Cloud Computing course presentation, Tarbiat Modares University
By: Sina Ebrahimi, Mohammadreza Noei
Advisor: Sadegh Dorri Nogoorani, PhD.
Presentation Data: 1397/03/07
Video Link in Aparat: https://www.aparat.com/v/N5VbK
Video Link on TMU Cloud: http://cloud.modares.ac.ir/public.php?service=files&t=9ecb8d2dd08df6f990a3eb63f42011f7
This presenation's pptx file (some animations may be lost in slideshare) : http://cloud.modares.ac.ir/public.php?service=files&t=f62282dbd205abaa66de2512d9fdfc83
Similar to Redis Day TLV 2018 - RediSearch Aggregations (20)
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
2. RediSearch 101
● High Performance Search Index
● Redis Module
● Written From Scratch in C
● Supports multiple index types
● Open Source / Commercial
○ Commercial is scalable and distributed
○ Several commercial customers
○ Largest cluster: 200 shards, ~2B documents, 8TB RAM
3. What Can RediSearch Do?
● Full-Text, Numeric, Geo and Tag indexes
● Real-time indexing and updates
● Distributed search on Billions of documents
● Complex query language
● Auto-Complete
● Highlighting
4. So, Aggregations...
“Data aggregation is the compiling of information from databases with intent
to prepare combined datasets for data processing.”
(Whatever you say, Wikipedia)
● Take a Search Query
● Process the results to produce some insights from them by:
○ Grouping
○ Reducing groups
○ Applying Transformations
○ Sorting
● Transform plain search results into statistical insights
5. Aggregate Request vs Search Request
● Search request just fetches the top N most relevant results
● Processing is minimal if any
● Focus is on filtering and ranking
● Aggregate requests use the same query language
● But add a pipeline of processing steps
● Emphasis on processing and transforming
6. The Aggregation Pipeline
● Transform the results by combining multiple steps
● All steps are repeatable and work on the upstream data
Filter Group
Reduce
Apply Sort Apply
Reduce
7. What The API Looks Like
FT.AGGREGATE {index} {query}
[LOAD {len} {property} ...]
[GROUPBY {len} {property} ...
[REDUCE {func} {len} {arg} … [AS {alias}] ]
[SORTBY {len}{property} [ASC|DESC] …[MAX {n}]
[APPLY {expr} AS {alias}]
[LIMIT {offset} {num} ]
8. Aggregate Filters
● Determine what results we process
● Same query syntax as normal RediSearch queries
● Can be used to create complex selections
@country:{Israel | Italy}
@age:[25 52]
@profession:”1337 h4x0r”
-@languages:{perl | php}
9. GROUPBY and Reduce
● Grouping by any number of fields / upstream properties
● A GROUPBY is associated with many reducers
● Reducers process each matching result per group and “flatten” them
GROUPBY @x @y
Reduce ⇒ count
Group{foo,bar} Group{bar,baz}
Reduce ⇒ count
10. Available Reducers
● COUNT
● COUNT_DISTINCT
● COUNT_DISTINCTISH
● MIN / MAX
● AVG
● SUM
● STDDEV
● QUANTILE
● FIRST_VALUE
● TOLIST
11. APPLY Transformations
● APPLY applies 1:1 transformations on results
● Evaluate a complex expression per result
● Can be numeric / text functions
● Numeric operators
● Runtime errors result in NULL
APPLY “sqrt((@a + @b)/2)” AS avg
APPLY “time(floor(@timestamp/86400)*86400)” AS date
APPLY “format(‘Hello, %s’, @userName)” AS msg
14. SORTBY
● Sort by one or more properties
● Each can be ASC/DESC
● MAX limits to top-N results
● Using MinMaxHeap
15. LIMIT vs. SORTBY MAX
● SORTBY … MAX {n}
● LIMIT {offset} {num}
● Limit can be applied even without sorting
● Max just speeds up sorts, but doesn’t page through results
● Getting results 10-20 will work best as:
SORTBY … MAX 20 LIMIT 10 10
17. The Problem
● Github events (push, issue, pr, etc) stream
● For each event we have:
○ Repo
○ Actor (user)
○ Type
○ Time
○ Text
● Let’s extract the top committers on Github!
18. Step 1: Filter
We scan only pushes:
FT.AGGREGATE gh “@type:{PushEvent}”
19. Step 2: Group/Reduce
Let’s group by actor, and count the number of commits and unique repos.
FT.AGGREGATE gh “@type:{PushEvent}”
GROUPBY 1 @actor
REDUCE COUNT 0 AS commits
REDUCE COUNT_DISTINCT 1 @repo AS repos
20. Step 3: Apply transformations if needed
Let’s group by actor, and count the number of commits and unique repos.
FT.AGGREGATE gh “@type:{PushEvent}”
GROUPBY 1 @actor
REDUCE COUNT 0 AS commits
REDUCE COUNT_DISTINCT 1 @repo AS repos
APPLY “@commits/@repos” AS commits_per_repo
21. Step 4: Sort
Sort by commits per repo:
FT.AGGREGATE gh “@type:{PushEvent}”
GROUPBY 1 @actor
REDUCE COUNT 0 AS commits
REDUCE COUNT_DISTINCT 1 @repo AS repos
APPLY “@commits/@repos” AS commits_per_repo
SORTBY 2 @commits_per_repo DESC MAX 20
22. Another Example: Total/Unique Visitors Per Hour
FT.AGGREGATE visits
APPLY “@time - (@time%3600)” AS hour
GROUPBY 1 @hour
REDUCE COUNT_DISTINCT 1 @userId AS unique
REDUCE COUNT 0 AS total
SORTBY 2 @hour ASC
APPLY “time(@hour)” AS hour
24. How Do You Scale Aggregations?
● All this works very well on a single node
● But what if your data is split across nodes?
● Simple searches run the query on each
node and merge
25. How Do You Scale Aggregations?
● That cannot always be done for aggregations
○ How would you merge AVG?
○ How would you merge COUNT_DISTINCT?
26. Simple Approach - Filter Push-Down
● We push the filtering stage and the properties we want to all nodes
● We merge all results into one stream
● Perform aggregations on it as if it was a single node’s results
● Pros: simple
● Cons: ALL matching records must be transferred over the network
27. Nicer Approach - Complex Push-Down
● We tale the original request
● For each step we can either:
○ Push it down to the nodes as is
○ Push down and create some special merge logic
○ Decide we cannot push down anything and quit
28. Nicer Approach - Complex Push-Down
● Merge logic:
○ COUNT ⇒ Pushed down as COUNT, merged as SUM
○ AVG ⇒ Pushed down as SUM + COUNT, merged as SUM/COUNT
○ COUNT_DISTINCT ⇒ Pushed as TOLIST, merged as COUNT_DISTINCT
○ Etc