About the How and Why of taking Lucene and Elasticsearch and turning it into a Relational Database.
Talk I gave at Search User Group Berlin September Meetup http://www.meetup.com/de/Search-UG-Berlin/events/224765731/
You know, for search. Querying 24 Billion Documents in 900msJodok Batlogg
Who doesn't love building high-available, scalable systems holding multiple Terabytes of data? Recently we had the pleasure to crack some tough nuts to solve the problems and we'd love to share our findings designing, building up and operating a 120 Node, 6TB Elasticsearch (and Hadoop) cluster with the community.
Talk given for the #phpbenelux user group, March 27th in Gent (BE), with the goal of convincing developers that are used to build php/mysql apps to broaden their horizon when adding search to their site. Be sure to also have a look at the notes for the slides; they explain some of the screenshots, etc.
An accompanying blog post about this subject can be found at http://www.jurriaanpersyn.com/archives/2013/11/18/introduction-to-elasticsearch/
You know, for search. Querying 24 Billion Documents in 900msJodok Batlogg
Who doesn't love building high-available, scalable systems holding multiple Terabytes of data? Recently we had the pleasure to crack some tough nuts to solve the problems and we'd love to share our findings designing, building up and operating a 120 Node, 6TB Elasticsearch (and Hadoop) cluster with the community.
Talk given for the #phpbenelux user group, March 27th in Gent (BE), with the goal of convincing developers that are used to build php/mysql apps to broaden their horizon when adding search to their site. Be sure to also have a look at the notes for the slides; they explain some of the screenshots, etc.
An accompanying blog post about this subject can be found at http://www.jurriaanpersyn.com/archives/2013/11/18/introduction-to-elasticsearch/
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Oleksiy Panchenko
In the age of information and big data, ability to quickly and easily find a needle in a haystack is extremely important. Elasticsearch is a distributed and scalable search engine which provides rich and flexible search capabilities. Social networks (Facebook, LinkedIn), media services (Netflix, SoundCloud), Q&A sites (StackOverflow, Quora, StackExchange) and even GitHub - they all find data for you using Elasticsearch. In conjunction with Logstash and Kibana, Elasticsearch becomes a powerful log engine which allows to process, store, analyze, search through and visualize your logs.
Video: https://www.youtube.com/watch?v=GL7xC5kpb-c
Scripts for the Demo: https://github.com/opanchenko/morning-at-lohika-ELK
"ElasticSearch in action" by Thijs Feryn.
ElasticSearch is a really powerful search engine, NoSQL database & analytics engine. It is fast, it scales and it's a child of the Cloud/BigData generation. This talk will show you how to get things done using ElasticSearch. The focus is on doing actual work, creating actual queries and achieving actual results. Topics that will be covered: - Filters and queries - Cluster, shard and index management - Data mapping - Analyzers and tokenizers - Aggregations - ElasticSearch as part of the ELK stack - Integration in your code.
ElasticSearch in Production: lessons learnedBeyondTrees
With Proquest Udini, we have created the worlds largest online article store, and aim to be the center for researchers all over the world. We connect to a 700M solr cluster for search, but have recently also implemented a search component with ElasticSearch. We will discuss how we did this, and how we want to use the 30M index for scientific citation recognition. We will highlight lessons learned in integrating ElasticSearch in our virtualized EC2 environments, and challenges aligning with our continuous deployment processes.
An introduction to elasticsearch with a short demonstration on Kibana to present the search API. The slide covers:
- Quick overview of the Elastic stack
- indexation
- Analysers
- Relevance score
- One use case of elasticsearch
The query used for the Kibana demonstration can be found here:
https://github.com/melvynator/elasticsearch_presentation
From Lucene to Elasticsearch, a short explanation of horizontal scalabilityStéphane Gamard
What makes it that Elasticsearch is "horizontaly" scalable while Lucene is not? How does the technology of one affect the other? How does ElasticSearch scale over Lucene and what are the limiting factor?
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Oleksiy Panchenko
In the age of information and big data, ability to quickly and easily find a needle in a haystack is extremely important. Elasticsearch is a distributed and scalable search engine which provides rich and flexible search capabilities. Social networks (Facebook, LinkedIn), media services (Netflix, SoundCloud), Q&A sites (StackOverflow, Quora, StackExchange) and even GitHub - they all find data for you using Elasticsearch. In conjunction with Logstash and Kibana, Elasticsearch becomes a powerful log engine which allows to process, store, analyze, search through and visualize your logs.
Video: https://www.youtube.com/watch?v=GL7xC5kpb-c
Scripts for the Demo: https://github.com/opanchenko/morning-at-lohika-ELK
"ElasticSearch in action" by Thijs Feryn.
ElasticSearch is a really powerful search engine, NoSQL database & analytics engine. It is fast, it scales and it's a child of the Cloud/BigData generation. This talk will show you how to get things done using ElasticSearch. The focus is on doing actual work, creating actual queries and achieving actual results. Topics that will be covered: - Filters and queries - Cluster, shard and index management - Data mapping - Analyzers and tokenizers - Aggregations - ElasticSearch as part of the ELK stack - Integration in your code.
ElasticSearch in Production: lessons learnedBeyondTrees
With Proquest Udini, we have created the worlds largest online article store, and aim to be the center for researchers all over the world. We connect to a 700M solr cluster for search, but have recently also implemented a search component with ElasticSearch. We will discuss how we did this, and how we want to use the 30M index for scientific citation recognition. We will highlight lessons learned in integrating ElasticSearch in our virtualized EC2 environments, and challenges aligning with our continuous deployment processes.
An introduction to elasticsearch with a short demonstration on Kibana to present the search API. The slide covers:
- Quick overview of the Elastic stack
- indexation
- Analysers
- Relevance score
- One use case of elasticsearch
The query used for the Kibana demonstration can be found here:
https://github.com/melvynator/elasticsearch_presentation
From Lucene to Elasticsearch, a short explanation of horizontal scalabilityStéphane Gamard
What makes it that Elasticsearch is "horizontaly" scalable while Lucene is not? How does the technology of one affect the other? How does ElasticSearch scale over Lucene and what are the limiting factor?
node-crate: node.js & big data
This presentation provides 'lessons learned' from project implementations with various technologies like Elasticsearch or MongoDB and describes how using Crate data store solved the key issues. The second part introduces CRATE data store and 'node-crate' by examples for development and operation.
About Crate: Crate is a new breed of database to serve today's mammoth data needs. Based on the familiar SQL syntax, Crate combines high availability, resiliency, and scalability in a distributed design that allows you to query mountains of data in realtime, not batches. We solve your data scaling problems and make administration a breeze. Easy to scale, simple to use.
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
In this session, we discuss how Spark and Presto complement the Netflix big data platform stack that started with Hadoop, and the use cases that Spark and Presto address. Also, we discuss how we run Spark and Presto on top of the Amazon EMR infrastructure; specifically, how we use Amazon S3 as our data warehouse and how we leverage Amazon EMR as a generic framework for data-processing cluster management.
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Michael Rys
Data Lakes have become a new tool in building modern data warehouse architectures. In this presentation we will introduce Microsoft's Azure Data Lake offering and its new big data processing language called U-SQL that makes Big Data Processing easy by combining the declarativity of SQL with the extensibility of C#. We will give you an initial introduction to U-SQL by explaining why we introduced U-SQL and showing with an example of how to analyze some tweet data with U-SQL and its extensibility capabilities and take you on an introductory tour of U-SQL that is geared towards existing SQL users.
slides for SQL Saturday 635, Vancouver BC, Aug 2017
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Michael Rys
From theory to implementation - follow the steps of implementing an end-to-end analytics solution illustrated with some best practices and examples in Azure Data Lake.
During this full training day we will share the architecture patterns, tooling, learnings and tips and tricks for building such services on Azure Data Lake. We take you through some anti-patterns and best practices on data loading and organization, give you hands-on time and the ability to develop some of your own U-SQL scripts to process your data and discuss the pros and cons of files versus tables.
This were the slides presented at the SQLBits 2018 Training Day on Feb 21, 2018.
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...NoSQLmatters
Simon Elliston Ball – When to NoSQL and When to Know SQL
With NoSQL, NewSQL and plain old SQL, there are so many tools around it’s not always clear which is the right one for the job.This is a look at a series of NoSQL technologies, comparing them against traditional SQL technology. I’ll compare real use cases and show how they are solved with both NoSQL options, and traditional SQL servers, and then see who wins. We’ll look at some code and architecture examples that fit a variety of NoSQL techniques, and some where SQL is a better answer. We’ll see some big data problems, little data problems, and a bunch of new and old database technologies to find whatever it takes to solve the problem.By the end you’ll hopefully know more NoSQL, and maybe even have a few new tricks with SQL, and what’s more how to choose the right tool for the job.
Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکیEhsan Asgarian
در این اسلاید به مباحث زیر می پردازیم:
مقدمات پایگاه داده های غیر اس.کیو.ال، مبانی جستجوگرها
سپس معرفی ابزار جستجوی الاستیکی، کاربردها، معماری کلی، مقایسه با ابزارهای مشابه
افزودن تحلیلگر متن و در نهایت لینک آن با دات نت
ا
Graph Databases in the Microsoft EcosystemMarco Parenzan
With SQL Server and Cosmos Db we now have graph databases broadly available, after being studied for decades in Db theory, or being a niche approach in Open Source with Neo4J. And then there are services like Microsoft Graph and Azure Digital Twins that give us vertical implementations of graph. So let's make a walkaround of graphs in the MIcrosoft ecosystem.
Cassandra Day London 2015: Introduction to Apache Cassandra and DataStax Ente...DataStax Academy
Speaker(s): Jon Haddad, Apache Cassandra Evangelist and Luke Tillman, Apache Cassandra Language Evangelist at DataStax
This is a crash course introduction to Cassandra. You'll step away understanding how it's possible to to utilize this distributed database to achieve high availability across multiple data centers, scale out as your needs grow, and not be woken up at 3am just because a server failed. We'll cover the basics of data modeling with CQL, and understand how that data is stored on disk. We'll wrap things up by setting up Cassandra locally, so bring your laptops.
Cassandra Day Chicago 2015: Introduction to Apache Cassandra & DataStax Enter...DataStax Academy
Speaker(s): Jon Haddad & Luke Tillman, Apache Cassandra Evangelists at DataStax
This is a crash course introduction to Cassandra. You'll step away understanding how it's possible to to utilize this distributed database to achieve high availability across multiple data centers, scale out as your needs grow, and not be woken up at 3am just because a server failed. We'll cover the basics of data modeling with CQL, and understand how that data is stored on disk. We'll wrap things up by setting up Cassandra locally, so bring your laptops.
Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...DataStax Academy
This is a crash course introduction to Cassandra. You'll step away understanding how it's possible to to utilize this distributed database to achieve high availability across multiple data centers, scale out as your needs grow, and not be woken up at 3am just because a server failed. We'll cover the basics of data modeling with CQL, and understand how that data is stored on disk. We'll wrap things up by setting up Cassandra locally, so bring your laptops.
MongoDB Munich 2012: MongoDB for official documents in BavariaMongoDB
Christian Brensing, Senior Developer, State of Bavaria
The Bavarian government runs a document template application (RTF or ODF with Groovy, Python, Ruby or Tcl as scripting language) serving different government offices. Having complex and hierarchical data structures to organize the templates, MongoDB was selected to replace the Oracle-based persistence layer. In this talk you will hear about the improvements we have achieved with the migration to MongoDB, problems we had to solve underway and unit testing of the persistence layer in order to keep our quality level.
MYSQL Query Anti-Patterns That Can Be Moved to SphinxPythian
PalominoDB European Team lead, Vladimir Fedorkov will be discussing how to handle query bottlenecks that can result from increases in dataset and traffic
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...Michael Rys
More and more customers who are looking to modernize analytics needs are exploring the data lake approach in Azure. Typically, they are most challenged by a bewildering array of poorly integrated technologies and a variety of data formats, data types not all of which are conveniently handled by existing ETL technologies. In this session, we’ll explore the basic shape of a modern ETL pipeline through the lens of Azure Data Lake. We will explore how this pipeline can scale from one to thousands of nodes at a moment’s notice to respond to business needs, how its extensibility model allows pipelines to simultaneously integrate procedural code written in .NET languages or even Python and R, how that same extensibility model allows pipelines to deal with a variety of formats such as CSV, XML, JSON, Images, or any enterprise-specific document format, and finally explore how the next generation of ETL scenarios are enabled though the integration of Intelligence in the data layer in the form of built-in Cognitive capabilities.
Similar to Turning a Search Engine into a Relational Database (20)
HEAP SORT ILLUSTRATED WITH HEAPIFY, BUILD HEAP FOR DYNAMIC ARRAYS.
Heap sort is a comparison-based sorting technique based on Binary Heap data structure. It is similar to the selection sort where we first find the minimum element and place the minimum element at the beginning. Repeat the same process for the remaining elements.
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
ACEP Magazine edition 4th launched on 05.06.2024Rahul
This document provides information about the third edition of the magazine "Sthapatya" published by the Association of Civil Engineers (Practicing) Aurangabad. It includes messages from current and past presidents of ACEP, memories and photos from past ACEP events, information on life time achievement awards given by ACEP, and a technical article on concrete maintenance, repairs and strengthening. The document highlights activities of ACEP and provides a technical educational article for members.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
3. Crate.io
THe company
• Founded in 2013 in Dornbirn/Austria!
• Offices in Dornbirn, Berlin, San
Francisco!
• Team of 14 People (with and without
strong austrian dialect)!
• won Techrunch Disrupt startup
battlefield
4. SQL Database
TABLES
• Table == Tuple Store!
• Primary-Key -> Tuple!
• Index == B-Tree!
• allowing for equality and range
queries O(log(N))!
• sorting!
• Query Planner + Engine for LOCAL
query execution
5. LUCENE INDEX
• Inverted Index!
• equality queries!
• range queries!
• fulltext search with analyzed
queries!
• NO Sorting!
• Stored Fields!
• docValues / FieldCache
6. SQL TABLE
LUCENE INDEX
CREATE TABLE t (
id int primary key,
name string,
marks array(float),
text string index using fulltext
) clustered into 5 shards
with (number_of_replicas=1)
• 1 TABLE
• S shards, each with R replicas!
• metadata in cluster state!
• 1 SHARD
• 1 Lucene Index (inverted index +
stored documents/fields)!
• field mappings!
• field caches
7. SQL TABLE
LUCENE INDEX
Components
• Inverted Index!
• Translog (WAL)!
• “Tuple Store” - Stored Fields!
• Lucene Field Data !
• DocValues (on disk)
8. SQL TABLE
LUCENE INDICES
• Differences to Relational
Databases
• DISTRIBUTED!
• 2 different indices needed for all
operations!
• inverted index not suited for all
kinds of queries!
• persistence is expensive!
• limited schema altering
9. SQL TABLE
LUCENE INDICES
• Differences to Relational
Databases
• DISTRIBUTED!
• 2 different indices needed for all
operations!
• inverted index not suited for all
kinds of queries!
• persistence is expensive!
• limited schema altering!
• no pull based database cursor (yet)
10. Crate
Features
• Distributed SQL Database written in
Java (7)!
• accessible via HTTP & TCP (for java
clients only)!
• Graphical Admin Interface!
• Blob Storage!
• Plugin Infrastructure!
• Clients available in Java, Python, Ruby,
PHP, Scala, Node.js, Erlang!
• Runs on Docker, AWS, GCE, Mesos, …
12. Crate
SQL
• subset of ANSI SQL with extensions!
• arrays and nested objects!
• different types!
• Information Schema!
• Cluster and Node State exposed via
tables !
• Partitioned Tables!
• speaking JDBC, ODBC, SQLAlchemy,
Activerecord, PHP-PDO/DBAL
14. Crate
SQL … NOT
• JOINS underway!
• no subselects, foreign key
constraints yet!
• no sessions, no client cursor!
• no transactions
15. CRATE -
A RELATIONAL DATABASE
• Relational Algebra
• SQL statement!
• Tree of Relational Operators!
• Mostly Tables == Leaves!
• ES!
• Single table operations only!
!
• No simple SQL wrapper around ES
Query DSL
SELECT
id,
substr(4, name),
id % 2 as “EVEN”
text,
marks
FROM t
WHERE
name IS NOT NULL
AND
match(text, ‘full’)
ORDER BY id DESC
LIMIT 10;
16. Querying crate
• Query Engine
• node based query execution!
• directly to Lucene indices!
• circumventing ES query execution
SELECT
id,
substr(4, name),
id % 2 as “EVEN”
text,
marks
FROM t
WHERE
name IS NOT NULL
AND
match(text, ‘full’)
ORDER BY id DESC
LIMIT 10;
17. SQL TABLE
LUCENE INDEX
INSERT INTO t (id, name, marks,
text)
VALUES (
42,
format(‘%d - %s’, 42, ‘Zaphod’),
[1.5, 4.6],
‘this is a quite full text!’)
ON DUPLICATE KEY UPDATE
name=‘DUPLICATE’;
• INSERT INTO
• insert values are validated by their
configured types!
• types are guessed for new
columns!
• primary key and routing values
extracted!
• JSON _source is created: !
• {“id”: 42 “name”: “42 - Zaphod”,
“marks”: [1.5, 4.6], “text”:”this is
a quite full text!”}
18. SQL TABLE
LUCENE INDEX
{
“id”: 42
“name”: “42 - Zaphod”,
“marks”: [1.5, 4.6],
“text”: “this is a quite full text!”
}
• INSERT INTO
• request is routed by “id” column to
node containing shard!
• row stored on shard
20. Querying crate
• Sorting and Grouping
• inverted index not enough!
• per document values (DocValues)
SELECT
id,
substr(4, name),
id % 2 as “EVEN”
text,
marks
FROM t
WHERE
name IS NOT NULL
AND
match(text, ‘full’)
ORDER BY id DESC
LIMIT 10;
21. Querying crate
• “Simple” SELECT - QTF
• Extract Fields to SELECT!
• Route to shards / Lucene Indices!
• Open and keep Lucene Reader in query
context!
• Only collect Doc/Row identifier (and all
necessary fields for sorting)!
• merge separate results on handler!
• apply limit/offset!
• fetch all fields!
• evaluate expressions!
• return Object[][]
SELECT
id,
substr(4, name),
id % 2 as “EVEN”
text,
marks
FROM t
WHERE
name IS NOT NULL
AND
match(text, ‘full’)
ORDER BY id DESC
LIMIT 10;
22. Querying crate
• INTERNAL PAGING
• Problems with big result sets /
high offsets!
• Need to fetch LIMIT + OFFSET
from every shard!
• Execution starts at TOP Relation!
• trickles down to tables (Lucene
Indices)!
• Hybrid of push and pull based
data flow
SELECT
id,
substr(4, name),
id % 2 as “EVEN”
text,
marks
FROM t
WHERE
name IS NOT NULL
AND
match(text, ‘full’)
ORDER BY id DESC
LIMIT 1
OFFSET 10000000;
23. Querying crate
• GROUP BY - AGGREGATIONS
• Aggregation Framework developed
parallel to Elasticsearch aggregations!
• ES - 2 phase aggregations
(HyperLogLog, Moving Averages,
Percentiles …)!
• online algorithms on partial data
(mergeable) necessary!
• https://github.com/elastic/
elasticsearch/issues/4915
SELECT
avg(temp) as avg,
stddev(temp) as stddev,
max(temp) as max,
min(temp) as min,
count(distinct temp)
date_trunc(‘year’, date)
as year
FROM t
WHERE temp IS NOT NULL
GROUP BY 2
ORDER BY avg DESC
LIMIT 10;
24. Querying crate
• GROUP BY - AGGREGATIONS
• split in 3 phases!
• partial aggregation executed on
each shard in parallel!
• partial result distributed to
“Reduce” nodes by hashing the
group keys!
• final aggregation on handler/
reducer!
• merge on handler
SELECT
avg(temp) as avg,
stddev(temp) as stddev,
max(temp) as max,
min(temp) as min,
count(distinct temp)
date_trunc(‘year’, date)
as year
FROM t
WHERE temp IS NOT NULL
GROUP BY 2
ORDER BY avg DESC
LIMIT 10;
25. Querying crate
SELECT
avg(temp) as avg,
stddev(temp) as stddev,
max(temp) as max,
min(temp) as min,
count(distinct temp)
date_trunc(‘year’, date)
as year
FROM t
WHERE temp IS NOT NULL
GROUP BY 2
ORDER BY avg DESC
LIMIT 10;
[1,2,2]
[2,3,7]
[4,9,42]
[1, 7, 9]
[2,3,4]
6
[1]
[2]
3
3
Shards
REDUCER
HANDLER
26. Querying crate
• GROUP BY - AGGREGATIONS
• Row Authority by hashing!
• split huge datasets!
• expensive intermediate
aggregation states possible
(COUNT DISTINCT)
SELECT
avg(temp) as avg,
stddev(temp) as stddev,
max(temp) as max,
min(temp) as min,
count(distinct temp)
date_trunc(‘year’, date)
as year
FROM t
WHERE temp IS NOT NULL
GROUP BY 2
ORDER BY avg DESC
LIMIT 10;
27. FINALLY
• GETTING RELATIONAL…
• still in transition!
• more relational operators to come!
• JOINs are underway!
• CROSS JOINS already “work”