A talk given at the August, 2013 NoSQL Now! conference in San Jose, California.
2013 is the year of that SQL meets Hadoop & NoSQL. There is a surprisingly good fit between thirty-something SQL and the upstart millennial data technologies. SQL helps you deliver faster value, improve integration with visualization and analytics tools, and bring your data to a larger audience.
Julian Hyde, author of Mondrian and Optiq, and a contributor to Apache Drill and Cascading Lingual, shows how to quickly build a SQL interface to a NoSQL system using the Optiq framework.
As a case study, he shows how Optiq has allows the Mondrian in-memory analysis engine (part of Pentaho's business intelligence suite) to leverage the memory and compute power of a cluster and pull data from hybrid SQL and NoSQL sources.
Discardable In-Memory Materialized Queries With HadoopJulian Hyde
What to do with all that memory in a Hadoop cluster? Should we load all of our data into memory to process it?
The goal should be to put memory into its right place in the storage hierarchy, alongside disk and solid-state drives (SSD). Data should reside in the right place for how it is being used, and should be organized appropriately for where it resides. This proposed solution requires a new kind of data set called the Discardable, In-Memory, Materialized Query (DIMMQ).
In this session we will talk through how we can build on existing Hadoop facilities to deliver three key underlying concepts that enable this approach.
What's new in Mondrian 4? Slides for a talk given by Julian Hyde to the Pentaho Bay Area User Group meetup in San Francisco on April 3rd, 2014.
Topics covered include attributes and attribute hierarchies, measure groups and aggregate tables, physical schema, and how to download and start using Mondrian 4 beta with Pentaho CE.
SQL on Big Data is not a "one size fits all". Optiq is a framework that allows you to build a data management system on top of any back-end system, including NoSQL and Hadoop, and rules that optimize query processing for capabilities of the data source. We show how Optiq is used in the Apache Drill and Cascading Lingual projects, and how we plan to combine Optiq materialized views, Mondrian, and a data grid to create next-generation in-memory analytics.
This presentation was given at the Real-Time Big Data meetup at RichRelevance in San Francisco, 2013-04-09.
Why is data independence (still) so important? Optiq and Apache Drill.Julian Hyde
Presentation to the Apache Drill Meetup in Sunnyvale, CA on 2012/9/13. Framing the debate about Drill's goals in terms of a "typical" modern DBMS architecture; and also introducing the Optiq extensible query optimizer.
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Julian Hyde
A talk from given by Julian Hyde and Tomer Shiran at Hadoop Summit, Dublin.
Data scientists and analysts want the best API, DSL or query language possible, not to be limited by what the processing engine can support. Polyalgebra is an extension to relational algebra that separates the user language from the engine, so you can choose the best language and engine for the job. It also allows the system to optimize queries and cache results. We demonstrate how Ibis uses Polyalgebra to execute the same Python-based machine learning queries on Impala, Drill and Spark. And we show how to build Polyalgebra expressions in Calcite and how to define optimization rules and storage handlers.
With the rise of the Internet of Things (IoT) and low-latency analytics, streaming data becomes ever more important. Surprisingly, one of the most promising approaches for processing streaming data is SQL. In this presentation, Julian Hyde shows how to build streaming SQL analytics that deliver results with low latency, adapt to network changes, and play nicely with BI tools and stored data. He also describes how Apache Calcite optimizes streaming queries, and the ongoing collaborations between Calcite and the Storm, Flink and Samza projects.
This talk was given Julian Hyde at Apache Big Data conference, Vancouver, on 2016/05/09.
Discardable In-Memory Materialized Queries With HadoopJulian Hyde
What to do with all that memory in a Hadoop cluster? Should we load all of our data into memory to process it?
The goal should be to put memory into its right place in the storage hierarchy, alongside disk and solid-state drives (SSD). Data should reside in the right place for how it is being used, and should be organized appropriately for where it resides. This proposed solution requires a new kind of data set called the Discardable, In-Memory, Materialized Query (DIMMQ).
In this session we will talk through how we can build on existing Hadoop facilities to deliver three key underlying concepts that enable this approach.
What's new in Mondrian 4? Slides for a talk given by Julian Hyde to the Pentaho Bay Area User Group meetup in San Francisco on April 3rd, 2014.
Topics covered include attributes and attribute hierarchies, measure groups and aggregate tables, physical schema, and how to download and start using Mondrian 4 beta with Pentaho CE.
SQL on Big Data is not a "one size fits all". Optiq is a framework that allows you to build a data management system on top of any back-end system, including NoSQL and Hadoop, and rules that optimize query processing for capabilities of the data source. We show how Optiq is used in the Apache Drill and Cascading Lingual projects, and how we plan to combine Optiq materialized views, Mondrian, and a data grid to create next-generation in-memory analytics.
This presentation was given at the Real-Time Big Data meetup at RichRelevance in San Francisco, 2013-04-09.
Why is data independence (still) so important? Optiq and Apache Drill.Julian Hyde
Presentation to the Apache Drill Meetup in Sunnyvale, CA on 2012/9/13. Framing the debate about Drill's goals in terms of a "typical" modern DBMS architecture; and also introducing the Optiq extensible query optimizer.
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Julian Hyde
A talk from given by Julian Hyde and Tomer Shiran at Hadoop Summit, Dublin.
Data scientists and analysts want the best API, DSL or query language possible, not to be limited by what the processing engine can support. Polyalgebra is an extension to relational algebra that separates the user language from the engine, so you can choose the best language and engine for the job. It also allows the system to optimize queries and cache results. We demonstrate how Ibis uses Polyalgebra to execute the same Python-based machine learning queries on Impala, Drill and Spark. And we show how to build Polyalgebra expressions in Calcite and how to define optimization rules and storage handlers.
With the rise of the Internet of Things (IoT) and low-latency analytics, streaming data becomes ever more important. Surprisingly, one of the most promising approaches for processing streaming data is SQL. In this presentation, Julian Hyde shows how to build streaming SQL analytics that deliver results with low latency, adapt to network changes, and play nicely with BI tools and stored data. He also describes how Apache Calcite optimizes streaming queries, and the ongoing collaborations between Calcite and the Storm, Flink and Samza projects.
This talk was given Julian Hyde at Apache Big Data conference, Vancouver, on 2016/05/09.
Apache Calcite is a dynamic data management framework. Think of it as a toolkit for building databases: it has an industry-standard SQL parser, validator, highly customizable optimizer (with pluggable transformation rules and cost functions, relational algebra, and an extensive library of rules), but it has no preferred storage primitives. In this tutorial, the attendees will use Apache Calcite to build a fully fledged query processor from scratch with very few lines of code. This processor is a full implementation of SQL over an Apache Lucene storage engine. (Lucene does not support SQL queries and lacks a declarative language for performing complex operations such as joins or aggregations.) Attendees will also learn how to use Calcite as an effective tool for research.
ONE FOR ALL! Using Apache Calcite to make SQL smartEvans Ye
In the past when Hadoop was born, the big data world were focusing on how to build systems that scales. Now the world has evolved. HBase hits 2.0, Cassandra hits 3.0, Hive hits 3.0, etc. When scalability is conquered, what's next? That’s right, usability comes into play. If we look back into the history, NoSQL is really just using divide and concur mechanism to tackle big data problems by trading off SQL capabilities. But once big data problem solved, we see more and more NoSQL and data processing engines start to build up SQL or SQL-like interfaces. Therefore, a generic SQL engine that provides core SQL capabilities such as query parsing, relational algebra, and query optimization starts to shine.
In this talk, I'll walk you through the architecture, functionality, and design concept of Apache Calcite. Notice that Calcite itself is not a database, but many well known systems already incorporate Calcite as a library. For instance, Hive, Drill, Druid, Phoenix, Apex, Flink, Storm, Samza, and more. To better illustrate how Calcite works, I'll choose some of the systems and describe how they adopt Calcite and which part is enhanced by Calcite. Furthermore, I'll talk about several features that Calcite provides such as query optimization, heterogeneous data source, materialized view, and Stream SQL. From user's perspective, knowing better how these systems work behind the scene equips you with more knowledge to chose a system that ultimately suits your needs.
Apache Calcite (a tutorial given at BOSS '21)Julian Hyde
Apache Calcite is a dynamic data management framework. Think of it as a toolkit for building databases: it has an industry-standard SQL parser, validator, highly customizable optimizer (with pluggable transformation rules and cost functions, relational algebra, and an extensive library of rules), but it has no preferred storage primitives.
In this tutorial (given at BOSS '21 in Copenhagen as part of VLDB '21) the attendees will use Apache Calcite to build a fully fledged query processor from scratch with very few lines of code. This processor is a full implementation of SQL over an Apache Lucene storage engine. (Lucene does not support SQL queries and lacks a declarative language for performing complex operations such as joins or aggregations.) Attendees will also learn how to use Calcite as an effective tool for research.
Presenters: Julian Hyde and Stamatis Zampetakis
How to integrate Splunk with any data solutionJulian Hyde
A presentation Julian Hyde gave to the Splunk 2012 User conference in Las Vegas, Tue 2012/9/11. Julian demonstrated a new technology called Optiq, described how it could be used to integrate data in Splunk with other systems, and demonstrated several queries accessing data in Splunk via SQL and JDBC.
https://fosdem.org/2017/schedule/event/hpc_bigdata_calcite/
When working with BigData & IoT systems we often feel the need for a Common Query Language. The platform specific languages are often harder to integrate with and require longer adoption time.
To fill this gap many NoSql (Not-only-Sql) vendors are building SQL layers for their platforms. It is worth exploring the driving forces behind this trend, how it fits in your BigData stacks and how we can adopt it in our favorite tools. However building SQL engine from scratch is a daunting job and frameworks like Apache Calcite can help you with the heavy lifting. Calcite allow you to integrate SQL parser, cost-based optimizer, and JDBC with your big data system.
Calcite has been used to empower many Big-Data platforms such as Hive, Spark, Drill Phoenix to name some.
I will walk you through the process of building a SQL access layer for Apache Geode (In-Memory Data Grid). I will share my experience, pitfalls and technical consideration like balancing between the SQL/RDBMS semantics and the design choices and limitations of the data system.
Hopefully this will enable you to add SQL capabilities to your prefered NoSQL data system.
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteJulian Hyde
What if Apache Pig had a SQL front-end and query optimizer? What if Apache Calcite was able to use Pig and MapReduce to run queries? In this project, we aimed to answer both questions by adding a Pig adapter for Calcite. In this talk, we describe Calcite's adapter framework, how we used it to write a Pig adapter, and how you can use this SQL interface to Pig for interactive and long-running queries.
A talk given by Eli Levine and Julian Hyde at Apache: Big Data, Miami, on May 17th, 2017.
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde
The revolution has happened. We are living the age of the deconstructed database. The modern enterprises are powered by data, and that data lives in many formats and locations, in-flight and at rest, but somewhat surprisingly, the lingua franca for remains SQL.
In this talk, Julian describes Apache Calcite, a toolkit for relational algebra that powers many systems including Apache Beam, Flink and Hive. He discusses some areas of development in Calcite: streaming SQL, materialized views, enabling spatial query on vanilla databases, and what a mash-up of all three might look like.
He also describes how SQL is being extended to handle streaming, and the challenges that will need to be solved if it is to become standard.
A talk given by Julian Hyde at Lyft, San Francisco, on 2018/06/27.
Enterprise data is moving into Hadoop, but some data has to stay in operational systems. Apache Calcite (the technology behind Hive’s new cost-based optimizer, formerly known as Optiq) is a query-optimization and data federation technology that allows you to combine data in Hadoop with data in NoSQL systems such as MongoDB and Splunk, and access it all via SQL.
Hyde shows how to quickly build a SQL interface to a NoSQL system using Calcite. He shows how to add rules and operators to Calcite to push down processing to the source system, and how to automatically build materialized data sets in memory for blazing-fast interactive analysis.
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...Christian Tzolov
When working with BigData & IoT systems we often feel the need for a Common Query Language. The system specific languages usually require longer adoption time and are harder to integrate within the existing stacks.
To fill this gap some NoSql vendors are building SQL access to their systems. Building SQL engine from scratch is a daunting job and frameworks like Apache Calcite can help you with the heavy lifting. Calcite allow you to integrate SQL parser, cost-based optimizer, and JDBC with your NoSql system.
We will walk through the process of building a SQL access layer for Apache Geode (In-Memory Data Grid). I will share my experience, pitfalls and technical consideration like balancing between the SQL/RDBMS semantics and the design choices and limitations of the data system.
Hopefully this will enable you to add SQL capabilities to your prefered NoSQL data system.
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Julian Hyde
What if Looker saw the queries you just executed and could predict your next query? Could it make those queries faster, by smarter caching, or aggregate navigation? Could it read your past SQL queries and help you write your LookML model? Those are some of the reasons to add relational algebra into Looker’s query engine, and why Looker hired Julian Hyde, author of Apache Calcite, to lead the effort. In this talk about the internals of Looker’s query engine, Julian Hyde will describe how the engine works, how Looker queries are described in Calcite’s relational algebra, and some features that it makes possible.
A talk by Julian Hyde at JOIN 2019 in San Francisco.
Mondrian update (Pentaho community meetup 2012, Amsterdam)Julian Hyde
Mondrian 4 is major improvement to Mondrian model & engine. It is now beta.
In this talk, I explain what is new, what has been changed, what has been taken away, and how you can help us test it, and get it to production quality faster.
A talk given by Julian Hyde at FlinkForward, Berlin, on 2016/09/12.
Streaming is necessary to handle data rates and latency, but SQL is unquestionably the lingua franca of data. Is it possible to combine SQL with streaming, and if so, what does the resulting language look like? Apache Calcite is extending SQL to include streaming, and Apache Flink is using Calcite to support both regular and streaming SQL. In this talk, Julian Hyde describes streaming SQL in detail and shows how you can use streaming SQL in your application. He also describes how Calcite’s planner optimizes queries for throughput and latency.
This talk provides an in-depth overview of the key concepts of Apache Calcite. It explores the Calcite catalog, parsing, validation, and optimization with various planners.
Optiq is a dynamic query planning framework. It can potentially help integrate Pentaho Mondrian and Kettle with various SQL, NoSQL and BigData data sources.
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
This talk, given by Maryann Xue and Julian Hyde at Hadoop Summit, San Jose on June 30th, 2016, describes how we re-engineered Apache Phoenix with a cost-based optimizer based on Apache Calcite.
Apache Phoenix has rapidly become a workhorse in many organizations, providing a convenient standard SQL interface to HBase suitable for a wide variety of workloads from transactions to ETL and analytics. But Phoenix's initial query optimizer was based on static optimization procedures and thus could not choose between several potential plans or indices based on cost metrics.
We describe how we rebuilt Phoenix's parser and query optimizer using the Calcite framework, improving Phoenix's performance and SQL compliance. The new architecture uses relational algebra as an intermediate language, and this enables you to switch in other engines, especially those also based on Calcite. As an example of this, we demonstrate querying a Phoenix database via Apache Drill.
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
Elasticsearch provides native integration with Apache Spark through ES-Hadoop. However, especially during development, it is at best cumbersome to have Elasticsearch running in a separate machine/instance. Leveraging Spark Cluster with Elasticsearch Inside it is possible to run an embedded instance of Elasticsearch in the driver node of a Spark Cluster. This opens up new opportunities to develop cutting-edge applications. One such application is Dataset Search.
Oscar will give a demo of a Dataset Search Engine built on Spark Cluster with Elasticsearch Inside. Motivation is that once Elasticsearch is running on Spark it becomes possible and interesting to have the Elasticsearch in-memory instance join an (existing) Elasticsearch cluster. And this in turn enables indexing of Datasets that are processed as part of Data Pipelines running on Spark. Dataset Search and Data Management are R&D topics that should be of interest to Spark Summit East attendees who are looking for a way to organize their Data Lake and make it searchable.
Slides for a talk.
Talk abstract:
In the dark of the night, if you listen carefully enough, you can hear databases cry. But why? As developers, we rarely consider what happens under the hood of widely used abstractions such as databases. As a consequence, we rarely think about the performance of databases. This is especially true to less widespread, but often very useful NoSQL databases.
In this talk we will take a close look at NoSQL database performance, peek under the hood of the most frequently used features to see how they affect performance and discuss performance issues and bottlenecks inherent to all databases.
Apache Calcite is a dynamic data management framework. Think of it as a toolkit for building databases: it has an industry-standard SQL parser, validator, highly customizable optimizer (with pluggable transformation rules and cost functions, relational algebra, and an extensive library of rules), but it has no preferred storage primitives. In this tutorial, the attendees will use Apache Calcite to build a fully fledged query processor from scratch with very few lines of code. This processor is a full implementation of SQL over an Apache Lucene storage engine. (Lucene does not support SQL queries and lacks a declarative language for performing complex operations such as joins or aggregations.) Attendees will also learn how to use Calcite as an effective tool for research.
ONE FOR ALL! Using Apache Calcite to make SQL smartEvans Ye
In the past when Hadoop was born, the big data world were focusing on how to build systems that scales. Now the world has evolved. HBase hits 2.0, Cassandra hits 3.0, Hive hits 3.0, etc. When scalability is conquered, what's next? That’s right, usability comes into play. If we look back into the history, NoSQL is really just using divide and concur mechanism to tackle big data problems by trading off SQL capabilities. But once big data problem solved, we see more and more NoSQL and data processing engines start to build up SQL or SQL-like interfaces. Therefore, a generic SQL engine that provides core SQL capabilities such as query parsing, relational algebra, and query optimization starts to shine.
In this talk, I'll walk you through the architecture, functionality, and design concept of Apache Calcite. Notice that Calcite itself is not a database, but many well known systems already incorporate Calcite as a library. For instance, Hive, Drill, Druid, Phoenix, Apex, Flink, Storm, Samza, and more. To better illustrate how Calcite works, I'll choose some of the systems and describe how they adopt Calcite and which part is enhanced by Calcite. Furthermore, I'll talk about several features that Calcite provides such as query optimization, heterogeneous data source, materialized view, and Stream SQL. From user's perspective, knowing better how these systems work behind the scene equips you with more knowledge to chose a system that ultimately suits your needs.
Apache Calcite (a tutorial given at BOSS '21)Julian Hyde
Apache Calcite is a dynamic data management framework. Think of it as a toolkit for building databases: it has an industry-standard SQL parser, validator, highly customizable optimizer (with pluggable transformation rules and cost functions, relational algebra, and an extensive library of rules), but it has no preferred storage primitives.
In this tutorial (given at BOSS '21 in Copenhagen as part of VLDB '21) the attendees will use Apache Calcite to build a fully fledged query processor from scratch with very few lines of code. This processor is a full implementation of SQL over an Apache Lucene storage engine. (Lucene does not support SQL queries and lacks a declarative language for performing complex operations such as joins or aggregations.) Attendees will also learn how to use Calcite as an effective tool for research.
Presenters: Julian Hyde and Stamatis Zampetakis
How to integrate Splunk with any data solutionJulian Hyde
A presentation Julian Hyde gave to the Splunk 2012 User conference in Las Vegas, Tue 2012/9/11. Julian demonstrated a new technology called Optiq, described how it could be used to integrate data in Splunk with other systems, and demonstrated several queries accessing data in Splunk via SQL and JDBC.
https://fosdem.org/2017/schedule/event/hpc_bigdata_calcite/
When working with BigData & IoT systems we often feel the need for a Common Query Language. The platform specific languages are often harder to integrate with and require longer adoption time.
To fill this gap many NoSql (Not-only-Sql) vendors are building SQL layers for their platforms. It is worth exploring the driving forces behind this trend, how it fits in your BigData stacks and how we can adopt it in our favorite tools. However building SQL engine from scratch is a daunting job and frameworks like Apache Calcite can help you with the heavy lifting. Calcite allow you to integrate SQL parser, cost-based optimizer, and JDBC with your big data system.
Calcite has been used to empower many Big-Data platforms such as Hive, Spark, Drill Phoenix to name some.
I will walk you through the process of building a SQL access layer for Apache Geode (In-Memory Data Grid). I will share my experience, pitfalls and technical consideration like balancing between the SQL/RDBMS semantics and the design choices and limitations of the data system.
Hopefully this will enable you to add SQL capabilities to your prefered NoSQL data system.
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteJulian Hyde
What if Apache Pig had a SQL front-end and query optimizer? What if Apache Calcite was able to use Pig and MapReduce to run queries? In this project, we aimed to answer both questions by adding a Pig adapter for Calcite. In this talk, we describe Calcite's adapter framework, how we used it to write a Pig adapter, and how you can use this SQL interface to Pig for interactive and long-running queries.
A talk given by Eli Levine and Julian Hyde at Apache: Big Data, Miami, on May 17th, 2017.
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde
The revolution has happened. We are living the age of the deconstructed database. The modern enterprises are powered by data, and that data lives in many formats and locations, in-flight and at rest, but somewhat surprisingly, the lingua franca for remains SQL.
In this talk, Julian describes Apache Calcite, a toolkit for relational algebra that powers many systems including Apache Beam, Flink and Hive. He discusses some areas of development in Calcite: streaming SQL, materialized views, enabling spatial query on vanilla databases, and what a mash-up of all three might look like.
He also describes how SQL is being extended to handle streaming, and the challenges that will need to be solved if it is to become standard.
A talk given by Julian Hyde at Lyft, San Francisco, on 2018/06/27.
Enterprise data is moving into Hadoop, but some data has to stay in operational systems. Apache Calcite (the technology behind Hive’s new cost-based optimizer, formerly known as Optiq) is a query-optimization and data federation technology that allows you to combine data in Hadoop with data in NoSQL systems such as MongoDB and Splunk, and access it all via SQL.
Hyde shows how to quickly build a SQL interface to a NoSQL system using Calcite. He shows how to add rules and operators to Calcite to push down processing to the source system, and how to automatically build materialized data sets in memory for blazing-fast interactive analysis.
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...Christian Tzolov
When working with BigData & IoT systems we often feel the need for a Common Query Language. The system specific languages usually require longer adoption time and are harder to integrate within the existing stacks.
To fill this gap some NoSql vendors are building SQL access to their systems. Building SQL engine from scratch is a daunting job and frameworks like Apache Calcite can help you with the heavy lifting. Calcite allow you to integrate SQL parser, cost-based optimizer, and JDBC with your NoSql system.
We will walk through the process of building a SQL access layer for Apache Geode (In-Memory Data Grid). I will share my experience, pitfalls and technical consideration like balancing between the SQL/RDBMS semantics and the design choices and limitations of the data system.
Hopefully this will enable you to add SQL capabilities to your prefered NoSQL data system.
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Julian Hyde
What if Looker saw the queries you just executed and could predict your next query? Could it make those queries faster, by smarter caching, or aggregate navigation? Could it read your past SQL queries and help you write your LookML model? Those are some of the reasons to add relational algebra into Looker’s query engine, and why Looker hired Julian Hyde, author of Apache Calcite, to lead the effort. In this talk about the internals of Looker’s query engine, Julian Hyde will describe how the engine works, how Looker queries are described in Calcite’s relational algebra, and some features that it makes possible.
A talk by Julian Hyde at JOIN 2019 in San Francisco.
Mondrian update (Pentaho community meetup 2012, Amsterdam)Julian Hyde
Mondrian 4 is major improvement to Mondrian model & engine. It is now beta.
In this talk, I explain what is new, what has been changed, what has been taken away, and how you can help us test it, and get it to production quality faster.
A talk given by Julian Hyde at FlinkForward, Berlin, on 2016/09/12.
Streaming is necessary to handle data rates and latency, but SQL is unquestionably the lingua franca of data. Is it possible to combine SQL with streaming, and if so, what does the resulting language look like? Apache Calcite is extending SQL to include streaming, and Apache Flink is using Calcite to support both regular and streaming SQL. In this talk, Julian Hyde describes streaming SQL in detail and shows how you can use streaming SQL in your application. He also describes how Calcite’s planner optimizes queries for throughput and latency.
This talk provides an in-depth overview of the key concepts of Apache Calcite. It explores the Calcite catalog, parsing, validation, and optimization with various planners.
Optiq is a dynamic query planning framework. It can potentially help integrate Pentaho Mondrian and Kettle with various SQL, NoSQL and BigData data sources.
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
This talk, given by Maryann Xue and Julian Hyde at Hadoop Summit, San Jose on June 30th, 2016, describes how we re-engineered Apache Phoenix with a cost-based optimizer based on Apache Calcite.
Apache Phoenix has rapidly become a workhorse in many organizations, providing a convenient standard SQL interface to HBase suitable for a wide variety of workloads from transactions to ETL and analytics. But Phoenix's initial query optimizer was based on static optimization procedures and thus could not choose between several potential plans or indices based on cost metrics.
We describe how we rebuilt Phoenix's parser and query optimizer using the Calcite framework, improving Phoenix's performance and SQL compliance. The new architecture uses relational algebra as an intermediate language, and this enables you to switch in other engines, especially those also based on Calcite. As an example of this, we demonstrate querying a Phoenix database via Apache Drill.
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
Elasticsearch provides native integration with Apache Spark through ES-Hadoop. However, especially during development, it is at best cumbersome to have Elasticsearch running in a separate machine/instance. Leveraging Spark Cluster with Elasticsearch Inside it is possible to run an embedded instance of Elasticsearch in the driver node of a Spark Cluster. This opens up new opportunities to develop cutting-edge applications. One such application is Dataset Search.
Oscar will give a demo of a Dataset Search Engine built on Spark Cluster with Elasticsearch Inside. Motivation is that once Elasticsearch is running on Spark it becomes possible and interesting to have the Elasticsearch in-memory instance join an (existing) Elasticsearch cluster. And this in turn enables indexing of Datasets that are processed as part of Data Pipelines running on Spark. Dataset Search and Data Management are R&D topics that should be of interest to Spark Summit East attendees who are looking for a way to organize their Data Lake and make it searchable.
Slides for a talk.
Talk abstract:
In the dark of the night, if you listen carefully enough, you can hear databases cry. But why? As developers, we rarely consider what happens under the hood of widely used abstractions such as databases. As a consequence, we rarely think about the performance of databases. This is especially true to less widespread, but often very useful NoSQL databases.
In this talk we will take a close look at NoSQL database performance, peek under the hood of the most frequently used features to see how they affect performance and discuss performance issues and bottlenecks inherent to all databases.
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets.
If you want to do multi-dimension analysis on large data sets (billion+ rows) with low query latency (sub-seconds), Kylin is a good option. Kylin also provides seamless integration with existing BI tools (e.g Tableau).
MongoDB has taken a clear lead in adoption among the new generation of databases, including the enormous variety of NoSQL offerings. A key reason for this lead has been a unique combination of agility and scalability. Agility provides business units with a quick start and flexibility to maintain development velocity, despite changing data and requirements. Scalability maintains that flexibility while providing fast, interactive performance as data volume and usage increase. We'll address the key organizational, operational, and engineering considerations to ensure that agility and scalability stay aligned at increasing scale, from small development instances to web-scale applications. We will also survey some key examples of highly-scaled customer applications of MongoDB.
Solr cloud the 'search first' nosql database extended deep divelucenerevolution
Presented by Mark Miller, Software Engineer, Cloudera
As the NoSQL ecosystem looks to integrate great search, great search is naturally beginning to expose many NoSQL features. Will these Goliath's collide? Or will they remain specialized while intermingling – two sides of the same coin.
Come learn about where SolrCloud fits into the NoSQL landscape. What can it do? What will it do? And how will the big data, NoSQL, Search ecosystem evolve. If you are interested in Big Data, NoSQL, distributed systems, CAP theorem and other hype filled terms, than this talk may be for you.
An Open Talk at DeveloperWeek Austin 2017 by Kimberly Wilkins (@dba_denizen), Principal Engineer - Databases at ObjectRocket. Featuring new use cases like Bitcoin, AI, IoT, and all the cool things.
Similar to SQL Now! How Optiq brings the best of SQL to NoSQL data. (20)
Building a semantic/metrics layer using CalciteJulian Hyde
A semantic layer, also known as a metrics layer, lies between business users and the database, and lets those users compose queries in the concepts that they understand. It also governs access to the data, manages data transformations, and can tune the database by defining materializations.
Like many new ideas, the semantic layer is a distillation and evolution of many old ideas, such as query languages, multidimensional OLAP, and query federation.
In this talk, we describe the features we are adding to Calcite to define business views, query measures, and optimize performance.
A talk given at Community over Code, the annual conference of the Apache Software Foundation, in Halifax, NS, on 9th October, 2023.
If SQL is the universal language of data, why do we author our most important data applications (metrics, analytics, business intelligence) in languages other than SQL? Multidimensional databases and languages such as MDX, DAX and Tableau LOD solve these problems but introduce others: they require specialized knowledge, complicate the data pipeline and don’t integrate well. Is it possible to define and query business intelligence models in SQL?
Apache Calcite has extended SQL to support metrics (which we call ‘measures’), filter context, and analytic expressions. With these concepts you can define data models (which we call Analytic Views) that contain metrics, use them in queries, and define new metrics in queries.
In this talk by the original developer of Apache Calcite, we describe the SQL syntax extensions for metrics,
and how to use them for cross-dimensional calculations such as period-over-period, percent-of-total,
non-additive and semi-additive measures. We describe how we got around fundamental limitations in SQL
semantics, and approaches for optimizing queries that use metrics.
A talk given by Julian Hyde at Data Council, Austin, TX, on March 29, 2023.
If SQL is the universal language of data, why do we author our most important data applications (metrics, analytics, business intelligence) in languages other than SQL? Multidimensional databases and languages such as MDX, DAX and Tableau LOD solve these problems but introduce others: they require specialized knowledge, complicate the data pipeline and don’t integrate well. Is it possible to define and query business intelligence models in SQL?
Apache Calcite has extended SQL to support measures, evaluation context, and multidimensional expressions. With these concepts you can define data models that contain measures, use them in queries, and define new measures in queries.
A talk given by Julian Hyde to Apache Calcite's virtual meetup, 2023-03-15.
Morel, a data-parallel programming languageJulian Hyde
What would the perfect data-parallel programming language look like? It would be as expressive as a general-purpose functional programming language, as powerful and concise as SQL, and run programs just as efficiently on a laptop or a thousand-node cluster.
We present Morel, a functional programming language with relational extensions, working towards that goal. Morel is implemented in the Apache Calcite community on top of Calcite’s relational algebra framework. In this talk, we describe Morel’s evolution, including how we are pushing Calcite’s capabilities with graph and recursive queries.
A talk given by Julian Hyde at ApacheCon, New Orleans, October 4th 2022.
Is there a perfect data-parallel programming language? (Experiments with More...Julian Hyde
The perfect data parallel language has not yet been invented. SQL queries can achieve great performance and scale, but there are many general purpose algorithms that it cannot express. In Morel, we build on the functional and relational roots of MapReduce in an elegant and strongly-typed general-purpose programming language. But Morel is, in a real sense, a query language; programs are executed on relational frameworks such as Google BigQuery and Spark.
In this talk, we describe the principles that drove Morel’s design, the problems that we had to solve in order to implement a hybrid functional/relational language, and how Morel can be applied to implement data-intensive systems.
We also introduce Apache Calcite, the popular open source framework for query planning, and describe how Morel's compiler uses Calcite's relational algebra and rewrite rules to generate efficient plans.
Is it easier to add functional programming features to a query language, or to add query capabilities to a functional language? In Morel, we have done the latter.
Functional and query languages have much in common, and yet much to learn from each other. Functional languages have a rich type system that includes polymorphism and functions-as-values and Turing-complete expressiveness; query languages have optimization techniques that can make programs several orders of magnitude faster, and runtimes that can use thousands of nodes to execute queries over terabytes of data.
Morel is an implementation of Standard ML on the JVM, with language extensions to allow relational expressions. Its compiler can translate programs to relational algebra and, via Apache Calcite’s query optimizer, run those programs on relational backends.
In this talk, we describe the principles that drove Morel’s design, the problems that we had to solve in order to implement a hybrid functional/relational language, and how Morel can be applied to implement data-intensive systems.
(A talk given by Julian Hyde at Strange Loop 2021, St. Louis, MO, on October 1st, 2021.)
The evolution of Apache Calcite and its CommunityJulian Hyde
Apache Calcite is an open source framework for building databases, and includes a SQL parser, relational algebra, and a highly extensible query optimizer.
It has achieved wide adoption, used in many commercial products, open source projects, and as a test bed for computer science research.
But there is a bootstrap problem: If software is written by a community of contributors, and each contributor acts in their own self-interest, how do you get the first working version of the product? The answer is in the story of how the technology evolved, and how the community evolved with it, and in this talk we tell that story.
The Hop project entered Apache Software Foundation as an Incubator project in 2020, and Julian Hyde, one of their mentors, gave this presentation to educate the initial committers on the Apache Way and what to expect during the Incubation process.
The talk was given by Julian Hyde on October 1st, 2020, with the original title "Apache Incubation - What's it all about?"
Efficient spatial queries on vanilla databasesJulian Hyde
A talk given by Julian Hyde at the Apache Calcite online meetup, 2021/01/20.
Spatial and GIS applications have traditionally required specialized databases, or at least specialized data structures like r-trees. Unfortunately this means that hybrid applications such as spatial analytics are not well served, and many people are unaware of the power of spatial queries because their favorite database does not support them.
In this talk, we describe how Apache Calcite enables efficient spatial queries using generic data structures such as HBase’s key-sorted tables, using techniques like Hilbert space-filling curves and materialized views. Calcite implements much of the OpenGIS function set and recognizes query patterns that can be rewritten to use particular spatial indexes. Calcite is bringing spatial query to the masses!
A talk given by Julian Hyde at DataCouncil SF on April 18, 2019
How do you organize your data so that your users get the right answers at the right time? That question is a pretty good definition of data engineering — but it is also describes the purpose of every DBMS (database management system). And it’s not a coincidence that these are so similar.
This talk looks at the patterns that reoccur throughout data management — such as caching, partitioning, sorting, and derived data sets. As the speaker is the author of Apache Calcite, we first look at these patterns through the lens of Relational Algebra and DBMS architecture. But then we apply these patterns to the modern data pipeline, ETL and analytics. As a case study, we look at how Looker’s “derived tables” blur the line between ETL and caching, and leverage the power of cloud databases.
Don't optimize my queries, organize my data!Julian Hyde
Your queries won't run fast if your data is not organized right. Apache Calcite optimizes queries, but can we make it optimize data? We had to solve several challenges. Users are too busy to tell us the structure of their database, and the query load changes daily, so Calcite has to learn and adapt. We talk about new algorithms we developed for gathering statistics on massive database, and how we infer and evolve the data model based on the queries.
A talk given by Julian Hyde at ApacheCon NA 2018 in Montreal on September 26th, 2018.
Spatial and GIS applications have traditionally required specialized databases, or at least specialized data structures like r-trees. Unfortunately this means that hybrid applications such as spatial analytics are not well served, and many people are unaware of the power of spatial queries because their favorite database does not support them.
In this talk, we describe how Apache Calcite enables efficient spatial queries using generic data structures such as HBase’s key-sorted tables, using techniques like Hilbert space-filling curves and materialized views. Calcite implements much of the OpenGIS function set and recognizes query patterns that can be rewritten to use particular spatial indexes. Calcite is bringing spatial query to the masses!
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Julian Hyde
A talk given at ACM SIGMOD 2018 in support of the paper <a href="https://arxiv.org/abs/1802.10233"> Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources</a>.
Apache Calcite is a foundational software framework that provides query processing, optimization, and query language support to many popular open-source data processing systems such as Apache Hive, Apache Storm, Apache Flink, Druid, and MapD. Calcite's architecture consists of a modular and extensible query optimizer with hundreds of built-in optimization rules, a query processor capable of processing a variety of query languages, an adapter architecture designed for extensibility, and support for heterogeneous data models and stores (relational, semi-structured, streaming, and geospatial). This flexible, embeddable, and extensible architecture is what makes Calcite an attractive choice for adoption in big-data frameworks. It is an active project that continues to introduce support for the new types of data sources, query languages, and approaches to query processing and optimization.
A talk given by Julian Hyde at DataEngConf SF on April 17th 2018.
Did you know that databases often “cheat”? Even with a scalable query engine and smart optimizer, many real-world queries would be too slow if the engine read all the data, so the engine re-writes your query to use a pre-materialized result. B-tree indexes made the first relational databases possible, and there are now many flavors of materialization, from explicit materialized views to OLAP-style caching and spatial indexes. Materialization is more relevant than ever in today’s heterogenous, distributed systems.
If you are evaluating data engines, we describe what materialization features to look for in your next engine. If you are implementing an engine, we describe the features provided by Apache Calcite to design, maintain and use materializations.
Don’t optimize my queries, optimize my data!Julian Hyde
Your queries won't run fast if your data is not organized right. Apache Calcite optimizes queries, but can we evolve it so that it can optimize data? We had to solve several challenges. Users are too busy to tell us the structure of their database, and the query load changes daily, so Calcite has to learn and adapt.
We talk about new algorithms we developed for gathering statistics on massive database, and how we infer and evolve the data model based on the queries, suggesting materialized views that will make your queries run faster without you changing them.
A talk given by Julian Hyde at DataEngConf NYC, Columbia University, on 2017/10/30.
Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data.
A talk given by Julian Hyde at DataWorks Summit, San Jose, on June 14th 2017.
Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data.
A talk given by Julian Hyde at Apache: Big Data, Miami, on May 16th 2017.
Streaming is necessary to handle data rates and latency, but SQL is unquestionably the lingua franca of data. Is it possible to combine SQL with streaming, and if so, what does the resulting language look like? Apache Calcite is extending SQL to include streaming, and Apache Apex is using Calcite to support streaming SQL. In this talk, Julian Hyde describes streaming SQL in detail and shows how you can use streaming SQL in your application. He also describes how Calcite’s planner optimizes queries for throughput and latency.
Julian Hyde gave this talk at Apex Big Data World, Mountain View, on April 4, 2017.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
5. 3 things
1. Modern data challenges need both NoSQL
and SQL.
2. Optiq is not a database (and this is a Good
Thing).
3. How to use Optiq with your data.
8. NoSQL: trade-offs vs
SQL
Gain Lose
Scale out Optimization
Flexibility Data independence
Productivity Central control
Purchase cost Tool integration
9. SQL: key features
• Inspired by Codd’s 1970 “relational
database” paper
• Semantics based on relational operators
(scan, filter, project, join, agg, union, sort)
• All implementations transform queries
• Optimizer rewrites queries onto available
data structures and operators
15. Example #1: CSV
• Uses CSV adapter (optiq-csv)
• Demo using sqlline
• Easy to run this for yourself:
$ git clone https://github.com/julianhyde/optiq-csv
$ cd optiq-csv
$ mvn install
$ ./sqlline
16. !connect jdbc:optiq:model=target/test-classes/model.json admin admin
!tables
!describe emps
SELECT * FROM emps;
EXPLAIN PLAN FOR SELECT * FROM emps;
SELECT depts.name, count(*)
FROM emps JOIN depts USING (deptno)
GROUP BY depts.name;
model.json:
{
version: '1.0',
defaultSchema: 'SALES',
schemas: [
{
name: 'SALES',
type: 'custom',
factory: 'net.hydromatic.optiq.impl.csv.CsvSchemaFactory',
operand: {
directory: 'target/test-classes/sales'
}
}
]
}
18. SQL on NoSQL
• No schema or schema-on-read
• Optiq MAP columns, dynamic schema
• Nested data (Drill, MongoDB)
• Optiq ARRAY columns
• Source system optimized for transactions &
focused reads, not analytic queries
• Optiq cache and materializations
19. • Mongo’s standard “zips” data set with all US
zip codes
• Raw ZIPS table with _MAP column
• ZIPS view extracts named fields, including
nested latitude and longitude, and converts
them to SQL types
Example #2: MongoDB
20. Splunk
• NoSQL database
• Every log file in the enterprise
• A single “table”
• A record for every line in every log file
• A column for every field that exists in
any log file
• No schema
21. Example #3:
Splunk + MySQL
SELECT p.product_name,
COUNT(*) AS c
FROM splunk.splunk AS s
JOIN mysql.products AS p
ON s.product_id = p.product_id
WHERE s.action = 'purchase'
GROUP BY p.product_name
ORDER BY c DESC
25. Interactive analytics on
NoSQL?
• NoSQL operational DB (e.g. HBase,
MongoDB, Cassandra)
• Analytic queries aggregate over full scan
• Speed-of-thought response (< 5 seconds)
• Data freshness (< 10 minutes)
26. Simple analytics
problem
• 100M U.S. census records
• 1KB each record, 100GB total
• 4 SATA3 disks, total 1.2GB/s
• How to count all records in under 5s?
27. Simple analytics
problem
• 100M U.S. census records
• 1KB each record, 100GB total
• 4 SATA3 disks, total 1.2GB/s
• How to count all records in under 5s?
28. Simple analytics
problem
• 100M U.S. census records
• 1KB each record, 100GB total
• 4 SATA3 disks, total 1.2GB/s
• How to count all records in under 5s?
• Not possible?! It takes 80s just to read the
data.
30. Solution: Cheat!
• Compress data
• Column-oriented storage
• Store data in sorted order
• Put data in memory
• Cache previous query results
• Pre-compute (materialize) aggregates
33. Hybrid analytic
architecture
1. NoSQL source system
2. Access via Optiq SQL
3. Optiq rewrites queries to use materialized data
(e.g. aggregate tables)
4. Cached results are treated as “in-memory
tables”
5. Materializations offline dynamically as underlying
data changes, and go online as they are refreshed
35. 3 things (reprise)
1. SQL allows you to reorganize your data and
optimize your queries—while still using your
NoSQL database for what it does best.
2. Optiq is not a database. It lets you create
very powerful federated data architectures.
3. Access your data using Optiq adapters.
Write schemas in JSON or use schema SPI.
Connect via JDBC.