Years ago when working at Amazon on shopping cart infrastructure and the precursor to DynamoDB, my co-founder and I realized that while distributed key value stores were useful for a few use-cases, we missed many of the benefits of relational databases: transactions, joins, and the power of the lingua franca of RDBMS’s: SQL. So we challenged ourselves to modernize the traditional relational database, to take a robust open source relational database and transform it into a distributed database.
This talk is about my team’s journey to create a more modern relational database. I’ll talk about the distributed systems problems we had to solve in order to scale out the Postgres open source database, in order to achieve parallelism and a concomitant increase in performance. I'll describe the architecture of the distributed query planner; how we extend traditional relational algebra operators to plan distributed queries and scale reads. I’ll also describe distributed deadlock detection, and how that enabled us to scale out transactions spanning multiple machines.
"Building financial-grade applications involve performing complex calculations over a wide range of data from across different domains, with challenges including stringent accuracy requirements, latency constraints, along with the need to share states across distributed services.
During this session, I will cover how, at Morgan Stanley, we built a real-time, microservices based Liquidity Management platform using event streaming with Kafka Streams API, to tackle high volumes of data and to perform calculations on cross domain events, spanning wide time windows over the past and the future.
I will demonstrate how we used Kafka Streams & state stores, along with patterns like Saga to achieve eventual data consistency and use state-enriched events to decouple services when transferring them through multiple business domains. I will cover mechanisms to ensure accuracy and transparency with idempotency at heart along with error detection and replay strategies.
Finally, I will look at how we used a high-performant in-memory cache to stage the results of cascaded KStream based calculation engines, which powered our high-speed, ticking and stateful data visualisations."
Flink Forward San Francisco 2022.
The Table API is one of the most actively developed components of Flink in recent time. Inspired by databases and SQL, it encapsulates concepts many developers are familiar with. It can be used with both bounded and unbounded streams in a unified way. But from afar it can be difficult to keep track of what this API is capable of and how it relates to Flink's other APIs. In this talk, we will explore the current state of Table API. We will show how it can be used as a batch processor, a changelog processor, or a streaming ETL tool with many built-in functions and operators for deduplicating, joining, and aggregating data. By comparing it to the DataStream API we will highlight differences and elaborate on when to use which API. We will demonstrate hybrid pipelines in which both APIs interact with one another and contribute their unique strengths. Finally, we will take a look at some of the most recent additions as a first step to stateful upgrades.
by
David Andreson
Relational Database to Apache Spark (and sometimes back again)Ed Thewlis
We've spent a lot of time using SQL Server. However, we started to struggle against it when we were building our SaaS Product.
This is an overview of where we started from, where we struggled, and some of our conclusions.
HTAP By Accident: Getting More From PostgreSQL Using Hardware AccelerationEDB
Big Data. Data Science. AI. It's all big business.
Once upon a time we succeeded in these fields by selectively storing, processing and learning from just the right data. This, of course, requires you to know what "the right data" is. We know there are valuable insights in data, so why not store the lot? It's the 21st century equivalent of "there's gold in them thar hills!"
So having spent years stashing away terabytes of your data in PostgreSQL, you want to start learning from that data. Queries. More queries. More complex queries. Lots of real-time queries. Lots of concurrent users. It might be tempting at this point to give up on PostgreSQL and stash your data into a different solution, more suited to purpose. Don't. PostgreSQL can perform very well in HTAP environments and performs even better with a little help.
In this presentation we dive into the current state of the art with regards to PostgreSQL in HTAP environments and expose how hardware acceleration can help squeeze as much knowledge as possible out of your data.
"Building financial-grade applications involve performing complex calculations over a wide range of data from across different domains, with challenges including stringent accuracy requirements, latency constraints, along with the need to share states across distributed services.
During this session, I will cover how, at Morgan Stanley, we built a real-time, microservices based Liquidity Management platform using event streaming with Kafka Streams API, to tackle high volumes of data and to perform calculations on cross domain events, spanning wide time windows over the past and the future.
I will demonstrate how we used Kafka Streams & state stores, along with patterns like Saga to achieve eventual data consistency and use state-enriched events to decouple services when transferring them through multiple business domains. I will cover mechanisms to ensure accuracy and transparency with idempotency at heart along with error detection and replay strategies.
Finally, I will look at how we used a high-performant in-memory cache to stage the results of cascaded KStream based calculation engines, which powered our high-speed, ticking and stateful data visualisations."
Flink Forward San Francisco 2022.
The Table API is one of the most actively developed components of Flink in recent time. Inspired by databases and SQL, it encapsulates concepts many developers are familiar with. It can be used with both bounded and unbounded streams in a unified way. But from afar it can be difficult to keep track of what this API is capable of and how it relates to Flink's other APIs. In this talk, we will explore the current state of Table API. We will show how it can be used as a batch processor, a changelog processor, or a streaming ETL tool with many built-in functions and operators for deduplicating, joining, and aggregating data. By comparing it to the DataStream API we will highlight differences and elaborate on when to use which API. We will demonstrate hybrid pipelines in which both APIs interact with one another and contribute their unique strengths. Finally, we will take a look at some of the most recent additions as a first step to stateful upgrades.
by
David Andreson
Relational Database to Apache Spark (and sometimes back again)Ed Thewlis
We've spent a lot of time using SQL Server. However, we started to struggle against it when we were building our SaaS Product.
This is an overview of where we started from, where we struggled, and some of our conclusions.
HTAP By Accident: Getting More From PostgreSQL Using Hardware AccelerationEDB
Big Data. Data Science. AI. It's all big business.
Once upon a time we succeeded in these fields by selectively storing, processing and learning from just the right data. This, of course, requires you to know what "the right data" is. We know there are valuable insights in data, so why not store the lot? It's the 21st century equivalent of "there's gold in them thar hills!"
So having spent years stashing away terabytes of your data in PostgreSQL, you want to start learning from that data. Queries. More queries. More complex queries. Lots of real-time queries. Lots of concurrent users. It might be tempting at this point to give up on PostgreSQL and stash your data into a different solution, more suited to purpose. Don't. PostgreSQL can perform very well in HTAP environments and performs even better with a little help.
In this presentation we dive into the current state of the art with regards to PostgreSQL in HTAP environments and expose how hardware acceleration can help squeeze as much knowledge as possible out of your data.
The Balanced Scorecard for Green Engineering Products in Ark Design is emulated with Push Funding ...addendum is for further analysis of Clock-based Financial problem
Why and how to leverage the simplicity and power of SQL on FlinkDataWorks Summit
SQL is the lingua franca of data processing, and everybody working with data knows SQL. Apache Flink provides SQL support for querying and processing batch and streaming data. Flink's SQL support powers large-scale production systems at Alibaba, Huawei, and Uber. Based on Flink SQL, these companies have built systems for their internal users as well as publicly offered services for paying customers.
In our talk, we will discuss why you should and how you can (not being Alibaba or Uber) leverage the simplicity and power of SQL on Flink. We will start exploring the use cases that Flink SQL was designed for and present real-world problems that it can solve. In particular, you will learn why unified batch and stream processing is important and what it means to run SQL queries on streams of data. After we explored why you should use Flink SQL, we will show how you can leverage its full potential.
Since recently, the Flink community is working on a service that integrates a query interface, (external) table catalogs, and result serving functionality for static, appending, and updating result sets. We will discuss the design and feature set of this query service and how it can be used for exploratory batch and streaming queries, ETL pipelines, and live updating query results that serve applications, such as real-time dashboards. The talk concludes with a brief demo of a client running queries against the service.
Speaker
Timo Walther, Software Engineer, Data Artisans
Oracle Goldengate for Big Data - LendingClub ImplementationVengata Guruswamy
This slide covers the LendingClub use case for implementing real time analytics using Oracle goldengate for Big Data. It covers architecture ,implementation and troubleshooting steps.
Who knew time travel could be possible!
While you can use the features of Delta Lake, what is actually happening underneath the covers? We will walk you through the concepts of ACID transactions, Delta time machine, Transaction protocol and how Delta brings reliability to data lakes. Organizations can finally standardize on a clean, centralized, versioned big data repository in their own cloud storage for analytics
How do you learn a new language? How do you figure out how to do something new in a language your already know? Code samples! This is a collection of some of my favorite bits of code. Some of it is useful in every application you build. Some of it solves an unusual problem. Some it of it is just plain cool. This includes BSO and MDX code.
We will also explore some design concepts such as when to use outline formulas vs. script-based calculations, dimensional design for different calculation requirements, and how to take advantage of calc manager's modular approach.
Here are a few of the snippets we will cover:
• Transposing data from one dimension to another
• Forecasting with S curves
• Rolling forecasts
• Using XREF to code overrides
• Useful date calcs
• Useful Planning expressions
This session is designed for developers with a beginning to intermediate skill level in BSO and or MDX calculations.
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...Flink Forward
SQL is the lingua franca of data processing and everybody working with data knows SQL. Apache Flink provides SQL support for querying and processing batch and streaming data. Flink’s SQL support powers large-scale production systems at Alibaba, Huawei, and Uber. Based on Flink SQL, these companies have built systems for their internal users as well as publicly offered services for paying customers. In our talk, we will discuss why you should and how you can (not being Alibaba or Uber) leverage the simplicity and power of SQL on Flink. We will start exploring the use cases that Flink SQL was designed for and present real-world problems that it can solve. In particular, you will learn why unified batch and stream processing is important and what it means to run SQL queries on streams of data. After we explored why you should use Flink SQL, we will show how you can leverage its full potential. Since recently, the Flink community is working on a service that integrates a query interface, (external) table catalogs, and result serving functionality for static, appending, and updating result sets. We will discuss the design and feature set of this query service and how it can be used for exploratory batch and streaming queries, ETL pipelines, and live updating query results that serve applications, such as real-time dashboards. The talk concludes with a brief demo of a client running queries against the service.
LendingClub RealTime BigData Platform with Oracle GoldenGateRajit Saha
LendingClub RealTime BigData Platform with Oracle GoldenGate BigData Adapter. This was presented at Oracle Open World 2017 at San Francisco.
Speaker :
Rajit Saha
Vengata Guruswami
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Julian Hyde
What if Looker saw the queries you just executed and could predict your next query? Could it make those queries faster, by smarter caching, or aggregate navigation? Could it read your past SQL queries and help you write your LookML model? Those are some of the reasons to add relational algebra into Looker’s query engine, and why Looker hired Julian Hyde, author of Apache Calcite, to lead the effort. In this talk about the internals of Looker’s query engine, Julian Hyde will describe how the engine works, how Looker queries are described in Calcite’s relational algebra, and some features that it makes possible.
A talk by Julian Hyde at JOIN 2019 in San Francisco.
Think Outside the Close: Profitability & Costing Reconciliations in EPM Cloud...Alithya
Let me tell you a story. The characters? Spreadsheets, shared drives, emails, and auditors. The conflict? A process in shambles—siloed data in disagreement. The conclusion? The implementation of EPM Cloud Account Reconciliation to save the day!
But here’s the secret: this isn’t just a Financial Close narrative. It’s a common tale across all domains—but can we give them the same fairytale ending? Using Profitability & Costing as a launching pad, explore what it looks like to move other financial processes towards being trustworthy and compliant, not just actuals data. Join us as we look at how EPM Cloud Account Reconciliation can move “outside the close.”
Hosted by Nick Boronski, Alithya and Andrew Laferla, Alithya at the ODTUG Learn from Home Series
Blogs on Document Splitting at www.veritysolutions.com.au
Document Splitting is a very powerful feature delivered by SAP ECC.
Previous to SAP ECC, if new fields were required to General Ledger SAP had to deliver these new fields in Special Purpose Ledger tables. Profit Centre Accounting in R3 was Special Purpose Ledger table 8*, Joint Venture Accounting was ledger 4*. This essentially meant that data had to be copied from General Ledger table GLT0 to special ledger tables so these could be reported upon. However, technical glitches in code and incorrect usage of functionalities caused imbalances between the main ledger GLT0 and the special purpose ledgers.
SAP customers who wanted to expand the functionality of General Ledger to cater to special business requirements (like reporting General Ledger with another fiscal year variant) had to create custom Special Purpose Ledger tables. For example, if a customer wanted to report by two fiscal year variants, they could report one variant using General Ledger and the other variant using Special Purpose Ledger.
All this disparate ledgers reported the same source information in different views. Customers had to execute several month end jobs to ensure synchronisation of data across all these ledgers. Differences in balances and information between ledgers led to delays in month end close and reporting.
With SAP ECC new GL, SAP Customers can add new fields (which SAP calls “scenarios”) into General Ledger. This allows customers to perform, for example, Profit Centre Accounting and Reporting within General Ledger.
With SAP ECC new GL, SAP Customers can add new ledgers (which SAP calls “parallel accounting”) into General Ledger. This allows customers to report, for example, the same General Ledger data in multiple fiscal year variants.
This replication of data happens in real-time. SAP customers no longer need to execute month end jobs to synchronise data between different ledgers.
ADAPTIVE FEATURES OR: HOW I LEARNED TO STOP WORRYING AND TROUBLESHOOT THE BOMBLudovico Caldara
Oracle Database 12c became available to the public back in 2013, but it took more than one year for the
Oracle customers to start upgrading of their existing databases to this new release. Many customers, in
2016, are still not in the process of migrating to 12c despite the premier support deadline for Oracle
Database 11g has passed in January 2015.
I had the chance to spend the last two years by a customer who decided to embrace the new release
and start the migration to 12c as soon as possible, in order to take the most out of the (many) new
features that this release offers. When the very first production databases have been migrated to 12c,
the users began noticing quite soon that some queries started to take much more time to complete,
some of them were actually several orders of magnitude slower than before. After small investigation, I
understood that most off those queries have been slowed down by the new “Adaptive Features” that
have been introduced in 12c for the opposite reason: increasing performance. This is what this article
is about.
A complete lowdown on Estimate To Complete (ETC). This presentation defines and explains project forecasting using ETC. It describes different scenarios and their associated formulas. It uses an example project to calculate ETC in different scenarios.
Architecting peta-byte-scale analytics by scaling out Postgres on Azure with ...Citus Data
A story about powering a 1.5 petabyte internal analytics application at Microsoft with 2816 cores and 18.7 TB of memory in the Citus cluster.
The internal RQV analytics dashboard at Microsoft helps the Windows team to assess the quality of upcoming Windows releases. The system tracks 20,000 diagnostic and quality metrics, digests data from 800 million Windows devices and currently supports over 6 million queries per day, with hundreds of concurrent users. The RQV analytics dashboard relies on Postgres—along with the Citus extension to Postgres to scale out horizontally—and is deployed on Microsoft Azure.
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...Citus Data
As a developer using PostgreSQL one of the most important tasks you have to deal with is modeling the database schema for your application. In order to achieve a solid design, it’s important to understand how the schema is then going to be used as well as the trade-offs it involves.
As Fred Brooks said: “Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious.”
In this talk we're going to see practical normalisation examples and their benefits, and also review some anti-patterns and their typical PostgreSQL solutions, including Denormalization techniques thanks to advanced Data Types.
More Related Content
Similar to The Future of Distributed Databases is Relational
The Balanced Scorecard for Green Engineering Products in Ark Design is emulated with Push Funding ...addendum is for further analysis of Clock-based Financial problem
Why and how to leverage the simplicity and power of SQL on FlinkDataWorks Summit
SQL is the lingua franca of data processing, and everybody working with data knows SQL. Apache Flink provides SQL support for querying and processing batch and streaming data. Flink's SQL support powers large-scale production systems at Alibaba, Huawei, and Uber. Based on Flink SQL, these companies have built systems for their internal users as well as publicly offered services for paying customers.
In our talk, we will discuss why you should and how you can (not being Alibaba or Uber) leverage the simplicity and power of SQL on Flink. We will start exploring the use cases that Flink SQL was designed for and present real-world problems that it can solve. In particular, you will learn why unified batch and stream processing is important and what it means to run SQL queries on streams of data. After we explored why you should use Flink SQL, we will show how you can leverage its full potential.
Since recently, the Flink community is working on a service that integrates a query interface, (external) table catalogs, and result serving functionality for static, appending, and updating result sets. We will discuss the design and feature set of this query service and how it can be used for exploratory batch and streaming queries, ETL pipelines, and live updating query results that serve applications, such as real-time dashboards. The talk concludes with a brief demo of a client running queries against the service.
Speaker
Timo Walther, Software Engineer, Data Artisans
Oracle Goldengate for Big Data - LendingClub ImplementationVengata Guruswamy
This slide covers the LendingClub use case for implementing real time analytics using Oracle goldengate for Big Data. It covers architecture ,implementation and troubleshooting steps.
Who knew time travel could be possible!
While you can use the features of Delta Lake, what is actually happening underneath the covers? We will walk you through the concepts of ACID transactions, Delta time machine, Transaction protocol and how Delta brings reliability to data lakes. Organizations can finally standardize on a clean, centralized, versioned big data repository in their own cloud storage for analytics
How do you learn a new language? How do you figure out how to do something new in a language your already know? Code samples! This is a collection of some of my favorite bits of code. Some of it is useful in every application you build. Some of it solves an unusual problem. Some it of it is just plain cool. This includes BSO and MDX code.
We will also explore some design concepts such as when to use outline formulas vs. script-based calculations, dimensional design for different calculation requirements, and how to take advantage of calc manager's modular approach.
Here are a few of the snippets we will cover:
• Transposing data from one dimension to another
• Forecasting with S curves
• Rolling forecasts
• Using XREF to code overrides
• Useful date calcs
• Useful Planning expressions
This session is designed for developers with a beginning to intermediate skill level in BSO and or MDX calculations.
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...Flink Forward
SQL is the lingua franca of data processing and everybody working with data knows SQL. Apache Flink provides SQL support for querying and processing batch and streaming data. Flink’s SQL support powers large-scale production systems at Alibaba, Huawei, and Uber. Based on Flink SQL, these companies have built systems for their internal users as well as publicly offered services for paying customers. In our talk, we will discuss why you should and how you can (not being Alibaba or Uber) leverage the simplicity and power of SQL on Flink. We will start exploring the use cases that Flink SQL was designed for and present real-world problems that it can solve. In particular, you will learn why unified batch and stream processing is important and what it means to run SQL queries on streams of data. After we explored why you should use Flink SQL, we will show how you can leverage its full potential. Since recently, the Flink community is working on a service that integrates a query interface, (external) table catalogs, and result serving functionality for static, appending, and updating result sets. We will discuss the design and feature set of this query service and how it can be used for exploratory batch and streaming queries, ETL pipelines, and live updating query results that serve applications, such as real-time dashboards. The talk concludes with a brief demo of a client running queries against the service.
LendingClub RealTime BigData Platform with Oracle GoldenGateRajit Saha
LendingClub RealTime BigData Platform with Oracle GoldenGate BigData Adapter. This was presented at Oracle Open World 2017 at San Francisco.
Speaker :
Rajit Saha
Vengata Guruswami
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Julian Hyde
What if Looker saw the queries you just executed and could predict your next query? Could it make those queries faster, by smarter caching, or aggregate navigation? Could it read your past SQL queries and help you write your LookML model? Those are some of the reasons to add relational algebra into Looker’s query engine, and why Looker hired Julian Hyde, author of Apache Calcite, to lead the effort. In this talk about the internals of Looker’s query engine, Julian Hyde will describe how the engine works, how Looker queries are described in Calcite’s relational algebra, and some features that it makes possible.
A talk by Julian Hyde at JOIN 2019 in San Francisco.
Think Outside the Close: Profitability & Costing Reconciliations in EPM Cloud...Alithya
Let me tell you a story. The characters? Spreadsheets, shared drives, emails, and auditors. The conflict? A process in shambles—siloed data in disagreement. The conclusion? The implementation of EPM Cloud Account Reconciliation to save the day!
But here’s the secret: this isn’t just a Financial Close narrative. It’s a common tale across all domains—but can we give them the same fairytale ending? Using Profitability & Costing as a launching pad, explore what it looks like to move other financial processes towards being trustworthy and compliant, not just actuals data. Join us as we look at how EPM Cloud Account Reconciliation can move “outside the close.”
Hosted by Nick Boronski, Alithya and Andrew Laferla, Alithya at the ODTUG Learn from Home Series
Blogs on Document Splitting at www.veritysolutions.com.au
Document Splitting is a very powerful feature delivered by SAP ECC.
Previous to SAP ECC, if new fields were required to General Ledger SAP had to deliver these new fields in Special Purpose Ledger tables. Profit Centre Accounting in R3 was Special Purpose Ledger table 8*, Joint Venture Accounting was ledger 4*. This essentially meant that data had to be copied from General Ledger table GLT0 to special ledger tables so these could be reported upon. However, technical glitches in code and incorrect usage of functionalities caused imbalances between the main ledger GLT0 and the special purpose ledgers.
SAP customers who wanted to expand the functionality of General Ledger to cater to special business requirements (like reporting General Ledger with another fiscal year variant) had to create custom Special Purpose Ledger tables. For example, if a customer wanted to report by two fiscal year variants, they could report one variant using General Ledger and the other variant using Special Purpose Ledger.
All this disparate ledgers reported the same source information in different views. Customers had to execute several month end jobs to ensure synchronisation of data across all these ledgers. Differences in balances and information between ledgers led to delays in month end close and reporting.
With SAP ECC new GL, SAP Customers can add new fields (which SAP calls “scenarios”) into General Ledger. This allows customers to perform, for example, Profit Centre Accounting and Reporting within General Ledger.
With SAP ECC new GL, SAP Customers can add new ledgers (which SAP calls “parallel accounting”) into General Ledger. This allows customers to report, for example, the same General Ledger data in multiple fiscal year variants.
This replication of data happens in real-time. SAP customers no longer need to execute month end jobs to synchronise data between different ledgers.
ADAPTIVE FEATURES OR: HOW I LEARNED TO STOP WORRYING AND TROUBLESHOOT THE BOMBLudovico Caldara
Oracle Database 12c became available to the public back in 2013, but it took more than one year for the
Oracle customers to start upgrading of their existing databases to this new release. Many customers, in
2016, are still not in the process of migrating to 12c despite the premier support deadline for Oracle
Database 11g has passed in January 2015.
I had the chance to spend the last two years by a customer who decided to embrace the new release
and start the migration to 12c as soon as possible, in order to take the most out of the (many) new
features that this release offers. When the very first production databases have been migrated to 12c,
the users began noticing quite soon that some queries started to take much more time to complete,
some of them were actually several orders of magnitude slower than before. After small investigation, I
understood that most off those queries have been slowed down by the new “Adaptive Features” that
have been introduced in 12c for the opposite reason: increasing performance. This is what this article
is about.
A complete lowdown on Estimate To Complete (ETC). This presentation defines and explains project forecasting using ETC. It describes different scenarios and their associated formulas. It uses an example project to calculate ETC in different scenarios.
Similar to The Future of Distributed Databases is Relational (20)
Architecting peta-byte-scale analytics by scaling out Postgres on Azure with ...Citus Data
A story about powering a 1.5 petabyte internal analytics application at Microsoft with 2816 cores and 18.7 TB of memory in the Citus cluster.
The internal RQV analytics dashboard at Microsoft helps the Windows team to assess the quality of upcoming Windows releases. The system tracks 20,000 diagnostic and quality metrics, digests data from 800 million Windows devices and currently supports over 6 million queries per day, with hundreds of concurrent users. The RQV analytics dashboard relies on Postgres—along with the Citus extension to Postgres to scale out horizontally—and is deployed on Microsoft Azure.
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...Citus Data
As a developer using PostgreSQL one of the most important tasks you have to deal with is modeling the database schema for your application. In order to achieve a solid design, it’s important to understand how the schema is then going to be used as well as the trade-offs it involves.
As Fred Brooks said: “Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious.”
In this talk we're going to see practical normalisation examples and their benefits, and also review some anti-patterns and their typical PostgreSQL solutions, including Denormalization techniques thanks to advanced Data Types.
JSONB Tricks: Operators, Indexes, and When (Not) to Use It | PostgresOpen 201...Citus Data
When do you use jsonb, and when don’t you? How do you make it fast? What operators are available, and what can they do? How will this change? These are all very good questions, but jsonb support in Postgres moves so fast that it’s hard to keep up.
In this talk, you will get details on these topics, complete with practical examples and real-world stories:
- When to use jsonb, what it’s good for, and when to not use it
- Operators and how to use them effectively
- Indexing, operator support for indexes, and the tradeoffs involved
- Postgres 12 improvements and new features
Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...Citus Data
One of the strongest features of any database is its extensibility and PostgreSQL comes with a rich extension API. It allows you to define new functions, types, and operators. It even allows you to modify some of its core parts like planner, executor or storage engine. You read it right, you can even change the behavior of PostgreSQL planner. How cool is that?
Such freedom in extensibility created strong extension community around PostgreSQL and made way for a vast amount of extensions such as pg_stat_statements, citus, postgresql-hll and many more.
In this tutorial, we will look at how you can create your own PostgreSQL extension. We will start with more common stuff like defining new functions and types but gradually explore less known parts of the PostgreSQL's extension API like C level hooks which lets you change the behavior of planner, executor and other core parts of the PostgreSQL. We will see how to code, debug, compile and test our extension. After that, we will also look into how to package and distribute our extension for other people to use.
To get the best benefit from the tutorial, C and SQL knowledge would be beneficial. Some knowledge on PostgreSQL internals would also be useful but we will cover the necessary details, so it is not necessary.
Whats wrong with postgres | PGConf EU 2019 | Craig KerstiensCitus Data
Postgres is a powerful database, it continues to improve in terms of performance, extensibility, and more broadly in features. However it is not perfect.
Here I'll cover a highly opinionated view of all the areas Postgres falls flat, with some rough thought ideas on how we can make it better. Opinions are all informed by 10 years of interacting with customers running literally millions of databases for users.
When it all goes wrong | PGConf EU 2019 | Will LeinweberCitus Data
You're woken up in the middle of the night to your phone. Your app is down and you're on call to fix it. Eventually you track it down to "something with the db," but what exactly is wrong? And of course, you're sure that nothing changed recently…
Knowing what to fix, and even where to start looking, is a skill that takes a long time to develop. Especially since Postgres normally works very well for months at a time, not letting you get practice!
In this talk, I'll share not only the more common failure cases and how to fix them, but also a general approach to efficiently figuring out what's wrong in the first place.
Amazing SQL your ORM can (or can't) do | PGConf EU 2019 | Louise GrandjoncCitus Data
SQL can seem like an obscure and complex but powerful language. Learning it can be intimidating. As a developer, we can easily be tempted using basic SQL provided by the ORM. But did you know that you can use window functions in some ORMs? Same goes for a lot of other fun SQL functionalities.
In this talk we will explore some advanced SQL features that you might find useful. We will discover the wonderful world of joins (lateral, cross…), subqueries, grouping sets, window functions, common table expressions.
But most importantly this talk is not only a talk to show you how great SQL is. This talk is here to show you how to use it in real life. What are the features supported by your ORM? And how can you use them if they don’t support them?
Wether you know SQL or not, whether you are a developer or a DBA working with developers, you might learn a lot about SQL, ORMs, and application development using Postgres.
What Microsoft is doing with Postgres & the Citus Data acquisition | PGConf E...Citus Data
Many people have asked us: “Why did Microsoft acquire Citus Data?” and “What do you plan to do with the Citus open source extension to Postgres?” Come join us to see the exciting work we are doing with Postgres and open source at Microsoft.
Deep Postgres Extensions in Rust | PGCon 2019 | Jeff DavisCitus Data
Postgres relies heavily on an extension ecosystem, but that is almost 100% dependent on C; which cuts out developers, libraries, and ideas from the world of Postgres. postgres-extension.rs changes that by supporting development of extensions in Rust. Rust is a memory-safe language that integrates nicely in any environment, has powerful libraries, a vibrant ecosystem, and a prolific developer community.
Rust is a unique language because it supports high-level features but all the magic happens at compile-time, and the resulting code is not dependent on an intrusive or bulky runtime. That makes it ideal for integrating with postgres, which has a lot of its own runtime, like memory contexts and signal handlers. postgres-extension.rs offers this integration, allowing the development of extensions in rust, even if deeply-integrated into the postgres internals, and helping handle tricky issues like error handling. This is done through a collection of Rust function declarations, macros, and utility functions that allow rust code to call into postgres, and safely handle resulting errors.
Why Postgres Why This Database Why Now | SF Bay Area Postgres Meetup | Claire...Citus Data
I spent the early part of my career working on developer tools, operating systems, high-speed file systems, and scale-out storage. Not databases. Frankly, I always thought that databases were a bit boring. So almost 2 years in to my new job at a Postgres company, I continue to be amazed at the enthusiasm of the PostgreSQL developer community and users. I mean, people’s eyes light up when you ask them why they love Postgres. Sure, a lot of us get animated when talking about our newest gadget, or Ronaldo’s phenomenal free-kick goal in the World Cup, or mint chip gelato from La Strega Nocciola—but most platform software simply doesn’t trigger this kind of passion. So why does Postgres? Why is this open source database having such a “moment”? Well, I’ve been trying to understand, looking at this “Postgres moment” from a few different angles. In this talk I’ll share what I’ve observed to be the top 10 business, technology, and community reasons so many of you have so much affection for PostgreSQL.
A story on Postgres index types | PostgresLondon 2019 | Louise GrandjoncCitus Data
Want to know everything about indexes in postgres? Here are the slides for a postgresql talk, and if you want to know more, you can read articles on www.louisemeta.com.
Why developers need marketing now more than ever | GlueCon 2019 | Claire Gior...Citus Data
Many in today’s developer world look down on marketing. I mean, after all, the marketing team is usually “not technical.” And they’re not developers. It’s 2019 and while we try to promote inclusiveness of all types, inclusiveness doesn’t seem to apply to marketers. Why? Is that OK? Who does that hurt? I grew up in engineering and spent the first 15 years of my career as a developer or an engineering manager of some type. So now that I’m in marketing, it surprised me when one of my engineering colleagues blurted out “But it’s a technical conference!” when he learned one of my talks was accepted to a technical conference.
This keynote is about why developers really need marketing. About how good marketing managers can make it so visitors to your website don’t leave empty-handed, confused about what your technology actually does or why it matters. About how the ability to translate technology into what-users-actually-care-about can make your project be the one that takes off. About why Dormain Drewitz said at Monktoberfest: “I work in product marketing. My preferred programming language is English.” Finally, this talk explores how to be sensitive to the bias against marketing that pervades some of our teams—and how to instead embrace teamwork best practices employed by sailors, where everyone in the boat has an important role to play if you are to win the race.
The Art of PostgreSQL | PostgreSQL Ukraine | Dimitri FontaineCitus Data
PostgreSQL is the World’s Most Advanced Open Source Relational Database and by the end of this talk you will understand what that means for you, an application developer. What kind of problems PostgreSQL can solve for you, and how much you can rely on PostgreSQL in your daily activities, including unit-testing.
Optimizing your app by understanding your Postgres | RailsConf 2019 | Samay S...Citus Data
I’m a Postgres person. Period. After talking to many Rails developers about their application performance, I realized many performance issues can be solved by understanding your database a bit better. So I thought I’d share the statistics Postgres captures for you and how you can use them to find slow queries, un-used indexes, or tables which are not getting vacuumed correctly. This talk will cover Postgres tools and tips for the above, including pgstatstatements, useful catalog tables, and recently added Postgres features such as CREATE STATISTICS.
When it all goes wrong (with Postgres) | RailsConf 2019 | Will LeinweberCitus Data
You're woken up in the middle of the night to your phone. Your app is down and you're on call to fix it. Eventually you track it down to "something with the db," but what exactly is wrong? And of course, you're sure that nothing changed recently…
Knowing what to fix, and even where to start looking, is a skill that takes a long time to develop. Especially since Postgres normally works very well for months at a time, not letting you get practice!
In this talk, I'll share not only the more common failure cases and how to fix them, but also a general approach to efficiently figuring out what's wrong in the first place.
The Art of PostgreSQL | PostgreSQL Ukraine Meetup | Dimitri FontaineCitus Data
PostgreSQL is the World’s Most Advanced Open Source Relational Database and by the end of this talk you will understand what that means for you, an application developer. What kind of problems PostgreSQL can solve for you, and how much you can rely on PostgreSQL in your daily activities, including unit-testing.
Using Postgres and Citus for Lightning Fast Analytics, also ft. Rollups | Liv...Citus Data
Watch Sai Srirampur, Solutions Engineer at Citus Data (now part of the Microsoft family), give a live demo of how you can use Postgres and the Citus extension to Postgres to manage real-time analytics workloads.
View if you & your application need:
>> A relational database that scales for customer-facing analytics dashboards, with real-time data ingest and a large volume of queries
>> A way to scale out Postgres horizontally, to address the performance hiccups you’re experiencing as you run into the resource limits of single-node Postgres
>> A way to roll-up and pre-aggregate data to build fast data pipelines and enable sub-second response times.
>> A way to consolidate your database platforms, to avoid having separate stores for your transactional and analytics workloads
Using a 4-node Citus database cluster in the cloud, Sai will show you how Citus shards Postgres to give you lightning fast performance, at scale. Also featuring rollups.
How to write SQL queries | pgDay Paris 2019 | Dimitri FontaineCitus Data
Most of the time we see finished SQL queries, either in code repositories, blog posts of talk slides. This talk focus on the process of how to write an SQL query, from a problem statement expressed in English to code review and long term maintenance of SQL code.
When it all Goes Wrong |Nordic PGDay 2019 | Will LeinweberCitus Data
You're woken up in the middle of the night to your phone. Your app is down and you're on call to fix it. Eventually you track it down to "something with the db," but what exactly is wrong? And of course, you're sure that nothing changed recently…
Knowing what to fix, and even where to start looking, is a skill that takes a long time to develop. Especially since Postgres normally works very well for months at a time, not letting you get practice!
In this talk, I'll share not only the more common failure cases and how to fix them, but also a general approach to efficiently figuring out what's wrong in the first place.
Why PostgreSQL Why This Database Why Now | Nordic PGDay 2019 | Claire GiordanoCitus Data
I spent the early part of my career working on developer tools, operating systems, high-speed file systems, and scale-out storage. Not databases. Frankly, I always thought that databases were a bit boring. So one year in to my new job at a Postgres company, I continue to be amazed at the enthusiasm of the PostgreSQL developer community and users. I mean, people’s eyes light up when you ask them why they love Postgres. Sure, a lot of us get animated when talking about our newest iPhone, or Ronaldo’s phenomenal free-kick goal in the World Cup, or mint chip gelato from La Strega Nociola—but most platform software simply doesn’t trigger this kind of passion. So why does Postgres? Why is this open source database having such a “moment”? Why now? Well, I’ve been trying to find out, looking at this “Postgres moment” from a few different angles. In this talk I’ll share what I’ve observed to be the top 10 business, technology, and community reasons so many of you have so much affection for PostgreSQL.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
1. Sumedh Pathak, Co-Founder & VP Engineering, Citus Data
QCon London 2018
The Future of Distributed
Databases is Relational
@moss_toss | @citusdata
2. About Me
• Co-Founder & VP
Engineering at Citus Data
• Amazon Shopping Cart
(former)
• Amazon Supply Chain &
Order Fulfillment (former)
• Stanford Computer Science
Citus Data co-founders, left-to-right
Ozgun Erdogan, Sumedh Pathak, & Umur Cubukcu
Photo cred: Willy Johnson, Monterey CA, Sep 2017
Sumedh Pathak | Citus Data | QCon London 2018
6. An RDBMS is a
general-purpose
data platform
Sumedh Pathak | Citus Data | QCon London 2018
Fast writes
Real-time & bulk
High throughput
High concurrency
Data consistency
Query optimizers
7. PostgreSQL
MySQL
MongoDB
SQL Server +
Oracle
Source: Hacker News, https://news.ycombinator.com
Startups Are Choosing Postgres% database job posts mentioning each database, across 20K+ job posts
Sumedh Pathak | Citus Data | QCon London 2018
9. but RDBMS’s don’t scale
RDBMS’s are hard to scale
Sumedh Pathak | Citus Data | QCon London 2018
10. What exactly needs to Scale?
- Tables (Data)
- Partitioning, Co-location, Reference Tables
- SQL (Reads)
- How do we express and optimize distributed SQL
- Transactions (Writes)
- Cross Shard updates/deletes, Global Atomic Transactions
Sumedh Pathak | Citus Data | QCon London 2018
1
2
3
18. The key to scaling tables...
- Use relational databases as a building block
- Understand semantics of application—to be smart about
partitioning
- Multi-tenant applications
20. FROM table R
SELECT x Projectx
(R) → R’
WHERE f(x) Filterf(x)
(R) → R’
… JOIN … R × S → R’
SQL ↔ Relational Algebra
Sumedh Pathak | Citus Data | QCon London 2018
25. SELECT sum(price) FROM orders, nation
WHERE orders.nation = nation.name AND
orders.date >= '2012-01-01' AND
nation.region = 'Asia';
Sumedh Pathak | Citus Data | QCon London 2018
30. Sumedh Pathak | Citus Data | QCon London 2018
Parallelize
Aggregate
Push Joins & Filters
below collect. Run in
parallel across all
nodes
Filters & Projections
done before Join
31. SELECT sum(intermediate_col)
FROM <concatenated results>;
SELECT sum(price)
FROM orders_2 JOIN nation_2
ON (orders_2.name = nation_2.name)
WHERE
orders_2.date >= '2017-01-01'
AND
nation_2.region = 'Asia';
SELECT sum(price)
FROM orders_2 JOIN nation_1
ON (orders_2.name = nation_1.name)
WHERE
orders_2.date >= '2017-01-01'
AND
nation_2.region = 'Asia';
Sumedh Pathak | Citus Data | QCon London 2018
32. Executing Distributed SQL
SQL database
orders_1
nation_1
orders
nation
SELECT sum(price)
FROM orders_2 o JOIN nation_1
ON (o.name = n.name)
WHERE o.date >= '2017-01-01'
AND n.region = 'Asia';
SELECT sum(price)
FROM orders_1 o JOIN nation_1
ON (o.name = n.name)
WHERE o.date >= '2017-01-01'
AND n.region = 'Asia';
SELECT sum(price) FROM <results>;
SQL database
orders_2
nation_1
33. The key to scaling SQL...
- New relational algebra operators for
distributed processing
- Relational Algebra Properties to optimize
tree: Commutativity, Associativity, &
Distributivity
- Map / Reduce operators
35. Money Transfer, as an example
BEGIN;
UPDATE accounts SET balance = balance -
100 WHERE id = ‘ALICE’;
UPDATE accounts SET balance = balance +
100 WHERE id = ‘BOB’;
COMMIT;
Sumedh Pathak | Citus Data | QCon London 2018
36. A1
A2
BEGIN;
UPDATE accounts SET balance = balance -
100 WHERE id = ‘ALICE’;
UPDATE accounts SET balance = balance +
100 WHERE id = ‘BOB’;
COMMIT;
A3
A4
Coordinator
Sumedh Pathak | Citus Data | QCon London 2018
37. A1
A2
BEGIN;
UPDATE accounts SET balance = balance -
100 WHERE id = ‘ALICE’;
UPDATE accounts SET balance = balance +
100 WHERE id = ‘BOB’;
COMMIT;
A3
A4
BEGIN;
UPDATE accounts SET balance = balance -
100 WHERE id = ‘ALICE’;
Sumedh Pathak | Citus Data | QCon London 2018
Coordinator
38. A1
A2
BEGIN;
UPDATE accounts SET balance = balance -
100 WHERE id = ‘ALICE’;
UPDATE accounts SET balance = balance +
100 WHERE id = ‘BOB’;
COMMIT;
A3
A4
BEGIN;
UPDATE accounts SET balance = balance -
100 WHERE id = ‘ALICE’;
COMMIT;
Sumedh Pathak | Citus Data | QCon London 2018
Coordinator
39. A1
A2
BEGIN;
UPDATE accounts SET balance = balance -
100 WHERE id = ‘ALICE’;
UPDATE accounts SET balance = balance +
100 WHERE id = ‘BOB’;
COMMIT;
A3
A4
Coordinator
BEGIN;
UPDATE accounts SET balance =
balance + 100 WHERE id = ‘BOB’;
Sumedh Pathak | Citus Data | QCon London 2018
BEGIN;
UPDATE accounts SET balance = balance
- 100 WHERE id = ‘ALICE’;
40. A1
A2
BEGIN;
UPDATE accounts SET balance = balance -
100 WHERE id = ‘ALICE’;
UPDATE accounts SET balance = balance +
100 WHERE id = ‘BOB’;
COMMIT;
A3
A4
BEGIN;
UPDATE accounts SET balance =
balance + 100 WHERE id = ‘BOB’;
COMMIT
Sumedh Pathak | Citus Data | QCon London 2018
BEGIN;
UPDATE accounts SET balance = balance
- 100 WHERE id = ‘ALICE’;
Coordinator
41. A1
A2
BEGIN;
UPDATE accounts SET balance = balance -
100 WHERE id = ‘ALICE’;
UPDATE accounts SET balance = balance +
100 WHERE id = ‘BOB’;
COMMIT;
A3
A4
PREPARE TRANSACTION ‘citus_...98’; PREPARE TRANSACTION ‘citus_...98’;
Sumedh Pathak | Citus Data | QCon London 2018
Coordinator
42. What happens during PREPARE?
State of transaction stored
on a durable store
Locks are
maintained
&
43. A1
A2
BEGIN;
UPDATE accounts SET balance = balance -
100 WHERE id = ‘ALICE’;
UPDATE accounts SET balance = balance +
100 WHERE id = ‘BOB’;
COMMIT;
A3
A4
Coordinator
PREPARE TRANSACTION ‘citus_...98’; ROLLBACK TRANSACTION ‘citus_...98’;
Sumedh Pathak | Citus Data | QCon London 2018
44. A1
A2
BEGIN;
UPDATE accounts SET balance = balance -
100 WHERE id = ‘ALICE’;
UPDATE accounts SET balance = balance +
100 WHERE id = ‘BOB’;
COMMIT;
A3
A4
PREPARE TRANSACTION ‘citus_...98’; PREPARE TRANSACTION ‘citus_...98’;
Sumedh Pathak | Citus Data | QCon London 2018
Coordinator
45. A1
A2
BEGIN;
UPDATE accounts SET balance = balance -
100 WHERE id = ‘ALICE’;
UPDATE accounts SET balance = balance +
100 WHERE id = ‘BOB’;
COMMIT;
A3
A4
COMMIT PREPARED ‘citus_...98’; COMMIT PREPARED ‘citus_...98’;
Sumedh Pathak | Citus Data | QCon London 2018
Coordinator
46. A1
A2
BEGIN;
UPDATE accounts SET balance = balance -
100 WHERE id = ‘ALICE’;
UPDATE accounts SET balance = balance +
100 WHERE id = ‘BOB’;
COMMIT;
A3
A4
COMMIT PREPARED ‘citus_...98’; COMMIT PREPARED ‘citus_...98’;
Sumedh Pathak | Citus Data | QCon London 2018
Coordinator
47. A1
A2
BEGIN;
UPDATE accounts SET balance = balance -
100 WHERE id = ‘ALICE’;
UPDATE accounts SET balance = balance +
100 WHERE id = ‘BOB’;
COMMIT;
A3
A4
COMMIT PREPARED ‘citus_...98’; COMMIT PREPARED ‘citus_...98’;
Sumedh Pathak | Citus Data | QCon London 2018
Coordinator
49. Example
// SESSION 1
BEGIN;
UPDATE accounts SET balance
= balance - 100 WHERE id =
‘ALICE’;
(LOCK on ROW with id
‘ALICE’)
// SESSION 2
Sumedh Pathak | Citus Data | QCon London 2018
50. Example
// SESSION 1
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE
id = ‘ALICE’;
(LOCK on ROW with id ‘ALICE’)
// SESSION 2
BEGIN;
UPDATE accounts SET balance
= balance + 100 WHERE id =
‘BOB’;
(LOCK on ROW with id ‘BOB’)
Sumedh Pathak | Citus Data | QCon London 2018
51. Example
// SESSION 1
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE
id = ‘ALICE’;
(LOCK on ROW with id ‘ALICE’)
UPDATE accounts SET balance
= balance + 100 WHERE id =
‘BOB’;
// SESSION 2
BEGIN;
UPDATE accounts SET balance = balance + 100 WHERE
id = ‘BOB’;
(LOCK on ROW with id ‘BOB’)
Sumedh Pathak | Citus Data | QCon London 2018
52. Example
// SESSION 1
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE
id = ‘ALICE’;
(LOCK on ROW with id ‘ALICE’)
UPDATE accounts SET balance = balance + 100
WHERE id = ‘BOB’;
// SESSION 2
BEGIN;
UPDATE accounts SET balance = balance + 100 WHERE
id = ‘BOB’;
(LOCK on ROW with id ‘BOB’)
UPDATE accounts SET balance
= balance - 100 WHERE id =
‘ALICE’
Sumedh Pathak | Citus Data | QCon London 2018
53. How do Relational DB’s solve this?
S1
S2
- Construct a
Directed Graph
- Each node is a
session/transaction
- Edge represents a
wait on a lock
Waiting
on ‘Bob’
Waiting on
‘Alice’
54. S1
S2
S1 S2
S2 S1
S2 Waits on S1 S1 Waits on S2
Sumedh Pathak | Citus Data | QCon London 2018
55. S1
S2
S1 S2
S2 S1
S2 Waits on S1 S1 Waits on S2
Distributed
Deadlock
Detector
S1
S2 S2
S1
Sumedh Pathak | Citus Data | QCon London 2018
56. keys to scaling transactions
- 2PC to ensure atomic transactions across nodes
- Deadlock Detection—to scale complex & concurrent
transaction workloads
- MVCC
- Failure Handling
4
57. It’s 2018. Distributed can be Relational
- Scale tables—via sharding
- Scale SQL—via distributed relational algebra
- Scale transactions—via 2PC & Deadlock Detection
Sumedh Pathak | Citus Data | QCon London 2018
Now, how do we implement all of this?