CockroachDB is a distributed SQL database that aims for scalability, strong consistency, and survivability. It implements a distributed key-value store and translates SQL queries into key-value operations. Data is partitioned into ranges that are replicated across multiple nodes for fault tolerance. Transactions are executed using a two-phase commit process to maintain strong consistency across the distributed database.
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2nZwuQF.
Peter Mattis talks about how Cockroach Labs addressed the complexity of distributed databases with CockroachDB. He gives a tour of CockroachDB’s internals, covering the usage of Raft for consensus, the challenges of data distribution, distributed transactions, distributed SQL execution, and distributed SQL optimizations. Filmed at qconnewyork.com.
Peter Mattis is the co-founder of Cockroach Labs where he works on a bit of everything, from low-level optimization of code to refining the overall design. He has worked on distributed systems, designing and implementing the original Gmail back-end search and storage system at Google and designing and implementing Colossus, the successor to Google's original distributed file system.
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...Altinity Ltd
Columnar stores like ClickHouse enable users to pull insights from big data in seconds, but only if you set things up correctly. This talk will walk through how to implement a data warehouse that contains 1.3 billion rows using the famous NY Yellow Cab ride data. We'll start with basic data implementation including clustering and table definitions, then show how to load efficiently. Next, we'll discuss important features like dictionaries and materialized views, and how they improve query efficiency. We'll end by demonstrating typical queries to illustrate the kind of inferences you can draw rapidly from a well-designed data warehouse. It should be enough to get you started--the next billion rows is up to you!
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Building Robust ETL Pipelines with Apache SparkDatabricks
Stable and robust ETL pipelines are a critical component of the data infrastructure of modern enterprises. ETL pipelines ingest data from a variety of sources and must handle incorrect, incomplete or inconsistent records and produce curated, consistent data for consumption by downstream applications. In this talk, we’ll take a deep dive into the technical details of how Apache Spark “reads” data and discuss how Spark 2.2’s flexible APIs; support for a wide variety of datasources; state of art Tungsten execution engine; and the ability to provide diagnostic feedback to users, making it a robust framework for building end-to-end ETL pipelines.
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2nZwuQF.
Peter Mattis talks about how Cockroach Labs addressed the complexity of distributed databases with CockroachDB. He gives a tour of CockroachDB’s internals, covering the usage of Raft for consensus, the challenges of data distribution, distributed transactions, distributed SQL execution, and distributed SQL optimizations. Filmed at qconnewyork.com.
Peter Mattis is the co-founder of Cockroach Labs where he works on a bit of everything, from low-level optimization of code to refining the overall design. He has worked on distributed systems, designing and implementing the original Gmail back-end search and storage system at Google and designing and implementing Colossus, the successor to Google's original distributed file system.
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...Altinity Ltd
Columnar stores like ClickHouse enable users to pull insights from big data in seconds, but only if you set things up correctly. This talk will walk through how to implement a data warehouse that contains 1.3 billion rows using the famous NY Yellow Cab ride data. We'll start with basic data implementation including clustering and table definitions, then show how to load efficiently. Next, we'll discuss important features like dictionaries and materialized views, and how they improve query efficiency. We'll end by demonstrating typical queries to illustrate the kind of inferences you can draw rapidly from a well-designed data warehouse. It should be enough to get you started--the next billion rows is up to you!
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Building Robust ETL Pipelines with Apache SparkDatabricks
Stable and robust ETL pipelines are a critical component of the data infrastructure of modern enterprises. ETL pipelines ingest data from a variety of sources and must handle incorrect, incomplete or inconsistent records and produce curated, consistent data for consumption by downstream applications. In this talk, we’ll take a deep dive into the technical details of how Apache Spark “reads” data and discuss how Spark 2.2’s flexible APIs; support for a wide variety of datasources; state of art Tungsten execution engine; and the ability to provide diagnostic feedback to users, making it a robust framework for building end-to-end ETL pipelines.
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
Apache Kafka is a new breed of messaging system built for the "big data" world. Coming out of LinkedIn (and donated to Apache), it is a distributed pub/sub system built in Scala. It has been an Apache TLP now for several months with the first Apache release imminent. Built for speed, scalability, and robustness, Kafka should definitely be one of the data tools you consider when designing distributed data-oriented applications.
The talk will cover a general overview of the project and technology, with some use cases, and a demo.
NoSQL databases get a lot of press coverage, but there seems to be a lot of confusion surrounding them, as in which situations they work better than a Relational Database, and how to choose one over another. This talk will give an overview of the NoSQL landscape and a classification for the different architectural categories, clarifying the base concepts and the terminology, and will provide a comparison of the features, the strengths and the drawbacks of the most popular projects (CouchDB, MongoDB, Riak, Redis, Membase, Neo4j, Cassandra, HBase, Hypertable).
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Kafka on ZFS: Better Living Through Filesystems confluent
(Hugh O'Brien, Jet.com) Kafka Summit SF 2018
You’re doing disk IO wrong, let ZFS show you the way. ZFS on Linux is now stable. Say goodbye to JBOD, to directories in your reassignment plans, to unevenly used disks. Instead, have 8K Cloud IOPS for $25, SSD speed reads on spinning disks, in-kernel LZ4 compression and the smartest page cache on the planet. (Fear compactions no more!)
Learn how Jet’s Kafka clusters squeeze every drop of disk performance out of Azure, all completely transparent to Kafka.
-Striping cheap disks to maximize instance IOPS
-Block compression to reduce disk usage by ~80% (JSON data)
-Instance SSD as the secondary read cache (storing compressed data), eliminating >99% of disk reads and safe across host redeployments
-Upcoming features: Compressed blocks in memory, potentially quadrupling your page cache (RAM) for free
We’ll cover:
-Basic Principles
-Adapting ZFS for cloud instances (gotchas)
-Performance tuning for Kafka
-Benchmarks
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleDatabricks
Bucketing is commonly used in Hive and Spark SQL to improve performance by eliminating Shuffle in Join or group-by-aggregate scenario. This is ideal for a variety of write-once and read-many datasets at Bytedance.
Sharding allows you to distribute load across multiple servers and keep your data balanced across those servers. This session will review MongoDB’s sharding support, including an architectural overview, design principles, and automation.
Watch this talk here: https://www.confluent.io/online-talks/apache-kafka-architecture-and-fundamentals-explained-on-demand
This session explains Apache Kafka’s internal design and architecture. Companies like LinkedIn are now sending more than 1 trillion messages per day to Apache Kafka. Learn about the underlying design in Kafka that leads to such high throughput.
This talk provides a comprehensive overview of Kafka architecture and internal functions, including:
-Topics, partitions and segments
-The commit log and streams
-Brokers and broker replication
-Producer basics
-Consumers, consumer groups and offsets
This session is part 2 of 4 in our Fundamentals for Apache Kafka series.
Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.
[Meetup] a successful migration from elastic search to clickhouseVianney FOUCAULT
Paris Clickhouse meetup 2019: How Contentsquare successfully migrated to Clickhouse !
Discover the subtleties of a migration to Clickhouse. What to check before hand, then how to operate clickhouse in Production
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
The slides explain how shuffle works in Spark and help people understand more details about Spark internal. It shows how the major classes are implemented, including: ShuffleManager (SortShuffleManager), ShuffleWriter (SortShuffleWriter, BypassMergeSortShuffleWriter, UnsafeShuffleWriter), ShuffleReader (BlockStoreShuffleReader).
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
Flink Forward San Francisco 2022.
With a real-time processing engine like Flink and a transactional storage layer like Hudi, it has never been easier to build end-to-end low-latency data platforms connecting sources like Kafka to data lake storage. Come learn how to blend Lakehouse architectural patterns with real-time processing pipelines with Flink and Hudi. We will dive deep on how Flink can leverage the newest features of Hudi like multi-modal indexing that dramatically improves query and write performance, data skipping that reduces the query latency by 10x for large datasets, and many more innovations unique to Flink and Hudi.
by
Ethan Guo & Kyle Weller
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...Andrew Lamb
DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems.
This presentation explains where it fits into the data eco system and how it helps implement your system in Rust
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
With a current zoo of technologies and different ways of their interaction it's a big challenge to architect a system (or adopt existed one) that will conform to low-latency BigData analysis requirements. Apache Kafka and Kappa Architecture in particular take more and more attention over classic Hadoop-centric technologies stack. New Consumer API put significant boost in this direction. Microservices-based streaming processing and new Kafka Streams tend to be a synergy in BigData world.
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
Apache Kafka is a new breed of messaging system built for the "big data" world. Coming out of LinkedIn (and donated to Apache), it is a distributed pub/sub system built in Scala. It has been an Apache TLP now for several months with the first Apache release imminent. Built for speed, scalability, and robustness, Kafka should definitely be one of the data tools you consider when designing distributed data-oriented applications.
The talk will cover a general overview of the project and technology, with some use cases, and a demo.
NoSQL databases get a lot of press coverage, but there seems to be a lot of confusion surrounding them, as in which situations they work better than a Relational Database, and how to choose one over another. This talk will give an overview of the NoSQL landscape and a classification for the different architectural categories, clarifying the base concepts and the terminology, and will provide a comparison of the features, the strengths and the drawbacks of the most popular projects (CouchDB, MongoDB, Riak, Redis, Membase, Neo4j, Cassandra, HBase, Hypertable).
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Kafka on ZFS: Better Living Through Filesystems confluent
(Hugh O'Brien, Jet.com) Kafka Summit SF 2018
You’re doing disk IO wrong, let ZFS show you the way. ZFS on Linux is now stable. Say goodbye to JBOD, to directories in your reassignment plans, to unevenly used disks. Instead, have 8K Cloud IOPS for $25, SSD speed reads on spinning disks, in-kernel LZ4 compression and the smartest page cache on the planet. (Fear compactions no more!)
Learn how Jet’s Kafka clusters squeeze every drop of disk performance out of Azure, all completely transparent to Kafka.
-Striping cheap disks to maximize instance IOPS
-Block compression to reduce disk usage by ~80% (JSON data)
-Instance SSD as the secondary read cache (storing compressed data), eliminating >99% of disk reads and safe across host redeployments
-Upcoming features: Compressed blocks in memory, potentially quadrupling your page cache (RAM) for free
We’ll cover:
-Basic Principles
-Adapting ZFS for cloud instances (gotchas)
-Performance tuning for Kafka
-Benchmarks
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleDatabricks
Bucketing is commonly used in Hive and Spark SQL to improve performance by eliminating Shuffle in Join or group-by-aggregate scenario. This is ideal for a variety of write-once and read-many datasets at Bytedance.
Sharding allows you to distribute load across multiple servers and keep your data balanced across those servers. This session will review MongoDB’s sharding support, including an architectural overview, design principles, and automation.
Watch this talk here: https://www.confluent.io/online-talks/apache-kafka-architecture-and-fundamentals-explained-on-demand
This session explains Apache Kafka’s internal design and architecture. Companies like LinkedIn are now sending more than 1 trillion messages per day to Apache Kafka. Learn about the underlying design in Kafka that leads to such high throughput.
This talk provides a comprehensive overview of Kafka architecture and internal functions, including:
-Topics, partitions and segments
-The commit log and streams
-Brokers and broker replication
-Producer basics
-Consumers, consumer groups and offsets
This session is part 2 of 4 in our Fundamentals for Apache Kafka series.
Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.
[Meetup] a successful migration from elastic search to clickhouseVianney FOUCAULT
Paris Clickhouse meetup 2019: How Contentsquare successfully migrated to Clickhouse !
Discover the subtleties of a migration to Clickhouse. What to check before hand, then how to operate clickhouse in Production
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
The slides explain how shuffle works in Spark and help people understand more details about Spark internal. It shows how the major classes are implemented, including: ShuffleManager (SortShuffleManager), ShuffleWriter (SortShuffleWriter, BypassMergeSortShuffleWriter, UnsafeShuffleWriter), ShuffleReader (BlockStoreShuffleReader).
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
Flink Forward San Francisco 2022.
With a real-time processing engine like Flink and a transactional storage layer like Hudi, it has never been easier to build end-to-end low-latency data platforms connecting sources like Kafka to data lake storage. Come learn how to blend Lakehouse architectural patterns with real-time processing pipelines with Flink and Hudi. We will dive deep on how Flink can leverage the newest features of Hudi like multi-modal indexing that dramatically improves query and write performance, data skipping that reduces the query latency by 10x for large datasets, and many more innovations unique to Flink and Hudi.
by
Ethan Guo & Kyle Weller
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...Andrew Lamb
DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems.
This presentation explains where it fits into the data eco system and how it helps implement your system in Rust
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
With a current zoo of technologies and different ways of their interaction it's a big challenge to architect a system (or adopt existed one) that will conform to low-latency BigData analysis requirements. Apache Kafka and Kappa Architecture in particular take more and more attention over classic Hadoop-centric technologies stack. New Consumer API put significant boost in this direction. Microservices-based streaming processing and new Kafka Streams tend to be a synergy in BigData world.
Domain Driven Design provides not only the strategic guidelines for decomposing a large system into microservices, but also offers the main tactical pattern that helps in decoupling microservices. The presentation will focus on the way domain events could be implemented using Kafka and the trade-offs between consistency and availability that are supported by Kafka.
https://youtu.be/P6IaxNcn-Ag?t=1466
Fully-managed Cloud-native Databases: The path to indefinite scale @ CNN MainzQAware GmbH
When it comes to the question: "Where do we actually store our application data?", we are spoilt for choice, especially when it comes to the major cloud providers.
The simple and often completely valid answer is still the classic relational database! It is very suitable for many areas of application, as the technology is tried and tested and can cover a very broad spectrum. It is therefore not surprising that all major cloud providers offer this as a "managed service".
For some years now, however, there have also been so-called cloud-native databases that have been specially developed for the requirements of the cloud. The big promise: "Infinite scalability"
In a large customer project, we have been using such a database productively for over 4 years with Azure CosmosDB. The presentation will deal with the following questions, among others
What does "upscalability" mean in practice ?
What do you have to pay attention to when designing?
What are the actual limits?
What other special features do I get?
When do I need a cloud-native database?
But that's not all! We also look beyond Azure to the other two major cloud providers: AWS and Google Cloud. With DynamoDB and Datastore/Firestore, they have similar products on offer.
Stream Processing with CompletableFuture and Flow in Java 9Trayan Iliev
Stream based data / event / message processing becomes preferred way of achieving interoperability and real-time communication in distributed SOA / microservice / database architectures.
Beside lambdas, Java 8 introduced two new APIs explicitly dealing with stream data processing:
- Stream - which is PULL-based and easily parallelizable;
- CompletableFuture / CompletionStage - which allow composition of PUSH-based, non-blocking, asynchronous data processing pipelines.
Java 9 will provide further support for stream-based data-processing by extending the CompletableFuture with additional functionality – support for delays and timeouts, better support for subclassing, and new utility methods.
More, Java 9 provides new java.util.concurrent.Flow API implementing Reactive Streams specification that enables reactive programming and interoperability with libraries like Reactor, RxJava, RabbitMQ, Vert.x, Ratpack, and Akka.
The presentation will discuss the novelties in Java 8 and Java 9 supporting stream data processing, describing the APIs, models and practical details of asynchronous pipeline implementation, error handling, multithreaded execution, asyncronous REST service implementation, interoperability with existing libraries.
There are provided demo examples (code on GitHub) using Completable Future and Flow with:
- JAX-RS 2.1 AsyncResponse, and more importantly unit-testing the async REST service method implementations;
- CDI 2.0 asynchronous observers (fireAsync / @ObservesAsync);
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...Andre Essing
A long time ago in a database far, far away...
SQL was the only option to save vast amounts of application data for a long period of time. There were always some rebellion activities, to overcome the SQL Empire, which brought a new hope, but all other ways of storing data were never more than a phantom menace.
Now Cosmos DB awakens and is ready for the revenge of the NoSQL.
During this talk, we will have a look at what Azure Cosmos DB is, what you can achieve with its possibilities and how to use it in a galactic environment of data and applications.
Join me and find your way to the right solution for your application.
May the data be with you!
NoSQL Strikes Back (An introduction to the dark side of your data)
A long time ago in a database far, far away...
SQL was the only option to save vast amounts of application data for a long period of time. There were always some rebellion activities, to overcome the SQL Empire, which brought a new hope, but all other ways of storing data were never more than a phantom menace.
Now Cosmos DB awakens and is ready for the revenge of the NoSQL.
During this talk, we will have a look at what Azure Cosmos DB is, what you can achieve with its possibilities and how to use it in a galactic environment of data and applications.
Join me and find your way to the right solution for your application.
May the data be with you!
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...Kai Wähner
Architecture patterns for distributed, hybrid, edge and global Apache Kafka deployments
Multi-cluster and cross-data center deployments of Apache Kafka have become the norm rather than an exception. This session gives an overview of several scenarios that may require multi-cluster solutions and discusses real-world examples with their specific requirements and trade-offs, including disaster recovery, aggregation for analytics, cloud migration, mission-critical stretched deployments and global Kafka.
Key takeaways:
In many scenarios, one Kafka cluster is not enough. Understand different architectures and alternatives for multi-cluster deployments.
Zero data loss and high availability are two key requirements. Understand how to realize this, including trade-offs.
Learn about features and limitations of Kafka for multi cluster deployments
Global Kafka and mission-critical multi-cluster deployments with zero data loss and high availability became the normal, not an exception.
Making Machine Learning Easy with H2O and WebFluxTrayan Iliev
Machine learning is becoming a must for many business domains and applications. H2O is a best-of-breed, open source, distributed machine learning library written in Java. The presentation shows how to create and train machine learning models easily using H2O Flow web interface, including Deep Learning Neural Networks (DNNs). The session provides a tutorial how to develop and deploy fullstack-reactive face recognition demo using React + RxJS WebSocket front-end, OpenCV, Caffe CNN for image segmentation, OpenFace CNN for feature extraction, H20 Flow for face recognition interactive model training and export as POJO. The trained POJO model is incorporated in a real-time streaming web service implemented using Spring 5 Web Flux and Spring Boot. All demo is 100% Java!
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
ndependent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. How can me make sure that all these event are accepted and forwarded in an efficient and reliable way? This is where Apache Kafaka comes into play, a distirbuted, highly-scalable messaging broker, build for exchanging huge amount of messages between a source and a target.
This session will start with an introduction into Apache and presents the role of Apache Kafka in a modern data / information architecture and the advantages it brings to the table. Additionally the Kafka ecosystem will be covered as well as the integration of Kafka in the Oracle Stack, with products such as Golden Gate, Service Bus and Oracle Stream Analytics all being able to act as a Kafka consumer or producer.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
20240605 QFM017 Machine Intelligence Reading List May 2024
CockroachDB
1. Cockroach DB
briefoverview
“We believe it is better to have application
programmers deal with performance
problems due to overuse of transactions
as bottlenecks arise, rather than always
coding around the lack of transactions”
Google Spanner
2. Agenda
Databases short history. Cockroach DB (roach db) – 3rd
category (not generation) database
Architecture . Components and their responsibilities (brief
overview)
SQL capabilities
Rocks DB overview
Raft algorithm overview
2
36. 36youtube.com/watch?v=GtQueJe6xRQ
Databases characteristics brief overview
transactions – MVCC PostgreSQL
× In PostgreSQL each row has xmin &
xmax property, and they represents
transaction’s id.
XMIN is used when data is inserted
& XMAX is used when data is deleted
-> a) data is read only b) for delete
xmax “metadata property is
updated” c) for update is done an
delete + insert
× In PostgreSQL like in CockroachDB
& RocksDB data is read only and
VACCUM (PostgreSQL) cleans old
version of the records & compaction
in RocksDB
× There can exists 2 types of snapshot:
a) query level (read committed ) b)
transaction level (snapshot isolation)
43. 43
Databases characteristics brief overview
transactions
× CockroachDB checks the running transaction's record to see if it's been ABORTED; if it has, it
restarts the transaction
× If the transaction passes these checks, it's moved to COMMITTED and responds with the
transaction's success to the client
× Enables transactions that can span your entire cluster (including cross-range and cross-table
transactions), it optimizes correctness through a two-phase commit process
What is 2 phase commit ?
A special object, known as a coordinator, is required in a distributed transaction. As its name
implies, the coordinator arranges activities and synchronization between distributed servers
Phase 1 - Each server that needs to commit data records to the log. If successful, the server replies
with an OK message
Phase 2 - This phase begins after all participants respond OK. Then, the coordinator sends a signal
to each server with commit instructions. After committing, each writes the commit as part of its log
record for reference and sends the coordinator a message that its commit has been successfully
implemented. If a server fails, the coordinator sends instructions to all servers to roll back the
transaction. After the servers roll back, each sends feedback that this has been completed
cockroachlabs.com
51. What is roach db ?
× “CockroachDB is a distributed SQL database. The primary design goals
are scalability, strong consistency and survivability(hencethe name). CockroachDB aims
to tolerate disk, machine, rack, and even datacenter failures with minimal latency
disruption and no manual intervention. CockroachDB nodes are symmetric; a design goal
is homogeneous deployment (one binary) with minimal configuration and no required
external dependencies …. CockroachDB implements a single, monolithic sorted map
from key to value where both keys and values are byte strings” (github.com/cockroachdb)
× Inspired by Google Spanner database
× Started by some guys that left Google (one of the sponsors is Google)
× Cloud Native database
× Go language (C++ storage – RocksDB)
× Driver PostgreSQL -> SQL database
× Read & Delete scenarios
× High availability
× No stale reads when failure occurs
51
52. 52
roach db – layers & their roles
cockroachlabs.com
× SQL - Translate client SQL queries to KV operations. When developers send requests to the
cluster, they arrive as SQL statements, but data is ultimately written to and read from the storage
layer as key-value (KV) pairs. To handle this, the SQL layer converts SQL statements into a plan of
KV operations, which it passes along to the Transaction Layer
× Transactional - Allow atomic changes to multiple KV entries. The only transactional level is
Serializable
× Distribution - Present replicated KV ranges as a single entity.
× Replication - Consistently and synchronously replicate KV ranges across many nodes. This layer
also enables consistent reads via leases.
× Storage - Write and read KV data on disk.
54. 54
roach db – terms & concepts
cockroachlabs.com
× Range = A set of sorted, contiguous data from your cluster
× Replica = Copies of your ranges, which are stored on at least 3 nodes to ensure survivability
× Replication = Replication involves creating and distributing copies of data, as well as ensuring
copies remain consistent. There are 2 types of replication: synchronous and asynchronous.
Cockroach DB adopted synchronous replication mechanism
× Range Lease = For each range, one of the replicas holds the "range lease". This replica, referred to
as the "leaseholder", is the one that receives and coordinates all read and write requests for the
range
× Consensus = When a range receives a write, a quorum of nodes containing replicas of the range
acknowledge the write. This means your data is safely stored and a majority of nodes agree on the
database's current state, even if some of the nodes are offline
× Multi-Active Availability = In Cockroach DB consensus-based notion of high availability lets each
node in the cluster handle reads and writes for a subset of the stored data (on a per-range basis).
This is in contrast to active-passive replication, in which the active node receives 100% of request
traffic, as well as active-active replication, in which all nodes accept requests but typically can't
guarantee that reads are both up-to-date and fast
55. 55
roach db – short description
cockroachlabs.com
× CockroachDB's nodes all behave symmetrically
× Cockroach DB nodes converts SQL RPCs into operations that work with distributed
key-value store. At the highest level, CockroachDB accomplishes conversion of clients'
SQL statements into key-value (KV) data, which get distributed among nodes. A node
cannot serve any request directly, it finds the node that can handle it and
communicates with it. So, user don't need to know about locality of data
× It algorithmically starts data distribution across nodes by dividing the data into 64MiB
chunks (these chunks are known as ranges). Each range get replicated synchronously to
at least 3 nodes
× Cockroach keys are arbitrary byte arrays. Keys come in two flavors: system keys and
table data keys. System keys are used by Cockroach for internal data structures and
metadata
56. 56
roach db characteristics - transactions
cockroachlabs.com
× Supports bundling multiple SQL statements into a single all-or-nothing transaction. Each
transaction guarantees ACID semantics spanning arbitrary tables and rows, even when data is
distributed
× Efficiently supports the strongest ANSI transaction isolation level: SERIALIZABLE. All other ANSI
transaction isolation levels (e.g., READ UNCOMMITTED, READ COMMITTED, and REPEATABLE
READ) are automatically upgraded to SERIALIZABLE
× Transactions are executed in two phases:
- Start the transaction by selecting a range where first write occurs and writing a new
transaction record to a reserved area of that range with state "PENDING“ and ends as either
COMMITTED or ABORTED)
- Commit the transaction by updating its transaction record
× SQL 86 was just ACD (isolation = SERIALIZABLE) & SQL 92 ACID was introduced with a lot of
anomalies/phenomena (dirty read, non-repeatable read or fuzzy read, phantom reads, write skew,
read skew, lost update)
× CockroachDB’s default isolation level is called Serializable (Serializable Snapshot for versions prior
to 2.1), and it is an optimistic, multi-version, timestamp-ordered concurrency control
59. 59
roach db – haproxy
cockroachlabs.com & haproxy.org
× HAProxy is one of the most popular open-source TCP load balancers, and
CockroachDB includes a built-in command for generating a configuration file that is
preset to work with your running cluster
× HAProxy is a free, very fast and reliable solution offering high availability, load
balancing, and proxying for TCP and HTTP-based applications. It is particularly suited
for very high traffic web sites and powers quite a number of the world's most visited
ones. Over the years it has become the de-facto standard opensource load balancer, is
now shipped with most mainstream Linux distributions, and is often deployed by
default in cloud platforms
× cockroach gen haproxy --certs-dir=<path to certs directory>
--host=<address of any node in the cluster> --port=26257
× listen psql
bind :26257
balance roundrobin
server cockroach1 <node1 address>:26257
server cockroach2 <node2 address>:26257
66. roach db - Data mapping between the SQL model and KV
× Every SQL table has a primary key in CockroachDB. If a table is created without one, an
implicit primary key is provided automatically. The table identifier, followed by the value of
the primary key for each row, are encoded as the prefix of a key in the underlying KV store.
× Each remaining column or column family in the table is then encoded as a value in the
underlying KV store, and the column/family identifier is appended as suffix to the KV key
× Example:
A table customers is created in a database mydb with a primary key column name and
normal columns address and URL, the KV pairs to store the schema would be
× Each database/table/column name is mapped to a spontaneously generated identifier, so
as to simplify renames
66cockroachlabs.com
67. roach db - Data mapping between the SQL model and KV
× SHOW EXPERIMENTAL-RANGES FROM TABLE alarms
× CREATE TABLE IF NOT EXISTS alarm ( cen-id STRING(30), subscription-id STRING(30),
alarm-emission-date TIMESTAMP, alarm-id INT4, alarm-status STRING(30),
trigger-id INT4, trigger-scope STRING(15), trigger-scope-value STRING(50), .....,
PRIMARY KEY (cen-id, alarm-emission-date, subscription-id, alarm-id),
FAMILY read-only-columns
(cen-id, subscription-id, alarm-emission-date, alarm-id, trigger-id, trigger-scope, ....),
FAMILY updatable-columns (alarm-status));
67
68. roach db sql characteristics - pagination
× Example 1
SELECT id, name FROM accounts LIMIT 5
× Example 2
SELECT id, name FROM accounts LIMIT 5 OFFSET 5
68cockroachlabs.com
69. roach db sql characteristics – ordering the results
× The ORDER BY clause controls the order in which rows are returned or processed
× The ORDER BY PRIMARY KEY notation guarantees that the results are presented in primary key
order
× The ORDER BY clause is only effective at the top-level statement in most of the cases
- SELECT * FROM a, b ORDER BY a.x; -- valid, effective
- SELECT * FROM (SELECT * FROM a ORDER BY a.x), b; -- ignored, ineffective
Exceptions from the rule:
- SELECT * FROM (SELECT * FROM a ORDER BY a.x) WITH ORDINALITY
ensures that the rows are numbered in the order of column a.x
Ex: SELECT * FROM (VALUES ('a'), ('b'), ('c')) WITH ORDINALITY
- SELECT * FROM a, ((SELECT * FROM b ORDER BY b.x) LIMIT 1)
ensures that only the first row of b in the order of column b.x is used in the cross join
- INSERT INTO a (SELECT * FROM b ORDER BY b.x) LIMIT 1
ensures that only the first row of b in the order of column b.x is inserted into a
- SELECT ARRAY(SELECT a.x FROM a ORDER BY a.x);
ensures that the array is constructed using the values of a.x in sorted order
69cockroachlabs.com
70. roach db sql characteristics – online schema changes
× CockroachDB's online schema changes provide a simple way to update a table schema
without imposing any negative consequences on an application - including downtime.
The schema change engine is a built-in feature requiring no additional tools, resources,
or ad hoc sequencing of operations
Benefits
- Changes to your table schema happen while the database is running
- The schema change runs as a background job without holding locks on the
underlying table data
- Your application's queries can run normally, with no effect on read/write latency
The schema is cached for performance
- Your data is kept in a safe, consistent state throughout the entire schema change
process
× Recommend doing schema changes outside transactions where possible
70cockroachlabs.com
71. roach db sql characteristics – truncate tables
× TRUNCATE statement deletes all rows from specified tables
× TRUNCATE removes all rows from a table by dropping the table and recreating a new table
with the same name. For large tables, this is much more performant than deleting each of
the rows. However, for smaller tables, it's more performant to use a DELETE statement
without a WHERE clause
× TRUNCATE is a schema change, and as such is not transactional
× CASCADE does not list dependent tables it truncates, so should be used cautiously.
Truncate dependent tables explicitly (TRUNCATE customers, orders)
× RESTRICT does not truncate the table if any other tables have Foreign Key dependencies
on it
71cockroachlabs.com
72. roach db sql characteristics – split ranges
× SPLIT AT statement forces a key-value layer range split at the specified row in a table or index
× The key-value layer of CockroachDB is broken into sections of contiguous key-space known as
ranges. By default, CockroachDB attempts to keep ranges below a size of 64MiB.
× Why you may want to perform manual splits ?
- When a table only consists of a single range, all writes and reads to the table will be served by
that range's leaseholder. If a table only holds a small amount of data but is serving a large amount
of traffic
- When a table is created, it will only consist of a single range & if you know that a new table will
immediately receive significant write traffic
× Example 1:
ALTER TABLE kv SPLIT AT VALUES (10), (20), (30)
× Example 2:
CREATE TABLE kv (k1 INT, k2 INT, v INT, w INT, PRIMARY KEY (k1, k2))
ALTER TABLE kv SPLIT AT VALUES (5,1), (5,2), (5,3)
SHOW EXPERIMENTAL-RANGES FROM TABLE kv
× Example 3:
CREATE INDEX secondary ON kv (v)
SHOW EXPERIMENTAL-RANGES FROM INDEX kv@secondary
ALTER INDEX kv@secondary SPLIT AT (SELECT v FROM kv LIMIT 3) 72cockroachlabs.com
73. 73
roach db sql characteristics - joins
cockroachlabs.com
× Support all kinds of join
× Joins over interleaved tables are usually (but not always) processed more effectively than over
non-interleaved tables
× When no indexes can be used to satisfy a join, CockroachDB may load all the rows in memory that
satisfy the condition one of the join operands before starting to return result rows. This may cause
joins to fail if the join condition or other WHERE clauses are insufficiently selective
× Outer joins are generally processed less efficiently than inner joins. Prefer using inner joins
whenever possible. Full outer joins are the least optimized
× Use EXPLAIN over queries containing joins to verify that indexes are used
× My rules: avoid cross joins & theta-joins for any database if possible & joins as much as possible in
BigData by denormalizing data. In general data is write once & read many –many times in BigData,
In some cases new data or newer version of data is appended/added to an existing “entity”, if still
there are problems maybe conflict-free replicated data types (CRDT) help or AVRO files +
schema. Try to avoid read before write
74. 74
roach db sql characteristics - sequences
cockroachlabs.com
× CREATE SEQUENCE seq1 MINVALUE 1 MAXVALUE 9223372036854775807 INCREMENT 1 START 1
× CREATE SEQUENCE seq2 MINVALUE -9223372036854775808 MAXVALUE -1 INCREMENT -2 START -1
× CREATE TABLE table-name (id INT PRIMARY KEY DEFAULT nextval(‘seqname'), …. )
× SELECT nextval(‘seqname')
× SELECT * FROM seqname / SELECT currval(‘seqname')
× They are slow I you have many records to inserts in preferable to use
value = SELECT nextval & SELECT setval(‘seqname‘, value + X) -> negotiate the parallelism
75. 75
roach db sql characteristics – parallel statement execution
cockroachlabs.com
× CONVERSATIONAL API
BEGIN;
UPDATE users SET lastname = 'Smith' WHERE id = 1;
UPDATE favoritemovies SET movies = 'The Matrix' WHERE userid = 1;
UPDATE favoritesongs SET songs = 'All this time' WHERE userid = 1;
COMMIT;
× The statements are executed in parallel until roach db encounters a barrier statement
BEGIN;
UPDATE users SET lastname = 'Smith' WHERE id = 1 RETURNING NOTHING;
UPDATE favoritemovies SET movies = 'The Matrix' WHERE userid = 1 RETURNING NOTHING;
UPDATE favoritesongs SET songs = 'All this time' WHERE userid = 1 RETURNING NOTHING;
COMMIT;
78. 78
roach db sql characteristics - json columns
cockroachlabs.com
× Example:
CREATE TABLE users (
profileid UUID PRIMARY KEY DEFAULT gen-random-uuid(),
lastupdated TIMESTAMP DEFAULT now(),
userprofile JSONB);
SHOW COLUMNS FROM users; return type for userprofile as JSON (JSON is an alias of JSONB)
× If duplicate keys are included in the input, only the last value is kept
× Recommended to keep values under 1 MB to ensure performance
× A standard index cannot be created on a JSONB column; you must use an inverted index.
× The primary key, foreign key, and unique constraints cannot be used on JSONB values
79. 79
roach db sql characteristics - json columns
youtube.com/watch?v=v2QK5VgLx6E
80. 80
roach db sql characteristics - json columns
youtube.com/watch?v=v2QK5VgLx6E
81. 81
roach db sql characteristics - json columns
youtube.com/watch?v=v2QK5VgLx6E
82. 82
roach db sql characteristics - json columns
youtube.com/watch?v=v2QK5VgLx6E
83. 83
roach db sql characteristics - json columns
youtube.com/watch?v=v2QK5VgLx6E
84. 84
roach db sql characteristics – inverted indexes
cockroachlabs.com
× Inverted indexes improve your database's performance by helping SQL locate the schemaless data in a
JSONB column. JSONB cannot be queried without a full table scan, since it does not adhere to ordinary
value prefix comparison operators
× Inverted indexes filter on components of tokenizable data. JSONB data type is built on two structures
that can be tokenized: objects & arrays
× Example:
{ "firstName": "John", "lastName": "Smith", "age": 25,
"address": { "state": "NY", "postalCode": "10021" }, "cars": [ "Subaru", "Honda" ] }
inverted index for this object
"firstName": "John" "lastName": "Smith" "age": 25 "address": "state": "NY"
"address": "postalCode": "10021" "cars" : "Subaru" "cars" : "Honda"
× Creation
- At the same time as the table with the INVERTED INDEX clause of CREATE TABLE
- For existing tables with CREATE INVERTED INDEX
- CREATE INDEX <optional name> ON <table> USING GIN (<column>)
× Inverted indexes only support equality comparisons using the = operator
× If >= or <= are required can be created an index on a computed column using your JSON payload, and
then create a regular index on that
85. 85
roach db sql characteristics – inverted indexes
cockroachlabs.com
× Example 1:
- CREATE TABLE test (id INT, data JSONB, foo INT AS ((data->>'foo')::INT) STORED)
- CREATE INDEX test-idx ON test (foo)
- SELECT * FROM test where foo > 3
× Example 2:
- CREATE TABLE users (profile-id UUID PRIMARY KEY DEFAULT gen-random-uuid(),
last-updated TIMESTAMP DEFAULT now(), user-profile JSONB,
INVERTED INDEX user-details (user-profile))
- INSERT INTO users (user-profile) VALUES
('{"first-name": "Lola", "last-name": "Dog", "location": "NYC", "online" : true, "friends" : 547}'),
('{"first-name": "Ernie", "status": "Looking for treats", "location" : "Brooklyn"}'))
- SELECT * FROM users where user-profile @> '{"location":"NYC"}‘
× Indexes they greatly improve the speed of queries, but slightly slow down writes (because new values
have to be copied and sorted). The first index you create has the largest impact, but additional indexes
only introduce marginal overhead.
86. 86
roach db sql characteristics - computed columns
cockroachlabs.com
× Example:
CREATE TABLE names (id INT PRIMARY KEY, firstname STRING, lastname STRING,
fullname STRING AS (CONCAT(firstname, ' ', lastname)) STORED );
CREATE TABLE userlocations (
locality STRINGAS (CASE
WHEN country IN ('ca', 'mx', 'us') THEN 'northamerica'
WHEN country IN ('au', 'nz') THEN 'australia’ END) STORED,
id SERIAL, name STRING, country STRING,
PRIMARY KEY (locality, id))
PARTITIONBY LIST (locality) (PARTITIONnorthamerica VALUES IN ('northamerica'), PARTITIONaustralia VALUES IN ('australia'));
× Cannot be added after a table is created
× Cannot be used to generate other computed columns
× Cannot be a foreign key reference
× Behave like any other column, with the exception that they cannot be written directly
× Are mutually exclusive with DEFAULT
87. 87
roach db sql characteristics - foreign keys
cockroachlabs.com
× For example, if you create a foreign key on orders table and column customerId that
references column id from table customers:
Each value inserted or updated in orders.customerId must exactly match a value
in customers.id
Values in customers.id that are referenced by orders.customerId cannot be deleted or
updated.
However, customers.id values that aren't present in orders.customerId can be updated or
deleted
× Each column cannot belong to more than 1 Foreign Key constraint
× Cannot be a computed column.
88. 88
roach db sql characteristics - interleaving tables
cockroachlabs.com
× Improves query performance by optimizing the key-value structure of closely related
tables, attempting to keep data on the same key-value range if it's likely to be read and
written together
× When tables are interleaved, data written to one table (known as the child) is inserted
directly into another (known as the parent) in the key-value store. This is accomplished
by matching the child table's Primary Key to the parent's
× For interleaved tables to have Primary Keys that can be matched, the child table must
use the parent table's entire Primary Key as a prefix of its own Primary Key– these
matching columns are referred to as the interleave prefix.
89. 89
roach db sql characteristics – column families
cockroachlabs.com
× A column family is a group of columns in a table that is stored as a single key-value pair in
the underlying key-value store. When frequently updated columns are grouped with seldom
updated columns, the seldom updated columns are nonetheless rewritten on every update
× Columns that are part of the primary index are always assigned to the first column family. If
you manually assign primary index columns to a family, it must therefore be the first family
listed in the CREATE TABLE statement.
× Storage requirements (experimental observation)
× Examples:
CREATE TABLE test (id INT PRIMARY KEY, lastAccessed TIMESTAMP, data BYTES,
FAMILY modifiableFamily (id, lastaccessed), FAMILY readonlyFamily (data));
ALTER TABLE test ADD COLUMN data2 BYTES CREATE FAMILY f3;
ALTER TABLE test ADD COLUMN name STRING CREATE IF NOT EXISTS FAMILY f1
90. 90
roach db sql characteristics – time travel queries
cockroachlabs.com
× The AS OF SYSTEM TIME timestamp clause causes statements to execute using the database
contents "as of" a specified time in the past
× Historical data is available only within the garbage collection window, which is determined by the
ttlseconds
× SELECT name, balance FROM accounts WHERE name = 'Edna Barath‘
× SELECT name, balance FROM accounts AS OF SYSTEM TIME '2016-10-03 12:45:00' WHERE
name = 'Edna Barath‘
91. 91
roach db - sql best practices (partial)
cockroachlabs.com
× Insert, Delete, Upsert multiple rows (The UPSERT statement is short-hand for
INSERT ON CONFLICT)
INSERT INTO accounts (id, balance) VALUES (3, 8100.73), (4, 9400.10)
× The TRUNCATE statement removes all rows from a table by dropping the table and
recreating a new table with the same name. This performs better than using DELETE,
which performs multiple transactions to delete all rows
× Use IMPORT instead of INSERT for Bulk Inserts into New Tables
× Execute Statements in Parallel
92. Key-value persistent store
Embedded
Exceptional fast (designed for SSD)
Log Structured merged engine - data in RAM + Append/TransactionLog
Not distributed (C++ library)
No failover
No highly-availability (if SSD dies you lose your data)
92
RocksDB
93. 93
RocksDB
Keys & values are byte arrays (not type like in RDBMS)
Data is store sorted by the key
In Java terms, a Sorted Map similar with Cassandra Clustering keys
Operations are : Put, Delete & Merge
Basic Queries: Get & Iterator (Scan)
100. youtube.com/watch?v=aKAJMd0iKtI 100
Other places where is used Rocks DB
MyRocks = MySQL + RocksDB
“MyRocks has 2x better compression compared to compressed InnoDB, 3-4x better
compression compared to uncompressed InnoDB, meaning you use less space.”
Rocksandra = Cassandra + RocksDB
thenewstack.io/instagram-supercharges-cassandra-pluggable-rocksdb-storage-
engine
CASSANDRA-13476 & CASSANDRA-13474 (pluggable storage engine)