I'm going to cover something which could be seen as essential for Cassandra but which hasn't gotten much attention in the Cassandra community and literature. It's schema migrations--how you go about pushing out and versioning changes to your keyspace and table definitions across environments. This is an area that has established solutions in the relational database world, with tools like Liquibase(http://www.liquibase.org/) and Flyway (http://flywaydb.org/) and in web frameworks like Rails and Grails.
I'll explain the different types of migrations but then focus, for most of the talk, on schema migrations. I'll explain how schema migrations have been done in the Cassandra community and the roadblocks teams have faced trying to use Liquibase and Flyway to manage Cassandra migrations.
Then I'll share an elegant, lightweight schema migrations system that we at GridPoint built on top of Flyway. I'll use our system as a context for discussing schema migration best practices for Cassandra and the various choices teams have for their migrations and table definitions, including when NOT to use a tool like Flyway. I'll also touch on the other types of migrations besides keyspace and table definitions that can be versioned and driven off source control.
Modeling Data and Queries for Wide Column NoSQLScyllaDB
Discover how to model data for wide column databases such as ScyllaDB and Apache Cassandra. Contrast the differerence from traditional RDBMS data modeling, going from a normalized “schema first” design to a denormalized “query first” design. Plus how to use advanced features like secondary indexes and materialized views to use the same base table to get the answers you need.
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...GetInData
Did you like it? Check out our E-book: Apache NiFi - A Complete Guide
https://ebook.getindata.com/apache-nifi-complete-guide
Apache NiFi is one of the most popular services for running ETL pipelines otherwise it’s not the youngest technology. During the talk, there are described all details about migrating pipelines from the old Hadoop platform to the Kubernetes, managing everything as the code, monitoring all corner cases of NiFi and making it a robust solution that is user-friendly even for non-programmers.
Author: Albert Lewandowski
Linkedin: https://www.linkedin.com/in/albert-lewandowski/
___
Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets.
Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries.
https://getindata.com
A practical look at the different strategies to deploy an application to Kubernetes. We list the pros and cons of each strategy and define which one to adopt depending on real world examples and use cases.
VictoriaLogs: Open Source Log Management System - PreviewVictoriaMetrics
VictoriaLogs Preview - Aliaksandr Valialkin
* Existing open source log management systems
- ELK (ElasticSearch) stack: Pros & Cons
- Grafana Loki: Pros & Cons
* What is VictoriaLogs
- Open source log management system from VictoriaMetrics
- Easy to setup and operate
- Scales vertically and horizontally
- Optimized for low resource usage (CPU, RAM, disk space)
- Accepts data from Logstash and Fluentbit in Elasticsearch format
- Accepts data from Promtail in Loki format
- Supports stream concept from Loki
- Provides easy to use yet powerful query language - LogsQL
* LogsQL Examples
- Search by time
- Full-text search
- Combining search queries
- Searching arbitrary labels
* Log Streams
- What is a log stream?
- LogsQL examples: querying log streams
- Stream labels vs log labels
* LogsQL: stats over access logs
* VictoriaLogs: CLI Integration
* VictoriaLogs Recap
Modeling Data and Queries for Wide Column NoSQLScyllaDB
Discover how to model data for wide column databases such as ScyllaDB and Apache Cassandra. Contrast the differerence from traditional RDBMS data modeling, going from a normalized “schema first” design to a denormalized “query first” design. Plus how to use advanced features like secondary indexes and materialized views to use the same base table to get the answers you need.
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...GetInData
Did you like it? Check out our E-book: Apache NiFi - A Complete Guide
https://ebook.getindata.com/apache-nifi-complete-guide
Apache NiFi is one of the most popular services for running ETL pipelines otherwise it’s not the youngest technology. During the talk, there are described all details about migrating pipelines from the old Hadoop platform to the Kubernetes, managing everything as the code, monitoring all corner cases of NiFi and making it a robust solution that is user-friendly even for non-programmers.
Author: Albert Lewandowski
Linkedin: https://www.linkedin.com/in/albert-lewandowski/
___
Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets.
Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries.
https://getindata.com
A practical look at the different strategies to deploy an application to Kubernetes. We list the pros and cons of each strategy and define which one to adopt depending on real world examples and use cases.
VictoriaLogs: Open Source Log Management System - PreviewVictoriaMetrics
VictoriaLogs Preview - Aliaksandr Valialkin
* Existing open source log management systems
- ELK (ElasticSearch) stack: Pros & Cons
- Grafana Loki: Pros & Cons
* What is VictoriaLogs
- Open source log management system from VictoriaMetrics
- Easy to setup and operate
- Scales vertically and horizontally
- Optimized for low resource usage (CPU, RAM, disk space)
- Accepts data from Logstash and Fluentbit in Elasticsearch format
- Accepts data from Promtail in Loki format
- Supports stream concept from Loki
- Provides easy to use yet powerful query language - LogsQL
* LogsQL Examples
- Search by time
- Full-text search
- Combining search queries
- Searching arbitrary labels
* Log Streams
- What is a log stream?
- LogsQL examples: querying log streams
- Stream labels vs log labels
* LogsQL: stats over access logs
* VictoriaLogs: CLI Integration
* VictoriaLogs Recap
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
Flink Forward San Francisco 2022.
In normal situations, the default Kafka consumer and producer configuration options work well. But we all know life is not all roses and rainbows and in this session we’ll explore a few knobs that can save the day in atypical scenarios. First, we'll take a detailed look at the parameters available when reading from Kafka. We’ll inspect the params helping us to spot quickly an application lock or crash, the ones that can significantly improve the performance and the ones to touch with gloves since they could cause more harm than benefit. Moreover we’ll explore the partitioning options and discuss when diverging from the default strategy is needed. Next, we’ll discuss the Kafka Sink. After browsing the available options we'll then dive deep into understanding how to approach use cases like sinking enormous records, managing spikes, and handling small but frequent updates.. If you want to understand how to make your application survive when the sky is dark, this session is for you!
by
Olena Babenko
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
This talk is from ApacheCon North America 2017 - Cassandra serving netflix @ scale - https://apachecon2017.sched.com/event/9zvG/cassandra-serving-netflix-scale-vinay-chella-netflix
https://www.youtube.com/watch?v=2l0_onmQsPI&index=3&t=284s&list=PL7uQt4PWyRW0XoVhEnNcSdCw5ufLEn9HA
Scylla allows us to create highly performant and scalable systems. However, to achieve good results and prevent our Scylla cluster from being overloaded, we need to properly write our client application and configure the driver. Join this session to learn some practical tips that can help you make your applications faster and more available.
How to Build a Scylla Database Cluster that Fits Your NeedsScyllaDB
Sizing a database cluster makes or breaks your application. Too small and you could sustain spikes in usage and recover from a node loss or an operational slowdown. Too big and your cluster will cost more and waste valuable human resources.
Since different workloads have different requirements, successful sizing of your application should be optimized for both throughput and latency performance. However, in many cases, the requirements for each contradicts each other.
In this webinar, we explain how to remediate the contradicting forces and build a sustainable cluster to meet both performance and resiliency requirements.
Performance Troubleshooting Using Apache Spark MetricsDatabricks
Performance troubleshooting of distributed data processing systems is a complex task. Apache Spark comes to rescue with a large set of metrics and instrumentation that you can use to understand and improve the performance of your Spark-based applications. You will learn about the available metric-based instrumentation in Apache Spark: executor task metrics and the Dropwizard-based metrics system. The talk will cover how Hadoop and Spark service at CERN is using Apache Spark metrics for troubleshooting performance and measuring production workloads. Notably, the talk will cover how to deploy a performance dashboard for Spark workloads and will cover the use of sparkMeasure, a tool based on the Spark Listener interface. The speaker will discuss the lessons learned so far and what improvements you can expect in this area in Apache Spark 3.0.
Improving Apache Spark's Reliability with DataSourceV2Databricks
DataSourceV2 is Spark's new API for working with data from tables and streams, but "v2" also includes a set of changes to SQL internals, the addition of a catalog API, and changes to the data frame read and write APIs. This talk will cover the context for those additional changes and how "v2" will make Spark more reliable and predictable for building enterprise data pipelines. This talk will include: * Problem areas where the current behavior is unpredictable or unreliable * The new standard SQL write plans (and the related SPIP) * The new table catalog API and a new Scala API for table DDL operations (and the related SPIP) * Netflix's use case that motivated these changes
Take advantage of ScyllaDB’s wide column NoSQL features such as workload prioritization to balance the needs of OLTP and OLAP in the same cluster. Plus learn about the different compaction strategies and which one would be right for your workload. With additional insights on properly sizing your database and using open source tools for observability.
Today, most any application can be “Dockerized.” However, there are special challenges when deploying a distributed application such as Spark on containers. This session will describe how to overcome these challenges in deploying Spark on Docker containers, with many practical tips and techniques for running Spark in a container environment.
Containers are typically used to run stateless applications on a single host. There are significant real-world enterprise requirements that need to be addressed when running a stateful, distributed application in a secure multi-host container environment.
There are decisions that need to be made concerning which tools and infrastructure to use. There are many choices with respect to container managers, orchestration frameworks, and resource schedulers that are readily available today and some that may be available tomorrow including:]
• Mesos
• Kubernetes
• Docker Swarm
Each has its own strengths and weaknesses; each has unique characteristics that may make it suitable, or unsuitable, for Spark. Understanding these differences is critical to the successful deployment of Spark on Docker containers.
This session will describe the work done by the BlueData engineering team to run Spark inside containers, on a distributed platform, including the evaluation of various orchestration frameworks and lessons learned. You will learn how to apply practical networking and storage techniques to achieve high performance and agility in a distributed, container environment.
Speaker
Thomas Phelan, Chief Architect, Blue Data, Inc
Introduction To Streaming Data and Stream Processing with Apache Kafkaconfluent
Modern businesses have data at their core, and this data is changing continuously. How can we harness this torrent of continuously changing data in real time? The answer is stream processing, and one system that has become a core hub for streaming data is Apache Kafka.
This presentation will give a brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will explain how Kafka serves as a foundation for both streaming data pipelines and applications that consume and process real-time data streams. It will introduce some of the newer components of Kafka that help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.
This is talk 1 out of 6 from the Kafka Talk Series.
http://www.confluent.io/apache-kafka-talk-series/introduction-to-stream-processing-with-apache-kafka
Learn the current state of the NoSQL landscape and discover the different data models within it. From document stores and key value databases to graph and Wide Column. Then you’ll learn why wide column databases are the most appropriate for scalable high performance use cases, including capabilities for massive scale-out architecture, peer-to-peer clustering to avoid bottlenecking and built-in multi-datacenter replication.
Using Queryable State for Fun and ProfitFlink Forward
Flink Forward San Francisco 2022.
A particular feature in our system relies on a streaming 90-minute trailing window of 1-minute samples - implemented as a lookaside cache - to speed up a particular query, allowing our customers to rapidly see an overview of their estate. Across our entire customer base, there is a substantial amount of data flowing into this cache - ~1,000,000 entries/second, with the entire cache requiring ~600GB of RAM. The current implementation is simplistic but expensive. In this talk I describe a replacement implementation as a stateful streaming Flink application leveraging Queryable State. This Flink application reduces the net cost by ~90%. In this session, the implementation is described in detail, including windowing considerations, a sliding-window state buffer that avoids the sliding window replication penalty, and a comparison of queryable state and Redis queries. The talk concludes with a frank discussion of when this distinctive approach is, and is not, appropriate.
by
Ron Crocker
Apache Hive Hook
I couldn't find enough info about Hive hooks.
So, I made this.
I hope this presentation will be useful when you want to use hooks.
This included some infomation about metastore event listeners.
This was written based on release-0.11 tag.
KSQL and Security: The Current State of Affairs (Victoria Xia, Confluent) Kaf...confluent
As KSQL-users move from development to production, security becomes an important consideration. Because KSQL is built on top of Kafka Streams, which in turn is built on top of Kafka Consumers and Producers, KSQL can leverage existing security functionality, including SSL encryption and SASL authentication in communications with Kafka brokers. However, authentication and authorization between KSQL servers and KSQL clients is a different story. As of December 2018, SSL for communication between KSQL clients and servers is enabled for the REST API, but not yet for the CLI. By April 2019, SSL will be supported in the KSQL CLI, and additional security functionality including SASL authentication, ACLs, audit logs, and RBAC will be in the works as well. This talk will cover the security options available for KSQL, including any new options added by April 2019, and will also include a preview of features to come. Audience members will leave with an understanding of what security features are currently available, how to configure them, current limitations, and upcoming features. The talk may also include common pitfalls and tips for debugging a KSQL security setup.
Governing Elastic IoT Cloud Systems under UncertaintiesHong-Linh Truong
we introduce U-GovOps – a novel framework for
dynamic, on-demand governance of elastic IoT cloud systems under
uncertainty. We introduce a declarative policy language to simplify
the development of uncertainty- and elasticity-aware governance
strategies. Based on that we develop runtime mechanisms, which
enable mitigating the uncertainties by monitoring and governing
the IoT cloud systems through specified strategies.
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
Flink Forward San Francisco 2022.
In normal situations, the default Kafka consumer and producer configuration options work well. But we all know life is not all roses and rainbows and in this session we’ll explore a few knobs that can save the day in atypical scenarios. First, we'll take a detailed look at the parameters available when reading from Kafka. We’ll inspect the params helping us to spot quickly an application lock or crash, the ones that can significantly improve the performance and the ones to touch with gloves since they could cause more harm than benefit. Moreover we’ll explore the partitioning options and discuss when diverging from the default strategy is needed. Next, we’ll discuss the Kafka Sink. After browsing the available options we'll then dive deep into understanding how to approach use cases like sinking enormous records, managing spikes, and handling small but frequent updates.. If you want to understand how to make your application survive when the sky is dark, this session is for you!
by
Olena Babenko
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
This talk is from ApacheCon North America 2017 - Cassandra serving netflix @ scale - https://apachecon2017.sched.com/event/9zvG/cassandra-serving-netflix-scale-vinay-chella-netflix
https://www.youtube.com/watch?v=2l0_onmQsPI&index=3&t=284s&list=PL7uQt4PWyRW0XoVhEnNcSdCw5ufLEn9HA
Scylla allows us to create highly performant and scalable systems. However, to achieve good results and prevent our Scylla cluster from being overloaded, we need to properly write our client application and configure the driver. Join this session to learn some practical tips that can help you make your applications faster and more available.
How to Build a Scylla Database Cluster that Fits Your NeedsScyllaDB
Sizing a database cluster makes or breaks your application. Too small and you could sustain spikes in usage and recover from a node loss or an operational slowdown. Too big and your cluster will cost more and waste valuable human resources.
Since different workloads have different requirements, successful sizing of your application should be optimized for both throughput and latency performance. However, in many cases, the requirements for each contradicts each other.
In this webinar, we explain how to remediate the contradicting forces and build a sustainable cluster to meet both performance and resiliency requirements.
Performance Troubleshooting Using Apache Spark MetricsDatabricks
Performance troubleshooting of distributed data processing systems is a complex task. Apache Spark comes to rescue with a large set of metrics and instrumentation that you can use to understand and improve the performance of your Spark-based applications. You will learn about the available metric-based instrumentation in Apache Spark: executor task metrics and the Dropwizard-based metrics system. The talk will cover how Hadoop and Spark service at CERN is using Apache Spark metrics for troubleshooting performance and measuring production workloads. Notably, the talk will cover how to deploy a performance dashboard for Spark workloads and will cover the use of sparkMeasure, a tool based on the Spark Listener interface. The speaker will discuss the lessons learned so far and what improvements you can expect in this area in Apache Spark 3.0.
Improving Apache Spark's Reliability with DataSourceV2Databricks
DataSourceV2 is Spark's new API for working with data from tables and streams, but "v2" also includes a set of changes to SQL internals, the addition of a catalog API, and changes to the data frame read and write APIs. This talk will cover the context for those additional changes and how "v2" will make Spark more reliable and predictable for building enterprise data pipelines. This talk will include: * Problem areas where the current behavior is unpredictable or unreliable * The new standard SQL write plans (and the related SPIP) * The new table catalog API and a new Scala API for table DDL operations (and the related SPIP) * Netflix's use case that motivated these changes
Take advantage of ScyllaDB’s wide column NoSQL features such as workload prioritization to balance the needs of OLTP and OLAP in the same cluster. Plus learn about the different compaction strategies and which one would be right for your workload. With additional insights on properly sizing your database and using open source tools for observability.
Today, most any application can be “Dockerized.” However, there are special challenges when deploying a distributed application such as Spark on containers. This session will describe how to overcome these challenges in deploying Spark on Docker containers, with many practical tips and techniques for running Spark in a container environment.
Containers are typically used to run stateless applications on a single host. There are significant real-world enterprise requirements that need to be addressed when running a stateful, distributed application in a secure multi-host container environment.
There are decisions that need to be made concerning which tools and infrastructure to use. There are many choices with respect to container managers, orchestration frameworks, and resource schedulers that are readily available today and some that may be available tomorrow including:]
• Mesos
• Kubernetes
• Docker Swarm
Each has its own strengths and weaknesses; each has unique characteristics that may make it suitable, or unsuitable, for Spark. Understanding these differences is critical to the successful deployment of Spark on Docker containers.
This session will describe the work done by the BlueData engineering team to run Spark inside containers, on a distributed platform, including the evaluation of various orchestration frameworks and lessons learned. You will learn how to apply practical networking and storage techniques to achieve high performance and agility in a distributed, container environment.
Speaker
Thomas Phelan, Chief Architect, Blue Data, Inc
Introduction To Streaming Data and Stream Processing with Apache Kafkaconfluent
Modern businesses have data at their core, and this data is changing continuously. How can we harness this torrent of continuously changing data in real time? The answer is stream processing, and one system that has become a core hub for streaming data is Apache Kafka.
This presentation will give a brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will explain how Kafka serves as a foundation for both streaming data pipelines and applications that consume and process real-time data streams. It will introduce some of the newer components of Kafka that help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.
This is talk 1 out of 6 from the Kafka Talk Series.
http://www.confluent.io/apache-kafka-talk-series/introduction-to-stream-processing-with-apache-kafka
Learn the current state of the NoSQL landscape and discover the different data models within it. From document stores and key value databases to graph and Wide Column. Then you’ll learn why wide column databases are the most appropriate for scalable high performance use cases, including capabilities for massive scale-out architecture, peer-to-peer clustering to avoid bottlenecking and built-in multi-datacenter replication.
Using Queryable State for Fun and ProfitFlink Forward
Flink Forward San Francisco 2022.
A particular feature in our system relies on a streaming 90-minute trailing window of 1-minute samples - implemented as a lookaside cache - to speed up a particular query, allowing our customers to rapidly see an overview of their estate. Across our entire customer base, there is a substantial amount of data flowing into this cache - ~1,000,000 entries/second, with the entire cache requiring ~600GB of RAM. The current implementation is simplistic but expensive. In this talk I describe a replacement implementation as a stateful streaming Flink application leveraging Queryable State. This Flink application reduces the net cost by ~90%. In this session, the implementation is described in detail, including windowing considerations, a sliding-window state buffer that avoids the sliding window replication penalty, and a comparison of queryable state and Redis queries. The talk concludes with a frank discussion of when this distinctive approach is, and is not, appropriate.
by
Ron Crocker
Apache Hive Hook
I couldn't find enough info about Hive hooks.
So, I made this.
I hope this presentation will be useful when you want to use hooks.
This included some infomation about metastore event listeners.
This was written based on release-0.11 tag.
KSQL and Security: The Current State of Affairs (Victoria Xia, Confluent) Kaf...confluent
As KSQL-users move from development to production, security becomes an important consideration. Because KSQL is built on top of Kafka Streams, which in turn is built on top of Kafka Consumers and Producers, KSQL can leverage existing security functionality, including SSL encryption and SASL authentication in communications with Kafka brokers. However, authentication and authorization between KSQL servers and KSQL clients is a different story. As of December 2018, SSL for communication between KSQL clients and servers is enabled for the REST API, but not yet for the CLI. By April 2019, SSL will be supported in the KSQL CLI, and additional security functionality including SASL authentication, ACLs, audit logs, and RBAC will be in the works as well. This talk will cover the security options available for KSQL, including any new options added by April 2019, and will also include a preview of features to come. Audience members will leave with an understanding of what security features are currently available, how to configure them, current limitations, and upcoming features. The talk may also include common pitfalls and tips for debugging a KSQL security setup.
Governing Elastic IoT Cloud Systems under UncertaintiesHong-Linh Truong
we introduce U-GovOps – a novel framework for
dynamic, on-demand governance of elastic IoT cloud systems under
uncertainty. We introduce a declarative policy language to simplify
the development of uncertainty- and elasticity-aware governance
strategies. Based on that we develop runtime mechanisms, which
enable mitigating the uncertainties by monitoring and governing
the IoT cloud systems through specified strategies.
Microservice architectures have generated quite a bit of hype in recent months, and practitioners across our industry have vigorously debated the definition, purpose, and effectiveness of these architectures.
In this session, Matt Stine will cut through the Microservices hype and examine some very practical considerations:
• Not an End in Themselves: Microservices are really all about helping us achieve continuous delivery
• Systems over Services: Microservices are less about the services themselves and more about the systems we can assemble using them. Boilerplate patterns for configuration, integration, and fault tolerance are keys.
• Operationalized Architecture: Microservices aren’t a free lunch. You have to pay for them with strong DevOps sauce.
• It’s About the Data: Bounded contexts with API’s are great until you need to ask really big questions. How do we effectively wrangle all of the data at once?
Along the way, we’ll see how open source technology efforts such as Cloud Foundry, Spring Cloud, Netflix OSS, Spring XD, and Hadoop can help us with many of these considerations.
Plate spin migration and transformation prsesentation uploadApurva Shah
PlateSpin® Migrate is a powerful workload portability solution that automates the process of moving server workloads over
the network between physical servers, virtual hosts and image archives. PlateSpin Migrate remotely decouples workloads
from the underlying server hardware and streams them to and from physical or virtual hosts—all from a single point of control.
Big Iron + Big Data = BIG DEAL! Unlock The Power of Your Mainframe DataCA Technologies
Struggling to analyze mainframe data to make the right Big Data decisions? Big Data environments are no longer an option; they are business-critical for any organization today. Your company can easily take advantage of your mission-critical data with key new innovations. Learn how the right Big Data strategy can unlock your Big Iron data and open up maximum insight to critical business decisions.
For more information, please visit http://cainc.to/Nv2VOe
Meetup Streaming Data Pipeline DevelopmentTimothy Spann
Meetup Streaming Data Pipeline Development
28 June 2023 6pm EST
Milwaukee meetup
https://www.meetup.com/futureofdata-princeton/events/292976004/
Details
This will be a hybrid event with a Zoom. The in-person event will be in Milwaukee.
In this interactive session, Tim will lead participants through how to best build streaming data pipelines. He will cover how to build applications from some common use cases and highlight tips, tricks, best practices and patterns.
He will show how to build the easy way and then dive deep into the underlying open source technologies including Apache NiFi, Apache Flink, Apache Kafka and Apache Iceberg.
If you wish to follow along, please download open source projects beforehand. You can also download this helpful streaming platform: https://docs.cloudera.com/csp-ce/latest/installation/topics/csp-ce-installing-ce.html
All source code and slides will be shared for those interested in building their own FLaNK Apps. https://www.flankstack.dev/
https://www.thecapitalgrille.com/locations/wi/milwaukee/milwaukee/8027
The Capital Grille 310 W Wisconsin Ave, Milwaukee, WI 53203
limited seating, preference will be given to NLIT attendees
A peak at the menu (Not Pizza)
RISOTTO FRITTERS WITH FRESH MOZZARELLA AND PROSCIUTTO
SLICED SIRLOIN WITH ROQUEFORT AND BALSAMIC ONIONS
MINIATURE LOBSTER AND CRAB CAKES
WILD MUSHROOM AND HERBED CHEESE
You can join the meeting virtually here (no meat or cheese virtually):
Future of Data Milwaukee Meetup Streaming Data Pipeline Development 28 June 2023ssuser73434e
Future of Data Milwaukee Meetup Streaming Data Pipeline Development 28 June 2023
Future of Data: New Jersey - Princeton, Edison, Holmdel
This will be a hybrid event with a Zoom. The in-person event will be in Milwaukee.
In this interactive session, Tim will lead participants through how to best build streaming data pipelines. He will cover how to build applications from some common use cases and highlight tips, tricks, best practices and patterns.
He will show how to build the easy way and then dive deep into the underlying open source technologies including Apache NiFi, Apache Flink, Apache Kafka and Apache Iceberg.
If you wish to follow along, please download open source projects beforehand. You can also download this helpful streaming platform: https://docs.cloudera.com/csp-ce/latest/installation/topics/csp-ce-installing-ce.html
All source code and slides will be shared for those interested in building their own FLaNK Apps. https://www.flankstack.dev/
https://www.thecapitalgrille.com/locations/wi/milwaukee/milwaukee/8027
The Capital Grille 310 W Wisconsin Ave, Milwaukee, WI 53203
limited seating, preference will be given to NLIT attendees
A peak at the menu (Not Pizza)
RISOTTO FRITTERS WITH FRESH MOZZARELLA AND PROSCIUTTO
SLICED SIRLOIN WITH ROQUEFORT AND BALSAMIC ONIONS
MINIATURE LOBSTER AND CRAB CAKES
WILD MUSHROOM AND HERBED CHEESE
You can join the meeting virtually here (no meat or cheese virtually):
Confluent Operator as Cloud-Native Kafka Operator for KubernetesKai Wähner
Agenda:
- Cloud Native vs. SaaS / Serverless Kafka
- The Emergence of Kubernetes
- Kafka on K8s Deployment Challenges
- Confluent Operator as Kafka Operator
- Q&A
Confluent Operator enables you to:
Provisioning, management and operations of Confluent Platform (including ZooKeeper, Apache Kafka, Kafka Connect, KSQL, Schema Registry, REST Proxy, Control Center)
Deployment on any Kubernetes Platform (Vanilla K8s, OpenShift, Rancher, Mesosphere, Cloud Foundry, Amazon EKS, Azure AKS, Google GKE, etc.)
Automate provisioning of Kafka pods in minutes
Monitor SLAs through Confluent Control Center or Prometheus
Scale Kafka elastically, handle fail-over & Automate rolling updates
Automate security configuration
Built on our first hand knowledge of running Confluent at scale
Fully supported for production usage
Who's in your Cloud? Cloud State MonitoringKevin Hakanson
When it comes to cloud operations, monitoring security and visibility are critical. Integration by other systems via Cloud APIs is one of the most powerful value drivers of the hyperscale cloud providers.
In this session, we will describe Cloud State Monitoring, including why it is important and who needs awareness in your organization. An explanation of the categories of Cloud APIs (including the management plane, control plane, and data plane) will give us background. Specific use cases across AWS, Azure, and GCP will dive deep into various changes you might not have considered monitoring.
The many uses of Kubernetes cross cluster migration of persistent dataDoKC
Multiple clusters exist in most Kubernetes environments today, and number of clusters will increase overtime. The reasons for having multiple Kubernetes clusters are many, for example, overcoming scale limits, reducing complexity, geo separation, redundancy and having separate production, staging, and development environments. Once you have multiple K8S clusters, it can be useful to have the ability to easily move or duplicate workloads across these different clusters. Kubernetes does not have a native method to allow migration or duplication of workloads across clusters.
Fortunately, there are tools that provide this functionality. In this presentation we will explore the different uses cases for cross cluster migration, and what is involved, and how these migration tools work. We'll cover some popular uses cases, such as, Disaster Recovery, Test/Dev, and performance testing. Migration could entail moving the entire cluster, or individual workloads. The components that need to be moved would include configuration and resources stored in etcd, and persistent data residing on PVCs. We'll cover the uses cases and challenges for migration, and run through an example of using one of these migration tools.
This talk was given by Ryan Kaw for DoK Day Europe @ KubeCon 2022.
The many uses of Kubernetes cross cluster migration of persistent dataDoKC
Link: https://youtu.be/J3JiwW5FIAI
https://go.dok.community/slack
https://dok.community/
From the DoK Day EU 2022 (https://youtu.be/Xi-h4XNd5tE)
Multiple clusters exist in most Kubernetes environments today, and number of clusters will increase overtime. The reasons for having multiple Kubernetes clusters are many, for example, overcoming scale limits, reducing complexity, geo separation, redundancy and having separate production, staging, and development environments. Once you have multiple K8S clusters, it can be useful to have the ability to easily move or duplicate workloads across these different clusters. Kubernetes does not have a native method to allow migration or duplication of workloads across clusters.
Fortunately, there are tools that provide this functionality. In this presentation we will explore the different uses cases for cross cluster migration, and what is involved, and how these migration tools work. We'll cover some popular uses cases, such as, Disaster Recovery, Test/Dev, and performance testing. Migration could entail moving the entire cluster, or individual workloads. The components that need to be moved would include configuration and resources stored in etcd, and persistent data residing on PVCs. We'll cover the uses cases and challenges for migration, and run through an example of using one of these migration tools.
How Facebook's Technologies can define the future of VistA and Health ITRob Tweed
This presentation examines the technologies used by Facebook - React.js, React Native, GraphQL and Relay.js - and assesses their relevance for use with the VistA EHR and Health IT generally
Removing Barriers Between Dev and Ops by Shahaf Airily, Advisory Field Engineer EMEA, Pivotal. This presentation is from VMworld Barcelona. For more information, visit https://pivotal.io/event/vmworld-europe
Similar to Managing (Schema) Migrations in Cassandra (20)
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftDataStax Academy
Companies today are innovating with real-time data to deliver truly amazing customer experiences in the moment. Real-time data management for real-time customer experience is core to staying ahead of competition and driving revenue growth. Join Trays to learn how Comcast is differentiating itself from it's own historical reputation with Customer Experience strategies.
Introduction to DataStax Enterprise Graph DatabaseDataStax Academy
DataStax Enterprise (DSE) Graph is a built to manage, analyze, and search highly connected data. DSE Graph, built on NoSQL Apache Cassandra delivers continuous uptime along with predictable performance and scales for modern systems dealing with complex and constantly changing data.
Download DataStax Enterprise: Academy.DataStax.com/Download
Start free training for DataStax Enterprise Graph: Academy.DataStax.com/courses/ds332-datastax-enterprise-graph
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraDataStax Academy
DataStax Enterprise Advanced Replication supports one-way distributed data replication from remote database clusters that might experience periods of network or internet downtime. Benefiting use cases that require a 'hub and spoke' architecture.
Learn more at http://www.datastax.com/2016/07/stay-100-connected-with-dse-advanced-replication
Advanced Replication docs – https://docs.datastax.com/en/latest-dse/datastax_enterprise/advRep/advRepTOC.html
Data Modeling is the one of the first things to sink your teeth into when trying out a new database. That's why we are going to cover this foundational topic in enough detail for you to get dangerous. Data Modeling for relational databases is more than a touch different than the way it's approached with Cassandra. We will address the quintessential query-driven methodology through a couple of different use cases, including working with time series data for IoT. We will also demo a new tool to get you bootstrapped quickly with MovieLens sample data. This talk should give you the basics you need to get serious with Apache Cassandra.
Hear about how Coursera uses Cassandra as the core of its scalable online education platform. I'll discuss the strengths of Cassandra that we leverage, as well as some limitations that you might run into as well in practice.
In the second part of this talk, we'll dive into how best to effectively use the Datastax Java drivers. We'll dig into how the driver is architected, and use this understanding to develop best practices to follow. I'll also share a couple of interesting bug we've run into at Coursera.
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
This talk covers scaling Cassandra to a fast growing user base. Alex and Isaias will cover new best practices and how to work with the strengths and weaknesses of Cassandra at large scale. They will discuss how to adapt to bottlenecks while providing a rich feature set to the playstation community.
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
This talk covers scaling Cassandra to a fast growing user base. Alex and Isaias will cover new best practices and how to work with the strengths and weaknesses of Cassandra at large scale. They will discuss how to adapt to bottlenecks while providing a rich feature set to the playstation community.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
We've had some sexy, exciting, cutting-edge topics today. This is not one of them. ... This is more the sort of routine, good-housekeeping, foundational work that can make the exciting stuff a little less exciting. I’m going to be talking about managing migrations in Cassandra and in particular schema migrations.
Let me give a nod to my employer. From the web site: “ GridPoint is a leader in comprehensive, data-driven energy management solutions (EMS) that leverage the power of real-time data collection, big data analytics and cloud computing to maximize energy savings, operational efficiency, capital utilization and sustainability benefits.” The company is based in Arlington, VA, with a development office in Seattle.
Disclaimer… This is my perspective.
Oh, the statue you see is from Bonn, Germany, according to the photographer.
A live-data migration is the process that runs to take the data in one table and adapt it to another table, such that the data in the first table can eventually be retired.
I’m not going to be focusing so much on live-data migrations.
I’m going to be focusing instead on what I would call source-driven migrations.
For schema migrations, think DDL.
The migrations are stored in source control and subject to source control versioning. They may be published to an artifact repository, where they artifact versioning and release versioning can be applied.
I’ll be focusing in particular on schema migrations.
These sorts of problems are covered in depth in this book from the Martin Fowler series that came out in 2006.
I can’t speak to containerizing migrations. We haven’t explored that.
A couple other established standalone tools are DBMaintain and DBDeploy, although those projects have not been active in recent years.
12.2. Schema Changes in RDBMS
Liquibase, Mybatis Migrator, DBDeploy, DBMaintain
12.3. Schema Changes in a NoSQL Data Store
the schema needs to change frequently in response to changing business requirements | can use similar techniques as with databases with strong schemas
with schemalessness at the database level, the burden of supporting the effective schema is shifted up to the application | the application still needs to be able to (un)marshal the data
With this slide, I hope you can see that I’m setting up a bit of a straw man. (A straw man with a strong man.)
There was a StackOverflow thread on schema migration tools for Cassandra (http://stackoverflow.com/questions/25286273/is-there-a-schema-versioning-tool-for-cassandra), and there was an erroneous answer I found amusing:
"Cassandra is by its nature… 'schemaless.' It is a structured key-value store, so it is very different from a traditional rdbms in that regard.”
Think about it though. With Cassandra as much as with a relational database, you pay a bitter price for getting your schema wrong.
You end up defining a good number of tables.
I have the fortune of not having worked much with Thrift. But I know that with Thrift, you'd be in the business of manipulating the contents of messages, which obscures the database's desire to have a schema applied to it.
With Thrift, you had super columns and super column families. With CQL, you have collections. But the collections still have to be part of a table. The things that might smack of schemalessness still come back to a schema. ===========================================
Thought experiment. Go into cqlsh and execute:
describe keyspace keyspace_name
How big is that output getting? How much is it changing over time?
===========================================
At last month's Cassandra Summit, there was an interesting talk by a company called Reltio, and they described how they were using Cassandra to support "metadata-driven documents in columnar storage." So they produced a keyspace that had a generic table like this. And maybe that schema only had one or two tables. But even they acknowledged that this is an atypical use case for Cassandra.
===========================================
So how have teams been managing their keyspace and table definitions? My anecdotal experience is that whenever the question has come up, teams have usually rolled their own, especially because, on the face of it, or in the simple case, this seems like such a simple thing.
Next I want to get into the tools that are out there for Cassandra migrations, and the roadblocks teams have faced trying to manage Cassandra schema migrations via LiquiBase and Flyway.
===========================================
Some history. The obvious way to integrate Liquibase or Flyway with Cassandra comes back to the prospect of the DataStax Java Driver supporting JDBC. There’s this statement from the 2013 announcement of the introduction of the driver (http://www.datastax.com/dev/blog/new-datastax-drivers-a-new-face-for-cassandra): "Today, DataStax announces version 1.0.0 of a new Java Driver, designed for CQL and based on years of experience within the Cassandra community. This Java driver is a first step; an object mapping and a JDBC extension will be available soon…."
Let’s keep that JDBC extension in mind.
===========================================
There was a liquibase-cassandra project that seemed to hit a wall. So some people gravitated toward Flyway.
===========================================
Then there was a GitHub issue for the Flyway project , “Cassandra support.”
https://github.com/flyway/flyway/issues/823
In January someone mentions a cassandra-jdbc project that’s out there and which also seems to have hit a wall.
"I …recently looked into adding support for Cassandra to Flyway, but using the existing cassandra-jdbc driver from https://code.google.com/a/apache-extras.org/p/cassandra-jdbc/ , just to see how far I could get. I found a few issues:"
Proceeds to list the issues.
"I disabled or stubbed out code to get past these, but gave up soon after."
That same poster referenced a thread he started on the DataStax Java Driver user mailing list.
===========================================
So if we go to that thread, which is from last December (https://groups.google.com/a/lists.datastax.com/forum/#!msg/java-driver-user/kspAx0neZlI/8A59HmYc-rwJ):
Subject: "Timeline for JDBC support?"
"Is there any timeline for JDBC support in the DataStax Java Driver for Cassandra, please?"
Alex Popescu, Sen. Product Manager @ DataStax responds:
"While I cannot (yet) promise an ETA for JDBC support, what I can say is that it's on our todo list (and very close to the top)."
===========================================
I look forward to seeing how DataStax pulls off the Cassandra JDBC support, but to my mind, trying to do JDBC against Cassandra seems like, I dunno, a bit of an uphill climb.
So let's side aside the prospect of first-class Cassandra support in Flyway and see what else is out there.
===========================================
Toward the end of the DataStax Java Driver mailing list thread, someone else chimes in and mentions Pillar, which is a dedicated Cassandra migrations tool written in Scala.
And here’s roughly what I wrote in my own internal tool evaluation:
“Before settling on (our) Flyway design for Cassandra schema migrations, I evaluated various open-source Cassandra migration tools. They’re listed below. Of them, the most promising tool was Pillar, which is implemented in Scala. The problem with Pillar vs. (Flyway) was the risk. I was afraid I’d invest time with Pillar and come up emptyhanded, that it wouldn’t deliver the sort of contract I expect from Flyway.” That’s what I wrote. I’m happy we went down the road we did (if I weren’t I wouldn’t be here talking about it), but I’d still maintain that Pillar is worth checking out.
There's mutagen-cassandra, which is a Java tool written against the Astyanax driver but which hasn't been adapted to the DataStax Java Driver.
Then there are these three Python-based tools: Trireme, cql-migrate, mschematool.
Here’s a view of a migrations table that’s responsible for several schemas in PostgreSQL, with PostgeSQL’s concept of a schema, analogous to a keyspace in Cassandra.
So let’s get back to the two prominent database migration tools in the relational world.
I think of Liquibase as the Martha Stewart of migration tools. It’s somewhat of a control freak. It wants to do everything itself.
On the other hand, I think of Flyway as the Oprah of migration tools. It provides a framework and then gives you the space to figure things out for yourself.
You see, Liquibase wants to generate the SQL from XML constructs. In the typical usage, the SQL is NOT a first-class citizen. You can define Liquibase migrations as SQL, but even then (to the best of my knowledge) you have to define it inline in the XML.
With Flyway, though, SQL is a first-class citizen. You can make migrations ouf of straight .sql files. It’s Flyway’s lightweight, inobtrusive, extensible approach that’s going to provide the leverage for using it with Cassandra.
So instead of first-class Flway, we’re going to do faked-out Flyway.
The idea is, let Flyway do what it knows, which is migrations. Let Cassandra do what it knows, which is CQL. All we need is an adapter or translator to connect the two.
And one key point. When I say that Flyway knows migrations, I’m saying that Flyway knows migrations in SQL.
So here’s the tradeoff. Or “the weird trick,” to use the parlance of an Internet ad.
Here’s what I wrote in my own internal design doc:
“The reality is that first-class Flyway support for Cassandra doesn’t really gain us anything more than our fake-Flyway solution does, especially considering that we’re fine with persisting the Flyway migrations table to PostgreSQL; once you’re embracing polyglot persistence, you’d realize that a relational database is a better fit anyway for keeping track of the migrations.”
Failure handling: If a migration produces invalid CQL, the driver throws a RuntimeException. The act of throwing a RuntimeException is the signal I need to tell the JDBC Connection to roll back the transaction. This emulates the JDBC contract where RuntimeExceptions cause the transaction in the actual migrate call to roll back. We do this with the beforeEachMigrate hook so that we have a chance to fail the migration before our dummy, token migration has a chance to run. Flyway will have succeeded with all the migrations up to that point; it will fail only with this particular migration. That preserves the expected Flyway behavior.
Our migrations follow a two-step process. At build time, we produce an artifact that gets published to an artifact repository. That’s the work of a proprietary class called MigrationsBuilder. At runtime, we have another custom class called FlywayMigrator that runs the published migrations against the target database.
In the simple case with Flyway, there’s only a single step, the deploy-time step, even if that might be executed at build time, or to be precise, by a build tool like Maven or Gradle.
It’s worth noting that we use the same two-step process, with the same classes, just the same way if the destination database is PostgreSQL.
We have the .cql files organized into directories according to our releases.
Here you can see that MigrationsBuilder is executed in a maven build. And you can see that the execution for CQL as opposed to SQL differs only by some arguments.
Here we can see the output of MigrationsBuilder. MigrationsBuilder creates .sql files in a package structure that Flyway expects. But our .cql files just show up in the root of the classpath. The generated .sql files have the same simple names as the generated .cql files, and those names have been tweaked from the names in source control to comply with Flyway conventions.
Contains the CQL script’s contents.
This is the dummy, token script that the Flyway class executes with its migrate method.
Now, at deploy time, when we go to execute FlywayMigrator against the destination database, you can see that the CQL and SQL invocations are quite similar.
Here we see the dependencies for the standalone JAR that’s executed at deploy time. Both JARs depend on the flywayMigrator library. The Cassandra JAR has only one other dependency because it has to support only one keyspace. The PostgreSQL JAR has numerous other dependencies because it has to support multiple schemas along with some migrations and constructs that don’t fit nicely in a schema.
Here you can see how the migrations version tracking table for Cassandra has been populated after a FlywayMigrator execution.
Now I want to go beyond our own Cassandra migration solution and share some best practices that I’ve arrived at and that I’d recommend however you do your migrations.
First, it’s worth keeping in mind the distinction between different kinds of versioning.
Regarding effective contract versions, there’s a nice discussion in Chapter 12 of “NoSQL Distilled” of making two schema versions coexist in a running application.
Consistent deployment across environments. You should be trying to execute your migrations the same way on a local dev box as you do in production. Or at least isolate the differences.
Failure handling: This goes back to the rollback semantics I was describing in beforeEachMigrate. The Flyway contract is every migration up to the migration that failed sticks because every migration up to that failure succeeded.
Baselining: If you haven’t been doing formalized database migrations from the get-go, you can use the current state of production as the starting point for your migrations by taking the “describe keyspace” CQL from cqlsh and make that be your initial migration, but only for installations that you want to create from scratch. And if you’ve made a lot of changes to your tables but your migrations haven’t made it to production yet, you can scrap all the history and start from your latest definitions. You get to call a mulligan. Declaring migration bankruptcy.
Rollbacks: Something that Liquibase supports. Part of why Liquibase tries to be such a control freak. Flyway, on the other hand, purposely does not support rollbacks. When I first looked into Flyway, that to me was a downside. But I eventually came around to the Flyway way of thinking. You keep progressing forward, even if you’re semantically going backwards. A little like an event sourcing paradigm.
The DataStax Java Driver has a nice mechanism for checking that your schema changes have propagated across the entire cluster. This snippet is taken from the DataStax Java Driver documentation.
The graphic is showing how a source-driven migration can inevitably expand into incorporating a live-data migration as well. Maybe you’re changing a column or moving from one table to another, and in the process, you need to copy over the data.
This isn’t so much a limitation. In a way, it’s a strength. Because we’re doing everything programmatically, there’s nothing stopping us from coupling a live-data migration with a source-driven migration. It’s just an extra amount of complexity to account for.
Now here is an actual limitation.
The two tables you see represent the same data, but with one having the data clustered in ascending order and the other with the data clustered in descending order. We need to have a time bucket to keep the partitions from growing indefinitely. In the ascending table, we’re able to incorporate the bucket into the partition key. But with the descending table, we want to be able to drop the tables entirely after a certain amount of time. So with those tables, we make the effective bucket part of the table name.
The ascending table, where the bucket is part of the partition key—that we’re able to create statically in the migrations. But the descending table we have to create dynamically on the fly in the application. So it falls outside the realm of the migrations. I’m sure there’s a better solution out there; we’re living with this solution for now.
Some other considerations…
Making it part of the main app is what I believe a lot of teams do.
Other use case where you want to migrate not CQL but actual sstables. At this point you might consider storing the data in a filesystem like S3 or even a separate Cassandra cluster.
I mentioned Chapter 12 of “NoSQL Distilled,” “Schema Migrations.” Well, Chapter 13 is “Polyglot Persistence.”
And the authors proceed to state the obvious, that different databases solve different problems. Relational databases excel at enforcing the existence of relationships. Not good at discovering relationships or pulling data from different tables into a single object. (Of course, these days some folks will say relational databases aren’t good enough at anything to justify their existence, but even then, that doesn’t necessarily mean that Cassandra is the best fit for everything either.)
13.5. Choosing the Right Technology
"Initially, the pendulum had shifted from specialty databases to a single RDBMS database which allows all types of data models to be stored, although with some abstraction. The trend is now shifting back to using the data storage that supports the implementation of solutions natively."
"Encapsulating data access into services reduces the impact of data storage choices on other parts of a system.“
Our Flyway-based solution has the promise to be a unified migrations solution for disparate persistence stores. What you see here is the view in PostgreSQL’s pgAdmin3 GUI of our dedicated flyway schema. There are two tables, one for the Cassandra migration versions, the other for the PostgreSQL migration versions. The name of that one is flyway_schema_version; it should really be called postgresql_schema_version. Not that I want to be encouraging persistence store proliferation, but you could see how we could create another table for another RDBMS vendor or for another entirely different type of persistence store.
I hope by now you can appreciate that I’m not trying to sell you on our particular solution.
I am trying to sell you on the value of source-driven schema migrations for Cassandra, and more broadly on the value of adding automation in building blocks at the right granularity.
I’d initially figured this talk would be a better fit for the beginners’ track. It’s not one of the more challenging and exciting things you’ll be doing with Cassandra, but it’s doing the routine, boring things like this which I believe will eventually pay off for you and your work with Cassandra.