In Apache Cassandra Lunch #52: Airflow and Cassandra for Cluster Management, we discussed using Airflow to schedule tasks on a Cassandra cluster beyond what could be accomplished with the Cassandra provider package.
In Data Engineer's Lunch #46, we discuss the architecture of Node.js and use it to initiate and harvest some data from an API call.
Accompanying Blog: https://blog.anant.us/data-engineers-lunch-45-apache-livy
Accompanying YouTube: https://youtu.be/WMRN815FuQ8
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Data Engineer’s Lunch Weekly at 12 PM EST Every Monday:
https://www.meetup.com/Data-Wranglers-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
In Apache Cassandra Lunch #59: Functions in Cassandra, we discussed the functions that are usable inside of the Cassandra database. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live.
Apache Cassandra Lunch #70: Basics of Apache CassandraAnant Corporation
In Cassandra Lunch #70, we discuss the Basics of Apache Cassandra and setup a stand-alone Apache Cassandra.
Accompanying Blog: https://blog.anant.us/cassandra-launch-70-basics-of-apache-cassandra
Accompanying YouTube: https://youtu.be/o-yU0mi4nzc
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Cassandra Lunch Weekly at 12 PM EST Every Wednesday: https://www.meetup.com/Cassandra-DataStax-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Cassandra.Lunch:
https://github.com/Anant/Cassandra.Lunch
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
Building a REST API with Cassandra on Datastax Astra Using Python and NodeAnant Corporation
DataStax Astra provides the ability to develop and deploy data-driven applications with a cloud-native service, without the hassles of database and infrastructure administration. In this webinar, we are going to walk you through creating a REST API and exposing that to your Cassandra database.
Webinar Link: https://www.youtube.com/watch?v=O64pJa3eLqs&t=20s
Cassandra is a highly scalable, open-source distributed database designed to handle large amounts of structured data across many servers. It provides high availability with no single point of failure and was created by Facebook to power search on their messaging platform. Cassandra uses a decentralized peer-to-peer architecture and replicates data across multiple data centers for fault tolerance. It emphasizes performance and scalability over more complex query options and does not support features like joins typically found in relational databases. Companies like Netflix and Hulu use Cassandra for its availability, scalability, and ability to span large clusters with minimal maintenance.
This document provides an introduction to Apache Cassandra, a distributed column-based NoSQL database. It discusses Cassandra's features such as horizontal scaling, high availability without a single point of failure, and supporting large amounts of data. It also briefly explains how Cassandra works by distributing data across nodes, and introduces the Cassandra Query Language for querying the database and includes references for further reading.
Apache Cassandra Lunch #67: Moving Data from Cassandra to Datastax AstraAnant Corporation
In Apache Cassandra Lunch #67, we discussed how to move data from Open Source Cassandra to Datastax Astra using dsbulk/scylla migratory.
https://github.com/DataStax-Examples/dsbulk-to-astra/
Accompanying Blog: https://blog.anant.us/apache-cassandra-lunch-67-moving-data-from-cassandra-to-datastax-astra-with-dsbulk
Accompanying Youtube: https://youtu.be/0k7RBf5vi5M
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Cassandra Lunch Weekly at 12 PM EST Every Wednesday: https://www.meetup.com/Cassandra-DataStax-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Cassandra.Lunch:
https://github.com/Anant/Cassandra.Lunch
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
No matter how resilient your database infrastructure is, backups are still needed to defend against catastrophic failures. Be it the unlikely hardware failure of all data centers, or the more likely and all-too-human user error. Acknowledging the importance of good backup procedures, the Scylla Manager now natively supports backup and restore operations. In this talk, we will learn more about how that works and the guarantees provided, as well as how to set it up to guarantee maximum resiliency to your cluster.
In Data Engineer's Lunch #46, we discuss the architecture of Node.js and use it to initiate and harvest some data from an API call.
Accompanying Blog: https://blog.anant.us/data-engineers-lunch-45-apache-livy
Accompanying YouTube: https://youtu.be/WMRN815FuQ8
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Data Engineer’s Lunch Weekly at 12 PM EST Every Monday:
https://www.meetup.com/Data-Wranglers-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
In Apache Cassandra Lunch #59: Functions in Cassandra, we discussed the functions that are usable inside of the Cassandra database. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live.
Apache Cassandra Lunch #70: Basics of Apache CassandraAnant Corporation
In Cassandra Lunch #70, we discuss the Basics of Apache Cassandra and setup a stand-alone Apache Cassandra.
Accompanying Blog: https://blog.anant.us/cassandra-launch-70-basics-of-apache-cassandra
Accompanying YouTube: https://youtu.be/o-yU0mi4nzc
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Cassandra Lunch Weekly at 12 PM EST Every Wednesday: https://www.meetup.com/Cassandra-DataStax-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Cassandra.Lunch:
https://github.com/Anant/Cassandra.Lunch
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
Building a REST API with Cassandra on Datastax Astra Using Python and NodeAnant Corporation
DataStax Astra provides the ability to develop and deploy data-driven applications with a cloud-native service, without the hassles of database and infrastructure administration. In this webinar, we are going to walk you through creating a REST API and exposing that to your Cassandra database.
Webinar Link: https://www.youtube.com/watch?v=O64pJa3eLqs&t=20s
Cassandra is a highly scalable, open-source distributed database designed to handle large amounts of structured data across many servers. It provides high availability with no single point of failure and was created by Facebook to power search on their messaging platform. Cassandra uses a decentralized peer-to-peer architecture and replicates data across multiple data centers for fault tolerance. It emphasizes performance and scalability over more complex query options and does not support features like joins typically found in relational databases. Companies like Netflix and Hulu use Cassandra for its availability, scalability, and ability to span large clusters with minimal maintenance.
This document provides an introduction to Apache Cassandra, a distributed column-based NoSQL database. It discusses Cassandra's features such as horizontal scaling, high availability without a single point of failure, and supporting large amounts of data. It also briefly explains how Cassandra works by distributing data across nodes, and introduces the Cassandra Query Language for querying the database and includes references for further reading.
Apache Cassandra Lunch #67: Moving Data from Cassandra to Datastax AstraAnant Corporation
In Apache Cassandra Lunch #67, we discussed how to move data from Open Source Cassandra to Datastax Astra using dsbulk/scylla migratory.
https://github.com/DataStax-Examples/dsbulk-to-astra/
Accompanying Blog: https://blog.anant.us/apache-cassandra-lunch-67-moving-data-from-cassandra-to-datastax-astra-with-dsbulk
Accompanying Youtube: https://youtu.be/0k7RBf5vi5M
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Cassandra Lunch Weekly at 12 PM EST Every Wednesday: https://www.meetup.com/Cassandra-DataStax-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Cassandra.Lunch:
https://github.com/Anant/Cassandra.Lunch
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
No matter how resilient your database infrastructure is, backups are still needed to defend against catastrophic failures. Be it the unlikely hardware failure of all data centers, or the more likely and all-too-human user error. Acknowledging the importance of good backup procedures, the Scylla Manager now natively supports backup and restore operations. In this talk, we will learn more about how that works and the guarantees provided, as well as how to set it up to guarantee maximum resiliency to your cluster.
What Kiwi.com Has Learned Running ScyllaDB and GoScyllaDB
Kiwi.com, a global travel booking site, uses Scylla as its search engine storage backend. Since last Scylla Summit, Kiwi.com has migrated from Cassandra to Scylla. Find out how our distributed database topology influences the development of all our applications. Also learn how we rewrote our core services, originally written in Python, as Go, and how we obtained performance improvements with the gocql driver.
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...ScyllaDB
This document compares MongoDB and ScyllaDB databases. It discusses their histories, architectures, data models, querying capabilities, consistency handling, and scaling approaches. It also provides takeaways for operations teams and developers, noting that ScyllaDB favors consistent performance over flexibility while MongoDB is more flexible but sacrifices some performance. The document also outlines how a company called Numberly uses both MongoDB and ScyllaDB for different use cases.
A database is an organized collection of data, generally stored and accessed electronically from a computer system. Where databases are more complex they are often developed using formal design and modeling techniques.
Migrating from a Relational Database to Cassandra: Why, Where, When and HowAnant Corporation
Everything you need to know about moving from a relational database to Cassandra.
You may be very familiar with what Cassandra is, or the name might just be a buzzword you've heard used when discussing databases. Regardless of your familiarity with Cassandra, this database should be the first tool you consider when you need scalability and high availability without compromising performance.
Running a DynamoDB-compatible Database on Managed Kubernetes ServicesScyllaDB
With the release of Alternator, Scylla’s DynamoDB-compatible API, you can now take your locked-in DynamoDB workloads and run them anywhere. Scylla provides a cost-effective open source alternative to Amazon’s DynamoDB, deployable wherever a user would want: on-premises, on other public clouds like Microsoft Azure or Google Cloud Platform, still on AWS (such as the high-density i3en instances) or as a fully managed DBaaS.
In this session, we will cover:
- Scylla Alternator: Scylla’s Amazon DynamoDB-compatible API
- Scylla Operator: Running Scylla Alternator on Kubernetes
- Demo Alternator - Demo and explain DynamoDB on GKE
ScyllaDB's Avi Kivity on UDF, UDA, and the FutureScyllaDB
Scylla is now capable of executing user-defined functions and user-defined aggregates. That allows queries to be more flexible, and in many situations, by avoiding server - client data transfers, faster too. In this talk, we will look at the infrastructure added to Scylla to make it happen. One key piece of that infrastructure, is the integration of a programming language interpreter that allows the users to inject their own custom code. But once that happens, where do we stop? We will look into proposed extensions to Scylla to leverage this infrastructure to allow Scylla to consume your data in faster, more efficient, and creative ways.
10 different Cassandra distributions and variants ranging from Cassandra / Cassandra Compliant Databases on JVM, Cassandra Compliant Databases on C++, Cassandra as a Service / Managed Cassandra Based on Open Source Cassandra, and Cassandra as a Service / Managed Cassandra Based on Proprietary Technology.
Cassandra Lunch #87: Recreating Cassandra.api using Astra and StargateAnant Corporation
In Cassandra Lunch #87, we will work on using AstraDBs included Stargate API layer to substitute for the written Node and Python APIs in our Cassandra.api project.
Accompanying YouTube: Coming Soon!
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Cassandra Lunch Weekly at 12 PM EST Every Wednesday: https://www.meetup.com/Cassandra-DataStax-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Cassandra.Lunch:
https://github.com/Anant/Cassandra.Lunch
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
Scylla Summit 2022: IO Scheduling & NVMe Disk ModellingScyllaDB
This document discusses IO scheduling and modeling NVMe disks. It explains that different components compete for limited disk resources with different priorities and may overconsume resources if not scheduled properly. It then describes using an IO scheduler to get maximum concurrency from the disk and apply request priorities while avoiding overconsumption. The document outlines a token bucket algorithm for rate limiting and previews ongoing work to implement a new scheduler in Seastar and Scylla and add related metrics and tuning capabilities.
Scylla Summit 2022: Stream Processing with ScyllaDBScyllaDB
Palo Alto Networks processes terabytes of events each day. One of their many challenges is to understand which of those events (which might come from various different sensors) actually describe the same story but from many different viewpoints.
Traditionally, such a system would need some sort of a database to store the events, and a message queue to notify consumers about new events that arrived into the system. They wanted to mitigate the cost and operational overhead of deploying yet another stateful component to their system, and designed a solution that uses ScyllaDB as the database for the events *and* as a message queue that allows our consumers to consume the correct events each time. Join this talk with Daniel Belenky, Principal Software Engineer, Palo Alto Networks where he will walk you through their process.
To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.
In Data Engineer's Lunch #54, we will discuss the data build tool, a tool for managing data transformations with config files rather than code. We will be connecting it to Apache Spark and using it to perform transformations.
Accompanying YouTube: https://youtu.be/dwZlYG6RCSY
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Data Engineer’s Lunch Weekly at 12 PM EST Every Monday:
https://www.meetup.com/Data-Wranglers-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
This document discusses ScyllaDB's process for sizing a Scylla cluster. It begins by outlining the importance of understanding business, application, and infrastructure requirements. Then it walks through building a sample system based on provided workload details. It shows how the sample system could be configured on different cloud platforms like AWS, Azure, and GCP. Finally, it highlights Scylla's sizing sheet tool for helping to determine hardware needs based on workload characteristics and performance goals.
Migrating Data Pipeline from MongoDB to CassandraDemi Ben-Ari
MongoDB is a great NoSQL database, it’s very flexible and easy to use,
but would it handle massive Read / Write throughput?
actually, what happens when you need to scale everything out and easily?
We will lay out the reasons and the steps of migrating our data pipeline to Apache Cassandra in a short period without having any prior knowledge.
We’ll list our lessons learned as well.
Bio:
Demi Ben-Ari, Sr. Data Engineer @Windward,
I have over 9 years of experience in building various systems both from the field of near real time applications and Big Data distributed systems.
Co-Organizer of the “Big Things” Big Data community:http://somebigthings.com/big-things-intro/
Ryan will expand on his popular blog series and drill down into the internals of the database. Ryan will discuss optimizing query performance, best indexing schemes, how to manage clustering (including meta and data nodes), the impact of IFQL on the database, the impact of cardinality on performance, TSI, and other internals that will help you architect better solutions around InfluxDB.
Creating an open source load balancer for S3Anders Bruvik
This document discusses Safespring's development of an open source load balancer for their object storage solution. It describes Safespring as an infrastructure company that offers storage, compute and backup services from local data centers using open source technologies like OpenStack and Ceph. The document outlines Safespring's goals for a new load balancer to support a hybrid cloud service for customers, and describes the open source components chosen including Bird, Træfik, Etcd, Letsencrypt, Radosgw and Prometheus. It provides details on the load balancer design, configuration management with Ansible, and Safespring's DevOps workflow and tools.
This document summarizes using Kubernetes to deploy a Spark big data computing environment. It discusses why Kubernetes is preferable to other solutions like Cloudera for managing Spark. The architecture of running Spark on Kubernetes is shown, with the Spark master and worker controllers. Performance is compared between Spark on Kubernetes and standalone Spark using the SparkPI and WordCount examples. Support for Spark 2.3.0 on Kubernetes is now official.
The document discusses operations in the cloud and best practices. It describes how companies like Zynga and others were able to scale games and applications using AWS services like EC2, S3, EBS, ELB, and RDS. It discusses high availability, making applications stateless, monitoring, and open source alternatives. Best practices include infrastructure as code, automated provisioning, eliminating single points of failure, caching, and following the sun for development.
The document discusses operations in the cloud and best practices. It describes how companies like Zynga and others were able to scale games and applications using AWS services like EC2, S3, EBS, ELB, and RDS. It discusses high availability, making applications stateless, monitoring, and open source alternatives. Best practices include infrastructure as code, automated provisioning, eliminating single points of failure, caching, and following the sun for development.
Seastar is a C++ asynchronous programming framework that allows for multi-domain async programming across networking, storage I/O, and multi-core communications. It uses an event-driven model where each logical core runs a task scheduler independently. Logical cores communicate through queues. Seastar is applicable for workloads with high I/O to compute ratios, high concurrency needs, and distributed applications. It provides futures/promises abstractions and rich APIs for tasks like HTTP servers, RPC, and distributed databases.
DockerCon 2016 Ecosystem - Everything You Need to Know About Docker and Stora...ClusterHQ
Docker volumes allow storing state from containers outside of the image layer for persistence. They can be local storage or enabled for external storage management using plugins. This provides high availability for data by allowing the data to move with containers during failures or maintenance. The document discusses key concepts around stateful vs stateless containers and volumes. It also demonstrates creating and managing volumes using Docker Volume plugins in UCP for shared storage and failover capabilities.
This document provides an overview of Apache Airflow, an open-source workflow management platform. It describes Airflow as a tool for scheduling and running jobs and data pipelines, ensuring correct ordering based on dependencies and recovering from failures. The key benefits of Airflow are that it is easy to use with Python knowledge, open source, supports many platforms and systems through integrations, uses Python flexibly for workflows, and enables visualization of workflows. The document outlines Airflow's architecture, core concepts including DAGs (directed acyclic graphs), tasks, and operators, and how to create a workflow by defining a DAG as a Python file with tasks and their dependencies and order.
Introduction to Apache Airflow, it's main concepts and features and an example of a DAG. Afterwards some lessons and best practices learned by from the 3 years I have been using Airflow to power workflows in production.
What Kiwi.com Has Learned Running ScyllaDB and GoScyllaDB
Kiwi.com, a global travel booking site, uses Scylla as its search engine storage backend. Since last Scylla Summit, Kiwi.com has migrated from Cassandra to Scylla. Find out how our distributed database topology influences the development of all our applications. Also learn how we rewrote our core services, originally written in Python, as Go, and how we obtained performance improvements with the gocql driver.
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...ScyllaDB
This document compares MongoDB and ScyllaDB databases. It discusses their histories, architectures, data models, querying capabilities, consistency handling, and scaling approaches. It also provides takeaways for operations teams and developers, noting that ScyllaDB favors consistent performance over flexibility while MongoDB is more flexible but sacrifices some performance. The document also outlines how a company called Numberly uses both MongoDB and ScyllaDB for different use cases.
A database is an organized collection of data, generally stored and accessed electronically from a computer system. Where databases are more complex they are often developed using formal design and modeling techniques.
Migrating from a Relational Database to Cassandra: Why, Where, When and HowAnant Corporation
Everything you need to know about moving from a relational database to Cassandra.
You may be very familiar with what Cassandra is, or the name might just be a buzzword you've heard used when discussing databases. Regardless of your familiarity with Cassandra, this database should be the first tool you consider when you need scalability and high availability without compromising performance.
Running a DynamoDB-compatible Database on Managed Kubernetes ServicesScyllaDB
With the release of Alternator, Scylla’s DynamoDB-compatible API, you can now take your locked-in DynamoDB workloads and run them anywhere. Scylla provides a cost-effective open source alternative to Amazon’s DynamoDB, deployable wherever a user would want: on-premises, on other public clouds like Microsoft Azure or Google Cloud Platform, still on AWS (such as the high-density i3en instances) or as a fully managed DBaaS.
In this session, we will cover:
- Scylla Alternator: Scylla’s Amazon DynamoDB-compatible API
- Scylla Operator: Running Scylla Alternator on Kubernetes
- Demo Alternator - Demo and explain DynamoDB on GKE
ScyllaDB's Avi Kivity on UDF, UDA, and the FutureScyllaDB
Scylla is now capable of executing user-defined functions and user-defined aggregates. That allows queries to be more flexible, and in many situations, by avoiding server - client data transfers, faster too. In this talk, we will look at the infrastructure added to Scylla to make it happen. One key piece of that infrastructure, is the integration of a programming language interpreter that allows the users to inject their own custom code. But once that happens, where do we stop? We will look into proposed extensions to Scylla to leverage this infrastructure to allow Scylla to consume your data in faster, more efficient, and creative ways.
10 different Cassandra distributions and variants ranging from Cassandra / Cassandra Compliant Databases on JVM, Cassandra Compliant Databases on C++, Cassandra as a Service / Managed Cassandra Based on Open Source Cassandra, and Cassandra as a Service / Managed Cassandra Based on Proprietary Technology.
Cassandra Lunch #87: Recreating Cassandra.api using Astra and StargateAnant Corporation
In Cassandra Lunch #87, we will work on using AstraDBs included Stargate API layer to substitute for the written Node and Python APIs in our Cassandra.api project.
Accompanying YouTube: Coming Soon!
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Cassandra Lunch Weekly at 12 PM EST Every Wednesday: https://www.meetup.com/Cassandra-DataStax-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Cassandra.Lunch:
https://github.com/Anant/Cassandra.Lunch
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
Scylla Summit 2022: IO Scheduling & NVMe Disk ModellingScyllaDB
This document discusses IO scheduling and modeling NVMe disks. It explains that different components compete for limited disk resources with different priorities and may overconsume resources if not scheduled properly. It then describes using an IO scheduler to get maximum concurrency from the disk and apply request priorities while avoiding overconsumption. The document outlines a token bucket algorithm for rate limiting and previews ongoing work to implement a new scheduler in Seastar and Scylla and add related metrics and tuning capabilities.
Scylla Summit 2022: Stream Processing with ScyllaDBScyllaDB
Palo Alto Networks processes terabytes of events each day. One of their many challenges is to understand which of those events (which might come from various different sensors) actually describe the same story but from many different viewpoints.
Traditionally, such a system would need some sort of a database to store the events, and a message queue to notify consumers about new events that arrived into the system. They wanted to mitigate the cost and operational overhead of deploying yet another stateful component to their system, and designed a solution that uses ScyllaDB as the database for the events *and* as a message queue that allows our consumers to consume the correct events each time. Join this talk with Daniel Belenky, Principal Software Engineer, Palo Alto Networks where he will walk you through their process.
To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.
In Data Engineer's Lunch #54, we will discuss the data build tool, a tool for managing data transformations with config files rather than code. We will be connecting it to Apache Spark and using it to perform transformations.
Accompanying YouTube: https://youtu.be/dwZlYG6RCSY
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Data Engineer’s Lunch Weekly at 12 PM EST Every Monday:
https://www.meetup.com/Data-Wranglers-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
This document discusses ScyllaDB's process for sizing a Scylla cluster. It begins by outlining the importance of understanding business, application, and infrastructure requirements. Then it walks through building a sample system based on provided workload details. It shows how the sample system could be configured on different cloud platforms like AWS, Azure, and GCP. Finally, it highlights Scylla's sizing sheet tool for helping to determine hardware needs based on workload characteristics and performance goals.
Migrating Data Pipeline from MongoDB to CassandraDemi Ben-Ari
MongoDB is a great NoSQL database, it’s very flexible and easy to use,
but would it handle massive Read / Write throughput?
actually, what happens when you need to scale everything out and easily?
We will lay out the reasons and the steps of migrating our data pipeline to Apache Cassandra in a short period without having any prior knowledge.
We’ll list our lessons learned as well.
Bio:
Demi Ben-Ari, Sr. Data Engineer @Windward,
I have over 9 years of experience in building various systems both from the field of near real time applications and Big Data distributed systems.
Co-Organizer of the “Big Things” Big Data community:http://somebigthings.com/big-things-intro/
Ryan will expand on his popular blog series and drill down into the internals of the database. Ryan will discuss optimizing query performance, best indexing schemes, how to manage clustering (including meta and data nodes), the impact of IFQL on the database, the impact of cardinality on performance, TSI, and other internals that will help you architect better solutions around InfluxDB.
Creating an open source load balancer for S3Anders Bruvik
This document discusses Safespring's development of an open source load balancer for their object storage solution. It describes Safespring as an infrastructure company that offers storage, compute and backup services from local data centers using open source technologies like OpenStack and Ceph. The document outlines Safespring's goals for a new load balancer to support a hybrid cloud service for customers, and describes the open source components chosen including Bird, Træfik, Etcd, Letsencrypt, Radosgw and Prometheus. It provides details on the load balancer design, configuration management with Ansible, and Safespring's DevOps workflow and tools.
This document summarizes using Kubernetes to deploy a Spark big data computing environment. It discusses why Kubernetes is preferable to other solutions like Cloudera for managing Spark. The architecture of running Spark on Kubernetes is shown, with the Spark master and worker controllers. Performance is compared between Spark on Kubernetes and standalone Spark using the SparkPI and WordCount examples. Support for Spark 2.3.0 on Kubernetes is now official.
The document discusses operations in the cloud and best practices. It describes how companies like Zynga and others were able to scale games and applications using AWS services like EC2, S3, EBS, ELB, and RDS. It discusses high availability, making applications stateless, monitoring, and open source alternatives. Best practices include infrastructure as code, automated provisioning, eliminating single points of failure, caching, and following the sun for development.
The document discusses operations in the cloud and best practices. It describes how companies like Zynga and others were able to scale games and applications using AWS services like EC2, S3, EBS, ELB, and RDS. It discusses high availability, making applications stateless, monitoring, and open source alternatives. Best practices include infrastructure as code, automated provisioning, eliminating single points of failure, caching, and following the sun for development.
Seastar is a C++ asynchronous programming framework that allows for multi-domain async programming across networking, storage I/O, and multi-core communications. It uses an event-driven model where each logical core runs a task scheduler independently. Logical cores communicate through queues. Seastar is applicable for workloads with high I/O to compute ratios, high concurrency needs, and distributed applications. It provides futures/promises abstractions and rich APIs for tasks like HTTP servers, RPC, and distributed databases.
DockerCon 2016 Ecosystem - Everything You Need to Know About Docker and Stora...ClusterHQ
Docker volumes allow storing state from containers outside of the image layer for persistence. They can be local storage or enabled for external storage management using plugins. This provides high availability for data by allowing the data to move with containers during failures or maintenance. The document discusses key concepts around stateful vs stateless containers and volumes. It also demonstrates creating and managing volumes using Docker Volume plugins in UCP for shared storage and failover capabilities.
This document provides an overview of Apache Airflow, an open-source workflow management platform. It describes Airflow as a tool for scheduling and running jobs and data pipelines, ensuring correct ordering based on dependencies and recovering from failures. The key benefits of Airflow are that it is easy to use with Python knowledge, open source, supports many platforms and systems through integrations, uses Python flexibly for workflows, and enables visualization of workflows. The document outlines Airflow's architecture, core concepts including DAGs (directed acyclic graphs), tasks, and operators, and how to create a workflow by defining a DAG as a Python file with tasks and their dependencies and order.
Introduction to Apache Airflow, it's main concepts and features and an example of a DAG. Afterwards some lessons and best practices learned by from the 3 years I have been using Airflow to power workflows in production.
Terraforming your Infrastructure on GCPSamuel Chow
A talk I gave at the Google Cloud Platform LA Meetup event at Google Playa Vista on Nov 6, 2019. This is a 1+ hour-long, tutorial-oriented talk on Infrastructure as Code (IaC), Terraform (as a toolset for IaC and modern devops), and leverage the practice and tools in defining, deploying, and managing your infrastructure in GCP.
Airflow Best Practises & Roadmap to Airflow 2.0Kaxil Naik
This document provides an overview of new features in Airflow 1.10.8/1.10.9 and best practices for writing DAGs and configuring Airflow for production. It also outlines the roadmap for Airflow 2.0, including dag serialization, a revamped real-time UI, developing a production-grade modern API, releasing official Docker/Helm support, and improving the scheduler. The document aims to help users understand recent Airflow updates and plan their migration to version 2.0.
This document summarizes some of the key upcoming features in Airflow 2.0, including scheduler high availability, DAG serialization, DAG versioning, a stable REST API, functional DAGs, an official Docker image and Helm chart, and providers packages. It provides details on the motivations, designs, and status of these features. The author is an Airflow committer and release manager who works on Airflow full-time at Astronomer.
Apache Cassandra Lunch #94: StreamSets and CassandraAnant Corporation
In Cassandra Lunch #94, Arpan Patel will discuss how to connect StreamSets and Cassandra.
Accompanying Blog: Coming Soon!
Accompanying YouTube: https://youtu.be/9-v5mOk6c9c
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Cassandra Lunch Weekly at 12 PM EST Every Wednesday: https://www.meetup.com/Cassandra-DataStax-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Cassandra.Lunch:
https://github.com/Anant/Cassandra.Lunch
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
Introduction to Docker and Monitoring with InfluxDataInfluxData
In this webinar, Gary Forgheti, Technical Alliance Engineer at Docker, and Gunnar Aasen, Partner Engineering, provide an introduction to Docker and InfluxData. From there, they will show you how to use the two together to setup and monitor your containers and microservices to properly manage your infrastructure and track key metrics (CPU, RAM, storage, network utilization), as well as the availability of your application endpoints.
Airflow is a platform for authoring, scheduling, and monitoring workflows or data pipelines. It uses a directed acyclic graph (DAG) to define dependencies between tasks and schedule their execution. The UI provides dashboards to monitor task status and view workflow histories. Hands-on exercises demonstrate installing Airflow and creating sample DAGs.
Orchestrating workflows Apache Airflow on GCP & AWSDerrick Qin
Working in a cloud or on-premises environment, we all somehow move data from A to B on-demand or on schedule. It is essential to have a tool that can automate recurring workflows. This can be anything from an ETL(Extract, Transform, and Load) job for a regular analytics report all the way to automatically re-training a machine learning model.
In this talk, we will introduce Apache Airflow and how it can help orchestrate your workflows. We will cover key concepts, features, and use cases of Apache Airflow, as well as how you can enjoy Apache Airflow on GCP and AWS by demo-ing a few practical workflows.
The document discusses upcoming features and changes in Apache Airflow 2.0. Key points include:
1. Scheduler high availability will use an active-active model with row-level locks to allow killing a scheduler without interrupting tasks.
2. DAG serialization will decouple DAG parsing from scheduling to reduce delays, support lazy loading, and enable features like versioning.
3. Performance improvements include optimizing the DAG file processor and using a profiling tool to identify other bottlenecks.
4. The Kubernetes executor will integrate with KEDA for autoscaling and allow customizing pods through templating.
5. The official Helm chart, functional DAGs, and smaller usability changes
In Data Engineer's Lunch #44, we will discuss Prefect and how it compares to Airflow when scheduling tasks.
Accompanying Blog: https://blog.anant.us/data-engineers-lunch-44-prefect-for-workflow-management/
Accompanying YouTube: https://youtu.be/P184heuv8ws
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Data Engineer’s Lunch Weekly at 12 PM EST Every Monday:
https://www.meetup.com/Data-Wranglers-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
This document provides an overview of Apache Airflow, including:
- What Apache Airflow is and its benefits such as being open-source, having a large community, and integrating with cloud platforms.
- Common use cases for Airflow like ETL pipelines, machine learning model training, report generation, and DevOps tasks.
- The key components of Airflow including DAGs, tasks, operators, hooks, providers, plugins, and connections.
- Best practices for using Airflow such as keeping workflow files updated, defining clear purposes for DAGs, using variables, setting priorities, and defining SLAs.
- A live demo of running Airflow locally using Docker.
Best Practices for Developing & Deploying Java Applications with DockerEric Smalling
This document provides a summary of best practices for developing and deploying Java applications with Docker. It begins with an introduction and overview of Docker terminology. It then demonstrates how to build a simple Java web application as a Docker image and run it as a container. The document also covers deploying applications to clusters as services and stacks, and techniques for application management, configuration, monitoring, troubleshooting and logging in Docker environments.
Kubernetes is an open-source container management platform. It has a master-node architecture with control plane components like the API server on the master and node components like kubelet and kube-proxy on nodes. Kubernetes uses pods as the basic building block, which can contain one or more containers. Services provide discovery and load balancing for pods. Deployments manage pods and replicasets and provide declarative updates. Key concepts include volumes for persistent storage, namespaces for tenant isolation, labels for object tagging, and selector matching.
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with
[WSO2Con EU 2018] Deploying Applications in K8S and DockerWSO2
Within the last four years container technologies have become very popular. A lot of companies and developers are now using containers to ship their applications. Docker provides an easy-to-use packaging model to bundle the application. However in many cases, a single container is not enough to run an application. It requires multiple containers, scaled into multiple host machines to become a production grade deployment. Kubernetes is an open source system for automating deployment, scaling, and management of containerized applications. It groups containers that make up an application into logical units for easy management and discovery. This presentation discusses best practices of deploying application in Docker and Kubernetes while discussing Docker and Kubernetes concepts.
This document discusses using Megam and Opennebula to deploy applications to cloud environments in a flexible and portable way. Megam allows deploying applications to any public or private cloud, provides automated scaling, and avoids vendor lock-in. The document outlines Megam's features like deployment recipes, monitoring, and integration with development tools. It also discusses Megam's support for Docker containers, including a visual designer and "Cloud in a Box" for deploying private clouds.
OpenNebula Conf 2014 | Cloud Automation for OpenNebula by Kishorekumar Neelam...NETWAYS
Kishore works with the engineering team in building the open source product with a future focussed cloud technical strategy for “Megam – Cloud Automation Platform “http://gomegam.com”. In his prior incarnation Kishore has worked as an Architect in complex system integration projects for Airport systems with high availability. Kishore has avid experience in architecting large scale build and packaging tools for mainframe platform integrated via thin clients and eclipse IDE.
Helm and the zen of managing complex Kubernetes appsAbhishek Chanda
Helm is a package manager for Kubernetes that allows deploying and managing Kubernetes applications. It defines applications as charts that contain templates for Kubernetes manifest files along with configuration parameters. Helm runs a server called Tiller on the Kubernetes cluster that manages releasing and installing charts. Charts can be stored in local or remote repositories and contain templates, dependencies, configuration and hooks. Helm provides commands to search, install, upgrade, and delete releases of packaged applications on Kubernetes clusters.
Similar to Apache Cassandra Lunch #52: Airflow and Cassandra for Cluster Management (20)
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137Anant Corporation
Discussion of LLM fine-tuning with an overview of fine-tuning types and datasets: specifically we will talk about the method that we used to turn an existing collection of Cassandra information into a set of instructions and responses that we can use for fine tuning.
What's AGI? How is it different from an Agent or an AI Assistant? If you're looking to understand how AI Agents/AGI can help your company, check this out.
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache PinotAnant Corporation
In this meetup, we will introduce the concepts of Real Time Analytics, why it is important, the evolution of Analytics, and how companies such as LinkedIn, Stripe, Uber and more are using Real Time analytics to grow their audience and improve usability by using Apache Pinot. What is Apache Pinot? Followed by Demo and Q&A.
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...Anant Corporation
Series: Using AI / ChatGPT at Work - GPT Automation
Are you a small business owner or web developer interested in leveraging the power of GPT (Generative Pretrained Transformer) technology to enhance your business processes? If so, Join us for a series of events focused on using GPT in business. Whether you're a small business owner or a web developer, you'll learn how to leverage GPT to improve your workflow and provide better services to your customers.
GPT Automation: What it is and How it Works
How Time-Saving GPT Automation Can Improve Your Business
Cost-Effective GPT Automation: How it Can Save Your Business Money
Using GPT Automation for Customer Service: Benefits and Best Practices
The Power of GPT Automation for Content Creation
Data Analysis Made Easy with GPT Automation
Top GPT-3 Automation Tools for Businesses
The Ethical Considerations of GPT Automation
Overcoming Bias in GPT Automation: Best Practices
The Future of GPT Automation: Trends and Predictions
Since we focus on "no code" here, we'll explore the tools that are already out there such as ChatGPT plugins for Chrome, OpenAI GPT API, low-code/no-code platforms like Make/Integromat and Zapier, existing apps like Jasper/Rytr, and ecosystem tools like Everyprompt. We'll also discuss the resources available for those interested in learning more about GPT, including other people’s prompts.
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAnant Corporation
This document provides an agenda for a full-day bootcamp on large language models (LLMs) like GPT-3. The bootcamp will cover fundamentals of machine learning and neural networks, the transformer architecture, how LLMs work, and popular LLMs beyond ChatGPT. The agenda includes sessions on LLM strategy and theory, design patterns for LLMs, no-code/code stacks for LLMs, and building a custom chatbot with an LLM and your own data.
In Apache Cassandra Lunch #131: YugabyteDB Developer Tools, we discussed third party developer tools that are compatible with YugabyteDB. We talked about using Yugabyte Developer Tools for data visualization and schema management. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST.
Developer tools play a critical role in simplifying and streamlining database development and management. They allow developers and administrators to be more productive, reducing the time and effort required to create and maintain database schemas, write SQL queries, test database performance, and enable collaboration. Developer tools also make it possible to track changes over time, improving the ability to manage the entire development lifecycle.
Episode 2: The LLM / GPT / AI Prompt / Data Engineer RoadmapAnant Corporation
In this episode we'll discuss the different flavors of prompt engineering in the LLM/GPT space. According to your skill level you should be able to pick up at any of the following:
Leveling up with GPT
1: Use ChatGPT / GPT Powered Apps
2: Become a Prompt Engineer on ChatGPT/GPT
3: Use GPT API with NoCode Automation, App Builders
4: Create Workflows to Automate Tasks with NoCode
5: Use GPT API with Code, make your own APIs
6: Create Workflows to Automate Tasks with Code
7: Use GPT API with your Data / a Framework
8: Use GPT API with your Data / a Framework to Make your own APIs
9: Create Workflows to Automate Tasks with your Data /a Framework
10: Use Another LLM API other than GPT (Cohere, HuggingFace)
11: Use open source LLM models on your computer
12: Finetune / Build your own models
Series: Using AI / ChatGPT at Work - GPT Automation
Are you a small business owner or web developer interested in leveraging the power of GPT (Generative Pretrained Transformer) technology to enhance your business processes?
If so, Join us for a series of events focused on using GPT in business. Whether you're a small business owner or a web developer, you'll learn how to leverage GPT to improve your workflow and provide better services to your customers.
In Data Engineer’s Lunch #89: Machine Learning Orchestration with Airflow, we discussed using Apache Airflow to manage and schedule machine learning tasks. By following the best practices of ML Ops, teams can streamline their ML workflows and build scalable, efficient, and accurate models that deliver real-world business value. Properly implemented ML Ops can help organizations stay ahead of the curve and achieve their goals in the fast-paced world of machine learning. Apache Airflow is an open-source tool for scheduling and automating workflows. Airflow allows you to define workflows in Python, with tasks defined as Python functions that can include Operators for all sorts of external tools. This makes it easy to automate repeated processes and define dependencies between tasks, creating directed-acyclic-graphs of tasks that can be scheduled using cron syntax or frequency tasks. Airflow also features a user-friendly UI for monitoring task progress and viewing logs, giving you greater control over your data pipeline.
Cassandra Lunch 130: Recap of Cassandra Forward TalksAnant Corporation
If you didn't attend, you don't want to miss a much shorter synopsis of what was covered and get some thoughts from us as to why they are important. We'll talk about the main topics of the event.
1. ACID transactions on Cassandra by Aaron Ploetz, Datastax
2. Apache Flink with Apache Cassandra at Satyajit Thadeswar, Netflix
3. Durable Execution built on Apache Cassandra by Loren Sands-Ramshaw, Temporal
4. Switching from Mongo to Cassandra with Mongoose & new Stargate JSON API, Valeri Karpov
5. Cloud Native and Realtime AI/ML with Patrick Mcfadin and Davor Boncaci, Datastax
Data Engineer's Lunch 90: Migrating SQL Data with ArcionAnant Corporation
In Data Engineer's Lunch 90, Eric Ramseur teaches our audience how to use Arcion.
From best practices to real-world examples, this talk will provide you with the knowledge and insights you need to ensure a successful migration of your SQL data. So whether you're new to data migration or looking to improve your existing process, join us and discover how Arcion can help you achieve your goals.
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Anant Corporation
In Data Engineer's Lunch 89, Obioma Anomnachi will discuss how to manage and schedule Machine Learning operations via Airflow. Learn how you can write complete end-to-end pipelines starting with retrieving raw data to serving ML predictions to end-users, entirely in Airflow.
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...Anant Corporation
As the demand for real-time data processing continues to grow, so too do the challenges associated with building production-ready applications that can handle large volumes of data and handle it quickly. In this talk, we will explore common problems faced when building real-time applications at scale, with a focus on a specific use case: detecting and responding to cyclist crashes. Using telemetry data collected from a fitness app, we’ll demonstrate how we used a combination of Apache Kafka and Python-based microservices running on Kubernetes to build a pipeline for processing and analyzing this data in real-time. We'll also discuss how we used machine learning techniques to build a model for detecting collisions and how we implemented notifications to alert family members of a crash. Our ultimate goal is to help you navigate the challenges that come with building data-intensive, real-time applications that use ML models. By showcasing a real-world example, we aim to provide practical solutions and insights that you can apply to your own projects.
Key takeaways:
An understanding of the common challenges faced when building real-time applications at scale
Strategies for using Apache Kafka and Python-based microservices to process and analyze data in real-time
Tips for implementing machine learning models in a real-time application
Best practices for responding to and handling critical events in a real-time application
Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation
What are the design considerations that go into architecting a modern data warehouse? This presentation will cover some of the requirements analysis, design decisions, and execution challenges of building a modern data lake/data warehouse.
In Apache Cassandra Lunch #121: Migrating to Azure Managed Instance for Apache Cassandra, we discussed different methods for migrating data from existing Cassandra instances to Azure hosted options.
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergAnant Corporation
In this talk, Dremio Developer Advocate, Alex Merced, discusses strategies for migrating your existing data over to Apache Iceberg. He'll go over the following:
How to Migrate Hive, Delta Lake, JSON, and CSV sources to Apache Iceberg
Pros and Cons of an In-place or Shadow Migration
Migrating between Apache Iceberg catalogs Hive/Glue -- Arctic/Nessie
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOpsAnant Corporation
In this lunch, Johnny will show us how easy it is to start monitoring your Cassandra cluster in minutes. He will explain the various aspects and features of Cassandra that need to be monitored, how to do it, and most importantly why! Approaches for backups and Cassandra repairs will be discussed and explored in detail.
Learn how AxonOps significantly reduces the complexity and overhead when looking after Cassandra and ensures your Cassandra cluster is reliable and resilient.
Experienced developer, DevOps, architect, and AxonOps co-founder, Johnny Miller, has worked with a wide variety of companies – from small start-ups to large enterprises. He has been working with Cassandra for many years and has a deep understanding of the challenges facing modern companies looking to adopt Apache Cassandra.
In Apache Cassandra Lunch #119, Rahul Singh will cover a refresher on GUI desktop/web tools for users that want to get their hands dirty with Cassandra but don't want to deal with CQLSH to do simple queries. Some of the tools are web-based and others are installed on your desktop. Since the beginning days of Cassandra, a lot has changed and there are many options for command-line-haters to use Cassandra.
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Anant Corporation
This document discusses automating Apache Cassandra operations using Apache Airflow. It recommends using Airflow to schedule and automate workflows for ETL, data hygiene, import/export, and more. It provides an overview of using Apache Spark jobs within Airflow DAGs to perform tasks like data cleaning, deduplication, and migrations for Cassandra. The document includes demos of using Airflow and Spark with Cassandra on DataStax Astra and discusses considerations for implementing this solution.
Did you know that drowning is a leading cause of unintentional death among young children? According to recent data, children aged 1-4 years are at the highest risk. Let's raise awareness and take steps to prevent these tragic incidents. Supervision, barriers around pools, and learning CPR can make a difference. Stay safe this summer!
Interview Methods - Marital and Family Therapy and Counselling - Psychology S...PsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Marlon Dumas
This webinar discusses the limitations of traditional approaches for business process simulation based on had-crafted model with restrictive assumptions. It shows how process mining techniques can be assembled together to discover high-fidelity digital twins of end-to-end processes from event data.
Do People Really Know Their Fertility Intentions? Correspondence between Sel...Xiao Xu
Fertility intention data from surveys often serve as a crucial component in modeling fertility behaviors. Yet, the persistent gap between stated intentions and actual fertility decisions, coupled with the prevalence of uncertain responses, has cast doubt on the overall utility of intentions and sparked controversies about their nature. In this study, we use survey data from a representative sample of Dutch women. With the help of open-ended questions (OEQs) on fertility and Natural Language Processing (NLP) methods, we are able to conduct an in-depth analysis of fertility narratives. Specifically, we annotate the (expert) perceived fertility intentions of respondents and compare them to their self-reported intentions from the survey. Through this analysis, we aim to reveal the disparities between self-reported intentions and the narratives. Furthermore, by applying neural topic modeling methods, we could uncover which topics and characteristics are more prevalent among respondents who exhibit a significant discrepancy between their stated intentions and their probable future behavior, as reflected in their narratives.
06-20-2024-AI Camp Meetup-Unstructured Data and Vector DatabasesTimothy Spann
Tech Talk: Unstructured Data and Vector Databases
Speaker: Tim Spann (Zilliz)
Abstract: In this session, I will discuss the unstructured data and the world of vector databases, we will see how they different from traditional databases. In which cases you need one and in which you probably don’t. I will also go over Similarity Search, where do you get vectors from and an example of a Vector Database Architecture. Wrapping up with an overview of Milvus.
Introduction
Unstructured data, vector databases, traditional databases, similarity search
Vectors
Where, What, How, Why Vectors? We’ll cover a Vector Database Architecture
Introducing Milvus
What drives Milvus' Emergence as the most widely adopted vector database
Hi Unstructured Data Friends!
I hope this video had all the unstructured data processing, AI and Vector Database demo you needed for now. If not, there’s a ton more linked below.
My source code is available here
https://github.com/tspannhw/
Let me know in the comments if you liked what you saw, how I can improve and what should I show next? Thanks, hope to see you soon at a Meetup in Princeton, Philadelphia, New York City or here in the Youtube Matrix.
Get Milvused!
https://milvus.io/
Read my Newsletter every week!
https://github.com/tspannhw/FLiPStackWeekly/blob/main/141-10June2024.md
For more cool Unstructured Data, AI and Vector Database videos check out the Milvus vector database videos here
https://www.youtube.com/@MilvusVectorDatabase/videos
Unstructured Data Meetups -
https://www.meetup.com/unstructured-data-meetup-new-york/
https://lu.ma/calendar/manage/cal-VNT79trvj0jS8S7
https://www.meetup.com/pro/unstructureddata/
https://zilliz.com/community/unstructured-data-meetup
https://zilliz.com/event
Twitter/X: https://x.com/milvusio https://x.com/paasdev
LinkedIn: https://www.linkedin.com/company/zilliz/ https://www.linkedin.com/in/timothyspann/
GitHub: https://github.com/milvus-io/milvus https://github.com/tspannhw
Invitation to join Discord: https://discord.com/invite/FjCMmaJng6
Blogs: https://milvusio.medium.com/ https://www.opensourcevectordb.cloud/ https://medium.com/@tspann
https://www.meetup.com/unstructured-data-meetup-new-york/events/301383476/?slug=unstructured-data-meetup-new-york&eventId=301383476
https://www.aicamp.ai/event/eventdetails/W2024062014
2. Airflow Overview
● A tool for scheduling and automating workflows and tasks
● Good for automating repeated processes
○ Common ETL tasks
○ Machine learning model training
● Write workflows in Python
○ Tools interactable via Python should work as well
○ Define dependencies for different sections of workflows
○ Workflows are DAG of tasks
● Schedule workflows or execute the processes by hand
○ Cron-like syntax or frequency tags
● Monitor tasks and collect/view logs
4. DAGs
● DAG - Directed Acyclic Graph
○ A DAG of tasks w/ dependencies as edges
○ Individual data engineering tasks combine together to form a DAG
■ Airflow allows the definition of relationships between tasks
■ Define dependencies and run order
○ DAGs are written in python, saved as a normal .py file
● DAGs are run to a specific schedule
○ They can also be triggered manually
○ Schedule defined using CRON notation
■ Also have some tags for frequencies
5. Airflow Providers
● Airflow provider
packages allow for
integration with
external systems
● They are mostly
maintained by the
Airflow community
● It is possible to create
your own provider
packages
6. Airflow Connections
● Airflow connections manage the network
connections with external systems
● Different types of connections are used to
connect with different external tools
● Connection types are added alongside their
provider package, with information
customized to their application
● Connections are ultimately JSON string
which airflow turns into python dictionaries
to pull data from
7. Airflow Operators for Cassandra
● Previously covered the Cassandra Operators (table and record sensors), the Cassandra Hooks
(give access to all python driver functionality), and Cassandra Connection (holds data for
connecting to Cassandra nodes)
● More potential Airflow Operators that might be useful with Cassandra
○ Docker Operator brings up a new Docker container on a given machine (defined via Docker API url) based on
a given image and can run defined commands on that container
○ Bash Operator runs commands on local machine, can be used to interact with local Cassandra installs or use
docker exec to interact with dockerized Cassandra installs
○ SSH Operator connects with SSHHook and SSH Connection to run bash commands on a machine with SSH
access
8. Cluster Management Tasks
● Can therefore use Airflow to trigger any given nodetool command on a schedule
○ Nodetool flush - flushes in-memory data (the commit log) to disk in the form of SSTables
○ Nodetool compact - performs compaction, resolves copies and tombstones and consolidates data into fewer
SSTable files
○ Nodetool repair - repairs data mismatches between nodes
○ Change configurations using commands like nodetool disableautocompaction, etc
○ Save status info to Airflow logs using nodetool status, etc