R2D2 is LinkedIn's request/response infrastructure that uses Zookeeper to store information about servers and services, and performs load balancing across clusters. It partitions traffic based on consistent hashing of request keys to specific clusters. The degrader load balancer monitors server metrics like latency and error rates to detect failures and route traffic away from unhealthy servers, gradually recovering as servers improve. It supports configurations like read-only and read-write clusters to partition traffic based on client needs.
1. The document discusses implementing distributed mclock in Ceph for quality of service (QoS). It describes implementing QoS units at the pool, RBD image, and universal levels.
2. It covers inserting delta/rho/phase parameters into Ceph classes for distributed mclock. Issues addressed include number of shards and background I/O.
3. An outstanding I/O based adaptive throttle is introduced to suspend mclock scheduling if the I/O load is too high. Testing showed it effectively maintained maximum throughput.
4. Future plans include improving the mclock algorithm, extending QoS to individual RBDs, adding metrics, and testing in various environments. Collaboration with the community is
The document discusses intra-cluster replication in Apache Kafka, including its architecture where partitions are replicated across brokers for high availability. Kafka uses a leader and in-sync replicas approach to strongly consistent replication while tolerating failures. Performance considerations in Kafka replication include latency and durability tradeoffs for producers and optimizing throughput for consumers.
This document provides troubleshooting tips for Redis. It discusses that Redis is single-threaded and can slow down if long commands are processed. It recommends using the latest stable version and checking for memory fragmentation issues. For replication, it suggests configuring health checks and resynchronizing slaves after a master restart. Troubleshooting tips include checking configuration options like stop-writes-on-bgsave-error and increasing client output buffer limits for large datasets. The document stresses that Redis security is weak and port access should be limited to private networks only.
Every company likes to brag about their successes, but not many are willing to talk about their failures. At PagerDuty we have been rigorously tracking downtime in order to analyze it and learn from our mistakes - we even blog about these failures publicly.
Despite being a highly available system, we have had three outages caused by problems with our production Cassandra clusters over the past year. We'll take a look at each of these outages: what we saw from the inside, the actions we took to recover, and most importantly the procedures and monitoring that will help prevent it from happening to you.
Linux High Availability provides information on configuring high availability clusters in Linux. It discusses:
- Key components of HA clustering including one service taking over work of another if it fails, IP and service takeover.
- The importance of high availability and costs of downtime for businesses. Statistics show significant downtime even at 99.9% availability levels.
- Best practices for HA including keeping configurations simple, preparing for failures, and testing HA setups. Complexity increases reliability risks.
MySQL High Availability Sprint: Launch the Pacemakerhastexo
This document provides instructions for a MySQL high availability sprint. It outlines setting up various components of the Linux HA stack including Pacemaker for cluster resource management, Corosync for cluster messaging, and DRBD for storage replication. It then provides step-by-step instructions for configuring resources like a floating IP address, DRBD device, filesystem, and MySQL, and grouping them together for high availability. The document concludes by providing further information and a way to provide feedback on the sprint.
Linux High Availability provides concise summaries of key concepts:
- High availability (HA) clustering allows services to take over work from others that go down, through IP and service takeover. It is designed for uptime, not performance or load balancing.
- Downtime is expensive for businesses due to lost revenue and customer dissatisfaction. Statistics show significant drops in availability even at 99.9% uprates.
- To achieve high availability, systems must be designed with simplicity, failure preparation, and reliability testing in mind. Complexity often undermines reliability.
- Myths exist around technologies like virtualization and live migration providing complete high availability solutions. True HA requires eliminating all single points of
Building a Replicated Logging System with Apache KafkaGuozhang Wang
Apache Kafka is a scalable publish-subscribe messaging system
with its core architecture as a distributed commit log.
It was originally built as its centralized event
pipelining platform for online data integration tasks. Over
the past years developing and operating Kafka, we extend
its log-structured architecture as a replicated logging backbone
for much wider application scopes in the distributed
environment. I am going to talk about our design
and engineering experience to replicate Kafka logs for various
distributed data-driven systems, including
source-of-truth data storage and stream processing.
1. The document discusses implementing distributed mclock in Ceph for quality of service (QoS). It describes implementing QoS units at the pool, RBD image, and universal levels.
2. It covers inserting delta/rho/phase parameters into Ceph classes for distributed mclock. Issues addressed include number of shards and background I/O.
3. An outstanding I/O based adaptive throttle is introduced to suspend mclock scheduling if the I/O load is too high. Testing showed it effectively maintained maximum throughput.
4. Future plans include improving the mclock algorithm, extending QoS to individual RBDs, adding metrics, and testing in various environments. Collaboration with the community is
The document discusses intra-cluster replication in Apache Kafka, including its architecture where partitions are replicated across brokers for high availability. Kafka uses a leader and in-sync replicas approach to strongly consistent replication while tolerating failures. Performance considerations in Kafka replication include latency and durability tradeoffs for producers and optimizing throughput for consumers.
This document provides troubleshooting tips for Redis. It discusses that Redis is single-threaded and can slow down if long commands are processed. It recommends using the latest stable version and checking for memory fragmentation issues. For replication, it suggests configuring health checks and resynchronizing slaves after a master restart. Troubleshooting tips include checking configuration options like stop-writes-on-bgsave-error and increasing client output buffer limits for large datasets. The document stresses that Redis security is weak and port access should be limited to private networks only.
Every company likes to brag about their successes, but not many are willing to talk about their failures. At PagerDuty we have been rigorously tracking downtime in order to analyze it and learn from our mistakes - we even blog about these failures publicly.
Despite being a highly available system, we have had three outages caused by problems with our production Cassandra clusters over the past year. We'll take a look at each of these outages: what we saw from the inside, the actions we took to recover, and most importantly the procedures and monitoring that will help prevent it from happening to you.
Linux High Availability provides information on configuring high availability clusters in Linux. It discusses:
- Key components of HA clustering including one service taking over work of another if it fails, IP and service takeover.
- The importance of high availability and costs of downtime for businesses. Statistics show significant downtime even at 99.9% availability levels.
- Best practices for HA including keeping configurations simple, preparing for failures, and testing HA setups. Complexity increases reliability risks.
MySQL High Availability Sprint: Launch the Pacemakerhastexo
This document provides instructions for a MySQL high availability sprint. It outlines setting up various components of the Linux HA stack including Pacemaker for cluster resource management, Corosync for cluster messaging, and DRBD for storage replication. It then provides step-by-step instructions for configuring resources like a floating IP address, DRBD device, filesystem, and MySQL, and grouping them together for high availability. The document concludes by providing further information and a way to provide feedback on the sprint.
Linux High Availability provides concise summaries of key concepts:
- High availability (HA) clustering allows services to take over work from others that go down, through IP and service takeover. It is designed for uptime, not performance or load balancing.
- Downtime is expensive for businesses due to lost revenue and customer dissatisfaction. Statistics show significant drops in availability even at 99.9% uprates.
- To achieve high availability, systems must be designed with simplicity, failure preparation, and reliability testing in mind. Complexity often undermines reliability.
- Myths exist around technologies like virtualization and live migration providing complete high availability solutions. True HA requires eliminating all single points of
Building a Replicated Logging System with Apache KafkaGuozhang Wang
Apache Kafka is a scalable publish-subscribe messaging system
with its core architecture as a distributed commit log.
It was originally built as its centralized event
pipelining platform for online data integration tasks. Over
the past years developing and operating Kafka, we extend
its log-structured architecture as a replicated logging backbone
for much wider application scopes in the distributed
environment. I am going to talk about our design
and engineering experience to replicate Kafka logs for various
distributed data-driven systems, including
source-of-truth data storage and stream processing.
Tungsten University: MySQL Multi-Master Operations Made Simple With Tungsten ...Continuent
Deployment of MySQL multi-master topologies with Tungsten Replicator has been constantly improving. Yet, earlier there were some heavy operations to sustain, and unfriendly commands to perform. The latest version of Tungsten Replicator delivers all the topologies of its predecessors, with an improved installation tool that cuts down the deployment time to half in simple topologies, and to 1/10th in complex ones. Now you can install master/slave, multi-master, fan-in, and star topologies in less than a minute.
But there is more. Thanks to a versatile Tungsten Replicator installation tool, you can define your own deployment on-the-fly, and get creative: you can have stars with satellites, all-masters with fan-in slaves, and other customized clusters.
We will also cover other enhancements in Tungsten Replicator 2.1.1, such as full integration with MySQL 5.6, enhanced output from administrative tools, a few more goodies.
The document discusses MapR cluster management using the MapR CLI. It provides examples of starting and stopping a MapR cluster, managing nodes, volumes, mirrors and schedules. Specific examples include creating volumes, linking mirrors to volumes, syncing mirrors, moving volumes and nodes to different topologies, and creating schedules to automate tasks.
This document provides instructions for setting up a basic Linux Virtual Server (LVS) using one of three forwarding methods: LVS-NAT, LVS-DR, or LVS-Tun. It discusses the minimum hardware and software requirements, including needing at least three machines - one client, one director, and one real server. It also provides examples for setting up an LVS using LVS-NAT and LVS-DR forwarding to provide the telnet and HTTP services. The document is intended to help users quickly set up a basic demonstration LVS without requiring an in-depth understanding of how LVS works under the hood.
This document summarizes Redis versions from 2.8 to 3.2 and discusses new features and improvements. Redis 2.8 introduced the SCAN command to iteratively scan keys and partial sync replication to avoid full resynchronization if connections are briefly lost. Redis 3.0 added Redis Cluster for automatic sharding and diskless replication for faster replication without disk I/O. Redis 3.2 optimized string header sizes in SDS to reduce memory usage and added new GEO commands.
DRBD (Distributed Replicated Block Device) is a distributed replicated storage system that provides synchronous replication and high availability of block devices and filesystems. It allows two or more computer systems to act as a single highly available storage system over a network. The document discusses what DRBD is, its development status, how to configure and use it, key features like replication modes and automatic recovery, and the basic data structures and processes involved when a DRBD resource is started.
Cassandra EU 2012 - Storage Internals by Nicolas Favre-FelixAcunu
The document discusses Cassandra's storage internals. It describes how Cassandra writes data to memtables and commit logs in memory before flushing to immutable SSTables on disk. It also explains how compaction merges SSTables to reclaim space and improve performance. For reads, Cassandra uses memtables, bloom filters on SSTables, key caches, and row caches to minimize disk I/O. Counters are implemented by coordinating writes across replicas.
Mastering Aurora PostgreSQL Clusters for Disaster RecoveryMydbops
The presentation "Mastering Aurora PostgreSQL Clusters for Disaster Recovery" by Bhuvanesh, Co-Founder & CTO of ShellKode, at the Mydbops OpenSource Database Meetup 14 covers advanced topics in managing Aurora PostgreSQL clusters for disaster recovery purposes.
Bhuvanesh discusses key features of Aurora, such as its decoupled storage and compute layers, auto scaling capabilities, and native replication, highlighting its benefits over traditional RDS instances. He also explores Aurora Global Databases, explaining how they enable replication of data across regions for geo-span applications with low latency.
The presentation includes architecture details, such as physical and log replication, and managed failover options for ensuring high availability. Bhuvanesh shares real-world experiences and best practices for managing Aurora clusters, including handling replication lag and TLS certificate management.
If you need to build highly performant, mission critical ,microservice-based system following DevOps best practices, you should definitely check Service Fabric!
Service Fabric is one of the most interesting services Azure offers today. It provide unique capabilities outperforming competitor products.
We are seeing global companies start to use Service Fabric for their mission critical solutions.
In this talk we explore the current state of Service Fabric and dive deeper to highlight best practices and design patterns.
We will cover the following topics:
• Service Fabric Core Concepts
• Cluster Planning and Management
• Stateless Services
• Stateful Services
• Actor Model
• Availability and reliability
• Scalability and perfromance
• Diganostics and Monitoring
• Containers
• Testing
• IoT
Live broadcast on https://www.youtube.com/watch?v=Zuxfhpab6xo
MariaDB MaxScale is a database proxy that provides scalability, high availability, and data streaming capabilities for MariaDB and MySQL databases. It acts as a load balancer and router to distribute queries across database servers. MaxScale supports services like read/write splitting, query caching, and security features like selective data masking. It can monitor replication lag and route queries accordingly. MaxScale uses a plugin architecture and its core remains stateless to provide flexibility and high performance.
How to Manage Scale-Out Environments with MariaDB MaxScaleMariaDB plc
MariaDB MaxScale is a database proxy that provides high availability, scalability, and security for MariaDB and MySQL database infrastructures. It implements read/write splitting to route read queries to slave servers and write queries to the master server. The document provides instructions on installing and configuring MariaDB MaxScale, including creating a service for read/write splitting, defining servers, adding authentication, and testing the split routing functionality.
How to Manage Scale-Out Environments with MariaDB MaxScaleMariaDB plc
MaxScale is a database proxy that provides load balancing, connection pooling, and replication capabilities for MariaDB and MySQL databases. It can be used to scale databases horizontally across multiple servers for increased performance and availability. The document provides an overview of MaxScale concepts and capabilities such as routing, filtering, security features, and how it can be used for operational tasks like query caching, logging, and data streaming. It also includes instructions on setting up MaxScale with a basic example of configuring read/write splitting between a master and slave database servers.
The document discusses Docker networking and Kubernetes networking concepts. It provides an overview of Docker networking and how containers on the same host can communicate. It then summarizes key Kubernetes concepts like pods, replication controllers, services and networking. It demonstrates how to create a replication controller and service for a Tomcat application. It also discusses exposing services externally and additional resources for learning about Docker and Kubernetes.
The document discusses Docker networking and Kubernetes networking concepts. It provides an overview of Docker networking and how containers on the same host can communicate. It then summarizes key Kubernetes concepts like pods, replication controllers, services and networking. It demonstrates how to set up a sample application topology in Kubernetes using replication controllers and services. It also discusses exposing services externally and additional resources for learning about Docker and Kubernetes.
DB2 Express-C is a free version of the DB2 database server from IBM. It has no usage or deployment limits and can run on Windows, Linux, Mac OS X, and Solaris operating systems. Minimum requirements are 256MB of RAM but it is recommended to have at least 1GB. DB2 Express-C provides basic database functionality and sits below the paid DB2 Workgroup and Enterprise editions in terms of features. It uses concurrency controls like locking and transactions to allow for multi-user access to the database.
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...HostedbyConfluent
Deploying Kafka to support multiple teams or even an entire company has many benefits. It reduces operational costs, simplifies onboarding of new applications as your adoption grows, and consolidates all your data in one place. However, this makes applications sharing the cluster vulnerable to any one or few of them taking all cluster resources. The combined cluster load also becomes less predictable, increasing the risk of overloading the cluster and data unavailability.
In this talk, we will describe how to use quota framework in Apache Kafka to ensure that a misconfigured client or unexpected increase in client load does not monopolize broker resources. You will get a deeper understanding of bandwidth and request quotas, how they get enforced, and gain intuition for setting the limits for your use-cases.
While quotas limit individual applications, there must be enough cluster capacity to support the combined application load. Onboarding new applications or scaling the usage of existing applications may require manual quota adjustments and upfront capacity planning to ensure high availability.
We will describe the steps we took toward solving this problem in Confluent Cloud, where we must immediately support unpredictable load with high availability. We implemented a custom broker quota plugin (KIP-257) to replace static per broker quota allocation with dynamic and self-tuning quotas based on the available capacity (which we also detect dynamically). By learning our journey, you will have more insights into the relevant problems and techniques to address them.
Logging is important for troubleshooting a DNS service. Conveniently with BIND 9, almost all problems will show up somewhere in the log output, but only if the logging is enabled and configured correctly.
In this webinar, we’ll discuss the BIND 9 logging configuration and best practices in searching through large log-files to find the entries of interest. In addition, we’ll release log-management tools used by Men & Mice Services.
(BAC404) Deploying High Availability and Disaster Recovery Architectures with...Amazon Web Services
The document discusses disaster recovery strategies for AWS including backup and restore, pilot light, and warm standby approaches. It provides examples of architectures using these approaches including replicating databases across Availability Zones and regions for high availability and disaster recovery. CloudFormation templates are shown that can automate the deployment of load balanced auto-scaled web servers across Availability Zones for disaster recovery.
UKOUG Tech15 - Deploying Oracle 12c Cloud Control in Maximum Availability Arc...Zahid Anwar (OCM)
Common Cloud Control deployments can sometimes be exposed to single points of failure. In this presentation we will be discussing these pitfalls and how, through deploying Cloud Control within the Maximum Availability Architecture can provide a robust system. Aimed at a technical audience - we will dive into giving High Availability and Disaster Recovery for the OMS repository and OMS Web Tier through the use of RAC, Web Tier Clustering, Data Guard and Storage Replication. We will take our audience through the simple but effective steps required for this type of deployment in addition to the license implications of using Maximum Availability Architecture including what Oracle give you for free under a restricted-use license. This presentation is based on a recent project completed by our speaker Zahid Anwar. This project saw Zahid provide Maximum Availability Architecture for Cloud Control which was monitoring 6, critical X4-2 Eighth Exadata Machines.
Basically everything you need to get started on your Zookeeper training, and setup apache Hadoop high availability with QJM setup with automatic failover.
Distributed Queries in IDS: New features.Keshav Murthy
Learn about the latest function relating to distributed queries that was delivered in IBM Informix® Dynamic Server (IDS) 11 and 11.5. This talk will provide an overview of distributed queries, then will jump into a deep dive on the latest functions and how you can benefit from implementing distributed queries in your solutions.
Dynamic routing in microservice oriented architectureDaniel Leon
When splitting an application into different micro-services and each application access URL is dynamically generated, the hell gets loose. If you are tired of manually setting a route in your ngnix, come see linkerd in action.
Tungsten University: MySQL Multi-Master Operations Made Simple With Tungsten ...Continuent
Deployment of MySQL multi-master topologies with Tungsten Replicator has been constantly improving. Yet, earlier there were some heavy operations to sustain, and unfriendly commands to perform. The latest version of Tungsten Replicator delivers all the topologies of its predecessors, with an improved installation tool that cuts down the deployment time to half in simple topologies, and to 1/10th in complex ones. Now you can install master/slave, multi-master, fan-in, and star topologies in less than a minute.
But there is more. Thanks to a versatile Tungsten Replicator installation tool, you can define your own deployment on-the-fly, and get creative: you can have stars with satellites, all-masters with fan-in slaves, and other customized clusters.
We will also cover other enhancements in Tungsten Replicator 2.1.1, such as full integration with MySQL 5.6, enhanced output from administrative tools, a few more goodies.
The document discusses MapR cluster management using the MapR CLI. It provides examples of starting and stopping a MapR cluster, managing nodes, volumes, mirrors and schedules. Specific examples include creating volumes, linking mirrors to volumes, syncing mirrors, moving volumes and nodes to different topologies, and creating schedules to automate tasks.
This document provides instructions for setting up a basic Linux Virtual Server (LVS) using one of three forwarding methods: LVS-NAT, LVS-DR, or LVS-Tun. It discusses the minimum hardware and software requirements, including needing at least three machines - one client, one director, and one real server. It also provides examples for setting up an LVS using LVS-NAT and LVS-DR forwarding to provide the telnet and HTTP services. The document is intended to help users quickly set up a basic demonstration LVS without requiring an in-depth understanding of how LVS works under the hood.
This document summarizes Redis versions from 2.8 to 3.2 and discusses new features and improvements. Redis 2.8 introduced the SCAN command to iteratively scan keys and partial sync replication to avoid full resynchronization if connections are briefly lost. Redis 3.0 added Redis Cluster for automatic sharding and diskless replication for faster replication without disk I/O. Redis 3.2 optimized string header sizes in SDS to reduce memory usage and added new GEO commands.
DRBD (Distributed Replicated Block Device) is a distributed replicated storage system that provides synchronous replication and high availability of block devices and filesystems. It allows two or more computer systems to act as a single highly available storage system over a network. The document discusses what DRBD is, its development status, how to configure and use it, key features like replication modes and automatic recovery, and the basic data structures and processes involved when a DRBD resource is started.
Cassandra EU 2012 - Storage Internals by Nicolas Favre-FelixAcunu
The document discusses Cassandra's storage internals. It describes how Cassandra writes data to memtables and commit logs in memory before flushing to immutable SSTables on disk. It also explains how compaction merges SSTables to reclaim space and improve performance. For reads, Cassandra uses memtables, bloom filters on SSTables, key caches, and row caches to minimize disk I/O. Counters are implemented by coordinating writes across replicas.
Mastering Aurora PostgreSQL Clusters for Disaster RecoveryMydbops
The presentation "Mastering Aurora PostgreSQL Clusters for Disaster Recovery" by Bhuvanesh, Co-Founder & CTO of ShellKode, at the Mydbops OpenSource Database Meetup 14 covers advanced topics in managing Aurora PostgreSQL clusters for disaster recovery purposes.
Bhuvanesh discusses key features of Aurora, such as its decoupled storage and compute layers, auto scaling capabilities, and native replication, highlighting its benefits over traditional RDS instances. He also explores Aurora Global Databases, explaining how they enable replication of data across regions for geo-span applications with low latency.
The presentation includes architecture details, such as physical and log replication, and managed failover options for ensuring high availability. Bhuvanesh shares real-world experiences and best practices for managing Aurora clusters, including handling replication lag and TLS certificate management.
If you need to build highly performant, mission critical ,microservice-based system following DevOps best practices, you should definitely check Service Fabric!
Service Fabric is one of the most interesting services Azure offers today. It provide unique capabilities outperforming competitor products.
We are seeing global companies start to use Service Fabric for their mission critical solutions.
In this talk we explore the current state of Service Fabric and dive deeper to highlight best practices and design patterns.
We will cover the following topics:
• Service Fabric Core Concepts
• Cluster Planning and Management
• Stateless Services
• Stateful Services
• Actor Model
• Availability and reliability
• Scalability and perfromance
• Diganostics and Monitoring
• Containers
• Testing
• IoT
Live broadcast on https://www.youtube.com/watch?v=Zuxfhpab6xo
MariaDB MaxScale is a database proxy that provides scalability, high availability, and data streaming capabilities for MariaDB and MySQL databases. It acts as a load balancer and router to distribute queries across database servers. MaxScale supports services like read/write splitting, query caching, and security features like selective data masking. It can monitor replication lag and route queries accordingly. MaxScale uses a plugin architecture and its core remains stateless to provide flexibility and high performance.
How to Manage Scale-Out Environments with MariaDB MaxScaleMariaDB plc
MariaDB MaxScale is a database proxy that provides high availability, scalability, and security for MariaDB and MySQL database infrastructures. It implements read/write splitting to route read queries to slave servers and write queries to the master server. The document provides instructions on installing and configuring MariaDB MaxScale, including creating a service for read/write splitting, defining servers, adding authentication, and testing the split routing functionality.
How to Manage Scale-Out Environments with MariaDB MaxScaleMariaDB plc
MaxScale is a database proxy that provides load balancing, connection pooling, and replication capabilities for MariaDB and MySQL databases. It can be used to scale databases horizontally across multiple servers for increased performance and availability. The document provides an overview of MaxScale concepts and capabilities such as routing, filtering, security features, and how it can be used for operational tasks like query caching, logging, and data streaming. It also includes instructions on setting up MaxScale with a basic example of configuring read/write splitting between a master and slave database servers.
The document discusses Docker networking and Kubernetes networking concepts. It provides an overview of Docker networking and how containers on the same host can communicate. It then summarizes key Kubernetes concepts like pods, replication controllers, services and networking. It demonstrates how to create a replication controller and service for a Tomcat application. It also discusses exposing services externally and additional resources for learning about Docker and Kubernetes.
The document discusses Docker networking and Kubernetes networking concepts. It provides an overview of Docker networking and how containers on the same host can communicate. It then summarizes key Kubernetes concepts like pods, replication controllers, services and networking. It demonstrates how to set up a sample application topology in Kubernetes using replication controllers and services. It also discusses exposing services externally and additional resources for learning about Docker and Kubernetes.
DB2 Express-C is a free version of the DB2 database server from IBM. It has no usage or deployment limits and can run on Windows, Linux, Mac OS X, and Solaris operating systems. Minimum requirements are 256MB of RAM but it is recommended to have at least 1GB. DB2 Express-C provides basic database functionality and sits below the paid DB2 Workgroup and Enterprise editions in terms of features. It uses concurrency controls like locking and transactions to allow for multi-user access to the database.
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...HostedbyConfluent
Deploying Kafka to support multiple teams or even an entire company has many benefits. It reduces operational costs, simplifies onboarding of new applications as your adoption grows, and consolidates all your data in one place. However, this makes applications sharing the cluster vulnerable to any one or few of them taking all cluster resources. The combined cluster load also becomes less predictable, increasing the risk of overloading the cluster and data unavailability.
In this talk, we will describe how to use quota framework in Apache Kafka to ensure that a misconfigured client or unexpected increase in client load does not monopolize broker resources. You will get a deeper understanding of bandwidth and request quotas, how they get enforced, and gain intuition for setting the limits for your use-cases.
While quotas limit individual applications, there must be enough cluster capacity to support the combined application load. Onboarding new applications or scaling the usage of existing applications may require manual quota adjustments and upfront capacity planning to ensure high availability.
We will describe the steps we took toward solving this problem in Confluent Cloud, where we must immediately support unpredictable load with high availability. We implemented a custom broker quota plugin (KIP-257) to replace static per broker quota allocation with dynamic and self-tuning quotas based on the available capacity (which we also detect dynamically). By learning our journey, you will have more insights into the relevant problems and techniques to address them.
Logging is important for troubleshooting a DNS service. Conveniently with BIND 9, almost all problems will show up somewhere in the log output, but only if the logging is enabled and configured correctly.
In this webinar, we’ll discuss the BIND 9 logging configuration and best practices in searching through large log-files to find the entries of interest. In addition, we’ll release log-management tools used by Men & Mice Services.
(BAC404) Deploying High Availability and Disaster Recovery Architectures with...Amazon Web Services
The document discusses disaster recovery strategies for AWS including backup and restore, pilot light, and warm standby approaches. It provides examples of architectures using these approaches including replicating databases across Availability Zones and regions for high availability and disaster recovery. CloudFormation templates are shown that can automate the deployment of load balanced auto-scaled web servers across Availability Zones for disaster recovery.
UKOUG Tech15 - Deploying Oracle 12c Cloud Control in Maximum Availability Arc...Zahid Anwar (OCM)
Common Cloud Control deployments can sometimes be exposed to single points of failure. In this presentation we will be discussing these pitfalls and how, through deploying Cloud Control within the Maximum Availability Architecture can provide a robust system. Aimed at a technical audience - we will dive into giving High Availability and Disaster Recovery for the OMS repository and OMS Web Tier through the use of RAC, Web Tier Clustering, Data Guard and Storage Replication. We will take our audience through the simple but effective steps required for this type of deployment in addition to the license implications of using Maximum Availability Architecture including what Oracle give you for free under a restricted-use license. This presentation is based on a recent project completed by our speaker Zahid Anwar. This project saw Zahid provide Maximum Availability Architecture for Cloud Control which was monitoring 6, critical X4-2 Eighth Exadata Machines.
Basically everything you need to get started on your Zookeeper training, and setup apache Hadoop high availability with QJM setup with automatic failover.
Distributed Queries in IDS: New features.Keshav Murthy
Learn about the latest function relating to distributed queries that was delivered in IBM Informix® Dynamic Server (IDS) 11 and 11.5. This talk will provide an overview of distributed queries, then will jump into a deep dive on the latest functions and how you can benefit from implementing distributed queries in your solutions.
Dynamic routing in microservice oriented architectureDaniel Leon
When splitting an application into different micro-services and each application access URL is dynamically generated, the hell gets loose. If you are tired of manually setting a route in your ngnix, come see linkerd in action.
This document discusses data center architecture and describes the key components and benefits of a data center. It summarizes that a data center houses critical computing resources in a controlled environment to enable businesses to operate continuously according to their needs. It provides benefits like business resiliency, cost savings, rapid application deployment, and consolidation of computing resources. The data center architecture is designed based on layering with core, intranet and extranet server farms for reliability. It segments functions into front end, application and back end segments to support presentation, business and database servers respectively.
The document discusses new features in Oracle NoSQL Database Release 3.4. Key highlights include improved data center failover and switchover operations to continue operations if a zone fails, new support for complex data types in Apache Hive and Oracle Big Data SQL, a new bulk get operation to retrieve multiple records in parallel for improved performance, and an off-heap cache for better performance and scalability. The release aims to improve ease of adoption, performance, and business continuity for users.
Oracle Drivers configuration for High Availability, is it a developer's job?Ludovico Caldara
UCP, GridLink, TAF, AC, TAC, FAN… The configuration of Oracle Drivers for application high availability is not an easy job. The developers often care about the minimal working configuration, while the DBAs are busy with the operations. In this session I will try to demystify application server’s connectivity to the database and give a direction toward the highest availability, using Real Application Clusters and new Oracle features like TAC and CMAN TDM.
Securing MongoDB to Serve an AWS-Based, Multi-Tenant, Security-Fanatic SaaS A...MongoDB
MongoDB introduces new capabilities that change the way micro-services interact with the database, capabilities that are either absent or exist only partially in high-end commercial databases such as Oracle. In this session I will share from my experiences building a cloud-based, multi-tenant SaaS application with extreme security requirements. We will cover topics including considerations for storing multi-tenant data in the database, best practices for authentication and authorization, and performance considerations specific to security in MongoDB.
Similar to R2D2 slides from Velocity Conference London 2013 (20)
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...University of Maribor
Slides from talk presenting:
Aleš Zamuda: Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapter and Networking.
Presentation at IcETRAN 2024 session:
"Inter-Society Networking Panel GRSS/MTT-S/CIS
Panel Session: Promoting Connection and Cooperation"
IEEE Slovenia GRSS
IEEE Serbia and Montenegro MTT-S
IEEE Slovenia CIS
11TH INTERNATIONAL CONFERENCE ON ELECTRICAL, ELECTRONIC AND COMPUTING ENGINEERING
3-6 June 2024, Niš, Serbia
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...IJECEIAES
Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network
ACEP Magazine edition 4th launched on 05.06.2024Rahul
This document provides information about the third edition of the magazine "Sthapatya" published by the Association of Civil Engineers (Practicing) Aurangabad. It includes messages from current and past presidents of ACEP, memories and photos from past ACEP events, information on life time achievement awards given by ACEP, and a technical article on concrete maintenance, repairs and strengthening. The document highlights activities of ACEP and provides a technical educational article for members.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSIJNSA Journal
The smart irrigation system represents an innovative approach to optimize water usage in agricultural and landscaping practices. The integration of cutting-edge technologies, including sensors, actuators, and data analysis, empowers this system to provide accurate monitoring and control of irrigation processes by leveraging real-time environmental conditions. The main objective of a smart irrigation system is to optimize water efficiency, minimize expenses, and foster the adoption of sustainable water management methods. This paper conducts a systematic risk assessment by exploring the key components/assets and their functionalities in the smart irrigation system. The crucial role of sensors in gathering data on soil moisture, weather patterns, and plant well-being is emphasized in this system. These sensors enable intelligent decision-making in irrigation scheduling and water distribution, leading to enhanced water efficiency and sustainable water management practices. Actuators enable automated control of irrigation devices, ensuring precise and targeted water delivery to plants. Additionally, the paper addresses the potential threat and vulnerabilities associated with smart irrigation systems. It discusses limitations of the system, such as power constraints and computational capabilities, and calculates the potential security risks. The paper suggests possible risk treatment methods for effective secure system operation. In conclusion, the paper emphasizes the significant benefits of implementing smart irrigation systems, including improved water conservation, increased crop yield, and reduced environmental impact. Additionally, based on the security analysis conducted, the paper recommends the implementation of countermeasures and security approaches to address vulnerabilities and ensure the integrity and reliability of the system. By incorporating these measures, smart irrigation technology can revolutionize water management practices in agriculture, promoting sustainability, resource efficiency, and safeguarding against potential security threats.
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
3. R2D2 in a nutshell
Client
Server for
Resource
“foo”
Server for
Resource
Profile
Service
“foo”
Server for
Resource
“foo”
Server for
Resource
Inbox
Service
“foo”
Server for
Resource
“foo”
Server for
Resource
Ads
Service
“foo”
Send request to get profile?id=123
Zookeeper
• Listens to profile zookeeper node
• Get a list of servers’ URIs where profile are hosted
• Get notified if a server leaves or joins a cluster
• Choose one server to send the request to
???
Request
Servers
4. Agenda
R2D2 Architecture
How information is stored and organized in zookeeper
How R2D2 does load balancing and graceful degradation
Partitioning and sticky routing
Miscellaneous D2 use cases at LinkedIn:
- Redlining
- Cluster variants
Q&A
6. What is rest.li?
Open source Java REST framework. Go to http://rest.li
7. What is D2?
Primarily a name server and traffic router
The global “address book” is stored in zookeeper
We store the back-up in the local filesystem
Definitions:
D2 Cluster represents a collection of identical servers that host one or many D2 services
D2 Service represents a service
D2 Uri represents a server’s address and weight
8. How is D2 information organized and stored?
/ Root
/d2
/d2/clusters /d2/services /d2/uris
/d2/clusters/clusterA
/d2/clusters/clusterB
/d2/services/serviceA1
/d2/services/serviceA2
/d2/services/serviceB
Service
Properties:
-Cluster =
clusterA
-Load-balancer
configuration
-Degrader
configuration
-Strategy
configuration
-Etc.
Cluster
Properties:
-Partition
configuration
-Etc.
/d2/uris/clusterA
Uri
Properties:
-Machine URI
-Weight
/d2/uris/clusterB
/d2/uris/clusterB/ephemeralNode1
/d2/uris/clusterB/ephemeralNode2
9. 9
How is zookeeper initialized ?
Config file Zookeeper
/ Root
/d2
/d2/clusters /d2/services /d2/uris
/d2/clusters/clusterA
/d2/clusters/clusterB
/d2/clusters/clusterC
/d2/services/serviceA1
/d2/services/serviceA2
/d2/services/serviceA3
ServiceA1
Client
ClusterA
Server
/d2/uris/clusterA
/d2/uris/clusterA/ephemeralNode1
D2Config.java
10. D2 Load Balancer
Client-side load balancer
Client keeps track of the state
2 Strategies to use:
- Random
- Degrader
11. How does the degrader load balancer work?
Period 3
LOAD_BALANCE
Individual Server
stats:
Cluster total call
count:
0
Cluster average
latency:
Cluster average
2500 latency:
ms
0 ms
Cluster drop rate:
0.0
Server 1
Server 2
Client
Total Call Count: 0
Latency: 0 ms
Total Call Count: 0
Latency: 0 ms
100 points
100 points
PPeerriioodd 12
100
4900 ms
100
100 ms
61 points
CALL_DROPPING
3636.5 ms
67
133
3000 ms
0.2
LB Configuration:
Latency Low Water Mark:
500 ms
Latency High Water Mark:
2000 ms
Min Call Count: 10
Notice:
The number of points
don’t change because
we are in CALL_DROPPING
mode
12. How does the degrader recover from a bad state?
Server 1
Server 2
Period N
Client
LOAD_BALANCE
Individual Server
stats:
Cluster total call
count:
0
Cluster average
latency:
0 ms
Cluster drop rate:
1.0
1 points
1 point
Total Call Count: 0
Latency: 0 ms
Total Call Count: 0
Latency: 0 ms
CALL_DROPPING
2 2 points
Notice:
We’re in recovery mode
Because we choke all traffic
So we will try recovering
regardless of call stats
N+1
0.8
2
15
150 ms
20
200 ms
35
178.6 ms
37 points
37 points
3
50
200 50
100
200 ms
0.6
LB Configuration:
Latency Low Water Mark:
500 ms
Latency High Water Mark:
2000 ms
Min Call Count: 10
13. A few more extra details
Min call count is reduced depending on how degraded the state is
It’s not just latency, we also consider error rate and number of outstanding calls
We can use many types of latency:
- AVERAGE
- 90%
- 95%
- 99%
We can set different low/high water mark
for cluster vs for individual node
14. Call Dropping vs Load Balancing
Call Dropping Mode Load Balancing Mode
Affects the entire clusters Affects only individual machines in the
cluster
Purpose: graceful degradation Purpose: load balancing traffic
Drop Rate Points
Hints: Latency Hints: individual node latency, error
rate, #outstanding calls
15. Partitioning and Sticky Routing
D2 supports partitioning of clusters
- Range partitioning
- Hash partitioning (MD5 or Modulo)
- Use regex to extract key from URI
to determine where a request should go
Sticky routing within partition is also supported
- Use regex to extract key from URI (same
as above)
- Use consistent hash ring
16. Consistent Hash Ring
Integer.MAX_INT Integer.MIN_INT
|
100 0 -100
app1.foo.com
app2.foo.com
app3.foo.com
Request for “foo”
17. Miscellaneous D2 use cases
Redlining: Measure max capacity of server
Use real traffic
Don’t have to worry about mutable operations
Integer.MAX_INT Integer.MIN_INT
|
100 0 -100
app1.foo.com
app2.foo.com
app3.foo.com
18. Miscellaneous D2 use cases
What if there are different requirements from different clients?
Let’s say we have a service called profile.
- For clients who can only view profile, we want them to go to read-only cluster
- For clients who can edit profile, we want them to go to read-write cluster.
Use Cluster variant technique
Cluster variant allows changing D2 Service’s namespace to get around the restriction that
zookeeper node’s name must be unique.
19. Miscellaneous D2 use cases
/ Root
/d2
Request for
profile
/d2/clusters /d2/services /d2/uris
/d2/clusters/readonly
/d2/clusters/readwrite
/d2/services/profile
Service
Properties:
-Cluster =
readonly
/d2/uris/readonly
/d2/uris/readwrite
Request for
profile
/d2/profileClusterVariant
/d2/profileClusterVariant/profile
Service
Properties:
-Cluster =
readwrite
/d2/uris/readonly/ephemeralNode1
/d2/uris/readwrite/ephemeralNode1
readonly
Server
readwrite
Server
View Client Edit Client
20. Q&A
Questions?
Email me at: osumampouw@linkedin.com
Check out http://rest.li https://github.com/linkedin/rest.li for more info
We’re hiring!
Step back several years, LinkedIn has small code base. Small binary
Easy to scale up : Just add more servers
One binary becomes too big to be deployed in a single server -> Split into multiple binaries. Birth of specialized services and Service Oriented Architecture (SOA).
When a service wants to talk to another service, we have to wire in the address of the load balancer for that cluster
Now we have hundreds of services, manually wiring the address of load balancer of route is error prone and slow -> imagine as developer you have to ask around what are the ip address of the load balancer.
Load Balancer are expensive and introduces extra network hop.
Imagine you have a client
Machines can leave and join cluster at any given time. D2 has a server side and client side.
Zookeeper is a distributed service that is used to maintain the state of a system. So it’s pretty fault tolerant even if few servers inside zookeeper dies we’re still OK.
Zookeeper = similar to a file system that provides a way to publish/subscribe messages to znode.
Servers announce its address to zookeeper
Point: zookeeper is not involved in sending every request
Open source Java rest framework currently being used at LinkedIn.
This is how it works:
Application sits on top of rest.li layer
Rest.li sits on top or R2D2.
D2 finds the services that rest.li creates, load balances traffic from clients to servers and also provides graceful degradation.
R2 handles the request/response interaction between the server and clients. R2 is asynchronous and is implemented using netty/jetty.
R2D2 is independent of rest.li. D2 can be used outside of rest.li’s as a name resolvers and load balancer.
There are 3 different constructs that we use to store information for D2.
D2 Cluster comprised of identical nodes. No first class nodes or second class nodes. No master/slave. No ACL. D2 is ideal for a trusted middle tier layer (simple to understand)
Each D2 Uri will create a new client abstraction for sending traffic into
URI node is zookeeper ephemeral node. cluster and service node are zookeeper permanent node.
Point: cluster properties and service properties are rarely updated and almost static. So that’s why it’s permanent zookeeper node.
Some restrictions:
ZKFSUtil sets d2 config writer to write to /services, /clusters, /uris
/d2 path is configurable
Once the client listens, it keeps the global information inside its internal storage. So after it receives information it won’t need to contact zookeeper. Zookeeper publishes info to update D2Client internal state.
If ClusterA Server dies, zookeeper automatically removes the ephemeral node so ServiceA1 client will know that it can’t sent request to ClusterAServer
Imagine we have a client and a cluster that consists of 2 servers
The client keeps track call statistics to each server
We update statistics on a 5 seconds interval
Talk about initial state (min call is to reduce flapping)
We have 2 modes of operation: CALL_DROPPING and LOAD_BALANCING
CALL_DROPPING is we change cluster’s drop rate. LOAD_BALANCING is directing traffic to healthier machines.
Explain why there are 2 different modes. Because there are 2 types of problem that can affect a service. Cluster vs individual node
CALL_DROPPING is for cluster -> problem with downstream services
LOAD_BALANCING is for individual node -> problem with a particular server
Why call dropping mode affects the entire cluster? Because we don’t want to double penalize a bad client (reducing the number of points while increasing the drop rate for a particular client)
From request URI we will compute which partition it belongs to.
We use regex to extract the key.
Extra attribute: “D2-KeyMapper-TargetHost” : URI (but must be part of the cluster)
“D2-Hint-TargetService” to override URI
Imagine we have 3 servers. This is how the client view the servers
MD5 hash the URI. For each URI we’ll create 100 points.
The number of points is based on weight of the node * number of points per weight (configurable)
The reason we use consistent hash ring is because servers can join/leave cluster at any time. So with a consistent hash ring, we’re guaranteed that only 1/n of the request will be reshuffled if there’s changes to the cluster membership
Redlining means performance testing a server so we know what is the maximum capacity a server can handle.
We can use real production traffic and not afraid of non-immutable requests
So we’ve talked about how R2D2 work. How we discover services and how to load balance traffic. For more information you can check http://rest.li.