Steve McGhee talks about how to build reliable things on top of unreliable things. Steve was a Google SRE for 10 years, then he left to help move a company onto the Cloud. He came back to Google to help more customers do that.
Recording on YouTube: https://youtu.be/YnjsYzCwTQI
Check out presos here: https://gdg.community.dev/gdg-cloud-southlake/
1) The document outlines a 5-step process for building a successful business continuity strategy in the cloud: architecting risks, planning for impacts, governing roles and responsibilities, budgeting costs, and making contingency plans dual purpose.
2) It emphasizes the importance of understanding different types of failures like human error, instance failures, zone failures, and region or multi-region failures to mitigate risks.
3) Companies are advised to specify data governance policies, roles, and responsibilities to maintain access controls and security when failing over to secondary regions during outages.
4) Budgeting requires assessing critical databases and teams, data loss tolerances, and replication frequencies to determine costs based on industry vulnerabilities and failure likelihoods.
This document provides an overview of performance tuning best practices for Scala applications. It discusses motivations for performance tuning such as resolving issues or reducing infrastructure costs. Some common bottlenecks are identified as databases, asynchronous/thread operations, and I/O. Best practices covered include measuring metrics, identifying bottlenecks, and avoiding premature optimization. Microbenchmarks and optimization examples using Scala collections are also presented.
Cloudifying High Availability: The Case for Elastic Disaster RecoveryAli Hodroj
Elastic DR: a solution architecture that aims to optimally balance cost and recovery time via three core principles that are germane the cloud world:
On-Demand: The disaster recovery cloud can be provisioned on any availability zone, region, or public/private cloud through Cloudify's cloud-agnostic bootstrapping mechanism.
Elastic: The ability to automatically provision resources in the recovery cloud in case of disaster while eliminating the need for idle resources in normal scenarios, thereby fully profiting from the pay-per-use pricing model of clouds.
Flexible RTO/RPO: The architecture can be easily extended from a warm DR to a hot DR pattern through enabling/disabling application recipes. This allows us to exploit economies of scale that the cloud provides by matching the number of recipes/tiers to provision (in the recovery cloud) against the recovery time/point objective for our disaster recovery strategy
The document discusses using spot instances with Druid for cost savings. It describes that spot instances provide lower costs but less availability than on-demand instances. The document outlines how Druid is configured to use Terraform and Helm for infrastructure setup and deployment. It also discusses how Druid's stateless architecture and redundancy across middle managers and historical nodes allows it to withstand spot instance interruptions without data loss.
Control Plane Architectures: Design SolutionsShane Gibson
Building an OpenStack private cloud can be very complex. Underpinning the entire operational capabilities of your cloud is the Control Plane. It's common knowledge you should treat your cloud resources "like cattle" and not "pets", yet most clouds treat the control plane like a favored pet.
How should the modern cloud architect choose to design the control plane aspect of your private cloud?
This presentation will outline several control plane architecutres you can employ to provide the management and control of your cloud. Depending on the scale, fault tolerance, and high availability requirements, you may implement different architectural designs of your control plane. We will explore various architecutres such as:
Single System
Dual System (active/active or active/standby)
Clustered
Distributed
Many private clouds start out as a simple PoC, and often that means something like DevStack. As a technology shop looks to move from the PoC to the Real World, it can be extremely daunting trying to decide how to build the critical control plane aspect of a fully functional real world cloud.
This presentaion will help guide attendees in choosing an architecture that is right for their needs based on their cloud scale, capacity, availability, or reliability requirements.
Attendees of this presentaion should walk away with a firm grasp of various architectures for managing the design and build out of your Control Plane, and know where each architecture would best fit depending on your needs and cloud scale.
MineExcellence solutions help in designing, optimizing and analyzing how blasts are performing in an integrated manner. We also have a very innovative mobile app - Smart Blasting app available in Android and iPhone. We have products for some other areas in mining such as Drilling, MineSafety App and Operational Analytics.
1. Blast Designer
2. Blast Data Collection and Management (BIMS)
3. Mobile app for Blasting (Smart Blasting)
4. Blasting Predictors – Air and Ground Vibration, Fragmentation and Fly-rock. Pattern Simulation /Analysis
5. Blast Designer and BIMSu for underground blasting
6. Drilling Platform(Drill Log, Plod Reports and Daily activity)
7. Mine Safety APP
8. Operational Analytics : (Combines drilling, blasting, loading, hauling etc)
9. Web and Mobile Custom Forms for all aspects of mining Lifecycle
10. Drone Platform for Mining Operations
1) The document outlines a 5-step process for building a successful business continuity strategy in the cloud: architecting risks, planning for impacts, governing roles and responsibilities, budgeting costs, and making contingency plans dual purpose.
2) It emphasizes the importance of understanding different types of failures like human error, instance failures, zone failures, and region or multi-region failures to mitigate risks.
3) Companies are advised to specify data governance policies, roles, and responsibilities to maintain access controls and security when failing over to secondary regions during outages.
4) Budgeting requires assessing critical databases and teams, data loss tolerances, and replication frequencies to determine costs based on industry vulnerabilities and failure likelihoods.
This document provides an overview of performance tuning best practices for Scala applications. It discusses motivations for performance tuning such as resolving issues or reducing infrastructure costs. Some common bottlenecks are identified as databases, asynchronous/thread operations, and I/O. Best practices covered include measuring metrics, identifying bottlenecks, and avoiding premature optimization. Microbenchmarks and optimization examples using Scala collections are also presented.
Cloudifying High Availability: The Case for Elastic Disaster RecoveryAli Hodroj
Elastic DR: a solution architecture that aims to optimally balance cost and recovery time via three core principles that are germane the cloud world:
On-Demand: The disaster recovery cloud can be provisioned on any availability zone, region, or public/private cloud through Cloudify's cloud-agnostic bootstrapping mechanism.
Elastic: The ability to automatically provision resources in the recovery cloud in case of disaster while eliminating the need for idle resources in normal scenarios, thereby fully profiting from the pay-per-use pricing model of clouds.
Flexible RTO/RPO: The architecture can be easily extended from a warm DR to a hot DR pattern through enabling/disabling application recipes. This allows us to exploit economies of scale that the cloud provides by matching the number of recipes/tiers to provision (in the recovery cloud) against the recovery time/point objective for our disaster recovery strategy
The document discusses using spot instances with Druid for cost savings. It describes that spot instances provide lower costs but less availability than on-demand instances. The document outlines how Druid is configured to use Terraform and Helm for infrastructure setup and deployment. It also discusses how Druid's stateless architecture and redundancy across middle managers and historical nodes allows it to withstand spot instance interruptions without data loss.
Control Plane Architectures: Design SolutionsShane Gibson
Building an OpenStack private cloud can be very complex. Underpinning the entire operational capabilities of your cloud is the Control Plane. It's common knowledge you should treat your cloud resources "like cattle" and not "pets", yet most clouds treat the control plane like a favored pet.
How should the modern cloud architect choose to design the control plane aspect of your private cloud?
This presentation will outline several control plane architecutres you can employ to provide the management and control of your cloud. Depending on the scale, fault tolerance, and high availability requirements, you may implement different architectural designs of your control plane. We will explore various architecutres such as:
Single System
Dual System (active/active or active/standby)
Clustered
Distributed
Many private clouds start out as a simple PoC, and often that means something like DevStack. As a technology shop looks to move from the PoC to the Real World, it can be extremely daunting trying to decide how to build the critical control plane aspect of a fully functional real world cloud.
This presentaion will help guide attendees in choosing an architecture that is right for their needs based on their cloud scale, capacity, availability, or reliability requirements.
Attendees of this presentaion should walk away with a firm grasp of various architectures for managing the design and build out of your Control Plane, and know where each architecture would best fit depending on your needs and cloud scale.
MineExcellence solutions help in designing, optimizing and analyzing how blasts are performing in an integrated manner. We also have a very innovative mobile app - Smart Blasting app available in Android and iPhone. We have products for some other areas in mining such as Drilling, MineSafety App and Operational Analytics.
1. Blast Designer
2. Blast Data Collection and Management (BIMS)
3. Mobile app for Blasting (Smart Blasting)
4. Blasting Predictors – Air and Ground Vibration, Fragmentation and Fly-rock. Pattern Simulation /Analysis
5. Blast Designer and BIMSu for underground blasting
6. Drilling Platform(Drill Log, Plod Reports and Daily activity)
7. Mine Safety APP
8. Operational Analytics : (Combines drilling, blasting, loading, hauling etc)
9. Web and Mobile Custom Forms for all aspects of mining Lifecycle
10. Drone Platform for Mining Operations
Defining a Cloud Adoption Journey to Deliver Cloud Native ServicesAmazon Web Services
This document discusses the New South Wales Department of Industry's cloud adoption journey. It summarizes their goals to migrate business-critical applications from six on-premise data centers to the cloud in 18 months while minimizing business disruption. It outlines their three-phase approach, including establishing core infrastructure, migrating applications, and optimizing and automating processes. Benefits included increased resilience, agility, and ability to support future changes. Lessons learned emphasized leadership support, managing commercial learning curves, and prioritizing quality over speed.
Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...Aerospike
The document discusses hyperscale AI and how to achieve faster performance with a smaller footprint. It outlines an end-to-end hyperscale design from HPE that spans from ingestion at the edge to analytics in the core and cloud. The design utilizes technologies like persistent memory, high-performance networking and hardware, and distributed databases optimized for AI workloads. It provides examples of how this design has been implemented and tested on real-world AI and analytics problems.
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Sumeet Singh
Since 2006, Hadoop and its ecosystem components have evolved into a platform that Yahoo has begun to trust for running its businesses globally. Hadoop’s scalability, efficiency, built-in reliability, and cost effectiveness have made it an enterprise-wide platform that web-scale cloud operations run on. In this talk, we will take a broad look at some of the top software, hardware, and services considerations that have gone in to make the platform indispensable for nearly 1,000 active developers on a daily basis, including the challenges that come from scale, security and multi-tenancy we have dealt with in the last several years of operating one the largest Hadoop footprints in the world. We will cover the current technology stack Yahoo that has built or assembled, infrastructure elements such as configurations, deployment models, and network, and what it takes to offer hosted Hadoop services to a large customer base at Yahoo. Throughout the talk, we will highlight relevant use cases from Yahoo’s Mobile, Search, Advertising, Personalization, Media, and Communications businesses that may make these considerations more pertinent to your situation.
Mine excellence products description v1.2Mason Taylor
The Blast Analytics Platform allows users to analyze blast data and gain insights to optimize future blast designs. Some key capabilities include:
- Dashboards and reports: Provides customizable dashboards and reports to visualize blast data trends over time. This helps identify areas for improvement.
- Predictive analytics: Leverages machine learning algorithms to build predictive models from historical blast data. These models can predict outcomes like fragmentation or vibration based on blast design parameters.
- Benchmarking: Enables benchmarking blast performance across different mines/sites. Key performance indicators like cost, productivity, safety can be compared. Best practices can be identified.
- Drill-down analytics: Users can drill down into blast data to analyze factors influencing outcomes. Relationships
MineExcellence Drill and Blast PlatformMason Taylor
- The document discusses integrating and digitizing the drill and blast process to optimize operations and improve downstream activities.
- Current practices often involve limited use of geology data and information from drilling in blast design. Data is collected on paper and cannot be properly analyzed.
- With digitization and integration, blast design can be tailored to geology and expected outcomes. Data can be analyzed, used for future blasts, and improve productivity, safety, and costs.
- Potential benefits include 10-15% reduction in drilling costs, 10-20% reduction in explosive costs, and 5-10% overall productivity gains across mining operations.
Mine excellence products description v1.2Mason Taylor
The document discusses the benefits of integrating and digitizing the drill and blast process. Key points include:
- Current methods have limited use of geology data and information from drilling in blast design. Integrated digital systems can tailor blast design to geology and expected outcomes.
- Potential benefits include 10-15% reduction in drilling costs, 10-20% reduction in explosive costs, and 5-10% improvement in overall mining productivity and asset productivity.
- Industry implications are automated data collection, increased demand for data analytics and decision support, and different skill sets required of mining engineers with a focus on optimized design and planning.
Google Cloud Networking provides a global, flexible, and secure networking foundation for applications and data. Key elements include:
- A global fiber network with over 100 points of presence and hundreds of thousands of miles of cable connecting Google's regions and zones.
- The Andromeda network virtualization stack, which powers VPC networking and provides scalable isolation, high performance, and distributed firewall capabilities.
- Global and regional load balancing options like HTTP(S) and TCP/UDP load balancing for optimizing application delivery worldwide.
- Hybrid connectivity options like Cloud Interconnect, VPN, and Direct Peering to build hybrid cloud architectures connecting on-premises to Google Cloud.
A Primer for Your Next Data Science Proof of Concept on the CloudAlton Alexander
Learn how to quickly create a highly scalable solution using AWS. We introduce the benefits and challenges you may face. We discuss scope and establish realistic expectations, budgets, and constraints for these type of projects. Finally we end with a demo for website event tracking and analysis.
"Using Multi-Master data replication for the parallel-run refactoring", Myros...Fwdays
It is a story about a brave team who took the decision of taking over the parallel-run pattern among others and afforded to introduce and handle a multi-master system as a temporary step in a refactoring process.
Architecting a Scalable Hadoop Platform: Top 10 considerations for successDataWorks Summit
This document discusses 10 considerations for architecting a scalable Hadoop platform:
1. Choosing between on-premise or public cloud deployment.
2. Evaluating total cost of ownership which includes hardware, software, support and other recurring costs.
3. Configuring hardware including servers, storage, networking and heterogeneous resources.
4. Ensuring a high performance network backbone that avoids bottlenecks.
5. Maintaining a software stack that focuses on use cases over specific technologies.
Challenges with Cloud Security by Ken Y ChanKen Chan
As more businesses move to cloud services, they are facing with new challenges in IT security. This presentation outlines the key challenges in cloud security, and my observations and recommendations
Disaster Recovery Planning for MySQL & MariaDBSeveralnines
Bart Oles - Severalnines AB
Organizations need an appropriate disaster recovery plan to mitigate the impact of downtime. But how much should a business invest? Designing a highly available system comes at a cost, and not all businesses and indeed not all applications need five 9's availability.
We will explain fundamental disaster recovery concepts and walk you through the relevant options from the MySQL & MariaDB ecosystem to meet different tiers of disaster recovery requirements, and demonstrate how to automate an appropriate disaster recovery plan.
OpenStack Control Plane Architectures - Design SolutionsShane Gibson
Building an OpenStack private cloud can be very complex. Underpinning the entire operational capabilities of your cloud is the Control Plane. It's common knowledge you should treat your cloud resources "like cattle" and not "pets", yet most clouds treat the control plane like a favored pet.
How should the modern cloud architect choose to design the control plane aspect of your private cloud?
Attendees of this presentaion should walk away with a grasp of various architectures for managing the design and build out of your Control Plane, and know where each architecture would best fit depending on your needs and cloud scale.
We will discuss designs for building and managing your Control Plane:
* standalone single server
* active/passive
* multi-node active/active
* distributed
Each has their own trade off and relevance depending on your goals. This presentation will help guide you in considering the models available and help you to make a decision for your cloud platform.
World Wide Technology: Is backing up to the cloud right for you?Angie Clark
Data protection and retention is critically important for commercial and public sector organizations alike. In fact, it is said that three things keep a CIO up at night: data failure, time to recovery and ensuring IT is able to meet recovery point objectives.
Today, if you have a failure, the amount of time the business is without the data can determine if you need to start looking for a new job…
This presentation will help the attendees:
• Evaluate the business case for targeting your backups to the cloud
• Identify candidate data sets for cloud backup and assess opportunities and readiness
• Mitigate the challenges of backing up to the cloud
• Build a roadmap to your cloud implementation
WWT also provides some real-world experience via use cases and introduces several tools and methodologies used to evaluate, plan and execute your journey to cloud backup in this presentation.
To find out more about WWT’s Data Protection team, log on to wwt.com.
This document discusses using chaos engineering and controlled experiments to proactively test disaster recovery plans. It recommends starting with small, isolated tests and expanding tests to larger scopes over time. The goals are to build confidence in disaster recovery plans, identify weaknesses, and limit the impact of failures. Regular testing is important as systems change frequently. The document provides examples of chaos engineering experiments and emphasizes documenting, rehearsing, and evaluating disaster recovery plans.
Next generation alerting and fault detection, SRECon Europe 2016Dieter Plaetinck
There is a common belief that in order to solve more [advanced] alerting cases and get more complete coverage, we need complex, often math-heavy solutions based on machine learning or stream processing.
This talk sets context and pro's/cons for such approaches, and provides anecdotal examples from the industry, nuancing the applicability of these methods.
We then explore how we can get dramatically better alerting, as well as make our lives a lot easier by optimizing workflow and machine-human interaction through an alerting IDE (exemplified by bosun), basic logic, basic math and metric metadata, even for solving complicated alerting problems such as detecting faults in seasonal timeseries data.
https://www.usenix.org/conference/srecon16europe/program/presentation/plaetinck
This document discusses disaster recovery and the use of cloud computing for disaster recovery. It begins by outlining the need for effective disaster recovery, noting that downtime from disasters cost over $41 billion in 2009 and that improving disaster recovery capabilities is a high priority for most enterprises. It then provides an overview of cloud computing characteristics like scalability, elasticity, and multi-tenancy. The document proposes that virtualizing disaster recovery can bridge the gap between traditional backup approaches that are slow, and duplicating all infrastructure which is costly. It presents the NetGains disaster recovery offerings that use virtualization and replication to enable workloads to be recovered in the cloud quickly and easily during a disaster.
The document discusses Intuit's transition to using canary releases in Kubernetes instead of a separate performance environment. It describes how Intuit collects metrics during canary releases to detect performance issues before fully deploying to production. The canary analysis model measures pod resource usage, JVM metrics, and application metrics to compute a score. Intuit aims to refine the model and scale the canary release process by integrating with tools like Argo Rollouts, Prometheus, and a service mesh.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
More Related Content
Similar to GDG Cloud Southlake #11 Steve McGhee Reliability Theory and Practice
Defining a Cloud Adoption Journey to Deliver Cloud Native ServicesAmazon Web Services
This document discusses the New South Wales Department of Industry's cloud adoption journey. It summarizes their goals to migrate business-critical applications from six on-premise data centers to the cloud in 18 months while minimizing business disruption. It outlines their three-phase approach, including establishing core infrastructure, migrating applications, and optimizing and automating processes. Benefits included increased resilience, agility, and ability to support future changes. Lessons learned emphasized leadership support, managing commercial learning curves, and prioritizing quality over speed.
Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...Aerospike
The document discusses hyperscale AI and how to achieve faster performance with a smaller footprint. It outlines an end-to-end hyperscale design from HPE that spans from ingestion at the edge to analytics in the core and cloud. The design utilizes technologies like persistent memory, high-performance networking and hardware, and distributed databases optimized for AI workloads. It provides examples of how this design has been implemented and tested on real-world AI and analytics problems.
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Sumeet Singh
Since 2006, Hadoop and its ecosystem components have evolved into a platform that Yahoo has begun to trust for running its businesses globally. Hadoop’s scalability, efficiency, built-in reliability, and cost effectiveness have made it an enterprise-wide platform that web-scale cloud operations run on. In this talk, we will take a broad look at some of the top software, hardware, and services considerations that have gone in to make the platform indispensable for nearly 1,000 active developers on a daily basis, including the challenges that come from scale, security and multi-tenancy we have dealt with in the last several years of operating one the largest Hadoop footprints in the world. We will cover the current technology stack Yahoo that has built or assembled, infrastructure elements such as configurations, deployment models, and network, and what it takes to offer hosted Hadoop services to a large customer base at Yahoo. Throughout the talk, we will highlight relevant use cases from Yahoo’s Mobile, Search, Advertising, Personalization, Media, and Communications businesses that may make these considerations more pertinent to your situation.
Mine excellence products description v1.2Mason Taylor
The Blast Analytics Platform allows users to analyze blast data and gain insights to optimize future blast designs. Some key capabilities include:
- Dashboards and reports: Provides customizable dashboards and reports to visualize blast data trends over time. This helps identify areas for improvement.
- Predictive analytics: Leverages machine learning algorithms to build predictive models from historical blast data. These models can predict outcomes like fragmentation or vibration based on blast design parameters.
- Benchmarking: Enables benchmarking blast performance across different mines/sites. Key performance indicators like cost, productivity, safety can be compared. Best practices can be identified.
- Drill-down analytics: Users can drill down into blast data to analyze factors influencing outcomes. Relationships
MineExcellence Drill and Blast PlatformMason Taylor
- The document discusses integrating and digitizing the drill and blast process to optimize operations and improve downstream activities.
- Current practices often involve limited use of geology data and information from drilling in blast design. Data is collected on paper and cannot be properly analyzed.
- With digitization and integration, blast design can be tailored to geology and expected outcomes. Data can be analyzed, used for future blasts, and improve productivity, safety, and costs.
- Potential benefits include 10-15% reduction in drilling costs, 10-20% reduction in explosive costs, and 5-10% overall productivity gains across mining operations.
Mine excellence products description v1.2Mason Taylor
The document discusses the benefits of integrating and digitizing the drill and blast process. Key points include:
- Current methods have limited use of geology data and information from drilling in blast design. Integrated digital systems can tailor blast design to geology and expected outcomes.
- Potential benefits include 10-15% reduction in drilling costs, 10-20% reduction in explosive costs, and 5-10% improvement in overall mining productivity and asset productivity.
- Industry implications are automated data collection, increased demand for data analytics and decision support, and different skill sets required of mining engineers with a focus on optimized design and planning.
Google Cloud Networking provides a global, flexible, and secure networking foundation for applications and data. Key elements include:
- A global fiber network with over 100 points of presence and hundreds of thousands of miles of cable connecting Google's regions and zones.
- The Andromeda network virtualization stack, which powers VPC networking and provides scalable isolation, high performance, and distributed firewall capabilities.
- Global and regional load balancing options like HTTP(S) and TCP/UDP load balancing for optimizing application delivery worldwide.
- Hybrid connectivity options like Cloud Interconnect, VPN, and Direct Peering to build hybrid cloud architectures connecting on-premises to Google Cloud.
A Primer for Your Next Data Science Proof of Concept on the CloudAlton Alexander
Learn how to quickly create a highly scalable solution using AWS. We introduce the benefits and challenges you may face. We discuss scope and establish realistic expectations, budgets, and constraints for these type of projects. Finally we end with a demo for website event tracking and analysis.
"Using Multi-Master data replication for the parallel-run refactoring", Myros...Fwdays
It is a story about a brave team who took the decision of taking over the parallel-run pattern among others and afforded to introduce and handle a multi-master system as a temporary step in a refactoring process.
Architecting a Scalable Hadoop Platform: Top 10 considerations for successDataWorks Summit
This document discusses 10 considerations for architecting a scalable Hadoop platform:
1. Choosing between on-premise or public cloud deployment.
2. Evaluating total cost of ownership which includes hardware, software, support and other recurring costs.
3. Configuring hardware including servers, storage, networking and heterogeneous resources.
4. Ensuring a high performance network backbone that avoids bottlenecks.
5. Maintaining a software stack that focuses on use cases over specific technologies.
Challenges with Cloud Security by Ken Y ChanKen Chan
As more businesses move to cloud services, they are facing with new challenges in IT security. This presentation outlines the key challenges in cloud security, and my observations and recommendations
Disaster Recovery Planning for MySQL & MariaDBSeveralnines
Bart Oles - Severalnines AB
Organizations need an appropriate disaster recovery plan to mitigate the impact of downtime. But how much should a business invest? Designing a highly available system comes at a cost, and not all businesses and indeed not all applications need five 9's availability.
We will explain fundamental disaster recovery concepts and walk you through the relevant options from the MySQL & MariaDB ecosystem to meet different tiers of disaster recovery requirements, and demonstrate how to automate an appropriate disaster recovery plan.
OpenStack Control Plane Architectures - Design SolutionsShane Gibson
Building an OpenStack private cloud can be very complex. Underpinning the entire operational capabilities of your cloud is the Control Plane. It's common knowledge you should treat your cloud resources "like cattle" and not "pets", yet most clouds treat the control plane like a favored pet.
How should the modern cloud architect choose to design the control plane aspect of your private cloud?
Attendees of this presentaion should walk away with a grasp of various architectures for managing the design and build out of your Control Plane, and know where each architecture would best fit depending on your needs and cloud scale.
We will discuss designs for building and managing your Control Plane:
* standalone single server
* active/passive
* multi-node active/active
* distributed
Each has their own trade off and relevance depending on your goals. This presentation will help guide you in considering the models available and help you to make a decision for your cloud platform.
World Wide Technology: Is backing up to the cloud right for you?Angie Clark
Data protection and retention is critically important for commercial and public sector organizations alike. In fact, it is said that three things keep a CIO up at night: data failure, time to recovery and ensuring IT is able to meet recovery point objectives.
Today, if you have a failure, the amount of time the business is without the data can determine if you need to start looking for a new job…
This presentation will help the attendees:
• Evaluate the business case for targeting your backups to the cloud
• Identify candidate data sets for cloud backup and assess opportunities and readiness
• Mitigate the challenges of backing up to the cloud
• Build a roadmap to your cloud implementation
WWT also provides some real-world experience via use cases and introduces several tools and methodologies used to evaluate, plan and execute your journey to cloud backup in this presentation.
To find out more about WWT’s Data Protection team, log on to wwt.com.
This document discusses using chaos engineering and controlled experiments to proactively test disaster recovery plans. It recommends starting with small, isolated tests and expanding tests to larger scopes over time. The goals are to build confidence in disaster recovery plans, identify weaknesses, and limit the impact of failures. Regular testing is important as systems change frequently. The document provides examples of chaos engineering experiments and emphasizes documenting, rehearsing, and evaluating disaster recovery plans.
Next generation alerting and fault detection, SRECon Europe 2016Dieter Plaetinck
There is a common belief that in order to solve more [advanced] alerting cases and get more complete coverage, we need complex, often math-heavy solutions based on machine learning or stream processing.
This talk sets context and pro's/cons for such approaches, and provides anecdotal examples from the industry, nuancing the applicability of these methods.
We then explore how we can get dramatically better alerting, as well as make our lives a lot easier by optimizing workflow and machine-human interaction through an alerting IDE (exemplified by bosun), basic logic, basic math and metric metadata, even for solving complicated alerting problems such as detecting faults in seasonal timeseries data.
https://www.usenix.org/conference/srecon16europe/program/presentation/plaetinck
This document discusses disaster recovery and the use of cloud computing for disaster recovery. It begins by outlining the need for effective disaster recovery, noting that downtime from disasters cost over $41 billion in 2009 and that improving disaster recovery capabilities is a high priority for most enterprises. It then provides an overview of cloud computing characteristics like scalability, elasticity, and multi-tenancy. The document proposes that virtualizing disaster recovery can bridge the gap between traditional backup approaches that are slow, and duplicating all infrastructure which is costly. It presents the NetGains disaster recovery offerings that use virtualization and replication to enable workloads to be recovered in the cloud quickly and easily during a disaster.
The document discusses Intuit's transition to using canary releases in Kubernetes instead of a separate performance environment. It describes how Intuit collects metrics during canary releases to detect performance issues before fully deploying to production. The canary analysis model measures pod resource usage, JVM metrics, and application metrics to compute a score. Intuit aims to refine the model and scale the canary release process by integrating with tools like Argo Rollouts, Prometheus, and a service mesh.
Similar to GDG Cloud Southlake #11 Steve McGhee Reliability Theory and Practice (20)
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
8. Unity (US, JP) 2010
Monet (US, BR) 2017
Tannat (BR, UY, AR) 2017
Junior (Rio, Santos) 2017
FASTER (US, JP, TW) 2016
PLCN (HK, LA) 2019
Indigo (SG, ID, AU) 2019
Curie (CL, US) 2019
Havfrue (US,IE, DK) 2019
SJC (JP, HK, SG) 2013
HK-G (HK, GU) 2019
Edge node locations
>1000
Edge points of
presence >100
Network
Future region and
number of zones
Current region and
number of zones
3
2
3
3
3
3
3
3
3
3
3
4
3 3 3
3
3
3
3
3
3
3
Scale on the same reliable infrastructure Google uses
9. The Network Matters
Typical Cloud
Provider Cloud
Provider
User
Google Cloud
Google
Cloud
Google
Pop
ISP User
Google
Pop
10. Confidential & Proprietary
GCP - Architected for Resilience and Scale
Compute
Borg
Scalable job scheduler
Behind Google's 8+
Billion-user Products
Inspiration for Kubernetes
Storage
Colossus
Exabyte storage clusters
Next-Generation cluster
storage system
Networking
Andromeda
Global software-defined network
Highly-available, flat global
network
11. Confidential & Proprietary
GCP leadership in infrastructure innovation
Compute
Borg
10+ years of evolution
Cloud specific clusters,
Layers of failure domains,
Flexible, fast control
Live Migration running VMs
No more maintenance windows.
Security patches and hardware
changes without VM downtime.
Storage
Colossus
Every bit triple-redundant
Services using Colossus inherit
world-class replication and
encoding
Distributed metadata model
Allows for fast, independent
retrieval of "hot" or "cold" data
Networking
Andromeda
Fail static
In the case of programming failure
or control plane fault, last-known-
good network remains in place
12. Confidential & Proprietary
Zones & Regions are the basic building blocks of global compute infrastructure
Zone: a unit of deployment of computing and supporting infrastructure
Region: A collection of Zones, typically in a single or nearby metros. Expectation: Region is >= 3 Zones.
Networking connects resources within a zone, region, and across regions
cluster cluster cluster
zone zone zone
region region region region
global network
A Logical view
GCP building blocks - Regions, Zones
13. Confidential & Proprietary
GCP Service Topology
Zones, Regions, Multi-Region (visible)
● Campuses, Buildings (internal)
● Borg Clusters (internal)
● Racks, Machines,
Power/Cooling (internal)
Think of Services within a scope:
● Zonal Service generally @ 99.9%
● Regional Service generally @
99.99%
Survive disaster (eg: hurricanes,
floods) via multi-regional
deployments.
18. Context: The Pyramids
Component-level reliability:
- solid base (big cold building, heavy
iron, redundant disks/net/power)
- each component up as much as
possible
- total availability as goal
- "scale up"
Scalable reliability:
- less-reliable, cost-effective base
- "warehouse scale" (many machines)
- software improves availability
- aggregate availability as goal
- "scale out"
19. This Bears Repeating
You can build
more reliable things
on top of
less reliable things
a simple example: RAID. see: The SRE I Aspire to Be, @aknin SREconEMEA 2019
27. blast radius
"how many users were affected
by this change"
● everybody 💥
● just one region 🤭
● just logged-in users 😿
● anyone who was checking out
during the time 🛍
● 1% of all users 🤓
● 0.001% of all users 😮
28. time
"area under the curve"
● MTTD | MTTR
● Detect, Mitigate, Prevent!
● Total outage → Partial outage →
Degraded State → Recovered
Note: Incident time might be
different, due to post-incident
"cleanup" or analysis.
29. So What?
We have 3 things we want to potentially minimize:
● probability of bad thing occurring
● blast radius, when it does happen
● time to get it fixed
reduce any of these, ideally all of these.
32. The Reliability (r9y) Journey
Cloud Customers have a hard time knowing what Reliability is, what they've done, and what they even
want! We need to learn how to best help them
● Start with a map of reliability capabilities
○ both known + unknown unknowns are presented, in context!
● Plot their current position with a orienteering survey
● Determine their destination with a compass
○ making a choice based on cost, business needs ("nines" availability, latency, DR, geography)
● Help plan their journey with a guidebook
○ how to decide next steps (feedback loops)
○ how to implement that step
○ what to buy or adopt along the way
33. The Reliability Map (WIP)
Eras (nines):
● Demo (90%)
● Deterministic (99.0%)
● Reactive (99.9%)
● Proactive (99.99%)
● Autonomic (99.999%)
Streams / Personas:
● Development
● Infra
● Operations
● Observability
● People
34. Quick Hack: the Virtuous Cycle
First: SLOs / Error Budget
⇒ Incident Response
⇒ Blameless Postmortems + Postmortem review
⇒ Risk Analysis
⇒ Resilience Engineering Backlog and prioritization
⇒ Risk / Impact Reduction!
⇒ SLOs (adjust)
This then becomes your flywheel for deciding which capabilities to build next.
* Separate: reduce toil as needed
35. Start with SLOs, unless you can't
In order to define and use SLOs (SLIs, error budgets etc), you need:
● accuracy
○ metrics that sufficiently represent the state of your system
○ only using blackbox/synthetic or "ping" insufficient and not representative of user traffic
○ changing a system to export its internal state can be more useful, either via metrics or logs
● precision
○ can't measure per-minute SLOs if you're only tracking "good days"
○ average latency ⇒ latency distribution over time
● breakdown per-service
○ measuring only at "the front door" or cross-stack can often be misleading
○ this is just another form of precision, breaking down per-service or per-container
36. The Pyramids
Component-level reliability:
- solid base (big cold building, heavy iron,
redundant disks/net/power)
- each component up as much as possible
- union of availability as goal
- "scale up"
Scalable reliability:
- less-reliable, cost-effective base
- "warehouse scale" (many machines)
- highly connected, API-driven
- software improves availability
- aggregate availability as goal
- "scale out"
37. Key Takeaway
We can build
more-reliable things
on top of
less-reliable things
This is counterintuitive!
Software lets us build systems that can cope with failure which hardware can't.
Apply this at many levels (app, system, team, org!) for great success.
38. Business Service Orientation
Business
Service 1
Capability
A
Limitation
X
Capability
B
Limitation
Y
Business
Service 2
Capability
B
Limitation
Y
Capability
D
Limitation
Z
Business
Service N
Capability
A
Limitation
X
Capability
F
Limitation
W
Identification of common limitations across Business Services surfaces the high impact modernization tasks
39. Modernization Adoption
time
capability 1
capability 2
capability 3
capability 4
service 1: low-risk
early adoption, slow progress
service N: high-risk
late adoption, fast safe progress!
platform
maturity
service N: high-risk
don't adopt prematurely!
gain confidence in
capabilities