Invited Keynote at the 9th International Workshop on Software Engineering for Resilient Systems, September 4-5, 2017, Geneva, Switzerland
Title: Cloud Reliability: Decreasing outage frequency using fault injection
Abstract: In 2016, Google Cloud had 74 minutes of total downtime, Microsoft Azure had 270 minutes, and 108 minutes of downtime for Amazon Web Services (see cloudharmony.com). Reliability is one of the most important properties of a successful cloud platform. Several approaches can be explored to increase reliability ranging from automated replication, to live migration, and to formal system analysis. Another interesting approach is to use software fault injection to test a platform during prototyping, implementation and operation. Fault injection was popularized by Netflix and their Chaos Monkey fault-injection tool to test cloud applications. The main idea behind this technique is to inject failures in a controlled manner to guarantee the ability of a system to survive failures during operations. This talk will explain how fault injection can also be applied to detect vulnerabilities of OpenStack cloud platform and how to effectively and efficiently detect the damages caused by the faults injected.
This document discusses increasing cloud resilience through fault injection testing. It describes using the Butterfly Effect system to intentionally inject failures like hardware failures, network issues, and software bugs into OpenStack deployments. This allows failures to be tested, repairs to be learned, and future failures to be predicted. Monitoring tools like Monasca and Zabbix are used to detect damages and visualize results. The goal is to automate repair and increase reliability through continuous testing and learning from failures.
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...Jorge Cardoso
Lecture given at the Technical University of Munich, 12 December 2016, on Cloud Operations and Analytics: Improving Distributed Systems Reliability using Fault Injection.
Gartner analyzed data centers for a period of 10 years and found that 47% of all problems were caused by cloud services outages. The duration of outages ranged between 40 minutes and five days. Ponemon Institute studied the financial impact and found that on average outages cost US$ 690.204, with an average downtime cost of US$ 6.828 per minute. These results are important due to the economic impact of unplanned outages on cloud operations which calls for higher platform reliability.
The first part of this talk will present the mechanisms that pioneers, such as Amazon, Google, and Netflix, have already developed to increase the reliability of their cloud platforms. The second part of the talk will describe how Huawei Research is exploring the use of fault-injection mechanisms to effectively increase the reliability of the Open Telekom Cloud platform from Deutsche Telekom.
In planet-scale deployments, the Operation and Maintenance (O&M) of cloud platforms cannot be done any longer manually or simply with off-the-shelf solutions. It requires self-developed automated systems, ideally exploiting the use of AI to provide tools for autonomous cloud operations. This talk will explain how deep learning, distributed traces, and time-series analysis (sequence analysis) can be used to effectively detect anomalous cloud infrastructure behaviors during operations to reduce the workload of human operators. The iForesight system is being used to evaluate this new O&M approach. iForesight 2.0 is the result of 2 years of research with the goal to provide an intelligent new tool aimed at SRE cloud maintenance teams. It enables them to quickly detect and predict anomalies thanks to the use of artificial intelligence when cloud services are slow or unresponsive.
Distributed Trace & Log Analysis using MLJorge Cardoso
The field of AIOps, also known as Artificial Intelligence for IT Operations, uses advanced technologies to dramatically improve the monitoring, operation, and troubleshooting of distributed systems. Its main premise is that operations can be automated using monitoring data to reduce the workload of operators (e.g., SREs or production engineers). Our current research explores how AIOps – and many related fields such as deep learning, machine learning, distributed traces, graph analysis, time-series analysis, sequence analysis, advanced statistics, NLP and log analysis – can be explored to effectively detect, localize, predict, and remediate failures in large-scale cloud infrastructures (>50 regions and AZs) by analyzing service management data (e.g., distributed traces, logs, events, alerts, metrics). In particular, this talk will describe how a particular monitoring data structure, called distributed traces, can be analyzed using deep learning to identify anomalies in its spans. This capability empowers operators to quickly identify which components of a distributed system are faulty.
Online Memory Leak Detection in the Cloud-based InfrastructuresAnshul Jindal
A memory leak in an application deployed on the cloud can affect the availability and reliability of the application. Therefore, to identify and ultimately resolve it quickly is highly important. However, in the production environment running on the cloud, memory leak detection is a challenge without the knowledge of the application or its internal object allocation details. This paper addresses this challenge of online detection of memory leaks in cloud-based infrastructure without having any internal application knowledge by introducing a novel machine learning-based algorithm Precog. This algorithm solely uses one metric i.e the system's memory utilization on which the application is deployed for the detection of a memory leak. The developed algorithm's accuracy was tested on 60 virtual machines manually labeled memory utilization data provided by our industry partner Huawei Munich Research Center and it was found that the proposed algorithm achieves the accuracy score of 85% with less than half a second prediction time per virtual machine.
The emergence of cloud computing and software testingRIA RUI Society
This document discusses the emergence of cloud computing and its impact on software testing. It provides background on related topics like grid computing, virtualization, and cloud infrastructure testing. The document also notes that based on a survey, around 50% of organizations use cloud computing experimentally or for prototyping, while nearly 30% use it for non-critical applications and about 22% for business critical applications. In summary, it explores the history of cloud computing and how the rise of cloud has changed the traditional practices of software testing.
What the cloud has to do with a burning house?Nane Kratzke
Cloud native applications can create enormous business growth and value in a very short amount of time. Take Instagram as one example company. It took only two years to get a net asset value of 1 billion USD. However, cloud-native applications are often characterized by a highly implicit technological dependency on hosting cloud infrastructures. What happens if you are forced to leave your cloud service provider? What happens if your cloud is burning? The project Cloud TRANSIT investigates how to design cloud-native applications and services to reduce technological dependencies on underlying cloud infrastructures.
This document discusses increasing cloud resilience through fault injection testing. It describes using the Butterfly Effect system to intentionally inject failures like hardware failures, network issues, and software bugs into OpenStack deployments. This allows failures to be tested, repairs to be learned, and future failures to be predicted. Monitoring tools like Monasca and Zabbix are used to detect damages and visualize results. The goal is to automate repair and increase reliability through continuous testing and learning from failures.
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...Jorge Cardoso
Lecture given at the Technical University of Munich, 12 December 2016, on Cloud Operations and Analytics: Improving Distributed Systems Reliability using Fault Injection.
Gartner analyzed data centers for a period of 10 years and found that 47% of all problems were caused by cloud services outages. The duration of outages ranged between 40 minutes and five days. Ponemon Institute studied the financial impact and found that on average outages cost US$ 690.204, with an average downtime cost of US$ 6.828 per minute. These results are important due to the economic impact of unplanned outages on cloud operations which calls for higher platform reliability.
The first part of this talk will present the mechanisms that pioneers, such as Amazon, Google, and Netflix, have already developed to increase the reliability of their cloud platforms. The second part of the talk will describe how Huawei Research is exploring the use of fault-injection mechanisms to effectively increase the reliability of the Open Telekom Cloud platform from Deutsche Telekom.
In planet-scale deployments, the Operation and Maintenance (O&M) of cloud platforms cannot be done any longer manually or simply with off-the-shelf solutions. It requires self-developed automated systems, ideally exploiting the use of AI to provide tools for autonomous cloud operations. This talk will explain how deep learning, distributed traces, and time-series analysis (sequence analysis) can be used to effectively detect anomalous cloud infrastructure behaviors during operations to reduce the workload of human operators. The iForesight system is being used to evaluate this new O&M approach. iForesight 2.0 is the result of 2 years of research with the goal to provide an intelligent new tool aimed at SRE cloud maintenance teams. It enables them to quickly detect and predict anomalies thanks to the use of artificial intelligence when cloud services are slow or unresponsive.
Distributed Trace & Log Analysis using MLJorge Cardoso
The field of AIOps, also known as Artificial Intelligence for IT Operations, uses advanced technologies to dramatically improve the monitoring, operation, and troubleshooting of distributed systems. Its main premise is that operations can be automated using monitoring data to reduce the workload of operators (e.g., SREs or production engineers). Our current research explores how AIOps – and many related fields such as deep learning, machine learning, distributed traces, graph analysis, time-series analysis, sequence analysis, advanced statistics, NLP and log analysis – can be explored to effectively detect, localize, predict, and remediate failures in large-scale cloud infrastructures (>50 regions and AZs) by analyzing service management data (e.g., distributed traces, logs, events, alerts, metrics). In particular, this talk will describe how a particular monitoring data structure, called distributed traces, can be analyzed using deep learning to identify anomalies in its spans. This capability empowers operators to quickly identify which components of a distributed system are faulty.
Online Memory Leak Detection in the Cloud-based InfrastructuresAnshul Jindal
A memory leak in an application deployed on the cloud can affect the availability and reliability of the application. Therefore, to identify and ultimately resolve it quickly is highly important. However, in the production environment running on the cloud, memory leak detection is a challenge without the knowledge of the application or its internal object allocation details. This paper addresses this challenge of online detection of memory leaks in cloud-based infrastructure without having any internal application knowledge by introducing a novel machine learning-based algorithm Precog. This algorithm solely uses one metric i.e the system's memory utilization on which the application is deployed for the detection of a memory leak. The developed algorithm's accuracy was tested on 60 virtual machines manually labeled memory utilization data provided by our industry partner Huawei Munich Research Center and it was found that the proposed algorithm achieves the accuracy score of 85% with less than half a second prediction time per virtual machine.
The emergence of cloud computing and software testingRIA RUI Society
This document discusses the emergence of cloud computing and its impact on software testing. It provides background on related topics like grid computing, virtualization, and cloud infrastructure testing. The document also notes that based on a survey, around 50% of organizations use cloud computing experimentally or for prototyping, while nearly 30% use it for non-critical applications and about 22% for business critical applications. In summary, it explores the history of cloud computing and how the rise of cloud has changed the traditional practices of software testing.
What the cloud has to do with a burning house?Nane Kratzke
Cloud native applications can create enormous business growth and value in a very short amount of time. Take Instagram as one example company. It took only two years to get a net asset value of 1 billion USD. However, cloud-native applications are often characterized by a highly implicit technological dependency on hosting cloud infrastructures. What happens if you are forced to leave your cloud service provider? What happens if your cloud is burning? The project Cloud TRANSIT investigates how to design cloud-native applications and services to reduce technological dependencies on underlying cloud infrastructures.
Cloud computing is a better platform for business continuity and disaster recovery for several reasons:
1) It provides flexibility and scalability, allowing organizations to easily adjust resources in response to disasters or failures.
2) Using cloud computing reduces costs compared to maintaining physical infrastructure, as capital expenditures turn into operational expenditures.
3) Cloud computing enables remote access and monitoring of systems, allowing disaster recovery plans to be managed from any location.
This document discusses how automation can help organizations manage constant infrastructure changes in a cloud DevOps world. It provides examples of how to design infrastructure for constant changes like new environments, configuration updates, failures, and code deployments through automation using techniques like configuration management, auto scaling, immutable infrastructure, and security scanning. Automation improves security by reducing human errors and ensuring consistent patching and configuration across all environments.
The new stack isn’t a stack: Fragmentation and terraforming the service layerDonnie Berkholz
Open source, cloud, and the API revolution have already
changed the way we build software. What's next? Donnie's spent the past 5 years trying to figure that out through observation and research at RedMonk and now at 451 Research. In this talk, he'll share what he's seen and what he predicts for the future of how we develop applications. You'll hear buzzwords like DevOps and microservices used in ways that actually make sense (for a change), see real-world examples of companies that have succeeded and failed, and learn how approaches like the one taken by HashiCorp's Terraform (by the authors of Vagrant) will be critical to the future of how we build software.
The document discusses incident response in cloud-based environments, with a focus on Amazon Web Services (AWS). It outlines some challenges of incident response in the cloud, such as shared responsibilities and lack of direct access to hardware. It then provides guidelines for effective cloud-based incident response, such as establishing joint response plans with cloud providers and evaluating monitoring and forensic tools. Finally, it examines how these guidelines can be implemented within AWS using services like CloudWatch, CloudTrail, and tools developed by AWS and third parties.
Faster deep learning solutions from training to inference - Amitai Armon & Ni...Codemotion Tel Aviv
The document discusses Intel's Deep Learning SDK which aims to democratize deep learning by making it easily accessible and deployable. The SDK provides plug and train functionality through an easy installer, maximizes performance through optimized frameworks, and includes productivity tools. It allows for distributed multi-node training and deploys trained models across Intel hardware and systems through optimizations, compression, and quantization to accelerate deployment. The SDK addresses challenges in deep learning like limited labeled data, reinforcement learning tasks, and deployment at the edge with lower memory models.
As the complexity of your AWS environment grows, automating security is a crucial step in protecting your data from malicious attacks and unintentional vulnerabilities. Automation and security practices must work hand-in-hand in order to effectively protect your environment. In this session, you will learn how to leverage AWS tools and best-in-class 3rd party services to automate access control, security configuration, and monitoring in order to improve your overall security posture. Using real-life examples, you will come away with an understanding of how to secure your deployments while minimizing the work it takes to keep them secure – all while simplifying your compliance audit process. Topics include:
· Controlling access to your environment with automation tools
· Maintaining security during high velocity deployment cycles
· Protecting data from malicious attacks with automated security controls
· Monitoring and measuring configuration, access, and policy changes
This presentation covers two uses cases using OpenPOWER Systems
1. Diabetic Retinopathy using AI on NVIDIA Jetson Nano: The objective is to classify the diabetic level solely on retina image in a remote area with minimum doctor's inference. The model uses VGG16 network architecture and gets trained from scratch on POWER9. The model was deployed on the Jetson Nano board.
1. Classifying Covid positivity using lung X-ray images: The idea is to build ML models to detect positive cases using X-ray images. The model was trained on POWER9, and the application was developed using Python.
IRJET- Usage of Multiple Clouds for Storing and Securing Data through Identit...IRJET Journal
This document discusses using multiple clouds for securely storing and accessing data through an identity-based key. It proposes a protocol called E-PDP that uses identity-based cryptography and Diffie-Hellman key exchange to verify data integrity across clouds while allowing both public and private access. E-PDP checks client credentials before updating data to prevent spam and supports different levels of private and public verification depending on client authorization.
Palestra apresentada por Pedro Mário Cruz e Silva, Solution Architect da NVIDIA, como parte da programação da VIII Semana de Inverno de Geofísica, em 19/07/2017.
App Performance Tip: Sharing Flash Across Virtualized WorkloadsDataCore Software
Core business applications like Oracle, SAP, SQL Server, Exchange and SharePoint often perform poorly when virtualized. More often than not the root cause is data I/O bottlenecks.
In this presentation, DataCore Software and Fusion-io highlight how to:
• Integrate flash memory to overcome I/O bottlenecks in real-world environments
• Combine flash technology with existing storage
• Speed up virtualized applications
• Prevent storage from slowing down or taking down your applications
Modern applications are complex with microservices and containers which makes observability challenging. Observability refers to understanding an application's state through logs, events and other data. It is more comprehensive than monitoring alone. To achieve effective observability, organizations should use error tracking, distributed tracing, APM, infrastructure monitoring, log aggregation and incident management tools. They should also implement development best practices like shift-left processes and continuous communication between teams. The goal of observability is to deliver the best user experience and maximize business value.
This document discusses how Splunk can help organizations address challenges related to escalating IT complexity. It notes that IT environments have become more complex with disconnected point solutions, over 70% of time spent maintaining rather than innovating, and latency in resolving issues measured in hours or days. Splunk provides a single platform to gather, analyze, and search machine data from various sources in real-time. It allows correlating data across silos for faster problem resolution. The document highlights how Splunk reduced escalations by 90% and mean time to resolution by 67% for one customer. It then discusses how Splunk offers pre-built apps for monitoring different parts of the IT infrastructure and applications.
Splunk Enterpise for Information Security Hands-OnSplunk
Splunk is the ultimate tool for the InfoSec hunter. In this unique session, we’ll dive straight into the Splunk search interface, and interact with wire data harvested from various interesting and hostile environments, as well as some web access logs. We’ll show how you can use Splunk Enterprise with a few free Splunk applications to hunt for attack patterns. We’ll also demonstrate some ways to add context to your data in order to reduce false positives and more quickly respond to information. Bring your laptop – you’ll need a web browser to access our demo systems!
This document summarizes experiences implementing workload management goals for CICS and IMS transactions as well as DB2 stored procedures. It describes converting two critical CICS systems to WLM transaction management to meet an SLA of 98% of transactions completing within 2 seconds. This improved the response time distribution. IMS regions were later converted to WLM transaction management as well, which provided more consistent response times during resource shortages. Managing DB2 stored procedures with WLM initially caused problems due to dependent and independent enclaves performing the same tasks, which was later addressed.
CodeValue Architecture Next 2018 - Executive track dilemmas and solutions in...Erez PEDRO
Moderen Software projects are challenging to develop. Eran Stiller, Ronen Rubinfeld, and Erez Pedro from CodeValue show a method for conducting multidisciplinary product discovery.
Adapting to a Cambrian AI/SW/HW explosion with open co-design competitions an...Grigori Fursin
Slides from ARM's Research Summit'17 about "Community-Driven and Knowledge-Guided Optimization of AI Applications Across the Whole SW/HW Stack" (http://cKnowledge.org/repo , http://cKnowledge.org/ai , http://tinyurl.com/zlbxvmw , https://developer.arm.com/research/summit )
Co-designing the whole AI/SW/HW stack in terms of speed, accuracy, energy consumption, size, costs and other metrics has become extremely complex, long and costly. With no rigorous methodology for analyzing performance and accumulating optimisation knowledge, we are simply destined to drown in the ever growing number of design choices, system
features and conflicting optimisation goals.
We present our novel community-driven approach to solve the above problems. Originating from natural sciences, this approach is embodied in Collective Knowledge (CK), our open-source cross-platform workflow framework and repository for automatic, collaborative and reproducible experimentation. CK helps organize, unify and share representative workloads, data sets, AI frameworks, libraries, compilers, scripts, models and other artifacts as customizable and reusable components with a common JSON API.
CK helps bring academia, industry and end-users together to
gradually expose optimisation choices at all levels (e.g. from parameterized models and algorithmic skeletons to compiler
flags and hardware configurations) and autotune them across diverse inputs and platforms. Optimization knowledge gets continuously aggregated in public or private repositories such as cKnowledge.org/repo in a reproducible way, and can be then mined and extrapolated to predict better AI algorithm choices, compiler transformations and hardware designs.
We also demonstrate how we use this approach in practice together with ARM and other companies to adapt to a Cambrian AI/SW/HW explosion by creating an open repository of reusable AI artifacts, and then collaboratively optimising and co-designing the whole deep learning stack (software, hardware and models).
A Novel Method of Directly Auditing Integrity On Encrypted DataIRJET Journal
The document proposes two systems, SecCloud and SecCloud+, to achieve both data integrity and deduplication on cloud data. SecCloud introduces an auditing entity that helps clients generate data tags before uploading and audit data integrity stored in the cloud, enabling secure deduplication. SecCloud+ allows integrity auditing and deduplication directly on encrypted data by involving a key server to assign encryption keys based on file content. Both systems use the same three protocols for file uploading, integrity auditing, and proof of ownership with SecCloud+ adding communication between the client and key server to encrypt files before uploading. The systems aim to reduce user computation during uploading and auditing while allowing integrity checks and deduplication on encrypted data in the
If you’re just getting started with Splunk, this session will help you understand how to use Splunk software to turn your silos of data into insights that are actionable. In this session, we’ll dive right into a Splunk environment and show you how to use the simple Splunk search interface to quickly find the needle-in-the-haystack or multiple needles in multiple haystacks. We’ll demonstrate how to perform rapid ad-hoc searches to conduct routine investigations across your entire IT infrastructure in one place, whether physical, virtual or in the cloud. We’ll show you how to then convert these searches into real time alerts and dashboards, so you can proactively monitor for problems before they impact your end user. We’ll demonstrate how you can use Splunk to connect the dots across heterogeneous systems in your environment for cross-tier, cross-silo visibility. You’ll have access to a demo environment. So, don’t forget to bring your laptop and follow along for a hands-on experience.
This document discusses the need for continuous delivery in software development. It defines continuous delivery as making sure software can be reliably released at any time. The document outlines some key aspects of continuous delivery including automated testing, infrastructure as code, continuous integration, and blue/green deployments. It provides an example of implementing continuous delivery for a large retail company using tools like Jenkins, Puppet, Logstash and practices like infrastructure as code and automated testing.
Between spending hours (or days!) making sure you can code and test locally and the difficulties of keeping remote environments up to date, sometimes we find ourselves falling back on "It works on my machine!". Getting rid of the difficulties in making new development environments and maintaining testing infrastructure is really key to banishing the dreaded phrase. In this session, we'll take you through some of the recent tools and techs that will not only make your life easier but will mean you never have to say "works on my machine" ever again.
Cloud computing is a better platform for business continuity and disaster recovery for several reasons:
1) It provides flexibility and scalability, allowing organizations to easily adjust resources in response to disasters or failures.
2) Using cloud computing reduces costs compared to maintaining physical infrastructure, as capital expenditures turn into operational expenditures.
3) Cloud computing enables remote access and monitoring of systems, allowing disaster recovery plans to be managed from any location.
This document discusses how automation can help organizations manage constant infrastructure changes in a cloud DevOps world. It provides examples of how to design infrastructure for constant changes like new environments, configuration updates, failures, and code deployments through automation using techniques like configuration management, auto scaling, immutable infrastructure, and security scanning. Automation improves security by reducing human errors and ensuring consistent patching and configuration across all environments.
The new stack isn’t a stack: Fragmentation and terraforming the service layerDonnie Berkholz
Open source, cloud, and the API revolution have already
changed the way we build software. What's next? Donnie's spent the past 5 years trying to figure that out through observation and research at RedMonk and now at 451 Research. In this talk, he'll share what he's seen and what he predicts for the future of how we develop applications. You'll hear buzzwords like DevOps and microservices used in ways that actually make sense (for a change), see real-world examples of companies that have succeeded and failed, and learn how approaches like the one taken by HashiCorp's Terraform (by the authors of Vagrant) will be critical to the future of how we build software.
The document discusses incident response in cloud-based environments, with a focus on Amazon Web Services (AWS). It outlines some challenges of incident response in the cloud, such as shared responsibilities and lack of direct access to hardware. It then provides guidelines for effective cloud-based incident response, such as establishing joint response plans with cloud providers and evaluating monitoring and forensic tools. Finally, it examines how these guidelines can be implemented within AWS using services like CloudWatch, CloudTrail, and tools developed by AWS and third parties.
Faster deep learning solutions from training to inference - Amitai Armon & Ni...Codemotion Tel Aviv
The document discusses Intel's Deep Learning SDK which aims to democratize deep learning by making it easily accessible and deployable. The SDK provides plug and train functionality through an easy installer, maximizes performance through optimized frameworks, and includes productivity tools. It allows for distributed multi-node training and deploys trained models across Intel hardware and systems through optimizations, compression, and quantization to accelerate deployment. The SDK addresses challenges in deep learning like limited labeled data, reinforcement learning tasks, and deployment at the edge with lower memory models.
As the complexity of your AWS environment grows, automating security is a crucial step in protecting your data from malicious attacks and unintentional vulnerabilities. Automation and security practices must work hand-in-hand in order to effectively protect your environment. In this session, you will learn how to leverage AWS tools and best-in-class 3rd party services to automate access control, security configuration, and monitoring in order to improve your overall security posture. Using real-life examples, you will come away with an understanding of how to secure your deployments while minimizing the work it takes to keep them secure – all while simplifying your compliance audit process. Topics include:
· Controlling access to your environment with automation tools
· Maintaining security during high velocity deployment cycles
· Protecting data from malicious attacks with automated security controls
· Monitoring and measuring configuration, access, and policy changes
This presentation covers two uses cases using OpenPOWER Systems
1. Diabetic Retinopathy using AI on NVIDIA Jetson Nano: The objective is to classify the diabetic level solely on retina image in a remote area with minimum doctor's inference. The model uses VGG16 network architecture and gets trained from scratch on POWER9. The model was deployed on the Jetson Nano board.
1. Classifying Covid positivity using lung X-ray images: The idea is to build ML models to detect positive cases using X-ray images. The model was trained on POWER9, and the application was developed using Python.
IRJET- Usage of Multiple Clouds for Storing and Securing Data through Identit...IRJET Journal
This document discusses using multiple clouds for securely storing and accessing data through an identity-based key. It proposes a protocol called E-PDP that uses identity-based cryptography and Diffie-Hellman key exchange to verify data integrity across clouds while allowing both public and private access. E-PDP checks client credentials before updating data to prevent spam and supports different levels of private and public verification depending on client authorization.
Palestra apresentada por Pedro Mário Cruz e Silva, Solution Architect da NVIDIA, como parte da programação da VIII Semana de Inverno de Geofísica, em 19/07/2017.
App Performance Tip: Sharing Flash Across Virtualized WorkloadsDataCore Software
Core business applications like Oracle, SAP, SQL Server, Exchange and SharePoint often perform poorly when virtualized. More often than not the root cause is data I/O bottlenecks.
In this presentation, DataCore Software and Fusion-io highlight how to:
• Integrate flash memory to overcome I/O bottlenecks in real-world environments
• Combine flash technology with existing storage
• Speed up virtualized applications
• Prevent storage from slowing down or taking down your applications
Modern applications are complex with microservices and containers which makes observability challenging. Observability refers to understanding an application's state through logs, events and other data. It is more comprehensive than monitoring alone. To achieve effective observability, organizations should use error tracking, distributed tracing, APM, infrastructure monitoring, log aggregation and incident management tools. They should also implement development best practices like shift-left processes and continuous communication between teams. The goal of observability is to deliver the best user experience and maximize business value.
This document discusses how Splunk can help organizations address challenges related to escalating IT complexity. It notes that IT environments have become more complex with disconnected point solutions, over 70% of time spent maintaining rather than innovating, and latency in resolving issues measured in hours or days. Splunk provides a single platform to gather, analyze, and search machine data from various sources in real-time. It allows correlating data across silos for faster problem resolution. The document highlights how Splunk reduced escalations by 90% and mean time to resolution by 67% for one customer. It then discusses how Splunk offers pre-built apps for monitoring different parts of the IT infrastructure and applications.
Splunk Enterpise for Information Security Hands-OnSplunk
Splunk is the ultimate tool for the InfoSec hunter. In this unique session, we’ll dive straight into the Splunk search interface, and interact with wire data harvested from various interesting and hostile environments, as well as some web access logs. We’ll show how you can use Splunk Enterprise with a few free Splunk applications to hunt for attack patterns. We’ll also demonstrate some ways to add context to your data in order to reduce false positives and more quickly respond to information. Bring your laptop – you’ll need a web browser to access our demo systems!
This document summarizes experiences implementing workload management goals for CICS and IMS transactions as well as DB2 stored procedures. It describes converting two critical CICS systems to WLM transaction management to meet an SLA of 98% of transactions completing within 2 seconds. This improved the response time distribution. IMS regions were later converted to WLM transaction management as well, which provided more consistent response times during resource shortages. Managing DB2 stored procedures with WLM initially caused problems due to dependent and independent enclaves performing the same tasks, which was later addressed.
CodeValue Architecture Next 2018 - Executive track dilemmas and solutions in...Erez PEDRO
Moderen Software projects are challenging to develop. Eran Stiller, Ronen Rubinfeld, and Erez Pedro from CodeValue show a method for conducting multidisciplinary product discovery.
Adapting to a Cambrian AI/SW/HW explosion with open co-design competitions an...Grigori Fursin
Slides from ARM's Research Summit'17 about "Community-Driven and Knowledge-Guided Optimization of AI Applications Across the Whole SW/HW Stack" (http://cKnowledge.org/repo , http://cKnowledge.org/ai , http://tinyurl.com/zlbxvmw , https://developer.arm.com/research/summit )
Co-designing the whole AI/SW/HW stack in terms of speed, accuracy, energy consumption, size, costs and other metrics has become extremely complex, long and costly. With no rigorous methodology for analyzing performance and accumulating optimisation knowledge, we are simply destined to drown in the ever growing number of design choices, system
features and conflicting optimisation goals.
We present our novel community-driven approach to solve the above problems. Originating from natural sciences, this approach is embodied in Collective Knowledge (CK), our open-source cross-platform workflow framework and repository for automatic, collaborative and reproducible experimentation. CK helps organize, unify and share representative workloads, data sets, AI frameworks, libraries, compilers, scripts, models and other artifacts as customizable and reusable components with a common JSON API.
CK helps bring academia, industry and end-users together to
gradually expose optimisation choices at all levels (e.g. from parameterized models and algorithmic skeletons to compiler
flags and hardware configurations) and autotune them across diverse inputs and platforms. Optimization knowledge gets continuously aggregated in public or private repositories such as cKnowledge.org/repo in a reproducible way, and can be then mined and extrapolated to predict better AI algorithm choices, compiler transformations and hardware designs.
We also demonstrate how we use this approach in practice together with ARM and other companies to adapt to a Cambrian AI/SW/HW explosion by creating an open repository of reusable AI artifacts, and then collaboratively optimising and co-designing the whole deep learning stack (software, hardware and models).
A Novel Method of Directly Auditing Integrity On Encrypted DataIRJET Journal
The document proposes two systems, SecCloud and SecCloud+, to achieve both data integrity and deduplication on cloud data. SecCloud introduces an auditing entity that helps clients generate data tags before uploading and audit data integrity stored in the cloud, enabling secure deduplication. SecCloud+ allows integrity auditing and deduplication directly on encrypted data by involving a key server to assign encryption keys based on file content. Both systems use the same three protocols for file uploading, integrity auditing, and proof of ownership with SecCloud+ adding communication between the client and key server to encrypt files before uploading. The systems aim to reduce user computation during uploading and auditing while allowing integrity checks and deduplication on encrypted data in the
If you’re just getting started with Splunk, this session will help you understand how to use Splunk software to turn your silos of data into insights that are actionable. In this session, we’ll dive right into a Splunk environment and show you how to use the simple Splunk search interface to quickly find the needle-in-the-haystack or multiple needles in multiple haystacks. We’ll demonstrate how to perform rapid ad-hoc searches to conduct routine investigations across your entire IT infrastructure in one place, whether physical, virtual or in the cloud. We’ll show you how to then convert these searches into real time alerts and dashboards, so you can proactively monitor for problems before they impact your end user. We’ll demonstrate how you can use Splunk to connect the dots across heterogeneous systems in your environment for cross-tier, cross-silo visibility. You’ll have access to a demo environment. So, don’t forget to bring your laptop and follow along for a hands-on experience.
This document discusses the need for continuous delivery in software development. It defines continuous delivery as making sure software can be reliably released at any time. The document outlines some key aspects of continuous delivery including automated testing, infrastructure as code, continuous integration, and blue/green deployments. It provides an example of implementing continuous delivery for a large retail company using tools like Jenkins, Puppet, Logstash and practices like infrastructure as code and automated testing.
Between spending hours (or days!) making sure you can code and test locally and the difficulties of keeping remote environments up to date, sometimes we find ourselves falling back on "It works on my machine!". Getting rid of the difficulties in making new development environments and maintaining testing infrastructure is really key to banishing the dreaded phrase. In this session, we'll take you through some of the recent tools and techs that will not only make your life easier but will mean you never have to say "works on my machine" ever again.
Prometheus for Monitoring Metrics (Fermilab 2018)Brian Brazil
From its humble beginnings in 2012, the Prometheus monitoring system has grown a substantial community with a comprehensive set of integrations. This talk will give an overview of the core ideas behind Prometheus, its feature set and how it has grown to met the challenges of modern cloud-based systems.
Frank Cohen - Are We Ready For Cloud Testing - EuroSTAR 2010TEST Huddle
EuroSTAR Software Testing Conference 2010 presentation on Are We Ready For Cloud Testing by Frank Cohen. See more at: http://conference.eurostarsoftwaretesting.com/past-presentations/
What is Cloud Testing Everything you need to know.pdfpcloudy2
No product owner would want to leave any bug unresolved in the live web app. This can happen when there is a lack of coordination and poor communication between the Development and testing teams. This can result in a blunder for the organization. To solve this problem, the organization should look for a locally hosted web app that supports integration with commonly favored CI/CD tools and helps to build a strong delivery pipeline. Relying on trusted third-party cloud based testing tools simplifies tracking bugs, prioritizing tests and managing projects, ensuring bug-free apps.
The document discusses Effektives Consulting's performance engineering portfolio, which includes user experience and web performance management, cloud-based commerce recommendations, zero-touch deployments, and emerging augmented reality applications. It focuses on web performance management, covering infrastructure capacity planning, a two-stage performance testing approach using both on-premise and cloud-based resources, application profiling, and reporting.
Unit testing focuses on testing individual software modules to uncover errors. Integration testing tests interfacing between modules incrementally to isolate errors. Testing objectives are to find errors, use high probability test cases, and ensure specifications are met. Reasons to test are for correctness, efficiency, and complexity. Test oracles verify expected outputs to increase automated testing efficiency and reduce costs, though complete automation has challenges.
Harnessing the Cloud for Performance Testing- Impetus White PaperImpetus Technologies
For Impetus’ White Papers archive, visit- http://www.impetus.com/whitepaper
The paper provides insights on the various benefits of using the Cloud for Performance Testing as well as how to address the various challenges associated with this approach.
IRJET- An Efficient Hardware-Oriented Runtime Approach for Stack-Based Softwa...IRJET Journal
This document discusses an efficient hardware-oriented runtime approach for detecting stack-based buffer overflow attacks during program execution. The approach automatically archives and compares the original and modified information of static variables in the program to detect any changes from the compiler-generated object code. This is done transparently to programmers without requiring any source code modifications. By leveraging the hardware of the CPU pipeline, the approach can identify buffer overflows during runtime to prevent security vulnerabilities from being exploited. The approach aims to provide protections against runtime attacks while having low performance and memory overhead.
Surekha Kadi has over 7 years of experience in software testing and automation. She has expertise in agile methodologies, scripting languages like Perl and Python, and testing domains including mobile applications, virtualization, cloud computing, and big data using tools like Hadoop, Informatica, Hive, and Vector. She has led testing projects for various clients, developing frameworks, automating test cases, and ensuring quality standards like CMMI level 3.
DevOps is a methodology capturing the practices adopted from the very start by the web giants who had a unique opportunity as well as a strong requirement to invent new ways of working due to the very nature of their business: the need to evolve their systems at an unprecedented pace as well as extend them and their business sometimes on a daily basis.
While DevOps makes obviously a critical sense for startups, I believe that the big corporations with large and old-fashioned IT departments are actually the ones that can benefit the most from adopting these principles and practices.
Malware Detection in Cloud Computing Infrastructures
malware detection whole design and working in a short ppt effectively explaining the criteria and infrastructure
Shweta Bijay has over 15 years of experience in software testing and development. She has worked on projects at F5 Networks, Microsoft, and Reliance Energy Ltd. Some of her roles included performing functional testing, writing test automation, designing and developing testing tools, and measuring impact of technologies like harmonics. She has expertise in languages like Python, C#, and tools like Perforce, Eclipse, SQL, and Agile methodologies like Scrum.
The document discusses unknown vulnerability management (UVM) which involves detecting vulnerabilities, including zero-days, building defenses, and deploying patches. The UVM process includes attack surface analysis through fuzz testing software, reporting issues found, and mitigating risks through patch verification and IDS rule development. Key challenges are communicating issues without leaks, reproducing bugs easily, and ensuring patches do not introduce new issues.
Operations: Production Readiness Review – How to stop bad things from HappeningAmazon Web Services
The document provides an overview of key areas to review for production readiness including architecture design, monitoring, logging, documentation, alerting, service level agreements, expected throughput, testing, and deployment strategy. It summarizes best practices and considerations for each area such as using circuit breakers in monitoring, consistent logging formats, storing documentation near code, automating level 1 operations, and strategies for testing, deployments, and managing error budgets.
Finding Bugs, Fixing Bugs, Preventing Bugs — Exploiting Automated Tests to In...University of Antwerp
With the rise of agile development, software teams all over the world embrace faster release cycles as *the* way to incorporate customer feedback into product development processes. Yet, faster release cycles imply rethinking the traditional notion of software quality: agile teams must balance reliability (minimize known defects) against agility (maximize ease of change). This talk will explore the state-of-the-art in software test automation and the opportunities this may present for maintaining this balance. We will address questions like: Will our test suite detect critical defects early? If not, how can we improve our test suite? Where should we fix a defect?
(Keynote for the SHIFT 2020 and IWSF 2020 Workshops, October 2020)
This document contains the resume of MNREDDY summarizing their work experience as a software tester. They have over 4 years of experience in manual testing and 1.6 years in automation testing. They have experience using tools such as Selenium, QTP, QC, JMeter, LoadRunner and BugTracker. They have worked on projects in various domains including insurance, finance, healthcare and products.
Similar to Cloud Reliability: Decreasing outage frequency using fault injection (20)
On the Application of AI for Failure Management: Problems, Solutions and Algo...Jorge Cardoso
Artificial Intelligence for IT Operations (AIOps) is a class of software which targets the automation of operational tasks through machine learning technologies. ML algorithms are typically used to support tasks such as anomaly detection, root-causes analysis, failure prevention, failure prediction, and system remediation. AIOps is gaining an increasing interest from the industry due to the exponential growth of IT operations and the complexity of new technology. Modern applications are assembled from hundreds of dependent microservices distributed across many cloud platforms, leading to extremely complex software systems. Studies show that cloud environments are now too complex to be managed solely by humans. This talk discusses various AIOps problems we have addressed over the years and gives a sketch of the solutions and algorithms we have implemented. Interesting problems include hypervisor anomaly detection, root-cause analysis of software service failures using application logs, multi-modal anomaly detection, root-cause analysis using distributed traces, and verification of virtual private cloud networks.
AIOps: Anomalous Span Detection in Distributed Traces Using Deep LearningJorge Cardoso
The field of AIOps, also known as Artificial Intelligence for IT Operations, uses algorithms and machine learning to dramatically improve the monitoring, operation, and maintenance of distributed systems. Its main premise is that operations can be automated using monitoring data to reduce the workload of operators (e.g., SREs or production engineers). Our current research explores how AIOps – and many related fields such as deep learning, machine learning, distributed traces, graph analysis, time-series analysis, sequence analysis, and log analysis – can be explored to effectively detect, localize, and remediate failures in large-scale cloud infrastructures (>50 regions and AZs). In particular, this lecture will describe how a particular monitoring data structure, called distributed trace, can be analyzed using deep learning to identify anomalies in its spans. This capability empowers operators to quickly identify which components of a distributed system are faulty.
AIOps: Anomalies Detection of Distributed TracesJorge Cardoso
Introduction to the field of AIOps. large-scale monitoring, and observability. Provides an example illustrating how Deep Learning can be used to analyze distributed traces to reveal exactly which component is causing a problem in microservice applications.
Presentation given at the National University of Ireland, Galway (NUI Galway)
on 2019.08.20.
Thanks to Prof. John Breslin
Presentation at the International Industry-Academia Workshop on Cloud Reliability and Resilience. 7-8 November 2016, Berlin, Germany.
Organized by EIT Digital and Huawei GRC, Germany.
Twitter: @CloudRR2016
Presentation at the International Industry-Academia Workshop on Cloud Reliability and Resilience. 7-8 November 2016, Berlin, Germany.
Organized by EIT Digital and Huawei GRC, Germany.
Twitter: @CloudRR2016
For more than 10 years, research on service descriptions has mainly studied software-based services and provided languages such as WSDL, OWL-S, WSMO for SOAP, and hREST for REST. Nonetheless, recent developments from service management (e.g., ITIL and COBIT) and cloud computing (e.g. Software-as-a-Service) have brought new requirements to service descriptions languages: the need to also model business services and account for the multi-faceted nature of services. Business-orientation, co- creation, pricing, legal aspects, and security issues are all elements which must also be part of service descriptions. While ontologies such as e  service and e  value provided a first modeling attempt to capture a business perspective, concerns on how to contract services and the agreements entailed by a contract also need to be taken into account. This has for the most part been disregarded by the e-family of ontologies. In this paper, we review the evolution and provide an overview of Linked USDL, a comprehensive language which provides a (multi-faceted) description to enable the commercialization of (business and technical) services over the web.
Ten years of service research from a computer science perspectiveJorge Cardoso
…It has been more than 10 years since a strong research stream on services started from the field of computer science. The main trigger was without a doubt the introduction of the Web Service Description Language (WSDL), a specification to represent a piece of software functionally which could be remotely invoked. Nonetheless, this was only the “tipping point”. The generalized interest on this new development was followed by interesting topics of research on the application of semantics to enhance the description of services, the composition of services into processes, the analysis of the quality of services, the complexity of processes supporting services, and the development of comprehensive service description languages. This seminar will provide an overview of the main research topics around services and will glimpse at a new research field on the analysis of service networks...
Cloud Computing Automation: Integrating USDL and TOSCAJorge Cardoso
-- Presented at CAiSE 2013, Valencia, Spain --
Standardization efforts to simplify the management of cloud applications are being conducted in isolation. The objective of this paper is to investigate to which extend two promising specifications, USDL and TOSCA, can be integrated to automate the lifecycle of cloud applications. In our approach, we selected a commercial SaaS CRM platform, modeled it using the service description language USDL, modeled its cloud deployment using TOSCA, and constructed a prototypical platform to integrate service selection with deployment. Our evaluation indicates that a high level of integration is possible. We were able to fully automatize the remote deployment of a cloud service after it was selected by a customer in a marketplace. Architectural decisions emerged during the construction of the platform and were related to global service identification and access, multi-layer routing, and dynamic binding.
Understanding how services operate as part of large scale global networks, the related risks and gains of different network structures and their dynamics is becoming increasingly critical for society. Our vision and research agenda focuses on the particularly challenging task of building, analyzing, and reasoning about global service networks. This paper explains how Service Network Analysis (SNA) can be used to study and optimize the provisioning of complex services modeled as Open Semantic Service Networks (OSSN), a computer-understandable digital structure which represents connected and dependent services.
Open Semantic Service Networks: Modeling and AnalysisJorge Cardoso
A new interesting research area is the representation and analysis of the networked economy using Open Semantic Service Networks (OSSN). OSSN are represented using the service description language
USDL to model nodes and using the service relationship model OSSR to model edges. Nonetheless, in their current form USDL and OSSR do not provide constructs to capture the dynamic behavior of service networks. To bridge this gap, we used the General System Theory (GST) as a framework guiding the extension of USDL and OSSR to model dynamic OSSN. We evaluated the extensions made by applying USDL and OSSR to two distinct types of dynamic OSSN analysis: 1) evolutionary by using a Preferential Attachment (PA) and 2) analytical by using concepts from System Dynamics (SD). Results indicate that OSSN can constitute the rst stepping stones toward the analysis of global service-based economies.
Modeling Service Relationships for Service NetworksJorge Cardoso
The last decade has seen an increased interest in the study of networks in many fields of science. Examples are numerous, from sociology to biology, and to physical systems such as power grids. Nonetheless, the field of service networks has received less attention. Previous research has mainly tackled the modeling of single service systems and service compositions, often focusing only on studying temporal relationships between services. The objective of this paper is to propose a computational model to represent the various types of relationships which can be established between services systems to model service networks. This work acquires a particular importance since the study of service networks can bring new scientific discoveries on how service-based economies operate at a global scale.
The document discusses the Linked-USDL (Unified Service Description Language), which is an ontology for describing services using semantic web principles. It provides an overview of the Linked-USDL modules, syntax, core classes and properties for describing services, service models, offerings, interactions and more. The goal is to develop a standardized way to describe services in a holistic manner, covering both business and technical aspects.
Challenges for Open Semantic Service Networks: models, theory, applications Jorge Cardoso
The document discusses challenges for open semantic service networks. It provides an overview of different types of networks including the world wide web, social networks, biological networks, and service networks. It then summarizes research on services from software and IT, business model, and design perspectives. Key aspects of open semantic service networks are that they are constructed by accessing, retrieving and combining service and relationship models which are openly available and use semantic descriptions. Building blocks for open semantic service networks include modeling services, modeling service relationships, populating models, analyzing service networks.
Description and portability of cloud services with USDL and TOSCAJorge Cardoso
The provisioning and management of cloud services are major concerns since they bring clear benefits such as elasticity, flexibility, scalability, and high availability of applications for enterprises. Two emerging contributions set semantics and machine-understandable specifications for the description and portability of cloud-based services: USDL and TOSCA. In this talk we will explain how both can be articulated to work in conjunction. The Unified Service Description Language (USDL) was created for describing business or real world services to allow services to become tradable and consumable on marketplaces. On the other hand, the Topology and Orchestration Specification for Cloud Applications (TOSCA) was standardized to enable the portability of complex cloud applications and their management across different cloud providers.
To address the emerging importance of services and the relevance of relationships, we have developed and introduced the concept of Open Semantic Service Network (OSSN). OSSN are networks which relate services with the assumption that firms make the information of their services openly available using suitable models. Services, relationships and networks are said to be open (similar to LOD), when their models are transparently available and accessible by external entities and follow an open world assumption. Networks are said to be semantic when they explicitly describe their capabilities and usage, typically using a conceptual or domain model, and ideally using Semantic Web standards and techniques. One limitation of OSSNs is that they were conceived without accounting for the dynamic behavior of service networks. In other words, they can only capture static snapshots of service-based economies but do not include any mechanism to model reactions and effects that services have on other services and the notion of time
To address the emerging importance of services and the relevance of relationships, we have developed and introduced the concept of Open Semantic Service Network (OSSN). OSSN are networks which relate services with the assumption that firms make the information of their services openly available using suitable models. Services, relationships and networks are said to be open (similar to LOD), when their models are transparently available and accessible by external entities and follow an open world assumption. Networks are said to be semantic when they explicitly describe their capabilities and usage, typically using a conceptual or domain model, and ideally using Semantic Web standards and techniques. One limitation of OSSNs is that they were conceived without accounting for the dynamic behavior of service networks. In other words, they can only capture static snapshots of service-based economies but do not include any mechanism to model reactions and effects that services have on other services and the notion of time
This document describes the Open Semantic Service Networks research group at the University of Coimbra in Portugal. It introduces the group's supervisors and students, and summarizes each student's research project related to semantic services, including their background, problem being addressed, and technical solution. The group's projects involve models, platforms, applications, and visualizations for open semantic service networks.
This document proposes SEASIDE, a systematic methodology for developing Internet-based self-services (ISS) from analysis through deployment. SEASIDE leverages enterprise architectures, model-driven development, MVC patterns, and platform as a service. It was applied to develop an ITIL incident management service as a case study. SEASIDE increases traceability, consistency, and reduces complexity of ISS development.
The document discusses the Linked-USDL standardization working group. It provides an overview of the Unified Service Description Language (USDL), including its key components like the Service class and how it can be used to describe various types of services. It also discusses efforts to standardize USDL through the W3C and how USDL compares to other standards like WSDL.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Zilliz
Join us to introduce Milvus Lite, a vector database that can run on notebooks and laptops, share the same API with Milvus, and integrate with every popular GenAI framework. This webinar is perfect for developers seeking easy-to-use, well-integrated vector databases for their GenAI apps.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
Building RAG with self-deployed Milvus vector database and Snowpark Container...Zilliz
This talk will give hands-on advice on building RAG applications with an open-source Milvus database deployed as a docker container. We will also introduce the integration of Milvus with Snowpark Container Services.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
20 Comprehensive Checklist of Designing and Developing a WebsitePixlogix Infotech
Dive into the world of Website Designing and Developing with Pixlogix! Looking to create a stunning online presence? Look no further! Our comprehensive checklist covers everything you need to know to craft a website that stands out. From user-friendly design to seamless functionality, we've got you covered. Don't miss out on this invaluable resource! Check out our checklist now at Pixlogix and start your journey towards a captivating online presence today.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Cloud Reliability: Decreasing outage frequency using fault injection
1. 9th International Workshop on Software Engineering for Resilient Systems
September 4-5, 2017, Geneva, Switzerland
Prof. Dr. Jorge Cardoso
University of Coimbra
Portugal
Cloud Reliability
Decreasing outage frequency using fault injection
Chief Architect for Cloud Operations and Analytics
Huawei Research, Munich
Germany
2. 2
l Title: Cloud Reliability: Decreasing outage frequency using fault injection
l Abstract: In 2016, Google Cloud had 74 minutes of total downtime, Microsoft Azure had 270 minutes, and 108 minutes
of downtime for Amazon Web Services (see cloudharmony.com). Reliability is one of the most important properties of a
successful cloud platform. Several approaches can be explored to increase reliability ranging from automated
replication, to live migration, and to formal system analysis. Another interesting approach is to use software fault
injection to test a platform during prototyping, implementation and operation. Fault injection was popularized by Netflix
and their Chaos Monkey fault-injection tool to test cloud applications. The main idea behind this technique is to inject
failures in a controlled manner to guarantee the ability of a system to survive failures during operations. This talk will
explain how fault injection can also be applied to detect vulnerabilities of OpenStack cloud platform and how to
effectively and efficiently detect the damages caused by the faults injected.
l Acknowledgments: This research was conducted in collaboration with Deutsche Telekom/T-Systems and with Ankur
Bhatia from the Technical University of Munich to analyze the reliability and resilience of modern public cloud platforms.
Executive Summary
l Short CV: Dr. Jorge Cardoso is Chief Architect for Cloud Operations and Analytics at
Huawei’s German Research Centre (GRC) in Munich. He is also Professor at the University
of Coimbra since 2009. In 2013 and 2014, he was a Guest Professor at the Karlsruhe
Institute of Technology (KIT) and a Fellow at the Technical University of Dresden (TU
Dresden). Previously, he worked for major companies such as SAP Research (Germany) on
the Internet of services and the Boeing Company in Seattle (USA) on Enterprise Application
Integration. Since 2013, he is the Vice-Chair of the KEYSTONE COST Action, a EU
research network bringing together more than 70 researchers from 26 countries. He has a
Ph.D. in Computer Science from the University of Georgia (USA).
6. 6
Huawei Public Cloud Products and Services
Compute
Elastic
Cloud
Server
Image
Mgmt
Service
Auto
Scaling Container
Service
Dedicated Cloud Service
Storage
Elastic IP
Elastic
Volume
Service
Object
Storage
Service
Elastic
Load
Balance
Virtual
Private
Cloud
Volume
Backup
Service
Management &
Deployment
Cloud Eye Identity and
Access
Management
Direct
Connect
MaaS
Network DNS
Security Anti-DDoS
Database
Enterprise
Application
Relational
Database
Service
Workspace
SAP Hana/SAP
Suite (IaaS)
MapReduce
Solution SAP Hana/SAP
Suite (IaaS)
7. 7
Global Deployment of Public Cloud Services
u Huawei Enterprise Cloud (HEC) public cloud regions include Langfang and Suzhou in China.
l Open Telefonica Cloud worldwide regions include Mexico, Brazil, Chile, USA, Spain, Peru, Argentina, and Colombia.
n China Telecom Cloud (CTC) public cloud regions include Guizhou and Beijing in China.
▲ Orange Cloud Business (OBS) public cloud regions include France, Singapore, USA, Holland, and Africa.
u Open Telekom Cloud (OTC), Germany
Mexico
Brazil
Chile
Colombia
USA
Peru
Spain
2016 Q1
2016 Q4
Argentina
2017 Jan
2017 July
France
Singapore
USA
Holland
Africa Region 2018 Jan
Beijing
Guizhou
Langfang
Suzhou
2015 Q2
Germany
9. 9
Fault Injection into Clouds
Enables to Automatically Test and Repair OpenStack and Cloud
Applications
CLOUD APPLICATION
HUAWEI FusionSphere
The system works by intentionally injecting different failures, test the ability to survive
them, and learn how to predict and repair failures preemptively
Failure
Repair
Test
11. 11
Fault Injection Plans
Director
Create
VM
1
I: None
O: VM name
N: 1/controller
Deployer
Update
DB
1
I: None
O: None
N: 1/controller
Selector
Get
compute
PPID
1
I: None
O: (node, pid)+
N: 1/controllers
V: 1/nova-comp.
mon-fri: 9h-17h, every
3h
Observer
Wait for
State
1
I: VM name
O: VM name,
state
N: 1/controller
V: 1/{ networking,
block_device_ma
pping, spawning}
Injector
Inject
Fault
1
I: (node, pid)+
O: none
V: 1/{SIGKILL,
SIGTERM,
SIGINT}
Detector
Check
VM
State?
1
I: VM name
O: VM state
N: 1/controller
Cleaner
Delete
VM
1
I: VM name
O: None
N: 1/localhost
(node, pid)+ VM_name
VM_name
FIP Scenario Get PPID Variability Inject Fault Variability Wait for State Variability 2 Result
ID12 Create VM Controller One of
controller
services
Send signal
to process
Signal {SIGKILL,
SIGTERM,
SIGINT}
Loop, read status,
sleep
States { networking,
block_device_mapping,
spawning}
T1 Create VM B_xzy Controller nova-compute Send signal SIGKILL Wait for networking OK
T2 Create VM B_jsk Controller nova-api Send signal SIGTERM Wait for block_device_mapping NOK
T999
Rapporteur
Generat
e Report
1
I: data pipeline
O: None
N: 1/localhost
Variability
12. 12
l With improvements in processing power, network and storage technologies, cloud
platforms have witnessed an unprecedented growth in complexity.
l Due to the increase in complexity, the need to efficiently diagnose failures in
cloud platforms has also risen.
l Because of these challenges, cloud operators often develop new sets of tests to
diagnose failures.
l But as mentioned before, cloud platforms are continuously evolving. They
undergo modification and frequent updates, and have periodic release cycles.
Hence, the tests developed become outdated and there is a constant need to
modify them when a new release is available. Therefore, this approach is costly
for the cloud operators.
Background - Requirement
Work done In collaboration with Ankur Bhatia from Technical University of Munich (TUM), Germany
13. 13
l As with most software, the validation of all the modules of a cloud platform is
done through a test suite containing a large number of unit tests. It is a part of
the software development process where the smallest testable part of an
application, called unit, along with associated control data are individually and
independently tested.
l Executing unit tests is a very effective way to test code integration during the
development of software. They are often executed to validate changes made by
developers and to guarantee that the code is error free.
l Although unit tests are extremely useful for the purpose of development and
integration, they are not meant to diagnose failures.
Solution
14. 14
Fault Injection Plans
Director
Create
VM
1
I: None
O: VM name
N: 1/controller
Deployer
Update
DB
1
I: None
O: None
N: 1/controller
Selector
Get
compute
PPID
1
I: None
O: (node, pid)+
N: 1/controllers
mon-fri: 9h-17h, every
3h
Observer
Wait for
State
1
I: VM name
O: VM name,
state
N: 1/controller
Injector
Inject
Fault
1
I: (node, pid)+
O: none
Detector
Check
VM
State?
1
I: VM name
O: VM state
N: 1/controller
Cleaner
Delete
VM
1
I: VM name
O: None
N: 1/localhost
VM_name
Rapporteur
Generat
e Report
1
I: data pipeline
O: None
N: 1/localhost
Unit tests. It is a part of the
software development
process where the smallest
testable part of a cloud
platform, called unit, along
with associated control data
are individually and
independently tested.
Failure diagnosis system.
Diagnoses failures in cloud
platforms making use of unit
tests and helps to detect
nonresponsive and failed
services.
Unit tests for failure
diagnosis. Reuse the already
developed unit tests to test
a cloud platform at runtime.
15. 15
Using unit tests for failure diagnosis
presents a set of challenges
l Unit tests do not provide any information
about the nonresponsive or failed services in
cloud platforms.
p The execution of unit tests generates a list of
passed and failed tests. This list can help to
locate software errors or to find issues with
individual modules of the code but cannot
diagnose failures as there are no relationships
between unit tests and services running on a
cloud platform.
l With the increase in codebase of cloud
platforms, the number of unit tests also
increases. Thus, it takes a considerable
amount of time to execute them.
Challenges
Experiments
l OpenStack has more than 1500 unit tests
l Used for development and integration
l Only validate APIs
l Time consuming
p E.g., unit tests to create a VM, uploading a large
operating system image, etc., need a few minutes to
execute.
p It can take up to 3 to 4 hours to execute all unit tests.
Considering the reliability requirements of 99.95%,
cloud platforms can have a downtime of only 21.6
minutes per month.
p Hence, the time required to execute unit tests is
considerably high.
l Not able to directly detect services that are
not functioning as expected
16. 16
The approach of 3 phases efficiently
diagnoses a cloud platform
l 1. Reduce the number of unit tests using
the set cover algorithm
l 2. Establish relationships between the
reduced unit tests and the services running
on a cloud platform. These relationships
help to determine nonresponsive or failed
services. This is done by simulating failure
of services in cloud platforms and running
unit tests in this environment.
l 3. Construct a decision tree based on these
relationships to select the unit tests that are
most relevant to diagnose failures. The ID3
algorithm is used to construct a decision
tree.
General Approach
Results
l The approach diagnoses failures by running
only 4-5% of the original unit tests.
17. 17
l Unit Tests Reduction (Phase 1). Reduce the number of unit tests written during the code
development
l Service Mapping (Phase 2). Establish relationships between the reduced unit tests and
services. It further reduces the number of unit tests based on these relationships
l Sequential Diagnosis (Phase 3). Use relationships to construct a decision tree to select the
most relevant unit tests to diagnose failures
General Approach
tempest.api.volume.admin.test_volume_types_negative.Vol
umeTypesNegativeV2Test.test_create_with_empty_name
tempest.api.volume.admin.test_volume_types_negative.Vol
umeTypesNegativeV2Test.test_create_with_nonexistent_vo
lume_type
tempest.api.volume.admin.test_volume_types_negative.Vol
umeTypesNegativeV2Test.test_delete_nonexistent_type_id
tempest.api.volume.admin.test_volume_types_negative.Vol
umeTypesNegativeV2Test.test_get_nonexistent_type_id
tempest.api.volume.admin.test_volume_types_negative.Vol
umeTypesNegativeV2Test.test_create_with_empty_name
empest.api.volume.admin.test_volume_types_negative.Volu
meTypesNegativeV2Test.test_create_with_nonexistent_vol
ume_type
tempest.api.volume.admin.test_volume_types_negative.Vol
umeTypesNegativeV2Test.test_delete_nonexistent_type_idt
empest.api.volume.admin.test_volume_types_negative.Volu
meTypesNegativeV2Test.test_create_with_empty_name
tempest.api.volume.admin.test_volume_types_negative.Vol
umeTypesNegativeV2Test.test_create_with_nonexistent_vo
lume_type
tempest.api.volume.admin.test_volume_types_negative.Vol
umeTypesNegativeV2Test.test_delete_nonexistent_type_id
tempest.api.volume.admin.test_volume_types_negative.Vol
umeTypesNegativeV2Test.test_get_nonexistent_type_id
tempest.api.volume.admin.test_volume_types_negative.Vol
umeTypesNegativeV2Test.test_create_with_empty_name
empest.api.volume.admin.test_volume_types_negative.Volu
meTypesNegativeV2Test.test_create_with_nonexistent_vol
ume_type
tempest.api.volume.admin.test_volume_types_negative.Vol
umeTypesNegativeV2Test.test_delete_nonexistent_type_id
Unit Tests
Reduction
tempest.api.volume.admin.test_volume_types_n
egative.VolumeTypesNegativeV2Test.test_creat
e_with_empty_name
tempest.api.volume.admin.test_volume_types_n
egative.VolumeTypesNegativeV2Test.test_creat
e_with_nonexistent_volume_type
tempest.api.volume.admin.test_volume_types_n
egative.VolumeTypesNegativeV2Test.test_delet
e_nonexistent_type_id
tempest.api.volume.admin.test_volume_types_n
egative.VolumeTypesNegativeV2Test.test_get_
nonexistent_type_id
Service
Mapping
Sequential
Diagnosis
Reduced unit testsUnit tests
Phase 1 Phase 2 Phase 3
Decision tree
Faulty
Service
Service mapping
1 0 1
1 0 0
t1
s1 s2 .. ..
1
1tm
..
sn
18. 18
l Identify modules (1/5)
p Identify the modules available in the unit test framework
l Construct Abstract Syntax Tree (2/5)
p For each module, an AST is constructed to establish a
relationship between unit test methods and support
methods
l Filter AST (3/5)
p All irrelevant support methods are filtered out from the
AST using a Inverse Document Frequency
l Support method deduplication (4/5)
p Unit test methods that perform redundant operations
are eliminated by finding the minimum set cover. Time
to execute an unit test is the cost of a set
l Cross module deduplication (5/5)
p Unit test methods from one module are compared with
methods of other modules and are further reduced
Phase 1 - Unit Test Reduction
List of
all unit
tests
List of all
modules
Identify
modules
Construct AST Filter AST
Support
method
de-duplication
Abstract
Syntax Tree
Final
reduced
unit tests
Eliminate
redundant unit
test methods
across modules
Eliminate
redundant unit
test methods
within modules
Reduced
unit tests
Filtered
AST
Unit test manager
Identify irrelevant
support methods
Cross module
de-duplication
19. 19
l Identify modules (1/5)
p A module is the source file which contains
the definition of the unit test method
l Construct AST (2/5)
p For all the modules construct an AST
l Traversed the AST to identify unit test
methods and their support methods
l For each unit test method, a set of its
support methods is created:
p test_create_flavor: {create_flavor,
assertEqual}
p test_delete_flavor: {create_flavor,
delete_fl, assertTrue}
Phase 1 (1-2/5)
tempest.api.compute.flavors.test_flavors. Flavors. test_create_flavor
Class name Unit test
method
Path of the module:
tempest/api/compute/flavors/test_flavors.py
Part 1 Part 2 Part 3
A simple AST
class Flavors (base.Test):
def test_create_flavor(self):
new_flavor_id = self.create_flavor()
self.assertEqual(new_flavor_id, flavor_id)
def test_delete_flavor(self):
flavor = self.create_flavor(flavor_name)
del_flavor = self.delete_fl(flavor)
self.assertTrue(some_statement)
A Python module (test_flavor.py) containing the
implementations of unit tests
test_flavor.py
Flavors
test_create_
flavor
test_delete_
flavor
Assert
Equal
create_
flavor
delete_fl assertTrue
create_
flavor
Module
Class
Unit test
methods
Support
methods
Note: Apart from the nodes shown in the figure, an AST
contains other nodes like variables, constructors, etc.
Since they are not relevant to us, they are excluded for
simplicity.
Identify Module & Construct AST
20. 20
Phase 1 (3/5)
l Support methods are eliminated
p Some of the support methods are specific
to the unit test framework
p E.g., assertEquals
l Use Inverted Document Frequency
(IDF) technique
p Higher occurrence -> lower IDF
l Calculated IDF for support methods
p Set of documents = ASTs of modules
l Eliminate support methods having a
IDF below a threshold alpha
p These support methods identified do not
play any role in the reduction of the unit
tests
p They are irrelevant
Module 1
Class A:
def
U1(self):
Assert()
S1()
def
U2(self):
S2()
S3()
Class B:
def
U3(self):
S2()
S3()
AST construction
Unit test
methods
Support
Methods
(words)
Class Class
U1
Module
S1 S2 S1
U2 U3
Asser
t
Filter AST
21. 21
Phase 1 (3/5)
IDF(x) = log (# documents/ # documents with word x)
Unit test
methods
Support
Methods
(words)
Document 1 Document 2 Document 3 Document 4
Class Class
U1
Module
S1 S2 S1
U2 U3
Class Class
U4
Module
S3
Asser
t
S2 S3
U5 U6
Class Class
U7
Module
S5
Asser
t
S5 S5
U8 U9
Class Class
U10
Module
S6
Asser
t
S7 S18
U11 U12
Asser
t
§ The support methods Assert is present in all the modules -> its IDF is log(4/4) = 0
§ Similarly, IDF for S1 is log(4), S2 is log(2)
§ Thus, Assert is irrelevant. It is eliminated from the list of support methods.
22. 22
Phase 1 (4/5)
l Each module has unit test methods which
call support methods
l In most cases, a subset of the unit test
methods call all the support methods
l Thus, some unit test methods are
redundant
l Prune the ASTs constructed
p Pruning is done using the minimum set cover
l Two important parameters: cost and
coverage
p Cost is the time a unit test takes to execute
p Coverage is the percentage of support methods
to consider in the set cover
Support method deduplication
23. 23
Phase 1 (4/5)
Support method (S) and time in seconds for each
Unit test method (U)
U1 : {S1 , S3}, 15
U2 : {S1 , S3,} ,8
U3 : {S2}, 9
Class A
U1
Module 1
S1
U2 U3
Class B
S1 S3 S2S3
Loop 1:
a1 = 15/2
a2 = 8/2
a3 = 9/1
I = {S1, S3}
Class A
Module 1
U2
Class B
S1 S3
Loop 1:
a1 = 15/2
a2 = 8/2
a3 = 9/1
I = {S1, S3}
Loop 2:
a1 = 15/0
a2 = 8/0
a3 = 9/1
I = {S1, S2, S3}
Class A
Module 1
U2 U3
Class B
S1 S3 S2
Minimum coverage requested = 100 %
U = {S1 , S2 , S3}
Required set = all elements of U
Minimum coverage requested = 50 %
U = {S1 , S2 , S3}
Required set = é(50/100 ) * 3ù = 2
Coverage obtained: 66.6 %Coverage obtained: 100 %
In case 2, the minimum requested is 50% so any
2 support methods out of 3 are sufficient. U2
contains 2 distinct support methods and has the
least cost. Hence, U1 and U3 are eliminated.
In case 1, minimum coverage is100 %, so all the
support methods must be present. U2 and U3
covers all the distinct support methods and has
the least cost. Hence, U1 is eliminated.
24. 24
Phase 1 (5/5)
U5 : {S4, S5} U10 : {S4}
U11 : {S10, S4}
U12 : {S12, S5}
§ The support method de-duplication subsystem guarantees
that {S4, S5, S10, S12} are called by the subset of the unit
test methods in Module 4
§ All the support methods called by U5 (i.e., {S4, S5}) are
already covered in Module 4.
§ Hence, U5 is redundant and can be eliminated
Cross module deduplication
§ The AST pruning reduces the unit test methods by finding the minimum set cover of the support methods in the same
module. However, some unit test methods are covered in other modules
§ Thus, the reduction can be improved without losing the coverage
§ Cross module deduplication compares the support methods of a unit test method from one module with the universe
of the support methods of other modules
Module 4
Class G:
def U10(self):
Assert()
S4()
def U11(self):
S10()
S4()
Class H:
def U12(self):
S12()
S5()
Module 2
Class C:
def U4(self):
Assert()
S2()
def U5(self):
S4()
S5()
Class D:
def U6(self):
S4()
S7()
Our experiment with OpenStack enabled us to reduce 1391 unit tests
created by developers to 538 tests with 100% coverage
25. 25
Phase 2
Reduced set of unit
tests from Phase 1
Unit test and
service mapping
Isomorphic unit
test elimination
Test service mapper
Matrix
representing
relationships
between unit
tests and
testability of
services
Reduced matrix representing
relationships between unit
tests and testability of services
Service Mapping
One of the major challenges of using unit tests for failure diagnosis is the lack of relationships between unit tests and services
running on cloud platforms. The execution of unit tests outputs a list of passed, failed and skipped tests.
§ Establish a relation between the unit tests and the services they can test
§ Isomorphic unit test elimination
unittestUi
unittestUi
Experiment: 20 OpenStack services running and 538 unit tests after
reduction from Phase 1. At the end of Phase 2, the isomorphic test
elimination procedure reduced the number of test to 25.
For each service s:
1. Disable s (stop/shutdown)
to simulate a failure of a
service in a cloud platform
2. Run reduced set of unit
tests. Unit tests that
depend on service s will
fail. Record result.
3. Enable s (start)
The execution time for unit tests U1 is 10 seconds, U2 is 9
seconds, U3 is 13 seconds, U4 is 19 seconds and U5 is 3
seconds. Hence, U1, U5 and U2 are selected
26. 26
Phase 3
unit test 3
unit test 2 unit test 4
unit test 1
unit test 5
service 1
service 2
service 5
service 5
service 3
service 1
service 4
fail
fail fail
fail
fail
pass
pass
pass
pass
Sequential Diagnosis
The main task of this phase is to analyze these relationships by constructing a decision tree and use the tree to detect failed
service(s) in a cloud platform in operation. Moreover, the failed services are detected without executing all the unit tests.
§ Construction of the decision tree
§ Execution of unit tests based on the decision tree
The tree of the Figure (right)
§ Execute unit test 3
§ If it fails, unit test 2 is executed
§ The failure of unit test 2 indicates
that service 1 is in a failed state
In some cases, there might be more than
one possible service in a failed state
§ If unit test 2 passes, then either service
2 or service 5 or both are in a failed
state
A failed service(s) can be detected by running log(n) number of unit tests instead of n, the total
number of unit tests after eliminating the isomorphic unit tests. Hence, the tree enables the
users to detect the failed service(s) automatically without executing all the unit tests.
27. 27
Results
We evaluated the approach with OpenStack = 1391 unit
tests
Phase 1. Reduce the number of unit tests to 538 with
100% coverage
Phase 2. Established relationships between the reduced
set of unit tests and the OpenStack services. There were
20 services. Eliminated isomorphic unit tests. At the end of
Phase 2, we were able to reduce the number of unit tests
to 25.
Phase 3. Build decision tree using the relationships
established. 5 unit tests on average.
1391
538
25
5
28. 28
Open Positions and CFP
l FTE, PostDoc, PhD
Students
p Fault injection, fault models,
fault libraries, fault plans,
brake and rebuild systems all
day long, …
p Cloud Operations and
Analytics for High Availability
p AI, machine learning, data
mining, and time-series
analysis for Cloud operations