The field of AIOps, also known as Artificial Intelligence for IT Operations, uses algorithms and machine learning to dramatically improve the monitoring, operation, and maintenance of distributed systems. Its main premise is that operations can be automated using monitoring data to reduce the workload of operators (e.g., SREs or production engineers). Our current research explores how AIOps – and many related fields such as deep learning, machine learning, distributed traces, graph analysis, time-series analysis, sequence analysis, and log analysis – can be explored to effectively detect, localize, and remediate failures in large-scale cloud infrastructures (>50 regions and AZs). In particular, this lecture will describe how a particular monitoring data structure, called distributed trace, can be analyzed using deep learning to identify anomalies in its spans. This capability empowers operators to quickly identify which components of a distributed system are faulty.
On the Application of AI for Failure Management: Problems, Solutions and Algo...Jorge Cardoso
Artificial Intelligence for IT Operations (AIOps) is a class of software which targets the automation of operational tasks through machine learning technologies. ML algorithms are typically used to support tasks such as anomaly detection, root-causes analysis, failure prevention, failure prediction, and system remediation. AIOps is gaining an increasing interest from the industry due to the exponential growth of IT operations and the complexity of new technology. Modern applications are assembled from hundreds of dependent microservices distributed across many cloud platforms, leading to extremely complex software systems. Studies show that cloud environments are now too complex to be managed solely by humans. This talk discusses various AIOps problems we have addressed over the years and gives a sketch of the solutions and algorithms we have implemented. Interesting problems include hypervisor anomaly detection, root-cause analysis of software service failures using application logs, multi-modal anomaly detection, root-cause analysis using distributed traces, and verification of virtual private cloud networks.
Effective AIOps with Open Source Software in a WeekDatabricks
Classic event, incident, problem and change management are ITSM practices that are getting integrated with DevOps/SRE and ML through a competency known as AIOps. Large streams of data generated through logs, metrics and traces are organized and computed using machine learning algorithms to extract insights on the anomalies of system behavior that could be impacting end-users and business transactions. Businesses cannot afford to see their end-users impacted by those anomalies and therefore would want to proactively predict the likelihood of systems regressing and take corrective action long before any material impact.
In this talk, we show the use of simple linear regression and multivariate linear regression techniques to predict the likelihood of system behavior resulting in one or two sigma of standard deviation. We show how to use FOSS tools to predict them using various decision trees that are integrated to high performing streaming platforms like Apache Flink, Apache Beam, Prometheus and Grafana which makes it a lot easier to visualize the various alerts and triage their way back to performing root cause analysis. These high performing systems are also backed by KAFKA for its streaming and distributed computing capabilities by partitioning the data for various staged analysis some of which can be done in parallel and concurrently based on the use cases. We present a fully integrated architecture that helps you realize a commercial AIOps capability without having to license expensive software products. The above open architecture allows you to implement various ML algorithms as needed and its agnostic to programming languages and tools.
The talk will combine various techniques with demos and is focused to practicing engineers and developers who are familiar with ML.
Customers migrating workloads to AWS have a variety of tools to monitor their infrastructure, generating large volumes of alarms from services such as Amazon CloudWatch, AWS Config, and other third party tools. Without careful curation, events and tickets can exponentially multiply and overwhelm ITSM systems and the teams operating them, obscuring real problems and wasting time. Using advanced Machine Learning techniques, customers can reduce noise from these events and tickets and increase their service quality. In this presentation, we explore challengs of adopting AIOps, and provide examples of how AIOPs can be used to reduce Mean Time To Restore and improve customer outcomes
Survive the fog of system development! Developers' lives have gotten more complex in the last decade. There is too much to learn and understand now, and you need a co-pilot. Let AIOps be that co-pilot.
In this webinar, we'll share use cases and discuss:
What is AIOps?
Why AI and ML are well-suited for Ops and DevOps
A guide for assessing where to automate
Customer case - Dynatrace Monitoring RedefinedMichel Duruel
One of the largest Airline in the world chose Dynatrace, here is the customer case.
Including:
Vision and Goal / Challenges / Requirements / Why Dynatrace is Unique / ROI and TCO / Rollout Status / Solution Screenshots
Dynatrace redefined monitoring with AI powered 3rd Generation APM, User Experience Monitoring & Continuous Improvement, Cloud-native, Full Stack, Auto Everything, End-to-End, Easiest to Implement, Use and Maintain
Since 2012, leading IT research firm EMA has conducted more than five separate AIOps research projects, including reviews of more than 70 AIOps-related customer deployments. Deep insights into this topic continue with these slides—based on the research webinar--that provide the latest insights into how to best succeed in AIOps deployments and unify IT in the process.
On the Application of AI for Failure Management: Problems, Solutions and Algo...Jorge Cardoso
Artificial Intelligence for IT Operations (AIOps) is a class of software which targets the automation of operational tasks through machine learning technologies. ML algorithms are typically used to support tasks such as anomaly detection, root-causes analysis, failure prevention, failure prediction, and system remediation. AIOps is gaining an increasing interest from the industry due to the exponential growth of IT operations and the complexity of new technology. Modern applications are assembled from hundreds of dependent microservices distributed across many cloud platforms, leading to extremely complex software systems. Studies show that cloud environments are now too complex to be managed solely by humans. This talk discusses various AIOps problems we have addressed over the years and gives a sketch of the solutions and algorithms we have implemented. Interesting problems include hypervisor anomaly detection, root-cause analysis of software service failures using application logs, multi-modal anomaly detection, root-cause analysis using distributed traces, and verification of virtual private cloud networks.
Effective AIOps with Open Source Software in a WeekDatabricks
Classic event, incident, problem and change management are ITSM practices that are getting integrated with DevOps/SRE and ML through a competency known as AIOps. Large streams of data generated through logs, metrics and traces are organized and computed using machine learning algorithms to extract insights on the anomalies of system behavior that could be impacting end-users and business transactions. Businesses cannot afford to see their end-users impacted by those anomalies and therefore would want to proactively predict the likelihood of systems regressing and take corrective action long before any material impact.
In this talk, we show the use of simple linear regression and multivariate linear regression techniques to predict the likelihood of system behavior resulting in one or two sigma of standard deviation. We show how to use FOSS tools to predict them using various decision trees that are integrated to high performing streaming platforms like Apache Flink, Apache Beam, Prometheus and Grafana which makes it a lot easier to visualize the various alerts and triage their way back to performing root cause analysis. These high performing systems are also backed by KAFKA for its streaming and distributed computing capabilities by partitioning the data for various staged analysis some of which can be done in parallel and concurrently based on the use cases. We present a fully integrated architecture that helps you realize a commercial AIOps capability without having to license expensive software products. The above open architecture allows you to implement various ML algorithms as needed and its agnostic to programming languages and tools.
The talk will combine various techniques with demos and is focused to practicing engineers and developers who are familiar with ML.
Customers migrating workloads to AWS have a variety of tools to monitor their infrastructure, generating large volumes of alarms from services such as Amazon CloudWatch, AWS Config, and other third party tools. Without careful curation, events and tickets can exponentially multiply and overwhelm ITSM systems and the teams operating them, obscuring real problems and wasting time. Using advanced Machine Learning techniques, customers can reduce noise from these events and tickets and increase their service quality. In this presentation, we explore challengs of adopting AIOps, and provide examples of how AIOPs can be used to reduce Mean Time To Restore and improve customer outcomes
Survive the fog of system development! Developers' lives have gotten more complex in the last decade. There is too much to learn and understand now, and you need a co-pilot. Let AIOps be that co-pilot.
In this webinar, we'll share use cases and discuss:
What is AIOps?
Why AI and ML are well-suited for Ops and DevOps
A guide for assessing where to automate
Customer case - Dynatrace Monitoring RedefinedMichel Duruel
One of the largest Airline in the world chose Dynatrace, here is the customer case.
Including:
Vision and Goal / Challenges / Requirements / Why Dynatrace is Unique / ROI and TCO / Rollout Status / Solution Screenshots
Dynatrace redefined monitoring with AI powered 3rd Generation APM, User Experience Monitoring & Continuous Improvement, Cloud-native, Full Stack, Auto Everything, End-to-End, Easiest to Implement, Use and Maintain
Since 2012, leading IT research firm EMA has conducted more than five separate AIOps research projects, including reviews of more than 70 AIOps-related customer deployments. Deep insights into this topic continue with these slides—based on the research webinar--that provide the latest insights into how to best succeed in AIOps deployments and unify IT in the process.
Modernizing Infrastructure Monitoring and Management with AIOpsOpsRamp
Artificial intelligence for IT operations (AIOps), with its promises of smarter automation, data ingestion, and actionable insights, is all the rage in the world of IT infrastructure monitoring and management. But how do you fundamentally implement it in an organization that is simultaneously balancing the demands of legacy, cloud, and hyperconverged digital infrastructure?
Join the OpsRamp team to see a simplified roadmap to bring AIOps to hybrid infrastructure monitoring and management, and watch a demo of the OpsRamp platform in action.
You will learn:
How AIOps can drive faster alert correlation, deduplication, and suppression
How you can observe AIOps in action before you actually push a solution to production
How you can bring AIOps to both your IT operations and IT service management practices simultaneously
Learn more at https://www.opsramp.com
Also, follow us on social media channels to learn about product highlights, news, announcements, events, conferences and more:
Twitter - https://www.twitter.com/OpsRamp
LinkedIn - https://www.linkedin.com/company/opsramp
Facebook - https://www.facebook.com/OpsRampHQ/
Distributed Trace & Log Analysis using MLJorge Cardoso
The field of AIOps, also known as Artificial Intelligence for IT Operations, uses advanced technologies to dramatically improve the monitoring, operation, and troubleshooting of distributed systems. Its main premise is that operations can be automated using monitoring data to reduce the workload of operators (e.g., SREs or production engineers). Our current research explores how AIOps – and many related fields such as deep learning, machine learning, distributed traces, graph analysis, time-series analysis, sequence analysis, advanced statistics, NLP and log analysis – can be explored to effectively detect, localize, predict, and remediate failures in large-scale cloud infrastructures (>50 regions and AZs) by analyzing service management data (e.g., distributed traces, logs, events, alerts, metrics). In particular, this talk will describe how a particular monitoring data structure, called distributed traces, can be analyzed using deep learning to identify anomalies in its spans. This capability empowers operators to quickly identify which components of a distributed system are faulty.
The vital role of AIOps in overcoming IT operational challenges - DEM07-SR - ...Amazon Web Services
Digital transformation has sparked an IT service availability crisis in the enterprise, driven by unprecedented levels of infrastructure scale and complexity, rendering the traditional, rules-driven approach to incident management obsolete. Artificial intelligence for IT Operations, or AIOps, is a new direction for assuring service availability via operational data. In this talk, we show how AIOps gives enterprises a critical advantage and paves the way for the future of service assurance. We examine how adopting AIOps yields greater business agility while improving customer service, lowering operation costs, and boosting IT Ops teams’ productivity. This presentation is brought to you by AWS partner, Moogsoft Inc.
SRE (service reliability engineer) on big DevOps platform running on the clou...DevClub_lv
SRE (service reliability engineer). The talk is to explain the SRE philosophy and the principles of production engineering and operations in clouds.
(Language – English)
Pavlo is ADOP (Accenture DevOps Platform) Service Reliability Team Lead, SRE practitioner. Has more then 18 years of IT experience in Ops and Dev.
Application Management Service OfferingsGss America
GSS America extends support for Application Development and Maintenance initiatives and helps customers focus on their strategic priorities. The cost effective AMS solutions from GSS ensures that customers achieve significant savings on running their business operations, so that they can channel these savings towards new product development initiatives that contribute to additional revenue generation.
Amidst an industry cloud of confusion about what “AIOps” is and what it can do, these slides--based on the webinar from EMA research--delineates a clear path to victory for business and IT stakeholders seeking to use machine learning to optimize the performance of critical business services.
MLOps Bridging the gap between Data Scientists and Ops.Knoldus Inc.
Through this session we're going to introduce the MLOps lifecycle and discuss the hidden loopholes that can affect the MLProject. Then we are going to discuss the ML Model lifecycle and discuss the problem with training. We're going to introduce the MLFlow Tracking module in order to track the experiments.
Driving AI Innovation with Machine Learning powered by AWS. AI is opening up new insights and efficiencies in enterprises of every industry. Learn how enterprises are using AWS’ machine learning capabilities combined with its deep storage, compute, analytics, and security services to deliver intelligent applications today. Strategies to develop ML expertise within your org will also be discussed.
Machine Learning operations brings data science to the world of devops. Data scientists create models on their workstations. MLOps adds automation, validation and monitoring to any environment including machine learning on kubernetes. In this session you hear about latest developments and see it in action.
Any modern business with digital assets requires a robust site reliability framework to secure it’s digital domains and ensure uninterrupted service delivery.
This is essential to protect revenue streams and brand value as well as to shield your website against cyber threats.
More Than Monitoring: How Observability Takes You From Firefighting to Fire P...DevOps.com
For some, observability is just a hollow rebranding of monitoring, for others it’s monitoring on steroids. But what if we told you observability is the new way to find out why—not just if—your distributed system or application isn’t working as expected? Today, we see that traditional monitoring approaches can fall short if a system or application doesn’t adequately externalize its state.
This is truer as workloads move into the cloud and leverage ephemeral technologies, such as microservices and containers. To reach observability, IT and DevOps teams need to correlate different sources from logs, metrics, traces, events and more. This becomes even more challenging when defining the online revenue impact of a failed container—after all, this is what really matters to the business.
This webinar will cover:
The differences between observability and monitoring
Why it is a bigger challenge in a multicloud and containerized world
How observability results in less firefighting and more fire prevention
How new platforms can help gain observability (on premises and in the cloud) for containers, microservices and even SAP or mainframes
Learn how Site24x7 gives you end-to-end application performance visibility for your Java, .NET and Ruby web transactions with metrics of all components starting from URLs to SQL queries.
AIOps is becoming imperative to the management of today’s complex IT systems and their ability to support changing business conditions. This slide explains the role that AIOps can and will play in the enterprise of the future, how the scope of AIOps platforms will expand, and what new functionality may be deployed.
Watch the webinar here. https://www.moogsoft.com/resources/aiops/webinar/aiops-the-next-five-years
The catalyst for the success of automobiles came not through the invention of the car but rather through the establishment of an innovative assembly line. History shows us that the ability to mass produce and distribute a product is the key to driving adoption of any innovation, and machine learning is no different. MLOps is the assembly line of Machine Learning and in this presentation we will discuss the core capabilities your organization should be focused on to implement a successful MLOps system.
AIOps: Anomalies Detection of Distributed TracesJorge Cardoso
Introduction to the field of AIOps. large-scale monitoring, and observability. Provides an example illustrating how Deep Learning can be used to analyze distributed traces to reveal exactly which component is causing a problem in microservice applications.
Presentation given at the National University of Ireland, Galway (NUI Galway)
on 2019.08.20.
Thanks to Prof. John Breslin
Modernizing Infrastructure Monitoring and Management with AIOpsOpsRamp
Artificial intelligence for IT operations (AIOps), with its promises of smarter automation, data ingestion, and actionable insights, is all the rage in the world of IT infrastructure monitoring and management. But how do you fundamentally implement it in an organization that is simultaneously balancing the demands of legacy, cloud, and hyperconverged digital infrastructure?
Join the OpsRamp team to see a simplified roadmap to bring AIOps to hybrid infrastructure monitoring and management, and watch a demo of the OpsRamp platform in action.
You will learn:
How AIOps can drive faster alert correlation, deduplication, and suppression
How you can observe AIOps in action before you actually push a solution to production
How you can bring AIOps to both your IT operations and IT service management practices simultaneously
Learn more at https://www.opsramp.com
Also, follow us on social media channels to learn about product highlights, news, announcements, events, conferences and more:
Twitter - https://www.twitter.com/OpsRamp
LinkedIn - https://www.linkedin.com/company/opsramp
Facebook - https://www.facebook.com/OpsRampHQ/
Distributed Trace & Log Analysis using MLJorge Cardoso
The field of AIOps, also known as Artificial Intelligence for IT Operations, uses advanced technologies to dramatically improve the monitoring, operation, and troubleshooting of distributed systems. Its main premise is that operations can be automated using monitoring data to reduce the workload of operators (e.g., SREs or production engineers). Our current research explores how AIOps – and many related fields such as deep learning, machine learning, distributed traces, graph analysis, time-series analysis, sequence analysis, advanced statistics, NLP and log analysis – can be explored to effectively detect, localize, predict, and remediate failures in large-scale cloud infrastructures (>50 regions and AZs) by analyzing service management data (e.g., distributed traces, logs, events, alerts, metrics). In particular, this talk will describe how a particular monitoring data structure, called distributed traces, can be analyzed using deep learning to identify anomalies in its spans. This capability empowers operators to quickly identify which components of a distributed system are faulty.
The vital role of AIOps in overcoming IT operational challenges - DEM07-SR - ...Amazon Web Services
Digital transformation has sparked an IT service availability crisis in the enterprise, driven by unprecedented levels of infrastructure scale and complexity, rendering the traditional, rules-driven approach to incident management obsolete. Artificial intelligence for IT Operations, or AIOps, is a new direction for assuring service availability via operational data. In this talk, we show how AIOps gives enterprises a critical advantage and paves the way for the future of service assurance. We examine how adopting AIOps yields greater business agility while improving customer service, lowering operation costs, and boosting IT Ops teams’ productivity. This presentation is brought to you by AWS partner, Moogsoft Inc.
SRE (service reliability engineer) on big DevOps platform running on the clou...DevClub_lv
SRE (service reliability engineer). The talk is to explain the SRE philosophy and the principles of production engineering and operations in clouds.
(Language – English)
Pavlo is ADOP (Accenture DevOps Platform) Service Reliability Team Lead, SRE practitioner. Has more then 18 years of IT experience in Ops and Dev.
Application Management Service OfferingsGss America
GSS America extends support for Application Development and Maintenance initiatives and helps customers focus on their strategic priorities. The cost effective AMS solutions from GSS ensures that customers achieve significant savings on running their business operations, so that they can channel these savings towards new product development initiatives that contribute to additional revenue generation.
Amidst an industry cloud of confusion about what “AIOps” is and what it can do, these slides--based on the webinar from EMA research--delineates a clear path to victory for business and IT stakeholders seeking to use machine learning to optimize the performance of critical business services.
MLOps Bridging the gap between Data Scientists and Ops.Knoldus Inc.
Through this session we're going to introduce the MLOps lifecycle and discuss the hidden loopholes that can affect the MLProject. Then we are going to discuss the ML Model lifecycle and discuss the problem with training. We're going to introduce the MLFlow Tracking module in order to track the experiments.
Driving AI Innovation with Machine Learning powered by AWS. AI is opening up new insights and efficiencies in enterprises of every industry. Learn how enterprises are using AWS’ machine learning capabilities combined with its deep storage, compute, analytics, and security services to deliver intelligent applications today. Strategies to develop ML expertise within your org will also be discussed.
Machine Learning operations brings data science to the world of devops. Data scientists create models on their workstations. MLOps adds automation, validation and monitoring to any environment including machine learning on kubernetes. In this session you hear about latest developments and see it in action.
Any modern business with digital assets requires a robust site reliability framework to secure it’s digital domains and ensure uninterrupted service delivery.
This is essential to protect revenue streams and brand value as well as to shield your website against cyber threats.
More Than Monitoring: How Observability Takes You From Firefighting to Fire P...DevOps.com
For some, observability is just a hollow rebranding of monitoring, for others it’s monitoring on steroids. But what if we told you observability is the new way to find out why—not just if—your distributed system or application isn’t working as expected? Today, we see that traditional monitoring approaches can fall short if a system or application doesn’t adequately externalize its state.
This is truer as workloads move into the cloud and leverage ephemeral technologies, such as microservices and containers. To reach observability, IT and DevOps teams need to correlate different sources from logs, metrics, traces, events and more. This becomes even more challenging when defining the online revenue impact of a failed container—after all, this is what really matters to the business.
This webinar will cover:
The differences between observability and monitoring
Why it is a bigger challenge in a multicloud and containerized world
How observability results in less firefighting and more fire prevention
How new platforms can help gain observability (on premises and in the cloud) for containers, microservices and even SAP or mainframes
Learn how Site24x7 gives you end-to-end application performance visibility for your Java, .NET and Ruby web transactions with metrics of all components starting from URLs to SQL queries.
AIOps is becoming imperative to the management of today’s complex IT systems and their ability to support changing business conditions. This slide explains the role that AIOps can and will play in the enterprise of the future, how the scope of AIOps platforms will expand, and what new functionality may be deployed.
Watch the webinar here. https://www.moogsoft.com/resources/aiops/webinar/aiops-the-next-five-years
The catalyst for the success of automobiles came not through the invention of the car but rather through the establishment of an innovative assembly line. History shows us that the ability to mass produce and distribute a product is the key to driving adoption of any innovation, and machine learning is no different. MLOps is the assembly line of Machine Learning and in this presentation we will discuss the core capabilities your organization should be focused on to implement a successful MLOps system.
AIOps: Anomalies Detection of Distributed TracesJorge Cardoso
Introduction to the field of AIOps. large-scale monitoring, and observability. Provides an example illustrating how Deep Learning can be used to analyze distributed traces to reveal exactly which component is causing a problem in microservice applications.
Presentation given at the National University of Ireland, Galway (NUI Galway)
on 2019.08.20.
Thanks to Prof. John Breslin
Privacy preserving public auditing for secured cloud storagedbpublications
As the cloud computing technology develops during the last decade, outsourcing data to cloud service for storage becomes an attractive trend, which benefits in sparing efforts on heavy data maintenance and management. Nevertheless, since the outsourced cloud storage is not fully trustworthy, it raises security concerns on how to realize data deduplication in cloud while achieving integrity auditing. In this work, we study the problem of integrity auditing and secure deduplication on cloud data. Specifically, aiming at achieving both data integrity and deduplication in cloud, we propose two secure systems, namely SecCloud and SecCloud+. SecCloud introduces an auditing entity with a maintenance of a MapReduce cloud, which helps clients generate data tags before uploading as well as audit the integrity of data having been stored in cloud. Compared with previous work, the computation by user in SecCloud is greatly reduced during the file uploading and auditing phases. SecCloud+ is designed motivated by the fact that customers always want to encrypt their data before uploading, and enables integrity auditing and secure deduplication on encrypted data.
A confluence of events is accelerating the growth of AI in the Enterprise - (i) The COVID pandemic is accelerating the digital transformation of enterprises, (ii) increased digital sales & digital interaction is fueling interest in operationalizing AI to drive revenue and cost efficiencies and (iii) Enterprise databases and enterprise apps are infusing AI to transparently augment predictive capabilities for clients. Enterprise Power Systems are pillars of the global economy hosting our trinity of operating systems
A Comprehensive Guide to AIOps Integration in OrganizationsCloudZenix LLC
AIOps, aka Artificial Intelligence for IT Operations, is an AI application & relevant technology platform with multiple layers. AIOps automates and enhances IT operations with the help of analytics and machine learning. The breakthrough that came with AIOps is to make IT operations relatively seamless. According to Gartner, AIOps combines machine learning and big data to automate processes, event correlation, causality determination, and anomaly detection. Read more: https://cloudzenix.com/a-comprehensive-guide-to-aiops-integration-in-organizations/
Is your organization drowning in siloed cybersecurity data? Are you eager to put Big Data to work on your cybersecurity haystack? Are you planning an Apache Metron deployment? Early in 2018, T-Mobile began their journey to cybersecurity at scale. Come learn how one of the largest wireless carriers in the US successfully operationalized Apache Metron, a horizontally scalable cybersecurity analytics platform that ingests, enriches and triages events in real time. Hear why T-Mobile chose Metron and how they planned and executed their deployment. Learn how the team leveraged built-in Metron components and tapped into existing event pipelines to get ingestion up and running quickly. Dive into the details on tuning ingest on a real event feed. Finally get tips and best practices for staying on top of security event monitoring in today’s challenging threat landscape. We discuss migrating log sources to Metron, monitoring and troubleshooting ingest, adapting security configurations to find new attacks, as well as capacity planning.
Industrial production is becoming increasingly interlinked with modern information and communication technology. From the foundation of intelligent digitally-networked systems, a largely self-organized production will be possible. In Industrie4.0, people, machinery, plants, logistics and products will communicate and cooperate directly. To connect these different strands, a unified, flexible, high-performance system is needed to provide company-wide, real-time, information flow.
To target these issues, we developed enterprise:inmation.
It securely and efficiently gathers data from manufacturing, process control and IT systems all around the globe, contextualizes it and transforms it into actionable information, which is presented to every decision-maker on any device, anytime, at any location.
Software made by industrial system integration pros, in close cooperation with industry leaders. Business performance in real-time, anytime, anywhere, for all decision- makers -that is enterprise:inmation.
The evolution of cloud technology has helped many laboratories to increase their productivity and reliability by digitalising and automating their operations. Digitalising here means easily migrating their data saved in the form of old paper notebooks and spreadsheets to computerized storage and management systems.
Similar to AIOps: Anomalous Span Detection in Distributed Traces Using Deep Learning (20)
In planet-scale deployments, the Operation and Maintenance (O&M) of cloud platforms cannot be done any longer manually or simply with off-the-shelf solutions. It requires self-developed automated systems, ideally exploiting the use of AI to provide tools for autonomous cloud operations. This talk will explain how deep learning, distributed traces, and time-series analysis (sequence analysis) can be used to effectively detect anomalous cloud infrastructure behaviors during operations to reduce the workload of human operators. The iForesight system is being used to evaluate this new O&M approach. iForesight 2.0 is the result of 2 years of research with the goal to provide an intelligent new tool aimed at SRE cloud maintenance teams. It enables them to quickly detect and predict anomalies thanks to the use of artificial intelligence when cloud services are slow or unresponsive.
Cloud Reliability: Decreasing outage frequency using fault injectionJorge Cardoso
Invited Keynote at the 9th International Workshop on Software Engineering for Resilient Systems, September 4-5, 2017, Geneva, Switzerland
Title: Cloud Reliability: Decreasing outage frequency using fault injection
Abstract: In 2016, Google Cloud had 74 minutes of total downtime, Microsoft Azure had 270 minutes, and 108 minutes of downtime for Amazon Web Services (see cloudharmony.com). Reliability is one of the most important properties of a successful cloud platform. Several approaches can be explored to increase reliability ranging from automated replication, to live migration, and to formal system analysis. Another interesting approach is to use software fault injection to test a platform during prototyping, implementation and operation. Fault injection was popularized by Netflix and their Chaos Monkey fault-injection tool to test cloud applications. The main idea behind this technique is to inject failures in a controlled manner to guarantee the ability of a system to survive failures during operations. This talk will explain how fault injection can also be applied to detect vulnerabilities of OpenStack cloud platform and how to effectively and efficiently detect the damages caused by the faults injected.
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...Jorge Cardoso
Lecture given at the Technical University of Munich, 12 December 2016, on Cloud Operations and Analytics: Improving Distributed Systems Reliability using Fault Injection.
Presentation at the International Industry-Academia Workshop on Cloud Reliability and Resilience. 7-8 November 2016, Berlin, Germany.
Organized by EIT Digital and Huawei GRC, Germany.
Twitter: @CloudRR2016
Presentation at the International Industry-Academia Workshop on Cloud Reliability and Resilience. 7-8 November 2016, Berlin, Germany.
Organized by EIT Digital and Huawei GRC, Germany.
Twitter: @CloudRR2016
Gartner analyzed data centers for a period of 10 years and found that 47% of all problems were caused by cloud services outages. The duration of outages ranged between 40 minutes and five days. Ponemon Institute studied the financial impact and found that on average outages cost US$ 690.204, with an average downtime cost of US$ 6.828 per minute. These results are important due to the economic impact of unplanned outages on cloud operations which calls for higher platform reliability.
The first part of this talk will present the mechanisms that pioneers, such as Amazon, Google, and Netflix, have already developed to increase the reliability of their cloud platforms. The second part of the talk will describe how Huawei Research is exploring the use of fault-injection mechanisms to effectively increase the reliability of the Open Telekom Cloud platform from Deutsche Telekom.
For more than 10 years, research on service descriptions has mainly studied software-based services and provided languages such as WSDL, OWL-S, WSMO for SOAP, and hREST for REST. Nonetheless, recent developments from service management (e.g., ITIL and COBIT) and cloud computing (e.g. Software-as-a-Service) have brought new requirements to service descriptions languages: the need to also model business services and account for the multi-faceted nature of services. Business-orientation, co- creation, pricing, legal aspects, and security issues are all elements which must also be part of service descriptions. While ontologies such as e  service and e  value provided a first modeling attempt to capture a business perspective, concerns on how to contract services and the agreements entailed by a contract also need to be taken into account. This has for the most part been disregarded by the e-family of ontologies. In this paper, we review the evolution and provide an overview of Linked USDL, a comprehensive language which provides a (multi-faceted) description to enable the commercialization of (business and technical) services over the web.
Ten years of service research from a computer science perspectiveJorge Cardoso
…It has been more than 10 years since a strong research stream on services started from the field of computer science. The main trigger was without a doubt the introduction of the Web Service Description Language (WSDL), a specification to represent a piece of software functionally which could be remotely invoked. Nonetheless, this was only the “tipping point”. The generalized interest on this new development was followed by interesting topics of research on the application of semantics to enhance the description of services, the composition of services into processes, the analysis of the quality of services, the complexity of processes supporting services, and the development of comprehensive service description languages. This seminar will provide an overview of the main research topics around services and will glimpse at a new research field on the analysis of service networks...
Cloud Computing Automation: Integrating USDL and TOSCAJorge Cardoso
-- Presented at CAiSE 2013, Valencia, Spain --
Standardization efforts to simplify the management of cloud applications are being conducted in isolation. The objective of this paper is to investigate to which extend two promising specifications, USDL and TOSCA, can be integrated to automate the lifecycle of cloud applications. In our approach, we selected a commercial SaaS CRM platform, modeled it using the service description language USDL, modeled its cloud deployment using TOSCA, and constructed a prototypical platform to integrate service selection with deployment. Our evaluation indicates that a high level of integration is possible. We were able to fully automatize the remote deployment of a cloud service after it was selected by a customer in a marketplace. Architectural decisions emerged during the construction of the platform and were related to global service identification and access, multi-layer routing, and dynamic binding.
Understanding how services operate as part of large scale global networks, the related risks and gains of different network structures and their dynamics is becoming increasingly critical for society. Our vision and research agenda focuses on the particularly challenging task of building, analyzing, and reasoning about global service networks. This paper explains how Service Network Analysis (SNA) can be used to study and optimize the provisioning of complex services modeled as Open Semantic Service Networks (OSSN), a computer-understandable digital structure which represents connected and dependent services.
Open Semantic Service Networks: Modeling and AnalysisJorge Cardoso
A new interesting research area is the representation and analysis of the networked economy using Open Semantic Service Networks (OSSN). OSSN are represented using the service description language
USDL to model nodes and using the service relationship model OSSR to model edges. Nonetheless, in their current form USDL and OSSR do not provide constructs to capture the dynamic behavior of service networks. To bridge this gap, we used the General System Theory (GST) as a framework guiding the extension of USDL and OSSR to model dynamic OSSN. We evaluated the extensions made by applying USDL and OSSR to two distinct types of dynamic OSSN analysis: 1) evolutionary by using a Preferential Attachment (PA) and 2) analytical by using concepts from System Dynamics (SD). Results indicate that OSSN can constitute the rst stepping stones toward the analysis of global service-based economies.
Modeling Service Relationships for Service NetworksJorge Cardoso
The last decade has seen an increased interest in the study of networks in many fields of science. Examples are numerous, from sociology to biology, and to physical systems such as power grids. Nonetheless, the field of service networks has received less attention. Previous research has mainly tackled the modeling of single service systems and service compositions, often focusing only on studying temporal relationships between services. The objective of this paper is to propose a computational model to represent the various types of relationships which can be established between services systems to model service networks. This work acquires a particular importance since the study of service networks can bring new scientific discoveries on how service-based economies operate at a global scale.
Description and portability of cloud services with USDL and TOSCAJorge Cardoso
The provisioning and management of cloud services are major concerns since they bring clear benefits such as elasticity, flexibility, scalability, and high availability of applications for enterprises. Two emerging contributions set semantics and machine-understandable specifications for the description and portability of cloud-based services: USDL and TOSCA. In this talk we will explain how both can be articulated to work in conjunction. The Unified Service Description Language (USDL) was created for describing business or real world services to allow services to become tradable and consumable on marketplaces. On the other hand, the Topology and Orchestration Specification for Cloud Applications (TOSCA) was standardized to enable the portability of complex cloud applications and their management across different cloud providers.
To address the emerging importance of services and the relevance of relationships, we have developed and introduced the concept of Open Semantic Service Network (OSSN). OSSN are networks which relate services with the assumption that firms make the information of their services openly available using suitable models. Services, relationships and networks are said to be open (similar to LOD), when their models are transparently available and accessible by external entities and follow an open world assumption. Networks are said to be semantic when they explicitly describe their capabilities and usage, typically using a conceptual or domain model, and ideally using Semantic Web standards and techniques. One limitation of OSSNs is that they were conceived without accounting for the dynamic behavior of service networks. In other words, they can only capture static snapshots of service-based economies but do not include any mechanism to model reactions and effects that services have on other services and the notion of time
To address the emerging importance of services and the relevance of relationships, we have developed and introduced the concept of Open Semantic Service Network (OSSN). OSSN are networks which relate services with the assumption that firms make the information of their services openly available using suitable models. Services, relationships and networks are said to be open (similar to LOD), when their models are transparently available and accessible by external entities and follow an open world assumption. Networks are said to be semantic when they explicitly describe their capabilities and usage, typically using a conceptual or domain model, and ideally using Semantic Web standards and techniques. One limitation of OSSNs is that they were conceived without accounting for the dynamic behavior of service networks. In other words, they can only capture static snapshots of service-based economies but do not include any mechanism to model reactions and effects that services have on other services and the notion of time
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
AIOps: Anomalous Span Detection in Distributed Traces Using Deep Learning
1. AIOps: Anomalous Span Detection in
Distributed Traces Using Deep Learning
Prof. Dr. Jorge Cardoso
E-mail: jorge.cardoso@huawei.com
Planet-scale AIOps / Chief Architect
Ireland and Munich Research Centers
Definition (Gartner) [AIOps]
AIOps platforms utilize big data, modern machine learning and other advanced analytics technologies to directly and indirectly
enhance IT operations (monitoring, automation and service desk) functions with proactive, personal and dynamic insight.
Lecture on Intelligent Systems
Trinity College Dublin, Ireland
02.10.2019
2. HUAWEI TECHNOLOGIES CO., LTD. 2
AIOps: Anomalous Span Detection in Distributed Traces Using Deep Learning
Abstract and Bio
The field of AIOps, also known as Artificial Intelligence for IT Operations, uses algorithms and machine learning to
dramatically improve the monitoring, operation, and maintenance of distributed systems. Its main premise is that
operations can be automated using monitoring data to reduce the workload of operators (e.g., SREs or production
engineers). Our current research explores how AIOps – and many related fields such as deep learning, machine learning,
distributed traces, graph analysis, time-series analysis, sequence analysis, and log analysis – can be explored to
effectively detect, localize, and remediate failures in large-scale cloud infrastructures (>50 regions and AZs). In particular,
this lecture will describe how a particular monitoring data structure, called distributed trace, can be analyzed using deep
learning to identify anomalies in its spans. This capability empowers operators to quickly identify which components of a
distributed system are faulty.
Dr. Jorge Cardoso is Chief Architect for Planet-scale AIOps at Huawei’s Ireland and Munich Research Centers. Previously he
worked for several major companies such as SAP Research (Germany) on the Internet of Services and the Boeing Company in
Seattle (USA) on Enterprise Application Integration. He previously gave lectures at the Karlsruhe Institute of Technology
(Germany), University of Georgia (USA), University of Coimbra and University of Madeira (Portugal). His current research
involves the development of the next generation of AIOps platforms, Cloud Operations and Analytics tools driven by AI, Cloud
Reliability and Resilience, and High Performance Business Process Management systems. He has a Ph.D. in Computer
Science from the University of Georgia (USA).
GitHub | Slideshare.net | GoogleScholar
Interests: AIOps, Service Reliability Engineering, Cloud Computing, Distributed Systems, Business Process Management
3. HUAWEI TECHNOLOGIES CO., LTD. 4
Underlying Problem
Increasing Complexity of Distributed Systems
Every year the management of IT O&M is more complex
► Digitalization with cloud, mobile, and edge
► Large-scale microservice and serverless systems
► Increase in IT size, and event/alert volumes
► SLA guarantee for critical applications
► Nano/microservices break existing monitoring tools
Microservices/serverless add complexity to communications between
services making systems far more difficult to maintain and troubleshoot.
With a monolith application, when there is a problem, only one product
owner needs to be contacted.
HC Cloud
Data Center
1
Data Center
2
Server 1 Server 2
Socket 1 Socket 2
Core 1 Core 2
VM 1
VM 2
VM 3
Core .. Core 16
Server ..
Server
20.000
Data Center
..
Data Center
10
Huawei Cloud is an example of a planet-
scale complex system:
40 availability zones, 23 geographical
regions
(Germany, France, South/Central America, Hong
Kong and Russia to Thailand and South Africa)
20M VMs (3 VM/core), billion metrics
(for illustrative purpose only)
However, existing O&M tools still rely on old approaches which
involve manual rule configuration, averages and std, and data
wrangling to manage physical assets
4. HUAWEI TECHNOLOGIES CO., LTD. 6
Industry Trends
Bringing AI to O&M
“Virtyt Koshi, the EMEA general manager for
virtualization vendor Mavenir, reckons Google is able
to run all of its data centers in the Asia-Pacific with
only about 30 people, and that a telco would typically
have about 3,000 employees to manage
infrastructure across the same area."
Reducing O&M Costs
Google: 30 operators
Others: 3000 operators
“We began applying machine learning two years ago (2016) to operate
our data centers more efficiently… over the past few months,
DeepMind researchers began working with Google’s data center team
to significantly improve the system’s utility. Using neural networks
trained on different operating scenarios and parameters, we created a
more efficient and adaptive framework to understand data center
dynamics and optimize efficiency.” Eric Schmidt, Dec. 2018
O&M Activities
System monitoring and 24x7 technical support, Tier 1-3 support
Troubleshooting and resolution of operational issues
Backup, restoration, archival services
Update, distribution and release of software
Change, capacity, and configuration management
…
Harvard Business Review
Early indicators
Moogsoft AIOps,
Amazon EC2
Predictive Scaling,
Azure VM resiliency,
Amazon S3 Intelligent
Tiering
38.4% of organizations take
more than 30 minutes to
resolve IT incidents that
impact consumer-facing
services (PagerDuty)
https://www.lightreading.com/automation/google-has-intent-to-cut-humans-out-of-network/d/d-id/742158
https://www.lightreading.com/automation/automation-is-about-job-cuts---lr-poll/d/d-id/741989
https://www.lightreading.com/automation/the-zero-person-network-operations-center-is-here-(in-finland)/d/d-id/741695
5. HUAWEI TECHNOLOGIES CO., LTD. 7
AIOps @ Huawei Cloud
Troubleshooting Scenarios
Troubleshooting Scenarios
Response Time Analysis
‒ A service started responding to requests more slowly than normal
‒ The change happened suddenly as a consequence of regression in
the latest deployment
System Load
‒ The demand placed on the system, e.g., REST API requests per
second, increase since yesterday
Error Analysis
‒ The rate of requests that fail -- either explicitly (HTTP 5xx) or
implicitly (HTTP 2xx with wrong content) -- is increasing slowly, but
steadily
System Saturation
‒ The resources (e.g., memory, I/O, disk) used by key controller
services is rapidly reaching threshold levels
https://intl.huaweicloud.com/
HC System’s Components
► OBS. Object Storage Service is a stable, secure, efficient, cloud
storage service
► EVS. Elastic Volume Service offers scalable block storage for
servers
► VPC. Virtual Private Cloud enables to create private, isolated
virtual networks
► ECS. Elastic Cloud Server is a cloud server that provides
scalable, on-demand computing resources
6. HUAWEI TECHNOLOGIES CO., LTD. 8
Anomaly
Detection
Fault
Diagnosis
Fault
Prediction
Fault
Recovery
AIOps @ Huawei Cloud
Troubleshooting Tasks
Data Source Connectors
Distribution Layer
Time Series Registry
Stability Analysis: Machine Learning and Analytics
Visualization, Exploration, Reporting
Real-time IngestionHistorical Data Storage
Stream ProcessingBatch Processing
Analytical Data Store
DevOpsBusiness Operators Developers
Anomaly Detection Fault Recovery
Fault Prediction Fault Diagnosis Topology Analysis
Fault Prevention
TracingMetrics Events
Reference architecture for AIOps platforms
App Logs Topology
► Anomaly detection. Determine what constitutes normal system
behavior, and then to discern departures from that normal
system behavior
► Fault diagnosis (root cause analysis). Identify links of
dependency that represent causal relationships to discover the
true source of an anomaly
► Fault Prediction. Use of historical or streaming data to predict
incidents with varying degrees of probability
► Fault recovery (Incident Remediation). Explore how decision
support systems can manage and select recovery processes to
repair failed systems
O&M Troubleshooting Tasks
iForesight is an intelligent new tool aimed at SRE cloud maintenance teams.
It enables them to quickly detect anomalies thanks to the use of artificial
intelligence when cloud services are slow or unresponsive.
iForesight3.0
Operational AI & AIOps
7. HUAWEI TECHNOLOGIES CO., LTD. 10
AIOps @ Huawei Cloud
Monitoring Data Structures
Logs. Service, microservices, and applications
generate logs, composed of timestamped
records with a structure and free-form text,
which are stored in system files.
Metrics. Examples of metrics include CPU
load, memory available, and the response
time of a HTTP request.
Traces. Traces records the workflow and tasks
executed in response to an HTTP request.
Events. Major milestones which occur within a
data center can be exposed as events.
Examples include alarms, service upgrades,
and software releases.
{"traceId": "72c53", "name": "get", "timestamp": 1529029301238, "id": "df332",
"duration": 124957, “annotations": [{"key": "http.status_code", "value": "200"}, {"key":
"http.url", "value": "https://v2/e5/servers/detail?limit=200"}, {"key": "protocol", "value":
"HTTP"}, "endpoint": {"serviceName": "hss", "ipv4": "126.75.191.253"}]
{“tags": [“mem”, “192.196.0.2”, “AZ01”], “data”: [2483, 2669, 2576, 2560, 2549, 2506,
2480, 2565, 3140, …, 2542, 2636, 2638, 2538, 2521, 2614, 2514, 2574, 2519]}
{"id": "dns_address_match“, "timestamp": 1529029301238, …}
{"id": "ping_packet_loss“, "timestamp": 152902933452, …}
{"id": "tcp_connection_time“, "timestamp": 15290294516578, …}
{"id": "cpu_usage_average “, "timestamp": 1529023098976, …}
2017-01-18 15:54:00.467 32552 ERROR oslo_messaging.rpc.server [req-c0b38ace -
default default] Exception during message handling
System’s Components (e.g., OBS, EVS, VPC, ECS) are monitored and generate various types of data:
Logs, Metrics, Traces, Events, Topologies
8. HUAWEI TECHNOLOGIES CO., LTD. 11
AIOps @ Huawei Cloud
Distributed Trace Analysis
1. User commands: image delete,
create server, image list, flavor set…
N1 N2
N
3
N
4
keystone -> nova -> glance
-> network
Servi
ces
3
0
%
7
0
%
Warning
Y
Warning A
Warning B
Warning C
Error Z
a1 62 e2 28Trace t1
Sequence of spans
auth_tokens_v3 1314421019193f1313162f253233191e341a3c3b402f2624192f214521162f163e2746282c192f2e2e24161d2c2c161628241635287a
images_v2 4342421119131616161a311316192216162f23213e1d2127272728241621191d5e5c471d132c4a16241616213c63
_var__detail_flavors_v2.1 5d13421314151316161b193b21242113364024192f162f131d21161621212c27282c2c2d162f5e5c132c16164d
image_schemas_v2 141319133113162f1644161b331e351a213b1613133b13192f5116132113131616252121282c2c162d21582e2e161d13643b132c4a78134d213c21
gather_result 1912141613252128211a1f402f221921162f2713282b2d2416131d13211a5a21
security-groups_v2.0 4c1319131d19131316132f
_var__images_v2 121313251b2f192f35211f241916223b1613231616242713462c53241654192c285a21
_var___var__action_servers_v2.1 61135d106c12131325191b1c193d281a4022136d13162f3b292116231d1627272741283b2416131d5c2f4a5121352f13
neutronclient.v2_0.client.retry_request 162f1319132f165124191316
4. Sequence Classification: identify invalid sequences
Service call encoding
16: {'v2.0', 'ports'}
51 : {'v2.0', '_var_', 'ports'}=200
24 : {'nova.compute.api.API.get'}=0
…
Service call encoding
13: {'v3', 'tokens', 'auth‘}
Encoding matching
53.47%
2. A distributed trace is generated 3. The distributed trace is encoded
into a sequence
Tracing has applicability in many fields besides microservices, e.g., end-to-end enterprise applications orchestrating mobile apps, edges,
websites, storage, and centralized systems.
Function call sequences
[{'v3'}=200, {'v3', 'tokens', 'auth'}=200, {'v...
Function call sequences
Similar call sequence
Distributed Tracing
Distributed context propagation
is still new to many people
Applications
Distributed transaction
monitoring
Anomaly detection
Performance analysis
‒ Latency optimization
Dependency analysis
‒ Who are my upstream and
downstream dependencies?
‒ How many different workflows
depend on my service?
‒ Is my service a critical (tier 1)
service for core business flows?
‒ How do my SLIs affect other
services?
‒ Will my service survive
Halloween?
Root cause analysis
{"traceId": "72c53", "name": "get", "timestamp": 1529029301238, "id": "df332", "duration": 124957, “annotations": [{"key": "http.status_code", "value": "200"},
{"key": "http.url", "value": "https://v2/e5/servers/detail?limit=200"}, {"key": "protocol", "value": "HTTP"}, "endpoint": {"serviceName": "hss", "ipv4":
"126.75.191.253"}]
9. HUAWEI TECHNOLOGIES CO., LTD. 12
Distributed Trace Analysis
Introduction to Sequence Analysis
Sequence Prediction
‒ Predict elements of a sequence on the basis of the preceding elements
‒ Given si, si+1,…, sj, predict sj+1, si,si+1, i.e., sj →sj+1
‒ Input: 1, 2, 3, 4, 5; output: 6
‒ Applications: weather forecast
Sequence Classification
‒ Predict a class label for a given input sequence.
‒ Given si,si+1,…,sj, determine if it is legitimate, i.e., si,si+1,…,sj →yes or no
‒ Input: 1, 2, 3, 4, 5; output: “normal” or “anomalous”
‒ Applications: trace anomaly detection, DNA sequence classification
Sequence Generation
‒ Generate new output sequence with same general characteristics as the input
‒ Input: [1, 3, 5], [7, 9, 11]; output: [3, 5 ,7]
‒ Applications: text generation
Sequence to Sequence Prediction
‒ Predict an output sequence given an input sequence.
‒ Input: 1, 2, 3, 4, 5; output: 6, 7, 8, 9, 10
‒ Applications: text summarization
1, 2, 3, 4, 5 ► 6
1, 2, 3, 4, 5 ► NORMAL
1, 3, 5 and 7, 9, 11 ► 3, 5 ,7
1, 2, 3, 4, 5 ► 6, 7, 8, 9, 10
Sequence learning: from recognition and prediction to sequential decision making, IEEE Intelligent Systems, 2001,
http://www.sts.rpi.edu/~rsun/sun.expert01.pdf
10. HUAWEI TECHNOLOGIES CO., LTD. 13
Related Work
Facebook Mystery Machine
https://blog.acolyer.org/2015/10/07/the-mystery-machine-end-to-end-performance-analysis-of-large-scale-internet-services/
https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-chow.pdf
Measure end-to-end performance of requests
Infer causal relationships from logs
‒ Adding instrumentation retroactively is an
expensive task
‒ Hypothesize and confirm relationships in
messages
Log Data
‒ Request & host id, timestamp, unique event label
Finding Relationships
‒ Samples requests, store logs in Hive and run
Hadoop jobs to infer causal relationships
‒ 2h Hadoop to analyze 1.3M requests sampled 30
days
‒ Assumes a hypothesized relationship between
two segments until finding a counterexample
‒ 3 types of relationship inferred: happens-before,
mutual exclusion, pipeline
Applications
‒ Find critical path and slack segments for
performance optimization
‒ Anomaly detection: 1) select top 5% of end-to-end
latency, 2) identify segments with proportionally
greater representation in the outlier set of
requests than in the non-outlier set.
As the number of traces analyzed increases, the observation of new counter
examples diminishes, leaving behind only true relationships
11. HUAWEI TECHNOLOGIES CO., LTD. 14
Related Work
Sequence Analysis of Log Records
Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning. In
Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS '17). ACM, New York, NY, USA, 1285-1298.
Sequence Prediction using Logs
Parsing
‒ Spell tool (ICDM’16) parses logs into patterns that
represent the fixed part of printf-like statements
‒ Log messages ► Log key
‒ https://www.cs.utah.edu/~lifeifei/papers/spell.pdf
Processing
‒ Workflow models are built to help anomaly
diagnosis
‒ Log Key ► Workflow
‒ Log Key + Parameters ► Behavior Model
‒ LSTM is used to model system execution paths
and log parameter values
Precision
‒ F1-score: 96%
How many scenarios need to be labeled? 10 or
10,000?
Dataset distribution
‒ HDFS: 2.9% labeled anomalies
‒ Openstack: 7% anomalies
Do precision numbers hold in more realistically
distributed logs?
12. HUAWEI TECHNOLOGIES CO., LTD. 15
Distributed Trace Analysis
Trace and Span Representation
► Training
1. Collect traces
2. Parse and abstract spans
3. Feed span lists to deep neural network
4. Train the network with traces
► Testing
1. Collect real-time traces
2. Parse and abstract spans
3. Feed span lists to deep neural network
4. Retrieve probability matrix
5. Analyze matrix to decide on the severity of the anomaly
1. We analyze historical traces and collect the spans generated to create a structure containing span types (symbols) using label encoding (e.g., 12, 23).
2. The network is training with the encoded traces. During operation, an LSTM network takes as input a sequence of symbols representing a new real-time trace,
e.g., 1: “72”, 2: “83”, 3: “54”, …n: “23”.
3. The output is a matrix with the probability of each symbol to belong to a learned structure.
4. The analysis of the probability matrix enables to reason on if a trace should be mark are normal or anomalous.
In this experiment, we processed >1M spans and >100K traces
Partial view of the spans collected
auth_tokens_v3 1314421019193f1313162f253233191e341a3c3b402f2624192f214521162f163e2746282c192f2e2e24161d2c2c161628241635287a
images_v2 4342421119131616161a311316192216162f23213e1d2127272728241621191d5e5c471d132c4a16241616213c63
_var__detail_flavors_v2.1 5d13421314151316161b193b21242113364024192f162f131d21161621212c27282c2c2d162f5e5c132c16164d
image_schemas_v2 141319133113162f1644161b331e351a213b1613133b13192f5116132113131616252121282c2c162d21582e2e161d13643b132c4a78134d213c21
gather_result 1912141613252128211a1f402f221921162f2713282b2d2416131d13211a5a21
security-groups_v2.0 4c1319131d19131316132f
_var__images_v2 121313251b2f192f35211f241916223b1613231616242713462c53241654192c285a21
_var___var__action_servers_v2.1 61135d106c12131325191b1c193d281a4022136d13162f3b292116231d1627272741283b2416131d5c2f4a5121352f13
neutronclient.v2_0.client.retry_request 162f1319132f165124191316
auth_tokens_v3 13__421019193f1313162f253233191e341a3c3b402f2____4192f214521162f163e2746282c192f2e2f44161d2c2c161628241635287a
Is this trace recognized by the model? Are these change significative? Are they associated with an system anomaly?
13. HUAWEI TECHNOLOGIES CO., LTD. 16
Distributed Trace Analysis
Using LSTMs
1. Generate Synthetics Traces
OPENSTACK_SEQUENCE = [
('keystone', 1.0, 'authenticate user with credentials and generate auth-token'),
('nova-api', 1.0, 'get user request and sends token to Keystone for validation'),
('keystone', 0.2, 'validate the token if not in cache'),
('nova-api', 1.0, 'starts VM creation process'),
('nova-database', 1.0, 'create initial database entry for new VM'),
('nova-scheduler', 1.0, 'locate an appropriate host using filters and weights'),
('nova-database', 1.0, 'execute query'),
('nova-compute', 1.0, 'dispatches request'),
('nova-conductor', 1.0, 'gets instance information'),
('nova-compute', 1.0, 'get image URI from image service'),
('glance-api', 1.0, 'get image metadata'),
('nova-compute', 1.0, 'setup network'),
('neutron-server', .5, 'allocate and configure network IP address'),
('nova-compute', 1.0, 'setup volume'),
('cinder-api', .75, 'attach volume to the instance or VM'),
('nova-compute', 1.0, 'generates data for the hypervisor driver')
]
2. Transform traces with service names into sequences of integers
Train dataset
span_list
trace_id
0 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 6, 1, 6]
1 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 6, 1, 6]
2 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 4, 6, 1, 6]
3 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 4, 6, 6]
4 [3, 5, 3, 5, 8, 9, 8, 6, 7, 6, 2, 6, 6, 1, 6]
... ...
95 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 6, 1, 6]
96 [3, 5, 3, 5, 8, 9, 8, 6, 7, 6, 2, 6, 4, 6, 6]
97 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 6, 1, 6]
98 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 4, 6, 1, 6]
99 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 4, 6, 1, 6]
[100 rows x 1 columns]
Test dataset
span_list
trace_id
0 [3, 5, 5, 1, 9, 8, 6, 7, 6, 2, 6, 6, 1, 6]
1 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 4, 6, 6]
4. Encoding traces sequences
● The encoding scheme pads each sequence of inputs with zeros, up to a pre-
defined maximum length.
● This allows to pre-allocate a chain of LSTM units of a specific length
● We also pass information about the sequence lengths. This is important for not
treating the zero padding as actual inputs, and from injecting the error signal at
the right unit in the sequence during back propagation.
Hot encoding
● A one hot encoding enables to represent categorical variables as binary vectors.
● Categorical values are mapped to integer values.
● Each integer value is represented as a binary vector that is all zero values except
the index of the integer, which is marked with a 1.
[[0 0 3 ... 6 1 6]
[0 0 3 ... 6 1 6]
[0 3 5 ... 6 1 6]
...
[0 0 3 ... 6 1 6]
[0 3 5 ... 6 1 6]
[0 3 5 ... 6 1 6]]
Padding with symbol 0
1 3
2 4
[0,
['keystone',
'nova-api',
'nova-api',
'nova-database',
'nova-scheduler',
'nova-database',
'nova-compute',
'nova-conductor',
'nova-compute',
'glance-api',
'nova-compute',
'nova-compute',
'cinder-api',
'nova-compute']]
Cache Simulation: Probability of execution = 75% Trace #0
Hot encoding not shown
Code available at: https://github.com/jorge-cardoso/dta_lstm
14. HUAWEI TECHNOLOGIES CO., LTD. 17
Distributed Trace Analysis
Using LSTMs
6 Create the DL model
Keras and Tensorflow
● LSTM network
● dropout layer of 0.2 (high, but necessary to
avoid quick divergence)
● Softmax activation to generate probabilities
over the different categories
● Because it is a classification problem, loss
calculation via categorical cross-entropy
compares the output probabilities against one-
one encoding
● ADAM optimizer (instead of the classical
stochastic gradient descent) to update weights
5 Shift traces left
X [[0 0 3 ... 6 1 6]
[0 0 3 ... 6 1 6]
[0 3 5 ... 6 1 6]
...
[0 0 3 ... 6 1 6]
[0 3 5 ... 6 1 6]
[0 3 5 ... 6 1 6]]
y [[0 0 5 ... 1 6 0]
[0 0 5 ... 1 6 0]
[0 5 5 ... 1 6 0]
...
[0 0 5 ... 1 6 0]
[0 5 5 ... 1 6 0]
[0 5 5 ... 1 6 0]]
17. HUAWEI TECHNOLOGIES CO., LTD. 20
Distributed Trace Analysis
Why is trace analysis challenging?
2. Cache-Aside pattern
1. Retry pattern n_tries = 3
while True:
try:
ExecuteOperation() # Success
break
except TimeoutException:
if n_tries == 0: # give up
raise Failure
else:
sleep(1000) # Wait until retrying the call
v = cache.get(k)
if v == None: # Check if the item is in the cache
v = store.get(k) # If not in cache, read from data store
cache.put(k, v) # Put a copy of the item into the cache
3. One-to-many subcalls
@app.route('/servers/<int:user_id>/<int:number>')
def create(user_id, number):
for i in range(number):
authentication(user_id)
# Call remote microservice to create a server
rpc.create_server()
T1: A, B, C, D, E, …
T2: A, B, C, C, D, E, ….
T3: A, B, C, C, C, D, E, …
T4: A, B, C, C, C, Z
T1: A, B, C, D, E, …
T1: A, C, D, E, …
T1: A, B, C, D, E, …
T2: A, B, C, C, D, E, ….
T3: A, B, C, C, C, D, E, …
T4: A, B, C, C, C, C, D, E, …
Many “Hidden” Software “Patterns” which affect running systems…Circuit breaker, concurrency and parallelism,
priority queues, publisher-subscriber, bulkhead, etc.
18. HUAWEI TECHNOLOGIES CO., LTD. 21
Distributed Trace Analysis
Why is trace analysis challenging?
Span Model
• Introduced in Google Dapper in 2010, base model in Zipkin
• Suitable to describe synchronous REST operations
• Every span consists of 4 events: “client send”, “server receive”, “server
send”, “client receive”
Limitations of span model
• Non-synchronous execution models such as queues and asynchronous
executions can not be described as span trees
• Fine-grained metrics are hard to express, e.g. RPC response handler can
read the data from network then enqueue it and then process – queue time
is not taken into account
• Multithreaded processing, like spawning of a group of threads, then joining
them requires additional tags to differ from sequential processing.
Evolution of trace model in Facebook*
a) Started with Dapper span model
b) Idle (blocked) time is taken into account
c) Internal queue is taken into account
d) Client-queue metrics for AJAX processing are added
e) Request and response are completely decoupled, trace is represented
as set of events with causal relations
* http://cs.brown.edu/~jcmace/papers/kaldor2017canopy.pdf
Inspired by Facebook Canopy design
(http://cs.brown.edu/~jcmace/papers/kaldor2017canopy.pdf)
Facebook uses end-to-end tracing for all services, from data-center to
mobile applications. As of 2017 the system processes 1.3 billion traces per
day, each trace contains up to several thousands of events.
19. HUAWEI TECHNOLOGIES CO., LTD. 22
► Monitoring and Incident Detection
Anomaly detection in large-scale system is widely
studied
IT monitoring systems typically use application logs
and resource metrics to detect failures
⎻ Distributed tracing is becoming the third pillar of
microservices observability
⎻ Logs and metrics were previously investigated
(see [1-4])
► Single and Bi-Models
Use of single and bi-models to capture traces as
sequences of events and their response time
Use sequential model representation by utilizing long-
short-term memory (LSTM) networks
► Results
Detect anomalies in Huawei Cloud infrastructure
The novel approach outperforms other deep learning
methods based on traditional architectures
Distributed Trace Analysis
Using Multimodal Deep Learning
S. Nedelkoski, J. Cardoso, O. Kao, Anomaly Detection from System Tracing Data using Multimodal Deep Learning, IEEE Cloud 2019, July 3-8, 2019, Milan, Italy.
S. Nedelkoski, J. Cardoso, O. Kao, Anomaly Detection and Classification using Distributed Tracing and Deep Learning, CCGrid 2019, 14-17.05, Cyprus.
J. Cardoso, Mastering AIOps with Deep Learning, Presentation at SRECon18, 29–31 August 2018, Dusseldorf, Germany.
Long Short Term Memory (LSTM)
LSTMs [1] are models which capture sequential data by using an
internal memory. They are very good for analyzing sequences of
values and predicting the next point of a given time series. This
makes LSTMs adequate for Machine Learning problems that involve
sequential data (see [3]) such speech recognition, machine
translation, visual recognition and description, and distributed
traces analysis.
[1] Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8
(1997): 1735-1780.
[2] Sequential processing in LSTM (from: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
[3] LSTM model description (from Andrej Karpathy. http://karpathy.github.io/2015/05/21/rnn-
effectiveness/
Idea: represent distributed traces as sequences of
events and their response time
20. HUAWEI TECHNOLOGIES CO., LTD. 23
A span (event) is a vector of key-value pairs 𝑘𝑖, 𝑣𝑖
describing a characteristic of a microservice at time
𝑡𝑖
A trace 𝑇 is an enumerated collection of events (i.e.,
spans) sorted by timestamps 𝑒0, 𝑒1, … , 𝑒𝑖 [16]
An event contains
⎻ trace id, event id, parent id
⎻ protocol, host ip, status code, url
⎻ response time, timestamp
⎻ and much more
Trace can have different lengths
⎻ 𝑇𝑝 = 𝑒0
𝑝
, 𝑒1
𝑝
, 𝑒2
𝑝
, … , 𝑒𝑖
𝑝
and 𝑇𝑞 = 𝑒0
𝑞
, 𝑒2
𝑞
, 𝑒1
𝑞
, … , 𝑒𝑖
𝑞
are
different (𝑒1 and 𝑒2 are swapped) but originate from
the same activity
⎻ Possibly caused by concurrent systems
Label each span as
⎻ Label = concat(url, status code, host ip)
⎻ We have 𝑁𝑙 labels
Pad trace vector up to 𝑇𝑙 or truncate traces
Distributed Trace Analysis
Bi-modal Approach
Trace structure
⎻ Trace one-hot categorical encoding [17]
⎻ 𝐷1 = 𝑁𝑡, 𝑇𝑙, 𝑁𝑙
⎻ 𝑁𝑡 = # traces, 𝑇𝑙 = max length, 𝑁𝑙 = # labels
Response time
⎻ Min-max scaling [0, 1]
⎻ 𝐷2 = 𝑁𝑡, 𝑇𝑙, 1
⎻ Last dimension is the response time (duration)
Trace Structure (sequence of events) and
Response time (duration)
e.g., Service 11 → Service 21; duration = 12ms
21. HUAWEI TECHNOLOGIES CO., LTD. 24
We model anomaly detection in traces as a sequence-to-
sequence, multi-class, single-label problem
⎻ Use multiple possible labels for one trace that are not
mutually exclusive.
⎻ The partial trace can have multiple subsequent events
LSTM network architecture
⎻ SAD = Structural Anomaly Detection (𝐷1)
Model input
⎻ 𝑇𝑘 = 𝑒0, 𝑒1, … , 𝑒 𝑇 𝑙
Model output
⎻ 𝑒0, 𝑒1, … , 𝑒𝑖−1
⎻ Probability distribution over the 𝑁𝑙 unique labels from 𝐿,
representing the probability for the next label (event) 𝑒𝑖 in
the sequence
Compare the predicted output against the observed label
⎻ input 𝑒0, 𝑒1, … , 𝑒 𝑇 𝑙
→ output 𝑒1, … , 𝑒 𝑇 𝑙
, ′! 0′
⎻ The output is shifted by one event and padded
Update network weights using categorical cross-entropy
loss minimization via gradient descent
Distributed Trace Analysis
Single-Modality LSTM (1)
SAD = Structural Anomaly Detection (𝐷1)
22. HUAWEI TECHNOLOGIES CO., LTD. 25
Evaluate if trace 𝑇𝑡𝑒𝑠𝑡 is anomalous
⎻ 𝑇𝑡𝑒𝑠𝑡 = 𝑒0, 𝑒1, … , 𝑒 𝑇 𝑙
The network calculates
⎻ Probability distribution 𝑃
⎻ 𝑃 = 𝑙0: 𝑝0, 𝑙1: 𝑝1, … , 𝑙 𝑁𝑡
: 𝑝 𝑁𝑡
⎻ Probability of a label of 𝐿 to appear as the next label value
in a trace
The output layer has a softmax function
⎻ A generalization of the logistic function that “squashes” a
K-dimensional vector z of arbitrary real values to a K-
dimensional vector 𝜎(𝑧)𝑖 of real values in the range [0, 1]
that add up to 1
Distribute the probability over labels
⎻ 𝜎(𝑧)𝑖 =
𝑒 𝑧 𝑖
𝑗=1
𝐾 𝑒
𝑧 𝑗
Such that
⎻ 𝑖
𝑁 𝑙
𝑝𝑖 = 1
Classification
⎻ Accept top-𝒌 predicted labels as behaviorally correct
⎻ Otherwise, report an anomaly along with the events which
contributed to the decision
Distributed Trace Analysis
Single-Modality LSTM (2)
SAD = Structural Anomaly Detection (𝐷1)
23. HUAWEI TECHNOLOGIES CO., LTD. 26
We model response time anomaly detection in traces as a
regression task
⎻ E.g., predict the duration of a span
LSTM network architecture
⎻ RTAD = Response Time Anomaly Detection (𝐷2)
⎻ Linear (i.e. identity) activation function
Approach
⎻ A each timestep 𝑡𝑖𝑚𝑒 = 𝑖 with 𝑟𝑡0, 𝑟𝑡1, … , 𝑟𝑡𝑖−1
⎻ Predict the response time 𝑟𝑡𝑖 of the next event
Update network weights using mean squared loss via
gradient descent
Detection
⎻ Compute the squared error distance
⎻ 𝑒𝑟𝑟𝑜𝑟𝑖 = 𝑟𝑡𝑖 − 𝑟𝑡𝑖
𝑝 2
, where 𝑟𝑡𝑖
𝑝
is the predicted value at
timestep 𝑖
⎻ errors are fitted by a Gaussian 𝑁(0, 𝜎2
)
⎻ report trace as anomalous, if squared error between
prediction and input at time 𝑖 is out of 95% confidence
interval
Distributed Trace Analysis
Single-Modality LSTM (3)
RTAD = Response Time Anomaly Detection (𝐷2)
24. HUAWEI TECHNOLOGIES CO., LTD. 27
Explore the correlation between trace structure and
response time
⎻ Improve accuracy/recall
LSTM network architecture
⎻ Concatenation of both single-modality
architectures in the second hidden layer
Approach
⎻ input 𝑒0, 𝑒1, … , 𝑒 𝑇 𝑙
, 𝑟𝑡0, 𝑟𝑡1, … , 𝑟𝑡 𝑇 𝑙
→
⎻ output 𝑒1, … , 𝑒 𝑇 𝑙
, ′! 0′ , 𝑟𝑡1, 𝑟𝑡2, … , 𝑟𝑡 𝑇 𝑙
, 0
Update network weights using
⎻ Sum mean squared error
⎻ Categorical cross-entropy
Distributed Trace Analysis
Multimodal LSTM (1)
Detection
⎻ Performed by comparing the output element-
wise with the input for both modalities using
the strategy developed in the single-modality
architectures
⎻ Anomaly source: 1) response time, 2)
structural, or 3) both
SAD (𝐷1) and RTAD (𝐷2)
25. HUAWEI TECHNOLOGIES CO., LTD. 28
► Datasets
System under study has 1000+ services running production Openstack-based cloud [19]
Traces collected using Zipkin [16] over 50 days: >4.5M events; 1M traces
► Evaluation Platform
Python using Keras, model with batch size = 512, learning rate of 0.001, and 400 epochs
PC using GPU-NVIDIA GTX 1060
► Preprocessing
To avoid outliers, select labels that appear more than 1000 times (105 unique labels)
Distribution of trace lengths is imbalanced
⎻ >90% have lengths <10 events
Select only 1000 samples of each trace length
⎻ Requires <1% of all the recorded data
⎻ Efficient and fast for training
For robustness, we also select the traces with lengths between 4 and 20
► Performance
Time to train multimodal LSTM on 1% traces (1M): approx. 30 min
Prediction time per trace: <50 ms
Distributed Trace Analysis
Evaluation
26. HUAWEI TECHNOLOGIES CO., LTD. 29
Distributed Trace Analysis
Evaluation
Best results in terms of accuracy of structural anomaly
detection are achieved using the multimodal LSTM predictions
Multimodal
LSTM
Single-model
LSTM
Multimodal
Dense
Single-model
Dense
Single-and multimodal modality LSTM achieves a comparable
accuracy, while single- and multimodal dense architectures
have low accuracies
Multimodal LSTM achieves a better accuracy than the single-
modality LSTM for k=1, otherwise it is similar
Accuracy when the anomaly is injected in traces with
different sizes
Multimodal slightly outperforms the single-modality
approach in 9 out of 15 trace lengths for k = 3 and k = 5
Significantly better results are achieved for k = 1 for
almost all of the trace lengths
𝑘 ∈ [3, 5]
𝑘 = 1
𝑘 = 1
27. HUAWEI TECHNOLOGIES CO., LTD. 30
Distributed Trace Analysis
Evaluation
For SAD, significantly better performance of the multimodal
approach for k = 1
Single-modality LSTM achieves a comparable accuracy, while
single- and multimodal dense architectures have low
accuracies
Multimodal LSTM achieves a better accuracy than the single-
modality LSTM for k=1, otherwise it is similar
For RTAD, single-task models have low performance
than those of both multimodal models
Multimodal approach achieves a higher accuracy for
RTAD
Models have high accuracy even when the length of the
trace increases. This is because the LSTMs are able to
learn long-term dependencies in sequential tasks
28. HUAWEI TECHNOLOGIES CO., LTD. 31
[1] F. Schmidt, A. Gulenko, M. Wallschlger, A. Acker, V. Hennig, F. Liu, and O. Kao, “Iftm - unsupervised anomaly detection for
virtualized network function services,” in 2018 IEEE International Conference on Web Services (ICWS), July 2018, pp. 187–194.
[2] A. Gulenko, F. Schmidt, A. Acker, M. Wallschlager, O. Kao, and F. Liu, “Detecting anomalous behavior of black-box services
modeled with distance-based online clustering,” in 2018 IEEE 11th International Conference on Cloud Computing (CLOUD), Jul 2018,
pp. 912–915.
[3] M. Du, F. Li, G. Zheng, and V. Srikumar, “Deeplog: Anomaly detection and diagnosis from system logs through deep learning,” in
Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communica- tions Security. ACM, 2017, pp. 1285–1298.
[4] F. Schmidt, F. Suri-Payer, A. Gulenko, M. Wallschlger, A. Acker, and O. Kao, “Unsupervised anomaly event detection for cloud
monitoring using online arima,” in 2018 IEEE/ACM International Conference on Utility and Cloud Computing Companion (UCC
Companion), Dec 2018, pp. 71–76.
[16] OpenZipkin, “openzipkin/zipkin,” 2018. [Online]. Available: https://github.com/openzipkin/zipkin
[19] A. Shrivastwa, S. Sarat, K. Jackson, C. Bunch, E. Sigler, and T. Campbell, OpenStack: Building a Cloud Environment. Packt
Publishing, 2016.
Further reading
S. Nedelkoski, J. Cardoso, O. Kao, Anomaly Detection from System Tracing Data using Multimodal Deep Learning, IEEE Cloud
2019, July 3-8, 2019, Milan, Italy.
S. Nedelkoski, J. Cardoso, O. Kao, Anomaly Detection and Classification using Distributed Tracing and Deep Learning, CCGrid 2019,
14-17.05, Cyprus.
J. Cardoso, Mastering AIOps with Deep Learning, Presentation at SRECon18, 29–31 August 2018, Dusseldorf, Germany.
Distributed Trace Analysis
References