The field of AIOps, also known as Artificial Intelligence for IT Operations, uses advanced technologies to dramatically improve the monitoring, operation, and troubleshooting of distributed systems. Its main premise is that operations can be automated using monitoring data to reduce the workload of operators (e.g., SREs or production engineers). Our current research explores how AIOps – and many related fields such as deep learning, machine learning, distributed traces, graph analysis, time-series analysis, sequence analysis, advanced statistics, NLP and log analysis – can be explored to effectively detect, localize, predict, and remediate failures in large-scale cloud infrastructures (>50 regions and AZs) by analyzing service management data (e.g., distributed traces, logs, events, alerts, metrics). In particular, this talk will describe how a particular monitoring data structure, called distributed traces, can be analyzed using deep learning to identify anomalies in its spans. This capability empowers operators to quickly identify which components of a distributed system are faulty.
AIOps: Anomalies Detection of Distributed TracesJorge Cardoso
Introduction to the field of AIOps. large-scale monitoring, and observability. Provides an example illustrating how Deep Learning can be used to analyze distributed traces to reveal exactly which component is causing a problem in microservice applications.
Presentation given at the National University of Ireland, Galway (NUI Galway)
on 2019.08.20.
Thanks to Prof. John Breslin
User and entity behavior analytics: building an effective solutionYolanta Beresna
This presentation provides an overview of UEBA space and gives insights into the core components of an effective solution, such as relevant Threat and Attack Scenarios, Data Sources, and various Analytic techniques. This was presented during ISSA-UK chapter meeting.
AIOps: Anomalous Span Detection in Distributed Traces Using Deep LearningJorge Cardoso
The field of AIOps, also known as Artificial Intelligence for IT Operations, uses algorithms and machine learning to dramatically improve the monitoring, operation, and maintenance of distributed systems. Its main premise is that operations can be automated using monitoring data to reduce the workload of operators (e.g., SREs or production engineers). Our current research explores how AIOps – and many related fields such as deep learning, machine learning, distributed traces, graph analysis, time-series analysis, sequence analysis, and log analysis – can be explored to effectively detect, localize, and remediate failures in large-scale cloud infrastructures (>50 regions and AZs). In particular, this lecture will describe how a particular monitoring data structure, called distributed trace, can be analyzed using deep learning to identify anomalies in its spans. This capability empowers operators to quickly identify which components of a distributed system are faulty.
- Azure Databricks provides a curated platform for data science and machine learning workloads using notebooks, data services, and machine learning tools.
- Only a small fraction of real-world machine learning systems is composed of the actual machine learning code, as vast surrounding infrastructure is required for data collection, feature extraction, model training, and deployment.
- Azure Databricks can be used across many industries for applications like customer analytics, financial modeling, healthcare analytics, industrial IoT, and cybersecurity threat detection through machine learning on structured and unstructured data.
The document provides an introduction to Prof. Dr. Sören Auer and his background in knowledge graphs. It discusses his current role as a professor and director focusing on organizing research data using knowledge graphs. It also briefly outlines some of his past roles and major scientific contributions in the areas of technology platforms, funding acquisition, and strategic projects related to knowledge graphs.
GenerativeAI and Automation - IEEE ACSOS 2023.pptxAllen Chan
Generative AI has been rapidly evolving, enabling different and more sophisticated interactions with Large Language Models (LLMs) like those available in IBM watsonx.ai or Meta Llama2. In this session, we will take a use case based approach to look at how we can leverage LLMs together with existing automation technologies like Workflow, Content Management, and Decisions to enable new solutions.
On the Application of AI for Failure Management: Problems, Solutions and Algo...Jorge Cardoso
Artificial Intelligence for IT Operations (AIOps) is a class of software which targets the automation of operational tasks through machine learning technologies. ML algorithms are typically used to support tasks such as anomaly detection, root-causes analysis, failure prevention, failure prediction, and system remediation. AIOps is gaining an increasing interest from the industry due to the exponential growth of IT operations and the complexity of new technology. Modern applications are assembled from hundreds of dependent microservices distributed across many cloud platforms, leading to extremely complex software systems. Studies show that cloud environments are now too complex to be managed solely by humans. This talk discusses various AIOps problems we have addressed over the years and gives a sketch of the solutions and algorithms we have implemented. Interesting problems include hypervisor anomaly detection, root-cause analysis of software service failures using application logs, multi-modal anomaly detection, root-cause analysis using distributed traces, and verification of virtual private cloud networks.
AIOps: Anomalies Detection of Distributed TracesJorge Cardoso
Introduction to the field of AIOps. large-scale monitoring, and observability. Provides an example illustrating how Deep Learning can be used to analyze distributed traces to reveal exactly which component is causing a problem in microservice applications.
Presentation given at the National University of Ireland, Galway (NUI Galway)
on 2019.08.20.
Thanks to Prof. John Breslin
User and entity behavior analytics: building an effective solutionYolanta Beresna
This presentation provides an overview of UEBA space and gives insights into the core components of an effective solution, such as relevant Threat and Attack Scenarios, Data Sources, and various Analytic techniques. This was presented during ISSA-UK chapter meeting.
AIOps: Anomalous Span Detection in Distributed Traces Using Deep LearningJorge Cardoso
The field of AIOps, also known as Artificial Intelligence for IT Operations, uses algorithms and machine learning to dramatically improve the monitoring, operation, and maintenance of distributed systems. Its main premise is that operations can be automated using monitoring data to reduce the workload of operators (e.g., SREs or production engineers). Our current research explores how AIOps – and many related fields such as deep learning, machine learning, distributed traces, graph analysis, time-series analysis, sequence analysis, and log analysis – can be explored to effectively detect, localize, and remediate failures in large-scale cloud infrastructures (>50 regions and AZs). In particular, this lecture will describe how a particular monitoring data structure, called distributed trace, can be analyzed using deep learning to identify anomalies in its spans. This capability empowers operators to quickly identify which components of a distributed system are faulty.
- Azure Databricks provides a curated platform for data science and machine learning workloads using notebooks, data services, and machine learning tools.
- Only a small fraction of real-world machine learning systems is composed of the actual machine learning code, as vast surrounding infrastructure is required for data collection, feature extraction, model training, and deployment.
- Azure Databricks can be used across many industries for applications like customer analytics, financial modeling, healthcare analytics, industrial IoT, and cybersecurity threat detection through machine learning on structured and unstructured data.
The document provides an introduction to Prof. Dr. Sören Auer and his background in knowledge graphs. It discusses his current role as a professor and director focusing on organizing research data using knowledge graphs. It also briefly outlines some of his past roles and major scientific contributions in the areas of technology platforms, funding acquisition, and strategic projects related to knowledge graphs.
GenerativeAI and Automation - IEEE ACSOS 2023.pptxAllen Chan
Generative AI has been rapidly evolving, enabling different and more sophisticated interactions with Large Language Models (LLMs) like those available in IBM watsonx.ai or Meta Llama2. In this session, we will take a use case based approach to look at how we can leverage LLMs together with existing automation technologies like Workflow, Content Management, and Decisions to enable new solutions.
On the Application of AI for Failure Management: Problems, Solutions and Algo...Jorge Cardoso
Artificial Intelligence for IT Operations (AIOps) is a class of software which targets the automation of operational tasks through machine learning technologies. ML algorithms are typically used to support tasks such as anomaly detection, root-causes analysis, failure prevention, failure prediction, and system remediation. AIOps is gaining an increasing interest from the industry due to the exponential growth of IT operations and the complexity of new technology. Modern applications are assembled from hundreds of dependent microservices distributed across many cloud platforms, leading to extremely complex software systems. Studies show that cloud environments are now too complex to be managed solely by humans. This talk discusses various AIOps problems we have addressed over the years and gives a sketch of the solutions and algorithms we have implemented. Interesting problems include hypervisor anomaly detection, root-cause analysis of software service failures using application logs, multi-modal anomaly detection, root-cause analysis using distributed traces, and verification of virtual private cloud networks.
This session was presented at the AWS Community Day in Munich (September 2023). It's for builders that heard the buzz about Generative AI but can’t quite grok it yet. Useful if you are eager to connect the dots on the Generative AI terminology and get a fast start for you to explore further and navigate the space. This session is largely product agnostic and meant to give you the fundamentals to get started.
Technology that is going to create a revolution in every Industry including Health care. What is it, what are the tools and what is the outcome?
NASA started the research on Twins due to space travel and the need to have real time feedback of components. Now it is extending to even Health care to having a Human twin.
This document provides an overview of NVIDIA's accelerated computing capabilities across a wide range of industries and applications. It highlights that NVIDIA GPUs power the majority of the world's top supercomputers and are used for AI, robotics, science, and more. New product announcements include updates to NVIDIA's computing platforms, networking, security, and simulation technologies.
Amidst an industry cloud of confusion about what “AIOps” is and what it can do, these slides--based on the webinar from EMA research--delineates a clear path to victory for business and IT stakeholders seeking to use machine learning to optimize the performance of critical business services.
Survive the fog of system development! Developers' lives have gotten more complex in the last decade. There is too much to learn and understand now, and you need a co-pilot. Let AIOps be that co-pilot.
In this webinar, we'll share use cases and discuss:
What is AIOps?
Why AI and ML are well-suited for Ops and DevOps
A guide for assessing where to automate
This talk overviews my background as a female data scientist, introduces many types of generative AI, discusses potential use cases, highlights the need for representation in generative AI, and showcases a few tools that currently exist.
Our report will provide a look into the technology landscape of the future, including:
- Importance of AI in enabling innovation
- Catalysts of future innovations
- Top technology trends in 2023-2024
- Main benefits of AI adoption
- Steps to prepare for future disruptions.
Download your free copy now and implement the key findings to improve your business.
Logical Data Fabric: Architectural ComponentsDenodo
Watch full webinar here: https://bit.ly/39MWm7L
Is the Logical Data Fabric one monolithic technology or does it comprise of various components? If so, what are they? In this presentation, Denodo CTO Alberto Pan will elucidate what components make up the logical data fabric.
Promising to transform trillion-dollar industries and address the “grand challenges” of our time, NVIDIA founder and CEO Jensen Huang shared a vision of an era where intelligence is created on an industrial scale and woven into real and virtual worlds at GTC 2022.
Data Con LA 2020
Description
More and more organizations are embracing AI technology by infusing it in their products and services to to differentiate themselves against their competitors. AI is being utilized in some sensitive areas of human life. In this session let's look at some of principles governing adoption of AI in a responsible manner. Why companies are accelerating adoption of AI?
Increasingly organization are accelerating adoption of AI to differentiate their product and services in the market. Outcomes of this digital transformation that we have seen in the areas of optimizing operations, engaging customers, empowering employees and transforming their products and services.
*List some of the sensitive use cases where AI is being applied
*Why governing AI is important and what are those principles?
*How Microsoft is approaching it?
Speaker
Suresh Paulraj, Microsoft, Principal Cloud Solution Architect Data & AI
Generative AI models, such as ChatGPT and Stable Diffusion, can create new and original content like text, images, video, audio, or other data from simple prompts, as well as handle complex dialogs and reason about problems with or without images. These models are disrupting traditional technologies, from search and content creation to automation and problem solving, and are fundamentally shaping the future user interface to computing devices. Generative AI can apply broadly across industries, providing significant enhancements for utility, productivity, and entertainment. As generative AI adoption grows at record-setting speeds and computing demands increase, on-device and hybrid processing are more important than ever. Just like traditional computing evolved from mainframes to today’s mix of cloud and edge devices, AI processing will be distributed between them for AI to scale and reach its full potential.
In this presentation you’ll learn about:
- Why on-device AI is key
- Full-stack AI optimizations to make on-device AI possible and efficient
- Advanced techniques like quantization, distillation, and speculative decoding
- How generative AI models can be run on device and examples of some running now
- Qualcomm Technologies’ role in scaling on-device generative AI
Benchmark comparison of Large Language ModelsMatej Varga
The document summarizes the results of a benchmark comparison that tested several large language models across different skillsets and domains. It shows that GPT-4 performed the best overall based on metrics like logical robustness, correctness, efficiency, factuality, and common sense. Tables display the scores each model received for different skillsets and how they compare between open-sourced, proprietary, and oracle models. The source is listed as an unreviewed preprint paper and related GitHub page under a Creative Commons license.
Using Databricks as an Analysis PlatformDatabricks
Over the past year, YipitData spearheaded a full migration of its data pipelines to Apache Spark via the Databricks platform. Databricks now empowers its 40+ data analysts to independently create data ingestion systems, manage ETL workflows, and produce meaningful financial research for our clients.
Presentation "AI Product Manager" at the Digital Product School (on 10/22/2020) from Datentreiber.
Content:
• Overview over the AI product innovation cycle
• AI Thinking: ideating and prioritizing the right use cases
• AI Prototyping: testing critical hypotheses with experiments
• AI Engineering: building scalable & user friendly AI applications
• AI Management: maintaining AI solutions with DataOps
• Outlook: how to become an AI product manager (links & more)
The introductory morning session will discuss big data challenges and provide an overview of the AWS Big Data Platform. We will also cover:
• How AWS customers leverage the platform to manage massive volumes of data from a variety of sources while containing costs.
• Reference architectures for popular use cases, including: connected devices (IoT), log streaming, real-time intelligence, and analytics.
• The AWS big data portfolio of services, including Amazon S3, Kinesis, DynamoDB, Elastic MapReduce (EMR) and Redshift.
• The latest relational database engine, Amazon Aurora - a MySQL-compatible, highly-available relational database engine which provides up to five times better performance than MySQL at a price one-tenth the cost of a commercial database.
• Amazon Machine Learning – the latest big data service from AWS provides visualization tools and wizards that guide you through the process of creating machine learning (ML) models without having to learn complex ML algorithms and technology.
Knowledge Graphs - The Power of Graph-Based SearchNeo4j
1) Knowledge graphs are graphs that are enriched with data over time, resulting in graphs that capture more detail and context about real world entities and their relationships. This allows the information in the graph to be meaningfully searched.
2) In Neo4j, knowledge graphs are built by connecting diverse data across an enterprise using nodes, relationships, and properties. Tools like natural language processing and graph algorithms further enrich the data.
3) Cypher is Neo4j's graph query language that allows users to search for graph patterns and return relevant data and paths. This reveals why certain information was returned based on the context and structure of the knowledge graph.
This document discusses privacy-preserving artificial intelligence and differential privacy. Differential privacy is a technique that adds mathematical noise to aggregate data to prove that an individual's private data is protected. It allows organizations to gain insights from large datasets while guaranteeing that no single person's data can be identified. The document provides an overview of differential privacy, how it works, examples of its applications, and addresses some common misconceptions about the technique.
This document summarizes the key phases and sections of an IT 265 Data Structures course project. The project covered common data structures like lists, stacks, queues, trees, and sorting/searching algorithms. It evaluated recursion and provided examples of insertion sort, bubble sort, and selection sort. The goal was to demonstrate understanding of these fundamental data structures and algorithms through code examples and explanations of their applications and efficiency.
This document discusses artificial intelligence and machine learning. It explains that deep learning is a key area of focus. It also outlines some fundamental components of an AI system including inputs, an inference engine, and outputs. Additionally, it discusses Google Cloud Platform's key AI components like TensorFlow, APIs for speech, video, vision, and translation. Finally, it provides examples of enterprise applications of AI in various industries.
This session was presented at the AWS Community Day in Munich (September 2023). It's for builders that heard the buzz about Generative AI but can’t quite grok it yet. Useful if you are eager to connect the dots on the Generative AI terminology and get a fast start for you to explore further and navigate the space. This session is largely product agnostic and meant to give you the fundamentals to get started.
Technology that is going to create a revolution in every Industry including Health care. What is it, what are the tools and what is the outcome?
NASA started the research on Twins due to space travel and the need to have real time feedback of components. Now it is extending to even Health care to having a Human twin.
This document provides an overview of NVIDIA's accelerated computing capabilities across a wide range of industries and applications. It highlights that NVIDIA GPUs power the majority of the world's top supercomputers and are used for AI, robotics, science, and more. New product announcements include updates to NVIDIA's computing platforms, networking, security, and simulation technologies.
Amidst an industry cloud of confusion about what “AIOps” is and what it can do, these slides--based on the webinar from EMA research--delineates a clear path to victory for business and IT stakeholders seeking to use machine learning to optimize the performance of critical business services.
Survive the fog of system development! Developers' lives have gotten more complex in the last decade. There is too much to learn and understand now, and you need a co-pilot. Let AIOps be that co-pilot.
In this webinar, we'll share use cases and discuss:
What is AIOps?
Why AI and ML are well-suited for Ops and DevOps
A guide for assessing where to automate
This talk overviews my background as a female data scientist, introduces many types of generative AI, discusses potential use cases, highlights the need for representation in generative AI, and showcases a few tools that currently exist.
Our report will provide a look into the technology landscape of the future, including:
- Importance of AI in enabling innovation
- Catalysts of future innovations
- Top technology trends in 2023-2024
- Main benefits of AI adoption
- Steps to prepare for future disruptions.
Download your free copy now and implement the key findings to improve your business.
Logical Data Fabric: Architectural ComponentsDenodo
Watch full webinar here: https://bit.ly/39MWm7L
Is the Logical Data Fabric one monolithic technology or does it comprise of various components? If so, what are they? In this presentation, Denodo CTO Alberto Pan will elucidate what components make up the logical data fabric.
Promising to transform trillion-dollar industries and address the “grand challenges” of our time, NVIDIA founder and CEO Jensen Huang shared a vision of an era where intelligence is created on an industrial scale and woven into real and virtual worlds at GTC 2022.
Data Con LA 2020
Description
More and more organizations are embracing AI technology by infusing it in their products and services to to differentiate themselves against their competitors. AI is being utilized in some sensitive areas of human life. In this session let's look at some of principles governing adoption of AI in a responsible manner. Why companies are accelerating adoption of AI?
Increasingly organization are accelerating adoption of AI to differentiate their product and services in the market. Outcomes of this digital transformation that we have seen in the areas of optimizing operations, engaging customers, empowering employees and transforming their products and services.
*List some of the sensitive use cases where AI is being applied
*Why governing AI is important and what are those principles?
*How Microsoft is approaching it?
Speaker
Suresh Paulraj, Microsoft, Principal Cloud Solution Architect Data & AI
Generative AI models, such as ChatGPT and Stable Diffusion, can create new and original content like text, images, video, audio, or other data from simple prompts, as well as handle complex dialogs and reason about problems with or without images. These models are disrupting traditional technologies, from search and content creation to automation and problem solving, and are fundamentally shaping the future user interface to computing devices. Generative AI can apply broadly across industries, providing significant enhancements for utility, productivity, and entertainment. As generative AI adoption grows at record-setting speeds and computing demands increase, on-device and hybrid processing are more important than ever. Just like traditional computing evolved from mainframes to today’s mix of cloud and edge devices, AI processing will be distributed between them for AI to scale and reach its full potential.
In this presentation you’ll learn about:
- Why on-device AI is key
- Full-stack AI optimizations to make on-device AI possible and efficient
- Advanced techniques like quantization, distillation, and speculative decoding
- How generative AI models can be run on device and examples of some running now
- Qualcomm Technologies’ role in scaling on-device generative AI
Benchmark comparison of Large Language ModelsMatej Varga
The document summarizes the results of a benchmark comparison that tested several large language models across different skillsets and domains. It shows that GPT-4 performed the best overall based on metrics like logical robustness, correctness, efficiency, factuality, and common sense. Tables display the scores each model received for different skillsets and how they compare between open-sourced, proprietary, and oracle models. The source is listed as an unreviewed preprint paper and related GitHub page under a Creative Commons license.
Using Databricks as an Analysis PlatformDatabricks
Over the past year, YipitData spearheaded a full migration of its data pipelines to Apache Spark via the Databricks platform. Databricks now empowers its 40+ data analysts to independently create data ingestion systems, manage ETL workflows, and produce meaningful financial research for our clients.
Presentation "AI Product Manager" at the Digital Product School (on 10/22/2020) from Datentreiber.
Content:
• Overview over the AI product innovation cycle
• AI Thinking: ideating and prioritizing the right use cases
• AI Prototyping: testing critical hypotheses with experiments
• AI Engineering: building scalable & user friendly AI applications
• AI Management: maintaining AI solutions with DataOps
• Outlook: how to become an AI product manager (links & more)
The introductory morning session will discuss big data challenges and provide an overview of the AWS Big Data Platform. We will also cover:
• How AWS customers leverage the platform to manage massive volumes of data from a variety of sources while containing costs.
• Reference architectures for popular use cases, including: connected devices (IoT), log streaming, real-time intelligence, and analytics.
• The AWS big data portfolio of services, including Amazon S3, Kinesis, DynamoDB, Elastic MapReduce (EMR) and Redshift.
• The latest relational database engine, Amazon Aurora - a MySQL-compatible, highly-available relational database engine which provides up to five times better performance than MySQL at a price one-tenth the cost of a commercial database.
• Amazon Machine Learning – the latest big data service from AWS provides visualization tools and wizards that guide you through the process of creating machine learning (ML) models without having to learn complex ML algorithms and technology.
Knowledge Graphs - The Power of Graph-Based SearchNeo4j
1) Knowledge graphs are graphs that are enriched with data over time, resulting in graphs that capture more detail and context about real world entities and their relationships. This allows the information in the graph to be meaningfully searched.
2) In Neo4j, knowledge graphs are built by connecting diverse data across an enterprise using nodes, relationships, and properties. Tools like natural language processing and graph algorithms further enrich the data.
3) Cypher is Neo4j's graph query language that allows users to search for graph patterns and return relevant data and paths. This reveals why certain information was returned based on the context and structure of the knowledge graph.
This document discusses privacy-preserving artificial intelligence and differential privacy. Differential privacy is a technique that adds mathematical noise to aggregate data to prove that an individual's private data is protected. It allows organizations to gain insights from large datasets while guaranteeing that no single person's data can be identified. The document provides an overview of differential privacy, how it works, examples of its applications, and addresses some common misconceptions about the technique.
This document summarizes the key phases and sections of an IT 265 Data Structures course project. The project covered common data structures like lists, stacks, queues, trees, and sorting/searching algorithms. It evaluated recursion and provided examples of insertion sort, bubble sort, and selection sort. The goal was to demonstrate understanding of these fundamental data structures and algorithms through code examples and explanations of their applications and efficiency.
This document discusses artificial intelligence and machine learning. It explains that deep learning is a key area of focus. It also outlines some fundamental components of an AI system including inputs, an inference engine, and outputs. Additionally, it discusses Google Cloud Platform's key AI components like TensorFlow, APIs for speech, video, vision, and translation. Finally, it provides examples of enterprise applications of AI in various industries.
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTrivadis
This document provides an overview of artificial intelligence trends and applications in development and operations. It discusses how AI is being used for rapid prototyping, intelligent programming assistants, automatic error handling and code refactoring, and strategic decision making. Examples are given of AI tools from Microsoft, Facebook, and Codota. The document also discusses challenges like interpretability of neural networks and outlines a vision of "Software 2.0" where programs are generated automatically to satisfy goals. It emphasizes that AI will transform software development over the next 10 years.
The Study of the Large Scale Twitter on Machine LearningIRJET Journal
This document discusses a study on integrating machine learning tools into a Hadoop-based analytics platform at Twitter. Specifically, it focuses on extending the Pig dataflow language to provide predictive analytics capabilities using machine learning algorithms like logistic regression and decision trees. Key points discussed include developing storage functions in Pig for both batch and online learning algorithms, training models by streaming training data through the functions, and scaling out training via model ensembles. An example application of sentiment analysis on tweets using these new Pig extensions is also described to illustrate the end-to-end workflow.
Dell NVIDIA AI Roadshow - South Western OntarioBill Wong
- Artificial intelligence (AI) is mimicking human intelligence through machine algorithms like those used for chess and facial recognition. Machine learning (ML) is a subset of AI that uses algorithms to parse data, learn from data, and make predictions. Deep learning (DL) uses artificial neural networks to develop relationships in data and is used for applications like driverless cars and cybersecurity.
- AI technologies are enabling digital transformation and require infrastructure like edge computing, GPUs, FPGAs, deep learning accelerators, and specialized hardware to power applications of AI, ML, and DL. Dell Technologies provides platforms and solutions to accelerate AI workloads and support digital transformation.
Hari Arjun Duche has over 12 years of experience working with companies like Persistent Systems and IBM India Software Labs. He has extensive experience in database internals, data warehousing, and business intelligence. Some of his areas of expertise include RDBMS like Netezza and PostgreSQL, programming languages like C/C++, and tools like GDB. He has worked on projects involving data engine design, data migration, performance improvement, and developing new product features. Hari has published 4 patents and received several awards for his work.
A confluence of events is accelerating the growth of AI in the Enterprise - (i) The COVID pandemic is accelerating the digital transformation of enterprises, (ii) increased digital sales & digital interaction is fueling interest in operationalizing AI to drive revenue and cost efficiencies and (iii) Enterprise databases and enterprise apps are infusing AI to transparently augment predictive capabilities for clients. Enterprise Power Systems are pillars of the global economy hosting our trinity of operating systems
Lambda Architecture 2.0 Convergence between Real-Time Analytics, Context-awar...Sabri Skhiri
At Huawei, we have developed a scalable Complex Event Processing with a significant improvement of the expressiveness. In the scope of the "context-aware" distributed systems, we need to define new architecture patterns. In this way we open new doors to new features and capabilities.
The document discusses several cloud computing platforms including Google App Engine, Amazon Web Services, Microsoft Azure, Hadoop, and Aneka. It provides descriptions of each platform, what types of applications they support, their capabilities and limitations. Key points covered include:
- Google App Engine allows developers to build applications using Python and scales automatically. Amazon Web Services provides infrastructure services like storage, computing and databases.
- Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It faces challenges around complexity, tuning, and platform choices.
- Microsoft Azure provides a cloud platform for running Windows applications and storing data, allowing applications to scale out across many servers.
Fundamental question and answer in cloud computing quiz by animesh chaturvediAnimesh Chaturvedi
The document contains questions and answers related to a cloud computing exam. It includes 5 questions worth 5 marks each on topics like the 2013 ACM Turing Award winner and their contributions to distributed systems and cloud computing, different cloud computing models, data transfer methods, descriptions of Google File System and Hadoop Distributed File System, and architectures for Hadoop on Google Cloud Platform and web applications on Google App Engine. The answers to the questions are provided in slides within the linked website.
This document discusses cloud computing, big data, Hadoop, and data analytics. It begins with an introduction to cloud computing, explaining its benefits like scalability, reliability, and low costs. It then covers big data concepts like the 3 Vs (volume, variety, velocity), Hadoop for processing large datasets, and MapReduce as a programming model. The document also discusses data analytics, describing different types like descriptive, diagnostic, predictive, and prescriptive analytics. It emphasizes that insights from analyzing big data are more valuable than raw data. Finally, it concludes that cloud computing can enhance business efficiency by enabling flexible access to computing resources for tasks like big data analytics.
IRJET- Search Improvement using Digital Thread in Data AnalyticsIRJET Journal
This document discusses the use of digital thread in data analytics to improve search and provide end-to-end visibility across product lifecycles. Digital thread is a communication system that connects manufacturing process elements and provides a complete view of each element throughout the lifecycle. It allows sharing of information across organizations and suppliers. Digital thread brings quality gains by managing large amounts of data and complex supply chains. It helps enterprises quickly redesign products and meet timelines while maintaining visibility of each component's journey. The document proposes using a Neo4j graph database hosted on AWS cloud to implement a digital thread that links product data. This would provide security, performance, and analytics benefits across the overall manufacturing process.
This document provides a summary of Abby Brown's technical experience including book reviews and editor roles, software patents submitted and issued, research projects and presentations conducted, and technical memberships. Key details include serving as technical editor for two books on Tibco software, submitting several patents around automated tools for services, metadata, and network alarms, presenting on topics such as cloud computing and SOA, and holding memberships in technical organizations like IEEE, The Open Group, and OASIS.
Data Analysis and Report Generation in Enterprise Mobility Solution IRJET Journal
This document discusses data analysis and report generation in enterprise mobility solutions. It describes how data is collected from multiple sources using backend components and RESTful web services. The data, which is in JSON format, is analyzed and converted into reports using the AmCharts API. Various visualization methods like Google Charts and Google Analytics are used to generate visual analytics. The analyzed data and reports are then presented to users through interfaces like HTML5 and CSS in formats like tables, graphs and charts. The overall aim is to help business managers make more informed decisions by analyzing enterprise data collected from employee mobile device use.
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopHazelcast
In this webinar
This talk identifies several shortcomings of Apache Hadoop and presents an alternative approach for building simple and flexible Big Data software stacks quickly, based on next generation computing paradigms, such as in-memory data/compute grids. The focus of the talk is on software architectures, but several code examples using Hazelcast will be provided to illustrate the concepts discussed.
We’ll cover these topics:
-Briefly explain why Hadoop is not a universal, or inexpensive, Big Data solution – despite the hype
-Lay out technical requirements for a flexible Big/Fast Data processing stack
-Present solutions thought to be alternatives to Hadoop
-Argue why In-Memory Data/Compute Grids are so attractive in creating future-proof Big/Fast Data applications
-Discuss how well Hazelcast meets the Big/Fast Data requirements vs Hadoop
-Present several code examples using Java and Hazelcast to illustrate concepts discussed
-Live Q&A Session
Presenter:
Jacek Kruszelnicki, President of Numatica Corporation
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...IRJET Journal
This document proposes using Apache Hadoop and a data-aware cache framework called Dache to analyze large amounts of social media data from Twitter in real-time. The goals are to overcome limitations of existing analytics tools by leveraging Hadoop's ability to handle big data, improve processing speed through Dache caching, and provide visualizations of trends. Data would be grabbed from Twitter using Flume, stored in HDFS, converted to CSV format using MapReduce, analyzed using Dache to optimize Hadoop jobs, and visualized using tools like Tableau. The system aims to efficiently analyze social media trends at low cost using open source tools.
The document discusses artificial intelligence (AI) and data analytics opportunities in the oil and gas industry. It provides examples of how AI can be used for virtual assistants, video analytics, fleet management, production optimization, preventative maintenance, and precision drilling. The benefits of these applications include increased profitability, risk mitigation, improved agility, and cost reductions. It also discusses Dell Technologies AI and HPC solutions that can enable oil and gas companies to leverage AI for applications like reservoir modeling and autonomous vehicles.
How to add Artificial Intelligence Capabilities to Existing Software PlatformsHarish Nalagandla
Artificial Intelligence is real and over the next few years, will significantly change the world as we know it. Even though there is some hype around this technology, many companies have put this to practical use, helping businesses delight their customers, increase engagement on their Platforms, and ultimately delivering positive financial results. Many companies have already invested into IT Platforms and it is not practical to invest in bringing up an Artificial Intelligence Platform in parallel. So, the more practical approach is to add Artificial Intelligence capabilities to existing Platforms.
In this keynote session, Harish Nalagandla, Director of Engineering, Enterprise Services Platform, PayPal, will discuss the pillars that form the foundation of Artificial Intelligence and will describe an approach towards how companies with existing systems can add Artificial Intelligence capabilities to their existing Platforms. This session will present a high level blueprint to help guide organizations in their own journey towards adoption of Artificial Intelligence technologies as they make progress towards digital transformation. This is a high level presentation targeted towards technology executives.
In planet-scale deployments, the Operation and Maintenance (O&M) of cloud platforms cannot be done any longer manually or simply with off-the-shelf solutions. It requires self-developed automated systems, ideally exploiting the use of AI to provide tools for autonomous cloud operations. This talk will explain how deep learning, distributed traces, and time-series analysis (sequence analysis) can be used to effectively detect anomalous cloud infrastructure behaviors during operations to reduce the workload of human operators. The iForesight system is being used to evaluate this new O&M approach. iForesight 2.0 is the result of 2 years of research with the goal to provide an intelligent new tool aimed at SRE cloud maintenance teams. It enables them to quickly detect and predict anomalies thanks to the use of artificial intelligence when cloud services are slow or unresponsive.
Cloud Reliability: Decreasing outage frequency using fault injectionJorge Cardoso
Invited Keynote at the 9th International Workshop on Software Engineering for Resilient Systems, September 4-5, 2017, Geneva, Switzerland
Title: Cloud Reliability: Decreasing outage frequency using fault injection
Abstract: In 2016, Google Cloud had 74 minutes of total downtime, Microsoft Azure had 270 minutes, and 108 minutes of downtime for Amazon Web Services (see cloudharmony.com). Reliability is one of the most important properties of a successful cloud platform. Several approaches can be explored to increase reliability ranging from automated replication, to live migration, and to formal system analysis. Another interesting approach is to use software fault injection to test a platform during prototyping, implementation and operation. Fault injection was popularized by Netflix and their Chaos Monkey fault-injection tool to test cloud applications. The main idea behind this technique is to inject failures in a controlled manner to guarantee the ability of a system to survive failures during operations. This talk will explain how fault injection can also be applied to detect vulnerabilities of OpenStack cloud platform and how to effectively and efficiently detect the damages caused by the faults injected.
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...Jorge Cardoso
Lecture given at the Technical University of Munich, 12 December 2016, on Cloud Operations and Analytics: Improving Distributed Systems Reliability using Fault Injection.
Presentation at the International Industry-Academia Workshop on Cloud Reliability and Resilience. 7-8 November 2016, Berlin, Germany.
Organized by EIT Digital and Huawei GRC, Germany.
Twitter: @CloudRR2016
Presentation at the International Industry-Academia Workshop on Cloud Reliability and Resilience. 7-8 November 2016, Berlin, Germany.
Organized by EIT Digital and Huawei GRC, Germany.
Twitter: @CloudRR2016
Gartner analyzed data centers for a period of 10 years and found that 47% of all problems were caused by cloud services outages. The duration of outages ranged between 40 minutes and five days. Ponemon Institute studied the financial impact and found that on average outages cost US$ 690.204, with an average downtime cost of US$ 6.828 per minute. These results are important due to the economic impact of unplanned outages on cloud operations which calls for higher platform reliability.
The first part of this talk will present the mechanisms that pioneers, such as Amazon, Google, and Netflix, have already developed to increase the reliability of their cloud platforms. The second part of the talk will describe how Huawei Research is exploring the use of fault-injection mechanisms to effectively increase the reliability of the Open Telekom Cloud platform from Deutsche Telekom.
This document discusses increasing cloud resilience through fault injection testing. It describes using the Butterfly Effect system to intentionally inject failures like hardware failures, network issues, and software bugs into OpenStack deployments. This allows failures to be tested, repairs to be learned, and future failures to be predicted. Monitoring tools like Monasca and Zabbix are used to detect damages and visualize results. The goal is to automate repair and increase reliability through continuous testing and learning from failures.
For more than 10 years, research on service descriptions has mainly studied software-based services and provided languages such as WSDL, OWL-S, WSMO for SOAP, and hREST for REST. Nonetheless, recent developments from service management (e.g., ITIL and COBIT) and cloud computing (e.g. Software-as-a-Service) have brought new requirements to service descriptions languages: the need to also model business services and account for the multi-faceted nature of services. Business-orientation, co- creation, pricing, legal aspects, and security issues are all elements which must also be part of service descriptions. While ontologies such as e  service and e  value provided a first modeling attempt to capture a business perspective, concerns on how to contract services and the agreements entailed by a contract also need to be taken into account. This has for the most part been disregarded by the e-family of ontologies. In this paper, we review the evolution and provide an overview of Linked USDL, a comprehensive language which provides a (multi-faceted) description to enable the commercialization of (business and technical) services over the web.
Ten years of service research from a computer science perspectiveJorge Cardoso
…It has been more than 10 years since a strong research stream on services started from the field of computer science. The main trigger was without a doubt the introduction of the Web Service Description Language (WSDL), a specification to represent a piece of software functionally which could be remotely invoked. Nonetheless, this was only the “tipping point”. The generalized interest on this new development was followed by interesting topics of research on the application of semantics to enhance the description of services, the composition of services into processes, the analysis of the quality of services, the complexity of processes supporting services, and the development of comprehensive service description languages. This seminar will provide an overview of the main research topics around services and will glimpse at a new research field on the analysis of service networks...
Cloud Computing Automation: Integrating USDL and TOSCAJorge Cardoso
-- Presented at CAiSE 2013, Valencia, Spain --
Standardization efforts to simplify the management of cloud applications are being conducted in isolation. The objective of this paper is to investigate to which extend two promising specifications, USDL and TOSCA, can be integrated to automate the lifecycle of cloud applications. In our approach, we selected a commercial SaaS CRM platform, modeled it using the service description language USDL, modeled its cloud deployment using TOSCA, and constructed a prototypical platform to integrate service selection with deployment. Our evaluation indicates that a high level of integration is possible. We were able to fully automatize the remote deployment of a cloud service after it was selected by a customer in a marketplace. Architectural decisions emerged during the construction of the platform and were related to global service identification and access, multi-layer routing, and dynamic binding.
Understanding how services operate as part of large scale global networks, the related risks and gains of different network structures and their dynamics is becoming increasingly critical for society. Our vision and research agenda focuses on the particularly challenging task of building, analyzing, and reasoning about global service networks. This paper explains how Service Network Analysis (SNA) can be used to study and optimize the provisioning of complex services modeled as Open Semantic Service Networks (OSSN), a computer-understandable digital structure which represents connected and dependent services.
Open Semantic Service Networks: Modeling and AnalysisJorge Cardoso
A new interesting research area is the representation and analysis of the networked economy using Open Semantic Service Networks (OSSN). OSSN are represented using the service description language
USDL to model nodes and using the service relationship model OSSR to model edges. Nonetheless, in their current form USDL and OSSR do not provide constructs to capture the dynamic behavior of service networks. To bridge this gap, we used the General System Theory (GST) as a framework guiding the extension of USDL and OSSR to model dynamic OSSN. We evaluated the extensions made by applying USDL and OSSR to two distinct types of dynamic OSSN analysis: 1) evolutionary by using a Preferential Attachment (PA) and 2) analytical by using concepts from System Dynamics (SD). Results indicate that OSSN can constitute the rst stepping stones toward the analysis of global service-based economies.
Modeling Service Relationships for Service NetworksJorge Cardoso
The last decade has seen an increased interest in the study of networks in many fields of science. Examples are numerous, from sociology to biology, and to physical systems such as power grids. Nonetheless, the field of service networks has received less attention. Previous research has mainly tackled the modeling of single service systems and service compositions, often focusing only on studying temporal relationships between services. The objective of this paper is to propose a computational model to represent the various types of relationships which can be established between services systems to model service networks. This work acquires a particular importance since the study of service networks can bring new scientific discoveries on how service-based economies operate at a global scale.
The document discusses the Linked-USDL (Unified Service Description Language), which is an ontology for describing services using semantic web principles. It provides an overview of the Linked-USDL modules, syntax, core classes and properties for describing services, service models, offerings, interactions and more. The goal is to develop a standardized way to describe services in a holistic manner, covering both business and technical aspects.
Challenges for Open Semantic Service Networks: models, theory, applications Jorge Cardoso
The document discusses challenges for open semantic service networks. It provides an overview of different types of networks including the world wide web, social networks, biological networks, and service networks. It then summarizes research on services from software and IT, business model, and design perspectives. Key aspects of open semantic service networks are that they are constructed by accessing, retrieving and combining service and relationship models which are openly available and use semantic descriptions. Building blocks for open semantic service networks include modeling services, modeling service relationships, populating models, analyzing service networks.
Description and portability of cloud services with USDL and TOSCAJorge Cardoso
The provisioning and management of cloud services are major concerns since they bring clear benefits such as elasticity, flexibility, scalability, and high availability of applications for enterprises. Two emerging contributions set semantics and machine-understandable specifications for the description and portability of cloud-based services: USDL and TOSCA. In this talk we will explain how both can be articulated to work in conjunction. The Unified Service Description Language (USDL) was created for describing business or real world services to allow services to become tradable and consumable on marketplaces. On the other hand, the Topology and Orchestration Specification for Cloud Applications (TOSCA) was standardized to enable the portability of complex cloud applications and their management across different cloud providers.
To address the emerging importance of services and the relevance of relationships, we have developed and introduced the concept of Open Semantic Service Network (OSSN). OSSN are networks which relate services with the assumption that firms make the information of their services openly available using suitable models. Services, relationships and networks are said to be open (similar to LOD), when their models are transparently available and accessible by external entities and follow an open world assumption. Networks are said to be semantic when they explicitly describe their capabilities and usage, typically using a conceptual or domain model, and ideally using Semantic Web standards and techniques. One limitation of OSSNs is that they were conceived without accounting for the dynamic behavior of service networks. In other words, they can only capture static snapshots of service-based economies but do not include any mechanism to model reactions and effects that services have on other services and the notion of time
To address the emerging importance of services and the relevance of relationships, we have developed and introduced the concept of Open Semantic Service Network (OSSN). OSSN are networks which relate services with the assumption that firms make the information of their services openly available using suitable models. Services, relationships and networks are said to be open (similar to LOD), when their models are transparently available and accessible by external entities and follow an open world assumption. Networks are said to be semantic when they explicitly describe their capabilities and usage, typically using a conceptual or domain model, and ideally using Semantic Web standards and techniques. One limitation of OSSNs is that they were conceived without accounting for the dynamic behavior of service networks. In other words, they can only capture static snapshots of service-based economies but do not include any mechanism to model reactions and effects that services have on other services and the notion of time
This document describes the Open Semantic Service Networks research group at the University of Coimbra in Portugal. It introduces the group's supervisors and students, and summarizes each student's research project related to semantic services, including their background, problem being addressed, and technical solution. The group's projects involve models, platforms, applications, and visualizations for open semantic service networks.
This document proposes SEASIDE, a systematic methodology for developing Internet-based self-services (ISS) from analysis through deployment. SEASIDE leverages enterprise architectures, model-driven development, MVC patterns, and platform as a service. It was applied to develop an ITIL incident management service as a case study. SEASIDE increases traceability, consistency, and reduces complexity of ISS development.
Ready to Unlock the Power of Blockchain!Toptal Tech
Imagine a world where data flows freely, yet remains secure. A world where trust is built into the fabric of every transaction. This is the promise of blockchain, a revolutionary technology poised to reshape our digital landscape.
Toptal Tech is at the forefront of this innovation, connecting you with the brightest minds in blockchain development. Together, we can unlock the potential of this transformative technology, building a future of transparency, security, and endless possibilities.
Instagram has become one of the most popular social media platforms, allowing people to share photos, videos, and stories with their followers. Sometimes, though, you might want to view someone's story without them knowing.
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdfFlorence Consulting
Quattordicesimo Meetup di Milano, tenutosi a Milano il 23 Maggio 2024 dalle ore 17:00 alle ore 18:30 in presenza e da remoto.
Abbiamo parlato di come Axpo Italia S.p.A. ha ridotto il technical debt migrando le proprie APIs da Mule 3.9 a Mule 4.4 passando anche da on-premises a CloudHub 1.0.
1. Distributed Trace & Log Analysis using ML
Prof. Jorge Cardoso
E-mail: jorge.cardoso@huawei.com
Chief Architect for AIOps
Munich & Dublin Research Center
2020.11.04
Definition: AIOps platforms utilize big data, modern machine learning and other advanced analytics
technologies to directly and indirectly enhance IT operations (monitoring, automation and service
desk) functions with proactive, personal and dynamic insight.
Gartner
Huawei SRE Summit: Running Planet Scale Systems
Dublin, 2-4 November 2020
2. ULTRA-SCALE AIOPS LAB 1
Intelligent Cloud Operations
The emerging field of AIOps
The field of AIOps, also known as Artificial Intelligence for IT Operations, uses advanced technologies to dramatically improve
the monitoring, operation, and troubleshooting of distributed systems. Its main premise is that operations can be automated using
monitoring data to reduce the workload of operators (e.g., SREs or production engineers). Our current research explores how
AIOps – and many related fields such as deep learning, machine learning, distributed traces, graph analysis, time-series
analysis, sequence analysis, advanced statistics, NLP and log analysis – can be explored to effectively detect, localize,
predict, and remediate failures in large-scale cloud infrastructures (>50 regions and AZs) by analyzing service management
data (e.g., distributed traces, logs, events, alerts, metrics). In particular, this talk will describe how a particular monitoring data
structure, called distributed traces, can be analyzed using deep learning to identify anomalies in its spans. This capability
empowers operators to quickly identify which componentsof a distributed system are faulty.
Planet/large-scale Distributed systems and cloud computing
Distributed traces, logs, events, alerts, metrics, ….
Big Data platforms with Kubernetes, Hadoop, Spark, Flink, …
Deep Learning, machine learning, data mining, advanced statistics, time-series analysis, NLP, ….
Detect, localize, predict, and remediate failures in infrastructures
Dr. Jorge Cardoso is Chief Architect for Planet-scale AIOps at Huawei’s Munich and Ireland Research Centers. Previously he
worked for several major companies such as SAP Research (Germany) on the Internet of Services and the Boeing Company in
Seattle (USA) on Enterprise Application Integration. He previously gave lectures at the Karlsruhe Institute of Technology
(Germany), University of Georgia (USA), University of Coimbra and University of Madeira (Portugal). His current research
involves the development of the next generation of AIOps platforms, Cloud Operations and Analytics tools driven by AI, Cloud
Reliability and Resilience, and High Performance Business Process Management systems. He has a Ph.D. in Computer
Science from the University of Georgia (USA).
GitHub | Slideshare.net | GoogleScholar
Interests: AIOps, Service Reliability Engineering, Cloud Computing, Distributed Systems, Business Process Management
3. ULTRA-SCALE AIOPS LAB 2
HUAWEI CLOUD
Site Reliability Engineering (SRE)
Reliability is an important feature of HAUWEI CLOUD. SRE is responsible for it …
50% software engineer /
50% system engineer
Load slows down systems
Elite forces, focusing on
high-ROI work
Automation
Setup quantified SLO
Run Err Budget
Self-constraint, balanced
Dev/Ops
Fact oriented
Four golden signals
Altering from time-series
Eliminating Toil
Implement monitoring and
develop handling
processes
Diagnosis, analysis, and
detailed data
Define contextual,
customer-focused SLOs
70% of outages due to
changes in a live system
Addressing cascading
failures
Take turns to monitor and
document solutions
Distributed consensus for
reliability
Regular review meetings
Humans add latency
Training through brain
games
Perform regular drills in
the environment
Guarantee that there are
few problems
It's either a problem or a
quick recovery
The recovery tool is
designed in the early
stage of the process
Continuous review and
optimization
… a few numbers…
worldwide, Huawei Cloud has 44
availability zones across 23 regions (June
2019)
more than 180 cloud services and 180
solutions for a wide range of sectors
Costumers include European Organization
for Nuclear Research (CERN), PSA
Group, Shenzhen Airport, Port of Tianjin,
…
SREHAUWIECLOUD
4. ULTRA-SCALE AIOPS LAB 3
AI for IT Operations (AIOps)
Bringing AI to O&M
“Virtyt Koshi, the EMEA general manager for virtualization vendor Mavenir, reckons Google is able
to run all of its data centers in the Asia-Pacific with onlyabout 30 people, and that a telco would
typically have about 3,000 employees to manage infrastructure across the same area."
“We began applying machine learning two years ago (2016) to operate
our data centers more efficiently… over the past few months,
DeepMind researchers began working with Google’s data center team
to significantly improve the system’s utility. Using neural networks
trained on different operating scenarios and parameters, we created a
more efficient and adaptive framework to understand data center
dynamics and optimize efficiency.” Eric Schmidt, Dec. 2018
SRE / O&M Activities
System monitoring and 24x7 technical support, Tier 1-3 support
Troubleshooting and resolution of operational issues
Backup, restoration, archival services
Update, distribution and release of software
Change, capacity, and configuration management
…
Harvard Business Review
Early indicators
Moogsoft AIOps,
Amazon EC2
Predictive Scaling,
Azure VM resiliency,
Amazon S3 Intelligent
Tiering
38.4% of organizations take
more than 30 minutes to
resolve IT incidents that
impact consumer-facing
services (PagerDuty)
https://www.lightreading.com/automation/google-has-intent-to-cut-humans-out-of-network/d/d-id/742158
https://www.lightreading.com/automation/automation-is-about-job-cuts---lr-poll/d/d-id/741989
https://www.lightreading.com/automation/the-zero-person-network-operations-center-is-here-(in-finland)/d/d-id/741695
5. ULTRA-SCALE AIOPS LAB 4
Intelligent Operation & Maintenance
Troubleshooting
Anomaly Detection
Response Time Analysis
‒ A service started responding to requests more slowly than
normal
‒ The change happened suddenly as a consequence of
regressionin the latest deployment
System Load
‒ The demand placed on the system, e.g., REST API requests
per second, increase since yesterday
Error Analysis
‒ The rate of requests that fail -- either explicitly (HTTP 5xx) or
implicitly (HTTP 2xx with wrong content) -- is increasingslowly,
but steadily
System Saturation
‒ The resources (e.g., memory, I/O, disk) used by key controller
services is rapidly reaching threshold levels
Complex System
OBS. Object Storage Service
EVS. Elastic Volume Service (block storage)
VPC. Virtual Private (private virtual networks)
ECS. Elastic Cloud Server (scalable computing)
SRE / O&M Activities
System monitoring and 24x7/Tier 1-3 technical support
Troubleshooting and resolution of operationalissues
Backup, restoration, archival services
Update, distribution and release of software
Change, capacity, and configuration management
…
Troubleshooting tasks
Anomaly detection.Determine what constitutes normal
system behavior, and then to discern departures from that
normal system behavior
Fault diagnosis (root cause analysis). Identify links of
dependencythat represent causal relationships to discover
the true source of an anomaly
Fault Prediction. Use of historical or streaming data to
predict incidents with varying degrees of probability
Fault recovery.Explore how decision support systems can
manage and select recovery processes to repair failed
systems
6. ULTRA-SCALE AIOPS LAB 5
Troubleshooting
Monitoring Data Sources
Logs. Service, microservices, and applications
generate logs, composed of timestamped
records with a structure and free-form text,
which are stored in system files.
Metrics. Examples of metrics include CPU
load, memory available, and the response
time of a HTTP request.
Traces. Traces records the workflow and tasks
executed in response to, e.g., an HTTP
request.
Events. Major milestones which occur within a
data center can be exposed as events.
Examples include alarms, service upgrades,
and software releases.
{"traceId": "72c53", "name": "get", "timestamp": 1529029301238, "id": "df332",
"duration": 124957, “annotations": [{"key": "http.status_code", "value": "200"}, {"key":
"http.url", "value": "https://v2/e5/servers/detail?limit=200"}, {"key": "protocol", "value":
"HTTP"}, "endpoint": {"serviceName": "hss", "ipv4": "126.75.191.253"}]
{“tags": [“mem”, “192.196.0.2”, “AZ01”], “data”: [2483, 2669, 2576, 2560, 2549, 2506,
2480, 2565, 3140, …, 2542, 2636, 2638, 2538, 2521, 2614, 2514, 2574, 2519]}
{"id": "dns_address_match“, "timestamp": 1529029301238, …}
{"id": "ping_packet_loss“, "timestamp": 152902933452, …}
{"id": "tcp_connection_time“, "timestamp": 15290294516578, …}
{"id": "cpu_usage_average“, "timestamp": 1529023098976, …}
2017-01-18 15:54:00.467 32552ERROR oslo_messaging.rpc.server[req-c0b38ace -
default default] Exception during message handling
System’s Components (e.g., OBS, EVS, VPC, ECS) are monitored and generate various types of data:
Logs, Metrics, Traces, Events, Topologies
7. ULTRA-SCALE AIOPS LAB 6
INNOVATION PATH FORWARD
BACKGROUND & MOTIVATION DESCRIPTION ANTICIPATED IMPACT
Intelligent Log Analysis
Root Cause Localization using Application Logs
Problem
Complex system generates high volumes of logs
for manual analysis/troubleshooting. E.g., a single-
node check using logs takes >15 min
Analyzing logs manuallyis time
consuming and error-prone
IntelligentLog Analysis
Insight. Recent research has shown that it is
possible to identify and model the underlying
structure of application logs using ML [1, 2] Higher levels of visualization and abstraction are required (Bird's-eye
view, zoom in/out)
MAIN ACHIEVEMENT
Extracting valuableinsights form massive, noisyand
unstructured logs
Pattern mining and NLP are used to abstract and summarize
massive logs
Help O&M teams improving efficiency of troubleshooting using
logs
ASSUMPTIONS & LIMITATIONS
Processing requires historical logs to learn normality
No real-time processing. Only on-demand operation is available
Status
Higher levels of visualization and abstraction are
effective for operators to quickly troubleshoot
system using logs
Next steps
Research other forms of visualization (e.g., Kernel
Density Estimation, Network Diagrams,
Correlation Matrices)
TechnologyTransition
Evaluation of infrastructures to process ultra-scale
logs (e.g., batch versus stream processing)
Potential End Users
Service Owners, DevOps and SRE
Log Analysis
Log analysis can be integrated into traditional log
platforms
HOW IT WORKS
Performance is a challenge! Processing million
log records per second is difficult
1. Fast log template mining using
Drain algorithm
2. Language-aware log parsing
and keyword extraction using
NLP approaches (www.spacy.io)
3. Time-series classification using
Poisson model
(en.wikipedia.org/wiki/Poisson_d
istribution)
4. Grouping using Pearson
algorithm
(en.wikipedia.org/wiki/Pearson_c
orrelation_coefficient)
5. Distance-aware correlation
Fig.
Existing Log
Analysis services
(e.g., ELK)
10000 error messages
100 templates
30 events
6 eventscorrelated
Fig.
Complex
services
generates
several GB
of logs per
second
as-is
Template mining Template analysis
Visualization
Anomaly detection
Fig. The process of intelligent log analysis
to-be
8. ULTRA-SCALE AIOPS LAB 7
INNOVATION PATH FORWARD
BACKGROUND & MOTIVATION DESCRIPTION ANTICIPATED IMPACT
Trace Analysis (TA)
Structure-based Anomaly Detection and RCA
Current limitations
Existing solutions only provides trace visualization tools
Analyze traces manually: error-prone and not scalable
While popular, onlyvisualization tools
have been developed for trace analysis
Apply recentML and statistical
approachesto process sequential data
Explore the use
of Deep
Learning: Long
Short Term
Memory (LSTM)
Explore the use
of attention
networks
Explore the use
of association-
rules TRL 4. Small scaleprototype.Basic technological components are
integrated to establish thatthey will work together.
MAIN ACHIEVEMENT
Trace anomalydetection and root-causeanalysis using
trace structure
Previous tentative using Deep Learning (LSTM, CNN), Machine
Learning (Optiks), Sequence Analysis (LCS, Multiple sequence
alignment, Needleman-Wunsch) algorithms did not enable root
cause analysis
ASSUMPTIONS & LIMITATIONS
Microservices are instrumented with tracing capabilities
Tracing tracks microservice to microservice communication
Improve trace-based RCA in 90%
HOW IT WORKS
1) Learning. For each service
endpoint, learn the traces’
structure it generates
2) Modeling. Aggregate all the
traces into a behavior model
3) Anomaly detection. When a
new trace is generate, compare
its structure with the behavior
model. If it was not seen
before, an anomaly exists
4) Root-cause analysis. When an
anomaly is detected, determine
in which span it occurred and
identify host, service, function
Trace Events
OK | ANOMALY
Create VM
LSTM Network
Trace Structure
Research Next Steps
Which data to inject into traces’ payload which
can significantly assist troubleshooting?
New visualization tools beyond Gantt diagrams
for O&M
Technology Transition
Integration / alignment tracing, logging and
metrics pipelines
Potential End Users
SRE and DevOps
PaaS and customer cloud native applications
E2E: Mobile apps, web clients, monoliths, etc.
Fig.
Red circles show structural anomalies (mockup)
Create
Model
Root-cause
Analysis
Anomaly
Detection
Host: 192.168.5.13
Service: nova-api
Function: schedule VM
Fig.
Elastic APM
Trace
vitalization
(in blue)
Structural
anomalies
(in red)
Learn traces structure
9. ULTRA-SCALE AIOPS LAB 8
Troubleshooting
Using Tracing Technolgies
https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/paged/architecture-guide/chapter-1-components
Troubleshooting
1. Find root cause issues in requests across several services / hosts
2. Benchmark different systems and identify performance bottlenecks
3. Reconstruct the workflow of service-to-service communications
10. ULTRA-SCALE AIOPS LAB 9
Tracing Technologies
Systems and pseudo-standards
Fig. Amazon AWS X-Ray
Fig. Google Cloud Trace
Fig. OpenTracing
Tracing Services
‒ Google Stackdriver(cloud.google.com/trace)
‒ Amazon AWS X-Ray (aws.amazon.com/xray)
‒ Lightstep (lightstep.com)
Tracing Systems
‒ Twitter Zipkin (zipkin.io)
‒ Uber Jaeger
‒ OSProfiler
‒ Pivot Tracing (pivottracing.io)
Tracing Standards
‒ opentelemetry.io
‒ OpenTracing
‒ OpenCensus
[1] http://opentracing.io/documentation/
[2] https://github.com/opentracing/specification/blob/master/specification.md
OpenTracing provides a consistent, expressive, vendor-neutral
APIs for popular platforms [1]
2019: The open source
distributed tracing projects
OpenCensus and OpenTracing
were merged into anew project
called OpenTelemetry
11. ULTRA-SCALE AIOPS LAB 10
Troubleshooting Openstack
Traces and Workflows
Distributed Tracing
1. Provides developers a detailed view of requests as they flow across a
distributed system
2. Identifies which microservices areinvolved in processinga request
3. Trace the path of a request to generate a transaction
4. Localize which microservice in a path (trace) is anomalous
12. ULTRA-SCALE AIOPS LAB 11
Distributed Traces
Abstractions
n
ki
kikiiiiiiikkn xXxXxXxXpxXxXpxxp ,...,,|,...,... 2211111
Markov Chain
‒ A stochastic process is a Markov chain if:
‒ The probability distribution of a state Xi depends on the previous state
Xi-1 and does not depend on the previous states
‒ K-th order Markov Chain:
Tree Structure
‒ Finite set of one or more nodes A, B, C, …
‒ Special nodes: the root node has no parent. Leaf nodes have children.
‒ Remaining nodes are partitioned into n >= 0 disjoint sets T1, ..., Tn,
where each set is also a tree
Sequence of Events
‒ Given a set E = {e1,...,en} of event types, an event is a pair (A, t), where
A ∈ E and t ∈ N is the occurrence time of the event
‒ An event sequences on E is an ordered sequence of events:
‒ Tree representation: (A (B (C, D), E)
n
i
xx
n
i
iiiin ii
axpxXxXpxXpxxp
2
1
2
11111 1
)(|...
Markov
Chain
Tree
Sequence
),(),...,,(),,( 2211 nn tAtAtAs
13. ULTRA-SCALE AIOPS LAB 14
Distributed Trace Analysis
Sequence Analysis
Sequence Prediction
‒ Predict elements of a sequence on the basis of the preceding elements
‒ Given si, si+1,…, sj, predict sj+1, si,si+1, i.e., sj →sj+1
‒ Input: 1, 2, 3, 4, 5; output: 6
‒ Applications: weather forecast
Sequence Classification
‒ Predict a class label for a given input sequence.
‒ Given si,si+1,…,sj, determine if it is legitimate, i.e., si,si+1,…,sj →yes or no
‒ Input: 1, 2, 3, 4, 5; output: “normal” or “anomalous”
‒ Applications: trace anomaly detection, DNA sequence classification
Sequence Generation
‒ Generate new output sequence with same general characteristics as the input
‒ Input: [1, 3, 5], [7, 9, 11]; output: [3, 5 ,7]
‒ Applications: text generation
Sequence to Sequence Prediction
‒ Predict an output sequence given an input sequence.
‒ Input: 1, 2, 3, 4, 5; output: 6, 7, 8, 9, 10
‒ Applications: text summarization
1, 2, 3, 4, 5 ► 6
1, 2, 3, 4, 5 ► NORMAL
1, 3, 5 and 7, 9, 11 ► 3, 5 ,7
1, 2, 3, 4, 5 ► 6, 7, 8, 9, 10
Sequence learning: from recognition and prediction to sequential decision making, IEEE Intelligent Systems, 2001,
http://www.sts.rpi.edu/~rsun/sun.expert01.pdf
14. ULTRA-SCALE AIOPS LAB 15
Distributed Trace Analysis
Sequence Analysis
1. User commands: image
delete, create server, …
N1 N2
N
3
N
4
keystone -> nova -> glance ->
network
Services
30
%
70
%
Warning Y
Warning A
Warning B
Error Z
a1 62 e2 28
Trace t1 Sequence of spans
auth_tokens_v3 1314421019193f1313162f253233191e341a3c3b402f2624192f214521162f163e2746282c192f2e2e24161d2c2c161628241635287a
images_v2 4342421119131616161a311316192216162f23213e1d2127272728241621191d5e5c471d132c4a16241616213c63
_var__detail_flavors_v2.1 5d13421314151316161b193b21242113364024192f162f131d21161621212c27282c2c2d162f5e5c132c16164d
image_schemas_v2 141319133113162f1644161b331e351a213b1613133b13192f5116132113131616252121282c2c162d21582e2e161d13643b132c4a78…
gather_result 1912141613252128211a1f402f221921162f2713282b2d2416131d13211a5a21
security-groups_v2.0 4c1319131d19131316132f
_var__images_v2 121313251b2f192f35211f241916223b1613231616242713462c53241654192c285a21
_var___var__action_servers_v2.1 61135d106c12131325191b1c193d281a4022136d13162f3b292116231d1627272741283b2416…f4a5121352f13
neutronclient.v2_0.client.retry_request 162f1319132f165124191316
4. SequenceClassification: identify invalid sequences
Service call encoding
16: {'v2.0', 'ports'}
51 : {'v2.0', '_var_', 'ports'}=200
24 : {'nova.compute.api.API.get'}=0
…
Service call encoding
13: {'v3', 'tokens', 'auth‘}
Encoding matching
53.47%
2. A distributed trace is
generated
3. The distributed trace is
encoded into a sequence
Function call sequences
[{'v3'}=200, {'v3', 'tokens', 'auth'}=200, {'v...
Function call sequences
Similar call sequence
Anomaly Detection from System Tracing Data using Multimodal Deep Learning
IEEE Cloud 2019, July 3-8, 2019, Milan, Italy.
15. ULTRA-SCALE AIOPS LAB 16
Distributed Trace Analysis
Why is trace analysis challenging?
2. Cache-Aside pattern
1. Retry pattern n_tries = 3
while True:
try:
ExecuteOperation() # Success
break
except TimeoutException:
if n_tries == 0: # give up
raise Failure
else:
sleep(1000) # Wait until retrying the call
v = cache.get(k)
if v == None: # Check if the item is in the cache
v = store.get(k) # If not in cache, read from data store
cache.put(k, v) # Put a copy of the item into the cache
3. One-to-many subcalls
@app.route('/servers/<int:user_id>/<int:number>')
def create(user_id, number):
for i in range(number):
authentication(user_id)
# Call remote microservice to create a server
rpc.create_server()
T1: A, B, C, D, E, …
T2: A, B, C, C, D, E, ….
T3: A, B, C, C, C, D, E, …
T4: A, B, C, C, C, Z
T1: A, B, C, D, E, …
T1: A, C, D, E, …
T1: A, B, C, D, E, …
T2: A, B, C, C, D, E, ….
T3: A, B, C, C, C, D, E, …
T4: A, B, C, C, C, C, D, E, …
Many “Hidden” Software “Patterns” which affect running systems…Circuit breaker, concurrency and parallelism, priority
queues, publisher-subscriber, bulkhead, etc.
16. ULTRA-SCALE AIOPS LAB 17
Distributed Trace Analysis
Using LSTMs
1. Model Openstackbehavior
OPENSTACK_SEQUENCE= [
('keystone', 1.0, 'authenticate user with credentials and generate auth-token'),
('nova-api', 1.0, 'get user request and sends token to Keystone for validation'),
('keystone', 0.2, 'validate the token if not in cache'),
('nova-api', 1.0, 'starts VM creation process'),
('nova-database', 1.0, 'create initial database entry for new VM'),
('nova-scheduler', 1.0, 'locate an appropriate host using filters and weights'),
('nova-database', 1.0, 'execute query'),
('nova-compute', 1.0, 'dispatches request'),
('nova-conductor', 1.0, 'gets instance information'),
('nova-compute', 1.0, 'get image URI from image service'),
('glance-api', 1.0, 'get image metadata'),
('nova-compute', 1.0, 'setup network'),
('neutron-server', .5, 'allocate and configure network IP address'),
('nova-compute', 1.0, 'setup volume'),
('cinder-api', .75, 'attach volume to the instance or VM'),
('nova-compute', 1.0, 'generates data for the hypervisor driver')
]
3. Transform traces into sequences of integers
Train dataset
span_list
trace_id
0 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 6, 1, 6]
1 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 6, 1, 6]
2 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 4, 6, 1, 6]
3 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 4, 6, 6]
4 [3, 5, 3, 5, 8, 9, 8, 6, 7, 6, 2, 6, 6, 1, 6]
... ...
95 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 6, 1, 6]
96 [3, 5, 3, 5, 8, 9, 8, 6, 7, 6, 2, 6, 4, 6, 6]
97 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 6, 1, 6]
98 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 4, 6, 1, 6]
99 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 4, 6, 1, 6]
[100 rows x 1 columns]
Test dataset
span_list
trace_id
0 [3, 5, 5, 1, 9, 8, 6, 7, 6, 2, 6, 6, 1, 6]
1 [3, 5, 5, 8, 9, 8, 6, 7, 6, 2, 6, 4, 6, 6]
[0,
['keystone',
'nova-api',
'nova-api',
'nova-database',
'nova-scheduler',
'nova-database',
'nova-compute',
'nova-conductor',
'nova-compute',
'glance-api',
'nova-compute',
'nova-compute',
'cinder-api',
'nova-compute']]
Cache Simulation: Probability of execution = 75%
Trace #0
Generate Synthetic Traces (generate_traces())
‒ Create a simple model to generate traces
‒ OPENSTACK_SEQUENCE models sequences of spans which are valid
• e.g., ['keystone‘, 'nova-api', 'nova-api', 'nova-database', 'nova-scheduler', 'nova-database', 'nova-compute', 'nova-
conductor', 'nova-compute', 'glance-api', 'nova-compute', 'nova-compute', 'cinder-api', 'nova-compute']
‒ Each microservice name is encoded into a number using a dictionary (create_coders())
Trace Mutation (mutate_traces())
‒ Selected synthetic traces are mutated
‒ Spans can be deleted or replaced
2. Generate Synthetics Traces
17. ULTRA-SCALE AIOPS LAB 18
Distributed Trace Analysis
Using LSTMs
[[0 0 3 ... 6 1 6]
[0 0 3 ... 6 1 6]
[0 3 5 ... 6 1 6]
...
[0 0 3 ... 6 1 6]
[0 3 5 ... 6 1 6]
[0 3 5 ... 6 1 6]]
Padding
2. Hot encoding
Tensorflow tutorial: https://github.com/campdav/text-rnn-tensorflow
Keras tutorial: https://github.com/campdav/text-rnn-keras
Code available at: https://github.com/jorge-cardoso/aiops_practice
Subdirectory:dta_lstm
Padding (_pad_traces())
‒ Pad each trace (sequence) with zeros (0), up to a pre-defined maximum length.
‒ This allows to pre-allocate a chain of LSTM units of a specific length
‒ Padding is done using pad_sequences(sequences,...)
One hot encoding (_traces_to_binary())
‒ Enables to represent categorical variables as binary vectors
‒ Each integer value is represented as a binary vector that has all zero values except the index of the integer, which is
marked with a 1
X = pad_sequences(X, maxlen=self.trace_size, dtype=np.int16,
truncating='pre', padding='pre',
value=DEFAULTS['padding_symbol'])
maxlen: int, maximum length of all sequences.
truncating='pre' remove values at the beginning from sequences larger than maxlen
padding='pre' pads each trace at the beginning with a special integer (e.g., 0)
1. Padding
keras.utils.to_categorical(X, num_classes=self.max_n_span_types,
dtype='int32')
X.shape = (n_traces, self.trace_size)
returns shape = (n_traces, self.trace_size, self.max_n_span_types)
[[[1. 0. 0. ... 0. 0. 0.]
[1. 0. 0. ... 0. 0. 0.]
[0. 0. 1. ... 0. 0. 0.]
...
]
]
18. ULTRA-SCALE AIOPS LAB 19
Distributed Trace Analysis
Using LSTMs
X [[0 0 3 ... 6 1 6]
[0 0 3 ... 6 1 6]
[0 3 5 ... 6 1 6]
...
[0 0 3 ... 6 1 6]
[0 3 5 ... 6 1 6]
[0 3 5 ... 6 1
6]]
y [[0 0 5 ... 1 6 0]
[0 0 5 ... 1 6 0]
[0 5 5 ... 1 6 0]
...
[0 0 5 ... 1 6 0]
[0 5 5 ... 1 6 0]
[0 5 5 ... 1 6 0]]
Generate y
‒ Shift traces by one position to the left
‒ Fill new cell with DEFAULTS['padding_symbol']
# X.shape is n_traces
# X.shape.y is self.trace_size
y = np.roll(X, -1, axis=1)
# Pad j array where X array has zeros
y[X == 0] = DEFAULTS['padding_symbol']
# Write the DEFAULTS['padding_symbol'] at the last
position of the shifted array
y[:, -1] = DEFAULTS['padding_symbol']
1. Generate y by shifting X
0 3 5 2 1 7X
y 0 5 2 1 7 0
Shift traces left
19. ULTRA-SCALE AIOPS LAB 20
Distributed Trace Analysis
Using LSTMs
Create DL model (_build_model())
‒ Use a LSTM network
‒ dropout layer of 0.2 (high, but necessary to avoid quick divergence)
‒ Softmax activation to generate probabilities over the different categories
‒ Because it is a classification problem, loss calculation via categorical cross-entropy compares the output probabilities
against one-one encoding
‒ ADAM optimizer (instead of the classical stochastic gradient descent) to update weights
model = Sequential()
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2, return_sequences=True,
input_shape=(self.trace_size, input_dim)))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2, return_sequences=True))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2, return_sequences=True))
# A Dense layer is used as the output for the network.
model.add(TimeDistributed(Dense(input_dim, activation='softmax')))
if self.gpus > 1:
model = keras.utils.multi_gpu_model(model, gpus=self.gpus)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
1. Create DL model
20. ULTRA-SCALE AIOPS LAB 21
Distributed Trace Analysis
Using LSTMs
Train the model (fit())
‒ X = Obtains the traces composed of spans
‒ Pad the traces of X
‒ y = Shift the spans of the traces left
‒ Apply an one hot encoding to X and y
‒ Build the model
‒ Fit the model f(X) = y
# X_train has 2 columns: traceId, span_list. Here we select the span_list
X_train = X_train.loc[:, DEFAULTS['span_list_col_name']].values
X = self._pad_traces(X_train)
y = self._shift_traces_left(X)
self._get_lstm_parameters(X)
X = self._traces_to_binary(X)
y = self._traces_to_binary(y)
self.model = self._build_model(input_dim=X.shape[2])
self.model.fit(X, y, epochs=self.epochs, batch_size=self.batch_size, shuffle=True,
validation_split=self.validation_split)
1. Train the model
21. ULTRA-SCALE AIOPS LAB 22
Distributed Trace Analysis
Using LSTMs
Test Traces
X [[0 0 3 5 5 1 9 8 6 7 6 2 6 6 1 6]
[0 0 3 5 5 8 9 8 6 7 6 2 6 4 6 6]]
X[0]: [0 0 3 5 5 1 9 8 6 7 6 2 6 6 1 6] =
[[0, 0.2625601], [1, 0.38167596], [0, 0.75843567],
[0, 0.8912414], [9, 2.394792e-05], [0, 0.94694203],
[0, 0.9656691], [0, 0.9686248], [0, 0.9666971],
[0, 0.95213455], [0, 0.94981205], [0, 0.93016535],
[0, 0.60129535], [0, 0.5417027], [0, 0.6876041],
[0, 0.82936114]]
X[1]: [0 0 3 5 5 8 9 8 6 7 6 2 6 4 6 6] =
[[0, 0.2625601], [1, 0.38167596], [0, 0.75843567],
[0, 0.8912414], [0, 0.9302344], [0, 0.94507533],
[0, 0.9638574], [0, 0.9665326], [0, 0.9627945],
[0, 0.9435249], [0, 0.9405963], [0, 0.92075515],
[1, 0.3207201], [1, 0.46595544], [0, 0.64980185],
[0, 0.8390282]]
trace_id 0 indices [4]
trace_id 1 indices []
Test traces (predict())
‒ X = Obtains the traces composed of spans
‒ Pad the traces of X
‒ Apply an one hot encoding to X
‒ Calculate: yhat = f(X)
‒ Identify in which spans the anomalies occur
X_test = X_test.loc[:, DEFAULTS['span_list_col_name']].values
X_test = self._pad_traces(X_test)
X_test_bin = self._traces_to_binary(X_test)
yhat = self.model.predict(X_test_bin)
self._identify_anomalies(X_test, yhat, prob=self.threshold)
return self.rank_prob
1. Test unseen traces
22. ULTRA-SCALE AIOPS LAB 23
Distributed Trace Analysis
Using LSTMs
Identifying Anomalous Traces Sequences (_identify_anomalies())
‒ Mark anomalies based on the difference between y and yhat
‒ y is the shifted version of X
‒ yhat = f(X)
‒ If we cannot predict correctly the shift, this means the trace is did not occur before and is marked as anomalous
y = DTA_LSTM._shift_traces_left(X)
for i in range(len(X)):
rp = self._compare_traces(y[i], yhat[i])
self.rank_prob.append(rp)
if self.explain:
print('X[{}]: {} = {}'.format(i, X[i], rp))
return self.rank_prob
Input X = [0 0 3 5 5 1 9 8 6 7 6 2 6 6 1 6]
Shifted_X = [0 0 5 5 1 9 8 6 7 6 2 6 6 1 6 0]
Prediction
yhat = [0 0 5 5 12 9 8 6 7 6 2 6 6 1 6 0]
X[0]: [0 0 3 5 5 1 9 8 6 7 6 2 6 6 1 6] =
[[0, 0.2625601], [1, 0.38167596], [0, 0.75843567],
[0, 0.8912414], [9, 2.394792e-05], [0, 0.94694203],
[0, 0.9656691], [0, 0.9686248], [0, 0.9666971],
[0, 0.95213455], [0, 0.94981205], [0, 0.93016535],
[0, 0.60129535], [0, 0.5417027], [0, 0.6876041],
[0, 0.82936114]]
Error at index [4]
X[i]: Input: [ 0 0 0 0 0 0 0 0 0 0 0 10 10 17 5 5 7 7 6 6]
y[i]: Output: [ 0 0 0 0 0 0 0 0 0 0 0 10 17 5 5 7 7 6 6 0]
yhat[i]: Predicted: [ 0 0 0 0 0 0 0 0 0 0 0 10 17 5 5 7 7 6 6 0]
Correct prediction
23. ULTRA-SCALE AIOPS LAB 24
Single and Bi-Models
Use of single and bi-models to capture traces as
sequences of events and their response time
Use sequential model representation by utilizing long-
short-term memory (LSTM) networks
Related Research
Technical University of Berlin
S. Nedelkoski,J. Cardoso,O. Kao, Anomaly Detectionfrom System Tracing Data using MultimodalDeep Learning, IEEECloud2019,July 3-8,2019,Milan, Italy.
Idea: represent distributed traces as sequences of
events and their response time
Trace Structure (sequence of events) and
Response time (duration)
e.g., Service 11 → Service 21; duration = 12ms
24. ULTRA-SCALE AIOPS LAB 25
Related Research
Facebook
https://blog.acolyer.org/2015/10/07/the-mystery-machine-end-to-end-performance-analysis-of-large-scale-internet-services/
https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-chow.pdf
Measure end-to-end performance of requests
Infer causal relationships from logs
‒ Adding instrumentation retroactively is an
expensive task
‒ Hypothesize and confirm relationships in
messages
Log Data
‒ Request & host id, timestamp, unique event label
Finding Relationships
‒ Samples requests, store logs in Hive and run
Hadoop jobs to infer causal relationships
‒ 2h Hadoop to analyze 1.3M requests sampled 30
days
‒ Assumesa hypothesized relationshipbetween
two segmentsuntil finding a counterexample
‒ 3 types of relationship inferred: happens-before,
mutual exclusion, pipeline
Applications
‒ Find critical path and slack segments for
performance optimization
‒ Anomaly detection: 1) select top 5% of end-to-end
latency, 2) identify segmentswith proportionally
greater representation in the outlier set of
requests than in the non-outlier set.
As the number of traces analyzed increases, the observation of new counter
examples diminishes, leaving behind only true relationships