Gartner analyzed data centers for a period of 10 years and found that 47% of all problems were caused by cloud services outages. The duration of outages ranged between 40 minutes and five days. Ponemon Institute studied the financial impact and found that on average outages cost US$ 690.204, with an average downtime cost of US$ 6.828 per minute. These results are important due to the economic impact of unplanned outages on cloud operations which calls for higher platform reliability.
The first part of this talk will present the mechanisms that pioneers, such as Amazon, Google, and Netflix, have already developed to increase the reliability of their cloud platforms. The second part of the talk will describe how Huawei Research is exploring the use of fault-injection mechanisms to effectively increase the reliability of the Open Telekom Cloud platform from Deutsche Telekom.
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...Jorge Cardoso
Lecture given at the Technical University of Munich, 12 December 2016, on Cloud Operations and Analytics: Improving Distributed Systems Reliability using Fault Injection.
Cloud Reliability: Decreasing outage frequency using fault injectionJorge Cardoso
Invited Keynote at the 9th International Workshop on Software Engineering for Resilient Systems, September 4-5, 2017, Geneva, Switzerland
Title: Cloud Reliability: Decreasing outage frequency using fault injection
Abstract: In 2016, Google Cloud had 74 minutes of total downtime, Microsoft Azure had 270 minutes, and 108 minutes of downtime for Amazon Web Services (see cloudharmony.com). Reliability is one of the most important properties of a successful cloud platform. Several approaches can be explored to increase reliability ranging from automated replication, to live migration, and to formal system analysis. Another interesting approach is to use software fault injection to test a platform during prototyping, implementation and operation. Fault injection was popularized by Netflix and their Chaos Monkey fault-injection tool to test cloud applications. The main idea behind this technique is to inject failures in a controlled manner to guarantee the ability of a system to survive failures during operations. This talk will explain how fault injection can also be applied to detect vulnerabilities of OpenStack cloud platform and how to effectively and efficiently detect the damages caused by the faults injected.
Presentation at the International Industry-Academia Workshop on Cloud Reliability and Resilience. 7-8 November 2016, Berlin, Germany.
Organized by EIT Digital and Huawei GRC, Germany.
Twitter: @CloudRR2016
In planet-scale deployments, the Operation and Maintenance (O&M) of cloud platforms cannot be done any longer manually or simply with off-the-shelf solutions. It requires self-developed automated systems, ideally exploiting the use of AI to provide tools for autonomous cloud operations. This talk will explain how deep learning, distributed traces, and time-series analysis (sequence analysis) can be used to effectively detect anomalous cloud infrastructure behaviors during operations to reduce the workload of human operators. The iForesight system is being used to evaluate this new O&M approach. iForesight 2.0 is the result of 2 years of research with the goal to provide an intelligent new tool aimed at SRE cloud maintenance teams. It enables them to quickly detect and predict anomalies thanks to the use of artificial intelligence when cloud services are slow or unresponsive.
Presentation at the International Industry-Academia Workshop on Cloud Reliability and Resilience. 7-8 November 2016, Berlin, Germany.
Organized by EIT Digital and Huawei GRC, Germany.
Twitter: @CloudRR2016
Online Memory Leak Detection in the Cloud-based InfrastructuresAnshul Jindal
A memory leak in an application deployed on the cloud can affect the availability and reliability of the application. Therefore, to identify and ultimately resolve it quickly is highly important. However, in the production environment running on the cloud, memory leak detection is a challenge without the knowledge of the application or its internal object allocation details. This paper addresses this challenge of online detection of memory leaks in cloud-based infrastructure without having any internal application knowledge by introducing a novel machine learning-based algorithm Precog. This algorithm solely uses one metric i.e the system's memory utilization on which the application is deployed for the detection of a memory leak. The developed algorithm's accuracy was tested on 60 virtual machines manually labeled memory utilization data provided by our industry partner Huawei Munich Research Center and it was found that the proposed algorithm achieves the accuracy score of 85% with less than half a second prediction time per virtual machine.
Distributed Trace & Log Analysis using MLJorge Cardoso
The field of AIOps, also known as Artificial Intelligence for IT Operations, uses advanced technologies to dramatically improve the monitoring, operation, and troubleshooting of distributed systems. Its main premise is that operations can be automated using monitoring data to reduce the workload of operators (e.g., SREs or production engineers). Our current research explores how AIOps – and many related fields such as deep learning, machine learning, distributed traces, graph analysis, time-series analysis, sequence analysis, advanced statistics, NLP and log analysis – can be explored to effectively detect, localize, predict, and remediate failures in large-scale cloud infrastructures (>50 regions and AZs) by analyzing service management data (e.g., distributed traces, logs, events, alerts, metrics). In particular, this talk will describe how a particular monitoring data structure, called distributed traces, can be analyzed using deep learning to identify anomalies in its spans. This capability empowers operators to quickly identify which components of a distributed system are faulty.
Observability at scale with Neural Networks: A more proactive approachTech Triveni
We at Expedia work on a mission of connecting people to places through the power of technology. To accomplish this, we build and run hundreds of micro-services that provide different functionalities to serve every single customer request, which results in generating billions of events. Now, what happens when one or more services fail at the same time? Well, to improve the observability in our system, we see a need to connect these failure points across our distributed topology to reduce mean time to detect(MTTD) and know (MTTK)
In this talk, we will present the journey of distributed tracing in Expedia that started with Zipkin as a prototype and ended up building our own solution(in open source). We will do a deep dive into our architecture and demonstrate how we ingest terabytes of tracing data (around 8 TB / day) in production with a peak throughput of over 550,000 spans / second for hundreds of micro-services.
We use this data for trending service errors/latencies/rate, perform anomaly detection on the aggregated trends, build service-dependency and network-latency graphs, other than our primary use case of distributed tracing.
With this increasing number, there felt the need to have a real-time intelligent alerting and monitoring system to move towards 24/7 reliability. We will talk about how we use neural networks on trends and perform anomaly detection, including a deep dive into the architecture for the automated training pipeline and online, compute using streams in a cost-effective manner
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...Jorge Cardoso
Lecture given at the Technical University of Munich, 12 December 2016, on Cloud Operations and Analytics: Improving Distributed Systems Reliability using Fault Injection.
Cloud Reliability: Decreasing outage frequency using fault injectionJorge Cardoso
Invited Keynote at the 9th International Workshop on Software Engineering for Resilient Systems, September 4-5, 2017, Geneva, Switzerland
Title: Cloud Reliability: Decreasing outage frequency using fault injection
Abstract: In 2016, Google Cloud had 74 minutes of total downtime, Microsoft Azure had 270 minutes, and 108 minutes of downtime for Amazon Web Services (see cloudharmony.com). Reliability is one of the most important properties of a successful cloud platform. Several approaches can be explored to increase reliability ranging from automated replication, to live migration, and to formal system analysis. Another interesting approach is to use software fault injection to test a platform during prototyping, implementation and operation. Fault injection was popularized by Netflix and their Chaos Monkey fault-injection tool to test cloud applications. The main idea behind this technique is to inject failures in a controlled manner to guarantee the ability of a system to survive failures during operations. This talk will explain how fault injection can also be applied to detect vulnerabilities of OpenStack cloud platform and how to effectively and efficiently detect the damages caused by the faults injected.
Presentation at the International Industry-Academia Workshop on Cloud Reliability and Resilience. 7-8 November 2016, Berlin, Germany.
Organized by EIT Digital and Huawei GRC, Germany.
Twitter: @CloudRR2016
In planet-scale deployments, the Operation and Maintenance (O&M) of cloud platforms cannot be done any longer manually or simply with off-the-shelf solutions. It requires self-developed automated systems, ideally exploiting the use of AI to provide tools for autonomous cloud operations. This talk will explain how deep learning, distributed traces, and time-series analysis (sequence analysis) can be used to effectively detect anomalous cloud infrastructure behaviors during operations to reduce the workload of human operators. The iForesight system is being used to evaluate this new O&M approach. iForesight 2.0 is the result of 2 years of research with the goal to provide an intelligent new tool aimed at SRE cloud maintenance teams. It enables them to quickly detect and predict anomalies thanks to the use of artificial intelligence when cloud services are slow or unresponsive.
Presentation at the International Industry-Academia Workshop on Cloud Reliability and Resilience. 7-8 November 2016, Berlin, Germany.
Organized by EIT Digital and Huawei GRC, Germany.
Twitter: @CloudRR2016
Online Memory Leak Detection in the Cloud-based InfrastructuresAnshul Jindal
A memory leak in an application deployed on the cloud can affect the availability and reliability of the application. Therefore, to identify and ultimately resolve it quickly is highly important. However, in the production environment running on the cloud, memory leak detection is a challenge without the knowledge of the application or its internal object allocation details. This paper addresses this challenge of online detection of memory leaks in cloud-based infrastructure without having any internal application knowledge by introducing a novel machine learning-based algorithm Precog. This algorithm solely uses one metric i.e the system's memory utilization on which the application is deployed for the detection of a memory leak. The developed algorithm's accuracy was tested on 60 virtual machines manually labeled memory utilization data provided by our industry partner Huawei Munich Research Center and it was found that the proposed algorithm achieves the accuracy score of 85% with less than half a second prediction time per virtual machine.
Distributed Trace & Log Analysis using MLJorge Cardoso
The field of AIOps, also known as Artificial Intelligence for IT Operations, uses advanced technologies to dramatically improve the monitoring, operation, and troubleshooting of distributed systems. Its main premise is that operations can be automated using monitoring data to reduce the workload of operators (e.g., SREs or production engineers). Our current research explores how AIOps – and many related fields such as deep learning, machine learning, distributed traces, graph analysis, time-series analysis, sequence analysis, advanced statistics, NLP and log analysis – can be explored to effectively detect, localize, predict, and remediate failures in large-scale cloud infrastructures (>50 regions and AZs) by analyzing service management data (e.g., distributed traces, logs, events, alerts, metrics). In particular, this talk will describe how a particular monitoring data structure, called distributed traces, can be analyzed using deep learning to identify anomalies in its spans. This capability empowers operators to quickly identify which components of a distributed system are faulty.
Observability at scale with Neural Networks: A more proactive approachTech Triveni
We at Expedia work on a mission of connecting people to places through the power of technology. To accomplish this, we build and run hundreds of micro-services that provide different functionalities to serve every single customer request, which results in generating billions of events. Now, what happens when one or more services fail at the same time? Well, to improve the observability in our system, we see a need to connect these failure points across our distributed topology to reduce mean time to detect(MTTD) and know (MTTK)
In this talk, we will present the journey of distributed tracing in Expedia that started with Zipkin as a prototype and ended up building our own solution(in open source). We will do a deep dive into our architecture and demonstrate how we ingest terabytes of tracing data (around 8 TB / day) in production with a peak throughput of over 550,000 spans / second for hundreds of micro-services.
We use this data for trending service errors/latencies/rate, perform anomaly detection on the aggregated trends, build service-dependency and network-latency graphs, other than our primary use case of distributed tracing.
With this increasing number, there felt the need to have a real-time intelligent alerting and monitoring system to move towards 24/7 reliability. We will talk about how we use neural networks on trends and perform anomaly detection, including a deep dive into the architecture for the automated training pipeline and online, compute using streams in a cost-effective manner
AIOps: Anomalies Detection of Distributed TracesJorge Cardoso
Introduction to the field of AIOps. large-scale monitoring, and observability. Provides an example illustrating how Deep Learning can be used to analyze distributed traces to reveal exactly which component is causing a problem in microservice applications.
Presentation given at the National University of Ireland, Galway (NUI Galway)
on 2019.08.20.
Thanks to Prof. John Breslin
Splunk’s machine learning framework mixed with Splunk’s Event Management capabilities gives operations teams the opportunity to proactively act and automate on an event before it becomes an IT outage. This session will detail and demonstrate how to predict a health score of your business service, proactively take action based on those predictions and publish to your collaborative messaging and automation solutions.
Top Cited Papers - International Journal of Network Security & Its Applicatio...IJNSA Journal
The International Journal of Network Security & Its Applications (IJNSA) is a bi monthly open access peer-reviewed journal that publishes articles which contribute new results in all areas of the computer Network Security & its applications. The journal focuses on all technical and practical aspects of security and its applications for wired and wireless networks. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on understanding Modern security threats and countermeasures, and establishing new collaborations in these areas.
Measuring Technical Lag in Software Deployments (CHAOSScon 2020)Tom Mens
Presentation at CHAOSSCon Europe 2020 about the generic technical lag software measurement framework. Technical lag measures the increasing difference between deployed software components and the ideal upstream software components.
For more information, see https://doi.org/10.1002/smr.2157
http://www.opendatacenteralliance.org/
Moderator:
Richard Villars,
Vice President, Information & Cloud
IDC
Panelists:
Curt Aubley,
VP & CTO Cyber Security & NexGen
Innovation
Lockheed Martin
Jeffrey R Deacon,
Chief Cloud Strategist,
Terremark
Joe Weinman,
Senior Vice President, Cloud Services and Strategy
Telx
Michael Kollar,
Global Cloud CTO, ATOS
Petteri Uljas,
Capgemini Finland, CEO
Capgemini Infra3 (Eastern Europe and India), Unit Head Capgemini
VMware Cloud Infrastructure and Management on NetAppNetApp
This ESG Lab Validation describes the hands-on testing of a VMware cloud infrastructure and management on NetApp solution with a focus on the value of integrated manageability, policy-based provisioning and automation, and data management for business continuity.
Enterprise-Grade Disaster Recovery Without Breaking the BankDonna Perlstein
Until recently, enterprise-grade DR had been prohibitively expensive, leaving many companies with high risk levels and unreliable solutions. Now, many organizations are enjoying top-of- the-line disaster recovery at a fraction of the price, thanks to the rapid development of cloud technology. CloudEndure and Actual Tech Media are thrilled to present this presentation, with a cost comparison of 3 Disaster Recovery Strategies, and much more.
Protect Your Data and Apps from Zombies and Other DisastersBluelock
How to protect your data and applications from zombies and other disasters. Learn how cloud for disaster recovery allows you to recover apps after a disaster.
Disasters happen, especially when it comes to technology. Hurricanes, floods and viruses are all well known, but we took a look at a lesser-known phenomena: the zombie apocalypse and the impact it could have on your business.
Regardless of the type of threat your business is worried about, the cloud offers newer, more affordable and more flexible recovery options through cloud service providers than have previously been available. Even if no zombie attacks have ever been confirmed (yet), the following data offers a compelling look at how you can protect your business, and where your infrastructure may be vulnerable. View the infographic below to learn more!
http://www.bluelock.com/recovery-from-zombies-and-disasters/
More at http://cloudify.co/2017/08/31/overcoming-the-five-hybrid-cloud-adoption-challenges/
First, should your enterprises work with a single cloud provider? Most likely your answer will be “No!” And for good reason.
Second, will hybrid (not necessarily cloud) be part of your data center’s future in the next few years? Here the answer is not as clear-cut. And in the absence of an obvious answer, new questions come to mind – what should I do with my current data center and how might this public cloud environment be incorporated into the mix?
Aside from dealing with their current on-premises resources, there is at least one good reason for enterprises to want to keep resources on premises – vendor lock-in. As an IT leader, your responsibility for the data and for business continuity force you to think long term. You need to maintain control and be able to move your IT assets based on your business needs at any time.
This consideration, combined with the current reality of having an on-premises data center to take care of, in most cases will launch you on the hybrid cloud journey. Leaders that see the half-full glass of this change will also see how this move forces their team to learn and innovate.
There are other incentives for building a hybrid cloud. Some enterprises simply want to use the public cloud to accommodate bursty workloads, and may want to migrate everything except for mission-critical applications and sensitive data repositories. Regardless of your incentive, it’s important to be aware of potential challenges lurking ahead.
AIOps: Anomalies Detection of Distributed TracesJorge Cardoso
Introduction to the field of AIOps. large-scale monitoring, and observability. Provides an example illustrating how Deep Learning can be used to analyze distributed traces to reveal exactly which component is causing a problem in microservice applications.
Presentation given at the National University of Ireland, Galway (NUI Galway)
on 2019.08.20.
Thanks to Prof. John Breslin
Splunk’s machine learning framework mixed with Splunk’s Event Management capabilities gives operations teams the opportunity to proactively act and automate on an event before it becomes an IT outage. This session will detail and demonstrate how to predict a health score of your business service, proactively take action based on those predictions and publish to your collaborative messaging and automation solutions.
Top Cited Papers - International Journal of Network Security & Its Applicatio...IJNSA Journal
The International Journal of Network Security & Its Applications (IJNSA) is a bi monthly open access peer-reviewed journal that publishes articles which contribute new results in all areas of the computer Network Security & its applications. The journal focuses on all technical and practical aspects of security and its applications for wired and wireless networks. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on understanding Modern security threats and countermeasures, and establishing new collaborations in these areas.
Measuring Technical Lag in Software Deployments (CHAOSScon 2020)Tom Mens
Presentation at CHAOSSCon Europe 2020 about the generic technical lag software measurement framework. Technical lag measures the increasing difference between deployed software components and the ideal upstream software components.
For more information, see https://doi.org/10.1002/smr.2157
http://www.opendatacenteralliance.org/
Moderator:
Richard Villars,
Vice President, Information & Cloud
IDC
Panelists:
Curt Aubley,
VP & CTO Cyber Security & NexGen
Innovation
Lockheed Martin
Jeffrey R Deacon,
Chief Cloud Strategist,
Terremark
Joe Weinman,
Senior Vice President, Cloud Services and Strategy
Telx
Michael Kollar,
Global Cloud CTO, ATOS
Petteri Uljas,
Capgemini Finland, CEO
Capgemini Infra3 (Eastern Europe and India), Unit Head Capgemini
VMware Cloud Infrastructure and Management on NetAppNetApp
This ESG Lab Validation describes the hands-on testing of a VMware cloud infrastructure and management on NetApp solution with a focus on the value of integrated manageability, policy-based provisioning and automation, and data management for business continuity.
Enterprise-Grade Disaster Recovery Without Breaking the BankDonna Perlstein
Until recently, enterprise-grade DR had been prohibitively expensive, leaving many companies with high risk levels and unreliable solutions. Now, many organizations are enjoying top-of- the-line disaster recovery at a fraction of the price, thanks to the rapid development of cloud technology. CloudEndure and Actual Tech Media are thrilled to present this presentation, with a cost comparison of 3 Disaster Recovery Strategies, and much more.
Protect Your Data and Apps from Zombies and Other DisastersBluelock
How to protect your data and applications from zombies and other disasters. Learn how cloud for disaster recovery allows you to recover apps after a disaster.
Disasters happen, especially when it comes to technology. Hurricanes, floods and viruses are all well known, but we took a look at a lesser-known phenomena: the zombie apocalypse and the impact it could have on your business.
Regardless of the type of threat your business is worried about, the cloud offers newer, more affordable and more flexible recovery options through cloud service providers than have previously been available. Even if no zombie attacks have ever been confirmed (yet), the following data offers a compelling look at how you can protect your business, and where your infrastructure may be vulnerable. View the infographic below to learn more!
http://www.bluelock.com/recovery-from-zombies-and-disasters/
More at http://cloudify.co/2017/08/31/overcoming-the-five-hybrid-cloud-adoption-challenges/
First, should your enterprises work with a single cloud provider? Most likely your answer will be “No!” And for good reason.
Second, will hybrid (not necessarily cloud) be part of your data center’s future in the next few years? Here the answer is not as clear-cut. And in the absence of an obvious answer, new questions come to mind – what should I do with my current data center and how might this public cloud environment be incorporated into the mix?
Aside from dealing with their current on-premises resources, there is at least one good reason for enterprises to want to keep resources on premises – vendor lock-in. As an IT leader, your responsibility for the data and for business continuity force you to think long term. You need to maintain control and be able to move your IT assets based on your business needs at any time.
This consideration, combined with the current reality of having an on-premises data center to take care of, in most cases will launch you on the hybrid cloud journey. Leaders that see the half-full glass of this change will also see how this move forces their team to learn and innovate.
There are other incentives for building a hybrid cloud. Some enterprises simply want to use the public cloud to accommodate bursty workloads, and may want to migrate everything except for mission-critical applications and sensitive data repositories. Regardless of your incentive, it’s important to be aware of potential challenges lurking ahead.
Prakash Raj was almost on the verge of getting banned from Tollywood. Entire industry turned furious on his behavior in sets of Prince Mahesh Babu’s starrer Aagadu directed by Srinu Vytla, where he was alleged to have abused assistant director. War of words flowed unabated between the filmmakers and also the actor even as MAA, Producers Council and Directors Association got involved.
Structural organization and architecture of a virtual reality explorerPrachi Gupta
This paper talks about the structural organization and architecture of a Virtual Reality Explorer. It attempts to throw some light on the components and their functions of a VR Explorer and what issues should be kept in mind while making such an application. via @prchg
Phytogenic feed additives: Keeping pace with trends and challenges in pig pro...Milling and Grain magazine
As the global population and its prosperity are steadily on the rise, the animal protein demand will further increase in the near future. Pig meat is the most consumed meat worldwide among the others, closely followed by poultry. Last year it comprised 38 percent (or 118 Mt) of the total meat consumption whereas poultry meat accounted for 35 percent (or 110 Mt). Though this growing demand is challenged – on the one hand by consumer’s awareness for safe food and on the other hand by sustainable and efficient swine production. At the same time, production costs should be kept as low as possible whilst controlling the high risk of developing drug resistant bacteria for humans due to the use of in-feed antibiotics, as antimicrobial growth promoters (AGP) or as disease treatment. Over the last decades, many feed additives have been developed and evaluated, within which phytogenic (plant derived) substances have attracted much attention.
Presented at Kafka Summit 2016
Operating out of multiple datacenters is a large part of most disaster recovery plans, but it brings extra complications to our data pipelines. Instead of having a straight path from front to back, it now has forks and dead ends and odd little use cases that don’t match up with a perfect view of the world. This talk will focus on how to best utilize Apache Kafka in this world, including basic architectures for multi-datacenter and multi-tier clusters. We will also touch on how to assure messages make it from producer to consumer, and how to monitor the entire ecosystem.
Anand Swaminathan and Iain Beardsell debate the use of thrombolytics in the treatment of submassive pulmonary embolism (PE).
PE is a spectrum of disease. Patients should be treated differently depending on where they are on the spectrum.
Subsegmental PE may need no treatment at all, whereas massive PE is unlikely to improve without thrombolytics.
Anand argues for the use of thrombolytics.
Evidently, time is critical when dealing with patients and Anand posits that thrombolytics gives the physician control over time.
Submassive PE can deteriorate, leading to massive pulmonary embolism. A proportion of these patients will die. The data is not conclusive for the use of thrombolytics in terms of mortality, however long term outcomes do improve.
Finally, Anand concludes by suggesting that the decision to use thrombolytics relies on sound clinical reasoning and decision making, informed by the available data. He argues for nuanced treatments and use of these drugs.
Iain takes a different approach in his reply.
Some of the most difficult topics in medicine attract considerable debate. The use of thrombolysis for submassive PE is one of these.
In this argument Iain attempts to highlight some of the most pertinent evidence against the use of thrombolysis. And he does so through song!
Submassive PE should be Thrombolysed: Anand Swaminathan and Iain Beardsell
For more like this, head to our podcast page. #CodaPodcast
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaGuozhang Wang
To manage the ever-increasing volume and velocity of data within your company, you have successfully made the transition from single machines and one-off solutions to large distributed stream infrastructures in your data center, powered by Apache Kafka. But what if one data center is not enough? I will describe building resilient data pipelines with Apache Kafka that span multiple data centers and points of presence, and provide an overview of best practices and common patterns while covering key areas such as architecture guidelines, data replication, and mirroring as well as disaster scenarios and failure handling.
Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...confluent
The concept of stream processing has been around for a while and most software systems continuously transform streams of inputs into streams of outputs. Yet the idea of directly modeling stream processing in infrastructure systems is just coming into its own after a few decades on the periphery.
At its core, stream processing is simple: read data in, process it, and maybe emit some data out. So why are there so many stream processing frameworks that all define their own terminology? And are the components of each even comparable? Why do I need to know about spouts or DStreams just to process a simple sequence of records? Depending on your application’s requirements, you may not need a framework.
This talk will be delivered by one of the creators of the popular stream data systems Apache Kafka and will abstract away the details of individual frameworks while describing the key features they provide. These core features include scalability and parallelism through data partitioning, fault tolerance and event processing order guarantees, support for stateful stream processing, and handy stream processing primitives such as windowing. Based on our experience building and scaling Kafka to handle streams that captured hundreds of billions of records per day — this presentation will help you understand how to map practical data problems to stream processing and how to write applications that process streams of data at scale.
Software Defined Environment - In one click get the Dev/QA/Staging EnvironmentVenu Murthy
Get the Development, QA, Staging or Production Environment you need at the click of a button.
The current situation:
It wouldn’t be a bold statement to say that all software’s ultimate goal is to enhance the customer experience. How many times have we not read such comments on app stores or heard business say?
“Great app, but I can only give it three stars until the developers add ...”
But the Development team’s side of the story is
“I am waiting for the environment to test the code with new features”
Continuous Delivery and Continuous Integration can help release software updates more frequently and with almost no manual intervention, but there are some bottlenecks to being able to do this. Following are a few: -
Delay in getting the Environments
Lack of self-provisioning creates dependency on IT department.
Lack of easily customizable Environments
For Development, Testing and Staging with new features or updates to dependencies.
Manual Provisioning of Environments
Being repetitive and involving several steps, we would not be able to leverage the power of Automated Deployments and CI.
And the hilarious but unfortunately true risk of
“Oh! But it works on my laptop!”
Not being able to recreate the environments easily and consistently can lead to not being able to recreate performance issues or release code or updates to production confidently!
Inconsistent environments could result in such scenarios as a new update has been released to the production system and the system Admin might have put in the configuration or dependencies that only he or she is aware of to get the app working. Similarly the developer might have put in the unique settings on his or her laptop to get the code working on his or her workstation or laptop. Due to which every server becomes “works of art” and as unique as snowflakes. Needless to say inconsistent environments make it very difficult to determine why an application breaks when it's promoted to the next environment. Wasting the Developer and Operation teams time in determining if an issue is due to the source code or environment configuration.
What is an Environment?
It is not just an image or template of a virtual machine but all the compute, storage, network and several other resources (XaaS) that are required to host your application. Quite simply put, everything you can find inside the server room!
Environments on Demand at the click of a button
A solution that could give the Development, QA, Staging or Production Environment at the click of a button could remove all the bottlenecks and risks that we had discussed earlier and at the same time orchestrate Software-Defined Compute, Networking, Storage, Security and such to provide a smart infrastructure that is aware of resources needed by the application and is adaptive and responsive to the workloads dues to fluctuating business demand. All this while being easy to customise and simple
OpenStack has the potential to deliver the agile, flexible infrastructure that businesses will need to compete in a fast changing global economy. For many users though, OpenStack appears complex and challenging to manage. During this session Mark Baker gives examples of how real users of OpenStack in production are addressing key operational requirements and will use live demos to show how Ubuntu OpenStack and automation tools can be used to simplify service delivery and make cloud life a lot easier.
Cloud-Größen wie Google, Twitter und Netflix haben die Kernbausteine ihrer Infrastruktur quelloffen verfügbar gemacht. Das Resultat aus vielen Jahren Cloud-Erfahrung ist nun frei zugänglich, und jeder kann seine eigenen Cloud-nativen Anwendungen entwickeln – Anwendungen, die in der Cloud zuverlässig laufen und fast beliebig skalieren. Die einzelnen Bausteine wachsen zu einem großen Ganzen zusammen, dem Cloud-Native-Stack. In dieser Session stellen wir die wichtigsten Konzepte und aktuellen Schlüsseltechnologien kurz vor. Anschließend implementieren wir einen einfachen Microservice mit .NET Core und Steeltoe OSS und bringen ihn zusammen mit ausgewählten Bausteinen für Service-Discovery und Konfiguration schrittweise auf einem Kubernetes-Cluster zum Laufen. @BASTAcon #BASTA17 @qaware #CloudNativeNerd
https://basta.net/microservices-services/cloud-native-net-microservices-mit-kubernetes/
General overview of what is "Chaos Engineering", the current
"perturbation models" available and the benefits of Chaos Engineering to Customers, Business and Tech.
Service Virtualization: What Testers Need to KnowTechWell
Unrestrained access to a trustworthy and realistic test environment—including the application under test and all of its dependent components—is essential for achieving “quality @ speed” with agile, DevOps, and continuous delivery. Service virtualization is an emerging technology that provides teams access to a complete test environment by simulating the dependent components that are beyond their control, still evolving, or too complex to configure in a test lab. Arthur Hicken covers the ABCs of service virtualization—what it is and how it impacts Access, Behavior, Cost, and Speed. Learn how it can help you test more rigorously, avoid parallel development bottlenecks, and isolate application layers for debugging and performance testing in two ways—first, by providing access to dependent system components that would otherwise delay development and testing tasks; and second, by allowing you to alter the behavior of those dependent components in ways that would be impossible with a staged test environment.
Don't Fumble the Data! Integrate Database Automation into your DevOps ToolchainDevOps.com
Today, we have proven techniques for many DevOps practices. For provisioning a new environment, we apply file based environment definitions to dynamic infrastructure: Helm for Kubernetes, Heat for OpenStack or Terraform for other clouds. For automating an application deployment, we can turn to basic pipelines like Jenkins provides or release automation tools like IBM UrbanCode deploy.
However, one area is a constant sticking point. Data. A provisioned test lab is useless without test data. Automated deployment tied to a manual schema update is only as fast as good as the DBA's working by hand. Meanwhile, data is different. There can be a lot of it. It's often sensitive. Changes to schema are generally incremental. Naively applying something like Terraform to the data problem is a recipe for trouble.
There is good news. Tools that specialize in managing databases are easy to integrate into your DevOps toolchain. Join Actifio's Jay Livens, DBmaestro's Chris Lucca and IBM's Eric Minick for a lively conversation examining how to overcome this stumbling block.
stackconf 2023 | Bringing Order to Chaos: Make Your Systems More Resilient wi...NETWAYS
Chaos Engineering is a new approach that helps identify & address weaknesses in software systems by intentionally introducing controlled failures. This talk covers principles & practices of chaos engineering, using real-world examples to show how it has improved resiliency, performance & saved costs. You’ll learn how to design & execute chaos experiments, interpret results, and implement chaos engineering within your organization. The goal is to create highly resilient systems that can withstand any challenge in today’s fast-paced digital landscape.
Proactive ops for container orchestration environmentsDocker, Inc.
Break -> inspect -> fix is the Ops workflow for infrastructure stacks of the past. Distributed infrastructure and applications claim to be the new generation, but why is it so much more painful to maintain and troubleshoot them? Much of the pain comes from outdated operational models relying on reactive or, worse yet, manual monitoring and Ops.
This talk lays out a proactive Ops model for container infrastructure. By focusing on event monitoring, infrastructure state monitoring, trend analysis, and distributed log collection, a proactive Ops model delivers observability for distributed apps that was not possible before. Using real-world examples from Swarm and Kubernetes, we'll demonstrate the tools used and how we relieve Ops pain in container orchestration.
Case Study: Datalink—Manage IT monitoring the MSP wayCA Technologies
Increasing infrastructure complexity is causing IT operations teams to re-think their monitoring approach. In this presentation with Datalink, learn how to build and evolve a proactive IT monitoring strategy geared towards the modern, dynamic IT landscape. Learn how Datalink proactively manages IT environments of leading Fortune 500 companies by leveraging analytics, intelligent alarms, a unified architecture and advanced process automation to achieve operational efficiencies. You will also learn how to make monitoring look easy to your end users while delivering the flexibility required to monitor just about anything they throw at you.
For more information on DevOps solutions from CA Technologies, please visit: http://bit.ly/1wbjjqX
On the Application of AI for Failure Management: Problems, Solutions and Algo...Jorge Cardoso
Artificial Intelligence for IT Operations (AIOps) is a class of software which targets the automation of operational tasks through machine learning technologies. ML algorithms are typically used to support tasks such as anomaly detection, root-causes analysis, failure prevention, failure prediction, and system remediation. AIOps is gaining an increasing interest from the industry due to the exponential growth of IT operations and the complexity of new technology. Modern applications are assembled from hundreds of dependent microservices distributed across many cloud platforms, leading to extremely complex software systems. Studies show that cloud environments are now too complex to be managed solely by humans. This talk discusses various AIOps problems we have addressed over the years and gives a sketch of the solutions and algorithms we have implemented. Interesting problems include hypervisor anomaly detection, root-cause analysis of software service failures using application logs, multi-modal anomaly detection, root-cause analysis using distributed traces, and verification of virtual private cloud networks.
AIOps: Anomalous Span Detection in Distributed Traces Using Deep LearningJorge Cardoso
The field of AIOps, also known as Artificial Intelligence for IT Operations, uses algorithms and machine learning to dramatically improve the monitoring, operation, and maintenance of distributed systems. Its main premise is that operations can be automated using monitoring data to reduce the workload of operators (e.g., SREs or production engineers). Our current research explores how AIOps – and many related fields such as deep learning, machine learning, distributed traces, graph analysis, time-series analysis, sequence analysis, and log analysis – can be explored to effectively detect, localize, and remediate failures in large-scale cloud infrastructures (>50 regions and AZs). In particular, this lecture will describe how a particular monitoring data structure, called distributed trace, can be analyzed using deep learning to identify anomalies in its spans. This capability empowers operators to quickly identify which components of a distributed system are faulty.
For more than 10 years, research on service descriptions has mainly studied software-based services and provided languages such as WSDL, OWL-S, WSMO for SOAP, and hREST for REST. Nonetheless, recent developments from service management (e.g., ITIL and COBIT) and cloud computing (e.g. Software-as-a-Service) have brought new requirements to service descriptions languages: the need to also model business services and account for the multi-faceted nature of services. Business-orientation, co- creation, pricing, legal aspects, and security issues are all elements which must also be part of service descriptions. While ontologies such as e  service and e  value provided a first modeling attempt to capture a business perspective, concerns on how to contract services and the agreements entailed by a contract also need to be taken into account. This has for the most part been disregarded by the e-family of ontologies. In this paper, we review the evolution and provide an overview of Linked USDL, a comprehensive language which provides a (multi-faceted) description to enable the commercialization of (business and technical) services over the web.
Ten years of service research from a computer science perspectiveJorge Cardoso
…It has been more than 10 years since a strong research stream on services started from the field of computer science. The main trigger was without a doubt the introduction of the Web Service Description Language (WSDL), a specification to represent a piece of software functionally which could be remotely invoked. Nonetheless, this was only the “tipping point”. The generalized interest on this new development was followed by interesting topics of research on the application of semantics to enhance the description of services, the composition of services into processes, the analysis of the quality of services, the complexity of processes supporting services, and the development of comprehensive service description languages. This seminar will provide an overview of the main research topics around services and will glimpse at a new research field on the analysis of service networks...
Cloud Computing Automation: Integrating USDL and TOSCAJorge Cardoso
-- Presented at CAiSE 2013, Valencia, Spain --
Standardization efforts to simplify the management of cloud applications are being conducted in isolation. The objective of this paper is to investigate to which extend two promising specifications, USDL and TOSCA, can be integrated to automate the lifecycle of cloud applications. In our approach, we selected a commercial SaaS CRM platform, modeled it using the service description language USDL, modeled its cloud deployment using TOSCA, and constructed a prototypical platform to integrate service selection with deployment. Our evaluation indicates that a high level of integration is possible. We were able to fully automatize the remote deployment of a cloud service after it was selected by a customer in a marketplace. Architectural decisions emerged during the construction of the platform and were related to global service identification and access, multi-layer routing, and dynamic binding.
Understanding how services operate as part of large scale global networks, the related risks and gains of different network structures and their dynamics is becoming increasingly critical for society. Our vision and research agenda focuses on the particularly challenging task of building, analyzing, and reasoning about global service networks. This paper explains how Service Network Analysis (SNA) can be used to study and optimize the provisioning of complex services modeled as Open Semantic Service Networks (OSSN), a computer-understandable digital structure which represents connected and dependent services.
Open Semantic Service Networks: Modeling and AnalysisJorge Cardoso
A new interesting research area is the representation and analysis of the networked economy using Open Semantic Service Networks (OSSN). OSSN are represented using the service description language
USDL to model nodes and using the service relationship model OSSR to model edges. Nonetheless, in their current form USDL and OSSR do not provide constructs to capture the dynamic behavior of service networks. To bridge this gap, we used the General System Theory (GST) as a framework guiding the extension of USDL and OSSR to model dynamic OSSN. We evaluated the extensions made by applying USDL and OSSR to two distinct types of dynamic OSSN analysis: 1) evolutionary by using a Preferential Attachment (PA) and 2) analytical by using concepts from System Dynamics (SD). Results indicate that OSSN can constitute the rst stepping stones toward the analysis of global service-based economies.
Modeling Service Relationships for Service NetworksJorge Cardoso
The last decade has seen an increased interest in the study of networks in many fields of science. Examples are numerous, from sociology to biology, and to physical systems such as power grids. Nonetheless, the field of service networks has received less attention. Previous research has mainly tackled the modeling of single service systems and service compositions, often focusing only on studying temporal relationships between services. The objective of this paper is to propose a computational model to represent the various types of relationships which can be established between services systems to model service networks. This work acquires a particular importance since the study of service networks can bring new scientific discoveries on how service-based economies operate at a global scale.
Description and portability of cloud services with USDL and TOSCAJorge Cardoso
The provisioning and management of cloud services are major concerns since they bring clear benefits such as elasticity, flexibility, scalability, and high availability of applications for enterprises. Two emerging contributions set semantics and machine-understandable specifications for the description and portability of cloud-based services: USDL and TOSCA. In this talk we will explain how both can be articulated to work in conjunction. The Unified Service Description Language (USDL) was created for describing business or real world services to allow services to become tradable and consumable on marketplaces. On the other hand, the Topology and Orchestration Specification for Cloud Applications (TOSCA) was standardized to enable the portability of complex cloud applications and their management across different cloud providers.
To address the emerging importance of services and the relevance of relationships, we have developed and introduced the concept of Open Semantic Service Network (OSSN). OSSN are networks which relate services with the assumption that firms make the information of their services openly available using suitable models. Services, relationships and networks are said to be open (similar to LOD), when their models are transparently available and accessible by external entities and follow an open world assumption. Networks are said to be semantic when they explicitly describe their capabilities and usage, typically using a conceptual or domain model, and ideally using Semantic Web standards and techniques. One limitation of OSSNs is that they were conceived without accounting for the dynamic behavior of service networks. In other words, they can only capture static snapshots of service-based economies but do not include any mechanism to model reactions and effects that services have on other services and the notion of time
To address the emerging importance of services and the relevance of relationships, we have developed and introduced the concept of Open Semantic Service Network (OSSN). OSSN are networks which relate services with the assumption that firms make the information of their services openly available using suitable models. Services, relationships and networks are said to be open (similar to LOD), when their models are transparently available and accessible by external entities and follow an open world assumption. Networks are said to be semantic when they explicitly describe their capabilities and usage, typically using a conceptual or domain model, and ideally using Semantic Web standards and techniques. One limitation of OSSNs is that they were conceived without accounting for the dynamic behavior of service networks. In other words, they can only capture static snapshots of service-based economies but do not include any mechanism to model reactions and effects that services have on other services and the notion of time
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesSanjeev Rampal
Talk presented at Kubernetes Community Day, New York, May 2024.
Technical summary of Multi-Cluster Kubernetes Networking architectures with focus on 4 key topics.
1) Key patterns for Multi-cluster architectures
2) Architectural comparison of several OSS/ CNCF projects to address these patterns
3) Evolution trends for the APIs of these projects
4) Some design recommendations & guidelines for adopting/ deploying these solutions.
This 7-second Brain Wave Ritual Attracts Money To You.!nirahealhty
Discover the power of a simple 7-second brain wave ritual that can attract wealth and abundance into your life. By tapping into specific brain frequencies, this technique helps you manifest financial success effortlessly. Ready to transform your financial future? Try this powerful ritual and start attracting money today!
ER(Entity Relationship) Diagram for online shopping - TAEHimani415946
https://bit.ly/3KACoyV
The ER diagram for the project is the foundation for the building of the database of the project. The properties, datatypes, and attributes are defined by the ER diagram.
1.Wireless Communication System_Wireless communication is a broad term that i...JeyaPerumal1
Wireless communication involves the transmission of information over a distance without the help of wires, cables or any other forms of electrical conductors.
Wireless communication is a broad term that incorporates all procedures and forms of connecting and communicating between two or more devices using a wireless signal through wireless communication technologies and devices.
Features of Wireless Communication
The evolution of wireless technology has brought many advancements with its effective features.
The transmitted distance can be anywhere between a few meters (for example, a television's remote control) and thousands of kilometers (for example, radio communication).
Wireless communication can be used for cellular telephony, wireless access to the internet, wireless home networking, and so on.
1. OpenStack: Eine Cloud ohne Fehler?
Höhere Zuverlässigkeit durch gezieltes Provozieren von Fehlern!
Using Fault Injection to Increase Cloud Reliability!
Götz Brasche / Jorge Cardoso
CTO IT PL RnD and Director CSI / Lead Architect Cloud Operations and Analytics
21.06. - 22.06.2016
Cologne, Germany
2. 1
• 47% aller Probleme in Datenzentren
resultieren aus Ausfällen von
Cloudiensten
• Dauer der Ausfälle reicht von
40 Minuten bis zu fünf Tagen
• Kosten pro Ausfall durchschnittlich
690.204 USD
• Kosten pro Minute knapp 7.000 USD
3. 2
Unplanned downtime
is caused by*
software bugs … 27%
hardware … 23%
human error … 18%
network failures … 17%
natural disasters … 8%
* Marcus, E., and Stern, H. Blueprints for High Availability: Designing Resilient Distributed Systems. John Wiley & Sons, Inc., 2003.
4. 3
Google's 2007 found annualized failure
rates (AFRs) for drives
1 year old 1.7%
3 year old >8.6%
Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso. 2007. Failure trends in a large disk drive population. In Proc.
of the 5th USENIX conference on File and Storage Technologies (FAST '07). USENIX Association, Berkeley, CA, USA, 2-2.
6. 5
OpenStack User Survey: A snapshot of OpenStack users’ attitudes and deployments.
April 2016. (https://www.openstack.org/assets/survey/April-2016-User-Survey-Report.pdf). Fig. 2.1, Page 9.
Key User Interests
8. 7
Gartner
• Member of the Magic
Quadrant for x86 Server
Virtualization
Infrastructure
• Member of the Magic
Quadrant for Integrated
Systems
Cloud
Infrastructure
Product
Innovation Award
Frost & Sullivan 2013
10,000
Huawei employees working
in cloud computing and
dedicated to meeting every
IT requirement
No. 1
Industry-leading performance
according to the SPECvirt
server virtualization
performance benchmark
Best of Show
Award
Nomination
Interop 2013
DCD Blueprint
Award
China's first in the data
center industry
100,000
Desktops in the world's
largest-scale deployment
Huawei Cloud Computing Investment & Rewards
10. 9
Fusion Sphere Architecture
VMware vSphere
Third-party Virtualization
Architecture
FusionCompute FusionStorage FusionNetwork
Huawei Virtualization Architecture
Server Storage Network & Security
Physical Infrastructure
Cloud Storage APIHuawei Open API
eBackup
UltraVR
Backup & DR
SNMP/REST NBI
FusionSphere SOI
FusionManager
Portal
RBAC
Alarm
Log
Open API
Resource
Management
Configuration API Adapter
Cloud Storage
Management
VDC/VPC
11. 10
Market Recognitions Grows…
2 times higher scalability
Industry-leading performance in the
SPECvirt test
For the first time in 3 years, Gartner has
introduced a new company, Huawei
(FusionSphere), into the Magic Quadrant
for x86 Server Virtualization
Infrastructure.
Hypervisor
SPECvirt
Score
Ranking
FusionSphere 5.0 632 1
Linux 6.4 (KVM) 625 2
ESXi 5.1 472 3
http://www.spec.org/virt_sc2013/results/specvirt_sc2013_perf.html
vSphere
5.1
vSphere
6.0
FusionSphere
5.0
32
64
128
Nodes per cluster
• High performance
− < 5% CPU performance overheads
− Support for database, email, and
ERP and CRM services
• High reliability
− Proactive event detection
− Active/Standby management nodes
− Upgrade without service interruption
− Multi-level disaster recovery plans
Support for Critical Applications
VMware
Microsoft
Oracle
Parallel
CitrixHUAWEI
LEADERSCHALLENGERS
NICHE PLAYERS VISIONARIES
As of July 2014COMPLETENESS OF VISION
Source: Gartner (July 2014)
ABILITYTOEXECUTE
Red Hat
Member of the Gartner Magic Quadrant
for x86 Server Virtualization
Infrastructure
12. 11
FusionSphere 5.1 Key Performance Indicators
Physical Server/VM Performance Indicator Value
Max. number of vCPUs (virtual SMP) per VM 128
Max. memory size per VM 4 TB
Max. virtual disk capacity per VM 64 TB
Max. number of virtual disks per VM 60
Max. number of virtual NICs per VM 12
Max. number of logical CPU cores per physical server 480
Max. memory size per physical server 12 TB
Max. number of powered-on VMs per physical server 1024
13. 12
FusionSphere 5.1 Key Performance Indicators
Management Indicator Value
Max. number of physical servers per logical cluster 128
Max. number of VMs per logical cluster 3000
Max. number of logical clusters supported by a VRM node 32
Max. number of hosts supported by a VRM node 1024
Max. number of VMs supported by a VRM node 10,000
Max. number of VRM nodes that can be cascaded 16
Max. number of physical servers supported by cascaded VRM nodes 4096
Max. number of VMs supported by cascaded VRM nodes 80,000 (best practice in the industry)
14. 15
FAILURES ARE INEVITABLE!
THE BEST WE CAN DO IS BE
PREPARED FOR THEM AND LEARN
FROM THEM
TEST, REPAIR, LEARN & PREDICT !
Kripa Krishnan, Technical Program Director from Google
15. 16
One reason [Netflix]: It’s the lack of control over the underlying
hardware, the inability to configure it to ensure 100% uptime
Why does using a cloud infrastructure requires
advanced approaches for resiliency?
16. 17
A program designed to increase resilience by purposely injecting
major failures
Discover flaws and subtle dependencies
Amazon AWS: GameDay
“That seems totally bizarre on the face of it, but as you dig down, you end up finding
some dependency no one knew about previously […] We’ve had situations where we
brought down a network in, say, São Paulo, only to find that in doing so we broke our
links in Mexico.”
17. 18
Google DIRT (Disaster Recovery Test)
Annual disaster recovery & testing exercise
8 years since inception
Multi-day exercise triggering (controlled) failures in systems and process
Premise
30-day incapacitation of headquarters following a disaster
Other offices and facilities may be affected
When
“Big disaster”: Annually for 3-5 days
Continuous testing: Year-round
Who
100s of engineers (Site Reliability, Network, Hardware, Software, Security, Facilities)
Business units (Human Resources, Finance, Safety, Crisis response etc.)
Google: DiRT
Source http://flowcon.org/dl/flowcon-sanfran-2014/slides/KripaKrishnan_LearningContinuouslyFromFailures.pdf
18. 19
Netflix: Chaos Monkey
Fewer alerts for
ops team
Amazon EC2 and Amazon RDS Service
Disruption in the US East Region
April 29, 2011
September 20th, 2015: Amazon’s DynamoDB service experienced an availability issue in their US-EAST-1
Transfer traffic
to east region
19. 20
Huawei: Butterfly Effect
-- Butterfly Effect System --
Enables to Automatically Test and Repair OpenStack and Cloud
Applications
CLOUD APPLICATION
HUAWEI FusionSphere
The system works by intentionally injecting different failures, test the ability to
survive them, and learn how to predict and repair failures preemptively
Failure
Repair
Test
21. 22
Design & Execute Fault-Injection Plan
Best way to avoid failure: Fail constantly
Kill cinder database
(Simulate update failure)
Introduce delay in messages
(Full-scale traffic shows where
the real bottlenecks are)
Operation Error
OPENSTACK_KEYSTONE_URL = "http://%s:5000/v2.0" % OPENSTACK_HOST
Operation Error
/etc/nova/nova.conf
Delete: auth_strategy=keystone
Remove driver to HD
Remove access to NFS
(Simulate hardware failure)
The main testing framework of OpenStack is called Tempest, an opensource project with more than 2000 tests: only black-box testing (test only access the public interfaces)
2
40. Typically, the first thing to look in the logs when you hit a
problem is for the error message or the stack trace that has
the details on which part of the python module the problem
is generated, for that particular component.
This should explain the problem in detail, as well as its origin.
You should also look for the exception that gets logged for
any failure, which will also help identify the root cause of the
failure. If there are no exception/stack trace messages getting
dumped to the logs, it means that the problem might not be of
severe impact and the user can continue with their operation.
However, in such cases, at least a warning message will get
logged to make sure that we don't miss out on anything that's
happening in the FusionSphere environment.
The root cause of most of the issues seen in FusionSphere can
be triaged with the default logging levels. If the support team
requires more detailed logging, you can enable the debug
logs and perform the action again to capture all the details.
To have a stable environment, you want to detect failure
promptly and determine causes efficiently. With a distributed
system, it's even more important to track the right items to
meet a service-level target. By knowing where the logs are and
how to manage them, you can analyze most issues you
encounter, allowing you to keep your environment running
smoothly.
Limitations of Troubleshooting Approaches
Although today’s programs are orders of magnitude more complex than those of
30 years ago, many people still use printf to log to console or local disk, and use
some combination of manual inspection and regular expressions to locate specific
messages or patterns.
FEBRUARY 2012, VOL. 55, NO. 2, COMMUNICATIONS OF THE ACM
Manual, complex, error-prone, and expensive
http://www.slideshare.net/tomoya/openstack-at-ntt-resonant-lessons-learned-in-web-infrastructure
One (very) simple command: glance.images.list()
Size: 316K
# messages:1499 # DEBUG: 1068
# INFO: 23 # Others: 408
DOCOMO has shown
(100GB and 80M lines)/day for
100 nodes
45. Google's Tracing System (see also X-Trace and Magpie)
Originally created to understand the system behavior from a search request
Today Google's production clusters generate >1 TB/day of sampled trace data
Dapper: a Large-Scale Distributed Systems Tracing Infrastructure
Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, Chandan Shanbhag, Dapper, a Large-Scale Distributed Systems Tracing Infrastructure, Google, Inc. (2010).
// Java:
Tracer t = Tracer.getCurrentTracer();
String request = ...;
if (hitCache())
t.record("cache hit for " + request);
else
t.record("cache miss for " + request);
Annotation
41 Java and 68 C++ applications have
custom annotations to better understand
intra-span activity
Overall approach
When a thread handles a traced control path, Dapper attaches a trace
context to thread-local storage.
When the control flow library (threading, control flow, RPC) is used to
schedule callbacks, Dapper attached a trace context.
Performance
Basic instrumentation as small as possible and record only a fraction
of all traces using sampling (1/1000).
Daemon uses < 0.3% CPU during collection. Small memory footprint
46. 47
Dapper
Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, Chandan Shanbhag, Dapper, a Large-Scale
Distributed Systems Tracing Infrastructure, Google, Inc. (2010).
49. Fingerprint Analytics
Can you identify your failures?
Key Challenges Analysis New matching algorithms Real-time fingerprint detection Fingerprint prediction Alignments and nested FP
M8
M1
Service a Service kService b
$ openstack image list Fault 15aOK
50. 51
Application Fields
Failure Mode and Effect Analysis (FMEA)
Security and Intrusion detection Performance Analysis
Bottleneck
Cloud Accountability
51. 52
OpenStack Engineers positions
Rapid prototyping of cool ideas: propose it today, code it, and show it running in
3 months…
Internship positions for MSc students
Fault injection, fault models, fault libraries, fault plans, brake and rebuild systems
all day long, …
Innovative PoCs
Solving difficult challenges of real problems using quick and dirty prototyping
Join the Cause!
52. 53
Industry-Academia Workshop on
Cloud Reliability and Resilience
This workshop intends to bring
together industry and academia
to identify the most relevant
requirements in the field of cloud
reliability and resilience, on one
hand, and existing state-of-the-art
solutions, on the other.
We invite engineers, scientists,
regulators, and experts to discuss
and contribute to the creation of a
new generation of highly reliable
cloud platforms.
November Event
BERLIN 7-8 NOVEMBER 2016
6, Ernst-Reuter-Platz 7
10587 Berlin
Germany