An overview of Google's Site Reliability Engineering with a view toward possible incorporation in the IEEE P2675 DevOps security standard. (Creative Commons with credit.)
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...Tori Wieldt
How do you make DevOps magic when you aren’t Google? This talk will help whether you’re still figuring out how to create a site reliability practice at your company or you’re trying to improve the processes and habits of an existing SRE team.
Getting started with Site Reliability Engineering (SRE)Abeer R
"Getting started with Site Reliability Engineering (SRE): A guide to improving systems reliability at production"
This is an intro guide to share some of the common concepts of SRE to a non-technical audience. We will look at both technical and organizational changes that should be adopted to increase operational efficiency, ultimately benefiting for global optimizations - such as minimize downtime, improve systems architecture & infrastructure:
- improving incident response
- Defining error budgets
- Better monitoring of systems
- Getting the best out of systems alerting
- Eliminating manual, repetitive actions (toils) by automation
- Designing better on-call shifts/rotations
How to design the role of the Site Reliability Engineer (who effectively works between application development teams and operations support teams)
SRE (service reliability engineer) on big DevOps platform running on the clou...DevClub_lv
SRE (service reliability engineer). The talk is to explain the SRE philosophy and the principles of production engineering and operations in clouds.
(Language – English)
Pavlo is ADOP (Accenture DevOps Platform) Service Reliability Team Lead, SRE practitioner. Has more then 18 years of IT experience in Ops and Dev.
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaKeet Sugathadasa
When it comes to Site Reliability Engineering, short for SRE, the resources available online are only limited to the books published by Google themselves. They do share some useful case studies that will help us understand what SRE is, and how to understand the concepts given in it, but they do not clearly explain how to build your own SRE team for your organization. The concept of SRE was cooked fresh within the walls of Google and later released to the general public as a practice for anyone to follow.
In this presentation I would like to give a brief introduction to SRE and why it is important to any Software Engineering organization. This is based on my experiences and learnings from leading a Site Reliability Engineering team for leading organizations in the US and Norway.
This presentation was conducted by me as a Tech Talk as an Associate Technical Lead at Creative Software Sri Lanka.
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...Tori Wieldt
How do you make DevOps magic when you aren’t Google? This talk will help whether you’re still figuring out how to create a site reliability practice at your company or you’re trying to improve the processes and habits of an existing SRE team.
Getting started with Site Reliability Engineering (SRE)Abeer R
"Getting started with Site Reliability Engineering (SRE): A guide to improving systems reliability at production"
This is an intro guide to share some of the common concepts of SRE to a non-technical audience. We will look at both technical and organizational changes that should be adopted to increase operational efficiency, ultimately benefiting for global optimizations - such as minimize downtime, improve systems architecture & infrastructure:
- improving incident response
- Defining error budgets
- Better monitoring of systems
- Getting the best out of systems alerting
- Eliminating manual, repetitive actions (toils) by automation
- Designing better on-call shifts/rotations
How to design the role of the Site Reliability Engineer (who effectively works between application development teams and operations support teams)
SRE (service reliability engineer) on big DevOps platform running on the clou...DevClub_lv
SRE (service reliability engineer). The talk is to explain the SRE philosophy and the principles of production engineering and operations in clouds.
(Language – English)
Pavlo is ADOP (Accenture DevOps Platform) Service Reliability Team Lead, SRE practitioner. Has more then 18 years of IT experience in Ops and Dev.
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaKeet Sugathadasa
When it comes to Site Reliability Engineering, short for SRE, the resources available online are only limited to the books published by Google themselves. They do share some useful case studies that will help us understand what SRE is, and how to understand the concepts given in it, but they do not clearly explain how to build your own SRE team for your organization. The concept of SRE was cooked fresh within the walls of Google and later released to the general public as a practice for anyone to follow.
In this presentation I would like to give a brief introduction to SRE and why it is important to any Software Engineering organization. This is based on my experiences and learnings from leading a Site Reliability Engineering team for leading organizations in the US and Norway.
This presentation was conducted by me as a Tech Talk as an Associate Technical Lead at Creative Software Sri Lanka.
Overview of Site Reliability Engineering (SRE) & best practicesAshutosh Agarwal
In any software organization, stability & innovation are always at loggerheads - the faster you move, the more things will break. This talk defines what SRE org looks like at high-tech organizations (Google, Uber).
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...ITSM Academy, Inc.
Presenter: Perry Statham
SRE Squad Leader with IBM Cloud DevOps Services
In this presentation, the IBM DevOps Services SRE team will give a brief introduction to Site Reliability Engineering, then show how they adopted its principals in their existing enterprise organization.
How to bootstrap an SRE team into your company. How to hire them, what to have them work on and how to interact with them as a team. Finally some thought on general practices to consider before your SREs arrive. There are also kitten pictures.
<p>From <a href="https://en.wikipedia.org/wiki/Site_reliability_engineering" target="_blank">Wikipedia</a>: Site reliability engineering (SRE) is a discipline that incorporates aspects of software engineering and applies that to operations whose goals are to create ultra-scalable and highly reliable software systems.<p>
<p>Over the past year Acquia has built their own SRE team to help their products and services scale with the demand of our growing number of customers. We wish to share our experience so that others are enabled to do the same and reap the rewards.</p>
<p>This presentation will discuss how the SRE team came about at Acquia, what achievements we have made so far, and the lessons we have learned along the way. We will then show the steps on how to introduce SRE to your workplace so you can deliver more reliable and scalable services to your customers! We will specifically cover:</p>
<ul>
<li>SRE's basic concepts and history from Google</li>
<li>The management support you will need to get started</li>
<li>Introducing the idea of service level objectives and error budgets</li>
<li>Operational Responsibility Assessments as a tool to measure risk</li>
<li>Creating a Launch Readiness Checklist to standardize and improve product launches</li>
<li>Finding ideal candidates for your SRE team</li></ul>
<p>The intended audience are software engineers, system administrators, and managers that have a desire to improve how they do their work and how their products/services perform.</p>
According to Google, SRE is what you get when you treat operations as if it’s a software problem. In this video, I briefly explain the term SRE (Site Reliability Engineering) and introduce key metrics for an SRE team SLI, SLO, and SLA.
Youtube Channel here: https://www.youtube.com/playlist?list=PLm_COkBtXzFq5uxmamT0tqXo-aKftLC1U
Adopting Kubernetes for production has huge impacts on operations at all levels. We present our pattern for formalizing cluster operations as a separate role from infrastructure and application operations, and explore the impact on the role of the SRE.
How Small Team Get Ready for SRE (public version)Setyo Legowo
How Urbanindo small team engineering team implement Site Reliability Engineering (SRE) in their daily work life and why we choose SRE instead of ordinary DevOps.
According to Google, SRE is what you get when you treat operations as if it’s a software problem. In this video, I briefly explain what is and isn't toil, how to identify, measure and eliminate them.
Youtube channel here: https://youtu.be/EgpCw15fIK8
Independently from the DevOps movement but starting from the same problems, Google developed its own strategy defining a new specific role called SRE (Site Reliability Engineer). This introduction tries to explain the history and the concept of this methodology and to compare it with the DevOps manifesto to understand what does it mean to adopt DevOps and what does it mean to be an SRE and what the two things are sharing and where they diverge.
In this presentation I will speak how are the SRE and DevOps, what is a reliability. Also about the reliability approach in Competitive Gaming in Wargaming and show a few cases.
SRE-iously: Defining the Principles, Habits, and Practices of Site Reliabilit...New Relic
No matter how you define it, the Site Reliability Engineer (SRE) role is clearly expanding into more and more companies. To be effective in this new role, SREs must possess a depth of understanding of how different systems work together, how they fail, how they can be improved, and how they can best be designed and monitored.
Service Level Terminology : SLA ,SLO & SLIKnoldus Inc.
Measuring outcomes is always at the top of our mind when approaching goals. While we do have specific targets we may be aiming for, circling back to confirm that the resulting outcome is in fact what you were after is extremely important. Small course corrections are required. Outcomes may be more general but often attract the attention and support of decision-makers earlier.
Key measurements and thresholds to hold us accountable for our efforts as well as communicate expectations across the entire organization needed to be established. Nearly every resource you find regarding site reliability engineering will talk about key metrics used to establish high-level objectives, indicators of the movement toward or away from those objectives, and ultimately what agreements are in place should objectives be unfulfilled.
SLIs will help us know how we are performing against our SLOs and our SLA will outline the consequences (good or bad) of meeting those objectives. Once we have data to observe, we will begin orienting ourselves to it and establish what we believe our SLIs and SLOs to be.
Here’s an outline of the webinar -
~ Learn what an SRE is and isn't.
~ Understand the difference between service-level indicators (SLI), service-level objectives (SLO), and service-level agreements (SLA).
~ Gain an understanding of error budgets and how to calculate reliability cost.
~ Learn how SREs can embed themselves within development teams to increase operational stability
Overview of Site Reliability Engineering (SRE) & best practicesAshutosh Agarwal
In any software organization, stability & innovation are always at loggerheads - the faster you move, the more things will break. This talk defines what SRE org looks like at high-tech organizations (Google, Uber).
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...ITSM Academy, Inc.
Presenter: Perry Statham
SRE Squad Leader with IBM Cloud DevOps Services
In this presentation, the IBM DevOps Services SRE team will give a brief introduction to Site Reliability Engineering, then show how they adopted its principals in their existing enterprise organization.
How to bootstrap an SRE team into your company. How to hire them, what to have them work on and how to interact with them as a team. Finally some thought on general practices to consider before your SREs arrive. There are also kitten pictures.
<p>From <a href="https://en.wikipedia.org/wiki/Site_reliability_engineering" target="_blank">Wikipedia</a>: Site reliability engineering (SRE) is a discipline that incorporates aspects of software engineering and applies that to operations whose goals are to create ultra-scalable and highly reliable software systems.<p>
<p>Over the past year Acquia has built their own SRE team to help their products and services scale with the demand of our growing number of customers. We wish to share our experience so that others are enabled to do the same and reap the rewards.</p>
<p>This presentation will discuss how the SRE team came about at Acquia, what achievements we have made so far, and the lessons we have learned along the way. We will then show the steps on how to introduce SRE to your workplace so you can deliver more reliable and scalable services to your customers! We will specifically cover:</p>
<ul>
<li>SRE's basic concepts and history from Google</li>
<li>The management support you will need to get started</li>
<li>Introducing the idea of service level objectives and error budgets</li>
<li>Operational Responsibility Assessments as a tool to measure risk</li>
<li>Creating a Launch Readiness Checklist to standardize and improve product launches</li>
<li>Finding ideal candidates for your SRE team</li></ul>
<p>The intended audience are software engineers, system administrators, and managers that have a desire to improve how they do their work and how their products/services perform.</p>
According to Google, SRE is what you get when you treat operations as if it’s a software problem. In this video, I briefly explain the term SRE (Site Reliability Engineering) and introduce key metrics for an SRE team SLI, SLO, and SLA.
Youtube Channel here: https://www.youtube.com/playlist?list=PLm_COkBtXzFq5uxmamT0tqXo-aKftLC1U
Adopting Kubernetes for production has huge impacts on operations at all levels. We present our pattern for formalizing cluster operations as a separate role from infrastructure and application operations, and explore the impact on the role of the SRE.
How Small Team Get Ready for SRE (public version)Setyo Legowo
How Urbanindo small team engineering team implement Site Reliability Engineering (SRE) in their daily work life and why we choose SRE instead of ordinary DevOps.
According to Google, SRE is what you get when you treat operations as if it’s a software problem. In this video, I briefly explain what is and isn't toil, how to identify, measure and eliminate them.
Youtube channel here: https://youtu.be/EgpCw15fIK8
Independently from the DevOps movement but starting from the same problems, Google developed its own strategy defining a new specific role called SRE (Site Reliability Engineer). This introduction tries to explain the history and the concept of this methodology and to compare it with the DevOps manifesto to understand what does it mean to adopt DevOps and what does it mean to be an SRE and what the two things are sharing and where they diverge.
In this presentation I will speak how are the SRE and DevOps, what is a reliability. Also about the reliability approach in Competitive Gaming in Wargaming and show a few cases.
SRE-iously: Defining the Principles, Habits, and Practices of Site Reliabilit...New Relic
No matter how you define it, the Site Reliability Engineer (SRE) role is clearly expanding into more and more companies. To be effective in this new role, SREs must possess a depth of understanding of how different systems work together, how they fail, how they can be improved, and how they can best be designed and monitored.
Service Level Terminology : SLA ,SLO & SLIKnoldus Inc.
Measuring outcomes is always at the top of our mind when approaching goals. While we do have specific targets we may be aiming for, circling back to confirm that the resulting outcome is in fact what you were after is extremely important. Small course corrections are required. Outcomes may be more general but often attract the attention and support of decision-makers earlier.
Key measurements and thresholds to hold us accountable for our efforts as well as communicate expectations across the entire organization needed to be established. Nearly every resource you find regarding site reliability engineering will talk about key metrics used to establish high-level objectives, indicators of the movement toward or away from those objectives, and ultimately what agreements are in place should objectives be unfulfilled.
SLIs will help us know how we are performing against our SLOs and our SLA will outline the consequences (good or bad) of meeting those objectives. Once we have data to observe, we will begin orienting ourselves to it and establish what we believe our SLIs and SLOs to be.
Here’s an outline of the webinar -
~ Learn what an SRE is and isn't.
~ Understand the difference between service-level indicators (SLI), service-level objectives (SLO), and service-level agreements (SLA).
~ Gain an understanding of error budgets and how to calculate reliability cost.
~ Learn how SREs can embed themselves within development teams to increase operational stability
The modern IT stack has become diverse and distributed, and it’s increasingly challenging to manage heterogeneous platforms and multi-vendor devices. Customers are looking to the cloud and APM to help address these hurdles, as well as accelerate IT transformation.
But migrating to the cloud will take time, it won’t make infrastructure ‘just disappear’, and legacy workloads are going to remain part of the enterprise reality for many. In addition, while APM will continue to be increasingly important, all applications are not the same and an application is still not equal to a digital business service.
Watch this webinar as John Worthington, a service management expert and Director of Product Marketing for eG Innovations, continues our Shift-Left series. You can learn:
• Why domain expertise is important when defining monitoring requirements
• What analytics are useful from a monitoring and observability context
• How end-to-end monitoring with converged application and infrastructure performance can drive ITSM and DevOps integration
Presentazione dello speech tenuto da Carmine Spagnuolo (Postdoctoral Research Fellow - Università degli Studi di Salerno/ ACT OR) dal titolo "Technology insights: Decision Science Platform", durante il Decision Science Forum 2019, il più importante evento italiano sulla Scienza delle Decisioni.
The Reality of Managing Microservices in Your CD PipelineDevOps.com
As we shift from monolithic software development practices to microservices, our well-designed CD pipeline will need to change. Microservices are small functions, deployed independently and linked via APIs at run-time. While these differences seem minor, they actually have a large impact on your overall CD structure. Think hundreds of workflows, small of any builds and the loss of a monolithic 'application.'
Join Tracy Ragan, CEO of DeployHub and Brendan O'Leary, Developer Evangelist at GitLab, to learn more.
It's never too early to start the conversation.
What’s New with NGINX Controller Load Balancing Module 2.0?NGINX, Inc.
On-Demand Link: https://www.nginx.com/resources/webinars/new-nginx-controller-load-balancing-module-2-0/
Speaker:
Karthik Krishnaswamy
Sr Product Marketing Manager
NGINX, Inc.
About the webinar
Achieving consistency in application performance begins with a consistent load balancing configuration. NGINX Controller Load Balancing Module 2.0 introduces a policy-driven approach to configuration management resulting in consistent configuration across multiple NGINX Plus instances. This can be achieved with the push of a button, saving time and effort for I&O teams. We will also showcase NGINX Controller’s integration with ServiceNow which seamlessly blends into your IT service management workflows.
The webinar includes a live demo of the Load Balancing Module in action.
Top 5 Challenges in Scaling DevOps in Brownfield EnvironmentsDeborah Schalm
Many believe that DevOps is primarily for greenfield projects. But, in order to compete enterprises must scale DevOps to utilize new technology solutions while maximizing the value of their current investments in critical IT infrastructure and business applications. Join Gary Gruver, well known DevOps leader and author, and Mark Levy, Director of Strategy at Micro Focus as they discuss the main challenges facing large enterprises as they try to scale DevOps across their brownfield environments.
VMworld 2015: vRealize Operations Insight: Manage vSphere and Your Entire Dat...VMworld
Learn how vRealize Operations Insight delivers an integrated solution for performance management, capacity optimization, real-time log analytics, and more.
Measure and Increase Developer Productivity with Help of Serverless at AWS Co...Vadym Kazulkin
The goal of Serverless is to focus on writing the code that delivers business value and offload everything else to your trusted partners (like Cloud providers or SaaS vendors). You want to iterate quickly and today’s code quickly becomes tomorrow’s technical debt. In this talk we will show why Serverless adoption increases the developer productivity and how to measure it. We will also go through AWS Serverless architectures where you only glue together different Serverless managed services relying solely on configuration, minimizing the amount of the code written.
Devops On Cloud Powerpoint Template Slides Powerpoint Presentation SlidesSlideTeam
Introducing DevOps On Cloud PowerPoint Template Slides PowerPoint Presentation Slides. Provide an overview of DevOps with this attention-grabbing PPT slideshow. This presentation helps to understand the need for DevOps, how it is different from traditional IT, DevOps use cases in business, lifecycle, roadmap, and so on. Provide an overview of how DevOps is different from agile by using the content-ready DevOps strategy PPT visuals. The slides also explain the roles, responsibilities, and skills of DevOps engineers. DevOps automation tools and DevOps roadmap for implementation in the organization can be discussed effectively. Provide an overview of DevOps on the cloud by describing cloud computing, characteristics of cloud computing, benefits, top risks related to cloud computing, etc. Cloud computing use cases and cloud deployment models can be presented with the help of visual attention-grabbing DevOps implementation roadmap PowerPoint slides. The roadmap to integrate cloud computing in business can be depicted easily by using the DevOps implementation strategy PowerPoint slideshow. https://bit.ly/3d8uYRY
The Quality “Logs”-Jam: Why Alerting for Cybersecurity is Awash with False Po...Mark Underwood
What happens when the (Observe) Plan-Do-Check-Adjust cycle is undermined by lapses in data integrity? Observations are questioned. Plans may be ill-conceived. Actions may be undertaken that undermine rather than enhance. “Checks” can fail. Adjustments may be guesswork. In cybersecurity, the results of poor data integrity can be expensive outages, ransom requests, breaches, fines -- even bankruptcy (think Cambridge Analytica). But data integrity issues take many forms, ranging from benign to malicious. The full range of these issues is surveyed from a cybersecurity perspective, where logs and alerts are critical for defenders -- as well as quality engineers . Techniques borrowed from model-based systems engineering and ontology AI to are identified that can mitigate these deleterious effects on PDCA.
In the era of algorithms and AI, codes of ethics should have an added sense of purpose. But do they? The codes of ethics for ACM, IEEE and ASQ are reviewed in light of these concerns. Several case studies are cited which have grabbed headlines over the past two years. An increasingly software / code-driven universe in which AI is insinuated seemingly everywhere is one in which ethics must be present, part of enterprise decision-making, and traceable.
An introductory take on the ethical issues surrounding the use of algorithms and machine learning in finance, education, law enforcement and defense. This work was stimulated by, but is not a product or authorized content from the IEEE P7003 WG.
Disclaimer: This work is mine alone and does not reflect view of IEEE, IEEE 7003 WG, my employer.
DevOps Support for an Ethical Software Development Life Cycle (SDLC)Mark Underwood
As part of the IEEE SA P7000 and P2675 working groups, it has been determined that DevOps engineering practices can support (or hinder) the environment for an ethical software development life cycle (SDLC). This deck scratches the surface.
Implications of GDPR for IoT Big Data Security and Privacy FabricMark Underwood
Discussion of ways in which GDPR has, and will continue to influence the SDLC and deployment of IoT, especially as it impacts the privacy and security fabric.
Technologies in Support of Big Data EthicsMark Underwood
As part of the NIST Big Data Public Working Group, we examine technologies that can support ethics in systems design. In particular, we review issues raised by the IEEE P7000 community regarding ethics for autonomous systems and robotics. Possible adaptations to the NBDPWG reference model are considered for the third and final version of SP1500.
NIST Big Data Public WG : Security and Privacy v2Mark Underwood
Presentation offers an overview of the security and privacy framework offered by the NIST BD PWG. The effort began in 2013 and a final version is due for publication at the end of 2018. The presentation reflects work presented in version 2, currently undergoing final review by NIST. The end product is a NIST special report.
Presents a more expansive view of "stakeholders" in systems design, specifically beyond purely human notions. Produced for use by the IEEE P7000 working group "Model Process for Addressing Ethical Concerns During System Design."
Slowing the Two Cultures continental drift. The humanities are drifting further and further away from the realities of science and technology.Their marginalization should worry us all. I survey the current state of affairs 50 years after CP Snow's talk, and suggest how poets should retool.
IoT Day 2016: Cloud Services for IoT Semantic InteroperabilityMark Underwood
Presentation made on IoT Day 2016 about the importance of API-first, cloud services role in implementing ontologies for IoT. The use case is homely: providing proper humidity to my electric violin and guitar instruments while in their cases.
Ontology Summit - Track D Standards Summary & Provocative Use CasesMark Underwood
The OntologySummit is an annual series of events (first started by Ontolog and NIST in 2006) that involves the ontology community and communities related to each year's theme chosen for the summit. The Ontology Summit program is now co-organized by Ontolog, NIST, NCOR, NCBO, IAOA, NCO_NITRD along with the co-sponsorship of other organizations that are supportive of the Summit goals and objectives. This deck summarizes some of the work in Track D, IoT and Ontology Standards Synergies
A presentation made at IoT Day 2015. It's an overview of the role ontologies could / should play in the internet of things. Calls for a general software engineering approach that integrates Big Data variety, velocity and veracity (i.e., provenance).
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
In this slide, we show the simulation example and the way to compile this solver.
In this solver, the Helmholtz equation can be solved by helmholtzFoam. Also, the Helmholtz equation with uniformly dispersed bubbles can be simulated by helmholtzBubbleFoam.
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
Even though at surface level ‘java.lang.OutOfMemoryError’ appears as one single error; underlyingly there are 9 types of OutOfMemoryError. Each type of OutOfMemoryError has different causes, diagnosis approaches and solutions. This session equips you with the knowledge, tools, and techniques needed to troubleshoot and conquer OutOfMemoryError in all its forms, ensuring smoother, more efficient Java applications.
Check out the webinar slides to learn more about how XfilesPro transforms Salesforce document management by leveraging its world-class applications. For more details, please connect with sales@xfilespro.com
If you want to watch the on-demand webinar, please click here: https://www.xfilespro.com/webinars/salesforce-document-management-2-0-smarter-faster-better/
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Globus
Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
A Comprehensive Look at Generative AI in Retail App Testing.pdfkalichargn70th171
Traditional software testing methods are being challenged in retail, where customer expectations and technological advancements continually shape the landscape. Enter generative AI—a transformative subset of artificial intelligence technologies poised to revolutionize software testing.
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxrickgrimesss22
Discover the essential features to incorporate in your Winzo clone app to boost business growth, enhance user engagement, and drive revenue. Learn how to create a compelling gaming experience that stands out in the competitive market.
Navigating the Metaverse: A Journey into Virtual Evolution"Donna Lenk
Join us for an exploration of the Metaverse's evolution, where innovation meets imagination. Discover new dimensions of virtual events, engage with thought-provoking discussions, and witness the transformative power of digital realms."
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
2. GAPS IN AGILE, DEVOPS APPROACHES
WHY ADDITIONAL OR SUPPLEMENTARY APPROACHES ARE NEEDED
*EDITORIAL
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 2
3. HOW OPS GETS OVERLOOKED
• No obvious “product” release cycle
• Keeping complex systems running is not primarily a software
problem
• Ops troubleshooting may not follow any SDLC model
• Some Ops entail managing systems in which no code readily
available
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 3
4. PHILOSOPHICAL NOTES
• Technical approaches to privacy are inextricably tied to security
• Similarly, reliability engineering is also tied to security
• -- and not just “Availability”
• Quality engineering comfortably straddles both Dev and Ops
• Most quality engineering in practice is pure Ops
• Software engineering has immature notions of quality
• Supporting legacy systems may be more Ops than Dev
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 4
5. USE CASES
• Call center operations
• Field service
• Sales, sales support
• Most of health care (17.8% of US GDP spending)
• Rework and repair (all sectors)
• Financial services
• Government operations (e.g., voting systems, regulation, transportation management)
• Utilities
• Even the less obvious: decision support
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 5
6. SOFTWARE SUPPORTS OPS, BUT . . .
• Complex systems lack human-machine controls
• Humans are almost always “man in the middle” by design
• Ops were not designed to be automated
• Software only lightly mitigates labor increases when service
load increases
• Ops must encompass non-automated tasks
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 6
7. SITE RELIABILITY ENGINEERING
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 7
Edited by Betsy Beyer, Chris Jones, Jennifer Petoff
and Niall Richard Murphy
(O’Reilly). Copyright 2016 Google, Inc., 978-1-491-
92912-4.”
8. SITE RELIABILITY WORKBOOK
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 8
Edited by Betsy Beyer, Niall Richard Murphy,
David K. Rensin, Kent Kawahara and Stephen
Thorne
O’Reilly Media
Source
9. CREDIT GOOGLE
GOOGLE DEVELOPED SRE AND PUBLISHES A FREE ONLINE TEXT.
BEN TREYNOR SLOSS ORIGINATED THE TERM.
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 9
10. GOOGLE’S DEFINITION
“SRE IS WHAT YOU GET WHEN YOU TREAT OPERATIONS AS IF IT’S A SOFTWARE
PROBLEM. OUR MISSION IS TO PROTECT, PROVIDE FOR, AND PROGRESS THE SOFTWARE
AND SYSTEMS BEHIND ALL OF GOOGLE’S PUBLIC SERVICES — GOOGLE SEARCH, ADS,
GMAIL, ANDROID, YOUTUBE, AND APP ENGINE, TO NAME JUST A FEW — WITH AN EVER-
WATCHFUL EYE ON THEIR AVAILABILITY, LATENCY, PERFORMANCE, AND CAPACITY.”
SOURCE
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 10
11. WHAT IS IT?
• Quasi open standardized process (vs. “standard”)
• Scalable, proven (albeit inside deep pocket enterprises)
• Begun in 2003, it predated DevOps
• Left-shift Sysadmin functions
• But with healthy skills in layers 1-3 in UNIX network stack
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 11
12. IS IT DEVOPS?
• “. . . We are distinct from the industry term DevOps, because
although we definitely regard infrastructure as code, we
have reliability as our main focus. Additionally, we are strongly
oriented toward removing the necessity for operations—
see The Evolution of Automation at Google for more details.”
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 12
13. IS IT DEVOPS? (PER GOOGLE)
“One could view DevOps as a generalization of several core SRE
principles to a wider range of organizations, management
structures, and personnel. One could equivalently view SRE as a
specific implementation of DevOps with some idiosyncratic
extensions.” (Chapter 1)
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 13
15. HOW SRE LEFT-SHIFTS OPS
• No more than 50% duty in Ops
• Remaining 50% is “coding skills on project work”
• Heavy reliance on “blame-free postmortem culture”
• Ed: Quality principle
• Ed: Implies analytics, evidence-, data-driven processes
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 15
16. SRE EVENT ANALYTICS
• Max of two events per 8/12 hr on-call shift
• No equivalent to these events in software engineering
• Tied to monitoring (alerts, tickets, logging)
• Emergency response is a useful event + event metrics
• MTTF and MTTR – MTTR is key
• Playbook* building as synthetic event / scenario construction
• “We have found that thinking through and recording the best practices ahead of time
in a ‘playbook’ produces roughly a 3x improvement in MTTR as compared to the
strategy of "winging it."
• “Wheel of Misfortune” (software engineering equivalent: Adversarial testing?)
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 16
17. CHANGE MANAGEMENT IN @RL
• “SRE: 70% of outages due to changes in a live system.”
• SRE automation enables:
• Progressive rollouts (Ed not just “promote to QA”)
• Rapid problem diagnosis
• Automated rollback (Ed Typically not an app ‘requirement’)
• Mitigate user exposure to service disruptions
• Automation reduces impact of fatigue, familiarity/contempt, challenges of
highly repetitive tasks
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 17
18. SRE TACKLES PLANNING, CAPACITY
• Dev rarely has eyes on metrics, processes for provisioning
• Provisioning is higher risk than load shifting: a class of Ops use cases
• Dev rarely accounts for ingest of demand data streams
• Dev has little insight into aperiodic spikes, trends, schedules,
dependencies
• Weather, cascading power outages
• Resource utilization entails variables Dev may be blind to
• Monitoring must utilize alerting from time series data (Few
devs get it)
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 18
19. SRE LEFT-SHIFTED COMPONENTS
• Abstract Machine (Apache Mesos-like)
• Distributed Storage
• OpenFlow-based SDN
• Prometheus-like Monitoring & Alerting for:
• Acute incidents
• A/B and E1/E2 comparisons
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 19
20. DEV FOR OPS @GOOGLE
• Single shared repo
• “All software is reviewed before being submitted”
• Even large builds are fast
• Same infrastructure for continuous testing
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 20
21. SOFTWARE-CENTRIC OPS
“Unlike traditional operations groups, we view software as the
primary tool through which our systems are managed,
maintained, and minded; to that end, we have the source-level
access and moral authority required to fix, extend and scale code
to keep it working, harden it against the vagaries of the Internet,
and develop our own planet-scale platforms.”
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 21
22. “FULL DEPTH OF THE STACK”
“In Google, we have the good fortune to have developed many
large systems ranging from planet-spanning databases to near
real-time scalable data warehousing to fault-tolerant datastream
joining. In SRE, we flip between the fine-grained detail of disk
driver IO scheduling to the big picture of continental-level
service capacity, across a range of systems and a user population
measured in billions. We own those products in production. We
drive reliability and performance across massive scale by
mastering the full depth of the stack.“M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 22
23. PRINCIPLES
• Embracing Risk (Ed: Listen up, FinTechs)
• Service Level Objectives
• Eliminating Toil (Ed: More than efficiency, velocity)
• Monitor (Ed: Integrated monitoring)
• Release Engineering
• Simplicity (Ed: Complexity evolved from simplicity?)
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 23
24. RISK MANAGEMENT IN SRE
“We strive to make a service reliable enough, but
no more reliable than it needs to be. That is, when we set an
availability target of 99.99%,we want to exceed it, but not by
much: that would waste opportunities to add features to the
system, clean up technical debt, or reduce its operational costs.
In a sense, we view the availability target as both a minimum and
a maximum. The key advantage of this framing is that it unlocks
explicit, thoughtful risktaking.” Source
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 24
25. SRE RISK PROCESS INSIGHTS
• Risk tolerance of consumer services
• Differential impact of failure types on product/service offering
• Google Apps for Business vs. Consumer
• Cost vs. availability (“an extra nine of availability means . . . “)
• Google + Google Partner latency objectives
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 25
26. SRE “ERROR BUDGET”
“In order to base these decisions [product velocity vs. reliability] on
objective data, the two teams jointly define a quarterly error budget
based on the service’s service level objective, or SLO (see Service Level
Objectives). The error budget provides a clear, objective metric that
determines how unreliable the service is allowed to be within a single
quarter. This metric removes the politics from negotiations between
the SREs and the product developers when deciding how much risk to
allow.”
“The main benefit of an error budget is that it provides a common
incentive that allows both product development and SRE to focus on
finding the right balance between innovation and reliability.”
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 26
27. KEY INSIGHT
Ed: Ops has a perspective on product performance that Dev will
rarely have. SRE leverages this by integrating processes to
monitor and manage the product while making improvements.
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 27
28. SERVICE ABSTRACTIONS
• SLA: Set by product owners, not SRE
• SLI Service Level Indicator (Ed: Domain specific dependent
measure)
• SLO Service Level Objective (Ed: Complex target range of
values; sets expectations)
• Agreements (usually, what happens when SLO not met)
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 28
29. OPS-DRIVEN TARGET GOALS
“Choosing targets (SLOs) is not a purely technical activity
because of the product and business implications, which should
be reflected in both the SLIs and SLOs (and maybe SLAs) that are
selected. Similarly, it may be necessary to trade off certain
product attributes against others within the constraints posed by
staffing, time to market, hardware availability, and funding.”
• SRE Ops-driven concepts: safety margin, throttling, systems
engineering (mod configs, OS tuning, load balancing, physical
updates)M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 29
30. SRE KEY MONITORING INSIGHT
“Monitoring a complex application is a significant engineering
endeavor in and of itself.”
Ed: Software engineering is 7-20 years away from fully
integrating monitoring concepts into IDE’s
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 30
31. ALERTING INSIGHTS
• Human alerts must be simple and fast
• Monitoring should identify what’s broken and why (Ed: Domain
dependent!)
• Focus s/b on better post hoc analysis (Ed: Forensics; big data)
• “Google SRE has experienced only limited success with complex
dependency hierarchies”
• “Different aspects of a system should be measured with different
levels of granularity.”
• “In Google’s experience, basic collection and aggregation of metrics,
paired with alerting and dashboards, has worked well as a relatively
standalone system.”M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 31
32. TYPES OF AUTOMATION
• No automation
• Externally maintained system-specific automation
• Externally maintained generic automation
• Internally maintained system-specific automation
• Systems need no automation
• Ed: Conclude Ops is closer to automation (except domain
specific)
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 32
33. LEFT-SHIFTING OPS ISN’T ONE-AND-DONE
“Automation code, like unit test code, dies when the maintaining
team isn’t obsessive about keeping the code in sync with the
codebase it covers. The world changes around the code: the DNS
team adds new configuration options, the storage team changes
their package names, and the networking team needs to support
new devices.”
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 33
34. TYPICAL SRE RELEASE PROCESS
• A typical release process proceeds as follows:
• Rapid uses the requested integration revision number (often obtained automatically from
our continuous test system) to create a release branch.
• Rapid uses Blaze to compile all the binaries and execute the unit tests, often performing
these two steps in parallel. Compilation and testing occur in environments dedicated to
those specific tasks, as opposed to taking place in the Borg job where the Rapid workflow
is executing. This separation allows us to parallelize work easily.
• Build artifacts are then available for system testing and canary deployments. A typical
canary deployment involves starting a few jobs in our production environment after the
completion of system tests.
• The results of each step of the process are logged. A report of all changes since the last
release is created.
• Rapid allows us to manage our release branches and cherry picks; individual cherry pick
requests can be approved or rejected for inclusion in a release. Source
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 34
36. 1. Complex IT operations are challenging to left-shift at scale
2. Python (+ Go etc.) have facilitated left-shift
3. SDN (5-6G) is a game-changer; Ops is in the game, like it or
not
4. Monitoring and alerting are beyond current SE skills
5. SRE treats security as a feature (casual?)
6. SRE measures manual processes as part of using automation
to drive reliability
7. SRE has a more formal, Ops-driven approach to trade-off
compacts with product owners
8. Current DevOps SDLC practices have not formalized how to
capture and manage quality, reliability
9. Except for CMMI, risk is weakly integrated into the DevOps
SDLC
10. DevOps does not identify “toil,” hence may not participate in
PDCA cycle from Ops
11. Dev teams may not know what can/should be automated.
M UNDERWOOD @KNOWLENGR | V1.2 | KNOWLENGR.COM | VIEWS MY OWN 36