Presentation I made at the Sun Network conference in 2003 on how to do capacity planning for virtualized systems, tied into the N1 product that Sun was pushing at the time. This project was structured as a design for six sigma (DFSS) project.
This is the latest version of the slides based on my book "Solaris Performance and Tuning" that has been extended to include Linux and many other more recent topics. It has been presented innumerable times, most recently at the CMG conference, Usenix 08 and LISA 08, and this version will be presented at Usenix 09, San Diego on June 16th, along with the Free Tools slides.
More and more clients are looking to understand the capabilities of the OTM/G-Log architecture and configuration in order better tune OTM. Usually, this is required because of poor OTM performance or as preparation for significant changes to OTM configuration, volume, or platform. The client may be experience poor performance throughout the entire system or for a very specific use cases. The primary objective of a Performance Tuning Exercise is to understand how OTM is being utilized and to recommend solution to improve the performance of OTM.
We recommend and will take the audience through a “ground-up” performance tuning exercise, starting with hardware and infrastructure, moving to Java and App server tuning, then to OTM technical tuning and finally to the OTM functional tuning (data, agents, etc).
These audits may identify hardware constraints at each tier, networking, or other infrastructure constraints causing sub-optimal system performance. Simply stated, the performance audit will identify all bottlenecks in the system if they exist.
In many cases the largest performance is impacts are not hardware, but rather how the data is configured within the application. So as part of the exercise we will analyze database performance, individual SQL queries, OTM Queues, bulk planning parameters, agents, rates and the settlement process.
Understanding the methods which will best identify these bottlenecks will help you avoid performance issues early in your project and save considerable time and expense as you near go-live. This presentation will guide you through the steps necessary to better understand what is impacting performance and how to best handle it. It will provide lessons learned and tools that are available to you better manage and maintain a healthy OTM environment.
Presented by Chris Plough at MavenWire
VMworld 2013: Big Data: Virtualized SAP HANA Performance, Scalability and Bes...VMworld
VMworld 2013
Bob Goldsand, VMware
Todd Muirhead, VMware
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
VMworld 2013
Peter Boone, VMware
Seongbeom Kim, VMware
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
This is the latest version of the slides based on my book "Solaris Performance and Tuning" that has been extended to include Linux and many other more recent topics. It has been presented innumerable times, most recently at the CMG conference, Usenix 08 and LISA 08, and this version will be presented at Usenix 09, San Diego on June 16th, along with the Free Tools slides.
More and more clients are looking to understand the capabilities of the OTM/G-Log architecture and configuration in order better tune OTM. Usually, this is required because of poor OTM performance or as preparation for significant changes to OTM configuration, volume, or platform. The client may be experience poor performance throughout the entire system or for a very specific use cases. The primary objective of a Performance Tuning Exercise is to understand how OTM is being utilized and to recommend solution to improve the performance of OTM.
We recommend and will take the audience through a “ground-up” performance tuning exercise, starting with hardware and infrastructure, moving to Java and App server tuning, then to OTM technical tuning and finally to the OTM functional tuning (data, agents, etc).
These audits may identify hardware constraints at each tier, networking, or other infrastructure constraints causing sub-optimal system performance. Simply stated, the performance audit will identify all bottlenecks in the system if they exist.
In many cases the largest performance is impacts are not hardware, but rather how the data is configured within the application. So as part of the exercise we will analyze database performance, individual SQL queries, OTM Queues, bulk planning parameters, agents, rates and the settlement process.
Understanding the methods which will best identify these bottlenecks will help you avoid performance issues early in your project and save considerable time and expense as you near go-live. This presentation will guide you through the steps necessary to better understand what is impacting performance and how to best handle it. It will provide lessons learned and tools that are available to you better manage and maintain a healthy OTM environment.
Presented by Chris Plough at MavenWire
VMworld 2013: Big Data: Virtualized SAP HANA Performance, Scalability and Bes...VMworld
VMworld 2013
Bob Goldsand, VMware
Todd Muirhead, VMware
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
VMworld 2013
Peter Boone, VMware
Seongbeom Kim, VMware
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
Mtc learnings from isv & enterprise interactionGovind Kanshi
This is one of the dated presentation for which I keep getting requests for, please do reach out to me for status on various things as Azure keeps fixing/innovating whole of things every day.
There are bunch of other things I can help you on to ensure you can take advantage of Azure platform for oss, .net frameworks and databases.
Mtc learnings from isv & enterprise (dated - Dec -2014)Govind Kanshi
This is little dated deck for our learnings - I keep getting multiple requests for it. I have removed one slide for access permissions (RBAC -which are now available).
As interest in cloud solutions and their use with enterprise applications has increased, MavenWire has taken a lead in implementing and benchmarking several instances of OTM using Amazon Web Services (AWS) and Elastic Cloud Compute (EC2). This presentation outlines how the instances were set up and configured; potential benefits of OTM in the cloud; cost and performance comparisons between the cloud and "traditional" server configurations; areas of concern and issues to be aware of when implementing OTM in the cloud. In addition, we will also outline what we believe the future direction of cloud OTM will be, as well as where we believe it is best suited to customer needs.
Capacity Management for system z license charge reportingMetron
Capacity Management for System z is not a silo activity within an enterprise.
Every Capacity Management decision is a business decision, and with that in mind, there are either positive or negative cost implications with each decision made.
Capacity Management reports are visible to all the stakeholders within an organization from the C-Level down to the Lines of Business and the Analysts.
In order to be successful, one needs to take an enterprise view in all aspects of Capacity Management and with all the cost implications involved with various licensing models, there needs to be an understanding of the current (and future) ways in which licenses can be allocated.
This presentation discusses:
•Capacity Management from an Enterprise level
•MLC & WLC (Monthly and Workload License Charges)
•Explanation of Country Multiplex Pricing (CMP) and how it may affect your enterprise
•Reporting necessary to understand the license charges
•Forecasting for future changes
SLA-aware Dynamic CPU Scaling in Business Cloud Computing EnvironmentsZhenyun Zhuang
IEEE CLOUD 2015
Modern cloud computing platforms (e.g. Linux
on Intel CPUs) feature ACPI-based (Advanced Configuration
and Power Interface) mechanism, which dynamically scales
CPU frequencies/voltages to adjust the CPU frequencies based
on the workload intensity. With this feature, CPU frequency
is reduced when the workload is relatively light in order to
save energy; while increased when the workload intensity is
relatively high.
In business cloud computing environments, software products/
services often need to “scale out” to multiple machines to
form a cluster to achieve a pre-defined aggregated performance
goal (e.g., SLA-devised throughput). To reduce business operation
cost, minimizing the provisioned cluster size is critical.
However, as we show in this work, the working of ACPI
in today’s modern OS may result in more machines being
provisioned, hence higher business operation cost,
To deal with this problem, we propose a SLA-aware CPU
scaling algorithm based on business SLA (Service Level Agreement
aware). The proposed design rational and algorithm are
a fundamental rethinking of how ACPI mechanisms should be
implemented in business cloud computing environments. Contrary
to the current forms of ACPI which simply adapt CPU
power levels only based on workload intensity, the proposed
SLA-aware algorithm is primarily based on current application
performance relative to the pre-defined SLA. Specifically, the
algorithm targets at achieving the pre-defined SLA as the toplevel
goal, while saving energy as the second-level goal.
Tips on implementing SAP adaptive computing design with SAP LaMa on Microsoft Azure. We discuss the best options for SAP and some of the challenges faced.
ANSYS SCADE Usage for Unmanned Aircraft VehiclesAnsys
SCADE on-board the UAS P.1HH HammerHead
The Use of SCADE to develop the P.1HH Vehicle Control & Management System (Integrated Modular Avionics System) greatly reduced development time and effort.
Learn more about ANSYS SCADE Solutions for Aerospace & Defense http://bit.ly/1EdcsOJ
Josh Berkus
You've heard that PostgreSQL is the highest-performance transactional open source database, but you're not seeing it on YOUR server. In fact, your PostgreSQL application is kind of poky. What should you do? While doing advanced performance engineering for really high-end systems takes years to learn, you can learn the basics to solve performance issues for 80% of PostgreSQL installations in less than an hour. In this session, you will learn: -- The parts of database application performance -- The performance setup procedure -- Basic troubleshooting tools -- The 13 postgresql.conf settings you need to know -- Where to look for more information.
Mtc learnings from isv & enterprise interactionGovind Kanshi
This is one of the dated presentation for which I keep getting requests for, please do reach out to me for status on various things as Azure keeps fixing/innovating whole of things every day.
There are bunch of other things I can help you on to ensure you can take advantage of Azure platform for oss, .net frameworks and databases.
Mtc learnings from isv & enterprise (dated - Dec -2014)Govind Kanshi
This is little dated deck for our learnings - I keep getting multiple requests for it. I have removed one slide for access permissions (RBAC -which are now available).
As interest in cloud solutions and their use with enterprise applications has increased, MavenWire has taken a lead in implementing and benchmarking several instances of OTM using Amazon Web Services (AWS) and Elastic Cloud Compute (EC2). This presentation outlines how the instances were set up and configured; potential benefits of OTM in the cloud; cost and performance comparisons between the cloud and "traditional" server configurations; areas of concern and issues to be aware of when implementing OTM in the cloud. In addition, we will also outline what we believe the future direction of cloud OTM will be, as well as where we believe it is best suited to customer needs.
Capacity Management for system z license charge reportingMetron
Capacity Management for System z is not a silo activity within an enterprise.
Every Capacity Management decision is a business decision, and with that in mind, there are either positive or negative cost implications with each decision made.
Capacity Management reports are visible to all the stakeholders within an organization from the C-Level down to the Lines of Business and the Analysts.
In order to be successful, one needs to take an enterprise view in all aspects of Capacity Management and with all the cost implications involved with various licensing models, there needs to be an understanding of the current (and future) ways in which licenses can be allocated.
This presentation discusses:
•Capacity Management from an Enterprise level
•MLC & WLC (Monthly and Workload License Charges)
•Explanation of Country Multiplex Pricing (CMP) and how it may affect your enterprise
•Reporting necessary to understand the license charges
•Forecasting for future changes
SLA-aware Dynamic CPU Scaling in Business Cloud Computing EnvironmentsZhenyun Zhuang
IEEE CLOUD 2015
Modern cloud computing platforms (e.g. Linux
on Intel CPUs) feature ACPI-based (Advanced Configuration
and Power Interface) mechanism, which dynamically scales
CPU frequencies/voltages to adjust the CPU frequencies based
on the workload intensity. With this feature, CPU frequency
is reduced when the workload is relatively light in order to
save energy; while increased when the workload intensity is
relatively high.
In business cloud computing environments, software products/
services often need to “scale out” to multiple machines to
form a cluster to achieve a pre-defined aggregated performance
goal (e.g., SLA-devised throughput). To reduce business operation
cost, minimizing the provisioned cluster size is critical.
However, as we show in this work, the working of ACPI
in today’s modern OS may result in more machines being
provisioned, hence higher business operation cost,
To deal with this problem, we propose a SLA-aware CPU
scaling algorithm based on business SLA (Service Level Agreement
aware). The proposed design rational and algorithm are
a fundamental rethinking of how ACPI mechanisms should be
implemented in business cloud computing environments. Contrary
to the current forms of ACPI which simply adapt CPU
power levels only based on workload intensity, the proposed
SLA-aware algorithm is primarily based on current application
performance relative to the pre-defined SLA. Specifically, the
algorithm targets at achieving the pre-defined SLA as the toplevel
goal, while saving energy as the second-level goal.
Tips on implementing SAP adaptive computing design with SAP LaMa on Microsoft Azure. We discuss the best options for SAP and some of the challenges faced.
ANSYS SCADE Usage for Unmanned Aircraft VehiclesAnsys
SCADE on-board the UAS P.1HH HammerHead
The Use of SCADE to develop the P.1HH Vehicle Control & Management System (Integrated Modular Avionics System) greatly reduced development time and effort.
Learn more about ANSYS SCADE Solutions for Aerospace & Defense http://bit.ly/1EdcsOJ
Josh Berkus
You've heard that PostgreSQL is the highest-performance transactional open source database, but you're not seeing it on YOUR server. In fact, your PostgreSQL application is kind of poky. What should you do? While doing advanced performance engineering for really high-end systems takes years to learn, you can learn the basics to solve performance issues for 80% of PostgreSQL installations in less than an hour. In this session, you will learn: -- The parts of database application performance -- The performance setup procedure -- Basic troubleshooting tools -- The 13 postgresql.conf settings you need to know -- Where to look for more information.
One of the most difficult challenges facing Security, VoIP, and Network Management Solutions is how to implement network tools onto enterprise networks. Enterprise networks are becoming more complex when looking at how to monitor and capture data. It can be difficult or impossible to gain access to network SPAN Ports or inserting In-line Devices like Intrusion Prevention Systems into enterprise networks. Contention for network access is a major problem. Learn how to design a network access solution that meets the requirements for security, network monitoring, and overall network access solutions. We help solve the questions, How do I get secure access to the network for capturing data or monitoring data traffic? Why TAP your network?
New Frameworks for Measuring Capacity and Assessing PerformanceTCC Group
If we start with the assumption that — in order to improve our social sector as a whole — those who do the work to strengthen our communities (the nonprofits) are equally as critical as those responsible for providing the resources for the work to get done (the foundations), then why wouldn’t we expect all social sector actors to build their capacity? How do we know when our grantees and our foundations are becoming more effective and impactful as a result of our capacity investments, organizational development efforts and technical assistance? What does a high performing organization or foundation look like? And can we measure that?
This presentation, provided during the Grantmakers for Effective Organizations 2016 National Conference in Minneapolis, reviews and demonstrates existing resources for assessing nonprofit and foundation capacity and effectiveness. Speakers introduced the pros and cons of a variety of rubrics in use in the field and offered guidance on how funders decide on the right fit for the desired purpose. Grantmaker peers also shared how they used different frameworks and tools to assess individual nonprofits and grantee cohorts. Session participants left with increased awareness of the importance of the facilitator’s role in interpreting data gleaned from assessments and of the data collection methods most appropriate for their organization.
SIP Trunking & Security in an Enterprise NetworkDan York
How secure are your VoIP systems as you deploy SIP-based systems in an enterprise environment? In this slide deck presented by VOIPSA Best Practices Chair Dan York at the Ingate SIP Trunking Seminars at ITEXPO September 17, 2008, Dan York walks through the security issues related to VoIP (with a focus on SIP trunking), the tools out there to attack/test VoIP systems, best practices and resources. (An audio recording of this session was made and will be available.)
Secure Network Design with High-Availability & VoIPArpan Patel
Networking, the communication between two or more networks, encompasses every aspect of connecting computers together. With the evolution of networking and the Internet, the threats to
information and networks have risen dramatically and performance has depleted enormously.
As a company grows its business its network design needs to be updated from the existing network
and expand it to accommodate additional users or workloads. But the diculty arises as networks
are being pressured to cost less, yet support the emerging applications and higher number of users
with increased performance. As personal, government and business-critical applications become
more prevalent on the Internet, it is imperative that all networks be protected from threats and
vulnerabilities in order for a business to achieve its fullest potential. Hence a Secure Design for a
network is critical in todays expanding corporate world.
At this year’s annual Design Automation Conference (DAC 2020), Rob Lalonde and Bill Bryce of Univa partnered with representatives from Google and Synopsys to discuss EDA in the Cloud and share best practices related to cloud migration.
Applying Cloud Techniques to Address Complexity in HPC System Integrationsinside-BigData.com
In this video from the HPC User Forum at Argonne, Arno Kolster from Providentia Worldwide presents: Applying Cloud Techniques to Address Complexity in HPC System Integrations.
"The Oak Ridge Leadership Computing Facility (OLCF) and technology consulting company Providentia Worldwide recently collaborated to develop an intelligence system that combines real-time updates from the IBM AC922 Summit supercomputer with local weather and operational data from its adjacent cooling plant, with the goal of optimizing Summit’s energy efficiency. The OLCF proposed the idea and provided facility data, and Providentia developed a scalable platform to integrate and analyze the data."
Watch the video: https://wp.me/p3RLHQ-kOg
Learn more: http://www.providentiaworldwide.com/
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
"Architecture assessment from classics to details", Dmytro OvcharenkoFwdays
We will talk about architecture assessment and SEI ATAM methodology in detail. We also review Quality Attribute Workshop on a high level and find differences between quantitative and qualitative analysis. The assessment process can be represented as a set of activities roughly split into assessment preparation, collection of the important data and stakeholders' inputs, architecture analysis, and, finally, presentation of findings and recommendations. Finally, we will review the assessment document and some examples.
SREcon 2016 Performance Checklists for SREsBrendan Gregg
Talk from SREcon2016 by Brendan Gregg. Video: https://www.usenix.org/conference/srecon16/program/presentation/gregg . "There's limited time for performance analysis in the emergency room. When there is a performance-related site outage, the SRE team must analyze and solve complex performance issues as quickly as possible, and under pressure. Many performance tools and techniques are designed for a different environment: an engineer analyzing their system over the course of hours or days, and given time to try dozens of tools: profilers, tracers, monitoring tools, benchmarks, as well as different tunings and configurations. But when Netflix is down, minutes matter, and there's little time for such traditional systems analysis. As with aviation emergencies, short checklists and quick procedures can be applied by the on-call SRE staff to help solve performance issues as quickly as possible.
In this talk, I'll cover a checklist for Linux performance analysis in 60 seconds, as well as other methodology-derived checklists and procedures for cloud computing, with examples of performance issues for context. Whether you are solving crises in the SRE war room, or just have limited time for performance engineering, these checklists and approaches should help you find some quick performance wins. Safe flying."
PowerArtist™ includes production-proven RTL power analysis with interactive visual debug, analysis-driven automatic RTL power reduction, and a Tcl interface to the database enabling custom reports and tracking of power through regressions. PowerArtist generated models bridge the RTL and layout gap delivering physical-aware RTL power accuracy and RTL-power driven early power grid integrity. This presentation provides an overview of PowerArtist and covers RTL design-for-power best practices using real-life examples. Learn more on our website: https://bit.ly/10Rpcxu
Large-Scale Optimization Strategies for Typical HPC Workloadsinside-BigData.com
In this deck from PASC 2019, Liu Yu from Inspur presents: Large-Scale Optimization Strategies for Typical HPC Workloads.
"Ensuring performance of applications running on large-scale clusters is one of the primary focuses in HPC research. In this talk, we will show our strategies on performance analysis and optimization for applications in different fields of research using large-scale HPC clusters. Our strategies are designed to comprehensively analyze runtime features of applications, parallel mode of the physical model, algorithm implementation and other technical details. This three levels of strategy covers platform optimization, technological innovation, and model innovation, and targeted optimization based on these features. State-of-the-art CPU instructions, network communication and other modules, and innovative parallel mode of some applications have been optimized. After optimization, it is expected that these applications will outperform their non-optimized counterparts with obvious increase in performance."
Watch the video: https://wp.me/p3RLHQ-kwB
Learn more: http://en.inspur.com/en/2403285/2403287/2403295/index.html
and
https://pasc19.pasc-conference.org/program/keynote-presentations/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xe...Matteo Ferroni
In the last few years, multi-core processors entered into the domain of embedded systems: this, together with virtualization techniques, allows multiple applications to easily run on the same System-on-Chip (SoC). As power consumption remains one of the most impacting costs on any digital system, several approaches have been explored in literature to cope with power caps, trying to maximize the performance of the hosted applications. In this paper, we present some preliminary results and opportunities towards a performance-aware power capping orchestrator for the Xen hypervisor. The proposed solution, called XeMPUPiL, uses the Intel Running Average Power Limit (RAPL) hardware interface to set a strict limit on the processor’s power consumption, while a software-level Observe-Decide-Act (ODA) loop performs an exploration of the available resource allocations to find the most power efficient one for the running workload. We show how XeMPUPiL is able to achieve higher performance under different power caps for almost all the different classes of benchmarks analyzed (e.g., CPU-, memory- and IO-bound).
Full paper: http://ceur-ws.org/Vol-1697/EWiLi16_17.pdf
I presented "Cloudsim & Green Cloud" in First National Workshop of Cloud Computing at Amirkabir University on 31st October and 1st November, 2012.
Enjoy it!
Adaptive Computing Using PlateSpin OrchestrateNovell
Adaptive computing goes beyond just intelligently utilizing available resources; it encompasses quality of service (QoS) targets, fault tolerance (high availability), monitoring, and iterative analysis of the resulting dataset to determine what corrective measures (adaptations) should occur at any given moment. As virtualization becomes widespread in the data center, the need for automating the placement and configuration of workloads (virtual machines) using an adaptive computing model becomes vitally important. This session demonstrates how to use events, introduced in PlateSpin Orchestrate 2.0.2, to create rules that trigger workload provisioning, migration, and other virtual machine lifecycle operations. It will also offer a preview of new functionality included in the upcoming 2.1 release of the product.
Scheduling Task-parallel Applications in Dynamically Asymmetric EnvironmentsLEGATO project
Presentation by Jing Chen and Pirah Noor Soomro (Chalmers University of Technology) at the 16th International Workshop on Scheduling and Resource Management for Parallel and Distributed Systems (SRMPDS 2020) on 17 August 2020.
SRMPDS was a virtual event and collocated with ICPP’20 - 2020 International Conference on Parallel Processing.
Cloud computing system models for distributed and cloud computinghrmalik20
System Models for Distributed and Cloud
Computing,Peer-to-peer (P2P) Networks,Computational and Data Grids,Clouds,Advantage of Clouds over Traditional
Distributed Systems,Performance Metrics and Scalability Analysis,System Efficiency,Performance Challenges in Cloud Computing,WHY CLOUD COMPUTING,What is cloud computing and why is it distinctive,CLOUD SERVICE DELIVERY MODELS AND THEIR
PERFORMANCE CHALLENGES,Cloud computing security,What does Cloud Computing Security mean,Cloud Security Landscape,Energy Efficiency of Cloud Computing,How energy-efficient is cloud computing?
Similar to Capacity Planning for Virtualized Datacenters - Sun Network 2003 (20)
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...Adrian Cockcroft
Flowcon keynote was a few days before CMG, a few tweaks and some extra content added at the start and end. Opening Keynote talk for both conferences on how Speed Wins and how Netflix is doing Continuous Delivery
For the Computer Measurement Group workshop in San Diego November 2013. Also presented to a student class at UC Santa Barbara. What is Cloud Native. Capacity and Performance benchmarks. Cost Optimization Techniques - content co-developed with Jinesh Varia of AWS.
A collection of information taken from previous presentations that was used as drill down for supporting discussion of specific topics during the tutorial.
Same basic flow as the keynote, but with a lot more detail, and we had a lot more interactive discussion rather than a presentation format. See part 2 for some more specific detail and links to other presentations.
Introduction to the Netflix Open Source Software project, explains why Netflix is doing this, how all the parts fit together and what is planned to come next. Presented at the inaugural NetflixOSS Meetup February 6th 2013 at Netflix headquarters in Los Gatos.
AWS Re:Invent - High Availability Architecture at NetflixAdrian Cockcroft
Slides from my talk at AWS Re:Invent November 2012. Describes the architecture, how to make highly available application code and data stores, a taxonomy of failure modes, and actual failures and effects. Ends with a summary of @NetflixOSS projects so others can easily leverage this architecture.
Architecture talk aimed at a well informed developer audience (i.e. QConSF Real Use Cases for NoSQL track), focused mainly on availability. Skips the Netflix cloud migration stuff that is in other talks.
SV Forum Platform Architecture SIG - Netflix Open Source PlatformAdrian Cockcroft
Architecture overview of Netflix Cloud Architecture with a focus on the Open Source components that Netflix has put and is planning to release on http://netflix.github.com
Summary of past Cassandra benchmarks performed by Netflix and description of how Netflix uses Cassandra interspersed with a live demo automated using Jenkins and Jmeter that created two 12 node Cassandra clusters from scratch on AWS, one with regular disks and one with SSDs. Both clusters were scaled up to 24 nodes each during the demo.
Latest version of the Netflix Cloud Architecture story was given at Gluecon May 23rd 2012. Gluecon rocks, and lots of Van Halen references were added for the occasion. There tradeoff between developer driven high functionality AWS based PaaS, and operations driven low cost portable PaaS is discussed. The three sections cover the developer view, the operator view and the builder view.
Cloud Architecture Tutorial - Why and What (1of 3) Adrian Cockcroft
Introduction to the Netflix Cloud Architecture Tutorial - discusses the why and what of cloud including the thinking behind Netflix choice of AWS, and the product features that Netflix runs in the cloud.
This is the meat of the presentation, it describes in detail how do use anti-architecture to define what gets done, then discusses patterns, type systems, PaaS frameworks, services and components. There is a detailed explanation of Cassandra as a data store and open source components.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Welocme to ViralQR, your best QR code generator.ViralQR
Welcome to ViralQR, your best QR code generator available on the market!
At ViralQR, we design static and dynamic QR codes. Our mission is to make business operations easier and customer engagement more powerful through the use of QR technology. Be it a small-scale business or a huge enterprise, our easy-to-use platform provides multiple choices that can be tailored according to your company's branding and marketing strategies.
Our Vision
We are here to make the process of creating QR codes easy and smooth, thus enhancing customer interaction and making business more fluid. We very strongly believe in the ability of QR codes to change the world for businesses in their interaction with customers and are set on making that technology accessible and usable far and wide.
Our Achievements
Ever since its inception, we have successfully served many clients by offering QR codes in their marketing, service delivery, and collection of feedback across various industries. Our platform has been recognized for its ease of use and amazing features, which helped a business to make QR codes.
Our Services
At ViralQR, here is a comprehensive suite of services that caters to your very needs:
Static QR Codes: Create free static QR codes. These QR codes are able to store significant information such as URLs, vCards, plain text, emails and SMS, Wi-Fi credentials, and Bitcoin addresses.
Dynamic QR codes: These also have all the advanced features but are subscription-based. They can directly link to PDF files, images, micro-landing pages, social accounts, review forms, business pages, and applications. In addition, they can be branded with CTAs, frames, patterns, colors, and logos to enhance your branding.
Pricing and Packages
Additionally, there is a 14-day free offer to ViralQR, which is an exceptional opportunity for new users to take a feel of this platform. One can easily subscribe from there and experience the full dynamic of using QR codes. The subscription plans are not only meant for business; they are priced very flexibly so that literally every business could afford to benefit from our service.
Why choose us?
ViralQR will provide services for marketing, advertising, catering, retail, and the like. The QR codes can be posted on fliers, packaging, merchandise, and banners, as well as to substitute for cash and cards in a restaurant or coffee shop. With QR codes integrated into your business, improve customer engagement and streamline operations.
Comprehensive Analytics
Subscribers of ViralQR receive detailed analytics and tracking tools in light of having a view of the core values of QR code performance. Our analytics dashboard shows aggregate views and unique views, as well as detailed information about each impression, including time, device, browser, and estimated location by city and country.
So, thank you for choosing ViralQR; we have an offer of nothing but the best in terms of QR code services to meet business diversity!
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
PHP Frameworks: I want to break free (IPC Berlin 2024)
Capacity Planning for Virtualized Datacenters - Sun Network 2003
1. Capacity Planning for N1
Sun Network 2003 Presentation
SunSigma
DFSS
Adrian.Cockcroft@sun.com
Project
Chief Architect - High Performance P925
Technical Computing
August 29, 2003
2. Project: Capacity Planning for N1
ID: P925
What is N1?
Datacenter Automation
Manage “N” systems as if they were “1” system
Solve the Total Cost of Ownership (TCO) problems
Manage all the “fabrics” as one - Network/VLAN, SAN/Zone, power,
consoles, cluster
Heterogenous Support
Solaris, Linux, AIX, HP-UX, Windows, EMC etc…
Layered Provisioning
Platform/OS, Application, Service
Roadmap Includes Acquisitions
2001 Sun internal N1 architectural definition
2002 Terraspring platform level virtualization
2003 CenterRun Application level provisioning
……….
2
3. Project: Capacity Planning for N1
ID: P925
Voice of the Customer
“We want better performance at a lower price”
_
“We want higher utilization”
_
“We don’t want application performance to
_
degrade at times of peak load”
“We want more and faster application changes”
_
“How do we do capacity planning with N1?”
_
Scope…
3
4. DEFINE Project: Capacity Planning for N1
ID: P925
Capacity Planning for N1
Define
_
Project goals, scope and plan, VOC, stakeholders
–
Measure
_
Definition of Capacity Planning measurements
–
Analyze
_
Gaps, N1CP Processes Concept Design, Survey
–
Design
_
Prototype Use Cases
–
Verify
_
Stakeholder communication and transition plan
–
Monitor
_
N1 Capacity Planning implementation tracked as
–
subgroup of N1 Strategic Working Group
4
5. MEASURE Project: Capacity Planning for N1
ID: P925
Translate VOC to Measurements
“We want better performance at a lower price”
Fast, well tuned and efficient systems
Lower Total Cost of Ownership
Flexibility - choice of systems by price, performance, reliability,
scalability, compatibility and feature set
“We want higher utilization”
Consistently high utilization of expensive resources
“We don’t want application performance to degrade at times of peak load”
Consistent and fast application or service response times
Headroom needed to handle peak loads
“We want more and faster application changes”
Flexible scenario planning, rapid provisioning
Question: “My company already has capacity planning processes and
tools” - do you agree or disagree with this statement?
5
6. MEASURE Project: Capacity Planning for N1
ID: P925
N1 as a Constraint and Opportunity
Centralized control and monitoring
_
Highly replicated hardware configurations
_
Well defined workload and capacity characterization
_
Arrays of load-balanced systems, structured network
_
Large SMP nodes, standardized storage layout
_
Web services workloads follow an “open system”
_
queuing model, which is simple to plan against
Dynamic system domains and virtualized provisioning
_
allow rapid capacity adjustments and pooled resources
Primary capacity metrics are CPU power and storage,
_
secondary metrics (memory, network and thermal) may
be over-provisioned but should be watched
6
7. MEASURE Project: Capacity Planning for N1
ID: P925
Utilization Definition
Utilization is the proportion of busy time
_
Always defined over a time interval
_
Sum over devices
_
OnCPU Scheduling for Each CPU
(mean load level)
Mean CPU Util
OnCPU and
0.56
usr+sys CPU for Peak Period
100
0
90
80 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41
70
Microseconds
60
CPU %
50
40
Utilization
30
20
10
0
Time
7
8. MEASURE Project: Capacity Planning for N1
ID: P925
Headroom Definition
Headroom is available usable resources
_
Total Capacity minus Peak Utilization and Margin
–
Applies to CPU, RAM, Net, Disk and OS
–
Depends upon workload mixture
–
Can be very complex to determine
–
usr+sys CPU for Peak Period
100
Margin
90
80
Headroom
70
60
CPU %
50
40
Utilization
30
20
10
0
Time
8
9. MEASURE Project: Capacity Planning for N1
ID: P925
CPU Capacity Measurements
CPU utilization is defined as busy time divided by
_
elapsed time for each CPU
Number of CPUs is dynamic, so capacity at “100%” is
_
not constant. Use units of “processors” to measure load.
CPU type and speed varies so we need something like
_
MIPS or M-Values for mixed systems
CPU utilization should be managed within a range that
_
safely minimizes headroom to give stable performance
at minimum cost
Process level CPU wait time measures the time a
_
process spent on the run queue waiting for a free CPU
This allows response time increase to be observed directly so that
–
increased capacity can be provisioned before headroom is
exhausted
9
10. MEASURE Project: Capacity Planning for N1
ID: P925
Response Time Definition
Service time occurs while using a resource
_
Queue time waits for access to a resource
_
Response Time = Queue time + Service time
_
Response time curves for random arrival of work from large
unknown user population (e.g. the Internet!)
Response Time Curves
R = S / (1 - (U/m)m)
10.00
Response Time Increase Factor
9.00
8.00
7.00
6.00
One CPU
5.00 Two CPUs
Four CPUs
4.00
3.00
2.00
1.00
0.00
0 0.5 1 1.5 2 2.5 3 3.5 4
Mean CPU Load Level
10
11. MEASURE Project: Capacity Planning for N1
ID: P925
Response Time Curves
Systems with many CPUs can run at higher utilization
levels, but degrade more rapidly when they finally run out
of capacity. Headroom margin should be set according to
response time margin and CPU count.
Response Time Curves R = S / (1 - (U%)m)
10.00
Response Time Increase Factor
9.00
8.00
One CPU
7.00
Two CPUs
6.00
Four CPUs
5.00 Eight CPUs
Headroom 16 CPUs
4.00
margin 32 CPUs
3.00 64 CPUs
2.00
1.00
0.00
0 10 20 30 40 50 60 70 80 90 100
Total System Utilization %
11
12. MEASURE Project: Capacity Planning for N1
ID: P925
CPU Scalability Differences
SMP allows work to migrate between CPUs, “blades” don’t
Single queue of work gives lower response time for user sessions
–
at high utilization than arrays of uniprocessor “blades”
Headroom margin on array of “blades” is constant as array grows
–
Two to four CPU systems need much less margin than Uni-CPUs
–
Measure and calibrate actual response curve per workload
–
Response Time Curves
SMP R = S / (1 - (U/m)m) vs. Blade R = S / (1 - U/m)
10.00
Response Time Increase Factor
9.00
8.00
7.00
1 CPU/Blade
6.00
2 CPU SMP
5.00 4 CPU SMP
2 Blades
4.00
4 Blades
3.00
2.00
1.00
0.00
0 0.5 1 1.5 2 2.5 3 3.5 4
CPU Demand Level
12
13. MEASURE Project: Capacity Planning for N1
ID: P925
CPU Measurement System Issues
Clock sampled CPU usage
_
Poor clock resolution at 10ms (optionally 1 ms)
–
Biased sample since clock schedules jobs
–
Underestimates more at lower utilization
–
Creates apparent lack of scalability
–
Microstate measured CPU usage
_
Measure state changes directly - “microstates”
–
Per-CPU microstate based counters are not available
–
Use microstates at process based workload level, sum over some or
–
all processes as needed (can take a while on big systems)
Microstate method simply extends to measuring services and mixed
–
workloads
13
14. MEASURE Project: Capacity Planning for N1
ID: P925
N1 Capacity Planning CTQs
Gauge Budget
CTQ Name Pri Units LSL USL
Acc. Sigma
30% of
CPU Utilization (TCO) 5 CPUs 99% 3.0
total
CPU Responsiveness 70-98%
10 CPUs 99% 4.0
(SLA) of total
Both of these Critical To Quality (CTQ) requirements are measured via the CPU load
level which can accurately be measured with a Gauge accuracy estimated at 99% and a
sigma goal based on defect cost. Using sampled CPU accuracy is estimated at 90%.
For CPU Utilization a defect is unacceptable Total Cost of Ownership (TCO) and
occurs if the total CPU load drops below the Lower Specification Limit (LSL) of 30%
of the total configured for a sample taken during the peak load period.
For CPU Responsiveness a defect is overload leading to a Service Level Agreement
(SLA) failure and occurs if the total CPU load goes above the Upper Specification
Limit (USL) which is 70% of the total configured for Uni-processors increasing for
larger CPU counts.
14
15. ANALYZE Project: Capacity Planning for N1
ID: P925
Concept Design - N1CP Roles
Manager
Application Architect
_
– Developers
– Database Administrators
Systems Architect
_
– Systems Administrators
– Storage Administrators
– Network Administrators
Others?
Question: What roles do you do?
15
16. ANALYZE Project: Capacity Planning for N1
ID: P925
Scenarios - Top Level Functional Breakdown
Install N1
Datacenter
Provision
Provisionlevel
System
Over-Provision
System level
Applications
System level
Applications Provision
Applications Provisionlevel Repeat infrequently
System
Right-size
System level
Applications
System level
Applications
Applications
Provision
Provisionlevel Repeat on schedule
System
Re-Allocate
System level
Applications
Resources during
Applications
Provision
Low load times
Provisionlevel
System Repeat as needed
Grow or borrow
System level
Capacity Applications
just before
Applications
Overload occurs
16
17. ANALYZE Project: Capacity Planning for N1
ID: P925
Installation Sizing Scenario
This scenario indicates the tasks for each role when an N1 datacenter fabric is created using
currently available system level provisioning software. The tasks performed by each role in a
scenario is called a “use case”. Future versions of N1 will configure services and policies
during installation. Red arrows show the command flow between the roles.
Manager Application Database Developer Systems Systems Network Storage
Architect Admin Architect Admin Admin Admin
I want an N1 Choose and Install Install Choose Size systems Size overall Size overall
ready size generic generic systems mix network storage
datacenter applications database application
and images servers
platforms
Time
Build generic Setup Setup SANs
system switches and storage
images and VLANs for N1
for N1
Measure
capacity of
generic
systems
17
18. ANALYZE Project: Capacity Planning for N1
ID: P925
Over-Provisioning Scenario
This gives an indication of the tasks performed by each role as a new application is
provisioned using the capabilities of todays N1 products. The initial goal is to over-provision
the capacity for initial bring-up of the application then later right-size it as its actual usage
pattern becomes better understood. In future releases more and more of this activity will be
automated, and more of the work will move to become pre-work that is related to setting up
the overall N1 datacenter infrastructure.
Manager Application Database Developer Systems Systems Network Storage
Architect Admin Architect Admin Admin Admin
Provide an Use these Database App server Use these Systems Network Storage
online apps versions versions platforms selection & sizing sizing
service and sizing and sizing versions
Configure Configure Define Build Provision Provision
Time database app server operations replicable Internet LUNs
policies system connection
images
Populate Acceptance Use N1 GUI Configure Configure
database test to over- access and backup
provision security strategy
initial
system
Enable user
access
18
19. ANALYZE Project: Capacity Planning for N1
ID: P925
Rightsizing Scenario
Rightsizing adjusts the headroom for each component of the system to make sure that the
usage level falls inside the specification limits. Rightsizing can be performed during an
offline maintenance window but all the technologies exist to adjust domain size for tier 3
systems, and adjust the number of tier 1 and tier 2 systems dynamically.
Manager Application Database Developer Systems Systems Network Storage
Architect Admin Architect Admin Admin Admin
Business Monitor Monitor CPU, Monitor WAN Monitor
level and database Network / Internet storage
trend plan headroom and headroom headroom
(memory memory
and tables)
Time
Increase Increase Increase Increase
headroom headroom headroom headroom
for for for for
bottleneck bottleneck bottleneck bottleneck
Reduce Reduce Reduce Reduce
headroom headroom headroom headroom
for under for under for under for under
utilized utilized utilized utilized
database systems bandwidth storage
19
20. ANALYZE Project: Capacity Planning for N1
ID: P925
Re-Allocation Scenario
Load levels vary during the day and the week. Regular times of low utilization can have
other work performed - e.g. overnight batch jobs. Batch workloads that cannot run on the
same systems due to configuration or security issues can run on systems (or Grids) that are
provisioned each night using spare capacity from other systems.
Manager Application Database Developer Systems Systems Network Storage
Architect Admin Architect Admin Admin Admin
Batch Define batch Build or Define batch Determine
workload capable configure mechanism timing and
capacity applications batch depth of
needed capable capacity to
applications re-allocate
Time
Move
resources
to Grid
after peak
load time
Bring
resources
back before
peak load
time
20
21. ANALYZE Project: Capacity Planning for N1
ID: P925
Overload Scenario
Load levels vary during the day and the week in a fairly consistent and predictable manner. Sizing
for the normal load level allows high utilization levels. Higher load levels can be handled as an
exception by watching for abnormally high levels before the load peaks and borrowing capacity
from lower priority applications such as development environments.
Question: “Are dynamic capacity adjustments a mature and reliable technology?”
Manager Application Database Developer Systems Systems Network Storage
Architect Admin Architect Admin Admin Admin
Higher Determine
utilization normal load
needed to curve for time
reduce cost of day and
of service day of week
Time
Negotiate Monitor
victim to deviations
steal above normal
capacity load level
from
Provision extra
capacity
before it is
needed
21
22. ANALYZE Project: Capacity Planning for N1
ID: P925
Rightsizing Scenario
Detailed Design Concept via an Example
_
Large scale Internet workload
_
Fairly predictable load shape
–
Peaks every evening (use peak hours)
–
Grows every week
–
Key CTQs
_
Performance during peak hour
–
Cost of maintaining performance level
–
Risk of downtime
–
Tier 3 backend database server
_
Primary bottleneck, over-provisioned elsewhere
–
Highest cost of CPU headroom (E10K/F15K class)
–
Initially 56 CPUs in domain, average 30 CPUs load
–
22
23. ANALYZE Project: Capacity Planning for N1
ID: P925
CPU Load Level
Monitor for days or weeks to establish baseline and time of
peak load, then track that timeslot daily
CPU load (units are CPUs, 56 configured) for a busy day:
Summed CPU Utilization
Peak
50
2 Hrs
CPU Utilization Level
40
30
20
10
0
0:00:00
0:58:00
1:56:00
2:54:00
3:52:00
4:50:00
5:48:00
6:46:00
7:44:00
8:42:00
9:40:00
10:38:00
11:36:00
12:34:00
13:32:00
14:30:00
15:28:00
16:26:00
17:24:00
18:22:00
19:20:00
20:18:01
21:16:00
22:14:00
23:12:00
Time of Day
23
24. ANALYZE Project: Capacity Planning for N1
ID: P925
Utilization Distribution
Capability plot for peak time shows system is less than half
utilized about 25% of the time, too much headroom. Defect
rate corresponds to Sigma level of 2.18.
CPU Demand Level
24
25. ANALYZE Project: Capacity Planning for N1
ID: P925
Increase Utilization
Reduce system to 40 CPUs, assume linear increase in utilization -
predicted sigma = 5.2
Over-simplified - headroom margin and non-linearities not included
in the plan. So add a little extra headroom to compensate
CPU Demand Level
25
26. DESIGN Project: Capacity Planning for N1
ID: P925
Headroom Tool Prototype
Solaris specific prototype
_
Rapid prototype using SE Toolkit from http://www.setoolkit.com
–
Shows component level headroom vs. utilization goal
–
Automatic margin calculation based on CPU count
–
Samples every few minutes, reports every 30-60 minutes
–
Microstate based, sums over all processes
–
Headroom predictor uses mean plus two standard deviations
–
Text based, logs data to a daily file, 3.5 sigma headroom
–
Code p.=processor, r.=ram, n.=network, d.=disk, .st=status, .cf=configured,
.ll=min lsl, .ul=limit usl, .ld=mean load, .h%=headroom, .sd=std deviation,
.tco=TCO defect rate, .sla=SLA defect rate, .tK=throughput K, .rm=response
time in milliseconds, .rp=response time proportional increase
time pll pul pcf pst ptco psla pld psd ph% ptK prm prp
17:36:04 3.6 11.6 12 Green 0.00 0.00 5.26 0.28 50 15.8 1.05 1.08
18:06:04 3.6 11.6 12 Green 0.00 0.00 4.90 0.38 51 13.9 1.01 1.06
18:36:04 3.6 11.6 12 Blue 0.40 0.00 4.55 2.19 23 13.0 0.93 1.09
19:06:03 3.6 11.6 12 Blue 1.00 0.00 3.02 0.17 71 12.7 0.86 1.05
19:36:03 3.6 11.6 12 Blue 0.93 0.00 2.82 0.53 67 12.0 0.67 1.04
CPU Throughput is based on
Samples taken every 12 CPUs configured, Status is based on measured Mean load level and
voluntary context switches,
two minutes and Lower limit 30% = 3.6, defect proportion of time that standard deviation are
prm is very short, but prp
reported every 30 Upper limit based on CPU load level is below pll=TCO or compared to the upper limit
minutes above pul=SLA limits to calculate headroom. defines a response time curve
count at 11.6
26
27. DESIGN Project: Capacity Planning for N1
ID: P925
Headroom Calculations
Set configured total to number of processors online
conf = sysconf(_SC_NPROCESSORS_ONLN);
Set lower spec limit to 30% for TOC failures
lsl = conf * 0.3;
Use response time goal of 3 times baseline on curve to
determine margin for maximum load level
rpgoal = 3.0;
Calculate max load level from theoretical response time curve
/* rp = R/S, rp = 1/(1-(U^m)) so U = exp(log((rp-1)/rp)/ m)) */
usl = conf * exp(log((rpgoal-1.0)/rpgoal)/conf);
Calculate headroom % from mean plus two standard
deviations versus upper spec limit
headp = 100.0 * (1.0 - (mean + 2.0*sd) / usl);
Calculate Sigma Zst
tco_sigma = 1.5 + (mean - lsl) / sd);
sla_sigma = 1.5 + (usl - mean) / sd);
27
28. DESIGN Project: Capacity Planning for N1
ID: P925
Design Optimization
Compare the “traditional” approach with the new design
Run the headroom tool on a big and busy server, collect data and show how a simplistic approach
compares with the method described in this project
SunRay timesharing server monitored for several days. System is loaded to the limit at peak times,
but idle out of hours, so focus on a scheduled capacity reallocation scenario.
Simplistic “Traditional” Approach
Collect data using vmstat, sar, SunMC or 3rd party tools
Plot CPU % busy - as shown on next slide
There is spare capacity, but no indication of how many CPUs are unused
Need extra information that this is a 12-CPU system
N1CP Approach
Collect data using headroom prototype
Plot CPU load level in CPU units, no need to guess or replot data
Calculate margin, headroom and sigma levels
Plan capacity reallocation and recalculate margin, headroom and sigma levels
28
30. DESIGN Project: Capacity Planning for N1
ID: P925
N1CP View free overnight, system overloads at peak times
- CPU Counts
There are 12 CPUs, 6 to 8 are
Mean+2sd Load vs Configured and Upper Limit
pcf pul pmd+2psd
14
12
10
8
CPU Count
6
Mean CPU Load 7.03
4
Mean Util 59% DPMO Min Sigma
Summary
Mean headroom 34%
2 TCO 110215 -1.5 Zst
Mean capacity 12.00 SLA 538 2.5 Zst
0
0:30:05
3:00:05
5:30:05
8:00:06
10:30:16
13:00:14
15:30:21
18:00:08
20:30:06
23:00:06
1:30:06
4:00:06
6:30:06
9:00:09
11:30:15
14:03:13
16:33:10
19:03:07
21:33:07
0:03:07
2:33:06
5:03:07
7:33:07
12:36:12
15:06:17
17:36:07
20:06:06
22:36:06
1:06:06
3:36:06
6:06:06
8:36:08
11:06:12
13:36:12
16:06:12
18:36:07
21:06:07
23:36:06
Time of Day
30
31. DESIGN Project: Capacity Planning for N1
ID: P925
N1CP - Response Curve
System is close to overload, this timeshared workload has a flatter curve
than internet workloads (closed rather than open queuing model)
Response Time vs Load Level
2.5
2
Response Increase
1.5
1
0.5
0
0 2 4 6 8 10 12
CPU Count
31
33. DESIGN Project: Capacity Planning for N1
ID: P925
N1CPcount and times daily, and borrow extra for the peak load
View - Dynamic!
Vary the CPU
CPU mean+2sd Load vs Config and Upper Limit
pcf pul pmd+2psd
14
3.2s
3.2s 3.5s
4.3s
12
6.3s
10
3.6s
CPU Count
8
5.2s
3.2s
6
5.7s
Mean CPU load 7.03
4
Min Sigma
Mean Util 74% Predicted
2 TCO 2.0 Zst
Mean headroom 16%
SLA 3.2 Zst
Mean capacity 9.52
0
30 5
30 5
35
:3 09
30 6
30 5
36
:3 12
33 6
33 7
37
:0 14
06 6
06 6
06
:0 10
:3 11
:3 21
:3 07
30 6
:3 16
:3 16
:3 08
33 7
:0 17
:0 07
06 6
:0 10
:0 13
:0 07
07
3: :0
6: :0
9: :0
3: :0
6: :0
9: :0
3: :0
6: :0
9: :0
3: :0
6: :0
9: :0
0: :0
0: :0
0: :0
12 0:
12 0:
15 3:
12 6:
15 0:
18 0:
21 0:
15 3:
18 3:
21 3:
18 6:
21 6:
15 6:
18 6:
21 6:
6:
30
0
3
6
0:
Time of Day
33
34. DESIGN Project: Capacity Planning for N1
ID: P925
Summary
Performance Impact
SLA Sigma levels improve from minimum of 2.5 Zst to 3.2 Zst
Improvement of 0.7 Sigma by allowing for extra peak load
Simplistic methods do not allow quality of service prediction
Cost Impact
TCO Sigma levels improve from minimum of -1.5 Zst to 2.0 Zst
Improvement of 3.5 Sigma by reducing capacity from 12 to 9.5
Observability Impact
Headroom tool prototype generates all required statistics
Sigma level is simply calculated, or headroom tool could print it
Simplistic methods do not show what is going on
Complexity Impact
Dynamic reconfiguration must be enabled
One reconfiguration each morning and each evening
Applicability (Assertions, out of scope for this project!)
CPU based example can be applied to blades, RAM, disk, net, thermal
Method can be extended from platform level to services
34
36. GRID Project: Capacity Planning for N1
ID: P925
Capacity for Sale
Uses for Spare Capacity
Carefully schedule batch work and backups
Remotely support global timezones
Run engineering dept. simulation jobs
Grid Oriented Solutions
Project Grid - departmental cluster (Sun Grid Engine)
Enterprise Grid - collection of clusters forming a general
purpose Grid service (Sun Grid Engine Enterprise Edition)
The Global Grid - Internet level - GT2.2, OGSA/OGSI/GT3
Provision an Enterprise Grid service using N1
Join The Global Grid and sell or share capacity
36
37. GRID Project: Capacity Planning for N1
ID: P925
Relationships: N1 and Grid
N1 is about provisioning things you own, Grid is about access to things you don’t own
Business
Infrastructure
Model
Things you
Utility
N1
own and
Computing
control
Things you Grid Services Utility
borrow or
Computing
Web Services
lease
37
38. GRID Project: Capacity Planning for N1
ID: P925
Capacity Flows in a Grid Enabled N1 Datacenter
Utility
Computing
N1 Virtualized Datacenter
Capacity
Requests
Capacity
Purchase
On Tier 0
C.O.D. Tier 1
Tier 3 Tier 2
Demand Web Web User /
Web
Database App Front Web Services
Servers
Storage Servers End
Purchase
Capacity
Free Sun
Pool Grid
Cluster Grid
Unused Grid User /
Engine
Compute and
Resources Grid Services
Enter-
Storage Resources
Prise
Retire
Edition
Obsolete
Capacity
Repair and Replace
38
39. GRID Project: Capacity Planning for N1
ID: P925
IT market segments by “need to share”
Defense Commercial Technical Consumer
spooks suits geeks users
What can be Operating
Nothing Hardware Everything
System
shared
Nothing, N1, Server P2P apps,
Grid, VPN,
What is physical domains, VLAN SETI, Kazaa,
encryption,
separation and SAN Zone Limewire,
trusted firewalls
required partitioning People!
Everything in The Everything
What is Local systems, Local systems
Global Grid including other
and Internet
visible Project Grids community users
Storage.
CPU cycles,
CPU cycles. Network
Organizational,
Primary Latency. bandwidth.
Organizational
legal,
constraints National issues Know-how
contractual
security issues
39