The document discusses goals for a dissertation on monitoring, diagnosing, and repairing systems. It aims to reduce the cost of system administration by automatically detecting and repairing problems faster. Key innovations proposed include using replicated, semi-hierarchical databases for fault tolerance and scalability, self-describing data structures, end-to-end notification, aggregation and high-resolution displays to reduce information, self-configuring systems, and secure remote actions. The system will be evaluated through fault injection, feature comparison, real-world usage, and testimonials.
Allocation of processors to processes in Distributed Systems. Strategies or algorithms for processor allocation. Design and Implementation Issues of Strategies.
Task Scheduling Algorithm for Multicore Processor Systems with Turbo Boost an...Naoki Shibata
Yosuke Wakisaka, Naoki Shibata, Keiichi Yasumoto, Minoru Ito, and Junji Kitamichi : Task Scheduling Algorithm for Multicore Processor Systems with Turbo Boost and Hyper-Threading, In Proc. of The 2014 International Conference on Parallel and Distributed Processing Techniques and Applications(PDPTA'14), pp. 229-235
In this paper, we propose a task scheduling algorithm for multiprocessor systems with Turbo Boost and Hyper-Threading technologies. The proposed algorithm minimizes the total computation time taking account of dynamic changes of the processing speed by the two technologies, in addition to the network contention among the processors. We constructed a clock speed model with which the changes of processing speed with Turbo Boost and Hyper-threading can be estimated for various processor usage patterns. We then constructed a new scheduling algorithm that minimizes the total execution time of a task graph considering network contention and the two technologies. We evaluated the proposed algorithm by simulations and experiments with a multiprocessor system consisting of 4 PCs. In the experiment, the proposed algorithm produced a schedule that reduces the total execution time by 36% compared to conventional methods which are straightforward extensions of an existing method.
(Slides) Task scheduling algorithm for multicore processor system for minimiz...Naoki Shibata
Shohei Gotoda, Naoki Shibata and Minoru Ito : "Task scheduling algorithm for multicore processor system for minimizing recovery time in case of single node fault," Proceedings of IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2012), pp.260-267, DOI:10.1109/CCGrid.2012.23, May 15, 2012.
In this paper, we propose a task scheduling al-gorithm for a multicore processor system which reduces the
recovery time in case of a single fail-stop failure of a multicore
processor. Many of the recently developed processors have
multiple cores on a single die, so that one failure of a computing
node results in failure of many processors. In the case of a failure
of a multicore processor, all tasks which have been executed
on the failed multicore processor have to be recovered at once.
The proposed algorithm is based on an existing checkpointing
technique, and we assume that the state is saved when nodes
send results to the next node. If a series of computations that
depends on former results is executed on a single die, we need
to execute all parts of the series of computations again in
the case of failure of the processor. The proposed scheduling
algorithm tries not to concentrate tasks to processors on a die.
We designed our algorithm as a parallel algorithm that achieves
O(n) speedup where n is the number of processors. We evaluated
our method using simulations and experiments with four PCs.
We compared our method with existing scheduling method, and
in the simulation, the execution time including recovery time in
the case of a node failure is reduced by up to 50% while the
overhead in the case of no failure was a few percent in typical
scenarios.
Sara Afshar: Scheduling and Resource Sharing in Multiprocessor Real-Time Systemsknowdiff
PhD Candidate,
Department of Computer science
Mälardalen University
Time: Tuesday, Dec. 30, 2014, 11:30 a.m.
Location: Computer Engineering Department, Urmia University
Abstract:
The processor is the brain of a computer system. Usually, one or more programs run on a processor where each program is typically responsible for performing a particular task or function of the system. The performance of all the tasks together results in the system functionality. In many computer systems, it is not only enough that all tasks deliver correct output, but it is also crucial that these activities are delivered in a proper time. This type of systems that have timing requirements are known as real-time systems. A scheduler is responsible for scheduling all tasks on the processor, i.e., it dictates which task to run and when to run to ensure that all tasks are carried out on time. Typically, such tasks/programs need to use the computer system’s hardware and software resources to perform their calculation. Examples of such type of resources that are shared among programs are I/O devices, buffers and memories. Technology that is used for the management of shared resources is known as resource sharing synchronization protocol.
In recent years, a shift from single-processor platforms to multiprocessor platforms has become inevitable due to availability of processor chips and requirements on increased performance. Scheduling and resource sharing protocols have been well studied for uniprocessor systems. However, in the context of multiprocessors, still such techniques are not fully mature. The shift towards multi-core technology has revealed the demand for real-time scheduling algorithms along with synchronization protocols to support real-time applications on multiprocessors, both with and without dependencies.
In this talk, we first have an introduction to real-time embedded systems. Next, we look at scheduling and resource sharing policies in uniprocessor platforms. Further, we discuss the extension of scheduling and resource sharing policies for multiprocessor platforms and present the recent challenges arisen in this context.
Biography:
Sara Afshar is a PhD student at Mälardalen University. She has received her B.Sc. degree in Electrical Engineering from Tabriz University, Iran in 2002. She worked at different engineering companies until 2009. In the year 2010 she started her M.Sc. in Embedded Systems at Mälardalen University. She obtained her Master degree in 2012 and at the same year she started her PhD studies in Mälardalen University. Currently she is working on the topic of resource sharing in multiprocessor systems. She is part of the Complex Real-Time Embedded Systems group at Mälardalen University.
Allocation of processors to processes in Distributed Systems. Strategies or algorithms for processor allocation. Design and Implementation Issues of Strategies.
Task Scheduling Algorithm for Multicore Processor Systems with Turbo Boost an...Naoki Shibata
Yosuke Wakisaka, Naoki Shibata, Keiichi Yasumoto, Minoru Ito, and Junji Kitamichi : Task Scheduling Algorithm for Multicore Processor Systems with Turbo Boost and Hyper-Threading, In Proc. of The 2014 International Conference on Parallel and Distributed Processing Techniques and Applications(PDPTA'14), pp. 229-235
In this paper, we propose a task scheduling algorithm for multiprocessor systems with Turbo Boost and Hyper-Threading technologies. The proposed algorithm minimizes the total computation time taking account of dynamic changes of the processing speed by the two technologies, in addition to the network contention among the processors. We constructed a clock speed model with which the changes of processing speed with Turbo Boost and Hyper-threading can be estimated for various processor usage patterns. We then constructed a new scheduling algorithm that minimizes the total execution time of a task graph considering network contention and the two technologies. We evaluated the proposed algorithm by simulations and experiments with a multiprocessor system consisting of 4 PCs. In the experiment, the proposed algorithm produced a schedule that reduces the total execution time by 36% compared to conventional methods which are straightforward extensions of an existing method.
(Slides) Task scheduling algorithm for multicore processor system for minimiz...Naoki Shibata
Shohei Gotoda, Naoki Shibata and Minoru Ito : "Task scheduling algorithm for multicore processor system for minimizing recovery time in case of single node fault," Proceedings of IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2012), pp.260-267, DOI:10.1109/CCGrid.2012.23, May 15, 2012.
In this paper, we propose a task scheduling al-gorithm for a multicore processor system which reduces the
recovery time in case of a single fail-stop failure of a multicore
processor. Many of the recently developed processors have
multiple cores on a single die, so that one failure of a computing
node results in failure of many processors. In the case of a failure
of a multicore processor, all tasks which have been executed
on the failed multicore processor have to be recovered at once.
The proposed algorithm is based on an existing checkpointing
technique, and we assume that the state is saved when nodes
send results to the next node. If a series of computations that
depends on former results is executed on a single die, we need
to execute all parts of the series of computations again in
the case of failure of the processor. The proposed scheduling
algorithm tries not to concentrate tasks to processors on a die.
We designed our algorithm as a parallel algorithm that achieves
O(n) speedup where n is the number of processors. We evaluated
our method using simulations and experiments with four PCs.
We compared our method with existing scheduling method, and
in the simulation, the execution time including recovery time in
the case of a node failure is reduced by up to 50% while the
overhead in the case of no failure was a few percent in typical
scenarios.
Sara Afshar: Scheduling and Resource Sharing in Multiprocessor Real-Time Systemsknowdiff
PhD Candidate,
Department of Computer science
Mälardalen University
Time: Tuesday, Dec. 30, 2014, 11:30 a.m.
Location: Computer Engineering Department, Urmia University
Abstract:
The processor is the brain of a computer system. Usually, one or more programs run on a processor where each program is typically responsible for performing a particular task or function of the system. The performance of all the tasks together results in the system functionality. In many computer systems, it is not only enough that all tasks deliver correct output, but it is also crucial that these activities are delivered in a proper time. This type of systems that have timing requirements are known as real-time systems. A scheduler is responsible for scheduling all tasks on the processor, i.e., it dictates which task to run and when to run to ensure that all tasks are carried out on time. Typically, such tasks/programs need to use the computer system’s hardware and software resources to perform their calculation. Examples of such type of resources that are shared among programs are I/O devices, buffers and memories. Technology that is used for the management of shared resources is known as resource sharing synchronization protocol.
In recent years, a shift from single-processor platforms to multiprocessor platforms has become inevitable due to availability of processor chips and requirements on increased performance. Scheduling and resource sharing protocols have been well studied for uniprocessor systems. However, in the context of multiprocessors, still such techniques are not fully mature. The shift towards multi-core technology has revealed the demand for real-time scheduling algorithms along with synchronization protocols to support real-time applications on multiprocessors, both with and without dependencies.
In this talk, we first have an introduction to real-time embedded systems. Next, we look at scheduling and resource sharing policies in uniprocessor platforms. Further, we discuss the extension of scheduling and resource sharing policies for multiprocessor platforms and present the recent challenges arisen in this context.
Biography:
Sara Afshar is a PhD student at Mälardalen University. She has received her B.Sc. degree in Electrical Engineering from Tabriz University, Iran in 2002. She worked at different engineering companies until 2009. In the year 2010 she started her M.Sc. in Embedded Systems at Mälardalen University. She obtained her Master degree in 2012 and at the same year she started her PhD studies in Mälardalen University. Currently she is working on the topic of resource sharing in multiprocessor systems. She is part of the Complex Real-Time Embedded Systems group at Mälardalen University.
Overview - Functions of an Operating System – Design Approaches – Types of Advanced
Operating System - Synchronization Mechanisms – Concept of a Process, Concurrent
Processes – The Critical Section Problem, Other Synchronization Problems – Language
Mechanisms for Synchronization – Axiomatic Verification of Parallel Programs - Process
Deadlocks - Preliminaries – Models of Deadlocks, Resources, System State – Necessary and
Sufficient conditions for a Deadlock – Systems with Single-Unit Requests, Consumable
Resources, Reusable Resources.
Solution to Operating system concepts ninth edition.
By Navid Daneshvaran, software engineering student at Kharazmi university.
I would be grateful if you would notify me of any errors to solutions.
E-Mail:
nd.naviddaneshvaran@gmail.com
Introduction: What is clock synchronization?
The challenges of clock synchronization.
Basic Concepts: Software and hardware clocks. Basic clock synchronization algorithm
Algorithms: Deep dive into landmark papers
NTP: Internet scale time synchronization
Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Wor...Rafael Ferreira da Silva
Presentation held at the 11th Workflows in Support of Large-Scale Science, October 14, 2016.
Abstract - Scientific workflows have become mainstream for conducting large-scale scientific research. As a result, many workflow applications and Workflow Management Systems (WMSs) have been developed as part of the cyberinfrastructure to allow scientists to execute their applications seamlessly on a range of distributed platforms. In spite of many success stories, a key challenge for running workflows in distributed systems is failure prediction, detection, and recovery. In this paper, we propose an approach to use control theory developed as part of autonomic computing to predict failures before they happen, and mitigated them when possible. The proposed approach applying the proportional-integral-derivative controller (PID controller) control loop mechanism, which is widely used in industrial control systems, to mitigate faults by adjusting the inputs of the controller. The PID controller aims at detecting the possibility of a fault far enough in advance so that an action can be performed to prevent it from happening. To demonstrate the feasibility of the approach, we tackle two common execution faults of the Big Data era---data storage overload and memory overflow. We define, implement, and evaluate simple PID controllers to autonomously manage data and memory usage of a bioinformatics workflow that consumes/produces over 4.4TB of data, and requires over 24TB of memory to run all tasks concurrently. Experimental results indicate that workflow executions may significantly benefit from PID controllers, in particular under online and unknown conditions. Simulation results show that nearly-optimal executions (slowdown of 1.01) can be attained when using our proposed method, and faults are detected and mitigated far in advance of their occurrence.
EuroBioForum 2013 - Day 1 | Anne EckhardtEuroBioForum
EuroBioForum 2013 2nd Annual Conference
27-28 May 2013 - Hilton Munich City, Munich, Germany
http://www.eurobioforum.eu/2013
=======================================
# NATIONAL PERSPECTIVES #
Switzerland:
Personalised medicine in Switzerland under a societal perspective
Dr Anne Eckhard
Consultant on behalf of the Swiss Academies of Arts and Sciences
=======================================
http://www.eurobioforum.eu
Overview - Functions of an Operating System – Design Approaches – Types of Advanced
Operating System - Synchronization Mechanisms – Concept of a Process, Concurrent
Processes – The Critical Section Problem, Other Synchronization Problems – Language
Mechanisms for Synchronization – Axiomatic Verification of Parallel Programs - Process
Deadlocks - Preliminaries – Models of Deadlocks, Resources, System State – Necessary and
Sufficient conditions for a Deadlock – Systems with Single-Unit Requests, Consumable
Resources, Reusable Resources.
Solution to Operating system concepts ninth edition.
By Navid Daneshvaran, software engineering student at Kharazmi university.
I would be grateful if you would notify me of any errors to solutions.
E-Mail:
nd.naviddaneshvaran@gmail.com
Introduction: What is clock synchronization?
The challenges of clock synchronization.
Basic Concepts: Software and hardware clocks. Basic clock synchronization algorithm
Algorithms: Deep dive into landmark papers
NTP: Internet scale time synchronization
Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Wor...Rafael Ferreira da Silva
Presentation held at the 11th Workflows in Support of Large-Scale Science, October 14, 2016.
Abstract - Scientific workflows have become mainstream for conducting large-scale scientific research. As a result, many workflow applications and Workflow Management Systems (WMSs) have been developed as part of the cyberinfrastructure to allow scientists to execute their applications seamlessly on a range of distributed platforms. In spite of many success stories, a key challenge for running workflows in distributed systems is failure prediction, detection, and recovery. In this paper, we propose an approach to use control theory developed as part of autonomic computing to predict failures before they happen, and mitigated them when possible. The proposed approach applying the proportional-integral-derivative controller (PID controller) control loop mechanism, which is widely used in industrial control systems, to mitigate faults by adjusting the inputs of the controller. The PID controller aims at detecting the possibility of a fault far enough in advance so that an action can be performed to prevent it from happening. To demonstrate the feasibility of the approach, we tackle two common execution faults of the Big Data era---data storage overload and memory overflow. We define, implement, and evaluate simple PID controllers to autonomously manage data and memory usage of a bioinformatics workflow that consumes/produces over 4.4TB of data, and requires over 24TB of memory to run all tasks concurrently. Experimental results indicate that workflow executions may significantly benefit from PID controllers, in particular under online and unknown conditions. Simulation results show that nearly-optimal executions (slowdown of 1.01) can be attained when using our proposed method, and faults are detected and mitigated far in advance of their occurrence.
EuroBioForum 2013 - Day 1 | Anne EckhardtEuroBioForum
EuroBioForum 2013 2nd Annual Conference
27-28 May 2013 - Hilton Munich City, Munich, Germany
http://www.eurobioforum.eu/2013
=======================================
# NATIONAL PERSPECTIVES #
Switzerland:
Personalised medicine in Switzerland under a societal perspective
Dr Anne Eckhard
Consultant on behalf of the Swiss Academies of Arts and Sciences
=======================================
http://www.eurobioforum.eu
EuroBioForum 2013 2nd Annual Conference
27-28 May 2013 - Hilton Munich City, Munich, Germany
http://www.eurobioforum.eu/2013
=======================================
# REGIONAL PERSPECTIVES #
Rotterdam Delta, The Netherlands:
What’s keeping medicine from becoming personalised?
Dr Menno Kok,
Advisor Research Strategies Erasmus MC and sector manager Medical Delta
=======================================
http://www.eurobioforum.eu
EuroBioForum 2013 - Day 2 | Mark PoznanskyEuroBioForum
EuroBioForum 2013 2nd Annual Conference
27-28 May 2013 - Hilton Munich City, Munich, Germany
http://www.eurobioforum.eu/2013
=======================================
# REGIONAL PERSPECTIVES #
Ontario Genomics Institute, Canada:
Innovative Research, Innovative Translation
Dr Mark Poznansky
President and CEO Ontario Genomics Institute
=======================================
http://www.eurobioforum.eu
“Performance testing is the process by which software is tested to determine the current system performance. This process aims to gather information about current performance, but places no value judgments on the findings".
Determining the root cause of performance issues is a critical task for Operations. In this webinar, we'll show you the tools and techniques for diagnosing and tuning the performance of your MongoDB deployment. Whether you're running into problems or just want to optimize your performance, these skills will be useful.
A cluster is a type of parallel or distributed computer system, which consists of a collection of inter-connected stand-alone computers working together as a single integrated computing resource.
Stop the Guessing: Performance Methodologies for Production SystemsBrendan Gregg
Talk presented at Velocity 2013. Description: When faced with performance issues on complex production systems and distributed cloud environments, it can be difficult to know where to begin your analysis, or to spend much time on it when it isn’t your day job. This talk covers various methodologies, and anti-methodologies, for systems analysis, which serve as guidance for finding fruitful metrics from your current performance monitoring products. Such methodologies can help check all areas in an efficient manner, and find issues that can be easily overlooked, especially for virtualized environments which impose resource controls. Some of the tools and methodologies covered, including the USE Method, were developed by the speaker and have been used successfully in enterprise and cloud environments.
This slides show how to utilize real-world applications to teach early architecture exploration of electronics, embedded systems, software/firmware and semiconductor using visualsim.
Overview of Performance Evaluation
Intro & Objective
The Art of Performance Evaluation
Professional Organizations, Journals, and conferences.
Performance Projects
Common Mistakes and How to Avoid Them
Selection of Techniques and Metrics
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
2. 2Jun 28, 2013
OverviewOverview
What is System Administration?
– What is the problem?
– Goals of Dissertation Research
– Goals of System Administration
Monitoring, diagnosing, and repairing
Dissertation Timeline
Conclusion
3. 3Jun 28, 2013
What is the problem?What is the problem?
Problems occur in systems, and result in loss of
productivity
– Server failures denial of service
– System overload lower productivity
Cost is too high
– Cost of ownership estimated at $5,000-$15,000/year/machine
– Median salary (~50k) / (median # machines/admin) $700
Our goal: Reduce cost by
– Repairing problems faster (possibly automatically)
– Handling more problems
4. 4Jun 28, 2013
Goals of Dissertation ResearchGoals of Dissertation Research
Describe field of System Administration
Monitoring, Diagnosing, and Repairing:
– Approach: Synthesize solutions from other fields of research
1) Detect previously ignored problems
2) Automatic repair of some problems
3) Reduce number of administrators needed
4) Support users’ understanding of system
Apply here & distribute software
Thesis: Through our approach, we can achieve
goals 1-4.
5. 5Jun 28, 2013
Goals of System AdministrationGoals of System Administration
Goal: Support cost-effective use of the computer
environment
More specifically (some non-technical):
Environment: uniform, customizable, high performance and
available
Faults & errors: recovery from benign errors, protection from
malicious attacks
Users: training, accounting & planning, legal
6. 6Jun 28, 2013
Monitoring, Diagnosing, andMonitoring, Diagnosing, and
Repairing (MDR)Repairing (MDR)
• Introductory examples
• Fundamental requirements
• Environmental constraints
• Previous work
• Six key innovations
• Architecture
• Details on innovations
• Evaluation methodology
7. 7Jun 28, 2013
MDR: Examples — IntroMDR: Examples — Intro
Four examples
1) Broken component
2) Resource overload — transient
3) Resource contention — user program
4) Resource exhaustion — long term
Previous Solutions
– Pay someone to watch
– Ignore or wait for someone to complain
– Specialized scripts (not general vast repeated work)
8. 8Jun 28, 2013
MDR: Example 1MDR: Example 1
Web server has crashed/hung
Gather information: process existence, service
uptime, restart times
Analyze data: process not responding, and hasn’t
been recently restarted.
Automatic repair: restart daemon.
Notify administrator: had to restart daemon.
9. 9Jun 28, 2013
MDR: Example 2MDR: Example 2
The NOW is “slow.”
Gather data: load, process info, CPU info
Analyze data: bounds on expected values
Notified administrator: fileserver overloaded.
Visualize data: nfsd’s are overloaded.
Repair: admin moves data, adds disks, or starts
more nfsd’s
10. 10Jun 28, 2013
MDR: Example 3MDR: Example 3
User running program
Gather: user statistics, CPU, disk
Visualize: spending too much time waiting on remote
accesses
(User fixes program, gathering, visualization repeated)
Analyze: some nodes have less throughput
Visualize: those have other jobs running on them
Repair: user is benchmarking so kills all extraneous
processes
11. 11Jun 28, 2013
MDR: Example 4MDR: Example 4
Web server increasing beyond capacity
Gather: CPU, request rate, reply latency
Analyze: Burst lengths getting longer, latency
increasing
Visualize: Graph of burst lengths & CPU usage over
time
Repair: Order more machines, install load balancer
12. 12Jun 28, 2013
MDR: Fundamental RequirementsMDR: Fundamental Requirements
• Gathering
• Flexible data gathering, self-describing storage
• Analyzing
• Calculate statistical measures, identify relevant statistics.
• Notifying
• Flexible infrequent messages to administrators or users
• Visualizing
• Maximize information/pixel, support multiple interfaces
• Repairing
• Automate simple repairs, support group operations
13. 13Jun 28, 2013
MDR: EnvironmentalMDR: Environmental
ConstraintsConstraints
Change is inherent
– Lack of Web/Mbone 5 years ago, now most/many have these.
Problems on many time-scales
– Second-Minute transients vs. Week-Month capacity problems
Must operate under very adverse conditions
– Often used when system is broken
– Would like at least post-mortum analysis
Need to handle hundreds – thousands of nodes
– Scalability: All sites are getting larger, possibly wide area
– Our system has 200 (NOW) – 2000 (Soda) nodes
14. 14Jun 28, 2013
MDR: Previous SystemsMDR: Previous Systems
Many previous systems: I’ve looked at about 16.
Not comprehensive, not extensible.
Look at a few that did a nice job of a piece:
[Fink97] — Run test, notify display engine
+ Easy to add tests
+ Selectivity of notification good
– Tests are just programs (redo gathering)
– Central, non-fault tolerant solution
– Many hard coded constants
15. 15Jun 28, 2013
MDR: Previous Systems, cont.MDR: Previous Systems, cont.
[Hard92] — buzzerd: Pager notification system
+ Flexible rules for notification
+ External interface for adding notify requests
– Simplistic gathering
– Poor fault tolerance
[Pier96] — Igor group fixes
+ Flexible operations
+ Nice reporting of success/failure
– Weak security, runs as root
– No delegation of responsibility
16. 16Jun 28, 2013
MDR: Six Key Innovations (1-3)MDR: Six Key Innovations (1-3)
Replicated, semi-hierarchical, data storage nodes
– Rendezvous point for programs
– Handles scaling and fault-tolerance
Self describing structures
– Functions (visualize, summarize) + data go in database
(OO)
– DB has machine and human readable descriptions of data
End to end notification
– Detect problems in MDR system
– Guarantee important messages get to users
17. 17Jun 28, 2013
MDR: Six Key Innovations (4-6)MDR: Six Key Innovations (4-6)
Aggregation and High Resolution Color Displays
– Reduce information to manageable amounts
– Maximize information per unit area
Partially self-configuring
– Learn averages, deviations, burst sizes
– Learn which values are relevant to problems
Secure, user-specified group repairs
– Don’t enable malicious attacks
– Automate repairs of many machines
20. 20Jun 28, 2013
Key: Semi-Hier. DBs.Key: Semi-Hier. DBs.
Fault tolerance
Scalability:
– Caches don’t need to commit to disk — authoritative copy
elsewhere.
– Batching updates over wide area links.
Top level cache Top level cache
Mid level cache Mid level cache Mid level cache
Per-node
database
Per-node
database
Per-node
database
Per-node
database
Per-node
database
21. 21Jun 28, 2013
Key: Self-DescribingKey: Self-Describing
De-couple data gathering, data storage, and data use
Self-Describing for Humans
– Descriptions of meanings of values stored with tables
– Description of methods of gathering stored with tables
– Column names help with self
Self-Describing for Computers
– Functions for visualizing or summarizing data
– Indication of resource selection from resource statistics
22. 22Jun 28, 2013
Key: End-to-End NotificationKey: End-to-End Notification
Recall: System must operate under extreme conditions
Humans must validate that system is still working
– Standalone display can indicate timestamps, mark out of
date data
– Wireless machine could intermittently contact notification
system
– Pager could be automatically paged every so often
Problems should be propagated to end users.
– Flexible notification — connected systems, e-mail, pager.
– Limit over-notification
23. 23Jun 28, 2013
Key: Aggregation & HiResKey: Aggregation & HiRes
System target has hundreds – thousands of nodes
Aggregate by showing out of bounds, relevant values
(via automatic tuning)
Also want overview of system
– Aggregate across similar statistics; show value (fill) &
dispersion (shade)
– Use color to highlight important values.
– Aggregate across values (machine utilization = CPU + disk +
memory)
– Maximize data/pixel [Tufte]
25. 25Jun 28, 2013
Key: Self-ConfiguringKey: Self-Configuring
Single statistics
– Phase 1: Calculate averages, standard deviations, burst
sizes
– Worked in other systems [Jaco88, Karn91]
Identify relevant statistics
– Give system Boolean examples (variables out of bounds,
and system working/not working) get function.
– Works for Boolean disjunctions in some cases:
• With lots of irrelevant variables [Litt89]
• With random bad examples [Sloa89]
• In some cases, with malicious bad examples [Ande94]
26. 26Jun 28, 2013
Key: Secure Remote ActionsKey: Secure Remote Actions
Security because of malicious attacks, benign errors
Delegation to remove SA from the loop
Independence from particular algorithms
– Building a library
– Program with principals (hosts, users), and properties
(signed, sealed, verifiable)
Use secure, run-time extensible languages
Actions report through gathering system
27. 27Jun 28, 2013
MDR: Testing MethodologyMDR: Testing Methodology
Fault injection
– Deliberately make the system slow
– Break hardware/software components
Feature comparison
– Paper comparison with other systems
Usage in practice
– Experience important to show system works
– We have need of administrative tools
Testimonials
– Experience at other sites lends credibility
28. 28Jun 28, 2013
MDR: DemoMDR: Demo
Hierarchical structure working (1 level right now)
Alternative Interface
Fault Injection
Need for Aggregation
Crufty right now
Demo
29. 29Jun 28, 2013
Timeline: Key PiecesTimeline: Key Pieces
1) (DBs) Replicated, semi-hierarchical, data storage nodes
2) (SDS) Self describing structures
3) (Vis) Aggregation and High Resolution Color Displays
4) (E2EN) End to end notification
5) (ReS) Automatic Restart
6) (Cfg) Partially self-configuring
7) (Rep) Secure, user-specified group repairs
30. 30Jun 28, 2013
TimelineTimeline
Deadlines:
June, 1997 Dec, 1997 Dec, 1998June, 1998
LISA 6/97 USENIX 12/97 OSDI 3/98 Graduation 12/98
Prototype 1,2,3
(DBs, SelfD, Vis)
Prototype 4,5
Notify, Restart
Prototype 6,7
AConfig, Repair
LISA 6/98
Experience
with 1-7
SOSP
3/99
Architecture of
Complete System
Writing
Mar, 1999
31. 31Jun 28, 2013
ConclusionConclusion
Description of field shows breadth
Monitoring, diagnosing, and repairing shows depth
– Examples show importance of problem
– Fundamental goals & environmental constraints show
understanding of problem
– Key innovations show differences from previous systems.
– Architecture and initial prototype show approach to problem
– Testing methods show ways to validate solution.
Timeline shows plan & milestones to graduation
35. 35Jun 28, 2013
Supporting UsersSupporting Users
Automated help desk
– Searchable collection of questions
– Easy method for addition
Remote device access
Site-wide training
36. 36Jun 28, 2013
Goals: EnvironmentGoals: Environment
Uniform
– Supports user mobility by eliminating arbitrary changes
– Increases effectiveness by avoiding need for users to learn multiple
interfaces
Customizable
– Handles special systems and special needs [firewalls, servers]
– Obviously reduces uniformity
37. 37Jun 28, 2013
Goals: Environment, cont.Goals: Environment, cont.
High Performance
– Increases effectiveness of users [HCI/psych]
– Limited by cost-effectiveness
Available
– Effectiveness is 0 if system isn’t working
– Balanced against expense
39. 39Jun 28, 2013
Goals: UsersGoals: Users
Training
– Troubleshooting = one-on-one training
– Larger sessions = classes
Accounting
– Supports management, helps billing
Capacity Planning
– Expanding systems takes time
Legal
– Sensitive information needs protection
40. 40Jun 28, 2013
Simplifying SecuritySimplifying Security
USENIX talk says “If cryptography is so great, why isn’t it used more?”
SA’s worry about security to protect data.
Goal: Ease development of secure applications
Write programs using principals & properties rather than keys and algorithms
Unify various forms of available cryptography (public key, secret-key, PGP,
Kerberos)
My use: protected, transferable rights to allow various actions
– Modify system configurations (add filesystems, printers)
– Kill/restart processes (runaway, after configurations modified)
– Access data (private logs, for backups, etc.)
41. 41Jun 28, 2013
ConclusionConclusion
System administration as area of research
– Description of field
– Areas for future research
• Managing stable storage
• Supporting users
Initial investigation of research area
– Monitoring, diagnosing, and repairing
• Broad, draws from many fields
Editor's Notes
Key idea: None Introduction slide.
Key Idea: Two contributions — System administration as a field of research; and initial work in the field produces initial results which substantially improve the state of the art.
Key idea: Has properties of “real” research — separation of concerns, important contributions, and a strategy for measuring effectiveness.