SlideShare a Scribd company logo
Monitoring, Diagnosing, andMonitoring, Diagnosing, and
RepairingRepairing
Eric Anderson
U.C. Berkeley
2Jun 28, 2013
OverviewOverview
 What is System Administration?
– What is the problem?
– Goals of Dissertation Research
– Goals of System Administration
 Monitoring, diagnosing, and repairing
 Dissertation Timeline
 Conclusion
3Jun 28, 2013
What is the problem?What is the problem?
 Problems occur in systems, and result in loss of
productivity
– Server failures  denial of service
– System overload  lower productivity
 Cost is too high
– Cost of ownership estimated at $5,000-$15,000/year/machine
– Median salary (~50k) / (median # machines/admin)  $700
 Our goal: Reduce cost by
– Repairing problems faster (possibly automatically)
– Handling more problems
4Jun 28, 2013
Goals of Dissertation ResearchGoals of Dissertation Research
 Describe field of System Administration
 Monitoring, Diagnosing, and Repairing:
– Approach: Synthesize solutions from other fields of research
1) Detect previously ignored problems
2) Automatic repair of some problems
3) Reduce number of administrators needed
4) Support users’ understanding of system
 Apply here & distribute software
 Thesis: Through our approach, we can achieve
goals 1-4.
5Jun 28, 2013
Goals of System AdministrationGoals of System Administration
Goal: Support cost-effective use of the computer
environment
More specifically (some non-technical):
Environment: uniform, customizable, high performance and
available
Faults & errors: recovery from benign errors, protection from
malicious attacks
Users: training, accounting & planning, legal
6Jun 28, 2013
Monitoring, Diagnosing, andMonitoring, Diagnosing, and
Repairing (MDR)Repairing (MDR)
• Introductory examples
• Fundamental requirements
• Environmental constraints
• Previous work
• Six key innovations
• Architecture
• Details on innovations
• Evaluation methodology
7Jun 28, 2013
MDR: Examples — IntroMDR: Examples — Intro
 Four examples
1) Broken component
2) Resource overload — transient
3) Resource contention — user program
4) Resource exhaustion — long term
 Previous Solutions
– Pay someone to watch
– Ignore or wait for someone to complain
– Specialized scripts (not general  vast repeated work)
8Jun 28, 2013
MDR: Example 1MDR: Example 1
Web server has crashed/hung
 Gather information: process existence, service
uptime, restart times
 Analyze data: process not responding, and hasn’t
been recently restarted.
 Automatic repair: restart daemon.
 Notify administrator: had to restart daemon.
9Jun 28, 2013
MDR: Example 2MDR: Example 2
The NOW is “slow.”
 Gather data: load, process info, CPU info
 Analyze data: bounds on expected values
 Notified administrator: fileserver overloaded.
 Visualize data: nfsd’s are overloaded.
 Repair: admin moves data, adds disks, or starts
more nfsd’s
10Jun 28, 2013
MDR: Example 3MDR: Example 3
User running program
 Gather: user statistics, CPU, disk
 Visualize: spending too much time waiting on remote
accesses
(User fixes program, gathering, visualization repeated)
 Analyze: some nodes have less throughput
 Visualize: those have other jobs running on them
 Repair: user is benchmarking so kills all extraneous
processes
11Jun 28, 2013
MDR: Example 4MDR: Example 4
Web server increasing beyond capacity
 Gather: CPU, request rate, reply latency
 Analyze: Burst lengths getting longer, latency
increasing
 Visualize: Graph of burst lengths & CPU usage over
time
 Repair: Order more machines, install load balancer
12Jun 28, 2013
MDR: Fundamental RequirementsMDR: Fundamental Requirements
• Gathering
• Flexible data gathering, self-describing storage
• Analyzing
• Calculate statistical measures, identify relevant statistics.
• Notifying
• Flexible infrequent messages to administrators or users
• Visualizing
• Maximize information/pixel, support multiple interfaces
• Repairing
• Automate simple repairs, support group operations
13Jun 28, 2013
MDR: EnvironmentalMDR: Environmental
ConstraintsConstraints
 Change is inherent
– Lack of Web/Mbone 5 years ago, now most/many have these.
 Problems on many time-scales
– Second-Minute transients vs. Week-Month capacity problems
 Must operate under very adverse conditions
– Often used when system is broken
– Would like at least post-mortum analysis
 Need to handle hundreds – thousands of nodes
– Scalability: All sites are getting larger, possibly wide area
– Our system has 200 (NOW) – 2000 (Soda) nodes
14Jun 28, 2013
MDR: Previous SystemsMDR: Previous Systems
 Many previous systems: I’ve looked at about 16.
 Not comprehensive, not extensible.
 Look at a few that did a nice job of a piece:
 [Fink97] — Run test, notify display engine
+ Easy to add tests
+ Selectivity of notification good
– Tests are just programs (redo gathering)
– Central, non-fault tolerant solution
– Many hard coded constants
15Jun 28, 2013
MDR: Previous Systems, cont.MDR: Previous Systems, cont.
 [Hard92] — buzzerd: Pager notification system
+ Flexible rules for notification
+ External interface for adding notify requests
– Simplistic gathering
– Poor fault tolerance
 [Pier96] — Igor group fixes
+ Flexible operations
+ Nice reporting of success/failure
– Weak security, runs as root
– No delegation of responsibility
16Jun 28, 2013
MDR: Six Key Innovations (1-3)MDR: Six Key Innovations (1-3)
 Replicated, semi-hierarchical, data storage nodes
– Rendezvous point for programs
– Handles scaling and fault-tolerance
 Self describing structures
– Functions (visualize, summarize) + data go in database
(OO)
– DB has machine and human readable descriptions of data
 End to end notification
– Detect problems in MDR system
– Guarantee important messages get to users
17Jun 28, 2013
MDR: Six Key Innovations (4-6)MDR: Six Key Innovations (4-6)
 Aggregation and High Resolution Color Displays
– Reduce information to manageable amounts
– Maximize information per unit area
 Partially self-configuring
– Learn averages, deviations, burst sizes
– Learn which values are relevant to problems
 Secure, user-specified group repairs
– Don’t enable malicious attacks
– Automate repairs of many machines
18Jun 28, 2013
MDR: ArchitectureMDR: Architecture
SQL-based Data Repository
Gather Agent
vmstat thread
ping thread
tcpdump thread
Diagnostic
Console
E-mail or
Phone
Notifier
Long-term
graphing Tolerance,
Relevance
Learner
Daemon
Restarter
Aggregation
Engine
19Jun 28, 2013
MDR-Arch: DerivationsMDR-Arch: Derivations
SQL-based Data Repository
Diagnostic
Console
E-mail or
Phone
Notifier
Tolerance,
Relevance
Learner
Daemon
Restarter
20Jun 28, 2013
Key: Semi-Hier. DBs.Key: Semi-Hier. DBs.
 Fault tolerance
 Scalability:
– Caches don’t need to commit to disk — authoritative copy
elsewhere.
– Batching updates over wide area links.
Top level cache Top level cache
Mid level cache Mid level cache Mid level cache
Per-node
database
Per-node
database
Per-node
database
Per-node
database
Per-node
database
21Jun 28, 2013
Key: Self-DescribingKey: Self-Describing
 De-couple data gathering, data storage, and data use
 Self-Describing for Humans
– Descriptions of meanings of values stored with tables
– Description of methods of gathering stored with tables
– Column names help with self
 Self-Describing for Computers
– Functions for visualizing or summarizing data
– Indication of resource selection from resource statistics
22Jun 28, 2013
Key: End-to-End NotificationKey: End-to-End Notification
Recall: System must operate under extreme conditions
 Humans must validate that system is still working
– Standalone display can indicate timestamps, mark out of
date data
– Wireless machine could intermittently contact notification
system
– Pager could be automatically paged every so often
 Problems should be propagated to end users.
– Flexible notification — connected systems, e-mail, pager.
– Limit over-notification
23Jun 28, 2013
Key: Aggregation & HiResKey: Aggregation & HiRes
 System target has hundreds – thousands of nodes
 Aggregate by showing out of bounds, relevant values
(via automatic tuning)
 Also want overview of system
– Aggregate across similar statistics; show value (fill) &
dispersion (shade)
– Use color to highlight important values.
– Aggregate across values (machine utilization = CPU + disk +
memory)
– Maximize data/pixel [Tufte]
24Jun 28, 2013
Key: Agg & HiRes: SnapshotKey: Agg & HiRes: Snapshot
25Jun 28, 2013
Key: Self-ConfiguringKey: Self-Configuring
 Single statistics
– Phase 1: Calculate averages, standard deviations, burst
sizes
– Worked in other systems [Jaco88, Karn91]
 Identify relevant statistics
– Give system Boolean examples (variables out of bounds,
and system working/not working) get function.
– Works for Boolean disjunctions in some cases:
• With lots of irrelevant variables [Litt89]
• With random bad examples [Sloa89]
• In some cases, with malicious bad examples [Ande94]
26Jun 28, 2013
Key: Secure Remote ActionsKey: Secure Remote Actions
 Security because of malicious attacks, benign errors
 Delegation to remove SA from the loop
 Independence from particular algorithms
– Building a library
– Program with principals (hosts, users), and properties
(signed, sealed, verifiable)
 Use secure, run-time extensible languages
 Actions report through gathering system
27Jun 28, 2013
MDR: Testing MethodologyMDR: Testing Methodology
 Fault injection
– Deliberately make the system slow
– Break hardware/software components
 Feature comparison
– Paper comparison with other systems
 Usage in practice
– Experience important to show system works
– We have need of administrative tools
 Testimonials
– Experience at other sites lends credibility
28Jun 28, 2013
MDR: DemoMDR: Demo
 Hierarchical structure working (1 level right now)
 Alternative Interface
 Fault Injection
 Need for Aggregation
 Crufty right now
 Demo
29Jun 28, 2013
Timeline: Key PiecesTimeline: Key Pieces
1) (DBs) Replicated, semi-hierarchical, data storage nodes
2) (SDS) Self describing structures
3) (Vis) Aggregation and High Resolution Color Displays
4) (E2EN) End to end notification
5) (ReS) Automatic Restart
6) (Cfg) Partially self-configuring
7) (Rep) Secure, user-specified group repairs
30Jun 28, 2013
TimelineTimeline
Deadlines:
June, 1997 Dec, 1997 Dec, 1998June, 1998
LISA 6/97 USENIX 12/97 OSDI 3/98 Graduation 12/98
Prototype 1,2,3
(DBs, SelfD, Vis)
Prototype 4,5
Notify, Restart
Prototype 6,7
AConfig, Repair
LISA 6/98
Experience
with 1-7
SOSP
3/99
Architecture of
Complete System
Writing
Mar, 1999
31Jun 28, 2013
ConclusionConclusion
 Description of field shows breadth
 Monitoring, diagnosing, and repairing shows depth
– Examples show importance of problem
– Fundamental goals & environmental constraints show
understanding of problem
– Key innovations show differences from previous systems.
– Architecture and initial prototype show approach to problem
– Testing methods show ways to validate solution.
 Timeline shows plan & milestones to graduation
Old SlidesOld Slides
33Jun 28, 2013
SolutionsSolutions
 Managing stable storage
 Supporting users
 Simplifying security
 Monitoring, diagnosing, and repairing
34Jun 28, 2013
Managing Stable StorageManaging Stable Storage
 Consistency vs. availability
 Fault tolerance
 Scalability
 Recoverability
 Customization
35Jun 28, 2013
Supporting UsersSupporting Users
 Automated help desk
– Searchable collection of questions
– Easy method for addition
 Remote device access
 Site-wide training
36Jun 28, 2013
Goals: EnvironmentGoals: Environment
 Uniform
– Supports user mobility by eliminating arbitrary changes
– Increases effectiveness by avoiding need for users to learn multiple
interfaces
 Customizable
– Handles special systems and special needs [firewalls, servers]
– Obviously reduces uniformity
37Jun 28, 2013
Goals: Environment, cont.Goals: Environment, cont.
 High Performance
– Increases effectiveness of users [HCI/psych]
– Limited by cost-effectiveness
 Available
– Effectiveness is 0 if system isn’t working
– Balanced against expense
38Jun 28, 2013
Goals: Faults & ErrorsGoals: Faults & Errors
 Benign errors:
– Accidentally deleted files
– Unnoticed runaway processes
 Malicious attacks:
– TCP SYN attack
– Sendmail bugs
– Data stealing
– False data injection
39Jun 28, 2013
Goals: UsersGoals: Users
 Training
– Troubleshooting = one-on-one training
– Larger sessions = classes
 Accounting
– Supports management, helps billing
 Capacity Planning
– Expanding systems takes time
 Legal
– Sensitive information needs protection
40Jun 28, 2013
Simplifying SecuritySimplifying Security
USENIX talk says “If cryptography is so great, why isn’t it used more?”
SA’s worry about security to protect data.
 Goal: Ease development of secure applications
 Write programs using principals & properties rather than keys and algorithms
 Unify various forms of available cryptography (public key, secret-key, PGP,
Kerberos)
 My use: protected, transferable rights to allow various actions
– Modify system configurations (add filesystems, printers)
– Kill/restart processes (runaway, after configurations modified)
– Access data (private logs, for backups, etc.)
41Jun 28, 2013
ConclusionConclusion
 System administration as area of research
– Description of field
– Areas for future research
• Managing stable storage
• Supporting users
 Initial investigation of research area
– Monitoring, diagnosing, and repairing
• Broad, draws from many fields

More Related Content

What's hot

Data recovery
Data recoveryData recovery
Data recovery
Suresh Hirpara
 
CS9222 Advanced Operating System
CS9222 Advanced Operating SystemCS9222 Advanced Operating System
CS9222 Advanced Operating System
Kathirvel Ayyaswamy
 
Operating system concepts ninth edition (2012), chapter 2 solution e1
Operating system concepts ninth edition (2012), chapter 2 solution e1Operating system concepts ninth edition (2012), chapter 2 solution e1
Operating system concepts ninth edition (2012), chapter 2 solution e1
Navid Daneshvaran
 
Data recovery
Data recoveryData recovery
Data recovery
Abhinav Parihar
 
Clock Synchronization in Distributed Systems
Clock Synchronization in Distributed SystemsClock Synchronization in Distributed Systems
Clock Synchronization in Distributed Systems
Zbigniew Jerzak
 
참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의DzH QWuynh
 
Chapter 2 (Part 2)
Chapter 2 (Part 2) Chapter 2 (Part 2)
Chapter 2 (Part 2) rohassanie
 
Testing pc’s performance lf
Testing pc’s performance lfTesting pc’s performance lf
Testing pc’s performance lf
iteclearners
 
CS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMSCS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMS
Kathirvel Ayyaswamy
 
Real Time Systems & RTOS
Real Time Systems & RTOSReal Time Systems & RTOS
Real Time Systems & RTOS
Vishwa Mohan
 
Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Wor...
Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Wor...Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Wor...
Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Wor...
Rafael Ferreira da Silva
 
Real Time Operating Systems
Real Time Operating SystemsReal Time Operating Systems
Real Time Operating Systems
Pawandeep Kaur
 
Resource management
Resource managementResource management
Resource management
peeyushanand6
 
RTOS for Embedded System Design
RTOS for Embedded System DesignRTOS for Embedded System Design
RTOS for Embedded System Design
anand hd
 
Performance Analysis of multithreaded applications based on Hardware Simulati...
Performance Analysis of multithreaded applications based on Hardware Simulati...Performance Analysis of multithreaded applications based on Hardware Simulati...
Performance Analysis of multithreaded applications based on Hardware Simulati...
Maria Stylianou
 
Multi processor scheduling
Multi  processor schedulingMulti  processor scheduling
Multi processor scheduling
Shashank Kapoor
 

What's hot (20)

Data recovery
Data recoveryData recovery
Data recovery
 
CS9222 Advanced Operating System
CS9222 Advanced Operating SystemCS9222 Advanced Operating System
CS9222 Advanced Operating System
 
Operating system concepts ninth edition (2012), chapter 2 solution e1
Operating system concepts ninth edition (2012), chapter 2 solution e1Operating system concepts ninth edition (2012), chapter 2 solution e1
Operating system concepts ninth edition (2012), chapter 2 solution e1
 
Data recovery
Data recoveryData recovery
Data recovery
 
Clock Synchronization in Distributed Systems
Clock Synchronization in Distributed SystemsClock Synchronization in Distributed Systems
Clock Synchronization in Distributed Systems
 
참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의
 
OS_Ch4
OS_Ch4OS_Ch4
OS_Ch4
 
Distributed Operating System_2
Distributed Operating System_2Distributed Operating System_2
Distributed Operating System_2
 
Chapter 2 (Part 2)
Chapter 2 (Part 2) Chapter 2 (Part 2)
Chapter 2 (Part 2)
 
Testing pc’s performance lf
Testing pc’s performance lfTesting pc’s performance lf
Testing pc’s performance lf
 
Real time system tsp
Real time system tspReal time system tsp
Real time system tsp
 
CS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMSCS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMS
 
Real Time Systems & RTOS
Real Time Systems & RTOSReal Time Systems & RTOS
Real Time Systems & RTOS
 
Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Wor...
Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Wor...Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Wor...
Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Wor...
 
seminar report
seminar reportseminar report
seminar report
 
Real Time Operating Systems
Real Time Operating SystemsReal Time Operating Systems
Real Time Operating Systems
 
Resource management
Resource managementResource management
Resource management
 
RTOS for Embedded System Design
RTOS for Embedded System DesignRTOS for Embedded System Design
RTOS for Embedded System Design
 
Performance Analysis of multithreaded applications based on Hardware Simulati...
Performance Analysis of multithreaded applications based on Hardware Simulati...Performance Analysis of multithreaded applications based on Hardware Simulati...
Performance Analysis of multithreaded applications based on Hardware Simulati...
 
Multi processor scheduling
Multi  processor schedulingMulti  processor scheduling
Multi processor scheduling
 

Viewers also liked

EuroBioForum 2013 - Day 1 | Anne Eckhardt
 EuroBioForum 2013 - Day 1 | Anne Eckhardt EuroBioForum 2013 - Day 1 | Anne Eckhardt
EuroBioForum 2013 - Day 1 | Anne Eckhardt
EuroBioForum
 
Special rules for risk control contractors and subcontractors
Special rules for risk control contractors and subcontractorsSpecial rules for risk control contractors and subcontractors
Special rules for risk control contractors and subcontractorsmotachristina
 
EuroBioForum 2013 - Day 2 | Menno Kok
 EuroBioForum 2013 - Day 2 | Menno Kok EuroBioForum 2013 - Day 2 | Menno Kok
EuroBioForum 2013 - Day 2 | Menno Kok
EuroBioForum
 
Life histories
Life historiesLife histories
Life histories
monikgs25
 
Relojes
RelojesRelojes
EuroBioForum 2013 - Day 2 | Mark Poznansky
 EuroBioForum 2013 - Day 2 | Mark Poznansky EuroBioForum 2013 - Day 2 | Mark Poznansky
EuroBioForum 2013 - Day 2 | Mark Poznansky
EuroBioForum
 
Munich biotech cluster
Munich biotech clusterMunich biotech cluster
Munich biotech cluster
EuroBioForum
 

Viewers also liked (9)

EuroBioForum 2013 - Day 1 | Anne Eckhardt
 EuroBioForum 2013 - Day 1 | Anne Eckhardt EuroBioForum 2013 - Day 1 | Anne Eckhardt
EuroBioForum 2013 - Day 1 | Anne Eckhardt
 
Special rules for risk control contractors and subcontractors
Special rules for risk control contractors and subcontractorsSpecial rules for risk control contractors and subcontractors
Special rules for risk control contractors and subcontractors
 
Tavi
TaviTavi
Tavi
 
Quals
QualsQuals
Quals
 
EuroBioForum 2013 - Day 2 | Menno Kok
 EuroBioForum 2013 - Day 2 | Menno Kok EuroBioForum 2013 - Day 2 | Menno Kok
EuroBioForum 2013 - Day 2 | Menno Kok
 
Life histories
Life historiesLife histories
Life histories
 
Relojes
RelojesRelojes
Relojes
 
EuroBioForum 2013 - Day 2 | Mark Poznansky
 EuroBioForum 2013 - Day 2 | Mark Poznansky EuroBioForum 2013 - Day 2 | Mark Poznansky
EuroBioForum 2013 - Day 2 | Mark Poznansky
 
Munich biotech cluster
Munich biotech clusterMunich biotech cluster
Munich biotech cluster
 

Similar to Quals

Embedded Intro India05
Embedded Intro India05Embedded Intro India05
Embedded Intro India05
Rajesh Gupta
 
Introduction to embedded system
Introduction to embedded systemIntroduction to embedded system
Introduction to embedded system
ajitsaraf123
 
Software Performance
Software Performance Software Performance
Software Performance
Prabhanshu Saraswat
 
Health monitoring & predictive analytics to lower the TCO in a datacenter
Health monitoring & predictive analytics to lower the TCO in a datacenterHealth monitoring & predictive analytics to lower the TCO in a datacenter
Health monitoring & predictive analytics to lower the TCO in a datacenterAndrei Khurshudov
 
Forecasting database performance
Forecasting database performanceForecasting database performance
Forecasting database performance
Shenglin Du
 
PQA's Performance Testing 101
PQA's Performance Testing 101PQA's Performance Testing 101
PQA's Performance Testing 101
PQA / PLATO Testing
 
Dori Exterman, Considerations for choosing the parallel computing strategy th...
Dori Exterman, Considerations for choosing the parallel computing strategy th...Dori Exterman, Considerations for choosing the parallel computing strategy th...
Dori Exterman, Considerations for choosing the parallel computing strategy th...
Sergey Platonov
 
Webinar: Performance Tuning + Optimization
Webinar: Performance Tuning + OptimizationWebinar: Performance Tuning + Optimization
Webinar: Performance Tuning + Optimization
MongoDB
 
201201 ureason introduction to use
201201 ureason introduction to use201201 ureason introduction to use
201201 ureason introduction to use
UReasonChannel
 
Cloud data management
Cloud data managementCloud data management
Cloud data managementambitlick
 
Performance testing basics
Performance testing basicsPerformance testing basics
Performance testing basics
Charu Anand
 
seed block algorithm
seed block algorithmseed block algorithm
seed block algorithmDipak Badhe
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
Renato Lucindo
 
Cluster computing
Cluster computingCluster computing
Cluster computing
Raja' Masa'deh
 
Stop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production SystemsStop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production Systems
Brendan Gregg
 
Webinar on radar
Webinar on radarWebinar on radar
Webinar on radar
Deepak Shankar
 
Embedded System Introduction and Basics
Embedded System Introduction  and BasicsEmbedded System Introduction  and Basics
Embedded System Introduction and Basics
gkesavan11
 
An Overview of Performance Evaluation & Simulation
An Overview of Performance Evaluation & SimulationAn Overview of Performance Evaluation & Simulation
An Overview of Performance Evaluation & Simulation
dasdfadfdsfsdfasdf
 

Similar to Quals (20)

Embedded Intro India05
Embedded Intro India05Embedded Intro India05
Embedded Intro India05
 
Introduction to embedded system
Introduction to embedded systemIntroduction to embedded system
Introduction to embedded system
 
Software Performance
Software Performance Software Performance
Software Performance
 
Health monitoring & predictive analytics to lower the TCO in a datacenter
Health monitoring & predictive analytics to lower the TCO in a datacenterHealth monitoring & predictive analytics to lower the TCO in a datacenter
Health monitoring & predictive analytics to lower the TCO in a datacenter
 
Forecasting database performance
Forecasting database performanceForecasting database performance
Forecasting database performance
 
PQA's Performance Testing 101
PQA's Performance Testing 101PQA's Performance Testing 101
PQA's Performance Testing 101
 
Dori Exterman, Considerations for choosing the parallel computing strategy th...
Dori Exterman, Considerations for choosing the parallel computing strategy th...Dori Exterman, Considerations for choosing the parallel computing strategy th...
Dori Exterman, Considerations for choosing the parallel computing strategy th...
 
Webinar: Performance Tuning + Optimization
Webinar: Performance Tuning + OptimizationWebinar: Performance Tuning + Optimization
Webinar: Performance Tuning + Optimization
 
201201 ureason introduction to use
201201 ureason introduction to use201201 ureason introduction to use
201201 ureason introduction to use
 
Cloud data management
Cloud data managementCloud data management
Cloud data management
 
Performance testing basics
Performance testing basicsPerformance testing basics
Performance testing basics
 
seed block algorithm
seed block algorithmseed block algorithm
seed block algorithm
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
 
Cluster computing
Cluster computingCluster computing
Cluster computing
 
Stop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production SystemsStop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production Systems
 
Ch24 system administration
Ch24 system administration Ch24 system administration
Ch24 system administration
 
Ch24
Ch24Ch24
Ch24
 
Webinar on radar
Webinar on radarWebinar on radar
Webinar on radar
 
Embedded System Introduction and Basics
Embedded System Introduction  and BasicsEmbedded System Introduction  and Basics
Embedded System Introduction and Basics
 
An Overview of Performance Evaluation & Simulation
An Overview of Performance Evaluation & SimulationAn Overview of Performance Evaluation & Simulation
An Overview of Performance Evaluation & Simulation
 

Recently uploaded

Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 

Recently uploaded (20)

Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 

Quals

  • 1. Monitoring, Diagnosing, andMonitoring, Diagnosing, and RepairingRepairing Eric Anderson U.C. Berkeley
  • 2. 2Jun 28, 2013 OverviewOverview  What is System Administration? – What is the problem? – Goals of Dissertation Research – Goals of System Administration  Monitoring, diagnosing, and repairing  Dissertation Timeline  Conclusion
  • 3. 3Jun 28, 2013 What is the problem?What is the problem?  Problems occur in systems, and result in loss of productivity – Server failures  denial of service – System overload  lower productivity  Cost is too high – Cost of ownership estimated at $5,000-$15,000/year/machine – Median salary (~50k) / (median # machines/admin)  $700  Our goal: Reduce cost by – Repairing problems faster (possibly automatically) – Handling more problems
  • 4. 4Jun 28, 2013 Goals of Dissertation ResearchGoals of Dissertation Research  Describe field of System Administration  Monitoring, Diagnosing, and Repairing: – Approach: Synthesize solutions from other fields of research 1) Detect previously ignored problems 2) Automatic repair of some problems 3) Reduce number of administrators needed 4) Support users’ understanding of system  Apply here & distribute software  Thesis: Through our approach, we can achieve goals 1-4.
  • 5. 5Jun 28, 2013 Goals of System AdministrationGoals of System Administration Goal: Support cost-effective use of the computer environment More specifically (some non-technical): Environment: uniform, customizable, high performance and available Faults & errors: recovery from benign errors, protection from malicious attacks Users: training, accounting & planning, legal
  • 6. 6Jun 28, 2013 Monitoring, Diagnosing, andMonitoring, Diagnosing, and Repairing (MDR)Repairing (MDR) • Introductory examples • Fundamental requirements • Environmental constraints • Previous work • Six key innovations • Architecture • Details on innovations • Evaluation methodology
  • 7. 7Jun 28, 2013 MDR: Examples — IntroMDR: Examples — Intro  Four examples 1) Broken component 2) Resource overload — transient 3) Resource contention — user program 4) Resource exhaustion — long term  Previous Solutions – Pay someone to watch – Ignore or wait for someone to complain – Specialized scripts (not general  vast repeated work)
  • 8. 8Jun 28, 2013 MDR: Example 1MDR: Example 1 Web server has crashed/hung  Gather information: process existence, service uptime, restart times  Analyze data: process not responding, and hasn’t been recently restarted.  Automatic repair: restart daemon.  Notify administrator: had to restart daemon.
  • 9. 9Jun 28, 2013 MDR: Example 2MDR: Example 2 The NOW is “slow.”  Gather data: load, process info, CPU info  Analyze data: bounds on expected values  Notified administrator: fileserver overloaded.  Visualize data: nfsd’s are overloaded.  Repair: admin moves data, adds disks, or starts more nfsd’s
  • 10. 10Jun 28, 2013 MDR: Example 3MDR: Example 3 User running program  Gather: user statistics, CPU, disk  Visualize: spending too much time waiting on remote accesses (User fixes program, gathering, visualization repeated)  Analyze: some nodes have less throughput  Visualize: those have other jobs running on them  Repair: user is benchmarking so kills all extraneous processes
  • 11. 11Jun 28, 2013 MDR: Example 4MDR: Example 4 Web server increasing beyond capacity  Gather: CPU, request rate, reply latency  Analyze: Burst lengths getting longer, latency increasing  Visualize: Graph of burst lengths & CPU usage over time  Repair: Order more machines, install load balancer
  • 12. 12Jun 28, 2013 MDR: Fundamental RequirementsMDR: Fundamental Requirements • Gathering • Flexible data gathering, self-describing storage • Analyzing • Calculate statistical measures, identify relevant statistics. • Notifying • Flexible infrequent messages to administrators or users • Visualizing • Maximize information/pixel, support multiple interfaces • Repairing • Automate simple repairs, support group operations
  • 13. 13Jun 28, 2013 MDR: EnvironmentalMDR: Environmental ConstraintsConstraints  Change is inherent – Lack of Web/Mbone 5 years ago, now most/many have these.  Problems on many time-scales – Second-Minute transients vs. Week-Month capacity problems  Must operate under very adverse conditions – Often used when system is broken – Would like at least post-mortum analysis  Need to handle hundreds – thousands of nodes – Scalability: All sites are getting larger, possibly wide area – Our system has 200 (NOW) – 2000 (Soda) nodes
  • 14. 14Jun 28, 2013 MDR: Previous SystemsMDR: Previous Systems  Many previous systems: I’ve looked at about 16.  Not comprehensive, not extensible.  Look at a few that did a nice job of a piece:  [Fink97] — Run test, notify display engine + Easy to add tests + Selectivity of notification good – Tests are just programs (redo gathering) – Central, non-fault tolerant solution – Many hard coded constants
  • 15. 15Jun 28, 2013 MDR: Previous Systems, cont.MDR: Previous Systems, cont.  [Hard92] — buzzerd: Pager notification system + Flexible rules for notification + External interface for adding notify requests – Simplistic gathering – Poor fault tolerance  [Pier96] — Igor group fixes + Flexible operations + Nice reporting of success/failure – Weak security, runs as root – No delegation of responsibility
  • 16. 16Jun 28, 2013 MDR: Six Key Innovations (1-3)MDR: Six Key Innovations (1-3)  Replicated, semi-hierarchical, data storage nodes – Rendezvous point for programs – Handles scaling and fault-tolerance  Self describing structures – Functions (visualize, summarize) + data go in database (OO) – DB has machine and human readable descriptions of data  End to end notification – Detect problems in MDR system – Guarantee important messages get to users
  • 17. 17Jun 28, 2013 MDR: Six Key Innovations (4-6)MDR: Six Key Innovations (4-6)  Aggregation and High Resolution Color Displays – Reduce information to manageable amounts – Maximize information per unit area  Partially self-configuring – Learn averages, deviations, burst sizes – Learn which values are relevant to problems  Secure, user-specified group repairs – Don’t enable malicious attacks – Automate repairs of many machines
  • 18. 18Jun 28, 2013 MDR: ArchitectureMDR: Architecture SQL-based Data Repository Gather Agent vmstat thread ping thread tcpdump thread Diagnostic Console E-mail or Phone Notifier Long-term graphing Tolerance, Relevance Learner Daemon Restarter Aggregation Engine
  • 19. 19Jun 28, 2013 MDR-Arch: DerivationsMDR-Arch: Derivations SQL-based Data Repository Diagnostic Console E-mail or Phone Notifier Tolerance, Relevance Learner Daemon Restarter
  • 20. 20Jun 28, 2013 Key: Semi-Hier. DBs.Key: Semi-Hier. DBs.  Fault tolerance  Scalability: – Caches don’t need to commit to disk — authoritative copy elsewhere. – Batching updates over wide area links. Top level cache Top level cache Mid level cache Mid level cache Mid level cache Per-node database Per-node database Per-node database Per-node database Per-node database
  • 21. 21Jun 28, 2013 Key: Self-DescribingKey: Self-Describing  De-couple data gathering, data storage, and data use  Self-Describing for Humans – Descriptions of meanings of values stored with tables – Description of methods of gathering stored with tables – Column names help with self  Self-Describing for Computers – Functions for visualizing or summarizing data – Indication of resource selection from resource statistics
  • 22. 22Jun 28, 2013 Key: End-to-End NotificationKey: End-to-End Notification Recall: System must operate under extreme conditions  Humans must validate that system is still working – Standalone display can indicate timestamps, mark out of date data – Wireless machine could intermittently contact notification system – Pager could be automatically paged every so often  Problems should be propagated to end users. – Flexible notification — connected systems, e-mail, pager. – Limit over-notification
  • 23. 23Jun 28, 2013 Key: Aggregation & HiResKey: Aggregation & HiRes  System target has hundreds – thousands of nodes  Aggregate by showing out of bounds, relevant values (via automatic tuning)  Also want overview of system – Aggregate across similar statistics; show value (fill) & dispersion (shade) – Use color to highlight important values. – Aggregate across values (machine utilization = CPU + disk + memory) – Maximize data/pixel [Tufte]
  • 24. 24Jun 28, 2013 Key: Agg & HiRes: SnapshotKey: Agg & HiRes: Snapshot
  • 25. 25Jun 28, 2013 Key: Self-ConfiguringKey: Self-Configuring  Single statistics – Phase 1: Calculate averages, standard deviations, burst sizes – Worked in other systems [Jaco88, Karn91]  Identify relevant statistics – Give system Boolean examples (variables out of bounds, and system working/not working) get function. – Works for Boolean disjunctions in some cases: • With lots of irrelevant variables [Litt89] • With random bad examples [Sloa89] • In some cases, with malicious bad examples [Ande94]
  • 26. 26Jun 28, 2013 Key: Secure Remote ActionsKey: Secure Remote Actions  Security because of malicious attacks, benign errors  Delegation to remove SA from the loop  Independence from particular algorithms – Building a library – Program with principals (hosts, users), and properties (signed, sealed, verifiable)  Use secure, run-time extensible languages  Actions report through gathering system
  • 27. 27Jun 28, 2013 MDR: Testing MethodologyMDR: Testing Methodology  Fault injection – Deliberately make the system slow – Break hardware/software components  Feature comparison – Paper comparison with other systems  Usage in practice – Experience important to show system works – We have need of administrative tools  Testimonials – Experience at other sites lends credibility
  • 28. 28Jun 28, 2013 MDR: DemoMDR: Demo  Hierarchical structure working (1 level right now)  Alternative Interface  Fault Injection  Need for Aggregation  Crufty right now  Demo
  • 29. 29Jun 28, 2013 Timeline: Key PiecesTimeline: Key Pieces 1) (DBs) Replicated, semi-hierarchical, data storage nodes 2) (SDS) Self describing structures 3) (Vis) Aggregation and High Resolution Color Displays 4) (E2EN) End to end notification 5) (ReS) Automatic Restart 6) (Cfg) Partially self-configuring 7) (Rep) Secure, user-specified group repairs
  • 30. 30Jun 28, 2013 TimelineTimeline Deadlines: June, 1997 Dec, 1997 Dec, 1998June, 1998 LISA 6/97 USENIX 12/97 OSDI 3/98 Graduation 12/98 Prototype 1,2,3 (DBs, SelfD, Vis) Prototype 4,5 Notify, Restart Prototype 6,7 AConfig, Repair LISA 6/98 Experience with 1-7 SOSP 3/99 Architecture of Complete System Writing Mar, 1999
  • 31. 31Jun 28, 2013 ConclusionConclusion  Description of field shows breadth  Monitoring, diagnosing, and repairing shows depth – Examples show importance of problem – Fundamental goals & environmental constraints show understanding of problem – Key innovations show differences from previous systems. – Architecture and initial prototype show approach to problem – Testing methods show ways to validate solution.  Timeline shows plan & milestones to graduation
  • 33. 33Jun 28, 2013 SolutionsSolutions  Managing stable storage  Supporting users  Simplifying security  Monitoring, diagnosing, and repairing
  • 34. 34Jun 28, 2013 Managing Stable StorageManaging Stable Storage  Consistency vs. availability  Fault tolerance  Scalability  Recoverability  Customization
  • 35. 35Jun 28, 2013 Supporting UsersSupporting Users  Automated help desk – Searchable collection of questions – Easy method for addition  Remote device access  Site-wide training
  • 36. 36Jun 28, 2013 Goals: EnvironmentGoals: Environment  Uniform – Supports user mobility by eliminating arbitrary changes – Increases effectiveness by avoiding need for users to learn multiple interfaces  Customizable – Handles special systems and special needs [firewalls, servers] – Obviously reduces uniformity
  • 37. 37Jun 28, 2013 Goals: Environment, cont.Goals: Environment, cont.  High Performance – Increases effectiveness of users [HCI/psych] – Limited by cost-effectiveness  Available – Effectiveness is 0 if system isn’t working – Balanced against expense
  • 38. 38Jun 28, 2013 Goals: Faults & ErrorsGoals: Faults & Errors  Benign errors: – Accidentally deleted files – Unnoticed runaway processes  Malicious attacks: – TCP SYN attack – Sendmail bugs – Data stealing – False data injection
  • 39. 39Jun 28, 2013 Goals: UsersGoals: Users  Training – Troubleshooting = one-on-one training – Larger sessions = classes  Accounting – Supports management, helps billing  Capacity Planning – Expanding systems takes time  Legal – Sensitive information needs protection
  • 40. 40Jun 28, 2013 Simplifying SecuritySimplifying Security USENIX talk says “If cryptography is so great, why isn’t it used more?” SA’s worry about security to protect data.  Goal: Ease development of secure applications  Write programs using principals & properties rather than keys and algorithms  Unify various forms of available cryptography (public key, secret-key, PGP, Kerberos)  My use: protected, transferable rights to allow various actions – Modify system configurations (add filesystems, printers) – Kill/restart processes (runaway, after configurations modified) – Access data (private logs, for backups, etc.)
  • 41. 41Jun 28, 2013 ConclusionConclusion  System administration as area of research – Description of field – Areas for future research • Managing stable storage • Supporting users  Initial investigation of research area – Monitoring, diagnosing, and repairing • Broad, draws from many fields

Editor's Notes

  1. Key idea: None Introduction slide.
  2. Key Idea: Two contributions — System administration as a field of research; and initial work in the field produces initial results which substantially improve the state of the art.
  3. Key idea: Has properties of “real” research — separation of concerns, important contributions, and a strategy for measuring effectiveness.