Quals

Monitoring, Diagnosing, andMonitoring, Diagnosing, and
RepairingRepairing
Eric Anderson
U.C. Berkeley

2Jun 30, 2013
OverviewOverview
 What is System Administration?
– What is the problem?
– Goals of Dissertation Research
– Goals of System Administration
 Monitoring, diagnosing, and repairing
 Dissertation Timeline
 Conclusion

3Jun 30, 2013
What is the problem?What is the problem?
 Problems occur in systems, and result in loss of
productivity
– Server failures  denial of service
– System overload  lower productivity
 Cost is too high
– Cost of ownership estimated at $5,000-$15,000/year/machine
– Median salary (~50k) / (median # machines/admin)  $700
 Our goal: Reduce cost by
– Repairing problems faster (possibly automatically)
– Handling more problems

4Jun 30, 2013
Goals of Dissertation ResearchGoals of Dissertation Research
 Describe field of System Administration
 Monitoring, Diagnosing, and Repairing:
– Approach: Synthesize solutions from other fields of research
1) Detect previously ignored problems
2) Automatic repair of some problems
3) Reduce number of administrators needed
4) Support users’ understanding of system
 Apply here & distribute software
 Thesis: Through our approach, we can achieve
goals 1-4.

5Jun 30, 2013
Goals of System AdministrationGoals of System Administration
Goal: Support cost-effective use of the computer
environment
More specifically (some non-technical):
Environment: uniform, customizable, high performance and
available
Faults & errors: recovery from benign errors, protection from
malicious attacks
Users: training, accounting & planning, legal

6Jun 30, 2013
Monitoring, Diagnosing, andMonitoring, Diagnosing, and
Repairing (MDR)Repairing (MDR)
• Introductory examples
• Fundamental requirements
• Environmental constraints
• Previous work
• Six key innovations
• Architecture
• Details on innovations
• Evaluation methodology

7Jun 30, 2013
MDR: Examples — IntroMDR: Examples — Intro
 Four examples
1) Broken component
2) Resource overload — transient
3) Resource contention — user program
4) Resource exhaustion — long term
 Previous Solutions
– Pay someone to watch
– Ignore or wait for someone to complain
– Specialized scripts (not general  vast repeated work)

8Jun 30, 2013
MDR: Example 1MDR: Example 1
Web server has crashed/hung
 Gather information: process existence, service
uptime, restart times
 Analyze data: process not responding, and hasn’t
been recently restarted.
 Automatic repair: restart daemon.
 Notify administrator: had to restart daemon.

9Jun 30, 2013
The NOW is “slow.”
 Gather data: load, process info, CPU info
 Analyze data: bounds on expected values
 Notified administrator: fileserver overloaded.
 Visualize data: nfsd’s are overloaded.
 Repair: admin moves data, adds disks, or starts
more nfsd’s

10Jun 30, 2013
User running program
 Gather: user statistics, CPU, disk
 Visualize: spending too much time waiting on remote
accesses
(User fixes program, gathering, visualization repeated)
 Analyze: some nodes have less throughput
 Visualize: those have other jobs running on them
 Repair: user is benchmarking so kills all extraneous
processes

11Jun 30, 2013
Web server increasing beyond capacity
 Gather: CPU, request rate, reply latency
 Analyze: Burst lengths getting longer, latency
increasing
 Visualize: Graph of burst lengths & CPU usage over
time
 Repair: Order more machines, install load balancer

12Jun 30, 2013
MDR: Fundamental RequirementsMDR: Fundamental Requirements
• Gathering
• Flexible data gathering, self-describing storage
• Analyzing
• Calculate statistical measures, identify relevant statistics.
• Notifying
• Flexible infrequent messages to administrators or users
• Visualizing
• Maximize information/pixel, support multiple interfaces
• Repairing
• Automate simple repairs, support group operations

13Jun 30, 2013
MDR: EnvironmentalMDR: Environmental
ConstraintsConstraints
 Change is inherent
– Lack of Web/Mbone 5 years ago, now most/many have these.
 Problems on many time-scales
– Second-Minute transients vs. Week-Month capacity problems
 Must operate under very adverse conditions
– Often used when system is broken
– Would like at least post-mortum analysis
 Need to handle hundreds – thousands of nodes
– Scalability: All sites are getting larger, possibly wide area
– Our system has 200 (NOW) – 2000 (Soda) nodes

14Jun 30, 2013
MDR: Previous SystemsMDR: Previous Systems
 Many previous systems: I’ve looked at about 16.
 Not comprehensive, not extensible.
 Look at a few that did a nice job of a piece:
 [Fink97] — Run test, notify display engine
+ Easy to add tests
+ Selectivity of notification good
– Tests are just programs (redo gathering)
– Central, non-fault tolerant solution
– Many hard coded constants

15Jun 30, 2013
MDR: Previous Systems, cont.MDR: Previous Systems, cont.
 [Hard92] — buzzerd: Pager notification system
+ Flexible rules for notification
+ External interface for adding notify requests
– Simplistic gathering
– Poor fault tolerance
 [Pier96] — Igor group fixes
+ Flexible operations
+ Nice reporting of success/failure
– Weak security, runs as root
– No delegation of responsibility

16Jun 30, 2013
MDR: Six Key Innovations (1-3)MDR: Six Key Innovations (1-3)
 Replicated, semi-hierarchical, data storage nodes
– Rendezvous point for programs
– Handles scaling and fault-tolerance
 Self describing structures
– Functions (visualize, summarize) + data go in database
(OO)
– DB has machine and human readable descriptions of data
 End to end notification
– Detect problems in MDR system
– Guarantee important messages get to users

17Jun 30, 2013
MDR: Six Key Innovations (4-6)MDR: Six Key Innovations (4-6)
 Aggregation and High Resolution Color Displays
– Reduce information to manageable amounts
– Maximize information per unit area
 Partially self-configuring
– Learn averages, deviations, burst sizes
– Learn which values are relevant to problems
 Secure, user-specified group repairs
– Don’t enable malicious attacks
– Automate repairs of many machines

18Jun 30, 2013
MDR: ArchitectureMDR: Architecture
SQL-based Data Repository
Gather Agent
vmstat thread
ping thread
tcpdump thread
Diagnostic
Console
E-mail or
Phone
Notifier
Long-term
graphing Tolerance,
Relevance
Learner
Daemon
Restarter
Aggregation
Engine

19Jun 30, 2013
MDR-Arch: DerivationsMDR-Arch: Derivations
SQL-based Data Repository
Diagnostic
Console
E-mail or
Phone
Notifier
Tolerance,
Relevance
Learner
Daemon
Restarter

20Jun 30, 2013
Key: Semi-Hier. DBs.Key: Semi-Hier. DBs.
 Fault tolerance
 Scalability:
– Caches don’t need to commit to disk — authoritative copy
elsewhere.
– Batching updates over wide area links.
Top level cache Top level cache
Mid level cache Mid level cache Mid level cache
Per-node
database
Per-node
database
Per-node
database
Per-node
database
Per-node
database

21Jun 30, 2013
Key: Self-DescribingKey: Self-Describing
 De-couple data gathering, data storage, and data use
 Self-Describing for Humans
– Descriptions of meanings of values stored with tables
– Description of methods of gathering stored with tables
– Column names help with self
 Self-Describing for Computers
– Functions for visualizing or summarizing data
– Indication of resource selection from resource statistics

22Jun 30, 2013
Key: End-to-End NotificationKey: End-to-End Notification
Recall: System must operate under extreme conditions
 Humans must validate that system is still working
– Standalone display can indicate timestamps, mark out of
date data
– Wireless machine could intermittently contact notification
system
– Pager could be automatically paged every so often
 Problems should be propagated to end users.
– Flexible notification — connected systems, e-mail, pager.
– Limit over-notification

23Jun 30, 2013
Key: Aggregation & HiResKey: Aggregation & HiRes
 System target has hundreds – thousands of nodes
 Aggregate by showing out of bounds, relevant values
(via automatic tuning)
 Also want overview of system
– Aggregate across similar statistics; show value (fill) &
dispersion (shade)
– Use color to highlight important values.
– Aggregate across values (machine utilization = CPU + disk +
memory)
– Maximize data/pixel [Tufte]

24Jun 30, 2013
Key: Agg & HiRes: SnapshotKey: Agg & HiRes: Snapshot

25Jun 30, 2013
Key: Self-ConfiguringKey: Self-Configuring
 Single statistics
– Phase 1: Calculate averages, standard deviations, burst
sizes
– Worked in other systems [Jaco88, Karn91]
 Identify relevant statistics
– Give system Boolean examples (variables out of bounds,
and system working/not working) get function.
– Works for Boolean disjunctions in some cases:
• With lots of irrelevant variables [Litt89]
• With random bad examples [Sloa89]
• In some cases, with malicious bad examples [Ande94]

26Jun 30, 2013
Key: Secure Remote ActionsKey: Secure Remote Actions
 Security because of malicious attacks, benign errors
 Delegation to remove SA from the loop
 Independence from particular algorithms
– Building a library
– Program with principals (hosts, users), and properties
(signed, sealed, verifiable)
 Use secure, run-time extensible languages
 Actions report through gathering system

27Jun 30, 2013
MDR: Testing MethodologyMDR: Testing Methodology
 Fault injection
– Deliberately make the system slow
– Break hardware/software components
 Feature comparison
– Paper comparison with other systems
 Usage in practice
– Experience important to show system works
– We have need of administrative tools
 Testimonials
– Experience at other sites lends credibility

28Jun 30, 2013
MDR: DemoMDR: Demo
 Hierarchical structure working (1 level right now)
 Alternative Interface
 Fault Injection
 Need for Aggregation
 Crufty right now
 Demo

29Jun 30, 2013
Timeline: Key PiecesTimeline: Key Pieces
1) (DBs) Replicated, semi-hierarchical, data storage nodes
2) (SDS) Self describing structures
3) (Vis) Aggregation and High Resolution Color Displays
4) (E2EN) End to end notification
5) (ReS) Automatic Restart
6) (Cfg) Partially self-configuring
7) (Rep) Secure, user-specified group repairs

30Jun 30, 2013
TimelineTimeline
Deadlines:
June, 1997 Dec, 1997 Dec, 1998June, 1998
LISA 6/97 USENIX 12/97 OSDI 3/98 Graduation 12/98
Prototype 1,2,3
(DBs, SelfD, Vis)
Prototype 4,5
Notify, Restart
Prototype 6,7
AConfig, Repair
LISA 6/98
Experience
with 1-7
SOSP
3/99
Architecture of
Complete System
Writing
Mar, 1999

31Jun 30, 2013
ConclusionConclusion
 Description of field shows breadth
 Monitoring, diagnosing, and repairing shows depth
– Examples show importance of problem
– Fundamental goals & environmental constraints show
understanding of problem
– Key innovations show differences from previous systems.
– Architecture and initial prototype show approach to problem
– Testing methods show ways to validate solution.
 Timeline shows plan & milestones to graduation

33Jun 30, 2013
SolutionsSolutions
 Managing stable storage
 Supporting users
 Simplifying security
 Monitoring, diagnosing, and repairing

34Jun 30, 2013
Managing Stable StorageManaging Stable Storage
 Consistency vs. availability
 Fault tolerance
 Scalability
 Recoverability
 Customization

35Jun 30, 2013
Supporting UsersSupporting Users
 Automated help desk
– Searchable collection of questions
– Easy method for addition
 Remote device access
 Site-wide training

36Jun 30, 2013
Goals: EnvironmentGoals: Environment
 Uniform
– Supports user mobility by eliminating arbitrary changes
– Increases effectiveness by avoiding need for users to learn multiple
interfaces
 Customizable
– Handles special systems and special needs [firewalls, servers]
– Obviously reduces uniformity

37Jun 30, 2013
Goals: Environment, cont.Goals: Environment, cont.
 High Performance
– Increases effectiveness of users [HCI/psych]
– Limited by cost-effectiveness
 Available
– Effectiveness is 0 if system isn’t working
– Balanced against expense

38Jun 30, 2013
Goals: Faults & ErrorsGoals: Faults & Errors
 Benign errors:
– Accidentally deleted files
– Unnoticed runaway processes
 Malicious attacks:
– TCP SYN attack
– Sendmail bugs
– Data stealing
– False data injection

39Jun 30, 2013
Goals: UsersGoals: Users
 Training
– Troubleshooting = one-on-one training
– Larger sessions = classes
 Accounting
– Supports management, helps billing
 Capacity Planning
– Expanding systems takes time
 Legal
– Sensitive information needs protection

40Jun 30, 2013
Simplifying SecuritySimplifying Security
USENIX talk says “If cryptography is so great, why isn’t it used more?”
SA’s worry about security to protect data.
 Goal: Ease development of secure applications
 Write programs using principals & properties rather than keys and algorithms
 Unify various forms of available cryptography (public key, secret-key, PGP,
Kerberos)
 My use: protected, transferable rights to allow various actions
– Modify system configurations (add filesystems, printers)
– Kill/restart processes (runaway, after configurations modified)
– Access data (private logs, for backups, etc.)

41Jun 30, 2013
ConclusionConclusion
 System administration as area of research
– Description of field
– Areas for future research
• Managing stable storage
• Supporting users
 Initial investigation of research area
– Monitoring, diagnosing, and repairing
• Broad, draws from many fields

Quals

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Quals

Similar to Quals (20)

Recently uploaded

Recently uploaded (20)

Quals

Editor's Notes