SlideShare a Scribd company logo
Monitoring, Diagnosing, andMonitoring, Diagnosing, and
RepairingRepairing
Eric Anderson
U.C. Berkeley
2Jun 30, 2013
OverviewOverview
 What is System Administration?
– What is the problem?
– Goals of Dissertation Research
– Goals of System Administration
 Monitoring, diagnosing, and repairing
 Dissertation Timeline
 Conclusion
3Jun 30, 2013
What is the problem?What is the problem?
 Problems occur in systems, and result in loss of
productivity
– Server failures  denial of service
– System overload  lower productivity
 Cost is too high
– Cost of ownership estimated at $5,000-$15,000/year/machine
– Median salary (~50k) / (median # machines/admin)  $700
 Our goal: Reduce cost by
– Repairing problems faster (possibly automatically)
– Handling more problems
4Jun 30, 2013
Goals of Dissertation ResearchGoals of Dissertation Research
 Describe field of System Administration
 Monitoring, Diagnosing, and Repairing:
– Approach: Synthesize solutions from other fields of research
1) Detect previously ignored problems
2) Automatic repair of some problems
3) Reduce number of administrators needed
4) Support users’ understanding of system
 Apply here & distribute software
 Thesis: Through our approach, we can achieve
goals 1-4.
5Jun 30, 2013
Goals of System AdministrationGoals of System Administration
Goal: Support cost-effective use of the computer
environment
More specifically (some non-technical):
Environment: uniform, customizable, high performance and
available
Faults & errors: recovery from benign errors, protection from
malicious attacks
Users: training, accounting & planning, legal
6Jun 30, 2013
Monitoring, Diagnosing, andMonitoring, Diagnosing, and
Repairing (MDR)Repairing (MDR)
• Introductory examples
• Fundamental requirements
• Environmental constraints
• Previous work
• Six key innovations
• Architecture
• Details on innovations
• Evaluation methodology
7Jun 30, 2013
MDR: Examples — IntroMDR: Examples — Intro
 Four examples
1) Broken component
2) Resource overload — transient
3) Resource contention — user program
4) Resource exhaustion — long term
 Previous Solutions
– Pay someone to watch
– Ignore or wait for someone to complain
– Specialized scripts (not general  vast repeated work)
8Jun 30, 2013
MDR: Example 1MDR: Example 1
Web server has crashed/hung
 Gather information: process existence, service
uptime, restart times
 Analyze data: process not responding, and hasn’t
been recently restarted.
 Automatic repair: restart daemon.
 Notify administrator: had to restart daemon.
9Jun 30, 2013
MDR: Example 2MDR: Example 2
The NOW is “slow.”
 Gather data: load, process info, CPU info
 Analyze data: bounds on expected values
 Notified administrator: fileserver overloaded.
 Visualize data: nfsd’s are overloaded.
 Repair: admin moves data, adds disks, or starts
more nfsd’s
10Jun 30, 2013
MDR: Example 3MDR: Example 3
User running program
 Gather: user statistics, CPU, disk
 Visualize: spending too much time waiting on remote
accesses
(User fixes program, gathering, visualization repeated)
 Analyze: some nodes have less throughput
 Visualize: those have other jobs running on them
 Repair: user is benchmarking so kills all extraneous
processes
11Jun 30, 2013
MDR: Example 4MDR: Example 4
Web server increasing beyond capacity
 Gather: CPU, request rate, reply latency
 Analyze: Burst lengths getting longer, latency
increasing
 Visualize: Graph of burst lengths & CPU usage over
time
 Repair: Order more machines, install load balancer
12Jun 30, 2013
MDR: Fundamental RequirementsMDR: Fundamental Requirements
• Gathering
• Flexible data gathering, self-describing storage
• Analyzing
• Calculate statistical measures, identify relevant statistics.
• Notifying
• Flexible infrequent messages to administrators or users
• Visualizing
• Maximize information/pixel, support multiple interfaces
• Repairing
• Automate simple repairs, support group operations
13Jun 30, 2013
MDR: EnvironmentalMDR: Environmental
ConstraintsConstraints
 Change is inherent
– Lack of Web/Mbone 5 years ago, now most/many have these.
 Problems on many time-scales
– Second-Minute transients vs. Week-Month capacity problems
 Must operate under very adverse conditions
– Often used when system is broken
– Would like at least post-mortum analysis
 Need to handle hundreds – thousands of nodes
– Scalability: All sites are getting larger, possibly wide area
– Our system has 200 (NOW) – 2000 (Soda) nodes
14Jun 30, 2013
MDR: Previous SystemsMDR: Previous Systems
 Many previous systems: I’ve looked at about 16.
 Not comprehensive, not extensible.
 Look at a few that did a nice job of a piece:
 [Fink97] — Run test, notify display engine
+ Easy to add tests
+ Selectivity of notification good
– Tests are just programs (redo gathering)
– Central, non-fault tolerant solution
– Many hard coded constants
15Jun 30, 2013
MDR: Previous Systems, cont.MDR: Previous Systems, cont.
 [Hard92] — buzzerd: Pager notification system
+ Flexible rules for notification
+ External interface for adding notify requests
– Simplistic gathering
– Poor fault tolerance
 [Pier96] — Igor group fixes
+ Flexible operations
+ Nice reporting of success/failure
– Weak security, runs as root
– No delegation of responsibility
16Jun 30, 2013
MDR: Six Key Innovations (1-3)MDR: Six Key Innovations (1-3)
 Replicated, semi-hierarchical, data storage nodes
– Rendezvous point for programs
– Handles scaling and fault-tolerance
 Self describing structures
– Functions (visualize, summarize) + data go in database
(OO)
– DB has machine and human readable descriptions of data
 End to end notification
– Detect problems in MDR system
– Guarantee important messages get to users
17Jun 30, 2013
MDR: Six Key Innovations (4-6)MDR: Six Key Innovations (4-6)
 Aggregation and High Resolution Color Displays
– Reduce information to manageable amounts
– Maximize information per unit area
 Partially self-configuring
– Learn averages, deviations, burst sizes
– Learn which values are relevant to problems
 Secure, user-specified group repairs
– Don’t enable malicious attacks
– Automate repairs of many machines
18Jun 30, 2013
MDR: ArchitectureMDR: Architecture
SQL-based Data Repository
Gather Agent
vmstat thread
ping thread
tcpdump thread
Diagnostic
Console
E-mail or
Phone
Notifier
Long-term
graphing Tolerance,
Relevance
Learner
Daemon
Restarter
Aggregation
Engine
19Jun 30, 2013
MDR-Arch: DerivationsMDR-Arch: Derivations
SQL-based Data Repository
Diagnostic
Console
E-mail or
Phone
Notifier
Tolerance,
Relevance
Learner
Daemon
Restarter
20Jun 30, 2013
Key: Semi-Hier. DBs.Key: Semi-Hier. DBs.
 Fault tolerance
 Scalability:
– Caches don’t need to commit to disk — authoritative copy
elsewhere.
– Batching updates over wide area links.
Top level cache Top level cache
Mid level cache Mid level cache Mid level cache
Per-node
database
Per-node
database
Per-node
database
Per-node
database
Per-node
database
21Jun 30, 2013
Key: Self-DescribingKey: Self-Describing
 De-couple data gathering, data storage, and data use
 Self-Describing for Humans
– Descriptions of meanings of values stored with tables
– Description of methods of gathering stored with tables
– Column names help with self
 Self-Describing for Computers
– Functions for visualizing or summarizing data
– Indication of resource selection from resource statistics
22Jun 30, 2013
Key: End-to-End NotificationKey: End-to-End Notification
Recall: System must operate under extreme conditions
 Humans must validate that system is still working
– Standalone display can indicate timestamps, mark out of
date data
– Wireless machine could intermittently contact notification
system
– Pager could be automatically paged every so often
 Problems should be propagated to end users.
– Flexible notification — connected systems, e-mail, pager.
– Limit over-notification
23Jun 30, 2013
Key: Aggregation & HiResKey: Aggregation & HiRes
 System target has hundreds – thousands of nodes
 Aggregate by showing out of bounds, relevant values
(via automatic tuning)
 Also want overview of system
– Aggregate across similar statistics; show value (fill) &
dispersion (shade)
– Use color to highlight important values.
– Aggregate across values (machine utilization = CPU + disk +
memory)
– Maximize data/pixel [Tufte]
24Jun 30, 2013
Key: Agg & HiRes: SnapshotKey: Agg & HiRes: Snapshot
25Jun 30, 2013
Key: Self-ConfiguringKey: Self-Configuring
 Single statistics
– Phase 1: Calculate averages, standard deviations, burst
sizes
– Worked in other systems [Jaco88, Karn91]
 Identify relevant statistics
– Give system Boolean examples (variables out of bounds,
and system working/not working) get function.
– Works for Boolean disjunctions in some cases:
• With lots of irrelevant variables [Litt89]
• With random bad examples [Sloa89]
• In some cases, with malicious bad examples [Ande94]
26Jun 30, 2013
Key: Secure Remote ActionsKey: Secure Remote Actions
 Security because of malicious attacks, benign errors
 Delegation to remove SA from the loop
 Independence from particular algorithms
– Building a library
– Program with principals (hosts, users), and properties
(signed, sealed, verifiable)
 Use secure, run-time extensible languages
 Actions report through gathering system
27Jun 30, 2013
MDR: Testing MethodologyMDR: Testing Methodology
 Fault injection
– Deliberately make the system slow
– Break hardware/software components
 Feature comparison
– Paper comparison with other systems
 Usage in practice
– Experience important to show system works
– We have need of administrative tools
 Testimonials
– Experience at other sites lends credibility
28Jun 30, 2013
MDR: DemoMDR: Demo
 Hierarchical structure working (1 level right now)
 Alternative Interface
 Fault Injection
 Need for Aggregation
 Crufty right now
 Demo
29Jun 30, 2013
Timeline: Key PiecesTimeline: Key Pieces
1) (DBs) Replicated, semi-hierarchical, data storage nodes
2) (SDS) Self describing structures
3) (Vis) Aggregation and High Resolution Color Displays
4) (E2EN) End to end notification
5) (ReS) Automatic Restart
6) (Cfg) Partially self-configuring
7) (Rep) Secure, user-specified group repairs
30Jun 30, 2013
TimelineTimeline
Deadlines:
June, 1997 Dec, 1997 Dec, 1998June, 1998
LISA 6/97 USENIX 12/97 OSDI 3/98 Graduation 12/98
Prototype 1,2,3
(DBs, SelfD, Vis)
Prototype 4,5
Notify, Restart
Prototype 6,7
AConfig, Repair
LISA 6/98
Experience
with 1-7
SOSP
3/99
Architecture of
Complete System
Writing
Mar, 1999
31Jun 30, 2013
ConclusionConclusion
 Description of field shows breadth
 Monitoring, diagnosing, and repairing shows depth
– Examples show importance of problem
– Fundamental goals & environmental constraints show
understanding of problem
– Key innovations show differences from previous systems.
– Architecture and initial prototype show approach to problem
– Testing methods show ways to validate solution.
 Timeline shows plan & milestones to graduation
Old SlidesOld Slides
33Jun 30, 2013
SolutionsSolutions
 Managing stable storage
 Supporting users
 Simplifying security
 Monitoring, diagnosing, and repairing
34Jun 30, 2013
Managing Stable StorageManaging Stable Storage
 Consistency vs. availability
 Fault tolerance
 Scalability
 Recoverability
 Customization
35Jun 30, 2013
Supporting UsersSupporting Users
 Automated help desk
– Searchable collection of questions
– Easy method for addition
 Remote device access
 Site-wide training
36Jun 30, 2013
Goals: EnvironmentGoals: Environment
 Uniform
– Supports user mobility by eliminating arbitrary changes
– Increases effectiveness by avoiding need for users to learn multiple
interfaces
 Customizable
– Handles special systems and special needs [firewalls, servers]
– Obviously reduces uniformity
37Jun 30, 2013
Goals: Environment, cont.Goals: Environment, cont.
 High Performance
– Increases effectiveness of users [HCI/psych]
– Limited by cost-effectiveness
 Available
– Effectiveness is 0 if system isn’t working
– Balanced against expense
38Jun 30, 2013
Goals: Faults & ErrorsGoals: Faults & Errors
 Benign errors:
– Accidentally deleted files
– Unnoticed runaway processes
 Malicious attacks:
– TCP SYN attack
– Sendmail bugs
– Data stealing
– False data injection
39Jun 30, 2013
Goals: UsersGoals: Users
 Training
– Troubleshooting = one-on-one training
– Larger sessions = classes
 Accounting
– Supports management, helps billing
 Capacity Planning
– Expanding systems takes time
 Legal
– Sensitive information needs protection
40Jun 30, 2013
Simplifying SecuritySimplifying Security
USENIX talk says “If cryptography is so great, why isn’t it used more?”
SA’s worry about security to protect data.
 Goal: Ease development of secure applications
 Write programs using principals & properties rather than keys and algorithms
 Unify various forms of available cryptography (public key, secret-key, PGP,
Kerberos)
 My use: protected, transferable rights to allow various actions
– Modify system configurations (add filesystems, printers)
– Kill/restart processes (runaway, after configurations modified)
– Access data (private logs, for backups, etc.)
41Jun 30, 2013
ConclusionConclusion
 System administration as area of research
– Description of field
– Areas for future research
• Managing stable storage
• Supporting users
 Initial investigation of research area
– Monitoring, diagnosing, and repairing
• Broad, draws from many fields

More Related Content

What's hot

Data recovery
Data recoveryData recovery
Data recovery
Abhinav Parihar
 
(Slides) Task scheduling algorithm for multicore processor system for minimiz...
(Slides) Task scheduling algorithm for multicore processor system for minimiz...(Slides) Task scheduling algorithm for multicore processor system for minimiz...
(Slides) Task scheduling algorithm for multicore processor system for minimiz...
Naoki Shibata
 
On Tune Performance Monitoring
On Tune Performance MonitoringOn Tune Performance Monitoring
On Tune Performance MonitoringTeemStone Pty Ltd
 
SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...
SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...
SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...
vtunotesbysree
 
Operating system concepts ninth edition (2012), chapter 2 solution e1
Operating system concepts ninth edition (2012), chapter 2 solution e1Operating system concepts ninth edition (2012), chapter 2 solution e1
Operating system concepts ninth edition (2012), chapter 2 solution e1
Navid Daneshvaran
 
DATA RECOVERY TECHNIQUES
DATA RECOVERY TECHNIQUESDATA RECOVERY TECHNIQUES
DATA RECOVERY TECHNIQUES
Venkatesh Pensalwar
 
3 securityarchitectureandmodels-120331064706-phpapp01
3 securityarchitectureandmodels-120331064706-phpapp013 securityarchitectureandmodels-120331064706-phpapp01
3 securityarchitectureandmodels-120331064706-phpapp01
wardell henley
 
STORAGE DEVICES & OPERATING SYSTEM SERVICES
STORAGE DEVICES & OPERATING SYSTEM SERVICESSTORAGE DEVICES & OPERATING SYSTEM SERVICES
STORAGE DEVICES & OPERATING SYSTEM SERVICES
Ayesha Tahir
 
The difference between in-depth analysis of virtual infrastructures & monitoring
The difference between in-depth analysis of virtual infrastructures & monitoringThe difference between in-depth analysis of virtual infrastructures & monitoring
The difference between in-depth analysis of virtual infrastructures & monitoring
BettyRManning
 
Chapter 2 (Part 2)
Chapter 2 (Part 2) Chapter 2 (Part 2)
Chapter 2 (Part 2) rohassanie
 
Sara Afshar: Scheduling and Resource Sharing in Multiprocessor Real-Time Systems
Sara Afshar: Scheduling and Resource Sharing in Multiprocessor Real-Time SystemsSara Afshar: Scheduling and Resource Sharing in Multiprocessor Real-Time Systems
Sara Afshar: Scheduling and Resource Sharing in Multiprocessor Real-Time Systems
knowdiff
 
Data recovery
Data recoveryData recovery
Data recovery
bhaumik_c
 
Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Wor...
Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Wor...Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Wor...
Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Wor...
Rafael Ferreira da Silva
 
Automation system performance myths
Automation system performance mythsAutomation system performance myths
Automation system performance mythspaulguerin
 
Testing System Qualities Agile2012 by Rebecca Wirfs-Brock and Joseph Yoder
Testing System Qualities Agile2012 by Rebecca Wirfs-Brock and Joseph YoderTesting System Qualities Agile2012 by Rebecca Wirfs-Brock and Joseph Yoder
Testing System Qualities Agile2012 by Rebecca Wirfs-Brock and Joseph Yoder
Joseph Yoder
 
Resource management
Resource managementResource management
Resource management
peeyushanand6
 
Os unit 3 , process management
Os unit 3 , process managementOs unit 3 , process management
Os unit 3 , process management
Arnav Chowdhury
 
Operating System Simple Introduction
Operating System Simple IntroductionOperating System Simple Introduction
Operating System Simple Introduction
Diwash Sapkota
 

What's hot (19)

Data recovery
Data recoveryData recovery
Data recovery
 
(Slides) Task scheduling algorithm for multicore processor system for minimiz...
(Slides) Task scheduling algorithm for multicore processor system for minimiz...(Slides) Task scheduling algorithm for multicore processor system for minimiz...
(Slides) Task scheduling algorithm for multicore processor system for minimiz...
 
On Tune Performance Monitoring
On Tune Performance MonitoringOn Tune Performance Monitoring
On Tune Performance Monitoring
 
SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...
SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...
SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...
 
Operating system concepts ninth edition (2012), chapter 2 solution e1
Operating system concepts ninth edition (2012), chapter 2 solution e1Operating system concepts ninth edition (2012), chapter 2 solution e1
Operating system concepts ninth edition (2012), chapter 2 solution e1
 
DATA RECOVERY TECHNIQUES
DATA RECOVERY TECHNIQUESDATA RECOVERY TECHNIQUES
DATA RECOVERY TECHNIQUES
 
3 securityarchitectureandmodels-120331064706-phpapp01
3 securityarchitectureandmodels-120331064706-phpapp013 securityarchitectureandmodels-120331064706-phpapp01
3 securityarchitectureandmodels-120331064706-phpapp01
 
STORAGE DEVICES & OPERATING SYSTEM SERVICES
STORAGE DEVICES & OPERATING SYSTEM SERVICESSTORAGE DEVICES & OPERATING SYSTEM SERVICES
STORAGE DEVICES & OPERATING SYSTEM SERVICES
 
The difference between in-depth analysis of virtual infrastructures & monitoring
The difference between in-depth analysis of virtual infrastructures & monitoringThe difference between in-depth analysis of virtual infrastructures & monitoring
The difference between in-depth analysis of virtual infrastructures & monitoring
 
Chapter 2 (Part 2)
Chapter 2 (Part 2) Chapter 2 (Part 2)
Chapter 2 (Part 2)
 
Sara Afshar: Scheduling and Resource Sharing in Multiprocessor Real-Time Systems
Sara Afshar: Scheduling and Resource Sharing in Multiprocessor Real-Time SystemsSara Afshar: Scheduling and Resource Sharing in Multiprocessor Real-Time Systems
Sara Afshar: Scheduling and Resource Sharing in Multiprocessor Real-Time Systems
 
Data recovery
Data recoveryData recovery
Data recovery
 
Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Wor...
Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Wor...Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Wor...
Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Wor...
 
Automation system performance myths
Automation system performance mythsAutomation system performance myths
Automation system performance myths
 
Testing System Qualities Agile2012 by Rebecca Wirfs-Brock and Joseph Yoder
Testing System Qualities Agile2012 by Rebecca Wirfs-Brock and Joseph YoderTesting System Qualities Agile2012 by Rebecca Wirfs-Brock and Joseph Yoder
Testing System Qualities Agile2012 by Rebecca Wirfs-Brock and Joseph Yoder
 
Resource management
Resource managementResource management
Resource management
 
Os unit 3 , process management
Os unit 3 , process managementOs unit 3 , process management
Os unit 3 , process management
 
Operating System Simple Introduction
Operating System Simple IntroductionOperating System Simple Introduction
Operating System Simple Introduction
 
OSCh4
OSCh4OSCh4
OSCh4
 

Similar to Quals

Stop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production SystemsStop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production Systems
Brendan Gregg
 
Forecasting database performance
Forecasting database performanceForecasting database performance
Forecasting database performance
Shenglin Du
 
Webinar: Performance Tuning + Optimization
Webinar: Performance Tuning + OptimizationWebinar: Performance Tuning + Optimization
Webinar: Performance Tuning + Optimization
MongoDB
 
Introduction to embedded system
Introduction to embedded systemIntroduction to embedded system
Introduction to embedded system
ajitsaraf123
 
LM9 - OPERATIONS, SCHEDULING, Inter process xommuncation
LM9 - OPERATIONS, SCHEDULING, Inter process xommuncationLM9 - OPERATIONS, SCHEDULING, Inter process xommuncation
LM9 - OPERATIONS, SCHEDULING, Inter process xommuncation
Mani Deepak Choudhry
 
Speeding Up Atlas Deep Learning Platform with Alluxio + Fluid
Speeding Up Atlas Deep Learning Platform with Alluxio + FluidSpeeding Up Atlas Deep Learning Platform with Alluxio + Fluid
Speeding Up Atlas Deep Learning Platform with Alluxio + Fluid
Alluxio, Inc.
 
201201 ureason introduction to use
201201 ureason introduction to use201201 ureason introduction to use
201201 ureason introduction to use
UReasonChannel
 
Run MongoDB with Confidence: Backing up and Monitoring with MMS
Run MongoDB with Confidence: Backing up and Monitoring with MMSRun MongoDB with Confidence: Backing up and Monitoring with MMS
Run MongoDB with Confidence: Backing up and Monitoring with MMSMongoDB
 
Cluster computing
Cluster computingCluster computing
Cluster computing
Raja' Masa'deh
 
Software Performance
Software Performance Software Performance
Software Performance
Prabhanshu Saraswat
 
CISSP Week 22
CISSP Week 22CISSP Week 22
CISSP Week 22jemtallon
 
seed block algorithm
seed block algorithmseed block algorithm
seed block algorithmDipak Badhe
 
Health monitoring & predictive analytics to lower the TCO in a datacenter
Health monitoring & predictive analytics to lower the TCO in a datacenterHealth monitoring & predictive analytics to lower the TCO in a datacenter
Health monitoring & predictive analytics to lower the TCO in a datacenterAndrei Khurshudov
 
Scylla Summit 2016: Scylla at Samsung SDS
Scylla Summit 2016: Scylla at Samsung SDSScylla Summit 2016: Scylla at Samsung SDS
Scylla Summit 2016: Scylla at Samsung SDS
ScyllaDB
 
Agile performance engineering with cloud 2016
Agile performance engineering with cloud   2016Agile performance engineering with cloud   2016
Agile performance engineering with cloud 2016
Ken Chan
 
An Overview of Performance Evaluation & Simulation
An Overview of Performance Evaluation & SimulationAn Overview of Performance Evaluation & Simulation
An Overview of Performance Evaluation & Simulation
dasdfadfdsfsdfasdf
 
Cluster computing report
Cluster computing reportCluster computing report
Cluster computing report
Sudhanshu kumar Sah
 

Similar to Quals (20)

Stop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production SystemsStop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production Systems
 
Forecasting database performance
Forecasting database performanceForecasting database performance
Forecasting database performance
 
Webinar: Performance Tuning + Optimization
Webinar: Performance Tuning + OptimizationWebinar: Performance Tuning + Optimization
Webinar: Performance Tuning + Optimization
 
Introduction to embedded system
Introduction to embedded systemIntroduction to embedded system
Introduction to embedded system
 
LM9 - OPERATIONS, SCHEDULING, Inter process xommuncation
LM9 - OPERATIONS, SCHEDULING, Inter process xommuncationLM9 - OPERATIONS, SCHEDULING, Inter process xommuncation
LM9 - OPERATIONS, SCHEDULING, Inter process xommuncation
 
Speeding Up Atlas Deep Learning Platform with Alluxio + Fluid
Speeding Up Atlas Deep Learning Platform with Alluxio + FluidSpeeding Up Atlas Deep Learning Platform with Alluxio + Fluid
Speeding Up Atlas Deep Learning Platform with Alluxio + Fluid
 
201201 ureason introduction to use
201201 ureason introduction to use201201 ureason introduction to use
201201 ureason introduction to use
 
Run MongoDB with Confidence: Backing up and Monitoring with MMS
Run MongoDB with Confidence: Backing up and Monitoring with MMSRun MongoDB with Confidence: Backing up and Monitoring with MMS
Run MongoDB with Confidence: Backing up and Monitoring with MMS
 
Cluster computing
Cluster computingCluster computing
Cluster computing
 
Software Performance
Software Performance Software Performance
Software Performance
 
CISSP Week 22
CISSP Week 22CISSP Week 22
CISSP Week 22
 
seed block algorithm
seed block algorithmseed block algorithm
seed block algorithm
 
Health monitoring & predictive analytics to lower the TCO in a datacenter
Health monitoring & predictive analytics to lower the TCO in a datacenterHealth monitoring & predictive analytics to lower the TCO in a datacenter
Health monitoring & predictive analytics to lower the TCO in a datacenter
 
Report_Internships
Report_InternshipsReport_Internships
Report_Internships
 
Ch24 system administration
Ch24 system administration Ch24 system administration
Ch24 system administration
 
Ch24
Ch24Ch24
Ch24
 
Scylla Summit 2016: Scylla at Samsung SDS
Scylla Summit 2016: Scylla at Samsung SDSScylla Summit 2016: Scylla at Samsung SDS
Scylla Summit 2016: Scylla at Samsung SDS
 
Agile performance engineering with cloud 2016
Agile performance engineering with cloud   2016Agile performance engineering with cloud   2016
Agile performance engineering with cloud 2016
 
An Overview of Performance Evaluation & Simulation
An Overview of Performance Evaluation & SimulationAn Overview of Performance Evaluation & Simulation
An Overview of Performance Evaluation & Simulation
 
Cluster computing report
Cluster computing reportCluster computing report
Cluster computing report
 

Recently uploaded

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 

Recently uploaded (20)

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 

Quals

  • 1. Monitoring, Diagnosing, andMonitoring, Diagnosing, and RepairingRepairing Eric Anderson U.C. Berkeley
  • 2. 2Jun 30, 2013 OverviewOverview  What is System Administration? – What is the problem? – Goals of Dissertation Research – Goals of System Administration  Monitoring, diagnosing, and repairing  Dissertation Timeline  Conclusion
  • 3. 3Jun 30, 2013 What is the problem?What is the problem?  Problems occur in systems, and result in loss of productivity – Server failures  denial of service – System overload  lower productivity  Cost is too high – Cost of ownership estimated at $5,000-$15,000/year/machine – Median salary (~50k) / (median # machines/admin)  $700  Our goal: Reduce cost by – Repairing problems faster (possibly automatically) – Handling more problems
  • 4. 4Jun 30, 2013 Goals of Dissertation ResearchGoals of Dissertation Research  Describe field of System Administration  Monitoring, Diagnosing, and Repairing: – Approach: Synthesize solutions from other fields of research 1) Detect previously ignored problems 2) Automatic repair of some problems 3) Reduce number of administrators needed 4) Support users’ understanding of system  Apply here & distribute software  Thesis: Through our approach, we can achieve goals 1-4.
  • 5. 5Jun 30, 2013 Goals of System AdministrationGoals of System Administration Goal: Support cost-effective use of the computer environment More specifically (some non-technical): Environment: uniform, customizable, high performance and available Faults & errors: recovery from benign errors, protection from malicious attacks Users: training, accounting & planning, legal
  • 6. 6Jun 30, 2013 Monitoring, Diagnosing, andMonitoring, Diagnosing, and Repairing (MDR)Repairing (MDR) • Introductory examples • Fundamental requirements • Environmental constraints • Previous work • Six key innovations • Architecture • Details on innovations • Evaluation methodology
  • 7. 7Jun 30, 2013 MDR: Examples — IntroMDR: Examples — Intro  Four examples 1) Broken component 2) Resource overload — transient 3) Resource contention — user program 4) Resource exhaustion — long term  Previous Solutions – Pay someone to watch – Ignore or wait for someone to complain – Specialized scripts (not general  vast repeated work)
  • 8. 8Jun 30, 2013 MDR: Example 1MDR: Example 1 Web server has crashed/hung  Gather information: process existence, service uptime, restart times  Analyze data: process not responding, and hasn’t been recently restarted.  Automatic repair: restart daemon.  Notify administrator: had to restart daemon.
  • 9. 9Jun 30, 2013 MDR: Example 2MDR: Example 2 The NOW is “slow.”  Gather data: load, process info, CPU info  Analyze data: bounds on expected values  Notified administrator: fileserver overloaded.  Visualize data: nfsd’s are overloaded.  Repair: admin moves data, adds disks, or starts more nfsd’s
  • 10. 10Jun 30, 2013 MDR: Example 3MDR: Example 3 User running program  Gather: user statistics, CPU, disk  Visualize: spending too much time waiting on remote accesses (User fixes program, gathering, visualization repeated)  Analyze: some nodes have less throughput  Visualize: those have other jobs running on them  Repair: user is benchmarking so kills all extraneous processes
  • 11. 11Jun 30, 2013 MDR: Example 4MDR: Example 4 Web server increasing beyond capacity  Gather: CPU, request rate, reply latency  Analyze: Burst lengths getting longer, latency increasing  Visualize: Graph of burst lengths & CPU usage over time  Repair: Order more machines, install load balancer
  • 12. 12Jun 30, 2013 MDR: Fundamental RequirementsMDR: Fundamental Requirements • Gathering • Flexible data gathering, self-describing storage • Analyzing • Calculate statistical measures, identify relevant statistics. • Notifying • Flexible infrequent messages to administrators or users • Visualizing • Maximize information/pixel, support multiple interfaces • Repairing • Automate simple repairs, support group operations
  • 13. 13Jun 30, 2013 MDR: EnvironmentalMDR: Environmental ConstraintsConstraints  Change is inherent – Lack of Web/Mbone 5 years ago, now most/many have these.  Problems on many time-scales – Second-Minute transients vs. Week-Month capacity problems  Must operate under very adverse conditions – Often used when system is broken – Would like at least post-mortum analysis  Need to handle hundreds – thousands of nodes – Scalability: All sites are getting larger, possibly wide area – Our system has 200 (NOW) – 2000 (Soda) nodes
  • 14. 14Jun 30, 2013 MDR: Previous SystemsMDR: Previous Systems  Many previous systems: I’ve looked at about 16.  Not comprehensive, not extensible.  Look at a few that did a nice job of a piece:  [Fink97] — Run test, notify display engine + Easy to add tests + Selectivity of notification good – Tests are just programs (redo gathering) – Central, non-fault tolerant solution – Many hard coded constants
  • 15. 15Jun 30, 2013 MDR: Previous Systems, cont.MDR: Previous Systems, cont.  [Hard92] — buzzerd: Pager notification system + Flexible rules for notification + External interface for adding notify requests – Simplistic gathering – Poor fault tolerance  [Pier96] — Igor group fixes + Flexible operations + Nice reporting of success/failure – Weak security, runs as root – No delegation of responsibility
  • 16. 16Jun 30, 2013 MDR: Six Key Innovations (1-3)MDR: Six Key Innovations (1-3)  Replicated, semi-hierarchical, data storage nodes – Rendezvous point for programs – Handles scaling and fault-tolerance  Self describing structures – Functions (visualize, summarize) + data go in database (OO) – DB has machine and human readable descriptions of data  End to end notification – Detect problems in MDR system – Guarantee important messages get to users
  • 17. 17Jun 30, 2013 MDR: Six Key Innovations (4-6)MDR: Six Key Innovations (4-6)  Aggregation and High Resolution Color Displays – Reduce information to manageable amounts – Maximize information per unit area  Partially self-configuring – Learn averages, deviations, burst sizes – Learn which values are relevant to problems  Secure, user-specified group repairs – Don’t enable malicious attacks – Automate repairs of many machines
  • 18. 18Jun 30, 2013 MDR: ArchitectureMDR: Architecture SQL-based Data Repository Gather Agent vmstat thread ping thread tcpdump thread Diagnostic Console E-mail or Phone Notifier Long-term graphing Tolerance, Relevance Learner Daemon Restarter Aggregation Engine
  • 19. 19Jun 30, 2013 MDR-Arch: DerivationsMDR-Arch: Derivations SQL-based Data Repository Diagnostic Console E-mail or Phone Notifier Tolerance, Relevance Learner Daemon Restarter
  • 20. 20Jun 30, 2013 Key: Semi-Hier. DBs.Key: Semi-Hier. DBs.  Fault tolerance  Scalability: – Caches don’t need to commit to disk — authoritative copy elsewhere. – Batching updates over wide area links. Top level cache Top level cache Mid level cache Mid level cache Mid level cache Per-node database Per-node database Per-node database Per-node database Per-node database
  • 21. 21Jun 30, 2013 Key: Self-DescribingKey: Self-Describing  De-couple data gathering, data storage, and data use  Self-Describing for Humans – Descriptions of meanings of values stored with tables – Description of methods of gathering stored with tables – Column names help with self  Self-Describing for Computers – Functions for visualizing or summarizing data – Indication of resource selection from resource statistics
  • 22. 22Jun 30, 2013 Key: End-to-End NotificationKey: End-to-End Notification Recall: System must operate under extreme conditions  Humans must validate that system is still working – Standalone display can indicate timestamps, mark out of date data – Wireless machine could intermittently contact notification system – Pager could be automatically paged every so often  Problems should be propagated to end users. – Flexible notification — connected systems, e-mail, pager. – Limit over-notification
  • 23. 23Jun 30, 2013 Key: Aggregation & HiResKey: Aggregation & HiRes  System target has hundreds – thousands of nodes  Aggregate by showing out of bounds, relevant values (via automatic tuning)  Also want overview of system – Aggregate across similar statistics; show value (fill) & dispersion (shade) – Use color to highlight important values. – Aggregate across values (machine utilization = CPU + disk + memory) – Maximize data/pixel [Tufte]
  • 24. 24Jun 30, 2013 Key: Agg & HiRes: SnapshotKey: Agg & HiRes: Snapshot
  • 25. 25Jun 30, 2013 Key: Self-ConfiguringKey: Self-Configuring  Single statistics – Phase 1: Calculate averages, standard deviations, burst sizes – Worked in other systems [Jaco88, Karn91]  Identify relevant statistics – Give system Boolean examples (variables out of bounds, and system working/not working) get function. – Works for Boolean disjunctions in some cases: • With lots of irrelevant variables [Litt89] • With random bad examples [Sloa89] • In some cases, with malicious bad examples [Ande94]
  • 26. 26Jun 30, 2013 Key: Secure Remote ActionsKey: Secure Remote Actions  Security because of malicious attacks, benign errors  Delegation to remove SA from the loop  Independence from particular algorithms – Building a library – Program with principals (hosts, users), and properties (signed, sealed, verifiable)  Use secure, run-time extensible languages  Actions report through gathering system
  • 27. 27Jun 30, 2013 MDR: Testing MethodologyMDR: Testing Methodology  Fault injection – Deliberately make the system slow – Break hardware/software components  Feature comparison – Paper comparison with other systems  Usage in practice – Experience important to show system works – We have need of administrative tools  Testimonials – Experience at other sites lends credibility
  • 28. 28Jun 30, 2013 MDR: DemoMDR: Demo  Hierarchical structure working (1 level right now)  Alternative Interface  Fault Injection  Need for Aggregation  Crufty right now  Demo
  • 29. 29Jun 30, 2013 Timeline: Key PiecesTimeline: Key Pieces 1) (DBs) Replicated, semi-hierarchical, data storage nodes 2) (SDS) Self describing structures 3) (Vis) Aggregation and High Resolution Color Displays 4) (E2EN) End to end notification 5) (ReS) Automatic Restart 6) (Cfg) Partially self-configuring 7) (Rep) Secure, user-specified group repairs
  • 30. 30Jun 30, 2013 TimelineTimeline Deadlines: June, 1997 Dec, 1997 Dec, 1998June, 1998 LISA 6/97 USENIX 12/97 OSDI 3/98 Graduation 12/98 Prototype 1,2,3 (DBs, SelfD, Vis) Prototype 4,5 Notify, Restart Prototype 6,7 AConfig, Repair LISA 6/98 Experience with 1-7 SOSP 3/99 Architecture of Complete System Writing Mar, 1999
  • 31. 31Jun 30, 2013 ConclusionConclusion  Description of field shows breadth  Monitoring, diagnosing, and repairing shows depth – Examples show importance of problem – Fundamental goals & environmental constraints show understanding of problem – Key innovations show differences from previous systems. – Architecture and initial prototype show approach to problem – Testing methods show ways to validate solution.  Timeline shows plan & milestones to graduation
  • 33. 33Jun 30, 2013 SolutionsSolutions  Managing stable storage  Supporting users  Simplifying security  Monitoring, diagnosing, and repairing
  • 34. 34Jun 30, 2013 Managing Stable StorageManaging Stable Storage  Consistency vs. availability  Fault tolerance  Scalability  Recoverability  Customization
  • 35. 35Jun 30, 2013 Supporting UsersSupporting Users  Automated help desk – Searchable collection of questions – Easy method for addition  Remote device access  Site-wide training
  • 36. 36Jun 30, 2013 Goals: EnvironmentGoals: Environment  Uniform – Supports user mobility by eliminating arbitrary changes – Increases effectiveness by avoiding need for users to learn multiple interfaces  Customizable – Handles special systems and special needs [firewalls, servers] – Obviously reduces uniformity
  • 37. 37Jun 30, 2013 Goals: Environment, cont.Goals: Environment, cont.  High Performance – Increases effectiveness of users [HCI/psych] – Limited by cost-effectiveness  Available – Effectiveness is 0 if system isn’t working – Balanced against expense
  • 38. 38Jun 30, 2013 Goals: Faults & ErrorsGoals: Faults & Errors  Benign errors: – Accidentally deleted files – Unnoticed runaway processes  Malicious attacks: – TCP SYN attack – Sendmail bugs – Data stealing – False data injection
  • 39. 39Jun 30, 2013 Goals: UsersGoals: Users  Training – Troubleshooting = one-on-one training – Larger sessions = classes  Accounting – Supports management, helps billing  Capacity Planning – Expanding systems takes time  Legal – Sensitive information needs protection
  • 40. 40Jun 30, 2013 Simplifying SecuritySimplifying Security USENIX talk says “If cryptography is so great, why isn’t it used more?” SA’s worry about security to protect data.  Goal: Ease development of secure applications  Write programs using principals & properties rather than keys and algorithms  Unify various forms of available cryptography (public key, secret-key, PGP, Kerberos)  My use: protected, transferable rights to allow various actions – Modify system configurations (add filesystems, printers) – Kill/restart processes (runaway, after configurations modified) – Access data (private logs, for backups, etc.)
  • 41. 41Jun 30, 2013 ConclusionConclusion  System administration as area of research – Description of field – Areas for future research • Managing stable storage • Supporting users  Initial investigation of research area – Monitoring, diagnosing, and repairing • Broad, draws from many fields

Editor's Notes

  1. Key idea: None Introduction slide.
  2. Key Idea: Two contributions — System administration as a field of research; and initial work in the field produces initial results which substantially improve the state of the art.
  3. Key idea: Has properties of “real” research — separation of concerns, important contributions, and a strategy for measuring effectiveness.