1. Server Refresh Program
Improving Throughput by combining LEAN, Kanban and
Theory of Constraints (TOC) Principles
Tal Aviv - Duck Management
August 2016
2. Background
A large portion of the organization’s computing infrastructure has aged
and passed its warranty stage.
The annual cost to maintain that infrastructure and the increasing risk of
failure drove the organization to put a program in place and refresh the
aging servers.
3. What is Server Refresh?
Server refresh is a process where a server that has been on line for
several years and is no longer in warranty is being replaced by new
hardware.
There are several reasons for refreshing the server:
Equipment is outdated – the old hardware can no longer support new
demand.
Equipment cost too much to maintain
Equipment is out of warranty and the vendor no longer provide support or
support is expensive.
4. How Do you refresh a server?
Fresh Build Physical (FBP) – Replace the server with a new hardware, install
the application on the new server and retire the old server.
Fresh Build Virtual (FBV)– Spin up a new virtual server, install the application
on the new virtual server and retire the old physical server.
Physical to Virtual migration (P2V) – Copy the physical server “as is” into a
virtual server. This approach is used when no changes are needed (same
application level, same operating system).
The organizational mandate is to virtualize as many servers as possible in order
to reduce the costs and complexity of running physical servers
5. The Problem
The organization has thousands of servers around the world.
Managing and controlling the refresh project was disorganized and
tracked via Excel.
The time to refresh a server is lengthy due to corporate bureaucracy and
limited resources.
The project manager who was responsible for the project quit
unexpectedly after getting frustrated with the organizational “mess”.
7. The Solution?
Understand the Goal (TOC)
Understand Current Stage (LEAN)
Use the 5 focusing steps to improve throughput (TOC)
Build Future State (TOC/LEAN)
Organization and Control (PMP, Agile/Kanban)
8. Understand the Goal
The GOAL of the program is to
refresh the hardware that is no longer under
warranty.
This means:
We do not re-architect
We do not upgrade
We replace “Like for Like” only !
9. Why is the goal important?
In the past, time and resources were wasted on trying to leverage the
refresh program to re-architect the solution, upgrade current
infrastructure or utilize funds for other initiatives that needed new
equipment.
While all of these initiatives are valid, they impacted the project team
from achieving their project goals.
10. Understand Current State
To identify what to change, first understand the current reality
Review documentation
Interview key players
Combine information
Build initial Value Stream Map
11. High level Process flow
System identified as
potential candidate
Schedule meeting
with:
· System Owner
· Business
Owner
· Architect
· Technical
Team
Review potential
refresh plan
Plan agreed?
No
Initiate Refresh
Activity
Yes
P2V
FBV
FBP
Decommission
Initiate Refresh
Activity
Based on the decision
out of the review
meeting the refresh
type was initiated
The architect
decide/approve
refresh path
12. Using the TOC focusing steps to improve
throughput
What to change to?
14. Identify the system constraint
Using the Value Stream Map to identify the system constraint – Planning
Meetings.
Each meeting included a minimum of 5 very busy people.
While the system and business owners changed for each server, the architect is
part of every meeting.
Only one per platform is assigned to the project on limited time basis
Architects have limited availability.
Meetings had to be scheduled weeks in advance to accommodate architect’s
availability
15. Refresh throughput is
governed by the ability to
schedule planning
meetings and deciding on
refresh path
Current Reality
System identified as
potential candidate
Schedule meeting
with:
· System Owner
· Business
Owner
· Architect
· Technical
Team
Review potential
refresh plan
Plan agreed?
No
Initiate Refresh
Activity
Yes
P2V
FBV
FBP
Decommission
System Constraint
Scheduling meetings and
repeat meetings with all
team members took a
long time and required
significant lead time
Initiate Refresh
Activity
Based on the decision
out of the review
meeting the refresh
type was initiated
16. Identify the system conflict
Refresh a large
number of servers
Quality of refresh
must meet Amgen
Standards
Quick turn around
of plans in order to
accommodate the
volume of servers
Skip architectural
review to decide
refresh path
Architect must
review all refresh
requests and decide
refresh path
Conflict – Architects
have limited
capacity
17. Exploit the constraint
Assumption: architect to be part of all the review meetings with the client
so they could bring the best solution to the refresh.
Reality: many of the migrations were standard and there was no added
value by having an architect review.
18. Exploit the constraint
Solution:
1. Identify cases where Architect is needed
2. Establish operating procedures where architects
are not needed to ensure quality of delivery
19. Exploit the constraint - New Current Reality
4 migration paths
Migration path is standard, no architect review is needed.
Migration path is standard but due to the sensitivity (or risk) of the
system, the architect needs to be notified of the actions
Migration path is complex , Architect review is needed
Server is no longer needed – decommission
The Project Manager can determine the refresh path and call
the Architect only when needed
20. Review with System
Owner
Fresh Build?
Get approval from
SO for P2V
No
Fresh Build
Virtual?
Yes
Notify ArchitectYes Open IDR Server build
SME to open CR
SME to initiate
communication with
SO. Identify risks
and SRT
PM update IS
Calendar and send
update notification
LMR
Migrate server P2V
Decomm
Remove from
Support contract
Open DCD
Customer to provide
official P2P
Justification
Submit justification
for Exec Approval
Fresh Build
Physical
Approved?
Engage Architect to
verify P2P needs
No
Fresh Build
Physical
Approved?
Yes
No
Open DCD Purchase HW Receive HW
LMR
Migrate
Open IDR
Yes
No
Backlog
Fresh Build Virtual
Fresh
Build
Physical
Workflow Decision Points
Elevate the Constraint
The architect is now needed for only a
small portion of the projects.
Architect is now needed
for only a small portion of
the servers.
PM Can determine when
Architect is needed
21. Exploit the constraint - New Current Reality
The result:
70% of the cases, architect review was not needed. Team can proceed
w/o waiting for Architect.
Only in 20% of the cases the architect was needed
10% were servers that initially thought to be “standard” but after
initial investigation became complex.
22. By moving the decision on
migration path to the PM,
we increased throughput
by 300% since only a small
portion of the refreshes
needed time from the
busiest resource
(architect)
Exploit the constraint - New Current Reality
System pulled from
backlog
Schedule meeting
with client only
Initiate P2V
activities
Initiate FBV
Activities
FBP – Architect
required, schedule
follow up meeting
Initiate
Decommission
activities
FBP Approved?
No
Initiate FBP
Activities
Yes
Preparation email
sent to clients
letting know about
the project and
potential refresh
plans
Based on
client input,
the PM can
decide on the
refresh path
23. Subordinate to the constraint
Understanding the different paths allowed us to release
work into the system based on architecture needs.
If a server was standard, there were no issues (constraint wise) releasing work.
If a server was not standard, the release of work will be only based on architect
capacity.
If Architect is not available, we call pull another item from the backlog and
initiate the refresh process.
WIP Limits enforced
24. Elevate the constraint
Budget constraints do not allow for adding architects to the team,
however, working closely with the architect, more conditions were
identified where the team could make the proper decisions and only
turn to the architect for approval.
27. To sprint or not to spring that is the
question.
How to select project methodology?
28. Waterfall? Agile/Scrum? Critical Chain?
Problem: Selecting a project management approach was not a straight
forward task for the server refresh project.
We identified 12 distinct stages that servers went through going from backlog
to remove from support.
High dependency on external and internal groups we could not identify time
frame for each task or for the entire chain.
A client might select a time window for migration several weeks in the future.
29. Agile/Scrum?
we could not time bound most of the tasks resulting in the
elimination of most Agile approaches
- No time bound a sprint
- No minimal viable product.
30. Critical Chain? Waterfall?
Critical Chain was also eliminated since we could not create buffers or
build a chain for each server.
The complexity of running each server as a sub project created a large
overhead and we decided not to go with the traditional waterfall
approach.
32. Kanban? Well sort of…
Flexibility of running multiple stories
through a common work flow combined
with visual management and flow control.
33. Kanban Tools: Jira vs. Rally
2 Systems were examined: Rally and JIRA
Our decision to go with Rally was mainly due to the fact that it looked
better but also because JIRA belong to the IT department and Rally
belong to the portfolio team which we could directly access.
34. Using Rally to implement Kanban
We implemented the Kanban board with swim lanes for each refresh
approach and with a column for each one of the stages.
All of the servers were imported in as stories and assigned to the backlog.
Each story contained the basic information for the server as well as notes,
discussion and additional custom fields to better manage the stories.
36. Controlling the flow.
Unlike traditional Kanban and since most external groups did not
have access to Rally (and did not want to be bothered with it) we
could not use the pull functionality that is built in to Kanban.
The board was used primarily by the project manager and the user
stories were moved on the board by the PM.
37. Chocking the release
In manufacturing, we know that one of the main issues we face on
the production floor is overflow of work in process (WIP).
During the previous years of the project, the push was to start early as much
as possible since the time to complete took forever.
The result was that work in progress grew, people could not keep up with the
work and confusion was common place.
Furthermore, since the work was managed via Excel, many communication
channels kept on going back forth trying to manage the work and remember
what action took place.
38. Keeping track of work in progress
Visual Management
Visual approach to managing the queue simplified understanding where
each story is in the process and what work stream is it taking.
Each ticket has a full history of work done, notes and emails attached.
The system allowed to easily understand what stage each server was in ,
what action needed to take place and when (we added a field called
watch date to identify when the next step should take place)
40. Definition of Done - Current Reality:
Servers that entered the process, never completed and were stuck in one
stage or another.
Most of the servers that were stuck were identified as servers that the
application was migrated off of it but no one retired the server.
From the application team’s point of view the work was done.
From an infrastructure point of view, the server was still up and running,
consuming power and space and still on the support list (i.e. incurring annual
support costs via the support contract).
41. Definition of Done - updated
A story is DONE only when the Server is removed from
support list.
42. Definition of done - results
By changing the definition and reviewing the records, we have identified
numerous servers that were “NOT DONE”
Previous initiatives left many servers on line when they were no longer used
Previous decommissioned servers were not always removed from the support
contract
Removing these servers from the support contract led to an
immediate savings of $500K.
Additional reviews are taking place to continue to identify servers that are “not done”
44. Current Organization and Control
Excel Spreadsheet(s)
Multiple documentation sources
Tribal knowledge
SharePoint pages
45. Existing project
documentation and work
flow were listed in several
pages of Excel template.
Difficult to read
Difficult to follow
Long Range Planning and Maintenance
Project Justification - Business Case
Accountable Team Deliverable T-Minus
Define Refresh Candidates
Define refresh candidates based upon installation dates noting the lifecycle 4, or 5+ years LRPM LRPM Inventory Collection Master 6 Months
Meet with AHS architects to determine the feasibility of deferring each candidate. Determine any
new innovations or other capacity considerations
LRPM LRPM Inventory Collection Master 6 Months
Finalize candidate selection with AHS Architecture concurrence LRPM LRPM Inventory Collection Master 5 Months
Request detailed capacity analysis for each Refresh Server PM/Server Team Capacity Analysis by System 5 Months
Assess Server Dispositions for Budgetary Purposes
Determine dispositions of servers:
Decommission, Migrate to USLV, Project Sponsored, P2P, P2V,No Refresh due to Technical Issues,
No Refresh due to Deferrement
PM, SOs LRPM Inventory Collection Master 6 Months
Special architect/executive signoff for P2P dispositions PM, Exe Mgmt
Architect signoff on all dispositions PM, Architects 4 Months
Risks
Request do-nothing vendor maintenance/support costs maintenance expiration dates required for
a cash flow analysis
LRPM
Per executive management no approval
for extended warrenty
4 Months
Financials - 5 Year Forecast
Request capacity-as-is budgetary quotes from vendors for preliminary CapEx and OpEx estimates LRPM
Financial Master Server Refresh -
available upon request
4 Months
Request quotes, 1Y Support, Years 2-4 support, and professional services costs, if any. LRPM LRPM PRA Master Worksheet 4 Months
Create detailed cash flow analysis on each candidate LRPM LRPM Cash Flow Analysis 3 Months
Create detailed financial worksheet leveraging LRPM templates. Note each Amgen region is a
separate tab, with a summary tab
LRPM
Financial Master Server Refresh -
available upon request
3 Months
Business Case
Create Business Case with detailed data driven metrics, benefits proceeding, risks not proceeding,
all costs, etc.
LRPM Global Server Refresh Business Case 3 Months
Approval Sign-off
LRPM team to peer review decks prior to submitting to Amgen, then schedule one hour with the
LRPM FTE lead , followed by another meeting with Tower/Service Lead
LRPM Sign-off Approval 2 Months
Create Project Request Authorization (PRA) Finance PRA 2 Months
Initiation Processes
Accountable Team Deliverable T-Minus
Project Charter, Project Scope
Develop Project Charter LRPM LRPM Server Refresh Overview 2 Months
Validate Project Scope and obtain buy off LRPM 2 Months
Budget Plan
Architecture to perform final review and acceptance of each candidate LRPM Kick-off meeting, Email confirmation 6 Weeks
Create current year LRPM Financial Runbook and project financial Masters for each project, each
region
LRPM
LRPM Financial Runbook, Project
Financial Masters
6 Weeks
Create and maintain monthly dashboards LRPM Dashboards 6 Weeks
Validate EPPM financials are accurate, PRA and PR numbers match T-Codes, update EPPM the first
week of each month
LRPM Financial Master Worksheet 6 Weeks
Obtain Quotes and Select Vendor(s)
AHS Architecture selects the vendor(s) AHS NA 6 Weeks
On purchases >$250K, engage Global Sourcing Solution (GSS) and submit RFP from Architecture for
vendor competation
LRPM RFP 4 Weeks
Request final vendor quotes via architecture, if <250K, schedule selection meeting with Tower
lead, Architecture, Project PM
LRPM NA 4 Weeks
If >$250K schedule meeting to discuss with Architecture, GSS, Tower Lead, Project PM LRPM NA 4 Weeks
Confirm with tower lead funds exist in Operations budget for Support Years 2-4 (Project funds
Year 1)
LRPM Purchase Request Form 4 Weeks
Collect technical specifications from vendors LRPM NA 3 Weeks
Ensure space and power availability in target data center LRPM DCD 3 Weeks
Project Initialization
Creat project into EPPM (Enterprise Project and Portfolio Management). System Generates PR
Code
PM NA 3 Weeks
Request T-Codes are created for each project PM Finance SharePoint Site 6 Weeks
Coordinate T-code number alignment with EPPM Project (PR Creation) PM LRPM PRA Master Worksheet 3 Weeks
Coordinate WBS code creation LRPM Email 3 Weeks
Procurement Process
LRPM P2V migrations: Purchase additional virtual licenses or blade servers as needed for the
migration
PM
LRPM P2P migrations: purchase replacement server PM
Submit Purchase Request Approvals (PRA) for signature. Add LRPM team as Watchers LRPM Hardcopy signatures required 3 Weeks
Confirm funds are available in SAP LRPM NA 3 Weeks
Create PRF (Purchase Request Form), then peer review. LRPM PRF 3 Weeks
Send FTE Manager an email that a X dollar spend will be submitted with brief description LRPM email 3 Weeks
Send PRF and quote to eFinity focal for PO submission review/approvals, following approval flow
at each step
LRPM Email to Focal 3 Weeks
Log requisition, PO number, and CapEx/OpEx in LRPM Master tracker by Region LRPM LRPM Master Tracker 2 Weeks
Upon PO creation, communicate with vendor and request delivery ETA LRPM Email 2 Weeks
Planning Processes
Accountable Team Deliverable T-Minus
CMDB Impact Analysis - Refreshed Weekly on Thursdays T-Minus 6 Weeks, 5, 4, etc.
Pull Impact Analysis LRPM Impact Analysis Document 6 Weeks & Weekly
Manipulate Impact Analysis into customizedLRPM template LRPM Impact Analysis Document 6 Weeks & Weekly
Create Contacts' List LRPM Impact Analysis Document 6 Weeks & Weekly
Determine all stakeholders andaddthem to the Contacts' list LRPM Impact Analysis Document 6 Weeks & Weekly
Determine SMEs scheduledfor refreshproject addthem to the Contacts' list LRPM MigrationPlaybook 6 Weeks & Weekly
Email Server dispositonworksheets to systemowners. Conduct workshopmeetings to review
process. LRPM MigrationPlaybook
6 Weeks & Weekly
Risk Planning
Risks associatedwithdeferment
Determine whenOEM support will endonthe EOSL date. PM Risk Plan
Determine risks whenserver support will be a best effort for hardware. PM
Schedule meetings as requestedby the SO to helprectify any issues for this selection. PM,SOs
Risks associatedwhentechnical issues prevent refresh
Server RefreshPM works withthe site leads to rectify this issue PM,Server LRPM Risk Plan
Findworkable solutionwithinfrastructure SO,ApplicationSO andManufacturing PM,SOs,
Assess risks whenno solutionandhardware support ends
PM,SOs,
Manufacturing
Risk PlanSignoff PM,Architects
Requirements Planning
Ensure for all USLV migrations,server meets USLV requirements PM,Server LRPM
Ensure for all P2V migrations,server is eligible to be migratedto a Virtual environment PM,Server LRPM
Identify Dependencies
For Project-sponsoredrefresh,coordinate withproject-sponsoredPM to supply new server PMs
Check IS Maintenance Calendar & PatchingSchedules for conflicts
Identify cross domainaccess requirements PM / RefreshTeam Dependency Workbooks Project Start
Check if there are any home directories as part of the move PM / RefreshTeam Dependency Workbooks Project Start
Prepare list of all systems PM / RefreshTeam MigrationWorkbook(s)Masters Project Start
Identify other dependencies PM / RefreshTeam Task Tracking Project Start
Remediate dependencies PM / RefreshTeam Task Tracking Project Start,thenWeekly
Workbooks
Prepare Discovery Workbooks Refresh Team Discovery Workbooks Project Start
Prepare Refresh Workbooks Refresh Team Discovery Workbooks Project Start
Meet with resources to ensure complete understanding of migration and tasks Refresh Team Task Tracking Master 5 days
Schedule SME Resources and Submit Tickets
Confirm resource availability LRPM Task Tracking Master 7 days
Track IRR Task numbers and assignments. PM Task Tracking Master Project Start + 3 Weeks
Technical Plans
Define tools/methodology to be used for refresh Refresh Team Migration Workbook(s) Masters Project Start
Build detailed refresh procedure Refresh Team Migration Workbook after baseline has completed
Build detailed back out plan Refresh Team Migration Workbook after baseline has completed
Perform procedure quality check Refresh Team Migration Workbook 4 weeks
Verify and capture workloads of systems source and target systems Refresh Team Migration Workbook 5 days
Supporting Documentation
Document (or video capture) all LRPM core competencies How-to Repository
Communications
Create LMR Presentation deck LRPM IS Release and Deployment Template 8 Weeks before refresh
Create IS Release and Maintenance calendar entry LRPM LMR Deck 6 weeks
Initiate/train managers and other stakeholders on how to read a CMDB Impact Analysis LRPM Customized Impact Analysis As requested
Email peer-reviewed Impact Notification LRPM Impact Notification Email 8 Weeks before refresh
Email steakholders of upcoming refresh PM Manager's Email 8 Weeks before refresh
Create FAQs LRPM Lifecycle Refresh Outage FAQ 8 Weeks before refresh
Q&A invitational migration cutover meeting LRPM Meeting Invitation 1 week
Execution Processes
Accountable Team Deliverable T-Minus
Hardware Installation
Coordinate HW delivery ETA with Data Center lead LRPM DCD 10 Weeks
Coordinate rack and stack, installation, configuration, and qualifications LRPM 6 weeks
Final Health Check LRPM 6 weeks
Qualification Complete Refresh Team 6 weeks
Migration Cutover
Perform final refresh plan Refresh Team day of
Ensure both Server structures mirror each other Refresh Team day of
Ensure PlateSpin completion Refresh Team day of
Host Lync bridge during refresh activities (if required) LRPM day of
Coordinate technical resources to meet program schedules LRPM day of
Troubleshoot issues Refresh Team day of
Communications
Communicate to key stakeholders ongoing status of Refresh PM day of
Stay on Lync Conference Bridge (if required) PM /Refresh Team day of
46. Work flow diagrams
Documentation of the process was a key deliverable for the project. However as
we know from Agile, over documentation leads to the basic fact that no one will
read the document.
Visual Representation gives the user the ability to understand the work flow,
responsibilities, and flow.
Solution: created a visual process flow and linked each stage to a 1 page short
explanation of the stage and how to reproduce the results.
47. Visual Representation of Work Flow
DCD Process
RefreshPMDCDPMSMEArchitectClient
Phase
Complete Fresh
Build Check List
Open IDR
Build DCD
Build Summary Build
Document
Approve Build Doc
Submit DCD
Create New server
CI in CMDB
Receive Server name
from Refresh PM
Review and Approve
DCD
Kick off Build (VM)
or Installation
Process (Physical)
Attach Application
CI to Server
(Physical)
Deliver server to
Production
Migrate Application Decomm Server
Remove from
SupportReview with System
Owner
Fresh Build?
Get approval from
SO for P2V
No
Fresh Build
Virtual?
Yes
Notify ArchitectYes Open IDR Server build
SME to open CR
SME to initiate
communication with
SO. Identify risks
and SRT
PM update IS
Calendar and send
update notification
LMR
Migrate server P2V
Decomm
Remove from
Support contract
Open DCD
Customer to provide
official P2P
Justification
Submit justification
for Exec Approval
Fresh Build
Physical
Approved?
Engage Architect to
verify P2P needs
No
Fresh Build
Physical
Approved?
Yes
No
Open DCD Purchase HW Receive HW
LMR
Migrate
Open IDR
Yes
No
Backlog
Fresh Build Virtual
Fresh
Build
Physical
48. Visual Representation
Diagram linked to detailed explanation
Review with System
Owner
Fresh Build?
Get approval from
SO for P2V
No
Fresh Build
Virtual?
Yes
Notify ArchitectYes Open IDR Server build
SME to open CR
SME to initiate
communication with
SO. Identify risks
and SRT
PM update IS
Calendar and send
update notification
LMR
Migrate server P2V
Decomm
Remove from
Support contract
Open DCD
Customer to provide
official P2P
Justification
Submit justification
for Exec Approval
Fresh Build
Physical
Approved?
Engage Architect to
verify P2P needs
No
Fresh Build
Physical
Approved?
Yes
No
Open DCD Purchase HW Receive HW
LMR
Migrate
Open IDR
Yes
No
Backlog
Fresh Build Virtual
Fresh
Build
Physical
Clicking on any cell
opens an explanation
slide
50. Oversight
During the onset of the project, the amount of confusion made
management and control of the process a complicated and cumbersome
approach.
Using the visual approach, management had the ability to quickly
understand the process and understand the work load at each stage.
51. Throughput
The 2016 server refresh program is in full swing.
During the first 3 months of the project we have managed to refresh more
servers than were in scope in 2015 and we constantly add servers to the
scope
Using to the Kanban approach, the backlog is flexible to include additional
work as it flows in.
The reduction in overhead released PM capacity to include additional servers
that were not in scope and increase the throughput of the system.
52. Savings
Reduce support contract costs
Annual support contracts costs were reduced by removing old architecture
and replacing it with servers that are under warranty.
Reducing physical footprint.
By virtualizing 80%+ of the servers we reduced the physical footprint
needed to support the environment leading to indirect savings of space,
power, cooling and support.
Retiring old infrastructure.
Server owners that were pushed to refresh old infrastructure often decided
that they no longer need the environment and retire the servers without
spinning up a replacement.
53. Savings
Identifying the server that were not done and removing them from
support
One of the unexpected benefits of the project was the change in definition of
done. By changing the definition and reviewing the records, we have
identified numerous servers that were migrated but not removed from
support.
Removing these servers from the support contract led to an immediate
savings of $500K.
54. Customer and team member satisfaction
The reduction in the need for architecture input freed the architectures
to deal with other issues (their “day to day” work) which increased their
satisfaction and willingness to support the project.
The Kanban approach gave the PM the ability to reach out to customers
only when needed with fully tracked history and progress which provided
the customers with clear and concise progress and planning options.