Deep kamalsingh
Upcoming SlideShare
Loading in...5
×
 

Deep kamalsingh

on

  • 272 views

 

Statistics

Views

Total Views
272
Slideshare-icon Views on SlideShare
272
Embed Views
0

Actions

Likes
0
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Deep kamalsingh Deep kamalsingh Document Transcript

    • Strategic approach to ascertain accurate decisions after unplanned service outage in Telecom operations Submitted to PMI National Conference 2013 Author: Deep Kamal Singh
    • Table of Contents 1. ABSTRACT .................................................................................................................................................3 2. KEYWORDS................................................................................................................................................3 3. INTRODUCTION.........................................................................................................................................4 3.1. OBJECTIVE OF THE RESEARCH AND PAPER................................................................................6 4. THE CONCEPT...........................................................................................................................................6 5. CASE STUDY..............................................................................................................................................8 5.1. CASE STUDY: ACTIVITY DESCRIPTION...........................................................................................8 5.2. CASE STUDY: ACTIVITY EXECUTION.............................................................................................12 5.3. KEY OBSERVATION AND LEARNING DRAWN FROM THE EVENT...........................................12 6. EVOLUTION OF THE STRATEGIC APPROACH .................................................................................14 6.1. IDENTIFYING ACTIVITY PHASES WHERE THINGS CAN GO WRONG ......................................14 6.1.1. PLANNING PHASE..........................................................................................................................15 6.1.2. EXECUTING PHASE AND MAINTENANCE PHASE...................................................................17 7. CONCLUSION ..........................................................................................................................................19 Table of Figures Figure-1: Phases of activity.......................................................................................................14 Figure-2: Strategic approach to ascertain accurate decisions after unplanned service outage..19 Tables Table 1 – Technical Teams involved in activity........................................................................... 9 Table 2 – Case Study: Activity timelines and major milestones.................................................11 Table 3 – Sample Service Impact matrix...................................................................................16
    • 1. Abstract After 50 years and four generations of mobile technology, subject matter experts have fashioned standard operating procedures of all activities that are executed in ever improving telecom ecosystem, however system failures still happen leaving no operator untouched by unforeseen outages impacting services of end subscriber. With an objective to define effective action plan that should be followed under unplanned service outages a team of project managers were formed. The team was involved in planning and execution phase of various activities in a mobile network. The ideas and learning were exchanged between technical managers and other PMs of all involved teams. The team adopted qualitative and interpretive approach to consolidate steps for resolving the crisis in least possible time minimizing revenue loss. Based on the erudition a general action plan was formulated following which stakeholders can take best decisions during service outages in telecom operations. Currently practiced recovery approaches are native to specific domain or subsystem of a service industry, there is a lack of commonly applicable practice for taking right decision in any service outage. Thus this paper is an initiative to construct standard action plan applicable to overall telecom ecosystem which should be referred by Project managers and apex decision makers when sudden unexpected outages and service down time occur in otherwise well-functioning environment. Albeit this paper is authored upon explorations of activities in telecom network operations, but the guidelines and underlying approach concluded in this paper can be referred by any service industry in case of unplanned outages. 2. Keywords Risk planning, minimizing uncertainty, revenue loss control, decision making
    • 3. Introduction Telecommunication ecosystem in itself is varying and ever improving industry. For reasons like business growth, technology implementation, adaptability, innovation, cost control etc there are endless upgrades, tune ups and additions that keep on happening in the network system, for same reasons there are regular set of activities planned and executed, ranging from a small node level change to a complete subsystem replacement or introduction of a new setup altogether. Continuous changes in the ecosystem subjects the network and business to risks of service outages and revenue impacts, To ensure business continuity and to keep revenue impact to a minimum it’s a necessary practice by technical teams to plan and prepare step wise execution plan before every change that further gets reviewed and approved by all relevant stakeholders. Often the activity involves implementation of changes that affect various services which in turn requires several tech-teams to align, coordinate and prepare their action steps in sync with other for the execution of activity. In such scenarios when any unexpected and unplanned variation occurs, it becomes paramount to define next course of action keeping actual need and impact to business in calculations. In these cases, it is most required to be well informed and equipped with all technical and business critical information altogether in order to quickly assess not only all possible decision but also to ascertain which decision will lead to lowest cost to business and keep revenue impact to minimum. Activity managers and technical specialists are well qualified experts in their field, they can easily work out and deduce key planning factors like steps of execution, detailed activity plan, estimate outages etc, up to certain level they can also highlight the risks and probable phases where an unplanned service outage might occur during course of complete activity, however its neither their domain nor their expertise to analyze and predict business impact in terms of revenue when an unplanned outage is experienced, On the other hand business driving functions like sales and marketing, revenue assurance, customer services and similar domains are always well equipped with analytical data and information which makes them expert for accurate calculation of revenue inflow from a live service, thus these functional domains can estimate impact to business in terms of revenue when a service is down. However they lack technical insight to understand risk factors involved in a planned activity, neither can they predict which service is more prone to get disrupted compared to other nor they are aware about the overhead cost to business in case of delay or cancellation of the activity. When an unplanned service outage occurs during a complex activity it becomes increasingly difficult to derive to a best decision because of gap in domain boundaries between various teams involved and also because of unavailability of some kind of cross reference matrix which can help to adjudge quickly whether to go ahead, delay or roll back the activity.
    • Thus it is identified here that activity owners must compile a reference approach with all necessary analysis well before the activity which should be followed in case of unexpected downtimes or service outages. This paper is an initiative to establish a common reference approach which can be referred in case of unexpected outages during a planned activity to reach to an appropriate decision that will ensure least cost to business.
    • 3.1.Objective of the research and Paper To establish whether a methodical approach and standard operating procedure can be formulated which when followed in event of unplanned and unexpected service outage will ensure that best possible decision is taken in least time. 4. The Concept As briefed in above section the regular activities in telecom domain require planned outages of live services, for every activity the involved technical managers perform diligent analysis and prepare activity steps with documents covering step wise execution in great detail. The services which should implement changes at respective end due to the activity are also required to plan accordingly. Thus different technical teams owning each service also prepare set of activity steps at their own part and Project manager of activity then collates the execution plan to prepare a combined set of steps with team name and ownership assigned against each step, this document gets signed off and approved from all the stakeholders so that everyone stays informed of the downtimes and the service outages involved along with stepwise execution planned in sequence and in parallel across all services. However when an unexpected outage occurs during an activity execution or if some set of services are not functioning properly after any step then it becomes first priority to decide next step, technical teams get engulfed in finding out root cause and then solutions, where as parallel progress happening on other connected system’s/service’s end may lose direction as they have no clear understanding whether to go ahead, hold or completely stop. It is observed that ownership of decision making is also not apparent in such situations, as the owners of malfunctioning connected service will not favor carrying on rest part of activity unless they know the root cause and estimate time to implement solution, the core team who are conducting the main track of activity will always suggest to carry out the activity so that their planned timelines are not impacted and also other services which are functioning properly will not face any delay in their timelines, business team can actually mandate the decision here but they lack the technical insight to take an informed decision, and thus events like this severely consume time approved for service outages, further lack of timely decision making sets off a chain reaction which increases outages duration allocated for even those services which are progressing as planned, as a result of which Business has to face un-estimated, unaccounted consequential costs. For example let’s consider a complex activity where a core system is getting upgraded to a higher version, when the core system is upgraded and switched on for initial testing, it is found that among many connected services one particular service is not able to connect to the new version, this behavior was not expected and because of this, launch of new version of core system gets on hold as core team is awaiting management’s ‘go ahead’ decision. It’s now a prerogative to take most appropriate decision in least possible time as minute by minute cost to business is increasing due to outage of all connected services, and decision makers need to be capable enough to derive the next course of action – 1. Whether to go live and end the downtime of all other services, except the disrupted one.
    • 2. Delay the go live until problem is resolved thus increasing downtime of all connected services 3. Or simply roll back the complete version upgrade so that all services function properly as they were before activity and plan the upgrade again increasing cost of activity. How decision makers can decide which of the above option is most beneficial to business when crisis situation occurs during planned activity is key challenge concluded in joint planning process sessions.
    • 5. Case Study In Q3 2010, A leading telecom operator in India planned nationwide upgrade of its core prepaid billing system (BSS), this activity was to be executed in 14 telecom circles across India, this is a complex activity involving hardware additions and retrofits, complete software upgrade, During planning process phase of this project the group of project managers and functional heads jointly began to estimate the risks involved in each implementation and further to prepare mitigation approach, as a result of many joint sessions over same apparition the concept and idea was born which became prime subject documented in this paper. The Operator had 14 sites running on old billing system and thus same upgrade was planned for all the sites. As on all sites there are different teams managing the same services, a great zeal of planning, coordination, testing and control was required at each site to execute the activity and thus every step of activity was properly documented and was reviewed and approved by all stakeholders. Even with great level of planning and coordination during activity execution in starting few sites various deviation were observed and which impacted cost of activity to business and imparted sudden revenue losses, With every implementations technical teams concluded all necessary learning to ensure same problem are not faced again in next implantation, and it was observed by project managers and functional heads that a reliable approach is required to ensure critical decisions are taken in least possible time 5.1.Case Study: Activity description Activity: Core billing system (IN System) upgrade to newer version, This activity requires complete outage of billing system for 8 hours, Since billing system stays unavailable, below listed connected services also face subsequent downtime 1. Voice calls – Local / National Long distance / International long distance 2. SMS 3. Data Browsing 4. Real-time data charging 5. Recharges – Voucher recharge and E-topup 6. USSD 7. Unified Subscriber life cycle management – Activation, Churn, daily jobs, and other offline process. 8. Business reporting – MIS
    • A complex activity involves changes at more than one functional system and thus involves many technical teams to work in coordinated and controlled manner, table illustrates various technical teams that were involved in the discussed activity SNO Service Owning Team 1 Voice calls Circle Team 2 SMS Circle Team 3 Data Browsing Circle Team 4 Real-time data charging Data Charging team 5 Voucher recharge Billing System Team 6 ETOPUP ETOPUP Team 7 USSD USSD Team 8 Unified Subscriber life cycle management Unify Team 9 Customer Service CS Team 10 Revenue assurance and CDR analysis RA Team 11 Business Reporting Mediation Team Table 1 – Technical Teams involved in activity Due to version upgrade of billing system the underlying communication protocol between billing system and connected services also changes at several layers, this demands parallel upgrade at IT applications – ETOPUP, USSD, Unified app and online data charging. After several weeks of in depth analysis and joint solution development sessions involving all functional team managers, it was concluded that IT applications will have to upgrade their clients in parallel to support new billing system during activity night, and accordingly the timelines were finalized.
    • Table 2 illustrates high level view of activity and timelines showing various functional teams involved in the complex upgrade Day Time Activity Team Ownership D-1 18:00 Subscriber Provisioning will stopped (New Subscriber Creation and deletion will be Stopped) Unify Team D-1 22:00 Etopup and Paper recharge will be stopped Etopup Team, Billing System team D-1 22:00 ‘Core balance <= 0’ subscriber base dump with IMSI details to be shared with Switch team for barring at HLR Circle Team D-1 22:00 All Changes from any node towards Billing system will be stopped Billing System team D-1 23:30 Billing systems interface for incoming connectivity will be stopped (All IT apps Communication towards IN will stop) Billing System team D 00:00 DOWNTIME Billing system by Pass for local and national voice calls and SMS. Circle Team D 00:15 Billing system will be out of service after closing all CDR file Billing System team D 01:00 Subscriber and service data complete dump to be provided to RA Billing System team D 00:30 Billing system Upgrade start Billing System team D 02:50 Information given to All IT teams - Data charging, ETOPUP, USSD, Unified processes and other downstream systems to get their application ready for new version Billing system Etopup Team USSD Team Unify Team Data charging Team Customer Service Team Billing System team D 05:00 Confirmation of completion of activity from all IT teams Etopup Team USSD Team Unify Team Data charging Team Customer Service Team D 05:30 Billing system upgrade complete Billing System team D 05:30 Post upgrade billing system data dumps to be provide to RA for recon Billing System team D 05:45 RA to confirm on provided data and give go ahead RA Team D 05:45 Test traffic to be routed on upgraded Billing system Circle Team D 05:45 UAT on Critical Product will be started - by Customer Service/RA /ETOPUP/Roaming/ICR/USSD/Data charging/Unified Teams Etopup Team USSD Team Unify Team Data charging Team Customer Service Team Billing System team D 06:25 CDR will be shared to RA team for testing number RA Team D 06:30 Go-Ahead confirmation will be given by Business UAT team Business Team D 06:35 Final go live confirmation from management team CxO Team
    • D 07:00 Billing system by Pass will be removed and system will start handling live traffic, Ending downtime Circle Team D 17:00 Complete Product and services testing to be completed CS Team, RA Team, Billing System team Table 2 – Case Study: Activity timelines and major milestones
    • 5.2.Case Study: Activity Execution Subscriber Base: 5.4 Million Core billing system upgrade activity was completed as per schedule projected in timelines, however when new version of billing system was brought up for testing it was found that ETOPUP is not able to connect whereas all other services were able to connect and perform testing at their end, technical teams started working out to get to the cause and find solution. Minute by minute the time allocated for testing before go-live was getting reduced for ETOPUP service, side by side testing was in progress from all other teams. As the time allocated for testing reached completion, testing status from all teams was shared with business whereas problem with ETOPUP service was still not found, at this stage business owner of ETOPUP service was advocating for roll back of complete activity, Billing system team was adamant that since other services are working with new version, there is no fault at their end thus roll back will not be done, to decide next course of action core business team was not having visibility over the technical details and it was also not known that how long it may take to fix the issue. Between these discussions no one was actually taking ownership to either give confirmation to go Live or call off the activity and roll back, it is important to realize that outage period of all services Voice calls, SMS, VAS, USSD et al was gradually increasing and by every passing minute revenue loss to business kept rising. After 95 minutes of extended downtime issue was identified and fixed and all services were made live with new version of billing system. 5.3.Key observation and learning drawn from the event When it was crucial to identify and compare increasing revenue losses, business team was not having enough information to decide next step. There was no delay threshold predefined and agreed for critical milestones of activity and thus when delay in readiness of one service encountered it consequentially delayed go live of other services also. The activity involved many stakeholders and cross functional teams, a specialist group of managers could have been designated to dedicatedly help decision makers take decisions in crisis situations. A methodical approach must be devised for future implementations which will ensure quick decision making Based upon the learning and experience from the complex implementation at site one, with several rounds of analysis and review sessions the commonly applicable reference was evolved which serves the need of general methodological approach to be referred by apex decision makers to ensure right decision is taken in least possible time during unexpected outage in telecom ecosystem.
    • 6. Evolution of the strategic approach For any planned activity there are three distinct phases where unexpected delay or failure can be encountered, An unexpected failure is total loss to business whereas in case of unexpected delay with use of project management methodologies and best practices revenue loss and impact to time-budget balance can be minimized, scope and conclusion of the study covered in this paper is applicable to minimizing business loss in cases of delay. 6.1.Identifying Activity Phases where things can go wrong Figure-1: Phases of activity Figure illustrates three phases of the activity with respect to major cost to business in case delay/failure is experienced. It was concluded that at every phase specific preparation is required to be ready for unexpected outages, collection of relevant business data is key to decision making in crisis scenarios, before the activity execution PM has to ensure that technical team has prepared discreet figures of hourly revenue impact
    • associated with each service, and also included all possible consequential costs to business after any service failure, e.g. if a SMS service is not functioning then the average revenue earned by SMS service for the entire time duration is direct revenue loss, whereas Subscriber calling to customer care to enquire and complain about the service disruption is consequential cost to business for the service downtime. Upon same principle, group of PMs deduced below set of readiness points which were deemed as most important to plan same activity for next site, or in general to plan any change in any system or subsystem of a service industry, below deduced approach will ensure quick and right decision making during an unexpected outage. 6.1.1.Planning phase Identify all services where changes will occur due to planned activity: this is first and foremost step to plan for activity and be ready for unexpected variations in the planning, while preparing list of the ‘to be affected’ services project manager should quantify level of impact of each service, it might be the case that service will get impacted partially or intermittently, total impact to business must be calculated appropriately in such cases. Identify all connected interfaces which will be impacted during the activity Work out and prepare Service impact matrix as shown in Table 3 o This matrix should include hourly revenue earning potential of every service, along with all consequential costs (whether calculable or incalculable). The service impact matrix must be reviewed and approved by all stakeholders including technical and business teams
    • A Sample Service Impact matrix SNO Service Name Subscriber Base (In Mn) Hourly Revenue Potential [in 100K Rs.] Indirect cost to business H01 H02 H03 H04 H05 H06 H07 H08 H09 H10 H11 H12 H13 H14 H15 H16 H17 H18 H19 H20 H21 H22 H23 H24 Customer Care Overhead Dependent Services impact Customer Satisfaction Impact 1 Voice Calls 14.34 1.1 Local Home network 14.34 1.2 Local cross network 14.34 1.3 National Long Distance 09.60 1.4 International Long Distance 01.12 2 SMS 14.34 2.1 Local 02.21 2.2 National 02.21 2.3 International 00.01 3 Data Usage 01.90 4 Paper Recharges 08.14 5 ETOPUP 11.89 6 USSD Services 14.34 6.1 Subscriber Info 14.34 6.2 Subscription management 04.30 6.3 VAS Services 01.80 7 Subscriber Life Cycle 14.34 7.1 Subscriber Activation 01.00 7.2 Subscriber Churn 00.01 7.3 Service management 00.01 Table 3 – Sample Service Impact matrix This table illustrates importance and criticality of the service in terms of revenue earning potential and cost to business in case of outage. (Data figures shown are indicative and are not real) The revenue earning potential is distributed over 24 hour time period as service usage varies on hourly basis, for example (3) Data usage revenue earning can be higher during H19 to H22 compared to same with (6.3) VAS services revenue earning during same period, this data will help decision makers perform calculative analysis and take go or no-go call when during an activity VAS service is down but Data usage is working fine
    • 6.1.2.Executing Phase and Maintenance Phase Identify affected services: whenever a variation from planned activities observed, total affected services must be identified to calculate the magnitude of impact, further the revenue impacting factors for a service must be evaluated. Identify affected subscriber base and probable service outage duration o It may happen that a particular service is only partially affected with only limited sets of service users getting impacted, or problem could be intermittent – in outage the first prerogative of technical teams is to prepare and share these stats. o Expected service outage duration of each service is another important data to compare revenue losses associated with each of feasible decisions Calculate total revenue impact against each affected service taking affected subscriber base into account with use of service impact matrix. Considering the revenue impacting factors associated with the activity, below equation can be deduced Let R(S1), R(S2)…..R(Sn) = Avg. revenue per user per minute of Service1,2,…n O(S1),O(S2)…..O(Sn) = Total estimated outage duration of Service1,2,…n Sb(S1),Sb(S2)….Sb(Sn) = Estimated percentage of affected subscriber base during outage of Service1, 2,….n Su(S1), Su(S2)…..Su(Sn) = Total users of Service1, 2, ….n Then Total loss to business during service outage = ∑ [ R(Sn) x O(Sn) x Sb(Sn) x Su(Sn) / 100 ] Prepare stats for indirect additional and future cost to business due to service outages as accurately as possible. Based on calculated revenue impact of each services and future cost to business associated with the service outage formulate list of possible next steps or possible options. Identify variable (uncertain) factors and Risk associated with each concluded option.
    • Prepare Option vs. Risk vs. Revenue Impact vs. Variance factor matrix: cross functional managers and tech leads should prepare the tabulated listing of all options with clear information about the revenue loss, risks and variable factors associated with each option. o This matrix should be prepared with all three tracks  Track 1 – Go-Live: Presenting decision makers with all information and choices to go live, this means activity to be carried on as planned despite having unexpected impacts in one or more services – this track ensures that main activity gets completed within time and service outage duration of all proper functioning services stays under approved limits.  Track 2 – Delay: Presenting decision makers will all information and choices to hold all sub- activities until the unexpected problem faced at one or more services gets fixed, this track ensures that all services will be live once the change is implemented completely, however it inherently subjects business to risk of bigger revenue loss if the affected services are having low revenue earning potential and/or longer time is taken to fix the issue.  Track 3 – Fall back: Presenting decision makers with all information and choices to completely call off the activity, this will ensure that all services function the same way as it were functioning before implementation of change, however fallback of the change directly means all efforts and cost invested in the activity becomes void, in addition separate cost of doing the same activity again in future should also be accounted for. From every track rule out options which are either least possible, or have highest risk, or have highest number of variable factors. For every option listed in the tracks, evaluate time to be taken by each team to implement the option. o Ensure every team is ready (technically and logistically) to go ahead with any of the option well in time. Present Decision makers with final track with concluded set of options and details of ‘time to implement’ for each option.
    • 7. Conclusion To ensure right decision is taken in least possible time, all stake holders involved in the project must invest required efforts for in-depth analysis and preparation of service impact matrix. Before an activity is planned, there should be a task force appointed containing members from all teams whose task will be to swing in action when unexpected outage occurs and quickly prepare the list of options for decision makers, with use of the strategic approach as discussed and deduced in this paper, graphically listed below: Figure-1: Strategic approach to ascertain accurate decisions after unplanned service outage