This is the last picture of this presentation. This is just general, but important considerration of product quality. This picture is derived from Applied Software Measurement authored by Capers Jones published by McGrowHill 1996. This picture says that the cost to repair defect in coding phase is only $25, but in post release phase, it’s $14,000. I think we need to cooperate with quality improvement in order to devote to Customer and IBM business. That’s it for my presentation. Thank you.
Complex problem determination cases in real world, Hiroki ...
Complex problem cases in real world Hiroki Nakamura DRO, IBM Japan
Summary <ul><li>Need to reduce resolution TAT and workload of account team to improve customer satisfaction </li></ul><ul><li>Built-in trace/diagnostic code without large performance down </li></ul><ul><li>Detail description in problem fix database to search more easily </li></ul><ul><li>Performance monitoring and its problem detection function </li></ul><ul><li>Ease of use core dump inspection for crash/hung cases </li></ul><ul><li>PD enablement (on/off) without restart a product </li></ul>More effective PD schemes are required
Customer Impacts and Requirements <ul><li>Waste large amount of time for PD and recreation test </li></ul><ul><li>Even if a fix is provided, account team need very long time regression test </li></ul><ul><li>Root cause analysis even if one time problem for some years </li></ul><ul><li>Source code investigation even if few materials (such as, no trace) </li></ul><ul><li>Logical scenario of a problem (root cause) to trust a fix </li></ul><ul><li>Detail information for customer management </li></ul><ul><ul><li>In mission critical cases, Japanese companies tend to be very sensitive in quality </li></ul></ul><ul><ul><ul><li>Especially financial company, because of strict guide form Financial Service Agency </li></ul></ul></ul><ul><ul><li>Frequent progress reports of a problem resolution required, every three hours, daily or etc. </li></ul></ul><ul><ul><li>A fix code might not be applied if low occurrence and easy recovery could be guaranteed </li></ul></ul><ul><li>Recurrence test to confirm a problem is fixed by a provided solution </li></ul><ul><li>Special build is preferable to Fixpac because it’s single fix and no long term regression test needed </li></ul><ul><li>Direct communication channel to laboratory change team, who makes a solution </li></ul>Impacts Requirements
What is a Problem? <ul><li>Unexpected results were produced by SQL calculation </li></ul><ul><ul><li>No hardware error detected </li></ul></ul><ul><ul><li>The problems are 100% reproducible, but symptoms are different each time </li></ul></ul><ul><ul><li>The SQL is very large and the data is considerably huge. </li></ul></ul><ul><ul><li>It takes about 30 hours for the calculation. </li></ul></ul><ul><ul><li>Application debug code takes about 4 days. </li></ul></ul><ul><ul><li>No reproducible in IBM </li></ul></ul><ul><ul><li>No reproducible with small data </li></ul></ul>DB2 Problem ? <ul><li>Frequent occurrence of parity error on a FC adapter can lead to two-bit error. </li></ul><ul><ul><li>Parity error (single bit) can be recovered by retry access. </li></ul></ul><ul><ul><li>Two bit error is not detectable and recoverable and can cause inconsistent behavior. </li></ul></ul><ul><ul><ul><li>Data corruption or wrong calculation </li></ul></ul></ul><ul><li>Some customers were suffered from this problem </li></ul><ul><ul><li>A communication company </li></ul></ul><ul><ul><li>An insurance company </li></ul></ul><ul><li>Temporary error was not recognized as a severe H/W error, because it’s recoverable. </li></ul><ul><ul><li>Cause long term problem determination of DB2 instead of H/W </li></ul></ul>Double bit Parity Error of H/W can lead to a problem Resolved Case 1 : DB2 calculation error
Situation Chronology Incident ▼ ▼ ▼ Replace FC Adaptor ▼ Add CPUs and memory ▼ No FC Error Found ▼ Service-in ▼ Application Errors Found HW Temporary Error ▼ Problem Support Request ▼ Critical Situation Process ▼ No problem found in application FA (76 days) ▼ ▼ ▼ Report to the Customer Situation Close ▼ Report to the Customer ▼ Confirmed no H/W error Resolved Case 1 : DB2 calculation error timeframe 93days / 381 person days TAT/WL SQL application error happened. Reproducible Frequency Takes about 30 hours for reproduction and 4 days with application debug code, because of considerably large quantity of SQL and data. Reasons for long term Found inconsistent results every time the same SQL executed with the same data. There are two system with the same configuration. A system produces proper results anytime, but the other system produces wrong results. Problems Fiber Channel adapter in pSeries Product
Who has ownership ? Resolved Case 2 : MQ connection error MQ Server Routers Switches Host Broad band Ethernet Routers Switches MQ get connection error !! IBM Non IBM IBM Non IBM Non IBM <ul><li>Initial problem is MQ get connection error. </li></ul><ul><li>There is no similar problem found. </li></ul><ul><li>There is a long network path of MQ connection. </li></ul>
Packet Capture to analyze network L2SW#1 Media Converter 1A Media Converter 1B L3SW#1 L2SW#2 Media Converter 2A Media Converter 2B L3SW#2 External F/W#1 External F/W#2 MQ Server #1 MQ Server #2 Fiber UTP UTP L2SW#3 L2SW#4 Back F/W#1 Back F/W#2 Broad Band Ether Router #1 Broad Band Ether Router #2 L2SW#5 L2SW#6 Router Host Broad Band Ethernet Intranet Firewall x 2 Host Sent reset packet Resolved Case 2 : MQ connection error Red line : connection path Bug in SYN Defender function L3 F/W, router : Data capture point
Symptom Resolved Case 3 : Connection Timeout <ul><li>Connection timeout happened between Web server and Application server. </li></ul><ul><li>Only MQ channels from application server were disconnected. </li></ul><ul><li>Investigated MQ log and trace No error found in MQ </li></ul><ul><li>Investigated AIX trace MQ threads are waiting for a lock </li></ul><ul><li>A lock owner thread is not dispatched for a long time. </li></ul><ul><li>And then connection timeout occurred. </li></ul>APL Server MQ Manager Web Server Gateway Server Client Channel Treads Disconnected only channels from Application Server Send Channel Process Receive Channel Process Send Channel Process Receive Channel Process Send Channel Process Receive Channel Process Web Server Web Server Gateway Server a MQGet request per 3sec Client channel thread is created According to a request from Gateway Server
How to determine configuration ? Resolved Case 3 : Connection Timeout CPU 1 CPU 2 K_T K_T K_T VP VP VP P1_T1 P1_T2 P1_T3 P1_T4 ・ ・ ・ P1_T99 Lock Owner ・ ・ ・ K_T : Kernel Thread, VP : Virtual Processor P#_T# : Process ID and Thread ID(Client Channel Thread) Processor wide CPU assignment Lock Wait : Check process Sleep Lock Wait : Check process Lock Wait : Check process CPU 1 CPU 2 VP VP VP P1_T1 P1_T2 P1_T3 P1_T4 ・ ・ ・ P1_T99 Lock Owner ・ ・ ・ System wide CPU assignment Lock Wait : Check process Sleep Lock Wait : Check process Lock Wait : Check process Bottleneck Change Configuration Wait for CPU dispatch for a long time
Problem sequence Login Request Web Browser Integrated Authentication Portal Server LDAP AIX AIX AIX TAM/WebSEAL WAS WPS Interceptor UDB User Registry Create Cookie Request Transfer Authentication Re-Authentication Retrieve Group Info Portal Screen x 3 x 2 HACMP (Active Standby) Directory Server 9:08 9:00 Portal #1 LDAP Master CPU Util 100 Portal #2 100 100 LDAP Backup 100 10:07-9 10:37-54 11:20-13:05 100% 100% 100% 100% 100% UID/PW Un-resolved Case 4 : CPU utilization 100%
Supposed Scenario Portal Server AIX WAS ＷＰＳ Interceptor Re-Authenticatoin Retrieve Group Info Create Portal LDAP AIX Directory Server UDB 9:08 9:00 10:07-9 10:37-54 11:20-13:05 2006/1/17 Request Response Smooth communication (Request/Response) Request Response Delayed Process in Portal Server Normal Process Request Response Discard response from LDAP because of no preparation. Inconsistent requests were left in Portal Server. Take longer to receive response more degradation Reqeust Portal Server re-sent inconsistent requests many times. So LDAP server became overdrive. Prompt Reply Re-Request with inconsistency ・・ Overdrive by Re-Request with inconsistency 100% Request Response Recovered by reboot Receive response before preparation Logic Flaw? Logic Flaw? Take longer to receive response Prompt Reply Request with inconsistency 100% Prompt Reply Un-resolved Case 4 : CPU utilization 100% <ul><li>Based on Solution assurance review </li></ul><ul><li>No occurrence in reproduction test in a customer test machine. </li></ul>
Configuration DB Server AIX UDB Engine TCP/IP Agent Agent Agent Agent Web Server Other Unix Servlet Engine Servlet Application TCP/IP TCP/IP PC terminal IE CLI Driver Hub Server Appl Server Application TCP/IP CLI Driver TCP/IP Agent UDB Engine db2tcpcm Connect SQL Terminate Servlet Application ｺｰﾙ Timeout threshold 120 sec Return Connection and termination by each SQL execution because of host base legacy application 2006.02.23 Un-resolved Case 5 : Application timeout <ul><li>Problem is an application timeout. </li></ul><ul><li>The application is legacy. </li></ul><ul><li>There are some components by other companies. </li></ul><ul><li>It’s very difficult to gather data to analyze. </li></ul>Timeout
Rapid increase of connections Symptom 1) increase of connection (5 to 32), wait for 16sec and connection became 46 2) connection became 0, wait for 28sec and connection became 45 3) connection timeout happened in some cases, even if there is no rapid connection increase Time Connection Time Connection 10:21:40 10 12:58:23 5 10:21:41 12 12:58:24 3 10:21:42 12 12:58:25 3 10:21:44 11 12:58:26 3 10:21:45 5 12:58:27 3 10:21:46 8 12:58:29 3 10:21:47 5 12:58:30 3 10:21:48 2 12:58:31 2 10:21:49 5 12:58:32 1 10:21:50 5 12:58:33 0 10:21:51 32 12:59:01 45 28sec wait 10:22:07 46 16sec wait 12:59:02 32 10:22:09 38 12:59:03 26 10:22:10 38 12:59:04 20 10:22:11 38 12:59:06 16 10:22:12 34 12:59:07 11 10:22:13 33 12:59:08 3 10:22:14 30 12:59:09 4 10:22:15 32 12:59:10 2 10:22:16 33 12:59:11 3 Un-resolved Case 5 : Application timeout <ul><li>Investigation is from DB2, because other components are owned by other company. </li></ul><ul><li>No body can explain long connection wait time and rapid increase of connections. </li></ul><ul><li>DB2 trace affected very heavy CPU utilization, which cause many timeout. </li></ul>DB2 trace and AIX trace are taken, but not resolved
Severe Problems Causal Analysis (long aged ） Others, 12, 27% Long PD time by Lab 8, 17% Difficult reproduction 7, 15% Side effect by a fix 3, 7% High error rate 3, 7% No guidance of Important info 3, 7% Low quality 2, 4% Improper communication with lab , 2, 4% Improper communication with customer 2, 4% Need to accelerate resolution, 2, 4% Improper problem management 2, 4% N=15 Code=46
Cost of Poor Quality (may be common understanding) Source: Applied Software Measurement, Capers Jones, 1996 % Defects Introduced in this phase Coding Unit Test Funct Test Field Test Post Release % Defects found in in this phase Percentage of Bugs 85% $ Cost to repair defect in this phase $25 $250 $14,000 $1000 $130
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.