BIM BIG extractions and performance monitoringor how to Prevent extraction from BIM to BIG when BIM is overloaded June 11th, 2007
Summary the same tool collect KPI of BIM (free memory, free WP, CPU usage...) and calculate how many new extractions on BIM can be started from BIG. the GC can influence the calculated number of extractions with the following parameters in table YEDW_BWADMIN, maintained in BIM : BIG_EXTRACT_BTC_MAX_ALLOW_PARA BIG_EXTRACT_BTC_FREE_REQ_PARA BIG_EXTRACT_DIA_FREE_REQ% BIG_EXTRACT_DIA_FREE_REQ_PARA BIG_EXTRACT_MAX_CPULOAD% BIG_EXTRACT_MAX_IO_WAIT% BIG_EXTRACT_MAX_TIME_RFC_MS BIG_EXTRACT_MAX_TIME_CHECK_MS
Limitating factors A & BYEDW_BWADMIN Parameters A - BIG_EXTRACT_BTC_MAX_ALLOW_PARA is the main parameter. Its the number of extractions that can run in parallel (we count the number of BTC). The number of extraction that can be started is the value of this parameter minus hte number of BTC used by BIG (ie number of extractions already running). Change it to 0 to prevent any extraction. When you reduce this parameter, inform immediatly Performance team and BIG project Leader. Default is 4. B - BIG_EXTRACT_BTC_FREE_REQ_PARA Indicate how many batch workprocess (BTC WP) must be free to kick off one additional extraction. Default is 5. It means that if 15 BTC are free, 3 additional extractions can be started maximum (depending also on other parameters).
Limitating factors C, D, EYEDW_BWADMIN Parameters C - BIG_EXTRACT_DIA_FREE_REQ% Indicate how many DIA must be free, in percentage of total number of DIA, to kick off additional extractions. Default is 20. It means that if you have 100DIA, and less than 20 are free, no new extraction will be started. D - BIG_EXTRACT_DIA_FREE_REQ_PARA Indicate how many DIA WP must be free to kick off one additional extraction. Default is 20. It means that if 60 DIA are free, 3 additional extractions can be started maximum (depending also on other parameters). E - BIG_EXTRACT_MAX_CPULOAD% The load average tells you how many processes are trying to use the available CPUs. This figure is averaged out of the period 5 min. If the load is greater than number of CPU, then jobs are queuing. If the load in percentage of total number of CPU is greater than this parameter, no new extraction will be started. Default is 80%.
Limitating factors F, G, H & IYEDW_BWADMIN Parameters F - BIG_EXTRACT_MAX_IO_WAIT% if IO wait on the DB is higher than this limit, no new extraction is started. Default is 50%. G - BIG_EXTRACT_MAX_PACK_DELAY_MIN If tRFC queue is too long, no new extraction is started. The program check the age of the oldest packet in SM58 queue, with target BIG. Default is 180min. This indicator has been removed on Aug 14th as with "non standard" tRFC method, the age of the queue is the beginning of the IP. And with tRFC issue, sometimes a packets remains in the queue and age of the queue is meaningless. H - BIG_EXTRACT_MAX_TIME_RFC_MS This is duration on a single RFC call. if RFC call to AS is too slow, no new extraction is started. Default is 1000ms. I - BIG_EXTRACT_MAX_TIME_CHECK_MS This is the total duration of the check program. RFC call, check all KPI : memory, free WP,... It can be slow if BIG is slow or if BIM is slow. Default is 6000ms. If duration is higher, no new extraction is started.
Calculate the number of extraction to start Use fonction module YGTTC_PERF_MONITOR 2 cases : the target contains new FM, or not. With July 2007 import all, a FM will be delivered in BIM, the test will be done in less than 1sec. If the source does not have this FM, no extraction are started if the check runs for more than 60sec (no param for thishow many new extractions the details : free temporary solution)can be started and limitating WP...factor
Monitoring Report YGTTC_PERF_MONITOR must be scheduled on BIG to capture the KPI in the following tables : table YGTTC_PERF_MONIT contains history table YGTTC_PERF_MLAST contains only the last snapshot table /BIC/AYGTTCPM00 (ODS YGTTCPM) contains data formatted for charts
Fields of YGTTC_PERF_MONIT and YGTTC_PERF_MLAST DATS & TIMS Day and time of the snapshot SID SID HOST Host DB DB (Y/N) PHYS_MEM Total Memory FREE_MEM Memory free SWAP_FREE Swap free WP_DIA No of DIA (for DB, sum of all AS and DB) WP_DIA_FREE No of free DIA (for DB, sum of all AS and DB) WP_DIA_ITSELF No of DIA used by ITSELF (for DB, sum of all AS and DB) WP_BTC No of BTC (for DB, sum of all AS and DB) WP_BTC_FREE No of free BTC (for DB, sum of all AS and DB) WP_BTC_ITSELF No of BTC used by ITSELF (for DB, sum of all AS and DB) USR_TOTAL CPU used by user SYS_TOTAL CPU used by system IDLE_TRUE True Idle WAIT_TRUE IO WAIT NBR_CPU No of CPU on this box LOAD_AVG CPU Load Average on 5 min (ST06) * 100 (divide by 100 the value and compare with number of CPU) RESPTIME_MS Response time to collect this information TIMESTAMP UTC time stamp in long form (YYYYMMDDhhmmss,mmmuuun) of the snapshot PACKET_RECORDED Number of packets in status recorded (to be sent) PACKET_DELAY_MIN Age of the oldest packet in stat recorded =delay in RFCqueue MAX_BTC_ALLOWED Max number of BTC allowed in YEDW_ADMIN in source CALC_EXTRACT_PAR Calculated number of extraction that can be started in paral CALC_LIMIT Limitating factor in the calculated number of parallel extra DUMPS Number of dumps in the last 10 minutes RESPTIME_DIA Average response time of DIA processes
YGTTC_PERF_M* For each instance you have figures per AS and for the DB. On the DB line, you have the total of all AS for number of WP, the detail of tRFC queue, and the calculation of the number of new extractions that can be started. DB = total for all AS read 15.44 DB = time for complete check. result only on DB line AS = time of a simple tRFC
Report YGTTC_SM66_READ If a performance issue due to system overloaded has to be investigated, report YGTTC_SM66_READ can capture all the jobs running (SM66) in tables : YKPISM66MASTER : header info YKPISM66RECORD : details of the job running This report runs remotely on any system and uses the FM YS_KPI_RFC_GET_SM66 developped by FTS. YKPISM66RECORD contains the userid, report, memory used... for all jobs running on the remote system. For example if issue is due to all memory used, you can see which job is using most of the memory.
Table /BIC/AYGTTCPM00 Contains mainly KPI in percentage to represent all system on the same baseline : 0CALDAY 0TCTTIMSTMP 0TCTSYSID YAVLOAD - Load average of CPU of DB in percentage of total CPU. Trigger sm66 if >80%. YIOWAIT - IO Wait % of DB. Trigger sm66 if >30%. YFREEDIA - Free Dia in percentage of all DIA (total all AS) YFREESWAP - Free Swap in percentage of total swap. Minimum value of all AS & DB of one instance YRESPPERC - percentage of average response time of check report compared to average on one week YRESPDIA - Average response time of DIA processes YRESPBIB – MAX or ?? Average response time of subjob ODS activation (jobs BIB*), they always insert the same number of rows (except last packet) and should always be below 200sec. Only available since v7. YDUMPS - Number of dumps (total all AS) http://hqaap012.ctr.nestle.com:3401/sap/bw/BEx?cmd=ldoc&sap-language=EN&template_
High level view shows OB8 CPU overloaded24 hours : 22 hours with hourly average,2 hours with snapshot every 10 min
Next indicators to capture – to discuss with FTS Average response time of BIB (running or finished in last 10min) locks and locks wait ? Row scan for top 10 IP with too many packets tablespace fill rate BIG locks Capture sm66 automatically if one indicator is too high Who will be doing the monitoring ? PCI BW team. Alert to PCI team. Plasma screen (Kirkan>Shaun) ? Analysis to send ticket to right team. Direct link from PCI team and SME BW or perf person ? ie no first level involvment. Michael ? Front page to list known issues and IM in progress to avoid investigation of same issue by several persons? Philippe to distribute names of PCI team and SME ? tbc