EXAMINING OUTLIER DETECTION PERFORMANCE FOR PRINCIPAL COMPONENTS ANALYSIS METHOD AND ITS ROBUSTIFICATION METHODS
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

EXAMINING OUTLIER DETECTION PERFORMANCE FOR PRINCIPAL COMPONENTS ANALYSIS METHOD AND ITS ROBUSTIFICATION METHODS

  • 424 views
Uploaded on

Intrusion detection has gasped the attention of both commercial institutions and academic research area. In this paper PCA (Principal Components Analysis) was utilized as unsupervised technique to......

Intrusion detection has gasped the attention of both commercial institutions and academic research area. In this paper PCA (Principal Components Analysis) was utilized as unsupervised technique to detect multivariate outliers on the dataset of an hour duration of time. PCA is sensitive to outliers since it depend on non-robust estimators. This lead us using MCD (Minimum Covariance Determinant) and PP (Projection Pursuit) as two different robustification techniques for the PCA. The results obtained from experiments show that PCA generates a high false alarms due to masking and swamping effects, while MCD and PP detection rate is much accurate and both reveals the effects of masking and swamping undergo the PCA method.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
424
On Slideshare
422
From Embeds
2
Number of Embeds
1

Actions

Shares
Downloads
9
Comments
0
Likes
0

Embeds 2

http://pinterest.com 2

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. International Journal of Advances in Engineering & Technology, May 2013.©IJAET ISSN: 2231-1963573 Vol. 6, Issue 2, pp. 573-582EXAMINING OUTLIER DETECTION PERFORMANCE FORPRINCIPAL COMPONENTS ANALYSIS METHOD AND ITSROBUSTIFICATION METHODSNada Badr, Noureldien A. NoureldienDepartment of Computer ScienceUniversity of Science and Technology, Omdurman, SudanABSTRACTIntrusion detection has gasped the attention of both commercial institutions and academic research area. In thispaper PCA (Principal Components Analysis) was utilized as unsupervised technique to detect multivariateoutliers on the dataset of an hour duration of time. PCA is sensitive to outliers since it depend on non-robustestimators. This lead us using MCD (Minimum Covariance Determinant) and PP (Projection Pursuit) as twodifferent robustification techniques for the PCA. The results obtained from experiments show that PCAgenerates a high false alarms due to masking and swamping effects, while MCD and PP detection rate is muchaccurate and both reveals the effects of masking and swamping undergo the PCA method.KEYWORDS: Multivariate Techniques, Robust Estimators, Principal Components, Minimum CovarianceDeterminant, Projection Pursuit.I. INTRODUCTIONPrincipal Components Analysis (PCA) is a multivariate statistical method that concerned withanalyzing and understanding data in high dimensions, that is to say, PCA method analyzes data setsthat represent observations which are described by several dependent variables that are intercorrelated. PCA is one of the best known and most used multivariate exploratory analysis technique[5].Several robust competitors to classical PCA estimators have been proposed in the literature. A naturalway to robustify PCA is to use robust location and scatter estimators instead of the PCAs samplemean and sample covariance matrix when estimating the eigenvalues and eigenvectors of thepopulation covariance matrix. The minimum covariance determinant (MCD) method is a highlyrobust estimator of multivariate location and scatter. Its objective is to find h observations out of nwhose covariance matrix has the lowest determinant. The MCD location estimate then is the mean ofthese h points, and the estimate of scatter is their covariance matrix. Another robust method forprincipal component analysis uses the Projection-Pursuit (PP) principle. Here, one projects the data ona lower-dimensional space such that a robust measure of variance of the projected data will bemaximized.In this paper we investigate the effectiveness of the robust estimators provided by MCD and PP, byapplying PCA on Abilene dataset and compare its detection performance of dataset outliers to MCDand PP.The rest of this paper is organized as follows. Section 2 is an overview to related work. Section 3 wasdedicated for classical PCA. PCA robustification methods, MCD and PP are discussed in section 4.In section 5 the experiment results are shown, conclusions and future work are drawn in section 6.II. RELATED WORKA number of researches have utilized principal components analysis to reduce the dimensionality andto detect anomalous network traffic. The use of PCA to structure network traffic flows was introduced
  • 2. International Journal of Advances in Engineering & Technology, May 2013.©IJAET ISSN: 2231-1963574 Vol. 6, Issue 2, pp. 573-582by Lakhina [13] whereby principal components analysis is used to decompose the structure of Origin-Destination flows from two backbone networks into three main constituents, namely periodictrends, bursts and noise.Labib [2] utilized PCA in reducing the dimension of the traffic data and for visualizing andidentifying attacks. Bouzida et, al. [7] presented a performance study of two machine learningalgorithms, namely, nearest neighbors and decision trees algorithms, when used with traffic data withor without PCA. They discover that when PCA is applied to the KDD99 dataset to reduce dimensionof the data, the algorithms learning speed was improved while accuracy remained the same.Terrel [9] used principal components analysis on features of aggregated network traffic of a linkconnecting a university campus to the Internet in order to detect anomalous traffic. Sastry [10]proposed the use of singular value decomposition and wavelet transform for detecting anomalies inself similar network traffic data. Wong [12] proposed an anomaly intrusion detection model based onPCA for monitoring network behaviors. The model utilizes PCA in reducing the dimensions of ahistorical data and in building the normal profile, as represented by the first few componentsprincipals. An anomaly is flagged when distance between the new observation and normal profileexceeds a predefined threshold.Mei-ling [4] proposed an anomaly detection scheme on robust principal components analysis. Twoclassifiers were implemented to detect anomalies, one was based on major components that capturemost of the variations in the data, and the second was based on minor components or residuals. A newobservation is considered to be an outlier or anomalous when the sum of squares of the weightedprincipal components exceeds the threshold in any of the two classifiers.Lakhina [6] applied principal components analysis to Origin-Destination (OD) flows traffic , thetraffic isolated into normal and anomalous spaces by projecting the data onto the resulting principalcomponents one at a time, ordered from high to low, Principal components (PC) are added to thenormal space as long as a predefined threshold is not exceeded. When the threshold is exceeded, thenthe PC and the subsequent PCs are added to anomalous space. New OD flow traffic is projected intothe anomalous space and anomaly is flagged if the value of the square prediction error or Q-statisticexceeds a predefined limit.Therefore PCA is widely used to identify lower dimensional structure in data, and is commonlyapplied to high-dimensional data. PCA represents data by a small number of components that accountfor the variability in the data. This dimension reduction step can be followed by other multivariatemethods, such as regression, discriminant analysis, cluster analysis, etc.In classical PCA the sample mean and the sample covariance matrix are used to derive the principalcomponents. These two estimators are highly sensitive to outlying observations, and render PCAunreliable, when outliers are encountered.III. CLASSICAL PCA MODELThe PCA detection model detects outliers by projecting observations of the dataset on the newcomputed axes known as PCs. The outliers detected by PCA method are two types, outliers detectedby major PCs, and outliers detected by minor PCs.The basic goals of PCA [5] are to extract important information from data set, to compress the size ofthe data set by keeping only this important information and to simplify the description of data andanalyze the structure of the observation and variables (finding patterns with similarities anddifference).To achieve these goals PCA calculate new variables from the original variables, called PrincipalComponents (PCs). The computed variables are linear combination of the original variables (tomaximize variance of the projected observation) and uncorrelated. The first computed PCs, calledmajor PCs has the largest inertia ( total variance in data set ), while the second calculated PCs, calledminor PCs has the greater residual inertia ,and orthogonal to the first principal components.The Principal Components define orthogonal directions in the space of observations. In other words,PCA just makes a change of orthogonal reference frame, the original variables being replaced by thePrincipal Components.
  • 3. International Journal of Advances in Engineering & Technology, May 2013.©IJAET ISSN: 2231-1963575 Vol. 6, Issue 2, pp. 573-5823.1 PCA AdvantagesPCA common advantages are:3.1.1 Exploratory Data AnalysisPCA is mostly used for making 2-dimensional plots of the data for visual examination andinterpretation. For this purpose, data is projected on factorial planes that are spanned by pairs ofPrincipal Components chosen among the first ones (that is, the most significant ones). From theseplots, one will try to extract information about the data structure, such as the detection of outliers(observations that are very different from the bulk of the data).Due to most researches [8][11], PCA detect two types of outliers, type(1): the outlier that inflatevariance and this is detected by the major PCs and type (2): outlier that violate structure, which aredetected by minor PCs.3.1.2 Data Reduction TechniqueAll multivariate techniques are prone to the bias variance tradeoff, which states that thenumber of variables entering a model should be severely restricted. Data is often describedby many more variables than necessary for building the best model. PCA is better thanother statistical reduction techniques in that, it select and feed the model with reducednumber of variables.3.1.3 Low Computational RequirementPCA needs low computational efforts since its algorithm constitutes simple calculations.3.2 PCA DisadvantagesIt may be noted that the PCA is based on the assumptions that, the dimensionality of data can beefficiently reduced by linear transformation and most information is contained in those directionswhere input data variance is maximum.As it is evident, these conditions are by no means always met. For example, if points of an input setare positioned on the surface of a hyper sphere, no linear transformation can reduce dimension(nonlinear transformation, however, can easily cope with this task). From the above the followingdisadvantage of PCA are concluded.3.2.1 Depending On Linear AlgebraIt relies on simple linear algebra as its main mathematical engine, and is quite easy to interpretgeometrically. But this strength is also a weakness, for it might very well be that other syntheticvariables, more complex than just linear combinations of the original variables, would lead to a morecomplex data description.3.2.2 Smallest Principal Components Have No Attention in Statistical TechniquesThe lack of interest is due to the fact that, compared with the largest principal components thatcontain most of the total variance in the data, the smallest principal components only contain thenoise of the data and, therefore, appear to contribute minimal information. However, because outliersare a common source of noise, the smallest principal components should be useful for outlierdetection.3.2.3 High False AlarmsPrincipal components are sensitive to outliers, since the principal components are determined bytheir directions and calculated from classical estimator such classical mean and classical covarianceor correlation matrices.IV. PCA ROBUSTIFICATIONIn real datasets, it often happens that some observation are different from the majority, suchobservation are called outliers, intrusion, discordant, etc. However classical PCA method can be
  • 4. International Journal of Advances in Engineering & Technology, May 2013.©IJAET ISSN: 2231-1963576 Vol. 6, Issue 2, pp. 573-582affected by outliers so that PCA model cannot detect all the actual real deviating observation, this isknown as masking effect. In addition some good data points might even appear to be outliers whichare known as swamping effect .Masking and swamping cause PCA to generate a high false alarm. To reduce this high false alarmsusing robust estimators was proposed, since outlying points are less likely to enter into thecalculation of the robust estimators.The well-known PCA Robustification methods are the minimum covariance determinant (MCD) andProjection-Pursuit (PP) principle. The objective of the raw MCD is to find h > n/2 observations outof n whose covariance matrix has the smallest determinant. Its breakdown value is (bn= [n- h+1]/n),hence the number h determines the robustness of the estimator. In Projection-Pursuit principle [3],one projects the data on a lower-dimensional space such that a robust measure of variance of theprojected data will be maximized. PP is applied where the number of variables or dimensions is verylarge, so PP has an advantage over MCD, since the MCD proposes the dimensions of the dataset notto exceed 50 dimensions.Principal Component Analysis (PCA) is an example of the PP approach, because they both search fordirections with maximal dispersion of the data projected on it, but PP instead of using variance asmeasure of dispersion, they use robust scale estimator [4].V. EXPERIMENTS AND RESULTSIn this section we show how we test PCA and its robustification methods MCD and PP on a dataset.The data that was used consist of OD (Origin-Destination) flows which, are collected and madeavailable by Zhang [1]. The dataset is an extraction of sixty minutes traffic flows from first week ofthe traffic matrix on 2004-03-01, which is the traffic matrix Yin Zhang was built from Abilenenetwork. Availability of the dataset is on offline mode, where it is extracted from offline trafficmatrix.5.1 PCA on DatasetAt first, the dataset or the traffic matrix is arranged into the data matrix X, where rows representobservations and columns represent variables or dimensions.X (144×12) =[𝑥1,1 ⋯ 𝑥1,12⋮ ⋱ ⋮𝑥144,1 ⋯ 𝑥144,12],The following steps are considered in apply PCA method on the dataset. Centering the dataset to have zero mean, so the mean vector is calculated from the followingequation:𝜇 =1𝑛∑ 𝑥𝑖𝑛𝑖=1 (1)and subtracted off the mean for each dimension.The product of this step is another centered data matrix Y, which has the same size as original dataset𝑌(𝑛,𝑝) = (𝑥𝑖,𝑗 – 𝜇(𝑋)) (2) Covariance matrix is calculated from the following equation:𝐶(𝑋)𝑜𝑟Σ(𝑋) =1𝑛−1(𝑋 − 𝑇(𝑋)) 𝑇. (𝑋 − 𝑇(𝑋)) (3) Finding eigenvectors and eigenvalues from the covariance matrix where eigenvalues are diagonalelements of the matrix by using eigen-decomposition technique in equation (4).𝐸−1× Σ Y×E =ʎ (4)Where E is the eigenvectors, ʎ is the eigenvalues . Ordering eigenvalues in decreasing order and sorting eigenvectors according to the orderedeigenvalues in loadings matrix. The Eigenvectors matrix is then sorted to be loading matrix. Calculating scores matrix (dataset projected on principal components), which declares therelations between principal components and observations. The scores matrix is calculated fromthe following equations:𝑠𝑐𝑜𝑟𝑒𝑠(𝑛,𝑝) = 𝑌(𝑛,𝑝) × 𝑙𝑜𝑎𝑑𝑖𝑛𝑔𝑠(𝑝,𝑝) (5)
  • 5. International Journal of Advances in Engineering & Technology, May 2013.©IJAET ISSN: 2231-1963577 Vol. 6, Issue 2, pp. 573-582 Applying the 97.5 tolerance ellipse of the bivariate dataset (data projected on first PCS, dataprojected on minor PCS) to reveal outliers automatically. The ellipse is defined by these datapoints whose distance is equal to the chisquare root of the 97.5 quantile with 2 degrees offreedom. The form of the distance is 𝑑𝑖𝑠𝑡 ≤ √𝑥2𝑝,0.975 (6)The screeplot is used and studied and the first and the second principal components accounted for98% of total variance of the dataset, so retaining the first two principal components to represent thedataset as whole, figure (1) shows the screeplot, the plotting of the data projected onto the first twoprincipal components in order to reveal the outliers on the dataset visually is shown in figure (2).Figure 1: PCA Screeplot Figure 2: PCA Visual outliersFigure (3) shows tolerance ellipse on major PCS, and figures (4) and (5) shows the visual recording ofoutliers from scatter plots of data projected on robust minor principal components and the outliersdetected by robust minor principal components tuned by tolerance ellipse respectively.Figure 3: PCA Tolerance Ellipse Figure 4: PCA type2 Outliers.0 2 4 6 8 10 120102030405060708090100principal componentstotalvariancevariances-2 -1 0 1 2 3 4 5 6 7x 107-1-0.500.511.52x 107data projected on major pcsPC1PC266120119135676871 757778828386878889 909698101103105111112113115126127128132134136139141125129130 131144124116117 11858606465 76798081 8284859192939495107108109110 114115121137138140142143123456789101112131415161718192021222324252627282930313233343536373839404243444546474849505152535455565758596061626365106122123133697210273747370104-4 -2 0 2 4 6x 107-5051015x 106PC1PC212345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364657677787980818284859192939495106107108109110 111112113114115121122123124125 1301371381391401411421436667686970717273747583868788899096979899100101102103104105116117 118119120126127128129131132133134135136144Tolerance ellipse (97.5%)-8 -6 -4 -2 0 2 4 6x 105-6-4-202468x 105data projected on minor pcslast PC-1lastPC12345 678910111213141516171819202122232425262728293031 323334353637383940414243444546474849505152535455565758596061626364656667687071727374757778 798081828384858788899091929394959910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314486769896
  • 6. International Journal of Advances in Engineering & Technology, May 2013.©IJAET ISSN: 2231-1963578 Vol. 6, Issue 2, pp. 573-582Figure 5: Tuned Minor PCS5.2 MCD on DatasetTesting robust statistics MCD (Minimum Covariance Determinant) estimator yields robust locationmeasure Tmcd and robust dispersion Σmcd.The following steps are applied to test MCD on the dataset in order to reach the robust principalcomponents. MCD measure is calculated from the formula:R=(xi-Tmcd(X))T.inv(Σmcd(X)).(xi-Tmcd(X) ) for i=1 to n (7)Tmcd or µmcd =1.0e+006 * From robust covariance matrix Σmcd calculating the followings:C(X)mcd or Σ(x)mcd = 1.0e+012 ** find robust eigenvalues as diagonal matrix as in equation (4) by replacing n with h* find robust eigenvectors as loading matrix as in equation (5). Calculating robust scores matrix as in the following form𝑟𝑜𝑏𝑢𝑠𝑡𝑠𝑐𝑜𝑟𝑒𝑠(𝑛,𝑝) = 𝑌(𝑛,𝑝) × 𝑙𝑜𝑎𝑑𝑖𝑛𝑔𝑠(𝑝,𝑝) (8)The robust screeplot retaining the first two robust principal components which accounted above of98% of total variance is shown in figure (6). Figures (7) and (8) shows respectively the visualrecording of outliers from scatter plots of data projected on robust major principal components, andthe outliers detected by robust major principal components tuned by tolerance ellipse, and Figures (9)and (10) shows the visual recording of outliers from scatter plots of data projected on robust minorprincipal components and the outliers detected by robust minor principal components tuned bytolerance ellipse respectively.Figure 6: MCD screeplot Figure 7: MCD Visual Outliers-6 -4 -2 0 2 4x 105-4-20246x 105PC11PC12 123 45 678910111213141516171819202122232425262728293031 32333435363738394041424344454647484950515253545556575859606162636465767778 79808182848591929394951061071081091101111121131141151211221231241251301371381391401411421436667686970717273747583868788899096979899100101102103104105116117118119120126127128129131132133134135136144Tolerance ellipse (97.5%)0 2 4 6 8 10 120102030405060708090100robust mcd screeplot to retain robust PCSprincipal componentstotalvariance-8 -7 -6 -5 -4 -3 -2 -1 0 1x 107-1-0.500.511.522.5x 107robustmcd PC1robustmcdPC2major pcs from robust estimator13511912066116118117129129130 1251241234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465676869 70 7172 7374 75 767778 79808182838485868788 8990 919293949596979899100 101102 103104 105 106107108109110111112113 114115121122123127128132136137138139140141142143134133131104104
  • 7. International Journal of Advances in Engineering & Technology, May 2013.©IJAET ISSN: 2231-1963579 Vol. 6, Issue 2, pp. 573-582Figure 8: MCD Tolerance Ellipse Figure 9: MCD type2 OutliersFigure 10: MCD Tuned Minor PCs5.3 Projection Pursuit on DatasetTesting the projection pursuit method on the dataset is included in the following steps: Center the data matrix X(n,p) , around L1-median to reach centralized data matrix Y(n,p) as :𝑌(𝑛,𝑝) = (𝑋 (𝑛,𝑝) − 𝐿1(𝑋)) (9)Where L1(X) is high robust estimator of multivariate data location with 50% resist of outliers [11]. Construct the directions pi as normalized rows of matrix , `this process include the following:𝑃𝑌 = (𝑌[𝑖, : ])′𝑓𝑜𝑟 𝑖 , 1: 𝑛 (10) 𝑙𝑒𝑡 𝑁𝑃𝑌 = max(𝑆𝑉𝐷(𝑃𝑌)) (11)Where SVD stand for singular value decomposition.𝑃𝑖 =𝑃𝑌𝑁𝑃𝑌(12) Project all dataset on all possible directions.𝑇𝑖 = 𝑌 × (𝑃𝑖)𝑡 (13) Calculate robust scale estimator for all the projections and find the directions that maximize qnestimator,𝑞 = max(𝑞𝑛(𝑇𝑖)) (14)qn is a scale estimator, essentially it is the first quartile of all pairwise distance between two datapoints [5]. The results of these steps yields the robust eigenvectors (PCs), and the squared ofvalue of the robust scale estimator is the eigenvalues. project all data on the selected direction q to obtain robust principal components as in thefollowing :𝑇𝑖 = 𝑌𝑛,𝑝 × 𝑃𝑞𝑡(15) Update data matrix by its orthogonal complement as in the followings:𝑌 = 𝑌 − (𝑃𝑞 × 𝑃𝑞𝑡). 𝑌 (16)-6 -4 -2 0 2 4x 107-505101520x 106robustmcdPC1robustmcdPC2123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646579818485919410610710810911011412112212312412566676869 707172 7374 75 767778808283868788 8990 92939596979899100 101102 103104105 111112113115116117118119120126127128129130131132133134135136137138139140141142143144Tolerance ellipse (97.5%)-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5x 106-3-2.5-2-1.5-1-0.500.511.52x 106data project on robustmcd minor PCSrobustmcd last-1 pcrobustmcdlastpc116961317170691019798991006612011984857611811786736774141918112613614410213410210413613961248026444444113112888856-2.5 -2 -1.5 -1 -0.5 0 0.5 1x 106-3-2.5-2-1.5-1-0.500.511.5x 106robustmcd pclast-1robustmcdpclast123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960 6162636465798184859194106107108 10911011412112212312412566676869707172737475767778 808283868788899092939596979899100101102103104105111112113115116117118119120126127128129130131132133134135136137138139140141142143144Tolerance ellipse (97.5%)
  • 8. International Journal of Advances in Engineering & Technology, May 2013.©IJAET ISSN: 2231-1963580 Vol. 6, Issue 2, pp. 573-582 Project all data on the orthogonal complement,𝑠𝑐𝑜𝑟𝑒𝑠 = 𝑌 × 𝑃𝑖 (17)The Plotting of the data projected on the first two robust principal components to detect outliersvisually, is shown in figure (11), and the tuning the first two robust principal components bytolerance ellipse is shown in figure (12). Figures (13) and (14) show respectively the plotting ofthe data projected on minor robust principal components to detect outliers visually, and the tuningof the last robust principal components by tolerance ellipse.Figure 11: PP Visual Outliers Figure 12: PP Tolerance EllipseFigure 13: MCD type2 Outliers Figure 14: MCD Tuned Minor PCs5.4 ResultsTable (1) summarizes the outliers detected by each method. The table shows that PCA suffers fromboth masking and swamping. The MCD and PP methods results reveal the effects of masking andswamping of the PCA method. The PP method results are similar to MCD with slight differencesince we use 12 dimensions on the dataset.Table 1: Outliers DetectionPCA Outlierdetected by majorand Minor PCSMCD Outliersdetected by major andminor PCSPP Outliersdetected by majorand minor PCSFalse alarms effectsMasking Swamping66 66 66 No No99 99 99 No No100 100 100 No No116 116 116 No No117 117 117 No No118 118 118 No No119 119 119 No No120 120 120 No No-1 0 1 2 3 4 5 6 7 8x 107-4-3.5-3-2.5-2-1.5-1-0.500.51x 107data projected on robust major PCS by PP methodPProbust PC1PProbustPC266676869707172737475767778798081828384858687888990919293949596979899100101102103104105107111112113114115116117118119120121 126127128129130131132133134135136137138139140141142143144-4 -2 0 2 4 6x 107-4-3-2-101x 107PProbust PC1PProbustPC215 1373419363798013487144226223520231449475029485930333218174310925544224552745281105253606441069088142571226413126512346585126840789397838373177109221541138939499583829613214310784561081281173131861401213616127611261248510311413972811301181331414111510275129125117917174112113136105101676811110476135116979810099706966119120Tolerance ellipse (97.5%)-3 -2 -1 0 1 2 3 4x 106-2-1.5-1-0.500.511.5x 106data projected on robust minor PCS by PPPProbust PC11PProbustPC129910012345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656768707273777879808182838485868788919293949510210310610710810911011412112212312712813113213313413713813914014114214314412345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656768717273747576777879808182838687888990919293949596101102103104105106107108109110111112113114115117118121122123124125126 1271291301311321331341361371381391401421431441351191209797116-2 -1 0 1 2 3x 106-1.5-1-0.500.51x 106PProbust PC11PProbustPC121513734193637980134871442262235202314494750294859303332181743109255442245527452811052536064410690 88142571226413126512346585126840789 3978383731 771092215411389394 99583829613214310784561081281173131861401213616 12761126124 85103114 1397281130 11813314141115 10275129 125117917174 112113 136105101676811110476135116979810099 7069 66119120Tolerance ellipse (97.5%)
  • 9. International Journal of Advances in Engineering & Technology, May 2013.©IJAET ISSN: 2231-1963581 Vol. 6, Issue 2, pp. 573-582129 129 129 No No131 131 131 No No135 135 135 No NoNormal Normal 69 Yes NoNormal Normal 70 Yes No71 Normal normal No Yes76 Normal normal No Yes81 Normal normal No Yes101 Normal normal No Yes104 Normal normal No Yes111 Normal normal No Yes144 Normal normal No YesNormal 84 normal Yes NoNormal 96 normal Yes NoNormal 97 97 Yes NoNormal 98 98 Yes NoVI. CONCLUSION AND FUTURE WORKThe study has examined the PCA and its robustification methods (MCD, PP) performance forintrusion detection by presenting the bi-plots and extracted outlying observation that are verydifferent from the bulk of data. The study showed that tuned results are identical to visualized one.The study returns the PCA false alarms shortness due to masking and swamping effect. Thecomparison proved that PP results are similar to MCD with slight difference in outliers type 2 sinceare considered as source of noise. Our future work will go into applying the hybrid method(ROBPCA), which takes PP as reduction technique and MCD as robust measure for furtherperformance, and applying dynamic robust PCA model with regards to online intrusion detection.REFERENCES[1]. Abilene TMs, collected by Zhang . www.cs.utexas.edu/yzhang/ research, visited on 13/07/2012[2]. Khalid Labib and V.Rao Vemuri. "An application of principal Components analysis to the detectionand visualization of computer network ". Annals of telecommunications, pages 218-234, 2005 .[3]. C. Croux, A. Ruiz-Gazen, "A fast algorithm for robust principal components based on projectionpursuit", COMPSTAT: Proceedings in Computational Statistics, Physica-Verlag, Heidelberg,1996, 211–217.[4]. Mei-ling Shyu, Schu-Ching Chen,Kanoksri Sarinnapakorn,and Li Wuchang. "Anovel anomaly detectionscheme based on principal components classifier". In proceedings of the IEEE foundations and Newdirections of Data Mining workshop, in conjuction with third IEEE international conference on data mining(ICOM03) .[5]. J.Edward Jackson . "A user guide to principal components". Wiely interscience Ist edition 2003.[6]. Anukool Lakhina,. Mark Crovella, and Christoph Diot. "Diagnosing network wide traffic anomalies".Proceedings of the 2004 conference on Applications, technologies, architectures, protocols for computercommunication. ACM 2004.[7]. Yacine Bouzida, Frederic Cuppens, NoraCuppens-Boulahio, and Sylvain Gombaul. "Efficient IntrusionDetection Using Principal Component Analysis ". La londe, France, June 2004.[8]. R.Gnandesikan, "Methods for statistical data analysis of multivariate observations". Wiely-intersciencepublication New York, 2ndedition 1997.[9]. J.Terrel, K.Jeffay L.Zhang, H.Shen, Zhu, and A.Nobel, "Multivariate SVD analysis for a networkanomaly detection ". In proceedings of the ACM SIGOMM Conference 2005.[10]. Challa S.Sastry, Sanjay Rawat, Aurn K.Pujari and V.P Gulati, "Netwok traffic analysis using singularvalue decomposition and multiscale transforms ". information sciences : an international journal 2007.
  • 10. International Journal of Advances in Engineering & Technology, May 2013.©IJAET ISSN: 2231-1963582 Vol. 6, Issue 2, pp. 573-582[11]. I.T.Jollif, "Principal components analysis", springer series in statistics, Springer Network ,2ndedition2007.[12]. Wei Wong, Xiachong Guan, and Xiangliong Zhang, "Processing of massive audit data streams for realtime anomaly intrusion detection". Computer communications , Elsevier 2008.[13]. A Lkhaina, K Papagiannak, M Crovella, C-Diot, E Kolaczy, and N. Taft, "Structural Analysis ofnetwork traffic flows". In proceedings of SIGMETRICS, New York, NY, USA, 2004.AUTHORS BIOGRAPHIESNada Badr earned her BSC in Mathematical and Computer Science at University ofGezira, Sudan. She received the MSC in Computer Science at University of Science andTechnology. She is pursuing her PHD in Computer Science at University of Science andTechnology, Omdurman, Sudan. She currently serving lecturer at the University ofScience and Technology, Faculty of Computer Science and Information Technology.Noureldien A. Noureldien is working as an associate professor in Computer Science,department of Computer Science and Information Technology, University of Science andTechnology, Omdurman, Sudan. He received his B.Sc. and M.Sc. from School ofMathematical Sciences, University of Khartoum, and received his PhD in ComputerScience in 2001 from University of Science and Technology, Khartoum, Sudan. He hasmany papers published in journals of repute. He currently working as the dean of theFaculty of Computer Science and Information Technology at the University of Scienceand Technology, Omdurman, Sudan.