A comparative analysis of data mining tools for performance mapping of wlan data


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

A comparative analysis of data mining tools for performance mapping of wlan data

  1. 1. INTERNATIONALComputer Engineering and2,Technology ENGINEERING International Journal of JOURNAL OF COMPUTER (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue March – April (2013), © IAEME & TECHNOLOGY (IJCET)ISSN 0976 – 6367(Print)ISSN 0976 – 6375(Online) IJCETVolume 4, Issue 2, March – April (2013), pp. 241-251© IAEME: www.iaeme.com/ijcet.aspJournal Impact Factor (2013): 6.1302 (Calculated by GISI) ©IAEMEwww.jifactor.com A COMPARATIVE ANALYSIS OF DATA MINING TOOLS FOR PERFORMANCE MAPPING OF WLAN DATA Mr. Ajay M. Patel Assistant Professor, Acharya Motibhai Patel Institute of Computer Studies, Ganpat University, Ganpat Vidyanagar-384012, India Dr. A. R. Patel Director, Department of Computer Application & Information Technology, H. North Gujarat University, Patan - 384265, India Ms. Hiral R. Patel Assistant Professor, Department of Computer Science, Ganpat University, Ganpat Vidyanagar-384012, India ABSTRACT Data Mining is the non-trivial process of identifying valid, potentially and understandable patterns in the form of knowledge discovery from the large volume of data. The main aim of this process is to discovering patterns and associations among preprocessed and transformed data. Data mining is used for two type of analysis: Prediction and description. Prediction in terms of predicts unknown or future values of selected variables. Description in terms of describes human interpretable patterns. The major application areas such as business and finance, stock market, telecommunications, health care, surveillance, fraud detection, scientific discovery and now a day’s extensive usage in networking. Data mining supports supervised and unsupervised type of machine learning process. This paper uses the unsupervised learning process of data mining. For that the paper uses the wireless network log as a data set which has 13 attributes with 1000 instances for anomaly detection. The research focuses on the performance mapping of different unsupervised algorithm supported by different data mining tools. The different tool provides different types of clustering algorithm with different performance mapping measures. The same data set applied for different tools. This paper shows the comparative analysis for performance of algorithms of on different data mining tools. 241
  2. 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEMEIndex Terms: Accuracy, Anomaly Detection, Clustering, Data Mining, Error Rate,Unsupervised Learning.1. INTRODUCTION The mining is a machine learning process for detecting unknown patterns from thedata. The data mining provides many useful analytical techniques. This research shows theusage of data mining techniques for anomaly detection in wireless networking. The mostobvious advantage of wireless networking is mobility. Wireless network users can connect toexisting networks and are then allowed to roam freely. In next generation wireless networks,one of the most serious challenges is how to achieve continuous connection during mobileuser movement among cells which is allowed due to handover procedure. An Intrusionprevention system (IPS) is software that has all the capabilities of an intrusion detectionsystem and can also attempt to stop possible incidents. An intrusion prevention system (IPS)combines IDS with a firewall, a virus detection algorithm, a vulnerability assessmentalgorithm, etc. The ambition of such a system is to manage both preventive and responsiveactions against attacks on a computer network. [10] The wireless log history hides this usefulknowledge patterns that describe typical behavior of anomalies in packet transmission. [5] Innetwork security research, Intrusion Detection is a dangerous concern. Misuse detection andAnomaly detection are the two basic approaches of intrusion detection. Intrusion DetectionSystem is accrues and examines the data to be aware of the intrusions and mishandlings in thecomputer system and network. [7] So data mining provides various types of technologiesavailable to find out these types of anomaly intrusion activities.1.1 Data Mining Data mining is a machine learning technique which provides different techniques tofind out the knowledge and unknown patterns from raw data. Data mining is up-and-comingwith the key features of much security inventiveness. Both the private and public sectors arecurrently increasingly usage the data mining. Many application domains such as banking,insurance, medicine, and retailing frequently use data mining to reduce costs, enhanceresearch, and increase sales. Data mining applications initially were used as a means to detectfraud and waste, but have grown to also be used for purposes such as measuring andimproving program performance. Data mining involves the use of sophisticated data analysistools to discover previously unknown, valid patterns and relationships in large data sets. TheData Mining tools can include statistical models, mathematical algorithms, and machinelearning methods. An algorithm improves the performance automatically through experience,such as neural networks or decision trees. Data mining exploits a discovery approach, inwhich algorithms can be used to scrutinize several multidimensional data relationshipsconcurrently, discovering those that are unique or frequently represented. Data mining hasbecome increasingly common in both the public and private sectors. Many Organizationsprovide data mining tools to survey different user work oriented information and givesanalytical results to interpret so these tools reduce fraud and waste of time to assist indeveloping algorithms for research. But it is possible and preferable way to use or modify thealgorithms as per the requirements. Recently, data mining has been gradually more cited as animperative tool for various security efforts. Some observers suggest that data mining shouldbe used as a means to identify terrorist or intrusive activities, such as money transfers andelectronic communications, and to identify and track individual terrorists or intrudersthemselves, such as through travel and immigration records. [9] 242
  3. 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME1.2 Why Unsupervised Learning? Data mining is the process of extracting knowledge from a database. Data miningmodels can be categorized according to the tasks they perform. Data mining techniques arepredictive (supervised) or descriptive (unsupervised) techniques. Classification Prediction,Clustering, Association Rules are the data mining techniques from which Classification andprediction is a supervised learning models, but clustering and association rules are descriptivemodels. Classification recognizes patterns that describe the group to which an item belongs.Prediction is the construction and use of a model to assess the class of an unlabeled object orto assess the value or value ranges of a given object is likely to have. A supervised learningmodel provides the way to classify the data as per pre defined given class label. Unsupervisedlearning provides a way to classify the data as per the behavior of the data. In unsupervisedlearning techniques treats all variables in the same way, there is no distinction betweendescriptive and dependent variables. However, in contrast to the name undirected data miningthere is still some target to achieve. This target might be as general as data reduction or morespecific like clustering. The difference between supervised learning and unsupervisedlearning is same as that distinguishes discriminant analysis with cluster analysis. Supervisedlearning necessitates the target variable is well defined and that a sufficient number of itsvalues are given. For unsupervised learning typically either the target variable is unknown orhas only been recorded for too small a number of cases.1.3 Intrusion Detection in WLAN A wireless IDPS monitor’s the wireless network traffic and investigate its wirelessnetworking protocols to identify suspicious activity perform by the user and detected byprotocols themselves. This section provides a detailed discussion of wireless IDPStechnologies. First, it contains a brief overview of wireless networking, which is backgroundmaterial for understanding the rest of the section. It covers the major components of wirelessIDPSs and gives the explanation the architectures typically used for deploying thecomponents. It also examines the security capabilities of the technologies in depth, includingthe methodologies they use to identify and stop suspicious activity. The rest of the sectiondiscusses the management capabilities of the technologies, including recommendations forimplementation and operation. [10] Wireless intrusion detection systems can be divided intomisuse based and anomaly based systems in the same way as the IDS for wired networks.Beside classical misuse and anomalies detectable in any network, wireless IDS must alsodetect wireless specific misuse and anomalies. Machine learning is regarded as an effectivetool utilized by intrusion detection system (IDS) to detect abnormal activities from networktraffic. In particular, neural networks, support vector machines (SVM) and decision trees arethree significant and popular schemes borrowed from the machine learning community intointrusion detection in recent academic research. [7]1.4 Anomaly Detection Anomaly is any happening or entity that is eccentric, abnormal or special. It can alsoindicate an inconsistency or divergence from the preset rule or tendency. A normal behavior ismodeled for anomaly detection. Any proceedings which contravene this model will bemarked as suspicious. For example, a normal passive public web can be considered to giverise to worm infection if it tries to open connections to a large number of addresses. An 243
  4. 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEMEAnomaly Based Intrusion Detection System is a system for finding the intrusions and misusein the computer by monitoring the system activity and classifies the activities as normal oranomalous. This system will detect any type of misuse that falls out of the normal systemoperation since the classification is completely based on rules or heuristics, rather thanpatterns or signatures. Anomaly based detection system seeks deviations from the learnedmodel of normal behavior. An anomaly based IDS analyze the ongoing traffic, activity,transactions or behaviors for detecting anomalies in the system or the network which may beindicative of any attack. An Intrusion Detection System (IDS) is a program that examineswhat happens or has happened during an execution and endeavor to find suggestions that thecomputer has been misuse. The development of anomaly detection techniques suitable forWireless Networks is regarded as a vital research area. [7]2. DATA MINING TECHNIQUES FOR ANOMALY DETECTION Anomaly detection means any significant deviations from the expected behavior arereported as possible attacks. Data mining provides various techniques to find out theknowledge from the data. Anomalies are some type of activities that would be performs byintruders. Anomaly detection is the process of finding the objects that are not related to othernormal objects. Data mining provides the techniques to find out such a groups or classes asper the requirement and the usage of the work. Classification is used to classify the datagathered from the different collected data. Data mining also provides another technique thatis clustering. Clustering is also used to grouping the data as per the behavior of the data. Sodata mining techniques are useful to find out the groups or classes. These classes or groupsare useful to differentiate the other dissimilar groups as per the predefined labels or thebehavior of data.3. PROCESS OF UNSUPERVISED LEARNING (CLUSTERING) Unsupervised learning is the method of grouping the data as per behavior of data. It isalso known as descriptive method. Clustering is one of the unsupervised learning techniques.Clustering works on the data directly no any predefined label are required. Clustering alsoexecutes or gives the different groups as per the user wants to generate. Clustering techniquesgenerate the groups as per the distance criteria among the data. There are different distancemeasure methods are available to count the distance amount the instances. Differentclustering provider tools use different distance measure to grouping the data. The accuracy ofthe results are depends on the algorithms used to clustering the instances. This paper showsthe usage of different tools of data mining. The clustering techniques are applied on samewireless log of data to perform comparative analysis to describe which tool gives moreaccurate results.4. DATA MINING TOOLS USED FOR PERFORMANCE ANALYSIS There are various organizations provide data mining tools to perform the data miningtechniques. Some of tools are freeware and open source so any one can easily use them. Datamining tools provides inbuilt algorithms for various data mining techniques. In this paper,Different types of data mining tools are used like Weka, SPSS, Tanagra and Microsoft SQLServer Provides Business Intelligence Development Studio for to support data mining 244
  5. 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEMEanalysis services. Here in this paper three clusters are generated and defined as “Normalactivities”, “Suspicious activities” and “Animalized activities”. These all different tools’ differentclustering algorithm applied on same wireless log file to find out animalized group of activities.Different tools have different results. The important thing is that to interpret the results of theapplied techniques. The closed instances are put in to the same cluster and the closeness of theinstances is measured by to finding out the distances. So clusters are generated based on thispolicy. Data mining unsupervised technique model is best suitable but different tools usesdifferent way of finding the distances so to define ideal model is depend on the accuracy anderror rate provided by the algorithm of the tools. The following shows the steps to perform datamining techniques using different tools.4.1 WEKA The full form of Weka is W (aikato) E (nvironment) for K (nowlegde) A (nalysis). Wekais open source tool because it is designed using Java. It provides various data mining techniques.It provides the facility to perform preprocessing task and user is able to develop or change theinbuilt algorithms using weka. Weka works with different file formats like .arff, csv, C4.5, .xrffetc. In this paper Weka 3.7 is used to apply Simple Kmean for 3 clusters on Wireless log based onEuclidean distance because it is sufficient to group similar instances. Figure 1: Clustering using Weka SPSS SPSS is specially designed to perform statistical analysis proprietary product from IBM.It provides various statistical test analyses and also provides data mining techniques. SPSS workswith .sav file and other database file like excel. In this paper, SPSS 16.0 is used to apply KmeanClustering from Analyze-> Classify tab. This model also generates the 3 clusters. They are usingtwo methods iterative with classify and only classify. It’s also performing ANOVA for statisticalverification. Figure 2: Clustering using SPSS 16.0 245
  6. 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME4.3 Tanagra Tanagra is also freely available data mining tool. It provides various statically, Nonparametric test, Spv Leaning techniques association and clustering. Tanagra works with .arffand other file format specified by Tanagra. Here Tanagra 1.4.43 is used. It is componentbased visualize tool. It generates 3 clusters for wireless log. Tanagra uses distancenormalization based on variance and find the seed based on random or standard way specifyby it. Figure 3: Clustering using Tanagra BIDS of MS SQL Server 2008 Microsoft also provides the data mining tool which is known as MS SQL Server 2008which provides business intelligent development studio. This tool provides various only datamining effective algorithms which provide scalable results. These algorithms generallyapplied on the data stored in SQL Server. In this paper Microsoft Clustering algorithm is usedto generate 3 clusters for same wireless log. This tool use the pure algorithm defined byMicrosoft and as per the data log user can specify the key measurement, inputs andpredictable attribute with number of cluster and as per measurement it will calculateclustering and also suggest the user as per statistical testing to provide better result. Figure 4: Clustering using MS SQL Server 2008 246
  7. 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME5. RESULT INTERPRETATION Now a day’s various organizations provide different tools which support differentanalytical techniques but the main important thing is to interpret the results. In this paperdifferent tools are used on same wireless log but gives different results. The three clusters arecategorized as Normal activities cluster, another activities cluster and animalized activities.5.1 Results using WEKA Weka performs the simple kmean algorithm to clusterize the wireless log. It isperform the clustering on predefined data set or also user able to provide the test data set.Weka provides four types of distance measure functions to generate the similar instance typeclusters. For this log Euclidean Distance function is used. It will generate 3 clusters as per thedistance. As per the figure 15% of instances show the anomaly activities, 44% as Normalactivities and rest of defined as Suspicious activities. Clustered Instances Result Cluster Clustered Instances 0 409 ( 41%) 1 440 ( 44%) 2 150 ( 15%) Figure 5: Results of Weka5.2 Results using SPSS SPSS performs clustering as per the above considerations it will perform the iterativeclassification and define 25% of shows Anomaly activities and 25% suspicious activities with50% definition of normal activities. SPSS used for to perform statistical analysis of givendata log. Its show the ANOVA table which represent the normality and the data significancefor the given log. The results also represent the distance matrix of the clusters. This show thedistance between clusters one and cluster two is very small compared to the cluster three.This interpreted as the instance of the cluster three are most different from the others. Thatmeans, the cluster three have the different behavior activities which not perform normalactivities. That’s the reason the cluster three have the animalized activities which is intrusivebecause intrusive events are the events which disturb the normal behavior of the network. 247
  8. 8. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME Figure 6: Results of SPSS5.3 Results using Tanagra The clustering is used to generate homogeneous subgroups of instances. As per Tanagrathe accuracy of the model depends on the TSS (Total sum of squares), WSS (Within sum ofsquares) and BSS (Between Sum of squares). On the basis of TSS and WSS, BSS is calculated.BSS and Result Ratio calculated using following. BSS = TSS – WSS [34326.92=39992.00-5665.077] Result Ratio = BSS / TSS [0.85=34326.92/39992.00] This result shows the individual groups classification which represent the no of instancesin 3 different clusters is not much differ in ratio. Figure 7: Results of Tanagra 248
  9. 9. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME5.4 Results using BIDS of MS SQL Server 2008 MS SQL Server 2008 is also provides the facility to perform data mining task. This tool isproduced by Microsoft. It provides effective mining algorithm. As per the results it creates theclusters automatically as per the behavior of the data. The result also contains the lift chart andaccuracy chart. It’s also display the discriminate statistical analysis. This tool gives the predictionmodel with its proving result. The lift chart of the model shows the overall accuracy of the modelin terms of statistics, Data analysis and model performance. For this log it shows the linear liftchart with statistical measurement. As per all the results this tool gives most accurate resultsbecause it also shows the statistics for given results as per shown in below. Figure 8: Results of BIDS 249
  10. 10. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME The results shows clustering statistics and also shows the clustering which is given asper the behavior of the data. The each cluster shows the density by the instances come upwith it. This tool also provides the statistics of the how each instance’s distance with the samecluster as well as others. The clustering of the BIDS is more flexible because it uses EM, K-Mean and scalable or non scalable methods of grouping. The Cluster diagram shows thecharacteristics of each and every clusters. The strength of the similarity of the clustersrepresented by the shading of lines connected among the clusters. The light shading theclusters denotes that these clusters are not very similar. So as per this model of Clusterdiagram cluster number eight, nine and ten represented with light shading so they haveinstances that is not much similar to the others. So the instances belongs to that cluster showsthe anomalous activities. The cluster number five six and seven represented with averageshading so it’s interpreted as the instances of these clusters are suspicious. The remainingclusters are purely highlighted so they have normal behavioral instances. The model gives16% density which is accurate by calculating the ratio of number of instances in each clusterwith the overall instances in the log. So its gives ideal model to identify each and everyinstances of the log statistically.6. CONCLUSION Recent research suggests data mining techniques for fraud detection and anomalydetection. The unsupervised learning technique is most useful for this objective because itdeals with the behavior of the complex data. Cluster analysis will always produce groupingbased on several parameters some of them are available for the researcher to customizecluster analysis. Here this paper shows the usage of different tools for same wireless log andits result interpretation. Among these tools MS SQL Server provides the best ideal model.Some tools have data size limitations. Some tools are best suited for pure statistical analysis.The MS SQL Server has limitation it does not available under GPL however it’s morepreferable to deal with lengthy, complex and dynamic behavioral data among otherexperimented tools.REFERENCES1. Marc M. VAN HULLE and Jesse DAVIS, “Data Mining” in Laboratorium voor Neuro- en Psychofysiologie, Katholieke Universiteit Leuven, pp. 1–54.2. Mrs.P.Nancy and Dr.R.Geetha Ramani,” A Comparison on Performance of Data Mining Algorithms in Classification of Social Network Data” in International Journal of Computer Applications (0975 – 8887) Volume 32– No.8, October 20113. Glenn A. Growe, Thesis on “Comparing Algorithms and Clustering Data: Components of the Data Mining Process” in Grand Valley State University, 1999.4. Reference Book on “802.11 Wireless Networks The Definitive Guide” By Mattbew S. Gast; Published By: O’Reilly; ISBN: 0-596-00183-55. Thuy Van T. Duong and Dinh Que Tran, “An Effective Approach for Mobility Prediction in Wireless Network based on Temporal Weighted Mobility Rule”, Published At: International Journal of Computer Science and Telecommunications [Volume 3, Issue 2, February 2012], ISSN 2047-33386. Mohamed Medhat Gaber, Shonali Krishnaswamy, and Arkady Zaslavsky, “A Wireless Data Stream Mining Model”, Published At: ICEIS 250
  11. 11. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME7. M.Moorthy and S.Sathiyabam,” A Hybrid Data Mining based Intrusion Detection System for Wireless Local Area Networks”, International Journal of Computer Applications (0975 – 8887) Volume 49– No.10, July 20128. Balaji Rengarajan and Gustavo de Vecian, “Data Mining and Coordination to Avoid Interference in Wireless Networks”, supported by: Intel Research Council and the NSF Award CNS-07215329. A CRS Report for Congress”Data Mining: An Overview” By Jeffrey W. Seifert10. A Research Paper on “Guide to Intrusion Detection and Prevention Systems (IDPS)” By Karen Scarfone and Peter Mell; Published By: NIST Special Publication 800-9411. Theodoros Lappas and Konstantinos Pelechrinis, “Data Mining Techniques for (Network) Intrusion Detection Systems”12. R. Manickam, D. Boominath and V. Bhuvaneswari, “An Analysis of Data Mining: Past, Present and Future”, International Journal of Computer Engineering & Technology (IJCET), Volume 3, Issue 1, 2012, pp. 1 - 9, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 637513. Mr. M. Karthikeyan, Mr. M. Suriya Kumar and Dr. S. Karthikeyan, “A Literature Review on the Data Mining and Information Security”, International Journal of Computer Engineering & Technology (IJCET), Volume 3, Issue 1, 2012, pp. 141 - 146, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 637514. R. Lakshman Naik, D. Ramesh and B. Manjula, “Instances Selection Using Advance Data Mining Techniques”, International Journal of Computer Engineering & Technology (IJCET), Volume 3, Issue 2, 2012, pp. 47 - 53, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375AUTHORS’A. Mr. Ajay M. Patel is an assistant professor of faculty of computer application ofGanpat University in India. He is well interested in networking era. He has also work withdata mining and gets enough expertise on data mining with wireless network. His ongoingresearch focused on intrusion detection in wireless LAN. He has published number of journaland conference papers in the area of his research interests. He is currently working on patternmatching and predication of wireless network traffic.B. Dr. Ashok R. Patel an eminent personality interested in finding ways to improve theteaching and learning process. The author has enormous research experience in the E-commerce and E-Governance. He has guided more the 15 Ph.D. students as well as PostGraduate level students in the diversified fields of computer application such as data mining,neural network, computer network, enterprise resources planning etc. He is a director ofdepartment of computer science of H. North Gujarat University of India. He is also workingas a director in AICTE the apex body in India for technical education.C. Ms. Hiral R. Patel is an assistant professor of faculty of computer application ofGanpat University in India. She is starting to working on pattern matching and predication offinancial data and wireless network traffic. 251