Mr. Shezad Shaikh, Mr. Niket Bhargava, Ms. Urmila Mahor / International Journal of       Engineering Research and Applicat...
Mr. Shezad Shaikh, Mr. Niket Bhargava, Ms. Urmila Mahor / International Journal of       Engineering Research and Applicat...
Mr. Shezad Shaikh, Mr. Niket Bhargava, Ms. Urmila Mahor / International Journal of       Engineering Research and Applicat...
Mr. Shezad Shaikh, Mr. Niket Bhargava, Ms. Urmila Mahor / International Journal of       Engineering Research and Applicat...
Mr. Shezad Shaikh, Mr. Niket Bhargava, Ms. Urmila Mahor / International Journal of         Engineering Research and Applic...
Mr. Shezad Shaikh, Mr. Niket Bhargava, Ms. Urmila Mahor / International Journal of           Engineering Research and Appl...
Mr. Shezad Shaikh, Mr. Niket Bhargava, Ms. Urmila Mahor / International Journal of                  Engineering Research a...
Mr. Shezad Shaikh, Mr. Niket Bhargava, Ms. Urmila Mahor / International Journal of    Engineering Research and Application...
Upcoming SlideShare
Loading in …5



Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Mr. Shezad Shaikh, Mr. Niket Bhargava, Ms. Urmila Mahor / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 2, Issue 5, September- October 2012, pp.1616-1623 Implementation Of Internet Traffic Classifier Using Dbscan Algorithm Mr. Shezad Shaikh Mr. Niket Bhargava Ms. Urmila Mahor M.Tech Department of Computer Asst. Professor Department of Asst. Professor Department of Science & Engineering, Computer Science & Engineering, Computer Science & Engineering, B.I.S.T Bhopal, India B.I.S.T. Bhopal, India B.I.S.T. Bhopal, IndiaAbstract Traffic modeling and classification find related to network are linked to traffic. Networkimportance in many areas such as bandwidth traffic is an important carrier to record and reflectmanagement, traffic analysis, prediction and the internet and end user activities; it is also anengineering, network planning, Quality of important composition of network behavior, throughService provisioning and anomalous traffic the analysis of network traffic statistics, we candetection. Much of research work has been done master the network statistical behavior the area of network traffic classification by Network traffic classification plays an importantapplication type and several classifiers are role in network activities such as networksuggested. In past network traffic classification management, planning and network design. It alsousing traditional techniques such as well known includes the allocation, control and management ofport number based and payload analysis based resources in TCP/IP networks. Networktechniques are no more effective because various classification is also essential for bandwidthapplications uses port hopping and encryption management, traffic shaping, intrusion detection andtechnique to avoid detection. Recently machine abnormality. The above activities need thelearning techniques such as supervised, capability of accurately classifying and identifyingunsupervised and semi supervised techniques are internet traffic. At the same time accuracy of trafficused to overcome the problems of traditional classifier is a main basis of network security andtechniques. In this work we use semi supervised traffic engineering [1].machine learning approach to classify the In past classical methods such as portnetwork traffic using DBSCAN algorithm. This number based techniques and payload analysistechniques uses only flow statistics to classify the techniques are more popular to classify the networknetwork traffic. This methodology is based on traffic. The classical well known port number basedmachine learning principle, consists of two method [8, 12] is the simplest one. This methodcomponents: clustering and classification. The needs to access the header of the packet to inspectgoal of clustering is to partitions the training the port number and recognize the applicationdata set in to two disjoint group flow and traffic according to the IANA’s (Internet Assigned Numberclass. After making clusters classification is Authority) this technique fails to classify the trafficperformed in which labeled data are used for accurately because many application uses dynamicassigning class label to the cluster. A NSL-KDD port negotiation and because of ambiguity in portdata set is used for testing this approach. Which number assignment to application by IANA.includes much kinds of attacks and normal data. Payload Analysis [10] technique was introduced toExperimental result shows that DBSCAN has solve the problems of port number based technique.better effectiveness and efficiency. It needs to access the payload of the packet to find the specific pattern in the payload to classify theKeywords- Clustering, Classification, Machine traffic. This technique fails because variouslearning, DBSCAN, Traffic classification. applications use encryption techniques to avoid detection and legality and privacy law does notI. INTRODUCTION allow scanning users payload. Network traffic classification is the process Machine learning approach is now uses toof identifying traffic flows and associating them to classify network traffic. It uses only flow statisticsdifferent categories of network application, and it such as duration, protocol_types, services, flags etcrepresents an essential task in the whole chain of to classify the traffic and does not need to access thenetwork management [2]. The aim of network header and payload of the packet. Machine learningtraffic classification is to find out what type of approach is classified as [8, 9] unsupervised,application are run by end users, and what is the supervised and semi supervised approach.share of the traffic generated by different Supervised approach needs the labeled instances toapplications in the total traffic mix. All activities train the classifier. Decision tree, Support Vector 1616 | P a g e
  2. 2. Mr. Shezad Shaikh, Mr. Niket Bhargava, Ms. Urmila Mahor / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 2, Issue 5, September- October 2012, pp.1616-1623Machine, Naïve Bayes etc. are supervised advance, and it may be changed along with thealgorithms. Supervised approach has following evolvement of applications. Another techniquelimitations; first, labeled instances are rare and based on transport layer information of packets isdifficult to obtain. Second, it forces mapping of often used, too. This technique usually makes use ofinstances to one of the known class without connection patterns of the traffic flows, or the roledetecting new ones. Unsupervised approach is a distinction of communication hosts, or theclass of machine learning in which unlabeled characteristic parameters of network and etc.instances are used and based on the inner similarity T.Karagiannis in [2][5] presents abetween instances clusters (groups) are formed. K- systematic approach based on the connection patternMeans, DBSCAN, CLARANS etc are unsupervised of P2P flows and the role distinction derived fromalgorithms. It has limitation in assigning label to the transport layer information. His results show thatcluster after clustering so that new instances this approach has the accuracy comparable with thatproperly mapped to applications. Semi supervised of payload based approach, and is capable ofapproach is a combination of supervised and identifying traffic flows missed by payload analysis.unsupervised approach [10]. The proposed It’s main merits are that it has no access to packettechnique permits both labeled and unlabeled payload, and need no knowledge of port numbersinstances to build the traffic classifier. and no additional information other than what current flow collectors provide. The rest of this paper is organized is F.Constantinou in [11] resents a novelwritten as followed. Section II represents back approach for P2P traffic identification that usesground and related work about internet traffic fundamental characteristic of P2P protocols, such asclassification. Section III introduces DBSCAN a large network diameter and the presence of manyalgorithm. Section IV and V presents our hosts acting both as servers and clients. At present, amethodology data set and experimental results. technique based on machine learning [3] [13-18] hasSection VI represents our conclusion. attracted more and more attention. This technique generally consists of two parts: model building andII. BACK GROUND AND RELATED classification. Firstly a model is built by some WORK methods such as clustering, and the clusters areA. Back ground labeled. Secondly a classifier utilizes this model to Much research work has been done in the classify flow data. According to whether the trainfield of traffic classification. This section introduces data classified or not in advance, the machinethe work related to semi supervised approach. The learning technique may primarily be divided intotraditional technique for the traffic classification is two types: unsupervised and supervised. Whenbased on the well-known port number. Literatures building the model, the supervised approach should[2-4] have presented that this technique is no longer classify the train data in advance, but theeffective for some applications due to the use of unsupervised approach not. The machine learningdynamic port number and the camouflage of approach is not panacea, we need apply otherapplication. So the port-based approach is currently techniques (e.g. payload analysis, based-on transportcombined with other approaches for traffic layer approaches, and etc.) to label the flows,classification. Another familiar technique used however.widely is based on analysis of packet payloads A.McGregor in [12] presents a machine[2][5-12]. It analyses packet payloads to determine learning approach, which uses the Expectationwhether they contain the given characteristic strings. Maximization EM clustering algorithm. J.ErmanThe most famous approach that uses this technique compares three clustering algorithms: K-means,is application signature approach, as some known DBSCAN and Auto Class in [3], and applies anapplications contain the specific characteristic supervised approach using EM and Naïve Bayesiansignatures [9]. H.Bleul [10-11] shows that classifier in [13]. In all those mentioned, themeasurement systems based on application clustering is one of the most important tools forsignatures can provide high accuracy. P.Haffner [6] model-building.presents a statistical machine learning algorithm toautomatically extract application signatures from IP B. Related worktraffic payload. Although packet payload analysis This section introduces the work relatedhas high accuracy, it has some shortcomings: (1) it with semi supervised approach.can’t deal with encrypted application payloads; (2) In 2007 Jeffery Erman et al.[1,3] proposedinspecting concerns of packet payloads may be a semi supervised traffic classification techniquefaced with the problem on legality and privacy; (3) consists of two steps clustering and required increased processing and storage For experimental purposes traces are collected fromcapacity; (4) it is useless if payloads are not the internet link of a large university in which 29available; (5) it is unable to identify the unknown application are identified. The authors categorizedapplication;(6) signatures must be obtained in traces as 1hour campus, 10 hour residential and 1 1617 | P a g e
  3. 3. Mr. Shezad Shaikh, Mr. Niket Bhargava, Ms. Urmila Mahor / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 2, Issue 5, September- October 2012, pp.1616-1623hour wireless LAN. They performed various training the classifier. The classifier achieves 94.8%experiment on this work. In first experiment 64000 accuracy.unlabeled flows are provided for clustering after C. System Architecturethese flows is clustered, the fix numbers of random Before applying clustering, we need toflows in each cluster are labeled. The results show follow few prerequisites. Real word data tend to bethat 94% accuracy is achieved by two labeled flows dirty, incomplete and inconsistent. Dataper cluster and K=400. The second set of preprocessing technique can improve the quality ofexperiments 80,800 and 8000 labeled flows are the data, thereby helping to improve the accuracymixed with random number of unlabeled flows to and efficiency of the subsequent mining process.generate the training dataset. The accuracy will Data set can also be in varying from, e.g. oneincrease when five or more flows are labeled per attribute varies in range of 100 and other attributecluster. varies in the range of 10000. So, a proper In 2008 Chuanliang chan et al.[14] normalization of data set is done in which eachproposed two graph based semi supervised attribute come in the range of 100 or whatever usermethods(i.e. spectral graph transducer, Gaussian selects. This normalization technique is known asfields approach) and one semi supervised clustering Min-max normalization.(MPCK Means) method to perform intrusiondetection .KDD CUP 99 dataset is used for Selection ofexperimental purpose and Precision and Recall and Input Apply training Data DBSCANF Measure is used to evaluate the clustering results. Data setThe authors compared two semi supervised set after Algorithmclassification with other traditional supervised Dataalgorithm and finds that performance of their Preparationapproach are much better than other. Also show thatthe performance of MPCK means is better than K- Final ClassifiedMeans. Output Data Class Label In 2009 Levi Lesis and Jorg Sander [18]proposed semi supervised algorithm called as AssignmentSSDBSCAN. This algorithm requires only one input to Clusterparameter, does not need user intervention andautomatically finds noise objects. The authors usedboth artificial and real world datasets forexperimental purpose and compare SSDBSCAN Fig.1 Architecture of Proposed Systemwith HISSCLU and finds that their approach is The input data set is the real data whichbetter to find the cluster in datasets. The Liu bin and captured in the real network. It includes many kindsTu Hao in 2010 [19] proposed semi supervised of attack data, also includes the normal data. Theclustering methods based on particle swarm classification model is built is based on semi-optimization (PSO) algorithm and two host feature supervised machine learning approach, thus bothname IP address discreteness and success rate of labeled and unlabeled data record are present. Theconnections. They collected experimental dataset output will basically would be the classification thatfrom the router of their university which contained 7 will specify the class to which the data set is belongclasses. To evaluate their approach they used irrespective of the data input is labeled or unlabeled.precision. Result showed that 85% accuracy The figure 1. shows the sequentialachieved when 100 or more labeled samples are execution of various phases the output of the firstused in training dataset. phase act as input to the second phase and so on. In 2010 Amita Shrivastav and Aruna Initially the input is taken as the data set which actsTiwari [15] proposed a semi supervised approach as input to phase one in which both labeled andbased on clustering algorithm. This approach has unlabeled data are used. It will partition a wholetwo steps clustering and classification. KDD CUP- data space into small number of disjoint regions99 dataset is used for experimentation. They (cluster). Finally, it labels the cluster, for eachcompare their approach with SVM based classifier. cluster formed; perform the probabilisticThe experimental result showed that accuracy of Maintaining the Integrity of the Specificationsproposed classifier lies between 70% and 96% for assignment to find the mapping from cluster tovarious datasets. labels. If the maximum priority belongs to same In 2012 Vinod Mahajan and Bhupendra class, then all the labeled and unlabeled dataVerma [6] proposed semi supervised distance based samples within the cluster are assigned the sameclustering and probabilistic assignment technique class label. The result parameter after each phasefor network traffic classification. It permits both can be viewed by the user. These results arelabeled and unlabeled instances to be used in displayed in the form of graphical tables. The entire classification task terminate once the class are 1618 | P a g e
  4. 4. Mr. Shezad Shaikh, Mr. Niket Bhargava, Ms. Urmila Mahor / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 2, Issue 5, September- October 2012, pp.1616-1623assigned to all the data samples. This classification and q, denoted by dist( p,q) . We use Euclideantechnique helps in classifying data and also making distance function for DBSCAN in this paper asthe system to learn how to classify a new coming PROPOSED WORK In this paper we proposed semi-supervisedtechnique using DBSCAN algorithm which Where n is the number of the features for pointclassifies network flows by using only flow statistics object p and q, pi and qi are the ith feature of pointwhich is analyzed and implemented. This technique object p and based on machine learning principle consists of A. Premitives of DBSAN Algorithmtwo components clustering and classification. 1. ε-neighbor: the neighbors in ε semiClustering is used to partitions the training data set diameter of an object.into disjoint group (cluster) .After making cluster, 2. Kernel object: certain number (MinP) ofclassification is performed in which labeled data are neighbors in ε semi diameter.used for assigning class labels to the clusters. 3. To a object set D, if object p is the ε-Labeled data means the input set for which the class neighbor of q, and q is kernel object, then pto which it belong is known. Unlabeled data set is can get “direct density reachable” from for which class to which it belongs is unknown 4. To a ε, p can get “direct density reachable”and is to be properly classified. This technique will from q; D contains Minp objects; if a seriesenable to built a traffic classifier using flow statistic object p1,p2,...,pn , p1= q, p n =p, then Pi+1from both labeled and un labeled flow. Our method can get “direct density reachable” from pi,,consists of two step clustering and classification. pi €i D, 1 ≤ i ≤ n.The details of these steps are as follows. 5. To ε and MinP, if there exist a object o(o€D) , p and q can get “direct densityA. Clustering reachable” from o, p and q are density We first employ a clustering algorithm to connected.partition a training data set that consist of labeledflows combined with unlabeled flows. Clustering The DBSCAN algorithm is based on thedata is method by which the large sets of data are concepts of density reach ability and density-grouped into clusters of small sets of similar data. connectivity. These concepts depend on two inputWe propose DBSCAN algorithm for clustering parameters: epsilon (eps) and minimum number ofpurpose. points (minpts). Epsilion is the distance around an object that defines its eps- neighborhood. For aB. Classification given object q, when the number of objects with in After clustering of training data, we use the the eps-neighborhood is at least minpts, then q is de-available labeled flows to obtain a mapping from the fined as a core object. All objects within its eps-clusters to the different known classes the result of neighborhood are said to directly density reachablethe learning is a set of cluster. Some mapped to the from q. In addition, an object p is said to densitydifferent flow types. This method referred to as reachable it is with in the eps-neighborhood of ansemi-supervised learning. The input data for object that is directly density-reachable or densityclassification task is collection of numbers of reachable from q. further more, objects p and q arerecords. Each records, also known as instance, is said to be density connected if an object o exists thatcharacterized by a tuple (x,y) where x is the attribute both p and q are density reachable [8]. Theseset and y is class attribute. notions of density – reach ability and density connectivity are used to define what the DBSCAN algorithm considers as a cluster. A cluster is definedIV. DBSCAN ALGORITHM as the set of objects in a data set that are density – Density-Based Spatial Clustering and connected to a particular core object. Any objectApplication with Noise (DBSCAN) was a clustering that is not a part of cluster is categorized as noise.algorithm based on density. It did clustering through This is in contrast to k- Means and Auto Class,growing high density area, and it can find any shape which assign every object as a cluster.of clustering. There are two important objectsclusters and noise, for DBSCAN algorithm. All B. Explanation of Algorithmpoints in data set are divided into points of clustersand noise. The key idea of DBSCAN is that for each  DBSCAN requires two parameters: epsilonpoint of a cluster the neighborhood of a given radius (eps) and minimum points (minPts). Ithas to contain at least a minimum number of points, starts with an arbitrary starting point thati.e. the density in the neighborhood has to exceed has not been visited. It then finds all thesome threshold. The shape of a neighborhood is neighbor points within distance eps of thedetermined by the distance function for two points p starting point. 1619 | P a g e
  5. 5. Mr. Shezad Shaikh, Mr. Niket Bhargava, Ms. Urmila Mahor / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 2, Issue 5, September- October 2012, pp.1616-1623  If the number of neighbors is greater than take only 8,200 instances form NSL KDD dataset or equal to minPts, a cluster is formed. The because to test the effectiveness of proposed starting point and its neighbors are added to approach by using whole dataset requires more this cluster and the starting point is marked computational time and resources. This data set is as visited. The algorithm then repeats the divided into training dataset which contains 4,200 evaluation process for all the neighbors records and test dataset which contains 4000 recursively. records. Training dataset contains 1800 labeled  If the number of neighbors is less than instances and 2000 unlabeled instances. Both minPts, the point is marked as noise. training and test dataset contains all 41 features.  If a cluster is fully expanded (all points within reach are visited) then the algorithm Analysis Tool proceeds to iterate through the remaining The system will be implementing using unvisited points in the dataset. MATLAB software, which is user friendly with  graphical interface that can be easily implement inC. Working of DBSCAN Algorithm MATLAB as compared to other GUI development The DBSCAN algorithm works as follows. tools. Moreover MATLAB is high performanceInitially, all objects in the data set are assumed to be language for technical computing. It integratesunassigned. DBSCAN then chooses an arbitrary computation, visualization, and programming in anunassigned object p from the data set. If DBSCAN easy to use environment where problems andfinds p is a core object, it finds all the density solutions are expressed in familiar mathematicalconnected objects based on eps and minpts. It notation. Hardware requirements of the system areassigns all these objects to a new cluster. If as follows:DBSCAN finds p is not a core object , then p is  Processor : PENTIUM IV or onwardsconsidered to be noise and DBSCAN moves onto  RAM : 1 MB SDRAMthe next unassigned object. Once every object is  Hard disk : 10 GB of disk spaceassigned, the algorithm stops[8].  Operating System : windows, Mac, UNIX etc.V. EXPRIMENTAL SETUP This section describes the data set form the VI. EXPRIMENTAL RESULTbasis of our work as well as the analysis tool used in To perform the evaluation the followingthe evaluation. The basic input to any system is the evaluation parameters are used. It is necessary todata. In the proposed system the input is the real evaluate the performance of the system beingworld data set. Real data set are the set that a data designed. To do so the overall accuracy of theanalyst may encounter while dealing with real world system is calculated. All most all the testing samplesapplication where the attribute are real valued. is determined. The overall accuracy is calculatedFurthermore, the data set may contain binary or and as shown in table 2.numeric values. Since the system is designed for We also determine the following factors.semi-supervised classification thus we consider both  Over all Accuracy.labeled and unlabeled data for classification. The  Accuracy of each class.classifier is tested on NSL KDD data set.  Precision values of each class.  Recall values of each class.Dataset Description  Fmeasure of each class. The NSL-KDD Intrusion detection data isused in our experiments. This data was prepared by We apply the following metrics to indicatethe 1998 DARPA Intrusion Detection Evaluation the performance of our traffic classifier:program by MIT Lincoln Labs. This is the data set For a given class, the number of correctlyused for The Third International Knowledge classified objects is referred to as a True PositiveDiscovery and Data Mining Tools Competition, (TP). The number of objects falsely identified as awhich was held in conjunction with NSl KDD [6]. class is referred as a False Positive(FP). The numberThis dataset contains 41 features such as duration, of objects from a class that are falsely labeled asProtocol_type, service, flag etc. The raw training another class is referred as False Negative (FN).data was about four gigabytes of compressed binary Accuracy is the ratio of Sum of True PositiveTCP dump data from seven weeks of network and True Negative to the sum of all True and Falsetraffic. This was processed into about five million Positve, True and False Negative for all classesconnection records. Each connection is labeled as TP+TNeither normal, or as an attack, with exactly one Accuracy = TP + FP+TN+FNspecific attack type. Each connection record consistsof about 100 bytes. Attacks fall into four main Recall is the ratio of True Positives to thecategories Probe, DoS, U2R and R2L. The dataset number of True Positives and False Negatives. Thisused for experiments contains 8,200 records. We 1620 | P a g e
  6. 6. Mr. Shezad Shaikh, Mr. Niket Bhargava, Ms. Urmila Mahor / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 2, Issue 5, September- October 2012, pp.1616-1623 determines how many objects in a class are achieved for all classes. Eighth R2L class achieved misclassified as something else. 91.26% f-measure value. This means that instances TP belong to this class are more correctly classified. Recall = TP + FN Ninth U2R class achieved lowest f-measure values i.e. the large number of instances are misclassified Precision is the ratio of True positives to True as compared to other class. and False Positives. This determines how many Table 2: Precision , Recall and F1 Measure identified objects were correct. F1 TP Accuracy Precision Recall Measure Precision = TP + FP Class (%) (%) (%) (%) Normal 98.57 100 86.13 92.54 F measures the balance between precision and Dos 97.65 99.11 90.02 94.35 recall; it is harmonic mean between them. U2R 98.00 57.29 99.06 72.60 2∙precision ∙ recall R2L 96.25 94.76 99.15 96.90 F = precision ∙ recall Probe 98.47 97.34 76.56 85.71 Performance Evaluation It is necessary to evaluate the performance VII. CONCLUSION of the technique being implemented. We evaluating In this paper we presented a DBSCAN the performance of classifier at number of clusters based semi supervised clustering algorithm to build equal to 18. To do so Evaluation matrix is computed the classifier and it has been achieved successfully. for test dataset and it is shown in Table 2. All the It permits both labeled and unlabeled instances to be testing instances it is determined how many used in training the classifier. The classifier instances are incorrectly classified and correctly achieves 97.33% accuracy at K=18. It is observed classified. It is difficult to determine the precise that the accuracy of the classifier depends on the values of Eps and MinPts for DBSCAN [20]. number of clusters and initial parameters i.e. minpts However, we can use a probable range for their and epsilion distance of DBSCAN algorithm. This values by means of experience. The values of Eps technique will be used in real time traffic are 1, 2, 4 and 8. The values of MinPts are 4, 6, 8 classification in future. and 10. Table 1: Confusion Matrix for Test Data set Predicted asClass Normal Dos U2R R2L ProbeNormal 354 0 7 50 0Dos 0 785 21 64 2U2R 0 0 106 1 0R2L 0 7 10 251 3Probe 0 0 41 15 183 From Table 2 we calculate the overall accuracy of the classifier and it is 97.76% at number of cluster =18. Table 2 shows the accuracy, precision, recall and F measure value of each class. And fig. 2,3,4,5 are plotted from the Table 2. Several observations can be made from the table 2 Fig. 2 Accuracy of each class First, more than 97% accuracy is achieved for all classes. Second, more than 94% precision achieved for all classes except U2R class. While normal class achieved 100% precisions i.e. the instances belong to other class are not classified as belongs to normal class. Third, the U2R class achieved lowest precision indicates that the other instances are misclassified as belongs to this class as compared to others. Forth, more than 76% recall achieved for all classes. Fifth, U2R class achieve 99.06% recall i.e. the instances belong to this class are more correctly classified. Sixth, Probe class achieved lowest recall values i.e. the large number of instances belongs to this class are misclassified as compared to the other classes. Seventh more than 72 % f-measure is 1621 | P a g e
  7. 7. Mr. Shezad Shaikh, Mr. Niket Bhargava, Ms. Urmila Mahor / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 2, Issue 5, September- October 2012, pp.1616-1623 Fig. 3 Precision of each class supervised clustering and probabilistic assignment technique for network traffic classification” March –April 2012. [5] (Basic Book) H.Margaret , S. S. Dynham , “Data Mining Introductory and Advanced topics”. [6] NSL KDD Intrusion Detection Datasets. Available at: KDD. [7] “Design and Implementation of Semi- Supervised Classification using Support Vector Machine” GSITS M.E.Thesis , J. Thomas,2008 [8] Caihong Yang, Fei Wang, Benxiong Huang “ Internet Traffic Classification using DBSCAN” WASE Internetional conference of Information Technology 2009. [9] Alberto Dainotti, Walter de Donato,Fig. 4 Recall of each class Antonio Pescape and Pierluigi Salvo Rossi, “Classification of Network Traffic via Packet Level Hidden Markov Models”, In IEEE GLOBECOM, New Orleans, LO, Dec.2008. [10] IANA, “Internent Assigned Numbers Authority”, number. [11] Subhabrata Sen, Oliver Spatscheck, and Dongmesi Wang, “Accurate, Scalable In- network Identificatio of P2P Traffic using application Signature”, WWW 2004, New York, USA ACM, May 17-22, 2004, 512- 521. [12] I.Witten and E.Frank, Data mining: practical machine learning tools and techniques (Second Edition, Morgan Fig. 5 Fmeasure of each class Kaufmann Publishers, 2005). [13] Thuy T.T. Nguyen, Grenville Armitage, “ REFERENCES A Survey of Techniques for Internet Traffic [1] J. Erman, A. Mahanti, M. Arlitt, I. Cohen, Classification using Machine Learning”, In and C. Williamson, “Semi-Supervised IEEE communication surveys and tutorials, Network Traffic 2008, 1-21. Classification”,SIGMETRICS07, June [14] Chuanliang Chen, Yunchao Gong and 12.16, 2007, San Diego, California, USA. Yingjie Tian, “ Semi-Supervised Learning ACM 978-1-59593-639-4/07/0006. Methods for Network Intrusion Detection”, [2] J. Erman, A. Mahanti, M. Arlitt, I. Cohen, in proc. of IEEE International Conference and C. Williamson, “Offline/Online Traffic on Systems Man and Cybernetics (SMC), Classification Using Semi-Supervised 2008 , 2603-2608. Learning”, Technical report, University of [15] Amita Shrivastav and Aruna Calgary, 2007. Tiwari,”Network Traffic Classification [3] J. Erman, A. Mahanti, M. Arlitt, I. Cohen, using Semi-Supervised Approach”, Second and C. Williamson,” Traffic Classification International Conference on Machine Using Clustering Algorithms”, University Learning and computing (ICMLC), 2010, of Calgary, SIGCOMM’06 Workshops 345-349. September 1115, 2006, Pisa, Italy. [16] Rudra Pratap, Getting started with Copyright 2006 ACM MATLAB 7 (OXFORD University press, 1595934170/06/0009. Indian Edition). [4] Vinod Mahajan and Bhupendra Verma “Implementation of distance based semi- 1622 | P a g e
  8. 8. Mr. Shezad Shaikh, Mr. Niket Bhargava, Ms. Urmila Mahor / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 2, Issue 5, September- October 2012, pp.1616-1623[17] Nigel Williams Sebastian Zander, [19] Liu Bin and Tu Hao, “An Application Grenville Armitage, Evaluating Machine Traffic Classification Method Based on Learning Algorihtms for Automated Semi-Supervised Clustering”, A 2nd Network Application Identification, CAIA International Symposium on Information Technical Report 060410B, March 2006, 1- Engineering and Electronics Commerce 14. (IEEC), 2010,1-4[18] Levi Lelis and Jorg Sander,”Semi- [20] Caihong Yang, Fei Wang, Benxiong Huang Supervised Density- Based Clustering”, in “Internet traffic classification using proc. of Nineth IEEE International DBSAN” A WASE International Conference on Data Mining , Miami, FL , conference on Information Engineering, Dec 2009, 842-847. 2010 1623 | P a g e