• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
50120140505007
 

50120140505007

on

  • 50 views

 

Statistics

Views

Total Views
50
Views on SlideShare
50
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    50120140505007 50120140505007 Document Transcript

    • International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 5, May (2014), pp. 49-55 © IAEME 49 PRE-PROCESSING OF SERVER LOG FILE ALONG WITH MAC ADDRESS USING APRIORI AND DECISION TREE Bikramjit kaur, Shanu Verma, Anumati Thakur Student of M.tech Computer Science, Department of CSE, Lovely Professional University, Phagwara, Punjab, India ABSTRACT In today’s world, there is heavy traffic on internet. Now days usually every task is done on internet. Because of high network traffic, so lots of risks from attackers are also there. Therefore, securing the data over internet becomes a difficult challenge for professionals. To deal with large amount of data, distributed systems are used. In distributed system, the data can be accessed from anywhere over the web, by which admin faces the problem to identify the unauthorized client or attacker. Server admin utilize the web server log file to identify the visitors of their web site, but that server log files are noisy, have unformatted data. There is some information in the server log which can be changed by attacker. Proposed work presented in this paper is going to sort the problem in the server log file. The recent research in this field has provided in-efficient log file clustering. Web usage mining is technique of data mining used to mining the data of web server log files. For resolving this problem I am going to propose a method by which MAC Address will be stored in log file of server through which we can track attackers. Keyword: Distributed Database, Security of Server, Log Files, MAC Address, Web Usage Mining, Apriori Algorithm, Decision Tree. I. INTRODUCTION As web applications are increasing at an enormous speed and its users are increasing on exponential speed. The evolutionary changes in technology have made it possible to capture the user’s interactions with web applications through web server log file. A distributed database is a collection of data which belong logically to the same system but are spread over the sites of a computer network [1]. In traditional databases, the database administrator, having centralized control, can ensure that only authorized access to the data is performed. In distributed system, the INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online) Volume 5, Issue 5, May (2014), pp. 49-55 © IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2014): 8.5328 (Calculated by GISI) www.jifactor.com IJCET © I A E M E
    • International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 5, May (2014), pp. 49-55 © IAEME 50 data can be accessed from anywhere, because distributed database is a collection of multiple databases that are distributed over the network and all these databases are connected to each other, and unauthorized people can get easily access to data. So, the distributed database’s administrator faced some problems regarding security, which are follows: • To recognize the unauthorized people who are trying to access the database. • The information of client site which are storing in log file can be changed or spoofed by user like IP Address, so it’s difficult to admin to indentify the user. • Server Admin unaware about all activities that are performed on web sites by user. • Log files at server side used to solve this problem but the data stored in log file is noisy, unformatted and unable to read. To identify the user, log file retrieve the IP Address of user. But today, to changed the IP Address by hacker is very easy task and then with changed IP address hacker or attacker can able to access the data from database over the network. So, question arises from server admin site, how can possible to secure the distributed database through the log files, while log file has noisy and unformatted data? So proposed work of thesis is solving these problems and secures the data in distributed database. II. SERVER LOG FILE Web usage mining is the process of data mining techniques which are uses to extract useful information from raw web log files [2]. Server log file stored the information about the users and their performed actions on sites. Server log file is password protected and can be only viewed by server administrator. It is a simple text file with extension .txt, and records the activities of the users, analysis the user’s behaviour [3]. Every databases server have log files at backend, log files store the user’s web browsing history such as Client IP Address, Date, Time, Server IP, Server port, Client server method, Bytes, time taken to perform any action and requests etc. When server admin export the web log data from server, the exported log data is noisy and unformatted. The raw log file has noise or unformatted data, not easy to read, useless and not understandable to extract the useful information. There are different types of log files such as access log, error log, agent log and referrer log. Figure 1: Raw log file [4]
    • International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 5, May (2014), pp. 49-55 © IAEME 51 Table 1: Elaborates the different attribute of access log file with their description. Attributes Description Client IP Address IP Address of Client machine Date Date when user accesses the data and transaction was recorded. Date Format is YYYY-MM-DD. Time Time of Transaction. Time Format is HH:MM:SS Server Site Name Internet service name which is appeared on client machine. Sever Computer Name Name of the server Server IP Server IP provided by Internet Service Provider Server Port Server port configured for data transmission. Like port are 80. Client Server Method Client method or modes of request, which is maybe GET, POST or HEAD. Server Client Bytes Number of bytes sent by server to client. Client Server Bytes Number of bytes received by client from server. Time Taken How much time taken to perform any action by the client. Client Server Version Client version which is connected to server version. User Agent Type of browser that used by client. Cookies Contents of cookies Referrer Link from where client jump to this site
    • International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 5, May (2014), pp. 49-55 © IAEME 52 III. RELATED WORK This section discusses the related work on server log files, exiting pre-processing and classifier technique applies on server log files. There are many research papers published regarding the pre-processing as well as classifiers the server log files in order to remove the noise from server log files. The problem is defined by conclusion of these research papers, and the concepts of proposed work from this survey are understood. Dr. C. Sunil, et.al in “Security Implications of Distributed Database Management System Models” [5] explained in this paper about the security strengths and weaknesses of database models and the problems initiate in the distributed environment. To determining which distributed database replica more will secure for a specific function, the choice should not be made on the basis of available security features [5]. This paper is the review of security concerns about databases and distributed database environment and security issues found in database models will be examined. Security protections for different data models are quite different in distributed environment and each replica has important strengths and weaknesses. Marathe Dagadu Mitharam in “Pre-processing in Web Usage mining” [6] explained in this paper, web usage mining technique used to discover relevant, useful and hidden information form server log files. To analysis the visitor to various website, it consists of three steps, i.e. pre- processing, pattern discovery, and pattern analysis. Only first step described in this paper. Pre- processing section depends on web log files or various raw log files. In this paper the fusion and synchronization of data from multiple log files, data cleaning, user identification, session time, all these tasks are explained very well in this paper [6]. The proposed methods of this paper were fully tested on the server’s log files which are helpful for admin to recognize the users. Jaideep Srivastava et al. in “Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data” [7] explained in this paper provides an up-to-date survey of web usage mining, including research effort in academic, commercial and industrial area. It also describes the various kinds of web data that can be useful for web usage mining and discusses the challenges involved in discovering usage patterns from web data. The three phases are preprocessing, discovery, and analysis of patterns. A detailed taxonomy and survey of the existing efforts in web usage mining and gives an overview of the WebSIFT system as a prototypical example of a web usage mining system are explained in this paper [7]. Finally discusses about privacy issues and challenges regarding server log files. S. Chitra and Dr. B. Kalpana in “A Novel Pre-processing Mixed Ancestral Graph Technique for Session Construction” [8] explained in this paper about web server log file’s problem like the normal Log data is very noisy and unclear and it is vital to pre-process the log data for efficient web usage mining process., to solve these problem web usage mining process applied to discover the patterns of the browsing nature of the visitors of the website. These log files available in the web server. Only these log files record the session information about the web pages. The Pre-processing process consists of three phases’ i.e. data cleaning, user identification and session time construction. Pre-processing phase helps to clean the recorded log files data and discover the interesting user’s information and construct the session time. There are many types of web log files, but typically the log files share the same basic information such as IP address, request, bytes, requested URL etc [8]. This paper also discusses the four main steps of web usage mining: Collection of Data, Data Pre- processing, Knowledge Extraction, and Analysis of Extracted Results. Proposed method of this paper is to constructs the session time as a Mixed Ancestral Graph (MAG). In MAG construction it
    • International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 5, May (2014), pp. 49-55 © IAEME 53 performs some tasks like calculate browsing time, calculation of weight of pages and discover the sequential patterns and cluster pattern. [8]. Web site administrators follow the results and improve their web sites more easily. Experiment is carried out on MATLAB to prove the effectiveness of the proposed MAG method. Mirghani. A. Eltahir and Anour F. A. Dafa-Alla in “Extracting Knowledge from Web Server Logs Using Web Usage Mining” [2] explained in this paper about web log files, web usage mining technique use to gathering the knowledge from web server log files where history of all visitors is registered, also present the main problem that faces any website admin. To implemented the web usage mining technique, various data sources have been selecting, like web server log, proxy server log, browser log. Paper present the format of log files, what type of information stored in log files, and how it helpful for a web admin. Web usage mining process started in step by step, the first step is data collection, in which usage raw data from various sources are collected from web servers. In second step, data pre-processing process execute, it include data cleaning, user identification, session identification [2]. Next step is pattern discovery and last final step is pattern analysis. This paper has important aspects exploration of data and analysis of activity of users on web sites and this aspect is ignored by a lot of institutions, but its importance in building a strong relationship between web admin and the users. Pawel Weichbroth, et al. in “Web User Navigation Patterns Discovery from WWW Server Log Files” [9] present in this paper the framework for web mining, to organized the knowledge about user navigation patterns. They focus on critical factors of extracted knowledge to evaluate the effective result. So, only suitable knowledge can be added to existing in personalization and recommendation systems (PRS) knowledge base [9]. Also this paper presents the related work which concern web mining algorithms and related software tools, and discuss about web usage mining process and its related tasks which including the data format and pre-processing, pattern discovery. Developed framework and implemented algorithm are presented and results as well. Developed framework is divided into six main components: database, data access service, controller, user interface, algorithm class and file controller. Tool for the developed framework has been implemented in .Net 4.0 and uses Microsoft SQL Server 2008 as back end. IV. PROPOSED WORK In case of distributed system it is very important to secure the database from unauthorized people. So the main objective of thesis work is to provide security to distributed database system. In the proposed work, the task to be implemented is going to sort the problem for the log file system clustering. The recent research on this field has provided in-efficient log file clustering. Due to that the sessions which were important to view were missed and the network admin could miss the important record set of the log file. The technology we are using is a Data mining in which we are working on applications of data mining named decision tree and Apriori Algorithm. The Decision tree will recognize the pattern and display it accordingly session wise. The Apriori will provide the effective clustering of log file to remove the data. Further going to fetch the MAC Address of the System along with IP address and display it accordingly also. This will help in verifying the unauthorized user. V. METHODOLOGY OF PROPOSED WORK The methodology of this proposed work, using the Data mining techniques in which named Apriori Algorithm and decision tree. Apriori algorithm used to optimize the server log file data. The
    • International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 5, May (2014), pp. 49-55 © IAEME 54 set of raw data of server log file must go through with web usage mining process and give final result. Apriori is a classic algorithm used in data mining for learning association rules. It will provide the effective clustering of log file to remove the noise data and useless data than the Decision tree will recognize the pattern and display it accordingly session wise and display the noiseless and formatted log files along with MAC Address of the System, which is used to verifying the unauthorized user. Figure 2: Flowfork of proposed work VI. CONCLUSION Security has been a growing concern in distributed database systems. The problem of authentication and authorization is main security issue in distributed systems. So the main objective of thesis work is to provide security to distributed database system. The main task of thesis work providing the security to distributed database through the log files, in which we can fetch the MAC Address of the client machine along with IP Address and reveal in server log file, which is help to verifying the unauthorized user. Server admin can get the knowledge from server log files so it should be in proper format. VII. REFERENCES [1] Mahajan, S. and shah, S. (2010) Distributed computing, India. [2] Mirghani. A. Eltahir and Anour F.A. Dafa-Alla (2013) “ Extracting Knowledge from Web Server Logs Using Web Usage Mining”, IEEE, 2013 INTERNATIONAL CONFERENCE ON COMPUTING, ELECTRICAL AND ELECTRONIC ENGINEERING (ICCEEE), [3] http://www.mediacollege.com/internet/statistics/logs/.
    • International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 5, May (2014), pp. 49-55 © IAEME 55 [4] http://www.axigen.com/docs/70/View-LogFiles_387.html. [5] Kumar sunil and J.seetha and S.R vinotha (2012) “Security Implication of Distributive Management System Models”, International Journal of Soft Computing and Software Engineering (JSCSE). 2(11). [6] Marathe Dagadu Mitharam (2012) “Preprocessing in Web Usage mining”, International Journal of Scientific & Engineering Research. 3(2). [7] Srivastava Jaideep, Cooley Robert, Deshpande Mukund and Tan Pang-Ning (2010) “Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data”, IEEE. 1(2), pp. 1-12. [8] S. Chitra and Dr. B. Kalpana (2013) “A Novel Preprocessing Mixed Ancestral Graph Technique for Session Construction” IEEE, International Conference on Computer Communication and Informatics, Coimbatore, INDIA, pp. 1-7. [9] Paweł Weichbroth, Mieczysław Owoc and Michał Pleszkun (2012) “Web User Navigation Patterns Discovery from WWW Server Log Files”, IEEE, Proceedings of the Federated Conference on Computer Science and Information Systems, pp. 1171–1176. [10] Ravita Mishra, “Web Usage Mining Contextual Factor: Human Information Behavior”, International Journal of Information Technology and Management Information Systems (IJITMIS), Volume 5, Issue 1, 2014, pp. 12 - 29, ISSN Print: 0976 – 6405, ISSN Online: 0976 – 6413. [11] Shaymaa Mohammed Jawad Kadhim and Dr. Shashank Joshi, “Agent Based Web Service Communicating Different Is’s And Platforms”, International Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 5, 2013, pp. 9 - 14, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375. [12] Tawfiq Khalil and Ching-Seh (Mike) Wu, “Link Patterns in the World Wide Web”, International Journal of Information Technology and Management Information Systems (IJITMIS), Volume 4, Issue 3, 2013, pp. 96 - 113, ISSN Print: 0976 – 6405, ISSN Online: 0976 – 6413. [13] Sumana M and Hareesha K S, “Preprocessing and Secure Computations for Privacy Preservation Data Mining”, International Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 4, 2013, pp. 203 - 212, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.