SlideShare a Scribd company logo
IOSR Journal of Computer Engineering (IOSR-JCE)
e-ISSN: 2278-0661, p- ISSN: 2278-8727Volume 9, Issue 4 (Mar. - Apr. 2013), PP 76-79
www.iosrjournals.org

          A Survey on Data Preprocessing in Web Usage Mining
                              Murti Punjani1, Mr.Vinitkumar Gupta2
           1
               (Department of Computer Engineering,Hasmukh Goswami College of Engineering, India
           2
               (Department of Computer Engineering,Hasmukh Goswami College of Engineering, India

Abstract : With the abundant use of Internet and constant growth of users, the World Wide Web has a huge
storage of data and these data serves as an important medium for the getting information of the users access to
web sites which are data stored in Web server Logs. Today people are interested in analyzing logs file as they
show actual usage of web site. But the data is not accurate so preprocessing of Web log files are essential then
after that data are suitable for knowledge discovery or mining tasks. Web Usage Mining, a part of Web mining
and application of data mining is used for automatic discovery of patterns in clickstreams and associated data
collected or generated as a result of user interactions with one or more Web Sites. This survey paper gives the
literature review and also overview of various steps needed for preprocessing phase.
Keywords โ€“ Data Fusion, Path Completion, Pre processing, Session Identification, Web usage, Web Server
Log file.

                                            I.     INTRODUCTION
          With the fast growth of Internet technology, preprocessing is necessary to get useful information about
user access and is of the most important research topics. With the explosive use of growth of information
available on WWW (World Wide Web), discovery and analysis ofuseful information has become necessity. The
Web has become an important medium to communicate ideas, transact business and promote entertainment. The
discovery and analysis of useful information from the Web documents is referred to as Web mining [1].The data
is stored in web server log and it is in heterogeneous form. So we need to preprocess these data to extract useful
information. Web Mining is divided into three categories [11]1.Web Content Mining 2.Web Structure Mining
3.Web Usage Mining. Web Content Mining is process to extract useful information from the contents of web
documents. Web Structure Mining is the process of discovering structure information from the web. Structure
represents hyperlinks and document structure. Web Usage Mining is application of data mining used to extract
user access from web server log files.

A. Web Usage Mining
Also known as Web Log Mining is used to discover patterns from web server logs. The primary source of data
for web usage mining consists of textual logs collected from several web servers all around the world. There are
four phases in web usage mining. [4]
1. Data Collection- User Logs are collected from client and server side servers, proxy servers, application
     servers etc.
2. Data Preprocessing- Consists phases like data fusion and cleaning, user identification, session
     identification, path completion
3. Pattern Discovery- Discovering patterns from preprocessed data using various data mining techniques like
     statistical analysis, association, clustering, and pattern matching and so on.
4. Pattern Analysis-Once patterns are discovered, analysis is done using knowledge query mechanism such as
     SQL or data cubes to perform OLAP operations.




                                          Fig. 1 Phases of Web Usage Mining
1.1    Motivation
       Thousands of users access multiple web sites all over the world. When the different users access the
websites, huge amount of data is gathered in the web log files which is very much useful many times as we can
know how many times user access the same page frequently. These data can be further used to get user access

                                            www.iosrjournals.org                                        76 | Page
A Survey on Data Preprocessing in Web Usage Mining

pattern and user behavior. As the data cannot be directly used in WUM, Preprocessing is necessary.
Preprocessing of the web log file is tedious job and it takes 80% of total time of web usage mining process as
whole [12].Seeing the advantages and disadvantages, we conclude that preprocessing is significant phase and
which also improves quality of the data [13].

                                           II.        Literature Review
          The aim of literature review is to study and compare the various available techniques for preprocessing.
Due to huge amount of extraneous and inaccurate entries in web log file, log file cannot be directly used in
WUM process so preprocessing is must.
          According to Ravindra Gupta and Prateek Gupta [17], in which two main tasks are done which are
customized web log preprocessing and improved FP Tree algorithm. Raw web log file was taken as input. The
authors modified the algorithm FP tree and proposed improved FP tree algorithm. The proposed algorithm was
divided into two main processes: creation of modified FP tree and mining. In modified FP tree algorithm,
structure items were stored in descending order of their frequency. Customized preprocessing steps were
Customization in which log cleaning was performed on basis of user requirement, next steps were Data
Cleaning, User Identification, Session Identification and last step was database of cleaned log.After applying
these steps compressed log file having user access behaviour in numeric form was generated and which can be
further sent for mining using modified FP tree algorithm.
          According to Wahab, et al, [16] discussed different types of log files in detail. Also discussed all the 19
attributes of web log file as well as different log file formats in detail. They proposed an algorithm for reading
server logs and also algorithm for transferring the log file to database was proposed. After reading the log web
files of any one type out of three formats, various attributes were ignored because they were considered not
significant for the analysis. Data filtering was performed to remove unwanted attributes of web log file. The web
server log file was containing 18 attributes, out of which 17 attributes were removed considering them as
unwanted and only one attribute was known i.e. โ€œURLโ€ and was stored in the database. Some important
attributes were not considered, so reliability was not maintained. So seeing the pros and cons, the proposed
algorithms need to be modified.
          According to Raju and Satyanarayana,[6], input was raw web log file collected from NASA Web site
during July 1995. Customized preprocessing steps generated compressed log file having user behavior in
numeric form which was further given for mining process using modified FP tree algorithm. It outputs complete
relational database model for storing the structured information about the Web site, its usage and its users.
As web log file contains important data related to website, Suneetha and Krishnamoorthi[15], the input was the
web log data of NASA website. Here the authors discussed the sources of web logs, web log structure and status
codes of HTTP in detail. They performed preprocessing techniques on web server log file and first step was
Data cleaning in which the irrelevant entries were removed like the entries that having status error or failure and
images pages were removed next step was user identification in which three attributes were used from log file
which are IP Address, Operating System, and User Agent. The output which can be further used to increase the
effectiveness of the website. The authors did not apply session identification phase.

                                                                                  Preprocessing          Algorithm
               Author Name                                   Source of log file   Technique              applied
               Ravindra Gupta, Prateek Gupta                 Raw web              Data Cleaning          Improved
                                                             log file             User Identification    FP Tree
                                                                                  Session                Algorithm
                                                                                  Identification
                                                                                  Formatting
               Mohd Helmy Abd Wahab, Mohd                    Server Log File      File Reading           Proposed
               Norzali Haji Mohd                                                  Data Cleaning
               and Mohamad Farhan Mohamad                                         Data Filtering
               Mohsin
               Raju and Satyanarayana                        Server Log File      Data Merging           NA
                                                                                  Data Cleaning
                                                                                  User Identification
                                                                                  Session
                                                                                  Identification
               Suneetha, K.        R.    and     D.     R.   Server Log File      Data Cleaning          NA
               Krishnamoorthi                                                     User Identification
                                   TABLE 1 Summary of Literature Review

                                               www.iosrjournals.org                                        77 | Page
A Survey on Data Preprocessing in Web Usage Mining

                                       III.    Data Preprocessing Tasks
         Fig 2 shows the phases of Data Preprocessing in Web Usage Mining. The goal of preprocessing is to
transform the raw click stream data into a set of user profiles [5]. Data preprocessing presents a number of
unique challenges which led to a variety of algorithms and heuristic techniques for preprocessing tasks such as
merging and cleaning, user and session identification etc [6]. Input to the preprocessing stage is web server log
file. Web Server Log contains 19 attributes such as Date, Time, Client IP, AuthUser, ServerName, ServerIP,
ServerPort, Request Method, URI-Stem, URI-Query, Protocol Status, Time Taken, Bytes Sent, Bytes Received,
Protocol Version, Host, User Agent, Cookies, Referer.




                        Fig 2: Phases of Data Preprocessing in Web Usage Mining[3]


Sample Log file is given below [3]:

2007-12-06        05:22:16       ::1     GET       /iisstart.htm  -       80        -       ::1
Mozilla/4.0+(compatible;+MSIE+7.0;+Windows+NT+6.0;+SLCC1;+.NET+CLR+2.0.50727;+Media+Center+P
C+5.0;+InfoPath.1;+.NET+CLR+1.1.4322;+.NET+CLR+3.5.21022;+.NET+CLR+3.0.04506) 200 0 0 296 336

3.1 Data Fusion and Cleaning
Merging of the log files from various Web and application servers is done at the Data Fusion phase.




                                      Fig. 3 Web Log File in Text format [2]

The goal of data cleaning phase is to remove the extraneous and redundant log entries. Important fields like
date, time, Client IP, User Agent, URL requested, URL referred, time taken, Referrer or browser used are
considered for further processing. Extraneous or redundant data is to be removed which are[2] i)As we want
only the log information related to user access so as HTTP is stateless protocol, graphics and scripts are also
recorded. So extensions of the files are checked and files having extensions like .css, .gif, .jpeg, .gif, .jpg etc
files are eliminated. ii) Removal of Robots request iii) some entries will be having errors. Eliminate the entries
having status code less than 200 and greater than 299 as they are failure entries.




                                              www.iosrjournals.org                                       78 | Page
A Survey on Data Preprocessing in Web Usage Mining

3.2 User Identification
          This phase identifies individual user by using their Client IP address. If new IP address, there is new
user. If IP address is same but browser version or operating system is different then it represents different user.
[7]

3.3 Session Identification
          Session of a particular user means how much time the user is connected to particular website. It tells us
total page accesses of particular user. The following rules we use to identify user session in our experiment: [3]
1) If there is a new user, there is a new session;
2) In one user session, if the refer page is null, there is a new session;
3) If the time between page requests exceeds a certain limit (30 or 25.5mintes), it is assumed that the user is
starting a new session.

3.4 Path Completion
          After session identification, path completion comes. As the client uses proxy servers and cache version
of the pages using โ€žBackโ€Ÿ, the sessions which are identified have many lost pages. So this phase is used to
identify lost pages.

                                        IV.       Conclusion And Future Work
         Preprocessing of web log file is mandatory step for web usage mining. After data cleaning step, we can
go for preprocessing step by which we can extract user access pattern and also can be used further for pattern
analysis. In this paper, various current preprocessing techniques are outlined. In this paper also I have explained
the various tasks needed for preprocessing of the data in web usage mining. My future work is to increase the
performance of the web server by getting meaningful and useful information quickly. Analyzing web server log
files, we can easily understand the user behaviors in web structure to get better design of web components and
web applications.

                                                          References
[1]    O.Etzioni, The World Wide Web: Quagmire or gold mine.Communications of the ACM, 39(11):65โ€“68, 1996.
[2]    Vijayashri Losarwar, Dr. Madhuri Joshi,Data Preprocessing in Web Usage Mining, International Conference on Artificial
       Intelligence and Embedded Systems (ICAIES'2012) July 15-16, 2012 Singapore
[3]    Li Chaofeng,Research and Development of Data Preprocessing in Web Usage Mining, School of Management, South-Central
       University for Nationalities, Wuhan 430074, P.R. China
[4]    V.Chitraa, Dr. Antony Selvdoss Davamani, A Survey on Preprocessing Methods for Web Usage Data, (IJCSIS) International
       Journal of Computer Science and Information Security,Vol. 7, No. 3, 2010,p.78-83
[5]    Demin Dong,Exploration on Web Usage Mining and its Application, IEEE, 2009.
[6]    Raju G.T. and Sathyanarayana P. Knowledge discovery from Web Usage Data : Complete Preprocessing Methodology, โ€, IJCSNS
       2008
[7]    Priyanka Patil,Ujwala Patil, Preprocessing of web server log file for web mining ,World Journal of Science and Technology 2012,
       2(3):14-18 ISSN: 2231 โ€“ 2587
[8]    Marathe Dagadu Mitharam ,Preprocessing in Web Usage mining, International Journal of Scientific & Engineering Research,
       Volume 3, Issue 2, February -2012 1 ISSN 2229-5518
[9]    C.P. Sumathi, R. Padmaja Valli , T. Santhanam, โ€œAn Overview of Preprocessing of Web Log Files for Web Usage Miningโ€, Journal
       Of Theoretical And Applied Information Technology 31st December 2011. Vol. 34 No.2,P.178-185
[10]    Jaideep Srivastava, Robert Cooley, Mukund Deshpande, and Pang-Ning Tan Web usage mining: Discovery and applications of
       usage patterns from web data.SIGKDD Explorations, 1(2):12โ€“23, 2000
[11]   Alam, S., G. Dobbie, et al. (2008). Particle Swarm Optimization Based Clustering Of Web Usage Data. 2008 IEEE/WIC/ACM
       International Conference on Web Intelligence and Intelligent Agent Technology 978-0-7695-3496-1/08 DOI
       10.1109/WIIAT.2008.292 IEEE/WIC/ACM International Conference on Web.
[12]   Pabarskaite, Z. (2002). Implementing Advanced Cleaning and End-User Interpretability Technologies in Web Log Mining. 24th Int.
       Conf. information Technology Interfaces /TI 2002, June 24-27, 2002, Cavtat, Croatia
[13]   Han, J. and M. Kamber (2006). Data Mining: Concepts and Techniques. A. Stephan. San Francisco,, Morgan Kaufmann Publishers
       is an imprint of Elsevier.
[14]   Yuan, F., L.-J. Wang, et al. (2003). Study on Data Preprocessing Algorithm in Web Log Mining. Proceedings of the Second
       International Conference on Machine Learning and Cybernetics, Wan, 2-5 November 2003.
[15]   Suneetha, K. R. and D. R. Krishnamoorthi (2009)."Identifying User Behavior by Analyzing Web Server Access Log File." IJCSNS
       International Journal of Computer Science and Network Security, VOL.9 No.4, April 2009.
[16]   Wahab, M. H. A., M. N. H. Mohd, et al. (2008). Data Preprocessing on Web Server Logs for Generalized Association Rules Mining
       Algorithm. World Academy of Science, Engineering and Technology 48 2008.
[17]   Ravindra Gupta and Prateek Gupta,Application Oriented Web Usage Mining with Customized Web Log Preprocessing & Frequent
       Pattern Tree, International Journal of Engineering Research and Applications, ISSN: 2248-9622,Vol. 2, Issue 1,Jan-Feb 2012,
       pp.596-598




                                                   www.iosrjournals.org                                                    79 | Page

More Related Content

Viewers also liked

A010520107
A010520107A010520107
A010520107
IOSR Journals
ย 
F012515059
F012515059F012515059
F012515059
IOSR Journals
ย 
Integrated E-Health Approach For Early Detection of Human Body Disorders in R...
Integrated E-Health Approach For Early Detection of Human Body Disorders in R...Integrated E-Health Approach For Early Detection of Human Body Disorders in R...
Integrated E-Health Approach For Early Detection of Human Body Disorders in R...
IOSR Journals
ย 
H1304025762
H1304025762H1304025762
H1304025762
IOSR Journals
ย 
G010334554
G010334554G010334554
G010334554
IOSR Journals
ย 
E010222124
E010222124E010222124
E010222124
IOSR Journals
ย 
E010132529
E010132529E010132529
E010132529
IOSR Journals
ย 
Investigation of Reducing Process of Uneven Shade Problem In Case Of Compact ...
Investigation of Reducing Process of Uneven Shade Problem In Case Of Compact ...Investigation of Reducing Process of Uneven Shade Problem In Case Of Compact ...
Investigation of Reducing Process of Uneven Shade Problem In Case Of Compact ...
IOSR Journals
ย 
J012256367
J012256367J012256367
J012256367
IOSR Journals
ย 
Vehicle Obstacles Avoidance Using Vehicle- To Infrastructure Communication
Vehicle Obstacles Avoidance Using Vehicle- To Infrastructure  CommunicationVehicle Obstacles Avoidance Using Vehicle- To Infrastructure  Communication
Vehicle Obstacles Avoidance Using Vehicle- To Infrastructure Communication
IOSR Journals
ย 
Model-based Approach of Controller Design for a FOPTD System and its Real Tim...
Model-based Approach of Controller Design for a FOPTD System and its Real Tim...Model-based Approach of Controller Design for a FOPTD System and its Real Tim...
Model-based Approach of Controller Design for a FOPTD System and its Real Tim...
IOSR Journals
ย 
E012653744
E012653744E012653744
E012653744
IOSR Journals
ย 
M010127983
M010127983M010127983
M010127983
IOSR Journals
ย 
High Performance Error Detection with Different Set Cyclic Codes for Memory A...
High Performance Error Detection with Different Set Cyclic Codes for Memory A...High Performance Error Detection with Different Set Cyclic Codes for Memory A...
High Performance Error Detection with Different Set Cyclic Codes for Memory A...
IOSR Journals
ย 
C0121216
C0121216C0121216
C0121216
IOSR Journals
ย 
Mouth dissolving tablets- A unique dosage form curtailed for special purpose:...
Mouth dissolving tablets- A unique dosage form curtailed for special purpose:...Mouth dissolving tablets- A unique dosage form curtailed for special purpose:...
Mouth dissolving tablets- A unique dosage form curtailed for special purpose:...
IOSR Journals
ย 
E0522327
E0522327E0522327
E0522327
IOSR Journals
ย 
Modified Distributive Arithmetic Based DWT-IDWT Processor Design and FPGA Imp...
Modified Distributive Arithmetic Based DWT-IDWT Processor Design and FPGA Imp...Modified Distributive Arithmetic Based DWT-IDWT Processor Design and FPGA Imp...
Modified Distributive Arithmetic Based DWT-IDWT Processor Design and FPGA Imp...
IOSR Journals
ย 
B1304021217
B1304021217B1304021217
B1304021217
IOSR Journals
ย 
Causes of Delay in Construction of Bridge Girders
Causes of Delay in Construction of Bridge GirdersCauses of Delay in Construction of Bridge Girders
Causes of Delay in Construction of Bridge Girders
IOSR Journals
ย 

Viewers also liked (20)

A010520107
A010520107A010520107
A010520107
ย 
F012515059
F012515059F012515059
F012515059
ย 
Integrated E-Health Approach For Early Detection of Human Body Disorders in R...
Integrated E-Health Approach For Early Detection of Human Body Disorders in R...Integrated E-Health Approach For Early Detection of Human Body Disorders in R...
Integrated E-Health Approach For Early Detection of Human Body Disorders in R...
ย 
H1304025762
H1304025762H1304025762
H1304025762
ย 
G010334554
G010334554G010334554
G010334554
ย 
E010222124
E010222124E010222124
E010222124
ย 
E010132529
E010132529E010132529
E010132529
ย 
Investigation of Reducing Process of Uneven Shade Problem In Case Of Compact ...
Investigation of Reducing Process of Uneven Shade Problem In Case Of Compact ...Investigation of Reducing Process of Uneven Shade Problem In Case Of Compact ...
Investigation of Reducing Process of Uneven Shade Problem In Case Of Compact ...
ย 
J012256367
J012256367J012256367
J012256367
ย 
Vehicle Obstacles Avoidance Using Vehicle- To Infrastructure Communication
Vehicle Obstacles Avoidance Using Vehicle- To Infrastructure  CommunicationVehicle Obstacles Avoidance Using Vehicle- To Infrastructure  Communication
Vehicle Obstacles Avoidance Using Vehicle- To Infrastructure Communication
ย 
Model-based Approach of Controller Design for a FOPTD System and its Real Tim...
Model-based Approach of Controller Design for a FOPTD System and its Real Tim...Model-based Approach of Controller Design for a FOPTD System and its Real Tim...
Model-based Approach of Controller Design for a FOPTD System and its Real Tim...
ย 
E012653744
E012653744E012653744
E012653744
ย 
M010127983
M010127983M010127983
M010127983
ย 
High Performance Error Detection with Different Set Cyclic Codes for Memory A...
High Performance Error Detection with Different Set Cyclic Codes for Memory A...High Performance Error Detection with Different Set Cyclic Codes for Memory A...
High Performance Error Detection with Different Set Cyclic Codes for Memory A...
ย 
C0121216
C0121216C0121216
C0121216
ย 
Mouth dissolving tablets- A unique dosage form curtailed for special purpose:...
Mouth dissolving tablets- A unique dosage form curtailed for special purpose:...Mouth dissolving tablets- A unique dosage form curtailed for special purpose:...
Mouth dissolving tablets- A unique dosage form curtailed for special purpose:...
ย 
E0522327
E0522327E0522327
E0522327
ย 
Modified Distributive Arithmetic Based DWT-IDWT Processor Design and FPGA Imp...
Modified Distributive Arithmetic Based DWT-IDWT Processor Design and FPGA Imp...Modified Distributive Arithmetic Based DWT-IDWT Processor Design and FPGA Imp...
Modified Distributive Arithmetic Based DWT-IDWT Processor Design and FPGA Imp...
ย 
B1304021217
B1304021217B1304021217
B1304021217
ย 
Causes of Delay in Construction of Bridge Girders
Causes of Delay in Construction of Bridge GirdersCauses of Delay in Construction of Bridge Girders
Causes of Delay in Construction of Bridge Girders
ย 

Similar to M0947679

A Novel Method for Data Cleaning and User- Session Identification for Web Mining
A Novel Method for Data Cleaning and User- Session Identification for Web MiningA Novel Method for Data Cleaning and User- Session Identification for Web Mining
A Novel Method for Data Cleaning and User- Session Identification for Web Mining
IJMER
ย 
a novel technique to pre-process web log data using sql server management studio
a novel technique to pre-process web log data using sql server management studioa novel technique to pre-process web log data using sql server management studio
a novel technique to pre-process web log data using sql server management studio
INFOGAIN PUBLICATION
ย 
A Survey of Issues and Techniques of Web Usage Mining
A Survey of Issues and Techniques of Web Usage MiningA Survey of Issues and Techniques of Web Usage Mining
A Survey of Issues and Techniques of Web Usage Mining
IRJET Journal
ย 
Web Data mining-A Research area in Web usage mining
Web Data mining-A Research area in Web usage miningWeb Data mining-A Research area in Web usage mining
Web Data mining-A Research area in Web usage mining
IOSR Journals
ย 
COMPARISON ANALYSIS OF WEB USAGE MINING USING PATTERN RECOGNITION TECHNIQUES
COMPARISON ANALYSIS OF WEB USAGE MINING USING PATTERN RECOGNITION TECHNIQUESCOMPARISON ANALYSIS OF WEB USAGE MINING USING PATTERN RECOGNITION TECHNIQUES
COMPARISON ANALYSIS OF WEB USAGE MINING USING PATTERN RECOGNITION TECHNIQUES
IJDKP
ย 
50120140505007
5012014050500750120140505007
50120140505007
IAEME Publication
ย 
Ijarcet vol-2-issue-7-2341-2343
Ijarcet vol-2-issue-7-2341-2343Ijarcet vol-2-issue-7-2341-2343
Ijarcet vol-2-issue-7-2341-2343
Editor IJARCET
ย 
Ijarcet vol-2-issue-7-2341-2343
Ijarcet vol-2-issue-7-2341-2343Ijarcet vol-2-issue-7-2341-2343
Ijarcet vol-2-issue-7-2341-2343
Editor IJARCET
ย 
Identifying the Number of Visitors to improve Website Usability from Educatio...
Identifying the Number of Visitors to improve Website Usability from Educatio...Identifying the Number of Visitors to improve Website Usability from Educatio...
Identifying the Number of Visitors to improve Website Usability from Educatio...
Editor IJCATR
ย 
IRJET- Enhancing Prediction of User Behavior on the Basic of Web Logs
IRJET- Enhancing Prediction of User Behavior on the Basic of Web LogsIRJET- Enhancing Prediction of User Behavior on the Basic of Web Logs
IRJET- Enhancing Prediction of User Behavior on the Basic of Web Logs
IRJET Journal
ย 
625 634
625 634625 634
625 634
Editor IJARCET
ย 
Web personalization using clustering of web usage data
Web personalization using clustering of web usage dataWeb personalization using clustering of web usage data
Web personalization using clustering of web usage data
ijfcstjournal
ย 
Applying web mining application for user behavior understanding
Applying web mining application for user behavior understandingApplying web mining application for user behavior understanding
Applying web mining application for user behavior understanding
Zakaria Zubi
ย 
A Comparative Study of Recommendation System Using Web Usage Mining
A Comparative Study of Recommendation System Using Web Usage Mining A Comparative Study of Recommendation System Using Web Usage Mining
A Comparative Study of Recommendation System Using Web Usage Mining
Editor IJMTER
ย 
Pxc3893553
Pxc3893553Pxc3893553
Pxc3893553
Ouzza Brahim
ย 
A new approach for user identification in web usage mining preprocessing
A new approach for user identification in web usage mining preprocessingA new approach for user identification in web usage mining preprocessing
A new approach for user identification in web usage mining preprocessing
IOSR Journals
ย 
Classification of User & Pattern discovery in WUM: A Survey
Classification of User & Pattern discovery in WUM: A SurveyClassification of User & Pattern discovery in WUM: A Survey
Classification of User & Pattern discovery in WUM: A Survey
IRJET Journal
ย 
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...
cscpconf
ย 
Comparison of decision and random tree algorithms on
Comparison of decision and random tree algorithms onComparison of decision and random tree algorithms on
Comparison of decision and random tree algorithms on
eSAT Publishing House
ย 
BIDIRECTIONAL GROWTH BASED MINING AND CYCLIC BEHAVIOUR ANALYSIS OF WEB SEQUEN...
BIDIRECTIONAL GROWTH BASED MINING AND CYCLIC BEHAVIOUR ANALYSIS OF WEB SEQUEN...BIDIRECTIONAL GROWTH BASED MINING AND CYCLIC BEHAVIOUR ANALYSIS OF WEB SEQUEN...
BIDIRECTIONAL GROWTH BASED MINING AND CYCLIC BEHAVIOUR ANALYSIS OF WEB SEQUEN...
ijdkp
ย 

Similar to M0947679 (20)

A Novel Method for Data Cleaning and User- Session Identification for Web Mining
A Novel Method for Data Cleaning and User- Session Identification for Web MiningA Novel Method for Data Cleaning and User- Session Identification for Web Mining
A Novel Method for Data Cleaning and User- Session Identification for Web Mining
ย 
a novel technique to pre-process web log data using sql server management studio
a novel technique to pre-process web log data using sql server management studioa novel technique to pre-process web log data using sql server management studio
a novel technique to pre-process web log data using sql server management studio
ย 
A Survey of Issues and Techniques of Web Usage Mining
A Survey of Issues and Techniques of Web Usage MiningA Survey of Issues and Techniques of Web Usage Mining
A Survey of Issues and Techniques of Web Usage Mining
ย 
Web Data mining-A Research area in Web usage mining
Web Data mining-A Research area in Web usage miningWeb Data mining-A Research area in Web usage mining
Web Data mining-A Research area in Web usage mining
ย 
COMPARISON ANALYSIS OF WEB USAGE MINING USING PATTERN RECOGNITION TECHNIQUES
COMPARISON ANALYSIS OF WEB USAGE MINING USING PATTERN RECOGNITION TECHNIQUESCOMPARISON ANALYSIS OF WEB USAGE MINING USING PATTERN RECOGNITION TECHNIQUES
COMPARISON ANALYSIS OF WEB USAGE MINING USING PATTERN RECOGNITION TECHNIQUES
ย 
50120140505007
5012014050500750120140505007
50120140505007
ย 
Ijarcet vol-2-issue-7-2341-2343
Ijarcet vol-2-issue-7-2341-2343Ijarcet vol-2-issue-7-2341-2343
Ijarcet vol-2-issue-7-2341-2343
ย 
Ijarcet vol-2-issue-7-2341-2343
Ijarcet vol-2-issue-7-2341-2343Ijarcet vol-2-issue-7-2341-2343
Ijarcet vol-2-issue-7-2341-2343
ย 
Identifying the Number of Visitors to improve Website Usability from Educatio...
Identifying the Number of Visitors to improve Website Usability from Educatio...Identifying the Number of Visitors to improve Website Usability from Educatio...
Identifying the Number of Visitors to improve Website Usability from Educatio...
ย 
IRJET- Enhancing Prediction of User Behavior on the Basic of Web Logs
IRJET- Enhancing Prediction of User Behavior on the Basic of Web LogsIRJET- Enhancing Prediction of User Behavior on the Basic of Web Logs
IRJET- Enhancing Prediction of User Behavior on the Basic of Web Logs
ย 
625 634
625 634625 634
625 634
ย 
Web personalization using clustering of web usage data
Web personalization using clustering of web usage dataWeb personalization using clustering of web usage data
Web personalization using clustering of web usage data
ย 
Applying web mining application for user behavior understanding
Applying web mining application for user behavior understandingApplying web mining application for user behavior understanding
Applying web mining application for user behavior understanding
ย 
A Comparative Study of Recommendation System Using Web Usage Mining
A Comparative Study of Recommendation System Using Web Usage Mining A Comparative Study of Recommendation System Using Web Usage Mining
A Comparative Study of Recommendation System Using Web Usage Mining
ย 
Pxc3893553
Pxc3893553Pxc3893553
Pxc3893553
ย 
A new approach for user identification in web usage mining preprocessing
A new approach for user identification in web usage mining preprocessingA new approach for user identification in web usage mining preprocessing
A new approach for user identification in web usage mining preprocessing
ย 
Classification of User & Pattern discovery in WUM: A Survey
Classification of User & Pattern discovery in WUM: A SurveyClassification of User & Pattern discovery in WUM: A Survey
Classification of User & Pattern discovery in WUM: A Survey
ย 
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...
ย 
Comparison of decision and random tree algorithms on
Comparison of decision and random tree algorithms onComparison of decision and random tree algorithms on
Comparison of decision and random tree algorithms on
ย 
BIDIRECTIONAL GROWTH BASED MINING AND CYCLIC BEHAVIOUR ANALYSIS OF WEB SEQUEN...
BIDIRECTIONAL GROWTH BASED MINING AND CYCLIC BEHAVIOUR ANALYSIS OF WEB SEQUEN...BIDIRECTIONAL GROWTH BASED MINING AND CYCLIC BEHAVIOUR ANALYSIS OF WEB SEQUEN...
BIDIRECTIONAL GROWTH BASED MINING AND CYCLIC BEHAVIOUR ANALYSIS OF WEB SEQUEN...
ย 

More from IOSR Journals

A011140104
A011140104A011140104
A011140104
IOSR Journals
ย 
M0111397100
M0111397100M0111397100
M0111397100
IOSR Journals
ย 
L011138596
L011138596L011138596
L011138596
IOSR Journals
ย 
K011138084
K011138084K011138084
K011138084
IOSR Journals
ย 
J011137479
J011137479J011137479
J011137479
IOSR Journals
ย 
I011136673
I011136673I011136673
I011136673
IOSR Journals
ย 
G011134454
G011134454G011134454
G011134454
IOSR Journals
ย 
H011135565
H011135565H011135565
H011135565
IOSR Journals
ย 
F011134043
F011134043F011134043
F011134043
IOSR Journals
ย 
E011133639
E011133639E011133639
E011133639
IOSR Journals
ย 
D011132635
D011132635D011132635
D011132635
IOSR Journals
ย 
C011131925
C011131925C011131925
C011131925
IOSR Journals
ย 
B011130918
B011130918B011130918
B011130918
IOSR Journals
ย 
A011130108
A011130108A011130108
A011130108
IOSR Journals
ย 
I011125160
I011125160I011125160
I011125160
IOSR Journals
ย 
H011124050
H011124050H011124050
H011124050
IOSR Journals
ย 
G011123539
G011123539G011123539
G011123539
IOSR Journals
ย 
F011123134
F011123134F011123134
F011123134
IOSR Journals
ย 
E011122530
E011122530E011122530
E011122530
IOSR Journals
ย 
D011121524
D011121524D011121524
D011121524
IOSR Journals
ย 

More from IOSR Journals (20)

A011140104
A011140104A011140104
A011140104
ย 
M0111397100
M0111397100M0111397100
M0111397100
ย 
L011138596
L011138596L011138596
L011138596
ย 
K011138084
K011138084K011138084
K011138084
ย 
J011137479
J011137479J011137479
J011137479
ย 
I011136673
I011136673I011136673
I011136673
ย 
G011134454
G011134454G011134454
G011134454
ย 
H011135565
H011135565H011135565
H011135565
ย 
F011134043
F011134043F011134043
F011134043
ย 
E011133639
E011133639E011133639
E011133639
ย 
D011132635
D011132635D011132635
D011132635
ย 
C011131925
C011131925C011131925
C011131925
ย 
B011130918
B011130918B011130918
B011130918
ย 
A011130108
A011130108A011130108
A011130108
ย 
I011125160
I011125160I011125160
I011125160
ย 
H011124050
H011124050H011124050
H011124050
ย 
G011123539
G011123539G011123539
G011123539
ย 
F011123134
F011123134F011123134
F011123134
ย 
E011122530
E011122530E011122530
E011122530
ย 
D011121524
D011121524D011121524
D011121524
ย 

Recently uploaded

Nutrition Inc FY 2024, 4 - Hour Training
Nutrition Inc FY 2024, 4 - Hour TrainingNutrition Inc FY 2024, 4 - Hour Training
Nutrition Inc FY 2024, 4 - Hour Training
melliereed
ย 
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
TechSoup
ย 
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skillsspot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
haiqairshad
ย 
HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.
deepaannamalai16
ย 
Bossa Nโ€™ Roll Records by Ismael Vazquez.
Bossa Nโ€™ Roll Records by Ismael Vazquez.Bossa Nโ€™ Roll Records by Ismael Vazquez.
Bossa Nโ€™ Roll Records by Ismael Vazquez.
IsmaelVazquez38
ย 
Skimbleshanks-The-Railway-Cat by T S Eliot
Skimbleshanks-The-Railway-Cat by T S EliotSkimbleshanks-The-Railway-Cat by T S Eliot
Skimbleshanks-The-Railway-Cat by T S Eliot
nitinpv4ai
ย 
Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"
National Information Standards Organization (NISO)
ย 
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
EduSkills OECD
ย 
SWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptxSWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptx
zuzanka
ย 
Bonku-Babus-Friend by Sathyajith Ray (9)
Bonku-Babus-Friend by Sathyajith Ray  (9)Bonku-Babus-Friend by Sathyajith Ray  (9)
Bonku-Babus-Friend by Sathyajith Ray (9)
nitinpv4ai
ย 
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptxBIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
RidwanHassanYusuf
ย 
Temple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation resultsTemple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation results
Krassimira Luka
ย 
Leveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit InnovationLeveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit Innovation
TechSoup
ย 
Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...
PsychoTech Services
ย 
skeleton System.pdf (skeleton system wow)
skeleton System.pdf (skeleton system wow)skeleton System.pdf (skeleton system wow)
skeleton System.pdf (skeleton system wow)
Mohammad Al-Dhahabi
ย 
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptxBeyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
EduSkills OECD
ย 
Electric Fetus - Record Store Scavenger Hunt
Electric Fetus - Record Store Scavenger HuntElectric Fetus - Record Store Scavenger Hunt
Electric Fetus - Record Store Scavenger Hunt
RamseyBerglund
ย 
Chapter wise All Notes of First year Basic Civil Engineering.pptx
Chapter wise All Notes of First year Basic Civil Engineering.pptxChapter wise All Notes of First year Basic Civil Engineering.pptx
Chapter wise All Notes of First year Basic Civil Engineering.pptx
Denish Jangid
ย 
ู…ุตุญู ุงู„ู‚ุฑุงุกุงุช ุงู„ุนุดุฑ ุฃุนุฏ ุฃุญุฑู ุงู„ุฎู„ุงู ุณู…ูŠุฑ ุจุณูŠูˆู†ูŠ.pdf
ู…ุตุญู ุงู„ู‚ุฑุงุกุงุช ุงู„ุนุดุฑ   ุฃุนุฏ ุฃุญุฑู ุงู„ุฎู„ุงู ุณู…ูŠุฑ ุจุณูŠูˆู†ูŠ.pdfู…ุตุญู ุงู„ู‚ุฑุงุกุงุช ุงู„ุนุดุฑ   ุฃุนุฏ ุฃุญุฑู ุงู„ุฎู„ุงู ุณู…ูŠุฑ ุจุณูŠูˆู†ูŠ.pdf
ู…ุตุญู ุงู„ู‚ุฑุงุกุงุช ุงู„ุนุดุฑ ุฃุนุฏ ุฃุญุฑู ุงู„ุฎู„ุงู ุณู…ูŠุฑ ุจุณูŠูˆู†ูŠ.pdf
ุณู…ูŠุฑ ุจุณูŠูˆู†ูŠ
ย 
How to deliver Powerpoint Presentations.pptx
How to deliver Powerpoint  Presentations.pptxHow to deliver Powerpoint  Presentations.pptx
How to deliver Powerpoint Presentations.pptx
HajraNaeem15
ย 

Recently uploaded (20)

Nutrition Inc FY 2024, 4 - Hour Training
Nutrition Inc FY 2024, 4 - Hour TrainingNutrition Inc FY 2024, 4 - Hour Training
Nutrition Inc FY 2024, 4 - Hour Training
ย 
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
ย 
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skillsspot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
ย 
HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.
ย 
Bossa Nโ€™ Roll Records by Ismael Vazquez.
Bossa Nโ€™ Roll Records by Ismael Vazquez.Bossa Nโ€™ Roll Records by Ismael Vazquez.
Bossa Nโ€™ Roll Records by Ismael Vazquez.
ย 
Skimbleshanks-The-Railway-Cat by T S Eliot
Skimbleshanks-The-Railway-Cat by T S EliotSkimbleshanks-The-Railway-Cat by T S Eliot
Skimbleshanks-The-Railway-Cat by T S Eliot
ย 
Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"
ย 
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
ย 
SWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptxSWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptx
ย 
Bonku-Babus-Friend by Sathyajith Ray (9)
Bonku-Babus-Friend by Sathyajith Ray  (9)Bonku-Babus-Friend by Sathyajith Ray  (9)
Bonku-Babus-Friend by Sathyajith Ray (9)
ย 
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptxBIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
ย 
Temple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation resultsTemple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation results
ย 
Leveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit InnovationLeveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit Innovation
ย 
Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...
ย 
skeleton System.pdf (skeleton system wow)
skeleton System.pdf (skeleton system wow)skeleton System.pdf (skeleton system wow)
skeleton System.pdf (skeleton system wow)
ย 
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptxBeyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
ย 
Electric Fetus - Record Store Scavenger Hunt
Electric Fetus - Record Store Scavenger HuntElectric Fetus - Record Store Scavenger Hunt
Electric Fetus - Record Store Scavenger Hunt
ย 
Chapter wise All Notes of First year Basic Civil Engineering.pptx
Chapter wise All Notes of First year Basic Civil Engineering.pptxChapter wise All Notes of First year Basic Civil Engineering.pptx
Chapter wise All Notes of First year Basic Civil Engineering.pptx
ย 
ู…ุตุญู ุงู„ู‚ุฑุงุกุงุช ุงู„ุนุดุฑ ุฃุนุฏ ุฃุญุฑู ุงู„ุฎู„ุงู ุณู…ูŠุฑ ุจุณูŠูˆู†ูŠ.pdf
ู…ุตุญู ุงู„ู‚ุฑุงุกุงุช ุงู„ุนุดุฑ   ุฃุนุฏ ุฃุญุฑู ุงู„ุฎู„ุงู ุณู…ูŠุฑ ุจุณูŠูˆู†ูŠ.pdfู…ุตุญู ุงู„ู‚ุฑุงุกุงุช ุงู„ุนุดุฑ   ุฃุนุฏ ุฃุญุฑู ุงู„ุฎู„ุงู ุณู…ูŠุฑ ุจุณูŠูˆู†ูŠ.pdf
ู…ุตุญู ุงู„ู‚ุฑุงุกุงุช ุงู„ุนุดุฑ ุฃุนุฏ ุฃุญุฑู ุงู„ุฎู„ุงู ุณู…ูŠุฑ ุจุณูŠูˆู†ูŠ.pdf
ย 
How to deliver Powerpoint Presentations.pptx
How to deliver Powerpoint  Presentations.pptxHow to deliver Powerpoint  Presentations.pptx
How to deliver Powerpoint Presentations.pptx
ย 

M0947679

  • 1. IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN: 2278-0661, p- ISSN: 2278-8727Volume 9, Issue 4 (Mar. - Apr. 2013), PP 76-79 www.iosrjournals.org A Survey on Data Preprocessing in Web Usage Mining Murti Punjani1, Mr.Vinitkumar Gupta2 1 (Department of Computer Engineering,Hasmukh Goswami College of Engineering, India 2 (Department of Computer Engineering,Hasmukh Goswami College of Engineering, India Abstract : With the abundant use of Internet and constant growth of users, the World Wide Web has a huge storage of data and these data serves as an important medium for the getting information of the users access to web sites which are data stored in Web server Logs. Today people are interested in analyzing logs file as they show actual usage of web site. But the data is not accurate so preprocessing of Web log files are essential then after that data are suitable for knowledge discovery or mining tasks. Web Usage Mining, a part of Web mining and application of data mining is used for automatic discovery of patterns in clickstreams and associated data collected or generated as a result of user interactions with one or more Web Sites. This survey paper gives the literature review and also overview of various steps needed for preprocessing phase. Keywords โ€“ Data Fusion, Path Completion, Pre processing, Session Identification, Web usage, Web Server Log file. I. INTRODUCTION With the fast growth of Internet technology, preprocessing is necessary to get useful information about user access and is of the most important research topics. With the explosive use of growth of information available on WWW (World Wide Web), discovery and analysis ofuseful information has become necessity. The Web has become an important medium to communicate ideas, transact business and promote entertainment. The discovery and analysis of useful information from the Web documents is referred to as Web mining [1].The data is stored in web server log and it is in heterogeneous form. So we need to preprocess these data to extract useful information. Web Mining is divided into three categories [11]1.Web Content Mining 2.Web Structure Mining 3.Web Usage Mining. Web Content Mining is process to extract useful information from the contents of web documents. Web Structure Mining is the process of discovering structure information from the web. Structure represents hyperlinks and document structure. Web Usage Mining is application of data mining used to extract user access from web server log files. A. Web Usage Mining Also known as Web Log Mining is used to discover patterns from web server logs. The primary source of data for web usage mining consists of textual logs collected from several web servers all around the world. There are four phases in web usage mining. [4] 1. Data Collection- User Logs are collected from client and server side servers, proxy servers, application servers etc. 2. Data Preprocessing- Consists phases like data fusion and cleaning, user identification, session identification, path completion 3. Pattern Discovery- Discovering patterns from preprocessed data using various data mining techniques like statistical analysis, association, clustering, and pattern matching and so on. 4. Pattern Analysis-Once patterns are discovered, analysis is done using knowledge query mechanism such as SQL or data cubes to perform OLAP operations. Fig. 1 Phases of Web Usage Mining 1.1 Motivation Thousands of users access multiple web sites all over the world. When the different users access the websites, huge amount of data is gathered in the web log files which is very much useful many times as we can know how many times user access the same page frequently. These data can be further used to get user access www.iosrjournals.org 76 | Page
  • 2. A Survey on Data Preprocessing in Web Usage Mining pattern and user behavior. As the data cannot be directly used in WUM, Preprocessing is necessary. Preprocessing of the web log file is tedious job and it takes 80% of total time of web usage mining process as whole [12].Seeing the advantages and disadvantages, we conclude that preprocessing is significant phase and which also improves quality of the data [13]. II. Literature Review The aim of literature review is to study and compare the various available techniques for preprocessing. Due to huge amount of extraneous and inaccurate entries in web log file, log file cannot be directly used in WUM process so preprocessing is must. According to Ravindra Gupta and Prateek Gupta [17], in which two main tasks are done which are customized web log preprocessing and improved FP Tree algorithm. Raw web log file was taken as input. The authors modified the algorithm FP tree and proposed improved FP tree algorithm. The proposed algorithm was divided into two main processes: creation of modified FP tree and mining. In modified FP tree algorithm, structure items were stored in descending order of their frequency. Customized preprocessing steps were Customization in which log cleaning was performed on basis of user requirement, next steps were Data Cleaning, User Identification, Session Identification and last step was database of cleaned log.After applying these steps compressed log file having user access behaviour in numeric form was generated and which can be further sent for mining using modified FP tree algorithm. According to Wahab, et al, [16] discussed different types of log files in detail. Also discussed all the 19 attributes of web log file as well as different log file formats in detail. They proposed an algorithm for reading server logs and also algorithm for transferring the log file to database was proposed. After reading the log web files of any one type out of three formats, various attributes were ignored because they were considered not significant for the analysis. Data filtering was performed to remove unwanted attributes of web log file. The web server log file was containing 18 attributes, out of which 17 attributes were removed considering them as unwanted and only one attribute was known i.e. โ€œURLโ€ and was stored in the database. Some important attributes were not considered, so reliability was not maintained. So seeing the pros and cons, the proposed algorithms need to be modified. According to Raju and Satyanarayana,[6], input was raw web log file collected from NASA Web site during July 1995. Customized preprocessing steps generated compressed log file having user behavior in numeric form which was further given for mining process using modified FP tree algorithm. It outputs complete relational database model for storing the structured information about the Web site, its usage and its users. As web log file contains important data related to website, Suneetha and Krishnamoorthi[15], the input was the web log data of NASA website. Here the authors discussed the sources of web logs, web log structure and status codes of HTTP in detail. They performed preprocessing techniques on web server log file and first step was Data cleaning in which the irrelevant entries were removed like the entries that having status error or failure and images pages were removed next step was user identification in which three attributes were used from log file which are IP Address, Operating System, and User Agent. The output which can be further used to increase the effectiveness of the website. The authors did not apply session identification phase. Preprocessing Algorithm Author Name Source of log file Technique applied Ravindra Gupta, Prateek Gupta Raw web Data Cleaning Improved log file User Identification FP Tree Session Algorithm Identification Formatting Mohd Helmy Abd Wahab, Mohd Server Log File File Reading Proposed Norzali Haji Mohd Data Cleaning and Mohamad Farhan Mohamad Data Filtering Mohsin Raju and Satyanarayana Server Log File Data Merging NA Data Cleaning User Identification Session Identification Suneetha, K. R. and D. R. Server Log File Data Cleaning NA Krishnamoorthi User Identification TABLE 1 Summary of Literature Review www.iosrjournals.org 77 | Page
  • 3. A Survey on Data Preprocessing in Web Usage Mining III. Data Preprocessing Tasks Fig 2 shows the phases of Data Preprocessing in Web Usage Mining. The goal of preprocessing is to transform the raw click stream data into a set of user profiles [5]. Data preprocessing presents a number of unique challenges which led to a variety of algorithms and heuristic techniques for preprocessing tasks such as merging and cleaning, user and session identification etc [6]. Input to the preprocessing stage is web server log file. Web Server Log contains 19 attributes such as Date, Time, Client IP, AuthUser, ServerName, ServerIP, ServerPort, Request Method, URI-Stem, URI-Query, Protocol Status, Time Taken, Bytes Sent, Bytes Received, Protocol Version, Host, User Agent, Cookies, Referer. Fig 2: Phases of Data Preprocessing in Web Usage Mining[3] Sample Log file is given below [3]: 2007-12-06 05:22:16 ::1 GET /iisstart.htm - 80 - ::1 Mozilla/4.0+(compatible;+MSIE+7.0;+Windows+NT+6.0;+SLCC1;+.NET+CLR+2.0.50727;+Media+Center+P C+5.0;+InfoPath.1;+.NET+CLR+1.1.4322;+.NET+CLR+3.5.21022;+.NET+CLR+3.0.04506) 200 0 0 296 336 3.1 Data Fusion and Cleaning Merging of the log files from various Web and application servers is done at the Data Fusion phase. Fig. 3 Web Log File in Text format [2] The goal of data cleaning phase is to remove the extraneous and redundant log entries. Important fields like date, time, Client IP, User Agent, URL requested, URL referred, time taken, Referrer or browser used are considered for further processing. Extraneous or redundant data is to be removed which are[2] i)As we want only the log information related to user access so as HTTP is stateless protocol, graphics and scripts are also recorded. So extensions of the files are checked and files having extensions like .css, .gif, .jpeg, .gif, .jpg etc files are eliminated. ii) Removal of Robots request iii) some entries will be having errors. Eliminate the entries having status code less than 200 and greater than 299 as they are failure entries. www.iosrjournals.org 78 | Page
  • 4. A Survey on Data Preprocessing in Web Usage Mining 3.2 User Identification This phase identifies individual user by using their Client IP address. If new IP address, there is new user. If IP address is same but browser version or operating system is different then it represents different user. [7] 3.3 Session Identification Session of a particular user means how much time the user is connected to particular website. It tells us total page accesses of particular user. The following rules we use to identify user session in our experiment: [3] 1) If there is a new user, there is a new session; 2) In one user session, if the refer page is null, there is a new session; 3) If the time between page requests exceeds a certain limit (30 or 25.5mintes), it is assumed that the user is starting a new session. 3.4 Path Completion After session identification, path completion comes. As the client uses proxy servers and cache version of the pages using โ€žBackโ€Ÿ, the sessions which are identified have many lost pages. So this phase is used to identify lost pages. IV. Conclusion And Future Work Preprocessing of web log file is mandatory step for web usage mining. After data cleaning step, we can go for preprocessing step by which we can extract user access pattern and also can be used further for pattern analysis. In this paper, various current preprocessing techniques are outlined. In this paper also I have explained the various tasks needed for preprocessing of the data in web usage mining. My future work is to increase the performance of the web server by getting meaningful and useful information quickly. Analyzing web server log files, we can easily understand the user behaviors in web structure to get better design of web components and web applications. References [1] O.Etzioni, The World Wide Web: Quagmire or gold mine.Communications of the ACM, 39(11):65โ€“68, 1996. [2] Vijayashri Losarwar, Dr. Madhuri Joshi,Data Preprocessing in Web Usage Mining, International Conference on Artificial Intelligence and Embedded Systems (ICAIES'2012) July 15-16, 2012 Singapore [3] Li Chaofeng,Research and Development of Data Preprocessing in Web Usage Mining, School of Management, South-Central University for Nationalities, Wuhan 430074, P.R. China [4] V.Chitraa, Dr. Antony Selvdoss Davamani, A Survey on Preprocessing Methods for Web Usage Data, (IJCSIS) International Journal of Computer Science and Information Security,Vol. 7, No. 3, 2010,p.78-83 [5] Demin Dong,Exploration on Web Usage Mining and its Application, IEEE, 2009. [6] Raju G.T. and Sathyanarayana P. Knowledge discovery from Web Usage Data : Complete Preprocessing Methodology, โ€, IJCSNS 2008 [7] Priyanka Patil,Ujwala Patil, Preprocessing of web server log file for web mining ,World Journal of Science and Technology 2012, 2(3):14-18 ISSN: 2231 โ€“ 2587 [8] Marathe Dagadu Mitharam ,Preprocessing in Web Usage mining, International Journal of Scientific & Engineering Research, Volume 3, Issue 2, February -2012 1 ISSN 2229-5518 [9] C.P. Sumathi, R. Padmaja Valli , T. Santhanam, โ€œAn Overview of Preprocessing of Web Log Files for Web Usage Miningโ€, Journal Of Theoretical And Applied Information Technology 31st December 2011. Vol. 34 No.2,P.178-185 [10] Jaideep Srivastava, Robert Cooley, Mukund Deshpande, and Pang-Ning Tan Web usage mining: Discovery and applications of usage patterns from web data.SIGKDD Explorations, 1(2):12โ€“23, 2000 [11] Alam, S., G. Dobbie, et al. (2008). Particle Swarm Optimization Based Clustering Of Web Usage Data. 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology 978-0-7695-3496-1/08 DOI 10.1109/WIIAT.2008.292 IEEE/WIC/ACM International Conference on Web. [12] Pabarskaite, Z. (2002). Implementing Advanced Cleaning and End-User Interpretability Technologies in Web Log Mining. 24th Int. Conf. information Technology Interfaces /TI 2002, June 24-27, 2002, Cavtat, Croatia [13] Han, J. and M. Kamber (2006). Data Mining: Concepts and Techniques. A. Stephan. San Francisco,, Morgan Kaufmann Publishers is an imprint of Elsevier. [14] Yuan, F., L.-J. Wang, et al. (2003). Study on Data Preprocessing Algorithm in Web Log Mining. Proceedings of the Second International Conference on Machine Learning and Cybernetics, Wan, 2-5 November 2003. [15] Suneetha, K. R. and D. R. Krishnamoorthi (2009)."Identifying User Behavior by Analyzing Web Server Access Log File." IJCSNS International Journal of Computer Science and Network Security, VOL.9 No.4, April 2009. [16] Wahab, M. H. A., M. N. H. Mohd, et al. (2008). Data Preprocessing on Web Server Logs for Generalized Association Rules Mining Algorithm. World Academy of Science, Engineering and Technology 48 2008. [17] Ravindra Gupta and Prateek Gupta,Application Oriented Web Usage Mining with Customized Web Log Preprocessing & Frequent Pattern Tree, International Journal of Engineering Research and Applications, ISSN: 2248-9622,Vol. 2, Issue 1,Jan-Feb 2012, pp.596-598 www.iosrjournals.org 79 | Page