Major Seminar
                             On
        Knowledge Discovery from Web Logs




Guided By:                                       Presented By:
Saurabh Anand                                    Avtar kishore Gaur
Lecturer                                         (IT/09/53)
Department Of IT                                 VIII Sem, IT

                   Poornima College Of Engineering
                          Sitapura,Jaipur
Introduction
• Vast amount of Web site traversal information in the form
  of Web logs are present.
• By analyzing these logs, it is possible to discover various
  kinds of knowledge, which can be applied to improve the
  performance of Web services.
• It is possible to learn the behavior of the Web users by
  analyzing these logs.
Introduction
• A particularly kind of knowledge which can be immediately
  applied to the operation of the Web site is called
  Actionable knowledge.
• Mining of such knowledge is known as Knowledge
  Discovery from Web Logs.
How big is the Web
• More then 4 billion websites are on Internet.(According to
  alexa.com)

• At least 7.92 billion pages (Thursday, 23
  February, 2012).(according to worldwidewebsize.com).
History
• Previous approaches was only aimed to mine Web-log
  knowledge for human consumption.
• These days mining actionable knowledge from Web logs is
  been used to improve the performance of Web Services.
Fields in Web Log File
• Reference Website www.hdwally.com Web Server: Apache
         1. 66.249.71.6 - - [23/Feb/2012:06:23:46 -0600] "GET
           /robots.txt HTTP/1.1" 500 7370 "-" "Mozilla/5.0
           (compatible; Googlebot/2.1;
           +http://www.google.com/bot.html)“
         2. 180.76.5.92 - - [23/Feb/2012:06:11:04 -0600] "GET /
           HTTP/1.1" 500 7370 "-" "Mozilla/5.0 (compatible;
           Baiduspider/2.0;
           +http://www.baidu.com/search/spider.html)“
• IP Adress:-66.249.71.6 and 180.76.5.92
• UserName:- -- and --
• Timestamp :- [23/Feb/2012:06:23:46 -0600] and -
  [23/Feb/2012:06:11:04 -0600] (time of visit by webserver)
Fields in Web Log File
• Access request : "GET /robots.txt HTTP/1.1“ and "GET /
  HTTP/1.1”
• Result status code : 500 and 500 (Internal Server Error)
• Bytes transferred : 7370 and 7370
• User Agent: Mozilla/5.0
• Referrer URL : (compatible; Googlebot/2.1;
  +http://www.google.com/bot.html) and (compatible;
  Baiduspider/2.0;
  +http://www.baidu.com/search/spider.html)
Example Of a Web Log File
• fcrawler.looksmart.com - - [26/Apr/2000:00:00:12 -0400]
  "GET /contacts.html HTTP/1.0" 200 4595 "-" "FAST-
  WebCrawler/2.1-pre2 (ashen@looksmart.net)"
  fcrawler.looksmart.com - - [26/Apr/2000:00:17:19 -0400]
  "GET /news/news.html HTTP/1.0" 200 16716 "-" "FAST-
  WebCrawler/2.1-pre2 (ashen@looksmart.net)“
• 123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET
  /pics/wpaper.gif HTTP/1.0" 200 6248
  "http://www.jafsoft.com/asctortf/"   "Mozilla/4.05
  (Macintosh; I; PPC )"
Mining Web Logs for Path Profiles
•   Data Cleaning on Web Log Data
•   Mining Web Logs for Path Profiles
•   Web Object Prediction
•   Learning to Prefetch Web Documents
Data Cleaning on Web Log Data
• Break apart a long sequence of visits by the users into user
  sessions.
• Identify user by an individual IP address.
• Thus, data cleaning means to separate the visiting
  sequence of pages into visiting sessions.
Web Log Mining for Prefetching
• We have separate visiting sessions.
• Now we can develop path profiles from these sessions as
  user visiting a sequence of Web pages often leaves a trail of
  the pages URL’s in a Web log.
• A path profile consists frequent subsequences from the
  frequently occurring paths.
• Path profile helps us to predict the next pages that are
  most likely to occur.
Web Object Prediction
• it is possible to train a path-based model for predicting
  future URL's based on a sequence of current URL accesses.
• This can be done on a per-user basis, or on a per-server
  basis.
• The former requires that the user-session be recognized
  and broken down nicely through a filtering system, and the
  latter takes the simplistic view that the accesses on a server
  is a single long thread.
Learning to Prefetch Web Documents
• Original cache memory is partitioned into two parts: cache-
  buffer and prefetching-buffer.
• A prefetching agent(Script) keeps pre-loading the
  prefetching-buffer with documents predicted to access
  next.
Web Page Clustering for Intelligent
              User Interfaces
• Web Logs can be used to build server-side customization
  and transformation to make website more convenient for
  users to visit and find their objectives.
• They path prediction algorithms that guess where the user
  wants to go next in a browsing session like WebWatcher
  and PageGather algorythm.
Applications
•    Search Engines
•    Similarity Measures
•    Ontology
•   information aggregation
•    Recognition technology
•    Summarization
•    E-commerce
•    Content management
Advantages
• Its easy to implement.
• The companies can establish better customer relationship
  by giving them exactly what they need.
• To create personalized search engines, which can
  understand a person’s search queries in a personal way by
  analyzing and profiling user’s search behavior.
• To improving caching and prefetching of Web objects.
• Use the mined knowledge for building better, adaptive user
  interfaces.
• Applying Web query log knowledge to improving Web
  search for a search engine application.
Reference
• Weblogs from www.hdwally.com and
  www.hdwallpaper4u.com .
• www.jafsoft.com/searchengines/log_sample.html
• Research paper on Knowledge Discovery From Weblogs by
  S Chandra and Dr B Kalpana.
• Researcalpana. paper on Mining Web Logs for Actionable
  Knowledge by Qiang Yang, Charles X. Ling and Jianfeng Gao.
• http://www.galeas.de/webmining.html
Queries ?
Thanks

Avtar's ppt

  • 1.
    Major Seminar On Knowledge Discovery from Web Logs Guided By: Presented By: Saurabh Anand Avtar kishore Gaur Lecturer (IT/09/53) Department Of IT VIII Sem, IT Poornima College Of Engineering Sitapura,Jaipur
  • 2.
    Introduction • Vast amountof Web site traversal information in the form of Web logs are present. • By analyzing these logs, it is possible to discover various kinds of knowledge, which can be applied to improve the performance of Web services. • It is possible to learn the behavior of the Web users by analyzing these logs.
  • 3.
    Introduction • A particularlykind of knowledge which can be immediately applied to the operation of the Web site is called Actionable knowledge. • Mining of such knowledge is known as Knowledge Discovery from Web Logs.
  • 4.
    How big isthe Web • More then 4 billion websites are on Internet.(According to alexa.com) • At least 7.92 billion pages (Thursday, 23 February, 2012).(according to worldwidewebsize.com).
  • 5.
    History • Previous approacheswas only aimed to mine Web-log knowledge for human consumption. • These days mining actionable knowledge from Web logs is been used to improve the performance of Web Services.
  • 6.
    Fields in WebLog File • Reference Website www.hdwally.com Web Server: Apache 1. 66.249.71.6 - - [23/Feb/2012:06:23:46 -0600] "GET /robots.txt HTTP/1.1" 500 7370 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)“ 2. 180.76.5.92 - - [23/Feb/2012:06:11:04 -0600] "GET / HTTP/1.1" 500 7370 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)“ • IP Adress:-66.249.71.6 and 180.76.5.92 • UserName:- -- and -- • Timestamp :- [23/Feb/2012:06:23:46 -0600] and - [23/Feb/2012:06:11:04 -0600] (time of visit by webserver)
  • 7.
    Fields in WebLog File • Access request : "GET /robots.txt HTTP/1.1“ and "GET / HTTP/1.1” • Result status code : 500 and 500 (Internal Server Error) • Bytes transferred : 7370 and 7370 • User Agent: Mozilla/5.0 • Referrer URL : (compatible; Googlebot/2.1; +http://www.google.com/bot.html) and (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
  • 8.
    Example Of aWeb Log File • fcrawler.looksmart.com - - [26/Apr/2000:00:00:12 -0400] "GET /contacts.html HTTP/1.0" 200 4595 "-" "FAST- WebCrawler/2.1-pre2 (ashen@looksmart.net)" fcrawler.looksmart.com - - [26/Apr/2000:00:17:19 -0400] "GET /news/news.html HTTP/1.0" 200 16716 "-" "FAST- WebCrawler/2.1-pre2 (ashen@looksmart.net)“ • 123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/wpaper.gif HTTP/1.0" 200 6248 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC )"
  • 9.
    Mining Web Logsfor Path Profiles • Data Cleaning on Web Log Data • Mining Web Logs for Path Profiles • Web Object Prediction • Learning to Prefetch Web Documents
  • 10.
    Data Cleaning onWeb Log Data • Break apart a long sequence of visits by the users into user sessions. • Identify user by an individual IP address. • Thus, data cleaning means to separate the visiting sequence of pages into visiting sessions.
  • 11.
    Web Log Miningfor Prefetching • We have separate visiting sessions. • Now we can develop path profiles from these sessions as user visiting a sequence of Web pages often leaves a trail of the pages URL’s in a Web log. • A path profile consists frequent subsequences from the frequently occurring paths. • Path profile helps us to predict the next pages that are most likely to occur.
  • 12.
    Web Object Prediction •it is possible to train a path-based model for predicting future URL's based on a sequence of current URL accesses. • This can be done on a per-user basis, or on a per-server basis. • The former requires that the user-session be recognized and broken down nicely through a filtering system, and the latter takes the simplistic view that the accesses on a server is a single long thread.
  • 13.
    Learning to PrefetchWeb Documents • Original cache memory is partitioned into two parts: cache- buffer and prefetching-buffer. • A prefetching agent(Script) keeps pre-loading the prefetching-buffer with documents predicted to access next.
  • 14.
    Web Page Clusteringfor Intelligent User Interfaces • Web Logs can be used to build server-side customization and transformation to make website more convenient for users to visit and find their objectives. • They path prediction algorithms that guess where the user wants to go next in a browsing session like WebWatcher and PageGather algorythm.
  • 15.
    Applications • Search Engines • Similarity Measures • Ontology • information aggregation • Recognition technology • Summarization • E-commerce • Content management
  • 16.
    Advantages • Its easyto implement. • The companies can establish better customer relationship by giving them exactly what they need. • To create personalized search engines, which can understand a person’s search queries in a personal way by analyzing and profiling user’s search behavior. • To improving caching and prefetching of Web objects. • Use the mined knowledge for building better, adaptive user interfaces. • Applying Web query log knowledge to improving Web search for a search engine application.
  • 17.
    Reference • Weblogs fromwww.hdwally.com and www.hdwallpaper4u.com . • www.jafsoft.com/searchengines/log_sample.html • Research paper on Knowledge Discovery From Weblogs by S Chandra and Dr B Kalpana. • Researcalpana. paper on Mining Web Logs for Actionable Knowledge by Qiang Yang, Charles X. Ling and Jianfeng Gao. • http://www.galeas.de/webmining.html
  • 18.
  • 19.