Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Web Servers


Published on

  • Be the first to comment

Web Servers

  1. 1. Web Servers & Log Analysis <ul><li>What can we learn from looking at Web server logs? </li></ul><ul><ul><li>What server resources were requested </li></ul></ul><ul><ul><li>When the files were requested </li></ul></ul><ul><ul><li>Who requested them (where IP address = who) </li></ul></ul><ul><ul><li>How they requested them (browser types & OS) </li></ul></ul><ul><li>Some assumptions </li></ul><ul><ul><li>A request for a resource means the user did receive it </li></ul></ul><ul><ul><li>A resource is viewable & understandable to each user </li></ul></ul><ul><ul><li>Users are identified within a loose set of parameters </li></ul></ul><ul><li>How does knowing request patterns affect or help IA? </li></ul>
  2. 2. Types of Web Server Logs <ul><li>Proxy-based </li></ul><ul><ul><li>Web access servers to control access or cache popular files </li></ul></ul><ul><li>Client-based </li></ul><ul><ul><li>Local cache files </li></ul></ul><ul><ul><li>Browser History file(s) </li></ul></ul><ul><li>Network-based </li></ul><ul><ul><li>Routers, firewalls & access points </li></ul></ul><ul><li>Server-based </li></ul><ul><ul><li>Web servers to serve content </li></ul></ul>
  3. 3. Using Web Servers <ul><li>The Apache Software Foundation </li></ul><ul><li>Microsoft Internet Information Server (Services) </li></ul><ul><li>These applications “Serve” </li></ul><ul><ul><li>Text - HTML, XML, plain text </li></ul></ul><ul><ul><li>Graphics - jpeg, gif, png </li></ul></ul><ul><ul><li>CGI, servlets, XMLHttpRequest & other logic </li></ul></ul><ul><ul><li>other MIME types such as movies & sound </li></ul></ul><ul><li>Most servers can log these files </li></ul><ul><ul><li>Daily, weekly or monthly </li></ul></ul><ul><ul><li>Can not always log CGI or related logic (specifically or “out of the box”) </li></ul></ul>
  4. 4. How Servers Work <ul><li>Hypertext Transfer Protocol - http </li></ul><ul><ul><li>A file is requested from the browser </li></ul></ul><ul><ul><li>The request is transferred via the network </li></ul></ul><ul><ul><li>The server receives the request (& logs it) </li></ul></ul><ul><ul><li>The server provides the file (& logs it) </li></ul></ul><ul><ul><li>The browser displays the file </li></ul></ul><ul><li>Almost all Web servers work this way </li></ul>
  5. 5. Types of Server Logs <ul><li>Access Log </li></ul><ul><ul><li>Logs information such as page served or time served </li></ul></ul><ul><li>Referer Log </li></ul><ul><ul><li>Logs name of the server and page that links to current served page </li></ul></ul><ul><ul><li>Not always </li></ul></ul><ul><ul><li>Can be from any Web site </li></ul></ul><ul><li>Agent Log </li></ul><ul><ul><li>Logs browser type and operating system </li></ul></ul><ul><ul><ul><li>Mozilla </li></ul></ul></ul><ul><ul><ul><li>Windows </li></ul></ul></ul>
  6. 6. Log File Format <ul><li>Extended Log File Format - W3C Working Draft WD-logfile-960323 </li></ul><ul><li>key advantage: </li></ul><ul><ul><li>computer storage cost decreases while paper cost rises </li></ul></ul><ul><li>every server generates slightly different logs </li></ul>
  7. 7. Extended Log File Formats <ul><li>WWW Consortium Standards </li></ul><ul><li>Will automatically record much of what is programmatically done now. </li></ul><ul><ul><li>faster </li></ul></ul><ul><ul><li>more accurate </li></ul></ul><ul><ul><li>standard baselines for comparison </li></ul></ul><ul><ul><li>graphics standards </li></ul></ul>
  8. 8. What is a log file? <ul><li>A delimited, text file with information about what the server is doing </li></ul><ul><ul><li>IP Address or Domain name </li></ul></ul><ul><ul><li>Date/Time </li></ul></ul><ul><ul><li>Method used & Page Requested </li></ul></ul><ul><ul><li>Protocol, Response Code & Bytes Returned </li></ul></ul><ul><ul><li>Referring Page (sometimes) </li></ul></ul><ul><ul><li>UserAgent & Operating System </li></ul></ul><ul><li> - - [01/Sep/2004:08:17:21 -0500] &quot;GET /images/sanchez.jpg HTTP/1.1&quot; 200 - &quot;; &quot;Mozilla/4.0 (compatible; MSIE 6.0; Windows XP)&quot; </li></ul>
  9. 9. In search of Reliable Data <ul><li>Not as Foolproof as Paper </li></ul><ul><ul><li>You can see when someone is reading a page </li></ul></ul><ul><ul><li>You can know the page is turned </li></ul></ul><ul><ul><li>You can know the book is checked out </li></ul></ul><ul><li>No State Information </li></ul><ul><ul><li>The same person or another person could be reading pages 1 then page 2 </li></ul></ul><ul><ul><li>You really can’t tell how many users you have </li></ul></ul><ul><li>Server Hits not perfectly Representative </li></ul><ul><ul><li>Counters inaccurate </li></ul></ul><ul><ul><li>Caching & Robots can influence + & - </li></ul></ul><ul><li>Floods/Bandwidth can Stop “intended” usage </li></ul>
  10. 10. What is a “hit”? <ul><li>Technically, a hit is simply any file requested from the server </li></ul><ul><ul><li>That is logged </li></ul></ul><ul><ul><li>That represents (usually) part of a request to “see” a whole Web page </li></ul></ul><ul><li>Hits combine to represent a “page view” </li></ul><ul><li>Page views combine to represent an “episode” or “session” </li></ul><ul><ul><li>Episode is one activity or question a user perfoms or requests on a Web site </li></ul></ul><ul><ul><li>Session is a series of episodes that embodies all the interactions a user undertakes using a Web site (per time, based on averages around 30 min.) </li></ul></ul>
  11. 11. Making Servers More Reliable <ul><li>Keep system setups simple </li></ul><ul><ul><li>unique file and directory names </li></ul></ul><ul><ul><li>clear, consistent structure </li></ul></ul><ul><li>Configure CMS for logging/serving </li></ul><ul><li>Use an FTP server for file transfer </li></ul><ul><ul><li>frees up logs and server! </li></ul></ul><ul><li>Judicious use of links </li></ul><ul><li>Wise MIME types </li></ul><ul><ul><li>some hard/impossible to log </li></ul></ul>
  12. 12. Clever Web Server Setup <ul><li>Redirect CGI to find referrer </li></ul><ul><li>Use a database </li></ul><ul><ul><li>store web content </li></ul></ul><ul><ul><li>record usage data </li></ul></ul><ul><li>create state information with programming </li></ul><ul><ul><li>NSAPI </li></ul></ul><ul><ul><li>ActiveX </li></ul></ul><ul><li>Have contact information </li></ul><ul><li>Have purpose statements </li></ul>
  13. 13. Managing Log Files <ul><li>Backup </li></ul><ul><li>Store Results or Logs? </li></ul><ul><li>Beginning New Logs </li></ul><ul><li>Posting Results </li></ul>
  14. 14. Log Analysis Tools <ul><li>Analog </li></ul><ul><li>Webalizer </li></ul><ul><li>Sawmill </li></ul><ul><li>WebTrends </li></ul><ul><li>AWStats </li></ul><ul><li>WWWStat </li></ul><ul><li>GetStats </li></ul><ul><li>Perl Scripts </li></ul><ul><li>Data Mining & Business Intelligence tools </li></ul>
  15. 15. WebTrends <ul><li>A whole industry of analytics </li></ul><ul><li>Most popular commercial application </li></ul>
  16. 16. Log Analysis Cumulative Sample <ul><ul><ul><ul><ul><li>Program started at Tue-03-Dec-2005 01:20 local time. </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Analysed requests from Thu-28-Jul-2004 20:31 to Mon-02-Dec-1996 23:59 (858.1 days). </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Total successful requests: 4 282 156 (88 952) </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Average successful requests per day: 4 990 (12 707) </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Total successful requests for pages: 1 058 526 (17 492) </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Total failed requests: 88 633 (1 649) </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Total redirected requests: 14 457 (197) </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Number of distinct files requested: 9 638 (2 268) </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Number of distinct hosts served: 311 878 (11 284) </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Number of new hosts served in last 7 days: 7 020 </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Corrupt logfile lines: 262 </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Unwanted logfile entries: 976 </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Total data transferred: 23 953 Mbytes (510 619 kbytes) </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Average data transferred per day: 28 582 kbytes (72 946 kbytes) </li></ul></ul></ul></ul></ul>
  17. 17. How about the iSchool Web site? <ul><li>Our server files are collected constantly </li></ul><ul><ul><li>Daily </li></ul></ul><ul><ul><li>Weekly </li></ul></ul><ul><ul><li>Monthly </li></ul></ul><ul><ul><li>Even yearly </li></ul></ul><ul><li>What does a quick look tell us? </li></ul><ul><ul><li>How well is the server working? </li></ul></ul><ul><ul><ul><li>Uptime, server errors, logging errors </li></ul></ul></ul><ul><ul><li>How popular is our site? </li></ul></ul><ul><ul><ul><li>Number of hits, popular files </li></ul></ul></ul><ul><ul><li>Who is visiting the site? </li></ul></ul><ul><ul><ul><li>Countries, types of companies </li></ul></ul></ul><ul><ul><li>What searches led people here? </li></ul></ul>
  18. 18. UT & its Web server logs <ul><li>UT Web log reports </li></ul><ul><li>(Figures in parentheses refer to the 7 days to 28-Mar-2004 03:00). </li></ul><ul><li>Successful requests: 39,826,634 (39,596,364) </li></ul><ul><li>Average successful requests per day: 5,690,083 (5,656,623) </li></ul><ul><li>Successful requests for pages: 4,189,081 (4,154,717) </li></ul><ul><li>Average successful requests for pages per day: 598,499 (593,530) </li></ul><ul><li>Failed requests: 442,129 (439,467) </li></ul><ul><li>Redirected requests: 1,101,849 (1,093,606) </li></ul><ul><li>Distinct files requested: 479,022 (473,341) </li></ul><ul><li>Corrupt logfile lines: 427 </li></ul><ul><li>Data transferred: 278.504 Gbytes (276.650 Gbytes) </li></ul><ul><li>Average data transferred per day: 39.790 Gbytes (39.521 Gbytes) </li></ul>
  19. 19. Neat Analysis Tricks <ul><li>use a search engine to find references </li></ul><ul><ul><li>“” </li></ul></ul><ul><ul><ul><li>key to using unique names </li></ul></ul></ul><ul><ul><li>use many engines </li></ul></ul><ul><ul><ul><li>update times different </li></ul></ul></ul><ul><ul><ul><li>blocking mechanisms are different </li></ul></ul></ul><ul><li>use Web searches (or Yahoo, Bloglines…) </li></ul><ul><ul><li>look for references </li></ul></ul><ul><ul><li>look for IP addresses of users </li></ul></ul>
  20. 20. Neat Tricks, cont. <ul><li>Walking up the Links </li></ul><ul><ul><li>follow URL’s upward </li></ul></ul><ul><li>Reverse Sort </li></ul><ul><ul><li>look for relations </li></ul></ul><ul><li>Use your own robot to index </li></ul><ul><ul><li>Test </li></ul></ul>
  21. 21. Web Surveys, an alternative <ul><li>Surveys actually ask users what they did, what they sought & if it helped </li></ul><ul><li>GVU, Nielsen and GNN </li></ul><ul><ul><li>Qualitative questions </li></ul></ul><ul><ul><ul><li>phone </li></ul></ul></ul><ul><ul><ul><li>web forms </li></ul></ul></ul><ul><ul><li>Self-selected sample problems </li></ul></ul><ul><ul><ul><li>random selection </li></ul></ul></ul><ul><ul><ul><li>oversample </li></ul></ul></ul>
  22. 22. Analysis of a Very Large Search Log <ul><li>What kinds of patterns can we find? </li></ul><ul><li>Request = query and results page </li></ul><ul><li>280 GB – Six Weeks of Web Queries </li></ul><ul><ul><li>Almost 1 Billion Search Requests, 850K valid, 575K queries </li></ul></ul><ul><ul><li>285 Million User Sessions (cookie issues) </li></ul></ul><ul><ul><li>Large volume, less trendy </li></ul></ul><ul><ul><li>Why are unique queries important? </li></ul></ul><ul><li>Web Users: </li></ul><ul><ul><li>Use Short Queries in short sessions - 63.7% one request </li></ul></ul><ul><ul><li>Mostly Look at the First Ten Results only </li></ul></ul><ul><ul><li>Seldom Modify Queries </li></ul></ul><ul><li>Traditional IR Isn’t Accurately Describing Web Search </li></ul><ul><li>Phrase Searching Could Be Augmented </li></ul><ul><ul><ul><ul><ul><li>Silverstein, Henzinger, Marais, Moricz (1998) </li></ul></ul></ul></ul></ul>
  23. 23. Analysis of a Very Large Search Log <ul><li>2.35 Average Terms Per Query </li></ul><ul><ul><li>0 = 20.6% (?) </li></ul></ul><ul><ul><li>1 = 25.8% </li></ul></ul><ul><ul><li>2 = 26.0% = 72.4% </li></ul></ul><ul><li>Operators Per Query </li></ul><ul><ul><li>0 = 79.6% </li></ul></ul><ul><li>Terms Predictable </li></ul><ul><li>First Set of Results Viewed Only = 85% </li></ul><ul><li>Some (Single Term Phrase) Query Correlation </li></ul><ul><ul><li>Augmentation </li></ul></ul><ul><ul><li>Taxonomy Input </li></ul></ul><ul><ul><li>Robots vs. Humans </li></ul></ul>
  24. 24. Real Life Information Retrieval <ul><li>51K Queries from Excite (1997) </li></ul><ul><li>Search Terms = 2.21 </li></ul><ul><li>Number of Terms </li></ul><ul><ul><li>1 = 31% 2 = 31% 3 = 18% (80% Combined) </li></ul></ul><ul><li>Logic & Modifiers (by User) </li></ul><ul><ul><li>Infrequent </li></ul></ul><ul><ul><li>AND, “+”, “-” </li></ul></ul><ul><li>Logic & Modifiers (by Query) </li></ul><ul><ul><li>6% of Users </li></ul></ul><ul><ul><li>Less Than 10% of Queries </li></ul></ul><ul><ul><li>Lots of Mistakes </li></ul></ul><ul><li>Uniqueness of Queries </li></ul><ul><ul><li>35% successive </li></ul></ul><ul><ul><li>22% modified </li></ul></ul><ul><ul><li>43% identical </li></ul></ul>
  25. 25. Real Life Information Retrieval <ul><li>Queries per user 2.8 </li></ul><ul><li>Sessions </li></ul><ul><ul><li>Flawed Analysis (User ID) </li></ul></ul><ul><ul><li>Some Revisits to Query (Result Page Revisits) </li></ul></ul><ul><li>Page Views </li></ul><ul><ul><li>Accurate, but not by User </li></ul></ul><ul><li>Use of Relevance Feedback (more like this) </li></ul><ul><ul><li>Not Used Much (~11%) </li></ul></ul><ul><li>Terms Used Typical & frequent </li></ul><ul><li>Mistakes </li></ul><ul><ul><li>Typos </li></ul></ul><ul><ul><li>Misspellings </li></ul></ul><ul><ul><li>Bad (Advanced) Query Formulation </li></ul></ul><ul><li>Jansen, B. J., Spink, A., Bateman, J., & Saracevic, T. (1998) </li></ul>
  26. 26. Downie & Web Usage <ul><li>Server logs are like library usage </li></ul><ul><li>User-based analyses </li></ul><ul><ul><li>who </li></ul></ul><ul><ul><li>where </li></ul></ul><ul><ul><li>what </li></ul></ul><ul><li>File-based analyses </li></ul><ul><ul><li>amount </li></ul></ul><ul><li>Request analyses </li></ul><ul><ul><li>conform (loosely) to Zipf’s Law </li></ul></ul><ul><li>Byte-based analyses </li></ul>
  27. 27. Web use analysis & IA? <ul><li>Another tool to begin to understand how people use your Web provided resources </li></ul><ul><li>With a small amount of setup, you can learn a large amount </li></ul><ul><li>Server use can be integrated into site usage for users </li></ul><ul><ul><li>Lists of popular pages & more interlinking pages </li></ul></ul><ul><ul><li>Adding search terms that found the page to related pages </li></ul></ul><ul><ul><li>Adjust metadata to reflect searches that find pages </li></ul></ul><ul><ul><li>Add pages to the site index or site map </li></ul></ul><ul><li>First-cut usability information </li></ul><ul><ul><li>Pages 1 & 2 were accessed, but not 3 - Why? </li></ul></ul><ul><ul><li>Navigation usage, link ordering and design understanding </li></ul></ul><ul><ul><li>Knowing what browsers & OS helps tailor design and media types </li></ul></ul>
  28. 28. BREAK! <ul><li>No Presentation this week </li></ul><ul><ul><li>Next week: Asset management, content management & version control </li></ul></ul><ul><li>Break up media development work </li></ul><ul><li>Examine current pages, style sheets & designs </li></ul><ul><li>Set up next set of pair & individual deliverables </li></ul>
  29. 29. Media Development work <ul><li>We need to find & create graphics for the new site </li></ul><ul><li>Content about: </li></ul><ul><ul><li>Austin </li></ul></ul><ul><ul><li>UT </li></ul></ul><ul><ul><li>iSchool </li></ul></ul><ul><ul><li>People at the iSchool </li></ul></ul><ul><ul><li>Students at work in the iSchool (classes, labs) </li></ul></ul><ul><li>Screen grab from videos </li></ul><ul><li>Search the Web for copyright free images </li></ul><ul><li>Take our own pictures </li></ul>
  30. 30. Current Pages & Designs <ul><li>First version of main iSchool page template and CSS complete </li></ul><ul><li>Secondary page template & CSS complete </li></ul><ul><ul><li>Some secondary pages already built </li></ul></ul><ul><li>Index page template set </li></ul><ul><li>Site map page initially set </li></ul><ul><ul><li>Big Map </li></ul></ul><ul><ul><li>Main pages map </li></ul></ul>
  31. 31. Next steps <ul><li>In class </li></ul><ul><ul><li>Test & evaluate current CSS and templates </li></ul></ul><ul><ul><li>Improvise secondary home page based on initial design </li></ul></ul><ul><ul><li>Examine new Alumni section </li></ul></ul><ul><ul><li>Examine new Course Listing page </li></ul></ul><ul><li>For homework </li></ul><ul><ul><li>Complete secondary page migration to new design </li></ul></ul><ul><ul><li>Rotate design work </li></ul></ul><ul><ul><ul><li>Alumni </li></ul></ul></ul><ul><ul><ul><li>Site Map </li></ul></ul></ul><ul><ul><ul><li>Home page design ideas </li></ul></ul></ul><ul><ul><li>Picture/Media creation work </li></ul></ul>