Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Structural profiling of web sites in the wild

60 views

Published on

Research on 708 website to analyze de structure of web pages.

Published in: Internet
  • Be the first to comment

  • Be the first to like this

Structural profiling of web sites in the wild

  1. 1. STRUCTURAL PROFILING OF WEB SITES IN THE WILD LABORATOIRE D’INFORMATIQUE FORMELLE UNIVERSITÉ DU QUÉBEC À CHICOUTIMI XAVIER CHAMBERLAND-THIBEAULT AND SYLVAIN HALLÉ ICWE 9 JUIN 2020 1
  2. 2. THE REASONING BEHIND THIS PAPER 2
  3. 3. DEBUGGING AND FIXING WEB APPLICATIONS  An increasing number of tools are created to help analyze, debug, detect errors or even process the output of web applications.  Most of the tools focus on anlyzing the Document Object Model (DOM) and the Cascading Stylesheet (CSS) of a page.  Those tools have varied utilities :  Fixing cross-browser issues ;  DOM interpreter ;  Detect responsive web design bugs ;  Etc. 3
  4. 4. WHAT DOES A WEB PAGE LOOKS LIKE ?  Most of the aforementioned tools have their scalability, and sometimes even their success, based on size related features.  What’s the average size of a web page ?  Walsh and al. (2015) run experiments against pages of up to 196 DOM nodes, whereas Choudhary and al. (2013) chose pages going up to 39146 DOM nodes.  This paper aimed to address this issue by doing a large-scale analysis of 708 websites hoping to measure an array of parameters relative to the size and structure of web pages. 4
  5. 5. METHODOLOGY 5
  6. 6. METHODOLOGY Website collection DOM harvesting Data processing 6
  7. 7. WEBSITE COLLECTION  To make sure to get a pool of websites representing the reality of the users, it was mandatory to get the sites that the most users visit.  To do that, the Moz top 500 most frequented websites list was used. However, there were many duplicates made of country specific versions of the same web application.  Out of those 500 sites, only 300 non-duplicate remained.  Yet, sites visited by the most users do not reflect the reality, for this notion is orthogonal to the sites most visited by an individual user.  Therefore, we informally asked people around to provide us with the list of websites they use daily. 7
  8. 8. DOM HARVESTING  To collect data on the DOM for each of these sites, a JavaScript program was designed to run when a page has finished loading.  The script starts at the body node of a page and performs a preordered traversal of the integral DOM tree, recording and computing various features :  Tag names ;  CSS classes ;  Visibility status ;  Structural information.  The script then generated two files : a JSON file containing all the data and a DOT file accepted by the Graphviz library so we could get statistical and visual representation of a web page. 8
  9. 9. DOM HARVESTING – RUNNING ON EVERY PAGE  To actually be able to run on every page, the TamperMonkey extension was used.  This extension, available on multiple browsers, allows the user to inject and run custom JavaScript code every time a new page is loaded in the browser.  It is to be noted that the harvesting was done on the browser-rendered DOM and properties. 9
  10. 10. DATA PROCESSING  LabPal was used to process all the 62MB of raw data :  Every website was made into an experiment that would process the associated JSON file ;  It was then possible to aggregate all the data recovered and even perform deeper statistical analysis.  It is to be noted that some files were not used since the automated loading made us retrieve a lot of pop-ups.  Manually inspecting each recovered files to detect the pop-ups would have been a tedious task, therefore it was decided to use a more generic filter removing most of these pages by removing every file with less than 5 DOM nodes or if the URL belonged to a list of know advertisement pages. 10
  11. 11. RESULTS 11
  12. 12. GRAPHICAL REPRESENTATION OF AWEBSITE  Each color represents a different HTML tag name.  The root of the tree, the body tag, is represented by the black square.  This is the representation of Zippyshare.com . 12
  13. 13. RESULTS Cumulative distribution of websites based on the size of DOM tree Distribution of websites based on DOM tree depth 13
  14. 14. RESULTS Cumulative distribution of websites based on maximum node degree Distribution of websites based on maximum node degree 14
  15. 15. RESULTS Total number of elements using each visibility Distribution of websites according to the fraction of all DOM nodes that are invisible. 15
  16. 16. RESULTS Size of the DOM tree vs. number of CSS classes Cumulative distribution of websites based on the average size of a CSS class 16
  17. 17. THREATTOVALIDITY Website sample Variance due to browser Homepage analysis 17
  18. 18. REFERENCES  Walsh,T.A., McMinn, P., Kapfhammer, G.M.:Automatic detection of potential layout faults following changes to responsive web pages (N). In: Cohen, M.B., Grunske, L.,Whalen, M. (eds.) Proc.ASE 2015. pp. 709–714. IEEE Computer Society (2015)  Choudhary, S.R., Prasad, M.R., Orso,A.: X-PERT: accurate identification of crossbrowser issues in web applications. In: Notkin, D., Cheng, B.H.C., Pohl, K. (eds.) Proc. ICSE 2013. pp. 702–711. IEEE Computer Society (2013)  The Moz top 500 websites, https://moz.com/top500,Accessed October 20th, 2019  All pictures used are licence free 18

×