Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

web page classification

1,471 views

Published on

Published in: Technology, Education
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

web page classification

  1. 1. web page classification with naïve bayes classifiers nabeelah ali 27 november 2013
  2. 2. outline • what is web page classification • motivation • literature review • project design • experiments • evaluation
  3. 3. description & motivation
  4. 4. what is classification?
  5. 5. web page classification web page classification can be seen as a type of document classification
  6. 6. documents vs web pages • web pages have structure • HTML indicates headings, paragraphs, meta-information • web pages are interconnected • they contain hyperlinks to other pages • they have locations (URLs)
  7. 7. why? web directories
  8. 8. why? improving search results
  9. 9. why? • user profile mining • information filtering • creation of domain-specific search engines
  10. 10. literature review
  11. 11. bag of words text is represented as an unordered list of words
  12. 12. n-gram representation • document is represented by vector of features • concepts expressed by phrases can be capture (e.g. “New York” vs “new” and “york”)
  13. 13. using html structure • assign weight depending on HTML tags, and make the feature a linear combination of these • e.g. headings would have a greater weight • four main elements are considered: title, headings, metadata and main text Golub, Koraljka, and Anders Ardö. "Importance of HTML structural elements and metadata in automated subject classification." Research and Advanced Technology for Digital Libraries. Springer Berlin Heidelberg, 2005. 368-378.
  14. 14. visual analysis • visual representation by web browser is important • each web page is visualised as an adjacency multigraph, with each section representing a different kind of content Kovacevic, Milos, et al. "Visual adjacency multigraphs—a novel approach for a Web page classification." Proceedings of SAWM04 workshop, ECML2004. 2004.
  15. 15. URL features • pages do not need to be fetched or analysed • fast! • derives tokens from the URL and uses these tokens as features Kan, Min-Yen, and Hoang Oanh Nguyen Thi. "Fast webpage classification using URL features." Proceedings of the 14th ACM international conference on Information and knowledge management. ACM, 2005.
  16. 16. web page classification project design
  17. 17. dataset • 4 universities dataset (cornell, texas, washington, wisconsin) • each page must be classified into a category: course, department, faculty, project, staff, student, other http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/
  18. 18. document classification single label classification: one and only one class label is assigned to each instance hard classification: an instance can either be or not be in a particular class, with no intermediate state multi-class classification: instances that can be divided into more than two categories
  19. 19. details of the dataset
  20. 20. experiment #1 bag of words use the words, unweighted, as features istant ass CS Dr intern 22 0 ission adm Professor room a rc h rese
  21. 21. experiment #2 HTML tag weighting use words weighted by the HTML tags (e.g. words in <h1> tags will be weighted more heavily than those in <p> tags) sistant as CS Dr intern 22 0 ission ofe adm Pr ssor room arch rese
  22. 22. experiment #3 n-gram use phrases instead of single words as features t ant assis arch c rese onta c t in form ogram description pr course outl ine atio n
  23. 23. evaluation k-fold cross validation From http://chrisjmccormick.wordpress.com/2013/07/31/k-fold-cross-validation-with-matlab-code/
  24. 24. evaluation confusion matrix http://en.wikipedia.org/wiki/Confusion_matrix
  25. 25. bibliography B. Choi and Z. Yao: Web Page Classification, StudFuzz 180, 221–274 (2005) Qi, Xiaoguang, and Brian D. Davison. "Web page classification: Features and algorithms." ACM Computing Surveys (CSUR) 41.2 (2009): 12. Golub, Koraljka, and Anders Ardö. "Importance of HTML structural elements and metadata in automated subject classification." Research and Advanced Technology for Digital Libraries. Springer Berlin Heidelberg, 2005. 368378. Kan, Min-Yen, and Hoang Oanh Nguyen Thi. "Fast webpage classification using URL features." Proceedings of the 14th ACM international conference on Information and knowledge management. ACM, 2005. Kovacevic, Milos, et al. "Visual adjacency multigraphs—a novel approach for a Web page classification." Proceedings of SAWM04 workshop, ECML2004. 2004.
  26. 26. questions?

×