04/29 regular meeting paper

  • 518 views
Uploaded on

 

More in: Business , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
518
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Image Classification for Mobile Web Browsing
    • Takuya Maekawa Takahiro Hara Shojiro Nishio
    • Proceedings of the 15th international conference on
    • World Wide Web
    • Reporter : Che-Min Liao
  • 2. Outline
    • INTRODUCTION
    • IMAGE CATEGORIES
    • DATA SET ANALYSIS
    • EVALUATION
    • WEB PAGE AUTOMATIC SCROLLING APPLICATION
    • CONCLUSION
  • 3. INTRDUCTION
    • Many commercial products and research studies focus on how to reconstruct Web pages to fit the size of screens on mobile devices.
    • In doing so , some Web images should be discarded or downsized to fit in the page layout of the small screen.
    • Here , by appropriately detecting the roles of the images , we would be able to correctly process them.
  • 4. INTRDUCTION
    • A few studies aim to develop applications for automatically providing images associated with the main contents of a Web page.
    • Some of these studies treat specific Web pages whose structures are well-known.
    • Here , by appropriately detecting the roles of Web images , we can extract such content images from any Web page.
  • 5. Goal
    • Automatic classification of Web images into categories according to image role
      • Collect 3901 images from 40 Web sites .
      • Define 11 categories of Web images .
      • Categorize 3901 images into 11 categories manually .
      • Select 37 image features to automatically categorize Web images well . (using C4.5 to build decision tree)
  • 6. IMAGE CATEGORIES
    • Four categories with string images are :
      • MENU
      • SECTION
      • DECORATION
      • BUTTON
    • Two categories have small images :
      • ITEM
      • ICON
    • Other five categories :
      • TITLE
      • MAP
      • AD
      • CONTENT
      • LAYOUTER
  • 7. Four categories with string images
    • MENU
      • They are set in line horizontally in the upper and/or lower portion of the page .
        • 67.6% of them had more than two horizontally in-line images at the same height.
      • They usually have small aspect ratios (average : 0.320)
  • 8. Four categories with string images
    • SECTION
      • They have text following them (92.8%)
      • They usually have small aspect ratios (average : 0.142)
  • 9. Four categories with string images
    • DECORATION
      • Images for decorative text.
      • These images don’t have hyperlinks.
    • BUTTON
      • Images with hyperlinks.
      • These images have neighboring text. (16.1% above them , 8% below them , 36.8% on the left and 13.8% on the right)
      • They usually have small aspect ratios (average : 0.266)
  • 10. Two categories have small images
    • ITEM
      • ITEM images with the same width are set in line vertically and have neighboring text on the right. (99.4% had text on the right)
      • They usually have aspect ratios of about 1 (average : 1.052)
  • 11. Two categories have small images
    • ICON
      • ICON images have neighboring text on the left or right. (right : 58.3% , left : 22.0%)
      • They usually have aspect ratios of about 1 (average : 0.942)
  • 12. Other five categories
    • TITLE
      • In the upper portion of the pages.
      • They have hyperlinks to the index page of the site.
      • They usually have small aspect ratios (average : 0.279)
    • MAP
      • Image maps.
    <MAP NAME=“world&quot;> <AREA href=“map.gif” … > </MAP>
  • 13. Other five categories
    • AD
      • Some AD images have hyperlinks to other domain (25.5%)
      • They usually have small aspect ratios (average : 0.459)
  • 14. Other five categories
    • CONTENT
      • Content images that are associated with the main contents of the page.
      • They usually have aspect ratios of about 1 (average : 0.951)
      • CONTENT images have neighboring text on the right or below them (right : 35.1% , below : 51.7%)
      • 55.4% of CONTENT images were in JPEG format. (remaining images : 6.6%)
  • 15. Other five categories
    • LAYOUTER
      • Images to control the design and layout of other images and/or text on the page.
      • Most LAYOUTER images are whole-colored.
      • They usually appear many times on a page. (average : 10.7)
  • 16. DATA SET ANALYSIS
    • From 120 pages in 40 sites
      • Selected 3 pages including an index page
      • Totally collected 3901 images
  • 17. Distribution of collected images 541 LAYOUTER 951 CONTENT 329 AD 53 MAP 141 TITLE 264 ICON 311 ITEM 87 BUTTON 69 DECORATION 469 SECTION 686 MENU number Category
  • 18. Image features
    • Define 37 image features (F1-F37) to classify Web images.
    • There four ways to extract features from Web images :
      • F1-F20: Use HTML source analysis
      • F21, F22: Query Web server s
      • F23-F30: Exploit the layout information of DOM trees when r endering the pages.
      • F31-F37: Use i mage processing
  • 19. Image features (HTML)
    • F1: Dimension
    • F2: Width
    • F3: Height
    • F4: Aspect ratio
    • F5: Uses Map or not {TRUE, FALSE}
    • F6: Has a hyperlink or not {TRUE, FALSE}
      • LAYOUTER images and DECORATION images are usually set as ‘FALSE’.
  • 20. Image features (HTML)
    • F7: Has an outlink or not {TRUE, FALSE}
      • Outlink: a hyperlink to another domain
    • F8: Has a loop-back-link or not {TRUE, FALSE}
      • A loop-back-link: a hyperlink to the index page of the site or a link to the page that it is on.
      • TITLE images and MENU images are usually set as ‘ TRUE ’.
    • F9: Has an ALT string or not {TRUE, FALSE}
      • String images and other text images are usually set as ‘TRUE’.
      • ( MENU:85.4%, SECTION:74.0%, DECORATION:66.7%,
      • BUTTON:63.2% )
  • 21. Image features (HTML)
    • F10: Number of characters in an ALT string
      • CONTENT images usually have large values (average : 26.8)
    • F11: Number of characters in neighboring text
      • MENU images usually have small values (average : 2.7)
  • 22. Image features (HTML)
    • F12: JPEG image or not {TRUE, FALSE}
    • F13: Index in the HTML source
      • The index is the order of the corresponding tag in a HTML source.
      • TITLE images have small values (average: 48.4, average of all images: 424.7).
    • F14: Number of appearances on a page
      • ITEM : 10.65 , LAYOUTER : 6.77
    • F15: Number of images with the same dimension on a page
      • CONTENT:7.5, ICON:4.3, ITEM:4.0
  • 23. Image features (HTML)
    • F16: Number of images with the same width on a page
      • CONTENT: 8.1, AD: 3.5, ICON: 4.3, ITEM:4.5, SECTION: 4.4
    • F17: Number of images with the same height on a page
      • CONTENT: 8.1, MENU: 8.5, SECTION: 4.8, ICON: 4.4, ITEM: 4.8
    • F18-F20: Number of neighboring images with the same attribute (distance between the index values is not more than 100)
      • Height : menu 、 button and title images usually have small values. (F18)
      • Width : button 、 title and layouter images usually have small values. (F19)
      • Dimension : button 、 title 、 map and layouter images usually have small values. (F20)
  • 24. Image features (Web server)
    • F21: Byte size
    • F22: Byte size per dimension (byte/pix^2)
      • CONTENT: 0.83, AD: 0.71
      • ICON: 1.2, ITEM:1.0, LAYOUTER: 8.9
  • 25. Image features (Rendering info.)
    • F23-F30: Features extracted when rendering the page
      • X coordinate of the top left of the image (F23)
      • Y coordinate of the top left of the image (F24)
      • Number of images with the same F23 coordinate (F25)
      • Number of images with the same F24 coordinate (F26)
      • Number of images with the same F23 coordinate and same width (F27)
      • Number of images with the same F24 coordinate and same height (F28)
      • Distance between the bottom of the page and the bottom of the image (F29)
      • Location of the neighboring text (F30)
  • 26. Image features (Image processing)
    • F31: Number of colors
      • The MENU 、 SECTION and TITLE images had on average 42.32 、 40.53 and 89.58 colors.
    • F32: Number of concolorous regions
      • The MENU 、 SECTION and TITLE images had on average 22.79 、 41.83 and 109.55 concolorous regions.
    • F33: Minimum similarity to neighboring images (distance between the index values is not more than 30)
  • 27. Image features (Image processing)
    • F34: Animation GIF or not
      • 14.29% of AD images had animation GIFs. (Other images: 0.36%)
    • F35: Has rounded corner rectangle or not (BUTTON: 37.9%)
    • F36: Text region occupancy ratio
      • LAYOUTER: 0.40%, SECTION: 37.89%, DECORATION: 55.19%, TITLE: 44.85%
    • F37: Number of text regions
      • AD: 2.75, MENU: 1.04, SECTION: 1.19
  • 28. EVALUATION
    • performed forty classification tests (Decision tree)
      • Training set: images at thirty nine sites
      • Test set: images at a rest of Web site
    • [Conditions]
    • C1: HTML source analysis (F1-20)
    • C2: HTML+Web server (F1-22)
    • C3: HTML+Web server+Rendering Info.(F1-30)
    • C4: HTML+Web server+Image processing (F1-22, F31-37)
    • C5: All features
  • 29. EVALUATION
  • 30. EVALUATION
  • 31. EVALUATION
    • The larger the number of features used is , the higher the accuracy becomes.
    • These results indicate that features acquired by image processing work effectively when combine features acquired from the DOM Tree.
    • The feature of AD in C3 is much lower. This is because AD images are often mistaken as MENU images because of F27 and F30.
    • MENU and AD images are often mistaken for each other because of their similar features.
  • 32. WEB PAGE AUTOMATIC SCROLLING APPLICATION
    • The automatic scrolling is done by extracting components from a Web page and setting the scrolling path to traverse the extracted components in order.
  • 33. Conclusion
    • Image classification for mobile Web browsing
      • 3901 images
      • 11 categories
      • 83.1% classification accuracy
    • As a part of our future work , we plan to examine the CONTENT image classification into more detailed categories for various application.