Identifying Auxiliary Web Images Using Combinations of Analyses

1,239 views

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,239
On SlideShare
0
From Embeds
0
Number of Embeds
68
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Identifying Auxiliary Web Images Using Combinations of Analyses

  1. 1. Identifying Auxiliary Web Images Using Combination of Analyses Tewson Seeoun Sirindhorn International Institute of Technology With Guidance From Asst. Prof. Dr. Toshiaki Kondo Sirindhorn International Institute of Technology Dr. Choochart Haruechaiyasak Human Language Technology Laboratory, NECTEC, NSTDA
  2. 2. Agenda ● Introduction ● Background ● Document Object Model (DOM) in HTML ● Support Vector Machine (SVM) ● Objective ● Methodology ● Results ● Discussion / Future Work ● Conclusion Acknowledgement 2 ●
  3. 3. Introduction ● Websites contain images. ● Some images are not necessary. ● Search Engine Indexing ● Printing ● Ignoring them is sometimes economical and green. 3
  4. 4. Background - DOM ● Web browsers / layout engines parse HTML / CSS / JavaScript into DOM. ● DOM represents things (elements) in a Web page. ● An element has properties (position, size, etc.). ● JavaScript sees DOM. 4
  5. 5. Background - SVM ● SVM is a supervised machine learning algorithm ● SVM is used for statistical pattern recognition. 5
  6. 6. Objective (for now) To recognize patterns of auxiliary Web images quickly using DOM analysis and basic image processing 6
  7. 7. Methodology HTML IMG PyQtWebKit Python CSS DOM Files JS jQuery PIL Page Level Features OpenCV Domain Level Features Tesseract Labels MySQL Image Level Features 7
  8. 8. Methodology (continued) ● Image Level Features ● No. of Colors ● No. of Human Faces ● No. of Alphabets ● Page Level Features ● Position ● Dimension ● No. of Images with Similar Dimension ● Domain Level Features External / Internal Links 8 ●
  9. 9. Methodology (continued) MySQL 80% (500/626) Randomly-Selected SVM (Train) 20% Model SVM (Predict) Results Results Results 9
  10. 10. Results 10-fold Cross-Validation (10 Experiments) Average Accuracy = 84.92% After Applying Grid-Search Technique Average Accuracy = 93.17% 10
  11. 11. Discussion ● Some pages cannot be parsed. ● Frames and redirections ● Positions can be miscalculated. ● JavaScript used in displaying images ● CSS sprites ● Tesseract is not well-tuned. ● Small images have to be magnified, but how much? ● Downloading images for processing is a bottleneck. ● Features are not weighted. The definition of “auxiliary image” is subjective. 11 ●
  12. 12. Future Work ● Context Analysis ● Weighed Features ● Adaptive Page Analysis (Website Categorization) ● Techniques Evaluation / Optimization 12
  13. 13. Conclusion Layout analysis and basic image processing techniques alone perform well, but the system could be better. 13
  14. 14. Acknowledgement ● NSTDA, NECTEC, and YSTP program ● Dr. Choochart Haruechaiyasak ● Dr. Toshiaki Kondo ● Mr. Krikamol Muendet ● And Many Others... 14

×