Identifying Auxiliary Web Images Using Combinations of Analyses
1. Identifying Auxiliary Web Images
Using Combination of Analyses
Tewson Seeoun
Sirindhorn International Institute of Technology
With Guidance From
Asst. Prof. Dr. Toshiaki Kondo
Sirindhorn International Institute of Technology
Dr. Choochart Haruechaiyasak
Human Language Technology Laboratory, NECTEC, NSTDA
2. Agenda
● Introduction
● Background
● Document Object Model (DOM) in HTML
● Support Vector Machine (SVM)
● Objective
● Methodology
● Results
● Discussion / Future Work
● Conclusion
Acknowledgement
2
●
3. Introduction
● Websites contain images.
● Some images are not necessary.
● Search Engine Indexing
● Printing
● Ignoring them is sometimes
economical and green.
3
4. Background - DOM
● Web browsers / layout engines parse
HTML / CSS / JavaScript into DOM.
● DOM represents things (elements) in a Web page.
● An element has properties (position, size, etc.).
● JavaScript sees DOM.
4
5. Background - SVM
● SVM is a supervised machine learning algorithm
● SVM is used for statistical pattern recognition.
5
6. Objective (for now)
To recognize patterns of auxiliary Web images quickly
using DOM analysis and basic image processing
6
7. Methodology
HTML IMG
PyQtWebKit Python
CSS DOM Files
JS
jQuery
PIL
Page Level Features
OpenCV
Domain Level Features
Tesseract
Labels MySQL Image Level Features
7
8. Methodology (continued)
● Image Level Features
● No. of Colors
● No. of Human Faces
● No. of Alphabets
● Page Level Features
● Position
● Dimension
● No. of Images with Similar Dimension
● Domain Level Features
External / Internal Links
8
●
9. Methodology (continued)
MySQL 80% (500/626) Randomly-Selected
SVM (Train)
20%
Model
SVM (Predict)
Results
Results
Results
9
10. Results
10-fold Cross-Validation (10 Experiments)
Average Accuracy = 84.92%
After Applying Grid-Search Technique
Average Accuracy = 93.17%
10
11. Discussion
● Some pages cannot be parsed.
● Frames and redirections
● Positions can be miscalculated.
● JavaScript used in displaying images
● CSS sprites
● Tesseract is not well-tuned.
● Small images have to be magnified, but how much?
● Downloading images for processing is a bottleneck.
● Features are not weighted.
The definition of “auxiliary image” is subjective.
11
●
14. Acknowledgement
● NSTDA, NECTEC, and YSTP program
● Dr. Choochart Haruechaiyasak
● Dr. Toshiaki Kondo
● Mr. Krikamol Muendet
● And Many Others...
14