Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Identifying Auxiliary Web Images
Using Combination of Analyses
                                         Tewson Seeoun
    ...
Agenda
    ●   Introduction
    ●   Background
         ●   Document Object Model (DOM) in HTML
         ●   Support Vecto...
Introduction


       ●   Websites contain images.
       ●   Some images are not necessary.
           ●   Search Engine ...
Background - DOM


●   Web browsers / layout engines parse
    HTML / CSS / JavaScript into DOM.
●   DOM represents things...
Background - SVM




 ●   SVM is a supervised machine learning algorithm
 ●   SVM is used for statistical pattern recognit...
Objective (for now)




To recognize patterns of auxiliary Web images quickly
  using DOM analysis and basic image process...
Methodology
 HTML                                          IMG
          PyQtWebKit             Python
 CSS               ...
Methodology (continued)
        ●   Image Level Features
             ●   No. of Colors
             ●   No. of Human Face...
Methodology (continued)

     MySQL     80% (500/626) Randomly-Selected

                              SVM (Train)
       ...
Results



   10-fold Cross-Validation (10 Experiments)
          Average Accuracy = 84.92%
     After Applying Grid-Searc...
Discussion
   ●   Some pages cannot be parsed.
        ●   Frames and redirections
   ●   Positions can be miscalculated.
...
Future Work



 ●   Context Analysis
 ●   Weighed Features
 ●   Adaptive Page Analysis (Website Categorization)
 ●   Techn...
Conclusion




Layout analysis and basic image processing techniques
  alone perform well, but the system could be better....
Acknowledgement


    ●   NSTDA, NECTEC, and YSTP program
    ●   Dr. Choochart Haruechaiyasak
    ●   Dr. Toshiaki Kondo
...
Upcoming SlideShare
Loading in …5
×

Identifying Auxiliary Web Images Using Combinations of Analyses

1,293 views

Published on

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

Identifying Auxiliary Web Images Using Combinations of Analyses

  1. 1. Identifying Auxiliary Web Images Using Combination of Analyses Tewson Seeoun Sirindhorn International Institute of Technology With Guidance From Asst. Prof. Dr. Toshiaki Kondo Sirindhorn International Institute of Technology Dr. Choochart Haruechaiyasak Human Language Technology Laboratory, NECTEC, NSTDA
  2. 2. Agenda ● Introduction ● Background ● Document Object Model (DOM) in HTML ● Support Vector Machine (SVM) ● Objective ● Methodology ● Results ● Discussion / Future Work ● Conclusion Acknowledgement 2 ●
  3. 3. Introduction ● Websites contain images. ● Some images are not necessary. ● Search Engine Indexing ● Printing ● Ignoring them is sometimes economical and green. 3
  4. 4. Background - DOM ● Web browsers / layout engines parse HTML / CSS / JavaScript into DOM. ● DOM represents things (elements) in a Web page. ● An element has properties (position, size, etc.). ● JavaScript sees DOM. 4
  5. 5. Background - SVM ● SVM is a supervised machine learning algorithm ● SVM is used for statistical pattern recognition. 5
  6. 6. Objective (for now) To recognize patterns of auxiliary Web images quickly using DOM analysis and basic image processing 6
  7. 7. Methodology HTML IMG PyQtWebKit Python CSS DOM Files JS jQuery PIL Page Level Features OpenCV Domain Level Features Tesseract Labels MySQL Image Level Features 7
  8. 8. Methodology (continued) ● Image Level Features ● No. of Colors ● No. of Human Faces ● No. of Alphabets ● Page Level Features ● Position ● Dimension ● No. of Images with Similar Dimension ● Domain Level Features External / Internal Links 8 ●
  9. 9. Methodology (continued) MySQL 80% (500/626) Randomly-Selected SVM (Train) 20% Model SVM (Predict) Results Results Results 9
  10. 10. Results 10-fold Cross-Validation (10 Experiments) Average Accuracy = 84.92% After Applying Grid-Search Technique Average Accuracy = 93.17% 10
  11. 11. Discussion ● Some pages cannot be parsed. ● Frames and redirections ● Positions can be miscalculated. ● JavaScript used in displaying images ● CSS sprites ● Tesseract is not well-tuned. ● Small images have to be magnified, but how much? ● Downloading images for processing is a bottleneck. ● Features are not weighted. The definition of “auxiliary image” is subjective. 11 ●
  12. 12. Future Work ● Context Analysis ● Weighed Features ● Adaptive Page Analysis (Website Categorization) ● Techniques Evaluation / Optimization 12
  13. 13. Conclusion Layout analysis and basic image processing techniques alone perform well, but the system could be better. 13
  14. 14. Acknowledgement ● NSTDA, NECTEC, and YSTP program ● Dr. Choochart Haruechaiyasak ● Dr. Toshiaki Kondo ● Mr. Krikamol Muendet ● And Many Others... 14

×