SlideShare a Scribd company logo
1 of 14
Download to read offline
Identifying Auxiliary Web Images
Using Combination of Analyses
                                         Tewson Seeoun
             Sirindhorn International Institute of Technology


                                         With Guidance From

                        Asst. Prof. Dr. Toshiaki Kondo
             Sirindhorn International Institute of Technology
                       Dr. Choochart Haruechaiyasak
  Human Language Technology Laboratory, NECTEC, NSTDA
Agenda
    ●   Introduction
    ●   Background
         ●   Document Object Model (DOM) in HTML
         ●   Support Vector Machine (SVM)
    ●   Objective
    ●   Methodology
    ●   Results
    ●   Discussion / Future Work
    ●   Conclusion
        Acknowledgement
                                                   2
    ●
Introduction


       ●   Websites contain images.
       ●   Some images are not necessary.
           ●   Search Engine Indexing
           ●   Printing
       ●   Ignoring them is sometimes
           economical and green.

                                            3
Background - DOM


●   Web browsers / layout engines parse
    HTML / CSS / JavaScript into DOM.
●   DOM represents things (elements) in a Web page.
●   An element has properties (position, size, etc.).
●   JavaScript sees DOM.



                                                        4
Background - SVM




 ●   SVM is a supervised machine learning algorithm
 ●   SVM is used for statistical pattern recognition.




                                                        5
Objective (for now)




To recognize patterns of auxiliary Web images quickly
  using DOM analysis and basic image processing




                                                  6
Methodology
 HTML                                          IMG
          PyQtWebKit             Python
 CSS                    DOM                    Files
  JS


                            jQuery
                                                   PIL
                 Page Level Features
                                                   OpenCV
                Domain Level Features
                                                   Tesseract



       Labels          MySQL           Image Level Features
                                                          7
Methodology (continued)
        ●   Image Level Features
             ●   No. of Colors
             ●   No. of Human Faces
             ●   No. of Alphabets
        ●   Page Level Features
             ●   Position
             ●   Dimension
             ●   No. of Images with Similar Dimension
        ●   Domain Level Features
                 External / Internal Links
                                                        8
             ●
Methodology (continued)

     MySQL     80% (500/626) Randomly-Selected

                              SVM (Train)
        20%
                           Model

         SVM (Predict)



    Results
    Results
     Results

                                                 9
Results



   10-fold Cross-Validation (10 Experiments)
          Average Accuracy = 84.92%
     After Applying Grid-Search Technique
          Average Accuracy = 93.17%



                                               10
Discussion
   ●   Some pages cannot be parsed.
        ●   Frames and redirections
   ●   Positions can be miscalculated.
        ●   JavaScript used in displaying images
        ●   CSS sprites
   ●   Tesseract is not well-tuned.
   ●   Small images have to be magnified, but how much?
   ●   Downloading images for processing is a bottleneck.
   ●   Features are not weighted.
       The definition of “auxiliary image” is subjective.
                                                            11
   ●
Future Work



 ●   Context Analysis
 ●   Weighed Features
 ●   Adaptive Page Analysis (Website Categorization)
 ●   Techniques Evaluation / Optimization



                                                  12
Conclusion




Layout analysis and basic image processing techniques
  alone perform well, but the system could be better.




                                                   13
Acknowledgement


    ●   NSTDA, NECTEC, and YSTP program
    ●   Dr. Choochart Haruechaiyasak
    ●   Dr. Toshiaki Kondo
    ●   Mr. Krikamol Muendet
    ●   And Many Others...


                                          14

More Related Content

Similar to Identifying Auxiliary Web Images Using Combinations of Analyses

Architectures For Scaling Ajax
Architectures For Scaling AjaxArchitectures For Scaling Ajax
Architectures For Scaling Ajaxwolframkriesing
 
Dconrails Gecco Presentation
Dconrails Gecco PresentationDconrails Gecco Presentation
Dconrails Gecco PresentationJuan J. Merelo
 
JS Single-Page Web App Essentials
JS Single-Page Web App EssentialsJS Single-Page Web App Essentials
JS Single-Page Web App EssentialsSergey Bolshchikov
 
Building assets on the fly with Node.js
Building assets on the fly with Node.jsBuilding assets on the fly with Node.js
Building assets on the fly with Node.jsAcquisio
 
Super Sizing Youtube with Python
Super Sizing Youtube with PythonSuper Sizing Youtube with Python
Super Sizing Youtube with Pythondidip
 
Iasi code camp 12 october 2013 responsive images in the wild-vlad zelinschi
Iasi code camp 12 october 2013 responsive images in the wild-vlad zelinschiIasi code camp 12 october 2013 responsive images in the wild-vlad zelinschi
Iasi code camp 12 october 2013 responsive images in the wild-vlad zelinschiCodecamp Romania
 
Scaling Face Recognition with Big Data - Key Notes at DevTalks Bucharest 2017
Scaling Face Recognition with Big Data - Key Notes at DevTalks Bucharest 2017Scaling Face Recognition with Big Data - Key Notes at DevTalks Bucharest 2017
Scaling Face Recognition with Big Data - Key Notes at DevTalks Bucharest 2017VisageCloud
 
InfoEducatie - Face Recognition Architecture
InfoEducatie - Face Recognition ArchitectureInfoEducatie - Face Recognition Architecture
InfoEducatie - Face Recognition ArchitectureBogdan Bocse
 
Everything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the WebEverything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the WebJames Rakich
 
Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015Turi, Inc.
 
Yurii Pashchenko: Unlocking the potential of Segment Anything Model (UA)
Yurii Pashchenko: Unlocking the potential of Segment Anything Model (UA)Yurii Pashchenko: Unlocking the potential of Segment Anything Model (UA)
Yurii Pashchenko: Unlocking the potential of Segment Anything Model (UA)Lviv Startup Club
 
CIKB - Software Architecture Analysis Design
CIKB - Software Architecture Analysis DesignCIKB - Software Architecture Analysis Design
CIKB - Software Architecture Analysis DesignAntonio Castellon
 
Talk Paris Infovis 091207132953 Phpapp01(2)
Talk Paris Infovis 091207132953 Phpapp01(2)Talk Paris Infovis 091207132953 Phpapp01(2)
Talk Paris Infovis 091207132953 Phpapp01(2)johnnybiz
 
Using Web Standards to create Interactive Data Visualizations for the Web
Using Web Standards to create Interactive Data Visualizations for the WebUsing Web Standards to create Interactive Data Visualizations for the Web
Using Web Standards to create Interactive Data Visualizations for the Webphilogb
 

Similar to Identifying Auxiliary Web Images Using Combinations of Analyses (20)

Architectures For Scaling Ajax
Architectures For Scaling AjaxArchitectures For Scaling Ajax
Architectures For Scaling Ajax
 
Dconrails Gecco Presentation
Dconrails Gecco PresentationDconrails Gecco Presentation
Dconrails Gecco Presentation
 
JS Single-Page Web App Essentials
JS Single-Page Web App EssentialsJS Single-Page Web App Essentials
JS Single-Page Web App Essentials
 
Building assets on the fly with Node.js
Building assets on the fly with Node.jsBuilding assets on the fly with Node.js
Building assets on the fly with Node.js
 
Os Solomon
Os SolomonOs Solomon
Os Solomon
 
Super Sizing Youtube with Python
Super Sizing Youtube with PythonSuper Sizing Youtube with Python
Super Sizing Youtube with Python
 
Iasi code camp 12 october 2013 responsive images in the wild-vlad zelinschi
Iasi code camp 12 october 2013 responsive images in the wild-vlad zelinschiIasi code camp 12 october 2013 responsive images in the wild-vlad zelinschi
Iasi code camp 12 october 2013 responsive images in the wild-vlad zelinschi
 
Scaling Face Recognition with Big Data - Key Notes at DevTalks Bucharest 2017
Scaling Face Recognition with Big Data - Key Notes at DevTalks Bucharest 2017Scaling Face Recognition with Big Data - Key Notes at DevTalks Bucharest 2017
Scaling Face Recognition with Big Data - Key Notes at DevTalks Bucharest 2017
 
InfoEducatie - Face Recognition Architecture
InfoEducatie - Face Recognition ArchitectureInfoEducatie - Face Recognition Architecture
InfoEducatie - Face Recognition Architecture
 
Coding the UI
Coding the UICoding the UI
Coding the UI
 
Coding Ui
Coding UiCoding Ui
Coding Ui
 
Everything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the WebEverything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the Web
 
Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015
 
Yurii Pashchenko: Unlocking the potential of Segment Anything Model (UA)
Yurii Pashchenko: Unlocking the potential of Segment Anything Model (UA)Yurii Pashchenko: Unlocking the potential of Segment Anything Model (UA)
Yurii Pashchenko: Unlocking the potential of Segment Anything Model (UA)
 
CIKB - Software Architecture Analysis Design
CIKB - Software Architecture Analysis DesignCIKB - Software Architecture Analysis Design
CIKB - Software Architecture Analysis Design
 
Talk Paris Infovis 091207132953 Phpapp01(2)
Talk Paris Infovis 091207132953 Phpapp01(2)Talk Paris Infovis 091207132953 Phpapp01(2)
Talk Paris Infovis 091207132953 Phpapp01(2)
 
Using Web Standards to create Interactive Data Visualizations for the Web
Using Web Standards to create Interactive Data Visualizations for the WebUsing Web Standards to create Interactive Data Visualizations for the Web
Using Web Standards to create Interactive Data Visualizations for the Web
 
Performance on a budget
Performance on a budgetPerformance on a budget
Performance on a budget
 
20080611accel
20080611accel20080611accel
20080611accel
 
Asp.Net MVC3 - Basics
Asp.Net MVC3 - BasicsAsp.Net MVC3 - Basics
Asp.Net MVC3 - Basics
 

Recently uploaded

Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptxPoojaSen20
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 

Recently uploaded (20)

Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptx
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 

Identifying Auxiliary Web Images Using Combinations of Analyses

  • 1. Identifying Auxiliary Web Images Using Combination of Analyses Tewson Seeoun Sirindhorn International Institute of Technology With Guidance From Asst. Prof. Dr. Toshiaki Kondo Sirindhorn International Institute of Technology Dr. Choochart Haruechaiyasak Human Language Technology Laboratory, NECTEC, NSTDA
  • 2. Agenda ● Introduction ● Background ● Document Object Model (DOM) in HTML ● Support Vector Machine (SVM) ● Objective ● Methodology ● Results ● Discussion / Future Work ● Conclusion Acknowledgement 2 ●
  • 3. Introduction ● Websites contain images. ● Some images are not necessary. ● Search Engine Indexing ● Printing ● Ignoring them is sometimes economical and green. 3
  • 4. Background - DOM ● Web browsers / layout engines parse HTML / CSS / JavaScript into DOM. ● DOM represents things (elements) in a Web page. ● An element has properties (position, size, etc.). ● JavaScript sees DOM. 4
  • 5. Background - SVM ● SVM is a supervised machine learning algorithm ● SVM is used for statistical pattern recognition. 5
  • 6. Objective (for now) To recognize patterns of auxiliary Web images quickly using DOM analysis and basic image processing 6
  • 7. Methodology HTML IMG PyQtWebKit Python CSS DOM Files JS jQuery PIL Page Level Features OpenCV Domain Level Features Tesseract Labels MySQL Image Level Features 7
  • 8. Methodology (continued) ● Image Level Features ● No. of Colors ● No. of Human Faces ● No. of Alphabets ● Page Level Features ● Position ● Dimension ● No. of Images with Similar Dimension ● Domain Level Features External / Internal Links 8 ●
  • 9. Methodology (continued) MySQL 80% (500/626) Randomly-Selected SVM (Train) 20% Model SVM (Predict) Results Results Results 9
  • 10. Results 10-fold Cross-Validation (10 Experiments) Average Accuracy = 84.92% After Applying Grid-Search Technique Average Accuracy = 93.17% 10
  • 11. Discussion ● Some pages cannot be parsed. ● Frames and redirections ● Positions can be miscalculated. ● JavaScript used in displaying images ● CSS sprites ● Tesseract is not well-tuned. ● Small images have to be magnified, but how much? ● Downloading images for processing is a bottleneck. ● Features are not weighted. The definition of “auxiliary image” is subjective. 11 ●
  • 12. Future Work ● Context Analysis ● Weighed Features ● Adaptive Page Analysis (Website Categorization) ● Techniques Evaluation / Optimization 12
  • 13. Conclusion Layout analysis and basic image processing techniques alone perform well, but the system could be better. 13
  • 14. Acknowledgement ● NSTDA, NECTEC, and YSTP program ● Dr. Choochart Haruechaiyasak ● Dr. Toshiaki Kondo ● Mr. Krikamol Muendet ● And Many Others... 14