SlideShare a Scribd company logo
Web image size prediction for efficient
focused image crawling
Katerina Andreadou, Symeon Papadopoulos and Yiannis Kompatsiaris
Centre for Research and Technology Hellas (CERTH) – Information Technologies Institute (ITI)
CBMI 2015, June 11, 2015, Prague, Czech Republic
Challenges in Crawling Web Images
#2
• Web pages contain loads of images
• A large number of HTTP requests need to be issued
to download all of them
• Yet, the majority of
small images
– are either irrelevant
– correspond to
decorative elements
The Problem
• Improve the performance of our focused image crawler
 crawls images related to a given set of keywords
• Typical focused crawling metrics
– Harvest rate  the number of relevant web pages discovered
– Target precision  the number of relevant crawl links
• Proposed evaluation criteria for images
– Does the alternate text contain any of the keywords?
– Does the web page title contain any of the keywords?
Very time consuming to download and evaluate the
whole HTML content and all available images
#3
Objective: Predict Web Image Size
• Predict the size of images based solely on
– the image URL and
– the HTML metadata and HTML surrounding elements
(number of DMO siblings, depth of the DOM tree, parent
text, etc.)
• Classify the images into two groups
– SMALL  width and height smaller than 200 pixels
– BIG  width and height bigger than 400 pixels
#4
Benefits of Predicting Image Size
• Substantial gains in time for the image crawler
• We used the Apache Benchmark to time random
image requests
– average download time for an image 300 msec
– average classification time for an image  10 msec
• For all images in Common Crawl (720 million)
– 10 download threads on a single core  35 weeks
• For just the big images using our method
– 10 download threads on a single core  less than 3 weeks
#5
Related Work (Focused Crawling / Image Crawling)
• Link context algorithms rely on the lexical content of
the URL within its parent page
– The shark-search algorithm (Hersovici et al., 1998)
• Graph structure algorithms take advantage of the
structure of the Web around a page
– Focused crawling: A new approach to topic-specific web
resource discovery (Chakrabarti et al., 1999)
• Semantic analysis algorithms utilize ontologies for
semantic classification
– Ontology-focused crawling (Maedche et al., 2002)
#6
Data Collection
#7
• We used data from the July 2014 Common Crawl set
– petabytes of data during the last 7 years
– contains raw web page data, extracted metadata and text
– lives on Amazon S3 as part of the Amazon Public Datasets
• We created a
MapReduce job to
parse all images and
videos using EMR
Statistics on Common Crawl Dataset
#8
266 TB in size containing 3.6
billion web pages:
• 78.5M unique domains
• 8% of images big
• 40% of images small
• 20% of images have no
dimension information
We choose 400 pixels as
threshold to characterize
big images.
Common Crawl and Big Data Analytics
• Used in combination with a Wikipedia dump to
investigate the frequency distribution of numbers
– Number frequency on the Web (van Hage, et al., 2014)
• Question whether the heavy-tailed distributions
observed in many Web crawls are inherent in the
network or a side-effect of the crawling process
– Graph structure in the Web (Meusel et al., 2014)
• Analyze the challenges of marking up content with
microdata
– Integrating product data from websites offering microdata
markup (Petrovski et al., 2014)
#9
Method Overview
We propose a supervised machine learning approach
for web image size prediction using different features:
1. The n-grams extracted from the image URL;
2. The tokens extracted from the image URL;
3. The HTML metadata and surrounding HTML
elements;
4. The combination of textual and non-textual
features (hybrid);
#10
Method I: NG
• An n-gram is a continuous sequence of n characters
from the given image URL
• Our main hypothesis:
“URLs that correspond to BIG and SMALL
images differ substantially in wording”
• BIG : large, x-large, gallery
• SMALL : logo, avatar, small, thumb, up, down
• First attempt: use the most frequent n-grams
#11
Method II: NG-TRF (term relative frequency)
1. Collect the most frequent n-grams (n={3,4,5})
for both classes (BIG and SMALL)
2. Rank the two separate lists by frequency
3. Discard n-grams below a threshold for every list
(e.g., less than 50 occurrences in 500K images)
4. For every n-gram, compute a correlation score
5. Rank again the two lists by this score
6. Pick equal number of n-grams from both lists to
create a feature vector (e.g., 500 SMALL n-grams
and 500 BIG n-grams for a 1000-vector)
#12
Method III: TOKENS-TRF
#13
• Same as before but with tokens
• To produce the tokens we split the image URL by all
non alphanumeric characters (W+)
Method IV: NG-TSRF-IDF
#14
• Stands for Term Squared Relative Frequency,
Inverse Document Frequency.
• If an n-gram is very frequent in both classes, we
should discard it.
• If an n-gram is not overall very frequent but it is
very class-specific, we should include it.
Method V: HTML metadata features
#15
HTML metadata features may
reveal cues about the image size.
Examples:
• Photos are more likely than
graphics to have an alt text.
• Most photos are in JPG or PNG
format.
• Most icons and graphics are in
BMP or GIF format.
Evaluation
#16
• Training: 1M images (500K small/500K big)
• Testing: 200K images (100K small/100K big)
• Random Forest classifier (Weka)
• Experimented with LibSVM and RandomTree but RF
achieved best trade-off between accuracy and training
time
• Tested with 10, 30, 100 trees
• Performance measure:
Results
#17
• Doubling the number of n-
gram features improves the
performance
• Adding more trees to the
Random Forest classifier
improves the performance
• The NG-tsrf-idf and
TOKENS-trf have the best
performance, followed closely
by NG-trf
Hybrid
Results: Hybrid method
#18
• The hybrid method takes into account both textual
and non-textual features.
• Hypothesis: the two methods will complement each
other when aggregating their outputs:
• The adv parameter allows to give an advantage to
one of the two classifiers.
Conclusion - Contributions
• A supervised machine learning approach for
automatically classifying Web images according to
their size.
• Assessment of textual and non-textual features.
• A statistical analysis and evaluation on a sample of
the Common Crawl set.
#19
Future Work
• Apply the n-grams and tokens approaches to the
alternate and parent text
– create two additional classifiers and combine them with
the existing ones
• Detect more fine-grained characteristics
– landscape - portrait
– photographs - graphics
#20
Thank you!
• Resources:
Slides: http://www.slideshare.net/KaterinaAndreadou1/kandreadou-
cbmi-59
Code: https://github.com/MKLab-ITI/reveal-media-
webservice/tree/year2/src/main/java/gr/iti/mklab/reveal/clustering
Common Crawl: http://commoncrawl.org/
• Get in touch:
@kandreads / kandreadou@iti.gr
@sympapadopoulos / papadop@iti.gr
#21

More Related Content

Similar to Web image size prediction for efficient focused image crawling

Searching Images: Recent research at Southampton
Searching Images: Recent research at SouthamptonSearching Images: Recent research at Southampton
Searching Images: Recent research at Southampton
Jonathon Hare
 
HIGH-LEVEL SEMANTICS OF IMAGES IN WEB DOCUMENTS USING WEIGHTED TAGS AND STREN...
HIGH-LEVEL SEMANTICS OF IMAGES IN WEB DOCUMENTS USING WEIGHTED TAGS AND STREN...HIGH-LEVEL SEMANTICS OF IMAGES IN WEB DOCUMENTS USING WEIGHTED TAGS AND STREN...
HIGH-LEVEL SEMANTICS OF IMAGES IN WEB DOCUMENTS USING WEIGHTED TAGS AND STREN...
IJCSEA Journal
 
Searching Images: Recent research at Southampton
Searching Images: Recent research at SouthamptonSearching Images: Recent research at Southampton
Searching Images: Recent research at Southampton
Jonathon Hare
 
Searching Images: Recent research at Southampton
Searching Images: Recent research at SouthamptonSearching Images: Recent research at Southampton
Searching Images: Recent research at Southampton
Jonathon Hare
 
Structured data and metadata evaluation methodology for organizations looking...
Structured data and metadata evaluation methodology for organizations looking...Structured data and metadata evaluation methodology for organizations looking...
Structured data and metadata evaluation methodology for organizations looking...
Emily Kolvitz
 
2014 IEEE JAVA DATA MINING PROJECT Web image re ranking using query-specific ...
2014 IEEE JAVA DATA MINING PROJECT Web image re ranking using query-specific ...2014 IEEE JAVA DATA MINING PROJECT Web image re ranking using query-specific ...
2014 IEEE JAVA DATA MINING PROJECT Web image re ranking using query-specific ...
IEEEFINALYEARSTUDENTPROJECT
 
2014 IEEE JAVA DATA MINING PROJECT Web image re ranking using query-specific ...
2014 IEEE JAVA DATA MINING PROJECT Web image re ranking using query-specific ...2014 IEEE JAVA DATA MINING PROJECT Web image re ranking using query-specific ...
2014 IEEE JAVA DATA MINING PROJECT Web image re ranking using query-specific ...
IEEEMEMTECHSTUDENTSPROJECTS
 
IEEE 2014 JAVA DATA MINING PROJECTS Web image re ranking using query-specific...
IEEE 2014 JAVA DATA MINING PROJECTS Web image re ranking using query-specific...IEEE 2014 JAVA DATA MINING PROJECTS Web image re ranking using query-specific...
IEEE 2014 JAVA DATA MINING PROJECTS Web image re ranking using query-specific...
IEEEFINALYEARSTUDENTPROJECTS
 
Student Industrial Training Presentation Slide
Student Industrial Training Presentation SlideStudent Industrial Training Presentation Slide
Student Industrial Training Presentation Slide
Khairul Filhan
 
2014 IEEE DOTNET CLOUD COMPUTING PROJECT Web image re ranking using query-spe...
2014 IEEE DOTNET CLOUD COMPUTING PROJECT Web image re ranking using query-spe...2014 IEEE DOTNET CLOUD COMPUTING PROJECT Web image re ranking using query-spe...
2014 IEEE DOTNET CLOUD COMPUTING PROJECT Web image re ranking using query-spe...
IEEEFINALSEMSTUDENTPROJECTS
 
IEEE 2014 DOTNET DATA MINING PROJECTS Web image re ranking using query-specif...
IEEE 2014 DOTNET DATA MINING PROJECTS Web image re ranking using query-specif...IEEE 2014 DOTNET DATA MINING PROJECTS Web image re ranking using query-specif...
IEEE 2014 DOTNET DATA MINING PROJECTS Web image re ranking using query-specif...
IEEEMEMTECHSTUDENTPROJECTS
 
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS Web image re ranking using query-sp...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS Web image re ranking using query-sp...IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS Web image re ranking using query-sp...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS Web image re ranking using query-sp...
IEEEMEMTECHSTUDENTPROJECTS
 
2014 IEEE DOTNET DATA MINING PROJECT Web image re ranking using query-specifi...
2014 IEEE DOTNET DATA MINING PROJECT Web image re ranking using query-specifi...2014 IEEE DOTNET DATA MINING PROJECT Web image re ranking using query-specifi...
2014 IEEE DOTNET DATA MINING PROJECT Web image re ranking using query-specifi...
IEEEMEMTECHSTUDENTSPROJECTS
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
IOSR Journals
 
Sree saranya
Sree saranyaSree saranya
Sree saranya
sreesaranya
 
Sree saranya
Sree saranyaSree saranya
Sree saranya
sreesaranya
 
Research Inventy : International Journal of Engineering and Science is publis...
Research Inventy : International Journal of Engineering and Science is publis...Research Inventy : International Journal of Engineering and Science is publis...
Research Inventy : International Journal of Engineering and Science is publis...
researchinventy
 
Research Inventy: International Journal of Engineering and Science
Research Inventy: International Journal of Engineering and ScienceResearch Inventy: International Journal of Engineering and Science
Research Inventy: International Journal of Engineering and Science
researchinventy
 
The Frontier of Deep Learning in 2020 and Beyond
The Frontier of Deep Learning in 2020 and BeyondThe Frontier of Deep Learning in 2020 and Beyond
The Frontier of Deep Learning in 2020 and Beyond
NUS-ISS
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Anubhav Jain
 

Similar to Web image size prediction for efficient focused image crawling (20)

Searching Images: Recent research at Southampton
Searching Images: Recent research at SouthamptonSearching Images: Recent research at Southampton
Searching Images: Recent research at Southampton
 
HIGH-LEVEL SEMANTICS OF IMAGES IN WEB DOCUMENTS USING WEIGHTED TAGS AND STREN...
HIGH-LEVEL SEMANTICS OF IMAGES IN WEB DOCUMENTS USING WEIGHTED TAGS AND STREN...HIGH-LEVEL SEMANTICS OF IMAGES IN WEB DOCUMENTS USING WEIGHTED TAGS AND STREN...
HIGH-LEVEL SEMANTICS OF IMAGES IN WEB DOCUMENTS USING WEIGHTED TAGS AND STREN...
 
Searching Images: Recent research at Southampton
Searching Images: Recent research at SouthamptonSearching Images: Recent research at Southampton
Searching Images: Recent research at Southampton
 
Searching Images: Recent research at Southampton
Searching Images: Recent research at SouthamptonSearching Images: Recent research at Southampton
Searching Images: Recent research at Southampton
 
Structured data and metadata evaluation methodology for organizations looking...
Structured data and metadata evaluation methodology for organizations looking...Structured data and metadata evaluation methodology for organizations looking...
Structured data and metadata evaluation methodology for organizations looking...
 
2014 IEEE JAVA DATA MINING PROJECT Web image re ranking using query-specific ...
2014 IEEE JAVA DATA MINING PROJECT Web image re ranking using query-specific ...2014 IEEE JAVA DATA MINING PROJECT Web image re ranking using query-specific ...
2014 IEEE JAVA DATA MINING PROJECT Web image re ranking using query-specific ...
 
2014 IEEE JAVA DATA MINING PROJECT Web image re ranking using query-specific ...
2014 IEEE JAVA DATA MINING PROJECT Web image re ranking using query-specific ...2014 IEEE JAVA DATA MINING PROJECT Web image re ranking using query-specific ...
2014 IEEE JAVA DATA MINING PROJECT Web image re ranking using query-specific ...
 
IEEE 2014 JAVA DATA MINING PROJECTS Web image re ranking using query-specific...
IEEE 2014 JAVA DATA MINING PROJECTS Web image re ranking using query-specific...IEEE 2014 JAVA DATA MINING PROJECTS Web image re ranking using query-specific...
IEEE 2014 JAVA DATA MINING PROJECTS Web image re ranking using query-specific...
 
Student Industrial Training Presentation Slide
Student Industrial Training Presentation SlideStudent Industrial Training Presentation Slide
Student Industrial Training Presentation Slide
 
2014 IEEE DOTNET CLOUD COMPUTING PROJECT Web image re ranking using query-spe...
2014 IEEE DOTNET CLOUD COMPUTING PROJECT Web image re ranking using query-spe...2014 IEEE DOTNET CLOUD COMPUTING PROJECT Web image re ranking using query-spe...
2014 IEEE DOTNET CLOUD COMPUTING PROJECT Web image re ranking using query-spe...
 
IEEE 2014 DOTNET DATA MINING PROJECTS Web image re ranking using query-specif...
IEEE 2014 DOTNET DATA MINING PROJECTS Web image re ranking using query-specif...IEEE 2014 DOTNET DATA MINING PROJECTS Web image re ranking using query-specif...
IEEE 2014 DOTNET DATA MINING PROJECTS Web image re ranking using query-specif...
 
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS Web image re ranking using query-sp...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS Web image re ranking using query-sp...IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS Web image re ranking using query-sp...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS Web image re ranking using query-sp...
 
2014 IEEE DOTNET DATA MINING PROJECT Web image re ranking using query-specifi...
2014 IEEE DOTNET DATA MINING PROJECT Web image re ranking using query-specifi...2014 IEEE DOTNET DATA MINING PROJECT Web image re ranking using query-specifi...
2014 IEEE DOTNET DATA MINING PROJECT Web image re ranking using query-specifi...
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
 
Sree saranya
Sree saranyaSree saranya
Sree saranya
 
Sree saranya
Sree saranyaSree saranya
Sree saranya
 
Research Inventy : International Journal of Engineering and Science is publis...
Research Inventy : International Journal of Engineering and Science is publis...Research Inventy : International Journal of Engineering and Science is publis...
Research Inventy : International Journal of Engineering and Science is publis...
 
Research Inventy: International Journal of Engineering and Science
Research Inventy: International Journal of Engineering and ScienceResearch Inventy: International Journal of Engineering and Science
Research Inventy: International Journal of Engineering and Science
 
The Frontier of Deep Learning in 2020 and Beyond
The Frontier of Deep Learning in 2020 and BeyondThe Frontier of Deep Learning in 2020 and Beyond
The Frontier of Deep Learning in 2020 and Beyond
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
 

More from REVEAL - Social Media Verification

Geoparsing and Real-time Social Media Analytics - technical and social challe...
Geoparsing and Real-time Social Media Analytics - technical and social challe...Geoparsing and Real-time Social Media Analytics - technical and social challe...
Geoparsing and Real-time Social Media Analytics - technical and social challe...
REVEAL - Social Media Verification
 
Veracity & Velocity of Social Media Content during Breaking News
Veracity & Velocity of Social Media Content during Breaking NewsVeracity & Velocity of Social Media Content during Breaking News
Veracity & Velocity of Social Media Content during Breaking News
REVEAL - Social Media Verification
 
REVEAL Project - Trust and Credibility Analysis
REVEAL Project - Trust and Credibility AnalysisREVEAL Project - Trust and Credibility Analysis
REVEAL Project - Trust and Credibility Analysis
REVEAL - Social Media Verification
 
"Extracting Attributed Verification and Debunking Reports from Social Media: ...
"Extracting Attributed Verification and Debunking Reports from Social Media: ..."Extracting Attributed Verification and Debunking Reports from Social Media: ...
"Extracting Attributed Verification and Debunking Reports from Social Media: ...
REVEAL - Social Media Verification
 
Prix Italia 2015 - Verification in Social Newsgathering
Prix Italia 2015 - Verification in Social NewsgatheringPrix Italia 2015 - Verification in Social Newsgathering
Prix Italia 2015 - Verification in Social Newsgathering
REVEAL - Social Media Verification
 
Verification of UGC/Eyewitness Media: Challenges and Approaches
Verification of UGC/Eyewitness Media: Challenges and Approaches Verification of UGC/Eyewitness Media: Challenges and Approaches
Verification of UGC/Eyewitness Media: Challenges and Approaches
REVEAL - Social Media Verification
 
News-oriented multimedia search over multiple social networks
News-oriented multimedia search over multiple social networksNews-oriented multimedia search over multiple social networks
News-oriented multimedia search over multiple social networks
REVEAL - Social Media Verification
 
WWW2015 - RDSM2015 Workshop - Trust and Credibility Analysis
WWW2015 - RDSM2015 Workshop - Trust and Credibility AnalysisWWW2015 - RDSM2015 Workshop - Trust and Credibility Analysis
WWW2015 - RDSM2015 Workshop - Trust and Credibility Analysis
REVEAL - Social Media Verification
 
Geotagging Social Media Content with a Refined Language Modelling Approach
Geotagging Social Media Content with a Refined Language Modelling ApproachGeotagging Social Media Content with a Refined Language Modelling Approach
Geotagging Social Media Content with a Refined Language Modelling Approach
REVEAL - Social Media Verification
 
Mediarevealr: A social multimedia monitoring and intelligence system for Web ...
Mediarevealr: A social multimedia monitoring and intelligence system for Web ...Mediarevealr: A social multimedia monitoring and intelligence system for Web ...
Mediarevealr: A social multimedia monitoring and intelligence system for Web ...
REVEAL - Social Media Verification
 
Cross-Media Konferenz "Think Cross - Change Media" in Magdeburg, Germany
 Cross-Media Konferenz "Think Cross - Change Media" in Magdeburg, Germany Cross-Media Konferenz "Think Cross - Change Media" in Magdeburg, Germany
Cross-Media Konferenz "Think Cross - Change Media" in Magdeburg, Germany
REVEAL - Social Media Verification
 
News Impact Summit - Verification, Investigation and Digital Ethics – Hamburg...
News Impact Summit - Verification, Investigation and Digital Ethics – Hamburg...News Impact Summit - Verification, Investigation and Digital Ethics – Hamburg...
News Impact Summit - Verification, Investigation and Digital Ethics – Hamburg...
REVEAL - Social Media Verification
 
TRIDEC and REVEAL projects: Geoparsing and Geosemantic knowledge model for tr...
TRIDEC and REVEAL projects: Geoparsing and Geosemantic knowledge model for tr...TRIDEC and REVEAL projects: Geoparsing and Geosemantic knowledge model for tr...
TRIDEC and REVEAL projects: Geoparsing and Geosemantic knowledge model for tr...
REVEAL - Social Media Verification
 
Reveal - Social Media Verification - poster
Reveal - Social Media Verification - posterReveal - Social Media Verification - poster
Reveal - Social Media Verification - poster
REVEAL - Social Media Verification
 
Focused Exploration of Geospatial Context on Linked Open Data
Focused Exploration of Geospatial Context on Linked Open DataFocused Exploration of Geospatial Context on Linked Open Data
Focused Exploration of Geospatial Context on Linked Open Data
REVEAL - Social Media Verification
 
REVEAL - Social Media Verification - brochure
REVEAL - Social Media Verification - brochureREVEAL - Social Media Verification - brochure
REVEAL - Social Media Verification - brochure
REVEAL - Social Media Verification
 

More from REVEAL - Social Media Verification (16)

Geoparsing and Real-time Social Media Analytics - technical and social challe...
Geoparsing and Real-time Social Media Analytics - technical and social challe...Geoparsing and Real-time Social Media Analytics - technical and social challe...
Geoparsing and Real-time Social Media Analytics - technical and social challe...
 
Veracity & Velocity of Social Media Content during Breaking News
Veracity & Velocity of Social Media Content during Breaking NewsVeracity & Velocity of Social Media Content during Breaking News
Veracity & Velocity of Social Media Content during Breaking News
 
REVEAL Project - Trust and Credibility Analysis
REVEAL Project - Trust and Credibility AnalysisREVEAL Project - Trust and Credibility Analysis
REVEAL Project - Trust and Credibility Analysis
 
"Extracting Attributed Verification and Debunking Reports from Social Media: ...
"Extracting Attributed Verification and Debunking Reports from Social Media: ..."Extracting Attributed Verification and Debunking Reports from Social Media: ...
"Extracting Attributed Verification and Debunking Reports from Social Media: ...
 
Prix Italia 2015 - Verification in Social Newsgathering
Prix Italia 2015 - Verification in Social NewsgatheringPrix Italia 2015 - Verification in Social Newsgathering
Prix Italia 2015 - Verification in Social Newsgathering
 
Verification of UGC/Eyewitness Media: Challenges and Approaches
Verification of UGC/Eyewitness Media: Challenges and Approaches Verification of UGC/Eyewitness Media: Challenges and Approaches
Verification of UGC/Eyewitness Media: Challenges and Approaches
 
News-oriented multimedia search over multiple social networks
News-oriented multimedia search over multiple social networksNews-oriented multimedia search over multiple social networks
News-oriented multimedia search over multiple social networks
 
WWW2015 - RDSM2015 Workshop - Trust and Credibility Analysis
WWW2015 - RDSM2015 Workshop - Trust and Credibility AnalysisWWW2015 - RDSM2015 Workshop - Trust and Credibility Analysis
WWW2015 - RDSM2015 Workshop - Trust and Credibility Analysis
 
Geotagging Social Media Content with a Refined Language Modelling Approach
Geotagging Social Media Content with a Refined Language Modelling ApproachGeotagging Social Media Content with a Refined Language Modelling Approach
Geotagging Social Media Content with a Refined Language Modelling Approach
 
Mediarevealr: A social multimedia monitoring and intelligence system for Web ...
Mediarevealr: A social multimedia monitoring and intelligence system for Web ...Mediarevealr: A social multimedia monitoring and intelligence system for Web ...
Mediarevealr: A social multimedia monitoring and intelligence system for Web ...
 
Cross-Media Konferenz "Think Cross - Change Media" in Magdeburg, Germany
 Cross-Media Konferenz "Think Cross - Change Media" in Magdeburg, Germany Cross-Media Konferenz "Think Cross - Change Media" in Magdeburg, Germany
Cross-Media Konferenz "Think Cross - Change Media" in Magdeburg, Germany
 
News Impact Summit - Verification, Investigation and Digital Ethics – Hamburg...
News Impact Summit - Verification, Investigation and Digital Ethics – Hamburg...News Impact Summit - Verification, Investigation and Digital Ethics – Hamburg...
News Impact Summit - Verification, Investigation and Digital Ethics – Hamburg...
 
TRIDEC and REVEAL projects: Geoparsing and Geosemantic knowledge model for tr...
TRIDEC and REVEAL projects: Geoparsing and Geosemantic knowledge model for tr...TRIDEC and REVEAL projects: Geoparsing and Geosemantic knowledge model for tr...
TRIDEC and REVEAL projects: Geoparsing and Geosemantic knowledge model for tr...
 
Reveal - Social Media Verification - poster
Reveal - Social Media Verification - posterReveal - Social Media Verification - poster
Reveal - Social Media Verification - poster
 
Focused Exploration of Geospatial Context on Linked Open Data
Focused Exploration of Geospatial Context on Linked Open DataFocused Exploration of Geospatial Context on Linked Open Data
Focused Exploration of Geospatial Context on Linked Open Data
 
REVEAL - Social Media Verification - brochure
REVEAL - Social Media Verification - brochureREVEAL - Social Media Verification - brochure
REVEAL - Social Media Verification - brochure
 

Recently uploaded

WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
flufftailshop
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
Dinusha Kumarasiri
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
saastr
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
LucaBarbaro3
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 

Recently uploaded (20)

WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 

Web image size prediction for efficient focused image crawling

  • 1. Web image size prediction for efficient focused image crawling Katerina Andreadou, Symeon Papadopoulos and Yiannis Kompatsiaris Centre for Research and Technology Hellas (CERTH) – Information Technologies Institute (ITI) CBMI 2015, June 11, 2015, Prague, Czech Republic
  • 2. Challenges in Crawling Web Images #2 • Web pages contain loads of images • A large number of HTTP requests need to be issued to download all of them • Yet, the majority of small images – are either irrelevant – correspond to decorative elements
  • 3. The Problem • Improve the performance of our focused image crawler  crawls images related to a given set of keywords • Typical focused crawling metrics – Harvest rate  the number of relevant web pages discovered – Target precision  the number of relevant crawl links • Proposed evaluation criteria for images – Does the alternate text contain any of the keywords? – Does the web page title contain any of the keywords? Very time consuming to download and evaluate the whole HTML content and all available images #3
  • 4. Objective: Predict Web Image Size • Predict the size of images based solely on – the image URL and – the HTML metadata and HTML surrounding elements (number of DMO siblings, depth of the DOM tree, parent text, etc.) • Classify the images into two groups – SMALL  width and height smaller than 200 pixels – BIG  width and height bigger than 400 pixels #4
  • 5. Benefits of Predicting Image Size • Substantial gains in time for the image crawler • We used the Apache Benchmark to time random image requests – average download time for an image 300 msec – average classification time for an image  10 msec • For all images in Common Crawl (720 million) – 10 download threads on a single core  35 weeks • For just the big images using our method – 10 download threads on a single core  less than 3 weeks #5
  • 6. Related Work (Focused Crawling / Image Crawling) • Link context algorithms rely on the lexical content of the URL within its parent page – The shark-search algorithm (Hersovici et al., 1998) • Graph structure algorithms take advantage of the structure of the Web around a page – Focused crawling: A new approach to topic-specific web resource discovery (Chakrabarti et al., 1999) • Semantic analysis algorithms utilize ontologies for semantic classification – Ontology-focused crawling (Maedche et al., 2002) #6
  • 7. Data Collection #7 • We used data from the July 2014 Common Crawl set – petabytes of data during the last 7 years – contains raw web page data, extracted metadata and text – lives on Amazon S3 as part of the Amazon Public Datasets • We created a MapReduce job to parse all images and videos using EMR
  • 8. Statistics on Common Crawl Dataset #8 266 TB in size containing 3.6 billion web pages: • 78.5M unique domains • 8% of images big • 40% of images small • 20% of images have no dimension information We choose 400 pixels as threshold to characterize big images.
  • 9. Common Crawl and Big Data Analytics • Used in combination with a Wikipedia dump to investigate the frequency distribution of numbers – Number frequency on the Web (van Hage, et al., 2014) • Question whether the heavy-tailed distributions observed in many Web crawls are inherent in the network or a side-effect of the crawling process – Graph structure in the Web (Meusel et al., 2014) • Analyze the challenges of marking up content with microdata – Integrating product data from websites offering microdata markup (Petrovski et al., 2014) #9
  • 10. Method Overview We propose a supervised machine learning approach for web image size prediction using different features: 1. The n-grams extracted from the image URL; 2. The tokens extracted from the image URL; 3. The HTML metadata and surrounding HTML elements; 4. The combination of textual and non-textual features (hybrid); #10
  • 11. Method I: NG • An n-gram is a continuous sequence of n characters from the given image URL • Our main hypothesis: “URLs that correspond to BIG and SMALL images differ substantially in wording” • BIG : large, x-large, gallery • SMALL : logo, avatar, small, thumb, up, down • First attempt: use the most frequent n-grams #11
  • 12. Method II: NG-TRF (term relative frequency) 1. Collect the most frequent n-grams (n={3,4,5}) for both classes (BIG and SMALL) 2. Rank the two separate lists by frequency 3. Discard n-grams below a threshold for every list (e.g., less than 50 occurrences in 500K images) 4. For every n-gram, compute a correlation score 5. Rank again the two lists by this score 6. Pick equal number of n-grams from both lists to create a feature vector (e.g., 500 SMALL n-grams and 500 BIG n-grams for a 1000-vector) #12
  • 13. Method III: TOKENS-TRF #13 • Same as before but with tokens • To produce the tokens we split the image URL by all non alphanumeric characters (W+)
  • 14. Method IV: NG-TSRF-IDF #14 • Stands for Term Squared Relative Frequency, Inverse Document Frequency. • If an n-gram is very frequent in both classes, we should discard it. • If an n-gram is not overall very frequent but it is very class-specific, we should include it.
  • 15. Method V: HTML metadata features #15 HTML metadata features may reveal cues about the image size. Examples: • Photos are more likely than graphics to have an alt text. • Most photos are in JPG or PNG format. • Most icons and graphics are in BMP or GIF format.
  • 16. Evaluation #16 • Training: 1M images (500K small/500K big) • Testing: 200K images (100K small/100K big) • Random Forest classifier (Weka) • Experimented with LibSVM and RandomTree but RF achieved best trade-off between accuracy and training time • Tested with 10, 30, 100 trees • Performance measure:
  • 17. Results #17 • Doubling the number of n- gram features improves the performance • Adding more trees to the Random Forest classifier improves the performance • The NG-tsrf-idf and TOKENS-trf have the best performance, followed closely by NG-trf Hybrid
  • 18. Results: Hybrid method #18 • The hybrid method takes into account both textual and non-textual features. • Hypothesis: the two methods will complement each other when aggregating their outputs: • The adv parameter allows to give an advantage to one of the two classifiers.
  • 19. Conclusion - Contributions • A supervised machine learning approach for automatically classifying Web images according to their size. • Assessment of textual and non-textual features. • A statistical analysis and evaluation on a sample of the Common Crawl set. #19
  • 20. Future Work • Apply the n-grams and tokens approaches to the alternate and parent text – create two additional classifiers and combine them with the existing ones • Detect more fine-grained characteristics – landscape - portrait – photographs - graphics #20
  • 21. Thank you! • Resources: Slides: http://www.slideshare.net/KaterinaAndreadou1/kandreadou- cbmi-59 Code: https://github.com/MKLab-ITI/reveal-media- webservice/tree/year2/src/main/java/gr/iti/mklab/reveal/clustering Common Crawl: http://commoncrawl.org/ • Get in touch: @kandreads / kandreadou@iti.gr @sympapadopoulos / papadop@iti.gr #21

Editor's Notes

  1. http://irevolution.net/2014/04/03/using-aidr-to-collect-and-analyze-tweets-from-chile-earthquake/
  2. http://irevolution.net/2014/04/03/using-aidr-to-collect-and-analyze-tweets-from-chile-earthquake/