SlideShare a Scribd company logo
1 of 35
Download to read offline
Search by Screenshots
for Universal Article Clipping
in Mobile Apps
Kazutoshi Umemoto 1, Ruihua Song 2, Jian-Yun Nie 3,
Xing Xie 2, Katsumi Tanaka 1, Yong Rui 2
1 Kyoto University, 2 Microsoft Research Asia, 3 University of Montreal
Information Access from Mobile
2
http://gs.statcounter.com/press/mobile-and-tablet-internet-usage-exceeds-desktop-for-first-time-worldwide
Web Access Style: Desktop vs. Mobile
3
social news food travel
⋯
Web Access Style: Desktop vs. Mobile
4
social news food travel
⋯
How can we assist
read-it-later behavior
of mobile users?
l What a user reads/likes
is scattered in different
apps
l People have limited time
to read articles at a time
Existing Solutions
5
No universal way to clip articles on mobile
l Various features: OneNote, Evernote, Pocket, URL copy, Email, …
l Difficult for clipping service to get partnership with all mobile apps
Proposal: UniClip
6
camera
camera
Cloud
Search
by
Screenshots
Main
Article
Extractor
core task
Users only have to take one screenshot of target articles
l UniClip only requires a single interaction
l UniClip allows users to save articles in one place
l UniClip is independent of app features
Advantages
Core Task: Search by Screenshots
7
• Input
– One screenshot of a single article
– Any part is OK if it contains the article’s text
• Output
– A URL that corresponds to the given article
– Exactly the same page is desired
(if identifiable from the screenshot)
1. How to represent a given article screenshot in a tractable format
2. How to formulate queries effective for identifying the article
3. How to aggregate search results of multiple queries
Challenges
Overview of Our Approaches
8
Block
segmentation
Query
formulation
Result
aggregation
Attribute
extraction
Text
recognition
1. How to represent a given article screenshot in a tractable format
2. How to formulate queries effective for identifying the article
3. How to aggregate search results of multiple queries
Approaches
Block segmentation
9
Two-Phase Segmentation Algorithm
10
1. Merge adjacent lines detected by OCR engine
2. Merge adjacent candidate segments
Detected lines Candidate segments Final segments
1 2
Line1
Line2
Alignment
Distance
Height
Line1
Line1
Distance
Alignment
Height
Segment1
Segment2
Before: OCR lines After: Segmented blocks
Baseline: OCR blocks Proposed: Segmented blocks
Approaches
Attribute extraction
14
Role of Each Block in Article
15
l Screenshots may contain
blocks unrelated to main
content
❯ Queries generated from such
blocks are useless to retrieve the
original article
l Classify blocks into 3 groups
❯ Title block
❯ Body block
― e.g., paragraphs
❯ Other block
― e.g., ads, toolbar
How to Estimate Block Attribute
16
2. Use majority voting for line sequence in a block
#{Title lines} = 1
#{Body lines} = 3
#{Other lines} = 1
Line1
Body
Body
Other
Body
Body
Title
Body
1. Estimate attributes of each line using CRF
one line is regarded
as one observation in CRF model
attributes
lines
Features
17
l Point-wise features (for a targeted line)
❯ Font size of words
❯ Recognition confidence of words
❯ Number of words
❯ Does contain any punctuations?
❯ Does exist a certain format (e.g., upper-case, title-case)?
❯ Vertical position
l Pair-wise features
(for a targeted and the previous lines)
❯ Alignment (left/center/right) accordance of two lines
❯ Gap (line space) between two lines
❯ Difference of height between two lines
Approaches
Query formulation
18
Basic Approach
l Simple query
❯ Formulate phrase queries from substring of each block
❯ Length of each query must be ["#$%, "#'(]
― ("#$% and "#'( are empirically set to 4 and 14)
l Compound query
❯ Simple queries generated from a single block may not be
unique in some cases (e.g., cited paragraphs)
❯ Concatenate two (half-length) simple queries
each generated from different blocks
19
l Queries should be as unique as possible
l Too long queries do not be return good results
Observation
How Simple/Compound Queries Are Generated
20
Title
Body
Body
Other 3
1
1 2
2
7
4 5
5 6
6
7
4
5
1
6
2
2
3
3 4
4
1 2
3
4
5
5 6
1
2
3
1 1
2 2
3 3
!"
!#
!$
!"
!#
!$
Simple query Compound query
Advanced Approach
l Hybrid query
❯ Title: simple query
❯ Body: compound query
❯ Other: simple query
21
Take block attribute into account for query formulation
l Title is unique enough to distinguish a given article from others
l Body blocks are sometimes not unique enough (e.g., citation)
l Other blocks are usually noisy but may have useful contents
22
Hybrid
method
Simple
method
Approaches
Result aggregation
23
Exploit Attribute and Rank
1. Score each search result based on its rank
and the weight of query attribute
❯ ! ", $% =
'()*)
%
2. Aggregate scores for each URL among all
queries ,
❯ ! $ = ∑.∈0 !(", $)
3. Output URL having highest aggregation score
❯ ̂$ = arg max
7
!($)
24
l Good queries return answer URLs at high rank positions
l Bad queries return diverse URLs at different positions
l Queries from title/body blocks are more promising than others
$8
$9
$:
$;
"
$<
Experiments
25
Datasets
l Training dataset: 98 screenshots
❯ to learn some parameters and to build CRF model
l Testing dataset: 189 screenshots
❯ to evaluate effectiveness of each approach
l Ground-truth: Manually assess the relevance of output URLs
26
Block Segmentation: Setting
l Baselines
❯ Line: regard each OCR line as a segment
❯ Region: regard each OCR region as a segment
l Procedure
1. Manually group OCR lines into ground-truth
clusters
2. Evaluate clustering quality of each method
― with Purity, NMI, RI, Precision, Recall, and F1
27
Block Segmentation: Result
28
Attribute Extraction: Setting
29
Too short
Low confidence due to blurred
l Heuristic baseline
❯ Title
― select the biggest block after some filtering
❯ Body
― select sentence-like block except for title block
l Procedure
1. Manually label the ground-truth attribute
of each OCR line
2. Evaluate classification performance of
each method
― with Precision, Recall, and F1
Attribute Extraction: Result
30
Precision Recall F1
Title
Heuristic 0.340 0.912 0.327
CRF 0.928 0.919 0.868
Body
Heuristic 0.754 0.780 0.702
CRF 0.967 0.880 0.893
(Macro Average)
Query Formulation & Result Aggregation: Setting
l Methods
❯ Hybrid: hybrid query (i.e., w/ attribute)
❯ Simple: simple query (i.e., w/o attribute)
❯ Keyword: non-phrase query consisting of keywords
extracted by TextRank
l Procedure
❯ Evaluate retrieval performance of each method with
different query budget
― Measure: F1 (and RR@8)
31
Query Formulation & Result Aggregation: Result
32
Successful Cases
33
Quotations
from
other pages
(1) thanks to compound query (2) thanks to query weighting
The only
block
that is useful
(Simple) User Study
l Implement UniClip as an Android app
l Ask 22 participants to try our app and give
their preference compared to other methods
34
Summary
l Approaches
❯ CRF-based attribute extraction for segmented blocks
❯ Attribute-dependent phrase query formulation
❯ Aggregation based on result rank and attribute weight
l Future work
❯ Improving efficiency by allowing only one query (done)
❯ Evaluation with larger datasets
❯ Leveraging the potential of screenshots for other tasks
35
camera
camera
Cloud
Search
by
Screenshots
Main
Article
Extractor
UniClip: A Framework for Article Clipping in Mobile Devices

More Related Content

Similar to Search by Screenshots for Universal Article Clipping in Mobile Apps

Web clustering engines
Web clustering enginesWeb clustering engines
Web clustering enginesYash Darak
 
MOOC backbone using Netty and Protobuf
MOOC backbone using Netty and ProtobufMOOC backbone using Netty and Protobuf
MOOC backbone using Netty and ProtobufGaurav Bhardwaj
 
Introducing object oriented programming (oop)
Introducing object oriented programming (oop)Introducing object oriented programming (oop)
Introducing object oriented programming (oop)Hemlathadhevi Annadhurai
 
Low Cost Business Intelligence Platform for MongoDB instances using MEAN stack
Low Cost Business Intelligence Platform for MongoDB instances using MEAN stackLow Cost Business Intelligence Platform for MongoDB instances using MEAN stack
Low Cost Business Intelligence Platform for MongoDB instances using MEAN stackAvinash Kaza
 
IRJET- Online Course Recommendation System
IRJET- Online Course Recommendation SystemIRJET- Online Course Recommendation System
IRJET- Online Course Recommendation SystemIRJET Journal
 
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State UniversityLSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State Universitydhabalia
 
Session ii g2 overview protein modeling mmc
Session ii g2 overview protein modeling mmcSession ii g2 overview protein modeling mmc
Session ii g2 overview protein modeling mmcUSD Bioinformatics
 
Joomla!Day Poland 2013 - Joomla Architecture (Ofer Cohen)
Joomla!Day Poland 2013 - Joomla Architecture  (Ofer Cohen)Joomla!Day Poland 2013 - Joomla Architecture  (Ofer Cohen)
Joomla!Day Poland 2013 - Joomla Architecture (Ofer Cohen)Ofer Cohen
 
Boilerplate removal and content
Boilerplate removal and contentBoilerplate removal and content
Boilerplate removal and contentIJCSEA Journal
 
Boilerplate Removal and Content Extraction from Dynamic Web Pages
Boilerplate Removal and Content Extraction from Dynamic Web PagesBoilerplate Removal and Content Extraction from Dynamic Web Pages
Boilerplate Removal and Content Extraction from Dynamic Web PagesIJCSEA Journal
 
Deep learning Tutorial - Part II
Deep learning Tutorial - Part IIDeep learning Tutorial - Part II
Deep learning Tutorial - Part IIQuantUniversity
 
Machine Learning and AI: Core Methods and Applications
Machine Learning and AI: Core Methods and ApplicationsMachine Learning and AI: Core Methods and Applications
Machine Learning and AI: Core Methods and ApplicationsQuantUniversity
 
Feature Extraction for Effective Microblog Search and Adaptive Clustering Alg...
Feature Extraction for Effective Microblog Search and Adaptive Clustering Alg...Feature Extraction for Effective Microblog Search and Adaptive Clustering Alg...
Feature Extraction for Effective Microblog Search and Adaptive Clustering Alg...Runwei Qiang
 
Feature Extraction for Effective Microblog Search and Adaptive Clustering Alg...
Feature Extraction for Effective Microblog Search and Adaptive Clustering Alg...Feature Extraction for Effective Microblog Search and Adaptive Clustering Alg...
Feature Extraction for Effective Microblog Search and Adaptive Clustering Alg...Runwei Qiang
 

Similar to Search by Screenshots for Universal Article Clipping in Mobile Apps (20)

Web clustering engines
Web clustering enginesWeb clustering engines
Web clustering engines
 
Bachelor-Thesis
Bachelor-ThesisBachelor-Thesis
Bachelor-Thesis
 
FEM_PPT.ppt
FEM_PPT.pptFEM_PPT.ppt
FEM_PPT.ppt
 
MOOC backbone using Netty and Protobuf
MOOC backbone using Netty and ProtobufMOOC backbone using Netty and Protobuf
MOOC backbone using Netty and Protobuf
 
Oo aand d-overview
Oo aand d-overviewOo aand d-overview
Oo aand d-overview
 
Introducing object oriented programming (oop)
Introducing object oriented programming (oop)Introducing object oriented programming (oop)
Introducing object oriented programming (oop)
 
Low Cost Business Intelligence Platform for MongoDB instances using MEAN stack
Low Cost Business Intelligence Platform for MongoDB instances using MEAN stackLow Cost Business Intelligence Platform for MongoDB instances using MEAN stack
Low Cost Business Intelligence Platform for MongoDB instances using MEAN stack
 
IRJET- Online Course Recommendation System
IRJET- Online Course Recommendation SystemIRJET- Online Course Recommendation System
IRJET- Online Course Recommendation System
 
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State UniversityLSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
 
Session ii g2 overview protein modeling mmc
Session ii g2 overview protein modeling mmcSession ii g2 overview protein modeling mmc
Session ii g2 overview protein modeling mmc
 
Web clustring engine
Web clustring engineWeb clustring engine
Web clustring engine
 
Joomla!Day Poland 2013 - Joomla Architecture (Ofer Cohen)
Joomla!Day Poland 2013 - Joomla Architecture  (Ofer Cohen)Joomla!Day Poland 2013 - Joomla Architecture  (Ofer Cohen)
Joomla!Day Poland 2013 - Joomla Architecture (Ofer Cohen)
 
MDE in Practice
MDE in PracticeMDE in Practice
MDE in Practice
 
Incremental clustering in search engines
Incremental clustering in search enginesIncremental clustering in search engines
Incremental clustering in search engines
 
Boilerplate removal and content
Boilerplate removal and contentBoilerplate removal and content
Boilerplate removal and content
 
Boilerplate Removal and Content Extraction from Dynamic Web Pages
Boilerplate Removal and Content Extraction from Dynamic Web PagesBoilerplate Removal and Content Extraction from Dynamic Web Pages
Boilerplate Removal and Content Extraction from Dynamic Web Pages
 
Deep learning Tutorial - Part II
Deep learning Tutorial - Part IIDeep learning Tutorial - Part II
Deep learning Tutorial - Part II
 
Machine Learning and AI: Core Methods and Applications
Machine Learning and AI: Core Methods and ApplicationsMachine Learning and AI: Core Methods and Applications
Machine Learning and AI: Core Methods and Applications
 
Feature Extraction for Effective Microblog Search and Adaptive Clustering Alg...
Feature Extraction for Effective Microblog Search and Adaptive Clustering Alg...Feature Extraction for Effective Microblog Search and Adaptive Clustering Alg...
Feature Extraction for Effective Microblog Search and Adaptive Clustering Alg...
 
Feature Extraction for Effective Microblog Search and Adaptive Clustering Alg...
Feature Extraction for Effective Microblog Search and Adaptive Clustering Alg...Feature Extraction for Effective Microblog Search and Adaptive Clustering Alg...
Feature Extraction for Effective Microblog Search and Adaptive Clustering Alg...
 

Recently uploaded

Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyDrAnita Sharma
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 

Recently uploaded (20)

Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomology
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 

Search by Screenshots for Universal Article Clipping in Mobile Apps

  • 1. Search by Screenshots for Universal Article Clipping in Mobile Apps Kazutoshi Umemoto 1, Ruihua Song 2, Jian-Yun Nie 3, Xing Xie 2, Katsumi Tanaka 1, Yong Rui 2 1 Kyoto University, 2 Microsoft Research Asia, 3 University of Montreal
  • 2. Information Access from Mobile 2 http://gs.statcounter.com/press/mobile-and-tablet-internet-usage-exceeds-desktop-for-first-time-worldwide
  • 3. Web Access Style: Desktop vs. Mobile 3 social news food travel ⋯
  • 4. Web Access Style: Desktop vs. Mobile 4 social news food travel ⋯ How can we assist read-it-later behavior of mobile users? l What a user reads/likes is scattered in different apps l People have limited time to read articles at a time
  • 5. Existing Solutions 5 No universal way to clip articles on mobile l Various features: OneNote, Evernote, Pocket, URL copy, Email, … l Difficult for clipping service to get partnership with all mobile apps
  • 6. Proposal: UniClip 6 camera camera Cloud Search by Screenshots Main Article Extractor core task Users only have to take one screenshot of target articles l UniClip only requires a single interaction l UniClip allows users to save articles in one place l UniClip is independent of app features Advantages
  • 7. Core Task: Search by Screenshots 7 • Input – One screenshot of a single article – Any part is OK if it contains the article’s text • Output – A URL that corresponds to the given article – Exactly the same page is desired (if identifiable from the screenshot) 1. How to represent a given article screenshot in a tractable format 2. How to formulate queries effective for identifying the article 3. How to aggregate search results of multiple queries Challenges
  • 8. Overview of Our Approaches 8 Block segmentation Query formulation Result aggregation Attribute extraction Text recognition 1. How to represent a given article screenshot in a tractable format 2. How to formulate queries effective for identifying the article 3. How to aggregate search results of multiple queries
  • 10. Two-Phase Segmentation Algorithm 10 1. Merge adjacent lines detected by OCR engine 2. Merge adjacent candidate segments Detected lines Candidate segments Final segments 1 2
  • 12. Before: OCR lines After: Segmented blocks
  • 13. Baseline: OCR blocks Proposed: Segmented blocks
  • 15. Role of Each Block in Article 15 l Screenshots may contain blocks unrelated to main content ❯ Queries generated from such blocks are useless to retrieve the original article l Classify blocks into 3 groups ❯ Title block ❯ Body block ― e.g., paragraphs ❯ Other block ― e.g., ads, toolbar
  • 16. How to Estimate Block Attribute 16 2. Use majority voting for line sequence in a block #{Title lines} = 1 #{Body lines} = 3 #{Other lines} = 1 Line1 Body Body Other Body Body Title Body 1. Estimate attributes of each line using CRF one line is regarded as one observation in CRF model attributes lines
  • 17. Features 17 l Point-wise features (for a targeted line) ❯ Font size of words ❯ Recognition confidence of words ❯ Number of words ❯ Does contain any punctuations? ❯ Does exist a certain format (e.g., upper-case, title-case)? ❯ Vertical position l Pair-wise features (for a targeted and the previous lines) ❯ Alignment (left/center/right) accordance of two lines ❯ Gap (line space) between two lines ❯ Difference of height between two lines
  • 19. Basic Approach l Simple query ❯ Formulate phrase queries from substring of each block ❯ Length of each query must be ["#$%, "#'(] ― ("#$% and "#'( are empirically set to 4 and 14) l Compound query ❯ Simple queries generated from a single block may not be unique in some cases (e.g., cited paragraphs) ❯ Concatenate two (half-length) simple queries each generated from different blocks 19 l Queries should be as unique as possible l Too long queries do not be return good results Observation
  • 20. How Simple/Compound Queries Are Generated 20 Title Body Body Other 3 1 1 2 2 7 4 5 5 6 6 7 4 5 1 6 2 2 3 3 4 4 1 2 3 4 5 5 6 1 2 3 1 1 2 2 3 3 !" !# !$ !" !# !$ Simple query Compound query
  • 21. Advanced Approach l Hybrid query ❯ Title: simple query ❯ Body: compound query ❯ Other: simple query 21 Take block attribute into account for query formulation l Title is unique enough to distinguish a given article from others l Body blocks are sometimes not unique enough (e.g., citation) l Other blocks are usually noisy but may have useful contents
  • 24. Exploit Attribute and Rank 1. Score each search result based on its rank and the weight of query attribute ❯ ! ", $% = '()*) % 2. Aggregate scores for each URL among all queries , ❯ ! $ = ∑.∈0 !(", $) 3. Output URL having highest aggregation score ❯ ̂$ = arg max 7 !($) 24 l Good queries return answer URLs at high rank positions l Bad queries return diverse URLs at different positions l Queries from title/body blocks are more promising than others $8 $9 $: $; " $<
  • 26. Datasets l Training dataset: 98 screenshots ❯ to learn some parameters and to build CRF model l Testing dataset: 189 screenshots ❯ to evaluate effectiveness of each approach l Ground-truth: Manually assess the relevance of output URLs 26
  • 27. Block Segmentation: Setting l Baselines ❯ Line: regard each OCR line as a segment ❯ Region: regard each OCR region as a segment l Procedure 1. Manually group OCR lines into ground-truth clusters 2. Evaluate clustering quality of each method ― with Purity, NMI, RI, Precision, Recall, and F1 27
  • 29. Attribute Extraction: Setting 29 Too short Low confidence due to blurred l Heuristic baseline ❯ Title ― select the biggest block after some filtering ❯ Body ― select sentence-like block except for title block l Procedure 1. Manually label the ground-truth attribute of each OCR line 2. Evaluate classification performance of each method ― with Precision, Recall, and F1
  • 30. Attribute Extraction: Result 30 Precision Recall F1 Title Heuristic 0.340 0.912 0.327 CRF 0.928 0.919 0.868 Body Heuristic 0.754 0.780 0.702 CRF 0.967 0.880 0.893 (Macro Average)
  • 31. Query Formulation & Result Aggregation: Setting l Methods ❯ Hybrid: hybrid query (i.e., w/ attribute) ❯ Simple: simple query (i.e., w/o attribute) ❯ Keyword: non-phrase query consisting of keywords extracted by TextRank l Procedure ❯ Evaluate retrieval performance of each method with different query budget ― Measure: F1 (and RR@8) 31
  • 32. Query Formulation & Result Aggregation: Result 32
  • 33. Successful Cases 33 Quotations from other pages (1) thanks to compound query (2) thanks to query weighting The only block that is useful
  • 34. (Simple) User Study l Implement UniClip as an Android app l Ask 22 participants to try our app and give their preference compared to other methods 34
  • 35. Summary l Approaches ❯ CRF-based attribute extraction for segmented blocks ❯ Attribute-dependent phrase query formulation ❯ Aggregation based on result rank and attribute weight l Future work ❯ Improving efficiency by allowing only one query (done) ❯ Evaluation with larger datasets ❯ Leveraging the potential of screenshots for other tasks 35 camera camera Cloud Search by Screenshots Main Article Extractor UniClip: A Framework for Article Clipping in Mobile Devices