SlideShare a Scribd company logo
 
Visually	
  Extrac.ng	
  Data	
  Records	
  
from	
  the	
  Deep	
  Web	
  	
  
	
  
Neil	
  Anderson	
  and	
  Jun	
  Hong	
  
Queen’s	
  University	
  Belfast,	
  UK	
  
Data	
  Record	
  Extrac.on	
  
	
  
Given	
  a	
  query	
  result	
  page	
  containing	
  a	
  set	
  of	
  data	
  
records,	
  our	
  goal	
  is	
  to	
  group	
  the	
  data	
  items	
  and	
  
labels	
  of	
  each	
  data	
  record	
  together.	
  	
  
Title	
  
Previous	
  Approaches	
  
•  Common	
  theme	
  is	
  to	
  iden.fy	
  repeated	
  paJerns	
  
•  Source	
  code	
  and	
  regular	
  expressions	
  	
  
–  JavaScript	
  makes	
  this	
  tricky 	
  	
  
•  Supervised	
  learning	
  with	
  annotated	
  pages	
  
–  Wrapper	
  induc.on	
  
•  Tag	
  tree	
  representa.on	
  (DOM)	
  
–  Hierarchical	
  representa.on	
  of	
  the	
  page,	
  designed	
  for	
  
the	
  browser,	
  not	
  for	
  humans	
  
–  Doesn’t	
  mirror	
  the	
  displayed	
  structure	
  -­‐	
  modern	
  
complex	
  web	
  pages	
  make	
  this	
  difficult	
  
Layout	
  Engine	
  
What	
  now?	
  
Our	
  Visual	
  Approach	
  
•  Mimic	
  human	
  intui.on	
  
•  To	
  make	
  use	
  of	
  the	
  common	
  
sources	
  of	
  evidence	
  on	
  
displayed	
  pages	
  that	
  humans	
  
use,	
  including	
  
– Structural	
  regularity	
  
– Visual	
  and	
  content	
  similarity	
  
between	
  data	
  records	
  
	
  
Previous	
  Approaches	
  Need	
  to	
  Iden.fy	
  
Data	
  Rich	
  Sec.on	
  
PiWalls:	
  	
  
How	
  to	
  iden.fy	
  the	
  Data	
  Rich	
  Sec.on	
  
DRS	
  does	
  not	
  contain	
  all	
  the	
  records	
  
DRS	
  contains	
  noise	
  as	
  well	
  as	
  records	
  
	
  
	
  
	
  
	
  
	
  
Our	
  Approach	
  
•  We	
  find	
  records,	
  not	
  the	
  Data	
  Rich	
  Sec.on	
  
•  Extract	
  data	
  records	
  individually	
  on	
  displayed	
  query	
  
result	
  pages,	
  while	
  excluding	
  noise	
  items	
  
•  Records	
  in	
  a	
  grid	
  or	
  a	
  column	
  
•  Use	
  clustering	
  algorithms	
  and	
  a	
  set	
  of	
  similarity	
  
measures	
  to:	
  
Iden.fy	
  records	
  
Exclude	
  noise 	
   	
  	
  
Our	
  Approach	
  
jQuery	
  
Web	
  	
  
Page	
  
Renderer	
  
	
  
WebKit	
  
Visual	
  	
  
Block	
  	
  
Modeller	
  
JavaScript	
  
Seed	
  
Block	
  	
  
Selector	
  
JavaScript	
  
Data	
  	
  
Record	
  
Block	
  	
  
Selector	
  
jQuery	
  
Record	
  
Boundary	
  
Drawer	
  
Our	
  Approach	
  
jQuery	
  
Web	
  	
  
Page	
  
Renderer	
  
	
  
WebKit	
  
Visual	
  	
  
Block	
  	
  
Modeller	
  
JavaScript	
  
Seed	
  
Block	
  	
  
Selector	
  
JavaScript	
  
Data	
  	
  
Record	
  
Block	
  	
  
Selector	
  
jQuery	
  
Record	
  
Boundary	
  
Drawer	
  
Green	
  and	
  blue	
  blocks	
  
Our	
  Approach	
  
jQuery	
  
Web	
  	
  
Page	
  
Renderer	
  
	
  
WebKit	
  
Visual	
  	
  
Block	
  	
  
Modeller	
  
JavaScript	
  
Seed	
  
Block	
  	
  
Selector	
  
JavaScript	
  
Data	
  	
  
Record	
  
Block	
  	
  
Selector	
  
jQuery	
  
Record	
  
Boundary	
  
Drawer	
  
Title	
  
Our	
  Approach	
  
jQuery	
  
Web	
  	
  
Page	
  
Renderer	
  
	
  
WebKit	
  
Visual	
  	
  
Block	
  	
  
Modeller	
  
JavaScript	
  
Seed	
  
Block	
  	
  
Selector	
  
JavaScript	
  
Data	
  	
  
Record	
  
Block	
  	
  
Selector	
  
jQuery	
  
Record	
  
Boundary	
  
Drawer	
  
Green	
  and	
  blue	
  blocks	
  
Title	
  
Title	
  
Selec.ng	
  Other	
  Candidate	
  Containers	
  
Filter	
  the	
  set	
  of	
  all	
  container	
  blocks	
  on	
  the	
  page	
  	
  
(blue	
  blocks)	
  and	
  
Discard	
  blocks	
  that	
  don’t	
  match	
  the	
  width	
  of	
  any	
  
candidate	
  container	
  block	
  (orange	
  blocks).	
  
Cluster	
  the	
  remaining	
  blocks	
  by	
  width.	
  
	
  
Why	
  width?	
  
Web	
  pages	
  designed	
  for	
  ver.cal,	
  not	
  horizontal,	
  scrolling.	
  
	
  
	
  
Title	
  
Selec.ng	
  Record	
  Containers	
  
Block	
  content	
  similarly	
  measure	
  
	
  
	
  
Block	
  A	
  –	
  Candidate	
  record	
  block	
  (orange)	
  
Block	
  B	
  –	
  Container	
  block	
  (block)	
  with	
  the	
  same	
  width	
  
	
   	
   	
  	
  	
  as	
  A	
  
The	
  cluster	
  with	
  the	
  maximum	
  number	
  of	
  similar	
  
blocks	
  is	
  the	
  winner!	
  
	
  
Title	
  
Title	
  
Our	
  Approach	
  
jQuery	
  
Web	
  	
  
Page	
  
Renderer	
  
	
  
WebKit	
  
Visual	
  	
  
Block	
  	
  
Modeller	
  
JavaScript	
  
Seed	
  
Block	
  	
  
Selector	
  
JavaScript	
  
Data	
  	
  
Record	
  
Block	
  	
  
Selector	
  
jQuery	
  
Record	
  
Boundary	
  
Drawer	
  
Title	
  
Title	
  
Visual	
  Block	
  Model	
  
Visual	
  Block	
  Model	
  -­‐	
  Clean	
  
Conclusions:	
  Main	
  Contribu.ons	
  
•  Visual	
  approach	
  to	
  directly	
  access	
  a	
  rendering	
  
engine	
  to	
  get	
  posi.onal	
  and	
  visual	
  features	
  
rather	
  than	
  codes	
  or	
  tag	
  trees	
  
•  No	
  need	
  to	
  iden.fy	
  data	
  rich	
  sec.on	
  
•  Use	
  observa.ons	
  on	
  visual	
  and	
  content	
  
similarity,	
  and	
  structural	
  regularity	
  to	
  group	
  
data	
  items	
  into	
  records	
  
Future	
  Work	
  
•  Use	
  a	
  domain	
  schema	
  from	
  schema.org,	
  or	
  a	
  
domain	
  ontology	
  to	
  annotate	
  data	
  records	
  
•  Use	
  a	
  domain	
  schema	
  or	
  ontology	
  to	
  annotate	
  
query	
  forms	
  too	
  
•  Solve	
  Label	
  incompleteness	
  and	
  inconsistency	
  
issues	
  
•  Similarity	
  threshold	
  
– Set	
  by	
  machine	
  learning.	
  
Neil andersonjunhong

More Related Content

Viewers also liked

Estudio de medios
Estudio de mediosEstudio de medios
Estudio de medios
Raisa Massera
 
Alfred tylor's diary
Alfred tylor's diaryAlfred tylor's diary
Alfred tylor's diary
nospeslux
 
Perkembangan remaja dan_permasalahannya_jadi
Perkembangan remaja dan_permasalahannya_jadiPerkembangan remaja dan_permasalahannya_jadi
Perkembangan remaja dan_permasalahannya_jadi
Addri AndTaga
 
Edtech presentation About Success
Edtech presentation About SuccessEdtech presentation About Success
Edtech presentation About Success
QA Ilagan
 
different kinds of Pastry and Crust
different kinds of Pastry and Crustdifferent kinds of Pastry and Crust
different kinds of Pastry and Crust
QA Ilagan
 
Fisheries (business feasibility study)
Fisheries (business feasibility study)Fisheries (business feasibility study)
Fisheries (business feasibility study)
QA Ilagan
 
Different types of Cake
Different types of CakeDifferent types of Cake
Different types of Cake
QA Ilagan
 
Lesson plan in TLE I (Grade 7)
Lesson plan in TLE I (Grade 7)Lesson plan in TLE I (Grade 7)
Lesson plan in TLE I (Grade 7)
QA Ilagan
 

Viewers also liked (9)

Estudio de medios
Estudio de mediosEstudio de medios
Estudio de medios
 
Prezi2
Prezi2Prezi2
Prezi2
 
Alfred tylor's diary
Alfred tylor's diaryAlfred tylor's diary
Alfred tylor's diary
 
Perkembangan remaja dan_permasalahannya_jadi
Perkembangan remaja dan_permasalahannya_jadiPerkembangan remaja dan_permasalahannya_jadi
Perkembangan remaja dan_permasalahannya_jadi
 
Edtech presentation About Success
Edtech presentation About SuccessEdtech presentation About Success
Edtech presentation About Success
 
different kinds of Pastry and Crust
different kinds of Pastry and Crustdifferent kinds of Pastry and Crust
different kinds of Pastry and Crust
 
Fisheries (business feasibility study)
Fisheries (business feasibility study)Fisheries (business feasibility study)
Fisheries (business feasibility study)
 
Different types of Cake
Different types of CakeDifferent types of Cake
Different types of Cake
 
Lesson plan in TLE I (Grade 7)
Lesson plan in TLE I (Grade 7)Lesson plan in TLE I (Grade 7)
Lesson plan in TLE I (Grade 7)
 

Similar to Neil andersonjunhong

dmBridge & dmMonocle
dmBridge & dmMonocledmBridge & dmMonocle
dmBridge & dmMonocle
University of Nevada, Las Vegas
 
Life Science Database Cross Search and Metadata
Life Science Database Cross Search and MetadataLife Science Database Cross Search and Metadata
Life Science Database Cross Search and Metadata
Maori Ito
 
Stupid Index Block Tricks
Stupid Index Block TricksStupid Index Block Tricks
Stupid Index Block Tricks
hannonhill
 
Become a Full Stack Web Developer (.NET) - Thisiswali
Become a Full Stack Web Developer (.NET) - ThisiswaliBecome a Full Stack Web Developer (.NET) - Thisiswali
Become a Full Stack Web Developer (.NET) - Thisiswali
thisiswali
 
Mongo db basics
Mongo db basicsMongo db basics
Mongo db basics
Claudio Montoya
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
IOSR Journals
 
Spring data presentation
Spring data presentationSpring data presentation
Spring data presentation
Oleksii Usyk
 
Session 34 - JDBC Best Practices, Introduction to Design Patterns
Session 34 - JDBC Best Practices, Introduction to Design PatternsSession 34 - JDBC Best Practices, Introduction to Design Patterns
Session 34 - JDBC Best Practices, Introduction to Design Patterns
PawanMM
 
Lesson7WebML.pdf
Lesson7WebML.pdfLesson7WebML.pdf
Lesson7WebML.pdf
cifoxo
 
Untying the Knots of Web Dev with Internet Explorer
Untying the Knots of Web Dev with Internet Explorer Untying the Knots of Web Dev with Internet Explorer
Untying the Knots of Web Dev with Internet Explorer
Sarah Dutkiewicz
 
Chapter15-Presentation.pptx
Chapter15-Presentation.pptxChapter15-Presentation.pptx
Chapter15-Presentation.pptx
GFRomano
 
Writing an extensible web testing framework ready for the cloud slide share
Writing an extensible web testing framework ready for the cloud   slide shareWriting an extensible web testing framework ready for the cloud   slide share
Writing an extensible web testing framework ready for the cloud slide share
Mike Ensor
 
Web Components at Scale, HTML5DevConf 2014-10-21
Web Components at Scale, HTML5DevConf 2014-10-21Web Components at Scale, HTML5DevConf 2014-10-21
Web Components at Scale, HTML5DevConf 2014-10-21
Chris Danford
 
JavaOne2016 - How to Generate Customized Java 8 Code from Your Database [TUT4...
JavaOne2016 - How to Generate Customized Java 8 Code from Your Database [TUT4...JavaOne2016 - How to Generate Customized Java 8 Code from Your Database [TUT4...
JavaOne2016 - How to Generate Customized Java 8 Code from Your Database [TUT4...
Speedment, Inc.
 
How to JavaOne 2016 - Generate Customized Java 8 Code from Your Database [TUT...
How to JavaOne 2016 - Generate Customized Java 8 Code from Your Database [TUT...How to JavaOne 2016 - Generate Customized Java 8 Code from Your Database [TUT...
How to JavaOne 2016 - Generate Customized Java 8 Code from Your Database [TUT...
Malin Weiss
 
OBIEE Online Training Institute in Hyderabad - C-Point
OBIEE Online Training Institute in Hyderabad - C-PointOBIEE Online Training Institute in Hyderabad - C-Point
OBIEE Online Training Institute in Hyderabad - C-Point
cpointss
 
Letsleads dot net-syllabus
Letsleads dot net-syllabusLetsleads dot net-syllabus
Letsleads dot net-syllabus
letsleads
 
Basic Application Performance Optimization Techniques (Backend)
Basic Application Performance Optimization Techniques (Backend)Basic Application Performance Optimization Techniques (Backend)
Basic Application Performance Optimization Techniques (Backend)
Klas Berlič Fras
 
Using Compass to Diagnose Performance Problems
Using Compass to Diagnose Performance Problems Using Compass to Diagnose Performance Problems
Using Compass to Diagnose Performance Problems
MongoDB
 
Using Compass to Diagnose Performance Problems in Your Cluster
Using Compass to Diagnose Performance Problems in Your ClusterUsing Compass to Diagnose Performance Problems in Your Cluster
Using Compass to Diagnose Performance Problems in Your Cluster
MongoDB
 

Similar to Neil andersonjunhong (20)

dmBridge & dmMonocle
dmBridge & dmMonocledmBridge & dmMonocle
dmBridge & dmMonocle
 
Life Science Database Cross Search and Metadata
Life Science Database Cross Search and MetadataLife Science Database Cross Search and Metadata
Life Science Database Cross Search and Metadata
 
Stupid Index Block Tricks
Stupid Index Block TricksStupid Index Block Tricks
Stupid Index Block Tricks
 
Become a Full Stack Web Developer (.NET) - Thisiswali
Become a Full Stack Web Developer (.NET) - ThisiswaliBecome a Full Stack Web Developer (.NET) - Thisiswali
Become a Full Stack Web Developer (.NET) - Thisiswali
 
Mongo db basics
Mongo db basicsMongo db basics
Mongo db basics
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
 
Spring data presentation
Spring data presentationSpring data presentation
Spring data presentation
 
Session 34 - JDBC Best Practices, Introduction to Design Patterns
Session 34 - JDBC Best Practices, Introduction to Design PatternsSession 34 - JDBC Best Practices, Introduction to Design Patterns
Session 34 - JDBC Best Practices, Introduction to Design Patterns
 
Lesson7WebML.pdf
Lesson7WebML.pdfLesson7WebML.pdf
Lesson7WebML.pdf
 
Untying the Knots of Web Dev with Internet Explorer
Untying the Knots of Web Dev with Internet Explorer Untying the Knots of Web Dev with Internet Explorer
Untying the Knots of Web Dev with Internet Explorer
 
Chapter15-Presentation.pptx
Chapter15-Presentation.pptxChapter15-Presentation.pptx
Chapter15-Presentation.pptx
 
Writing an extensible web testing framework ready for the cloud slide share
Writing an extensible web testing framework ready for the cloud   slide shareWriting an extensible web testing framework ready for the cloud   slide share
Writing an extensible web testing framework ready for the cloud slide share
 
Web Components at Scale, HTML5DevConf 2014-10-21
Web Components at Scale, HTML5DevConf 2014-10-21Web Components at Scale, HTML5DevConf 2014-10-21
Web Components at Scale, HTML5DevConf 2014-10-21
 
JavaOne2016 - How to Generate Customized Java 8 Code from Your Database [TUT4...
JavaOne2016 - How to Generate Customized Java 8 Code from Your Database [TUT4...JavaOne2016 - How to Generate Customized Java 8 Code from Your Database [TUT4...
JavaOne2016 - How to Generate Customized Java 8 Code from Your Database [TUT4...
 
How to JavaOne 2016 - Generate Customized Java 8 Code from Your Database [TUT...
How to JavaOne 2016 - Generate Customized Java 8 Code from Your Database [TUT...How to JavaOne 2016 - Generate Customized Java 8 Code from Your Database [TUT...
How to JavaOne 2016 - Generate Customized Java 8 Code from Your Database [TUT...
 
OBIEE Online Training Institute in Hyderabad - C-Point
OBIEE Online Training Institute in Hyderabad - C-PointOBIEE Online Training Institute in Hyderabad - C-Point
OBIEE Online Training Institute in Hyderabad - C-Point
 
Letsleads dot net-syllabus
Letsleads dot net-syllabusLetsleads dot net-syllabus
Letsleads dot net-syllabus
 
Basic Application Performance Optimization Techniques (Backend)
Basic Application Performance Optimization Techniques (Backend)Basic Application Performance Optimization Techniques (Backend)
Basic Application Performance Optimization Techniques (Backend)
 
Using Compass to Diagnose Performance Problems
Using Compass to Diagnose Performance Problems Using Compass to Diagnose Performance Problems
Using Compass to Diagnose Performance Problems
 
Using Compass to Diagnose Performance Problems in Your Cluster
Using Compass to Diagnose Performance Problems in Your ClusterUsing Compass to Diagnose Performance Problems in Your Cluster
Using Compass to Diagnose Performance Problems in Your Cluster
 

Neil andersonjunhong

  • 1.   Visually  Extrac.ng  Data  Records   from  the  Deep  Web       Neil  Anderson  and  Jun  Hong   Queen’s  University  Belfast,  UK  
  • 2.
  • 3.
  • 4. Data  Record  Extrac.on     Given  a  query  result  page  containing  a  set  of  data   records,  our  goal  is  to  group  the  data  items  and   labels  of  each  data  record  together.    
  • 6. Previous  Approaches   •  Common  theme  is  to  iden.fy  repeated  paJerns   •  Source  code  and  regular  expressions     –  JavaScript  makes  this  tricky     •  Supervised  learning  with  annotated  pages   –  Wrapper  induc.on   •  Tag  tree  representa.on  (DOM)   –  Hierarchical  representa.on  of  the  page,  designed  for   the  browser,  not  for  humans   –  Doesn’t  mirror  the  displayed  structure  -­‐  modern   complex  web  pages  make  this  difficult  
  • 7.
  • 9.
  • 10.
  • 12. Our  Visual  Approach   •  Mimic  human  intui.on   •  To  make  use  of  the  common   sources  of  evidence  on   displayed  pages  that  humans   use,  including   – Structural  regularity   – Visual  and  content  similarity   between  data  records    
  • 13.
  • 14.
  • 15.
  • 16. Previous  Approaches  Need  to  Iden.fy   Data  Rich  Sec.on   PiWalls:     How  to  iden.fy  the  Data  Rich  Sec.on   DRS  does  not  contain  all  the  records   DRS  contains  noise  as  well  as  records            
  • 17.
  • 18. Our  Approach   •  We  find  records,  not  the  Data  Rich  Sec.on   •  Extract  data  records  individually  on  displayed  query   result  pages,  while  excluding  noise  items   •  Records  in  a  grid  or  a  column   •  Use  clustering  algorithms  and  a  set  of  similarity   measures  to:   Iden.fy  records   Exclude  noise      
  • 19. Our  Approach   jQuery   Web     Page   Renderer     WebKit   Visual     Block     Modeller   JavaScript   Seed   Block     Selector   JavaScript   Data     Record   Block     Selector   jQuery   Record   Boundary   Drawer  
  • 20. Our  Approach   jQuery   Web     Page   Renderer     WebKit   Visual     Block     Modeller   JavaScript   Seed   Block     Selector   JavaScript   Data     Record   Block     Selector   jQuery   Record   Boundary   Drawer  
  • 21. Green  and  blue  blocks  
  • 22.
  • 23.
  • 24. Our  Approach   jQuery   Web     Page   Renderer     WebKit   Visual     Block     Modeller   JavaScript   Seed   Block     Selector   JavaScript   Data     Record   Block     Selector   jQuery   Record   Boundary   Drawer  
  • 25.
  • 27.
  • 28. Our  Approach   jQuery   Web     Page   Renderer     WebKit   Visual     Block     Modeller   JavaScript   Seed   Block     Selector   JavaScript   Data     Record   Block     Selector   jQuery   Record   Boundary   Drawer  
  • 29.
  • 30. Green  and  blue  blocks  
  • 31.
  • 33.
  • 35. Selec.ng  Other  Candidate  Containers   Filter  the  set  of  all  container  blocks  on  the  page     (blue  blocks)  and   Discard  blocks  that  don’t  match  the  width  of  any   candidate  container  block  (orange  blocks).   Cluster  the  remaining  blocks  by  width.     Why  width?   Web  pages  designed  for  ver.cal,  not  horizontal,  scrolling.      
  • 37. Selec.ng  Record  Containers   Block  content  similarly  measure       Block  A  –  Candidate  record  block  (orange)   Block  B  –  Container  block  (block)  with  the  same  width            as  A   The  cluster  with  the  maximum  number  of  similar   blocks  is  the  winner!    
  • 40. Our  Approach   jQuery   Web     Page   Renderer     WebKit   Visual     Block     Modeller   JavaScript   Seed   Block     Selector   JavaScript   Data     Record   Block     Selector   jQuery   Record   Boundary   Drawer  
  • 43.
  • 44.
  • 45.
  • 46.
  • 48.
  • 49. Visual  Block  Model  -­‐  Clean  
  • 50.
  • 51.
  • 52. Conclusions:  Main  Contribu.ons   •  Visual  approach  to  directly  access  a  rendering   engine  to  get  posi.onal  and  visual  features   rather  than  codes  or  tag  trees   •  No  need  to  iden.fy  data  rich  sec.on   •  Use  observa.ons  on  visual  and  content   similarity,  and  structural  regularity  to  group   data  items  into  records  
  • 53. Future  Work   •  Use  a  domain  schema  from  schema.org,  or  a   domain  ontology  to  annotate  data  records   •  Use  a  domain  schema  or  ontology  to  annotate   query  forms  too   •  Solve  Label  incompleteness  and  inconsistency   issues   •  Similarity  threshold   – Set  by  machine  learning.