Machine	  Le             arning	  in	      DIADEM	       Reading	  Co                        urse	  Presen                ...
Current	  area	  of	  research	   Real	  estate	  page	  classiDication	                   vs	  
Current	  area	  of	  research	                                              	  Input	  and	  output	  page	  distinction
Current	  area	  of	  research	     Page	  element	  classiDication	  
The	  Reading	  List	                 Papers	  not	  included	  in	  this	  presentation                                  ...
The	  Reading	  List	                   Papers	  not	  included	  in	  this	  presentation	  0  “Learning	   (k,l)-­‐conte...
The	  Reading	  List	                  Papers	  included	  in	  this	  presentation                            	  #1	  “We...
Paper	  #	  1	         Web	  page	  classiDication:	  features	  and	  algorithms	                   X.	  Qi	  and	  B.	  ...
Paper	  #	  1	  	  Web	  page	  classiDication:	  features	  and	  algorithms	        X.	  Qi	  and	  B.	  Davison	  (Lehi...
Paper	  #	  1	      Web	  page	  classiDication:	  features	  and	  algorithms                                            ...
Paper	  #	  2	  Web	  page	  element	  classiDication	  based	  on	  visual	  features                                    ...
Paper	  #	  2	  Web	  page	  element	  classiDication	  based	  on	  visual	  features	           R.	  Burget	  and	  I.	 ...
Paper	  #	  2	  Web	  page	  element	  classiDication	  based	  on	  visual	  features	         R.	  Burget	  and	  I.	  R...
Paper	  #	  2	  Web	  page	  element	  classiDication	  based	  on	  visual	  features	         R.	  Burget	  and	  I.	  R...
Paper	  #	  3	  Stylistic	  and	  Lexical	  Co-­‐training	  for	  Web	  Block	  ClassiDication    	                       ...
Paper	  #	  3	  Stylistic	  and	  Lexical	  Co-­‐training	  for	  Web	  Block	  ClassiDication	          C.	  Lee	  et	  a...
Paper	  #	  3	  Stylistic	  and	  Lexical	  Co-­‐training	  for	  Web	  Block	  ClassiDication	          C.	  Lee	  et	  a...
Paper	  #	  3	  Stylistic	  and	  Lexical	  Co-­‐training	  for	  Web	  Block	  ClassiDication	          C.	  Lee	  et	  a...
Paper	  #	  3	  Stylistic	  and	  Lexical	  Co-­‐training	  for	  Web	  Block	  ClassiDication	          C.	  Lee	  et	  a...
Paper	  #	  3	  
Paper	  #	  4	  Can	  we	  learn	  a	  template	  independent	  wrapper	  for	   news	  article	  extraction	  for	  a	  s...
Paper	  #	  4	    Can	  we	  learn	  a	  template	  independent	  wrapper	  for	     news	  article	  extraction	  for	  a...
Paper	  #	  4	     Can	  we	  learn	  a	  template	  independent	  wrapper	  for	      news	  article	  extraction	  for	 ...
Paper	  #	  4	  Can	  we	  learn	  a	  template	  independent	  wrapper	  for	   news	  article	  extraction	  for	  a	  s...
Paper	  #	  5	              EfDicient	  record	  level	  wrapper	  induction	           S.	  Zheng	  et	  al	  (Pennsylvan...
Paper	  #	  5	           EfDicient	  record	  level	  wrapper	  induction	        S.	  Zheng	  et	  al	  (Pennsylvania	  S...
Paper	  #	  5	    EfDicient	  record	  level	  wrapper	  induction	                                                       ...
Paper	  #	  5	            EfDicient	  record	  level	  wrapper	  induction	          S.	  Zheng	  et	  al	  (Pennsylvania	...
Paper	  #	  5	             EfDicient	  record	  level	  wrapper	  induction	          S.	  Zheng	  et	  al	  (Pennsylvania...
Paper	  #	  6	    Towards	  combining	  Web	  classiDication	  and	  Web	        Information	  Extraction:	  a	  case	  st...
Paper	  #	  6	     Towards	  combining	  Web	  classiDication	  and	  Web	               Information	  Extraction:	  a	  c...
Lessons	  learnt	  from	  the	  Reading	  Course	  #1	  “Web	  page	  classiYication:	  features	  and	  algorithms”	  by	...
Lessons	  learnt	  from	  the	  Reading	  Course	  #4	  “Can	  we	  learn	  a	  template	  independent	  wrapper	  for	  n...
General	  lessons	  learnt                                                 	  0  Most	  of	  the	  papers	  are	  recent	 ...
Summary	  of	  the	  Reading	  Course	               and	  its	  relevance	  to	  DIADEM	  0  The	  six	  proposed	  paper...
Thank	  you	  for	  your	  attention!	  
Machine Learning in DIADEM (Andrey Kravchenko)
Upcoming SlideShare
Loading in …5

Machine Learning in DIADEM (Andrey Kravchenko)


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Machine Learning in DIADEM (Andrey Kravchenko)

  1. 1. Machine  Le arning  in   DIADEM   Reading  Co urse  Presen tation     Andrey  Kra vchenko   20 th  of  Janu ary,  2010  
  2. 2. Current  area  of  research   Real  estate  page  classiDication   vs  
  3. 3. Current  area  of  research    Input  and  output  page  distinction
  4. 4. Current  area  of  research   Page  element  classiDication  
  5. 5. The  Reading  List   Papers  not  included  in  this  presentation  0  “An   interactive   clustering   –   based   approach   to   integrating   source   query   interfaces  on  the  Deep  Web”   0  This  paper  is  concerned  with  input  forms.  0  “Automatic   wrapper   induction   from   hidden-­‐web   sources   with   domain   knowledge”   0  Only   a   part   of   the   paper   deals   with   the   output   pages.   Their   methodology   for   processing   the   output   pages   is   based   on   gazetteer’s   and   is   thus   closer   to   linguistics  than  ML.  0  “Web  scale  extraction  of  structured  data”   0  Deals  with  the  whole  Web.  0  “An   adaptive   information   extraction   system   based   on   wrapper   induction   with  POS  tagging”   0  The   labels   are   of   very   low   granularity   (e.g.   work_name,   work_location)   and   of   linguistic   nature.   The   comparison   is   done   against   linguistics   systems   such   as   Rapier  (another  excluded  paper  on  the  reading  list),  GATE-­‐SVM,  etc.  Introducing   POS   tagging   provides   only   a   5%   gain   in   accuracy   and   only   for   some   target   slots   for  one  corpus  and  no  gain  for  the  other  two.  
  6. 6. The  Reading  List   Papers  not  included  in  this  presentation  0  “Learning   (k,l)-­‐contextual   tree   languages   for   information   extraction   from   Web  pages”   0  The  paper  deals  with  learning  an  extraction  language  rather  than  extraction  itself.  0  “Bottom-­‐up   relational   learning   of   problem   matching   rules   for   Information   Retrieval”   0  Deals  with  textual  documents  only.  0  “Learning  rules  to  pre-­‐process  Web  data  for  automatic  integration”   0  Relies   on   web   data   extraction   and   alignment   phases   performed   by   the   VIPER   system   that   are   not   described   in   the   paper.   I   wasn’t   able   to   detect   any   ML   involved   in   the   stage   of   rule   learning.   No   clear   description   of   practical   results.   Low-­‐level   granularity  of  labels.  0  “Learning  rules  for  information  extraction”   0  Is  not  HTML/DOM  speciDic.  
  7. 7. The  Reading  List   Papers  included  in  this  presentation  #1  “Web-­‐page  classiDication:  features  and  algorithms”  -­‐  2007  #2  “Web  page  element  classiDication  based  on  visual  features”  #3  “Stylistic  and  lexical  co-­‐training  for  Web-­‐block  classiDication”  #4  “Can  we  learn  a  template-­‐independent    wrapper  for                    news  article  extraction  from  a  single  training  site?”  #5  “EfDicient  record-­‐level  wrapper  induction”    #6  “Towards  combining  Web  classiDication  and  Web  Information                  Extraction:  a  case  study”      
  8. 8. Paper  #  1   Web  page  classiDication:  features  and  algorithms   X.  Qi  and  B.  Davison  (Lehigh  University,  2007)  0  The  paper  distinguishes  between  four  types  of  classiDication;  0  They  also  distinguish  between  subject  classiDication,  functional   classiDication,  sentiment  classiDication,  and  other  types  of   classiDication;  0  The  paper  distinguishes  between  on-­‐page  features  and  the   features  of  the  neighbours;  0  On-­‐page  features:   0  Textual  analysis:  bag  of  words  vs  n-­‐gram;   0  Visual  analysis:  the  multigraph  approach.    
  9. 9. Paper  #  1    Web  page  classiDication:  features  and  algorithms   X.  Qi  and  B.  Davison  (Lehigh  University,  2007)  
  10. 10. Paper  #  1   Web  page  classiDication:  features  and  algorithms   X.  Qi  and  B.  Davison  (Lehigh  University,  2007)  0  When  using  the  features  of  neighbouring  pages  the  authors   distinct  between  the  weak  assumption  and  the  strong  assumption;  0  They  also  distinguish  between  different  types  of  neighbours:   parents/children,  grandparents/grandchildren  and  siblings/ spouses;  0  It  appears  that  siblings  are  the  most  important  neighbours;  0  There  are  various  features    uses  for  different  types  of   neighbouring  pages;  0  Algorithm  survey:  dimension  reduction  and  relational  learning   approaches;  
  11. 11. Paper  #  2  Web  page  element  classiDication  based  on  visual  features   R.  Burget  and  I.  Rudolfova  (Brno  University,  2009)  0  Problem:  ClassiDication  of  elements  from  a  web  page  based  on   its  visual  rendering;  0  Assumptions:  A  tagged  corpus,  DOM  tree,  CSSBox  layout;  0  Approach:    Page  segmentation  followed  by  block  classiDication   performed  via  Weka’s  J48  decision  tree  classiYier;  0  Features:  Font  features,  spatial  features,  text  features,  colour   features;  0  Evaluation:  News  domain.  Average  F1  measure  on                               coarse-­‐grained  labels,  low  F1  measure  on  high-­‐grained  labels.  
  12. 12. Paper  #  2  Web  page  element  classiDication  based  on  visual  features   R.  Burget  and  I.  Rudolfova  (Brno  University,  2009)   0  The  approach  of  this  papers  is  split  into  two  phases:   0  Page  segmentation;   0  Page  element  classiDication;   0  Page  segmentation  is  done  in  four  phases:   0  Page  rendering;   0  Detecting  basic  visual  areas;   0  Text  line  detection;   0  Block  detection;   0  As  a  result  of  page  segmentation  we  obtain  a  tree  of  areas.  
  13. 13. Paper  #  2  Web  page  element  classiDication  based  on  visual  features   R.  Burget  and  I.  Rudolfova  (Brno  University,  2009)   0  The  actual    page  element  classiDication  is  performed   for  each  area  via  Weka’s  J48  decision  tree  classiDier   based  on  the  following  set  of  features:   0  Font  features  {fontsize,  weight};   0  Spatial  features  {aabove,  abelow,  aleft,  aright};   0  Text  features  {tdigits,    tlower,    tupper,  tspaces,  tlength};   0  Colour  features  {contrast}.    
  14. 14. Paper  #  2  Web  page  element  classiDication  based  on  visual  features   R.  Burget  and  I.  Rudolfova  (Brno  University,  2009)   Results     The  set  of  labels   (the  testing  pages  from  another   source  than  the  training  pages)  
  15. 15. Paper  #  3  Stylistic  and  Lexical  Co-­‐training  for  Web  Block  ClassiDication     C.  Lee  et  al  (National  University  of  Singapore,  2004)   from  a  web  page  based  on  0  Problem:  ClassiDication  of  elements   both  stylistic  and  lexical  features;  0  Assumptions:  A  tagged  corpus,  DOM  tree,  CSSBox  layout;  0  Approach:    Web  block  division  followed  by  co-­‐training  with   Boostexter,  an  ensemble  learning  method  with  a  decision  stump   corresponding  to  a  single  weak  learner;  0  Features:  Lexical  and  stylistic;  0  Evaluation:  News  domain.  Average  F1  measure  on                               coarse-­‐grained  labels,  low  F1  measure  on  high-­‐grained  labels.  
  16. 16. Paper  #  3  Stylistic  and  Lexical  Co-­‐training  for  Web  Block  ClassiDication   C.  Lee  et  al  (National  University  of  Singapore,  2004)    0  The  authors  aim  to  combine  two  different  classiDiers  with   distinctive  set  of  features  (lexical  and  stylistic);  0  They’ve  created  a  PARser  for  Content  Extraction  and  Layout   Structure  (PARCELS);  0  Web  page  division  –  the  authors  differentiate  between   structural  tags  and  content  tags.  
  17. 17. Paper  #  3  Stylistic  and  Lexical  Co-­‐training  for  Web  Block  ClassiDication   C.  Lee  et  al  (National  University  of  Singapore,  2004)    
  18. 18. Paper  #  3  Stylistic  and  Lexical  Co-­‐training  for  Web  Block  ClassiDication   C.  Lee  et  al  (National  University  of  Singapore,  2004)    0  The  authors  distinguish  between  labels  of  different    levels  of   granularity.  They  deDine  17  tags  for  labelling;  0  Stylistic  features:   0  Linear  structure  –  paragraph  (<p>),  header  (<h1>-­‐<h6>)  and  rule  tags  (<hr>);   0  Table  structure  –  cell  Dlow,  neighbouring  cells’  data,    the  position  of  table  cells;   0  XHTML/CSS  structure  –  height,  width,  z-­‐index;   0  Font  features  –  colour,  weight,  family,  size,  hyperlink  features;   0  Images  –  size,  number  of  images  within  a  block;  0  Lexical  features:   0  Low-­‐level  features  –  count  and  vocabulary  of  the  words  present  in  the  text  block;   0  High-­‐level  features  –  POS-­‐tags,  mailto-­‐links,  image-­‐links,  text-­‐links,  total-­‐links;  0  Boostexter  is  used  for  co-­‐training.  It  is  an  ensemble  learning  method   with  a  decision  stump  corresponding  to  a  single  weak  learner.  
  19. 19. Paper  #  3  Stylistic  and  Lexical  Co-­‐training  for  Web  Block  ClassiDication   C.  Lee  et  al  (National  University  of  Singapore,  2004)    
  20. 20. Paper  #  3  
  21. 21. Paper  #  4  Can  we  learn  a  template  independent  wrapper  for   news  article  extraction  for  a  single  training  site?   J.  Wang  et  al  (2009,  Zhejiang  University,  MS  Research)  0  Problem:  ClassiDication  of  titles  and  bodies  of  news  taken  from   the  webpages  belonging  to  the  news  domain;  0  Assumptions:  A  tagged  corpus,  DOM  tree,  CSSBox  layout;  0  Approach:    SVM;  decision  function  gets  converted  to  posterior   probability;  0  Features:  Different  sets  of  features  for  body  and  title   extraction.    Features  are  divided  into  content  and  spatial   features;    0  Evaluation:  Overall  99%  extraction  accuracy.  
  22. 22. Paper  #  4   Can  we  learn  a  template  independent  wrapper  for   news  article  extraction  for  a  single  training  site?   J.  Wang  et  al  (2009,  Zhejiang  University,  MS  Research)  0  The  aim  of  the  paper  is  to  efDiciently  extract  and  then  combine   titles  and  bodies  of  news  articles;  0   The  main  problem  is  in  dealing  with  various  noises  around  the   titles.  
  23. 23. Paper  #  4   Can  we  learn  a  template  independent  wrapper  for   news  article  extraction  for  a  single  training  site?   J.  Wang  et  al  (2009,  Zhejiang  University,  MS  Research)  0  News  body  extraction:   0  Content  features:  FormattingElementsNum  and  FormattedContentLen;   0  Spatial  features:  normalised  RectLeft,  RectTop,  RectWidth  and  RectHeight;   0  News  body  extraction  heuristics:  TopInScreen(T)  and  BigEnough(T);  0  News  title  extraction:   0  Content  features:  FontSize,  EndWithFullStop,  WordNum;   0  Spatial  features:  RectLeft,  RectTop,  RectWidth,  RectHeight,  Overlap,  Distance,  Flat;   0  News  title  extraction  heuristics:  WholeInScreen(T),  NoAnchorText(T),   NotCategoryName(T);  0  A  SVM  approach  is  chosen  for  classiDication.  The  decision   function  gets  converted  to  posterior  probability.  
  24. 24. Paper  #  4  Can  we  learn  a  template  independent  wrapper  for   news  article  extraction  for  a  single  training  site?   J.  Wang  et  al  (2009,  Zhejiang  University,  MS  Research)   Testing  results  on  the  large     Extraction  results   scale  experiment  
  25. 25. Paper  #  5   EfDicient  record  level  wrapper  induction   S.  Zheng  et  al  (Pennsylvania  State  Univeristy,  2009)  0  Problem:  EfDicient  extraction  of  records  from  Web  pages  and   classiDication  of  their  elements;  0  Assumptions:  A  tagged  corpus,  DOM  tree;  0  Approach:    Alignment  of  the  DOM  subtree  and  the  possible   wrappers;  0  Features:  None;  0  Evaluation:  Four  different  domains  (online  shops,  user  reviews,   digital  libraries,  search  results).  Seven  detail  page  datasets  and   eleven  list  page  datasets.  A  99%  F1  value.  
  26. 26. Paper  #  5   EfDicient  record  level  wrapper  induction   S.  Zheng  et  al  (Pennsylvania  State  Univeristy,  2009)  0  The  paper  is  concerned  with  extracting  records  and  their   respective  attributes;  0  The  key  distinction  from  other  approaches  is  the  record-­‐ level  extraction  opposed  to  page-­‐level  extraction;  0  The  authors  propose  a  novel  broom  structure  for  this  task;  0  The  broom  structure  has  a  head  and  a  stick;  0  One  of  the  main  issues  are  crossing  records.  
  27. 27. Paper  #  5   EfDicient  record  level  wrapper  induction    S.  Zheng  et  al  (Pennsylvania  State  Univeristy,  2009)
  28. 28. Paper  #  5   EfDicient  record  level  wrapper  induction   S.  Zheng  et  al  (Pennsylvania  State  Univeristy,  2009)  0  The  general  architecture  of  the  system  involves  training  and   testing  phases.  
  29. 29. Paper  #  5   EfDicient  record  level  wrapper  induction   S.  Zheng  et  al  (Pennsylvania  State  Univeristy,  2009)  0  The  authors  claim  to  achieve  a  remarkable  extraction  accuracy   and  a  signiDicant  boost  in  running  time  performance;  
  30. 30. Paper  #  6   Towards  combining  Web  classiDication  and  Web   Information  Extraction:  a  case  study     P.  Luo  et  al  (HP  Labs  China,  2009)   with  the  extraction  of  its  0  Problem:  Combination  of  web  page  classiDication  based  on   their  relevance  to  a  speciDic  domain   speciDic  elements,  using  both  forward  and  backward   dependencies;    0  Assumptions:  A  tagged  corpus,  DOM  tree;  0  Approach:    Conditional  Random  Fields  (CRFs);  0  Features:  Course  terms  and  heuristics  for  course  homepage   detection;  format,  position  and  content  features  for  course   title  extraction;  0  Evaluation:  OfCourse  system  for  online  course  information   extraction.  90%  F1  value  for  course  page  classiDication,  83%   F1  value  for  course  title  extraction.  
  31. 31. Paper  #  6   Towards  combining  Web  classiDication  and  Web   Information  Extraction:  a  case  study   P.  Luo  et  al  (HP  Labs  China,  2009)    0  The  authors  propose  a  method  that  utilises  both  forward  and   backward  dependencies  between  Web  classiDication  and   information  extraction;  0  The  authors  use  a  uniDied  graphical  CRF  model  for  joint  and   simultaneous  optimisation  of  these  two  steps;  0  This  methodology  has  been  used  for  building  the  OfCourse   online  search  engine  ;  0  In  their  results  for  OfCourse  the  authors  claim  that  their  model   signiDicantly  outperforms  the  two  baseline  methods;  0  Drawbacks:  they  only  deal  with  DOM  leave  nodes  as   classiDication  variables  for  the  information  extraction  phase.  
  32. 32. Lessons  learnt  from  the  Reading  Course  #1  “Web  page  classiYication:  features  and  algorithms”  by  X.  Qi  and   B.  Davison  (2007):  the  importance  of  the  neighbouring  pages’   features,  features  of  neighbouring  pages;  #2  “Web  page  element  classiYication  based  on  visual  features”  by   R.  Burget  and  I.  Rudolfova  (2009):  a  broad  set  of  visual  features   (font  features,  spatial  features,  text  features  and  colour   features);  #3  “Stylistic  and  Lexical  Co-­‐training  for  Web  Block  ClassiYication”   by        C.  Lee  et  al  (2004):    A  useful  web  block  division  algorithm.  A   possibility  of  co-­‐training  on  the  same  corpus  using  two   distinctive  set  of  features;  
  33. 33. Lessons  learnt  from  the  Reading  Course  #4  “Can  we  learn  a  template  independent  wrapper  for  news   article  extraction  for  a  single  training  site”  by  J.  Weng  et  al   (2009):  a  distinctive  set  of  features  for  news  title  extraction,  a   lot  of  which  can  be  used  for  property  title  extraction  in   DIADEM;  #5  “EfYicient  record  level  wrapper  induction  “by  S.  Zheng  et  al   (2009):  a  new  record-­‐level  approach  for  extraction.  Performs   much  better  and  faster  than  the  page-­‐level  approaches.  Can  be   useful  for  DIADEM  extraction  in  the  record-­‐heavy  domains;  #6  “Towards  combining  Web  classiYication  and  Web  Information   Extraction:  a  case  study”  by  P.  Luo  et  al  (2009):  backward   dependency  between  these  two  tasks  can  work  as  well.  Thus  it   is  worthwhile  to  experiment  with  their  mutual  tie-­‐up.  
  34. 34. General  lessons  learnt  0  Most  of  the  papers  are  recent  or  very  recent  (2004-­‐2009);  0  Features  play  a  much  more  important  role  than  algorithms;  0  Initial  page  segmentation  into  blocks  can  help  with  subsequent   determination  of  relevant  DOM-­‐subtrees;  0  All  features  can  be  broadly  divided  into  content  features  and   visual  features;  0  News  domain  is  a  very  popular  one  (3  out  of  5  reviewed   systems).  No  mention  of  real  estate  in  any  of  the  papers.  
  35. 35. Summary  of  the  Reading  Course   and  its  relevance  to  DIADEM  0  The  six  proposed  papers  are  of  relevance  to  all  three  areas  of  my   current  research:     0  Real  estate  page  classiDication;   0  Output/Input  page  distinction;   0  Property  page  elements’  classiDication;  0  The  most  obvious  synergy  is  with  Omer’s  NLP  work,  although   cross  sections  with  Cheng’s  and  Xiaonan’s  work  are  also  possible;  0   I  plan  to  use  a  subset  of  the  features  presented  in  these  papers  in   the  classiDication  of  the  elements  of  output  pages  and  subsequent   real  estate  page  classiDication.    
  36. 36. Thank  you  for  your  attention!