Your SlideShare is downloading. ×
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Incorporating site level knowledge to extract structured data from web forums - keynote
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Incorporating site level knowledge to extract structured data from web forums - keynote

983

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
983
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Incorpora(ng  Site-­‐Level  Knowledge  to   Extract  Structured  Data  from  Web  Forums Jiang-­‐Ming  Yang,  Rui  Cai,  Yida  Wang,  Jun  Zhu,  Lei  Zhang,  and  Wei-­‐Ying  Ma Web  Search  &  Mining  Group Microso=  Research  Asia 2009-­‐04 Saturday, May 22, 2010
  • 2. Web  Forum  Data • An  important  informa,on  resource  with  a  lot  of  human   knowledge. • These  informa,on  include  recrea,on,  sports,  games,   computers,  art,  society,  science,  home,  health; • 20%  pages  on  the  search  results  are  from  forums Saturday, May 22, 2010
  • 3. Understanding  Forum Quality   Data   Crawling Assessmen ExtracIon t Saturday, May 22, 2010
  • 4. Understanding  Forum Quality   Data   Crawling Assessmen ExtracIon t WWW’08 WWW’09, SIGIR’09 iRobot:  An  Intelligent  Crawler  for   AutomaIon  Data  ExtracIon Quality  Assessment Web  Forums SIGIR’08 Exploring  Traversal  Strategy KDD’09 Incremental  Crawling Saturday, May 22, 2010
  • 5. Understanding  Forum Quality   Data   Crawling Assessmen ExtracIon t WWW’08 WWW’09, SIGIR’09 iRobot:  An  Intelligent  Crawler  for   AutomaIon  Data  ExtracIon Quality  Assessment Web  Forums SIGIR’08 Exploring  Traversal  Strategy KDD’09 Incremental  Crawling Saturday, May 22, 2010
  • 6. Challenge Saturday, May 22, 2010
  • 7. Challenge Saturday, May 22, 2010
  • 8. Challenge Saturday, May 22, 2010
  • 9. Challenge Saturday, May 22, 2010
  • 10. Challenge Saturday, May 22, 2010
  • 11. Challenge Saturday, May 22, 2010
  • 12. Challenge Saturday, May 22, 2010
  • 13. Challenge • Leverage  more  site-­‐level  knowledge Saturday, May 22, 2010
  • 14. Saturday, May 22, 2010
  • 15. Saturday, May 22, 2010
  • 16. Forum  Sitemap • A  sitemap  is  a  directed  graph  corresponding   consis,ng  of  a  set  of  ver$ces  and  the  links Saturday, May 22, 2010
  • 17. Forum  Sitemap • A  sitemap  is  a  directed  graph  corresponding   consis,ng  of  a  set  of  ver$ces  and  the  links • Rui  Cai,  Jiangming  Yang,  Wei  Lai,  Yida  Wang  and  Lei  Zhang.  iRobot:  An  Intelligent  Crawler  for  Web  Forums.  In  Proceedings  of  WWW  2008  Conference Saturday, May 22, 2010
  • 18. Page  Clustering • Forum  pages  are  based  on  database  &  template • Layout  is  robust  to  describe  template – Layout  can  be  characterized  by  the  HTML  elements  in   different  DOM  paths Saturday, May 22, 2010
  • 19. Page  Clustering • Forum  pages  are  based  on  database  &  template • Layout  is  robust  to  describe  template – Layout  can  be  characterized  by  the  HTML  elements  in   different  DOM  paths Saturday, May 22, 2010
  • 20. Page  Clustering • Forum  pages  are  based  on  database  &  template • Layout  is  robust  to  describe  template – Layout  can  be  characterized  by  the  HTML  elements  in   different  DOM  paths Saturday, May 22, 2010
  • 21. Page  Clustering • Forum  pages  are  based  on  database  &  template • Layout  is  robust  to  describe  template – Layout  can  be  characterized  by  the  HTML  elements  in   different  DOM  paths Saturday, May 22, 2010
  • 22. Page  Clustering Saturday, May 22, 2010
  • 23. Page  Clustering Dom  Path  Feature   Discovery Saturday, May 22, 2010
  • 24. Page  Clustering Dom  Path  Feature   Discovery Saturday, May 22, 2010
  • 25. Page  Clustering Dom  Path  Feature   Discovery Clustering  by   Virtual  Tables Saturday, May 22, 2010
  • 26. Link  Analysis A  Link  =  URL  Pa4ern  +  Loca9on Saturday, May 22, 2010
  • 27. Saturday, May 22, 2010
  • 28. Inner-­‐Page  Features • The  inclusion  rela9on.  Data  records   usually  have  inclusion  relaIons. • The  alignment  rela9on.  Since  data  is   generated  from  database  and   represented  via  templates,  data   records  with  the  same  label  may   appear  repeatedly  in  a  page. • Time  Order.  Since  post  records  are   generated  sequenIally  along   Imeline,  the  post  Ime  should  be   sorted  ascending  or  descending. Saturday, May 22, 2010
  • 29. Inner-­‐vertex  Features Saturday, May 22, 2010
  • 30. Inner-­‐vertex  Features Saturday, May 22, 2010
  • 31. Inner-­‐vertex  Features Saturday, May 22, 2010
  • 32. Inter-­‐vertex  Features Saturday, May 22, 2010
  • 33. Inter-­‐vertex  Features Saturday, May 22, 2010
  • 34. Inter-­‐vertex  Features Saturday, May 22, 2010
  • 35. Saturday, May 22, 2010
  • 36. Problem  SeGng Saturday, May 22, 2010
  • 37. Problem  SeGng Author Saturday, May 22, 2010
  • 38. Problem  SeGng Author Title Saturday, May 22, 2010
  • 39. Problem  SeGng Author Title Content Saturday, May 22, 2010
  • 40. Formulas  of  list  page • Formulas  for  iden9fying  list  record • Formulas  for  iden9fying  list  9tle Saturday, May 22, 2010
  • 41. Formulas  of  list  page • Formulas  for  iden9fying  list  record • Formulas  for  iden9fying  list  9tle Saturday, May 22, 2010
  • 42. Formulas  of  list  page • Formulas  for  iden9fying  list  record • Formulas  for  iden9fying  list  9tle Saturday, May 22, 2010
  • 43. Formulas  of  post  page • Formulas  for  iden9fying  post  record • Formulas  for  iden9fying  post  author Saturday, May 22, 2010
  • 44. Formulas  of  post  page • Formulas  for  iden9fying  post  record • Formulas  for  iden9fying  post  author Saturday, May 22, 2010
  • 45. Formulas  of  post  page • Formulas  for  iden9fying  post  9me • Formulas  for  iden9fying  post  content Saturday, May 22, 2010
  • 46. Saturday, May 22, 2010
  • 47. Markov  Logic  Networks • An  MLN  can  be  viewed  as  a  template  for  construc,ng  Markov   Random  Fields.   • With  a  set  of  formulas  and  constants,  MLNs  define  a  Markov   network  with  one  node  per  ground  atom  and  one  feature  per   ground  formula.  The  probability  of  a  state  x  in  such  a  network   is  given  by: Saturday, May 22, 2010
  • 48. Markov  Logic  Networks • Divide  DOM  tree  elements  into  three  categories  : – Text  element – Hyperlink  element – Inner  element • Benefit – Reduce  the  number  of  possible  groundings  in  inference.   – Reduce  the  ambiguity  and  achieve  beRer  performance. Saturday, May 22, 2010
  • 49. Experiments List  Pages Post  Pages Saturday, May 22, 2010
  • 50. Experiments Saturday, May 22, 2010
  • 51. Experiments Saturday, May 22, 2010
  • 52. Experiments Saturday, May 22, 2010
  • 53. Experiments Saturday, May 22, 2010
  • 54. Experiments Saturday, May 22, 2010
  • 55. Experiments Saturday, May 22, 2010
  • 56. Future  works Saturday, May 22, 2010
  • 57. Future  works hJp://discussions.apple.com/ Saturday, May 22, 2010
  • 58. Conclusion • A  template-­‐independent  approach  to  extract   structured  data  from  web  forum  sites. • we  can  leverage  power  of  site-­‐level  informaIon,   such  as  the  mutual  informaIon  among  pages,   inner  or  inter  verIces  of  the  sitemap. • hZp://research.microso=.com/people/jmyang/ Saturday, May 22, 2010

×