Incorporating site level knowledge to extract structured data from web forums - keynote
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Incorporating site level knowledge to extract structured data from web forums - keynote

  • 917 views
Uploaded on

 

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
917
On Slideshare
916
From Embeds
1
Number of Embeds
1

Actions

Shares
Downloads
4
Comments
0
Likes
0

Embeds 1

http://www.brijj.org 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Incorpora(ng  Site-­‐Level  Knowledge  to   Extract  Structured  Data  from  Web  Forums Jiang-­‐Ming  Yang,  Rui  Cai,  Yida  Wang,  Jun  Zhu,  Lei  Zhang,  and  Wei-­‐Ying  Ma Web  Search  &  Mining  Group Microso=  Research  Asia 2009-­‐04 Saturday, May 22, 2010
  • 2. Web  Forum  Data • An  important  informa,on  resource  with  a  lot  of  human   knowledge. • These  informa,on  include  recrea,on,  sports,  games,   computers,  art,  society,  science,  home,  health; • 20%  pages  on  the  search  results  are  from  forums Saturday, May 22, 2010
  • 3. Understanding  Forum Quality   Data   Crawling Assessmen ExtracIon t Saturday, May 22, 2010
  • 4. Understanding  Forum Quality   Data   Crawling Assessmen ExtracIon t WWW’08 WWW’09, SIGIR’09 iRobot:  An  Intelligent  Crawler  for   AutomaIon  Data  ExtracIon Quality  Assessment Web  Forums SIGIR’08 Exploring  Traversal  Strategy KDD’09 Incremental  Crawling Saturday, May 22, 2010
  • 5. Understanding  Forum Quality   Data   Crawling Assessmen ExtracIon t WWW’08 WWW’09, SIGIR’09 iRobot:  An  Intelligent  Crawler  for   AutomaIon  Data  ExtracIon Quality  Assessment Web  Forums SIGIR’08 Exploring  Traversal  Strategy KDD’09 Incremental  Crawling Saturday, May 22, 2010
  • 6. Challenge Saturday, May 22, 2010
  • 7. Challenge Saturday, May 22, 2010
  • 8. Challenge Saturday, May 22, 2010
  • 9. Challenge Saturday, May 22, 2010
  • 10. Challenge Saturday, May 22, 2010
  • 11. Challenge Saturday, May 22, 2010
  • 12. Challenge Saturday, May 22, 2010
  • 13. Challenge • Leverage  more  site-­‐level  knowledge Saturday, May 22, 2010
  • 14. Saturday, May 22, 2010
  • 15. Saturday, May 22, 2010
  • 16. Forum  Sitemap • A  sitemap  is  a  directed  graph  corresponding   consis,ng  of  a  set  of  ver$ces  and  the  links Saturday, May 22, 2010
  • 17. Forum  Sitemap • A  sitemap  is  a  directed  graph  corresponding   consis,ng  of  a  set  of  ver$ces  and  the  links • Rui  Cai,  Jiangming  Yang,  Wei  Lai,  Yida  Wang  and  Lei  Zhang.  iRobot:  An  Intelligent  Crawler  for  Web  Forums.  In  Proceedings  of  WWW  2008  Conference Saturday, May 22, 2010
  • 18. Page  Clustering • Forum  pages  are  based  on  database  &  template • Layout  is  robust  to  describe  template – Layout  can  be  characterized  by  the  HTML  elements  in   different  DOM  paths Saturday, May 22, 2010
  • 19. Page  Clustering • Forum  pages  are  based  on  database  &  template • Layout  is  robust  to  describe  template – Layout  can  be  characterized  by  the  HTML  elements  in   different  DOM  paths Saturday, May 22, 2010
  • 20. Page  Clustering • Forum  pages  are  based  on  database  &  template • Layout  is  robust  to  describe  template – Layout  can  be  characterized  by  the  HTML  elements  in   different  DOM  paths Saturday, May 22, 2010
  • 21. Page  Clustering • Forum  pages  are  based  on  database  &  template • Layout  is  robust  to  describe  template – Layout  can  be  characterized  by  the  HTML  elements  in   different  DOM  paths Saturday, May 22, 2010
  • 22. Page  Clustering Saturday, May 22, 2010
  • 23. Page  Clustering Dom  Path  Feature   Discovery Saturday, May 22, 2010
  • 24. Page  Clustering Dom  Path  Feature   Discovery Saturday, May 22, 2010
  • 25. Page  Clustering Dom  Path  Feature   Discovery Clustering  by   Virtual  Tables Saturday, May 22, 2010
  • 26. Link  Analysis A  Link  =  URL  Pa4ern  +  Loca9on Saturday, May 22, 2010
  • 27. Saturday, May 22, 2010
  • 28. Inner-­‐Page  Features • The  inclusion  rela9on.  Data  records   usually  have  inclusion  relaIons. • The  alignment  rela9on.  Since  data  is   generated  from  database  and   represented  via  templates,  data   records  with  the  same  label  may   appear  repeatedly  in  a  page. • Time  Order.  Since  post  records  are   generated  sequenIally  along   Imeline,  the  post  Ime  should  be   sorted  ascending  or  descending. Saturday, May 22, 2010
  • 29. Inner-­‐vertex  Features Saturday, May 22, 2010
  • 30. Inner-­‐vertex  Features Saturday, May 22, 2010
  • 31. Inner-­‐vertex  Features Saturday, May 22, 2010
  • 32. Inter-­‐vertex  Features Saturday, May 22, 2010
  • 33. Inter-­‐vertex  Features Saturday, May 22, 2010
  • 34. Inter-­‐vertex  Features Saturday, May 22, 2010
  • 35. Saturday, May 22, 2010
  • 36. Problem  SeGng Saturday, May 22, 2010
  • 37. Problem  SeGng Author Saturday, May 22, 2010
  • 38. Problem  SeGng Author Title Saturday, May 22, 2010
  • 39. Problem  SeGng Author Title Content Saturday, May 22, 2010
  • 40. Formulas  of  list  page • Formulas  for  iden9fying  list  record • Formulas  for  iden9fying  list  9tle Saturday, May 22, 2010
  • 41. Formulas  of  list  page • Formulas  for  iden9fying  list  record • Formulas  for  iden9fying  list  9tle Saturday, May 22, 2010
  • 42. Formulas  of  list  page • Formulas  for  iden9fying  list  record • Formulas  for  iden9fying  list  9tle Saturday, May 22, 2010
  • 43. Formulas  of  post  page • Formulas  for  iden9fying  post  record • Formulas  for  iden9fying  post  author Saturday, May 22, 2010
  • 44. Formulas  of  post  page • Formulas  for  iden9fying  post  record • Formulas  for  iden9fying  post  author Saturday, May 22, 2010
  • 45. Formulas  of  post  page • Formulas  for  iden9fying  post  9me • Formulas  for  iden9fying  post  content Saturday, May 22, 2010
  • 46. Saturday, May 22, 2010
  • 47. Markov  Logic  Networks • An  MLN  can  be  viewed  as  a  template  for  construc,ng  Markov   Random  Fields.   • With  a  set  of  formulas  and  constants,  MLNs  define  a  Markov   network  with  one  node  per  ground  atom  and  one  feature  per   ground  formula.  The  probability  of  a  state  x  in  such  a  network   is  given  by: Saturday, May 22, 2010
  • 48. Markov  Logic  Networks • Divide  DOM  tree  elements  into  three  categories  : – Text  element – Hyperlink  element – Inner  element • Benefit – Reduce  the  number  of  possible  groundings  in  inference.   – Reduce  the  ambiguity  and  achieve  beRer  performance. Saturday, May 22, 2010
  • 49. Experiments List  Pages Post  Pages Saturday, May 22, 2010
  • 50. Experiments Saturday, May 22, 2010
  • 51. Experiments Saturday, May 22, 2010
  • 52. Experiments Saturday, May 22, 2010
  • 53. Experiments Saturday, May 22, 2010
  • 54. Experiments Saturday, May 22, 2010
  • 55. Experiments Saturday, May 22, 2010
  • 56. Future  works Saturday, May 22, 2010
  • 57. Future  works hJp://discussions.apple.com/ Saturday, May 22, 2010
  • 58. Conclusion • A  template-­‐independent  approach  to  extract   structured  data  from  web  forum  sites. • we  can  leverage  power  of  site-­‐level  informaIon,   such  as  the  mutual  informaIon  among  pages,   inner  or  inter  verIces  of  the  sitemap. • hZp://research.microso=.com/people/jmyang/ Saturday, May 22, 2010