0
Incorpora(ng	
  Site-­‐Level	
  Knowledge	
  to	
  
Extract	
  Structured	
  Data	
  from	
  Web	
  Forums

            Ji...
Web	
  Forum	
  Data
      • An	
  important	
  informa,on	
  resource	
  with	
  a	
  lot	
  of	
  human	
  
        know...
Understanding	
  Forum


                                                   Quality	
  
                                  ...
Understanding	
  Forum


                                                                                                Q...
Understanding	
  Forum


                                                                                                Q...
Challenge




Saturday, May 22, 2010
Challenge




Saturday, May 22, 2010
Challenge




Saturday, May 22, 2010
Challenge




Saturday, May 22, 2010
Challenge




Saturday, May 22, 2010
Challenge




Saturday, May 22, 2010
Challenge




Saturday, May 22, 2010
Challenge




     •    Leverage	
  more	
  site-­‐level	
  knowledge




Saturday, May 22, 2010
Saturday, May 22, 2010
Saturday, May 22, 2010
Forum	
  Sitemap
      • A	
  sitemap	
  is	
  a	
  directed	
  graph	
  corresponding	
  
        consis,ng	
  of	
  a	
 ...
Forum	
  Sitemap
        • A	
  sitemap	
  is	
  a	
  directed	
  graph	
  corresponding	
  
          consis,ng	
  of	
  ...
Page	
  Clustering
    • Forum	
  pages	
  are	
  based	
  on	
  database	
  &	
  template
    • Layout	
  is	
  robust	
 ...
Page	
  Clustering
    • Forum	
  pages	
  are	
  based	
  on	
  database	
  &	
  template
    • Layout	
  is	
  robust	
 ...
Page	
  Clustering
    • Forum	
  pages	
  are	
  based	
  on	
  database	
  &	
  template
    • Layout	
  is	
  robust	
 ...
Page	
  Clustering
    • Forum	
  pages	
  are	
  based	
  on	
  database	
  &	
  template
    • Layout	
  is	
  robust	
 ...
Page	
  Clustering




Saturday, May 22, 2010
Page	
  Clustering

                               Dom	
  Path	
  Feature	
  
                                  Discovery
...
Page	
  Clustering

                               Dom	
  Path	
  Feature	
  
                                  Discovery
...
Page	
  Clustering

                               Dom	
  Path	
  Feature	
  
                                  Discovery
...
Link	
  Analysis




                         A	
  Link	
  =	
  URL	
  Pa4ern	
  +	
  Loca9on



Saturday, May 22, 2010
Saturday, May 22, 2010
Inner-­‐Page	
  Features
                                          •   The	
  inclusion	
  rela9on.	
  Data	
  records	
  ...
Inner-­‐vertex	
  Features




Saturday, May 22, 2010
Inner-­‐vertex	
  Features




Saturday, May 22, 2010
Inner-­‐vertex	
  Features




Saturday, May 22, 2010
Inter-­‐vertex	
  Features




Saturday, May 22, 2010
Inter-­‐vertex	
  Features




Saturday, May 22, 2010
Inter-­‐vertex	
  Features




Saturday, May 22, 2010
Saturday, May 22, 2010
Problem	
  SeGng




Saturday, May 22, 2010
Problem	
  SeGng

                         Author




Saturday, May 22, 2010
Problem	
  SeGng

                         Author     Title




Saturday, May 22, 2010
Problem	
  SeGng

                         Author     Title   Content




Saturday, May 22, 2010
Formulas	
  of	
  list	
  page
                               • Formulas	
  for	
  iden9fying	
  list	
  record




      ...
Formulas	
  of	
  list	
  page
                               • Formulas	
  for	
  iden9fying	
  list	
  record




      ...
Formulas	
  of	
  list	
  page
                               • Formulas	
  for	
  iden9fying	
  list	
  record




      ...
Formulas	
  of	
  post	
  page
                            • Formulas	
  for	
  iden9fying	
  post	
  record




         ...
Formulas	
  of	
  post	
  page
                            • Formulas	
  for	
  iden9fying	
  post	
  record




         ...
Formulas	
  of	
  post	
  page
                            • Formulas	
  for	
  iden9fying	
  post	
  9me




            ...
Saturday, May 22, 2010
Markov	
  Logic	
  Networks
      • An	
  MLN	
  can	
  be	
  viewed	
  as	
  a	
  template	
  for	
  construc,ng	
  Marko...
Markov	
  Logic	
  Networks
      • Divide	
  DOM	
  tree	
  elements	
  into	
  three	
  categories	
  :

            – T...
Experiments




                         List	
  Pages    Post	
  Pages


Saturday, May 22, 2010
Experiments




Saturday, May 22, 2010
Experiments




Saturday, May 22, 2010
Experiments




Saturday, May 22, 2010
Experiments




Saturday, May 22, 2010
Experiments




Saturday, May 22, 2010
Experiments




Saturday, May 22, 2010
Future	
  works




Saturday, May 22, 2010
Future	
  works




                                           hJp://discussions.apple.com/
Saturday, May 22, 2010
Conclusion
      • A	
  template-­‐independent	
  approach	
  to	
  extract	
  
        structured	
  data	
  from	
  web	...
Upcoming SlideShare
Loading in...5
×

Incorporating site level knowledge to extract structured data from web forums - keynote

1,068

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,068
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Incorporating site level knowledge to extract structured data from web forums - keynote"

  1. 1. Incorpora(ng  Site-­‐Level  Knowledge  to   Extract  Structured  Data  from  Web  Forums Jiang-­‐Ming  Yang,  Rui  Cai,  Yida  Wang,  Jun  Zhu,  Lei  Zhang,  and  Wei-­‐Ying  Ma Web  Search  &  Mining  Group Microso=  Research  Asia 2009-­‐04 Saturday, May 22, 2010
  2. 2. Web  Forum  Data • An  important  informa,on  resource  with  a  lot  of  human   knowledge. • These  informa,on  include  recrea,on,  sports,  games,   computers,  art,  society,  science,  home,  health; • 20%  pages  on  the  search  results  are  from  forums Saturday, May 22, 2010
  3. 3. Understanding  Forum Quality   Data   Crawling Assessmen ExtracIon t Saturday, May 22, 2010
  4. 4. Understanding  Forum Quality   Data   Crawling Assessmen ExtracIon t WWW’08 WWW’09, SIGIR’09 iRobot:  An  Intelligent  Crawler  for   AutomaIon  Data  ExtracIon Quality  Assessment Web  Forums SIGIR’08 Exploring  Traversal  Strategy KDD’09 Incremental  Crawling Saturday, May 22, 2010
  5. 5. Understanding  Forum Quality   Data   Crawling Assessmen ExtracIon t WWW’08 WWW’09, SIGIR’09 iRobot:  An  Intelligent  Crawler  for   AutomaIon  Data  ExtracIon Quality  Assessment Web  Forums SIGIR’08 Exploring  Traversal  Strategy KDD’09 Incremental  Crawling Saturday, May 22, 2010
  6. 6. Challenge Saturday, May 22, 2010
  7. 7. Challenge Saturday, May 22, 2010
  8. 8. Challenge Saturday, May 22, 2010
  9. 9. Challenge Saturday, May 22, 2010
  10. 10. Challenge Saturday, May 22, 2010
  11. 11. Challenge Saturday, May 22, 2010
  12. 12. Challenge Saturday, May 22, 2010
  13. 13. Challenge • Leverage  more  site-­‐level  knowledge Saturday, May 22, 2010
  14. 14. Saturday, May 22, 2010
  15. 15. Saturday, May 22, 2010
  16. 16. Forum  Sitemap • A  sitemap  is  a  directed  graph  corresponding   consis,ng  of  a  set  of  ver$ces  and  the  links Saturday, May 22, 2010
  17. 17. Forum  Sitemap • A  sitemap  is  a  directed  graph  corresponding   consis,ng  of  a  set  of  ver$ces  and  the  links • Rui  Cai,  Jiangming  Yang,  Wei  Lai,  Yida  Wang  and  Lei  Zhang.  iRobot:  An  Intelligent  Crawler  for  Web  Forums.  In  Proceedings  of  WWW  2008  Conference Saturday, May 22, 2010
  18. 18. Page  Clustering • Forum  pages  are  based  on  database  &  template • Layout  is  robust  to  describe  template – Layout  can  be  characterized  by  the  HTML  elements  in   different  DOM  paths Saturday, May 22, 2010
  19. 19. Page  Clustering • Forum  pages  are  based  on  database  &  template • Layout  is  robust  to  describe  template – Layout  can  be  characterized  by  the  HTML  elements  in   different  DOM  paths Saturday, May 22, 2010
  20. 20. Page  Clustering • Forum  pages  are  based  on  database  &  template • Layout  is  robust  to  describe  template – Layout  can  be  characterized  by  the  HTML  elements  in   different  DOM  paths Saturday, May 22, 2010
  21. 21. Page  Clustering • Forum  pages  are  based  on  database  &  template • Layout  is  robust  to  describe  template – Layout  can  be  characterized  by  the  HTML  elements  in   different  DOM  paths Saturday, May 22, 2010
  22. 22. Page  Clustering Saturday, May 22, 2010
  23. 23. Page  Clustering Dom  Path  Feature   Discovery Saturday, May 22, 2010
  24. 24. Page  Clustering Dom  Path  Feature   Discovery Saturday, May 22, 2010
  25. 25. Page  Clustering Dom  Path  Feature   Discovery Clustering  by   Virtual  Tables Saturday, May 22, 2010
  26. 26. Link  Analysis A  Link  =  URL  Pa4ern  +  Loca9on Saturday, May 22, 2010
  27. 27. Saturday, May 22, 2010
  28. 28. Inner-­‐Page  Features • The  inclusion  rela9on.  Data  records   usually  have  inclusion  relaIons. • The  alignment  rela9on.  Since  data  is   generated  from  database  and   represented  via  templates,  data   records  with  the  same  label  may   appear  repeatedly  in  a  page. • Time  Order.  Since  post  records  are   generated  sequenIally  along   Imeline,  the  post  Ime  should  be   sorted  ascending  or  descending. Saturday, May 22, 2010
  29. 29. Inner-­‐vertex  Features Saturday, May 22, 2010
  30. 30. Inner-­‐vertex  Features Saturday, May 22, 2010
  31. 31. Inner-­‐vertex  Features Saturday, May 22, 2010
  32. 32. Inter-­‐vertex  Features Saturday, May 22, 2010
  33. 33. Inter-­‐vertex  Features Saturday, May 22, 2010
  34. 34. Inter-­‐vertex  Features Saturday, May 22, 2010
  35. 35. Saturday, May 22, 2010
  36. 36. Problem  SeGng Saturday, May 22, 2010
  37. 37. Problem  SeGng Author Saturday, May 22, 2010
  38. 38. Problem  SeGng Author Title Saturday, May 22, 2010
  39. 39. Problem  SeGng Author Title Content Saturday, May 22, 2010
  40. 40. Formulas  of  list  page • Formulas  for  iden9fying  list  record • Formulas  for  iden9fying  list  9tle Saturday, May 22, 2010
  41. 41. Formulas  of  list  page • Formulas  for  iden9fying  list  record • Formulas  for  iden9fying  list  9tle Saturday, May 22, 2010
  42. 42. Formulas  of  list  page • Formulas  for  iden9fying  list  record • Formulas  for  iden9fying  list  9tle Saturday, May 22, 2010
  43. 43. Formulas  of  post  page • Formulas  for  iden9fying  post  record • Formulas  for  iden9fying  post  author Saturday, May 22, 2010
  44. 44. Formulas  of  post  page • Formulas  for  iden9fying  post  record • Formulas  for  iden9fying  post  author Saturday, May 22, 2010
  45. 45. Formulas  of  post  page • Formulas  for  iden9fying  post  9me • Formulas  for  iden9fying  post  content Saturday, May 22, 2010
  46. 46. Saturday, May 22, 2010
  47. 47. Markov  Logic  Networks • An  MLN  can  be  viewed  as  a  template  for  construc,ng  Markov   Random  Fields.   • With  a  set  of  formulas  and  constants,  MLNs  define  a  Markov   network  with  one  node  per  ground  atom  and  one  feature  per   ground  formula.  The  probability  of  a  state  x  in  such  a  network   is  given  by: Saturday, May 22, 2010
  48. 48. Markov  Logic  Networks • Divide  DOM  tree  elements  into  three  categories  : – Text  element – Hyperlink  element – Inner  element • Benefit – Reduce  the  number  of  possible  groundings  in  inference.   – Reduce  the  ambiguity  and  achieve  beRer  performance. Saturday, May 22, 2010
  49. 49. Experiments List  Pages Post  Pages Saturday, May 22, 2010
  50. 50. Experiments Saturday, May 22, 2010
  51. 51. Experiments Saturday, May 22, 2010
  52. 52. Experiments Saturday, May 22, 2010
  53. 53. Experiments Saturday, May 22, 2010
  54. 54. Experiments Saturday, May 22, 2010
  55. 55. Experiments Saturday, May 22, 2010
  56. 56. Future  works Saturday, May 22, 2010
  57. 57. Future  works hJp://discussions.apple.com/ Saturday, May 22, 2010
  58. 58. Conclusion • A  template-­‐independent  approach  to  extract   structured  data  from  web  forum  sites. • we  can  leverage  power  of  site-­‐level  informaIon,   such  as  the  mutual  informaIon  among  pages,   inner  or  inter  verIces  of  the  sitemap. • hZp://research.microso=.com/people/jmyang/ Saturday, May 22, 2010
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×