Your SlideShare is downloading. ×
Incorporating site level knowledge for incremental crawling of web forums - a list-wise strategy
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Incorporating site level knowledge for incremental crawling of web forums - a list-wise strategy

443
views

Published on

Published in: Technology

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
443
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Incorpora(ng  Site-­‐Level  Knowledge  for   Incremental  Crawling  of  Web  Forums: A  List-­‐wise  Strategy Jiang-­‐Ming  Yang,  Rui  Cai,  Lei  Zhang,  and  Wei-­‐Ying  Ma MicrosoB  Research,  Asia Chun-­‐song  Wang University  of  Wisconsin-­‐Madison Hua  Huang Beijing  University  of  Posts  and  Telecommunica(ons Saturday, May 22, 2010
  • 2. Web  Forums Web  Search Q  &  A Social   Network Forums  is  a  huge  resource  with  human  knowledge  ! May  22,  2010 2 Saturday, May 22, 2010
  • 3. Forum  Data  Crawl  and  Mining Content   Mining Data   Parsing Crawling May  22,  2010 3 Saturday, May 22, 2010
  • 4. Forum  Data  Crawl  and  Mining SIGIR 2008 Exploring  Traversal  Strategy Content   WWW 2008 iRobot:  Sitemap  Reconstruc(on Mining Data   Parsing Crawling May  22,  2010 3 Saturday, May 22, 2010
  • 5. Forum  Data  Crawl  and  Mining SIGIR 2008 Exploring  Traversal  Strategy Content   WWW 2008 iRobot:  Sitemap  Reconstruc(on Mining Data   Parsing WWW 2009 Crawling Automa(on  Data  Parsing May  22,  2010 3 Saturday, May 22, 2010
  • 6. Forum  Data  Crawl  and  Mining SIGIR 2008 Exploring  Traversal  Strategy Content   WWW 2008 iRobot:  Sitemap  Reconstruc(on Mining Data   Parsing SIGIR 2009 Expert  Finding  &  Junk  detec(on WWW 2009 Crawling Automa(on  Data  Parsing May  22,  2010 3 Saturday, May 22, 2010
  • 7. Forum  Data  Crawl  and  Mining KDD 2009 Incremental  Crawling SIGIR 2008 Exploring  Traversal  Strategy Content   WWW 2008 iRobot:  Sitemap  Reconstruc(on Mining Data   Parsing SIGIR 2009 Expert  Finding  &  Junk  detec(on WWW 2009 KDD 2009 Crawling Automa(on  Data  Parsing User  Behavior  in  Forums May  22,  2010 3 Saturday, May 22, 2010
  • 8. CharacterisKcs  of  Forums Index Page Post Page May  22,  2010 4 Saturday, May 22, 2010
  • 9. CharacterisKcs  of  Forums Index Page Post Page May  22,  2010 4 Saturday, May 22, 2010
  • 10. CharacterisKcs  of  Forums Index Page Post Page May  22,  2010 4 Saturday, May 22, 2010
  • 11. Incremental  Crawling May  22,  2010 5 Saturday, May 22, 2010
  • 12. Incremental  Crawling May  22,  2010 5 Saturday, May 22, 2010
  • 13. Incremental  Crawling • General  Web  Pages – TreaKng  page  independently,  i.e.,  page-­‐wise • Forum  Pages – Considering  paginaKon,  i.e.,  list-­‐wise   May  22,  2010 5 Saturday, May 22, 2010
  • 14. Incremental  Crawling • General  Web  Pages – TreaKng  page  independently,  i.e.,  page-­‐wise • Forum  Pages – Considering  paginaKon,  i.e.,  list-­‐wise   May  22,  2010 5 Saturday, May 22, 2010
  • 15. Our  SoluKon     • Incorpora(ng  Site-­‐level  Knowledge – How  many  kinds  of  pages  in  a  website – How  various  pages  linked  with  each  others • Purposes – Dis(nguish  index  and  post  pages May  22,  2010 6 Saturday, May 22, 2010
  • 16. Our  SoluKon     • Incorpora(ng  Site-­‐level  Knowledge – How  many  kinds  of  pages  in  a  website – How  various  pages  linked  with  each  others • Purposes – Dis(nguish  index  and  post  pages Sitemap List  Construc8on  & Timestamp Predic8on Bandwidth Construc8on Classifica8on Extrac8on Models Control May  22,  2010 6 Saturday, May 22, 2010
  • 17. Sitemap List  Construc8on  & Timestamp Predic8on Bandwidth Construc8on Classifica8on Extrac8on Models Control May  22,  2010 7 Saturday, May 22, 2010
  • 18. Forum  Sitemap • A  sitemap  is  a  directed  graph  consisKng  of  a   set  of  ver6ces  and  links May  22,  2010 hRp://forums.asp.net 8 Saturday, May 22, 2010
  • 19. Page  Layout  Clustering • Forum  pages  are  based  on  database  &  template • Layout  is  robust  to  describe  template – Layout  can  be  characterized  by  the  HTML  elements  in   different  DOM  paths  (e.g.  repeKKve  pa]erns) May  22,  2010 9 Saturday, May 22, 2010
  • 20. Page  Layout  Clustering • Forum  pages  are  based  on  database  &  template • Layout  is  robust  to  describe  template – Layout  can  be  characterized  by  the  HTML  elements  in   different  DOM  paths  (e.g.  repeKKve  pa]erns)   Rui  Cai,  Jiang-­‐Ming  Yang,  Wei  Lai,  Yida  Wang  and  Lei  Zhang.   May  22,  2010   iRobot:  An  Intelligent  Crawler  for  Web  Forums.   9   In  Proceedings  of  WWW  2008  Conference Saturday, May 22, 2010
  • 21. Link  Analysis   Rui  Cai,  Jiang-­‐Ming  Yang,  Wei  Lai,  Yida  Wang  and  Lei  Zhang.     iRobot:  An  Intelligent  Crawler  for  Web  Forums.     In  Proceedings  of  WWW  2008  Conference May  22,  2010 10 Saturday, May 22, 2010
  • 22. Link  Analysis   Rui  Cai,  Jiang-­‐Ming  Yang,  Wei  Lai,  Yida  Wang  and  Lei  Zhang.     iRobot:  An  Intelligent  Crawler  for  Web  Forums.     In  Proceedings  of  WWW  2008  Conference   Yida  Wang,  Jiang-­‐Ming  Yang,  Wei  Lai,  Rui  Cai  and  Lei  Zhang.     Exploring  Traversal  Strategy  for  Web  Forum  Crawling.     In  Proceedings  of  SIGIR  2008  Conference May  22,  2010 10 Saturday, May 22, 2010
  • 23. Sitemap List  Construc8on  & Timestamp Predic8on Bandwidth Construc8on Classifica8on Extrac8on Models Control May  22,  2010 11 Saturday, May 22, 2010
  • 24. IndenKfy  Index  &  Post  Nodes • A  SVM-­‐based  Classifier – Site  independent – Features • Node  size • Link  structure • Keywords • Node  classificaKon  is   robust  that  page – Robust  to  noise  on   individual  pages May  22,  2010 12 Saturday, May 22, 2010
  • 25. List  ReconstrucKon • Given  a  new  page 1. Classify  into  a  node 2. Detect  paginaKon  links 3. Find  out  link  orders May  22,  2010 13 Saturday, May 22, 2010
  • 26. YYYY/MM/DD Sitemap List  Construc8on  & Timestamp Predic8on Bandwidth Construc8on Classifica8on Extrac8on Models Control May  22,  2010 14 Saturday, May 22, 2010
  • 27. Timestamp  ExtracKon May  22,  2010 15 Saturday, May 22, 2010
  • 28. Timestamp  ExtracKon May  22,  2010 15 Saturday, May 22, 2010
  • 29. Timestamp  ExtracKon • Dis(nguish  real  (mestamps  from  noises – The  temporal  order  can  help  ! May  22,  2010 15 Saturday, May 22, 2010
  • 30. Timestamp  ExtracKon • Dis(nguish  real  (mestamps  from  noises – The  temporal  order  can  help  ! May  22,  2010 15 Saturday, May 22, 2010
  • 31. Sitemap List  Construc8on  & Timestamp Predic8on Bandwidth Construc8on Classifica8on Extrac8on Models Control May  22,  2010 16 Saturday, May 22, 2010
  • 32. Feature  ExtracKon • Features  to  describe  update  frequency – List-­‐dependent  &  independent  (site-­‐level  staKsKcs) – Absolute  &  RelaKve May  22,  2010 17 Saturday, May 22, 2010
  • 33. Regression  Model • Linear  regression – Advantages Lightweight  computaKonal  cost Efficient  for  online  process May  22,  2010 18 Saturday, May 22, 2010
  • 34. Regression  Model • Linear  regression – Advantages Lightweight  computaKonal  cost Efficient  for  online  process • Predict  when  the  next  new  record  arrives – CT:  current  Kme – LT:  last  (re-­‐)visit  Kme  by  crawler May  22,  2010 18 Saturday, May 22, 2010
  • 35. Sitemap List  Construc8on  & Timestamp Predic8on Bandwidth Construc8on Classifica8on Extrac8on Models Control May  22,  2010 19 Saturday, May 22, 2010
  • 36. Bandwidth  Control • Index  and  post  pages  are  quite  different Index Post QuanKty <  10   >  90   Avg.  Update   high% low % Num.  Re-­‐crawl   small Frequency large Pages May  22,  2010 20 Saturday, May 22, 2010
  • 37. Bandwidth  Control • Index  and  post  pages  are  quite  different Index Post QuanKty <  10   >  90   Avg.  Update   high% low % Num.  Re-­‐crawl   small Frequency large Pages • Post  pages  blocks  the  bandwidth – Cannot  discover  new  threads  in  Kme – A  simple  but  pracKcal  soluKon May  22,  2010 20 Saturday, May 22, 2010
  • 38. Experiment  Setup • 18  web  forums  in  diverse  categories – March  1999  ~  June  2008 – 990,476  pages  and  5,407,854  posts • Simula(on – Repeatable  and  Controllable • Comparison – List-­‐wise  strategy  (LWS),   – LWS  with  bandwidth  control  (LWS  +  BC) – Curve-­‐fieng  policy  (CF) – Bound-­‐based  policy  (BB,  WWW  2008) – Oracle  (Most  ideal  case) May  22,  2010 21 Saturday, May 22, 2010
  • 39. Measurements • Bandwidth  U1liza1on – Inew: #pages  with  new  informaKon – IB: #pages  crawled • Coverage – Icrawl: #new  posts  crawled – Iall: #new  posts  published  on  forums • Timeliness – ∆ti : #minutes  between  publish  and  download May  22,  2010 22 Saturday, May 22, 2010
  • 40. Performance  Comparison • Warm-­‐up  Stage – Bandwidth:  3000  pages  /  day May  22,  2010 23 Saturday, May 22, 2010
  • 41. Performance  Comparison • Warm-­‐up  Stage – Bandwidth:  3000  pages  /  day May  22,  2010 23 Saturday, May 22, 2010
  • 42. Performance  Comparison • Warm-­‐up  Stage – Bandwidth:  3000  pages  /  day May  22,  2010 23 Saturday, May 22, 2010
  • 43. Performance  Comparison • Warm-­‐up  Stage – Bandwidth:  3000  pages  /  day May  22,  2010 23 Saturday, May 22, 2010
  • 44. Performance  Comparison  (Cont.) • Comparison  with  various  bandwidth May  22,  2010 24 Saturday, May 22, 2010
  • 45. Performance  Comparison  (Cont.) • Comparison  with  various  bandwidth May  22,  2010 24 Saturday, May 22, 2010
  • 46. Performance  Comparison  (Cont.) • Comparison  with  various  bandwidth May  22,  2010 24 Saturday, May 22, 2010
  • 47. Performance  Comparison  (Cont.) • Comparison  with  various  bandwidth May  22,  2010 24 Saturday, May 22, 2010
  • 48. Performance  Comparison  (Cont.) • Detailed  performance  of  Index  and  Post  pages – Bandwidth:  3000  pages  /  day May  22,  2010 25 Saturday, May 22, 2010
  • 49. Conclusions  and  Future  Work • Targeted  on  web  forums,  a  specific  but   interesKng  field. • Developing  an  effecKve  soluKon  for   incremental  forum  crawling – IntegraKng  site-­‐level  knowledge – Some  pracKcal  engineering  implementaKon • Future  work – Improve  Kmestamps  extracKon  algorithm – Stronger  predicKon  model  than  linear  regression May  22,  2010 26 Saturday, May 22, 2010

×