0
Incorpora(ng	
  Site-­‐Level	
  Knowledge	
  for	
  
       Incremental	
  Crawling	
  of	
  Web	
  Forums:
              ...
Web	
  Forums

                                                                             Web	
  Search




            ...
Forum	
  Data	
  Crawl	
  and	
  Mining



                                                   Content	
  
                ...
Forum	
  Data	
  Crawl	
  and	
  Mining


  SIGIR 2008
  Exploring	
  Traversal	
  Strategy
                              ...
Forum	
  Data	
  Crawl	
  and	
  Mining


  SIGIR 2008
  Exploring	
  Traversal	
  Strategy
                              ...
Forum	
  Data	
  Crawl	
  and	
  Mining


  SIGIR 2008
  Exploring	
  Traversal	
  Strategy
                              ...
Forum	
  Data	
  Crawl	
  and	
  Mining

  KDD 2009
  Incremental	
  Crawling

  SIGIR 2008
  Exploring	
  Traversal	
  St...
CharacterisKcs	
  of	
  Forums


            Index Page




            Post Page




      May	
  22,	
  2010            ...
CharacterisKcs	
  of	
  Forums


            Index Page




            Post Page




      May	
  22,	
  2010            ...
CharacterisKcs	
  of	
  Forums


            Index Page




            Post Page




      May	
  22,	
  2010            ...
Incremental	
  Crawling




      May	
  22,	
  2010                             5

Saturday, May 22, 2010
Incremental	
  Crawling




      May	
  22,	
  2010                             5

Saturday, May 22, 2010
Incremental	
  Crawling




       • General	
  Web	
  Pages
                 – TreaKng	
  page	
  independently,	
  i.e.,...
Incremental	
  Crawling




       • General	
  Web	
  Pages
                 – TreaKng	
  page	
  independently,	
  i.e.,...
Our	
  SoluKon	
  	
  
      • Incorpora(ng	
  Site-­‐level	
  Knowledge
               – How	
  many	
  kinds	
  of	
  pa...
Our	
  SoluKon	
  	
  
      • Incorpora(ng	
  Site-­‐level	
  Knowledge
               – How	
  many	
  kinds	
  of	
  pa...
Sitemap List	
  Construc8on	
  &   Timestamp   Predic8on   Bandwidth
  Construc8on Classifica8on             Extrac8on    M...
Forum	
  Sitemap
      • A	
  sitemap	
  is	
  a	
  directed	
  graph	
  consisKng	
  of	
  a	
  
        set	
  of	
  ver...
Page	
  Layout	
  Clustering
      • Forum	
  pages	
  are	
  based	
  on	
  database	
  &	
  template
      • Layout	
  i...
Page	
  Layout	
  Clustering
      • Forum	
  pages	
  are	
  based	
  on	
  database	
  &	
  template
      • Layout	
  i...
Link	
  Analysis
                               	
     Rui	
  Cai,	
  Jiang-­‐Ming	
  Yang,	
  Wei	
  Lai,	
  Yida	
  Wang...
Link	
  Analysis
                                                                                        	
     Rui	
  Cai...
Sitemap List	
  Construc8on	
  & Timestamp   Predic8on   Bandwidth
    Construc8on  Classifica8on          Extrac8on    Mod...
IndenKfy	
  Index	
  &	
  Post	
  Nodes
                                         • A	
  SVM-­‐based	
  Classifier
         ...
List	
  ReconstrucKon


      • Given	
  a	
  new	
  page
              1. Classify	
  into	
  a	
  node
              2. ...
YYYY/MM/DD




      Sitemap              List	
  Construc8on	
  &   Timestamp    Predic8on   Bandwidth
    Construc8on   ...
Timestamp	
  ExtracKon




      May	
  22,	
  2010                            15

Saturday, May 22, 2010
Timestamp	
  ExtracKon




      May	
  22,	
  2010                            15

Saturday, May 22, 2010
Timestamp	
  ExtracKon




      • Dis(nguish	
  real	
  (mestamps	
  from	
  noises
               – The	
  temporal	
  o...
Timestamp	
  ExtracKon




      • Dis(nguish	
  real	
  (mestamps	
  from	
  noises
               – The	
  temporal	
  o...
Sitemap              List	
  Construc8on	
  &   Timestamp   Predic8on   Bandwidth
    Construc8on               Classifica8...
Feature	
  ExtracKon
      • Features	
  to	
  describe	
  update	
  frequency
               – List-­‐dependent	
  &	
  i...
Regression	
  Model
       • Linear	
  regression
                 – Advantages
                    Lightweight	
  computa...
Regression	
  Model
       • Linear	
  regression
                 – Advantages
                    Lightweight	
  computa...
Sitemap              List	
  Construc8on	
  &   Timestamp   Predic8on   Bandwidth
    Construc8on               Classifica8...
Bandwidth	
  Control
      • Index	
  and	
  post	
  pages	
  are	
  quite	
  different
                                   ...
Bandwidth	
  Control
      • Index	
  and	
  post	
  pages	
  are	
  quite	
  different
                                   ...
Experiment	
  Setup
      • 18	
  web	
  forums	
  in	
  diverse	
  categories
            – March	
  1999	
  ~	
  June	
 ...
Measurements
      • Bandwidth	
  U1liza1on
               – Inew: #pages	
  with	
  new	
  informaKon
               – IB...
Performance	
  Comparison
      • Warm-­‐up	
  Stage
               – Bandwidth:	
  3000	
  pages	
  /	
  day




      Ma...
Performance	
  Comparison
      • Warm-­‐up	
  Stage
               – Bandwidth:	
  3000	
  pages	
  /	
  day




      Ma...
Performance	
  Comparison
      • Warm-­‐up	
  Stage
               – Bandwidth:	
  3000	
  pages	
  /	
  day




      Ma...
Performance	
  Comparison
      • Warm-­‐up	
  Stage
               – Bandwidth:	
  3000	
  pages	
  /	
  day




      Ma...
Performance	
  Comparison	
  (Cont.)
      • Comparison	
  with	
  various	
  bandwidth




      May	
  22,	
  2010      ...
Performance	
  Comparison	
  (Cont.)
      • Comparison	
  with	
  various	
  bandwidth




      May	
  22,	
  2010      ...
Performance	
  Comparison	
  (Cont.)
      • Comparison	
  with	
  various	
  bandwidth




      May	
  22,	
  2010      ...
Performance	
  Comparison	
  (Cont.)
      • Comparison	
  with	
  various	
  bandwidth




      May	
  22,	
  2010      ...
Performance	
  Comparison	
  (Cont.)
      • Detailed	
  performance	
  of	
  Index	
  and	
  Post	
  pages
              ...
Conclusions	
  and	
  Future	
  Work
      • Targeted	
  on	
  web	
  forums,	
  a	
  specific	
  but	
  
        interesKn...
Upcoming SlideShare
Loading in...5
×

Incorporating site level knowledge for incremental crawling of web forums - a list-wise strategy

449

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
449
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Incorporating site level knowledge for incremental crawling of web forums - a list-wise strategy"

  1. 1. Incorpora(ng  Site-­‐Level  Knowledge  for   Incremental  Crawling  of  Web  Forums: A  List-­‐wise  Strategy Jiang-­‐Ming  Yang,  Rui  Cai,  Lei  Zhang,  and  Wei-­‐Ying  Ma MicrosoB  Research,  Asia Chun-­‐song  Wang University  of  Wisconsin-­‐Madison Hua  Huang Beijing  University  of  Posts  and  Telecommunica(ons Saturday, May 22, 2010
  2. 2. Web  Forums Web  Search Q  &  A Social   Network Forums  is  a  huge  resource  with  human  knowledge  ! May  22,  2010 2 Saturday, May 22, 2010
  3. 3. Forum  Data  Crawl  and  Mining Content   Mining Data   Parsing Crawling May  22,  2010 3 Saturday, May 22, 2010
  4. 4. Forum  Data  Crawl  and  Mining SIGIR 2008 Exploring  Traversal  Strategy Content   WWW 2008 iRobot:  Sitemap  Reconstruc(on Mining Data   Parsing Crawling May  22,  2010 3 Saturday, May 22, 2010
  5. 5. Forum  Data  Crawl  and  Mining SIGIR 2008 Exploring  Traversal  Strategy Content   WWW 2008 iRobot:  Sitemap  Reconstruc(on Mining Data   Parsing WWW 2009 Crawling Automa(on  Data  Parsing May  22,  2010 3 Saturday, May 22, 2010
  6. 6. Forum  Data  Crawl  and  Mining SIGIR 2008 Exploring  Traversal  Strategy Content   WWW 2008 iRobot:  Sitemap  Reconstruc(on Mining Data   Parsing SIGIR 2009 Expert  Finding  &  Junk  detec(on WWW 2009 Crawling Automa(on  Data  Parsing May  22,  2010 3 Saturday, May 22, 2010
  7. 7. Forum  Data  Crawl  and  Mining KDD 2009 Incremental  Crawling SIGIR 2008 Exploring  Traversal  Strategy Content   WWW 2008 iRobot:  Sitemap  Reconstruc(on Mining Data   Parsing SIGIR 2009 Expert  Finding  &  Junk  detec(on WWW 2009 KDD 2009 Crawling Automa(on  Data  Parsing User  Behavior  in  Forums May  22,  2010 3 Saturday, May 22, 2010
  8. 8. CharacterisKcs  of  Forums Index Page Post Page May  22,  2010 4 Saturday, May 22, 2010
  9. 9. CharacterisKcs  of  Forums Index Page Post Page May  22,  2010 4 Saturday, May 22, 2010
  10. 10. CharacterisKcs  of  Forums Index Page Post Page May  22,  2010 4 Saturday, May 22, 2010
  11. 11. Incremental  Crawling May  22,  2010 5 Saturday, May 22, 2010
  12. 12. Incremental  Crawling May  22,  2010 5 Saturday, May 22, 2010
  13. 13. Incremental  Crawling • General  Web  Pages – TreaKng  page  independently,  i.e.,  page-­‐wise • Forum  Pages – Considering  paginaKon,  i.e.,  list-­‐wise   May  22,  2010 5 Saturday, May 22, 2010
  14. 14. Incremental  Crawling • General  Web  Pages – TreaKng  page  independently,  i.e.,  page-­‐wise • Forum  Pages – Considering  paginaKon,  i.e.,  list-­‐wise   May  22,  2010 5 Saturday, May 22, 2010
  15. 15. Our  SoluKon     • Incorpora(ng  Site-­‐level  Knowledge – How  many  kinds  of  pages  in  a  website – How  various  pages  linked  with  each  others • Purposes – Dis(nguish  index  and  post  pages May  22,  2010 6 Saturday, May 22, 2010
  16. 16. Our  SoluKon     • Incorpora(ng  Site-­‐level  Knowledge – How  many  kinds  of  pages  in  a  website – How  various  pages  linked  with  each  others • Purposes – Dis(nguish  index  and  post  pages Sitemap List  Construc8on  & Timestamp Predic8on Bandwidth Construc8on Classifica8on Extrac8on Models Control May  22,  2010 6 Saturday, May 22, 2010
  17. 17. Sitemap List  Construc8on  & Timestamp Predic8on Bandwidth Construc8on Classifica8on Extrac8on Models Control May  22,  2010 7 Saturday, May 22, 2010
  18. 18. Forum  Sitemap • A  sitemap  is  a  directed  graph  consisKng  of  a   set  of  ver6ces  and  links May  22,  2010 hRp://forums.asp.net 8 Saturday, May 22, 2010
  19. 19. Page  Layout  Clustering • Forum  pages  are  based  on  database  &  template • Layout  is  robust  to  describe  template – Layout  can  be  characterized  by  the  HTML  elements  in   different  DOM  paths  (e.g.  repeKKve  pa]erns) May  22,  2010 9 Saturday, May 22, 2010
  20. 20. Page  Layout  Clustering • Forum  pages  are  based  on  database  &  template • Layout  is  robust  to  describe  template – Layout  can  be  characterized  by  the  HTML  elements  in   different  DOM  paths  (e.g.  repeKKve  pa]erns)   Rui  Cai,  Jiang-­‐Ming  Yang,  Wei  Lai,  Yida  Wang  and  Lei  Zhang.   May  22,  2010   iRobot:  An  Intelligent  Crawler  for  Web  Forums.   9   In  Proceedings  of  WWW  2008  Conference Saturday, May 22, 2010
  21. 21. Link  Analysis   Rui  Cai,  Jiang-­‐Ming  Yang,  Wei  Lai,  Yida  Wang  and  Lei  Zhang.     iRobot:  An  Intelligent  Crawler  for  Web  Forums.     In  Proceedings  of  WWW  2008  Conference May  22,  2010 10 Saturday, May 22, 2010
  22. 22. Link  Analysis   Rui  Cai,  Jiang-­‐Ming  Yang,  Wei  Lai,  Yida  Wang  and  Lei  Zhang.     iRobot:  An  Intelligent  Crawler  for  Web  Forums.     In  Proceedings  of  WWW  2008  Conference   Yida  Wang,  Jiang-­‐Ming  Yang,  Wei  Lai,  Rui  Cai  and  Lei  Zhang.     Exploring  Traversal  Strategy  for  Web  Forum  Crawling.     In  Proceedings  of  SIGIR  2008  Conference May  22,  2010 10 Saturday, May 22, 2010
  23. 23. Sitemap List  Construc8on  & Timestamp Predic8on Bandwidth Construc8on Classifica8on Extrac8on Models Control May  22,  2010 11 Saturday, May 22, 2010
  24. 24. IndenKfy  Index  &  Post  Nodes • A  SVM-­‐based  Classifier – Site  independent – Features • Node  size • Link  structure • Keywords • Node  classificaKon  is   robust  that  page – Robust  to  noise  on   individual  pages May  22,  2010 12 Saturday, May 22, 2010
  25. 25. List  ReconstrucKon • Given  a  new  page 1. Classify  into  a  node 2. Detect  paginaKon  links 3. Find  out  link  orders May  22,  2010 13 Saturday, May 22, 2010
  26. 26. YYYY/MM/DD Sitemap List  Construc8on  & Timestamp Predic8on Bandwidth Construc8on Classifica8on Extrac8on Models Control May  22,  2010 14 Saturday, May 22, 2010
  27. 27. Timestamp  ExtracKon May  22,  2010 15 Saturday, May 22, 2010
  28. 28. Timestamp  ExtracKon May  22,  2010 15 Saturday, May 22, 2010
  29. 29. Timestamp  ExtracKon • Dis(nguish  real  (mestamps  from  noises – The  temporal  order  can  help  ! May  22,  2010 15 Saturday, May 22, 2010
  30. 30. Timestamp  ExtracKon • Dis(nguish  real  (mestamps  from  noises – The  temporal  order  can  help  ! May  22,  2010 15 Saturday, May 22, 2010
  31. 31. Sitemap List  Construc8on  & Timestamp Predic8on Bandwidth Construc8on Classifica8on Extrac8on Models Control May  22,  2010 16 Saturday, May 22, 2010
  32. 32. Feature  ExtracKon • Features  to  describe  update  frequency – List-­‐dependent  &  independent  (site-­‐level  staKsKcs) – Absolute  &  RelaKve May  22,  2010 17 Saturday, May 22, 2010
  33. 33. Regression  Model • Linear  regression – Advantages Lightweight  computaKonal  cost Efficient  for  online  process May  22,  2010 18 Saturday, May 22, 2010
  34. 34. Regression  Model • Linear  regression – Advantages Lightweight  computaKonal  cost Efficient  for  online  process • Predict  when  the  next  new  record  arrives – CT:  current  Kme – LT:  last  (re-­‐)visit  Kme  by  crawler May  22,  2010 18 Saturday, May 22, 2010
  35. 35. Sitemap List  Construc8on  & Timestamp Predic8on Bandwidth Construc8on Classifica8on Extrac8on Models Control May  22,  2010 19 Saturday, May 22, 2010
  36. 36. Bandwidth  Control • Index  and  post  pages  are  quite  different Index Post QuanKty <  10   >  90   Avg.  Update   high% low % Num.  Re-­‐crawl   small Frequency large Pages May  22,  2010 20 Saturday, May 22, 2010
  37. 37. Bandwidth  Control • Index  and  post  pages  are  quite  different Index Post QuanKty <  10   >  90   Avg.  Update   high% low % Num.  Re-­‐crawl   small Frequency large Pages • Post  pages  blocks  the  bandwidth – Cannot  discover  new  threads  in  Kme – A  simple  but  pracKcal  soluKon May  22,  2010 20 Saturday, May 22, 2010
  38. 38. Experiment  Setup • 18  web  forums  in  diverse  categories – March  1999  ~  June  2008 – 990,476  pages  and  5,407,854  posts • Simula(on – Repeatable  and  Controllable • Comparison – List-­‐wise  strategy  (LWS),   – LWS  with  bandwidth  control  (LWS  +  BC) – Curve-­‐fieng  policy  (CF) – Bound-­‐based  policy  (BB,  WWW  2008) – Oracle  (Most  ideal  case) May  22,  2010 21 Saturday, May 22, 2010
  39. 39. Measurements • Bandwidth  U1liza1on – Inew: #pages  with  new  informaKon – IB: #pages  crawled • Coverage – Icrawl: #new  posts  crawled – Iall: #new  posts  published  on  forums • Timeliness – ∆ti : #minutes  between  publish  and  download May  22,  2010 22 Saturday, May 22, 2010
  40. 40. Performance  Comparison • Warm-­‐up  Stage – Bandwidth:  3000  pages  /  day May  22,  2010 23 Saturday, May 22, 2010
  41. 41. Performance  Comparison • Warm-­‐up  Stage – Bandwidth:  3000  pages  /  day May  22,  2010 23 Saturday, May 22, 2010
  42. 42. Performance  Comparison • Warm-­‐up  Stage – Bandwidth:  3000  pages  /  day May  22,  2010 23 Saturday, May 22, 2010
  43. 43. Performance  Comparison • Warm-­‐up  Stage – Bandwidth:  3000  pages  /  day May  22,  2010 23 Saturday, May 22, 2010
  44. 44. Performance  Comparison  (Cont.) • Comparison  with  various  bandwidth May  22,  2010 24 Saturday, May 22, 2010
  45. 45. Performance  Comparison  (Cont.) • Comparison  with  various  bandwidth May  22,  2010 24 Saturday, May 22, 2010
  46. 46. Performance  Comparison  (Cont.) • Comparison  with  various  bandwidth May  22,  2010 24 Saturday, May 22, 2010
  47. 47. Performance  Comparison  (Cont.) • Comparison  with  various  bandwidth May  22,  2010 24 Saturday, May 22, 2010
  48. 48. Performance  Comparison  (Cont.) • Detailed  performance  of  Index  and  Post  pages – Bandwidth:  3000  pages  /  day May  22,  2010 25 Saturday, May 22, 2010
  49. 49. Conclusions  and  Future  Work • Targeted  on  web  forums,  a  specific  but   interesKng  field. • Developing  an  effecKve  soluKon  for   incremental  forum  crawling – IntegraKng  site-­‐level  knowledge – Some  pracKcal  engineering  implementaKon • Future  work – Improve  Kmestamps  extracKon  algorithm – Stronger  predicKon  model  than  linear  regression May  22,  2010 26 Saturday, May 22, 2010
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×