Your SlideShare is downloading. ×
0
20081009 meeting
20081009 meeting
20081009 meeting
20081009 meeting
20081009 meeting
20081009 meeting
20081009 meeting
20081009 meeting
20081009 meeting
20081009 meeting
20081009 meeting
20081009 meeting
20081009 meeting
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

20081009 meeting

634

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
634
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Blog Post and Comment Extraction Using Information Quantity of Web Format 4th Asia Infomation Retrieval Symposium, AIRS 2008 Reportor : Che-Min Liao
  • 2. Framework of Blog Extraction <ul><li>The framework of blog extraction includes two stages </li></ul><ul><ul><li>Locating the main text ( post and comment ) </li></ul></ul><ul><ul><li>Finding separator between the post and comment </li></ul></ul>
  • 3. Locating Main Text <ul><li>Based on the DOM tree structure, the task of locating main text is to find the minimal subtree which contains the main text. </li></ul><ul><li>Three kinds of noises in blog pages </li></ul><ul><ul><li>Advertisements </li></ul></ul><ul><ul><li>Some useful links ( ex : blogrolls ) </li></ul></ul><ul><ul><li>Some routine texts ( ex : about author ) </li></ul></ul>
  • 4. Important Features of the Main Text <ul><li>Most of the main texts of blogs hold the largest vision space in comparing with their siblings in the DOM tree. </li></ul><ul><ul><li>Using CSS style in html to acquire vsion information. </li></ul></ul><ul><li>Most of the main text of blogs contain more words than other routine texts. </li></ul><ul><ul><li>Calculate the effective text information in each node of DOM tree. </li></ul></ul><ul><ul><li>W a is the number of words </li></ul></ul><ul><ul><li>W e is the number of words without links in the text </li></ul></ul>
  • 5. Locating main text algorthm
  • 6. Example
  • 7. Finding Separator <ul><li>Information Quantity : </li></ul><ul><ul><li>H ( Xi ) = − log 2 P ( Xi ) </li></ul></ul><ul><li>Assume that there are m kinds of html tags in blog page. The definition of the information quantity of separator as follows : </li></ul>
  • 8. Finding Separator Algorithm
  • 9. Experiment <ul><li>Using the standard blog corpus which comes from the blog track in TREC2006. </li></ul><ul><ul><li>Choose all the permalinks from the data which were crawled in December 7, 2005. </li></ul></ul><ul><ul><li>Select the blog pages in the top 100 domains as test data. </li></ul></ul><ul><ul><li>Download css style file of each page. </li></ul></ul><ul><ul><li>After eliminating the pages without css style, they get 25910 blog pages. </li></ul></ul>
  • 10. Corpus Distribution
  • 11. Experiment Result <ul><li>The definition of four kinds of precisions as follows : </li></ul>
  • 12. Performance comparison of three locating main text algorithm
  • 13. Performance of finding separator algorithm

×