Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Survey on article extraction and comment monitoring techniques

The online News publisher publishes their news in the form of articles. Most of the online news websites provide the facility for their users to comment on the news article and as a result a lot of people comment on the news article. Hence news web page contains huge data in the form of article content and comments data, etc and have a good potential to be a resource for many Information Retrieval Systems and Data Mining Applications. The extraction of the main content (Article content) from a web page has always been a challenging task because a web page contains other information like advertisements and hyperlinks etc. which is not related to Article Text. In this survey, we review various techniques which are proposed by various researchers to extract the article content from a news web site. We also learn various techniques which monitor and analyze the comments for various applications like popularity prediction of articles and identification of discussions thread in the comments data.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to comment

  • Be the first to like this

Survey on article extraction and comment monitoring techniques

  1. 1. Submitted by Ankur Kumar Agrawal M.Tech(CS)-II Year 13535009 Under the guidance of Dr. Dhaval Patel
  2. 2.  Introduction  Why Article Extraction and Comments Monitoring ?  Challenges in Article Extraction and Comments Monitoring  Article Extraction Techniques Learning Based Techniques Heuristic Techniques Visual Based Approach  Comments Monitoring News Article Popularity Prediction Extracting Discussion Structure  Conclusion
  3. 3. What is Article on news web page?  Online news sources publish their news in the form of articles.  Article describes about a particular event happened.  The main content on the news web page is Article Content.  Other content on web pages like hyperlinks, images, and side banners etc. is considered as noise content. What areComments?  Comments are the reactions by the citizens on the article published by the news media.
  4. 4. 1 1 2 Article Text 2 Comments
  5. 5. Article Extraction can be used in  Information Retrieval Systems.  Search Engines (Indexing on Article content for giving best search result) like Google , Yahoo.  News Aggregator Systems like Google News. Comments monitoring can be used for  News Article Popularity Prediction.  Advertisement Agencies  News Agencies  Debate Identification  Sentimental Analysis and Opinion Mining
  6. 6. 1 1 Article Text 2 Noise Content Menus Advertisements Side Banners Hyperlinks 2 2 2
  7. 7.  Public Comments are not always available for every news source. Some websites provides their comments data  It is difficult to apply standard NLP techniques in comments since comments may not be syntactically correct.
  8. 8. Heuristic Based Techniques Learning Based Techniques Visual Based Techniques
  9. 9. Parsed News Web Page Applying Heuristics on parsed document Article Text Content output
  10. 10.  Web page is processed using DOMTree.  DOMTree represents each tag as Node Object in a tree.  Two important factors in heuristic techniques are Text Count and Link Count.  Text Count: Text count is the number of words in the text of a node.  Link Count: Number of links a node has in the sub tree rooted at any node.
  11. 11. Html (7,1) Head (1,0) Body (6,1) DIV (5,1) Node Structure P(3,0) This is (2,0) Article (1,0) A(1,1) More detail(1,1) P(1,0) Text (1,0) DIV (1,0) P(1,0) Noise (1,0) Node Name (Text Count, Link Count)
  12. 12.  For each node of DOM Tree a Basic Score is calculated using the following formula.  Basic Score Function = 푻풆풙풕 푪풐풖풏풕−푳풊풏풌 푪풐풖풏풕 푻풆풙풕 푪풐풖풏풕  A node having Maximum Basic Score is selected as a probable node having Article Text.  If multiple nodes are having same Maximum Score: Select the one which is higher in level  Drawback Favors some nodes having less text count and no link.
  13. 13. Html (6,1) Body (6,1) 0.8 푻풆풙풕 푪풐풖풏풕 − 푳풊풏풌 푪풐풖풏풕 DIV (5,1) 푻풆풙풕 푪풐풖풏풕 Real Article Node 0.83 1 P (3,0) 1 0 This is (2,0) Article (1,0) A (1,1) More detail (1,1) 1 1 P (1,0) Text (1,0) Selected as article text node as higher in level DIV (1,0) P (1,0) Noise (1,0) DOM Tree After applying Basic Score function 1 0 1 1
  14. 14. Weightratio × 푻풆풙풕 푪풐풖풏풕−푳풊풏풌 푪풐풖풏풕  Here one extra factor is added in basic scoring function.  Extra factor describes the fraction of Total text of page in a node.  Now optimal weights are assigned to both the factors.  This extra factor removes the drawback of using only basic scoring function. 푻풆풙풕 푪풐풖풏풕 +Weighttext × 푻풆풙풕 푪풐풖풏풕 푷풂품풆푻풆풙풕
  15. 15. Html (6,1) Body (0.8,0.83) DIV (0.8,0.9953) P (1,0.7) This is (1,0.9333) Article (1,0.9333) a (0,0) More detail (0, 0) P (1,0.91667) Text (1,0.9166) DIV (1,0.9408) P(1,0.83) Noise (1,0.91667) Real Article Text Node Containing maximum score
  16. 16.  Experiment was performed on 1620 news Articles from 27 different news sources.  Using a Basic Score: Precision is around 0.85 Recall is 0.02 (Very Poor)  Using Modified Weight Score Function: Precision is around 0.9562 (Improved) Recall is 0.9088 (Great Improvement)  Source: Jyotiak Prasad et. al.,”Coreex: content extraction from online news articles”
  17. 17. Heuristic Based Techniques Learning Based Techniques Visual Based Techniques
  18. 18.  This approach works in two steps. STEP 1 First Learning is performed from a set of news web pages and a model is build which identifies the location of article content and noise content. STEP 2 A new web page is given as input to the model and Article text is obtained.
  19. 19. Model Learns some common features of web pages to distinguish between Noise and main Article Text Content Model output Training dataset Target web page Article Text Learning Based Technique
  20. 20.  The technique focus on removing noise content from news web page.  Learning is from web pages of a single news source.  The model builds a Style Tree after learning common layout from all the web pages.  Model(Style Tree) is applied on the target web page of the same news source to classify noise nodes and content nodes.
  21. 21. Html Body DIV DIV P IMG P Html Body DIV DIV a P BR P Html Body DIV DIV 2 2 2 P IMG P a P BR P 1 1 d1 d2
  22. 22.  Noise node and content is identified based on the information gain(Entropy) of each node.  So it is assumed that if more presentation style a node have then it may be the Noise Node.  If actual content is more diverted then it may be the probable Content Node.
  23. 23.  If E is an Element Node and number of pages that contain E is m. Then 푁표푑푒퐼푚푝 퐸 = − 푙 푝푖 푙표푔 푝푖 , 푖푓 푚 > 1 푖=1 1, 푖푓 푚 = 1 Where l denotes number of child style nodes of E and 푝푖 that web page uses ith style node in l.
  24. 24. root IMG Table Table Table 35 15 Tr Text P P P IMG A Text P A A A A A 100 100 100 100 body 100 25
  25. 25. Advantage  Algorithm is fast once the learning is over. Disadvantages  Style Tree can take large amount of memory.  It requires some web pages of a single domain to learn.
  26. 26. Heuristic Based Techniques Learning Based Techniques Visual Based Techniques
  27. 27.  The techniques learns visual features of web page and identifies the boundary of Article Text content.  A simple visual based technique uses following two steps:  Step 1: Identifying different text segments using beak node identification of CSS.  Step 2: Global optimization method MSS(Maximum Scoring Subsequent) is used to identify article text body .
  28. 28.  <Br> and <Hr> tags are always break nodes.  For other element nodes CSS display property is checked.  If CSS display property is “block” then it indicates that element have a line break before or after it.  Now Text segments are formed using nearest line break nodes of every text nodes.
  29. 29. t3 Body P DIV A I Br em U U t4 t5 t6 B t7 t8 B I Br t1 t2 Element node Break node Text node group consecutive Text segments based on the Nearest line break node
  30. 30.  Given set of text segments from step 1 we have to group the segments which can be the part of Article Text.  The algorithm gives score to each segments between -1 to 1 in the following way. { +1 ,Psize>c1,Pcolour>c2,Plink<c3 -1 ,otherwise F(S) =
  31. 31.  Learning based Techniques are fast.  Heuristic Techniques can be applied on any web page.  Heuristic based techniques rely on threshold values which may not be accurate always.  Heuristic techniques are slow.  Learning based techniques require sufficient web pages to learn.
  32. 32.  News Comments monitoring can be used to predict the popularity of an article prior to its publication.  Comments also describe the mindset of the citizens about a particular event.  Comments can also be used to identify discussions/debates going on about a news story.
  33. 33.  The Technique uses number of comments as a key factor to predict the popularity of an article.  The method also considers the publication hour and category of an article it belongs to.  The method is based on Linear Regression Y=a + bX  Where X=Number of Comments an article received over a timed  Y= Predicted volume of comments
  34. 34. Comments Repository Regression Based on publication hours Regression Based on category How the Proposed Technique works? Regression Based on Per Year Published Articles Regression Y=a + bX Apply output Predicted volume of comments Different Regression models Article for popularity Prediction Select best regression aghaghgch acbjacjjahc jahcajhcac ajajcnjacj
  35. 35.  The experiment was performed on the articles data of four years(from February 2006 till June 2010).  Based on Per Year Data: It was concluded that the Articles published during 2008-10 are good for prediction.  Based on publication time of an article: The articles published between 6 to 11 AM suits best for prediction.
  36. 36.  When people comments on the comments of other people then a Discussion Structure is created.  So the proposed method is used to identify that discussion structure in Dutch news media.  The technique solves following two questions: 1. How to Extract the comments ? 2. How to identify the Discussion Thread?
  37. 37. Article Scrapper Comments Scrapper Dutch News Sources like Torus, AD RSS Feed Articles Comments and Articles Repository Comments Comment URL HTML Page
  38. 38.  Technique identifies commenter name in the comment text. “Yes Tom you are right” Posted by: Bob  It also assumes that @ character can also be used to refer to someone. “@Bob this is not a good political view.” Posted by: Jimmy  Issue: The issue is that the Author name may be the part of comment text as example is Boy may exist in “good boy”.
  39. 39.  Following Machine learning based methods are proposed:  Word Boundary Based: Tokenize comments and commenter and check for commenter name in comments.  POS Tagging and Loose Match: Only those words are matched which are noun and use following method to match. 푠푖푚푖푙푎푟푖푡푦(푚1, 푚2) = 2. 푚푎푡푐ℎ(푚1, 푚2) 푙푒푛푔푡ℎ 푚1 + 푙푒푛푔푡ℎ(푚2) Optimal threshold value 0.85 is obtained after experiment.  @ Trigger and Loose Match: The @ character is used to trigger previous comments. Getting all reference of a comment text loose match is used.
  40. 40.  We have learned the importance of article text and comments.  Article can be extracted using heuristic technique, learning based technique and visual based techniques.  Comments can be monitored for popularity prediction and identifying discussion structure or debate.

×