.                             Content Extraction       Identifying The Main Content in Html DocumentsBy : Hadi Mohammadzad...
.Outline1.   Introduction2.   Basic Terms and Concepts3.   New Single document Algorithms4.   Template clustering and dete...
.                           Part One            IntroductionHadi Mohammadzadeh   Content Extraction       3
.                               What is the Problem•   Most HTML documents on the World Wide Web contain far more than the...
.                           What is the Problem-Cont•   Now question is what is Content Extraction :    CE is the process ...
.                           What is the Problem-Cont• Several applications benefit from CE under different aspects:    – W...
.                           Part TwoBasic Terms and ConceptsHadi Mohammadzadeh   Content Extraction       7
.                     What you need to know before ….•   Here three essential fields are addressed to know :        –   So...
.            What you need to know before ….1. Methods and data structures could be used to represent documents for data a...
.                        Part ThreeNew Single document Algorithms          Content Code Blurring (CCB) Hadi Mohammadzadeh ...
.                   Single Document Content Extraction•   CE methods which are based on single documents perform the extra...
.            Evaluation of Content Extraction Algorithms• Human User Evaluation• Application Specific Evaluation• Evaluati...
.                                 Introduction of CCB• CCB is a novel CE algorithm.• CCB is:       –   It is robust to inv...
.                          Concept and Idea of CCB• Two different ways to obtain a suitable document representation      –...
.                          Concept and Idea of CCB• For each single element in the CCV we determine a ratio of content to ...
.                    Blurring the Content Code Vector• Each entry in the CCV is initialized with a value of 1 if the accor...
.                     Implementation and Adaptations• To find main content corresponds to selecting those elements of the ...
.                       Part Four                     Clustering Template Based Web Documents            (TBWD)Hadi Mohamm...
.                                       Abstract• More and more documents on the World Wide Web are based on templates. • ...
.                               General Information• As more and more documents on the World Wide Web are generated  autom...
.                                           Related Works   -1                                                   for      ...
.                                               Related Works - 2                                                       fo...
.                                             Related Works   -3                                                     for  ...
.               Distance Measures for TBWD Structures    There are six tag sequence based measures for calculating    dist...
.              Distance Measures for TBWD Structures•   TV   – Tag Vector    Counting how many times each possible tag app...
.                              Clustering TechniquesIn this paper we have applied two different techniques for clustering ...
.                                   Experiments• To evaluate the different distance measures we collected a corpus  of 500...
.                               Experiments-Cont• Evaluation of Clustering:  We used three different measures to evaluate ...
.                                          Experiments-ContEvaluation of k-median clustering for k = 5 (Average of 100 rep...
.                                         Experiments-ContEvaluation of single linkage clustering for five clusters.    ba...
.                                     References•   Thomas Gottron. Evaluating content extraction on HTML documents. In IT...
Upcoming SlideShare
Loading in …5
×

Content extraction: By Hadi Mohammadzadeh

811 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
811
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
28
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Content extraction: By Hadi Mohammadzadeh

  1. 1. . Content Extraction Identifying The Main Content in Html DocumentsBy : Hadi MohammadzadehInstitute of Applied Information ProcessingUniversity of Ulm – 6th of July . 2010 Hadi Mohammadzadeh Content Extraction 1
  2. 2. .Outline1. Introduction2. Basic Terms and Concepts3. New Single document Algorithms4. Template clustering and detection Hadi Mohammadzadeh Content Extraction 2
  3. 3. . Part One IntroductionHadi Mohammadzadeh Content Extraction 3
  4. 4. . What is the Problem• Most HTML documents on the World Wide Web contain far more than the article or text which forms their main content. Navigation menus, functional and design elements or commercial banners are typical examples of additional contents. Hadi Mohammadzadeh Content Extraction 4
  5. 5. . What is the Problem-Cont• Now question is what is Content Extraction : CE is the process of identifying the main content and/or removing the additional contents.• Two different kind of approaches evolved to solve the CE task: – Heuristic approaches on single documents. – Template Detection (TD) approaches on multiple documents. The template portions of the documents occur more frequently or even in every document. Hadi Mohammadzadeh Content Extraction 5
  6. 6. . What is the Problem-Cont• Several applications benefit from CE under different aspects: – Web Mining (WM) and Information Retrieval (IR) applications use CE to preprocess the raw HTML data to reduce noise and to obtain more accurate results. – Other applications use CE to reduce the document size for presentation on screen readers and small screen devices. Hadi Mohammadzadeh Content Extraction 6
  7. 7. . Part TwoBasic Terms and ConceptsHadi Mohammadzadeh Content Extraction 7
  8. 8. . What you need to know before ….• Here three essential fields are addressed to know : – Some common data models for web documents and their representations • XHTML (Extensible Hypertext Markup Language) , XML (Extensible Markup Language) , XSLT (Extensible Style sheet Language Transformations) , Xpath, • SAX (Simple API for XML) • DOM (Document Object Model ) • Templates, Content Management System (CMS) – Including : Main navigation, Location display, Date of publication, News article, Commercials, Related links, External links – Basic issues from the field of Information Retrieval • Concepts, Instances and Attributes • Distance and Similarly Measures • Query, Result Set , and Gold Standard • Evaluation and Visualization – Recall, Precision, F1-measure Hadi Mohammadzadeh Content Extraction 8
  9. 9. . What you need to know before ….1. Methods and data structures could be used to represent documents for data and text mining applications • Document Representation • Methods for classifications and clustering – Instance based methods » K-means for clustering » K nearest neighbor for classification – Statistical method » Naïve Bayes (NB) – Kernel based method » Support vector machine Hadi Mohammadzadeh Content Extraction 9
  10. 10. . Part ThreeNew Single document Algorithms Content Code Blurring (CCB) Hadi Mohammadzadeh Content Extraction 10
  11. 11. . Single Document Content Extraction• CE methods which are based on single documents perform the extraction by analyzing only the document at hand.• CE algorithms and framework: – Crunch framework – Body Text Extraction (BTE) algorithm interprets a HTML document as a sequence of word and tag tokens. It identifies a single, continuous region which contains most words while excluding most tags. A problem of BTE is its quadratic complexity and its restriction to discover only a single and continuous text passage as main content. – Document Slope Curves (DSC) algorithm is an extended BTE. Using a windowing technique they are capable to locate also several document regions in which the word tokens are more frequent than tag tokens, while also reducing the complexity to linear runtime. – Link Quota Filters (LQF) is a quite common heuristic for identifying link lists and navigation elements. The basic idea is to find DOM elements which consist mainly of text in hyperlink anchors. – Content Code Blurring (CCB) is based on finding regions in the source code character sequence which represent homogeneously formatted text. Its ACCB variation, which ignores format changes caused by hyperlinks, performed better than all previous CE heuristics. Hadi Mohammadzadeh Content Extraction 11
  12. 12. . Evaluation of Content Extraction Algorithms• Human User Evaluation• Application Specific Evaluation• Evaluation based on Information Retrieval Measures Hadi Mohammadzadeh Content Extraction 12
  13. 13. . Introduction of CCB• CCB is a novel CE algorithm.• CCB is: – It is robust to invalid or badly formatted HTML documents, – It is fast and delivers very good results on most documents.• The idea underlying content code blurring is to take advantage of visual features of the main and the additional contents. Additional contents are usually highly formatted and contain little and short texts.• The main text content, on the other hand, is long and homogeneously formatted.• As in the source code of an HTML document any change of format is indicated by a tag, we will try to identify those parts of the document which contain a lot of text and few or no tags. Hadi Mohammadzadeh Content Extraction 13
  14. 14. . Concept and Idea of CCB• Two different ways to obtain a suitable document representation – Strikes a new path for document representations in the CE context by determining for each single character whether it is content or code. – The second approach is based on a token sequence as used by BTE and DSC.• Both ways lead to a representation of a document as a sequence of atomic elements which are either content or code. We will refer to this vector from now on as the content code vector (CCV). Hadi Mohammadzadeh Content Extraction 14
  15. 15. . Concept and Idea of CCB• For each single element in the CCV we determine a ratio of content to code in its vicinity to find out if it is surrounded mainly by content or by code.• If for several elements in a row this content code ratio (CCR) is high, i.e. they are surrounded mainly by text and only by a few tags. Hadi Mohammadzadeh Content Extraction 15
  16. 16. . Blurring the Content Code Vector• Each entry in the CCV is initialized with a value of 1 if the according element is of type content and with a value of 0 for code.• To obtain the CCR we calculate for each entry a weighted and local average of the values in a neighborhood with a fixed symmetric range. In inhomogeneous neighborhoods the average value will be between 0 and 1. If they are mainly content, the ratio will be high, if they are mainly code, the ratio will be low. So, the average values have exactly the properties we need for our CCR values. Hadi Mohammadzadeh Content Extraction 16
  17. 17. . Implementation and Adaptations• To find main content corresponds to selecting those elements of the CCV which have a high CCR value, i.e. a value closer to 1.• An element in the CCV is considered to be part of the main content, if it has a CCR value above a fixed threshold t. Hadi Mohammadzadeh Content Extraction 17
  18. 18. . Part Four Clustering Template Based Web Documents (TBWD)Hadi Mohammadzadeh Content Extraction 18
  19. 19. . Abstract• More and more documents on the World Wide Web are based on templates. • On a technical level this causes those documents to have a quite similar source code and DOM tree structure. • Grouping together documents which are based on the same template is an important task for applications that analyze the template structure and need clean training data.• This paper develops and compares several distance measures for clustering web documents according to their underlying templates. In other words we take a closer look at web document distance measures which are supposed to reflect template related structural similarities and dissimilarities. Hadi Mohammadzadeh Content Extraction 19
  20. 20. . General Information• As more and more documents on the World Wide Web are generated automatically by Content Management Systems (CMS), more and more of them are based on templates.• Templates can be seen as framework documents which are filled with different contents to compile the final documents• A technical side effect is that the source code of template generated documents is always very similar. Hadi Mohammadzadeh Content Extraction 20
  21. 21. . Related Works -1 for Recognizing template structures in HTML documents• First Bar-Yossef and Rajagopalan proposed a template recognition algorithm based on DOM tree segmentation and segment selection. (Template detection via data mining and its applications-2002)• Lin and Ho developed InfoDiscoverer which is based on the idea, that – opposite to the main content – template generated contents appear more frequently. (Discovering informative content blocks from web documents.-2002)• Debnath et al. used a similar assumption of redundant blocks in ContentExtractor but take into account not only words and text but also other features like image or script elements. (Automatic extraction of informative blocks from webpages-2005) Hadi Mohammadzadeh Content Extraction 21
  22. 22. . Related Works - 2 for Recognizing template structures in HTML documents• The Site Style Tree(SST) approach of Yi, Liu and Li instead is concentrating more on the visual impression single DOM tree elements are supposed to achieve and declares identically formated DOM sub-trees to be template generated. (Eliminating noisy information in web pages for data mining-2003)• Cruz et al. describe several distance measures for web documents. They distinguish between distance measures based on tag vectors, parametric functions or tree edit distances. (Measuring structural similarity among web documents: preliminary results-1998)• In the more general context of comparing XML documents Buttler stated tree edit distances to be probably the best but as well very expensive similarity measures. Therefore Buttler proposes the path shingling approach which makes use of the shingling technique. (A short survey of document structure similarity algorithms-2004) Hadi Mohammadzadeh Content Extraction 22
  23. 23. . Related Works -3 for Recognizing template structures in HTML documents• Shi et al. propose an alignment based on simplified DOM tree representation to find parallel versions of web documents in different languages. (A DOM tree alignment model for mining parallel data from the web.-2006) Hadi Mohammadzadeh Content Extraction 23
  24. 24. . Distance Measures for TBWD Structures There are six tag sequence based measures for calculating distances between TBWD.• RTDM (Restricted Top-Down Mapping) Algorithm– Tree Edit Distance This distance measure is based on calculating the cost for transforming a source tree into a target tree structure.• CP – Common Paths Another way is to look at the paths leading from the root node to the leaf nodes in the DOM tree.• CPS – Common Path Shingles The idea is not to compare complete paths but rather breaking them up in smaller pieces of equal length – the shingles. Hadi Mohammadzadeh Content Extraction 24
  25. 25. . Distance Measures for TBWD Structures• TV – Tag Vector Counting how many times each possible tag appears converts a document D in a vector v(D) of fixed dimension N.• LCTS – Longest Common Tag Subsequence The distance of two documents can be expressed based on their longest common tag subsequence.• CTSS – Common Tag Sequence Shingles To overcome the computational costs of the previous distance measure we utilize again the shingling techniques. Hadi Mohammadzadeh Content Extraction 25
  26. 26. . Clustering TechniquesIn this paper we have applied two different techniques for clustering TBWD.3. K-Median Clustering4. Single Linkage Hadi Mohammadzadeh Content Extraction 26
  27. 27. . Experiments• To evaluate the different distance measures we collected a corpus of 500 document from five different German news web sites.• Each web site contributed 20 documents from five different topical categories: national and international politics, sports, business and IT related news.• Once the distance matrices had been computed, the different cluster analysis methods were applied to each of them. Hadi Mohammadzadeh Content Extraction 27
  28. 28. . Experiments-Cont• Evaluation of Clustering: We used three different measures to evaluate the k-median and the single linkage algorithms : – The Rand index • Rand Index or Rand Measure is a measure of how the clustering results are close to the original classes. Value one means perfect clustering – Cluster purity – Mutual information Hadi Mohammadzadeh Content Extraction 28
  29. 29. . Experiments-ContEvaluation of k-median clustering for k = 5 (Average of 100 repetitions) based on the different distance measures RTDM , CP , CPS , TV , LCTS , CTSS With considering different performance measures The Rand index , Cluster purity , Mutual information Distance RTDM TV CP CPS LCTS CTSS Measure Rand Index 0.9399 0.9140 0.9157 0.9293 0.9608 0.9560 Ave. Purity 0.9235 0.9057 0.8629 0.9218 0.9613 0.9535 Mutual 0.1354 0.1302 0.1250 0.1350 0.1444 0.1432 Information RTDM is providing the best results, followed by common path measures. Hadi Mohammadzadeh Content Extraction 29
  30. 30. . Experiments-ContEvaluation of single linkage clustering for five clusters. based on the different distance measures RTDM , CP , CPS , TV , LCTS , CTSS With considering different performance measures The Rand index , Cluster purity , Mutual information Distance RTDM TV CP CPS LCTS CTSS Measure Rand Index 0.9200 0.9200 1.0000 1.0000 1.0000 1.0000 Ave. Purity 0.9005 0.9005 1.0000 1.0000 1.0000 1.0000 Mutual 0.1287 0.1287 0.1553 0.1553 0.1553 0.1553 Information We can deduce that single linkage is a better way to form clusters for template based documents. Hadi Mohammadzadeh Content Extraction 30
  31. 31. . References• Thomas Gottron. Evaluating content extraction on HTML documents. In ITA ’07: Proceedings of the 2nd International Conference on Internet Technologies and Applications, pages 123–132, September 2007.• Thomas Gottron. Combining content extraction heuristics: the combine system. In iiWAS ’08: Proceedings of the 10th International Conference on Information Integration and Web-based Applications &Services, pages 591–595, New York, NY, USA, 2008.ACM.• Thomas Gottron. Content code blurring: A new approach to content extraction. In DEXA ’08:19th International Workshop on Database and Expert Systems Applications, pages 29 – 33. IEEE Computer Society, September 2008• Thomas Gottron. Clustering Template Based Web Documents . Proceedings of the 30th European Conference on Information Retrieval, 2008, 40—51. Hadi Mohammadzadeh Content Extraction 31

×