SlideShare a Scribd company logo
.




                             Content Extraction

       Identifying The Main Content in Html Documents




By : Hadi Mohammadzadeh
Institute of Applied Information Processing
University of Ulm – 6th of July . 2010

            Hadi Mohammadzadeh          Content Extraction       1
.




Outline

1.   Introduction
2.   Basic Terms and Concepts
3.   New Single document Algorithms
4.   Template clustering and detection




        Hadi Mohammadzadeh   Content Extraction       2
.




                           Part One



            Introduction




Hadi Mohammadzadeh   Content Extraction       3
.




                               What is the Problem

•   Most HTML documents on the World Wide Web contain far more than the article or text
    which forms their main content.
     Navigation menus, functional and design elements or commercial banners are typical
    examples of additional contents.




               Hadi Mohammadzadeh   Content Extraction                               4
.




                           What is the Problem-Cont

•   Now question is what is Content Extraction :
    CE is the process of identifying the main content and/or removing the
    additional contents.

•   Two different kind of approaches evolved to solve the CE task:
    –   Heuristic approaches on single documents.
    –   Template Detection (TD) approaches on multiple documents. The template
        portions of the documents occur more frequently or even in every document.




              Hadi Mohammadzadeh   Content Extraction                                5
.




                           What is the Problem-Cont

• Several applications benefit from CE under different aspects:
    – Web Mining (WM) and Information Retrieval (IR) applications use CE to
      preprocess the raw HTML data to reduce noise and to obtain more accurate results.
    – Other applications use CE to reduce the document size for presentation on screen
      readers and small screen devices.




              Hadi Mohammadzadeh   Content Extraction                                     6
.




                           Part Two


Basic Terms and Concepts




Hadi Mohammadzadeh   Content Extraction       7
.




                     What you need to know before ….

•   Here three essential fields are addressed to know :
        –   Some common data models for web documents and their representations
               • XHTML (Extensible Hypertext Markup Language) , XML (Extensible Markup Language) , XSLT
                 (Extensible Style sheet Language Transformations) , Xpath,
               • SAX (Simple API for XML)
               • DOM (Document Object Model )
               • Templates, Content Management System (CMS)
                       – Including : Main navigation, Location display, Date of publication, News article,
                         Commercials, Related links, External links
        –   Basic issues from the field of Information Retrieval
               •   Concepts, Instances and Attributes
               •   Distance and Similarly Measures
               •   Query, Result Set , and Gold Standard
               •   Evaluation and Visualization
                       – Recall, Precision, F1-measure




               Hadi Mohammadzadeh    Content Extraction                                                      8
.




            What you need to know before ….

1. Methods and data structures could be used to represent documents for data and
   text mining applications
            •     Document Representation
            •     Methods for classifications and clustering
                    –   Instance based methods
                          »      K-means for clustering
                          »      K nearest neighbor for classification
                    –   Statistical method
                          »      Naïve Bayes (NB)
                    –   Kernel based method
                          »      Support vector machine




       Hadi Mohammadzadeh   Content Extraction                                     9
.




                        Part Three


New Single document Algorithms
          Content Code Blurring (CCB)




 Hadi Mohammadzadeh   Content Extraction       10
.




                   Single Document Content Extraction

•   CE methods which are based on single documents perform the extraction by analyzing
    only the document at hand.
•   CE algorithms and framework:
       –   Crunch framework
       –   Body Text Extraction (BTE) algorithm interprets a HTML document as a sequence of word
           and tag tokens. It identifies a single, continuous region which contains most words while
           excluding most tags. A problem of BTE is its quadratic complexity and its restriction to
           discover only a single and continuous text passage as main content.

       –   Document Slope Curves (DSC) algorithm is an extended BTE. Using a windowing
           technique they are capable to locate also several document regions in which the word
           tokens are more frequent than tag tokens, while also reducing the complexity to linear
           runtime.

       –   Link Quota Filters (LQF) is a quite common heuristic for identifying link lists and
           navigation elements. The basic idea is to find DOM elements which consist mainly of text
           in hyperlink anchors.

       –   Content Code Blurring (CCB) is based on finding regions in the source code character
           sequence which represent homogeneously formatted text. Its ACCB variation, which
           ignores format changes caused by hyperlinks, performed better than all previous CE
           heuristics.
                Hadi Mohammadzadeh   Content Extraction                                             11
.




            Evaluation of Content Extraction Algorithms

• Human User Evaluation
• Application Specific Evaluation
• Evaluation based on Information Retrieval Measures




               Hadi Mohammadzadeh   Content Extraction       12
.




                                 Introduction of CCB

• CCB is a novel CE algorithm.

• CCB is:
       –   It is robust to invalid or badly formatted HTML documents,
       –   It is fast and delivers very good results on most documents.

• The idea underlying content code blurring is
                   to take advantage of visual features
                  of the main and the additional contents.
  Additional contents are usually highly formatted and contain little and short texts.

• The main text content, on the other hand, is long and homogeneously formatted.

• As in the source code of an HTML document any change of format is indicated by a tag,
  we will try to identify those parts of the document which contain a lot of text and few or
  no tags.



                Hadi Mohammadzadeh   Content Extraction                                    13
.




                          Concept and Idea of CCB

• Two different ways to obtain a suitable document representation
      –   Strikes a new path for document representations in the CE context by
          determining for each single character whether it is content or code.
      –   The second approach is based on a token sequence as used by BTE and DSC.


• Both ways lead to a representation of a document as a sequence of atomic
  elements which are either content or code. We will refer to this vector from now
  on as the content code vector (CCV).




              Hadi Mohammadzadeh   Content Extraction                                14
.




                          Concept and Idea of CCB

• For each single element in the CCV we determine a ratio of content to code in its
  vicinity to find out if it is surrounded mainly by content or by code.

• If for several elements in a row this content code ratio (CCR) is high, i.e. they
  are surrounded mainly by text and only by a few tags.




              Hadi Mohammadzadeh   Content Extraction                                 15
.




                    Blurring the Content Code Vector

• Each entry in the CCV is initialized with a value of 1 if the according element is
  of type content and with a value of 0 for code.

• To obtain the CCR we calculate for each entry a weighted and local average of
  the values in a neighborhood with a fixed symmetric range. In inhomogeneous
  neighborhoods the average value will be between 0 and 1. If they are mainly
  content, the ratio will be high, if they are mainly code, the ratio will be low. So,
  the average values have exactly the properties we need for our CCR values.




               Hadi Mohammadzadeh   Content Extraction                               16
.




                     Implementation and Adaptations

• To find main content corresponds to selecting those elements of the CCV which have
  a high CCR value, i.e. a value closer to 1.

• An element in the CCV is considered to be part of the main content, if it has a CCR
  value above a fixed threshold t.




               Hadi Mohammadzadeh   Content Extraction                              17
.




                       Part Four


                     Clustering

 Template Based Web Documents
            (TBWD)




Hadi Mohammadzadeh   Content Extraction       18
.




                                       Abstract
• More and more documents on the World Wide Web are based on templates.

 • On a technical level this causes those documents to have a quite similar source
                                                   code and DOM tree structure.

      •    Grouping together documents which are based on the same template is an
          important task for applications that analyze the template structure and need
                                                                  clean training data.

• This paper develops and compares several distance measures for clustering web
  documents according to their underlying templates. In other words we take a
  closer look at web document distance measures which are supposed to reflect
  template related structural similarities and dissimilarities.




               Hadi Mohammadzadeh   Content Extraction                             19
.




                               General Information

• As more and more documents on the World Wide Web are generated
  automatically by Content Management Systems (CMS), more and more of them
  are based on templates.

• Templates can be seen as framework documents which are filled with different
  contents to compile the final documents

• A technical side effect is that the source code of template generated documents
  is always very similar.




              Hadi Mohammadzadeh   Content Extraction                           20
.



                                           Related Works   -1
                                                   for

          Recognizing template structures in HTML documents
• First Bar-Yossef and Rajagopalan proposed a template recognition algorithm
  based on DOM tree segmentation and segment selection.
  (Template detection via data mining and its applications-2002)




• Lin and Ho developed InfoDiscoverer which is based on the idea, that – opposite
  to the main content – template generated contents appear more frequently.
   (Discovering informative content blocks from web documents.-2002)


• Debnath et al. used a similar assumption of redundant blocks in
  ContentExtractor but take into account not only words and text but also other
  features like image or script elements.
   (Automatic extraction of informative blocks from webpages-2005)




                Hadi Mohammadzadeh    Content Extraction                          21
.



                                               Related Works - 2
                                                       for

             Recognizing template structures in HTML documents
•   The Site Style Tree(SST) approach of Yi, Liu and Li instead is concentrating
    more on the visual impression single DOM tree elements are supposed to achieve
    and declares identically formated DOM sub-trees to be template generated.
     (Eliminating noisy information in web pages for data mining-2003)


•   Cruz et al. describe several distance measures for web documents. They
    distinguish between distance measures based on tag vectors, parametric functions
    or tree edit distances.
    (Measuring structural similarity among web documents: preliminary results-1998)


•   In the more general context of comparing XML documents Buttler stated tree
    edit distances to be probably the best but as well very expensive similarity
    measures. Therefore Buttler proposes the path shingling approach which makes
    use of the shingling technique.
    (A short survey of document structure similarity algorithms-2004)


                    Hadi Mohammadzadeh    Content Extraction                          22
.



                                             Related Works   -3
                                                     for

             Recognizing template structures in HTML documents
•   Shi et al. propose an alignment based on simplified DOM tree representation to
    find parallel versions of web documents in different languages.
    (A DOM tree alignment model for mining parallel data from the web.-2006)




                   Hadi Mohammadzadeh   Content Extraction                           23
.




               Distance Measures for TBWD Structures

    There are six tag sequence based measures for calculating
    distances between TBWD.

•   RTDM (Restricted Top-Down Mapping) Algorithm– Tree Edit Distance
    This distance measure is based on calculating the cost for transforming a
    source tree into a target tree structure.

•   CP – Common Paths
    Another way is to look at the paths leading from the root node to the leaf
    nodes in the DOM tree.

•   CPS – Common Path Shingles
    The idea is not to compare complete paths but rather breaking them up in
    smaller pieces of equal length – the shingles.

            Hadi Mohammadzadeh   Content Extraction                              24
.




              Distance Measures for TBWD Structures

•   TV   – Tag Vector
    Counting how many times each possible tag appears converts a document D
    in a vector v(D) of fixed dimension N.


•   LCTS    – Longest Common Tag Subsequence
    The distance of two documents can be expressed based on their longest
    common tag subsequence.


•   CTSS   – Common Tag Sequence Shingles
    To overcome the computational costs of the previous distance measure we
    utilize again the shingling techniques.



            Hadi Mohammadzadeh   Content Extraction                           25
.




                              Clustering Techniques

In this paper we have applied two different techniques for clustering TBWD.

3.   K-Median Clustering
4.   Single Linkage




              Hadi Mohammadzadeh   Content Extraction                         26
.




                                   Experiments

• To evaluate the different distance measures we collected a corpus
  of 500 document from five different German news web sites.

• Each web site contributed 20 documents from five different
  topical categories: national and international politics, sports,
  business and IT related news.

• Once the distance matrices had been computed, the different
  cluster analysis methods were applied to each of them.




            Hadi Mohammadzadeh   Content Extraction                  27
.




                               Experiments-Cont


• Evaluation of Clustering:
  We used three different measures to evaluate the k-median and
    the single linkage algorithms :
  – The Rand index
     • Rand Index or Rand Measure is a measure of how the clustering results

       are close to the original classes. Value one means perfect clustering
  – Cluster purity
  – Mutual information




          Hadi Mohammadzadeh   Content Extraction                              28
.




                                          Experiments-Cont

Evaluation of k-median clustering for k = 5 (Average of 100 repetitions)
    based on the different distance measures
    RTDM , CP , CPS , TV , LCTS , CTSS
    With considering different performance measures
    The Rand index , Cluster purity , Mutual information

     Distance         RTDM       TV           CP           CPS      LCTS     CTSS
     Measure
     Rand Index                  0.9399                    0.9140   0.9157   0.9293
                      0.9608                  0.9560
     Ave. Purity                 0.9235                    0.9057   0.8629   0.9218
                      0.9613                  0.9535
     Mutual                      0.1354                    0.1302   0.1250   0.1350
                      0.1444                  0.1432
     Information


    RTDM is providing the best results, followed by common path measures.


                   Hadi Mohammadzadeh     Content Extraction                          29
.




                                         Experiments-Cont

Evaluation of single linkage clustering for five clusters.
    based on the different distance measures
    RTDM , CP , CPS , TV , LCTS , CTSS
    With considering different performance measures
    The Rand index , Cluster purity , Mutual information


   Distance        RTDM         TV        CP         CPS       LCTS     CTSS
   Measure
   Rand Index      0.9200       0.9200    1.0000     1.0000    1.0000   1.0000


   Ave. Purity     0.9005       0.9005    1.0000     1.0000    1.0000   1.0000


   Mutual          0.1287       0.1287    0.1553     0.1553    0.1553   0.1553
   Information


    We can deduce that single linkage is a better way to form clusters for template based documents.



                 Hadi Mohammadzadeh       Content Extraction                                       30
.




                                     References
•   Thomas Gottron. Evaluating content extraction on HTML documents. In ITA ’07: Proceedings of the 2nd
    International Conference on Internet Technologies and Applications, pages 123–132, September 2007.

•   Thomas Gottron. Combining content extraction heuristics: the combine system. In iiWAS ’08: Proceedings
    of the 10th International Conference on Information Integration and Web-based Applications &Services,
    pages 591–595, New York, NY, USA, 2008.ACM.

•   Thomas Gottron. Content code blurring: A new approach to content extraction. In DEXA ’08:19th
    International Workshop on Database and Expert Systems Applications, pages 29 – 33. IEEE Computer
    Society, September 2008

•   Thomas Gottron. Clustering Template Based Web Documents . Proceedings of the 30th European
    Conference on Information Retrieval, 2008, 40—51.




                Hadi Mohammadzadeh     Content Extraction                                                    31

More Related Content

Similar to Content extraction: By Hadi Mohammadzadeh

CM Pros CMIS Overview - Jan 2011
CM Pros CMIS Overview  - Jan 2011CM Pros CMIS Overview  - Jan 2011
CM Pros CMIS Overview - Jan 2011
Laurence Hart
 
CDMI For Swift
CDMI For SwiftCDMI For Swift
CDMI For Swift
Mark Carlson
 
Web forms and html lecture Number 2
Web forms and html lecture Number 2Web forms and html lecture Number 2
Web forms and html lecture Number 2
Mudasir Syed
 
Cloud development using html css and js
Cloud development using html css and jsCloud development using html css and js
Cloud development using html css and js
NanditaDutta4
 
Dublin Core Metadata Tutorial.ppt
Dublin Core Metadata Tutorial.pptDublin Core Metadata Tutorial.ppt
Dublin Core Metadata Tutorial.ppt
Bharath Abbareddy
 
02 From HTML tags to XHTML
02 From HTML tags to XHTML02 From HTML tags to XHTML
02 From HTML tags to XHTML
Rich Dron
 
Government GraphSummit: And Then There Were 15 Standards
Government GraphSummit: And Then There Were 15 StandardsGovernment GraphSummit: And Then There Were 15 Standards
Government GraphSummit: And Then There Were 15 Standards
Neo4j
 
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML ParsingMastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Knoldus Inc.
 
Document object model(dom)
Document object model(dom)Document object model(dom)
Document object model(dom)
rahul kundu
 
Document object model(dom)
Document object model(dom)Document object model(dom)
Document object model(dom)
rahul kundu
 
K0946269
K0946269K0946269
K0946269
IOSR Journals
 
Introduction to Structured Authoring
Introduction to Structured AuthoringIntroduction to Structured Authoring
Introduction to Structured Authoring
dclsocialmedia
 
Awais bilal sara 1.pptx
Awais bilal sara 1.pptxAwais bilal sara 1.pptx
Awais bilal sara 1.pptx
MuhammadAwaisQureshi6
 
NLP and the Web
NLP and the WebNLP and the Web
NLP and the Web
mattthemathman
 
dmBridge & dmMonocle
dmBridge & dmMonocledmBridge & dmMonocle
dmBridge & dmMonocle
University of Nevada, Las Vegas
 
SharePoint for Records Management
SharePoint for Records ManagementSharePoint for Records Management
SharePoint for Records Management
C/D/H Technology Consultants
 
AI-SDV 2021 - Holger Keibel; Daniele Puccinelli - Leveraging pre-trained lang...
AI-SDV 2021 - Holger Keibel; Daniele Puccinelli - Leveraging pre-trained lang...AI-SDV 2021 - Holger Keibel; Daniele Puccinelli - Leveraging pre-trained lang...
AI-SDV 2021 - Holger Keibel; Daniele Puccinelli - Leveraging pre-trained lang...
Dr. Haxel Consult
 
IRJET- Behaviour of Hybrid Fibre Reinforced Sintered Fly Ash Aggregate Concre...
IRJET- Behaviour of Hybrid Fibre Reinforced Sintered Fly Ash Aggregate Concre...IRJET- Behaviour of Hybrid Fibre Reinforced Sintered Fly Ash Aggregate Concre...
IRJET- Behaviour of Hybrid Fibre Reinforced Sintered Fly Ash Aggregate Concre...
IRJET Journal
 
IRJET- SVM-based Web Content Mining with Leaf Classification Unit From DOM-Tree
IRJET- SVM-based Web Content Mining with Leaf Classification Unit From DOM-TreeIRJET- SVM-based Web Content Mining with Leaf Classification Unit From DOM-Tree
IRJET- SVM-based Web Content Mining with Leaf Classification Unit From DOM-Tree
IRJET Journal
 

Similar to Content extraction: By Hadi Mohammadzadeh (20)

Edi text
Edi textEdi text
Edi text
 
CM Pros CMIS Overview - Jan 2011
CM Pros CMIS Overview  - Jan 2011CM Pros CMIS Overview  - Jan 2011
CM Pros CMIS Overview - Jan 2011
 
CDMI For Swift
CDMI For SwiftCDMI For Swift
CDMI For Swift
 
Web forms and html lecture Number 2
Web forms and html lecture Number 2Web forms and html lecture Number 2
Web forms and html lecture Number 2
 
Cloud development using html css and js
Cloud development using html css and jsCloud development using html css and js
Cloud development using html css and js
 
Dublin Core Metadata Tutorial.ppt
Dublin Core Metadata Tutorial.pptDublin Core Metadata Tutorial.ppt
Dublin Core Metadata Tutorial.ppt
 
02 From HTML tags to XHTML
02 From HTML tags to XHTML02 From HTML tags to XHTML
02 From HTML tags to XHTML
 
Government GraphSummit: And Then There Were 15 Standards
Government GraphSummit: And Then There Were 15 StandardsGovernment GraphSummit: And Then There Were 15 Standards
Government GraphSummit: And Then There Were 15 Standards
 
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML ParsingMastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
 
Document object model(dom)
Document object model(dom)Document object model(dom)
Document object model(dom)
 
Document object model(dom)
Document object model(dom)Document object model(dom)
Document object model(dom)
 
K0946269
K0946269K0946269
K0946269
 
Introduction to Structured Authoring
Introduction to Structured AuthoringIntroduction to Structured Authoring
Introduction to Structured Authoring
 
Awais bilal sara 1.pptx
Awais bilal sara 1.pptxAwais bilal sara 1.pptx
Awais bilal sara 1.pptx
 
NLP and the Web
NLP and the WebNLP and the Web
NLP and the Web
 
dmBridge & dmMonocle
dmBridge & dmMonocledmBridge & dmMonocle
dmBridge & dmMonocle
 
SharePoint for Records Management
SharePoint for Records ManagementSharePoint for Records Management
SharePoint for Records Management
 
AI-SDV 2021 - Holger Keibel; Daniele Puccinelli - Leveraging pre-trained lang...
AI-SDV 2021 - Holger Keibel; Daniele Puccinelli - Leveraging pre-trained lang...AI-SDV 2021 - Holger Keibel; Daniele Puccinelli - Leveraging pre-trained lang...
AI-SDV 2021 - Holger Keibel; Daniele Puccinelli - Leveraging pre-trained lang...
 
IRJET- Behaviour of Hybrid Fibre Reinforced Sintered Fly Ash Aggregate Concre...
IRJET- Behaviour of Hybrid Fibre Reinforced Sintered Fly Ash Aggregate Concre...IRJET- Behaviour of Hybrid Fibre Reinforced Sintered Fly Ash Aggregate Concre...
IRJET- Behaviour of Hybrid Fibre Reinforced Sintered Fly Ash Aggregate Concre...
 
IRJET- SVM-based Web Content Mining with Leaf Classification Unit From DOM-Tree
IRJET- SVM-based Web Content Mining with Leaf Classification Unit From DOM-TreeIRJET- SVM-based Web Content Mining with Leaf Classification Unit From DOM-Tree
IRJET- SVM-based Web Content Mining with Leaf Classification Unit From DOM-Tree
 

More from Hadi Mohammadzadeh

TitleFinder Extracting the Headline of News Web Pages
TitleFinder Extracting the Headline of News Web PagesTitleFinder Extracting the Headline of News Web Pages
TitleFinder Extracting the Headline of News Web Pages
Hadi Mohammadzadeh
 
Revealing Trends Based on Defined Queries in Biological Publications Using Co...
Revealing Trends Based on Defined Queries in Biological Publications Using Co...Revealing Trends Based on Defined Queries in Biological Publications Using Co...
Revealing Trends Based on Defined Queries in Biological Publications Using Co...
Hadi Mohammadzadeh
 
Webist2012 presentation
Webist2012 presentationWebist2012 presentation
Webist2012 presentation
Hadi Mohammadzadeh
 
Accurate Main Content Extraction from Persian HTML Files
Accurate Main Content Extraction from Persian HTML FilesAccurate Main Content Extraction from Persian HTML Files
Accurate Main Content Extraction from Persian HTML FilesHadi Mohammadzadeh
 
Main Content Extraction from Persian HTML Files
Main Content Extraction from Persian HTML FilesMain Content Extraction from Persian HTML Files
Main Content Extraction from Persian HTML FilesHadi Mohammadzadeh
 
Information filtering, By Hadi Mohammadzadeh
Information filtering, By Hadi MohammadzadehInformation filtering, By Hadi Mohammadzadeh
Information filtering, By Hadi Mohammadzadeh
Hadi Mohammadzadeh
 
Information retreival, By Hadi Mohammadzadeh
Information retreival, By Hadi MohammadzadehInformation retreival, By Hadi Mohammadzadeh
Information retreival, By Hadi MohammadzadehHadi Mohammadzadeh
 

More from Hadi Mohammadzadeh (7)

TitleFinder Extracting the Headline of News Web Pages
TitleFinder Extracting the Headline of News Web PagesTitleFinder Extracting the Headline of News Web Pages
TitleFinder Extracting the Headline of News Web Pages
 
Revealing Trends Based on Defined Queries in Biological Publications Using Co...
Revealing Trends Based on Defined Queries in Biological Publications Using Co...Revealing Trends Based on Defined Queries in Biological Publications Using Co...
Revealing Trends Based on Defined Queries in Biological Publications Using Co...
 
Webist2012 presentation
Webist2012 presentationWebist2012 presentation
Webist2012 presentation
 
Accurate Main Content Extraction from Persian HTML Files
Accurate Main Content Extraction from Persian HTML FilesAccurate Main Content Extraction from Persian HTML Files
Accurate Main Content Extraction from Persian HTML Files
 
Main Content Extraction from Persian HTML Files
Main Content Extraction from Persian HTML FilesMain Content Extraction from Persian HTML Files
Main Content Extraction from Persian HTML Files
 
Information filtering, By Hadi Mohammadzadeh
Information filtering, By Hadi MohammadzadehInformation filtering, By Hadi Mohammadzadeh
Information filtering, By Hadi Mohammadzadeh
 
Information retreival, By Hadi Mohammadzadeh
Information retreival, By Hadi MohammadzadehInformation retreival, By Hadi Mohammadzadeh
Information retreival, By Hadi Mohammadzadeh
 

Recently uploaded

How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
ViralQR
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 

Recently uploaded (20)

How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 

Content extraction: By Hadi Mohammadzadeh

  • 1. . Content Extraction Identifying The Main Content in Html Documents By : Hadi Mohammadzadeh Institute of Applied Information Processing University of Ulm – 6th of July . 2010 Hadi Mohammadzadeh Content Extraction 1
  • 2. . Outline 1. Introduction 2. Basic Terms and Concepts 3. New Single document Algorithms 4. Template clustering and detection Hadi Mohammadzadeh Content Extraction 2
  • 3. . Part One Introduction Hadi Mohammadzadeh Content Extraction 3
  • 4. . What is the Problem • Most HTML documents on the World Wide Web contain far more than the article or text which forms their main content. Navigation menus, functional and design elements or commercial banners are typical examples of additional contents. Hadi Mohammadzadeh Content Extraction 4
  • 5. . What is the Problem-Cont • Now question is what is Content Extraction : CE is the process of identifying the main content and/or removing the additional contents. • Two different kind of approaches evolved to solve the CE task: – Heuristic approaches on single documents. – Template Detection (TD) approaches on multiple documents. The template portions of the documents occur more frequently or even in every document. Hadi Mohammadzadeh Content Extraction 5
  • 6. . What is the Problem-Cont • Several applications benefit from CE under different aspects: – Web Mining (WM) and Information Retrieval (IR) applications use CE to preprocess the raw HTML data to reduce noise and to obtain more accurate results. – Other applications use CE to reduce the document size for presentation on screen readers and small screen devices. Hadi Mohammadzadeh Content Extraction 6
  • 7. . Part Two Basic Terms and Concepts Hadi Mohammadzadeh Content Extraction 7
  • 8. . What you need to know before …. • Here three essential fields are addressed to know : – Some common data models for web documents and their representations • XHTML (Extensible Hypertext Markup Language) , XML (Extensible Markup Language) , XSLT (Extensible Style sheet Language Transformations) , Xpath, • SAX (Simple API for XML) • DOM (Document Object Model ) • Templates, Content Management System (CMS) – Including : Main navigation, Location display, Date of publication, News article, Commercials, Related links, External links – Basic issues from the field of Information Retrieval • Concepts, Instances and Attributes • Distance and Similarly Measures • Query, Result Set , and Gold Standard • Evaluation and Visualization – Recall, Precision, F1-measure Hadi Mohammadzadeh Content Extraction 8
  • 9. . What you need to know before …. 1. Methods and data structures could be used to represent documents for data and text mining applications • Document Representation • Methods for classifications and clustering – Instance based methods » K-means for clustering » K nearest neighbor for classification – Statistical method » Naïve Bayes (NB) – Kernel based method » Support vector machine Hadi Mohammadzadeh Content Extraction 9
  • 10. . Part Three New Single document Algorithms Content Code Blurring (CCB) Hadi Mohammadzadeh Content Extraction 10
  • 11. . Single Document Content Extraction • CE methods which are based on single documents perform the extraction by analyzing only the document at hand. • CE algorithms and framework: – Crunch framework – Body Text Extraction (BTE) algorithm interprets a HTML document as a sequence of word and tag tokens. It identifies a single, continuous region which contains most words while excluding most tags. A problem of BTE is its quadratic complexity and its restriction to discover only a single and continuous text passage as main content. – Document Slope Curves (DSC) algorithm is an extended BTE. Using a windowing technique they are capable to locate also several document regions in which the word tokens are more frequent than tag tokens, while also reducing the complexity to linear runtime. – Link Quota Filters (LQF) is a quite common heuristic for identifying link lists and navigation elements. The basic idea is to find DOM elements which consist mainly of text in hyperlink anchors. – Content Code Blurring (CCB) is based on finding regions in the source code character sequence which represent homogeneously formatted text. Its ACCB variation, which ignores format changes caused by hyperlinks, performed better than all previous CE heuristics. Hadi Mohammadzadeh Content Extraction 11
  • 12. . Evaluation of Content Extraction Algorithms • Human User Evaluation • Application Specific Evaluation • Evaluation based on Information Retrieval Measures Hadi Mohammadzadeh Content Extraction 12
  • 13. . Introduction of CCB • CCB is a novel CE algorithm. • CCB is: – It is robust to invalid or badly formatted HTML documents, – It is fast and delivers very good results on most documents. • The idea underlying content code blurring is to take advantage of visual features of the main and the additional contents. Additional contents are usually highly formatted and contain little and short texts. • The main text content, on the other hand, is long and homogeneously formatted. • As in the source code of an HTML document any change of format is indicated by a tag, we will try to identify those parts of the document which contain a lot of text and few or no tags. Hadi Mohammadzadeh Content Extraction 13
  • 14. . Concept and Idea of CCB • Two different ways to obtain a suitable document representation – Strikes a new path for document representations in the CE context by determining for each single character whether it is content or code. – The second approach is based on a token sequence as used by BTE and DSC. • Both ways lead to a representation of a document as a sequence of atomic elements which are either content or code. We will refer to this vector from now on as the content code vector (CCV). Hadi Mohammadzadeh Content Extraction 14
  • 15. . Concept and Idea of CCB • For each single element in the CCV we determine a ratio of content to code in its vicinity to find out if it is surrounded mainly by content or by code. • If for several elements in a row this content code ratio (CCR) is high, i.e. they are surrounded mainly by text and only by a few tags. Hadi Mohammadzadeh Content Extraction 15
  • 16. . Blurring the Content Code Vector • Each entry in the CCV is initialized with a value of 1 if the according element is of type content and with a value of 0 for code. • To obtain the CCR we calculate for each entry a weighted and local average of the values in a neighborhood with a fixed symmetric range. In inhomogeneous neighborhoods the average value will be between 0 and 1. If they are mainly content, the ratio will be high, if they are mainly code, the ratio will be low. So, the average values have exactly the properties we need for our CCR values. Hadi Mohammadzadeh Content Extraction 16
  • 17. . Implementation and Adaptations • To find main content corresponds to selecting those elements of the CCV which have a high CCR value, i.e. a value closer to 1. • An element in the CCV is considered to be part of the main content, if it has a CCR value above a fixed threshold t. Hadi Mohammadzadeh Content Extraction 17
  • 18. . Part Four Clustering Template Based Web Documents (TBWD) Hadi Mohammadzadeh Content Extraction 18
  • 19. . Abstract • More and more documents on the World Wide Web are based on templates. • On a technical level this causes those documents to have a quite similar source code and DOM tree structure. • Grouping together documents which are based on the same template is an important task for applications that analyze the template structure and need clean training data. • This paper develops and compares several distance measures for clustering web documents according to their underlying templates. In other words we take a closer look at web document distance measures which are supposed to reflect template related structural similarities and dissimilarities. Hadi Mohammadzadeh Content Extraction 19
  • 20. . General Information • As more and more documents on the World Wide Web are generated automatically by Content Management Systems (CMS), more and more of them are based on templates. • Templates can be seen as framework documents which are filled with different contents to compile the final documents • A technical side effect is that the source code of template generated documents is always very similar. Hadi Mohammadzadeh Content Extraction 20
  • 21. . Related Works -1 for Recognizing template structures in HTML documents • First Bar-Yossef and Rajagopalan proposed a template recognition algorithm based on DOM tree segmentation and segment selection. (Template detection via data mining and its applications-2002) • Lin and Ho developed InfoDiscoverer which is based on the idea, that – opposite to the main content – template generated contents appear more frequently. (Discovering informative content blocks from web documents.-2002) • Debnath et al. used a similar assumption of redundant blocks in ContentExtractor but take into account not only words and text but also other features like image or script elements. (Automatic extraction of informative blocks from webpages-2005) Hadi Mohammadzadeh Content Extraction 21
  • 22. . Related Works - 2 for Recognizing template structures in HTML documents • The Site Style Tree(SST) approach of Yi, Liu and Li instead is concentrating more on the visual impression single DOM tree elements are supposed to achieve and declares identically formated DOM sub-trees to be template generated. (Eliminating noisy information in web pages for data mining-2003) • Cruz et al. describe several distance measures for web documents. They distinguish between distance measures based on tag vectors, parametric functions or tree edit distances. (Measuring structural similarity among web documents: preliminary results-1998) • In the more general context of comparing XML documents Buttler stated tree edit distances to be probably the best but as well very expensive similarity measures. Therefore Buttler proposes the path shingling approach which makes use of the shingling technique. (A short survey of document structure similarity algorithms-2004) Hadi Mohammadzadeh Content Extraction 22
  • 23. . Related Works -3 for Recognizing template structures in HTML documents • Shi et al. propose an alignment based on simplified DOM tree representation to find parallel versions of web documents in different languages. (A DOM tree alignment model for mining parallel data from the web.-2006) Hadi Mohammadzadeh Content Extraction 23
  • 24. . Distance Measures for TBWD Structures There are six tag sequence based measures for calculating distances between TBWD. • RTDM (Restricted Top-Down Mapping) Algorithm– Tree Edit Distance This distance measure is based on calculating the cost for transforming a source tree into a target tree structure. • CP – Common Paths Another way is to look at the paths leading from the root node to the leaf nodes in the DOM tree. • CPS – Common Path Shingles The idea is not to compare complete paths but rather breaking them up in smaller pieces of equal length – the shingles. Hadi Mohammadzadeh Content Extraction 24
  • 25. . Distance Measures for TBWD Structures • TV – Tag Vector Counting how many times each possible tag appears converts a document D in a vector v(D) of fixed dimension N. • LCTS – Longest Common Tag Subsequence The distance of two documents can be expressed based on their longest common tag subsequence. • CTSS – Common Tag Sequence Shingles To overcome the computational costs of the previous distance measure we utilize again the shingling techniques. Hadi Mohammadzadeh Content Extraction 25
  • 26. . Clustering Techniques In this paper we have applied two different techniques for clustering TBWD. 3. K-Median Clustering 4. Single Linkage Hadi Mohammadzadeh Content Extraction 26
  • 27. . Experiments • To evaluate the different distance measures we collected a corpus of 500 document from five different German news web sites. • Each web site contributed 20 documents from five different topical categories: national and international politics, sports, business and IT related news. • Once the distance matrices had been computed, the different cluster analysis methods were applied to each of them. Hadi Mohammadzadeh Content Extraction 27
  • 28. . Experiments-Cont • Evaluation of Clustering: We used three different measures to evaluate the k-median and the single linkage algorithms : – The Rand index • Rand Index or Rand Measure is a measure of how the clustering results are close to the original classes. Value one means perfect clustering – Cluster purity – Mutual information Hadi Mohammadzadeh Content Extraction 28
  • 29. . Experiments-Cont Evaluation of k-median clustering for k = 5 (Average of 100 repetitions) based on the different distance measures RTDM , CP , CPS , TV , LCTS , CTSS With considering different performance measures The Rand index , Cluster purity , Mutual information Distance RTDM TV CP CPS LCTS CTSS Measure Rand Index 0.9399 0.9140 0.9157 0.9293 0.9608 0.9560 Ave. Purity 0.9235 0.9057 0.8629 0.9218 0.9613 0.9535 Mutual 0.1354 0.1302 0.1250 0.1350 0.1444 0.1432 Information RTDM is providing the best results, followed by common path measures. Hadi Mohammadzadeh Content Extraction 29
  • 30. . Experiments-Cont Evaluation of single linkage clustering for five clusters. based on the different distance measures RTDM , CP , CPS , TV , LCTS , CTSS With considering different performance measures The Rand index , Cluster purity , Mutual information Distance RTDM TV CP CPS LCTS CTSS Measure Rand Index 0.9200 0.9200 1.0000 1.0000 1.0000 1.0000 Ave. Purity 0.9005 0.9005 1.0000 1.0000 1.0000 1.0000 Mutual 0.1287 0.1287 0.1553 0.1553 0.1553 0.1553 Information We can deduce that single linkage is a better way to form clusters for template based documents. Hadi Mohammadzadeh Content Extraction 30
  • 31. . References • Thomas Gottron. Evaluating content extraction on HTML documents. In ITA ’07: Proceedings of the 2nd International Conference on Internet Technologies and Applications, pages 123–132, September 2007. • Thomas Gottron. Combining content extraction heuristics: the combine system. In iiWAS ’08: Proceedings of the 10th International Conference on Information Integration and Web-based Applications &Services, pages 591–595, New York, NY, USA, 2008.ACM. • Thomas Gottron. Content code blurring: A new approach to content extraction. In DEXA ’08:19th International Workshop on Database and Expert Systems Applications, pages 29 – 33. IEEE Computer Society, September 2008 • Thomas Gottron. Clustering Template Based Web Documents . Proceedings of the 30th European Conference on Information Retrieval, 2008, 40—51. Hadi Mohammadzadeh Content Extraction 31