The document discusses content extraction from HTML documents. It introduces the Content Code Blurring (CCB) algorithm for single document content extraction. CCB works by representing a document as a content code vector (CCV) that labels elements as either content or code. It then calculates a content code ratio (CCR) for each element based on the average of its neighbors. Elements with a high CCR, closer to 1, are likely part of the main content as they are surrounded primarily by other content elements and few code elements like tags. The document outlines the concept and implementation of CCB.
Applying ‘best fit’ frameworks to systematic review data extractionAndrea Miller-Nesbitt
Presented at the 7th International Conference on Qualitative and Quantitative Methods in Libraries, Paris France. Application for the 'best fit' framework synthesis methodology to systematic review data extraction
Applying ‘best fit’ frameworks to systematic review data extractionAndrea Miller-Nesbitt
Presented at the 7th International Conference on Qualitative and Quantitative Methods in Libraries, Paris France. Application for the 'best fit' framework synthesis methodology to systematic review data extraction
Government GraphSummit: And Then There Were 15 StandardsNeo4j
Todd Pihl PhD., Technical Project Mgr. & Mark Jensen, Director of Data Managements and Interoperability, National Institute of Health, Frederick National Labs for Cancer Research
Data repositories such as NCI’s Cancer Research Data Commons receive data that use a variety of data models and vocabularies. This presents a significant obstacle to finding and using the data outside of their original purpose. In this talk we’ll show how using Neo4j allows different data models to be represented and mapped to each other, giving data managers a new way to provide harmonized data to their users.
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML ParsingKnoldus Inc.
In this session, we will delve into the world of web scraping with JSoup, an open-source Java library. Here we are going to learn how to parse HTML effectively, extract meaningful data, and navigate the Document Object Model (DOM) for powerful web scraping capabilities.
This session, targeted at decision makers, consultants, and information professionals, introduces the concepts behind structured content and discusses the benefits and challenges to adoption.
, A = 1 indicates symmetry in one axis, and A = 2 indicates asymmetry in two axes. Between 0 and 8 lies the erratic border B. It uses six distinct types of colors to distinguish between malignant and benign
dmBridge and dmMonocle are two technologies developed by Alex Dolski and Brian Egan, respectively, which significantly enhance the default CONTENTdm web templates and image viewer. This is the presentation was given at the OCLC Western CONTENTdm Users Group meeting.
Gartner puts Microsoft in a virtual tie for first place in its enterprise content management (ECM) quadrant. But SharePoint is a relative newcomer to the ECM world. Does it really stack up?
In this presentation, C/D/H examines the ECM/ERM functionality in SharePoint 2010 and recommends best practices for approaching an ERM SharePoint project. Example case studies are also discussed. View the deck today!
And for more information on this or other SharePoint topics, visit our blog at www.cdhtalkstech.com.
The EXTRA classifier is a scalable solution based on recent advances in Natural Language Processing (NLP). The foundational concept of the EXTRA classifier is transfer learning, a machine learning process that enables the relatively low-cost specialization of a pre-trained language model to a specific task in a specific domain with far fewer training examples compared to standard machine learning solutions.
More specifically, the EXTRA classifier leverages BERT, a well-known pre-trained autoencoding language model that has revolutionized the NLP space in the past few years. BERT provides contextual embeddings, i.e., it provides context-aware vector representations of words that capture semantics far more efficiently than their context-free counterparts.
The EXTRA classifier contains a pre-processing module to cope with the inevitable noise in the output of standard Optical Character Recognition systems. The pre-processed plain text from a source document is then fed into a BERT-based classifier, which is built by extending pre-trained BERT with an additional linear layer trained for classification through a process commonly known as fine-tuning.
We will present preliminary results that confirm some clear benefits with respect to rule-based solutions in terms of classification performance and system scalability.
Government GraphSummit: And Then There Were 15 StandardsNeo4j
Todd Pihl PhD., Technical Project Mgr. & Mark Jensen, Director of Data Managements and Interoperability, National Institute of Health, Frederick National Labs for Cancer Research
Data repositories such as NCI’s Cancer Research Data Commons receive data that use a variety of data models and vocabularies. This presents a significant obstacle to finding and using the data outside of their original purpose. In this talk we’ll show how using Neo4j allows different data models to be represented and mapped to each other, giving data managers a new way to provide harmonized data to their users.
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML ParsingKnoldus Inc.
In this session, we will delve into the world of web scraping with JSoup, an open-source Java library. Here we are going to learn how to parse HTML effectively, extract meaningful data, and navigate the Document Object Model (DOM) for powerful web scraping capabilities.
This session, targeted at decision makers, consultants, and information professionals, introduces the concepts behind structured content and discusses the benefits and challenges to adoption.
, A = 1 indicates symmetry in one axis, and A = 2 indicates asymmetry in two axes. Between 0 and 8 lies the erratic border B. It uses six distinct types of colors to distinguish between malignant and benign
dmBridge and dmMonocle are two technologies developed by Alex Dolski and Brian Egan, respectively, which significantly enhance the default CONTENTdm web templates and image viewer. This is the presentation was given at the OCLC Western CONTENTdm Users Group meeting.
Gartner puts Microsoft in a virtual tie for first place in its enterprise content management (ECM) quadrant. But SharePoint is a relative newcomer to the ECM world. Does it really stack up?
In this presentation, C/D/H examines the ECM/ERM functionality in SharePoint 2010 and recommends best practices for approaching an ERM SharePoint project. Example case studies are also discussed. View the deck today!
And for more information on this or other SharePoint topics, visit our blog at www.cdhtalkstech.com.
The EXTRA classifier is a scalable solution based on recent advances in Natural Language Processing (NLP). The foundational concept of the EXTRA classifier is transfer learning, a machine learning process that enables the relatively low-cost specialization of a pre-trained language model to a specific task in a specific domain with far fewer training examples compared to standard machine learning solutions.
More specifically, the EXTRA classifier leverages BERT, a well-known pre-trained autoencoding language model that has revolutionized the NLP space in the past few years. BERT provides contextual embeddings, i.e., it provides context-aware vector representations of words that capture semantics far more efficiently than their context-free counterparts.
The EXTRA classifier contains a pre-processing module to cope with the inevitable noise in the output of standard Optical Character Recognition systems. The pre-processed plain text from a source document is then fed into a BERT-based classifier, which is built by extending pre-trained BERT with an additional linear layer trained for classification through a process commonly known as fine-tuning.
We will present preliminary results that confirm some clear benefits with respect to rule-based solutions in terms of classification performance and system scalability.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Welocme to ViralQR, your best QR code generator.ViralQR
Welcome to ViralQR, your best QR code generator available on the market!
At ViralQR, we design static and dynamic QR codes. Our mission is to make business operations easier and customer engagement more powerful through the use of QR technology. Be it a small-scale business or a huge enterprise, our easy-to-use platform provides multiple choices that can be tailored according to your company's branding and marketing strategies.
Our Vision
We are here to make the process of creating QR codes easy and smooth, thus enhancing customer interaction and making business more fluid. We very strongly believe in the ability of QR codes to change the world for businesses in their interaction with customers and are set on making that technology accessible and usable far and wide.
Our Achievements
Ever since its inception, we have successfully served many clients by offering QR codes in their marketing, service delivery, and collection of feedback across various industries. Our platform has been recognized for its ease of use and amazing features, which helped a business to make QR codes.
Our Services
At ViralQR, here is a comprehensive suite of services that caters to your very needs:
Static QR Codes: Create free static QR codes. These QR codes are able to store significant information such as URLs, vCards, plain text, emails and SMS, Wi-Fi credentials, and Bitcoin addresses.
Dynamic QR codes: These also have all the advanced features but are subscription-based. They can directly link to PDF files, images, micro-landing pages, social accounts, review forms, business pages, and applications. In addition, they can be branded with CTAs, frames, patterns, colors, and logos to enhance your branding.
Pricing and Packages
Additionally, there is a 14-day free offer to ViralQR, which is an exceptional opportunity for new users to take a feel of this platform. One can easily subscribe from there and experience the full dynamic of using QR codes. The subscription plans are not only meant for business; they are priced very flexibly so that literally every business could afford to benefit from our service.
Why choose us?
ViralQR will provide services for marketing, advertising, catering, retail, and the like. The QR codes can be posted on fliers, packaging, merchandise, and banners, as well as to substitute for cash and cards in a restaurant or coffee shop. With QR codes integrated into your business, improve customer engagement and streamline operations.
Comprehensive Analytics
Subscribers of ViralQR receive detailed analytics and tracking tools in light of having a view of the core values of QR code performance. Our analytics dashboard shows aggregate views and unique views, as well as detailed information about each impression, including time, device, browser, and estimated location by city and country.
So, thank you for choosing ViralQR; we have an offer of nothing but the best in terms of QR code services to meet business diversity!
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Content extraction: By Hadi Mohammadzadeh
1. .
Content Extraction
Identifying The Main Content in Html Documents
By : Hadi Mohammadzadeh
Institute of Applied Information Processing
University of Ulm – 6th of July . 2010
Hadi Mohammadzadeh Content Extraction 1
2. .
Outline
1. Introduction
2. Basic Terms and Concepts
3. New Single document Algorithms
4. Template clustering and detection
Hadi Mohammadzadeh Content Extraction 2
3. .
Part One
Introduction
Hadi Mohammadzadeh Content Extraction 3
4. .
What is the Problem
• Most HTML documents on the World Wide Web contain far more than the article or text
which forms their main content.
Navigation menus, functional and design elements or commercial banners are typical
examples of additional contents.
Hadi Mohammadzadeh Content Extraction 4
5. .
What is the Problem-Cont
• Now question is what is Content Extraction :
CE is the process of identifying the main content and/or removing the
additional contents.
• Two different kind of approaches evolved to solve the CE task:
– Heuristic approaches on single documents.
– Template Detection (TD) approaches on multiple documents. The template
portions of the documents occur more frequently or even in every document.
Hadi Mohammadzadeh Content Extraction 5
6. .
What is the Problem-Cont
• Several applications benefit from CE under different aspects:
– Web Mining (WM) and Information Retrieval (IR) applications use CE to
preprocess the raw HTML data to reduce noise and to obtain more accurate results.
– Other applications use CE to reduce the document size for presentation on screen
readers and small screen devices.
Hadi Mohammadzadeh Content Extraction 6
7. .
Part Two
Basic Terms and Concepts
Hadi Mohammadzadeh Content Extraction 7
8. .
What you need to know before ….
• Here three essential fields are addressed to know :
– Some common data models for web documents and their representations
• XHTML (Extensible Hypertext Markup Language) , XML (Extensible Markup Language) , XSLT
(Extensible Style sheet Language Transformations) , Xpath,
• SAX (Simple API for XML)
• DOM (Document Object Model )
• Templates, Content Management System (CMS)
– Including : Main navigation, Location display, Date of publication, News article,
Commercials, Related links, External links
– Basic issues from the field of Information Retrieval
• Concepts, Instances and Attributes
• Distance and Similarly Measures
• Query, Result Set , and Gold Standard
• Evaluation and Visualization
– Recall, Precision, F1-measure
Hadi Mohammadzadeh Content Extraction 8
9. .
What you need to know before ….
1. Methods and data structures could be used to represent documents for data and
text mining applications
• Document Representation
• Methods for classifications and clustering
– Instance based methods
» K-means for clustering
» K nearest neighbor for classification
– Statistical method
» Naïve Bayes (NB)
– Kernel based method
» Support vector machine
Hadi Mohammadzadeh Content Extraction 9
10. .
Part Three
New Single document Algorithms
Content Code Blurring (CCB)
Hadi Mohammadzadeh Content Extraction 10
11. .
Single Document Content Extraction
• CE methods which are based on single documents perform the extraction by analyzing
only the document at hand.
• CE algorithms and framework:
– Crunch framework
– Body Text Extraction (BTE) algorithm interprets a HTML document as a sequence of word
and tag tokens. It identifies a single, continuous region which contains most words while
excluding most tags. A problem of BTE is its quadratic complexity and its restriction to
discover only a single and continuous text passage as main content.
– Document Slope Curves (DSC) algorithm is an extended BTE. Using a windowing
technique they are capable to locate also several document regions in which the word
tokens are more frequent than tag tokens, while also reducing the complexity to linear
runtime.
– Link Quota Filters (LQF) is a quite common heuristic for identifying link lists and
navigation elements. The basic idea is to find DOM elements which consist mainly of text
in hyperlink anchors.
– Content Code Blurring (CCB) is based on finding regions in the source code character
sequence which represent homogeneously formatted text. Its ACCB variation, which
ignores format changes caused by hyperlinks, performed better than all previous CE
heuristics.
Hadi Mohammadzadeh Content Extraction 11
12. .
Evaluation of Content Extraction Algorithms
• Human User Evaluation
• Application Specific Evaluation
• Evaluation based on Information Retrieval Measures
Hadi Mohammadzadeh Content Extraction 12
13. .
Introduction of CCB
• CCB is a novel CE algorithm.
• CCB is:
– It is robust to invalid or badly formatted HTML documents,
– It is fast and delivers very good results on most documents.
• The idea underlying content code blurring is
to take advantage of visual features
of the main and the additional contents.
Additional contents are usually highly formatted and contain little and short texts.
• The main text content, on the other hand, is long and homogeneously formatted.
• As in the source code of an HTML document any change of format is indicated by a tag,
we will try to identify those parts of the document which contain a lot of text and few or
no tags.
Hadi Mohammadzadeh Content Extraction 13
14. .
Concept and Idea of CCB
• Two different ways to obtain a suitable document representation
– Strikes a new path for document representations in the CE context by
determining for each single character whether it is content or code.
– The second approach is based on a token sequence as used by BTE and DSC.
• Both ways lead to a representation of a document as a sequence of atomic
elements which are either content or code. We will refer to this vector from now
on as the content code vector (CCV).
Hadi Mohammadzadeh Content Extraction 14
15. .
Concept and Idea of CCB
• For each single element in the CCV we determine a ratio of content to code in its
vicinity to find out if it is surrounded mainly by content or by code.
• If for several elements in a row this content code ratio (CCR) is high, i.e. they
are surrounded mainly by text and only by a few tags.
Hadi Mohammadzadeh Content Extraction 15
16. .
Blurring the Content Code Vector
• Each entry in the CCV is initialized with a value of 1 if the according element is
of type content and with a value of 0 for code.
• To obtain the CCR we calculate for each entry a weighted and local average of
the values in a neighborhood with a fixed symmetric range. In inhomogeneous
neighborhoods the average value will be between 0 and 1. If they are mainly
content, the ratio will be high, if they are mainly code, the ratio will be low. So,
the average values have exactly the properties we need for our CCR values.
Hadi Mohammadzadeh Content Extraction 16
17. .
Implementation and Adaptations
• To find main content corresponds to selecting those elements of the CCV which have
a high CCR value, i.e. a value closer to 1.
• An element in the CCV is considered to be part of the main content, if it has a CCR
value above a fixed threshold t.
Hadi Mohammadzadeh Content Extraction 17
18. .
Part Four
Clustering
Template Based Web Documents
(TBWD)
Hadi Mohammadzadeh Content Extraction 18
19. .
Abstract
• More and more documents on the World Wide Web are based on templates.
• On a technical level this causes those documents to have a quite similar source
code and DOM tree structure.
• Grouping together documents which are based on the same template is an
important task for applications that analyze the template structure and need
clean training data.
• This paper develops and compares several distance measures for clustering web
documents according to their underlying templates. In other words we take a
closer look at web document distance measures which are supposed to reflect
template related structural similarities and dissimilarities.
Hadi Mohammadzadeh Content Extraction 19
20. .
General Information
• As more and more documents on the World Wide Web are generated
automatically by Content Management Systems (CMS), more and more of them
are based on templates.
• Templates can be seen as framework documents which are filled with different
contents to compile the final documents
• A technical side effect is that the source code of template generated documents
is always very similar.
Hadi Mohammadzadeh Content Extraction 20
21. .
Related Works -1
for
Recognizing template structures in HTML documents
• First Bar-Yossef and Rajagopalan proposed a template recognition algorithm
based on DOM tree segmentation and segment selection.
(Template detection via data mining and its applications-2002)
• Lin and Ho developed InfoDiscoverer which is based on the idea, that – opposite
to the main content – template generated contents appear more frequently.
(Discovering informative content blocks from web documents.-2002)
• Debnath et al. used a similar assumption of redundant blocks in
ContentExtractor but take into account not only words and text but also other
features like image or script elements.
(Automatic extraction of informative blocks from webpages-2005)
Hadi Mohammadzadeh Content Extraction 21
22. .
Related Works - 2
for
Recognizing template structures in HTML documents
• The Site Style Tree(SST) approach of Yi, Liu and Li instead is concentrating
more on the visual impression single DOM tree elements are supposed to achieve
and declares identically formated DOM sub-trees to be template generated.
(Eliminating noisy information in web pages for data mining-2003)
• Cruz et al. describe several distance measures for web documents. They
distinguish between distance measures based on tag vectors, parametric functions
or tree edit distances.
(Measuring structural similarity among web documents: preliminary results-1998)
• In the more general context of comparing XML documents Buttler stated tree
edit distances to be probably the best but as well very expensive similarity
measures. Therefore Buttler proposes the path shingling approach which makes
use of the shingling technique.
(A short survey of document structure similarity algorithms-2004)
Hadi Mohammadzadeh Content Extraction 22
23. .
Related Works -3
for
Recognizing template structures in HTML documents
• Shi et al. propose an alignment based on simplified DOM tree representation to
find parallel versions of web documents in different languages.
(A DOM tree alignment model for mining parallel data from the web.-2006)
Hadi Mohammadzadeh Content Extraction 23
24. .
Distance Measures for TBWD Structures
There are six tag sequence based measures for calculating
distances between TBWD.
• RTDM (Restricted Top-Down Mapping) Algorithm– Tree Edit Distance
This distance measure is based on calculating the cost for transforming a
source tree into a target tree structure.
• CP – Common Paths
Another way is to look at the paths leading from the root node to the leaf
nodes in the DOM tree.
• CPS – Common Path Shingles
The idea is not to compare complete paths but rather breaking them up in
smaller pieces of equal length – the shingles.
Hadi Mohammadzadeh Content Extraction 24
25. .
Distance Measures for TBWD Structures
• TV – Tag Vector
Counting how many times each possible tag appears converts a document D
in a vector v(D) of fixed dimension N.
• LCTS – Longest Common Tag Subsequence
The distance of two documents can be expressed based on their longest
common tag subsequence.
• CTSS – Common Tag Sequence Shingles
To overcome the computational costs of the previous distance measure we
utilize again the shingling techniques.
Hadi Mohammadzadeh Content Extraction 25
26. .
Clustering Techniques
In this paper we have applied two different techniques for clustering TBWD.
3. K-Median Clustering
4. Single Linkage
Hadi Mohammadzadeh Content Extraction 26
27. .
Experiments
• To evaluate the different distance measures we collected a corpus
of 500 document from five different German news web sites.
• Each web site contributed 20 documents from five different
topical categories: national and international politics, sports,
business and IT related news.
• Once the distance matrices had been computed, the different
cluster analysis methods were applied to each of them.
Hadi Mohammadzadeh Content Extraction 27
28. .
Experiments-Cont
• Evaluation of Clustering:
We used three different measures to evaluate the k-median and
the single linkage algorithms :
– The Rand index
• Rand Index or Rand Measure is a measure of how the clustering results
are close to the original classes. Value one means perfect clustering
– Cluster purity
– Mutual information
Hadi Mohammadzadeh Content Extraction 28
29. .
Experiments-Cont
Evaluation of k-median clustering for k = 5 (Average of 100 repetitions)
based on the different distance measures
RTDM , CP , CPS , TV , LCTS , CTSS
With considering different performance measures
The Rand index , Cluster purity , Mutual information
Distance RTDM TV CP CPS LCTS CTSS
Measure
Rand Index 0.9399 0.9140 0.9157 0.9293
0.9608 0.9560
Ave. Purity 0.9235 0.9057 0.8629 0.9218
0.9613 0.9535
Mutual 0.1354 0.1302 0.1250 0.1350
0.1444 0.1432
Information
RTDM is providing the best results, followed by common path measures.
Hadi Mohammadzadeh Content Extraction 29
30. .
Experiments-Cont
Evaluation of single linkage clustering for five clusters.
based on the different distance measures
RTDM , CP , CPS , TV , LCTS , CTSS
With considering different performance measures
The Rand index , Cluster purity , Mutual information
Distance RTDM TV CP CPS LCTS CTSS
Measure
Rand Index 0.9200 0.9200 1.0000 1.0000 1.0000 1.0000
Ave. Purity 0.9005 0.9005 1.0000 1.0000 1.0000 1.0000
Mutual 0.1287 0.1287 0.1553 0.1553 0.1553 0.1553
Information
We can deduce that single linkage is a better way to form clusters for template based documents.
Hadi Mohammadzadeh Content Extraction 30
31. .
References
• Thomas Gottron. Evaluating content extraction on HTML documents. In ITA ’07: Proceedings of the 2nd
International Conference on Internet Technologies and Applications, pages 123–132, September 2007.
• Thomas Gottron. Combining content extraction heuristics: the combine system. In iiWAS ’08: Proceedings
of the 10th International Conference on Information Integration and Web-based Applications &Services,
pages 591–595, New York, NY, USA, 2008.ACM.
• Thomas Gottron. Content code blurring: A new approach to content extraction. In DEXA ’08:19th
International Workshop on Database and Expert Systems Applications, pages 29 – 33. IEEE Computer
Society, September 2008
• Thomas Gottron. Clustering Template Based Web Documents . Proceedings of the 30th European
Conference on Information Retrieval, 2008, 40—51.
Hadi Mohammadzadeh Content Extraction 31