This document describes the Higgs-Reader project which extended an existing reader engine called DOM-Distiller to improve support for Korean web pages. It created a new Korean language model for text extraction and tools for annotating web pages and training the model. The existing heuristics in DOM-Distiller were tuned and a new Chrome extension called the Final Reader was implemented that incorporated these changes. The approach involved 4 stages - a web page annotator, creating the Korean language model, tuning the reader engine, and building the reader UI.
2. Contents
❏ Overview
❏ Problems & Requirements
❏ Goals Met
❏ Approach
❏ Architecture
❏ WebAnn - Training Set Creator Tool
❏ The Korean Language Model
❏ Final Reader
❏ Demo
4. Overview
● A reader engine is composed of:
○ A web-page text extraction algorithm, to find the
main article text
○ Heuristics to find metadata, relevant images to the
main article
○ User Interface to embody the reader engine
6. Overview
● Higgs-Reader (built upon DOM-Distiller)
○ Boilerpipe extended with a Korean Language Model
■ Tools to train the model - Weka / C4.5 Decision Trees
■ Tools to generate the training set - WebAnn
■ Integration of the model back into the reader engine
○ Existing Heuristics in DOM-Distiller will be tuned to improve the
performance for Korean Web pages
○ Final Reader Chrome Extension
7. Goals Met
● Extended the DOM-Distiller reader engine, with
enhanced support for Korean web pages.
● Created a new Korean Language model for text-
extraction
● Tuned the existing heuristics to improve the
performance on Korean web sites
● Created a Reader UI to embody the reader engine
8. Problems Encountered
● The existing reader engines, such as DOM-Distiller, had a poor support for
non-English web pages.
● Korean websites did not commonly follow the website markup standards,
such as OpenGraph protocol, schema.org, etc.
● Current HTML standards used by majority of websites tend to still use the
<div> or <table> tags to separate content. This eliminates the possibility of
identifying the semantics of any particular section of HTML source.
● Poor performance on multi-page websites. It should be able to retrieve all
or at most K number of the pages at once.
● Poor performance on detection of relevant images or other rich-content
media.
9. Requirements Met
● A Korean language model was made and integrated into the Boilerpipe
algorithm.
○ Tooling for creating the training set (WebAnn)
● The existing DOM-Distiller was tuned to work with Korean websites.
● Better support for web pages, with their layouts made with tables.
● Better support for multi-page web pages.
● Enhanced the relevant image detection heuristic
● Chrome Extension Implementation (Final Reader)
● Comparison mechanism for testing purposes
10. Approach
● 4 Stages
○ Web Page Annotator
○ Korean Language Model for boilerpipe
○ Reader Engine tuning
○ Reader UI
12. Approach / Architecture (WebAnn)
● Web Page Annotator (WebAnn)
○ Built as a Chrome extension
○ Provides a simple UI to annotate
different sections of a web page
with predefined labels.
■ HEADING
■ FULL_CONTENT
■ SUPPLEMENTARY
■ COMMENTS
■ RELEVANT_IMAGES
16. Approach / Architecture (Language Model)
● Korean Language Model
○ A corresponding model for each of the Models listed in Table 3.2
○ Will be trained using Shallow Text features listed in Table 3.2
DensityRulesClassifier
HeuristicsFilterBase
IgnoreBlocksAfterContentFilter
IgnoreBlocksAfterContentFromEndFilter
KeepLargestFulltextBlockFilter
MinFullTextWordsFilter
NumWordsRulesClassifier
TerminatingBlocksFinder
prev_link_density
prev_text_density
prev_num_words
prev_num_words_in_anchor_text
curr_link_density
curr_text_density
curr_num_words
curr_num_words_in_anchor_text
next_link_density
next_text_density
next_num_words
next_num_words_in_anchor_text
17. Approach / Architecture (Language Model)
● Korean Language Models
○ Trained using C4.5 Decision Trees algorithm
■ Existing English language models also trained with this algorithm
■ better performance on multi-category classification problems
■ Good performance in supervised learning
○ Use the Weka ML toolset
■ Provides a wide number of implementations for ML algorithms
■ easy to compare and evaluate different models by tuning the
parameters
■ Provides cross-validation features, such as k-fold cross validation
22. Approach / Architecture (Reader Engine)
● Reader Engine
○ Based on the DOM-Distiller project
○ New Language model will be integrated into Boilerpipe
○ Existing Heuristics will be tuned to improve performance on Korean
web pages
○ Built upon Google Web Toolkit (GWT)
■ Can use Java libraries
■ Can use Java OOP features
■ Compiler will produce cross-browser JS code
■ Reader engine can be ported into any browser