Extending DOM-Distiller Reader Engine with Enhanced Support for Korean Web Pages

Higgs-Reader
Team C.
Arif Jafer, Camilo Celis, Marcus Low,

Contents
❏ Overview
❏ Problems & Requirements
❏ Goals Met
❏ Approach
❏ Architecture
❏ WebAnn - Training Set Creator Tool
❏ The Korean Language Model
❏ Final Reader
❏ Demo

Overview
● A reader engine is composed of:
○ A web-page text extraction algorithm, to find the
main article text
○ Heuristics to find metadata, relevant images to the
main article
○ User Interface to embody the reader engine

Overview
● Higgs-Reader (built upon DOM-Distiller)
○ Boilerpipe extended with a Korean Language Model
■ Tools to train the model - Weka / C4.5 Decision Trees
■ Tools to generate the training set - WebAnn
■ Integration of the model back into the reader engine
○ Existing Heuristics in DOM-Distiller will be tuned to improve the
performance for Korean Web pages
○ Final Reader Chrome Extension

Goals Met
● Extended the DOM-Distiller reader engine, with
enhanced support for Korean web pages.
● Created a new Korean Language model for text-
extraction
● Tuned the existing heuristics to improve the
performance on Korean web sites
● Created a Reader UI to embody the reader engine

Problems Encountered
● The existing reader engines, such as DOM-Distiller, had a poor support for
non-English web pages.
● Korean websites did not commonly follow the website markup standards,
such as OpenGraph protocol, schema.org, etc.
● Current HTML standards used by majority of websites tend to still use the
<div> or <table> tags to separate content. This eliminates the possibility of
identifying the semantics of any particular section of HTML source.
● Poor performance on multi-page websites. It should be able to retrieve all
or at most K number of the pages at once.
● Poor performance on detection of relevant images or other rich-content
media.

Requirements Met
● A Korean language model was made and integrated into the Boilerpipe
algorithm.
○ Tooling for creating the training set (WebAnn)
● The existing DOM-Distiller was tuned to work with Korean websites.
● Better support for web pages, with their layouts made with tables.
● Better support for multi-page web pages.
● Enhanced the relevant image detection heuristic
● Chrome Extension Implementation (Final Reader)
● Comparison mechanism for testing purposes

Approach
● 4 Stages
○ Web Page Annotator
○ Korean Language Model for boilerpipe
○ Reader Engine tuning
○ Reader UI

Approach / Architecture (Overall)

Approach / Architecture (WebAnn)
● Web Page Annotator (WebAnn)
○ Built as a Chrome extension
○ Provides a simple UI to annotate
different sections of a web page
with predefined labels.
■ HEADING
■ FULL_CONTENT
■ SUPPLEMENTARY
■ COMMENTS
■ RELEVANT_IMAGES

WebAnn -- Training Set Creator Tool
Ordinary
Webpage

WebAnn -- Training Set Creator Tool
Annotator in
Action

Approach / Architecture
(Machine Learning)

Approach / Architecture (Language Model)
● Korean Language Model
○ A corresponding model for each of the Models listed in Table 3.2
○ Will be trained using Shallow Text features listed in Table 3.2
DensityRulesClassifier
HeuristicsFilterBase
IgnoreBlocksAfterContentFilter
IgnoreBlocksAfterContentFromEndFilter
KeepLargestFulltextBlockFilter
MinFullTextWordsFilter
NumWordsRulesClassifier
TerminatingBlocksFinder
prev_link_density
prev_text_density
prev_num_words
prev_num_words_in_anchor_text
curr_link_density
curr_text_density
curr_num_words
curr_num_words_in_anchor_text
next_link_density
next_text_density
next_num_words
next_num_words_in_anchor_text

● Korean Language Models
○ Trained using C4.5 Decision Trees algorithm
■ Existing English language models also trained with this algorithm
■ better performance on multi-category classification problems
■ Good performance in supervised learning
○ Use the Weka ML toolset
■ Provides a wide number of implementations for ML algorithms
■ easy to compare and evaluate different models by tuning the
parameters
■ Provides cross-validation features, such as k-fold cross validation

Korean Language Heuristics
● Lack of <p> tags
● Terminating Blocks

Korean Language Model
Decision Tree based on Number of Words Decision Tree based on Density of Words

Number of Words
Korean Language Model
English Model
boilerplate content
21032 621
225 647
Confusion Matrix
boilerplate content
21637 16
142 730
Correctly
Classified
Instances
22367 99.2986 %
Incorrectly
Classified
Instances
158 0.7014 %
Density of Words
English Model
boilerplate content
21105 548
220 652
Confusion Matrix
boilerplate content
21637 16
142 730
Correctly
Classified
Instances
22367 99.2986 %
Incorrectly
Classified
Instances
158 0.7014 %

Approach / Architecture (Reader Engine)
● Reader Engine
○ Based on the DOM-Distiller project
○ New Language model will be integrated into Boilerpipe
○ Existing Heuristics will be tuned to improve performance on Korean
web pages
○ Built upon Google Web Toolkit (GWT)
■ Can use Java libraries
■ Can use Java OOP features
■ Compiler will produce cross-browser JS code
■ Reader engine can be ported into any browser

Approach / Architecture (Reader Engine)

Final Reader UI (Implementation)
Final Reader Old Reader

OLD READER (Live Demo)
● Small Chrome Extension using old dom-
distiller code and old language model

FINAL READER (Live Demo)
● Faster build cycles
● Can be used to easily compare with Old
Reader extension

Extending DOM-Distiller Reader Engine with Enhanced Support for Korean Web Pages

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (18)

Similar to Extending DOM-Distiller Reader Engine with Enhanced Support for Korean Web Pages

Similar to Extending DOM-Distiller Reader Engine with Enhanced Support for Korean Web Pages (20)

Extending DOM-Distiller Reader Engine with Enhanced Support for Korean Web Pages