SlideShare a Scribd company logo
1 of 27
Download to read offline
Higgs-Reader
Team C.
Arif Jafer, Camilo Celis, Marcus Low,
Contents
❏ Overview
❏ Problems & Requirements
❏ Goals Met
❏ Approach
❏ Architecture
❏ WebAnn - Training Set Creator Tool
❏ The Korean Language Model
❏ Final Reader
❏ Demo
Overview
Overview
● A reader engine is composed of:
○ A web-page text extraction algorithm, to find the
main article text
○ Heuristics to find metadata, relevant images to the
main article
○ User Interface to embody the reader engine
Overview (boilerpipe)
Overview
● Higgs-Reader (built upon DOM-Distiller)
○ Boilerpipe extended with a Korean Language Model
■ Tools to train the model - Weka / C4.5 Decision Trees
■ Tools to generate the training set - WebAnn
■ Integration of the model back into the reader engine
○ Existing Heuristics in DOM-Distiller will be tuned to improve the
performance for Korean Web pages
○ Final Reader Chrome Extension
Goals Met
● Extended the DOM-Distiller reader engine, with
enhanced support for Korean web pages.
● Created a new Korean Language model for text-
extraction
● Tuned the existing heuristics to improve the
performance on Korean web sites
● Created a Reader UI to embody the reader engine
Problems Encountered
● The existing reader engines, such as DOM-Distiller, had a poor support for
non-English web pages.
● Korean websites did not commonly follow the website markup standards,
such as OpenGraph protocol, schema.org, etc.
● Current HTML standards used by majority of websites tend to still use the
<div> or <table> tags to separate content. This eliminates the possibility of
identifying the semantics of any particular section of HTML source.
● Poor performance on multi-page websites. It should be able to retrieve all
or at most K number of the pages at once.
● Poor performance on detection of relevant images or other rich-content
media.
Requirements Met
● A Korean language model was made and integrated into the Boilerpipe
algorithm.
○ Tooling for creating the training set (WebAnn)
● The existing DOM-Distiller was tuned to work with Korean websites.
● Better support for web pages, with their layouts made with tables.
● Better support for multi-page web pages.
● Enhanced the relevant image detection heuristic
● Chrome Extension Implementation (Final Reader)
● Comparison mechanism for testing purposes
Approach
● 4 Stages
○ Web Page Annotator
○ Korean Language Model for boilerpipe
○ Reader Engine tuning
○ Reader UI
Approach / Architecture (Overall)
Approach / Architecture (WebAnn)
● Web Page Annotator (WebAnn)
○ Built as a Chrome extension
○ Provides a simple UI to annotate
different sections of a web page
with predefined labels.
■ HEADING
■ FULL_CONTENT
■ SUPPLEMENTARY
■ COMMENTS
■ RELEVANT_IMAGES
WebAnn -- Training Set Creator Tool
Ordinary
Webpage
WebAnn -- Training Set Creator Tool
Annotator in
Action
Approach / Architecture
(Machine Learning)
Approach / Architecture (Language Model)
● Korean Language Model
○ A corresponding model for each of the Models listed in Table 3.2
○ Will be trained using Shallow Text features listed in Table 3.2
DensityRulesClassifier
HeuristicsFilterBase
IgnoreBlocksAfterContentFilter
IgnoreBlocksAfterContentFromEndFilter
KeepLargestFulltextBlockFilter
MinFullTextWordsFilter
NumWordsRulesClassifier
TerminatingBlocksFinder
prev_link_density
prev_text_density
prev_num_words
prev_num_words_in_anchor_text
curr_link_density
curr_text_density
curr_num_words
curr_num_words_in_anchor_text
next_link_density
next_text_density
next_num_words
next_num_words_in_anchor_text
Approach / Architecture (Language Model)
● Korean Language Models
○ Trained using C4.5 Decision Trees algorithm
■ Existing English language models also trained with this algorithm
■ better performance on multi-category classification problems
■ Good performance in supervised learning
○ Use the Weka ML toolset
■ Provides a wide number of implementations for ML algorithms
■ easy to compare and evaluate different models by tuning the
parameters
■ Provides cross-validation features, such as k-fold cross validation
Korean Language Heuristics
● Lack of <p> tags
● Terminating Blocks
Korean Language Model
Decision Tree based on Number of Words Decision Tree based on Density of Words
Number of Words
Korean Language Model
English Model
boilerplate content
21032 621
225 647
Confusion Matrix
boilerplate content
21637 16
142 730
Correctly
Classified
Instances
22367 99.2986 %
Incorrectly
Classified
Instances
158 0.7014 %
Density of Words
English Model
boilerplate content
21105 548
220 652
Confusion Matrix
boilerplate content
21637 16
142 730
Correctly
Classified
Instances
22367 99.2986 %
Incorrectly
Classified
Instances
158 0.7014 %
Approach / Architecture (Language Model)
Approach / Architecture (Reader Engine)
● Reader Engine
○ Based on the DOM-Distiller project
○ New Language model will be integrated into Boilerpipe
○ Existing Heuristics will be tuned to improve performance on Korean
web pages
○ Built upon Google Web Toolkit (GWT)
■ Can use Java libraries
■ Can use Java OOP features
■ Compiler will produce cross-browser JS code
■ Reader engine can be ported into any browser
Approach / Architecture (Reader Engine)
Final Reader UI (Implementation)
Final Reader Old Reader
OLD READER (Live Demo)
● Small Chrome Extension using old dom-
distiller code and old language model
FINAL READER (Live Demo)
● Faster build cycles
● Can be used to easily compare with Old
Reader extension
Thank you

More Related Content

Viewers also liked

How to Build a Website Similar to WorldStarHipHop
How to Build a Website Similar to WorldStarHipHopHow to Build a Website Similar to WorldStarHipHop
How to Build a Website Similar to WorldStarHipHopTarik Pierce
 
Modifikasi folder pada CMD
Modifikasi folder pada CMDModifikasi folder pada CMD
Modifikasi folder pada CMDMuhammad Syidik
 
Sirbaugh_Jerry_W4_PPP_ Final
Sirbaugh_Jerry_W4_PPP_ FinalSirbaugh_Jerry_W4_PPP_ Final
Sirbaugh_Jerry_W4_PPP_ FinalJSirbaugh
 
Meng-hiden, Menghapus, dan merename file melalui CMD
Meng-hiden, Menghapus, dan merename file melalui CMDMeng-hiden, Menghapus, dan merename file melalui CMD
Meng-hiden, Menghapus, dan merename file melalui CMDMuhammad Syidik
 
Root cause of good and bad - AgriYouthNepal Friday Sharing with Dipendra Shahi
Root cause of good and bad - AgriYouthNepal Friday Sharing with Dipendra ShahiRoot cause of good and bad - AgriYouthNepal Friday Sharing with Dipendra Shahi
Root cause of good and bad - AgriYouthNepal Friday Sharing with Dipendra Shahiagriyouthnepal
 
Operation and management of primary and secondary tillage
Operation and management of primary and secondary tillageOperation and management of primary and secondary tillage
Operation and management of primary and secondary tillageagriyouthnepal
 
Insects pests of maize
Insects pests of maizeInsects pests of maize
Insects pests of maizeagriyouthnepal
 
Insect pest or crucifers
Insect pest or crucifersInsect pest or crucifers
Insect pest or crucifersagriyouthnepal
 
Sowing and planting machines
Sowing and planting machinesSowing and planting machines
Sowing and planting machinesagriyouthnepal
 
Internal Combustion Engine
Internal Combustion EngineInternal Combustion Engine
Internal Combustion Engineagriyouthnepal
 
Insect pests of cucurbits
Insect pests of cucurbitsInsect pests of cucurbits
Insect pests of cucurbitsagriyouthnepal
 
Plant protection equipment
Plant protection equipmentPlant protection equipment
Plant protection equipmentagriyouthnepal
 

Viewers also liked (18)

How to Build a Website Similar to WorldStarHipHop
How to Build a Website Similar to WorldStarHipHopHow to Build a Website Similar to WorldStarHipHop
How to Build a Website Similar to WorldStarHipHop
 
Modifikasi folder pada CMD
Modifikasi folder pada CMDModifikasi folder pada CMD
Modifikasi folder pada CMD
 
Sirbaugh_Jerry_W4_PPP_ Final
Sirbaugh_Jerry_W4_PPP_ FinalSirbaugh_Jerry_W4_PPP_ Final
Sirbaugh_Jerry_W4_PPP_ Final
 
IPS ASEAN kelas IX
IPS ASEAN kelas IX IPS ASEAN kelas IX
IPS ASEAN kelas IX
 
Honours_Thesis2015_final
Honours_Thesis2015_finalHonours_Thesis2015_final
Honours_Thesis2015_final
 
Meng-hiden, Menghapus, dan merename file melalui CMD
Meng-hiden, Menghapus, dan merename file melalui CMDMeng-hiden, Menghapus, dan merename file melalui CMD
Meng-hiden, Menghapus, dan merename file melalui CMD
 
Root cause of good and bad - AgriYouthNepal Friday Sharing with Dipendra Shahi
Root cause of good and bad - AgriYouthNepal Friday Sharing with Dipendra ShahiRoot cause of good and bad - AgriYouthNepal Friday Sharing with Dipendra Shahi
Root cause of good and bad - AgriYouthNepal Friday Sharing with Dipendra Shahi
 
Operation and management of primary and secondary tillage
Operation and management of primary and secondary tillageOperation and management of primary and secondary tillage
Operation and management of primary and secondary tillage
 
Insects pests of maize
Insects pests of maizeInsects pests of maize
Insects pests of maize
 
Insect pest or crucifers
Insect pest or crucifersInsect pest or crucifers
Insect pest or crucifers
 
Sowing and planting machines
Sowing and planting machinesSowing and planting machines
Sowing and planting machines
 
Internal Combustion Engine
Internal Combustion EngineInternal Combustion Engine
Internal Combustion Engine
 
Threshing machines
Threshing machinesThreshing machines
Threshing machines
 
Insect pest of rice
Insect pest of riceInsect pest of rice
Insect pest of rice
 
Insect pests of cucurbits
Insect pests of cucurbitsInsect pests of cucurbits
Insect pests of cucurbits
 
Combine harvester
Combine harvesterCombine harvester
Combine harvester
 
Plant protection equipment
Plant protection equipmentPlant protection equipment
Plant protection equipment
 
Harvesting machines
Harvesting machinesHarvesting machines
Harvesting machines
 

Similar to Extending DOM-Distiller Reader Engine with Enhanced Support for Korean Web Pages

Domain Driven Design Up And Running
Domain Driven Design Up And RunningDomain Driven Design Up And Running
Domain Driven Design Up And RunningIASA
 
Lavacon 2011: Managing Translations in Frame DITA without a CMS
Lavacon 2011: Managing Translations in Frame DITA without a CMSLavacon 2011: Managing Translations in Frame DITA without a CMS
Lavacon 2011: Managing Translations in Frame DITA without a CMSClearPath, LLC
 
CSS Adnaved with HTML abd complete Stylesheet
CSS Adnaved with HTML abd complete StylesheetCSS Adnaved with HTML abd complete Stylesheet
CSS Adnaved with HTML abd complete StylesheetPraveenHegde20
 
Best Practices and Tips for Ruby on Rails Development.pptx
Best Practices and Tips for Ruby on Rails Development.pptxBest Practices and Tips for Ruby on Rails Development.pptx
Best Practices and Tips for Ruby on Rails Development.pptxw3villatech
 
Asp.Net 3 5 Part 1
Asp.Net 3 5 Part 1Asp.Net 3 5 Part 1
Asp.Net 3 5 Part 1asim78
 
xpages & dojo
xpages & dojoxpages & dojo
xpages & dojodominion
 
Staying Close to Experts with Executable Specifications
Staying Close to Experts with Executable SpecificationsStaying Close to Experts with Executable Specifications
Staying Close to Experts with Executable SpecificationsVagif Abilov
 
AD113 -- IBM Lotus Notes Discussion Template: Next Generation and Other OpenN...
AD113 -- IBM Lotus Notes Discussion Template: Next Generation and Other OpenN...AD113 -- IBM Lotus Notes Discussion Template: Next Generation and Other OpenN...
AD113 -- IBM Lotus Notes Discussion Template: Next Generation and Other OpenN...ddrschiw
 
The Characteristics of a Successful SPA
The Characteristics of a Successful SPAThe Characteristics of a Successful SPA
The Characteristics of a Successful SPAGil Fink
 
How to Create Your Own Product-Modeling Environment
How to Create Your Own Product-Modeling EnvironmentHow to Create Your Own Product-Modeling Environment
How to Create Your Own Product-Modeling EnvironmentTim Geisler
 
[DevDay2018] Embrace the challenge – working as a developer in Content Manage...
[DevDay2018] Embrace the challenge – working as a developer in Content Manage...[DevDay2018] Embrace the challenge – working as a developer in Content Manage...
[DevDay2018] Embrace the challenge – working as a developer in Content Manage...DevDay.org
 
Multiplier Effect: Case Studies in Distributions for Publishers
Multiplier Effect: Case Studies in Distributions for PublishersMultiplier Effect: Case Studies in Distributions for Publishers
Multiplier Effect: Case Studies in Distributions for PublishersJon Peck
 
Configuring in the Browser, Really!
Configuring in the Browser, Really!Configuring in the Browser, Really!
Configuring in the Browser, Really!Tim Geisler
 
Domain Specific Development using T4
Domain Specific Development using T4Domain Specific Development using T4
Domain Specific Development using T4Joubin Najmaie
 
Angular - Chapter 1 - Introduction
 Angular - Chapter 1 - Introduction Angular - Chapter 1 - Introduction
Angular - Chapter 1 - IntroductionWebStackAcademy
 

Similar to Extending DOM-Distiller Reader Engine with Enhanced Support for Korean Web Pages (20)

Domain Driven Design Up And Running
Domain Driven Design Up And RunningDomain Driven Design Up And Running
Domain Driven Design Up And Running
 
Lavacon 2011: Managing Translations in Frame DITA without a CMS
Lavacon 2011: Managing Translations in Frame DITA without a CMSLavacon 2011: Managing Translations in Frame DITA without a CMS
Lavacon 2011: Managing Translations in Frame DITA without a CMS
 
Code-Hub
Code-HubCode-Hub
Code-Hub
 
Ppt ch01
Ppt ch01Ppt ch01
Ppt ch01
 
Ppt ch01
Ppt ch01Ppt ch01
Ppt ch01
 
CSS Adnaved with HTML abd complete Stylesheet
CSS Adnaved with HTML abd complete StylesheetCSS Adnaved with HTML abd complete Stylesheet
CSS Adnaved with HTML abd complete Stylesheet
 
Best Practices and Tips for Ruby on Rails Development.pptx
Best Practices and Tips for Ruby on Rails Development.pptxBest Practices and Tips for Ruby on Rails Development.pptx
Best Practices and Tips for Ruby on Rails Development.pptx
 
Asp.Net 3 5 Part 1
Asp.Net 3 5 Part 1Asp.Net 3 5 Part 1
Asp.Net 3 5 Part 1
 
Ch1
Ch1Ch1
Ch1
 
xpages & dojo
xpages & dojoxpages & dojo
xpages & dojo
 
Staying Close to Experts with Executable Specifications
Staying Close to Experts with Executable SpecificationsStaying Close to Experts with Executable Specifications
Staying Close to Experts with Executable Specifications
 
AD113 -- IBM Lotus Notes Discussion Template: Next Generation and Other OpenN...
AD113 -- IBM Lotus Notes Discussion Template: Next Generation and Other OpenN...AD113 -- IBM Lotus Notes Discussion Template: Next Generation and Other OpenN...
AD113 -- IBM Lotus Notes Discussion Template: Next Generation and Other OpenN...
 
The Characteristics of a Successful SPA
The Characteristics of a Successful SPAThe Characteristics of a Successful SPA
The Characteristics of a Successful SPA
 
How to Create Your Own Product-Modeling Environment
How to Create Your Own Product-Modeling EnvironmentHow to Create Your Own Product-Modeling Environment
How to Create Your Own Product-Modeling Environment
 
Web components
Web componentsWeb components
Web components
 
[DevDay2018] Embrace the challenge – working as a developer in Content Manage...
[DevDay2018] Embrace the challenge – working as a developer in Content Manage...[DevDay2018] Embrace the challenge – working as a developer in Content Manage...
[DevDay2018] Embrace the challenge – working as a developer in Content Manage...
 
Multiplier Effect: Case Studies in Distributions for Publishers
Multiplier Effect: Case Studies in Distributions for PublishersMultiplier Effect: Case Studies in Distributions for Publishers
Multiplier Effect: Case Studies in Distributions for Publishers
 
Configuring in the Browser, Really!
Configuring in the Browser, Really!Configuring in the Browser, Really!
Configuring in the Browser, Really!
 
Domain Specific Development using T4
Domain Specific Development using T4Domain Specific Development using T4
Domain Specific Development using T4
 
Angular - Chapter 1 - Introduction
 Angular - Chapter 1 - Introduction Angular - Chapter 1 - Introduction
Angular - Chapter 1 - Introduction
 

Extending DOM-Distiller Reader Engine with Enhanced Support for Korean Web Pages

  • 1. Higgs-Reader Team C. Arif Jafer, Camilo Celis, Marcus Low,
  • 2. Contents ❏ Overview ❏ Problems & Requirements ❏ Goals Met ❏ Approach ❏ Architecture ❏ WebAnn - Training Set Creator Tool ❏ The Korean Language Model ❏ Final Reader ❏ Demo
  • 4. Overview ● A reader engine is composed of: ○ A web-page text extraction algorithm, to find the main article text ○ Heuristics to find metadata, relevant images to the main article ○ User Interface to embody the reader engine
  • 6. Overview ● Higgs-Reader (built upon DOM-Distiller) ○ Boilerpipe extended with a Korean Language Model ■ Tools to train the model - Weka / C4.5 Decision Trees ■ Tools to generate the training set - WebAnn ■ Integration of the model back into the reader engine ○ Existing Heuristics in DOM-Distiller will be tuned to improve the performance for Korean Web pages ○ Final Reader Chrome Extension
  • 7. Goals Met ● Extended the DOM-Distiller reader engine, with enhanced support for Korean web pages. ● Created a new Korean Language model for text- extraction ● Tuned the existing heuristics to improve the performance on Korean web sites ● Created a Reader UI to embody the reader engine
  • 8. Problems Encountered ● The existing reader engines, such as DOM-Distiller, had a poor support for non-English web pages. ● Korean websites did not commonly follow the website markup standards, such as OpenGraph protocol, schema.org, etc. ● Current HTML standards used by majority of websites tend to still use the <div> or <table> tags to separate content. This eliminates the possibility of identifying the semantics of any particular section of HTML source. ● Poor performance on multi-page websites. It should be able to retrieve all or at most K number of the pages at once. ● Poor performance on detection of relevant images or other rich-content media.
  • 9. Requirements Met ● A Korean language model was made and integrated into the Boilerpipe algorithm. ○ Tooling for creating the training set (WebAnn) ● The existing DOM-Distiller was tuned to work with Korean websites. ● Better support for web pages, with their layouts made with tables. ● Better support for multi-page web pages. ● Enhanced the relevant image detection heuristic ● Chrome Extension Implementation (Final Reader) ● Comparison mechanism for testing purposes
  • 10. Approach ● 4 Stages ○ Web Page Annotator ○ Korean Language Model for boilerpipe ○ Reader Engine tuning ○ Reader UI
  • 12. Approach / Architecture (WebAnn) ● Web Page Annotator (WebAnn) ○ Built as a Chrome extension ○ Provides a simple UI to annotate different sections of a web page with predefined labels. ■ HEADING ■ FULL_CONTENT ■ SUPPLEMENTARY ■ COMMENTS ■ RELEVANT_IMAGES
  • 13. WebAnn -- Training Set Creator Tool Ordinary Webpage
  • 14. WebAnn -- Training Set Creator Tool Annotator in Action
  • 16. Approach / Architecture (Language Model) ● Korean Language Model ○ A corresponding model for each of the Models listed in Table 3.2 ○ Will be trained using Shallow Text features listed in Table 3.2 DensityRulesClassifier HeuristicsFilterBase IgnoreBlocksAfterContentFilter IgnoreBlocksAfterContentFromEndFilter KeepLargestFulltextBlockFilter MinFullTextWordsFilter NumWordsRulesClassifier TerminatingBlocksFinder prev_link_density prev_text_density prev_num_words prev_num_words_in_anchor_text curr_link_density curr_text_density curr_num_words curr_num_words_in_anchor_text next_link_density next_text_density next_num_words next_num_words_in_anchor_text
  • 17. Approach / Architecture (Language Model) ● Korean Language Models ○ Trained using C4.5 Decision Trees algorithm ■ Existing English language models also trained with this algorithm ■ better performance on multi-category classification problems ■ Good performance in supervised learning ○ Use the Weka ML toolset ■ Provides a wide number of implementations for ML algorithms ■ easy to compare and evaluate different models by tuning the parameters ■ Provides cross-validation features, such as k-fold cross validation
  • 18. Korean Language Heuristics ● Lack of <p> tags ● Terminating Blocks
  • 19. Korean Language Model Decision Tree based on Number of Words Decision Tree based on Density of Words
  • 20. Number of Words Korean Language Model English Model boilerplate content 21032 621 225 647 Confusion Matrix boilerplate content 21637 16 142 730 Correctly Classified Instances 22367 99.2986 % Incorrectly Classified Instances 158 0.7014 % Density of Words English Model boilerplate content 21105 548 220 652 Confusion Matrix boilerplate content 21637 16 142 730 Correctly Classified Instances 22367 99.2986 % Incorrectly Classified Instances 158 0.7014 %
  • 21. Approach / Architecture (Language Model)
  • 22. Approach / Architecture (Reader Engine) ● Reader Engine ○ Based on the DOM-Distiller project ○ New Language model will be integrated into Boilerpipe ○ Existing Heuristics will be tuned to improve performance on Korean web pages ○ Built upon Google Web Toolkit (GWT) ■ Can use Java libraries ■ Can use Java OOP features ■ Compiler will produce cross-browser JS code ■ Reader engine can be ported into any browser
  • 23. Approach / Architecture (Reader Engine)
  • 24. Final Reader UI (Implementation) Final Reader Old Reader
  • 25. OLD READER (Live Demo) ● Small Chrome Extension using old dom- distiller code and old language model
  • 26. FINAL READER (Live Demo) ● Faster build cycles ● Can be used to easily compare with Old Reader extension