9. Structure
retainment
Inner content
cleaning
Boilerpipe
plain text only
uses a classifier to
determine whether or
not the atomic text
open source java library
block holds useful
content
Alchemy API
text only (has an
option to include
relevant hyperlinks)
n/a
Name
Diffbot
Readability
Goose
Extractiv
Repustate API
Webstemmer
plain text or html
an option to remove
inline ads
retains original
structure
uses hardcoded
heuristics to extract
content divided by
ads
plain text
n/a
depends on the
chosen output format
n/a
– e.g. xml format
breaks the content
plain text
n/a
plain text
NCleaner (paper) plain text
Implementation
commercial web api
web api (private beta)
Source parameter
should be language
you can fetch
independent since the
documents by yourself
text block classifier
or use built-in utilities
observes language
to fetch them for you
independent text
observation: returns an
include the whole
error for non-english
document in the post
content e.g. the
request or provide an
document contains
url
“unsupported text
does fetching for you
n/a
via provided url
open source javascript
bookmarklet
via browser
open source java library
url only (my fork
enables you to fetch
the document by
yourself)
commercial web api
commercial web api
n/a
open source python
library
uses character level
n-grams to detect
content text blocks
open source perl library
Language dependancy
language independent
but it relies on language
dependent regular
expressions to match id
and class labels
language independent
but it relies on language
dependent regular
expressions to match id
and class labels
include the whole
document in post
n/a
request or provide an
url
url only
n/a
first runs a crawler to
obtain seed pages,
then it learns layout
language independent
patterns that are later
put to work to extract
arbitrary html
document
Additional features and
remarks
implements many
extractors with different
classification rules trained
on different datasets
extra API call to extract
the title
extracts: relevant media,
titile, tags, xpath descriptor
for wrappers, comments
and comment count, article
summary
uses hardcoded heuristics
to search for related
images and embedded
media
capable of enriching the
extracted text with
semantic entities and
relationships
the only piece of software
on this list that requires a
cluster of similar
documents obtained by
crawling
reliant on lynx browser for
depends on the training
converting html to
language
structured plain text
10. Reference
Evaluating
Text Extraction Algorithms
List of resources: Article text extraction from
HTML documents
Feature-wise Comparison of HTML Article
Text Extractors
Overview: Extracting article text from HTML
documents
Readability for Java - Snacktory