Python Packages for Web Data Extraction and Analysis

Python Packages for
Web Data Extraction
and Analysis
Python Packages for Web Data Extraction and Analysis 1

Summary
• HTML
• Feature Extraction
• Detect Soft 404 pages
• Detect and Classify Pagination Links
• Classify the form types in a web page
• Similarity between web pages

HTML
HTML is the standard markup language for creating Web
pages.

Feature Extraction
• Aggregate groups of text by block tags.
• Represent HTML as a sequence of tags.
• Annotate information using webstruct.

Detect Soft 404 pages

Detect Soft 404
pages
A soft 404 is a URL that returns a page
telling the user that the page does not
exist and also a 200-level (success)
code.
• soft404 Python Package.
>>> import soft404
>>> soft404.probability('<h1>Page not found</h1>')
0.9736860086882132

Detect Soft 404 pages
• Trained with 120k pages of 25k domains with a ratio of 1/3.
• It uses SGDClassifier + Logistic Regression.
• ROC AUC 0.995 +/- 0.002.

Detect and Classify
Pagination Links

Detect and Classify Pagination Links
AutoPager Python package.
• It uses Conditional Random Fields to train the model.

• Classify the links in:
• PREV: Link to the previous page.
• PAGE: Link of a page.
• NEXT: Link to the next page.
• OTHER: No a pagination link.

Features:
• Text of the link.
• Class of the CSS.
• Part of the HTML.
• Context from the left and right.

>>> import autopager
>>> import requests
>>> autopager.urls(requests.get("https://manolo.rocks/search/?q=fujimori"))
['https://manolo.rocks/search/?page=1&q=fujimori',
'https://manolo.rocks/search/?page=2&q=fujimori',
'https://manolo.rocks/search/?q=fujimori',
'https://manolo.rocks/search/?page=2&q=fujimori']

Classify the form
types on a web page

Classify the form types on a web page
Formsaurus Python Package
• It uses 2 models, one for detecting forms and the other to
detect the field type.
• The model was trained with 1000+ annotated forms.

Form Types:
• search
• login
• registration
• password/login recovery
• contact/comment
• join mailing list
• order/add to cart
• other

Features
• POST/GET
• Text of the submit buttons.
• Name of the css classes and IDs.
• Tags of the inputs.
• Strings in the url.

• Detect the field types using Conditional Random Fields. The
form is a sequence where the order matters

There may need extra work to make this library works. It don't
work with python 3.7 out of the box because
sklearn.externals.joblib is deprecated in 0.21.

Similarity between
web pages

Similarity between web pages
The web pages can be classified by structure (DOM Tree) and
Style (CSS).
html-similarity Python Package.

• For structure, it uses a sequence of tags and calculates the
similarity between the sequences using sequence matcher.
• For style similarity, is uses the CSS classes using Jaccard
distance to measure the similarity.

In [1]: html_1 = '''
<h1 class="title">First Document</h1>
<ul class="menu">
<li class="active">Documents</li>
<li>Extra</li>
</ul>
'''
In [2]: html_2 = '''
<h1 class="title">Second document Document</h1>
<ul class="menu">
<li class="active">Extra Documents</li>
</ul>
'''
In [3] from html_similarity import style_similarity, structural_similarity, similarity
In [4]: style_similarity(html_1, html_2)
Out[4]: 1.0
In [7]: structural_similarity(html_1, html_2)
Out[7]: 0.9090909090909091
In [8]: similarity(html_1, html_2)
Out[8]: 0.9545454545454546

More Libraries
• mdr: Detect and extract listing data from HTML page
• aile: Automatic Item List Extraction
• pydepta: Extract structured data from HTML page

Takeaways
• Follow engineering blogs and conferences on web crawling.
Zyte engineering blog is good and videos from Pydata are
awesome!
• Follow interesting topics in Google Scholar like Web
Scraping, Web Crawling, Wrapper induction and so on.
• Understand the feature extraction. You can use them in your
next project.

Questions

Python Packages for Web Data Extraction and Analysis

Recommended

Recommended

More Related Content

Similar to Python Packages for Web Data Extraction and Analysis

Similar to Python Packages for Web Data Extraction and Analysis (20)

More from Edgar Marca

More from Edgar Marca (7)

Recently uploaded

Recently uploaded (20)

Python Packages for Web Data Extraction and Analysis