INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
Python Packages for Web Data Extraction and Analysis
1. Python Packages for
Web Data Extraction
and Analysis
Python Packages for Web Data Extraction and Analysis 1
2. Summary
• HTML
• Feature Extraction
• Detect Soft 404 pages
• Detect and Classify Pagination Links
• Classify the form types in a web page
• Similarity between web pages
Python Packages for Web Data Extraction and Analysis 2
3. HTML
HTML is the standard markup language for creating Web
pages.
Python Packages for Web Data Extraction and Analysis 3
4. Feature Extraction
• Aggregate groups of text by block tags.
• Represent HTML as a sequence of tags.
• Annotate information using webstruct.
Python Packages for Web Data Extraction and Analysis 4
6. Detect Soft 404 pages
Python Packages for Web Data Extraction and Analysis 6
7. Detect Soft 404
pages
A soft 404 is a URL that returns a page
telling the user that the page does not
exist and also a 200-level (success)
code.
• soft404 Python Package.
>>> import soft404
>>> soft404.probability('<h1>Page not found</h1>')
0.9736860086882132
Python Packages for Web Data Extraction and Analysis 7
8. Detect Soft 404 pages
• Trained with 120k pages of 25k domains with a ratio of 1/3.
• It uses SGDClassifier + Logistic Regression.
• ROC AUC 0.995 +/- 0.002.
Python Packages for Web Data Extraction and Analysis 8
11. Detect and Classify Pagination Links
AutoPager Python package.
• It uses Conditional Random Fields to train the model.
Python Packages for Web Data Extraction and Analysis 11
12. Detect and Classify Pagination Links
• Classify the links in:
• PREV: Link to the previous page.
• PAGE: Link of a page.
• NEXT: Link to the next page.
• OTHER: No a pagination link.
Python Packages for Web Data Extraction and Analysis 12
13. Detect and Classify Pagination Links
Features:
• Text of the link.
• Class of the CSS.
• Part of the HTML.
• Context from the left and right.
Python Packages for Web Data Extraction and Analysis 13
14. Detect and Classify Pagination Links
>>> import autopager
>>> import requests
>>> autopager.urls(requests.get("https://manolo.rocks/search/?q=fujimori"))
['https://manolo.rocks/search/?page=1&q=fujimori',
'https://manolo.rocks/search/?page=2&q=fujimori',
'https://manolo.rocks/search/?page=3&q=fujimori',
'https://manolo.rocks/search/?page=4&q=fujimori',
'https://manolo.rocks/search/?page=5&q=fujimori',
'https://manolo.rocks/search/?q=fujimori',
'https://manolo.rocks/search/?page=12&q=fujimori',
'https://manolo.rocks/search/?page=13&q=fujimori',
'https://manolo.rocks/search/?page=14&q=fujimori',
'https://manolo.rocks/search/?page=15&q=fujimori',
'https://manolo.rocks/search/?page=16&q=fujimori',
'https://manolo.rocks/search/?page=2&q=fujimori']
Python Packages for Web Data Extraction and Analysis 14
16. Classify the form
types on a web page
Python Packages for Web Data Extraction and Analysis 16
17. Classify the form types on a web page
Formsaurus Python Package
• It uses 2 models, one for detecting forms and the other to
detect the field type.
• The model was trained with 1000+ annotated forms.
Python Packages for Web Data Extraction and Analysis 17
18. Classify the form types on a web page
Form Types:
• search
• login
• registration
• password/login recovery
• contact/comment
• join mailing list
• order/add to cart
• other
Python Packages for Web Data Extraction and Analysis 18
19. Classify the form types on a web page
Features
• POST/GET
• Text of the submit buttons.
• Name of the css classes and IDs.
• Tags of the inputs.
• Strings in the url.
Python Packages for Web Data Extraction and Analysis 19
20. Classify the form types on a web page
• Detect the field types using Conditional Random Fields. The
form is a sequence where the order matters
Python Packages for Web Data Extraction and Analysis 20
21. Classify the form types on a web page
There may need extra work to make this library works. It don't
work with python 3.7 out of the box because
sklearn.externals.joblib is deprecated in 0.21.
Python Packages for Web Data Extraction and Analysis 21
24. Similarity between web pages
The web pages can be classified by structure (DOM Tree) and
Style (CSS).
html-similarity Python Package.
Python Packages for Web Data Extraction and Analysis 24
26. Similarity between web pages
• For structure, it uses a sequence of tags and calculates the
similarity between the sequences using sequence matcher.
• For style similarity, is uses the CSS classes using Jaccard
distance to measure the similarity.
Python Packages for Web Data Extraction and Analysis 26
27. Similarity between web pages
In [1]: html_1 = '''
<h1 class="title">First Document</h1>
<ul class="menu">
<li class="active">Documents</li>
<li>Extra</li>
</ul>
'''
In [2]: html_2 = '''
<h1 class="title">Second document Document</h1>
<ul class="menu">
<li class="active">Extra Documents</li>
</ul>
'''
In [3] from html_similarity import style_similarity, structural_similarity, similarity
In [4]: style_similarity(html_1, html_2)
Out[4]: 1.0
In [7]: structural_similarity(html_1, html_2)
Out[7]: 0.9090909090909091
In [8]: similarity(html_1, html_2)
Out[8]: 0.9545454545454546
Python Packages for Web Data Extraction and Analysis 27
28. More Libraries
• mdr: Detect and extract listing data from HTML page
• aile: Automatic Item List Extraction
• pydepta: Extract structured data from HTML page
Python Packages for Web Data Extraction and Analysis 28
29. Takeaways
• Follow engineering blogs and conferences on web crawling.
Zyte engineering blog is good and videos from Pydata are
awesome!
• Follow interesting topics in Google Scholar like Web
Scraping, Web Crawling, Wrapper induction and so on.
• Understand the feature extraction. You can use them in your
next project.
Python Packages for Web Data Extraction and Analysis 29