SlideShare a Scribd company logo
1 of 30
Download to read offline
Python Packages for
Web Data Extraction
and Analysis
Python Packages for Web Data Extraction and Analysis 1
Summary
• HTML
• Feature Extraction
• Detect Soft 404 pages
• Detect and Classify Pagination Links
• Classify the form types in a web page
• Similarity between web pages
Python Packages for Web Data Extraction and Analysis 2
HTML
HTML is the standard markup language for creating Web
pages.
Python Packages for Web Data Extraction and Analysis 3
Feature Extraction
• Aggregate groups of text by block tags.
• Represent HTML as a sequence of tags.
• Annotate information using webstruct.
Python Packages for Web Data Extraction and Analysis 4
Python Packages for Web Data Extraction and Analysis 5
Detect Soft 404 pages
Python Packages for Web Data Extraction and Analysis 6
Detect Soft 404
pages
A soft 404 is a URL that returns a page
telling the user that the page does not
exist and also a 200-level (success)
code.
• soft404 Python Package.
>>> import soft404
>>> soft404.probability('<h1>Page not found</h1>')
0.9736860086882132
Python Packages for Web Data Extraction and Analysis 7
Detect Soft 404 pages
• Trained with 120k pages of 25k domains with a ratio of 1/3.
• It uses SGDClassifier + Logistic Regression.
• ROC AUC 0.995 +/- 0.002.
Python Packages for Web Data Extraction and Analysis 8
Python Packages for Web Data Extraction and Analysis 9
Detect and Classify
Pagination Links
Python Packages for Web Data Extraction and Analysis 10
Detect and Classify Pagination Links
AutoPager Python package.
• It uses Conditional Random Fields to train the model.
Python Packages for Web Data Extraction and Analysis 11
Detect and Classify Pagination Links
• Classify the links in:
• PREV: Link to the previous page.
• PAGE: Link of a page.
• NEXT: Link to the next page.
• OTHER: No a pagination link.
Python Packages for Web Data Extraction and Analysis 12
Detect and Classify Pagination Links
Features:
• Text of the link.
• Class of the CSS.
• Part of the HTML.
• Context from the left and right.
Python Packages for Web Data Extraction and Analysis 13
Detect and Classify Pagination Links
>>> import autopager
>>> import requests
>>> autopager.urls(requests.get("https://manolo.rocks/search/?q=fujimori"))
['https://manolo.rocks/search/?page=1&q=fujimori',
'https://manolo.rocks/search/?page=2&q=fujimori',
'https://manolo.rocks/search/?page=3&q=fujimori',
'https://manolo.rocks/search/?page=4&q=fujimori',
'https://manolo.rocks/search/?page=5&q=fujimori',
'https://manolo.rocks/search/?q=fujimori',
'https://manolo.rocks/search/?page=12&q=fujimori',
'https://manolo.rocks/search/?page=13&q=fujimori',
'https://manolo.rocks/search/?page=14&q=fujimori',
'https://manolo.rocks/search/?page=15&q=fujimori',
'https://manolo.rocks/search/?page=16&q=fujimori',
'https://manolo.rocks/search/?page=2&q=fujimori']
Python Packages for Web Data Extraction and Analysis 14
Python Packages for Web Data Extraction and Analysis 15
Classify the form
types on a web page
Python Packages for Web Data Extraction and Analysis 16
Classify the form types on a web page
Formsaurus Python Package
• It uses 2 models, one for detecting forms and the other to
detect the field type.
• The model was trained with 1000+ annotated forms.
Python Packages for Web Data Extraction and Analysis 17
Classify the form types on a web page
Form Types:
• search
• login
• registration
• password/login recovery
• contact/comment
• join mailing list
• order/add to cart
• other
Python Packages for Web Data Extraction and Analysis 18
Classify the form types on a web page
Features
• POST/GET
• Text of the submit buttons.
• Name of the css classes and IDs.
• Tags of the inputs.
• Strings in the url.
Python Packages for Web Data Extraction and Analysis 19
Classify the form types on a web page
• Detect the field types using Conditional Random Fields. The
form is a sequence where the order matters
Python Packages for Web Data Extraction and Analysis 20
Classify the form types on a web page
There may need extra work to make this library works. It don't
work with python 3.7 out of the box because
sklearn.externals.joblib is deprecated in 0.21.
Python Packages for Web Data Extraction and Analysis 21
Python Packages for Web Data Extraction and Analysis 22
Similarity between
web pages
Python Packages for Web Data Extraction and Analysis 23
Similarity between web pages
The web pages can be classified by structure (DOM Tree) and
Style (CSS).
html-similarity Python Package.
Python Packages for Web Data Extraction and Analysis 24
Python Packages for Web Data Extraction and Analysis 25
Similarity between web pages
• For structure, it uses a sequence of tags and calculates the
similarity between the sequences using sequence matcher.
• For style similarity, is uses the CSS classes using Jaccard
distance to measure the similarity.
Python Packages for Web Data Extraction and Analysis 26
Similarity between web pages
In [1]: html_1 = '''
<h1 class="title">First Document</h1>
<ul class="menu">
<li class="active">Documents</li>
<li>Extra</li>
</ul>
'''
In [2]: html_2 = '''
<h1 class="title">Second document Document</h1>
<ul class="menu">
<li class="active">Extra Documents</li>
</ul>
'''
In [3] from html_similarity import style_similarity, structural_similarity, similarity
In [4]: style_similarity(html_1, html_2)
Out[4]: 1.0
In [7]: structural_similarity(html_1, html_2)
Out[7]: 0.9090909090909091
In [8]: similarity(html_1, html_2)
Out[8]: 0.9545454545454546
Python Packages for Web Data Extraction and Analysis 27
More Libraries
• mdr: Detect and extract listing data from HTML page
• aile: Automatic Item List Extraction
• pydepta: Extract structured data from HTML page
Python Packages for Web Data Extraction and Analysis 28
Takeaways
• Follow engineering blogs and conferences on web crawling.
Zyte engineering blog is good and videos from Pydata are
awesome!
• Follow interesting topics in Google Scholar like Web
Scraping, Web Crawling, Wrapper induction and so on.
• Understand the feature extraction. You can use them in your
next project.
Python Packages for Web Data Extraction and Analysis 29
Questions
Python Packages for Web Data Extraction and Analysis 30

More Related Content

Similar to Python Packages for Web Data Extraction and Analysis

Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with PythonMaris Lemba
 
Search engine optimization (seo) from Endeca & ATG
Search engine optimization (seo) from Endeca & ATGSearch engine optimization (seo) from Endeca & ATG
Search engine optimization (seo) from Endeca & ATGVignesh sitaraman
 
Analyzing a Link with Google's Eyes by Matteo Monari
Analyzing a Link with Google's Eyes by Matteo MonariAnalyzing a Link with Google's Eyes by Matteo Monari
Analyzing a Link with Google's Eyes by Matteo MonariBizup
 
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Esteve Castells
 
The Technical SEO Full Course how to do
The Technical SEO  Full Course  how to doThe Technical SEO  Full Course  how to do
The Technical SEO Full Course how to doasadkhan888889990
 
Stupid Index Block Tricks
Stupid Index Block TricksStupid Index Block Tricks
Stupid Index Block Trickshannonhill
 
Technical SEO Checklist for Beginners
Technical SEO Checklist for BeginnersTechnical SEO Checklist for Beginners
Technical SEO Checklist for BeginnersBristolSEO
 
Even faster web sites
Even faster web sitesEven faster web sites
Even faster web sitesFelipe Lavín
 
Web Components: The Future of Web Development is Here
Web Components: The Future of Web Development is HereWeb Components: The Future of Web Development is Here
Web Components: The Future of Web Development is HereJohn Riviello
 
Web Components: The Future of Web Development is Here
Web Components: The Future of Web Development is HereWeb Components: The Future of Web Development is Here
Web Components: The Future of Web Development is HereJohn Riviello
 
Stop Playing Hide and Seek with Google: Drupal SEO for Non-profits
Stop Playing Hide and Seek with Google: Drupal SEO for Non-profitsStop Playing Hide and Seek with Google: Drupal SEO for Non-profits
Stop Playing Hide and Seek with Google: Drupal SEO for Non-profitsDesignHammer
 
NCompass Live: Libraries in Search Engines: Search Engine Optimization (SEO) ...
NCompass Live: Libraries in Search Engines: Search Engine Optimization (SEO) ...NCompass Live: Libraries in Search Engines: Search Engine Optimization (SEO) ...
NCompass Live: Libraries in Search Engines: Search Engine Optimization (SEO) ...Nebraska Library Commission
 
Integrating the BCS with Search in SharePoint 2013
Integrating the BCS with Search in SharePoint 2013Integrating the BCS with Search in SharePoint 2013
Integrating the BCS with Search in SharePoint 2013Sparkhound Inc.
 
SMX Advanced 2018 Solving Complex SEO Problems by Patrick Stox
SMX Advanced 2018 Solving Complex SEO Problems by Patrick StoxSMX Advanced 2018 Solving Complex SEO Problems by Patrick Stox
SMX Advanced 2018 Solving Complex SEO Problems by Patrick Stoxpatrickstox
 
Web Information Network Extraction and Analysis
Web Information Network Extraction and AnalysisWeb Information Network Extraction and Analysis
Web Information Network Extraction and AnalysisTim Weninger
 
SharePoint NYC search presentation
SharePoint NYC search presentationSharePoint NYC search presentation
SharePoint NYC search presentationjtbarrera
 
SharePoint 2013 Search Operations
SharePoint 2013 Search OperationsSharePoint 2013 Search Operations
SharePoint 2013 Search OperationsSPC Adriatics
 

Similar to Python Packages for Web Data Extraction and Analysis (20)

Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with Python
 
Search engine optimization (seo) from Endeca & ATG
Search engine optimization (seo) from Endeca & ATGSearch engine optimization (seo) from Endeca & ATG
Search engine optimization (seo) from Endeca & ATG
 
Sree saranya
Sree saranyaSree saranya
Sree saranya
 
Sree saranya
Sree saranyaSree saranya
Sree saranya
 
Analyzing a Link with Google's Eyes by Matteo Monari
Analyzing a Link with Google's Eyes by Matteo MonariAnalyzing a Link with Google's Eyes by Matteo Monari
Analyzing a Link with Google's Eyes by Matteo Monari
 
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
 
The Technical SEO Full Course how to do
The Technical SEO  Full Course  how to doThe Technical SEO  Full Course  how to do
The Technical SEO Full Course how to do
 
DIGITAL MARKETING.pptx
DIGITAL MARKETING.pptxDIGITAL MARKETING.pptx
DIGITAL MARKETING.pptx
 
Stupid Index Block Tricks
Stupid Index Block TricksStupid Index Block Tricks
Stupid Index Block Tricks
 
Technical SEO Checklist for Beginners
Technical SEO Checklist for BeginnersTechnical SEO Checklist for Beginners
Technical SEO Checklist for Beginners
 
Even faster web sites
Even faster web sitesEven faster web sites
Even faster web sites
 
Web Components: The Future of Web Development is Here
Web Components: The Future of Web Development is HereWeb Components: The Future of Web Development is Here
Web Components: The Future of Web Development is Here
 
Web Components: The Future of Web Development is Here
Web Components: The Future of Web Development is HereWeb Components: The Future of Web Development is Here
Web Components: The Future of Web Development is Here
 
Stop Playing Hide and Seek with Google: Drupal SEO for Non-profits
Stop Playing Hide and Seek with Google: Drupal SEO for Non-profitsStop Playing Hide and Seek with Google: Drupal SEO for Non-profits
Stop Playing Hide and Seek with Google: Drupal SEO for Non-profits
 
NCompass Live: Libraries in Search Engines: Search Engine Optimization (SEO) ...
NCompass Live: Libraries in Search Engines: Search Engine Optimization (SEO) ...NCompass Live: Libraries in Search Engines: Search Engine Optimization (SEO) ...
NCompass Live: Libraries in Search Engines: Search Engine Optimization (SEO) ...
 
Integrating the BCS with Search in SharePoint 2013
Integrating the BCS with Search in SharePoint 2013Integrating the BCS with Search in SharePoint 2013
Integrating the BCS with Search in SharePoint 2013
 
SMX Advanced 2018 Solving Complex SEO Problems by Patrick Stox
SMX Advanced 2018 Solving Complex SEO Problems by Patrick StoxSMX Advanced 2018 Solving Complex SEO Problems by Patrick Stox
SMX Advanced 2018 Solving Complex SEO Problems by Patrick Stox
 
Web Information Network Extraction and Analysis
Web Information Network Extraction and AnalysisWeb Information Network Extraction and Analysis
Web Information Network Extraction and Analysis
 
SharePoint NYC search presentation
SharePoint NYC search presentationSharePoint NYC search presentation
SharePoint NYC search presentation
 
SharePoint 2013 Search Operations
SharePoint 2013 Search OperationsSharePoint 2013 Search Operations
SharePoint 2013 Search Operations
 

More from Edgar Marca

Word2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimWord2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimEdgar Marca
 
The Kernel Trick
The Kernel TrickThe Kernel Trick
The Kernel TrickEdgar Marca
 
Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector MachinesEdgar Marca
 
Aprendizaje de Maquina y Aplicaciones
Aprendizaje de Maquina y AplicacionesAprendizaje de Maquina y Aplicaciones
Aprendizaje de Maquina y AplicacionesEdgar Marca
 
Tilemill: Una Herramienta Open Source para diseñar mapas
Tilemill: Una Herramienta Open Source para diseñar mapasTilemill: Una Herramienta Open Source para diseñar mapas
Tilemill: Una Herramienta Open Source para diseñar mapasEdgar Marca
 
Buenas Aplicaciones y Programas con Datos Abiertos / Publicos.
Buenas Aplicaciones y Programas con Datos Abiertos / Publicos.Buenas Aplicaciones y Programas con Datos Abiertos / Publicos.
Buenas Aplicaciones y Programas con Datos Abiertos / Publicos.Edgar Marca
 
Theming cck-n-views
Theming cck-n-viewsTheming cck-n-views
Theming cck-n-viewsEdgar Marca
 

More from Edgar Marca (7)

Word2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimWord2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensim
 
The Kernel Trick
The Kernel TrickThe Kernel Trick
The Kernel Trick
 
Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector Machines
 
Aprendizaje de Maquina y Aplicaciones
Aprendizaje de Maquina y AplicacionesAprendizaje de Maquina y Aplicaciones
Aprendizaje de Maquina y Aplicaciones
 
Tilemill: Una Herramienta Open Source para diseñar mapas
Tilemill: Una Herramienta Open Source para diseñar mapasTilemill: Una Herramienta Open Source para diseñar mapas
Tilemill: Una Herramienta Open Source para diseñar mapas
 
Buenas Aplicaciones y Programas con Datos Abiertos / Publicos.
Buenas Aplicaciones y Programas con Datos Abiertos / Publicos.Buenas Aplicaciones y Programas con Datos Abiertos / Publicos.
Buenas Aplicaciones y Programas con Datos Abiertos / Publicos.
 
Theming cck-n-views
Theming cck-n-viewsTheming cck-n-views
Theming cck-n-views
 

Recently uploaded

Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and usesDevarapalliHaritha
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2RajaP95
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 

Recently uploaded (20)

young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and uses
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 

Python Packages for Web Data Extraction and Analysis

  • 1. Python Packages for Web Data Extraction and Analysis Python Packages for Web Data Extraction and Analysis 1
  • 2. Summary • HTML • Feature Extraction • Detect Soft 404 pages • Detect and Classify Pagination Links • Classify the form types in a web page • Similarity between web pages Python Packages for Web Data Extraction and Analysis 2
  • 3. HTML HTML is the standard markup language for creating Web pages. Python Packages for Web Data Extraction and Analysis 3
  • 4. Feature Extraction • Aggregate groups of text by block tags. • Represent HTML as a sequence of tags. • Annotate information using webstruct. Python Packages for Web Data Extraction and Analysis 4
  • 5. Python Packages for Web Data Extraction and Analysis 5
  • 6. Detect Soft 404 pages Python Packages for Web Data Extraction and Analysis 6
  • 7. Detect Soft 404 pages A soft 404 is a URL that returns a page telling the user that the page does not exist and also a 200-level (success) code. • soft404 Python Package. >>> import soft404 >>> soft404.probability('<h1>Page not found</h1>') 0.9736860086882132 Python Packages for Web Data Extraction and Analysis 7
  • 8. Detect Soft 404 pages • Trained with 120k pages of 25k domains with a ratio of 1/3. • It uses SGDClassifier + Logistic Regression. • ROC AUC 0.995 +/- 0.002. Python Packages for Web Data Extraction and Analysis 8
  • 9. Python Packages for Web Data Extraction and Analysis 9
  • 10. Detect and Classify Pagination Links Python Packages for Web Data Extraction and Analysis 10
  • 11. Detect and Classify Pagination Links AutoPager Python package. • It uses Conditional Random Fields to train the model. Python Packages for Web Data Extraction and Analysis 11
  • 12. Detect and Classify Pagination Links • Classify the links in: • PREV: Link to the previous page. • PAGE: Link of a page. • NEXT: Link to the next page. • OTHER: No a pagination link. Python Packages for Web Data Extraction and Analysis 12
  • 13. Detect and Classify Pagination Links Features: • Text of the link. • Class of the CSS. • Part of the HTML. • Context from the left and right. Python Packages for Web Data Extraction and Analysis 13
  • 14. Detect and Classify Pagination Links >>> import autopager >>> import requests >>> autopager.urls(requests.get("https://manolo.rocks/search/?q=fujimori")) ['https://manolo.rocks/search/?page=1&q=fujimori', 'https://manolo.rocks/search/?page=2&q=fujimori', 'https://manolo.rocks/search/?page=3&q=fujimori', 'https://manolo.rocks/search/?page=4&q=fujimori', 'https://manolo.rocks/search/?page=5&q=fujimori', 'https://manolo.rocks/search/?q=fujimori', 'https://manolo.rocks/search/?page=12&q=fujimori', 'https://manolo.rocks/search/?page=13&q=fujimori', 'https://manolo.rocks/search/?page=14&q=fujimori', 'https://manolo.rocks/search/?page=15&q=fujimori', 'https://manolo.rocks/search/?page=16&q=fujimori', 'https://manolo.rocks/search/?page=2&q=fujimori'] Python Packages for Web Data Extraction and Analysis 14
  • 15. Python Packages for Web Data Extraction and Analysis 15
  • 16. Classify the form types on a web page Python Packages for Web Data Extraction and Analysis 16
  • 17. Classify the form types on a web page Formsaurus Python Package • It uses 2 models, one for detecting forms and the other to detect the field type. • The model was trained with 1000+ annotated forms. Python Packages for Web Data Extraction and Analysis 17
  • 18. Classify the form types on a web page Form Types: • search • login • registration • password/login recovery • contact/comment • join mailing list • order/add to cart • other Python Packages for Web Data Extraction and Analysis 18
  • 19. Classify the form types on a web page Features • POST/GET • Text of the submit buttons. • Name of the css classes and IDs. • Tags of the inputs. • Strings in the url. Python Packages for Web Data Extraction and Analysis 19
  • 20. Classify the form types on a web page • Detect the field types using Conditional Random Fields. The form is a sequence where the order matters Python Packages for Web Data Extraction and Analysis 20
  • 21. Classify the form types on a web page There may need extra work to make this library works. It don't work with python 3.7 out of the box because sklearn.externals.joblib is deprecated in 0.21. Python Packages for Web Data Extraction and Analysis 21
  • 22. Python Packages for Web Data Extraction and Analysis 22
  • 23. Similarity between web pages Python Packages for Web Data Extraction and Analysis 23
  • 24. Similarity between web pages The web pages can be classified by structure (DOM Tree) and Style (CSS). html-similarity Python Package. Python Packages for Web Data Extraction and Analysis 24
  • 25. Python Packages for Web Data Extraction and Analysis 25
  • 26. Similarity between web pages • For structure, it uses a sequence of tags and calculates the similarity between the sequences using sequence matcher. • For style similarity, is uses the CSS classes using Jaccard distance to measure the similarity. Python Packages for Web Data Extraction and Analysis 26
  • 27. Similarity between web pages In [1]: html_1 = ''' <h1 class="title">First Document</h1> <ul class="menu"> <li class="active">Documents</li> <li>Extra</li> </ul> ''' In [2]: html_2 = ''' <h1 class="title">Second document Document</h1> <ul class="menu"> <li class="active">Extra Documents</li> </ul> ''' In [3] from html_similarity import style_similarity, structural_similarity, similarity In [4]: style_similarity(html_1, html_2) Out[4]: 1.0 In [7]: structural_similarity(html_1, html_2) Out[7]: 0.9090909090909091 In [8]: similarity(html_1, html_2) Out[8]: 0.9545454545454546 Python Packages for Web Data Extraction and Analysis 27
  • 28. More Libraries • mdr: Detect and extract listing data from HTML page • aile: Automatic Item List Extraction • pydepta: Extract structured data from HTML page Python Packages for Web Data Extraction and Analysis 28
  • 29. Takeaways • Follow engineering blogs and conferences on web crawling. Zyte engineering blog is good and videos from Pydata are awesome! • Follow interesting topics in Google Scholar like Web Scraping, Web Crawling, Wrapper induction and so on. • Understand the feature extraction. You can use them in your next project. Python Packages for Web Data Extraction and Analysis 29
  • 30. Questions Python Packages for Web Data Extraction and Analysis 30