Web Crawling and Scraping
with Python
Introduction
contact@molescrape.com
www.molescrape.com
Introduction
contact@molescrape.com
www.molescrape.com
Introduction
contact@molescrape.com
www.molescrape.com
Why having a full platform?
● Handle multiple projects
● Repeated scraping on same website
● Error monitoring
● Postprocessing allows to change the extracted information
later
contact@molescrape.com
www.molescrape.com
Technology
contact@molescrape.com
www.molescrape.com
A
B
C
D
Custom
Processing
Newspaper Mining
contact@molescrape.com
www.molescrape.com
Party names mentioned in the titles of newspaper
articles before the German national elections in 2017
Full article (in German): goo.gl/EfvwQP
Legal Challenges
● discussed under the term screen scraping
● EU Directive 96/6/EC
● sui generis protection of databases (no level of creativity
needed, only financial investments)
● Court decisions in Germany/Europe usually only about
live requests (i.e. not collection of whole database)
– Ofen: Price search engines for flights
– Meta search engine for vehicle prices
contact@molescrape.com
www.molescrape.com
Future Ideas
● Postprocessing: Replace Mosquito Custom Code with
Apache Kafka
● LegalityPipeline: only ingest k% of the data into database,
but mark all data as already read
● Beter integrate broad crawl into infrastructure
contact@molescrape.com
www.molescrape.com
Dashboard
contact@molescrape.com
www.molescrape.com
Qestions
Qestions
either now or contact@molescrape.com
contact@molescrape.com
www.molescrape.com

Molescrape: Web Crawling and Scraping with Python