Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Molescrape: Web Crawling and Scraping with Python


Published on

Molescrape is a platform for web crawling and scraping with python. It uses scrapy to schedule the spiders and postprocessing scripts to process the data and ingest it into different databases like MongoDB and Elasticsearch.

Published in: Internet
  • Be the first to comment

  • Be the first to like this

Molescrape: Web Crawling and Scraping with Python

  1. 1. Web Crawling and Scraping with Python
  2. 2. Introduction
  3. 3. Introduction
  4. 4. Introduction
  5. 5. Why having a full platform? ● Handle multiple projects ● Repeated scraping on same website ● Error monitoring ● Postprocessing allows to change the extracted information later
  6. 6. Technology A B C D Custom Processing
  7. 7. Newspaper Mining Party names mentioned in the titles of newspaper articles before the German national elections in 2017 Full article (in German):
  8. 8. Legal Challenges ● discussed under the term screen scraping ● EU Directive 96/6/EC ● sui generis protection of databases (no level of creativity needed, only financial investments) ● Court decisions in Germany/Europe usually only about live requests (i.e. not collection of whole database) – Ofen: Price search engines for flights – Meta search engine for vehicle prices
  9. 9. Future Ideas ● Postprocessing: Replace Mosquito Custom Code with Apache Kafka ● LegalityPipeline: only ingest k% of the data into database, but mark all data as already read ● Beter integrate broad crawl into infrastructure
  10. 10. Dashboard
  11. 11. Qestions Qestions either now or