Molescrape is a platform for web crawling and scraping with python. It uses scrapy to schedule the spiders and postprocessing scripts to process the data and ingest it into different databases like MongoDB and Elasticsearch.
Why having afull platform?
● Handle multiple projects
● Repeated scraping on same website
● Error monitoring
● Postprocessing allows to change the extracted information
later
contact@molescrape.com
www.molescrape.com
Legal Challenges
● discussedunder the term screen scraping
● EU Directive 96/6/EC
● sui generis protection of databases (no level of creativity
needed, only financial investments)
● Court decisions in Germany/Europe usually only about
live requests (i.e. not collection of whole database)
– Ofen: Price search engines for flights
– Meta search engine for vehicle prices
contact@molescrape.com
www.molescrape.com
9.
Future Ideas
● Postprocessing:Replace Mosquito Custom Code with
Apache Kafka
● LegalityPipeline: only ingest k% of the data into database,
but mark all data as already read
● Beter integrate broad crawl into infrastructure
contact@molescrape.com
www.molescrape.com