Data Scraping (crawling) Risks
Scrapers take for free what the company has
spent large sums to develop, resulting in loss of
revenue and loss of customer confidence with a
brand. This is theft of digital property and an
attack on the uniqueness of online brands.
It is impossible for traditional network security
devices such firewalls, intrusion detection and
prevention, or even application layer firewalls to
detect or block them as sophisticated scraping
tools mimic user search patterns.
After analyzing the frequency of requests to the
server and based on your analysis you can pick
one or more from the following techniques
The easiest and most common way to determine
attempts of website scraping is analyzing the frequency
of requests to the server. If requests from a certain IP-
address are too often or too much, the address might be
blocked and it is often asked to enter CAPTCHA to
The most important thing in this protection method is to
find the boundary between the common frequency and
number of requests and attempts of scraping in order not
to block ordinary users. Commonly this might be
determined by analyzing common users’ behavior.
Bypass (IP-address ban)
One may bypass this protection using multiple
proxies to hide the real IP-address of the
Don’t use your Real IP Address in the first
It’s a popular way of data protection from web scraping,
too. In this case a user is invited to type captcha text to
get access to the website. The inconvenience to the
regular users forced to enter captchas is the significant
disadvantage of this method. Therefore, it’s mostly
applicable in systems where data is accessed not very
often and upon individual requests.
Many web services and browsers’ extensions allows you
to bypass chaptcha.
Most of CAPTCHA cracking services are commercial
Using different accounts
With this protection method the data might be
accessed by authorized users only. It simplifies
the control on users’ behavior and blocking
suspicious accounts regardless of the IP-
address the client is working from.
You can’t always use this approach , hence you
will lose many customers.
Bypass (Using different accounts)
This protection might be bypassed by creating a
set of accounts including the automatic ones.
There are certain services selling accounts on
well-known social networks. Verifying the
account by phone (so-called, PVA-Phone
Verified Account) to check its authenticity may
create the essential complexity for automatic
accounts creation, although it could be bypassed
using disposable SIM-cards.
Create your own bulk account generator
In this case browser sends a special code (or
several codes) in its request to server and the
codes are formed by complex logic written in
JavsScript. The code is often obfuscated, and
It might be bypassed through scraping with real
browsers (for example using Selenium or
Mechanize libraries). But it gives an additional
advantage to this method: the scraper will show
up in website traffic analytics (eg Google
allows webmaster immediately notice that
something is going on.
Frequent update of the page structure
One of the most effective ways to protect a
website against automatic scraping is to change
its structure frequently. This can apply not only
on changing the names of HTML element
identifiers and classes, but even on the entire
hierarchy. This makes writing scraper very
complicated, although it overloads the website
code and, sometimes, the entire system as well.
Bypass (Frequent update of the page
To bypass protection like this a more flexible
and “intelligent” scraper is required, or just a
scraper’s manual correction is needed when
these changes occur.
Selenium also will help in this developing such a
Limitation of the frequency of requests and
downloadable data allowance
This allows to make scraping of large amounts
of data very slow and therefore impractical. At
the same time the restrictions must be applied
considering the needs of a common user, so that
it would not reduce the overall usability of the
Bypass (Limitation of the frequency of
requests and downloadable data
It might be bypassed through accessing the
website from different IP-addresses or accounts
(multiple users’ simulation).
Multiple VPS servers will help also
Mapping the important data as images
This method of content protection makes
automatic data collection more complicated and
at the same time it maintains visual access for
common users. Images often replace e-mail
addresses and phone numbers, but some
websites even manage to replace random letters
in the text. Although nothing prevents to display
the content of a website in graphic form (eg
using Flash or HTML 5), it can significantly hurt
the indexing for search engines.
Bypass (Mapping the important data as
It’s hard to bypass this protection as some
automatic or manual images recognition is
required, similar to the one used in CAPTCHA