6. OWASP
6
Data Scraping (crawling) Risks
Scrapers take for free what the company has
spent large sums to develop, resulting in loss of
revenue and loss of customer confidence with a
brand. This is theft of digital property and an
attack on the uniqueness of online brands.
It is impossible for traditional network security
devices such firewalls, intrusion detection and
prevention, or even application layer firewalls to
detect or block them as sophisticated scraping
tools mimic user search patterns.
8. OWASP
8
Anti Crawling
After analyzing the frequency of requests to the
server and based on your analysis you can pick
one or more from the following techniques
9. OWASP
9
IP-address ban
The easiest and most common way to determine
attempts of website scraping is analyzing the frequency
of requests to the server. If requests from a certain IP-
address are too often or too much, the address might be
blocked and it is often asked to enter CAPTCHA to
unblock.
The most important thing in this protection method is to
find the boundary between the common frequency and
number of requests and attempts of scraping in order not
to block ordinary users. Commonly this might be
determined by analyzing common users’ behavior.
10. OWASP
10
Bypass (IP-address ban)
One may bypass this protection using multiple
proxies to hide the real IP-address of the
scraper.
Don’t use your Real IP Address in the first
attack.
11. OWASP
11
CAPTCHA
It’s a popular way of data protection from web scraping,
too. In this case a user is invited to type captcha text to
get access to the website. The inconvenience to the
regular users forced to enter captchas is the significant
disadvantage of this method. Therefore, it’s mostly
applicable in systems where data is accessed not very
often and upon individual requests.
12. OWASP
12
Bypass (CAPTCHA)
Many web services and browsers’ extensions allows you
to bypass chaptcha.
Most of CAPTCHA cracking services are commercial
13. OWASP
13
Using different accounts
With this protection method the data might be
accessed by authorized users only. It simplifies
the control on users’ behavior and blocking
suspicious accounts regardless of the IP-
address the client is working from.
You can’t always use this approach , hence you
will lose many customers.
14. OWASP
14
Bypass (Using different accounts)
This protection might be bypassed by creating a
set of accounts including the automatic ones.
There are certain services selling accounts on
well-known social networks. Verifying the
account by phone (so-called, PVA-Phone
Verified Account) to check its authenticity may
create the essential complexity for automatic
accounts creation, although it could be bypassed
using disposable SIM-cards.
Create your own bulk account generator
15. OWASP
15
Usage of complex JavaScript logic
In this case browser sends a special code (or
several codes) in its request to server and the
codes are formed by complex logic written in
JavsScript. The code is often obfuscated, and
the logic is placed in one or more JavaScript-
loadable files.
16. OWASP
16
Bypass (Usage of complex JavaScript
logic)
It might be bypassed through scraping with real
browsers (for example using Selenium or
Mechanize libraries). But it gives an additional
advantage to this method: the scraper will show
up in website traffic analytics (eg Google
Analytics) when executing JavaScript, which
allows webmaster immediately notice that
something is going on.
18. OWASP
18
Frequent update of the page structure
One of the most effective ways to protect a
website against automatic scraping is to change
its structure frequently. This can apply not only
on changing the names of HTML element
identifiers and classes, but even on the entire
hierarchy. This makes writing scraper very
complicated, although it overloads the website
code and, sometimes, the entire system as well.
19. OWASP
19
Bypass (Frequent update of the page
structure)
To bypass protection like this a more flexible
and “intelligent” scraper is required, or just a
scraper’s manual correction is needed when
these changes occur.
Selenium also will help in this developing such a
scraper
20. OWASP
20
Limitation of the frequency of requests and
downloadable data allowance
This allows to make scraping of large amounts
of data very slow and therefore impractical. At
the same time the restrictions must be applied
considering the needs of a common user, so that
it would not reduce the overall usability of the
site.
21. OWASP
21
Bypass (Limitation of the frequency of
requests and downloadable data
allowance)
It might be bypassed through accessing the
website from different IP-addresses or accounts
(multiple users’ simulation).
Multiple VPS servers will help also
22. OWASP
22
Mapping the important data as images
This method of content protection makes
automatic data collection more complicated and
at the same time it maintains visual access for
common users. Images often replace e-mail
addresses and phone numbers, but some
websites even manage to replace random letters
in the text. Although nothing prevents to display
the content of a website in graphic form (eg
using Flash or HTML 5), it can significantly hurt
the indexing for search engines.
23. OWASP
23
Bypass (Mapping the important data as
images)
It’s hard to bypass this protection as some
automatic or manual images recognition is
required, similar to the one used in CAPTCHA
case.