Your SlideShare is downloading. ×
Anti (anti crawling) techniques
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Anti (anti crawling) techniques


Published on

Published in: Internet, Technology

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Copyright © The OWASP Foundation Permission is granted to copy, distribute and/or modify this document under the terms of the OWASP License. The OWASP Foundation OWASP Anti "Anti-Crawling“ Techniques Ayman Mohammed Mohammed IBM 14/06/2014
  • 2. OWASP Why Data is important ?
  • 3. OWASP Web 2.0
  • 4. OWASP Web 3.0 (semantic web)
  • 5. OWASP Web 3.0 (semantic web)
  • 6. OWASP 6 Data Scraping (crawling) Risks Scrapers take for free what the company has spent large sums to develop, resulting in loss of revenue and loss of customer confidence with a brand. This is theft of digital property and an attack on the uniqueness of online brands. It is impossible for traditional network security devices such firewalls, intrusion detection and prevention, or even application layer firewalls to detect or block them as sophisticated scraping tools mimic user search patterns.
  • 7. OWASP Security Triangle
  • 8. OWASP 8 Anti Crawling After analyzing the frequency of requests to the server and based on your analysis you can pick one or more from the following techniques
  • 9. OWASP 9 IP-address ban The easiest and most common way to determine attempts of website scraping is analyzing the frequency of requests to the server. If requests from a certain IP- address are too often or too much, the address might be blocked and it is often asked to enter CAPTCHA to unblock. The most important thing in this protection method is to find the boundary between the common frequency and number of requests and attempts of scraping in order not to block ordinary users. Commonly this might be determined by analyzing common users’ behavior.
  • 10. OWASP 10 Bypass (IP-address ban) One may bypass this protection using multiple proxies to hide the real IP-address of the scraper. Don’t use your Real IP Address in the first attack.
  • 11. OWASP 11 CAPTCHA It’s a popular way of data protection from web scraping, too. In this case a user is invited to type captcha text to get access to the website. The inconvenience to the regular users forced to enter captchas is the significant disadvantage of this method. Therefore, it’s mostly applicable in systems where data is accessed not very often and upon individual requests.
  • 12. OWASP 12 Bypass (CAPTCHA) Many web services and browsers’ extensions allows you to bypass chaptcha. Most of CAPTCHA cracking services are commercial
  • 13. OWASP 13 Using different accounts With this protection method the data might be accessed by authorized users only. It simplifies the control on users’ behavior and blocking suspicious accounts regardless of the IP- address the client is working from. You can’t always use this approach , hence you will lose many customers.
  • 14. OWASP 14 Bypass (Using different accounts) This protection might be bypassed by creating a set of accounts including the automatic ones. There are certain services selling accounts on well-known social networks. Verifying the account by phone (so-called, PVA-Phone Verified Account) to check its authenticity may create the essential complexity for automatic accounts creation, although it could be bypassed using disposable SIM-cards. Create your own bulk account generator
  • 15. OWASP 15 Usage of complex JavaScript logic In this case browser sends a special code (or several codes) in its request to server and the codes are formed by complex logic written in JavsScript. The code is often obfuscated, and the logic is placed in one or more JavaScript- loadable files.
  • 16. OWASP 16 Bypass (Usage of complex JavaScript logic) It might be bypassed through scraping with real browsers (for example using Selenium or Mechanize libraries). But it gives an additional advantage to this method: the scraper will show up in website traffic analytics (eg Google Analytics) when executing JavaScript, which allows webmaster immediately notice that something is going on.
  • 17. OWASP Crawljax Demo
  • 18. OWASP 18 Frequent update of the page structure One of the most effective ways to protect a website against automatic scraping is to change its structure frequently. This can apply not only on changing the names of HTML element identifiers and classes, but even on the entire hierarchy. This makes writing scraper very complicated, although it overloads the website code and, sometimes, the entire system as well.
  • 19. OWASP 19 Bypass (Frequent update of the page structure)  To bypass protection like this a more flexible and “intelligent” scraper is required, or just a scraper’s manual correction is needed when these changes occur. Selenium also will help in this developing such a scraper
  • 20. OWASP 20 Limitation of the frequency of requests and downloadable data allowance  This allows to make scraping of large amounts of data very slow and therefore impractical. At the same time the restrictions must be applied considering the needs of a common user, so that it would not reduce the overall usability of the site.
  • 21. OWASP 21 Bypass (Limitation of the frequency of requests and downloadable data allowance)  It might be bypassed through accessing the website from different IP-addresses or accounts (multiple users’ simulation). Multiple VPS servers will help also
  • 22. OWASP 22 Mapping the important data as images  This method of content protection makes automatic data collection more complicated and at the same time it maintains visual access for common users. Images often replace e-mail addresses and phone numbers, but some websites even manage to replace random letters in the text. Although nothing prevents to display the content of a website in graphic form (eg using Flash or HTML 5), it can significantly hurt the indexing for search engines.
  • 23. OWASP 23 Bypass (Mapping the important data as images)  It’s hard to bypass this protection as some automatic or manual images recognition is required, similar to the one used in CAPTCHA case.
  • 24. OWASP Questions?
  • 25. OWASP 25 The Question is :  What’s the fastest way to collect Facebook users’ info ??