Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Scraping AJAX
Pages
Big Data made small
What’s AJAX on a web page?
1. Filters 2. Load
more results
3. Forms
and others...
GET vs. POST
Client Server
Client Server
GET
POST
http://example.com?date=20140410
http://example.com
Payload
Form Data, J...
What makes crawling AJAX difficult?
Challenge 1- Javascript Calls
Solution- Emulate Javascript calls using headless browsers
Data fetched
from under
Javascrip...
Challenge 2- Fetch Bandwidths
Solution-
Optimize fetch limits
Incomplete page fetched
because of low fetch age
Image Credi...
Challenge 3- .NET Architectures
Solution- Track states, pass event validations, restore states for
mitigation
Viewstate
Challenge 4- Page Encoding
Solution- Send request (content type, media type,
accept field parameters) and parse responses ...
Use Case- Crawl Ticketing Sites
Thank You!
Have specific queries on AJAX crawling?
Reach out to info@promptcloud.com.
Upcoming SlideShare
Loading in …5
×

Web Crawling- Scraping Ajax Sites

2,117 views

Published on

Challenges with crawling AJAX pages on the web and their solutions.

Published in: Technology
  • How do you deal with JavaScript Invocation Graphs and Hot Call Conjunctions Queries? these I feel are some real AJAX crawl problems.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Web Crawling- Scraping Ajax Sites

  1. 1. Scraping AJAX Pages Big Data made small
  2. 2. What’s AJAX on a web page? 1. Filters 2. Load more results 3. Forms and others...
  3. 3. GET vs. POST Client Server Client Server GET POST http://example.com?date=20140410 http://example.com Payload Form Data, JSON Strings, Query Parameters, View States, etc.
  4. 4. What makes crawling AJAX difficult?
  5. 5. Challenge 1- Javascript Calls Solution- Emulate Javascript calls using headless browsers Data fetched from under Javascript code
  6. 6. Challenge 2- Fetch Bandwidths Solution- Optimize fetch limits Incomplete page fetched because of low fetch age Image Credit: ticketmaster.com
  7. 7. Challenge 3- .NET Architectures Solution- Track states, pass event validations, restore states for mitigation Viewstate
  8. 8. Challenge 4- Page Encoding Solution- Send request (content type, media type, accept field parameters) and parse responses in same format as expected by server
  9. 9. Use Case- Crawl Ticketing Sites
  10. 10. Thank You! Have specific queries on AJAX crawling? Reach out to info@promptcloud.com.

×