Generic Crawler
CRAWL THE WORLD!
Content
• Introduction to Generic Crawler
• The Infrastructure
• Introduction to Crawler Rule
• Crawl Procedure
• Data Fact Sheet
• Limitation
• Future Work
Introduction to Generic Crawler
• Information are not only in Social Media
• Some websites do not provide API
Proposed Solution
• Multi purpose Crawler
• Rule based crawler
• The power of cloud
The Infrastructure
Introduction to Crawler Rule
• XPATH or CSS Expression
• Tree Data Structure
• Deep First Search Algorithm
XPATH
• Search HTML Tag
• String and Array basic function
• Text extraction (Remove HTML tag)
Introduction to Crawler Rule
XPATH: //div[@class=‘detail_content]
Crawler Procedure
Link
Generation
• Schedule Auto Runner task
• Schedule Auto Pusher task
Crawl
• Crawl based of the links
• Save the crawled data to local DB
On-Demand
Central DB
Pusher
• Keyword Matching
• Push to Central DB
Data Fact Sheet
Average Crawling
Time
15s
* Based on 1,000 links
New Links Generation
Time
3/min
* From 5 sources
Limitation
• AJAX Website
• Depends on Rule
• High CPU and Bandwidth demand
• Robot.txt
Links
Viva.co.id 724
Detik.com 418
Beritajatim.com 120
Hukumonline.com 13
* Last update: 27 January 2016 – 16:00
Future Work
• Input URL to scrap
• Scheduler for Auto Crawl
• Crawler Health Monitoring System
THANK YOU
GENERIC CRAWLER

Generic Crawler