Generic Crawler

Generic Crawler
CRAWL THE WORLD!

Content
• Introduction to Generic Crawler
• The Infrastructure
• Introduction to Crawler Rule
• Crawl Procedure
• Data Fact Sheet
• Limitation
• Future Work

Introduction to Generic Crawler
• Information are not only in Social Media
• Some websites do not provide API
Proposed Solution
• Multi purpose Crawler
• Rule based crawler
• The power of cloud

Introduction to Crawler Rule
• XPATH or CSS Expression
• Tree Data Structure
• Deep First Search Algorithm
XPATH
• Search HTML Tag
• String and Array basic function
• Text extraction (Remove HTML tag)

Introduction to Crawler Rule
XPATH: //div[@class=‘detail_content]

Crawler Procedure
Link
Generation
• Schedule Auto Runner task
• Schedule Auto Pusher task
Crawl
• Crawl based of the links
• Save the crawled data to local DB
On-Demand
Central DB
Pusher
• Keyword Matching
• Push to Central DB

Data Fact Sheet
Average Crawling
Time
15s
* Based on 1,000 links
New Links Generation
Time
3/min
* From 5 sources

Limitation
• AJAX Website
• Depends on Rule
• High CPU and Bandwidth demand
• Robot.txt
Links
Viva.co.id 724
Detik.com 418
Beritajatim.com 120
Hukumonline.com 13
* Last update: 27 January 2016 – 16:00

Future Work
• Input URL to scrap
• Scheduler for Auto Crawl
• Crawler Health Monitoring System

Generic Crawler

More Related Content

What's hot

Viewers also liked

Similar to Generic Crawler

Generic Crawler