Heritrix REST API

Heritrix REST API
Roger G. Coram
Web Crawl Engineer

2
Heritrix API
URL structure mimics that of the interface:
• https://___.bl.uk:8443/engine/
• https://___.bl.uk:8444/engine/job/daily-0900.
Actions are POSTed to those URLs along with relevant parameters.
Any client supporting HTTPS can use the API, e.g. curl.

3
Actions
Possible actions:
• create
• add
• build
• launch
• rescan
• pause
• unpause
• terminate
• teardown
• checkpoint
• execute
• submit

4
BL Use Case
Our normal workflow would be:
1. Check for an already existing job.
• If one exists, pause, terminate, teardown:
curl -k -u $USER:$PASS -d "action=pause"
--anyauth --location https://$HOST:8443/engine/job/daily-0900
curl -k -u $USER:$PASS -d "action=terminate"
curl -k -u $USER:$PASS -d "action=teardown"
2. Copy the relevant profile, seeds, etc. into the job
directory
3. build, launch:
curl -k -u $USER:$PASS -d "action=build"
curl -k -u $USER:$PASS -d "action=launch"

5
BL Use Case
We also have bespoke settings which we apply via Sheets either in the
crawler-beans.cxml or:
SCRIPT='appCtx.getBean("sheetOverlaysManager").addSurtAssociation("http://(uk,bl,",
"higherLimit");'
curl -k -u $USER:$PASS -d "action=script&engine=beanshell&script=$SCRIPT"

6
Documentation
Fully documented here:
• https://webarchive.jira.com/wiki/display/Heritrix/Heritrix+3.x+API+Guide

Heritrix REST API

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Heritrix REST API

Similar to Heritrix REST API (20)

Recently uploaded

Recently uploaded (20)

Heritrix REST API