Heritrix DecideRules
Roger G. Coram
Web Crawl Engineer
2
DecideRuleSequence
A list of DecideRules:
• Processed in order
• *Every rule is processed
• Result will be:
• ACCEPT: URI is rule in scope
• REJECT: URI is ruled out of scope
• PASS: DecideRule has no effect
*DecideRules can have a onlyDecision method to skip processing if they can’t change the outcome.
3
UK Domain Crawl 2014
RejectDecideRule REJECT everything by default
SurtPrefixedDecideRule ACCEPT .uk, london, etc.
MatchesRegexDecideRule ACCEPT; try to capture media files
HopsPathMatchesRegexDecideRule ACCEPT anything embedded on a seed; ^E*$
*ExternalGeoLocationDecideRule ACCEPT IP addresses in GB
*OnDomainsDecideRule ACCEPT specific domains; disabled by default
*HopsPathMatchesRegexDecideRule ACCEPT redirects from seeds; ^R+$
CompressibilityDecideRule REJECT highly (in)compressible URIs; experimental
*TooManyHopsDecideRule REJECT URIs more than 20 hops from a seed
*MatchesListRegexDecideRule REJECT specific patterns
PathologicalPathDecideRule REJECT URIs with more than 3 recurrences of a pattern
TooManyPathSegmentsDecideRule REJECT URIs with more than 15 path segments
SurtPrefixedDecideRule ACCEPT URIs matching a list of URL-shortening services
SurtPrefixedDecideRule REJECT a list of SURTs from a file—exclude.txt
PrerequisiteAcceptDecideRule ACCEPT prerequisites
4
UK Domain Crawl 2014
Basic flow:
• REJECT everything.
• Look for reasons to ACCEPT content.
• REJECT anything you absolutely do not want.
Those marked with a ‘*’ are specified using Spring’s <ref bean…/> syntax. This
facilitates changing their values dynamically or with Sheets.
Those struck out are disabled by default (and typically enabled using Sheets for
specific sites).
Experimental DecideRules:
• ExternalGeoLocationDecideRule: found 2,544,426 new hosts.
• CompressibilityDecideRule: REJECTed 1,650,861 URIs.
5
Beyond Scoping
All Processors have a shouldProcessRule property—you can use DecideRules
instead of simple true/false values.
We use this to filter viral content:
<bean id="warcWriter" class="org.archive.modules.writer.WARCWriterProcessor">
<property name="shouldProcessRule">
<bean class="org.archive.modules.deciderules.DecideRuleSequence">
<property name="rules">
<list>
<bean class="org.archive.modules.deciderules.AcceptDecideRule"/>
<bean class="uk.bl.wap.modules.deciderules.AnnotationMatchesListRegexDecideRule">
<property name="decision" value="REJECT"/>
<property name="regexList">
<list>
<value>^.*stream:.+FOUND.*$</value>
</list>
</property>
</bean>
</list>
6
logging.properties
In the logging.properties file:
org.archive.modules.deciderules.DecideRuleSequence.level=FINEST
This will output the decision of every DecideRule for every URI.
This will generate a lot of log entries.
Only practical for small crawls or testing.
scope.log
Alternatively, a recent addition to the DecideRuleSequence:
<property name="logToFile" value="true" />
This will create a file, scope.log, containing the final decision for every
URI along with the specific rule which made that decision:
2014-11-05T10:17:39.790Z 4 ExternalGeoLocationDecideRule ACCEPT http://www.jaymoy.com/
2014-11-05T10:17:39.790Z 0 RejectDecideRule REJECT https://t.co/Sz15mxnvtQ
2014-11-05T10:17:39.790Z 0 RejectDecideRule REJECT http://twitter.com/2017Hull
7
List of DecideRules
AcceptDecideRule
ContentLengthDecideRule
ContentTypeMatchesRegexDecideRule
ContentTypeNotMatchesRegexDecideRule
ExternalGeoLocationDecideRule
FetchStatusDecideRule
FetchStatusMatchesRegexDecideRule
FetchStatusNotMatchesRegexDecideRule
HasViaDecideRule
HopCrossesAssignmentLevelDomainDecideRule
HopsPathMatchesRegexDecideRule
IpAddressSetDecideRule
MatchesFilePatternDecideRule
MatchesListRegexDecideRule
MatchesRegexDecideRule
MatchesStatusCodeDecideRule
NotMatchesFilePatternDecideRule
NotMatchesListRegexDecideRule
NotMatchesRegexDecideRule
NotMatchesStatusCodeDecideRule
PathologicalPathDecideRule
PredicatedDecideRule
PrerequisiteAcceptDecideRule
RejectDecideRule
ResourceLongerThanDecideRule
ResourceNoLongerThanDecideRule
ResponseContentLengthDecideRule
SchemeNotInSetDecideRule
ScriptedDecideRule
SeedAcceptDecideRule
TooManyHopsDecideRule
TooManyPathSegmentsDecideRule
TransclusionDecideRule
ViaSurtPrefixedDecideRule
IdenticalDigestDecideRule
NotOnDomainsDecideRule
NotOnHostsDecideRule
NotSurtPrefixedDecideRule
OnDomainsDecideRule
OnHostsDecideRule
SurtPrefixedDecideRule
8

Heritrix DecideRules