Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Heritrix DecideRules
Roger G. Coram
Web Crawl Engineer
2
DecideRuleSequence
A list of DecideRules:
• Processed in order
• *Every rule is processed
• Result will be:
• ACCEPT: UR...
3
UK Domain Crawl 2014
RejectDecideRule REJECT everything by default
SurtPrefixedDecideRule ACCEPT .uk, london, etc.
Match...
4
UK Domain Crawl 2014
Basic flow:
• REJECT everything.
• Look for reasons to ACCEPT content.
• REJECT anything you absolu...
5
Beyond Scoping
All Processors have a shouldProcessRule property—you can use DecideRules
instead of simple true/false val...
6
logging.properties
In the logging.properties file:
org.archive.modules.deciderules.DecideRuleSequence.level=FINEST
This ...
scope.log
Alternatively, a recent addition to the DecideRuleSequence:
<property name="logToFile" value="true" />
This will...
List of DecideRules
AcceptDecideRule
ContentLengthDecideRule
ContentTypeMatchesRegexDecideRule
ContentTypeNotMatchesRegexD...
Upcoming SlideShare
Loading in …5
×

Heritrix DecideRules

Presentation for the IIPC Technical Training Workshop 2015 #iipctech15.

  • Login to see the comments

  • Be the first to like this

Heritrix DecideRules

  1. 1. Heritrix DecideRules Roger G. Coram Web Crawl Engineer
  2. 2. 2 DecideRuleSequence A list of DecideRules: • Processed in order • *Every rule is processed • Result will be: • ACCEPT: URI is rule in scope • REJECT: URI is ruled out of scope • PASS: DecideRule has no effect *DecideRules can have a onlyDecision method to skip processing if they can’t change the outcome.
  3. 3. 3 UK Domain Crawl 2014 RejectDecideRule REJECT everything by default SurtPrefixedDecideRule ACCEPT .uk, london, etc. MatchesRegexDecideRule ACCEPT; try to capture media files HopsPathMatchesRegexDecideRule ACCEPT anything embedded on a seed; ^E*$ *ExternalGeoLocationDecideRule ACCEPT IP addresses in GB *OnDomainsDecideRule ACCEPT specific domains; disabled by default *HopsPathMatchesRegexDecideRule ACCEPT redirects from seeds; ^R+$ CompressibilityDecideRule REJECT highly (in)compressible URIs; experimental *TooManyHopsDecideRule REJECT URIs more than 20 hops from a seed *MatchesListRegexDecideRule REJECT specific patterns PathologicalPathDecideRule REJECT URIs with more than 3 recurrences of a pattern TooManyPathSegmentsDecideRule REJECT URIs with more than 15 path segments SurtPrefixedDecideRule ACCEPT URIs matching a list of URL-shortening services SurtPrefixedDecideRule REJECT a list of SURTs from a file—exclude.txt PrerequisiteAcceptDecideRule ACCEPT prerequisites
  4. 4. 4 UK Domain Crawl 2014 Basic flow: • REJECT everything. • Look for reasons to ACCEPT content. • REJECT anything you absolutely do not want. Those marked with a ‘*’ are specified using Spring’s <ref bean…/> syntax. This facilitates changing their values dynamically or with Sheets. Those struck out are disabled by default (and typically enabled using Sheets for specific sites). Experimental DecideRules: • ExternalGeoLocationDecideRule: found 2,544,426 new hosts. • CompressibilityDecideRule: REJECTed 1,650,861 URIs.
  5. 5. 5 Beyond Scoping All Processors have a shouldProcessRule property—you can use DecideRules instead of simple true/false values. We use this to filter viral content: <bean id="warcWriter" class="org.archive.modules.writer.WARCWriterProcessor"> <property name="shouldProcessRule"> <bean class="org.archive.modules.deciderules.DecideRuleSequence"> <property name="rules"> <list> <bean class="org.archive.modules.deciderules.AcceptDecideRule"/> <bean class="uk.bl.wap.modules.deciderules.AnnotationMatchesListRegexDecideRule"> <property name="decision" value="REJECT"/> <property name="regexList"> <list> <value>^.*stream:.+FOUND.*$</value> </list> </property> </bean> </list>
  6. 6. 6 logging.properties In the logging.properties file: org.archive.modules.deciderules.DecideRuleSequence.level=FINEST This will output the decision of every DecideRule for every URI. This will generate a lot of log entries. Only practical for small crawls or testing.
  7. 7. scope.log Alternatively, a recent addition to the DecideRuleSequence: <property name="logToFile" value="true" /> This will create a file, scope.log, containing the final decision for every URI along with the specific rule which made that decision: 2014-11-05T10:17:39.790Z 4 ExternalGeoLocationDecideRule ACCEPT http://www.jaymoy.com/ 2014-11-05T10:17:39.790Z 0 RejectDecideRule REJECT https://t.co/Sz15mxnvtQ 2014-11-05T10:17:39.790Z 0 RejectDecideRule REJECT http://twitter.com/2017Hull 7
  8. 8. List of DecideRules AcceptDecideRule ContentLengthDecideRule ContentTypeMatchesRegexDecideRule ContentTypeNotMatchesRegexDecideRule ExternalGeoLocationDecideRule FetchStatusDecideRule FetchStatusMatchesRegexDecideRule FetchStatusNotMatchesRegexDecideRule HasViaDecideRule HopCrossesAssignmentLevelDomainDecideRule HopsPathMatchesRegexDecideRule IpAddressSetDecideRule MatchesFilePatternDecideRule MatchesListRegexDecideRule MatchesRegexDecideRule MatchesStatusCodeDecideRule NotMatchesFilePatternDecideRule NotMatchesListRegexDecideRule NotMatchesRegexDecideRule NotMatchesStatusCodeDecideRule PathologicalPathDecideRule PredicatedDecideRule PrerequisiteAcceptDecideRule RejectDecideRule ResourceLongerThanDecideRule ResourceNoLongerThanDecideRule ResponseContentLengthDecideRule SchemeNotInSetDecideRule ScriptedDecideRule SeedAcceptDecideRule TooManyHopsDecideRule TooManyPathSegmentsDecideRule TransclusionDecideRule ViaSurtPrefixedDecideRule IdenticalDigestDecideRule NotOnDomainsDecideRule NotOnHostsDecideRule NotSurtPrefixedDecideRule OnDomainsDecideRule OnHostsDecideRule SurtPrefixedDecideRule 8

×