Class WebcrawlerConnector.DocumentURLFilter
- java.lang.Object
-
- org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.DocumentURLFilter
-
- Enclosing class:
- WebcrawlerConnector
protected class WebcrawlerConnector.DocumentURLFilter extends java.lang.ObjectThis class describes the url filtering information (for crawling and indexing) obtained from a digested DocumentSpecification.
-
-
Field Summary
Fields Modifier and Type Field Description protected WebcrawlerConnector.CanonicalizationPoliciescanonicalizationPoliciesCanonicalization policiesprotected java.util.List<java.util.regex.Pattern>excludeContentIndexPatternsList of content exclusion patternprotected java.util.List<java.util.regex.Pattern>excludeIndexPatternsThe arraylist of index exclude patternsprotected java.util.List<java.util.regex.Pattern>excludePatternsThe arraylist of exclude patternsprotected java.util.List<java.util.regex.Pattern>includeIndexPatternsThe arraylist of index include patternsprotected java.util.List<java.util.regex.Pattern>includePatternsThe arraylist of include patternsprotected WebcrawlerConnector.MappingRulesmappingsMapping rulesprotected java.util.Set<java.lang.String>seedHostsThe hash map of seed hosts, to limit urls by, if non-nullprotected java.lang.StringversionStringThe version string
-
Constructor Summary
Constructors Constructor Description DocumentURLFilter(org.apache.manifoldcf.core.interfaces.Specification spec)Process a document specification to produce a filter.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected java.lang.StringfindSpecifiedContent(java.lang.String currentURI, java.util.List<java.util.regex.Pattern> patterns)WebcrawlerConnector.CanonicalizationPoliciesgetCanonicalizationPolicies()Get canonicalization policiesjava.lang.StringgetVersionString()Get whatever contribution to the version string should come from this data.booleanisDocumentAndHostLegal(java.lang.String url, org.apache.manifoldcf.crawler.interfaces.IHistoryActivity activities)Check if both a document and host are legal.booleanisDocumentContentIndexable(java.lang.String documentIdentifier)java.lang.StringisDocumentIndexable(java.lang.String url, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities)Check if the document identifier is indexable, and return the indexing URL if found.booleanisDocumentLegal(java.lang.String url, org.apache.manifoldcf.crawler.interfaces.IHistoryActivity activities)Check if the document identifier is legal.booleanisHostLegal(java.lang.String host)Check if a host is legal.
-
-
-
Field Detail
-
versionString
protected java.lang.String versionString
The version string
-
mappings
protected final WebcrawlerConnector.MappingRules mappings
Mapping rules
-
includePatterns
protected final java.util.List<java.util.regex.Pattern> includePatterns
The arraylist of include patterns
-
excludePatterns
protected final java.util.List<java.util.regex.Pattern> excludePatterns
The arraylist of exclude patterns
-
includeIndexPatterns
protected final java.util.List<java.util.regex.Pattern> includeIndexPatterns
The arraylist of index include patterns
-
excludeIndexPatterns
protected final java.util.List<java.util.regex.Pattern> excludeIndexPatterns
The arraylist of index exclude patterns
-
seedHosts
protected java.util.Set<java.lang.String> seedHosts
The hash map of seed hosts, to limit urls by, if non-null
-
excludeContentIndexPatterns
protected final java.util.List<java.util.regex.Pattern> excludeContentIndexPatterns
List of content exclusion pattern
-
canonicalizationPolicies
protected final WebcrawlerConnector.CanonicalizationPolicies canonicalizationPolicies
Canonicalization policies
-
-
Constructor Detail
-
DocumentURLFilter
public DocumentURLFilter(org.apache.manifoldcf.core.interfaces.Specification spec) throws org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionProcess a document specification to produce a filter. Note that we EXPECT the regexp's in the document specification to be properly formed. This should be checked at save time to prevent errors. Any syntax errors found here will thus cause the include or exclude regexp to be skipped.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
-
Method Detail
-
getVersionString
public java.lang.String getVersionString()
Get whatever contribution to the version string should come from this data.
-
isDocumentAndHostLegal
public boolean isDocumentAndHostLegal(java.lang.String url, org.apache.manifoldcf.crawler.interfaces.IHistoryActivity activities) throws org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionCheck if both a document and host are legal.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
isHostLegal
public boolean isHostLegal(java.lang.String host)
Check if a host is legal.
-
isDocumentLegal
public boolean isDocumentLegal(java.lang.String url, org.apache.manifoldcf.crawler.interfaces.IHistoryActivity activities) throws org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionCheck if the document identifier is legal.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
isDocumentIndexable
public java.lang.String isDocumentIndexable(java.lang.String url, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities) throws org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionCheck if the document identifier is indexable, and return the indexing URL if found.- Returns:
- null if the url doesn't match or should not be ingested, or the new string if it does.
- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
getCanonicalizationPolicies
public WebcrawlerConnector.CanonicalizationPolicies getCanonicalizationPolicies()
Get canonicalization policies
-
isDocumentContentIndexable
public boolean isDocumentContentIndexable(java.lang.String documentIdentifier) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
findSpecifiedContent
protected java.lang.String findSpecifiedContent(java.lang.String currentURI, java.util.List<java.util.regex.Pattern> patterns) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
-