Crawler plug-ins
Crawler plug-ins are Java™ application programming interfaces (APIs) that you can use to change content or metadata in crawled documents.
Data source crawler plug-ins
You can apply business and security rules to enforce document-level security and add, update, or delete the crawled metadata and document content that is associated with documents in an index. The data source crawler plug-in APIs cannot be used with the web crawler.
You can also create a plug-in that extracts entries from archive files. The extracted files can then be parsed individually and included in collections.
- Agent for Windows file systems crawler
- BoardReader crawler
- Case Manager crawler
- Exchange Server crawler
- FileNet P8 crawler
- SharePoint crawler
Web crawler plug-ins
You can add fields to the HTTP request header that is sent to the origin server to request a document. You can also view the content, security tokens, and metadata of a document after the document is downloaded. You can add to, delete from, or replace any of these fields, or stop the document from being parsed.
Web crawler plug-ins support two kinds of filtering: prefetch and postparse. You can specify only a single Java class to be the web crawler plug-in, but because the prefetch and postparse plug-in behaviors are defined in two separate Java interfaces and because Java classes can implement any number of interfaces, the web crawler plug-in class can implement either or both behaviors.
- Prefetch plug-in
- A prefetch plug-in is called before the crawler downloads a document. Your plug-in is given the document URL, the fetch method, the HTTP version, and the HTTP request header. Your plug-in can use these elements to decide whether to modify the request header (for example, to add cookies) or even to cancel the download.
- Postparse plug-in
- The postparse plug-in is called after any download attempt. Before the plug-in is called, the target content is downloaded and parsed by the crawler. The plug-in is given the document URL, the metadata that is extracted by the crawler from various sources, and the document's content. The plug-in can determine whether to alter any of these items in the document and whether to save the content of the document before it is parsed.
Javadoc documentation for crawler plug-ins
For detailed information about each plug-in API, see the Javadoc documentation in the following directory: ES_INSTALL_ROOT/docs/api/.