crawler Module

Module to store crawler classes.

More than one class can be written here, but only one (that specified in the configuration file) will be used by the client to instantiate a crawler object whose crawl() method will by called to do the collection of the resource received.

class crawler.BaseCrawler(configurationsDictionary)

Abstract class. All crawlers should inherit from it or from other class that inherits.

__init__(configurationsDictionary)

Constructor.

Upon initialization the crawler object receives everything in the crawler section of the XML configuration file as the parameter configurationsDictionary.

_extractConfig(configurationsDictionary)

Extract and store configurations.

If some configuration needs any kind of pre-processing, it is done here. Extend this method if you need to pre-process custom configuration options.

crawl(resourceID, filters)

Collect the resource.

Must be overriden.

Args:
  • resourceID (user defined type): ID of the resource to be collected, sent by the server.
  • filters (list): All data (if any) generated by the filters added to server. Sequential filters data come first, in the same order that the filters were specified in the configuration file. Parallel filters data come next, in undetermined order.
Returns:

A tuple in the format (resourceInfo, extraInfo, newResources). Any element of the tuple can be None, depending on what the user desires.

  • resourceInfo (dict): Resource information dictionary, used to update resource information at the server side. This information is user defined and must be understood by the persistence handler used.
  • extraInfo (dict): Aditional information. This information is just passed to all filters via callback() method and is not used by the server itself.
  • newResources (list): Resources to be stored by the server when the feedback option is enabled. Each new resource is described by a tuple in the format (resourceID, resourceInfo), where the first element is the resource ID (whose type is defined by the user) and the second element is a dictionary containing resource information (in a format understood by the persistence handler used).
class crawler.DemoCrawler(configurationsDictionary)

Bases: crawler.BaseCrawler

Example crawler, just for demonstration.

Table Of Contents