filters Module

Module to store filter classes.

The filters are sequentially applied in the same order in wich they were specified in the configuration file, unless they were explicitly set as parallel.

class filters.BaseFilter(configurationsDictionary)

Abstract class. All filters should inherit from it or from other class that inherits.

__init__(configurationsDictionary)

Constructor.

Each filter receives everything in its corresponding filter section of the XML configuration file as the parameter configurationsDictionary.

_extractConfig(configurationsDictionary)

Extract and store configurations.

If some configuration needs any kind of pre-processing, it is done here. Extend this method if you need to pre-process custom configuration options.

setup()

Execute per client initialization procedures.

This method is called every time a connection to a new client is opened, allowing to execute initialization code on a per client basis (which differs from __init__() that is called when the server instantiate the filter, i.e., __init__() is called just one time for the whole period of execution of the program).

apply(resourceID, resourceInfo, extraInfo)

Process resource information before it is sent to a client.

Args:
  • resourceID (user defined type): ID of the resource to be collected, sent by the server.
  • extraInfo (dict): Reference to a dictionary that can be used to pass information among sequential filters. It is not sent to clients and its value will always be None if the filter is executed in parallel.
Returns:
A dictionary containing the desired filter information to be sent to clients.
callback(resourceID, resourceInfo, newResources, extraInfo)

Process information sent by clients after a resource has been crawled.

Args:
  • resourceID (user defined type): ID of the crawled resource.
  • resourceInfo (dict): Resource information dictionary sent by client. Sequential filters receive this parameter as reference, so they can alter its value, but parallel filters receive just a copy of it. The server will store the final value of resourceInfo as it is after all filters were called back.
  • newResources (list): List of new resources sent by client to be stored by the server. Sequential filters receive this parameter as reference, so they can alter its value, but parallel filters receive just a copy of it. The server will store the final value of newResources as it is after all filters were called back.
  • extraInfo (dict): Dictionary that contains information sent by client to filters. Sequential filters receive this parameter as reference, so they can alter its value, but parallel filters receive just a copy of it. As in apply(), extraInfo can also be used to pass information among sequential filters (in the case of sequential filters, the original information received from crawler is stored in extraInfo[“original”], so it is available at any time). This information is not used by the server.
finish()

Execute per client finalization procedures.

This method is called every time a connection to a client is closed, allowing to execute finalization code on a per client basis. It is the counterpart of setup().

shutdown()

Execute program finalization procedures (similar to a destructor).

This method is called when the server is shut down, allowing to execute finalization code in a global manner. It is intended to be the counterpart of __init__(), but differs from __del__() in that it is not bounded to the live of the filter object itself, but rather to the span of execution time of the server.

class filters.SaveResourcesFilter(configurationsDictionary)

Bases: filters.BaseFilter

Save resources sent by clients in a user specified location.

This post-processing olny filter makes use of the persistence infrastructure to save resources sent by clients. The location where the resources are stored can be specified in the XML configuration file just setting up the persistence handler to be used.

Table Of Contents