persistence Module

Module to store persistence handler classes.

Persistence handlers take care of all implementation details related to resource storage. They expose a common interface (defined in BasePersistenceHandler) through which the server (and/or filters/crawlers) can load, save and perform other operations over resources independently from where and how the resources are actually stored. At any point in time, the collection status of each resource must be one of those defined in the struct-like class StatusCodes.

class persistence.StatusCodes

A struct-like class to hold constants for resources status codes.

The numeric value of each code can be modified to match the one used in the final location where the resources are persisted. The name of each code (SUCCEEDED, INPROGRESS, AVAILABLE, FAILED, ERROR) must not be modified.

class persistence.BasePersistenceHandler(configurationsDictionary)

Abstract class. All persistence handlers should inherit from it or from other class that inherits.

__init__(configurationsDictionary)

Constructor.

Each persistence handler receives everything in its corresponding handler section of the XML configuration file as the parameter configurationsDictionary.

_extractConfig(configurationsDictionary)

Extract and store configurations.

If some configuration needs any kind of pre-processing, it is done here. Extend this method if you need to pre-process custom configuration options.

setup()

Execute per client initialization procedures.

This method is called every time a connection to a new client is opened, allowing to execute initialization code on a per client basis (which differs from __init__() that is called when the server instantiate the persistence handler, i.e., __init__() is called just one time for the whole period of execution of the program).

select()

Retrive an AVAILABLE resource.

Returns:

A tuple in the format (resourceKey, resourceID, resourceInfo).

  • resourceKey (user defined type): Value that uniquely identify the resource internally. It works like a primary key in relational databases and makes possible the existence of resources with the same ID, if needed.
  • resourceID (user defined type): Resource ID to be sent to a client.
  • resourceInfo (dict): Other information related to the resource, if there is any.
update(resourceKey, status, resourceInfo)

Update the specified resource, setting its status and information data to the ones given.

Args:
  • resourceKey (user defined type): Value that uniquely identify the resource internally.
  • status (StatusCodes): New status of the resource.
  • resourceInfo (dict): Other information related to the resource, if there is any.
insert(resourcesList)

Insert new resources into the final location where resources are persisted.

Args:
  • resourcesList (list): List of tuples containing all new resources to be inserted. Each resource is defined by a tuple in the format (resourceID, resourceInfo).
count()

Count the number of resources in each status category.

Returns:
A tuple in the format (total, succeeded, inprogress, available, failed, error) where all fields are integers representing the number of resources with the respective status code.
reset(status)

Change to AVAILABLE all resources with the status code given.

Args:
  • status (StatusCodes): Status of the resources to be reseted.
Returns:
Number of resources reseted.
finish()

Execute per client finalization procedures.

This method is called every time a connection to a client is closed, allowing to execute finalization code on a per client basis. It is the counterpart of setup().

shutdown()

Execute program finalization procedures (similar to a destructor).

This method is called when the server is shut down, allowing to execute finalization code in a global manner. It is intended to be the counterpart of __init__(), but differs from __del__() in that it is not bounded to the live of the persistence handler object itself, but rather to the span of execution time of the server.

class persistence.FilePersistenceHandler(configurationsDictionary)

Bases: persistence.MemoryPersistenceHandler

Load and dump resources from/to a file.

All resources in the file are loaded into memory before the server operations begin. So, this handler is recomended for small to medium size datasets that can be completely fitted into machine’s memory. For larger datasets, consider using another persistence handler. Another option for large datasets is to divide the resources in more than one file, collecting the resources of one file at a time.

The default version of this handler supports CSV and JSON files. It is possible to add support to other file types by subclassing BaseFileColumns and BaseFileHandler. The new file type must also be included in the supportedFileTypes dictionary.

class BaseFileColumns(fileName, idColumn, statusColumn)

Hold column names of data in the file, allowing fast access to names of ID, status and info columns.

_extractColNames(fileName)

Extract column names from the file.

Must be overriden, as column names extraction depends on the file type.

Returns:
A list of all column names in the file.
class FilePersistenceHandler.BaseFileHandler

Handle low level details about persistence in a specific file type.

Each resource loaded from a file is stored in memory in a dictionary in the format {"id": X, "status": X, "info": {...}}, which is the resource internal representation format. This handler is responsible for translating resources in the internal representation format to the format used in a specific file type and vice-versa.

parse(resource, columns)

Transform resource from file format to internal representation format.

Args:
  • resource (file specific type): Resource given in file format.
  • columns (BaseFileColumns subclass): Object holding column names.
Returns:
A resource in internal representation format.
unparse(resource, columns)

Transform resource from internal representation format to file format.

Args:
  • resource (dict): Resource given in internal representation format.
  • columns (BaseFileColumns subclass): Object holding column names.
Returns:
A resource in file format.
load(file, columns)

Load resources in file format and yield them in internal representation format.

Args:
  • file (file object): File object bounded to the physical file where resources are stored.
  • columns (BaseFileColumns subclass): Object holding column names.
Yields:
A resource in internal representation format.
dump(resources, file, columns)

Save resources in internal representation format to file format.

Args:
  • resources (list): List of resources in internal representation format.
  • file (file object): File object bounded to the physical file where resources will be stored.
  • columns (BaseFileColumns subclass): Object holding column names.
class FilePersistenceHandler.CSVColumns(fileName, idColumn, statusColumn)

Bases: persistence.BaseFileColumns

Hold column names of data in CSV files, allowing fast access to names of ID, status and info columns.

class FilePersistenceHandler.CSVHandler

Bases: persistence.BaseFileHandler

Handle low level details about persistence in CSV files.

Note

This class and CSVColumns class uses Python’s built-in csv module internally.

class FilePersistenceHandler.JSONColumns(fileName, idColumn, statusColumn)

Bases: persistence.BaseFileColumns

Hold column names of data in JSON files, allowing fast access to names of ID, status and info columns.

class FilePersistenceHandler.JSONHandler

Bases: persistence.BaseFileHandler

Handle low level details about persistence in JSON files.

Note

This class and JSONColumns uses Python’s built-in json module internally.

FilePersistenceHandler.supportedFileTypes = {'JSON': ['JSONColumns', 'JSONHandler'], 'CSV': ['CSVColumns', 'CSVHandler']}

Associate file types and its columns and handler classes. The type of the current file is provided by the user directly (through the filetype option in the XML configuration file) or indirectly (through the file extension extracted from file name). When checking if the type of the current file is on the list of supported file types, the comparison between the strings is case insensitive.

class persistence.RolloverFilePersistenceHandler(configurationsDictionary)

Bases: persistence.FilePersistenceHandler

Load and dump resources from/to files respecting limits of file size and/or number of resources per file.

This handler uses multiple instances of FilePersistenceHandler to allow insertion of new resources respecting limits specified by the user. It is also capable of reading and updating resources from multiple files.

The rollover handler leaves the low level details of persistence for the file handlers attached to each file, taking care of the coordination necessary to maintain consistency between them and also of the verification of limits established.

When inserting new resources, every time the file size limit and/or number of resources per file limit is reached rollover handler opens a new file and assigns a new instance of FilePersistenceHandler to handle it. All resources, however, are maintained in memory. So, as in the case of FilePersistenceHandler, this handler is not well suited for large datasets that cannot be completely fitted in memory.

Note

This handler was inspired by Python’s logging.handlers.RotatingFileHandler class.

class persistence.MySQLPersistenceHandler(configurationsDictionary)

Bases: persistence.BasePersistenceHandler

Store and retrieve resources to/from a MySQL database.

The table must already exist in the database and must contain at least three columns: a primary key column, a resource ID column and a status column.

Note

This handler uses MySQL Connector/Python to interact with MySQL databases.

Table Of Contents