The issue

WMS performs an implicit cleanup of the staging directories (input/output sandboxes) only for those jobs declared as aborted and whose output is retrieved by the users by means of the glite-wms-job-output command.

In normal deployment the WMS supplies system administrator with a purge command which should be periodically executed (as a cron job) to clean up old input/output sandboxes, according to basic policies which can be specified as command line options.

The purge command

The command glite-wms-purgeStorage is designed to perform clean removal of the sandboxes staging directories according to the Purging algorithm for those jobs matching some purging criteria specified as command line option. The command which basically runs as a cron-job iterates over the paths within the staging directory and calls iteratively the purge algorithm on the i-th path corresponding to a given jobid relevant either to simple job, dags or dag nodes.

The gLite 3.1 implementation supplies users with an option (allocated-limit,a) which defines the percentage of allocated blocks, within the partition that holds the staging path, which triggers the actual purging. In other words, this option provides a way to specify an hard limit based on the staging partition allocated size.

The other command line options which can be specified are in brief:

  • --threshold: sets the purging threshold to the specified number of seconds ;
  • --skip-status-checking: does not perform any status checking before purging ;
  • --force-orphan-node-removal: force removal of orphan dag nodes ;
  • --force-dag-node-removal: force removal of dag nodes even if purging condition are not met for its father ;

These last options do not define any triggering condition, but influence how the purging algorithm processes the status of jobs and thus identifies the ones to be purged.

Purging algorithm

As described above the purging algorithm is called for each path within the staging directory. The path is then converted into the relevant jobid and L&B is finally queried to acquire information about the current status of the job.

A job and thus its staging directory is considered to be removable if it has been declared as DONE, ABORTED, CLEARED or CANCELLED. Unless explicitly specified (--skip-status-checking flag) all the remaining job states are not considered removable.

If a job is in a removable state and the --threshold has been specified, then the age check is performed. Basically this check consists of comparing the time elapsed since the last logged event (lastUpdateTime) with the specified threshold. If the threshold is overcome then the job and its staging space are then removed. It should be pointed out that the threshold checking does not refers to the age of the files relevant to the job about to be purged, but to the time of the last event logged to the LB. If the job is a DAG and purging conditions are met, according to any modifiers eventually specified (either --threshold or --skip-status-checking), the staging space of the whole DAG, including its children nodes, is removed. This implies that the staging space of a DAG node is cleaned up once the purging conditions are met for its father. By setting force-dag-node-removal flag, it is possible to force the removal of dag nodes even if the purging conditions are not met for its father.

Modification to be made

Decisions made at the IT/CZ meeting in Rome

  • The purge should be in charge of removing the expired delegated proxies from the WMProxy. Therefore mechanism like delegation proxy cache cleanup should be performed by the purger itself. In the current WMS deployment this is achieved by adding the glite-wms-wmproxy-purge-proxycache command in the WMS node crontab. This is strongly required in order to avoid exceeding the max number of files per directory and filling up the disk space. The mechanism should be migrated in the purger.

  • The purger has to be modified to check if the job is in a state for a defined time period, this is not done by the purger now, as it checks for only completed jobs (in removable state) and different timeouts could be specified for different states.

  • The purger should start its algorithm from those jobs kept by the LBProxy and not, as in the current implementation, from the collection of jobids generated by scanning the staging path for directories relevant to submitted jobs/DAGs. The choice of LBProxy as source of information about jobs and their status would increase the overall performance avoiding the remote querying of the LB server.

  • The purger should also be responsible for cleaning up JobControl specific files such as those generated in /var/glite/jobcontrol/{condorio,submit} directories.

Other suggestions / requests

  • Daniele Cesini: "For sandbox, It would be good if beside complicated purging algorithms it will be possible to define a set of hard limits based only on age, size and combination of age/size of files, not depending on any other variable (i.e. status of parents and sons) that could depend somehow on the correct behavior of WMS services and system administrators".

-- SalvatoreMonforte - 03 Apr 2008

Topic revision: r2 - 2008-04-09 - SalvatoreMonforte
 
This site is powered by the TWiki collaboration platformCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback