The issue
WMS performs an implicit cleanup of the staging directories (input/output sandboxes) only for those jobs
declared as aborted and whose output is retrieved by the users by means of the
glite-wms-job-output
command.
In normal deployment the WMS supplies system administrator with a purge command which should be periodically
executed (as a cron job) to clean up old input/output sandboxes, according to basic policies which can be specified
as command line options.
The purge command
The command
glite-wms-purgeStorage
is designed to perform clean removal of the sandboxes staging directories
according to the Purging algorithm for those jobs matching some purging criteria specified as command line option.
The command which basically runs as a cron-job iterates over the paths within the staging directory and calls iteratively
the purge algorithm on the i-th path corresponding to a given jobid relevant either to simple job, dags or dag nodes.
The gLite 3.1 implementation supplies users with an option (
allocated-limit,a
) which defines the percentage of allocated
blocks, within the partition that holds the staging path, which triggers the actual purging. In other words, this option
provides a way to specify an hard limit based on the staging partition allocated size.
The other command line options which can be specified are in brief:
-
--threshold
: sets the purging threshold to the specified number of seconds ;
-
--skip-status-checking
: does not perform any status checking before purging ;
-
--force-orphan-node-removal
: force removal of orphan dag nodes ;
-
--force-dag-node-removal
: force removal of dag nodes even if purging condition are not met for its father ;
These last options do not define any triggering condition, but influence how the purging algorithm processes the status of
jobs and thus identifies the ones to be purged.
Purging algorithm
As described above the purging algorithm is called for each path within the staging directory. The path is then converted into
the relevant jobid and L&B is finally queried to acquire information about the current status of the job.
A job and thus its staging directory is considered to be removable if it has been declared as
DONE
,
ABORTED
,
CLEARED
or
CANCELLED
.
Unless explicitly specified (
--skip-status-checking flag
) all the remaining job states are not considered removable.
If a job is in a removable state and the
--threshold
has been specified, then the age check is performed. Basically this check consists of comparing the time
elapsed since the last logged event (lastUpdateTime) with the specified threshold. If the threshold is overcome then the job and its staging space are
then removed. It should be pointed out that the threshold checking does not refers to the age of the files relevant to the job about to be purged, but to the time of the last event logged to the LB.
If the job is a DAG and purging conditions are met, according to any modifiers eventually specified
(either
--threshold
or
--skip-status-checking
), the staging space of the whole DAG, including its
children nodes, is removed. This implies that the staging space of a DAG node is cleaned up once the purging
conditions are met for its father. By setting
force-dag-node-removal
flag, it is possible to force the removal of dag
nodes even if the purging conditions are not met for its father.
Modification to be made
Decisions made at the IT/CZ meeting in Rome
- The purge should be in charge of removing the expired delegated proxies from the WMProxy. Therefore mechanism like delegation proxy cache cleanup should be performed by the purger itself. In the current WMS deployment this is achieved by adding the
glite-wms-wmproxy-purge-proxycache
command in the WMS node crontab. This is strongly required in order to avoid exceeding the max number of files per directory and filling up the disk space. The mechanism should be migrated in the purger.
- The purger has to be modified to check if the job is in a state for a defined time period, this is not done by the purger now, as it checks for only completed jobs (in removable state) and different timeouts could be specified for different states.
- The purger should start its algorithm from those jobs kept by the LBProxy and not, as in the current implementation, from the collection of jobids generated by scanning the staging path for directories relevant to submitted jobs/DAGs. The choice of LBProxy as source of information about jobs and their status would increase the overall performance avoiding the remote querying of the LB server.
- The purger should also be responsible for cleaning up JobControl specific files such as those generated in
/var/glite/jobcontrol/{condorio,submit}
directories.
Other suggestions / requests
- Daniele Cesini: "For sandbox, It would be good if beside complicated purging algorithms it will be possible to define a set of hard limits based only on age, size and combination of age/size of files, not depending on any other variable (i.e. status of parents and sons) that could depend somehow on the correct behavior of WMS services and system administrators".
--
SalvatoreMonforte - 03 Apr 2008