EMI WMS Test Plan
Service Description
The Workload Management System (gLite WMS) is a software service of the gLite suite which is responsible for distributing and managing tasks across computing and storage resources available on a Grid. WMS assigns user jobs to CEs and SEs belonging to a Grid environment in a convenient fashion, so that:
- jobs are always executed on resources that match the job requirements
- grid-wide load balance is maintained, i.e. jobs are evenly and efficiently distributed across the entire Grid.
The WMS basically receives requests of job execution from a client, finds the required appropriate resources, then dispatches and follows them until
completion. This is done handling failure in between and whenever possible. Other than single batch-like jobs, compound job types handled by the WMS are Directed Acyclic Graphs (a set of jobs where the input/output/execution of one of more jobs may depend on one or more other jobs), Parametric Jobs (multiple jobs with one parametrized description), and Collections (multiple jobs with a common description). Jobs are described via a flexible, high-level Job Definition Language (JDL).
Functional Description
Service Reference Card
Unit tests
N/A
Deployment tests
Repository
The EMI-1 RC4 repository can be found under:
http://emisoft.web.cern.ch/emisoft/dist/EMI/1/RC4/sl5/x86_64
Other repositories:
- epel.repo
- lcg-CA.repo
- sl.repo
- sl-security.repo
Installation test
First of all, install the yum-protectbase rpm:
yum install yum-protectbase.noarch
Then proceed with the installation of the CA certificates by issuing:
yum install ca-policy-egi-core
Install the WMS metapackage:
yum install emi-wms
(see
log
file)
Configure the WMS:
/opt/glite/yaim/bin/yaim -c -s site-info.def -n WMS
(see
log
file)
Update test
N/A
Functionality tests
Features/Scenarios to be tested
YAIM-WMS Configuration Testing
- Installation and configuration starting from a cleaning machine (i.e. only OS)
- Update and configuration from a previous version
WMS Job Submission/GetOutput Testing
Submit a job to the WMS service and when finished retrieve the output.
Test job submission with the following type of jobs:
Normal Job
- Test submission of normal jobs with different options and situation Implemented
- Test the complete cycle with the two types of CEs: lcg and Cream Implemented
More different jdls can added in the future.
Perusal job
Job perusal is the ability to view output from a job while it is running.
Implemented
DAG job
Directed Acyclic Graphs (a set of jobs where the input/output/execution of one of more jobs may depend on one or more other jobs).
- Submit a jdl like this one:
[
type = "dag";
DefaultNodeShallowRetryCount = 3;
nodes = [
nodeA = [
node_type = "edg-jdl";
file ="jdl/arg.jdl" ;
];
nodeB = [
node_type = "edg-jdl";
file ="jdl/arg.jdl" ;
];
nodeC = [
node_type = "edg-jdl";
file ="jdl/arg.jdl" ;
];
dependencies = {
{ nodeA, nodeB },
{ nodeA, nodeC }
}
];
]
- When dag finishes retrieve the output files
- Check the final status of the dag (all nodes and parent should be "Cleared")
More different jdls can added in the future.
Parametric Job
Multiple jobs with one parametrized description.
Implemented
Collection Job
Multiple jobs with a common description. There are two ways to submit collection: you can create a single jdl with all the jdls of nodes or you can submit all the jdls stored in a directory (bulk submission)
- Submit a jdl like this one:
[
nodes = {
[
file="jdl/arg.jdl";
],
[
executable="/bin/env";
ShallowRetryCount = 0;
RetryCount = 0;
Stdoutput = "file.out" ;
StdError = "file.err" ;
OutputSandbox ={ "file.out" ,"file.err"} ;
FuzzyRank = true;
],
[
NodeName="nodeA";
executable="/bin/ls" ;
Stdoutput = "file.out" ;
OutputSandbox ={ "file.out"} ;
]
};
Type = "Collection" ;
requirements = other.GlueCEStateStatus == "Production" ;
rank = -other.GlueCEStateEstimatedResponseTime ;
]
- When collection finishes retrieve the output files
- Check the final status of the collectionall nodes and parent should be "Cleared")
- To test bulk submission use option "--collection" of glite-wms-job-submit command.
- When collection finishes retrieve the output files
- Check the final status of the collection (all nodes and parent should be "Cleared")
More different jdls can added in the future.
Parallel Job
Jobs that can be running in one or more cpus in parallel.
- Submit a jdl like this one:
[
Executable = "cpi";
CpuNumber = 2;
Stdoutput = "cpi.out" ;
StdError = "cpi.err" ;
OutputSandbox = { "cpi.out" ,"cpi.err"} ;
InputSandbox = { "exe/cpi" };
FuzzyRank = true;
usertags = [ exe = "cpi" ];
]
- When job finishes retrieve the output files
- Check the final status of the job
WMS Job shallow and deep re-submission
There two type of resubmission; the first is defined
deep occurs when the user's job has stardted running on the WN and then the job itself or the WMS JobWrapper has failed. The second one is called
shallow and occurs when the WMS JobWrapper has failed before starting the actual user's job.
Implemented
WMS Job List-match Testing
Without data
Test job-list-command and its option
Implemented
With data
- You need to register a file on an SE, then submit a jdl like this one (as InputData put the lfn(s) registered before):
###########################################
# JDL with Data Requirements #
###########################################
Executable = "calc-pi.sh";
Arguments = "1000";
StdOutput = "std.out";
StdError = "std.err";
Prologue = "prologue.sh";
InputSandbox = {"calc-pi.sh", "fileA", "fileB","prologue.sh"};
OutputSandbox = {"std.out", "std.err","out-PI.txt","out-e.txt"};
Requirements = true;
DataRequirements = {
[
DataCatalogType = "DLI";
DataCatalog = "http://lfcserver.cnaf.infn.it:8085";
InputData = {"lfn:/grid/infngrid/cesini/PI_1M.txt","lfn:/grid/infngrid/cesini/e-2M.txt"};
]
};
DataAccessProtocol = "gsiftp";
- Then try a list-match, the listed CEs should be the ones "close" to the used SE
WMS Job Cancel Testing
Test the cancellation of these type of jobs (final status should be cleared):
Normal job
Submit and cancel a normal job
Implementd
DAG job
Submit a dag job and then cancel it (the
parent)
Collection
Submit a collection job and then cancel it (the
parent)
Node of a collection
Submit a collection job and then some of its nodes
Others
Delegation Testing
Test the delegation command and its options
Implementd
Job-info Testing
Test the job-info command and its options
Implementd
Logging-info Testing
Test the logging-info command and its options
Implemented
Job Status Testing
Test the job-status commend and its options
Implemented
Prologue and Epilogue jobs
In the jdl you can specify two attributes
prologue and
epilogue which are scripts that are execute respectively before and after the user's job.
Implemented
Performance tests
Collection of 1000 nodes
Submit a collection of 1000 nodes.
Stress test
This could be an example of stress test
- 2880 collections each of 20 jobs
- One collection every 60 seconds
- Four users
- Use LCG-CEs and CREAM-CEs (with different batch systems)
- Use automatic-delegation
- The job is a "sleep random(672)"
- Resubmission is enabled
- Enable proxy renewal
Regression tests
bug #33342: separate retry policies for ISB and OSB
Of course we're speaking of submission to the lcg-CE, as CREAM uses its own jobwrapper.
ISB:
https://devel11.cnaf.infn.it:9000/a
...
submitted a job and then removed its ISB
[mcecchi@cert-19 ~]$ glite-wms-job-submit -a -c devel11.conf lsisb.jdl
Connecting to the service https://devel11.cnaf.infn.it:7443/g...
====================== glite-wms-job-submit Success ======================
The job has been successfully submitted to the WMProxy
Your job identifier is:
https://devel11.cnaf.infn.it:9000/-...
==========================================================================
server side:
[root@devel11 input]# rm -f a
[root@devel11 input]# pwd
/var/SandboxDir/-h/https_3a_2f_2fdevel11.cnaf.infn.it_3a9000_2f-h4MRDzYkufRu71MKfF1pw/input
after a while, maradona reports:
[root@devel11 https_3a_2f_2fdevel11.cnaf.infn.it_3a9000_2f-h4MRDzYkufRu71MKfF1pw]# cat Maradona.output
LM_log_done_begin
Wed Apr 20 22:08:51 CEST 2011: lcg-jobwrapper-hook.sh not readable or not present
Wed Apr 20 22:08:52 CEST 2011: Error during transfer
Wed Apr 20 22:09:53 CEST 2011: Error during transfer
Wed Apr 20 22:11:54 CEST 2011: Error during transfer
LM_log_done_end
Cannot download a from gsiftp://devel11.cnaf.infn.it:2811/var...
Killing log watchdog (pid=21047)...
jw exit status = 1
OSB:
[mcecchi@cert-19 ~]$ glite-wms-job-submit -a -c devel11.conf lsosb.jdl
Connecting to the service https://devel11.cnaf.infn.it:7443/g...
====================== glite-wms-job-submit Success ======================
The job has been successfully submitted to the WMProxy
Your job identifier is:
https://devel11.cnaf.infn.it:9000/3...
==========================================================================
after more than twenty minutes Maradona hasn't returned yet and the job is running, meaning that other defaults are in place (the ones previously used for both ISB and OSB)
bug #36292: Not all attributes of a SA/SE coul be used in a gangmatching
Fix certified doing a listmatch with the following expression in the jdl:
Requirements = regexp(".in2p3.fr:2119.*",other.GlueCEUniqueID) && anyMatch(other.storage.CloseSEs,target.GlueSEImplementationVersion=="1.9.5-24");
which returns:
- cclcgceli02.in2p3.fr:2119/jobmanager-bqs-short
- cclcgceli04.in2p3.fr:2119/jobmanager-bqs-short
- cclcgceli06.in2p3.fr:2119/jobmanager-bqs-short
- cclcgceli09.in2p3.fr:2119/jobmanager-bqs-short
- cclcgceli02.in2p3.fr:2119/jobmanager-bqs-medium
- cclcgceli04.in2p3.fr:2119/jobmanager-bqs-long
- cclcgceli06.in2p3.fr:2119/jobmanager-bqs-medium
- cclcgceli09.in2p3.fr:2119/jobmanager-bqs-medium
- cclcgceli02.in2p3.fr:2119/jobmanager-bqs-long
- cclcgceli06.in2p3.fr:2119/jobmanager-bqs-long
- cclcgceli04.in2p3.fr:2119/jobmanager-bqs-medium
- cclcgceli09.in2p3.fr:2119/jobmanager-bqs-long
Double checking that the correct "GlueSEImplementationVersion" is picked up:
lcg-infosites --vo dteam closeSE >closeses.txt
gives the following closeSEs:
cclcgceli02.in2p3.fr:2119/jobmanager-bqs-long
ccsrm.in2p3.fr
ccsrm02.in2p3.fr
and ldapsearch -x -H ldap://lcg-bdii.cern.ch:2170 -b 'Mds-vo-name=local,o=Grid' '(GlueSEUniqueId=ccsrm.in2p3.fr)'
returns:
...
GlueSEImplementationVersion: 1.9.5-24
...
bug #40982: When a collection is aborted the "Abort" event should be logged for the sub-nodes as well /2
coll_10.jdl is a ten nodes collection, only the first node having non empty ISB.
[mcecchi@cert-19 ~]$ head -25 coll_10.jdl
[
Type = "collection";
InputSandbox = {"/home/mcecchi/Test.sh"};
RetryCount = 1;
Requirements = ( random(1.0) < 0.5 );
ShallowRetryCount = 2;
nodes = {
[
JobType = "Normal";
Zippedisb=true;
Executable = "Test.sh";
StdOutput = "test.out";
StdError = "test.err";
InputSandbox = {"a"};
OutputSandbox = {};
],
[
JobType = "Normal";
Executable = "Test.sh";
StdOutput = "test.out";
StdError = "test.err";
OutputSandbox = {};
],
[
JobType = "Normal";
We register the collection:
[mcecchi@cert-19 ~]$ glite-wms-job-submit -a -c devel11.conf --register-only coll_10.jdl
Connecting to the service https://devel11.cnaf.infn.it:7443/g...
====================== glite-wms-job-submit Success ======================
The job has been successfully registered to the WMProxy
Your job identifier is:
https://devel11.cnaf.infn.it:9000/M...
==========================================================================
To complete the operation, the following file containing the InputSandbox of the job needs to be transferred:
==========================================================================================================
ISB ZIP file : /tmp/ISBfiles_aoIPOxSR3GFuEcTxqJ6_Mg_0.tar.gz
Destination : gsiftp://devel11.cnaf.infn.it:2811/var...
We do NOT transfer ISB for the first node and start the job.
[mcecchi@cert-19 ~]$ glite-wms-job-submit --start https://devel11.cnaf.infn.it:9000/M...
Connecting to the service https://devel11.cnaf.infn.it:7443/g...
====================== glite-wms-job-submit Success ======================
The job has been successfully started to the WMProxy
Your job identifier is:
https://devel11.cnaf.infn.it:9000/M...
==========================================================================
After some seconds:
[mcecchi@cert-19 ~]$ glite-wms-job-status https://devel11.cnaf.infn.it:9000/M... Aborted|wc -l
11
--
ElisabettaMolinari - 2010-02-24
--
MarcoCecchi - 2011-06-27