Tags:
, view all tags

WMS Test Plan

Service Description

The Workload Management System (gLite WMS) is a software service of the gLite suite which is responsible for distributing and managing tasks across computing and storage resources available on a Grid. WMS assigns user jobs to CEs and SEs belonging to a Grid environment in a convenient fashion, so that:

  • jobs are always executed on resources that match the job requirements
  • grid-wide load balance is maintained, i.e. jobs are evenly and efficiently distributed across the entire Grid.

The WMS basically receives requests of job execution from a client, finds the required appropriate resources, then dispatches and follows them until completion. This is done handling failure in between and whenever possible. Other than single batch-like jobs, compound job types handled by the WMS are Directed Acyclic Graphs (a set of jobs where the input/output/execution of one of more jobs may depend on one or more other jobs), Parametric Jobs (multiple jobs with one parametrized description), and Collections (multiple jobs with a common description). Jobs are described via a flexible, high-level Job Definition Language (JDL).

Functional Description

Service Reference Card

Unit tests

N/A

Deployment tests

Repository

The EMI-1 RC4 repository can be found under:

http://emisoft.web.cern.ch/emisoft/dist/EMI/1/RC4/sl5/x86_64

Other repositories:

  • epel.repo
  • lcg-CA.repo
  • sl.repo
  • sl-security.repo

Installation test

First of all, install the yum-protectbase rpm:

yum install yum-protectbase.noarch

Then proceed with the installation of the CA certificates by issuing:

yum install ca-policy-egi-core

Install the WMS metapackage:

yum install emi-wms 

(see log file)

Configure the WMS:

/opt/glite/yaim/bin/yaim -c -s site-info.def -n WMS  

(see log file)

Update test

N/A

Functionality tests

Features/Scenarios to be tested

YAIM-WMS Configuration Testing

  • Installation and configuration starting from a cleaning machine (i.e. only OS)
  • Update and configuration from a previous version

WMS Job Submission/GetOutput Testing

Submit a job to the WMS service and when finished retrieve the output. Test job submission with the following type of jobs:

Normal Job
  • Test submission of normal jobs with different options and situation Implemented

  • Test the complete cycle with the two types of CEs: lcg and Cream Implemented

More different jdls can added in the future.

Perusal job

Job perusal is the ability to view output from a job while it is running. Implemented

DAG job

Directed Acyclic Graphs (a set of jobs where the input/output/execution of one of more jobs may depend on one or more other jobs).

  • Submit a jdl like this one:
[
  type = "dag";
  DefaultNodeShallowRetryCount = 3;
  nodes = [
    nodeA = [
      node_type = "edg-jdl";
      file ="jdl/arg.jdl" ; 
    ];
    nodeB = [
      node_type = "edg-jdl";
      file ="jdl/arg.jdl" ; 
    ];
    nodeC = [
      node_type = "edg-jdl";
      file ="jdl/arg.jdl" ; 
    ];
    dependencies = {
      { nodeA, nodeB },
      { nodeA, nodeC }
    }
  ];
]

  • When dag finishes retrieve the output files
  • Check the final status of the dag (all nodes and parent should be "Cleared")

More different jdls can added in the future.

Parametric Job

Multiple jobs with one parametrized description. Implemented

Collection Job

Multiple jobs with a common description. There are two ways to submit collection: you can create a single jdl with all the jdls of nodes or you can submit all the jdls stored in a directory (bulk submission)

  • Submit a jdl like this one:
[
nodes = {
   [
   file="jdl/arg.jdl";
   ],
   [
  executable="/bin/env";
  ShallowRetryCount = 0;
  RetryCount = 0;
  Stdoutput = "file.out" ;
  StdError =  "file.err" ;
  OutputSandbox ={ "file.out" ,"file.err"} ;
  FuzzyRank = true;
   ],
   [
   NodeName="nodeA";
   executable="/bin/ls" ;
  Stdoutput = "file.out" ;
  OutputSandbox ={ "file.out"} ;
   ]
};
Type = "Collection" ;
requirements =  other.GlueCEStateStatus == "Production" ;
rank = -other.GlueCEStateEstimatedResponseTime ;
]

  • When collection finishes retrieve the output files
  • Check the final status of the collectionall nodes and parent should be "Cleared")

  • To test bulk submission use option "--collection" of glite-wms-job-submit command.
  • When collection finishes retrieve the output files
  • Check the final status of the collection (all nodes and parent should be "Cleared")

More different jdls can added in the future.

Parallel Job

Jobs that can be running in one or more cpus in parallel.

  • Submit a jdl like this one:
[
Executable = "cpi";
CpuNumber = 2;
Stdoutput = "cpi.out" ;
StdError =  "cpi.err" ;
OutputSandbox = { "cpi.out" ,"cpi.err"} ;
InputSandbox = { "exe/cpi" };
FuzzyRank = true;
usertags = [ exe = "cpi" ];
]

  • When job finishes retrieve the output files
  • Check the final status of the job

WMS Job shallow and deep re-submission

There two type of resubmission; the first is defined deep occurs when the user's job has stardted running on the WN and then the job itself or the WMS JobWrapper has failed. The second one is called shallow and occurs when the WMS JobWrapper has failed before starting the actual user's job. Implemented

WMS Job List-match Testing

Without data

Test job-list-command and its option Implemented

With data

  • You need to register a file on an SE, then submit a jdl like this one (as InputData put the lfn(s) registered before):
###########################################
#      JDL with Data Requirements         #
###########################################

Executable = "calc-pi.sh";
Arguments = "1000";
StdOutput = "std.out";
StdError = "std.err";
Prologue = "prologue.sh";
InputSandbox = {"calc-pi.sh", "fileA", "fileB","prologue.sh"};
OutputSandbox = {"std.out", "std.err","out-PI.txt","out-e.txt"};
Requirements = true;

DataRequirements = {
[
DataCatalogType = "DLI";
DataCatalog = "http://lfcserver.cnaf.infn.it:8085";
InputData = {"lfn:/grid/infngrid/cesini/PI_1M.txt","lfn:/grid/infngrid/cesini/e-2M.txt"};
]
};
DataAccessProtocol = "gsiftp";

  • Then try a list-match, the listed CEs should be the ones "close" to the used SE

WMS Job Cancel Testing

Test the cancellation of these type of jobs (final status should be cleared):

Normal job

Submit and cancel a normal job Implementd

DAG job

Submit a dag job and then cancel it (the parent)

Collection

Submit a collection job and then cancel it (the parent)

Node of a collection

Submit a collection job and then some of its nodes

Others

Delegation Testing

Test the delegation command and its options Implementd

Job-info Testing

Test the job-info command and its options Implementd

Logging-info Testing

Test the logging-info command and its options Implemented

Job Status Testing

Test the job-status commend and its options Implemented

Prologue and Epilogue jobs

In the jdl you can specify two attributes prologue and epilogue which are scripts that are execute respectively before and after the user's job. Implemented

Performance tests

Collection of 1000 nodes

Submit a collection of 1000 nodes.

Stress test

This could be an example of stress test

  • 2880 collections each of 20 jobs
  • One collection every 60 seconds
  • Four users
  • Use LCG-CEs and CREAM-CEs (with different batch systems)
  • Use automatic-delegation
  • The job is a "sleep random(672)"
  • Resubmission is enabled
  • Enable proxy renewal

Regression tests

bug #33342: separate retry policies for ISB and OSB

Of course we're speaking of submission to the lcg-CE, as CREAM uses its own jobwrapper.

ISB: https://devel11.cnaf.infn.it:9000/a...

submitted a job and then removed its ISB

[mcecchi@cert-19 ~]$ glite-wms-job-submit -a -c devel11.conf lsisb.jdl
Connecting to the service https://devel11.cnaf.infn.it:7443/g...

====================== glite-wms-job-submit Success ======================

The job has been successfully submitted to the WMProxy
Your job identifier is:

https://devel11.cnaf.infn.it:9000/-...

==========================================================================

server side:
[root@devel11 input]# rm -f a
[root@devel11 input]# pwd
/var/SandboxDir/-h/https_3a_2f_2fdevel11.cnaf.infn.it_3a9000_2f-h4MRDzYkufRu71MKfF1pw/input

after a while, maradona reports:
[root@devel11 https_3a_2f_2fdevel11.cnaf.infn.it_3a9000_2f-h4MRDzYkufRu71MKfF1pw]# cat Maradona.output
LM_log_done_begin
Wed Apr 20 22:08:51 CEST 2011: lcg-jobwrapper-hook.sh not readable or not present
Wed Apr 20 22:08:52 CEST 2011: Error during transfer
Wed Apr 20 22:09:53 CEST 2011: Error during transfer
Wed Apr 20 22:11:54 CEST 2011: Error during transfer

LM_log_done_end
Cannot download a from gsiftp://devel11.cnaf.infn.it:2811/var...
Killing log watchdog (pid=21047)...
jw exit status = 1

OSB:
[mcecchi@cert-19 ~]$ glite-wms-job-submit -a -c devel11.conf lsosb.jdl

Connecting to the service https://devel11.cnaf.infn.it:7443/g...

====================== glite-wms-job-submit Success ======================

The job has been successfully submitted to the WMProxy
Your job identifier is:

https://devel11.cnaf.infn.it:9000/3...

==========================================================================
after more than twenty minutes Maradona hasn't returned yet and the job is running, meaning that other defaults are in place (the ones previously used for both ISB and OSB)

bug #36292: Not all attributes of a SA/SE coul be used in a gangmatching

Fix certified doing a listmatch with the following expression in the jdl:

Requirements = regexp(".in2p3.fr:2119.*",other.GlueCEUniqueID) && anyMatch(other.storage.CloseSEs,target.GlueSEImplementationVersion=="1.9.5-24");

which returns:

- cclcgceli02.in2p3.fr:2119/jobmanager-bqs-short
- cclcgceli04.in2p3.fr:2119/jobmanager-bqs-short
- cclcgceli06.in2p3.fr:2119/jobmanager-bqs-short
- cclcgceli09.in2p3.fr:2119/jobmanager-bqs-short
- cclcgceli02.in2p3.fr:2119/jobmanager-bqs-medium
- cclcgceli04.in2p3.fr:2119/jobmanager-bqs-long
- cclcgceli06.in2p3.fr:2119/jobmanager-bqs-medium
- cclcgceli09.in2p3.fr:2119/jobmanager-bqs-medium
- cclcgceli02.in2p3.fr:2119/jobmanager-bqs-long
- cclcgceli06.in2p3.fr:2119/jobmanager-bqs-long
- cclcgceli04.in2p3.fr:2119/jobmanager-bqs-medium
- cclcgceli09.in2p3.fr:2119/jobmanager-bqs-long

Double checking that the correct "GlueSEImplementationVersion" is picked up:

lcg-infosites --vo dteam closeSE >closeses.txt

gives the following closeSEs:

cclcgceli02.in2p3.fr:2119/jobmanager-bqs-long
ccsrm.in2p3.fr
ccsrm02.in2p3.fr

and ldapsearch -x -H ldap://lcg-bdii.cern.ch:2170 -b 'Mds-vo-name=local,o=Grid' '(GlueSEUniqueId=ccsrm.in2p3.fr)'
returns:

...
GlueSEImplementationVersion: 1.9.5-24
...

bug #40982: When a collection is aborted the "Abort" event should be logged for the sub-nodes as well /2

coll_10.jdl is a ten nodes collection, only the first node having non empty ISB.

[mcecchi@cert-19 ~]$ head -25 coll_10.jdl
[
Type = "collection";
InputSandbox = {"/home/mcecchi/Test.sh"};
RetryCount = 1;
Requirements = ( random(1.0) < 0.5 );
ShallowRetryCount = 2;
nodes = {
[
JobType = "Normal";
Zippedisb=true;
Executable = "Test.sh";
StdOutput = "test.out";
StdError = "test.err";
InputSandbox = {"a"};
OutputSandbox = {};
],
[
JobType = "Normal";
Executable = "Test.sh";
StdOutput = "test.out";
StdError = "test.err";
OutputSandbox = {};
],
[
JobType = "Normal";

We register the collection:

[mcecchi@cert-19 ~]$ glite-wms-job-submit -a -c devel11.conf --register-only coll_10.jdl

Connecting to the service https://devel11.cnaf.infn.it:7443/g...

====================== glite-wms-job-submit Success ======================

The job has been successfully registered to the WMProxy
Your job identifier is:

https://devel11.cnaf.infn.it:9000/M...

==========================================================================

To complete the operation, the following file containing the InputSandbox of the job needs to be transferred:
==========================================================================================================
ISB ZIP file : /tmp/ISBfiles_aoIPOxSR3GFuEcTxqJ6_Mg_0.tar.gz
Destination : gsiftp://devel11.cnaf.infn.it:2811/var...

We do NOT transfer ISB for the first node and start the job.

[mcecchi@cert-19 ~]$ glite-wms-job-submit --start https://devel11.cnaf.infn.it:9000/M...
Connecting to the service https://devel11.cnaf.infn.it:7443/g...

====================== glite-wms-job-submit Success ======================

The job has been successfully started to the WMProxy
Your job identifier is:

https://devel11.cnaf.infn.it:9000/M...

==========================================================================

After some seconds:

[mcecchi@cert-19 ~]$ glite-wms-job-status https://devel11.cnaf.infn.it:9000/M... Aborted|wc -l
11
-- ElisabettaMolinari - 2010-02-24
Edit | Attach | PDF | History: r30 | r21 < r20 < r19 < r18 | Backlinks | Raw View | More topic actions...
Topic revision: r19 - 2011-06-24 - FabioCapannini
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platformCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback