Difference: Test_Plan_WMS (1 vs. 4)

Revision 42011-06-30 - MarcoCecchi

Line: 1 to 1
 
META TOPICPARENT name="WebTopicList"
Test bunch #1: requests:
Line: 196 to 196
 #77004: Wrong myproxyserver string processing in ICE #75402: Synchronization loss between real validity of proxy and exp. time saved in ICE's database #74259: Previous matches information is not taken into account if direct submission is used \ No newline at end of file
Added:
>
>
META TOPICMOVED by="MarcoCecchi" date="1309439888" from="EgeeJra1It.Workplan_WMS_EMI" to="EgeeJra1It.Test_Plan_WMS"

Revision 32011-06-30 - MarcoCecchi

Line: 1 to 1
 
META TOPICPARENT name="WebTopicList"
Changed:
<
<

EMI WMS Test Plan

>
>
Test bunch #1: requests:
 
Changed:
<
<

Service Description

>
>
#1.1:
 
Changed:
<
<
The Workload Management System (gLite WMS) is a software service of the gLite suite which is responsible for distributing and managing tasks across computing and storage resources available on a Grid. WMS assigns user jobs to CEs and SEs belonging to a Grid environment in a convenient fashion, so that:
>
>
#1.1.1: submit a simple "Hello, world" job with trivial ISB and OSB. #1.1.2: submit a collection made of a bunch of such jobs, say 10.
 
Deleted:
<
<
  • jobs are always executed on resources that match the job requirements
  • grid-wide load balance is maintained, i.e. jobs are evenly and efficiently distributed across the entire Grid.

The WMS basically receives requests of job execution from a client, finds the required appropriate resources, then dispatches and follows them until completion. This is done handling failure in between and whenever possible. Other than single batch-like jobs, compound job types handled by the WMS are Directed Acyclic Graphs (a set of jobs where the input/output/execution of one of more jobs may depend on one or more other jobs), Parametric Jobs (multiple jobs with one parametrized description), and Collections (multiple jobs with a common description). Jobs are described via a flexible, high-level Job Definition Language (JDL).

Functional Description

Service Reference Card

Unit tests

N/A

Deployment tests

Repository

The EMI-1 RC4 repository can be found under:

http://emisoft.web.cern.ch/emisoft/dist/EMI/1/RC4/sl5/x86_64

Other repositories:

  • epel.repo
  • lcg-CA.repo
  • sl.repo
  • sl-security.repo

Installation test

First of all, install the yum-protectbase rpm:

yum install yum-protectbase.noarch

Then proceed with the installation of the CA certificates by issuing:

yum install ca-policy-egi-core

Install the WMS metapackage:

yum install emi-wms 

(see log file)

Configure the WMS:

/opt/glite/yaim/bin/yaim -c -s site-info.def -n WMS  

(see log file)

Update test

N/A

Functionality tests

Features/Scenarios to be tested

YAIM-WMS Configuration Testing

  • Installation and configuration starting from a cleaning machine (i.e. only OS)
  • Update and configuration from a previous version

WMS Job Submission/GetOutput Testing

Submit a job to the WMS service and when finished retrieve the output. Test job submission with the following type of jobs:

Normal Job
  • Test submission of normal jobs with different options and situation Implemented

  • Test the complete cycle with the two types of CEs: lcg and Cream Implemented

More different jdls can added in the future.

Perusal job

Job perusal is the ability to view output from a job while it is running. Implemented

DAG job

Directed Acyclic Graphs (a set of jobs where the input/output/execution of one of more jobs may depend on one or more other jobs).

  • Submit a jdl like this one:
[
  type = "dag";
  DefaultNodeShallowRetryCount = 3;
  nodes = [
    nodeA = [
      node_type = "edg-jdl";
      file ="jdl/arg.jdl" ; 
    ];
    nodeB = [
      node_type = "edg-jdl";
      file ="jdl/arg.jdl" ; 
    ];
    nodeC = [
      node_type = "edg-jdl";
      file ="jdl/arg.jdl" ; 
    ];
    dependencies = {
      { nodeA, nodeB },
      { nodeA, nodeC }
    }
  ];
]

  • When dag finishes retrieve the output files
  • Check the final status of the dag (all nodes and parent should be "Cleared")

More different jdls can added in the future.

Parametric Job

Multiple jobs with one parametrized description. Implemented

Collection Job

Multiple jobs with a common description. There are two ways to submit collection: you can create a single jdl with all the jdls of nodes or you can submit all the jdls stored in a directory (bulk submission)

  • Submit a jdl like this one:
[
nodes = {
   [
   file="jdl/arg.jdl";
   ],
   [
  executable="/bin/env";
  ShallowRetryCount = 0;
  RetryCount = 0;
  Stdoutput = "file.out" ;
  StdError =  "file.err" ;
  OutputSandbox ={ "file.out" ,"file.err"} ;
  FuzzyRank = true;
   ],
   [
   NodeName="nodeA";
   executable="/bin/ls" ;
  Stdoutput = "file.out" ;
  OutputSandbox ={ "file.out"} ;
   ]
};
Type = "Collection" ;
requirements =  other.GlueCEStateStatus == "Production" ;
rank = -other.GlueCEStateEstimatedResponseTime ;
]

  • When collection finishes retrieve the output files
  • Check the final status of the collectionall nodes and parent should be "Cleared")

  • To test bulk submission use option "--collection" of glite-wms-job-submit command.
  • When collection finishes retrieve the output files
  • Check the final status of the collection (all nodes and parent should be "Cleared")

More different jdls can added in the future.

Parallel Job

Jobs that can be running in one or more cpus in parallel.

  • Submit a jdl like this one:
[
Executable = "cpi";
CpuNumber = 2;
Stdoutput = "cpi.out" ;
StdError =  "cpi.err" ;
OutputSandbox = { "cpi.out" ,"cpi.err"} ;
InputSandbox = { "exe/cpi" };
FuzzyRank = true;
usertags = [ exe = "cpi" ];
]

  • When job finishes retrieve the output files
  • Check the final status of the job

WMS Job shallow and deep re-submission

There two type of resubmission; the first is defined deep occurs when the user's job has stardted running on the WN and then the job itself or the WMS JobWrapper has failed. The second one is called shallow and occurs when the WMS JobWrapper has failed before starting the actual user's job. Implemented

WMS Job List-match Testing

Without data

Test job-list-command and its option Implemented

With data

  • You need to register a file on an SE, then submit a jdl like this one (as InputData put the lfn(s) registered before):
###########################################
#      JDL with Data Requirements         #
###########################################

Executable = "calc-pi.sh";
Arguments = "1000";
StdOutput = "std.out";
StdError = "std.err";
Prologue = "prologue.sh";
InputSandbox = {"calc-pi.sh", "fileA", "fileB","prologue.sh"};
OutputSandbox = {"std.out", "std.err","out-PI.txt","out-e.txt"};
Requirements = true;

DataRequirements = {
[
DataCatalogType = "DLI";
DataCatalog = "http://lfcserver.cnaf.infn.it:8085";
InputData = {"lfn:/grid/infngrid/cesini/PI_1M.txt","lfn:/grid/infngrid/cesini/e-2M.txt"};
]
};
DataAccessProtocol = "gsiftp";

  • Then try a list-match, the listed CEs should be the ones "close" to the used SE

WMS Job Cancel Testing

Test the cancellation of these type of jobs (final status should be cleared):

Normal job

Submit and cancel a normal job Implementd

DAG job

Submit a dag job and then cancel it (the parent)

Collection

Submit a collection job and then cancel it (the parent)

Node of a collection

Submit a collection job and then some of its nodes

Others

Delegation Testing

Test the delegation command and its options Implementd

Job-info Testing

Test the job-info command and its options Implementd

Logging-info Testing

Test the logging-info command and its options Implemented

Job Status Testing

Test the job-status commend and its options Implemented

Prologue and Epilogue jobs

In the jdl you can specify two attributes prologue and epilogue which are scripts that are execute respectively before and after the user's job. Implemented

Performance tests

Collection of 1000 nodes

Submit a collection of 1000 nodes.

Stress test

This could be an example of stress test

  • 2880 collections each of 20 jobs
  • One collection every 60 seconds
  • Four users
  • Use LCG-CEs and CREAM-CEs (with different batch systems)
  • Use automatic-delegation
  • The job is a "sleep random(672)"
  • Resubmission is enabled
  • Enable proxy renewal

Regression tests

bug #33342: separate retry policies for ISB and OSB

Of course we're speaking of submission to the lcg-CE, as CREAM uses its own jobwrapper.

ISB: https://devel11.cnaf.infn.it:9000/a...

submitted a job and then removed its ISB

 \ No newline at end of file
Added:
>
>
check all option, especially json compliance test submission to CREAM test submission to ICE

glite-wms-job-cancel #1.2.1: glite-wms-job-cancel for single job #1.2.2: glite-wms-job-cancel for collection job parent and nodes. pending nodes should be at various mixed states (submitted, ready, scheduled, running) check all option, especially json compliance

glite-wms-job-info glite-wms-job-logging-info

check that ReallyRunning event is present in LogMonitor

glite-wms-job-perusal glite-wms-job-submit glite-wms-job-delegate-proxy glite-wms-job-list-match check all option, especially json compliance

glite-wms-job-output glite-wms-job-status check all option, especially json compliance glite-wms-proxy-sign

Test bunch #2: job types:

-collection use both: node = [a=b;c=d;...] and node = [file = "..."] -dag DAG1: dag with no dependencies DAG2: dag with terminal failing node DAG3: dag with non-terminal failing node max_running_nodes,NodesCollocation -parametric -MPI

Test bunch #3: security features:

delegation: automatic, explicit voms proxy: old-style, --rfc, no attributes, one attribute, more attributes authorization: check how gacl file is generated and match by dn and fqan. proxy renewal: myproxyserver="myproxy.cnaf.infn.it" check with both ICE and jc sandbox mapping with gridftp and lcas, lcmaps. check that no mix up is done when using delegations from the same user with different certificates/attributes

Test bunch #4: server features: check that init scripts work with all their options (start, stop, graceful, etc.) check limiter kicking in check wm recovery check stale gridftp processes (#53700) stop the wm with pending jobs and check that at restart they are properly restored from their last state create a dump of the ism, adding a request on file like [command="ism_dump";] check that proxycache cron purges expires delegated proxies. check that lb proxy does NOT purge done jobs before one week check that wms purger works properly check ExpiryPeriod weith jobs that do not specify it in their jdl check MatchRetryPeriod check MaxOutputSandboxSize check ice configuration check wmproxy in static and dynamic mode (now with fcgid) check WMS+LB co-hosting, checklbproxy=false/true

job-list-match: requirements =true, requirements=false job submit 1) ISB 2) ISB with one empty file 3) zipped ISB 4) OSB 5) ISB+OSB

after job is complete: - check that proxy is deregistered - sandbox is purged - check that lb proxy did NOT purge done the job before one week

Test bunch #5: JDL features:

-nodescollocation mm with data -inputdata lfn:, guid:, lds and query -DataRequirements -DataCatalog, DataCatalogType, DataAccessProtocol, OutputSE -Gang-matching -Resubmission:

deep
RetryCount
shallow
shallowRetryCount
- requirements, rank, fuzzyrank - ExpiryTime (=1) - resubmission - zippedisb - replanning - Environment (Environment={"CIAO=1008"} executable= echo $CIAO

Test bunch #6: Jobwrapper:

sandbox splitting sandbox tracking

Test bunch #7: Configuration:

metapackage, dependencies glue and glie2.0 service information

regression tests

submit a job with expired ACs but valid certificate and check that fqan authZ doesn't let the job pass URLs in ISB are sometimes converted to lowercase by WMproxy Detect job status changes in ICE->CREAM with secondary FQANs Env variables, ~ character are not correctly expanded in the WMS UI #77284 WMProxy deleagation ID should be associated to DN + FQAN #77282 WMS information system usage is case-sensitive #74832 Files specified with absolute paths shouldn't be used with inputsandboxbaseuri #74737 glite-wms-wmproxy start removes fastcgi socket ! #74221 Perusal doesn't work with nodes of a collection #73329 ICE and CREAM can loose eventid synchronization #73286 glite-wms-job-status and glite-wms-job-logging info wrongly see a corrupted help file #72870 The WMS-UI does not check type mismatch for SMPGranularity, WholeNodes and HostNumber #71438 proxycache purger fails on symlinked proxy cache #70331 glite-wms-create-proxy "ambiguous redirect" #70401 stale Condor-G jobs slowly piling up in held state #70061 WMS hates collections with 192 nodes! #68307 WMS-UI: syntax error for correct JDL when substr is used #68786 "InputSandboxBaseURI" JDL attribute work only in some circumstances... #66721 Ineffective and never removed Job cancels #64698 jobwrapper max osb limit should be considered only if the gridftp server is the wms #64462 high CPU usage of workload manager in gLite-WMS #64567 job from gLite-WMS is still running after proxy expiration #63824 LCMAPS excessive logging #63113 WMS should allow any groups/roles by default and #50009 wmproxy.gacl person record allows anyone to pass !!!! #59453 ICE polling needs to be improved #59502 WM became unresponsive after St9bad_alloc #59611 WMS UI: should transfer files even if size is 0 bytes #59781 limit maximum sleep time in job wrapper #56827 workload manager needs better ISM logging #56734 ListMatch should consider also SDJ specification #56673 WMProxy SL5: Gridsite is not backward compatible with old style proxy #56395 WMProxy fails to enqueue jobs and they remain Submitted forever #56330 httpd doesn't start after upgrade WMS #56090 Correctly submitted JDL staying in SUBMITTED - error parsing classad #56034 Matchmaking with JobType=Normal does not take NodeNumber into account #55684 WMSProxy(wmproxy) gets FQAN wrong using mod_gridsite 1.7.4 Patch #55649 WMS 3.2: wmproxy/fastcgi crashes if gcal contains invalid FQAN #55606 glite-wms-job-listmatch is sometimes slow #55532 WMS only accepts LSF/PBS batch system for MPI jobs when expressed as MPICH #55452 CMS production struck by waves of "Globus error 10: data transfer to the server failed" #54728 WMP finds FQAN inconsistency only if GROUPS are different, not ROLES #54079 glite-wms-job-submit segmentation if requirements=(); #53733 The StorageElements section of the BrokerInfo files contains some repetitions #53714 WMS PURGER SHOULD NOT directly FORCE PURGE OF jobs when its DN is not authorized on LB server #53700 HUGE NUMBER OF GRIDFTP CONNECTION in CLOSE_WAIT STATUS #53294 WMS 3.2 WMProxy logs are useless below level 6 #52371 [ yaim-wms ] /var/log on WMS ends up owned by glite #52003 ICE crashes when the purger is called #51296 glite-wms-job-info: does not recognize if a job is simple or DAG/Collection #51295 glite-wms-job-output: fails creating the output directory if not present #51294 glite-wms-job-submit: ExpiryTime wrong type #51293 glite-wms-job-status does not load LBAddresses attribute #51292 glite-wms-job-status does not load default configuration file #48640 glite-wms-wmproxy to support graceful command #48598 job submission fails when first WMS tried is draining !! #48172 wmproxy stop/restart can leave old processes behind #48079 WMProxy has garbage in GlueServiceStatusInfo #48068 [wms] GlueServiceStatusInfo content is ugly #47404 Ambiguous error message 'System load is too high:' #47150 glite-wms-wm script: problem when moving files from too much populated directories #33342: separate retry policies for ISB and OSB #36292: Not all attributes of a SA/SE coul be used in a gangmatching When a collection is aborted the "Abort" event should be logged for the sub-nodes as well #40982: When a collection is aborted the "Abort" event should be logged for the sub-nodes as well /2 #44599: WMS should consider MaxTotalJobs #45883: Optimization of resubmission #48636: job wrapper should log events for truncated files #49844: WMProxy does not catch signal 25 #52617: [ yaim-wms ] host{cert,key}.pem in /home/glite #56933: WMProxy Server: gSoap needs to be built with WITH_IPV6 flag #58878: Request for a feature allowing propagation of generic parameters from JDL to LRMs #61557: user job is not killed when proxy expires #70824: environment values in JDL cannot have spaces #78030: Alternative GLITE_WMS_LOG_DESTINATION in the jobwrapper #77004: Wrong myproxyserver string processing in ICE #75402: Synchronization loss between real validity of proxy and exp. time saved in ICE's database #74259: Previous matches information is not taken into account if direct submission is used

 \ No newline at end of file

Revision 22011-06-29 - MarcoCecchi

Line: 1 to 1
 
META TOPICPARENT name="WebTopicList"

EMI WMS Test Plan

Line: 298 to 298
 https://devel11.cnaf.infn.it:9000/a...

submitted a job and then removed its ISB

Deleted:
<
<
[mcecchi@cert-19 ~]$ glite-wms-job-submit -a -c devel11.conf lsisb.jdl
Connecting to the service https://devel11.cnaf.infn.it:7443/g...

====================== glite-wms-job-submit Success ======================

The job has been successfully submitted to the WMProxy
Your job identifier is:

https://devel11.cnaf.infn.it:9000/-...

==========================================================================

server side:
[root@devel11 input]# rm -f a
[root@devel11 input]# pwd
/var/SandboxDir/-h/https_3a_2f_2fdevel11.cnaf.infn.it_3a9000_2f-h4MRDzYkufRu71MKfF1pw/input

after a while, maradona reports:
[root@devel11 https_3a_2f_2fdevel11.cnaf.infn.it_3a9000_2f-h4MRDzYkufRu71MKfF1pw]# cat Maradona.output
LM_log_done_begin
Wed Apr 20 22:08:51 CEST 2011: lcg-jobwrapper-hook.sh not readable or not present
Wed Apr 20 22:08:52 CEST 2011: Error during transfer
Wed Apr 20 22:09:53 CEST 2011: Error during transfer
Wed Apr 20 22:11:54 CEST 2011: Error during transfer

LM_log_done_end
Cannot download a from gsiftp://devel11.cnaf.infn.it:2811/var...
Killing log watchdog (pid=21047)...
jw exit status = 1

OSB:
[mcecchi@cert-19 ~]$ glite-wms-job-submit -a -c devel11.conf lsosb.jdl

Connecting to the service https://devel11.cnaf.infn.it:7443/g...

====================== glite-wms-job-submit Success ======================

The job has been successfully submitted to the WMProxy
Your job identifier is:

https://devel11.cnaf.infn.it:9000/3...

==========================================================================
after more than twenty minutes Maradona hasn't returned yet and the job is running, meaning that other defaults are in place (the ones previously used for both ISB and OSB)

bug #36292: Not all attributes of a SA/SE coul be used in a gangmatching

Fix certified doing a listmatch with the following expression in the jdl:

Requirements = regexp(".in2p3.fr:2119.*",other.GlueCEUniqueID) && anyMatch(other.storage.CloseSEs,target.GlueSEImplementationVersion=="1.9.5-24");

which returns:

- cclcgceli02.in2p3.fr:2119/jobmanager-bqs-short
- cclcgceli04.in2p3.fr:2119/jobmanager-bqs-short
- cclcgceli06.in2p3.fr:2119/jobmanager-bqs-short
- cclcgceli09.in2p3.fr:2119/jobmanager-bqs-short
- cclcgceli02.in2p3.fr:2119/jobmanager-bqs-medium
- cclcgceli04.in2p3.fr:2119/jobmanager-bqs-long
- cclcgceli06.in2p3.fr:2119/jobmanager-bqs-medium
- cclcgceli09.in2p3.fr:2119/jobmanager-bqs-medium
- cclcgceli02.in2p3.fr:2119/jobmanager-bqs-long
- cclcgceli06.in2p3.fr:2119/jobmanager-bqs-long
- cclcgceli04.in2p3.fr:2119/jobmanager-bqs-medium
- cclcgceli09.in2p3.fr:2119/jobmanager-bqs-long

Double checking that the correct "GlueSEImplementationVersion" is picked up:

lcg-infosites --vo dteam closeSE >closeses.txt

gives the following closeSEs:

cclcgceli02.in2p3.fr:2119/jobmanager-bqs-long
ccsrm.in2p3.fr
ccsrm02.in2p3.fr

and ldapsearch -x -H ldap://lcg-bdii.cern.ch:2170 -b 'Mds-vo-name=local,o=Grid' '(GlueSEUniqueId=ccsrm.in2p3.fr)'
returns:

...
GlueSEImplementationVersion: 1.9.5-24
...

bug #40982: When a collection is aborted the "Abort" event should be logged for the sub-nodes as well /2

coll_10.jdl is a ten nodes collection, only the first node having non empty ISB.

[mcecchi@cert-19 ~]$ head -25 coll_10.jdl
[
Type = "collection";
InputSandbox = {"/home/mcecchi/Test.sh"};
RetryCount = 1;
Requirements = ( random(1.0) < 0.5 );
ShallowRetryCount = 2;
nodes = {
[
JobType = "Normal";
Zippedisb=true;
Executable = "Test.sh";
StdOutput = "test.out";
StdError = "test.err";
InputSandbox = {"a"};
OutputSandbox = {};
],
[
JobType = "Normal";
Executable = "Test.sh";
StdOutput = "test.out";
StdError = "test.err";
OutputSandbox = {};
],
[
JobType = "Normal";

We register the collection:

[mcecchi@cert-19 ~]$ glite-wms-job-submit -a -c devel11.conf --register-only coll_10.jdl

Connecting to the service https://devel11.cnaf.infn.it:7443/g...

====================== glite-wms-job-submit Success ======================

The job has been successfully registered to the WMProxy
Your job identifier is:

https://devel11.cnaf.infn.it:9000/M...

==========================================================================

To complete the operation, the following file containing the InputSandbox of the job needs to be transferred:
==========================================================================================================
ISB ZIP file : /tmp/ISBfiles_aoIPOxSR3GFuEcTxqJ6_Mg_0.tar.gz
Destination : gsiftp://devel11.cnaf.infn.it:2811/var...

We do NOT transfer ISB for the first node and start the job.

[mcecchi@cert-19 ~]$ glite-wms-job-submit --start https://devel11.cnaf.infn.it:9000/M...
Connecting to the service https://devel11.cnaf.infn.it:7443/g...

====================== glite-wms-job-submit Success ======================

The job has been successfully started to the WMProxy
Your job identifier is:

https://devel11.cnaf.infn.it:9000/M...

==========================================================================

After some seconds:

[mcecchi@cert-19 ~]$ glite-wms-job-status https://devel11.cnaf.infn.it:9000/M... Aborted|wc -l
11
-- ElisabettaMolinari - 2010-02-24

-- MarcoCecchi - 2011-06-27

 \ No newline at end of file

Revision 12011-06-27 - MarcoCecchi

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="WebTopicList"

EMI WMS Test Plan

Service Description

The Workload Management System (gLite WMS) is a software service of the gLite suite which is responsible for distributing and managing tasks across computing and storage resources available on a Grid. WMS assigns user jobs to CEs and SEs belonging to a Grid environment in a convenient fashion, so that:

  • jobs are always executed on resources that match the job requirements
  • grid-wide load balance is maintained, i.e. jobs are evenly and efficiently distributed across the entire Grid.

The WMS basically receives requests of job execution from a client, finds the required appropriate resources, then dispatches and follows them until completion. This is done handling failure in between and whenever possible. Other than single batch-like jobs, compound job types handled by the WMS are Directed Acyclic Graphs (a set of jobs where the input/output/execution of one of more jobs may depend on one or more other jobs), Parametric Jobs (multiple jobs with one parametrized description), and Collections (multiple jobs with a common description). Jobs are described via a flexible, high-level Job Definition Language (JDL).

Functional Description

Service Reference Card

Unit tests

N/A

Deployment tests

Repository

The EMI-1 RC4 repository can be found under:

http://emisoft.web.cern.ch/emisoft/dist/EMI/1/RC4/sl5/x86_64

Other repositories:

  • epel.repo
  • lcg-CA.repo
  • sl.repo
  • sl-security.repo

Installation test

First of all, install the yum-protectbase rpm:

yum install yum-protectbase.noarch

Then proceed with the installation of the CA certificates by issuing:

yum install ca-policy-egi-core

Install the WMS metapackage:

yum install emi-wms 

(see log file)

Configure the WMS:

/opt/glite/yaim/bin/yaim -c -s site-info.def -n WMS  

(see log file)

Update test

N/A

Functionality tests

Features/Scenarios to be tested

YAIM-WMS Configuration Testing

  • Installation and configuration starting from a cleaning machine (i.e. only OS)
  • Update and configuration from a previous version

WMS Job Submission/GetOutput Testing

Submit a job to the WMS service and when finished retrieve the output. Test job submission with the following type of jobs:

Normal Job
  • Test submission of normal jobs with different options and situation Implemented

  • Test the complete cycle with the two types of CEs: lcg and Cream Implemented

More different jdls can added in the future.

Perusal job

Job perusal is the ability to view output from a job while it is running. Implemented

DAG job

Directed Acyclic Graphs (a set of jobs where the input/output/execution of one of more jobs may depend on one or more other jobs).

  • Submit a jdl like this one:
[
  type = "dag";
  DefaultNodeShallowRetryCount = 3;
  nodes = [
    nodeA = [
      node_type = "edg-jdl";
      file ="jdl/arg.jdl" ; 
    ];
    nodeB = [
      node_type = "edg-jdl";
      file ="jdl/arg.jdl" ; 
    ];
    nodeC = [
      node_type = "edg-jdl";
      file ="jdl/arg.jdl" ; 
    ];
    dependencies = {
      { nodeA, nodeB },
      { nodeA, nodeC }
    }
  ];
]

  • When dag finishes retrieve the output files
  • Check the final status of the dag (all nodes and parent should be "Cleared")

More different jdls can added in the future.

Parametric Job

Multiple jobs with one parametrized description. Implemented

Collection Job

Multiple jobs with a common description. There are two ways to submit collection: you can create a single jdl with all the jdls of nodes or you can submit all the jdls stored in a directory (bulk submission)

  • Submit a jdl like this one:
[
nodes = {
   [
   file="jdl/arg.jdl";
   ],
   [
  executable="/bin/env";
  ShallowRetryCount = 0;
  RetryCount = 0;
  Stdoutput = "file.out" ;
  StdError =  "file.err" ;
  OutputSandbox ={ "file.out" ,"file.err"} ;
  FuzzyRank = true;
   ],
   [
   NodeName="nodeA";
   executable="/bin/ls" ;
  Stdoutput = "file.out" ;
  OutputSandbox ={ "file.out"} ;
   ]
};
Type = "Collection" ;
requirements =  other.GlueCEStateStatus == "Production" ;
rank = -other.GlueCEStateEstimatedResponseTime ;
]

  • When collection finishes retrieve the output files
  • Check the final status of the collectionall nodes and parent should be "Cleared")

  • To test bulk submission use option "--collection" of glite-wms-job-submit command.
  • When collection finishes retrieve the output files
  • Check the final status of the collection (all nodes and parent should be "Cleared")

More different jdls can added in the future.

Parallel Job

Jobs that can be running in one or more cpus in parallel.

  • Submit a jdl like this one:
[
Executable = "cpi";
CpuNumber = 2;
Stdoutput = "cpi.out" ;
StdError =  "cpi.err" ;
OutputSandbox = { "cpi.out" ,"cpi.err"} ;
InputSandbox = { "exe/cpi" };
FuzzyRank = true;
usertags = [ exe = "cpi" ];
]

  • When job finishes retrieve the output files
  • Check the final status of the job

WMS Job shallow and deep re-submission

There two type of resubmission; the first is defined deep occurs when the user's job has stardted running on the WN and then the job itself or the WMS JobWrapper has failed. The second one is called shallow and occurs when the WMS JobWrapper has failed before starting the actual user's job. Implemented

WMS Job List-match Testing

Without data

Test job-list-command and its option Implemented

With data

  • You need to register a file on an SE, then submit a jdl like this one (as InputData put the lfn(s) registered before):
###########################################
#      JDL with Data Requirements         #
###########################################

Executable = "calc-pi.sh";
Arguments = "1000";
StdOutput = "std.out";
StdError = "std.err";
Prologue = "prologue.sh";
InputSandbox = {"calc-pi.sh", "fileA", "fileB","prologue.sh"};
OutputSandbox = {"std.out", "std.err","out-PI.txt","out-e.txt"};
Requirements = true;

DataRequirements = {
[
DataCatalogType = "DLI";
DataCatalog = "http://lfcserver.cnaf.infn.it:8085";
InputData = {"lfn:/grid/infngrid/cesini/PI_1M.txt","lfn:/grid/infngrid/cesini/e-2M.txt"};
]
};
DataAccessProtocol = "gsiftp";

  • Then try a list-match, the listed CEs should be the ones "close" to the used SE

WMS Job Cancel Testing

Test the cancellation of these type of jobs (final status should be cleared):

Normal job

Submit and cancel a normal job Implementd

DAG job

Submit a dag job and then cancel it (the parent)

Collection

Submit a collection job and then cancel it (the parent)

Node of a collection

Submit a collection job and then some of its nodes

Others

Delegation Testing

Test the delegation command and its options Implementd

Job-info Testing

Test the job-info command and its options Implementd

Logging-info Testing

Test the logging-info command and its options Implemented

Job Status Testing

Test the job-status commend and its options Implemented

Prologue and Epilogue jobs

In the jdl you can specify two attributes prologue and epilogue which are scripts that are execute respectively before and after the user's job. Implemented

Performance tests

Collection of 1000 nodes

Submit a collection of 1000 nodes.

Stress test

This could be an example of stress test

  • 2880 collections each of 20 jobs
  • One collection every 60 seconds
  • Four users
  • Use LCG-CEs and CREAM-CEs (with different batch systems)
  • Use automatic-delegation
  • The job is a "sleep random(672)"
  • Resubmission is enabled
  • Enable proxy renewal

Regression tests

bug #33342: separate retry policies for ISB and OSB

Of course we're speaking of submission to the lcg-CE, as CREAM uses its own jobwrapper.

ISB: https://devel11.cnaf.infn.it:9000/a...

submitted a job and then removed its ISB

[mcecchi@cert-19 ~]$ glite-wms-job-submit -a -c devel11.conf lsisb.jdl
Connecting to the service https://devel11.cnaf.infn.it:7443/g...

====================== glite-wms-job-submit Success ======================

The job has been successfully submitted to the WMProxy
Your job identifier is:

https://devel11.cnaf.infn.it:9000/-...

==========================================================================

server side:
[root@devel11 input]# rm -f a
[root@devel11 input]# pwd
/var/SandboxDir/-h/https_3a_2f_2fdevel11.cnaf.infn.it_3a9000_2f-h4MRDzYkufRu71MKfF1pw/input

after a while, maradona reports:
[root@devel11 https_3a_2f_2fdevel11.cnaf.infn.it_3a9000_2f-h4MRDzYkufRu71MKfF1pw]# cat Maradona.output
LM_log_done_begin
Wed Apr 20 22:08:51 CEST 2011: lcg-jobwrapper-hook.sh not readable or not present
Wed Apr 20 22:08:52 CEST 2011: Error during transfer
Wed Apr 20 22:09:53 CEST 2011: Error during transfer
Wed Apr 20 22:11:54 CEST 2011: Error during transfer

LM_log_done_end
Cannot download a from gsiftp://devel11.cnaf.infn.it:2811/var...
Killing log watchdog (pid=21047)...
jw exit status = 1

OSB:
[mcecchi@cert-19 ~]$ glite-wms-job-submit -a -c devel11.conf lsosb.jdl

Connecting to the service https://devel11.cnaf.infn.it:7443/g...

====================== glite-wms-job-submit Success ======================

The job has been successfully submitted to the WMProxy
Your job identifier is:

https://devel11.cnaf.infn.it:9000/3...

==========================================================================
after more than twenty minutes Maradona hasn't returned yet and the job is running, meaning that other defaults are in place (the ones previously used for both ISB and OSB)

bug #36292: Not all attributes of a SA/SE coul be used in a gangmatching

Fix certified doing a listmatch with the following expression in the jdl:

Requirements = regexp(".in2p3.fr:2119.*",other.GlueCEUniqueID) && anyMatch(other.storage.CloseSEs,target.GlueSEImplementationVersion=="1.9.5-24");

which returns:

- cclcgceli02.in2p3.fr:2119/jobmanager-bqs-short
- cclcgceli04.in2p3.fr:2119/jobmanager-bqs-short
- cclcgceli06.in2p3.fr:2119/jobmanager-bqs-short
- cclcgceli09.in2p3.fr:2119/jobmanager-bqs-short
- cclcgceli02.in2p3.fr:2119/jobmanager-bqs-medium
- cclcgceli04.in2p3.fr:2119/jobmanager-bqs-long
- cclcgceli06.in2p3.fr:2119/jobmanager-bqs-medium
- cclcgceli09.in2p3.fr:2119/jobmanager-bqs-medium
- cclcgceli02.in2p3.fr:2119/jobmanager-bqs-long
- cclcgceli06.in2p3.fr:2119/jobmanager-bqs-long
- cclcgceli04.in2p3.fr:2119/jobmanager-bqs-medium
- cclcgceli09.in2p3.fr:2119/jobmanager-bqs-long

Double checking that the correct "GlueSEImplementationVersion" is picked up:

lcg-infosites --vo dteam closeSE >closeses.txt

gives the following closeSEs:

cclcgceli02.in2p3.fr:2119/jobmanager-bqs-long
ccsrm.in2p3.fr
ccsrm02.in2p3.fr

and ldapsearch -x -H ldap://lcg-bdii.cern.ch:2170 -b 'Mds-vo-name=local,o=Grid' '(GlueSEUniqueId=ccsrm.in2p3.fr)'
returns:

...
GlueSEImplementationVersion: 1.9.5-24
...

bug #40982: When a collection is aborted the "Abort" event should be logged for the sub-nodes as well /2

coll_10.jdl is a ten nodes collection, only the first node having non empty ISB.

[mcecchi@cert-19 ~]$ head -25 coll_10.jdl
[
Type = "collection";
InputSandbox = {"/home/mcecchi/Test.sh"};
RetryCount = 1;
Requirements = ( random(1.0) < 0.5 );
ShallowRetryCount = 2;
nodes = {
[
JobType = "Normal";
Zippedisb=true;
Executable = "Test.sh";
StdOutput = "test.out";
StdError = "test.err";
InputSandbox = {"a"};
OutputSandbox = {};
],
[
JobType = "Normal";
Executable = "Test.sh";
StdOutput = "test.out";
StdError = "test.err";
OutputSandbox = {};
],
[
JobType = "Normal";

We register the collection:

[mcecchi@cert-19 ~]$ glite-wms-job-submit -a -c devel11.conf --register-only coll_10.jdl

Connecting to the service https://devel11.cnaf.infn.it:7443/g...

====================== glite-wms-job-submit Success ======================

The job has been successfully registered to the WMProxy
Your job identifier is:

https://devel11.cnaf.infn.it:9000/M...

==========================================================================

To complete the operation, the following file containing the InputSandbox of the job needs to be transferred:
==========================================================================================================
ISB ZIP file : /tmp/ISBfiles_aoIPOxSR3GFuEcTxqJ6_Mg_0.tar.gz
Destination : gsiftp://devel11.cnaf.infn.it:2811/var...

We do NOT transfer ISB for the first node and start the job.

[mcecchi@cert-19 ~]$ glite-wms-job-submit --start https://devel11.cnaf.infn.it:9000/M...
Connecting to the service https://devel11.cnaf.infn.it:7443/g...

====================== glite-wms-job-submit Success ======================

The job has been successfully started to the WMProxy
Your job identifier is:

https://devel11.cnaf.infn.it:9000/M...

==========================================================================

After some seconds:

[mcecchi@cert-19 ~]$ glite-wms-job-status https://devel11.cnaf.infn.it:9000/M... Aborted|wc -l
11
-- ElisabettaMolinari - 2010-02-24

-- MarcoCecchi - 2011-06-27

 
This site is powered by the TWiki collaboration platformCopyright © 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback