TESTs (WMS: devel18)
List Match
- without data:
- with data:
Submission/GetOutput
-
Normal
jobs through
- ICE work:
- JC work:
-
Dag
jobs through:
- JC work:
- tested with the following
[
Type = "dag";
VirtualOrganisation = "dteam";
Max_nodes_running = 10;
InputSandbox = "test.sh";
FuzzyRank = true;
Nodes = [
nodeA = [
file= "test_dag.jdl";
];
nodeB = [
file= "test_dag.jdl";
];
nodeC = [
file= "test_dag.jdl";
];
nodeD = [
file= "test_dag.jdl";
];
nodeE = [
file= "test_dag.jdl";
];
nodeF= [
file= "test_dag.jdl";
];
];
Dependencies = {
{{nodeA, nodeB}, nodeC},{nodeD,nodeE.nodeF}
}
]
-
Collection
jobs through:
- ICE work:
- JC work:
-
Parametric
jobs through:
- ICE work:
- JC work:
-
Bulk
jobs sent both through ICE and JC and RetryCount = 0; :
- Submit a bulk of 3 jobs -> success 100%
- Submit a bulk of 50 jobs -> success 99.99%
- Submit a bulk of 100 jobs -> success 99.99%
- Submit a bulk of 500 jobs -> success 99.99%
- Submit a bulk of 1000 jobs -> success 99.99%
-
Perusal
jobs through:
- JC work:
- ICE work:
-
MPICH
jobs:
Cancel
- Normal jobs
- ICE:
- JC:
- Dag:
- Collection:
- Node of a collection:
Others
-
BrokerInfo
- ICE creation
- JC creation:
- Verify all the glite-brokerinfo functions with the generated file
-
Resubmission
- Shallow:
- Deep:
-
Job Recovery
- Tested with a few collections re-starting the wm while some node jobs are still in a 'submitted or 'waiting' status
-
Prologue
and Epilogue
jobs
- ICE:
- JC:
Check bugs:
- BUG #53106
: Inefficient ICE's database access HOPEFULLY FIXED
- BUG #53223
: Proxy renewal of ICE should be enhanced FIXED
- BUG #53502
: Using sqlite database transaction instead of "old" ICE's mutex. HOPEFULLY FIXED
- BUG #53714
: WMS PURGER SHOULD NOT directly FORCE PURGE OF jobs when its DN is not authorized on LB server FIXED
- BUG #55237
: WMS job wrapper first customization point should be moved HOPEFULLY FIXED
- Check the job wrapper created
- BUG #55290
: ICE's delegation renewal needs several enhancements. HOPEFULLY FIXED
- BUG #55329
: BAD delegation ID generation in ICE HOPEFULLY FIXED
- BUG #55606
: glite-wms-job-listmatch is sometimes slow HOPEFULLY FIXED
- The cron job glite-wms-wmproxy-purge-proxycache.cron removes proxy files from '/var/glite/proxycache/' every six hours
- BUG #55709
: problems with glite-wms-wm restart in WMS 3.2 FIXED
Check old bugs: (see here
)
- Bug #27215
WM to set the maximum output sandbox size
- Bug #47447
Cream doesn't handle the jdl parameter MaxOutputSandboxSize.
- Bug #27797
Mixed int and string in Parameters attribute generates wrong jdl
- Bug #28235
Previously used CEs are not considered at all in the resubmission
- Bug #28642
User environment breaks WMS wrapper
- Bug #30308
created .mpi file in MPICH job wrapper causes jobs to fail
(Open bug #56762
)
- Bug #30896
WMS must limit number of files per sandbox
- Bug #31012
WMS Client does not print properly WMProxy server version
- Bug #31278
WMS should prevent non-SDJ jobs from being scheduled on SDJ CEs
(Open bug #56734
)
- Bug #31669
org.glite.jdl.api-cpp: defaultNode[Shallow]RetryCount attributes unexpected behavior
- Bug #32078
Problem with GangMatching statement involving GlueSEStatus
- Bug #32980
Maradona file should be removed at resubmission
- Bug #33103
Request for adding an feature to select only specific VO resources via an additional LDAP filter
- Bug #34420
WMS Client: glite-wms-job-submit option --valid does not accept any time value
- Bug #34508
Any collection submitted while the WMS is down is not recovered upon WM startup
- Bug #34510
When a collection is aborted the "Abort" event should be logged for the sub-nodes as well
There are problem when WM is not able to retrieve child status: unable to retrieve children information from jobstatus)
- Bug #35250
DAG: glite_wms_wmproxy_dirmanager does not extract links from tar.gz
- Bug #36536
The glite wms purge storage library should rely on LBProxy while logging CLEAR events.
- Bug #38366
Recovery doesn't work with a list-match request
- Bug #48533
Recovery ignores the requests
TESTs only on ICE
4) Test starts on Wed Sep 16 at 14:08:40 CEST 2009 (WMS: devel20)
Description:
- 7200 collections each of 40 jobs
- One collection every 60 seconds
- Four users
- max_ice_threads = 10
- Use all the CEs of testbedB
- Use automatic-delegation
- Use 2 proxy renewal servers ("myproxy.cern.ch" and "myproxy.cnaf.infn.it")
- The job is a "sleep 4242"
- Resubmission has been enabled after 2 days of test
- Lease mechanism is not used
- Interventions in the testbed:
- PBS CEs
- qmgr -c "set server node_pack = False"
- restart of pbs and maui services
- PBS WNs
Submissions finish on Mon Sep at 21 14:03:44 CEST 2009
- 7059 collections submitted in 157326 seconds: 4/22.3/145 (min/avg/max)
- 141 (1.96%) submission(s) fail(s)
Final results taken on Thu Sep 24 at 17:36:18 CEST 2009
- Collections correctly submitted: 7059 (282360 jobs)
- DONE OK: 277253 (98.19%)
- NOT TERMINATED: 3238 (1.15%) ****
- ABORTED+CANCELLED: 1819+49 (0.66%)
- Resubmitted: 3441 (1.22%)
- Errors found (3959)
- Transfer to CREAM failed due to exception:
- FaultCause=[Batch System lsf not supported!]" (1010 times) *
- FaultSubCode=[SOAP-ENV:Client] (2 times)
- CREAM Register raised std::exception Connection to service [https://cream-23.pd.infn.it:8443/ce-cream/services/CREAM2] failed (3 times)
- Authentication error: Unable to open the file [/var/glite/SandboxDir/<xx >/<jobid >/user.proxy] : No such file or directory (423 times) **
- Cannot move ISB (1774 times)
- blah error: send command timeout (509 times)
- pbs_reason=127 (206 times) ***
- lsf_reason=306 (1 time)
- lsf_reason=11 (3 times)
- Cannot take token; /opt/edg/libexec/edg-gridftp-base-rm: error globus_ftp_client: the server responded with an error 421 Service not available, closing control connection Cannot take token (28 times)
Note:
* The blparser on cream-23 was not running. It was restarted on Fri Sep 18 at 11:22:06.
** The proxy renewal service arrived late to renew the job's proxy
*** Jobs stucked on pbs queue (error = 15020)
**** The "NOT TERMINATED" are distributed in this way:
- 73 Collections (i.e. 2920 jobs) failed to be submitted (by WM) with reason request expired
- 6 Collections (i.e. 240 jobs) failed to be submitted (by wmproxy) with reason Register DAG subjobs failed Exit code: 1416 LB[Proxy] Error: LB server (bkserver,lbproxy) store protocol error
- 77 jobs are running
- 1 job is scheduled
3) Test starts on Fri Sep 11 at 16:28:30 CEST 2009 (WMS: devel20)
Description:
- 3600 collections each of 40 jobs
- One collection every 60 seconds
- Four users
- max_ice_threads = 10
- Use all the CEs of testbedB
- Use automatic-delegation
- Use 2 proxy renewal servers ("myproxy.cern.ch" and "myproxy.cnaf.infn.it")
- The job is a "sleep 4242"
- Resubmission is NOT enabled
- Lease mechanism is not used
Submissions finish on Mon Sep at 14 04:22:55 CEST 2009
- 3600 collections submitted in 25710 seconds: 3/7.14/54 (min/avg/max)
Final results taken on Wed Sep 16 at 09:11:18 CEST 2009
- Collections correctly submitted: 3600 (144000 jobs)
- DONE OK: 143485 (99.64%)
- NOT TERMINATED: 0 (0%)
- ABORTED+CANCELLED: 2+515 (0.36%)
- Resubmitted: - (-%)
- Errors found (2)
- Transfer to CREAM failed due to exception: (2 times)
- CREAM Register raised std::exception Received NULL fault; the error is due to another cause: FaultString=[] - FaultCode=[SOAP-ENV:Server.generalException] - FaultSubCode=[SOAP-ENV:Server.generalException] - FaultDetail=[invoke2009-09-14T01:18:56.638Z0cannot write the authN proxy to file: nullcannot write the authN proxy to file: nullorg.glite.ce.faults.AuthenticationFaultcream-29.pd.infn.it]
- CREAM Start raised exception Received NULL fault; the error is due to another cause: FaultString=[] - FaultCode=[SOAP-ENV:Server.generalException] - FaultSubCode=[SOAP-ENV:Server.generalException] - FaultDetail=[invoke2009-09-13T19:31:01.591Z0USER_VO_LABEL not defined in msgContextUSER_VO_LABEL not defined in msgContextorg.glite.ce.faults.AuthenticationFaultcream-34.pd.infn.it]
2) Test starts on Thu Sep 10 at 16:00:00 CEST 2009 (WMS: devel20)
Description:
- 720 collections each of 40 jobs
- One collection every 60 seconds
- Four users
- max_ice_threads = 10
- Used all the CEs of testbedB
- Used automatic-delegation
- Use two proxy renewal servers ("myproxy.cern.ch" and "myproxy.cnaf.infn.it") and also submit 33% of jobs without setting MyproxyServer
- The job is a "sleep 2424"
- Resubmission is NOT enabled
- Lease mechanism is not used
- Changes in the software wrt previous test:
- CREAM
- Use old blparser for all the CEs
Submissions finish on Fri Sep 11 at 03:51:54 CEST 2009
- 718 collections submitted in 4540 seconds: 3/6.3/37 (min/avg/max)
Final results taken on Fri Sep 11 at 15:20:33 CEST 2009
- Collections correctly submitted: 718 (28720 jobs)
- DONE OK: 28319 (98.6%)
- NOT TERMINATED: 0 (0%)
- ABORTED+CANCELLED: 5+396 (1.4%)
- Resubmitted: - (-%)
- Errors found (5)
- Cannot move ISB [...] : (5 times)
Note:
- Some failures are due to bug #54949
1) Test starts on Wed Sep 9 at 16:08:00 CEST 2009 (WMS: devel20)
Description:
- 800 collections each of 30 jobs
- One collection every 60 seconds
- Four users
- max_ice_threads = 10
- Used all the CEs of testbedB
- Used automatic-delegation and 2 proxy renewal servers ("myproxy.cern.ch" and "myproxy.cnaf.infn.it")
- The job is a "sleep 4242"
- Resubmission is NOT enabled
- Lease mechanism is not used
- Changes in the software wrt previous test:
- Use a WMS updates with patches #3156
and #3183
- ICE
- Use new delegation renewal mechanism
- CREAM
- Use new blparser for pbs CEs
Submissions finish on Thu Sep 10 at 05:20:33 CEST 2009
- 799 collections submitted in 4499 seconds: 3/5.6/48 (min/avg/max)
Final results taken on Fri Sep 11 at 09:20:33 CEST 2009
- Collections correctly submitted: 799 (23970 jobs)
- DONE OK: 23752 (99.09%)
- NOT TERMINATED: 36 (0.15%)
- ABORTED+CANCELLED: 95+87 (0.76%)
- Resubmitted: - (-%)
- Errors found (95)
- Cannot move ISB [...] : (1 time)
- reason=999: (1 time)
- Proxy is expired; (93 times)
Note:
- Some failures are due to the use of the new blparser
--
AlessioGianelle - 10 Sep 2009