TESTs (WMS: devel18)

List Match

  • without data: Yes / Done
  • with data: Yes / Done

Submission/GetOutput

  • Normal jobs through
    • ICE work: Yes / Done
    • JC work: Yes / Done

  • Dag jobs through:
    • JC work: Yes / Done
      • tested with the following
        [
        Type = "dag";
        VirtualOrganisation = "dteam";
        Max_nodes_running = 10;
        InputSandbox = "test.sh";
        FuzzyRank = true;
        Nodes = [
        nodeA = [
        file= "test_dag.jdl";
        ];
        nodeB = [
        file= "test_dag.jdl";
        ];
        nodeC = [
        file= "test_dag.jdl";
        ];
        nodeD = [
        file= "test_dag.jdl";
        ];
        nodeE = [
        file= "test_dag.jdl";
        ];
        nodeF= [
        file= "test_dag.jdl";
        ];
        ];
        Dependencies = {
        {{nodeA, nodeB}, nodeC},{nodeD,nodeE.nodeF}
        }
        ]

  • Collection jobs through:
    • ICE work: Yes / Done
    • JC work: Yes / Done

  • Parametric jobs through:
    • ICE work: Yes / Done
    • JC work: Yes / Done
      • tested with the following
         [
          JobType = "parametric";
          Executable = "/usr/bin/env";
          Environment = {"MYPATH_PARAM_=$PATH:/bin:/usr/bin:$HOME"};
          StdOutput = "echo_PARAM_.out";
          StdError = "echo_PARAM_.err";
          OutputSandbox = {"echo_PARAM_.out","echo_PARAM_.err"};
          Parameters =  5;
                usertags = [ jdl = "parametric" ];
         ]

  • Bulk jobs sent both through ICE and JC and RetryCount = 0; :
    • Submit a bulk of 3 jobs -> success 100% Yes / Done
    • Submit a bulk of 50 jobs -> success 99.99% Yes / Done
    • Submit a bulk of 100 jobs -> success 99.99% Yes / Done
    • Submit a bulk of 500 jobs -> success 99.99% Yes / Done
    • Submit a bulk of 1000 jobs -> success 99.99% Yes / Done

  • Perusal jobs through:
    • JC work: Yes / Done
    • ICE work: Yes / Done

  • MPICH jobs: Yes / Done

Cancel

  • Normal jobs
    • ICE: Yes / Done
    • JC: Yes / Done
  • Dag: Yes / Done
  • Collection: Yes / Done
  • Node of a collection: Yes / Done

Others

  • BrokerInfo
    • ICE creation Yes / Done
    • JC creation: Yes / Done
    • Verify all the glite-brokerinfo functions with the generated file Yes / Done

  • Resubmission
    • Shallow: Yes / Done
    • Deep: Yes / Done

  • Job Recovery
    • Tested with a few collections re-starting the wm while some node jobs are still in a 'submitted or 'waiting' status Yes / Done

  • Prologue and Epilogue jobs
    • ICE: Yes / Done
    • JC: Yes / Done



Check bugs:

  • BUG #53106: Inefficient ICE's database access HOPEFULLY FIXED
    • Changes inside the code.

  • BUG #53223: Proxy renewal of ICE should be enhanced FIXED
    • Submit a job to ICE (e.g. requirements = regexp("8443/cream", other.GlueCEUniqueID);) with a short proxy (at least 1 hour)
    • Look at the ICE log waiting for the next "Delegation(s) check" (i.e. DEBUG - iceCommandDelegationRenewal::renewAllDelegations() - There are [3] Delegation(s) to check...), if it is time to renew the delegation you should find in the log file a sentence like this one:
      2009-10-06 10:38:20,141 WARN - iceCommandProxyRenewal::renewAllDelegations() - The better proxy [/var/glite/ice/persist_dir/6FD2EF1252D6668B438CACCD47D7A9827
      7751464.betterproxy] is expiring NOT AFTER the current delegation [12548182532E416358devel182Ecnaf2Einfn2Eit]. Skipping ... 

  • BUG #53502: Using sqlite database transaction instead of "old" ICE's mutex. HOPEFULLY FIXED
    • Changes inside the code.

  • BUG #53714: WMS PURGER SHOULD NOT directly FORCE PURGE OF jobs when its DN is not authorized on LB server FIXED
    • Remove the DN of the WMS host machine from the /opt/glite/etc/LB_SUPERUSER file on the LB server machine
    • restart LB server on the LB host machine (by /opt/glite/etc/init.d/glite-lb-bkserverd restart)
    • submit a job and wait for it to be in status DONE
    • run the wms-purge cron by hand without specifying the -t (# of secs) option
    • check that the SandBox for the previously submitted job is not removed by the cron job and the wms purge log should print a message like the following:
      05 Oct, 13:16:11 -E: [Error] query_job_status(purger.cpp:127): https://devel15.cnaf.infn.it:9000/7QnNitYk3bT8BNHw-grosQ: edg_wll_JobStat [1] Operation not permitted(matching jobs found but authorization failed)

  • BUG #55237: WMS job wrapper first customization point should be moved HOPEFULLY FIXED
    • Check the job wrapper created

  • BUG #55290: ICE's delegation renewal needs several enhancements. HOPEFULLY FIXED
    • See tests below.

  • BUG #55329: BAD delegation ID generation in ICE HOPEFULLY FIXED
    • Changes inside the code.

  • BUG #55606: glite-wms-job-listmatch is sometimes slow HOPEFULLY FIXED
    • The cron job glite-wms-wmproxy-purge-proxycache.cron removes proxy files from '/var/glite/proxycache/' every six hours

  • BUG #55709: problems with glite-wms-wm restart in WMS 3.2 FIXED



Check old bugs: (see here)

  • Bug #27215 WM to set the maximum output sandbox size Yes / Done
  • Bug #47447 Cream doesn't handle the jdl parameter MaxOutputSandboxSize. Yes / Done
  • Bug #27797 Mixed int and string in Parameters attribute generates wrong jdl Yes / Done
  • Bug #28235 Previously used CEs are not considered at all in the resubmission Yes / Done
  • Bug #28642 User environment breaks WMS wrapper Yes / Done
  • Bug #30308 created .mpi file in MPICH job wrapper causes jobs to fail Yes / Done (Open bug #56762)
  • Bug #30896 WMS must limit number of files per sandbox Yes / Done
  • Bug #31012 WMS Client does not print properly WMProxy server version Yes / Done
  • Bug #31278 WMS should prevent non-SDJ jobs from being scheduled on SDJ CEs Yes / Done (Open bug #56734)
  • Bug #31669 org.glite.jdl.api-cpp: defaultNode[Shallow]RetryCount attributes unexpected behavior Yes / Done
  • Bug #32078 Problem with GangMatching statement involving GlueSEStatus Yes / Done
  • Bug #32980 Maradona file should be removed at resubmission Yes / Done
  • Bug #33103 Request for adding an feature to select only specific VO resources via an additional LDAP filter Yes / Done
  • Bug #34420 WMS Client: glite-wms-job-submit option --valid does not accept any time value Yes / Done
  • Bug #34508 Any collection submitted while the WMS is down is not recovered upon WM startup Yes / Done
  • Bug #34510 When a collection is aborted the "Abort" event should be logged for the sub-nodes as well Warning, important There are problem when WM is not able to retrieve child status: unable to retrieve children information from jobstatus)
  • Bug #35250 DAG: glite_wms_wmproxy_dirmanager does not extract links from tar.gz Yes / Done
  • Bug #36536 The glite wms purge storage library should rely on LBProxy while logging CLEAR events. Yes / Done
  • Bug #38366 Recovery doesn't work with a list-match request Yes / Done
  • Bug #48533 Recovery ignores the requests Yes / Done



TESTs only on ICE

4) Test starts on Wed Sep 16 at 14:08:40 CEST 2009 (WMS: devel20)

Description:
  • 7200 collections each of 40 jobs
  • One collection every 60 seconds
  • Four users
  • max_ice_threads = 10
  • Use all the CEs of testbedB
  • Use automatic-delegation
  • Use 2 proxy renewal servers ("myproxy.cern.ch" and "myproxy.cnaf.infn.it")
  • The job is a "sleep 4242"
  • Resubmission has been enabled after 2 days of test
  • Lease mechanism is not used

  • Interventions in the testbed:
    • PBS CEs
      • qmgr -c "set server node_pack = False"
      • restart of pbs and maui services
    • PBS WNs
      • Set in /var/spool/pbs/mom_priv/config files:
          $ideal_load 3.5
          $max_load 4.2
      • Cleaning of the directories: /var/spool/pbs/mom_priv/jobs/
      • Restart of pbs_mom services

Submissions finish on Mon Sep at 21 14:03:44 CEST 2009

  • 7059 collections submitted in 157326 seconds: 4/22.3/145 (min/avg/max)
    • 141 (1.96%) submission(s) fail(s)

Final results taken on Thu Sep 24 at 17:36:18 CEST 2009

  • Collections correctly submitted: 7059 (282360 jobs)
    • DONE OK: 277253 (98.19%)
    • NOT TERMINATED: 3238 (1.15%) ****
    • ABORTED+CANCELLED: 1819+49 (0.66%)
    • Resubmitted: 3441 (1.22%)

  • Errors found (3959)
    • Transfer to CREAM failed due to exception:
      • FaultCause=[Batch System lsf not supported!]" (1010 times) *
      • FaultSubCode=[SOAP-ENV:Client] (2 times)
      • CREAM Register raised std::exception Connection to service [https://cream-23.pd.infn.it:8443/ce-cream/services/CREAM2] failed (3 times)
      • Authentication error: Unable to open the file [/var/glite/SandboxDir/<xx >/<jobid >/user.proxy] : No such file or directory (423 times) **
    • Cannot move ISB (1774 times)
    • blah error: send command timeout (509 times)
    • pbs_reason=127 (206 times) ***
    • lsf_reason=306 (1 time)
    • lsf_reason=11 (3 times)
    • Cannot take token; /opt/edg/libexec/edg-gridftp-base-rm: error globus_ftp_client: the server responded with an error 421 Service not available, closing control connection Cannot take token (28 times)

Note:

* The blparser on cream-23 was not running. It was restarted on Fri Sep 18 at 11:22:06.

** The proxy renewal service arrived late to renew the job's proxy

*** Jobs stucked on pbs queue (error = 15020)

**** The "NOT TERMINATED" are distributed in this way:

  • 73 Collections (i.e. 2920 jobs) failed to be submitted (by WM) with reason request expired
  • 6 Collections (i.e. 240 jobs) failed to be submitted (by wmproxy) with reason Register DAG subjobs failed Exit code: 1416 LB[Proxy] Error: LB server (bkserver,lbproxy) store protocol error
  • 77 jobs are running
  • 1 job is scheduled

ice3_04.png

3) Test starts on Fri Sep 11 at 16:28:30 CEST 2009 (WMS: devel20)

Description:
  • 3600 collections each of 40 jobs
  • One collection every 60 seconds
  • Four users
  • max_ice_threads = 10
  • Use all the CEs of testbedB
  • Use automatic-delegation
  • Use 2 proxy renewal servers ("myproxy.cern.ch" and "myproxy.cnaf.infn.it")
  • The job is a "sleep 4242"
  • Resubmission is NOT enabled
  • Lease mechanism is not used

Submissions finish on Mon Sep at 14 04:22:55 CEST 2009

  • 3600 collections submitted in 25710 seconds: 3/7.14/54 (min/avg/max)

Final results taken on Wed Sep 16 at 09:11:18 CEST 2009

  • Collections correctly submitted: 3600 (144000 jobs)
    • DONE OK: 143485 (99.64%)
    • NOT TERMINATED: 0 (0%)
    • ABORTED+CANCELLED: 2+515 (0.36%)
    • Resubmitted: - (-%)

  • Errors found (2)
    • Transfer to CREAM failed due to exception: (2 times)
      • CREAM Register raised std::exception Received NULL fault; the error is due to another cause: FaultString=[] - FaultCode=[SOAP-ENV:Server.generalException] - FaultSubCode=[SOAP-ENV:Server.generalException] - FaultDetail=[invoke2009-09-14T01:18:56.638Z0cannot write the authN proxy to file: nullcannot write the authN proxy to file: nullorg.glite.ce.faults.AuthenticationFaultcream-29.pd.infn.it]
      • CREAM Start raised exception Received NULL fault; the error is due to another cause: FaultString=[] - FaultCode=[SOAP-ENV:Server.generalException] - FaultSubCode=[SOAP-ENV:Server.generalException] - FaultDetail=[invoke2009-09-13T19:31:01.591Z0USER_VO_LABEL not defined in msgContextUSER_VO_LABEL not defined in msgContextorg.glite.ce.faults.AuthenticationFaultcream-34.pd.infn.it]

ice3_03.png

2) Test starts on Thu Sep 10 at 16:00:00 CEST 2009 (WMS: devel20)

Description:
  • 720 collections each of 40 jobs
  • One collection every 60 seconds
  • Four users
  • max_ice_threads = 10
  • Used all the CEs of testbedB
  • Used automatic-delegation
  • Use two proxy renewal servers ("myproxy.cern.ch" and "myproxy.cnaf.infn.it") and also submit 33% of jobs without setting MyproxyServer
  • The job is a "sleep 2424"
  • Resubmission is NOT enabled
  • Lease mechanism is not used

  • Changes in the software wrt previous test:
    • CREAM
      • Use old blparser for all the CEs

Submissions finish on Fri Sep 11 at 03:51:54 CEST 2009

  • 718 collections submitted in 4540 seconds: 3/6.3/37 (min/avg/max)
    • 2 submissions fail

Final results taken on Fri Sep 11 at 15:20:33 CEST 2009

  • Collections correctly submitted: 718 (28720 jobs)
    • DONE OK: 28319 (98.6%)
    • NOT TERMINATED: 0 (0%)
    • ABORTED+CANCELLED: 5+396 (1.4%)
    • Resubmitted: - (-%)

  • Errors found (5)
    • Cannot move ISB [...] : (5 times)

Note:

  • Some failures are due to bug #54949

ice3_02.png

1) Test starts on Wed Sep 9 at 16:08:00 CEST 2009 (WMS: devel20)

Description:
  • 800 collections each of 30 jobs
  • One collection every 60 seconds
  • Four users
  • max_ice_threads = 10
  • Used all the CEs of testbedB
  • Used automatic-delegation and 2 proxy renewal servers ("myproxy.cern.ch" and "myproxy.cnaf.infn.it")
  • The job is a "sleep 4242"
  • Resubmission is NOT enabled
  • Lease mechanism is not used

  • Changes in the software wrt previous test:
    • Use a WMS updates with patches #3156 and #3183
    • ICE
      • Use new delegation renewal mechanism
    • CREAM
      • Use new blparser for pbs CEs

Submissions finish on Thu Sep 10 at 05:20:33 CEST 2009

  • 799 collections submitted in 4499 seconds: 3/5.6/48 (min/avg/max)
    • 1 submission fails

Final results taken on Fri Sep 11 at 09:20:33 CEST 2009

  • Collections correctly submitted: 799 (23970 jobs)
    • DONE OK: 23752 (99.09%)
    • NOT TERMINATED: 36 (0.15%)
    • ABORTED+CANCELLED: 95+87 (0.76%)
    • Resubmitted: - (-%)

  • Errors found (95)
    • Cannot move ISB [...] : (1 time)
    • reason=999: (1 time)
    • Proxy is expired; (93 times)

Note:

  • Some failures are due to the use of the new blparser

ice3_01.png

-- AlessioGianelle - 10 Sep 2009

Topic attachments
I Attachment Action Size Date Who Comment
Unknown file formatEXT Error_ISBMove manage 2284.1 K 2009-09-28 - 07:22 AlessioGianelle ErrorISB_04
Unknown file formatEXT Error_LSFNotSup manage 276.2 K 2009-09-25 - 16:16 AlessioGianelle ErrorLSFNS_04
Unknown file formatEXT Error_SOAP manage 0.5 K 2009-09-25 - 16:16 AlessioGianelle ErrorSOAP_04
Unknown file formatEXT Error_UnableToOpenFile manage 115.5 K 2009-09-25 - 16:17 AlessioGianelle ErrorUnableOpenFile_04
Unknown file formatEXT Error_blah manage 44.7 K 2009-09-25 - 16:14 AlessioGianelle ErrorBlah_04
Unknown file formatEXT Error_connection manage 0.7 K 2009-09-25 - 16:15 AlessioGianelle ErrorConnection_04
Unknown file formatEXT Error_lsf11 manage 0.2 K 2009-09-25 - 16:15 AlessioGianelle Errorlsf11_04
Unknown file formatEXT Error_lsf306 manage 0.1 K 2009-09-25 - 16:15 AlessioGianelle Errorlsf306_04
Unknown file formatEXT Error_pbs127 manage 32.9 K 2009-09-25 - 16:16 AlessioGianelle Errorpbs127_04
Unknown file formatEXT Error_token manage 6.8 K 2009-09-25 - 16:17 AlessioGianelle ErrorToken04
Texttxt JobNotTerminated04.txt manage 8.4 K 2009-09-25 - 11:22 AlessioGianelle JobNotTerminated_04
PNGpng ice3_01.png manage 6.6 K 2009-09-11 - 09:05 AlessioGianelle Test_01
PNGpng ice3_02.png manage 7.0 K 2009-09-11 - 09:05 AlessioGianelle Test_02
PNGpng ice3_03.png manage 7.4 K 2009-09-16 - 08:29 AlessioGianelle Test_03
PNGpng ice3_04.png manage 7.9 K 2009-09-24 - 16:24 AlessioGianelle Test_04
Edit | Attach | PDF | History: r49 < r48 < r47 < r46 < r45 | Backlinks | Raw View | More topic actions
Topic revision: r49 - 2011-02-24 - AlessioGianelle
 
This site is powered by the TWiki collaboration platformCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback