Tags:
, view all tags

TESTs (WMS: devel18)

  • List Match
    • without data:
    • with data:

  • Normal jobs through
    • ICE work: OK
    • JC work: OK

  • Dag jobs through:
    • JC work: OK
      • tested with the following
        [
        Type = "dag";
        VirtualOrganisation = "dteam";
        Max_nodes_running = 10;
        InputSandbox = "test.sh";
        FuzzyRank = true;
        Nodes = [
        nodeA = [
        file= "test_dag.jdl";
        ];
        nodeB = [
        file= "test_dag.jdl";
        ];
        nodeC = [
        file= "test_dag.jdl";
        ];
        nodeD = [
        file= "test_dag.jdl";
        ];
        nodeE = [
        file= "test_dag.jdl";
        ];
        nodeF= [
        file= "test_dag.jdl";
        ];
        ];
        Dependencies = {
        {{nodeA, nodeB}, nodeC},{nodeD,nodeE.nodeF}
        }
        ]

  • Collection jobs through:
    • ICE work: OK
    • JC work: OK

  • Parametric jobs through:
    • ICE:
    • JC work: OK
      • tested with the following
         [
          JobType = "parametric";
          Executable = "/usr/bin/env";
          Environment = {"MYPATH_PARAM_=$PATH:/bin:/usr/bin:$HOME"};
          StdOutput = "echo_PARAM_.out";
          StdError = "echo_PARAM_.err";
          OutputSandbox = {"echo_PARAM_.out","echo_PARAM_.err"};
          Parameters =  5;
                usertags = [ jdl = "parametric" ];
         ]

  • Bulk jobs sent both through ICE and JC:
    • Submit a bulk of 3 jobs -> success 100 % OK
    • Submit a bulk of 50 jobs -> success 99.99 % OK
    • Submit a bulk of 100 jobs
    • Submit a bulk of 500 jobs
    • Submit a bulk of 1000 jobs

  • Perusal jobs through:
    • JC work: OK
    • ICE work: OK

  • MPICH jobs:

Check bugs:

  • BUG #53106: Inefficient ICE's database access HOPEFULLY FIXED
    • Changes inside the code.

  • BUG #53223: Proxy renewal of ICE should be enhanced FIXED
    • Submit a job to ICE (e.g. requirements = regexp("8443/cream", other.GlueCEUniqueID);) with a short proxy (at least 1 hour)
    • Look at the ICE log waiting for the next "Delegation(s) check" (i.e. DEBUG - iceCommandDelegationRenewal::renewAllDelegations() - There are [3] Delegation(s) to check...), if it is time to renew the delegation you should find in the log file a sentence like this one:
      2009-10-06 10:38:20,141 WARN - iceCommandProxyRenewal::renewAllDelegations() - The better proxy [/var/glite/ice/persist_dir/6FD2EF1252D6668B438CACCD47D7A9827
      7751464.betterproxy] is expiring NOT AFTER the current delegation [12548182532E416358devel182Ecnaf2Einfn2Eit]. Skipping ... 

  • BUG #53502: Using sqlite database transaction instead of "old" ICE's mutex. HOPEFULLY FIXED
    • Changes inside the code.

  • BUG #53714: WMS PURGER SHOULD NOT directly FORCE PURGE OF jobs when its DN is not authorized on LB server FIXED
    • Remove the DN of the WMS host machine from the /opt/glite/etc/LB_SUPERUSER file on the LB server machine
    • restart LB server on the LB host machine (by /opt/glite/etc/init.d/glite-lb-bkserverd restart)
    • submit a job and wait for it to be in status DONE
    • run the wms-purge cron by hand without specifying the -t (# of secs) option
    • check that the SandBox for the previously submitted job is not removed by the cron job and the wms purge log should print a message like the following:
      05 Oct, 13:16:11 -E: [Error] query_job_status(purger.cpp:127): https://devel15.cnaf.infn.it:9000/7QnNitYk3bT8BNHw-grosQ: edg_wll_JobStat [1] Operation not permitted(matching jobs found but authorization failed)

  • BUG #55237: WMS job wrapper first customization point should be moved

  • BUG #55290: ICE's delegation renewal needs several enhancements. HOPEFULLY FIXED
    • See tests below.

  • BUG #55329: BAD delegation ID generation in ICE HOPEFULLY FIXED
    • Changes inside the code.

  • BUG #55606: glite-wms-job-listmatch is sometimes slow

  • BUG #55709: problems with glite-wms-wm restart in WMS 3.2 FIXED

TESTs only on ICE

4) Test starts on Wed Sep 16 at 14:08:40 CEST 2009 (WMS: devel20)

Description:
  • 7200 collections each of 40 jobs
  • One collection every 60 seconds
  • Four users
  • max_ice_threads = 10
  • Use all the CEs of testbedB
  • Use automatic-delegation
  • Use 2 proxy renewal servers ("myproxy.cern.ch" and "myproxy.cnaf.infn.it")
  • The job is a "sleep 4242"
  • Resubmission has been enabled after 2 days of test
  • Lease mechanism is not used

  • Interventions in the testbed:
    • PBS CEs
      • qmgr -c "set server node_pack = False"
      • restart of pbs and maui services
    • PBS WNs
      • Set in /var/spool/pbs/mom_priv/config files:
          $ideal_load 3.5
          $max_load 4.2
      • Cleaning of the directories: /var/spool/pbs/mom_priv/jobs/
      • Restart of pbs_mom services

Submissions finish on Mon Sep at 21 14:03:44 CEST 2009

  • 7059 collections submitted in 157326 seconds: 4/22.3/145 (min/avg/max)
    • 141 (1.96%) submission(s) fail(s)

Final results taken on Thu Sep 24 at 17:36:18 CEST 2009

  • Collections correctly submitted: 7059 (282360 jobs)
    • DONE OK: 277253 (98.19%)
    • NOT TERMINATED: 3238 (1.15%) ****
    • ABORTED+CANCELLED: 1819+49 (0.66%)
    • Resubmitted: 3441 (1.22%)

  • Errors found (3959)
    • Transfer to CREAM failed due to exception:
      • FaultCause=[Batch System lsf not supported!]" (1010 times) *
      • FaultSubCode=[SOAP-ENV:Client] (2 times)
      • CREAM Register raised std::exception Connection to service [https://cream-23.pd.infn.it:8443/ce-cream/services/CREAM2] failed (3 times)
      • Authentication error: Unable to open the file [/var/glite/SandboxDir/<xx >/<jobid >/user.proxy] : No such file or directory (423 times) **
    • Cannot move ISB (1774 times)
    • blah error: send command timeout (509 times)
    • pbs_reason=127 (206 times) ***
    • lsf_reason=306 (1 time)
    • lsf_reason=11 (3 times)
    • Cannot take token; /opt/edg/libexec/edg-gridftp-base-rm: error globus_ftp_client: the server responded with an error 421 Service not available, closing control connection Cannot take token (28 times)

Note:

* The blparser on cream-23 was not running. It was restarted on Fri Sep 18 at 11:22:06.

** The proxy renewal service arrived late to renew the job's proxy

*** Jobs stucked on pbs queue (error = 15020)

**** The "NOT TERMINATED" are distributed in this way:

  • 73 Collections (i.e. 2920 jobs) failed to be submitted (by WM) with reason request expired
  • 6 Collections (i.e. 240 jobs) failed to be submitted (by wmproxy) with reason Register DAG subjobs failed Exit code: 1416 LB[Proxy] Error: LB server (bkserver,lbproxy) store protocol error
  • 77 jobs are running
  • 1 job is scheduled

ice3_04.png

3) Test starts on Fri Sep 11 at 16:28:30 CEST 2009 (WMS: devel20)

Description:
  • 3600 collections each of 40 jobs
  • One collection every 60 seconds
  • Four users
  • max_ice_threads = 10
  • Use all the CEs of testbedB
  • Use automatic-delegation
  • Use 2 proxy renewal servers ("myproxy.cern.ch" and "myproxy.cnaf.infn.it")
  • The job is a "sleep 4242"
  • Resubmission is NOT enabled
  • Lease mechanism is not used

Submissions finish on Mon Sep at 14 04:22:55 CEST 2009

  • 3600 collections submitted in 25710 seconds: 3/7.14/54 (min/avg/max)

Final results taken on Wed Sep 16 at 09:11:18 CEST 2009

  • Collections correctly submitted: 3600 (144000 jobs)
    • DONE OK: 143485 (99.64%)
    • NOT TERMINATED: 0 (0%)
    • ABORTED+CANCELLED: 2+515 (0.36%)
    • Resubmitted: - (-%)

  • Errors found (2)
    • Transfer to CREAM failed due to exception: (2 times)
      • CREAM Register raised std::exception Received NULL fault; the error is due to another cause: FaultString=[] - FaultCode=[SOAP-ENV:Server.generalException] - FaultSubCode=[SOAP-ENV:Server.generalException] - FaultDetail=[invoke2009-09-14T01:18:56.638Z0cannot write the authN proxy to file: nullcannot write the authN proxy to file: nullorg.glite.ce.faults.AuthenticationFaultcream-29.pd.infn.it]
      • CREAM Start raised exception Received NULL fault; the error is due to another cause: FaultString=[] - FaultCode=[SOAP-ENV:Server.generalException] - FaultSubCode=[SOAP-ENV:Server.generalException] - FaultDetail=[invoke2009-09-13T19:31:01.591Z0USER_VO_LABEL not defined in msgContextUSER_VO_LABEL not defined in msgContextorg.glite.ce.faults.AuthenticationFaultcream-34.pd.infn.it]

ice3_03.png

2) Test starts on Thu Sep 10 at 16:00:00 CEST 2009 (WMS: devel20)

Description:
  • 720 collections each of 40 jobs
  • One collection every 60 seconds
  • Four users
  • max_ice_threads = 10
  • Used all the CEs of testbedB
  • Used automatic-delegation
  • Use two proxy renewal servers ("myproxy.cern.ch" and "myproxy.cnaf.infn.it") and also submit 33% of jobs without setting MyproxyServer
  • The job is a "sleep 2424"
  • Resubmission is NOT enabled
  • Lease mechanism is not used

  • Changes in the software wrt previous test:
    • CREAM
      • Use old blparser for all the CEs

Submissions finish on Fri Sep 11 at 03:51:54 CEST 2009

  • 718 collections submitted in 4540 seconds: 3/6.3/37 (min/avg/max)
    • 2 submissions fail

Final results taken on Fri Sep 11 at 15:20:33 CEST 2009

  • Collections correctly submitted: 718 (28720 jobs)
    • DONE OK: 28319 (98.6%)
    • NOT TERMINATED: 0 (0%)
    • ABORTED+CANCELLED: 5+396 (1.4%)
    • Resubmitted: - (-%)

  • Errors found (5)
    • Cannot move ISB [...] : (5 times)

Note:

  • Some failures are due to bug #54949

ice3_02.png

1) Test starts on Wed Sep 9 at 16:08:00 CEST 2009 (WMS: devel20)

Description:
  • 800 collections each of 30 jobs
  • One collection every 60 seconds
  • Four users
  • max_ice_threads = 10
  • Used all the CEs of testbedB
  • Used automatic-delegation and 2 proxy renewal servers ("myproxy.cern.ch" and "myproxy.cnaf.infn.it")
  • The job is a "sleep 4242"
  • Resubmission is NOT enabled
  • Lease mechanism is not used

  • Changes in the software wrt previous test:
    • Use a WMS updates with patches #3156 and #3183
    • ICE
      • Use new delegation renewal mechanism
    • CREAM
      • Use new blparser for pbs CEs

Submissions finish on Thu Sep 10 at 05:20:33 CEST 2009

  • 799 collections submitted in 4499 seconds: 3/5.6/48 (min/avg/max)
    • 1 submission fails

Final results taken on Fri Sep 11 at 09:20:33 CEST 2009

  • Collections correctly submitted: 799 (23970 jobs)
    • DONE OK: 23752 (99.09%)
    • NOT TERMINATED: 36 (0.15%)
    • ABORTED+CANCELLED: 95+87 (0.76%)
    • Resubmitted: - (-%)

  • Errors found (95)
    • Cannot move ISB [...] : (1 time)
    • reason=999: (1 time)
    • Proxy is expired; (93 times)

Note:

  • Some failures are due to the use of the new blparser

ice3_01.png

-- AlessioGianelle - 10 Sep 2009

Topic attachments
I Attachment Action Size Date Who Comment
Unknown file formatEXT Error_ISBMove manage 2284.1 K 2009-09-28 - 07:22 AlessioGianelle ErrorISB_04
Unknown file formatEXT Error_LSFNotSup manage 276.2 K 2009-09-25 - 16:16 AlessioGianelle ErrorLSFNS_04
Unknown file formatEXT Error_SOAP manage 0.5 K 2009-09-25 - 16:16 AlessioGianelle ErrorSOAP_04
Unknown file formatEXT Error_UnableToOpenFile manage 115.5 K 2009-09-25 - 16:17 AlessioGianelle ErrorUnableOpenFile_04
Unknown file formatEXT Error_blah manage 44.7 K 2009-09-25 - 16:14 AlessioGianelle ErrorBlah_04
Unknown file formatEXT Error_connection manage 0.7 K 2009-09-25 - 16:15 AlessioGianelle ErrorConnection_04
Unknown file formatEXT Error_lsf11 manage 0.2 K 2009-09-25 - 16:15 AlessioGianelle Errorlsf11_04
Unknown file formatEXT Error_lsf306 manage 0.1 K 2009-09-25 - 16:15 AlessioGianelle Errorlsf306_04
Unknown file formatEXT Error_pbs127 manage 32.9 K 2009-09-25 - 16:16 AlessioGianelle Errorpbs127_04
Unknown file formatEXT Error_token manage 6.8 K 2009-09-25 - 16:17 AlessioGianelle ErrorToken04
Texttxt JobNotTerminated04.txt manage 8.4 K 2009-09-25 - 11:22 AlessioGianelle JobNotTerminated_04
PNGpng ice3_01.png manage 6.6 K 2009-09-11 - 09:05 AlessioGianelle Test_01
PNGpng ice3_02.png manage 7.0 K 2009-09-11 - 09:05 AlessioGianelle Test_02
PNGpng ice3_03.png manage 7.4 K 2009-09-16 - 08:29 AlessioGianelle Test_03
PNGpng ice3_04.png manage 7.9 K 2009-09-24 - 16:24 AlessioGianelle Test_04
Edit | Attach | PDF | History: r49 | r35 < r34 < r33 < r32 | Backlinks | Raw View | More topic actions...
Topic revision: r33 - 2009-10-06 - AlessioGianelle
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback