Difference: WmsTestsP2597 (1 vs. 55)

Revision 552011-02-24 - AlessioGianelle

Line: 1 to 1
Changed:
<
<
META TOPICPARENT name="TestWokPlan"
>
>
META TOPICPARENT name="TestPage"
 

TESTs

2009-03-25 (Danilo)

Revision 542009-04-10 - MarcoCecchi

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

TESTs

2009-03-25 (Danilo)

Line: 58 to 58
  The following CE(s) matching your job requirements have been found - gridit-ce-001.cnaf.infn.it:2119/jobmanager-lcgpbs-cert
Changed:
<
<

Problems

  • Bug #40982: When a collection is aborted the "Abort" event should be logged for the sub-nodes as well /2
>
>
  1. File perusal with data. OK
  2. DAGS with data. OK
 

Using this repository: Configuration Name: glite-wms_R_3_2_1_12

Changed:
<
<
  1. It seems that for some jobs condor is not able to retrieve the standard output of the job, in all the case tha "maradona" mechanism is used.
>
>
  1. It seems that for some jobs condor is not able to retrieve the standard output of the job, in all the case tha "maradona" mechanism is used. DOES NOT SEEM A WMS ISSUE
 24 Mar, 10:38:40 -H- JobWrapperOutputParser::parse_file(...): Going to parse standard output file. 24 Mar, 10:38:40 -*- JobWrapperOutputParser::parse_file(...): Standard output does not contain useful data. 24 Mar, 10:38:40 -H- JobWrapperOutputParser::parse_file(...): Standard output was not useful, passing ball to Maradona...
Line: 73 to 75
 
  1. Submit a collection with all the nodes with "false" requirements, then a cancel for the collection; when you try to restart the wm it fail. ---> HOPEFULLY FIXED BY MANAGER
      <mcecchi> glite-wms-workload_manager: recovery.cpp:387: void glite::wms::manager::server::single_request_recovery(const glite::wms::manager::server::<unnamed>::IdRequests&, 
     glite::wms::manager::server::Events&, const glite::wms::manager::server::WMReal&, glite::wms::manager::server::<unnamed>::IdToRequests&, 
Changed:
<
<
boost::shared_ptr<std::string>): Assertion `requests_for_id.size() == 1' failed.
>
>
boost::shared_ptr<std::string>): Assertion `requests_for_id.size() == 1' failed. FIXED BY MANAGER
 
  1. Cancelling a collection trigger a cancel also for the "Done" nodes:
    Event: Done
    - Arrived                    =    Thu Mar 26 13:48:39 2009 CET
Line: 114 to 116
 - Status code = REFUSE - Timestamp = Thu Mar 26 15:29:13 2009 CET - User = /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
Changed:
<
<
  1. Recovery is not able to manage correctly node resubmission request (it triggers two resubmission: one due to collection recovery and the other to the node resubmission).
>
>
  1. Recovery is not able to manage correctly node resubmission request (it triggers two resubmission: one due to collection recovery and the other to the node resubmission). FIXED BY MANAGER
 

Using this repository: Configuration Name: glite-wms_R_3_2_1_8

Line: 124 to 126
 19 Mar, 12:45:21 -D: [Debug] operator()(recovery.cpp:76): ignoring jobcancel request for https://devel17.cnaf.infn.it:9000/iVp83BKI_-IytpdVlojq3Q 19 Mar, 12:45:21 -D: [Debug] operator()(recovery.cpp:76): ignoring jobresubmit request for https://devel17.cnaf.infn.it:9000/iVp83BKI_-IytpdVlojq3Q
Changed:
<
<
HOPEFULLY FIXED BY MANAGER
>
>
FIXED BY MANAGER
 
  1. Sometimes the submit command fails with this message:
    [ale@cream-15 UI]$ glite-wms-job-submit  -a -c ~/UI/etc/wmp_wms007.conf -o wms007 jdl/fail/fail.jdl

Revision 532009-03-26 - AlessioGianelle

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"
Changed:
<
<

TEST 03252009, Configuration Name: glite-wms_R_3_2_1_11

>
>

TESTs

2009-03-25 (Danilo)

  1. MatchMaking with data. OK
    • Files used:
      >lcg-lr lfn:test_e-2M.txt 
      srm://gridit-se-01.cnaf.infn.it/dpm/cnaf.infn.it/home/infngrid/generated/2009-03-23/file881fe282-f7ce-4e0a-bf6e-e45ebc2f54bc 
      srm://gridsrm.ts.infn.it/infngrid/generated/2009-03-23/fileb225a02d-672b-439c-bf60-2ec5939563ac 
      >lcg-lr lfn:test_PI_1M.txt
      srm://gridit-se-01.cnaf.infn.it/dpm/cnaf.infn.it/home/infngrid/generated/2009-03-23/filee260881d-f7b9-4ba5-aadb-c73da3537700 
      srm://gridsrm.ts.infn.it/infngrid/generated/2009-03-23/file175411c7-870d-4302-bcfe-e4e471ebb23b
    • DataRequirements = { [ DataCatalogType = "DLI"; DataCatalog = "http://lfcserver.cnaf.infn.it:8085"; InputData = {"lfn:/grid/infngrid/test_e-2M.txt","lfn:/grid/infngrid/test_PI_1M.txt"}; ] };
    • DataAccessProtocol = "gsiftp";
      [danilo@ui]$ glite-wms-job-list-match -c ~/WMSCONF/conf_wms007.conf -a data-req.jdl 
      Connecting to the service https://wms007.cnaf.infn.it:7443/glite_wms_wmproxy_server
      
                           COMPUTING ELEMENT IDs LIST
      
       The following CE(s) matching your job requirements have been found
      - grid001.ts.infn.it:2119/jobmanager-lcglsf-cert
      - gridit-ce-001.cnaf.infn.it:2119/jobmanager-lcgpbs-cert
  2. GANG MATCHING
    • JDL Requirements = anyMatch( other.storage.CloseSEs ,target.GlueSAStateAvailableSpace == 87200000);
      [danilo@ui TEST_WMS]$ glite-wms-job-list-match -c ~/WMSCONF/conf_wms007.conf -a test_gang.jdl 
      Connecting to the service https://wms007.cnaf.infn.it:7443/glite_wms_wmproxy_server
      
                           COMPUTING ELEMENT IDs LIST
      
       The following CE(s) matching your job requirements have been found:
         - gridit-ce-001.cnaf.infn.it:2119/jobmanager-lcgpbs-cert (the closeSE is gridit-se-01 with 87200000 available space.)
    • JDL Requirements = anyMatch( other.storage.CloseSEs ,target.GlueSAStateAvailableSpace >= 50340000000);
      [danilo@ui TEST_WMS]$ glite-wms-job-list-match -c ~/WMSCONF/conf_wms007.conf -a test_gang.jdl 
      Connecting to the service https://wms007.cnaf.infn.it:7443/glite_wms_wmproxy_server
      
                           COMPUTING ELEMENT IDs LIST
      
       The following CE(s) matching your job requirements have been found:
      
       - argoce01.na.infn.it:2119/jobmanager-lcgpbs-cert 
       - atlasce1.lnf.infn.it:2119/jobmanager-lcgpbs-cert 
       - grid-ce2.pr.infn.it:2119/jobmanager-pbs-cert 
       - grid001.ts.infn.it:2119/jobmanager-lcglsf-cert 
       - prod-ce-01.pd.infn.it:2119/jobmanager-lcglsf-cert 
       - prod-ce-02.pd.infn.it:2119/jobmanager-lcglsf-cert 
       - t2-ce-01.mi.infn.it:2119/jobmanager-lcgpbs-cert 
       - atlasce01.na.infn.it:2119/jobmanager-lcgpbs-cert 
       - atlas-ce-01.roma1.infn.it:2119/jobmanager-lcglsf-atlasgcert 
       - grid012.ct.infn.it:2119/jobmanager-lcglsf-cert
  3. OUTPUT SE OK
    • OutputSE? = "grid007g.cnaf.infn.it";
      [danilo@ui]$ glite-wms-job-list-match -c ~/WMSCONF/conf_wms007.conf -a outputSE.jdl 
      Connecting to the service https://wms007.cnaf.infn.it:7443/glite_wms_wmproxy_server
      
                           COMPUTING ELEMENT IDs LIST
      
       The following CE(s) matching your job requirements have been found
       - gridit-ce-001.cnaf.infn.it:2119/jobmanager-lcgpbs-cert 
 

Problems

  • Bug #40982: When a collection is aborted the "Abort" event should be logged for the sub-nodes as well /2
Line: 58 to 114
 - Status code = REFUSE - Timestamp = Thu Mar 26 15:29:13 2009 CET - User = /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
Changed:
<
<
>
>
  1. Recovery is not able to manage correctly node resubmission request (it triggers two resubmission: one due to collection recovery and the other to the node resubmission).
 

Using this repository: Configuration Name: glite-wms_R_3_2_1_8

Revision 522009-03-26 - AlessioGianelle

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

TEST 03252009, Configuration Name: glite-wms_R_3_2_1_11

Problems

Changed:
<
<
Bug #40982: When a collection is aborted the "Abort" event should be logged for the sub-nodes as well /2
>
>
  • Bug #40982: When a collection is aborted the "Abort" event should be logged for the sub-nodes as well /2
 

Using this repository: Configuration Name: glite-wms_R_3_2_1_12

Line: 14 to 14
 24 Mar, 10:38:40 -*- JobWrapperOutputParser::parse_file(...): Got info from Maradona... 24 Mar, 10:38:40 -U- JobWrapperOutputParser::parse_file(...): Maradona makes another goal !!! 24 Mar, 10:38:40 -U- JobWrapperOutputParser::parse_file(...): The legend goes on...
Changed:
<
<
  1. Submit a collection with all the nodes with "false" requirements, then a cancel for the collection; when you try to restart the wm it fail.
>
>
  1. Submit a collection with all the nodes with "false" requirements, then a cancel for the collection; when you try to restart the wm it fail. ---> HOPEFULLY FIXED BY MANAGER
  glite-wms-workload_manager: recovery.cpp:387: void glite::wms::manager::server::single_request_recovery(const glite::wms::manager::server::::IdRequests&, glite::wms::manager::server::Events&, const glite::wms::manager::server::WMReal&, glite::wms::manager::server::::IdToRequests&, boost::shared_ptr<std::string>): Assertion `requests_for_id.size() == 1' failed.
Added:
>
>
  1. Cancelling a collection trigger a cancel also for the "Done" nodes:
    Event: Done
    - Arrived                    =    Thu Mar 26 13:48:39 2009 CET
    - Exit code                  =    0
    - Host                       =    wms009.cnaf.infn.it
    - Reason                     =    Job terminated successfully
    - Source                     =    LogMonitor
    - Src instance               =    unique
    - Status code                =    OK
    - Timestamp                  =    Thu Mar 26 13:48:39 2009 CET
    - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
            ---
    Event: Cancel
    - Arrived                    =    Thu Mar 26 15:29:09 2009 CET
    - Host                       =    wms009.cnaf.infn.it
    - Source                     =    WorkloadManager
    - Src instance               =    23245
    - Status code                =    DONE
    - Timestamp                  =    Thu Mar 26 15:29:09 2009 CET
    - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
            ---
    Event: Cancel
    - Arrived                    =    Thu Mar 26 15:29:13 2009 CET
    - Host                       =    wms009.cnaf.infn.it
    - Reason                     =    Cancel requested by WorkloadManager
    - Source                     =    JobController
    - Src instance               =    unique
    - Status code                =    REQ
    - Timestamp                  =    Thu Mar 26 15:29:13 2009 CET
    - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
            ---
    Event: Cancel
    - Arrived                    =    Thu Mar 26 15:29:14 2009 CET
    - Host                       =    wms009.cnaf.infn.it
    - Reason                     =    I'm not able to retrieve the condor ID.
    - Source                     =    JobController
    - Src instance               =    unique
    - Status code                =    REFUSE
    - Timestamp                  =    Thu Mar 26 15:29:13 2009 CET
    - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
 
Deleted:
<
<
HOPEFULLY FIXED BY MANAGER
 

Using this repository: Configuration Name: glite-wms_R_3_2_1_8

Revision 512009-03-26 - AlessioGianelle

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

TEST 03252009, Configuration Name: glite-wms_R_3_2_1_11

Line: 98 to 98
  HOPEFULLY FIXED BY MANAGER_3_2_10
Changed:
<
<
  1. Cancel of pending nodes (or jobs) doesn't work FIXED (glite-wms-manager-3.2.1-7). But the final status of the job is ABORTED instead of CANCELLED.

HOPEFULLY FIXED BY MANAGER_3_2_10 - TO BE RETESTED

  1. recovery doesn't recognize a mm request: FIXED (glite-wms-manager-3.2.1-7)
>
>
  1. Cancel of pending nodes (or jobs) doesn't work FIXED (glite-wms-manager-3.2.1-7). But the final status of the job is ABORTED instead of CANCELLED.
  2. recovery doesn't recognize a mm request: NOT FIXED
 09 Mar, 11:58:57 -I: [Info] operator()(recovery.cpp:836): recovering https://localhost:6000/kqU2YOq-Ofu1_ArnS5RoIQ
Changed:
<
<
09 Mar, 11:58:57 -E: [Error] single_request_recovery(recovery.cpp:392): cannot create LB context (2)and it is not removed from the jobdir. Now at startup request is ignored: ignoring match request, but the ui command hangs.

THIS IS THE EXPECTED BEHAVIOUR WITH THE UI 3.1

>
>
09 Mar, 11:58:57 -E: [Error] single_request_recovery(recovery.cpp:392): cannot create LB context (2)and it is not removed from the jobdir. Note: the ui command hangs.
 
  1. Recovery of a collection with all aborted nodes causes a crash in the wm FIXED (glite-wms-manager-3.2.1-3)
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:328): This is the gLite Workload Manager, running with pid 3953
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:336): loading broker dll libglite_wms_helper_broker_ism.so

Revision 502009-03-26 - DaniloDongiovanni

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

TEST 03252009, Configuration Name: glite-wms_R_3_2_1_11

Problems

Changed:
<
<
Bug #40982: When a collection is aborted the "Abort" event should be logged for the sub-nodes as well /2
>
>
Bug #40982: When a collection is aborted the "Abort" event should be logged for the sub-nodes as well /2
 

Using this repository: Configuration Name: glite-wms_R_3_2_1_12

Revision 492009-03-26 - MarcoCecchi

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

TEST 03252009, Configuration Name: glite-wms_R_3_2_1_11

Problems

Deleted:
<
<
MM with data does not work:
Requirements = other.GlueCEInfoHostName != "spacin-ce1.dma.unina.it";
DataRequirements = {
[
DataCatalogType = "DLI";
DataCatalog = "http://lfcserver.cnaf.infn.it:8085";
InputData = {"lfn:/grid/infngrid/cesini/PI_1M.txt","lfn:/grid/infngrid/cesini/e-2M.txt"};
]
};
DataAccessProtocol = "gsiftp";
does not match, while
File1:
[cesini@ui DataReq]$ lcg-lr --vo infngrid lfn:/grid/infngrid/cesini/e-2M.txt
sfn://gridba6.ba.infn.it/flatfiles/SE00/infngrid/generated/2007-11-21/file1d1e9575-ae91-4d67-9080-f4a7f4687806
sfn://gridse.pi.infn.it/flatfiles/SE00/infngrid/generated/2007-11-26/fileb28f35a1-fbb4-413c-920f-b52cf002012e
sfn://prod-se-01.pd.infn.it/flatfiles/SE00/infngrid/generated/2007-11-21/file88152626-c3c9-448e-9275-b01278a0d9d0
sfn://t2-se-01.mi.infn.it/flatfiles/SE00/infngrid/generated/2007-11-21/fileba70145a-2a5c-4f61-ac18-a4b48bc0e78b
sfn://t2-se-03.lnl.infn.it/data1/infngrid/generated/2007-11-26/file2ddaccf9-c62c-417a-826d-3a1755cf3718

File2:
[cesini@ui DataReq]$ lcg-lr --vo infngrid lfn:/grid/infngrid/cesini/PI_1M.txt
sfn://gridba6.ba.infn.it/flatfiles/SE00/infngrid/generated/2007-11-21/filec4384060-5c78-4ee8-ac75-6a5b7161987d
sfn://gridse.pi.infn.it/flatfiles/SE00/infngrid/generated/2007-11-26/file6a7ce8b1-3aa9-442b-80c2-21493a60553e
sfn://prod-se-01.pd.infn.it/flatfiles/SE00/infngrid/generated/2007-11-21/filed03f96ea-4a70-4315-bc3a-a3839c9e0bed
sfn://t2-se-01.mi.infn.it/flatfiles/SE00/infngrid/generated/2007-11-21/file5a113abe-0a0e-459f-9a2f-2478f8bb6c6a
sfn://t2-se-03.lnl.infn.it/data1/infngrid/generated/2007-11-26/filed1f43d74-24ac-47a2-a83a-8b0e8b0d9d4f

Log-status
Status info for the Job : https://devel20.cnaf.infn.it:9000/LkRfwhr3ibCfROQfCCj0jA
Current Status:     Aborted 
Status Reason:      PID: 11874 - "wmpcoreoperations::jobStart"
Submitted:          Fri Mar 20 20:04:16 2009 CET

HOPEFULLY FIXED BY HELPER

 Bug #40982: When a collection is aborted the "Abort" event should be logged for the sub-nodes as well /2

Using this repository: Configuration Name: glite-wms_R_3_2_1_12

Revision 482009-03-26 - MarcoCecchi

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

TEST 03252009, Configuration Name: glite-wms_R_3_2_1_11

Deleted:
<
<
  1. SIMPLE job submission, requires 3-4 attempts to work (KNOWN ISSUE, JRA1 working on that)
  2. MM with data: MM with data WORKS:
Files used: >lcg-lr lfn:test_e-2M.txt
srm://gridit-se-01.cnaf.infn.it/dpm/cnaf.infn.it/home/infngrid/generated/2009-03-23/file881fe282-f7ce-4e0a-bf6e-e45ebc2f54bc srm://gridsrm.ts.infn.it/infngrid/generated/2009-03-23/fileb225a02d-672b-439c-bf60-2ec5939563ac >lcg-lr lfn:test_PI_1M.txt
srm://gridit-se-01.cnaf.infn.it/dpm/cnaf.infn.it/home/infngrid/generated/2009-03-23/filee260881d-f7b9-4ba5-aadb-c73da3537700 srm://gridsrm.ts.infn.it/infngrid/generated/2009-03-23/file175411c7-870d-4302-bcfe-e4e471ebb23b DataRequirements = { [ DataCatalogType = "DLI"; DataCatalog = "http://lfcserver.cnaf.infn.it:8085"; InputData = {"lfn:/grid/infngrid/test_e-2M.txt","lfn:/grid/infngrid/test_PI_1M.txt"}; ] }; DataAccessProtocol = "gsiftp";

[danilo@ui DataReq]$ glite-wms-job-list-match -c ~/WMSCONF/conf_wms007.conf -a data-req.jdl Connecting to the service https://wms007.cnaf.infn.it:7443/glite_wms_wmproxy_server

COMPUTING ELEMENT IDs LIST The following CE(s) matching your job requirements have been found: CEId - grid001.ts.infn.it:2119/jobmanager-lcglsf-cert
- gridit-ce-001.cnaf.infn.it:2119/jobmanager-lcgpbs-cert

1. GANG MATCHING

JDL Requirements = anyMatch( other.storage.CloseSEs ,target.GlueSAStateAvailableSpace == 87200000); TEST RESULT: MM is OK for "==" , FAILS FOR ">=, >, < " PROBLEM: Likely the available space number is treated as a string

[danilo@ui TEST_WMS]$ glite-wms-job-list-match -c ~/WMSCONF/conf_wms007.conf -a test_gang.jdl Connecting to the service https://wms007.cnaf.infn.it:7443/glite_wms_wmproxy_server

COMPUTING ELEMENT IDs LIST The following CE(s) matching your job requirements have been found: CEId - gridit-ce-001.cnaf.infn.it:2119/jobmanager-lcgpbs-cert

WHOSE closeSE is gridit-se-01 with 87200000 available space. TEST RESULT: MM is OK for == Requirements = anyMatch( other.storage.CloseSEs ,target.GlueSAStateAvailableSpace >= 50340000000); It should match atlas-ce-01.roma1.infn.it:2119/jobmanager-lcglsf-atlasgcert only it matches

CEId - argoce01.na.infn.it:2119/jobmanager-lcgpbs-cert - atlasce1.lnf.infn.it:2119/jobmanager-lcgpbs-cert - grid-ce2.pr.infn.it:2119/jobmanager-pbs-cert - grid001.ts.infn.it:2119/jobmanager-lcglsf-cert - prod-ce-01.pd.infn.it:2119/jobmanager-lcglsf-cert - prod-ce-02.pd.infn.it:2119/jobmanager-lcglsf-cert - t2-ce-01.mi.infn.it:2119/jobmanager-lcgpbs-cert - atlasce01.na.infn.it:2119/jobmanager-lcgpbs-cert - atlas-ce-01.roma1.infn.it:2119/jobmanager-lcglsf-atlasgcert grid012.ct.infn.it:2119/jobmanager-lcglsf-cert

TEST RESULT: MM for >=, >, < FAILED

  1. OUTPUT SE OK:

OutputSE = "grid007g.cnaf.infn.it";

CEId gridit-ce-001.cnaf.infn.it:2119/jobmanager-lcgpbs-cert TEST RESULT: MM is OK

 

Problems

MM with data does not work:

Revision 472009-03-26 - DaniloDongiovanni

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"
Added:
>
>

TEST 03252009, Configuration Name: glite-wms_R_3_2_1_11

  1. SIMPLE job submission, requires 3-4 attempts to work (KNOWN ISSUE, JRA1 working on that)
  2. MM with data: MM with data WORKS:
Files used: >lcg-lr lfn:test_e-2M.txt
srm://gridit-se-01.cnaf.infn.it/dpm/cnaf.infn.it/home/infngrid/generated/2009-03-23/file881fe282-f7ce-4e0a-bf6e-e45ebc2f54bc srm://gridsrm.ts.infn.it/infngrid/generated/2009-03-23/fileb225a02d-672b-439c-bf60-2ec5939563ac >lcg-lr lfn:test_PI_1M.txt
srm://gridit-se-01.cnaf.infn.it/dpm/cnaf.infn.it/home/infngrid/generated/2009-03-23/filee260881d-f7b9-4ba5-aadb-c73da3537700 srm://gridsrm.ts.infn.it/infngrid/generated/2009-03-23/file175411c7-870d-4302-bcfe-e4e471ebb23b DataRequirements = { [ DataCatalogType = "DLI"; DataCatalog = "http://lfcserver.cnaf.infn.it:8085"; InputData = {"lfn:/grid/infngrid/test_e-2M.txt","lfn:/grid/infngrid/test_PI_1M.txt"}; ] }; DataAccessProtocol = "gsiftp";

[danilo@ui DataReq]$ glite-wms-job-list-match -c ~/WMSCONF/conf_wms007.conf -a data-req.jdl Connecting to the service https://wms007.cnaf.infn.it:7443/glite_wms_wmproxy_server

COMPUTING ELEMENT IDs LIST The following CE(s) matching your job requirements have been found: CEId - grid001.ts.infn.it:2119/jobmanager-lcglsf-cert
- gridit-ce-001.cnaf.infn.it:2119/jobmanager-lcgpbs-cert

1. GANG MATCHING

JDL Requirements = anyMatch( other.storage.CloseSEs ,target.GlueSAStateAvailableSpace == 87200000); TEST RESULT: MM is OK for "==" , FAILS FOR ">=, >, < " PROBLEM: Likely the available space number is treated as a string

[danilo@ui TEST_WMS]$ glite-wms-job-list-match -c ~/WMSCONF/conf_wms007.conf -a test_gang.jdl Connecting to the service https://wms007.cnaf.infn.it:7443/glite_wms_wmproxy_server

COMPUTING ELEMENT IDs LIST The following CE(s) matching your job requirements have been found: CEId - gridit-ce-001.cnaf.infn.it:2119/jobmanager-lcgpbs-cert

WHOSE closeSE is gridit-se-01 with 87200000 available space. TEST RESULT: MM is OK for == Requirements = anyMatch( other.storage.CloseSEs ,target.GlueSAStateAvailableSpace >= 50340000000); It should match atlas-ce-01.roma1.infn.it:2119/jobmanager-lcglsf-atlasgcert only it matches

CEId - argoce01.na.infn.it:2119/jobmanager-lcgpbs-cert - atlasce1.lnf.infn.it:2119/jobmanager-lcgpbs-cert - grid-ce2.pr.infn.it:2119/jobmanager-pbs-cert - grid001.ts.infn.it:2119/jobmanager-lcglsf-cert - prod-ce-01.pd.infn.it:2119/jobmanager-lcglsf-cert - prod-ce-02.pd.infn.it:2119/jobmanager-lcglsf-cert - t2-ce-01.mi.infn.it:2119/jobmanager-lcgpbs-cert - atlasce01.na.infn.it:2119/jobmanager-lcgpbs-cert - atlas-ce-01.roma1.infn.it:2119/jobmanager-lcglsf-atlasgcert grid012.ct.infn.it:2119/jobmanager-lcglsf-cert

TEST RESULT: MM for >=, >, < FAILED

  1. OUTPUT SE OK:

OutputSE = "grid007g.cnaf.infn.it";

CEId gridit-ce-001.cnaf.infn.it:2119/jobmanager-lcgpbs-cert TEST RESULT: MM is OK

 

Problems

MM with data does not work:

Revision 462009-03-25 - MarcoCecchi

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

MM with data does not work:
Line: 36 to 36
 Submitted: Fri Mar 20 20:04:16 2009 CET
Added:
>
>
HOPEFULLY FIXED BY HELPER
  Bug #40982: When a collection is aborted the "Abort" event should be logged for the sub-nodes as well /2
Line: 53 to 55
  glite::wms::manager::server::Events&, const glite::wms::manager::server::WMReal&, glite::wms::manager::server::::IdToRequests&, boost::shared_ptr<std::string>): Assertion `requests_for_id.size() == 1' failed.
Added:
>
>
HOPEFULLY FIXED BY MANAGER
 

Using this repository: Configuration Name: glite-wms_R_3_2_1_8

  1. Recovery mechanism is not able to manage a cancel and a resubmit request for the same job (it should cancel the job).
Line: 60 to 64
 19 Mar, 12:45:21 -I: [Info] multiple_request_recovery(recovery.cpp:804): invalid pattern; ignoring all requests 19 Mar, 12:45:21 -D: [Debug] operator()(recovery.cpp:76): ignoring jobcancel request for https://devel17.cnaf.infn.it:9000/iVp83BKI_-IytpdVlojq3Q 19 Mar, 12:45:21 -D: [Debug] operator()(recovery.cpp:76): ignoring jobresubmit request for https://devel17.cnaf.infn.it:9000/iVp83BKI_-IytpdVlojq3Q
Added:
>
>
HOPEFULLY FIXED BY MANAGER
 
  1. Sometimes the submit command fails with this message:
    [ale@cream-15 UI]$ glite-wms-job-submit  -a -c ~/UI/etc/wmp_wms007.conf -o wms007 jdl/fail/fail.jdl
    Connecting to the service https://wms007.cnaf.infn.it:7443/glite_wms_wmproxy_server
Line: 69 to 76
 JobId: https://devel17.cnaf.infn.it:9000/9fuQafsyV-lBWdiHXjkBFg (please contact the server administrator)
Changed:
<
<
THIS iS BOTH A UI PROBLEM AND A WMPROXY 3.2 bug
>
>
HOPEFULLY FIXED BY WMPROXY
 
  1. LogMonitor dies when invokes operator()(purger.cpp:432)
    19 Mar, 15:50:20 -U- JobFilePurger::do_purge(...): Going to purge job storage...

Revision 452009-03-25 - AlessioGianelle

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

MM with data does not work:
Line: 60 to 60
 19 Mar, 12:45:21 -I: [Info] multiple_request_recovery(recovery.cpp:804): invalid pattern; ignoring all requests 19 Mar, 12:45:21 -D: [Debug] operator()(recovery.cpp:76): ignoring jobcancel request for https://devel17.cnaf.infn.it:9000/iVp83BKI_-IytpdVlojq3Q 19 Mar, 12:45:21 -D: [Debug] operator()(recovery.cpp:76): ignoring jobresubmit request for https://devel17.cnaf.infn.it:9000/iVp83BKI_-IytpdVlojq3Q
Deleted:
<
<
HOPEFULLY FIXED BY MANAGER_3_2_10
 
  1. Sometimes the submit command fails with this message:
    [ale@cream-15 UI]$ glite-wms-job-submit  -a -c ~/UI/etc/wmp_wms007.conf -o wms007 jdl/fail/fail.jdl
    Connecting to the service https://wms007.cnaf.infn.it:7443/glite_wms_wmproxy_server

Revision 442009-03-24 - AlessioGianelle

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

MM with data does not work:
Line: 41 to 41
 

Using this repository: Configuration Name: glite-wms_R_3_2_1_12

Changed:
<
<
  1. It seems that for all the jobs condor is not able to retrieve the standard output of the job, in all the case tha "maradona" mechanism is used.
>
>
  1. It seems that for some jobs condor is not able to retrieve the standard output of the job, in all the case tha "maradona" mechanism is used.
 24 Mar, 10:38:40 -H- JobWrapperOutputParser::parse_file(...): Going to parse standard output file. 24 Mar, 10:38:40 -*- JobWrapperOutputParser::parse_file(...): Standard output does not contain useful data. 24 Mar, 10:38:40 -H- JobWrapperOutputParser::parse_file(...): Standard output was not useful, passing ball to Maradona... 24 Mar, 10:38:40 -*- JobWrapperOutputParser::parse_file(...): Got info from Maradona... 24 Mar, 10:38:40 -U- JobWrapperOutputParser::parse_file(...): Maradona makes another goal !!! 24 Mar, 10:38:40 -U- JobWrapperOutputParser::parse_file(...): The legend goes on...
Added:
>
>
  1. Submit a collection with all the nodes with "false" requirements, then a cancel for the collection; when you try to restart the wm it fail.
      <mcecchi> glite-wms-workload_manager: recovery.cpp:387: void glite::wms::manager::server::single_request_recovery(const glite::wms::manager::server::<unnamed>::IdRequests&, 
     glite::wms::manager::server::Events&, const glite::wms::manager::server::WMReal&, glite::wms::manager::server::<unnamed>::IdToRequests&, 
     boost::shared_ptr<std::string>): Assertion `requests_for_id.size() == 1' failed.
 

Using this repository: Configuration Name: glite-wms_R_3_2_1_8

Revision 432009-03-24 - MarcoCecchi

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

MM with data does not work:
Line: 76 to 76
 19 Mar, 15:50:20 -*- MonitorLoop::run(): Namely: "call to empty boost::function" 19 Mar, 15:50:20 -*- MonitorLoop::run(): Aborting daemon...
Added:
>
>
HOPEFULLY FIXED BY PURGER AND JOBSUBMISSION
 
  1. Submit a collection with attribute "requirements = false;" doesn't work. Using this collection: FIXED (glite-wms-wmproxy-3.2.1-4.slc4)
    [
    Type = "collection";
Line: 123 to 126
 HOPEFULLY FIXED BY MANAGER_3_2_10

  1. Cancel of pending nodes (or jobs) doesn't work FIXED (glite-wms-manager-3.2.1-7). But the final status of the job is ABORTED instead of CANCELLED.
Added:
>
>
HOPEFULLY FIXED BY MANAGER_3_2_10 - TO BE RETESTED
 
  1. recovery doesn't recognize a mm request: FIXED (glite-wms-manager-3.2.1-7)
    09 Mar, 11:58:57 -I: [Info] operator()(recovery.cpp:836): recovering https://localhost:6000/kqU2YOq-Ofu1_ArnS5RoIQ
    09 Mar, 11:58:57 -E: [Error] single_request_recovery(recovery.cpp:392): cannot create LB context (2)
    and it is not removed from the jobdir. Now at startup request is ignored: ignoring match request, but the ui command hangs.
Changed:
<
<
  1. problems with cancels` nodes during recovery -----> HOPEFULLY FIXED BY MANAGER_3_2_8
  2. Recovery of a collection with all aborted nodes causes a crash in the wm FIXED (glite-wms-manager-3.2.1-3)
>
>
THIS IS THE EXPECTED BEHAVIOUR WITH THE UI 3.1

  1. Recovery of a collection with all aborted nodes causes a crash in the wm FIXED (glite-wms-manager-3.2.1-3)
 12 Mar, 12:44:33 -I: [Info] main(main.cpp:328): This is the gLite Workload Manager, running with pid 3953 12 Mar, 12:44:33 -I: [Info] main(main.cpp:336): loading broker dll libglite_wms_helper_broker_ism.so 12 Mar, 12:44:33 -I: [Info] main(main.cpp:357): created input reader of type jobdir from /var/glite/workload_manager/jobdir
Line: 136 to 144
 12 Mar, 12:44:33 -I: [Info] single_request_recovery(recovery.cpp:407): submit request Program received signal SIGSEGV, Segmentation fault. 0x081471b2 in glite::wms::manager::server::(anonymous namespace)::select_recoverable_nodes::select_recoverable_nodes ()
Changed:
<
<
  1. A pending collection is completely (i.e. also the DONE nodes) resubmitted after every restart (through recovery mechanism); this means that a job can be run more than one times (see for example https://devel17.cnaf.infn.it:9000/630alnB1XH4ZEGqPyucc4g): see #30816 FIXED (glite-wms-manager-3.2.1-7)
  2. Recovery of a collection with two "scheduled" nodes and the remaining in "waiting" is treated as if all nodes were recoverable. What happens is that collections node status as returned upon recovery does not seem to match the real LB status.
    
    18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:254): https://devel20.cnaf.infn.it:9000/fQwKKHBLncwJqJCVbiwOQw: recoverable node (WAITING)
    18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:254): https://devel20.cnaf.infn.it:9000/MrLckmBvcibWQl3TnUyF2Q: recoverable node (WAITING)
    18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:254): https://devel20.cnaf.infn.it:9000/wSfED2AXfBUhcKaJBWvqYg: recoverable node (WAITING)
    18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:283): https://devel20.cnaf.infn.it:9000/pLzF8vJud_B6P1lo8OsOGA: recoverable node (READY)
    18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:283): https://devel20.cnaf.infn.it:9000/3D7JRvT1QDpci3biRroWpw: recoverable node (READY)
    In the above example, https://devel20.cnaf.infn.it:9000/pLzF8vJud_B6P1lo8OsOGA and https://devel20.cnaf.infn.it:9000/pLzF8vJud_B6P1lo8OsOGA were respectively SUBMITTED and RUNNING.
  3. WM dies after this message: -----------> HOPEFULLY FIXED BY MANAGER_3_2_10
>
>
  1. A pending collection is completely (i.e. also the DONE nodes) resubmitted after every restart (through recovery mechanism); this means that a job can be run more than one times (see for example https://devel17.cnaf.infn.it:9000/630alnB1XH4ZEGqPyucc4g): see #30816 FIXED (glite-wms-manager-3.2.1-7)
  2. WM dies after this message: -----------> HOPEFULLY FIXED BY MANAGER_3_2_10
 09 Mar, 17:08:36 -E: [Error] unrecoverable_collection(submit_request.cpp:90): https://devel17.cnaf.infn.it:9000/u-dqY5E9RtQ9mHk0-i0oYg: unable to retrieve children information from jobstatus
Changed:
<
<
  1. Condor fails to start FIXED (glite-wms-jobsubmission-3.2.1-2)
>
>
  1. Condor fails to start FIXED (glite-wms-jobsubmission-3.2.1-2)
  -- AlessioGianelle - 04 Mar 2009 \ No newline at end of file

Revision 422009-03-24 - AlessioGianelle

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

MM with data does not work:
Line: 76 to 76
 19 Mar, 15:50:20 -*- MonitorLoop::run(): Namely: "call to empty boost::function" 19 Mar, 15:50:20 -*- MonitorLoop::run(): Aborting daemon...
Changed:
<
<
  1. Submit a collection with attribute "requirements = false;" doesn't work. Using this collection:
>
>
  1. Submit a collection with attribute "requirements = false;" doesn't work. Using this collection: FIXED (glite-wms-wmproxy-3.2.1-4.slc4)
 [ Type = "collection"; InputSandbox = {"exe/test.sh"};

Revision 412009-03-24 - MarcoCecchi

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

MM with data does not work:
Line: 68 to 68
 JobId: https://devel17.cnaf.infn.it:9000/9fuQafsyV-lBWdiHXjkBFg (please contact the server administrator)
Changed:
<
<
SHOULD BE A UI PROBLEM
>
>
THIS iS BOTH A UI PROBLEM AND A WMPROXY 3.2 bug
 
  1. LogMonitor dies when invokes operator()(purger.cpp:432)
    19 Mar, 15:50:20 -U- JobFilePurger::do_purge(...): Going to purge job storage...

Revision 402009-03-24 - AlessioGianelle

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

MM with data does not work:
Changed:
<
<
>
>
 Requirements = other.GlueCEInfoHostName = "spacin-ce1.dma.unina.it"; DataRequirements = { [
Line: 12 to 12
 ] }; DataAccessProtocol = "gsiftp";
Changed:
<
<
does not match, while
>
>
does not match, while
 File1: [cesini@ui DataReq]$ lcg-lr --vo infngrid lfn:/grid/infngrid/cesini/e-2M.txt sfn://gridba6.ba.infn.it/flatfiles/SE00/infngrid/generated/2007-11-21/file1d1e9575-ae91-4d67-9080-f4a7f4687806
Line: 36 to 34
 Current Status: Aborted Status Reason: PID: 11874 - "wmpcoreoperations::jobStart" Submitted: Fri Mar 20 20:04:16 2009 CET
Added:
>
>
  Bug #40982: When a collection is aborted the "Abort" event should be logged for the sub-nodes as well /2
Added:
>
>

Using this repository: Configuration Name: glite-wms_R_3_2_1_12

  1. It seems that for all the jobs condor is not able to retrieve the standard output of the job, in all the case tha "maradona" mechanism is used.
    24 Mar, 10:38:40 -H- JobWrapperOutputParser::parse_file(...): Going to parse standard output file.
    24 Mar, 10:38:40 -*- JobWrapperOutputParser::parse_file(...): Standard output does not contain useful data.
    24 Mar, 10:38:40 -H- JobWrapperOutputParser::parse_file(...): Standard output was not useful, passing ball to Maradona...
    24 Mar, 10:38:40 -*- JobWrapperOutputParser::parse_file(...): Got info from Maradona...
    24 Mar, 10:38:40 -U- JobWrapperOutputParser::parse_file(...): Maradona makes another goal !!!
    24 Mar, 10:38:40 -U- JobWrapperOutputParser::parse_file(...): The legend goes on...
 

Using this repository: Configuration Name: glite-wms_R_3_2_1_8

  1. Recovery mechanism is not able to manage a cancel and a resubmit request for the same job (it should cancel the job).

Revision 392009-03-24 - AlessioGianelle

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

MM with data does not work:
Line: 39 to 39
  Bug #40982: When a collection is aborted the "Abort" event should be logged for the sub-nodes as well /2
Deleted:
<
<

Using this repository: Configuration Name: glite-wms_R_3_2_1_11

 

Using this repository: Configuration Name: glite-wms_R_3_2_1_8

  1. Recovery mechanism is not able to manage a cancel and a resubmit request for the same job (it should cancel the job).

Revision 382009-03-23 - AlessioGianelle

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

MM with data does not work:
Line: 39 to 39
  Bug #40982: When a collection is aborted the "Abort" event should be logged for the sub-nodes as well /2
Added:
>
>

Using this repository: Configuration Name: glite-wms_R_3_2_1_11

 

Using this repository: Configuration Name: glite-wms_R_3_2_1_8

  1. Recovery mechanism is not able to manage a cancel and a resubmit request for the same job (it should cancel the job).

Revision 372009-03-23 - MarcoCecchi

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

MM with data does not work:
Line: 46 to 46
 19 Mar, 12:45:21 -I: [Info] multiple_request_recovery(recovery.cpp:804): invalid pattern; ignoring all requests 19 Mar, 12:45:21 -D: [Debug] operator()(recovery.cpp:76): ignoring jobcancel request for https://devel17.cnaf.infn.it:9000/iVp83BKI_-IytpdVlojq3Q 19 Mar, 12:45:21 -D: [Debug] operator()(recovery.cpp:76): ignoring jobresubmit request for https://devel17.cnaf.infn.it:9000/iVp83BKI_-IytpdVlojq3Q
Added:
>
>
HOPEFULLY FIXED BY MANAGER_3_2_10
 
  1. Sometimes the submit command fails with this message:
    [ale@cream-15 UI]$ glite-wms-job-submit  -a -c ~/UI/etc/wmp_wms007.conf -o wms007 jdl/fail/fail.jdl
    Connecting to the service https://wms007.cnaf.infn.it:7443/glite_wms_wmproxy_server
Line: 54 to 57
 (DestinationURI with https protocol not found) JobId: https://devel17.cnaf.infn.it:9000/9fuQafsyV-lBWdiHXjkBFg (please contact the server administrator)
Added:
>
>
SHOULD BE A UI PROBLEM
 
  1. LogMonitor dies when invokes operator()(purger.cpp:432)
    19 Mar, 15:50:20 -U- JobFilePurger::do_purge(...): Going to purge job storage...
    19 Mar, 15:50:20 -*- MonitorLoop::run(): Got an unhandled standard exception !!!
Line: 103 to 109
 

Using this repository: Configuration Name: glite-wms_branch_3_2_0

  1. 08 Mar, 05:04:44 -E: [Error] operator()(dispatcher_utils.cpp:407): Dispatcher: boost::filesystem::directory_iterator constructor: "/var/glite/workload_manager/jobdir/new": Cannot allocate memory. Exiting...
Added:
>
>
HOPEFULLY FIXED BY MANAGER_3_2_10
 
  1. Cancel of pending nodes (or jobs) doesn't work FIXED (glite-wms-manager-3.2.1-7). But the final status of the job is ABORTED instead of CANCELLED.
  2. recovery doesn't recognize a mm request: FIXED (glite-wms-manager-3.2.1-7)
    09 Mar, 11:58:57 -I: [Info] operator()(recovery.cpp:836): recovering https://localhost:6000/kqU2YOq-Ofu1_ArnS5RoIQ
    09 Mar, 11:58:57 -E: [Error] single_request_recovery(recovery.cpp:392): cannot create LB context (2)
    and it is not removed from the jobdir. Now at startup request is ignored: ignoring match request, but the ui command hangs.
Changed:
<
<
  1. problems with cancels` nodes during recovery -----> HOPEFULLY FIXED BY MANAGER_3_2_7
>
>
  1. problems with cancels` nodes during recovery -----> HOPEFULLY FIXED BY MANAGER_3_2_8
 
  1. Recovery of a collection with all aborted nodes causes a crash in the wm FIXED (glite-wms-manager-3.2.1-3)
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:328): This is the gLite Workload Manager, running with pid 3953
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:336): loading broker dll libglite_wms_helper_broker_ism.so
Line: 125 to 134
 18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:254): https://devel20.cnaf.infn.it:9000/wSfED2AXfBUhcKaJBWvqYg: recoverable node (WAITING) 18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:283): https://devel20.cnaf.infn.it:9000/pLzF8vJud_B6P1lo8OsOGA: recoverable node (READY) 18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:283): https://devel20.cnaf.infn.it:9000/3D7JRvT1QDpci3biRroWpw: recoverable node (READY) In the above example, https://devel20.cnaf.infn.it:9000/pLzF8vJud_B6P1lo8OsOGA and https://devel20.cnaf.infn.it:9000/pLzF8vJud_B6P1lo8OsOGA were respectively SUBMITTED and RUNNING.
Changed:
<
<
  1. WM dies after this message: -----------> HOPEFULLY FIXED BY MANAGER_3_2_7
>
>
  1. WM dies after this message: -----------> HOPEFULLY FIXED BY MANAGER_3_2_10
 09 Mar, 17:08:36 -E: [Error] unrecoverable_collection(submit_request.cpp:90): https://devel17.cnaf.infn.it:9000/u-dqY5E9RtQ9mHk0-i0oYg: unable to retrieve children information from jobstatus
  1. Condor fails to start FIXED (glite-wms-jobsubmission-3.2.1-2)

Revision 362009-03-23 - AlessioGianelle

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

MM with data does not work:
Line: 31 to 31
 sfn://t2-se-01.mi.infn.it/flatfiles/SE00/infngrid/generated/2007-11-21/file5a113abe-0a0e-459f-9a2f-2478f8bb6c6a sfn://t2-se-03.lnl.infn.it/data1/infngrid/generated/2007-11-26/filed1f43d74-24ac-47a2-a83a-8b0e8b0d9d4f
Deleted:
<
<
Submission of collections of the form ... does not work

WmProxy log: : Error while checking JDL: requirements: wrong type caught for attribut@

 Log-status Status info for the Job : https://devel20.cnaf.infn.it:9000/LkRfwhr3ibCfROQfCCj0jA Current Status: Aborted
Line: 65 to 60
 19 Mar, 15:50:20 -*- MonitorLoop::run(): Namely: "call to empty boost::function" 19 Mar, 15:50:20 -*- MonitorLoop::run(): Aborting daemon...
Added:
>
>
  1. Submit a collection with attribute "requirements = false;" doesn't work. Using this collection:
    [
    Type = "collection";
    InputSandbox = {"exe/test.sh"};
    requirements = false;
    nodes = {
    [
        JobType = "Normal";
        Executable = "test.sh";
        StdOutput = "test.out";
        StdError = "test.err";
        OutputSandbox = {};
        ]
    }
    ]
    you should find in the wmproxy.log this message:
    23 Mar, 10:30:34 -I- PID: 28115 - "wmpcommon::getType": JDL Type: dag
    23 Mar, 10:30:34 -E- PID: 28115 - "wmpcoreoperations::jobStart": Error while checking JDL: requirements: wrong type caught for attributx
 

Using this repository: Configuration Name: glite-wms_R_3_2_1_5

Revision 352009-03-23 - MarcoCecchi

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

MM with data does not work:
Line: 31 to 31
 sfn://t2-se-01.mi.infn.it/flatfiles/SE00/infngrid/generated/2007-11-21/file5a113abe-0a0e-459f-9a2f-2478f8bb6c6a sfn://t2-se-03.lnl.infn.it/data1/infngrid/generated/2007-11-26/filed1f43d74-24ac-47a2-a83a-8b0e8b0d9d4f
Changed:
<
<
Submission of collections of the form [ Type = "collection"; InputSandbox = {"/home/mcecchi/Test.sh"}; RetryCount = 1; ShallowRetryCount = 2; nodes = { [ JobType = "Normal"; Executable = "Test.sh"; StdOutput = "test.out"; StdError = "test.err"; OutputSandbox = {}; ], does not work
>
>
Submission of collections of the form ... does not work
  WmProxy log: : Error while checking JDL: requirements: wrong type caught for attribut@

Revision 342009-03-21 - MarcoCecchi

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

Added:
>
>
MM with data does not work:

Requirements = other.GlueCEInfoHostName = "spacin-ce1.dma.unina.it"; DataRequirements = { [ DataCatalogType = "DLI"; DataCatalog = "http://lfcserver.cnaf.infn.it:8085"; InputData = {"lfn:/grid/infngrid/cesini/PI_1M.txt","lfn:/grid/infngrid/cesini/e-2M.txt"}; ] }; DataAccessProtocol = "gsiftp";

does not match, while

File1: [cesini@ui DataReq]$ lcg-lr --vo infngrid lfn:/grid/infngrid/cesini/e-2M.txt sfn://gridba6.ba.infn.it/flatfiles/SE00/infngrid/generated/2007-11-21/file1d1e9575-ae91-4d67-9080-f4a7f4687806 sfn://gridse.pi.infn.it/flatfiles/SE00/infngrid/generated/2007-11-26/fileb28f35a1-fbb4-413c-920f-b52cf002012e sfn://prod-se-01.pd.infn.it/flatfiles/SE00/infngrid/generated/2007-11-21/file88152626-c3c9-448e-9275-b01278a0d9d0 sfn://t2-se-01.mi.infn.it/flatfiles/SE00/infngrid/generated/2007-11-21/fileba70145a-2a5c-4f61-ac18-a4b48bc0e78b sfn://t2-se-03.lnl.infn.it/data1/infngrid/generated/2007-11-26/file2ddaccf9-c62c-417a-826d-3a1755cf3718

File2: [cesini@ui DataReq]$ lcg-lr --vo infngrid lfn:/grid/infngrid/cesini/PI_1M.txt sfn://gridba6.ba.infn.it/flatfiles/SE00/infngrid/generated/2007-11-21/filec4384060-5c78-4ee8-ac75-6a5b7161987d sfn://gridse.pi.infn.it/flatfiles/SE00/infngrid/generated/2007-11-26/file6a7ce8b1-3aa9-442b-80c2-21493a60553e sfn://prod-se-01.pd.infn.it/flatfiles/SE00/infngrid/generated/2007-11-21/filed03f96ea-4a70-4315-bc3a-a3839c9e0bed sfn://t2-se-01.mi.infn.it/flatfiles/SE00/infngrid/generated/2007-11-21/file5a113abe-0a0e-459f-9a2f-2478f8bb6c6a sfn://t2-se-03.lnl.infn.it/data1/infngrid/generated/2007-11-26/filed1f43d74-24ac-47a2-a83a-8b0e8b0d9d4f

Submission of collections of the form [ Type = "collection"; InputSandbox = {"/home/mcecchi/Test.sh"}; RetryCount = 1; ShallowRetryCount = 2; nodes = { [ JobType = "Normal"; Executable = "Test.sh"; StdOutput = "test.out"; StdError = "test.err"; OutputSandbox = {}; ], does not work

WmProxy log: : Error while checking JDL: requirements: wrong type caught for attribut@

Log-status Status info for the Job : https://devel20.cnaf.infn.it:9000/LkRfwhr3ibCfROQfCCj0jA Current Status: Aborted Status Reason: PID: 11874 - "wmpcoreoperations::jobStart" Submitted: Fri Mar 20 20:04:16 2009 CET

Bug #40982: When a collection is aborted the "Abort" event should be logged for the sub-nodes as well /2

 

Using this repository: Configuration Name: glite-wms_R_3_2_1_8

  1. Recovery mechanism is not able to manage a cancel and a resubmit request for the same job (it should cancel the job).

Revision 332009-03-20 - AlessioGianelle

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

Using this repository: Configuration Name: glite-wms_R_3_2_1_8

Line: 25 to 25
 

Using this repository: Configuration Name: glite-wms_R_3_2_1_5

Changed:
<
<
  1. WM startup script must be in charge of creating both JC and ICE input -----> HOPEFULLY FIXED BY MANAGER_3_2_8
>
>
  1. WM startup script must be in charge of creating both JC and ICE input FIXED (glite-wms-manager-3.2.1-8)
 
  1. During recovery pending nodes are Aborted FIXED (glite-wms-manager-3.2.1-6)
  2. After setting EnableRecovery = true; in the glite_wms.conf file, the recovery is skipped: FIXED (glite-wms-manager-3.2.1-6)
    16 Mar, 12:41:33 -I: [Info] main(main.cpp:411): skipping recovery

Revision 322009-03-19 - AlessioGianelle

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

Using this repository: Configuration Name: glite-wms_R_3_2_1_8

Line: 16 to 16
 (DestinationURI with https protocol not found) JobId: https://devel17.cnaf.infn.it:9000/9fuQafsyV-lBWdiHXjkBFg (please contact the server administrator)
Added:
>
>
  1. LogMonitor dies when invokes operator()(purger.cpp:432)
    19 Mar, 15:50:20 -U- JobFilePurger::do_purge(...): Going to purge job storage...
    19 Mar, 15:50:20 -*- MonitorLoop::run(): Got an unhandled standard exception !!!
    19 Mar, 15:50:20 -*- MonitorLoop::run(): Namely: "call to empty boost::function"
    19 Mar, 15:50:20 -*- MonitorLoop::run(): Aborting daemon...
    
 

Using this repository: Configuration Name: glite-wms_R_3_2_1_5

Revision 312009-03-19 - AlessioGianelle

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

Changed:
<
<

Using this repository: Configuration Name: glite-wms_R_3_2_1_5

>
>

Using this repository: Configuration Name: glite-wms_R_3_2_1_8

 
Changed:
<
<
  1. WM startup script must be in charge of creating both JC and ICE input
>
>
  1. Recovery mechanism is not able to manage a cancel and a resubmit request for the same job (it should cancel the job).
    19 Mar, 12:45:21 -I: [Info] multiple_request_recovery(recovery.cpp:651): multiple requests [CS] for https://devel17.cnaf.infn.it:9000/iVp83BKI_-IytpdVlojq3Q (status 2)
    19 Mar, 12:45:21 -I: [Info] multiple_request_recovery(recovery.cpp:804): invalid pattern; ignoring all requests
    19 Mar, 12:45:21 -D: [Debug] operator()(recovery.cpp:76): ignoring jobcancel request for https://devel17.cnaf.infn.it:9000/iVp83BKI_-IytpdVlojq3Q
    19 Mar, 12:45:21 -D: [Debug] operator()(recovery.cpp:76): ignoring jobresubmit request for https://devel17.cnaf.infn.it:9000/iVp83BKI_-IytpdVlojq3Q
  2. Sometimes the submit command fails with this message:
    [ale@cream-15 UI]$ glite-wms-job-submit  -a -c ~/UI/etc/wmp_wms007.conf -o wms007 jdl/fail/fail.jdl
    Connecting to the service https://wms007.cnaf.infn.it:7443/glite_wms_wmproxy_server
    Error - The job has been successfully registered (the JobId is: https://devel17.cnaf.infn.it:9000/9fuQafsyV-lBWdiHXjkBFg), but an error occurred while transferring files:
    Unable to find Job InputSandbox Relative Path information needed to create the ISB Zipped File(s)
    (DestinationURI with https protocol not found)
    JobId: https://devel17.cnaf.infn.it:9000/9fuQafsyV-lBWdiHXjkBFg
    (please contact the server administrator)
 
Changed:
<
<
HOPEFULLY FIXED BY MANAGER_3_2_8
>
>

Using this repository: Configuration Name: glite-wms_R_3_2_1_5

 
Added:
>
>
  1. WM startup script must be in charge of creating both JC and ICE input -----> HOPEFULLY FIXED BY MANAGER_3_2_8
 
  1. During recovery pending nodes are Aborted FIXED (glite-wms-manager-3.2.1-6)
  2. After setting EnableRecovery = true; in the glite_wms.conf file, the recovery is skipped: FIXED (glite-wms-manager-3.2.1-6)
    16 Mar, 12:41:33 -I: [Info] main(main.cpp:411): skipping recovery
Line: 28 to 41
 

Using this repository: Configuration Name: glite-wms_branch_3_2_0

  1. 08 Mar, 05:04:44 -E: [Error] operator()(dispatcher_utils.cpp:407): Dispatcher: boost::filesystem::directory_iterator constructor: "/var/glite/workload_manager/jobdir/new": Cannot allocate memory. Exiting...
Changed:
<
<
  1. Cancel of pending nodes (or jobs) doesn't work

HOPEFULLY FIXED BY MANAGER_3_2_7

  1. recovery doesn't recognize a mm request:
     
>
>
  1. Cancel of pending nodes (or jobs) doesn't work FIXED (glite-wms-manager-3.2.1-7). But the final status of the job is ABORTED instead of CANCELLED.
  2. recovery doesn't recognize a mm request: FIXED (glite-wms-manager-3.2.1-7)
 09 Mar, 11:58:57 -I: [Info] operator()(recovery.cpp:836): recovering https://localhost:6000/kqU2YOq-Ofu1_ArnS5RoIQ
Changed:
<
<
09 Mar, 11:58:57 -E: [Error] single_request_recovery(recovery.cpp:392): cannot create LB context (2)and it is not removed from the jobdir

HOPEFULLY FIXED BY MANAGER_3_2_7

  1. problems with cancels` nodes during recovery

HOPEFULLY FIXED BY MANAGER_3_2_7

>
>
09 Mar, 11:58:57 -E: [Error] single_request_recovery(recovery.cpp:392): cannot create LB context (2)and it is not removed from the jobdir. Now at startup request is ignored: ignoring match request, but the ui command hangs.
  1. problems with cancels` nodes during recovery -----> HOPEFULLY FIXED BY MANAGER_3_2_7
 
  1. Recovery of a collection with all aborted nodes causes a crash in the wm FIXED (glite-wms-manager-3.2.1-3)
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:328): This is the gLite Workload Manager, running with pid 3953
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:336): loading broker dll libglite_wms_helper_broker_ism.so
Line: 59 to 63
 18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:254): https://devel20.cnaf.infn.it:9000/wSfED2AXfBUhcKaJBWvqYg: recoverable node (WAITING) 18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:283): https://devel20.cnaf.infn.it:9000/pLzF8vJud_B6P1lo8OsOGA: recoverable node (READY) 18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:283): https://devel20.cnaf.infn.it:9000/3D7JRvT1QDpci3biRroWpw: recoverable node (READY) In the above example, https://devel20.cnaf.infn.it:9000/pLzF8vJud_B6P1lo8OsOGA and https://devel20.cnaf.infn.it:9000/pLzF8vJud_B6P1lo8OsOGA were respectively SUBMITTED and RUNNING.
Changed:
<
<
  1. WM dies after this message:
>
>
  1. WM dies after this message: -----------> HOPEFULLY FIXED BY MANAGER_3_2_7
 09 Mar, 17:08:36 -E: [Error] unrecoverable_collection(submit_request.cpp:90): https://devel17.cnaf.infn.it:9000/u-dqY5E9RtQ9mHk0-i0oYg: unable to retrieve children information from jobstatus
Deleted:
<
<
HOPEFULLY FIXED BY MANAGER_3_2_7
 
  1. Condor fails to start FIXED (glite-wms-jobsubmission-3.2.1-2)

-- AlessioGianelle - 04 Mar 2009 \ No newline at end of file

Revision 302009-03-19 - AlessioGianelle

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

Using this repository: Configuration Name: glite-wms_R_3_2_1_5

Line: 51 to 51
 12 Mar, 12:44:33 -I: [Info] single_request_recovery(recovery.cpp:407): submit request Program received signal SIGSEGV, Segmentation fault. 0x081471b2 in glite::wms::manager::server::(anonymous namespace)::select_recoverable_nodes::select_recoverable_nodes ()
Changed:
<
<
  1. A pending collection is completely (i.e. also the DONE nodes) resubmitted after every restart (through recovery mechanism); this means that a job can be run more than one times (see for example https://devel17.cnaf.infn.it:9000/630alnB1XH4ZEGqPyucc4g): see #30816
>
>
  1. A pending collection is completely (i.e. also the DONE nodes) resubmitted after every restart (through recovery mechanism); this means that a job can be run more than one times (see for example https://devel17.cnaf.infn.it:9000/630alnB1XH4ZEGqPyucc4g): see #30816 FIXED (glite-wms-manager-3.2.1-7)
 
  1. Recovery of a collection with two "scheduled" nodes and the remaining in "waiting" is treated as if all nodes were recoverable. What happens is that collections node status as returned upon recovery does not seem to match the real LB status.
    
    18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:254): https://devel20.cnaf.infn.it:9000/fQwKKHBLncwJqJCVbiwOQw: recoverable node (WAITING)

Revision 292009-03-19 - MarcoCecchi

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

Using this repository: Configuration Name: glite-wms_R_3_2_1_5

  1. WM startup script must be in charge of creating both JC and ICE input
Changed:
<
<
  1. During recovery pending nodes are Aborted FIXED (glite-wms-manager-3.2.1-6)
  2. After setting EnableRecovery = true; in the glite_wms.conf file, the recovery is skipped: FIXED (glite-wms-manager-3.2.1-6)
>
>
HOPEFULLY FIXED BY MANAGER_3_2_8

  1. During recovery pending nodes are Aborted FIXED (glite-wms-manager-3.2.1-6)
  2. After setting EnableRecovery = true; in the glite_wms.conf file, the recovery is skipped: FIXED (glite-wms-manager-3.2.1-6)
 16 Mar, 12:41:33 -I: [Info] main(main.cpp:411): skipping recovery

Using this repository: Configuration Name: glite-wms_R_3_2_1_3

Line: 18 to 21
 

Using this repository: Configuration Name: glite-wms_R_3_2_1_2

  1. After the submission of a collection with 20 jobs with "requirements=false", there are 20 events "Pending" logged to the same job instead of one event per job. FIXED (glite-wms-manager-3.2.1-4)
Changed:
<
<
  1. It is very hard to stop wm FIXED (glite-wms-manager-3.2.1-6)
>
>
  1. It is very hard to stop wm FIXED (glite-wms-manager-3.2.1-6)
 [root@wms007 new]# /opt/glite/etc/init.d/glite-wms-wm stop stopping workload manager... failure (stop it manually)

Using this repository: Configuration Name: glite-wms_branch_3_2_0

  1. 08 Mar, 05:04:44 -E: [Error] operator()(dispatcher_utils.cpp:407): Dispatcher: boost::filesystem::directory_iterator constructor: "/var/glite/workload_manager/jobdir/new": Cannot allocate memory. Exiting...
Changed:
<
<
  1. Cancel of pending nodes (or jobs) doesn't work
>
>
  1. Cancel of pending nodes (or jobs) doesn't work
  HOPEFULLY FIXED BY MANAGER_3_2_7
Changed:
<
<
  1. recovery doesn't recognize a mm request:
     
>
>
  1. recovery doesn't recognize a mm request:
     
 09 Mar, 11:58:57 -I: [Info] operator()(recovery.cpp:836): recovering https://localhost:6000/kqU2YOq-Ofu1_ArnS5RoIQ 09 Mar, 11:58:57 -E: [Error] single_request_recovery(recovery.cpp:392): cannot create LB context (2)and it is not removed from the jobdir

HOPEFULLY FIXED BY MANAGER_3_2_7

Changed:
<
<
  1. problems with cancels` nodes during recovery
>
>
  1. problems with cancels` nodes during recovery
  HOPEFULLY FIXED BY MANAGER_3_2_7
Changed:
<
<
  1. Recovery of a collection with all aborted nodes causes a crash in the wm FIXED (glite-wms-manager-3.2.1-3)
>
>
  1. Recovery of a collection with all aborted nodes causes a crash in the wm FIXED (glite-wms-manager-3.2.1-3)
 12 Mar, 12:44:33 -I: [Info] main(main.cpp:328): This is the gLite Workload Manager, running with pid 3953 12 Mar, 12:44:33 -I: [Info] main(main.cpp:336): loading broker dll libglite_wms_helper_broker_ism.so 12 Mar, 12:44:33 -I: [Info] main(main.cpp:357): created input reader of type jobdir from /var/glite/workload_manager/jobdir
Line: 48 to 51
 12 Mar, 12:44:33 -I: [Info] single_request_recovery(recovery.cpp:407): submit request Program received signal SIGSEGV, Segmentation fault. 0x081471b2 in glite::wms::manager::server::(anonymous namespace)::select_recoverable_nodes::select_recoverable_nodes ()
Changed:
<
<
  1. A pending collection is completely (i.e. also the DONE nodes) resubmitted after every restart (through recovery mechanism); this means that a job can be run more than one times (see for example https://devel17.cnaf.infn.it:9000/630alnB1XH4ZEGqPyucc4g): see #30816
  2. Recovery of a collection with two "scheduled" nodes and the remaining in "waiting" is treated as if all nodes were recoverable. What happens is that collections node status as returned upon recovery does not seem to match the real LB status.
>
>
  1. A pending collection is completely (i.e. also the DONE nodes) resubmitted after every restart (through recovery mechanism); this means that a job can be run more than one times (see for example https://devel17.cnaf.infn.it:9000/630alnB1XH4ZEGqPyucc4g): see #30816

  1. Recovery of a collection with two "scheduled" nodes and the remaining in "waiting" is treated as if all nodes were recoverable. What happens is that collections node status as returned upon recovery does not seem to match the real LB status.
  18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:254): https://devel20.cnaf.infn.it:9000/fQwKKHBLncwJqJCVbiwOQw: recoverable node (WAITING) 18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:254): https://devel20.cnaf.infn.it:9000/MrLckmBvcibWQl3TnUyF2Q: recoverable node (WAITING)
Line: 61 to 66
  HOPEFULLY FIXED BY MANAGER_3_2_7
Changed:
<
<
  1. Condor fails to start FIXED (glite-wms-jobsubmission-3.2.1-2)
>
>
  1. Condor fails to start FIXED (glite-wms-jobsubmission-3.2.1-2)
  -- AlessioGianelle - 04 Mar 2009

Revision 282009-03-19 - MarcoCecchi

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

Using this repository: Configuration Name: glite-wms_R_3_2_1_5

Added:
>
>
  1. WM startup script must be in charge of creating both JC and ICE input
 
  1. During recovery pending nodes are Aborted FIXED (glite-wms-manager-3.2.1-6)
  2. After setting EnableRecovery = true; in the glite_wms.conf file, the recovery is skipped: FIXED (glite-wms-manager-3.2.1-6)
    16 Mar, 12:41:33 -I: [Info] main(main.cpp:411): skipping recovery

Revision 272009-03-18 - AlessioGianelle

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

Using this repository: Configuration Name: glite-wms_R_3_2_1_5

Line: 48 to 48
 Program received signal SIGSEGV, Segmentation fault. 0x081471b2 in glite::wms::manager::server::(anonymous namespace)::select_recoverable_nodes::select_recoverable_nodes ()
  1. A pending collection is completely (i.e. also the DONE nodes) resubmitted after every restart (through recovery mechanism); this means that a job can be run more than one times (see for example https://devel17.cnaf.infn.it:9000/630alnB1XH4ZEGqPyucc4g): see #30816
Changed:
<
<
PLEASE ADD MORE DETAILS

  1. Recovery of a collection with two "scheduled" nodes and the remaining in "waiting" is treated as if all nodes were recoverable. What happens is that collections node status as returned upon recovery does not seem to match the real LB status.
>
>
  1. Recovery of a collection with two "scheduled" nodes and the remaining in "waiting" is treated as if all nodes were recoverable. What happens is that collections node status as returned upon recovery does not seem to match the real LB status.
  18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:254): https://devel20.cnaf.infn.it:9000/fQwKKHBLncwJqJCVbiwOQw: recoverable node (WAITING) 18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:254): https://devel20.cnaf.infn.it:9000/MrLckmBvcibWQl3TnUyF2Q: recoverable node (WAITING) 18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:254): https://devel20.cnaf.infn.it:9000/wSfED2AXfBUhcKaJBWvqYg: recoverable node (WAITING) 18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:283): https://devel20.cnaf.infn.it:9000/pLzF8vJud_B6P1lo8OsOGA: recoverable node (READY)
Changed:
<
<
18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:283): https://devel20.cnaf.infn.it:9000/3D7JRvT1QDpci3biRroWpw: recoverable node (READY)

In the above example, https://devel20.cnaf.infn.it:9000/pLzF8vJud_B6P1lo8OsOGA and https://devel20.cnaf.infn.it:9000/pLzF8vJud_B6P1lo8OsOGA were respectively SUBMITTED and RUNNING.

>
>
18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:283): https://devel20.cnaf.infn.it:9000/3D7JRvT1QDpci3biRroWpw: recoverable node (READY) In the above example, https://devel20.cnaf.infn.it:9000/pLzF8vJud_B6P1lo8OsOGA and https://devel20.cnaf.infn.it:9000/pLzF8vJud_B6P1lo8OsOGA were respectively SUBMITTED and RUNNING.
 
  1. WM dies after this message:
    09 Mar, 17:08:36 -E: [Error] unrecoverable_collection(submit_request.cpp:90): https://devel17.cnaf.infn.it:9000/u-dqY5E9RtQ9mHk0-i0oYg: unable to retrieve children information from jobstatus

Revision 262009-03-18 - MarcoCecchi

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

Using this repository: Configuration Name: glite-wms_R_3_2_1_5

Line: 49 to 49
 0x081471b2 in glite::wms::manager::server::(anonymous namespace)::select_recoverable_nodes::select_recoverable_nodes ()
  1. A pending collection is completely (i.e. also the DONE nodes) resubmitted after every restart (through recovery mechanism); this means that a job can be run more than one times (see for example https://devel17.cnaf.infn.it:9000/630alnB1XH4ZEGqPyucc4g): see #30816
Changed:
<
<
Even if the problem looks similar, this behaviour has nothing to do with bug #30816, given that the design is different in this new architecture. What happens is that collections node status as returned upon recovery does not seem to match the real LB status.
>
>
PLEASE ADD MORE DETAILS

  1. Recovery of a collection with two "scheduled" nodes and the remaining in "waiting" is treated as if all nodes were recoverable. What happens is that collections node status as returned upon recovery does not seem to match the real LB status.
  18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:254): https://devel20.cnaf.infn.it:9000/fQwKKHBLncwJqJCVbiwOQw: recoverable node (WAITING) 18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:254): https://devel20.cnaf.infn.it:9000/MrLckmBvcibWQl3TnUyF2Q: recoverable node (WAITING)
Line: 62 to 64
 
  1. WM dies after this message:
    09 Mar, 17:08:36 -E: [Error] unrecoverable_collection(submit_request.cpp:90): https://devel17.cnaf.infn.it:9000/u-dqY5E9RtQ9mHk0-i0oYg: unable to retrieve children information from jobstatus
Changed:
<
<
HOPEFULLY FIXED WHEN THE LATEST LDAP RESTRUCTURING WORKS
>
>
HOPEFULLY FIXED BY MANAGER_3_2_7
 
  1. Condor fails to start FIXED (glite-wms-jobsubmission-3.2.1-2)

Revision 252009-03-18 - MarcoCecchi

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

Using this repository: Configuration Name: glite-wms_R_3_2_1_5

Line: 25 to 25
 
  1. 08 Mar, 05:04:44 -E: [Error] operator()(dispatcher_utils.cpp:407): Dispatcher: boost::filesystem::directory_iterator constructor: "/var/glite/workload_manager/jobdir/new": Cannot allocate memory. Exiting...
  2. Cancel of pending nodes (or jobs) doesn't work
Added:
>
>
HOPEFULLY FIXED BY MANAGER_3_2_7
 
  1. recovery doesn't recognize a mm request:
     
    09 Mar, 11:58:57 -I: [Info] operator()(recovery.cpp:836): recovering https://localhost:6000/kqU2YOq-Ofu1_ArnS5RoIQ
    09 Mar, 11:58:57 -E: [Error] single_request_recovery(recovery.cpp:392): cannot create LB context (2)
    and it is not removed from the jobdir
Added:
>
>
HOPEFULLY FIXED BY MANAGER_3_2_7
 
  1. problems with cancels` nodes during recovery
Added:
>
>
HOPEFULLY FIXED BY MANAGER_3_2_7
 
  1. Recovery of a collection with all aborted nodes causes a crash in the wm FIXED (glite-wms-manager-3.2.1-3)
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:328): This is the gLite Workload Manager, running with pid 3953
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:336): loading broker dll libglite_wms_helper_broker_ism.so
Line: 53 to 62
 
  1. WM dies after this message:
    09 Mar, 17:08:36 -E: [Error] unrecoverable_collection(submit_request.cpp:90): https://devel17.cnaf.infn.it:9000/u-dqY5E9RtQ9mHk0-i0oYg: unable to retrieve children information from jobstatus
Changed:
<
<
HOPEFULLY FIXED WITH >=MANAGER_3_2_2
>
>
HOPEFULLY FIXED WHEN THE LATEST LDAP RESTRUCTURING WORKS
 
  1. Condor fails to start FIXED (glite-wms-jobsubmission-3.2.1-2)

Revision 242009-03-18 - AlessioGianelle

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

Using this repository: Configuration Name: glite-wms_R_3_2_1_5

Line: 9 to 9
 

Using this repository: Configuration Name: glite-wms_R_3_2_1_3

Changed:
<
<
  1. The output of the startup script of JobController is not correct: FIXED (glite-wms-jobsubmission-3.2.1-3_200903161019.slc4)
>
>
  1. The output of the startup script of JobController is not correct: FIXED (glite-wms-jobsubmission-3.2.1-3)
 [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-jc start Starting JobController daemon(s) [root@wms007 ~]# CondorG...ler... [ OK ]

Using this repository: Configuration Name: glite-wms_R_3_2_1_2

Changed:
<
<
  1. After the submission of a collection with 20 jobs with "requirements=false", there are 20 events "Pending" logged to the same job instead of one event per job. FIXED (glite-wms-manager-3.2.1-4_200903161021.slc4)
>
>
  1. After the submission of a collection with 20 jobs with "requirements=false", there are 20 events "Pending" logged to the same job instead of one event per job. FIXED (glite-wms-manager-3.2.1-4)
 
  1. It is very hard to stop wm FIXED (glite-wms-manager-3.2.1-6)
    [root@wms007 new]# /opt/glite/etc/init.d/glite-wms-wm stop
    stopping workload manager... failure (stop it manually)
Line: 24 to 24
 

Using this repository: Configuration Name: glite-wms_branch_3_2_0

  1. 08 Mar, 05:04:44 -E: [Error] operator()(dispatcher_utils.cpp:407): Dispatcher: boost::filesystem::directory_iterator constructor: "/var/glite/workload_manager/jobdir/new": Cannot allocate memory. Exiting...
Deleted:
<
<
HOPEFULLY FIXED WITH glite-wms-ism_R_3_2_1_4
 
  1. Cancel of pending nodes (or jobs) doesn't work
  2. recovery doesn't recognize a mm request:
     
    09 Mar, 11:58:57 -I: [Info] operator()(recovery.cpp:836): recovering https://localhost:6000/kqU2YOq-Ofu1_ArnS5RoIQ
    09 Mar, 11:58:57 -E: [Error] single_request_recovery(recovery.cpp:392): cannot create LB context (2)
    and it is not removed from the jobdir
Deleted:
<
<
HOPEFULLY FIXED WITH glite-wms-manager-3.2.1-6
 
  1. problems with cancels` nodes during recovery
Changed:
<
<
  1. Recovery of a collection with all aborted nodes causes a crash in the wm FIXED (glite-wms-manager-3.2.1-3_200903131510.slc4)
>
>
  1. Recovery of a collection with all aborted nodes causes a crash in the wm FIXED (glite-wms-manager-3.2.1-3)
 12 Mar, 12:44:33 -I: [Info] main(main.cpp:328): This is the gLite Workload Manager, running with pid 3953 12 Mar, 12:44:33 -I: [Info] main(main.cpp:336): loading broker dll libglite_wms_helper_broker_ism.so 12 Mar, 12:44:33 -I: [Info] main(main.cpp:357): created input reader of type jobdir from /var/glite/workload_manager/jobdir
Line: 61 to 55
  HOPEFULLY FIXED WITH >=MANAGER_3_2_2
Changed:
<
<
  1. Condor fails to start FIXED (glite-wms-jobsubmission-3.2.1-2_200903131509.slc4)
>
>
  1. Condor fails to start FIXED (glite-wms-jobsubmission-3.2.1-2)
  -- AlessioGianelle - 04 Mar 2009 \ No newline at end of file

Revision 232009-03-18 - MarcoCecchi

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

Using this repository: Configuration Name: glite-wms_R_3_2_1_5

Line: 45 to 45
 Program received signal SIGSEGV, Segmentation fault. 0x081471b2 in glite::wms::manager::server::(anonymous namespace)::select_recoverable_nodes::select_recoverable_nodes ()
  1. A pending collection is completely (i.e. also the DONE nodes) resubmitted after every restart (through recovery mechanism); this means that a job can be run more than one times (see for example https://devel17.cnaf.infn.it:9000/630alnB1XH4ZEGqPyucc4g): see #30816
Added:
>
>
Even if the problem looks similar, this behaviour has nothing to do with bug #30816, given that the design is different in this new architecture. What happens is that collections node status as returned upon recovery does not seem to match the real LB status.

18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:254): https://devel20.cnaf.infn.it:9000/fQwKKHBLncwJqJCVbiwOQw: recoverable node (WAITING) 18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:254): https://devel20.cnaf.infn.it:9000/MrLckmBvcibWQl3TnUyF2Q: recoverable node (WAITING) 18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:254): https://devel20.cnaf.infn.it:9000/wSfED2AXfBUhcKaJBWvqYg: recoverable node (WAITING) 18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:283): https://devel20.cnaf.infn.it:9000/pLzF8vJud_B6P1lo8OsOGA: recoverable node (READY) 18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:283): https://devel20.cnaf.infn.it:9000/3D7JRvT1QDpci3biRroWpw: recoverable node (READY)

In the above example, https://devel20.cnaf.infn.it:9000/pLzF8vJud_B6P1lo8OsOGA and https://devel20.cnaf.infn.it:9000/pLzF8vJud_B6P1lo8OsOGA were respectively SUBMITTED and RUNNING.

 
  1. WM dies after this message:
    09 Mar, 17:08:36 -E: [Error] unrecoverable_collection(submit_request.cpp:90): https://devel17.cnaf.infn.it:9000/u-dqY5E9RtQ9mHk0-i0oYg: unable to retrieve children information from jobstatus

Revision 222009-03-17 - MarcoCecchi

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

Using this repository: Configuration Name: glite-wms_R_3_2_1_5

Line: 31 to 31
 
  1. recovery doesn't recognize a mm request:
     
    09 Mar, 11:58:57 -I: [Info] operator()(recovery.cpp:836): recovering https://localhost:6000/kqU2YOq-Ofu1_ArnS5RoIQ
    09 Mar, 11:58:57 -E: [Error] single_request_recovery(recovery.cpp:392): cannot create LB context (2)
    and it is not removed from the jobdir
Changed:
<
<
  1. problems with cancels` nodes during recovery (SHOULD BE INVALID)
>
>
HOPEFULLY FIXED WITH glite-wms-manager-3.2.1-6

  1. problems with cancels` nodes during recovery
 
  1. Recovery of a collection with all aborted nodes causes a crash in the wm FIXED (glite-wms-manager-3.2.1-3_200903131510.slc4)
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:328): This is the gLite Workload Manager, running with pid 3953
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:336): loading broker dll libglite_wms_helper_broker_ism.so

Revision 212009-03-17 - AlessioGianelle

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

Using this repository: Configuration Name: glite-wms_R_3_2_1_5

Changed:
<
<
  1. During recovery pending nodes are Aborted

HOPEFULLY FIXED WITH glite-wms-manager_R_3_2_1_6

  1. After setting EnableRecovery = true; in the glite_wms.conf file, the recovery is skipped:
>
>
  1. During recovery pending nodes are Aborted FIXED (glite-wms-manager-3.2.1-6)
  2. After setting EnableRecovery = true; in the glite_wms.conf file, the recovery is skipped: FIXED (glite-wms-manager-3.2.1-6)
 16 Mar, 12:41:33 -I: [Info] main(main.cpp:411): skipping recovery
Deleted:
<
<
HOPEFULLY FIXED WITH glite-wms-manager_R_3_2_1_6
 

Using this repository: Configuration Name: glite-wms_R_3_2_1_3

  1. The output of the startup script of JobController is not correct: FIXED (glite-wms-jobsubmission-3.2.1-3_200903161019.slc4)
Line: 22 to 17
 

Using this repository: Configuration Name: glite-wms_R_3_2_1_2

  1. After the submission of a collection with 20 jobs with "requirements=false", there are 20 events "Pending" logged to the same job instead of one event per job. FIXED (glite-wms-manager-3.2.1-4_200903161021.slc4)
Changed:
<
<
  1. It is very hard to stop wm
>
>
  1. It is very hard to stop wm FIXED (glite-wms-manager-3.2.1-6)
 [root@wms007 new]# /opt/glite/etc/init.d/glite-wms-wm stop stopping workload manager... failure (stop it manually)
Deleted:
<
<
HOPEFULLY FIXED WITH glite-wms-manager_R_3_2_1_6 (and prior)
 

Using this repository: Configuration Name: glite-wms_branch_3_2_0

  1. 08 Mar, 05:04:44 -E: [Error] operator()(dispatcher_utils.cpp:407): Dispatcher: boost::filesystem::directory_iterator constructor: "/var/glite/workload_manager/jobdir/new": Cannot allocate memory. Exiting...

HOPEFULLY FIXED WITH glite-wms-ism_R_3_2_1_4

Changed:
<
<
  1. Cancel of pending nodes (or jobs) doesn't work HOPEFULLY FIXED WITH glite-wms-manager_R_3_2_1_6

  1. recovery doesn't recognize a mm request: FIXED (glite-wms-manager-3.2.1-4_200903161021.slc4)
     
>
>
  1. Cancel of pending nodes (or jobs) doesn't work
  2. recovery doesn't recognize a mm request:
     
 09 Mar, 11:58:57 -I: [Info] operator()(recovery.cpp:836): recovering https://localhost:6000/kqU2YOq-Ofu1_ArnS5RoIQ 09 Mar, 11:58:57 -E: [Error] single_request_recovery(recovery.cpp:392): cannot create LB context (2)and it is not removed from the jobdir
Deleted:
<
<
 
  1. problems with cancels` nodes during recovery (SHOULD BE INVALID)
Deleted:
<
<
 
  1. Recovery of a collection with all aborted nodes causes a crash in the wm FIXED (glite-wms-manager-3.2.1-3_200903131510.slc4)
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:328): This is the gLite Workload Manager, running with pid 3953
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:336): loading broker dll libglite_wms_helper_broker_ism.so
Line: 52 to 42
 Program received signal SIGSEGV, Segmentation fault. 0x081471b2 in glite::wms::manager::server::(anonymous namespace)::select_recoverable_nodes::select_recoverable_nodes ()
  1. A pending collection is completely (i.e. also the DONE nodes) resubmitted after every restart (through recovery mechanism); this means that a job can be run more than one times (see for example https://devel17.cnaf.infn.it:9000/630alnB1XH4ZEGqPyucc4g): see #30816
Deleted:
<
<
HOPEFULLY FIXED WITH glite-wms-manager_R_3_2_1_6
 
  1. WM dies after this message:
    09 Mar, 17:08:36 -E: [Error] unrecoverable_collection(submit_request.cpp:90): https://devel17.cnaf.infn.it:9000/u-dqY5E9RtQ9mHk0-i0oYg: unable to retrieve children information from jobstatus

Revision 202009-03-16 - MarcoCecchi

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

Using this repository: Configuration Name: glite-wms_R_3_2_1_5

  1. During recovery pending nodes are Aborted
Added:
>
>
HOPEFULLY FIXED WITH glite-wms-manager_R_3_2_1_6
 
  1. After setting EnableRecovery = true; in the glite_wms.conf file, the recovery is skipped:
    16 Mar, 12:41:33 -I: [Info] main(main.cpp:411): skipping recovery
Added:
>
>
HOPEFULLY FIXED WITH glite-wms-manager_R_3_2_1_6
 

Using this repository: Configuration Name: glite-wms_R_3_2_1_3

Line: 20 to 24
 
  1. After the submission of a collection with 20 jobs with "requirements=false", there are 20 events "Pending" logged to the same job instead of one event per job. FIXED (glite-wms-manager-3.2.1-4_200903161021.slc4)
  2. It is very hard to stop wm
    [root@wms007 new]# /opt/glite/etc/init.d/glite-wms-wm stop
Changed:
<
<
stopping workload manager... failure (stop it manually) TO BE RETESTED WITH MANAGER_3_2_3
>
>
stopping workload manager... failure (stop it manually)

HOPEFULLY FIXED WITH glite-wms-manager_R_3_2_1_6 (and prior)

 

Using this repository: Configuration Name: glite-wms_branch_3_2_0

Changed:
<
<
  1. 08 Mar, 05:04:44 -E: [Error] operator()(dispatcher_utils.cpp:407): Dispatcher: boost::filesystem::directory_iterator constructor: "/var/glite/workload_manager/jobdir/new": Cannot allocate memory. Exiting... --> THIS IS A MEMORY LEAK IN THE NEW LDAP MANAGEMENT CODE
  2. Cancel of pending nodes (or jobs) doesn't work
>
>
  1. 08 Mar, 05:04:44 -E: [Error] operator()(dispatcher_utils.cpp:407): Dispatcher: boost::filesystem::directory_iterator constructor: "/var/glite/workload_manager/jobdir/new": Cannot allocate memory. Exiting...

HOPEFULLY FIXED WITH glite-wms-ism_R_3_2_1_4

  1. Cancel of pending nodes (or jobs) doesn't work HOPEFULLY FIXED WITH glite-wms-manager_R_3_2_1_6
 
  1. recovery doesn't recognize a mm request: FIXED (glite-wms-manager-3.2.1-4_200903161021.slc4)
     
    09 Mar, 11:58:57 -I: [Info] operator()(recovery.cpp:836): recovering https://localhost:6000/kqU2YOq-Ofu1_ArnS5RoIQ
    09 Mar, 11:58:57 -E: [Error] single_request_recovery(recovery.cpp:392): cannot create LB context (2)
    and it is not removed from the jobdir
Changed:
<
<
  1. problems with cancels` nodes during recovery
>
>
  1. problems with cancels` nodes during recovery (SHOULD BE INVALID)
 
  1. Recovery of a collection with all aborted nodes causes a crash in the wm FIXED (glite-wms-manager-3.2.1-3_200903131510.slc4)
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:328): This is the gLite Workload Manager, running with pid 3953
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:336): loading broker dll libglite_wms_helper_broker_ism.so
Line: 40 to 52
 Program received signal SIGSEGV, Segmentation fault. 0x081471b2 in glite::wms::manager::server::(anonymous namespace)::select_recoverable_nodes::select_recoverable_nodes ()
  1. A pending collection is completely (i.e. also the DONE nodes) resubmitted after every restart (through recovery mechanism); this means that a job can be run more than one times (see for example https://devel17.cnaf.infn.it:9000/630alnB1XH4ZEGqPyucc4g): see #30816
Added:
>
>
HOPEFULLY FIXED WITH glite-wms-manager_R_3_2_1_6
 
  1. WM dies after this message:
Changed:
<
<
09 Mar, 17:08:36 -E: [Error] unrecoverable_collection(submit_request.cpp:90): https://devel17.cnaf.infn.it:9000/u-dqY5E9RtQ9mHk0-i0oYg: unable to retrieve children information from jobstatusTO BE RETESTED WITH MANAGER_3_2_2
>
>
09 Mar, 17:08:36 -E: [Error] unrecoverable_collection(submit_request.cpp:90): https://devel17.cnaf.infn.it:9000/u-dqY5E9RtQ9mHk0-i0oYg: unable to retrieve children information from jobstatus

HOPEFULLY FIXED WITH >=MANAGER_3_2_2

 
  1. Condor fails to start FIXED (glite-wms-jobsubmission-3.2.1-2_200903131509.slc4)

-- AlessioGianelle - 04 Mar 2009

Revision 192009-03-16 - AlessioGianelle

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

Added:
>
>

Using this repository: Configuration Name: glite-wms_R_3_2_1_5

  1. During recovery pending nodes are Aborted
  2. After setting EnableRecovery = true; in the glite_wms.conf file, the recovery is skipped:
    16 Mar, 12:41:33 -I: [Info] main(main.cpp:411): skipping recovery
 

Using this repository: Configuration Name: glite-wms_R_3_2_1_3

Changed:
<
<
  1. The output of the startup script of JobController is not correct:
>
>
  1. The output of the startup script of JobController is not correct: FIXED (glite-wms-jobsubmission-3.2.1-3_200903161019.slc4)
 [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-jc start Starting JobController daemon(s) [root@wms007 ~]# CondorG...ler... [ OK ]
Deleted:
<
<
TO BE RETESTED WITH JOBSUBMISSION_3_2_3
 

Using this repository: Configuration Name: glite-wms_R_3_2_1_2

Changed:
<
<
  1. After the submission of a collection with 20 jobs with "requirements=false", there are 20 events "Pending" logged to the same job instead of one event per job.
TO BE RETESTED WITH MANAGER_3_2_4
>
>
  1. After the submission of a collection with 20 jobs with "requirements=false", there are 20 events "Pending" logged to the same job instead of one event per job. FIXED (glite-wms-manager-3.2.1-4_200903161021.slc4)
 
  1. It is very hard to stop wm
    [root@wms007 new]# /opt/glite/etc/init.d/glite-wms-wm stop
    stopping workload manager... failure (stop it manually)
    TO BE RETESTED WITH MANAGER_3_2_3
Line: 19 to 22
 [root@wms007 new]# /opt/glite/etc/init.d/glite-wms-wm stop stopping workload manager... failure (stop it manually) TO BE RETESTED WITH MANAGER_3_2_3
Deleted:
<
<
 

Using this repository: Configuration Name: glite-wms_branch_3_2_0

Changed:
<
<
  1. 08 Mar, 05:04:44 -E: [Error] operator()(dispatcher_utils.cpp:407): Dispatcher: boost::filesystem::directory_iterator constructor: "/var/glite/workload_manager/jobdir/new": Cannot allocate memory. Exiting...
THIS IS A MEMORY LEAK IN THE NEW LDAP MANAGEMENT CODE

  1. Cancel of pending nodes (or jobs) doesn't work TO BE RETESTED WITH MANAGER_3_2_4
  2. recovery doesn't recognize a mm request:
     
>
>
  1. 08 Mar, 05:04:44 -E: [Error] operator()(dispatcher_utils.cpp:407): Dispatcher: boost::filesystem::directory_iterator constructor: "/var/glite/workload_manager/jobdir/new": Cannot allocate memory. Exiting... --> THIS IS A MEMORY LEAK IN THE NEW LDAP MANAGEMENT CODE
  2. Cancel of pending nodes (or jobs) doesn't work
  3. recovery doesn't recognize a mm request: FIXED (glite-wms-manager-3.2.1-4_200903161021.slc4)
     
 09 Mar, 11:58:57 -I: [Info] operator()(recovery.cpp:836): recovering https://localhost:6000/kqU2YOq-Ofu1_ArnS5RoIQ 09 Mar, 11:58:57 -E: [Error] single_request_recovery(recovery.cpp:392): cannot create LB context (2)and it is not removed from the jobdir
Deleted:
<
<
DID YOU MEAN A MATCH REQUEST? TO BE RETESTED WITH MANAGER_3_2_4
 
  1. problems with cancels` nodes during recovery
  2. Recovery of a collection with all aborted nodes causes a crash in the wm FIXED (glite-wms-manager-3.2.1-3_200903131510.slc4)
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:328): This is the gLite Workload Manager, running with pid 3953
Line: 41 to 39
 12 Mar, 12:44:33 -I: [Info] single_request_recovery(recovery.cpp:407): submit request Program received signal SIGSEGV, Segmentation fault. 0x081471b2 in glite::wms::manager::server::(anonymous namespace)::select_recoverable_nodes::select_recoverable_nodes ()
Changed:
<
<
  1. A pending collection is completely (i.e. also the DONE nodes) resubmitted after every restart (trough recovery mechanism); this means that a job can be run more than one times (see for example https://devel17.cnaf.infn.it:9000/630alnB1XH4ZEGqPyucc4g): see #30816
>
>
  1. A pending collection is completely (i.e. also the DONE nodes) resubmitted after every restart (through recovery mechanism); this means that a job can be run more than one times (see for example https://devel17.cnaf.infn.it:9000/630alnB1XH4ZEGqPyucc4g): see #30816
 
  1. WM dies after this message:
    09 Mar, 17:08:36 -E: [Error] unrecoverable_collection(submit_request.cpp:90): https://devel17.cnaf.infn.it:9000/u-dqY5E9RtQ9mHk0-i0oYg: unable to retrieve children information from jobstatus
    TO BE RETESTED WITH MANAGER_3_2_2
  2. Condor fails to start FIXED (glite-wms-jobsubmission-3.2.1-2_200903131509.slc4)

Revision 182009-03-16 - MarcoCecchi

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

Line: 26 to 26
 
  1. 08 Mar, 05:04:44 -E: [Error] operator()(dispatcher_utils.cpp:407): Dispatcher: boost::filesystem::directory_iterator constructor: "/var/glite/workload_manager/jobdir/new": Cannot allocate memory. Exiting...
THIS IS A MEMORY LEAK IN THE NEW LDAP MANAGEMENT CODE
Changed:
<
<
  1. Cancel of pending nodes (or jobs) doesn't work > YET TO BE IMPLEMENTED
>
>
  1. Cancel of pending nodes (or jobs) doesn't work TO BE RETESTED WITH MANAGER_3_2_4
 
  1. recovery doesn't recognize a mm request:
    09 Mar, 11:58:57 -I: [Info] operator()(recovery.cpp:836): recovering https://localhost:6000/kqU2YOq-Ofu1_ArnS5RoIQ
    09 Mar, 11:58:57 -E: [Error] single_request_recovery(recovery.cpp:392): cannot create LB context (2)
    and it is not removed from the jobdir
Changed:
<
<
TO BE RETESTED WITH MANAGER_3_2_4
>
>
DID YOU MEAN A MATCH REQUEST? TO BE RETESTED WITH MANAGER_3_2_4
 
  1. problems with cancels` nodes during recovery
  2. Recovery of a collection with all aborted nodes causes a crash in the wm FIXED (glite-wms-manager-3.2.1-3_200903131510.slc4)
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:328): This is the gLite Workload Manager, running with pid 3953

Revision 172009-03-14 - MarcoCecchi

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

Line: 14 to 14
 

Using this repository: Configuration Name: glite-wms_R_3_2_1_2

  1. After the submission of a collection with 20 jobs with "requirements=false", there are 20 events "Pending" logged to the same job instead of one event per job.
Added:
>
>
TO BE RETESTED WITH MANAGER_3_2_4
 
  1. It is very hard to stop wm
    [root@wms007 new]# /opt/glite/etc/init.d/glite-wms-wm stop
    stopping workload manager... failure (stop it manually)
    TO BE RETESTED WITH MANAGER_3_2_3
Line: 22 to 23
 

Using this repository: Configuration Name: glite-wms_branch_3_2_0

Changed:
<
<
  1. 08 Mar, 05:04:44 -E: [Error] operator()(dispatcher_utils.cpp:407): Dispatcher: boost::filesystem::directory_iterator constructor: "/var/glite/workload_manager/jobdir/new": Cannot allocate memory. Exiting... THIS SMELLS LIKE A MEMORY LEAK
>
>
  1. 08 Mar, 05:04:44 -E: [Error] operator()(dispatcher_utils.cpp:407): Dispatcher: boost::filesystem::directory_iterator constructor: "/var/glite/workload_manager/jobdir/new": Cannot allocate memory. Exiting...
THIS IS A MEMORY LEAK IN THE NEW LDAP MANAGEMENT CODE
 
  1. Cancel of pending nodes (or jobs) doesn't work > YET TO BE IMPLEMENTED
  2. recovery doesn't recognize a mm request:
    09 Mar, 11:58:57 -I: [Info] operator()(recovery.cpp:836): recovering https://localhost:6000/kqU2YOq-Ofu1_ArnS5RoIQ
    09 Mar, 11:58:57 -E: [Error] single_request_recovery(recovery.cpp:392): cannot create LB context (2)
    and it is not removed from the jobdir
Added:
>
>
TO BE RETESTED WITH MANAGER_3_2_4
 
  1. problems with cancels` nodes during recovery
  2. Recovery of a collection with all aborted nodes causes a crash in the wm FIXED (glite-wms-manager-3.2.1-3_200903131510.slc4)
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:328): This is the gLite Workload Manager, running with pid 3953

Revision 162009-03-13 - MarcoCecchi

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

Line: 9 to 9
 Starting JobController daemon(s) [root@wms007 ~]# CondorG...ler... [ OK ]
Added:
>
>
TO BE RETESTED WITH JOBSUBMISSION_3_2_3
 

Using this repository: Configuration Name: glite-wms_R_3_2_1_2

  1. After the submission of a collection with 20 jobs with "requirements=false", there are 20 events "Pending" logged to the same job instead of one event per job.

Revision 152009-03-13 - AlessioGianelle

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

Added:
>
>

Using this repository: Configuration Name: glite-wms_R_3_2_1_3

  1. The output of the startup script of JobController is not correct:
    [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-jc start
    Starting JobController daemon(s)
    [root@wms007 ~]# CondorG...ler...                          [  OK  ]
 

Using this repository: Configuration Name: glite-wms_R_3_2_1_2

  1. After the submission of a collection with 20 jobs with "requirements=false", there are 20 events "Pending" logged to the same job instead of one event per job.
Line: 18 to 26
 09 Mar, 11:58:57 -I: [Info] operator()(recovery.cpp:836): recovering https://localhost:6000/kqU2YOq-Ofu1_ArnS5RoIQ 09 Mar, 11:58:57 -E: [Error] single_request_recovery(recovery.cpp:392): cannot create LB context (2)and it is not removed from the jobdir
  1. problems with cancels` nodes during recovery
Changed:
<
<
  1. Recovery of a collection with all aborted nodes causes a crash in the wm FIXED
>
>
  1. Recovery of a collection with all aborted nodes causes a crash in the wm FIXED (glite-wms-manager-3.2.1-3_200903131510.slc4)
 12 Mar, 12:44:33 -I: [Info] main(main.cpp:328): This is the gLite Workload Manager, running with pid 3953 12 Mar, 12:44:33 -I: [Info] main(main.cpp:336): loading broker dll libglite_wms_helper_broker_ism.so 12 Mar, 12:44:33 -I: [Info] main(main.cpp:357): created input reader of type jobdir from /var/glite/workload_manager/jobdir
Line: 30 to 38
 
  1. A pending collection is completely (i.e. also the DONE nodes) resubmitted after every restart (trough recovery mechanism); this means that a job can be run more than one times (see for example https://devel17.cnaf.infn.it:9000/630alnB1XH4ZEGqPyucc4g): see #30816
  2. WM dies after this message:
    09 Mar, 17:08:36 -E: [Error] unrecoverable_collection(submit_request.cpp:90): https://devel17.cnaf.infn.it:9000/u-dqY5E9RtQ9mHk0-i0oYg: unable to retrieve children information from jobstatus
    TO BE RETESTED WITH MANAGER_3_2_2
Changed:
<
<
  1. Condor fails to start TO BE RETESTED WITH JOBSUBMISSION_3_2_2 FIXED
>
>
  1. Condor fails to start FIXED (glite-wms-jobsubmission-3.2.1-2_200903131509.slc4)
  -- AlessioGianelle - 04 Mar 2009 \ No newline at end of file

Revision 142009-03-13 - MarcoCecchi

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

Using this repository: Configuration Name: glite-wms_R_3_2_1_2

Line: 6 to 6
 
  1. After the submission of a collection with 20 jobs with "requirements=false", there are 20 events "Pending" logged to the same job instead of one event per job.
  2. It is very hard to stop wm
    [root@wms007 new]# /opt/glite/etc/init.d/glite-wms-wm stop
Changed:
<
<
stopping workload manager... failure (stop it manually)
>
>
stopping workload manager... failure (stop it manually) TO BE RETESTED WITH MANAGER_3_2_3
 

Using this repository: Configuration Name: glite-wms_branch_3_2_0

Revision 132009-03-13 - AlessioGianelle

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

Using this repository: Configuration Name: glite-wms_R_3_2_1_2

  1. After the submission of a collection with 20 jobs with "requirements=false", there are 20 events "Pending" logged to the same job instead of one event per job.
Added:
>
>
  1. It is very hard to stop wm
    [root@wms007 new]# /opt/glite/etc/init.d/glite-wms-wm stop
    stopping workload manager... failure (stop it manually)
 

Using this repository: Configuration Name: glite-wms_branch_3_2_0

Revision 122009-03-12 - AlessioGianelle

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

Using this repository: Configuration Name: glite-wms_R_3_2_1_2

Line: 24 to 24
 12 Mar, 12:44:33 -I: [Info] single_request_recovery(recovery.cpp:407): submit request Program received signal SIGSEGV, Segmentation fault. 0x081471b2 in glite::wms::manager::server::(anonymous namespace)::select_recoverable_nodes::select_recoverable_nodes ()
Changed:
<
<
  1. A pending collection is completely (i.e. also the DONE nodes) resubmitted after every restart (trough recovery mechanism); this means that a job can be run more than one times (see for example https://devel17.cnaf.infn.it:9000/630alnB1XH4ZEGqPyucc4g)
>
>
  1. A pending collection is completely (i.e. also the DONE nodes) resubmitted after every restart (trough recovery mechanism); this means that a job can be run more than one times (see for example https://devel17.cnaf.infn.it:9000/630alnB1XH4ZEGqPyucc4g): see #30816
 
  1. WM dies after this message:
    09 Mar, 17:08:36 -E: [Error] unrecoverable_collection(submit_request.cpp:90): https://devel17.cnaf.infn.it:9000/u-dqY5E9RtQ9mHk0-i0oYg: unable to retrieve children information from jobstatus
    TO BE RETESTED WITH MANAGER_3_2_2
  2. Condor fails to start TO BE RETESTED WITH JOBSUBMISSION_3_2_2 FIXED

Revision 112009-03-12 - AlessioGianelle

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

Using this repository: Configuration Name: glite-wms_R_3_2_1_2

Added:
>
>
  1. After the submission of a collection with 20 jobs with "requirements=false", there are 20 events "Pending" logged to the same job instead of one event per job.
 
Deleted:
<
<

Using this repository: Configuration Name: glite-wms_branch_3_2_0

  • 08 Mar, 05:04:44 -E: [Error] operator()(dispatcher_utils.cpp:407): Dispatcher: boost::filesystem::directory_iterator constructor: "/var/glite/workload_manager/jobdir/new": Cannot allocate memory. Exiting... THIS SMELLS LIKE A MEMORY LEAK
 
Added:
>
>

Using this repository: Configuration Name: glite-wms_branch_3_2_0

 
Deleted:
<
<
  • Cancel of pending nodes (or jobs) doesn't work > YET TO BE IMPLEMENTED
 
Changed:
<
<
  • recovery doesn't recognize a mm request:
>
>
  1. 08 Mar, 05:04:44 -E: [Error] operator()(dispatcher_utils.cpp:407): Dispatcher: boost::filesystem::directory_iterator constructor: "/var/glite/workload_manager/jobdir/new": Cannot allocate memory. Exiting... THIS SMELLS LIKE A MEMORY LEAK
  2. Cancel of pending nodes (or jobs) doesn't work > YET TO BE IMPLEMENTED
  3. recovery doesn't recognize a mm request:
 09 Mar, 11:58:57 -I: [Info] operator()(recovery.cpp:836): recovering https://localhost:6000/kqU2YOq-Ofu1_ArnS5RoIQ 09 Mar, 11:58:57 -E: [Error] single_request_recovery(recovery.cpp:392): cannot create LB context (2)and it is not removed from the jobdir
Changed:
<
<
  • problems with cancels` nodes during recovery

  • Recovery of a collection with all aborted nodes causes a crash in the wm
>
>
  1. problems with cancels` nodes during recovery
  2. Recovery of a collection with all aborted nodes causes a crash in the wm FIXED
 12 Mar, 12:44:33 -I: [Info] main(main.cpp:328): This is the gLite Workload Manager, running with pid 3953 12 Mar, 12:44:33 -I: [Info] main(main.cpp:336): loading broker dll libglite_wms_helper_broker_ism.so 12 Mar, 12:44:33 -I: [Info] main(main.cpp:357): created input reader of type jobdir from /var/glite/workload_manager/jobdir
Line: 27 to 24
 12 Mar, 12:44:33 -I: [Info] single_request_recovery(recovery.cpp:407): submit request Program received signal SIGSEGV, Segmentation fault. 0x081471b2 in glite::wms::manager::server::(anonymous namespace)::select_recoverable_nodes::select_recoverable_nodes ()
Changed:
<
<

  • WM dies after this message:
    09 Mar, 17:08:36 -E: [Error] unrecoverable_collection(submit_request.cpp:90): https://devel17.cnaf.infn.it:9000/u-dqY5E9RtQ9mHk0-i0oYg: unable to retrieve children information from jobstatus
TO BE RETESTED WITH MANAGER_3_2_2

  • Condor fails to start _ TO BE RETESTED WITH JOBSUBMISSION_3_2_2
>
>
  1. A pending collection is completely (i.e. also the DONE nodes) resubmitted after every restart (trough recovery mechanism); this means that a job can be run more than one times (see for example https://devel17.cnaf.infn.it:9000/630alnB1XH4ZEGqPyucc4g)
  2. WM dies after this message:
    09 Mar, 17:08:36 -E: [Error] unrecoverable_collection(submit_request.cpp:90): https://devel17.cnaf.infn.it:9000/u-dqY5E9RtQ9mHk0-i0oYg: unable to retrieve children information from jobstatus
    TO BE RETESTED WITH MANAGER_3_2_2
  3. Condor fails to start TO BE RETESTED WITH JOBSUBMISSION_3_2_2 FIXED
  -- AlessioGianelle - 04 Mar 2009 \ No newline at end of file

Revision 102009-03-12 - AlessioGianelle

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

Using this repository: Configuration Name: glite-wms_R_3_2_1_2

Line: 18 to 18
 
  • problems with cancels` nodes during recovery
Changed:
<
<
  • Recovery of a collection with all aborted nodes causes a crash in the wm
>
>
  • Recovery of a collection with all aborted nodes causes a crash in the wm
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:328): This is the gLite Workload Manager, running with pid 3953
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:336): loading broker dll libglite_wms_helper_broker_ism.so
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:357): created input reader of type jobdir from /var/glite/workload_manager/jobdir
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:402): starting recovery
    12 Mar, 12:44:33 -I: [Info] operator()(recovery.cpp:836): recovering https://devel17.cnaf.infn.it:9000/4ZppCer_07TdPvtmVAsqYQ
    12 Mar, 12:44:33 -I: [Info] single_request_recovery(recovery.cpp:407): submit request
    Program received signal SIGSEGV, Segmentation fault.
    0x081471b2 in glite::wms::manager::server::(anonymous namespace)::select_recoverable_nodes::select_recoverable_nodes ()
 

Revision 92009-03-12 - AlessioGianelle

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"
Changed:
<
<

Problems

>
>

Problems

Using this repository: Configuration Name: glite-wms_R_3_2_1_2

Using this repository: Configuration Name: glite-wms_branch_3_2_0

 
Deleted:
<
<

Repository: http://etics-hd.cern.ch:8080/repository/pm/volatile/repomd/id/eaff55e8-37ff-4a75-b40f-1544f3232359/slc4_ia32_gcc346

 
  • 08 Mar, 05:04:44 -E: [Error] operator()(dispatcher_utils.cpp:407): Dispatcher: boost::filesystem::directory_iterator constructor: "/var/glite/workload_manager/jobdir/new": Cannot allocate memory. Exiting... THIS SMELLS LIKE A MEMORY LEAK

Revision 82009-03-11 - MarcoCecchi

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

Repository: http://etics-hd.cern.ch:8080/repository/pm/volatile/repomd/id/eaff55e8-37ff-4a75-b40f-1544f3232359/slc4_ia32_gcc346

Changed:
<
<
  • 08 Mar, 05:04:44 -E: [Error] operator()(dispatcher_utils.cpp:407): Dispatcher: boost::filesystem::directory_iterator constructor: "/var/glite/workload_manager/jobdir/new": Cannot allocate memory. Exiting...
>
>
  • 08 Mar, 05:04:44 -E: [Error] operator()(dispatcher_utils.cpp:407): Dispatcher: boost::filesystem::directory_iterator constructor: "/var/glite/workload_manager/jobdir/new": Cannot allocate memory. Exiting... THIS SMELLS LIKE A MEMORY LEAK
 
  • Cancel of pending nodes (or jobs) doesn't work > YET TO BE IMPLEMENTED

Revision 72009-03-11 - MarcoCecchi

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

Repository: http://etics-hd.cern.ch:8080/repository/pm/volatile/repomd/id/eaff55e8-37ff-4a75-b40f-1544f3232359/slc4_ia32_gcc346

  • 08 Mar, 05:04:44 -E: [Error] operator()(dispatcher_utils.cpp:407): Dispatcher: boost::filesystem::directory_iterator constructor: "/var/glite/workload_manager/jobdir/new": Cannot allocate memory. Exiting...
Changed:
<
<
  • Cancel of pending nodes (or jobs) doesn't work
>
>
  • Cancel of pending nodes (or jobs) doesn't work > YET TO BE IMPLEMENTED
 
  • recovery doesn't recognize a mm request:
    09 Mar, 11:58:57 -I: [Info] operator()(recovery.cpp:836): recovering https://localhost:6000/kqU2YOq-Ofu1_ArnS5RoIQ
Line: 19 to 19
 
  • WM dies after this message:
    09 Mar, 17:08:36 -E: [Error] unrecoverable_collection(submit_request.cpp:90): https://devel17.cnaf.infn.it:9000/u-dqY5E9RtQ9mHk0-i0oYg: unable to retrieve children information from jobstatus
Added:
>
>
TO BE RETESTED WITH MANAGER_3_2_2
 
Changed:
<
<
  • Condor fails to start
>
>
  • Condor fails to start _ TO BE RETESTED WITH JOBSUBMISSION_3_2_2
  -- AlessioGianelle - 04 Mar 2009 \ No newline at end of file

Revision 62009-03-11 - AlessioGianelle

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

Line: 20 to 20
 
  • WM dies after this message:
    09 Mar, 17:08:36 -E: [Error] unrecoverable_collection(submit_request.cpp:90): https://devel17.cnaf.infn.it:9000/u-dqY5E9RtQ9mHk0-i0oYg: unable to retrieve children information from jobstatus
Added:
>
>
  • Condor fails to start
 -- AlessioGianelle - 04 Mar 2009 \ No newline at end of file

Revision 52009-03-10 - AlessioGianelle

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Problems

Added:
>
>

Repository: http://etics-hd.cern.ch:8080/repository/pm/volatile/repomd/id/eaff55e8-37ff-4a75-b40f-1544f3232359/slc4_ia32_gcc346

 
  • 08 Mar, 05:04:44 -E: [Error] operator()(dispatcher_utils.cpp:407): Dispatcher: boost::filesystem::directory_iterator constructor: "/var/glite/workload_manager/jobdir/new": Cannot allocate memory. Exiting...

  • Cancel of pending nodes (or jobs) doesn't work

Revision 42009-03-10 - AlessioGianelle

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"
Changed:
<
<

Errors

>
>

Problems

 
  • 08 Mar, 05:04:44 -E: [Error] operator()(dispatcher_utils.cpp:407): Dispatcher: boost::filesystem::directory_iterator constructor: "/var/glite/workload_manager/jobdir/new": Cannot allocate memory. Exiting...
Line: 12 to 12
 
  • problems with cancels` nodes during recovery
Changed:
<
<
  • Recovery of collection with all aborted nodes causes a crash in the wm
>
>
  • Recovery of a collection with all aborted nodes causes a crash in the wm

  • WM dies after this message:
    09 Mar, 17:08:36 -E: [Error] unrecoverable_collection(submit_request.cpp:90): https://devel17.cnaf.infn.it:9000/u-dqY5E9RtQ9mHk0-i0oYg: unable to retrieve children information from jobstatus
  -- AlessioGianelle - 04 Mar 2009

Revision 32009-03-09 - AlessioGianelle

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"
Added:
>
>

Errors

 
Added:
>
>
  • 08 Mar, 05:04:44 -E: [Error] operator()(dispatcher_utils.cpp:407): Dispatcher: boost::filesystem::directory_iterator constructor: "/var/glite/workload_manager/jobdir/new": Cannot allocate memory. Exiting...
 
Added:
>
>
  • Cancel of pending nodes (or jobs) doesn't work
 
Changed:
<
<

Errors

>
>
  • recovery doesn't recognize a mm request:
    09 Mar, 11:58:57 -I: [Info] operator()(recovery.cpp:836): recovering https://localhost:6000/kqU2YOq-Ofu1_ArnS5RoIQ
    09 Mar, 11:58:57 -E: [Error] single_request_recovery(recovery.cpp:392): cannot create LB context (2)
    and it is not removed from the jobdir
 
Changed:
<
<
  • 08 Mar, 05:04:44 -E: [Error] operator()(dispatcher_utils.cpp:407): Dispatcher: boost::filesystem::directory_iterator constructor: "/var/glite/workload_manager/jobdir/new": Cannot allocate memory. Exiting...
>
>
  • problems with cancels` nodes during recovery
 
Added:
>
>
  • Recovery of collection with all aborted nodes causes a crash in the wm
  -- AlessioGianelle - 04 Mar 2009 \ No newline at end of file

Revision 22009-03-09 - AlessioGianelle

Line: 1 to 1
 
META TOPICPARENT name="TestWokPlan"

Added:
>
>

Errors

  • 08 Mar, 05:04:44 -E: [Error] operator()(dispatcher_utils.cpp:407): Dispatcher: boost::filesystem::directory_iterator constructor: "/var/glite/workload_manager/jobdir/new": Cannot allocate memory. Exiting...
  -- AlessioGianelle - 04 Mar 2009 \ No newline at end of file

Revision 12009-03-04 - AlessioGianelle

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="TestWokPlan"

-- AlessioGianelle - 04 Mar 2009

 
This site is powered by the TWiki collaboration platformCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback