TESTs

2009-03-25 (Danilo)

  1. MatchMaking with data. OK
    • Files used:
      >lcg-lr lfn:test_e-2M.txt 
      srm://gridit-se-01.cnaf.infn.it/dpm/cnaf.infn.it/home/infngrid/generated/2009-03-23/file881fe282-f7ce-4e0a-bf6e-e45ebc2f54bc 
      srm://gridsrm.ts.infn.it/infngrid/generated/2009-03-23/fileb225a02d-672b-439c-bf60-2ec5939563ac 
      >lcg-lr lfn:test_PI_1M.txt
      srm://gridit-se-01.cnaf.infn.it/dpm/cnaf.infn.it/home/infngrid/generated/2009-03-23/filee260881d-f7b9-4ba5-aadb-c73da3537700 
      srm://gridsrm.ts.infn.it/infngrid/generated/2009-03-23/file175411c7-870d-4302-bcfe-e4e471ebb23b
    • DataRequirements = { [ DataCatalogType = "DLI"; DataCatalog = "http://lfcserver.cnaf.infn.it:8085"; InputData = {"lfn:/grid/infngrid/test_e-2M.txt","lfn:/grid/infngrid/test_PI_1M.txt"}; ] };
    • DataAccessProtocol = "gsiftp";
      [danilo@ui]$ glite-wms-job-list-match -c ~/WMSCONF/conf_wms007.conf -a data-req.jdl 
      Connecting to the service https://wms007.cnaf.infn.it:7443/glite_wms_wmproxy_server
      
                           COMPUTING ELEMENT IDs LIST
      
       The following CE(s) matching your job requirements have been found
      - grid001.ts.infn.it:2119/jobmanager-lcglsf-cert
      - gridit-ce-001.cnaf.infn.it:2119/jobmanager-lcgpbs-cert
  2. GANG MATCHING
    • JDL Requirements = anyMatch( other.storage.CloseSEs ,target.GlueSAStateAvailableSpace == 87200000);
      [danilo@ui TEST_WMS]$ glite-wms-job-list-match -c ~/WMSCONF/conf_wms007.conf -a test_gang.jdl 
      Connecting to the service https://wms007.cnaf.infn.it:7443/glite_wms_wmproxy_server
      
                           COMPUTING ELEMENT IDs LIST
      
       The following CE(s) matching your job requirements have been found:
         - gridit-ce-001.cnaf.infn.it:2119/jobmanager-lcgpbs-cert (the closeSE is gridit-se-01 with 87200000 available space.)
    • JDL Requirements = anyMatch( other.storage.CloseSEs ,target.GlueSAStateAvailableSpace >= 50340000000);
      [danilo@ui TEST_WMS]$ glite-wms-job-list-match -c ~/WMSCONF/conf_wms007.conf -a test_gang.jdl 
      Connecting to the service https://wms007.cnaf.infn.it:7443/glite_wms_wmproxy_server
      
                           COMPUTING ELEMENT IDs LIST
      
       The following CE(s) matching your job requirements have been found:
      
       - argoce01.na.infn.it:2119/jobmanager-lcgpbs-cert 
       - atlasce1.lnf.infn.it:2119/jobmanager-lcgpbs-cert 
       - grid-ce2.pr.infn.it:2119/jobmanager-pbs-cert 
       - grid001.ts.infn.it:2119/jobmanager-lcglsf-cert 
       - prod-ce-01.pd.infn.it:2119/jobmanager-lcglsf-cert 
       - prod-ce-02.pd.infn.it:2119/jobmanager-lcglsf-cert 
       - t2-ce-01.mi.infn.it:2119/jobmanager-lcgpbs-cert 
       - atlasce01.na.infn.it:2119/jobmanager-lcgpbs-cert 
       - atlas-ce-01.roma1.infn.it:2119/jobmanager-lcglsf-atlasgcert 
       - grid012.ct.infn.it:2119/jobmanager-lcglsf-cert
  3. OUTPUT SE OK
    • OutputSE? = "grid007g.cnaf.infn.it";
      [danilo@ui]$ glite-wms-job-list-match -c ~/WMSCONF/conf_wms007.conf -a outputSE.jdl 
      Connecting to the service https://wms007.cnaf.infn.it:7443/glite_wms_wmproxy_server
      
                           COMPUTING ELEMENT IDs LIST
      
       The following CE(s) matching your job requirements have been found
       - gridit-ce-001.cnaf.infn.it:2119/jobmanager-lcgpbs-cert 

  1. File perusal with data. OK
  2. DAGS with data. OK

Using this repository: Configuration Name: glite-wms_R_3_2_1_12

  1. It seems that for some jobs condor is not able to retrieve the standard output of the job, in all the case tha "maradona" mechanism is used. DOES NOT SEEM A WMS ISSUE
    24 Mar, 10:38:40 -H- JobWrapperOutputParser::parse_file(...): Going to parse standard output file.
    24 Mar, 10:38:40 -*- JobWrapperOutputParser::parse_file(...): Standard output does not contain useful data.
    24 Mar, 10:38:40 -H- JobWrapperOutputParser::parse_file(...): Standard output was not useful, passing ball to Maradona...
    24 Mar, 10:38:40 -*- JobWrapperOutputParser::parse_file(...): Got info from Maradona...
    24 Mar, 10:38:40 -U- JobWrapperOutputParser::parse_file(...): Maradona makes another goal !!!
    24 Mar, 10:38:40 -U- JobWrapperOutputParser::parse_file(...): The legend goes on...
  2. Submit a collection with all the nodes with "false" requirements, then a cancel for the collection; when you try to restart the wm it fail. ---> HOPEFULLY FIXED BY MANAGER
      <mcecchi> glite-wms-workload_manager: recovery.cpp:387: void glite::wms::manager::server::single_request_recovery(const glite::wms::manager::server::<unnamed>::IdRequests&, 
     glite::wms::manager::server::Events&, const glite::wms::manager::server::WMReal&, glite::wms::manager::server::<unnamed>::IdToRequests&, 
     boost::shared_ptr<std::string>): Assertion `requests_for_id.size() == 1' failed.
    FIXED BY MANAGER
  3. Cancelling a collection trigger a cancel also for the "Done" nodes:
    Event: Done
    - Arrived                    =    Thu Mar 26 13:48:39 2009 CET
    - Exit code                  =    0
    - Host                       =    wms009.cnaf.infn.it
    - Reason                     =    Job terminated successfully
    - Source                     =    LogMonitor
    - Src instance               =    unique
    - Status code                =    OK
    - Timestamp                  =    Thu Mar 26 13:48:39 2009 CET
    - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
            ---
    Event: Cancel
    - Arrived                    =    Thu Mar 26 15:29:09 2009 CET
    - Host                       =    wms009.cnaf.infn.it
    - Source                     =    WorkloadManager
    - Src instance               =    23245
    - Status code                =    DONE
    - Timestamp                  =    Thu Mar 26 15:29:09 2009 CET
    - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
            ---
    Event: Cancel
    - Arrived                    =    Thu Mar 26 15:29:13 2009 CET
    - Host                       =    wms009.cnaf.infn.it
    - Reason                     =    Cancel requested by WorkloadManager
    - Source                     =    JobController
    - Src instance               =    unique
    - Status code                =    REQ
    - Timestamp                  =    Thu Mar 26 15:29:13 2009 CET
    - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
            ---
    Event: Cancel
    - Arrived                    =    Thu Mar 26 15:29:14 2009 CET
    - Host                       =    wms009.cnaf.infn.it
    - Reason                     =    I'm not able to retrieve the condor ID.
    - Source                     =    JobController
    - Src instance               =    unique
    - Status code                =    REFUSE
    - Timestamp                  =    Thu Mar 26 15:29:13 2009 CET
    - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
  4. Recovery is not able to manage correctly node resubmission request (it triggers two resubmission: one due to collection recovery and the other to the node resubmission). FIXED BY MANAGER

Using this repository: Configuration Name: glite-wms_R_3_2_1_8

  1. Recovery mechanism is not able to manage a cancel and a resubmit request for the same job (it should cancel the job).
    19 Mar, 12:45:21 -I: [Info] multiple_request_recovery(recovery.cpp:651): multiple requests [CS] for https://devel17.cnaf.infn.it:9000/iVp83BKI_-IytpdVlojq3Q (status 2)
    19 Mar, 12:45:21 -I: [Info] multiple_request_recovery(recovery.cpp:804): invalid pattern; ignoring all requests
    19 Mar, 12:45:21 -D: [Debug] operator()(recovery.cpp:76): ignoring jobcancel request for https://devel17.cnaf.infn.it:9000/iVp83BKI_-IytpdVlojq3Q
    19 Mar, 12:45:21 -D: [Debug] operator()(recovery.cpp:76): ignoring jobresubmit request for https://devel17.cnaf.infn.it:9000/iVp83BKI_-IytpdVlojq3Q

FIXED BY MANAGER

  1. Sometimes the submit command fails with this message:
    [ale@cream-15 UI]$ glite-wms-job-submit  -a -c ~/UI/etc/wmp_wms007.conf -o wms007 jdl/fail/fail.jdl
    Connecting to the service https://wms007.cnaf.infn.it:7443/glite_wms_wmproxy_server
    Error - The job has been successfully registered (the JobId is: https://devel17.cnaf.infn.it:9000/9fuQafsyV-lBWdiHXjkBFg), but an error occurred while transferring files:
    Unable to find Job InputSandbox Relative Path information needed to create the ISB Zipped File(s)
    (DestinationURI with https protocol not found)
    JobId: https://devel17.cnaf.infn.it:9000/9fuQafsyV-lBWdiHXjkBFg
    (please contact the server administrator)

HOPEFULLY FIXED BY WMPROXY

  1. LogMonitor dies when invokes operator()(purger.cpp:432)
    19 Mar, 15:50:20 -U- JobFilePurger::do_purge(...): Going to purge job storage...
    19 Mar, 15:50:20 -*- MonitorLoop::run(): Got an unhandled standard exception !!!
    19 Mar, 15:50:20 -*- MonitorLoop::run(): Namely: "call to empty boost::function"
    19 Mar, 15:50:20 -*- MonitorLoop::run(): Aborting daemon...
    

HOPEFULLY FIXED BY PURGER AND JOBSUBMISSION

  1. Submit a collection with attribute "requirements = false;" doesn't work. Using this collection: FIXED (glite-wms-wmproxy-3.2.1-4.slc4)
    [
    Type = "collection";
    InputSandbox = {"exe/test.sh"};
    requirements = false;
    nodes = {
    [
        JobType = "Normal";
        Executable = "test.sh";
        StdOutput = "test.out";
        StdError = "test.err";
        OutputSandbox = {};
        ]
    }
    ]
    you should find in the wmproxy.log this message:
    23 Mar, 10:30:34 -I- PID: 28115 - "wmpcommon::getType": JDL Type: dag
    23 Mar, 10:30:34 -E- PID: 28115 - "wmpcoreoperations::jobStart": Error while checking JDL: requirements: wrong type caught for attributx

Using this repository: Configuration Name: glite-wms_R_3_2_1_5

  1. WM startup script must be in charge of creating both JC and ICE input FIXED (glite-wms-manager-3.2.1-8)
  2. During recovery pending nodes are Aborted FIXED (glite-wms-manager-3.2.1-6)
  3. After setting EnableRecovery = true; in the glite_wms.conf file, the recovery is skipped: FIXED (glite-wms-manager-3.2.1-6)
    16 Mar, 12:41:33 -I: [Info] main(main.cpp:411): skipping recovery

Using this repository: Configuration Name: glite-wms_R_3_2_1_3

  1. The output of the startup script of JobController is not correct: FIXED (glite-wms-jobsubmission-3.2.1-3)
    [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-jc start
    Starting JobController daemon(s)
    [root@wms007 ~]# CondorG...ler...                          [  OK  ]

Using this repository: Configuration Name: glite-wms_R_3_2_1_2

  1. After the submission of a collection with 20 jobs with "requirements=false", there are 20 events "Pending" logged to the same job instead of one event per job. FIXED (glite-wms-manager-3.2.1-4)
  2. It is very hard to stop wm FIXED (glite-wms-manager-3.2.1-6)
    [root@wms007 new]# /opt/glite/etc/init.d/glite-wms-wm stop
    stopping workload manager... failure (stop it manually)

Using this repository: Configuration Name: glite-wms_branch_3_2_0

  1. 08 Mar, 05:04:44 -E: [Error] operator()(dispatcher_utils.cpp:407): Dispatcher: boost::filesystem::directory_iterator constructor: "/var/glite/workload_manager/jobdir/new": Cannot allocate memory. Exiting...

HOPEFULLY FIXED BY MANAGER_3_2_10

  1. Cancel of pending nodes (or jobs) doesn't work FIXED (glite-wms-manager-3.2.1-7). But the final status of the job is ABORTED instead of CANCELLED.
  2. recovery doesn't recognize a mm request: NOT FIXED
    09 Mar, 11:58:57 -I: [Info] operator()(recovery.cpp:836): recovering https://localhost:6000/kqU2YOq-Ofu1_ArnS5RoIQ
    09 Mar, 11:58:57 -E: [Error] single_request_recovery(recovery.cpp:392): cannot create LB context (2)
    and it is not removed from the jobdir. Note: the ui command hangs.
  3. Recovery of a collection with all aborted nodes causes a crash in the wm FIXED (glite-wms-manager-3.2.1-3)
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:328): This is the gLite Workload Manager, running with pid 3953
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:336): loading broker dll libglite_wms_helper_broker_ism.so
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:357): created input reader of type jobdir from /var/glite/workload_manager/jobdir
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:402): starting recovery
    12 Mar, 12:44:33 -I: [Info] operator()(recovery.cpp:836): recovering https://devel17.cnaf.infn.it:9000/4ZppCer_07TdPvtmVAsqYQ
    12 Mar, 12:44:33 -I: [Info] single_request_recovery(recovery.cpp:407): submit request
    Program received signal SIGSEGV, Segmentation fault.
    0x081471b2 in glite::wms::manager::server::(anonymous namespace)::select_recoverable_nodes::select_recoverable_nodes ()
  4. A pending collection is completely (i.e. also the DONE nodes) resubmitted after every restart (through recovery mechanism); this means that a job can be run more than one times (see for example https://devel17.cnaf.infn.it:9000/630alnB1XH4ZEGqPyucc4g): see #30816 FIXED (glite-wms-manager-3.2.1-7)
  5. WM dies after this message: -----------> HOPEFULLY FIXED BY MANAGER_3_2_10
    09 Mar, 17:08:36 -E: [Error] unrecoverable_collection(submit_request.cpp:90): https://devel17.cnaf.infn.it:9000/u-dqY5E9RtQ9mHk0-i0oYg: unable to retrieve children information from jobstatus
  6. Condor fails to start FIXED (glite-wms-jobsubmission-3.2.1-2)

-- AlessioGianelle - 04 Mar 2009

Edit | Attach | PDF | History: r55 < r54 < r53 < r52 < r51 | Backlinks | Raw View | More topic actions
Topic revision: r55 - 2011-02-24 - AlessioGianelle
 
This site is powered by the TWiki collaboration platformCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback