TESTs
2009-03-25 (Danilo)
- MatchMaking with data. OK
- Files used:
>lcg-lr lfn:test_e-2M.txt
srm://gridit-se-01.cnaf.infn.it/dpm/cnaf.infn.it/home/infngrid/generated/2009-03-23/file881fe282-f7ce-4e0a-bf6e-e45ebc2f54bc
srm://gridsrm.ts.infn.it/infngrid/generated/2009-03-23/fileb225a02d-672b-439c-bf60-2ec5939563ac
>lcg-lr lfn:test_PI_1M.txt
srm://gridit-se-01.cnaf.infn.it/dpm/cnaf.infn.it/home/infngrid/generated/2009-03-23/filee260881d-f7b9-4ba5-aadb-c73da3537700
srm://gridsrm.ts.infn.it/infngrid/generated/2009-03-23/file175411c7-870d-4302-bcfe-e4e471ebb23b
- DataRequirements = { [ DataCatalogType = "DLI"; DataCatalog = "http://lfcserver.cnaf.infn.it:8085"; InputData = {"lfn:/grid/infngrid/test_e-2M.txt","lfn:/grid/infngrid/test_PI_1M.txt"}; ] };
- DataAccessProtocol = "gsiftp";
[danilo@ui]$ glite-wms-job-list-match -c ~/WMSCONF/conf_wms007.conf -a data-req.jdl
Connecting to the service https://wms007.cnaf.infn.it:7443/glite_wms_wmproxy_server
COMPUTING ELEMENT IDs LIST
The following CE(s) matching your job requirements have been found
- grid001.ts.infn.it:2119/jobmanager-lcglsf-cert
- gridit-ce-001.cnaf.infn.it:2119/jobmanager-lcgpbs-cert
- GANG MATCHING
- JDL Requirements = anyMatch( other.storage.CloseSEs ,target.GlueSAStateAvailableSpace == 87200000);
[danilo@ui TEST_WMS]$ glite-wms-job-list-match -c ~/WMSCONF/conf_wms007.conf -a test_gang.jdl
Connecting to the service https://wms007.cnaf.infn.it:7443/glite_wms_wmproxy_server
COMPUTING ELEMENT IDs LIST
The following CE(s) matching your job requirements have been found:
- gridit-ce-001.cnaf.infn.it:2119/jobmanager-lcgpbs-cert (the closeSE is gridit-se-01 with 87200000 available space.)
- JDL Requirements = anyMatch( other.storage.CloseSEs ,target.GlueSAStateAvailableSpace >= 50340000000);
[danilo@ui TEST_WMS]$ glite-wms-job-list-match -c ~/WMSCONF/conf_wms007.conf -a test_gang.jdl
Connecting to the service https://wms007.cnaf.infn.it:7443/glite_wms_wmproxy_server
COMPUTING ELEMENT IDs LIST
The following CE(s) matching your job requirements have been found:
- argoce01.na.infn.it:2119/jobmanager-lcgpbs-cert
- atlasce1.lnf.infn.it:2119/jobmanager-lcgpbs-cert
- grid-ce2.pr.infn.it:2119/jobmanager-pbs-cert
- grid001.ts.infn.it:2119/jobmanager-lcglsf-cert
- prod-ce-01.pd.infn.it:2119/jobmanager-lcglsf-cert
- prod-ce-02.pd.infn.it:2119/jobmanager-lcglsf-cert
- t2-ce-01.mi.infn.it:2119/jobmanager-lcgpbs-cert
- atlasce01.na.infn.it:2119/jobmanager-lcgpbs-cert
- atlas-ce-01.roma1.infn.it:2119/jobmanager-lcglsf-atlasgcert
- grid012.ct.infn.it:2119/jobmanager-lcglsf-cert
- OUTPUT SE OK
- File perusal with data. OK
- DAGS with data. OK
Using this
repository: Configuration Name: glite-wms_R_3_2_1_12
- It seems that for some jobs condor is not able to retrieve the standard output of the job, in all the case tha "maradona" mechanism is used. DOES NOT SEEM A WMS ISSUE
24 Mar, 10:38:40 -H- JobWrapperOutputParser::parse_file(...): Going to parse standard output file.
24 Mar, 10:38:40 -*- JobWrapperOutputParser::parse_file(...): Standard output does not contain useful data.
24 Mar, 10:38:40 -H- JobWrapperOutputParser::parse_file(...): Standard output was not useful, passing ball to Maradona...
24 Mar, 10:38:40 -*- JobWrapperOutputParser::parse_file(...): Got info from Maradona...
24 Mar, 10:38:40 -U- JobWrapperOutputParser::parse_file(...): Maradona makes another goal !!!
24 Mar, 10:38:40 -U- JobWrapperOutputParser::parse_file(...): The legend goes on...
- Submit a collection with all the nodes with "false" requirements, then a cancel for the collection; when you try to restart the wm it fail. ---> HOPEFULLY FIXED BY MANAGER
<mcecchi> glite-wms-workload_manager: recovery.cpp:387: void glite::wms::manager::server::single_request_recovery(const glite::wms::manager::server::<unnamed>::IdRequests&,
glite::wms::manager::server::Events&, const glite::wms::manager::server::WMReal&, glite::wms::manager::server::<unnamed>::IdToRequests&,
boost::shared_ptr<std::string>): Assertion `requests_for_id.size() == 1' failed.
FIXED BY MANAGER
- Cancelling a collection trigger a cancel also for the "Done" nodes:
Event: Done
- Arrived = Thu Mar 26 13:48:39 2009 CET
- Exit code = 0
- Host = wms009.cnaf.infn.it
- Reason = Job terminated successfully
- Source = LogMonitor
- Src instance = unique
- Status code = OK
- Timestamp = Thu Mar 26 13:48:39 2009 CET
- User = /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
---
Event: Cancel
- Arrived = Thu Mar 26 15:29:09 2009 CET
- Host = wms009.cnaf.infn.it
- Source = WorkloadManager
- Src instance = 23245
- Status code = DONE
- Timestamp = Thu Mar 26 15:29:09 2009 CET
- User = /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
---
Event: Cancel
- Arrived = Thu Mar 26 15:29:13 2009 CET
- Host = wms009.cnaf.infn.it
- Reason = Cancel requested by WorkloadManager
- Source = JobController
- Src instance = unique
- Status code = REQ
- Timestamp = Thu Mar 26 15:29:13 2009 CET
- User = /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
---
Event: Cancel
- Arrived = Thu Mar 26 15:29:14 2009 CET
- Host = wms009.cnaf.infn.it
- Reason = I'm not able to retrieve the condor ID.
- Source = JobController
- Src instance = unique
- Status code = REFUSE
- Timestamp = Thu Mar 26 15:29:13 2009 CET
- User = /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
- Recovery is not able to manage correctly node resubmission request (it triggers two resubmission: one due to collection recovery and the other to the node resubmission). FIXED BY MANAGER
Using this
repository: Configuration Name: glite-wms_R_3_2_1_8
- Recovery mechanism is not able to manage a cancel and a resubmit request for the same job (it should cancel the job).
19 Mar, 12:45:21 -I: [Info] multiple_request_recovery(recovery.cpp:651): multiple requests [CS] for https://devel17.cnaf.infn.it:9000/iVp83BKI_-IytpdVlojq3Q (status 2)
19 Mar, 12:45:21 -I: [Info] multiple_request_recovery(recovery.cpp:804): invalid pattern; ignoring all requests
19 Mar, 12:45:21 -D: [Debug] operator()(recovery.cpp:76): ignoring jobcancel request for https://devel17.cnaf.infn.it:9000/iVp83BKI_-IytpdVlojq3Q
19 Mar, 12:45:21 -D: [Debug] operator()(recovery.cpp:76): ignoring jobresubmit request for https://devel17.cnaf.infn.it:9000/iVp83BKI_-IytpdVlojq3Q
FIXED BY MANAGER
- Sometimes the submit command fails with this message:
[ale@cream-15 UI]$ glite-wms-job-submit -a -c ~/UI/etc/wmp_wms007.conf -o wms007 jdl/fail/fail.jdl
Connecting to the service https://wms007.cnaf.infn.it:7443/glite_wms_wmproxy_server
Error - The job has been successfully registered (the JobId is: https://devel17.cnaf.infn.it:9000/9fuQafsyV-lBWdiHXjkBFg), but an error occurred while transferring files:
Unable to find Job InputSandbox Relative Path information needed to create the ISB Zipped File(s)
(DestinationURI with https protocol not found)
JobId: https://devel17.cnaf.infn.it:9000/9fuQafsyV-lBWdiHXjkBFg
(please contact the server administrator)
HOPEFULLY FIXED BY WMPROXY
- LogMonitor dies when invokes operator()(purger.cpp:432)
19 Mar, 15:50:20 -U- JobFilePurger::do_purge(...): Going to purge job storage...
19 Mar, 15:50:20 -*- MonitorLoop::run(): Got an unhandled standard exception !!!
19 Mar, 15:50:20 -*- MonitorLoop::run(): Namely: "call to empty boost::function"
19 Mar, 15:50:20 -*- MonitorLoop::run(): Aborting daemon...
HOPEFULLY FIXED BY PURGER AND JOBSUBMISSION
- Submit a collection with attribute "requirements = false;" doesn't work. Using this collection: FIXED (glite-wms-wmproxy-3.2.1-4.slc4)
[
Type = "collection";
InputSandbox = {"exe/test.sh"};
requirements = false;
nodes = {
[
JobType = "Normal";
Executable = "test.sh";
StdOutput = "test.out";
StdError = "test.err";
OutputSandbox = {};
]
}
]
you should find in the wmproxy.log this message:
23 Mar, 10:30:34 -I- PID: 28115 - "wmpcommon::getType": JDL Type: dag
23 Mar, 10:30:34 -E- PID: 28115 - "wmpcoreoperations::jobStart": Error while checking JDL: requirements: wrong type caught for attributx
Using this
repository: Configuration Name: glite-wms_R_3_2_1_5
- WM startup script must be in charge of creating both JC and ICE input FIXED (glite-wms-manager-3.2.1-8)
- During recovery pending nodes are Aborted FIXED (glite-wms-manager-3.2.1-6)
- After setting EnableRecovery = true; in the glite_wms.conf file, the recovery is skipped: FIXED (glite-wms-manager-3.2.1-6)
16 Mar, 12:41:33 -I: [Info] main(main.cpp:411): skipping recovery
Using this
repository: Configuration Name: glite-wms_R_3_2_1_3
- The output of the startup script of JobController is not correct: FIXED (glite-wms-jobsubmission-3.2.1-3)
[root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-jc start
Starting JobController daemon(s)
[root@wms007 ~]# CondorG...ler... [ OK ]
Using this
repository: Configuration Name: glite-wms_R_3_2_1_2
- After the submission of a collection with 20 jobs with "requirements=false", there are 20 events "Pending" logged to the same job instead of one event per job. FIXED (glite-wms-manager-3.2.1-4)
- It is very hard to stop wm FIXED (glite-wms-manager-3.2.1-6)
[root@wms007 new]# /opt/glite/etc/init.d/glite-wms-wm stop
stopping workload manager... failure (stop it manually)
Using this
repository: Configuration Name: glite-wms_branch_3_2_0
- 08 Mar, 05:04:44 -E: [Error] operator()(dispatcher_utils.cpp:407): Dispatcher: boost::filesystem::directory_iterator constructor: "/var/glite/workload_manager/jobdir/new": Cannot allocate memory. Exiting...
HOPEFULLY FIXED BY MANAGER_3_2_10
- Cancel of pending nodes (or jobs) doesn't work FIXED (glite-wms-manager-3.2.1-7). But the final status of the job is ABORTED instead of CANCELLED.
- recovery doesn't recognize a mm request: NOT FIXED
09 Mar, 11:58:57 -I: [Info] operator()(recovery.cpp:836): recovering https://localhost:6000/kqU2YOq-Ofu1_ArnS5RoIQ
09 Mar, 11:58:57 -E: [Error] single_request_recovery(recovery.cpp:392): cannot create LB context (2)
and it is not removed from the jobdir. Note: the ui command hangs.
- Recovery of a collection with all aborted nodes causes a crash in the wm FIXED (glite-wms-manager-3.2.1-3)
12 Mar, 12:44:33 -I: [Info] main(main.cpp:328): This is the gLite Workload Manager, running with pid 3953
12 Mar, 12:44:33 -I: [Info] main(main.cpp:336): loading broker dll libglite_wms_helper_broker_ism.so
12 Mar, 12:44:33 -I: [Info] main(main.cpp:357): created input reader of type jobdir from /var/glite/workload_manager/jobdir
12 Mar, 12:44:33 -I: [Info] main(main.cpp:402): starting recovery
12 Mar, 12:44:33 -I: [Info] operator()(recovery.cpp:836): recovering https://devel17.cnaf.infn.it:9000/4ZppCer_07TdPvtmVAsqYQ
12 Mar, 12:44:33 -I: [Info] single_request_recovery(recovery.cpp:407): submit request
Program received signal SIGSEGV, Segmentation fault.
0x081471b2 in glite::wms::manager::server::(anonymous namespace)::select_recoverable_nodes::select_recoverable_nodes ()
- A pending collection is completely (i.e. also the DONE nodes) resubmitted after every restart (through recovery mechanism); this means that a job can be run more than one times (see for example https://devel17.cnaf.infn.it:9000/630alnB1XH4ZEGqPyucc4g
): see #30816
FIXED (glite-wms-manager-3.2.1-7)
- WM dies after this message: -----------> HOPEFULLY FIXED BY MANAGER_3_2_10
09 Mar, 17:08:36 -E: [Error] unrecoverable_collection(submit_request.cpp:90): https://devel17.cnaf.infn.it:9000/u-dqY5E9RtQ9mHk0-i0oYg: unable to retrieve children information from jobstatus
- Condor fails to start FIXED (glite-wms-jobsubmission-3.2.1-2)
--
AlessioGianelle - 04 Mar 2009