TEST 03252009, Configuration Name: glite-wms_R_3_2_1_11
Problems
Bug #40982: When a collection is aborted the "Abort" event should be logged for the sub-nodes as well /2
Using this repository: Configuration Name: glite-wms_R_3_2_1_12
- It seems that for some jobs condor is not able to retrieve the standard output of the job, in all the case tha "maradona" mechanism is used.
24 Mar, 10:38:40 -H- JobWrapperOutputParser::parse_file(...): Going to parse standard output file.
24 Mar, 10:38:40 -*- JobWrapperOutputParser::parse_file(...): Standard output does not contain useful data.
24 Mar, 10:38:40 -H- JobWrapperOutputParser::parse_file(...): Standard output was not useful, passing ball to Maradona...
24 Mar, 10:38:40 -*- JobWrapperOutputParser::parse_file(...): Got info from Maradona...
24 Mar, 10:38:40 -U- JobWrapperOutputParser::parse_file(...): Maradona makes another goal !!!
24 Mar, 10:38:40 -U- JobWrapperOutputParser::parse_file(...): The legend goes on...
- Submit a collection with all the nodes with "false" requirements, then a cancel for the collection; when you try to restart the wm it fail.
<mcecchi> glite-wms-workload_manager: recovery.cpp:387: void glite::wms::manager::server::single_request_recovery(const glite::wms::manager::server::<unnamed>::IdRequests&,
glite::wms::manager::server::Events&, const glite::wms::manager::server::WMReal&, glite::wms::manager::server::<unnamed>::IdToRequests&,
boost::shared_ptr<std::string>): Assertion `requests_for_id.size() == 1' failed.
HOPEFULLY FIXED BY MANAGER
Using this repository: Configuration Name: glite-wms_R_3_2_1_8
- Recovery mechanism is not able to manage a cancel and a resubmit request for the same job (it should cancel the job).
19 Mar, 12:45:21 -I: [Info] multiple_request_recovery(recovery.cpp:651): multiple requests [CS] for https://devel17.cnaf.infn.it:9000/iVp83BKI_-IytpdVlojq3Q (status 2)
19 Mar, 12:45:21 -I: [Info] multiple_request_recovery(recovery.cpp:804): invalid pattern; ignoring all requests
19 Mar, 12:45:21 -D: [Debug] operator()(recovery.cpp:76): ignoring jobcancel request for https://devel17.cnaf.infn.it:9000/iVp83BKI_-IytpdVlojq3Q
19 Mar, 12:45:21 -D: [Debug] operator()(recovery.cpp:76): ignoring jobresubmit request for https://devel17.cnaf.infn.it:9000/iVp83BKI_-IytpdVlojq3Q
HOPEFULLY FIXED BY MANAGER
- Sometimes the submit command fails with this message:
[ale@cream-15 UI]$ glite-wms-job-submit -a -c ~/UI/etc/wmp_wms007.conf -o wms007 jdl/fail/fail.jdl
Connecting to the service https://wms007.cnaf.infn.it:7443/glite_wms_wmproxy_server
Error - The job has been successfully registered (the JobId is: https://devel17.cnaf.infn.it:9000/9fuQafsyV-lBWdiHXjkBFg), but an error occurred while transferring files:
Unable to find Job InputSandbox Relative Path information needed to create the ISB Zipped File(s)
(DestinationURI with https protocol not found)
JobId: https://devel17.cnaf.infn.it:9000/9fuQafsyV-lBWdiHXjkBFg
(please contact the server administrator)
HOPEFULLY FIXED BY WMPROXY
- LogMonitor dies when invokes operator()(purger.cpp:432)
19 Mar, 15:50:20 -U- JobFilePurger::do_purge(...): Going to purge job storage...
19 Mar, 15:50:20 -*- MonitorLoop::run(): Got an unhandled standard exception !!!
19 Mar, 15:50:20 -*- MonitorLoop::run(): Namely: "call to empty boost::function"
19 Mar, 15:50:20 -*- MonitorLoop::run(): Aborting daemon...
HOPEFULLY FIXED BY PURGER AND JOBSUBMISSION
- Submit a collection with attribute "requirements = false;" doesn't work. Using this collection: FIXED (glite-wms-wmproxy-3.2.1-4.slc4)
[
Type = "collection";
InputSandbox = {"exe/test.sh"};
requirements = false;
nodes = {
[
JobType = "Normal";
Executable = "test.sh";
StdOutput = "test.out";
StdError = "test.err";
OutputSandbox = {};
]
}
]
you should find in the wmproxy.log this message:
23 Mar, 10:30:34 -I- PID: 28115 - "wmpcommon::getType": JDL Type: dag
23 Mar, 10:30:34 -E- PID: 28115 - "wmpcoreoperations::jobStart": Error while checking JDL: requirements: wrong type caught for attributx
Using this repository: Configuration Name: glite-wms_R_3_2_1_5
- WM startup script must be in charge of creating both JC and ICE input FIXED (glite-wms-manager-3.2.1-8)
- During recovery pending nodes are Aborted FIXED (glite-wms-manager-3.2.1-6)
- After setting EnableRecovery = true; in the glite_wms.conf file, the recovery is skipped: FIXED (glite-wms-manager-3.2.1-6)
16 Mar, 12:41:33 -I: [Info] main(main.cpp:411): skipping recovery
Using this repository: Configuration Name: glite-wms_R_3_2_1_3
- The output of the startup script of JobController is not correct: FIXED (glite-wms-jobsubmission-3.2.1-3)
[root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-jc start
Starting JobController daemon(s)
[root@wms007 ~]# CondorG...ler... [ OK ]
Using this repository: Configuration Name: glite-wms_R_3_2_1_2
- After the submission of a collection with 20 jobs with "requirements=false", there are 20 events "Pending" logged to the same job instead of one event per job. FIXED (glite-wms-manager-3.2.1-4)
- It is very hard to stop wm FIXED (glite-wms-manager-3.2.1-6)
[root@wms007 new]# /opt/glite/etc/init.d/glite-wms-wm stop
stopping workload manager... failure (stop it manually)
Using this repository: Configuration Name: glite-wms_branch_3_2_0
- 08 Mar, 05:04:44 -E: [Error] operator()(dispatcher_utils.cpp:407): Dispatcher: boost::filesystem::directory_iterator constructor: "/var/glite/workload_manager/jobdir/new": Cannot allocate memory. Exiting...
HOPEFULLY FIXED BY MANAGER_3_2_10
- Cancel of pending nodes (or jobs) doesn't work FIXED (glite-wms-manager-3.2.1-7). But the final status of the job is ABORTED instead of CANCELLED.
- recovery doesn't recognize a mm request: NOT FIXED
09 Mar, 11:58:57 -I: [Info] operator()(recovery.cpp:836): recovering https://localhost:6000/kqU2YOq-Ofu1_ArnS5RoIQ
09 Mar, 11:58:57 -E: [Error] single_request_recovery(recovery.cpp:392): cannot create LB context (2)
and it is not removed from the jobdir. Note: the ui command hangs.
- Recovery of a collection with all aborted nodes causes a crash in the wm FIXED (glite-wms-manager-3.2.1-3)
12 Mar, 12:44:33 -I: [Info] main(main.cpp:328): This is the gLite Workload Manager, running with pid 3953
12 Mar, 12:44:33 -I: [Info] main(main.cpp:336): loading broker dll libglite_wms_helper_broker_ism.so
12 Mar, 12:44:33 -I: [Info] main(main.cpp:357): created input reader of type jobdir from /var/glite/workload_manager/jobdir
12 Mar, 12:44:33 -I: [Info] main(main.cpp:402): starting recovery
12 Mar, 12:44:33 -I: [Info] operator()(recovery.cpp:836): recovering https://devel17.cnaf.infn.it:9000/4ZppCer_07TdPvtmVAsqYQ
12 Mar, 12:44:33 -I: [Info] single_request_recovery(recovery.cpp:407): submit request
Program received signal SIGSEGV, Segmentation fault.
0x081471b2 in glite::wms::manager::server::(anonymous namespace)::select_recoverable_nodes::select_recoverable_nodes ()
- A pending collection is completely (i.e. also the DONE nodes) resubmitted after every restart (through recovery mechanism); this means that a job can be run more than one times (see for example https://devel17.cnaf.infn.it:9000/630alnB1XH4ZEGqPyucc4g): see #30816 FIXED (glite-wms-manager-3.2.1-7)
- WM dies after this message: -----------> HOPEFULLY FIXED BY MANAGER_3_2_10
09 Mar, 17:08:36 -E: [Error] unrecoverable_collection(submit_request.cpp:90): https://devel17.cnaf.infn.it:9000/u-dqY5E9RtQ9mHk0-i0oYg: unable to retrieve children information from jobstatus
- Condor fails to start FIXED (glite-wms-jobsubmission-3.2.1-2)
--
AlessioGianelle - 04 Mar 2009