Problems
Using this repository: Configuration Name: glite-wms_R_3_2_1_5
- During recovery pending nodes are Aborted FIXED (glite-wms-manager-3.2.1-6)
- After setting EnableRecovery = true; in the glite_wms.conf file, the recovery is skipped: FIXED (glite-wms-manager-3.2.1-6)
16 Mar, 12:41:33 -I: [Info] main(main.cpp:411): skipping recovery
Using this repository: Configuration Name: glite-wms_R_3_2_1_3
- The output of the startup script of JobController is not correct: FIXED (glite-wms-jobsubmission-3.2.1-3)
[root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-jc start
Starting JobController daemon(s)
[root@wms007 ~]# CondorG...ler... [ OK ]
Using this repository: Configuration Name: glite-wms_R_3_2_1_2
- After the submission of a collection with 20 jobs with "requirements=false", there are 20 events "Pending" logged to the same job instead of one event per job. FIXED (glite-wms-manager-3.2.1-4)
- It is very hard to stop wm FIXED (glite-wms-manager-3.2.1-6)
[root@wms007 new]# /opt/glite/etc/init.d/glite-wms-wm stop
stopping workload manager... failure (stop it manually)
Using this repository: Configuration Name: glite-wms_branch_3_2_0
- 08 Mar, 05:04:44 -E: [Error] operator()(dispatcher_utils.cpp:407): Dispatcher: boost::filesystem::directory_iterator constructor: "/var/glite/workload_manager/jobdir/new": Cannot allocate memory. Exiting...
- Cancel of pending nodes (or jobs) doesn't work
HOPEFULLY FIXED BY MANAGER_3_2_7
- recovery doesn't recognize a mm request:
09 Mar, 11:58:57 -I: [Info] operator()(recovery.cpp:836): recovering https://localhost:6000/kqU2YOq-Ofu1_ArnS5RoIQ
09 Mar, 11:58:57 -E: [Error] single_request_recovery(recovery.cpp:392): cannot create LB context (2)
and it is not removed from the jobdir
HOPEFULLY FIXED BY MANAGER_3_2_7
- problems with cancels` nodes during recovery
HOPEFULLY FIXED BY MANAGER_3_2_7
- Recovery of a collection with all aborted nodes causes a crash in the wm FIXED (glite-wms-manager-3.2.1-3)
12 Mar, 12:44:33 -I: [Info] main(main.cpp:328): This is the gLite Workload Manager, running with pid 3953
12 Mar, 12:44:33 -I: [Info] main(main.cpp:336): loading broker dll libglite_wms_helper_broker_ism.so
12 Mar, 12:44:33 -I: [Info] main(main.cpp:357): created input reader of type jobdir from /var/glite/workload_manager/jobdir
12 Mar, 12:44:33 -I: [Info] main(main.cpp:402): starting recovery
12 Mar, 12:44:33 -I: [Info] operator()(recovery.cpp:836): recovering https://devel17.cnaf.infn.it:9000/4ZppCer_07TdPvtmVAsqYQ
12 Mar, 12:44:33 -I: [Info] single_request_recovery(recovery.cpp:407): submit request
Program received signal SIGSEGV, Segmentation fault.
0x081471b2 in glite::wms::manager::server::(anonymous namespace)::select_recoverable_nodes::select_recoverable_nodes ()
- A pending collection is completely (i.e. also the DONE nodes) resubmitted after every restart (through recovery mechanism); this means that a job can be run more than one times (see for example https://devel17.cnaf.infn.it:9000/630alnB1XH4ZEGqPyucc4g): see #30816
Even if the problem looks similar, this behaviour has nothing to do with bug #30816, given that the design is different in this new architecture. What happens is that collections node status as returned upon recovery does not seem to match the real LB status.
18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:254):
https://devel20.cnaf.infn.it:9000/fQwKKHBLncwJqJCVbiwOQw: recoverable node (WAITING)
18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:254):
https://devel20.cnaf.infn.it:9000/MrLckmBvcibWQl3TnUyF2Q: recoverable node (WAITING)
18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:254):
https://devel20.cnaf.infn.it:9000/wSfED2AXfBUhcKaJBWvqYg: recoverable node (WAITING)
18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:283):
https://devel20.cnaf.infn.it:9000/pLzF8vJud_B6P1lo8OsOGA: recoverable node (READY)
18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:283):
https://devel20.cnaf.infn.it:9000/3D7JRvT1QDpci3biRroWpw: recoverable node (READY)
In the above example,
https://devel20.cnaf.infn.it:9000/pLzF8vJud_B6P1lo8OsOGA and
https://devel20.cnaf.infn.it:9000/pLzF8vJud_B6P1lo8OsOGA were respectively SUBMITTED and RUNNING.
- WM dies after this message:
09 Mar, 17:08:36 -E: [Error] unrecoverable_collection(submit_request.cpp:90): https://devel17.cnaf.infn.it:9000/u-dqY5E9RtQ9mHk0-i0oYg: unable to retrieve children information from jobstatus
HOPEFULLY FIXED WHEN THE LATEST LDAP RESTRUCTURING WORKS
- Condor fails to start FIXED (glite-wms-jobsubmission-3.2.1-2)
--
AlessioGianelle - 04 Mar 2009