Tags:
, view all tags

Problems

Using this repository: Configuration Name: glite-wms_R_3_2_1_5

  1. During recovery pending nodes are Aborted FIXED (glite-wms-manager-3.2.1-6)
  2. After setting EnableRecovery = true; in the glite_wms.conf file, the recovery is skipped: FIXED (glite-wms-manager-3.2.1-6)
    16 Mar, 12:41:33 -I: [Info] main(main.cpp:411): skipping recovery

Using this repository: Configuration Name: glite-wms_R_3_2_1_3

  1. The output of the startup script of JobController is not correct: FIXED (glite-wms-jobsubmission-3.2.1-3)
    [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-jc start
    Starting JobController daemon(s)
    [root@wms007 ~]# CondorG...ler...                          [  OK  ]

Using this repository: Configuration Name: glite-wms_R_3_2_1_2

  1. After the submission of a collection with 20 jobs with "requirements=false", there are 20 events "Pending" logged to the same job instead of one event per job. FIXED (glite-wms-manager-3.2.1-4)
  2. It is very hard to stop wm FIXED (glite-wms-manager-3.2.1-6)
    [root@wms007 new]# /opt/glite/etc/init.d/glite-wms-wm stop
    stopping workload manager... failure (stop it manually)

Using this repository: Configuration Name: glite-wms_branch_3_2_0

  1. 08 Mar, 05:04:44 -E: [Error] operator()(dispatcher_utils.cpp:407): Dispatcher: boost::filesystem::directory_iterator constructor: "/var/glite/workload_manager/jobdir/new": Cannot allocate memory. Exiting...
  2. Cancel of pending nodes (or jobs) doesn't work

HOPEFULLY FIXED BY MANAGER_3_2_7

  1. recovery doesn't recognize a mm request:
     
    09 Mar, 11:58:57 -I: [Info] operator()(recovery.cpp:836): recovering https://localhost:6000/kqU2YOq-Ofu1_ArnS5RoIQ
    09 Mar, 11:58:57 -E: [Error] single_request_recovery(recovery.cpp:392): cannot create LB context (2)
    and it is not removed from the jobdir

HOPEFULLY FIXED BY MANAGER_3_2_7

  1. problems with cancels` nodes during recovery

HOPEFULLY FIXED BY MANAGER_3_2_7

  1. Recovery of a collection with all aborted nodes causes a crash in the wm FIXED (glite-wms-manager-3.2.1-3)
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:328): This is the gLite Workload Manager, running with pid 3953
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:336): loading broker dll libglite_wms_helper_broker_ism.so
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:357): created input reader of type jobdir from /var/glite/workload_manager/jobdir
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:402): starting recovery
    12 Mar, 12:44:33 -I: [Info] operator()(recovery.cpp:836): recovering https://devel17.cnaf.infn.it:9000/4ZppCer_07TdPvtmVAsqYQ
    12 Mar, 12:44:33 -I: [Info] single_request_recovery(recovery.cpp:407): submit request
    Program received signal SIGSEGV, Segmentation fault.
    0x081471b2 in glite::wms::manager::server::(anonymous namespace)::select_recoverable_nodes::select_recoverable_nodes ()
  2. A pending collection is completely (i.e. also the DONE nodes) resubmitted after every restart (through recovery mechanism); this means that a job can be run more than one times (see for example https://devel17.cnaf.infn.it:9000/630alnB1XH4ZEGqPyucc4g): see #30816

PLEASE ADD MORE DETAILS

  1. Recovery of a collection with two "scheduled" nodes and the remaining in "waiting" is treated as if all nodes were recoverable. What happens is that collections node status as returned upon recovery does not seem to match the real LB status.

18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:254): https://devel20.cnaf.infn.it:9000/fQwKKHBLncwJqJCVbiwOQw: recoverable node (WAITING) 18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:254): https://devel20.cnaf.infn.it:9000/MrLckmBvcibWQl3TnUyF2Q: recoverable node (WAITING) 18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:254): https://devel20.cnaf.infn.it:9000/wSfED2AXfBUhcKaJBWvqYg: recoverable node (WAITING) 18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:283): https://devel20.cnaf.infn.it:9000/pLzF8vJud_B6P1lo8OsOGA: recoverable node (READY) 18 Mar, 10:04:25 -D: [Debug] operator()(recovery.cpp:283): https://devel20.cnaf.infn.it:9000/3D7JRvT1QDpci3biRroWpw: recoverable node (READY)

In the above example, https://devel20.cnaf.infn.it:9000/pLzF8vJud_B6P1lo8OsOGA and https://devel20.cnaf.infn.it:9000/pLzF8vJud_B6P1lo8OsOGA were respectively SUBMITTED and RUNNING.

  1. WM dies after this message:
    09 Mar, 17:08:36 -E: [Error] unrecoverable_collection(submit_request.cpp:90): https://devel17.cnaf.infn.it:9000/u-dqY5E9RtQ9mHk0-i0oYg: unable to retrieve children information from jobstatus

HOPEFULLY FIXED BY MANAGER_3_2_7

  1. Condor fails to start FIXED (glite-wms-jobsubmission-3.2.1-2)

-- AlessioGianelle - 04 Mar 2009

Edit | Attach | PDF | History: r55 | r28 < r27 < r26 < r25 | Backlinks | Raw View | More topic actions...
Topic revision: r26 - 2009-03-18 - MarcoCecchi
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback