Tags:
, view all tags

Problems

Using this repository: Configuration Name: glite-wms_R_3_2_1_2

  1. After the submission of a collection with 20 jobs with "requirements=false", there are 20 events "Pending" logged to the same job instead of one event per job.

Using this repository: Configuration Name: glite-wms_branch_3_2_0

  1. 08 Mar, 05:04:44 -E: [Error] operator()(dispatcher_utils.cpp:407): Dispatcher: boost::filesystem::directory_iterator constructor: "/var/glite/workload_manager/jobdir/new": Cannot allocate memory. Exiting... THIS SMELLS LIKE A MEMORY LEAK
  2. Cancel of pending nodes (or jobs) doesn't work > YET TO BE IMPLEMENTED
  3. recovery doesn't recognize a mm request:
    09 Mar, 11:58:57 -I: [Info] operator()(recovery.cpp:836): recovering https://localhost:6000/kqU2YOq-Ofu1_ArnS5RoIQ
    09 Mar, 11:58:57 -E: [Error] single_request_recovery(recovery.cpp:392): cannot create LB context (2)
    and it is not removed from the jobdir
  4. problems with cancels` nodes during recovery
  5. Recovery of a collection with all aborted nodes causes a crash in the wm FIXED
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:328): This is the gLite Workload Manager, running with pid 3953
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:336): loading broker dll libglite_wms_helper_broker_ism.so
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:357): created input reader of type jobdir from /var/glite/workload_manager/jobdir
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:402): starting recovery
    12 Mar, 12:44:33 -I: [Info] operator()(recovery.cpp:836): recovering https://devel17.cnaf.infn.it:9000/4ZppCer_07TdPvtmVAsqYQ
    12 Mar, 12:44:33 -I: [Info] single_request_recovery(recovery.cpp:407): submit request
    Program received signal SIGSEGV, Segmentation fault.
    0x081471b2 in glite::wms::manager::server::(anonymous namespace)::select_recoverable_nodes::select_recoverable_nodes ()
  6. A pending collection is completely (i.e. also the DONE nodes) resubmitted after every restart (trough recovery mechanism); this means that a job can be run more than one times (see for example https://devel17.cnaf.infn.it:9000/630alnB1XH4ZEGqPyucc4g): see #30816
  7. WM dies after this message:
    09 Mar, 17:08:36 -E: [Error] unrecoverable_collection(submit_request.cpp:90): https://devel17.cnaf.infn.it:9000/u-dqY5E9RtQ9mHk0-i0oYg: unable to retrieve children information from jobstatus
    TO BE RETESTED WITH MANAGER_3_2_2
  8. Condor fails to start TO BE RETESTED WITH JOBSUBMISSION_3_2_2 FIXED

-- AlessioGianelle - 04 Mar 2009

Edit | Attach | PDF | History: r55 | r14 < r13 < r12 < r11 | Backlinks | Raw View | More topic actions...
Topic revision: r12 - 2009-03-12 - AlessioGianelle
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback