Tags:
, view all tags

Problems

Using this repository: Configuration Name: glite-wms_R_3_2_1_3

  1. The output of the startup script of JobController is not correct:
    [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-jc start
    Starting JobController daemon(s)
    [root@wms007 ~]# CondorG...ler...                          [  OK  ]

Using this repository: Configuration Name: glite-wms_R_3_2_1_2

  1. After the submission of a collection with 20 jobs with "requirements=false", there are 20 events "Pending" logged to the same job instead of one event per job.
  2. It is very hard to stop wm
    [root@wms007 new]# /opt/glite/etc/init.d/glite-wms-wm stop
    stopping workload manager... failure (stop it manually)
    TO BE RETESTED WITH MANAGER_3_2_3

Using this repository: Configuration Name: glite-wms_branch_3_2_0

  1. 08 Mar, 05:04:44 -E: [Error] operator()(dispatcher_utils.cpp:407): Dispatcher: boost::filesystem::directory_iterator constructor: "/var/glite/workload_manager/jobdir/new": Cannot allocate memory. Exiting... THIS SMELLS LIKE A MEMORY LEAK
  2. Cancel of pending nodes (or jobs) doesn't work > YET TO BE IMPLEMENTED
  3. recovery doesn't recognize a mm request:
    09 Mar, 11:58:57 -I: [Info] operator()(recovery.cpp:836): recovering https://localhost:6000/kqU2YOq-Ofu1_ArnS5RoIQ
    09 Mar, 11:58:57 -E: [Error] single_request_recovery(recovery.cpp:392): cannot create LB context (2)
    and it is not removed from the jobdir
  4. problems with cancels` nodes during recovery
  5. Recovery of a collection with all aborted nodes causes a crash in the wm FIXED (glite-wms-manager-3.2.1-3_200903131510.slc4)
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:328): This is the gLite Workload Manager, running with pid 3953
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:336): loading broker dll libglite_wms_helper_broker_ism.so
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:357): created input reader of type jobdir from /var/glite/workload_manager/jobdir
    12 Mar, 12:44:33 -I: [Info] main(main.cpp:402): starting recovery
    12 Mar, 12:44:33 -I: [Info] operator()(recovery.cpp:836): recovering https://devel17.cnaf.infn.it:9000/4ZppCer_07TdPvtmVAsqYQ
    12 Mar, 12:44:33 -I: [Info] single_request_recovery(recovery.cpp:407): submit request
    Program received signal SIGSEGV, Segmentation fault.
    0x081471b2 in glite::wms::manager::server::(anonymous namespace)::select_recoverable_nodes::select_recoverable_nodes ()
  6. A pending collection is completely (i.e. also the DONE nodes) resubmitted after every restart (trough recovery mechanism); this means that a job can be run more than one times (see for example https://devel17.cnaf.infn.it:9000/630alnB1XH4ZEGqPyucc4g): see #30816
  7. WM dies after this message:
    09 Mar, 17:08:36 -E: [Error] unrecoverable_collection(submit_request.cpp:90): https://devel17.cnaf.infn.it:9000/u-dqY5E9RtQ9mHk0-i0oYg: unable to retrieve children information from jobstatus
    TO BE RETESTED WITH MANAGER_3_2_2
  8. Condor fails to start FIXED (glite-wms-jobsubmission-3.2.1-2_200903131509.slc4)

-- AlessioGianelle - 04 Mar 2009

Edit | Attach | PDF | History: r55 | r17 < r16 < r15 < r14 | Backlinks | Raw View | More topic actions...
Topic revision: r15 - 2009-03-13 - AlessioGianelle
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback