Difference: PreCertificationTestP2876 ( vs. 1)

Revision 12011-02-22 - AlessioGianelle

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="WmsTestsP2876"

Problems and solutions

  1.  /opt/glite/yaim/functions/config_glite_lb: line 99: /opt/glite/etc/glite-lb-dbsetup.sql: No such file or directory
     ERROR 1146 (42S02) at line 1: Table 'lbserver20.short_fields' doesn't exist
     ERROR 1146 (42S02) at line 1: Table 'lbserver20.long_fields' doesn't exist
     ERROR 1146 (42S02) at line 1: Table 'lbserver20.states' doesn't exist
     ERROR 1146 (42S02) at line 1: Table 'lbserver20.events' doesn't exist
     /opt/glite/yaim/functions/config_glite_lb: line 190: /opt/glite/etc/init.d/glite-lb-bkserverd: No such file or directory
     /opt/glite/yaim/functions/config_glite_lb: line 200: /opt/glite/etc/init.d/glite-lb-bkserverd: No such -> file or directory
        ABORT: Service glite-lb-bkserverd failed to start!
        ERROR: Error during the execution of function: config_glite_lb
        ERROR: Error during the configuration.Exiting.                              [FAILED]
        ERROR: One of the functions returned with error without specifying it's nature ! 
    We nedd to install LB FIXED
  2.  DEBUG: Skipping function: config_glite_lb_setenv because it is not defined
     DEBUG: Skipping function: config_glite_lb because it is not defined
     ERROR: Error during the configuration.Exiting.                              [FAILED]
    install glite-yaim-lb FIXED
  3. ERROR: Unable to execute /etc/init.d/globus-gridftp.
     ERROR: Error during the execution of function: config_globus_gridftp 
    install glite-initscript-globus-gridftp-1.0.2-1.noarch.rpm FIXED
  4.  Syntax error on line 242 of /opt/glite/etc/glite_wms_wmproxy_httpd.conf:
     FastCgiConfig: invalid option: -intial-env 
    fix the template /opt/glite/etc/glite_wms_wmproxy_httpd.conf.template FIXED
  5. start/stop script di ice doesn't work
    fix committed in CVS FIXED
  6. glite-yaim-wms ame:-ame:
    The version of the yaim-wms is written directly into the Makefile during the CVS checkout (see keyword substitution). Since HEAD is used for the development of this component the keyword substitution is not correct. When a tag is available the version will be correctly define FIXED
  7. Starting program: /opt/glite/bin/glite_wms_wmproxy_server
    (no debugging symbols found)
    [Thread debugging using libthread_db enabled]
    [New Thread 0x2b227bb94260 (LWP 25595)]
    
    Program received signal SIGSEGV, Segmentation fault.
    0x00000000004e94a8 in __static_initialization_and_destruction_0 ()
    (gdb) bt
    #0  0x00000000004e94a8 in __static_initialization_and_destruction_0 ()
    #1  0x0000000000565c26 in __do_global_ctors_aux ()
    #2  0x000000000045665b in _init ()
    #3  0x00002b2274847010 in __CTOR_LIST__ () from /opt/glite/lib64/libglite_lb_clientpp_gcc64dbg.so.4
    #4  0x0000000000565ba7 in __libc_csu_init ()
    #5  0x0000003835e1d92e in __libc_start_main () from /lib64/libc.so.6
    #6  0x0000000000457fc9 in _start ()
    (gdb)
    FIXED
  8.  Warning - Unable to register the job to the service: https://cream-44.pd.infn.it:7443/glite_wms_wmproxy_server
    Unable to create job local directory
    (please contact server administrator)
    
    
    [root@cream-03 ~]# ls -l /opt/glite/bin/glite_wms_wmproxy_dirmanager 
          -r-sr-xr-x 1 nobody nobody 46363 Jul  8 11:00 /opt/glite/bin/glite_wms_wmproxy_dirmanager
    FIXED . Implemented postun in ETICS.
  9. Bug: #72573
    [root@cream-03 ~]# service gLite status
    [...]
    *** glite-lb-locallogger:
    glite-lb-logd not running 
    [...]
    FIXED Using new LB tag (patch 4423)
  10. lbproxy not started
    GLITE_LB_TYPE=proxy is the default behaviour FIXED
  11. [Fri Jul 09 18:05:21 2010] [error] Certificate Verification: Error (24): invalid CA certificate
    [Fri Jul 09 18:05:21 2010] [error] Certificate Verification: Error (26): unsupported certificate purpose
    Trying a simple list-match. Is it a warning?
  12. In /opt/glite/etc/lcmaps/lcmaps.db substitute path = ${moddir} with path = /opt/glite/lib64/modules
    FIXED
  13. Remove SDJRequirements and WmsRequirements from section WorkloadManagerProxy of /opt/glite/etc/glite_wms.conf
    FIXED
  14. Put on section WorkloadManager of /opt/glite/etc/glite_wms.conf:
    WmsRequirements  = ( (ShortDeadlineJob =?= TRUE) ? RegExp(".*sdj$", other.GlueCEUniqueID) : !RegExp(".*sdj$", other.GlueCEUniqueID)) && 
    (other.GlueCEPolicyMaxTotalJobs == 0 || other.GlueCEStateTotalJobs < other.GlueCEPolicyMaxTotalJobs);
    FIXED
  15. Correct /opt/glite/etc/glite_wms_wmproxy.gacl: insert a / before the "vo" and the rule /"vo"/Role=NULL/Capability=NULL
    FIXED
  16. Cleared event is not logged:
    27 Jul, 17:41:22 -E- PID: 25792 - "wmputils::doPurge": [Error] remove_path(purger.cpp:256): LB event logging failed LB server (bkserver,lbproxy) store protocol error (1417) - edg_wll_LogEvent(): 
    LB server (bkserver,lbproxy) store protocol error;; Logging library ERROR: 
    LB server (bkserver,lbproxy) store protocol error;; edg_wll_DoLogEvent(): edg_wll_log_connect error
    Transport endpoint is not connected;; edg_wll_gss_connect();; System Error: Connection refused
    27 Jul, 17:41:22 -S- PID: 25792 - "wmpcoreoperations::jobpurge": Unable to complete job purge
    27 Jul, 17:41:22 -D- PID: 25792 - "wmpgsoapoperations::ns1__jobPurge": ------------------------------- Fault Description --------------------------------
    27 Jul, 17:41:22 -D- PID: 25792 - "wmpgsoapoperations::ns1__jobPurge": Method: jobPurge
    27 Jul, 17:41:22 -D- PID: 25792 - "wmpgsoapoperations::ns1__jobPurge": Code: 1202
    27 Jul, 17:41:22 -D- PID: 25792 - "wmpgsoapoperations::ns1__jobPurge": Description: The Operation is not allowed: Unable to complete job purge
    27 Jul, 17:41:22 -D- PID: 25792 - "wmpgsoapoperations::ns1__jobPurge": Stack: 
    27 Jul, 17:41:22 -D- PID: 25792 - "wmpgsoapoperations::ns1__jobPurge": JobOperationException: The Operation is not allowed: Unable to complete job purge
       at jobpurge()[wmpcoreoperations.cpp:2640]
    27 Jul, 17:41:22 -D- PID: 25792 - "wmpgsoapoperations::ns1__jobPurge":    at jobpurge()[wmpcoreoperations.cpp:2546]
    27 Jul, 17:41:22 -D- PID: 25792 - "wmpgsoapoperations::ns1__jobPurge":    at jobPurge()[wmpcoreoperations.cpp:2667]
    27 Jul, 17:41:22 -D- PID: 25792 - "wmpgsoapoperations::ns1__jobPurge": ----------------------------------------------------------------------------------
    27 Jul, 17:41:22 -D- PID: 25792 - "wmpgsoapoperations::ns1__jobPurge": jobPurge operation completed
    FIXED for normal jobs
  17. Submission to LCG CE doesn't work: Got a job held event, reason: Failed to initialize GAHP
    Possibles solutions are
    • install condor-lcg-1.1.0-1
    • set GT2_GAHP = /opt/condor-7.4.1/sbin/gahp_server and GRID_MONITOR = /opt/condor-7.4.1/libexec/glite/grid_monitor.sh on /opt/condor-c/local.<$HOSTNAME>/condor_config.local
      Waiting for Marteen investigation.... FIXED Decided to use the second option. Changes done in yaim-wms.
  18. BUG #73192 Submission failed:
    [ale@egee-rb-03 UI]$ glite-wms-job-submit -a --config ~/UI/etc/wmp_cream-03.conf jdl/env.jdl 
    Connecting to the service https://cream-03.pd.infn.it:7443/glite_wms_wmproxy_server
    
    Warning - Unable to register the job to the service: https://cream-03.pd.infn.it:7443/glite_wms_wmproxy_server
    LB: :2652
    Set logging job failed
    edg_wll_SetLoggingJob
    LB[Proxy] Error: GSSAPI Error
    (failed to load GSI credentials: GSS Major Status: General failure
     (GSS Minor Status Error Chain:
    globus_gsi_gssapi: Error with GSI credential
    globus_gsi_gssapi: Error with gss credential handle
    globus_credential: Valid credentials could not be found in any of the possible locations specified by the credential search order.
    Valid credentials could not be found in any of the possible locations specified by the credential search order.
    Possibles (ugly) workarounds to solve this problem are:
    • chown root.glite /etc/grid-security/host*.pem
    • cp /etc/grid-security/hostcert.pem /home/glite/.globus/usercert.pem and cp /etc/grid-security/hostkey.pem /home/glite/.globus/userkey.pem
      FIXED using new LB client (patch 4423)
  19. WM required huge amount of memory:
    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND  
    15007 glite     25   0 2657m 2.8g 8320 S  0.0 77.0   2:34.79 glite-wms-workl 
    Monitoring.. Google malloc is used by default and correctly set by yaim
  20. *** glite-lb-bkserverd:
    Starting glite-lb-bkserver ...Warning: MySQL library version mismatch (compiled '50045', runtime '50077')
     done
    Salvet said: Installed MySQL library is slightly newer than library used for glite-lb-bkserver build. This is normal. No Problem
  21. Bug #73206 Collection doesn't work. The only suspicious messages are in /var/log/messages:
     Sep 10 15:50:48 cream-44 glite_wms_wmproxy_server[6179]: ts=2010-09-10T13:50:48Z : event=wms.wmpserver_setJobFileSystem() : userid=18118 jobid=https://devel07.cnaf.infn.it:9000/pvRt15v6hiIBbEKj0UeeAw
    Sep 10 15:50:48 cream-44 glite_wms_wmproxy_server[6179]: ts=2010-09-10T13:50:48Z : event=wms.wmpserver_setSubjobFileSystem() : userid=18118 jobid=https://devel07.cnaf.infn.it:9000/pvRt15v6hiIBbEKj0UeeAw
    Sep 10 15:50:50 cream-44 glite_wms_wmproxy_server[6179]: ts=2010-09-10T13:50:50Z : event=wms.wmpserver_submit() : userid=18118 jobid=https://devel07.cnaf.infn.it:9000/pvRt15v6hiIBbEKj0UeeAw
    Sep 10 15:50:50 cream-44 kernel: glite_wms_wmpro[6179] general protection rip:3e86c797e0 rsp:7fff60c65148 error:0
    FIXED
  22. Bug #72970 I'm running on the WMS an lbserver in "both" mode:
     9197 ?        S      0:00 /opt/glite/bin/glite-lb-bkserverd --notif-il-sock=/tmp/glite-lb-notif.sock --notif-il-fprefix=/var/tmp/glite-lb-notif -c /home/glite/.certs/hostcert.pem -k /home/glite/.certs/hostkey.pem -i /var/glite/glite-lb-bkserverd.pid --dump-prefix /var/glite/dump --purge-prefix /var/glite/purge -B --proxy-il-sock /tmp/glite-lbproxy-ilog.sock --proxy-il-fprefix /tmp/glite-lbproxy-ilog_events --policy /opt/glite/etc/glite-lb/glite-lb-authz.conf
    Then I started locallogger:
    [root@devel08 glite]# /opt/glite/etc/init.d/glite-lb-locallogger start
    Starting glite-lb-logd ...This is LocalLogger, part of Workload Management System in EU DataGrid & EGEE.
     done
    Starting glite-lb-interlogd ... done
    but:
    [root@devel08 glite]# /opt/glite/etc/init.d/glite-lb-locallogger status
    glite-lb-logd not running
    instead:
    [root@devel08 ~]# ps ax | grep lb-log
    10049 ?        Ss     0:00 /opt/glite/bin/glite-lb-logd -i /var/glite/glite-lb-logd.pid -c /home/glite/.certs/hostcert.pem -k /home/glite/.certs/hostkey.pem
    12135 pts/0    S+     0:00 grep lb-log
    FIXED Using new LB tag (path 4423)
  23. Problem during installation:
    Error: Missing Dependency: org.glite.build.common-cpp >= 3.2.1 is needed by package glite-security-lcmaps-without-gsi-1.4.8-5.sl5.x86_64 (ETICS-volatile-build-5b42070f-c48b-4e7b-9819-48810222a0b3-sl5_x86_64_gcc412)
    Error: Missing Dependency: mod_fastcgi >= 2.4.3 is needed by package glite-wms-wmproxy-3.3.0-3.sl5.x86_64 (ETICS-volatile-build-5b42070f-c48b-4e7b-9819-48810222a0b3-sl5_x86_64_gcc412)
    Error: Missing Dependency: mod_fastcgi >= 2.4.3 is needed by package glite-WMS-3.3.0-0.sl5.x86_64 (ETICS-volatile-build-5b42070f-c48b-4e7b-9819-48810222a0b3-sl5_x86_64_gcc412)
    two packages missing... FIXED
  24. Resubmission for jobs submitted with -r option is not done. Jobs are aborted with "hit job shallow retry count (0)" even if ShallowRetryCount was set in the JDL FIXED .
  25. Problems with DAG. DAG job reported as failed:
    Current Status:     Done (Exit Code !=0)
    Exit code:          1
    Status Reason:      Warning: job exit code != 0
    Destination:        dagman
    
    and nodes stuck in "Submitted". FIXED .
  26. BUG #75223 FIXED When a job failed logged reason is wrong
    Event: Done
    - Arrived                    =    Thu Nov 11 14:06:54 2010 CET
    - Exit code                  =    1
    - Host                       =    cream-46.pd.infn.it
    - Reason                     =    LM_log_done_beginThu Nov 11 14:03:17 CET 2010: prologue failed with error 1
    - Source                     =    LogMonitor
    - Src instance               =    unique
    - Status code                =    FAILED
    - Timestamp                  =    Thu Nov 11 14:06:54 2010 CET
    Another type of error:
    Event: Done
    - Arrived                    =    Fri Nov 26 12:15:30 2010 CET
    - Exit code                  =    0
    - Host                       =    pamelawn23.na.infn.it
    - Reason                     =    Fri Nov 26 12:13:45 CET 2010: Cannot
    - Source                     =    LRMS
    - Status code                =    OK
    - Timestamp                  =    Fri Nov 26 12:13:46 2010 CET
    - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle
  27. Problem installing new build (07/11/1010) FIXED .
      --> Missing Dependency: mod_fastcgi >= 2.4.3 is needed by package glite-wms-wmproxy-3.3.0-4.sl5.x86_64 (ETICS-name-patch_2876_1)
  28. Problem starting bdii:
    Starting BDII slapd: Traceback (most recent call last):
      File "/usr/sbin/bdii-update", line 821, in ?
        create_daemon(config['BDII_LOG_FILE'])
      File "/usr/sbin/bdii-update", line 168, in create_daemom
        e = os.open(log_file, os.O_WRONLY | os.O_APPEND | os.O_CREAT, 0644)
    OSError: [Errno 13] Permission denied: '/var/log/bdii/bdii-update.log'
                                                               [  OK  ]
    BDII update process failed to startStarting BDII update pro[FAILED]
    Be sure that the installed version cames from etics repo: bdii-5.0.9-1
  29. BUG #75099 FIXED .
    2010-11-09 13:01:07,649 DEBUG - iceCommandEventQuery::execute() -  TID=[203357600] Database  ID=[1265375986000]
    2010-11-09 13:01:07,650 DEBUG - iceCommandEventQuery::execute() -  TID=[203357600] Exec time ID=[6]
    2010-11-09 13:01:07,651 INFO - scoped_timer iceCommandEventQuery::processEvents() - TID=[203357600] All Events Proc Time 1289304067.650547 1289304067.651501 0.000954
    terminate called after throwing an instance of 'std::out_of_range'
      what():  basic_string::at
    Aborted
    It happen when shallow resubmission is set to -1.
  30. Submission failed Probably FIXED using a new LB server (2.1)
    Warning - Unable to register the job to the service: https://cream-46.pd.infn.it:7443/glite_wms_wmproxy_server
    HTTP Error
    <html><head>
    <title>500 Internal Server Error</title>
    </head><body>
    <h1>Internal Server Error</h1>
    <p>The server encountered an internal error or
    misconfiguration and was unable to complete
    your request.</p>
    <p>Please contact the server administrator,
     [no address given] and inform them of the time the error occurred,
    and anything you might have done that may have
    caused the error.</p>
    <p>More information about this error may be available
    in the server error log.</p>
    </body></html>
    
    Error code: SOAP-ENV:Server
    Looking into the /var/log/messages file of the WMS you should read:
    Nov 10 17:10:44 cream-46 kernel: glite_wms_wmpro[15571]: segfault at 0000000000000000 rip 00002b6b3ca32548 rsp 00007fffa962f320 error 4
    Instead the wmproxy log says:
    10 Nov, 17:10:35 -S- PID: 15571 - "WMPEventlogger::registerJob": Register job failed
    edg_wll_RegisterJobProxy
    Exit code: 22
    LB[Proxy] Error: Invalid argument
    (edg_wll_RegisterJobMaster(): unable to register job
    Resource temporarily unavailable;; Logging library ERROR: 
    Resource temporarily unavailable;; edg_wll_DoLogEventServer(): edg_wll_log_direct_read error
    LB server (bkserver,lbproxy) store protocol error;; edg_wll_log_proto_client_direct(): error reading answer from L&B direct server
    LB server (bkserver,lbproxy) store protocol error;; get_reply_gss(): error reading reply
    LB server (bkserver,lbproxy) store protocol error;; gss_reader(): error reading message
    Transport endpoint is not connected;; edg_wll_gss_read_full;; GSS Error: EOF occured;)
    The httpd-wmproxy-errors.log says:
    [Wed Nov 10 17:10:44 2010] [error] [client 193.206.210.108] FastCGI: incomplete headers (0 bytes) received from server "/opt/glite/bin/glite_wms_wmproxy_server"
  31. Cleared event not logged FIXED .
    [ale@cream-03 ~]$ glite-wms-job-output https://devel07.cnaf.infn.it:9000/IvaPn8c4ezVSiHbQP8FJOg
    
    Connecting to the service https://cream-46.pd.infn.it:7443/glite_wms_wmproxy_server
    
    Warning - JobPurging not allowed
     (The Operation is not allowed: Unable to complete job purge)
    You need to set the WMS DN in both ACTION "READ_ALL" and "LOG_WMS_EVENTS" of glite-lb-authz.conf file of the LB Server
  32. Cleared event not logged FIXED (2011/01/19).
    [ale@cream-03 ~]$ glite-wms-job-output https://devel07.cnaf.infn.it:9000/Hn0VS-oMpYlT7bdAIXNB7g
    
    Connecting to the service https://cream-46.pd.infn.it:7443/glite_wms_wmproxy_server
    
    Warning - JobPurging not allowed
     (Proxy exception: The delegated Proxy has expired)
    This happen when the job proxy has expired.
  33. Bug #75368 FIXED A "DONE OK" job is marked as "ABORTED". The problem is that a failed job should not been aborted (resubmission is possibile only for Done Failed jobs). Another case:
     Event: Abort
    - Arrived                    =    Wed Nov 17 13:31:54 2010 CET
    - Host                       =    cream-46.pd.infn.it
    - Reason                     =    BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:Connection timed out-qsub: cannot connect to server gridba3.ba.infn.it (errno=110) Connection timed out-TERM environment variable not set.-) N/A (jobId = CREAM809848230)
    - Source                     =    LogMonitor
    - Timestamp                  =    Wed Nov 17 13:31:53 2010 CET
    - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
       ---
    Event: Resubmission
    - Arrived                    =    Wed Nov 17 13:31:54 2010 CET
    - Host                       =    cream-46.pd.infn.it
    - Reason                     =    Job resubmitted by ICE
    - Result                     =    WILLRESUB
    - Source                     =    LogMonitor
    - Tag                        =    unavailable
    - Timestamp                  =    Wed Nov 17 13:31:54 2010 CET
    - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
  34. St9exception FIXED (2011/02/01). After the "get-output" of a dag, the nodes are in "DONE-OK" state. If you try a get-output at this point for a node you have (see also 53.):
     [ale@cream-03 ~]$ glite-wms-job-output https://devel17.cnaf.infn.it:9000/-ZpRldEJcENxhNiwMqidiw
    
    Connecting to the service https://cream-46.pd.infn.it:7443/glite_wms_wmproxy_server
    
    Error - getOutputFileList Error
     (St9exception)
    Instead the right message should be found in the wmproxy log file:
    19 Jan, 12:55:46 -D- PID: 22916 - "wmpgsoapoperations::ns1__getOutputFileList": JobId Exception: The Operation is not allowed: The job has not been registered from this Workload Manager Proxy server (https://cream-46.pd.infn.it:7443/glite_wms_wmproxy_server) or it has been purged
    19 Jan, 12:55:46 -D- PID: 22916 - "wmpgsoapoperations::ns1__getOutputFileList": ------------------------------- Fault Description --------------------------------
    19 Jan, 12:55:46 -D- PID: 22916 - "wmpgsoapoperations::ns1__getOutputFileList": Method: getOutputFileList
    19 Jan, 12:55:46 -D- PID: 22916 - "wmpgsoapoperations::ns1__getOutputFileList": Code: 1202
    19 Jan, 12:55:46 -D- PID: 22916 - "wmpgsoapoperations::ns1__getOutputFileList": Description: St9exception
    19 Jan, 12:55:46 -D- PID: 22916 - "wmpgsoapoperations::ns1__getOutputFileList": Stack: 
    19 Jan, 12:55:46 -D- PID: 22916 - "wmpgsoapoperations::ns1__getOutputFileList": ----------------------------------------------------------------------------------
    19 Jan, 12:55:46 -D- PID: 22916 - "wmpgsoapoperations::ns1__getOutputFileList": getOutputFileList operation completed
  35. The state of the nodes doesn't change after the output retrieval of the parent of a collection. Bug #77876 FIXED (2011/02/14)
  36. Dag job doesn't work FIXED . Probably the name of the token is wrongly set.
  37. Bug #73715 FIXED the Really Running event is not logged from JC/LM (instead it works for ICE).
  38. The "type" attribute is case sensitive FIXED installing the new jdl's rpm on the UI
    Error - AdSyntaxException
    The following parsing error(s) have been found:
    'node_type' must be "dag"
  39. Bug #75402 FIXED To verify with the new tag Synchronization loss between real validity of proxy and exp. time saved in ICE's database; this can happen when the copy of the new proxy fails
    2010-11-15 10:57:41,869 INFO - DNProxyManager::setUserProxyIfLonger_Legacy() - Setting user proxy to [ /var/glite/SandboxDir/jI/https_3a_2f_2fdevel17.cnaf.infn.it_3a9000_2fjIwx1BAneLSLa93u3CEeAQ/user.proxy] copied to /var/glite/ice/persist_dir/B23D0D7177A8B6234F1985493FA09FF41A4FA98C.proxy] because the old one is less long-lived.
    2010-11-15 10:57:42,019 ERROR - DNProxyManager::setUserProxyIfLonger_Legacy() - Error copying proxy [/var/glite/SandboxDir/jI/https_3a_2f_2fdevel17.cnaf.infn.it_3a9000_2fjIwx1BAneLSLa93u3CEeAQ/user.proxy] to [/var/glite/ice/persist_dir/B23D0D7177A8B6234F1985493FA09FF41A4FA98C.proxy].
    2010-11-15 10:57:42,019 DEBUG - DNProxyManager::setUserProxyIfLonger_Legacy() - New proxy [/var/glite/SandboxDir/jI/https_3a_2f_2fdevel17.cnaf.infn.it_3a9000_2fjIwx1BAneLSLa93u3CEeAQ/user.proxy] has been copied into [/var/glite/ice/persist_dir/B23D0D7177A8B6234F1985493FA09FF41A4FA98C.proxy] - New Expiration Time is [Tue Nov 16 10:56:34 2010]
  40. Problem with job's proxy expired in ice FIXED To verify with the new tag . When proxy expired the jobs in ice queue are not correctly removed
    2010-11-17 10:08:37,269 DEBUG - iceCommandLBLogging::execute - TID=[] Will not LOG anything to LB for Job [https://devel17.cnaf.infn.it:9000/ZSwHnJmAMblpLYKwz69mFQ] for reason: CreamJobID [CREAM898633002] disappeared from ICE database !
  41. Ice log Cancel events with wrong sequence code FIXED (2011-01-19)
  42. Ice Aborted TO Monitoring It happen that ICE aborted with these messages (see also 49.):
    t1169246528:p13952: Fatal error: [Thread System] GLOBUSTHREAD: pthread_mutex_destroy() failed
    
    [Thread System] mutex is locked (EBUSY)t1295124800:p13952: Fatal error: [Thread System] GLOBUSTHREAD: globus_thread_setspecific() failed
    or
    t1313102144:p12551: Fatal error: [Thread System] GLOBUSTHREAD: globus_thread_setspecific() failed
    
    [Thread System] invalid value passed to thread interface (EINVAL)Aborted (core dumped)
  43. glite-wms-ice-db-rm, various errors FIXED (2011/01/19)
    [root@cream-46 persist_dir]# /opt/glite/bin/glite-wms-ice-db-rm -h
    /opt/glite/bin/glite-wms-ice-db-rm: invalid option -- h
    Type /opt/glite/bin/glite-wms-ice-db-rm -h for help
    
    [root@cream-46 persist_dir]# /opt/glite/bin/glite-wms-ice-db-rm
    Must specify at least one of the options --from-file <pathfile> or <gridjobid>
    
    [root@cream-46 persist_dir]# /opt/glite/bin/glite-wms-ice-db-rm --from-file a
    /opt/glite/bin/glite-wms-ice-db-rm: unrecognized option `--from-file' 
  44. feedback doesn't work FIXED The "old" token is not removed!
    [root@devel09 ~]# ls -l  /var/glite/SandboxDir/B4/https_3a_2f_2fdevel17.cnaf.infn.it_3a9000_2fB4ni_5fIMNzn6cY6W8llClSA/
    total 20
    drwxrwx--- 2 dteam035 glite 4096 Nov 30 10:17 input
    -rw-r--r-- 1 dteam035 dteam  133 Nov 30 10:41 Maradona.output
    drwxrwx--- 2 dteam035 glite 4096 Nov 30 10:17 output
    drwxrwx--- 2 dteam035 glite 4096 Nov 30 10:17 peek
    -rw-r--r-- 1 glite    glite    0 Nov 30 10:19 token.txt_0_
    -rw-r--r-- 1 glite    glite    0 Nov 30 10:17 token.txt_1
  45. Misleading messages in Maradona file: FIXED (2011/02/02)
    LM_log_done_beginThu Dec  2 12:08:53 CET 2010: 0
    LM_log_done_end
    jw exit status = 1
    Then the job fails with this message:
            - Standard output does not contain useful data.Cannot read JobWrapper output, both from Condor and from Maradona.
    Thu Dec  2 12:08:53 CET 2010: 0LM_log_done_beginThu Dec  2 12:08:53 CET 2010: 0LM_log_done_end
  46. Feedback doesn't work with LCG-CE, so these CEs must be excluded in the MM FIXED (2011/02/10) Proposed fix is that yaim set:
    WmsRequirements  = ((ShortDeadlineJob =?= TRUE) ? RegExp(".*sdj$", other.GlueCEUniqueID) : !RegExp(".*sdj$", other.GlueCEUniqueID)) && (other.GlueCEPolicyMaxTotalJobs == 0 || other.GlueCEStateTotalJobs < other.GlueCEPolicyMaxTotalJobs) && ((EnableWmsFeedback =?= TRUE) ? (RegExp("cream", other.GlueCEImplementationName, "i")) : true)
  47. WM defines a rescheduled request as a "Submission" FIXED (2011/01/25)
  48. Bug: #76097 During the first mm of a node the "UserTag CEInfoHostName" is not logged FIXED
  49. Both WM and ICE give on high load: t1168148800:p30771: Fatal error: [Thread System] GLOBUSTHREAD: pthread_mutex_lock() failed. Supposedly fixed by the 'newest' GPT in glite 32. Pay attention. (see also 42.)
  50. repo 110114 Yaim fails FIXED (2011/01/31):
    cp: cannot stat `/opt/glite/sbin/glite_wms_wmproxy_load_monitor.template': No such file or directory
    chmod: cannot access `/opt/glite/sbin/glite_wms_wmproxy_load_monitor': No such file or directory
  51. repo 110114 Submission failed FIXED Problem was with fast-cgi in gsoap (2011/01/27) :
    Status: 500 Internal Server Error
    Server: gSOAP/2.7
    Content-Type: text/xml; charset=utf-8
    Content-Length: 831
    Connection: close
    
    <?xml version="1.0" encoding="UTF-8"?>
    <SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:SOAP-ENC="http://schemas.xmlsoap.org/soap/encoding/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:jsdl="http://schemas.ggf.org/jsdl/2005/11/jsdl" xmlns:jsdlposix="http://schemas.ggf.org/jsdl/2005/11/jsdl-posix" xmlns:delegation1="http://www.gridsite.org/namespaces/delegation-1" xmlns:delegationns="http://www.gridsite.org/namespaces/delegation-2" xmlns:ns1="http://glite.org/wms/wmproxy"><SOAP-ENV:Body><SOAP-ENV:Fault SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/"><faultcode>SOAP-ENV:Client</faultcode><faultstring>End of file or no input: 'Invalid argument'</faultstring></SOAP-ENV:Fault></SOAP-ENV:Body></SOAP-ENV:Envelope>[Fri Jan 14 16:45:46 2011] [error] [client 193.206.210.108] FastCGI: incomplete headers (0 bytes) received from server "/opt/glite/bin/glite_wms_wmproxy_server"
  52. repo 110114 Cannot take shallow resubmission token FIXED (2011/01/31). Infact in the sandbox dir of the job the token is named: token.txt_0. (note the dot)
  53. repo 110114 Misleading output message from UI FIXED (2011/02/01) When a user try to retrieve the output files of a job of another user the message is (see also 34.):
     
    [ale@cream-03 UI]$ glite-wms-job-output https://devel17.cnaf.infn.it:9000/cxyj4EodCZx7qY2vretM6g
    
    Connecting to the service https://cream-46.pd.infn.it:7443/glite_wms_wmproxy_server
    
    Error - getOutputFileList Error
     (St9exception)
    Instead in the wmproxy log file you should find the right reason:
    19 Jan, 12:16:55 -D- PID: 17941 - "wmpgsoapoperations::ns1__getOutputFileList": JobId Exception: User not authorized to perform this operation
    19 Jan, 12:16:55 -D- PID: 17941 - "wmpgsoapoperations::ns1__getOutputFileList": ------------------------------- Fault Description --------------------------------
    19 Jan, 12:16:55 -D- PID: 17941 - "wmpgsoapoperations::ns1__getOutputFileList": Method: getOutputFileList
    19 Jan, 12:16:55 -D- PID: 17941 - "wmpgsoapoperations::ns1__getOutputFileList": Code: 1202
    19 Jan, 12:16:55 -D- PID: 17941 - "wmpgsoapoperations::ns1__getOutputFileList": Description: St9exception
    19 Jan, 12:16:55 -D- PID: 17941 - "wmpgsoapoperations::ns1__getOutputFileList": Stack: 
    19 Jan, 12:16:55 -D- PID: 17941 - "wmpgsoapoperations::ns1__getOutputFileList": ----------------------------------------------------------------------------------
    19 Jan, 12:16:55 -D- PID: 17941 - "wmpgsoapoperations::ns1__getOutputFileList": getOutputFileList operation completed
  54. repo 110114 deep resubmission doesn't work FIXED (2011/02/07) In fact when the old token is grabbed, wm doesn't create the new one
     31 Jan, 16:37:21 -I: [Info] operator()(dispatcher_utils.cpp:228): new jobresubmit for https://cream-44.pd.infn.it:9000/Ham9zx-h36iNR0FVtjT9Ng
    31 Jan, 16:37:21 -E: [Error] operator()(submit_request.cpp:536): cannot rename temporary token /var/glite/SandboxDir/Ha/https_3a_2f_2fcream-44.pd.infn.it_3a9000_2fHam9zx-h36iNR0FVtjT9Ng/token.txt_0. (error 2)
    31 Jan, 16:37:21 -I: [Info] postpone(submit_request.cpp:227): postponing https://cream-44.pd.infn.it:9000/Ham9zx-h36iNR0FVtjT9Ng (cannot create token for shallow resubmission)
  55. repo 110114 Bug: #77366: Sometimes submission failed for LB error Hopefully FIXED (2011/02/10) :
    Warning - Unable to register the job to the service: https://cream-46.pd.infn.it:7443/glite_wms_wmproxy_server
    Register COLLECTIONfailed to LB server:devel17.cnaf.infn.it:9000
    edg_wll_RegisterJobProxy/Sync
    Exit code: 22
    LB[Proxy] Error: Invalid argument
    (edg_wll_RegisterJobMaster(): unable to register job
    Resource temporarily unavailable;; Logging library ERROR: 
    Resource temporarily unavailable;; edg_wll_DoLogEventServer(): edg_wll_log_direct_read error
    LB server (bkserver,lbproxy) store protocol error;; edg_wll_log_proto_client_direct(): error reading answer from L&B direct server
    LB server (bkserver,lbproxy) store protocol error;; get_reply_gss(): error reading reply
    LB server (bkserver,lbproxy) store protocol error;; gss_reader(): error reading message
    Transport endpoint is not connected;; edg_wll_gss_read_full;; GSS Error: EOF occured;)
    
    Method: jobRegister
    
    
    Error - Operation failed
    Unable to find any endpoint where to perform service request
  56. repo 110114 Bug #77055 FIXED (2011/01/31) "MyProxyServer: wrong type caught for attribute" for parametric jobs
  57. repo 110114 Feedback doesn't work FIXED (2011/02/07)
    20 Jan, 13:28:13 -I: [Info] operator()(replanner.cpp:226): created replanning request for job https://cream-46.pd.infn.it:9000/P4kfEQQnyw3g9NL7qyfA3Q with token /var/glite/SandboxDir/P4/https_3a_2f_2fcream-46.pd.infn.it_3a9000_2fP4kfEQQnyw3g9NL7qyfA3Q/token.txt_1
    20 Jan, 13:28:14 -I: [Info] operator()(dispatcher_utils.cpp:310): cannot create LB context (2) for [...] 
  58. repo 110114 Before resubmission ICE must be sure that the job's proxy should be valid for at least "n" minutes. Hopefully FIXED (2011/02/10)
  59. WMS repo 110131 & LB patch #4623 Yaim error: sed: can't read /opt/glite/etc/gip/glite-info-generic.conf: No such file or directory FIXED (2011/02/07) Probably this package: glite-info-generic is missing. After configuration we have:
    [root@cream-44 ~]# ls -l /opt/glite/etc/gip/glite-info-generic.conf
    -rw-r--r-- 1 root root 0 Jan 31 15:45 /opt/glite/etc/gip/glite-info-generic.conf
  60. repo 110204 FIXED (2011/02/14) Cron Purger doesn't work:
    [root@cream-46 ~]#   /opt/glite/sbin/glite-wms-purgeStorage --help
    glite-wms-purgeStorage: glite-wms-purgeStorage.cpp:87: bool<unnamed>::lb_proxy(): Assertion `f_conf' failed.
    Aborted
  61. repo 110204 NOT FIXED (2011/02/11) All the queries done for replanning have timeout (open bug #78047)
    10 Feb, 10:35:20 -D: [Debug] get_scheduled_jobs(lb_utils.cpp:153): error (110) while querying for scheduled jobs
    10 Feb, 10:35:20 -D: [Debug] operator()(replanner.cpp:382): no jobs in scheduled state for more than 1800 seconds for replanning 
  62. repo 110204 NOT FIXED (2011/02/11) If a collection is aborted for "request expired" the nodes don't change status. The problem is due to LB query: (probably it is tied to point 61.)
    10 Feb, 07:01:48 -E: [Error] unrecoverable_collection(submit_request.cpp:108): https://cream-46.pd.infn.it:9000/MRy3d5geSzmep67b0VXV1A: unable to retrieve children information from jobstatus
    10 Feb, 07:01:48 -E: [Error] unrecoverable(submit_request.cpp:126): https://cream-46.pd.infn.it:9000/MRy3d5geSzmep67b0VXV1A failed (request expired) 
  63. repo 110204 NOT FIXED (2011/02/11) Submitting a collection with 192 nodes you obtaing this error (see bug #70061):
    
    Status info for the Job : https://cream-44.pd.infn.it:9000/Tck2cFcOKuOjM4gfH9Ermg
    Current Status:     Waiting
    Status Reason:      jobid: unable to complete the operation: the attribute has not been initialised yet
    Submitted:          Fri Feb 11 17:06:23 2011 CET 
  64. repo 110214 NOT FIXED (2011/02/14) glite-WMS metapackage is wrong (too much dependencies and forget google-perftools)
  65. repo 110214 NOT FIXED (2011/02/14) . There is a mistake in /opt/glite/yaim/functions/config_info_service_wms (forget a "\" at line 86), yaim reports this message:
    sed: -e expression #3, char 84: unknown option to `s'
  66. repo 110214 NOT FIXED (2011/02/14) When you cancel a collection some nodes (the ones which where in state Running or DONE OK) are put in state "Cleared".
  67. repo 110214 NOT FIXED (2011/02/14) Wm dies with these messages:
    14 Feb, 16:05:46 -E: [Error] unrecoverable(submit_request.cpp:126): https://devel09.cnaf.infn.it:9000/3hTRNHN65Dv64kYDLKjZkQ failed (hit job shallow retry count (2))
    14 Feb, 16:05:46 -I: [Info] checkRequirement(matchmakerISMImpl.cpp:107): MM for job: https://devel09.cnaf.infn.it:9000/dsX5dYARJ2rAnk--uyYcJg (19/2976 [0] )
    14 Feb, 16:05:46 -E: [Error] handle_synch_signal(signal_handling.cpp:77): Got a synchronous signal (11), stack trace:
    /opt/glite/bin/glite-wms-workload_manager
    /lib64/libpthread.so.0
    classad::EvalState::SetRootScope()
    classad::ClassAd::EvaluateAttr(std::string const&, classad::Value&) const
    glite::wmsutils::classads::evaluate_attribute(classad::ClassAd const&, std::string const&)
    /opt/glite/lib64/libglite_wms_matchmaking.so.0(_ZN5glite3wms11matchmaking17matchmakerISMImpl16checkRequirementERN7classad7Class
    glite::wms::broker::RBSimpleISMImpl::findSuitableCEs(classad::ClassAd const*)
    glite::wms::broker::ResourceBroker::findSuitableCEs(classad::ClassAd const*)
    /opt/glite/lib64/libglite_wms_helper_broker_ism.so
    glite::wms::helper::broker::Helper::resolve(classad::ClassAd const*, boost::shared_ptr<std::string>) const
    glite::wms::helper::Helper::resolve(classad::ClassAd const*, boost::shared_ptr<std::string>) const
    glite::wms::helper::RequestStateMachine::next_step(classad::ClassAd const*, boost::shared_ptr<std::string>)
    glite::wms::helper::Request::Impl::resolve()
    glite::wms::manager::server::Plan(classad::ClassAd const&, boost::shared_ptr<std::string>)
    glite::wms::manager::server::WMReal::submit(classad::ClassAd const&, boost::shared_ptr<_edg_wll_Context>, boost::shared_ptr<std::string>, bool)
    glite::wms::manager::server::SubmitProcessor::operator()()
    boost::function0<void, std::allocator<void> >::operator()() const
    glite::wms::manager::server::Events::run()
    boost::function0<void, std::allocator<boost::function_base> >::operator()() const
    /usr/lib64/libboost_thread.so.2

-- AlessioGianelle - 2011-02-22

META FILEATTACHMENT attachment="loginfo.txt" attr="" comment="Problem 33: loginfo" date="1289471741" moveby="AlessioGianelle" movedto="EgeeJra1It.PreCertificationTestP2876.loginfo.txt" movedwhen="1298370069" movefrom="EgeeJra1It.WmsTestsP2876.loginfo.txt" name="loginfo.txt" path="loginfo.txt" size="42700" stream="loginfo.txt" tmpFilename="/usr/tmp/CGItemp13678" user="AlessioGianelle" version="1"
META FILEATTACHMENT attachment="status.txt" attr="" comment="Problem 33: status" date="1289471718" moveby="AlessioGianelle" movedto="EgeeJra1It.PreCertificationTestP2876.status.txt" movedwhen="1298370114" movefrom="EgeeJra1It.WmsTestsP2876.status.txt" name="status.txt" path="status.txt" size="2301" stream="status.txt" tmpFilename="/usr/tmp/CGItemp16567" user="AlessioGianelle" version="1"
 
This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback