Tags:
,
view all tags
---+!! Known issues %TOC% ---++ Open known issues Known problems in CREAM software or in other software modules affecting a CREAM based CE (the list refer to known problem affecting the latest release of the software released in EMI) ---+++ Problem switching off of the JobSubmissionManager (i.e. JOB_SUBMISSION_MANAGER_ENABLE false in /etc/glite-ce-cream/cream-config.xml The switching off of the JobSubmissionManager (<parameter name="JOB_SUBMISSION_MANAGER_ENABLE" value="false" /> in /etc/glite-ce-cream/cream-config.xml) makes the CREAM service not available for the users. In particular the CREAM UI reports the following message error: "Received NULL fault; the error is due to another cause: FaultString=[CREAM service not available: configuration failed!] - FaultCode=[SOAP-ENV:Server] - FaultSubCode=[SOAP-ENV:Server]" ---+++ Problem at first configuration with EMI-2 CREAM with GE The first yaim configuration for a EMI-2 CREAM CE using GE as batch system fails with: <verbatim> /etc/lrms/scheduler.conf: No such file or directory </verbatim> The problem disappears running again yaim ---+++ Problem if there are "extra" characters in the beginning of the tomcat key file Because on an issue in trustmanager, there can be problems if there is something before =-----BEGIN ...----= in =/etc/grid-security/tomcat-key.pem= The workaround is to simply remove these chars ---+++ Problem suspending not running job for LSF with old blparser When the CREAM CE is configured to use the old blparser, there might be problems suspending jobs when LSF is used. In this case the blparser can crash. It will then automatically restarted by the =blparser_master= process, but for a few minutes submissions won't work. ---+++ Issue with conflicting BUpdaterSGE instances gLite service in CreamCE starts the following services by this exact order: tomcat5, glite-lb-locallogger and glite-ce-blahparser. The default behaviour of tomcat5 is to start BUpdaterSGE daemon case it thinks it is not running. The problem is that at start up time, BUpdaterSGE is also started by glite-ce-blahparser afterwards. This gives rise to two running instance of the BUpdaterSGE daemon and to a race condition while monitoring running jobs. Jobs may end up being cancelled by BUpdaterSGE conflicts. The workaroud is to change the order how the different services are started: <verbatim> # diff /root/gLite.orig /etc/init.d/gLite 36c36 < start) SERVICE_LIST=`cat $GLITE_STARTUP_FILE` --- > start) SERVICE_LIST=`cat $GLITE_STARTUP_FILE | sort` 44c44 < stop) SERVICE_LIST=`cat $GLITE_STARTUP_FILE | sort` --- > stop) SERVICE_LIST=`cat $GLITE_STARTUP_FILE | sort -r` </verbatim> This will be fixed in EMI 2. ---+++ CREAM jobs are cancelled with status reason=3 in a GE system When a job is submitted by BLAH to an GE system, the blah job_registry is updated via sge_submit.sh script with status "1" and all the subsequent status are updated in blah job_registry by the BUpdaterSGE daemon. BUpdaterSGE is the daemon that decides in what status a given job is examining the output of a "qstat" command. There is a tricky situation when a job disappears: Was it cancelled or did it finish? To know the difference, BUpdaterSGE uses "qacct" to query the accounting log. If there is information about the job in the accounting log, it means it finished, otherwise it means it was cancelled. There are two queries to the accounting log using "qacct -j" with a difference of one minute between the two. If both queries return error, the job is assumed as cancelled. If you are seeing systematic cancelled jobs in glite-cream-ce.log like <verbatim> 02 Dec 2012 12:43:01,247 org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor - JOB CREAM172977114 STATUS CHANGED: REALLY-RUNNING => CANCELLED [description=Cancelled by CE admin] [failureReason=reason=3] [localUser=XXX] [workerNode=XXX] [delegationId=1354371579.274301] </verbatim> this most probably means "qstat" and "qacct" commands can not be successfully executed by tomcat. This could happen by several reasons: * The BUpdaterSGE daemons does not inherits the correct GE environment variables * Tomcat user is not allowed to query the GE system * The accounting file is not shared in the CreamCE ---++++ The BUpdaterSGE daemons does not inherits the correct GE environment variables If the environment present in a BUpdaterSGE process does not include the GE environment variables, the GE client commands (qstat, qconf) can not be executed by BUpdaterSGE. As a consequence, BUpdaterSGE will assume that jobs have been cancelled (because it receives no information from qstat or qacct). You can check the environment for BUpdaterSGE process using the following commands and searching for the GE env variables (SGE_EXECD, SGE_QMASTER, SGE_ROOT, SGE_CLUSTER_NAME, SGE_CELL) <verbatim> # ps xuawww | grep -i sge tomcat 7423 0.6 0.5 37184 21328 ? S Nov23 103:56 /usr/bin/BUpdaterSGE root 30622 0.0 0.0 61180 804 pts/0 R+ 13:41 0:00 grep -i sge # cat /proc/7423/environ </verbatim> This can happen if the BUpdaterSGE daemon is restarted by other user different than root (for example, tomcat starts the daemon at boot time and restarts it if the daemon is dead) without sourcing the proper environment. The workaround is to force the environment to be loaded in /etc/init.d/gLite and /etc/init.d/glite-ce-blahparser. This can be done simply by adding a line like the one bellow to be sourced at the beguinning of previous scripts <verbatim> . /etc/profile.d/sge.sh </verbatim> ---++++ Tomcat user is not allowed to query the GE system Some GE systems use certificates to encrypt the communication between GE client and server. For CreamCE, tomcat must be able to query your system (BUpdaterSGE daemon is running under user tomcat). If this is not the case, most probably you will get the following error while trying to do a "qstat" with user tomcat <verbatim> su - tomcat sh-3.2$ qstat -u '*' error: commlib error: can't set CA chain file (/var/sgeCA/sge_qmaster/GridKa/userkeys/tomcat/cert.pem) error: commlib error: ssl error ([ID=33558530] in module "system library": "No such file or directory") error: unable to send message to qmaster using port 15020 on host "xxxxxxxx": can't set CA chain file </verbatim> ---+++ Significant changes introduced with Torque 2.5.7-1 The updated EPEL5 build of torque-2.5.7-1 as compared to previous versions enables munge[1] as an inter node authentication method. Please see [[http://www.eu-emi.eu/kebnekaise-products/-/asset_publisher/4BKc/content/cream-torque-module#Release_Notes][here]] ---+++ pbs_server create hangs on first time run On a fresh installation of torque-server, running pbs_server for the first time by running =/etc/init.d/pbs_server start will hang=. Also, the command to create a server database file hangs (=/etc/init.d/pbs_server create=). Issue tracked at https://bugzilla.redhat.com/show_bug.cgi?id=744138. ---+++ Problems with suspend when the old blparser is used When the old blparser is used, there are problems with the suspend command. Relevant bug: https://savannah.cern.ch/bugs/?90085 ---+++ Problems running single yaim functions in EMI-2 When configuring a EMI-2 CREAM-CE with yaim, there might be problems if a single function is run (the problem is that the =TOMCAT_VERSION_ variable is not defined). The problem involves the following functions: * =config_cream_ce= * =config_cream_cemon= * =config_cream_emies= * =config_cream_gliteservices= * =config_cream_stop= There aren't problems if instead the whole CREAM-CE is configured (i.e. =yaim -c -s site-info.def -n creamCE ...=). ---+++ Problems re-enabling CEMon and/or EMI-ES When configuring a EMI-2 CREAM-CE, by default CEMon and EMI-ES are not deployed, unless the relevant yaim variables =USE_CEMON= and =USE_EMIES= are set to =true=. There are problems if CEMon is disabled (i.e. with =USE_CEMON= not set or set to =false=) and then a reconfiguration is done re-enabling it (i.e setting =USE_CEMON= to =true=). The workaround is to reinstalling the =glite-ce-monitor= rpm before reconfiguring. There is the same issue for EMI-ES. ---+++ Execution of DAG jobs Execution of DAG jobs on the CREAM based CE through the gLite WMS is not implemented yet. ---+++ qsub crashes With some Torque versions it was observer qsub crashing with glibc detecting a double free or corruption.Although this is a problem to be addressed in Torque problem, adding: <verbatim> export MALLOC_CHECK_=0 </verbatim> to =/etc/blah.config= should help ---+++ CREAM CE not Torque master: communication errors when the maui server and client are not of the same builds. * [[https://savannah.cern.ch/bugs/?61968][Bug #61698]]: when the CREAM CE is not a Torque server, there could be communication errors when the maui (and probably torque) server and client are NOT of the same builds. A common scenario/example when this can happen: * The maui server is a 32bit binary deployed on a 32bit LCG-CE * The 64bit maui client is deployed on a 64bit CREAM-CE From the CREAM-CE node perform: <verbatim> [root@cream-ce]# diagnose –g </verbatim> If you see: <verbatim> ERROR: lost connection to server ERROR: cannot request service (status) </verbatim> you are affected by the problem. A possible workaround is the following: On the LCG-CE create a cron file to dump the =diagnose -g= output to a file: <verbatim> [root@lcg-ce]# cat <<EOF>> /etc/cron.d/diagnose-for-cream */5 * * * * root /usr/bin/diagnose –g > /export/dir/to/cream-ce/diagnose.out EOF </verbatim> The interval defined in =/etc/cron.d/diagnose-for-cream file=, has to be set by the experts. Just an example has been provided here. Then export over NFS the directory where the file is located: <verbatim> [root@lcg-ce]# cat /etc/exports /export/dir/to/cream-ce cream-ce(rw,map_identity,no_root_squash,sync) </verbatim> On the CREAM-CE include/mount the remote directory to a local one: <verbatim> [root@cream-ce]# cat /etc/fstab | grep diagnose lcg-ce: /export/dir/to/cream-ce /import/dir/to/cream-ce nfs defaults,bg 0 0 </verbatim> Then feed the =lcg-info-dynamic-scheduler= with the diagnose output file: <verbatim> [root@cream-ce]# cat /opt/glite/etc/lcg-info-dynamic-scheduler.conf|grep vomaxjobs-maui vo_max_jobs_cmd: /opt/lcg/libexec/vomaxjobs-maui -h lcg-ce –infile /import/dir/to/cream-ce/diagnose-for-cream </verbatim> ---+++ Reconfiguration after update After an update of the CREAM RPM, it is mandatory to reconfigure (via yaim) ---+++ Special characters in CREAM_DB_USER and CREAM_DB_PASSWORD Don't use special characters in the CREAM_DB_USER and CREAM_DB_PASSWORD yaim variables ---+++ Problems with OS language different than US English Problems have been reported if jobs are submitted through the WMS to a CREAM CE deployed on a machine installed using a non-English language. This is because of different representations of decimal numbers. The workaround in this case is to uncomment the line: <verbatim> LANG=en_US </verbatim> in =$CATALINA_HOME/conf/tomcat5.conf= and then restart tomcat ---++ Old known issues Problems in CREAM software or in other software modules affecting a CREAM based CE that have already been fixed (i.e. they are not affecting the latest release of the software released in EMI) ---+++ Problem by submitting jobs via WMS (direct submission to CREAM CE isn't affected by this problem) There is an unwanted auto-updating of the field "creationTime" on the cream database. This happen, for example, when tomcat is restarted (yaim does the stop and the start of the tomcat service). Submitting jobs via WMS, you could obtain the following wrong message: ======================= glite-wms-job-status Success ===================== BOOKKEEPING INFORMATION: Status info for the Job : ... Current Status: Aborted <----------------- Status Reason: CREAM'S database has been scratched and all its jobs have been lost <----------------- Destination: ... Submitted: ... Parent Job: ... ========================================================================== N.B: The CREAM database isn't scratched, but it is so from the WMS point of view because of the above problem. The problem is solved applying the following workaround on the cream database: ---------------------------- use creamdb; ALTER TABLE db_info MODIFY creationTime TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP; commit; ---------------------------- Fix provided with EMI-2 update 7 ---+++ !EMI-2 CREAM CE delegates bad proxy to WN The delegated and limited proxy on the CE contains a corrupted field "X509v3 Key Usage"; this issue can be reproduced executing the following command: <verbatim>openssl x509 -noout -text -in /var/glite/cream_sandbox/[vo]/[dn_fqan_mappeduser]/proxy/[delegationid_internalid]</verbatim> A temporary workaround is to change to proxy-limiting command in the script /usr/bin/glite-cream-copyProxyToSandboxDir.sh, the modified version of the script is available [[%ATTACHURL%/glite-cream-copyProxyToSandboxDir.sh][here]] Fix provided with EMI-2 update 7 ---+++ Problem with cancelled jobs notification In BLAH 1.18.0(EMI2) to run correctly the notification of cancelled jobs it is needed to add in /etc/blah.config the row: <verbatim> bupdater_use_bhist_for_killed="yes" </verbatim> Fix released with EMI-1 Update 17 and EMI-2 Update 1 ---+++ Problem with generic dynamic scheduler with SGE The yaim plugin for sge configures the gip for publishing information but when used out of the box the following error is shown in the BDII log: <verbatim> Traceback (most recent call last): File "/usr/libexec/lcg-info-dynamic-scheduler", line 256, in ? import lrms ImportError: No module named lrms </verbatim> The workaround is defining =PYTHONPATH= in =/var/lib/bdii/gip/plugin/glite-info-dynamic-scheduler-wrapper=: <verbatim> $ cat /var/lib/bdii/gip/plugin/glite-info-dynamic-scheduler-wrapper #!/bin/sh #/opt/lcg/libexec/lcg-info-dynamic-scheduler -c /opt/glite/etc/lcg-info-dynamic-scheduler.conf export PYTHONPATH=/usr/lib/python:$PYTHONPATH /usr/libexec/lcg-info-dynamic-scheduler -c /etc/lcg-info-dynamic-scheduler.conf </verbatim> Relevant ticket: https://ggus.eu/ws/ticket_info.php?ticket=76961 Fix provided with EMI-2 update 3 ---+++ Error parsing GLUE2PolicyRule All the attibutes "GLUE2PolicyRule" defined in the file /var/lib/bdii/gip/ldif/ComputingShare.ldif MUST BE in the form "VO:nameofthevo" (the VO prefix is mandatory) Other strings, even the empty one, are not correctly parsed by the script lcg-info-dynamic-scheduler and the following error is reported in the BDII log: <verbatim> vogrp = tmpl[2].strip() IndexError: list index out of range </verbatim> In that case the wrong attributes MUST BE removed Fix provided with EMI-2 update 3 ---+++ GlueCEStateWaitingJobs: 444444 and WallTime workaround If on the queues there is published: <verbatim>GlueCEStateWaitingJobs: 444444</verbatim> and in the log /var/log/bdii/bdii-update.log you notice errors like the folllowing: <verbatim>Traceback (most recent call last): File "/usr/libexec/lcg-info-dynamic-scheduler", line 435, in ? wrt = qwt * nwait TypeError: unsupported operand type(s) for *: 'NoneType' and 'int' </verbatim> probably the queues have no "resources_default.walltime" parameter configured. So define it for each queue by launching, for example: <verbatim># qmgr -c "set queue prod resources_default.walltime = 01:00:00" # qmgr -c "set queue cert resources_default.walltime = 01:00:00" # qmgr -c "set queue cloudtf resources_default.walltime = 01:00:00" </verbatim> Relevant ticket https://ggus.eu/tech/ticket_show.php?ticket=83229 Fix provided with EMI-2 update 3 ---+++ Issue with the setting of the maximum number of accepted FTP connections The number of maximum number of gridftp connections is now automatically set in =/etc/grid-security/gridftp.conf=. It should be manually added also in the file =/etc/gridftp.conf= where the line: <verbatim> connections_max 150 </verbatim> should be added. Relevant ticket: https://ggus.eu/tech/ticket_show.php?ticket=78902 ---+++ Memory leak in bupdater for PBS and LSF Version 1.16.3 of BLAH is affected by a quite critical memory leak in the bupdater component for LSF and PBS. Because of that the usage of memory of the bupdater process will keep increasing till when it crashes/it is killed by OOM. It is then automatically restarted by blah. SGE is not affected by this problem For LSF and PBS, the workaround is to configure the blparser using the old method (see: http://wiki.italiangrid.it/twiki/bin/view/CREAM/SystemAdministratorGuideForEMI1#1_2_4_Choose_the_BLAH_BLparser_d) or to the restart from time to time the bupdater. Relevant bug: https://savannah.cern.ch/bugs/index.php?89859 Fix provided with EMI-1 Update 12 ---+++ Problems starting the bupdater and bnotifier with (S)GE With SGE, there is a problem when starting the bupdater and bnotifier: <verbatim> Starting BNotifier: /usr/bin/BNotifier: sge_helperpath not defined. Exiting [FAILED] Starting BUpdaterSGE: /usr/bin/BUpdaterSGE: sge_helperpath not defined. Exiting [FAILED] </verbatim> The workaround is to uncomment the variable: <verbatim> sge_helperpath=/usr/bin/sge_helper </verbatim> in the =blah.config= file. See https://savannah.cern.ch/bugs/index.php?88974 and https://ggus.eu/ws/ticket_info.php?ticket=76067 Fix provided with EMI-1 Update 12 ---+++ No dynamic info published for one VOview For one VOView the =lcg-info-dynamic-scheduler= doesn't publish information, and therefore the values defined in the static ldif file is used. As found by Jan Astalos (thanks !) this is because a missing blank line at the end of =/var/lib/bdii/gip/ldif/static-file-CE.ldif= created by YAIM. Waiting for the fix, the workaround is simply doing: <verbatim> echo >> /var/lib/bdii/gip/ldif/static-file-CE.ldif </verbatim> after having configured via yaim Relevant bug: http://savannah.cern.ch/bugs/?86191 Fix provided with CREAM CE 1.13.3 (see http://savannah.cern.ch/task/?24022) released with EMI-1 Update 10 ---+++ Problems if Torque is not configured to suppress mails Torque should be configured to suppress all mails (mail_domain=never). Otherwise the bupdater process of the blparser will keep dying. Relevant bug: https://savannah.cern.ch/bugs/index.php?86238 Fix provided with BLAH 1.16.3 (see http://savannah.cern.ch/task/?22845) released with EMI-1 Update 10 ---+++ Memory issues with new BLAH Blparser If the new Blparser is used (click [[http://wiki.italiangrid.org/twiki/bin/view/CREAM/SystemAdministratorGuideForEMI1#2_15_How_to_check_if_you_are_usi][here]] to check this) there can be issues if the blah registry becomes very large. The submission process can get slower and there can be problems with memory usage. Waiting for the fix, there are two possible workarounds: * Reduce the number of multiple instances of blahpd (the default value is 50). This means changing the value =cream_concurrency_level= in =cream-config.xml=. To apply the change, you will then need to restart tomcat. This should help addressing the issue, but it will also mean less parallel instances interacting with the batch system (and so a possible reduction of the throughput in the submission to the batch system) . Click [[http://wiki.italiangrid.org/twiki/bin/view/CREAM/SystemAdministratorGuideForEMI1#1_4_7_1_Tune_the_number_of_concu][here]] to get more details * Reduce the value for =purge_interval= in =blah.config=. This value is expressed in seconds. A job is removed from the BLAH registry (and therefore not managed anymore by BLAH and therefore CREAM) after =purge_interval= seconds since its submission. To apply the change, you will then need to restart the blparser (=/etc/init.d/glite-ce-blahparser restart=) Relevant bug: https://savannah.cern.ch/bugs/index.php?75854 Fix provided with BLAH 1.16.3 (see http://savannah.cern.ch/task/?22845) released with EMI-1 Update 10 ---+++ Problems with Torque 2.5.7-1 There is a problem with latest torque version available in the EPEL repository (2.5.7-1). At start the following error is reported: <verbatim> [root@cream-38 ~]# /etc/init.d/pbs_server start /var/torque/server_priv/serverdb Starting TORQUE Server: PBS_Server: LOG_ERROR::No such file or directory (2) in pbs_init, unable to stat checkpoint directory /var/torque/checkpoint/, errno 2 (No such file or directory) PBS_Server: LOG_ERROR::PBS_Server, pbsd_init failed [FAILED] </verbatim> Problem addressed with Torque v- 2.5.7-2 ---+++ Problems affecting users with certificates signed by the GermanGrid Because of a bug in trustmanager, users with certificates signed by the GermanGrid CA can't submit jobs to CREAM. The error message is something like: <verbatim> Failed to create a delegation id for job https://grid-lb0.desy.de:9000/ADkeOt6tc0Rfi8oP-pzUrQ: reason is Client 'O=GermanGrid,OU=DESY,CN=Alexander Fomenko' is not issuer of proxy 'O=GermanGrid,OU=DESY,CN=Alexander Fomenko,CN=proxy,CN=proxy'. </verbatim> * Relevant bug: https://savannah.cern.ch/bugs/?83426 * Fix released with CREAM CE 1.13.2 (http://savannah.cern.ch/task/?21573), released with EMI-1 Update 4 ---+++ Problems with SubCAs when Argus is used as authorization system There are problems when CREAM CE is configured to use Argus, happening with sub-CAs (e.g. CERN-TCA, UKeScienceCA) * Relevant bug: https://savannah.cern.ch/bugs/?82567 * Fix released with CREAM CE 1.13.1 (http://savannah.cern.ch/task/?20813) ---+++ CREAM doesn't transfert the output files remotely if SANDBOX_TRANSFER_METHOD="LRMS" The related bug is https://savannah.cern.ch/bugs/index.php?95480; the error occurs only if the transfer method selected is "LRMS" and the name of the URL is lexicographically greater than "gsiftp://localhost". No workaround is available. ---+++ FAILURE_REASON="Cannot enqueue the command id=-1: Data truncation: Data too long for column 'commandGroupId' at row 1 (rollback performed)" it's a bug reintroduced in EMI-2 CREAM CE: https://savannah.cern.ch/bugs/index.php?95593 A workaround is to modify a database table by executing the following query: <verbatim> use creamdb; ALTER TABLE command_queue MODIFY commandGroupId varchar(255) NULL; </verbatim> -- Main.LisaZangrando - 2012-10-26
Attachments
Attachments
Topic attachments
I
Attachment
Action
Size
Date
Who
Comment
sh
glite-cream-copyProxyToSandboxDir.sh
manage
1.5 K
2012-09-07 - 07:34
PaoloAndreetto
Edit
|
Attach
|
PDF
|
H
istory
:
r69
|
r46
<
r45
<
r44
<
r43
|
B
acklinks
|
V
iew topic
|
More topic actions...
Topic revision: r44 - 2012-12-21
-
PaoloAndreetto
Home
Site map
CEMon web
CREAM web
Cloud web
Cyclops web
DGAS web
EgeeJra1It web
Gows web
GridOversight web
IGIPortal web
IGIRelease web
MPI web
Main web
MarcheCloud web
MarcheCloudPilotaCNAF web
Middleware web
Operations web
Sandbox web
Security web
SiteAdminCorner web
TWiki web
Training web
UserSupport web
VOMS web
WMS web
WMSMonitor web
WeNMR web
General Doc
Functional Description
Batch System Support
CREAM and Information Service
Release Notes
Known Issues
Security in CREAM
Nagios Probes to monitor CREAM and WN
Papers
Presentations
User Doc
CREAM User Guide for EMI-1
CREAM User Guide for EMI-2
CREAM User Guide for EMI-3
CREAM JDL Guide
BLAH User Guide
Troubleshooting Guide
System Administrator Doc
System Administrator Guide for CREAM (EMI-3 release)
System Administrator Guide for CREAM (EMI-2 release)
System Administrator Guide for CREAM (EMI-1 release)
The CREAM configuration file
The CEMonitor configuration file
The CREAM CE Service Reference Card (EMI-2 release)
The CREAM CE Service Reference Card (EMI-1 release)
Batch System related documentation
Troubleshooting Guide
The guide for integrating EMIR in CREAM
]
Developers Doc
CREAM Client API C++ Documentation
CREAM Client API for Python
Other Doc
Contacts
Moving to CREAM from LCG-CE
Testing
Internal Collaboration Information
Credits
CREAM Web utilities
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Account
Log In
Edit
Attach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback