Known issues
Open known issues
Known problems in CREAM software or in other software modules affecting a CREAM based CE (the list refer to known problem affecting the latest release of the software released in EMI)
Memory leak in bupdater for PBS and LSF
Version 1.16.3 of BLAH is affected by a quite critical memory leak in the bupdater component for LSF and PBS. Because of that the usage of memory of the bupdater process will keep increasing till when it crashes/it is killed by OOM. It is then automatically restarted by blah.
SGE is not affected by this problem
For LSF and PBS, the workaround is to configure the blparser using the old method (see:
http://wiki.italiangrid.it/twiki/bin/view/CREAM/SystemAdministratorGuideForEMI1#1_2_4_Choose_the_BLAH_BLparser_d
) or to the restart from time to time the bupdater.
Relevant bug:
https://savannah.cern.ch/bugs/index.php?89859
Problems starting the bupdater and bnotifier with (S)GE
With SGE, there is a problem when starting the bupdater and bnotifier:
Starting BNotifier: /usr/bin/BNotifier: sge_helperpath not defined. Exiting
[FAILED]
Starting BUpdaterSGE: /usr/bin/BUpdaterSGE: sge_helperpath not defined. Exiting
[FAILED]
The workaround is to uncomment the variable:
sge_helperpath=/usr/bin/sge_helper
in the
blah.config
file.
See
https://savannah.cern.ch/bugs/index.php?88974
and
https://ggus.eu/ws/ticket_info.php?ticket=76067
Significant changes introduced with Torque 2.5.7-1
The updated EPEL5 build of torque-2.5.7-1 as compared to previous versions enables munge[1] as an inter node authentication method. Please see
here
pbs_server create hangs on first time run
On a fresh installation of torque-server, running pbs_server for the first time by running
/etc/init.d/pbs_server start will hang
. Also, the command to create
a server database file hangs (
/etc/init.d/pbs_server create
). Issue tracked at
https://bugzilla.redhat.com/show_bug.cgi?id=744138
.
Execution of DAG jobs
Execution of DAG jobs on the CREAM based CE through the gLite WMS is not implemented yet.
qsub crashes
With some Torque versions it was observer qsub crashing with glibc detecting a double free or corruption.Although this is a problem to be addressed in Torque problem, adding:
export MALLOC_CHECK_=0
to
/etc/blah.config
should help
CREAM CE not Torque master: communication errors when the maui server and client are not of the same builds.
*
Bug #61698
: when the CREAM CE is not a Torque server, there could be communication errors when the maui (and probably torque) server and client are NOT of the same builds.
A common scenario/example when this can happen:
- The maui server is a 32bit binary deployed on a 32bit LCG-CE
- The 64bit maui client is deployed on a 64bit CREAM-CE
From the CREAM-CE node perform:
[root@cream-ce]# diagnose –g
If you see:
ERROR: lost connection to server
ERROR: cannot request service (status)
you are affected by the problem.
A possible workaround is the following:
On the LCG-CE create a cron file to dump the
diagnose -g
output to a file:
[root@lcg-ce]# cat <<EOF>> /etc/cron.d/diagnose-for-cream
*/5 * * * * root /usr/bin/diagnose –g > /export/dir/to/cream-ce/diagnose.out
EOF
The interval defined in
/etc/cron.d/diagnose-for-cream file
, has to be set by the experts. Just an example has been provided here.
Then export over NFS the directory where the file is located:
[root@lcg-ce]# cat /etc/exports
/export/dir/to/cream-ce cream-ce(rw,map_identity,no_root_squash,sync)
On the CREAM-CE include/mount the remote directory to a local one:
[root@cream-ce]# cat /etc/fstab | grep diagnose
lcg-ce: /export/dir/to/cream-ce /import/dir/to/cream-ce nfs defaults,bg 0 0
Then feed the
lcg-info-dynamic-scheduler
with the diagnose output file:
[root@cream-ce]# cat /opt/glite/etc/lcg-info-dynamic-scheduler.conf|grep vomaxjobs-maui
vo_max_jobs_cmd: /opt/lcg/libexec/vomaxjobs-maui -h lcg-ce –infile /import/dir/to/cream-ce/diagnose-for-cream
Reconfiguration after update
After an update of the CREAM RPM, it is mandatory to reconfigure (via yaim)
Special characters in CREAM_DB_USER and CREAM_DB_PASSWORD
Don't use special characters in the CREAM_DB_USER and CREAM_DB_PASSWORD yaim variables
Problems with OS language different than US English
Problems have been reported if jobs are submitted through the WMS to a CREAM CE deployed on a machine installed using a non-English language. This is because of different representations of decimal numbers. The workaround in this case is to uncomment the line:
LANG=en_US
in
$CATALINA_HOME/conf/tomcat5.conf
and then restart tomcat
Old known issues
Problems in CREAM software or in other software modules affecting a CREAM based CE that have already been fixed (i.e. they are not affecting the latest release of the software released in EMI)
No dynamic info published for one VOview
For one VOView the
lcg-info-dynamic-scheduler
doesn't publish information, and therefore the values defined in the static ldif file is used.
As found by Jan Astalos (thanks !) this is because a missing blank line at the end of
/var/lib/bdii/gip/ldif/static-file-CE.ldif
created by YAIM.
Waiting for the fix, the workaround is simply doing:
echo >> /var/lib/bdii/gip/ldif/static-file-CE.ldif
after having configured via yaim
Relevant bug:
http://savannah.cern.ch/bugs/?86191
Fix provided with CREAM CE 1.13.3 (see
http://savannah.cern.ch/task/?24022
) released with EMI-1 Update 10
Problems if Torque is not configured to suppress mails
Torque should be configured to suppress all mails (mail_domain=never). Otherwise the bupdater process of the blparser will keep dying.
Relevant bug:
https://savannah.cern.ch/bugs/index.php?86238
Fix provided with BLAH 1.16.3 (see
http://savannah.cern.ch/task/?22845
) released with EMI-1 Update 10
Memory issues with new BLAH Blparser
If the new Blparser is used (click
here
to check this) there can be issues if the blah registry becomes very large. The submission process can get slower and there can be problems with memory usage.
Waiting for the fix, there are two possible workarounds:
- Reduce the number of multiple instances of blahpd (the default value is 50). This means changing the value
cream_concurrency_level
in cream-config.xml
. To apply the change, you will then need to restart tomcat. This should help addressing the issue, but it will also mean less parallel instances interacting with the batch system (and so a possible reduction of the throughput in the submission to the batch system)
. Click
here
to get more details
- Reduce the value for
purge_interval
in blah.config
. This value is expressed in seconds. A job is removed from the BLAH registry (and therefore not managed anymore by BLAH and therefore CREAM) after purge_interval
seconds since its submission. To apply the change, you will then need to restart the blparser (/etc/init.d/glite-ce-blahparser restart
)
Relevant bug:
https://savannah.cern.ch/bugs/index.php?75854
Fix provided with BLAH 1.16.3 (see
http://savannah.cern.ch/task/?22845
) released with EMI-1 Update 10
Problems with Torque 2.5.7-1
There is a problem with latest torque version available in the EPEL repository (2.5.7-1).
At start the following error is reported:
[root@cream-38 ~]# /etc/init.d/pbs_server start
/var/torque/server_priv/serverdb
Starting TORQUE Server: PBS_Server: LOG_ERROR::No such file or directory (2)
in pbs_init, unable to stat checkpoint directory /var/torque/checkpoint/,
errno 2 (No such file or directory)
PBS_Server: LOG_ERROR::PBS_Server, pbsd_init failed
[FAILED]
Problem addressed with Torque v- 2.5.7-2
Problems affecting users with certificates signed by the GermanGrid
Because of a bug in trustmanager, users with certificates signed by the
GermanGrid CA can't submit jobs to CREAM.
The error message is something like:
Failed to create a delegation id for job https://grid-lb0.desy.de:9000/ADkeOt6tc0Rfi8oP-pzUrQ: reason is Client 'O=GermanGrid,OU=DESY,CN=Alexander Fomenko' is not issuer of proxy 'O=GermanGrid,OU=DESY,CN=Alexander Fomenko,CN=proxy,CN=proxy'.
Problems with SubCAs when Argus is used as authorization system
There are problems when CREAM CE is configured to use Argus, happening with sub-CAs (e.g. CERN-TCA,
UKeScienceCA)
--
MassimoSgaravatto - 2011-05-05