Tags:
, view all tags

HOW-TO optimise performance distributing WMS and LB on two hosts

WMS+LB physical architecture

In order to gain better performance, the components of a single WMS instance have been distributed on two hosts according to a layout different from the typical one. LBserver is hosted on one machine, in our case devel20, together with WMproxy and WM and without LBproxy, not to store the same events twice on database (this issue will disappear with the advent of LB 2.0) . The Job Submission Service is moved to another machine, gundamfor us. JC+LM+CondorG are hosted by gundam. They connect to the LBserver at devel20 without using an LBproxy outpost on gundam.

Configure LBproxy = false on gundam.

Components host devel20 host gundam
glite_wms_wmproxy Yes / Done No
glite-wms-workload_manager Yes / Done No
glite-proxy-renewd Yes / Done No
glite-wms-job_controller No Yes / Done
glite-wms-log_monitor No Yes / Done
CondorG No Yes / Done
glite-lb-logd Yes / Done Yes / Done
glite-lb-interlogd Yes / Done Yes / Done
glite-lb-bkserverd Yes / Done No

Filesystem sharing

Interoperation between the various WMS components running on two different hosts is (temporarily) guaranteed by exporting /var/glite on devel20 to the host gundam via NFS, this is done only for simplicity. gundam mounts the filesystem under /mnt/devel20. Since the gahp_server is highly CPU-bound this physical architecture should be better than just using a WMS+LB on a single machine with two separately controlled disks.

devel20: NFS server configuration

On devel20, as root, insert the following lines in /etc/hosts.deny:
portmap: ALL
lockd: ALL
statd: ALL
mountd: ALL
rquotad: ALL 
Insert the following line in /etc/hosts.allow:
portmap: gundam.cnaf.infn.it
lockd: gundam.cnaf.infn.it 
statd: gundam.cnaf.infn.it
mountd: gundam.cnaf.infn.it
rquotad: gundam.cnaf.infn.it 
There is no need to restart the portmap daemon.

Start the NFS service:

# /etc/init.d/nfs start

Make the NFS service start at boot:

# chkconfig nfs on

Insert the following line in /etc/exports:

/var/glite  gundam.cnaf.infn.it(rw,sync,wdelay)

Re-export the filesystem:

# exportfs -r

gundam: NFS client configuration

In order to prevent any problems during the booting process, we don't mount the NFS filesystem at boot on gundam. Instead, we configure automount to mount the filesystem automatically at first access, and disable subsequent auto-unmount.

As root, insert the following line in /etc/auto.master:

/mnt /etc/auto.mnt --timeout=0

Create the file /etc/auto.mnt with the following line:

devel20   -rw,hard,intr,nosuid,noauto,timeo=600,wsize=32768,rsize=32768,tcp   devel20.cnaf.infn.it:/var/glite

Start the automount daemon:

# /etc/init.d/autofs start

Make automount start at boot:

# chkconfig autofs on

The filesystem /mnt/devel20 gets mounted automatically at first access attempt after boot, and is never automatically unmounted. If the filesystem is not busy, it can be manually unmounted either by:

  • issuing the usual command `umount /mnt/devel20`
  • sending the USR1 signal to the automount daemon
Of course, upon subsequent access attempt, the filesystem gets automatically remounted.

gundam: creation of the necessary links

On gundam create the following symbolic links:
# ln -s /mnt/devel20/jobcontrol /var/glite/jobcontrol
# ln -s /mnt/devel20/SandboxDir /var/glite/SandboxDir
# ln -s /mnt/devel20/spool /var/glite/spool

This is more than important because ClassAds attributes will still point to the canonical "/var/glite/...."

configuration

On devel20 no changes were made to the default configuration file glite_wms.conf.
On gundam it is necessary to update some entries for the JobController and LogMonitor to find the jobdir under /mnt/devel20.

  1. On gundam after "exporting"
           GLITE_LOCAL_LOCATION_LOG=/var/log/glite 
           GLITE_LOCAL_LOCATION_VAR=/var/glite 
           GLITE_REMOTE_LOCATION_VAR=/mnt/devel20 
    modify the following line in the LogMonitor [] section of /opt/glite/etc/glite_wms.conf:
           LogMonitor = [
           ...
           MonitorInternalDir  =  "${GLITE_LOCAL_LOCATION_VAR}/logmonitor/internal";
           ...
           ]
  2. Always on gundam, modify the following lines in the JobController [] section of /opt/glite/etc/glite_wms.conf:
    JobController = [
    ...
    Input  =  "${GLITE_REMOTE_LOCATION_VAR}/jobcontrol/jobdir";
    LockFile  =  "${GLITE_REMOTE_LOCATION_VAR}/jobcontrol/lock";
    SubmitFileDir  =  "${GLITE_REMOTE_LOCATION_VAR}/jobcontrol/submit";
    ...
    ]

Of course each component stores its logs locally, this is especially important for gundam where the LM, JC and CondorG logs produce a huge amount of data:

LogMonitor = [
...
CondorLogRecycleDir  =  "${GLITE_LOCAL_LOCATION_VAR}/logmonitor/CondorG.log/recycle";
LockFile  =  "${GLITE_LOCAL_LOCATION_LOG}/logmonitor/lock";
CondorLogDir  =  "${GLITE_LOCAL_LOCATION_LOG}/logmonitor/CondorG.log";
LogFile  =  "${GLITE_LOCAL_LOCATION_LOG}/logmonitor_events.log";
ExternalLogFile  =  "${GLITE_LOCAL_LOCATION_LOG}/logmonitor_external.log";
...
]

JobController = [
...
LogFile  =  "${GLITE_LOCAL_LOCATION_LOG}/jobcontoller_events.log";
OutputFileDir  =  "${GLITE_LOCAL_LOCATION_LOG}/jobcontrol/condorio";
...
]

TODO: Condor tweaks:

  1. Can't send RESCHEDULE command to condor scheduler SUBMIT SEND RESCHEDULE = False
... GRIDMANAGER_MAX_JOBMANAGERS_PER_RESOURCE = 100

scripts

devel20:

# /opt/glite/etc/init.d/glite-wms-wm start/stop/status
# /opt/glite/etc/init.d/glite-wms-wmproxy start/stop/status
# /opt/glite/etc/init.d/glite-proxy-renewald start/stop/status
# /opt/glite/etc/init.d/glite-lb-locallogger start/stop/status 
# /opt/glite/etc/init.d/glite-lb-bkserverd start/stop/status

gundam:

# /opt/glite/etc/init.d/glite-wms-lm start/stop/status
# /opt/glite/etc/init.d/glite-wms-jc start/stop/status
# /opt/glite/etc/init.d/glite-lb-locallogger start/stop/status
# /opt/glite/etc/init.d/glite-lb-bkserverd start/stop/status 

Gundam must be superuser for the LB@devel20

A preview from stress tests recently made with CMS (thanks to Enzo Miccio): a >1Hz stable rate to Condor (blue line) whenever Grid resources were able to keep the pace: These test have been made with an experimental version for the gLite WMS which will be released after patch #1841. -- FabioCapannini - 02 Oct 2008

Edit | Attach | PDF | History: r20 | r18 < r17 < r16 < r15 | Backlinks | Raw View | More topic actions...
Topic revision: r16 - 2009-08-31 - MarcoCecchi
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platformCopyright © 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback