--
AlessandroPaolini - 2010-11-26
How to test a site before putting it into production grid
Be sure that its
GIIS url is contained in the
BDII (gridit-bdii-01.cnaf.infn.it) we use for certification. In case it is missed, please open a ticket on ticketing.cnaf.infn.it.
1) Check the consistency of the published information
the main branches of the ldap tree are:
- 1.1)
GlueSiteUniqueID
- 1.2)
GlueSubClusterUniqueID
- 1.3)
GlueCEUniqueID
- 1.4)
GlueCESEBind
- 1.5)
GlueSEUniqueID
- 1.6)
GlueServiceUniqueID
Use a browser ldap in order to make easy your work Anyway, for the headstrongs, I will report an example ldapsearch command with which check the several information
1.1) GlueSiteUniqueID branch
Under the branch
GlueSiteUniqueID check the values of the following parameters:
-
GlueSiteName
-
GlueSiteUserSupportContact
-
GlueSiteSysAdminContact
-
GlueSiteSecurityContact
-
GlueSiteOtherInfo
EXAMPLE:
$ ldapsearch -x -LLL -H ldap://gridit-bdii-01.cnaf.infn.it:2170 -b mds-vo-name=local,o=grid 'objectClass=GlueSite' GlueSiteName GlueSiteUserSupportContact GlueSiteSysAdminContact GlueSiteSecurityContact GlueSiteOtherInfo
1.2) GlueSubClusterUniqueID branch
Under the branch
GlueSubClusterUniqueID check the values of the following parameters:
* Check GlueHostApplicationSoftwareRunTimeEnvironment
* site name
* Current version of middleware
* R-GMA
* SI00MeanPerCPU_<value> e SF00MeanPerCPU_<value>
* if the site supports mpi jobs, MPICH (ant other related tags)
* (in case) AFS (and verify WNs mount /afs)
* GlueHostProcessorOtherDescription (for instance: Cores=2,Benchmark=7.92-HEP-SPEC06 )
* GlueHostOperatingSystemName (es. ScientificSL)
* GlueHostOperatingSystemVersion (es. Berillium)
* GlueHostOperatingSystemRelease (es. 4.5)
EXAMPLE:
$ ldapsearch -x -LLL -H ldap://virgo-ce.roma1.infn.it:2170 -b mds-vo-name=resource,o=grid 'objectClass=GlueSubCluster' GlueHostOperatingSystemName GlueHostOperatingSystemVersion GlueHostOperatingSystemRelease GlueHostProcessorOtherDescription
1.3) GlueCEUniqueID branch
Under the branch
GlueCEUniqueID check the values of the following parameters:
-
GlueCEInfoTotalCPUs
: If there is a “0”, you have to worry!!
-
GlueCEStateWaitingJobs
: If there is a “44444”, red alarm!!
-
GlueCEInfoLRMSType
: pbs or lsf (or sge, …)
-
GlueCEStateStatus
: Production or Draining
-
GlueCEAccessControlBaseRule
: VOs enabled on the queue
-
GlueCECapability
EXAMPLE:
$ ldapsearch -x -LLL -H ldap://virgo-ce.roma1.infn.it:2170 -b mds-vo-name=INFN-ROMA1-VIRGO,o=grid 'objectClass=GlueCE' GlueCEInfoTotalCPUs GlueCEInfoTotalCPUs GlueCEStateWaitingJobs GlueCEInfoJobManager GlueCEImplementationName GlueCEInfoLRMSType GlueCEStateStatus GlueCEAccessControlBaseRule GlueCECapability
1.4) GlueCESEBindSEUniqueID branch
For each SE it has to be defined:
-
GlueCESEBindSEUniqueID
-
GlueCESEBindCEAccesspoint
and GlueCESEBindMountInfo
EXAMPLE:
$ ldapsearch -x -LLL -H ldap://prod-ce-02.pd.infn.it:2170 -b mds-vo-name=resource,o=grid 'objectClass=GlueCESEBind' GlueCESEBindSEUniqueID GlueCESEBindCEUniqueID GlueCESEBindMountInfo
1.5) GlueSEUniqueID branch
Under the branch
GlueSEUniqueID check the values of the following parameters:
-
GlueSALocalID
: VO information
-
GlueSEAccessProtocolLocalID
: rfio, srm_v1, srm_v2, classic, gsiftp, gsidcap
-
GlueSEType
(deprecated)
-
GlueSEArchitecture
-
GlueSAStateUsedSpace
-
GlueSAStateAvailableSpace
EXAMPLE:
$ ldapsearch -x -LLL -H ldap://prod-bdii-02.pd.infn.it:2170 -b mds-vo-name=INFN-PADOVA,o=grid 'objectClass=GlueSE'
$ ldapsearch -x -LLL -H ldap://prod-bdii-02.pd.infn.it:2170 -b mds-vo-name=INFN-PADOVA,o=grid 'objectClass=GlueSA' GlueSAAccessControlBaseRule
1.6) GlueServiceUniqueID branch
there is a branch
GlueServiceUniqueID for each service published by the site (WMS, LFC, DPM, GRIDICE, LB, MYPROXY, BDII,…): what discriminate the services are the values of
GlueServiceType
, ex:
- lcg-file-catalog
- org.glite.wms.WMProxy
- org.glite.lb.Server
- srm_v1, SRM
EXAMPLE:
$ ldapsearch -x -LLL -H ldap://gridit-ce-001.cnaf.infn.it:2170 -b mds-vo-name=INFN-CNAF,o=grid 'objectClass=GlueService' GlueServiceType GlueServiceEndpoint GlueServiceName
Check the functionality of the grid elements
lcg-CE checks
Verify the authentication and authorization on CE
$ globus-job-run inaf-ce-01.ct.pi2s2.it /bin/hostname (or /usr/bin/whoami, or whatever you want!!)
In case of pbs, check the WNs, ex.:
$ globus-job-run pbs-enmr.cerm.unifi.it /usr/bin/pbsnodes -a
Verify the functioning of the batch system: be careful that the queue you are querying really exists, and your VO is enabled on it. For example:
$ globus-job-run ce-cyb.ca.infn.it/jobmanager-lcglsf -queue poncert /bin/pwd
check dgas processes on CE (with a ps ax| grep dgas)
Cream-CE checks
Open your browser to
https://<hostname-of-cream-ce>:8443/ce-cream/services
A page with link to the
CREAM WSDL should be shown
Try a
gsiftp (e.g. using globus-url-copy or
@uberftp
@@) towards that CREAM CE. E.g.:
$ globus-url-copy gsiftp://<hostname-of-cream-ce>/opt/glite/yaim/etc/versions/ig-yaim file:/tmp/ig-version-test
Try the following command:
$ glite-ce-allowed-submission <<hostname-of-cream-ce>>:8443
It should report:
Job Submission to this CREAM CE is enabled
Try a submission to Cream-CE using the
glite-ce-job-submit command, e.g.:
$ /bin/cat sleep.jdl
[
executable="/bin/sleep";
arguments="1";
]
$ glite-ce-job-submit -a -r <hostname-of-cream-ce>:8443/<queue> test.jdl
$ glite-ce-job-submit -a -r ce-cr-02.ts.infn.it:8443/cream-lsf-cert sleep.jdl
https://ce-cr-02.ts.infn.it:8443/CREAM127814374
Check the status of that job, which eventually should be
DONE-OK
$ glite-ce-job-status https://ce-cr-02.ts.infn.it:8443/CREAM127814374
2010-07-27 11:55:37,986 WARN - No configuration file suitable for loading. Using built-in configuration
****** JobID=[https://ce-cr-02.ts.infn.it:8443/CREAM127814374]
Status = [DONE-OK]
ExitCode = [0]
Try a submission to that CE using the glite-ce-job-submit command, and then tries to cancel it (using the glite-ce-job-cancel command)
$ /bin/cat sleep2.jdl
[
executable="/bin/sleep";
arguments="1000";
]
$ glite-ce-job-submit -a -r cecream-cyb.ca.infn.it:8443/cream-lsf-poncert sleep2.jdl
https://cecream-cyb.ca.infn.it:8443/CREAM126335182
$ glite-ce-job-cancel https://cecream-cyb.ca.infn.it:8443/CREAM126335182
$ glite-ce-job-status https://cecream-cyb.ca.infn.it:8443/CREAM126335182
2010-07-27 12:18:26,973 WARN - No configuration file suitable for loading. Using built-in configuration
****** JobID=[https://cecream-cyb.ca.infn.it:8443/CREAM126335182]
Status = [CANCELLED]
ExitCode = []
Description = [Cancelled by user]
SE checks
check if
gridftp server on SE works (NOTE: this command isn't present any more on sl5 UI):
$ edg-gridftp-ls gsiftp://inaf-se-01.ct.pi2s2.it/
check if SRM client works (on the published information you can find the right port to use)
$ clientSRM ping -e httpg://sunstorm.cnaf.infn.it:8444
============================================================
Sending Ping request to: httpg://sunstorm.cnaf.infn.it:8444
============================================================
Request status:
statusCode="SRM_SUCCESS"(0)
explanation="SRM server successfully contacted"
============================================================
SRM Response:
versionInfo="v2.2"
otherInfo (size=2)
[0] key="backend_type"
[0] value="StoRM"
[1] key="backend_version"
[1] value="<FE:1.5.0-1.sl4><BE:1.5.3-4.sl4>"
============================================================
if you want, try to write on SE. Be sure your UI is pointing to an IS the SE is contained in (you may use our certification BDII), i.e.
$ export LCG_GFAL_INFOSYS=gridit-bdii-01.cnaf.infn.it:2170
$ lcg-cr -v --vo glast.org -d storm-fe-cg.cr.cnaf.infn.it -l lfn:/grid/glast.org/wfug.jdl file:/home/paolini/rank.jdl
$ lcg-del -v --vo glast.org -a <guid>
Job submission
Submit a test job to either
lcg-CE or
Cream-CE through the
WMS, i.e. using the
glite-wms-job-submit command. In case, submit a mpi test job. Our certification WMS is gridit-cert-wms.cnaf.infn.it
Registration into 1st level HLR
After the site entered in production, it needs to register the site resources in the hlr.
Ask the site-admins to open a ticket towards the hlr adminstrators, passing them the following information:
- grid queues names, in the form:
- gridit-ce-001.cnaf.infn.it:2119/jobmanager-lcgpbs-cert
- not-grid queues names, in the form:
- Name, surname ad certificate subject of each site-admin
- Certificate subject of Computing Element
Eventually, the site-admins have to open a ticket to DGAS support unit asking to enable the forwarding of accounting data from the 2� level hlr to APEL
Certication Job
The
test job cheks several things, like the envirnment on WN and rpms installed. Moreover it performs some replica managements test.
With a "grep TEST" you may get a summary of the results: in case of errors, you have to see in detail what is gone wrong!
As already said, if the site supports any flavour of mpi, launch a mpi test job, like
this don't forget to set a reasonable value in
CPUNumber
: the important is that your job will go soon in running
If you want less stuff in the .out and .err files, in the file mpi-start-wrapper.sh comment the line
export I2G_MPI_START_DEBUG=1
A successful output will look like the following one (extract)
[...]
mpi-start [DEBUG ]: using user supplied startup : '/opt/mpich-1.2.7p1/bin/mpirun '
mpi-start [DEBUG ]: => MPI_SPECIFIC_PARAMS=
mpi-start [DEBUG ]: => I2G_MPI_PRECOMMAND=
mpi-start [DEBUG ]: => MPIEXEC=/opt/mpich-1.2.7p1/bin/mpirun
mpi-start [DEBUG ]: => I2G_MACHINEFILE_AND_NP=-machinefile /tmp/tmp.iBypc12521 -np 6
mpi-start [DEBUG ]: => I2G_MPI_APPLICATION=/home/dteam022/globus-tmp.t3-wn-13.11955.0/https_3a_2f_2falbalonga.cnaf.infn.it_3a9000_2fI06uWaKi1evxL3tTF-DTOg/hello
mpi-start [DEBUG ]: => I2G_MPI_APPLICATION_ARGS=
mpi-start [DEBUG ]: /opt/mpich-1.2.7p1/bin/mpirun -machinefile /tmp/tmp.iBypc12521 -np 6 /home/dteam022/globus-tmp.t3-wn-13.11955.0/https_3a_2f_2falbalonga.cnaf.infn.it_3a9000_2fI06uWaKi1evxL3tTF-DTOg/hello
Process 4 on t3-wn-37.pn.pd.infn.it out of 6
Process 3 on t3-wn-34.pn.pd.infn.it out of 6
Process 1 on t3-wn-13.pn.pd.infn.it out of 6
Process 2 on t3-wn-34.pn.pd.infn.it out of 6
Process 5 on t3-wn-37.pn.pd.infn.it out of 6
Process 0 on t3-wn-13.pn.pd.infn.it out of 6
[...]