Tags:
, view all tags
-- AlessandroPaolini - 2010-11-26

How to test a site before putting it into production grid

Be sure that its GIIS url is contained in the BDII (gridit-bdii-01.cnaf.infn.it) we use for certification

Check the consistency of the published information

the main branches of the ldap tree are:

  • GlueSiteUniqueID
  • GlueSubClusterUniqueID
  • GlueCEUniqueID
  • GlueCESEBind
  • GlueSEUniqueID
  • GlueServiceUniqueID
Use a browser ldap in order to make easy your work Anyway, for the headstrongs, I will report an example ldapsearch command with which check the several information

Under the branch GlueSiteUniqueID check the values of the following parameters:

  • GlueSiteName
  • GlueSiteUserSupportContact
  • GlueSiteSysAdminContact
  • GlueSiteSecurityContact
  • GlueSiteOtherInfo
EXAMPLE:

$ ldapsearch -x -LLL -H ldap://gridit-bdii-01.cnaf.infn.it:2170 -b mds-vo-name=local,o=grid 'objectClass=GlueSite' GlueSiteName GlueSiteUserSupportContact GlueSiteSysAdminContact GlueSiteSecurityContact GlueSiteOtherInfo

Under the branch GlueSubClusterUniqueID check the values of the following parameters:

   * Check GlueHostApplicationSoftwareRunTimeEnvironment
      * site name
      * Current version of middleware
      * R-GMA
      * SI00MeanPerCPU_<value> e SF00MeanPerCPU_<value>
      * if the site supports mpi jobs, MPICH (ant other related tags)
      * (in case) AFS (and verify WNs mount /afs)
   * GlueHostProcessorOtherDescription (for instance: Cores=1 , or: Cores=2,Benchmark=7.92-HEP-SPEC06 )
   * GlueHostOperatingSystemName (es. ScientificSL)
   * GlueHostOperatingSystemVersion (es. Berillium)
   * GlueHostOperatingSystemRelease (es. 4.5)

EXAMPLE:

$ ldapsearch -x -LLL -H ldap://virgo-ce.roma1.infn.it:2170 -b mds-vo-name=resource,o=grid 'objectClass=GlueSubCluster' GlueHostProcessorOtherDescription

Under the branch GlueCEUniqueID check the values of the following parameters:

  • GlueCEInfoTotalCPUs: If there is a “0”, you have to worry!!
  • GlueCEStateWaitingJobs: If there is a “44444”, red alarm!!
  • GlueCEInfoLRMSType: pbs or lsf (or sge, …)
  • GlueCEStateStatus: Production or Draining
  • GlueCEAccessControlBaseRule: VOs enabled on the queue
  • GlueCECapability
EXAMPLE:
$ ldapsearch -x -LLL -H ldap://virgo-ce.roma1.infn.it:2170 -b mds-vo-name=INFN-ROMA1-VIRGO,o=grid 'objectClass=GlueCE' GlueCEInfoTotalCPUs GlueCEInfoJobManager GlueCEImplementationName 

For each SE it has to be defined:

  • GlueCESEBindSEUniqueID
    • GlueCESEBindCEAccesspoint and GlueCESEBindMountInfo
EXAMPLE:
$ ldapsearch -x -LLL -H ldap://prod-ce-02.pd.infn.it:2170 -b mds-vo-name=resource,o=grid 'objectClass=GlueCESEBind' GlueCESEBindSEUniqueID GlueCESEBindCEUniqueID GlueCESEBindMountInfo 

Under the branch GlueSEUniqueID check the values of the following parameters:

  • GlueSALocalID: VO information
  • GlueSEAccessProtocolLocalID: rfio, srm_v1, srm_v2, classic, gsiftp, gsidcap
  • GlueSEType (deprecated)
  • GlueSEArchitecture
  • GlueSAStateUsedSpace
  • GlueSAStateAvailableSpace
EXAMPLE:
$ ldapsearch -x -LLL -H ldap://prod-bdii-02.pd.infn.it:2170 -b mds-vo-name=INFN-PADOVA,o=grid 'objectClass=GlueSE' 
$ ldapsearch -x -LLL -H ldap://prod-bdii-02.pd.infn.it:2170 -b mds-vo-name=INFN-PADOVA,o=grid 'objectClass=GlueSA' GlueSAAccessControlBaseRule 

there is a branch GlueServiceUniqueID for each service published by the site (WMS, LFC, DPM, GRIDICE, LB, MYPROXY, BDII,…): what discriminate the services are the values of GlueServiceType, ex:

  • lcg-file-catalog
  • org.glite.wms.WMProxy
  • org.glite.lb.Server
  • srm_v1, SRM
EXAMPLE:
$ ldapsearch -x -LLL -H ldap://gridit-ce-001.cnaf.infn.it:2170 -b mds-vo-name=INFN-CNAF,o=grid 'objectClass=GlueService' GlueServiceType GlueServiceEndpoint GlueServiceName 

Check the functionality of the grid elements

lcg-CE checks

Verify the authentication and authorization on CE

$ globus-job-run inaf-ce-01.ct.pi2s2.it /bin/hostname (or /usr/bin/whoami, or whatever you want!!) 

In case of pbs, check the WNs, ex.:

$ globus-job-run pbs-enmr.cerm.unifi.it /usr/bin/pbsnodes -a 

Verify the functioning of the batch system: be careful that the queue you are querying really exists, and your VO is enabled on it. For example:

$ globus-job-run ce-cyb.ca.infn.it/jobmanager-lcglsf -queue poncert /bin/pwd 

check dgas processes on CE (with a ps ax| grep dgas)

Cream-CE checks

Open your browser to

https://<hostname-of-cream-ce>:8443/ce-cream/services 

A page with link to the CREAM WSDL should be shown

Try a gsiftp (e.g. using globus-url-copy or @uberftp@@) towards that CREAM CE. E.g.:

$ globus-url-copy gsiftp://<hostname-of-cream-ce>/opt/glite/yaim/etc/versions/ig-yaim file:/tmp/ig-version-test  

Try the following command:

$ glite-ce-allowed-submission <<hostname-of-cream-ce>>:8443 

It should report:

Job Submission to this CREAM CE is enabled  

Try a submission to Cream-CE using the glite-ce-job-submit command, e.g.:

$ /bin/cat sleep.jdl 
 
[ 
executable="/bin/sleep"; 
arguments="1"; 
] 
$ glite-ce-job-submit -a -r <hostname-of-cream-ce>:8443/<queue> test.jdl  
$ glite-ce-job-submit -a -r ce-cr-02.ts.infn.it:8443/cream-lsf-cert sleep.jdl 
https://ce-cr-02.ts.infn.it:8443/CREAM127814374 

Check the status of that job, which eventually should be DONE-OK

$ glite-ce-job-status https://ce-cr-02.ts.infn.it:8443/CREAM127814374
2010-07-27 11:55:37,986 WARN - No configuration file suitable for loading. Using built-in configuration

******  JobID=[https://ce-cr-02.ts.infn.it:8443/CREAM127814374]
        Status        = [DONE-OK]
        ExitCode      = [0]

Try a submission to that CE using the glite-ce-job-submit command, and then tries to cancel it (using the glite-ce-job-cancel command)

$ /bin/cat sleep2.jdl 
 
[ 
executable="/bin/sleep"; 
arguments="1000"; 
] 

$ glite-ce-job-submit -a -r cecream-cyb.ca.infn.it:8443/cream-lsf-poncert sleep2.jdl
https://cecream-cyb.ca.infn.it:8443/CREAM126335182

$ glite-ce-job-cancel https://cecream-cyb.ca.infn.it:8443/CREAM126335182

$ glite-ce-job-status https://cecream-cyb.ca.infn.it:8443/CREAM126335182
2010-07-27 12:18:26,973 WARN - No configuration file suitable for loading. Using built-in configuration

******  JobID=[https://cecream-cyb.ca.infn.it:8443/CREAM126335182]
        Status        = [CANCELLED]
        ExitCode      = []
        Description   = [Cancelled by user]

SE checks

check if gridftp server on SE works (NOTE: this command isn't present any more on sl5 UI):

$ edg-gridftp-ls gsiftp://inaf-se-01.ct.pi2s2.it/ 

check if SRM client works (on the published information you can find the right port to use)

$ clientSRM ping -e httpg://sunstorm.cnaf.infn.it:8444
============================================================
Sending Ping request to: httpg://sunstorm.cnaf.infn.it:8444
============================================================
Request status:
  statusCode="SRM_SUCCESS"(0)
  explanation="SRM server successfully contacted"
============================================================
SRM Response:
  versionInfo="v2.2"
  otherInfo (size=2)
    [0] key="backend_type"
    [0] value="StoRM"
    [1] key="backend_version"
    [1] value="<FE:1.5.0-1.sl4><BE:1.5.3-4.sl4>"
============================================================

if you want, try to write on SE. Be sure your UI is pointing to an IS the SE is contained in (you may use our certification BDII), i.e.

$ export LCG_GFAL_INFOSYS=gridit-bdii-01.cnaf.infn.it:2170 
$ lcg-cr -v --vo glast.org -d storm-fe-cg.cr.cnaf.infn.it -l lfn:/grid/glast.org/wfug.jdl file:/home/paolini/rank.jdl 
$ lcg-del -v --vo glast.org -a <guid> 

Job submission

Submit a test job to either lcg-CE or Cream-CE through the WMS, i.e. using the glite-wms-job-submit command. In case, submit a mpi test job. Our certification WMS is gridit-cert-wms.cnaf.infn.it

Registration into 1st level HLR

After the site entered in production, it needs to register the site resources in the hlr.
Ask the site-admins to open a ticket towards the hlr adminstrators, passing them the following information:

  • grid queues names, in the form:
    • gridit-ce-001.cnaf.infn.it:2119/jobmanager-lcgpbs-cert

  • not-grid queues names, in the form:
    • hostname:queue

  • Name, surname ad certificate subject of each site-admin
  • Certificate subject of Computing Element
Eventually, the site-admins have to open a ticket to DGAS support unit asking to enable the forwarding of accounting data from the 2� level hlr to APEL

Certication Job

The test job cheks several things, like the envirnment on WN and rpms installed. Moreover it performs some replica managements test.
With a "grep TEST" you may get a summary of the results: in case of errors, you have to see in detail what is gone wrong!

As already said, if the site supports any flavour of mpi, launch a mpi test job, like this
don't forget to set a reasonable value in CPUNumber: the important is that your job will go soon in running

If you want less stuff in the .out and .err files, in the file mpi-start-wrapper.sh comment the line

export I2G_MPI_START_DEBUG=1 

A successful output will look like the following one (extract)

[...] 
mpi-start [DEBUG  ]: using user supplied startup : '/opt/mpich-1.2.7p1/bin/mpirun ' 
mpi-start [DEBUG  ]: => MPI_SPECIFIC_PARAMS= 
mpi-start [DEBUG  ]: => I2G_MPI_PRECOMMAND= 
mpi-start [DEBUG  ]: => MPIEXEC=/opt/mpich-1.2.7p1/bin/mpirun 
mpi-start [DEBUG  ]: => I2G_MACHINEFILE_AND_NP=-machinefile /tmp/tmp.iBypc12521 -np 6 
mpi-start [DEBUG  ]: => I2G_MPI_APPLICATION=/home/dteam022/globus-tmp.t3-wn-13.11955.0/https_3a_2f_2falbalonga.cnaf.infn.it_3a9000_2fI06uWaKi1evxL3tTF-DTOg/hello 
mpi-start [DEBUG  ]: => I2G_MPI_APPLICATION_ARGS= 
mpi-start [DEBUG  ]: /opt/mpich-1.2.7p1/bin/mpirun -machinefile /tmp/tmp.iBypc12521 -np 6 /home/dteam022/globus-tmp.t3-wn-13.11955.0/https_3a_2f_2falbalonga.cnaf.infn.it_3a9000_2fI06uWaKi1evxL3tTF-DTOg/hello 
Process 4 on t3-wn-37.pn.pd.infn.it out of 6 
Process 3 on t3-wn-34.pn.pd.infn.it out of 6 
Process 1 on t3-wn-13.pn.pd.infn.it out of 6 
Process 2 on t3-wn-34.pn.pd.infn.it out of 6 
Process 5 on t3-wn-37.pn.pd.infn.it out of 6 
Process 0 on t3-wn-13.pn.pd.infn.it out of 6 
[...]  

Topic attachments
I Attachment ActionSorted ascending Size Date Who Comment
Unknown file formatrar jobcert.rar manage 3.1 K 2010-11-26 - 15:29 TWikiAdminUser  
Compressed Zip archivetar mpijobcert.tar manage 10.0 K 2010-11-26 - 15:29 TWikiAdminUser  
Edit | Attach | PDF | History: r3 < r2 < r1 | Backlinks | Raw View | More topic actions...
Topic revision: r1 - 2010-11-26 - TWikiAdminUser
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback