Tags:
create new tag
,
view all tags
-- AlessandroPaolini - 2010-11-26 ---+ *How to test a site before putting it into production grid* Be sure that its *GIIS url* is contained in the <a target="_top" href="http://www.italiangrid.org/fileadmin/bdii/cert-bdii-update.conf">BDII</a> (gridit-bdii-01.cnaf.infn.it) we use for certification. In case it is missed, please open a ticket on ticketing.cnaf.infn.it. ---++ <a name="Check_the_consistency_of_the_pub"></a> *1) Check the consistency of the published information* the main branches of the ldap tree are: * 1.1) =GlueSiteUniqueID= * 1.2) =GlueSubClusterUniqueID= * 1.3) =GlueCEUniqueID= * 1.4) =GlueCESEBind= * 1.5) =GlueSEUniqueID= * 1.6) =GlueServiceUniqueID= Use a browser ldap in order to make easy your work Anyway, for the headstrongs, I will report an example ldapsearch command with which check the several information ---+++ 1.1) !GlueSiteUniqueID branch Under the branch *GlueSiteUniqueID* check the values of the following parameters: * =GlueSiteName= * =GlueSiteUserSupportContact= * =GlueSiteSysAdminContact= * =GlueSiteSecurityContact= * =GlueSiteOtherInfo= EXAMPLE: <pre>$ ldapsearch -x -LLL -H ldap://gridit-bdii-01.cnaf.infn.it:2170 -b mds-vo-name=local,o=grid 'objectClass=GlueSite' GlueSiteName GlueSiteUserSupportContact GlueSiteSysAdminContact GlueSiteSecurityContact GlueSiteOtherInfo </pre> ---+++ 1.2) !GlueSubClusterUniqueID branch Under the branch *GlueSubClusterUniqueID* check the values of the following parameters: <pre> * Check GlueHostApplicationSoftwareRunTimeEnvironment * site name * Current version of middleware * R-GMA * SI00MeanPerCPU_<value> e SF00MeanPerCPU_<value> * if the site supports mpi jobs, MPICH (ant other related tags) * (in case) AFS (and verify WNs mount /afs) * GlueHostProcessorOtherDescription (for instance: Cores=2,Benchmark=7.92-HEP-SPEC06 ) * GlueHostOperatingSystemName (es. ScientificSL) * GlueHostOperatingSystemVersion (es. Berillium) * GlueHostOperatingSystemRelease (es. 4.5) </pre> EXAMPLE: <pre>$ ldapsearch -x -LLL -H ldap://virgo-ce.roma1.infn.it:2170 -b mds-vo-name=resource,o=grid 'objectClass=GlueSubCluster' GlueHostOperatingSystemName GlueHostOperatingSystemVersion GlueHostOperatingSystemRelease GlueHostProcessorOtherDescription </pre> ---+++ 1.3) !GlueCEUniqueID branch Under the branch *GlueCEUniqueID* check the values of the following parameters: * =GlueCEInfoTotalCPUs=: If there is a “0”, you have to worry!! * =GlueCEStateWaitingJobs=: If there is a “44444”, red alarm!! * =GlueCEInfoLRMSType=: pbs or lsf (or sge, …) * =GlueCEStateStatus=: Production or Draining * =GlueCEAccessControlBaseRule=: VOs enabled on the queue * =GlueCECapability= EXAMPLE: <pre>$ ldapsearch -x -LLL -H ldap://virgo-ce.roma1.infn.it:2170 -b mds-vo-name=INFN-ROMA1-VIRGO,o=grid 'objectClass=GlueCE' GlueCEInfoTotalCPUs GlueCEInfoTotalCPUs GlueCEStateWaitingJobs GlueCEInfoJobManager GlueCEImplementationName GlueCEInfoLRMSType GlueCEStateStatus GlueCEAccessControlBaseRule GlueCECapability </pre> ---+++ 1.4) !GlueCESEBindSEUniqueID branch For each SE it has to be defined: * =GlueCESEBindSEUniqueID= * =GlueCESEBindCEAccesspoint= and =GlueCESEBindMountInfo= EXAMPLE: <pre>$ ldapsearch -x -LLL -H ldap://prod-ce-02.pd.infn.it:2170 -b mds-vo-name=resource,o=grid 'objectClass=GlueCESEBind' GlueCESEBindSEUniqueID GlueCESEBindCEUniqueID GlueCESEBindMountInfo </pre> ---+++ 1.5) !GlueSEUniqueID branch Under the branch *GlueSEUniqueID* check the values of the following parameters: * =GlueSALocalID=: VO information * =GlueSEAccessProtocolLocalID=: rfio, srm_v1, srm_v2, classic, gsiftp, gsidcap * =GlueSEType= (deprecated) * =GlueSEArchitecture= * =GlueSAStateUsedSpace= * =GlueSAStateAvailableSpace= EXAMPLE: <pre>$ ldapsearch -x -LLL -H ldap://prod-bdii-02.pd.infn.it:2170 -b mds-vo-name=INFN-PADOVA,o=grid 'objectClass=GlueSE' $ ldapsearch -x -LLL -H ldap://prod-bdii-02.pd.infn.it:2170 -b mds-vo-name=INFN-PADOVA,o=grid 'objectClass=GlueSA' GlueSAAccessControlBaseRule </pre> ---+++ 1.6) !GlueServiceUniqueID branch there is a branch *GlueServiceUniqueID* for each service published by the site (WMS, LFC, DPM, GRIDICE, LB, MYPROXY, BDII,…): what discriminate the services are the values of =GlueServiceType=, ex: * lcg-file-catalog * org.glite.wms.WMProxy * org.glite.lb.Server * srm_v1, SRM EXAMPLE: <pre>$ ldapsearch -x -LLL -H ldap://gridit-ce-001.cnaf.infn.it:2170 -b mds-vo-name=INFN-CNAF,o=grid 'objectClass=GlueService' GlueServiceType GlueServiceEndpoint GlueServiceName </pre> ---++ <a name="Check_the_functionality_of_the_g"></a> *Check the functionality of the grid elements* ---+++ <a name="lcg_CE_checks"></a> *lcg-CE checks* Verify the authentication and authorization on CE <pre>$ globus-job-run inaf-ce-01.ct.pi2s2.it /bin/hostname (or /usr/bin/whoami, or whatever you want!!) </pre> In case of pbs, check the WNs, ex.: <pre>$ globus-job-run pbs-enmr.cerm.unifi.it /usr/bin/pbsnodes -a </pre> Verify the functioning of the batch system: be careful that the queue you are querying really exists, and your VO is enabled on it. For example: <pre>$ globus-job-run ce-cyb.ca.infn.it/jobmanager-lcglsf -queue poncert /bin/pwd </pre> check dgas processes on CE (with a ps ax| grep dgas) ---+++ <a name="Cream_CE_checks"></a> *Cream-CE checks* Open your browser to <pre>https://<hostname-of-cream-ce>:8443/ce-cream/services </pre> A page with link to the *CREAM WSDL* should be shown Try a *gsiftp* (e.g. using globus-url-copy or @uberftp@@) towards that CREAM CE. E.g.: <pre>$ globus-url-copy gsiftp://<hostname-of-cream-ce>/opt/glite/yaim/etc/versions/ig-yaim file:/tmp/ig-version-test </pre> Try the following command: <pre>$ glite-ce-allowed-submission <<hostname-of-cream-ce>>:8443 </pre> It should report: <pre>Job Submission to this CREAM CE is enabled </pre> Try a submission to Cream-CE using the *glite-ce-job-submit* command, e.g.: <pre>$ /bin/cat sleep.jdl [ executable="/bin/sleep"; arguments="1"; ] </pre> <pre>$ glite-ce-job-submit -a -r <hostname-of-cream-ce>:8443/<queue> test.jdl </pre> <pre>$ glite-ce-job-submit -a -r ce-cr-02.ts.infn.it:8443/cream-lsf-cert sleep.jdl https://ce-cr-02.ts.infn.it:8443/CREAM127814374 </pre> Check the status of that job, which eventually should be *DONE-OK* <pre>$ glite-ce-job-status https://ce-cr-02.ts.infn.it:8443/CREAM127814374 2010-07-27 11:55:37,986 WARN - No configuration file suitable for loading. Using built-in configuration ****** JobID=[https://ce-cr-02.ts.infn.it:8443/CREAM127814374] Status = [DONE-OK] ExitCode = [0] </pre> Try a submission to that CE using the glite-ce-job-submit command, and then tries to cancel it (using the glite-ce-job-cancel command) <pre>$ /bin/cat sleep2.jdl [ executable="/bin/sleep"; arguments="1000"; ] </pre> <pre>$ glite-ce-job-submit -a -r cecream-cyb.ca.infn.it:8443/cream-lsf-poncert sleep2.jdl https://cecream-cyb.ca.infn.it:8443/CREAM126335182 $ glite-ce-job-cancel https://cecream-cyb.ca.infn.it:8443/CREAM126335182 $ glite-ce-job-status https://cecream-cyb.ca.infn.it:8443/CREAM126335182 2010-07-27 12:18:26,973 WARN - No configuration file suitable for loading. Using built-in configuration ****** JobID=[https://cecream-cyb.ca.infn.it:8443/CREAM126335182] Status = [CANCELLED] ExitCode = [] Description = [Cancelled by user] </pre> ---+++ <a name="SE_checks"></a> *SE checks* check if *gridftp server* on SE works (NOTE: this command isn't present any more on sl5 UI): <pre>$ edg-gridftp-ls gsiftp://inaf-se-01.ct.pi2s2.it/ </pre> check if SRM client works (on the published information you can find the right port to use) <pre>$ clientSRM ping -e httpg://sunstorm.cnaf.infn.it:8444 ============================================================ Sending Ping request to: httpg://sunstorm.cnaf.infn.it:8444 ============================================================ Request status: statusCode="SRM_SUCCESS"(0) explanation="SRM server successfully contacted" ============================================================ SRM Response: versionInfo="v2.2" otherInfo (size=2) [0] key="backend_type" [0] value="StoRM" [1] key="backend_version" [1] value="<FE:1.5.0-1.sl4><BE:1.5.3-4.sl4>" ============================================================ </pre> if you want, try to write on SE. Be sure your UI is pointing to an IS the SE is contained in (you may use our certification BDII), i.e. <pre>$ export LCG_GFAL_INFOSYS=gridit-bdii-01.cnaf.infn.it:2170 $ lcg-cr -v --vo glast.org -d storm-fe-cg.cr.cnaf.infn.it -l lfn:/grid/glast.org/wfug.jdl file:/home/paolini/rank.jdl $ lcg-del -v --vo glast.org -a <guid> </pre> ---+++ <a name="Job_submission"></a> *Job submission* Submit a test job to either *lcg-CE* or *Cream-CE* through the *WMS*, i.e. using the *glite-wms-job-submit* command. In case, submit a mpi test job. Our certification WMS is gridit-cert-wms.cnaf.infn.it ---+++ <a name="Registration_into_1st_level_HLR"></a> *Registration into 1st level HLR* After the site entered in production, it needs to register the site resources in the hlr.<br /> Ask the site-admins to open a ticket towards the hlr adminstrators, passing them the following information: * grid queues names, in the form: * gridit-ce-001.cnaf.infn.it:2119/jobmanager-lcgpbs-cert * not-grid queues names, in the form: * hostname:queue * Name, surname ad certificate subject of each site-admin * Certificate subject of Computing Element Eventually, the site-admins have to open a ticket to DGAS support unit asking to enable the forwarding of accounting data from the 2� level hlr to APEL ---++ <a name="Certication_Job"></a> *Certication Job* The <a target="_top" href="https://twiki.cnaf.infn.it/twiki/pub/Sandbox/SiteCertification/jobcert.rar">test job</a> cheks several things, like the envirnment on WN and rpms installed. Moreover it performs some replica managements test.<br /> With a "grep TEST" you may get a summary of the results: in case of errors, you have to see in detail what is gone wrong! As already said, if the site supports any flavour of mpi, launch a mpi test job, like <a target="_top" href="https://twiki.cnaf.infn.it/twiki/pub/Sandbox/SiteCertification/mpijobcert.tar">this</a><br /> don't forget to set a reasonable value in =CPUNumber=: the important is that your job will go soon in running If you want less stuff in the .out and .err files, in the file mpi-start-wrapper.sh comment the line <pre>export I2G_MPI_START_DEBUG=1 </pre> A successful output will look like the following one (extract) <pre>[...] mpi-start [DEBUG ]: using user supplied startup : '/opt/mpich-1.2.7p1/bin/mpirun ' mpi-start [DEBUG ]: => MPI_SPECIFIC_PARAMS= mpi-start [DEBUG ]: => I2G_MPI_PRECOMMAND= mpi-start [DEBUG ]: => MPIEXEC=/opt/mpich-1.2.7p1/bin/mpirun mpi-start [DEBUG ]: => I2G_MACHINEFILE_AND_NP=-machinefile /tmp/tmp.iBypc12521 -np 6 mpi-start [DEBUG ]: => I2G_MPI_APPLICATION=/home/dteam022/globus-tmp.t3-wn-13.11955.0/https_3a_2f_2falbalonga.cnaf.infn.it_3a9000_2fI06uWaKi1evxL3tTF-DTOg/hello mpi-start [DEBUG ]: => I2G_MPI_APPLICATION_ARGS= mpi-start [DEBUG ]: /opt/mpich-1.2.7p1/bin/mpirun -machinefile /tmp/tmp.iBypc12521 -np 6 /home/dteam022/globus-tmp.t3-wn-13.11955.0/https_3a_2f_2falbalonga.cnaf.infn.it_3a9000_2fI06uWaKi1evxL3tTF-DTOg/hello Process 4 on t3-wn-37.pn.pd.infn.it out of 6 Process 3 on t3-wn-34.pn.pd.infn.it out of 6 Process 1 on t3-wn-13.pn.pd.infn.it out of 6 Process 2 on t3-wn-34.pn.pd.infn.it out of 6 Process 5 on t3-wn-37.pn.pd.infn.it out of 6 Process 0 on t3-wn-13.pn.pd.infn.it out of 6 [...] </pre> * <a target="_top" href="https://twiki.cnaf.infn.it/twiki/pub/Sandbox/SiteCertification/jobcert.rar">jobcert.rar</a>: job di certificazione * <a target="_top" href="https://twiki.cnaf.infn.it/twiki/pub/Sandbox/SiteCertification/mpijobcert.tar">mpijobcert.tar</a>: job di test mpi
Attachments
Attachments
Topic attachments
I
Attachment
Action
Size
Date
Who
Comment
rar
jobcert.rar
manage
3.1 K
2010-11-26 - 15:29
TWikiAdminUser
tar
mpijobcert.tar
manage
10.0 K
2010-11-26 - 15:29
TWikiAdminUser
E
dit
|
A
ttach
|
PDF
|
H
istory
: r3
<
r2
<
r1
|
B
acklinks
|
V
iew topic
|
M
ore topic actions
Topic revision: r3 - 2011-10-28
-
AlessandroPaolini
Home
Site map
CEMon web
CREAM web
Cloud web
Cyclops web
DGAS web
EgeeJra1It web
Gows web
GridOversight web
IGIPortal web
IGIRelease web
MPI web
Main web
MarcheCloud web
MarcheCloudPilotaCNAF web
Middleware web
Operations web
Sandbox web
Security web
SiteAdminCorner web
TWiki web
Training web
UserSupport web
VOMS web
WMS web
WMSMonitor web
WeNMR web
GridOversight Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Account
Log In
E
dit
A
ttach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback