2003 December

From angel at miami.edu Mon Dec 1 10:25:34 2003
From: angel at miami.edu (Angel Li)
Date: Mon, 01 Dec 2003 13:25:34 -0500
Subject: [Rocks-Discuss]cluster-fork
Message-ID: <3FCB879E.8050905@miami.edu>

Hi,

I recently installed Rocks 3.0 on a Linux cluster and when I run the
command "cluster-fork" I get this error:

apple* cluster-fork ls
Traceback (innermost last):
File "/opt/rocks/sbin/cluster-fork", line 88, in ?
import rocks.pssh
File "/opt/rocks/lib/python/rocks/pssh.py", line 96, in ?
import gmon.encoder
ImportError: Bad magic number in
/usr/lib/python1.5/site-packages/gmon/encoder.pyc

Any thoughts? I'm also wondering where to find the python sources for
files in /usr/lib/python1.5/site-packages/gmon.

Thanks,

Angel

From jghobrial at uh.edu Mon Dec 1 11:35:06 2003
From: jghobrial at uh.edu (Joseph)
Date: Mon, 1 Dec 2003 13:35:06 -0600 (CST)
In-Reply-To: <3FCB879E.8050905@miami.edu>
References: <3FCB879E.8050905@miami.edu>
Message-ID: <Pine.LNX.4.56.0312011331460.5615@mail.tlc2.uh.edu>

On Mon, 1 Dec 2003, Angel Li wrote:
Hello Angel, I have the same problem and so far there is no response when
I posted about this a month ago.

Is your frontend an AMD setup??

I am thinking this is an AMD problem.

Thanks,
Joseph

> Hi,
>
> I recently installed Rocks 3.0 on a Linux cluster and when I run the
> command "cluster-fork" I get this error:
>
> apple* cluster-fork ls
> Traceback (innermost last):
> File "/opt/rocks/sbin/cluster-fork", line 88, in ?
> import rocks.pssh
> File "/opt/rocks/lib/python/rocks/pssh.py", line 96, in ?

> import gmon.encoder
> ImportError: Bad magic number in
> /usr/lib/python1.5/site-packages/gmon/encoder.pyc
>
> Any thoughts? I'm also wondering where to find the python sources for
> files in /usr/lib/python1.5/site-packages/gmon.
>
> Thanks,
>
> Angel
>

From tim.carlson at pnl.gov Mon Dec 1 14:58:54 2003
From: tim.carlson at pnl.gov (Tim Carlson)
Date: Mon, 01 Dec 2003 14:58:54 -0800 (PST)
Subject: [Rocks-Discuss]odd kickstart problem
In-Reply-To: <76AC0F5E-2025-11D8-804D-000393A4725A@sdsc.edu>
Message-ID: <Pine.LNX.4.44.0312011453020.22892-100000@scorpion.emsl.pnl.gov>

Trying to bring up an old dead node on a Rocks 2.3.2 cluster and I get the
following error in /var/log/httpd/error_log

File "/opt/rocks/sbin/kgen", line 530, in ?
app.run()
File "/opt/rocks/sbin/kgen", line 497, in run
doc = FromXmlStream(file)
File "/usr/lib/python1.5/site-packages/xml/dom/ext/reader/Sax2.py", line
386, in FromXmlStream
return reader.fromStream(stream, ownerDocument)
372, in fromStream
self.parser.parse(s)
File "/usr/lib/python1.5/site-packages/xml/sax/expatreader.py", line 58,
in parse
xmlreader.IncrementalParser.parse(self, source)
File "/usr/lib/python1.5/site-packages/xml/sax/xmlreader.py", line 125,
in parse
self.close()
File "/usr/lib/python1.5/site-packages/xml/sax/expatreader.py", line
154, in close
self.feed("", isFinal = 1)
File "/usr/lib/python1.5/site-packages/xml/sax/expatreader.py", line
148, in feed
self._err_handler.fatalError(exc)
340, in fatalError
raise exception
xml.sax._exceptions.SAXParseException: <stdin>:3298:0: no element found

Doing a wget of http://frontend-0/install/kickstart.cgi?
arch=i386&np=2&project=rocks
on one of the working internal nodes yields the same error.

Any thoughts on this?

I've also done a fresh
rocks-dist dist

Tim

From sjenks at uci.edu Mon Dec 1 15:35:54 2003
From: sjenks at uci.edu (Stephen Jenks)
Date: Mon, 1 Dec 2003 15:35:54 -0800
In-Reply-To: <Pine.LNX.4.56.0312011331460.5615@mail.tlc2.uh.edu>
<Pine.LNX.4.56.0312011331460.5615@mail.tlc2.uh.edu>
Message-ID: <1B15A45F-2457-11D8-A374-00039389B580@uci.edu>

FYI, I have a dual Athlon frontend and didn't have that problem. I know
that doesn't exactly help you, but at least it doesn't fail on all AMD
machines.

It looks like the .pyc file might be corrupt in your installation. The
source .py file (encoder.py) is in the
/usr/lib/python1.5/site-packages/gmon directory, so perhaps removing
the .pyc file would regenerate it (if you run cluster-fork as root?)

The md5sum for encoder.pyc on my system is:
459c78750fe6e065e9ed464ab23ab73d encoder.pyc
So you can check if yours is different.

Steve Jenks

On Dec 1, 2003, at 11:35 AM, Joseph wrote:

> On Mon, 1 Dec 2003, Angel Li wrote:
> Hello Angel, I have the same problem and so far there is no response
> when
> I posted about this a month ago.
>
> Is your frontend an AMD setup??
>
> I am thinking this is an AMD problem.
>
> Thanks,
> Joseph
>
>
>> Hi,
>>
>> I recently installed Rocks 3.0 on a Linux cluster and when I run the
>> command "cluster-fork" I get this error:
>>
>> apple* cluster-fork ls
>> Traceback (innermost last):
>> File "/opt/rocks/sbin/cluster-fork", line 88, in ?
>> import rocks.pssh
>> File "/opt/rocks/lib/python/rocks/pssh.py", line 96, in ?
>> import gmon.encoder
>> ImportError: Bad magic number in

>> /usr/lib/python1.5/site-packages/gmon/encoder.pyc
>>
>> Any thoughts? I'm also wondering where to find the python sources for
>> files in /usr/lib/python1.5/site-packages/gmon.
>>
>> Thanks,
>>
>> Angel
>>

From mjk at sdsc.edu Mon Dec 1 19:03:16 2003
From: mjk at sdsc.edu (Mason J. Katz)
Date: Mon, 1 Dec 2003 19:03:16 -0800
In-Reply-To: <Pine.LNX.4.44.0312011453020.22892-100000@scorpion.emsl.pnl.gov>
References: <Pine.LNX.4.44.0312011453020.22892-100000@scorpion.emsl.pnl.gov>
Message-ID: <132DD626-2474-11D8-A7A4-000A95DA5638@sdsc.edu>

You'll need to run the kpp and kgen steps (what kickstart.cgi does for
your) manually to find if this is an XML error.

# cd /home/install/profiles/current
# kpp compute

This will generate a kickstart file for a compute nodes, although some
information will be missing since it isn't specific to a node (not like
what ./kickstart.cgi --client=node-name generates). But what this does
do is traverse the XML graph and build a monolithic XML kickstart
profile. If this step works you can then "|" pipe the output into kgen
to convert the XML to kickstart syntax. Something in this procedure
should fail and point to the error.

-mjk

On Dec 1, 2003, at 2:58 PM, Tim Carlson wrote:

> Trying to bring up an old dead node on a Rocks 2.3.2 cluster and I get
> the
> following error in /var/log/httpd/error_log
>
>
> File "/opt/rocks/sbin/kgen", line 530, in ?
> app.run()
> File "/opt/rocks/sbin/kgen", line 497, in run
> doc = FromXmlStream(file)
> File "/usr/lib/python1.5/site-packages/xml/dom/ext/reader/Sax2.py",
> line
> 386, in FromXmlStream
> return reader.fromStream(stream, ownerDocument)
> line
> 372, in fromStream
> self.parser.parse(s)
> File "/usr/lib/python1.5/site-packages/xml/sax/expatreader.py", line
> 58,
> in parse

> xmlreader.IncrementalParser.parse(self, source)
> File "/usr/lib/python1.5/site-packages/xml/sax/xmlreader.py", line
> 125,
> in parse
> self.close()
> 154, in close
> self.feed("", isFinal = 1)
> 148, in feed
> self._err_handler.fatalError(exc)
> line
> 340, in fatalError
> raise exception
> xml.sax._exceptions.SAXParseException: <stdin>:3298:0: no element found
>
>
> Doing a wget of
> http://frontend-0/install/kickstart.cgi?
> arch=i386&np=2&project=rocks
> on one of the working internal nodes yields the same error.
>
> Any thoughts on this?
>
> I've also done a fresh
> rocks-dist dist
>
> Tim

Date: Mon, 01 Dec 2003 20:42:51 -0800 (PST)
In-Reply-To: <132DD626-2474-11D8-A7A4-000A95DA5638@sdsc.edu>
Message-ID: <Pine.GSO.4.44.0312012040250.3148-100000@paradox.emsl.pnl.gov>

On Mon, 1 Dec 2003, Mason J. Katz wrote:

> You'll need to run the kpp and kgen steps (what kickstart.cgi does for
> your) manually to find if this is an XML error.
>
> # cd /home/install/profiles/current
> # kpp compute

That was the trick. This sent me down the correct path. I had uninstalled
SGE on the frontend (I was having problems with SGE and wanted to start
from scratch)

Adding the 2 SGE XML files back to /home/install/profiles/2.3.2/nodes/
fixed everything

Thanks!

Tim

From landman at scalableinformatics.com Tue Dec 2 04:15:07 2003
From: landman at scalableinformatics.com (Joe Landman)
Date: Tue, 02 Dec 2003 07:15:07 -0500
Subject: [Rocks-Discuss]supermicro based MB's
Message-ID: <3FCC824B.5060406@scalableinformatics.com>

Folks:

Working on integrating a Supermicro MB based cluster. Discovered early
on that all of the compute nodes have an Intel based NIC that RedHat
doesn't know anything about (any version of RH). Some of the
administrative nodes have other similar issues. I am seeing simply a
suprising number of mis/un detected hardware across the collection of MBs.

Anyone have advice on where to get modules/module source for Redhat
for these things? It looks like I will need to rebuild the boot CD,
though the several times I have tried this previously have failed to
produce a working/bootable system. It looks like new modules need to be
created/inserted into the boot process (head node and cluster nodes)
kernels, as well as into the installable kernels.

Has anyone done this for a Supermicro MB based system? Thanks .

Joe

--
Joseph Landman, Ph.D
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://scalableinformatics.com
phone: +1 734 612 4615

From jghobrial at uh.edu Tue Dec 2 08:28:08 2003
Date: Tue, 2 Dec 2003 10:28:08 -0600 (CST)
In-Reply-To: <1B15A45F-2457-11D8-A374-00039389B580@uci.edu>
<1B15A45F-2457-11D8-A374-00039389B580@uci.edu>

Indeed my md5sum is different for encoder.pyc. However, when I pulled the
file and run "cluster-fork" python responds about an import problem. So it
seems that regeneration did not occur. Is there a flag I need to pass?

I have also tried to figure out what package provides encoder and
reinstall the package, but an rpm query reveals nothing.

If this is a generated file, what generates it?

It seems that an rpm file query on ganglia show that files in the
directory belong to the package, but encoder.pyc does not.

Thanks,

Joseph

On Mon, 1 Dec 2003, Stephen Jenks wrote:
> FYI, I have a dual Athlon frontend and didn't have that problem. I know
> that doesn't exactly help you, but at least it doesn't fail on all AMD
> machines.
>
> It looks like the .pyc file might be corrupt in your installation. The
> source .py file (encoder.py) is in the
> /usr/lib/python1.5/site-packages/gmon directory, so perhaps removing
> the .pyc file would regenerate it (if you run cluster-fork as root?)
>
> The md5sum for encoder.pyc on my system is:
> 459c78750fe6e065e9ed464ab23ab73d encoder.pyc
> So you can check if yours is different.
>
> Steve Jenks
>
>
> On Dec 1, 2003, at 11:35 AM, Joseph wrote:
>
> > On Mon, 1 Dec 2003, Angel Li wrote:
> > Hello Angel, I have the same problem and so far there is no response
> > when
> > I posted about this a month ago.
> >
> > Is your frontend an AMD setup??
> >
> > I am thinking this is an AMD problem.
> >
> > Thanks,
> > Joseph
> >
> >
> >> Hi,
> >>
> >> I recently installed Rocks 3.0 on a Linux cluster and when I run the
> >> command "cluster-fork" I get this error:
> >>
> >> apple* cluster-fork ls
> >> Traceback (innermost last):
> >> File "/opt/rocks/sbin/cluster-fork", line 88, in ?
> >> import rocks.pssh
> >> File "/opt/rocks/lib/python/rocks/pssh.py", line 96, in ?
> >> import gmon.encoder
> >> ImportError: Bad magic number in
> >> /usr/lib/python1.5/site-packages/gmon/encoder.pyc
> >>
> >> Any thoughts? I'm also wondering where to find the python sources for
> >> files in /usr/lib/python1.5/site-packages/gmon.
> >>
> >> Thanks,
> >>
> >> Angel
> >>
>

From angel at miami.edu Tue Dec 2 09:02:55 2003
Date: Tue, 02 Dec 2003 12:02:55 -0500
<Pine.LNX.4.56.0312011331460.5615@mail.tlc2.uh.edu> <1B15A45F-2457-11D8-
A374-00039389B580@uci.edu> <Pine.LNX.4.56.0312021000490.7581@mail.tlc2.uh.edu>
Message-ID: <3FCCC5BF.3030903@miami.edu>

Joseph wrote:

>Indeed my md5sum is different for encoder.pyc. However, when I pulled the
>file and run "cluster-fork" python responds about an import problem. So it
>seems that regeneration did not occur. Is there a flag I need to pass?
>
>I have also tried to figure out what package provides encoder and
>reinstall the package, but an rpm query reveals nothing.
>
>If this is a generated file, what generates it?
>
>It seems that an rpm file query on ganglia show that files in the
>directory belong to the package, but encoder.pyc does not.
>
>Thanks,
>Joseph
>
>
>
>
I have finally found the python sources in the HPC rolls CD, filename
ganglia-python-3.0.0-2.i386.rpm. I'm not familiar with python but it
seems python "compiles" the .py files to ".pyc" and then deletes the
source file the first time they are referenced? I also noticed that
there are two versions of python installed. Maybe the pyc files from one
version won't load into the other one?

Angel

From mjk at sdsc.edu Tue Dec 2 15:52:52 2003
Date: Tue, 2 Dec 2003 15:52:52 -0800
In-Reply-To: <3FCCC5BF.3030903@miami.edu>
<3FCCC5BF.3030903@miami.edu>
Message-ID: <A43157DE-2522-11D8-A7A4-000A95DA5638@sdsc.edu>

Python creates the .pyc files for you, and does not remove the original
.py file. I would be extremely surprised it two "identical" .pyc files
had the same md5 checksum. I'd expect this to be more like C .o file
which always contain random data to pad out to the end of a page and

32/64 bit word sizes. Still this is just a guess, the real point is
you can always remove the .pyc files and the .py will regenerate it
when imported (although standard UNIX file/dir permission still apply).

What is the import error you get from cluster-fork?

-mjk

On Dec 2, 2003, at 9:02 AM, Angel Li wrote:

> Joseph wrote:
>
>> Indeed my md5sum is different for encoder.pyc. However, when I pulled
>> the file and run "cluster-fork" python responds about an import
>> problem. So it seems that regeneration did not occur. Is there a flag
>> I need to pass?
>>
>> I have also tried to figure out what package provides encoder and
>> reinstall the package, but an rpm query reveals nothing.
>>
>> If this is a generated file, what generates it?
>>
>> It seems that an rpm file query on ganglia show that files in the
>> directory belong to the package, but encoder.pyc does not.
>>
>> Thanks,
>> Joseph
>>
>>
>>
> I have finally found the python sources in the HPC rolls CD, filename
> ganglia-python-3.0.0-2.i386.rpm. I'm not familiar with python but it
> seems python "compiles" the .py files to ".pyc" and then deletes the
> source file the first time they are referenced? I also noticed that
> there are two versions of python installed. Maybe the pyc files from
> one version won't load into the other one?
>
> Angel
>
>

From vrowley at ucsd.edu Mon Dec 1 14:27:03 2003
From: vrowley at ucsd.edu (V. Rowley)
Date: Mon, 01 Dec 2003 14:27:03 -0800
Subject: [Rocks-Discuss]PXE boot problems
Message-ID: <3FCBC037.5000302@ucsd.edu>

We have installed a ROCKS 3.0.0 frontend on a DL380 and are trying to
install a compute node via PXE. We are getting an error similar to the
one mentioned in the archives, e.g.

> Loading initrd.img....
> Ready
>
> Failed to free base memory
>

We have upgraded to syslinux-2.07-1, per the suggestion in the archives,
but continue to get the same error. Any ideas?

--
Vicky Rowley email: vrowley at ucsd.edu
Biomedical Informatics Research Network work: (858) 536-5980
University of California, San Diego fax: (858) 822-0828
9500 Gilman Drive
La Jolla, CA 92093-0715

See pictures from our trip to China at http://www.sagacitech.com/Chinaweb

From naihh at imcb.a-star.edu.sg Tue Dec 2 18:50:55 2003
From: naihh at imcb.a-star.edu.sg (Nai Hong Hwa Francis)
Date: Wed, 3 Dec 2003 10:50:55 +0800
Subject: [Rocks-Discuss]RE: When will Sun Grid Engine be included inRocks 3 for
Itanium?
Message-ID: <5E118EED7CC277468A275F11EEEC39B94CCC22@EXIMCB2.imcb.a-star.edu.sg>

Hi Laurence,

I just downloaded the Rocks3.0 for IA32 and installed it but SGE is
still not working.

Any idea?

Nai Hong Hwa Francis
Institute of Molecular and Cell Biology (A*STAR)
30 Medical Drive
Singapore 117609.
DID: (65) 6874-6196

-----Original Message-----
From: Laurence Liew [mailto:laurence at scalablesys.com]
Sent: Thursday, November 20, 2003 2:53 PM
To: Nai Hong Hwa Francis
Cc: npaci-rocks-discussion at sdsc.edu
Subject: Re: [Rocks-Discuss]RE: When will Sun Grid Engine be included
inRocks 3 for Itanium?

Hi Francis

GridEngine roll is ready for ia32. We will get a ia64 native version
ready as soon as we get back from SC2003. It will be released in a few
weeks time.

Globus GT2.4 is included in the Grid Roll

Cheers!
Laurence

On Thu, 2003-11-20 at 10:13, Nai Hong Hwa Francis wrote:
>
> Hi,

>
> Does anyone have any idea when will Sun Grid Engine be included as
part
> of Rocks 3 distribution.
>
> I am a newbie to Grid Computing.
> Anyone have any idea on how to invoke Globus in Rocks to setup a Grid?
>
> Regards
>
> Nai Hong Hwa Francis
>
> Institute of Molecular and Cell Biology (A*STAR)
> 30 Medical Drive
> Singapore 117609
> DID: 65-6874-6196
>
> -----Original Message-----
> From: npaci-rocks-discussion-request at sdsc.edu
> [mailto:npaci-rocks-discussion-request at sdsc.edu]
> Sent: Thursday, November 20, 2003 4:01 AM
> To: npaci-rocks-discussion at sdsc.edu
> Subject: npaci-rocks-discussion digest, Vol 1 #613 - 3 msgs
>
> Send npaci-rocks-discussion mailing list submissions to
> npaci-rocks-discussion at sdsc.edu
>
> To subscribe or unsubscribe via the World Wide Web, visit
>
> http://lists.sdsc.edu/mailman/listinfo.cgi/npaci-rocks-discussion
> or, via email, send a message with subject or body 'help' to
> npaci-rocks-discussion-request at sdsc.edu
>
> You can reach the person managing the list at
> npaci-rocks-discussion-admin at sdsc.edu
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of npaci-rocks-discussion digest..."
>
>
> Today's Topics:
>
> 1. top500 cluster installation movie (Greg Bruno)
> 2. Re: Running Normal Application on Rocks Cluster -
> Newbie Question (Laurence Liew)
>
> --__--__--
>
> Message: 1
> From: Greg Bruno <bruno at rocksclusters.org>
> Date: Tue, 18 Nov 2003 13:41:15 -0800
> Subject: [Rocks-Discuss]top500 cluster installation movie
>
> here's a crew of 7, installing the 201st fastest supercomputer in the
> world in under two hours on the showroom floor at SC 03:
>
> http://www.rocksclusters.org/rocks.mov
>

> warning: the above file is ~65MB.
>
> - gb
>
>
> --__--__--
>
> Message: 2
> Subject: Re: [Rocks-Discuss]Running Normal Application on Rocks
Cluster
> -
> Newbie Question
> From: Laurence Liew <laurenceliew at yahoo.com.sg>
> To: Leong Chee Shian <chee-shian.leong at schenker.com>
> Cc: npaci-rocks-discussion at sdsc.edu
> Date: Wed, 19 Nov 2003 12:31:18 +0800
>
> Chee Shian,
>
> Thanks for your call. We will take this off list and visit you next
week
> in your office as you requested.
>
> Cheers!
> laurence
>
>
>
> On Tue, 2003-11-18 at 17:29, Leong Chee Shian wrote:
> > I have just installed Rocks 3.0 with one frontend and two compute
> > node.
> >
> > A normal file based application is installed on the frontend and is
> > NFS shared to the compute nodes .
> >
> > Question is : When run 5 sessions of my applications , the CPU
> > utilization is all concentrated on the frontend node , nothing is
> > being passed on to the compute nodes . How do I make these 3
computers
> > to function as one and share the load ?
> >
> > Thanks everyone as I am really new to this clustering stuff..
> >
> > PS : The idea of exploring rocks cluster is to use a few inexpensive
> > intel machines to replace our existing multi CPU sun server,
> > suggestions and recommendations are greatly appreciated.
> >
> >
> > Leong
> >
> >
> >
>
>
>
> --__--__--
>
> _______________________________________________
> npaci-rocks-discussion mailing list

>
>
> End of npaci-rocks-discussion Digest
>
>
> DISCLAIMER:
> This email is confidential and may be privileged. If you are not the
intended recipient, please delete it and notify us immediately. Please
do not copy or use it for any purpose, or disclose its contents to any
other person as it may be an offence under the Official Secrets Act.
Thank you.
--
Laurence Liew
CTO, Scalable Systems Pte Ltd
7 Bedok South Road
Singapore 469272
Tel : 65 6827 3953
Fax : 65 6827 3922
Mobile: 65 9029 4312
Email : laurence at scalablesys.com
http://www.scalablesys.com

DISCLAIMER:
This email is confidential and may be privileged. If you are not the intended
recipient, please delete it and notify us immediately. Please do not copy or use it
for any purpose, or disclose its contents to any other person as it may be an
offence under the Official Secrets Act. Thank you.

From laurence at scalablesys.com Tue Dec 2 19:10:08 2003
From: laurence at scalablesys.com (Laurence Liew)
Date: Wed, 03 Dec 2003 11:10:08 +0800
Subject: [Rocks-Discuss]RE: When will Sun Grid Engine be included
In-Reply-To: <5E118EED7CC277468A275F11EEEC39B94CCC22@EXIMCB2.imcb.a-star.edu.sg>
References:
<5E118EED7CC277468A275F11EEEC39B94CCC22@EXIMCB2.imcb.a-star.edu.sg>
Message-ID: <1070421007.2452.51.camel@scalable>

Hi,

SGE is in the SGE roll.

You need to download the base, hpc and sge roll.

The install is now different from V2.3.x

Cheers!
laurence

On Wed, 2003-12-03 at 10:50, Nai Hong Hwa Francis wrote:
> Hi Laurence,
>

> I just downloaded the Rocks3.0 for IA32 and installed it but SGE is
> still not working.
>
> Any idea?
>
> 30 Medical Drive
> Singapore 117609.
> DID: (65) 6874-6196
>
> From: Laurence Liew [mailto:laurence at scalablesys.com]
> Sent: Thursday, November 20, 2003 2:53 PM
> To: Nai Hong Hwa Francis
> Subject: Re: [Rocks-Discuss]RE: When will Sun Grid Engine be included
> inRocks 3 for Itanium?
>
> Hi Francis
>
> GridEngine roll is ready for ia32. We will get a ia64 native version
> ready as soon as we get back from SC2003. It will be released in a few
> weeks time.
>
> Globus GT2.4 is included in the Grid Roll
>
> Cheers!
> Laurence
>
>
> On Thu, 2003-11-20 at 10:13, Nai Hong Hwa Francis wrote:
> >
> > Hi,
> >
> > Does anyone have any idea when will Sun Grid Engine be included as
> part
> > of Rocks 3 distribution.
> >
> > I am a newbie to Grid Computing.
> > Anyone have any idea on how to invoke Globus in Rocks to setup a Grid?
> >
> > Regards
> >
> > Nai Hong Hwa Francis
> >
> > Institute of Molecular and Cell Biology (A*STAR)
> > 30 Medical Drive
> > Singapore 117609
> > DID: 65-6874-6196
> >
> > -----Original Message-----
> > From: npaci-rocks-discussion-request at sdsc.edu
> > [mailto:npaci-rocks-discussion-request at sdsc.edu]
> > Sent: Thursday, November 20, 2003 4:01 AM
> > To: npaci-rocks-discussion at sdsc.edu
> > Subject: npaci-rocks-discussion digest, Vol 1 #613 - 3 msgs
> >
> > Send npaci-rocks-discussion mailing list submissions to

> > npaci-rocks-discussion at sdsc.edu
> >
> > To subscribe or unsubscribe via the World Wide Web, visit
> >
> > http://lists.sdsc.edu/mailman/listinfo.cgi/npaci-rocks-discussion
> > or, via email, send a message with subject or body 'help' to
> > npaci-rocks-discussion-request at sdsc.edu
> >
> > You can reach the person managing the list at
> > npaci-rocks-discussion-admin at sdsc.edu
> >
> > When replying, please edit your Subject line so it is more specific
> > than "Re: Contents of npaci-rocks-discussion digest..."
> >
> >
> > Today's Topics:
> >
> > 1. top500 cluster installation movie (Greg Bruno)
> > 2. Re: Running Normal Application on Rocks Cluster -
> > Newbie Question (Laurence Liew)
> >
> > --__--__--
> >
> > Message: 1
> > From: Greg Bruno <bruno at rocksclusters.org>
> > Date: Tue, 18 Nov 2003 13:41:15 -0800
> > Subject: [Rocks-Discuss]top500 cluster installation movie
> >
> > here's a crew of 7, installing the 201st fastest supercomputer in the
> > world in under two hours on the showroom floor at SC 03:
> >
> > http://www.rocksclusters.org/rocks.mov
> >
> > warning: the above file is ~65MB.
> >
> > - gb
> >
> >
> > --__--__--
> >
> > Message: 2
> > Subject: Re: [Rocks-Discuss]Running Normal Application on Rocks
> Cluster
> > -
> > Newbie Question
> > From: Laurence Liew <laurenceliew at yahoo.com.sg>
> > To: Leong Chee Shian <chee-shian.leong at schenker.com>
> > Cc: npaci-rocks-discussion at sdsc.edu
> > Date: Wed, 19 Nov 2003 12:31:18 +0800
> >
> > Chee Shian,
> >
> > Thanks for your call. We will take this off list and visit you next
> week
> > in your office as you requested.
> >
> > Cheers!
> > laurence

> >
> >
> >
> > On Tue, 2003-11-18 at 17:29, Leong Chee Shian wrote:
> > > I have just installed Rocks 3.0 with one frontend and two compute
> > > node.
> > >
> > > A normal file based application is installed on the frontend and is
> > > NFS shared to the compute nodes .
> > >
> > > Question is : When run 5 sessions of my applications , the CPU
> > > utilization is all concentrated on the frontend node , nothing is
> > > being passed on to the compute nodes . How do I make these 3
> computers
> > > to function as one and share the load ?
> > >
> > > Thanks everyone as I am really new to this clustering stuff..
> > >
> > > PS : The idea of exploring rocks cluster is to use a few inexpensive
> > > intel machines to replace our existing multi CPU sun server,
> > > suggestions and recommendations are greatly appreciated.
> > >
> > >
> > > Leong
> > >
> > >
> > >
> >
> >
> >
> > --__--__--
> >
> > _______________________________________________
> > npaci-rocks-discussion mailing list
> >
> >
> > End of npaci-rocks-discussion Digest
> >
> >
> > DISCLAIMER:
> > This email is confidential and may be privileged. If you are not the
> intended recipient, please delete it and notify us immediately. Please
> do not copy or use it for any purpose, or disclose its contents to any
> other person as it may be an offence under the Official Secrets Act.
> Thank you.
--
Laurence Liew
7 Bedok South Road
Singapore 469272
Tel : 65 6827 3953
Fax : 65 6827 3922
Mobile: 65 9029 4312

From DGURGUL at PARTNERS.ORG Wed Dec 3 07:24:29 2003
From: DGURGUL at PARTNERS.ORG (Gurgul, Dennis J.)
Date: Wed, 3 Dec 2003 10:24:29 -0500
Subject: [Rocks-Discuss]RE: When will Sun Grid Engine be included inRo
cks 3 for Itanium?
Message-ID: <BC447F1AD529D311B4DE0008C71BF2EB0AE157F7@phsexch7.mgh.harvard.edu>

Where do we find the SGE roll? Under Lhoste at http://rocks.npaci.edu/Rocks/
there is a "Grid" roll listed. Is SGE in that? The userguide doesn't mention
SGE.

Dennis J. Gurgul
Partners Health Care System
Research Management
Research Computing Core
617.724.3169

From: npaci-rocks-discussion-admin at sdsc.edu
[mailto:npaci-rocks-discussion-admin at sdsc.edu]On Behalf Of Laurence Liew
Sent: Tuesday, December 02, 2003 10:10 PM
To: Nai Hong Hwa Francis
Subject: RE: [Rocks-Discuss]RE: When will Sun Grid Engine be included

Hi,

SGE is in the SGE roll.

You need to download the base, hpc and sge roll.

The install is now different from V2.3.x

Cheers!
laurence

On Wed, 2003-12-03 at 10:50, Nai Hong Hwa Francis wrote:
> Hi Laurence,
>
> I just downloaded the Rocks3.0 for IA32 and installed it but SGE is
> still not working.
>
> Any idea?
>
> 30 Medical Drive
> Singapore 117609.
> DID: (65) 6874-6196
>
> From: Laurence Liew [mailto:laurence at scalablesys.com]
> Sent: Thursday, November 20, 2003 2:53 PM

> To: Nai Hong Hwa Francis
> Subject: Re: [Rocks-Discuss]RE: When will Sun Grid Engine be included
> inRocks 3 for Itanium?
>
> Hi Francis
>
> GridEngine roll is ready for ia32. We will get a ia64 native version
> ready as soon as we get back from SC2003. It will be released in a few
> weeks time.
>
> Globus GT2.4 is included in the Grid Roll
>
> Cheers!
> Laurence
>
>
> On Thu, 2003-11-20 at 10:13, Nai Hong Hwa Francis wrote:
> >
> > Hi,
> >
> > Does anyone have any idea when will Sun Grid Engine be included as
> part
> > of Rocks 3 distribution.
> >
> > I am a newbie to Grid Computing.
> > Anyone have any idea on how to invoke Globus in Rocks to setup a Grid?
> >
> > Regards
> >
> > Nai Hong Hwa Francis
> >
> > Institute of Molecular and Cell Biology (A*STAR)
> > 30 Medical Drive
> > Singapore 117609
> > DID: 65-6874-6196
> >
> > -----Original Message-----
> > From: npaci-rocks-discussion-request at sdsc.edu
> > [mailto:npaci-rocks-discussion-request at sdsc.edu]
> > Sent: Thursday, November 20, 2003 4:01 AM
> > Subject: npaci-rocks-discussion digest, Vol 1 #613 - 3 msgs
> >
> > Send npaci-rocks-discussion mailing list submissions to
> >
> > To subscribe or unsubscribe via the World Wide Web, visit
> >
> > or, via email, send a message with subject or body 'help' to
> > npaci-rocks-discussion-request at sdsc.edu
> >
> > You can reach the person managing the list at
> > npaci-rocks-discussion-admin at sdsc.edu
> >
> > When replying, please edit your Subject line so it is more specific
> > than "Re: Contents of npaci-rocks-discussion digest..."
> >

> >
> > Today's Topics:
> >
> > 1. top500 cluster installation movie (Greg Bruno)
> > 2. Re: Running Normal Application on Rocks Cluster -
> > Newbie Question (Laurence Liew)
> >
> > --__--__--
> >
> > Message: 1
> > From: Greg Bruno <bruno at rocksclusters.org>
> > Date: Tue, 18 Nov 2003 13:41:15 -0800
> > Subject: [Rocks-Discuss]top500 cluster installation movie
> >
> > here's a crew of 7, installing the 201st fastest supercomputer in the
> > world in under two hours on the showroom floor at SC 03:
> >
> > http://www.rocksclusters.org/rocks.mov
> >
> > warning: the above file is ~65MB.
> >
> > - gb
> >
> >
> > --__--__--
> >
> > Message: 2
> > Subject: Re: [Rocks-Discuss]Running Normal Application on Rocks
> Cluster
> > -
> > Newbie Question
> > From: Laurence Liew <laurenceliew at yahoo.com.sg>
> > To: Leong Chee Shian <chee-shian.leong at schenker.com>
> > Cc: npaci-rocks-discussion at sdsc.edu
> > Date: Wed, 19 Nov 2003 12:31:18 +0800
> >
> > Chee Shian,
> >
> > Thanks for your call. We will take this off list and visit you next
> week
> > in your office as you requested.
> >
> > Cheers!
> > laurence
> >
> >
> >
> > On Tue, 2003-11-18 at 17:29, Leong Chee Shian wrote:
> > > I have just installed Rocks 3.0 with one frontend and two compute
> > > node.
> > >
> > > A normal file based application is installed on the frontend and is
> > > NFS shared to the compute nodes .
> > >
> > > Question is : When run 5 sessions of my applications , the CPU
> > > utilization is all concentrated on the frontend node , nothing is
> > > being passed on to the compute nodes . How do I make these 3
> computers

> > > to function as one and share the load ?
> > >
> > > Thanks everyone as I am really new to this clustering stuff..
> > >
> > > PS : The idea of exploring rocks cluster is to use a few inexpensive
> > > intel machines to replace our existing multi CPU sun server,
> > > suggestions and recommendations are greatly appreciated.
> > >
> > >
> > > Leong
> > >
> > >
> > >
> >
> >
> >
> > --__--__--
> >
> > _______________________________________________
> > npaci-rocks-discussion mailing list
> >
> >
> > End of npaci-rocks-discussion Digest
> >
> >
> > DISCLAIMER:
> > This email is confidential and may be privileged. If you are not the
> intended recipient, please delete it and notify us immediately. Please
> do not copy or use it for any purpose, or disclose its contents to any
> other person as it may be an offence under the Official Secrets Act.
> Thank you.
--
Laurence Liew
7 Bedok South Road
Singapore 469272
Tel : 65 6827 3953
Fax : 65 6827 3922
Mobile: 65 9029 4312

From bruno at rocksclusters.org Wed Dec 3 07:32:14 2003
From: bruno at rocksclusters.org (Greg Bruno)
Date: Wed, 3 Dec 2003 07:32:14 -0800
Subject: [Rocks-Discuss]RE: When will Sun Grid Engine be included inRo cks 3 for
Itanium?
In-Reply-To: <BC447F1AD529D311B4DE0008C71BF2EB0AE157F7@phsexch7.mgh.harvard.edu>
References: <BC447F1AD529D311B4DE0008C71BF2EB0AE157F7@phsexch7.mgh.harvard.edu>
Message-ID: <DF132702-25A5-11D8-86E6-000A95C4E3B4@rocksclusters.org>

> Where do we find the SGE roll? Under Lhoste at
> http://rocks.npaci.edu/Rocks/
> there is a "Grid" roll listed. Is SGE in that? The userguide doesn't
> mention
> SGE.

the SGE roll will be available in the upcoming v3.1.0 release.
scheduled release date is december 15th.

- gb

From jlkaiser at fnal.gov Wed Dec 3 08:35:18 2003
From: jlkaiser at fnal.gov (Joe Kaiser)
Date: Wed, 03 Dec 2003 10:35:18 -0600
Subject: [Rocks-Discuss]supermicro based MB's
In-Reply-To: <3FCC824B.5060406@scalableinformatics.com>
References: <3FCC824B.5060406@scalableinformatics.com>
Message-ID: <1070469318.12324.13.camel@nietzsche.fnal.gov>

Hi,

You don't say what version of Rocks you are using. The following is for
the X5DPA-GG board and Rocks 3.0. It requires modifying only the
pcitable in the boot image on the tftp server. I believe the procedure
for 2.3.2 requires a heck of a lot more work, (but it may not). I would
have to dig deep for the notes about the changing 2.3.2.

This should be done on the frontend:

cd /tftpboot/X86PC/UNDI/pxelinux/
cp initrd.img initrd.img.orig
cp initrd.img /tmp
cd /tmp
mv initrd.img initrd.gz
gunzip initrd.gz
mkdir /mnt/loop
mount -o loop initrd /mnt/loop
cd /mnt/loop/modules/
vi pcitable

Search for the e1000 drivers and add the following line:

0x8086 0x1013 "e1000" "Intel Corp.|82546EB Gigabit Ethernet
Controller"

write the file

cd /tmp
umount /mnt/loop
gzip initrd
mv initrd.gz initrd.img
mv initrd.img /tftpboot/X86PC/UNDI/pxelinux/

Then boot the node.

Hope this helps.

Thanks,

Joe

On Tue, 2003-12-02 at 06:15, Joe Landman wrote:

> Folks:
>
> Working on integrating a Supermicro MB based cluster. Discovered early
> on that all of the compute nodes have an Intel based NIC that RedHat
> doesn't know anything about (any version of RH). Some of the
> administrative nodes have other similar issues. I am seeing simply a
> suprising number of mis/un detected hardware across the collection of MBs.
>
> Anyone have advice on where to get modules/module source for Redhat
> for these things? It looks like I will need to rebuild the boot CD,
> though the several times I have tried this previously have failed to
> produce a working/bootable system. It looks like new modules need to be
> created/inserted into the boot process (head node and cluster nodes)
> kernels, as well as into the installable kernels.
>
> Has anyone done this for a Supermicro MB based system? Thanks .
>
> Joe
--
===================================================================
Joe Kaiser - Systems Administrator

Fermi Lab
CD/OSS-SCS Never laugh at live dragons.
630-840-6444
jlkaiser at fnal.gov
===================================================================

From jghobrial at uh.edu Wed Dec 3 08:59:15 2003
Date: Wed, 3 Dec 2003 10:59:15 -0600 (CST)
In-Reply-To: <A43157DE-2522-11D8-A7A4-000A95DA5638@sdsc.edu>
<1B15A45F-2457-11D8-A374-00039389B580@uci.edu>
<3FCCC5BF.3030903@miami.edu> <A43157DE-2522-11D8-A7A4-000A95DA5638@sdsc.edu>

Here is the error I receive when I remove the file encoder.pyc and run the
command cluster-fork

File "/opt/rocks/sbin/cluster-fork", line 88, in ?
import rocks.pssh
File "/opt/rocks/lib/python/rocks/pssh.py", line 96, in ?
import gmon.encoder
ImportError: No module named encoder

Thanks,
Joseph

On Tue, 2 Dec 2003, Mason J. Katz wrote:

> Python creates the .pyc files for you, and does not remove the original

> .py file. I would be extremely surprised it two "identical" .pyc files
> had the same md5 checksum. I'd expect this to be more like C .o file
> which always contain random data to pad out to the end of a page and
> 32/64 bit word sizes. Still this is just a guess, the real point is
> you can always remove the .pyc files and the .py will regenerate it
> when imported (although standard UNIX file/dir permission still apply).
>
> What is the import error you get from cluster-fork?
>
> -mjk
>
> On Dec 2, 2003, at 9:02 AM, Angel Li wrote:
>
> > Joseph wrote:
> >
> >> Indeed my md5sum is different for encoder.pyc. However, when I pulled
> >> the file and run "cluster-fork" python responds about an import
> >> problem. So it seems that regeneration did not occur. Is there a flag
> >> I need to pass?
> >>
> >> I have also tried to figure out what package provides encoder and
> >> reinstall the package, but an rpm query reveals nothing.
> >>
> >> If this is a generated file, what generates it?
> >>
> >> It seems that an rpm file query on ganglia show that files in the
> >> directory belong to the package, but encoder.pyc does not.
> >>
> >> Thanks,
> >> Joseph
> >>
> >>
> >>
> > I have finally found the python sources in the HPC rolls CD, filename
> > ganglia-python-3.0.0-2.i386.rpm. I'm not familiar with python but it
> > seems python "compiles" the .py files to ".pyc" and then deletes the
> > source file the first time they are referenced? I also noticed that
> > there are two versions of python installed. Maybe the pyc files from
> > one version won't load into the other one?
> >
> > Angel
> >
> >
>

From mjk at sdsc.edu Wed Dec 3 15:19:38 2003
Date: Wed, 3 Dec 2003 15:19:38 -0800
Message-ID: <2A332131-25E7-11D8-A641-000A95DA5638@sdsc.edu>

This file come from a ganglia package, what does

# rpm -q ganglia-receptor

Return?

-mjk


> Here is the error I receive when I remove the file encoder.pyc and run
> the
> command cluster-fork
>
> import rocks.pssh
> ImportError: No module named encoder
>
> Thanks,
> Joseph
>
>
> On Tue, 2 Dec 2003, Mason J. Katz wrote:
>
>> Python creates the .pyc files for you, and does not remove the
>> original
>> .py file. I would be extremely surprised it two "identical" .pyc
>> files
>> had the same md5 checksum. I'd expect this to be more like C .o file
>> which always contain random data to pad out to the end of a page and
>> 32/64 bit word sizes. Still this is just a guess, the real point is
>> you can always remove the .pyc files and the .py will regenerate it
>> when imported (although standard UNIX file/dir permission still
>> apply).
>>
>> What is the import error you get from cluster-fork?
>>
>> -mjk
>>
>> On Dec 2, 2003, at 9:02 AM, Angel Li wrote:
>>
>>> Joseph wrote:
>>>
>>>> Indeed my md5sum is different for encoder.pyc. However, when I
>>>> pulled
>>>> the file and run "cluster-fork" python responds about an import
>>>> problem. So it seems that regeneration did not occur. Is there a
>>>> flag
>>>> I need to pass?
>>>>
>>>> I have also tried to figure out what package provides encoder and
>>>> reinstall the package, but an rpm query reveals nothing.
>>>>
>>>> If this is a generated file, what generates it?
>>>>
>>>> It seems that an rpm file query on ganglia show that files in the

>>>> directory belong to the package, but encoder.pyc does not.
>>>>
>>>> Thanks,
>>>> Joseph
>>>>
>>>>
>>>>
>>> I have finally found the python sources in the HPC rolls CD, filename
>>> ganglia-python-3.0.0-2.i386.rpm. I'm not familiar with python but it
>>> seems python "compiles" the .py files to ".pyc" and then deletes the
>>> source file the first time they are referenced? I also noticed that
>>> there are two versions of python installed. Maybe the pyc files from
>>> one version won't load into the other one?
>>>
>>> Angel
>>>
>>>
>>

From csamuel at vpac.org Wed Dec 3 18:09:26 2003
From: csamuel at vpac.org (Chris Samuel)
Date: Thu, 4 Dec 2003 13:09:26 +1100
Subject: [Rocks-Discuss]Confirmation of Rocks 3.1.0 Opteron support & RHEL
trademark removal ?
Message-ID: <200312041309.27986.csamuel@vpac.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi folks,

Can someone confirm that the next Rocks release will support Opteron please ?

Also, I noticed that the current Rocks release on Itanium based on RHEL still
has a lot of mentions of RedHat in it, which from my reading of their
trademark guidelines is not permitted, is that fixed in the new version ?

cheers!
Chris
- --
Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin
Victorian Partnership for Advanced Computing http://www.vpac.org/
Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQE/zpdWO2KABBYQAh8RAqB8AJ9FG+IjIeem21qlFS6XYIHamIMPmwCghVTV
AgjAlVHWgdv/KzYQinHGPxs=
=IAWU
-----END PGP SIGNATURE-----

Date: Wed, 3 Dec 2003 18:46:30 -0800

Subject: [Rocks-Discuss]Confirmation of Rocks 3.1.0 Opteron support & RHEL
trademark removal ?
In-Reply-To: <200312041309.27986.csamuel@vpac.org>
References: <200312041309.27986.csamuel@vpac.org>
Message-ID: <10AD9827-2604-11D8-86E6-000A95C4E3B4@rocksclusters.org>

> Can someone confirm that the next Rocks release will support Opteron
> please ?

yes, it will support opteron.

> Also, I noticed that the current Rocks release on Itanium based on
> RHEL still
> has a lot of mentions of RedHat in it, which from my reading of their
> trademark guidelines is not permitted, is that fixed in the new
> version ?

and yes, (even though it doesn't feel like the right thing to do, as
redhat has offered to the community some outstanding technologies that
we'd like to credit), all redhat trademarks will be removed from 3.1.0.

- gb

From fds at sdsc.edu Thu Dec 4 06:46:32 2003
From: fds at sdsc.edu (Federico Sacerdoti)
Date: Thu, 4 Dec 2003 06:46:32 -0800
Message-ID: <A69923FA-2668-11D8-804D-000393A4725A@sdsc.edu>

Please install the
http://www.rocksclusters.org/errata/3.0.0/ganglia-python-3.0.1
-2.i386.rpm package, which includes the correct encoder.py file. (This
package is listed on the 3.0.0 errata page)

-Federico


> Here is the error I receive when I remove the file encoder.pyc and run
> the
> command cluster-fork
>
> import rocks.pssh
> ImportError: No module named encoder
>
> Thanks,
> Joseph

>
>
> On Tue, 2 Dec 2003, Mason J. Katz wrote:
>
>> Python creates the .pyc files for you, and does not remove the
>> original
>> .py file. I would be extremely surprised it two "identical" .pyc
>> files
>> had the same md5 checksum. I'd expect this to be more like C .o file
>> which always contain random data to pad out to the end of a page and
>> 32/64 bit word sizes. Still this is just a guess, the real point is
>> you can always remove the .pyc files and the .py will regenerate it
>> when imported (although standard UNIX file/dir permission still
>> apply).
>>
>> What is the import error you get from cluster-fork?
>>
>> -mjk
>>
>> On Dec 2, 2003, at 9:02 AM, Angel Li wrote:
>>
>>> Joseph wrote:
>>>
>>>> Indeed my md5sum is different for encoder.pyc. However, when I
>>>> pulled
>>>> the file and run "cluster-fork" python responds about an import
>>>> problem. So it seems that regeneration did not occur. Is there a
>>>> flag
>>>> I need to pass?
>>>>
>>>> I have also tried to figure out what package provides encoder and
>>>> reinstall the package, but an rpm query reveals nothing.
>>>>
>>>> If this is a generated file, what generates it?
>>>>
>>>> It seems that an rpm file query on ganglia show that files in the
>>>> directory belong to the package, but encoder.pyc does not.
>>>>
>>>> Thanks,
>>>> Joseph
>>>>
>>>>
>>>>
>>> I have finally found the python sources in the HPC rolls CD, filename
>>> ganglia-python-3.0.0-2.i386.rpm. I'm not familiar with python but it
>>> seems python "compiles" the .py files to ".pyc" and then deletes the
>>> source file the first time they are referenced? I also noticed that
>>> there are two versions of python installed. Maybe the pyc files from
>>> one version won't load into the other one?
>>>
>>> Angel
>>>
>>>
>>
>>
Federico

Rocks Cluster Group, San Diego Supercomputing Center, CA

From jghobrial at uh.edu Thu Dec 4 07:14:21 2003
Date: Thu, 4 Dec 2003 09:14:21 -0600 (CST)
In-Reply-To: <A69923FA-2668-11D8-804D-000393A4725A@sdsc.edu>
<1B15A45F-2457-11D8-A374-00039389B580@uci.edu>
<A69923FA-2668-11D8-804D-000393A4725A@sdsc.edu>

Thank you very much this solved the problem.

Joseph

On Thu, 4 Dec 2003, Federico Sacerdoti wrote:

> Please install the
> http://www.rocksclusters.org/errata/3.0.0/ganglia-python-3.0.1
> -2.i386.rpm package, which includes the correct encoder.py file. (This
> package is listed on the 3.0.0 errata page)
>
> -Federico
>
> On Dec 3, 2003, at 8:59 AM, Joseph wrote:
>
> > Here is the error I receive when I remove the file encoder.pyc and run
> > the
> > command cluster-fork
> >
> > Traceback (innermost last):
> > File "/opt/rocks/sbin/cluster-fork", line 88, in ?
> > import rocks.pssh
> > File "/opt/rocks/lib/python/rocks/pssh.py", line 96, in ?
> > import gmon.encoder
> > ImportError: No module named encoder
> >
> > Thanks,
> > Joseph
> >
> >
> > On Tue, 2 Dec 2003, Mason J. Katz wrote:
> >
> >> Python creates the .pyc files for you, and does not remove the
> >> original
> >> .py file. I would be extremely surprised it two "identical" .pyc
> >> files
> >> had the same md5 checksum. I'd expect this to be more like C .o file
> >> which always contain random data to pad out to the end of a page and
> >> 32/64 bit word sizes. Still this is just a guess, the real point is
> >> you can always remove the .pyc files and the .py will regenerate it
> >> when imported (although standard UNIX file/dir permission still
> >> apply).

> >>
> >> What is the import error you get from cluster-fork?
> >>
> >> -mjk
> >>
> >> On Dec 2, 2003, at 9:02 AM, Angel Li wrote:
> >>
> >>> Joseph wrote:
> >>>
> >>>> Indeed my md5sum is different for encoder.pyc. However, when I
> >>>> pulled
> >>>> the file and run "cluster-fork" python responds about an import
> >>>> problem. So it seems that regeneration did not occur. Is there a
> >>>> flag
> >>>> I need to pass?
> >>>>
> >>>> I have also tried to figure out what package provides encoder and
> >>>> reinstall the package, but an rpm query reveals nothing.
> >>>>
> >>>> If this is a generated file, what generates it?
> >>>>
> >>>> It seems that an rpm file query on ganglia show that files in the
> >>>> directory belong to the package, but encoder.pyc does not.
> >>>>
> >>>> Thanks,
> >>>> Joseph
> >>>>
> >>>>
> >>>>
> >>> I have finally found the python sources in the HPC rolls CD, filename
> >>> ganglia-python-3.0.0-2.i386.rpm. I'm not familiar with python but it
> >>> seems python "compiles" the .py files to ".pyc" and then deletes the
> >>> source file the first time they are referenced? I also noticed that
> >>> there are two versions of python installed. Maybe the pyc files from
> >>> one version won't load into the other one?
> >>>
> >>> Angel
> >>>
> >>>
> >>
> >>
> Federico
>
> Rocks Cluster Group, San Diego Supercomputing Center, CA
>

From vrowley at ucsd.edu Thu Dec 4 12:29:55 2003
Date: Thu, 04 Dec 2003 12:29:55 -0800
Subject: [Rocks-Discuss]Re: PXE boot problems
In-Reply-To: <3FCBC037.5000302@ucsd.edu>
References: <3FCBC037.5000302@ucsd.edu>
Message-ID: <3FCF9943.1020806@ucsd.edu>

Uh, nevermind. We had upgraded syslinux on our frontend, not the node
we were trying to PXE boot. Sigh.

V. Rowley wrote:

> We have installed a ROCKS 3.0.0 frontend on a DL380 and are trying to
> install a compute node via PXE. We are getting an error similar to the
> one mentioned in the archives, e.g.
>
>> Loading initrd.img....
>> Ready
>>
>> Failed to free base memory
>>
>
> We have upgraded to syslinux-2.07-1, per the suggestion in the archives,
> but continue to get the same error. Any ideas?
>

--
9500 Gilman Drive
La Jolla, CA 92093-0715


From cdwan at mail.ahc.umn.edu Fri Dec 5 08:16:07 2003
From: cdwan at mail.ahc.umn.edu (Chris Dwan (CCGB))
Date: Fri, 5 Dec 2003 10:16:07 -0600 (CST)
Subject: [Rocks-Discuss]Private NIS master
Message-ID: <Pine.GSO.4.58.0312042305070.18193@lenti.med.umn.edu>

Hello all. Long time listener, first time caller. Thanks for all the
great work.

I'm integrating a Rocks cluster into an existing NIS domain. I noticed
that while the cluster database now supports a PrivateNISMaster, that
variable doesn't make it into the /etc/yp.conf on the compute nodes. They
remain broadcast.

Assume that, for whatever reason, I don't want to set up a repeater
(slave) ypserv process on my frontend. I added the option "--nisserver
<var name="Kickstart_PrivateNISMaster"/>" to the
"profiles/3.0.0/nodes/nis-client.xml" file, removed the ypserver on my
frontend, and it works like I want it to.

Am I missing anything fundamental here?

-Chris Dwan
University of Minnesota

From wyzhong78 at msn.com Mon Dec 8 06:18:34 2003
From: wyzhong78 at msn.com (zhong wenyu)
Date: Mon, 08 Dec 2003 22:18:34 +0800
Subject: [Rocks-Discuss]3.0.0 problem: not able to boot up
Message-ID: <BAY3-F14uFqD45TpNO40002c14c@hotmail.com>

Hi,everyone!

I installed rocks 3.0.0 defautly, There wasn't any trouble in the
installing. But I haven't be able to boot,it stopped at the beginning,the
message "GRUB" showed on the screen,and waiting....
my hardware are double Xeon 2.4G,MSI 9138,Seagate SCSI disk.
Any appreciate is welcome!

_________________________________________________________________
???? MSN Explorer: http://explorer.msn.com/lccn/

From angelini at vki.ac.be Mon Dec 8 06:20:45 2003
From: angelini at vki.ac.be (Angelini Giuseppe)
Date: Mon, 08 Dec 2003 15:20:45 +0100
Subject: [Rocks-Discuss]How to use MPICH with ssh
Message-ID: <3FD488BD.3EBBDB8D@vki.ac.be>

Dear rocks folk,

I have recently installed mpich with Lahay Fortran and now that I can
compile and link,
I would like to run but it seems that I have another problem. In fact I
have the following
error message when I try to run:

[panara at compute-0-7 ~]$ mpirun -np $NPROC -machinefile $PBS_NODEFILE
$DPT/hybflow
p0_13226: p4_error: Path to program is invalid while starting
/dc_03_04/panara/PREPRO_TESTS/hybflow with /usr/bin/rsh on compute-0-7:
-1
p4_error: latest msg from perror: No such file or directory
p0_13226: p4_error: Child process exited while making connection to
remote process on compute-0-6: 0
p0_13226: (6.025133) net_send: could not write to fd=4, errno = 32
p0_13226: (6.025231) net_send: could not write to fd=4, errno = 32

I am wondering why it is looking for /usr/bin/rsh for the communication,

I expected to use ssh and not rsh.

Any help will be welcome.

Regards.

Giuseppe Angelini

From casuj at cray.com Mon Dec 8 07:31:21 2003
From: casuj at cray.com (John Casu)
Date: Mon, 8 Dec 2003 07:31:21 -0800
In-Reply-To: <3FD488BD.3EBBDB8D@vki.ac.be>; from Angelini Giuseppe on Mon, Dec 08,
2003 at 03:20:45PM +0100
References: <3FD488BD.3EBBDB8D@vki.ac.be>
Message-ID: <20031208073121.A10151@stemp3.wc.cray.com>

On Mon, Dec 08, 2003 at 03:20:45PM +0100, Angelini Giuseppe wrote:
>
> Dear rocks folk,
>
>
> I have recently installed mpich with Lahay Fortran and now that I can
> compile and link,
> I would like to run but it seems that I have another problem. In fact I
> have the following
> error message when I try to run:
>
> [panara at compute-0-7 ~]$ mpirun -np $NPROC -machinefile $PBS_NODEFILE
> $DPT/hybflow
> p0_13226: p4_error: Path to program is invalid while starting
> /dc_03_04/panara/PREPRO_TESTS/hybflow with /usr/bin/rsh on compute-0-7:
> -1
> p4_error: latest msg from perror: No such file or directory
> p0_13226: p4_error: Child process exited while making connection to
> remote process on compute-0-6: 0
> p0_13226: (6.025133) net_send: could not write to fd=4, errno = 32
> p0_13226: (6.025231) net_send: could not write to fd=4, errno = 32
>
> I am wondering why it is looking for /usr/bin/rsh for the communication,
>
> I expected to use ssh and not rsh.
>
> Any help will be welcome.
>

build mpich thus:

RSHCOMMAND=ssh ./configure .....

>
> Regards.
>
>
> Giuseppe Angelini

--
"Roses are red, Violets are blue,
You lookin' at me ?
YOU LOOKIN' AT ME ?!" -- Get Fuzzy.
=======================================================================
John Casu
Cray Inc. casuj at cray.com
411 First Avenue South, Suite 600 Tel: (206) 701-2173
Seattle, WA 98104-2860 Fax: (206) 701-2500
=======================================================================

From davidow at molbio.mgh.harvard.edu Mon Dec 8 08:12:53 2003
From: davidow at molbio.mgh.harvard.edu (Lance Davidow)
Date: Mon, 8 Dec 2003 11:12:53 -0500
In-Reply-To: <3FD488BD.3EBBDB8D@vki.ac.be>

References: <3FD488BD.3EBBDB8D@vki.ac.be>
Message-ID: <p06002001bbfa51fea005@[132.183.190.222]>

Giuseppe,

Here's an answer from a newbie who just faced the same problem.

You are using the wrong flavor of mpich (and mpirun). There are
several different distributions which work differently in ROCKS. the
one you are using in the default path expects serv_p4 demons and
.rhosts files in your home directory. The different flavors may be
more compatible with different compilers as well.

[lance at rescluster2 lance]$ which mpirun
/opt/mpich-mpd/gnu/bin/mpirun

the one you probably want is
/opt/mpich/gnu/bin/mpirun

[lance at rescluster2 lance]$ locate mpirun
...
/opt/mpich-mpd/gnu/bin/mpirun
...
/opt/mpich/myrinet/gnu/bin/mpirun
...
/opt/mpich/gnu/bin/mpirun

Cheers,
Lance

At 3:20 PM +0100 12/8/03, Angelini Giuseppe wrote:
>Dear rocks folk,
>
>
>I have recently installed mpich with Lahay Fortran and now that I can
>compile and link,
>I would like to run but it seems that I have another problem. In fact I
>have the following
>error message when I try to run:
>
>[panara at compute-0-7 ~]$ mpirun -np $NPROC -machinefile $PBS_NODEFILE
>$DPT/hybflow
>p0_13226: p4_error: Path to program is invalid while starting
>/dc_03_04/panara/PREPRO_TESTS/hybflow with /usr/bin/rsh on compute-0-7:
>-1
> p4_error: latest msg from perror: No such file or directory
>p0_13226: p4_error: Child process exited while making connection to
>remote process on compute-0-6: 0
>p0_13226: (6.025133) net_send: could not write to fd=4, errno = 32
>p0_13226: (6.025231) net_send: could not write to fd=4, errno = 32
>
>I am wondering why it is looking for /usr/bin/rsh for the communication,
>
>I expected to use ssh and not rsh.
>
>Any help will be welcome.
>
>

>Regards.
>
>Giuseppe Angelini

--
Lance Davidow, PhD
Director of Bioinformatics
Dept of Molecular Biology
Mass General Hospital
Boston MA 02114
davidow at molbio.mgh.harvard.edu
617.726-5955
Fax: 617.726-6893

From rscarce at caci.com Fri Dec 5 16:43:00 2003
From: rscarce at caci.com (Reed Scarce)
Date: Fri, 5 Dec 2003 19:43:00 -0500
Subject: [Rocks-Discuss]PXE and system images
Message-ID: <OFF783DCCA.8F016562-ON85256DF3.008001FC-85256DF7.00043E45@caci.com>

We want to initialize new hardware with a known good image from identical
hardware currently in use. The process imagined would be to PXE boot to a
disk image server, PXE would create a RAM system that would request the
system disk image from the server, which would push the desired system
disk image to the requesting system. Upon completion the system would be
available as a cluster member.

The lab configuration is a PC grade frontend with two 3Com 905s and a
single server grade cluster node with integrated Intel 82551 (10/100)(the
only PXE interface) and two integrated Intel 82546 (10/100/1000). The
cluster node is one of the stock of nodes for the expansion. The stock of
nodes have a Linux OS pre-installed, which would be eliminated in the
process.

Currently the node will PXE boot from the 10/100 and pickup an
installation boot from one of the g-bit interfaces. From there kickstart
wants to take over.

Any recommendations how to get kickstart to push an image to the disk?

Thanks,

Reed Scarce
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.sdsc.edu/pipermail/npaci-rocks-
discussion/attachments/20031205/dad04521/attachment-0001.html

Date: Mon, 08 Dec 2003 21:36:37 +0800
Subject: [Rocks-Discuss]Rocks 3.0.0 problem:not able to boot up
Message-ID: <BAY3-F9yOi5AgJQlDrR0002a5da@hotmail.com>

Hi,everyone!
I have installed Rocks 3.0.0 with default options successful,there was not
any trouble.But I boot it up,it stopped at beginning,just show "GRUB" on

the screen and waiting...
Thanks for your help!

_________________________________________________________________
???? MSN Explorer: http://explorer.msn.com/lccn/

From daniel.kidger at quadrics.com Mon Dec 8 09:54:53 2003
From: daniel.kidger at quadrics.com (daniel.kidger at quadrics.com)
Date: Mon, 8 Dec 2003 17:54:53 -0000
Subject: [Rocks-Discuss]custom-kernels : naming conventions ? (Rocks 3.0.0)
Message-ID: <30062B7EA51A9045B9F605FAAC1B4F622357C7@tardis0.quadrics.com>

Dear all,
Previously I have been installing a custom kernel on the compute nodes
with an "extend-compute.xml" and an "/etc/init.d/qsconfigure" (to fix grub.conf).

However I am now trying to do it the 'proper' way. So I do (on :
# cp qsnet-RedHat-kernel-2.4.18-27.3.10qsnet.i686.rpm
/home/install/rocks-dist/7.3/en/os/i386/force/RPMS
# cd /home/install
# rocks-dist dist
# SSH_NO_PASSWD=1 shoot-node compute-0-0

Hence:
# find /home/install/ |xargs -l grep -nH qsnet
shows me that hdlist and hdlist2 now contain this RPM. (and indeed If I duplicate
my rpm in that directory rocks-dist notices this and warns me.)

However the node always ends up with "2.4.20-20.7smp" again.
anaconda-ks.cfg contains just "kernel-smp" and install.log has "Installing kernel-
smp-2.4.20-20.7."

So my question is:
It looks like my RPM has a name that Rocks doesn't understand properly.
What is wrong with my name ?
and what are the rules for getting the correct name ?
(.i686.rpm is of course correct, but I don't have -smp. in the name Is this
the problem ?)

cf. Greg Bruno's wisdom:
https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2003-April/001770.html

Yours,
Daniel.

--------------------------------------------------------------
Dr. Dan Kidger, Quadrics Ltd. daniel.kidger at quadrics.com
One Bridewell St., Bristol, BS1 2AA, UK 0117 915 5505
----------------------- www.quadrics.com --------------------

>

From DGURGUL at PARTNERS.ORG Mon Dec 8 11:09:27 2003
Date: Mon, 8 Dec 2003 14:09:27 -0500

Subject: [Rocks-Discuss]cluster-fork --mpd strangeness
Message-ID: <BC447F1AD529D311B4DE0008C71BF2EB0AE15840@phsexch7.mgh.harvard.edu>

I just did "cluster-fork -Uvh /sourcedir/ganglia-python-3.0.1-2.i386.rpm" and
then "cluster-fork service gschedule restart" (not sure I had to do the last).
I also put 3.0.1-2 and restarted gschedule on the frontend.

Now I run "cluster-fork --mpd w".

I currently have a user who ssh'd to compute-0-8 from the frontend and one who
ssh'd into compute-0-17 from the front end.

But the return shows the users on lines for 17 (for the user on 0-8) and 10 (for
the user on 0-17):

17: 1:58pm up 24 days, 3:20, 1 user, load average: 0.00, 0.00, 0.03
17: USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
17: lance pts/0 rescluster2.mgh. 1:31pm 40.00s 0.02s 0.02s -bash

10: dennis pts/0 rescluster2.mgh. 1:57pm 17.00s 0.02s 0.02s -bash

When I do "cluster-fork w" (without the --mpd) the users show up on the correct
nodes.

Do the numbers on the left of the -mpd output correspond to the node names?

Thanks.

Dennis

Dennis J. Gurgul
Research Management
617.724.3169

Date: Mon, 8 Dec 2003 14:28:30 -0500

Maybe this is a better description of the "strangeness".

I did "cluster-fork --mpd hostname":

1: compute-0-0.local


Dennis J. Gurgul
Research Management
617.724.3169

From: npaci-rocks-discussion-admin at sdsc.edu
[mailto:npaci-rocks-discussion-admin at sdsc.edu]On Behalf Of Gurgul,
Dennis J.
Sent: Monday, December 08, 2003 2:09 PM
To: npaci-rocks-discussion at sdsc.edu

I just did "cluster-fork -Uvh /sourcedir/ganglia-python-3.0.1-2.i386.rpm"
and
then "cluster-fork service gschedule restart" (not sure I had to do the
last).
I also put 3.0.1-2 and restarted gschedule on the frontend.

Now I run "cluster-fork --mpd w".

I currently have a user who ssh'd to compute-0-8 from the frontend and one
who
ssh'd into compute-0-17 from the front end.

But the return shows the users on lines for 17 (for the user on 0-8) and 10
(for
the user on 0-17):

17: lance pts/0 rescluster2.mgh. 1:31pm 40.00s 0.02s 0.02s -bash

10: dennis pts/0 rescluster2.mgh. 1:57pm 17.00s 0.02s 0.02s -bash

When I do "cluster-fork w" (without the --mpd) the users show up on the
correct
nodes.

Do the numbers on the left of the -mpd output correspond to the node names?

Thanks.

Dennis

Dennis J. Gurgul
Research Management
617.724.3169

Date: Mon, 08 Dec 2003 12:35:16 -0800 (PST)
In-Reply-To:
<OFF783DCCA.8F016562-ON85256DF3.008001FC-85256DF7.00043E45@caci.com>

On Fri, 5 Dec 2003, Reed Scarce wrote:

> We want to initialize new hardware with a known good image from identical
> hardware currently in use. The process imagined would be to PXE boot to a
> disk image server, PXE would create a RAM system that would request the
> system disk image from the server, which would push the desired system
> disk image to the requesting system. Upon completion the system would be
> available as a cluster member.
>
> The lab configuration is a PC grade frontend with two 3Com 905s and a
> single server grade cluster node with integrated Intel 82551 (10/100)(the
> only PXE interface) and two integrated Intel 82546 (10/100/1000). The
> cluster node is one of the stock of nodes for the expansion. The stock of
> nodes have a Linux OS pre-installed, which would be eliminated in the
> process.
>
> Currently the node will PXE boot from the 10/100 and pickup an
> installation boot from one of the g-bit interfaces. From there kickstart
> wants to take over.
>
> Any recommendations how to get kickstart to push an image to the disk?

This sounds like you want to use Oscar instead of ROCKS.

http://oscar.openclustergroup.org/tiki-index.php

I'm not exactly sure why you think that the kickstart process won't give
you exactly the same image on ever machine. If the hardware is the same,
you'll get the same image on each machine.

We have boxes with the same setup, 10/100 PXE, and then dual gigabit. Our
method for installing ROCKS on this type of hardware is the following

1) Run insert-ethers and choose "manager" type of node.
2) Connect all the PXE interfaces to the switch and boot them all. Do not
connect the gigabit interface
3) Once all of the nodes have PXE booted, exit insert-ethers. Start
insert-ethers again and this time choose compute node
4) Hook up the gigabit interface and the PXE interface to your nodes. All

of your machines will now install.
5) In our case, we now quickly disconnect the PXE interface because we
don't want to have the machine continually install. The real ROCKS
method would have you choose (HD/net) for booting in the BIOS, but if you
already
have an OS on your machine, you would have to go into the BIOS twice
before the compute nodes were installed. We disable rocks-grub and just
connect up the PXE cable if we need to reinstall.

Tim

Tim Carlson
Voice: (509) 376 3423
Email: Tim.Carlson at pnl.gov
EMSL UNIX System Support

Date: Mon, 08 Dec 2003 12:42:23 -0800 (PST)
In-Reply-To: <30062B7EA51A9045B9F605FAAC1B4F622357C7@tardis0.quadrics.com>

On Mon, 8 Dec 2003 daniel.kidger at quadrics.com wrote:

I've gotten confused from time to time as to where to place custom RPMS
(it's changed between releases), so my not-so-clean method is to just rip
out the kernels in /home/install/rocks-dist/7.3/en/os/i386/Redhat/RPMS
and drop my own in. Then do a

cd /home/install
rocks-dist dist
shoot-node

You are probably running into an issue where the "force" directory is more
of an "in addition to" directory and your 2.4.18 kernel is being noted,
but ignored since the 2.4.20 kernel is newer. I assume you nodes get both
and SMP and UP version of 2.4.20 and that your custom 2.4.18 is nowhere to
be found on the compute node.

Tim Carlson
Voice: (509) 376 3423

> Previously I have been installing a custom kernel on the compute nodes
> with an "extend-compute.xml" and an "/etc/init.d/qsconfigure" (to fix grub.conf).
>
> However I am now trying to do it the 'proper' way. So I do (on :
> # cp qsnet-RedHat-kernel-2.4.18-27.3.10qsnet.i686.rpm
> /home/install/rocks-dist/7.3/en/os/i386/force/RPMS
> # cd /home/install
> # rocks-dist dist
> # SSH_NO_PASSWD=1 shoot-node compute-0-0
>
> Hence:
> # find /home/install/ |xargs -l grep -nH qsnet

> shows me that hdlist and hdlist2 now contain this RPM. (and indeed If I duplicate
my rpm in that directory rocks-dist notices this and warns me.)
>
> However the node always ends up with "2.4.20-20.7smp" again.
> anaconda-ks.cfg contains just "kernel-smp" and install.log has "Installing
kernel-smp-2.4.20-20.7."
>
> So my question is:
> It looks like my RPM has a name that Rocks doesn't understand properly.
> What is wrong with my name ?
> and what are the rules for getting the correct name ?
> (.i686.rpm is of course correct, but I don't have -smp. in the name Is this
the problem ?)
>
> cf. Greg Bruno's wisdom:
> https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2003-April/001770.html
>
>
> Yours,
> Daniel.

From fds at sdsc.edu Mon Dec 8 12:51:12 2003
Date: Mon, 8 Dec 2003 12:51:12 -0800
In-Reply-To: <BC447F1AD529D311B4DE0008C71BF2EB0AE15843@phsexch7.mgh.harvard.edu>
References: <BC447F1AD529D311B4DE0008C71BF2EB0AE15843@phsexch7.mgh.harvard.edu>
Message-ID: <423D0494-29C0-11D8-804D-000393A4725A@sdsc.edu>

You are right, and I think this is a shortcoming of MPD. There is no
obvious way to force the MPD numbering to correspond to the order the
nodes were called out on the command line (cluster-fork --mpd actually
makes a shell call to mpirun and it calls out all the node names
explicitly). MPD seems to number the output differently, as you found
out.

So mpd for now may be more useful for jobs that are not sensitive to
this. If enough of you find this shortcoming to be a real annoyance, we
could work on putting the node name label on the output by explicitly
calling "hostname" or similar.

Good ideas are welcome :)
-Federico

On Dec 8, 2003, at 11:28 AM, Gurgul, Dennis J. wrote:

> Maybe this is a better description of the "strangeness".
>
> I did "cluster-fork --mpd hostname":
>
> 1: compute-0-0.local

>
> Dennis J. Gurgul
> Partners Health Care System
> Research Management
> Research Computing Core
> 617.724.3169
>
>
> From: npaci-rocks-discussion-admin at sdsc.edu
> [mailto:npaci-rocks-discussion-admin at sdsc.edu]On Behalf Of Gurgul,
> Dennis J.
> Sent: Monday, December 08, 2003 2:09 PM
> Subject: [Rocks-Discuss]cluster-fork --mpd strangeness
>
>
> I just did "cluster-fork -Uvh
> /sourcedir/ganglia-python-3.0.1-2.i386.rpm"
> and
> then "cluster-fork service gschedule restart" (not sure I had to do the
> last).
> I also put 3.0.1-2 and restarted gschedule on the frontend.
>
> Now I run "cluster-fork --mpd w".
>
> I currently have a user who ssh'd to compute-0-8 from the frontend and
> one
> who
> ssh'd into compute-0-17 from the front end.
>
> But the return shows the users on lines for 17 (for the user on 0-8)
> and 10
> (for
> the user on 0-17):
>
> 17: 1:58pm up 24 days, 3:20, 1 user, load average: 0.00, 0.00,
> 0.03
> 17: USER TTY FROM LOGIN@ IDLE JCPU PCPU
> WHAT
> 17: lance pts/0 rescluster2.mgh. 1:31pm 40.00s 0.02s 0.02s
> -bash
>

> 0.07
> WHAT
> 10: dennis pts/0 rescluster2.mgh. 1:57pm 17.00s 0.02s 0.02s
> -bash
>
> When I do "cluster-fork w" (without the --mpd) the users show up on the
> correct
> nodes.
>
> Do the numbers on the left of the -mpd output correspond to the node
> names?
>
> Thanks.
>
> Dennis
>
> Dennis J. Gurgul
> 617.724.3169
>
Federico


Date: Mon, 8 Dec 2003 15:55:13 -0500

Thanks.

On a related note, when I did "cluster-fork service gschedule restart" gschedule
started with the "OK" output, but then the fork process hung on each node and I
had to ^c out for it to go on to the next node.

I tried to ssh to a node and then did the gschedule restart. Even then, after I
tried to "exit" out of the node, the session hung and I had to log back in and
kill it from the frontend.

Dennis J. Gurgul
Research Management
617.724.3169

From: Federico Sacerdoti [mailto:fds at sdsc.edu]
Sent: Monday, December 08, 2003 3:51 PM
To: Gurgul, Dennis J.
Subject: Re: [Rocks-Discuss]cluster-fork --mpd strangeness

You are right, and I think this is a shortcoming of MPD. There is no
obvious way to force the MPD numbering to correspond to the order the
nodes were called out on the command line (cluster-fork --mpd actually
makes a shell call to mpirun and it calls out all the node names
explicitly). MPD seems to number the output differently, as you found
out.

So mpd for now may be more useful for jobs that are not sensitive to
this. If enough of you find this shortcoming to be a real annoyance, we
could work on putting the node name label on the output by explicitly
calling "hostname" or similar.

Good ideas are welcome :)
-Federico

On Dec 8, 2003, at 11:28 AM, Gurgul, Dennis J. wrote:

> Maybe this is a better description of the "strangeness".
>
> I did "cluster-fork --mpd hostname":
>
>
> Dennis J. Gurgul
> 617.724.3169
>
>
> From: npaci-rocks-discussion-admin at sdsc.edu
> [mailto:npaci-rocks-discussion-admin at sdsc.edu]On Behalf Of Gurgul,
> Dennis J.

> Subject: [Rocks-Discuss]cluster-fork --mpd strangeness
>
>
> I just did "cluster-fork -Uvh
> /sourcedir/ganglia-python-3.0.1-2.i386.rpm"
> and
> then "cluster-fork service gschedule restart" (not sure I had to do the
> last).
> I also put 3.0.1-2 and restarted gschedule on the frontend.
>
> Now I run "cluster-fork --mpd w".
>
> I currently have a user who ssh'd to compute-0-8 from the frontend and
> one
> who
> ssh'd into compute-0-17 from the front end.
>
> But the return shows the users on lines for 17 (for the user on 0-8)
> and 10
> (for
> the user on 0-17):
>
> 0.03
> WHAT
> 17: lance pts/0 rescluster2.mgh. 1:31pm 40.00s 0.02s 0.02s
> -bash
>
> 0.07
> WHAT
> 10: dennis pts/0 rescluster2.mgh. 1:57pm 17.00s 0.02s 0.02s
> -bash
>
> When I do "cluster-fork w" (without the --mpd) the users show up on the
> correct
> nodes.
>
> Do the numbers on the left of the -mpd output correspond to the node
> names?
>
> Thanks.
>
> Dennis
>
> Dennis J. Gurgul
> 617.724.3169
>
Federico



Date: Mon, 8 Dec 2003 12:58:22 -0800
In-Reply-To: <Pine.LNX.4.44.0312081226270.19031-100000@scorpion.emsl.pnl.gov>
References: <Pine.LNX.4.44.0312081226270.19031-100000@scorpion.emsl.pnl.gov>
Message-ID: <4261C250-29C1-11D8-AECB-000A95DA5638@sdsc.edu>


> 5) In our case, we now quickly disconnect the PXE interface because we
> don't want to have the machine continually install. The real ROCKS
> method would have you choose (HD/net) for booting in the BIOS, but
> if you already
> have an OS on your machine, you would have to go into the BIOS twice
> before the compute nodes were installed. We disable rocks-grub and
> just
> connect up the PXE cable if we need to reinstall.
>

For most boxes we've seen that support PXE there is an option to hit
<F12> to force a network PXE boot, this allows you to force a PXE even
when a valid OS/Boot block exists on your hard disk. If you don't have
this you do indeed need to go into BIOS twice -- a pain.

-mjk

Date: Mon, 8 Dec 2003 13:26:46 -0800
In-Reply-To: <BC447F1AD529D311B4DE0008C71BF2EB0AE15847@phsexch7.mgh.harvard.edu>
References: <BC447F1AD529D311B4DE0008C71BF2EB0AE15847@phsexch7.mgh.harvard.edu>
Message-ID: <39CC5B05-29C5-11D8-804D-000393A4725A@sdsc.edu>

I've seen this before as well. I believe it has something to do with
the way the color "[ OK ]" characters are interacting with the ssh
session from the normal cluster-fork. We have yet to characterize this
bug adequately.

-Federico

On Dec 8, 2003, at 12:55 PM, Gurgul, Dennis J. wrote:

> Thanks.
>
> On a related note, when I did "cluster-fork service gschedule restart"
> gschedule
> started with the "OK" output, but then the fork process hung on each
> node and I
> had to ^c out for it to go on to the next node.
>
> I tried to ssh to a node and then did the gschedule restart. Even
> then, after I
> tried to "exit" out of the node, the session hung and I had to log
> back in and
> kill it from the frontend.

>
>
> Dennis J. Gurgul
> 617.724.3169
>
>
> From: Federico Sacerdoti [mailto:fds at sdsc.edu]
> To: Gurgul, Dennis J.
> Subject: Re: [Rocks-Discuss]cluster-fork --mpd strangeness
>
>
> You are right, and I think this is a shortcoming of MPD. There is no
> obvious way to force the MPD numbering to correspond to the order the
> nodes were called out on the command line (cluster-fork --mpd actually
> makes a shell call to mpirun and it calls out all the node names
> explicitly). MPD seems to number the output differently, as you found
> out.
>
> So mpd for now may be more useful for jobs that are not sensitive to
> this. If enough of you find this shortcoming to be a real annoyance, we
> could work on putting the node name label on the output by explicitly
> calling "hostname" or similar.
>
> Good ideas are welcome :)
> -Federico
>
> On Dec 8, 2003, at 11:28 AM, Gurgul, Dennis J. wrote:
>
>> Maybe this is a better description of the "strangeness".
>>
>> I did "cluster-fork --mpd hostname":
>>
>> 1: compute-0-0.local

>>
>> Dennis J. Gurgul
>> Partners Health Care System
>> Research Management
>> Research Computing Core
>> 617.724.3169
>>
>>
>> -----Original Message-----
>> From: npaci-rocks-discussion-admin at sdsc.edu
>> [mailto:npaci-rocks-discussion-admin at sdsc.edu]On Behalf Of Gurgul,
>> Dennis J.
>> Sent: Monday, December 08, 2003 2:09 PM
>> To: npaci-rocks-discussion at sdsc.edu
>> Subject: [Rocks-Discuss]cluster-fork --mpd strangeness
>>
>>
>> I just did "cluster-fork -Uvh
>> /sourcedir/ganglia-python-3.0.1-2.i386.rpm"
>> and
>> then "cluster-fork service gschedule restart" (not sure I had to do
>> the
>> last).
>> I also put 3.0.1-2 and restarted gschedule on the frontend.
>>
>> Now I run "cluster-fork --mpd w".
>>
>> I currently have a user who ssh'd to compute-0-8 from the frontend and
>> one
>> who
>> ssh'd into compute-0-17 from the front end.
>>
>> But the return shows the users on lines for 17 (for the user on 0-8)
>> and 10
>> (for
>> the user on 0-17):
>>
>> 17: 1:58pm up 24 days, 3:20, 1 user, load average: 0.00, 0.00,
>> 0.03
>> 17: USER TTY FROM LOGIN@ IDLE JCPU PCPU
>> WHAT
>> 17: lance pts/0 rescluster2.mgh. 1:31pm 40.00s 0.02s 0.02s
>> -bash
>>
>> 10: 1:58pm up 24 days, 3:21, 1 user, load average: 0.02, 0.04,
>> 0.07
>> 10: USER TTY FROM LOGIN@ IDLE JCPU PCPU
>> WHAT
>> 10: dennis pts/0 rescluster2.mgh. 1:57pm 17.00s 0.02s 0.02s
>> -bash
>>
>> When I do "cluster-fork w" (without the --mpd) the users show up on
>> the
>> correct
>> nodes.
>>
>> Do the numbers on the left of the -mpd output correspond to the node
>> names?

>>
>> Thanks.
>>
>> Dennis
>>
>> Dennis J. Gurgul
>> Partners Health Care System
>> Research Management
>> Research Computing Core
>> 617.724.3169
>>
> Federico
>
>
Federico


From bruno at rocksclusters.org Mon Dec 8 15:31:08 2003
Date: Mon, 8 Dec 2003 15:31:08 -0800
In-Reply-To: <BAY3-F9yOi5AgJQlDrR0002a5da@hotmail.com>
References: <BAY3-F9yOi5AgJQlDrR0002a5da@hotmail.com>
Message-ID: <9979F090-29D6-11D8-9715-000A95C4E3B4@rocksclusters.org>

> I have installed Rocks 3.0.0 with default options successful,there was
> not any trouble.But I boot it up,it stopped at beginning,just show
> "GRUB" on the screen and waiting...

when you built the frontend, did you start with the rocks base CD then
add the HPC roll?

- gb

Date: Mon, 8 Dec 2003 15:37:46 -0800
References: <30062B7EA51A9045B9F605FAAC1B4F622357C7@tardis0.quadrics.com>
Message-ID: <8700A2BE-29D7-11D8-9715-000A95C4E3B4@rocksclusters.org>

> Previously I have been installing a custom kernel on the compute
> nodes
> with an "extend-compute.xml" and an "/etc/init.d/qsconfigure" (to fix
> grub.conf).
>
> # rocks-dist dist

>
> Hence:
> shows me that hdlist and hdlist2 now contain this RPM. (and indeed If
> I duplicate my rpm in that directory rocks-dist notices this and warns
> me.)
>
> anaconda-ks.cfg contains just "kernel-smp" and install.log has
> "Installing kernel-smp-2.4.20-20.7."
>
> It looks like my RPM has a name that Rocks doesn't understand
> properly.
> (.i686.rpm is of course correct, but I don't have -smp. in the
> name Is this the problem ?)

the anaconda installer looks for kernel packages with a specific format:

kernel-<kernel ver>-<redhat ver>.i686.rpm

and for smp nodes:

kernel-smp-<kernel ver>-<redhat ver>.i686.rpm

we have made the necessary patches to files under /usr/src/linux-2.4 in
order to produce redhat-compliant kernels. see:

http://www.rocksclusters.org/rocks-documentation/3.0.0/customization-
kernel.html

also, would you be interested in making your changes for the quadrics
interconnect available to the general rocks community?

- gb

From purikk at hotmail.com Mon Dec 8 20:23:35 2003
From: purikk at hotmail.com (purushotham komaravolu)
Date: Mon, 8 Dec 2003 23:23:35 -0500
Subject: [Rocks-Discuss]AMD Opteron
References: <200312082001.hB8K1KJ24139@postal.sdsc.edu>
Message-ID: <BAY1-DAV65Bp80SiEmA00005c14@hotmail.com>

Hello,
I am a newbie to ROCKS cluster. I wanted to setup clusters on
32-bit Architectures( Intel and AMD) and 64-bit Architecture( Intel and
AMD).
I found the 64-bit download for Intel on the website but not for AMD. Does
it work for AMD opteron? if not what is the ETA for AMD-64.
We are planning to but AMD-64 bit machines shortly, and I would like to
volunteer for the beta testing if needed.
Thanks
Regards,
Puru

Date: Tue, 9 Dec 2003 07:28:51 -0800
In-Reply-To: <BAY1-DAV65Bp80SiEmA00005c14@hotmail.com>
References: <200312082001.hB8K1KJ24139@postal.sdsc.edu> <BAY1-
DAV65Bp80SiEmA00005c14@hotmail.com>
Message-ID: <6413D41A-2A5C-11D8-AECB-000A95DA5638@sdsc.edu>

We have a beta right now that we have sent to a few people. We plan on
a release this month, and AMD_64 will be part of this release along
with the usual x86, IA64 support.

If you want to help accelerate this process please talk to your vendor
about loaning/giving us some hardware for testing. Having access to a
variety of Opteron hardware (we own two boxes) is the only way we can
have good support for this chip.

-mjk

On Dec 8, 2003, at 8:23 PM, purushotham komaravolu wrote:

> Hello,
> I am a newbie to ROCKS cluster. I wanted to setup clusters
> on
> 32-bit Architectures( Intel and AMD) and 64-bit Architecture( Intel
> and
> AMD).
> I found the 64-bit download for Intel on the website but not for AMD.
> Does
> it work for AMD opteron? if not what is the ETA for AMD-64.
> We are planning to but AMD-64 bit machines shortly, and I would like to
> volunteer for the beta testing if needed.
> Thanks
> Regards,
> Puru

From cdmaest at sandia.gov Tue Dec 9 07:48:31 2003
From: cdmaest at sandia.gov (Christopher D. Maestas)
Date: Tue, 09 Dec 2003 08:48:31 -0700
In-Reply-To: <6413D41A-2A5C-11D8-AECB-000A95DA5638@sdsc.edu>
References: <200312082001.hB8K1KJ24139@postal.sdsc.edu>
<BAY1-DAV65Bp80SiEmA00005c14@hotmail.com>
<6413D41A-2A5C-11D8-AECB-000A95DA5638@sdsc.edu>
Message-ID: <1070984911.19042.12.camel@capdesk.sandia.gov>

What do I have to do to sign up to test? We have opteron systems we can
test on here.

On Tue, 2003-12-09 at 08:28, Mason J. Katz wrote:
> We have a beta right now that we have sent to a few people. We plan on
> a release this month, and AMD_64 will be part of this release along
> with the usual x86, IA64 support.
>

> If you want to help accelerate this process please talk to your vendor
> about loaning/giving us some hardware for testing. Having access to a
> variety of Opteron hardware (we own two boxes) is the only way we can
> have good support for this chip.
>
> -mjk
>
>
> On Dec 8, 2003, at 8:23 PM, purushotham komaravolu wrote:
>
> > Hello,
> > I am a newbie to ROCKS cluster. I wanted to setup clusters
> > on
> > 32-bit Architectures( Intel and AMD) and 64-bit Architecture( Intel
> > and
> > AMD).
> > I found the 64-bit download for Intel on the website but not for AMD.
> > Does
> > it work for AMD opteron? if not what is the ETA for AMD-64.
> > We are planning to but AMD-64 bit machines shortly, and I would like to
> > volunteer for the beta testing if needed.
> > Thanks
> > Regards,
> > Puru
>

From vincent_b_fox at yahoo.com Tue Dec 9 11:10:40 2003
From: vincent_b_fox at yahoo.com (Vincent Fox)
Date: Tue, 9 Dec 2003 11:10:40 -0800 (PST)
Subject: [Rocks-Discuss]ATLAS rpm build problems on PII platform
Message-ID: <20031209191040.71171.qmail@web14811.mail.yahoo.com>

I tried doing a rebuild of the ATLAS libraries on a
PII test cluster and no go. Did an export
PATH=/opt/gcc32/bin:$PATH first to make it easy on
myself.

The "make rpm" appears to get stuck in a loop on the
xconfig part. I pause it and it seems like the prompt
is defining f77=-O and f77 FLAGS=y which doesn't work
of course. My guess is the spec file doesn't have an
answer for a previous question, so the /usr/bin/g77
answer is getting set for the previous prompt, and
since no f77 is defined, it gets stuck.

Anyhow thought I would note this problem on the list
for those more qualified to address it.

__________________________________
Do you Yahoo!?
New Yahoo! Photos - easier uploading and sharing.
http://photos.yahoo.com/

From bryan at UCLAlumni.net Tue Dec 9 12:14:16 2003

From: bryan at UCLAlumni.net (Bryan Littlefield)
Date: Tue, 09 Dec 2003 12:14:16 -0800
Subject: [Rocks-Discuss]Rocks-Discuss] AMD Opteron - Contact Appro
In-Reply-To: <200312091531.hB9FV9J12694@postal.sdsc.edu>
References: <200312091531.hB9FV9J12694@postal.sdsc.edu>
Message-ID: <3FD62D18.7010208@UCLAlumni.net>

Hi Mason,

I suggest contacting Appro. We are using Rocks on our Opteron cluster
and Appro would likely love to help. I will contact them as well to see
if they could help getting a opteron machine for testing. Contact info
below:

Thanks --Bryan

Jian Chang - Regional Sales Manager
(408) 941-8100 x 202
(800) 927-5464 x 202
(408) 941-8111 Fax
jian at appro.com
http://www.appro.com

npaci-rocks-discussion-request at sdsc.edu wrote:

>From: "Mason J. Katz" <mjk at sdsc.edu>
>Subject: Re: [Rocks-Discuss]AMD Opteron
>Date: Tue, 9 Dec 2003 07:28:51 -0800
>To: "purushotham komaravolu" <purikk at hotmail.com>
>
>We have a beta right now that we have sent to a few people. We plan on
>a release this month, and AMD_64 will be part of this release along
>with the usual x86, IA64 support.
>
>If you want to help accelerate this process please talk to your vendor
>about loaning/giving us some hardware for testing. Having access to a
>variety of Opteron hardware (we own two boxes) is the only way we can
>have good support for this chip.
>
> -mjk
>
>
>On Dec 8, 2003, at 8:23 PM, purushotham komaravolu wrote:
>
>
>
> Cc: <npaci-rocks-discussion at sdsc.edu>
>
>>Hello,
>> I am a newbie to ROCKS cluster. I wanted to setup clusters
>>on
>>32-bit Architectures( Intel and AMD) and 64-bit Architecture( Intel
>>and
>>AMD).
>>I found the 64-bit download for Intel on the website but not for AMD.
>>Does
>>it work for AMD opteron? if not what is the ETA for AMD-64.
>>We are planning to but AMD-64 bit machines shortly, and I would like to
>>volunteer for the beta testing if needed.

>>Thanks
>>Regards,
>>Puru
>>
>>
>
>_______________________________________________
>npaci-rocks-discussion mailing list
>npaci-rocks-discussion at sdsc.edu
>http://lists.sdsc.edu/mailman/listinfo.cgi/npaci-rocks-discussion
>
>
>End of npaci-rocks-discussion Digest
>
>
-------------- next part --------------
discussion/attachments/20031209/611e65b4/attachment-0001.html

From vincent_b_fox at yahoo.com Tue Dec 9 13:22:59 2003
Date: Tue, 9 Dec 2003 13:22:59 -0800 (PST)

Okay, came up my own quick hack:

Edit atlas.spec.in, go to "other x86" section, remove
2 lines right above "linux", seems to make rpm now.

A more formal patch would be put in a section for
cpuid eq 4 with this correction I suppose.

__________________________________
Do you Yahoo!?
New Yahoo! Photos - easier uploading and sharing.
http://photos.yahoo.com/

Date: Tue, 09 Dec 2003 16:49:06 -0500
Subject: [Rocks-Discuss]Has anyone tried Gaussian binary only on the ROCKS 3.1.0
beta?
Message-ID: <1071006546.18100.46.camel@squash.scalableinformatics.com>

Hi Folks

Working on building the same cluster from last week. The admin nodes
are up and functional (plain old RH9+XFS).

I want to get the head nodes up, with one of the requirements being
running the Gaussian binary-only code. Gaussian's page lists RH9.0
support, so I wanted to see if someone has tried the beta with this
code.

Thanks.

Joe

--
phone: +1 734 612 4615

Date: Tue, 09 Dec 2003 16:59:37 -0500
Subject: [Rocks-Discuss]a name for pain ... modules/kernels/ethernets ...

Folks:

As indicated previously, I am wrestling with a Supermicro based
cluster. None of the RH distributions come with the correct E1000
driver, so a new kernel is needed (in the boot CD, and for
installation).

The problem I am running into is that it isn't at all obvious/easy how
to install a new kernel/modules into ROCKS (3.0 or otherwise) to enable
this thing to work. Following the examples in the documentation have
not met with success. Running "rocks-dist cdrom" with the new kernels
(2.4.23 works nicely on the nodes) in the force/RPMS directory generates
a bootable CD with the original 2.4.18BOOT kernel.

What I (and I think others) need, is a simple/easy to follow method
that will generate a bootable CD with the correct linux kernel, and the
correct modules.

Is this in process somewhere? What would be tremendously helpful is
if we can generate a binary module, and put that into the boot process
by placing it into the force/modules/binary directory (assuming one
exists) with the appropriate entry of a similar name in the
force/modules/meta directory as a simple XML document giving pci-ids,
description, name, etc.

Anything close to this coming? Modules are killing future ROCKS
installs, the inability to easily inject a new module in there has
created a problem whereby ROCKS does not function (as the underlying RH
does not function).

--
phone: +1 734 612 4615

From tim.carlson at pnl.gov Tue Dec 9 14:11:43 2003
Date: Tue, 09 Dec 2003 14:11:43 -0800 (PST)
In-Reply-To: <1071007177.18100.58.camel@squash.scalableinformatics.com>
Message-ID: <Pine.GSO.4.44.0312091406080.17458-100000@paradox.emsl.pnl.gov>

On Tue, 9 Dec 2003, Joe Landman wrote:

> The problem I am running into is that it isn't at all obvious/easy how
> to install a new kernel/modules into ROCKS (3.0 or otherwise) to enable
> this thing to work. Following the examples in the documentation have
> not met with success. Running "rocks-dist cdrom" with the new kernels
> (2.4.23 works nicely on the nodes) in the force/RPMS directory generates
> a bootable CD with the original 2.4.18BOOT kernel.

So you built a 2.4.23BOOT rpm? The problem people have is with the naming
convention of kernels. A kernel.org spec file isn't going to generate
proper kernel rpms IMHO. What you really want to do (and maybe you are
already doing this) is steal the bit of the Redhat spec building scripts
that generage the -smp .i686 and BOOT rpms.

New hardware is tough for any distro.

Tim

Tim Carlson
Voice: (509) 376 3423

From tmartin at physics.ucsd.edu Tue Dec 9 15:57:17 2003
From: tmartin at physics.ucsd.edu (Terrence Martin)
Date: Tue, 09 Dec 2003 15:57:17 -0800
Subject: [Rocks-Discuss]Intel MT based Gigabit controllers
Message-ID: <3FD6615D.8090200@physics.ucsd.edu>

Does Rocks 3.0 support the Intel MT based Gigabit controllers (PCI
8086:1013) without any modifications? My new cluster has these new
controllers.

Rocks 2.3.1 does not seem detect/drive these cards correctly (install
failes to detect and the e1000 driver does not seem to work). So I was
going to go ahead and move my new head node to 3.0.0 and was wondering
if I am going to have to do additional work to get the intel drivers on
the boot image (for cluster nodes) to have the working Intel driver with
these cards.

Terrence

Date: Tue, 09 Dec 2003 15:59:29 -0800
Subject: [Rocks-Discuss]how to include custom driver
In-Reply-To: <Pine.GSO.4.44.0306092142150.18083-100000@poincare.emsl.pnl.gov>

References: <Pine.GSO.4.44.0306092142150.18083-100000@poincare.emsl.pnl.gov>
Message-ID: <3FD661E1.90307@physics.ucsd.edu>

Tim Carlson wrote:
> On Mon, 9 Jun 2003, Greg Bruno wrote:
>
>
>>what driver did you have to add?
>>
>>we may be able to provide a patch for your compute nodes.
>
>
> Ah!!!.. I didn't see this repsonse before I sent off my reply to Matthew.
> Can I please have the aic79xx driver and while your at it can I get a
> module-info file that has this entry for gigabit? Not sure if it is
> already in there? ;)
>
> 0x8086 0x100f "e1000" "Intel Corp. 82545EM Gigabit Ethernet Controller rev
(01)"
>
> It is also quite possible that I burned the 2.3.0 media instead of
> 2.3.2. It was late in the day when I tried to do my install.
>
> Tim
>
> Tim Carlson
> Voice: (509) 376 3423
> Email: Tim.Carlson at pnl.gov
> EMSL UNIX System Support

I would also like to request that this driver/change be made. I have a
cluster with these newer Intel gigabit chipsets.

Terrence

Date: Tue, 09 Dec 2003 16:33:18 -0800
Subject: [Rocks-Discuss]a name for pain ... modules/kernels/ethernets
...
In-Reply-To: <Pine.GSO.4.44.0312091406080.17458-100000@paradox.emsl.pnl.gov>
References: <Pine.GSO.4.44.0312091406080.17458-100000@paradox.emsl.pnl.gov>
Message-ID: <3FD669CE.1070700@physics.ucsd.edu>

Tim Carlson wrote:
> On Tue, 9 Dec 2003, Joe Landman wrote:
>
>
>> The problem I am running into is that it isn't at all obvious/easy how
>>to install a new kernel/modules into ROCKS (3.0 or otherwise) to enable
>>this thing to work. Following the examples in the documentation have
>>not met with success. Running "rocks-dist cdrom" with the new kernels
>>(2.4.23 works nicely on the nodes) in the force/RPMS directory generates
>>a bootable CD with the original 2.4.18BOOT kernel.
>
>
> So you built a 2.4.23BOOT rpm? The problem people have is with the naming

> convention of kernels. A kernel.org spec file isn't going to generate
> proper kernel rpms IMHO. What you really want to do (and maybe you are
> already doing this) is steal the bit of the Redhat spec building scripts
> that generage the -smp .i686 and BOOT rpms.
>
> New hardware is tough for any distro.
>
> Tim
>
> Tim Carlson
> Voice: (509) 376 3423
>

Where do you start if you want to update the PXE boot image to support a
new kernel?

Terrence

Date: Tue, 09 Dec 2003 16:58:08 -0800
Subject: [Rocks-Discuss]Could not allocate requested partitions
Message-ID: <3FD66FA0.5070401@physics.ucsd.edu>

I am getting the following error when trying to install a Rocks 3.0.0
headnode. The headnode works find in rocks 2.3.2.

Could not allocate requested partitions: Partitioning failed: Could not
allocate partitions as primary partitions

What is also odd is when I alt-f2 and run fdisk /dev/hda it tells me it
cannot find that device (unable to open /dev/hda). However when I watch
the boot messages hda definitely comes up. Also the headnode works fine
with 2.3.2.

Any ideas?

Terrence

Date: Tue, 09 Dec 2003 17:33:24 -0800
In-Reply-To: <3FD66FA0.5070401@physics.ucsd.edu>
References: <3FD66FA0.5070401@physics.ucsd.edu>
Message-ID: <3FD677E4.8050806@physics.ucsd.edu>

Terrence Martin wrote:
> I am getting the following error when trying to install a Rocks 3.0.0
> headnode. The headnode works find in rocks 2.3.2.
>

> Could not allocate requested partitions: Partitioning failed: Could not
> allocate partitions as primary partitions
>
> What is also odd is when I alt-f2 and run fdisk /dev/hda it tells me it
> cannot find that device (unable to open /dev/hda). However when I watch
> the boot messages hda definitely comes up. Also the headnode works fine
> with 2.3.2.
>
> Any ideas?
>
> Terrence
>
>
>

Figured it out, aparently rocks 3.0.0 did not like my partitions from
rocks 2.3.2. I booted knoppix, blew away the partition table and so far
so good on the head node.

Terrence

Date: Tue, 9 Dec 2003 17:54:01 -0800
References: <1071007177.18100.58.camel@squash.scalableinformatics.com>
Message-ID: <BA0ADEC6-2AB3-11D8-981C-000A95DA5638@sdsc.edu>

If the underlying RedHat doesn't support your hardware you are pretty
much dead in the water. We do at times include drivers that RH does
not but this is an exception and only for hardware we physically have
access to. The rocks-boot (rocks/src/rock/boot in CVS) package
controls the boot kernel and module selection. You can look into this
to see what it would take to add your own module. We do plan on
refining and documenting this not for several months. We also have
some very good idea on how we can track this faster than RH, but again
nothing coming in the next few months.

To continue my earlier rant for today, until more hardware vendors
start taking the linux market place seriously buying bleeding edge
hardware and CPUs is asking for problems. It takes several months for
any new hardware to become supported by RedHat and several years for
any new CPU to be supported well. This isn't killing future Rocks
installs, it's just correctly delaying them until the underlying OS
supports the hardware.

-mjk

On Dec 9, 2003, at 1:59 PM, Joe Landman wrote:

> Folks:
>
> As indicated previously, I am wrestling with a Supermicro based
> cluster. None of the RH distributions come with the correct E1000
> driver, so a new kernel is needed (in the boot CD, and for

> installation).
>
> The problem I am running into is that it isn't at all obvious/easy
> how
> (2.4.23 works nicely on the nodes) in the force/RPMS directory
> generates
>
> What I (and I think others) need, is a simple/easy to follow method
> that will generate a bootable CD with the correct linux kernel, and the
> correct modules.
>
> Is this in process somewhere? What would be tremendously helpful is
> if we can generate a binary module, and put that into the boot process
> by placing it into the force/modules/binary directory (assuming one
> exists) with the appropriate entry of a similar name in the
> force/modules/meta directory as a simple XML document giving pci-ids,
> description, name, etc.
>
> Anything close to this coming? Modules are killing future ROCKS
> installs, the inability to easily inject a new module in there has
> created a problem whereby ROCKS does not function (as the underlying RH
> does not function).
>
>
>
> --
> Joseph Landman, Ph.D
> Scalable Informatics LLC,
> email: landman at scalableinformatics.com
> web : http://scalableinformatics.com
> phone: +1 734 612 4615

From gotero at linuxprophet.com Tue Dec 9 18:02:23 2003
From: gotero at linuxprophet.com (gotero at linuxprophet.com)
Date: Tue, 09 Dec 2003 18:02:23 -0800 (PST)
Message-ID:
<20031209180224.24711.h014.c001.wm@mail.linuxprophet.com.criticalpath.net>

Daniel-

I recently had the same problem when building a quadrics cluster on Rocks 2.3.2
with the qsnet-RedHat-kernel-2.4.18-27.3.4qsnet.i686.rpms. The problem is
definitely in the naming of the rpms, in that anaconda running on the compute
nodes is not going to recognize kernel rpms that begin with 'qsnet' as potential
boot options. Unfortunately, being under a severe time contraint, I resorted to
manually installing the qsnet kernel on all nodes of the cluster, which isn't
the Rocks way. The long term solution is to mangle the kernel makefiles so that
the qsnet kernel rpms have conventional kernel rpm names, which is what Greg's
post referred to.

Glen

On Mon, 8 Dec 2003 17:54:53 -0000, daniel.kidger at quadrics.com wrote:

>
> Dear all,
> Previously I have been installing a custom kernel on the compute nodes
> with an "extend-compute.xml" and an "/etc/init.d/qsconfigure" (to fix
grub.conf).
>
> # rocks-dist dist
>
> Hence:
> shows me that hdlist and hdlist2 now contain this RPM. (and indeed If I
> duplicate my rpm in that directory rocks-dist notices this and warns me.)
>
> anaconda-ks.cfg contains just "kernel-smp" and install.log has "Installing
> kernel-smp-2.4.20-20.7."
>
> It looks like my RPM has a name that Rocks doesn't understand properly.
> (.i686.rpm is of course correct, but I don't have -smp. in the name Is
this
> the problem ?)
>
> cf. Greg Bruno's wisdom:
>
https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2003-April/001770.html
>
>
> Yours,
> Daniel.
>
> --------------------------------------------------------------
> Dr. Dan Kidger, Quadrics Ltd. daniel.kidger at quadrics.com
> One Bridewell St., Bristol, BS1 2AA, UK 0117 915 5505
> ----------------------- www.quadrics.com --------------------
>
> >

Glen Otero, Ph.D.
Linux Prophet

From gotero at linuxprophet.com Tue Dec 9 18:05:04 2003
From: gotero at linuxprophet.com (gotero at linuxprophet.com)
Date: Tue, 09 Dec 2003 18:05:04 -0800 (PST)
Message-ID:

On Tue, 09 Dec 2003 17:33:24 -0800, Terrence Martin wrote:

>
> Terrence Martin wrote:
> > I am getting the following error when trying to install a Rocks 3.0.0
> > headnode. The headnode works find in rocks 2.3.2.
> >
> > Could not allocate requested partitions: Partitioning failed: Could not
> > allocate partitions as primary partitions
> >
> > What is also odd is when I alt-f2 and run fdisk /dev/hda it tells me it
> > cannot find that device (unable to open /dev/hda). However when I watch
> > the boot messages hda definitely comes up. Also the headnode works fine
> > with 2.3.2.
> >
> > Any ideas?
> >
> > Terrence
> >
> >
> >
>
> Figured it out, aparently rocks 3.0.0 did not like my partitions from
> rocks 2.3.2. I booted knoppix, blew away the partition table and so far
> so good on the head node.

I had the same problem with moving from 2.3.2 to 3.1. I'll try your solution.

Glen

>
> Terrence

Glen Otero, Ph.D.
Linux Prophet

From jorge at phys.ufl.edu Tue Dec 9 18:55:02 2003
From: jorge at phys.ufl.edu (Jorge L. Rodriguez)
Date: Tue, 09 Dec 2003 21:55:02 -0500
Subject: [Rocks-Discuss]Adding partitions that are not reformatted under hard boots
or shoot-node
Message-ID: <3FD68B06.9010709@phys.ufl.edu>

Hi,

How do I add an extra partition to my compute nodes and retain the data
on all non / partitions when system hard boots or is shot?
I tried the suggestion in the documentation under "Customizing your
ROCKS Installation" where you replace the auto-partition.xml but hard
boots or shoot-nodes on these reformat all partitions instead of just
the /. I have also tried to modify the installclass.xml so that an
extra partition is added into the python code see below. This does
mostly what I want but now I can't shoot-node even though a hard boot
reinstalls without reformatting all but /. Is this the right approach?
I'd rather avoid having to replace installclass since I don't really
want to partition all nodes this way but if I must I will.

Jorge

#
# set up the root partition
#
args = [ "/" , "--size" , "4096",
"--fstype", "&fstype;",
"--ondisk", devnames[0] ]
KickstartBase.definePartition(self, id, args)

# ---- Jorge, I added this args
args = [ "/state/partition1" , "--size" , "55000",
# -----
args = [ "swap" , "--size" , "1000",

#
# greedy partitioning
#
# ----- Jorge, I change this from i = 1
i = 2
# -----
for devname in devnames:
partname = "/state/partition%d" % (i)
args = [ partname, "--size", "1",
"--grow", "--ondisk", devname ]
KickstartBase.definePartition(self, id,
args)

i = i + 1

From bruno at rocksclusters.org Tue Dec 9 22:43:04 2003
Date: Tue, 9 Dec 2003 22:43:04 -0800
In-Reply-To: <20031209212259.39587.qmail@web14810.mail.yahoo.com>
References: <20031209212259.39587.qmail@web14810.mail.yahoo.com>
Message-ID: <1B097BEE-2ADC-11D8-9715-000A95C4E3B4@rocksclusters.org>

> Okay, came up my own quick hack:
>
> Edit atlas.spec.in, go to "other x86" section, remove
> 2 lines right above "linux", seems to make rpm now.
>
> A more formal patch would be put in a section for
> cpuid eq 4 with this correction I suppose.

if you provide the patch, we'll include it in our next release.

- gb

From tlw at cs.unm.edu Tue Dec 9 23:23:43 2003
From: tlw at cs.unm.edu (Tiffani Williams)
Date: Wed, 10 Dec 2003 00:23:43 -0700
Subject: [Rocks-Discuss]PBS errors
Message-ID: <3FD6C9FF.60603@cs.unm.edu>

Hello,

I am trying to submit a job through PBS, but I receive 2 errors. The
first error is
Job cannot be executed
See job standard error file

The second error is that the standard error file cannot be written into
my home directory.

I downloaded the sample script at

http://rocks.npaci.edu/papers/rocks-documentation/launching-batch-jobs.html
and have tried a more simple script with PBS directives and echo commands.

I do not know what I am doing wrong? I have used PBS successfully on
other clusters.

Does anyone have any suggestions?

Tiffani

Date: Tue, 9 Dec 2003 23:35:59 -0800
In-Reply-To: <3FD6C9FF.60603@cs.unm.edu>
References: <3FD6C9FF.60603@cs.unm.edu>
Message-ID: <7F75D3D2-2AE3-11D8-9715-000A95C4E3B4@rocksclusters.org>

> I am trying to submit a job through PBS, but I receive 2 errors. The
> first error is
> Job cannot be executed
> See job standard error file
>
> The second error is that the standard error file cannot be written
> into my home directory.
> I downloaded the sample script at
>
> http://rocks.npaci.edu/papers/rocks-documentation/launching-batch-
> jobs.html
> and have tried a more simple script with PBS directives and echo
> commands.
>

> I do not know what I am doing wrong? I have used PBS successfully on
> other clusters.
>
> Does anyone have any suggestions?

can you login to the compute nodes successfully?

if not, try restarting autofs on all the compute nodes. on the
frontend, execute:

# ssh-agent $SHELL
# ssh-add

# cluster-fork "/etc/rc.d/init.d/autofs restart"

we've found the startup of autofs to be flaky at times.

- gb

From tlw at cs.unm.edu Wed Dec 10 00:03:13 2003
From: tlw at cs.unm.edu (Tiffani Williams)
Date: Wed, 10 Dec 2003 01:03:13 -0700
In-Reply-To: <7F75D3D2-2AE3-11D8-9715-000A95C4E3B4@rocksclusters.org>
<7F75D3D2-2AE3-11D8-9715-000A95C4E3B4@rocksclusters.org>
Message-ID: <3FD6D341.5070501@cs.unm.edu>

>> I am trying to submit a job through PBS, but I receive 2 errors.
>> The first error is
>> Job cannot be executed
>> See job standard error file
>>
>> The second error is that the standard error file cannot be written
>> into my home directory.
>> I downloaded the sample script at
>>
>> http://rocks.npaci.edu/papers/rocks-documentation/launching-batch-
>> jobs.html
>> and have tried a more simple script with PBS directives and echo
>> commands.
>>
>> I do not know what I am doing wrong? I have used PBS successfully
>> on other clusters.
>>
>> Does anyone have any suggestions?
>
>
> can you login to the compute nodes successfully?
>
> if not, try restarting autofs on all the compute nodes. on the
> frontend, execute:
>
> # ssh-agent $SHELL
> # ssh-add
>
> # cluster-fork "/etc/rc.d/init.d/autofs restart"

>
> we've found the startup of autofs to be flaky at times.
>
> - gb

Do these commands have to be run by an administrator? If so, I do not
have such privileges. I can ssh to the compute nodes, but I am denied
entry. Am I supposed to be able to login to a compute node as a user.

Tiffani

Date: Wed, 10 Dec 2003 06:37:05 -0800
In-Reply-To: <3FD6D341.5070501@cs.unm.edu>
<3FD6D341.5070501@cs.unm.edu>
Message-ID: <53451392-2B1E-11D8-9715-000A95C4E3B4@rocksclusters.org>

On Dec 10, 2003, at 12:03 AM, Tiffani Williams wrote:

>
>>> I am trying to submit a job through PBS, but I receive 2 errors.
>>> The first error is
>>> Job cannot be executed
>>> See job standard error file
>>>
>>> The second error is that the standard error file cannot be written
>>> into my home directory.
>>> I downloaded the sample script at
>>>
>>> http://rocks.npaci.edu/papers/rocks-documentation/launching-batch-
>>> jobs.html
>>> and have tried a more simple script with PBS directives and echo
>>> commands.
>>>
>>> I do not know what I am doing wrong? I have used PBS successfully
>>> on other clusters.
>>>
>>> Does anyone have any suggestions?
>>
>>
>> can you login to the compute nodes successfully?
>>
>> if not, try restarting autofs on all the compute nodes. on the
>> frontend, execute:
>>
>> # ssh-agent $SHELL
>> # ssh-add
>>
>> # cluster-fork "/etc/rc.d/init.d/autofs restart"
>>
>> we've found the startup of autofs to be flaky at times.
>>

>> - gb
>
>
> Do these commands have to be run by an administrator? If so, I do not
> have such privileges. I can ssh to the compute nodes, but I am denied
> entry. Am I supposed to be able to login to a compute node as a user.

yes, you need to be 'root'.

it appears your home directory is not being mounted when you login --
have your administrator run the commands above.

- gb

Date: Wed, 10 Dec 2003 07:20:47 -0800
In-Reply-To: <53451392-2B1E-11D8-9715-000A95C4E3B4@rocksclusters.org>
<3FD6D341.5070501@cs.unm.edu>
<53451392-2B1E-11D8-9715-000A95C4E3B4@rocksclusters.org>
Message-ID: <6E659550-2B24-11D8-981C-000A95DA5638@sdsc.edu>

This is most likely the dreaded NIS-crash. You'll need to restart the
ypserver on the frontend and the ypbind daemon on all the nodes. We've
seen this on our clusters maybe 4 times (on production systems) in the
last several years. Others have seen this on a weekly basis. This is
why NIS is dead in Rocks 3.1 - it served us reasonably well but never
matured to a stable system.

-mjk

On Dec 10, 2003, at 6:37 AM, Greg Bruno wrote:

>
> On Dec 10, 2003, at 12:03 AM, Tiffani Williams wrote:
>
>>
>>>> I am trying to submit a job through PBS, but I receive 2 errors.
>>>> The first error is
>>>> Job cannot be executed
>>>> See job standard error file
>>>>
>>>> The second error is that the standard error file cannot be written
>>>> into my home directory.
>>>> I downloaded the sample script at
>>>>
>>>> http://rocks.npaci.edu/papers/rocks-documentation/launching-batch-
>>>> jobs.html
>>>> and have tried a more simple script with PBS directives and echo
>>>> commands.
>>>>
>>>> I do not know what I am doing wrong? I have used PBS successfully
>>>> on other clusters.
>>>>

>>>> Does anyone have any suggestions?
>>>
>>>
>>> can you login to the compute nodes successfully?
>>>
>>> if not, try restarting autofs on all the compute nodes. on the
>>> frontend, execute:
>>>
>>> # ssh-agent $SHELL
>>> # ssh-add
>>>
>>> # cluster-fork "/etc/rc.d/init.d/autofs restart"
>>>
>>> we've found the startup of autofs to be flaky at times.
>>>
>>> - gb
>>
>>
>> Do these commands have to be run by an administrator? If so, I do not
>> have such privileges. I can ssh to the compute nodes, but I am
>> denied entry. Am I supposed to be able to login to a compute node as
>> a user.
>
> yes, you need to be 'root'.
>
> it appears your home directory is not being mounted when you login --
> have your administrator run the commands above.
>
> - gb

From vincent_b_fox at yahoo.com Wed Dec 10 07:59:14 2003
Date: Wed, 10 Dec 2003 07:59:14 -0800 (PST)
Subject: [Rocks-Discuss]one node short in "labels"

So I go to the "labels" selection on the web page to print out the pretty labels.
What a nice idea by the way!

EXCEPT....it's one node short! I go up to 0-13 and this stops at 0-12. Any ideas
where I should check to fix this?

---------------------------------
Do you Yahoo!?
New Yahoo! Photos - easier uploading and sharing
-------------- next part --------------
discussion/attachments/20031210/c5bf5e79/attachment-0001.html

From cdwan at mail.ahc.umn.edu Wed Dec 10 12:04:53 2003
Date: Wed, 10 Dec 2003 14:04:53 -0600 (CST)
Subject: [Rocks-Discuss]Non-homogenous legacy hardware

I am integrating legacy systems into a ROCKS cluster, and have hit a
snag with the auto-partition configuration: The new (old) systems have
SCSI disks, while old (new) ones contain IDE. This is a non-issue so
long as the initial install does its default partitioning. However, I
have a "replace-auto-partition.xml" file which is unworkable for the SCSI
based systems since it makes specific reference to "hda" rather than
"sda."

I would like to have a site-nodes/replace-auto-partition.xml file with a
conditional such that "hda" or "sda" is used, based on the name of the
node (or some other criterion).

Is this possible?

Thanks, in advance. If this is out there on the mailing list archives, a
pointer would be greatly appreciated.

-Chris Dwan
The University of Minnesota

From tmartin at physics.ucsd.edu Wed Dec 10 12:09:11 2003
Date: Wed, 10 Dec 2003 12:09:11 -0800
Subject: [Rocks-Discuss]Error during Make when building a new install floppy
Message-ID: <3FD77D67.7000708@physics.ucsd.edu>

I get the following error when I try to rebuild a boot floppy for rocks.

This is with the default CVS checkout with an update today according to
the rocks userguide. I have not actually attempted to make any changes.

make[3]: Leaving directory
`/home/install/rocks/src/rocks/boot/7.3/loader/anaconda-7.3/loader'
`/home/install/rocks/src/rocks/boot/7.3/loader/anaconda-7.3'
strip -o loader anaconda-7.3/loader/loader
strip: anaconda-7.3/loader/loader: No such file or directory
make[1]: *** [loader] Error 1
make[1]: Leaving directory `/home/install/rocks/src/rocks/boot/7.3/loader'
make: *** [loader] Error 2

Of course I could avoid all of this together and just put my binary
module into the appropriate location in the boot image.

Would it be correct to modify the following image file with my changes
and then write it to a floppy via dd?

/home/install/ftp.rocksclusters.org/pub/rocks/rocks-3.0.0/rocks-
dist/7.3/en/os/i386/images/bootnet.img

Basically I am injecting an updated e1000 driver with changes to
pcitable to support the address of my gigabit cards.

Terrence

From tim.carlson at pnl.gov Wed Dec 10 12:40:41 2003
Date: Wed, 10 Dec 2003 12:40:41 -0800 (PST)
Subject: [Rocks-Discuss]Error during Make when building a new install floppy
In-Reply-To: <3FD77D67.7000708@physics.ucsd.edu>

On Wed, 10 Dec 2003, Terrence Martin wrote:

> I get the following error when I try to rebuild a boot floppy for rocks.
>

You can't make a boot floppy with Rocks 3.0. That isn't supported. Or at
least it wasn't the last time I checked

> Of course I could avoid all of this together and just put my binary
> module into the appropriate location in the boot image.
>
> Would it be correct to modify the following image file with my changes
> and then write it to a floppy via dd?
>
> /home/install/ftp.rocksclusters.org/pub/rocks/rocks-3.0.0/rocks-
dist/7.3/en/os/i386/images/bootnet.img
>
> Basically I am injecting an updated e1000 driver with changes to
> pcitable to support the address of my gigabit cards.

Modifiying the bootnet.img is about 1/3 of what you need to do if you go
down that path. You also need to work on netstg1.img and you'll need to
update the drive in the kernel rpm that gets installed on the box. None of
this is trivial.

If it were me, I would go down the same path I took for updating the
AIC79XX driver

https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2003-October/003533.html

Tim

Tim Carlson
Voice: (509) 376 3423

Date: Wed, 10 Dec 2003 12:52:38 -0800 (PST)
In-Reply-To: <Pine.GSO.4.58.0312101359380.22@lenti.med.umn.edu>

On Wed, 10 Dec 2003, Chris Dwan (CCGB) wrote:

>
> I am integrating legacy systems into a ROCKS cluster, and have hit a
> snag with the auto-partition configuration: The new (old) systems have
> SCSI disks, while old (new) ones contain IDE. This is a non-issue so

> long as the initial install does its default partitioning. However, I
> have a "replace-auto-partition.xml" file which is unworkable for the SCSI
> based systems since it makes specific reference to "hda" rather than
> "sda."

If you have just a single drive, then you should be able to skip the
"--ondisk" bits of your "part" command

Otherwise, you would have first to do something ugly like the following:

http://penguin.epfl.ch/slides/kickstart/ks.cfg

You could probably (maybe) wrap most of that in an
<eval sh="bash">
</eval>

block in the <main> block.

Just guessing.. haven't tried this.

Tim

Tim Carlson
Voice: (509) 376 3423

From agrajag at dragaera.net Wed Dec 10 10:21:07 2003
From: agrajag at dragaera.net (Jag)
Date: Wed, 10 Dec 2003 13:21:07 -0500
Subject: [Rocks-Discuss]ssh_known_hosts and ganglia
Message-ID: <1071080467.4693.6.camel@pel>

I noticed a previous post on this list
(https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2003-May/001934.html)
indicating that Rocks distributes ssh keys for all the nodes over
ganglia. Can anyone enlighten me as to how this is done?

I looked through the ganglia docs and didn't see anything indicating how
to do this, so I'm assuming Rocks made some changes. Unfortunately the
rocks iso images don't seem to contain srpms, so I'm now coming here.
What did Rocks do to ganglia to make the distribution of ssh keys work?

Also, does anyone know where Rocks SRPMs can be found? I've done quite
a bit of searching, but haven't found them anywhere.

Date: Wed, 10 Dec 2003 14:39:15 -0800
In-Reply-To: <1071080467.4693.6.camel@pel>
References: <1071080467.4693.6.camel@pel>
Message-ID: <AF006859-2B61-11D8-981C-000A95DA5638@sdsc.edu>

Most of the SRPMS are on our FTP site, but we've screwed this up

before. The SRPMS are entirely Rocks specific so they are of little
value outside of Rocks. You can also checkout our CVS tree
(cvs.rocksclusters.org) where rocks/src/ganglia shows what we add. We
have a ganglia-python package we created to allow us to write our own
metrics at a high level than the provide gmetric application. We've
also moved from this method to a single cluster-wide ssh key for Rocks
3.1.

-mjk

On Dec 10, 2003, at 10:21 AM, Jag wrote:

> I noticed a previous post on this list
> (https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2003-May/
> 001934.html) indicating that Rocks distributes ssh keys for all the
> nodes over
> ganglia. Can anyone enlighten me as to how this is done?
>
> I looked through the ganglia docs and didn't see anything indicating
> how
> to do this, so I'm assuming Rocks made some changes. Unfortunately the
> rocks iso images don't seem to contain srpms, so I'm now coming here.
> What did Rocks do to ganglia to make the distribution of ssh keys work?
>
> Also, does anyone know where Rocks SRPMs can be found? I've done quite
> a bit of searching, but haven't found them anywhere.

From vrowley at ucsd.edu Wed Dec 10 14:43:49 2003
Date: Wed, 10 Dec 2003 14:43:49 -0800
Subject: [Rocks-Discuss]"TypeError: loop over non-sequence" when trying to build
CD distro
Message-ID: <3FD7A1A5.2030805@ucsd.edu>

When I run this:

[root at rocks14 install]# rocks-dist mirror ; rocks-dist dist ; rocks-dist
--dist=cdrom cdrom

on a server installed with ROCKS 3.0.0, I eventually get this:

> Cleaning distribution
> Resolving versions (RPMs)
> Resolving versions (SRPMs)
> Adding support for rebuild distribution from source
> Creating files (symbolic links - fast)
> Creating symlinks to kickstart files
> Fixing Comps Database
> Generating hdlist (rpm database)
> Patching second stage loader (eKV, partioning, ...)
> patching "rocks-ekv" into distribution ...
> patching "rocks-piece-pipe" into distribution ...
> patching "PyXML" into distribution ...
> patching "expat" into distribution ...
> patching "rocks-pylib" into distribution ...
> patching "MySQL-python" into distribution ...
> patching "rocks-kickstart" into distribution ...

> patching "rocks-kickstart-profiles" into distribution ...
> patching "rocks-kickstart-dtds" into distribution ...
> building CRAM filesystem ...
> Segregating RPMs (rocks, non-rocks)
> sh: ./kickstart.cgi: No such file or directory
> File "/opt/rocks/bin/rocks-dist", line 807, in ?
> app.run()
> File "/opt/rocks/bin/rocks-dist", line 623, in run
> eval('self.command_%s()' % (command))
> File "<string>", line 0, in ?
> File "/opt/rocks/bin/rocks-dist", line 736, in command_cdrom
> builder.build()
> File "/opt/rocks/lib/python/rocks/build.py", line 1223, in build
> (rocks, nonrocks) = self.segregateRPMS()
> File "/opt/rocks/lib/python/rocks/build.py", line 1107, in segregateRPMS
> for pkg in ks.getSection('packages'):
> TypeError: loop over non-sequence

Any ideas?

--
9500 Gilman Drive
La Jolla, CA 92093-0715


Date: Wed, 10 Dec 2003 15:12:49 -0800
Subject: [Rocks-Discuss]one node short in "labels"
Message-ID: <5F8539FC-2B66-11D8-9715-000A95C4E3B4@rocksclusters.org>

> So I go to the "labels" selection on the web page to print out the
> pretty labels. What a nice idea by the way!
> ?
> EXCEPT....it's one node short! I go up to 0-13 and this stops at
> 0-12.? Any ideas where I should check to fix this?

yeah, we found this corner case -- it'll be fixed in the next release.

thanks for bug report.

- gb

Date: Wed, 10 Dec 2003 15:16:27 -0800
CD distro
In-Reply-To: <3FD7A1A5.2030805@ucsd.edu>
References: <3FD7A1A5.2030805@ucsd.edu>
Message-ID: <E17B3F9E-2B66-11D8-981C-000A95DA5638@sdsc.edu>

It looks like someone moved the profiles directory to profiles.orig.

-mjk

[root at rocks14 install]# ls -l
total 56
drwxr-sr-x 3 root wheel 4096 Dec 10 21:16 cdrom
drwxrwsr-x 5 root wheel 4096 Dec 10 20:38 contrib.orig
drwxr-sr-x 3 root wheel 4096 Dec 10 21:07
ftp.rocksclusters.org
ftp.rocksclusters.org.orig
-r-xrwsr-x 1 root wheel 19254 Sep 3 12:40 kickstart.cgi
drwxr-xr-x 3 root root 4096 Dec 10 20:38 profiles.orig
drwxr-sr-x 3 root wheel 4096 Dec 10 21:15 rocks-dist
drwxrwsr-x 3 root wheel 4096 Dec 10 20:38 rocks-dist.orig
drwxr-sr-x 3 root wheel 4096 Dec 10 21:02 src
drwxr-sr-x 4 root wheel 4096 Dec 10 20:49 src.foo
On Dec 10, 2003, at 2:43 PM, V. Rowley wrote:

> When I run this:
>
> [root at rocks14 install]# rocks-dist mirror ; rocks-dist dist ;
> rocks-dist --dist=cdrom cdrom
>
> on a server installed with ROCKS 3.0.0, I eventually get this:
>
>> Cleaning distribution
>> Resolving versions (RPMs)
>> Resolving versions (SRPMs)
>> Adding support for rebuild distribution from source
>> Creating files (symbolic links - fast)
>> Creating symlinks to kickstart files
>> Fixing Comps Database
>> Generating hdlist (rpm database)
>> Patching second stage loader (eKV, partioning, ...)
>> patching "rocks-ekv" into distribution ...
>> patching "rocks-piece-pipe" into distribution ...
>> patching "PyXML" into distribution ...
>> patching "expat" into distribution ...
>> patching "rocks-pylib" into distribution ...
>> patching "MySQL-python" into distribution ...
>> patching "rocks-kickstart" into distribution ...
>> patching "rocks-kickstart-profiles" into distribution ...
>> patching "rocks-kickstart-dtds" into distribution ...
>> building CRAM filesystem ...

>> Segregating RPMs (rocks, non-rocks)
>> sh: ./kickstart.cgi: No such file or directory
>> File "/opt/rocks/bin/rocks-dist", line 807, in ?
>> app.run()
>> File "/opt/rocks/bin/rocks-dist", line 623, in run
>> eval('self.command_%s()' % (command))
>> File "<string>", line 0, in ?
>> File "/opt/rocks/bin/rocks-dist", line 736, in command_cdrom
>> builder.build()
>> File "/opt/rocks/lib/python/rocks/build.py", line 1223, in build
>> (rocks, nonrocks) = self.segregateRPMS()
>> File "/opt/rocks/lib/python/rocks/build.py", line 1107, in
>> segregateRPMS
>> for pkg in ks.getSection('packages'):
>> TypeError: loop over non-sequence
>
> Any ideas?
>
> --
> Vicky Rowley email: vrowley at ucsd.edu
> Biomedical Informatics Research Network work: (858) 536-5980
> University of California, San Diego fax: (858) 822-0828
> 9500 Gilman Drive
> La Jolla, CA 92093-0715
>
>
> See pictures from our trip to China at
> http://www.sagacitech.com/Chinaweb

Date: Wed, 10 Dec 2003 16:50:16 -0800
Subject: [Rocks-Discuss]"TypeError: loop over non-sequence" when trying
to build CD distro
In-Reply-To: <E17B3F9E-2B66-11D8-981C-000A95DA5638@sdsc.edu>
References: <3FD7A1A5.2030805@ucsd.edu>
<E17B3F9E-2B66-11D8-981C-000A95DA5638@sdsc.edu>
Message-ID: <3FD7BF48.9020409@ucsd.edu>

Yep, I did that, but only *AFTER* getting the error. [Thought it was
generated by the rocks-dist sequence, but apparently not.] Go ahead.
Move it back. Same difference.

Vicky

Mason J. Katz wrote:
> It looks like someone moved the profiles directory to profiles.orig.
>
> -mjk
>
>

> [root at rocks14 install]# ls -l
> total 56
> drwxr-sr-x 3 root wheel 4096 Dec 10 21:16 cdrom
> drwxrwsr-x 5 root wheel 4096 Dec 10 20:38 contrib.orig
> drwxr-sr-x 3 root wheel 4096 Dec 10 21:07
> ftp.rocksclusters.org
> ftp.rocksclusters.org.orig
> -r-xrwsr-x 1 root wheel 19254 Sep 3 12:40 kickstart.cgi
> drwxr-xr-x 3 root root 4096 Dec 10 20:38 profiles.orig
> drwxr-sr-x 3 root wheel 4096 Dec 10 21:15 rocks-dist
> drwxrwsr-x 3 root wheel 4096 Dec 10 20:38 rocks-dist.orig
> drwxr-sr-x 3 root wheel 4096 Dec 10 21:02 src
> drwxr-sr-x 4 root wheel 4096 Dec 10 20:49 src.foo
> On Dec 10, 2003, at 2:43 PM, V. Rowley wrote:
>
>> When I run this:
>>
>> [root at rocks14 install]# rocks-dist mirror ; rocks-dist dist ;
>> rocks-dist --dist=cdrom cdrom
>>
>> on a server installed with ROCKS 3.0.0, I eventually get this:
>>
>>> Cleaning distribution
>>> Resolving versions (RPMs)
>>> Resolving versions (SRPMs)
>>> Adding support for rebuild distribution from source
>>> Creating files (symbolic links - fast)
>>> Creating symlinks to kickstart files
>>> Fixing Comps Database
>>> Generating hdlist (rpm database)
>>> Patching second stage loader (eKV, partioning, ...)
>>> patching "rocks-ekv" into distribution ...
>>> patching "rocks-piece-pipe" into distribution ...
>>> patching "PyXML" into distribution ...
>>> patching "expat" into distribution ...
>>> patching "rocks-pylib" into distribution ...
>>> patching "MySQL-python" into distribution ...
>>> patching "rocks-kickstart" into distribution ...
>>> patching "rocks-kickstart-profiles" into distribution ...
>>> patching "rocks-kickstart-dtds" into distribution ...
>>> building CRAM filesystem ...
>>> Segregating RPMs (rocks, non-rocks)
>>> sh: ./kickstart.cgi: No such file or directory
>>> Traceback (innermost last):
>>> File "/opt/rocks/bin/rocks-dist", line 807, in ?
>>> app.run()
>>> File "/opt/rocks/bin/rocks-dist", line 623, in run
>>> eval('self.command_%s()' % (command))
>>> File "<string>", line 0, in ?
>>> File "/opt/rocks/bin/rocks-dist", line 736, in command_cdrom
>>> builder.build()
>>> File "/opt/rocks/lib/python/rocks/build.py", line 1223, in build

>>> (rocks, nonrocks) = self.segregateRPMS()
>>> File "/opt/rocks/lib/python/rocks/build.py", line 1107, in
>>> segregateRPMS
>>> for pkg in ks.getSection('packages'):
>>> TypeError: loop over non-sequence
>>
>>
>> Any ideas?
>>
>> --
>> Vicky Rowley email: vrowley at ucsd.edu
>> Biomedical Informatics Research Network work: (858) 536-5980
>> University of California, San Diego fax: (858) 822-0828
>> 9500 Gilman Drive
>> La Jolla, CA 92093-0715
>>
>>
>> See pictures from our trip to China at http://www.sagacitech.com/Chinaweb
>
>
>

--
9500 Gilman Drive
La Jolla, CA 92093-0715


Date: Wed, 10 Dec 2003 17:23:25 -0800 (PST)
Subject: [Rocks-Discuss]"TypeError: loop over non-sequence" when trying to
build CD distro
In-Reply-To: <3FD7BF48.9020409@ucsd.edu>
Message-ID: <Pine.GSO.4.44.0312101722470.711-100000@poincare.emsl.pnl.gov>

On Wed, 10 Dec 2003, V. Rowley wrote:

Did you remove python by chance? kickstart.cgi calls python directly in
/usr/bin/python while rocks-dist does an "env python"

Tim

> Yep, I did that, but only *AFTER* getting the error. [Thought it was
> generated by the rocks-dist sequence, but apparently not.] Go ahead.
> Move it back. Same difference.
>
> Vicky
>
> Mason J. Katz wrote:
> > It looks like someone moved the profiles directory to profiles.orig.
> >
> > -mjk

> >
> >
> > [root at rocks14 install]# ls -l
> > total 56
> > drwxr-sr-x 3 root wheel 4096 Dec 10 21:16 cdrom
> > drwxrwsr-x 5 root wheel 4096 Dec 10 20:38 contrib.orig
> > drwxr-sr-x 3 root wheel 4096 Dec 10 21:07
> > ftp.rocksclusters.org
> > ftp.rocksclusters.org.orig
> > -r-xrwsr-x 1 root wheel 19254 Sep 3 12:40 kickstart.cgi
> > drwxr-xr-x 3 root root 4096 Dec 10 20:38 profiles.orig
> > drwxr-sr-x 3 root wheel 4096 Dec 10 21:15 rocks-dist
> > drwxrwsr-x 3 root wheel 4096 Dec 10 20:38 rocks-dist.orig
> > drwxr-sr-x 3 root wheel 4096 Dec 10 21:02 src
> > drwxr-sr-x 4 root wheel 4096 Dec 10 20:49 src.foo
> > On Dec 10, 2003, at 2:43 PM, V. Rowley wrote:
> >
> >> When I run this:
> >>
> >> [root at rocks14 install]# rocks-dist mirror ; rocks-dist dist ;
> >> rocks-dist --dist=cdrom cdrom
> >>
> >> on a server installed with ROCKS 3.0.0, I eventually get this:
> >>
> >>> Cleaning distribution
> >>> Resolving versions (RPMs)
> >>> Resolving versions (SRPMs)
> >>> Adding support for rebuild distribution from source
> >>> Creating files (symbolic links - fast)
> >>> Creating symlinks to kickstart files
> >>> Fixing Comps Database
> >>> Generating hdlist (rpm database)
> >>> Patching second stage loader (eKV, partioning, ...)
> >>> patching "rocks-ekv" into distribution ...
> >>> patching "rocks-piece-pipe" into distribution ...
> >>> patching "PyXML" into distribution ...
> >>> patching "expat" into distribution ...
> >>> patching "rocks-pylib" into distribution ...
> >>> patching "MySQL-python" into distribution ...
> >>> patching "rocks-kickstart" into distribution ...
> >>> patching "rocks-kickstart-profiles" into distribution ...
> >>> patching "rocks-kickstart-dtds" into distribution ...
> >>> building CRAM filesystem ...
> >>> Segregating RPMs (rocks, non-rocks)
> >>> sh: ./kickstart.cgi: No such file or directory
> >>> Traceback (innermost last):
> >>> File "/opt/rocks/bin/rocks-dist", line 807, in ?
> >>> app.run()
> >>> File "/opt/rocks/bin/rocks-dist", line 623, in run
> >>> eval('self.command_%s()' % (command))
> >>> File "<string>", line 0, in ?
> >>> File "/opt/rocks/bin/rocks-dist", line 736, in command_cdrom

> >>> builder.build()
> >>> File "/opt/rocks/lib/python/rocks/build.py", line 1223, in build
> >>> (rocks, nonrocks) = self.segregateRPMS()
> >>> File "/opt/rocks/lib/python/rocks/build.py", line 1107, in
> >>> segregateRPMS
> >>> for pkg in ks.getSection('packages'):
> >>> TypeError: loop over non-sequence
> >>
> >>
> >> Any ideas?
> >>
> >> --
> >> Vicky Rowley email: vrowley at ucsd.edu
> >> Biomedical Informatics Research Network work: (858) 536-5980
> >> University of California, San Diego fax: (858) 822-0828
> >> 9500 Gilman Drive
> >> La Jolla, CA 92093-0715
> >>
> >>
> >> See pictures from our trip to China at http://www.sagacitech.com/Chinaweb
> >
> >
> >
>
> --
> 9500 Gilman Drive
> La Jolla, CA 92093-0715
>
>
> See pictures from our trip to China at http://www.sagacitech.com/Chinaweb
>
>

From naihh at imcb.a-star.edu.sg Wed Dec 10 17:45:18 2003
Date: Thu, 11 Dec 2003 09:45:18 +0800
Subject: [Rocks-Discuss]RE: Do you have a list of the various models of Gigabit
Ethernet Interfaces compatible to Rocks 3?
Message-ID: <5E118EED7CC277468A275F11EEEC39B94CCD66@EXIMCB2.imcb.a-star.edu.sg>

Hi All,

Do you have a list of the various gigabit Ethernet interfaces that are
compatible to Rocks 3?

I am changing my nodes connectivity from 10/100 to 1000.

Have anyone done that and how are the differences in performance or
turnaround time?

Have anyone successfully build a set of grid compute nodes using Rocks
3?

Thanks and Regards

30 Medical Drive
Singapore 117609.
DID: (65) 6874-6196

From: npaci-rocks-discussion-request at sdsc.edu
[mailto:npaci-rocks-discussion-request at sdsc.edu]
Sent: Thursday, December 11, 2003 9:25 AM
Subject: npaci-rocks-discussion digest, Vol 1 #641 - 13 msgs

Send npaci-rocks-discussion mailing list submissions to
npaci-rocks-discussion at sdsc.edu

To subscribe or unsubscribe via the World Wide Web, visit

http://lists.sdsc.edu/mailman/listinfo.cgi/npaci-rocks-discussion
or, via email, send a message with subject or body 'help' to
npaci-rocks-discussion-request at sdsc.edu

You can reach the person managing the list at
npaci-rocks-discussion-admin at sdsc.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of npaci-rocks-discussion digest..."

Today's Topics:

1. Non-homogenous legacy hardware (Chris Dwan (CCGB))
2. Error during Make when building a new install floppy (Terrence
Martin)
3. Re: Error during Make when building a new install floppy (Tim
Carlson)
4. Re: Non-homogenous legacy hardware (Tim Carlson)
5. ssh_known_hosts and ganglia (Jag)
6. Re: ssh_known_hosts and ganglia (Mason J. Katz)
7. "TypeError: loop over non-sequence" when trying to build CD
distro (V. Rowley)
8. Re: one node short in "labels" (Greg Bruno)
9. Re: "TypeError: loop over non-sequence" when trying to build CD
distro (Mason J. Katz)
10. Re: "TypeError: loop over non-sequence" when trying
to build CD distro (V. Rowley)
11. Re: "TypeError: loop over non-sequence" when trying to
build CD distro (Tim Carlson)

--__--__--

Message: 1
Date: Wed, 10 Dec 2003 14:04:53 -0600 (CST)
From: "Chris Dwan (CCGB)" <cdwan at mail.ahc.umn.edu>


have a "replace-auto-partition.xml" file which is unworkable for the
SCSI
"sda."


Is this possible?

Thanks, in advance. If this is out there on the mailing list archives,
a

-Chris Dwan

--__--__--

Message: 2
Date: Wed, 10 Dec 2003 12:09:11 -0800
From: Terrence Martin <tmartin at physics.ucsd.edu>
To: npaci-rocks-discussion <npaci-rocks-discussion at sdsc.edu>
Subject: [Rocks-Discuss]Error during Make when building a new install
floppy


This is with the default CVS checkout with an update today according to

`/home/install/rocks/src/rocks/boot/7.3/loader'

Of course I could avoid all of this together and just put my binary

Would it be correct to modify the following image file with my changes

/home/install/ftp.rocksclusters.org/pub/rocks/rocks-3.0.0/rocks-dist/7.3
/en/os/i386/images/bootnet.img

Basically I am injecting an updated e1000 driver with changes to

Terrence

--__--__--

Message: 3
Date: Wed, 10 Dec 2003 12:40:41 -0800 (PST)
From: Tim Carlson <tim.carlson at pnl.gov>
Subject: Re: [Rocks-Discuss]Error during Make when building a new
install floppy
To: Terrence Martin <tmartin at physics.ucsd.edu>
Cc: npaci-rocks-discussion <npaci-rocks-discussion at sdsc.edu>
Reply-to: Tim Carlson <tim.carlson at pnl.gov>


> I get the following error when I try to rebuild a boot floppy for
rocks.
>


>
>
>
>

update the drive in the kernel rpm that gets installed on the box. None
of
this is trivial.

AIC79XX driver

https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2003-October/003
533.html

Tim

Tim Carlson
Voice: (509) 376 3423

--__--__--

Message: 4
Date: Wed, 10 Dec 2003 12:52:38 -0800 (PST)
Subject: Re: [Rocks-Discuss]Non-homogenous legacy hardware
To: "Chris Dwan (CCGB)" <cdwan at mail.ahc.umn.edu>


>
> snag with the auto-partition configuration: The new (old) systems
have
> have a "replace-auto-partition.xml" file which is unworkable for the
SCSI
> "sda."




<eval sh="bash">
</eval>



Tim

Tim Carlson
Voice: (509) 376 3423

--__--__--

Message: 5
From: Jag <agrajag at dragaera.net>
Date: Wed, 10 Dec 2003 13:21:07 -0500

(https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2003-May/001934
.html) indicating that Rocks distributes ssh keys for all the nodes over

rocks iso images don't seem to contain srpms, so I'm now coming here.


--__--__--

Message: 6
From: "Mason J. Katz" <mjk at sdsc.edu>
Subject: Re: [Rocks-Discuss]ssh_known_hosts and ganglia
Date: Wed, 10 Dec 2003 14:39:15 -0800
To: Jag <agrajag at dragaera.net>

Most of the SRPMS are on our FTP site, but we've screwed this up
before. The SRPMS are entirely Rocks specific so they are of little
value outside of Rocks. You can also checkout our CVS tree
metrics at a high level than the provide gmetric application. We've
3.1.

-mjk


> (https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2003-May/
> nodes over
>
> how
> to do this, so I'm assuming Rocks made some changes. Unfortunately
the
> What did Rocks do to ganglia to make the distribution of ssh keys
work?
>
> Also, does anyone know where Rocks SRPMs can be found? I've done
quite

--__--__--

Message: 7
Date: Wed, 10 Dec 2003 14:43:49 -0800
From: "V. Rowley" <vrowley at ucsd.edu>
to build CD distro

When I run this:


--dist=cdrom cdrom


> app.run()
> builder.build()
> (rocks, nonrocks) = self.segregateRPMS()
> File "/opt/rocks/lib/python/rocks/build.py", line 1107, in
segregateRPMS

Any ideas?

--
9500 Gilman Drive
La Jolla, CA 92093-0715

See pictures from our trip to China at
http://www.sagacitech.com/Chinaweb

--__--__--

Message: 8
Cc: rocks <npaci-rocks-discussion at sdsc.edu>
From: Greg Bruno <bruno at rocksclusters.org>
Subject: Re: [Rocks-Discuss]one node short in "labels"
Date: Wed, 10 Dec 2003 15:12:49 -0800
To: Vincent Fox <vincent_b_fox at yahoo.com>

> So I go to the "labels" selection on the web page to print out the=20
> =A0
> EXCEPT....it's one node short! I go up to 0-13 and this stops at=20
> 0-12.=A0 Any ideas where I should check to fix this?



- gb

--__--__--

Message: 9
Subject: Re: [Rocks-Discuss]"TypeError: loop over non-sequence" when
trying to build CD distro
Date: Wed, 10 Dec 2003 15:16:27 -0800
To: "V. Rowley" <vrowley at ucsd.edu>


-mjk

total 56

> When I run this:

>
> [root at rocks14 install]# rocks-dist mirror ; rocks-dist dist ;
> rocks-dist --dist=cdrom cdrom
>
>
>> app.run()
>> builder.build()
>> (rocks, nonrocks) = self.segregateRPMS()
>> File "/opt/rocks/lib/python/rocks/build.py", line 1107, in
>> segregateRPMS
>
> Any ideas?
>
> --
> 9500 Gilman Drive
> La Jolla, CA 92093-0715
>
>


--__--__--

Message: 10
Date: Wed, 10 Dec 2003 16:50:16 -0800
To: "Mason J. Katz" <mjk at sdsc.edu>
CC: npaci-rocks-discussion at sdsc.edu
trying
to build CD distro

Yep, I did that, but only *AFTER* getting the error. [Thought it was
generated by the rocks-dist sequence, but apparently not.] Go ahead.

Vicky

>
> -mjk
>
>
> total 56
> drwxrwsr-x 3 root wheel 4096 Dec 10 20:38
rocks-dist.orig
>
>> When I run this:
>>
>> rocks-dist --dist=cdrom cdrom
>>
>>

>>> app.run()
>>> builder.build()
>>> (rocks, nonrocks) = self.segregateRPMS()
>>> segregateRPMS
>>
>>
>> Any ideas?
>>
>> --
>> La Jolla, CA 92093-0715
>>
>>
>> See pictures from our trip to China at
>
>
>

--
9500 Gilman Drive
La Jolla, CA 92093-0715



--__--__--

Message: 11
Date: Wed, 10 Dec 2003 17:23:25 -0800 (PST)
trying to
build CD distro
Cc: "Mason J. Katz" <mjk at sdsc.edu>, npaci-rocks-discussion at sdsc.edu



Tim

>
> Vicky
>
> >
> > -mjk
> >
> >
> > total 56
> > -r-xrwsr-x 1 root wheel 19254 Sep 3 12:40
kickstart.cgi
> > drwxr-xr-x 3 root root 4096 Dec 10 20:38
profiles.orig
> > drwxrwsr-x 3 root wheel 4096 Dec 10 20:38
rocks-dist.orig
> >
> >>
> >> rocks-dist --dist=cdrom cdrom
> >>

> >>
> >>> app.run()
> >>> (rocks, nonrocks) = self.segregateRPMS()
> >>> segregateRPMS
> >>
> >>
> >> Any ideas?
> >>
> >> --
> >> La Jolla, CA 92093-0715
> >>
> >>
> >> See pictures from our trip to China at
> >
> >
> >

>
> --
> 9500 Gilman Drive
> La Jolla, CA 92093-0715
>
>
>
>

--__--__--

_______________________________________________
npaci-rocks-discussion mailing list

End of npaci-rocks-discussion Digest

DISCLAIMER:

From tmartin at physics.ucsd.edu Wed Dec 10 18:03:41 2003
Date: Wed, 10 Dec 2003 18:03:41 -0800
Subject: [Rocks-Discuss]Rocks 3.0.0
Message-ID: <3FD7D07D.8090108@physics.ucsd.edu>

I am having a problem on install of rocks 3.0.0 on my new cluster.

The python error occurs right after anaconda starts and just before the
install asks for the roll CDROM.

The error refers to an inability to find or load rocks.file. The error
is associated I think with the window that pops up and asks you in put
the roll CDROM in.

The process I followed to get to this point is

Put the Rocks 3.0.0 CDROM into the CDROM drive
Boot the system
At the prompt type frontend
Wait till anaconda starts
Error referring to unable to load rocks.file.

I have successfully installed rocks on a smaller cluster but that has

different hardware. I used the same CDROM for both installs.

Any thoughts?

Terrence

Date: Wed, 10 Dec 2003 19:52:49 -0800
to build CD distro
Message-ID: <3FD7EA11.10204@ucsd.edu>

Looks like python is okay:

> [root at rocks14 birn-oracle1]# which python
> /usr/bin/python
> [root at rocks14 birn-oracle1]# python --help
> Unknown option: --
> usage: python [option] ... [-c cmd | file | -] [arg] ...
> Options and arguments (and corresponding environment variables):
> -d : debug output from parser (also PYTHONDEBUG=x)
> -i : inspect interactively after running script, (also PYTHONINSPECT=x)
> and force prompts, even if stdin does not appear to be a terminal
> -O : optimize generated bytecode (a tad; also PYTHONOPTIMIZE=x)
> -OO : remove doc-strings in addition to the -O optimizations
> -S : don't imply 'import site' on initialization
> -t : issue warnings about inconsistent tab usage (-tt: issue errors)
> -u : unbuffered binary stdout and stderr (also PYTHONUNBUFFERED=x)
> -v : verbose (trace import statements) (also PYTHONVERBOSE=x)
> -x : skip first line of source, allowing use of non-Unix forms of #!cmd
> -X : disable class based built-in exceptions
> -c cmd : program passed in as string (terminates option list)
> file : program read from script file
> - : program read from stdin (default; interactive mode if a tty)
> arg ...: arguments passed to program in sys.argv[1:]
> Other environment variables:
> PYTHONSTARTUP: file executed on interactive startup (no default)
> PYTHONPATH : ':'-separated list of directories prefixed to the
> default module search path. The result is sys.path.
> PYTHONHOME : alternate <prefix> directory (or <prefix>:<exec_prefix>).
> The default module search path uses <prefix>/python1.5.
> [root at rocks14 birn-oracle1]#

Tim Carlson wrote:
> On Wed, 10 Dec 2003, V. Rowley wrote:
>
> Did you remove python by chance? kickstart.cgi calls python directly in
> /usr/bin/python while rocks-dist does an "env python"
>
> Tim
>

>
>>Yep, I did that, but only *AFTER* getting the error. [Thought it was
>>generated by the rocks-dist sequence, but apparently not.] Go ahead.
>>Move it back. Same difference.
>>
>>Vicky
>>
>>Mason J. Katz wrote:
>>
>>>It looks like someone moved the profiles directory to profiles.orig.
>>>
>>> -mjk
>>>
>>>
>>>[root at rocks14 install]# ls -l
>>>total 56
>>>drwxr-sr-x 3 root wheel 4096 Dec 10 21:16 cdrom
>>>drwxrwsr-x 5 root wheel 4096 Dec 10 20:38 contrib.orig
>>>drwxr-sr-x 3 root wheel 4096 Dec 10 21:07
>>>ftp.rocksclusters.org
>>>ftp.rocksclusters.org.orig
>>>-r-xrwsr-x 1 root wheel 19254 Sep 3 12:40 kickstart.cgi
>>>drwxr-xr-x 3 root root 4096 Dec 10 20:38 profiles.orig
>>>drwxr-sr-x 3 root wheel 4096 Dec 10 21:15 rocks-dist
>>>drwxrwsr-x 3 root wheel 4096 Dec 10 20:38 rocks-dist.orig
>>>drwxr-sr-x 3 root wheel 4096 Dec 10 21:02 src
>>>drwxr-sr-x 4 root wheel 4096 Dec 10 20:49 src.foo
>>>On Dec 10, 2003, at 2:43 PM, V. Rowley wrote:
>>>
>>>
>>>>When I run this:
>>>>
>>>>[root at rocks14 install]# rocks-dist mirror ; rocks-dist dist ;
>>>>rocks-dist --dist=cdrom cdrom
>>>>
>>>>on a server installed with ROCKS 3.0.0, I eventually get this:
>>>>
>>>>
>>>>>Cleaning distribution
>>>>>Resolving versions (RPMs)
>>>>>Resolving versions (SRPMs)
>>>>>Adding support for rebuild distribution from source
>>>>>Creating files (symbolic links - fast)
>>>>>Creating symlinks to kickstart files
>>>>>Fixing Comps Database
>>>>>Generating hdlist (rpm database)
>>>>>Patching second stage loader (eKV, partioning, ...)
>>>>> patching "rocks-ekv" into distribution ...
>>>>> patching "rocks-piece-pipe" into distribution ...
>>>>> patching "PyXML" into distribution ...
>>>>> patching "expat" into distribution ...
>>>>> patching "rocks-pylib" into distribution ...
>>>>> patching "MySQL-python" into distribution ...
>>>>> patching "rocks-kickstart" into distribution ...
>>>>> patching "rocks-kickstart-profiles" into distribution ...
>>>>> patching "rocks-kickstart-dtds" into distribution ...
>>>>> building CRAM filesystem ...

>>>>>Segregating RPMs (rocks, non-rocks)
>>>>>sh: ./kickstart.cgi: No such file or directory
>>>>>Traceback (innermost last):
>>>>> File "/opt/rocks/bin/rocks-dist", line 807, in ?
>>>>> app.run()
>>>>> File "/opt/rocks/bin/rocks-dist", line 623, in run
>>>>> eval('self.command_%s()' % (command))
>>>>> File "<string>", line 0, in ?
>>>>> File "/opt/rocks/bin/rocks-dist", line 736, in command_cdrom
>>>>> builder.build()
>>>>> File "/opt/rocks/lib/python/rocks/build.py", line 1223, in build
>>>>> (rocks, nonrocks) = self.segregateRPMS()
>>>>> File "/opt/rocks/lib/python/rocks/build.py", line 1107, in
>>>>>segregateRPMS
>>>>> for pkg in ks.getSection('packages'):
>>>>>TypeError: loop over non-sequence
>>>>
>>>>
>>>>Any ideas?
>>>>
>>>>--
>>>>Vicky Rowley email: vrowley at ucsd.edu
>>>>Biomedical Informatics Research Network work: (858) 536-5980
>>>>University of California, San Diego fax: (858) 822-0828
>>>>9500 Gilman Drive
>>>>La Jolla, CA 92093-0715
>>>>
>>>>
>>>>See pictures from our trip to China at http://www.sagacitech.com/Chinaweb
>>>
>>>
>>>
>>--
>>Vicky Rowley email: vrowley at ucsd.edu
>>Biomedical Informatics Research Network work: (858) 536-5980
>>University of California, San Diego fax: (858) 822-0828
>>9500 Gilman Drive
>>La Jolla, CA 92093-0715
>>
>>
>>See pictures from our trip to China at http://www.sagacitech.com/Chinaweb
>>
>>
>
>
>
>

--
9500 Gilman Drive
La Jolla, CA 92093-0715


From wyzhong78 at msn.com Wed Dec 10 20:38:53 2003
Date: Thu, 11 Dec 2003 12:38:53 +0800
Message-ID: <BAY3-F3296PnPlpNvHX000097eb@hotmail.com>

>From: Greg Bruno <bruno at rocksclusters.org>
>To: "zhong wenyu" <wyzhong78 at msn.com>
>CC: npaci-rocks-discussion at sdsc.edu
>Subject: Re: [Rocks-Discuss]Rocks 3.0.0 problem:not able to boot up
>Date: Mon, 8 Dec 2003 15:31:08 -0800
>
>>I have installed Rocks 3.0.0 with default options successful,there
>>was not any trouble.But I boot it up,it stopped at beginning,just
>>show "GRUB" on the screen and waiting...
>
>when you built the frontend, did you start with the rocks base CD
>then add the HPC roll?
>
> - gb
>
I have raveled out this trouble.But I don't know why.
I have one SCSI harddisk and one IDE disk On the frontend,I choose SCSI to
be the first HDD and installed "/" on it.then it can not boot up.Even
disabled the IDE HDD and install it again,It can not boot up also.at last I
choose SCSI to be the first HDD and install,then choose IDE HDD to be the
first and boot up, it's ok!
GRUB must be installed on IDE HDD?
thanks!

_________________________________________________________________
??????????????? MSN Hotmail? http://www.hotmail.com

From wyzhong78 at msn.com Wed Dec 10 20:44:09 2003
Date: Thu, 11 Dec 2003 12:44:09 +0800
Subject: [Rocks-Discuss]I can't use xpbs in rocks
Message-ID: <BAY3-F24QLayI4TY7zD00009bf1@hotmail.com>

Hi,everyone!
I have installed rocks 2.3.2 and 3.0.0,xpbs can not be use in both of them.
typed:xpbs[enter]
showed:xpbs: initialization failed! output: invalid command name
"Pref_Init"
thanks!

_________________________________________________________________
?????????????? MSN Messenger: http://messenger.msn.com/cn

From phil at sdsc.edu Wed Dec 10 21:26:50 2003
From: phil at sdsc.edu (Philip Papadopoulos)
Date: Wed, 10 Dec 2003 21:26:50 -0800
In-Reply-To: <BAY3-F3296PnPlpNvHX000097eb@hotmail.com>
References: <BAY3-F3296PnPlpNvHX000097eb@hotmail.com>
Message-ID: <3FD8001A.9030702@sdsc.edu>

There is a conflict in the way the BIOS numbers drives and the way the
install
kernel numbers the drive (and this is not standard). You should check in
your BIOS
if you can select which is the boot device. If it just says "Hard Disk"
(no choice between
IDE and SCSI), then you are stuck with needing to have Grub on the
device that
BIOS thinks is the boot device. If you can choose, then SCSI can
probably be made
to work.

These sorts of issues (this is a general redhat/linux problem) can be
quite troublesome
(and annoying). We had some older HW that had two different types of
SCSI controllers
with drives on each controller. The boot kernel labeled the /sda
differently than the BIOS.
Install went fine, by the dreaded "OS Not Found" BIOS message when
rebooting. The cause was that
the Grub loader was being put on Linux's notion of /sda, but when BIOS
loaded, it found
nothing (because grub was installed on BIOS's idea of /sdb). For this
particular machine, we were not
able to change BIOSes notion -- we had to force Linux to boot the
bootloader on linuxes idea of
/sdb.

-P

zhong wenyu wrote:

>
>
>
>> From: Greg Bruno <bruno at rocksclusters.org>
>> To: "zhong wenyu" <wyzhong78 at msn.com>
>> CC: npaci-rocks-discussion at sdsc.edu
>> Subject: Re: [Rocks-Discuss]Rocks 3.0.0 problem:not able to boot up
>> Date: Mon, 8 Dec 2003 15:31:08 -0800
>>
>>> I have installed Rocks 3.0.0 with default options successful,there
>>> was not any trouble.But I boot it up,it stopped at beginning,just
>>> show "GRUB" on the screen and waiting...
>>
>>
>> when you built the frontend, did you start with the rocks base CD
>> then add the HPC roll?

>>
>> - gb
>>
> I have raveled out this trouble.But I don't know why.
> I have one SCSI harddisk and one IDE disk On the frontend,I choose
> SCSI to be the first HDD and installed "/" on it.then it can not boot
> up.Even disabled the IDE HDD and install it again,It can not boot up
> also.at last I choose SCSI to be the first HDD and install,then choose
> IDE HDD to be the first and boot up, it's ok!
> GRUB must be installed on IDE HDD?
> thanks!
>
> _________________________________________________________________
> ??????????????? MSN Hotmail? http://www.hotmail.com

Date: Wed, 10 Dec 2003 22:04:57 -0800
CD distro
In-Reply-To: <3FD7EA11.10204@ucsd.edu>
<3FD7EA11.10204@ucsd.edu>
Message-ID: <F23F7B5E-2B9F-11D8-981C-000A95DA5638@sdsc.edu>

Hi Vicky,

The following directory cannot resolve its symlinks anymore. If you
start removing the profiles and mirror directories around Rocks cannot
find them to build kickstart files.

-mjk

[root at rocks14 default]# ls -l
total 16
lrwxrwxrwx 1 root root 113 Nov 13 20:19 core.xml ->
/home/install/ftp.rocksclusters.org/pub/rocks/rocks-3.0.0/rocks-dist/
7.3/en/os/i386/build/graphs/default/core.xml
-rwxrwsr-x 1 root wheel 3123 Sep 3 17:10 hpc.xml
-rwxr-xr-x 1 root root 495 Sep 9 22:55 patch.xml
-rwxrwsr-x 1 root wheel 452 Sep 3 17:10 root.xml
lrwxrwxrwx 1 root root 112 Nov 13 20:19 rsh.xml ->
/home/install/ftp.rocksclusters.org/pub/rocks/rocks-3.0.0/rocks-dist/
7.3/en/os/i386/build/graphs/default/rsh.xml
-rwxrwsr-x 1 root wheel 923 Sep 3 17:10 sge.xml


> Looks like python is okay:
>
>> [root at rocks14 birn-oracle1]# which python
>> /usr/bin/python
>> [root at rocks14 birn-oracle1]# python --help
>> Unknown option: --
>> usage: python [option] ... [-c cmd | file | -] [arg] ...

>> Options and arguments (and corresponding environment variables):
>> -d : debug output from parser (also PYTHONDEBUG=x)
>> -i : inspect interactively after running script, (also
>> PYTHONINSPECT=x)
>> and force prompts, even if stdin does not appear to be a
>> terminal
>> -O : optimize generated bytecode (a tad; also PYTHONOPTIMIZE=x)
>> -OO : remove doc-strings in addition to the -O optimizations
>> -S : don't imply 'import site' on initialization
>> -t : issue warnings about inconsistent tab usage (-tt: issue
>> errors)
>> -u : unbuffered binary stdout and stderr (also PYTHONUNBUFFERED=x)
>> -v : verbose (trace import statements) (also PYTHONVERBOSE=x)
>> -x : skip first line of source, allowing use of non-Unix forms of
>> #!cmd
>> -X : disable class based built-in exceptions
>> -c cmd : program passed in as string (terminates option list)
>> file : program read from script file
>> - : program read from stdin (default; interactive mode if a tty)
>> arg ...: arguments passed to program in sys.argv[1:]
>> Other environment variables:
>> PYTHONSTARTUP: file executed on interactive startup (no default)
>> PYTHONPATH : ':'-separated list of directories prefixed to the
>> default module search path. The result is sys.path.
>> PYTHONHOME : alternate <prefix> directory (or
>> <prefix>:<exec_prefix>).
>> The default module search path uses <prefix>/python1.5.
>> [root at rocks14 birn-oracle1]#
>
>
>
> Tim Carlson wrote:
>> On Wed, 10 Dec 2003, V. Rowley wrote:
>> Did you remove python by chance? kickstart.cgi calls python directly
>> in
>> /usr/bin/python while rocks-dist does an "env python"
>> Tim
>>> Yep, I did that, but only *AFTER* getting the error. [Thought it was
>>> generated by the rocks-dist sequence, but apparently not.] Go ahead.
>>> Move it back. Same difference.
>>>
>>> Vicky
>>>
>>> Mason J. Katz wrote:
>>>
>>>> It looks like someone moved the profiles directory to profiles.orig.
>>>>
>>>> -mjk
>>>>
>>>>
>>>> [root at rocks14 install]# ls -l
>>>> total 56
>>>> drwxr-sr-x 3 root wheel 4096 Dec 10 21:16 cdrom
>>>> drwxrwsr-x 5 root wheel 4096 Dec 10 20:38 contrib.orig
>>>> drwxr-sr-x 3 root wheel 4096 Dec 10 21:07
>>>> ftp.rocksclusters.org
>>>> ftp.rocksclusters.org.orig
>>>> -r-xrwsr-x 1 root wheel 19254 Sep 3 12:40

>>>> kickstart.cgi
>>>> drwxr-xr-x 3 root root 4096 Dec 10 20:38
>>>> profiles.orig
>>>> drwxr-sr-x 3 root wheel 4096 Dec 10 21:15 rocks-dist
>>>> drwxrwsr-x 3 root wheel 4096 Dec 10 20:38
>>>> rocks-dist.orig
>>>> drwxr-sr-x 3 root wheel 4096 Dec 10 21:02 src
>>>> drwxr-sr-x 4 root wheel 4096 Dec 10 20:49 src.foo
>>>> On Dec 10, 2003, at 2:43 PM, V. Rowley wrote:
>>>>
>>>>
>>>>> When I run this:
>>>>>
>>>>> [root at rocks14 install]# rocks-dist mirror ; rocks-dist dist ;
>>>>> rocks-dist --dist=cdrom cdrom
>>>>>
>>>>> on a server installed with ROCKS 3.0.0, I eventually get this:
>>>>>
>>>>>
>>>>>> Cleaning distribution
>>>>>> Resolving versions (RPMs)
>>>>>> Resolving versions (SRPMs)
>>>>>> Adding support for rebuild distribution from source
>>>>>> Creating files (symbolic links - fast)
>>>>>> Creating symlinks to kickstart files
>>>>>> Fixing Comps Database
>>>>>> Generating hdlist (rpm database)
>>>>>> Patching second stage loader (eKV, partioning, ...)
>>>>>> patching "rocks-ekv" into distribution ...
>>>>>> patching "rocks-piece-pipe" into distribution ...
>>>>>> patching "PyXML" into distribution ...
>>>>>> patching "expat" into distribution ...
>>>>>> patching "rocks-pylib" into distribution ...
>>>>>> patching "MySQL-python" into distribution ...
>>>>>> patching "rocks-kickstart" into distribution ...
>>>>>> patching "rocks-kickstart-profiles" into distribution ...
>>>>>> patching "rocks-kickstart-dtds" into distribution ...
>>>>>> building CRAM filesystem ...
>>>>>> Segregating RPMs (rocks, non-rocks)
>>>>>> sh: ./kickstart.cgi: No such file or directory
>>>>>> Traceback (innermost last):
>>>>>> File "/opt/rocks/bin/rocks-dist", line 807, in ?
>>>>>> app.run()
>>>>>> File "/opt/rocks/bin/rocks-dist", line 623, in run
>>>>>> eval('self.command_%s()' % (command))
>>>>>> File "<string>", line 0, in ?
>>>>>> File "/opt/rocks/bin/rocks-dist", line 736, in command_cdrom
>>>>>> builder.build()
>>>>>> File "/opt/rocks/lib/python/rocks/build.py", line 1223, in build
>>>>>> (rocks, nonrocks) = self.segregateRPMS()
>>>>>> File "/opt/rocks/lib/python/rocks/build.py", line 1107, in
>>>>>> segregateRPMS
>>>>>> for pkg in ks.getSection('packages'):

>>>>>> TypeError: loop over non-sequence
>>>>>
>>>>>
>>>>> Any ideas?
>>>>>
>>>>> --
>>>>> Vicky Rowley email: vrowley at ucsd.edu
>>>>> Biomedical Informatics Research Network work: (858) 536-5980
>>>>> University of California, San Diego fax: (858) 822-0828
>>>>> 9500 Gilman Drive
>>>>> La Jolla, CA 92093-0715
>>>>>
>>>>>
>>>>> See pictures from our trip to China at
>>>>> http://www.sagacitech.com/Chinaweb
>>>>
>>>>
>>>>
>>> --
>>> Vicky Rowley email: vrowley at ucsd.edu
>>> Biomedical Informatics Research Network work: (858) 536-5980
>>> University of California, San Diego fax: (858) 822-0828
>>> 9500 Gilman Drive
>>> La Jolla, CA 92093-0715
>>>
>>>
>>> See pictures from our trip to China at
>>> http://www.sagacitech.com/Chinaweb
>>>
>>>
>
> --
> 9500 Gilman Drive
> La Jolla, CA 92093-0715
>
>

Date: Wed, 10 Dec 2003 22:31:11 -0800
In-Reply-To: <3FD7D07D.8090108@physics.ucsd.edu>
References: <3FD7D07D.8090108@physics.ucsd.edu>
Message-ID: <9C7EE8E9-2BA3-11D8-9715-000A95C4E3B4@rocksclusters.org>

> I am having a problem on install of rocks 3.0.0 on my new cluster.
>
> The python error occurs right after anaconda starts and just before
> the install asks for the roll CDROM.
>
> The error refers to an inability to find or load rocks.file. The error
> is associated I think with the window that pops up and asks you in put

> the roll CDROM in.
>
> The process I followed to get to this point is
>
> Put the Rocks 3.0.0 CDROM into the CDROM drive
> Boot the system
> At the prompt type frontend
> Wait till anaconda starts
> Error referring to unable to load rocks.file.
>
> I have successfully installed rocks on a smaller cluster but that has
> different hardware. I used the same CDROM for both installs.
>
> Any thoughts?

hard to say -- but some folks had similar problems due to bad memory:

https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2003-February/
001246.html

- gb

From vincent_b_fox at yahoo.com Wed Dec 10 22:43:21 2003
Date: Wed, 10 Dec 2003 22:43:21 -0800 (PST)
In-Reply-To: <1B097BEE-2ADC-11D8-9715-000A95C4E3B4@rocksclusters.org>

Okay, here's the context diff as plain text. I test-applied it using "patch -p0 <
atlas.patch" and did a compile on my PII box successfully. I can send it as
attachment or submit to CVS or some other way if you need:

*** atlas.spec.in.orig Thu Dec 11 06:27:13 2003
--- atlas.spec.in Thu Dec 11 06:30:46 2003
***************
*** 111,117 ****
--- 111,133 ----
y
" | make
+ elif [ $CPUID -eq 4 ]
+ then
+ #
+ # Pentium II
+ #
+ echo "0
+ y
+ y
+ n
+ y
+ linux
+ 0
+ /usr/bin/g77
+ -O
+ y
+ " | make
else

#

Greg Bruno <bruno at rocksclusters.org> wrote:
> Okay, came up my own quick hack:
>
> Edit atlas.spec.in, go to "other x86" section, remove
> 2 lines right above "linux", seems to make rpm now.
>
> A more formal patch would be put in a section for
> cpuid eq 4 with this correction I suppose.

if you provide the patch, we'll include it in our next release.

- gb

---------------------------------
Do you Yahoo!?
New Yahoo! Photos - easier uploading and sharing
-------------- next part --------------
discussion/attachments/20031210/be5c8b04/attachment-0001.html

From naihh at imcb.a-star.edu.sg Thu Dec 11 00:08:14 2003
Date: Thu, 11 Dec 2003 16:08:14 +0800
Subject: [Rocks-Discuss]RE: Have anyone successfully build a set of grid compute
nodes using Rocks?
Message-ID: <5E118EED7CC277468A275F11EEEC39B94CCDB9@EXIMCB2.imcb.a-star.edu.sg>

Hi,

3?
Anyone care to share?

30 Medical Drive
Singapore 117609.
DID: (65) 6874-6196







Today's Topics:

1. RE: Do you have a list of the various models of Gigabit Ethernet
Interfaces compatible to Rocks 3? (Nai Hong Hwa Francis)
2. Rocks 3.0.0 (Terrence Martin)

--__--__--

Message: 1
Date: Thu, 11 Dec 2003 09:45:18 +0800
From: "Nai Hong Hwa Francis" <naihh at imcb.a-star.edu.sg>
To: <npaci-rocks-discussion at sdsc.edu>
Subject: [Rocks-Discuss]RE: Do you have a list of the various models of
Gigabit Ethernet Interfaces compatible to Rocks 3?

Hi All,



turnaround time?

Thanks and Regards

30 Medical Drive
Singapore 117609.
DID: (65) 6874-6196

[mailto:npaci-rocks-discussion-request at sdsc.edu]=20


=09



Today's Topics:

1. Non-homogenous legacy hardware (Chris Dwan (CCGB))
2. Error during Make when building a new install floppy (Terrence
Martin)
3. Re: Error during Make when building a new install floppy (Tim
Carlson)
4. Re: Non-homogenous legacy hardware (Tim Carlson)
5. ssh_known_hosts and ganglia (Jag)
6. Re: ssh_known_hosts and ganglia (Mason J. Katz)
7. "TypeError: loop over non-sequence" when trying to build CD
distro (V. Rowley)
8. Re: one node short in "labels" (Greg Bruno)
9. Re: "TypeError: loop over non-sequence" when trying to build CD
distro (Mason J. Katz)
11. Re: "TypeError: loop over non-sequence" when trying to
build CD distro (Tim Carlson)

-- __--__--

Message: 1
Date: Wed, 10 Dec 2003 14:04:53 -0600 (CST)
From: "Chris Dwan (CCGB)" <cdwan at mail.ahc.umn.edu>

have a "replace-auto-partition.xml" file which is unworkable for the
SCSI
"sda."


Is this possible?

Thanks, in advance. If this is out there on the mailing list archives,

a

-Chris Dwan

-- __--__--

Message: 2
Date: Wed, 10 Dec 2003 12:09:11 -0800
To: npaci-rocks-discussion <npaci-rocks-discussion at sdsc.edu>
Subject: [Rocks-Discuss]Error during Make when building a new install
floppy


This is with the default CVS checkout with an update today according
to=20

make[3]: Leaving directory=20
make[2]: Leaving directory=20
`/home/install/rocks/src/rocks/boot/7.3/loader'

Of course I could avoid all of this together and just put my binary=20

Would it be correct to modify the following image file with my
changes=20


Basically I am injecting an updated e1000 driver with changes to=20

Terrence

-- __--__--

Message: 3
Date: Wed, 10 Dec 2003 12:40:41 -0800 (PST)
Subject: Re: [Rocks-Discuss]Error during Make when building a new
install floppy
To: Terrence Martin <tmartin at physics.ucsd.edu>
Cc: npaci-rocks-discussion <npaci-rocks-discussion at sdsc.edu>


> I get the following error when I try to rebuild a boot floppy for
rocks.
>


>
>
>
>

update the drive in the kernel rpm that gets installed on the box. None
of
this is trivial.

AIC79XX driver

https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2003-October/003
533.html

Tim

Tim Carlson
Voice: (509) 376 3423

-- __--__--

Message: 4
Date: Wed, 10 Dec 2003 12:52:38 -0800 (PST)
Subject: Re: [Rocks-Discuss]Non-homogenous legacy hardware
To: "Chris Dwan (CCGB)" <cdwan at mail.ahc.umn.edu>


>
> snag with the auto-partition configuration: The new (old) systems
have

> have a "replace-auto-partition.xml" file which is unworkable for the
SCSI
> "sda."




<eval sh=3D"bash">
</eval>



Tim

Tim Carlson
Voice: (509) 376 3423

-- __--__--

Message: 5
From: Jag <agrajag at dragaera.net>
Date: Wed, 10 Dec 2003 13:21:07 -0500

(https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2003-May/001934
.html) indicating that Rocks distributes ssh keys for all the nodes over

rocks iso images don't seem to contain srpms, so I'm now coming here.=20


-- __--__--

Message: 6
Subject: Re: [Rocks-Discuss]ssh_known_hosts and ganglia
Date: Wed, 10 Dec 2003 14:39:15 -0800
To: Jag <agrajag at dragaera.net>

Most of the SRPMS are on our FTP site, but we've screwed this up =20
before. The SRPMS are entirely Rocks specific so they are of little =20
value outside of Rocks. You can also checkout our CVS tree =20
=20
=20
metrics at a high level than the provide gmetric application. We've =20
=20
3.1.

-mjk


> (https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2003-May/=20
=20
> nodes over
>
=20
> how
> to do this, so I'm assuming Rocks made some changes. Unfortunately
the
> What did Rocks do to ganglia to make the distribution of ssh keys
work?
>
> Also, does anyone know where Rocks SRPMs can be found? I've done
quite

-- __--__--

Message: 7
Date: Wed, 10 Dec 2003 14:43:49 -0800
to build CD distro

When I run this:


--dist=3Dcdrom cdrom



> app.run()
> builder.build()
> (rocks, nonrocks) =3D self.segregateRPMS()
> File "/opt/rocks/lib/python/rocks/build.py", line 1107, in
segregateRPMS

Any ideas?

--=20
9500 Gilman Drive
La Jolla, CA 92093-0715


-- __--__--

Message: 8
Cc: rocks <npaci-rocks-discussion at sdsc.edu>
From: Greg Bruno <bruno at rocksclusters.org>
Subject: Re: [Rocks-Discuss]one node short in "labels"
Date: Wed, 10 Dec 2003 15:12:49 -0800

To: Vincent Fox <vincent_b_fox at yahoo.com>

> So I go to the "labels" selection on the web page to print out =
the=3D20
> =3DA0
> EXCEPT....it's one node short! I go up to 0-13 and this stops at=3D20
> 0-12.=3DA0 Any ideas where I should check to fix this?



- gb

-- __--__--

Message: 9
trying to build CD distro
Date: Wed, 10 Dec 2003 15:16:27 -0800


-mjk

total 56
drwxr-sr-x 3 root wheel 4096 Dec 10 21:07=20
drwxr-sr-x 3 root wheel 4096 Dec 10 20:38=20

> When I run this:
>
> [root at rocks14 install]# rocks-dist mirror ; rocks-dist dist ;=20
> rocks-dist --dist=3Dcdrom cdrom
>
>

>> app.run()
>> builder.build()
>> (rocks, nonrocks) =3D self.segregateRPMS()
>> File "/opt/rocks/lib/python/rocks/build.py", line 1107, in=20
>> segregateRPMS
>
> Any ideas?
>
> --=20
> 9500 Gilman Drive
> La Jolla, CA 92093-0715
>
>
> See pictures from our trip to China at=20

-- __--__--

Message: 10
Date: Wed, 10 Dec 2003 16:50:16 -0800
To: "Mason J. Katz" <mjk at sdsc.edu>
CC: npaci-rocks-discussion at sdsc.edu

trying
to build CD distro

Yep, I did that, but only *AFTER* getting the error. [Thought it was=20
generated by the rocks-dist sequence, but apparently not.] Go ahead.=20

Vicky

>=20
> -mjk
>=20
>=20
> total 56
> drwxr-sr-x 3 root wheel 4096 Dec 10 21:07=20
> drwxr-sr-x 3 root wheel 4096 Dec 10 20:38=20
> drwxrwsr-x 3 root wheel 4096 Dec 10 20:38
rocks-dist.orig
>=20
>> When I run this:
>>
>> [root at rocks14 install]# rocks-dist mirror ; rocks-dist dist ;=20
>> rocks-dist --dist=3Dcdrom cdrom
>>
>>

>>> app.run()
>>> builder.build()
>>> (rocks, nonrocks) =3D self.segregateRPMS()
>>> File "/opt/rocks/lib/python/rocks/build.py", line 1107, in=20
>>> segregateRPMS
>>
>>
>> Any ideas?
>>
>> --=20
>> La Jolla, CA 92093-0715
>>
>>
>=20
>=20
>=20

--=20
9500 Gilman Drive
La Jolla, CA 92093-0715


-- __--__--

Message: 11
Date: Wed, 10 Dec 2003 17:23:25 -0800 (PST)
trying to
build CD distro

Cc: "Mason J. Katz" <mjk at sdsc.edu>, npaci-rocks-discussion at sdsc.edu



Tim

>
> Vicky
>
> >
> > -mjk
> >
> >
> > total 56
> > -r-xrwsr-x 1 root wheel 19254 Sep 3 12:40
kickstart.cgi
> > drwxr-xr-x 3 root root 4096 Dec 10 20:38
profiles.orig
> > drwxrwsr-x 3 root wheel 4096 Dec 10 20:38
rocks-dist.orig
> >
> >>
> >> rocks-dist --dist=3Dcdrom cdrom
> >>
> >>

> >>> app.run()
> >>> (rocks, nonrocks) =3D self.segregateRPMS()
> >>> segregateRPMS
> >>
> >>
> >> Any ideas?
> >>
> >> --
> >> La Jolla, CA 92093-0715
> >>
> >>
> >> See pictures from our trip to China at
> >
> >
> >
>
> --
> 9500 Gilman Drive
> La Jolla, CA 92093-0715
>
>

>
>

-- __--__--

_______________________________________________


DISCLAIMER:
This email is confidential and may be privileged. If you are not the =
intended recipient, please delete it and notify us immediately. Please =
do not copy or use it for any purpose, or disclose its contents to any =
other person as it may be an offence under the Official Secrets Act. =
Thank you.

--__--__--

Message: 2
Date: Wed, 10 Dec 2003 18:03:41 -0800

I am having a problem on install of rocks 3.0.0 on my new cluster.

The python error occurs right after anaconda starts and just before the
install asks for the roll CDROM.

The error refers to an inability to find or load rocks.file. The error
is associated I think with the window that pops up and asks you in put
the roll CDROM in.

The process I followed to get to this point is

Put the Rocks 3.0.0 CDROM into the CDROM drive
Boot the system
At the prompt type frontend
Wait till anaconda starts
Error referring to unable to load rocks.file.

I have successfully installed rocks on a smaller cluster but that has
different hardware. I used the same CDROM for both installs.

Any thoughts?

Terrence

--__--__--

Message: 3
Date: Wed, 10 Dec 2003 19:52:49 -0800
trying
to build CD distro

Looks like python is okay:

> [root at rocks14 birn-oracle1]# which python
> /usr/bin/python
> [root at rocks14 birn-oracle1]# python --help
> Unknown option: --
> usage: python [option] ... [-c cmd | file | -] [arg] ...
> Options and arguments (and corresponding environment variables):
> -d : debug output from parser (also PYTHONDEBUG=x)
> -i : inspect interactively after running script, (also
PYTHONINSPECT=x)
> and force prompts, even if stdin does not appear to be a
terminal
> -O : optimize generated bytecode (a tad; also PYTHONOPTIMIZE=x)
> -OO : remove doc-strings in addition to the -O optimizations
> -S : don't imply 'import site' on initialization
> -t : issue warnings about inconsistent tab usage (-tt: issue
errors)
> -u : unbuffered binary stdout and stderr (also PYTHONUNBUFFERED=x)
> -v : verbose (trace import statements) (also PYTHONVERBOSE=x)
> -x : skip first line of source, allowing use of non-Unix forms of
#!cmd
> -X : disable class based built-in exceptions
> -c cmd : program passed in as string (terminates option list)
> file : program read from script file
> - : program read from stdin (default; interactive mode if a tty)
> arg ...: arguments passed to program in sys.argv[1:]
> Other environment variables:
> PYTHONSTARTUP: file executed on interactive startup (no default)
> PYTHONPATH : ':'-separated list of directories prefixed to the
> default module search path. The result is sys.path.
> PYTHONHOME : alternate <prefix> directory (or
<prefix>:<exec_prefix>).
> The default module search path uses <prefix>/python1.5.
> [root at rocks14 birn-oracle1]#

Tim Carlson wrote:
> On Wed, 10 Dec 2003, V. Rowley wrote:
>
> Did you remove python by chance? kickstart.cgi calls python directly
in
> /usr/bin/python while rocks-dist does an "env python"
>
> Tim
>
>
>>Yep, I did that, but only *AFTER* getting the error. [Thought it was
>>generated by the rocks-dist sequence, but apparently not.] Go ahead.

>>Move it back. Same difference.
>>
>>Vicky
>>
>>Mason J. Katz wrote:
>>
>>>It looks like someone moved the profiles directory to profiles.orig.
>>>
>>> -mjk
>>>
>>>
>>>[root at rocks14 install]# ls -l
>>>total 56
>>>drwxr-sr-x 3 root wheel 4096 Dec 10 21:16 cdrom
>>>drwxrwsr-x 5 root wheel 4096 Dec 10 20:38 contrib.orig
>>>ftp.rocksclusters.org
>>>ftp.rocksclusters.org.orig
>>>-r-xrwsr-x 1 root wheel 19254 Sep 3 12:40 kickstart.cgi
>>>drwxr-xr-x 3 root root 4096 Dec 10 20:38 profiles.orig
>>>drwxr-sr-x 3 root wheel 4096 Dec 10 21:15 rocks-dist
>>>drwxrwsr-x 3 root wheel 4096 Dec 10 20:38
rocks-dist.orig
>>>drwxr-sr-x 3 root wheel 4096 Dec 10 21:02 src
>>>drwxr-sr-x 4 root wheel 4096 Dec 10 20:49 src.foo
>>>On Dec 10, 2003, at 2:43 PM, V. Rowley wrote:
>>>
>>>
>>>>When I run this:
>>>>
>>>>[root at rocks14 install]# rocks-dist mirror ; rocks-dist dist ;
>>>>rocks-dist --dist=cdrom cdrom
>>>>
>>>>on a server installed with ROCKS 3.0.0, I eventually get this:
>>>>
>>>>
>>>>>Adding support for rebuild distribution from source
>>>>>Creating files (symbolic links - fast)
>>>>>Fixing Comps Database
>>>>>Patching second stage loader (eKV, partioning, ...)

>>>>>Segregating RPMs (rocks, non-rocks)
>>>>>Traceback (innermost last):
>>>>> app.run()
>>>>> (rocks, nonrocks) = self.segregateRPMS()
>>>>> File "/opt/rocks/lib/python/rocks/build.py", line 1107, in
>>>>>segregateRPMS
>>>>>TypeError: loop over non-sequence
>>>>
>>>>
>>>>Any ideas?
>>>>
>>>>--
>>>>Vicky Rowley email: vrowley at ucsd.edu
>>>>Biomedical Informatics Research Network work: (858) 536-5980
>>>>University of California, San Diego fax: (858) 822-0828
>>>>9500 Gilman Drive
>>>>La Jolla, CA 92093-0715
>>>>
>>>>
>>>>See pictures from our trip to China at
>>>
>>>
>>>
>>--
>>Vicky Rowley email: vrowley at ucsd.edu
>>Biomedical Informatics Research Network work: (858) 536-5980
>>University of California, San Diego fax: (858) 822-0828
>>9500 Gilman Drive
>>La Jolla, CA 92093-0715
>>
>>
>>See pictures from our trip to China at
>>
>>
>
>
>
>

--
9500 Gilman Drive
La Jolla, CA 92093-0715


--__--__--

_______________________________________________


DISCLAIMER:

From naihh at imcb.a-star.edu.sg Thu Dec 11 00:09:34 2003
Date: Thu, 11 Dec 2003 16:09:34 +0800
Subject: [Rocks-Discuss]RE: Install rocks on Titan64 Superblade Classic with Dual
Opteron 244
Message-ID: <5E118EED7CC277468A275F11EEEC39B94CCDBA@EXIMCB2.imcb.a-star.edu.sg>

Hi,

Has anyone successfully install rocks on Titan64 Superblade Classic with
Dual Opteron 244?

30 Medical Drive
Singapore 117609.
DID: (65) 6874-6196








Today's Topics:

1. RE: Do you have a list of the various models of Gigabit Ethernet
Interfaces compatible to Rocks 3? (Nai Hong Hwa Francis)
2. Rocks 3.0.0 (Terrence Martin)

--__--__--

Message: 1
Date: Thu, 11 Dec 2003 09:45:18 +0800
From: "Nai Hong Hwa Francis" <naihh at imcb.a-star.edu.sg>
To: <npaci-rocks-discussion at sdsc.edu>
Subject: [Rocks-Discuss]RE: Do you have a list of the various models of
Gigabit Ethernet Interfaces compatible to Rocks 3?

Hi All,



turnaround time?

3?

Thanks and Regards

30 Medical Drive
Singapore 117609.
DID: (65) 6874-6196

[mailto:npaci-rocks-discussion-request at sdsc.edu]=20



--__--__--

_______________________________________________


DISCLAIMER:

From wyzhong78 at msn.com Thu Dec 11 07:27:39 2003
Date: Thu, 11 Dec 2003 23:27:39 +0800
Subject: [Rocks-Discuss]3.0.0 problem:Does my namd job allocate to each node?
Message-ID: <BAY3-F25UBUhr3ukkwu000156fe@hotmail.com>

I have build a rocks cluster with four double Xeon computer to run namd.one
frontend and the other three to be compute.with intel's hyper threading
tecnology i have 16 cpus at all.
now I have some troubles. Maybe someone can help me.
I created bellow pbs script named mytask.
#!/bin/csh
#PBS -N NAMD
#PBS -m be
#PBS -l ncpus=8
#PBS -l nodes=2
#
cd $PBS_O_WORKDIR
/charmrun namd2 +p8 mytask.namd

i typed:
qsub mytask
qrun N

then i use
qstat -f N

the message feedback showed(i'm sorry i can't copy the orgin message,just
the meaning)

host: compute-0-0/0+compute-0-0/1+compute-0-1/0+compute-0-1/1
cpu used: 8

it's strange why 4 hosts and 8 cpu used?

but when i saw ganlia, the cluster status. it show me only one node used
(fore example ,compute-0-0).both the other two are idle.
i want to know whether the job was doing by one or two node.
so i creat a new task specify to compute-0-1,message feedback show no
resource availabe.
while the task ended,i checked the information, found that the cpu time per
step is half of 4 cpus (1 nodes),but the whole time(include wall time) is
equal.
Does my namd job allocate to each node?
please help me!
thanks

_________________________________________________________________

From bruno at rocksclusters.org Thu Dec 11 07:55:17 2003
Date: Thu, 11 Dec 2003 07:55:17 -0800
Message-ID: <6A67C95F-2BF2-11D8-B821-000A95C4E3B4@rocksclusters.org>

outstanding -- thanks for the patch!

i just committed the change to cvs. the fix will be reflected in the
upcoming release (or immediately for anyone who has the rocks source
tree checked out on their local frontend).

- gb

On Dec 10, 2003, at 10:43 PM, Vincent Fox wrote:

> Okay, here's the?context diff?as plain text. I test-applied it using
> "patch -p0 < atlas.patch" and did a compile on my PII box
> successfully. I can send it as attachment or submit to CVS or some
> other way if you need:
> ?
> *** atlas.spec.in.orig? Thu Dec 11 06:27:13 2003
> --- atlas.spec.in?????? Thu Dec 11 06:30:46 2003
> ***************
> *** 111,117 ****
> --- 111,133 ----
> ? y
> ? " | make
> + elif [ $CPUID -eq 4 ]
> + then
> + #
> + # Pentium II
> + #
> + echo "0
> + y
> + y
> + n
> + y
> + linux

> + 0
> + /usr/bin/g77
> + -O
> + y
> + " | make
> ? else
> ? #
>
>
> Greg Bruno <bruno at rocksclusters.org>wrote:
> > Okay, came up my own quick hack:
> >
> > Edit atlas.spec.in, go to "other x86" section, remove
> > 2 lines right above "linux", seems to make rpm now.
> >
> > A more formal patch would be put in a section for
> > cpuid eq 4 with this correction I suppose.
>
> if you provide the patch, we'll include it in our next release.
>
> - gb
>
> Do you Yahoo!?
> New Yahoo! Photos - easier uploading and sharing

From phil at sdsc.edu Thu Dec 11 08:00:06 2003
Date: Thu, 11 Dec 2003 12:00:06 -0400
Message-ID: <1920451470-1071158479-cardhu_blackberry.rim.net-21416-@engine05>

The important thing to understand is the pbs only gives an allocation of nodes
(listed in the PBS_NODES environment variable) when the job is run. It is the
user's responsibility to actually start the code on multiple nodes. This is the way
pbs works on all platforms, not just rocks.

Pbs will start the submitted code (usually a script) on the first node listed in
PBS_NODES. This environment variable is only available once the queued job is
running. Your mytask script must explicitly start on the allocated nodes.

Pbs (actually maui) will pack jobs onto nodes by default, so allocating 8 cpu jobs
to four nodes is normal, but changable.

-p

From: "zhong wenyu" <wyzhong78 at msn.com>
Date: Thu, 11 Dec 2003 23:27:39
To:npaci-rocks-discussion at sdsc.edu

I have build a rocks cluster with four double Xeon computer to run namd.one
frontend and the other three to be compute.with intel's hyper threading
tecnology i have 16 cpus at all.
now I have some troubles. Maybe someone can help me.
I created bellow pbs script named mytask.
#!/bin/csh
#PBS -N NAMD

#PBS -m be
#PBS -l ncpus=8
#PBS -l nodes=2
#
cd $PBS_O_WORKDIR
/charmrun namd2 +p8 mytask.namd

i typed:
qsub mytask
qrun N

then i use
qstat -f N

the message feedback showed(i'm sorry i can't copy the orgin message,just
the meaning)

host: compute-0-0/0+compute-0-0/1+compute-0-1/0+compute-0-1/1
cpu used: 8

it's strange why 4 hosts and 8 cpu used?
but when i saw ganlia, the cluster status. it show me only one node used
(fore example ,compute-0-0).both the other two are idle.
i want to know whether the job was doing by one or two node.
so i creat a new task specify to compute-0-1,message feedback show no
resource availabe.
while the task ended,i checked the information, found that the cpu time per
step is half of 4 cpus (1 nodes),but the whole time(include wall time) is
equal.
Does my namd job allocate to each node?
please help me!
thanks

_________________________________________________________________
???????????????????????????? MSN Messenger: http://messenger.msn.com/cn

Sent via BlackBerry - a service from AT&T Wireless.

From jlkaiser at fnal.gov Thu Dec 11 08:28:08 2003
Date: Thu, 11 Dec 2003 10:28:08 -0600

Hi,

I'm sorry, I thought I sent email to the list reporting how I did this.

You have not said what motherboard you are using or what the error
exactly is. The instructions below are for the X5DPA-GG and the error
isn't reported as an error, I just get prompted to insert my driver.

If it is the X5DPA-GG then 3.0.0 will support the e1000 but you have to
make a change to the pcitable on the initrd.img. The current pcitable
on the initrd.img does NOT have the proper deviceId for the e1000 for
this board. If you look in /etc/sysconfig/hwconf and search for the

e1000, you will find this:

class: NETWORK
bus: PCI
detached: 0
device: eth
driver: e1000
desc: "Unknown vendor|Generic e1000 device"
vendorId: 8086
deviceId: 1013
subVendorId: 8086
subDeviceId: 1213
pciType: 1

The device ID is 1013. If you look in the pcitable that comes off of
the initrd.img you will see that the highest the e1000 device id's go is
1012. Just add in the proper line to the initrd.img in your /tftpboot
directory and it should work. Instructions are below.

Here are the instructions:

This should be done on the frontend:

cd /tftpboot/X86PC/UNDI/pxelinux/
cp initrd.img initrd.img.orig
cp initrd.img /tmp
cd /tmp
mv initrd.img initrd.gz
gunzip initrd.gz
mkdir /mnt/loop
mount -o loop initrd /mnt/loop
cd /mnt/loop/modules/
vi pcitable

Search for the e1000 drivers and add the following line:

0x8086 0x1013 "e1000" "Intel Corp.|82546EB Gigabit Ethernet
Controller"

write the file

cd /tmp
umount /mnt/loop
gzip initrd
mv initrd.gz initrd.img
mv initrd.img /tftpboot/X86PC/UNDI/pxelinux/

Then boot the node.

Hope this helps.

Thanks,

Joe

On Tue, 2003-12-09 at 15:59, Joe Landman wrote:
> Folks:
>
> As indicated previously, I am wrestling with a Supermicro based

> cluster. None of the RH distributions come with the correct E1000
> driver, so a new kernel is needed (in the boot CD, and for
> installation).
>
> The problem I am running into is that it isn't at all obvious/easy how
> (2.4.23 works nicely on the nodes) in the force/RPMS directory generates
>
> What I (and I think others) need, is a simple/easy to follow method
> that will generate a bootable CD with the correct linux kernel, and the
> correct modules.
>
> Is this in process somewhere? What would be tremendously helpful is
> if we can generate a binary module, and put that into the boot process
> by placing it into the force/modules/binary directory (assuming one
> exists) with the appropriate entry of a similar name in the
> force/modules/meta directory as a simple XML document giving pci-ids,
> description, name, etc.
>
> Anything close to this coming? Modules are killing future ROCKS
> installs, the inability to easily inject a new module in there has
> created a problem whereby ROCKS does not function (as the underlying RH
> does not function).
>
>
>
--
===================================================================

Fermi Lab
630-840-6444
===================================================================

From jghobrial at uh.edu Thu Dec 11 08:41:42 2003
Date: Thu, 11 Dec 2003 10:41:42 -0600 (CST)
Subject: [Rocks-Discuss]Re: Rocks Pythone Error with rocks.file
In-Reply-To: <3FD82F68.9070600@physics.ucsd.edu>
References: <3FD82F68.9070600@physics.ucsd.edu>

On Thu, 11 Dec 2003, Terrence Martin wrote:

> I am having the exact same error that you reported to the list on my
> cluster when I try to install rocks 3.0.0.
>
> X tries to start, fails, then just before the HPC roll is supposed to
> start I get the python error about not being able to load the rocks.file.
>
> The thing is that my system is a dual Xeon supermicro not AMD, so it
> must not be an AMD specific issue.

>
> Did you ever find a resolution to the problem?
>
> Thanks,
>
> Terrence
>

Yes, I guess you should check your memory as Greg suggests, but my
solution was to install the frontend on a different machine and then take
the HD back to the original frontend. The only problem that I had was that
the build box was a single processor setup so when I went back to the
dual-AMD pvfs fails because it was built against a non-SMP kernel.
I installed the SMP kernel and noticed this problem.

It seems the problem may be related to an SMP issue do to the fact we both
have an SMP setup. I did not check the frontend's memory so this may still
be a factor, but I have had no trouble with the box after the installation.

My initial problem was a booting problem on the frontend due to a cdrom
issue. All my other attempts at installing failed with the error you mentioned, but
as I
posted early I tried 3 different AMD single processor boxes and they
failed. The boxes are up all the time and stressed pretty hard so I don't
believe it is a memory issue.

This is some very strange behaviour.

Thanks,
Joseph

From shewa at inel.gov Thu Dec 11 10:02:59 2003
From: shewa at inel.gov (Andrew Shewmaker)
Date: Thu, 11 Dec 2003 11:02:59 -0700
Message-ID: <3FD8B153.6000205@inel.gov>

"Mason J. Katz" <mjk at sdsc.edu> wrote:

> We've also moved from this method to a single cluster-wide ssh key for
> Rocks 3.1.

How does a single key work? I have successfully set up ssh host
based authentication for some non-Rocks systems using

http://www.omega.telia.net/vici/openssh/

(Note that OpenSSH_3.7.1p2 requires one more setting in addition
to those mentioned in the above url.

In <dir-of-ssh-conf-files>/ssh_config:
EnableSSHKeysign yes)

But I thought it still requires that each host in the has a key...
am I wrong? Do you do it differently?

Thanks,

Andrew

--
Andrew Shewmaker, Associate Engineer
Phone: 1-208-526-1415
Idaho National Eng. and Environmental Lab.
P.0. Box 1625, M.S. 3605
Idaho Falls, Idaho 83415-3605

From tmartin at physics.ucsd.edu Thu Dec 11 11:13:16 2003
Date: Thu, 11 Dec 2003 11:13:16 -0800
...
In-Reply-To: <1071160088.18486.25.camel@nietzsche.fnal.gov>
<1071160088.18486.25.camel@nietzsche.fnal.gov>
Message-ID: <3FD8C1CC.20700@physics.ucsd.edu>

Hi Joe,

Do you know if 2.3.2 can also benefit from the same small change?

Terrence

Joe Kaiser wrote:
> Hi,
>
> I'm sorry, I thought I sent email to the list reporting how I did this.
>
> You have not said what motherboard you are using or what the error
> exactly is. The instructions below are for the X5DPA-GG and the error
> isn't reported as an error, I just get prompted to insert my driver.
>
> If it is the X5DPA-GG then 3.0.0 will support the e1000 but you have to
> make a change to the pcitable on the initrd.img. The current pcitable
> on the initrd.img does NOT have the proper deviceId for the e1000 for
> this board. If you look in /etc/sysconfig/hwconf and search for the
> e1000, you will find this:
>
> class: NETWORK
> bus: PCI
> detached: 0
> device: eth
> driver: e1000
> desc: "Unknown vendor|Generic e1000 device"
> vendorId: 8086
> deviceId: 1013
> subVendorId: 8086
> subDeviceId: 1213
> pciType: 1
>
> The device ID is 1013. If you look in the pcitable that comes off of
> the initrd.img you will see that the highest the e1000 device id's go is
> 1012. Just add in the proper line to the initrd.img in your /tftpboot
> directory and it should work. Instructions are below.

>
> Here are the instructions:
>
> This should be done on the frontend:
>
> cd /tftpboot/X86PC/UNDI/pxelinux/
> cp initrd.img initrd.img.orig
> cp initrd.img /tmp
> cd /tmp
> mv initrd.img initrd.gz
> gunzip initrd.gz
> mkdir /mnt/loop
> mount -o loop initrd /mnt/loop
> cd /mnt/loop/modules/
> vi pcitable
>
> Search for the e1000 drivers and add the following line:
>
> 0x8086 0x1013 "e1000" "Intel Corp.|82546EB Gigabit Ethernet
> Controller"
>
> write the file
>
> cd /tmp
> umount /mnt/loop
> gzip initrd
> mv initrd.gz initrd.img
> mv initrd.img /tftpboot/X86PC/UNDI/pxelinux/
>
> Then boot the node.
>
> Hope this helps.
>
> Thanks,
>
> Joe
>
> On Tue, 2003-12-09 at 15:59, Joe Landman wrote:
>
>>Folks:
>>
>> As indicated previously, I am wrestling with a Supermicro based
>>cluster. None of the RH distributions come with the correct E1000
>>driver, so a new kernel is needed (in the boot CD, and for
>>installation).
>>
>> The problem I am running into is that it isn't at all obvious/easy how
>>to install a new kernel/modules into ROCKS (3.0 or otherwise) to enable
>>this thing to work. Following the examples in the documentation have
>>not met with success. Running "rocks-dist cdrom" with the new kernels
>>(2.4.23 works nicely on the nodes) in the force/RPMS directory generates
>>a bootable CD with the original 2.4.18BOOT kernel.
>>
>> What I (and I think others) need, is a simple/easy to follow method
>>that will generate a bootable CD with the correct linux kernel, and the
>>correct modules.
>>
>> Is this in process somewhere? What would be tremendously helpful is
>>if we can generate a binary module, and put that into the boot process

>>by placing it into the force/modules/binary directory (assuming one
>>exists) with the appropriate entry of a similar name in the
>>force/modules/meta directory as a simple XML document giving pci-ids,
>>description, name, etc.
>>
>> Anything close to this coming? Modules are killing future ROCKS
>>installs, the inability to easily inject a new module in there has
>>created a problem whereby ROCKS does not function (as the underlying RH
>>does not function).
>>
>>
>>

From tmartin at physics.ucsd.edu Thu Dec 11 11:19:55 2003
Date: Thu, 11 Dec 2003 11:19:55 -0800
Subject: [Rocks-Discuss]Re: Rocks Pythone Error with rocks.file
References: <3FD82F68.9070600@physics.ucsd.edu>
Message-ID: <3FD8C35B.2090309@physics.ucsd.edu>

I am fairly certain it is not the memory even without memtest86. I have
in my office the same Supermicro 613A-Xi (SB-613A-Xi-B) with a SUPER
X5DPA-GG motherboard as the ones at the SDSC but it is from a different
vendor and completely different ram from another manufacturer.

When I put rocks 3.0.0 into it I get the crash of the installer in the
same spot, right after the system attempts to start Xwindows and fails
(either it fails because it just fails to start X or if a mouse is not
present) a python error comes up complaining that the rocks.file could
not be found.

On the exact same system rocks 2.3.2 installs fine.

Terrence

Joseph wrote:
> On Thu, 11 Dec 2003, Terrence Martin wrote:
>
>
>>I am having the exact same error that you reported to the list on my
>>cluster when I try to install rocks 3.0.0.
>>
>>X tries to start, fails, then just before the HPC roll is supposed to
>>start I get the python error about not being able to load the rocks.file.
>>
>>The thing is that my system is a dual Xeon supermicro not AMD, so it
>>must not be an AMD specific issue.
>>
>>Did you ever find a resolution to the problem?
>>
>>Thanks,
>>
>>Terrence
>>

>
>
> Yes, I guess you should check your memory as Greg suggests, but my
> solution was to install the frontend on a different machine and then take
> the HD back to the original frontend. The only problem that I had was that
> the build box was a single processor setup so when I went back to the
> dual-AMD pvfs fails because it was built against a non-SMP kernel.
> I installed the SMP kernel and noticed this problem.
>
> It seems the problem may be related to an SMP issue do to the fact we both
> have an SMP setup. I did not check the frontend's memory so this may still
> be a factor, but I have had no trouble with the box after the installation.
>
> My initial problem was a booting problem on the frontend due to a cdrom
> issue. All my other attempts at installing failed with the error you mentioned,
but as I
> posted early I tried 3 different AMD single processor boxes and they
> failed. The boxes are up all the time and stressed pretty hard so I don't
> believe it is a memory issue.
>
> This is some very strange behaviour.
>
> Thanks,
> Joseph
>

From landman at scalableinformatics.com Thu Dec 11 11:42:14 2003
Date: Thu, 11 Dec 2003 14:42:14 -0500
...
In-Reply-To: <3FD8C1CC.20700@physics.ucsd.edu>
<3FD8C1CC.20700@physics.ucsd.edu>

Hi Terrence and Joe:

These are indeed X5DPA-GG. I am working on a device driver disk for
3.0 ROCKS. If this works, it is a weak hack, but it might be fine.
More later (testing it now as we speak)..

Joe

On Thu, 2003-12-11 at 14:13, Terrence Martin wrote:
> Hi Joe,
>
> Do you know if 2.3.2 can also benefit from the same small change?
>
> Terrence
>
> Joe Kaiser wrote:
> > Hi,

> >
> > I'm sorry, I thought I sent email to the list reporting how I did this.
> >
> > You have not said what motherboard you are using or what the error
> > exactly is. The instructions below are for the X5DPA-GG and the error
> > isn't reported as an error, I just get prompted to insert my driver.
> >
> > If it is the X5DPA-GG then 3.0.0 will support the e1000 but you have to
> > make a change to the pcitable on the initrd.img. The current pcitable
> > on the initrd.img does NOT have the proper deviceId for the e1000 for
> > this board. If you look in /etc/sysconfig/hwconf and search for the
> > e1000, you will find this:
> >
> > class: NETWORK
> > bus: PCI
> > detached: 0
> > device: eth
> > driver: e1000
> > desc: "Unknown vendor|Generic e1000 device"
> > vendorId: 8086
> > deviceId: 1013
> > subVendorId: 8086
> > subDeviceId: 1213
> > pciType: 1
> >
> > The device ID is 1013. If you look in the pcitable that comes off of
> > the initrd.img you will see that the highest the e1000 device id's go is
> > 1012. Just add in the proper line to the initrd.img in your /tftpboot
> > directory and it should work. Instructions are below.
> >
> > Here are the instructions:
> >
> > This should be done on the frontend:
> >
> > cd /tftpboot/X86PC/UNDI/pxelinux/
> > cp initrd.img initrd.img.orig
> > cp initrd.img /tmp
> > cd /tmp
> > mv initrd.img initrd.gz
> > gunzip initrd.gz
> > mkdir /mnt/loop
> > mount -o loop initrd /mnt/loop
> > cd /mnt/loop/modules/
> > vi pcitable
> >
> > Search for the e1000 drivers and add the following line:
> >
> > 0x8086 0x1013 "e1000" "Intel Corp.|82546EB Gigabit Ethernet
> > Controller"
> >
> > write the file
> >
> > cd /tmp
> > umount /mnt/loop
> > gzip initrd
> > mv initrd.gz initrd.img
> > mv initrd.img /tftpboot/X86PC/UNDI/pxelinux/
> >
> > Then boot the node.

> >
> > Hope this helps.
> >
> > Thanks,
> >
> > Joe
> >
> > On Tue, 2003-12-09 at 15:59, Joe Landman wrote:
> >
> >>Folks:
> >>
> >> As indicated previously, I am wrestling with a Supermicro based
> >>cluster. None of the RH distributions come with the correct E1000
> >>driver, so a new kernel is needed (in the boot CD, and for
> >>installation).
> >>
> >> The problem I am running into is that it isn't at all obvious/easy how
> >>to install a new kernel/modules into ROCKS (3.0 or otherwise) to enable
> >>this thing to work. Following the examples in the documentation have
> >>not met with success. Running "rocks-dist cdrom" with the new kernels
> >>(2.4.23 works nicely on the nodes) in the force/RPMS directory generates
> >>a bootable CD with the original 2.4.18BOOT kernel.
> >>
> >> What I (and I think others) need, is a simple/easy to follow method
> >>that will generate a bootable CD with the correct linux kernel, and the
> >>correct modules.
> >>
> >> Is this in process somewhere? What would be tremendously helpful is
> >>if we can generate a binary module, and put that into the boot process
> >>by placing it into the force/modules/binary directory (assuming one
> >>exists) with the appropriate entry of a similar name in the
> >>force/modules/meta directory as a simple XML document giving pci-ids,
> >>description, name, etc.
> >>
> >> Anything close to this coming? Modules are killing future ROCKS
> >>installs, the inability to easily inject a new module in there has
> >>created a problem whereby ROCKS does not function (as the underlying RH
> >>does not function).
> >>
> >>
> >>
--
phone: +1 734 612 4615

From jlkaiser at fnal.gov Thu Dec 11 11:33:03 2003
Date: Thu, 11 Dec 2003 13:33:03 -0600
In-Reply-To: <3FD8C1CC.20700@physics.ucsd.edu>
<3FD8C1CC.20700@physics.ucsd.edu>

I am not sure. Presumably, yes....

On Thu, 2003-12-11 at 13:13, Terrence Martin wrote:
> Hi Joe,
>
> Do you know if 2.3.2 can also benefit from the same small change?
>
> Terrence
>
> Joe Kaiser wrote:
> > Hi,
> >
> > I'm sorry, I thought I sent email to the list reporting how I did this.
> >
> > You have not said what motherboard you are using or what the error
> > exactly is. The instructions below are for the X5DPA-GG and the error
> > isn't reported as an error, I just get prompted to insert my driver.
> >
> > If it is the X5DPA-GG then 3.0.0 will support the e1000 but you have to
> > make a change to the pcitable on the initrd.img. The current pcitable
> > on the initrd.img does NOT have the proper deviceId for the e1000 for
> > this board. If you look in /etc/sysconfig/hwconf and search for the
> > e1000, you will find this:
> >
> > class: NETWORK
> > bus: PCI
> > detached: 0
> > device: eth
> > driver: e1000
> > desc: "Unknown vendor|Generic e1000 device"
> > vendorId: 8086
> > deviceId: 1013
> > subVendorId: 8086
> > subDeviceId: 1213
> > pciType: 1
> >
> > The device ID is 1013. If you look in the pcitable that comes off of
> > the initrd.img you will see that the highest the e1000 device id's go is
> > 1012. Just add in the proper line to the initrd.img in your /tftpboot
> > directory and it should work. Instructions are below.
> >
> > Here are the instructions:
> >
> > This should be done on the frontend:
> >
> > cd /tftpboot/X86PC/UNDI/pxelinux/
> > cp initrd.img initrd.img.orig
> > cp initrd.img /tmp
> > cd /tmp
> > mv initrd.img initrd.gz
> > gunzip initrd.gz
> > mkdir /mnt/loop
> > mount -o loop initrd /mnt/loop
> > cd /mnt/loop/modules/
> > vi pcitable
> >
> > Search for the e1000 drivers and add the following line:
> >

> > 0x8086 0x1013 "e1000" "Intel Corp.|82546EB Gigabit Ethernet
> > Controller"
> >
> > write the file
> >
> > cd /tmp
> > umount /mnt/loop
> > gzip initrd
> > mv initrd.gz initrd.img
> > mv initrd.img /tftpboot/X86PC/UNDI/pxelinux/
> >
> > Then boot the node.
> >
> > Hope this helps.
> >
> > Thanks,
> >
> > Joe
> >
> > On Tue, 2003-12-09 at 15:59, Joe Landman wrote:
> >
> >>Folks:
> >>
> >> As indicated previously, I am wrestling with a Supermicro based
> >>cluster. None of the RH distributions come with the correct E1000
> >>driver, so a new kernel is needed (in the boot CD, and for
> >>installation).
> >>
> >> The problem I am running into is that it isn't at all obvious/easy how
> >>to install a new kernel/modules into ROCKS (3.0 or otherwise) to enable
> >>this thing to work. Following the examples in the documentation have
> >>not met with success. Running "rocks-dist cdrom" with the new kernels
> >>(2.4.23 works nicely on the nodes) in the force/RPMS directory generates
> >>a bootable CD with the original 2.4.18BOOT kernel.
> >>
> >> What I (and I think others) need, is a simple/easy to follow method
> >>that will generate a bootable CD with the correct linux kernel, and the
> >>correct modules.
> >>
> >> Is this in process somewhere? What would be tremendously helpful is
> >>if we can generate a binary module, and put that into the boot process
> >>by placing it into the force/modules/binary directory (assuming one
> >>exists) with the appropriate entry of a similar name in the
> >>force/modules/meta directory as a simple XML document giving pci-ids,
> >>description, name, etc.
> >>
> >> Anything close to this coming? Modules are killing future ROCKS
> >>installs, the inability to easily inject a new module in there has
> >>created a problem whereby ROCKS does not function (as the underlying RH
> >>does not function).
> >>
> >>
> >>
--
===================================================================

Fermi Lab

630-840-6444
===================================================================

Date: Thu, 11 Dec 2003 14:51:51 -0500
Subject: [Rocks-Discuss]driver disk for e1000 for rocks 3.0.0

Folks:

I have built a slightly modified RedHat 7.3 driver disk with the
updated 5.2.22 e1000 driver. I verified that this does indeed work on
my systems (during initial portion of ROCKS install, I can now insmod
e1000 in the shell window and see the ethernet... this is a big change
from before). If you want the driver disk grab it from
http://scalableinformatics.com/downloads/newdrv.img . To use it while
installing a front end, type

frontend dd

at the boot prompt (not just frontend). I believe it should work for
the compute nodes as well (i will test it soon). Now it is time to work
around the rest of the Supermicro "features".
--
phone: +1 734 612 4615

From dtwright at uiuc.edu Thu Dec 11 12:32:54 2003
From: dtwright at uiuc.edu (Dan Wright)
Date: Thu, 11 Dec 2003 14:32:54 -0600
In-Reply-To: <BAY3-F25UBUhr3ukkwu000156fe@hotmail.com>
References: <BAY3-F25UBUhr3ukkwu000156fe@hotmail.com>
Message-ID: <20031211203254.GP6476@uiuc.edu>

NAMD2 needs some more information to be started on multiple nodes like that.
You need to give it a nodelist, in particular, so it knows where to run
itself. We run namd2 on several clusters here (UIUC chemistry department).

Below is a script used to exec namd2 with the right options, etc, on a
cluster. Below that is a script that automates the PBS job submission. Hope this
helps!

- Dan Wright
(dtwright at uiuc.edu)
(http://www.scs.uiuc.edu/)
(UNIX Systems Administrator, School of Chemical Sciences)
(333-1728)

-- namd2.csh --

#!/bin/csh
# Script to run NAMD2 on the cluster automatically.
# Courtesy of Jim Phillips.

setenv CONV_RSH ssh
setenv TMPDIR /tmp
setenv BINDIR /home/NAMD

if ( $?PBS_JOBID ) then
if ( $?PBS_NODEFILE ) then
set nodes = `cat $PBS_NODEFILE`
else
set nodes = localhost
endif
set nodefile = $TMPDIR/namd2.nodelist.$PBS_JOBID
echo group main >! $nodefile
foreach node ( $nodes )
echo host $node >> $nodefile
end
$BINDIR/charmrun $BINDIR/namd2 +p$#nodes ++nodelist $nodefile $*
else
$BINDIR/charmrun $BINDIR/namd2 ++local $*
endif

-------------

Here's an example script using this to start namd2 on 8 uniprocessor nodes;
you'd just run it as "namd2-8p <jobfile>" to automatically do the PBS job
submission and everything.

-- namd2-8p --

#!/bin/bash
# This script runs namd2 on 8 nodes.
#

echo
echo "Please remember to specify the FULL PATH to your namd2 job file."
echo "If you haven't done that, please press ctrl-c now and re-run"
echo "this command with the full path."
echo
sleep 10

export SCRIPTFILE=/tmp/namd2-script.$USER.`date "+%s"`
export NAMD_SCRIPT=/usr/local/bin/namd2.csh

NAMD_CMD="$NAMD_SCRIPT $* > $HOME/namd2.out.`date '+%d%b%Y-%H:%M:%S'` 2>&1"

cat >$SCRIPTFILE <<EOF
#!/bin/bash
#PBS -l nodes=8

EOF
echo $NAMD_CMD >> $SCRIPTFILE
echo "exit" >> $SCRIPTFILE
/usr/apps/pbs/bin/qsub -V $SCRIPTFILE

sleep 5

rm -f $SCRIPTFILE

--------------

zhong wenyu said:
> I have build a rocks cluster with four double Xeon computer to run namd.one
> frontend and the other three to be compute.with intel's hyper threading
> tecnology i have 16 cpus at all.
> now I have some troubles. Maybe someone can help me.
> I created bellow pbs script named mytask.
> #!/bin/csh
> #PBS -N NAMD
> #PBS -m be
> #PBS -l ncpus=8
> #PBS -l nodes=2
> #
> cd $PBS_O_WORKDIR
> /charmrun namd2 +p8 mytask.namd
>
> i typed:
> qsub mytask
> qrun N
>
> then i use
> qstat -f N
>
> the message feedback showed(i'm sorry i can't copy the orgin message,just
> the meaning)
>
> host: compute-0-0/0+compute-0-0/1+compute-0-1/0+compute-0-1/1
> cpu used: 8
>
> it's strange why 4 hosts and 8 cpu used?
> but when i saw ganlia, the cluster status. it show me only one node used
> (fore example ,compute-0-0).both the other two are idle.
> i want to know whether the job was doing by one or two node.
> so i creat a new task specify to compute-0-1,message feedback show no
> resource availabe.
> while the task ended,i checked the information, found that the cpu time per
> step is half of 4 cpus (1 nodes),but the whole time(include wall time) is
> equal.
> Does my namd job allocate to each node?
> please help me!
> thanks
>
> _________________________________________________________________
> ???????????????????????????? MSN Messenger: http://messenger.msn.com/cn
>
- Dan Wright
(http://www.uiuc.edu/~dtwright)

-] ------------------------------ [-] -------------------------------- [-
``Weave a circle round him thrice, / And close your eyes with holy dread,
For he on honeydew hath fed, / and drunk the milk of Paradise.''
Samuel Taylor Coleridge, Kubla Khan

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : https://lists.sdsc.edu/pipermail/npaci-rocks-
discussion/attachments/20031211/417e39b4/attachment-0001.bin

From mjk at sdsc.edu Thu Dec 11 13:16:45 2003
Date: Thu, 11 Dec 2003 13:16:45 -0800
In-Reply-To: <3FD8B153.6000205@inel.gov>
References: <3FD8B153.6000205@inel.gov>
Message-ID: <52B4A71C-2C1F-11D8-832A-000A95DA5638@sdsc.edu>

Download 3.1 (out very soon now) and poke around. Basically there is a
single SSH host key, and all the nodes have a copy. This kills the
"man in the middle" warning every time you reinstall.

-mjk

On Dec 11, 2003, at 10:02 AM, Andrew Shewmaker wrote:

> "Mason J. Katz" <mjk at sdsc.edu> wrote:
>
> > We've also moved from this method to a single cluster-wide ssh key
> for
> > Rocks 3.1.
>
> How does a single key work? I have successfully set up ssh host
> based authentication for some non-Rocks systems using
>
> http://www.omega.telia.net/vici/openssh/
>
> (Note that OpenSSH_3.7.1p2 requires one more setting in addition
> to those mentioned in the above url.
>
> In <dir-of-ssh-conf-files>/ssh_config:
> EnableSSHKeysign yes)
>
> But I thought it still requires that each host in the has a key...
> am I wrong? Do you do it differently?
>
> Thanks,
>
> Andrew
>
> --
> Andrew Shewmaker, Associate Engineer
> Phone: 1-208-526-1415
> Idaho National Eng. and Environmental Lab.
> P.0. Box 1625, M.S. 3605
> Idaho Falls, Idaho 83415-3605


Date: Thu, 11 Dec 2003 16:36:44 -0500
In-Reply-To: <52B4A71C-2C1F-11D8-832A-000A95DA5638@sdsc.edu>
<52B4A71C-2C1F-11D8-832A-000A95DA5638@sdsc.edu>

Hi Mason:

Eta? I have a non-functional cluster I think I can make function with
3.1. I would be happy to be a real world beta/gamma tester for it
(immediately, eg. today). Please send me a URL. ...

Joe

On Thu, 2003-12-11 at 16:16, Mason J. Katz wrote:
> Download 3.1 (out very soon now) and poke around. Basically there is a
> single SSH host key, and all the nodes have a copy. This kills the
> "man in the middle" warning every time you reinstall.
>
> -mjk
>
> On Dec 11, 2003, at 10:02 AM, Andrew Shewmaker wrote:
>
> > "Mason J. Katz" <mjk at sdsc.edu> wrote:
> >
> > > We've also moved from this method to a single cluster-wide ssh key
> > for
> > > Rocks 3.1.
> >
> > How does a single key work? I have successfully set up ssh host
> > based authentication for some non-Rocks systems using
> >
> > http://www.omega.telia.net/vici/openssh/
> >
> > (Note that OpenSSH_3.7.1p2 requires one more setting in addition
> > to those mentioned in the above url.
> >
> > In <dir-of-ssh-conf-files>/ssh_config:
> > EnableSSHKeysign yes)
> >
> > But I thought it still requires that each host in the has a key...
> > am I wrong? Do you do it differently?
> >
> > Thanks,
> >
> > Andrew
> >
> > --
> > Andrew Shewmaker, Associate Engineer
> > Phone: 1-208-526-1415
> > Idaho National Eng. and Environmental Lab.
> > P.0. Box 1625, M.S. 3605
> > Idaho Falls, Idaho 83415-3605
--

phone: +1 734 612 4615

Date: Thu, 11 Dec 2003 13:34:30 -0800
<52B4A71C-2C1F-11D8-832A-000A95DA5638@sdsc.edu>
<1071178604.6164.46.camel@squash.scalableinformatics.com>
Message-ID: <CD814510-2C21-11D8-832A-000A95DA5638@sdsc.edu>

We're too close to send out more beta's right now, but if something bad
happens before friday we'll reconsider. We are shooting for next week
- but absolutely before the holidays. ho ho ho. We recognize that our
delay on getting a current release out there is hurting new clusters,
and just having the latest redhat kernel is going to fix most of these
issues.

-mjk


> Hi Mason:
>
> Eta? I have a non-functional cluster I think I can make function
> with
> 3.1. I would be happy to be a real world beta/gamma tester for it
> (immediately, eg. today). Please send me a URL. ...
>
> Joe
>
> On Thu, 2003-12-11 at 16:16, Mason J. Katz wrote:
>> Download 3.1 (out very soon now) and poke around. Basically there is
>> a
>> single SSH host key, and all the nodes have a copy. This kills the
>> "man in the middle" warning every time you reinstall.
>>
>> -mjk
>>
>> On Dec 11, 2003, at 10:02 AM, Andrew Shewmaker wrote:
>>
>>> "Mason J. Katz" <mjk at sdsc.edu> wrote:
>>>
>>>> We've also moved from this method to a single cluster-wide ssh key
>>> for
>>>> Rocks 3.1.
>>>
>>> How does a single key work? I have successfully set up ssh host
>>> based authentication for some non-Rocks systems using
>>>
>>> http://www.omega.telia.net/vici/openssh/
>>>
>>> (Note that OpenSSH_3.7.1p2 requires one more setting in addition
>>> to those mentioned in the above url.

>>>
>>> In <dir-of-ssh-conf-files>/ssh_config:
>>> EnableSSHKeysign yes)
>>>
>>> But I thought it still requires that each host in the has a key...
>>> am I wrong? Do you do it differently?
>>>
>>> Thanks,
>>>
>>> Andrew
>>>
>>> --
>>> Andrew Shewmaker, Associate Engineer
>>> Phone: 1-208-526-1415
>>> Idaho National Eng. and Environmental Lab.
>>> P.0. Box 1625, M.S. 3605
>>> Idaho Falls, Idaho 83415-3605
> --
> Joseph Landman, Ph.D
> Scalable Informatics LLC,
> email: landman at scalableinformatics.com
> web : http://scalableinformatics.com
> phone: +1 734 612 4615

From purikk at hotmail.com Thu Dec 11 15:06:17 2003
From: purikk at hotmail.com (Purushotham Komaravolu)
Date: Thu, 11 Dec 2003 18:06:17 -0500
Subject: [Rocks-Discuss]Kernal of Rocks 3.0
References: <200312112001.hBBK1IJ18815@postal.sdsc.edu>
Message-ID: <BAY1-DAV391Zg8eBpx700008b71@hotmail.com>

Hi,
I am a newbie to Rocks and have a few questions. I would appreciate help
with those.
1) what kernel does latest rocks use, if its not latest can I use latest
kernal and how?
2) is there any way to have more than 1 fronend nodes for failover
redundancy?
3) did anybody install penguin compilers over the cluster
Thanks
Regards,
Puru

From bruno at rocksclusters.org Thu Dec 11 15:42:27 2003
Date: Thu, 11 Dec 2003 15:42:27 -0800
Subject: [Rocks-Discuss]Kernal of Rocks 3.0
In-Reply-To: <BAY1-DAV391Zg8eBpx700008b71@hotmail.com>
References: <200312112001.hBBK1IJ18815@postal.sdsc.edu> <BAY1-
DAV391Zg8eBpx700008b71@hotmail.com>
Message-ID: <AD988A9F-2C33-11D8-B821-000A95C4E3B4@rocksclusters.org>

> 1) what kernel does latest rocks use, if its not latest can I use
> latest
> kernal and how?

our upcoming release (scheduled to release next week) has kernel
version 2.4.21. additionally, the new release includes documentation on
how to build your own kernel RPM from a kernel.org tarball.

> 2) is there any way to have more than 1 fronend nodes for failover
> redundancy?

no, that has not yet been implemented.

> 3) did anybody install penguin compilers over the cluster

i apologize, but i'm not familiar with the penguin compiler. we do have
experience with gnu compilers, intel compilers and the portland group
compilers. additionally, some folks in the rocks community have also
successfully deployed the lahey compiler.

- gb

From oconnor at ucsd.edu Thu Dec 11 14:29:46 2003
From: oconnor at ucsd.edu (Edward O'Connor)
Date: Thu, 11 Dec 2003 14:29:46 -0800
Subject: [Rocks-Discuss]ia64 compute nodes with ia32 frontends?
In-Reply-To: <ddptix48s6.fsf@oecpc11.ucsd.edu> (Edward O'Connor's message of
"Fri, 22 Aug 2003 15:39:05 -0700")
References: <793188FE-D411-11D7-8529-000393C7898E@sdsc.edu>
<ddptix48s6.fsf@oecpc11.ucsd.edu>
Message-ID: <ddwu930yzp.fsf_-_@oecpc11.ucsd.edu>

Hi everybody,

I'm trying to bring up some ia64 compute nodes in a cluster with an ia32
frontend. Normally, `cd /home/install; rocks-dist mirror dist` only sets
up the frontend to handle ia32 compute nodes. I tried to manhandle
`rocks-dist mirror` into mirroring the ia64 stuff from
ftp.rocksclusters.org by giving it the --arch=ia64 option, but that
didn't work, so I went ahead and did the mirroring step by hand.

After having done so, `rocks-dist dist` still doesn't do the right
thing. So, adding --arch=ia64 to that command yields this error output:

,----
| # rocks-dist --arch=ia64 dist
| Cleaning distribution
| Resolving versions (RPMs)
| Resolving versions (SRPMs)
| Adding support for rebuild distribution from source
| Creating files (symbolic links - fast)
| Creating symlinks to kickstart files
| Fixing Comps Database
| error - comps file is missing, skipping this step
| Generating hdlist (rpm database)
| error - could not find rpm anaconda-runtime
| error - could not find genhdlist
| Patching second stage loader (eKV, partioning, ...)
| error - could not find second stage, skipping this step
`----

So my question is, what do I need to do to the ia32 frontend to enable
it to kickstart an ia64 compute node? Thanks.

Ted

--
Edward O'Connor
oconnor at ucsd.edu

From gotero at linuxprophet.com Thu Dec 11 21:14:33 2003
From: gotero at linuxprophet.com (Glen Otero)
Date: Thu, 11 Dec 2003 21:14:33 -0800
Subject: Fwd: [Rocks-Discuss]RE: Have anyone successfully build a set of grid
compute nodes using Rocks?
Message-ID: <1279F870-2C62-11D8-AAC6-000A95CD8EC8@linuxprophet.com>

>
>
> We put two Itanium clusters and an x86 cluster together on a grid at
> SC2003 using Rocks 3.1 beta and the Grid Roll. Simple CA is installed
> on the cluster frontends for you, so all one has to do is create and
> exchange certificates and update the grid-mapfiles. This grid was a
> joint collaboration between SDSC, Promicro Systems and Callident.
>
> On Dec 11, 2003, at 12:08 AM, Nai Hong Hwa Francis wrote:
>
>>
>>
>>
>> Hi,
>>
>> Have anyone successfully build a set of grid compute nodes using Rocks
>> 3?
>> Anyone care to share?
>>
>>
>> Nai Hong Hwa Francis
>> Institute of Molecular and Cell Biology (A*STAR)
>> 30 Medical Drive
>> Singapore 117609.
>> DID: (65) 6874-6196
>>
>> From: npaci-rocks-discussion-request at sdsc.edu
>> [mailto:npaci-rocks-discussion-request at sdsc.edu]
>> Sent: Thursday, December 11, 2003 11:54 AM
>> Subject: npaci-rocks-discussion digest, Vol 1 #642 - 4 msgs
>>
>> Send npaci-rocks-discussion mailing list submissions to
>> npaci-rocks-discussion at sdsc.edu
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>>
>> http://lists.sdsc.edu/mailman/listinfo.cgi/npaci-rocks-discussion
>> or, via email, send a message with subject or body 'help' to

>> npaci-rocks-discussion-request at sdsc.edu
>>
>> You can reach the person managing the list at
>> npaci-rocks-discussion-admin at sdsc.edu
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of npaci-rocks-discussion digest..."
>>
>>
>> Today's Topics:
>>
>> 1. RE: Do you have a list of the various models of Gigabit Ethernet
>> Interfaces compatible to Rocks 3? (Nai Hong Hwa Francis)
>> 2. Rocks 3.0.0 (Terrence Martin)
>> 3. Re: "TypeError: loop over non-sequence" when trying
>> to build CD distro (V. Rowley)
>>
>> --__--__--
>>
>> Message: 1
>> Date: Thu, 11 Dec 2003 09:45:18 +0800
>> From: "Nai Hong Hwa Francis" <naihh at imcb.a-star.edu.sg>
>> To: <npaci-rocks-discussion at sdsc.edu>
>> Subject: [Rocks-Discuss]RE: Do you have a list of the various models
>> of
>> Gigabit Ethernet Interfaces compatible to Rocks 3?
>>
>>
>>
>> Hi All,
>>
>> Do you have a list of the various gigabit Ethernet interfaces that are
>> compatible to Rocks 3?
>>
>> I am changing my nodes connectivity from 10/100 to 1000.
>>
>> Have anyone done that and how are the differences in performance or
>> turnaround time?
>>
>>
>>
>> Thanks and Regards
>>
>> Nai Hong Hwa Francis
>> Institute of Molecular and Cell Biology (A*STAR)
>> 30 Medical Drive
>> Singapore 117609.
>> DID: (65) 6874-6196
>>
>> From: npaci-rocks-discussion-request at sdsc.edu
>> [mailto:npaci-rocks-discussion-request at sdsc.edu]=20
>> Sent: Thursday, December 11, 2003 9:25 AM
>> Subject: npaci-rocks-discussion digest, Vol 1 #641 - 13 msgs
>>
>> Send npaci-rocks-discussion mailing list submissions to
>>

>> To subscribe or unsubscribe via the World Wide Web, visit
>> =09
>> or, via email, send a message with subject or body 'help' to
>> npaci-rocks-discussion-request at sdsc.edu
>>
>> You can reach the person managing the list at
>> npaci-rocks-discussion-admin at sdsc.edu
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of npaci-rocks-discussion digest..."
>>
>>
>> Today's Topics:
>>
>> 1. Non-homogenous legacy hardware (Chris Dwan (CCGB))
>> 2. Error during Make when building a new install floppy (Terrence
>> Martin)
>> 3. Re: Error during Make when building a new install floppy (Tim
>> Carlson)
>> 4. Re: Non-homogenous legacy hardware (Tim Carlson)
>> 5. ssh_known_hosts and ganglia (Jag)
>> 6. Re: ssh_known_hosts and ganglia (Mason J. Katz)
>> 7. "TypeError: loop over non-sequence" when trying to build CD
>> distro (V. Rowley)
>> 8. Re: one node short in "labels" (Greg Bruno)
>> 9. Re: "TypeError: loop over non-sequence" when trying to build CD
>> distro (Mason J. Katz)
>> 10. Re: "TypeError: loop over non-sequence" when trying
>> to build CD distro (V. Rowley)
>> 11. Re: "TypeError: loop over non-sequence" when trying to
>> build CD distro (Tim Carlson)
>>
>> -- __--__--
>> Message: 1
>> Date: Wed, 10 Dec 2003 14:04:53 -0600 (CST)
>> From: "Chris Dwan (CCGB)" <cdwan at mail.ahc.umn.edu>
>> Subject: [Rocks-Discuss]Non-homogenous legacy hardware
>>
>>
>> I am integrating legacy systems into a ROCKS cluster, and have hit a
>> snag with the auto-partition configuration: The new (old) systems
>> have
>> SCSI disks, while old (new) ones contain IDE. This is a non-issue so
>> long as the initial install does its default partitioning. However, I
>> have a "replace-auto-partition.xml" file which is unworkable for the
>> SCSI
>> based systems since it makes specific reference to "hda" rather than
>> "sda."
>>
>> I would like to have a site-nodes/replace-auto-partition.xml file
>> with a
>> conditional such that "hda" or "sda" is used, based on the name of the
>> node (or some other criterion).
>>
>> Is this possible?
>>
>> Thanks, in advance. If this is out there on the mailing list

>> archives,
>> a
>> pointer would be greatly appreciated.
>>
>> -Chris Dwan
>> The University of Minnesota
>>
>> -- __--__--
>> Message: 2
>> Date: Wed, 10 Dec 2003 12:09:11 -0800
>> From: Terrence Martin <tmartin at physics.ucsd.edu>
>> To: npaci-rocks-discussion <npaci-rocks-discussion at sdsc.edu>
>> Subject: [Rocks-Discuss]Error during Make when building a new install
>> floppy
>>
>> I get the following error when I try to rebuild a boot floppy for
>> rocks.
>>
>> This is with the default CVS checkout with an update today according
>> to=20
>> the rocks userguide. I have not actually attempted to make any
>> changes.
>>
>> make[3]: Leaving directory=20
>> `/home/install/rocks/src/rocks/boot/7.3/loader/anaconda-7.3/loader'
>> make[2]: Leaving directory=20
>> `/home/install/rocks/src/rocks/boot/7.3/loader/anaconda-7.3'
>> strip -o loader anaconda-7.3/loader/loader
>> strip: anaconda-7.3/loader/loader: No such file or directory
>> make[1]: *** [loader] Error 1
>> make[1]: Leaving directory
>> `/home/install/rocks/src/rocks/boot/7.3/loader'
>> make: *** [loader] Error 2
>>
>> Of course I could avoid all of this together and just put my binary=20
>> module into the appropriate location in the boot image.
>>
>> Would it be correct to modify the following image file with my
>> changes=20
>> and then write it to a floppy via dd?
>>
>> /home/install/ftp.rocksclusters.org/pub/rocks/rocks-3.0.0/rocks-dist/
>> 7.3
>> /en/os/i386/images/bootnet.img
>>
>> Basically I am injecting an updated e1000 driver with changes to=20
>> pcitable to support the address of my gigabit cards.
>>
>> Terrence
>>
>>
>> -- __--__--
>> Message: 3
>> Date: Wed, 10 Dec 2003 12:40:41 -0800 (PST)
>> From: Tim Carlson <tim.carlson at pnl.gov>
>> Subject: Re: [Rocks-Discuss]Error during Make when building a new
>> install floppy
>> To: Terrence Martin <tmartin at physics.ucsd.edu>
>> Cc: npaci-rocks-discussion <npaci-rocks-discussion at sdsc.edu>

>> Reply-to: Tim Carlson <tim.carlson at pnl.gov>
>>
>> On Wed, 10 Dec 2003, Terrence Martin wrote:
>>
>>> I get the following error when I try to rebuild a boot floppy for
>> rocks.
>>>
>>
>> You can't make a boot floppy with Rocks 3.0. That isn't supported. Or
>> at
>> least it wasn't the last time I checked
>>
>>> Of course I could avoid all of this together and just put my binary
>>> module into the appropriate location in the boot image.
>>>
>>> Would it be correct to modify the following image file with my
>>> changes
>>> and then write it to a floppy via dd?
>>>
>>>
>> /home/install/ftp.rocksclusters.org/pub/rocks/rocks-3.0.0/rocks-dist/
>> 7.3
>> /en/os/i386/images/bootnet.img
>>>
>>> Basically I am injecting an updated e1000 driver with changes to
>>> pcitable to support the address of my gigabit cards.
>>
>> Modifiying the bootnet.img is about 1/3 of what you need to do if you
>> go
>> down that path. You also need to work on netstg1.img and you'll need
>> to
>> update the drive in the kernel rpm that gets installed on the box.
>> None
>> of
>> this is trivial.
>>
>> If it were me, I would go down the same path I took for updating the
>> AIC79XX driver
>>
>> https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2003-October/
>> 003
>> 533.html
>>
>> Tim
>>
>> Tim Carlson
>> Voice: (509) 376 3423
>> Email: Tim.Carlson at pnl.gov
>> EMSL UNIX System Support
>>
>>
>> -- __--__--
>> Message: 4
>> Date: Wed, 10 Dec 2003 12:52:38 -0800 (PST)
>> Subject: Re: [Rocks-Discuss]Non-homogenous legacy hardware
>> To: "Chris Dwan (CCGB)" <cdwan at mail.ahc.umn.edu>
>> Cc: npaci-rocks-discussion at sdsc.edu

>>
>> On Wed, 10 Dec 2003, Chris Dwan (CCGB) wrote:
>>
>>>
>>> I am integrating legacy systems into a ROCKS cluster, and have hit a
>>> snag with the auto-partition configuration: The new (old) systems
>> have
>>> SCSI disks, while old (new) ones contain IDE. This is a non-issue so
>>> long as the initial install does its default partitioning. However,
>>> I
>>> have a "replace-auto-partition.xml" file which is unworkable for the
>> SCSI
>>> based systems since it makes specific reference to "hda" rather than
>>> "sda."
>>
>> If you have just a single drive, then you should be able to skip the
>> "--ondisk" bits of your "part" command
>>
>> Otherwise, you would have first to do something ugly like the
>> following:
>>
>> http://penguin.epfl.ch/slides/kickstart/ks.cfg
>>
>> You could probably (maybe) wrap most of that in an
>> <eval sh=3D"bash">
>> </eval>
>>
>> block in the <main> block.
>>
>> Just guessing.. haven't tried this.
>>
>> Tim
>>
>> Tim Carlson
>> Voice: (509) 376 3423
>> Email: Tim.Carlson at pnl.gov
>> EMSL UNIX System Support
>>
>>
>> -- __--__--
>> Message: 5
>> From: Jag <agrajag at dragaera.net>
>> Date: Wed, 10 Dec 2003 13:21:07 -0500
>> Subject: [Rocks-Discuss]ssh_known_hosts and ganglia
>>
>> I noticed a previous post on this list
>> (https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2003-May/
>> 001934
>> .html) indicating that Rocks distributes ssh keys for all the nodes
>> over
>> ganglia. Can anyone enlighten me as to how this is done?
>>
>> I looked through the ganglia docs and didn't see anything indicating
>> how
>> to do this, so I'm assuming Rocks made some changes. Unfortunately
>> the
>> rocks iso images don't seem to contain srpms, so I'm now coming
>> here.=20

>> What did Rocks do to ganglia to make the distribution of ssh keys
>> work?
>>
>> Also, does anyone know where Rocks SRPMs can be found? I've done
>> quite
>> a bit of searching, but haven't found them anywhere.
>>
>>
>> -- __--__--
>> Message: 6
>> From: "Mason J. Katz" <mjk at sdsc.edu>
>> Subject: Re: [Rocks-Discuss]ssh_known_hosts and ganglia
>> Date: Wed, 10 Dec 2003 14:39:15 -0800
>> To: Jag <agrajag at dragaera.net>
>>
>> Most of the SRPMS are on our FTP site, but we've screwed this up =20
>> before. The SRPMS are entirely Rocks specific so they are of little
>> =20
>> value outside of Rocks. You can also checkout our CVS tree =20
>> (cvs.rocksclusters.org) where rocks/src/ganglia shows what we add. We
>> =20
>> have a ganglia-python package we created to allow us to write our own
>> =20
>> metrics at a high level than the provide gmetric application. We've
>> =20
>> also moved from this method to a single cluster-wide ssh key for Rocks
>> =20
>> 3.1.
>>
>> -mjk
>>
>> On Dec 10, 2003, at 10:21 AM, Jag wrote:
>>
>>> I noticed a previous post on this list
>>> (https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2003-May/=20
>>> 001934.html) indicating that Rocks distributes ssh keys for all the
>> =20
>>> nodes over
>>> ganglia. Can anyone enlighten me as to how this is done?
>>>
>>> I looked through the ganglia docs and didn't see anything indicating
>> =20
>>> how
>>> to do this, so I'm assuming Rocks made some changes. Unfortunately
>> the
>>> rocks iso images don't seem to contain srpms, so I'm now coming here.
>>> What did Rocks do to ganglia to make the distribution of ssh keys
>> work?
>>>
>>> Also, does anyone know where Rocks SRPMs can be found? I've done
>> quite
>>> a bit of searching, but haven't found them anywhere.
>>
>>
>> -- __--__--
>> Message: 7
>> Date: Wed, 10 Dec 2003 14:43:49 -0800
>> From: "V. Rowley" <vrowley at ucsd.edu>

>> Subject: [Rocks-Discuss]"TypeError: loop over non-sequence" when
>> trying
>> to build CD distro
>>
>> When I run this:
>>
>> rocks-dist
>>
>> --dist=3Dcdrom cdrom
>>
>>
>>> app.run()
>>> builder.build()
>>> (rocks, nonrocks) =3D self.segregateRPMS()
>> segregateRPMS
>>
>> Any ideas?
>>
>> --=20

>> La Jolla, CA 92093-0715
>>
>>
>> http://www.sagacitech.com/Chinaweb
>>
>>
>> -- __--__--
>> Message: 8
>> Cc: rocks <npaci-rocks-discussion at sdsc.edu>
>> From: Greg Bruno <bruno at rocksclusters.org>
>> Subject: Re: [Rocks-Discuss]one node short in "labels"
>> Date: Wed, 10 Dec 2003 15:12:49 -0800
>> To: Vincent Fox <vincent_b_fox at yahoo.com>
>>
>>> So I go to the "labels" selection on the web page to print out =
>> the=3D20
>>> pretty labels. What a nice idea by the way!
>>> =3DA0
>>> EXCEPT....it's one node short! I go up to 0-13 and this stops at=3D20
>>> 0-12.=3DA0 Any ideas where I should check to fix this?
>>
>> yeah, we found this corner case -- it'll be fixed in the next release.
>>
>> thanks for bug report.
>>
>> - gb
>>
>>
>> -- __--__--
>> Message: 9
>> Subject: Re: [Rocks-Discuss]"TypeError: loop over non-sequence" when
>> trying to build CD distro
>> Date: Wed, 10 Dec 2003 15:16:27 -0800
>> To: "V. Rowley" <vrowley at ucsd.edu>
>>
>> It looks like someone moved the profiles directory to profiles.orig.
>>
>> -mjk
>>
>>
>> [root at rocks14 install]# ls -l
>> total 56
>> drwxr-sr-x 3 root wheel 4096 Dec 10 21:16 cdrom
>> drwxrwsr-x 5 root wheel 4096 Dec 10 20:38 contrib.orig
>> drwxr-sr-x 3 root wheel 4096 Dec 10 21:07=20
>> ftp.rocksclusters.org
>> drwxr-sr-x 3 root wheel 4096 Dec 10 20:38=20
>> ftp.rocksclusters.org.orig
>> -r-xrwsr-x 1 root wheel 19254 Sep 3 12:40 kickstart.cgi
>> drwxr-xr-x 3 root root 4096 Dec 10 20:38 profiles.orig
>> drwxr-sr-x 3 root wheel 4096 Dec 10 21:15 rocks-dist
>> drwxrwsr-x 3 root wheel 4096 Dec 10 20:38

>> rocks-dist.orig
>> drwxr-sr-x 3 root wheel 4096 Dec 10 21:02 src
>> drwxr-sr-x 4 root wheel 4096 Dec 10 20:49 src.foo
>> On Dec 10, 2003, at 2:43 PM, V. Rowley wrote:
>>
>>> When I run this:
>>>
>>> [root at rocks14 install]# rocks-dist mirror ; rocks-dist dist ;=20
>>> rocks-dist --dist=3Dcdrom cdrom
>>>
>>> on a server installed with ROCKS 3.0.0, I eventually get this:
>>>
>>>> Cleaning distribution
>>>> Resolving versions (RPMs)
>>>> Resolving versions (SRPMs)
>>>> Adding support for rebuild distribution from source
>>>> Creating files (symbolic links - fast)
>>>> Creating symlinks to kickstart files
>>>> Fixing Comps Database
>>>> Generating hdlist (rpm database)
>>>> Patching second stage loader (eKV, partioning, ...)
>>>> patching "rocks-ekv" into distribution ...
>>>> patching "rocks-piece-pipe" into distribution ...
>>>> patching "PyXML" into distribution ...
>>>> patching "expat" into distribution ...
>>>> patching "rocks-pylib" into distribution ...
>>>> patching "MySQL-python" into distribution ...
>>>> patching "rocks-kickstart" into distribution ...
>>>> patching "rocks-kickstart-profiles" into distribution ...
>>>> patching "rocks-kickstart-dtds" into distribution ...
>>>> building CRAM filesystem ...
>>>> Cleaning distribution
>>>> Resolving versions (RPMs)
>>>> Resolving versions (SRPMs)
>>>> Creating symlinks to kickstart files
>>>> Generating hdlist (rpm database)
>>>> Segregating RPMs (rocks, non-rocks)
>>>> sh: ./kickstart.cgi: No such file or directory
>>>> sh: ./kickstart.cgi: No such file or directory
>>>> Traceback (innermost last):
>>>> File "/opt/rocks/bin/rocks-dist", line 807, in ?
>>>> app.run()
>>>> File "/opt/rocks/bin/rocks-dist", line 623, in run
>>>> eval('self.command_%s()' % (command))
>>>> File "<string>", line 0, in ?
>>>> File "/opt/rocks/bin/rocks-dist", line 736, in command_cdrom
>>>> builder.build()
>>>> File "/opt/rocks/lib/python/rocks/build.py", line 1223, in build
>>>> (rocks, nonrocks) =3D self.segregateRPMS()
>>>> File "/opt/rocks/lib/python/rocks/build.py", line 1107, in=20
>>>> segregateRPMS
>>>> for pkg in ks.getSection('packages'):
>>>> TypeError: loop over non-sequence
>>>
>>> Any ideas?
>>>
>>> --=20

>>> La Jolla, CA 92093-0715
>>>
>>>
>>> See pictures from our trip to China at=20
>>> http://www.sagacitech.com/Chinaweb
>>
>>
>> -- __--__--
>> Message: 10
>> Date: Wed, 10 Dec 2003 16:50:16 -0800
>> To: "Mason J. Katz" <mjk at sdsc.edu>
>> trying
>>
>> Yep, I did that, but only *AFTER* getting the error. [Thought it
>> was=20
>> generated by the rocks-dist sequence, but apparently not.] Go
>> ahead.=20
>> Move it back. Same difference.
>>
>> Vicky
>>
>> Mason J. Katz wrote:
>>> It looks like someone moved the profiles directory to profiles.orig.
>>> =20
>>> -mjk
>>> =20
>>> =20
>>> [root at rocks14 install]# ls -l
>>> total 56
>>> drwxr-sr-x 3 root wheel 4096 Dec 10 21:16 cdrom
>>> drwxrwsr-x 5 root wheel 4096 Dec 10 20:38 contrib.orig
>>> drwxr-sr-x 3 root wheel 4096 Dec 10 21:07=20
>>> ftp.rocksclusters.org
>>> drwxr-sr-x 3 root wheel 4096 Dec 10 20:38=20
>>> ftp.rocksclusters.org.orig
>>> -r-xrwsr-x 1 root wheel 19254 Sep 3 12:40 kickstart.cgi
>>> drwxr-xr-x 3 root root 4096 Dec 10 20:38 profiles.orig
>>> drwxr-sr-x 3 root wheel 4096 Dec 10 21:15 rocks-dist
>>> drwxrwsr-x 3 root wheel 4096 Dec 10 20:38
>> rocks-dist.orig
>>> drwxr-sr-x 3 root wheel 4096 Dec 10 21:02 src
>>> drwxr-sr-x 4 root wheel 4096 Dec 10 20:49 src.foo
>>> On Dec 10, 2003, at 2:43 PM, V. Rowley wrote:
>>> =20
>>>> When I run this:
>>>>
>>>> [root at rocks14 install]# rocks-dist mirror ; rocks-dist dist ;=20
>>>> rocks-dist --dist=3Dcdrom cdrom
>>>>
>>>> on a server installed with ROCKS 3.0.0, I eventually get this:
>>>>
>>>>> Cleaning distribution
>>>>> Resolving versions (RPMs)

>>>>> Resolving versions (SRPMs)
>>>>> Adding support for rebuild distribution from source
>>>>> Creating files (symbolic links - fast)
>>>>> Creating symlinks to kickstart files
>>>>> Fixing Comps Database
>>>>> Generating hdlist (rpm database)
>>>>> Patching second stage loader (eKV, partioning, ...)
>>>>> Cleaning distribution
>>>>> Resolving versions (RPMs)
>>>>> Resolving versions (SRPMs)
>>>>> Creating symlinks to kickstart files
>>>>> Generating hdlist (rpm database)
>>>>> Segregating RPMs (rocks, non-rocks)
>>>>> sh: ./kickstart.cgi: No such file or directory
>>>>> sh: ./kickstart.cgi: No such file or directory
>>>>> Traceback (innermost last):
>>>>> app.run()
>>>>> (rocks, nonrocks) =3D self.segregateRPMS()
>>>>> File "/opt/rocks/lib/python/rocks/build.py", line 1107, in=20
>>>>> segregateRPMS
>>>>> TypeError: loop over non-sequence
>>>>
>>>>
>>>> Any ideas?
>>>>
>>>> --=20
>>>> Vicky Rowley email: vrowley at ucsd.edu
>>>> Biomedical Informatics Research Network work: (858) 536-5980
>>>> University of California, San Diego fax: (858) 822-0828
>>>> 9500 Gilman Drive
>>>> La Jolla, CA 92093-0715
>>>>
>>>>
>>>> See pictures from our trip to China at
>>> =20
>>> =20
>>> =20
>>
>> --=20

>> La Jolla, CA 92093-0715
>>
>>
>>
>>
>> -- __--__--
>> Message: 11
>> Date: Wed, 10 Dec 2003 17:23:25 -0800 (PST)
>> trying to
>> build CD distro
>> To: "V. Rowley" <vrowley at ucsd.edu>
>> Cc: "Mason J. Katz" <mjk at sdsc.edu>, npaci-rocks-discussion at sdsc.edu
>>
>> On Wed, 10 Dec 2003, V. Rowley wrote:
>>
>> Did you remove python by chance? kickstart.cgi calls python directly
>> in
>> /usr/bin/python while rocks-dist does an "env python"
>>
>> Tim
>>
>>> Yep, I did that, but only *AFTER* getting the error. [Thought it was
>>> generated by the rocks-dist sequence, but apparently not.] Go ahead.
>>> Move it back. Same difference.
>>>
>>> Vicky
>>>
>>> Mason J. Katz wrote:
>>>> It looks like someone moved the profiles directory to profiles.orig.
>>>>
>>>> -mjk
>>>>
>>>>
>>>> [root at rocks14 install]# ls -l
>>>> total 56
>>>> drwxr-sr-x 3 root wheel 4096 Dec 10 21:16 cdrom
>>>> drwxrwsr-x 5 root wheel 4096 Dec 10 20:38 contrib.orig
>>>> ftp.rocksclusters.org
>>>> ftp.rocksclusters.org.orig
>>>> -r-xrwsr-x 1 root wheel 19254 Sep 3 12:40
>> kickstart.cgi
>>>> drwxr-xr-x 3 root root 4096 Dec 10 20:38
>> profiles.orig
>>>> drwxr-sr-x 3 root wheel 4096 Dec 10 21:15 rocks-dist
>>>> drwxrwsr-x 3 root wheel 4096 Dec 10 20:38
>> rocks-dist.orig
>>>> drwxr-sr-x 3 root wheel 4096 Dec 10 21:02 src
>>>> drwxr-sr-x 4 root wheel 4096 Dec 10 20:49 src.foo
>>>> On Dec 10, 2003, at 2:43 PM, V. Rowley wrote:

>>>>
>>>>> When I run this:
>>>>>
>>>>> [root at rocks14 install]# rocks-dist mirror ; rocks-dist dist ;
>>>>> rocks-dist --dist=3Dcdrom cdrom
>>>>>
>>>>> on a server installed with ROCKS 3.0.0, I eventually get this:
>>>>>
>>>>>> Adding support for rebuild distribution from source
>>>>>> Creating files (symbolic links - fast)
>>>>>> Fixing Comps Database
>>>>>> Patching second stage loader (eKV, partioning, ...)
>>>>>> patching "rocks-ekv" into distribution ...
>>>>>> patching "rocks-piece-pipe" into distribution ...
>>>>>> patching "PyXML" into distribution ...
>>>>>> patching "expat" into distribution ...
>>>>>> patching "rocks-pylib" into distribution ...
>>>>>> patching "MySQL-python" into distribution ...
>>>>>> patching "rocks-kickstart" into distribution ...
>>>>>> patching "rocks-kickstart-profiles" into distribution ...
>>>>>> patching "rocks-kickstart-dtds" into distribution ...
>>>>>> building CRAM filesystem ...
>>>>>> Segregating RPMs (rocks, non-rocks)
>>>>>> Traceback (innermost last):
>>>>>> File "/opt/rocks/bin/rocks-dist", line 807, in ?
>>>>>> app.run()
>>>>>> File "/opt/rocks/bin/rocks-dist", line 623, in run
>>>>>> eval('self.command_%s()' % (command))
>>>>>> File "<string>", line 0, in ?
>>>>>> File "/opt/rocks/bin/rocks-dist", line 736, in command_cdrom
>>>>>> builder.build()
>>>>>> File "/opt/rocks/lib/python/rocks/build.py", line 1223, in build
>>>>>> (rocks, nonrocks) =3D self.segregateRPMS()
>>>>>> File "/opt/rocks/lib/python/rocks/build.py", line 1107, in
>>>>>> segregateRPMS
>>>>>> for pkg in ks.getSection('packages'):
>>>>>> TypeError: loop over non-sequence
>>>>>
>>>>>
>>>>> Any ideas?
>>>>>
>>>>> --
>>>>> Vicky Rowley email: vrowley at ucsd.edu
>>>>> Biomedical Informatics Research Network work: (858) 536-5980
>>>>> University of California, San Diego fax: (858) 822-0828
>>>>> 9500 Gilman Drive
>>>>> La Jolla, CA 92093-0715

>>>>>
>>>>>
>>>>> See pictures from our trip to China at
>>>>
>>>>
>>>>
>>>
>>> --
>>> La Jolla, CA 92093-0715
>>>
>>>
>>> See pictures from our trip to China at
>>>
>>>
>>
>>
>>
>>
>> -- __--__--
>> _______________________________________________
>> npaci-rocks-discussion mailing list
>>
>>
>> End of npaci-rocks-discussion Digest
>>
>>
>> DISCLAIMER:
>> This email is confidential and may be privileged. If you are not the =
>> intended recipient, please delete it and notify us immediately.
>> Please =
>> do not copy or use it for any purpose, or disclose its contents to
>> any =
>> other person as it may be an offence under the Official Secrets Act. =
>> Thank you.
>>
>> --__--__--
>>
>> Message: 2
>> Date: Wed, 10 Dec 2003 18:03:41 -0800
>> From: Terrence Martin <tmartin at physics.ucsd.edu>
>> Subject: [Rocks-Discuss]Rocks 3.0.0
>>
>> I am having a problem on install of rocks 3.0.0 on my new cluster.
>>
>> The python error occurs right after anaconda starts and just before
>> the
>> install asks for the roll CDROM.
>>
>> The error refers to an inability to find or load rocks.file. The error
>> is associated I think with the window that pops up and asks you in put

>> the roll CDROM in.
>>
>> The process I followed to get to this point is
>>
>> Put the Rocks 3.0.0 CDROM into the CDROM drive
>> Boot the system
>> At the prompt type frontend
>> Wait till anaconda starts
>> Error referring to unable to load rocks.file.
>>
>> I have successfully installed rocks on a smaller cluster but that has
>> different hardware. I used the same CDROM for both installs.
>>
>> Any thoughts?
>>
>> Terrence
>>
>>
>>
>> --__--__--
>>
>> Message: 3
>> Date: Wed, 10 Dec 2003 19:52:49 -0800
>> trying
>>
>> Looks like python is okay:
>>
>>> [root at rocks14 birn-oracle1]# which python
>>> /usr/bin/python
>>> [root at rocks14 birn-oracle1]# python --help
>>> Unknown option: --
>>> usage: python [option] ... [-c cmd | file | -] [arg] ...
>>> Options and arguments (and corresponding environment variables):
>>> -d : debug output from parser (also PYTHONDEBUG=x)
>>> -i : inspect interactively after running script, (also
>> PYTHONINSPECT=x)
>>> and force prompts, even if stdin does not appear to be a
>> terminal
>>> -O : optimize generated bytecode (a tad; also PYTHONOPTIMIZE=x)
>>> -OO : remove doc-strings in addition to the -O optimizations
>>> -S : don't imply 'import site' on initialization
>>> -t : issue warnings about inconsistent tab usage (-tt: issue
>> errors)
>>> -u : unbuffered binary stdout and stderr (also
>>> PYTHONUNBUFFERED=x)
>>> -v : verbose (trace import statements) (also PYTHONVERBOSE=x)
>>> -x : skip first line of source, allowing use of non-Unix forms of
>> #!cmd
>>> -X : disable class based built-in exceptions
>>> -c cmd : program passed in as string (terminates option list)
>>> file : program read from script file
>>> - : program read from stdin (default; interactive mode if a tty)
>>> arg ...: arguments passed to program in sys.argv[1:]
>>> Other environment variables:
>>> PYTHONSTARTUP: file executed on interactive startup (no default)

>>> PYTHONPATH : ':'-separated list of directories prefixed to the
>>> default module search path. The result is sys.path.
>>> PYTHONHOME : alternate <prefix> directory (or
>> <prefix>:<exec_prefix>).
>>> The default module search path uses
>>> <prefix>/python1.5.
>>> [root at rocks14 birn-oracle1]#
>>
>>
>>
>> Tim Carlson wrote:
>>> On Wed, 10 Dec 2003, V. Rowley wrote:
>>>
>>> Did you remove python by chance? kickstart.cgi calls python directly
>> in
>>> /usr/bin/python while rocks-dist does an "env python"
>>>
>>> Tim
>>>
>>>
>>>> Yep, I did that, but only *AFTER* getting the error. [Thought it
>>>> was
>>>> generated by the rocks-dist sequence, but apparently not.] Go
>>>> ahead.
>>>> Move it back. Same difference.
>>>>
>>>> Vicky
>>>>
>>>> Mason J. Katz wrote:
>>>>
>>>>> It looks like someone moved the profiles directory to
>>>>> profiles.orig.
>>>>>
>>>>> -mjk
>>>>>
>>>>>
>>>>> [root at rocks14 install]# ls -l
>>>>> total 56
>>>>> drwxr-sr-x 3 root wheel 4096 Dec 10 21:16 cdrom
>>>>> drwxrwsr-x 5 root wheel 4096 Dec 10 20:38
>>>>> contrib.orig
>>>>> drwxr-sr-x 3 root wheel 4096 Dec 10 21:07
>>>>> ftp.rocksclusters.org
>>>>> drwxr-sr-x 3 root wheel 4096 Dec 10 20:38
>>>>> ftp.rocksclusters.org.orig
>>>>> -r-xrwsr-x 1 root wheel 19254 Sep 3 12:40
>>>>> kickstart.cgi
>>>>> drwxr-xr-x 3 root root 4096 Dec 10 20:38
>>>>> profiles.orig
>>>>> drwxr-sr-x 3 root wheel 4096 Dec 10 21:15 rocks-dist
>>>>> drwxrwsr-x 3 root wheel 4096 Dec 10 20:38
>> rocks-dist.orig
>>>>> drwxr-sr-x 3 root wheel 4096 Dec 10 21:02 src
>>>>> drwxr-sr-x 4 root wheel 4096 Dec 10 20:49 src.foo
>>>>> On Dec 10, 2003, at 2:43 PM, V. Rowley wrote:
>>>>>
>>>>>
>>>>>> When I run this:
>>>>>>

>>>>>> [root at rocks14 install]# rocks-dist mirror ; rocks-dist dist ;
>>>>>> rocks-dist --dist=cdrom cdrom
>>>>>>
>>>>>> on a server installed with ROCKS 3.0.0, I eventually get this:
>>>>>>
>>>>>>
>>>>>>> Cleaning distribution
>>>>>>> Resolving versions (RPMs)
>>>>>>> Resolving versions (SRPMs)
>>>>>>> Adding support for rebuild distribution from source
>>>>>>> Creating files (symbolic links - fast)
>>>>>>> Creating symlinks to kickstart files
>>>>>>> Fixing Comps Database
>>>>>>> Generating hdlist (rpm database)
>>>>>>> Patching second stage loader (eKV, partioning, ...)
>>>>>>> patching "rocks-ekv" into distribution ...
>>>>>>> patching "rocks-piece-pipe" into distribution ...
>>>>>>> patching "PyXML" into distribution ...
>>>>>>> patching "expat" into distribution ...
>>>>>>> patching "rocks-pylib" into distribution ...
>>>>>>> patching "MySQL-python" into distribution ...
>>>>>>> patching "rocks-kickstart" into distribution ...
>>>>>>> patching "rocks-kickstart-profiles" into distribution ...
>>>>>>> patching "rocks-kickstart-dtds" into distribution ...
>>>>>>> building CRAM filesystem ...
>>>>>>> Cleaning distribution
>>>>>>> Resolving versions (RPMs)
>>>>>>> Resolving versions (SRPMs)
>>>>>>> Creating symlinks to kickstart files
>>>>>>> Generating hdlist (rpm database)
>>>>>>> Segregating RPMs (rocks, non-rocks)
>>>>>>> sh: ./kickstart.cgi: No such file or directory
>>>>>>> sh: ./kickstart.cgi: No such file or directory
>>>>>>> Traceback (innermost last):
>>>>>>> File "/opt/rocks/bin/rocks-dist", line 807, in ?
>>>>>>> app.run()
>>>>>>> File "/opt/rocks/bin/rocks-dist", line 623, in run
>>>>>>> eval('self.command_%s()' % (command))
>>>>>>> File "<string>", line 0, in ?
>>>>>>> File "/opt/rocks/bin/rocks-dist", line 736, in command_cdrom
>>>>>>> builder.build()
>>>>>>> File "/opt/rocks/lib/python/rocks/build.py", line 1223, in build
>>>>>>> (rocks, nonrocks) = self.segregateRPMS()
>>>>>>> File "/opt/rocks/lib/python/rocks/build.py", line 1107, in
>>>>>>> segregateRPMS
>>>>>>> for pkg in ks.getSection('packages'):
>>>>>>> TypeError: loop over non-sequence
>>>>>>
>>>>>>
>>>>>> Any ideas?
>>>>>>
>>>>>> --
>>>>>> Vicky Rowley email: vrowley at ucsd.edu
>>>>>> Biomedical Informatics Research Network work: (858) 536-5980
>>>>>> University of California, San Diego fax: (858) 822-0828
>>>>>> 9500 Gilman Drive
>>>>>> La Jolla, CA 92093-0715
>>>>>>
>>>>>>

>>>>>> See pictures from our trip to China at
>>>>>
>>>>>
>>>>>
>>>> --
>>>> Vicky Rowley email: vrowley at ucsd.edu
>>>> Biomedical Informatics Research Network work: (858) 536-5980
>>>> University of California, San Diego fax: (858) 822-0828
>>>> 9500 Gilman Drive
>>>> La Jolla, CA 92093-0715
>>>>
>>>>
>>>> See pictures from our trip to China at
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>> --
>> La Jolla, CA 92093-0715
>>
>>
>>
>>
>>
>> --__--__--
>>
>> _______________________________________________
>> npaci-rocks-discussion mailing list
>>
>>
>> End of npaci-rocks-discussion Digest
>>
>>
>> DISCLAIMER:
>> This email is confidential and may be privileged. If you are not the
>> intended recipient, please delete it and notify us immediately.
>> Please do not copy or use it for any purpose, or disclose its
>> contents to any other person as it may be an offence under the
>> Official Secrets Act. Thank you.
>>
>>
> Glen Otero, Ph.D.
> Linux Prophet
> 619.917.1772
>
>

Glen Otero, Ph.D.
Linux Prophet
619.917.1772

-------------- next part --------------
Name: not available
Type: text/enriched
Size: 35605 bytes
Desc: not available
discussion/attachments/20031211/1a0b38fb/attachment-0001.bin

From tmartin at physics.ucsd.edu Fri Dec 12 10:26:58 2003
Date: Fri, 12 Dec 2003 10:26:58 -0800
Subject: [Rocks-Discuss]ftp.rocksclusters.org mirror?
Message-ID: <3FDA0872.8010405@physics.ucsd.edu>

I was wondering, does the command rocks-dist do anything else besides
call wget on the correct tree at ftp.rocksclusters.org?

I ask because some firewall restrictions on a system I am hesitant to
fiddle are preventing me from running rocks-dist mirror from my head
node. I would like to download the mirror of the rocks distro on another
system, transfer the tree and then run rocks-dist dist to rebuild the
rocks for my compute nodes. Is this reasonable?

Also am I going to run into any problems with rocks 3.0.0 having
installed the head node on a UP system but my compute nodes are SMP? I
am making an assumption that once I get all of the packages into rocks
(currently there is no smp kernels on the head node) the compute nodes
will install the right kernel?

BTW thanks for the help so far, the trick it seems to getting Rocks
3.0.0 on these supermicro systems is to install rocks on the hard drive
in a separate computer and then install the hard disk.

Thanks,

Terrence

From mjk at sdsc.edu Fri Dec 12 10:48:17 2003
Date: Fri, 12 Dec 2003 10:48:17 -0800
Subject: [Rocks-Discuss]ftp.rocksclusters.org mirror?
In-Reply-To: <3FDA0872.8010405@physics.ucsd.edu>
References: <3FDA0872.8010405@physics.ucsd.edu>
Message-ID: <BF99287A-2CD3-11D8-A2DC-000A95DA5638@sdsc.edu>

- Yes, "rocks-dist mirror" does a python system() call to run the wget
application. It does this several time for the various directories it
needs.

- No, the compute nodes do not need to be the same SMPness of the

frontend. All installations are done with Red Hat Kickstart (plus our
pixie dust) so hardware is auto detected for you. This is not disk
imaging :)

-mjk

On Dec 12, 2003, at 10:26 AM, Terrence Martin wrote:

> I was wondering, does the command rocks-dist do anything else besides
> call wget on the correct tree at ftp.rocksclusters.org?
>
> I ask because some firewall restrictions on a system I am hesitant to
> fiddle are preventing me from running rocks-dist mirror from my head
> node. I would like to download the mirror of the rocks distro on
> another system, transfer the tree and then run rocks-dist dist to
> rebuild the rocks for my compute nodes. Is this reasonable?
>
> Also am I going to run into any problems with rocks 3.0.0 having
> installed the head node on a UP system but my compute nodes are SMP? I
> am making an assumption that once I get all of the packages into rocks
> (currently there is no smp kernels on the head node) the compute nodes
> will install the right kernel?
>
> BTW thanks for the help so far, the trick it seems to getting Rocks
> 3.0.0 on these supermicro systems is to install rocks on the hard
> drive in a separate computer and then install the hard disk.
>
> Thanks,
>
> Terrence
>
>

Date: Fri, 12 Dec 2003 10:54:03 -0800
In-Reply-To: <ddwu930yzp.fsf_-_@oecpc11.ucsd.edu>
<ddptix48s6.fsf@oecpc11.ucsd.edu> <ddwu930yzp.fsf_-_@oecpc11.ucsd.edu>
Message-ID: <8E405599-2CD4-11D8-A2DC-000A95DA5638@sdsc.edu>

We haven't done this for a while, and since our 3.0 release using
different version of Red Hat for x86 and IA64 cross-building
distribution may not work. 3.1.0 (since you are on campus you'll get a
CD set from us next week) uses the same base RH for all architecture so
this should be possible again.

The mirror should have worked:

# rocks-dist --arch=ia64 mirror

Should be the ia64 tree from ftp.rocksclusters.org, you can also use
your IA64 DVD mount it on /mnt/cdrom and do a "rocks-dist copycd" to
create the IA64 mirror.

If this works you will the to use the --genhdlist flag w/ rocks-dist.

For example:

# cd /home/install
# rocks-dist dist --- build the x86 distribution
# rocks-dist --arch=ia64 --genhdlist=rocks-dist/.../i386/.../genhdlist

You'll need to use find to determine the path of the genhdlist
executable in you x86 distribution. This may still fail (since RH
version differ), but it does work when the version are the same for
both archs.

-mjk

On Dec 11, 2003, at 2:29 PM, Edward O'Connor wrote:

> Hi everybody,
>
> I'm trying to bring up some ia64 compute nodes in a cluster with an
> ia32
> frontend. Normally, `cd /home/install; rocks-dist mirror dist` only
> sets
> up the frontend to handle ia32 compute nodes. I tried to manhandle
> `rocks-dist mirror` into mirroring the ia64 stuff from
> ftp.rocksclusters.org by giving it the --arch=ia64 option, but that
> didn't work, so I went ahead and did the mirroring step by hand.
>
> After having done so, `rocks-dist dist` still doesn't do the right
> thing. So, adding --arch=ia64 to that command yields this error output:
>
> ,----
> | # rocks-dist --arch=ia64 dist
> | Cleaning distribution
> | Resolving versions (RPMs)
> | Resolving versions (SRPMs)
> | Adding support for rebuild distribution from source
> | Creating files (symbolic links - fast)
> | Creating symlinks to kickstart files
> | Fixing Comps Database
> | error - comps file is missing, skipping this step
> | Generating hdlist (rpm database)
> | error - could not find rpm anaconda-runtime
> | error - could not find genhdlist
> | Patching second stage loader (eKV, partioning, ...)
> | error - could not find second stage, skipping this step
> `----
>
> So my question is, what do I need to do to the ia32 frontend to enable
> it to kickstart an ia64 compute node? Thanks.
>
>
> Ted
>
> --
> Edward O'Connor
> oconnor at ucsd.edu


Date: Fri, 12 Dec 2003 11:12:59 -0800
In-Reply-To: <BAY3-F24QLayI4TY7zD00009bf1@hotmail.com>
References: <BAY3-F24QLayI4TY7zD00009bf1@hotmail.com>
Message-ID: <32F6A3BA-2CD7-11D8-A2DC-000A95DA5638@sdsc.edu>

Unfortunately we don't have a fix here. We've moved to SGE (your can
now use QMon). We do have a PBS roll but we plan to release 3.1 before
the PBS roll is complete.

-mjk

On Dec 10, 2003, at 8:44 PM, zhong wenyu wrote:

> Hi,everyone!
> I have installed rocks 2.3.2 and 3.0.0,xpbs can not be use in both of
> them.
> typed:xpbs[enter]
> showed:xpbs: initialization failed! output: invalid command name
> "Pref_Init"
> thanks!
>
> _________________________________________________________________
> ?????????????? MSN Messenger: http://messenger.msn.com/cn

From fparnold at chem.northwestern.edu Fri Dec 12 06:52:45 2003
From: fparnold at chem.northwestern.edu (Fred P. Arnold)
Date: Fri, 12 Dec 2003 08:52:45 -0600 (CST)
Subject: [Rocks-Discuss]Gig E on HP ZX6000
Message-ID: <Pine.GSO.4.33.0312120850030.4235-100000@mercury.chem.northwestern.edu>

Hello,

I know this is a hardware question, not technically a Rocks one, but I
can't find the answer in my HP manuals:

On the ZX6000, there are two ethernet ports, a 10/100 basic/management
port, and a 1000 which is designated the primary interface.
Unfortunately, rocks always identifies the 10/100 as eth0.

Does anyone know how to disable the 10/100 on a ZX6000? On an IA32, I'd
go into the bios, but these don't technically have one. We'd like to run
ours on a pure Gig network.

Thanks.

-Fred

Frederick P. Arnold, Jr.
NUIT, Northwestern U.
f-arnold at northwestern.edu


Date: Fri, 12 Dec 2003 11:16:42 -0800
Subject: [Rocks-Discuss]ScalablePBS.
In-Reply-To: <200311212352.27000.Roy.Dragseth@cc.uit.no>
References: <200311212352.27000.Roy.Dragseth@cc.uit.no>
Message-ID: <B83C8894-2CD7-11D8-A2DC-000A95DA5638@sdsc.edu>

hi Roy,

This should become the basis of the PBS roll (currently openpbs). We
are seeking developers who would like to help write and maintain this
-- I'm not singling you out Roy, although you would be more than
welcome, rather I'm taking advantage of your message to solicit other
volunteers. Anyone?

-mjk

On Nov 21, 2003, at 2:52 PM, Roy Dragseth wrote:

> Hi folks.
>
> I've been testing ScalablePBS (SPBS) from supercluster.org for a few
> weeks now
> and it seems like a fairly good replacement for OpenPBS. Only a few
> minor
> changes to the OpenPBS infrastructure were needed to accomplish the
> neccessary changes in the kickstart generation to make the nodes
> switch to
> SPBS.
>
> SPBS is based on OpenPBS 2.3.12, but incorporates most provided patches
> (sandia etc) and is actively developed by the same maintainers that
> develop
> maui. It scales better than OpenPBS, to around 2K nodes, has better
> fault
> tolerance and communicates better with maui. It has, as far as I can
> see, no
> user visible changes from OpenPBS.
>
> I know, a lot of people are moving away from pbs and into sge, I was
> thinking
> about making the switch too. The emergence SPBS seems to make the
> switch
> unneccessary and I don't have to teach myself (and the users) a new
> queueing
> interface...
>
> Configuration tested:
> Rocks 3.0.0
> SPBS 1.0.1p0 (should leave beta phase next month)
> Maui 3.2.6p6 (available for "Early Access Production")
>
> SPBS and Maui can be downloaded from http://www.supercluster.org/
>
> Have a nice weekend,
> r.
>
> --
>

> The Computer Center, University of Troms?, N-9037 TROMS?, Norway.
> phone:+47 77 64 41 07, fax:+47 77 64 41 00
> Roy Dragseth, High Performance Computing System Administrator
> Direct call: +47 77 64 62 56. email: royd at cc.uit.no

From jlkaiser at fnal.gov Fri Dec 12 11:25:58 2003
From: jlkaiser at fnal.gov (Joseph L. Kaiser)
Date: Fri, 12 Dec 2003 13:25:58 -0600
Subject: [Rocks-Discuss](no subject)
Message-ID: <1071257158.3719.9.camel@ajax.kaisergroup.net>

My install of 3.0.0 is crapping out here:

"/usr/src/build/90289-i386/install// a x
x usr/lib/anaconda/comps.py", line a
x
x 153, in __getitem__ a
x
x KeyError: PyXML #
x
x

Even though PyXML is in the distribution I have built. Is there
anything that can cause this other than the missing RPM?

Thanks,

Joe

From oconnor at soe.ucsd.edu Fri Dec 12 11:36:04 2003
From: oconnor at soe.ucsd.edu (Edward O'Connor)
Date: Fri, 12 Dec 2003 11:36:04 -0800
In-Reply-To: <8E405599-2CD4-11D8-A2DC-000A95DA5638@sdsc.edu> (Mason J.
Katz's message of "Fri, 12 Dec 2003 10:54:03 -0800")
<8E405599-2CD4-11D8-A2DC-000A95DA5638@sdsc.edu>
Message-ID: <ddiskl4ymz.fsf@oecpc11.ucsd.edu>

> We haven't done this for a while, and since our 3.0 release using
> different version of Red Hat for x86 and IA64 cross-building
> distribution may not work.

Ahh. After further travails (read below), I'm pretty willing to suspect
that this indeed does not work in Rocks 3.0.0. I'm looking forward to
those 3.1.0 CDs and DVDs next week! :)

> you can also use your IA64 DVD mount it on /mnt/cdrom and do a
> "rocks-dist copycd" to create the IA64 mirror.

Unfortunately, the ia32 frontend machine doesn't have a DVD drive in it.
So I mounted the ia64 ISO image on /mnt/cdrom via a loopback device and
that worked fine.

However, `rocks-dist copycd` seemed to have nuked the ia32 stuff under
/home/install/ftp.rocksclusters.org/, or, if it didn't entirely nuke it,
it made the bare `rocks-dist dist` of your next instructions fail:

> If this works you will the to use the --genhdlist flag w/ rocks-dist.
> For example:
>
> # rocks-dist dist --- build the x86 distribution

As this failed, I went ahead and also ran a `rocks-dist mirror`, which
proceeded to download a whole lot of stuff from you guys. After it
finished, `rocks-dist dist` completed without error. I double-checked
and the ia64 mirror from the `rocks-dist copycd` command still appears
to be there.

> # rocks-dist --arch=ia64 --genhdlist=rocks-dist/.../i386/.../genhdlist

Should there be a `dist` at the end of that? The above command (with the
substitution of the appropriate genhdlist path) appears to be a no-op.
So I appended a `dist` as the idea is for it to create the appropriate
symlinks for ia64 as well, and it bombs out too, in the same way as
before:

,----
| # rocks-dist --arch=ia64 --genhdlist=rocks-dist/7.3/en/os/i386/usr/lib/anaconda-
runtime/genhdlist dist
| Cleaning distribution
| Resolving versions (RPMs)
| Resolving versions (SRPMs)
| Adding support for rebuild distribution from source
| Creating files (symbolic links - fast)
| Creating symlinks to kickstart files
| Fixing Comps Database
| error - comps file is missing, skipping this step
| Generating hdlist (rpm database)
| error creating file /home/install/rocks-
dist/desktop/7.3/en/os/ia64/RedHat/base/hdlist: No such file or directory
| Patching second stage loader (eKV, partioning, ...)
| error - could not find second stage, skipping this step
`----

> You'll need to use find to determine the path of the genhdlist
> executable in you x86 distribution. This may still fail (since RH
> version differ), but it does work when the version are the same for
> both archs.

I suppose at this point that it's still failing due to the RH version
mismatch, and that getting this to work in 3.0.0 is a lost cause.

Ted

--
Edward O'Connor
oconnor at ucsd.edu

From jared_hodge at iat.utexas.edu Fri Dec 12 12:07:32 2003
From: jared_hodge at iat.utexas.edu (Jared Hodge)
Date: Fri, 12 Dec 2003 14:07:32 -0600
References: <BAY3-F24QLayI4TY7zD00009bf1@hotmail.com> <32F6A3BA-2CD7-11D8-
A2DC-000A95DA5638@sdsc.edu>
Message-ID: <3FDA2004.3020203@iat.utexas.edu>

OK, I've got a fix for this one.
The problem is that xpbs thinks that it's in the directory
/var/tmp/OpenPBS-buildroot/opt/OpenPBS/
Anyway, the path is mangled to get to some of the subroutines. The
rocks guys can figure out a way to prevent this in future releases, but
here's how you can get it working (and pbsmon while were at it):

First fix the scripts:
/opt/OpenPBS/bin/xpbs Need's the following changes:

#set libdir /var/tmp/OpenPBS-buildroot/opt/OpenPBS/lib/xpbs
#set appdefdir /var/tmp/OpenPBS-buildroot/opt/OpenPBS/lib/xpbs
set libdir /opt/OpenPBS/lib/xpbs
set appdefdir /opt/OpenPBS/lib/xpbs

/opt/OpenPBS/bin/xpbsmon Needs the same thing plus the first line needs
changed

now do the following:
cd /opt/OpenPBS/lib/xpbs
rm tclIndex
./buildindex `pwd`
cd /opt/OpenPBS/lib/xpbsmon
rm tclIndex
./buildindex `pwd`

That should fix it all up. I tested this on a 2.3.2 cluster, I assume
it's the same on 3.0.

--
Jared Hodge
The Institute for Advanced Technology
The University of Texas at Austin
3925 W. Braker Lane, Suite 400
Austin, Texas 78759

Phone: 512-232-4460
Fax: 512-471-9096
Email: jared_hodge at iat.utexas.edu


> Unfortunately we don't have a fix here. We've moved to SGE (your can
> now use QMon). We do have a PBS roll but we plan to release 3.1
> before the PBS roll is complete.
>
> -mjk

>
> On Dec 10, 2003, at 8:44 PM, zhong wenyu wrote:
>
>> Hi,everyone!
>> I have installed rocks 2.3.2 and 3.0.0,xpbs can not be use in both of
>> them.
>> typed:xpbs[enter]
>> showed:xpbs: initialization failed! output: invalid command name
>> "Pref_Init"
>> thanks!
>>
>> _________________________________________________________________
>> ?????????????? MSN Messenger: http://messenger.msn.com/cn
>
>

From jlkaiser at fnal.gov Fri Dec 12 14:39:42 2003
Date: Fri, 12 Dec 2003 16:39:42 -0600
Subject: [Rocks-Discuss](no subject)
In-Reply-To: <1071257158.3719.9.camel@ajax.kaisergroup.net>
References: <1071257158.3719.9.camel@ajax.kaisergroup.net>

Sorry, creating extra links where they don't belong. Nevermind.

On Fri, 2003-12-12 at 13:25, Joseph L. Kaiser wrote:
> My install of 3.0.0 is crapping out here:
>
> "/usr/src/build/90289-i386/install// a x
> x usr/lib/anaconda/comps.py", line a
> x
> x 153, in __getitem__ a
> x
> x KeyError: PyXML #
> x
> x
>
>
> Even though PyXML is in the distribution I have built. Is there
> anything that can cause this other than the missing RPM?
>
> Thanks,
>
> Joe
--
===================================================================

Fermi Lab
630-840-6444
===================================================================

From jholland at cs.uh.edu Fri Dec 12 14:52:10 2003
From: jholland at cs.uh.edu (Jason Holland)
Date: Fri, 12 Dec 2003 16:52:10 -0600 (CST)
In-Reply-To:
<Pine.GSO.4.33.0312120850030.4235-100000@mercury.chem.northwestern.edu>
References: <Pine.GSO.4.33.0312120850030.4235-100000@mercury.chem.northwestern.edu>
Message-ID: <Pine.GSO.4.58.0312121650350.4139@leibnitz.cs.uh.edu>

Fred,

Try flipping the modules in /etc/modules.conf. Flip eth0 with eth1 so
that the gige interface comes up as eth0. Or, just turn off eth0
altogether with 'alias eth0 off'. I think thats the right syntax.

We have 60 zx6000's and I have personally have never found a way to
disable the port.

Jason P Holland
Texas Learning and Computation Center
http://www.tlc2.uh.edu
University of Houston
Philip G Hoffman Hall rm 207A
tel: 713-743-4850

On Fri, 12 Dec 2003, Fred P. Arnold wrote:

> Hello,
>
> I know this is a hardware question, not technically a Rocks one, but I
> can't find the answer in my HP manuals:
>
> On the ZX6000, there are two ethernet ports, a 10/100 basic/management
> port, and a 1000 which is designated the primary interface.
> Unfortunately, rocks always identifies the 10/100 as eth0.
>
> Does anyone know how to disable the 10/100 on a ZX6000? On an IA32, I'd
> go into the bios, but these don't technically have one. We'd like to run
> ours on a pure Gig network.
>
> Thanks.
>
> -Fred
>
> Frederick P. Arnold, Jr.
> NUIT, Northwestern U.
> f-arnold at northwestern.edu
>

From jian at appro.com Fri Dec 12 17:27:51 2003
From: jian at appro.com (Jian Chang)
Date: Fri, 12 Dec 2003 17:27:51 -0800
Subject: [Rocks-Discuss]RE: Rocks-Discuss] AMD Opteron - Contact Appro
Message-ID: <4AE58AD63966B24B99F95CA24C02EB1903414F@hawk.appro.com>

Hello Mason / Puru,

I got your contact information from Bryan Littlefield.

I would like to discuss with you regarding benchmark test systems you might need
down the road.
We can also share with you our findings as to what is compatible in the Opteron
systems.
Please reply with your phone number where I can reach you, and I will call
promptly.

Bryan,

Thank you for the referral.

Best regards,

Jian Chang
Regional Sales Manager
(408) 941-8100 x 202
(800) 927-5464 x 202
(408) 941-8111 Fax
jian at appro.com
www.appro.com

From: Bryan Littlefield [mailto:bryan at UCLAlumni.net]
Sent: Tuesday, December 09, 2003 12:14 PM
To: npaci-rocks-discussion at sdsc.edu; mjk at sdsc.edu
Cc: Jian Chang
Subject: Rocks-Discuss] AMD Opteron - Contact Appro

Hi Mason,

I suggest contacting Appro. We are using Rocks on our Opteron cluster and Appro
would likely love to help. I will contact them as well to see if they could help
getting a opteron machine for testing. Contact info below:

Thanks --Bryan

Jian Chang - Regional Sales Manager
(408) 941-8100 x 202
(800) 927-5464 x 202
(408) 941-8111 Fax
jian at appro.com
http://www.appro.com

npaci-rocks-discussion-request at sdsc.edu wrote:

From: "Mason J. Katz" <mailto:mjk at sdsc.edu> <mjk at sdsc.edu>
Subject: Re: [Rocks-Discuss]AMD Opteron
Date: Tue, 9 Dec 2003 07:28:51 -0800
To: "purushotham komaravolu" <mailto:purikk at hotmail.com> <purikk at
hotmail.com>

We have a beta right now that we have sent to a few people. We plan on
a release this month, and AMD_64 will be part of this release along
with the usual x86, IA64 support.

If you want to help accelerate this process please talk to your vendor
about loaning/giving us some hardware for testing. Having access to a
variety of Opteron hardware (we own two boxes) is the only way we can

have good support for this chip.

-mjk

On Dec 8, 2003, at 8:23 PM, purushotham komaravolu wrote:

Cc: <mailto:npaci-rocks-discussion at sdsc.edu> <npaci-rocks-discussion at
sdsc.edu>

Hello,
I am a newbie to ROCKS cluster. I wanted to setup clusters
on
32-bit Architectures( Intel and AMD) and 64-bit Architecture( Intel
and
AMD).
I found the 64-bit download for Intel on the website but not for AMD.
Does
it work for AMD opteron? if not what is the ETA for AMD-64.
We are planning to but AMD-64 bit machines shortly, and I would like to
volunteer for the beta testing if needed.
Thanks
Regards,
Puru

_______________________________________________


-------------- next part --------------
discussion/attachments/20031212/dec7e41b/attachment-0001.html

From landman at scalableinformatics.com Sat Dec 13 07:50:02 2003
Date: Sat, 13 Dec 2003 10:50:02 -0500
Subject: [Rocks-Discuss]Trying to integrate a new kernel into 3.0.0
Message-ID: <1071330602.4444.56.camel@protein.scalableinformatics.com>

Folks:

Finally built the 2.4.23 kernel into an RPM via the RedHat tools. Had
to hack up the spec file a bit, but you can see the results at

http://scalableinformatics.com/downloads/kernels/2.4.23/

These are 2.4.23 with the 2.4.24-pre1 patch (e.g. xfs is in there, woo
hoo!). I had to strip out most of the previous patches as they were
incompatible with .23 (and I don't want to spend time forward porting
them). The spec file, the sources, etc are released under the normal
licenses (GPL). No warranties, use at your own risk, and these are NOT

official Redhat kernels. Don't ask them for support for these, they
won't do it, and they will look at you funny.

That said, I had also checked out the cvs tree to start the "Carlson"
process :) indicated in the list a few months ago (see
https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2003-October/003533.html)
to build a more customized distribution. I got to the

Build the boot RPM

cd rocks/src/rocks/boot
make rpm

point, and lo and behold this is what I see ...

rm version.mk
rm arch
rm -f /local/rocks/src/rocks/boot/.rpmmacros
rm -f /usr/src/redhat/SOURCES/rocks-boot-3.1.0.tar
rm -f /usr/src/redhat/SOURCES/rocks-boot-3.1.0.tar.gz
...

Ok... I wanted to rebuild 3.0.0, as I cannot wait for 3.1.0 (my customer
has a strong sense of urgency and little time to wait for an operational
cluster). I checked out the system from CVS earlier this week.

Is there any way to switch the build back to 3.0.0? Or am I really out
of luck at this moment??? Clues/hints welcome.

These kernels might work, though I don't have a method to try them in
the distro yet. They work on the build machine.

[root at head root]# uname -a
Linux head.public 2.4.23-1 #1 SMP Sat Dec 13 14:41:06 GMT 2003 i686
unknown

[root at head root]# rpm -qa | grep -i kernel
kernel-2.4.23-1
kernel-BOOT-2.4.23-1
rocks-kernel-3.0.0-0
pvfs-kernel-1.6.0-1
kernel-doc-2.4.23-1
kernel-source-2.4.23-1
kernel-smp-2.4.23-1

The spec file is in the above download section, along with a .src.rpm
and other stuff. If anyone does have a clue as to how to build with
3.0.0 given the current cvs, or if there is a tagged set I needed to
get, please let me know.

Joe

--
Scalable Informatics LLC
web: http://scalableinformatics.com
phone: +1 734 612 4615

From tim.carlson at pnl.gov Sat Dec 13 08:31:03 2003
Date: Sat, 13 Dec 2003 08:31:03 -0800 (PST)
In-Reply-To: <1071330602.4444.56.camel@protein.scalableinformatics.com>

On Sat, 13 Dec 2003, Joe Landman wrote:

> That said, I had also checked out the cvs tree to start the "Carlson"
> process :) indicated in the list a few months ago (see

yikes.. ! :)

>
> Ok... I wanted to rebuild 3.0.0, as I cannot wait for 3.1.0 (my customer
> has a strong sense of urgency and little time to wait for an operational
> cluster). I checked out the system from CVS earlier this week.

You needed to check out the 3.0.0 tagged version

ROCKS_3_0_0_i386

Off thread, but it would seem to me that the numbering scheme for ROCKS
got out of whack somewhere. Shouldn't 3.0.0 have been 2.3.3 and the new
3.1 been 3.0? The reasoning being that the current 3.0.0 is still RH 7.3
based and the new 3.1 will be RH 3.0 based. Not that it matters. Just
curious.

Tim

Tim Carlson
Voice: (509) 376 3423

From phil at sdsc.edu Sat Dec 13 08:51:29 2003
Date: Sat, 13 Dec 2003 08:51:29 -0800
Message-ID: <3FDB4391.4080405@sdsc.edu>

Tim Carlson wrote:

>On Sat, 13 Dec 2003, Joe Landman wrote:
>
>
>
>>That said, I had also checked out the cvs tree to start the "Carlson"

>>process :) indicated in the list a few months ago (see
>>
>>
>
>yikes.. ! :)
>
>
>
>>Ok... I wanted to rebuild 3.0.0, as I cannot wait for 3.1.0 (my customer
>>has a strong sense of urgency and little time to wait for an operational
>>cluster). I checked out the system from CVS earlier this week.
>>
>>
>
>You needed to check out the 3.0.0 tagged version
>
>ROCKS_3_0_0_i386
>
this is correct.

>
>Off thread, but it would seem to me that the numbering scheme for ROCKS
>got out of whack somewhere. Shouldn't 3.0.0 have been 2.3.3 and the new
>3.1 been 3.0? The reasoning being that the current 3.0.0 is still RH 7.3
>based and the new 3.1 will be RH 3.0 based. Not that it matters. Just
>curious.
>
I blame Bruno ...
We moved to 3.0 because rolls is very different from the way 2.3.2 was
put together -- this
wasn't a minor change and so a subminor revision number didn't make sense.

3.0 --> 3.1 change from 7.3 to recompiled RHEL, change from PBS as
default to
SGE as default. .... OK, you could argue that this is also
a major change and shouldn't have a minor version #. We didn't want to
go from 3.0 to 4.0 for
some non-definable reasons :-), but mostly it's that 3.0 and 3.1 feel
pretty similar in terms of the
way they are put together (with rolls).
-P

>
>Tim
>
>Tim Carlson
>Voice: (509) 376 3423
>Email: Tim.Carlson at pnl.gov
>EMSL UNIX System Support
>
>
-------------- next part --------------
discussion/attachments/20031213/69aa41fa/attachment-0001.html

From landman at scalableinformatics.com Sat Dec 13 11:14:51 2003
Date: Sat, 13 Dec 2003 14:14:51 -0500


Thanks. Magic incantations, and I have the "Carlson" process
implemented. Ok, next step is the roll-my-own ... more later

On Sat, 2003-12-13 at 11:31, Tim Carlson wrote:
> On Sat, 13 Dec 2003, Joe Landman wrote:
>
> > That said, I had also checked out the cvs tree to start the "Carlson"
> > process :) indicated in the list a few months ago (see
>
> yikes.. ! :)
>
> >
> > Ok... I wanted to rebuild 3.0.0, as I cannot wait for 3.1.0 (my customer
> > has a strong sense of urgency and little time to wait for an operational
> > cluster). I checked out the system from CVS earlier this week.
>
> You needed to check out the 3.0.0 tagged version
>
> ROCKS_3_0_0_i386
>
> Off thread, but it would seem to me that the numbering scheme for ROCKS
> got out of whack somewhere. Shouldn't 3.0.0 have been 2.3.3 and the new
> 3.1 been 3.0? The reasoning being that the current 3.0.0 is still RH 7.3
> based and the new 3.1 will be RH 3.0 based. Not that it matters. Just
> curious.
>
> Tim
>
> Tim Carlson
> Voice: (509) 376 3423

Date: Mon, 15 Dec 2003 16:02:15 +0800
Subject: [Rocks-Discuss]about add-extra-nic
Message-ID: <BAY3-F40JRkRy9Iwgel00056a6d@hotmail.com>

Hi,everyone!
my compute node'mb is msi 9141,on which there are one 1000M nic and one
100M nic.I plan to use 100M net to control and 1000M for application.so I
use 100M switch to connect compute nodes from frontend,a 1000M switch to
connect compute nodes each other not include frontend.
at my first time to install the compute node,i found it "waiting for dhcp
ip information" to long,and ican not finish a installing.i think the 1000M
nic must answer for it.so i disabled it in BIOS.after that,the installing
worked,the compute nodes appeared.
then i want to add the extar nic.i use the command and shoo-node,the
compute node go to rebooting(between the rebooting i enabled the nic) ,and
come into "waiting for dhcp ip information" again.

so i disabled it again and restart, the node reinstall all right, finished
with no trouble.even i can see the boot message "start eth1....[ok]"!but i
can only found error by use "ifconfig eth1" even after i enable the 1000M
nic again!
thanks and regards!

_________________________________________________________________

From Roy.Dragseth at cc.uit.no Mon Dec 15 02:31:51 2003
From: Roy.Dragseth at cc.uit.no (Roy Dragseth)
Date: Mon, 15 Dec 2003 11:31:51 +0100
In-Reply-To: <ddwu930yzp.fsf_-_@oecpc11.ucsd.edu>
Message-ID: <200312151131.51410.Roy.Dragseth@cc.uit.no>

Hi.

I've been running a setup like this for something like this for over a year
now, it will not (ever?) work right out of the box due to some kernel
problems.

rocks-dist --arch ia64 dist

will most likely crash a ia32 frontend. The ia32 kernel doesn't like to mount
a cramfs image generated on a ia64 machine, it gives me a kernel panic.

Here is a rough guide to get this kind of setup going.

1. Setup the ia32 as usual, but allow root write access to /export by
inserting "no_root_squash" as an option in /etc/exports.

2. create a "fake" ia64 frontend using one of the ia64 nodes, let it configure
eth0 by dhcp an let the ia32 frontend think it is a compute node.

3. on the fake frontend you turn off the nis daemons except ypbind.

4. edit /etc/auto.home to mount /home from the ia32 frontend and restart
autofs.

5. on the fake frontend you do a rocks-dist copycd to dump the ia64 dvd into
the /home/install.

6. Now you can do a rocks-dist dist on the ia64 box.

7. At last you need a symlink to make the ia32 frontend happy:
ln -s enterprise/2.1AW/en/os/ia64 rocks-dist/7.3/en/os/ia64

Now you can boot up your ia64 nodes from the ia32 frontend.
After you are confident that your ia64 nodes are installed correctly you can
reinstall the ia64 frontend as a regular compute node. Subsequent rocks-dist
dist can be run on any ia64 compute node as long as it has the
anaconda-runtime and rocks-dist rpms installed.

Hope this helps,

r.

--

The Computer Center, University of Troms?, N-9037 TROMS? Norway.
phone:+47 77 64 41 07, fax:+47 77 64 41 00
Roy Dragseth, High Performance Computing System Administrator
Direct call: +47 77 64 62 56. email: royd at cc.uit.no

From Roy.Dragseth at cc.uit.no Mon Dec 15 04:28:15 2003
Date: Mon, 15 Dec 2003 13:28:15 +0100
In-Reply-To: <Pine.GSO.4.58.0312121650350.4139@leibnitz.cs.uh.edu>
References: <Pine.GSO.4.33.0312120850030.4235-100000@mercury.chem.northwestern.edu>
<Pine.GSO.4.58.0312121650350.4139@leibnitz.cs.uh.edu>

I had similar problems on our HP rx2600 boxes and found a way to make the
kernel ignore the 100Mb/s NIC by adding this append line in elilo.conf:

append="reserve=0xd00,64"

See my post
https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2003-October/003483.html

for details on how to figure out this parameter.

Remark: this has to be modified both in the elilo.conf and elilo-ks.conf in
/boot/efi/efi/redhat/. The problem is that cluster-kickstart overwrites
these files at every reboot and the setup is hardcoded into the
cluster-kickstart executable so you need to figure out a way to work around
this. I grabbed cluster-kickstart.c from cvs, did the neccessary mods to it
and installed the new one on every compute node.

r.

--

phone:+47 77 64 41 07, fax:+47 77 64 41 00

Date: Mon, 15 Dec 2003 11:31:01 -0800
Message-ID: <37508BEC-2F35-11D8-804D-000393A4725A@sdsc.edu>

We did indeed move version scheming. We used to be "Redhat minus 5" so
a RH 7.3-based Rocks was called 2.3.x. This became mute when Redhat

quickly went from 8 to 9 to Enterprise 3. So we decided to be selfish
and move to 3.0.0 when we made a big internal change (Rolls and the end
of monolithic Rocks).

3.1.0 is an minor number revision, which corresponds to how much has
changed in the Rocks code, not the underlying Redhat system. A bugfix
release would be 3.1.1, etc...

We hope this versioning scheme will be more resilient to linux system
changes (which are out of our control), while keeping the focus on the
Rocks structure.

On Dec 13, 2003, at 8:31 AM, Tim Carlson wrote:

> Off thread, but it would seem to me that the numbering scheme for ROCKS
> got out of whack somewhere. Shouldn't 3.0.0 have been 2.3.3 and the new
> 3.1 been 3.0? The reasoning being that the current 3.0.0 is still RH
> 7.3
> based and the new 3.1 will be RH 3.0 based. Not that it matters. Just
> curious.
>
Federico


From jlkaiser at fnal.gov Mon Dec 15 11:43:43 2003
Date: Mon, 15 Dec 2003 13:43:43 -0600
Subject: [Rocks-Discuss]problem forcing a kernel

Hi,

I am trying to install this kernel:

kernel-smp-2.4.20-20.XFS1.3.1.i686.rpm and keep getting the following
whether I put it in the force directory of my distro or the regular RPMS
directory or contrib:

During package installation it gives me this:

/mnt/sysimage/var/tmpkernel-smp-2.4.20-20.9.XFS1.3.1.i686.rpm cannot be
opened. This is due to a missing file, a bad package, or bad media.
Press <return> to try again.

The file is there. The media is the network. I have installed the
package on other systems by hand. Any ideas?

Thanks,

Joe

From tmartin at physics.ucsd.edu Mon Dec 15 15:58:51 2003
Date: Mon, 15 Dec 2003 15:58:51 -0800
Subject: [Rocks-Discuss]removing a node from the cluster
Message-ID: <3FDE4ABB.6030302@physics.ucsd.edu>

How does one go about removing a node from the cluster? Is there a
straight forward way to do this?

Terrence

From ebpeele2 at pams.ncsu.edu Mon Dec 15 16:42:47 2003
From: ebpeele2 at pams.ncsu.edu (Elliot Peele)
Date: Mon, 15 Dec 2003 19:42:47 -0500
In-Reply-To: <3FDE4ABB.6030302@physics.ucsd.edu>
References: <3FDE4ABB.6030302@physics.ucsd.edu>
Message-ID: <1071535367.1871.1.camel@localhost.localdomain>

insert-ethers --replace hostname

Select compute from the menu then exit insert-ethers.

Elliot

On Mon, 2003-12-15 at 18:58, Terrence Martin wrote:
> How does one go about removing a node from the cluster? Is there a
> straight forward way to do this?
>
> Terrence
-------------- next part --------------
Name: not available
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20031215/
ebf9581b/attachment-0001.bin

From phil at sdsc.edu Mon Dec 15 16:44:29 2003
Date: Mon, 15 Dec 2003 16:44:29 -0800
In-Reply-To: <3FDE4ABB.6030302@physics.ucsd.edu>
Message-ID: <3FDE556D.4040100@sdsc.edu>

insert-ethers --replace "compute-0-0"
select "compute" from the menu
and then hit f1 to exit.

This will re-create all of the files that have host names and remove
the node (you are essentially replacing the node named "compute-0-0" with
the empty set).

PBS will likely be unhappy with this change -- If I remember correctly,
it has an

additional file that it creates when a node is added to the queuing
system -- when the
node doesn't appear in the host table, it gets cranky. You should look in
/opt/OpenPBS/server_priv/nodes to solve this problem -- suppose you want
to delete
compute-0-0.

# qmgr -c "delete node compute-0-0"
# insert-ethers --replace "compute-0-0"

-P

Terrence Martin wrote:

> How does one go about removing a node from the cluster? Is there a
> straight forward way to do this?
>
> Terrence

--
== Philip Papadopoulos, Ph.D.
== Program Director for San Diego Supercomputer Center
== Grid and Cluster Computing 9500 Gilman Drive
== Ph: (858) 822-3628 University of California, San Diego
== FAX: (858) 822-5407 La Jolla, CA 92093-0505

From gotero at linuxprophet.com Mon Dec 15 16:52:23 2003
From: gotero at linuxprophet.com (Glen Otero)
Date: Mon, 15 Dec 2003 16:52:23 -0800
In-Reply-To: <1071535367.1871.1.camel@localhost.localdomain>
<1071535367.1871.1.camel@localhost.localdomain>
Message-ID: <1C2131BE-2F62-11D8-9436-000A95CD8EC8@linuxprophet.com>

On Dec 15, 2003, at 4:42 PM, Elliot Peele wrote:

> insert-ethers --replace hostname
>
> Select compute from the menu then exit insert-ethers.

Then run:

# insert-ethers --update

to update the database

Check the database entries with:

# dbreport hosts

Glen

>
> Elliot
>
> On Mon, 2003-12-15 at 18:58, Terrence Martin wrote:
>> How does one go about removing a node from the cluster? Is there a
>> straight forward way to do this?
>>
>> Terrence
>>
Glen Otero, Ph.D.
Linux Prophet
619.917.1772

From landman at scalableinformatics.com Mon Dec 15 17:13:29 2003
Date: Mon, 15 Dec 2003 20:13:29 -0500
In-Reply-To: <1C2131BE-2F62-11D8-9436-000A95CD8EC8@linuxprophet.com>
<1C2131BE-2F62-11D8-9436-000A95CD8EC8@linuxprophet.com>
Message-ID: <3FDE5C39.1030503@scalableinformatics.com>

Harumph:

rmnode nasty_compute_node
insert-ethers --update

(rmnode at http://scalableinformatics.com/downloads/rmnode.gz).

I thought insert-ethers had a simple version of this. All rmnode is, is
a hacked version of one of the other rocks tools.

Joe

Glen Otero wrote:

>
> On Dec 15, 2003, at 4:42 PM, Elliot Peele wrote:
>
>> insert-ethers --replace hostname
>>
>> Select compute from the menu then exit insert-ethers.
>
>
> Then run:
>
> # insert-ethers --update
>
> to update the database
>

> Check the database entries with:
>
> # dbreport hosts
>
> Glen
>
>>
>> Elliot
>>
>> On Mon, 2003-12-15 at 18:58, Terrence Martin wrote:
>>
>>> How does one go about removing a node from the cluster? Is there a
>>> straight forward way to do this?
>>>
>>> Terrence
>>>
> Glen Otero, Ph.D.
> Linux Prophet
> 619.917.1772

--
phone: +1 734 612 4615

From csamuel at vpac.org Mon Dec 15 18:06:47 2003
Date: Tue, 16 Dec 2003 13:06:47 +1100
In-Reply-To: <B83C8894-2CD7-11D8-A2DC-000A95DA5638@sdsc.edu>
References: <200311212352.27000.Roy.Dragseth@cc.uit.no> <B83C8894-2CD7-11D8-

Hash: SHA1

On Sat, 13 Dec 2003 06:16 am, Mason J. Katz wrote:

> This should become the basis of the PBS roll (currently openpbs). We
> are seeking developers who would like to help write and maintain this
> -- I'm not singling you out Roy, although you would be more than
> welcome, rather I'm taking advantage of your message to solicit other
> volunteers. Anyone?

I think we might be interested in getting involved with this, we migrated from
OpenPBS to ScalablePBS some time ago and spent quite a bit of time tracking
down memory leaks and the like with DJ and friends at SuperCluster.

We've also started using Rocks on a cluster that we manage for one of our
member institutions and a colleague of mine is having fun trying to get it to
go onto an Itanium cluster at the moment plus we should have some Opteron
boxes arriving in a month or so for a mini-cluster which we'd like to run

Rocks on.

Currently we install Rocks on the cluster and then remove PBS and MAUI RPM's
and install SPBS and the 3.2.6 version of MAUI we have access to, so a
version that came with SPBS ready to go would make life a lot simpler for us.
:-)

cheers!
Chris
- --


iD8DBQE/3mi3O2KABBYQAh8RAuSLAJ9Bx/5aCF8kRjHFapUpiASQUJeCTwCcD9y7
Y/ZM38t0J8r5dAYj1MdiUWA=
=bCIS

Date: Mon, 15 Dec 2003 18:30:03 -0800
In-Reply-To: <3FDE5C39.1030503@scalableinformatics.com>
<1C2131BE-2F62-11D8-9436-000A95CD8EC8@linuxprophet.com>
<3FDE5C39.1030503@scalableinformatics.com>
Message-ID: <C13C5DE4-2F6F-11D8-B821-000A95C4E3B4@rocksclusters.org>

> Harumph:
>
> rmnode nasty_compute_node
> insert-ethers --update
>
> (rmnode at http://scalableinformatics.com/downloads/rmnode.gz).
>
> I thought insert-ethers had a simple version of this. All rmnode is,
> is a hacked version of one of the other rocks tools.

actually, since v3.0.0, i think it does:

http://www.rocksclusters.org/rocks-documentation/3.0.0/faq-
configuration.html#REMOVE-NODE

- gb

Date: Mon, 15 Dec 2003 19:40:49 -0800
In-Reply-To: <1071517423.3719.4.camel@ajax.kaisergroup.net>

Message-ID: <A3F73894-2F79-11D8-B821-000A95C4E3B4@rocksclusters.org>

> I am trying to install this kernel:
>
> kernel-smp-2.4.20-20.XFS1.3.1.i686.rpm and keep getting the following
> whether I put it in the force directory of my distro or the regular
> RPMS
> directory or contrib:
>
> During package installation it gives me this:
>
>
> /mnt/sysimage/var/tmpkernel-smp-2.4.20-20.9.XFS1.3.1.i686.rpm cannot be
> opened. This is due to a missing file, a bad package, or bad media.
> Press <return> to try again.
>
>
> The file is there. The media is the network. I have installed the
> package on other systems by hand. Any ideas?

just to be sure, do you run the following after you copy the RPM into
the force directory:

# cd /home/install
# rocks-dist dist

- gb

Date: Mon, 15 Dec 2003 19:56:51 -0800
Subject: [Rocks-Discuss]Adding partitions that are not reformatted under hard boots
or shoot-node
In-Reply-To: <3FD68B06.9010709@phys.ufl.edu>
References: <3FD68B06.9010709@phys.ufl.edu>
Message-ID: <E12881B4-2F7B-11D8-B821-000A95C4E3B4@rocksclusters.org>

sorry for the late response.

i recently tested the manual partitioning procedure on our upcoming
release and there was a bug. a fix has been committed for the next
release -- so manual partitioning will work on 3.1.0 as explained in
the 3.0.0 documentation.

- gb

On Dec 9, 2003, at 6:55 PM, Jorge L. Rodriguez wrote:

> Hi,
>
> How do I add an extra partition to my compute nodes and retain the
> data on all non / partitions when system hard boots or is shot?
> I tried the suggestion in the documentation under "Customizing your
> ROCKS Installation" where you replace the auto-partition.xml but hard
> boots or shoot-nodes on these reformat all partitions instead of just

> the /. I have also tried to modify the installclass.xml so that an
> extra partition is added into the python code see below. This does
> mostly what I want but now I can't shoot-node even though a hard boot
> reinstalls without reformatting all but /. Is this the right approach?
> I'd rather avoid having to replace installclass since I don't really
> want to partition all nodes this way but if I must I will.
>
> Jorge
>
> #
> # set up the root partition
> #
> args = [ "/" , "--size" , "4096",
> "--fstype", "&fstype;",
> "--ondisk", devnames[0] ]
> KickstartBase.definePartition(self, id, args)
>
> # ---- Jorge, I added this args
> args = [ "/state/partition1" , "--size" ,
> "55000",
> # -----
> args = [ "swap" , "--size" , "1000",
>
> #
> # greedy partitioning
> #
> # ----- Jorge, I change this from i = 1
> i = 2
> # -----
> for devname in devnames:
> partname = "/state/partition%d" % (i)
> args = [ partname, "--size", "1",
> "--grow", "--ondisk", devname ]
> KickstartBase.definePartition(self, id,
> args)
>
> i = i + 1
>
>
>

From jlkaiser at fnal.gov Mon Dec 15 20:17:52 2003
Date: Mon, 15 Dec 2003 22:17:52 -0600
In-Reply-To: <A3F73894-2F79-11D8-B821-000A95C4E3B4@rocksclusters.org>
<A3F73894-2F79-11D8-B821-000A95C4E3B4@rocksclusters.org>

yup

On Mon, 2003-12-15 at 21:40, Greg Bruno wrote:
> > I am trying to install this kernel:
> >
> > kernel-smp-2.4.20-20.XFS1.3.1.i686.rpm and keep getting the following
> > whether I put it in the force directory of my distro or the regular
> > RPMS
> > directory or contrib:
> >
> > During package installation it gives me this:
> >
> >
> > /mnt/sysimage/var/tmpkernel-smp-2.4.20-20.9.XFS1.3.1.i686.rpm cannot be
> > opened. This is due to a missing file, a bad package, or bad media.
> > Press <return> to try again.
> >
> >
> > The file is there. The media is the network. I have installed the
> > package on other systems by hand. Any ideas?
>
> just to be sure, do you run the following after you copy the RPM into
> the force directory:
>
> # rocks-dist dist
>
> - gb
>

From Roy.Dragseth at cc.uit.no Tue Dec 16 02:13:50 2003
Date: Tue, 16 Dec 2003 11:13:50 +0100
In-Reply-To: <B83C8894-2CD7-11D8-A2DC-000A95DA5638@sdsc.edu>

On Friday 12 December 2003 20:16, Mason J. Katz wrote:
> This should become the basis of the PBS roll (currently openpbs). We
> are seeking developers who would like to help write and maintain this
> -- I'm not singling you out Roy, although you would be more than
> welcome, rather I'm taking advantage of your message to solicit other
> volunteers. Anyone?
>

I talked to my boss and he gave me thumbs up, so I'll be glad to take care of
the Maui/PBS roll of rocks.

I'd love to see some more hands in the air as maintainers/testers...

r.

--


phone:+47 77 64 41 07, fax:+47 77 64 41 00

From daniel.kidger at quadrics.com Tue Dec 16 07:08:44 2003
From: daniel.kidger at quadrics.com (Dan Kidger)
Date: Tue, 16 Dec 2003 15:08:44 +0000
In-Reply-To:
References:
Message-ID: <3FDF1FFC.60501@quadrics.com>

Glen et al.

>I recently had the same problem when building a quadrics cluster on Rocks 2.3.2
>with the qsnet-RedHat-kernel-2.4.18-27.3.4qsnet.i686.rpms. The problem is
>definitely in the naming of the rpms, in that anaconda running on the compute
>nodes is not going to recognize kernel rpms that begin with 'qsnet' as potential
>boot options. Unfortunately, being under a severe time contraint, I resorted to
>manually installing the qsnet kernel on all nodes of the cluster, which isn't
>the Rocks way. The long term solution is to mangle the kernel makefiles so that
>the qsnet kernel rpms have conventional kernel rpm names, which is what Greg's
>post referred to.

I have been thinking about this.

I reckon that the long term solution is *not* to rename the kernel that
we use. (nor indeed to change the naming convention of any other kernels
that people want to work on). As well as the triplet version numbering
and the architecture, the kernel naming that we use includes the kernel
source tree (Redhat, Suse, LSY, Vanilia, ..) and our partch level
version numering triplet.
Quadrics cannot be the only people who need freedom to include extra
information in our naming convention for kernels.
The solution must lie in either annaconda itself or more likely a
cleaner way to include extra kernel(s) as well as the stock one in the
compute node install process.
Using extend-nodes.xml this works apart from niggles about the
/boot/grub/menu.lst that our kernel post-instal;l configures getting
clobbered by Rocks.

Yours,
Daniel.

gotero at linuxprophet.com wrote:

>Daniel-
>
>
>

--
Yours,
Daniel.

--------------------------------------------------------------
----------------------- www.quadrics.com --------------------

Date: Tue, 16 Dec 2003 07:09:56 -0800
In-Reply-To: <200312161113.50076.Roy.Dragseth@cc.uit.no>
A2DC-000A95DA5638@sdsc.edu> <200312161113.50076.Roy.Dragseth@cc.uit.no>
Message-ID: <E89F1F82-2FD9-11D8-A2DC-000A95DA5638@sdsc.edu>

Fanstastic! I think this puts us at three people that have volunteered
to help out on this. I will followup on this and help organize,
support, and do some of the development also. But I'm going to push
this back until after we get 3.1 out which looks like monday.

-mjk

On Dec 16, 2003, at 2:13 AM, Roy Dragseth wrote:

> On Friday 12 December 2003 20:16, Mason J. Katz wrote:
>> This should become the basis of the PBS roll (currently openpbs). We
>> are seeking developers who would like to help write and maintain this
>> -- I'm not singling you out Roy, although you would be more than
>> welcome, rather I'm taking advantage of your message to solicit other
>> volunteers. Anyone?
>>
>
> I talked to my boss and he gave me thumbs up, so I'll be glad to take
> care of
> the Maui/PBS roll of rocks.
>
> I'd love to see some more hands in the air as maintainers/testers...
>
> r.
>
>
> --
>
> The Computer Center, University of Troms?, N-9037 TROMS? Norway.
> phone:+47 77 64 41 07, fax:+47 77 64 41 00
> Roy Dragseth, High Performance Computing System Administrator
> Direct call: +47 77 64 62 56. email: royd at cc.uit.no

Date: Tue, 16 Dec 2003 07:37:04 -0800
In-Reply-To: <3FDF1FFC.60501@quadrics.com>
References:

<3FDF1FFC.60501@quadrics.com>
Message-ID: <B3192AFA-2FDD-11D8-A2DC-000A95DA5638@sdsc.edu>

If you rename the linux kernel to include other arbitrary strings the
RedHat Kickstart installer will not recognize it as a kernel. This
means you loose probing for the correct x86 cpu (386/486/585/686) and
probing for SMP vs. uni. This implies you would need to re-write the
anaconda code to do this for arbitrarily named packages, if you could
convince RedHat to do this great, but it's not worth our development
time to do this ourselves when properly named kernel packages work
wonderfully. The unfortunate reality is the kernel RPM is not just
another package -- it has some special installation logic to optimize
for you hardware. Sure they could have done this better, but they do a
darn good job as is.

This is not a Rocks issue, it means you have created a package that
does not work with RedHat. I understand why you need to include extra
strings in the kernel name, but suggest that there are several
alternatives to this that don't break RedHat kickstart. For example,
you could:

- Write a kernel version module to report on /proc/qsnet_kernel the
same information.

- Have the kernel RPM install a /usr/doc/qsnet/VERSION file

- Have a subpackage of the kernel rpm that include the extra strings
(and extra docs).

- Stop patching the kernel and only use a module. True some things
require kernel patches, but almost all driver changes can go into
modules only. This was not always true a few years ago, the module
system has improved a lot.

We've faced numerous issues like this with RedHat in creating Rocks,
and for every issue we have found a work around that keeps us w/in the
RedHat way of doing things. This is not always optimal for development
but always yields a simpler, and more supportable, system.

-mjk

On Dec 16, 2003, at 7:08 AM, Dan Kidger wrote:

> Glen et al.
>
>> I recently had the same problem when building a quadrics cluster on
>> Rocks 2.3.2
>> with the qsnet-RedHat-kernel-2.4.18-27.3.4qsnet.i686.rpms. The
>> problem is
>> definitely in the naming of the rpms, in that anaconda running on the
>> compute
>> nodes is not going to recognize kernel rpms that begin with 'qsnet'
>> as potential
>> boot options. Unfortunately, being under a severe time contraint, I
>> resorted to
>> manually installing the qsnet kernel on all nodes of the cluster,
>> which isn't

>> the Rocks way. The long term solution is to mangle the kernel
>> makefiles so that
>> the qsnet kernel rpms have conventional kernel rpm names, which is
>> what Greg's
>> post referred to.
>
> I have been thinking about this.
>
> I reckon that the long term solution is *not* to rename the kernel
> that we use. (nor indeed to change the naming convention of any other
> kernels that people want to work on). As well as the triplet version
> numbering and the architecture, the kernel naming that we use includes
> the kernel source tree (Redhat, Suse, LSY, Vanilia, ..) and our partch
> level version numering triplet.
> Quadrics cannot be the only people who need freedom to include extra
> information in our naming convention for kernels.
> The solution must lie in either annaconda itself or more likely a
> cleaner way to include extra kernel(s) as well as the stock one in the
> compute node install process. Using extend-nodes.xml this works apart
> from niggles about the /boot/grub/menu.lst that our kernel
> post-instal;l configures getting clobbered by Rocks.
>
> Yours,
> Daniel.
>
>
> gotero at linuxprophet.com wrote:
>
>> Daniel-
>>
>>
>
> --
> Yours,
> Daniel.
>
> --------------------------------------------------------------
> ----------------------- www.quadrics.com --------------------
>

From dtwright at uiuc.edu Tue Dec 16 11:45:55 2003
Date: Tue, 16 Dec 2003 13:45:55 -0600
Subject: [Rocks-Discuss]a minor ganglia question
Message-ID: <20031216194554.GH26246@uiuc.edu>

Hello all,

I'm in the process of setting up a 3.0.0 cluster and have a question about the
"Physical view" in ganglia. In this view (which is quite cool BTW :) is shows
higher-numbered nodes on top and lower-numbered nodes on bottom:

compute-0-12
...
compute-0-2

compute-0-1
compute-0-0

and my cluster is physically reversed from that:

compute-0-0
compute-0-1
compute-0-2
...
compute-0-12

Is there an easy way to switch this display around so it matches the real
physical layout? I poked around and ganglia for a few minutes and didn't see
anything obvious, so I thought I'd ask before I actually start wasting time on
this :)

Thanks,

- Dan Wright
(http://www.scs.uiuc.edu/)
(UNIX Systems Administrator, School of Chemical Sciences, UIUC)
(333-1728)
-------------- next part --------------
Name: not available
Size: 189 bytes
Desc: not available
discussion/attachments/20031216/28f3eb5a/attachment-0001.bin

From purikk at hotmail.com Tue Dec 16 12:34:51 2003
Date: Tue, 16 Dec 2003 15:34:51 -0500
Subject: [Rocks-Discuss]hardware-setup for the Rocks cluster
References: <200312162016.hBGKGuJ05160@postal.sdsc.edu>
Message-ID: <BAY1-DAV575EPSM0omP0000cb94@hotmail.com>

Hi All,
We are trying to setup rocks cluster with 1 front and 20 computing
nodes.
Frontend:
1) Dual Pentium Xeon 2.4 GHz PC 533 and 512lk L2 Cache
2) Dual port Gigabit Ethernet
3) 1 GB DDR RAM
4) 3* 200 GB EIDE ULTRA ATA 100

Compute nodes:
1) Pentium Xeon 2.4 GHz PC 533 and 512k L2 Cache
2) Dual port Gigabit Ethernet
3) 1 GB DDR RAM
4) 41 GB UDMA EIDE
1 HP Procurve 24 port switch

Does the setup look ok?

Does Rocks support the following features
Remote power monitoring for individual nodes

*Temperature monitoring of individual processors

*Power sequencing on startup to prevent possible power spiking

*Remote power-down and reset of system and nodes

*Serial access to nodes

*Disk cloning

*Plug-In Extensible Architecture

*Image Manager

and also

How should be the disk setup, does all the disks need to be attached to
frontend and compute nodes have small 3 or 4 GB disks?

Can someone point me to a clustering software which supports all above
features if Rocks does'nt support them.

thanks a lot

Regards,

Puru

Date: Tue, 16 Dec 2003 15:39:19 -0500
Subject: [Rocks-Discuss]Java Rocks cluster
Message-ID: <BAY1-DAV62R0rmTIVvL0000cc3a@hotmail.com>

I am a newbie to ROCKS
I have a question about running Java on a Rockster.
Is it possible that I can start only one JVM on one machine and the
task be run distributed on the cluster? It is a multi-threaded application.
Like say, I have an application with 100 threads. Can I have 50 threads run on one
machine and 50 on another by launching the application(jvm) on one machine?(similar
to SUN Firebird) I dont want to use MPI or any special code.
Thanks
Sincerely
Puru
-------------- next part --------------
discussion/attachments/20031216/ee12ac80/attachment-0001.html


Date: Tue, 16 Dec 2003 13:20:24 -0800
In-Reply-To: <BAY1-DAV62R0rmTIVvL0000cc3a@hotmail.com>
References: <BAY1-DAV62R0rmTIVvL0000cc3a@hotmail.com>
Message-ID: <A9849F18-300D-11D8-A2DC-000A95DA5638@sdsc.edu>

There are a few research projects that do map java threads onto cluster
compute nodes processes. At the IEEE Cluster '03 conference a couple
weeks ago in Hong Kong there were a few interesting Java talks on this
subject. You can see the schedule at the following link and do some
google research for more info. I think the papers will be online
soon...

http://www.csis.hku.hk/cluster2003/advance-program.html

Rocks 3.1 will include a Java Roll, but this is nothing more than Sun's
Java sdk/rte and doesn't do any cluster magic for you.

-mjk

On Dec 16, 2003, at 12:39 PM, Purushotham Komaravolu wrote:

> I am a newbie to ROCKS
> I have a question about running Java on a?Rockster.
> ?Is it possible that I can start only one JVM on one machine and the
> ?task?be run distributed on the cluster? It is a multi-threaded
> application.?
> Like say, I have an application with 100 threads.?Can I have 50
> threads run on one machine and 50 on another by?launching the
> application(jvm) on one machine?(similar to SUN Firebird)?I dont want
> to use MPI or any?special code.
> Thanks
> Sincerely
> Puru

From phil at sdsc.edu Tue Dec 16 13:38:48 2003
Date: Tue, 16 Dec 2003 13:38:48 -0800
In-Reply-To: <BAY1-DAV575EPSM0omP0000cb94@hotmail.com>
References: <200312162016.hBGKGuJ05160@postal.sdsc.edu> <BAY1-
DAV575EPSM0omP0000cb94@hotmail.com>
Message-ID: <3FDF7B68.3030302@sdsc.edu>

Purushotham Komaravolu wrote:

>Hi All,
> We are trying to setup rocks cluster with 1 front and 20 computing
>nodes.
>Frontend:
> 1) Dual Pentium Xeon 2.4 GHz PC 533 and 512lk L2 Cache
> 2) Dual port Gigabit Ethernet
> 3) 1 GB DDR RAM
> 4) 3* 200 GB EIDE ULTRA ATA 100

>
>Compute nodes:
> 1) Pentium Xeon 2.4 GHz PC 533 and 512k L2 Cache
> 3) 1 GB DDR RAM
> 4) 41 GB UDMA EIDE
>1 HP Procurve 24 port switch
>
>
>Does the setup look ok?
>
Setup looks fine.

>
>
>Does Rocks support the following features
>Remote power monitoring for individual nodes
>
>*Temperature monitoring of individual processors
>
Not directly -- there isn't a completely general solution to this --
though lmsensors
is good for non-server boards. However, nothing prevents you from
adding the
proper software. It's fairly easy to add metrics to ganglia if you have
the baseline drivers
for your particular temp monitoring sw.

>
>*Power sequencing on startup to prevent possible power spiking
>
>*Remote power-down and reset of system and nodes
>
>*Serial access to nodes
>
All of these generally require another network (serial, lights-out
management, etc).
We don't assume any of these extra networks exist. Again, layering that
functionality
a top of rocks is very very straightforward. See the FAQ for how to add
packages to nodes.

>
>*Disk cloning
>
No. Emphatically No. Disk cloning is not anywhere in the rocks vocabulary.
We have distributions (Redhat + Rocks + Cluster tools + your own
software) and
a way to generate a kickstart file in a programatic way. Disk cloning
assumes homogeneity
of hardware (we don't), requires a custom after market installer to fix
up a node after
an image is put on a node (we use Redhat as the installer), requires a
completely different
image for every different functional type of node (frontend, compute,
nfs, web, pvfs, etc).

>
>*Plug-In Extensible Architecture

>
Uh. Yeah. That's the whole point. Again see the FAQ of how you add
packages.
Rolls is an additional extension mechanism that allows you to add
larger chunks of functionality
at Cluster build time. We extend base rocks with Grid Software,
Schedulers, Java, and
community-specific software stacks. You should wait (about 5 days) for
the final
release of 3.1.0 to see how rolls works.

>
>*Image Manager
>
Definitely No. There are no images in Rocks. We have distributions and
appliance types.
A graph description of appliances is melded with distributions to define
a complete
node. Shared configuration is truly shared. None of that happens with
images -- the base
software and the configuration are locked together.

>
>and also
>
>How should be the disk setup, does all the disks need to be attached to
>frontend and compute nodes have small 3 or 4 GB disks?
>
Nodes must be disk full. Type and size (8GB is probably minimal given
the size of Linux these
days). You can put as many disks as you want on your frontend and have
it double as
an NFS server for your cluster (default). You can build other NFS
servers easily (and manage
them as easily as you do a compute node).

>
>Can someone point me to a clustering software which supports all above
>features if Rocks does'nt support them.
>
Sorry. Doesn't exist. Pick the things that you can live without today
(but would
want to add tomorrow).

-P

>
>thanks a lot
>
>Regards,
>
>Puru
>
>
>
>
>
>

--
== Philip Papadopoulos, Ph.D.
== Program Director for San Diego Supercomputer Center
== Grid and Cluster Computing 9500 Gilman Drive
== Ph: (858) 822-3628 University of California, San Diego
== FAX: (858) 822-5407 La Jolla, CA 92093-0505

Date: Tue, 16 Dec 2003 13:38:59 -0800
In-Reply-To: <BAY1-DAV575EPSM0omP0000cb94@hotmail.com>
References: <200312162016.hBGKGuJ05160@postal.sdsc.edu> <BAY1-
DAV575EPSM0omP0000cb94@hotmail.com>
Message-ID: <421F6254-3010-11D8-A2DC-000A95DA5638@sdsc.edu>

On Dec 16, 2003, at 12:34 PM, Purushotham Komaravolu wrote:

> Hi All,
> We are trying to setup rocks cluster with 1 front and 20
> computing
> nodes.
> Frontend:
> 1) Dual Pentium Xeon 2.4 GHz PC 533 and 512lk L2 Cache
> 3) 1 GB DDR RAM
> 4) 3* 200 GB EIDE ULTRA ATA 100
>
> Compute nodes:
> 1) Pentium Xeon 2.4 GHz PC 533 and 512k L2 Cache
> 3) 1 GB DDR RAM
> 4) 41 GB UDMA EIDE
> 1 HP Procurve 24 port switch
>
>
> Does the setup look ok?

Sounds good, if you have device driver issues just wait until next week
when 3.1 comes out, this will have a new kernel and more supported
hardware.

> Does Rocks support the following features
> Remote power monitoring for individual nodes

Ethernet addressable power strips can be used for this.

> *Temperature monitoring of individual processors

No, although a ganglia module can be created to do this. The problem
is there isn't a common standard out there for *all* hardware right
now.

> *Power sequencing on startup to prevent possible power spiking

Ethernet addressable power strips can be used for this.

> *Remote power-down and reset of system and nodes

Yes (using sw). For hw control you would need a remote management
board in every node, or ethernet addressable power stips.

> *Serial access to nodes

No, Rocks using ssh and ethernet for this. But you can add your own
serial port concentrator if you need.

> *Disk cloning

Nope, this doesn't scale in both system and people time. Rocks uses
RedHat's Kickstart to build the disk image on each node in a cluster
programmatically. This is extremely fast -- in fact a 128 node cluster
can be built from scratch (including hardware integration) in under 2
hours, and the entire cluster can be reinstalled in around 12 minutes.
We did this as a demonstration of Rock's scalability at SC'03 (we even
have a movie of it).

> *Plug-In Extensible Architecture

Yes. You can add to the cluster database and extend our utilities.
Everything is open.

> *Image Manager

Rocks does not do system imaging. We have a utility called rocks-dist
that builds distributions for you. This combined with the XML profile
graph gives you what you want here.

> How should be the disk setup, does all the disks need to be attached to
> frontend and compute nodes have small 3 or 4 GB disks?

Buy the smallest modern HD you can for the compute node (4 GB is fine).
By default the frontend serves user directories over NFS so you
should have more storage on the frontend node.

-mjk

Date: Tue, 16 Dec 2003 16:43:51 -0500
In-Reply-To: <BAY1-DAV62R0rmTIVvL0000cc3a@hotmail.com>
References: <BAY1-DAV62R0rmTIVvL0000cc3a@hotmail.com>

Hi Puru:

Java threads are shared memory objects at this moment. You would need
to look at thread-migration schemes to layer atop the process, and a
distributed shared memory model to handle memory issues. I don't think
Java natively supports this, so you will likely have to appeal to some

other method.

Moreover, shared memory across slower cluster network fabrics is
painful at best. If you are going to work on a single system image
machine with shared memory, you want the fastest/best fabric you can
get.

If it is easier to re-architect your code to be independent worker
processes, you could write it using JVMs and simple sockets or similar.
If it is threaded, you may have problems parallelizing it on a cluster.

Joe

On Tue, 2003-12-16 at 15:39, Purushotham Komaravolu wrote:
> I am a newbie to ROCKS
> I have a question about running Java on a Rockster.
> Is it possible that I can start only one JVM on one machine and the
> task be run distributed on the cluster? It is a multi-threaded
> application.
> Like say, I have an application with 100 threads. Can I have 50
> threads run on one machine and 50 on another by launching the
> application(jvm) on one machine?(similar to SUN Firebird) I dont want
> to use MPI or any special code.
> Thanks
> Sincerely
> Puru
--
phone: +1 734 612 4615

From rscarce at caci.com Tue Dec 16 10:56:18 2003
From: rscarce at caci.com (Reed Scarce)
Date: Tue, 16 Dec 2003 13:56:18 -0500
Subject: [Rocks-Discuss]grub / boot / fdisk problem
Message-ID: <OF2C6AD168.EB3D778E-ON85256DFE.0067CF1C-85256DFE.006812B4@caci.com>

I installed Rocks on a primary master hard drive.
It became necessary to re-install I took an
identical hd and made it primary master. The first drive, which boots
fine, was left off the system to act as an archive, to mount after the
new system was up and running.
The new system was installed and works great, now to correctly install
the old drive as primary slave, reboot, mount and copy the scripts and
configs to the new system!
There the problem began.
When I boot either drive as primary master and only primary drive,
they boot fine.
When I connect either drive correctly configured and recognized by the
BIOS, as primary or secondary slave - grub gives a GRUB prompt and
won't boot.
Something interesting, when booted from a floppy (mkbootdisk)from the
new disk, in /var/log/dmesg both drives are visible but fdisk reports
the partition table is empty - so I can't mount the drive from a
floppy boot.

dmesg is like this: (my comments)
hda: ST34321A, ... (pri master)
hdb: ST34321A, ... (pri slave)
hdc: FX4010M, ATAPI CD/DVD-ROM drive (secnd master)
hdd: ST320420A, ... (secnd slave)
ide0 at ... (ide pri chain)
ide1 at ... (ide secnd chain)
hda: 8404830 sectors ... (good)
hdb: 8404830 sectors ... (good)
hdd: 39851760 sectors ... (good)
ide-floppy driver ... (ok)
Partition check: (<---<<<this is where it gets interesting)
hda:
hdb:
hdd: hdd1 hdd2 hdd3 (<---<<<that's right, hdd is now the boot drive.
Even if I boot without the floppy, hdd is the boot drive.)

Any suggestons?

Reed Scarce
Systems Engineer
CACI, Inc.
1100 N. Glebe Rd
Arlington, VA 22201
(703) 841-3045
-------------- next part --------------
discussion/attachments/20031216/498124c7/attachment-0001.html

From ShiYi.Yue at astrazeneca.com Tue Dec 16 14:05:46 2003
From: ShiYi.Yue at astrazeneca.com (ShiYi.Yue at astrazeneca.com)
Date: Tue, 16 Dec 2003 23:05:46 +0100
Subject: [Rocks-Discuss]hardware compability check wirh Rocks 3.00
Message-ID: <D2A2B86E8730D711B8560008028AC980257A2E@camrd9.camrd.astrazeneca.net>

hi,

I was wondering if there is a way to set a hardware compability check in the
kickstart of Rocks, and give us an oppotunity to add the drvers once the
uncompatible hardware was detected.

I have some PCs with Broadcom Gbit 10/100/1000 network cards, It looks Rocks
3.0 was not happy with these network cards. The only way I can do now
(without rebuild the distribution) is to replace these cards. I am afraid
this type of situation will happen again and again since RH7.3 is getting
older and older.
I hope I were wrong and someone can point me a solution.
Shi-Yi
shiyi.yue at astrazeneca.com


Date: Tue, 16 Dec 2003 14:55:38 -0800
Subject: [Rocks-Discuss]hardware compability check wirh Rocks 3.00
In-Reply-To: <D2A2B86E8730D711B8560008028AC980257A2E@camrd9.camrd.astrazeneca.net>
References: <D2A2B86E8730D711B8560008028AC980257A2E@camrd9.camrd.astrazeneca.net>
Message-ID: <F7910D2D-301A-11D8-A2DC-000A95DA5638@sdsc.edu>

We've been thinking about this off and on for over a year -- it's a
pretty hard problem. The real trick to supporting all hardware is
keeping the boot kernel current. We've let our releases get old and
more and more people are seeing hardware support issues.

Rocks 3.1 (out next week) will include the latest RedHat kernel from
RHEL 3.0. This will fix most of the hardware support issues out there.
When we release please download 3.1 and try it with you hardware, if
this still fails please let us know. Thanks.

-mjk

On Dec 16, 2003, at 2:05 PM, ShiYi.Yue at astrazeneca.com wrote:

> hi,
>
> I was wondering if there is a way to set a hardware compability check
> in the
> kickstart of Rocks, and give us an oppotunity to add the drvers once
> the
> uncompatible hardware was detected.
>
> I have some PCs with Broadcom Gbit 10/100/1000 network cards, It looks
> Rocks
> 3.0 was not happy with these network cards. The only way I can do now
> (without rebuild the distribution) is to replace these cards. I am
> afraid
> this type of situation will happen again and again since RH7.3 is
> getting
> older and older.
> I hope I were wrong and someone can point me a solution.
> Shi-Yi
> shiyi.yue at astrazeneca.com

From msherman at informaticscenter.info Tue Dec 16 16:25:45 2003
From: msherman at informaticscenter.info (Mark Sherman)
Date: Tue, 16 Dec 2003 17:25:45 -0700
Subject: [Rocks-Discuss]RE: Rocks-Discuss] AMD Opteron - Contact Appro
Message-ID: <20031217002545.17912.qmail@webmail-2-2.mesa1.secureserver.net>

Hello,
I'm an administrator on a pure i386 cluster under Rocks 3.0.0, and our clients are
pushing us to include some Opteron nodes. I'm trying to find out the feasibility of
such an addition. I know there's been a lot of talk about Opterons on the rocks
list, so I'm wondering if someone can give a boiled-down can-do can't-do maybe-but-
we-haven't-tested-it-yet kind of status.
With that, I'd say I'm probaly willing to be a pseudo-beta site and give feedback
on how the system works.
Thank you very much, and keep up the good work. I love the Rocks system.
~M

______________________________________________
Mark Sherman
Computing Systems Administrator
Informatics Center
Massachusetts Biomedical Initiatives
Worcester MA 01605
508-797-4200
msherman at informaticscenter.info
----------------------~-----------------------

> -------- Original Message --------
> Subject: [Rocks-Discuss]RE: Rocks-Discuss] AMD Opteron - Contact Appro
> From: "Jian Chang" <jian at appro.com>
> Date: Fri, December 12, 2003 6:27 pm
> To: "Bryan Littlefield" <bryan at UCLAlumni.net>,
> npaci-rocks-discussion at sdsc.edu, mjk at sdsc.edu
>
> Hello Mason / Puru,
>
> I got your contact information from Bryan Littlefield.
> I would like to discuss with you regarding benchmark test systems you
> might need down the road.
> We can also share with you our findings as to what is compatible in the
> Opteron systems.
> Please reply with your phone number where I can reach you, and I will
> call promptly.
>
> Bryan,
>
> Thank you for the referral.
>
> Best regards,
>
> Jian Chang
> Regional Sales Manager
> (408) 941-8100 x 202
> (800) 927-5464 x 202
> (408) 941-8111 Fax
> jian at appro.com
> www.appro.com
>
> From: Bryan Littlefield [mailto:bryan at UCLAlumni.net]
> Sent: Tuesday, December 09, 2003 12:14 PM
> To: npaci-rocks-discussion at sdsc.edu; mjk at sdsc.edu
> Cc: Jian Chang
> Subject: Rocks-Discuss] AMD Opteron - Contact Appro
>
> Hi Mason,
>
> I suggest contacting Appro. We are using Rocks on our Opteron cluster
> and Appro would likely love to help. I will contact them as well to see
> if they could help getting a opteron machine for testing. Contact info
> below:
>
> Thanks --Bryan
>
> Jian Chang - Regional Sales Manager

> (408) 941-8100 x 202
> (800) 927-5464 x 202
> (408) 941-8111 Fax
> jian at appro.com
> http://www.appro.com
>
> npaci-rocks-discussion-request at sdsc.edu wrote:
>
>
> From: "Mason J. Katz" <mailto:mjk at sdsc.edu> <mjk at sdsc.edu>
> Subject: Re: [Rocks-Discuss]AMD Opteron
> Date: Tue, 9 Dec 2003 07:28:51 -0800
> To: "purushotham komaravolu" <mailto:purikk at hotmail.com>
> <purikk at hotmail.com>
>
> We have a beta right now that we have sent to a few people. We plan on
>
> a release this month, and AMD_64 will be part of this release along
> with the usual x86, IA64 support.
>
> If you want to help accelerate this process please talk to your vendor
>
> about loaning/giving us some hardware for testing. Having access to a
>
> variety of Opteron hardware (we own two boxes) is the only way we can
> have good support for this chip.
>
> -mjk
>
>
> On Dec 8, 2003, at 8:23 PM, purushotham komaravolu wrote:
>
>
> Cc: <mailto:npaci-rocks-discussion at sdsc.edu>
> <npaci-rocks-discussion at sdsc.edu>
>
>
> Hello,
> I am a newbie to ROCKS cluster. I wanted to setup clusters
>
> on
> 32-bit Architectures( Intel and AMD) and 64-bit Architecture( Intel
> and
> AMD).
> I found the 64-bit download for Intel on the website but not for AMD.
> Does
> it work for AMD opteron? if not what is the ETA for AMD-64.
> We are planning to but AMD-64 bit machines shortly, and I would like
> to
> volunteer for the beta testing if needed.
> Thanks
> Regards,
> Puru
>
>
> _______________________________________________

>
>

From fds at sdsc.edu Tue Dec 16 18:04:47 2003
Date: Tue, 16 Dec 2003 18:04:47 -0800
In-Reply-To: <20031216194554.GH26246@uiuc.edu>
References: <20031216194554.GH26246@uiuc.edu>
Message-ID: <63C818CD-3035-11D8-8652-000393A4725A@sdsc.edu>

Dan,

Good question. Unfortunately this behavior is hardwired into stock
Ganglia, not the Rocks-specific pages that we have more control over.

The good news is that I wrote the code for this page :) Its easy to fix
if you would like to do it yourself.

Edit the file /var/www/html/ganglia/functions.php. On line 386, you
should see:

krsort($racks[$rack]);

To get the ordering you desire, change this to:

ksort($racks[$rack]);

Thats it. You should see the high-numbered compute nodes at the bottom
of the rack. I will see if we can get a config file button on the page
to give this option for a later release of Ganglia.

-Federico

On Dec 16, 2003, at 11:45 AM, Dan Wright wrote:

> Hello all,
>
> I'm in the process of setting up a 3.0.0 cluster and have a question
> about the
> "Physical view" in ganglia. In this view (which is quite cool BTW :)
> is shows
> higher-numbered nodes on top and lower-numbered nodes on bottom:
>
> compute-0-12
> ...
> compute-0-2
> compute-0-1
> compute-0-0
>
> and my cluster is physically reversed from that:
>
> compute-0-0
> compute-0-1
> compute-0-2
> ...
> compute-0-12

>
> Is there an easy way to switch this display around so it matches the
> real
> physical layout? I poked around and ganglia for a few minutes and
> didn't see
> anything obvious, so I thought I'd ask before I actually start wasting
> time on
> this :)
>
> Thanks,
>
> - Dan Wright
> (dtwright at uiuc.edu)
> (http://www.scs.uiuc.edu/)
> (UNIX Systems Administrator, School of Chemical Sciences, UIUC)
> (333-1728)
>
Federico


From csamuel at vpac.org Tue Dec 16 18:49:22 2003
Date: Wed, 17 Dec 2003 13:49:22 +1100
In-Reply-To: <20031216194554.GH26246@uiuc.edu>

Hash: SHA1

On Wed, 17 Dec 2003 06:45 am, Dan Wright wrote:

> Is there an easy way to switch this display around so it matches the real
> physical layout?

I think this is why they tell you to install the compute nodes from the bottom
of the rack. :-)

cheers,
Chris
- --


iD8DBQE/38QyO2KABBYQAh8RAo+vAJ0XcP6tBJpwjxYnicEQkysRslWmmQCcDpeb
K8bNCLgiF5umMiJ/59ICN70=
=57YJ

From hermanns at tupi.dmt.upm.es Wed Dec 17 00:08:19 2003
From: hermanns at tupi.dmt.upm.es (Miguel Hermanns)
Date: Wed, 17 Dec 2003 09:08:19 +0100
Subject: [Rocks-Discuss]Creation of a hardware compatibility list?
Message-ID: <3FE00EF3.4020809@tupi.dmt.upm.es>

Since one of the strong features of Rocks is the posibility of fast
deployment of clusters, wouldn't it be of interest to create a hardware
compatibility list on the web page of Rocks? This list could be filled
in by the users of Rocks with their experience and the hardware they
have. In this way somebody interested in building a cluster as fast as
possible could check the list and buy something absolutely 100%
compatible with Rocks.

I know that in principle one could check the compatibility list of RH,
but my own experience was negative in that aspect (I installed an
Adaptec IDE RAID controller, supported by RH7.3, but Rocks 2.3 was
unable to recognize it).

Miguel

Date: Wed, 17 Dec 2003 09:03:00 -0800
In-Reply-To: <3FE00EF3.4020809@tupi.dmt.upm.es>
References: <3FE00EF3.4020809@tupi.dmt.upm.es>
Message-ID: <DEEE58E0-30B2-11D8-9543-000A95DA5638@sdsc.edu>

We have thought about this, and have some ideas on how to setup a
useful page. Something like the old Linux laptop hardware list but
simpler to mine for data. It's been on our long list of things to do
for a while now :)

-mjk

On Dec 17, 2003, at 12:08 AM, Miguel Hermanns wrote:

> Since one of the strong features of Rocks is the posibility of fast
> deployment of clusters, wouldn't it be of interest to create a
> hardware compatibility list on the web page of Rocks? This list could
> be filled in by the users of Rocks with their experience and the
> hardware they have. In this way somebody interested in building a
> cluster as fast as possible could check the list and buy something
> absolutely 100% compatible with Rocks.
>
> I know that in principle one could check the compatibility list of RH,
> but my own experience was negative in that aspect (I installed an
> Adaptec IDE RAID controller, supported by RH7.3, but Rocks 2.3 was
> unable to recognize it).
>
> Miguel
>

From junkscarce at hotmail.com Wed Dec 17 09:31:21 2003
From: junkscarce at hotmail.com (Reed Scarce)
Date: Wed, 17 Dec 2003 17:31:21 +0000
Subject: [Rocks-Discuss]fidsk reports all zeros, need actual
Message-ID: <BAY1-F978XKPl5GDrPi0003db4e@hotmail.com>

Good ol' fdisk "print" on my compute node give me a line:
Device Boot Start End Blocks Id System

but no data.

Extra Functionality's "print" reports
Nr AF Hd Sec Cyl Hd Sec Cyl Start Size ID
1 00 0 0 0 0 0 0 0 0 0
2 00 0 0 0 0 0 0 0 0 0
3 00 0 0 0 0 0 0 0 0 0
4 00 0 0 0 0 0 0 0 0 0

How can I retrieve the information necessary for scripted information at
node installation time?

TIA
--RRS

_________________________________________________________________
Enjoy the holiday season with great tips from MSN.
http://special.msn.com/network/happyholidays.armx

From dtwright at uiuc.edu Wed Dec 17 11:49:53 2003
Date: Wed, 17 Dec 2003 13:49:53 -0600
References: <20031216194554.GH26246@uiuc.edu> <200312171349.24485.csamuel@vpac.org>
Message-ID: <20031217194953.GS26246@uiuc.edu>

Eh...whatever ;-) I started using rocks with 2.2.1 (when there was no
physical layout display) and haven't read the manual again since :)

Chris Samuel said:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On Wed, 17 Dec 2003 06:45 am, Dan Wright wrote:
>
> > Is there an easy way to switch this display around so it matches the real
> > physical layout?
>
> I think this is why they tell you to install the compute nodes from the bottom
> of the rack. :-)
>
> cheers,
> Chris
> - --
> Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin
> Victorian Partnership for Advanced Computing http://www.vpac.org/
> Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.2.2 (GNU/Linux)
>
> iD8DBQE/38QyO2KABBYQAh8RAo+vAJ0XcP6tBJpwjxYnicEQkysRslWmmQCcDpeb
> K8bNCLgiF5umMiJ/59ICN70=
> =57YJ
> -----END PGP SIGNATURE-----
>
- Dan Wright

-] ------------------------------ [-] -------------------------------- [-
-------------- next part --------------
Name: not available
Size: 189 bytes
Desc: not available
Url : https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20031217/
a3718aef/attachment-0001.bin

From dtwright at uiuc.edu Wed Dec 17 11:51:00 2003
Date: Wed, 17 Dec 2003 13:51:00 -0600
In-Reply-To: <63C818CD-3035-11D8-8652-000393A4725A@sdsc.edu>
<63C818CD-3035-11D8-8652-000393A4725A@sdsc.edu>
Message-ID: <20031217195100.GT26246@uiuc.edu>

Federico,

Thanks! That'll make this easy enough... maybe next time I'll read the
manual and install the machines in the rocks-recommended order as another
poster suggested :)

Federico Sacerdoti said:
> Dan,
>
> Good question. Unfortunately this behavior is hardwired into stock
> Ganglia, not the Rocks-specific pages that we have more control over.
>
> The good news is that I wrote the code for this page :) Its easy to fix
> if you would like to do it yourself.
>
> Edit the file /var/www/html/ganglia/functions.php. On line 386, you
> should see:
>
> krsort($racks[$rack]);
>
> To get the ordering you desire, change this to:
>
> ksort($racks[$rack]);
>

> Thats it. You should see the high-numbered compute nodes at the bottom
> of the rack. I will see if we can get a config file button on the page
> to give this option for a later release of Ganglia.
>
> -Federico
>
> On Dec 16, 2003, at 11:45 AM, Dan Wright wrote:
>
> >Hello all,
> >
> >I'm in the process of setting up a 3.0.0 cluster and have a question
> >about the
> >"Physical view" in ganglia. In this view (which is quite cool BTW :)
> >is shows
> >higher-numbered nodes on top and lower-numbered nodes on bottom:
> >
> >compute-0-12
> >...
> >compute-0-2
> >compute-0-1
> >compute-0-0
> >
> >and my cluster is physically reversed from that:
> >
> >compute-0-0
> >compute-0-1
> >compute-0-2
> >...
> >compute-0-12
> >
> >Is there an easy way to switch this display around so it matches the
> >real
> >physical layout? I poked around and ganglia for a few minutes and
> >didn't see
> >anything obvious, so I thought I'd ask before I actually start wasting
> >time on
> >this :)
> >
> >Thanks,
> >
> >- Dan Wright
> >(dtwright at uiuc.edu)
> >(http://www.scs.uiuc.edu/)
> >(UNIX Systems Administrator, School of Chemical Sciences, UIUC)
> >(333-1728)
> >
> Federico
>
>
- Dan Wright

-] ------------------------------ [-] -------------------------------- [-
-------------- next part --------------

Name: not available
Size: 189 bytes
Desc: not available
discussion/attachments/20031217/620937b3/attachment-0001.bin

Date: Wed, 17 Dec 2003 12:52:30 -0800
Subject: [Rocks-Discuss]fidsk reports all zeros, need actual
In-Reply-To: <BAY1-F978XKPl5GDrPi0003db4e@hotmail.com>
References: <BAY1-F978XKPl5GDrPi0003db4e@hotmail.com>
Message-ID: <EDF0DAE8-30D2-11D8-B821-000A95C4E3B4@rocksclusters.org>

> Good ol' fdisk "print" on my compute node give me a line:
> Device Boot Start End Blocks Id System
>
> but no data.
>
> Extra Functionality's "print" reports
> Nr AF Hd Sec Cyl Hd Sec Cyl Start Size ID
> 1 00 0 0 0 0 0 0 0 0 0
> 2 00 0 0 0 0 0 0 0 0 0
> 3 00 0 0 0 0 0 0 0 0 0
> 4 00 0 0 0 0 0 0 0 0 0
>
> How can I retrieve the information necessary for scripted information
> at node installation time?

this should answer your question:

https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2003-February/
001388.html

- gb

From anand at novaglobal.com.sg Wed Dec 17 20:14:45 2003
From: anand at novaglobal.com.sg (Anand Vaidya)
Date: Wed, 17 Dec 2003 23:14:45 -0500
In-Reply-To: <DEEE58E0-30B2-11D8-9543-000A95DA5638@sdsc.edu>
<DEEE58E0-30B2-11D8-9543-000A95DA5638@sdsc.edu>
Message-ID: <200312172314.48434.anand@novaglobal.com.sg>

Why not create a Wiki? Wiki is easy enough to install (60seconds?) and just
the right tool for user-driven projects like Rocks.

Nice example of wiki wiki webs are http://en.wikipedia.org/ or even my
favourite GentooServer project has a very nice wiki at http://
www.subverted.net/wakka/wakka.php?wakka=MainPage (Though not related to
clustering)

Regards,
Anand

On Wednesday 17 December 2003 12:03, Mason J. Katz wrote:
> We have thought about this, and have some ideas on how to setup a
> useful page. Something like the old Linux laptop hardware list but
> simpler to mine for data. It's been on our long list of things to do
> for a while now :)
>
> -mjk
>
> On Dec 17, 2003, at 12:08 AM, Miguel Hermanns wrote:
> > Since one of the strong features of Rocks is the posibility of fast
> > deployment of clusters, wouldn't it be of interest to create a
> > hardware compatibility list on the web page of Rocks? This list could
> > be filled in by the users of Rocks with their experience and the
> > hardware they have. In this way somebody interested in building a
> > cluster as fast as possible could check the list and buy something
> > absolutely 100% compatible with Rocks.
> >
> > I know that in principle one could check the compatibility list of RH,
> > but my own experience was negative in that aspect (I installed an
> > Adaptec IDE RAID controller, supported by RH7.3, but Rocks 2.3 was
> > unable to recognize it).
> >
> > Miguel

-

Date: Thu, 18 Dec 2003 08:02:14 -0800
In-Reply-To: <200312172314.48434.anand@novaglobal.com.sg>
<DEEE58E0-30B2-11D8-9543-000A95DA5638@sdsc.edu>
<200312172314.48434.anand@novaglobal.com.sg>
Message-ID: <8BA1598E-3173-11D8-9543-000A95DA5638@sdsc.edu>

I've been thinking about a rocks wiki for a few months now, but I'm a
bit paranoid about the lack of authentication for updates (basically
anyone can modify your site).

If there is interest out there, we could just set one up, leave it
alone, and let our users worry about the content. Done well this could
have information on:

- hardware issues
- bugs reports
- feature requests
- contributed documentation (to be moved into our users manual)
- etc

Basically a simple version of sourceforge (we have no plans to move to
sourceforge -- the interface and bandwidth both stink). Ideas....?

-mjk

On Dec 17, 2003, at 8:14 PM, Anand Vaidya wrote:

> Why not create a Wiki? Wiki is easy enough to install (60seconds?) and
> just
> the right tool for user-driven projects like Rocks.
>
> Nice example of wiki wiki webs are http://en.wikipedia.org/ or even my
> favourite GentooServer project has a very nice wiki at http://
> www.subverted.net/wakka/wakka.php?wakka=MainPage (Though not related to
> clustering)
>
> Regards,
> Anand
>
> On Wednesday 17 December 2003 12:03, Mason J. Katz wrote:
>> We have thought about this, and have some ideas on how to setup a
>> useful page. Something like the old Linux laptop hardware list but
>> simpler to mine for data. It's been on our long list of things to do
>> for a while now :)
>>
>> -mjk
>>
>> On Dec 17, 2003, at 12:08 AM, Miguel Hermanns wrote:
>>> Since one of the strong features of Rocks is the posibility of fast
>>> deployment of clusters, wouldn't it be of interest to create a
>>> hardware compatibility list on the web page of Rocks? This list could
>>> be filled in by the users of Rocks with their experience and the
>>> hardware they have. In this way somebody interested in building a
>>> cluster as fast as possible could check the list and buy something
>>> absolutely 100% compatible with Rocks.
>>>
>>> I know that in principle one could check the compatibility list of
>>> RH,
>>> but my own experience was negative in that aspect (I installed an
>>> Adaptec IDE RAID controller, supported by RH7.3, but Rocks 2.3 was
>>> unable to recognize it).
>>>
>>> Miguel
>
> -

From hermanns at tupi.dmt.upm.es Fri Dec 19 00:47:11 2003
From: hermanns at tupi.dmt.upm.es (Miguel Hermanns)
Date: Fri, 19 Dec 2003 09:47:11 +0100
Message-ID: <3FE2BB0F.4060908@tupi.dmt.upm.es>

//>>I've been thinking about a rocks wiki for a few months now, but I'm a
bit paranoid about the lack of authentication for updates (basically
anyone can modify your site).<<

One possible filter could be that only the users of the registered
clusters can modify the wiki (So that when you summit the data of the
cluster you also include a user and a password), although in that case I
would be excluded, since our cluster has been unable to work with Rocks
yet :-(.

>> - hardware issues

>> - bugs reports
>> - feature requests
>> - contributed documentation (to be moved into our users manual)
>> - etc

So for example the cluster register could be editable by the registered
users (each one only its entry) and could include a description of the
installed hardware (not just the processor, but also the motherboard
model, the hard disks, NICs, etc). So everybody interested in building a
cluster could go to the register, have a look and click on the different
clusters that are similar to the one in mind. After that with just a
click the user could review the hardware configuration and the
encountered problems.

This would also be greate if the Rocks clusters get updated, because
then their builders could go and update their entry without needing to
summit an email to the Rocks team, hence avoinding giving them extra work.

In order to include the not yet working Rocks clusters, the database of
clusters (with the corresponding users and passwords) could be extended
by them, but their entries would not be shown on the Rocks register
until they are fully working. In this way information on the hardware
incompatibilities can be collected and could be shown on a different
part of www.rocksclusters.org.

The feature requests would still be handled through the maillist and for
the contributed documentation I would place the sourcefiles in readonly
mode on the ftp server and if somebody goes and makes modifications on
them, then the new version should be emailed to the persons in charge of
the docs to give their approval.

Miguel

From jkreuzig at uci.edu Fri Dec 19 16:58:58 2003
From: jkreuzig at uci.edu (James Kreuziger)
Date: Fri, 19 Dec 2003 16:58:58 -0800 (PST)
Subject: [Rocks-Discuss]Dell Power Connect 5224
In-Reply-To: <1062015636.6781.100.camel@babylon.physics.ncsu.edu>
References: <1062015636.6781.100.camel@babylon.physics.ncsu.edu>
Message-ID: <Pine.GSO.4.58.0312191642260.19504@massun.ucicom.uci.edu>

Ok, I need some help here. I've managed to setup
my frontend node, and it is up and running. I have
my 8 nodes all connected up to a Dell Power Connect 5224.
I can access the switch through a serial terminal and
get a command line interface. The little lights on the
front of the switch are blinking, so that's good.

However, I can't get the switch recognized by insert-ethers.
I've even managed to change the IP of the switch through
the CLI, but I can't see the switch from the frontend node.
I can't telnet, get the web interface or anything. I haven't
saved the configuration, so a reboot of the switch will
reset the values.

I'm grasping at straws here. I'm not a network engineer,
so I could use some help getting this thing configured.

If anybody can help me out, contact me by email.

Thanks,

-Jim

*************************************************
Jim Kreuziger
jkreuzig at uci.edu
949-824-4474
*************************************************

From tim.carlson at pnl.gov Fri Dec 19 17:24:22 2003
Date: Fri, 19 Dec 2003 17:24:22 -0800 (PST)
In-Reply-To: <Pine.GSO.4.58.0312191642260.19504@massun.ucicom.uci.edu>

On Fri, 19 Dec 2003, James Kreuziger wrote:

I think we need a Rocks FAQ

https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2003-August/002762.html

You need to turn on fast-link.

> Ok, I need some help here. I've managed to setup
> my frontend node, and it is up and running. I have
> my 8 nodes all connected up to a Dell Power Connect 5224.
> I can access the switch through a serial terminal and
> get a command line interface. The little lights on the
> front of the switch are blinking, so that's good.
>
> However, I can't get the switch recognized by insert-ethers.
> I've even managed to change the IP of the switch through
> the CLI, but I can't see the switch from the frontend node.
> I can't telnet, get the web interface or anything. I haven't
> saved the configuration, so a reboot of the switch will
> reset the values.
>
> I'm grasping at straws here. I'm not a network engineer,
> so I could use some help getting this thing configured.
>
> If anybody can help me out, contact me by email.
>
> Thanks,
>
> -Jim
>
> *************************************************
> Jim Kreuziger
> jkreuzig at uci.edu
> 949-824-4474
> *************************************************

>
>
>

Tim Carlson
Voice: (509) 376 3423

From Georgi.Kostov at umich.edu Fri Dec 19 17:34:15 2003
From: Georgi.Kostov at umich.edu (Georgi Kostov)
Date: Fri, 19 Dec 2003 20:34:15 -0500
References: <1062015636.6781.100.camel@babylon.physics.ncsu.edu>
<Pine.GSO.4.58.0312191642260.19504@massun.ucicom.uci.edu>
Message-ID: <1071884055.3fe3a717b3efc@carrierpigeon.mail.umich.edu>

Jim,

I have a 5224 here. What are your config settings on the switch? I.e. IP,
sub-net mask, gateway settings - for both the switch and the interface of the
head-node on which the 5224 is connected (I assume it's on the private subnet,
so the subnet is something like 10.0.0.0/255.0.0.0 with the frontend internal
interface (eth0) as 10.0.1.1, right?)

One thing to try on the head node is use (as root) "tcpdump eth0", and watch for
packets. To avoid clutter, I would either turn the rest (compute nodes, etc.)
off, or filter them out with settings on tcpdump.

With some more info we should be able to tease this out.

--Georgi

Michigan Center for Biological Information (MCBI)
University of Michigan
3600 Green Court, Suite 700
Ann Arbor, MI 48105-1570
Phone/Fax: (734) 998-9236/8571
kostov at umich.edu
www.ctaalliance.org

Quoting James Kreuziger <jkreuzig at uci.edu>:

> Ok, I need some help here. I've managed to setup
> my frontend node, and it is up and running. I have
> my 8 nodes all connected up to a Dell Power Connect 5224.
> I can access the switch through a serial terminal and
> get a command line interface. The little lights on the
> front of the switch are blinking, so that's good.
>
> However, I can't get the switch recognized by insert-ethers.
> I've even managed to change the IP of the switch through
> the CLI, but I can't see the switch from the frontend node.
> I can't telnet, get the web interface or anything. I haven't

> saved the configuration, so a reboot of the switch will
> reset the values.
>
> I'm grasping at straws here. I'm not a network engineer,
> so I could use some help getting this thing configured.
>
> If anybody can help me out, contact me by email.
>
> Thanks,
>
> -Jim
>
> *************************************************
> Jim Kreuziger
> jkreuzig at uci.edu
> 949-824-4474
> *************************************************
>
>
>

From: daniel.kidger at quadrics.com (Dan Kidger)
Date: Mon, 22 Dec 2003 09:45:47 +0000
Subject: Fwd: Re: [Rocks-Discuss]Dell Power Connect 5224
Message-ID: <200312220945.47665.daniel.kidger@quadrics.com>

---------- Forwarded Message ----------

Subject: Re: [Rocks-Discuss]Dell Power Connect 5224
Date: Mon, 22 Dec 2003 09:38:41 +0000
From: Dan Kidger <daniel.kidger at quadrics.com>
To: Georgi Kostov <Georgi.Kostov at umich.edu>
Cc: paci-rocks-discussion at sdsc.edu

> Quoting James Kreuziger <jkreuzig at uci.edu>:
> > Ok, I need some help here. I've managed to setup
> > my frontend node, and it is up and running. I have
> > my 8 nodes all connected up to a Dell Power Connect 5224.
> > I can access the switch through a serial terminal and
> > get a command line interface. The little lights on the
> > front of the switch are blinking, so that's good.
> >
> > However, I can't get the switch recognized by insert-ethers.
> > I've even managed to change the IP of the switch through
> > the CLI, but I can't see the switch from the frontend node.
> > I can't telnet, get the web interface or anything. I haven't
> > saved the configuration, so a reboot of the switch will
> > reset the values.

I don't know much about the 5224 per se. but I do know that much of the time
emebedded devices *have* to be rebooted to pick up new settings for their IP.

once done - I would try pinging the switch's IP abnd then doing 'arp -a' to
see its MAC address (which should match that on the white sticky label on
the back)

Daniel.

--------------------------------------------------------------
----------------------- www.quadrics.com --------------------

-------------------------------------------------------

--
Yours,
Daniel.

--------------------------------------------------------------
----------------------- www.quadrics.com --------------------

Date: Mon, 22 Dec 2003 17:03:56 -0000
Subject: [Rocks-Discuss]RE:Writing a Roll ?
Message-ID: <30062B7EA51A9045B9F605FAAC1B4F622357D0@tardis0.quadrics.com>

Folks,
I have made good headway in adding software and its configuration using extend-
compute.xml and now have a robust system. (the head node install is still rather
manual though :-( )

I would now like to move to doing this as a Roll. However I am not sureof the best
way of proceeding - there appears to be little documentation - either on HOWTO or
on the underlying concepts.

I have mounted the HPC_roll.iso and browsed around:
- the image seems to consists of 2 subdirectories - in the same style as RedHat
CD's
- as expected ./SRPMS contains the source RPMs, and ./RedHat/RPMS contains binary
RPMs
( the latter contains many more RPMs than there is an SRPM for. )

There is no obvious configuration information until you explore:
roll-hpc-kickstart-3.0.0-0.noarch.rpm
This seems to contain lots of XML which at first glance is hard to decifer.

So my question is:
Should we be writing our own rolls, and if so how ? (examples?)

Yours,
Daniel.

--------------------------------------------------------------

----------------------- www.quadrics.com --------------------

>

Date: Mon, 22 Dec 2003 17:08:21 -0000
Subject: [Rocks-Discuss]shucks.
Message-ID: <30062B7EA51A9045B9F605FAAC1B4F622461C9@tardis0.quadrics.com>

# rpm -ql roll-hpc-kickstart |xargs -l grep -inH sucks

/export/home/install/profiles/current/nodes/force-smp.xml:21: IBM sucks
/export/home/install/profiles/current/nodes/ganglia-server.xml:134: perl sucks
/export/home/install/profiles/current/nodes/ganglia-server.xml:148: Switch from
ISC to RedHat's pump. Pump sucks but it is standard so
/export/home/install/profiles/current/nodes/sendmail-masq.xml:31: m4 sucks

:-)

Have a good Christmas,
Daniel.

--------------------------------------------------------------
----------------------- www.quadrics.com --------------------

Date: Mon, 22 Dec 2003 10:22:54 -0800
In-Reply-To: <30062B7EA51A9045B9F605FAAC1B4F622357D0@tardis0.quadrics.com>
References: <30062B7EA51A9045B9F605FAAC1B4F622357D0@tardis0.quadrics.com>
Message-ID: <DBF30128-34AB-11D8-8652-000393A4725A@sdsc.edu>

You are right, we have little documentation on creating new rolls. I
have lamented to Greg about this, and he has done the same to me.
Basically we have been so busy trying to get the 3.1.0 release out that
we haven't put our nose to the grindstone about the Developer docs.

Here is a little primer since it sounds like you are indeed ready.

1. The first thing to realize is that rolls are not build from
"scratch", but are done from the safe confines of our build
environment. This environment is the directory:

[your local rocks CVS sandbox]/src/roll/

You must checkout the Rocks CVS tree to get this. Instructions about
how to do this (anonymously) are at http://cvs/rocksclusters.org/.

Once you have this build environment on your frontend system, you are
ready for the next step to building your roll. You should make a new
directory here called "quadrics" - the name matters as it will be the
identifier for your roll from now on.

1. Now the best thing I can tell you is to look at the "hpc" and "sge"
roll (two of our most mature) for the directory structure in
"quadrics". Its fairly straightforward, and mirrors what we do for the
base. The "nodes" directory will hold your "extend-compute.xml", etc.
(more on this later). The "roll-quadrics-kickstart.noarch.rpm" is made
automatically for your from information in these directories.

2. The "src" dir holds anything you need to compile. Anything in src
should deposit an RPM package in the "RPMS" directory when its build is
finished.

3. You type "make roll" to start the build process. It will take a bit
of study for you to get things correct, but suffice it to say that you
will have an iso file suitable for burning when you are done. Thank
bruno for this sweet fact - everything is automatic except your
intellectual property :)

One more word on your XML files. Our philosophy of rolls is not to use
the "extend/replace" strategy that we advocate for customization. As a
roll builder, you are at the grass-roots level, and can rise above
simple customization techniques.

Your roll should define a "quadrics.xml" node in the kickstart graph.
You define the node in the file "roll/quadrics/nodes/quadrics.xml" and
the edges in the file "roll/quadrics/graphs/default/quadrics.xml". Look
at the SGE roll for a good example of this. By defining your
configuration this way, you have more power to do complex tasks
(different configuration for different appliance types), and to leave
room for future growth.

Good luck, and we hope and pray for a good technical writer that will
do this process justice.

-Federico

On Dec 22, 2003, at 9:03 AM, daniel.kidger at quadrics.com wrote:

> Folks,
> I have made good headway in adding software and its configuration
> using extend-compute.xml and now have a robust system. (the head node
> install is still rather manual though :-( )
>
> I would now like to move to doing this as a Roll. However I am not
> sureof the best way of proceeding - there appears to be little
> documentation - either on HOWTO or on the underlying concepts.
>
> I have mounted the HPC_roll.iso and browsed around:
> - the image seems to consists of 2 subdirectories - in the same style
> as RedHat CD's
> - as expected ./SRPMS contains the source RPMs, and ./RedHat/RPMS
> contains binary RPMs
> ( the latter contains many more RPMs than there is an SRPM for. )
>
> There is no obvious configuration information until you explore:
> roll-hpc-kickstart-3.0.0-0.noarch.rpm
> This seems to contain lots of XML which at first glance is hard to
> decifer.
>

> Should we be writing our own rolls, and if so how ? (examples?)
>
>
> Yours,
> Daniel.
>
> --------------------------------------------------------------
> ----------------------- www.quadrics.com --------------------
>
>>
>>
Federico


Date: Mon, 22 Dec 2003 11:07:32 -0800
Subject: [Rocks-Discuss]shucks.
References: <30062B7EA51A9045B9F605FAAC1B4F622461C9@tardis0.quadrics.com>
Message-ID: <18168448-34B2-11D8-8AD9-000A95DA5638@sdsc.edu>

If these are the worst CVS log comments you've found you aren't looking
very hard. The only one here I'm compelled to clarify is IBM. There
are around 3-5 ways of probing the chipset to determine if the box is
SMP, RedHat supports the most common ones which everyone in the world
except IBM use. This forced us to patch anaconda to detect SMP for IBM
hardware (or in this case just force it) -- didn't these guys invent
the PC?

-mjk


>
> # rpm -ql roll-hpc-kickstart |xargs -l grep -inH sucks
>
> /export/home/install/profiles/current/nodes/force-smp.xml:21: IBM
> sucks
> /export/home/install/profiles/current/nodes/ganglia-server.xml:134:
> perl sucks
> /export/home/install/profiles/current/nodes/ganglia-server.xml:148:
> Switch from ISC to RedHat's pump. Pump sucks but it is standard so
> /export/home/install/profiles/current/nodes/sendmail-masq.xml:31: m4
> sucks
>
> :-)
>
> Have a good Christmas,
> Daniel.
>
> --------------------------------------------------------------

> ----------------------- www.quadrics.com --------------------

Date: Mon, 22 Dec 2003 11:13:30 -0800
Message-ID: <EDBC4D7D-34B2-11D8-9250-000A95DA5638@sdsc.edu>

http://cvs.rocksclusters.org

In the rocks/src/roll directory you can see several roll examples, all
of which are build be typing "make roll". THe
roll-*-kickstart.*.noarch.rpm is the real magic that includes the XML
profiles that are grafted onto the base kickstart graph.

-mjk


> Folks,
> I have made good headway in adding software and its configuration
> using extend-compute.xml and now have a robust system. (the head node
> install is still rather manual though :-( )
>
> I would now like to move to doing this as a Roll. However I am not
> sureof the best way of proceeding - there appears to be little
> documentation - either on HOWTO or on the underlying concepts.
>
> I have mounted the HPC_roll.iso and browsed around:
> - the image seems to consists of 2 subdirectories - in the same style
> as RedHat CD's
> - as expected ./SRPMS contains the source RPMs, and ./RedHat/RPMS
> contains binary RPMs
> ( the latter contains many more RPMs than there is an SRPM for. )
>
> There is no obvious configuration information until you explore:
> roll-hpc-kickstart-3.0.0-0.noarch.rpm
> This seems to contain lots of XML which at first glance is hard to
> decifer.
>
> Should we be writing our own rolls, and if so how ? (examples?)
>
>
> Yours,
> Daniel.
>
> --------------------------------------------------------------
> ----------------------- www.quadrics.com --------------------
>
>>

Date: Mon, 22 Dec 2003 19:12:17 -0000
Message-ID: <30062B7EA51A9045B9F605FAAC1B4F622357D1@tardis0.quadrics.com>

Federico,

> Here is a little primer since it sounds like you are indeed ready.
> --- many very informative lines deleted ---

thanks for that long reply. :-)
I am currently pulling a copy of the source tree from cvs.rocksclusters.org
(194MB of rocks/doc alone !)

Just a couple of questions for now:
1. Do rolls have to be CD based ?
(during development I would probably get through a lot of CDROMs - but more
importantly it would get a bit fiddly - to be keep walking round to the CD-writer -
then nipping of to the room with the cluster in every time)

2. Do I have to reinstall the headnode from scratch each time I want to test a
roll ?
(even if the roll only affects RPMs that get installed on compute nodes)

3. Can a CD contain multiple rolls?
(Once mature - a cluster may have quite a few rolls: pbs, sge, gm, IB, etc.
and Quadrics would proably have two - the (open-source) hardware
drivers,MPI,etc and also RMS - our (closed-source) cluster Resource Manager.)

4. What subset of the cvs tree does a Roll developer need? The whole tree is
clearly rather excessive.

5. I am a little concerned about the amount of bloat needed to install our five
RPMs as a Roll.(The RPMs are already prebuilt by our own internal build
proceedures).
So taking another case - lets say the Intel Compilers - These have 4 RPMs (plus a
little sed-ery of their config files and pasting in the license file). Would these
be best installed as a Roll or as a simple extend-compute.xml as I have currently?

Yours,
Daniel.

--------------------------------------------------------------
----------------------- www.quadrics.com --------------------

From sjenks at uci.edu Mon Dec 22 11:17:07 2003
Date: Mon, 22 Dec 2003 11:17:07 -0800
Subject: [Rocks-Discuss]rocks-dist suggestion
Message-ID: <6F2FB100-34B3-11D8-88FD-000A95B96C68@uci.edu>

Hi ROCKS folks,

Just a suggestion for when you guys are bored after the 3.1 release 8-)

I ran into some trouble installing some updates to a ROCKS 3.0 cluster
that could easily be solved with some checking in rocks-dist:

I put the openssh and other updates in the proper contrib directory
under /home/install and ran "rock-dist dist" which properly updated the
distribution.

The problem occurred when I tried to reload the computed nodes - the
install failed when it hit any of the RPMs in the contrib directory. It
turns out the protections on those RPMs was set to 600 because I had
copied them out of root's home directory, thus they couldn't be read by
the server to send them down to the compute nodes. After fixing the
permissions, all was well.

So rocks-dist should check (and possibly fix) permissions on files that
will be included in the kickstart distribution. I realize that the
mistake was entirely mine, but I'm probably not the only one to ever
forget to set permissions correctly and the tool could easily catch
such mistakes.

Thanks for putting together such a useful cluster distribution!

Steve Jenks

From msherman at informaticscenter.info Mon Dec 22 11:50:03 2003
From: msherman at informaticscenter.info (Mark Sherman)
Date: Mon, 22 Dec 2003 12:50:03 -0700
Subject: [Rocks-Discuss]MPI and memory + node rescue
Message-ID: <20031222195003.7688.qmail@webmail4.mesa1.secureserver.net>

just for future consideration...
any time I need to look at a system without booting it or it's ability to boot I
just throw in the knoppix cd.
www.knoppix.org
______________________________________________
Mark Sherman
Computing Systems Administrator
Informatics Center
Massachusetts Biomedical Initiatives
Worcester MA 01605
508-797-4200
msherman at informaticscenter.info
----------------------~-----------------------

> -------- Original Message --------
> Subject: Re: [Rocks-Discuss]MPI and memory + node rescue
> From: "Trond SAUE" <saue at quantix.u-strasbg.fr>
> Date: Thu, November 27, 2003 1:38 am
> To: "Stephen P. Lebeau" <lebeau at openbiosystems.com>
>
> On 2003.11.26 16:52, Stephen P. Lebeau wrote:

> > If you go here, they talk about creating a Linux floppy
> > repair disk. Make sure to read the README file... they
> > require that you make a 1.68MB floppy ( README explains how )
> >
> > http://www.tux.org/pub/people/kent-robotti/looplinux/rip/
> >
> > If that doesn't work...
> >
> > http://www.toms.net/rb/download.html
> >
> > I've actually used this one before.
> >
> > -S
> >
> In order to have a look at the disk of my crashed node, I downloaded
> RIP-2.2-1680.bin from the first site, but I was not able to boot
> properly. However, tomsrtbt-2.0.103 from the second site worked very
> well and allowed me to reboot the node as well as mount its disk to
> look at messages. Unfortunately, they did not really tell me anything
> more...However, it might be an idea for a future release of ROCKS to
> include a second "standalone" boot option for the computer nodes, so
> that one can access them independent of the frontend....
> All the best,
> Trond Saue
> --
> Trond SAUE (DIRAC:
> http://dirac.chem.sdu.dk/)
> Laboratoire de Chimie Quantique et Mod?lisation Mol?culaire
> Universite Louis Pasteur ; 4, rue Blaise Pascal ; F-67000 STRASBOURG
> t?l: 03 90 24 13 01 fax: 03 90 24 15 89 email: saue at quantix.u-
> strasbg.fr

Date: Mon, 22 Dec 2003 19:51:16 -0000
Subject: [Rocks-Discuss]rocks-dist suggestion
Message-ID: <30062B7EA51A9045B9F605FAAC1B4F622461CD@tardis0.quadrics.com>

> Just a suggestion for when you guys are bored after the 3.1
> release 8-)

> The problem occurred when I tried to reload the computed nodes - the
> install failed when it hit any of the RPMs in the contrib
> directory. It
> turns out the protections on those RPMs was set to 600 because I had
> copied them out of root's home directory, thus they couldn't
> be read by
> the server to send them down to the compute nodes. After fixing the
> permissions, all was well.

This is a 'me-too' reply.

Rocks reads the RPMs using http - hence they need to be readable by user apache.
With symlinks - it is all too easy even if the RPMs are 644 for the directory tree
to be somewhere not walakable by a 3rd party userid like apache.

Yours,

Daniel.

--------------------------------------------------------------
----------------------- www.quadrics.com --------------------

>

Date: Mon, 22 Dec 2003 15:26:01 -0800
Message-ID: <34B2A95C-34D6-11D8-8652-000393A4725A@sdsc.edu>


> Federico,
>
>> Here is a little primer since it sounds like you are indeed ready.
>> --- many very informative lines deleted ---
>
>
> Just a couple of questions for now:
> 1. Do rolls have to be CD based ?
> (during development I would probably get through a lot of CDROMs -
> but more importantly it would get a bit fiddly
> - to be keep walking round to the CD-writer - then nipping of to the
> room with the cluster in every time)
>
For distribution, the rolls should probably be cd based. For
development, however, that is not necessary. There is a make target
which will compile your source, and "install" the roll into your local
distribution. This is "make intodist" and assumes you are building on a
frontend node. You would follow this call with a call to "rocks-dist
dist" in the "/home/install" directory.

Of course, this makes most sense for rolls that affect compute nodes.
To test parts of your roll that affect frontend functionality, you
still need to use the CDs.

> 2. Do I have to reinstall the headnode from scratch each time I want
> to test a roll ?
> (even if the roll only affects RPMs that get installed on compute
> nodes)

See comment above. We're working on a way to fully install frontends
over the network, but it will not make it into the new release.

>
> 3. Can a CD contain multiple rolls?
> (Once mature - a cluster may have quite a few rolls: pbs, sge, gm,
> IB, etc.
> and Quadrics would proably have two - the (open-source) hardware
> drivers,MPI,etc and also RMS - our (closed-source) cluster Resource

> Manager.)

There is some support for this, we call them "Metarolls". We know they
are important, and we have some support for them now. The build process
for them is a bit different, and wont arrive for this release but soon
after.

> 4. What subset of the cvs tree does a Roll developer need? The whole
> tree is clearly rather excessive.
>
There are definately areas of the tree not necessary for roll building.
Its always safest to have everything, but you're welcome to crop and
test.

> 5. I am a little concerned about the amount of bloat needed to
> install our five RPMs as a Roll.(The RPMs are already prebuilt by our
> own internal build proceedures).
> So taking another case - lets say the Intel Compilers - These have 4
> RPMs (plus a little sed-ery of their config files and pasting in the
> license file). Would these be best installed as a Roll or as a simple
> extend-compute.xml as I have currently?

It is better to put them in a roll. We are have ways to combine,
distribute, sort, etc. these rolls, and they form a nice capsule of
software to introduce into the system. I understand that pulling the
whole source tree seems a bit excessive, but it is rather standard
practice for working on an open project.

Plus only the developer needs the source, the consumer does not.

Good luck, and we're glad someone is asking the questions. Rolls are
intended for outside construction, and we need to document the process.
:)

-Federico

>
> Yours,
> Daniel.
>
> --------------------------------------------------------------
> ----------------------- www.quadrics.com --------------------
>
>
Federico


From tlinden at pcu.helsinki.fi Tue Dec 23 05:28:35 2003
From: tlinden at pcu.helsinki.fi (=?ISO-8859-15?Q?Tomas_Lind=E9n?=)
Date: Tue, 23 Dec 2003 15:28:35 +0200 (EET)
Subject: [Rocks-Discuss]Lost nodes during cluster-kickstart?
Message-ID: <Pine.OSF.4.58.0312231440260.353431@rock.helsinki.fi>

To reinstall a cluster I use the command

cluster-fork /boot/kickstart/cluster-kickstart
Now since all 32 nodes have been PXE installed this means that the
reinstallation is performed by first doing a PXE-boot to load the
installation kernel. My problem is that sometimes a few nodes fail
during this reinstallation process. The failing nodes seem to be different
whenever this problem occurs. The really strange thing is that after
more than a day or so some nodes somehow manage to finish the
reinstallation process!

Sometimes the whole cluster comes up fine without any lost node.

The problematic nodes _seem_ to get the installation kernel with PXE, so
it might be not a PXE problem but something odd that happens later?

Has anyone seen anything like this before?

I'm aware of a bug in the RedHat installation kernel
on Athlon systems when trying to run with a serial console.
https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2003-May/001988.html
This is why I run the installation kernel without a serial console, but
this makes debugging difficult because the serial console only shows
output during the PXE boot process. No output is generated by the
installation kernel itself. The next output is generated when
the node has finished the installation and loads the final kernel which
runs fine with a serial console.

This is using Rocks 2.3.2 on a 32 node cluster with Tyan Tiger MPX
S2466N-4M motherboards and dual Athlon MP CPUs with no graphics
adapters, so the system has a 32 port serial console switch. The
motherboards have integrated 100 Mb/s 3Com 3C920 NICs (in practice a
3C905 NIC). The switch is made by Enterasys. The frontend private NIC is
also running at 100 Mb/s. When doing the cluster reinstallation the
network bandwidth over the frontend NIC saturates at 12,5 MB/s. Maybe
some packets are lost because of this?

The frontend private ethernet connection will be upgraded to Gb/s.
Hopefully this will solve this reinstallation problem.

Do you have any other ideas how to solve this problem?

Best regards, Tomas Lind?n
--------------------------------------------------------------------------
I , I
I Tomas Linden Helsinki Institute of Physics (HIP) I
I Tomas.Linden at Helsinki.FI P.O. Box 64 (Gustaf H?llstr?min katu 2) I
I phone: +358-9-191 505 63 FIN-00014 UNIVERSITY OF HELSINKI I
I fax: +358-9-191 505 53 Finland I
I WWW: http://www.physics.helsinki.fi/~tlinden/eindex.html I
--------------------------------------------------------------------------

From kjcruz at ece.uprm.edu Tue Dec 23 05:31:26 2003
From: kjcruz at ece.uprm.edu (Kennie Cruz)
Date: Tue, 23 Dec 2003 09:31:26 -0400 (AST)
Subject: [Rocks-Discuss]Error installing the compute node
Message-ID: <Pine.LNX.4.58.0312230921290.23333@alambique.ece.uprm.edu>

Hi,

I am trying to kickstart the compute nodes with Rocks 3.0.0, the frontend
is already working. I revised the FAQ question 7.1.2, the services (dhcpd,
httpd, mysqld and autofs) are running, but running kickstar.cgi from the
command line give an error:

error - cannot kickstart external nodes

I made a quick search on the list, but without any success.

The compute node gets the assigned IP and insert-ethers detect the
appliance without any trouble, but fails to run the kickstart.cgi from the
frontend. The web server error log says something like this:

[Tue Dec 23 09:10:08 2003] [error] [client 10.255.255.254] malformed header
from script. Bad header=# @Copyright@: /var/www/html/install/kickstart.cgi

While the access log says this:

10.255.255.254 - - [23/Dec/2003:09:10:08 -0400] "GET
/install/kickstart.cgi?arch=i386&np=2&if=eth0&project=rocks HTTP/1.0"
500 587 "-" "-"

I ran insert-ethers with the Ethernet Switches option. My nodes are
connected via 3 managed ethernet switches.

Any help will be appreciated.

Thanks in advance.

--
Kennie J. Cruz Gutierrez, System Administrator
Department of Electrical and Computer Engineering
University of Puerto Rico, Mayaguez Campus
Work Phone: (787) 832-4040 x 3798
Email: Kennie.Cruz at ece.uprm.edu
Web: http://ece.uprm.edu/~kennie/

[2003-12-23/09:21]
Black holes are created when God divides by zero!

Date: Tue, 23 Dec 2003 08:33:39 -0800
In-Reply-To: <Pine.LNX.4.58.0312230921290.23333@alambique.ece.uprm.edu>
References: <Pine.LNX.4.58.0312230921290.23333@alambique.ece.uprm.edu>
Message-ID: <C33DF11A-3565-11D8-B821-000A95C4E3B4@rocksclusters.org>

just to be clear, did you execute:

# cd /home/install
# ./kickstart.cgi --client compute-0-0

- gb

On Dec 23, 2003, at 5:31 AM, Kennie Cruz wrote:

> Hi,
>
> I am trying to kickstart the compute nodes with Rocks 3.0.0, the
> frontend
> is already working. I revised the FAQ question 7.1.2, the services
> (dhcpd,
> httpd, mysqld and autofs) are running, but running kickstar.cgi from
> the
> command line give an error:
>
> error - cannot kickstart external nodes
>
> I made a quick search on the list, but without any success.
>
> The compute node gets the assigned IP and insert-ethers detect the
> appliance without any trouble, but fails to run the kickstart.cgi from
> the
> frontend. The web server error log says something like this:
>
> [Tue Dec 23 09:10:08 2003] [error] [client 10.255.255.254] malformed
> header
> from script. Bad header=# @Copyright@:
> /var/www/html/install/kickstart.cgi
>
> While the access log says this:
>
> 10.255.255.254 - - [23/Dec/2003:09:10:08 -0400] "GET
> /install/kickstart.cgi?arch=i386&np=2&if=eth0&project=rocks HTTP/1.0"
> 500 587 "-" "-"
>
> I ran insert-ethers with the Ethernet Switches option. My nodes are
> connected via 3 managed ethernet switches.
>
> Any help will be appreciated.
>
> Thanks in advance.
>
> --
> Kennie J. Cruz Gutierrez, System Administrator
> Department of Electrical and Computer Engineering
> University of Puerto Rico, Mayaguez Campus
> Work Phone: (787) 832-4040 x 3798
> Email: Kennie.Cruz at ece.uprm.edu
> Web: http://ece.uprm.edu/~kennie/
>
> [2003-12-23/09:21]
> Black holes are created when God divides by zero!

From daniel.kidger at quadrics.com Tue Dec 23 09:03:49 2003
From: daniel.kidger at quadrics.com (Daniel Kidger)
Date: Tue, 23 Dec 2003 17:03:49 +0000
In-Reply-To: <Pine.OSF.4.58.0312231440260.353431@rock.helsinki.fi>
References: <Pine.OSF.4.58.0312231440260.353431@rock.helsinki.fi>

Message-ID: <3FE87575.5060807@quadrics.com>

Tomas Lind?n wrote:

>To reinstall a cluster I use the command
> cluster-fork /boot/kickstart/cluster-kickstart
>Now since all 32 nodes have been PXE installed this means that the
>reinstallation is performed by first doing a PXE-boot to load the
>installation kernel. My problem is that sometimes a few nodes fail
>during this reinstallation process.
>
Although I haven't PXE installed a Rocks cluster of this size I have
done PXE-based installs of (larger) RedHat clusters using a customised
kickstart file. What can go wrong is that I have seen timeouts if too
made nodes dhcp/tftp for their installer kernel simultaneously. You
could try and increase the timeout or better not do too many at once -
say start 8 at a time every 30 seconds. There is plenty of precedence of
this in say the automated installer of the Alphaserver SC Tru64clusters.
Also outside of Rocks I have seen folk use mutiple 'sub-master' nodes
to act as tftp/http fileservers during the install process. It would be
interesting to see what the Rocks developers vision is for the scalable
installation of large clusters.

--
Yours,
Daniel.

--------------------------------------------------------------
----------------------- www.quadrics.com --------------------

Date: Tue, 23 Dec 2003 09:44:14 -0800
In-Reply-To: <Pine.OSF.4.58.0312231440260.353431@rock.helsinki.fi>
References: <Pine.OSF.4.58.0312231440260.353431@rock.helsinki.fi>
Message-ID: <9F7E8D1C-356F-11D8-8281-000A95DA5638@sdsc.edu>

The problems is PXE has an extremely short timeout, and once it fails
it does not retry. Since this is a BIOS thing, there isn't a lot to
do. If you boot your compute nodes off of CDs (and avoid PXE), the
problem goes away. This is because even if the DHCP timeouts we've
modified our installation to be extremely aggressive in DHCP request
and the entire installation process will actually watchdog timeout and
restart if needed. Unfortunately, the PXE timeout cannot be fixed in
the same way.

Our experience shows PXE to scale to 128 nodes for a mass re-install
using current hardware. Older CPUs may shows issues. The only answer
right now for this is to stage your re-install so the PXE server can
handle the load. This load is actually very low, but the PXE server
for Linux is still maturing.

-mjk

On Dec 23, 2003, at 5:28 AM, Tomas Lind?n wrote:

> To reinstall a cluster I use the command
> cluster-fork /boot/kickstart/cluster-kickstart
> Now since all 32 nodes have been PXE installed this means that the
> reinstallation is performed by first doing a PXE-boot to load the
> installation kernel. My problem is that sometimes a few nodes fail
> during this reinstallation process. The failing nodes seem to be
> different
> whenever this problem occurs. The really strange thing is that after
> more than a day or so some nodes somehow manage to finish the
> reinstallation process!
>
> Sometimes the whole cluster comes up fine without any lost node.
>
> The problematic nodes _seem_ to get the installation kernel with PXE,
> so
> it might be not a PXE problem but something odd that happens later?
>
> Has anyone seen anything like this before?
>
> I'm aware of a bug in the RedHat installation kernel
> on Athlon systems when trying to run with a serial console.
>
> https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2003-May/
> 001988.html
> This is why I run the installation kernel without a serial console, but
> this makes debugging difficult because the serial console only shows
> output during the PXE boot process. No output is generated by the
> installation kernel itself. The next output is generated when
> the node has finished the installation and loads the final kernel which
> runs fine with a serial console.
>
> This is using Rocks 2.3.2 on a 32 node cluster with Tyan Tiger MPX
> S2466N-4M motherboards and dual Athlon MP CPUs with no graphics
> adapters, so the system has a 32 port serial console switch. The
> motherboards have integrated 100 Mb/s 3Com 3C920 NICs (in practice a
> 3C905 NIC). The switch is made by Enterasys. The frontend private NIC
> is
> also running at 100 Mb/s. When doing the cluster reinstallation the
> network bandwidth over the frontend NIC saturates at 12,5 MB/s. Maybe
> some packets are lost because of this?
>
> The frontend private ethernet connection will be upgraded to Gb/s.
> Hopefully this will solve this reinstallation problem.
>
> Do you have any other ideas how to solve this problem?
>
> Best regards, Tomas Lind?n
> -----------------------------------------------------------------------
> ---
> I ,
> I
> I Tomas Linden Helsinki Institute of Physics (HIP)
> I
> I Tomas.Linden at Helsinki.FI P.O. Box 64 (Gustaf H?llstr?min katu
> 2) I
> I phone: +358-9-191 505 63 FIN-00014 UNIVERSITY OF HELSINKI

> I
> I fax: +358-9-191 505 53 Finland
> I
> I WWW: http://www.physics.helsinki.fi/~tlinden/eindex.html
> I
> -----------------------------------------------------------------------
> ---

From Timothy.Carlson at pnl.gov Tue Dec 23 08:57:07 2003
From: Timothy.Carlson at pnl.gov (Carlson, Timothy S)
Date: Tue, 23 Dec 2003 08:57:07 -0800
Message-ID: <A383F042472668459D642266F8B41692056B9F@pnlmse24.pnl.gov>

The problem he is having is that he chose "ethernet switches" when
running insert-ethers. He should have chosen "Compute nodes".

Only choose "ethernet switches" when you are assigning an IP address to
an ethernet switch with DHCP. If your managed switches already have IP
addresses, then just install "compute nodes"

Tim

From: Greg Bruno [mailto:bruno at rocksclusters.org]
Sent: Tuesday, December 23, 2003 8:34 AM
To: Kennie Cruz
Subject: Re: [Rocks-Discuss]Error installing the compute node

just to be clear, did you execute:

# cd /home/install
# ./kickstart.cgi --client compute-0-0

- gb

On Dec 23, 2003, at 5:31 AM, Kennie Cruz wrote:

> Hi,
>
> I am trying to kickstart the compute nodes with Rocks 3.0.0, the
> frontend
> is already working. I revised the FAQ question 7.1.2, the services
> (dhcpd,
> httpd, mysqld and autofs) are running, but running kickstar.cgi from
> the
> command line give an error:
>
> error - cannot kickstart external nodes
>
> I made a quick search on the list, but without any success.

>
> The compute node gets the assigned IP and insert-ethers detect the
> appliance without any trouble, but fails to run the kickstart.cgi from

> the frontend. The web server error log says something like this:
>
> [Tue Dec 23 09:10:08 2003] [error] [client 10.255.255.254] malformed
> header
> from script. Bad header=# @Copyright@:
> /var/www/html/install/kickstart.cgi
>
> While the access log says this:
>
> 10.255.255.254 - - [23/Dec/2003:09:10:08 -0400] "GET
> /install/kickstart.cgi?arch=i386&np=2&if=eth0&project=rocks
HTTP/1.0"
> 500 587 "-" "-"
>
> I ran insert-ethers with the Ethernet Switches option. My nodes are
> connected via 3 managed ethernet switches.
>
> Any help will be appreciated.
>
> Thanks in advance.
>
> --
> Kennie J. Cruz Gutierrez, System Administrator
> Department of Electrical and Computer Engineering
> University of Puerto Rico, Mayaguez Campus
> Work Phone: (787) 832-4040 x 3798
> Email: Kennie.Cruz at ece.uprm.edu
> Web: http://ece.uprm.edu/~kennie/
>
> [2003-12-23/09:21]
> Black holes are created when God divides by zero!

Date: Tue, 23 Dec 2003 15:48:30 -0500
Subject: [Rocks-Discuss]beowulf and rocks
Message-ID: <BAY1-DAV43JrOq93dSA00011dba@hotmail.com>

Hi,
I keep people mentioning about beowulf and Rocks, can somebody point me
the differnece between them. They they just two different solutions for
Clusters?
Thanks
Regards,
Puru

Date: Tue, 23 Dec 2003 13:19:39 -0800 (PST)
In-Reply-To: <BAY1-DAV43JrOq93dSA00011dba@hotmail.com>
Message-ID: <Pine.LNX.4.44.0312231314420.25800-100000@localhost.localdomain>

On Tue, 23 Dec 2003, Purushotham Komaravolu wrote:

> I keep people mentioning about beowulf and Rocks, can somebody point me
> the differnece between them. They they just two different solutions for
> Clusters?

Beowulf is a loose definition for a cluster of machines (typically off the
shelf hardware). Beowulf is not software.

Rocks is a software solution to manage your beowulf.

You can compare rocks/oscar/scyld/ as software systems for your beowulf
cluster.

Read Robert Brown's book on beowulfs at this URL

http://www.phy.duke.edu/~rgb/Beowulf/beowulf_book/beowulf_book/index.html

Tim

Tim Carlson
Voice: (509) 376 3423

From dlane at ap.stmarys.ca Tue Dec 23 14:53:51 2003
From: dlane at ap.stmarys.ca (Dave Lane)
Date: Tue, 23 Dec 2003 18:53:51 -0400
In-Reply-To: <BAY1-DAV43JrOq93dSA00011dba@hotmail.com>
Message-ID: <5.2.0.9.0.20031223185219.01b444e8@ap.stmarys.ca>

At 03:48 PM 12/23/2003 -0500, Purushotham Komaravolu wrote:
>Hi,
> I keep people mentioning about beowulf and Rocks, can somebody point me
>the differnece between them. They they just two different solutions for
>Clusters?

Beowulf is a loosely-defined generic term (that I won't attempt do define
now!), while Rocks is one of the several software distributions that
implement a beowulf cluster.

... Dave

From junkscarce at hotmail.com Tue Dec 23 15:43:05 2003
Date: Tue, 23 Dec 2003 23:43:05 +0000
Subject: [Rocks-Discuss]Extend-compute.xml issue, ln creation fails
Message-ID: <BAY1-F147XhOous6jec0001512f@hotmail.com>

Within /export/home/install/profiles/2.3.2/site-nodes extend-compute.xml
lies code like this commented code:
<post>
/bin/mkdir /mnt/plc/ <-- works -->

/bin/mkdir /mnt/plc/plc_data <-- works -->
/bin/ln -s /mnt/plc_data /data1 <-- works -->
/bin/ln /etc/rc.d/init.d/gpm /etc/rc.d/rc3.d/S15gpm <-- fails to ln, source
exists -->
</post>

I don't understand why the ln to a directory succeeds but a ln to a script
fails.

BTW, Dr. Landman, I've attempted to use your build.pl but it seems to faill
with:
Can't stat `/usr/src/redhat/RPMS/noarch//finishing-server-"3.0"-1.noarch.rpm
.
(my note: the path ends at RPMS) I swear I thought I saw a solution to this
once but I can't find it again.
Upon reinstallation with the file your tool created
(/usr/src/RedHat/RPMS/i386/finishing-scripts-3.00-1.i386.rpm) anaconda threw
back the exception: Traceback (innermost last): file
"/usr/bin/anaconda.real", line 633, in ? intf.run(id, dispatch,
configFileData) File
"/usr/src/build.90289-i386/install//usr/lib/anaconda/text.py", line 427 in
run
ok save debug

TIA Reed Scarce

_________________________________________________________________
Tired of slow downloads? Compare online deals from your local high-speed
providers now. https://broadband.msn.com

Date: Tue, 23 Dec 2003 19:17:58 -0500
In-Reply-To: <BAY1-F147XhOous6jec0001512f@hotmail.com>
References: <BAY1-F147XhOous6jec0001512f@hotmail.com>

Hi Reed:

Which version of finishing server fails on which version of ROCKS? It
looks like 3.0. I am up to 3.1.0 now. With a little bit of modification
I could make it work with 2.3.2. Likely just a single line to point to
the right path.

Let me know and I'll see what I can do. I would recommend using the
3.1.0 environment, as it is a significant (read as massive) improvement
over previous versions. If you (and others) need it to work with older
(pre-3.0) versions of ROCKS, I think I can handle that. Let me know.

Joe

On Tue, 2003-12-23 at 18:43, Reed Scarce wrote:
> Within /export/home/install/profiles/2.3.2/site-nodes extend-compute.xml
> lies code like this commented code:
> <post>

> /bin/mkdir /mnt/plc/ <-- works -->
> /bin/mkdir /mnt/plc/plc_data <-- works -->
> /bin/ln -s /mnt/plc_data /data1 <-- works -->
> /bin/ln /etc/rc.d/init.d/gpm /etc/rc.d/rc3.d/S15gpm <-- fails to ln, source
> exists -->
> </post>
>
> I don't understand why the ln to a directory succeeds but a ln to a script
> fails.
>
> BTW, Dr. Landman, I've attempted to use your build.pl but it seems to faill
> with:
> Can't stat `/usr/src/redhat/RPMS/noarch//finishing-server-"3.0"-1.noarch.rpm

Date: Tue, 23 Dec 2003 16:35:13 -0800
In-Reply-To: <BAY1-F147XhOous6jec0001512f@hotmail.com>
References: <BAY1-F147XhOous6jec0001512f@hotmail.com>
Message-ID: <09B1C3EA-35A9-11D8-8281-000A95DA5638@sdsc.edu>

"man chkconfig"

If you use chkconfig you do not need to create the rc*.d/* files and
they are put in place for you.

-mjk

On Dec 23, 2003, at 3:43 PM, Reed Scarce wrote:

> Within /export/home/install/profiles/2.3.2/site-nodes
> extend-compute.xml lies code like this commented code:
> <post>
> /bin/mkdir /mnt/plc/ <-- works -->
> /bin/mkdir /mnt/plc/plc_data <-- works -->
> /bin/ln -s /mnt/plc_data /data1 <-- works -->
> /bin/ln /etc/rc.d/init.d/gpm /etc/rc.d/rc3.d/S15gpm <-- fails to ln,
> source exists -->
> </post>
>
> I don't understand why the ln to a directory succeeds but a ln to a
> script fails.
>
> BTW, Dr. Landman, I've attempted to use your build.pl but it seems to
> faill with:
> Can't stat
> `/usr/src/redhat/RPMS/noarch//finishing-server-"3.0"-1.noarch.rpm .
> (my note: the path ends at RPMS) I swear I thought I saw a solution
> to this once but I can't find it again.
> Upon reinstallation with the file your tool created
> (/usr/src/RedHat/RPMS/i386/finishing-scripts-3.00-1.i386.rpm) anaconda
> threw back the exception: Traceback (innermost last): file
> "/usr/bin/anaconda.real", line 633, in ? intf.run(id, dispatch,
> configFileData) File
> "/usr/src/build.90289-i386/install//usr/lib/anaconda/text.py", line
> 427 in run

> ok save debug
>
>
> TIA Reed Scarce
>
> _________________________________________________________________
> Tired of slow downloads? Compare online deals from your local
> high-speed providers now. https://broadband.msn.com

From jkreuzig at uci.edu Tue Dec 23 19:53:16 2003
From: jkreuzig at uci.edu (James Kreuziger)
Date: Tue, 23 Dec 2003 19:53:16 -0800 (PST)
Message-ID: <Pine.GSO.4.58.0312231934470.12276@massun.ucicom.uci.edu>

Thanks everybody for the info. I was aware of the fast-link issue;
However, after enabling it, we still were unable to see the switch
from the frontend. We had a laptop hooked up to the switch via serial
and ethernet and was able to turn on the fast-link, and assign an
IP address. After that, the web-based interface came up on the laptop.
Still, no response on the switch from the frontend.

So after great gnashing of teeth, and dozens of re-installs of the
frontend, success! The problem? The extra nic card on the frontend.
We had bought the frontend with a dual 1GB card and a single 100MB card.
Whenever the single nic card is installed, the system always takes this
as eth0. This is something that was staring us right in the face, so
that's why it probably took so long to figure out.

After 3 years of trying to find the money, we finally have our first
8 node cluster up!

-Jim

*************************************************
Jim Kreuziger
jkreuzig at uci.edu
949-824-4474
*************************************************

Date: Tue, 23 Dec 2003 23:23:35 -0500
Message-ID: <3FE914C7.3050001@scalableinformatics.com>

Hi James:

One of the things I do first time I boot up a new head node is to map

the ethernet ports. I take out all but one of the network wires, and
make sure there is real network traffic. A ping on the subnet is fine.
Then I tcpdump the network port. What is suprising to me is how many
times the assumed network eth0 is mapped differently. Then by hand,
after mapping the rest of the ports, I manually modify the
/etc/modules.conf file to reflect what I need.

Just a suggestion. Having been bitten enough, I find simple sanity
checks help reduce the size or dimensionality of the space of possible
problems. This makes these debugging sessions usually faster, and
allows for better characterization of the issue.

Joe

James Kreuziger wrote:

>Thanks everybody for the info. I was aware of the fast-link issue;
>However, after enabling it, we still were unable to see the switch
>from the frontend. We had a laptop hooked up to the switch via serial
>and ethernet and was able to turn on the fast-link, and assign an
>IP address. After that, the web-based interface came up on the laptop.
>Still, no response on the switch from the frontend.
>
>So after great gnashing of teeth, and dozens of re-installs of the
>frontend, success! The problem? The extra nic card on the frontend.
>We had bought the frontend with a dual 1GB card and a single 100MB card.
>Whenever the single nic card is installed, the system always takes this
>as eth0. This is something that was staring us right in the face, so
>that's why it probably took so long to figure out.
>
>After 3 years of trying to find the money, we finally have our first
>8 node cluster up!
>
>-Jim
>
>*************************************************
>Jim Kreuziger
>jkreuzig at uci.edu
>949-824-4474
>*************************************************
>
>
>

--
phone: +1 734 612 4615

Date: Tue, 23 Dec 2003 21:26:08 -0800
Subject: [Rocks-Discuss]Rocks 3.1.0 is released for x86, ia64 and x86-64
Message-ID: <ADA8CDD0-35D1-11D8-B821-000A95C4E3B4@rocksclusters.org>

Version 3.1.0 (Matterhorn) of the Rocks cluster distribution is
released and now supports three processor families: Intel IA-32, Intel
Itanium Processor Family, and AMD Opteron. ?This is the released
version of software that was used to build a fully-functioning 128-node
grid-enabled cluster?in under 2 hours on opening night ?last month
at?SC2003 in Phoenix, AZ. ?Rocks is developed by the Grid and Cluster
Computing Group at SDSC and by partners at the University of
California, Berkeley, Scalable Systems in Singapore, and individual
open-source software developers.

This is a co-release for x86 (Pentium, Athlon, and others), Itanium2
(IA-64) and Opteron (x86-64) based clusters. Software is freely
available for download to burn onto a bootable CD set for x86 and
x86-64 or a single DVD for Itanium2. Versions for all processor
families are available at http://www.rocksclusters.org/.

Introduced on Version 3.0.0, this version enhances the ?roll? mechanism
to enable users, communities and others to easily add on optional
software and configuration. ?These optional ?Roll CDs? extend the
system by integrating seamlessly and automatically into the management
and packaging mechanisms used by base software. For all intents and
purposes, rolls appear as if they are part of the original CD
distribution. ?A number of defined extension rolls are freely available
and include HPC, Sun Grid Engine, Grid (based on NMI), Java and Intel
Compiler. An important feature is that new rolls can be created or
updated independently of the core distribution. This fundamentally
enables science teams and communities to add on domain-specific
software packages, define a particular grid configuration, or simply
modify any of the default configuration or package settings.

New features in NPACI Rocks 3.1.0 include:

- Opteron Support
- Sun Grid Engine as default queuing System
- Upgraded Ganglia server and client, used for collecting and
visualizing cluster-wide monitoring metrics
- Upgraded MPICH-GM and Myrinet GM 2.0 for the latest Rev D. Cards
- Rocks-developed 411 information system to replace Network Information
Service (NIS)
- Updated SSH version 3.7.1 with no login delay
- Several Optional Software Rolls including:
- NSF Middleware Initiative version R4 grid distribution
- Java 2
- Intel Compilers for x86 and ia64

Rocks 3.1.0 is derived from Red Hat?s publicly available source
packages (SRPMS) used in portions of their Enterprise Linux 3.0 Product
Line. All SRPMs ?have been recompiled to enable redistribution. ?All
available updates for these packages have been pre-applied.
Rocks-specific software and standard cluster and grid community
software is then added to create a complete clustering toolkit. ?All
Rocks source code is available in a public CVS repository.

From angel at miami.edu Wed Dec 24 13:14:59 2003
Date: Wed, 24 Dec 2003 16:14:59 -0500

In-Reply-To: <ADA8CDD0-35D1-11D8-B821-000A95C4E3B4@rocksclusters.org>
References: <ADA8CDD0-35D1-11D8-B821-000A95C4E3B4@rocksclusters.org>
Message-ID: <3FEA01D3.8080204@miami.edu>

Hi,

I currently have a cluster running Rocks 3.0 and I'm considering
upgrading to 3.1. Now that SGE is the default batch queue, is maui
working? Also, the Intel compiler roll is included. What licensing
issues will I encounter? We currently have a license for version 7.

Thanks,

Angel

Date: Wed, 24 Dec 2003 14:14:46 -0800
In-Reply-To: <3FEA01D3.8080204@miami.edu>
References: <ADA8CDD0-35D1-11D8-B821-000A95C4E3B4@rocksclusters.org>
<3FEA01D3.8080204@miami.edu>
Message-ID: <94F9D6F6-365E-11D8-B821-000A95C4E3B4@rocksclusters.org>

> I currently have a cluster running Rocks 3.0 and I'm considering
> upgrading to 3.1. Now that SGE is the default batch queue, is maui
> working?

maui and pbs are currently not available in rocks 3.1, but it will be
soon.

maui and pbs will be included in its own roll -- that effort will be
driven by roy dragseth from the University of Troms?.

> Also, the Intel compiler roll is included. What licensing issues will
> I encounter? We currently have a license for version 7.

i'm not sure how the licenses transfer between versions.

after you bring up a frontend with the intel roll, the following link
is available on the frontend's home page:

http://www.intel.com/software/products/distributors/rock_cluster.htm

after you purchase a license, you just need to copy the license into
the appropriate directory and then start compiling.

for fortran, the appropriate directory is:

/opt/intel_fc_80/licenses

and for C, the appropriate directory is:

/opt/intel_cc_80/licenses

also, the intel roll contains a pre-built MPICH environment -- it is

found under /opt/mpich/intel.

- gb

Date: Wed, 24 Dec 2003 16:17:28 -0600 (CST)
In-Reply-To: <3FE914C7.3050001@scalableinformatics.com>
<3FE914C7.3050001@scalableinformatics.com>

Once upon a time, I decided to install a third interface in a rocks head
node (Dell SC1400, and a Syskonnect 98x Gig NIC for the interested) for a
data network. At boot time *everything* was broken.

To make a long story less long, the system had remapped itself with the
new gig card as eth0, and the other two shifted up by one. That was
really close to "no fun at all."

Happy holidays! I'm burning the new release right now!

-C

From michal at harddata.com Wed Dec 24 15:05:43 2003
From: michal at harddata.com (Michal Jaegermann)
Date: Wed, 24 Dec 2003 16:05:43 -0700
In-Reply-To: <Pine.GSO.4.58.0312241612450.25288@lenti.med.umn.edu>; from
cdwan@mail.ahc.umn.edu on Wed, Dec 24, 2003 at 04:17:28PM -0600
<Pine.GSO.4.58.0312241612450.25288@lenti.med.umn.edu>
Message-ID: <20031224160543.A25886@mail.harddata.com>

On Wed, Dec 24, 2003 at 04:17:28PM -0600, Chris Dwan (CCGB) wrote:
>
> Once upon a time, I decided to install a third interface in a rocks head
> node (Dell SC1400, and a Syskonnect 98x Gig NIC for the interested) for a
> data network. At boot time *everything* was broken.

I still cannot understand why people insists on NOT using 'nameif'
utility. All network interfaces can be named whichever way you want
and they will not move regardless how many NICs you will add or
remove as long as MACs are not changed. If you replace a card with
a different one then /etc/mactab needs to be edited to reflect your
new configuration. On clients nodes with an automatic reinstall
this indeed is not practical but for your front end machine this is
another story.

It is indeed the case that default startup scripts from Red Hat 7.3
need some simple additions as interface (re)naming need to be done
before NICs are brought up for the first time. In RH9 and FC1

'nameif' will be used "automagically" if HWADDR variable is defined
(and with a correct value).

Of course if you have different drivers for different NICs, and they
are loaded as modules, then names can be assigned by editing
/etc/modules.conf

Michal

Date: Wed, 24 Dec 2003 15:41:25 -0800
In-Reply-To: <20031224160543.A25886@mail.harddata.com>
<20031224160543.A25886@mail.harddata.com>
Message-ID: <AFFB44D8-366A-11D8-B821-000A95C4E3B4@rocksclusters.org>

>> Once upon a time, I decided to install a third interface in a rocks
>> head
>> node (Dell SC1400, and a Syskonnect 98x Gig NIC for the interested)
>> for a
>> data network. At boot time *everything* was broken.
>
> I still cannot understand why people insists on NOT using 'nameif'
> utility. All network interfaces can be named whichever way you want
> and they will not move regardless how many NICs you will add or
> remove as long as MACs are not changed. If you replace a card with
> a different one then /etc/mactab needs to be edited to reflect your
> new configuration. On clients nodes with an automatic reinstall
> this indeed is not practical but for your front end machine this is
> another story.
>
> It is indeed the case that default startup scripts from Red Hat 7.3
> need some simple additions as interface (re)naming need to be done
> before NICs are brought up for the first time. In RH9 and FC1
> 'nameif' will be used "automagically" if HWADDR variable is defined
> (and with a correct value).

michal,

for this release, we looked at your suggestion of using nameif -- we
did a quick prototype and it looks like it will be the right thing to
do. we sketched out a design and found that the full solution will
require many pieces (database changes, installer changes and the
obvious XML file changes). we left this out of 3.1.0 but it is towards
the top of our list for the next release.

thanks for the suggestion of nameif -- it is suggestions like that
which help us to define the direction of rocks.

- gb

From landman at scalableinformatics.com Wed Dec 24 16:08:54 2003
Date: Wed, 24 Dec 2003 19:08:54 -0500
In-Reply-To: <20031224160543.A25886@mail.harddata.com>
<20031224160543.A25886@mail.harddata.com>
Message-ID: <3FEA2A96.3060405@scalableinformatics.com>

Michal Jaegermann wrote:

>On Wed, Dec 24, 2003 at 04:17:28PM -0600, Chris Dwan (CCGB) wrote:
>
>
>>Once upon a time, I decided to install a third interface in a rocks head
>>node (Dell SC1400, and a Syskonnect 98x Gig NIC for the interested) for a
>>data network. At boot time *everything* was broken.
>>
>>
>
>I still cannot understand why people insists on NOT using 'nameif'
>utility. All network interfaces can be named whichever way you want
>and they will not move regardless how many NICs you will add or
>remove as long as MACs are not changed. If you replace a card with
>a different one then /etc/mactab needs to be edited to reflect your
>new configuration. On clients nodes with an automatic reinstall
>this indeed is not practical but for your front end machine this is
>another story.
>
>
Agreed, though as far as I can tell, nameif is not used in the
/etc/init.d scripts. It is used by ifup, so you would have to set HWADDR
on each interface in the /etc/sysconfig/.../ifcfg-eth* files (the ...
refers to that RH9 and RHEL3 have moved where these things sit from what
we were used to in RH7.x). Still need to map the interfaces though, to
see which physical port corresponds to which device/mac address. With
that in hand, you can set up the HWADDR or just swap cables. With the
advent of the folks making exactly the right length cables (e.g. not
giving any play, and placing them under tension while plugged in...) the
cable swap doesnt work well for mapping on some systems. Moreover, on a
fair number of systems I have played with, the BIOS is setup so that if
they PXE boot, they are doing so from the address that the installed
version of ROCKS would see as eth1. Annoying.

--
phone: +1 734 612 4615

From junkscarce at hotmail.com Fri Dec 26 15:35:57 2003
Date: Fri, 26 Dec 2003 23:35:57 +0000
Message-ID: <BAY1-F88Kxt8zPdqJL900052b1b@hotmail.com>

The line:

chkconfig --level 3 gpm on

works great from the command line, not in extend-compute.xml. Thanks for
the new tool though, always glad. The line above is in a block without
<eval shell="bash"> tags. I'll keep trying and rtm. Is it possible this is
a 2.6.2 issue? The live environment restricts me from using a more recent
version.

>To: "Reed Scarce" <junkscarce at hotmail.com>
>Subject: Re: [Rocks-Discuss]Extend-compute.xml issue, ln creation fails
>Date: Tue, 23 Dec 2003 16:35:13 -0800
>
>"man chkconfig"
>
>If you use chkconfig you do not need to create the rc*.d/* files and they
>are put in place for you.
>
> -mjk
>
>On Dec 23, 2003, at 3:43 PM, Reed Scarce wrote:
>
>>Within /export/home/install/profiles/2.3.2/site-nodes extend-compute.xml
>>lies code like this commented code:
>><post>
>>/bin/mkdir /mnt/plc/ <-- works -->
>>/bin/mkdir /mnt/plc/plc_data <-- works -->
>>/bin/ln -s /mnt/plc_data /data1 <-- works -->
>>/bin/ln /etc/rc.d/init.d/gpm /etc/rc.d/rc3.d/S15gpm <-- fails to ln,
>>source exists -->
>></post>
>>
>>I don't understand why the ln to a directory succeeds but a ln to a script
>>fails.
>>
>>BTW, Dr. Landman, I've attempted to use your build.pl but it seems to
>>faill with:
>>Can't stat
>>`/usr/src/redhat/RPMS/noarch//finishing-server-"3.0"-1.noarch.rpm .
>>(my note: the path ends at RPMS) I swear I thought I saw a solution to
>>this once but I can't find it again.
>>Upon reinstallation with the file your tool created
>>(/usr/src/RedHat/RPMS/i386/finishing-scripts-3.00-1.i386.rpm) anaconda
>>threw back the exception: Traceback (innermost last): file
>>"/usr/bin/anaconda.real", line 633, in ? intf.run(id, dispatch,
>>configFileData) File
>>"/usr/src/build.90289-i386/install//usr/lib/anaconda/text.py", line 427 in
>>run
>>ok save debug

>>
>>
>>TIA Reed Scarce
>>
>>_________________________________________________________________
>>Tired of slow downloads? Compare online deals from your local high-speed
>>providers now. https://broadband.msn.com
>

_________________________________________________________________
Worried about inbox overload? Get MSN Extra Storage now!
http://join.msn.com/?PAGE=features/es

Date: Fri, 26 Dec 2003 16:46:22 -0800
In-Reply-To: <BAY1-F88Kxt8zPdqJL900052b1b@hotmail.com>
References: <BAY1-F88Kxt8zPdqJL900052b1b@hotmail.com>
Message-ID: <1759D2DF-3806-11D8-98D0-000A95DA5638@sdsc.edu>

Not sure if this answers your question. But..

The <eval></eval> blocks are for code to be run on the kickstart server
(the one the generates the kickstart file). Code outside of the eval
blocks is run on the kickstarting host.

-mjk

On Dec 26, 2003, at 3:35 PM, Reed Scarce wrote:

> The line:
>
> chkconfig --level 3 gpm on
>
> works great from the command line, not in extend-compute.xml. Thanks
> for the new tool though, always glad. The line above is in a block
> without <eval shell="bash"> tags. I'll keep trying and rtm. Is it
> possible this is a 2.6.2 issue? The live environment restricts me
> from using a more recent version.
>
>
>> To: "Reed Scarce" <junkscarce at hotmail.com>
>> Subject: Re: [Rocks-Discuss]Extend-compute.xml issue, ln creation
>> fails
>> Date: Tue, 23 Dec 2003 16:35:13 -0800
>>
>> "man chkconfig"
>>
>> If you use chkconfig you do not need to create the rc*.d/* files and
>> they are put in place for you.
>>
>> -mjk
>>

>> On Dec 23, 2003, at 3:43 PM, Reed Scarce wrote:
>>
>>> Within /export/home/install/profiles/2.3.2/site-nodes
>>> extend-compute.xml lies code like this commented code:
>>> <post>
>>> /bin/mkdir /mnt/plc/ <-- works -->
>>> /bin/mkdir /mnt/plc/plc_data <-- works -->
>>> /bin/ln -s /mnt/plc_data /data1 <-- works -->
>>> /bin/ln /etc/rc.d/init.d/gpm /etc/rc.d/rc3.d/S15gpm <-- fails to ln,
>>> source exists -->
>>> </post>
>>>
>>> I don't understand why the ln to a directory succeeds but a ln to a
>>> script fails.
>>>
>>> BTW, Dr. Landman, I've attempted to use your build.pl but it seems
>>> to faill with:
>>> Can't stat
>>> `/usr/src/redhat/RPMS/noarch//finishing-server-"3.0"-1.noarch.rpm .
>>> (my note: the path ends at RPMS) I swear I thought I saw a solution
>>> to this once but I can't find it again.
>>> Upon reinstallation with the file your tool created
>>> (/usr/src/RedHat/RPMS/i386/finishing-scripts-3.00-1.i386.rpm)
>>> anaconda threw back the exception: Traceback (innermost last): file
>>> "/usr/bin/anaconda.real", line 633, in ? intf.run(id, dispatch,
>>> configFileData) File
>>> "/usr/src/build.90289-i386/install//usr/lib/anaconda/text.py", line
>>> 427 in run
>>> ok save debug
>>>
>>>
>>> TIA Reed Scarce
>>>
>>> _________________________________________________________________
>>> Tired of slow downloads? Compare online deals from your local
>>> high-speed providers now. https://broadband.msn.com
>>
>
> _________________________________________________________________
> Worried about inbox overload? Get MSN Extra Storage now!
> http://join.msn.com/?PAGE=features/es

From apseyed at bu.edu Sat Dec 27 12:32:40 2003
From: apseyed at bu.edu (apseyed at bu.edu)
Date: Sat, 27 Dec 2003 15:32:40 -0500
Subject: [Rocks-Discuss]Re: npaci-rocks-discussion digest, Vol 1 #663 - 2 msgs
In-Reply-To: <200312272013.hBRKDbJ15227@postal.sdsc.edu>
References: <200312272013.hBRKDbJ15227@postal.sdsc.edu>
Message-ID: <1072557160.3fedec68d07d6@www.bu.edu>

For what its worth,

Why don't you try specifying the absolute path (/sbin/chkconfig) and setting
debug flags and output file. (If you can confirm /sbin is in $PATH during the
life of the script nevermind the first suggestion.)

echo "got to chkconfig beginning" > /tmp/ks.log

/sbin/chkconfig --level 3 gpm on
echo "go to chkconfig end" >> /tmp/ks.log
/sbin/chkconfig --list | grep gpm >> /tmp/ks.log

-Patrice

Quoting npaci-rocks-discussion-request at sdsc.edu:

> Send npaci-rocks-discussion mailing list submissions to
>
> To subscribe or unsubscribe via the World Wide Web, visit
> or, via email, send a message with subject or body 'help' to
> npaci-rocks-discussion-request at sdsc.edu
>
> You can reach the person managing the list at
> npaci-rocks-discussion-admin at sdsc.edu
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of npaci-rocks-discussion digest..."
>
>
> Today's Topics:
>
> 1. Re: Extend-compute.xml issue, ln creation fails (Reed Scarce)
> 2. Re: Extend-compute.xml issue, ln creation fails (Mason J.
> Katz)
>
> --__--__--
>
> Message: 1
> From: "Reed Scarce" <junkscarce at hotmail.com>
> To: mjk at sdsc.edu
> Subject: Re: [Rocks-Discuss]Extend-compute.xml issue, ln creation
> fails
> Date: Fri, 26 Dec 2003 23:35:57 +0000
>
> The line:
>
> chkconfig --level 3 gpm on
>
> works great from the command line, not in extend-compute.xml. Thanks
> for
> the new tool though, always glad. The line above is in a block
> without
> <eval shell="bash"> tags. I'll keep trying and rtm. Is it possible
> this is
> a 2.6.2 issue? The live environment restricts me from using a more
> recent
> version.
>
>
> >From: "Mason J. Katz" <mjk at sdsc.edu>
> >To: "Reed Scarce" <junkscarce at hotmail.com>
> >CC: npaci-rocks-discussion at sdsc.edu
> >Subject: Re: [Rocks-Discuss]Extend-compute.xml issue, ln creation

> fails
> >Date: Tue, 23 Dec 2003 16:35:13 -0800
> >
> >"man chkconfig"
> >
> >If you use chkconfig you do not need to create the rc*.d/* files and
> they
> >are put in place for you.
> >
> > -mjk
> >
> >On Dec 23, 2003, at 3:43 PM, Reed Scarce wrote:
> >
> >>Within /export/home/install/profiles/2.3.2/site-nodes
> extend-compute.xml
> >>lies code like this commented code:
> >><post>
> >>/bin/mkdir /mnt/plc/ <-- works -->
> >>/bin/mkdir /mnt/plc/plc_data <-- works -->
> >>/bin/ln -s /mnt/plc_data /data1 <-- works -->
> >>/bin/ln /etc/rc.d/init.d/gpm /etc/rc.d/rc3.d/S15gpm <-- fails to
> ln,
> >>source exists -->
> >></post>
> >>
> >>I don't understand why the ln to a directory succeeds but a ln to a
> script
> >>fails.
> >>
> >>BTW, Dr. Landman, I've attempted to use your build.pl but it seems
> to
> >>faill with:
> >>Can't stat
> >>`/usr/src/redhat/RPMS/noarch//finishing-server-"3.0"-1.noarch.rpm
> .
> >>(my note: the path ends at RPMS) I swear I thought I saw a
> solution to
> >>this once but I can't find it again.
> >>Upon reinstallation with the file your tool created
> >>(/usr/src/RedHat/RPMS/i386/finishing-scripts-3.00-1.i386.rpm)
> anaconda
> >>threw back the exception: Traceback (innermost last): file
> >>"/usr/bin/anaconda.real", line 633, in ? intf.run(id, dispatch,
> >>configFileData) File
> >>"/usr/src/build.90289-i386/install//usr/lib/anaconda/text.py", line
> 427 in
> >>run
> >>ok save debug
> >>
> >>
> >>TIA Reed Scarce
> >>
> >>_________________________________________________________________
> >>Tired of slow downloads? Compare online deals from your local
> high-speed
> >>providers now. https://broadband.msn.com
> >
>
> _________________________________________________________________

> Worried about inbox overload? Get MSN Extra Storage now!
> http://join.msn.com/?PAGE=features/es
>
>
> --__--__--
>
> Message: 2
> From: "Mason J. Katz" <mjk at sdsc.edu>
> Subject: Re: [Rocks-Discuss]Extend-compute.xml issue, ln creation
> fails
> Date: Fri, 26 Dec 2003 16:46:22 -0800
> To: "Reed Scarce" <junkscarce at hotmail.com>
>
> Not sure if this answers your question. But..
>
> The <eval></eval> blocks are for code to be run on the kickstart
> server
> (the one the generates the kickstart file). Code outside of the eval
>
> blocks is run on the kickstarting host.
>
> -mjk
>
>
> On Dec 26, 2003, at 3:35 PM, Reed Scarce wrote:
>
> > The line:
> >
> > chkconfig --level 3 gpm on
> >
> > works great from the command line, not in extend-compute.xml.
> Thanks
> > for the new tool though, always glad. The line above is in a block
>
> > without <eval shell="bash"> tags. I'll keep trying and rtm. Is it
>
> > possible this is a 2.6.2 issue? The live environment restricts me
>
> > from using a more recent version.
> >
> >
> >> From: "Mason J. Katz" <mjk at sdsc.edu>
> >> To: "Reed Scarce" <junkscarce at hotmail.com>
> >> CC: npaci-rocks-discussion at sdsc.edu
> >> Subject: Re: [Rocks-Discuss]Extend-compute.xml issue, ln creation
>
> >> fails
> >> Date: Tue, 23 Dec 2003 16:35:13 -0800
> >>
> >> "man chkconfig"
> >>
> >> If you use chkconfig you do not need to create the rc*.d/* files
> and
> >> they are put in place for you.
> >>
> >> -mjk
> >>
> >> On Dec 23, 2003, at 3:43 PM, Reed Scarce wrote:

> >>
> >>> Within /export/home/install/profiles/2.3.2/site-nodes
> >>> extend-compute.xml lies code like this commented code:
> >>> <post>
> >>> /bin/mkdir /mnt/plc/ <-- works -->
> >>> /bin/mkdir /mnt/plc/plc_data <-- works -->
> >>> /bin/ln -s /mnt/plc_data /data1 <-- works -->
> >>> /bin/ln /etc/rc.d/init.d/gpm /etc/rc.d/rc3.d/S15gpm <-- fails to
> ln,
> >>> source exists -->
> >>> </post>
> >>>
> >>> I don't understand why the ln to a directory succeeds but a ln to
> a
> >>> script fails.
> >>>
> >>> BTW, Dr. Landman, I've attempted to use your build.pl but it
> seems
> >>> to faill with:
> >>> Can't stat
> >>> `/usr/src/redhat/RPMS/noarch//finishing-server-"3.0"-1.noarch.rpm
> .
> >>> (my note: the path ends at RPMS) I swear I thought I saw a
> solution
> >>> to this once but I can't find it again.
> >>> Upon reinstallation with the file your tool created
> >>> (/usr/src/RedHat/RPMS/i386/finishing-scripts-3.00-1.i386.rpm)
> >>> anaconda threw back the exception: Traceback (innermost last):
> file
> >>> "/usr/bin/anaconda.real", line 633, in ? intf.run(id, dispatch,
>
> >>> configFileData) File
> >>> "/usr/src/build.90289-i386/install//usr/lib/anaconda/text.py",
> line
> >>> 427 in run
> >>> ok save debug
> >>>
> >>>
> >>> TIA Reed Scarce
> >>>
> >>>
> _________________________________________________________________
> >>> Tired of slow downloads? Compare online deals from your local
> >>> high-speed providers now. https://broadband.msn.com
> >>
> >
> > _________________________________________________________________
> > Worried about inbox overload? Get MSN Extra Storage now!
> > http://join.msn.com/?PAGE=features/es
>
>
>
> --__--__--
>
> _______________________________________________
>

>
>

From rocks_india at yahoo.co.in Sat Dec 27 20:20:40 2003
From: rocks_india at yahoo.co.in (=?iso-8859-1?q?Rocks=20India?=)
Date: Sun, 28 Dec 2003 04:20:40 +0000 (GMT)
Subject: [Rocks-Discuss]Rocks 3.0 Newbeeeeeeee
Message-ID: <20031228042040.88990.qmail@web8301.mail.in.yahoo.com>

Hello All,
I am new to Rocks, i was able to download
and
install Rocks 3.0. I am not sure if Globus 3.0 gets
installed during the installation process.I tried to
use simple ca commands and get command not found
error.
Do i need to download Globus Tool Kit and
install it or would it be installed along with rocks.

Or can any one direct me to a site or give me steps
that
need to be taken after installing rocks what need to
be done for manipulating globus

Rocks-India

________________________________________________________________________
Yahoo! India Matrimony: Find your partner online.
Go to http://yahoo.shaadi.com

From bruno at rocksclusters.org Sat Dec 27 21:35:28 2003
Date: Sat, 27 Dec 2003 21:35:28 -0800
Subject: [Rocks-Discuss]Rocks 3.0 Newbeeeeeeee
In-Reply-To: <20031228042040.88990.qmail@web8301.mail.in.yahoo.com>
References: <20031228042040.88990.qmail@web8301.mail.in.yahoo.com>
Message-ID: <A4E388DE-38F7-11D8-9E96-000A95C4E3B4@rocksclusters.org>

> I am new to Rocks, i was able to download
> and
> install Rocks 3.0. I am not sure if Globus 3.0 gets
> installed during the installation process.I tried to
> use simple ca commands and get command not found
> error.
> Do i need to download Globus Tool Kit and
> install it or would it be installed along with rocks.
>
> Or can any one direct me to a site or give me steps
> that
> need to be taken after installing rocks what need to
> be done for manipulating globus

here's the steps, but it would require reinstalling your frontend:

go to:

http://www.rocksclusters.org/rocks-documentation/3.1.0/iso-images.html

and download:

Rocks Base, HPC Roll, SGE Roll and the Grid Roll

then burn them all to CD.

the follow the directions at:

http://www.rocksclusters.org/rocks-documentation/3.1.0/install-
frontend.html

but, before you get started, you should consult this page too:

http://rocks.npaci.edu/roll-documentation/grid/3.0/adding-the-roll.html

at the end of the process, your frontend will be configured with globus.

- gb

From ramonjt at ucia.gov Mon Dec 29 09:08:45 2003
From: ramonjt at ucia.gov (ramonjt)
Date: Mon, 29 Dec 2003 12:08:45 -0500
Message-ID: <3FF05F9D.F6A122F@ucia.gov>

Folks,

Which set of Rocks 3.1.0 downloads support Xeon Processors, "Pentium
and Athlon" or "Itanium"?

Thanks,
Ramon

Date: Mon, 29 Dec 2003 09:31:56 -0800
In-Reply-To: <3FF05F9D.F6A122F@ucia.gov>
References: <3FF05F9D.F6A122F@ucia.gov>
Message-ID: <E60E6664-3A24-11D8-9E96-000A95C4E3B4@rocksclusters.org>

> Which set of Rocks 3.1.0 downloads support Xeon Processors, "Pentium
> and Athlon" or "Itanium"?

xeons are x86 processors -- so you want the ISO images found under the
section:

Software for x86 (Pentium and Athlon)

- gb

From: landman at scalableinformatics.com (landman)
Date: Mon, 29 Dec 2003 13:49:49 -0500
Subject: [Rocks-Discuss]3.1.0 surprises
Message-ID: <20031229183225.M11961@scalableinformatics.com>

Pulled the distro. Burned it after checking md5's. Ok. Booted/installed test
cluster, completely vanilla, just defaults.

SSH is too slow. Wow. 5-10 seconds to log in.

Ok, now out at a customer site with the disks.

Unhappily discovered that the following are missing:

a) md (e.g. Software RAID): Just try to build one. Anaconda will happily let
you do this ... though it will die in the formatting stages. Dropping into the
shell (Alt-F2) and looking for the md module (lsmod) shows nothing. Insmod the
md also doesn't do anything. Catting /proc/devices shows no md as a character
or block device.

If md is really not there anymore, it should be removed from anaconda, just
like ...

b) ext3. There is no ext3 available for the install.

Also discovered how incredibly fragile anaconda is. In order to install, you
have to wipe the disks. It will not install if there is an md (software raid)
device, chosing instead to crap out after you have entered in all the
information. To say that this is annoying is a slight understatement. This is
an anaconda issue, not a ROCKS issue, though as a result of this issue, ROCKS is
less functional than it could be.

I also noted that there is no xfs option. This means that I will need to hack
new kernels later on after the install. Moreover, I will also need to turn on
the ext3 journaling features later on (post install).

Hopefully 3.1.1 or 3.2 will fix some of these things.

Joe

--
phone: +1 734 612 4615

From junkscarce at hotmail.com Mon Dec 29 15:15:52 2003
Date: Mon, 29 Dec 2003 23:15:52 +0000
Message-ID: <BAY1-F661p8yTXBgtnm0000a4dd@hotmail.com>

Are there any examples of Rocks 2.3.2 extend-compute.xml scripts that work?
I need to know the limitations of the distribution. As far as I can tell
the commands are available (`which command`locates the commands fine) but
they don't necessarily perform the job as expected. I had seen the
`eval...` clairification in the archives.

As it stands I plan to mkdir, ln and echo in the extend-c... but then run
the heart of the customization (scripted) once the nodes are up. It just
doesn't seem to be what was intended.

As always, thanks for your help
--Reed

>Date: Fri, 26 Dec 2003 16:46:22 -0800
>
>Not sure if this answers your question. But..
>
>The <eval></eval> blocks are for code to be run on the kickstart server
>(the one the generates the kickstart file). Code outside of the eval
>blocks is run on the kickstarting host.
>
> -mjk
>
>
>On Dec 26, 2003, at 3:35 PM, Reed Scarce wrote:
>
>>The line:
>>
>>chkconfig --level 3 gpm on
>>
>>works great from the command line, not in extend-compute.xml. Thanks for
>>the new tool though, always glad. The line above is in a block without
>><eval shell="bash"> tags. I'll keep trying and rtm. Is it possible this
>>is a 2.6.2 issue? The live environment restricts me from using a more
>>recent version.
>>
>>
>>>From: "Mason J. Katz" <mjk at sdsc.edu>
>>>To: "Reed Scarce" <junkscarce at hotmail.com>
>>>CC: npaci-rocks-discussion at sdsc.edu
>>>Subject: Re: [Rocks-Discuss]Extend-compute.xml issue, ln creation fails
>>>Date: Tue, 23 Dec 2003 16:35:13 -0800
>>>
>>>"man chkconfig"
>>>
>>>If you use chkconfig you do not need to create the rc*.d/* files and they
>>>are put in place for you.
>>>
>>> -mjk
>>>
>>>On Dec 23, 2003, at 3:43 PM, Reed Scarce wrote:
>>>
>>>>Within /export/home/install/profiles/2.3.2/site-nodes extend-compute.xml
>>>>lies code like this commented code:

>>>><post>
>>>>/bin/mkdir /mnt/plc/ <-- works -->
>>>>/bin/mkdir /mnt/plc/plc_data <-- works -->
>>>>/bin/ln -s /mnt/plc_data /data1 <-- works -->
>>>>/bin/ln /etc/rc.d/init.d/gpm /etc/rc.d/rc3.d/S15gpm <-- fails to ln,
>>>>source exists -->
>>>></post>
>>>>
>>>>I don't understand why the ln to a directory succeeds but a ln to a
>>>>script fails.
>>>>
>>>>BTW, Dr. Landman, I've attempted to use your build.pl but it seems to
>>>>faill with:
>>>>Can't stat
>>>>`/usr/src/redhat/RPMS/noarch//finishing-server-"3.0"-1.noarch.rpm .
>>>>(my note: the path ends at RPMS) I swear I thought I saw a solution to
>>>>this once but I can't find it again.
>>>>Upon reinstallation with the file your tool created
>>>>(/usr/src/RedHat/RPMS/i386/finishing-scripts-3.00-1.i386.rpm) anaconda
>>>>threw back the exception: Traceback (innermost last): file
>>>>"/usr/bin/anaconda.real", line 633, in ? intf.run(id, dispatch,
>>>>configFileData) File
>>>>"/usr/src/build.90289-i386/install//usr/lib/anaconda/text.py", line 427
>>>>in run
>>>>ok save debug
>>>>
>>>>
>>>>TIA Reed Scarce
>>>>
>>>>_________________________________________________________________
>>>>Tired of slow downloads? Compare online deals from your local high-speed
>>>>providers now. https://broadband.msn.com
>>>
>>
>>_________________________________________________________________
>>Worried about inbox overload? Get MSN Extra Storage now!
>>http://join.msn.com/?PAGE=features/es
>

_________________________________________________________________
Make your home warm and cozy this winter with tips from MSN House & Home.
http://special.msn.com/home/warmhome.armx

From dlane at ap.stmarys.ca Mon Dec 29 15:44:23 2003
From: dlane at ap.stmarys.ca (Dave Lane)
Date: Mon, 29 Dec 2003 19:44:23 -0400
In-Reply-To: <BAY1-F661p8yTXBgtnm0000a4dd@hotmail.com>
Message-ID: <5.2.0.9.0.20031229194312.01ed0f40@ap.stmarys.ca>

At 11:15 PM 12/29/2003 +0000, Reed Scarce wrote:
>Are there any examples of Rocks 2.3.2 extend-compute.xml scripts that work?

Reed,

Below is a script that worked fine for me (with 2.3.2). What it does should
be fairly explanatory...Dave

--->>>

<post>


mv /usr/local /usr/local-old
ln -s /home/local /usr/local
ln -s /home/opt/intel /opt/intel
ln -s /home/disc15 /disc15
mkdir /scratch/tmp
chmod 1777 /scratch/tmp
echo '#!/bin/bash' > /etc/init.d/wait
echo 'sleep 60' >> /etc/init.d/wait
chmod +x /etc/init.d/wait
ln -s /etc/init.d/wait /etc/rc3.d/S11wait

<eval sh="python">

</eval>
</post>

Date: Mon, 29 Dec 2003 19:03:25 -0800
In-Reply-To: <20031229183225.M11961@scalableinformatics.com>
References: <20031229183225.M11961@scalableinformatics.com>
Message-ID: <BC750EE0-3A74-11D8-9E96-000A95C4E3B4@rocksclusters.org>

> Pulled the distro. Burned it after checking md5's. Ok.
> Booted/installed test
> cluster, completely vanilla, just defaults.

i'm assuming this is an x86 installation, yes?

> SSH is too slow. Wow. 5-10 seconds to log in.

that is not the case on our clusters. in fact, we tested this on all
three architectures and all three are 'fast'.

> Ok, now out at a customer site with the disks.
>
> Unhappily discovered that the following are missing:
>

> a) md (e.g. Software RAID): Just try to build one. Anaconda will
> happily let
> you do this ... though it will die in the formatting stages. Dropping
> into the
> shell (Alt-F2) and looking for the md module (lsmod) shows nothing.
> Insmod the
> md also doesn't do anything. Catting /proc/devices shows no md as a
> character
> or block device.
>
> If md is really not there anymore, it should be removed from anaconda,
> just like ...
>
> b) ext3. There is no ext3 available for the install.
>
> Also discovered how incredibly fragile anaconda is. In order to
> install, you
> have to wipe the disks. It will not install if there is an md
> (software raid)
> device, chosing instead to crap out after you have entered in all the
> information. To say that this is annoying is a slight understatement.
> This is
> an anaconda issue, not a ROCKS issue, though as a result of this
> issue, ROCKS is
> less functional than it could be.

we'll look into the above two issues.

> I also noted that there is no xfs option. This means that I will need
> to hack
> new kernels later on after the install.

just curious, is xfs offered as an option on other redhat supported
products?

also (and i'm assuming this will be no consolation to you, but it may
be to others), building a new kernel RPM is straightforward in rocks:

http://www.rocksclusters.org/rocks-documentation/3.1.0/customization-
kernel.html

- gb

Date: Mon, 29 Dec 2003 22:44:16 -0500
In-Reply-To: <BC750EE0-3A74-11D8-9E96-000A95C4E3B4@rocksclusters.org>
<BC750EE0-3A74-11D8-9E96-000A95C4E3B4@rocksclusters.org>

On Mon, 2003-12-29 at 22:03, Greg Bruno wrote:
> > Pulled the distro. Burned it after checking md5's. Ok.
> > Booted/installed test
> > cluster, completely vanilla, just defaults.
>

> i'm assuming this is an x86 installation, yes?

Yes.

>
> > SSH is too slow. Wow. 5-10 seconds to log in.
>
> that is not the case on our clusters. in fact, we tested this on all
> three architectures and all three are 'fast'.

2 different clusters exhibited the same results. Fixed one by applying
dnsmasq to one of them.

>
> > Ok, now out at a customer site with the disks.
> >
> > Unhappily discovered that the following are missing:
> >
> > a) md (e.g. Software RAID): Just try to build one. Anaconda will
> > happily let
> > you do this ... though it will die in the formatting stages. Dropping
> > into the
> > shell (Alt-F2) and looking for the md module (lsmod) shows nothing.
> > Insmod the
> > md also doesn't do anything. Catting /proc/devices shows no md as a
> > character
> > or block device.
> >
> > If md is really not there anymore, it should be removed from anaconda,
> > just like ...
> >
> > b) ext3. There is no ext3 available for the install.
> >
> > Also discovered how incredibly fragile anaconda is. In order to
> > install, you
> > have to wipe the disks. It will not install if there is an md
> > (software raid)
> > device, chosing instead to crap out after you have entered in all the
> > information. To say that this is annoying is a slight understatement.
> > This is
> > an anaconda issue, not a ROCKS issue, though as a result of this
> > issue, ROCKS is
> > less functional than it could be.
>
> we'll look into the above two issues.

Thanks

>
> > I also noted that there is no xfs option. This means that I will need
> > to hack
> > new kernels later on after the install.
>
> just curious, is xfs offered as an option on other redhat supported
> products?

Nope, nor will Redhat likely do this in the near/mid term. This is
fairly common knowledge. All the other major distros do offer Redhat.
I hope that the defense of the current state isn't that "Redhat doesn't

support it". I might have misunderstood you, but Redhat is almost
completely disinterested in clusters, so Redhat supporting/not
supporting it is really not relevant.

Curiously, cAos which is doing some of the similar things ROCKS is doing
in terms of recompiling packages sans Redhat trademarks, has XFS and a
number of other useful things in there.

Regardless, having ext2 or vfat as your only fs options simply is not
reasonable, as both neither of these are really appropriate for very
large disks, or big file systems.

>
> also (and i'm assuming this will be no consolation to you, but it may
> be to others), building a new kernel RPM is straightforward in rocks:
>
> http://www.rocksclusters.org/rocks-documentation/3.1.0/customization-
> kernel.html

I had been planning to use a similar approach to this. I was/am simply
quite surprised that the two options for ROCKS file systems are really
not very good, and the good choices are unavailable. In all fairness
this is more likely a constraint of anaconda than of ROCKS.

I fixed the ext2/ext3 by a reboot after a quick tune2fs session and some
fixup of the /etc/fstab.

I have to say that I get less and less impressed with anaconda as time
goes on.

I fixed the partitioning problem (anaconda dies when it runs in an md'ed
set of partitions) by wiping the disk and using knoppix to fdisk the
disks. Autopartitioning is not an option, as the default choices are
not all that good (another anaconda-ism).

>
> - gb

From cdwan at mail.ahc.umn.edu Mon Dec 29 20:58:20 2003
Date: Mon, 29 Dec 2003 22:58:20 -0600 (CST)
<1072755856.4432.15.camel@protein.scalableinformatics.com>

I also encountered the Software RAID problem today. It made upgrading an
existing ROCKS cluster a little tricky.

Another behavior I noticed was that the CDs were not ejecting as the node
installs finished. It was managable, but required watching to prevent the

endless reinstall cycle.

-Chris Dwan

Date: Mon, 29 Dec 2003 21:48:22 -0800
Message-ID: <C6F470F2-3A8B-11D8-9E96-000A95C4E3B4@rocksclusters.org>

> Another behavior I noticed was that the CDs were not ejecting as the
> node
> installs finished. It was managable, but required watching to prevent
> the
> endless reinstall cycle.

actually, it isn't a problem as the last CD in the frontend will be a
roll and rolls are not bootable.

- gb

From cdwan at mail.ahc.umn.edu Mon Dec 29 21:51:13 2003
Date: Mon, 29 Dec 2003 23:51:13 -0600 (CST)
In-Reply-To: <C6F470F2-3A8B-11D8-9E96-000A95C4E3B4@rocksclusters.org>
<C6F470F2-3A8B-11D8-9E96-000A95C4E3B4@rocksclusters.org>

> > Another behavior I noticed was that the CDs were not ejecting as the
> > node
> > installs finished. It was managable, but required watching to prevent
> > the
> > endless reinstall cycle.
>
> actually, it isn't a problem as the last CD in the frontend will be a
> roll and rolls are not bootable.

You're right about the frontend. It was the compute nodes where it gave
me trouble. Roll disks never go in those.

-Chris Dwan

Date: Tue, 30 Dec 2003 01:03:06 -0500

In-Reply-To: <C6F470F2-3A8B-11D8-9E96-000A95C4E3B4@rocksclusters.org>

What I had noticed is that some CD hardware does not eject when
prompting for swapping in the roll. I swapped hardware and that fixed
it. Rather odd. Seen this in 3 different systems. Worked ok with
previous ROCKS.

Is it possible to do something like a

frontend askmethod

akin to the "linux askmethod" and specifically have the ISO's online in
a directory somewhere? Just curious... I find it interesting that 10
years after swapping floppies for OS installs, I am now swapping CDs...
There is irony here somewhere.

On Tue, 2003-12-30 at 00:48, Greg Bruno wrote:
> > Another behavior I noticed was that the CDs were not ejecting as the
> > node
> > installs finished. It was managable, but required watching to prevent
> > the
> > endless reinstall cycle.
>
> actually, it isn't a problem as the last CD in the frontend will be a
> roll and rolls are not bootable.
>
> - gb

Date: Mon, 29 Dec 2003 22:28:45 -0800
Message-ID: <6BB0542E-3A91-11D8-9E96-000A95C4E3B4@rocksclusters.org>

> Is it possible to do something like a
>
> frontend askmethod
>
> akin to the "linux askmethod" and specifically have the ISO's online in
> a directory somewhere? Just curious...

the ability to install frontends remotely is at the top of our priority
list for the next release.

> I find it interesting that 10
> years after swapping floppies for OS installs, I am now swapping CDs...
> There is irony here somewhere.

sorry, i'm going to have to evangelize rolls a bit.

joe, do you not have just a bit of appreciation for rolls and what is
going on under the sheets? we now have a formal way for you, that's
right you, to augment the installation of a cluster. you get to
programmatically interact with the installer at virtually any level.
you get to tell the installer what bits you want it to lay down and how
to configure them. and this is done completely independently of the
core. the core has no idea of your bits, yet, it installs it and
configures it to your specification.

for you, this could be having the 'scalable informatics' roll that
contains all your RPMS and XML configuration files. this ISO image
could be completely proprietary, yet, the installer installs it. you
could ship your roll worldwide and every one of your customers would,
within 2 hours, have a scalable informatics cluster online running the
applications you sold them. and, you know it would be running because
you embedded the correct configuration into the roll.

or, perhaps rolls work so smoothly, it just looks like CD swapping. :-)

- gb

Date: Tue, 30 Dec 2003 01:50:30 -0500
In-Reply-To: <6BB0542E-3A91-11D8-9E96-000A95C4E3B4@rocksclusters.org>
<6BB0542E-3A91-11D8-9E96-000A95C4E3B4@rocksclusters.org>


> > There is irony here somewhere.
>
> sorry, i'm going to have to evangelize rolls a bit.
>
> joe, do you not have just a bit of appreciation for rolls and what is
> going on under the sheets? we now have a formal way for you, that's
> right you, to augment the installation of a cluster. you get to
> programmatically interact with the installer at virtually any level.
> you get to tell the installer what bits you want it to lay down and how
> to configure them. and this is done completely independently of the
> core. the core has no idea of your bits, yet, it installs it and
> configures it to your specification.

Actually I do have a pretty good appreciation for them. I see that they
are a different way of solving the problems I have been solving for a
while using "other methods"
(http://scalableinformatics.com/downloads/finishing/finishing-v3.1.0.tar.gz). What
I don't see is how to build them (yes, I did see the "source" messages, and
"cvs", ...).

The major issue for me is going to be anaconda, all its joy and bugs,
and what directions its use forces ROCKS to follow (vis-a-vis file
systems, etc).

>
> for you, this could be having the 'scalable informatics' roll that
> contains all your RPMS and XML configuration files. this ISO image
> could be completely proprietary, yet, the installer installs it. you
> could ship your roll worldwide and every one of your customers would,
> within 2 hours, have a scalable informatics cluster online running the
> applications you sold them. and, you know it would be running because
> you embedded the correct configuration into the roll.

This is a nice vision, though it is unfortunately a vision. The
customer would have to re-install the cluster head node when a new
version of the bits comes out. Right? This is simply not tenable for a
production cycle facility that needs to upgrade a package. Please let
me know if my understanding is incorrect, I would be quite happy to hear
this.

The "other method" that I developed doesn't have this as a problem.
Just re-install the compute nodes, and load the RPM on the head nodes.
In fact I built some tools which simplify both the "other method" and
the ROCKS method. As I have to worry about multiple different cluster
distros (not just ROCKS, sorry, customers get what they need/want), I
have to worry about interfacing with that distro. So I have some tools
(the auto-build scripts) which simplify adding/removing packages into
the extend-compute.xml.

What I am hoping for rolls are two things: 1) insertable/removable from
a live cluster without forcing a re-install of the head node (compute
nodes, thats fine, not the head nodes) 2) simple documentation on how to
build. If they are really quite simple, I see no reason I could not
take the same tool I use to automate the building of installable RPMS
for the other method actually emit a ROCKS roll. But I need to know how
to do this. I am not sure I have sufficient time to "read the source,
Luke" for this. I would be happy to do this given time, and customer
demand/need. The other method had that, hence its development.

>
> or, perhaps rolls work so smoothly, it just looks like CD swapping. :-)

My point was that after inserting the SGE roll, I had to get up from the
console, walk over to the unit, swap in the next roll, iterate....

Felt like CD swapping to me.

Rolls wont solve other problems which are anaconda specific (file
systems, partitioning, formatting, RAID, network detection, etc). As
there are multiple similar RHEL de-redhatifying efforts, some of which
are drastically improving the installation process (by not using
anaconda), are you folks looking to move away from anaconda any time

soon?

>
> - gb
--

Date: Mon, 29 Dec 2003 23:45:52 -0800
Message-ID: <31636F66-3A9C-11D8-9E96-000A95C4E3B4@rocksclusters.org>

> This is a nice vision, though it is unfortunately a vision. The
> customer would have to re-install the cluster head node when a new
> version of the bits comes out. Right? This is simply not tenable for a
> production cycle facility that needs to upgrade a package. Please let
> me know if my understanding is incorrect, I would be quite happy to
> hear
> this.

we've talked about this on the list and we've talked with you about
this in person. you know the above statement is true. you also know it
is a future direction for rolls.

> What I am hoping for rolls are two things: 1) insertable/removable from
> a live cluster without forcing a re-install of the head node (compute
> nodes, thats fine, not the head nodes) 2) simple documentation on how
> to
> build. If they are really quite simple, I see no reason I could not
> take the same tool I use to automate the building of installable RPMS
> for the other method actually emit a ROCKS roll. But I need to know
> how
> to do this. I am not sure I have sufficient time to "read the source,
> Luke" for this. I would be happy to do this given time, and customer
> demand/need. The other method had that, hence its development.

a roll developer's guide is in progress. and, as stated above, adding
rolls to a live frontend is on our roadmap.

> Rolls wont solve other problems which are anaconda specific (file
> systems, partitioning, formatting, RAID, network detection, etc).

not true. if you wish to get deeply involved with the red hat
installer, you can develop a 'patch' roll that will change the
installer to do as you wish.

> As
> there are multiple similar RHEL de-redhatifying efforts, some of which

> are drastically improving the installation process (by not using
> anaconda), are you folks looking to move away from anaconda any time
> soon?

please educate us -- where can we download these installers and find
the developer guides that describe how to interact with the installer.

as for moving away from anaconda, i don't think that will happen
anytime soon. anaconda has served us well. we have all had issues with
the installer, but i would rather work with anaconda rather than
reinvent it. the boys and girls at redhat have a vested interest in
detecting and configuring the latest hardware and i plan on leveraging
that.

of the issues you mention above, the only one we don't know how to
control yet is file system selection (but, we will look into it per
your earlier request). we already manipulate anaconda to partition and
format the drives to our specifications, and we have ideas on how to
handle RAID and network naming (which is what i think you mean by
network detection).

- gb

Date: Tue, 30 Dec 2003 03:55:37 -0500
In-Reply-To: <31636F66-3A9C-11D8-9E96-000A95C4E3B4@rocksclusters.org>
<31636F66-3A9C-11D8-9E96-000A95C4E3B4@rocksclusters.org>

> > This is a nice vision, though it is unfortunately a vision. The
> > customer would have to re-install the cluster head node when a new
> > version of the bits comes out. Right? This is simply not tenable for a
> > production cycle facility that needs to upgrade a package. Please let
> > me know if my understanding is incorrect, I would be quite happy to
> > hear
> > this.
>
> we've talked about this on the list and we've talked with you about
> this in person. you know the above statement is true. you also know it
> is a future direction for rolls.

I was simply responding to the evangelism which seemed to imply the
functionality existed today. It doesn't, and we both agree that it is
necessary. Although the vision will provide innumerable benefits ...
ROCKS is not there yet, and won't be for a while.

Thats ok though, as I have a reasonable work around for some of these
issues. And when I can insert and delete rolls live into a cluster,
I'll modify my tools to emit rolls. Until then, it is as you said, a
vision for the future.

[...]

> a roll developer's guide is in progress. and, as stated above, adding
> rolls to a live frontend is on our roadmap.

Adding and removing are needed as we have discussed.

>
> > Rolls wont solve other problems which are anaconda specific (file
> > systems, partitioning, formatting, RAID, network detection, etc).
>
> not true. if you wish to get deeply involved with the red hat
> installer, you can develop a 'patch' roll that will change the
> installer to do as you wish.

I guess I am at a loss to understand what it is you are doing then. If
you are telling me I can hack around anaconda to my hearts content, why
do you tell me later on that ROCKS is deeply wedded to anaconda and will
not change soon? I will assume I am missing something here. Can I
replace anaconda? This is what I think you are saying. If you are
instead saying, no don't replace, just hack it, I am not sure I want to
do that. It is a very large and complex beast, with one system doing
the job of many. Jack of all trades.

More than half of the pain I have experienced deploying ROCKS is
directly attributable to anaconda. I would like to work around it. If
I can completely replace it under ROCKS this could be of interest. If I
cannot, and ROCKS will always remain closely tied to RedHat specific
technology (e.g. anaconda), that is also important to know.

>
> > As
> > there are multiple similar RHEL de-redhatifying efforts, some of which
> > are drastically improving the installation process (by not using
> > anaconda), are you folks looking to move away from anaconda any time
> > soon?
>
> please educate us -- where can we download these installers and find
> the developer guides that describe how to interact with the installer.

If you are serious about this, I would be happy to help you find more
development info and help make introductions to some of the people doing
this stuff. If you are not serious about this, thats fine too.

> as for moving away from anaconda, i don't think that will happen
> anytime soon. anaconda has served us well. we have all had issues with
> the installer, but i would rather work with anaconda rather than
> reinvent it. the boys and girls at redhat have a vested interest in
> detecting and configuring the latest hardware and i plan on leveraging
> that.

Knoppix makes good use of the anaconda detection routines without using
anaconda. You do not need anaconda in its entirety for the detection
routines.

While Redhat has a vested interest in making sure it detects hardware
well, the software that does it's installation has been getting more and
more fragile compared to other installation systems. Simple failures of
one item or the other in the SUSE YAST tool, or the Mandrake installer,
or for that matter, most of the non-anaconda based installers do not
force you to start over from the beginning. Stack traces are not given,
and you are not asked to debug an arcane and complex python program from
a highly limited command window. You are brought back to a well known
and well defined state, and you have a finite and non zero chance of
recovering from the failure. This is different than the anaconda
experience, where the slightest hiccup, which would be trivially
correctable given the opportunity, results in a complete failure of the
process.

This has resulted in our discovery of the RH9/RHEL fragility and
sensitivity (and lack of ability) to software raid, partitioning, and
related. This has wasted many hours of our collective time, and the
inability to use the upgrade option for those of us with software RAID
systems.

As ROCKS depends critically upon this bit of technology that you
indicate later on is so important, ROCKS happens to share in its
pitfalls, even though these are not ROCKS problems. I am not sure if
you understand how much time I have to spend explaining to customers and
end users why what they are seeing are not ROCKS problems but a Redhat
artifacts. Part of the reason I am raising this issue in this forum is
that I have spent all together too much time trying to explain this to
various users.

> of the issues you mention above, the only one we don't know how to
> control yet is file system selection (but, we will look into it per
> your earlier request). we already manipulate anaconda to partition and
> format the drives to our specifications, and we have ideas on how to
> handle RAID and network naming (which is what i think you mean by
> network detection).

Network detection is

a) getting the right network driver config
1) by detection
2) from floppy/usb/whatever

b) getting the correct network interface ordering (what you call naming)

The point you (somewhat whimsicality) made was that I could create
Scalable Informatics rolls and ship them around the world for people to
use in 2 hours. Great. Good vision, and that is something like what I
am looking at. I have that now with my tools, but I can always expand
their functionality.

Now the problem is, if after shipping out my roll, when my end users
install it, anaconda barfs in some new and exciting manner (has happened
already with the finishing scripts, and I have worked hard to try to
figure out what is broken in anaconda to work around its bugs), who are
the customers going to blame?

My experience thus far is that ROCKS is taking more than its fair share
of heat over bugs that it has nothing to do with.

From fds at sdsc.edu Tue Dec 30 05:53:48 2003
From: fds at sdsc.edu (fds at sdsc.edu)
Date: Tue, 30 Dec 2003 05:53:48 -0800 (PST)
In-Reply-To: <BAY1-F661p8yTXBgtnm0000a4dd@hotmail.com>
References: <BAY1-F661p8yTXBgtnm0000a4dd@hotmail.com>
Message-ID: <1291.194.125.171.53.1072792428.squirrel@uhura.sdsc.edu>

Code in the <post> section of an xml file (extend-compute or otherwise)
can be almost anything. When the script is run, the environment is not as
full as usual, which is why we always recommend specifying the full path
to commands. As you saw, /bin and /usr/bin are in the path, so certain
things like "which sed" will work, for example.

Remember that everything in the eval tags gets run at kickstart generation
time (on the frontend). Everything else (the naked commands in the post
section) are run by the node being installed.

We do intend for the heart of the customization to be performed at
kickstart time. I would be suprised if you had to postpone many tasks
until the node was up, although this does happen occasionally. The globus
and condor post configuration contain tasks that cannot be done at install
time.

Send us the scripts in question and we will take a look.

-Federico

> Are there any examples of Rocks 2.3.2 extend-compute.xml scripts that
> work?
> I need to know the limitations of the distribution. As far as I can tell
> the commands are available (`which command`locates the commands fine) but
> they don't necessarily perform the job as expected. I had seen the
> `eval...` clairification in the archives.
>
> As it stands I plan to mkdir, ln and echo in the extend-c... but then run
> the heart of the customization (scripted) once the nodes are up. It just
> doesn't seem to be what was intended.
>
> As always, thanks for your help
> --Reed
>

Date: Tue, 30 Dec 2003 09:03:02 -0500
Subject: [Rocks-Discuss]Licensing
References: <200312300711.hBU7BeJ14002@postal.sdsc.edu>
Message-ID: <BAY1-DAV14HJL2WZcXm0000fc27@hotmail.com>

Hi All,
I would like to know the list of the components that have to be

licensed, when we install ROCKS as a commercial solution.
Thanks
Happy Holidays
Puru

From doug at seismo.berkeley.edu Tue Dec 30 10:53:36 2003
From: doug at seismo.berkeley.edu (Doug Neuhauser)
Date: Tue, 30 Dec 2003 10:53:36 -0800 (PST)
Subject: [Rocks-Discuss]Rocks 3.1.0 install problems
Message-ID: <200312301853.hBUIragp015469@perry.geo.berkeley.edu>

I am having a problem upgrading Rocks 2.3.2 to 3.1.0.
Both my head node and compute nodes are dual XEON 2.4 GHz boxes.

We burned the CDs from the following images:
rocks-base-3.1.0.i386.iso
roll-hpc-3.1.0-0.i386.iso
roll-grid-3.1.0-0.any.iso
roll-intel-3.1.0-0.any.iso
roll-sge-3.1.0-0.any.iso
I verified the md5s both on the downloaded images from the rocks
web site and the md5s on the burned cds. They are fine.
I have run the upgrade several times -- at least once with all of the
rolls, and once with nust the rocks base and hpc roll.

I head node installs with no problem using the command
frontend upgrade
I can login and run insert-ethers, telling it to look for compute nodes.

When I power on a compute node, it boots grub, selects the only
kernel on its local disk
Rocks Reinstall
and runs through the /sbin/loader.
The blue screen comes up, the compute node requests and receives a
dynamic IP address from the head node, but then within a few seconds
aborts with the messages:
install exited abnormally - received signal 11
sending termination signals ... done
sending kill signals ... done
disabling swap ...
unmounting filesystems ...
/proc/bus/usb done
/proc done
/dev/pts done
You may safely reboot your system

It appears the the "Rocks Reinstall" kernel on the disk is not compatible
with Rocks 3.1.0. When I changed the compute node boot order to perform
a PXE boot before the hard disks, it properly downloads the 3.1.0 kernel
from the head node, reformats the disk, and installes 3.1.0 properly.
I have to catch it in the reboot, and change the boot order to use the
disk before PXE, or I get into an infinite loop.

Is there any better way to address this problem? The procedure of:
set PXE boot first
boot from net, install rocks 3.1.0 on disk
reboot
catch node during reboot, change boot order to floppy,disk,net

reboot
for each node is tedious.

Did I do something wrong in how I shut my 2.3.2 cluster down before the
upgrade? If so, some notes about this in the install instructions would
be useful.

- Doug N

------------------------------------------------------------------------
Doug Neuhauser University of California, Berkeley
doug at seismo.berkeley.edu Berkeley Seismological Laboratory
Phone: 510-642-0931 215 McCone Hall # 4760
Fax: 510-643-5811 Berkeley, CA 94720-4760

Date: Tue, 30 Dec 2003 11:29:14 -0800
In-Reply-To: <200312301853.hBUIragp015469@perry.geo.berkeley.edu>
References: <200312301853.hBUIragp015469@perry.geo.berkeley.edu>
Message-ID: <73E3933E-3AFE-11D8-9E96-000A95C4E3B4@rocksclusters.org>

On Dec 30, 2003, at 10:53 AM, Doug Neuhauser wrote:

> I am having a problem upgrading Rocks 2.3.2 to 3.1.0.
> Both my head node and compute nodes are dual XEON 2.4 GHz boxes.
>
> We burned the CDs from the following images:
> rocks-base-3.1.0.i386.iso
> roll-hpc-3.1.0-0.i386.iso
> roll-grid-3.1.0-0.any.iso
> roll-intel-3.1.0-0.any.iso
> roll-sge-3.1.0-0.any.iso
> I verified the md5s both on the downloaded images from the rocks
> web site and the md5s on the burned cds. They are fine.
> I have run the upgrade several times -- at least once with all of the
> rolls, and once with nust the rocks base and hpc roll.
>
> I head node installs with no problem using the command
> frontend upgrade
> I can login and run insert-ethers, telling it to look for compute
> nodes.
>
> When I power on a compute node, it boots grub, selects the only
> kernel on its local disk
> Rocks Reinstall
> and runs through the /sbin/loader.
> The blue screen comes up, the compute node requests and receives a
> dynamic IP address from the head node, but then within a few seconds
> aborts with the messages:
> install exited abnormally - received signal 11
> sending termination signals ... done
> sending kill signals ... done
> disabling swap ...
> unmounting filesystems ...
> /proc/bus/usb done

> /proc done
> /dev/pts done
> You may safely reboot your system
>
> It appears the the "Rocks Reinstall" kernel on the disk is not
> compatible
> with Rocks 3.1.0. When I changed the compute node boot order to
> perform
> a PXE boot before the hard disks, it properly downloads the 3.1.0
> kernel
> from the head node, reformats the disk, and installes 3.1.0 properly.
> I have to catch it in the reboot, and change the boot order to use the
> disk before PXE, or I get into an infinite loop.
>
> Is there any better way to address this problem? The procedure of:
> set PXE boot first
> boot from net, install rocks 3.1.0 on disk
> reboot
> catch node during reboot, change boot order to floppy,disk,net
> reboot
> for each node is tedious.
>
> Did I do something wrong in how I shut my 2.3.2 cluster down before the
> upgrade? If so, some notes about this in the install instructions
> would
> be useful.

your right, the 2.3.2 installer (anaconda from redhat's version 7.3) is
not compatible with the installer on rocks 3.1 (anaconda from redhat's
enterprise linux 3.0).

the way you will have to reinstall your cluster is one of two ways:

1) if your compute nodes support PXE that is enabled from the keyboard
-- that is, when you boot the node, in BIOS you see a message that says
"Press F12 for Network Boot (PXE)". if your nodes have that, then
you'll have to boot the nodes, one by one and, when you see the
message, press the F12 key, then move to the next node.

2) use the rocks base CD to boot each compute node. when insert-ethers
reports that it discovered the node, take the CD out and put it in the
next compute node.

but, if your compute nodes were initially installed with PXE, the
fastest way to upgrade the compute nodes is to simply turn all the
compute nodes off, upgrade the frontend, run insert-ethers, then turn
the compute nodes on one by one. the compute nodes should be set for
PXE boot which will pull the installer from the frontend and therefore
be updated installer.

as you state above, we need to document this.

thanks for the bug report.

- gb

Date: Tue, 30 Dec 2003 11:45:59 -0800 (PST)
Message-ID: <200312301945.hBUJjxgp016489@perry.geo.berkeley.edu>

Greg,

1. I don't have cdroms on my compute nodes, only floppy. :(
2. My boot order on the compute nodes is normally:
floppy, disk, PXE
3. I don't have a hot-key override to force PXE boot.
I have to change the BIOS boot order to enable PXE boot.

> but, if your compute nodes were initially installed with PXE, the
> fastest way to upgrade the compute nodes is to simply turn all the
> compute nodes off, upgrade the frontend, run insert-ethers, then turn
> the compute nodes on one by one. the compute nodes should be set for
> PXE boot which will pull the installer from the frontend and therefore
> be updated installer.

I don't understand this.

I can't leave the compute nodes with PXE boot first, or it will create an
endless loop. The compute node will boot via PXE, install rocks 3.1.0,
and then reboot via PXE and repeat the process ad-nauseum.

Can I use the old floppy boot image found at:
ftp://rocksclusters.org/pub/rocks/current/i386/bootnet.img
to force a network boot?

The 3.1.0 online manual has a link in the section
1.3 Install your Compute Nodes
to ftp://www.rocksclusters.org/pub/rocks/bootnet.img
but this does not exist.

- Doug N
------------------------------------------------------------------------
Fax: 510-643-5811 Berkeley, CA 94720-4760

From junkscarce at hotmail.com Tue Dec 30 11:57:16 2003
Date: Tue, 30 Dec 2003 19:57:16 +0000
Message-ID: <BAY1-F39DSuCcN0o41B0005872e@hotmail.com>

I tested your echo ... wait and ln wait... S11wait lines. They worked
perfectly. Then I tried the same with gpm and left wait in the script.
Wait worked as before, and gpm didn't work - like before. I've given up on
doing anything very fancy and have started to make a script to run the first
time it boots, with hand removal.

Thanks for the perspective,
--Reed

>From: Dave Lane <dlane at ap.stmarys.ca>
>Date: Mon, 29 Dec 2003 19:44:23 -0400
>
>At 11:15 PM 12/29/2003 +0000, Reed Scarce wrote:
>>Are there any examples of Rocks 2.3.2 extend-compute.xml scripts that
>>work?
>
>Reed,
>
>Below is a script that worked fine for me (with 2.3.2). What it does should
>be fairly explanatory...Dave
>
>--->>>
>
><post>
> 
>
>mv /usr/local /usr/local-old
>ln -s /home/local /usr/local
>ln -s /home/opt/intel /opt/intel
>ln -s /home/disc15 /disc15
>mkdir /scratch/tmp
>chmod 1777 /scratch/tmp
>echo '#!/bin/bash' > /etc/init.d/wait
>echo 'sleep 60' >> /etc/init.d/wait
>chmod +x /etc/init.d/wait
>ln -s /etc/init.d/wait /etc/rc3.d/S11wait
>
> <eval sh="python">
> 
> </eval>
></post>
>

_________________________________________________________________
Get reliable dial-up Internet access now with our limited-time introductory
offer. http://join.msn.com/?page=dept/dialup

Date: Tue, 30 Dec 2003 15:01:44 -0500
In-Reply-To: <200312301945.hBUJjxgp016489@perry.geo.berkeley.edu>
References: <200312301945.hBUJjxgp016489@perry.geo.berkeley.edu>

Hi Doug:

As long as pxe is in there, you should be able to do this
(semi)-automatically. All you need to do is to wipe the partition
tables and boot sectors of the compute nodes. I seem to remember a
really simply single floppy that did this.

See http://paud.sourceforge.net/
and http://dban.sourceforge.net/

I think dban is the right one. After that (only on compute nodes) you
should be able to pxe boot.

Joe

On Tue, 2003-12-30 at 14:45, Doug Neuhauser wrote:
> Greg,
>
> 1. I don't have cdroms on my compute nodes, only floppy. :(
> 2. My boot order on the compute nodes is normally:
> floppy, disk, PXE
> 3. I don't have a hot-key override to force PXE boot.
> I have to change the BIOS boot order to enable PXE boot.
>
> > but, if your compute nodes were initially installed with PXE, the
> > fastest way to upgrade the compute nodes is to simply turn all the
> > compute nodes off, upgrade the frontend, run insert-ethers, then turn
> > the compute nodes on one by one. the compute nodes should be set for
> > PXE boot which will pull the installer from the frontend and therefore
> > be updated installer.
>
> I don't understand this.
>
> I can't leave the compute nodes with PXE boot first, or it will create an
> endless loop. The compute node will boot via PXE, install rocks 3.1.0,
> and then reboot via PXE and repeat the process ad-nauseum.
>
> Can I use the old floppy boot image found at:
> ftp://rocksclusters.org/pub/rocks/current/i386/bootnet.img
> to force a network boot?
>
> The 3.1.0 online manual has a link in the section
> 1.3 Install your Compute Nodes
> to ftp://www.rocksclusters.org/pub/rocks/bootnet.img
> but this does not exist.
>
> - Doug N
> ------------------------------------------------------------------------
> Doug Neuhauser University of California, Berkeley
> doug at seismo.berkeley.edu Berkeley Seismological Laboratory
> Phone: 510-642-0931 215 McCone Hall # 4760
> Fax: 510-643-5811 Berkeley, CA 94720-4760

Date: Tue, 30 Dec 2003 12:07:34 -0800
In-Reply-To: <200312301945.hBUJjxgp016489@perry.geo.berkeley.edu>
Message-ID: <CEAFCC25-3B03-11D8-9E96-000A95C4E3B4@rocksclusters.org>

On Dec 30, 2003, at 11:45 AM, Doug Neuhauser wrote:

> Greg,
>
> 1. I don't have cdroms on my compute nodes, only floppy. :(
> 2. My boot order on the compute nodes is normally:
> floppy, disk, PXE
> 3. I don't have a hot-key override to force PXE boot.
> I have to change the BIOS boot order to enable PXE boot.
>
>> but, if your compute nodes were initially installed with PXE, the
>> fastest way to upgrade the compute nodes is to simply turn all the
>> compute nodes off, upgrade the frontend, run insert-ethers, then turn
>> the compute nodes on one by one. the compute nodes should be set for
>> PXE boot which will pull the installer from the frontend and therefore
>> be updated installer.
>
> I don't understand this.

i'll try to a better explanation.

when compute nodes are installed via PXE, rocks detects this and
manipulates the boot sector of the disk drive on the compute node that
makes the disk non-bootable. that way, if the compute node is reset, it
will try to PXE boot. it will PXE boot even if your boot order is: hard
disk, cd/floppy, PXE. this occurs because the hard disk is non-bootable
so the BIOS boot loader will skip the hard disk and move on to the
other boot devices.

> I can't leave the compute nodes with PXE boot first, or it will create
> an
> endless loop. The compute node will boot via PXE, install rocks 3.1.0,
> and then reboot via PXE and repeat the process ad-nauseum.
>
> Can I use the old floppy boot image found at:
> ftp://rocksclusters.org/pub/rocks/current/i386/bootnet.img
> to force a network boot?
>
> The 3.1.0 online manual has a link in the section
> 1.3 Install your Compute Nodes
> to ftp://www.rocksclusters.org/pub/rocks/bootnet.img
> but this does not exist.

we are no longer supporting the boot floppy as it was problematic to
make one that contained the appropriate device drivers that worked on
most compute nodes.

- gb

Date: Tue, 30 Dec 2003 12:28:46 -0800 (PST)
Message-ID: <200312302028.hBUKSkgp017318@perry.geo.berkeley.edu>

Greg,

Thanks for the detailed boot/reboot explaination. My problem dates
back to my intitial rocks 2.3.2 installation. My compute node
motherboards have 3 ethernet interfaces (1 100Mb, 2 1Gb), but initially
only the 100 Mb supported PXE. When I used that for PXE boot, Linux
would then remap the interfaces so that it tried to use one of the Gbit
interfaces on the next reboot. Needless to say, the head node did not
respond to DHCP because the MAC address was unknown to it.

My solution was to get a new BIOS from Tyan that supported PXE on
all interfaces. However, since my cluster was initially installed using
the boot floppy, my compute nodes have the vestiges of floppy boot config,
not PXE boot config.

I'll try Joe Landman's suggestion of a scrub floppy to scrub the boot
sector of the boot disk on the compute nodes. If I can't do that, I
CAN go through the manual process of setting and resetting the boot
order on each compute node, but it is a slow and sequential process.

- Doug N
------------------------------------------------------------------------
Fax: 510-643-5811 Berkeley, CA 94720-4760

From sjenks at uci.edu Tue Dec 30 12:37:26 2003
Date: Tue, 30 Dec 2003 12:37:26 -0800
In-Reply-To: <CEAFCC25-3B03-11D8-9E96-000A95C4E3B4@rocksclusters.org>
<CEAFCC25-3B03-11D8-9E96-000A95C4E3B4@rocksclusters.org>
Message-ID: <FAA5CF63-3B07-11D8-BF62-000A95B96C68@uci.edu>

On Dec 30, 2003, at 12:07 PM, Greg Bruno wrote:
> when compute nodes are installed via PXE, rocks detects this and
> manipulates the boot sector of the disk drive on the compute node that
> makes the disk non-bootable. that way, if the compute node is reset,
> it will try to PXE boot. it will PXE boot even if your boot order is:
> hard disk, cd/floppy, PXE. this occurs because the hard disk is
> non-bootable so the BIOS boot loader will skip the hard disk and move
> on to the other boot devices.

Hi Greg, et al.

Is there any way to force this behavior even if I initially used a CD
to install the compute nodes? My nodes are capable of PXE boot, but

since I didn't use that, I presume they didn't do the non-bootable disk
trick upon install. Now that I'm clear about how the PXE install works,
I'd prefer to move to that, but don't really want to have to corrupt
the disks to cause the PXE install.

The nodes are currently loaded with 3.0, so perhaps that will work with
3.1's kickstart, but I'm curious about the PXE issue.

Thanks,

Steve Jenks

Date: Tue, 30 Dec 2003 12:48:08 -0800
In-Reply-To: <FAA5CF63-3B07-11D8-BF62-000A95B96C68@uci.edu>
<CEAFCC25-3B03-11D8-9E96-000A95C4E3B4@rocksclusters.org> <FAA5CF63-3B07-11D8-
BF62-000A95B96C68@uci.edu>
Message-ID: <796DAC35-3B09-11D8-9E96-000A95C4E3B4@rocksclusters.org>

On Dec 30, 2003, at 12:37 PM, Stephen Jenks wrote:

>
> On Dec 30, 2003, at 12:07 PM, Greg Bruno wrote:
>> when compute nodes are installed via PXE, rocks detects this and
>> manipulates the boot sector of the disk drive on the compute node
>> that makes the disk non-bootable. that way, if the compute node is
>> reset, it will try to PXE boot. it will PXE boot even if your boot
>> order is: hard disk, cd/floppy, PXE. this occurs because the hard
>> disk is non-bootable so the BIOS boot loader will skip the hard disk
>> and move on to the other boot devices.
>
> Hi Greg, et al.
>
> Is there any way to force this behavior even if I initially used a CD
> to install the compute nodes? My nodes are capable of PXE boot, but
> since I didn't use that, I presume they didn't do the non-bootable
> disk trick upon install. Now that I'm clear about how the PXE install
> works, I'd prefer to move to that, but don't really want to have to
> corrupt the disks to cause the PXE install.
>
> The nodes are currently loaded with 3.0, so perhaps that will work
> with 3.1's kickstart, but I'm curious about the PXE issue.

3.0 is based on redhat 7.3 and 3.1 is based on redhat enterprise linux
3.0 -- so you'll hit a similar problem as doug did when you perform an
upgrade.

give me a bit of time to cook up a procedure for forcing your compute
nodes to PXE boot.

- gb

From cdwan at mail.ahc.umn.edu Tue Dec 30 14:22:18 2003
Date: Tue, 30 Dec 2003 16:22:18 -0600 (CST)
Subject: [Rocks-Discuss]NIS outside, 411 inside?

Is there a preferred way to have the 411 server on the head node replicate
information (passwd and auto.whatever) from an external NIS server to the
compute nodes? It seems to me that a cron job like the one below does the
trick, but it feels crufty to me:

ypcat passwd > yp.passwd;
cat /etc/passwd yp.passwd > 411.passwd
** build the 411 distributed passwd from the file above instead of
** /etc/passwd.

I'd love to hear suggestions for a more elegant solution.

-Chris Dwan

Date: Tue, 30 Dec 2003 15:16:36 -0800
In-Reply-To: <FAA5CF63-3B07-11D8-BF62-000A95B96C68@uci.edu>
<CEAFCC25-3B03-11D8-9E96-000A95C4E3B4@rocksclusters.org> <FAA5CF63-3B07-11D8-
BF62-000A95B96C68@uci.edu>
Message-ID: <3737B584-3B1E-11D8-9E96-000A95C4E3B4@rocksclusters.org>

On Dec 30, 2003, at 12:37 PM, Stephen Jenks wrote:

>
> On Dec 30, 2003, at 12:07 PM, Greg Bruno wrote:
>> when compute nodes are installed via PXE, rocks detects this and
>> manipulates the boot sector of the disk drive on the compute node
>> that makes the disk non-bootable. that way, if the compute node is
>> reset, it will try to PXE boot. it will PXE boot even if your boot
>> order is: hard disk, cd/floppy, PXE. this occurs because the hard
>> disk is non-bootable so the BIOS boot loader will skip the hard disk
>> and move on to the other boot devices.
>
> Hi Greg, et al.
>
> Is there any way to force this behavior even if I initially used a CD
> to install the compute nodes? My nodes are capable of PXE boot, but
> since I didn't use that, I presume they didn't do the non-bootable
> disk trick upon install. Now that I'm clear about how the PXE install
> works, I'd prefer to move to that, but don't really want to have to
> corrupt the disks to cause the PXE install.
>
> The nodes are currently loaded with 3.0, so perhaps that will work
> with 3.1's kickstart, but I'm curious about the PXE issue.

here's a procedure to ensure that your non-3.1.0 compute nodes PXE

install after a frontend upgrade.

this assumes your compute nodes support PXE installs.

before you upgrade the frontend, login to the frontend and execute:

# ssh-agent $SHELL
# ssh-add

# cluster-fork 'touch /boot/grub/pxe-install'

# cluster-fork '/boot/kickstart/cluster-kickstart --start'

# cluster-fork '/sbin/chkconfig --del rocks-grub'

now you can shutdown your compute nodes.

then upgrade your frontend.

after you login to your new frontend, run insert-ethers, then reset
each compute node, one at a time.

doug, you'll have a bit harder time.

if you can find a bootable floppy, after the compute node boots, you
can chroot to the root partition on the disk and run the three
cluster-fork commands above.

i apologize for making this procedure tough on you.

- gb

Date: Tue, 30 Dec 2003 15:32:20 -0800
Subject: [Rocks-Discuss]Licensing
In-Reply-To: <BAY1-DAV14HJL2WZcXm0000fc27@hotmail.com>
References: <200312300711.hBU7BeJ14002@postal.sdsc.edu> <BAY1-
DAV14HJL2WZcXm0000fc27@hotmail.com>
Message-ID: <69879D2D-3B20-11D8-98D0-000A95DA5638@sdsc.edu>

Nothing!

Rocks is entirely open source with various GNU, BSD, Artistic, etc open
source licenses attached. The underlying RedHat OS (as of Rocks 3.1.0
-- available now) is recompiled from RedHat's publicly available SRPMS.
You of course welcome to send us money and hardware to help further
the causes. Several vendor do in fact do this, and this helps us
support them.

-mjk

On Dec 30, 2003, at 6:03 AM, Purushotham Komaravolu wrote:

> Hi All,

> I would like to know the list of the components that have
> to be
> licensed, when we install ROCKS as a commercial solution.
> Thanks
> Happy Holidays
> Puru

Date: Tue, 30 Dec 2003 15:35:39 -0800
References: <Pine.GSO.4.58.0312301614500.554@lenti.med.umn.edu>
Message-ID: <E05DB9B2-3B20-11D8-98D0-000A95DA5638@sdsc.edu>

As of Rocks 3.1.0 we no longer use NIS "inside" the cluster. So in
some ways this job is simpler now, although no one has done this yet.
A simple ypcat like you have will do most of the right thing and 411
will pick up the changed and send them around the cluster. But, you
need to figure out how to merge the cluster information with the
external NIS information. This will include things like the IP address
for the cluster compute nodes.

-mjk

On Dec 30, 2003, at 2:22 PM, Chris Dwan (CCGB) wrote:

>
> Is there a preferred way to have the 411 server on the head node
> replicate
> information (passwd and auto.whatever) from an external NIS server to
> the
> compute nodes? It seems to me that a cron job like the one below does
> the
> trick, but it feels crufty to me:
>
> ypcat passwd > yp.passwd;
> cat /etc/passwd yp.passwd > 411.passwd
> ** build the 411 distributed passwd from the file above instead of
> ** /etc/passwd.
>
> I'd love to hear suggestions for a more elegant solution.
>
> -Chris Dwan
> The University of Minnesota
>

From mitchskin at comcast.net Tue Dec 30 17:13:44 2003
From: mitchskin at comcast.net (Mitchell Skinner)
Date: Tue, 30 Dec 2003 17:13:44 -0800
In-Reply-To: <200312302028.hBUKSkgp017318@perry.geo.berkeley.edu>
References: <200312302028.hBUKSkgp017318@perry.geo.berkeley.edu>
Message-ID: <1072833146.8645.1114.camel@zeitgeist>

On Tue, 2003-12-30 at 12:28, Doug Neuhauser wrote:
> I'll try Joe Landman's suggestion of a scrub floppy to scrub the boot
> sector of the boot disk on the compute nodes. If I can't do that, I
> CAN go through the manual process of setting and resetting the boot
> order on each compute node, but it is a slow and sequential process.

Something I'm going to try and implement at our site is support for the
pxelinux 'localboot' option. If the hard drives have a valid boot
sector, I can leave the BIOS set to PXE boot before the hard drive, and
by changing the pxelinux configuration on the head node, I can set a
particular node to boot from the network or from the local disk. In
other words, when a node PXE boots, it might get either the kickstart
instructions or the 'boot from hard drive' instructions.

That will take some fiddling, I think, because the head node then has to
maintain some more state for all of the compute nodes. I really want to
avoid going through the BIOS setup on all my nodes more than once,
though.

Is this something that the ROCKS mainline would be interested in?

Mitch

Date: Tue, 30 Dec 2003 17:51:49 -0800 (PST)
Message-ID: <200312310151.hBV1pngp026060@perry.geo.berkeley.edu>

My solution to force PXE boot is outlined below.

1. Boot dban floppy (floppy image at http://dban.sourceforge.net/ ).

2. Run "quick" purge of disks on system (I only have 1 disk on compute nodes).
I let the disk purge get far enough into the disk to overwrite the boot
sectors and filesystem -- I didn't wait for it to completely erase the
entire disk.

3. Reset the system, and CYCLE POWER on the compute node.

NOTE: If you don't cycle power, the BIOS sees the disk, but reports that
it has a fatal error reading from it. This caused the following problems:
a. PXE boot worked, but Rocks install also did not see the disk.
It asked whether you want to manually configure the disk, but
the configuration failed immeditately irregardless of whether I
answered yes or no. The Rocks developers may want to look into
this bug.
b. By the time that I figured out that I needed to cycle power,
the BIOS had already removed the disk from the boot order.
My boot order was now:
floppy, PXE, disk
Rocks installed properly once, twice, .... until I reset the boot
order to:
floppy, disk, PXE.

4. Compute node will now perform PXE boot, install Rocks 3.1.0, and
subsequent "controlled reboots" will boot from disk. If the node

is powered down or reset with reset button, no boot block is left
on disk, and the system will perform PXE boot and reinstall Rocks.

------------------------------------------------------------------------
Fax: 510-643-5811 Berkeley, CA 94720-4760

Date: Tue, 30 Dec 2003 19:17:11 -0800 (PST)
In-Reply-To: <200312310151.hBV1pngp026060@perry.geo.berkeley.edu>

On Tue, 30 Dec 2003, Doug Neuhauser wrote:

> 2. Run "quick" purge of disks on system (I only have 1 disk on compute nodes).
> I let the disk purge get far enough into the disk to overwrite the boot
> sectors and filesystem -- I didn't wait for it to completely erase the
> entire disk.

Here is something that is a bit quicker

cluster-fork dd if=/dev/zero of=/dev/hda bs=1k count=512

Then either power cycle or

cluster-fork reboot

Tim

Tim Carlson
Voice: (509) 376 3423

Date: Tue, 30 Dec 2003 21:44:11 -0600 (CST)
In-Reply-To: <E05DB9B2-3B20-11D8-98D0-000A95DA5638@sdsc.edu>
<E05DB9B2-3B20-11D8-98D0-000A95DA5638@sdsc.edu>

> As of Rocks 3.1.0 we no longer use NIS "inside" the cluster. So in
> some ways this job is simpler now, although no one has done this yet.
> A simple ypcat like you have will do most of the right thing and 411
> will pick up the changed and send them around the cluster. But, you
> need to figure out how to merge the cluster information with the
> external NIS information. This will include things like the IP address
> for the cluster compute nodes.

The shuffling below would work, I think, but it still gives me the
willies to be mucking with the passwd file every hour:

mv /etc/passwd /etc/passwd.local
ypcat /etc/passwd > /etc/passwd.nis
cat /etc/passwd.local /etc/passwd.nis > /etc/passwd
service 411 commit
cp /etc/passwd.local /etc/passwd

Am I missing the simple way? I seem to have an affinity for finding the
maximially complex way to do things...

-Chris Dwan

Date: Tue, 30 Dec 2003 19:58:43 -0800
Message-ID: <A04191F8-3B45-11D8-98D0-000A95DA5638@sdsc.edu>

This sounds reasonable, but you still have a chance of conflicting UIDs
in your password file. If you only issues accounts from your LAN NIS
server than you should be fine. I'd suggest adding the accounts
created by Rocks into your server (just look at the initial passwd
file). The SGE roll creates an SGE user, others may also exist.

You can also try setting up your frontend as an NIS client of your
external server, with the same UID issues above.

The bad news is we don't have a canned answer, and need someone to give
us one. The good news is with 411 in place only the frontend need be
changed and the compute node will still function as stock Rocks.

-mjk

On Dec 30, 2003, at 7:44 PM, Chris Dwan (CCGB) wrote:

>
>> As of Rocks 3.1.0 we no longer use NIS "inside" the cluster. So in
>> some ways this job is simpler now, although no one has done this yet.
>> A simple ypcat like you have will do most of the right thing and 411
>> will pick up the changed and send them around the cluster. But, you
>> need to figure out how to merge the cluster information with the
>> external NIS information. This will include things like the IP
>> address
>> for the cluster compute nodes.
>
> The shuffling below would work, I think, but it still gives me the
> willies to be mucking with the passwd file every hour:
>
> mv /etc/passwd /etc/passwd.local
> ypcat /etc/passwd > /etc/passwd.nis

> cat /etc/passwd.local /etc/passwd.nis > /etc/passwd
> service 411 commit
> cp /etc/passwd.local /etc/passwd
>
> Am I missing the simple way? I seem to have an affinity for finding
> the
> maximially complex way to do things...
>
> -Chris Dwan
> The University of Minnesota

Date: Wed, 31 Dec 2003 14:59:51 +1100

Hash: SHA1

On Wed, 31 Dec 2003 02:44 pm, Chris Dwan (CCGB) wrote:


Hmm, how about:

ypcat passwd > /etc/passwd.nis
cat /etc/passwd /etc/passwd.nis > /etc/passwd.tmp
cp /etc/passwd /etc/passwd.local
mv /etc/passwd.tmp /etc/passwd
service 411 commit
mv /etc/passwd.local /etc/passwd

That should mean that you're never operating without a password file and the
overwrites should be approaching atomic (I hope).

Of course, it'd be nice if you could do whatever the 411 init file does on
something else than /etc/passwd :-)

Disclaimer: I have not tried this myself & don't (yet) have a 3.1 system to
test with, caveat emptor, batteries not includeded, IANAL, etc..

cheers!
Chris
- --


iD8DBQE/8km3O2KABBYQAh8RAnpPAJ9a9oRdGXeBUBAokdX6wmwrVbgXkQCeKD0C
xh8eT6qTbZpxhu8+FHPSt90=
=lhiY

Date: Wed, 31 Dec 2003 15:01:39 +1100
<200312311459.54054.csamuel@vpac.org>

Hash: SHA1

On Wed, 31 Dec 2003 02:59 pm, Chris Samuel wrote:

> cp /etc/passwd /etc/passwd.local

should be:

cp -p /etc/passwd /etc/passwd.local

Oh, and what happens if users overlap ? :-)

cheers,
Chris
- --


iD8DBQE/8kojO2KABBYQAh8RAmWTAJwNhpm77IclXcWLoAuhp2/B4/GsCgCfZWek
me3Lk2I7VDmRj4ygTSLSaaY=
=Pv8G

Date: Tue, 30 Dec 2003 22:12:34 -0600 (CST)
<200312311459.54054.csamuel@vpac.org>


> Of course, it'd be nice if you could do whatever the 411 init file does on
> something else than /etc/passwd :-)

That would be a really big step. I'm deeply wary of cron jobs that
overwrite my passwd file.

The next step might be to put this functionality into 411 itself. it
would be truly cool to have an automatic, non NIS way to make the passwd,
group, autofs, and host lookup stuff be consistent and static across the
cluster nodes.

On the other hand, I appreciate that this is probably a complex enough
system without trying to reinvent NIS but leave out the brittle server
bits. We can work around for the time being.

-Chris Dwan

Date: Tue, 30 Dec 2003 20:34:25 -0800 (PST)
Subject: [Rocks-Discuss]Mozilla / ssh DISPLAY problem with Rocks 3.1.0
Message-ID: <200312310434.hBV4YPgp028521@perry.geo.berkeley.edu>

I am having a problem using mozilla with the default Rocks monitor web page
over an ssh session to my headnode from a Sun workstation with a 24-bit
display. My workstation is Sun Blade 150 running Solaris 8, and I am
using SSH Secure Shell 3.2.5 (non-commercial version).

When I ssh to my frontend and to run mozilla, I get an empty Mozilla frame.
Running mozilla with debugging options "--g-fatal-warnings" I get:

Gdk-WARNING **: Attempt to draw a drawable with depth 24 to a drawable with
depth 8
aborting...

xwinfino shows the following window characteristics:

xwininfo: Window id: 0x9400034 "GCLCluster Cluster - Mozilla"

Absolute upper-left X: 175
Absolute upper-left Y: 150
Relative upper-left X: 0
Relative upper-left Y: 0
Width: 1021
Height: 738
Depth: 8
Visual Class: PseudoColor
Border width: 0
Class: InputOutput
Colormap: 0x22 (installed)
Bit Gravity State: NorthWestGravity
Window Gravity State: NorthWestGravity
Backing Store State: NotUseful
Save Under State: no
Map State: IsViewable
Override Redirect State: no

Corners: +175+150 -84+150 -84-136 +175-136
-geometry 1021x738-78+125

Is there a way to configure mozilla to use only a 8-bit drawable?

If I ssh from a workstation with an 8-bit display, mozilla starts up
OK, and creates an 8-bit window.

- Doug N
------------------------------------------------------------------------
Fax: 510-643-5811 Berkeley, CA 94720-4760

From qian1129 at yahoo.com Tue Dec 30 22:47:57 2003
From: qian1129 at yahoo.com (li lee)
Date: Tue, 30 Dec 2003 22:47:57 -0800 (PST)
Subject: [Rocks-Discuss]How to install Roll CDs in Rocks 3.1.0

Hi,

I want to install Rocks v3.1.0 in PCs, but I do not
want to so many CDs:
roll-grid-3.1.0-0.any.iso
roll-intel-3.1.0-0.any.iso
roll-sge-3.1.0-0.any.iso
......
So, how to install all these after Rocks and HPC
installation on clusters?

Thanks

Li

__________________________________
Do you Yahoo!?
Find out what made the Top Yahoo! Searches of 2003
http://search.yahoo.com/top2003

Date: Tue, 30 Dec 2003 23:35:28 -0800
Message-ID: <E7D709AA-3B63-11D8-9E96-000A95C4E3B4@rocksclusters.org>

> I want to install Rocks v3.1.0 in PCs, but I do not
> want to so many CDs:
> ......
> So, how to install all these after Rocks and HPC

> installation on clusters?

for now, we do not have a systematic way in which to incorporate rolls
after the frontend is up. this is on our 'todo' list.

- gb

Date: Wed, 31 Dec 2003 07:29:21 -0800 (PST)
Subject: [Rocks-Discuss]Mozilla / ssh DISPLAY problem with Rocks 3.1.0
In-Reply-To: <200312310434.hBV4YPgp028521@perry.geo.berkeley.edu>

On Tue, 30 Dec 2003, Doug Neuhauser wrote:

>
> I am having a problem using mozilla with the default Rocks monitor web page
> over an ssh session to my headnode from a Sun workstation with a 24-bit
> display. My workstation is Sun Blade 150 running Solaris 8, and I am
> using SSH Secure Shell 3.2.5 (non-commercial version).
>
> When I ssh to my frontend and to run mozilla, I get an empty Mozilla frame.
> Running mozilla with debugging options "--g-fatal-warnings" I get:

This sounds like an X tunnel problem. I see X tunnel errors all the time
(OpenGL, colormap, etc). What happens if you just set the DISPLAY
variable back to your Sun box and do the proper xhost command on the Sun?

Tim

Tim Carlson
Voice: (509) 376 3423

Date: Wed, 31 Dec 2003 09:45:49 -0800
Message-ID: <2BEBEC90-3BB9-11D8-9BE3-000A95DA5638@sdsc.edu>

For this release you need all these CDs (if you want this
functionality). Think of Rolls as add-on packs to Rocks, and remember
that software belongs on a CD (not a tar ball, or ftp site). CDs are
the accepted commercial way of releasing software, they a very nice.
But we have some issues with this that we are addressing right now:

- Meta-Rolls. That is how do you merge multiple Rolls into a single CD
image. This is actually very easy to do, and we have some early code
for this, it will be there in the next release. For IA64 we merge the

HPC Roll onto the base DVD, so we have a proof of concept here.

- Rolls cannot be added after a cluster is installed, and must be used
during installation.

- Rolls cannot be uninstalled.

Rolls are maturing pretty quickly, and we know where they need to go.

-mjk

On Dec 30, 2003, at 10:47 PM, li lee wrote:

> Hi,
>
> I want to install Rocks v3.1.0 in PCs, but I do not
> want to so many CDs:
> ......
> So, how to install all these after Rocks and HPC
> installation on clusters?
>
> Thanks
>
> Li
>
> __________________________________
> Do you Yahoo!?
> Find out what made the Top Yahoo! Searches of 2003
> http://search.yahoo.com/top2003

Date: Wed, 31 Dec 2003 11:05:26 -0700
In-Reply-To: <Pine.GSO.4.58.0312302131450.24366@lenti.med.umn.edu>; from
cdwan@mail.ahc.umn.edu on Tue, Dec 30, 2003 at 09:44:11PM -0600
Message-ID: <20031231110526.B11252@mail.harddata.com>

On Tue, Dec 30, 2003 at 09:44:11PM -0600, Chris Dwan (CCGB) wrote:
>
>
> The shuffling below would work, I think, but it still gives me the
> willies to be mucking with the passwd file every hour:
>
>
> Am I missing the simple way?

cp -p /etc/passwd /etc/passwd.local
ypcat passwd >> /etc/passwd
service 411 commit
mv /etc/passwd.local /etc/passwd

unless 'service 411' cat be told to use another file. You minimize
that way a time gap when you are without /etc/passwd, you make
sure that file attributes on /etc/passwd will remain intact and
you are not left with extra files.

You can also play with (symbolic) links but I am not sure if every
possible /etc/passwd reader will indeed follow a link.

Michal

Date: Wed, 31 Dec 2003 11:16:18 -0700
In-Reply-To: <200312311501.43675.csamuel@vpac.org>; from csamuel@vpac.org on Wed,
Dec 31, 2003 at 03:01:39PM +1100
<200312311459.54054.csamuel@vpac.org> <200312311501.43675.csamuel@vpac.org>
Message-ID: <20031231111618.C11252@mail.harddata.com>

On Wed, Dec 31, 2003 at 03:01:39PM +1100, Chris Samuel wrote:
> should be:
>
> cp -p /etc/passwd /etc/passwd.local
>
> Oh, and what happens if users overlap ? :-)

'sort -u' over relevant fields after replacing ':'s with blanks? But
this is getting somewhat tad more involved and an "automatic
conflict resolution" still may screw up. A bit of coordination
between whomever maintains NIS and local user data, like reserving
some names and uid ranges for one or another, is likely more
effective in practice.

Michal

Date: Wed, 31 Dec 2003 10:42:21 -0800
Subject: [Rocks-Discuss]Roll Documentation posted on the web site
Message-ID: <117308FA-3BC1-11D8-9E96-000A95C4E3B4@rocksclusters.org>

just posted documentation for some of the rolls on the web site -- see
the left-hand side of the web page:

http://www.rocksclusters.org/Rocks/

and here are the links to the roll documentation:

HPC Roll: http://www.rocksclusters.org/rocks-documentation/3.1.0/

SGE Roll: http://www.rocksclusters.org/roll-documentation/sge/3.1.0/

Grid Roll: http://www.rocksclusters.org/roll-documentation/grid/3.1.0/

Intel Roll: http://www.rocksclusters.org/roll-documentation/intel/3.1.0/

as a side note, for every one of the rolls you install above, the
documentation will be available on your frontend at:

http://localhost/roll-documentation/

- gb

Date: Wed, 31 Dec 2003 13:07:37 -0600 (CST)
In-Reply-To: <20031231111618.C11252@mail.harddata.com>
<200312311459.54054.csamuel@vpac.org>
<200312311501.43675.csamuel@vpac.org> <20031231111618.C11252@mail.harddata.com>

> this is getting somewhat tad more involved and an "automatic
> conflict resolution" still may screw up.

I agree with this assessment. The key is to keep the local passwd file as
small as possible, and remove redundant accounts on the frontend node.
Since it consists mostly of non-login accounts anyway, this shouldn't be
too difficult...and it's a one time task anyway.

I've settled on the hourly cron job below. I'll report any weirdness as
appropriate. Thanks for all the suggestions and discussion.

#!/bin/sh
ypcat auto.master > /etc/auto.master
ypcat auto.home > /etc/auto.home
ypcat auto.net > /etc/auto.net
ypcat auto.web > /etc/auto.web

ypcat passwd > /etc/passwd.nis
cat /etc/passwd.local /etc/passwd.nis > /etc/passwd.combined
cp /etc/passwd.combined /etc/passwd

ypcat group > /etc/group.nis
cat /etc/group.local /etc/group.nis > /etc/group.combined
cp /etc/group.combined /etc/group

-Chris Dwan

From maz at tempestcomputers.com Wed Dec 31 11:37:09 2003
From: maz at tempestcomputers.com (John Mazza)
Date: Wed, 31 Dec 2003 14:37:09 -0500
Subject: [Rocks-Discuss]Rocks 3.1.0 with Adaptec I2O RAID
Message-ID: <200312311937.hBVJb9J25828@postal.sdsc.edu>

Does anyone know of a way to make 3.1.0 (x86-64) version
work with an Adaptec 2100S SCSI RAID card? My master node
needs to use this card, but it doesn't appear to be in the
kernel on the CD. Also, does it support the SysKonnect
SK-9821 (Ver 2.0) Gig cards?

Thanks!

Date: Wed, 31 Dec 2003 12:49:25 -0800 (PST)
In-Reply-To: <20031229183225.M11961@scalableinformatics.com>
Message-ID: <Pine.LNX.4.44.0312311230120.18826-100000@roach.emsl.pnl.gov>

On Mon, 29 Dec 2003, landman wrote:

> SSH is too slow. Wow. 5-10 seconds to log in.

Just getting around to this. I did a clean install on our test cluster
(Dell 1550 and 1750 boxes). No delays with ssh. As root or a normal
user, a "cluster-fork date" command on 4 nodes took under .6 seconds

Sounds like you have some type of DNS issue. Did you get a bad
/etc/resolv.conf file on the nodes for some reason?

> a) md (e.g. Software RAID): Just try to build one. Anaconda will
> happily let you do this ... though it will die in the formatting stages.
> Dropping into the shell (Alt-F2) and looking for the md module (lsmod)
> shows nothing. Insmod the md also doesn't do anything. Catting
> /proc/devices shows no md as a character or block device.

The odd bit here is that you can do a

modprobe raid0

on a running frontend and it gets installed but there is no associated
"md" module. Was "md" built directly into the kernel? very odd.

>b) ext3. There is no ext3 available for the install.

This is a bit annoying. Nobody really uses ext2 anymore do they? :) Not
having ext3 as an install option isn't a show stopper for me since I can
do a tune2fs after the fact. But ext3 should be there.

Having version 2.0.8 of the myrinet drivers up and running is a big + in

my book. SGE 5.3p5 is also nice to see.

It will be some time before I upgrade any production clusters given the
differences between Rh 7.3 and WS 3.0. Too big of a jump for me right now.
We first need to convert a couple hundred desktop boxes :)

Tim Carlson
Voice: (509) 376 3423

From James_ODell at Brown.edu Wed Dec 31 13:09:25 2003
From: James_ODell at Brown.edu (James O'Dell)
Date: Wed, 31 Dec 2003 16:09:25 -0500
In-Reply-To: <Pine.LNX.4.44.0312311230120.18826-100000@roach.emsl.pnl.gov>
References: <Pine.LNX.4.44.0312311230120.18826-100000@roach.emsl.pnl.gov>
Message-ID: <9CBB7CF1-3BD5-11D8-9574-0030656A27CC@Brown.edu>

For whatever its worth, MPICH works MUCH better when run over rsh that
ssh. It seems as if ssh doesn't pass along
signals nearly as well as rsh. Since enabling rsh and configuring MPICH
to use it, we have had no Zombie jobs
on our compute nodes. When using SSH they were a common occurrence. In
fact, if you look at the MPICH implementation for myrinet, you'll see
the contortions that they use to try and clean up compute nodes when
using ssh.

Jim


> On Mon, 29 Dec 2003, landman wrote:
>
>> SSH is too slow. Wow. 5-10 seconds to log in.
>
> Just getting around to this. I did a clean install on our test cluster
> (Dell 1550 and 1750 boxes). No delays with ssh. As root or a normal
> user, a "cluster-fork date" command on 4 nodes took under .6 seconds
>
> Sounds like you have some type of DNS issue. Did you get a bad
> /etc/resolv.conf file on the nodes for some reason?
>
>> a) md (e.g. Software RAID): Just try to build one. Anaconda will
>> happily let you do this ... though it will die in the formatting
>> stages.
>> Dropping into the shell (Alt-F2) and looking for the md module (lsmod)
>> shows nothing. Insmod the md also doesn't do anything. Catting
>> /proc/devices shows no md as a character or block device.
>
> The odd bit here is that you can do a
>
> modprobe raid0
>
> on a running frontend and it gets installed but there is no associated
> "md" module. Was "md" built directly into the kernel? very odd.

>
>> b) ext3. There is no ext3 available for the install.
>
> This is a bit annoying. Nobody really uses ext2 anymore do they? :) Not
> having ext3 as an install option isn't a show stopper for me since I
> can
> do a tune2fs after the fact. But ext3 should be there.
>
> Having version 2.0.8 of the myrinet drivers up and running is a big +
> in
> my book. SGE 5.3p5 is also nice to see.
>
> It will be some time before I upgrade any production clusters given the
> differences between Rh 7.3 and WS 3.0. Too big of a jump for me right
> now.
> We first need to convert a couple hundred desktop boxes :)
>
> Tim Carlson
> Voice: (509) 376 3423
>

Date: Wed, 31 Dec 2003 17:46:22 -0500
In-Reply-To: <Pine.LNX.4.44.0312311230120.18826-100000@roach.emsl.pnl.gov>

On Wed, 2003-12-31 at 15:49, Tim Carlson wrote:
> On Mon, 29 Dec 2003, landman wrote:
>
> > SSH is too slow. Wow. 5-10 seconds to log in.
>
> Just getting around to this. I did a clean install on our test cluster
> (Dell 1550 and 1750 boxes). No delays with ssh. As root or a normal
> user, a "cluster-fork date" command on 4 nodes took under .6 seconds

Yeah, some weirdness in DNS. Re-load on one cluster head took care of
it, on the other applying dnsmasq helped.

>
> Sounds like you have some type of DNS issue. Did you get a bad
> /etc/resolv.conf file on the nodes for some reason?
>
> > a) md (e.g. Software RAID): Just try to build one. Anaconda will
> > happily let you do this ... though it will die in the formatting stages.
> > Dropping into the shell (Alt-F2) and looking for the md module (lsmod)
> > shows nothing. Insmod the md also doesn't do anything. Catting
> > /proc/devices shows no md as a character or block device.
>
> The odd bit here is that you can do a
>
> modprobe raid0
>

> on a running frontend and it gets installed but there is no associated
> "md" module. Was "md" built directly into the kernel? very odd.

True, but I wanted to do a raid 1. I tried the insmod raid1 but it
didn't work, from what I can see the module was not in the build. This
is ok, as some of it can be done later.

>
> >b) ext3. There is no ext3 available for the install.
>
> This is a bit annoying. Nobody really uses ext2 anymore do they? :) Not
> having ext3 as an install option isn't a show stopper for me since I can
> do a tune2fs after the fact. But ext3 should be there.

Thats what I did. I'll post a quick set of instructions for this a
little later.

>
> Having version 2.0.8 of the myrinet drivers up and running is a big + in
> my book. SGE 5.3p5 is also nice to see.

I agree, though I would like to see people do a

cluster-fork "/etc/init.d/rcsge stop"
cluster-fork "chown -R root:root /opt/gridengine/bin
/opt/gridengine/utilbin"
cluster-fork "/etc/init.d/rcsge start"

to fix the compute node sge permissions. Some of the utils don't work
otherwise.

>
> It will be some time before I upgrade any production clusters given the
> differences between Rh 7.3 and WS 3.0. Too big of a jump for me right now.
> We first need to convert a couple hundred desktop boxes :)

:)

>
> Tim Carlson
> Voice: (509) 376 3423
>

Date: Wed, 31 Dec 2003 17:48:08 -0500
In-Reply-To: <9CBB7CF1-3BD5-11D8-9574-0030656A27CC@Brown.edu>
<9CBB7CF1-3BD5-11D8-9574-0030656A27CC@Brown.edu>

Hi James:

Did you rebuild MPICH for this? I noticed the signal handling bit

using mpiBLAST. Lots of zombies to deal with.

Joe

On Wed, 2003-12-31 at 16:09, James O'Dell wrote:
> For whatever its worth, MPICH works MUCH better when run over rsh that
> ssh. It seems as if ssh doesn't pass along
> signals nearly as well as rsh. Since enabling rsh and configuring MPICH
> to use it, we have had no Zombie jobs
> on our compute nodes. When using SSH they were a common occurrence. In
> fact, if you look at the MPICH implementation for myrinet, you'll see
> the contortions that they use to try and clean up compute nodes when
> using ssh.
>
> Jim
>
> On Dec 31, 2003, at 3:49 PM, Tim Carlson wrote:
>
> > On Mon, 29 Dec 2003, landman wrote:
> >
> >> SSH is too slow. Wow. 5-10 seconds to log in.
> >
> > Just getting around to this. I did a clean install on our test cluster
> > (Dell 1550 and 1750 boxes). No delays with ssh. As root or a normal
> > user, a "cluster-fork date" command on 4 nodes took under .6 seconds
> >
> > Sounds like you have some type of DNS issue. Did you get a bad
> > /etc/resolv.conf file on the nodes for some reason?
> >
> >> a) md (e.g. Software RAID): Just try to build one. Anaconda will
> >> happily let you do this ... though it will die in the formatting
> >> stages.
> >> Dropping into the shell (Alt-F2) and looking for the md module (lsmod)
> >> shows nothing. Insmod the md also doesn't do anything. Catting
> >> /proc/devices shows no md as a character or block device.
> >
> > The odd bit here is that you can do a
> >
> > modprobe raid0
> >
> > on a running frontend and it gets installed but there is no associated
> > "md" module. Was "md" built directly into the kernel? very odd.
> >
> >> b) ext3. There is no ext3 available for the install.
> >
> > This is a bit annoying. Nobody really uses ext2 anymore do they? :) Not
> > having ext3 as an install option isn't a show stopper for me since I
> > can
> > do a tune2fs after the fact. But ext3 should be there.
> >
> > Having version 2.0.8 of the myrinet drivers up and running is a big +
> > in
> > my book. SGE 5.3p5 is also nice to see.
> >
> > It will be some time before I upgrade any production clusters given the
> > differences between Rh 7.3 and WS 3.0. Too big of a jump for me right
> > now.
> > We first need to convert a couple hundred desktop boxes :)
> >

> > Tim Carlson
> > Voice: (509) 376 3423
> > Email: Tim.Carlson at pnl.gov
> > EMSL UNIX System Support
> >

From James_ODell at Brown.edu Wed Dec 31 15:12:59 2003
From: James_ODell at Brown.edu (James O'Dell)
Date: Wed, 31 Dec 2003 18:12:59 -0500
<9CBB7CF1-3BD5-11D8-9574-0030656A27CC@Brown.edu>
Message-ID: <DFF94A81-3BE6-11D8-9574-0030656A27CC@Brown.edu>

The cheap way to do it is to grep the bin directory and look for SSH in
the execution
scripts. You can change them to RSH and MPICH will use RSH to execute.

An alternative is to set RSHCOMMAND=rsh during a rebuild. I'm pretty
sure that
this method accomplishes precisely the same thing as simply editing the
execution
scripts.

Jim


> Hi James:
>
> Did you rebuild MPICH for this? I noticed the signal handling bit
> using mpiBLAST. Lots of zombies to deal with.
>
> Joe
>
> On Wed, 2003-12-31 at 16:09, James O'Dell wrote:
>> For whatever its worth, MPICH works MUCH better when run over rsh that
>> ssh. It seems as if ssh doesn't pass along
>> signals nearly as well as rsh. Since enabling rsh and configuring
>> MPICH
>> to use it, we have had no Zombie jobs
>> on our compute nodes. When using SSH they were a common occurrence.
>> In
>> fact, if you look at the MPICH implementation for myrinet, you'll see
>> the contortions that they use to try and clean up compute nodes when
>> using ssh.
>>
>> Jim
>>
>> On Dec 31, 2003, at 3:49 PM, Tim Carlson wrote:
>>
>>> On Mon, 29 Dec 2003, landman wrote:
>>>
>>>> SSH is too slow. Wow. 5-10 seconds to log in.
>>>

>>> Just getting around to this. I did a clean install on our test
>>> cluster
>>> (Dell 1550 and 1750 boxes). No delays with ssh. As root or a normal
>>> user, a "cluster-fork date" command on 4 nodes took under .6 seconds
>>>
>>> Sounds like you have some type of DNS issue. Did you get a bad
>>> /etc/resolv.conf file on the nodes for some reason?
>>>
>>>> a) md (e.g. Software RAID): Just try to build one. Anaconda will
>>>> happily let you do this ... though it will die in the formatting
>>>> stages.
>>>> Dropping into the shell (Alt-F2) and looking for the md module
>>>> (lsmod)
>>>> shows nothing. Insmod the md also doesn't do anything. Catting
>>>> /proc/devices shows no md as a character or block device.
>>>
>>> The odd bit here is that you can do a
>>>
>>> modprobe raid0
>>>
>>> on a running frontend and it gets installed but there is no
>>> associated
>>> "md" module. Was "md" built directly into the kernel? very odd.
>>>
>>>> b) ext3. There is no ext3 available for the install.
>>>
>>> This is a bit annoying. Nobody really uses ext2 anymore do they? :)
>>> Not
>>> having ext3 as an install option isn't a show stopper for me since I
>>> can
>>> do a tune2fs after the fact. But ext3 should be there.
>>>
>>> Having version 2.0.8 of the myrinet drivers up and running is a big +
>>> in
>>> my book. SGE 5.3p5 is also nice to see.
>>>
>>> It will be some time before I upgrade any production clusters given
>>> the
>>> differences between Rh 7.3 and WS 3.0. Too big of a jump for me right
>>> now.
>>> We first need to convert a couple hundred desktop boxes :)
>>>
>>> Tim Carlson
>>> Voice: (509) 376 3423
>>> Email: Tim.Carlson at pnl.gov
>>> EMSL UNIX System Support
>>>

Date: Wed, 31 Dec 2003 15:46:23 -0800
Message-ID: <8ABA2E3A-3BEB-11D8-83CE-000A95C4E3B4@rocksclusters.org>

>> Having version 2.0.8 of the myrinet drivers up and running is a big +
>> in
>> my book. SGE 5.3p5 is also nice to see.
>
> I agree, though I would like to see people do a
>
> cluster-fork "/etc/init.d/rcsge stop"
> cluster-fork "chown -R root:root /opt/gridengine/bin
> /opt/gridengine/utilbin"
> cluster-fork "/etc/init.d/rcsge start"
>
> to fix the compute node sge permissions. Some of the utils don't work
> otherwise.

so we can test the fixes, what utilities need the above changes?

- gb

Date: Thu, 01 Jan 2004 00:04:14 -0500
Subject: [Rocks-Discuss]looking for a work-around

Ok, this one is weird. On two different clusters using the same
replace-auto-partition.xml I get two completely different behaviors. I
am positive this is an anaconda issue, but it could be something else.

Both systems have IDE hard disks. I made the second one (my office
system) match the other system, so the IDE hard disks are hda and hdb.
Yes, I know this is not ideal, and I know that this should be changed.
I am simply trying to match their system.

First the partitioning:

<main>
<clearpart>--all</clearpart>
<part> / --size 4096 --ondisk hda </part>
<part> swap --size 1024 --ondisk hda </part>
<part> raid.00 --size 1 --grow --ondisk hda </part>
<part> /tmp --size 4096 --ondisk hdb </part>
<part> swap --size 1024 --ondisk hdb </part>
<part> raid.01 --size 1 --grow --ondisk hdb </part>
</main>

On one cluster (my office), this works perfectly.

On the other cluster, it fails with:

An unhandled exception has occurred. This is # ???
??? most likely a bug. Please copy the full text ???
???
??? of this exception or save the crash dump to a ???
???
??? floppy then file a detailed bug report against ???
???
??? anaconda at http://bugzilla.redhat.com/bugzilla/ ???

???
??? ???
???
??? Traceback (most recent call last): ???
???
??? File "/usr/bin/anaconda.real", line 1081, in ? ???
???
??? intf.run(id, dispatch, configFileData) ???
???
??? File ???
???
??? "/var/tmp/anaconda-9.1//usr/lib/anaconda/text.py ???
???
??? ", line 448, in run ???
???
??? File "/tmp/ksclass.py", line 799, in __call__ ???
???
??? KeyError: swap # ???
??? ???
??? ?????????????????? ????????????????????????
?????????????? ??? OK ??? ??? Save ??? ??? Debug
???
??? ??? ?????????????????? ????????????????????????
?????????????? ??

(sorry about the question marks). It appears that this is a python
KeyError, which occurs when the element being sought has not been found.

Any ideas?

Joe
--
Scalable Informatics LLC
web: http://scalableinformatics.com
phone: +1 734 612 4615

2003 December

More Related Content

What's hot

Viewers also liked

2003 December