Industrial PC booting issues

=?ISO-8859-2?Q?Petr_Vav=F8inec?= [pvavrinec at snop.cz]

Thu, 12 Nov 2009 11:35:44 +0100

Allmighty TAG!

first, sorry for a longish post. I'm almost at my wit's end.

I have an industrial PC, equipped with a touch screen and "VIA Nehemiah" CPU (that behaves like i386). The PC has no fan, no keyboard, no harddisk, nor flash drive - only ethernet card. It boots via PXE from my database server. The /proc/meminfo on the PC says, that the PC has MemTotal: 452060 kB. I'm booting the "thinstation" (http://thinstation.sourceforge.net) with kernel 2.6.21.1. The PC uses ramdisk. I boot Xorg with twm window manager. Then I start client of the "Opera" browser on my database server and "-display" it on the X server on the PC. This setup has worked flawlessly for a couple of months.

Then my end-users came with complains, that the PC doesn't boot anymore after flipping the mains switch. For the moment, I have found following:

1. The PC sometimes doesn't boot at all. The boot process stops with the message:

   Uncompressing linux... OK, booting the kernel.

...and that's all. Usually, when I switch the PC again off and on, it boots OK, I mean I don't change anything, just flip that big button.

2. When it really does boot all the way into X, the opera browser isn't able to start properly. I tried to investigate further the matter. I modified the setup, now I'm booting only into X+twm. This works.

Now I tried to run following test on my database server:

    xlogo -display <ip_address_of_the_pc>:0

This works, the X logo is displayed on the screen of the PC.

Now I tried this test on the database server:

    xterm -display <ip_address_of_the_pc>:0

Result is this error message on the client side (i.e. on the database server):

   xterm:  fatal IO error 104 (Connection reset by peer) or KillClient 
on X server "192.168.100.171:0.0"

...and the X server on the PC is really killed (I can't find him anymore in the process list on the PC).

This is, what I have found in the /var/log/boot.log on the PC:

------------ /var/log/boot.log starts here ----------------------------
X connection to :0.0 broken (explicit kill or server shutdown).
/etc/init.d/twm: /etc/init.d/twm: 184: xwChoice: not found
twm_startup
twm:  unable to open display ":0.0"
------------ /var/log/boot.log ends here ------------------------------

...and this is from /var/log/messages on the PC:

------------ /var/log/messages starts here ----------------------------
Nov 12 12:18:25 msweld03 daemon.info udevd[721]: udev_event_run: seq 790 
forked, pid [2167], 'remove' 'vc', 0 seconds old
Nov 12 12:18:25 msweld03 daemon.info udevd[721]: udev_event_run: seq 791 
forked, pid [2168], 'remove' 'vc', 0 seconds old
Nov 12 12:18:25 msweld03 daemon.info udevd-event[2167]: 
udev_db_get_device: found a symlink as db file
Nov 12 12:18:25 msweld03 daemon.info udevd-event[2167]: name_index: 
removing index: '/dev/.udev/names/vcs3/%2fclass%2fvc%2fvcs3'
Nov 12 12:18:25 msweld03 daemon.info udevd-event[2167]: 
udev_node_remove: removing device node '/dev/vcs3'
Nov 12 12:18:25 msweld03 daemon.info udevd-event[2167]: udev_event_run: 
seq 790 finished
Nov 12 12:18:25 msweld03 daemon.info udevd[721]: udev_done: seq 790, pid 
[2167] exit with 0, 0 seconds old
Nov 12 12:18:25 msweld03 daemon.info udevd-event[2168]: 
udev_db_get_device: found a symlink as db file
Nov 12 12:18:25 msweld03 daemon.info udevd-event[2168]: name_index: 
removing index: '/dev/.udev/names/vcsa3/%2fclass%2fvc%2fvcsa3'
Nov 12 12:18:25 msweld03 daemon.info udevd-event[2168]: 
udev_node_remove: removing device node '/dev/vcsa3'
Nov 12 12:18:25 msweld03 daemon.info udevd-event[2168]: udev_event_run: 
seq 791 finished
Nov 12 12:18:25 msweld03 daemon.info udevd[721]: udev_done: seq 791, pid 
[2168] exit with 0, 0 seconds old
Nov 12 11:18:25 msweld03 user.debug kernel: unhashed dentry being 
revalidated: CMD2095
------------ /var/log/messages ends here ------------------------------

I tried to fiddle with the BIOS parameters, i.e. reverted back to "factory setup" and re-entered the PXE-booting stuff. It didn't help.

I tried to limit the memory used by the ramdisk. My current parameters are:

   append ramdisk_blocksize=1024 initrd=initrd.C0A864AB root=/dev/ram0
ramdisk_size=131072 console=ttyS3 acpi=off noapic nolapic

...and this doesn't help, either.

Has anyone of the honorable TAG staff any clues that could help me? Thanks in advance for any help.

Petr

Top Back

Paul Sephton [paul at inet.co.za]

Thu, 12 Nov 2009 21:51:34 +0200

On Thu, 2009-11-12 at 11:35 +0100, Petr Vavřinec wrote:

> 1. The PC sometimes doesn't boot at all. The boot process stops with
> the message:
> 
>    Uncompressing linux... OK, booting the kernel.
> 
> ...and that's all. Usually, when I switch the PC again off and on, it 
> boots OK, I mean I don't change anything, just flip that big button.

Just looking at this aspect of the problem, and for the moment disregarding the rest of the symptoms, it would appear that your problem is hardware related. I would suggest you start by checking the integrity of your RAM modules, or if you have ready access to replacement RAM, simply swap them out and see whether the problem has gone away.

My reasoning might be flawed, but based on the fact that the machine was working before, and the fact that you are using a RAM disk, together with the strange behaviour on bootup, I suspect that your RAM might not be refreshing properly. This failure to refresh might be caused by a faulty module, a loosely seated module (possible through vibration of the chassis over time), or a CPU memory bus problem.

Whether or not this is the case, it is easy enough to check.

Regards, Paul

Top Back

René Pfeiffer [lynx at luchs.at]

Thu, 12 Nov 2009 20:55:17 +0100

On Nov 12, 2009 at 2151 +0200, Paul Sephton appeared and said:

> On Thu, 2009-11-12 at 11:35 +0100, Petr Vavřinec wrote:
> 1. The PC sometimes doesn't boot at all. The boot process stops with
> > 1. The PC sometimes doesn't boot at all. The boot process stops with
> > the message:
> > 
> >    Uncompressing linux... OK, booting the kernel.
> > 
> > ...and that's all. Usually, when I switch the PC again off and on, it 
> > boots OK, I mean I don't change anything, just flip that big button.
> 
> Just looking at this aspect of the problem, and for the moment
> disregarding the rest of the symptoms, it would appear that your problem
> is hardware related.  I would suggest you start by checking the
> integrity of your RAM modules, or if you have ready access to
> replacement RAM, simply swap them out and see whether the problem has
> gone away.

Running memtest86 for a couple of hours of even days is a good way of finding out if the memory works or not. You can find it here: http://www.memtest86.com/

Some live-CDs aloow booting into memtest86, too.

Best, René.

Top Back

Neil Youngman [ny at youngman.org.uk]

Thu, 12 Nov 2009 20:35:33 +0000

On Thursday 12 November 2009 19:55:17 René Pfeiffer wrote:

>
> Running memtest86 for a couple of hours of even days is a good way of
> finding out if the memory works or not. You can find it here:
> http://www.memtest86.com/

The last time I had flaky memory the system seized up about twice a day. Running memtest86 overnight didn't show any problems, but swapping in new memory solved the problem.

I would suggest that if memtest86 can find a problem you can rely on that, but a negative from memtest86 doesn't guarantee that your memory's OK.

Neil

Top Back

=?windows-1250?Q?Petr_Vav=F8inec?= [pvavrinec at snop.cz]

Fri, 13 Nov 2009 14:16:41 +0100

René Pfeiffer napsal(a):

[...snip...]

> 
> Running memtest86 for a couple of hours of even days is a good way of
> finding out if the memory works or not. You can find it here:
> http://www.memtest86.com/
> 
> Some live-CDs aloow booting into memtest86, too. 
> 
> Best,
> René.
>

Thank you guys,

I ran the memtest86, when it reached 5000 errors, I switched it off. Fortunatedly I was able to find suitable replacement for my memory. I swapped it, and now I'm running again without problems...

Thanks again, have a nice weekend,

Petr

Top Back

Rick Moen [rick at linuxmafia.com]

Fri, 13 Nov 2009 15:04:30 -0800

Quoting Neil Youngman ([email protected]):

> The last time I had flaky memory the system seized up about twice a day. 
> Running memtest86 overnight didn't show any problems, but swapping in new 
> memory solved the problem.

Running memtest86 overnight will almost always pinpoint bad RAM. Running it for only a few hours and finding no errors doesn't mean much.

On one memorable occasion, bad RAM on my candidate replacement server got smoked out only through resorting to iterative kernel compiles using "make -j NN" tweaked to make sure I used all memory. (After a few tests, NN ended up being 256.)

In the referenced case, the situation really was sort of my own fault, because, trying to save money, I'd deployed some sticks of RAM that had a dubious history. The entire saga of how I tracked down the bad sticks is here, and I personally think it makes pretty good reading:

http://linuxmafia.com/pipermail/conspire/2006-December/002662.html

http://linuxmafia.com/pipermail/conspire/2006-December/002668.html

http://linuxmafia.com/pipermail/conspire/2007-January/002743.html

Also worth considering is the Cerberus Test Control System (CTCS), the hardware burn-in suite that we used to quality hardware at VA Linux Systems.

Information links on CTCS:

http://linuxmafia.com/faq/Hardware/cerberus.html

http://va-ctcs.cvs.sourceforge.net/va-ctcs/ctcs/FAQ?view=markup

-- 
Rick Moen                            "If accuracy / Is what you crave / 
[email protected]                  Then you should call it / Myanmar Shave."  
                                                           -- FakeAPStylebook

Top Back

Ben Okopnik [ben at linuxgazette.net]

Fri, 13 Nov 2009 18:24:24 -0500

On Fri, Nov 13, 2009 at 03:04:30PM -0800, Rick Moen wrote:

> Quoting Neil Youngman ([email protected]):
> 
> > The last time I had flaky memory the system seized up about twice a day. 
> > Running memtest86 overnight didn't show any problems, but swapping in new 
> > memory solved the problem. 
> 
> Running memtest86 overnight will almost always pinpoint bad RAM. 
> Running it for only a few hours and finding no errors doesn't mean much.

True in my experience as well. Not specifically for memtest86, but when I was building systems, the only time I considered the memory to have been properly "burned in" is after I ran it through "BURNIN" (this was back in the days of DOS) for 24 hours. It was annoying, but since the alternative was to have systems that would come back and that I'd have to fix on my own dime - not to speak of the associated loss of reputation for my then very-young business - it was a requirement for any machines I sold. I've also had systems with memory problems that required either a spritz of Freon or a blast from a hair dryer set on 'high' to confess their faults.

I'm quite happy that a) this kind of problems are quite rare anymore, and b) that I'm doing almost nothing but software these days.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *

Top Back

Raj Shekhar [rajlist2 at rajshekhar.net]

Mon, 16 Nov 2009 14:41:44 -0800

In infinite wisdom Ben Okopnik said the following On 11/13/09 3:24 PM:

> I've also had systems with memory problems that required either a spritz
> of Freon or a blast from a hair dryer set on 'high' to confess their
> faults.

I have never heard of this method for diagnosing RAM problems. What does it do?

Top Back

Thomas Adam [thomas.adam22 at gmail.com]

Mon, 16 Nov 2009 22:44:56 +0000

2009/11/16 Raj Shekhar <[email protected]>:

> In infinite wisdom Ben Okopnik said the following On 11/13/09 3:24 PM:
>
>> I've also had systems with memory problems that required either a spritz
>> of Freon or a blast from a hair dryer set on 'high' to confess their
>> faults.
>
> I have never heard of this method for diagnosing RAM problems. �What does it
> do?

man 7 salonandpermset

-- Thomas Adam

Top Back

Ben Okopnik [ben at linuxgazette.net]

Mon, 16 Nov 2009 18:04:50 -0500

On Mon, Nov 16, 2009 at 10:44:56PM +0000, Thomas Adam wrote:

> 2009/11/16 Raj Shekhar <[email protected]>:
> > In infinite wisdom Ben Okopnik said the following On 11/13/09 3:24 PM:
> >
> >> I've also had systems with memory problems that required either a spritz
> >> of Freon or a blast from a hair dryer set on 'high' to confess their
> >> faults.
> >
> > I have never heard of this method for diagnosing RAM problems. What does it
> > do?
> 
> man 7 salonandpermset

<Ahem> "man 8 salonandpermset", please. There's nothing "miscellaneous" about serious stuff like this.

Raj: In the past, some memory chips would become thermally-sensitive as they degraded, and could be made to fail by heating them up or (less often) by cooling them during a memory test. You had to use this with some discretion - obviously, you could crack the chip packages if you flipped the temperature too rapidly - but it was a really good test that would usually suss out otherwise difficult-to-diagnose, intermittent memory.

Memory quality has increased greatly since those days, so I don't know how applicable this would be today - but it was a standard part of a techie's toolkit back then.

-- 
                                                                       
[  Okopnik Consulting  |  Putting computing solutions within easy reach  ]
[  Expert-led Training |  Dynamic, vital websites | Custom programming   ]
[____________________________ http://okopnik.com ________________________]

Top Back

Blaine Clark [thelight9 at comcast.net]

Tue, 17 Nov 2009 19:35:27 -0500

Just a note about temperature effects; Sometimes with failed hard drives, they can be put in your fridge or for a short time, in your freezer. Let it cool, pull it out, connect up and have your recovery process ready to roll NOW. You probably have only one chance to recover as much of your data as can be recovered. This doesn't always work of course, and it doesn't always let you retrieve all that you want or need, but hey, it's a shot when you're desperate.

Top Back