...making Linux just a little more fun!
Neil Youngman [ny at youngman.org.uk]
Mon, 22 Jan 2007 21:23:58 +0000
I've been having problems with my SATA disk for some time and I've moved back to working off my old IDE disk, while I investigate the problem. I'm assuming the problem is hardware, but I don't have any suitable hardware to swap around to prove the point. I think my next step is to buy another SATA controller and swap that, but first I thought I'd see if the gang's collective wisdom had any pointers to offer.
First off, here's an extract from /var/log/messages as I try to copy a 1.4GB file onto the SATA disk.
Jan 22 15:52:57 tsr2 kernel: ata2: hard resetting port Jan 22 15:52:57 tsr2 kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Jan 22 15:52:57 tsr2 kernel: ata2.00: configured for UDMA/100 Jan 22 15:52:57 tsr2 kernel: ata2: EH complete Jan 22 15:52:57 tsr2 kernel: SCSI device sda: 390721968 512-byte hdwr sectors (200050 MB) Jan 22 15:52:57 tsr2 kernel: sda: Write Protect is off Jan 22 15:52:57 tsr2 kernel: SCSI device sda: drive cache: write back Jan 22 15:52:58 tsr2 kernel: ata2: hard resetting port Jan 22 15:52:59 tsr2 kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Jan 22 15:52:59 tsr2 kernel: ata2.00: configured for UDMA/100 Jan 22 15:52:59 tsr2 kernel: ata2: EH complete Jan 22 15:52:59 tsr2 kernel: SCSI device sda: 390721968 512-byte hdwr sectors (200050 MB) Jan 22 15:52:59 tsr2 kernel: sda: Write Protect is off Jan 22 15:52:59 tsr2 kernel: SCSI device sda: drive cache: write back Jan 22 15:52:59 tsr2 kernel: ata2: hard resetting port Jan 22 15:52:59 tsr2 kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Jan 22 15:52:59 tsr2 kernel: ata2.00: configured for UDMA/100 Jan 22 15:52:59 tsr2 kernel: ata2: EH complete Jan 22 15:52:59 tsr2 kernel: SCSI device sda: 390721968 512-byte hdwr sectors (200050 MB) Jan 22 15:52:59 tsr2 kernel: sda: Write Protect is off Jan 22 15:52:59 tsr2 kernel: SCSI device sda: drive cache: write back Jan 22 15:53:01 tsr2 kernel: ata2.00: limiting speed to UDMA/66 Jan 22 15:53:01 tsr2 kernel: ata2: hard resetting port Jan 22 15:53:02 tsr2 kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Jan 22 15:53:02 tsr2 kernel: ata2.00: configured for UDMA/66 Jan 22 15:53:02 tsr2 kernel: ata2: EH complete Jan 22 15:53:02 tsr2 kernel: SCSI device sda: 390721968 512-byte hdwr sectors (200050 MB) Jan 22 15:53:02 tsr2 kernel: sda: Write Protect is off Jan 22 15:53:02 tsr2 kernel: SCSI device sda: drive cache: write back Jan 22 15:53:04 tsr2 kernel: ata2.00: limiting speed to UDMA/44 Jan 22 15:53:04 tsr2 kernel: ata2: hard resetting port Jan 22 15:53:05 tsr2 kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Jan 22 15:53:05 tsr2 kernel: ata2.00: configured for UDMA/44 Jan 22 15:53:05 tsr2 kernel: ata2: EH complete Jan 22 15:53:05 tsr2 kernel: SCSI device sda: 390721968 512-byte hdwr sectors (200050 MB) Jan 22 15:53:05 tsr2 kernel: sda: Write Protect is off Jan 22 15:53:05 tsr2 kernel: SCSI device sda: drive cache: write back Jan 22 15:53:09 tsr2 kernel: ata2.00: limiting speed to UDMA/33 Jan 22 15:53:09 tsr2 kernel: ata2: hard resetting port Jan 22 15:53:10 tsr2 kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Jan 22 15:53:10 tsr2 kernel: ata2.00: configured for UDMA/33 Jan 22 15:53:10 tsr2 kernel: ata2: EH complete Jan 22 15:53:10 tsr2 kernel: SCSI device sda: 390721968 512-byte hdwr sectors (200050 MB) Jan 22 15:53:10 tsr2 kernel: sda: Write Protect is off Jan 22 15:53:10 tsr2 kernel: SCSI device sda: drive cache: write back Jan 22 15:53:11 tsr2 kernel: ata2.00: limiting speed to UDMA/25 Jan 22 15:53:11 tsr2 kernel: ata2: hard resetting port Jan 22 15:53:12 tsr2 kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Jan 22 15:53:12 tsr2 kernel: ata2.00: configured for UDMA/25 Jan 22 15:53:12 tsr2 kernel: ata2: EH complete Jan 22 15:53:12 tsr2 kernel: SCSI device sda: 390721968 512-byte hdwr sectors (200050 MB) Jan 22 15:53:12 tsr2 kernel: sda: Write Protect is off Jan 22 15:53:12 tsr2 kernel: SCSI device sda: drive cache: write back Jan 22 15:53:16 tsr2 kernel: ata2.00: limiting speed to UDMA/16 Jan 22 15:53:16 tsr2 kernel: ata2: hard resetting port Jan 22 15:53:17 tsr2 kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Jan 22 15:53:17 tsr2 kernel: ata2.00: configured for UDMA/16 Jan 22 15:53:17 tsr2 kernel: ata2: EH complete Jan 22 15:53:17 tsr2 kernel: SCSI device sda: 390721968 512-byte hdwr sectors (200050 MB) Jan 22 15:53:17 tsr2 kernel: sda: Write Protect is off Jan 22 15:53:17 tsr2 kernel: SCSI device sda: drive cache: write back Jan 22 15:53:18 tsr2 kernel: ata2.00: limiting speed to PIO4 Jan 22 15:53:18 tsr2 kernel: ata2: hard resetting port Jan 22 15:53:19 tsr2 kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Jan 22 15:53:19 tsr2 kernel: ata2.00: configured for PIO4 Jan 22 15:53:19 tsr2 kernel: ata2: EH completeAt this point 46MB has been copied and the machine is effectively hung.
Once I realised I had a problem, i naturally installed smartmontools and this is what smartctl tells me.
# smartctl -d ata -l selftest /dev/sda smartctl version 5.36 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 1685 - # 2 Short offline Completed without error 00% 1671 - # 3 Short offline Completed without error 00% 1667 - # 4 Extended offline Completed without error 00% 1661 - # 5 Short offline Completed without error 00% 1643 - # 6 Short offline Completed without error 00% 1628 - # 7 Short offline Completed without error 00% 1613 - # 8 Short offline Completed without error 00% 1598 - # 9 Short offline Completed without error 00% 1583 - #10 Short offline Completed without error 00% 1569 - #11 Short offline Completed without error 00% 1555 - #12 Extended offline Completed without error 00% 1549 - #13 Short offline Completed without error 00% 1538 - #14 Extended offline Completed without error 00% 1523 - # smartctl -d ata -l error /dev/sda smartctl version 5.36 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART Error Log Version: 1 No Errors Logged #The lack of any errors suggests to me that the problem is not with the disk, hence the thought that I should replace the controller. Is this a reasonable conclusion from the data available?
I have tried reseating the controller card and cables and moved the SATA cable to the secondary port on the SATA controller.
Is there anything else I should be trying?
Neil Youngman
Benjamin A. Okopnik [ben at linuxgazette.net]
Tue, 23 Jan 2007 11:46:36 -0500
On Mon, Jan 22, 2007 at 09:23:58PM +0000, Neil Youngman wrote:
> > # smartctl -d ata -l selftest /dev/sda > smartctl version 5.36 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen > Home page is http://smartmontools.sourceforge.net/ > > === START OF READ SMART DATA SECTION === > SMART Self-test log structure revision number 1 > Num Test_Description Status Remaining LifeTime(hours) > LBA_of_first_error > # 1 Short offline Completed without error 00% 1685 - > # 2 Short offline Completed without error 00% 1671 - > # 3 Short offline Completed without error 00% 1667 - > # 4 Extended offline Completed without error 00% 1661 - > # 5 Short offline Completed without error 00% 1643 - > # 6 Short offline Completed without error 00% 1628 - > # 7 Short offline Completed without error 00% 1613 - > # 8 Short offline Completed without error 00% 1598 - > # 9 Short offline Completed without error 00% 1583 - > #10 Short offline Completed without error 00% 1569 - > #11 Short offline Completed without error 00% 1555 - > #12 Extended offline Completed without error 00% 1549 - > #13 Short offline Completed without error 00% 1538 - > #14 Extended offline Completed without error 00% 1523 - > > # smartctl -d ata -l error /dev/sda > smartctl version 5.36 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen > Home page is http://smartmontools.sourceforge.net/ > > === START OF READ SMART DATA SECTION === > SMART Error Log Version: 1 > No Errors Logged > > # > > The lack of any errors suggests to me that the problem is not with the disk, > hence the thought that I should replace the controller. Is this a reasonable > conclusion from the data available? > > I have tried reseating the controller card and cables and moved the SATA cable > to the secondary port on the SATA controller. > > Is there anything else I should be trying?
I'm wondering what would happen if you ran the above test in a logged loop while loading the disk - say, by running a large copy operation in another terminal. You'd build up a pretty good sized logfile after a while, but you might get some failure info that might point the way to a solution.
Coming at it from the hardware end, I'd say that you have the right idea: throwing in a different controller would be a pretty good test. Shotgunning does make sense as a troubleshooting technique when the possible number of affected parts is low.
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *