A Nightmare on Tape Drive
And in the fourth month of my new job, I encountered numerous problems all related to backup.
Backup! My eyes glaze over. I suppose some people can get enthusiastic about backup, but not me. Oh, sure, I understand the point of backup. And I try very hard in my own endeavours to keep lots of backup. (What would happen to my writing if I lost it all! Oh, the horror.)
Currently, I seem to spend about a third of my time on backup. There are many aspects to this, and several reasons. Nonetheless, there are many activities that are not a particularly good match for my skill-set: operating a Dymo labelmaker; searching for tapes; opening new tape boxes; removing shrink-wrapping; loading and unloading magazines; searching for tape cases; loading tape cases; removing tapes from their covers; inserting tapes into their covers; delivering tape cases to, and retrieving them from, a nearby site.
I yearn to throw the whole lot out and start again. But, of course, that is unthinkable.
We have 3 jukeboxes which feed 5 tape drives. About a month ago, the PX502 started to misbehave. I started scouring documentation, searching around in the logs and trying various commands. Somewhere I came across this helpful message:
The drive is not ready - it requires an initialization command
Oh, really? Well, if you are so smart, send the appropriate initialization command. Or tell me what the initialization command looks like and ask me to issue it.
Don't you just hate that? It's on a par with those stupid messages that compilers sometimes produce, something like:
Fatal error: extraneous comma
!! I mean, I wouldn't mind,
Warning: extraneous comma. Ignored.
But "Fatal error"?! Give me a break. That's just punitive.
It seemed to me that the aberrant behaviour was hardware-related. I chased around to see if the unit was under maintenance. There was some difficulty in locating the relevant paperwork, but I was given a phone number and assured it would be OK. It wasn't really OK, but David came out nonetheless and poked around.
It was one of those visits like when you take your car to the mechanic - or your kids to their friends. The car and the kids are suddenly on their best behaviour. The mechanic cannot hear any of those ominous noises that caused so much consternation. And the other parents report what sweet children you have.
As you drive away in frustration, the car starts its strange noises and the kids are attempting to kill each other in the back seat.
Two days later I was back on the phone to David. He assured me that our support was not with him but with some other crowd, but since he had started down this path, he would continue and work out some cross-charging with the other crowd. He told me to run some program and send him the output.
Apparently, he sent the output on to the US and was informed that one of the drives needed replacement. They shipped a replacement drive to David, who came out again a few days later.
And that should have been the end of it.
But no. This is the story of a nightmare; don't look for happy endings. The happiest part of a nightmare is when you wake up.
The software which coordinates all the backup tasks is Sun's StorageTek EBS which is really Legato NetWorker under the covers. After the drive swapout, the StorageTek software showed only a single drive on the PX502.
I struggled with this for a while, believing it ought to be fairly straightforward to solve this problem. And this is where I demonstrate why I earn the big bucks.
When you encounter a problem like this you have the analogue of the manufacturer's make/buy dilemma. Do you try to fix it yourself? Or do you call in support? Before you can think about calling for support you have to answer a few questions.
First, have you investigated the problem? If your opening gambit is, "It doesn't work," you may find it hard to get useful help. Support may suggest politely that you try reading the manual.
Second, is the problem hardware or software? In many cases, my answer is that if I knew that I wouldn't be calling for support. Nevertheless, you need to be able to engage support, get them interested enough to want to help you. Otherwise, they'll just give you a lot of dopey menial tasks.
Finally, do you even have support?
First, I tried the Microsoft approach: if things don't work, try restarting them; or power cycling. I'm not proud to take this route, but I figure support is going to suggest it, so I may as well clear the decks. Unsurprisingly, nothing was gained.
I started trying to break things down. Does the machine see the hardware?
Before the problems, the PX502 controlled 2 tape drives:
/dev/rmt/0cbn /dev/rmt/1cbn
The first of these was the faulty one. When it was swapped out, the new tape drive came up as /dev/rmt/2cbn. I don't understand why. As I write this, it occurs to me that another approach to this problem may have been to persuade Solaris at the lowest level that this tape drive was /dev/rmt/0cbn.
I also thought it might not be a bad idea to reboot the Sun to which the tape drive is attached, but this is the organisation's file server; rebooting it is not a task to be taken lightly. Fortuitously, it rebooted itself one night when one of its SAN disks had a hiccup. Even the reboot did not improve matters.
I was able to go to the front panel of the PX502, press some buttons and load a tape into the "invisible" drive. So, at least as far as the PX502 is concerned, the tape drive is present.
I then went to the Sun to which the tape drive is attached and issued:
mt -f /dev/rmt/2cbn status Quantum DLT-S4 tape drive: sense key(0x6)= Unit Attention residual= 0 retries= 0 file no= 0 block no= 0
That's promising. It looks like the Sun knows about the tape drive.
Since the tape was one of the backup tapes, it had a label. I don't know exactly what a label looks like, but I expect it to be at the beginning of the tape. I did:
dd if=/dev/rmt/2cbn ibs=1000000 count=1 | od -Cx | less
This reads the first MB off the tape and pipes it into a dump format. Lots of the dump were incomprehensible to me, but about 16 lines down I found a string that corresponded to the label (tapeBSA.3472).
0000360 \0 \0 \0 \f t a p e B S A . 3 4 7 2
At the lowest level, the drive is present and works fine.
When David had come out, he had shown me how to connect to the PX502 web interface. From the web interface one can control the PX502 more conveniently than from its front panel. I navigated from one screen to another, satisfying myself that it could see two tape drives, could move a tape from a magazine to a drive; and move it the other way as well. So it seemed fair to conclude that the PX502 was OK.
That left the interface between the PX502 and the Sun, or the StorageTek software. The StorageTek software claimed it could see the PX502 and one of the drives, so my money was on a problem with the StorageTek software. Time to find out if we have support (we do) and then get in touch with Sun.
After only a little bit of palaver over my inability to locate the correct paperwork, Sun routed my call to someone who took down the details. He was reasonably patient with my uncertainty as to whether this was hardware or software. After some discussion, he agreed with my view that it was probably the StorageTek software that merited attention. I was given a Tracking Number.
I didn't expect anything much to happen right way, especially as I'd called about 4:30 pm on a Friday. I began thinking about packing up and going home, expecting that I'd pick up the matter on Monday, so I was somewhat startled when the phone rang only a few minutes later.
"I'm ringing about your Sev 1."
Sev 1?! A Severity One error means something catastrophic like the entire business has ground to a halt! I freaked. Typically, to get an organisation like Sun to even allow you to call in a Sev 1, you have to be paying big bikkies. Normally you get "best effort" or maybe "response within one business day". Sev 1?! Maybe defence departments get to call Sev Ones, not me.
I hastened to assure the caller that I had never mentioned that the problem was critical. Far from it - I was in no great rush to get the problem solved. He took this with good grace, and we agreed to leave it until Monday.
I guess Sun runs some sort of tag-team problem-solving that follows the, um, sun. On Monday morning, my inbox had a couple of emails with the Tracking Number in the subject, which had arrived Friday after I had left.
The first email seemed to be on the ball. It asked me to provide details of the software I was using; the output of several commands; and the contents of several logs. This approach suits me well. I get to find out which instructions the experts use, so in the future I can help myself.
The second email came from the same sender. I'm guessing he's based in India. He had tried to call me and wanted to confirm that the number he had was correct. Since his email included a bit that went "... as I did not get a response @ 61-3 ...", I concluded that he had tried to make an international call. (I usually expect to see a plus (+) for international access, but "61" is the country code for Australia and "3" is my area code.) He also asked for a few more details.
I spent the rest of Monday fighting fires. It was Tuesday before I could gather the various responses. I was still a bit antsy about the Sev 1, so I prefaced my responses with:
I've said this before, but I'll repeat it just in case. The unit is usable (even though it seems to complain that it can't see the changer). We are doing backups with the one tape drive it can see. However, it had 2 tape drives and we want it to be able to use both tape drives.
I included in my email a summary of the situation to date, pretty much what I have written above.
When it came to sending the logs, I had a bit of a problem. One of the log files was 12 MB, the other 544 MB. It seems that these log files are never rolled over. I sent the last part of the log file, containing entries for the last few months.
I had been asked to "kindly provide the screenshot that shows the errors from console."
I had two methods for monitoring the backup: a java GUI; and nsrwatch, a curses-based application. The GUI proved unhelpful, but nsrwatch displayed several messages.
The main error message was:
media warning: The /dev/rmt/0cbn is either skipped as requested (due to hardware problem) or no longer connected to the storage node. media warning: Please remove /dev/rmt/0cbn from NetWorker if it is permanently disconnected.
Well, that's probably correct. The old drive was zero; for some reason the new one is two. There was also an analogous message for the changer. And yet, the software can drive the changer!
I wrote back:
I've attached a screenshot, but there are no error messages as such. Perhaps the following might be more useful: # jbedit -j 'Quantum PX502' -a -f /dev/rmt/2cbn -E 81 Using 'unix33.alpha.wehi.edu.au' as NetWorker server host. 39078:jbedit: RAP error: The device '/dev/rmt/2cbn' is already part of jukebox 'Quantum PX502'. # jbedit -j 'Quantum PX502' -d -f /dev/rmt/2cbn -E 81 Using 'unix33.alpha.wehi.edu.au' as NetWorker server host. 39077:jbedit: error, Cannot find device `/dev/rmt/2cbn' in jukebox `Quantum PX502'.
I included this to underline the point that the software had become very confused. I can't add the tape drive because it's already there. But I can't delete it because it isn't there.
Late the next afternoon a reply came summarising their understanding of the problem; then this:
Please Select the jukebox PX502 from GUI and do a scan for drives. I am sure this will show the missing drive into the configration. If this fails we need to delete this Jukebox and recreate it once again. It is suggested to do this when there is a downtime available.
There was an interesting bit that followed. Here's a fragment of my reply:
------------------------------------------------------------------------
-->please let me know how did you get the element number "81". Kindly
I honestly don't remember.
-->confirm the correct element number using the sjisn command.
sjisn 3.7.0 Serial Number data for 3.7.0 (QUANTUM PX500 ): Library: Serial Number: QP0714BDC00025 SCSI-3 Device Identifiers: ATNN=QUANTUM PX500 QP0714BDC00025 IENN=00E09EFFFF0B61FE WWNN=100000E09E0B61FE Drive at element address 128: SCSI-3 Device Identifiers: ATNN=QUANTUM DLT-S4 QP0713AMD00014 Drive at element address 129: SCSI-3 Device Identifiers: ATNN=QUANTUM DLT-S4 QP0734AMD00102
I guess 81 (hex) = 129 (decimal).
------------------------------------------------------------------------
And here we have an encapsulation of some of the many ways that things can go wrong. This is so easy in hindsight. I have a PhD in hindsight. Rear vision is always 20/20.
Why had I used element number "81"? Because I am too clever by half! In the man page for jbedit(1m), I had seen an example which ended with "-E 82". When one is a Brilliant Expert, one gets a "feel" for the "shape" of numbers. A number like 81 or 82 in the context of devices is so obviously hexadecimal only a fool would imagine any other possibility. (Would the fool typing please put his hand up?) When I saw "129" it was obvious that I was meant to translate it to hex, so I did. Wrong!
Strangely, the support guys never corrected me. That's the next lesson. I had spent much time trying to establish how to determine the element number. I should never have responded the way I did.
I ought to know better. I have worked for many years in support. I am reluctant to let my customers tell me what they think is wrong. I am only interested in what they were trying to do, what they saw, and what they expected or wanted to see. Very often, if I let them tell me what they think is wrong, I get sucked in to their view of the problem. Had that view been helpful, they would have solved the problem already. They are coming to me for a fresh perspective.
In researching for this article, I came across this in the man page:
The data element address is the "decimal number" that the jukebox assigns to each of its drives.
This underlines another valuable lesson. Most people, most of the time, operate as if they subscribe to the theory "don't confuse me with facts, my mind's made up". Of course, if you try to say that to people out loud, they respond that it's ridiculous.
I may have read that part of the man page when I was trying to solve the problem. But, if I did, I never connected "decimal number" with the value associated with "-E". (I wonder why the man page has "decimal number" in quotes.)
The next day I received another email suggesting that the GUI "database is corrupted"; that I pkgrm (uninstall) the GUI part of the software; delete the database components; and then pkgadd ((re)install) it. I'd seen a similar suggestion in posts on the Net.
Be careful what you wish for. And, once again, I justify my wage.
I am always reluctant to delete. Except in extremis, I always rename. Or take a copy and then pkgrm. Fortunately, that's what I did.
And when he got there the cupboard was ...
... not exactly bare. However, when I had pkgrmed the software, I then tried the pkgadd
pkgadd: ERROR: no packages were found in </var/spool/pkg>
Oops.
I started rooting around the file system. However, there were some problems. I guess I should have said "more problems"; I seem to have them in spades.
This is a Sun running Solaris 10. Unlike sensible systems (Linux, FreeBSD), Sun does not offer "locate" by default. As I read somewhere, if ever a system needed "locate", it's Solaris. Further, this system is the file server: it serves over 20 TB. It's also extremely overloaded. I'm not going to be popular if I start doing a find across 20TB; and I'm not going to get an answer any time soon. So, although I know what the package is called, I'm going to have a hard time finding it.
I eventually found the package and installed it. Um, not quite. I installed 7.3 and we were running 7.4. So although I can use the GUI, it doesn't have all the features which used to be present.
Some problems can be solved in one sitting. It should be clear that this saga continued over the course of a month, maybe more. Concurrently, I was doing other parts of my regular job, including the limping backups. Every now and then I would find the time to mount another assault on my problem du jour.
Time is a great healer in all sorts of activities. However, a balance must be struck. If all you do is monomaniacally try to solve the problem to the exclusion of other activities, firstly you will neglect important duties; but more importantly, you may get too close to the problem - you start going round in circles, stumbling blindly, getting confused, making mistakes.
I reckon I've solved more tough problems on the drive/walk/tram/train home, or in the shower, than I have at the keyboard.
One day, I tried a minor variation on an earlier command and had success.
jbedit -j 'Quantum PX502' -a -f /dev/rmt/2cbn -E 129
If you've persevered to here, you will recognise this as similar to a command listed earlier, but now I'm using 129 instead of the alleged hex 81.
Now 2 tape drives show up in the GUI. My original problem has been solved.
If only I hadn't uninstalled the console software!
Where was I going to find the version we used to have?
When you read job ads, you'll often see specifications that the prospective employer requires or desires. In my experience, these are rarely relevant. Sure, you want someone with a computer background, not a ballerina or a chef. You probably want someone with a Unix background if you run a Unix shop.
But getting more detailed than that seems to me pointless. If your looking for a mechanic, you want someone who has worked on cars, but do you really need someone who has worked on 1975 Dodge Darts?
I'm looking for some software. Which part of my CV is going to demonstrate my ability to find it? The question is not going to be asked. But this is exactly the sort of real-world problem that must be solved.
And that's my point. Most organisations are sufficiently idiosyncratic that the way they do things contributes to the steepest part of the learning curve. In-depth knowledge of a particular release of a distribution, or a specific rev of some software package is only transiently helpful.
The real requirement is to be able to get answers and solve problems; to perform the necessary research.
Before this job, I had used Legato, but I had not been in charge of the backup. I had no familiarity with managing the backup. But that task has not been difficult to pick up. It's just like getting behind the wheel of an unfamiliar car.
Finding where the software might be is a horse of a different colour.
I decided I needed a locate database. Not a genuine, complete, standard locate database; but something simple and workable. I didn't need to update it regularly, because I was only interested in files which were already there. A df showed that the root directory and /var were on separate filesystems. All the other filesystems, the 20TB, were out of my area of interest. I went with
nice find / /var -mount >> ~/tmp/locate.output
Now I can grep for what I need. It's not brilliant, but it will do.
It enabled me to find files which were helpful in other ways, but in creating my little ersatz locate I managed to overlook /usr/local. A colleague pointed me to an obscure subdirectory of /usr/local where I found rev 7.4 of the Legato software.
It was close to what we'd had before, but even that wasn't the end of the story. I found something dated later than the other candidates in a patch directory. So once more I went through the cycle of uninstall of the old version and reinstall of the new. But the latest item was in a patch directory. It complained that it could only be installed over an existing installation of the Legato software.
Patches have come to mean so many different things. When I was just starting out in computers, a patch was an analogue of what I use to repair my bike tubes. Typically it consisted of instructions to overwrite the contents of a file at a small number of locations with binary data; patching was done by hand. Over time, patching also came to mean a form of automated editing of source code. And patching is also used to describe the selective replacement of some files in a package. In fact, in the Solaris world, patches are packaged in a very similar way to software. There is not a lot of difference between pkgadd and patchadd. My guess is that they share an awful lot of codebase.
I have seen so-called jumbo patches from Sun which run to nearly a gigabyte. From humble beginnings...
I reinstalled 7.4 and then installed the patch and finally I was back to our original GUI console. Finally, after nearly a month, all the problems had been resolved.
A day or two later, another jukebox, attached to another machine, started to play up. As I write, it looks like I am looking at repeating the exercise above. I hope that my experience will allow me to get out the the other side with less drama.
Share |
Talkback: Discuss this article with The Answer Gang
Henry has spent his days working with computers, mostly for computer manufacturers or software developers. His early computer experience includes relics such as punch cards, paper tape and mag tape. It is his darkest secret that he has been paid to do the sorts of things he would have paid money to be allowed to do. Just don't tell any of his employers.
He has used Linux as his personal home desktop since the family got its first PC in 1996. Back then, when the family shared the one PC, it was a dual-boot Windows/Slackware setup. Now that each member has his/her own computer, Henry somehow survives in a purely Linux world.
He lives in a suburb of Melbourne, Australia.