Henry's Techno-Musings: Philosophy of Troubleshooting: Sea of Troubles
To be or not to be that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles And, by opposing, end them. -- William Shakespeare, "Hamlet"
There's a story about a man whose factory is shut down because there is some sort of blockage in one of the pipes. He sends for a plumber who wanders around the factory looking here, listening there and soon steps up to a piece of pipe and gives it a whack with a hammer. Problem fixed. The factory owner is delighted.
However, when a bill for $1000 arrives, he thinks he has been charged quite a lot for very little effort on the plumber's part. He asks for an itemisation.
For tapping pipe $1 For knowing where to tap $999
In the various tasks performed by a sysadmin, problem solving is a lot like the story. Once you know what the problem is, the actual fixing is quite often a doddle.
The real value the sysadmin brings to the table is the ability to diagnose the problem. Importantly, what is presented as the problem is rarely what it seems.
"It doesn't work"
This week I was presented with a problem that epitomises a lot of what has been written above. It started with little more than, "It doesn't work."
In this case, it was, "When I go to URL http://machine06.example.com, it doesn't work." The domain name example.com is just an example; machine06 is one of our computers.
Having learned that my customers can often mislead me, I started by typing the URL into Firefox. Sure enough, it came up with some error.
Unable to connect Firefox can't establish a connection to the server at example.com.
I wasn't surprised, but I need to see it for myself. Sometimes, helpful information will be displayed by the browser, information the customer conveniently neglected to supply.
I'm sure that customers don't always deliberately mislead me. They will neglect to provide information for a host of reasons. They might not have noticed it. They might have seen it, but did not understand it. They might not have appreciated its significance.
Chris
Here's another example. Chris recently started working for us as a Help Desk person. I have found him to be a decent, honest, enthusiastic worker. He knows a fair bit, but he is relatively new to Linux. I'm trying to say that I don't think he deliberately tries to mislead me.
"I've copied a file to machine23, but I can't find it."
"Where on machine23 did you send it?" I asked.
"I don't know."
I figured it had to be either his home directory or /tmp. I looked, but there were no new files.
"Show me."
I went back with him to his machine. There it was, on his screen:
scp a_file machine23
It's not something I'd likely pick up if I just spoke to him. I explained why he needed a colon after the hostname and suggested that he get into the habit of typing:
scp a_file machine23:/tmp
I surmise that users look at messages from the perspective of their objectives. If the message does not help them achieve their objectives, they ignore it.
So it's always important to check that you can reproduce the problem - otherwise how will you know when you've fixed it? At the same time, you should check if there are any additional clues.
Gathering Information
Satisfied that I had a valid symptom, I set out to get more info. In this case, the next place to look seems pretty obvious. I SSHed into machine06, and tried:
psa | grep http
"psa?" you ask. Think of it as
ps auxw | grep http
"psa" is part of HAL (Henry's Abstraction Layer). I want a list of all processes running on the machine. Because I use it so often, I have abbreviated. But, more importantly, because I work on lots of different platforms, I find it convenient to have a single alias or function which achieves the same result independent of platform (hence the idea of an abstraction layer).
Whenever I login to a new platform for the first time, I bring my HAL (aliases, functions, scripts). If they work, fine. If not, I modify them to handle the new environment. I don't need to make mods very often these days.
machine06 runs Linux. Had it been a Sun, psa would have translated to
/usr/ucb/ps -auxw
(I could also use "ps -ef".)
Back to the story.
Normally, I would expect output like the following:
root 12344 0.0 2.2 29932 16516 ? Ss Aug21 0:03 /usr/sbin/httpd apache 21394 0.0 1.5 29932 11244 ? S Oct05 0:00 /usr/sbin/httpd apache 21395 0.0 1.5 29932 11208 ? S Oct05 0:00 /usr/sbin/httpd
There might be more lines like the last two, depending on how Apache had been set up.
(It's also true that machine06 might have been running some other software and not Apache. I assume Apache because I have some idea of what our organisation typically does.)
Well, if Apache is not running, that could account for the problem. I guess I could just restart Apache.
But why is it not running?
cd /var/log/httpd ls -lat | head
(I actually typed "dth", not the second line. More HAL.)
There were several files including error_log, error_log.0, error_log.1, ... and similar for access_log. As I write this, with the benefit of knowing the answer, there were heaps more clues which provided "back story" to the problem.
But at the time, I was only interested in recent files. error_log was most recent, and not very large:
cat error_log.2 [Wed Oct 06 01:30:14 2010] [crit] (28)No space left on device: mod_rewrite: could not create rewrite_log_lock Configuration Failed!
Even though I preach not to jump to conclusions, in the heat of the chase I confess that I sometimes do. In this case, the seed had been planted. The previous day I had come in to work to discover that outbound mail was not working because a disk had filled. Here was another message which seemed to say the same thing: "No space left on device".
I cursed whoever was responsible for the message. "Why doesn't is say which device?"
The second and third lines did not make much sense to me. Since I was predisposed to believe that a disk had filled (again), I rushed forward:
df -h Filesystem Size Used Avail Use% Mounted /dev/hda5 3.9G 909M 2.8G 25% / /dev/hda1 99M 56M 38M 60% /boot none 485M 0 485M 0% /dev/shm /dev/hda7 41G 35M 39G 1% /tmp /dev/hda2 53G 3.4G 47G 7% /usr /dev/hdb1 111G 102G 3.4G 97% /usr/local /dev/hda3 9.7G 5.4G 3.8G 59% /var
Hmm, not very helpful. None of the disks is close to full.
Perhaps a disk was full and is no longer full. I looked through all the earlier error logs:
ls -la error_log.* -rw-r--r-- 1 root root 132 Oct 6 01:30 error_log.2 -rw-r--r-- 1 root root 132 Oct 5 01:30 error_log.3 -rw-r--r-- 1 root root 132 Oct 4 01:30 error_log.4 -rw-r--r-- 1 root root 132 Oct 3 01:30 error_log.5 -rw-r--r-- 1 root root 132 Oct 2 01:30 error_log.6 -rw-r--r-- 1 root root 132 Oct 1 01:30 error_log.7 -rw-r--r-- 1 root root 132 Sep 30 01:30 error_log.8 -rw-r--r-- 1 root root 132 Sep 29 01:30 error_log.9
They were all similar; only the date was different.
That's really odd. This problem has been around for over a week, yet I'm only hearing about it now. And, if my tentative hypothesis is correct, the disk was full for all that time, but today there is space. Curiouser and curiouser, to quote Alice.
I've gone through all the information I can find, and I don't look like I'm any closer to a solution.
Here's the next difficulty. I think it's unarguable that computer systems are getting more and more complex. And I suspect technology is changing more and more rapidly. The producers of all this change and complexity do the best they can. In particular, they try to build resilience. What they don't get time to do is craft perfection.
Often errors are encountered, reported and brushed aside. I see countless warnings, errors and the like which seem to have no impact on functionality.
It may just be the case that the error "No space" is not responsible for, or relevant to, the problem I'm trying to solve. I don't have enough logs to go back to a time when things were working. (I do *now*, of course, because before I started writing this article I solved the problem. So I could look at recent logs.)
There have been times when I have been confronted with a problem. After a certain amount of research, I come to the conclusion that the problem is caused by one of two possible scenarios. But I can get no further. No amount of cerebration enables me to split the possibilities. And I'm still missing the vital next step.
Eventually, I construct an experiment or test to confirm or deny one of the two possibilities. As soon as the test rejects one of the possibilities, I get a flash of realisation and understanding of what's going on in the other scenario: what's wrong and the path to pursue for a fix.
Why has my state of mind changed so dramatically? Why could I have not come up with the flash simply by assuming that scenario 1 was false?
The question is not rhetorical. I do not have an answer.
In this case, the two scenarios are
- it's a disk-space problem
- it's not a disk-space problem.
I don't believe that the problem is to do with disk space. But I've run out of ideas, and the thought of the error message is going to keep me from moving forward with clarity.
So, to try to get more information:
/etc/init.d/httpd status httpd is stopped /etc/init.d/httpd start Starting httpd: [FAILED]
(There might have been a bit more. I'm doing this bit out of my head. It's a bit hard to bring down production functionality just to make a point.)
Looking in the log, I saw a recent time stamp and the "No space" message.
OK, I think that that proves it's not disk space.
At this point, one could go one of two ways. I went both ways, but the order in which I worked reflects my predilections.
I want to understand the problem, and I figure I have a better chance of understanding if I look under the covers.
truss /etc/init.d/httpd start
In this case, "truss" is not "truss" (which is a Solaris command) but rather "strace". And it's a much more complicated mapping than that. (More HAL.)
90% into the strace:
24313 semget(IPC_PRIVATE, 1, IPC_CREAT|0600) = -1 ENOSPC (No space left on device)
I looked up the man page for semget:
ENOSPC A semaphore set has to be created but the system limit for the maximum number of semaphore sets (SEMMNI), or the system wide maximum number of semaphores (SEMMNS), would be exceeded.
When is a door not a door? When it's ajar.
When is "No space" not "No space"? Here we have the answer.
All platforms have the notion of error numbers. Solaris environments document these in Intro(2). Fedora (all Linux distros?) document them in errno(3).
Further, every system call and every library call has its own section of the man page to document relevant error numbers.
Somewhere between, there is scope for more unintended misdirection.
On the one hand, it's highly commendable that different system/library routines return the same error number for the same sort of error. On the other hand, the universe does not always divide into such simple consistent components.
In this case, it seems that someone decided that running out of space in a table should return ENOSPC. Looked at from that point of view, the rationale seems reasonable.
However, looked at from the perspective of a long-suffering sysadmin, this noble ambition has simply misled me. It has added to my sea of troubles.
One More Turn of the Wheel
There's one more turn of the wheel left.
I now understand the problem. I probably know more about the problem than I ever wanted to. But I'm still no nearer to a solution.
Mr. Internet is your friend. I looked up "mod_rewrite: could not create rewrite_log_lock" and found this article: http://carlosrivero.com/fix-apache---no-space-left-on-device-couldnt-create-accept-lock
Perhaps because by now I had a good insight into the problem, I immediately recognised the relevance of this webpage.
I checked:
ipcs -s | grep apache ------ Semaphore Arrays -------- key semid owner perms nsems 0x00000000 8159232 apache 600 1 0x00000000 8192001 apache 600 1 ...
Sure enough, there were lots of semaphore arrays, owned by apache.
Since Apache was not running it seemed safe to delete them using the script provided in the article:
ipcs -s | grep apache | perl -e 'while (<STDIN>) { @a=split(/\s+/); print `ipcrm sem $a[1]`}'
After that, I was able to start Apache:
/etc/init.d/httpd start Starting httpd: [ OK ]
Back at my Firefox, I confirmed that I could visit http://machine06.example.com and get a reasonable response.
And One More Turn
Finally, because this time I remembered to be methodical, I wrote some notes for my help system in a file called help_apache_http_no_space.
With a bit of luck, if I have a problem in the future, I will visit my help system to check if there is any relevant material. Perhaps I'll find useful information. Instead of rediscovering the solution ab initio, I can skip to the last page and simply type the command to delete the Apache semaphores.
If someone asks me how I can charge so much for typing the 'ipcs' command above, I can respond along the following lines:
for typing the command $1 for knowing which command to type $999
Share |
Talkback: Discuss this article with The Answer Gang
Henry has spent his days working with computers, mostly for computer manufacturers or software developers. His early computer experience includes relics such as punch cards, paper tape and mag tape. It is his darkest secret that he has been paid to do the sorts of things he would have paid money to be allowed to do. Just don't tell any of his employers.
He has used Linux as his personal home desktop since the family got its first PC in 1996. Back then, when the family shared the one PC, it was a dual-boot Windows/Slackware setup. Now that each member has his/her own computer, Henry somehow survives in a purely Linux world.
He lives in a suburb of Melbourne, Australia.