Saturday, May 05, 2007

Linux is not a happy Time Traveller

Have you ever brought up a computer and noticed that the time is incorrect? Maybe it was a new computer, or just one that had been off for a long time. Possibly someone had changed the time while you weren't looking. Generally, we set the correct time and think nothing about it, in fact, we had been doing that without consequence on our 14k+ thin clients for years now using the NTP protocol and a program called ntpdate.

Somehow, however, we (by we I suppose I mean *I*) created an image that had a time adjustment on it that thrust the computer 44 hours into the future upon its first boot. Unfortunately, no-one discovered this.

To give a little perspective on why this was actually a problem, and not a thing to forget about 44 hours later, I have to share that this image that I created was already in IBM's facilities being copied to almost 1800 new x3200 servers, and that modifying the image would have cost us a significant amount of money.

OK, back to the point at which we "fixed" the time on one of these x3200 boxes at a live store. I received a phone call from my boss stating that the time was incorrect at this store, and that it needed to be corrected. I nonchalantly ssh'd over to the box and ran "ntpdate tic" (tic happens to be the name of our NTP server). After grumbling that the uselessly configured ntp service on the SLES 10 server was blocking my update, and stopping that service, I ran the command again which successfully updated the time. I thought nothing more of it, and myself and my two other co-workers (the only three in the company that can deal with a problem like the one that came next) went to an extended lunch.

Before we had even driven half way to the lunch, one of our team members at the store called about a problem that they were having. We had them try to open an xterm window, but the thin client froze. Hmmm, weird, I guess we should reboot it... It won't reboot, NFS is not allowing it to mount its file system properly. The brave dedicated co-worker makes his way over to the store, and cannot manage to figure out what is going wrong. All of our DHCP, NFS, LTSP, and network settings are correct, but the thin clients will *NOT* boot. The brave one drives off to grab an image from another store, returning only moments after I get back to the office and notice an oddity in the process list. The nfsd processes are using up some of the processor (say 5-20%), which is quite odd since these boxes rarely register process usage for nfs. I ask them to shut down all the thin clients and then reboot the server. They shut down the server and stick in the drive from a working store.

Unfortunately this step left doubt in my mind. I hadn't yet connected the time change and the problems at this store. I thought maybe a thin client had caused these run away nfs processes, and shutting them all down may have been the solution we needed. After the drive transplant, and quickly (and easily due to scripts we had written) reconfiguring the data to be correct for this store, they were up and running. We brought the drives back into the office and attempted to reproduce the problem. No signs of the problem. We decided to take the drives back to the store and see if rebooting with those drives would reproduce the problem. (We still hadn't made the time connection, and that image had been off for 40+ hours) The problem did not occur at that store so we decided to go forward with another store install. We had not tackled the problem, but it had gone away, so we were hopeful that it would not resurface.

It did in fact resurface, but we were able to recall the next morning that both times the problem began shortly after a time change. NFS goes nuts if you change the server's time backwards! We thought it was because we had brought up thin clients via NFS and then changed the time, so we had further failures by trying to change the time before thin clients had mounted NFS drives. Eventually, we discovered that uninstalling the NFS server, changing the time backwards, and then reinstalling NFS was a sufficient solution to the problem.

To this point we don't know if changing the time backwards broke anything else on the server, but we've worked around our NFS issues.

Summary:

Starting the NFS server, then changing the date backwards causes NFS to be inoperable (in fact quite stuck into some sort of loop) until the time passes that future date, or you uninstall and reinstall.

Labels: , , , ,