The router from hell

My sordid tale begins two days after Christmas in Milwaukee where I was visiting my wife's family for the holidays.

I followed my usual routine: wake up, roll the pull-out couch mattress back into the couch (sleeping on the floor is much more desirable than waking up with an S-shaped spine), brew some strong coffee and fire up the laptop. During our holiday visit the year previous, I had taken my sister-in-law shopping for a wireless router as she had been wishing to have net access on her work-issued laptop (and I wouldn't have to hook up my old Netgear with me every time I visited). The weather had recently become interesting: Temperatures in the 50s with considerable rain following a snowy extended cold snap the previous few weeks. It looked even more exciting back home where the forecast called for gusts up to 50 mph. I brought up the Weather Underground where I report data every few seconds from my home weather station. When I travel, I periodically check in on my station reports less for the data, but more for a verification that that the house hasn't collapsed, flooded to the second floor, or burned to the ground.

When I looked at the time series data, I was immediately drawn to the fact that the last report, a couple hours ago, was of a wind gust of 55 mph - followed by no reports. Knowing full well what this meant, I tried to ssh into my home machine anyway and of course I couldn't. After racking my brain for what the hell the name of our power company was (hey, it's auto-paid) I discovered as suspected that there were widespread power outages across Michigan.

Now, my machine is on a monster UPS that can keep it and its many external drives running for about 15 minutes. But: I do not have it set the computer to wake-on-power or wake-on-lan. So, I assumed the UPS shut the computer down gracefully following about 15 minutes of loss of mains power (it did) - but it would not boot without human intervention when power was restored. What made this especially painful was I would not know for sure when power would be restored - and the source of this wind was a strong low moving in which in its wake would be bringing in some rather cold air. While it was only a remote possibility, without power for a long enough time we could be looking at burst pipes etc.

So as I sat there, I started thinking of how I might avoid this kind of not-knowing situation in the future - and also it bugged me that I wasn't reporting weather data (geek meteorologist double whammy). For a while I had been toying with the idea of putting Linux on a router and after some research decided that I was going to put together a wireless router that ran a Linux distribution. I figured I could run something like this for many hours on a UPS and, assuming the cable wasn't down, monitor things like room temperature through my weather station.

I settled on a PC Engines Alix 2d3. Specs were impressive for a small single-board computer: 500 MHz processor, 256 MB of memory. 3 ethernet ports would be enough for my current setup, and I also added an Atheros-based EMP8602+S (sold under the EnGenius brand) wireless card. A case, power supply, and two antennae rounded off the order.

Before it arrived at my door I spent a considerable amount of time looking at operating systems. I had put OpenWrt at the top of my list, and dd-wrt as another possibility. I quickly realized there were several Linux distributions designed specifically for routers out there, and it wasn't clear to me the pros and cons of any of them.

So, the box arrives. The first thing I discover is they neglected to send the 256 MB compact flash card that came with the order (PC Engines sent one next-day air the following day - great customer service!). In the mean tine, I did what any respectable geek would do: Took the CF card out of our digital camera (after pulling the handful of Christmas pictures off of it) and "flashed" it with OpenWRT. Flashing in this case involves mounting it on my main Linux PC through the USB card reader, and typing something like
sudo dd if=openwrt.iso of=/dev/sdk
checking several times first that /dev/sdk didn't point to one of my several external drives.

At this point, I was not ready to go, however. I needed a null modem cable so that I could talk to the thing. I had several straight serial cables so I sacrificed two of them soldered together a null modem cable using the best null modem wiring configuration I could find (the one I settled on came from a Microsoft whitepaper, I believe).

So, the moment of truth was nigh! I hooked one end to the Alix, the other to a USB/serial converter I had bought on Ebay a while back, plugged the USB cable into my main Linux machine, started minicom, set the terminal settings and fired it up. Success! After a quick memory check it began the boot sequence. But there was a problem: I could not get the device to respond to keystrokes. Hence, I could not log into the device. This, I found, rather perplexing, and I spent far too much time worrying about handshaking, hardware and software flow control and the like (all bringing up bad memories of the early 1990s). I reversed the serial cable - no change. I triple-checked my wiring with an ohmmeter - no problem. Maybe I chose a bad null modem wiring plan? There are a few different ways to do it. So I head to the local mom-and-pop PC store which claimed to have null modem cables - well they did, but they were the wrong gender on both sides.

I am getting desperate and frustrated. I decide to try to talk to the damned thing using a Windows terminal emulator. No dice still! I finally try a different USB/serial controller and lo and behold, I can finally type at it! Apparently the cheapo USB/serial device was the culprit. Without further investigating, I put the crappy one in service on my weather station and it works just fine. Whatever.

You'd think by now the story would conclude with a happy ending after this rather annoying bump in the road. Not exactly. I can get a pre-built Openwrt distro to boot but cannot for the life of me get the wireless mPCI card to be recognized. A couple days were spent getting the latest subversion snapshot and building it (this involved getting a Debian virtual machine going as I could not compile it under Fedora Linux). I finally ran into a show stopper where it would compile but the /boot/grub directory was huge due to some bug in one of the helper apps and I finally gave up on Openwrt.

I had chosen an Atheros chipset for the wireless cad because all the information I could glean suggested that Atheros was supported quite well with Linux. Well, reality is a bit more complicated. I had unwittingly bought a rather new chipset (which could put out 50 mW of power, higher than most) and support was only in recent kernels. I also had a hell of a time actually nailing down what frigging chipset was in there... the spec sheet rather conveniently neglected to include that information. The lspci command reports "Atheros Communications Inc. AR5413 802.11abg NIC (rev 01)" which I assume is correct

Next I tried dd-wrt. Very nice professional looking web interface, configurable like any router you'd buy at Staples or whatever. But no support for my wireless card.

Then, I discover a distro called Voyage Linux. Voyage is a stripped-down version of Debian Lenny (which as of this writing has not been officially released - still in development). Having never played with Debian, it took me some time to get familiar with the package management, /etc file structure etc. But, happily I could get my wireless card up. And it supported all the dpkg/apt Debian package commands which would make installing software a piece of cake.

So, after a week or so of on this crap, I finally had a router that would boot and not do much more. One of my Christmas presents had been the Linux Networking Cookbook which, conveniently, had a "recipe" for building a Linux router. There were also some clues in crucial files including /etc/networking/interfaces, /etc/hostapd/hostapd.conf and /etc/dnsmasq.conf.

The first recipe in the book had me using a network bridge to get the router working. I simply could not get this to work in a stable way. I continue to have a hard time wrapping my brain around what a bridge does. At one point I discovered I could, using a bridge, essentially turn my router into a switch which just moved packets across the bridge... and that made sense, but did not help me set up a box that would assign dhcp addresses to wired and wireless clients, forward traffic correctly, have a good firewall in place and handle all of the WPA2 wireless encryption stuff. The latter was new to me; up until now I had been running 64 bit WEP, as it was state of the art when I got my first wireless devices.

So I decide to go to the next recipe which uses iptables rules instead of a bridge to properly forward traffic around. I really, really like iptables and wanted to play with it more anyway... so after a couple more hours of transcribing from the book and mucking about with a few settings, I managed to get the wired part working where packets went where they should, dhcp addresses were assigned, and I could connect via WPA2 to the wireless network! But, I was not out of the woods yet... I had devices that are WPA/TKIP as well as WPA2/AES and this meant I would have to run in both modes. When I did this, I had an unstable network that would work for a day and then get stuck in a loop of authenticating / deauthenticating all my wireless devices. I finally decided to run WPA2/AES only and have the older devices (a printer and DVR-like device) stay 64 bit WEP on my old wireless access point. A sucky solution, but at least the network seemed to be relatively stable now.

So, after all this pain and obsession (occurring during the first couple of weeks of the semester when I arguably should have been working on more important or rewarding things in my spare time), I decide it's time to free some disk space before going to bed. I had used a thumb drive in one of the USB ports (the other was happily running my weather station - yes, my wireless router is running, among other things, perl) for backup storage. I had rsynced (or so I thought) the OS to the thumb drive as a backup a couple days earlier and wanted to clean things up. The following is a dump of the session I grabbed after, to my great horror, I realized what I had done. I saved it in order to serve as a cautionary tale to others.

See if you can figure out what went wrong.

voyage:/thumb1/orf/rrd% w
 21:23:19 up 23:13,  2 users,  load average: 0.67, 1.48, 1.69
 USER     TTY      FROM              LOGIN@   IDLE   JCPU   PCPU WHAT
 orf      pts/0    vista            16:52    1:17   0.36s  0.36s -bash
 orf      pts/1    vista            14:40    0.00s  0.30s  0.29s sshd: orf [priv]
 voyage:/thumb1/orf/rrd% df
 Filesystem           1K-blocks      Used Available Use% Mounted on
 rootfs                 1968368   1041280    827096  56% /
 udev                     10240        24     10216   1% /dev
 /dev/disk/by-label/ROOT_FS 1968368   1041280    827096  56% /
tmpfs                   128428     49096     79332  39% /lib/init/rw
varrun                  128428        80    128348   1% /var/run
varlock                 128428         0    128428   0% /var/lock
tmpfs                   128428         4    128424   1% /dev/shm
tmpfs                   128428       104    128324   1% /tmp
none                    128428     49096     79332  39% /var/log
none                    128428     49096     79332  39% /var/tmp
none                    128428     49096     79332  39% /home
none                    128428     49096     79332  39% /usr/local/weather
none                    128428     49096     79332  39% /var/www
none                    128428     49096     79332  39% /var/lib/munin
/dev/hda1              1968368   1041280    827096  56% /thumb1
voyage:/thumb1/orf/rrd% ls
./  ../  flows/  graphics/  interfaces/
voyage:/thumb1/orf/rrd% cd ..
voyage:/thumb1/orf% cd ..
voyage:/thumb1% ls
./     CHANGELOG*  initrd.img@  mnt/   README*          rw/       sys/       tmp/ voyage.depends.list
../    dev/        lib/         opt/   README.live-cd*  sbin/     test.conf  usr/ voyage.dpkg-l
bin/   etc/        lost+found/  orf/   README.pxe*      selinux/  tftpboot/  var/ voyage.dpkg.list
boot/  home/       media/       proc/  root/            srv/      thumb1/    vmlinuz@
voyage:/thumb1% sudo rm -rf bin boot dev etc home lib lost* media mnt op proc R* root rw sbin
selinux srv sys tftpboot usr var tmp
[sudo] password for orf:
rm: cannot remove directory `dev': Device or resource busy
rm: cannot remove directory `home': Device or resource busy
rm: cannot remove directory `lib/init/rw': Device or resource busy
rm: cannot remove directory `proc': Device or resource busy
rm: cannot remove directory `sys': Device or resource busy
rm: cannot remove directory `usr/local/weather': Device or resource busy
^C
voyage:/thumb1% ls
-bash: /bin/ls: No such file or directory
voyage:/thumb1% pwd
/thumb1
voyage:/thumb1% ls
-bash: /bin/ls: No such file or directory
voyage:/thumb1% cd
-bash: hostname: command not found
-bash: sed: command not found

First of all, look at the output of the df command. Because flash memory suffers from memory wear (it can only survive a finite number of write/erase cycles) I have directories which would see a lot of read/write activity mounted as volumes in memory, not the CF device.

But /thumb1 is mounted at /dev/hda1. No problem, right? When I mounted the thumb drive I issued the following commands:

% mkdir /thumb1
% mount -t ext3 /dev/hda1 /thumb1

The problem was, the device was actually /dev/sda1, not /dev/hda1. /dev/hda1 was the root filesystem for the compact flash card which had the operating system on it.

The mount command happily let me mount the root file system on another mount point. If you look carefully you can see that the space available on /thumb1 is identical to that available on /. Because I had (or so I thought) backed up the root file system to the thumb drive, it did not concern me that I was blowing away all the system directories - they were only copies, after all! had I looked closer I would have noticed that /thumb1 should *not* have existed in /thumb1!

You never, ever want to see "/bin/ls: No such file or directory."

So I went downstairs and stared at the terminal screen for a few minutes and then did the only thing I could: Took it apart and re-flashed the drive. All the work of the past 10 or so obsessive days was down the toilet. Thankfully, however, I had done a backup of the OS to my main machine a couple days ealier, but a lot of development had been done since then. Since a lot of it was still fresh in my mind it only took a few hours to get the network up again and another day to install all the other crap that I had done (install and configure munin, get the super-fast system clock under control, install locale data, perl, etc. etc. etc.).

So that is my sordid tale. If there is a moral to the story, it's that you shouldn't nuke backup files after 12 hours of pretty much solid computing. In a refreshed state I probably would have caught myself.

The router has an uptime of 8 days as I type this. I've discovered that two of my wireless devices - my Linux laptop and my Vista laptop - occasionally get confused or confuse the router and get in the death/auth cycle. Bringing ath0 down and up again stops the problem, but that's an unsatisfying kludge. So it's not as rock solid as I would like. I am hoping that perhaps the ath5k driver (which I now have on my Linux laptop and seems to have helped that particular problem) will be ready for running in access point mode soon. Also, perhaps because Voyage Linux is based upon a not-quite-ready-yet version of Debian, things will improve down the road. I still have not realized my original plan of getting things such that a power outage would be better dealt with; I need to put it on a separate UPS, or shut down my main machine when I'm gone (I am loathe to do that as I often access stuff when on the road). It is pretty neat to be able to monitor all the traffic which passes through the router in excruciating detail. As a knob twister and meter geek, this is extremely satisfying. I feel I've learned a little about how iptables works as well as more than I ever intended to know about the world of open source wireless drivers for Linux. Had I known the pain this would have entailed I might have waited a year or so to do this (nah, who am I fooling).

Posted on Feb 8, 2009