Archive for January 2nd, 2008

Gentoo + jabberd = aargh

I’ve been running jabberd2 from ~x86 for ages. Tonight I went to make some config changes, and stopped and started jabberd using the init script like usual. Things were different though, as the init script didn’t shut down all the Jabber tasks and I had to stop them manually. When I went to restart it, only two processes were shown and not all the separate processes I was used to.

Nothing was being logged either, as I was trying to find out what was going on and why the processes weren’t starting. It was as if it was suddenly ignoring all my configuration files!

Careful inspection of some output from eix showed the problem: Jabberd 2 has been moved to its own ebuild (jabberd2), and the highest version in the jabberd ebuild is now a 1.4.4-something. Not only that, they’ve hard-masked jabberd2:

# Krzysiek Pawlik  (08 Oct 2007)
# Masked untill the split from net-im/jabberd is complete.
# See bug #178055 and bug #195091
net-im/jabberd2

Looks like the last time I emerged I downgraded my Jabberd 2 to 1.4. No wonder the thing was not responding to me.

This is the kind of thing that happens on Gentoo from time-to-time. It’s why I started a regular sync of portage and email-output-of-emerge-pretend-world process: so that I didn’t get too far behind and have a heap of these things to sort out. This one got me off guard though.

Note to self: pay closer attention to emerge output in future!

Tags: ,

OpenLDAP database recovery

Something ugly happened to my LDAP database a while back, and I never noticed. I saw it had lost a bunch of records, but I’d put it down to some replication problem and never investigated. It wasn’t until I tried to replace one of the lost records, and got an error from LDAP telling me the non-existent record already existed, that I figured something was really wrong.

Multiple iterations of db_recover, attempts to re-index, dump-and-restores of the raw Berkely DB files… Nothing helped. In the end, all that was left was the slapcat-delete-slapadd dance.

(You know that your OpenLDAP is especially sick when commands like slapcat generate glibc backtraces. :( )

So with what was left of my LDAP data, I started to compare against my replicated LDAP server. The first thing I noticed was that a number of records that I expected to have been replicated were not. I figured that records in the master directory that were lost to database corruption and not to an LDAP operation (a modify or delete) should have been present on the replicated copy. This was not the case, which makes me think that replication only takes effect after the master directory’s backend is updated, and if something like a corrupted database prevents the master from being updated then the replication doesn’t take place. As Zaphod might say, ten points for directory consistency but minus several million for data preservation… :)

(As I think about this though, the more it doesn’t make sense. If slapd had been unable to update the backend, and hence the replication didn’t take place, surely that would have been returned to me as an update error? I know for a fact that the data I lost made it to the database because I tested an app using the data. It’s unreasonable to me to think that BDB would have returned success on a write operation unless it had actually done so, but I suppose write-caching might create an opportunity for that to occur… No, I suspect a different problem, maybe just replication being suspended at the time, as the real reason that some data was missing from the replica.)

Next I found, despite what I thought was happening based on the lost records, there were quite a few records that were on the replica. This makes me think I’ve had multiple failures, apparently at different times, that have impaired my master directory — one that caused new updates to be lost, the other resulting in loss of existing data.

I’ve added a step to my Bacula processing that performs a slapcat and backs up the resulting LDIF, so if anything happens in the future I have a bit of a chance of running through old files and restoring. The other thing that I’ll kick off is a process to verify the accuracy or integrity of the replica — this might tip me off to a problem sooner rather than later.

My theory on what the cause of this hassle was? Well a while ago I was having a bit of trouble with partitions filling. At a guess I’d say that OpenLDAP was trying to do something (update a transaction log maybe) at a time when the partition its data lives on was full, and got twisted. Soon I’m going to write a separate post with my (updated) thoughts about isolation of failure domains…

For those that haven’t seen it, here’s the process I used to get things back:

# cd /var/lib
# slapcat > whatsleft.ldif
# /etc/init.d/slapd stop
# mv openldap-data openldap-data-old
# mkdir openldap-data
# chown ldap:ldap openldap-data
# cp -a openldap-data-old/DB_CONFIG openldap-data/
# cd openldap-data
# slapadd < ../whatsleft.ldif
# chown ldap:ldap *
# /etc/init.d/slapd start

Obviously if you find yourself in the unfortunate position of having to use this process, substitute your distribution's values for the path to the OpenLDAP data directory and the user/group that LDAP runs under.