atp

Atp's external memory

debugging a crash in haproxy

Horrors uncovered in kerberos

Last week we had several reports of disconnections to some of our haproxy fronted services after we upgraded to a longer/stronger SSL certificate. As the version of haproxy we are running is quite old, we tried an upgrade to the latest (dev21). This fixed the disconnections, but after a few packets, it reliably crashed on the production systems. However

we could not repeat the problem on any test or development systems, even through kickstart and puppet ensured they were identical to production. 

This post details how we found the cause of the problem, so I have a record/teaching aid.

Starting with the bare facts. The systems are centos 6.4. Running haproxy-1.5-dev12. To fix another problem related to disconnects and general SSL misery with older clients and 4096 bit keys, we wanted to upgrade to  run haproxy-1.5-dev21.

We had two attempts at deployment, each of which ended up with a rollback, after the new version crashed within seconds of start up.

Debugging

Haproxy-dev21 starts fine, and runs fine until it apparently gets a "packet of death", at which point /var/log/messages and the audit log records that it gets sent a SIGABRT signal. So, man 7 signal tells us that the SIGABRT signal comes from the abort() library call. 

As we couldn't reproduce this on identical systems (thank you kickstart and puppet) we initially thought it might be dodgy hardware after the first night. By default abrt is configured to only run against binaries from signed rpms, so we didn't catch a crash dump the first time it happened as our haproxy rpm was not signed. 

We rapidly changed that;

sed -i 's/OpenGPGCheck = yes/OpenGPGCheck = no/' \ 
/etc/abrt/abrt-action-save-package-data.conf

We have a 4 minute maintenance window every night. I'm not among those people who can type fast and accurately so I script and test.  This means there is little room for exploratory debugging during maintenance windows.

On the second night we failed over to the secondary (to test if it was dodgy hardware) and ran a packet capture. We got identical results and another SIGABRT crash, but this time we had something to analyse. 

Still unable to reproduce this in any test or development systems the leading theory became that it was some packet on the wire/customer traffic causing the fault. 

The audit log tells us when haproxy got the SIGABRT signal. 

# grep ABEND audit.log.1 | tail -1
type=ANOM_ABEND msg=audit(1391032884.818:1967129): auid=809400027 uid=0 gid=0\
 ses=221308 pid=31826 comm="haproxy" sig=6

i.e. at  Wed Jan 29 22:01:24 UTC 2014 and 818 milliseconds

The wireshark session shows us that the last few packets were;

4816 2014-01-29 22:01:24.790906 server client TLSv1 1154 Certificate
4819 2014-01-29 22:01:24.792635 client server TCP 60 38409 > https [ACK]
Seq=264 Ack=5197 Win=40320 Len=0
4823 2014-01-29 22:01:24.810406 client server TLSv1 252 Client Key Exchange,
 Change Cipher Spec, Encrypted Handshake Message
4824 2014-01-29 22:01:24.810422 server client TCP 54 https > 38409 [ACK]
Seq=5197 Ack=462 Win=16896 Len=0
4955 2014-01-29 22:01:24.825933 server client TCP 54 https > 38409 [RST, ACK]
Seq=5197 Ack=462 Win=16896 Len=0

(lines broken for clarity).

And we have a winner with line number 3 which is an inbound packet - SSL Change Cipher spec at  22:01:24.810

A bit of experimentation later the packet of death can be reproduced at will by using this command;

openssl s_client -cipher LOW -connect 192.168.1.1:443

We also had a coredump from abrt of haproxy. However abrt also truncated the crashdump, which is not so handy.

Also unhelpfully on the production machine we have a stripped binary, which gives us even less. Running these two commands (Thanks tim for the pointer to those);

abrt-action-generate-backtrace 
abrt-action-generate-core-backtrace

gives us;

1d6d7b3a0cf1571310fab6e29508eddee11469a5 0x328a5 raise libc.so.6 
1d6d7b3a0cf1571310fab6e29508eddee11469a5 0x34085 abort libc.so.6
605701a8ae551604303523b4f0d3a7e98cf9e153 0x87201 - libkrb5.so.3
605701a8ae551604303523b4f0d3a7e98cf9e153 0x87321 - libkrb5.so.3
605701a8ae551604303523b4f0d3a7e98cf9e153 0x84de7 krb5_get_default_realm libkrb5.so.3
605701a8ae551604303523b4f0d3a7e98cf9e153 0x4a331 krb5_kt_get_entry libkrb5.so.3
c953da0be16117bde323d1e021fc74878825c935 0x45740 kssl_keytab_is_available libssl.so.10
c953da0be16117bde323d1e021fc74878825c935 0x22a94 ssl3_choose_cipher libssl.so.10
c953da0be16117bde323d1e021fc74878825c935 0x1bd72 ssl3_get_client_hello libssl.so.10
c953da0be16117bde323d1e021fc74878825c935 0x1db6f ssl3_accept libssl.so.10
c953da0be16117bde323d1e021fc74878825c935 0x290d9 ssl23_get_client_hello libssl.so.10
c953da0be16117bde323d1e021fc74878825c935 0x298b3 ssl23_accept libssl.so.10
a59613382d68d83cd342ccb0c400066b88e439c4 0x6b36a - [exe]
a59613382d68d83cd342ccb0c400066b88e439c4 0x2fccd - [exe]
a59613382d68d83cd342ccb0c400066b88e439c4 0x687ee - [exe]
a59613382d68d83cd342ccb0c400066b88e439c4 0x4eef - [exe]
a59613382d68d83cd342ccb0c400066b88e439c4 0x73dc - [exe]

Which tells us that the abort() happened in the depths of the kerberos library. Which makes sense as abort() does not appear anywhere near this code path in haproxy. 

Running the packet of death over an unstripped haproxy on a test machine, with debuginfos loaded gives us a far nicer backtrace;

 gdb /usr/sbin/haproxy core.2198
<headers snipped>
(gdb) where #0 0x00007f79bea928e5 in raise () from /lib64/libc.so.6 #1
0x00007f79bea940c5 in abort () from /lib64/libc.so.6
#2 0x00007f79be35b241 in krb5int_translate_gai_error (num=<value optimized out>)
 at hst_realm.c:264
#3 0x00007f79be35b361 in get_fq_hostname (buf=<value optimized out>,
bufsiz=<value optimized out>) at hst_realm.c:113
#4 krb5int_get_fq_local_hostname (buf=<value optimized out>,
bufsiz=<value optimized out>) at hst_realm.c:132
#5 0x00007f79be358e27 in krb5_get_default_realm (context=0x185af10,
lrealm=0x7ffff8c07200) at def_realm.c:118
#6 0x00007f79be31e331 in krb5_kt_get_entry (context=0x185af10, keytab=0x185b480,
 principal=0x7ffff8c071d0, vno=0, enctype=0,
entry=<value optimized out>) at ktfns.c:66
#7 0x00007f79bf600740 in kssl_keytab_is_available () from /usr/lib64/libssl.so.10
#8 0x00007f79bf5dda94 in ssl3_choose_cipher () from /usr/lib64/libssl.so.10
#9 0x00007f79bf5d6d72 in ssl3_get_client_hello () from /usr/lib64/libssl.so.10
#10 0x00007f79bf5d8b6f in ssl3_accept () from /usr/lib64/libssl.so.10
#11 0x00007f79bf5e40d9 in ssl23_get_client_hello () from /usr/lib64/libssl.so.10
#12 0x00007f79bf5e48b3 in ssl23_accept () from /usr/lib64/libssl.so.10
#13 0x000000000046a81a in ssl_sock_handshake ()
#14 0x000000000042f66d in conn_fd_handler ()
#15 0x0000000000467cee in _do_poll ()
#16 0x0000000000404e0f in run_poll_loop ()
#17 0x000000000040726c in main ()

 (Again, lines broken for readability).

So we can see that krb5int_translate_gai_error() calls the abort() that killed our production load balancers.

Frame 4 looks wierd because the compiler has inlined a static function. I spent a couple of hours poking around tracing the flow of control through the assembly here to check what was going on and make sure I understood what was going on.

<opinion>As a systems programmer this makes me angry;

What on Earth is a library doing calling abort()? The contract is supposed to be that the library will return an error code, not make a unilateral decision to shoot you in the head.  </opinion>

What made this particularly hard was that without a crash dump to point the finger at kerberos, the only observation is that version 1.5-dev12 worked fine, but version 1.5-dev21 crashes with an abort(). From that, your first thought is not exactly !aha a bug in kerberos! No, you're going to waste your time poring over haproxy. 

So, lets have a look at this routine - from the SRPM - which given the nature of the fault may not be the code running on the system.

In all its glory, reproduced in full,

krb5_error_code
krb5int_translate_gai_error (int num)
{
    switch (num) {
#ifdef EAI_ADDRFAMILY
    case EAI_ADDRFAMILY:
        return EAFNOSUPPORT;
#endif
    case EAI_AGAIN:
        return EAGAIN;
    case EAI_BADFLAGS:
        return EINVAL;
    case EAI_FAIL:
        return KRB5_EAI_FAIL;
    case EAI_FAMILY:
        return EAFNOSUPPORT;
    case EAI_MEMORY:
        return ENOMEM;
#if defined(EAI_NODATA) && EAI_NODATA != EAI_NONAME
    case EAI_NODATA:
        return KRB5_EAI_NODATA;
#endif
    case EAI_NONAME:
        return KRB5_EAI_NONAME;
#if defined(EAI_OVERFLOW)
    case EAI_OVERFLOW:
        return EINVAL;          /* XXX */
#endif
    case EAI_SERVICE:
        return KRB5_EAI_SERVICE;
    case EAI_SOCKTYPE:
        return EINVAL;
#ifdef EAI_SYSTEM
    case EAI_SYSTEM:
        return errno;
#endif
    }
    abort ();
    return -1;
}

This is called on the error code returned by getaddrinfo() so, the next frame up in the stack - get_fq_hostname() is ;

static krb5_error_code
get_fq_hostname(char *buf, size_t bufsize, const char *name)
{
    struct addrinfo *ai, hints;
    int err;

    memset (&hints, 0, sizeof (hints));
    hints.ai_flags = AI_CANONNAME | AI_ADDRCONFIG;
    err = getaddrinfo (name, 0, &hints, &ai);
    if (err)
        return krb5int_translate_gai_error (err);
    if (ai->ai_canonname == 0)
        return KRB5_EAI_FAIL;
    strncpy (buf, ai->ai_canonname, bufsize);
    buf[bufsize-1] = 0;
    freeaddrinfo (ai);
    return 0;
}

So this means, if getaddrinfo() returns an error code that krb5int_translate_gai_error() doesn't understand then your program will get shot in the head. 

So, why does this happen?

Looking at the man page for getaddrinfo, all of the return codes listed are all accounted for in translate routine above, so what's the error code that is killing us, and why is it being generated? 

Running under a debugger, or indeed running a test case where we call krb5int_get_fq_local_hostname() with the same parameters as happens in krb5_get_default_realm() shows that we're getting a return code of -5 from getaddrinfo().

Looking for this in /usr gives us;

# grep -- -5 /usr/include/netdb.h 
# define EAI_NODATA -5 /* No address associated with NAME. */

Which is odd, as this should have been handled by the translate routine above;

This leads to a couple of possible conclusions;

1) When the system library libkrb5.so was compiled, the value of this #define must have been false;

#if defined(EAI_NODATA) && EAI_NODATA != EAI_NONAME

2) The code running is not the patched code from the SRPM (after rpmbuild -bp) and does not handle EAI_NODATA. 

Either way we have an inconsistency between the system libraries, with the net result that rather than defaulting to returning something reasonable like KRB5_EAI_FAIL we get abort(). 

Clearly the original intent of the programmer was that the abort() represented an "impossible" condition. I'm not sure that signalling that with an abort() is the wisest course. The alternatives are all unappealing, but this turns a drama into a crisis. 

Moving on,

How does that -5 arise in the first place? 

Well haproxy runs chrooted. And in that chroot, we have no files. So when SNI kicks into action on being presented with the SSL packet of death which has a LOW cipher it attempts (via ssl3_choose_cipher())  to resolve its own local hostname. This fails as /etc/resolv.conf and /etc/nsswitch.conf are missing (for starters), and returns EAI_NODATA.

Conclusion

This is all fixed in Centos 6.5. Probably because the libkrb5 or glibc libraries have been compiled with the right flags to match each other and avoid the "should not happen" call to abort(). 

In the mean time, we've removed haproxy from the chroot. That way when it attempts to lookup its hostname, the lookup will succeed (or fail with EAI_NONAME which is caught by the translate gai error routine) and all is good.

ipv6 being enabled or disabled also plays a role, in that influences whether you get an EAI_NODATA or an EAI_NONAME returned by getaddrinfo(). 

There's more here;

http://marc.info/?t=139111261200001&r=1&w=2

I'm now more suspicious of the kerberos library. A brief grep -r shows several more places where abort() is called.

We have a work around, and this is already fixed, so in the long run this is just a curiosity. It does serve to remind you that things are not always as they seem at first glance. Bugs exist in operating systems too. 

Although I get to write this up here, many people contributed to fixing this production issue and root causing it - particularly Luke Bigum, Andy Millar and Tim Hughes, and of course James Hogarth who paired with me for most of this. Thanks guys.

Written by atp

Friday 31 January 2014 at 09:47 am

Posted in Default

Leave a Reply