linux syscalls on x86_64

This was going to be a post about the performance overhead of the linux syscall interface. However, things have changed a little since I last was seriously looking at the linux syscall interface while trying to figure out how to map the linux syscall interface to the VAX System Control Block. So this is the collection of notes I've made about how things work in more modern linux kernels.

Back then the intel version of Linux used the 0x80 software interrupt. In fact you can still find this if you look;

arch/x86/vdso/vdso32/int80.S
__kernel_vsyscall:
  14.LSTART_vsyscall:
  15        int $0x80
  16        ret

(All kernel code fragments are from 2.6.39 on the excellent LXR here.)

Jumping ahead a little you can see that this has been buried away in a subdirectory called vdso, which stands for virtual dynamic shared object.

VDSO is one of the more interesting things from a performance and latency point of view. In fact if you look on a recent linux distro like ubuntu 10.04 (RedHat 5.4 is too old to have vdso) you'll see this strange entry in the link table for binaries;

atp@euston:~$ ldd /bin/true
    linux-vdso.so.1 =>  (0x00007fff1afff000)
    libc.so.6 => /lib/libc.so.6 (0x00007f5708905000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f5708caf000)

Note how there is no file system reference. Thats because linux-vdso.so.1 is actually mapped by the kernel.

System Call opcodes

So, moving back to history;

In the 32 bit world, there were changes with processor architecture from the int 0x80 to sysenter mechanism that are described in great detail in [1], [2] and [3].

As this article focusses on the X86_64 world, we can leave the need for backwards compatibility behind us. For the gory details and the original motivation for the VDSO (originally called linux-gate.so.1) start (and probably end) with [4] and then check out the files in the linux kernel x86 vdso32 directory referred to earlier.

The x86_64 world has no need for a __kernel_vsyscall trampoline in the vdso to maintain compatibility across process architectures as it has the just the one syscall opcode.

The syscall opcode is described in the AMD "SYSCALL and SYSRET instruction specification" [5]. Within that document it says in the overview

SYSCALL and SYSRET are instructions used for low-latency system calls and returns in operating systems with a flat memory model and no segmentation. These instructions have been optimized by reducing the number of checks and memory references that are normally made so that a call or return takes less than one-fourth the number of internal clock cycles when compared to the current CALL/RET instruction method.

Which sounds good to me from a performance point of view.

Checking the local machine which is a;

vendor_id    : GenuineIntel
cpu family    : 6
model        : 23
model name    : Intel(R) Celeron(R) CPU        E3500  @ 2.70GHz

we can see its using the syscall instruction.

atp@euston:~/c/vdso$ cat syscall.c
#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>

int main(int argc, char *argv[])
{
   uid_t uid;
   uid = syscall(SYS_getuid);
   printf("uid=%d\n",uid);
}
atp@euston:~/c/vdso$ objdump -d syscall | grep -A10 '<syscall>:'
000000000040e770 <syscall>:
  40e770:    48 89 f8                 mov    %rdi,%rax
  40e773:    48 89 f7                 mov    %rsi,%rdi
  40e776:    48 89 d6                 mov    %rdx,%rsi
  40e779:    48 89 ca                 mov    %rcx,%rdx
  40e77c:    4d 89 c2                 mov    %r8,%r10
  40e77f:    4d 89 c8                 mov    %r9,%r8
  40e782:    4c 8b 4c 24 08           mov    0x8(%rsp),%r9
  40e787:    0f 05                    syscall 
  40e789:    48 3d 01 f0 ff ff        cmp    $0xfffffffffffff001,%rax
  40e78f:    0f 83 3b 16 00 00        jae    40fdd0 <__syscall_error>

I'm using a static link there to pull the syscall C library routine into the binary for objdump, otherwise you'll see a lazy link jump to the ELF PLT like this, which is resolved at run time via the standard ld-linux.so run time linker.

  400588:    b8 00 00 00 00           mov    $0x0,%eax
  40058d:    e8 e6 fe ff ff           callq  400478 <syscall@plt>
  400592:    89 45 fc                 mov    %eax,-0x4(%rbp)

VDSO and Vsyscall

The web pages listed in the reference talk about the 32 bit implementation of linux and the need for the vdso to fix the problem of deciding which syscall implementation to use (int 0x80 or sysenter) on a partcular machine at boot time in way that is easy for the kernel to presetn to the C library.

On x86_64 why do we still have a vdso if we no longer need to switch between syscall methods?

Actually we have both vdso and vsyscall - which was the precursor, and is maintained for static binary linked historical reasons [6].

atp@euston:~$ cat /proc/self/maps | grep '\['
0113c000-0115d000 rw-p 00000000 00:00 0                                  [heap]
7fff4c87a000-7fff4c88f000 rw-p 00000000 00:00 0                          [stack]
7fff4c8b5000-7fff4c8b6000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

This probably isn't getting much clearer, so let me lay this out.

vdso and vsyscall on x86_64 essentially do the same thing, which is to provide a mechanism for certain system calls to be executed without the overhead of even the lighter weight syscall instruction, and a mode switch to ring 0. They do this by mapping a page of memory from the kernel into each processes address space. Thus the call to frequently used syscalls is actually as light weight as a call to any C library function.

vsyscall is mapped at a static location. This is not a great idea as it encourages people to attempt to hijack system calls, for example a rather outdated example that relies on finding int 0x80 is here. You can find better examples that work on vsyscall yourself by searching for a "guide to kernel exploitation".

vdso is mapped randomly, which several runs of the command above will convince you of.

So, in order to dump out the contents of the vdso and see whats there, we need to do it with a small program.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <elf.h>

static void *getsys(char **envp) 
{
   Elf64_auxv_t *auxv;
   
   /* walk past all env pointers */
   while (*envp++ != NULL);

   /* and find ELF auxiliary vectors (if this was an ELF binary) */
   auxv = (Elf64_auxv_t *) envp;
   
   for ( ; auxv->a_type != AT_NULL; auxv++)
     if (auxv->a_type == AT_SYSINFO_EHDR)
       return (void *)auxv->a_un.a_val;
   
   fprintf(stderr, "no AT_SYSINFO_EHDR auxv entry found\n");
   exit(1);
}

int main(int argc, char *argv[], char **envp)
{
   unsigned char buffer[4096];
   void *p;
   
   p=getsys(envp);
   fprintf(stderr, "AT_SYSINFO_EHDR at %p\n",p);
   memcpy(buffer, p, 4096);
   write(1, buffer, 4096);
}

Which uses a lightly modified for x86_64 version of the ELF auxiliary vectors method described in [7], to pull out the AT_SYSINFO_EHDR for the vdso then dumps it to stdout.

Incidentally its easy to access the value of AT_SYSINFO_EHDR by using the LD_SHOW_AUXV environment variable as mentioned in [7].

atp@euston:~/c/vdso$ LD_SHOW_AUXV=1 cat /proc/self/maps | egrep '\[vdso|AT_SYSINFO' 
AT_SYSINFO_EHDR: 0x7fff4e18b000
7fff4e18b000-7fff4e18c000 r-xp 00000000 00:00 0                          [vdso]

So, now we can dump out whats there. Its set up as an elf binary (see vdso-layout.lds.S) so we can use objdump.

atp@euston:~/c/vdso$ ./dump_vdso > foo.img 
AT_SYSINFO_EHDR at 0x7fffafda1000
atp@euston:~/c/vdso$ objdump -T foo.img 

foo.img:     file format elf64-x86-64

DYNAMIC SYMBOL TABLE:
ffffffffff70030c l    d  .eh_frame_hdr    0000000000000000              .eh_frame_hdr
ffffffffff7008d0  w   DF .text    000000000000009c  LINUX_2.6   clock_gettime
0000000000000000 g    DO *ABS*    0000000000000000  LINUX_2.6   LINUX_2.6
ffffffffff700790 g    DF .text    000000000000008a  LINUX_2.6   __vdso_gettimeofday
ffffffffff700970 g    DF .text    000000000000003d  LINUX_2.6   __vdso_getcpu
ffffffffff700790  w   DF .text    000000000000008a  LINUX_2.6   gettimeofday
ffffffffff700970  w   DF .text    000000000000003d  LINUX_2.6   getcpu
ffffffffff7008d0 g    DF .text    000000000000009c  LINUX_2.6   __vdso_clock_gettime

which actually is not much different from the contents of the vsyscall page at this time;

atp@euston:~/c/vdso$ cat /usr/include/asm/vsyscall.h 
#ifndef _ASM_X86_VSYSCALL_H
#define _ASM_X86_VSYSCALL_H

enum vsyscall_num {
    __NR_vgettimeofday,
    __NR_vtime,
    __NR_vgetcpu,
};

But it does include the clock_gettime which we can use for high resolution timing.

So, the questions now are

how much faster is the vsyscall/vdso version?
what is the actual over head of a system call?
does java make effective use of these faster system calls?

We can use /proc/sys/kernel/vsyscall64 to turn the vsyscall page behaviour on or off, which is handy for question 1. The other questions are a job for next week.