atp

Atp's external memory

Timing Precision under Linux

When you are after low latency, the first thing to examine are your measurement tools. How precise are the different ways of timing events? What is the maximum resolution you can reach?

On my celeron E3500 desktop I get a maximum resolution of 13118 +/-  (3 sigma = 7262) picoseconds. (13+/-7ns).

On a server class system (Westmere X5650) the maximum resolution is 15000 picos, with no variance over the test.

So measuring the smallest practical resolution means staying away from system calls. In many cases clocks or "high resolution" timers may have a relatively long update interval - perhaps they should be called "high precision" timers instead. A good summary is in [3]. If you are not interested in absolute time, and just want the maximum possible resolution, the time stamp counters are still your best bet provided you;

  • peg your cpu frequency to the highest rate
  • use taskset to constrain your run to a single cpu

Plus it helps to make sure the system is not busy. At these rates a cpu switch/preemption by another program is enormous.

This uses a bit of hand written assembly and the rdtsc instruction. RDTSC has several health warnings associated with it. For example [1] summarises the problems with multicore processors nicely, and at least with some minor poking at it, java system.nanotime() seems to be using RDTSC instructions on linux.

Earlier issues with dynamic clock frequencies - for example [2] have been mainly overcome by the introduction of constant timestamp counters;

grep tsc /proc/cpuinfo | head -1
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc
arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2
ssse3 cx16 xtpr pdcm xsave lahf_lm tpr_shadow vnmi flexpriority

The remaining issue of migration from one cpu to another giving you different TSC values can be solved by using taskset.

So it comes down to something like this;

static __inline__ uint64_t fast_rdtsc()
{
     register uint64_t ticks;
     // just get the TSC
     __asm__ __volatile__ ( "rdtsc\nshl $0x20, %%rdx\nor  %%rax,%%rdx"
                : "=d" (ticks) :: "%rax");
     return ticks;
}


// this loop should compile to 7 instructions when inlined and fast test.
    for (ii=0; ii<NOPS; ii++) {
        ticks[ii] = fast_rdtsc();
     }
:

which compiles to the following assembly;

  400b06:    0f 31                    rdtsc  
  400b08:    48 c1 e2 20              shl    $0x20,%rdx
  400b0c:    48 09 c2                 or     %rax,%rdx
  400b0f:    48 89 14 cb              mov    %rdx,(%rbx,%rcx,8)
  400b13:    48 83 c1 01              add    $0x1,%rcx
  400b17:    48 83 f9 64              cmp    $0x64,%rcx
  400b1b:    75 e9                    jne    400b06 <main+0x5f>

The function is inlined and the register hint is used to reduce the instruction count. The RDTSC instruction places the output in two parts into the %rax and %rdx registers. We bit shift the top one up by 32bits (shl $0x20,%rdx) and then "or" in the lower half. The other instructions put the result into the array, increment the counters and do the loop test.

The code was compiled like this;

gcc -Wall -O -o timer_resolution timer_resolution.c -lrt -lm

Running it takes 40 ticks to go round the loop on an Intel X5650, and a similar amount on an Intel E3500. Surprisingly running it on an Opteron 2214 (which is rather old now, and lacks the constant_tsc) it takes about 9 ticks. This is allegedly 4 ns for the 2.2GHz 2214 processor vs the 15ns for the X5650  at 2.6 GHz and 13ns for the E3500 which runs at a slightly higher clock rate of 2.7GHz.

This gives us a baseline.The first thing is to worry about out of order cpus, where the net.wisdom, inspired by a application note "Using the RDTSC instruction for Performance Monitoring" [4] by intel for the Pentium II processor recommends the use of the CPUID instruction to serialise the instruction pipeline.

Which would look like this;

static __inline__ uint64_t careful_rdtsc()
{
     register uint64_t ticks;
     // Serialize
     __asm__ __volatile__ ("xorl %%eax,%%eax\ncpuid"
               ::: "%rax", "%rbx", "%rcx", "%rdx");
     // get the TSC
     __asm__ __volatile__ ("rdtsc\nshl $0x20, %%rdx\nor %%rax,%%rdx"
               : "=d" (ticks):: "%rax");
     return ticks;
}

Where the CPUID bit was taken from [5].

The result of including the cpuid instruction is;

Processor fast_rdtsc() ticks fast ns fast sigma (ns)
careful_rdtsc() ticks
careful ns careful sigma (ns)
E3500 2.7GHz 36 13 2 237 88 10
2214 2.2GHz 10 5 2 68 31 5
X5650 2.6GHz 40 15 0 179 67 7

All are averages of 100 iterations. The E3500 is running a desktop, which may have bearing on both the average and the sigma in particular.

The difference between the AMD and Intel processors in terms of the number of ticks taken is also borne out in the responses to a similar question in [6] where the original poster also sees a difference in the number of cycles between AMD and Intel. Without the CPUID instruction we don't really see that. The cpuid instruction seems to be thing that is the most heavyweight (or else the AMD architecture doesn't serialise on it). 

In [6] they also throw doubt onto whether CPUID is needed.

The detail, and perhaps final word should go to the Intel engineer(s) who responded in that thread;

First, as to whether or not executing RDTSC distorts the measurement: the fact is that it will for shorter instruction sequences that exec ute within the "shadow" of an instance of RDTSC execution ... in that upon inserting pair of RDTSC, one essentially cannot measure time spans less that about twice the pipeline "shadow" ... recovered precision is in direct proportion to time span duration relative to time span of twice the pipeline "shadow". So this is lower bounds of what minimum time span can be measured using present instruction-based technology.

Which says that things get hairy when you measure a small number of instructions, and also from the thread in [6].

  Second, as to whether there is jitter among pairs of RDTSC used to measure time span of an instruction sequence. If purely executing within the core, e.g. recurrence relations, Pentium 4 is very faithful here when executing from the trace cache and no jitter is ever seen (at least by me). On Core u-arch, one will experience jitter, perhaps up to ~25% but typically ~5% of time span being measured.

So, depending on your microarchitecture you'll see jitter or not. Which matches the observations on the westmere vs the E3500.

Conclusion.

To get to the highest resolution measurements - which for this case is about 100ns or below, you need to be very aware of the microarchitecture, which is more than just Intel vs AMD, once you've satisfied yourself you have constant_tsc, and you're executing on a single core. You may need to serialise using CPUID, or not, and the devil is very much in the details. I suspect I've only scratched the surface here and I'm wary about generalising that AMD is better for timing than Intel.

I suspect that all that can be drawn is that a practical highest  resolution would appear to be about the 100ns level with current processors.

Plus the same caveats probably apply to java's system.nanotime();

------------

[1] http://www.principiaprogramatica.com/?p=16

[2] http://lwn.net/Articles/209101/

[3] http://blogs.oracle.com/dholmes/entry/inside_the_hotspot_vm_clocks

[4] http://www.ccsl.carleton.ca/~jamuir/rdtscpm1.pdf

[5] http://en.wikipedia.org/wiki/Time_Stamp_Counter

[6] http://software.intel.com/en-us/forums/showthread.php?t=52482

Written by atp

Sunday 12 June 2011 at 2:24 pm

Posted in Linux

Leave a Reply