Time Stamp Counters

While investigating a different issue, I found myself looking again at accurate timing with the rdtsc instruction
to read the processor's Time Stamp Counters. I last blogged about this when looking at the jitter caused by system calls to the linux kernel, and things have changed quite a lot. This post is the result of my trip around this particular block again.

There are plenty of blogs and resources out there that go into the details of the advantages and pitfalls
of using rdtsc, but few that brought everything together, in terms of rdtsc vs rdtscp, what the different tsc
processor flags mean and how to convert (at least nominally) from clock ticks to wall clock time in nanoseconds.

The information here is drawn from other blogs, stack overflow, the Intel documentation and the Linux kernel source code. There are several stackoverflow/intel forum posts that refer to the painful trip through the documentation but I've not been able to find a post or blog where the results or code are presented.

The kernel has to deal with a wider variety of hardware, so it takes a belt and braces approach to the conversion of tsc cycles to nanoseconds by using either the HPET or PIT timer. However if you restrict yourself to recent server hardware you can use the MSRs and read out the base clock ratio.

As always the deeper you dig into the intel architecture, the more confusing it gets, none the less there's no subsitute for the Intel Software Developers Manual (SDM) [4,5].

1) rdtsc/rdtscp.

The rdtsc and rdtscp instructions both read the CPU timestamp model specific register (MSR), and their presence is indicated by the rdtsc and rdtscp flags in /proc/cpuinfo respectively.

In processor terms rdtsc is indicated by bit 4 in the %EDX register for CPUID.01H, and rdtscp by bit 27 in %EDX of CPUID.80000001H. See [6] and tables 3-17 and 3-20 in the intel Software Developer Manual (SDM).

rdtscp differs from rdtsc in that it is a serialising instruction. So all those examples out there that
use code like

__asm__ ("cpuid\n rdtsc":::);

are using the cpuid as a serialising instruction. A white paper from intel [3] uses it, however I've also seen the lfence instruction used.

The rdtscp instruction however "waits until all previous instructions have been executed before reading the counter". In addition it returns data from the IA32_TSC_AUX MSR in an atomic operation. Linux stores socket and cpu numbers in that register, so that you can use rdtscp and detect if the cpu that the instruction ran on is actually the one you expected to run on. This turns out to be very handy later.

2) CPU capabilities.

The original implementation of rdtsc was actually not much practical use. It could vary across processors, and was subject to different tick rates at different frequencies. It would occasionally stop completely if the processor went into a sleep state. In short next to useless for anything other than very specific timing uses on a single core.

These shortcomings have been progressively addressed by the following features.

Linux /proc/cpuinfo flags;

constant_tsc (Intel calls this Constant TSC)
nonstop_tsc (Intel calls this Invariant TSC - CPUID.80000007H EDX bit 8).

For completeness there's also a deadline timer mode flag

tsc_deadline_timer (CPUID.01H ECX bit 24)

There are variant features, for example whether rdtsc can be synchronised by lfence or mfence and whether nonstop_tsc really stops in S3 state or not (see /arch/x86/include/asm/cpufeature.h for details).
However the above are the main ones.

There is much confusion about what those flags do. The internet wisdom that I've seen on many articles is that constant_tsc synchronises tsc across cores and invariant_tsc means it won't be affected by frequency scaling or C or P states.

So, starting with the easy one.

Invariant TSC is set in the kernel as X86_FEATURE_NONSTOP_TSC. [7].

Section 17.14.1 of the SDM defines this as the "TSC running at a constant rate in all ACPI P, C and T states. Also, TSC reads are more efficient and do not incur the overhead of a ring transition (kernel syscall) or access to a platform resource (HPET, PIT)". So that's the one that you need to be sure that the thing is ticking properly no matter what.

However, the kernel source has a warning that ACPI sleep states will still cause problems by resetting the TSC even if the invariant tsc flag is present [8]. This is backed up by section 18.15.3 of the SDM which says "The time-stamp counter increments when the clock signal on the system bus is active and when the sleep pin is not asserted. The counter value can be read with the RDTSC instruction." It then goes on to say that the TSC and non-sleep clock ticks may not agree.

Luckily for me, I'm working with servers which get set to high performance modes. Laptop sleep modes are not something to worry about.

There is one interesting comment in the linux source that says for nonstop_tsc that it is reliable across cores and sockets, but not cabinets [9]. Its been ages since I worked on a multi-cabinet computer. I'm guessing thats an altix or something.

constant_tsc is another matter. The linux kernel ([10], [11]) says that the constant tsc means that the TSC is unchanged with P states or ticks at a constant rate - whatever that means.

The linux kernel seems to use constant_tsc mainly to disable updating frequency changes to the tsc_khz clock. For example in time_cpufreq_notifier [12] where we just return if we have constant_tsc.

The main lead we have that constant_tsc means that the individual cpus have synchronised clocks is the comment for the function unsynchronized_tsc() [13]. It states, "make a guess if the TSC is trustworthy and synchronised over all cpus", and returns a 0 (indicating that the cpus are synchronised) if boot_cpu_has((X86_FEATURE_CONSTANT_TSC).

The Intel SDM is its usual cryptic self, and merely says that "Constant TSC behavior ensures that the duration
of each clock tick is uniform and supports the use of the TSC as a wall clock timer even if the processor core
changes frequency". No direct mention of synchronisation across processors there. However it does say that for Intel Xeon processors of family 0x06 (i.e. not netburst) the "rate 'may' be set by max core-clock to bus-clock ratio, or by the max resolved frequency at which the processor is booted".

By inference then, if all the cores and sockets have the same core-clock to bus clock ratio and same max resolved frequency, and the tick rate is independent of processor frequency then they're all likely to tick at the same rate. Hopefully that also means they're also set to the same offset at the same time. Although I'm not sure what would happen in a hotplug cpu system. More research needed.

So, the net wisdom appears correct, although it would be nice to have that via statement from an authorative source, rather than by inference.

3) Converting ticks to nanoseconds.

CPU cycles or ticks are all very well, but nanoseconds would be more useful. Again this is where we descend into the world of processor specific behaviour. Assuming you have some of the invariant There are broadly three approaches;

1) Use a timing loop off a deadline timer to determine the TSC frequency.

The timing loop is how the kernel does it, and how the SDM recommends you do it, at least in section 17.14. This works best if you have exclusive access to the machine to use either the ACPI or HPET timers (which are fed by an external clock source). The HPET timer spec [14] is rather abstract, but does define a minimum clock frequency. This is something you could do in userspace, but I'd be wary of the results. I'd expect them to occasionally be wrong.

2) Read out the core-clock to bus-ratio from the 0xCE MSR, make a guess about the bus clock and use that.

Reading out the core-clock to bus-ratio is actually the most accessible provided you have the right processor, and is what I'll cover here. The comment in the SDM above is bolstered by a comment on a forum post [15] that states "On Nehalem, the TSC runs at a constant frequency of MSR_PLATFORM_INFO[15:8] * 133.33MHz. MSR_PLATFORM_INFO[15:8] will report the lower of the ratio at which the part was stamped or a separate MSR to lower the ratio to provide TSC consistency across multi-socket systems with parts of different frequencies."

Its followed by another interesting sentence that lends weight to the constant_tsc means all clocks are synced
point above;

"Synchronization of the TSC across multiple threads/cores/packages: As long as software does not write the TSC, the Nehalem TSC will remain synchronized across all threads, cores and packages connected to a single PCH."

3) If you have invariant tsc (nonstop_tsc) then you can do some maths using values from CPUID.15H and the 'core crystal clock frequency'

This last one for me founders on knowing the value known as the 'ART_Value' or 'core crystal clock frequency'. This is where my googling ran into the sand. It appears that this is known by convention/magic.

A patch from an intel developer [16] defines it as being 24Mhz, but its not clear what processors that's for and whether that's a generically useful number, or specific to those processor signatures only.

Also in passing he appears to read the crystal clock Hz from register ECX in CPUID.015H. Now that field in table 3-17 of the SDM is marked as 'reserved' and 0. So, undocumented. Nice. The fact that there's comments on lkml in July 2015 [17] about how there's no way to query the nominal frequency of the ART leads me to believe that this is actually not something I'm going to be any more sure of the accuracy of than the 0xCE MSR approach.

So, we'll leave this and concentrate on the MSR approach.

Overview of method

At a high level to convert from TSC to nanos, we'll be doing the following;

Getting the processor family and model type from CPUID.1H
Converting the processor family and model to a machine type (Nehalem, Westmere etc.)
Using that to define the base clock (BCLK) for the machine.
Reading the MSR_PLATFORM_INFO MSR (0xCE) to get the non turbo clock multiplier
Using the multiplier and base clock to derive a scale factor we can use to convert cycles to nanos.
Reading the timestamp counter with rdtscp and finding our nanosecond value.

Simple.

Where this goes wrong is when the main board manufacturer is diddling with the baseclock. On the surface it seemed possible we could get that from CPUID.16H if that cpuid leaf is supported. However, looking at that leaf in SDM (Table 3-17 again) there's a big fat warning that says that this is just marketing information. "Data is returned from this interface in accordance with the processor's specification and does not reflect actual values." and "The returned information should not be used for any other purpose as the returned information does not accurately correlate to information / counters returned by other processor interfaces."

So that's fairly useless then. Instead we have to go with hardwired magic numbers scoured from Section 36 of the SDM for each processor family. If your motherboard base clock is overclocked, then your numbers are all wrong. Then you have no choice but to fall back to a timing loop.

If only there was a register or cpuid leaf or MSR that gave you the actual base clock. Or indeed the ART clock for that matter. If there is, I've been unable to find it... Email me if you know how.

Getting processor family and model.

static __inline__ uint32_t get_intel_family_model()
{
 uint32_t model_register, model, family;
 int eax = 1;

  __asm__ __volatile__ ("cpuid" : "=a" (model_register): "a" (eax): "%ebx", "%ecx", "%edx");
  model = ((model_register & 0xff) >> 4) | ((model_register & 0xf0000) >> 12);
  family = ((model_register & 0xf00) >> 8) | ((model_register & 0xff00000) >> 16);
  return (family << 16) | model;
}

We're after leaf 1 (in Intel parlance CPUID.01H) which we place in the input register EAX. We decode the response according to the instructions, into a model and a family, and repack the family and model number into a single unsigned 32 bit integer, with the top 16 bit word for the family (in practise 0x06 or 0x0f). See [1] for more details. We could just return the model_register, but doing it this way makes it easy to read
the output and compare with /proc/cpuinfo.

This returns us a number like 0x6001f;

Convert to machine type and getting the base clock

From trawling through the SDM we get the following mappings;

enum {
 TOO_OLD,
 NEHALEM,
 WESTMERE,
 SANDYBRIDGE,
 IVYBRIDGE,
 HASWELL,
 BROADWELL,
 SKYLAKE,
 PHI
} processor_enum;

struct processor_type_s {
   char *name;
   unsigned int base_clock_khz;
};

struct processor_type_s tsc_processor_types[] = {
 { "Too Old/Unknown", 100000u },
 { "Nehalem", 133330u },
 { "Westmere", 133330u },
 { "Sandybridge", 100000u }, 
 { "Ivybridge", 100000u },
 { "Haswell", 100000u },
 { "Broadwell", 100000u },
 { "Skylake", 100000u },
 { "Xeon Phi", 100000u }
};

int get_processor_type(uint32_t family_model)
{
  switch (family_model) {
    // nehalem - Section 35.5 Vol 3c
    case 0x6001a:
    case 0x6001e:
    case 0x6001f:
    case 0x6002e:
        return NEHALEM;
    // westmere - section 35.6
    case 0x60025:
    case 0x6002c:
    case 0x6002f:
        return WESTMERE;
    // sandy bridge - section 35.8
    case 0x6002a:
    case 0x6002d:
        return SANDYBRIDGE;
    case 0x6003a:
    case 0x6003e:
        return IVYBRIDGE;
    case 0x6003c:
    case 0x6003f:
    case 0x60045:
    case 0x60046:
        return HASWELL;
    case 0x6003d:
    case 0x60047:
    case 0x6004f:
    case 0x60056:
        return BROADWELL;
    case 0x6004e:
    case 0x6005e: 
        return SKYLAKE;
    case 0x60057:
        return PHI
    default:
        return TOO_OLD;
  }
  return TOO_OLD;
}

 processor_type = get_processor_type(get_intel_family_model());
 base_clock = tsc_processor_types[processor_type].base_clock_khz;
 printf ("This is a %s processor with a base clock of %dkhz\n",    
                    tsc_processor_types[processor_type].name,  
                    tsc_processor_types[processor_type].base_clock_khz);

The downside of this is that we need to maintain the list as new models come out. The upside is that westmere is
getting pretty old, so soon all this code can effectively be eliminated in favour of a

base_clock = 100000u;

Unless of course, intel changes the base clock in newer models.

Get the non turbo mode ratio.

This is the one thing that needs root;

// read msrs - stolen from cpupower helpers
// http://lxr.free-electrons.com/source/tools/power/cpupower/utils/helpers/msr.c#L26
//
int read_msr(int cpu, unsigned int idx, uint64_t *val)
{
  int fd;
  char msr_file_name[64];
  
  sprintf(msr_file_name, "/dev/cpu/%d/msr", cpu);
  fd = open(msr_file_name, O_RDONLY);
  if (fd < 0) return -1;
  if (lseek(fd, idx, SEEK_CUR) == -1) goto err;
  if (read(fd, val, sizeof *val) != sizeof *val) goto err;
  close(fd);
  return 0;
err:
  close(fd);
  return -1;
}

Used like this.

// 0xce is the intel MSR_PLATFORM_INFO.

if (read_msr(cpu,0xce, &platform_info) == -1) {
 fprintf(stderr,"error reading MSR_PLATFORM_INFO - are you root?\n");
 return -1;
 }

non_turbo_ratio = (platform_info & 0xff00) >> 8;
tsc_khz = non_turbo_ratio * tsc_processor_types[processor_type].base_clock_khz;

Why do we get the kHz rather than the MHz? It helps with the scaling factor.

// See comment for the math behind this in arch/x86/kernel/tsc.c
// http://lxr.free-electrons.com/source/arch/x86/kernel/tsc.c?v=3.18#L157
//
uint32_t get_cycles_to_nsec_scale(unsigned int tsc_frequency_khz)
{
   return (uint32_t)((1000000)<<10 )/(uint32_t)tsc_frequency_khz;
}

uint64_t cycles_to_nsec(uint64_t cycles, uint32_t scale_factor)
{
   return (cycles * scale_factor) >> 10;
}

Read the TSC and convert to nanos

Here's our rdtscp routine. Its possible there's also a __rdtscp() intrinsic in gcc like there is for __rdtsc(). I didn't check. However this one allows you to detect if the TSC was read on a different cpu to the one you expected. As noted above, constant_tsc _should_ mean things are all the same across the sockets and cores, but there may be circumstances when it isn't. For example sleep modes, possibly during a TSC reset if you don't have invariant tsc or if someone has been fiddling with the MSR to adjust the TSC.

uint64_t rdtscp(uint32_t expected_cpu)
{
  uint32_t lo, hi, cpuid;
#ifdef PARANOID_TSC
  uint32_t core, socket;
#endif
  __asm__ __volatile__ ("rdtscp" : "=a" (lo), "=d" (hi), "=c" (cpuid)::);
#ifdef PARANOID_TSC
  socket = (cpuid & 0xfff000)>>12;
  core = cpuid & 0xfff;
  if (core != expected_cpu) return -1;
#endif
 return (uint64_t)hi << 32 | lo;
}

Which are used together like this;

// simple timing exercise to see if we're close to reality
 start_timestamp = rdtscp(cpu);
 usleep(500000);
 end_timestamp = rdtscp(cpu);
// The difference in timestamp cycles, converted to nanoseconds via the scale factor.
 cycles = (end_timestamp-start_timestamp);
 nanos = (unsigned long long)cycles_to_nsec(cycles, cycles_nsec_scale);

Output on a variety of machines.

Nehalem (from our department of antiquities)

Detected processor with family/model of 6001a
 This is a Nehalem processor with a base clock of 133330khz
 Invariant TSC runs at 2533270 kHz, scale factor 404
 Expected to sleep for 500000000 nanos, actually slept for 1267058865
 cycles, 499894317 nanos

Westmere (from my venerable desktop)

Detected processor with family/model of 6002c
 This is a Westmere processor with a base clock of 133330khz
 Invariant TSC runs at 2399940 kHz, scale factor 426
 Expected to sleep for 500000000 nanos, actually slept for 1197124827
 cycles, 498022633 nanos

Ivy Bridge

Detected processor with family/model of 6003e
 This is a Ivybridge processor with a base clock of 100000khz
 Invariant TSC runs at 2600000 kHz, scale factor 393
 Expected to sleep for 500000000 nanos, actually slept for 1300123672
 cycles, 498973245 nanos

Haswell

Detected processor with family/model of 6003f
 This is a Haswell processor with a base clock of 100000khz
 Invariant TSC runs at 2500000 kHz, scale factor 409
 Expected to sleep for 500000000 nanos, actually slept for 1250141184
 cycles, 499323969 nanos

Conclusion.

Its getting better, but from userspace, there's still way too much inference, guess work and magic numbers
required to get a decent cycles to nanoseconds conversion.

At least you can now get the TSC with a single serialising instruction that has nice side effects of returning the cpu, and the TSC will carry on ticking at the same rate in most situations. All we need is that magic Always Running Timer 'crystal clock core frequency' in a nice cpuid or MSR somewhere and the story will be complete.

For now, on the server processors, the MSR method seems to work.

Full code for the test app is available here; https://github.com/andyphillips/time_stamp_counters

Sources;

Written by atp

Sunday 25 October 2015 at 10:22 pm

Posted in Linux

atp