Croaking Kero

High Precision Timing with Win32 in C

This tutorial is also available as a video. In this tutorial I’ll show you how to perform at least microsecond-precision timing on Windows in C. I’ll also show the two main applications for such timing which are measuring code performance and rate limiting – like limiting your game’s render loop to 60Hz. Note: I’m compiling this code with GCC. Note: Click any of the hyperlinked words to visit the MSDN documentation page for them. just_timing.c #define WIN32_LEAN_AND_MEAN #include <windows.h> #include <stdio.h> LARGE_INTEGER frequency, a, b; float elapsed_seconds; int WINAPI WinMain (HINSTANCE hInstance, HINSTANCE hPrevInstance, LPSTR lpCmdLine, int nShowCmd) { QueryPerformanceFrequency(&frequency); printf("Frequency: %lld ticks per second.\n", frequency.QuadPart); QueryPerformanceCounter(&a); printf("A: %lld\n", a.QuadPart); QueryPerformanceCounter(&b); printf("B: %lld\n", b.QuadPart); elapsed_seconds = (float)(b.QuadPart - a.QuadPart) / frequency.QuadPart; printf("Elaped time between A and B: %fs\n", elapsed_seconds); return 0; } zen_timer.h #pragma once #include <windows.h> #include <stdint.h> typedef struct { LARGE_INTEGER start, end; } zen_timer_t; static LARGE_INTEGER zen_ticks_per_second = {.QuadPart = 1}; static int64_t zen_ticks_per_microsecond = 1; // You MUST call ZenTimer_Init() to use ZenTimer, otherwise the tick rate will be set at 1 and you'll get garbage. static inline void ZenTimer_Init() { QueryPerformanceFrequency(&zen_ticks_per_second); zen_ticks_per_microsecond = zen_ticks_per_second.QuadPart / 1000000; } static inline zen_timer_t ZenTimer_Start() { zen_timer_t timer; QueryPerformanceCounter(&timer.start); return timer; } // Returns time in microseconds static inline int64_t ZenTimer_End(zen_timer_t *timer) { QueryPerformanceCounter(&timer->end); return (timer->end.QuadPart - timer->start.QuadPart) / zen_ticks_per_microsecond; } using_zen_timer.c #include "zen_timer.h" #include <stdio.h> int WINAPI WinMain(HINSTANCE hInstance, HINSTANCE hPrevInstance, LPSTR lpCmdLine, int nShowCmd) { ZenTimer_Init(); zen_timer_t timer = ZenTimer_Start(); for(int i = 0; i < 1000; ++i) { rand(); } int64_t time = ZenTimer_End(&timer); printf("1000 rand()s took: %lldus\n", time); return 0; } rate_limiting.c #define WIN32_LEAN_AND_MEAN #include <windows.h> #include <mmsystem.h> #include <stdio.h> #include <stdint.h> #include <conio.h> int WINAPI WinMain(HINSTANCE hInstance, HINSTANCE hPrevInstance, LPSTR lpCmdLine, int nShowCmd) { LARGE_INTEGER ticks_per_second, start, current; int64_t ticks_per_loop, ticks_per_millisecond; unsigned int loop_count = 0; timeBeginPeriod(1); QueryPerformanceFrequency(&ticks_per_second); ticks_per_millisecond = ticks_per_second.QuadPart / 1000; ticks_per_loop = ticks_per_second.QuadPart / 5; QueryPerformanceCounter(&start); while(!kbhit()) { printf("%u ", ++loop_count); QueryPerformanceCounter(&current); static int64_t sleep_time; sleep_time = (start.QuadPart + ticks_per_loop - current.QuadPart) / ticks_per_millisecond - 2; printf("Sleeping: %lldms\n", sleep_time); Sleep(sleep_time); do { Sleep(0); QueryPerformanceCounter(&current); } while(current.QuadPart < start.QuadPart + ticks_per_loop); start.QuadPart += ticks_per_loop; if(current.QuadPart - start.QuadPart > ticks_per_loop) start = current; } timeEndPeriod(1); return 0; } build.bat gcc -o just_timing.exe just_timing.c gcc -o using_zen_timer.exe using_zen_timer.c gcc -o rate_limiting.exe rate_limiting.c -lwinmm

Code Walkthrough

Modern CPUs have an internal counter which increments at a fixed rate, so all we need to do is find out how many times that counter increments per second, then we can get the value of that counter at any time and divide by the tick rate to get time measurements. The two functions we’ll use are QueryPerformanceFrequency and QueryPerformanceCounter, from windows.h. Both functions use Windows’ LARGE_INTEGER type, which is a union used here as a 64-bit signed integer. To extract the int64 from the union we access the QuadPart member. just_timing.c #define WIN32_LEAN_AND_MEAN #include <windows.h> #include <stdio.h> LARGE_INTEGER frequency, a, b; float elapsed_seconds; int WINAPI WinMain (HINSTANCE hInstance, HINSTANCE hPrevInstance, LPSTR lpCmdLine, int nShowCmd) { QueryPerformanceFrequency(&frequency); printf("Frequency: %lld ticks per second.\n", frequency.QuadPart); QueryPerformanceCounter(&a); printf("A: %lld\n", a.QuadPart); QueryPerformanceCounter(&b); printf("B: %lld\n", b.QuadPart); elapsed_seconds = (float)(b.QuadPart - a.QuadPart) / frequency.QuadPart; printf("Elaped time between A and B: %fs\n", elapsed_seconds); return 0; } We call QueryPerformanceFrequency, passing a pointer to our frequency variable to be filled, and print the result. Next we call QueryPerformanceCounter a couple of times and print out the values. To demonstrate calculating an elapsed time between two QPC calls, here I’m subtracting a from b and dividing by frequency to calculate the elapsed seconds. That’s all you need for timing, and it doesn’t take much to turn this into a code performance measuring tool. Here’s what I’ve come up with: Zen Timer; get the reference? zen_timer.h #pragma once #include <windows.h> #include <stdint.h> typedef struct { LARGE_INTEGER start, end; } zen_timer_t; static LARGE_INTEGER zen_ticks_per_second = {.QuadPart = 1}; static int64_t zen_ticks_per_microsecond = 1; // You MUST call ZenTimer_Init() to use ZenTimer, otherwise the tick rate will be set at 1 and you'll get garbage. static inline void ZenTimer_Init() { QueryPerformanceFrequency(&zen_ticks_per_second); zen_ticks_per_microsecond = zen_ticks_per_second.QuadPart / 1000000; } static inline zen_timer_t ZenTimer_Start() { zen_timer_t timer; QueryPerformanceCounter(&timer.start); return timer; } // Returns time in microseconds static inline int64_t ZenTimer_End(zen_timer_t *timer) { QueryPerformanceCounter(&timer->end); return (timer->end.QuadPart - timer->start.QuadPart) / zen_ticks_per_microsecond; } The Init function gets the tick frequency and Start creates a new timer structure and gets the starting tick count. End takes an existing timer, gets the ending tick count and returns the real time in microseconds between beginning and end. using_zen_timer.c #include "zen_timer.h" #include <stdio.h> int WINAPI WinMain(HINSTANCE hInstance, HINSTANCE hPrevInstance, LPSTR lpCmdLine, int nShowCmd) { ZenTimer_Init(); zen_timer_t timer = ZenTimer_Start(); for(int i = 0; i < 1000; ++i) { rand(); } int64_t time = ZenTimer_End(&timer); printf("1000 rand()s took: %lldus\n", time); return 0; } Here’s an example usage of Zen Timer. We include zen_timer.h, Initialize ZenTimer, Start a new timer and store it locally, do something we want to time, then End the timer, recording the returned time locally. Lastly we print out the time. Zen Timer can easily be inserted into whatever existing Windows code you have to measure code performance. Now on to rate limiting. What we want to do is set a target amount of time for each update, render or whatever. Then time how long the iteration takes and wait until we’ve reached the target amount of time. Here’s a simple program demonstrating that. LARGE_INTEGER ticks_per_second, start, current; int64_t ticks_per_loop; unsigned int loop_count = 0; QueryPerformanceFrequency(&ticks_per_second); ticks_per_loop = ticks_per_second.QuadPart / 5; QueryPerformanceCounter(&start); Since our timing functions use ticks, we calculate the amount of ticks per loop by taking the amount of ticks per second and dividing by the number of loops we want to run per second, in this case 5. We get a starting tick count, then proceed through our loop. while(!kbhit()) { printf("%u ", ++loop_count); do { QueryPerformanceCounter(&current); } while(current.QuadPart - start.QuadPart < ticks_per_loop); start = current; } In this case our loop just prints out the loop count, which, without a rate limit, could easily run thousands of times per second. So we run this while loop checking the current time until the difference between start and current has exceeded the target number of ticks per loop. Then set the new starting ticks to the current ticks. This delay loop gets the job done but it will exceed the target amount of time by a tiny bit and constantly uses the CPU while waiting. start.QuadPart += ticks_per_loop; We can solve the overshoot by, instead of setting start to current, adding the ticks per loop to start. This way even if one wait loop overshoots by a bit, that amount will be compensated for in the next loop. However if our thread gets locked for, lets say, a full second, each proceeding loop won’t wait at all until we’ve added the ticks per loop amount to the start variable enough to catch up to all the time that’s passed. if(current.QuadPart - start.QuadPart > ticks_per_loop) start = current; To fix that we can cap the difference between start and end to 1 loop’s worth of ticks so that small overshoots in our loop and wait runtime can be compensated for and large overshoots like a temporary thread freeze will act like the program was paused for that period. You can tune these values depending on the performance and requirements of your program. Instead of a spin-lock, we’d rather put the thread to sleep and let Windows wake it back up when the time has elapsed, so that we can let the CPU cool off between loops instead of melting the thing. The best Windows gives us is the Sleep function, which takes a time in milliseconds to sleep for. By default, Sleep is actually only accurate to around 15ms intervals, so calling Sleep(1) will way overshoot. We can increase precision using timeBeginPeriod, from mmsystem.h. We now also need to link to winmm. int WINAPI WinMain(HINSTANCE hInstance, HINSTANCE hPrevInstance, LPSTR lpCmdLine, int nShowCmd) { LARGE_INTEGER ticks_per_second, start, current; int64_t ticks_per_loop, ticks_per_millisecond; unsigned int loop_count = 0; timeBeginPeriod(1); QueryPerformanceFrequency(&ticks_per_second); ticks_per_millisecond = ticks_per_second.QuadPart / 1000; ticks_per_loop = ticks_per_second.QuadPart / 5; QueryPerformanceCounter(&start); while(!kbhit()) { printf("%u ", ++loop_count); QueryPerformanceCounter(&current); static int64_t sleep_time; sleep_time = (start.QuadPart + ticks_per_loop - current.QuadPart) / ticks_per_millisecond; printf("Sleeping: %lldms\n", sleep_time); Sleep(sleep_time); start.QuadPart += ticks_per_loop; if(current.QuadPart - start.QuadPart > ticks_per_loop) start = current; } timeEndPeriod(1); return 0; } timeBeginPeriod sets the thread timing resolution to a number of milliseconds, so we set it to one since that’s lowest. Until Windows 10 version 2004, this set the system-wide timer resolution and therefore increased CPU and power usage even after closing your program, so be sure to call timeEndPeriod with the same value at the end of the program. Since that Windows version this is now handled per-process and is no longer a problem for most users. Now we can Sleep with millisecond precision, though it’s still inaccurate, often missing by 1 millisecond and occasionally more than that. If you want the most accuracy you can Sleep for an amount of milliseconds a few less than needed, then use a spin lock for the rest, and if you want the least CPU usage you can Sleep to the nearest millisecond and let the 0-3 milliseconds of inaccuracy get absorbed into the next loop/Sleep cycle. How you choose to tune that depends on your program requirements. One extra optimization is that calling Sleep(0) is a kind of hidden feature which sleeps for the remainder of the current “time-slice”. There’s no guarantee how long that will be, but in my testing it’s around 30us. By using this in a spin-lock you could reduce your CPU usage without losing any meaningful amount of precision. You can see these extra optimizations implemented in "rate_limiting.c" above the Code Overview. That’s how to do precise timing, performance measuring and loop frequency limiting in C on Windows, and this one really was a blast to make. Soon I’ll be starting my series of tutorials on how to make a small game in C.

Addendum

Following the release of this tutorial I was provided information regarding some undocumented code which can be used to either increase the accuracy of Sleep() or replace it with something more accurate. Thanks to Reddit users Sunius, skeeto and ack_error for providing the initial information and links. TimeBeginPeriod allows us to request a minimum of 1ms for the timer Windows uses to wake our thread. The very old and undocumented function, NtSetTimerResolution allows us to request a timer interval with a precision of 100ns! We can request that interval, but as far as I have determined, modern systems with modern Windows are only able to set a minimum timer resolution of 496us - still half the minimum of timeBeginPeriod. We'd like it to be a little lower than that, so I'm requesting 100us but it'll round up to 496us unless a future version of Windows and/or hardware supports lower. NtSetTimerResolution is not exposed in any Windows headers so we have to declare it ourselves, and the linker will find it in ntdll. Here's some sample code of using this function: NtSetTimerResolution.c #include <Windows.h> #include <stdio.h> #include <mmsystem.h> #include "zen_timer.h" extern NTSYSAPI NTSTATUS NTAPI NtSetTimerResolution(ULONG DesiredResolution, BOOLEAN SetResolution, PULONG CurrentResolution); int WINAPI WinMain(HINSTANCE hInstance, HINSTANCE hPrevInstance, LPSTR lpCmdLine, int nShowCmd) { static zen_timer_t timer; static long long slept; static unsigned long long avg = 0, min = -1, max = 0; #if 0 timeBeginPeriod(1); #else static unsigned long time; NtSetTimerResolution(1000, TRUE, &time); printf("Timer resolution: %ldns\n", time * 100); #endif ZenTimer_Init(); #define NUM_LOOPS 10000 for(int i = 0; i < NUM_LOOPS; ++i) { timer = ZenTimer_Start(); Sleep(1); slept = ZenTimer_End(&timer); avg += slept; if(slept < min) min = slept; if(slept > max) max = slept; } printf("Avg: %llu\nMin: %llu\nMax: %llu\n", avg / NUM_LOOPS, min, max); NtSetTimerResolution(0, FALSE, &time); return 0; } build-nt.bat gcc -o nt.exe NtSetTimerResolution.c -lntdll -lwinmm I've hybridized the above code to measure the difference between using timeBeginPeriod and NtSetTimerResolution. This code sets the thread timer either way, then, 10,000 times, records how long a call to Sleep(1) takes. At the end we also call NtSetTimerResolution, passing FALSE as the second argument to remove our requested time, just like calling timeEndPeriod. I find that the maximum occasionally has an outlier result, but the minimum and average are consistent between runs. Here are the results from running the above code: timeBeginPeriod: Avg: 1887 Min: 1010 Max: 2689 NtSetTimerResolution: Avg: 1249 Min: 1009 Max: 1929 As you can see from my results, and from running the code yourself, NtSetTimerResolution brings the average Sleep time much closer to the requested 1ms, and brings the maximum down from more than 2 and a half milliseconds to less than 2ms! These are great improvements and well worth the small cost of declaring the function ourselves and linking to ntdll, in my opinion. There are yet question marks about this function though - does it have an impact on other programs? What about overall system performance and battery life? For now it seems a safe bet to assume that it functions similarly to timeBeginPeriod within the Windows black box, but I invite anyone with reliable information about this function to share it. There is a function which can entirely replace Sleep(): WaitForSingleObject(). This requires creating a "Waitable Timer" object with CreateWaitableTimerEx(), specifically using the undocumented option, CREATE_WAITABLE_TIMER_HIGH_RESOLUTION. The biggest caveat with this combo is that CREATE_WAITABLE_TIMER_HIGH_RESOLUTION is, as far as I have been able to discover, not available when using the GCC/MinGW toolchain. I'm not sure of the details of how MinGW bundles/implements the Windows libraries, so if anyone has details on why this feature is available from Microsoft but not MinGW I'm all ears. In any case, here's the code using Waitable Timer objects: waitable-timer.c #include <Windows.h> #include "zen_timer.h" #include <stdio.h> int WINAPI WinMain(HINSTANCE hInstance, HINSTANCE hPrevInstance, LPSTR lpCmdLine, int nShowCmd) { if(!AttachConsole(ATTACH_PARENT_PROCESS)) { return -1; } freopen("CONOUT$", "w", stdout); printf("\n"); static zen_timer_t timer; static long long slept, num_overslept; static unsigned long long avg = 0, min = -1, max = 0; static unsigned long time; ZenTimer_Init(); HANDLE waitable_timer = CreateWaitableTimerEx(NULL, NULL, CREATE_WAITABLE_TIMER_HIGH_RESOLUTION, TIMER_ALL_ACCESS); LARGE_INTEGER wait_time; wait_time.QuadPart = -5000; #define NUM_LOOPS 10000 for(int i = 0; i < NUM_LOOPS; ++i) { timer = ZenTimer_Start(); SetWaitableTimer(waitable_timer, &wait_time, 0, 0, 0, 0); WaitForSingleObject(waitable_timer, INFINITE); slept = ZenTimer_End(&timer); avg += slept; if(slept < min) min = slept; if(slept > max) max = slept; } printf("Avg: %llu\nMin: %llu\nMax: %llu\n", avg / NUM_LOOPS, min, max); return 0; } build-wt.bat cl waitable-timer.c We create the waitable timer object, passing CREATE_WAITABLE_TIMER_HIGH_RESOLUTION. We store the amount of time we want to wait in hundreds of nanoseconds. A positive value is absolute time and negative is relative time, so -5000 represents 500us after calling SetWaitableTimer(). Then we call WaitForSingleObject(), specifying an infinite timeout other than the timer's timeout itself. You might want to choose something smaller than infinity when using this function, just in case of an error. That "AttachConsole [...] freopen" stuff just enables the program to output to a console again, since "Window" applications under cl don't associate with a console window by default. Running this code, the following statistics are normal: Avg: 701 Min: 504 Max: 1405 We were able to request a sub-millisecond wait time, and the minimum and average are actually pretty close to the requested time! The max is half a millisecond better than before as well. It also suffers from the occasional outlier, as with Sleep(). Note that requesting less than 500us, in my testing, still results in the same values as above, so that seems to be the minimum. This may be related to the minumum supported by NtSetTimerResolution. Also note that neither timeBeginPeriod nor NtSetTimerResolution need to be called when using waitable timer objects. That about covers this undocumented functionality. If you have further information about this stuff, with reliable sources, please send me an e-mail with the link below. Cheers.
If you've got questions about any of the code feel free to e-mail me or comment on the youtube video. I'll try to answer them, or someone else might come along and help you out. If you've got any extra tips about how this code can be better or just more useful info about the code, let me know so I can update the tutorial. Thanks to Evgeny Borodin for e-mailing me to correct some errors in this tutorial. Thanks to Froggie717 for criticisms and correcting errors in this tutorial. Cheers.