Linux時間管理之clocksource
前面提到了Linux下的時間相關(guān)的硬件。TSC PIT,HPET,ACPI_PM,這些硬件以一定的頻率產(chǎn)生時鐘中斷,來幫助我們計(jì)時。Linux為了管理這些硬件,抽象出來clocksource。
- struct clocksource {
- /*
- * Hotpath data, fits in a single cache line when the
- * clocksource itself is cacheline aligned.
- */
- cycle_t (*read)(struct clocksource *cs);
- cycle_t cycle_last;
- cycle_t mask;
- u32 mult;
- u32 shift;
- u64 max_idle_ns;
- u32 maxadj;
- #ifdef CONFIG_ARCH_CLOCKSOURCE_DATA
- struct arch_clocksource_data archdata;
- #endif
- const char *name;
- struct list_head list;
- int rating;
- int (*enable)(struct clocksource *cs);
- void (*disable)(struct clocksource *cs);
- unsigned long flags;
- void (*suspend)(struct clocksource *cs);
- void (*resume)(struct clocksource *cs);
- /* private: */
- #ifdef CONFIG_CLOCKSOURCE_WATCHDOG
- /* Watchdog related data, used by the framework */
- struct list_head wd_list;
- cycle_t cs_last;
- cycle_t wd_last;
- #endif
- } ____cacheline_aligned;
這些參數(shù)當(dāng)中,比較重要的是rating,shift,mult。其中rating在上一篇博文提到了:
- 1--99: 不適合于用作實(shí)際的時鐘源,只用于啟動過程或用于測試;
- 100--199:基本可用,可用作真實(shí)的時鐘源,但不推薦;
- 200--299:精度較好,可用作真實(shí)的時鐘源;
- 300--399:很好,精確的時鐘源;
- 400--499:理想的時鐘源,如有可能就必須選擇它作為時鐘源;
我們基本在前面看到:
- include/linux/acpi_pmtmr.h
- ------------------------------------------
- #define PMTMR_TICKS_PER_SEC 3579545
- drivers/clocksource/acpi_pm.c
- ---------------------------------------------
- static struct clocksource clocksource_acpi_pm = {
- .name = "acpi_pm",
- .rating = 200,
- .read = acpi_pm_read,
- .mask = (cycle_t)ACPI_PM_MASK,
- .mult = 0, /*to be calculated*/
- .shift = 22,
- .flags = CLOCK_SOURCE_IS_CONTINUOUS,
- };
- dmesg output
- ------------------------
- [ 0.664201] hpet0: 8 comparators, 64-bit 14.318180 MHz counter
- arch/86/kernel/hpet.c
- --------------------------------
- static struct clocksource clocksource_hpet = {
- .name = "hpet",
- .rating = 250,
- .read = read_hpet,
- .mask = HPET_MASK,
- .flags = CLOCK_SOURCE_IS_CONTINUOUS,
- .resume = hpet_resume_counter,
- #ifdef CONFIG_X86_64
- .archdata = { .vclock_mode = VCLOCK_HPET },
- #endif
- };
- dmesg output:
- -----------------------------
- [ 0.004000] Detected 2127.727 MHz processor.
- arch/x86/kernel/tsc.c
- --------------------------------------
- static struct clocksource clocksource_tsc = {
- .name = "tsc",
- .rating = 300,
- .read = read_tsc,
- .resume = resume_tsc,
- .mask = CLOCKSOURCE_MASK(64),
- .flags = CLOCK_SOURCE_IS_CONTINUOUS |
- CLOCK_SOURCE_MUST_VERIFY,
- #ifdef CONFIG_X86_64
- .archdata = { .vclock_mode = VCLOCK_TSC },
- #endif
- };
從上面可以看到,acpi_pm,hpet tsc的rating分別是200,250,300,他們的rating基本是和他們的frequency符合,TSC以2127.727MHz的頻率技壓群雄,等級rating=300最高,被選擇成current_clocksource:
- root@manu:~# cat /sys/devices/system/clocksource/clocksource0/available_clocksource
- tsc hpet acpi_pm
- root@manu:~# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
- tsc
除此外,還有兩個參數(shù)shift和mult,這兩個參數(shù)是干啥的呢?
我們想一下,假如我們需要給你個以一定頻率輸出中斷的硬件,你如何計(jì)時?比如我有一個頻率是1000Hz的硬件,當(dāng)前時鐘源計(jì)數(shù)是3500,過了一段時間,我抬頭看了下時鐘源計(jì)數(shù)至是5500,過去了2000cycles,我就知道了過去了2000/1000 =2 second。
- times_elapse = cycles_interval / frequency
從上面的例子中,我抬頭看了下當(dāng)前計(jì)數(shù)值這個肯定是瞎掰了,實(shí)際上要想獲取時鐘源還是需要和硬件打交道的。在clocksource中有一個成員變量是read,這個就是一個時鐘源注冊的時候,提供的一個函數(shù),如果你想獲得我的當(dāng)前計(jì)數(shù)值,請調(diào)用這個read 函數(shù)。以TSC時鐘為例:
- static struct clocksource clocksource_tsc = {
- .name = "tsc",
- .rating = 300,
- .read = read_tsc,
- .resume = resume_tsc,
- .mask = CLOCKSOURCE_MASK(64),
- .flags = CLOCK_SOURCE_IS_CONTINUOUS |
- CLOCK_SOURCE_MUST_VERIFY,
- #ifdef CONFIG_X86_64
- .archdata = { .vclock_mode = VCLOCK_TSC },
- #endif
- };
- /*--------- arch/x86/kernel/tsc.c -------------------*/
- static cycle_t read_tsc(struct clocksource *cs)
- {
- cycle_t ret = (cycle_t)get_cycles();
- return ret >= clocksource_tsc.cycle_last ?
- ret : clocksource_tsc.cycle_last;
- }
- /*------- arch/x86/include/asm/tsc.h----------------------*/
- static inline cycles_t get_cycles(void)
- {
- unsigned long long ret = 0;
- #ifndef CONFIG_X86_TSC
- if (!cpu_has_tsc)
- return 0;
- #endif
- rdtscll(ret);
- return ret;
- }
- /*------arch/x86/include/asm/msr.h-----------------*/
- #define rdtscll(val) \
- ((val) = __native_read_tsc())
- static __always_inline unsigned long long __native_read_tsc(void)
- {
- DECLARE_ARGS(val, low, high);
- asm volatile("rdtsc" : EAX_EDX_RET(val, low, high));
- return EAX_EDX_VAL(val, low, high);
- }
根據(jù)這個脈絡(luò),我們知道,最終就是rdtsc這條指令來獲取當(dāng)前計(jì)數(shù)值cycles。
扯了半天read這個成員變量,可以回到shift和mult了。其實(shí)shift和mult是為了解決下面這個公式的:
- times_elapse = cycles_interval / frequency
就像上面的公式,有頻率就足以計(jì)時了。為啥弄出來個shift和mult。原因在于kernel搞個除法不太方便,必須轉(zhuǎn)化乘法和移位。Kernel中有很多這種把除法轉(zhuǎn)化成乘法的樣例。那么公式變成了:
- times_elapse = cycles_interval * mult >> shift
Kernel用乘法+移位來替換除法:根據(jù)cycles來計(jì)算過去了多少ns。
- /**
- * clocksource_cyc2ns - converts clocksource cycles to nanoseconds
- * @cycles: cycles
- * @mult: cycle to nanosecond multiplier
- * @shift: cycle to nanosecond pisor (power of two)
- *
- * Converts cycles to nanoseconds, using the given mult and shift.
- *
- * XXX - This could use some mult_lxl_ll() asm optimization
- */
- static inline s64 clocksource_cyc2ns(cycle_t cycles, u32 mult, u32 shift)
- {
- return ((u64) cycles * mult) >> shift;
- }
單純從精度上講,肯定是mult越大越好,但是計(jì)算過程可能溢出,所以mult也不能無限制的大,這個計(jì)算中有個magic number 600 :
- void __clocksource_updatefreq_scale(struct clocksource *cs, u32 scale, u32 freq)
- {
- u64 sec;
- /*
- * Calc the maximum number of seconds which we can run before
- * wrapping around. For clocksources which have a mask > 32bit
- * we need to limit the max sleep time to have a good
- * conversion precision. 10 minutes is still a reasonable
- * amount. That results in a shift value of 24 for a
- * clocksource with mask >= 40bit and f >= 4GHz. That maps to
- * ~ 0.06ppm granularity for NTP. We apply the same 12.5%
- * margin as we do in clocksource_max_deferment()
- */
- sec = (cs->mask - (cs->mask >> 3));
- do_p(sec, freq);
- do_p(sec, scale);
- if (!sec)
- sec = 1;
- else if (sec > 600 && cs->mask > UINT_MAX)
- sec = 600;
- clocks_calc_mult_shift(&cs->mult, &cs->shift, freq,
- NSEC_PER_SEC / scale, sec * scale);
- /*
- * for clocksources that have large mults, to avoid overflow.
- * Since mult may be adjusted by ntp, add an safety extra margin
- *
- */
- cs->maxadj = clocksource_max_adjustment(cs);
- while ((cs->mult + cs->maxadj < cs->mult)
- || (cs->mult - cs->maxadj > cs->mult)) {
- cs->mult >>= 1;
- cs->shift--;
- cs->maxadj = clocksource_max_adjustment(cs);
- }
- cs->max_idle_ns = clocksource_max_deferment(cs);
- }
這個600的意思是600秒,表示的Timer兩次計(jì)算當(dāng)前計(jì)數(shù)值的差不會超過10分鐘。主要考慮的是系統(tǒng)進(jìn)入IDLE狀態(tài)之后,時間信息不會被更新,10分鐘內(nèi)只要退出IDLE,clocksource還是可以成功的轉(zhuǎn)換時間。當(dāng)然了,最后的這個時間不一定就是10分鐘,它由clocksource_max_deferment計(jì)算并將結(jié)果存儲在max_idle_ns中。
#p#
筒子比較關(guān)心的問題是如何計(jì)算,精度如何,其實(shí)我不太喜歡這種計(jì)算,Kernel總是因?yàn)槟承┰虬汛a寫的很蛋疼。反正揣摩代碼意圖要花不少時間,收益嘛其實(shí)也不太大.如何實(shí)現(xiàn)我也不解釋了,我以TSC為例子我評估下這種mult+shift的精度。
- #include<stdio.h>
- #include<stdlib.h>
- typedef unsigned int u32;
- typedef unsigned long long u64;
- #define NSEC_PER_SEC 1000000000L
- void
- clocks_calc_mult_shift(u32 *mult, u32 *shift, u32 from, u32 to, u32 maxsec)
- {
- u64 tmp;
- u32 sft, sftacc= 32;
- /*
- * * Calculate the shift factor which is limiting the conversion
- * * range:
- * */
- tmp = ((u64)maxsec * from) >> 32;
- while (tmp) {
- tmp >>=1;
- sftacc--;
- }
- /*
- * * Find the conversion shift/mult pair which has the best
- * * accuracy and fits the maxsec conversion range:
- * */
- for (sft = 32; sft > 0; sft--) {
- tmp = (u64) to << sft;
- tmp += from / 2;
- //do_p(tmp, from);
- tmptmp = tmp/from;
- if ((tmp >> sftacc) == 0)
- break;
- }
- *mult = tmp;
- *shift = sft;
- }
- int main()
- {
- u32 tsc_mult;
- u32 tsc_shift ;
- u32 tsc_frequency = 2127727000/1000; //TSC frequency(KHz)
- clocks_calc_mult_shift(&tsc_mult,&tsc_shift,tsc_frequency,NSEC_PER_SEC/1000,600*1000); //NSEC_PER_SEC/1000是因?yàn)門SC的注冊是clocksource_register_khz
- fprintf(stderr,"mult = %d shift = %d\n",tsc_mult,tsc_shift);
- return 0;
- }
600是根據(jù)TSC clocksource的MASK算出來的的入?yún)?,感興趣可以自己推算看下結(jié)果:
- mult = 7885042 shift = 24
- root@manu:~/code/c/self/time# python
- Python 2.7.3 (default, Apr 10 2013, 05:46:21)
- [GCC 4.6.3] on linux2
- Type "help", "copyright", "credits" or "license" for more information.
- >>> (2127727000*7885042)>>24
- 1000000045L
- >>>
我們知道TSC的frequency是2127727000Hz,如果cycle走過2127727000,就意味過去了1秒,或者說10^9(us)。按照我們的算法得出的時間是1000000045us.。這個誤差是多大呢,每走10^9秒,誤差是45秒,換句話說,運(yùn)行257天,產(chǎn)生1秒的計(jì)算誤差。考慮到NTP的存在,這個運(yùn)算精度還可以了。
接下來是注冊和各大clocksource PK。
各大clocksource會調(diào)用clocksource_register_khz或者clocksource_register_hz來注冊。
- HPET (arch/x86/kernel/hpet)
- ----------------------------------------
- hpet_enable
- |_____hpet_clocksource_register
- |_____clocksource_register_hz
- TSC (arch/x86/kernel/tsc.c)
- ----------------------------------------
- device_initcall(init_tsc_clocksource);
- init_tsc_clocksource
- |_____clocksource_register_khz
- ACPI_PM(drivers/cloclsource/acpi_pm.c)
- -------------------------------------------
- fs_initcall(init_acpi_pm_clocksource);
- init_acpi_pm_clocksource
- |_____clocksource_register_hz
最終都會調(diào)用__clocksource_register_scale.
- int __clocksource_register_scale(struct clocksource *cs, u32 scale, u32 freq)
- {
- /* Initialize mult/shift and max_idle_ns */
- __clocksource_updatefreq_scale(cs, scale, freq);
- /* Add clocksource to the clcoksource list */
- mutex_lock(&clocksource_mutex);
- clocksource_enqueue(cs);
- clocksource_enqueue_watchdog(cs);
- clocksource_select();
- mutex_unlock(&clocksource_mutex);
- return 0;
- }
第一函數(shù)是__clocksource_updatefreq_scale,計(jì)算shift,mult還有max_idle_ns,前面講過了。
clocksource_enqueue是將clocksource鏈入全局鏈表,根據(jù)的是rating,rating高的放前面。
clocksource_select會選擇最好的clocksource記錄在全局變量curr_clocksource,同時會通知timekeeping,切換最好的clocksource會有內(nèi)核log:
- manu@manu:~$ dmesg|grep Switching
- [ 0.673002] Switching to clocksource hpet
- [ 1.720643] Switching to clocksource tsc
clocksource_enqueue_watchdog會將clocksource掛到watchdog鏈表。watchdog顧名思義,監(jiān)控所有clocksource:
- #define WATCHDOG_INTERVAL (HZ >> 1)
- #define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)
如果0.5秒內(nèi),誤差大于0.0625s,表示這個clocksource精度極差,將rating設(shè)成0。