內(nèi)核中的互斥之我見

作者：時(shí)間：2006-12-12 來源：網(wǎng)絡(luò)

加入技術(shù)交流群
- 掃碼加入
  和技術(shù)大咖面對(duì)面交流
  海量資料庫(kù)查詢

/*e4gle:在我修改linux源代碼的過程中曾被大量的內(nèi)核互斥現(xiàn)象所困擾，這需要利用內(nèi)核鎖去解決，雖然最后大部分解決，但我覺得應(yīng)該留下些什么，也沒時(shí)間寫了，偶爾看見這位兄弟的文章，覺得正是我想整理的，所以拿出來給大家分享，關(guān)于bottom_half和中斷的問題，在tcp/ip半底中絕對(duì)不能對(duì)文件讀寫操作，不然就panic，恰恰我在linux中的增強(qiáng)功能就有這個(gè)操作，使我郁悶了很久，歡迎大家討論
　　*/
　　內(nèi)核中的互斥之我見
　　by wheelz

　　看了前面各位的討論，我也有些想法，與大家商榷。
　　需要澄清的是，互斥手段的選擇，不是根據(jù)臨界區(qū)的大小，而是根據(jù)臨界區(qū)的性質(zhì)，以及有哪些部分的代碼，即哪些內(nèi)核執(zhí)行路徑來爭(zhēng)奪。
　　從嚴(yán)格意義上說，semaphore和spinlock_XXX屬于不同層次的互斥手段，前者的實(shí)現(xiàn)有賴于后者，這有點(diǎn)象HTTP和TCP的關(guān)系，都是協(xié)議，但層次是不同的。
　　先說semaphore，它是進(jìn)程級(jí)的，用于多個(gè)進(jìn)程之間對(duì)資源的互斥，雖然也是在內(nèi)核中，但是該內(nèi)核執(zhí)行路徑是以進(jìn)程的身份，代表進(jìn)程來爭(zhēng)奪資源的。如果競(jìng)爭(zhēng)不上，會(huì)有context switch，進(jìn)程可以去sleep，但CPU不會(huì)停，會(huì)接著運(yùn)行其他的執(zhí)行路徑。從概念上說，這和單CPU或多CPU沒有直接的關(guān)系，只是在 semaphore本身的實(shí)現(xiàn)上，為了保證semaphore結(jié)構(gòu)存取的原子性，在多CPU中需要spinlock來互斥。
　　在內(nèi)核中，更多的是要保持內(nèi)核各個(gè)執(zhí)行路徑之間的數(shù)據(jù)訪問互斥，這是最基本的互斥問題，即保持?jǐn)?shù)據(jù)修改的原子性。semaphore的實(shí)現(xiàn)，也要依賴這個(gè)。在單CPU中，主要是中斷和bottom_half的問題，因此，開關(guān)中斷就可以了。在多CPU中，又加上了其他CPU的干擾，因此需要spinlock來幫助。這兩個(gè)部分結(jié)合起來，就形成了spinlock_XXX。它的特點(diǎn)是，一旦CPU進(jìn)入了spinlock_XXX，它就不會(huì)干別的，而是一直空轉(zhuǎn)，直到鎖定成功為止。因此，這就決定了被spinlock_XXX鎖住的臨界區(qū)不能停，更不能context switch，要存取完數(shù)據(jù)后趕快出來，以便其他的在空轉(zhuǎn)的執(zhí)行路徑能夠獲得spinlock。這也是spinlock的原則所在。如果當(dāng)前執(zhí)行路徑一定要進(jìn)行context switch，那就要在schedule()之前釋放spinlock，否則，容易死鎖。因?yàn)樵谥袛嗪蚥h中，沒有context，無(wú)法進(jìn)行context switch，只能空轉(zhuǎn)等待spinlock，你context switch走了，誰(shuí)知道猴年馬月才能回來。
　　因?yàn)閟pinlock的原意和目的就是保證數(shù)據(jù)修改的原子性，因此也沒有理由在spinlock 鎖住的臨界區(qū)中停留。
　　spinlock_XXX有很多形式，有
　　spin_lock()/spin_unlock()，
　　spin_lock_irq()/spin_unlock_irq()，
　　spin_lock_irqsave/spin_unlock_irqrestore()
　　spin_lock_bh()/spin_unlock_bh()
　　local_irq_disable/local_irq_enable
　　local_bh_disable/local_bh_enable
　　那么，在什么情況下具體用哪個(gè)呢？這要看是在什么內(nèi)核執(zhí)行路徑中，以及要與哪些內(nèi)核執(zhí)行路徑相互斥。我們知道，內(nèi)核中的執(zhí)行路徑主要有：
　　1 用戶進(jìn)程的內(nèi)核態(tài)，此時(shí)有進(jìn)程context，主要是代表進(jìn)程在執(zhí)行系統(tǒng)調(diào)用等。
　　2 中斷或者異?；蛘咦韵莸?，從概念上說，此時(shí)沒有進(jìn)程context，不能進(jìn)行
　　context switch。
　　3 bottom_half，從概念上說，此時(shí)也沒有進(jìn)程context。
　　4 同時(shí)，相同的執(zhí)行路徑還可能在其他的CPU上運(yùn)行。
　　這樣，考慮這四個(gè)方面的因素，通過判斷我們要互斥的數(shù)據(jù)會(huì)被這四個(gè)因素中
　　的哪幾個(gè)來存取，就可以決定具體使用哪種形式的spinlock。如果只要和其他CPU互斥，就要用spin_lock/spin_unlock，如果要和irq及其他CPU互斥，就要用
　　spin_lock_irq/spin_unlock_irq，如果既要和irq及其他CPU互斥，又要保存EFLAG的狀態(tài)，就要用spin_lock_irqsave/spin_unlock_irqrestore，如果要和bh及其他CPU互斥，就要用spin_lock_bh/spin_unlock_bh，如果不需要和其他CPU互斥，只要和irq互斥，則用local_irq_disable/local_irq_enable，
　　如果不需要和其他CPU互斥，只要和bh互斥，則用local_bh_disable/local_bh_enable，
　　等等。值得指出的是，對(duì)同一個(gè)數(shù)據(jù)的互斥，在不同的內(nèi)核執(zhí)行路徑中，
　　所用的形式有可能不同(見下面的例子)。
　　舉一個(gè)例子。在中斷部分中有一個(gè)irq_desc_t類型的結(jié)構(gòu)數(shù)組變量irq_desc[]，
　　該數(shù)組每個(gè)成員對(duì)應(yīng)一個(gè)irq的描述結(jié)構(gòu)，里面有該irq的響應(yīng)函數(shù)等。
　　在irq_desc_t結(jié)構(gòu)中有一個(gè)spinlock，用來保證存取(修改)的互斥。
　　對(duì)于具體一個(gè)irq成員，irq_desc[irq]，對(duì)其存取的內(nèi)核執(zhí)行路徑有兩個(gè)，一是
　　在設(shè)置該irq的響應(yīng)函數(shù)時(shí)(setup_irq)，這通常發(fā)生在module的初始化階段，或
　　系統(tǒng)的初始化階段；二是在中斷響應(yīng)函數(shù)中(do_IRQ)。代碼如下：
　　int setup_irq(unsigned int irq, struct irqaction * new)
　　{
　　int shared = 0;
　　unsigned long flags;
　　struct irqaction *old, **p;
　　irq_desc_t *desc = irq_desc + irq;
　　/*
　　* Some drivers like serial.c use request_irq() heavily,
　　* so we have to be careful not to interfere with a
　　* running system.
　　*/
　　if (new->flags SA_SAMPLE_RANDOM) {
　　/*
　　* This function might sleep, we want to call it first,
　　* outside of the atomic block.
　　* Yes, this might clear the entropy pool if the wrong
　　* driver is attempted to be loaded, without actually
　　* installing a new handler, but is this really a problem,
　　* only the sysadmin is able to do this.
　　*/
　　rand_initialize_irq(irq);
　　}
　　/*
　　* The following block of code has to be executed atomically
　　*/
　　[1] spin_lock_irqsave(desc->lock,flags);
　　p = desc->action;
　　if ((old = *p) != NULL) {
　　/* Can't share interrupts unless both agree to */
　　if (!(old->flags new->flags SA_SHIRQ)) {
　　[2] spin_unlock_irqrestore(desc->lock,flags);
　　return -EBUSY;
　　}
　　/* add new interrupt at end of irq queue */
　　do {
　　p = old->next;
　　old = *p;
　　} while (old);
　　shared = 1;
　　}
　　*p = new;
　　if (!shared) {
　　desc->depth = 0;
　　desc->status = ~(IRQ_DISABLED | IRQ_AUTODETECT | IRQ_WAITING);
　　desc->handler->startup(irq);
　　}
　　[3] spin_unlock_irqrestore(desc->lock,flags);
　　register_irq_proc(irq);
　　return 0;
　　}
　　asmlinkage unsigned int do_IRQ(struct pt_regs regs)
　　{
　　/*
　　* We ack quickly, we don't want the irq controller
　　* thinking we're snobs just because some other CPU has
　　* disabled global interrupts (we have already done the
　　* INT_ACK cycles, it's too late to try to pretend to the
　　* controller that we aren't taking the interrupt).
　　*
　　* 0 return value means that this irq is already being
　　* handled by some other CPU. (or is disabled)
　　*/
　　int irq = regs.orig_eax 0xff; /* high bits used in ret_from_ code */
　　int cpu = smp_processor_id();
　　irq_desc_t *desc = irq_desc + irq;
　　struct irqaction * action;
　　unsigned int status;
　　kstat.irqs[cpu][irq]++;
　　[4] spin_lock(desc->lock);
　　desc->handler->ack(irq);
　　/*
　　REPLAY is when Linux resends an IRQ that was dropped earlier
　　WAITING is used by probe to mark irqs that are being tested
　　*/
　　status = desc->status ~(IRQ_REPLAY | IRQ_WAITING);
　　status |= IRQ_PENDING; /* we _want_ to handle it */
　　/*
　　* If the IRQ is disabled for whatever reason, we cannot
　　* use the action we have.
　　*/
　　action = NULL;
　　if (!(status (IRQ_DISABLED | IRQ_INPROGRESS))) {
　　action = desc->action;
　　status = ~IRQ_PENDING; /* we commit to handling */
　　status |= IRQ_INPROGRESS; /* we are handling it */
　　}
　　desc->status = status;
　　/*
　　* If there is no IRQ handler or it was disabled, exit early.
　　Since we set PENDING, if another processor is handling
　　a different instance of this same irq, the other processor
　　will take care of it.
　　*/
　　if (!action)
　　goto out;
　　/*
　　* Edge triggered interrupts need to remember
　　* pending events.
　　* This applies to any hw interrupts that allow a second
　　* instance of the same irq to arrive while we are in do_IRQ
　　* or in the handler. But the code here only handles the _second_
　　* instance of the irq, not the third or fourth. So it is mostly
　　* useful for irq hardware that does not mask cleanly in an
　　* SMP environment.
　　*/
　　for (;;) {
　　[5] spin_unlock(desc->lock);
　　handle_IRQ_event(irq, ®s, action);
　　[6] spin_lock(desc->lock)

本文引用地址：http://www.butianyuan.cn/article/258258.htm

新聞中心

內(nèi)核中的互斥之我見

評(píng)論

相關(guān)推薦

技術(shù)專區(qū)