从上次修复jbd2的bug后到现在,已经四个月了,突然昨天DB的机器又出现了panic,用netoops抓到的信息如下:
jbd2 panic
居然又栽在jbd2上了,panic的位置和上次一样是

J_ASSERT(journal->j_running_transaction != NULL);

考虑到是运行很长时间的DB机器才出现这个问题,疑点还是落到tid上,由文卿同学找到了现象最符合的upstream的补丁:

commit deeeaf13b291420fe4a4a52606b9fc9128387340
Author: Theodore Ts'o 
Date:   Sun May 1 18:16:26 2011 -0400

    jbd2: fix fsync() tid wraparound bug
    
    If an application program does not make any changes to the indirect
    blocks or extent tree, i_datasync_tid will not get updated.  If there
    are enough commits (i.e., 2**31) such that tid_geq()'s calculations
    wrap, and there isn't a currently active transaction at the time of
    the fdatasync() call, this can end up triggering a BUG_ON in
    fs/jbd2/commit.c:
    
    	J_ASSERT(journal->j_running_transaction != NULL);
    
    It's pretty rare that this can happen, since it requires the use of
    fdatasync() plus *very* frequent and excessive use of fsync().  But
    with the right workload, it can.
    
    We fix this by replacing the use of tid_geq() with an equality test,
    since there's only one valid transaction id that we is valid for us to
    wait until it is commited: namely, the currently running transaction
    (if it exists).
    
    Signed-off-by: "Theodore Ts'o" 

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index e0ec3db..8c10004 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -479,9 +479,12 @@ int __jbd2_log_space_left(journal_t *journal)
 int __jbd2_log_start_commit(journal_t *journal, tid_t target)
 {
 	/*
-	 * Are we already doing a recent enough commit?
+	 * The only transaction we can possibly wait upon is the
+	 * currently running transaction (if it exists).  Otherwise,
+	 * the target tid must be an old one.
 	 */
-	if (!tid_geq(journal->j_commit_request, target)) {
+	if (journal->j_running_transaction &&
+	    journal->j_running_transaction->t_tid == target) {
 		/*
 		 * We want a new commit: OK, mark the request and wakeup the
 		 * commit thread.  We do _not_ do the commit ourselves.
@@ -493,7 +496,14 @@ int __jbd2_log_start_commit(journal_t *journal, tid_t target)
 			  journal->j_commit_sequence);
 		wake_up(&journal->j_wait_commit);
 		return 1;
-	}
+	} else if (!tid_geq(journal->j_commit_request, target))
+		/* This should never happen, but if it does, preserve
+		   the evidence before kjournald goes into a loop and
+		   increments j_commit_sequence beyond all recognition. */
+		WARN(1, "jbd: bad log_start_commit: %u %u %u %u\n",
+		     journal->j_commit_request, journal->j_commit_sequence,
+		     target, journal->j_running_transaction ? 
+		     journal->j_running_transaction->t_tid : 0);
 	return 0;
 }

原来,做fdatasync()时,会调用 ext4_sync_file():

fs/ext4/fsync.c
ext4_sync_file()

        commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid;                        
        printk("datasync:%lu, sync:%lu\n", ei->i_datasync_tid, ei->i_sync_tid);             
        if (jbd2_log_start_commit(journal, commit_tid)) {                                
		......

直接把i_datasync_tid赋给target,然后调用__jbd2_log_start_commit(),由于i_datasync_tid只在新创建inode或者ext4_iget()时更新,而数据库在打开某些文件后几乎再不关闭,而且一直是覆盖写,extent树不变,故而i_datasync_tid始终不更新,保持一个很小的值,而j_commit_request却随着jbd2日志的不断提交而不停增长,最后,它俩的差值越来越大,直至差值超过int型上限,tid_geq判断出错。

真是此恨绵绵,上次通过给i_datasync_tid赋正确的初值避免了“0比2177452108更晚”这一逻辑错误,修复了,但现在又是由于j_commit_request和target之间差值太大(大于了2的31次方)再次造成判断错误,所以又触发了BUG_ON。这次明显Ted哥也抓狂了,折腾tid_geq这破逻辑干嘛,直接判断j_running_transaction是不是NULL得了....

重现方法由伯瑜发现:打开一个文件,不停的覆盖写和fdatasync();另外启动一个进程打开另一个文件不停覆盖写和fsync()。重现时间需要三个多月,我们在线下修改了部分内核代码才加速复现的。
上次线上临时修复这个问题是umount ext4文件系统后用专用工具直接将tid置为0,这次其实临时方案更简单:umount后再mount就行了,因为这样会让i_datasync_tid更新为最新的tid,避免了它们之间差值不断变大。当然,文卿同学正在用kprobe的机制写了个hotfix的内核模块,这个才是正解。

DB的运维同学又渡过了一个难眠的夜晚,给大家添麻烦了,实在抱歉!