In previous article, I have found out the reason. But how to resolve it on Multi-GPU-Training is still a question. As the suggestion of this issue in github, I tried two way to fix the problem:
First, rewrite my Averaging-Gradients-Training to learn tf.slim.create_train_op():

...
def create_train_grads(total_loss, optimizer):
  update_ops = set(ops.get_collection(ops.GraphKeys.UPDATE_OPS))
  with ops.control_dependencies(update_ops):
    barrier = control_flow_ops.no_op(name='update_barrier')
  total_loss = control_flow_ops.with_dependencies([barrier], total_loss)
  variables_to_train = tf_variables.trainable_variables()
  grads = optimizer.compute_gradients(total_loss, variables_to_train)
  return grads
...
          cross_entropy = tf.reduce_mean(cross_entropy)
          tf.get_variable_scope().reuse_variables()
          grads = create_train_grads(cross_entropy, opt)
          tower_grads.append(grads)
...
  grads = average_gradients(tower_grads)
  grad_updates = opt.apply_gradients(grads)
  with ops.name_scope('train_op'):
    # Ensure the train_tensor computes grad_updates.
    train_op = control_flow_ops.with_dependencies([grad_updates], cross_entropy)
  # Add the operation used for training to the 'train_op' collection
  train_ops = ops.get_collection_ref(ops.GraphKeys.TRAIN_OP)
  if train_op not in train_ops:
    train_ops.append(train_op)

But unfortunately, this didn’t work at all. The inference result was still a mess.
Then, another way, I use Asynchronous-Gradient-Training and tf.slim.create_train_op():

...
          cross_entropy = tf.reduce_mean(cross_entropy)
          train_op = tf.contrib.slim.learning.create_train_op(cross_entropy, opt)
          tower_ops.append(train_op)
...
  train_step = tf.group(*tower_ops)

Now the inference works very well! And the training speed become a little bit faster than Averaging-Gradients-Training, for the Averaging Operation needs to wait multi gradients from multi GPUs.