Problem about using slim.batch_norm() of Tensorflow

After using resnet_v2_50 in tensorflow/models, I found that the inference result is totally incorrect, though the training accuracy looks very well.
Firstly, I suspected the regularization of samples:

Indeed I had extended the image to a too big size. But after I changing padding size to ’10’, the inference accuracy was still incorrect.
Then I checked the code about importing data:

and changed my inference code as the data importing routines. But the problem still existed.

About one week past. Finally, I found this issue in Github. It explains all my questions: the cause is the slim.batch_norm(). After I adding these code to my program (learning from slim.create_train_op()):

The inference accuracy is — still low. Without other choice, I removed all slim.batch_norm() in, and at this time inference accuracy becomes the same with training accuracy.
Looks problem partly been solved, but I still need to find out why sli.batch_norm() doesn’t work well in inference …

Experiment for distributed Tensorflow

Here is my experimental code for distributed Tensorflow, which is learned from the example.

The important thing is that we need to use tf.assign() to push Variable back to Parameter Server. The operation ‘tf.add’ was about to run on the task0 of worker in this example. But if we deploy more complicated application by many tasks, things became weird: a pipeline operation sometimes even runs on ‘ps’ role! The official solution to this problem is using ‘tf.train.replica_device_setter()’, which will automatically deploy Variables to parameter servers and Operations (many replicas) to many workers. What did ‘tf.train.replica_device_setter()’ do? Let’s see the backbone code of its implementation:

All the Variables will be counted as ‘ps_ops’, and the deploy strategy for Operations will be replication, for it’s called ‘_ReplicaDeviceChooser’.

All the ‘op’ in ‘self._ps_ops’ will be put into ‘ps_device’.

Performance problem for training images on MXNet

After running my MXNet application like this snippet:

I found out that the training speed is only 300 samples per second, and the usage of GPU looks very strange:

About two days later, I just noticed that there are some messages reported by MXNet:

After changing my command to:

the speed of training has changed to 690 samples per second, and the usage of GPU became much smoothly since it could use more CPUs to decode image now:

The problem of ‘bool’ type in argparse of Python 2.7

To learn the example of distributed Tensorflow, I wrote this snippet:

The “parser.register()” is the Tensorflow way of register ‘bool’ type for parser. But it can’t work! In my shell, I run

They all print out “Namespace(training=True)”, which means the code above can’t change value of argument ‘training’ (My Python’s version is 2.7.5).

The correct codes should be:

Using Python to access HBase through JPype

First, we need to write a Java function to get data from HBase:

Then use maven to build it to one jar file with all dependent libraries:

Now, we could use python to call this Class from java by using JPype:

This python example could run correctly. But if we use it in tf.py_func(), it will core dump in, which is difficult to debug. So at last we choose to write operation by c++ to access HBase through Thrift server, which is better for stability and grace of architecture.

Small tips about containers in Intel Threading Building Blocks and C++11

Changing values in container

The code above will not change any value in container ‘table’. ‘auto’ will become std::pair and ‘item’ will be a copy of real item in ‘table’, so modify ‘item’ will not change the actual value in container.
The correct way is:


Do traversal and modification concurrently in a container
Using concurrent_hash_map like this:

will cause the program to coredump.
The reason is that concurrent_hash_map can’t be modified and traversed concurrently.

Actually, Intel figure out another solution to concurrently-traverse-and-insert: concurrent_unordered_map.
But still be careful, concurrent_unordered_map support simultaneous traversal and insertion, but not simultaneous traversal and erasure.

The CSE (Common Subexpression Elimination) problem about running custom operation in Tensorflow

Recently, we create a new custom operation in Tensorflow:

It’s as simple as the example in Tensorflow’s document. But when we run this Op in session:

It only get image_ids from network once, and then use the result of first ‘run’ forever, without even call ‘Compute()’ function in cpp code again!

Seems Tensorflow optimized the new Op and never run it twice. My colleague give a suggestion to solve this problem by using tf.placeholder:

Looks a little tricky. The final solution is add flag in cpp code to let new Op to avoid CSE (Common Subexpression Elimination):

Attachment of the ‘CMakeLists.txt’:

Some details about Arduino Uno

In previous article, I reviewed some open source hardwares for children programming, and considered the Micro:bit is the best choice. But recently, after searching many different types of micro-controllers, I noticed an advanced version of Arduino Uno R3 (less than $3) is desperately cheaper than Micro:bit (about $16) in (a famous e-marketing website in China). Despite the low price, Arduino also have a simple and intuitive IDE (Integrated Development Environment):

The programming language for Arduino is Processing, which looks just like C-language. A python developer could also learn Processing very quickly, for it is also a typeless language.
I also find an example code of blinking a LED bubble by Arduino:

Easy enough, right? But it still have more convenient graphic programming tools, such as ArduBlock and Mixly

The demo of ‘ArduBlock’

The demo of ‘Mixly’

With Easy-learning language, and graphic programming interface, I think Arduino is also a good choice for children programming.

Compute gradients of different part of model in Tensorflow

In Tensorflow, we could use Optimizer to train model:

But sometimes, model need to be split to two parts and trained separately, so we need to compute gradients and apply them by two steps:

Then how could we delivery gradients from first part to second part? Here is the equation to answer:

\frac{\partial Loss} {\partial W_{second-part}} = \frac{\partial Loss} {\partial IV} \cdot \frac{\partial IV} {\partial W_{second-part}}

The IV means ‘intermediate vector’, which is the interface vector between first-part and second-part and it is belong to both first-part and second-part. The W_{second-part} is the weights of second part of model. Therefore we could use tf.gradients() to connect gradients of two parts:

My choice between Raspberry PI, Arduino, Pyboard and Micro:bit

I want to teach my child about programming. But Teaching child to sit steadily and keep watching computer screen is not easy, I think, for children usually can’t focus on the boring developing IDE for more than ten minutes. Therefore I try to find some micro-controller which could be used to do some interesting works for kids, such as getting temperature/humidity from environment, or control micro motors on toy car.

There are many micro-controller or micro-computer chips on market. Then I have to compare them and finally choose the most suitable one.

Raspberry PI

Raspberry PI is very powerful. It could do almost anything that a personal-computer or laptop can do. But the problem about Raspberry PI is it is too difficult to learn for a child. And another reason I give up it is the price: $35 for only the bare chip without any peripherals.


Arduino is cheap enough. But you could only use C-language to program it. Using C-language need strong knowledges about computer science, such as memory models and data structures. Imaging use C-language to implement a working-well dictionary, it looks like building a space ship in backyard for a pupil.

Until now, I could narrow my choices to chips that could support python, or micro-python. Because Python is easy to understand, looks much straightforward, and also could be used in imperative mode. In one word, it’s much easier to learn than C.

So let’s take a look at chips which support micro-python.


Pyboard is simple and cheap enough, and also supports micro-python. The only imperfection of it is that its hardware interface is hard to use for someone who doesn’t know hardware very well.


Launched in 2015 by BBC, Micro:bit is the most suitable chip for a child to learning program and even IOT(Internet Of Things), in my opinion. It is cheap: only $18. It supports programming by micro-python and Javascript (Graphic IDE), so child could using a few lines of Python code to control it. It also support a bunch of peripherals such as micro motor, thermometer, hygrometer, bluetooth and WIFI. Children could use it as core-controller to run a intelligent toy car. The Micro:bit even have Android/IOS app for operation, which is perfect for little child under 7 years.

So this is my choice: Micro:bit.