Usage limitations of HDFS’s C API

I have to change a program which is written by c language from writing local files to writing on HDFS. After learning the example of C API in libhdfs, I complete the modification of open()/write()/read() to hdfsOpenFile()/hdfsWriteFile()/hdfsReadFile() and so on. But when running the new program, many problems occured. The first is: after fork(), I can’t open files of HDFS anymore. And the problem looks very common in community and haven’t any solution yet.
So I have to try the hdfs-fuse tool. According to the steps of this article, I successfully build and run the hdfs-fuse:

But something weird happened:

After fsync(), the size of file “my.db” is still zero by “ls” command on mountpoint “/data”! It cause the program report error and can’t continue to process.
The reason is fuse-dfs haven’t implement fuse_fsync() interface. After adding the implementation of fuse_fsync() by hdfsHSync(), it works now. But the performance is too bad: about 10~20MB/s in network.
Consequently, I decided to use glusterfs instead of HDFS because it totally don’t need any modification for user program and support erasure-code since version 3.6 (this will dramatically reduce occupation of storage space).

Network problem of installing docker-engine on Centos-7

After installing docker-engine on Centos-7, it failed to start by

After I use

it shows:

To see more details, I use

There were much more informations now:

The ethernet interface of docker0 can not be assigned IPv4 addresse. It seems many people meet this problem after I searching the google, but none of the solutions works for me. Consequently, I find out this solution in my environment (totally remove the network bridge):

and my docker service startup now!

The key is: docker-engine will not assign IP to docker0 interface which is already exists.

The CPU usage of soft-irq will be counted into a process.

Redis usually runs as a single process daemon that is the perfect sample of UNIX philosophy — one process just do one thing. But current servers have many CPUs(Cores) so we need to launch many Redis processes to provide service. After running multi-process redis-server in server, I found out there always be one redis-server daemon cost more CPU than others in “top” command:




I used perf to collect function samples in different processes and be noticed that some “softirq” function had been called. Then I remember I haven’t balance the soft-irq of netowork card into different CPU cores. After running my script to balance the soft-irq:

The view of “top” command looks much better now:



But I still have a question: why the “top” command count the CPU usage of system soft-irq into a innocent process? The answer is here: The soft-irq is run under process-context, so it certainly need to find a “scapegoat” process to count the CPU usage.

Avoid “page allocation failure” for linux kernel in big memory server

After adding pressure to a Key-Value cluster, I found many error in dmesg:

It’s hard to understand the “page allocation failure” error because the memory capacity is very big in our servers. By looking at the result “free” command, I noticed that a large mount of memory was used to cache files. Maybe the “free” memory is too small so the kernel could not get enough pages when it need many.
But how to reserve more “free” memory in linux kernel? According to this article,we could modify “/proc/sys/vm/min_free_kbytes” to adjust the watermark of linux-memory-management. And the kernel will try hardly to reserve enough “free” memory:


mm

After changing the “/proc/sys/vm/min_free_kbytes” to 1G, the errors became rare but still exists. Then I change it to 4G, and this time, there wasn’t any errors in dmesg now.
At conclude, the default value of “min_free_kbytes” in kernel is too small, we’d better turn up “min_free_kbytes” in machines with big memory.

Too many “ext4-dio-unwrit” processes in system

After adding pressure to application which will write tremendous data into ext4 file system, we see many “ext4-dio-unwrit” kernel threads in “top” screen. Many guys say this is a normal phenomenon, so I check the source code of ext4 in 2.6.32 linux kernel.

The beginning of writing a file in kernel is write-back kernel thread, it will call generic_writepages() and then ext4_write_page():

Let’s look at ext4_add_complete_io():

It will put “io_end” into the work queue “dio_unwritten_wq” which is in the ext4_sb_info. But where does the “dio_unwritten_wq” come from ? In the fs/ext4/super.c:

Oh, it is the “ext4-dio-unwritten” kernel thread! So, the problem is solved: the application dirty the page of file system cache, then the write-back kernel thread will write these dirty pages into specific file system (ext4 in this article), finally ext4 will put io into the work queue “ext4-dio-unwritten” and wait to convert unwritten-extent into written extent of ext4.
Therefore, if we don’t have unwritten-extent in ext4 (just using system-call write() to appending a normal file),the “ext4-dio-unwritten” kernel threads will exist but not using any CPU.

Why you should update your gcc (and c++ library)

Consider the code below:

It could be compiled and run on CentOS 5 (gcc-4.1.2), but will core dump at runtime.

The gdb stack shows the breakpoint is in string_hashfunc::operator():

Let’s see the source code of “ext/hash_map” in /usr/include/c++/4.1.2/ext/hashtable.h:

And in the implementation of _M_bkt_num():

It use _M_hash() to compute the bucket number of the key, and the _M_hash() is actually string_hashfunc::operator(). The reason is clear now: the iterator want to increase, so it call operator++() –> _M_bkt_num() –> _M_bkt_num_key() –> _M_hash() –> string_hashfunc::operator() and it can’t fetch the key because it has been freed in “delete it->first”.

How about new g++ and new c++ library? Let’s try to write the same program on CentOS 7 (gcc-4.8.5) and change “ext/hash_map” to “unordered_map” (for c++ 11 standard):

Then build it:

Everything goes normal because the new implementation of c++ library use “_M_nxt” to point to the next hash node instead of using hash function (could see it in /usr/include/c++/4.8.5/bits/hashtable_policy.h).

Why you should update your gcc

Consider this c++ code:

I compiled it on CentOS-5 on which the version of gcc is 4.1.2 in the first place and it report:

Can anyone find out the problem at first glance of this mess report ? The error report of c++ template is terrible difficult to understand, since I used it 9 years ago.
Then I try to compile the source on CentOS-6 with gcc-4.4.6

Looks almost the same. How about CentOS-7 with gcc-4.8.5

Aha, much better as it tell us the exact position of problem: “vec.begin()” will return a “const_iterator” which is not coherent to “iterator”.
To save your time for debugging c++ template code and enjoy life, please update your gcc.

Performance bottleneck in Jedis

I have had write a test program which using Jedis to read/write Redis Cluster. It create 32 threads and every thread init a instance of JedisCLuster(). But it will cost more than half minute to create total 32 JedisCluster Instances.
By tracking the problem, I found out that the bottleneck is in setNodeIfNotExist():

In the method setNodeIfNotExist() of class JedisClusterInfoCache, “new JedisPool()” will cost a lot of time because it will use apache commons pool and apache-commons-pool will register MBean Server with JMX. The register operation of JMX is the bottelneck.

The first solution for this problem is to disable JMX when calling JedisCluster():

The second solution is “create one JedisCluster() instance for all threads”. After I commited patch for Jedis to set JMX disable as default, Marcos Nils remind me that JedisCluster() is thread-safe, for it has using commons-pool to manage the connection resource.

Upgrade to kernel-4.4.1 on CentOS 7

After I compiled and installed kernel-4.4.1 (from kernel.org) on my CentOS 7, I reboot the machine. But it can’t boot up correctly.
Using

to extract the content in initramfs and check them, I found out the ‘mpt2sas’ kernel driver had not been added into initramfs so /boot partition could not be loaded.

Seems this problem is common. Because changing dracut source code or configure file on all servers is not viable, I chose to add command in my kernel rpm spec file:

This will add drivers to the corresponding initramfs file.

But the kernel could not boot up either. This time, I found that the command line in GRUB2 is like:

Looks we should change it to UUID. Add another command in kernel rpm spec file:

This will get UUID of boot disk from /proc/cmdline and give it to GRUB2 configure file.

Now, the kernel-4.4.1 boot up correctly on CentOS 7.

The importance of Declaration in c language

Let’s create two c language files,

and compile & link them:

But after run “./test”, the result is

Why the result from myNumber() become different in main() function?

Let’s see the assembly language source of test.c (gcc -S test.c)

It only get the 32-bits result from function ‘myNumber’ (The size of %eax register is just 32-bits, smaller than the size of “long long”). Actually, we missed the declaration of myNumber() in test.c file so it only consider the result of myNumber() as 32-bits size.
After adding the declaration of myNumber() into test.c, we could check the assembly language source has changed:

(The size of %rax register is 64-bits)
And the result of running ‘./test’ is correct now.