ext4

“No space left on device” problem I met on ext4

After downloading the whole CC12M dataset from Huggingface, I wrote a tool to extract all of the image-text-pair files into one directory. But after extracting 17 million (17681010 exactly) files, the tool reported the error:

Exception: [Errno 28] No space left on device: '/home/robin/Downloads/cc12m/011647171.txt'

I checked the space and inodes in my ext4 filesystem, and seems they all have free capacity:

# "df -lh"
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           3.2G  2.0M  3.2G   1% /run
/dev/nvme0n1p1  916G  410G  461G  48% /
tmpfs            16G  412K   16G   1% /dev/shm
tmpfs           5.0M  4.0K  5.0M   1% /run/lock
efivarfs        192K  124K   64K  66% /sys/firmware/efi/efivars
/dev/sda2        96M   32M   65M  33% /boot/efi
tmpfs           3.2G   56K  3.2G   1% /run/user/1000
/dev/sdb2       3.7T  2.5T  1.3T  67% /mnt

# "df -i"
Filesystem       Inodes   IUsed    IFree IUse% Mounted on
tmpfs           4077303    1264  4076039    1% /run
/dev/nvme0n1p1 61054976 9308680 51746296   16% /
tmpfs           4077303     104  4077199    1% /dev/shm
tmpfs           4077303       4  4077299    1% /run/lock
efivarfs              0       0        0     - /sys/firmware/efi/efivars
/dev/sda2             0       0        0     - /boot/efi
tmpfs            815460      61   815399    1% /run/user/1000
/dev/sdb2             0       0        0     - /mnt

Then why the ext4 filesystem returned a “No space” error? The reason is explained here: https://blog.merovius.de/posts/2013-10-20-ext4-mysterious-no-space-left-on/.

After using “sudo dumpe2fs /dev/nvme0n1p1”, I got:

...
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      e10f88a7-1d8c-4c38-a796-6fa15bdf4e65
...

Seems the hash algorithm of “index_dir” of my ext4 filesystem is already “half_md4” therefore my only choice is using “tea”. (The default “hash_algo” when you using “mke2fs” is “half_md4“)

But after I make the change:

sudo tune2fs -E "hash_alg=tea" /dev/nvme0n1p1

the error “No space left on device” still jumped out…

There are two solutions left:

Rewrite my tool to generate flat big files with every file contains previous “small files”
Replace ext4 with xfs (I will test this after I got another NVME SSD)

An interesting problem about ext4 mounting

When I login my computer and try to run “tmux attach” this morning, it reported a strange error:

/tmp/tmux-1001/default (Address already in use)

Intuitively, I thought this temporary file is out of date. So I just type in a command to delete it. But another error jumped out “The filesystem is read-only!”.

By looking at the mount point “mount|grep ro,”, I noticed that my root directory is mounted with “read-only” option. Checking the /etc/fstab:

/dev/disk/by-uuid/69bf5a7f-4031-4a6d-b877-f83fc73a4440 / ext4 rw,discard,data=writeback, 0 1

I guess one of the mount options is wrong so the operating system only mounts a “read-only” filesystem.

After I remove the options one by one and reboot the machine many times, it turns out to be that “data=writeback” is the incorrect option. Essentially, “data=writeback” option is only for ext3.

When I trying to modify /etc/fstab, the system report “you can’t change file because the root filesystem is read-only”. Seems I was trapped in a dead loop… so I use my final weapon:

sudo mount -o remount,rw /dev/nvme0n1p2 /

And it works.

Now, by setting /etc/fstab, the ext4 filesystem could be mounted with both read and write permission:

/dev/disk/by-uuid/69bf5a7f-4031-4a6d-b877-f83fc73a4440 / ext4 rw,discard,noatime 0 1

Too many “ext4-dio-unwrit” processes in system

After adding pressure to application which will write tremendous data into ext4 file system, we see many “ext4-dio-unwrit” kernel threads in “top” screen. Many guys say this is a normal phenomenon, so I check the source code of ext4 in 2.6.32 linux kernel.
The beginning of writing a file in kernel is write-back kernel thread, it will call generic_writepages() and then ext4_write_page():

ext4_write_page()
    --> ext4_set_bh_endio()
        --> ext4_end_io_buffer_write()
            --> ext4_add_complete_io()

Let’s look at ext4_add_complete_io():

/* Add the io_end to per-inode completed end_io list. */
void ext4_add_complete_io(ext4_io_end_t *io_end)
{
    struct ext4_inode_info *ei = EXT4_I(io_end->inode);
    struct workqueue_struct *wq;
    unsigned long flags;
    BUG_ON(!(io_end->flag & DIO_AIO_UNWRITTEN));
    wq = EXT4_SB(io_end->inode->i_sb)->dio_unwritten_wq;
    spin_lock_irqsave(&ei->i_completed_io_lock, flags);
    if (list_empty(&ei->i_aio_dio_complete_list)) {
        io_end->flag |= DIO_AIO_QUEUED;
        queue_work(wq, &io_end->work);
    }
    list_add_tail(&io_end->list, &ei->i_aio_dio_complete_list);
    spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
}

It will put “io_end” into the work queue “dio_unwritten_wq” which is in the ext4_sb_info. But where does the “dio_unwritten_wq” come from ? In the fs/ext4/super.c:

static int ext4_fill_super(struct super_block *sb, void *data, int silent)
{
......
    EXT4_SB(sb)->dio_unwritten_wq = create_workqueue("ext4-dio-unwritten");
    if (!EXT4_SB(sb)->dio_unwritten_wq) {
        printk(KERN_ERR "EXT4-fs: failed to create DIO workqueue\n");
        goto failed_mount_wq;
    }
......

Oh, it is the “ext4-dio-unwritten” kernel thread! So, the problem is solved: the application dirty the page of file system cache, then the write-back kernel thread will write these dirty pages into specific file system (ext4 in this article), finally ext4 will put io into the work queue “ext4-dio-unwritten” and wait to convert unwritten-extent into written extent of ext4.
Therefore, if we don’t have unwritten-extent in ext4 (just using system-call write() to appending a normal file)，the “ext4-dio-unwritten” kernel threads will exist but not using any CPU.

About ext4 file system

Recently I make a presentation about my past works on linux kernel. Then I find out the PPT which is made four years ago ( I was in Kernel Team of Taobao Corp.). For any body who is interesting in file system 🙂

why we need ext4 from Robin Dong

China Linux Storage & Filesystem 2015 workshop (second day)

Zheng Liu from Alibaba lead the topic about ext4. The most important change in EXT-series filesystem this year is: ext3 has gone, people could only use ext3 by mount ext4 with special arguments in latest kernel (actually, in CentOS 7.0). Encrypt feature has complete in ext4.
Robin Dong (Yes, it’s me) from Alibaba give a presentation about cold storage (Slide is here). We develop distributed storage system based on a small open-source software called “sheepdog“, and modified it heavily to improve data recovery performance and make sure it could run in low-end but high-density storage servers.

Discussion in tea break

Yanhai Zhu from Alibaba (We have done so much works on storage) lead a topic about cache in virtual machines environment. Alibaba choose Bcache as code base to develop a new cache software.
Robin: Why Bcache? Why not flashcache?
Yanhai: I started my work on flashcache first, but flashcache is not profit to the product environment. First, flashcache is unfriendly to sequential-write. Second, it use hash data structure to distributed IO requests at beginning, which will split the cache data in multi-tenant environment. Bcache use B-tree instead of hash-table to store data, it’s better for our requirements.
They use radical write-back strategy on cache. It works very well because the cache sequentialize the write IOs and make backend easy to absorb the pressure peak.
The last topic is lead by Zhongjie Wu from Memblaze, a famous startup company in China on flash storage technology. It’s about NVDIMM, the most hot hardware technology in recent years. A NVDIMM is not expensive, it is only a DDR DIMM with a capacitance. Memblaze has develop a new 1U storage server with a NVDIMM and many flash cards. It contain their own developed OS and could use Fabric-Channel/Ethernet to connect to client. The main purpose of NVDIMM is to reduce latency, and they use write-back strategy(Surely).
The big problem they face with NVDIMM is CPU can’t flush data in its L1 cache to NVDIMM when whole server powers down. To solve this problem, Memblaze use write-combining in CPU multi-cores, it hurts the performance a little but avoid the data missing finally.

clsf2015

All the staff in this CLSF 2015

Articles from other attenders:
https://blogs.oracle.com/linuxkernel/entry/china_linux_storage_and_file1

China Linux Storage & Filesystem 2014 workshop (PPT or PDF for download)

ext4: link
f2fs: link
ocfs2: link
ubifs: link
btrfs: link

China Linux Storage & Filesystem 2014 workshop (second day)

The first topic in second day of CLSF 2014 is about NFS, which lead by Tao Peng from PrimaryData. The protocol of NFS is updated to 4.2 and the mainly jobs of NFS community is implement many features (such as “server side copy”) which had been used on local file system onto server side.

Then the distributed software developer in Xiaomi —— Liang Xie introduce the basic infrastructure of Xiaomi Cloud Storage and report some severe problems about IO stack. The first problem is that heavy write pressure will cause long latency on ext4 file system, and the latency will be short if the local file system is replaced by xfs.
Zheng Liu (from Alibaba): The implement of journal in xfs is better than ext4. When large mount of write operations come to ext4, it have to write checkpoint in journal and flush disk, which may take a long time. I think you could try ‘no journal’ mode which developed by google guys in ext4.

#note: the way to use no journal mode in ext4
mkfs.ext4 -O ^has_journal, .... /dev/sdx

Another problem is Xiaomi want to use io-scheduler of deadline but they can’t use cgroup by ‘deadline’.
Coly Li (from Alibaba): I suggest that you could try tpps which is a simple and effective io-scheduler in ali_kernel.
Next topic about ext4 is hold by Zheng Liu. In this year, ext4 has add no new features (may be that’s why it is so stable). In google’s chrome OS, they want to store something like cookie for every user, so it need to add encryption feature in ext4. We ask why chrome os not using encryptfs on ext4. The answer of Zheng Liu is: the requirement is came from google itself, so no one knows why. Ext4 also add a new option “sparse_super2” (to store super block only in the beginning and the end of ext4 groups) and “packed_meta_block” (to squeeze all meta data of ext4 into the beginning of disk, mainly for the SMR disk).

The last topic is about OSv, the most popular and cutting-edge OS in this conference. OSv is a operating system based on virtual machine and cloud environment. It reduce the IO and network stack which makes it very fast and effective. The jvm and some scripts language (such as python, ruby, node.js) could already run on OSv, therefore I consider that it has wined a large part of cloud market for it could run hadoop/spark and many front-end web application.
OSv