Robin on Linux – Page 13 – All about technology

Run cron jobs in Kubernetes

In the old age of a single machine, we use /etc/crontab to schedule the job. But in the new age of the distributed system, especially Kubernetes, what should we use?

The first choice in my mind is CronJob, like this one:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: hello
spec:
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: hello
            image: busybox
            imagePullPolicy: IfNotPresent
            args:
            - /bin/sh
            - -c
            - date; echo Hello from the Kubernetes cluster
          restartPolicy: OnFailure

Then my colleague recommends the CronWorkflow (need at least v2.5 of Argo) to me. Its configuration seems not very different:

apiVersion: argoproj.io/v1alpha1
kind: CronWorkflow
metadata:
  name: test-cron-wf
spec:
  schedule: "* * * * *"
  concurrencyPolicy: "Replace"
  startingDeadlineSeconds: 0
  workflowSpec:
    entrypoint: whalesay
    templates:
    - name: whalesay
      container:
        image: alpine:3.6
        command: [sh, -c]
        args: ["date; sleep 90"]

But CronWorkflow have a advantage over CronJob: the UI. We can use Web UI to access and manage different types of workflows:

And the UI also makes the accessing of pod log very easy.

A struggle to keep the accuracy

In this August, we have got 0.83 evaluation accuracy for DIB-10K dataset. But since last month, we have updated the dataset and the accuracy could only get to 0.82.

The first doubtful point is the Weight Standardization method we used for micro-batch (since the model is too big). So I turned to try gradient-accumulation and use this snippet as an example because it won’t need me to change my code heavily:

model.zero_grad()                                   # Reset gradients tensors
for i, (inputs, labels) in enumerate(training_set):
    predictions = model(inputs)                     # Forward pass
    loss = loss_function(predictions, labels)
    loss = loss / accumulation_steps
    loss.backward()                                 # Backward pass
    if (i+1) % accumulation_steps == 0:
        optimizer.step()
        model.zero_grad()
        if (i+1) % evaluation_steps == 0:
            evaluate_model()

But after changing my code and retrain the model, the accuracy still keep around 0.82:

Epoch     4: reducing learning rate of group 0 to 1.0000e-01.
[2020-12-16 05:53:29] Eval accuracy: 0.8283 | Train accuracy: 0.8187
[2020-12-16 10:01:40] Eval accuracy: 0.8284 | Train accuracy: 0.8938
[2020-12-16 14:11:35] Eval accuracy: 0.8284 | Train accuracy: 0.8313
Epoch     7: reducing learning rate of group 0 to 5.0000e-02.
[2020-12-16 18:21:47] Eval accuracy: 0.8285 | Train accuracy: 0.8750
[2020-12-16 22:31:19] Eval accuracy: 0.8285 | Train accuracy: 0.8313
[2020-12-17 02:41:37] Eval accuracy: 0.8284 | Train accuracy: 0.8625
Epoch    10: reducing learning rate of group 0 to 2.5000e-02.
[2020-12-17 06:52:05] Eval accuracy: 0.8286 | Train accuracy: 0.8500
[2020-12-17 11:02:11] Eval accuracy: 0.8285 | Train accuracy: 0.8063
[2020-12-17 15:12:23] Eval accuracy: 0.8286 | Train accuracy: 0.8375
Epoch    13: reducing learning rate of group 0 to 1.2500e-02.
[2020-12-17 19:22:04] Eval accuracy: 0.8285 | Train accuracy: 0.8313

This makes me really desperate. Maybe I should temporarily put this task aside and go on other works.

Directly update git commit

Previously when I want to update my branch of git, I just commit new modifications again and again. This makes my branch log quite ugly.

After my colleague introduced git --amend to me, he really helps me a lot. Now, I just use

git commit -a --amend --no-edit

to update my modification into previous commit and

git push -f origin my_branch_name

to totally update my branch in remote

And, if I want to also change the message of my previous comment, I only need to

git commit -a --amend -m "Let's rock again"

Get the schema of a parquet file

Previously I just use this snippet to get all the column names of a parquet file:

import pandas as pd
df = pd.read_parquet("hello.parquet")
print(list(df.columns))

But if the parquet file is very large (maybe not very large, for example, 1GB), it will cause OOM in my small VM (about 4GB RAM).

Actually, what I want is just column names, not the whole data. Since parquet file has strongly designed format, there must be someway we can only get the schema instead of all data.

And, here it is:

import pyarrow.parquet as pq
schema = pq.read_schema("hello.parquet", memory_map=True)
print(list(schema.names))

Export YOLOv5 models for mobile device

Somebody has finished the work about exporting YOLOv5 models to tflite model. To use it, we only need to:

git clone --single-branch --branch tf-export https://github.com/zldrobit/yolov5.git
cd yolov5
# it will download all pytorch models
sh -x weights/download_weights.sh
# export a tflite model from yolov5l
PYTHONPATH=. python3 models/tf.py --weights yolov5l.pt --cfg models/yolov5l.yaml --img 640
# there will be a tflite model file
ls yolov5l-fp16.tflite

The model file yolov5l-fp16.tflite is 91MB, which is a little too big but still could be put into a mobile phone.

Attach references on arXiv.org

About one month ago I and my old colleague Jian Mei had published a paper about ornithology images on arXiv.org. But two days ago, Jian Mei found out that I have written the number of images and categories wrong.

To update the paper, I just packed my .tex and .bib files into a zip and uploaded this zip to arXiv.org to replace my old pdf. But strange things happened: all the references are lost.

Then I caught sight of the document. Since the arXiv.org only supports .bbl file instead of .bib, I need to compile the .bib to a .bbl file manually.

Could overleaf do it? Fortunately it could. Just upload the zip file and click the “Logs and output files” icon as below:

Then click the “Other logs & files” and choose “bbl file”. The bbl file will be downloaded. Finally, change the bbl file name to be the same with the .tex file (except the .bbl suffix) and pack them again to be a zip file.

This time, all references came back to the paper.

Separate pip environment

Days ago, I thought anaconda could separate different python environment, so I created a new conda environment to install new version pip package. And no doubt, it failed.

The fact is: conda could only separate different conda package environment, not pip package. To separate the pip environment, we need to use virtualenv.

# Create new env
virtualenv -p /home/my/anaconda3/bin/python3.6 venv_my
# Enter the new env
source venv_my/bin/activate

The awesome YOLOv5

I just found a repository YOLOv5 from Github. It’s not just the models are accurate and fast but also easy to get and to use.

Just download the code, install some dependent libraries. And you can just run a simple command:

python3 detect.py --weights yolov5l.pt

Then it will automatically download the YOLO v5 Large model, process all images in inference/images, and put the annotated images into inference/output.

Let’s see some images predicted by YOLO v5 Large:

And every image cost no more than 2 seconds to predict on my laptop. Isn’t it awesome? At least it’s much better than my experiments…

TabNet: a new neural-network architecture for tabular data

The neural network seems mostly to be used on Computer Vision and Natural Language Processing scenarios, while tree-models like GBDT are mainly used for tabular data.

But why?

Although this article tries to give an explanation of this, it hasn’t been so promising to me. In my humble opinion, the neural network could finally surpass, or at least be competitive, to the GBDT model.

For example, the paper <TabNet: Attentive Interpretable Tabular Learning> describe a Transformer-like model to simulate the tree-model. The PyTorch implementation is here. I have used it on our own data and it finally reached 90% accuracy ( the accuracy of LightGBM is 93%). In spite of the lower accuracy, this is the first neural model reached 90% accuracy when used on our private data. The author has already done a great job.

Some tips for using bash and Jsonnet

To build an integration test by using Argo, I used bash to run scripts and Jsonnet to orchestrate Argo workflow. As expected, I met a bunch of problems with them.

problem about base64

[default]
aws_access_key_id = apple
aws_secret_access_key = banana

for this text file, we can use base64 to encode it and assign to an environment variable.

export MY=`cat my.ini |base64`
echo $MY|base64 -d

But it will report error ‘invalid input’:

[default]
aws_access_key_id = apple
aws_secret_access_keybase64: invalid input

The correct way is using -i, --ignore-garbage :

echo $MY|base64 -di

2. problem about exit status of Bash

If a step (or a pod) in Argo workflow failed, the whole workflow will fail. That’s the behaviour we want. But if we use diff command many times in one step, the fail (for diff, it means two files are not the same) of one diff will not cause the fail of the step. So I need a way to check the status of diff command. Fortunately, it’s very easy:

diff a b || exit -1
diff c d || exit -1

3. writes multiple lines of string in Jsonnet

local test_script = |||
  diff a b || exit -1
  diff c d || exit -1
|||;
...
  args = [test_script]