Robin on Linux – Page 12 – All about technology

Export YOLOv5 models for mobile device

Somebody has finished the work about exporting YOLOv5 models to tflite model. To use it, we only need to:

git clone --single-branch --branch tf-export https://github.com/zldrobit/yolov5.git
cd yolov5
# it will download all pytorch models
sh -x weights/download_weights.sh
# export a tflite model from yolov5l
PYTHONPATH=. python3 models/tf.py --weights yolov5l.pt --cfg models/yolov5l.yaml --img 640
# there will be a tflite model file
ls yolov5l-fp16.tflite

The model file yolov5l-fp16.tflite is 91MB, which is a little too big but still could be put into a mobile phone.

Attach references on arXiv.org

About one month ago I and my old colleague Jian Mei had published a paper about ornithology images on arXiv.org. But two days ago, Jian Mei found out that I have written the number of images and categories wrong.

To update the paper, I just packed my .tex and .bib files into a zip and uploaded this zip to arXiv.org to replace my old pdf. But strange things happened: all the references are lost.

Then I caught sight of the document. Since the arXiv.org only supports .bbl file instead of .bib, I need to compile the .bib to a .bbl file manually.

Could overleaf do it? Fortunately it could. Just upload the zip file and click the “Logs and output files” icon as below:

Then click the “Other logs & files” and choose “bbl file”. The bbl file will be downloaded. Finally, change the bbl file name to be the same with the .tex file (except the .bbl suffix) and pack them again to be a zip file.

This time, all references came back to the paper.

Separate pip environment

Days ago, I thought anaconda could separate different python environment, so I created a new conda environment to install new version pip package. And no doubt, it failed.

The fact is: conda could only separate different conda package environment, not pip package. To separate the pip environment, we need to use virtualenv.

# Create new env
virtualenv -p /home/my/anaconda3/bin/python3.6 venv_my
# Enter the new env
source venv_my/bin/activate

The awesome YOLOv5

I just found a repository YOLOv5 from Github. It’s not just the models are accurate and fast but also easy to get and to use.

Just download the code, install some dependent libraries. And you can just run a simple command:

python3 detect.py --weights yolov5l.pt

Then it will automatically download the YOLO v5 Large model, process all images in inference/images, and put the annotated images into inference/output.

Let’s see some images predicted by YOLO v5 Large:

And every image cost no more than 2 seconds to predict on my laptop. Isn’t it awesome? At least it’s much better than my experiments…

TabNet: a new neural-network architecture for tabular data

The neural network seems mostly to be used on Computer Vision and Natural Language Processing scenarios, while tree-models like GBDT are mainly used for tabular data.

But why?

Although this article tries to give an explanation of this, it hasn’t been so promising to me. In my humble opinion, the neural network could finally surpass, or at least be competitive, to the GBDT model.

For example, the paper <TabNet: Attentive Interpretable Tabular Learning> describe a Transformer-like model to simulate the tree-model. The PyTorch implementation is here. I have used it on our own data and it finally reached 90% accuracy ( the accuracy of LightGBM is 93%). In spite of the lower accuracy, this is the first neural model reached 90% accuracy when used on our private data. The author has already done a great job.

Some tips for using bash and Jsonnet

To build an integration test by using Argo, I used bash to run scripts and Jsonnet to orchestrate Argo workflow. As expected, I met a bunch of problems with them.

problem about base64

[default]
aws_access_key_id = apple
aws_secret_access_key = banana

for this text file, we can use base64 to encode it and assign to an environment variable.

export MY=`cat my.ini |base64`
echo $MY|base64 -d

But it will report error ‘invalid input’:

[default]
aws_access_key_id = apple
aws_secret_access_keybase64: invalid input

The correct way is using -i, --ignore-garbage :

echo $MY|base64 -di

2. problem about exit status of Bash

If a step (or a pod) in Argo workflow failed, the whole workflow will fail. That’s the behaviour we want. But if we use diff command many times in one step, the fail (for diff, it means two files are not the same) of one diff will not cause the fail of the step. So I need a way to check the status of diff command. Fortunately, it’s very easy:

diff a b || exit -1
diff c d || exit -1

3. writes multiple lines of string in Jsonnet

local test_script = |||
  diff a b || exit -1
  diff c d || exit -1
|||;
...
  args = [test_script]

Efficient reading in pandas

My previous code was trying to read all data and get only one column that I need:

import pandas as pd
df = pd.read_csv("data.csv")["card_id"]

In the test environment, this program cost more than 10GB memory because of the large size of the data file.

To reduce the memory, I changed to use usecols :

import pandas as pd
df = pd.read_csv("data.csv", usecols=["card_id"])

Then, the program only cost less than 1GB memory.

The only problem is: only read_csv() and read_sql() support reading special columns. In read_parquet(), we still need to read all data at first.

Compare two tables in BigQuery

As this answer, the best solution for comparing two tables in BigQuery is:

(
  SELECT * FROM table1
  EXCEPT DISTINCT
  SELECT * from table2
)
UNION ALL
(
  SELECT * FROM table2
  EXCEPT DISTINCT
  SELECT * from table1
)

But in my test, two tables with the same rows report difference by using the above snippet. Then I found out that the order of column names may be different, and the order of rows too. Then the better solution should be fixing the order of column names and rows:

(
  (
  SELECT col, col2, col3, col4
  FROM table1
  ORDER BY col1, col2
  )
  EXCEPT DISTINCT
  (
  SELECT col1, col2, col3, col4
  FROM table2
  ORDER BY
  col1, col2
  )
)
UNION ALL
(
  (
  SELECT col, col2, col3, col4
  FROM table2
  ORDER BY col1, col2
  )
  EXCEPT DISTINCT
  (
  SELECT col1, col2, col3, col4
  FROM table1
  ORDER BY
  col1, col2
  )
)

Some tips about pandas, again

pd.merge() may change the names of original columns:

import pandas as pd
df1 = pd.DataFrame(data={"name": ["robin", "hood"], "age": [40, 30]})
df2 = pd.DataFrame(data={"name": ["lion", "heart"], "age": [50, 60]})
merged = pd.merge(df1, df2, how="outer", on="name")
print(merged)

The output will not have a column named age but two more new columns named age_x and age_y. So when you merging two tables with many columns, be aware of that the column names may change.

2. Use iterrows() to traverse rows of dataframe:

import pandas as pd
from multiprocessing import Pool
def process(row):
    # Do something for row
    print(row[1])
df = pd.DataFrame(data={"name": ["robin", "hood"], "age": [40, 30]})
pool = Pool(6)
pool.map(process, df.iterrows())

If we directly use pool.map(process, df), it will incorrectly traverse the column names of dataframe.

3. How to append pd.Series to a pd.DataFrame. From this article, the easist way is:

import pandas as pd
df = pd.DataFrame(data={"name": ["robin", "hood"], "age": [40, 30]})
series = pd.Series(["water", 50], index=["name", "age"])
print(df.append(series, ignore_index=True))

The result is

    name  age
0  robin   40
1   hood   30
2  water   50

Or, we can add a name to pd.Series and remove the ignore_index. It could give the same result.

If the pd.Series doesn’t have index, the result will become:

    name   age      0     1
0  robin  40.0    NaN   NaN
1   hood  30.0    NaN   NaN
2    NaN   NaN  water  50.0

Build a Python module for MFCC’s C++ implementation

MFCC means Mel-frequency cepstral coefficients. It’s a powerful feature representation for sound. Although there is a lot of implementations in different programming language for MFCC, they give sheerly different results for the same audio input.

To solve this problem, I got an open-source implementation of C++ for MFCC and built a Python module for it. By using SWIG, this work became less painful.

The function has sample_rate and a one-dimension-array as input, a two-dimensions-array as output. So the header file of C++ looks like:

void mfcc(int sample_rate,
          short* in_array, int size_in,
          double** out_array, int* dim1, int* dim2
         );

We also need to use numpy, so the interface file for SWIG is:

%module mfcc
%{
  #define SWIG_FILE_WITH_INIT
  #include "mfcc.hpp"
%}
%include "numpy.i"
%init %{
  import_array();
%}
%apply (short* IN_ARRAY1, int DIM1) {(short* in_array, int size_in)}
%apply (double** ARGOUTVIEW_ARRAY2, int* DIM1, int* DIM2) {(double** out_array, int* dim1, int* dim2)}
%rename (mfcc) my_mfcc;
%inline %{
  void my_mfcc(int sample_rate, short* in_array, int size_in, double** out_array, int* dim1, int* dim2) {
    mfcc(sample_rate, in_array, size_in, out_array, dim1, dim2);
  }
%}

To use this module, here is an example Python code:

import mfcc
import numpy as np
from scipy.io import wavfile
sr, audio = wavfile.read("mono.wav")
output = mfcc.mfcc(sr, audio)
print(output.shape, output)

All the code is in my repository.