Some tips about BigQuery on GCP

Migrate SQL script from AWS Redshift to BigQuery

in Redshift should be changed to

in BigQuery.
Since BigQuery doesn’t force type conversion, some NULL value in Redshift could be a NULL value or a ‘NULL’ string in BigQuery. Make sure you use both

and

for checking.
In BigQuery, we can also use UDF like this:

Performance improvement of BigQuery SQL
Remove ‘DISTINCT’ in SQL and de-dup data later in Pandas could boost whole performance for data processing. Even ‘CAST’ in BigQuery would hurt the performance. The best way to find the bottlenecks for your SQL is by looking at the ‘Execution details‘ in GUI.

Loading speed
For pandas-gbq, we can accelerate the speed of reading BigQuery table by adding argument ‘use_bqstorage_api=True’ in ‘read_gbq()’ function:

The nn.Sigmoid() of PyTorch on Android device

I have trained an EfficientNet model to classify more than ten thousand different categories of birds, by using PyTorch. To run this model on the mobile device, I built a program by learning the PyTorch iOS demo at first, make sure it runs well, and then try to build another Android program by learning the PyTorch Android demo. But after replacing the demo model with my model, the Android program prints out the result as all ‘NaN’.
To figure out the reason step by step. I tested a few widely used deep learning models:

Model Name Result on Android demo
EfficientNet Nan
ResNet-50 Normal
ResNet-101 Normal
RegNet+SE NaN
RegNet Normal

Seems that the only difference between ‘Nan’ and ‘Normal’ is the Squeeze-Excitation module. But Squeeze-Excitation module is quite simple. It’s just:

Aha, I saw the nn.Sigmoid. That’s the only layer which haven’t been used in ResNet.
Let’s dive into Sigmoid function


sigmoid

Looks the e^{-x} is the problem: it will became underflow if the X is too big!

Until now, I still have another question: why the ‘NaN’ appears in the Android program but not the iOS program. Perhaps it’s about JVM or the emulator that Android Studio used.

Use `psql` to download data as CSV file

Although SQL WorkBench is a handy tool for querying AWS Redshift, we still need to CLI tool for automation.
To install psql on MacOS, we need to

Then we could download data without using \copy but only –csv

Recent learned tips abou Numpy and Pandas

Precision

After running this snippet:

It print out:

Why np.float32 and np.float64 have the same output? The answer is: displaying of numpy array need to set options.
Let’s set option before print:

The result has became:

which looks much reasonable.
Furthermore, why it prints out ‘0.1122334455667789’ which has only ’16’ precision instead of ’18’? Because the float64 only support about 15~16 precisions, as this reference said.

Hidden metadata

There are two parquet files which look different after using ‘cksum’ to compare. But after we export them as CSV files:

The two output CSV files are exactly the same.
Then what happened in those previous two parquet files? Dose parquet file have some hidden metadata in it?
As a matter of fact, parquet file will save the ‘index’ of a DataFrame of Pandas while CSV file will not. If we drop the index before writing out the parquet file:

These two parquet files would become identical.

Problems about using treelite in Kubernetes

treelite is an easy-to-use tool to accelerate prediction speed of XGBoost/LightGBM models. Three days ago, I tested it in my 4-CPU-cores virtual machine and found out that it could reduce the running time half. But after I deployed it into our Kubernetes cluster, it ran even slower than LightGBM library!
There are some strange phenomenons I noticed in the Kubernetes environment:

  1. If I run only one pod, it will run as fastly as previous test in my virtual machine. But if I run 100 pods simutaneously, they will all run very slow.
  2. After login to a pod for profiling, I found out that CPU usage of the application would start from 400% (for I set the cpu request and limit to "4") and gradually descend to less than 10%
  3. There was no memory-swapping happend in any node
  4. I used sleep 3600 to hang all the pods, login to one pod with kubectl exec -ti <pod_name> /bin/bash and run my application manually. It could still run very fast.

This weird problem haunted me, dejected me, and frustrated me for two days and almost ruined one night of my sleeping. Only until I turned to look into the code of treelite, the answer jumped out:

The treelite wil pin its processes to each CPU-core by default!
This explained all the thing: when we start 100 pods, they were all trying to ping their 4 processes to CPU-core 0-3. But in the container environment, every pod could see all the CPU-cores of the node (Unlike a virtual machine, in which every VM could only see 4 CPU-cores of itself). Hence they all pin their processes to first 4 CPU-cores of that node. Using only 4 CPU-cores to run all 100 pods? That’s a terrible disaster!

The conclusion is, some acceleration tricks that work very well in classic-computer-era couldn’t be used in container-era.

Use both ‘withParam’ and ‘when’ in Argo Workflows (on Kubernetes)

In Argo, we can use ‘withParam’ to create loop logic:

But in my YAML, it also use when in Argo:

When the NEED_RUN is 0, the Argo will report error since it can’t find the {{steps.generate.outputs.result}}. Seems the YAML parser of Argo will try to parse withParam before when phrase.
Fortunately we don’t need to modify Argo or Kubernetes to solve this problem — we just need to let template gen-number-list generate a fake output (empty array):

Image pull policy in Kubernetes

Recently, we use Kubernetes for our project. Yesterday, a problem haunted me severely: even I have pushed the docker image to the GCR (Goolge Container Registry), the pod in Kubernetes will still use the stale image.
I tried many ways to solve the problem: removing the image in GCR, removing the image in local laptop, rebuild the image again and again. And finally I have found the reason and also realised that I am still a stupid starter on Kubernetes.
The reason for pod to use stale docker image is: the Kubernetes will (and should, I think) cache the docker images it used before for speed. Hence if you want it to re-pull image forcedly. You should use configuration item imagePullPlicy(ref), like:

Fortunately I can debug my docker image correctly now…

Be careful of the ternary operator in Python

The result will be:

Where is the last go? It goes with the no. The python interpreter will consider "no" / "last" under the else condition even it actually break the syntax rule. The correct way to write the ternary operator should be:

Now the result become:

Grab a hands-on realtime-object-detection tool

Try to get a fast (what I mean is detecting in lesss than 1 second on mainstream CPU) object-detection tool from Github, I experiment with some repositories written by PyTorch (because I am familiar with it). Below are some conclusions:
1. detectron2
This the official tool from Facebook Corporation. I download and installed it successfully. The test python code is:

Although can’t recognize all birds in below image, it will cost more than 5 seconds on CPU (my MackbookPro). Performance is not as good as my expectation.

2. efficientdet
From the paper, the EfficientDet should be fast and accurate. But after I wrote a test program, it totally couldn’t recognize the object at all. Then I gave up this solution.

3. EfficientDet.Pytorch
Couldn’t download models from it’s model_zoo.

4. ssd.pytorch
Finally, I came to my sweet ssd(Single Shot Detection). Since have studied it for more than half a year, I wrote below snippet quickly:

The result is not perfect but good enough for my current situation.

Some tips about Argo Workflows (on Kubernetes)

Using Argo to execute workflows last week, I met some problems and also find the solutions.
1. Can’t parse “outputs”
By submitting this YAML file:

I met the error:

Why the Argo could’t recognize the “steps.generate.outputs.result”? Because only “source” could have a default “output”, not “args”. So the template “generate-run” should be

2. Can’t parse parameters from JSON
If the Argo report:

it means the “output” of the previous step isn’t in standard JSON format. So make sure you have pretty JSON format output. For python, it should be like: