Robin on Linux – Page 4 – All about technology

Simple Python code of Knapsack Problem

Just write this snippet for my practice of 0-1 Knapsack Problem:

values  = [1, 2, 3, 4, 5]
weights = [3, 2, 1, 9, 6]
max_weight = 12

def knappack():
    n = len(values)
    dp = [[0] * (max_weight+1) for _ in range(n)]
    print(dp)

    for i in range(n):
        for j in range(1, max_weight+1):
            if weights[i] > j:
                dp[i][j] = dp[i-1][j]
            else:
                dp[i][j] = max(dp[i-1][j], dp[i-1][j-weights[i]] + values[i])

    print(dp[-1][-1])

knappack()

An incorrect understanding of me for Skip-list

After reading the classic paper about Skip-list, I tried to implement it by myself. But then I found a line of pseudo-code in the “Delete” function that I couldn’t understand:

Seems all the elements in “update” will point to “x”, so why do we need to check here? Maybe I can ignore this checking. Now here comes a part of my code:

    def erase(self, num: int) -> bool:
        update = Node(-1)
        
        curr = self._head
        for level in range(MAX_LEVEL-1, -1, -1):
            while curr._forward[level] and curr._forward[level]._key < num:
                curr = curr._forward[level]
            update._forward[level] = curr
        
        curr = curr._forward[0]
        if curr == None or curr._key != num:
            return False
        curr._count -= 1
        if curr._count > 0:
            return True
        
        for level in range(MAX_LEVEL):
            update._forward[level]._forward[level] = curr._forward[level]
            
        del curr
        return True

But unfortunately, it failed for the test case. In the debugging process, I realized that not all elements in “update” will point to “x”. Le’ts just take the figure-1 from the paper as my example:

As above, imaging we are deleting node “17”. The “forward[1]” (index start from 0, this is the difference of my code with the paper) of node “9” is pointing to node “17” so it should be redirect to node “25”. But the “forward[3]” of node “6” is pointing to “NIL”, and shouldn’t be redirected to node “25” because “node6._forward[3]” didn’t point to node “17”. The situation is the same for “forward[4]” and beyond, of node “6”.

This is why the last few lines of my code should be:

......
        
        for level in range(MAX_LEVEL):
            if update._forward[level]._forward[level] != curr:
                break
            update._forward[level]._forward[level] = curr._forward[level]
            
        del curr
        return True

Just like the pseudo-code in the paper!

I am really respect to these academic guys — everytime I thought they were wrong is actually I missed something.

Extract only birds sound from audio

This paper introduced a method to extract only segments with bird sound from an audio file. Since the paper didn’t give any code, I started to write it by myself.

Here is the Python implementation:

import cv2
import time
import torch
import librosa
import soundfile as sf
import numpy as np

from torchlibrosa.stft import LogmelFilterBank, Spectrogram

class CFG:
    n_fft = 2048
    hop_length = 512
    sample_rate = 32000
    n_mels = 64
    fmin = 150
    fmax = 150000

class SignalExtractor:
    def __init__(self):
        self.spectrogram_extractor = Spectrogram(
            n_fft=CFG.n_fft, hop_length=CFG.hop_length, win_length=CFG.n_fft, window="hann",
            center=True, pad_mode="reflect", freeze_parameters=True)
        # Logmel feature extractor
        self.logmel_extractor = LogmelFilterBank(sr=CFG.sample_rate, n_fft=CFG.n_fft,
            n_mels=CFG.n_mels, fmin=CFG.fmin, fmax=CFG.fmax, ref=1.0, amin=1e-10, top_db=None, freeze_parameters=True)
        self.factors = [2.0, 1.8, 1.6, 1.4, 1.2, 1.1]
        self.kernel_size = 15
        self.sn_threshold = 0.2

    def extract(self, input):
        x = torch.from_numpy(input)
        x = x[None, :].float()

        x = self.spectrogram_extractor(x)
        x = self.logmel_extractor(x)

        x = x.squeeze(0).squeeze(0)
        x = x.permute(1, 0).numpy()
        x = x - np.amin(x)

        for factor in self.factors:
            sound, sn_ratio = self._factor_extract(input, x, factor)
            if sn_ratio >= self.sn_threshold:
                break

        return sound, sn_ratio

    def _factor_extract(self, input, x, factor: float):
        rows, cols = x.shape
        row_median = np.median(x, axis=1)
        row_median_matrix = np.tile(row_median, (cols, 1)).T * factor
        col_median = np.median(x, axis=0)
        col_median_matrix = np.tile(col_median, (rows, 1)) * factor

        y = x > row_median_matrix
        z = x > col_median_matrix
        res = np.logical_and(y, z) + np.zeros(x.shape)

        kernel = np.ones((self.kernel_size, self.kernel_size), np.uint8)
        img = cv2.dilate(res, kernel, iterations=1)

        indicator = np.sum(img, axis=0)
        chunk_size = input.shape[0] // indicator.shape[0]
        sounds = []
        for index, chunk in enumerate(indicator):
            if chunk > 0:
                sounds.append(input[index*chunk_size:(index+1)*chunk_size])
        if len(sounds) <= 0:
            return None, 0.0
        sound = np.concatenate(sounds)
        return sound, sound.shape[0]/input.shape[0]

The implementation has some differences from the method in the paper:

I didn’t use erosion since dilation is good enough for picking up the bird-sound segment
three times bigger than median is too strict for most audio files, so I use an array of ratios. When the 2.0 ratio couldn’t pick up any bird sound, the code will automatically try a 1.8 ratio etc.
I used a big kernel (15, 15) for dilation since it works well in my samples

The original sample:

After extracteing only bird sounds:

Intel extension for PyTorch

Trying to test the Intel extension for PyTorch in my project, but it reported errors:

Traceback (most recent call last):                                                                                                                                                                                                          
  File "reviewjpgs_optimaztion_testing.py", line 27, in <module>                                                                                                                                                                            
    import intel_extension_for_pytorch as ipex                                                                                                                                                                                              
  File "/home/hero/.pyenv/versions/3.8.12/lib/python3.8/site-packages/intel_extension_for_pytorch/__init__.py", line 11, in <module>                                                                                                        
    from .cpu import _cpu_isa                                                                                                                                                                                                               
  File "/home/hero/.pyenv/versions/3.8.12/lib/python3.8/site-packages/intel_extension_for_pytorch/cpu/__init__.py", line 1, in <module>                                                                                                     
    from . import runtime                                                                                                                                                                                                                   
  File "/home/hero/.pyenv/versions/3.8.12/lib/python3.8/site-packages/intel_extension_for_pytorch/cpu/runtime/__init__.py", line 3, in <module>                                                                                             
    from .multi_stream import MultiStreamModule, get_default_num_streams, \
  File "/home/hero/.pyenv/versions/3.8.12/lib/python3.8/site-packages/intel_extension_for_pytorch/cpu/runtime/multi_stream.py", line 4, in <module>
    import intel_extension_for_pytorch._C as core
ImportError: /home/hero/.pyenv/versions/3.8.12/lib/python3.8/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-cpu.so: undefined symbol: _ZNK3c1010TensorImpl22is_strides_like_customENS_12MemoryFormatE

The answer is quite tricky: need to install the IPEX package with the same version of PyTorch.

After the testing of both torch.jit.trace and this IPEX, we found out that `torch.jit.trace` could boost the performance of prediction significantly but IPEX could not.

Use a specific service account in the Argo job

I created a simple Argo job to pull messages from a Google Cloud Pub/Sub topic. Permission has been given to the service account of GKE’s workload identity. But the Argo job failed with errors:

argo submit example.json -n argoproj

hello-world-pqbm5:   File "/usr/local/lib/python3.9/dist-packages/google/api_core/grpc_helpers.py", line 72, in error_remapped_callable
hello-world-pqbm5:     return callable_(*args, **kwargs)
hello-world-pqbm5:   File "/usr/local/lib/python3.9/dist-packages/grpc/_channel.py", line 1030, in __call__
hello-world-pqbm5:     return _end_unary_response_blocking(state, call, False, None)
hello-world-pqbm5:   File "/usr/local/lib/python3.9/dist-packages/grpc/_channel.py", line 910, in _end_unary_response_blocking
hello-world-pqbm5:     raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
hello-world-pqbm5: grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
hello-world-pqbm5:      status = StatusCode.PERMISSION_DENIED
hello-world-pqbm5:      details = "User not authorized to perform this action."
hello-world-pqbm5:      debug_error_string = "UNKNOWN:Error received from peer ipv4:74.125.69.95:443 {grpc_message:"User not authorized to perform this action.", grpc_status:7, created_time:"2023-05-15T01:10:43.128528579+00:00"}"
hello-world-pqbm5: >
hello-world-pqbm5: 
hello-world-pqbm5: The above exception was the direct cause of the following exception:
hello-world-pqbm5: 
hello-world-pqbm5: Traceback (most recent call last):
hello-world-pqbm5:   File "<string>", line 26, in <module>
hello-world-pqbm5:   File "/usr/local/lib/python3.9/dist-packages/google/pubsub_v1/services/subscriber/client.py", line 1495, in pull
hello-world-pqbm5:     response = rpc(
hello-world-pqbm5:   File "/usr/local/lib/python3.9/dist-packages/google/api_core/gapic_v1/method.py", line 113, in __call__
hello-world-pqbm5:     return wrapped_func(*args, **kwargs)
hello-world-pqbm5:   File "/usr/local/lib/python3.9/dist-packages/google/api_core/retry.py", line 349, in retry_wrapped_func
hello-world-pqbm5:     return retry_target(
hello-world-pqbm5:   File "/usr/local/lib/python3.9/dist-packages/google/api_core/retry.py", line 191, in retry_target
hello-world-pqbm5:     return target()
hello-world-pqbm5:   File "/usr/local/lib/python3.9/dist-packages/google/api_core/timeout.py", line 120, in func_with_timeout
hello-world-pqbm5:     return func(*args, **kwargs)
hello-world-pqbm5:   File "/usr/local/lib/python3.9/dist-packages/google/api_core/grpc_helpers.py", line 74, in error_remapped_callable
hello-world-pqbm5:     raise exceptions.from_grpc_error(exc) from exc
hello-world-pqbm5: google.api_core.exceptions.PermissionDenied: 403 User not authorized to perform this action.

Thanks to my colleagues. They remind me that an Argo job needs to specify a service account when running in the workload identity namespace.

argo submit example.json -n argoproj --serviceaccount argo-workflow

Or, I can add this service account to the YAML file:

apiVersion: argoproj.io/v1alpha1
kind: Workflow                  # new type of k8s spec
metadata:
  generateName: hello-world-    # name of the workflow spec
spec:
  entrypoint: whalesay          # invoke the whalesay template
  serviceAccountName: argo-workflow

Empty messages received by PubSub pull()

I want my Python script to receive one message from a PubSub topic and then go on to other work. The code is learned from an example of the GCP document:

with subscriber:
    # The subscriber pulls a specific number of messages. The actual
    # number of messages pulled may be smaller than max_messages.
    response = subscriber.pull(
        request={"subscription": subscription_path, "max_messages": NUM_MESSAGES},
        retry=retry.Retry(deadline=300),
    )

    if len(response.received_messages) == 0:
        return

The problem is that it will receive empty messages, meaning that “len(response.received_messages)” is zero.

Where do these empty messages come from? Here is the answer:

Once a message is sent to a subscriber, the subscriber must either acknowledge or drop the message. A message is considered outstanding once it has been sent out for delivery and before a subscriber acknowledges it.

My solution is just to wait until receiving a non-empty message:

with subscriber:
    # The subscriber pulls a specific number of messages. The actual
    # number of messages pulled may be smaller than max_messages.
    while True:
      response = subscriber.pull(
          request={"subscription": subscription_path, "max_messages": NUM_MESSAGES},
          retry=retry.Retry(deadline=300),
      )

      if len(response.received_messages) > 0:
          break

Hanging of PyTorch’s data loader

Long story short. I am trying to build a Siamese network for audio classification. For 50% possibility, the “dataset.py” will try to find a pair of audios in the same category but with different files (also, different category for another 50% possibility). But when the evaluating start, it will hang after fetching a few batches. The trace could be see:

Traceback (most recent call last):                                                                                                                                                                                                        
  File "/home/robin/song/birdclef/old_train.py", line 395, in <module>                                                
    train(args, train_loader, eval_loader)                                                                                                                                                                                                  
  File "/home/robin/song/birdclef/old_train.py", line 280, in train                                                   
    accuracy = evaluate(args, net, eval_loader)                                                                                                                                                                                             
  File "/home/robin/song/birdclef/old_train.py", line 91, in evaluate                                                 
    sounds1, sounds2, type_ids = next(batch_iterator)                                                                 
  File "/home/robin/miniconda3/envs/bird/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 634, in __next__
    data = self._next_data()                                                                                                                                                                                                                
  File "/home/robin/miniconda3/envs/bird/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1329, in _next_data
    idx, data = self._get_data()                                                                                      
  File "/home/robin/miniconda3/envs/bird/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1285, in _get_data                                                                                                              
    success, data = self._try_get_data()                                                                                                                                                                                                    
  File "/home/robin/miniconda3/envs/bird/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
    data = self._data_queue.get(timeout=timeout)                                                                      
  File "/home/robin/miniconda3/envs/bird/lib/python3.10/queue.py", line 180, in get                                   
    self.not_empty.wait(remaining)                                                                                    
  File "/home/robin/miniconda3/envs/bird/lib/python3.10/threading.py", line 324, in wait                              
    gotit = waiter.acquire(True, timeout)                                                                                                                                                                                                   
KeyboardInterrupt

As usual, I start with suspection of PyTorch. Is the version of PyTorch too new (2.0) that it includes some flaws? Then I quickly rejected my thoughts: if it’s the problem of PyTorch, why it didn’t meet same situation when not using Siamese network?

Then I found this issue in PyTorch GitHub page. It pointed to the clue: the new code in “dataset.py”. Now I notice the problem in my code:

            arr = self.cat_map[ebird_code]
            pair_wav_name = np.random.choice(arr)
            while pair_wav_name == wav_name:
                pair_wav_name = np.random.choice(arr)
            pair_sound = self.get_sound(pair_wav_name, ebird_code)

If a category only have one file, this loop will continue forever. This is the reason of the hang.

The solution is simple:

            arr = self.cat_map[ebird_code]
            if len(arr) > 1:
                pair_wav_name = np.random.choice(arr)
                while pair_wav_name == wav_name:
                    pair_wav_name = np.random.choice(arr)
            else:
                pair_wav_name = wav_name
            pair_sound = self.get_sound(pair_wav_name, ebird_code)

A powerful tool to monitor details of Intel CPU

In the research of PCIE 3.0 versus PCIE 4.0, I became serious about the actual application scenario. What’s the real bandwidth between CPU and GPU when we are training a deep learning model?

Finally, I got this tool: pcm

After building it, I run “sudo ./bin/pcm” and got this:

Grateful that I can even see the IPC(Instructions Per Cycle), and L2/L3 hit ratio from this tool. But my most interesting metric is the PCIE bandwidth. Where is the PCIE bandwidth?

I tried “sudo bin/pcm-pcie” but it said my desktop CPU (i5-12400) is not supported:

The processor is not susceptible to Rogue Data Cache Load: yes
The processor supports enhanced IBRS                     : yes
Package thermal spec power: 65 Watt; Package minimum power: 0 Watt; Package maximum power: 0 Watt;

INFO: Linux perf interface to program uncore PMUs is present

For non-CSV mode delay < 1.0s does not make a lot of practical sense. Default delay 1s is used. Consider to use CSV mode for lower delay values
Update every 1 seconds

Detected 12th Gen Intel(R) Core(TM) i5-12400 "Intel(r) microarchitecture codename Alder Lake" stepping 5 microcode level 0x2c
Jaketown, Ivytown, Haswell, Broadwell-DE, Skylake, Icelake, Snowridge and Sapphirerapids Server CPU is required for this tool! Program aborted
Cleaning up
 Closed perf event handles
 Zeroed uncore PMU registers

Then a new idea jumped out of my mind: what my CPU do in my application is only read data from file and push them to GPU, so the bandwidth of reading memory is approximately the writing bandwidth of PCIE!

To verify my idea, I changed my model from “tf_efficientnetv2_s_in21k” to “tf_mobilenetv3_small_075” (using a smaller model could let CPU pump more data into GPU)

As we can see, the bandwidth of READ memory increased from “1.36GB” to “13.69GB”. This shall be equal to the bandwidth of PCIe (since the data from memory will only go to the GPU).

Seems we really need PCIE 4.0 for deep learning 🙂

Use bits instead of set for visited nodes. LeetCode #1434

My first idea is depth-first-search: iterate all people, try to give them different hats. The solution got TLE (Time Limit Exceeded). Then as a hint from discussion forum, I started to iterate hat (instead of people), try to give them different people. The solution also got TLE (even I used lru_cache for function):

from collections import defaultdict

class Solution:
        
    def numberWays(self, hats: List[List[int]]) -> int:
        hp = defaultdict(set)
        for index, hat in enumerate(hats):
            for _id in hat:
                hp[_id].add(index)
                
        hp = [people for people in hp.values()]
        @functools.lru_cache(None)
        def dfs(start, path) -> int:
            if len(path) == len(hats):
                return 1
            if start == len(hp):
                return 0
            total = 0
            for person in (hp[start] - set(path)):
                total += dfs(start + 1, tuple(list(path) + [person]))
            total += dfs(start + 1, path)
            return total % (10**9 + 7)

        return dfs(0, tuple())

Using list as data structure to record visited node is not efficient enough in this case. Since there will be no more than 10 people, the most efficient data structure to record visited people is bits.

My final solution is still using dfs (by using lru_cache, it is also a dynamic-programming):

from collections import defaultdict

class Solution:
        
    def numberWays(self, hats: List[List[int]]) -> int:
        hp = defaultdict(set)
        for index, hat in enumerate(hats):
            for _id in hat:
                hp[_id].add(index)
                
        hp = [people for people in hp.values()]
        @functools.lru_cache(None)
        def dfs(start, mask) -> int:
            if bin(mask).count('1') == len(hats):
                return 1
            if start == len(hp):
                return 0
            total = 0
            for person in hp[start]:
                if (1 << person) & mask > 0:
                    continue
                mask |= 1 << person
                total += dfs(start + 1, mask)
                mask ^= 1 << person
            total += dfs(start + 1, mask)
            return total % (10**9 + 7)

        return dfs(0, 0)

Upgrade ubuntu to solve a GPU problem

After installing an RTX 2080 TI on an old-2016-desktop at the beginning of 2019, we used it to train YOLOv6 for a while. But recently the training job will occasionally hang and the GPU stops working. The only message I can see is from dmesg

[ 8104.078794] NVRM: GPU at PCI:0000:01:00: GPU-b4f425ef-2d0f-f29e-5624-ff96b37c2c46
[ 8104.078796] NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[ 8104.078797] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
[ 8104.078803] NVRM: A GPU crash dump has been created. If possible, please run
               NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.

At first, I suspected the NVIDIA driver was too new. But after installing back to an older driver, the same errors jumped out in dmesg. And the problem seems to occur more frequently, sometimes could not hold more than 24 hours.

Considering that Ubuntu 18.04 is too old (also the Linux kernel), I tried to upgrade it. Actually, although I installed a lot of Linux systems and Linux kernels in different machines (servers, desktops, laptops, and even the development board), this is the first time I upgraded an existing Ubuntu system.

By following the guide, I barely upgrade from 18.04 to 20.04. Surprisingly, the new system works well with the older NVIDIA driver and the GPU works smoothly for more than 12 hours now.

In conclusion, we should use a new system (new kernel) with new hardware drivers. If the training job doesn’t report any error, I will go on using this 20.04 and saving the time of upgrading to 22.04