Virtual Thoughts

Virtualisation, Storage and various other ramblings.

Page 4 of 24

Simplify Multus deployments with Rancher and RKE2

From my experience, some environments necessitate leveraging multiple NICs on Kubernetes worker nodes as well as the underlying Pods. Because of this, I wanted to create a test environment to experiment with this kind of setup. Although more common in bare metal environments, I’ll create a virtualised equivalent.

Planning

This is what I have in mind:

In RKE2 vernacular, we refer to nodes that assume etcd and/or control plane roles as servers, and worker nodes as agents.

Server Nodes

Server nodes will not run any workloads. Therefore, they only require 1 NIC. This will reside on VLAN40 in my environment and will act as the overlay/management network for my cluster and will be used for node <-> node communication.

Agent Nodes

Agent nodes will be connected to multiple networks:

  • VLAN40 – Used for node <-> node communication.
  • VLAN50 – Used exclusively by Longhorn for replication traffic. Longhorn is a cloud-native distributed block storage solution for Kubernetes.
  • VLAN60 – Provide access to ancillary services.

Creating Nodes

For the purposes of experimenting, I will create my VMs first.

Server VM config:

Agent VM Config:

Rancher Cluster Configuration

Using Multus is as simple as selecting it from the dropdown list of CNI’s. We have to have an existing CNI for cluster networking, which is Canal in this example

The section “Add-On Config” enables us to make changes to the various addons for our cluster:

This cluster has the following tweaks:

calico:
  ipAutoDetectionMethod: interface=ens192

flannel:
  backend: host-gw
  iface: ens192

The Canal CNI is a combination of both Calico and Flannel. Which is why the specific interface used is defined in both sections.

With this set, we can extract the join command and run it on our servers:

Tip – Store the desired node-ip in a config file before launching the command on the nodes. Ie:

packerbuilt@mullti-homed-wrk-1:/$ cat /etc/rancher/rke2/config.yaml
node-ip: 172.16.40.47
NAME                 STATUS   ROLES                       AGE   VERSION          INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
multi-homed-cpl-1   Ready    control-plane,etcd,master   42h   v1.25.9+rke2r1   172.16.40.46   <none>        Ubuntu 22.04.1 LTS   5.15.0-71-generic   containerd://1.6.19-k3s1
multi-homed-cpl-2   Ready    control-plane,etcd,master   41h   v1.25.9+rke2r1   172.16.40.49   <none>        Ubuntu 22.04.1 LTS   5.15.0-71-generic   containerd://1.6.19-k3s1
multi-homed-cpl-3   Ready    control-plane,etcd,master   41h   v1.25.9+rke2r1   172.16.40.50   <none>        Ubuntu 22.04.1 LTS   5.15.0-71-generic   containerd://1.6.19-k3s1
multi-homed-wrk-1   Ready    worker                      42h   v1.25.9+rke2r1   172.16.40.47   <none>        Ubuntu 22.04.1 LTS   5.15.0-71-generic   containerd://1.6.19-k3s1
multi-homed-wrk-2   Ready    worker                      42h   v1.25.9+rke2r1   172.16.40.48   <none>        Ubuntu 22.04.1 LTS   5.15.0-71-generic   containerd://1.6.19-k3s1
multi-homed-wrk-3   Ready    worker                      25h   v1.25.9+rke2r1   172.16.40.51   <none>        Ubuntu 22.04.1 LTS   5.15.0-71-generic   containerd://1.6.19-k3s1

Pod Networking

Multus is not a CNI in itself, but a meta CNI plugin, enabling the use of multiple CNI’s in a Kubernetes cluster. At this point we have a functioning cluster with an overlay network in place for cluster communication, and every Pod will have a interface on that network. So which other CNI’s can we use?

Out of the box, we can query the /opt/cni/bin directory for available plugins. You can also add additional CNI’s if you wish.

packerbuilt@mullti-homed-wrk-1:/$ ls /opt/cni/bin/
bandwidth  calico       dhcp      flannel      host-local  ipvlan    macvlan  portmap  sbr     tuning  vrf
bridge     calico-ipam  firewall  host-device  install     loopback  multus   ptp      static  vlan

For this environment, macvlan will be used. It provides MAC addresses directly to Pod interfaces which makes it simple to integrate with network services like DHCP.

Defining the Networks

Through NetworkAttachmentDefinition objects, we can define the respective networks and bridge them to named, physical interfaces on the host:

apiVersion: v1
kind: Namespace
metadata:
  name: multus-network-attachments
---
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: macvlan-longhorn-dhcp
  namespace: multus-network-attachments
spec:
  config: '{
      "cniVersion": "0.3.0",
      "type": "macvlan",
      "master": "ens224",
      "mode": "bridge",
      "ipam": {
        "type": "dhcp"
      }
    }'
---
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: macvlan-private-dhcp
  namespace: multus-network-attachments
spec:
  config: '{
      "cniVersion": "0.3.0",
      "type": "macvlan",
      "master": "ens256",
      "mode": "bridge",
      "ipam": {
        "type": "dhcp"
      }
    }'

We use an annotation to attach a pod to additional networks

apiVersion: v1
kind: Pod
metadata:
  name: net-tools
  namespace: multus-network-attachments
  annotations:
    k8s.v1.cni.cncf.io/networks: multus-network-attachments/macvlan-longhorn-dhcp,multus-network-attachments/macvlan-private-dhcp
spec:
  containers:
  - name: samplepod
    command: ["/bin/bash", "-c", "sleep 2000000000000"]
    image: ubuntu

Which we can validate within the pod:

root@net-tools:/# ip addr show
3: eth0@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default 
    link/ether 1a:57:1a:c1:bf:f3 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.42.5.27/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::1857:1aff:fec1:bff3/64 scope link 
       valid_lft forever preferred_lft forever
4: net1@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether aa:70:ab:b6:7a:86 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.16.50.40/24 brd 172.16.50.255 scope global net1
       valid_lft forever preferred_lft forever
    inet6 fe80::a870:abff:feb6:7a86/64 scope link 
       valid_lft forever preferred_lft forever
5: net2@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 62:a6:51:84:a9:30 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.16.60.30/24 brd 172.16.60.255 scope global net2
       valid_lft forever preferred_lft forever
    inet6 fe80::60a6:51ff:fe84:a930/64 scope link 
       valid_lft forever preferred_lft forever
root@net-tools:/# ip route
default via 169.254.1.1 dev eth0 
169.254.1.1 dev eth0 scope link 
172.16.50.0/24 dev net1 proto kernel scope link src 172.16.50.40 
172.16.60.0/24 dev net2 proto kernel scope link src 172.16.60.30

Testing access to a service on net2:

root@net-tools:/# curl 172.16.60.31
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>

Configuring Longhorn

Longhorn has a config setting to define the network used for storage operations:

If setting this post-install, the instance-manager pods will restart and attach a new interface:

instance-manager-e-437ba600ca8a15720f049790071aac70:/ # ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
3: eth0@if51: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default 
    link/ether fe:da:f1:04:81:67 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.42.1.58/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::fcda:f1ff:fe04:8167/64 scope link 
       valid_lft forever preferred_lft forever
4: lhnet1@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 12:90:50:15:04:c7 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.16.50.34/24 brd 172.16.50.255 scope global lhnet1
       valid_lft forever preferred_lft forever
    inet6 fe80::1090:50ff:fe15:4c7/64 scope link 
       valid_lft forever preferred_lft forever

Debugging cloud-init not executing runcmd commands

Background

Rancher leverages cloud-init for the provisioning of Virtual Machines on a number of infrastructure providers, as below:

I recently encountered an issue whereby vSphere based clusters using an Ubuntu VM template would successfully provision, but SLES based VM templates would not.

What does Rancher use cloud-init for?

This is covered in the Masterclass session I co-hosted, but as a refresher, particularly with the vSphere driver, Rancher will mount an ISO image to the VM to deliver the user-data portion of a cloud-init configuration. The contents of which look like this:

#cloud-config
groups:
- staff
hostname: scale-aio-472516f5-s82pz
runcmd:
- sh /usr/local/custom_script/install.sh
set_hostname:
- scale-aio-472516f5-s82pz
users:
- create_groups: false
  groups: staff
  lock_passwd: true
  name: docker
  no_user_group: true
  ssh_authorized_keys:
  - |
    ssh-rsa AAAAB3NzaC1yc.......
  sudo: ALL=(ALL) NOPASSWD:ALL
write_files:
- content: H4sIAAAAAAAA/wAAA...........
  encoding: gzip+b64
  path: /usr/local/custom_script/install.sh
  permissions: "0644"

Note: This is automatically generated, any additional cloud-init config you include in the cluster configuration (below) gets merged with the above.

It saves a script with write_files and then runs this with runcmd – this will install the rancher-system-agent service and begin the process of installing RKE2/K3s.

The Issue

When I provisioned SLES based clusters using my existing Packer template, Rancher would indicate it was waiting for the agent to check in:

Investigating

Thinking cloud-init didn’t ingest the config, I ssh’d into the node to do some debugging. I noticed that the node name had changed:

sles-15-sp3-pool1-15a47a8f-xcspb:~ #

Which I verified with:

sles-15-sp3-pool1-15a47a8f-xcspb:/ # cat /var/lib/cloud/instance/user-data.txt | grep hostname
hostname: sles-15-sp3-pool1-15a47a8f-xcspb

Inspecting user-data.txt from that directory also matched what was in the mounted ISO. I could also see /usr/local/custom_script/install.sh was created, but nothing indicated that it was executed. It appeared everything else from the cloud-init file was processed – SSH keys, groups, writing the script, etc, but nothing from runcmd was executed.

I ruled out the script by creating a new cluster and adding my own command:

As expected, this was merged into the user-data.iso file mounted to the VM, but /tmp/test.txt didn’t exist, so it was never executed.

Checking cloud-init logs

Cloud-Init has an easy way to collect logs – the cloud-init collect-logs command, This will generate a tarball:

sles-15-sp3-pool1-15a47a8f-xcspb:/ # cloud-init collect-logs
Wrote /cloud-init.tar.gz

I noted in cloud-init.log I could see the script file being saved:

2023-01-18 09:56:22,917 - helpers.py[DEBUG]: Running config-write-files using lock (<FileLock using file '/var/lib/cloud/instances/nocloud/sem/config_write_files'>)
2023-01-18 09:56:22,927 - util.py[DEBUG]: Writing to /usr/local/custom_script/install.sh - wb: [644] 29800 bytes
2023-01-18 09:56:22,928 - util.py[DEBUG]: Changing the ownership of /usr/local/custom_script/install.sh to 0:0

But nothing indicating it was executed.

I decided to extrapolate a list of all the cloud-init modules that were initiated:

cat cloud-init.log | grep "Running module"

stages.py[DEBUG]: Running module migrator
stages.py[DEBUG]: Running module seed_random 
stages.py[DEBUG]: Running module bootcmd 
stages.py[DEBUG]: Running module write-files 
stages.py[DEBUG]: Running module growpart 
stages.py[DEBUG]: Running module resizefs 
stages.py[DEBUG]: Running module disk_setup
stages.py[DEBUG]: Running module mounts 
stages.py[DEBUG]: Running module set_hostname
stages.py[DEBUG]: Running module update_hostname 
stages.py[DEBUG]: Running module update_etc_hosts 
stages.py[DEBUG]: Running module rsyslog 
stages.py[DEBUG]: Running module users-groups 
stages.py[DEBUG]: Running module ssh

But still, no sign of runcmd.

Checking cloud-init configuration

Outside of the log bundle, /etc/cloud/cloud.cfg includes the configuration for cloud-init. having suspected the runcmd module may not be loaded, I checked, but it was present:

# The modules that run in the 'config' stage
cloud_config_modules:
 - ssh-import-id
 - locale
 - set-passwords
 - zypper-add-repo
 - ntp
 - timezone
 - disable-ec2-metadata
 - runcmd

However, I noticed that nothing from the cloud_config_modules block was mentioned in cloud-init.log. However, everything from cloud_init_modules was:

# The modules that run in the 'init' stage
cloud_init_modules:
 - migrator
 - seed_random
 - bootcmd
 - write-files
 - growpart
 - resizefs
 - disk_setup
 - mounts
 - set_hostname
 - update_hostname
 - update_etc_hosts
 - ca-certs
 - rsyslog
 - users-groups
 - ssh

So, it appeared the entire cloud_config_modules step wasn’t running. Weird.

Fixing

After speaking with someone from the cloud-init community, I found out that there are several cloud-init services that exist on a host machine. Each dedicated to a specific step.

Default config on SLES 15 SP4 machine:

sles-15-sp3-pool1-15a47a8f-xcspb:/ # sudo systemctl list-unit-files | grep cloud
cloud-config.service                    disabled        disabled     
cloud-final.service                     disabled        disabled     
cloud-init-local.service                disabled        disabled     
cloud-init.service                      enabled         disabled     
cloud-config.target                     static          -            
cloud-init.target                       enabled-runtime disabled

Default config on a Ubuntu 22.04 machine:

packerbuilt@SRV-RNC-1:~$ sudo systemctl list-unit-files | grep cloud
cloud-config.service                        enabled         enabled
cloud-final.service                         enabled         enabled
cloud-init-hotplugd.service                 static          -
cloud-init-local.service                    enabled         enabled
cloud-init.service                          enabled         enabled
cloud-init-hotplugd.socket                  enabled         enabled
cloud-config.target                         static          -
cloud-init.target                           enabled-runtime enabled

The cloud-config service was not enabled and therefore would not run any of the related modules. To rectify, I added the following to my Packer script when building the template:

# Ensure cloud-init services are enabled
systemctl enable cloud-init.service
systemctl enable cloud-init-local.server
systemctl enable cloud-config.service
systemctl enable cloud-final.service

After which, provisioning SLES based machines from Rancher worked.

Installing & Using the Nvidia GPU Operator in K3s with Rancher

This post outlines the necessary steps to leverage the Nvidia GPU operator in a K3s cluster. In this example, using a gift from me to my homelab, a cheap Nvidia T400 GPU which is on the supported list for the operator.

Step 1 – Configure Passthrough (If required)

For this environment, vSphere is used and therefore PCI Passthrough is required to present the GPU to the VM. The Nvidia GPU is represented as two devices – one for the video controller, and another for the audio controller – we only need the video controller. Steps after this are still relevant to bare metal deployments.

Step 2 – Create VM

When creating a VM, choose to add a PCI device, and specify the Nvidia GPU:

Step 3 – Install nvidia-container-runtime and K3s

In order for Containerd (within K3s) to pick up the Nvidia plugin when K3s starts, we need to install the corresponding container runtime:

root@ubuntu:~# curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey |   sudo apt-key add -
root@ubuntu:~# distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
root@ubuntu:~# curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list |   sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
root@ubuntu:~# apt update && apt install -y nvidia-container-runtime

root@ubuntu:~# curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="v1.23.7+k3s1" sh

We can validate the Containerd config includes the Nvidia plugin with:

root@ubuntu:~# cat /var/lib/rancher/k3s/agent/etc/containerd/config.toml | grep -i nvidia
[plugins.cri.containerd.runtimes."nvidia"]
[plugins.cri.containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"

Step 4 – Import Cluster into Rancher and install the nvidia-gpu-operator

Follow this guide to import an existing cluster in Rancher.

After which, Navigate to Rancher -> Cluster -> Apps -> Repositories -> Create

Add the Helm chart for the Nvidia GPU operator:

Select to install the GPU Operator chart by going to Cluster -> Apps -> Charts -> Search for "GPU":

Follow the instructions until you reach the Edit YAML section. At this point add the following configuration into the corresponding section; this is to cater to where K3s stores the Containerd config and socket endpoint:

toolkit:
  env:
    - name: CONTAINERD_CONFIG
      value: /var/lib/rancher/k3s/agent/etc/containerd/config.toml
    - name: CONTAINERD_SOCKET
      value: /run/k3s/containerd/containerd.sock

Proceed with the installation and wait for the corresponding Pods to spin up. This will take some time as it’s compiling the GPU/CUDA drivers on the fly.

Note: You will notice several GPU-Operator Pods initially in a crashloop state. This is expected until the nvidia-driver-daemonset Pod has finished building and installing the Nvidia drivers. You can follow the Pod logs to get more insight as to what’s occurring.

oot@ubuntu:~# kubectl logs nvidia-driver-daemonset-wmrxq
DRIVER_ARCH is x86_64
Creating directory NVIDIA-Linux-x86_64-515.65.01
Verifying archive integrity... OK
root@ubuntu:~# kubectl logs nvidia-driver-daemonset-wmrxq -f
DRIVER_ARCH is x86_64
Creating directory NVIDIA-Linux-x86_64-515.65.01
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 515.65.01............................................................................................................................................
root@ubuntu:~# kubectl get po
NAME                                                              READY   STATUS            RESTARTS      AGE
nvidia-dcgm-exporter-dkcz9                                        0/1     PodInitializing   0             4m42s
gpu-operator-v22-1669053133-node-feature-discovery-master-t4mrp   1/1     Running           0             6m26s
gpu-operator-v22-1669053133-node-feature-discovery-worker-rxxw5   1/1     Running           1 (91s ago)   6m1s
gpu-operator-8488c86579-gf7z8                                     1/1     Running           1 (10m ago)   30m
nvidia-container-toolkit-daemonset-mgn92                          1/1     Running           0             5m59s
nvidia-driver-daemonset-46sdp                                     1/1     Running           0             5m55s
nvidia-cuda-validator-cmt7x                                       0/1     Completed         0             74s
gpu-feature-discovery-4xw2q                                       1/1     Running           0             4m23s
nvidia-device-plugin-daemonset-8czgl                              1/1     Running           0             5m
nvidia-device-plugin-validator-tzpq8                              0/1     Completed         0             37s

Step 5 – Validate and Test

First, check to see the runtimeClass is present:

root@ubuntu:~# kubectl get runtimeclass
NAME     HANDLER   AGE
nvidia   nvidia    30m

kubectl describe node should also list a GPU under the Allocatable resources:

Allocatable:
  cpu:                8
  ephemeral-storage:  49893476109
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             16384596Ki
  nvidia.com/gpu:     1

We can use the following workload to test. Note the runtimeClassName reference in the Pod spec:

 cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
  restartPolicy: OnFailure
  runtimeClassName: nvidia
  containers:
  - name: cuda-vectoradd
    image: "nvidia/samples:vectoradd-cuda11.2.1"
    resources:
      limits:
         nvidia.com/gpu: 1
EOF

Logs from the Pod will indicate if it was successful:

root@ubuntu:~# kubectl logs cuda-vectoradd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED

Without providing the runtimeClassName in the spec the Pod will error:

root@ubuntu:~# kubectl logs cuda-vectoradd
[Vector addition of 50000 elements]
Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!
« Older posts Newer posts »

© 2025 Virtual Thoughts

Theme by Anders NorenUp ↑

Social media & sharing icons powered by UltimatelySocial
RSS
Twitter
Visit Us
Follow Me