Background
Rancher leverages cloud-init for the provisioning of Virtual Machines on a number of infrastructure providers, as below:
I recently encountered an issue whereby vSphere based clusters using an Ubuntu VM template would successfully provision, but SLES based VM templates would not.
What does Rancher use cloud-init for?
This is covered in the Masterclass session I co-hosted, but as a refresher, particularly with the vSphere
driver, Rancher will mount an ISO image to the VM to deliver the user-data
portion of a cloud-init
configuration. The contents of which look like this:
#cloud-config
groups:
- staff
hostname: scale-aio-472516f5-s82pz
runcmd:
- sh /usr/local/custom_script/install.sh
set_hostname:
- scale-aio-472516f5-s82pz
users:
- create_groups: false
groups: staff
lock_passwd: true
name: docker
no_user_group: true
ssh_authorized_keys:
- |
ssh-rsa AAAAB3NzaC1yc.......
sudo: ALL=(ALL) NOPASSWD:ALL
write_files:
- content: H4sIAAAAAAAA/wAAA...........
encoding: gzip+b64
path: /usr/local/custom_script/install.sh
permissions: "0644"
Note: This is automatically generated, any additional cloud-init config you include in the cluster configuration (below) gets merged with the above.
It saves a script with write_files
and then runs this with runcmd
– this will install the rancher-system-agent
service and begin the process of installing RKE2/K3s.
The Issue
When I provisioned SLES based clusters using my existing Packer template, Rancher would indicate it was waiting for the agent to check in:
Investigating
Thinking cloud-init didn’t ingest the config, I ssh’d into the node to do some debugging. I noticed that the node name had changed:
sles-15-sp3-pool1-15a47a8f-xcspb:~ #
Which I verified with:
sles-15-sp3-pool1-15a47a8f-xcspb:/ # cat /var/lib/cloud/instance/user-data.txt | grep hostname
hostname: sles-15-sp3-pool1-15a47a8f-xcspb
Inspecting user-data.txt
from that directory also matched what was in the mounted ISO. I could also see /usr/local/custom_script/install.sh
was created, but nothing indicated that it was executed. It appeared everything else from the cloud-init file was processed – SSH keys, groups, writing the script, etc, but nothing from runcmd
was executed.
I ruled out the script by creating a new cluster and adding my own command:
As expected, this was merged into the user-data.iso
file mounted to the VM, but /tmp/test.txt
didn’t exist, so it was never executed.
Checking cloud-init logs
Cloud-Init
has an easy way to collect logs – the cloud-init collect-logs
command, This will generate a tarball:
sles-15-sp3-pool1-15a47a8f-xcspb:/ # cloud-init collect-logs
Wrote /cloud-init.tar.gz
I noted in cloud-init.log
I could see the script file being saved:
2023-01-18 09:56:22,917 - helpers.py[DEBUG]: Running config-write-files using lock (<FileLock using file '/var/lib/cloud/instances/nocloud/sem/config_write_files'>)
2023-01-18 09:56:22,927 - util.py[DEBUG]: Writing to /usr/local/custom_script/install.sh - wb: [644] 29800 bytes
2023-01-18 09:56:22,928 - util.py[DEBUG]: Changing the ownership of /usr/local/custom_script/install.sh to 0:0
But nothing indicating it was executed.
I decided to extrapolate a list of all the cloud-init modules that were initiated:
cat cloud-init.log | grep "Running module"
stages.py[DEBUG]: Running module migrator
stages.py[DEBUG]: Running module seed_random
stages.py[DEBUG]: Running module bootcmd
stages.py[DEBUG]: Running module write-files
stages.py[DEBUG]: Running module growpart
stages.py[DEBUG]: Running module resizefs
stages.py[DEBUG]: Running module disk_setup
stages.py[DEBUG]: Running module mounts
stages.py[DEBUG]: Running module set_hostname
stages.py[DEBUG]: Running module update_hostname
stages.py[DEBUG]: Running module update_etc_hosts
stages.py[DEBUG]: Running module rsyslog
stages.py[DEBUG]: Running module users-groups
stages.py[DEBUG]: Running module ssh
But still, no sign of runcmd
.
Checking cloud-init configuration
Outside of the log bundle, /etc/cloud/cloud.cfg
includes the configuration for cloud-init. having suspected the runcmd
module may not be loaded, I checked, but it was present:
# The modules that run in the 'config' stage
cloud_config_modules:
- ssh-import-id
- locale
- set-passwords
- zypper-add-repo
- ntp
- timezone
- disable-ec2-metadata
- runcmd
However, I noticed that nothing from the cloud_config_modules
block was mentioned in cloud-init.log
. However, everything from cloud_init_modules
was:
# The modules that run in the 'init' stage
cloud_init_modules:
- migrator
- seed_random
- bootcmd
- write-files
- growpart
- resizefs
- disk_setup
- mounts
- set_hostname
- update_hostname
- update_etc_hosts
- ca-certs
- rsyslog
- users-groups
- ssh
So, it appeared the entire cloud_config_modules
step wasn’t running. Weird.
Fixing
After speaking with someone from the cloud-init community, I found out that there are several cloud-init services that exist on a host machine. Each dedicated to a specific step.
Default config on SLES 15 SP4 machine:
sles-15-sp3-pool1-15a47a8f-xcspb:/ # sudo systemctl list-unit-files | grep cloud
cloud-config.service disabled disabled
cloud-final.service disabled disabled
cloud-init-local.service disabled disabled
cloud-init.service enabled disabled
cloud-config.target static -
cloud-init.target enabled-runtime disabled
Default config on a Ubuntu 22.04 machine:
packerbuilt@SRV-RNC-1:~$ sudo systemctl list-unit-files | grep cloud
cloud-config.service enabled enabled
cloud-final.service enabled enabled
cloud-init-hotplugd.service static -
cloud-init-local.service enabled enabled
cloud-init.service enabled enabled
cloud-init-hotplugd.socket enabled enabled
cloud-config.target static -
cloud-init.target enabled-runtime enabled
The cloud-config
service was not enabled and therefore would not run any of the related modules. To rectify, I added the following to my Packer script when building the template:
# Ensure cloud-init services are enabled
systemctl enable cloud-init.service
systemctl enable cloud-init-local.server
systemctl enable cloud-config.service
systemctl enable cloud-final.service
After which, provisioning SLES based machines from Rancher worked.