PVE2 Networking — The NIC That Wasn’t There

Some homelab problems announce themselves loudly. A RAID array degrading, a VM that won’t boot, a dashboard screaming red. This one didn’t. It just quietly sat there — a NIC that existed in lspci but nowhere else, and a cluster that had no idea anything was wrong.

This is the story of that NIC, the rabbit hole it sent us down, and the somewhat embarrassing simplicity of the fix.


The Lab Setup

The cluster is two repurposed desktop machines:

  • Node 1 (PVE) — workstation-class hardware, Xeon CPU, Intel NIC. Boring, reliable, exactly what you want.
  • Node 2 (PVE2) — compact i5 desktop, onboard Realtek RTL8168evl PCIe NIC. Less boring. Much less reliable, as it turned out.
  • Raspberry Pi — acting as Qdevice to keep quorum honest when one node goes dark.

Nothing exotic. The kind of setup you build from whatever hardware is lying around, which is both the joy and the curse of homelabbing.


Act 1 — Something Is Very Wrong With This NIC

The first sign wasn’t an alert. It was an iperf3 test.

Node 1 was hitting expected gigabit throughput. Node 2 was producing numbers that looked like a throttled broadband connection from 2009. Tens of megabits. On the same switch, same cable, same VLAN.

# On Node 2
iperf3 -c 192.168.200.200

Connecting to host 192.168.200.200, port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.00  sec  28.4 MBytes  23.8 Mbits/sec     # something is deeply wrong

Time to look at the interface itself:

lspci | grep -i ethernet
# 02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCIe Gigabit Ethernet Controller (rev 15)

lspci -k -s 02:00.0
# 02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCIe Gigabit Ethernet Controller (rev 15)
#         Subsystem: ...
#         Kernel driver in use: (none)
#         Kernel modules: r8168, r8169

There it was. Kernel driver in use: (none).

The hardware exists. The kernel knows about two candidate modules. But nothing is loaded. The interface simply does not exist to the OS. No eth0, no enp2s0, nothing.


Root Cause — The r8168-dkms Trap

A quick check of /etc/modprobe.d/ told the whole story:

cat /etc/modprobe.d/r8168-dkms.conf
# Realtek r8168-dkms
alias pci:v000010ECd00008168sv*sd*bc*sc*i* r8168

At some point — likely during the initial OS setup — r8168-dkms had been installed. Realtek ships an out-of-tree kernel module called r8168 as an alternative to the in-kernel r8169 driver. The DKMS package drops an alias into modprobe configuration that says: when you see this PCI ID, load r8168, not r8169.

The problem: the DKMS build was broken. r8168 never actually compiled. So when the kernel tried to honour that alias and load r8168, it found nothing. And because the alias explicitly redirected the PCI ID away from r8169, the in-kernel driver — which handles RTL8168 chipsets perfectly well — was silently blocked.

The NIC had been there the whole time. The driver had been deliberately told to stay away.

Node 1 never had this problem. Its Intel NIC loads e1000e cleanly, no DKMS involvement, no drama.


Act 2 — Life on a USB NIC

While the root cause was being tracked down, the node still needed to function. The short-term fix: plug in a USB 3.0 Realtek adapter and bridge it into vmbr0 as the primary network port.

This kept Node 2 alive. It did not keep it healthy.

# What vmbr0 looked like during the USB era
auto vmbr0
iface vmbr0 inet static
    address 192.168.200.201/24
    gateway 192.168.200.1
    bridge-ports usb0        # 👈 this is the problem
    bridge-stp off
    bridge-fd 0

Throughput through the USB adapter was capped at 18–30 Mbps under load. For basic VM operation that’s survivable. For a Proxmox cluster, it’s a liability. HA decisions, live migration, Corosync heartbeats — all of that rides on reliable inter-node connectivity. A USB NIC with inconsistent throughput is not the foundation you want quorum riding on.

There was also a moment of being completely locked out of Node 2 remotely — no IPMI, no out-of-band management, no mesh VPN installed at the time. Any change that dropped the interface even briefly would mean a drive to the hardware. A hard lesson about the importance of having an escape hatch before touching network config on a remote machine.


Act 3 — The Fix (Embarrassingly Simple)

Once the root cause was clear, the fix was three commands:

# 1. Disable the broken alias — comment it out rather than delete
sed -i 's/^alias/#alias/' /etc/modprobe.d/r8168-dkms.conf

# 2. Load the correct in-kernel driver immediately
modprobe r8169

# 3. Persist across reboots
echo "r8169" >> /etc/modules-load.d/r8169.conf
update-initramfs -u

After modprobe r8169, the onboard NIC came up with carrier almost instantly. The interface appeared, got its address, and iperf3 immediately started reporting gigabit numbers.

The final step — switching vmbr0’s bridge port from the USB adapter to the onboard NIC — was done carefully over a Tailscale mesh tunnel to avoid losing remote access mid-change:

# Edit /etc/network/interfaces
auto vmbr0
iface vmbr0 inet static
    address 192.168.200.201/24
    gateway 192.168.200.1
    bridge-ports enp2s0      # 👈 back to the real NIC
    bridge-stp off
    bridge-fd 0

# Apply without full reboot
ifdown vmbr0 && ifup vmbr0

Post-switch iperf3:

iperf3 -c 192.168.200.200

[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.00  sec  1.09 GBytes   936 Mbits/sec     # that's more like it

936 Mbits/sec. From 23 Mbits/sec. Same hardware. Same cable. Same switch port. The only change was unblocking the driver that should have been there from day one.


What We Learned

r8168-dkms is a silent trap. The r8169 in-kernel driver handles RTL8168 chipsets just fine on modern kernels. There’s rarely a reason to install the out-of-tree r8168 module, and if the DKMS build fails (which it can, especially after kernel updates), you end up with exactly this scenario — a NIC that exists in hardware but nowhere in the OS.

If you ever run lspci -k and see Kernel driver in use: (none) on a NIC, check /etc/modprobe.d/ immediately. A ghost alias is a likely culprit.

Always have an out-of-band access path. Before touching NIC configuration on any remote or semi-remote node, have a way in that doesn’t depend on the interface you’re about to reconfigure. Tailscale, WireGuard on a second interface, a serial console, IPMI — anything. The moment you don’t have this is the moment you need it.

A USB NIC on vmbr0 is a liability. It works as a short-term fallback while you diagnose. It is not a permanent solution for a cluster node. USB has overhead, USB adapters have driver quirks, and USB connections are physically fragile compared to a PCIe NIC. If you find yourself running a production cluster bridge over USB, that’s a sign something else needs fixing.

Measure before and after. iperf3 is free and takes thirty seconds. Run it before any NIC change, run it after. Don’t assume gigabit just because the link light is green.


The Current State

Both nodes are now running on their respective onboard NICs — Intel e1000e on PVE1, Realtek r8169 on PVE2. The USB adapter is in a drawer. The cluster is stable. Corosync is happy. The Raspberry Pi Qdevice continues its quiet existence arbitrating quorum for machines far more powerful than itself.

Would’ve been faster if we’d checked modprobe aliases first. But then there’d be nothing to write about.