Setting up Proxmox servers

For running ZFS on RAID controllers we need the HBA or IT mode and a separate boot device, you can't boot from a HBA device.

Setting up Proxmox servers
Photo by Fotis Fotopoulos / Unsplash

There is many way to setup file systems for Proxmox. I will discuss a special way using a disk for boot and a partition for ZFS. Many of us don't want to use the whole boot drive. The SSD or NVMe could serve other functions too.

Use a PCIe attached or carriers for SSD's or NVMe's with the needed number of disks and of the size of storage needed. This way you can utilize the full speed of the these disks. There is versions from 1x - 16x lanes depending much of the PCI generation and speed needed. You need to found all specs on your CPU, PCIe bus and the BIOS. This is not always easy or even possible.

See notes and references at end of the article for more in-dept information
💡
Your motherboard, it's BIOS and your CPU determent what you can expect for speeds and even what devices you can use.

HBA Controllers

It's not recommended to run RAID controllers for ZFS. Its not only the extra complexity and reasons for failure but also that they wright unwanted data to disks. HBA controllers are on the market and legacy controllers are not expensive, and home labs usually use legacy servers. Modern RAID controllers can be run in HBA mode by configuring it in the BIOS or running a utility software.

💡
SATA is half-duplex and SAS is full-duplex

For older servers

Look at your HBA to determine the speed. Below some average speeds per port for older servers like HP DL360/380 Gen 5 or 6 servers with HBA's:

  • PCIe (gen2) x8 - 300 MB/s
  • PCIe (gen2) x4 - 150 MB/s
  • PCIe (gen1) x8 - 150 MB/s
  • PCI-X 64-bit 133MHz - 100 MB/s

Recommended base level device is the old but trusted LSI SAS2008, 6Gbps, PCIe (gen2) x8 300-350MB/s/port. For older servers a good choice is LSI SAS1068E, 3Gbps, PCIe (gen1) x8 150-175MB/s/port, PCI-X 64-bit 133MHz 107MB/s/port

For more modern servers

Best choices are the LSI/Avago/Broadcom SAS 93XX or 94XX but the old 92XX (LSI SAS2308 chip based) is still very useful. For instance 9200-8e do 320,000 IOPS and the newer 9207-e8 650,000 IOPS)
If you have 4-8 disk, any one will do but if you are running more the 8 you need to focus on the performance of the HBA.

  • SAS is full duplex and much faster than SATA wich is half duplex.
  • SAS speeds are from SAS1 3Gb/s 300 to 24G SAS (SAS4) 22.5 Gb/s and soon SAS5 at 45 Gb/s. The speeds mach the PCIe Gen 1 - 4 storage ecosystem.
  • SATA 1 is 1.5 Gb/s, SATA 2 is 3 Gb/s and SATA 3 is 6 Gb/s.
Some vendors use propriety software to lock the controller to their disks only.
NVMe can deliver a sustained read-write speed of 20 Gb/s per second,
way faster than the SATA SSD III, which limits at 6 Gb/s.
💡
Remember ZFS works best with HBA's or IT-mode controllers.Avoid using RAID storage controllers as RAID 0

Test your disk I/O and tune it to your needs Link

Setup hack for Proxmox on NVMe/SSD

We have a 256 G or bigger NVMe or SSD that we want to use for more than a boot disk. We can use it as a ZFS pool for VM's using replication and High Availability.

  • Setup Proxmox to use 32 -64 G for boot, swap and system storage (LVM is grate if running low on disk it can be a knight in shining armor, if there is spare space on the drive. For a 256 G SSD: 38 G for Proxmox and 200 G for the cluster.
  • The rest 200 - 900 G is for the partition sda4 that will be the 'storage' used for mission critical VM/CT's with replication and HA.
  • ISO's are best to keep on a NAS.
  • Backups on Proxmox Backup Server.
  • You could also use 10 G networks and offline storage like ISCSI or NFS.

Solid-State-Drives are getting more and more common. A problem that comes with SSD's is their limited cell lifetime. Depending on their manufacturing technique, each cell can be overwritten from 1.000 times in consumer TLC SSD's to up to 100.000 times in enterprise SLC based SSD's.

The value to keep an eye on is the guaranteed TBW (It's the total LBAs written, multiply that by 512 and divide by 10^12 or 2^40 for TiB), read the specs of the drive.

Install Proxmox

We will use a special setup for the boot disk. Select your disk and special and set the disk size to 38G. After the initial boot we start setting up the rest.

Create a new partition on the boot disk

fstab /dev/sda                         # a 256G NVMe, 38G for Proxmox
Command (m for help): n                # create a new partition
Partition number (4-128, default 4): 4 # the new partition number
First sector.......................:   # just hit [enter]
Last sector........................: +200G #to create a 200G partion 
Command (m for help): w          # writes the new partition table to your disk
Is this the best way - no it's better to use more disks.
But many of use need this hack from time to time.

Setup of Proxmox for clustering

I focus here on normal setups of clusters, this time without CephFS or GlusterFS.
We use migration and High Availability with 3 or 5 nodes.

Planning the cluster first - then implement

You need to plan the whole cluster before starting to add nodes.

  • Why are you clustering? What are the goals?
  • Power distribution and UPS
  • Cooling/Warming
  • Users and their roles
  • Networking including VLAN's
  • Storage strategy including type of storage's
  • Rack space
  • SAS is better but SATA is OK in a home lab.
  • ZFS is the best for Proxmox and you get to use all the features of Proxmox.

Add a zpool - tank on sda4

Create the ZPOOL tank from the CLI. Her I use the famous 'tank' but you could use something more interesting like MyHApool or TestClusterZFSpool.
Avoid using local and local-lvm - they are tiny. You can restrict the use by the pvesm set -command, example: pvesm set local --content snippets,vztmpl
Set up more disks for your storage needs.

In large arrays or for high performance enterprise systems it's common to define the pool's with a wright cache ZIL log on a small mirrored NVMe and as much memory as possible for read cache ARC and maybe but not likely a L2ARC cache SSD.

Wipe the disks before use

To prepare the disks it's recommended to wipe the disks before creating the pool.

wipefs -a /dev/sdk /dev/sdl /dev/sdm
wipefs -a /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf

What sector size do your disks have ?

You should not mix different sizes in one array. Using fdisk -l to see the blocksize used .

fdisk -l

You can also use

blockdev --report /dev/sda 

Print a report for the specified device

blockdev -v --getss /dev/sda 

Logical sector size in bytes - usually 512.

blockdev -v --getpbsz /dev/sda

Physical block (sector) size

blockdev -v --getbsz/dev/sda

Blocksize in bytes, don't, describe device topology

hdparm -I /dev/sda | grep -i physical

For 512e it's 512/4096

diskinfo -v da1

On BSD systems

Non-Advanced
512 disks have both logical and physical blocks of 512 bytes, 512, 512n or 512 native. The norm before 2010.

Advanced Format (AF)
AF is a generic term for block devices with physical sectors larger than the traditional 512 bytes, these generally have a physical sector size of 4K.

512e
is a 512 byte emulation where the logical (addressable) sector is 512 bytes.
(8:1 logical:physical sector ratio).

4k
4096 byte native mode where the logical (addressable) sector is physical 4kb sector size in length (1:1 logical:physical sector ratio). (4k alias 4k native, 4kN or True 4k)

NOTE 520 is a special form of formatting that you need to low level format into 512 and that takes a long time (The disk are from some special RAID controller like ones used in Fiber Channel systems).

Create the zpool

zpool create -f -o ashift=12 tank /dev/sda4
  • create: to create the pool.
  • -f: force creating the pool to bypass the “EFI label error”.
  • -m: mount point of the pool. If this is not specified, then the pool will be mounted to root as /pool.
  • pool: the name of the pool.
  • type: mirror, raidz, raidz2, raidz3. If omitted, default type is a stripe or raid0.

Don't refer to a device as /dev/sda use the ID

ls /dev/disk/by-id

Copy the output into a notepad and make it a one liner. Add zpool create -f -o ashift=12 -m <mount> <pool> <type> before the <ids>. Meany drives reports them to be 512 but they actually are 512e drives, 4k native drives that report a sector size of 512.

The best way to create a zpool

Firs list your disk id's then add them to the create a pool: zpool create lake raidz2 <id1> <id2> <id3> <id4> <id5> <id6>

ls /dev/disk/by-id

Copy the output into a notepad and make it a one line and separate the id's by a space. Add in front of all the id's:

For 512n drives: zpool create -f -m <mount> <pool> <type>
For 512e/4kn drives: zpool create -f -o ashift=12 -m <mount> <pool> <type>
before the id's

NOTE drives report to be 512 but, actually they are 512e drives, 4k native drives that report a sector size of 512. Do not mix the types in an array, but if it can’t be helped then just use -o ashift=12.

Another way of creating the zpool's

Maka a mirrored ZFS for speed
Or set up a RAID-Z1 -Z2 or RAID-Z3 or any combo of these

zpool create tank mirror /dev/sdb /dev/sdc
zpool create spark mirror c1t0d0 c2t0d0 c3t0d0 mirror c4t0d0 c5t0d0 c6t0d0

RAID-Z1 is like RAID5 but better, -Z2 is similar to RAID6 and -Z3 is über good

zpool create -f -o ashift=12 tank raidz1 /dev/sdk /dev/sdl /dev/sdm
zpool create -f -o ashift=12 tank raidz2 /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf
Z1 for 3-7 disks, Z2 6-15 disks and Z3 can be used for large arrays.
Also mirrored groups of mirrored ZFS are very effective for high speed

Use disk-by-id referensing

Disk mount points are bound to change after reboots or hardware changes, resulting in ZPOOL degradation. Hence we will make the zpool use block device identifiers (/dev/disk/by-id) instead of mount-points. See: The best way to create a zpool.

Exporting the ZFS pool, and importing it back with /dev/disk/by-id will also seal the disk references.

zpool export tank && zpool import -d /dev/disk/by-id tank 

Add the ZFS pool to PVE Storage Manager

This pool is for storage of Disk Image, Containers only

pvesm add zfspool tank -pool tank

Set compression on

zfs set compression=lz4 tank

If you need to store ISO's and Back-Up's on a zpool

It is recommended to create an extra ZFS file system to store your ISO images and backup's or what ever you need, lets call the pool lake (use your imagination).

Create a zpool with the name of lake as above and then add a directory to it

zpool create -f -o ashift=12 lake /dev/sdl
zpool export lake && zpool import -d /dev/disk/by-id lake 
zfs set compression=lz4 lake
zfs create lake/vmdata
pvesm set vmdata --content backup,iso,snippets,vztmpl
💡
Use only high quality SAS/SATA cables that locks properly at both the disk and the contrller. Lose cables can destroyed the whole array.

Networking needs

Ports used by Proxmox by default

  • web interface: 8006 (TCP, HTTP/1.1 over TLS)
  • VNC Web console: 5900-5999 (TCP, WebSocket)
  • SPICE proxy: 3128 (TCP)
  • sshd (used for cluster actions): 22 (TCP)
  • rpcbind: 111 (UDP)
  • sendmail: 25 (TCP, outgoing)
  • corosync cluster traffic: 5405-5412 UDP
  • live migration (VM memory and local-disk data): 60000-60050 (TCP)
  • For ssh to the node (yes, you need it) add your own number like 222 or 666
Open the ports (only ones needed ) in the firewall be as strict as possible

Configure your NIC's and vmbr's

  1. Management vmbr
  2. Cluster NIC (dedicated, need for speed - 10G)
  3. Replication NIC (dedicated, heavy trafic - 10G)
  4. 1 - n internal vmbr LAN's for the node's VM's to talk to each other
  5. A VM LAN for accessing the Internet
  6. VLAN's for the VM's (Servers, Infra, Guest, LAN, Backup, ... )

Users and groups - security

You do not want to expose root too much. Add user and groups and use the Pools to give user access restrictively only to VM's and features needed.

The documentation is good on this matter so I refer you to the wiki, helps and manuals.


Create a cluster

Prepare your second (same for all nodes) by creating the zpool tank and the networks. You will need a fast dedicated NIC for the replication.

Add your second node to the cluster will make it a cluster where you can add more nodes as the needs grove. For HA you need 3 or 5 nodes, always a odd number.

After creating the cluster you need to add a new common storage for your VM's

  • In Proxmox/Datacenter add a ZFS with:
    Storage ID: storage
    ZFS:pool: /tank
    and check to Thin Provisioning

Common external storage

You can also use a NAS to store your VM disk's. Then set up of a zpool for replication isn't needed but recommended for VM's with needs for high speed switchovers.

  • A 10G NIC is needed for descent speeds
  • Use NFS or iSCSI storage for speed.
  • ISO's and Templates are best to store on a NAS

Setting up a High Availability cluster

You might need to have a secure and always on VM or two. The use case will determine whatever configuration you need. Key is the allowed downtime per Year that's acceptable and what's the cost.

💡
If there's a single place of failure - it's going to fail. Murphy's Law

Redundancy is the key, power, communication and server power supply, HBA controllers, RAM, disks. For really high availability you need to have dual utilities and multiple sites all with hart beat monitoring.

  • First you need to have a UPS in your rack with the capability to take the load for the minimum time a power outage usually takes.
  • Your servers should have redundant power supply's
  • The servers recommended networking is 4 * 1G and 2 * 10G NIC's
  • For larger disk arrays use several controllers and mirror your disk arrays

On each node

The nodes need to replicate as possible each other for successful migrating of VM's. Minimum is to have the same zpool on each node

  • Set up the same zpool or zpool's
  • Set up the same NIC's and the VMBR's
  • Set up the same users

Create the cluster

  • In Proxmox/Datacenter add a ZFS with:
    Storage ID: storage
    ZFS:pool: /tank
    and check to Thin Provisioning

Create a Cloud-Init Bootstrap for Qemu-Guest-Agent

A vendor config snippet can be used to bootstrap cloud-init images. Link

To install Qemu-Guest-Agent on debian/ubuntu VM's after the VM has been deployed we use a snippet that runs at initial boot.

Note that the vendor config is executed on first boot only !

Create the directory snippets if not yet present in /var/lib/vz.

Create the file vendor.yaml

nano /var/lib/vz/snippets/vendor.yaml
#cloud-config @ /var/lib/vz/snippets/vendor.yaml
runcmd:
    - sudo apt-get update
    - sudo apt-get install -y qemu-guest-agent
    - sudo systemctl start qemu-guest-agent
    - sudo apt-get dist-upgrade -y

Add it to the script or execut it now if needed/wanted.

qm set 9000 --cicustom "vendor=local:snippets/vendor.yaml"

Set up your replication and HA

Read Proxmox help or wiki, they have it all explained about groups and fencing. You need a watchdog configured.

VM's that usually do not migrate well

Yes, there is many things not migrating well. This is usually the use of dedicated resources of the node.

  • Containers cant be migrated live
  • 👎 VM with a controllers bypassed to it
  • 👎 Firewalls and routers are tricky
  • 👎 Desktops are tricky in many ways

What HW do I run?

apt update && apt install hwinfo

hwinfo --short

Other commands:
Usage: hwinfo [options] Probe for hardware. --short just a short listing --log logfile write info to logfile --debug level set debuglevel --version show libhd version --dump-db n dump hardware data base, 0: external, 1: internal --hw_item probe for hw_item hw_item is one of: all, bios, block, bluetooth, braille, bridge, camera, cdrom, chipcard, cpu, disk, dsl, dvb, fingerprint, floppy, framebuffer, gfxcard, hub, ide, isapnp, isdn, joystick, keyboard, memory, modem, monitor, mouse, netcard, network, partition, pci, pcmcia, pcmcia-ctrl, pppoe, printer, scanner, scsi, smp, sound, storage-ctrl, sys, tape, tv, usb, usb-ctrl, vbe, wlan, zip Note: debug info is shown only in the log file. (If you specify a log file the debug level is implicitly set to a reasonable value.)

Tech note on disk types from Link


Notes and references

Proxmox manual and wiki [1] - NFS storage [2] - ZFS over ISCSI [3] - ZFS High Availabuility [4]
RAID-5 issues [5] - Open ZFS Sys Admin [6] - SATA/SAS controllers [7] - Hardware RAID controllers [8] - SAS Serial Attached SCSI [9] - SATA Serial AT Attached [10] - Speeds and how to test SSD's [11] - A problem that comes with SSDs is their limited cell lifetime. SSD TBW Calculator [^tbw]
Phoronix Test Suit for professional testing [12].
Free and open-source [13].


  1. Proxmox Admin guide, Storage See the Web page. More info can be found on the Wiki pages. ↩︎

  2. NFS Storage See this Wiki page. ↩︎

  3. ISCSI Storage See the Wiki page. ↩︎

  4. Proxmox High Availabuility. See the Web page. ↩︎

  5. In addition to a mirrored storage pool configuration, ZFS provides a RAID-Z configuration with either single-, double-, or triple-parity fault tolerance. Single-parity RAID-Z (raidz or raidz1) is similar to RAID-5. Double-parity RAID-Z (raidz2) is similar to RAID-6.
    All traditional RAID-5-like algorithms (RAID-4, RAID-6, RDP, and EVEN-ODD, for example) might experience a problem known as the RAID-5 write hole. If only part of a RAID-5 stripe is written, and power is lost before all blocks have been written to disk, the parity will remain unsynchronized with the data, and therefore forever useless, (unless a subsequent full-stripe write overwrites it). In RAID-Z, ZFS uses variable-width RAID stripes so that all writes are full-stripe writes. This design is only possible because ZFS integrates file system and device management in such a way that the file system's metadata has enough information about the underlying data redundancy model to handle variable-width RAID stripes. RAID-Z is the world's first software-only solution to the RAID-5 write hole.
    A RAID-Z configuration with N disks of size X with P parity disks can hold approximately (N-P)*X bytes and can withstand P device(s) failing before data integrity is compromised. You need at least two disks for a single-parity RAID-Z configuration and at least three disks for a double-parity RAID-Z configuration, and so on. For example, if you have three disks in a single-parity RAID-Z configuration, parity data occupies disk space equal to one of the three disks. Otherwise, no special hardware is required to create a RAID-Z configuration. ↩︎

  6. Open ZFS Stystem Administration See the Web page. ↩︎

  7. From 32 to 2 ports: Ideal SATA/SAS Controllers for ZFS & Linux MD RAID. See the Web page. ↩︎

  8. Hardware and Open ZFS. See this Information page. ↩︎

  9. Serial attache SCSI (SAS). See the Wiki page. ↩︎

  10. Serial AT attached (SATA). See the Wiki page. ↩︎

  11. Ceph: how to test if your SSD is suitable as a journal device?. See Web ↩︎

  12. Phoronix Test Suit. See the GitHub page.
    [tbw]: For TBW calculation use this calulator ↩︎

  13. Free and open-source needs community support!

    There are many reoccurring costs involved with maintaining free, open-source, and privacy respecting software. Expenses which volunteer developers pitch in to cover out-of-pocket.
    These are just some fine example of apps developed when people feel about their software, as well as the importance of keeping them maintained.

    Your support is absolutely vital to keep them innovating and maintaining! ↩︎