Proxmox

Proxmox HA

Making a Proxmox High Availability Virtual Machine cluster without a NAS or GLUSTER or CEPH. Some apps are extremely valuable and needed in our systems. They need to bee up 24/7. How to do it? And are we willing to bare the cost? That's all folks - now we have a 24/7 VM running.

Photo by Demidov Armor / Unsplash

Making a Proxmox High Availability Virtual Machine cluster without a NAS or Gluster (XFS based) or CEPH (BTRFS based).

Some apps are extremely valuable and needed in our systems or our business. They need to bee up 24/7. How to do it? And are we willing to bare the cost?

See the next chapter on how pro do it.

Levels of High Availability

Often base our decisions on a calculated as availability % and downtime per year. Then we estimate the cost of running HA against the cost of business. The cost of HA includes the necessary cost from people, devices and services.

It's no use to setup HA without 24/7 operator coverage or UPSes or redundant infra and hardware.

Cases where we can't use HA

no personnel 24/7 365
lack of redundant hardware
lack of redundant infrastructure

Cases where we don't necessarily need HA

See above Cases where we can't use HA
99% = 3,65 days usually servers do that easily
99,9% = 8,76 hours most servers can do that

Cases where we probably want to use HA

99,99% = 52,56 hours
99,999% = 5.26 minutes
99,9999% = 31.5 seconds
99,99999% = 3.15 seconds

Mission statement

The goal is to have one primary server (pve1) that is running my web server VM and a secondary server (pve2) that is in standby in the event that pve1 fails and then pve3 if pve1 and pve2 are out, and so on.

Basic requirements

The solution is simple. Pve1 will replicate the VM’s hard drive to pve2’s storage. There are some requirements:

Both servers must be a part of a cluster
ZFS storage must be used instead of lvm
(other options also exist outside the scope of this post)
ZFS Pools must have the same name - lake (in my cluster)
The network connection must be super reliable and free of noise
All nodes need to have the same networks vmbr0, vmbr1 ...
All nodes need to use the same VLANs
Nodes should preferably use the same CPU family

Set up a Proxmox cluster

We need 3 identical (ideal case) servers (anything from a Raspberry Pi to server grade HW) with 1 or more extra disks for our cluster storage. We can also utilize iSCASI or NFS storage, but that is outside of the scope of this post and requires 10G Networking.

create 3 (or more odd number of nodes) Proxmox servers pve1 - pve3
- all having 1 unused or better 4 unused disks for ZFS storage
- one free network card for cluster tragic - we need speed
- same root password on all makes life easy
Create the cluster CU (or any name you like)
Join nodes to cluster, using pve1 clusterdata and root password
- join pve2 to the cluster CU
- join pve3 to the cluster CU
Create the cluster disk storage using ZFS storage
- create on pve1 ZFS storage: CU-storage with Add storage checked
- create on pve2 ZFS storage: CU-storage with Add storage not checked
- create on pve2 ZFS storage: CU-storage with Add storage not checked
- activate in Datacenter Storage the CU-storage. Edit and choose all nodes
Create the new VM on pve1: VM900. Use CU-storage for the hard disk
Configure Replication on pve1, VM900 and add pve2 and then pve2
Set up High Availability in Datacenter add VM900

That's all folks - now we have a 24/7 VM running. The single point of failure is 230V supply, at least one server needs to have a functioning UPS. Oter issues is the personell (alvays number 1 reason things go sidways), the ISP goes down, cable faults, weather and acts of God.

Basic rules for Pro Setups

Requirements

You must meet the following requirements before you start with HA:

at least three cluster nodes (to get reliable quorum)
shared storage for VMs (ZFS/NFS) and containers (NFS)
hardware redundancy (everywhere)
use reliable “server” components
hardware watchdog - as a fall back is the Linux Kernel watchdog (softdog)
optional hardware fencing devices
active Back Up systemss

Eliminate single point of failure (redundant components)

use an uninterruptible power supply (UPS)
use redundant power supplies on the main boards
use ECC-RAM
use redundant network hardware (boards not ports)
use RAIDZ for local storage
use distributed, redundant storage for VM data
use several ISPs
use several locations and even continents
use internal and external systems

Reduce downtime

rapidly accessible administrators (24/7)
use monitoring software
availability of spare parts (other nodes in a Proxmox VE cluster)
automatic error detection (provided by ha-manager)
automatic failover (provided by ha-manager)

Proxmox HA

Levels of High Availability

Cases where we can't use HA

Cases where we don't necessarily need HA

Cases where we probably want to use HA

Mission statement

Basic requirements

Set up a Proxmox cluster

Read more about high availabuility

Basic rules for Pro Setups

Requirements

Eliminate single point of failure (redundant components)

Reduce downtime

Read more

Security for SSH

Tailscale running in a LXC

Local SAMBA using a LXC

A DIY Tailscale Control Center