Build

Proxmox Backup Strategy

What if your servers don't work, Where is your data? When is it up again?
You need to create a solid back-up strategy and stick to using it. You can and should refined the strategy on a regular basis. Proxmox Backup Server PBS is highly recommended tool for automatic backups. You need to store backup data at a remote location(s) to.

Follow the 3-2-1 backup rule

Whether it is disruptions caused by accidental deletions and hardware failure or more severe accidents like natural disasters or malware attacks, maintaining access to data is critical.

The 3-2-1 rule, attributed to photographer Peter Krogh over 20 years ago,
follows these easy requirements:

3 Copies of Data – Maintain three copies of data — the original, and at least two copies.
2 Different Media – Use two different media types for storage. This can help reduce any impact that may be attributable to one specific storage media type. It’s your decision as to which storage medium will contain the original data and which will contain any of the additional copies.
1 Copy Offsite – Keep one copy offsite to prevent the possibility of data loss due to a site-specific failure.

A single copy of critical data may seem to be sufficient to recover from. However, at the heart of every robust data protection plan is the 3-2-1 backup rule. Today, this rule is a universally accepted strategy within the IT industry and beyond. The 3-2-1 backup approach is recommended by information security professionals and government agencies like the Cybersecurity and Infrastructure Security Agency (CISA) in the USA (in the Data Backup Options document by US-CERT).

Most enterprices go much farter than just the basic 3-2-1 rule, a type of 3-2-1-#-#-#rule. The 3-2-1 rule is over 20 years old but still the de-facto rule in IT.

💡

If a disaster can happen - it will happen at the worst time possible.

We need to prepare for any thing to happen.
Do we need to still operate in those conditions and to what cost.
Divide the challenges into groups like Natural, Financial and Political and ask the question on all of them. Make a summary.
Based on step 1-3 we proceed. What can we do to prepare our business for the challenges and to what cost
What are the single point of failures in our system
What is critical for our business to always have backed up
Where to keep the backup's
Disaster recovery tactics
How likely does our utilities break s/year or times/day
What are the backup utilities and are there a need for a second stage
VM by VM create a schedule for backups (daily, hourly ...)
How many backups to keep and for how long

Based on these 12 steps and by iterating them forms a solid strategy and a budget.

💡

Availability in %: 99% - 99.999% - 99.99999% is meaningDowntime per year: 3.65 days - 5.26 min - 3.15 sec

Some real life examples

Some battles we win some we lose. All electronics will break and all moving parts even faster. Here I describe some, without the details but the general layout is described:

A 3 site customer backups by site A<->B B<->C C<->A and have a cloud storage as the backup's backup. Backups run nightly over OpenVPN tunnels.
A company have a small remote office 200 km away and backup to the remote office by OpenVPN tunnels every night. The local backups runs nightly.
A multinational company have a production site in one country and a satellite production site in a other country a thousand km away and the whole production is run based on a sophisticated computerized system. The main site run the heavy lifting and have the databases but the satellite has a hot standby system. The networks between the sites are routed thru different countries for security and to safeguard for digger accidents. (They actually had one of the cables cut at some building site hundred of km away).
A small production company owner called me for emergency help to restore their server that had crashed some time ago, their IT said it was un-repairable HW failure in the disk array. This server held all billing, all job descriptions and all orders. When I asked where is the backup device, he said they do not have it any more, they had used it for some thing else because they had RAID5.
Typical No Can Do situation.
Lesson learnt: RAID5 is not a backup
New years eave after 21.00, a vp of production calls 300 km away . They need a spare card within 4-5 hours or their whole site goes down resulting in a cost of 5-10 million euros and a production stop for up to several months.
Did find a cab driver willing to do the run, got my warehouse manager to go and pack the card and send it off with the cab. - A very happy customer.
Lesson learnt: Keep all the essential spares on site.

It's up to you to create a strategy based on your business facts and a budget to make it possible. Some things do not need to be and some must be on backups.

💡

Backups are important - but so is the bottom line.

How to reduce problems

Test your backups regulary
Keep backups at different locations
Use server grade or enterprise grade components in your build
Eliminate single point of failure (redundant components)
Use an uninterruptible power supply (UPS)
Use redundant power supplies in the server
Always use ECC-RAM - absolutely always
Use redundant network cards
Use RAID-Z1 or -Z2 for local storage
Avoid RAID cards
Beware of USB disks performans and stability problems
Beware of bit-rot in long time storage - avoid RAID5
Use distributed, redundant storage for VM data
Use a Proxmox Backup Server or a redundant pair
Remember that SSD's have a max number of write cycles
- use spinning rust for NAS devices and heavy use swap
- use SSD's for static data and as boot devices and ZFF logs

Reduce downtime when disaster strikes

Rapidly accessible administrators (24/7)
Use of email notifications to a support mailing list
Availability of tested spare parts on site, like disks and NIC's
Calculate SSD writes expected for a device and plan replacement cycle
Automatic error detection (provided by HA-manager)
Automatic fail over (provided by HA-manager)

💡

High Availability clusters need backups to!

Time Synchronization

The Proxmox VE cluster stack itself relies heavily on the fact that all the nodes have precisely synchronized time. Some other components, like Ceph, also won’t work properly if the local time on all nodes is not in sync.

See my post Start using Proxmox for how to.

Use monitoring tools

Zabbix is my choice of monitoring system, see my blog. Just knowing a server is up don't really tell you anything. You need to now if it's behaving normally or not.

Monitor a multitude of things
Use alerts 🚨
Use email alerts 🔔.

Systems Security Auditing

Internal SSA is the start of the process. Use a 3d party System Security Auditing if you can afford it, it’s dam expensive. See my blog.

Business Continuity Strategy

Create a Business Continuity Strategy for bad and worst case scenarios. In many areas nature is not that easy on you and earthquakes are not nice, wildfires are also bad and getting more frequent every year as the global worming continues, flooding is also a major problem in many places and the rising sea levels is making it worse.

House fires happen and cars do drive through the walls from time to time and the electricity is interrupted due to treas falling from snow and wind or by human intervention in some form. Diggers is a major problem to data and electrical cables every where what ever we do it seams.

Proxmox Backup Server

See my blog Proxmox Backup Server and Franken Proxmox How to install Proxmox VE and Proxmox Backup Server on the same server.

Hardening your Servers and

See my blogs Hardening Servers 3

Other blog of possible value

Security Audit and Hardening blog post
Hardening your Servers blog post
Proxmox Backup Server blog post
Moitoring your Servers blog post ↩︎

What if backup's fail?

Yes - this will happen, so be prepared for it! This is what make IT people into heroes - or not. Take it as a great challenge or a fantastic journey.

Read or wright errors

If a file do not open, try earlier or later copy's - you might get lucky or at least get to a point from where you can recreate the file.

Sometimes a file do not open as rw but maybe it's possible to use it as read only. This is typical sign of "Bit Rot".

Obsolete software

If a program can't be run on modern HW use a VM to run it. I have some Windows XP VM's for running obsolete software.

No date to use

Scan code or documents from paper copy to recreate, been there seen that.

Wrong Format

If data backups are in the wrong format. Some DB's has tools for it or just start to create conversion programs. Done conversion with MySQL tools, Basic, Cobol, BASH and of course the workhorse for every thing Excel.