My New Lab – ZFS
ZFS is the new normal and replaces the RAID card generation. Why - our disks are too big and the problems with bit rot had to be addressed. After the release of the code as open source in 2001 it quickly moved into MacOS, FreeBSD and Linux. This is a guid for Proxmox users. #zfs
ZFS is the new normal and replaces the RAID card generation. Why - our disks are too big and the problems with bit rot had to be addressed. After the release of the code as open source in 2001 it quickly moved into MacOS, FreeBSD and Linux. A major point was OpenZFS in 2013. The integrity of ZFS systems made the word move fast towards ZFS as the de-facto standard. The bit rot problem is not present in ZFS as in standard RAID approaches. Unlike RAID card ZFS do not impose branding marks onto the disks.
ZFS Basics for Proxmox users
ZFS (from Zettabyte File System) is a file system with volume management capabilities. It began as part of the Sun Microsystems Solaris operating system in 2001. OpenZFS is really widely used in Unix and Linux systems. FreeBSD uses ZFS as defaulf.
Comparison chart
Striped ≈ RAID0 | Mirror ≃ RAID1 | RAIDZ ≃ RAID5 | RAIDZ2 ≃ RAID6 | RAIDZ3 ZFS only | Striped mirror ≈ RAID10 | |
---|---|---|---|---|---|---|
Min # of disks | 1 | 2 | 3 | 4 | 5 | 4 |
Disk # Fault tolerance | None | (N-1) disk | 1 | 2 | 3 | note 1 |
Disk # overhead | None | (N-1)/N | 1 | 2 | 3 | note 2 |
Read speed | Fast | Fast | Slower | Fast | ||
Write speed | Fast | Fair | Slower | Fair |
Note 1: (N-1) disk in each N-disk mirror
Note 2: (N-1)P for P stipe over N-disk mirrors
Hardware cost:
Stripe = Cheap, Mirror = Very high–highest, RAIDZ = High, RAIDZ2 = High–very high, RAIDZ3 = Very high and RAID10 = Very high–highest
ZFS RAID levels
This is a very complex issue if you start to create larger arrays. Knowing how to create the pools top level vdevs and the vdevs isn't easy. How may spares do we need and are they hot standbys or not, what is the affect of grouping on performance or what would the net array be in TiB, how long do a re-silver take?
You will need a ZFS Calculator like this or like this or at leastr this or this.
Striping is a technique to store data on the disk array. The contigous stream of data is divided into blocks, and blocks are written to multiple disks in a specific pattern. Striping is used with all RAIDZ levels due to the specifics of ZFS filesystem which is used on the RAIDZ pools.
Block size is dynamically selected for each data row to be written to the ZFS pool.
Stipe pool is almost equal to RAID0
Stripe pool and RAID0 are almost equal to each other. The capacity of such volume is the sum of the capacities of all disks. But RAID0 does not add any redundancy, so the failure of a single drive makes the volume unusable. Efficient storage but poor on integrity. Often used for temporary data.
If a striped ZFS pool fails logically but all the disks are present and healthy, you can do a ZFS stripe pool recovery relatively easy using a ZFS recovery software. However, you should keep in mind that if the disk failure happens, data is lost irreversibly.
With four 1 TB drives in a ZFS mirror you'd get 4 TB of usable disk space and. Up to 4 x read and write speed gain. No fault tolerance.
Mirror pool is very similar to a RAID1
Also called “mirroring”. Data is written identically to all disks. This mode requires at least 2 disks with the same size. The resulting capacity is that of a single disk. Often used for boot drives.
If you combine four 1 TB drives in a ZFS mirror:, you'd only get 1 TB of usable disk space and 3 TB goes for redundancy. Up to 4 x read speed, no write speed gain. Fault tolerace is 3 drives.
Mirror Striped similar to a RAID10
RAID 10 consists of two or more striped RAID 1 disk sets. Requires at least 4 disks. Speed and integrity but capacity is only half of the combined size.
With 4 1 TB drives the capasity is 2 TB. Up to 2 x read and write speed gain. Fault tolerace, 2 disks in a group can fail without data loss.
With 2x6 1 TB drives the capasity is 6 TB. Up to 6 x read and write speed gain. Fault tolerace, 6 disks in a group can fail without data loss.
RAIDZ or RAIDZ1 or Z1 is most similar to a traditional RAID5
A variation on RAID-5, single parity. Requires at least 3 disks.
If you combine four 1 TB drives in a ZFS mirror you'd get 3 TB of usable disk space and 1 TB goes for redundancy. Up to 4 x read speed, no write speed gain. Fault tolerace is 1 drive.
RAIDZ2 or Z2
A variation on RAID-6, double parity. Requires at least 4 disks.
If you combine four 1 TB drives in a ZFS mirror you'd get 2 TB of usable disk space and 2 TB goes for redundancy. Up to 2x read speed, no write speed gain. Fault tolerace is 2 drives.
With 12 1 TB drives the capasity is 10 TB. Up to 10x read speed, no write speed gain. Fault tolerace is 2 drives.
RAIDZ3 or Z3
An extended variation on RAID-6 with triple parity. Requires at least 5 disks. Sometimes it's called RAID 7 but that is a incorrect name.
With 12 1 TB drives the capasity is 9 GB. Up to 9x read speed, no write speed gain. Fault tolerace is 3 drives.
ZFS dRAID - dRAID3
In a ZFS dRAID (de-clustered RAID) the hot spare drive(s) participate in the RAID. Their spare capacity is reserved and used for rebuilding when one drive fails. This provides, depending on the configuration, faster rebuilding compared to a RAIDZ in case of drive failure. More information can be found in the official OpenZFS documentation.

There will be many situations where traditional RAIDZ vdevs make more sense than deploying dRAID. In general, dRAID will be in contention if you’re working with a large quantity of hard disks (say 30+) and you would otherwise deploy 10-12 wide Z2/Z3 vdevs for bulk storage applications.
dRAID
dRAID1 or dRAID: requires at least 2 disks, 1 can fail before data is lost
dRAID2
dRAID2: requires at least 3 disks, 2 can fail before data is lost
dRAID3
dRAID3: requires at least 4 disks, 3 can fail before data is lost
Things to use in a Large ZFS Arrays
You may have all or some or non in yours, except the ARC that you always have. On larger storage servers, L2ARC and SLOG are really powerfull enhancemants.
ARC
ARC is the ZFS main memory cache (in DRAM), which can be accessed with sub microsecond latency.
L2ARC
L2ARC sits in-between, extending the RAM cache using fast storage devices, such as SSDs and NVEes or even a RAM disk.
Using NVMes or SSDs the partition for ZIL should be half of the system RAM and the rest can be dedicate to L2ARC.
Setup zpool add tank cache sdy
ZFS Special Device
A special device (Special Allocation Class) in a pool is used to store metadata, de-duplication tables, and optionally small file blocks. The rule of thumb is to use 0.3% of the pool size but, it can vary.
A special device can improve the speed of a pool consisting of slow spinning rust with a lot of metadata changes. Like workloads that involve creating, updating or deleting a large number of files will benefit from the presence of a special device. ZFS datasets can also be configured to store whole small files on the special device which can further improve the performance. Use fast enterprise grade SSDs for the special device (consumer grade are not up to the wrighting tasc ahead).
Don't mix SLOG/L2ARC with special device, very hard to debug
Setup like zpool create tank sdb special sdc
ZIL and SLOG
ZFS is taking extensive measures to safeguard your data. These two acronyms represent the two key data safeguards. The creates a second level and called L2ARC. On large system they generate speed and security in case of power cuts. Where you use 100 G NIC's you find them. SLOG devices uses fast NVMe drive mirrored but even a spinning rust drive will improve performance. My smallest storage array with ZIL/SLOG has 12 disks, they are good to have if you have more than 10 the preferred way if you have 24 or more disks in your storage array because of the speed advantages and the security it brings.
- ZIL: Acronym for ZFS Intended Log. Logs synchronous operations to disk
- SLOG: Acronym for (S)eperate (LOG) Device
Think about a 92 disk storage with 2 x 100 G NIC's bonded to give 200 Gbps (25GB/s), 90 HDD drives working in parallel up to 250 MB/s each (22.5GB/s) and a pair or SAS SSD ZIL in mirror, that writes close to a 1 GB/s - the bottleneck would be the SLOG ZIL. For the ZIL we use a mirrord pair of SSDs or even better really fast NVMes.
Setup zpool add tank log sdj1
or zpool add tank log mirror sdj1 sdk1
ZIL ZFS Intent Log
It keep track of in-progress, synchronous write operations so they can be completed or rolled back after a system crash or power failure. Standard caching generally utilizes system memory and data is lost in those scenarios. The ZIL prevents that. The ZIL can be set up on a dedicated disk called a Separate Intent Log (SLOG) or on its own disk.
ZIL and SLOG for Speed
If you only need the speed part you might set up a mirrord RAMDISK for the SLOG intead of SSD's. The ZIL can be on a mirrord SSD to take care of the integrity. The L2ARC will give dramatically improve read speeds.
Encrypted ZFS Datasets
Native ZFS encryption in Proxmox VE is experimental. Known limitations and issues include Replication with encrypted datasets, as well as checksum errors when using Snapshots or ZVOLs.
SWAP on ZFS
Swap-space created on a zvol may generate some troubles, like blocking the server or generating a high IO load, often seen when starting a Backup to an external Storage. It's strongly recommend to use enough memory, so that you normally do not run into low memory situations. Should you need or want to add swap, it is preferred to create a partition on a physical disk and use it as a swap device. You can leave some space free for this purpose in the advanced options of the installer. Additionally, you can lower the “swappiness” value. A good value for servers is 10.
An example a pool tank with L2ARC cache and a ZIL/SLOG
zpool create -o ashift=12 -o compression=lz4 tank sdb sdc sde sdf sdg sdh sdi sdj sdk
zpool add tank log mirror /dev/sdv /dev/sdx
zpool add tank cache mirror /dev/sdy /dev/sdz
Create a Pool
Booting from ZFS on Linux is possible, I do it for my all my Proxmox, pfSense and TrueNAS machines. There are excellent documentetion at OpenZFS and most of the distros execp Ubuntu.
The -f
prevents the error message from preventing the creation. Be careful when using this as you could overwrite existing pools/partitions without any warning.
Newer ZFS use libblkid
to search for the correct disk, but always returns the found disk as /dev/sdX
, as long as you do not use a cachefile.
Selecting /dev/ names when creating a pool (see more in the References). Its always better to use disk IDs but ZFS isn't picky in witch order it finds the disk, quit the opposit to RAID arrays and using /dev/sdX, /dev/hdX: is fine. That said its up to the size of your pool what to use, /dev/sdx, /dev/disk/by-id, /dev/disk/by-path or/dev/disk/by-vdev. Only dev/disk/by-uuid/ is not a great option.
Create Options
- -f: Force creating the pool to bypass the “EFI label error”.
- -m: The mount point of the pool.
If this is not specified, then the pool will be mounted to root as /pool. - pool: This is the name of the pool.
- type: single disk, mirror, raidz, raidz2, raidz3. If omitted, the default type is a stripe in the GUI one disk in the CLI as many you like, even different sizes that is not really wise.
- ids: The names of the drives/partitions to include in the pool obtained from ls /dev/disk/by-id.
Creating pools
Single Disk
sudo zpool create -f [pool name] /dev/sdb
Single disk, replace sdb with the correct one or preferably use disk ID
RAID0
sudo zpool create -f [pool name] /dev/sdb /dev/sdc
Striped or RAID0
sudo zpool add [existing pool name] /dev/sdd
You can add drives to a pool to increase its capacity. Any new data will be dynamically striped across the pool, but existing data will not be moved to balance the pool.
RAID1
sudo zpool create -f [pool name] mirror /dev/sdb /dev/sdc
To create a RAID1 pool (or mirror), add the command mirror when creating or adding drives
sudo zpool add [existing pool name] mirror /dev/sdd /dev/sde
Adding drives to an existing mirrored pool. You need to match the number of disks.
RAID10
sudo zpool create [pool name] \
mirror /dev/sde /dev/sdf \
mirror /dev/sdg /dec/sdh
To create a RAID10 array in a single command
RAIDZ (similar to RAID5)
sudo zpool create -f [pool name] raidz /dev/sdb /dev/sdc /dev/sdd
To create a RAIDZ array in a single command. You can't add to a RAIDz pool
RAIDZ2 (similar to RAID6)
sudo zpool create -f [pool name] raidz2 /dev/sdb /dev/sdc /dev/sdd
You need a minimum of 3 drives. Your array can lose any 2 drives without loss of data
RAIDZ3
sudo zpool create -f [pool name] raidz3 /dev/sdb /dev/sdc /dev/sdd /dev/sde
You need a minimum number of drives is 4. Your array can lose 3 drives without loss of data.
dRaid
# zpool create <pool> draid[<parity>][:<data>d][:<children>c][:<spares>s] <vdevs...>
A dRAID vdev is created like any other by using the zpool create
command and enumerating the disks which should be used.
Like raidz, the parity level is specified immediately after the draid
vdev type. However, unlike raidz additional colon separated options can be specified. The most important of which is the :<spares>s
option which controls the number of distributed hot spares to create. By default, no spares are created. The :<data>d
option can be specified to set the number of data devices to use in each RAID stripe (D+P). When unspecified reasonable defaults are chosen.
- parity - The parity level (1-3). Defaults to one.
- data - The number of data devices per redundancy group. In general a smaller value of D will increase IOPS, improve the compression ratio, and speed up resilvering at the expense of total usable capacity. Defaults to 8, unless N-P-S is less than 8.
- children - The expected number of children. Useful as a cross-check when listing a large number of devices. An error is returned when the provided number of children differs.
- spares - The number of distributed hot spares. Defaults to zero.
Other
You can mix and match pools together. The pools are always "striped" together creating an effective RAID 0, you may need to re-balance/re-silver if you add to an existing pool. This means you can combine any number of mirrors and RAIDz pools to create any kind of configurations.
Since ZFS does not have an in-built tool to re-stripe existing data when a drive has been added, You need to get a tool for it like zfs-balancer see GitHub or use this approach from JRS Systems.
Destroying Pools
sudo zpool destroy [pool name]
Remeber to releace any cache and&or log devices first!
List Drives
ls /dev/disk/by-id
To check your pool
zpool list or zpool list -v
zpool iostat or zpool iostat -v
Check Proxmox Storage Manager Know it exists:
pvesm zfsscan
configure your ZFS Pool
zfs create zstorage/iso
zsf create zstorage/share
zsf create zstorage/vmstorage
To set quota
zfs set quota=1000G zstorage/iso
To check
zfs list
zpool status
zpool iostat -v
Usage with Proxmox
In the the GUI go to Datacenter -> storage -> Add -> Directory -> zstorage/iso ( Make sure only “ISO image” and “Container template” are selected. )
Directory -> Add -> ZFS -> Id: vmstorage -> ZFS Pool: /zstorage/vmstorage
Examples:
Creating datasets on external (USB) disks for backups and ISOs.
USB disks are notoriously bad but still there are use cases for them. Use external backup is one good example, testing is another. To overcome some of the problems we use the as a mirror. The good thing with ZFS is that it takes care of the RAID problem of placing the disks in the right order.
The usual name for a ZFS pool is tank, so I use it in this also, we use RAID1 Mirroring. Data is written identically to all disks. This mode requires at least 2 disks with the same size. The resulting capacity is that of a single disk.
Goto node -> Disks -> ZFS and hit [Create: ZFS] button
Name = tank RAID Level = Mirror and leave compression on, Uncklick Add Storage.
First totally wipe the disks fdisk dev/sdh and commands g (create a new GPT partotion table) w (wright table to disk and exit)
SSD/SAS/NVMe Disks
Checkin the Speed and basic info on Voltage and Power consumtion.
For VM systemdisks or small but fast NAS servers PCIe NVMe drives are the best.
But, for server builds I prefer SAS drives over SATA/SATA SSD drives all day long.
Why, size availeble and speed – you can't have both.
Some modern AIC and U.2 NVMe SSDs can reach sequential read rates close to 7.000 MB/s and write rates over 4.000 MB/S. But, they can be half of that too.
With SATA SSDs we are in the 500+ MB/s.SAS can span multiple racks, using copper or fiber for connectivity. SAS supports SATA as well, and does things at a scale that NVMe doesn't.
I/F | IOPS | Througput | Latency | Queues | Commands / queue |
---|---|---|---|---|---|
SATA | 60–100k | 6 Gbps | < 1 – 100+ ms | 1 | 32 |
SAS | 200–400k | 12-24 Gbps | < 100 μs – >10 ms | 1 | 256 |
NVMe | 200-10.000k | 16-32 Gbps | < 10 – 255 μs | 65.535 | 64.000 |
Choosing the Right Interface
Selecting the optimal storage interface depends on factors such as workload requirements, scalability, stabuility, lifetime, and budget considerations:
- NVMe drives, for enterprises demanding maximum performance and scalability, they represent the pinnacle of storage technology, offering unparalleled speed and efficiency. But, to a very high cost.
- SAS drives, with their reliability and versatility, cater to a wide range of enterprise applications, striking a balance between performance and cost.
- SATA drives remain relevant for budget-conscious users and scenarios where performance demands are modest, providing a cost-effective storage solution for everyday computing tasks.
Check disk drive speeds
Use the following command as a first tool:
sudo hdparm -Tt /dev/sda /dev/sdb /dev/nvme0n1
Replace with your drive to test or use a block /dev/sd[a-i]


SSD/SAS Voltage
- 3.5" HDD, requires 12v and 5v
- 2.5" HDD or SSD, requires 5v
- 2.5" SAS, requires 5v and 12v
- 1.8″ SSD devices require 3.3 volts
SSD Power Consumtion in Wats
SSD Type | Idle | Read | Write |
---|---|---|---|
2.5″ SATA | 0.30 – 2 W | 4.5 – 8 W | 4.5 – 8 W |
mSATA | 0.20 – 2 W | 1 – 5 W | 4 – 8 W |
M.2 SATA | 0.40 – 2 W | 2.5 – 6 W | 4 – 9 W |
M.2 NVMe Gen 4.0 | 1 – 3 W | 2 – 8 W | 4 – 10 W |
M.2 NVMe Gen 5.0 | 1 – 3 W | 4 – 10 W | 4 – 12 W |
AIC PCIe | 2 – 6 W | 4 – 8 W | 8 – 20 W |
References
OpenZFS [1] ZFS RAID levels [2], TrueNAS Documentation [3], ZFS Recovery Tools [4]
See the Wikipedia, man pages, Selecting /dev/ names when creating a pool and FAQ ↩︎
TrueNAS blog ZIL demystified,
blog SLOG or not to SLOG and
Documentation on L2ARC ↩︎RAIDZ Recovery Tools, ReclaiMe Pro software homepage,
Klennet ZFS Recovery homepage,
RAID Recovery™ homepage ↩︎