Setting up dm-cache on Arch Linux

On the Linux kernel, dm-cache is the device mapper solution to implementing tiered storage. Unfortunately there isn't much documentation on how to set it up, as most resources will guide you towards using lvmcache (which is just metadata on top of dm-cache).

Why dm-cache?

Setting it up

I am assuming you have a filesystem on /dev/disk/by-id/ata-HDD (a slow device) and your cache disk is /dev/disk/by-id/ata-SSD (a fast device).

First, you need to settle on a block size3, I will use 256 sectors (128 KiB, sectors are always 512-bytes regardless of actual devices' physical sector sizes).

Then you need to figure out the size of the metadata. There is no documentation on this that I can find, but this mailing list message4 indicates 4 MiB (8192 sectors) plus 16 bytes per cache block.

BLOCK_SIZE=$(( 128*1024 ))
SSD_SECTORS=$(cat /sys/block/$(basename $(realpath /dev/disk/by-id/ata-SSD))/size)
METADATA_SECTORS=$(( 8192 + 16 * $SSD_SECTORS  / $BLOCK_SIZE ))
echo $METADATA_SECTORS

In my case I get 38718 sectors for metadata. To be extra cautious and to avoid potential alignment issues, I round it up to 20 MiB (40960 sectors).

METADATA_SECTORS=40960

Make the two logical block devices for metadata and cache blocks. See dmsetup(8) and dm-linear documentation for help with linear tables.

CACHE_SECTORS=$(( $SSD_SECTORS - $METADATA_SECTORS ))
dmsetup create SSD-metadata --table "0 $METADATA_SECTORS linear /dev/disk/by-id/ata-SSD 0"
dmsetup create SSD-blocks   --table "0 $CACHE_SECTORS linear /dev/disy/by-id/ata-SSD $METADATA_SECTORS"

Erase the metadata zone. The next step may fail with obscure messages (like requires a block device) if this area isn't blank.

cat /dev/zero > /dev/mapper/SSD-metadata

Finally, create the cached logical device. Change writethrough to writeback if you understand the added risks.

HDD_SECTORS=$(cat /sys/block/$(basename $(realpath /dev/disk/by-id/ata-HDD))/size)
BLOCK_SECTORS=$(( $BLOCK_SIZE / 512 ))
dmsetup create HDD-cached --table "0 $HDD_SECTORS cache /dev/mapper/SSD-metadata /dev/mapper/SSD-blocks /dev/disk/by-id/ata-HDD $BLOCK_SECTORS 1 writethrough default 0"

We now have created /dev/mapper/HDD-cached, which behaves just like /dev/disk/by-id/ata-HDD but is now benefiting from the cache.

mount /dev/mapper/HDD-cached /mnt/hdd

Automating it with systemd

The steps above will not persist after a shutdown (the cache will), so some automation is needed to have it set up automatically at boot.

In my configuration, I use dm-cache below dm-crypt (to avoid having to deal with extra encryption/key material for the SSD; dm-cache will directly cache ciphered data instead). I want the cached device to play nice with systemd's handling of crypttab and fstab devices/mountpoints.

I found little documentation on how to do it, but the solution below works (/etc/systemd/system/setup-cached-HDD.service). Change values as needed. Use systemd-escape -p /dev/disk/by-id/* to get dev-disk-*.device unit names.

[Unit]
Description=setup dm-cached device (HDD-cached)
DefaultDependencies=no
IgnoreOnIsolate=true
Before=cryptsetup-pre.target
BindsTo=dev-disk-by\x2did-ata\x2dHDD.device dev-disk-by\x2did-ata\x2dSSD.device
After=dev-disk-by\x2did-ata\x2dHDD.device dev-disk-by\x2did-ata\x2dSSD.device
RequiresMountsFor=/usr/bin/dmsetup

[Service]
Type=oneshot
RemainAfterExit=yes

ExecStartPre=/usr/bin/dmsetup create SSD-metadata --table '0 40960 linear /dev/disk/by-id/ata-SSD 0'
ExecStartPre=/usr/bin/dmsetup create SSD-blocks --table '0 249938608 linear /dev/disk/by-id/ata-SSD 40960'
ExecStart=/usr/bin/dmsetup create HDD-cached --table '0 3907029168 cache /dev/mapper/SSD-metadata /dev/mapper/SSD-blocks /dev/disk/by-id/ata-HDD 256 1 writethrough default 0'

ExecStop=/usr/bin/dmsetup remove HDD-cached
ExecStopPost=/usr/bin/dmsetup remove SSD-metadata
ExecStopPost=/usr/bin/dmsetup remove SSD-blocks

[Install]
WantedBy=systemd-cryptsetup@HDD\x2dcached.service

Enable the systemd unit:

systemctl daemon-reload
systemctl enable setup-cached-HDD.service

Here are relevant entries in crypttab and fstab, respectively:

# /etc/crypttab: mappings for encrypted partitions
HDD /dev/mapper/HDD-cached /root/luks/HDD.key no-read-workqueue,no-write-workqueue,submit-from-crypt-cpus,noauto,nofail,header=/root/luks/HDD.hdr
# /etc/fstab: static file system information
/dev/mapper/HDD /mnt/HDD btrfs defaults,space_cache=v2,noatime,commit=300,flushoncommit,compress-force=zstd:7,nofail 0 0

Note the nofail options, which allow the system to continue booting if a problem happens with the device. If you are are caching your root (/) filesystem, you should remove nofail, use the sd-encrypt initramfs hook (using /etc/crypttab.initramfs instead) and add the dmsetup binary to your initramfs.

Monitoring the cache

Various statistics about the cache (such as usage, read/write hits and misses, dirty blocks, etc.) can be accessed using:

dmsetup status /dev/mapper/HDD-cached

See the dm-cache documentation for explanations.

Bonus: using multiple devices

It's possible to use dm-crypt with multiple caching and/or backing devices. Imagine you want two SSDs to cache three HDDs :

/dev/disk/by-id/ata-SSD1, 1000000 sectors
/dev/disk/by-id/ata-SSD2, 2000000 sectors
/dev/disk/by-id/ata-HDD1, 40000000 sectors
/dev/disk/by-id/ata-HDD2, 60000000 sectors
/dev/disk/by-id/ata-HDD3, 80000000 sectors
  1. Create one logical "cache device" and one logical "backing device", using dm-linear (if the SSDs have similar sizes, you can also use RAID0 aka dm-stripe with a stripe size equal to cache block size to increase performance and spread the writes more evenly);

  2. Create the logical "cached device" using dm-cache;

  3. Split the logical "cached device" again with dm-linear, back to the three hard disks equivalents.

This method is very versatile, as any block on any caching device can be used for any block on any backed device. However, if you use writeback caching, expect to lose all your data on all your backed devices if any single device fails. This can be mitigated by mirroring (RAID1). Writethrough is much safer, as the backed devices are always in a consistent state; a failing SSD won't cause data loss and a failing HDD won't lose data on other HDDs.

dmsetup create logical-SSD --table "0 1000000 linear /dev/disk/by-id/ata-SSD1 0\n1000000 2000000 linear /dev/disk/by-id/ata-SSD2 0"
dmsetup create logical-HDD --table "0 40000000 linear /dev/disk/by-id/ata-HDD1 0\n40000000 60000000 linear /dev/disk/by-id/ata-HDD2 0\n100000000 80000000 linear /dev/disk/by-id/ata-HDD3 0"

dmsetup create logical-SSD-metadata --table "0 20480 linear /dev/mapper/logical-SSD 0"
dmsetup create logical-SSD-blocks   --table "0 2979520 linear /dev/mapper/logical-SSD 20480"
cat /dev/zero > /dev/mapper/logical-SSD-metadata
dmsetup create logical-HDD-cached   --table "0 180000000 cache /dev/mapper/logical-SSD-metadata /dev/mapper/logical-SSD-blocks /dev/mapper/logical-HDD 256 1 writethrough default 0"

dmsetup create HDD1-cached --table "0 40000000 linear /dev/mapper/logical-HDD-cached 0"
dmsetup create HDD2-cached --table "0 60000000 linear /dev/mapper/logical-HDD-cached 40000000"
dmsetup create HDD3-cached --table "0 80000000 linear /dev/mapper/logical-HDD-cached 100000000"

mount /dev/mapper/HDD1-cached /mnt/hdd1
mount /dev/mapper/HDD2-cached /mnt/hdd2
mount /dev/mapper/HDD3-cached /mnt/hdd3

  1. https://www.kernel.org/doc/html/latest/admin-guide/bcache.html#troubleshooting-performance ↩︎

  2. https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/cache-policies.html ↩︎

  3. https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/cache.html#fixed-block-size ↩︎

  4. https://www.redhat.com/archives/dm-devel/2012-December/msg00046.html, from https://blog.kylemanna.com/linux/ssd-caching-using-dmcache-tutorial/ ↩︎