linux_dsm_epyc7002/fs
Hans van Kranenburg 583b723151 btrfs: Do not use data_alloc_cluster in ssd mode
This patch provides a band aid to improve the 'out of the box'
behaviour of btrfs for disks that are detected as being an ssd.  In a
general purpose mixed workload scenario, the current ssd mode causes
overallocation of available raw disk space for data, while leaving
behind increasing amounts of unused fragmented free space. This
situation leads to early ENOSPC problems which are harming user
experience and adoption of btrfs as a general purpose filesystem.

This patch modifies the data extent allocation behaviour of the ssd mode
to make it behave identical to nossd mode.  The metadata behaviour and
additional ssd_spread option stay untouched so far.

Recommendations for future development are to reconsider the current
oversimplified nossd / ssd distinction and the broken detection
mechanism based on the rotational attribute in sysfs and provide
experienced users with a more flexible way to choose allocator behaviour
for data and metadata, optimized for certain use cases, while keeping
sane 'out of the box' default settings.  The internals of the current
btrfs code have more potential than what currently gets exposed to the
user to choose from.

    The SSD story...

    In the first year of btrfs development, around early 2008, btrfs
gained a mount option which enables specific functionality for
filesystems on solid state devices. The first occurance of this
functionality is in commit e18e4809, labeled "Add mount -o ssd, which
includes optimizations for seek free storage".

The effect on allocating free space for doing (data) writes is to
'cluster' writes together, writing them out in contiguous space, as
opposed to a 'tetris' way of putting all separate writes into any free
space fragment that fits (which is what the -o nossd behaviour does).

A somewhat simplified explanation of what happens is that, when for
example, the 'cluster' size is set to 2MiB, when we do some writes, the
data allocator will search for a free space block that is 2MiB big, and
put the writes in there. The ssd mode itself might allow a 2MiB cluster
to be composed of multiple free space extents with some existing data in
between, while the additional ssd_spread mount option kills off this
option and requires fully free space.

The idea behind this is (commit 536ac8ae): "The [...] clusters make it
more likely a given IO will completely overwrite the ssd block, so it
doesn't have to do an internal rwm cycle."; ssd block meaning nand erase
block. So, effectively this means applying a "locality based algorithm"
and trying to outsmart the actual ssd.

Since then, various changes have been made to the involved code, but the
basic idea is still present, and gets activated whenever the ssd mount
option is active. This also happens by default, when the rotational flag
as seen at /sys/block/<device>/queue/rotational is set to 0.

    However, there's a number of problems with this approach.

    First, what the optimization is trying to do is outsmart the ssd by
assuming there is a relation between the physical address space of the
block device as seen by btrfs and the actual physical storage of the
ssd, and then adjusting data placement. However, since the introduction
of the Flash Translation Layer (FTL) which is a part of the internal
controller of an ssd, these attempts are futile. The use of good quality
FTL in consumer ssd products might have been limited in 2008, but this
situation has changed drastically soon after that time. Today, even the
flash memory in your automatic cat feeding machine or your grandma's
wheelchair has a full featured one.

Second, the behaviour as described above results in the filesystem being
filled up with badly fragmented free space extents because of relatively
small pieces of space that are freed up by deletes, but not selected
again as part of a 'cluster'. Since the algorithm prefers allocating a
new chunk over going back to tetris mode, the end result is a filesystem
in which all raw space is allocated, but which is composed of
underutilized chunks with a 'shotgun blast' pattern of fragmented free
space. Usually, the next problematic thing that happens is the
filesystem wanting to allocate new space for metadata, which causes the
filesystem to fail in spectacular ways.

Third, the default mount options you get for an ssd ('ssd' mode enabled,
'discard' not enabled), in combination with spreading out writes over
the full address space and ignoring freed up space leads to worst case
behaviour in providing information to the ssd itself, since it will
never learn that all the free space left behind is actually free.  There
are two ways to let an ssd know previously written data does not have to
be preserved, which are sending explicit signals using discard or
fstrim, or by simply overwriting the space with new data.  The worst
case behaviour is the btrfs ssd_spread mount option in combination with
not having discard enabled. It has a side effect of minimizing the reuse
of free space previously written in.

Fourth, the rotational flag in /sys/ does not reliably indicate if the
device is a locally attached ssd. For example, iSCSI or NBD displays as
non-rotational, while a loop device on an ssd shows up as rotational.

The combination of the second and third problem effectively means that
despite all the good intentions, the btrfs ssd mode reliably causes the
ssd hardware and the filesystem structures and performance to be choked
to death. The clickbait version of the title of this story would have
been "Btrfs ssd optimizations considered harmful for ssds".

The current nossd 'tetris' mode (even still without discard) allows a
pattern of overwriting much more previously used space, causing many
more implicit discards to happen because of the overwrite information
the ssd gets. The actual location in the physical address space, as seen
from the point of view of btrfs is irrelevant, because the actual writes
to the low level flash are reordered anyway thanks to the FTL.

    Changes made in the code

1. Make ssd mode data allocation identical to tetris mode, like nossd.
2. Adjust and clean up filesystem mount messages so that we can easily
identify if a kernel has this patch applied or not, when providing
support to end users. Also, make better use of the *_and_info helpers to
only trigger messages on actual state changes.

    Backporting notes

Notes for whoever wants to backport this patch to their 4.9 LTS kernel:
* First apply commit 951e7966 "btrfs: drop the nossd flag when
  remounting with -o ssd", or fixup the differences manually.
* The rest of the conflicts are because of the fs_info refactoring. So,
  for example, instead of using fs_info, it's root->fs_info in
  extent-tree.c

Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:42 +02:00
..
9p 9p: Implement show_options 2017-07-11 06:08:58 -04:00
adfs
affs affs: Implement show_options 2017-07-11 06:06:17 -04:00
afs Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-15 12:00:42 -07:00
autofs4 Fix up over-eager 'wait_queue_t' renaming 2017-07-10 11:40:19 -07:00
befs Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-15 12:00:42 -07:00
bfs bfs: fix sanity checks for empty files 2017-07-12 16:26:00 -07:00
btrfs btrfs: Do not use data_alloc_cluster in ssd mode 2017-08-21 17:47:42 +02:00
cachefiles sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming 2017-06-20 12:19:14 +02:00
ceph ceph: fix race in concurrent readdir 2017-07-17 14:54:59 +02:00
cifs Add wait_for_random_bytes() and get_random_*_wait() functions so that 2017-07-15 12:44:02 -07:00
coda fs: implement vfs_iter_write using do_iter_write 2017-06-29 17:49:23 -04:00
configfs configfs: Introduce config_item_get_unless_zero() 2017-06-12 13:20:20 +02:00
cramfs
crypto The first major feature for ext4 this merge window is the largedir 2017-07-09 09:31:22 -07:00
debugfs Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-15 12:00:42 -07:00
devpts
dlm
ecryptfs ecryptfs: Convert to separately allocated bdi 2017-04-20 12:09:55 -06:00
efivarfs VFS: Kill off s_options and helpers 2017-07-11 06:09:21 -04:00
efs
exofs mm: drop "wait" parameter from write_one_page() 2017-07-05 18:44:22 -04:00
exportfs
ext2 ext2: preserve i_mode if ext2_set_acl() fails 2017-07-18 11:23:56 +02:00
ext4 ext4: fix copy paste error in ext4_swap_extents() 2017-08-06 01:33:07 -04:00
f2fs f2fs: avoid cpu lockup 2017-07-17 19:23:18 -07:00
fat
freevxfs
fscache
fuse Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse 2017-08-11 11:20:48 -07:00
gfs2 Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-15 12:00:42 -07:00
hfs fs: semove set but not checked AOP_FLAG_UNINTERRUPTIBLE flag 2017-05-08 17:15:14 -07:00
hfsplus hfsplus: Don't clear SGID when inheriting ACLs 2017-07-18 18:23:39 +02:00
hostfs
hpfs
hugetlbfs Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-15 12:00:42 -07:00
isofs isofs: Fix off-by-one in 'session' mount option parsing 2017-07-18 12:33:16 +02:00
jbd2 Writeback error handling fixes (pile #2) 2017-07-07 19:38:17 -07:00
jffs2 jffs2: fix spelling mistake: "requestied" -> "requested" 2017-04-19 11:35:55 -07:00
jfs JFS fixes for 4.13 2017-07-25 08:51:57 -07:00
kernfs
lockd sunrpc: mark all struct svc_version instances as const 2017-07-13 15:58:03 -04:00
minix Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-08 10:50:54 -07:00
ncpfs mm: per-cgroup memory reclaim stats 2017-07-06 16:24:35 -07:00
nfs Some more NFS client bugfixes for 4.13 2017-08-11 13:54:09 -07:00
nfs_common
nfsd nfsd: Fix a memory scribble in the callback channel 2017-07-17 13:15:06 -04:00
nilfs2 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2017-07-03 13:08:04 -07:00
nls
notify dentry name snapshots 2017-07-07 20:09:10 -04:00
ntfs ntfs: Use ERR_CAST() to avoid cross-structure cast 2017-05-28 10:11:48 -07:00
ocfs2 ocfs2: don't clear SGID when inheriting ACLs 2017-08-02 17:16:13 -07:00
omfs omfs: Implement show_options 2017-07-06 03:31:46 -04:00
openpromfs
orangefs Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-15 12:00:42 -07:00
overlayfs ovl: check for bad and whiteout index on lookup 2017-07-20 11:08:21 +02:00
proc mm: fix KSM data corruption 2017-08-10 15:54:07 -07:00
pstore Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-15 12:00:42 -07:00
qnx4
qnx6
quota quota: add get_inode_usage callback to transfer multi-inode charges 2017-06-22 11:46:48 -04:00
ramfs ramfs: Implement show_options 2017-07-06 03:31:46 -04:00
reiserfs reiserfs: preserve i_mode if __reiserfs_set_acl() fails 2017-07-18 11:24:08 +02:00
romfs
squashfs
sysfs sysfs: be careful of error returns from ops->show() 2017-04-08 17:33:32 +02:00
sysv mm: drop "wait" parameter from write_one_page() 2017-07-05 18:44:22 -04:00
tracefs VFS: Don't use save/replace_mount_options if not using generic_show_options 2017-07-06 03:31:46 -04:00
ubifs ubifs: Set double hash cookie also for RENAME_EXCHANGE 2017-07-14 22:50:57 +02:00
udf udf: Convert udf_disk_stamp_to_time() to use mktime64() 2017-06-14 11:21:02 +02:00
ufs Writeback error handling fixes (pile #1) 2017-07-07 18:39:15 -07:00
xfs xfs: Fix per-inode DAX flag inheritance 2017-08-04 13:43:36 -07:00
aio.c fs: add O_DIRECT and aio support for sending down write life time hints 2017-06-27 12:05:36 -06:00
anon_inodes.c
attr.c
bad_inode.c
binfmt_aout.c
binfmt_elf_fdpic.c
binfmt_elf.c binfmt_elf: safely increment argv pointers 2017-07-10 16:32:36 -07:00
binfmt_em86.c
binfmt_flat.c binfmt_flat: Use %u to format u32 2017-07-16 09:24:05 -07:00
binfmt_misc.c fs: constify tree_descr arrays passed to simple_fill_super() 2017-04-26 23:54:06 -04:00
binfmt_script.c
block_dev.c Writeback error handling fixes (pile #2) 2017-07-07 19:38:17 -07:00
buffer.c fs/buffer.c: make bh_lru_install() more efficient 2017-07-10 16:32:30 -07:00
char_dev.c
compat_binfmt_elf.c
compat_ioctl.c Merge branch 'work.__copy_in_user' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-08 10:15:02 -07:00
compat.c fs/compat.c: trim unused includes 2017-04-17 12:52:27 -04:00
coredump.c
dax.c Writeback error handling fixes (pile #2) 2017-07-07 19:38:17 -07:00
dcache.c Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-15 12:00:42 -07:00
dcookies.c
direct-io.c fs: add O_DIRECT and aio support for sending down write life time hints 2017-06-27 12:05:36 -06:00
drop_caches.c
eventfd.c There has been a fair amount of activity in the docs tree this time 2017-07-03 21:13:25 -07:00
eventpoll.c kcmp: fs/epoll: wrap kcmp code with CONFIG_CHECKPOINT_RESTORE 2017-07-12 16:26:01 -07:00
exec.c exec: Limit arg stack to at most 75% of _STK_LIM 2017-07-07 20:05:08 -07:00
fcntl.c vfs: fix flock compat thinko 2017-07-07 13:48:18 -07:00
fhandle.c fhandle: move compat syscalls from compat.c 2017-04-17 12:52:26 -04:00
file_table.c fs: new infrastructure for writeback error handling and reporting 2017-07-06 07:02:25 -04:00
file.c fs/file.c: replace alloc_fdmem() with kvmalloc() alternative 2017-07-06 16:24:30 -07:00
filesystems.c Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-15 12:00:42 -07:00
fs_pin.c sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming 2017-06-20 12:19:14 +02:00
fs_struct.c
fs-writeback.c writeback: rework wb_[dec|inc]_stat family of functions 2017-07-12 16:26:05 -07:00
inode.c Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-08 10:50:54 -07:00
internal.h Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-05-09 09:12:53 -07:00
ioctl.c
iomap.c Changes since last update: 2017-07-14 22:57:32 -07:00
Kconfig fs/Kconfig: kill CONFIG_PERCPU_RWSEM some more 2017-07-12 16:26:00 -07:00
Kconfig.binfmt
libfs.c fs: convert __generic_file_fsync to use errseq_t based reporting 2017-07-06 07:02:29 -04:00
locks.c fs/locks: pass kernel struct flock to fcntl_getlk/setlk 2017-05-27 06:07:19 -04:00
Makefile
mbcache.c ext4: xattr inode deduplication 2017-06-22 11:44:55 -04:00
mount.h Now that IPC and other changes have landed, enable manual markings for 2017-07-19 08:55:18 -07:00
mpage.c There has been a fair amount of activity in the docs tree this time 2017-07-03 21:13:25 -07:00
namei.c Now that IPC and other changes have landed, enable manual markings for 2017-07-19 08:55:18 -07:00
namespace.c Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-15 12:00:42 -07:00
no-block.c
nsfs.c VFS: Provide empty name qstr 2017-07-06 03:27:09 -04:00
open.c Writeback error handling fixes (pile #2) 2017-07-07 19:38:17 -07:00
pipe.c VFS: Provide empty name qstr 2017-07-06 03:27:09 -04:00
pnode.c mnt: Make propagate_umount less slow for overlapping mount propagation trees 2017-05-23 08:41:17 -05:00
pnode.h
posix_acl.c
proc_namespace.c
read_write.c Merge branch 'work.read_write' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-07 21:48:15 -07:00
readdir.c readdir: move compat syscalls from compat.c 2017-04-17 12:52:24 -04:00
select.c Merge branch 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-06 20:57:13 -07:00
seq_file.c mm: introduce kv[mz]alloc helpers 2017-05-08 17:15:12 -07:00
signalfd.c sched/wait: Rename wait_queue_t => wait_queue_entry_t 2017-06-20 12:18:27 +02:00
splice.c fs: implement vfs_iter_write using do_iter_write 2017-06-29 17:49:23 -04:00
stack.c
stat.c ufs: restore maintaining ->i_blocks 2017-06-09 16:28:01 -04:00
statfs.c Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-08 10:50:54 -07:00
super.c VFS: Kill off s_options and helpers 2017-07-11 06:09:21 -04:00
sync.c fs: remove call_fsync helper function 2017-07-05 18:44:23 -04:00
timerfd.c timerfd: Use get_itimerspec64() and put_itimerspec64() 2017-06-30 04:14:38 -04:00
userfaultfd.c userfaultfd: replace ENOSPC with ESRCH in case mm has gone during copy/zeropage 2017-08-10 15:54:07 -07:00
utimes.c utimes: move compat syscalls from compat.c 2017-04-17 12:52:23 -04:00
xattr.c treewide: use kv[mz]alloc* rather than opencoded variants 2017-05-08 17:15:13 -07:00