linux_dsm_epyc7002/fs/xfs/xfs_aops.h
Darrick J. Wong cb357bf3d1 xfs: implement per-inode writeback completion queues
When scheduling writeback of dirty file data in the page cache, XFS uses
IO completion workqueue items to ensure that filesystem metadata only
updates after the write completes successfully.  This is essential for
converting unwritten extents to real extents at the right time and
performing COW remappings.

Unfortunately, XFS queues each IO completion work item to an unbounded
workqueue, which means that the kernel can spawn dozens of threads to
try to handle the items quickly.  These threads need to take the ILOCK
to update file metadata, which results in heavy ILOCK contention if a
large number of the work items target a single file, which is
inefficient.

Worse yet, the writeback completion threads get stuck waiting for the
ILOCK while holding transaction reservations, which can use up all
available log reservation space.  When that happens, metadata updates to
other parts of the filesystem grind to a halt, even if the filesystem
could otherwise have handled it.

Even worse, if one of the things grinding to a halt happens to be a
thread in the middle of a defer-ops finish holding the same ILOCK and
trying to obtain more log reservation having exhausted the permanent
reservation, we now have an ABBA deadlock - writeback completion has a
transaction reserved and wants the ILOCK, and someone else has the ILOCK
and wants a transaction reservation.

Therefore, we create a per-inode writeback io completion queue + work
item.  When writeback finishes, it can add the ioend to the per-inode
queue and let the single worker item process that queue.  This
dramatically cuts down on the number of kworkers and ILOCK contention in
the system, and seems to have eliminated an occasional deadlock I was
seeing while running generic/476.

Testing with a program that simulates a heavy random-write workload to a
single file demonstrates that the number of kworkers drops from
approximately 120 threads per file to 1, without dramatically changing
write bandwidth or pagecache access latency.

Note that we leave the xfs-conv workqueue's max_active alone because we
still want to be able to run ioend processing for as many inodes as the
system can handle.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-04-16 10:01:57 -07:00

36 lines
1.1 KiB
C

// SPDX-License-Identifier: GPL-2.0
/*
* Copyright (c) 2005-2006 Silicon Graphics, Inc.
* All Rights Reserved.
*/
#ifndef __XFS_AOPS_H__
#define __XFS_AOPS_H__
extern struct bio_set xfs_ioend_bioset;
/*
* Structure for buffered I/O completions.
*/
struct xfs_ioend {
struct list_head io_list; /* next ioend in chain */
int io_fork; /* inode fork written back */
xfs_exntst_t io_state; /* extent state */
struct inode *io_inode; /* file being written to */
size_t io_size; /* size of the extent */
xfs_off_t io_offset; /* offset in the file */
struct xfs_trans *io_append_trans;/* xact. for size update */
struct bio *io_bio; /* bio being built */
struct bio io_inline_bio; /* MUST BE LAST! */
};
extern const struct address_space_operations xfs_address_space_operations;
extern const struct address_space_operations xfs_dax_aops;
int xfs_setfilesize(struct xfs_inode *ip, xfs_off_t offset, size_t size);
extern void xfs_count_page_state(struct page *, int *, int *);
extern struct block_device *xfs_find_bdev_for_inode(struct inode *);
extern struct dax_device *xfs_find_daxdev_for_inode(struct inode *);
#endif /* __XFS_AOPS_H__ */