linux_dsm_epyc7002/fs/hfsplus
Hin-Tak Leung 7cb74be6fd hfs,hfsplus: cache pages correctly between bnode_create and bnode_free
Pages looked up by __hfs_bnode_create() (called by hfs_bnode_create() and
hfs_bnode_find() for finding or creating pages corresponding to an inode)
are immediately kmap()'ed and used (both read and write) and kunmap()'ed,
and should not be page_cache_release()'ed until hfs_bnode_free().

This patch fixes a problem I first saw in July 2012: merely running "du"
on a large hfsplus-mounted directory a few times on a reasonably loaded
system would get the hfsplus driver all confused and complaining about
B-tree inconsistencies, and generates a "BUG: Bad page state".  Most
recently, I can generate this problem on up-to-date Fedora 22 with shipped
kernel 4.0.5, by running "du /" (="/" + "/home" + "/mnt" + other smaller
mounts) and "du /mnt" simultaneously on two windows, where /mnt is a
lightly-used QEMU VM image of the full Mac OS X 10.9:

$ df -i / /home /mnt
Filesystem                  Inodes   IUsed      IFree IUse% Mounted on
/dev/mapper/fedora-root    3276800  551665    2725135   17% /
/dev/mapper/fedora-home   52879360  716221   52163139    2% /home
/dev/nbd0p2             4294967295 1387818 4293579477    1% /mnt

After applying the patch, I was able to run "du /" (60+ times) and "du
/mnt" (150+ times) continuously and simultaneously for 6+ hours.

There are many reports of the hfsplus driver getting confused under load
and generating "BUG: Bad page state" or other similar issues over the
years.  [1]

The unpatched code [2] has always been wrong since it entered the kernel
tree.  The only reason why it gets away with it is that the
kmap/memcpy/kunmap follow very quickly after the page_cache_release() so
the kernel has not had a chance to reuse the memory for something else,
most of the time.

The current RW driver appears to have followed the design and development
of the earlier read-only hfsplus driver [3], where-by version 0.1 (Dec
2001) had a B-tree node-centric approach to
read_cache_page()/page_cache_release() per bnode_get()/bnode_put(),
migrating towards version 0.2 (June 2002) of caching and releasing pages
per inode extents.  When the current RW code first entered the kernel [2]
in 2005, there was an REF_PAGES conditional (and "//" commented out code)
to switch between B-node centric paging to inode-centric paging.  There
was a mistake with the direction of one of the REF_PAGES conditionals in
__hfs_bnode_create().  In a subsequent "remove debug code" commit [4], the
read_cache_page()/page_cache_release() per bnode_get()/bnode_put() were
removed, but a page_cache_release() was mistakenly left in (propagating
the "REF_PAGES <-> !REF_PAGE" mistake), and the commented-out
page_cache_release() in bnode_release() (which should be spanned by
!REF_PAGES) was never enabled.

References:
[1]:
Michael Fox, Apr 2013
http://www.spinics.net/lists/linux-fsdevel/msg63807.html
("hfsplus volume suddenly inaccessable after 'hfs: recoff %d too large'")

Sasha Levin, Feb 2015
http://lkml.org/lkml/2015/2/20/85 ("use after free")

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/740814
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1027887
https://bugzilla.kernel.org/show_bug.cgi?id=42342
https://bugzilla.kernel.org/show_bug.cgi?id=63841
https://bugzilla.kernel.org/show_bug.cgi?id=78761

[2]:
http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\
fs/hfs/bnode.c?id=d1081202f1d0ee35ab0beb490da4b65d4bc763db
commit d1081202f1d0ee35ab0beb490da4b65d4bc763db
Author: Andrew Morton <akpm@osdl.org>
Date:   Wed Feb 25 16:17:36 2004 -0800

    [PATCH] HFS rewrite

http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\
fs/hfsplus/bnode.c?id=91556682e0bf004d98a529bf829d339abb98bbbd

commit 91556682e0bf004d98a529bf829d339abb98bbbd
Author: Andrew Morton <akpm@osdl.org>
Date:   Wed Feb 25 16:17:48 2004 -0800

    [PATCH] HFS+ support

[3]:
http://sourceforge.net/projects/linux-hfsplus/

http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.1/
http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.2/

http://linux-hfsplus.cvs.sourceforge.net/viewvc/linux-hfsplus/linux/\
fs/hfsplus/bnode.c?r1=1.4&r2=1.5

Date:   Thu Jun 6 09:45:14 2002 +0000
Use buffer cache instead of page cache in bnode.c. Cache inode extents.

[4]:
http://git.kernel.org/cgit/linux/kernel/git/\
stable/linux-stable.git/commit/?id=a5e3985fa014029eb6795664c704953720cc7f7d

commit a5e3985fa0
Author: Roman Zippel <zippel@linux-m68k.org>
Date:   Tue Sep 6 15:18:47 2005 -0700

[PATCH] hfs: remove debug code

Signed-off-by: Hin-Tak Leung <htl10@users.sourceforge.net>
Signed-off-by: Sergei Antonov <saproj@gmail.com>
Reviewed-by: Anton Altaparmakov <anton@tuxera.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Sougata Santra <sougata@tuxera.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
..
acl.h hfsplus: use generic posix ACL infrastructure 2014-01-25 23:58:20 -05:00
attributes.c hfsplus: remove unused routine hfsplus_attr_build_key_uni 2014-06-06 16:08:09 -07:00
bfind.c fs/hfsplus: replace if/BUG by BUG_ON 2015-04-17 09:04:05 -04:00
bitmap.c hfsplus: remove duplicated message prefix in hfsplus_block_free() 2013-04-30 17:04:05 -07:00
bnode.c hfs,hfsplus: cache pages correctly between bnode_create and bnode_free 2015-09-10 13:29:01 -07:00
brec.c hfsplus: fix B-tree corruption after insertion at position 0 2015-03-25 16:20:31 -07:00
btree.c hfsplus: fix "unused node is not erased" error 2014-06-06 16:08:10 -07:00
catalog.c hfsplus: add missing curly braces in hfsplus_delete_cat() 2015-04-17 09:04:04 -04:00
dir.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-04-26 17:22:07 -07:00
extents.c fs/hfsplus: fix pr_foo() and hfs_dbg formats 2014-06-06 16:08:10 -07:00
hfsplus_fs.h fs: cleanup slight list_entry abuse 2015-06-23 18:01:59 -04:00
hfsplus_raw.h hfsplus: fix "unused node is not erased" error 2014-06-06 16:08:10 -07:00
inode.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-04-26 17:22:07 -07:00
ioctl.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-04-26 17:22:07 -07:00
Kconfig hfsplus: add necessary declarations for POSIX ACLs support 2013-09-11 15:59:00 -07:00
Makefile hfsplus: integrate POSIX ACLs support into driver 2013-09-11 15:59:01 -07:00
options.c fs: create and use seq_show_option for escaping 2015-09-04 16:54:41 -07:00
part_tbl.c hfsplus: ensure bio requests are not smaller than the hardware sectors 2011-07-22 16:37:44 +02:00
posix_acl.c hfsplus: use generic posix ACL infrastructure 2014-01-25 23:58:20 -05:00
super.c writeback: separate out include/linux/backing-dev-defs.h 2015-06-02 08:33:34 -06:00
tables.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
unicode.c Don't pass inode to ->d_hash() and ->d_compare() 2013-06-29 12:57:36 +04:00
wrapper.c fs/hfsplus/wrapper.c: replace shift loop by ilog2 2014-06-06 16:08:10 -07:00
xattr_security.c fs/hfsplus: move xattr_name allocation in hfsplus_setxattr() 2015-04-17 09:04:05 -04:00
xattr_trusted.c fs/hfsplus: move xattr_name allocation in hfsplus_setxattr() 2015-04-17 09:04:05 -04:00
xattr_user.c fs/hfsplus: move xattr_name allocation in hfsplus_setxattr() 2015-04-17 09:04:05 -04:00
xattr.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-04-26 17:22:07 -07:00
xattr.h fs/hfsplus: move xattr_name allocation in hfsplus_setxattr() 2015-04-17 09:04:05 -04:00