If the bitmap block on disk is bad, ext4_mb_load_buddy() returns an
error. This error is returned to the caller,
ext4_mb_regular_allocator() and then to ext4_mb_new_blocks(). But
ext4_mb_new_blocks() did not check for the return value of
ext4_mb_regular_allocator() and would repeatedly try to load the
bitmap block. The fix simply catches the return value and exits out of
the 'repeat' loop after cleanup.
We also take the opportunity to clean up the error handling in
ext4_mb_new_blocks().
Google-Bug-Id: 2853530
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In data=journal mode, we still use block_write_begin() to prepare
page for writing. This function can occasionally mark buffer dirty
which violates journalling assumptions - when a buffer is part of
a transaction, it should be dirty and a buffer can be already part
of a forget list of some transaction when block_write_begin()
gets called. This violation of journalling assumptions then results
in "JBD: Spotted dirty metadata buffer..." warnings.
In fact, temporary dirtying the buffer while the page is still locked
does not really cause problems to the journalling because we won't write
the buffer until the page gets unlocked. So we just have to make sure
to clear dirty bits before unlocking the page.
Signed-off-by: Jan Kara <jack@suse.cz>
commit 3d0518f4, "ext4: New rec_len encoding for very
large blocksizes" made several changes to this path, but from
a perf perspective, un-inlining ext4_rec_len_from_disk() seems
most significant. This function is called from ext4_check_dir_entry(),
which on a file-creation workload is called extremely often.
I tested this with bonnie:
# bonnie++ -u root -s 0 -f -x 200 -d /mnt/test -n 32
(this does 200 iterations) and got this for the file creations:
ext4 stock: Average = 21206.8 files/s
ext4 inlined: Average = 22346.7 files/s (+5%)
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Lockstat reports have shown that j_state_lock is a major source of
lock contention, especially on systems with more than 4 CPU cores. So
change it to be a read/write spinlock.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Allow mount options to be stored in the superblock. Also add default
mount option bits for nobarrier, block_validity, discard, and nodelalloc.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Perform full sync procedure so that any delayed allocation blocks are
allocated so quota will be consistent.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Commit 6b0310fbf0 caused a regression resulting in deadlocks
when freezing a filesystem which had active IO; the vfs_check_frozen
level (SB_FREEZE_WRITE) did not let the freeze-related IO syncing
through. Duh.
Changing the test to FREEZE_TRANS should let the normal freeze
syncing get through the fs, but still block any transactions from
starting once the fs is completely frozen.
I tested this by running fsstress in the background while periodically
snapshotting the fs and running fsck on the result. I ran into
occasional deadlocks, but different ones. I think this is a
fine fix for the problem at hand, and the other deadlocky things
will need more investigation.
Reported-by: Phillip Susi <psusi@cfl.rr.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
There were some error paths in ext4_delete_inode() which was not
dropping the inode from the orphan list. This could lead to a BUG_ON
on umount when the orphan list is discovered to be non-empty.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
There are some drivers which may not set bdev->bd_dev. So make sure
it is non-NULL before dereferencing it.
Google-Bug-Id: 1773557
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
I often get emails containing the "This should not happen!!" message,
conveniently trimmed to remove things like:
sd 0:0:0:0: [sda] Unhandled error code
sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 03 13 c9 70 00 00 28 00
end_request: I/O error, dev sda, sector 51628400
Aborting journal on device dm-0-8.
EXT4-fs error (device dm-0): ext4_journal_start_sb: Detected aborted journal
EXT4-fs (dm-0): Remounting filesystem read-only
I don't think there is any value to the verbosity if the reason is
due to a filesystem abort; it just obfuscates the root cause.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext4_get_blocks got renamed to ext4_map_blocks, but left stale
comments and a prototype littered around.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When journaled quota options are not specified, we do writes
to quota files just in data=ordered mode. This actually causes
warnings from JBD2 about dirty journaled buffer because ext4_getblk
unconditionally treats a block allocated by it as metadata. Since
quota actually is filesystem metadata, the easiest way to get rid
of the warning is to always treat quota writes as metadata...
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Under heavy memory pressure we may hit out of memory
situation and as result kstrdup'ed options will not be
freed. Fix it.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If the user attempts to make a non-extent-mapped file to be too large,
return EFBIG, but don't call ext4_std_err() which will end up marking
the file system as containing an error.
Thanks to Toshiyuki Okajima-san at Fujitsu for pointing this out.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
For some reason, today mballoc only allocates IOs which are exactly
stripe-sized on a stripe boundary. If you have a multiple (say, a
128k IO on a 64k stripe) you may end up unaligned.
It seems to me that a simple change to align stripe-multiple IOs
on stripe boundaries would be a very good idea, unless this breaks
some other mballoc heuristic for some reason...
Reported-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch is to be applied upon Christoph's "direct-io: move aio_complete
into ->end_io" patch. It adds iocb and result fields to struct ext4_io_end_t,
so that we can call aio_complete from ext4_end_io_nolock() after the extent
conversion has finished.
I have verified with Christoph's aio-dio test that used to fail after a few
runs on an original kernel but now succeeds on the patched kernel.
See http://thread.gmane.org/gmane.comp.file-systems.ext4/19659 for details.
Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Filesystems with unwritten extent support must not complete an AIO request
until the transaction to convert the extent has been commited. That means
the aio_complete calls needs to be moved into the ->end_io callback so
that the filesystem can control when to call it exactly.
This makes a bit of a mess out of dio_complete and the ->end_io callback
prototype even more complicated.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Issue discard request in ext4_free_blocks() when ext4 has no journal and
is mounted with discard option.
Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
We have experienced bitmap inconsistencies after crash during file
delete under heavy load. The crash is not file system related and I
the following patch in ext4_free_branches() fixes the recovery
problem.
If the transaction is restarted and there is a crash before the new
transaction is committed, then after recovery, the blocks that this
indirect block points to have been freed, but the indirect block
itself has not been freed and may still point to some of the free
blocks (because of the ext4_forget()).
So ext4_forget() should be called inside ext4_free_blocks() to avoid
this problem.
Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This allows us to grab any file system error messages by scraping
/var/log/messages. This will make it easy for us to do error analysis
across the very large number of machines as we deploy ext4 across the
fleet.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Save number of file system errors, and the time function name, line
number, block number, and inode number of the first and most recent
errors reported on the file system in the superblock.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext4 didn't update the ctime of the file when its permission was
changed.
Steps to reproduce:
# touch aaa
# stat -c %Z aaa
1275289822
# setfacl -m 'u::x,g::x,o::x' aaa
# stat -c %Z aaa
1275289822 <- unchanged
But, according to the spec of the ctime, ext4 must update it.
Port of ext3 patch by Miao Xie <miaox@cn.fujitsu.com>.
CC: linux-ext4@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The nobh option was only supported for writeback mode, but given that all
write paths actually create buffer heads it effectively was a no-op already.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
No real bugs found, just removed some dead code.
Found by gcc 4.6's new warnings.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
We don't need to set s_dirt in most of the ext4 code when journaling
is enabled. In ext3/4 some of the summary statistics for # of free
inodes, blocks, and directories are calculated from the per-block
group statistics when the file system is mounted or unmounted. As a
result the superblock doesn't have to be updated, either via the
journal or by setting s_dirt. There are a few exceptions, most
notably when resizing the file system, where the superblock needs to
be modified --- and in that case it should be done as a journalled
operation if possible, and s_dirt set only in no-journal mode.
This patch will optimize out some unneeded disk writes when using ext4
with a journal.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
A few functions were still modifying i_flags in a racy manner.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Dan Roseberg has reported a problem with the MOVE_EXT ioctl. If the
donor file is an append-only file, we should not allow the operation
to proceed, lest we end up overwriting the contents of an append-only
file.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Dan Rosenberg <dan.j.rosenberg@gmail.com>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
quota: Convert quota statistics to generic percpu_counter
ext3 uses rb_node = NULL; to zero rb_root.
quota: Fixup dquot_transfer
reiserfs: Fix resuming of quotas on remount read-write
pohmelfs: Remove dead quota code
ufs: Remove dead quota code
udf: Remove dead quota code
quota: rename default quotactl methods to dquot_
quota: explicitly set ->dq_op and ->s_qcop
quota: drop remount argument to ->quota_on and ->quota_off
quota: move unmount handling into the filesystem
quota: kill the vfs_dq_off and vfs_dq_quota_on_remount wrappers
quota: move remount handling into the filesystem
ocfs2: Fix use after free on remount read-only
Fix up conflicts in fs/ext4/super.c and fs/ufs/file.c
We don't name our generic fsync implementations very well currently.
The no-op implementation for in-memory filesystems currently is called
simple_sync_file which doesn't make too much sense to start with,
the the generic one for simple filesystems is called simple_fsync
which can lead to some confusion.
This patch renames the generic file fsync method to generic_file_fsync
to match the other generic_file_* routines it is supposed to be used
with, and the no-op implementation to noop_fsync to make it obvious
what to expect. In addition add some documentation for both methods.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (40 commits)
ext4: Make fsync sync new parent directories in no-journal mode
ext4: Drop whitespace at end of lines
ext4: Fix compat EXT4_IOC_ADD_GROUP
ext4: Conditionally define compat ioctl numbers
tracing: Convert more ext4 events to DEFINE_EVENT
ext4: Add new tracepoints to track mballoc's buddy bitmap loads
ext4: Add a missing trace hook
ext4: restart ext4_ext_remove_space() after transaction restart
ext4: Clear the EXT4_EOFBLOCKS_FL flag only when warranted
ext4: Avoid crashing on NULL ptr dereference on a filesystem error
ext4: Use bitops to read/modify i_flags in struct ext4_inode_info
ext4: Convert calls of ext4_error() to EXT4_ERROR_INODE()
ext4: Convert callers of ext4_get_blocks() to use ext4_map_blocks()
ext4: Add new abstraction ext4_map_blocks() underneath ext4_get_blocks()
ext4: Use our own write_cache_pages()
ext4: Show journal_checksum option
ext4: Fix for ext4_mb_collect_stats()
ext4: check for a good block group before loading buddy pages
ext4: Prevent creation of files larger than RLIMIT_FSIZE using fallocate
ext4: Remove extraneous newlines in ext4_msg() calls
...
Fixed up trivial conflict in fs/ext4/fsync.c
Follow the dquot_* style used elsewhere in dquot.c.
[Jan Kara: Fixed up missing conversion of ext2]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Remount handling has fully moved into the filesystem, so all this is
superflous now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently the VFS calls into the quotactl interface for unmounting
filesystems. This means filesystems with their own quota handling
can't easily distinguish between user-space originating quotaoff
and an unount. Instead move the responsibily of the unmount handling
into the filesystem to be consistent with all other dquot handling.
Note that we do call dquot_disable a lot later now, e.g. after
a sync_filesystem. But this is fine as the quota code does all its
writes via blockdev's mapping and that is synced even later.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Instead of having wrappers in the VFS namespace export the dquot_suspend
and dquot_resume helpers directly. Also rename vfs_quota_disable to
dquot_disable while we're at it.
[Jan Kara: Moved dquot_suspend to quotaops.h and made it inline]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently do_remount_sb calls into the dquot code to tell it about going
from rw to ro and ro to rw. Move this code into the filesystem to
not depend on the dquot code in the VFS - note ocfs2 already ignores
these calls and handles remount by itself. This gets rid of overloading
the quotactl calls and allows to unify the VFS and XFS codepaths in
that area later.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (69 commits)
fix handling of offsets in cris eeprom.c, get rid of fake on-stack files
get rid of home-grown mutex in cris eeprom.c
switch ecryptfs_write() to struct inode *, kill on-stack fake files
switch ecryptfs_get_locked_page() to struct inode *
simplify access to ecryptfs inodes in ->readpage() and friends
AFS: Don't put struct file on the stack
Ban ecryptfs over ecryptfs
logfs: replace inode uid,gid,mode initialization with helper function
ufs: replace inode uid,gid,mode initialization with helper function
udf: replace inode uid,gid,mode init with helper
ubifs: replace inode uid,gid,mode initialization with helper function
sysv: replace inode uid,gid,mode initialization with helper function
reiserfs: replace inode uid,gid,mode initialization with helper function
ramfs: replace inode uid,gid,mode initialization with helper function
omfs: replace inode uid,gid,mode initialization with helper function
bfs: replace inode uid,gid,mode initialization with helper function
ocfs2: replace inode uid,gid,mode initialization with helper function
nilfs2: replace inode uid,gid,mode initialization with helper function
minix: replace inode uid,gid,mode init with helper
ext4: replace inode uid,gid,mode init with helper
...
Trivial conflict in fs/fs-writeback.c (mark bitfields unsigned)