Pull CIFS fixes from Steve French.
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
CIFS: Fix memory leaks in SMB2_open
cifs: ensure that vol->username is not NULL before running strlen on it
Clarify SMB2/SMB3 create context and add missing ones
Do not send ClientGUID on SMB2.02 dialect
cifs: Set client guid on per connection basis
fs/cifs/netmisc.c: convert printk to pr_foo()
fs/cifs/cifs.c: replace seq_printf by seq_puts
Update cifs version number to 2.03
fs: cifs: new helper: file_inode(file)
cifs: fix potential races in cifs_revalidate_mapping
cifs: new helper function: cifs_revalidate_mapping
cifs: convert booleans in cifsInodeInfo to a flags field
cifs: fix cifs_uniqueid_to_ino_t not to ever return 0
This caused reduced performance for some users with advanced post
processing enabled. We need a better method to pick the
UVD state based on the amount of post processing required or tune
the advanced post processing to fit within the lower power state
envelope.
This reverts commit 14a9579ddb.
Cc: "3.15" <stable@vger.kernel.org>
And also domain to prefered_domains. That matches better
what those values represent.
Signed-off-by: Christian König <christian.koenig@amd.com>
Cc: Marek Olšák <maraeo@gmail.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The underlying reason for the crashes seems to be fixed now.
Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
We never check the return value anyway and if the
index isn't valid would crash way before calling
the functions.
Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
When we set the valid bit on invalid GART entries they are
loaded into the TLB when an adjacent entry is loaded. This
poisons the TLB with invalid entries which are sometimes
not correctly removed on TLB flush.
For stable inclusion the patch probably needs to be modified a bit.
Signed-off-by: Christian König <christian.koenig@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Make sure that a hdmi deep color mode can't exceed the max tmds
clock limit of a hdmi sink if such a limit is defined by edid.
If requested deep color bpc would exceed the limit given the mode
to be set, try to degrade gracefully to lower supported deep color
bpc or to standard 8 bpc if needed.
Signed-off-by: Mario Kleiner <mario.kleiner.de@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
HDMI deep color setup must know which modes are supported if
it needs to degrade gracefully, as only 12 bpc / dc_36 is
guaranteed, but 10 bpc / dc_30 is optional. The maximum bpc
is not sufficient for this.
Signed-off-by: Mario Kleiner <mario.kleiner.de@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Replace occurrences of "v & 0xffffffff" with lower_32_bits(v)
when it's next to an upper_32_bits(v). Also remove unnecessary
"upper_32_bits(v) & 0xffffffff" code snippets.
Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This patch consists of the usual driver updates (qla2xxx, qla4xxx, lpfc,
be2iscsi, fnic, ufs, NCR5380) The NCR5380 is the addition to maintained status
of a long neglected driver for older hardware. In addition there are a lot of
minor fixes and cleanups and some more updates to make scsi mq ready.
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJTlcwiAAoJEDeqqVYsXL0M5qsIALzVPLd4yxA16zCiaPQUeIV5
mfYmwISFlN+qW3AcUeSH4D13YgegCjEBfqaDMWvIkgouxLy/7jpxtChutq3MCzUE
cDT1B9+ZrzoqBISRNHEh/gx5F1MOF2VPuqG2pe0J90wyRCNzJscB6PbtWMAo86CA
2eu7wq3K9FXxCC1qY0PzwBLXHqUcgk5GWiK9CM/k4W0NiTVeNmwPeh5i91IQnBHx
E2l7NAXgNLyCf5tyeswvZ4pW0T3hlaswNmBB4qC8oJm4U6UqMN+tk4ML63Pz7uPe
4mlHG0uI8Vbdi13iv1EDUZ9Vo8iqVrzP2UAhakgP9poKSGqE4d/MD0EKNGQB2Es=
=cBY8
-----END PGP SIGNATURE-----
Merge tag 'scsi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This patch consists of the usual driver updates (qla2xxx, qla4xxx,
lpfc, be2iscsi, fnic, ufs, NCR5380) The NCR5380 is the addition to
maintained status of a long neglected driver for older hardware. In
addition there are a lot of minor fixes and cleanups and some more
updates to make scsi mq ready"
* tag 'scsi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (130 commits)
include/scsi/osd_protocol.h: remove unnecessary __constant
mvsas: Recognise device/subsystem 9485/9485 as 88SE9485
Revert "be2iscsi: Fix processing cqe for cxn whose endpoint is freed"
mptfusion: fix msgContext in mptctl_hp_hostinfo
acornscsi: remove linked command support
scsi/NCR5380: dprintk macro
fusion: Remove use of DEF_SCSI_QCMD
fusion: Add free msg frames to the head, not tail of list
mpt2sas: Add free smids to the head, not tail of list
mpt2sas: Remove use of DEF_SCSI_QCMD
mpt2sas: Remove uses of serial_number
mpt3sas: Remove use of DEF_SCSI_QCMD
mpt3sas: Remove uses of serial_number
qla2xxx: Use kmemdup instead of kmalloc + memcpy
qla4xxx: Use kmemdup instead of kmalloc + memcpy
qla2xxx: fix incorrect debug printk
be2iscsi: Bump the driver version
be2iscsi: Fix processing cqe for cxn whose endpoint is freed
be2iscsi: Fix destroy MCC-CQ before MCC-EQ is destroyed
be2iscsi: Fix memory corruption in MBX path
...
Pull input updates from Dmitry Torokhov:
"A big update to the Atmel touchscreen driver, devm support for polled
input devices, several drivers have been converted to using managed
resources, and assorted driver fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (87 commits)
Input: synaptics - fix resolution for manually provided min/max
Input: atmel_mxt_ts - fix invalid return from mxt_get_bootloader_version
Input: max8997_haptic - add error handling for regulator and pwm
Input: elantech - don't set bit 1 of reg_10 when the no_hw_res quirk is set
Input: elantech - deal with clickpads reporting right button events
Input: edt-ft5x06 - fix an i2c write for M09 support
Input: omap-keypad - remove platform data support
ARM: OMAP2+: remove unused omap4-keypad file and code
Input: ab8500-ponkey - switch to using managed resources
Input: max8925_onkey - switch to using managed resources
Input: 88pm860x-ts - switch to using managed resources
Input: 88pm860x_onkey - switch to using managed resources
Input: intel-mid-touch - switch to using managed resources
Input: wacom - process outbound for newer Cintiqs
Input: wacom - set stylus_in_proximity when pen is in range
DTS: ARM: OMAP3-N900: Add tsc2005 support
Input: tsc2005 - add DT support
Input: add common DT binding for touchscreens
Input: jornada680_kbd - switch top using managed resources
Input: adp5520-keys - switch to using managed resources
...
Add OMAP DT data:
* omap5 display subsystem
* display data for omap5 uEVM board
* am43xx display subsystem
* display data for am43xx ePOS and GP boards (LCD only)
* display data for GTA04 board
* display data for overo board
* display data for duovero-parlor board
* display data for omap3 evm and ldp boards
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJTlbIaAAoJEPo9qoy8lh71zKsP/2RuWCEKuyZ8acg5a58y98eD
aguogxUjxKURoqA2FtrH1qAokYnqds9KHcB8GtOtYL+5Q8GMhGsk4YS/++twyCBm
9JOq/2FBdbTKi8mkGmURRJQjWwd+BJyeOQb/F54jif+akEmw3oL4SNL3YaTitQqT
Yhg+QZa7djwBBCSGy2sHygnrYlEVJiz9gjdMye0kdPEPmg1LKZny0HJZgMkndsCH
oEs836pY78SiWGpFjz5Jsk4zjitPJOLwa7/RdL27s+OWyJb/RMxc4SDdL6de5H+u
L2GSOe3vxG+0lrTslosRM3qJwIQGKWbYqOEXMDFdDKANS24QbQYw5NCFwUCfeCCy
Rxlw9ntr7v/iPyQ3t8oMoNG9Xm0o4gvet8LIbj/33mqFMAESnnbi6GmujmlA9S9p
x6DAasBN2LAf6eQhshE7W/6XiEnDH2cVLXVGQwj6yiuhPp/GblGhHIh1MTHA41Vi
A/AN/svDsPjkzhZyMETRljSpdHwQXf+vIYSeipSQFW0poBQ7o5bLUuli/VB21kbi
UNDhegCNrTKjqxVZL4DI7/E8JYdwGjKKfmbgiGWOjyu8Jd3/0KAhZA6JIv0DeAyN
ankSfioyEleXFm4iPC+dN7dZPWTb3SudCndEwbmjVR3VDKjBeiy60HGyxBJ4/uWm
9FkkdQfnRfNHYYXHJi1R
=/UHU
-----END PGP SIGNATURE-----
Merge tag 'fbdev-omap-dt-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux
Pull OMAP DT fbdev updates from Tomi Valkeinen:
"Here are display related device tree data changes for OMAP. They are
based on an already merged branch to satisfy the dependencies for the
dts file changes.
Add OMAP DT data:
- omap5 display subsystem
- display data for omap5 uEVM board
- am43xx display subsystem
- display data for am43xx ePOS and GP boards (LCD only)
- display data for GTA04 board
- display data for overo board
- display data for duovero-parlor board
- display data for omap3 evm and ldp boards"
* tag 'fbdev-omap-dt-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux:
ARM: omap5.dtsi: Add audio related parameters to hdmi node
ARM: omap4.dtsi: Add audio related parametes to hdmi node
ARM: dts: duovero-parlor: Add HDMI output
ARM: dts: overo: Add support for 3.5'' LCD output
ARM: dts: overo: Add support for 4.3'' LCD output
ARM: dts: overo: Add support for DVI output
ARM: dts: Add LCD panel sharp ls037v7dw01 support for omap3-evm and ldp
ARM: dts: omap3-gta04: Add display support
ARM: dts: omap5-uevm.dts: add display nodes
ARM: dts: omap5-uevm.dts: add tca6424a
ARM: dts: omap5.dtsi: add DSS nodes
ARM: dts: am43x-epos-evm: add LCD data
ARM: dts: am437x-gp-evm: add LCD data
ARM: dts: am4372.dtsi: add DSS information
Pull MIPS updates from Ralf Baechle:
- three fixes for 3.15 that didn't make it in time
- limited Octeon 3 support.
- paravirtualization support
- improvment to platform support for Netlogix SOCs.
- add support for powering down the Malta eval board in software
- add many instructions to the in-kernel microassembler.
- add support for the BPF JIT.
- minor cleanups of the BCM47xx code.
- large cleanup of math emu code resulting in significant code size
reduction, better readability of the code and more accurate
emulation.
- improvments to the MIPS CPS code.
- support C3 power status for the R4k count/compare clock device.
- improvments to the GIO support for older SGI workstations.
- increase number of supported CPUs to 256; this can be reached on
certain embedded multithreaded ccNUMA configurations.
- various small cleanups, updates and fixes
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (173 commits)
MIPS: IP22/IP28: Improve GIO support
MIPS: Octeon: Add twsi interrupt initialization for OCTEON 3XXX, 5XXX, 63XX
DEC: Document the R4k MB ASIC mini interrupt controller
DEC: Add self as the maintainer
MIPS: Add microMIPS MSA support.
MIPS: Replace calls to obsolete strict_strto call with kstrto* equivalents.
MIPS: Replace obsolete strict_strto call with kstrto
MIPS: BFP: Simplify code slightly.
MIPS: Call find_vma with the mmap_sem held
MIPS: Fix 'write_msa_##' inline macro.
MIPS: Fix MSA toolchain support detection.
mips: Update the email address of Geert Uytterhoeven
MIPS: Add minimal defconfig for mips_paravirt
MIPS: Enable build for new system 'paravirt'
MIPS: paravirt: Add pci controller for virtio
MIPS: Add code for new system 'paravirt'
MIPS: Add functions for hypervisor call
MIPS: OCTEON: Add OCTEON3 to __get_cpu_type
MIPS: Add function get_ebase_cpunum
MIPS: Add minimal support for OCTEON3 to c-r4k.c
...
* Moving cache disabling to early boot
* ARC UART enabled only if earlyprintk setup in cmdline
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
iQIcBAABAgAGBQJTlbXaAAoJEGnX8d3iisJezBoP/2pNo6Pc+s9IWEtqyKQWujTW
AVUxC/kKboVXk9KA4Uzbou9Up3yZOXKYqCjTqHvrHSUSx1mhU4tRQQfeFUWZ1YVr
qRsHOit2eMEknJG8PP9a/qjlxMxIQ/DYoQumQCzK9F4wwFoirgevvSZeR23owphL
2WSJB1wyAqmVt1mvTOJP7AvH9xY8hBp+lMm8skL9Nc7ay5Z7jhETINdpu5X6w5uP
H4T8hbz289iwH7R73zPyyaT0eWewFIe5zQpxF/l7vHB96fZY7x8/Gf0hyh0ufCsa
p9O760tdNzd2duA2nnjT5vqtfHo66PvNmnwIrr/LchVX41tWtvYDeeIwy13+NhjS
i+XZ0vKEFaXZM4gLbPRa2sd/D3s/y1qQhygrl1tyImYGsBEq0gIq1iUl7P/CJ1KP
1M/sx7JoDXwkkq+HbcMpRUotTG7gZXbEwFKL2ZSX6XJ7v5SGPlSKSYqASdGSRjpa
ynvTVLTHYX0Q1CfYLrUOFoa4R9AvaTC3MxP09Dv8LGjByHcZGM5yEp+4JAfTZlfL
jP7wgrtrQNPw+hf0Zw/98uuG4IpnjwRv2nJROzBoddAWhOvihT8AuBuOTUDDIDEv
8gilSSY4lqxQvEMdZ07djtdvL86eMib0jFaeRSz3APxk5q3vsBXhuUXdE+DHnzTS
+gMza/QUMw73amwdzBUm
=neQc
-----END PGP SIGNATURE-----
Merge tag 'arc-v3.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc
Pull ARC updates from Vineet Gupta:
"Nothing too exciting here, just minor fixes/cleanup. Only noteworthy
ones are:
- Moving cache disabling to early boot
- ARC UART enabled only if earlyprintk setup in cmdline"
* tag 'arc-v3.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
ARC: Disable caches in early boot if so configured
ARC: [arcfpga] Early ARC UART to be only activated by cmdline
ARC: [arcfpga] Get rid of legacy BVCI latency unit support
ARC: remove duplicate header exports
ARC: arc_local_timer_setup() need not pass own cpu id
ARC: Fixed spelling errors within comments
ARC: make start_thread() out-of-line
ARC: fix mmuv2 warning
ARC: [SMP] ISS SMP extension bitrot
The raid5 sync_request() processing calls handle_stripe() within the context of
the resync-thread. The resync-thread issues the first set of read requests
and this adds execution latency and slows down the scheduling of the next
sync_request().
The current rebuild/resync speed of raid5 is not much faster than what
rotational HDDs can sustain.
Testing the following patch on a 6-drive array, I can increase the rebuild
speed from 100 MB/s to 175 MB/s.
The sync_request() now just sets STRIPE_HANDLE and releases the stripe. This
creates some more parallelism between the resync-thread and raid5 kernel daemon.
Signed-off-by: Eivind Sarto <esarto@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The skinny extents are intepreted incorrectly in scrub_print_warning(),
and end up hitting the BUG() in btrfs_extent_inline_ref_size.
Reported-by: Konstantinos Skarlatos <k.skarlatos@gmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
When cloning into a file, we were correctly replacing the extent
items in the target range and removing the extent maps. However
we weren't replacing the extent maps with new ones that point to
the new extents - as a consequence, an incremental fsync (when the
inode doesn't have the full sync flag) was a NOOP, since it relies
on the existence of extent maps in the modified list of the inode's
extent map tree, which was empty. Therefore add new extent maps to
reflect the target clone range.
A test case for xfstests follows.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
We want to make sure the point is still within the extent item, not to verify
the memory it's pointing to.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
The backref code was looking at nodes as well as leaves when we tried to
populate extent item entries. This is not good, and although we go away with it
for the most part because we'd skip where disk_bytenr != random_memory,
sometimes random_memory would match and suddenly boom. This fixes that problem.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
In inode.c:btrfs_page_exists_in_range(), if the page we got from
the radix tree is an exception entry, which can't be retried, we
exit the loop with a non-NULL page and then call page_cache_release
against it, which is not ok since it's not a valid page. This could
also make us return true when we shouldn't.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
In inode.c:btrfs_page_exists_in_range(), if the page we get from the
radix tree is an exception which should make us retry, set page to
NULL in order to really retry, because otherwise we don't get another
loop iteration executed (page != NULL makes the while loop exit).
This also was making us call page_cache_release after exiting the loop,
which isn't correct because page doesn't point to a valid page, and
possibly return true from the function when we shouldn't.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
In inode.c:btrfs_page_exists_in_range(), if we can't get the page
we need to retry. However we weren't retrying because we weren't
setting page to NULL, which makes the while loop exit immediately
and will make us call page_cache_release after exiting the loop
which is incorrect because our page get didn't succeed. This could
also make us return true when we shouldn't.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
To return EOPNOTSUPP is more user friendly than to return EINVAL,
and then user-space tool will show that the dev_replace operation
for raid56 is not currently supported rather than showing that
there is an invalid argument.
Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Antonio Ospite <ao2@ao2.it>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Chris Mason <clm@fb.com>
Several reports about leaf corruption has been floating on the list, one of them
points to __btrfs_drop_extents(), and we find that the leaf becomes corrupted
after __btrfs_drop_extents(), it's really a rare case but it does exist.
The problem turns out to be btrfs_next_leaf() called in __btrfs_drop_extents().
So in btrfs_next_leaf(), we release the current path to re-search the last key of
the leaf for locating next leaf, and we've taken it into account that there might
be balance operations between leafs during this 'unlock and re-lock' dance, so
we check the path again and advance it if there are now more items available.
But things are a bit different if that last key happens to be removed and balance
gets a bigger key as the last one, and btrfs_search_slot will return it with
ret > 0, IOW, nothing change in this leaf except the new last key, then we think
we're okay because there is no more item balanced in, fine, we thinks we can
go to the next leaf.
However, we should return that bigger key, otherwise we deserve leaf corruption,
for example, in endio, skipping that key means that __btrfs_drop_extents() thinks
it has dropped all extent matched the required range and finish_ordered_io can
safely insert a new extent, but it actually doesn't and ends up a leaf
corruption.
One may be asking that why our locking on extent io tree doesn't work as
expected, ie. it should avoid this kind of race situation. But in
__btrfs_drop_extents(), we don't always find extents which are included within
our locking range, IOW, extents can start before our searching start, in this
case locking on extent io tree doesn't protect us from the race.
This takes the special case into account.
Reviewed-by: Filipe Manana <fdmanana@gmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
We might have had an item with the previous key in the tree right
before we released our path. And after we released our path, that
item might have been pushed to the first slot (0) of the leaf we
were holding due to a tree balance. Alternatively, an item with the
previous key can exist as the only element of a leaf (big fat item).
Therefore account for these 2 cases, so that our callers (like
btrfs_previous_item) don't miss an existing item with a key matching
the previous key we computed above.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
If the NO_HOLES feature is enabled holes don't have file extent items in
the btree that represent them anymore. This made the clone operation
ignore the gaps that exist between consecutive file extent items and
therefore not create the holes at the destination. When not using the
NO_HOLES feature, the holes were created at the destination.
A test case for xfstests follows.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
On heavy workloads, we're seeing soft lockup warnings on
root->inode_lock in __btrfs_release_delayed_node. The low hanging fruit
is to reduce the size of the critical section.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
To be accurate about the error case,
if the new size is beyond ULLONG_MAX, return ERANGE instead of EINVAL.
Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
If btrfs_log_dentry_safe() returns an error, we set ret to 1 and
fall through with the goal of committing the transaction. However,
in the case where the inode doesn't need a full sync, we would call
btrfs_wait_ordered_range() against the target range for our inode,
and if it returned an error, we would return without commiting or
ending the transaction.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
btrfs_punch_hole() will truncate unaligned pages or punch hole on a
already existed hole.
This will cause unneeded zero page or holes splitting the original huge
hole.
This patch will skip already existed holes before any page truncating or
hole punching.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
On snapshot creation (either writable or read-only), we do orphan cleanup
against the root of the snapshot. If the cleanup did remove any orphans,
then the current root node will be different from the commit root node
until the next transaction commit happens.
A send operation always uses the commit root of a snapshot - this means
it will see the orphans if it starts computing the send stream before the
next transaction commit happens (triggered by a timer or sync() for .e.g),
which is when the commit root gets assigned a reference to current root,
where the orphans are not visible anymore. The consequence of send seeing
the orphans is explained below.
For example:
mkfs.btrfs -f /dev/sdd
mount -o commit=999 /dev/sdd /mnt
# open a file with O_TMPFILE and leave it open
# write some data to the file
btrfs subvolume snapshot -r /mnt /mnt/snap1
btrfs send /mnt/snap1 -f /tmp/send.data
The send operation will fail with the following error:
ERROR: send ioctl failed with -116: Stale file handle
What happens here is that our snapshot has an orphan inode still visible
through the commit root, that corresponds to the tmpfile. However send
will attempt to call inode.c:btrfs_iget(), with the goal of reading the
file's data, which will return -ESTALE because it will use the current
root (and not the commit root) of the snapshot.
Of course, there are other cases where we can get orphans, but this
example using a tmpfile makes it much easier to reproduce the issue.
Therefore on snapshot creation, after calling btrfs_orphan_cleanup, if
the commit root is different from the current root, just commit the
transaction associated with the snapshot's root (if it exists), so that
a send will not see any orphans that don't exist anymore. This also
guarantees a send will always see the same content regardless of whether
a transaction commit happened already before the send was requested and
after the orphan cleanup (meaning the commit root and current roots are
the same) or it hasn't happened yet (commit and current roots are
different).
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
In ioctl.c:lock_extent_range(), after locking our target range, the
ordered extent that btrfs_lookup_first_ordered_extent() returns us
may not overlap our target range at all. In this case we would just
unlock our target range, wait for any new ordered extents that overlap
the range to complete, lock again the range and repeat all these steps
until we don't get any ordered extent and the delalloc flag isn't set
in the io tree for our target range.
Therefore just stop if we get an ordered extent that doesn't overlap
our target range and the dealalloc flag isn't set for the range in
the inode's io tree.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
When cloning a range of a file, we were visiting all the extent items in
the btree that belong to our source inode. We don't need to visit those
extent items that don't overlap the range we are cloning, as doing so only
makes us waste time and do unnecessary btree navigations (btrfs_next_leaf)
for inodes that have a large number of file extent items in the btree.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
We were setting the BTRFS_ROOT_SUBVOL_DEAD flag on the root of the
parent of our target snapshot, instead of setting it in the target
snapshot's root.
This is easy to observe by running the following scenario:
mkfs.btrfs -f /dev/sdd
mount /dev/sdd /mnt
btrfs subvolume create /mnt/first_subvol
btrfs subvolume snapshot -r /mnt /mnt/mysnap1
btrfs subvolume delete /mnt/first_subvol
btrfs subvolume snapshot -r /mnt /mnt/mysnap2
btrfs send -p /mnt/mysnap1 /mnt/mysnap2 -f /tmp/send.data
The send command failed because the send ioctl returned -EPERM.
A test case for xfstests follows.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
We were cleaning the clone target file range from the page cache before
we did replace the file extent items in the fs tree. This was racy,
as right after cleaning the relevant range from the page cache and before
replacing the file extent items, a read against that range could be
performed by another task and populate again the page cache with stale
data (stale after the cloning finishes). This would result in reads after
the clone operation successfully finishes to get old data (and potentially
for a very long time). Therefore evict the pages after replacing the file
extent items, so that subsequent reads will always get the new data.
Similarly, we were prone to races while cloning the file extent items
because we weren't locking the target range and wait for any existing
ordered extents against that range to complete. It was possible that
after cloning the extent items, a write operation that was performed
before the clone operation and overlaps the same range, would end up
undoing all or part of the work the clone operation did (a worker task
running inode.c:btrfs_finish_ordered_io). Therefore lock the target
range in the io tree, wait for all pending ordered extents against that
range to finish and then safely perform the cloning.
The issue of reading stale data after the clone operation is easy to
reproduce by running the following C program in a loop until it exits
with return value 1.
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <pthread.h>
#include <fcntl.h>
#include <assert.h>
#include <asm/types.h>
#include <linux/ioctl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/ioctl.h>
#define SRC_FILE "/mnt/sdd/foo"
#define DST_FILE "/mnt/sdd/bar"
#define FILE_SIZE (16 * 1024)
#define PATTERN_SRC 'X'
#define PATTERN_DST 'Y'
struct btrfs_ioctl_clone_range_args {
__s64 src_fd;
__u64 src_offset, src_length;
__u64 dest_offset;
};
#define BTRFS_IOCTL_MAGIC 0x94
#define BTRFS_IOC_CLONE_RANGE _IOW(BTRFS_IOCTL_MAGIC, 13, \
struct btrfs_ioctl_clone_range_args)
static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
static int clone_done = 0;
static int reader_ready = 0;
static int stale_data = 0;
static void *reader_loop(void *arg)
{
char buf[4096], want_buf[4096];
memset(want_buf, PATTERN_SRC, 4096);
pthread_mutex_lock(&mutex);
reader_ready = 1;
pthread_mutex_unlock(&mutex);
while (1) {
int done, fd, ret;
fd = open(DST_FILE, O_RDONLY);
assert(fd != -1);
pthread_mutex_lock(&mutex);
done = clone_done;
pthread_mutex_unlock(&mutex);
ret = read(fd, buf, 4096);
assert(ret == 4096);
close(fd);
if (done) {
ret = memcmp(buf, want_buf, 4096);
if (ret == 0) {
printf("Found new content\n");
} else {
printf("Found old content\n");
pthread_mutex_lock(&mutex);
stale_data = 1;
pthread_mutex_unlock(&mutex);
}
break;
}
}
return NULL;
}
int main(int argc, char *argv[])
{
pthread_t reader;
int ret, i, fd;
struct btrfs_ioctl_clone_range_args clone_args;
int fd1, fd2;
ret = remove(SRC_FILE);
if (ret == -1 && errno != ENOENT) {
fprintf(stderr, "Error deleting src file: %s\n", strerror(errno));
return 1;
}
ret = remove(DST_FILE);
if (ret == -1 && errno != ENOENT) {
fprintf(stderr, "Error deleting dst file: %s\n", strerror(errno));
return 1;
}
fd = open(SRC_FILE, O_CREAT | O_WRONLY | O_TRUNC, S_IRWXU);
assert(fd != -1);
for (i = 0; i < FILE_SIZE; i++) {
char c = PATTERN_SRC;
ret = write(fd, &c, 1);
assert(ret == 1);
}
close(fd);
fd = open(DST_FILE, O_CREAT | O_WRONLY | O_TRUNC, S_IRWXU);
assert(fd != -1);
for (i = 0; i < FILE_SIZE; i++) {
char c = PATTERN_DST;
ret = write(fd, &c, 1);
assert(ret == 1);
}
close(fd);
sync();
ret = pthread_create(&reader, NULL, reader_loop, NULL);
assert(ret == 0);
while (1) {
int r;
pthread_mutex_lock(&mutex);
r = reader_ready;
pthread_mutex_unlock(&mutex);
if (r) break;
}
fd1 = open(SRC_FILE, O_RDONLY);
if (fd1 < 0) {
fprintf(stderr, "Error open src file: %s\n", strerror(errno));
return 1;
}
fd2 = open(DST_FILE, O_RDWR);
if (fd2 < 0) {
fprintf(stderr, "Error open dst file: %s\n", strerror(errno));
return 1;
}
clone_args.src_fd = fd1;
clone_args.src_offset = 0;
clone_args.src_length = 4096;
clone_args.dest_offset = 0;
ret = ioctl(fd2, BTRFS_IOC_CLONE_RANGE, &clone_args);
assert(ret == 0);
close(fd1);
close(fd2);
pthread_mutex_lock(&mutex);
clone_done = 1;
pthread_mutex_unlock(&mutex);
ret = pthread_join(reader, NULL);
assert(ret == 0);
pthread_mutex_lock(&mutex);
ret = stale_data ? 1 : 0;
pthread_mutex_unlock(&mutex);
return ret;
}
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
There is otherwise a risk of a possible null pointer dereference.
Was largely found by using a static code analysis program called cppcheck.
Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Signed-off-by: Chris Mason <clm@fb.com>
We are currently allocating space_info objects in an array when we
allocate space_info. When a user does something like:
# btrfs balance start -mconvert=raid1 -dconvert=raid1 /mnt
# btrfs balance start -mconvert=single -dconvert=single /mnt -f
# btrfs balance start -mconvert=raid1 -dconvert=raid1 /
We can end up with memory corruption since the kobject hasn't
been reinitialized properly and the name pointer was left set.
The rationale behind allocating them statically was to avoid
creating a separate kobject container that just contained the
raid type. It used the index in the array to determine the index.
Ultimately, though, this wastes more memory than it saves in all
but the most complex scenarios and introduces kobject lifetime
questions.
This patch allocates the kobjects dynamically instead. Note that
we also remove the kobject_get/put of the parent kobject since
kobject_add and kobject_del do that internally.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
We were limiting the sum of the xattr name and value lengths to PATH_MAX,
which is not correct, specially on filesystems created with btrfs-progs
v3.12 or higher, where the default leaf size is max(16384, PAGE_SIZE), or
systems with page sizes larger than 4096 bytes.
Xattrs have their own specific maximum name and value lengths, which depend
on the leaf size, therefore use these limits to be able to send xattrs with
sizes larger than PATH_MAX.
A test case for xfstests follows.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
If we are doing an incremental send and the base snapshot has a
directory with name X that doesn't exist anymore in the second
snapshot and a new subvolume/snapshot exists in the second snapshot
that has the same name as the directory (name X), the incremental
send would fail with -ENOENT error. This is because it attempts
to lookup for an inode with a number matching the objectid of a
root, which doesn't exist.
Steps to reproduce:
mkfs.btrfs -f /dev/sdd
mount /dev/sdd /mnt
mkdir /mnt/testdir
btrfs subvolume snapshot -r /mnt /mnt/mysnap1
rmdir /mnt/testdir
btrfs subvolume create /mnt/testdir
btrfs subvolume snapshot -r /mnt /mnt/mysnap2
btrfs send -p /mnt/mysnap1 /mnt/mysnap2 -f /tmp/send.data
A test case for xfstests follows.
Reported-by: Robert White <rwhite@pobox.com>
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Delayed extent operations are triggered during transaction commits.
The goal is to queue up a healthly batch of changes to the extent
allocation tree and run through them in bulk.
This farms them off to async helper threads. The goal is to have the
bulk of the delayed operations being done in the background, but this is
also important to limit our stack footprint.
Signed-off-by: Chris Mason <clm@fb.com>
__extent_writepage has two unrelated parts. First it does the delayed
allocation dance and second it does the mapping and IO for the page
we're actually writing.
This splits it up into those two parts so the stack from one doesn't
impact the stack from the other.
Signed-off-by: Chris Mason <clm@fb.com>
In these instances, we are trying to determine if a page has been accessed
since we began the operation for the sake of retry. This is easily
accomplished by doing a gang lookup in the page mapping radix tree, and it
saves us the dependency on the flag (so that we might eventually delete
it).
btrfs_page_exists_in_range borrows heavily from find_get_page, replacing
the radix tree look up with a gang lookup of 1, so that we can find the
next highest page >= index and see if it falls into our lock range.
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Alex Gartrell <agartrell@fb.com>
This adds noinline_for_stack to two helpers used by
btree_write_cache_pages. It shaves us down from 424 bytes on the
stack to 280.
Signed-off-by: Chris Mason <clm@fb.com>