mirror of
https://github.com/AuxXxilium/linux_dsm_epyc7002.git
synced 2025-01-17 05:17:10 +07:00
219db95bbe
This commit aims to fix the following issues in ext4 documentation: - Flexible block group docs said that the aim was to group block metadata together instead of block group metadata. - The documentation consistly uses "location" instead of "block number". It is easy to confuse location to be an absolute offset on disk. Added a line to clarify all location values are in terms of block numbers. - Dirent2 docs said that the rec_len field is shortened instead of the name_len field. - Typo in bg_checksum description. - Inode size is 160 bytes now, and hence i_extra_isize is now 32. - Cluster size formula was incorrect, it did not include the +10 to s_log_cluster_size value. - Typo: there were two s_wtime_hi in the superblock struct. - Superblock struct was outdated, added the new fields which were part of s_reserved earlier. - Multiple mount protection seems to be implemented in fs/ext4/mmp.c. Signed-off-by: Ayush Ranjan <ayushr2@illinois.edu> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
136 lines
6.4 KiB
ReStructuredText
136 lines
6.4 KiB
ReStructuredText
.. SPDX-License-Identifier: GPL-2.0
|
||
|
||
Layout
|
||
------
|
||
|
||
The layout of a standard block group is approximately as follows (each
|
||
of these fields is discussed in a separate section below):
|
||
|
||
.. list-table::
|
||
:widths: 1 1 1 1 1 1 1 1
|
||
:header-rows: 1
|
||
|
||
* - Group 0 Padding
|
||
- ext4 Super Block
|
||
- Group Descriptors
|
||
- Reserved GDT Blocks
|
||
- Data Block Bitmap
|
||
- inode Bitmap
|
||
- inode Table
|
||
- Data Blocks
|
||
* - 1024 bytes
|
||
- 1 block
|
||
- many blocks
|
||
- many blocks
|
||
- 1 block
|
||
- 1 block
|
||
- many blocks
|
||
- many more blocks
|
||
|
||
For the special case of block group 0, the first 1024 bytes are unused,
|
||
to allow for the installation of x86 boot sectors and other oddities.
|
||
The superblock will start at offset 1024 bytes, whichever block that
|
||
happens to be (usually 0). However, if for some reason the block size =
|
||
1024, then block 0 is marked in use and the superblock goes in block 1.
|
||
For all other block groups, there is no padding.
|
||
|
||
The ext4 driver primarily works with the superblock and the group
|
||
descriptors that are found in block group 0. Redundant copies of the
|
||
superblock and group descriptors are written to some of the block groups
|
||
across the disk in case the beginning of the disk gets trashed, though
|
||
not all block groups necessarily host a redundant copy (see following
|
||
paragraph for more details). If the group does not have a redundant
|
||
copy, the block group begins with the data block bitmap. Note also that
|
||
when the filesystem is freshly formatted, mkfs will allocate “reserve
|
||
GDT block” space after the block group descriptors and before the start
|
||
of the block bitmaps to allow for future expansion of the filesystem. By
|
||
default, a filesystem is allowed to increase in size by a factor of
|
||
1024x over the original filesystem size.
|
||
|
||
The location of the inode table is given by ``grp.bg_inode_table_*``. It
|
||
is continuous range of blocks large enough to contain
|
||
``sb.s_inodes_per_group * sb.s_inode_size`` bytes.
|
||
|
||
As for the ordering of items in a block group, it is generally
|
||
established that the super block and the group descriptor table, if
|
||
present, will be at the beginning of the block group. The bitmaps and
|
||
the inode table can be anywhere, and it is quite possible for the
|
||
bitmaps to come after the inode table, or for both to be in different
|
||
groups (flex\_bg). Leftover space is used for file data blocks, indirect
|
||
block maps, extent tree blocks, and extended attributes.
|
||
|
||
Flexible Block Groups
|
||
---------------------
|
||
|
||
Starting in ext4, there is a new feature called flexible block groups
|
||
(flex\_bg). In a flex\_bg, several block groups are tied together as one
|
||
logical block group; the bitmap spaces and the inode table space in the
|
||
first block group of the flex\_bg are expanded to include the bitmaps
|
||
and inode tables of all other block groups in the flex\_bg. For example,
|
||
if the flex\_bg size is 4, then group 0 will contain (in order) the
|
||
superblock, group descriptors, data block bitmaps for groups 0-3, inode
|
||
bitmaps for groups 0-3, inode tables for groups 0-3, and the remaining
|
||
space in group 0 is for file data. The effect of this is to group the
|
||
block group metadata close together for faster loading, and to enable
|
||
large files to be continuous on disk. Backup copies of the superblock
|
||
and group descriptors are always at the beginning of block groups, even
|
||
if flex\_bg is enabled. The number of block groups that make up a
|
||
flex\_bg is given by 2 ^ ``sb.s_log_groups_per_flex``.
|
||
|
||
Meta Block Groups
|
||
-----------------
|
||
|
||
Without the option META\_BG, for safety concerns, all block group
|
||
descriptors copies are kept in the first block group. Given the default
|
||
128MiB(2^27 bytes) block group size and 64-byte group descriptors, ext4
|
||
can have at most 2^27/64 = 2^21 block groups. This limits the entire
|
||
filesystem size to 2^21 ∗ 2^27 = 2^48bytes or 256TiB.
|
||
|
||
The solution to this problem is to use the metablock group feature
|
||
(META\_BG), which is already in ext3 for all 2.6 releases. With the
|
||
META\_BG feature, ext4 filesystems are partitioned into many metablock
|
||
groups. Each metablock group is a cluster of block groups whose group
|
||
descriptor structures can be stored in a single disk block. For ext4
|
||
filesystems with 4 KB block size, a single metablock group partition
|
||
includes 64 block groups, or 8 GiB of disk space. The metablock group
|
||
feature moves the location of the group descriptors from the congested
|
||
first block group of the whole filesystem into the first group of each
|
||
metablock group itself. The backups are in the second and last group of
|
||
each metablock group. This increases the 2^21 maximum block groups limit
|
||
to the hard limit 2^32, allowing support for a 512PiB filesystem.
|
||
|
||
The change in the filesystem format replaces the current scheme where
|
||
the superblock is followed by a variable-length set of block group
|
||
descriptors. Instead, the superblock and a single block group descriptor
|
||
block is placed at the beginning of the first, second, and last block
|
||
groups in a meta-block group. A meta-block group is a collection of
|
||
block groups which can be described by a single block group descriptor
|
||
block. Since the size of the block group descriptor structure is 32
|
||
bytes, a meta-block group contains 32 block groups for filesystems with
|
||
a 1KB block size, and 128 block groups for filesystems with a 4KB
|
||
blocksize. Filesystems can either be created using this new block group
|
||
descriptor layout, or existing filesystems can be resized on-line, and
|
||
the field s\_first\_meta\_bg in the superblock will indicate the first
|
||
block group using this new layout.
|
||
|
||
Please see an important note about ``BLOCK_UNINIT`` in the section about
|
||
block and inode bitmaps.
|
||
|
||
Lazy Block Group Initialization
|
||
-------------------------------
|
||
|
||
A new feature for ext4 are three block group descriptor flags that
|
||
enable mkfs to skip initializing other parts of the block group
|
||
metadata. Specifically, the INODE\_UNINIT and BLOCK\_UNINIT flags mean
|
||
that the inode and block bitmaps for that group can be calculated and
|
||
therefore the on-disk bitmap blocks are not initialized. This is
|
||
generally the case for an empty block group or a block group containing
|
||
only fixed-location block group metadata. The INODE\_ZEROED flag means
|
||
that the inode table has been initialized; mkfs will unset this flag and
|
||
rely on the kernel to initialize the inode tables in the background.
|
||
|
||
By not writing zeroes to the bitmaps and inode table, mkfs time is
|
||
reduced considerably. Note the feature flag is RO\_COMPAT\_GDT\_CSUM,
|
||
but the dumpe2fs output prints this as “uninit\_bg”. They are the same
|
||
thing.
|