XFS (Part 5) – Multi-Block Directories

Life gets more interesting when directories get large enough to occupy multiple blocks. Let’s take a look at my /etc directory:

[root@localhost hal]# ls -lid /etc
67146849 drwxr-xr-x. 141 root root 8192 May 26 20:37 /etc

The file size is 8192 bytes, or two 4K blocks.

Now we’ll use xfs_db to get more information:

xfs_db> inode 67146849
xfs_db> print
core.size = 8192
core.nblocks = 3
core.extsize = 0
core.nextents = 3
u3.bmx[0-2] = [startoff,startblock,blockcount,extentflag] 

I’ve removed much of the output here to make things more readable. The directory file is fragmented, requiring multiple single-block extents, which is common for directories in XFS. The directory would start as a single block. Eventually enough files will be added to the directory that it needs more than one block to hold all the file entries. But by this time, the blocks immediately following the original directory block have been consumed– often by the files which make up the content of the directory. When the directory needs to grow, it typically has to fragment.

What is really interesting about multi-block directories in XFS is that they are sparse files. Looking at the list of extents at the end of the xfs_db output, we see that the first two blocks are at logical block offsets 0 and 1, but the third block is at logical block offset 8388608. What the heck is going on here?

If you recall from our discussion of block directories in the last installment, XFS directories have a hash lookup table at the end for faster searching. When a directory consumes multiple blocks, the hash lookup table and “tail record” move into their own block. For consistency, XFS places this information at logical offset XFS_DIR2_LEAF_OFFSET, which is currently set to 32GB. 32GB divided by our 4K block size gives a logical block offset of 8388608.

From a file size perspective, we can see that xfs_db agrees with our earlier ls output, saying the directory is 8192 bytes. However, the xfs_db output clearly shows that the directory consumes three blocks, which should give it a file size of 3*4096 = 12288 bytes. Based on my testing, the directory “size” in XFS only counts the blocks that contain directory entries.

We can use xfs_db to examine the directory data blocks in more detail:

xfs_db> addr u3.bmx[0].startblock
xfs_db> print
dhdr.hdr.magic = 0x58444433 ("XDD3")
dhdr.hdr.crc = 0xe3a7892d (correct)
dhdr.hdr.bno = 38872696
dhdr.hdr.lsn = 0x2200007442
dhdr.hdr.uuid = e56c3b41-ca03-4b41-b15c-dd609cb7da71
dhdr.hdr.owner = 67146849
dhdr.bestfree[0].offset = 0x220
dhdr.bestfree[0].length = 0x8
dhdr.bestfree[1].offset = 0x258
dhdr.bestfree[1].length = 0x8
dhdr.bestfree[2].offset = 0x368
dhdr.bestfree[2].length = 0x8
du[0].inumber = 67146849
du[0].namelen = 1
du[0].name = "."
du[0].filetype = 2
du[0].tag = 0x40
du[1].inumber = 64
du[1].namelen = 2
du[1].name = ".."
du[1].filetype = 2
du[1].tag = 0x50
du[2].inumber = 34100330
du[2].namelen = 5
du[2].name = "fstab"
du[2].filetype = 1
du[2].tag = 0x60
du[3].inumber = 67146851
du[3].namelen = 8
du[3].name = "crypttab"

I’m using the addr command in xfs_db to select the startblock value from the first extent in the array (the zero element of the array).

The beginning of this first data block is nearly identical to the block directories we looked at previously. The only difference is that single block directories have a magic number “XDB3”, while data blocks in multi-block directories use “XDD3” as we see here. Remember that the value that xfs_db lobels dhdr.hdr.bno is actually the sector offset to this block and not the block number.

Let’s look at the next data block:

xfs_db> inode 67146849
xfs_db> addr u3.bmx[1].startblock
xfs_db> print
dhdr.hdr.magic = 0x58444433 ("XDD3")
dhdr.hdr.crc = 0xa0dba9dc (correct)
dhdr.hdr.bno = 38905568
dhdr.hdr.lsn = 0x2200007442
dhdr.hdr.uuid = e56c3b41-ca03-4b41-b15c-dd609cb7da71
dhdr.hdr.owner = 67146849
dhdr.bestfree[0].offset = 0xad8
dhdr.bestfree[0].length = 0x20
dhdr.bestfree[1].offset = 0xc18
dhdr.bestfree[1].length = 0x20
dhdr.bestfree[2].offset = 0xd78
dhdr.bestfree[2].length = 0x20
du[0].inumber = 67637117
du[0].namelen = 10
du[0].name = "machine-id"
du[0].filetype = 1
du[0].tag = 0x40
du[1].inumber = 67146855
du[1].namelen = 9
du[1].name = "localtime"

Again we see the same header information. Note that each data block has it’s own “free space” array, tracking available space in that data block.

Finally, we have the block containing the hash lookup table and tail record. We could use xfs_db to decode this block, but it turns out that there are some interesting internal structures to see here. Here’s the hex editor view of the start of the block:

Extent Directory Tail Block:

0-3     Forward link                        0
4-7     Backward link                       0
8-9     Magic number                        0x3df1
10-11   Padding                             zeroed
12-15   CRC32                               0xef654461

16-23   Sector offset                       38883440
24-31   Log seq number last update          0x2200008720
32-47   UUID                                e56c3b41-...-dd609cb7da71
48-55   Inode number                        67146849

56-57   Number of entries                   0x0126 = 294
58-59   Unused entries                      1
60-63   Padding for alignment               zeroed

The “forward” and “backward” links would come into play if this were a multi-node B+Tree data structure rather than a single block. Unlike previous magic number values, the magic value here (0x3df1) does not correspond to printable ASCII characters.

After the typical XFS header information, there is a two-byte value tracking the number of entries in the directory, and therefore the number of entries in the hash lookup table that follows. The next two bytes tell us that there is one unused entry– typically a record for a deleted file.

We find this unused record near the end of the hash lookup array. The entry starting at block offset 0x840 has an offset value of zero, indicating the entry is unused:

Extent Directory Tail block 0x820

Interestingly, right after the end of the hash lookup array, we see what appears to be the extended attribute information from an inode. This is apparently residual data left over from an earlier use of the block.

At the end of the block is data which tracks free space in the directory:

Extent Directory Tail Block 0xFFF

The last four bytes in the block are the number of blocks containing directory entries– two in this case. Preceding those four bytes is a “best free” array that tracks the length of the largest chunk of free space in each block. You will notice that the array values here correspond to the dhdr.bestfree[0].length values for each block in the xfs_db output above. When new directory entries are added, this array helps the file system locate the best spot to place the new entry.

We see the two bytes immediately before the “best free” array are identical to the first entry in the array. Did the /etc directory once consume three blocks and later shrink back to two? Based on limited testing, this appears to be the case. Unlike directories in traditional Unix file systems, which never shrink once blocks have been allocated, XFS directories will grow and shrink dynamically as needed.

So far we’ve looked at the three most common directory types in XFS: small “short form” directories stored in the inode, single block directories, and in this case a multi-block directories tracked with an extent array in the inode. In rare cases, when the directory is very large and very fragmented, the extent array in the inode is insufficient. In these cases, XFS uses a B+Tree to track the extent information. We will examine this scenario in the next installment.


XFS (Part 4) – Block Directories

In the previous installment, we looked at small directories stored in “short form” in the inode. While these small directories can make up as much as 90% of the total directories in a typical Linux file system, eventually directories get big enough that they can no longer be packed into the inode data fork. When this happens, directory data moves out to blocks on disk.

In the inode, the data fork type (byte 5) changes to indicate that the data is no longer stored within the inode. Extents are used to track the location of the disk blocks containing the directory data. Here is the inode core and extent list for a directory that only occupies a single block:

Inode detail for directory occupying a single block

The data fork type is 2, indicating an extent list follows the inode core. Bytes 76-79 indicate that there is only a single extent. The extent starts at byte 176 (0x0B0), immediately after the inode core. The last 21 bits of the extent structure show that the extent only contains a single block. Parsing the rest of the extent yields a block address of  0x8118e7, or relative block 71911 in AG 2.

We can extract this block and examine it in our hex editor. Here is the data in the beginning of the block:

Block Directory Header and Entries

The directory block begins with a 48 byte header:

0-3      Magic number                       XDB3
4-7      CRC32 checksum                     0xaf6a416d
8-15     Sector offset of this block        39409464

16-23    Last LSN update                    0x20000061fe
24-39    UUID                               e56c3b41-...-dd609cb7da71
40-47    inode that points to this block    0x0408e66d

You may compare the UUID and inode values in the directory block header with the corresponding values in the inode to see that they match.

The XFS documentation describes the sector offset field as the “block number”. However, using the formula from Part 1 of this series, we can calculate the physical block number of this block as:

(AG number) * (blocks per AG) + (relative block offset)
     2      *    2427136      +         71911   =   4926183

Multiply the block offset 4926183 by 8 sectors per block to get the sector offset value 39409464 that we see in the directory block header.

Following the header is a “free space” array that consumes 12 bytes, plus 4 bytes of padding to preserve 64-bit alignment. The free space array contains three elements which indicate where the three largest chunks of unused space are located in this directory block. Each element is a 2 byte offset and a 2 byte length field. The elements of the array are sorted in descending order by the length of each chunk.

In this directory block, there is only a single chunk of free space, starting at offset 1296 (0x0510) and having 2376 bytes (0x0948) of space. The other elements of the free space array are zeroed, indicating no other free space is available.

The directory entries start at byte 64 (0x040) and can be read sequentially like a typical Unix directory. However, XFS uses a hash-based lookup table, growing up from the bottom of the directory block, for more efficient searching:

Block Directory Tail Record and Hash Array

The last 8 bytes of the directory block are a “tail record” containing two 4 byte values: the number of directory entries (0x34 or 52) and the number of unused entries (zero). Immediately preceding the tail record will be an array of 8 byte records, one record per directory entry (52 records in this case). Each record contains a hash value computed from the file name, and the offset in the directory block where the directory entry for that file is located. The array is sorted by hash value so that binary search can quickly find the desired record. The offsets are in 8 byte units.

The xfs_db program can compute hash values for us:

xfs_db> hash 03_smallfile

If we locate this hash value in the array, we see the byte offset value is 0x12 or 18. Since the offset units are 8 bytes, this translates to byte offset 144 (0x090) from the start of the directory block.

Here are the first six directory entries from this block, including the entry for “03_smallfile”:

Directory Entry Detail

Directory entries are variable length, but always 8 byte (64-bit) aligned. The fields in each directory entry are:

     Len (bytes)       Field
     ===========       =====
          8            Inode number
          1            File name length
          varies       File name
          1            File type
          varies       Padding for alignment
          2            Byte offset of this directory entry

64-bit inode addresses are always used. This is different from “short form” directories, where 32-bit inode addresses will be used if possible.

File name length is a single byte, limiting file names to 255 characters. The file type byte uses the same numbering scheme we saw in “short form” directories:

    1   Regular file
    2   Directory
    3   Character special device
    4   Block special device
    5   FIFO
    6   Socket
    7   Symlink

Padding for alignment is only included if necessary. Our “03_smallfile” entry starting at offset 0x090 is exactly 24 bytes long and needs no padding for alignment. You can clearly see the padding in the “.” and “..” entries starting at offset 0x040 and 0x050 respectively.

Deleting a File

If we remove “03_smallfile” from this directory, the inode updates similarly to what we saw with the “short form” directory in the last installment of this series. The mtime and ctime values are updated, and the CRC32 and Logfile Sequence Number fields as well. The file size does not change, since the directory still occupies one block.

The “tail record” and hash array at the end of the directory block change:

Tail record and hash array post delete

The tail record still shows 34 entries, but one of them is now unused. If we look at the entry for hash 0x3F07FDEC, we see the offset value has been zeroed, indicating an unused record.

We also see changes at the beginning of the block:

Directory entry detail post delete

The free space array now uses the second element, showing 24 (0x18) bytes free at byte offset 0x90– the location where the “03_smallfile” entry used to reside.

Looking at offset 0x90, we see that the first two bytes of the inode field are overwritten with 0xFFFF, indicating an unused entry. The next two bytes are the length of the free space. Again we see 0x18, or 24 bytes.

However, since inode addresses in this file system fit in 32 bits, the original inode address associated with this file is still clearly visible. The rest of the original directory entry is untouched until a new entry overwrites this space. This should make file recovery easier.

Not Quite Done With Directories

When directories get large enough to occupy multiple blocks, the directory structure gets more complicated. We’ll examine larger directories in our next installment.