XFS (Part 5) – Multi-Block Directories

Life gets more interesting when directories get large enough to occupy multiple blocks. Let’s take a look at my /etc directory:

[root@localhost hal]# ls -lid /etc
67146849 drwxr-xr-x. 141 root root 8192 May 26 20:37 /etc

The file size is 8192 bytes, or two 4K blocks.

Now we’ll use xfs_db to get more information:

xfs_db> inode 67146849
xfs_db> print
core.size = 8192
core.nblocks = 3
core.extsize = 0
core.nextents = 3
u3.bmx[0-2] = [startoff,startblock,blockcount,extentflag] 

I’ve removed much of the output here to make things more readable. The directory file is fragmented, requiring multiple single-block extents, which is common for directories in XFS. The directory would start as a single block. Eventually enough files will be added to the directory that it needs more than one block to hold all the file entries. But by this time, the blocks immediately following the original directory block have been consumed– often by the files which make up the content of the directory. When the directory needs to grow, it typically has to fragment.

What is really interesting about multi-block directories in XFS is that they are sparse files. Looking at the list of extents at the end of the xfs_db output, we see that the first two blocks are at logical block offsets 0 and 1, but the third block is at logical block offset 8388608. What the heck is going on here?

If you recall from our discussion of block directories in the last installment, XFS directories have a hash lookup table at the end for faster searching. When a directory consumes multiple blocks, the hash lookup table and “tail record” move into their own block. For consistency, XFS places this information at logical offset XFS_DIR2_LEAF_OFFSET, which is currently set to 32GB. 32GB divided by our 4K block size gives a logical block offset of 8388608.

From a file size perspective, we can see that xfs_db agrees with our earlier ls output, saying the directory is 8192 bytes. However, the xfs_db output clearly shows that the directory consumes three blocks, which should give it a file size of 3*4096 = 12288 bytes. Based on my testing, the directory “size” in XFS only counts the blocks that contain directory entries.

We can use xfs_db to examine the directory data blocks in more detail:

xfs_db> addr u3.bmx[0].startblock
xfs_db> print
dhdr.hdr.magic = 0x58444433 ("XDD3")
dhdr.hdr.crc = 0xe3a7892d (correct)
dhdr.hdr.bno = 38872696
dhdr.hdr.lsn = 0x2200007442
dhdr.hdr.uuid = e56c3b41-ca03-4b41-b15c-dd609cb7da71
dhdr.hdr.owner = 67146849
dhdr.bestfree[0].offset = 0x220
dhdr.bestfree[0].length = 0x8
dhdr.bestfree[1].offset = 0x258
dhdr.bestfree[1].length = 0x8
dhdr.bestfree[2].offset = 0x368
dhdr.bestfree[2].length = 0x8
du[0].inumber = 67146849
du[0].namelen = 1
du[0].name = "."
du[0].filetype = 2
du[0].tag = 0x40
du[1].inumber = 64
du[1].namelen = 2
du[1].name = ".."
du[1].filetype = 2
du[1].tag = 0x50
du[2].inumber = 34100330
du[2].namelen = 5
du[2].name = "fstab"
du[2].filetype = 1
du[2].tag = 0x60
du[3].inumber = 67146851
du[3].namelen = 8
du[3].name = "crypttab"

I’m using the addr command in xfs_db to select the startblock value from the first extent in the array (the zero element of the array).

The beginning of this first data block is nearly identical to the block directories we looked at previously. The only difference is that single block directories have a magic number “XDB3”, while data blocks in multi-block directories use “XDD3” as we see here. Remember that the value that xfs_db lobels dhdr.hdr.bno is actually the sector offset to this block and not the block number.

Let’s look at the next data block:

xfs_db> inode 67146849
xfs_db> addr u3.bmx[1].startblock
xfs_db> print
dhdr.hdr.magic = 0x58444433 ("XDD3")
dhdr.hdr.crc = 0xa0dba9dc (correct)
dhdr.hdr.bno = 38905568
dhdr.hdr.lsn = 0x2200007442
dhdr.hdr.uuid = e56c3b41-ca03-4b41-b15c-dd609cb7da71
dhdr.hdr.owner = 67146849
dhdr.bestfree[0].offset = 0xad8
dhdr.bestfree[0].length = 0x20
dhdr.bestfree[1].offset = 0xc18
dhdr.bestfree[1].length = 0x20
dhdr.bestfree[2].offset = 0xd78
dhdr.bestfree[2].length = 0x20
du[0].inumber = 67637117
du[0].namelen = 10
du[0].name = "machine-id"
du[0].filetype = 1
du[0].tag = 0x40
du[1].inumber = 67146855
du[1].namelen = 9
du[1].name = "localtime"

Again we see the same header information. Note that each data block has it’s own “free space” array, tracking available space in that data block.

Finally, we have the block containing the hash lookup table and tail record. We could use xfs_db to decode this block, but it turns out that there are some interesting internal structures to see here. Here’s the hex editor view of the start of the block:

Extent Directory Tail Block:

0-3     Forward link                        0
4-7     Backward link                       0
8-9     Magic number                        0x3df1
10-11   Padding                             zeroed
12-15   CRC32                               0xef654461

16-23   Sector offset                       38883440
24-31   Log seq number last update          0x2200008720
32-47   UUID                                e56c3b41-...-dd609cb7da71
48-55   Inode number                        67146849

56-57   Number of entries                   0x0126 = 294
58-59   Unused entries                      1
60-63   Padding for alignment               zeroed

The “forward” and “backward” links would come into play if this were a multi-node B+Tree data structure rather than a single block. Unlike previous magic number values, the magic value here (0x3df1) does not correspond to printable ASCII characters.

After the typical XFS header information, there is a two-byte value tracking the number of entries in the directory, and therefore the number of entries in the hash lookup table that follows. The next two bytes tell us that there is one unused entry– typically a record for a deleted file.

We find this unused record near the end of the hash lookup array. The entry starting at block offset 0x840 has an offset value of zero, indicating the entry is unused:

Extent Directory Tail block 0x820

Interestingly, right after the end of the hash lookup array, we see what appears to be the extended attribute information from an inode. This is apparently residual data left over from an earlier use of the block.

At the end of the block is data which tracks free space in the directory:

Extent Directory Tail Block 0xFFF

The last four bytes in the block are the number of blocks containing directory entries– two in this case. Preceding those four bytes is a “best free” array that tracks the length of the largest chunk of free space in each block. You will notice that the array values here correspond to the dhdr.bestfree[0].length values for each block in the xfs_db output above. When new directory entries are added, this array helps the file system locate the best spot to place the new entry.

We see the two bytes immediately before the “best free” array are identical to the first entry in the array. Did the /etc directory once consume three blocks and later shrink back to two? Based on limited testing, this appears to be the case. Unlike directories in traditional Unix file systems, which never shrink once blocks have been allocated, XFS directories will grow and shrink dynamically as needed.

So far we’ve looked at the three most common directory types in XFS: small “short form” directories stored in the inode, single block directories, and in this case a multi-block directories tracked with an extent array in the inode. In rare cases, when the directory is very large and very fragmented, the extent array in the inode is insufficient. In these cases, XFS uses a B+Tree to track the extent information. We will examine this scenario in the next installment.