Life gets more interesting when directories get large enough to occupy multiple blocks. Let’s take a look at my /etc directory:
[root@localhost hal]# ls -lid /etc 67146849 drwxr-xr-x. 141 root root 8192 May 26 20:37 /etc
The file size is 8192 bytes, or two 4K blocks.
Now we’ll use xfs_db to get more information:
xfs_db> inode 67146849 xfs_db> print [...] core.size = 8192 core.nblocks = 3 core.extsize = 0 core.nextents = 3 [...] u3.bmx[0-2] = [startoff,startblock,blockcount,extentflag] 0:[0,8393423,1,0] 1:[1,8397532,1,0] 2:[8388608,8394766,1,0] [...]
I’ve removed much of the output here to make things more readable. The directory file is fragmented, requiring multiple single-block extents, which is common for directories in XFS. The directory would start as a single block. Eventually enough files will be added to the directory that it needs more than one block to hold all the file entries. But by this time, the blocks immediately following the original directory block have been consumed– often by the files which make up the content of the directory. When the directory needs to grow, it typically has to fragment.
What is really interesting about multi-block directories in XFS is that they are sparse files. Looking at the list of extents at the end of the xfs_db output, we see that the first two blocks are at logical block offsets 0 and 1, but the third block is at logical block offset 8388608. What the heck is going on here?
If you recall from our discussion of block directories in the last installment, XFS directories have a hash lookup table at the end for faster searching. When a directory consumes multiple blocks, the hash lookup table and “tail record” move into their own block. For consistency, XFS places this information at logical offset XFS_DIR2_LEAF_OFFSET, which is currently set to 32GB. 32GB divided by our 4K block size gives a logical block offset of 8388608.
From a file size perspective, we can see that xfs_db agrees with our earlier ls output, saying the directory is 8192 bytes. However, the xfs_db output clearly shows that the directory consumes three blocks, which should give it a file size of 3*4096 = 12288 bytes. Based on my testing, the directory “size” in XFS only counts the blocks that contain directory entries.
We can use xfs_db to examine the directory data blocks in more detail:
xfs_db> addr u3.bmx[0].startblock xfs_db> print dhdr.hdr.magic = 0x58444433 ("XDD3") dhdr.hdr.crc = 0xe3a7892d (correct) dhdr.hdr.bno = 38872696 dhdr.hdr.lsn = 0x2200007442 dhdr.hdr.uuid = e56c3b41-ca03-4b41-b15c-dd609cb7da71 dhdr.hdr.owner = 67146849 dhdr.bestfree[0].offset = 0x220 dhdr.bestfree[0].length = 0x8 dhdr.bestfree[1].offset = 0x258 dhdr.bestfree[1].length = 0x8 dhdr.bestfree[2].offset = 0x368 dhdr.bestfree[2].length = 0x8 du[0].inumber = 67146849 du[0].namelen = 1 du[0].name = "." du[0].filetype = 2 du[0].tag = 0x40 du[1].inumber = 64 du[1].namelen = 2 du[1].name = ".." du[1].filetype = 2 du[1].tag = 0x50 du[2].inumber = 34100330 du[2].namelen = 5 du[2].name = "fstab" du[2].filetype = 1 du[2].tag = 0x60 du[3].inumber = 67146851 du[3].namelen = 8 du[3].name = "crypttab" [...]
I’m using the addr command in xfs_db to select the startblock value from the first extent in the array (the zero element of the array).
The beginning of this first data block is nearly identical to the block directories we looked at previously. The only difference is that single block directories have a magic number “XDB3”, while data blocks in multi-block directories use “XDD3” as we see here. Remember that the value that xfs_db lobels dhdr.hdr.bno is actually the sector offset to this block and not the block number.
Let’s look at the next data block:
xfs_db> inode 67146849 xfs_db> addr u3.bmx[1].startblock xfs_db> print dhdr.hdr.magic = 0x58444433 ("XDD3") dhdr.hdr.crc = 0xa0dba9dc (correct) dhdr.hdr.bno = 38905568 dhdr.hdr.lsn = 0x2200007442 dhdr.hdr.uuid = e56c3b41-ca03-4b41-b15c-dd609cb7da71 dhdr.hdr.owner = 67146849 dhdr.bestfree[0].offset = 0xad8 dhdr.bestfree[0].length = 0x20 dhdr.bestfree[1].offset = 0xc18 dhdr.bestfree[1].length = 0x20 dhdr.bestfree[2].offset = 0xd78 dhdr.bestfree[2].length = 0x20 du[0].inumber = 67637117 du[0].namelen = 10 du[0].name = "machine-id" du[0].filetype = 1 du[0].tag = 0x40 du[1].inumber = 67146855 du[1].namelen = 9 du[1].name = "localtime" [...]
Again we see the same header information. Note that each data block has it’s own “free space” array, tracking available space in that data block.
Finally, we have the block containing the hash lookup table and tail record. We could use xfs_db to decode this block, but it turns out that there are some interesting internal structures to see here. Here’s the hex editor view of the start of the block:
:
0-3 Forward link 0 4-7 Backward link 0 8-9 Magic number 0x3df1 10-11 Padding zeroed 12-15 CRC32 0xef654461 16-23 Sector offset 38883440 24-31 Log seq number last update 0x2200008720 32-47 UUID e56c3b41-...-dd609cb7da71 48-55 Inode number 67146849 56-57 Number of entries 0x0126 = 294 58-59 Unused entries 1 60-63 Padding for alignment zeroed
The “forward” and “backward” links would come into play if this were a multi-node B+Tree data structure rather than a single block. Unlike previous magic number values, the magic value here (0x3df1) does not correspond to printable ASCII characters.
After the typical XFS header information, there is a two-byte value tracking the number of entries in the directory, and therefore the number of entries in the hash lookup table that follows. The next two bytes tell us that there is one unused entry– typically a record for a deleted file.
We find this unused record near the end of the hash lookup array. The entry starting at block offset 0x840 has an offset value of zero, indicating the entry is unused:
Interestingly, right after the end of the hash lookup array, we see what appears to be the extended attribute information from an inode. This is apparently residual data left over from an earlier use of the block.
At the end of the block is data which tracks free space in the directory:
The last four bytes in the block are the number of blocks containing directory entries– two in this case. Preceding those four bytes is a “best free” array that tracks the length of the largest chunk of free space in each block. You will notice that the array values here correspond to the dhdr.bestfree[0].length values for each block in the xfs_db output above. When new directory entries are added, this array helps the file system locate the best spot to place the new entry.
We see the two bytes immediately before the “best free” array are identical to the first entry in the array. Did the /etc directory once consume three blocks and later shrink back to two? Based on limited testing, this appears to be the case. Unlike directories in traditional Unix file systems, which never shrink once blocks have been allocated, XFS directories will grow and shrink dynamically as needed.
So far we’ve looked at the three most common directory types in XFS: small “short form” directories stored in the inode, single block directories, and in this case a multi-block directories tracked with an extent array in the inode. In rare cases, when the directory is very large and very fragmented, the extent array in the inode is insufficient. In these cases, XFS uses a B+Tree to track the extent information. We will examine this scenario in the next installment.