XFS Part 6 – B+Tree Directories

Look here for earlier posts in this series.

Just to see what would happen, I created a directory containing 5000 files. Let’s start with the inode:

B+Tree Directory

The number of extents (bytes 76-79) is 0x2A, or 42. This is too many extents to fit in an extent array in the inode. The data fork type (byte 5) is 3, which means the data fork is the root of a B+Tree.

The root of the B+Tree starts at byte offset 176 (0x0B0), right after the inode core. The first two bytes are the level of this node in the tree. The value 1 indicates that this is an interior node in the tree, rather than a leaf node. The next two bytes are the number of entries in the arrays which track the nodes below us in the tree– there is only one node and one array entry. Four padding bytes are used to maintain 64-bit alignment.

The rest of the space in the data fork is divided into two arrays for tracking sub-nodes. The first array is made up for four byte logical offset values, tracking where each chunk of file data belongs. The second array is the absolute block address of the node which tracks the extents at the corresponding logical offset. In our case that block isĀ 0x8118e4 = 8460516 (aka relative block 71908 in AG 2), which tracks the extents starting from the start of the file (logical offset zero).

This is a small file system and the absolute block addresses fit in 32 bits. What’s not clear in the documentation is what happens when the file system is large enough to require 64-bit block addresses? More research is needed here.

Let’s examine blockĀ 8460516 which holds the extent information. Here are the first 256 bytes in a hex editor:

B+Tree Directory Leaf

0-3     Magic number                        BMA3
4-5     Level in tree                       0 (leaf node)
6-7     Number of extents                   42
8-15    Left sibling pointer                -1 (NULL)

16-23   Right sibling pointer               -1 (NULL)
24-31   Sector offset of this block         0x02595720 = 39409440

32-39   LSN of last update                  0x200000631b
40-55   UUID                                e56c...da71
56-63   Inode owner of this block           0x022f4d7d = 36654461

64-67   CRC32 of this block                 0x9d14d936
68-71   Padding for 64-bit alignment        zeroed

This node is at level zero in the tree, which means it’s a leaf node containing data. In this case the data is extent structures, and there are 42 of them following the header.

If there were more than one leaf node, the left and right sibling pointers would be used. Since we only have the one leaf, both of these values are set to -1, which is used as a NULL pointer in XFS metadata structures.

As far as decoding the extent structures, it’s easier to use xfs_db:

xfs_db> inode 36654461
xfs_db> addr u3.bmbt.ptrs[1]
xfs_db> print
magic = 0x424d4133
level = 0
numrecs = 42
leftsib = null
rightsib = null
bno = 39409440
lsn = 0x200000631b
uuid = e56c3b41-ca03-4b41-b15c-dd609cb7da71
owner = 36654461
crc = 0x9d14d936 (correct)
recs[1-42] = [startoff,startblock,blockcount,extentflag] 

As we saw in the previous installment, multi-block directories in XFS are sparse files:

  • Starting at logical offset zero, we have extents 1-33 containing the first 35 blocks of the directory file. This is where the directory entries live.
  • Extents 34-41 starting at logical offset 8388608 (XFS_DIR2_LEAF_OFFSET) contain the hash lookup table for finding directory entries.
  • Because the hash lookup table is large enough to require multiple blocks, the “tail record” for the directory moves into its own block tracked by the final extent (extent 42 in our example above). The logical offset for the tail record is 2*XFS_DIR2_LEAF_OFFSET or 16777216.

The Tail Record

0-3      Magic number                       XDF3
4-7      CRC32 checksum                     0xf56e9aba
8-15     Sector offset of this block        22517032

16-23    Last LSN update                    0x200000631b
24-39    UUID                               e56c...da71
40-47    Inode that points to this block    0x022f4d7d

48-51    Starting block offset              0
52-55    Size of array                      35
56-59    Array entries used                 35
60-63    Padding for 64-bit alignment       zeroed

The last two fields describe an array whose elements correspond to the blocks in the hash lookup table for this directory. The array itself follows immediately after the header as shown above. Each element of the array is a two-byte number representing the largest chunk of free space available in each block. In our example, all of the blocks are full (zero free bytes) except for the last block which has at least a 0x0440 = 1088 byte chunk available.

Decoding the Hash Lookup Table

The hash lookup table for this directory is contained in the fifteen blocks starting at logical file offset 8388608. Because the hash lookup table spans multiple blocks, it is also formatted as a B+Tree. The initial block at logical offset 8388608 should be the root of this tree. This block is shown below.

0-3      "Forward" pointer                  0
4-7      "Back" pointer                     0
8-9      Magic number                       0x3ebe
10-11    Padding for alignment              zeroed
12-15    CRC32 checksum                     0x129cf461

16-23    Sector offset of this block        22517064
24-31    LSN of last update                 0x200000631b

32-47    UUID                               e56c...da71

48-55    Parent inode                       0x022f4d7d
56-57    Number of array entries            14
58-59    Level in tree                      1
59-63    Padding for alignment              zeroed

We confirm this is an interior node of a B+Tree by looking at the “Level in tree” value at bytes 58-59– interior nodes have non-zero values here. The “forward” and “back” pointers being zeroed mean there are no other nodes at this level, so we’re sitting at the root of the tree.

The fourteen other blocks that hold the directory entries are tracked by an array here in the root block. Bytes 56-57 track the size of the array, and the array itself starts at byte 64. Each array entry contains a four byte hash value and a four byte logical block offset. The hash value in each array entry is the largest hash value in the given block.

It’s easier to decode these values using xfs_db:

xfs_db> fsblock 4581801
xfs_db> type dir3
xfs_db> p
nhdr.info.hdr.forw = 0
nhdr.info.hdr.back = 0
nhdr.info.hdr.magic = 0x3ebe
nhdr.info.crc = 0x129cf461 (correct)
nhdr.info.bno = 22517064
nhdr.info.lsn = 0x200000631b
nhdr.info.uuid = e56c3b41-ca03-4b41-b15c-dd609cb7da71
nhdr.info.owner = 36654461
nhdr.count = 14
nhdr.level = 1
nbtree[0-13] = [hashval,before]

If you look at the residual data in the block after the hash array, it looks like hash values and block offsets similar to what we’ve seen in previous installments. I speculate that this is residual data from when the hash lookup table was able to fit into a single block. Once the directory grew to a point where the B+Tree was necessary, the new B+Tree root node simply took over this block, leaving a significant amount of residual data in the slack space.

To understand the function of the leaf nodes in the B+Tree, suppose we wanted to find the directory entry for the file “0003_smallfile”. First we can use xfs_db to compute the hash value for this filename:

xfs_db> hash 0003_smallfile

According to the array, that hash value should be in logical block 8388614. We then have to refer back to the list of extents we decoded earlier to discover that this local offset corresponds to block address 8460517 (AG 2, block 71909). Here is the breakdown of that block:

0-3      Forward pointer                    0x800001 = 8388609
4-7      Back pointer                       0x800008 = 8388616
8-9      Magic number                       0x3dff
10-11    Padding for alignment              zeroed
12-15    CRC32 checksum                     0xdb227061

16-23    Sector offset of this block        39409448
24-31    LSN of last update                 0x200000631b

32-47    UUID                               e56c...da71

48-55    Parent inode                       0x022f4d7d
56-57    Number of array entries            0x01a0 = 416
58-59    Unused entries                     0
59-63    Padding for alignment              zeroed

Following the 64-byte header is an array holding the hash lookup structures. Each structure contains a four byte hash value and a four byte offset. The array is sorted by hash value for binary search. Offsets are in 8 byte units.

The has value for “0003_smallfile” was 0xbc07fded. We have to look fairly far down in the array to find the offset for this value:

The offset tells us that the directory entry of “0003_smallfile” should be 0x13 = 19 * 8 = 152 bytes from the start of the directory file. That puts it near the beginning of the first block at logical offset zero.

The Directory Entries

To find the first block of the directory file we need to refer back to the extent list we decoded from the inode at the very start of this article. According to that list, the initial block is 4581802 (AG 1, block 387498). Let’s take a closer look at this block:

0-3      Magic number                       XDD3
4-7      CRC32 checksum                     0xaf173b31
8-15     Sector offset to this block        22517072

16-23    LSN of last update                 0x200000631b
24-39    UUID                               e56c...da71
40-47    Parent inode                       0x022f4d7d

Bytes 48-59 are a three element array indicating where there is available free space in this directory. Each array element is a 2 byte offset (in bytes) to the free space and a 2 byte length (in bytes). There is no free space in this block, so all array entries are zeroed. Bytes 60-63 are padding for alignment.

Following this header are variable length directory entries defined as follows:

     Len (bytes)     Field
     ===========     ======
       8             absolute inode number
       1             file name length (bytes)
       varies        file name
       1             file type
       varies        padding as necessary for 64bit alignment
       2             offset to beginning of this entry

Here is the decoding of the directory entries shown above:

    Inode        Len    File Name         Type    Offset
    =====        ===    =========         ====    ========
    0x022f4d7d    1     .                 2       0x0040
    0x04159fa1    2     ..                2       0x0050
    0x022f4d7e   14     0001_smallfile    1       0x0060
    0x022f4d7f   12     0002_bigfile      1       0x0080
    0x022f4d80   14     0003_smallfile    1       0x0098
    0x022f4d81   12     0004_bigfile      1       0x00B8

File type bytes are as described in Part Three of this series (1 is a regular file, 2 is a directory). Note that the starting offset of the “0003_smallfile” entry is 152 bytes (0x0098), exactly as the hash table lookup told us.

What Happens Upon Deletion?

Let’s see what happens when we delete “0003_smallfile”. When doing this sort of testing, always be careful to force the file system cache to flush to disk before busting out the trusty hex editor:

# rm -f /root/dir-testing/bigdir/0003_smallfile
# sync; echo 3 > /proc/sys/vm/drop_caches

The mtime and ctime in the directory inode are set to the deletion time of “0003_smallfile”. The LSN and CRC32 checksum in the inode are also updated.

The removal of a single file is typically not a big enough event to modify the size of the directory. In this case, neither the extent tree root or leaf block changes. We would have to purge a significant number of files the impact this data.

However, the “tail record” for the directory is impacted by the file deletion.

The CRC32 checksum and LSN (now highlighted in red) are updated. Also the free space array now shows 0x20 = 32 bytes free in the first block.

Again, a single file deletion is not significant enough to impact the root of the hash B+Tree. However, one of the leaf nodes does register the change.

Again we see updates to the CRC32 checksum and LSN fields. The “Unused entries” field for the hash array now shows one unused entry. Looking farther down in the block, we find the unused entry for our hash 0xbc07fded. The offset is zeroed to indicate this entry is unused. We saw similar behavior in other block-based directories in previous installments of this series.

Changes to the directory entries are also similar to the behavior we’ve seen previously for block-based directory files:

Again we see the usual CRC32 and LSN updates. But now the free space array starting at byte 48 shows 0x0020 = 32 bytes free at offset 0x0098 = 152. The first two bytes of the inode field in this directory are overwritten with 0xFFFF to indicate the unused space, and the next two bytes indicate 0x0020 = 32 bytes of free space. However, since the inodes in this file system fit in 32 bits, the original inode number for the file is still fully visible and the file could potentially be recovered using residual data in the inode.

Wrapping Up and Planning for the Future

This post concludes my walk-through of the major on-disk data structures in the XFS file system. If anything was unclear or you want more detailed explanations in any area, feel free to reach me through the comments or via any of my social media accounts.

The colorized hex dumps that appear in these posts where made with a combination of Synalize It! and Hexinator. Along the way I created “grammar” files that you can use to produce similar colored breakdowns on your own XFS data structures.

I have multiple pages of research questions that came up as I was working through this series. But what I’m most interested in at the moment is the process of recovering deleted data from XFS file systems. This is what I will be looking at in upcoming posts.