Look here for earlier posts in this series.
Just to see what would happen, I created a directory containing 5000 files. Let’s start with the inode:
The number of extents (bytes 76-79) is 0x2A, or 42. This is too many extents to fit in an extent array in the inode. The data fork type (byte 5) is 3, which means the data fork is the root of a B+Tree.
The root of the B+Tree starts at byte offset 176 (0x0B0), right after the inode core. The first two bytes are the level of this node in the tree. The value 1 indicates that this is an interior node in the tree, rather than a leaf node. The next two bytes are the number of entries in the arrays which track the nodes below us in the tree– there is only one node and one array entry. Four padding bytes are used to maintain 64-bit alignment.
The rest of the space in the data fork is divided into two arrays for tracking sub-nodes. The first array is made up for four byte logical offset values, tracking where each chunk of file data belongs. The second array is the absolute block address of the node which tracks the extents at the corresponding logical offset. In our case that block is 0x8118e4 = 8460516 (aka relative block 71908 in AG 2), which tracks the extents starting from the start of the file (logical offset zero).
This is a small file system and the absolute block addresses fit in 32 bits. What’s not clear in the documentation is what happens when the file system is large enough to require 64-bit block addresses? More research is needed here.
Let’s examine block 8460516 which holds the extent information. Here are the first 256 bytes in a hex editor:
0-3 Magic number BMA3 4-5 Level in tree 0 (leaf node) 6-7 Number of extents 42 8-15 Left sibling pointer -1 (NULL) 16-23 Right sibling pointer -1 (NULL) 24-31 Sector offset of this block 0x02595720 = 39409440 32-39 LSN of last update 0x200000631b 40-55 UUID e56c...da71 56-63 Inode owner of this block 0x022f4d7d = 36654461 64-67 CRC32 of this block 0x9d14d936 68-71 Padding for 64-bit alignment zeroed
This node is at level zero in the tree, which means it’s a leaf node containing data. In this case the data is extent structures, and there are 42 of them following the header.
If there were more than one leaf node, the left and right sibling pointers would be used. Since we only have the one leaf, both of these values are set to -1, which is used as a NULL pointer in XFS metadata structures.
As far as decoding the extent structures, it’s easier to use xfs_db:
xfs_db> inode 36654461 xfs_db> addr u3.bmbt.ptrs xfs_db> print magic = 0x424d4133 level = 0 numrecs = 42 leftsib = null rightsib = null bno = 39409440 lsn = 0x200000631b uuid = e56c3b41-ca03-4b41-b15c-dd609cb7da71 owner = 36654461 crc = 0x9d14d936 (correct) recs[1-42] = [startoff,startblock,blockcount,extentflag] 1:[0,4581802,1,0] 2:[1,4581800,1,0] 3:[2,4581799,1,0] 4:[3,4581798,1,0] 5:[4,4581794,1,0] 6:[5,4581793,1,0] 7:[6,4581791,1,0] 8:[7,4581790,1,0] 9:[8,4581789,1,0] 10:[9,4581787,1,0] 11:[10,4581786,1,0] 12:[11,4582219,1,0] 13:[12,4582236,1,0] 14:[13,4587210,1,0] 15:[14,4688117,3,0] 16:[17,4695931,1,0] 17:[18,4695948,1,0] 18:[19,4701245,1,0] 19:[20,4703737,1,0] 20:[21,4706394,1,0] 21:[22,4711526,1,0] 22:[23,4714191,1,0] 23:[24,4721971,1,0] 24:[25,4729743,1,0] 25:[26,4740155,1,0] 26:[27,4742820,1,0] 27:[28,4745312,1,0] 28:[29,4747961,1,0] 29:[30,4753101,1,0] 30:[31,4761038,1,0] 31:[32,4768818,1,0] 32:[33,4776747,1,0] 33:[34,4797727,1,0] 34:[8388608,4581801,1,0] 35:[8388609,4581796,1,0] 36:[8388610,4581795,1,0] 37:[8388611,4581792,1,0] 38:[8388612,4581788,1,0] 39:[8388613,8459337,1,0] 40:[8388614,8460517,2,0] 41:[8388616,8682827,7,0] 42:[16777216,4581797,1,0]
As we saw in the previous installment, multi-block directories in XFS are sparse files:
- Starting at logical offset zero, we have extents 1-33 containing the first 35 blocks of the directory file. This is where the directory entries live.
- Extents 34-41 starting at logical offset 8388608 (XFS_DIR2_LEAF_OFFSET) contain the hash lookup table for finding directory entries.
- Because the hash lookup table is large enough to require multiple blocks, the “tail record” for the directory moves into its own block tracked by the final extent (extent 42 in our example above). The logical offset for the tail record is 2*XFS_DIR2_LEAF_OFFSET or 16777216.
The Tail Record
0-3 Magic number XDF3 4-7 CRC32 checksum 0xf56e9aba 8-15 Sector offset of this block 22517032 16-23 Last LSN update 0x200000631b 24-39 UUID e56c...da71 40-47 Inode that points to this block 0x022f4d7d 48-51 Starting block offset 0 52-55 Size of array 35 56-59 Array entries used 35 60-63 Padding for 64-bit alignment zeroed
The last two fields describe an array whose elements correspond to the blocks in the hash lookup table for this directory. The array itself follows immediately after the header as shown above. Each element of the array is a two-byte number representing the largest chunk of free space available in each block. In our example, all of the blocks are full (zero free bytes) except for the last block which has at least a 0x0440 = 1088 byte chunk available.
Decoding the Hash Lookup Table
The hash lookup table for this directory is contained in the fifteen blocks starting at logical file offset 8388608. Because the hash lookup table spans multiple blocks, it is also formatted as a B+Tree. The initial block at logical offset 8388608 should be the root of this tree. This block is shown below.
0-3 "Forward" pointer 0 4-7 "Back" pointer 0 8-9 Magic number 0x3ebe 10-11 Padding for alignment zeroed 12-15 CRC32 checksum 0x129cf461 16-23 Sector offset of this block 22517064 24-31 LSN of last update 0x200000631b 32-47 UUID e56c...da71 48-55 Parent inode 0x022f4d7d 56-57 Number of array entries 14 58-59 Level in tree 1 59-63 Padding for alignment zeroed
We confirm this is an interior node of a B+Tree by looking at the “Level in tree” value at bytes 58-59– interior nodes have non-zero values here. The “forward” and “back” pointers being zeroed mean there are no other nodes at this level, so we’re sitting at the root of the tree.
The fourteen other blocks that hold the directory entries are tracked by an array here in the root block. Bytes 56-57 track the size of the array, and the array itself starts at byte 64. Each array entry contains a four byte hash value and a four byte logical block offset. The hash value in each array entry is the largest hash value in the given block.
It’s easier to decode these values using xfs_db:
xfs_db> fsblock 4581801 xfs_db> type dir3 xfs_db> p nhdr.info.hdr.forw = 0 nhdr.info.hdr.back = 0 nhdr.info.hdr.magic = 0x3ebe nhdr.info.crc = 0x129cf461 (correct) nhdr.info.bno = 22517064 nhdr.info.lsn = 0x200000631b nhdr.info.uuid = e56c3b41-ca03-4b41-b15c-dd609cb7da71 nhdr.info.owner = 36654461 nhdr.count = 14 nhdr.level = 1 nbtree[0-13] = [hashval,before] 0:[0x4863b0b3,8388610] 1:[0x4a63d132,8388619] 2:[0x4c63d1f3,8388615] 3:[0x4e63f070,8388622] 4:[0x9c46fd6d,8388612] 5:[0xa446fd6d,8388621] 6:[0xac46fd6d,8388618] 7:[0xb446fd6d,8388616] 8:[0xbc275ded,8388614] 9:[0xbc777c6d,8388609] 10:[0xc463d170,8388611] 11:[0xc863f170,8388620] 12:[0xcc63d072,8388613] 13:[0xce63f377,8388617]
If you look at the residual data in the block after the hash array, it looks like hash values and block offsets similar to what we’ve seen in previous installments. I speculate that this is residual data from when the hash lookup table was able to fit into a single block. Once the directory grew to a point where the B+Tree was necessary, the new B+Tree root node simply took over this block, leaving a significant amount of residual data in the slack space.
To understand the function of the leaf nodes in the B+Tree, suppose we wanted to find the directory entry for the file “0003_smallfile”. First we can use xfs_db to compute the hash value for this filename:
xfs_db> hash 0003_smallfile 0xbc07fded
According to the array, that hash value should be in logical block 8388614. We then have to refer back to the list of extents we decoded earlier to discover that this local offset corresponds to block address 8460517 (AG 2, block 71909). Here is the breakdown of that block:
0-3 Forward pointer 0x800001 = 8388609 4-7 Back pointer 0x800008 = 8388616 8-9 Magic number 0x3dff 10-11 Padding for alignment zeroed 12-15 CRC32 checksum 0xdb227061 16-23 Sector offset of this block 39409448 24-31 LSN of last update 0x200000631b 32-47 UUID e56c...da71 48-55 Parent inode 0x022f4d7d 56-57 Number of array entries 0x01a0 = 416 58-59 Unused entries 0 59-63 Padding for alignment zeroed
Following the 64-byte header is an array holding the hash lookup structures. Each structure contains a four byte hash value and a four byte offset. The array is sorted by hash value for binary search. Offsets are in 8 byte units.
The has value for “0003_smallfile” was 0xbc07fded. We have to look fairly far down in the array to find the offset for this value:
The offset tells us that the directory entry of “0003_smallfile” should be 0x13 = 19 * 8 = 152 bytes from the start of the directory file. That puts it near the beginning of the first block at logical offset zero.
The Directory Entries
To find the first block of the directory file we need to refer back to the extent list we decoded from the inode at the very start of this article. According to that list, the initial block is 4581802 (AG 1, block 387498). Let’s take a closer look at this block:
0-3 Magic number XDD3 4-7 CRC32 checksum 0xaf173b31 8-15 Sector offset to this block 22517072 16-23 LSN of last update 0x200000631b 24-39 UUID e56c...da71 40-47 Parent inode 0x022f4d7d
Bytes 48-59 are a three element array indicating where there is available free space in this directory. Each array element is a 2 byte offset (in bytes) to the free space and a 2 byte length (in bytes). There is no free space in this block, so all array entries are zeroed. Bytes 60-63 are padding for alignment.
Following this header are variable length directory entries defined as follows:
Len (bytes) Field =========== ====== 8 absolute inode number 1 file name length (bytes) varies file name 1 file type varies padding as necessary for 64bit alignment 2 offset to beginning of this entry
Here is the decoding of the directory entries shown above:
Inode Len File Name Type Offset ===== === ========= ==== ======== 0x022f4d7d 1 . 2 0x0040 0x04159fa1 2 .. 2 0x0050 0x022f4d7e 14 0001_smallfile 1 0x0060 0x022f4d7f 12 0002_bigfile 1 0x0080 0x022f4d80 14 0003_smallfile 1 0x0098 0x022f4d81 12 0004_bigfile 1 0x00B8
File type bytes are as described in Part Three of this series (1 is a regular file, 2 is a directory). Note that the starting offset of the “0003_smallfile” entry is 152 bytes (0x0098), exactly as the hash table lookup told us.
What Happens Upon Deletion?
Let’s see what happens when we delete “0003_smallfile”. When doing this sort of testing, always be careful to force the file system cache to flush to disk before busting out the trusty hex editor:
# rm -f /root/dir-testing/bigdir/0003_smallfile # sync; echo 3 > /proc/sys/vm/drop_caches
The mtime and ctime in the directory inode are set to the deletion time of “0003_smallfile”. The LSN and CRC32 checksum in the inode are also updated.
The removal of a single file is typically not a big enough event to modify the size of the directory. In this case, neither the extent tree root or leaf block changes. We would have to purge a significant number of files the impact this data.
However, the “tail record” for the directory is impacted by the file deletion.
The CRC32 checksum and LSN (now highlighted in red) are updated. Also the free space array now shows 0x20 = 32 bytes free in the first block.
Again, a single file deletion is not significant enough to impact the root of the hash B+Tree. However, one of the leaf nodes does register the change.
Again we see updates to the CRC32 checksum and LSN fields. The “Unused entries” field for the hash array now shows one unused entry. Looking farther down in the block, we find the unused entry for our hash 0xbc07fded. The offset is zeroed to indicate this entry is unused. We saw similar behavior in other block-based directories in previous installments of this series.
Changes to the directory entries are also similar to the behavior we’ve seen previously for block-based directory files:
Again we see the usual CRC32 and LSN updates. But now the free space array starting at byte 48 shows 0x0020 = 32 bytes free at offset 0x0098 = 152. The first two bytes of the inode field in this directory are overwritten with 0xFFFF to indicate the unused space, and the next two bytes indicate 0x0020 = 32 bytes of free space. However, since the inodes in this file system fit in 32 bits, the original inode number for the file is still fully visible and the file could potentially be recovered using residual data in the inode.
Wrapping Up and Planning for the Future
This post concludes my walk-through of the major on-disk data structures in the XFS file system. If anything was unclear or you want more detailed explanations in any area, feel free to reach me through the comments or via any of my social media accounts.
The colorized hex dumps that appear in these posts where made with a combination of Synalize It! and Hexinator. Along the way I created “grammar” files that you can use to produce similar colored breakdowns on your own XFS data structures.
I have multiple pages of research questions that came up as I was working through this series. But what I’m most interested in at the moment is the process of recovering deleted data from XFS file systems. This is what I will be looking at in upcoming posts.