Sunday, December 20, 2009

ZFS Raidz Data Walk

Several months ago, I wrote in my blog about raidz on disk format (see In that blog, I went over the high level details. Here, I thought I would show the low level stuff that I did to determine the layout. I am using a modified zdb and mdb to walk through the on-disk data structures to find the data for a copy of the /usr/dict/words file that I made on a raidz file system.

The raidz volume contains 5 equal size devices. Since I don't have 5 disks lying around, I created 5 equal sized files (/export/home/max/r0 through /export/home/max/r4). I'll use the term disk throughout this discussion to refer to one of these files.

# zpool status -v tank
pool: tank
state: ONLINE
scrub: none requested

tank ONLINE 0 0 0
raidz1 ONLINE 0 0 0
/export/home/max/r0 ONLINE 0 0 0
/export/home/max/r1 ONLINE 0 0 0
/export/home/max/r2 ONLINE 0 0 0
/export/home/max/r3 ONLINE 0 0 0
/export/home/max/r4 ONLINE 0 0 0

errors: No known data errors

I'll umount the file system so things don't change while I'm examining the on-disk structures.

# zfs umount tank

And, as I have done in the past, I walk the data structures to get to the "words" file by starting at the uberblock_t. If you get lost during this walk, you can always refer to the diagram "ZFS On-Disk Layout - The Big Picture", page 4 in from the OpenSolaris Developer's Conference, 2008 in Prague.

First, the active uberblock_t.

# zdb -uuu tank

magic = 0000000000bab10c
version = 13
txg = 1280
guid_sum = 6800651560363961243
timestamp = 1239197133 UTC = Wed Apr 8 15:25:33 2009
rootbp = [L0 DMU objset] 400L/200P DVA[0]=<0:1e007400:400> DVA[1]=<0:9400:400> DVA[2]=<0:3c003800:400> fletcher4 lzjb LE contiguous birth=1280 fill=27 cksum=9ad89e117:40b4956a12c:db76af09e62f:1f779fd1db6115

Now, I use a new command I added to zdb to allow me to see the raidz mapping. The "-Z" option takes the pool name, device id, location, and physical size as arguments, and prints the device index, location, and size for each piece of the corresponding data plus parity.

# ./zdb -Z tank:0:1e007400:200
Columns = 2, bigcols = 2, asize = 400, firstdatacol = 1
devidx = 3, offset = 6001600, size = 200
devidx = 4, offset = 6001600, size = 200

So, the 0x200 byte parity is on the fourth disk (devidx = 3), and the 0x200 byte objset_phys_t is on the fifth disk (devidx = 4). (Of course, either one will work since there are only 2).

Now, convert the hex offset to an absolute decimal block number. The 0x400000 skips the disk labels at the front of each device in the volume.

# mdb
> (6001600>>9)+(400000>>9)=D

Attempting to use zdb with the -R option to read the blocks causes a assertion failure in zdb (at least, that was the state back in April, when I wrote the original blog on raidz). So, instead I use dd to dump the raw data into a file.

# dd if=/export/home/max/r4 of=/tmp/objset_t bs=512 count=1 iseek=204811
1+0 records in
1+0 records out

Now, I'll uncompress the data. The size after decompression should be 0x400 bytes (as specified in the block pointer in the uberblock_t above). For this, I use a utility I wrote called zuncompress. This utility takes an option which allows one to specify the compression algorithm used. The default is lzjb. The output should be the objset_phys_t for the meta object set (MOS).

# ./zuncompress -p 200 -l 400 /tmp/objset_t > /tmp/objset

And now, I'll use my modified mdb to print the objset_phys_t.

# mdb /tmp/objset
> 0::print -a zfs`objset_phys_t
0 os_meta_dnode = {
0 dn_type = 0xa <-- DMU_OT_DNODE
1 dn_indblkshift = 0xe
2 dn_nlevels = 0x1
40 dn_blkptr = [
40 blk_dva = [
40 dva_word = [ 0x8, 0xf0050 ]
50 dva_word = [ 0x8, 0x40 ]
60 dva_word = [ 0x8, 0x1e0028 ]

And the blkptr_t at 0x40:

> 40::blkptr
DVA[0]: vdev_id 0 / 1e00a000
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 100000000000
DVA[0]: :0:1e00a000:a00:d
DVA[1]: vdev_id 0 / 8000
DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 100000000000
DVA[1]: :0:8000:a00:d
DVA[2]: vdev_id 0 / 3c005000
DVA[2]: GANG: FALSE GRID: 0000 ASIZE: 100000000000
DVA[2]: :0:3c005000:a00:d
LSIZE: 4000 PSIZE: a00
BIRTH: 500 LEVEL: 0 FILL: 1a00000000
CKFUNC: fletcher4 COMP: lzjb
CKSUM: a182339fe8:ded0f7be7047:bcb1c1a96b94cc:765bd519587bfb41

So, "LEVEL: 0" means no indirection. The next object is the MOS, which is an array of dnode_phys_t. Let's see how the MOS is layed out on the raidz volume.

# ./zdb -Z tank:0:1e00a000:a00
Columns = 5, bigcols = 2, asize = 1000, firstdatacol = 1
devidx = 0, offset = 6002000, size = 400
devidx = 1, offset = 6002000, size = 400
devidx = 2, offset = 6002000, size = 200
devidx = 3, offset = 6002000, size = 200
devidx = 4, offset = 6002000, size = 200

So, disk 0 contains parity, and disks 1, 2, 3, and 4 contain the MOS. The MOS is compressed with lzjb compression. We'll use dd to dump the 4 blocks containing the MOS to a file, then decompress the MOS.

I'll use mdb to translate the blkptr DVA address to a block on the disks. Note that all blocks in this example are at the same location (0x6002000).

# mdb
> (6002000>>9)+(400000>>9)=D

And now dd each of the blocks. The first disk (/export/home/max/r0) is parity. The second disk contains 0x400 bytes. The other 3 disks contain 0x200 bytes each. So total size of compressed data is 0x400 + 0x200 + 0x200 + 0x200, or 0xa00 bytes, which agrees with the PSIZE field in the blkptr_t. Note that size of the parity block must be equal to the size of the largest block (0x400).

# dd if=/export/home/max/r1 of=/tmp/mos_z1 iseek=204816 count=2
2+0 records in
2+0 records out
# dd if=/export/home/max/r2 of=/tmp/mos_z2 iseek=204816 count=1
1+0 records in
1+0 records out
# dd if=/export/home/max/r3 of=/tmp/mos_z3 iseek=204816 count=1
1+0 records in
1+0 records out
# dd if=/export/home/max/r4 of=/tmp/mos_z4 iseek=204816 count=1
1+0 records in
1+0 records out

Now, concatenate the files to get the compressed MOS.

# cat /tmp/mos_z* > /tmp/mos_comp

And uncompress. The size after decompression, according to the blkptr is 0x4000 (LSIZE in the blkptr).

# ./zuncompress -l 4000 -p a00 /tmp/mos_comp > /tmp/mos

And I'll use the modified mdb to dump out the MOS.

# mdb /tmp/mos
> ::sizeof zfs`dnode_phys_t
sizeof (zfs`dnode_phys_t) = 0x200

> 4000%200=K
20 <-- There are 32 dnode_phys_t in the MOS
> 0,20::print -a zfs`dnode_phys_t
0 dn_type = 0 <-- DMU_OT_NONE, first is unused
200 dn_type = 0x1 <-- DMU_OT_OBJECT_DIRECTORY
240 dn_blkptr = [
240 blk_dva = [
240 dva_word = [ 0x2, 0x24 ]
400 dn_type = 0xc <-- DMU_OT_DSL_DIR (DSL Directory)
404 dn_bonustype = 0xc
4c0 dn_bonus = [ 0x39, 0x75, 0xdb, 0x49, 0, 0, 0, 0, 0x10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0xe, 0, 0, 0, 0, 0, 0, 0,
... ]
600 dn_type = 0xf
1600 dn_type = 0x10 <-- DMU_OT_DSL_DATASET (DSL DataSet)
1604 dn_bonustype = 0x10
16c0 dn_bonus = [ 0x8, 0, 0, 0, 0, 0, 0, 0, 0xe, 0, 0, 0, 0, 0, 0, 0, 0x1, 0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
... ]

The blkptr_t at 0x240 is for the Object Directory. Let's take a closer look.

> 240::blkptr
DVA[0]: vdev_id 0 / 4800
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 40000000000
DVA[0]: :0:4800:200:d
DVA[1]: vdev_id 0 / 1e004800
DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 40000000000
DVA[1]: :0:1e004800:200:d
DVA[2]: vdev_id 0 / 3c000000
DVA[2]: GANG: FALSE GRID: 0000 ASIZE: 40000000000
DVA[2]: :0:3c000000:200:d
LSIZE: 200 PSIZE: 200
ENDIAN: LITTLE TYPE: object directory
BIRTH: 4 LEVEL: 0 FILL: 100000000
CKFUNC: fletcher4 COMP: uncompressed
CKSUM: 5d4dec3ac:1e59c2be429:5825c81154e8:b9b170eedd49e

We'll use zdb to find out where ZFS has put the 0x200 byte object directory.

# ./zdb -Z tank:0:4800:200
Columns = 2, bigcols = 2, asize = 400, firstdatacol = 1
devidx = 1, offset = e00, size = 200
devidx = 2, offset = e00, size = 200

So, the parity is on the second disk (devidx = 1), and the object directory (a ZAP object), is on the third disk.

We'll convert the offset into a block address.

# mdb
> (e00>>9)+(400000>>9)=D

And dump the 0x200 (i.e, 512byte) block.

# dd if=/export/home/max/r2 of=/tmp/objdir iseek=8199 count=1
1+0 records in
1+0 records out

The ZAP object is not compressed (see the above blkptr_t). So, no need to uncompress. We'll use mdb to look at the zap.

# mdb /tmp/objdir
> 0/J
0: 8000000000000003 <-- a microzap object

> 0::print -a -t zfs`mzap_phys_t
0 uint64_t mz_block_type = 0x8000000000000003
8 uint64_t mz_salt = 0x32064dbb
10 uint64_t mz_normflags = 0
18 uint64_t [5] mz_pad = [ 0, 0, 0, 0, 0 ]
40 mzap_ent_phys_t [1] mz_chunk = [
40 uint64_t mze_value = 0x2
48 uint32_t mze_cd = 0
4c uint16_t mze_pad = 0
4e char [50] mze_name = [ "root_dataset" ]

There are more mzap_ent_phys_t in the object, but we are only concerned with the root dataset. This is object id 2, so we'll go back to the MOS, and examine the dnode_phys_t at index 2.

# mdb /tmp/mos
> 2*200::print -a zfs`dnode_phys_t <-- Each dnode_phys_t is 0x200 bytes
400 dn_type = 0xc <-- DMU_OT_DSL_DIR
404 dn_bonustype = 0xc <-- DMU_OT_DSL_DIR
4c0 dn_bonus = [ 0x39, 0x75, 0xdb, 0x49, 0, 0, 0, 0, 0x10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0xe, 0, 0, 0, 0, 0, 0, 0, ... ]

The bonus buffer contains a dsl_dir_phys_t.

> 4c0::print -a zfs`dsl_dir_phys_t
4c0 dd_creation_time = 0x49db7539
4c8 dd_head_dataset_obj = 0x10

The DSL DataSet is object id 0x10 (dd_head_dataset_obj = 0x10).

> 10*200::print -a zfs`dnode_phys_t
2000 dn_type = 0x10 <-- DMU_OT_DSL_DATASET
2004 dn_bonustype = 0x10 <-- bonus buffer contains dsl_dataset_phys_t
20c0 dn_bonus = [ 0x2, 0, 0, 0, 0, 0, 0, 0, 0xe, 0, 0, 0, 0, 0, 0, 0, 0x1, 0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... ]

Now, the dsl_dataset_phys_t in the bonus buffer of the DSL DataSet dnode.

> 20c0::print -a zfs`dsl_dataset_phys_t
20c0 ds_dir_obj = 0x2
2140 ds_bp = {
2140 blk_dva = [
2140 dva_word = [ 0x2, 0xf0038 ]
2150 dva_word = [ 0x2, 0x6 ]
2160 dva_word = [ 0, 0 ]

The blkptr_t at 0x2140 will give us the objset_phys_t of the root dataset of the file system.

> 2140::blkptr
DVA[0]: vdev_id 0 / 1e007000
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 40000000000
DVA[0]: :0:1e007000:200:d
DVA[1]: vdev_id 0 / c00
DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 40000000000
DVA[1]: :0:c00:200:d
LSIZE: 400 PSIZE: 200
BIRTH: 500 LEVEL: 0 FILL: a00000000
CKFUNC: fletcher4 COMP: lzjb
CKSUM: 6bb79a7b2:2e0d64756dd:9fc17017938b:176b8a4b6c4756

Now get the locations where the file system objset_phys_t resides.

# ./zdb -Z tank:0:1e007000:200
Columns = 2, bigcols = 2, asize = 400, firstdatacol = 1
devidx = 1, offset = 6001600, size = 200
devidx = 2, offset = 6001600, size = 200

So, parity is on the second disk, and the data is on the third disk, both at offset 0x6001600.

# mdb

And again use dd to dump the compressed objset_phys_t to a file.

# dd if=/export/home/max/r2 of=/tmp/dmu_objset_comp iseek=204811 count=1
1+0 records in
1+0 records out

And uncompress the objset_phys_t.

# ./zuncompress -l 400 -p 200 /tmp/dmu_objset_comp > /tmp/dmu_objset

Now, mdb to example the objset_phys_t.

# mdb /tmp/dmu_objset
> 0::print -a zfs`objset_phys_t
0 os_meta_dnode = {
0 dn_type = 0xa <-- DMU_OT_DNODE
1 dn_indblkshift = 0xe
2 dn_nlevels = 0x7 <-- 7 levels of indirection
40 dn_blkptr = [
40 blk_dva = [
40 dva_word = [ 0x4, 0xf0020 ]
50 dva_word = [ 0x4, 0x20 ]
60 dva_word = [ 0, 0 ]
> 40::blkptr
DVA[0]: vdev_id 0 / 1e004000
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 80000000000
DVA[0]: :0:1e004000:400:id
DVA[1]: vdev_id 0 / 4000
DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 80000000000
DVA[1]: :0:4000:400:id
LSIZE: 4000 PSIZE: 400
BIRTH: 500 LEVEL: 6 FILL: 900000000
CKFUNC: fletcher4 COMP: lzjb
CKSUM: 5b884586fa:3f9c7d79ba1f:17674db0ee38e0:6077d2d63aa75b6

There are 6 levels of indirection to get the MOS for the file system. Next, we'll get the disk locations for the 6th level of indirection.

# ./zdb -Z tank:0:1e004000:400
Columns = 3, bigcols = 3, asize = 800, firstdatacol = 1
devidx = 2, offset = 6000c00, size = 200
devidx = 3, offset = 6000c00, size = 200
devidx = 4, offset = 6000c00, size = 200

So, the third disk contains parity, and the fourth and fifth disks contain the indirect block.

# mdb
> (6000c00>>9)+(400000>>9)=D

Again, we'll use dd to get the data from the 2 disks, then concatenate the dd outputs, then uncompress.

# dd if=/export/home/max/r3 of=/tmp/i6_1z iseek=204806 count=1
1+0 records in
1+0 records out
# dd if=/export/home/max/r4 of=/tmp/i6_2z iseek=204806 count=1
1+0 records in
1+0 records out
# cat /tmp/i6*z > /tmp/i6_z

Now, uncompress. The size after decompression is 0x4000 bytes, as specified in the LSIZE field of the blkptr_t.

# ./zuncompress -l 4000 -p 400 /tmp/i6_z > /tmp/i6

And use mdb to examine the blkptr_t structures. We are only interested in the first one, since it will take us to the beginning dnode_phys_t in the file system.

# mdb/intel/ia32/mdb/mdb /tmp/i6
> 0::blkptr
DVA[0]: vdev_id 0 / 1e003800
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 80000000000
DVA[0]: :0:1e003800:400:id
DVA[1]: vdev_id 0 / 3800
DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 80000000000
DVA[1]: :0:3800:400:id
LSIZE: 4000 PSIZE: 400
BIRTH: 500 LEVEL: 5 FILL: 900000000
CKFUNC: fletcher4 COMP: lzjb
CKSUM: 59defe2103:3e0ac53edc13:16a8c688ba6d69:5cafeb97a9046d7

Now at level 5, we again need to know where on the physical disks are the data.

# ./zdb -Z tank:0:1e003800:400
Columns = 3, bigcols = 3, asize = 800, firstdatacol = 1
devidx = 3, offset = 6000a00, size = 200
devidx = 4, offset = 6000a00, size = 200
devidx = 0, offset = 6000c00, size = 200

So, parity on fourth disk and data on fifth and first.

# mdb
> (6000a00>>9)+(400000>>9)=D
> (6000c00>>9)+(400000>>9)=D

And use dd to dump the blocks.

# dd if=/export/home/max/r4 of=/tmp/i5_1z iseek=204805 count=1
1+0 records in
1+0 records out
# dd if=/export/home/max/r0 of=/tmp/i5_2z iseek=204806 count=1
1+0 records in
1+0 records out

And concatenate...

# cat /tmp/i5*z > /tmp/i5_z

And uncompress...

# ./zuncompress -p 400 -l 4000 /tmp/i5_z > /tmp/i5

And get to the 4th level of indirection...

# mdb /tmp/i5
> 0::blkptr
DVA[0]: vdev_id 0 / 1e003000
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 80000000000
DVA[0]: :0:1e003000:400:id
DVA[1]: vdev_id 0 / 3000
DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 80000000000
DVA[1]: :0:3000:400:id
LSIZE: 4000 PSIZE: 400
BIRTH: 500 LEVEL: 4 FILL: 900000000
CKFUNC: fletcher4 COMP: lzjb
CKSUM: 5aaaf038c7:3ecd4215b2cf:1705e4d4343d71:5e8d71a8535f678

Rather than show all 6 levels, we'll jump to level 0.

# mdb /tmp/i1
> 0::blkptr
DVA[0]: vdev_id 0 / 1e001000
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 80000000000
DVA[0]: :0:1e001000:600:d
DVA[1]: vdev_id 0 / 1000
DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 80000000000
DVA[1]: :0:1000:600:d
LSIZE: 4000 PSIZE: 600
BIRTH: 500 LEVEL: 0 FILL: 900000000
CKFUNC: fletcher4 COMP: lzjb
CKSUM: 7e1ebca68d:4f0370c6d404:23a24df0937608:ce6838f39084f95

Locate the data for the stripe:

# ./zdb -Z tank:0:1e001000:600
Columns = 4, bigcols = 4, asize = 800, firstdatacol = 1
devidx = 3, offset = 6000200, size = 200
devidx = 4, offset = 6000200, size = 200
devidx = 0, offset = 6000400, size = 200
devidx = 1, offset = 6000400, size = 200

# mdb
> (6000200>>9)+(400000>>9)=D
> (6000400>>9)+(400000>>9)=D

Get the data from the individual disks...

# dd if=/export/home/max/r4 of=/tmp/fs_mos_1z iseek=204801 count=1
1+0 records in
1+0 records out
# dd if=/export/home/max/r0 of=/tmp/fs_mos_2z iseek=204802 count=1
1+0 records in
1+0 records out
# dd if=/export/home/max/r1 of=/tmp/fs_mos_3z iseek=204802 count=1
1+0 records in
1+0 records out

Concatenate the data...

# cat /tmp/fs_mos_* > /tmp/fs_mos_z


# ./zuncompress -l 4000 -p 600 /tmp/fs_mos_z > /tmp/fs_mos

We should now be at the MOS for the root data set.

# mdb /tmp/fs_mos
> 0,20::print -a zfs`dnode_phys_t
0 dn_type = 0 <-- first is not used
200 dn_type = 0x15 <-- DMU_OT_MASTER
240 dn_blkptr = [
240 blk_dva = [
240 dva_word = [ 0x2, 0 ]
250 dva_word = [ 0x2, 0xf0000 ]
260 dva_word = [ 0, 0 ]
600 dn_type = 0x14 <-- DMU_OT_DIRECTORY_CONTENTS (probably for root of fs)
604 dn_bonustype = 0x11 <-- bonus buffer is a znode_phys_t
640 dn_blkptr = [
640 blk_dva = [
640 dva_word = [ 0x2, 0xf0006 ]
6c0 dn_bonus = [ 0x26, 0xa0, 0xdb, 0x49, 0, 0, 0, 0, 0x8e, 0xf0, 0xf7, 0x25, 0, 0, 0, 0, 0xca, 0xa5, 0xdc, 0x49, 0, 0, 0, 0, 0xf3, 0x80, 0xdd, 0x34, 0, 0, 0 , 0, ... ]
800 dn_type = 0x13 <-- DMU_OT_PLAIN_FILE_CONTENTS
804 dn_bonustype = 0x11 <-- bonus buffer is znode_phys_t
840 dn_blkptr = [
840 blk_dva = [
840 dva_word = [ 0x2, 0xf0004 ]
8c0 dn_bonus = [ 0xca, 0xa5, 0xdc, 0x49, 0, 0, 0, 0, 0x5e, 0xe2, 0xdc, 0x34, 0, 0, 0, 0, 0xca, 0xa5, 0xdc, 0x49, 0, 0, 0, 0, 0x58, 0x9e, 0xde, 0x34, 0, 0, 0 , 0, ... ]

We should start with the ZAP object specified by the blkptr_t for the master node to get to the root directory object of the file system. Instead, we'll assume the dnode_phys_t at 0x600 is for the root of the file system, and we'll dump the blkptr_t. This should be for a ZAP object which should contain the list of files in the directory.

> 640::blkptr
DVA[0]: vdev_id 0 / 1e000c00
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 40000000000
DVA[0]: :0:1e000c00:200:d
DVA[1]: vdev_id 0 / 800
DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 40000000000
DVA[1]: :0:800:200:d
LSIZE: 200 PSIZE: 200
BIRTH: 500 LEVEL: 0 FILL: 100000000
CKFUNC: fletcher4 COMP: uncompressed
CKSUM: 60d062a16:197ca3c8839:4691877f93d3:946b572aca5a2

Find the location on the disk(s) for the directory ZAP object.

# ./zdb -Z tank:0:1e000c00:200
Columns = 2, bigcols = 2, asize = 400, firstdatacol = 1
devidx = 1, offset = 6000200, size = 200
devidx = 2, offset = 6000200, size = 200
# mdb
> (6000200>>9)+(400000>>9)=D

Dump the contents.

# dd if=/export/home/max/r2 of=/tmp/rootdir iseek=204801 count=1
1+0 records in
1+0 records out

Examine the directory.

# mdb /tmp/rootdir
> ::sizeof zfs`mzap_phys_t
sizeof (zfs`mzap_phys_t) = 0x80
> ::sizeof zfs`mzap_ent_phys_t
sizeof (zfs`mzap_ent_phys_t) = 0x40
> 0::print zfs`mzap_phys_t
mz_block_type = 0x8000000000000003
mz_salt = 0x14187c7
mz_normflags = 0
mz_pad = [ 0, 0, 0, 0, 0 ]
mz_chunk = [
mze_value = 0x8000000000000004
mze_cd = 0
mze_pad = 0
mze_name = [ "smallfile" ]
> (200-80)%40=K
6 <-- there are 6 more mzap_ent_phys_t
> 80,6::print zfs`mzap_ent_phys_t
mze_value = 0
mze_cd = 0
mze_pad = 0
mze_name = [ '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0 ', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', ... ]
mze_value = 0x8000000000000006
mze_cd = 0
mze_pad = 0
mze_name = [ "words" ] <-- here is the file we want, object id is 6
mze_value = 0x8000000000000007
mze_cd = 0
mze_pad = 0
mze_name = [ "foo" ]

Now, go back to the file system MOS to look at object id 6. If the object ID was greater than 32 (0x20), there would have been more work looking at other indirect blocks from the objset_phys_t for the file system. We assumed that the root directory for the file system would be a low object number above, and, fortunately,
the file we want to examine is also a low object number.

# mdb /tmp/fs_mos
> (6*200)::print -a zfs`dnode_phys_t
c00 dn_type = 0x13 <-- plain file contents
c01 dn_indblkshift = 0xe
c02 dn_nlevels = 0x2 <-- one level of indirection
c03 dn_nblkptr = 0x1
c04 dn_bonustype = 0x11 <-- bonus buffer contains znode_phys_t
c40 dn_blkptr = [
c40 blk_dva = [
c40 dva_word = [ 0x4, 0x5ec ]
c50 dva_word = [ 0x4, 0xf00ec ]
c60 dva_word = [ 0, 0 ]
cc0 dn_bonus = [ 0x22, 0x86, 0xdb, 0x49, 0, 0, 0, 0, 0x9, 0x31, 0x20, 0x28, 0, 0, 0, 0, 0x22, 0x86, 0xdb, 0x49, 0, 0, 0, 0, 0x71, 0x48, 0x2b, 0x28, 0, 0, 0, 0, ... ]

A quick look at the znode_phys_t in the bonus buffer...

> cc0::print zfs`znode_phys_t
zp_atime = [ 0x49db8622, 0x28203109 ]
zp_mtime = [ 0x49db8622, 0x282b4871 ]
zp_ctime = [ 0x49db8622, 0x282b4871 ]
zp_crtime = [ 0x49db8622, 0x28203109 ]
zp_gen = 0x97
zp_mode = 0x8124
zp_size = 0x32752
zp_parent = 0x3
zp_links = 0x1
zp_xattr = 0
zp_rdev = 0
zp_flags = 0x40800000004
zp_uid = 0
zp_gid = 0
zp_zap = 0
zp_pad = [ 0, 0, 0 ]
zp_acl = {
z_acl_extern_obj = 0
z_acl_size = 0x30
z_acl_version = 0x1
z_acl_count = 0x6
z_ace_data = [ 0x1, 0, 0, 0x10, 0x26, 0, 0, 0, 0, 0, 0, 0x10, 0x11, 0x1,
0xc, 0, 0x1, 0, 0x40, 0x20, 0x26, 0, 0, 0, 0, 0, 0x40, 0x20, 0x1, 0, 0, 0, ...

When was the file created?

> 49db8622=Y
2009 Apr 7 18:58:10

Now, let's look at the blkptr_t.

> c40::blkptr
DVA[0]: vdev_id 0 / bd800
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 80000000000
DVA[0]: :0:bd800:400:id
DVA[1]: vdev_id 0 / 1e01d800
DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 80000000000
DVA[1]: :0:1e01d800:400:id
LSIZE: 4000 PSIZE: 400
BIRTH: 97 LEVEL: 1 FILL: 200000000
CKFUNC: fletcher4 COMP: lzjb
CKSUM: 600e97db0e:411c4ea86350:1790b46d936d46:602547566d07cc7

We're at level 1.

# ./zdb -Z tank:0:bd800:400
Columns = 3, bigcols = 3, asize = 800, firstdatacol = 1
devidx = 1, offset = 25e00, size = 200
devidx = 2, offset = 25e00, size = 200
devidx = 3, offset = 25e00, size = 200

# mdb
> (25e00>>9)+(400000>>9)=D

# dd if=/export/home/max/r2 of=/tmp/words_i1z iseek=8495 count=1
1+0 records in
1+0 records out
# dd if=/export/home/max/r3 of=/tmp/words_i2z iseek=8495 count=1
1+0 records in
1+0 records out
# cat /tmp/words_*z > /tmp/words_iz


# ./zuncompress -l 4000 -p 400 /tmp/words_iz > /tmp/words_i

So, /tmp/words_i should contain uncompressed blkptr_t. These blkptr_t should take us to the data for the words file.

# mdb /tmp/words_i
> 0::blkptr
DVA[0]: vdev_id 0 / c0000
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 2800000000000
DVA[0]: :0:c0000:20000:d
LSIZE: 20000 PSIZE: 20000
BIRTH: 97 LEVEL: 0 FILL: 100000000
CKFUNC: fletcher2 COMP: uncompressed
CKSUM: f5cbf93a151abcac:5b5d6ca83588d8ad:574d9b8bf334944b:ad78d30af51771d8
DVA[0]: vdev_id 0 / e8000
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 2800000000000
DVA[0]: :0:e8000:20000:d
LSIZE: 20000 PSIZE: 20000
BIRTH: 97 LEVEL: 0 FILL: 100000000
CKFUNC: fletcher2 COMP: uncompressed
CKSUM: f39ae34f048ae079:de2ef1af7d1fb495:ec3ae3f7985b2a98:c6d33ac68cb042b6

So, where is the data?

# ./zdb -Z tank:0:c0000:20000
Columns = 5, bigcols = 0, asize = 28000, firstdatacol = 1
devidx = 1, offset = 26600, size = 8000
devidx = 2, offset = 26600, size = 8000
devidx = 3, offset = 26600, size = 8000
devidx = 4, offset = 26600, size = 8000
devidx = 0, offset = 26800, size = 8000

A little hex to decimal conversion for dd...

# mdb
> (26600>>9)+(400000>>9)=D
> (26800>>9)+(400000>>9)=D

Now, dump the blocks...

# dd if=/export/home/max/r2 of=/tmp/w1 iseek=8499 bs=512 count=64
64+0 records in
64+0 records out
# dd if=/export/home/max/r3 of=/tmp/w2 iseek=8499 bs=512 count=64
64+0 records in
64+0 records out
# dd if=/export/home/max/r4 of=/tmp/w3 iseek=8499 bs=512 count=64
64+0 records in
64+0 records out
# dd if=/export/home/max/r0 of=/tmp/w4 iseek=8500 bs=512 count=64
64+0 records in
64+0 records out

And concatenate the 4 files...

# cat /tmp/w[1-4]

The first 128k of the file. To get the remainder of the file, we would need to look at the next blkptr_t at level 1. But, not today...

No comments: