Max Bruning's weblog: ZFS Raidz Data Walk

Several months ago, I wrote in my blog about raidz on disk format (see http://mbruning.blogspot.com/2009/04/raidz-on-disk-format.html). In that blog, I went over the high level details. Here, I thought I would show the low level stuff that I did to determine the layout. I am using a modified zdb and mdb to walk through the on-disk data structures to find the data for a copy of the /usr/dict/words file that I made on a raidz file system.

The raidz volume contains 5 equal size devices. Since I don't have 5 disks lying around, I created 5 equal sized files (/export/home/max/r0 through /export/home/max/r4). I'll use the term disk throughout this discussion to refer to one of these files.


# zpool status -v tank
pool: tank
state: ONLINE
scrub: none requested
config:

  NAME                     STATE     READ WRITE CKSUM
  tank                     ONLINE       0     0     0
    raidz1                 ONLINE       0     0     0
      /export/home/max/r0  ONLINE       0     0     0
      /export/home/max/r1  ONLINE       0     0     0
      /export/home/max/r2  ONLINE       0     0     0
      /export/home/max/r3  ONLINE       0     0     0
      /export/home/max/r4  ONLINE       0     0     0

errors: No known data errors
#

I'll umount the file system so things don't change while I'm examining the on-disk structures.


# zfs umount tank
#

And, as I have done in the past, I walk the data structures to get to the "words" file by starting at the uberblock_t. If you get lost during this walk, you can always refer to the diagram "ZFS On-Disk Layout - The Big Picture", page 4 in http://www.osdevcon.org/2008/files/osdevcon2008-max.pdf from the OpenSolaris Developer's Conference, 2008 in Prague.

First, the active uberblock_t.


# zdb -uuu tank
Uberblock

  magic = 0000000000bab10c
  version = 13
  txg = 1280
  guid_sum = 6800651560363961243
  timestamp = 1239197133 UTC = Wed Apr  8 15:25:33 2009
  rootbp = [L0 DMU objset] 400L/200P DVA[0]=<0:1e007400:400> DVA[1]=<0:9400:400> DVA[2]=<0:3c003800:400> fletcher4 lzjb LE contiguous birth=1280 fill=27 cksum=9ad89e117:40b4956a12c:db76af09e62f:1f779fd1db6115
#

Now, I use a new command I added to zdb to allow me to see the raidz mapping. The "-Z" option takes the pool name, device id, location, and physical size as arguments, and prints the device index, location, and size for each piece of the corresponding data plus parity.


# ./zdb -Z tank:0:1e007400:200
Columns = 2, bigcols = 2, asize = 400, firstdatacol = 1
devidx = 3, offset = 6001600, size = 200
devidx = 4, offset = 6001600, size = 200
#

So, the 0x200 byte parity is on the fourth disk (devidx = 3), and the 0x200 byte objset_phys_t is on the fifth disk (devidx = 4). (Of course, either one will work since there are only 2).

Now, convert the hex offset to an absolute decimal block number. The 0x400000 skips the disk labels at the front of each device in the volume.


# mdb
> (6001600>>9)+(400000>>9)=D
              204811

Attempting to use zdb with the -R option to read the blocks causes a assertion failure in zdb (at least, that was the state back in April, when I wrote the original blog on raidz). So, instead I use dd to dump the raw data into a file.


# dd if=/export/home/max/r4 of=/tmp/objset_t bs=512 count=1 iseek=204811
1+0 records in
1+0 records out
#

Now, I'll uncompress the data. The size after decompression should be 0x400 bytes (as specified in the block pointer in the uberblock_t above). For this, I use a utility I wrote called zuncompress. This utility takes an option which allows one to specify the compression algorithm used. The default is lzjb. The output should be the objset_phys_t for the meta object set (MOS).


# ./zuncompress -p 200 -l 400 /tmp/objset_t > /tmp/objset
#

And now, I'll use my modified mdb to print the objset_phys_t.


# mdb /tmp/objset
> 0::print -a zfs`objset_phys_t
{
  0 os_meta_dnode = {
      0 dn_type = 0xa  <-- DMU_OT_DNODE 
      1 dn_indblkshift = 0xe 
      2 dn_nlevels = 0x1
 ...         
     40 dn_blkptr = [
             {      
          40 blk_dva = [  
                {      
                   40 dva_word = [ 0x8, 0xf0050 ] 
                    }    
                { 
                   50 dva_word = [ 0x8, 0x40 ] 
                } 
                { 
                   60 dva_word = [ 0x8, 0x1e0028 ] 
                }  
           ]
 ... 
}

And the blkptr_t at 0x40:


> 40::blkptr
DVA[0]: vdev_id 0 / 1e00a000
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 100000000000
DVA[0]: :0:1e00a000:a00:d
DVA[1]: vdev_id 0 / 8000
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 100000000000
DVA[1]: :0:8000:a00:d
DVA[2]: vdev_id 0 / 3c005000
DVA[2]:       GANG: FALSE  GRID:  0000  ASIZE: 100000000000
DVA[2]: :0:3c005000:a00:d
LSIZE:  4000                            PSIZE: a00
ENDIAN: LITTLE                                  TYPE:  DMU dnode
BIRTH:  500                LEVEL: 0     FILL:  1a00000000
CKFUNC: fletcher4                       COMP:  lzjb
CKSUM:  a182339fe8:ded0f7be7047:bcb1c1a96b94cc:765bd519587bfb41
$q
#

So, "LEVEL: 0" means no indirection. The next object is the MOS, which is an array of dnode_phys_t. Let's see how the MOS is layed out on the raidz volume.


# ./zdb -Z tank:0:1e00a000:a00
Columns = 5, bigcols = 2, asize = 1000, firstdatacol = 1
devidx = 0, offset = 6002000, size = 400
devidx = 1, offset = 6002000, size = 400
devidx = 2, offset = 6002000, size = 200
devidx = 3, offset = 6002000, size = 200
devidx = 4, offset = 6002000, size = 200
#

So, disk 0 contains parity, and disks 1, 2, 3, and 4 contain the MOS. The MOS is compressed with lzjb compression. We'll use dd to dump the 4 blocks containing the MOS to a file, then decompress the MOS.

I'll use mdb to translate the blkptr DVA address to a block on the disks. Note that all blocks in this example are at the same location (0x6002000).


# mdb
> (6002000>>9)+(400000>>9)=D
              204816

And now dd each of the blocks. The first disk (/export/home/max/r0) is parity. The second disk contains 0x400 bytes. The other 3 disks contain 0x200 bytes each. So total size of compressed data is 0x400 + 0x200 + 0x200 + 0x200, or 0xa00 bytes, which agrees with the PSIZE field in the blkptr_t. Note that size of the parity block must be equal to the size of the largest block (0x400).


# dd if=/export/home/max/r1 of=/tmp/mos_z1 iseek=204816 count=2
2+0 records in
2+0 records out
# dd if=/export/home/max/r2 of=/tmp/mos_z2 iseek=204816 count=1
1+0 records in
1+0 records out
# dd if=/export/home/max/r3 of=/tmp/mos_z3 iseek=204816 count=1
1+0 records in
1+0 records out
# dd if=/export/home/max/r4 of=/tmp/mos_z4 iseek=204816 count=1
1+0 records in
1+0 records out
#

Now, concatenate the files to get the compressed MOS.


# cat /tmp/mos_z* > /tmp/mos_comp

And uncompress. The size after decompression, according to the blkptr is 0x4000 (LSIZE in the blkptr).


# ./zuncompress -l 4000 -p a00 /tmp/mos_comp > /tmp/mos

And I'll use the modified mdb to dump out the MOS.


# mdb /tmp/mos
> ::sizeof zfs`dnode_phys_t
sizeof (zfs`dnode_phys_t) = 0x200

> 4000%200=K
              20              <-- There are 32 dnode_phys_t in the MOS
> 0,20::print -a zfs`dnode_phys_t
{
       0 dn_type = 0  <-- DMU_OT_NONE, first is unused
 ... 
} 
{ 
 200 dn_type = 0x1  <-- DMU_OT_OBJECT_DIRECTORY
 ...
     240 dn_blkptr = [
         {
             240 blk_dva = [
                 {
                     240 dva_word = [ 0x2, 0x24 ]
                 }
 ...
}
{
     400 dn_type = 0xc  <-- DMU_OT_DSL_DIR (DSL Directory)
     ...
     404 dn_bonustype = 0xc
 ...
     4c0 dn_bonus = [ 0x39, 0x75, 0xdb, 0x49, 0, 0, 0, 0, 0x10, 0, 0, 0, 0, 0, 0,  0, 0, 0, 0, 0, 0, 0, 0, 0, 0xe, 0, 0, 0, 0, 0, 0, 0,
 ... ]
}
{
     600 dn_type = 0xf
 ...
{
    1600 dn_type = 0x10  <-- DMU_OT_DSL_DATASET (DSL DataSet)
 ...
    1604 dn_bonustype = 0x10
 ...
    16c0 dn_bonus = [ 0x8, 0, 0, 0, 0, 0, 0, 0, 0xe, 0, 0, 0, 0, 0, 0, 0, 0x1, 0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 ... ]
}
 ...

The blkptr_t at 0x240 is for the Object Directory. Let's take a closer look.


> 240::blkptr
DVA[0]: vdev_id 0 / 4800
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 40000000000
DVA[0]: :0:4800:200:d
DVA[1]: vdev_id 0 / 1e004800
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 40000000000
DVA[1]: :0:1e004800:200:d
DVA[2]: vdev_id 0 / 3c000000
DVA[2]:       GANG: FALSE  GRID:  0000  ASIZE: 40000000000
DVA[2]: :0:3c000000:200:d
LSIZE:  200                             PSIZE: 200
ENDIAN: LITTLE                                  TYPE:  object directory
BIRTH:  4                  LEVEL: 0     FILL:  100000000
CKFUNC: fletcher4                       COMP:  uncompressed
CKSUM:  5d4dec3ac:1e59c2be429:5825c81154e8:b9b170eedd49e
$q
#

We'll use zdb to find out where ZFS has put the 0x200 byte object directory.


# ./zdb -Z tank:0:4800:200
Columns = 2, bigcols = 2, asize = 400, firstdatacol = 1
devidx = 1, offset = e00, size = 200
devidx = 2, offset = e00, size = 200
#

So, the parity is on the second disk (devidx = 1), and the object directory (a ZAP object), is on the third disk.

We'll convert the offset into a block address.


# mdb
> (e00>>9)+(400000>>9)=D
              8199

And dump the 0x200 (i.e, 512byte) block.


# dd if=/export/home/max/r2 of=/tmp/objdir iseek=8199 count=1
1+0 records in
1+0 records out
#

The ZAP object is not compressed (see the above blkptr_t). So, no need to uncompress. We'll use mdb to look at the zap.


# mdb /tmp/objdir
> 0/J
0:              8000000000000003   <-- a microzap object
>

> 0::print -a -t zfs`mzap_phys_t
{
  0 uint64_t mz_block_type = 0x8000000000000003
  8 uint64_t mz_salt = 0x32064dbb
  10 uint64_t mz_normflags = 0
  18 uint64_t [5] mz_pad = [ 0, 0, 0, 0, 0 ]
  40 mzap_ent_phys_t [1] mz_chunk = [
      {
          40 uint64_t mze_value = 0x2
          48 uint32_t mze_cd = 0
          4c uint16_t mze_pad = 0
          4e char [50] mze_name = [ "root_dataset" ]
      }
  ]
}
$q
#

There are more mzap_ent_phys_t in the object, but we are only concerned with the root dataset. This is object id 2, so we'll go back to the MOS, and examine the dnode_phys_t at index 2.


# mdb /tmp/mos
> 2*200::print -a zfs`dnode_phys_t  <-- Each dnode_phys_t is 0x200 bytes
{
     400 dn_type = 0xc  <-- DMU_OT_DSL_DIR
 ...
     404 dn_bonustype = 0xc  <-- DMU_OT_DSL_DIR
 ...
     4c0 dn_bonus = [ 0x39, 0x75, 0xdb, 0x49, 0, 0, 0, 0, 0x10, 0, 0, 0, 0, 0, 0,  0, 0, 0, 0, 0, 0, 0, 0, 0, 0xe, 0, 0, 0, 0, 0, 0, 0, ... ]
}

The bonus buffer contains a dsl_dir_phys_t.


> 4c0::print -a zfs`dsl_dir_phys_t
{
  4c0 dd_creation_time = 0x49db7539
  4c8 dd_head_dataset_obj = 0x10
...
}

The DSL DataSet is object id 0x10 (dd_head_dataset_obj = 0x10).


> 10*200::print -a zfs`dnode_phys_t
{
  2000 dn_type = 0x10  <-- DMU_OT_DSL_DATASET
 ...
  2004 dn_bonustype = 0x10  <-- bonus buffer contains dsl_dataset_phys_t
 ...
  20c0 dn_bonus = [ 0x2, 0, 0, 0, 0, 0, 0, 0, 0xe, 0, 0, 0, 0, 0, 0, 0, 0x1, 0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... ] 
}

Now, the dsl_dataset_phys_t in the bonus buffer of the DSL DataSet dnode.


> 20c0::print -a zfs`dsl_dataset_phys_t
{
  20c0 ds_dir_obj = 0x2
...
  2140 ds_bp = {
      2140 blk_dva = [
          {
              2140 dva_word = [ 0x2, 0xf0038 ]
          }
          {
              2150 dva_word = [ 0x2, 0x6 ]
          }
          {
              2160 dva_word = [ 0, 0 ]
          }
      ]
...
}

The blkptr_t at 0x2140 will give us the objset_phys_t of the root dataset of the file system.


> 2140::blkptr
DVA[0]: vdev_id 0 / 1e007000
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 40000000000
DVA[0]: :0:1e007000:200:d
DVA[1]: vdev_id 0 / c00
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 40000000000
DVA[1]: :0:c00:200:d
LSIZE:  400                             PSIZE: 200
ENDIAN: LITTLE                                  TYPE:  DMU objset
BIRTH:  500                LEVEL: 0     FILL:  a00000000
CKFUNC: fletcher4                       COMP:  lzjb
CKSUM:  6bb79a7b2:2e0d64756dd:9fc17017938b:176b8a4b6c4756
$q
#

Now get the locations where the file system objset_phys_t resides.


# ./zdb -Z tank:0:1e007000:200
Columns = 2, bigcols = 2, asize = 400, firstdatacol = 1
devidx = 1, offset = 6001600, size = 200
devidx = 2, offset = 6001600, size = 200
#

So, parity is on the second disk, and the data is on the third disk, both at offset 0x6001600.


# mdb
(6001600>>9)+(400000>>9)=D
              204811

And again use dd to dump the compressed objset_phys_t to a file.


# dd if=/export/home/max/r2 of=/tmp/dmu_objset_comp iseek=204811 count=1
1+0 records in
1+0 records out
#

And uncompress the objset_phys_t.


# ./zuncompress -l 400 -p 200 /tmp/dmu_objset_comp > /tmp/dmu_objset
#

Now, mdb to example the objset_phys_t.


# mdb /tmp/dmu_objset
> 0::print -a zfs`objset_phys_t
{
  0 os_meta_dnode = {
      0 dn_type = 0xa  <-- DMU_OT_DNODE
      1 dn_indblkshift = 0xe
      2 dn_nlevels = 0x7  <-- 7 levels of indirection
 ...
     40 dn_blkptr = [
             {
                 40 blk_dva = [
                     {
                         40 dva_word = [ 0x4, 0xf0020 ]
                     }
                     {
                         50 dva_word = [ 0x4, 0x20 ]
                     }
                     {
                         60 dva_word = [ 0, 0 ]
                     }
                 ]
 ...
}
> 40::blkptr
DVA[0]: vdev_id 0 / 1e004000
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 80000000000
DVA[0]: :0:1e004000:400:id
DVA[1]: vdev_id 0 / 4000
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 80000000000
DVA[1]: :0:4000:400:id
LSIZE:  4000                            PSIZE: 400
ENDIAN: LITTLE                                  TYPE:  DMU dnode
BIRTH:  500                LEVEL: 6     FILL:  900000000
CKFUNC: fletcher4                       COMP:  lzjb
CKSUM:  5b884586fa:3f9c7d79ba1f:17674db0ee38e0:6077d2d63aa75b6
$q
#

There are 6 levels of indirection to get the MOS for the file system. Next, we'll get the disk locations for the 6th level of indirection.


# ./zdb -Z tank:0:1e004000:400
Columns = 3, bigcols = 3, asize = 800, firstdatacol = 1
devidx = 2, offset = 6000c00, size = 200
devidx = 3, offset = 6000c00, size = 200
devidx = 4, offset = 6000c00, size = 200
#

So, the third disk contains parity, and the fourth and fifth disks contain the indirect block.


# mdb
> (6000c00>>9)+(400000>>9)=D
              204806

Again, we'll use dd to get the data from the 2 disks, then concatenate the dd outputs, then uncompress.


# dd if=/export/home/max/r3 of=/tmp/i6_1z iseek=204806 count=1
1+0 records in
1+0 records out
# dd if=/export/home/max/r4 of=/tmp/i6_2z iseek=204806 count=1
1+0 records in
1+0 records out
# cat /tmp/i6*z > /tmp/i6_z
#

Now, uncompress. The size after decompression is 0x4000 bytes, as specified in the LSIZE field of the blkptr_t.


# ./zuncompress -l 4000 -p 400 /tmp/i6_z > /tmp/i6
#

And use mdb to examine the blkptr_t structures. We are only interested in the first one, since it will take us to the beginning dnode_phys_t in the file system.


# mdb/intel/ia32/mdb/mdb /tmp/i6
> 0::blkptr
DVA[0]: vdev_id 0 / 1e003800
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 80000000000
DVA[0]: :0:1e003800:400:id
DVA[1]: vdev_id 0 / 3800
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 80000000000
DVA[1]: :0:3800:400:id
LSIZE:  4000                            PSIZE: 400
ENDIAN: LITTLE                                  TYPE:  DMU dnode
BIRTH:  500                LEVEL: 5     FILL:  900000000
CKFUNC: fletcher4                       COMP:  lzjb
CKSUM:  59defe2103:3e0ac53edc13:16a8c688ba6d69:5cafeb97a9046d7
$q
#

Now at level 5, we again need to know where on the physical disks are the data.


# ./zdb -Z tank:0:1e003800:400
Columns = 3, bigcols = 3, asize = 800, firstdatacol = 1
devidx = 3, offset = 6000a00, size = 200
devidx = 4, offset = 6000a00, size = 200
devidx = 0, offset = 6000c00, size = 200
#

So, parity on fourth disk and data on fifth and first.


# mdb
> (6000a00>>9)+(400000>>9)=D
              204805       
> (6000c00>>9)+(400000>>9)=D
              204806

And use dd to dump the blocks.


# dd if=/export/home/max/r4 of=/tmp/i5_1z iseek=204805 count=1
1+0 records in
1+0 records out
# dd if=/export/home/max/r0 of=/tmp/i5_2z iseek=204806 count=1
1+0 records in
1+0 records out
#

And concatenate...


# cat /tmp/i5*z > /tmp/i5_z
#

And uncompress...


# ./zuncompress -p 400 -l 4000 /tmp/i5_z > /tmp/i5
#

And get to the 4th level of indirection...


# mdb /tmp/i5
> 0::blkptr
DVA[0]: vdev_id 0 / 1e003000
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 80000000000
DVA[0]: :0:1e003000:400:id
DVA[1]: vdev_id 0 / 3000
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 80000000000
DVA[1]: :0:3000:400:id
LSIZE:  4000                            PSIZE: 400
ENDIAN: LITTLE                                  TYPE:  DMU dnode
BIRTH:  500                LEVEL: 4     FILL:  900000000
CKFUNC: fletcher4                       COMP:  lzjb
CKSUM:  5aaaf038c7:3ecd4215b2cf:1705e4d4343d71:5e8d71a8535f678
$q
#

Rather than show all 6 levels, we'll jump to level 0.


# mdb /tmp/i1
> 0::blkptr
DVA[0]: vdev_id 0 / 1e001000
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 80000000000
DVA[0]: :0:1e001000:600:d
DVA[1]: vdev_id 0 / 1000
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 80000000000
DVA[1]: :0:1000:600:d
LSIZE:  4000                            PSIZE: 600
ENDIAN: LITTLE                                  TYPE:  DMU dnode
BIRTH:  500                LEVEL: 0     FILL:  900000000
CKFUNC: fletcher4                       COMP:  lzjb
CKSUM:  7e1ebca68d:4f0370c6d404:23a24df0937608:ce6838f39084f95
$q
#

Locate the data for the stripe:


# ./zdb -Z tank:0:1e001000:600
Columns = 4, bigcols = 4, asize = 800, firstdatacol = 1
devidx = 3, offset = 6000200, size = 200
devidx = 4, offset = 6000200, size = 200
devidx = 0, offset = 6000400, size = 200
devidx = 1, offset = 6000400, size = 200
#

# mdb
> (6000200>>9)+(400000>>9)=D
              204801       
> (6000400>>9)+(400000>>9)=D
              204802

Get the data from the individual disks...


# dd if=/export/home/max/r4 of=/tmp/fs_mos_1z iseek=204801 count=1
1+0 records in
1+0 records out
# dd if=/export/home/max/r0 of=/tmp/fs_mos_2z iseek=204802 count=1
1+0 records in
1+0 records out
# dd if=/export/home/max/r1 of=/tmp/fs_mos_3z iseek=204802 count=1
1+0 records in
1+0 records out
#

Concatenate the data...


# cat /tmp/fs_mos_* > /tmp/fs_mos_z
#

Uncompress...


# ./zuncompress -l 4000 -p 600 /tmp/fs_mos_z > /tmp/fs_mos
#

We should now be at the MOS for the root data set.


# mdb /tmp/fs_mos
> 0,20::print -a zfs`dnode_phys_t
{
  0 dn_type = 0 <-- first is not used
 ...
}
{
     200 dn_type = 0x15  <-- DMU_OT_MASTER
     ...
     240 dn_blkptr = [
         {
             240 blk_dva = [
                 {
                     240 dva_word = [ 0x2, 0 ]
                 }
                 {
                     250 dva_word = [ 0x2, 0xf0000 ]
                 }
                 {
                     260 dva_word = [ 0, 0 ]
                 }
             ]
 ...
}
 ...
{
     600 dn_type = 0x14  <-- DMU_OT_DIRECTORY_CONTENTS (probably for root of fs)
     ...
     604 dn_bonustype = 0x11  <-- bonus buffer is a znode_phys_t
     ...
     640 dn_blkptr = [
         {
             640 blk_dva = [
                 {
                     640 dva_word = [ 0x2, 0xf0006 ]
                 }
         ...
     6c0 dn_bonus = [ 0x26, 0xa0, 0xdb, 0x49, 0, 0, 0, 0, 0x8e, 0xf0, 0xf7, 0x25,  0, 0, 0, 0, 0xca, 0xa5, 0xdc, 0x49, 0, 0, 0, 0, 0xf3, 0x80, 0xdd, 0x34, 0, 0, 0 , 0, ... ] 
} 
{
     800 dn_type = 0x13  <-- DMU_OT_PLAIN_FILE_CONTENTS
     ...
     804 dn_bonustype = 0x11  <-- bonus buffer is znode_phys_t
     ...
     840 dn_blkptr = [
         {
             840 blk_dva = [
                 {
                     840 dva_word = [ 0x2, 0xf0004 ]
                 }
...
     8c0 dn_bonus = [ 0xca, 0xa5, 0xdc, 0x49, 0, 0, 0, 0, 0x5e, 0xe2, 0xdc, 0x34,  0, 0, 0, 0, 0xca, 0xa5, 0xdc, 0x49, 0, 0, 0, 0, 0x58, 0x9e, 0xde, 0x34, 0, 0, 0 , 0, ... ] 
}
 ... 
>

We should start with the ZAP object specified by the blkptr_t for the master node to get to the root directory object of the file system. Instead, we'll assume the dnode_phys_t at 0x600 is for the root of the file system, and we'll dump the blkptr_t. This should be for a ZAP object which should contain the list of files in the directory.


> 640::blkptr
DVA[0]: vdev_id 0 / 1e000c00
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 40000000000
DVA[0]: :0:1e000c00:200:d
DVA[1]: vdev_id 0 / 800
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 40000000000
DVA[1]: :0:800:200:d
LSIZE:  200                             PSIZE: 200
ENDIAN: LITTLE                                  TYPE:  ZFS directory
BIRTH:  500                LEVEL: 0     FILL:  100000000
CKFUNC: fletcher4                       COMP:  uncompressed
CKSUM:  60d062a16:197ca3c8839:4691877f93d3:946b572aca5a2
$q
#

Find the location on the disk(s) for the directory ZAP object.


# ./zdb -Z tank:0:1e000c00:200
Columns = 2, bigcols = 2, asize = 400, firstdatacol = 1
devidx = 1, offset = 6000200, size = 200
devidx = 2, offset = 6000200, size = 200
# mdb
> (6000200>>9)+(400000>>9)=D
              204801

Dump the contents.


# dd if=/export/home/max/r2 of=/tmp/rootdir iseek=204801 count=1
1+0 records in
1+0 records out
#

Examine the directory.


# mdb /tmp/rootdir
> ::sizeof zfs`mzap_phys_t
sizeof (zfs`mzap_phys_t) = 0x80
> ::sizeof zfs`mzap_ent_phys_t
sizeof (zfs`mzap_ent_phys_t) = 0x40
> 0::print zfs`mzap_phys_t
{
  mz_block_type = 0x8000000000000003
  mz_salt = 0x14187c7
  mz_normflags = 0
  mz_pad = [ 0, 0, 0, 0, 0 ]
  mz_chunk = [
      {
          mze_value = 0x8000000000000004
          mze_cd = 0
          mze_pad = 0
          mze_name = [ "smallfile" ]
      }
  ]
}
> (200-80)%40=K
              6  <-- there are 6 more mzap_ent_phys_t
> 80,6::print zfs`mzap_ent_phys_t
 {
     mze_value = 0
     mze_cd = 0
     mze_pad = 0
     mze_name = [ '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0 ', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0',  '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', ... ]
 }
 {
     mze_value = 0x8000000000000006
     mze_cd = 0
     mze_pad = 0
     mze_name = [ "words" ]  <-- here is the file we want, object id is 6
 }
 {
     mze_value = 0x8000000000000007
     mze_cd = 0
     mze_pad = 0
     mze_name = [ "foo" ]
 }
 ...
$q
#

Now, go back to the file system MOS to look at object id 6. If the object ID was greater than 32 (0x20), there would have been more work looking at other indirect blocks from the objset_phys_t for the file system. We assumed that the root directory for the file system would be a low object number above, and, fortunately,
the file we want to examine is also a low object number.


# mdb /tmp/fs_mos
> (6*200)::print -a zfs`dnode_phys_t
{
  c00 dn_type = 0x13  <-- plain file contents
  c01 dn_indblkshift = 0xe
  c02 dn_nlevels = 0x2  <-- one level of indirection
  c03 dn_nblkptr = 0x1
  c04 dn_bonustype = 0x11 <-- bonus buffer contains znode_phys_t
     ...
  c40 dn_blkptr = [
         {
             c40 blk_dva = [
                 {
                     c40 dva_word = [ 0x4, 0x5ec ]
                 }
                 {
                     c50 dva_word = [ 0x4, 0xf00ec ]
                 }
                 {
                     c60 dva_word = [ 0, 0 ]
                 }
             ]
 ...
  cc0 dn_bonus = [ 0x22, 0x86, 0xdb, 0x49, 0, 0, 0, 0, 0x9, 0x31, 0x20, 0x28, 0, 0, 0, 0, 0x22, 0x86, 0xdb, 0x49, 0, 0, 0, 0, 0x71, 0x48, 0x2b, 0x28, 0, 0, 0,  0, ... ] 
}

A quick look at the znode_phys_t in the bonus buffer...


> cc0::print zfs`znode_phys_t
{
  zp_atime = [ 0x49db8622, 0x28203109 ]
  zp_mtime = [ 0x49db8622, 0x282b4871 ]
  zp_ctime = [ 0x49db8622, 0x282b4871 ]
  zp_crtime = [ 0x49db8622, 0x28203109 ]
  zp_gen = 0x97
  zp_mode = 0x8124
  zp_size = 0x32752
  zp_parent = 0x3
  zp_links = 0x1
  zp_xattr = 0
  zp_rdev = 0
  zp_flags = 0x40800000004
  zp_uid = 0
  zp_gid = 0
  zp_zap = 0
  zp_pad = [ 0, 0, 0 ]
  zp_acl = {
      z_acl_extern_obj = 0
      z_acl_size = 0x30
      z_acl_version = 0x1
      z_acl_count = 0x6
      z_ace_data = [ 0x1, 0, 0, 0x10, 0x26, 0, 0, 0, 0, 0, 0, 0x10, 0x11, 0x1,
0xc, 0, 0x1, 0, 0x40, 0x20, 0x26, 0, 0, 0, 0, 0, 0x40, 0x20, 0x1, 0, 0, 0, ...
]
  }
}

When was the file created?


> 49db8622=Y
              2009 Apr  7 18:58:10

Now, let's look at the blkptr_t.


> c40::blkptr
DVA[0]: vdev_id 0 / bd800
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 80000000000
DVA[0]: :0:bd800:400:id
DVA[1]: vdev_id 0 / 1e01d800
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 80000000000
DVA[1]: :0:1e01d800:400:id
LSIZE:  4000                            PSIZE: 400
ENDIAN: LITTLE                                  TYPE:  ZFS plain file
BIRTH:  97                 LEVEL: 1     FILL:  200000000
CKFUNC: fletcher4                       COMP:  lzjb
CKSUM:  600e97db0e:411c4ea86350:1790b46d936d46:602547566d07cc7
$q
#

We're at level 1.


# ./zdb -Z tank:0:bd800:400
Columns = 3, bigcols = 3, asize = 800, firstdatacol = 1
devidx = 1, offset = 25e00, size = 200
devidx = 2, offset = 25e00, size = 200
devidx = 3, offset = 25e00, size = 200
#

# mdb
> (25e00>>9)+(400000>>9)=D
              8495         

# dd if=/export/home/max/r2 of=/tmp/words_i1z iseek=8495 count=1
1+0 records in
1+0 records out
# dd if=/export/home/max/r3 of=/tmp/words_i2z iseek=8495 count=1
1+0 records in
1+0 records out
# cat /tmp/words_*z > /tmp/words_iz

Uncompress...


# ./zuncompress -l 4000 -p 400 /tmp/words_iz > /tmp/words_i

So, /tmp/words_i should contain uncompressed blkptr_t. These blkptr_t should take us to the data for the words file.


# mdb /tmp/words_i
> 0::blkptr
DVA[0]: vdev_id 0 / c0000
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 2800000000000
DVA[0]: :0:c0000:20000:d
LSIZE:  20000                           PSIZE: 20000
ENDIAN: LITTLE                                  TYPE:  ZFS plain file
BIRTH:  97                 LEVEL: 0     FILL:  100000000
CKFUNC: fletcher2                       COMP:  uncompressed
CKSUM:  f5cbf93a151abcac:5b5d6ca83588d8ad:574d9b8bf334944b:ad78d30af51771d8
80::blkptr
DVA[0]: vdev_id 0 / e8000
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 2800000000000
DVA[0]: :0:e8000:20000:d
LSIZE:  20000                           PSIZE: 20000
ENDIAN: LITTLE                                  TYPE:  ZFS plain file
BIRTH:  97                 LEVEL: 0     FILL:  100000000
CKFUNC: fletcher2                       COMP:  uncompressed
CKSUM:  f39ae34f048ae079:de2ef1af7d1fb495:ec3ae3f7985b2a98:c6d33ac68cb042b6
$q
#

So, where is the data?


# ./zdb -Z tank:0:c0000:20000
Columns = 5, bigcols = 0, asize = 28000, firstdatacol = 1
devidx = 1, offset = 26600, size = 8000
devidx = 2, offset = 26600, size = 8000
devidx = 3, offset = 26600, size = 8000
devidx = 4, offset = 26600, size = 8000
devidx = 0, offset = 26800, size = 8000

A little hex to decimal conversion for dd...


# mdb
> (26600>>9)+(400000>>9)=D
              8499         
> (26800>>9)+(400000>>9)=D
              8500         
8000>>9=D
              64

Now, dump the blocks...


# dd if=/export/home/max/r2 of=/tmp/w1 iseek=8499 bs=512 count=64
64+0 records in
64+0 records out
# dd if=/export/home/max/r3 of=/tmp/w2 iseek=8499 bs=512 count=64
64+0 records in
64+0 records out
# dd if=/export/home/max/r4 of=/tmp/w3 iseek=8499 bs=512 count=64
64+0 records in
64+0 records out
# dd if=/export/home/max/r0 of=/tmp/w4 iseek=8500 bs=512 count=64
64+0 records in
64+0 records out

And concatenate the 4 files...


# cat /tmp/w[1-4]
10th
1st
2nd
3rd
4th
5th
6th
7th
8th
9th
a
AAA
AAAS
Aarhus
Aaron
AAU
ABA
Ababa
aback
...
Nostrand
nostril
not
notary
notate
notch
note
notebook
noteworthy
nothing
notice
notice#

The first 128k of the file. To get the remainder of the file, we would need to look at the next blkptr_t at level 1. But, not today...

Max Bruning's weblog

Sunday, December 20, 2009

ZFS Raidz Data Walk

No comments:

Blog Archive