Max Bruning's weblog: 2009

Sunday, December 20, 2009

ZFS Raidz Data Walk

Several months ago, I wrote in my blog about raidz on disk format (see http://mbruning.blogspot.com/2009/04/raidz-on-disk-format.html). In that blog, I went over the high level details. Here, I thought I would show the low level stuff that I did to determine the layout. I am using a modified zdb and mdb to walk through the on-disk data structures to find the data for a copy of the /usr/dict/words file that I made on a raidz file system.

The raidz volume contains 5 equal size devices. Since I don't have 5 disks lying around, I created 5 equal sized files (/export/home/max/r0 through /export/home/max/r4). I'll use the term disk throughout this discussion to refer to one of these files.


# zpool status -v tank
pool: tank
state: ONLINE
scrub: none requested
config:

  NAME                     STATE     READ WRITE CKSUM
  tank                     ONLINE       0     0     0
    raidz1                 ONLINE       0     0     0
      /export/home/max/r0  ONLINE       0     0     0
      /export/home/max/r1  ONLINE       0     0     0
      /export/home/max/r2  ONLINE       0     0     0
      /export/home/max/r3  ONLINE       0     0     0
      /export/home/max/r4  ONLINE       0     0     0

errors: No known data errors
#

I'll umount the file system so things don't change while I'm examining the on-disk structures.


# zfs umount tank
#

And, as I have done in the past, I walk the data structures to get to the "words" file by starting at the uberblock_t. If you get lost during this walk, you can always refer to the diagram "ZFS On-Disk Layout - The Big Picture", page 4 in http://www.osdevcon.org/2008/files/osdevcon2008-max.pdf from the OpenSolaris Developer's Conference, 2008 in Prague.

First, the active uberblock_t.


# zdb -uuu tank
Uberblock

  magic = 0000000000bab10c
  version = 13
  txg = 1280
  guid_sum = 6800651560363961243
  timestamp = 1239197133 UTC = Wed Apr  8 15:25:33 2009
  rootbp = [L0 DMU objset] 400L/200P DVA[0]=<0:1e007400:400> DVA[1]=<0:9400:400> DVA[2]=<0:3c003800:400> fletcher4 lzjb LE contiguous birth=1280 fill=27 cksum=9ad89e117:40b4956a12c:db76af09e62f:1f779fd1db6115
#

Now, I use a new command I added to zdb to allow me to see the raidz mapping. The "-Z" option takes the pool name, device id, location, and physical size as arguments, and prints the device index, location, and size for each piece of the corresponding data plus parity.


# ./zdb -Z tank:0:1e007400:200
Columns = 2, bigcols = 2, asize = 400, firstdatacol = 1
devidx = 3, offset = 6001600, size = 200
devidx = 4, offset = 6001600, size = 200
#

So, the 0x200 byte parity is on the fourth disk (devidx = 3), and the 0x200 byte objset_phys_t is on the fifth disk (devidx = 4). (Of course, either one will work since there are only 2).

Now, convert the hex offset to an absolute decimal block number. The 0x400000 skips the disk labels at the front of each device in the volume.


# mdb
> (6001600>>9)+(400000>>9)=D
              204811

Attempting to use zdb with the -R option to read the blocks causes a assertion failure in zdb (at least, that was the state back in April, when I wrote the original blog on raidz). So, instead I use dd to dump the raw data into a file.


# dd if=/export/home/max/r4 of=/tmp/objset_t bs=512 count=1 iseek=204811
1+0 records in
1+0 records out
#

Now, I'll uncompress the data. The size after decompression should be 0x400 bytes (as specified in the block pointer in the uberblock_t above). For this, I use a utility I wrote called zuncompress. This utility takes an option which allows one to specify the compression algorithm used. The default is lzjb. The output should be the objset_phys_t for the meta object set (MOS).


# ./zuncompress -p 200 -l 400 /tmp/objset_t > /tmp/objset
#

And now, I'll use my modified mdb to print the objset_phys_t.


# mdb /tmp/objset
> 0::print -a zfs`objset_phys_t
{
  0 os_meta_dnode = {
      0 dn_type = 0xa  <-- DMU_OT_DNODE 
      1 dn_indblkshift = 0xe 
      2 dn_nlevels = 0x1
 ...         
     40 dn_blkptr = [
             {      
          40 blk_dva = [  
                {      
                   40 dva_word = [ 0x8, 0xf0050 ] 
                    }    
                { 
                   50 dva_word = [ 0x8, 0x40 ] 
                } 
                { 
                   60 dva_word = [ 0x8, 0x1e0028 ] 
                }  
           ]
 ... 
}

And the blkptr_t at 0x40:


> 40::blkptr
DVA[0]: vdev_id 0 / 1e00a000
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 100000000000
DVA[0]: :0:1e00a000:a00:d
DVA[1]: vdev_id 0 / 8000
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 100000000000
DVA[1]: :0:8000:a00:d
DVA[2]: vdev_id 0 / 3c005000
DVA[2]:       GANG: FALSE  GRID:  0000  ASIZE: 100000000000
DVA[2]: :0:3c005000:a00:d
LSIZE:  4000                            PSIZE: a00
ENDIAN: LITTLE                                  TYPE:  DMU dnode
BIRTH:  500                LEVEL: 0     FILL:  1a00000000
CKFUNC: fletcher4                       COMP:  lzjb
CKSUM:  a182339fe8:ded0f7be7047:bcb1c1a96b94cc:765bd519587bfb41
$q
#

So, "LEVEL: 0" means no indirection. The next object is the MOS, which is an array of dnode_phys_t. Let's see how the MOS is layed out on the raidz volume.


# ./zdb -Z tank:0:1e00a000:a00
Columns = 5, bigcols = 2, asize = 1000, firstdatacol = 1
devidx = 0, offset = 6002000, size = 400
devidx = 1, offset = 6002000, size = 400
devidx = 2, offset = 6002000, size = 200
devidx = 3, offset = 6002000, size = 200
devidx = 4, offset = 6002000, size = 200
#

So, disk 0 contains parity, and disks 1, 2, 3, and 4 contain the MOS. The MOS is compressed with lzjb compression. We'll use dd to dump the 4 blocks containing the MOS to a file, then decompress the MOS.

I'll use mdb to translate the blkptr DVA address to a block on the disks. Note that all blocks in this example are at the same location (0x6002000).


# mdb
> (6002000>>9)+(400000>>9)=D
              204816

And now dd each of the blocks. The first disk (/export/home/max/r0) is parity. The second disk contains 0x400 bytes. The other 3 disks contain 0x200 bytes each. So total size of compressed data is 0x400 + 0x200 + 0x200 + 0x200, or 0xa00 bytes, which agrees with the PSIZE field in the blkptr_t. Note that size of the parity block must be equal to the size of the largest block (0x400).


# dd if=/export/home/max/r1 of=/tmp/mos_z1 iseek=204816 count=2
2+0 records in
2+0 records out
# dd if=/export/home/max/r2 of=/tmp/mos_z2 iseek=204816 count=1
1+0 records in
1+0 records out
# dd if=/export/home/max/r3 of=/tmp/mos_z3 iseek=204816 count=1
1+0 records in
1+0 records out
# dd if=/export/home/max/r4 of=/tmp/mos_z4 iseek=204816 count=1
1+0 records in
1+0 records out
#

Now, concatenate the files to get the compressed MOS.


# cat /tmp/mos_z* > /tmp/mos_comp

And uncompress. The size after decompression, according to the blkptr is 0x4000 (LSIZE in the blkptr).


# ./zuncompress -l 4000 -p a00 /tmp/mos_comp > /tmp/mos

And I'll use the modified mdb to dump out the MOS.


# mdb /tmp/mos
> ::sizeof zfs`dnode_phys_t
sizeof (zfs`dnode_phys_t) = 0x200

> 4000%200=K
              20              <-- There are 32 dnode_phys_t in the MOS
> 0,20::print -a zfs`dnode_phys_t
{
       0 dn_type = 0  <-- DMU_OT_NONE, first is unused
 ... 
} 
{ 
 200 dn_type = 0x1  <-- DMU_OT_OBJECT_DIRECTORY
 ...
     240 dn_blkptr = [
         {
             240 blk_dva = [
                 {
                     240 dva_word = [ 0x2, 0x24 ]
                 }
 ...
}
{
     400 dn_type = 0xc  <-- DMU_OT_DSL_DIR (DSL Directory)
     ...
     404 dn_bonustype = 0xc
 ...
     4c0 dn_bonus = [ 0x39, 0x75, 0xdb, 0x49, 0, 0, 0, 0, 0x10, 0, 0, 0, 0, 0, 0,  0, 0, 0, 0, 0, 0, 0, 0, 0, 0xe, 0, 0, 0, 0, 0, 0, 0,
 ... ]
}
{
     600 dn_type = 0xf
 ...
{
    1600 dn_type = 0x10  <-- DMU_OT_DSL_DATASET (DSL DataSet)
 ...
    1604 dn_bonustype = 0x10
 ...
    16c0 dn_bonus = [ 0x8, 0, 0, 0, 0, 0, 0, 0, 0xe, 0, 0, 0, 0, 0, 0, 0, 0x1, 0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 ... ]
}
 ...

The blkptr_t at 0x240 is for the Object Directory. Let's take a closer look.


> 240::blkptr
DVA[0]: vdev_id 0 / 4800
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 40000000000
DVA[0]: :0:4800:200:d
DVA[1]: vdev_id 0 / 1e004800
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 40000000000
DVA[1]: :0:1e004800:200:d
DVA[2]: vdev_id 0 / 3c000000
DVA[2]:       GANG: FALSE  GRID:  0000  ASIZE: 40000000000
DVA[2]: :0:3c000000:200:d
LSIZE:  200                             PSIZE: 200
ENDIAN: LITTLE                                  TYPE:  object directory
BIRTH:  4                  LEVEL: 0     FILL:  100000000
CKFUNC: fletcher4                       COMP:  uncompressed
CKSUM:  5d4dec3ac:1e59c2be429:5825c81154e8:b9b170eedd49e
$q
#

We'll use zdb to find out where ZFS has put the 0x200 byte object directory.


# ./zdb -Z tank:0:4800:200
Columns = 2, bigcols = 2, asize = 400, firstdatacol = 1
devidx = 1, offset = e00, size = 200
devidx = 2, offset = e00, size = 200
#

So, the parity is on the second disk (devidx = 1), and the object directory (a ZAP object), is on the third disk.

We'll convert the offset into a block address.


# mdb
> (e00>>9)+(400000>>9)=D
              8199

And dump the 0x200 (i.e, 512byte) block.


# dd if=/export/home/max/r2 of=/tmp/objdir iseek=8199 count=1
1+0 records in
1+0 records out
#

The ZAP object is not compressed (see the above blkptr_t). So, no need to uncompress. We'll use mdb to look at the zap.


# mdb /tmp/objdir
> 0/J
0:              8000000000000003   <-- a microzap object
>

> 0::print -a -t zfs`mzap_phys_t
{
  0 uint64_t mz_block_type = 0x8000000000000003
  8 uint64_t mz_salt = 0x32064dbb
  10 uint64_t mz_normflags = 0
  18 uint64_t [5] mz_pad = [ 0, 0, 0, 0, 0 ]
  40 mzap_ent_phys_t [1] mz_chunk = [
      {
          40 uint64_t mze_value = 0x2
          48 uint32_t mze_cd = 0
          4c uint16_t mze_pad = 0
          4e char [50] mze_name = [ "root_dataset" ]
      }
  ]
}
$q
#

There are more mzap_ent_phys_t in the object, but we are only concerned with the root dataset. This is object id 2, so we'll go back to the MOS, and examine the dnode_phys_t at index 2.


# mdb /tmp/mos
> 2*200::print -a zfs`dnode_phys_t  <-- Each dnode_phys_t is 0x200 bytes
{
     400 dn_type = 0xc  <-- DMU_OT_DSL_DIR
 ...
     404 dn_bonustype = 0xc  <-- DMU_OT_DSL_DIR
 ...
     4c0 dn_bonus = [ 0x39, 0x75, 0xdb, 0x49, 0, 0, 0, 0, 0x10, 0, 0, 0, 0, 0, 0,  0, 0, 0, 0, 0, 0, 0, 0, 0, 0xe, 0, 0, 0, 0, 0, 0, 0, ... ]
}

The bonus buffer contains a dsl_dir_phys_t.


> 4c0::print -a zfs`dsl_dir_phys_t
{
  4c0 dd_creation_time = 0x49db7539
  4c8 dd_head_dataset_obj = 0x10
...
}

The DSL DataSet is object id 0x10 (dd_head_dataset_obj = 0x10).


> 10*200::print -a zfs`dnode_phys_t
{
  2000 dn_type = 0x10  <-- DMU_OT_DSL_DATASET
 ...
  2004 dn_bonustype = 0x10  <-- bonus buffer contains dsl_dataset_phys_t
 ...
  20c0 dn_bonus = [ 0x2, 0, 0, 0, 0, 0, 0, 0, 0xe, 0, 0, 0, 0, 0, 0, 0, 0x1, 0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... ] 
}

Now, the dsl_dataset_phys_t in the bonus buffer of the DSL DataSet dnode.


> 20c0::print -a zfs`dsl_dataset_phys_t
{
  20c0 ds_dir_obj = 0x2
...
  2140 ds_bp = {
      2140 blk_dva = [
          {
              2140 dva_word = [ 0x2, 0xf0038 ]
          }
          {
              2150 dva_word = [ 0x2, 0x6 ]
          }
          {
              2160 dva_word = [ 0, 0 ]
          }
      ]
...
}

The blkptr_t at 0x2140 will give us the objset_phys_t of the root dataset of the file system.


> 2140::blkptr
DVA[0]: vdev_id 0 / 1e007000
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 40000000000
DVA[0]: :0:1e007000:200:d
DVA[1]: vdev_id 0 / c00
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 40000000000
DVA[1]: :0:c00:200:d
LSIZE:  400                             PSIZE: 200
ENDIAN: LITTLE                                  TYPE:  DMU objset
BIRTH:  500                LEVEL: 0     FILL:  a00000000
CKFUNC: fletcher4                       COMP:  lzjb
CKSUM:  6bb79a7b2:2e0d64756dd:9fc17017938b:176b8a4b6c4756
$q
#

Now get the locations where the file system objset_phys_t resides.


# ./zdb -Z tank:0:1e007000:200
Columns = 2, bigcols = 2, asize = 400, firstdatacol = 1
devidx = 1, offset = 6001600, size = 200
devidx = 2, offset = 6001600, size = 200
#

So, parity is on the second disk, and the data is on the third disk, both at offset 0x6001600.


# mdb
(6001600>>9)+(400000>>9)=D
              204811

And again use dd to dump the compressed objset_phys_t to a file.


# dd if=/export/home/max/r2 of=/tmp/dmu_objset_comp iseek=204811 count=1
1+0 records in
1+0 records out
#

And uncompress the objset_phys_t.


# ./zuncompress -l 400 -p 200 /tmp/dmu_objset_comp > /tmp/dmu_objset
#

Now, mdb to example the objset_phys_t.


# mdb /tmp/dmu_objset
> 0::print -a zfs`objset_phys_t
{
  0 os_meta_dnode = {
      0 dn_type = 0xa  <-- DMU_OT_DNODE
      1 dn_indblkshift = 0xe
      2 dn_nlevels = 0x7  <-- 7 levels of indirection
 ...
     40 dn_blkptr = [
             {
                 40 blk_dva = [
                     {
                         40 dva_word = [ 0x4, 0xf0020 ]
                     }
                     {
                         50 dva_word = [ 0x4, 0x20 ]
                     }
                     {
                         60 dva_word = [ 0, 0 ]
                     }
                 ]
 ...
}
> 40::blkptr
DVA[0]: vdev_id 0 / 1e004000
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 80000000000
DVA[0]: :0:1e004000:400:id
DVA[1]: vdev_id 0 / 4000
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 80000000000
DVA[1]: :0:4000:400:id
LSIZE:  4000                            PSIZE: 400
ENDIAN: LITTLE                                  TYPE:  DMU dnode
BIRTH:  500                LEVEL: 6     FILL:  900000000
CKFUNC: fletcher4                       COMP:  lzjb
CKSUM:  5b884586fa:3f9c7d79ba1f:17674db0ee38e0:6077d2d63aa75b6
$q
#

There are 6 levels of indirection to get the MOS for the file system. Next, we'll get the disk locations for the 6th level of indirection.


# ./zdb -Z tank:0:1e004000:400
Columns = 3, bigcols = 3, asize = 800, firstdatacol = 1
devidx = 2, offset = 6000c00, size = 200
devidx = 3, offset = 6000c00, size = 200
devidx = 4, offset = 6000c00, size = 200
#

So, the third disk contains parity, and the fourth and fifth disks contain the indirect block.


# mdb
> (6000c00>>9)+(400000>>9)=D
              204806

Again, we'll use dd to get the data from the 2 disks, then concatenate the dd outputs, then uncompress.


# dd if=/export/home/max/r3 of=/tmp/i6_1z iseek=204806 count=1
1+0 records in
1+0 records out
# dd if=/export/home/max/r4 of=/tmp/i6_2z iseek=204806 count=1
1+0 records in
1+0 records out
# cat /tmp/i6*z > /tmp/i6_z
#

Now, uncompress. The size after decompression is 0x4000 bytes, as specified in the LSIZE field of the blkptr_t.


# ./zuncompress -l 4000 -p 400 /tmp/i6_z > /tmp/i6
#

And use mdb to examine the blkptr_t structures. We are only interested in the first one, since it will take us to the beginning dnode_phys_t in the file system.


# mdb/intel/ia32/mdb/mdb /tmp/i6
> 0::blkptr
DVA[0]: vdev_id 0 / 1e003800
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 80000000000
DVA[0]: :0:1e003800:400:id
DVA[1]: vdev_id 0 / 3800
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 80000000000
DVA[1]: :0:3800:400:id
LSIZE:  4000                            PSIZE: 400
ENDIAN: LITTLE                                  TYPE:  DMU dnode
BIRTH:  500                LEVEL: 5     FILL:  900000000
CKFUNC: fletcher4                       COMP:  lzjb
CKSUM:  59defe2103:3e0ac53edc13:16a8c688ba6d69:5cafeb97a9046d7
$q
#

Now at level 5, we again need to know where on the physical disks are the data.


# ./zdb -Z tank:0:1e003800:400
Columns = 3, bigcols = 3, asize = 800, firstdatacol = 1
devidx = 3, offset = 6000a00, size = 200
devidx = 4, offset = 6000a00, size = 200
devidx = 0, offset = 6000c00, size = 200
#

So, parity on fourth disk and data on fifth and first.


# mdb
> (6000a00>>9)+(400000>>9)=D
              204805       
> (6000c00>>9)+(400000>>9)=D
              204806

And use dd to dump the blocks.


# dd if=/export/home/max/r4 of=/tmp/i5_1z iseek=204805 count=1
1+0 records in
1+0 records out
# dd if=/export/home/max/r0 of=/tmp/i5_2z iseek=204806 count=1
1+0 records in
1+0 records out
#

And concatenate...


# cat /tmp/i5*z > /tmp/i5_z
#

And uncompress...


# ./zuncompress -p 400 -l 4000 /tmp/i5_z > /tmp/i5
#

And get to the 4th level of indirection...


# mdb /tmp/i5
> 0::blkptr
DVA[0]: vdev_id 0 / 1e003000
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 80000000000
DVA[0]: :0:1e003000:400:id
DVA[1]: vdev_id 0 / 3000
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 80000000000
DVA[1]: :0:3000:400:id
LSIZE:  4000                            PSIZE: 400
ENDIAN: LITTLE                                  TYPE:  DMU dnode
BIRTH:  500                LEVEL: 4     FILL:  900000000
CKFUNC: fletcher4                       COMP:  lzjb
CKSUM:  5aaaf038c7:3ecd4215b2cf:1705e4d4343d71:5e8d71a8535f678
$q
#

Rather than show all 6 levels, we'll jump to level 0.


# mdb /tmp/i1
> 0::blkptr
DVA[0]: vdev_id 0 / 1e001000
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 80000000000
DVA[0]: :0:1e001000:600:d
DVA[1]: vdev_id 0 / 1000
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 80000000000
DVA[1]: :0:1000:600:d
LSIZE:  4000                            PSIZE: 600
ENDIAN: LITTLE                                  TYPE:  DMU dnode
BIRTH:  500                LEVEL: 0     FILL:  900000000
CKFUNC: fletcher4                       COMP:  lzjb
CKSUM:  7e1ebca68d:4f0370c6d404:23a24df0937608:ce6838f39084f95
$q
#

Locate the data for the stripe:


# ./zdb -Z tank:0:1e001000:600
Columns = 4, bigcols = 4, asize = 800, firstdatacol = 1
devidx = 3, offset = 6000200, size = 200
devidx = 4, offset = 6000200, size = 200
devidx = 0, offset = 6000400, size = 200
devidx = 1, offset = 6000400, size = 200
#

# mdb
> (6000200>>9)+(400000>>9)=D
              204801       
> (6000400>>9)+(400000>>9)=D
              204802

Get the data from the individual disks...


# dd if=/export/home/max/r4 of=/tmp/fs_mos_1z iseek=204801 count=1
1+0 records in
1+0 records out
# dd if=/export/home/max/r0 of=/tmp/fs_mos_2z iseek=204802 count=1
1+0 records in
1+0 records out
# dd if=/export/home/max/r1 of=/tmp/fs_mos_3z iseek=204802 count=1
1+0 records in
1+0 records out
#

Concatenate the data...


# cat /tmp/fs_mos_* > /tmp/fs_mos_z
#

Uncompress...


# ./zuncompress -l 4000 -p 600 /tmp/fs_mos_z > /tmp/fs_mos
#

We should now be at the MOS for the root data set.


# mdb /tmp/fs_mos
> 0,20::print -a zfs`dnode_phys_t
{
  0 dn_type = 0 <-- first is not used
 ...
}
{
     200 dn_type = 0x15  <-- DMU_OT_MASTER
     ...
     240 dn_blkptr = [
         {
             240 blk_dva = [
                 {
                     240 dva_word = [ 0x2, 0 ]
                 }
                 {
                     250 dva_word = [ 0x2, 0xf0000 ]
                 }
                 {
                     260 dva_word = [ 0, 0 ]
                 }
             ]
 ...
}
 ...
{
     600 dn_type = 0x14  <-- DMU_OT_DIRECTORY_CONTENTS (probably for root of fs)
     ...
     604 dn_bonustype = 0x11  <-- bonus buffer is a znode_phys_t
     ...
     640 dn_blkptr = [
         {
             640 blk_dva = [
                 {
                     640 dva_word = [ 0x2, 0xf0006 ]
                 }
         ...
     6c0 dn_bonus = [ 0x26, 0xa0, 0xdb, 0x49, 0, 0, 0, 0, 0x8e, 0xf0, 0xf7, 0x25,  0, 0, 0, 0, 0xca, 0xa5, 0xdc, 0x49, 0, 0, 0, 0, 0xf3, 0x80, 0xdd, 0x34, 0, 0, 0 , 0, ... ] 
} 
{
     800 dn_type = 0x13  <-- DMU_OT_PLAIN_FILE_CONTENTS
     ...
     804 dn_bonustype = 0x11  <-- bonus buffer is znode_phys_t
     ...
     840 dn_blkptr = [
         {
             840 blk_dva = [
                 {
                     840 dva_word = [ 0x2, 0xf0004 ]
                 }
...
     8c0 dn_bonus = [ 0xca, 0xa5, 0xdc, 0x49, 0, 0, 0, 0, 0x5e, 0xe2, 0xdc, 0x34,  0, 0, 0, 0, 0xca, 0xa5, 0xdc, 0x49, 0, 0, 0, 0, 0x58, 0x9e, 0xde, 0x34, 0, 0, 0 , 0, ... ] 
}
 ... 
>

We should start with the ZAP object specified by the blkptr_t for the master node to get to the root directory object of the file system. Instead, we'll assume the dnode_phys_t at 0x600 is for the root of the file system, and we'll dump the blkptr_t. This should be for a ZAP object which should contain the list of files in the directory.


> 640::blkptr
DVA[0]: vdev_id 0 / 1e000c00
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 40000000000
DVA[0]: :0:1e000c00:200:d
DVA[1]: vdev_id 0 / 800
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 40000000000
DVA[1]: :0:800:200:d
LSIZE:  200                             PSIZE: 200
ENDIAN: LITTLE                                  TYPE:  ZFS directory
BIRTH:  500                LEVEL: 0     FILL:  100000000
CKFUNC: fletcher4                       COMP:  uncompressed
CKSUM:  60d062a16:197ca3c8839:4691877f93d3:946b572aca5a2
$q
#

Find the location on the disk(s) for the directory ZAP object.


# ./zdb -Z tank:0:1e000c00:200
Columns = 2, bigcols = 2, asize = 400, firstdatacol = 1
devidx = 1, offset = 6000200, size = 200
devidx = 2, offset = 6000200, size = 200
# mdb
> (6000200>>9)+(400000>>9)=D
              204801

Dump the contents.


# dd if=/export/home/max/r2 of=/tmp/rootdir iseek=204801 count=1
1+0 records in
1+0 records out
#

Examine the directory.


# mdb /tmp/rootdir
> ::sizeof zfs`mzap_phys_t
sizeof (zfs`mzap_phys_t) = 0x80
> ::sizeof zfs`mzap_ent_phys_t
sizeof (zfs`mzap_ent_phys_t) = 0x40
> 0::print zfs`mzap_phys_t
{
  mz_block_type = 0x8000000000000003
  mz_salt = 0x14187c7
  mz_normflags = 0
  mz_pad = [ 0, 0, 0, 0, 0 ]
  mz_chunk = [
      {
          mze_value = 0x8000000000000004
          mze_cd = 0
          mze_pad = 0
          mze_name = [ "smallfile" ]
      }
  ]
}
> (200-80)%40=K
              6  <-- there are 6 more mzap_ent_phys_t
> 80,6::print zfs`mzap_ent_phys_t
 {
     mze_value = 0
     mze_cd = 0
     mze_pad = 0
     mze_name = [ '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0 ', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0',  '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', ... ]
 }
 {
     mze_value = 0x8000000000000006
     mze_cd = 0
     mze_pad = 0
     mze_name = [ "words" ]  <-- here is the file we want, object id is 6
 }
 {
     mze_value = 0x8000000000000007
     mze_cd = 0
     mze_pad = 0
     mze_name = [ "foo" ]
 }
 ...
$q
#

Now, go back to the file system MOS to look at object id 6. If the object ID was greater than 32 (0x20), there would have been more work looking at other indirect blocks from the objset_phys_t for the file system. We assumed that the root directory for the file system would be a low object number above, and, fortunately,
the file we want to examine is also a low object number.


# mdb /tmp/fs_mos
> (6*200)::print -a zfs`dnode_phys_t
{
  c00 dn_type = 0x13  <-- plain file contents
  c01 dn_indblkshift = 0xe
  c02 dn_nlevels = 0x2  <-- one level of indirection
  c03 dn_nblkptr = 0x1
  c04 dn_bonustype = 0x11 <-- bonus buffer contains znode_phys_t
     ...
  c40 dn_blkptr = [
         {
             c40 blk_dva = [
                 {
                     c40 dva_word = [ 0x4, 0x5ec ]
                 }
                 {
                     c50 dva_word = [ 0x4, 0xf00ec ]
                 }
                 {
                     c60 dva_word = [ 0, 0 ]
                 }
             ]
 ...
  cc0 dn_bonus = [ 0x22, 0x86, 0xdb, 0x49, 0, 0, 0, 0, 0x9, 0x31, 0x20, 0x28, 0, 0, 0, 0, 0x22, 0x86, 0xdb, 0x49, 0, 0, 0, 0, 0x71, 0x48, 0x2b, 0x28, 0, 0, 0,  0, ... ] 
}

A quick look at the znode_phys_t in the bonus buffer...


> cc0::print zfs`znode_phys_t
{
  zp_atime = [ 0x49db8622, 0x28203109 ]
  zp_mtime = [ 0x49db8622, 0x282b4871 ]
  zp_ctime = [ 0x49db8622, 0x282b4871 ]
  zp_crtime = [ 0x49db8622, 0x28203109 ]
  zp_gen = 0x97
  zp_mode = 0x8124
  zp_size = 0x32752
  zp_parent = 0x3
  zp_links = 0x1
  zp_xattr = 0
  zp_rdev = 0
  zp_flags = 0x40800000004
  zp_uid = 0
  zp_gid = 0
  zp_zap = 0
  zp_pad = [ 0, 0, 0 ]
  zp_acl = {
      z_acl_extern_obj = 0
      z_acl_size = 0x30
      z_acl_version = 0x1
      z_acl_count = 0x6
      z_ace_data = [ 0x1, 0, 0, 0x10, 0x26, 0, 0, 0, 0, 0, 0, 0x10, 0x11, 0x1,
0xc, 0, 0x1, 0, 0x40, 0x20, 0x26, 0, 0, 0, 0, 0, 0x40, 0x20, 0x1, 0, 0, 0, ...
]
  }
}

When was the file created?


> 49db8622=Y
              2009 Apr  7 18:58:10

Now, let's look at the blkptr_t.


> c40::blkptr
DVA[0]: vdev_id 0 / bd800
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 80000000000
DVA[0]: :0:bd800:400:id
DVA[1]: vdev_id 0 / 1e01d800
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 80000000000
DVA[1]: :0:1e01d800:400:id
LSIZE:  4000                            PSIZE: 400
ENDIAN: LITTLE                                  TYPE:  ZFS plain file
BIRTH:  97                 LEVEL: 1     FILL:  200000000
CKFUNC: fletcher4                       COMP:  lzjb
CKSUM:  600e97db0e:411c4ea86350:1790b46d936d46:602547566d07cc7
$q
#

We're at level 1.


# ./zdb -Z tank:0:bd800:400
Columns = 3, bigcols = 3, asize = 800, firstdatacol = 1
devidx = 1, offset = 25e00, size = 200
devidx = 2, offset = 25e00, size = 200
devidx = 3, offset = 25e00, size = 200
#

# mdb
> (25e00>>9)+(400000>>9)=D
              8495         

# dd if=/export/home/max/r2 of=/tmp/words_i1z iseek=8495 count=1
1+0 records in
1+0 records out
# dd if=/export/home/max/r3 of=/tmp/words_i2z iseek=8495 count=1
1+0 records in
1+0 records out
# cat /tmp/words_*z > /tmp/words_iz

Uncompress...


# ./zuncompress -l 4000 -p 400 /tmp/words_iz > /tmp/words_i

So, /tmp/words_i should contain uncompressed blkptr_t. These blkptr_t should take us to the data for the words file.


# mdb /tmp/words_i
> 0::blkptr
DVA[0]: vdev_id 0 / c0000
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 2800000000000
DVA[0]: :0:c0000:20000:d
LSIZE:  20000                           PSIZE: 20000
ENDIAN: LITTLE                                  TYPE:  ZFS plain file
BIRTH:  97                 LEVEL: 0     FILL:  100000000
CKFUNC: fletcher2                       COMP:  uncompressed
CKSUM:  f5cbf93a151abcac:5b5d6ca83588d8ad:574d9b8bf334944b:ad78d30af51771d8
80::blkptr
DVA[0]: vdev_id 0 / e8000
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 2800000000000
DVA[0]: :0:e8000:20000:d
LSIZE:  20000                           PSIZE: 20000
ENDIAN: LITTLE                                  TYPE:  ZFS plain file
BIRTH:  97                 LEVEL: 0     FILL:  100000000
CKFUNC: fletcher2                       COMP:  uncompressed
CKSUM:  f39ae34f048ae079:de2ef1af7d1fb495:ec3ae3f7985b2a98:c6d33ac68cb042b6
$q
#

So, where is the data?


# ./zdb -Z tank:0:c0000:20000
Columns = 5, bigcols = 0, asize = 28000, firstdatacol = 1
devidx = 1, offset = 26600, size = 8000
devidx = 2, offset = 26600, size = 8000
devidx = 3, offset = 26600, size = 8000
devidx = 4, offset = 26600, size = 8000
devidx = 0, offset = 26800, size = 8000

A little hex to decimal conversion for dd...


# mdb
> (26600>>9)+(400000>>9)=D
              8499         
> (26800>>9)+(400000>>9)=D
              8500         
8000>>9=D
              64

Now, dump the blocks...


# dd if=/export/home/max/r2 of=/tmp/w1 iseek=8499 bs=512 count=64
64+0 records in
64+0 records out
# dd if=/export/home/max/r3 of=/tmp/w2 iseek=8499 bs=512 count=64
64+0 records in
64+0 records out
# dd if=/export/home/max/r4 of=/tmp/w3 iseek=8499 bs=512 count=64
64+0 records in
64+0 records out
# dd if=/export/home/max/r0 of=/tmp/w4 iseek=8500 bs=512 count=64
64+0 records in
64+0 records out

And concatenate the 4 files...


# cat /tmp/w[1-4]
10th
1st
2nd
3rd
4th
5th
6th
7th
8th
9th
a
AAA
AAAS
Aarhus
Aaron
AAU
ABA
Ababa
aback
...
Nostrand
nostril
not
notary
notate
notch
note
notebook
noteworthy
nothing
notice
notice#

The first 128k of the file. To get the remainder of the file, we would need to look at the next blkptr_t at level 1. But, not today...

Wednesday, December 16, 2009

ZFS Data Recovery

I knew there was a reason to understand the ZFS on-disk format besides wanting to be able to teach about it...

Recently, I was sent an email from someone who had 15 years of video and music stored in a 10TB ZFS pool that, after a power failure, became defective. He unfortunately did not have a backup. He was using ZFS version 6 on FreeBSD 7, and he asked if I could help. He also said that he had spoken with various people, including engineers within Sun, and was told that basically, it was probably not possible to recover the data.

He also got in touch with a data recovery company who told him they would assign a technician to examine the problem at a cost of $15,000/month. And if they could not restore his data, he would only have to pay 1/2 of the cost.

After spending about 1 week examining the data on the disk, I was able to restore basically all of it. The recovery was done on OpenSolaris 0609 (build 111b). After the recovery, he was able to view his data on the FreeBSD system (as well as OpenSolaris). I would be happy to do this for anyone else who runs into a problem where they can no longer access their ZFS pool, especially at $15k/month. And if I can not do it, you would not need to pay anything!

Tuesday, December 15, 2009

segpages dmod source code

Source is at ftp://ftp.bruningsystems.com/segpages.tar

Monday, December 14, 2009

Examining address spaces with mdb

A while ago, I was interested in more details about process address spaces. For instance, if a page is mapped into an address space, where is the page in physical memory? Or if a page is on swap, where is it on swap? Are there pages that are in memory, but not currently valid for a process? The meminfo(2) system call can be used by an application to examine the locations of physical pages corresponding to a range of virtual addresses that the process is using. Is there a tool for doing this from outside the process? Is there any tool for determining the locations of pages in memory when one is using liblgrp(3)? liblgrp(3) provides an API for specifying a "locality group". A locality group, as the man page says, "represents the set of CPU-like and memory-like hardware devices that are at most some locality apart from each other". Essentially, using liblgrp(3), one can specify the desired memory placement for memory that threads within a process are using.

So, I have written a dcmd, called segpages, for mdb that allows one to examine each virtual page of a segment in a process address space. The command gives the following information:

The virtual address of the page.
If the page is in memory, the physical address of the page.
If the page is on swap, the location on swap, and which swap device/file.
If the page is not currently in memory or on swap, a "-".
If the page is mapped from a file, the pathname of the file, and the offset within the file.
If the page is anonymous, the command prints "anon".
If the page is mapped to a device, the command only prints the physical address it is mapped to, and the path to the device.
The "share count" for the page, i.e., the number of processes sharing the same page.
The dcmd command also prints the status of the page:
- VALID -- The page is mapped
- INMEMORY -- The page is in memory (it may not be valid for the process).
- SWAPPED -- The page is on swap. Note that a page may be INMEMORY and SWAPPED. What I find more interesting, is pages that are SWAPPED and VALID. I expect to find INMEMORY pages that are also on swap. I did not expect to find SWAPPED pages that are also VALID, since I assumed that a page that was read in from swap and is now valid would not have a copy on swap. From a quick look at the code, it appears the swap slot is not freed until the reference count on the anon struct that is mapping the page has gone to 0. Anyone with a more complete understanding of this is welcome to comment.

Here is (very abbreviated) output for a running bash process.

First, a look at pmap output. Each line of the pmap output represents a "segment" of the address space. The different columns are described in the pmap(1) man page.
$ pmap -x 919 919: /bin/bash --noediting -i< Address Kbytes RSS Anon Locked Mode Mapped File 08045000 12 12 4 - rw--- [ stack ] 08050000 644 644 - - r-x-- bash 08100000 80 80 12 - rwx-- bash 08114000 52 52 28 - rwx-- [ heap ] CE410000 624 512 - - r-x-- libnsl.so.1 CE4BC000 16 16 4 - rw--- libnsl.so.1 CE4C0000 20 8 - - rw--- libnsl.so.1 CE4F0000 56 52 - - r-x-- methods_unicode.so.3 CE50D000 4 4 - - rwx-- methods_unicode.so.3 CE510000 2416 752 - - r-x-- en_US.UTF-8.so.3 CE77B000 4 4 - - rwx-- en_US.UTF-8.so.3 CE960000 64 16 - - rwx-- [ anon ] CE97E000 4 4 - - rwxs- [ anon ] CE980000 4 4 - - rwx-- [ anon ] CE990000 24 12 4 - rwx-- [ anon ] CE9A0000 4 4 4 - rwx-- [ anon ] CE9B0000 1280 972 - - r-x-- libc_hwcap1.so.1 CEAF0000 28 28 16 - rwx-- libc_hwcap1.so.1 CEAF7000 8 8 - - rwx-- libc_hwcap1.so.1 CEB00000 4 4 - - r-x-- libdl.so.1 CEB10000 4 4 4 - rwx-- [ anon ] CEB20000 56 56 - - r-x-- libsocket.so.1 CEB3E000 4 4 - - rw--- libsocket.so.1 CEB40000 180 136 - - r-x-- libcurses.so.1 CEB7D000 28 28 - - rw--- libcurses.so.1 CEB84000 8 - - - rw--- libcurses.so.1 CEB90000 4 4 4 - rwx-- [ anon ] CEBA0000 4 4 4 - rw--- [ anon ] CEBB0000 4 4 - - rw--- [ anon ] CEBBF000 180 180 - - r-x-- ld.so.1 CEBFC000 8 8 4 - rwx-- ld.so.1 CEBFE000 4 4 4 - rwx-- ld.so.1 -------- ------- ------- ------- ------- total Kb 5832 3620 92 - # mdb -k Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc pcplusmp scsi_vhci zfs usba sockfs ip hook neti sctp arp uhci sd fctl md lofs audiosup fcip fcp random cpc crypto logindmux ptm ufs sppp ipc ]

First, load the dmod containing the new dcmd.
> ::load /wd320/max/source/mdb/segpages/i386/segpages.so >

Now, walk through the segments of the process address space, showing
each virtual page in the segment. Note that output has been omitted.
> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages VA PA FILE OFFSET SHARES DISPOSITION 8045000 378C5000 [anon] 54518000 1 VALID 8046000 6EB06000 [anon] 54118000 1 VALID 8047000 5F9C7000 [anon] 540B8000 1 VALID VA PA FILE OFFSET SHARES DISPOSITION 8050000 600A7000 bash 0 7 VALID 8051000 74368000 bash 1000 7 VALID 8052000 72669000 bash 2000 7 VALID 8053000 66C6A000 bash 3000 7 VALID 8054000 636AB000 bash 4000 0 INVALID,INMEMORY 8055000 5FDEC000 bash 5000 0 INVALID,INMEMORY 8056000 63EED000 bash 6000 0 INVALID,INMEMORY 8057000 62EAE000 bash 7000 0 INVALID,INMEMORY 8058000 5C52F000 bash 8000 7 VALID 8059000 5C5B0000 bash 9000 7 VALID ... output omitted 80ED000 5C2C4000 bash 9D000 7 VALID 80EE000 5C245000 bash 9E000 7 VALID 80EF000 5C286000 bash 9F000 3 VALID 80F0000 63A97000 bash A0000 0 INVALID,INMEMORY VA PA FILE OFFSET SHARES DISPOSITION 8100000 79940000 [anon] 541D8000 1 VALID 8101000 5F0C1000 [anon] 62F00000 1 VALID 8102000 378C2000 [anon] 54438000 1 VALID 8103000 5EF5A000 bash A3000 6 VALID 8104000 5EEDB000 bash A4000 6 VALID 8105000 37885000 [anon] 543B8000 1 VALID 8106000 60E1D000 bash A6000 7 VALID ... VA PA FILE OFFSET SHARES DISPOSITION 8114000 37914000 [anon] 54478000 1 VALID 8115000 79DD5000 [anon] 54368000 1 VALID 8116000 55356000 [anon] 62F90000 1 VALID ... VA PA FILE OFFSET SHARES DISPOSITION CE410000 7AE40000 libnsl.so.1 0 55 VALID CE411000 7AEC1000 libnsl.so.1 1000 57 VALID CE412000 7AE42000 libnsl.so.1 2000 57 VALID CE413000 7AE83000 libnsl.so.1 3000 57 VALID CE414000 7AE84000 libnsl.so.1 4000 57 VALID ... CE42D000 6EE96000 libnsl.so.1 1D000 1A INVALID,INMEMORY CE42E000 6E797000 libnsl.so.1 1E000 1A INVALID,INMEMORY ... VA PA FILE OFFSET SHARES DISPOSITION CE4F0000 17D9000 methods_unicode.so.3 0 29 VALID CE4F1000 17DA000 methods_unicode.so.3 1000 2C VALID ... VA PA FILE OFFSET SHARES DISPOSITION CE510000 1869000 en_US.UTF-8.so.3 0 28 VALID CE511000 18AA000 en_US.UTF-8.so.3 1000 2A VALID ... CE518000 6F1EA000 en_US.UTF-8.so.3 8000 0 INVALID,INMEMORY CE519000 6F1EB000 en_US.UTF-8.so.3 9000 0 INVALID,INMEMORY CE51A000 6F1EC000 en_US.UTF-8.so.3 A000 0 INVALID,INMEMORY ... CE5FF000 6DB60000 en_US.UTF-8.so.3 EF000 5 INVALID,INMEMORY CE600000 1659000 en_US.UTF-8.so.3 F0000 7 INVALID,INMEMORY ... CE6EE000 1687000 en_US.UTF-8.so.3 1DE000 9 INVALID,INMEMORY CE6EF000 1688000 en_US.UTF-8.so.3 1DF000 9 INVALID,INMEMORY CE6F0000 1649000 en_US.UTF-8.so.3 1E0000 9 INVALID,INMEMORY ... CE729000 1782000 en_US.UTF-8.so.3 219000 29 VALID CE72A000 1783000 en_US.UTF-8.so.3 21A000 29 VALID ... CE730000 1709000 en_US.UTF-8.so.3 220000 29 VALID CE731000 6F143000 en_US.UTF-8.so.3 221000 0 INVALID,INMEMORY CE732000 6F144000 en_US.UTF-8.so.3 222000 0 INVALID,INMEMORY ... VA PA FILE OFFSET SHARES DISPOSITION CE9B0000 76A42000 libc_hwcap1.so.1 0 5B VALID CE9B1000 76AC3000 libc_hwcap1.so.1 1000 5B VALID ... VA PA FILE OFFSET SHARES DISPOSITION ... CEBC4000 2A34000 ld.so.1 5000 47 VALID CEBC5000 28B5000 ld.so.1 6000 47 VALID CEBC6000 29F6000 ld.so.1 7000 60 VALID CEBC7000 2937000 ld.so.1 8000 60 VALID ... >

Some general things to note:

Physical pages are randomly distributed. However, pages from ld.so.1 tend to be in low memory with comparison to anonymous pages. This should be expected as most pages of ld.so.1 are probably loaded early on in the system lifetime as most every application uses it.
There are many pages that are not valid, but they are in memory. In general, text and data pages are prefetched when a program starts, unless the program is large, or there is not enough free memory. Although pages are prefetched, it appears that they are not mapped to the process address space until/unless they are actually used.
Bash is not very large. Running the command above finishes in 5-10 seconds. Running the same command on a large program (for instance, firefox-bin), takes several minutes to complete. Running the command on a large 64-bit application will take considerably longer.
This is being run on a live system, so the address space of the process being examined may change while it is being walked.
At this point in time, no pages are swapped out.

Now, let's get some general statistics.

First, a count of the pages currently valid for the process. This is the current mapped RSS. Note that the pmap command reports "RSS", which, at 3620k is 905 4k-byte pages. But only 558 pages (or 2232k) are currently valid.
> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages !grep -i " valid" | wc 558 3348 35712 >

Now, the pages in memory, but not currently valid in the page table(s) for the process.
> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages !egrep -i "inmemory" | wc 347 2082 26025 >

Note that the valid pages plus the in memory pages is 905, or the value reported by pmap. So RSS as reported by pmap does not imply that page faults will not happen for all of those pages. But if a page fault occurs the correct page will be found in memory.

How many pages are currently not valid (and not in memory).
> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages !egrep -i " invalid$" | wc 553 3298 36498 >

How large is the address space?
> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages !egrep -v OFFSET | wc 1458 8728 98235 >

Note that this is 5832k, the total size as reported by pmap.

How many pages have been swapped out?
> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages !grep -i swapped | wc 0 0 0 >

Now, we'll induce memory load on the system, and again examine the address space. The memory usage induced should be enough to cause pages to be swapped (paged) out.

First, pmap output after the memory stress.
$ pmap -x 919 919: /bin/bash --noediting -i Address Kbytes RSS Anon Locked Mode Mapped File 08045000 12 4 - - rw--- [ stack ] 08050000 644 508 - - r-x-- bash 08100000 80 80 - - rwx-- bash 08114000 52 44 28 - rwx-- [ heap ] CE410000 624 320 - - r-x-- libnsl.so.1 CE4BC000 16 16 4 - rw--- libnsl.so.1 CE4C0000 20 8 - - rw--- libnsl.so.1 CE4F0000 56 36 - - r-x-- methods_unicode.so.3 CE50D000 4 4 - - rwx-- methods_unicode.so.3 CE510000 2416 124 - - r-x-- en_US.UTF-8.so.3 CE77B000 4 4 - - rwx-- en_US.UTF-8.so.3 CE960000 64 16 - - rwx-- [ anon ] CE97E000 4 4 - - rwxs- [ anon ] CE980000 4 4 - - rwx-- [ anon ] CE990000 24 12 4 - rwx-- [ anon ] CE9A0000 4 4 4 - rwx-- [ anon ] CE9B0000 1280 952 - - r-x-- libc_hwcap1.so.1 CEAF0000 28 28 12 - rwx-- libc_hwcap1.so.1 CEAF7000 8 8 - - rwx-- libc_hwcap1.so.1 CEB00000 4 4 - - r-x-- libdl.so.1 CEB10000 4 4 4 - rwx-- [ anon ] CEB20000 56 56 - - r-x-- libsocket.so.1 CEB3E000 4 4 - - rw--- libsocket.so.1 CEB40000 180 68 - - r-x-- libcurses.so.1 CEB7D000 28 28 - - rw--- libcurses.so.1 CEB84000 8 - - - rw--- libcurses.so.1 CEB90000 4 4 4 - rwx-- [ anon ] CEBA0000 4 4 4 - rw--- [ anon ] CEBB0000 4 4 - - rw--- [ anon ] CEBBF000 180 180 - - r-x-- ld.so.1 CEBFC000 8 8 4 - rwx-- ld.so.1 CEBFE000 4 4 4 - rwx-- ld.so.1 -------- ------- ------- ------- ------- total Kb 5832 2544 72 - $

As expected, the RSS has gone down, but the virtual size remains the same. It is a little interesting that the amount reported under anon has also dropped by 20k.

Again, we'll use the new dcmd to examine the address space more closely.
> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages VA PA FILE OFFSET SHARES DISPOSITION 8045000 378C5000 [anon] 54518000 1 VALID 8046000 - /dev/zvol/dsk/rpool/swap 17EBF000 0 INVALID,SWAPPED 8047000 - /dev/zvol/dsk/rpool/swap 1D8FE000 0 INVALID,SWAPPED VA PA FILE OFFSET SHARES DISPOSITION 8050000 16B86000 bash 0 1 VALID 8051000 13A07000 bash 1000 1 VALID 8052000 6B088000 bash 2000 1 VALID 8053000 1889000 bash 3000 1 VALID 8054000 2430A000 bash 4000 1 VALID 8055000 6440B000 bash 5000 1 VALID 8056000 6684C000 bash 6000 1 VALID 8057000 7308D000 bash 7000 1 VALID 8058000 6DCCE000 bash 8000 0 INVALID,INMEMORY 8059000 3784F000 bash 9000 0 INVALID,INMEMORY ... 80ED000 4CB23000 bash 9D000 0 INVALID,INMEMORY 80EE000 76BE4000 bash 9E000 0 INVALID,INMEMORY 80EF000 5BA5000 bash 9F000 0 INVALID,INMEMORY 80F0000 36836000 bash A0000 0 INVALID,INMEMORY VA PA FILE OFFSET SHARES DISPOSITION 8100000 - /dev/zvol/dsk/rpool/swap 247C2000 0 INVALID,SWAPPED 8101000 - /dev/zvol/dsk/rpool/swap 7CCD000 0 INVALID,SWAPPED 8102000 378C2000 [anon] 54438000 1 VALID 8103000 75479000 bash A3000 0 INVALID,INMEMORY 8104000 532BA000 bash A4000 0 INVALID,INMEMORY 8105000 37885000 [anon] 543B8000 1 VALID 8106000 7443C000 bash A6000 0 INVALID,INMEMORY ... VA PA FILE OFFSET SHARES DISPOSITION 8114000 37914000 [anon] 54478000 1 VALID 8115000 79DD5000 [anon] 54368000 1 VALID 8116000 55356000 [anon] 62F90000 1 VALID ... VA PA FILE OFFSET SHARES DISPOSITION CE410000 7AE40000 libnsl.so.1 0 4C VALID CE411000 7AEC1000 libnsl.so.1 1000 4E VALID CE412000 7AE42000 libnsl.so.1 2000 4E VALID CE413000 7AE83000 libnsl.so.1 3000 4E VALID CE414000 7AE84000 libnsl.so.1 4000 4E VALID ... CE42D000 6EE96000 libnsl.so.1 1D000 18 INVALID,INMEMORY CE42E000 6E797000 libnsl.so.1 1E000 18 INVALID,INMEMORY ... VA PA FILE OFFSET SHARES DISPOSITION CE4F0000 17D9000 methods_unicode.so.3 0 27 VALID CE4F1000 17DA000 methods_unicode.so.3 1000 2A VALID ... VA PA FILE OFFSET SHARES DISPOSITION CE510000 1869000 en_US.UTF-8.so.3 0 26 VALID CE511000 18AA000 en_US.UTF-8.so.3 1000 28 VALID ... CE518000 - en_US.UTF-8.so.3 8000 0 INVALID CE519000 - en_US.UTF-8.so.3 9000 0 INVALID CE51A000 - en_US.UTF-8.so.3 A000 0 INVALID ... CE5FF000 - en_US.UTF-8.so.3 EF000 0 INVALID CE600000 - en_US.UTF-8.so.3 F0000 0 INVALID ... CE6EE000 1687000 en_US.UTF-8.so.3 1DE000 A INVALID,INMEMORY CE6EF000 1688000 en_US.UTF-8.so.3 1DF000 A INVALID,INMEMORY CE6F0000 1649000 en_US.UTF-8.so.3 1E0000 A INVALID,INMEMORY ... CE729000 1782000 en_US.UTF-8.so.3 219000 27 VALID CE72A000 1783000 en_US.UTF-8.so.3 21A000 27 VALID ... CE730000 1709000 en_US.UTF-8.so.3 220000 27 VALID CE731000 - en_US.UTF-8.so.3 221000 0 INVALID CE732000 - en_US.UTF-8.so.3 222000 0 INVALID ... VA PA FILE OFFSET SHARES DISPOSITION CE9B0000 76A42000 libc_hwcap1.so.1 0 51 VALID CE9B1000 76AC3000 libc_hwcap1.so.1 1000 51 VALID ... VA PA FILE OFFSET SHARES DISPOSITION CEBC4000 2A34000 ld.so.1 5000 42 VALID CEBC5000 28B5000 ld.so.1 6000 42 VALID CEBC6000 29F6000 ld.so.1 7000 57 VALID CEBC7000 2937000 ld.so.1 8000 57 VALID ... >

As expected, many pages that were previously valid are now invalid. Many of these pages are still in memory, but some have been swapped out. The output does not show it, but some pages that are swapped out can also be in memory (the page was swapped out, put on a freelist, but has not yet been re-used for some other purpose. It is interesting that some pages with reasonably high share counts are still in memory, but no longer valid for this instance of bash. The pageout code checks the share counts, and skips pages being shared by more than po_share processes. On my system, po_share is 8. So I am not sure what is marking the pages invalid (maybe a job for DTrace).

As before, I'll get some counts of valid, invalid, inmemory, and
swapped pages.
> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages !grep -i " valid" | wc 413 2478 26432 >

Previously, the number of valid pages was 558, so 145 pages have been marked invalid and possibly swapped out.

The number of invalid pages is now:
> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages !egrep -i " invalid$" | wc 818 4888 53988 >

Previously, this was 553, so 265 pages that previously were valid are
now invalid.
> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages !egrep -i "inmemory" | wc 215 1290 16125 >

And 215 pages that are invalid are still in memory, but the page table entries for the bash instance does not have the pages mapped.
> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages !grep -i swapped | wc 12 72 936

And 12 pages of bash are on swap.

It would be nice to be able to show this graphically. For instance, a large box representing the address space, with different colored pixels to represent the state of the different pages of the address space. I have been told that JavaFX is good for this, but my knowledge of Java is really not up to it. Especially for large processes, a graphical view would be nice (well, at least interesting to look at...).

I have not tried the dcmd on SPARC or x64, but I expect it to work (at least on x64). I would also like to try this on a large machine which has latency groups set up. If anyone has such a machine and would like to try this out, please let me know.

I also have a version of the command that only prints summary information. I want to add an option that prints page sizes, but currently the command assumes all pages are the native page size (4k on x86/x64 and 8k on SPARC).

If there is interest, I'll make the code for the dcmd available.

Saturday, September 05, 2009

Correction for classes in Berlin

Oops. I've got the wrong dates for the Solaris/OpenSolaris Internals classes in Berlin. The correct dates are Sept. 21-25, and Sept. 28 - Oct. 2. The first week is full, but there are still openings for the second week. Hope to see you there!

Tuesday, August 04, 2009

OpenSolaris Internals course announcements

I will be teaching 2 back-to-back 5 day OpenSolaris Internals classes in Berlin, Germany the weeks of September 28 through October 2, and again from October 5 through October 10. For details about topics covered, price, and availability, please visit http://www.workshops-berlin.de/. Note that this website is in German. The course itself will be in English. If you are interested, but cannot read German, send me an email.

Thursday, April 09, 2009

RAIDZ On-Disk Format

A while back, I came up with a way to examine ZFS on-disk format using a modified mdb and zdb (see paper and slides. I also used the method described to recover a removed file (see here in my blog. During the past week, I decided to try to understand the layout of raidz. In other words, how raidz organizes data on disk. It's simple to say that raidz on disk is basically raid5 with variable length stripes, but what does that really mean?

To do this, I once again use the modified mdb, and made a further modification to zdb. In addtion, I implemented a new command (zuncompress) which allow me to uncompress ZFS data existing in a regular file. Since I fear that most of the 10 people or so who read this will not want to read a long description of how I determined the layout, here I'll just give a summary. If anyone really wants the details, reply to the blog and maybe I'll go into them.

First, some general characteristics:

- Each disk contains the 4MB labels at the beginning and end of the disk. For information on these, please see the ZFS On-Disk Specification paper. The starting point for any walk of ZFS on-disk starts with an uberblock_t which is found in this area.

- The metadata used for raidz is the same as for other ZFS objects. In other words, uberblock_t contains the location of a objset_phys_t, which in turn contains the location of the meta-object set (mos), and so on. A difference is that physically, individual structures on disk may exist across different disks, and not necessarily all of the disks. For example, let's take a mos (basically an array of dnode_phys_t structures), on a 5 disk raidz volume. This might be compressed to 0xa00 (2560 decimal) bytes. This may be organized on the raidz disks as follows:
- 512 bytes on disk 0
- 512 bytes on disk 1
- 512 bytes on disk 2
- 1024 bytes on disk 3
- 1024 bytes on disk 4
If you do the arithmetic, you'll find this is 0xe00 bytes (3.5k) and not 0xa00 (2.5K) bytes. The actual allocated size may be still larger. The reason for the extra 1k bytes is the next point.

- Each metadata object (as well as data itself) has its own parity. The extra 1k bytes in the previous point is for parity. If the parity in the above example is on disk 4, it must be 1024 bytes large, since the largest size of any of the blocks containing the object is 1k bytes. Even a metadata structure that only takes up 512 bytes (for instance, an objset_phys_t), will take up 1024 bytes on the disks, one disk containing the 512-byte structure, and another containing 512-bytes of parity.

- Block offsets as reported by zdb (and described in the ZFS On-Disk Specification) are for the entire space (i.e., if you have 5 100GB disks making up a raidz pool, the block offsets start at 0 and go to 500GB).

- Since block offsets cover the entire pool, you cannot simply look at the offsets reported by zdb and map them to locations on disk. The kernel routine, vdev_raidz_map_alloc() (see http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c#644), converts offset and size to locations on the disks. I have added an option to zdb that, given a raidz pool, offset, and size (as reported by zdb), calls this routine and prints out the values of the returned map. This shows the location on the disk(s) and sizes for both the data itself, and parity.

- I recently saw on #opensolaris irc, a person stating that a write of a 1 byte file results in a write to all disks in raidz. That may be true (I haven't checked), but only 1024 bytes are used for the 1 byte file. One 512 byte block containing the 1 byte of data, and the other 512 byte block on a different disk containing parity. It is not using space on all n disks for a 1 byte file.

- ZFS basically uses a scatter-gather approach to reading and writing data on a raidz pool. The disks are read at the correct offsets into a buffer large enough to contain the data. So on a read, data on the first disk is read into the beginning of the buffer, data from the second disk is read into the same buffer virtually contiguous with the end of the data from the first disk, and so on. The resulting buffer is then de-compressed, and the data returned to the requestor.

So, that's the basics. I was going to turn my attention to the in-memory representation of ZFS, but now think instead I'll take a stab at automating the techniques I am using. Once I have that done, I'll try automating recovery of a removed file. From there, we'll see.

Saturday, April 04, 2009

More information about Free One Day OpenSolaris Internals Training

I thought I would say a few words about what is planned for the free one day OpenSolaris Internals training class (see http://sl.osunix.org/FreeKernelTrainingDay for a list of topics, and to sign up).

Regardless of the topics covered, I want to make this as close to a "classroom" setting as possible. For me, this means that attendees should be able to follow along with anything I am doing on OpenSolaris by doing it themselves. So, for instance, if I am using mdb to examine some data structure, students should be able to do the same on their machines. For some topics, notably ZFS, this will require students to either build an mdb dmod, and the modified mdb and zdb that I use, or load a version of OpenSolaris that contains these (to be provided by osunix.org). Source for the modified mdb, zdb, and rawzfs mdb dmod is available for download at ftp://ftp.bruningsystems.com/mdb.tar.Z, ftp://ftp.bruningsystems.com/zdb.tar.Z, and ftp://ftp.bruningsystems.com/raw_dmods.tar.Z. If we do a kmdb session, students will either need to run OpenSolaris in a VM (virtualbox), or have 2 machines connectable via tip or a terminal server for console access.

Currently, the plan is to give attendees access to some slides, use IRC, and give students access to a window on my machine where they can see what I am doing, and try the same on their machine. Best would be a window where everyone can "see" my desktop, but I'm still looking into the best way to do that (any suggestions for this are welcome). It would be great to have audio, preferably conferencing, but this may cost money, and... the class is free. That should mean free for me as well. If anyone has a suggestion for free, conferenced audio, I would appreciate it.

I would like to decide on topics to be covered in the next week or so. So, if you are interested in attending, please go to http://sl.osunix.org/FreeKernelTrainingDay, take a look at the topics, and sign up. If you have ideas for other kernel-related topics, please let me know. Depending on how this goes, I may do more of these in the future.

Thursday, April 02, 2009

Free One-day OpenSolaris Internals class

I am holding a free, one day OpenSolaris Internals class on-line on April 18 or 19. We'll cover 2 topics as determined by a vote of topics that may be covered. For more information, see http://sl.osunix.org/FreeKernelTrainingDay. I hope to see you there!

OpenSolaris Internals class

I am teaching an OpenSolaris Internals class at Systemics in Warsaw, Poland the week of May 4-8. The course will be held in English. For a detailed topic outline, see here. For pricing, location information, and availability, please send email to magdalena.sternick@systemics.pl. If you have questions about course content, please email me at max@bruningsystems.com.

Tuesday, March 31, 2009

A faster memstat for mdb

I have implemented a version of the ::memstat dcmd for mdb that gives results in less than half the time of the ::memstat currently in mdb. If you are interested, it is available for download here.

So, how does it work? The current version of memstat uses the page walker (::walk page) to walk all cached pages in the system. The new version simply examines all pages.

Each page-able page in the system is represented by a page_t structure. Basically all of memory except for the unix and genunix kernel modules, and a few other odds and ends is considered "page-able". Every page, that is in use (either by the kernel, user process, anonymous, or cached) has an identity. This is a vnode/offset pair. The identify uniquely determines how the page is being used. For instance, a page for the code of a running bash process will have the vnode_t for /usr/bin/bash, and the offset within /usr/bin/bash of where the page comes from. For a kernel page, there is a special vnode_t (kvp). For anonymous space, the page has a swapfsvnode (also used for shared memory and tmpfs files). When a process gets a page fault, the fault handling code first checks to see if the faulting address is mapped in the process' address space. If not, a segmentation violation (SIGSEGV) is sent to the process. If the address is within the address space, the fault handling code sees if the page is already in memory. It does this by retrieving the vnode/offset for the faulting page, and hashing into an array called page_hash. Each entry in page_hash is the beginning of a linked list of page_t structures. So the fault handling code does a hash to get a page_hash array entry, then walks the page_t structures starting at that entry to look for a matching vnode_t/offset. If the page_t is in the hash, the fault handling code sets up a page table entry (translation table entry on SPARC) to map to the corresponding physical page.

The page_hash array is sized so the the average search, given a page_hash bucket, is no longer than 4 (PAGE_HASHAVELEN in vm/page.h) entries. This makes searching for a cached page fairly fast.

The page walker that mdb uses to do memstat walks every hash bucket looking for pages. Basically, if the page is found from a hash bucket, the page is either in use by the kernel, some process, or tmpfs, or, the page is free but cached. Any page not hanging off of a page_hash bucket is considered free (i.e., the free (freelist) statistic).

The new memstat takes a different approach. Rather than scanning each hash bucket for pages, it simply reads all of the page_t structures on the system, then examines each one to determine if it is a kernel page, executable page, anonymous page, and so on. Any page_t that does not have a vnode_t is considered a free page, and is counted in the free (freelist) statistic.

How are the page_t structures found on the system? There is a linked list of memseg structures that are bookkeeping for page-able page structures. The list is headed by a pointer, memsegs, and is built early on in the startup code when the system is booting. I suspect the list could change due to dynamic reconfiguration events, but I'll leave that as an exercise for the reader...
To see the list, you can do "::memseg_list" in mdb. On my system, this gives:

> ::memseg_list ADDR PAGES EPAGES BASE END fbe00028 fbe94160 fe652260 00000c00 0007fed0 fbe00014 fbe85160 fbe94160 00000100 00000400 fbe00000 fbe82050 fbe85160 00000002 0000009f >

The ADDR column is the address of a memseg structure. PAGES is the address of the first page_t in an array. EPAGES points to the end of the page_t array. BASE is the starting page frame number of the memseg, and END is the ending page frame number. So, on my system, page frames between 2 and 9f, 100 and 400, and c00 and 7fed0 have page_t structures. Physical page 0 and 1 are not in the list. Also pages between 9f and 100, and between 400 and c00 are not in the list. This is either because the physical memory does not exist, or it is not considered pageable.

The new memstat uses the memseg list to read in all of the page_t structures. On my system, this means 3 read calls (though the read from fbe94160 to fe652260 is quite large), versus thousands of read calls in the existing memstat via the page walker. The new memstat assumes there will never be more than 256 memseg structures. (This was arbitrarily chosen. I have not seen machines with more than 6 memseg structures, but I don't get on very large machines very often). A more correct way would be to build the memseg list within the dcmd, but I am lazy.

Using dtrace and counting system calls during running of the two version of memstat shows that the original memstat makes 1741225 system calls, while the new memstat makes 737344. So over 10000 fewer system calls in the new memstat.

I think memstat performance could be improved even more by using mmap to map in the page arrays. Then there would be no need for using mdb_alloc, and no need to mdb_vread the page_t structs.