Monday, August 18, 2008

recovering removed file on zfs disk

I have used my modified mdb and zdb (see
http://www.osdevcon.org/2008/files/osdevcon2008-proceedings.pdf and
http://www.osdevcon.org/2008/files/osdevcon2008-max.pdf)
to recover a file that was removed from a zfs file system.
The technique is to locate the active uberblock_t after the file
was created, but before the file was removed, and follow the data
structures from that uberblock_t. This technique would probably not
work on a near full file system, and probably not on a very busy file
system, but it works here. Also, this will not work with RAID-z,
but works fine for mirrors. (I shall get around to figuring out
raid-z, but not now...).

It is possible to follow all of the steps and still not have the right
data because you chose the wrong uberblock_t, or one of the blocks containing
metadata (or the data itself) has been re-used.

The modified mdb and zdb have been updated to work with Nevada,
build 94. It took about 15 minutes to merge the versions I was using
for build 79 into build94. For source for the changes, and
the zfs dmod, send mail to me at max@bruningsystems.com.

It might be possible, with a bit more clever use of mdb
and some shell scripting, to automate this... Also, it might be
useful to add an option to zdb so that a different transaction
id other than the active one be used for it's traversals.
Then you might be able to do everything using zdb.

The following describes the steps taken.

First, I copy a file with known contents to the zfs file system.


# cp /usr/dict/words /zfs_fs/words
#


We'll get the object id (inumber) for /zfs_fs. We'll use it later.


# ls -aid /zfs_fs
3 /zfs_fs
#

Next, I'll try to make sure everything is on the disk.

# sync
#

Now, I'll use zdb to get the root blkptr from the uberblock.
This will also give me a transaction ID. Generally, you would not
use zdb to get the uberblock_t every time that you add/remove a
file to a zfs file system. That is ok. I have written a dcmd
(output shown below), that walks the uberblock_t array on disk.
Then you can, by trial and error, locate the uberblock_t you need
(assuming it still exists in the array, and assuming the metadata
it points to has not been re-used for another purpose).


# ./zdb -uuu zfs_fs
Uberblock

magic = 0000000000bab10c
version = 11
txg = 1282 <-- transaction id in decimal
guid_sum = 8876692711396000182
timestamp = 1218963748 UTC = Sun Aug 17 11:02:28 2008
rootbp = [L0 DMU objset] 400L/200P DVA[0]=<0:11a00:200>
DVA[1]=<0:c010e00:200> DVA[2]=<0:18008e00:200> fletcher4
lzjb LE contiguous birth=1282 fill=27
cksum=81f780ec5:361b52dda06:b6f3f410036f:1a2b8b10bfdb5c


Next, I'll remove the file I just created.


# rm /zfs_fs/words
#


Let's take a look at the active uberblock_t.


# ./zdb -uuu zfs_fs
Uberblock

magic = 0000000000bab10c
version = 11
txg = 1282 <-- nothing has changed
guid_sum = 8876692711396000182
timestamp = 1218963748 UTC = Sun Aug 17 11:02:28 2008
rootbp = [L0 DMU objset] 400L/200P DVA[0]=<0:11a00:200>
DVA[1]=<0:c010e00:200> DVA[2]=<0:18008e00:200> fletcher4
lzjb LE contiguous birth=1282 fill=27
cksum=81f780ec5:361b52dda06:b6f3f410036f:1a2b8b10bfdb5c


Let's try to make sure it is on the disk.

# sync
#

And check the active uberblock_t again.

# ./zdb -uuu zfs_fs
Uberblock

magic = 0000000000bab10c
version = 11
txg = 1284 <-- new transaction id, after file was removed
guid_sum = 8876692711396000182
timestamp = 1218963808 UTC = Sun Aug 17 11:03:28 2008
rootbp = [L0 DMU objset] 400L/200P DVA[0]=<0:15200:200>
DVA[1]=<0:c014600:200> DVA[2]=<0:1800a000:200> fletcher4
lzjb LE contiguous birth=1284 fill=27
cksum=87431704e:37f154f58d7:bbddb76e9703:1aaf346847004f


Now, let's make sure nothing changes in the file system.


# zfs umount zfs_fs
#


And look at the active uberblock_t again.


# ./zdb -uuu zfs_fs
Uberblock

magic = 0000000000bab10c
version = 11
txg = 1284 <-- ok. nothing changed
guid_sum = 8876692711396000182
timestamp = 1218963808 UTC = Sun Aug 17 11:03:28 2008
rootbp = [L0 DMU objset] 400L/200P DVA[0]=<0:15200:200>
DVA[1]=<0:c014600:200> DVA[2]=<0:1800a000:200> fletcher4
lzjb LE contiguous birth=1284 fill=27
cksum=87431704e:37f154f58d7:bbddb76e9703:1aaf346847004f


Ok. So nothing changed when the file system was unmounted.
Now, we'll use the modified mdb to examine the uberblock_t array on disk.
The uberblock_t we want has transaction id 1282 decimal.


# ./mdb /export/home/max/zfsfile

First, convert decimal 1282 to hex.

> 0t1282=X
502

Now, load kernel CTF and a few dcmds that work with zfs on disk.

> ::loadctf
> ::load /export/home/max/source/mdb/i386/rawzfs.so

Walk the uberblock_t array on disk. This shows all possible 1024 uberblocks.
Here, we'll only show the entry with ub_txg = 0x502. Again, if I had not
retrieved the value of the active uberblock_t after the file was created,
and before the file was removed, I could dump all uberblock_t using the
following command, and then searched backwards, trying all transaction ids
that are less than the current (i.e., current after the file was removed
and the file system unmounted).

> ::walk uberblock | ::print -a -t zfs`uberblock_t
...
{
20800 uint64_t ub_magic = 0xbab10c
20808 uint64_t ub_version = 0xb
20810 uint64_t ub_txg = 0x502 <-- the correct transaction id
20818 uint64_t ub_guid_sum = 0x7b3058fd830ec1b6
20820 uint64_t ub_timestamp = 0x48a7e924
20828 blkptr_t ub_rootbp = { <-- blkptr is at 0x20828 on disk
20828 dva_t [3] blk_dva = [
{
20828 uint64_t [2] dva_word = [ 0x1, 0x8d ]
}
{
20838 uint64_t [2] dva_word = [ 0x1, 0x60087 ]
}
{
20848 uint64_t [2] dva_word = [ 0x1, 0xc0047 ]
}
]
20858 uint64_t blk_prop = 0x800b070300000001
20860 uint64_t [3] blk_pad = [ 0, 0, 0 ]
20878 uint64_t blk_birth = 0x502
20880 uint64_t blk_fill = 0x1b
20888 zio_cksum_t blk_cksum = {
20888 uint64_t [4] zc_word = [ 0x81f780ec5, 0x361b52dda06,
0xb6f3f410036f, 0x1a2b8b10bfdb5c ]
}
}
}
...

Let's dump the blkptr_t for this uberblock_t.

> 20828::blkptr
DVA[0]: vdev_id 0 / 11a00
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 20000000000
DVA[0]: :0:11a00:200:d
DVA[1]: vdev_id 0 / c010e00
DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 20000000000
DVA[1]: :0:c010e00:200:d
DVA[2]: vdev_id 0 / 18008e00
DVA[2]: GANG: FALSE GRID: 0000 ASIZE: 20000000000
DVA[2]: :0:18008e00:200:d
LSIZE: 400 PSIZE: 200
ENDIAN: LITTLE TYPE: DMU objset
BIRTH: 502 LEVEL: 0 FILL: 1b00000000
CKFUNC: fletcher4 COMP: lzjb
CKSUM: 81f780ec5:361b52dda06:b6f3f410036f:1a2b8b10bfdb5c
$q
#

Now, using the modified zdb, let's dump the mos objset_phys_t.


# ./zdb -R zfs_fs:0:11a00:200:d,lzjb,400 2> /tmp/mos
Found vdev: /export/home/max/zfsfile
#

Back to mdb to examine the objset_phys_t for the meta object set (mos).


# ./mdb /tmp/mos
> ::loadctf
> ::load /export/home/max/source/mdb/i386/rawzfs.so
> 0::print -a -t zfs`objset_phys_t
{
0 dnode_phys_t os_meta_dnode = {
0 uint8_t dn_type = 0xa <-- this is DMU_OT_DNODE
...
40 blkptr_t [1] dn_blkptr = [
{
40 dva_t [3] blk_dva = [
{
40 uint64_t [2] dva_word = [ 0x5, 0x88 ]
}
{
50 uint64_t [2] dva_word = [ 0x5, 0x60082 ]
}
{
60 uint64_t [2] dva_word = [ 0x5, 0xc0042 ]
}
]
70 uint64_t blk_prop = 0x800a07030004001f
78 uint64_t [3] blk_pad = [ 0, 0, 0 ]
90 uint64_t blk_birth = 0x502
98 uint64_t blk_fill = 0x1a
a0 zio_cksum_t blk_cksum = {
a0 uint64_t [4] zc_word = [ 0xa9af50f215, 0xec01e192b95e,
0xc523efad092ebc, 0x7a3a8be19416f454 ]
}
}
]
...

And dump the blkptr_t in the objset_phys_t.


> 40::blkptr
DVA[0]: vdev_id 0 / 11000
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: a0000000000
DVA[0]: :0:11000:a00:d
DVA[1]: vdev_id 0 / c010400
DVA[1]: GANG: FALSE GRID: 0000 ASIZE: a0000000000
DVA[1]: :0:c010400:a00:d
DVA[2]: vdev_id 0 / 18008400
DVA[2]: GANG: FALSE GRID: 0000 ASIZE: a0000000000
DVA[2]: :0:18008400:a00:d
LSIZE: 4000 PSIZE: a00
ENDIAN: LITTLE TYPE: DMU dnode
BIRTH: 502 LEVEL: 0 FILL: 1a00000000
CKFUNC: fletcher4 COMP: lzjb
CKSUM: a9af50f215:ec01e192b95e:c523efad092ebc:7a3a8be19416f454
$q
#

Using zdb with the offset specified for the first ditto block
in the above blkptr output, we get the mos dnode array.
Note that the "LEVEL: 0" blkptr output means there are
no levels of indirection. On larger zfs file systems, you may
need to go through block(s) of indirect blkptr_t's. An example of this
is shown a bit later.


# ./zdb -R zfs_fs:0:11000:a00:d,lzjb,4000 2> /tmp/metadnode
Found vdev: /export/home/max/zfsfile
#

Now, we'll look at the metadnode for the DMU_OT_OBJECT_DIRECTORY. This
will tell us about objects in the zfs file system. For every zfs file
system that I have tried this on, this is dnode number 1, (starting from
0). Regardless, the field to check is "dn_type = 0x1". It is possible,
(I assume), for this to be at a different index into the metadnode array,
and, possibly not in the 0x4000 bytes read and decompressed from 0x11000.
In this case, the LEVEL field would not have been 0, and you would have to
look at indirect blkptr_t's. But not here...


# ./mdb /tmp/metadnode
> ::loadctf
> ::load /export/home/max/source/mdb/i386/rawzfs.so
> 0,20::print -a -t zfs`dnode_phys_t <-- dnode_phys_t is 0x200 bytes, so 0x20
{ <-- of these in 0x4000.
0 uint8_t dn_type = 0 <-- First entry is not used (DMU_OT_NONE)
...
}
{ <-- start of the second entry
200 uint8_t dn_type = 0x1 <-- DMU_OT_OBJECT_DIRECTORY (see dmu.h)
...
240 blkptr_t [1] dn_blkptr = [ <-- blkptr_t is 0x240 in /tmp/metadnode
... <-- lots of output omitted, we'll look at some of this later.
}


Now we'll look at the blkptr_t for the Object Directory.

240::blkptr
DVA[0]: vdev_id 0 / 2400
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 20000000000
DVA[0]: :0:2400:200:d
DVA[1]: vdev_id 0 / c002400
DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 20000000000
DVA[1]: :0:c002400:200:d
DVA[2]: vdev_id 0 / 18000000
DVA[2]: GANG: FALSE GRID: 0000 ASIZE: 20000000000
DVA[2]: :0:18000000:200:d
LSIZE: 200 PSIZE: 200
ENDIAN: LITTLE TYPE: object directory
BIRTH: 4 LEVEL: 0 FILL: 100000000
CKFUNC: fletcher4 COMP: uncompressed
CKSUM: 5a40238b4:1cd8f9f7e19:522eab3e03f0:a9c92410b009e
$q

#

Now, we'll read the (uncompressed) 0x200 bytes of the object directory using zdb. The "2400" is the (hex) offset from the blkptr_t above.


# ./zdb -R zfs_fs:0:2400:200:r 2> /tmp/objdir
Found vdev: /export/home/max/zfsfile
#

Back to mdb to look at the object directory. Object directories are "zap"
objects. Zap objects contain name/value pairs. The first 64 bits
identify the type of the zap (micro zap or fat zap). A "fat zap" is a zap
object that uses indirection. Micro zaps contain name/value pairs directly
(i.e., no indirection). I have not seen a fat zap (but the largest zfs
file system I have used is only ~140GB, and I have not examined large
directories. (Directory entries are stored in zap objects).

# ./mdb /tmp/objdir
> ::loadctf
> ::load /export/home/max/source/mdb/i386/rawzfs.so

> 0/J <-- look at the first 64 bits as hex
0: 8000000000000003 <-- a "signature" for a micro zap

> 0::print -a -t zfs`mzap_phys_t <-- the beginning of the microzap is
{ <-- an mzap_phys_t
0 uint64_t mz_block_type = 0x8000000000000003
8 uint64_t mz_salt = 0x129c2c3
10 uint64_t mz_normflags = 0
18 uint64_t [5] mz_pad = [ 0, 0, 0, 0, 0 ]
40 mzap_ent_phys_t [1] mz_chunk = [ <-- there may be more than one
{ <-- mzap_ent_phys_t starting here
40 uint64_t mze_value = 0x2 <-- object id of "root_dataset"
48 uint32_t mze_cd = 0
4c uint16_t mze_pad = 0
4e char [50] mze_name = [ "root_dataset" ]
}
]
}
$q
#

Now, we go back to the mos metadnode array in /tmp/metadnode, and
examine object id 2 (the third entry in the array).
Each entry is 0x200 bytes, so we want the dnode_phys_t starting
at (2*200) bytes into the file.

# ./mdb /tmp/metadnode
> ::loadctf
> ::load /export/home/max/source/mdb/i386/rawzfs.so
> 2*200::print -a -t zfs`dnode_phys_t <-- get object id 2
{
400 uint8_t dn_type = 0xc <-- DMU_OT_DSL_DIR (a dataset directory object)
...
404 uint8_t dn_bonustype = 0xc <-- bonus buffer contains a dsl_dir_phys_t
...
440 blkptr_t [1] dn_blkptr = [ <-- not used for this object
{
440 dva_t [3] blk_dva = [
{
440 uint64_t [2] dva_word = [ 0, 0 ]
...
4c0 uint8_t [320] dn_bonus = [ 0xe5, 0xa9, 0xa6, 0x48, 0, 0, 0, 0, 0x10,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0xe, 0, 0, 0, 0, 0, 0, 0, ... ]
}

And dump the bonus buffer at 0x4c0.

> 4c0::print -a -t zfs`dsl_dir_phys_t
{
4c0 uint64_t dd_creation_time = 0x48a6a9e5
4c8 uint64_t dd_head_dataset_obj = 0x10 <-- object id for dataset head
...
}

Let's get object id 10 from the metadnode array.

> 10*200::print -a -t zfs`dnode_phys_t
{
2000 uint8_t dn_type = 0x10 <-- DMU_OT_DSL_DATASET
...
2004 uint8_t dn_bonustype = 0x10 <-- bonus buffer contains dsl_dataset_phys_t
...
2040 blkptr_t [1] dn_blkptr = [ <-- again, not used here
{
2040 dva_t [3] blk_dva = [
{
2040 uint64_t [2] dva_word = [ 0, 0 ]
...
20c0 uint8_t [320] dn_bonus = [ 0x2, 0, 0, 0, 0, 0, 0, 0, 0xe, 0, 0, 0, 0, 0
, 0, 0, 0x1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... ]
}

At 0x20c0 in the /tmp/metadnode file is a dsl_dataset_phys_t (the bonus
buffer).

> 20c0::print -a -t zfs`dsl_dataset_phys_t
{
20c0 uint64_t ds_dir_obj = 0x2
...
2140 blkptr_t ds_bp = {
2140 dva_t [3] blk_dva = [
{
2140 uint64_t [2] dva_word = [ 0x1, 0x79 ]
}
{
2150 uint64_t [2] dva_word = [ 0x1, 0x60073 ]
}
{
2160 uint64_t [2] dva_word = [ 0, 0 ]
}
]
...
}

Let's look at the blkptr_t in the dsl_dataset_phys_t.


> 2140::blkptr
DVA[0]: vdev_id 0 / f200
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 20000000000
DVA[0]: :0:f200:200:d
DVA[1]: vdev_id 0 / c00e600
DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 20000000000
DVA[1]: :0:c00e600:200:d
LSIZE: 400 PSIZE: 200
ENDIAN: LITTLE TYPE: DMU objset
BIRTH: 502 LEVEL: 0 FILL: 600000000
CKFUNC: fletcher4 COMP: lzjb
CKSUM: 9cb4e346a:403aa7532bf:d688fac60e1e:1e67a933734ea5
$q
#

The blkptr_t for the dsl_dataset_phys_t is for another DMU objset.
(The first DMU objset was from the uberblock_t rootbp and describe
the set of objects. The objset described by the dsl_dataset_phys_t describes
the set of objects in the file system (i.e., files and directories (and...?)).
Back to zdb to get this data.

# ./zdb -R zfs_fs:0:f200:200:d,lzjb,400 2> /tmp/root_dataset_mos
Found vdev: /export/home/max/zfsfile
#

And back to mdb to display the objset_phys_t for the root dataset.

# ./mdb /tmp/root_dataset_mos
> ::loadctf
> ::load /export/home/max/source/mdb/i386/rawzfs.so
> 0::print -a -t zfs`objset_phys_t
{
0 dnode_phys_t os_meta_dnode = {
0 uint8_t dn_type = 0xa <-- the second object directory
...
40 blkptr_t [1] dn_blkptr = [ <-- blkptr_t is 0x40 bytes into the file
...

And dump the blkptr_t...

> 40::blkptr
DVA[0]: vdev_id 0 / 10800
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 40000000000
DVA[0]: :0:10800:400:id
DVA[1]: vdev_id 0 / c00fc00
DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 40000000000
DVA[1]: :0:c00fc00:400:id
LSIZE: 4000 PSIZE: 400
ENDIAN: LITTLE TYPE: DMU dnode
BIRTH: 502 LEVEL: 6 FILL: 500000000
CKFUNC: fletcher4 COMP: lzjb
CKSUM: 58461f1c5e:3c7272ace4a1:15e2cf555fd2ac:58b9cdc6bcd0b54
$q
#

Note the "LEVEL: 6" in the above output. There are 6 levels of indirection
to get to another array of dnode_phys_t. We will follow the levels, always
using the first indirect blkptr_t at each level since the file was in
a directory whose object id is 3 (from "ls -aid /zfs_fs" back at the
beginning). If I want the dnode_phys_t for a different object id, I
can use the technique explained in the paper and slides referenced
at the beginning.

# ./zdb -R zfs_fs:0:10800:400:d,lzjb,4000 2> /tmp/dnode_l6
Found vdev: /export/home/max/zfsfile
#

Each indirect blkptr_t array contains 0x80 blkptr_t structures. (The size
of a blkptr_t is 0x80 bytes. 0x4000 (i.e., the size of the decompressed
data) divided by 0x80 = 0x80). We'll use mdb to examine blkptr 0 in the
array.

# ./mdb /tmp/dnode_l6
> ::loadctf
> ::load /export/home/max/source/mdb/i386/rawzfs.so
> 0::blkptr
DVA[0]: vdev_id 0 / 10400
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 40000000000
DVA[0]: :0:10400:400:id
DVA[1]: vdev_id 0 / c00f800
DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 40000000000
DVA[1]: :0:c00f800:400:id
LSIZE: 4000 PSIZE: 400
ENDIAN: LITTLE TYPE: DMU dnode
BIRTH: 502 LEVEL: 5 FILL: 500000000
CKFUNC: fletcher4 COMP: lzjb
CKSUM: 593bf7cd50:3d5bdfbff40e:1652191251855c:5af2260f72aa12a
$q
#

Great. Let's get the indirect array for level 5.

# ./zdb -R zfs_fs:0:10400:400:d,lzjb,4000 2> /tmp/dnode_l5
Found vdev: /export/home/max/zfsfile
#

And back to mdb to display it...

# ./mdb /tmp/dnode_l5
> ::loadctf
> ::load /export/home/max/source/mdb/i386/rawzfs.so
> 0::blkptr
DVA[0]: vdev_id 0 / 10000
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 40000000000
DVA[0]: :0:10000:400:id
DVA[1]: vdev_id 0 / c00f400
DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 40000000000
DVA[1]: :0:c00f400:400:id
LSIZE: 4000 PSIZE: 400
ENDIAN: LITTLE TYPE: DMU dnode
BIRTH: 502 LEVEL: 4 FILL: 500000000
CKFUNC: fletcher4 COMP: lzjb
CKSUM: 5a4787d7ae:3e5a99501b03:16cbd983ac6802:5d616f6513864cb
$q
#

And now to level 4. Note that BIRTH value corresponds to the
transaction id we want... (0x502 = 0t1282).

# ./zdb -R zfs_fs:0:10000:400:d,lzjb,4000 2> /tmp/dnode_l4
Found vdev: /export/home/max/zfsfile
#

Back to mdb...

# ./mdb /tmp/dnode_l4
> ::loadctf
> ::load /export/home/max/source/mdb/i386/rawzfs.so
> 0::blkptr
DVA[0]: vdev_id 0 / fc00
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 40000000000
DVA[0]: :0:fc00:400:id
DVA[1]: vdev_id 0 / c00f000
DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 40000000000
DVA[1]: :0:c00f000:400:id
LSIZE: 4000 PSIZE: 400
ENDIAN: LITTLE TYPE: DMU dnode
BIRTH: 502 LEVEL: 3 FILL: 500000000
CKFUNC: fletcher4 COMP: lzjb
CKSUM: 580321bd90:3cf0cb3827a9:1647f21a4fee83:5b2042e25b8771b
$q
#

Now to level 3.

# ./zdb -R zfs_fs:0:fc00:400:d,lzjb,4000 2> /tmp/dnode_l3
Found vdev: /export/home/max/zfsfile
#
# ./mdb /tmp/dnode_l3
> ::loadctf
> ::load /export/home/max/source/mdb/i386/rawzfs.so
> 0::blkptr
DVA[0]: vdev_id 0 / f800
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 40000000000
DVA[0]: :0:f800:400:id
DVA[1]: vdev_id 0 / c00ec00
DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 40000000000
DVA[1]: :0:c00ec00:400:id
LSIZE: 4000 PSIZE: 400
ENDIAN: LITTLE TYPE: DMU dnode
BIRTH: 502 LEVEL: 2 FILL: 500000000
CKFUNC: fletcher4 COMP: lzjb
CKSUM: 58e75640d6:3d0bc696c0c2:162c02c075c9ab:5a30099a876cabe
$q
#

And level 2...

# ./zdb -R zfs_fs:0:f800:400:d,lzjb,4000 2> /tmp/dnode_l2
Found vdev: /export/home/max/zfsfile
#
# ./mdb /tmp/dnode_l2
mdb: no terminal data available for TERM=emacs
mdb: term init failed: command-line editing and prompt will not be available
::loadctf
::load /export/home/max/source/mdb/i386/rawzfs.so
0::blkptr
DVA[0]: vdev_id 0 / f400
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 40000000000
DVA[0]: :0:f400:400:id
DVA[1]: vdev_id 0 / c00e800
DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 40000000000
DVA[1]: :0:c00e800:400:id
LSIZE: 4000 PSIZE: 400
ENDIAN: LITTLE TYPE: DMU dnode
BIRTH: 502 LEVEL: 1 FILL: 500000000
CKFUNC: fletcher4 COMP: lzjb
CKSUM: 5763205f2d:3c57f7df68a9:15fea170721af5:59a7686491c24a7
$q
#

And level 1.

# ./zdb -R zfs_fs:0:f400:400:d,lzjb,4000 2> /tmp/dnode_l1
Found vdev: /export/home/max/zfsfile
#
# ./mdb /tmp/dnode_l1
> ::loadctf
> ::load /export/home/max/source/mdb/i386/rawzfs.so
> 0::blkptr
DVA[0]: vdev_id 0 / ec00
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 60000000000
DVA[0]: :0:ec00:600:d
DVA[1]: vdev_id 0 / c00e000
DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 60000000000
DVA[1]: :0:c00e000:600:d
LSIZE: 4000 PSIZE: 600
ENDIAN: LITTLE TYPE: DMU dnode
BIRTH: 502 LEVEL: 0 FILL: 500000000
CKFUNC: fletcher4 COMP: lzjb
CKSUM: 6fb5c84271:61d7d7ffe6a4:2f9cbc90dcaa4c:10f07885852e2558
$q
#

Level 0 will contain the beginning of the array of dnode_phys_t
for files and directories within the file system.
We'll again use zdb to retrieve the block containing the first
0x20 entries. (Again, decompressed size is 0x4000, dnode_phys_t size
is 0x200, so there are 0x20 entries in the first level 0 block).

# ./zdb -R zfs_fs:0:ec00:600:d,lzjb,4000 2> /tmp/dnode_l0
Found vdev: /export/home/max/zfsfile
#

# ./mdb /tmp/dnode_l0
> ::loadctf
> ::load /export/home/max/source/mdb/i386/rawzfs.so
> 0,20::print -t -a zfs`dnode_phys_t
{
0 uint8_t dn_type = 0 <-- first entry is not used
...
}
{ <-- second entry (object id 1)
200 uint8_t dn_type = 0x15 <-- DMU_OT_MASTER_NODE
...
240 blkptr_t [1] dn_blkptr = [
{
...
}
{ <-- third entry (object id 2)
400 uint8_t dn_type = 0x16
...
{ <-- fourth entry (object id 3, should be root directory for the fs)
600 uint8_t dn_type = 0x14 <-- DMU_OT_DIRECTORY_CONTENTS
...
604 uint8_t dn_bonustype = 0x11 <-- bonus buffer contains znode_phys_t
...
640 blkptr_t [1] dn_blkptr = [ <-- this blkptr_t is a zap for directory entries
{
640 dva_t [3] blk_dva = [
{
640 uint64_t [2] dva_word = [ 0x1, 0x73 ]
}
...
6c0 uint8_t [320] dn_bonus = [ 0x1e, 0xe9, 0xa7, 0x48, 0, 0, 0, 0,
0xc3, 0x61, 0x34, 0xf, 0, 0, 0, 0, 0x1f, 0xe9, 0xa7, 0x48, 0,
0, 0, 0, 0x1, 0x43, 0x79, 0x3a, 0, 0, 0, 0, ... ]
... <-- lots omitted
}

At this point, we could go the the fourth entry in the above output
(object id 3 at 0x600 bytes into the file) and look at the directory
contents to see if the removed file is there. (Remember, ls -aid on
the directory containing the removed file shows inumber 3).
However, we'll be safe and examine the master node to get to
the root directory of the file system. The master node
is object id 1 (at 0x200 in the above output). The block pointer
for this dnode_phys_t is for a zap object.
We'll use mdb to dump the master node blkptr_t.

> 240::blkptr
DVA[0]: vdev_id 0 / 0
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 20000000000
DVA[0]: :0:0:200:d
DVA[1]: vdev_id 0 / c000000
DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 20000000000
DVA[1]: :0:c000000:200:d
LSIZE: 200 PSIZE: 200
ENDIAN: LITTLE TYPE: ZFS master node
BIRTH: 4 LEVEL: 0 FILL: 100000000
CKFUNC: fletcher4 COMP: uncompressed
CKSUM: 233dfc135:e10dd7aa27:2e8c1eba771e:6a380d575d3d6
$q
#

And now back to zdb to get the zfs master node zap object. Note
that this is not compressed, and is at the beginning of the disk
(following label 0 and label 1.

# ./zdb -R zfs_fs:0:0:200:r 2> /tmp/master_node
Found vdev: /export/home/max/zfsfile
#

Back to mdb to examine the master node zap.

# ./mdb /tmp/master_node
> ::loadctf
> ::load /export/home/max/source/mdb/i386/rawzfs.so
> 0/J <-- let's see what kind of zap it is
0: 8000000000000003 <-- micro zap
> 0::print -a -t zfs`mzap_phys_t
{
0 uint64_t mz_block_type = 0x8000000000000003
8 uint64_t mz_salt = 0x3d3b
10 uint64_t mz_normflags = 0
18 uint64_t [5] mz_pad = [ 0, 0, 0, 0, 0 ]
40 mzap_ent_phys_t [1] mz_chunk = [
{
40 uint64_t mze_value = 0x3
48 uint32_t mze_cd = 0
4c uint16_t mze_pad = 0
4e char [50] mze_name = [ "VERSION" ]
}
]
}

The mzap_phys_t is 0x80 bytes large. Following this are zero or more
mzap_ent_phys_t. Each mzap_ent_phys_t is 0x40 bytes. The following
will dump all mzap_ent_phys_t following the mzap_phys_t in the block.

> 80,((200-80)%40)::print -a -t zfs`mzap_ent_phys_t
...
{
c0 uint64_t mze_value = 0x3 <-- the object id for the root of the fs
c8 uint32_t mze_cd = 0
cc uint16_t mze_pad = 0
ce char [50] mze_name = [ "ROOT" ] <-- this is root
}
$q
#

Now, back to the level 0 dnode_phys_t array to look at the root directory
dnode_phys_t.

# ./mdb /tmp/dnode_l0
> ::loadctf
> ::load /export/home/max/source/mdb/i386/rawzfs.so
> 3*200::print -a -t zfs`dnode_phys_t
{
600 uint8_t dn_type = 0x14 <-- DMU_OT_DIRECTORY_CONTENTS
...
604 uint8_t dn_bonustype = 0x11 <-- bonus buffer contains znode_phys_t
...
640 blkptr_t [1] dn_blkptr = [
{
640 dva_t [3] blk_dva = [
{
640 uint64_t [2] dva_word = [ 0x1, 0x73 ]
}
...
}

The blkptr_t is for a zap object containing filename/object id
values for files in the root directory of the file system.

> 640::blkptr
DVA[0]: vdev_id 0 / e600
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 20000000000
DVA[0]: :0:e600:200:d
DVA[1]: vdev_id 0 / c00da00
DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 20000000000
DVA[1]: :0:c00da00:200:d
LSIZE: 200 PSIZE: 200
ENDIAN: LITTLE TYPE: ZFS directory
BIRTH: 502 LEVEL: 0 FILL: 100000000
CKFUNC: fletcher4 COMP: uncompressed
CKSUM: 25f50a2fc:fe963fd84e:36937666328d:7f9475424708c
$q
#

Now read the root directory zap object.

# ./zdb -R zfs_fs:0:e600:200:r 2> /tmp/rootdir
Found vdev: /export/home/max/zfsfile
#

And use mdb to look at the zap entries.

# ./mdb /tmp/rootdir
> ::loadctf
> ::load /export/home/max/source/mdb/i386/rawzfs.so
> 0/J
0: 8000000000000003 <-- a micro zap


> 0::print -a -t zfs`mzap_phys_t
{
0 uint64_t mz_block_type = 0x8000000000000003
8 uint64_t mz_salt = 0x3e0f
10 uint64_t mz_normflags = 0
18 uint64_t [5] mz_pad = [ 0, 0, 0, 0, 0 ]
40 mzap_ent_phys_t [1] mz_chunk = [
{
40 uint64_t mze_value = 0x8000000000000004
48 uint32_t mze_cd = 0
4c uint16_t mze_pad = 0
4e char [50] mze_name = [ "foo" ]
}
]
}

And dump the rest of the zap entries.

> 80,((200-80)%40)::print -a -t zfs`mzap_ent_phys_t
{
80 uint64_t mze_value = 0x8000000000000005
88 uint32_t mze_cd = 0
8c uint16_t mze_pad = 0
8e char [50] mze_name = [ "words" ] <-- here is the removed file
}
...
5*200=X <-- we want dnode_phys_t object id 5.
a00 <-- Offset within /tmp/dnode_l0 where the object resides
$q
#

We'll go back and get the dnode for object id 5.

# ./mdb /tmp/dnode_l0
> ::loadctf
> ::load /export/home/max/source/mdb/i386/rawzfs.so
> a00::print -a -t zfs`dnode_phys_t
{
a00 uint8_t dn_type = 0x13 <-- DMU_OT_PLAIN_FILE_CONTENTS
...
a04 uint8_t dn_bonustype = 0x11 <-- znode_phys_t for "words" file
...
a40 blkptr_t [1] dn_blkptr = [ <-- blkptr_t for data or indirect blocks
{
...
ac0 uint8_t [320] dn_bonus = [ 0x1f, 0xe9, 0xa7, 0x48, 0, 0, 0, 0, 0xcb,
0x96, 0x78, 0x3a, 0, 0, 0, 0, 0x1f, 0xe9, 0xa7, 0x48, 0, 0, 0, 0, 0xd1, 0xb1,
0x83, 0x3a, 0, 0, 0, 0, ... ]
}

Now, let's take a quick look at the znode_phys_t for this file.
It is in the bonus buffer at 0xac0.

> ac0::print -a -t zfs`znode_phys_t
{
ac0 uint64_t [2] zp_atime = [ 0x48a7e91f, 0x3a7896cb ]
ad0 uint64_t [2] zp_mtime = [ 0x48a7e91f, 0x3a83b1d1 ]
ae0 uint64_t [2] zp_ctime = [ 0x48a7e91f, 0x3a83b1d1 ]
af0 uint64_t [2] zp_crtime = [ 0x48a7e91f, 0x3a7896cb ]
b00 uint64_t zp_gen = 0x502
b08 uint64_t zp_mode = 0x8124
b10 uint64_t zp_size = 0x32752 <-- should be same size as /usr/dict/words
b18 uint64_t zp_parent = 0x3
b20 uint64_t zp_links = 0x1
...
}

> 32752=D
206674
> !ls -l /usr/dict/words
-r--r--r-- 1 root bin 206674 Jul 11 02:57 /usr/dict/words

Looks good. Let's look at the blkptr_t for this dnode_phys_t.

> a40::blkptr
DVA[0]: vdev_id 0 / e800
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 40000000000
DVA[0]: :0:e800:400:id
DVA[1]: vdev_id 0 / c00dc00
DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 40000000000
DVA[1]: :0:c00dc00:400:id
LSIZE: 4000 PSIZE: 400
ENDIAN: LITTLE TYPE: ZFS plain file
BIRTH: 502 LEVEL: 1 FILL: 200000000
CKFUNC: fletcher4 COMP: lzjb
CKSUM: 5e9a82c0c2:3ff97cbecacc:1714169599f4c8:5dd02ff967dd42c
$q
#

Note the "LEVEL: 1". This means there is one level of indirect blocks.
We'll use zdb to retrieve the indirect block.

# ./zdb -R zfs_fs:0:e800:400:d,lzjb,4000 2> /tmp/iblock
Found vdev: /export/home/max/zfsfile
#

And mdb to look at the indirect block.

# ./mdb /tmp/iblock
> ::loadctf
> ::load /export/home/max/source/mdb/i386/rawzfs.so
0::blkptr
DVA[0]: vdev_id 0 / 20000
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 2000000000000
DVA[0]: :0:20000:20000:d
LSIZE: 20000 PSIZE: 20000
ENDIAN: LITTLE TYPE: ZFS plain file
BIRTH: 502 LEVEL: 0 FILL: 100000000
CKFUNC: fletcher2 COMP: uncompressed
CKSUM: f5cbf93a151abcac:5b5d6ca83588d8ad:574d9b8bf334944b:ad78d30af51771d8

The blkptr_t above is for the first 0x20000 (128k) of the file.
The next blkptr_t in the indirect block should contain the remainder
of the file. (The file is less than 256k large).

> 80::blkptr
DVA[0]: vdev_id 0 / 40000
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 2000000000000
DVA[0]: :0:40000:20000:d
LSIZE: 20000 PSIZE: 20000
ENDIAN: LITTLE TYPE: ZFS plain file
BIRTH: 502 LEVEL: 0 FILL: 100000000
CKFUNC: fletcher2 COMP: uncompressed
CKSUM: f39ae34f048ae079:de2ef1af7d1fb495:ec3ae3f7985b2a98:c6d33ac68cb042b6
$q
#

Now we'll use zdb to retrieve the data blocks.

# ./zdb -R zfs_fs:0:20000:20000:r 2> /tmp/data <-- first data block
Found vdev: /export/home/max/zfsfile
# ./zdb -R zfs_fs:0:40000:20000:r 2> /tmp/data1 <-- second data block
Found vdev: /export/home/max/zfsfile
#

# cat /tmp/data /tmp/data1 > /tmp/foo <-- concatenate them

Size of the file, according to the znode_phys_t is 206674 bytes.
We'll lop off remaining bytes.

# dd if=/tmp/foo bs=206674 count=1 of=/tmp/finalwords
1+0 records in
1+0 records out

Now, let's see if we have the correct data.

# diff /tmp/finalwords /usr/dict/words
# <-- no differences

2 comments:

Anonymous said...

how did you get the modified mdb and zdb. Is it available for general usage? Where do i get it for my purpose.

Max Bruning said...

Hi. The modified mdb and zdb are available by sending me email at max@bruningsystems.com. These are modifications that I made. The modification to mdb allows one to use kernel CTF information on raw disks (or any other raw data file). The modified zdb allows one to dump decompressed blocks from the disk. There are also 2 dmods, one for zfs on raw disk, the other for ufs on raw disk.