Thursday, April 09, 2009

RAIDZ On-Disk Format

A while back, I came up with a way to examine ZFS on-disk format using a modified mdb and zdb (see paper and slides. I also used the method described to recover a removed file (see here in my blog. During the past week, I decided to try to understand the layout of raidz. In other words, how raidz organizes data on disk. It's simple to say that raidz on disk is basically raid5 with variable length stripes, but what does that really mean?

To do this, I once again use the modified mdb, and made a further modification to zdb. In addtion, I implemented a new command (zuncompress) which allow me to uncompress ZFS data existing in a regular file. Since I fear that most of the 10 people or so who read this will not want to read a long description of how I determined the layout, here I'll just give a summary. If anyone really wants the details, reply to the blog and maybe I'll go into them.

First, some general characteristics:

- Each disk contains the 4MB labels at the beginning and end of the disk. For information on these, please see the ZFS On-Disk Specification paper. The starting point for any walk of ZFS on-disk starts with an uberblock_t which is found in this area.

- The metadata used for raidz is the same as for other ZFS objects. In other words, uberblock_t contains the location of a objset_phys_t, which in turn contains the location of the meta-object set (mos), and so on. A difference is that physically, individual structures on disk may exist across different disks, and not necessarily all of the disks. For example, let's take a mos (basically an array of dnode_phys_t structures), on a 5 disk raidz volume. This might be compressed to 0xa00 (2560 decimal) bytes. This may be organized on the raidz disks as follows:
- 512 bytes on disk 0
- 512 bytes on disk 1
- 512 bytes on disk 2
- 1024 bytes on disk 3
- 1024 bytes on disk 4
If you do the arithmetic, you'll find this is 0xe00 bytes (3.5k) and not 0xa00 (2.5K) bytes. The actual allocated size may be still larger. The reason for the extra 1k bytes is the next point.

- Each metadata object (as well as data itself) has its own parity. The extra 1k bytes in the previous point is for parity. If the parity in the above example is on disk 4, it must be 1024 bytes large, since the largest size of any of the blocks containing the object is 1k bytes. Even a metadata structure that only takes up 512 bytes (for instance, an objset_phys_t), will take up 1024 bytes on the disks, one disk containing the 512-byte structure, and another containing 512-bytes of parity.

- Block offsets as reported by zdb (and described in the ZFS On-Disk Specification) are for the entire space (i.e., if you have 5 100GB disks making up a raidz pool, the block offsets start at 0 and go to 500GB).

- Since block offsets cover the entire pool, you cannot simply look at the offsets reported by zdb and map them to locations on disk. The kernel routine, vdev_raidz_map_alloc() (see, converts offset and size to locations on the disks. I have added an option to zdb that, given a raidz pool, offset, and size (as reported by zdb), calls this routine and prints out the values of the returned map. This shows the location on the disk(s) and sizes for both the data itself, and parity.

- I recently saw on #opensolaris irc, a person stating that a write of a 1 byte file results in a write to all disks in raidz. That may be true (I haven't checked), but only 1024 bytes are used for the 1 byte file. One 512 byte block containing the 1 byte of data, and the other 512 byte block on a different disk containing parity. It is not using space on all n disks for a 1 byte file.

- ZFS basically uses a scatter-gather approach to reading and writing data on a raidz pool. The disks are read at the correct offsets into a buffer large enough to contain the data. So on a read, data on the first disk is read into the beginning of the buffer, data from the second disk is read into the same buffer virtually contiguous with the end of the data from the first disk, and so on. The resulting buffer is then de-compressed, and the data returned to the requestor.

So, that's the basics. I was going to turn my attention to the in-memory representation of ZFS, but now think instead I'll take a stab at automating the techniques I am using. Once I have that done, I'll try automating recovery of a removed file. From there, we'll see.


tobi11 said...

Very nice explanation. It helped me understand how ZFS operates on an Raid-Z and why the expansion of a Raid-Z is such a tricky thing.
In the official On-Disk-Layout Specification is not much to find on this topic.
So great work.

jamesd_wi said...

I must be user #9, and I would love a more detailed version, ZFS rocks, and could always use more details on its black magic it uses.

Max Bruning said...

I have a file that contains all of the steps I took, and output for each step, to find data for a file on a 5 disk raidz. As soon as I get around to annotating it, I'll post it here. It's basically the same "simple" 25 or so steps from the osdevcon paper, plus another 10-15 steps to reconstruct things.

Ulrich Gräf said...

About: writing one byte needs writing to all disks:
If you are writing the byte it may be writing to only 2 blocks on different disks, but performing the transaction needs updates to all the uberblock lists in all devices of the zpool.
So you get 4 writes on each disk at each transaction commit (default: every 5 seconds) as long you are writing on this zpool - independent of the vdev structure.
(I can hear it on my workstation...).

Max Bruning said...

Ulrich, You are absolutely correct. I was only thinking of the data for the file being written. In fact, updating of other metadata besides uberblocks may increase the number of writes as well. Fortunately, a lot of the metadata may be close together, so it's not like a separate write is needed for every structure.

Anonymous said...

very nice work max. Are you working together with the zfs forensics project?


Max Bruning said...

I am not working with the zfs forensics project, but there is a link on their page to some of my stuff. I'll send them an email to let them know about the raid-z stuff. thanks.