Saturday, September 05, 2009
Correction for classes in Berlin
Oops. I've got the wrong dates for the Solaris/OpenSolaris Internals classes in Berlin. The correct dates are Sept. 21-25, and Sept. 28 - Oct. 2. The first week is full, but there are still openings for the second week. Hope to see you there!
Tuesday, August 04, 2009
OpenSolaris Internals course announcements
I will be teaching 2 back-to-back 5 day OpenSolaris Internals classes in Berlin, Germany the weeks of September 28 through October 2, and again from October 5 through October 10. For details about topics covered, price, and availability, please visit http://www.workshops-berlin.de/. Note that this website is in German. The course itself will be in English. If you are interested, but cannot read German, send me an email.
Thursday, April 09, 2009
RAIDZ On-Disk Format
A while back, I came up with a way to examine ZFS on-disk format using a modified mdb and zdb (see paper and slides. I also used the method described to recover a removed file (see here in my blog. During the past week, I decided to try to understand the layout of raidz. In other words, how raidz organizes data on disk. It's simple to say that raidz on disk is basically raid5 with variable length stripes, but what does that really mean?
To do this, I once again use the modified mdb, and made a further modification to zdb. In addtion, I implemented a new command (zuncompress) which allow me to uncompress ZFS data existing in a regular file. Since I fear that most of the 10 people or so who read this will not want to read a long description of how I determined the layout, here I'll just give a summary. If anyone really wants the details, reply to the blog and maybe I'll go into them.
First, some general characteristics:
- Each disk contains the 4MB labels at the beginning and end of the disk. For information on these, please see the ZFS On-Disk Specification paper. The starting point for any walk of ZFS on-disk starts with an uberblock_t which is found in this area.
- The metadata used for raidz is the same as for other ZFS objects. In other words, uberblock_t contains the location of a objset_phys_t, which in turn contains the location of the meta-object set (mos), and so on. A difference is that physically, individual structures on disk may exist across different disks, and not necessarily all of the disks. For example, let's take a mos (basically an array of dnode_phys_t structures), on a 5 disk raidz volume. This might be compressed to 0xa00 (2560 decimal) bytes. This may be organized on the raidz disks as follows:
- 512 bytes on disk 0
- 512 bytes on disk 1
- 512 bytes on disk 2
- 1024 bytes on disk 3
- 1024 bytes on disk 4
If you do the arithmetic, you'll find this is 0xe00 bytes (3.5k) and not 0xa00 (2.5K) bytes. The actual allocated size may be still larger. The reason for the extra 1k bytes is the next point.
- Each metadata object (as well as data itself) has its own parity. The extra 1k bytes in the previous point is for parity. If the parity in the above example is on disk 4, it must be 1024 bytes large, since the largest size of any of the blocks containing the object is 1k bytes. Even a metadata structure that only takes up 512 bytes (for instance, an objset_phys_t), will take up 1024 bytes on the disks, one disk containing the 512-byte structure, and another containing 512-bytes of parity.
- Block offsets as reported by zdb (and described in the ZFS On-Disk Specification) are for the entire space (i.e., if you have 5 100GB disks making up a raidz pool, the block offsets start at 0 and go to 500GB).
- Since block offsets cover the entire pool, you cannot simply look at the offsets reported by zdb and map them to locations on disk. The kernel routine, vdev_raidz_map_alloc() (see http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c#644), converts offset and size to locations on the disks. I have added an option to zdb that, given a raidz pool, offset, and size (as reported by zdb), calls this routine and prints out the values of the returned map. This shows the location on the disk(s) and sizes for both the data itself, and parity.
- I recently saw on #opensolaris irc, a person stating that a write of a 1 byte file results in a write to all disks in raidz. That may be true (I haven't checked), but only 1024 bytes are used for the 1 byte file. One 512 byte block containing the 1 byte of data, and the other 512 byte block on a different disk containing parity. It is not using space on all n disks for a 1 byte file.
- ZFS basically uses a scatter-gather approach to reading and writing data on a raidz pool. The disks are read at the correct offsets into a buffer large enough to contain the data. So on a read, data on the first disk is read into the beginning of the buffer, data from the second disk is read into the same buffer virtually contiguous with the end of the data from the first disk, and so on. The resulting buffer is then de-compressed, and the data returned to the requestor.
So, that's the basics. I was going to turn my attention to the in-memory representation of ZFS, but now think instead I'll take a stab at automating the techniques I am using. Once I have that done, I'll try automating recovery of a removed file. From there, we'll see.
To do this, I once again use the modified mdb, and made a further modification to zdb. In addtion, I implemented a new command (zuncompress) which allow me to uncompress ZFS data existing in a regular file. Since I fear that most of the 10 people or so who read this will not want to read a long description of how I determined the layout, here I'll just give a summary. If anyone really wants the details, reply to the blog and maybe I'll go into them.
First, some general characteristics:
- Each disk contains the 4MB labels at the beginning and end of the disk. For information on these, please see the ZFS On-Disk Specification paper. The starting point for any walk of ZFS on-disk starts with an uberblock_t which is found in this area.
- The metadata used for raidz is the same as for other ZFS objects. In other words, uberblock_t contains the location of a objset_phys_t, which in turn contains the location of the meta-object set (mos), and so on. A difference is that physically, individual structures on disk may exist across different disks, and not necessarily all of the disks. For example, let's take a mos (basically an array of dnode_phys_t structures), on a 5 disk raidz volume. This might be compressed to 0xa00 (2560 decimal) bytes. This may be organized on the raidz disks as follows:
- 512 bytes on disk 0
- 512 bytes on disk 1
- 512 bytes on disk 2
- 1024 bytes on disk 3
- 1024 bytes on disk 4
If you do the arithmetic, you'll find this is 0xe00 bytes (3.5k) and not 0xa00 (2.5K) bytes. The actual allocated size may be still larger. The reason for the extra 1k bytes is the next point.
- Each metadata object (as well as data itself) has its own parity. The extra 1k bytes in the previous point is for parity. If the parity in the above example is on disk 4, it must be 1024 bytes large, since the largest size of any of the blocks containing the object is 1k bytes. Even a metadata structure that only takes up 512 bytes (for instance, an objset_phys_t), will take up 1024 bytes on the disks, one disk containing the 512-byte structure, and another containing 512-bytes of parity.
- Block offsets as reported by zdb (and described in the ZFS On-Disk Specification) are for the entire space (i.e., if you have 5 100GB disks making up a raidz pool, the block offsets start at 0 and go to 500GB).
- Since block offsets cover the entire pool, you cannot simply look at the offsets reported by zdb and map them to locations on disk. The kernel routine, vdev_raidz_map_alloc() (see http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c#644), converts offset and size to locations on the disks. I have added an option to zdb that, given a raidz pool, offset, and size (as reported by zdb), calls this routine and prints out the values of the returned map. This shows the location on the disk(s) and sizes for both the data itself, and parity.
- I recently saw on #opensolaris irc, a person stating that a write of a 1 byte file results in a write to all disks in raidz. That may be true (I haven't checked), but only 1024 bytes are used for the 1 byte file. One 512 byte block containing the 1 byte of data, and the other 512 byte block on a different disk containing parity. It is not using space on all n disks for a 1 byte file.
- ZFS basically uses a scatter-gather approach to reading and writing data on a raidz pool. The disks are read at the correct offsets into a buffer large enough to contain the data. So on a read, data on the first disk is read into the beginning of the buffer, data from the second disk is read into the same buffer virtually contiguous with the end of the data from the first disk, and so on. The resulting buffer is then de-compressed, and the data returned to the requestor.
So, that's the basics. I was going to turn my attention to the in-memory representation of ZFS, but now think instead I'll take a stab at automating the techniques I am using. Once I have that done, I'll try automating recovery of a removed file. From there, we'll see.
Saturday, April 04, 2009
More information about Free One Day OpenSolaris Internals Training
I thought I would say a few words about what is planned for the free one day OpenSolaris Internals training class (see http://sl.osunix.org/FreeKernelTrainingDay for a list of topics, and to sign up).
Regardless of the topics covered, I want to make this as close to a "classroom" setting as possible. For me, this means that attendees should be able to follow along with anything I am doing on OpenSolaris by doing it themselves. So, for instance, if I am using mdb to examine some data structure, students should be able to do the same on their machines. For some topics, notably ZFS, this will require students to either build an mdb dmod, and the modified mdb and zdb that I use, or load a version of OpenSolaris that contains these (to be provided by osunix.org). Source for the modified mdb, zdb, and rawzfs mdb dmod is available for download at ftp://ftp.bruningsystems.com/mdb.tar.Z, ftp://ftp.bruningsystems.com/zdb.tar.Z, and ftp://ftp.bruningsystems.com/raw_dmods.tar.Z. If we do a kmdb session, students will either need to run OpenSolaris in a VM (virtualbox), or have 2 machines connectable via tip or a terminal server for console access.
Currently, the plan is to give attendees access to some slides, use IRC, and give students access to a window on my machine where they can see what I am doing, and try the same on their machine. Best would be a window where everyone can "see" my desktop, but I'm still looking into the best way to do that (any suggestions for this are welcome). It would be great to have audio, preferably conferencing, but this may cost money, and... the class is free. That should mean free for me as well. If anyone has a suggestion for free, conferenced audio, I would appreciate it.
I would like to decide on topics to be covered in the next week or so. So, if you are interested in attending, please go to http://sl.osunix.org/FreeKernelTrainingDay, take a look at the topics, and sign up. If you have ideas for other kernel-related topics, please let me know. Depending on how this goes, I may do more of these in the future.
Regardless of the topics covered, I want to make this as close to a "classroom" setting as possible. For me, this means that attendees should be able to follow along with anything I am doing on OpenSolaris by doing it themselves. So, for instance, if I am using mdb to examine some data structure, students should be able to do the same on their machines. For some topics, notably ZFS, this will require students to either build an mdb dmod, and the modified mdb and zdb that I use, or load a version of OpenSolaris that contains these (to be provided by osunix.org). Source for the modified mdb, zdb, and rawzfs mdb dmod is available for download at ftp://ftp.bruningsystems.com/mdb.tar.Z, ftp://ftp.bruningsystems.com/zdb.tar.Z, and ftp://ftp.bruningsystems.com/raw_dmods.tar.Z. If we do a kmdb session, students will either need to run OpenSolaris in a VM (virtualbox), or have 2 machines connectable via tip or a terminal server for console access.
Currently, the plan is to give attendees access to some slides, use IRC, and give students access to a window on my machine where they can see what I am doing, and try the same on their machine. Best would be a window where everyone can "see" my desktop, but I'm still looking into the best way to do that (any suggestions for this are welcome). It would be great to have audio, preferably conferencing, but this may cost money, and... the class is free. That should mean free for me as well. If anyone has a suggestion for free, conferenced audio, I would appreciate it.
I would like to decide on topics to be covered in the next week or so. So, if you are interested in attending, please go to http://sl.osunix.org/FreeKernelTrainingDay, take a look at the topics, and sign up. If you have ideas for other kernel-related topics, please let me know. Depending on how this goes, I may do more of these in the future.
Thursday, April 02, 2009
Free One-day OpenSolaris Internals class
I am holding a free, one day OpenSolaris Internals class on-line on April 18 or 19. We'll cover 2 topics as determined by a vote of topics that may be covered. For more information, see http://sl.osunix.org/FreeKernelTrainingDay. I hope to see you there!
OpenSolaris Internals class
I am teaching an OpenSolaris Internals class at Systemics in Warsaw, Poland the week of May 4-8. The course will be held in English. For a detailed topic outline, see here. For pricing, location information, and availability, please send email to magdalena.sternick@systemics.pl. If you have questions about course content, please email me at max@bruningsystems.com.
Tuesday, March 31, 2009
A faster memstat for mdb
I have implemented a version of the ::memstat dcmd for mdb that gives results in less than half the time of the ::memstat currently in mdb. If you are interested, it is available for download here.
So, how does it work? The current version of memstat uses the page walker (::walk page) to walk all cached pages in the system. The new version simply examines all pages.
Each page-able page in the system is represented by a page_t structure. Basically all of memory except for the unix and genunix kernel modules, and a few other odds and ends is considered "page-able". Every page, that is in use (either by the kernel, user process, anonymous, or cached) has an identity. This is a vnode/offset pair. The identify uniquely determines how the page is being used. For instance, a page for the code of a running bash process will have the vnode_t for /usr/bin/bash, and the offset within /usr/bin/bash of where the page comes from. For a kernel page, there is a special vnode_t (kvp). For anonymous space, the page has a swapfsvnode (also used for shared memory and tmpfs files). When a process gets a page fault, the fault handling code first checks to see if the faulting address is mapped in the process' address space. If not, a segmentation violation (SIGSEGV) is sent to the process. If the address is within the address space, the fault handling code sees if the page is already in memory. It does this by retrieving the vnode/offset for the faulting page, and hashing into an array called page_hash. Each entry in page_hash is the beginning of a linked list of page_t structures. So the fault handling code does a hash to get a page_hash array entry, then walks the page_t structures starting at that entry to look for a matching vnode_t/offset. If the page_t is in the hash, the fault handling code sets up a page table entry (translation table entry on SPARC) to map to the corresponding physical page.
The page_hash array is sized so the the average search, given a page_hash bucket, is no longer than 4 (PAGE_HASHAVELEN in vm/page.h) entries. This makes searching for a cached page fairly fast.
The page walker that mdb uses to do memstat walks every hash bucket looking for pages. Basically, if the page is found from a hash bucket, the page is either in use by the kernel, some process, or tmpfs, or, the page is free but cached. Any page not hanging off of a page_hash bucket is considered free (i.e., the free (freelist) statistic).
The new memstat takes a different approach. Rather than scanning each hash bucket for pages, it simply reads all of the page_t structures on the system, then examines each one to determine if it is a kernel page, executable page, anonymous page, and so on. Any page_t that does not have a vnode_t is considered a free page, and is counted in the free (freelist) statistic.
How are the page_t structures found on the system? There is a linked list of memseg structures that are bookkeeping for page-able page structures. The list is headed by a pointer, memsegs, and is built early on in the startup code when the system is booting. I suspect the list could change due to dynamic reconfiguration events, but I'll leave that as an exercise for the reader...
To see the list, you can do "::memseg_list" in mdb. On my system, this gives:
The ADDR column is the address of a memseg structure. PAGES is the address of the first page_t in an array. EPAGES points to the end of the page_t array. BASE is the starting page frame number of the memseg, and END is the ending page frame number. So, on my system, page frames between 2 and 9f, 100 and 400, and c00 and 7fed0 have page_t structures. Physical page 0 and 1 are not in the list. Also pages between 9f and 100, and between 400 and c00 are not in the list. This is either because the physical memory does not exist, or it is not considered pageable.
The new memstat uses the memseg list to read in all of the page_t structures. On my system, this means 3 read calls (though the read from fbe94160 to fe652260 is quite large), versus thousands of read calls in the existing memstat via the page walker. The new memstat assumes there will never be more than 256 memseg structures. (This was arbitrarily chosen. I have not seen machines with more than 6 memseg structures, but I don't get on very large machines very often). A more correct way would be to build the memseg list within the dcmd, but I am lazy.
Using dtrace and counting system calls during running of the two version of memstat shows that the original memstat makes 1741225 system calls, while the new memstat makes 737344. So over 10000 fewer system calls in the new memstat.
I think memstat performance could be improved even more by using mmap to map in the page arrays. Then there would be no need for using mdb_alloc, and no need to mdb_vread the page_t structs.
So, how does it work? The current version of memstat uses the page walker (::walk page) to walk all cached pages in the system. The new version simply examines all pages.
Each page-able page in the system is represented by a page_t structure. Basically all of memory except for the unix and genunix kernel modules, and a few other odds and ends is considered "page-able". Every page, that is in use (either by the kernel, user process, anonymous, or cached) has an identity. This is a vnode/offset pair. The identify uniquely determines how the page is being used. For instance, a page for the code of a running bash process will have the vnode_t for /usr/bin/bash, and the offset within /usr/bin/bash of where the page comes from. For a kernel page, there is a special vnode_t (kvp). For anonymous space, the page has a swapfsvnode (also used for shared memory and tmpfs files). When a process gets a page fault, the fault handling code first checks to see if the faulting address is mapped in the process' address space. If not, a segmentation violation (SIGSEGV) is sent to the process. If the address is within the address space, the fault handling code sees if the page is already in memory. It does this by retrieving the vnode/offset for the faulting page, and hashing into an array called page_hash. Each entry in page_hash is the beginning of a linked list of page_t structures. So the fault handling code does a hash to get a page_hash array entry, then walks the page_t structures starting at that entry to look for a matching vnode_t/offset. If the page_t is in the hash, the fault handling code sets up a page table entry (translation table entry on SPARC) to map to the corresponding physical page.
The page_hash array is sized so the the average search, given a page_hash bucket, is no longer than 4 (PAGE_HASHAVELEN in vm/page.h) entries. This makes searching for a cached page fairly fast.
The page walker that mdb uses to do memstat walks every hash bucket looking for pages. Basically, if the page is found from a hash bucket, the page is either in use by the kernel, some process, or tmpfs, or, the page is free but cached. Any page not hanging off of a page_hash bucket is considered free (i.e., the free (freelist) statistic).
The new memstat takes a different approach. Rather than scanning each hash bucket for pages, it simply reads all of the page_t structures on the system, then examines each one to determine if it is a kernel page, executable page, anonymous page, and so on. Any page_t that does not have a vnode_t is considered a free page, and is counted in the free (freelist) statistic.
How are the page_t structures found on the system? There is a linked list of memseg structures that are bookkeeping for page-able page structures. The list is headed by a pointer, memsegs, and is built early on in the startup code when the system is booting. I suspect the list could change due to dynamic reconfiguration events, but I'll leave that as an exercise for the reader...
To see the list, you can do "::memseg_list" in mdb. On my system, this gives:
> ::memseg_list
ADDR PAGES EPAGES BASE END
fbe00028 fbe94160 fe652260 00000c00 0007fed0
fbe00014 fbe85160 fbe94160 00000100 00000400
fbe00000 fbe82050 fbe85160 00000002 0000009f
>
The ADDR column is the address of a memseg structure. PAGES is the address of the first page_t in an array. EPAGES points to the end of the page_t array. BASE is the starting page frame number of the memseg, and END is the ending page frame number. So, on my system, page frames between 2 and 9f, 100 and 400, and c00 and 7fed0 have page_t structures. Physical page 0 and 1 are not in the list. Also pages between 9f and 100, and between 400 and c00 are not in the list. This is either because the physical memory does not exist, or it is not considered pageable.
The new memstat uses the memseg list to read in all of the page_t structures. On my system, this means 3 read calls (though the read from fbe94160 to fe652260 is quite large), versus thousands of read calls in the existing memstat via the page walker. The new memstat assumes there will never be more than 256 memseg structures. (This was arbitrarily chosen. I have not seen machines with more than 6 memseg structures, but I don't get on very large machines very often). A more correct way would be to build the memseg list within the dcmd, but I am lazy.
Using dtrace and counting system calls during running of the two version of memstat shows that the original memstat makes 1741225 system calls, while the new memstat makes 737344. So over 10000 fewer system calls in the new memstat.
I think memstat performance could be improved even more by using mmap to map in the page arrays. Then there would be no need for using mdb_alloc, and no need to mdb_vread the page_t structs.
Monday, August 25, 2008
Update to bruningsystems.com website
I have added a section called "articles", which has links to various articles, presentations, and some course materials on OpenSolaris. You can see it here.
Subscribe to:
Posts (Atom)