Max Bruning's weblog

Hadoop bug on SmartOS

2012-11-12T00:35:00.000-08:00

Recently I had a chance to help with a problem that occurred when trying to run a Hadoop benchmark on SmartOS. Basically, some of the Java code written for Hadoop was making an implicit assumption that the code was being run on Linux. When running the benchmark, the following error showed up:

12/10/01 20:58:49 INFO mapred.JobClient: Task Id : attempt_201209262235_0003_m_000003_0, Status : FAILED
ENOENT: No such file or directory
at org.apache.hadoop.io.nativeio.NativeIO.open(Native Method)
at org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:161)
at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:312)
at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:385)
at org.apache.hadoop.mapred.Child.main(Child.java:229)

The NativeIO.open call basically calls the open(2) system call. Here, it is being called from
createForWrite() in SecureIOUtils.java at line 161. Here is the code for SecureIOUtils.java:

/**
  * Open the specified File for write access, ensuring that it does not exist.
  * @param f the file that we want to create
  * @param permissions we want to have on the file (if security is enabled)
  *
  * @throws AlreadyExistsException if the file already exists
  * @throws IOException if any other error occurred
  */
public static FileOutputStream createForWrite(File f, int permissions)
throws IOException {
   if (skipSecurity) {
     return insecureCreateForWrite(f, permissions);
   } else {
     // Use the native wrapper around open(2)
     try {
       FileDescriptor fd = NativeIO.open(f.getAbsolutePath(), <-- 161="161" line="line" span="span">
         NativeIO.O_WRONLY | NativeIO.O_CREAT | NativeIO.O_EXCL,
         permissions);
       return new FileOutputStream(fd);
     } catch (NativeIOException nioe) {
       if (nioe.getErrno() == Errno.EEXIST) {
         throw new AlreadyExistsException(nioe);
       }
       throw nioe;
     }
   }
}

So, the open is called with O_WRONLY, O_CREAT, and O_EXCL flags. However, the truss(1) output
shows a different story. We started the following truss on a slave machine, and ran the test again:

# truss -f -a -wall -topen,close,fork,write,stat,fstat -o ~/mapred.truss -p $(pgrep -f Djava.library.path)

And here is the relevant truss output:

51039/28: open("/opt/local/hadoop/bin/../logs/userlogs/job_201210171129_0008/attempt_201210171129_0008_m_000002_1/log.tmp", O_WRONLY|O_DSYNC|O_NONBLOCK) Err#2 ENOENT

The error message is emitted shortly after the above open(2) system call. So, the code shows O_WRONLY, O_CREAT, and O_EXCL, which is what one
would expect for a routine that is called createForWrite(). However, the flags actually passed to open() are: O_WRONLY, O_DSYNC, and O_NONBLOCK.
Why the difference?

Grepping for O_CREAT in the hadoop source finds it defined at:

./trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/nativeio/NativeIO.java:

/**
* JNI wrappers for various native IO-related calls not available in Java.
* These functions should generally be used alongside a fallback to another
* more portable mechanism.
*/
public class NativeIO {
// Flags for open() call from bits/fcntl.h
public static final int O_RDONLY   =    00;
public static final int O_WRONLY   =    01;
public static final int O_RDWR     =    02;
public static final int O_CREAT    = 0100;
public static final int O_EXCL     = 0200;
public static final int O_NOCTTY   = 0400;
public static final int O_TRUNC    = 01000;
public static final int O_APPEND   = 02000;
public static final int O_NONBLOCK = 04000;
public static final int O_SYNC   = 010000;
public static final int O_ASYNC = 020000;
public static final int O_FSYNC = O_SYNC;
public static final int O_NDELAY = O_NONBLOCK;

The comment in the above code says that the flags for the open(2) call are coming from bit/fcntl.h.
However, on SmartOS (as well as illumos and Solaris), the same flags in sys/fcntl.h show:

/*
* Flag values accessible to open(2) and fcntl(2)
* The first five can only be set (exclusively) by open(2).
*/
#define   O_RDONLY        0
#define        O_WRONLY        1
#define        O_RDWR          2
#define        O_SEARCH        0x200000
#define O_EXEC          0x400000
#if defined(__EXTENSIONS__) || !defined(_POSIX_C_SOURCE)
#define O_NDELAY        0x04    /* non-blocking I/O */
#endif /* defined(__EXTENSIONS__) || !defined(_POSIX_C_SOURCE) */
#define O_APPEND        0x08    /* append (writes guaranteed at the end) */
#if defined(__EXTENSIONS__) || !defined(_POSIX_C_SOURCE) || \
       (_POSIX_C_SOURCE > 2) || defined(_XOPEN_SOURCE)
#define O_SYNC          0x10    /* synchronized file update option */
#define    O_DSYNC         0x40    /* synchronized data update option */
#define    O_RSYNC         0x8000 /* synchronized file update option */
                          /* defines read/write file integrity */
#endif /* defined(__EXTENSIONS__) || !defined(_POSIX_C_SOURCE) ... */
#define     O_NONBLOCK      0x80    /* non-blocking I/O (POSIX) */
#ifdef    _LARGEFILE_SOURCE
#define        O_LARGEFILE     0x2000
#endif

/*
* Flag values accessible only to open(2).
*/
#define      O_CREAT         0x100   /* open with file create (uses third arg) */
#define     O_TRUNC         0x200   /* open with truncation */
#define       O_EXCL          0x400   /* exclusive open */
#define     O_NOCTTY        0x800   /* don't allocate controlling tty (POSIX) */
#define     O_XATTR         0x4000 /* extended attribute */
#define O_NOFOLLOW      0x20000 /* don't follow symlinks */
#define      O_NOLINKS       0x40000 /* don't allow multiple hard links */

The O_CREAT flag (from bits/fcntl.h) is 0100 (octal) in the NativeIO.java file, but 0x100 on SmartOS. The 0100 value is 0x40, which corresponds to O_DSYNC on SmartOS. Similarly, the O_EXCL value of 0200 is hex value 0x80, which is O_NONBLOCK on SmartOS. Whoever wrote this code made an assumption that they were running on a Linux system. The flags are different yet again on FreeBSD and Mac OS (for instance, O_CREAT is 0x200 on these systems). My colleague, Filip Hajny, changed the flags to match the SmartOS flags, and rebuilt everything to fix the problem.

This problem reminds me how many little things like this can occur when porting an application that was developed on one operating system to run on another operating system. It is possible that for all but the simplest of applications, some changes are going to be needed. For the above problem, POSIX specifies the flags that open(2) can take (O_CREAT, O_RDWR, etc.), but does not specify the values of those flags. Basically, if the code could include the correct header file (fcntl.h in both cases), the problem would not occur. It is an important reminder that all code should be reviewed and tested on as many different systems as possible.

Why take a SmartOS/illumos Internals or ZFS Internals course?

2012-07-09T00:32:00.000-07:00

I have been teaching OS internals courses for many years, starting with Bell Labs/AT&T Unix System III in 1982, onto System V, SVR2, SVR3, SVR4, and Solaris Unixes since 1994. Along the way, I have also taught HP-UX internals, various device driver courses, and kernel debugging courses. I started using unix on the Sixth Edition in 1975. I have also done a fair amount of kernel development and debugging, along with some user level stuff.

The audiences I have for internals courses has been quite varied. Many of the people I have taught have been in support or sustaining organizations, but I have also taught developers, system administrators, Java programmers, QA people, hardware engineers, and even end users. Along the way, I have been asked by various people (many of them managers), "Why should I or my team take this course? What will I or my team get out of this training?"

In response to the first question, I usually tell people that an internals course should teach students how the system works, and why it works the way it does. In other words, the course teaches the data structures and algorithms used by the operating system to manage the resources of the computer, and explains the architecture of the system, as well as the rationale behind the design decisions that have been made to implement the system. My view is that knowledge of how the system works can benefit everyone. For developers (especially kernel developers), knowledge of the system is key to adding new functionality. For system administrators, knowledge of the OS can help to do troubleshooting and performance analysis. Tools like DTrace become even more useful when one has knowledge of what's going on in the system. In general, knowledge of how the system works allows everyone who uses the system to make better use of the system.

As for what specific skills are acquired in an internals course, I make very extensive use of tools that come with the system during the training. Both when I am lecturing, and in lab work. My view has always been that in order to learn the concepts being taught, one must be able to actually "see" them. Tools like DTrace, mdb, kmdb, and other observability mechanisms are key to doing this. I do not "teach" the tools, but rather we use the tools in lots of examples throughout the course. At the end of the course, I am satisfied if my students can start to learn things on their own. Basically, a good internals course should be an "enabling" course. It should enable the student to learn more on their own. For some, they may never use the specific tools used during the class in the specific ways they are used, but it will educate students that one can actually determine what the system is doing at any given time. For others, they will be using the tools consistently in their work.

As with all training, you only get out of it what you put into it.

If you're interested in Internals training, please visit training from joyent.

I hope to see people in class soon!

SmartOS/Illumos Training

2012-06-15T08:50:00.000-07:00

If you are reading this, you probably are here either because you saw my post on twitter, or you searched for "zfs recovery" (see here). This is the first blog I have written here since 2009, so it is time to write again.

I have written (in a different blog) on using flamegraphs at Using flamegraph to Solve IP Scaling Issue. Rather than spending time saying what I've been doing since my last blog entry here, I want to talk a bit about what I am doing now. If you're interested in what I have been doing, see KVM on Illumos.

Since I wrote the blogs on ZFS recovery, I have been getting emails, at a rate that is slowly increasing over time (now ~2 per week), from people asking if I can help with ZFS problems. If I had received this many emails when I wrote the blog, I might be working full time now on ZFS recovery issues. As it is, I am now very busy working for Joyent, and have not had time to answer as many of the ZFS requests as I would like. My apologies to people who I have not answered. For those of you who have asked for my mdb and zdb modifications, please send me email at max_at_joyent_dot_com. If I get enough requests, it is possible that the modifications may find their way into SmartOS (and Illumos).

If you would like help with ZFS problems, I can better justify my time if you download Joyent's SmartDataCenter product, available here, and give it a try. If you're interested in SmartOS (simply the best Solaris-based operating system you can use), it is available for download at www.smartos.org. Joyent fully supports SmartOS for use in its SmartDataCenter product, so you are more likely to get help in a timely fashion with problems than I am able to provide on my own.

And, what am I doing now? Joyent is offering classes on SmartDataCenter, DTrace, and SmartOS/Illumos Internals. I am involved with developing the courseware, and shall be (along with Brendan Gregg) delivering the courses. For more information, see Training from Joyent.

ZFS Raidz Data Walk

2009-12-20T23:54:00.000-08:00

Several months ago, I wrote in my blog about raidz on disk format (see http://mbruning.blogspot.com/2009/04/raidz-on-disk-format.html). In that blog, I went over the high level details. Here, I thought I would show the low level stuff that I did to determine the layout. I am using a modified zdb and mdb to walk through the on-disk data structures to find the data for a copy of the /usr/dict/words file that I made on a raidz file system.

The raidz volume contains 5 equal size devices. Since I don't have 5 disks lying around, I created 5 equal sized files (/export/home/max/r0 through /export/home/max/r4). I'll use the term disk throughout this discussion to refer to one of these files.


# zpool status -v tank
pool: tank
state: ONLINE
scrub: none requested
config:

  NAME                     STATE     READ WRITE CKSUM
  tank                     ONLINE       0     0     0
    raidz1                 ONLINE       0     0     0
      /export/home/max/r0  ONLINE       0     0     0
      /export/home/max/r1  ONLINE       0     0     0
      /export/home/max/r2  ONLINE       0     0     0
      /export/home/max/r3  ONLINE       0     0     0
      /export/home/max/r4  ONLINE       0     0     0

errors: No known data errors
#

I'll umount the file system so things don't change while I'm examining the on-disk structures.


# zfs umount tank
#

And, as I have done in the past, I walk the data structures to get to the "words" file by starting at the uberblock_t. If you get lost during this walk, you can always refer to the diagram "ZFS On-Disk Layout - The Big Picture", page 4 in http://www.osdevcon.org/2008/files/osdevcon2008-max.pdf from the OpenSolaris Developer's Conference, 2008 in Prague.

First, the active uberblock_t.


# zdb -uuu tank
Uberblock

  magic = 0000000000bab10c
  version = 13
  txg = 1280
  guid_sum = 6800651560363961243
  timestamp = 1239197133 UTC = Wed Apr  8 15:25:33 2009
  rootbp = [L0 DMU objset] 400L/200P DVA[0]=<0:1e007400:400> DVA[1]=<0:9400:400> DVA[2]=<0:3c003800:400> fletcher4 lzjb LE contiguous birth=1280 fill=27 cksum=9ad89e117:40b4956a12c:db76af09e62f:1f779fd1db6115
#

Now, I use a new command I added to zdb to allow me to see the raidz mapping. The "-Z" option takes the pool name, device id, location, and physical size as arguments, and prints the device index, location, and size for each piece of the corresponding data plus parity.


# ./zdb -Z tank:0:1e007400:200
Columns = 2, bigcols = 2, asize = 400, firstdatacol = 1
devidx = 3, offset = 6001600, size = 200
devidx = 4, offset = 6001600, size = 200
#

So, the 0x200 byte parity is on the fourth disk (devidx = 3), and the 0x200 byte objset_phys_t is on the fifth disk (devidx = 4). (Of course, either one will work since there are only 2).

Now, convert the hex offset to an absolute decimal block number. The 0x400000 skips the disk labels at the front of each device in the volume.


# mdb
> (6001600>>9)+(400000>>9)=D
              204811

Attempting to use zdb with the -R option to read the blocks causes a assertion failure in zdb (at least, that was the state back in April, when I wrote the original blog on raidz). So, instead I use dd to dump the raw data into a file.


# dd if=/export/home/max/r4 of=/tmp/objset_t bs=512 count=1 iseek=204811
1+0 records in
1+0 records out
#

Now, I'll uncompress the data. The size after decompression should be 0x400 bytes (as specified in the block pointer in the uberblock_t above). For this, I use a utility I wrote called zuncompress. This utility takes an option which allows one to specify the compression algorithm used. The default is lzjb. The output should be the objset_phys_t for the meta object set (MOS).


# ./zuncompress -p 200 -l 400 /tmp/objset_t > /tmp/objset
#

And now, I'll use my modified mdb to print the objset_phys_t.


# mdb /tmp/objset
> 0::print -a zfs`objset_phys_t
{
  0 os_meta_dnode = {
      0 dn_type = 0xa  <-- DMU_OT_DNODE 
      1 dn_indblkshift = 0xe 
      2 dn_nlevels = 0x1
 ...         
     40 dn_blkptr = [
             {      
          40 blk_dva = [  
                {      
                   40 dva_word = [ 0x8, 0xf0050 ] 
                    }    
                { 
                   50 dva_word = [ 0x8, 0x40 ] 
                } 
                { 
                   60 dva_word = [ 0x8, 0x1e0028 ] 
                }  
           ]
 ... 
}

And the blkptr_t at 0x40:


> 40::blkptr
DVA[0]: vdev_id 0 / 1e00a000
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 100000000000
DVA[0]: :0:1e00a000:a00:d
DVA[1]: vdev_id 0 / 8000
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 100000000000
DVA[1]: :0:8000:a00:d
DVA[2]: vdev_id 0 / 3c005000
DVA[2]:       GANG: FALSE  GRID:  0000  ASIZE: 100000000000
DVA[2]: :0:3c005000:a00:d
LSIZE:  4000                            PSIZE: a00
ENDIAN: LITTLE                                  TYPE:  DMU dnode
BIRTH:  500                LEVEL: 0     FILL:  1a00000000
CKFUNC: fletcher4                       COMP:  lzjb
CKSUM:  a182339fe8:ded0f7be7047:bcb1c1a96b94cc:765bd519587bfb41
$q
#

So, "LEVEL: 0" means no indirection. The next object is the MOS, which is an array of dnode_phys_t. Let's see how the MOS is layed out on the raidz volume.


# ./zdb -Z tank:0:1e00a000:a00
Columns = 5, bigcols = 2, asize = 1000, firstdatacol = 1
devidx = 0, offset = 6002000, size = 400
devidx = 1, offset = 6002000, size = 400
devidx = 2, offset = 6002000, size = 200
devidx = 3, offset = 6002000, size = 200
devidx = 4, offset = 6002000, size = 200
#

So, disk 0 contains parity, and disks 1, 2, 3, and 4 contain the MOS. The MOS is compressed with lzjb compression. We'll use dd to dump the 4 blocks containing the MOS to a file, then decompress the MOS.

I'll use mdb to translate the blkptr DVA address to a block on the disks. Note that all blocks in this example are at the same location (0x6002000).


# mdb
> (6002000>>9)+(400000>>9)=D
              204816

And now dd each of the blocks. The first disk (/export/home/max/r0) is parity. The second disk contains 0x400 bytes. The other 3 disks contain 0x200 bytes each. So total size of compressed data is 0x400 + 0x200 + 0x200 + 0x200, or 0xa00 bytes, which agrees with the PSIZE field in the blkptr_t. Note that size of the parity block must be equal to the size of the largest block (0x400).


# dd if=/export/home/max/r1 of=/tmp/mos_z1 iseek=204816 count=2
2+0 records in
2+0 records out
# dd if=/export/home/max/r2 of=/tmp/mos_z2 iseek=204816 count=1
1+0 records in
1+0 records out
# dd if=/export/home/max/r3 of=/tmp/mos_z3 iseek=204816 count=1
1+0 records in
1+0 records out
# dd if=/export/home/max/r4 of=/tmp/mos_z4 iseek=204816 count=1
1+0 records in
1+0 records out
#

Now, concatenate the files to get the compressed MOS.


# cat /tmp/mos_z* > /tmp/mos_comp

And uncompress. The size after decompression, according to the blkptr is 0x4000 (LSIZE in the blkptr).


# ./zuncompress -l 4000 -p a00 /tmp/mos_comp > /tmp/mos

And I'll use the modified mdb to dump out the MOS.


# mdb /tmp/mos
> ::sizeof zfs`dnode_phys_t
sizeof (zfs`dnode_phys_t) = 0x200

> 4000%200=K
              20              <-- There are 32 dnode_phys_t in the MOS
> 0,20::print -a zfs`dnode_phys_t
{
       0 dn_type = 0  <-- DMU_OT_NONE, first is unused
 ... 
} 
{ 
 200 dn_type = 0x1  <-- DMU_OT_OBJECT_DIRECTORY
 ...
     240 dn_blkptr = [
         {
             240 blk_dva = [
                 {
                     240 dva_word = [ 0x2, 0x24 ]
                 }
 ...
}
{
     400 dn_type = 0xc  <-- DMU_OT_DSL_DIR (DSL Directory)
     ...
     404 dn_bonustype = 0xc
 ...
     4c0 dn_bonus = [ 0x39, 0x75, 0xdb, 0x49, 0, 0, 0, 0, 0x10, 0, 0, 0, 0, 0, 0,  0, 0, 0, 0, 0, 0, 0, 0, 0, 0xe, 0, 0, 0, 0, 0, 0, 0,
 ... ]
}
{
     600 dn_type = 0xf
 ...
{
    1600 dn_type = 0x10  <-- DMU_OT_DSL_DATASET (DSL DataSet)
 ...
    1604 dn_bonustype = 0x10
 ...
    16c0 dn_bonus = [ 0x8, 0, 0, 0, 0, 0, 0, 0, 0xe, 0, 0, 0, 0, 0, 0, 0, 0x1, 0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 ... ]
}
 ...

The blkptr_t at 0x240 is for the Object Directory. Let's take a closer look.


> 240::blkptr
DVA[0]: vdev_id 0 / 4800
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 40000000000
DVA[0]: :0:4800:200:d
DVA[1]: vdev_id 0 / 1e004800
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 40000000000
DVA[1]: :0:1e004800:200:d
DVA[2]: vdev_id 0 / 3c000000
DVA[2]:       GANG: FALSE  GRID:  0000  ASIZE: 40000000000
DVA[2]: :0:3c000000:200:d
LSIZE:  200                             PSIZE: 200
ENDIAN: LITTLE                                  TYPE:  object directory
BIRTH:  4                  LEVEL: 0     FILL:  100000000
CKFUNC: fletcher4                       COMP:  uncompressed
CKSUM:  5d4dec3ac:1e59c2be429:5825c81154e8:b9b170eedd49e
$q
#

We'll use zdb to find out where ZFS has put the 0x200 byte object directory.


# ./zdb -Z tank:0:4800:200
Columns = 2, bigcols = 2, asize = 400, firstdatacol = 1
devidx = 1, offset = e00, size = 200
devidx = 2, offset = e00, size = 200
#

So, the parity is on the second disk (devidx = 1), and the object directory (a ZAP object), is on the third disk.

We'll convert the offset into a block address.


# mdb
> (e00>>9)+(400000>>9)=D
              8199

And dump the 0x200 (i.e, 512byte) block.


# dd if=/export/home/max/r2 of=/tmp/objdir iseek=8199 count=1
1+0 records in
1+0 records out
#

The ZAP object is not compressed (see the above blkptr_t). So, no need to uncompress. We'll use mdb to look at the zap.


# mdb /tmp/objdir
> 0/J
0:              8000000000000003   <-- a microzap object
>

> 0::print -a -t zfs`mzap_phys_t
{
  0 uint64_t mz_block_type = 0x8000000000000003
  8 uint64_t mz_salt = 0x32064dbb
  10 uint64_t mz_normflags = 0
  18 uint64_t [5] mz_pad = [ 0, 0, 0, 0, 0 ]
  40 mzap_ent_phys_t [1] mz_chunk = [
      {
          40 uint64_t mze_value = 0x2
          48 uint32_t mze_cd = 0
          4c uint16_t mze_pad = 0
          4e char [50] mze_name = [ "root_dataset" ]
      }
  ]
}
$q
#

There are more mzap_ent_phys_t in the object, but we are only concerned with the root dataset. This is object id 2, so we'll go back to the MOS, and examine the dnode_phys_t at index 2.


# mdb /tmp/mos
> 2*200::print -a zfs`dnode_phys_t  <-- Each dnode_phys_t is 0x200 bytes
{
     400 dn_type = 0xc  <-- DMU_OT_DSL_DIR
 ...
     404 dn_bonustype = 0xc  <-- DMU_OT_DSL_DIR
 ...
     4c0 dn_bonus = [ 0x39, 0x75, 0xdb, 0x49, 0, 0, 0, 0, 0x10, 0, 0, 0, 0, 0, 0,  0, 0, 0, 0, 0, 0, 0, 0, 0, 0xe, 0, 0, 0, 0, 0, 0, 0, ... ]
}

The bonus buffer contains a dsl_dir_phys_t.


> 4c0::print -a zfs`dsl_dir_phys_t
{
  4c0 dd_creation_time = 0x49db7539
  4c8 dd_head_dataset_obj = 0x10
...
}

The DSL DataSet is object id 0x10 (dd_head_dataset_obj = 0x10).


> 10*200::print -a zfs`dnode_phys_t
{
  2000 dn_type = 0x10  <-- DMU_OT_DSL_DATASET
 ...
  2004 dn_bonustype = 0x10  <-- bonus buffer contains dsl_dataset_phys_t
 ...
  20c0 dn_bonus = [ 0x2, 0, 0, 0, 0, 0, 0, 0, 0xe, 0, 0, 0, 0, 0, 0, 0, 0x1, 0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... ] 
}

Now, the dsl_dataset_phys_t in the bonus buffer of the DSL DataSet dnode.


> 20c0::print -a zfs`dsl_dataset_phys_t
{
  20c0 ds_dir_obj = 0x2
...
  2140 ds_bp = {
      2140 blk_dva = [
          {
              2140 dva_word = [ 0x2, 0xf0038 ]
          }
          {
              2150 dva_word = [ 0x2, 0x6 ]
          }
          {
              2160 dva_word = [ 0, 0 ]
          }
      ]
...
}

The blkptr_t at 0x2140 will give us the objset_phys_t of the root dataset of the file system.


> 2140::blkptr
DVA[0]: vdev_id 0 / 1e007000
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 40000000000
DVA[0]: :0:1e007000:200:d
DVA[1]: vdev_id 0 / c00
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 40000000000
DVA[1]: :0:c00:200:d
LSIZE:  400                             PSIZE: 200
ENDIAN: LITTLE                                  TYPE:  DMU objset
BIRTH:  500                LEVEL: 0     FILL:  a00000000
CKFUNC: fletcher4                       COMP:  lzjb
CKSUM:  6bb79a7b2:2e0d64756dd:9fc17017938b:176b8a4b6c4756
$q
#

Now get the locations where the file system objset_phys_t resides.


# ./zdb -Z tank:0:1e007000:200
Columns = 2, bigcols = 2, asize = 400, firstdatacol = 1
devidx = 1, offset = 6001600, size = 200
devidx = 2, offset = 6001600, size = 200
#

So, parity is on the second disk, and the data is on the third disk, both at offset 0x6001600.


# mdb
(6001600>>9)+(400000>>9)=D
              204811

And again use dd to dump the compressed objset_phys_t to a file.


# dd if=/export/home/max/r2 of=/tmp/dmu_objset_comp iseek=204811 count=1
1+0 records in
1+0 records out
#

And uncompress the objset_phys_t.


# ./zuncompress -l 400 -p 200 /tmp/dmu_objset_comp > /tmp/dmu_objset
#

Now, mdb to example the objset_phys_t.


# mdb /tmp/dmu_objset
> 0::print -a zfs`objset_phys_t
{
  0 os_meta_dnode = {
      0 dn_type = 0xa  <-- DMU_OT_DNODE
      1 dn_indblkshift = 0xe
      2 dn_nlevels = 0x7  <-- 7 levels of indirection
 ...
     40 dn_blkptr = [
             {
                 40 blk_dva = [
                     {
                         40 dva_word = [ 0x4, 0xf0020 ]
                     }
                     {
                         50 dva_word = [ 0x4, 0x20 ]
                     }
                     {
                         60 dva_word = [ 0, 0 ]
                     }
                 ]
 ...
}
> 40::blkptr
DVA[0]: vdev_id 0 / 1e004000
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 80000000000
DVA[0]: :0:1e004000:400:id
DVA[1]: vdev_id 0 / 4000
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 80000000000
DVA[1]: :0:4000:400:id
LSIZE:  4000                            PSIZE: 400
ENDIAN: LITTLE                                  TYPE:  DMU dnode
BIRTH:  500                LEVEL: 6     FILL:  900000000
CKFUNC: fletcher4                       COMP:  lzjb
CKSUM:  5b884586fa:3f9c7d79ba1f:17674db0ee38e0:6077d2d63aa75b6
$q
#

There are 6 levels of indirection to get the MOS for the file system. Next, we'll get the disk locations for the 6th level of indirection.


# ./zdb -Z tank:0:1e004000:400
Columns = 3, bigcols = 3, asize = 800, firstdatacol = 1
devidx = 2, offset = 6000c00, size = 200
devidx = 3, offset = 6000c00, size = 200
devidx = 4, offset = 6000c00, size = 200
#

So, the third disk contains parity, and the fourth and fifth disks contain the indirect block.


# mdb
> (6000c00>>9)+(400000>>9)=D
              204806

Again, we'll use dd to get the data from the 2 disks, then concatenate the dd outputs, then uncompress.


# dd if=/export/home/max/r3 of=/tmp/i6_1z iseek=204806 count=1
1+0 records in
1+0 records out
# dd if=/export/home/max/r4 of=/tmp/i6_2z iseek=204806 count=1
1+0 records in
1+0 records out
# cat /tmp/i6*z > /tmp/i6_z
#

Now, uncompress. The size after decompression is 0x4000 bytes, as specified in the LSIZE field of the blkptr_t.


# ./zuncompress -l 4000 -p 400 /tmp/i6_z > /tmp/i6
#

And use mdb to examine the blkptr_t structures. We are only interested in the first one, since it will take us to the beginning dnode_phys_t in the file system.


# mdb/intel/ia32/mdb/mdb /tmp/i6
> 0::blkptr
DVA[0]: vdev_id 0 / 1e003800
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 80000000000
DVA[0]: :0:1e003800:400:id
DVA[1]: vdev_id 0 / 3800
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 80000000000
DVA[1]: :0:3800:400:id
LSIZE:  4000                            PSIZE: 400
ENDIAN: LITTLE                                  TYPE:  DMU dnode
BIRTH:  500                LEVEL: 5     FILL:  900000000
CKFUNC: fletcher4                       COMP:  lzjb
CKSUM:  59defe2103:3e0ac53edc13:16a8c688ba6d69:5cafeb97a9046d7
$q
#

Now at level 5, we again need to know where on the physical disks are the data.


# ./zdb -Z tank:0:1e003800:400
Columns = 3, bigcols = 3, asize = 800, firstdatacol = 1
devidx = 3, offset = 6000a00, size = 200
devidx = 4, offset = 6000a00, size = 200
devidx = 0, offset = 6000c00, size = 200
#

So, parity on fourth disk and data on fifth and first.


# mdb
> (6000a00>>9)+(400000>>9)=D
              204805       
> (6000c00>>9)+(400000>>9)=D
              204806

And use dd to dump the blocks.


# dd if=/export/home/max/r4 of=/tmp/i5_1z iseek=204805 count=1
1+0 records in
1+0 records out
# dd if=/export/home/max/r0 of=/tmp/i5_2z iseek=204806 count=1
1+0 records in
1+0 records out
#

And concatenate...


# cat /tmp/i5*z > /tmp/i5_z
#

And uncompress...


# ./zuncompress -p 400 -l 4000 /tmp/i5_z > /tmp/i5
#

And get to the 4th level of indirection...


# mdb /tmp/i5
> 0::blkptr
DVA[0]: vdev_id 0 / 1e003000
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 80000000000
DVA[0]: :0:1e003000:400:id
DVA[1]: vdev_id 0 / 3000
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 80000000000
DVA[1]: :0:3000:400:id
LSIZE:  4000                            PSIZE: 400
ENDIAN: LITTLE                                  TYPE:  DMU dnode
BIRTH:  500                LEVEL: 4     FILL:  900000000
CKFUNC: fletcher4                       COMP:  lzjb
CKSUM:  5aaaf038c7:3ecd4215b2cf:1705e4d4343d71:5e8d71a8535f678
$q
#

Rather than show all 6 levels, we'll jump to level 0.


# mdb /tmp/i1
> 0::blkptr
DVA[0]: vdev_id 0 / 1e001000
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 80000000000
DVA[0]: :0:1e001000:600:d
DVA[1]: vdev_id 0 / 1000
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 80000000000
DVA[1]: :0:1000:600:d
LSIZE:  4000                            PSIZE: 600
ENDIAN: LITTLE                                  TYPE:  DMU dnode
BIRTH:  500                LEVEL: 0     FILL:  900000000
CKFUNC: fletcher4                       COMP:  lzjb
CKSUM:  7e1ebca68d:4f0370c6d404:23a24df0937608:ce6838f39084f95
$q
#

Locate the data for the stripe:


# ./zdb -Z tank:0:1e001000:600
Columns = 4, bigcols = 4, asize = 800, firstdatacol = 1
devidx = 3, offset = 6000200, size = 200
devidx = 4, offset = 6000200, size = 200
devidx = 0, offset = 6000400, size = 200
devidx = 1, offset = 6000400, size = 200
#

# mdb
> (6000200>>9)+(400000>>9)=D
              204801       
> (6000400>>9)+(400000>>9)=D
              204802

Get the data from the individual disks...


# dd if=/export/home/max/r4 of=/tmp/fs_mos_1z iseek=204801 count=1
1+0 records in
1+0 records out
# dd if=/export/home/max/r0 of=/tmp/fs_mos_2z iseek=204802 count=1
1+0 records in
1+0 records out
# dd if=/export/home/max/r1 of=/tmp/fs_mos_3z iseek=204802 count=1
1+0 records in
1+0 records out
#

Concatenate the data...


# cat /tmp/fs_mos_* > /tmp/fs_mos_z
#

Uncompress...


# ./zuncompress -l 4000 -p 600 /tmp/fs_mos_z > /tmp/fs_mos
#

We should now be at the MOS for the root data set.


# mdb /tmp/fs_mos
> 0,20::print -a zfs`dnode_phys_t
{
  0 dn_type = 0 <-- first is not used
 ...
}
{
     200 dn_type = 0x15  <-- DMU_OT_MASTER
     ...
     240 dn_blkptr = [
         {
             240 blk_dva = [
                 {
                     240 dva_word = [ 0x2, 0 ]
                 }
                 {
                     250 dva_word = [ 0x2, 0xf0000 ]
                 }
                 {
                     260 dva_word = [ 0, 0 ]
                 }
             ]
 ...
}
 ...
{
     600 dn_type = 0x14  <-- DMU_OT_DIRECTORY_CONTENTS (probably for root of fs)
     ...
     604 dn_bonustype = 0x11  <-- bonus buffer is a znode_phys_t
     ...
     640 dn_blkptr = [
         {
             640 blk_dva = [
                 {
                     640 dva_word = [ 0x2, 0xf0006 ]
                 }
         ...
     6c0 dn_bonus = [ 0x26, 0xa0, 0xdb, 0x49, 0, 0, 0, 0, 0x8e, 0xf0, 0xf7, 0x25,  0, 0, 0, 0, 0xca, 0xa5, 0xdc, 0x49, 0, 0, 0, 0, 0xf3, 0x80, 0xdd, 0x34, 0, 0, 0 , 0, ... ] 
} 
{
     800 dn_type = 0x13  <-- DMU_OT_PLAIN_FILE_CONTENTS
     ...
     804 dn_bonustype = 0x11  <-- bonus buffer is znode_phys_t
     ...
     840 dn_blkptr = [
         {
             840 blk_dva = [
                 {
                     840 dva_word = [ 0x2, 0xf0004 ]
                 }
...
     8c0 dn_bonus = [ 0xca, 0xa5, 0xdc, 0x49, 0, 0, 0, 0, 0x5e, 0xe2, 0xdc, 0x34,  0, 0, 0, 0, 0xca, 0xa5, 0xdc, 0x49, 0, 0, 0, 0, 0x58, 0x9e, 0xde, 0x34, 0, 0, 0 , 0, ... ] 
}
 ... 
>

We should start with the ZAP object specified by the blkptr_t for the master node to get to the root directory object of the file system. Instead, we'll assume the dnode_phys_t at 0x600 is for the root of the file system, and we'll dump the blkptr_t. This should be for a ZAP object which should contain the list of files in the directory.


> 640::blkptr
DVA[0]: vdev_id 0 / 1e000c00
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 40000000000
DVA[0]: :0:1e000c00:200:d
DVA[1]: vdev_id 0 / 800
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 40000000000
DVA[1]: :0:800:200:d
LSIZE:  200                             PSIZE: 200
ENDIAN: LITTLE                                  TYPE:  ZFS directory
BIRTH:  500                LEVEL: 0     FILL:  100000000
CKFUNC: fletcher4                       COMP:  uncompressed
CKSUM:  60d062a16:197ca3c8839:4691877f93d3:946b572aca5a2
$q
#

Find the location on the disk(s) for the directory ZAP object.


# ./zdb -Z tank:0:1e000c00:200
Columns = 2, bigcols = 2, asize = 400, firstdatacol = 1
devidx = 1, offset = 6000200, size = 200
devidx = 2, offset = 6000200, size = 200
# mdb
> (6000200>>9)+(400000>>9)=D
              204801

Dump the contents.


# dd if=/export/home/max/r2 of=/tmp/rootdir iseek=204801 count=1
1+0 records in
1+0 records out
#

Examine the directory.


# mdb /tmp/rootdir
> ::sizeof zfs`mzap_phys_t
sizeof (zfs`mzap_phys_t) = 0x80
> ::sizeof zfs`mzap_ent_phys_t
sizeof (zfs`mzap_ent_phys_t) = 0x40
> 0::print zfs`mzap_phys_t
{
  mz_block_type = 0x8000000000000003
  mz_salt = 0x14187c7
  mz_normflags = 0
  mz_pad = [ 0, 0, 0, 0, 0 ]
  mz_chunk = [
      {
          mze_value = 0x8000000000000004
          mze_cd = 0
          mze_pad = 0
          mze_name = [ "smallfile" ]
      }
  ]
}
> (200-80)%40=K
              6  <-- there are 6 more mzap_ent_phys_t
> 80,6::print zfs`mzap_ent_phys_t
 {
     mze_value = 0
     mze_cd = 0
     mze_pad = 0
     mze_name = [ '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0 ', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0',  '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', ... ]
 }
 {
     mze_value = 0x8000000000000006
     mze_cd = 0
     mze_pad = 0
     mze_name = [ "words" ]  <-- here is the file we want, object id is 6
 }
 {
     mze_value = 0x8000000000000007
     mze_cd = 0
     mze_pad = 0
     mze_name = [ "foo" ]
 }
 ...
$q
#

Now, go back to the file system MOS to look at object id 6. If the object ID was greater than 32 (0x20), there would have been more work looking at other indirect blocks from the objset_phys_t for the file system. We assumed that the root directory for the file system would be a low object number above, and, fortunately,
the file we want to examine is also a low object number.


# mdb /tmp/fs_mos
> (6*200)::print -a zfs`dnode_phys_t
{
  c00 dn_type = 0x13  <-- plain file contents
  c01 dn_indblkshift = 0xe
  c02 dn_nlevels = 0x2  <-- one level of indirection
  c03 dn_nblkptr = 0x1
  c04 dn_bonustype = 0x11 <-- bonus buffer contains znode_phys_t
     ...
  c40 dn_blkptr = [
         {
             c40 blk_dva = [
                 {
                     c40 dva_word = [ 0x4, 0x5ec ]
                 }
                 {
                     c50 dva_word = [ 0x4, 0xf00ec ]
                 }
                 {
                     c60 dva_word = [ 0, 0 ]
                 }
             ]
 ...
  cc0 dn_bonus = [ 0x22, 0x86, 0xdb, 0x49, 0, 0, 0, 0, 0x9, 0x31, 0x20, 0x28, 0, 0, 0, 0, 0x22, 0x86, 0xdb, 0x49, 0, 0, 0, 0, 0x71, 0x48, 0x2b, 0x28, 0, 0, 0,  0, ... ] 
}

A quick look at the znode_phys_t in the bonus buffer...


> cc0::print zfs`znode_phys_t
{
  zp_atime = [ 0x49db8622, 0x28203109 ]
  zp_mtime = [ 0x49db8622, 0x282b4871 ]
  zp_ctime = [ 0x49db8622, 0x282b4871 ]
  zp_crtime = [ 0x49db8622, 0x28203109 ]
  zp_gen = 0x97
  zp_mode = 0x8124
  zp_size = 0x32752
  zp_parent = 0x3
  zp_links = 0x1
  zp_xattr = 0
  zp_rdev = 0
  zp_flags = 0x40800000004
  zp_uid = 0
  zp_gid = 0
  zp_zap = 0
  zp_pad = [ 0, 0, 0 ]
  zp_acl = {
      z_acl_extern_obj = 0
      z_acl_size = 0x30
      z_acl_version = 0x1
      z_acl_count = 0x6
      z_ace_data = [ 0x1, 0, 0, 0x10, 0x26, 0, 0, 0, 0, 0, 0, 0x10, 0x11, 0x1,
0xc, 0, 0x1, 0, 0x40, 0x20, 0x26, 0, 0, 0, 0, 0, 0x40, 0x20, 0x1, 0, 0, 0, ...
]
  }
}

When was the file created?


> 49db8622=Y
              2009 Apr  7 18:58:10

Now, let's look at the blkptr_t.


> c40::blkptr
DVA[0]: vdev_id 0 / bd800
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 80000000000
DVA[0]: :0:bd800:400:id
DVA[1]: vdev_id 0 / 1e01d800
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 80000000000
DVA[1]: :0:1e01d800:400:id
LSIZE:  4000                            PSIZE: 400
ENDIAN: LITTLE                                  TYPE:  ZFS plain file
BIRTH:  97                 LEVEL: 1     FILL:  200000000
CKFUNC: fletcher4                       COMP:  lzjb
CKSUM:  600e97db0e:411c4ea86350:1790b46d936d46:602547566d07cc7
$q
#

We're at level 1.


# ./zdb -Z tank:0:bd800:400
Columns = 3, bigcols = 3, asize = 800, firstdatacol = 1
devidx = 1, offset = 25e00, size = 200
devidx = 2, offset = 25e00, size = 200
devidx = 3, offset = 25e00, size = 200
#

# mdb
> (25e00>>9)+(400000>>9)=D
              8495         

# dd if=/export/home/max/r2 of=/tmp/words_i1z iseek=8495 count=1
1+0 records in
1+0 records out
# dd if=/export/home/max/r3 of=/tmp/words_i2z iseek=8495 count=1
1+0 records in
1+0 records out
# cat /tmp/words_*z > /tmp/words_iz

Uncompress...


# ./zuncompress -l 4000 -p 400 /tmp/words_iz > /tmp/words_i

So, /tmp/words_i should contain uncompressed blkptr_t. These blkptr_t should take us to the data for the words file.


# mdb /tmp/words_i
> 0::blkptr
DVA[0]: vdev_id 0 / c0000
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 2800000000000
DVA[0]: :0:c0000:20000:d
LSIZE:  20000                           PSIZE: 20000
ENDIAN: LITTLE                                  TYPE:  ZFS plain file
BIRTH:  97                 LEVEL: 0     FILL:  100000000
CKFUNC: fletcher2                       COMP:  uncompressed
CKSUM:  f5cbf93a151abcac:5b5d6ca83588d8ad:574d9b8bf334944b:ad78d30af51771d8
80::blkptr
DVA[0]: vdev_id 0 / e8000
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 2800000000000
DVA[0]: :0:e8000:20000:d
LSIZE:  20000                           PSIZE: 20000
ENDIAN: LITTLE                                  TYPE:  ZFS plain file
BIRTH:  97                 LEVEL: 0     FILL:  100000000
CKFUNC: fletcher2                       COMP:  uncompressed
CKSUM:  f39ae34f048ae079:de2ef1af7d1fb495:ec3ae3f7985b2a98:c6d33ac68cb042b6
$q
#

So, where is the data?


# ./zdb -Z tank:0:c0000:20000
Columns = 5, bigcols = 0, asize = 28000, firstdatacol = 1
devidx = 1, offset = 26600, size = 8000
devidx = 2, offset = 26600, size = 8000
devidx = 3, offset = 26600, size = 8000
devidx = 4, offset = 26600, size = 8000
devidx = 0, offset = 26800, size = 8000

A little hex to decimal conversion for dd...


# mdb
> (26600>>9)+(400000>>9)=D
              8499         
> (26800>>9)+(400000>>9)=D
              8500         
8000>>9=D
              64

Now, dump the blocks...


# dd if=/export/home/max/r2 of=/tmp/w1 iseek=8499 bs=512 count=64
64+0 records in
64+0 records out
# dd if=/export/home/max/r3 of=/tmp/w2 iseek=8499 bs=512 count=64
64+0 records in
64+0 records out
# dd if=/export/home/max/r4 of=/tmp/w3 iseek=8499 bs=512 count=64
64+0 records in
64+0 records out
# dd if=/export/home/max/r0 of=/tmp/w4 iseek=8500 bs=512 count=64
64+0 records in
64+0 records out

And concatenate the 4 files...


# cat /tmp/w[1-4]
10th
1st
2nd
3rd
4th
5th
6th
7th
8th
9th
a
AAA
AAAS
Aarhus
Aaron
AAU
ABA
Ababa
aback
...
Nostrand
nostril
not
notary
notate
notch
note
notebook
noteworthy
nothing
notice
notice#

The first 128k of the file. To get the remainder of the file, we would need to look at the next blkptr_t at level 1. But, not today...

ZFS Data Recovery

2009-12-16T05:55:00.000-08:00

I knew there was a reason to understand the ZFS on-disk format besides wanting to be able to teach about it...

Recently, I was sent an email from someone who had 15 years of video and music stored in a 10TB ZFS pool that, after a power failure, became defective. He unfortunately did not have a backup. He was using ZFS version 6 on FreeBSD 7, and he asked if I could help. He also said that he had spoken with various people, including engineers within Sun, and was told that basically, it was probably not possible to recover the data.

He also got in touch with a data recovery company who told him they would assign a technician to examine the problem at a cost of $15,000/month. And if they could not restore his data, he would only have to pay 1/2 of the cost.

After spending about 1 week examining the data on the disk, I was able to restore basically all of it. The recovery was done on OpenSolaris 0609 (build 111b). After the recovery, he was able to view his data on the FreeBSD system (as well as OpenSolaris). I would be happy to do this for anyone else who runs into a problem where they can no longer access their ZFS pool, especially at $15k/month. And if I can not do it, you would not need to pay anything!

segpages dmod source code

2009-12-15T04:39:00.000-08:00

Source is at ftp://ftp.bruningsystems.com/segpages.tar

Examining address spaces with mdb

2009-12-14T05:19:00.000-08:00

A while ago, I was interested in more details about process address spaces. For instance, if a page is mapped into an address space, where is the page in physical memory? Or if a page is on swap, where is it on swap? Are there pages that are in memory, but not currently valid for a process? The meminfo(2) system call can be used by an application to examine the locations of physical pages corresponding to a range of virtual addresses that the process is using. Is there a tool for doing this from outside the process? Is there any tool for determining the locations of pages in memory when one is using liblgrp(3)? liblgrp(3) provides an API for specifying a "locality group". A locality group, as the man page says, "represents the set of CPU-like and memory-like hardware devices that are at most some locality apart from each other". Essentially, using liblgrp(3), one can specify the desired memory placement for memory that threads within a process are using.

So, I have written a dcmd, called segpages, for mdb that allows one to examine each virtual page of a segment in a process address space. The command gives the following information:

The virtual address of the page.
If the page is in memory, the physical address of the page.
If the page is on swap, the location on swap, and which swap device/file.
If the page is not currently in memory or on swap, a "-".
If the page is mapped from a file, the pathname of the file, and the offset within the file.
If the page is anonymous, the command prints "anon".
If the page is mapped to a device, the command only prints the physical address it is mapped to, and the path to the device.
The "share count" for the page, i.e., the number of processes sharing the same page.
The dcmd command also prints the status of the page:
- VALID -- The page is mapped
- INMEMORY -- The page is in memory (it may not be valid for the process).
- SWAPPED -- The page is on swap. Note that a page may be INMEMORY and SWAPPED. What I find more interesting, is pages that are SWAPPED and VALID. I expect to find INMEMORY pages that are also on swap. I did not expect to find SWAPPED pages that are also VALID, since I assumed that a page that was read in from swap and is now valid would not have a copy on swap. From a quick look at the code, it appears the swap slot is not freed until the reference count on the anon struct that is mapping the page has gone to 0. Anyone with a more complete understanding of this is welcome to comment.

Here is (very abbreviated) output for a running bash process.

First, a look at pmap output. Each line of the pmap output represents a "segment" of the address space. The different columns are described in the pmap(1) man page.
$ pmap -x 919 919: /bin/bash --noediting -i< Address Kbytes RSS Anon Locked Mode Mapped File 08045000 12 12 4 - rw--- [ stack ] 08050000 644 644 - - r-x-- bash 08100000 80 80 12 - rwx-- bash 08114000 52 52 28 - rwx-- [ heap ] CE410000 624 512 - - r-x-- libnsl.so.1 CE4BC000 16 16 4 - rw--- libnsl.so.1 CE4C0000 20 8 - - rw--- libnsl.so.1 CE4F0000 56 52 - - r-x-- methods_unicode.so.3 CE50D000 4 4 - - rwx-- methods_unicode.so.3 CE510000 2416 752 - - r-x-- en_US.UTF-8.so.3 CE77B000 4 4 - - rwx-- en_US.UTF-8.so.3 CE960000 64 16 - - rwx-- [ anon ] CE97E000 4 4 - - rwxs- [ anon ] CE980000 4 4 - - rwx-- [ anon ] CE990000 24 12 4 - rwx-- [ anon ] CE9A0000 4 4 4 - rwx-- [ anon ] CE9B0000 1280 972 - - r-x-- libc_hwcap1.so.1 CEAF0000 28 28 16 - rwx-- libc_hwcap1.so.1 CEAF7000 8 8 - - rwx-- libc_hwcap1.so.1 CEB00000 4 4 - - r-x-- libdl.so.1 CEB10000 4 4 4 - rwx-- [ anon ] CEB20000 56 56 - - r-x-- libsocket.so.1 CEB3E000 4 4 - - rw--- libsocket.so.1 CEB40000 180 136 - - r-x-- libcurses.so.1 CEB7D000 28 28 - - rw--- libcurses.so.1 CEB84000 8 - - - rw--- libcurses.so.1 CEB90000 4 4 4 - rwx-- [ anon ] CEBA0000 4 4 4 - rw--- [ anon ] CEBB0000 4 4 - - rw--- [ anon ] CEBBF000 180 180 - - r-x-- ld.so.1 CEBFC000 8 8 4 - rwx-- ld.so.1 CEBFE000 4 4 4 - rwx-- ld.so.1 -------- ------- ------- ------- ------- total Kb 5832 3620 92 - # mdb -k Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc pcplusmp scsi_vhci zfs usba sockfs ip hook neti sctp arp uhci sd fctl md lofs audiosup fcip fcp random cpc crypto logindmux ptm ufs sppp ipc ]

First, load the dmod containing the new dcmd.
> ::load /wd320/max/source/mdb/segpages/i386/segpages.so >

Now, walk through the segments of the process address space, showing
each virtual page in the segment. Note that output has been omitted.
> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages VA PA FILE OFFSET SHARES DISPOSITION 8045000 378C5000 [anon] 54518000 1 VALID 8046000 6EB06000 [anon] 54118000 1 VALID 8047000 5F9C7000 [anon] 540B8000 1 VALID VA PA FILE OFFSET SHARES DISPOSITION 8050000 600A7000 bash 0 7 VALID 8051000 74368000 bash 1000 7 VALID 8052000 72669000 bash 2000 7 VALID 8053000 66C6A000 bash 3000 7 VALID 8054000 636AB000 bash 4000 0 INVALID,INMEMORY 8055000 5FDEC000 bash 5000 0 INVALID,INMEMORY 8056000 63EED000 bash 6000 0 INVALID,INMEMORY 8057000 62EAE000 bash 7000 0 INVALID,INMEMORY 8058000 5C52F000 bash 8000 7 VALID 8059000 5C5B0000 bash 9000 7 VALID ... output omitted 80ED000 5C2C4000 bash 9D000 7 VALID 80EE000 5C245000 bash 9E000 7 VALID 80EF000 5C286000 bash 9F000 3 VALID 80F0000 63A97000 bash A0000 0 INVALID,INMEMORY VA PA FILE OFFSET SHARES DISPOSITION 8100000 79940000 [anon] 541D8000 1 VALID 8101000 5F0C1000 [anon] 62F00000 1 VALID 8102000 378C2000 [anon] 54438000 1 VALID 8103000 5EF5A000 bash A3000 6 VALID 8104000 5EEDB000 bash A4000 6 VALID 8105000 37885000 [anon] 543B8000 1 VALID 8106000 60E1D000 bash A6000 7 VALID ... VA PA FILE OFFSET SHARES DISPOSITION 8114000 37914000 [anon] 54478000 1 VALID 8115000 79DD5000 [anon] 54368000 1 VALID 8116000 55356000 [anon] 62F90000 1 VALID ... VA PA FILE OFFSET SHARES DISPOSITION CE410000 7AE40000 libnsl.so.1 0 55 VALID CE411000 7AEC1000 libnsl.so.1 1000 57 VALID CE412000 7AE42000 libnsl.so.1 2000 57 VALID CE413000 7AE83000 libnsl.so.1 3000 57 VALID CE414000 7AE84000 libnsl.so.1 4000 57 VALID ... CE42D000 6EE96000 libnsl.so.1 1D000 1A INVALID,INMEMORY CE42E000 6E797000 libnsl.so.1 1E000 1A INVALID,INMEMORY ... VA PA FILE OFFSET SHARES DISPOSITION CE4F0000 17D9000 methods_unicode.so.3 0 29 VALID CE4F1000 17DA000 methods_unicode.so.3 1000 2C VALID ... VA PA FILE OFFSET SHARES DISPOSITION CE510000 1869000 en_US.UTF-8.so.3 0 28 VALID CE511000 18AA000 en_US.UTF-8.so.3 1000 2A VALID ... CE518000 6F1EA000 en_US.UTF-8.so.3 8000 0 INVALID,INMEMORY CE519000 6F1EB000 en_US.UTF-8.so.3 9000 0 INVALID,INMEMORY CE51A000 6F1EC000 en_US.UTF-8.so.3 A000 0 INVALID,INMEMORY ... CE5FF000 6DB60000 en_US.UTF-8.so.3 EF000 5 INVALID,INMEMORY CE600000 1659000 en_US.UTF-8.so.3 F0000 7 INVALID,INMEMORY ... CE6EE000 1687000 en_US.UTF-8.so.3 1DE000 9 INVALID,INMEMORY CE6EF000 1688000 en_US.UTF-8.so.3 1DF000 9 INVALID,INMEMORY CE6F0000 1649000 en_US.UTF-8.so.3 1E0000 9 INVALID,INMEMORY ... CE729000 1782000 en_US.UTF-8.so.3 219000 29 VALID CE72A000 1783000 en_US.UTF-8.so.3 21A000 29 VALID ... CE730000 1709000 en_US.UTF-8.so.3 220000 29 VALID CE731000 6F143000 en_US.UTF-8.so.3 221000 0 INVALID,INMEMORY CE732000 6F144000 en_US.UTF-8.so.3 222000 0 INVALID,INMEMORY ... VA PA FILE OFFSET SHARES DISPOSITION CE9B0000 76A42000 libc_hwcap1.so.1 0 5B VALID CE9B1000 76AC3000 libc_hwcap1.so.1 1000 5B VALID ... VA PA FILE OFFSET SHARES DISPOSITION ... CEBC4000 2A34000 ld.so.1 5000 47 VALID CEBC5000 28B5000 ld.so.1 6000 47 VALID CEBC6000 29F6000 ld.so.1 7000 60 VALID CEBC7000 2937000 ld.so.1 8000 60 VALID ... >

Some general things to note:

Physical pages are randomly distributed. However, pages from ld.so.1 tend to be in low memory with comparison to anonymous pages. This should be expected as most pages of ld.so.1 are probably loaded early on in the system lifetime as most every application uses it.
There are many pages that are not valid, but they are in memory. In general, text and data pages are prefetched when a program starts, unless the program is large, or there is not enough free memory. Although pages are prefetched, it appears that they are not mapped to the process address space until/unless they are actually used.
Bash is not very large. Running the command above finishes in 5-10 seconds. Running the same command on a large program (for instance, firefox-bin), takes several minutes to complete. Running the command on a large 64-bit application will take considerably longer.
This is being run on a live system, so the address space of the process being examined may change while it is being walked.
At this point in time, no pages are swapped out.

Now, let's get some general statistics.

First, a count of the pages currently valid for the process. This is the current mapped RSS. Note that the pmap command reports "RSS", which, at 3620k is 905 4k-byte pages. But only 558 pages (or 2232k) are currently valid.
> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages !grep -i " valid" | wc 558 3348 35712 >

Now, the pages in memory, but not currently valid in the page table(s) for the process.
> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages !egrep -i "inmemory" | wc 347 2082 26025 >

Note that the valid pages plus the in memory pages is 905, or the value reported by pmap. So RSS as reported by pmap does not imply that page faults will not happen for all of those pages. But if a page fault occurs the correct page will be found in memory.

How many pages are currently not valid (and not in memory).
> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages !egrep -i " invalid$" | wc 553 3298 36498 >

How large is the address space?
> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages !egrep -v OFFSET | wc 1458 8728 98235 >

Note that this is 5832k, the total size as reported by pmap.

How many pages have been swapped out?
> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages !grep -i swapped | wc 0 0 0 >

Now, we'll induce memory load on the system, and again examine the address space. The memory usage induced should be enough to cause pages to be swapped (paged) out.

First, pmap output after the memory stress.
$ pmap -x 919 919: /bin/bash --noediting -i Address Kbytes RSS Anon Locked Mode Mapped File 08045000 12 4 - - rw--- [ stack ] 08050000 644 508 - - r-x-- bash 08100000 80 80 - - rwx-- bash 08114000 52 44 28 - rwx-- [ heap ] CE410000 624 320 - - r-x-- libnsl.so.1 CE4BC000 16 16 4 - rw--- libnsl.so.1 CE4C0000 20 8 - - rw--- libnsl.so.1 CE4F0000 56 36 - - r-x-- methods_unicode.so.3 CE50D000 4 4 - - rwx-- methods_unicode.so.3 CE510000 2416 124 - - r-x-- en_US.UTF-8.so.3 CE77B000 4 4 - - rwx-- en_US.UTF-8.so.3 CE960000 64 16 - - rwx-- [ anon ] CE97E000 4 4 - - rwxs- [ anon ] CE980000 4 4 - - rwx-- [ anon ] CE990000 24 12 4 - rwx-- [ anon ] CE9A0000 4 4 4 - rwx-- [ anon ] CE9B0000 1280 952 - - r-x-- libc_hwcap1.so.1 CEAF0000 28 28 12 - rwx-- libc_hwcap1.so.1 CEAF7000 8 8 - - rwx-- libc_hwcap1.so.1 CEB00000 4 4 - - r-x-- libdl.so.1 CEB10000 4 4 4 - rwx-- [ anon ] CEB20000 56 56 - - r-x-- libsocket.so.1 CEB3E000 4 4 - - rw--- libsocket.so.1 CEB40000 180 68 - - r-x-- libcurses.so.1 CEB7D000 28 28 - - rw--- libcurses.so.1 CEB84000 8 - - - rw--- libcurses.so.1 CEB90000 4 4 4 - rwx-- [ anon ] CEBA0000 4 4 4 - rw--- [ anon ] CEBB0000 4 4 - - rw--- [ anon ] CEBBF000 180 180 - - r-x-- ld.so.1 CEBFC000 8 8 4 - rwx-- ld.so.1 CEBFE000 4 4 4 - rwx-- ld.so.1 -------- ------- ------- ------- ------- total Kb 5832 2544 72 - $

As expected, the RSS has gone down, but the virtual size remains the same. It is a little interesting that the amount reported under anon has also dropped by 20k.

Again, we'll use the new dcmd to examine the address space more closely.
> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages VA PA FILE OFFSET SHARES DISPOSITION 8045000 378C5000 [anon] 54518000 1 VALID 8046000 - /dev/zvol/dsk/rpool/swap 17EBF000 0 INVALID,SWAPPED 8047000 - /dev/zvol/dsk/rpool/swap 1D8FE000 0 INVALID,SWAPPED VA PA FILE OFFSET SHARES DISPOSITION 8050000 16B86000 bash 0 1 VALID 8051000 13A07000 bash 1000 1 VALID 8052000 6B088000 bash 2000 1 VALID 8053000 1889000 bash 3000 1 VALID 8054000 2430A000 bash 4000 1 VALID 8055000 6440B000 bash 5000 1 VALID 8056000 6684C000 bash 6000 1 VALID 8057000 7308D000 bash 7000 1 VALID 8058000 6DCCE000 bash 8000 0 INVALID,INMEMORY 8059000 3784F000 bash 9000 0 INVALID,INMEMORY ... 80ED000 4CB23000 bash 9D000 0 INVALID,INMEMORY 80EE000 76BE4000 bash 9E000 0 INVALID,INMEMORY 80EF000 5BA5000 bash 9F000 0 INVALID,INMEMORY 80F0000 36836000 bash A0000 0 INVALID,INMEMORY VA PA FILE OFFSET SHARES DISPOSITION 8100000 - /dev/zvol/dsk/rpool/swap 247C2000 0 INVALID,SWAPPED 8101000 - /dev/zvol/dsk/rpool/swap 7CCD000 0 INVALID,SWAPPED 8102000 378C2000 [anon] 54438000 1 VALID 8103000 75479000 bash A3000 0 INVALID,INMEMORY 8104000 532BA000 bash A4000 0 INVALID,INMEMORY 8105000 37885000 [anon] 543B8000 1 VALID 8106000 7443C000 bash A6000 0 INVALID,INMEMORY ... VA PA FILE OFFSET SHARES DISPOSITION 8114000 37914000 [anon] 54478000 1 VALID 8115000 79DD5000 [anon] 54368000 1 VALID 8116000 55356000 [anon] 62F90000 1 VALID ... VA PA FILE OFFSET SHARES DISPOSITION CE410000 7AE40000 libnsl.so.1 0 4C VALID CE411000 7AEC1000 libnsl.so.1 1000 4E VALID CE412000 7AE42000 libnsl.so.1 2000 4E VALID CE413000 7AE83000 libnsl.so.1 3000 4E VALID CE414000 7AE84000 libnsl.so.1 4000 4E VALID ... CE42D000 6EE96000 libnsl.so.1 1D000 18 INVALID,INMEMORY CE42E000 6E797000 libnsl.so.1 1E000 18 INVALID,INMEMORY ... VA PA FILE OFFSET SHARES DISPOSITION CE4F0000 17D9000 methods_unicode.so.3 0 27 VALID CE4F1000 17DA000 methods_unicode.so.3 1000 2A VALID ... VA PA FILE OFFSET SHARES DISPOSITION CE510000 1869000 en_US.UTF-8.so.3 0 26 VALID CE511000 18AA000 en_US.UTF-8.so.3 1000 28 VALID ... CE518000 - en_US.UTF-8.so.3 8000 0 INVALID CE519000 - en_US.UTF-8.so.3 9000 0 INVALID CE51A000 - en_US.UTF-8.so.3 A000 0 INVALID ... CE5FF000 - en_US.UTF-8.so.3 EF000 0 INVALID CE600000 - en_US.UTF-8.so.3 F0000 0 INVALID ... CE6EE000 1687000 en_US.UTF-8.so.3 1DE000 A INVALID,INMEMORY CE6EF000 1688000 en_US.UTF-8.so.3 1DF000 A INVALID,INMEMORY CE6F0000 1649000 en_US.UTF-8.so.3 1E0000 A INVALID,INMEMORY ... CE729000 1782000 en_US.UTF-8.so.3 219000 27 VALID CE72A000 1783000 en_US.UTF-8.so.3 21A000 27 VALID ... CE730000 1709000 en_US.UTF-8.so.3 220000 27 VALID CE731000 - en_US.UTF-8.so.3 221000 0 INVALID CE732000 - en_US.UTF-8.so.3 222000 0 INVALID ... VA PA FILE OFFSET SHARES DISPOSITION CE9B0000 76A42000 libc_hwcap1.so.1 0 51 VALID CE9B1000 76AC3000 libc_hwcap1.so.1 1000 51 VALID ... VA PA FILE OFFSET SHARES DISPOSITION CEBC4000 2A34000 ld.so.1 5000 42 VALID CEBC5000 28B5000 ld.so.1 6000 42 VALID CEBC6000 29F6000 ld.so.1 7000 57 VALID CEBC7000 2937000 ld.so.1 8000 57 VALID ... >

As expected, many pages that were previously valid are now invalid. Many of these pages are still in memory, but some have been swapped out. The output does not show it, but some pages that are swapped out can also be in memory (the page was swapped out, put on a freelist, but has not yet been re-used for some other purpose. It is interesting that some pages with reasonably high share counts are still in memory, but no longer valid for this instance of bash. The pageout code checks the share counts, and skips pages being shared by more than po_share processes. On my system, po_share is 8. So I am not sure what is marking the pages invalid (maybe a job for DTrace).

As before, I'll get some counts of valid, invalid, inmemory, and
swapped pages.
> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages !grep -i " valid" | wc 413 2478 26432 >

Previously, the number of valid pages was 558, so 145 pages have been marked invalid and possibly swapped out.

The number of invalid pages is now:
> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages !egrep -i " invalid$" | wc 818 4888 53988 >

Previously, this was 553, so 265 pages that previously were valid are
now invalid.
> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages !egrep -i "inmemory" | wc 215 1290 16125 >

And 215 pages that are invalid are still in memory, but the page table entries for the bash instance does not have the pages mapped.
> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages !grep -i swapped | wc 12 72 936

And 12 pages of bash are on swap.

It would be nice to be able to show this graphically. For instance, a large box representing the address space, with different colored pixels to represent the state of the different pages of the address space. I have been told that JavaFX is good for this, but my knowledge of Java is really not up to it. Especially for large processes, a graphical view would be nice (well, at least interesting to look at...).

I have not tried the dcmd on SPARC or x64, but I expect it to work (at least on x64). I would also like to try this on a large machine which has latency groups set up. If anyone has such a machine and would like to try this out, please let me know.

I also have a version of the command that only prints summary information. I want to add an option that prints page sizes, but currently the command assumes all pages are the native page size (4k on x86/x64 and 8k on SPARC).

If there is interest, I'll make the code for the dcmd available.

Correction for classes in Berlin

2009-09-05T06:08:00.000-07:00

Oops. I've got the wrong dates for the Solaris/OpenSolaris Internals classes in Berlin. The correct dates are Sept. 21-25, and Sept. 28 - Oct. 2. The first week is full, but there are still openings for the second week. Hope to see you there!

OpenSolaris Internals course announcements

2009-08-04T05:24:00.000-07:00

I will be teaching 2 back-to-back 5 day OpenSolaris Internals classes in Berlin, Germany the weeks of September 28 through October 2, and again from October 5 through October 10. For details about topics covered, price, and availability, please visit http://www.workshops-berlin.de/. Note that this website is in German. The course itself will be in English. If you are interested, but cannot read German, send me an email.

RAIDZ On-Disk Format

2009-04-09T05:50:00.000-07:00

A while back, I came up with a way to examine ZFS on-disk format using a modified mdb and zdb (see paper and slides. I also used the method described to recover a removed file (see here in my blog. During the past week, I decided to try to understand the layout of raidz. In other words, how raidz organizes data on disk. It's simple to say that raidz on disk is basically raid5 with variable length stripes, but what does that really mean?

To do this, I once again use the modified mdb, and made a further modification to zdb. In addtion, I implemented a new command (zuncompress) which allow me to uncompress ZFS data existing in a regular file. Since I fear that most of the 10 people or so who read this will not want to read a long description of how I determined the layout, here I'll just give a summary. If anyone really wants the details, reply to the blog and maybe I'll go into them.

First, some general characteristics:

- Each disk contains the 4MB labels at the beginning and end of the disk. For information on these, please see the ZFS On-Disk Specification paper. The starting point for any walk of ZFS on-disk starts with an uberblock_t which is found in this area.

- The metadata used for raidz is the same as for other ZFS objects. In other words, uberblock_t contains the location of a objset_phys_t, which in turn contains the location of the meta-object set (mos), and so on. A difference is that physically, individual structures on disk may exist across different disks, and not necessarily all of the disks. For example, let's take a mos (basically an array of dnode_phys_t structures), on a 5 disk raidz volume. This might be compressed to 0xa00 (2560 decimal) bytes. This may be organized on the raidz disks as follows:
- 512 bytes on disk 0
- 512 bytes on disk 1
- 512 bytes on disk 2
- 1024 bytes on disk 3
- 1024 bytes on disk 4
If you do the arithmetic, you'll find this is 0xe00 bytes (3.5k) and not 0xa00 (2.5K) bytes. The actual allocated size may be still larger. The reason for the extra 1k bytes is the next point.

- Each metadata object (as well as data itself) has its own parity. The extra 1k bytes in the previous point is for parity. If the parity in the above example is on disk 4, it must be 1024 bytes large, since the largest size of any of the blocks containing the object is 1k bytes. Even a metadata structure that only takes up 512 bytes (for instance, an objset_phys_t), will take up 1024 bytes on the disks, one disk containing the 512-byte structure, and another containing 512-bytes of parity.

- Block offsets as reported by zdb (and described in the ZFS On-Disk Specification) are for the entire space (i.e., if you have 5 100GB disks making up a raidz pool, the block offsets start at 0 and go to 500GB).

- Since block offsets cover the entire pool, you cannot simply look at the offsets reported by zdb and map them to locations on disk. The kernel routine, vdev_raidz_map_alloc() (see http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c#644), converts offset and size to locations on the disks. I have added an option to zdb that, given a raidz pool, offset, and size (as reported by zdb), calls this routine and prints out the values of the returned map. This shows the location on the disk(s) and sizes for both the data itself, and parity.

- I recently saw on #opensolaris irc, a person stating that a write of a 1 byte file results in a write to all disks in raidz. That may be true (I haven't checked), but only 1024 bytes are used for the 1 byte file. One 512 byte block containing the 1 byte of data, and the other 512 byte block on a different disk containing parity. It is not using space on all n disks for a 1 byte file.

- ZFS basically uses a scatter-gather approach to reading and writing data on a raidz pool. The disks are read at the correct offsets into a buffer large enough to contain the data. So on a read, data on the first disk is read into the beginning of the buffer, data from the second disk is read into the same buffer virtually contiguous with the end of the data from the first disk, and so on. The resulting buffer is then de-compressed, and the data returned to the requestor.

So, that's the basics. I was going to turn my attention to the in-memory representation of ZFS, but now think instead I'll take a stab at automating the techniques I am using. Once I have that done, I'll try automating recovery of a removed file. From there, we'll see.

More information about Free One Day OpenSolaris Internals Training

2009-04-04T02:04:00.000-07:00

I thought I would say a few words about what is planned for the free one day OpenSolaris Internals training class (see http://sl.osunix.org/FreeKernelTrainingDay for a list of topics, and to sign up).

Regardless of the topics covered, I want to make this as close to a "classroom" setting as possible. For me, this means that attendees should be able to follow along with anything I am doing on OpenSolaris by doing it themselves. So, for instance, if I am using mdb to examine some data structure, students should be able to do the same on their machines. For some topics, notably ZFS, this will require students to either build an mdb dmod, and the modified mdb and zdb that I use, or load a version of OpenSolaris that contains these (to be provided by osunix.org). Source for the modified mdb, zdb, and rawzfs mdb dmod is available for download at ftp://ftp.bruningsystems.com/mdb.tar.Z, ftp://ftp.bruningsystems.com/zdb.tar.Z, and ftp://ftp.bruningsystems.com/raw_dmods.tar.Z. If we do a kmdb session, students will either need to run OpenSolaris in a VM (virtualbox), or have 2 machines connectable via tip or a terminal server for console access.

Currently, the plan is to give attendees access to some slides, use IRC, and give students access to a window on my machine where they can see what I am doing, and try the same on their machine. Best would be a window where everyone can "see" my desktop, but I'm still looking into the best way to do that (any suggestions for this are welcome). It would be great to have audio, preferably conferencing, but this may cost money, and... the class is free. That should mean free for me as well. If anyone has a suggestion for free, conferenced audio, I would appreciate it.

I would like to decide on topics to be covered in the next week or so. So, if you are interested in attending, please go to http://sl.osunix.org/FreeKernelTrainingDay, take a look at the topics, and sign up. If you have ideas for other kernel-related topics, please let me know. Depending on how this goes, I may do more of these in the future.

Free One-day OpenSolaris Internals class

2009-04-02T23:36:00.000-07:00

I am holding a free, one day OpenSolaris Internals class on-line on April 18 or 19. We'll cover 2 topics as determined by a vote of topics that may be covered. For more information, see http://sl.osunix.org/FreeKernelTrainingDay. I hope to see you there!

OpenSolaris Internals class

2009-04-02T23:25:00.000-07:00

I am teaching an OpenSolaris Internals class at Systemics in Warsaw, Poland the week of May 4-8. The course will be held in English. For a detailed topic outline, see here. For pricing, location information, and availability, please send email to magdalena.sternick@systemics.pl. If you have questions about course content, please email me at max@bruningsystems.com.

A faster memstat for mdb

2009-03-31T02:31:00.000-07:00

I have implemented a version of the ::memstat dcmd for mdb that gives results in less than half the time of the ::memstat currently in mdb. If you are interested, it is available for download here.

So, how does it work? The current version of memstat uses the page walker (::walk page) to walk all cached pages in the system. The new version simply examines all pages.

Each page-able page in the system is represented by a page_t structure. Basically all of memory except for the unix and genunix kernel modules, and a few other odds and ends is considered "page-able". Every page, that is in use (either by the kernel, user process, anonymous, or cached) has an identity. This is a vnode/offset pair. The identify uniquely determines how the page is being used. For instance, a page for the code of a running bash process will have the vnode_t for /usr/bin/bash, and the offset within /usr/bin/bash of where the page comes from. For a kernel page, there is a special vnode_t (kvp). For anonymous space, the page has a swapfsvnode (also used for shared memory and tmpfs files). When a process gets a page fault, the fault handling code first checks to see if the faulting address is mapped in the process' address space. If not, a segmentation violation (SIGSEGV) is sent to the process. If the address is within the address space, the fault handling code sees if the page is already in memory. It does this by retrieving the vnode/offset for the faulting page, and hashing into an array called page_hash. Each entry in page_hash is the beginning of a linked list of page_t structures. So the fault handling code does a hash to get a page_hash array entry, then walks the page_t structures starting at that entry to look for a matching vnode_t/offset. If the page_t is in the hash, the fault handling code sets up a page table entry (translation table entry on SPARC) to map to the corresponding physical page.

The page_hash array is sized so the the average search, given a page_hash bucket, is no longer than 4 (PAGE_HASHAVELEN in vm/page.h) entries. This makes searching for a cached page fairly fast.

The page walker that mdb uses to do memstat walks every hash bucket looking for pages. Basically, if the page is found from a hash bucket, the page is either in use by the kernel, some process, or tmpfs, or, the page is free but cached. Any page not hanging off of a page_hash bucket is considered free (i.e., the free (freelist) statistic).

The new memstat takes a different approach. Rather than scanning each hash bucket for pages, it simply reads all of the page_t structures on the system, then examines each one to determine if it is a kernel page, executable page, anonymous page, and so on. Any page_t that does not have a vnode_t is considered a free page, and is counted in the free (freelist) statistic.

How are the page_t structures found on the system? There is a linked list of memseg structures that are bookkeeping for page-able page structures. The list is headed by a pointer, memsegs, and is built early on in the startup code when the system is booting. I suspect the list could change due to dynamic reconfiguration events, but I'll leave that as an exercise for the reader...
To see the list, you can do "::memseg_list" in mdb. On my system, this gives:

> ::memseg_list ADDR PAGES EPAGES BASE END fbe00028 fbe94160 fe652260 00000c00 0007fed0 fbe00014 fbe85160 fbe94160 00000100 00000400 fbe00000 fbe82050 fbe85160 00000002 0000009f >

The ADDR column is the address of a memseg structure. PAGES is the address of the first page_t in an array. EPAGES points to the end of the page_t array. BASE is the starting page frame number of the memseg, and END is the ending page frame number. So, on my system, page frames between 2 and 9f, 100 and 400, and c00 and 7fed0 have page_t structures. Physical page 0 and 1 are not in the list. Also pages between 9f and 100, and between 400 and c00 are not in the list. This is either because the physical memory does not exist, or it is not considered pageable.

The new memstat uses the memseg list to read in all of the page_t structures. On my system, this means 3 read calls (though the read from fbe94160 to fe652260 is quite large), versus thousands of read calls in the existing memstat via the page walker. The new memstat assumes there will never be more than 256 memseg structures. (This was arbitrarily chosen. I have not seen machines with more than 6 memseg structures, but I don't get on very large machines very often). A more correct way would be to build the memseg list within the dcmd, but I am lazy.

Using dtrace and counting system calls during running of the two version of memstat shows that the original memstat makes 1741225 system calls, while the new memstat makes 737344. So over 10000 fewer system calls in the new memstat.

I think memstat performance could be improved even more by using mmap to map in the page arrays. Then there would be no need for using mdb_alloc, and no need to mdb_vread the page_t structs.

Update to bruningsystems.com website

2008-08-25T07:31:00.000-07:00

I have added a section called "articles", which has links to various articles, presentations, and some course materials on OpenSolaris. You can see it here.

recovering removed file on zfs disk

2008-08-18T10:05:00.000-07:00

I have used my modified mdb and zdb (see
http://www.osdevcon.org/2008/files/osdevcon2008-proceedings.pdf and
http://www.osdevcon.org/2008/files/osdevcon2008-max.pdf)
to recover a file that was removed from a zfs file system.
The technique is to locate the active uberblock_t after the file
was created, but before the file was removed, and follow the data
structures from that uberblock_t. This technique would probably not
work on a near full file system, and probably not on a very busy file
system, but it works here. Also, this will not work with RAID-z,
but works fine for mirrors. (I shall get around to figuring out
raid-z, but not now...).

It is possible to follow all of the steps and still not have the right
data because you chose the wrong uberblock_t, or one of the blocks containing
metadata (or the data itself) has been re-used.

The modified mdb and zdb have been updated to work with Nevada,
build 94. It took about 15 minutes to merge the versions I was using
for build 79 into build94. For source for the changes, and
the zfs dmod, send mail to me at max@bruningsystems.com.

It might be possible, with a bit more clever use of mdb
and some shell scripting, to automate this... Also, it might be
useful to add an option to zdb so that a different transaction
id other than the active one be used for it's traversals.
Then you might be able to do everything using zdb.

The following describes the steps taken.

First, I copy a file with known contents to the zfs file system.



# cp /usr/dict/words /zfs_fs/words
#

We'll get the object id (inumber) for /zfs_fs. We'll use it later.



# ls -aid /zfs_fs
         3 /zfs_fs
#

Next, I'll try to make sure everything is on the disk.


# sync
#

Now, I'll use zdb to get the root blkptr from the uberblock.
This will also give me a transaction ID. Generally, you would not
use zdb to get the uberblock_t every time that you add/remove a
file to a zfs file system. That is ok. I have written a dcmd
(output shown below), that walks the uberblock_t array on disk.
Then you can, by trial and error, locate the uberblock_t you need
(assuming it still exists in the array, and assuming the metadata
it points to has not been re-used for another purpose).



# ./zdb -uuu zfs_fs
Uberblock

 magic = 0000000000bab10c
 version = 11
 txg = 1282  <-- transaction id in decimal
 guid_sum = 8876692711396000182
 timestamp = 1218963748 UTC = Sun Aug 17 11:02:28 2008
 rootbp = [L0 DMU objset] 400L/200P DVA[0]=<0:11a00:200>
            DVA[1]=<0:c010e00:200> DVA[2]=<0:18008e00:200> fletcher4
            lzjb LE contiguous birth=1282 fill=27
        cksum=81f780ec5:361b52dda06:b6f3f410036f:1a2b8b10bfdb5c

Next, I'll remove the file I just created.



# rm /zfs_fs/words 
#

Let's take a look at the active uberblock_t.



# ./zdb -uuu zfs_fs
Uberblock

 magic = 0000000000bab10c
 version = 11
 txg = 1282  <-- nothing has changed
 guid_sum = 8876692711396000182
 timestamp = 1218963748 UTC = Sun Aug 17 11:02:28 2008
 rootbp = [L0 DMU objset] 400L/200P DVA[0]=<0:11a00:200> 
            DVA[1]=<0:c010e00:200> DVA[2]=<0:18008e00:200> fletcher4
            lzjb LE contiguous birth=1282 fill=27 
        cksum=81f780ec5:361b52dda06:b6f3f410036f:1a2b8b10bfdb5c

Let's try to make sure it is on the disk.


# sync
#

And check the active uberblock_t again.


# ./zdb -uuu zfs_fs
Uberblock

 magic = 0000000000bab10c
 version = 11
 txg = 1284  <-- new transaction id, after file was removed
 guid_sum = 8876692711396000182
 timestamp = 1218963808 UTC = Sun Aug 17 11:03:28 2008
 rootbp = [L0 DMU objset] 400L/200P DVA[0]=<0:15200:200> 
            DVA[1]=<0:c014600:200> DVA[2]=<0:1800a000:200> fletcher4
            lzjb LE contiguous birth=1284 fill=27 
        cksum=87431704e:37f154f58d7:bbddb76e9703:1aaf346847004f

Now, let's make sure nothing changes in the file system.



# zfs umount zfs_fs
#

And look at the active uberblock_t again.



# ./zdb -uuu zfs_fs
Uberblock

 magic = 0000000000bab10c
 version = 11
 txg = 1284  <-- ok.  nothing changed
 guid_sum = 8876692711396000182
 timestamp = 1218963808 UTC = Sun Aug 17 11:03:28 2008
 rootbp = [L0 DMU objset] 400L/200P DVA[0]=<0:15200:200> 
            DVA[1]=<0:c014600:200> DVA[2]=<0:1800a000:200> fletcher4 
            lzjb LE contiguous birth=1284 fill=27 
        cksum=87431704e:37f154f58d7:bbddb76e9703:1aaf346847004f

Ok. So nothing changed when the file system was unmounted.
Now, we'll use the modified mdb to examine the uberblock_t array on disk.
The uberblock_t we want has transaction id 1282 decimal.



# ./mdb /export/home/max/zfsfile

First, convert decimal 1282 to hex.


> 0t1282=X
                502

Now, load kernel CTF and a few dcmds that work with zfs on disk.


> ::loadctf
> ::load /export/home/max/source/mdb/i386/rawzfs.so

Walk the uberblock_t array on disk. This shows all possible 1024 uberblocks.
Here, we'll only show the entry with ub_txg = 0x502. Again, if I had not
retrieved the value of the active uberblock_t after the file was created,
and before the file was removed, I could dump all uberblock_t using the
following command, and then searched backwards, trying all transaction ids
that are less than the current (i.e., current after the file was removed
and the file system unmounted).


> ::walk uberblock | ::print -a -t zfs`uberblock_t
...
{
    20800 uint64_t ub_magic = 0xbab10c
    20808 uint64_t ub_version = 0xb
    20810 uint64_t ub_txg = 0x502  <-- the correct transaction id
    20818 uint64_t ub_guid_sum = 0x7b3058fd830ec1b6
    20820 uint64_t ub_timestamp = 0x48a7e924
    20828 blkptr_t ub_rootbp = {  <-- blkptr is at 0x20828 on disk
        20828 dva_t [3] blk_dva = [
            {
                20828 uint64_t [2] dva_word = [ 0x1, 0x8d ]
            }
            {
                20838 uint64_t [2] dva_word = [ 0x1, 0x60087 ]
            }
            {
                20848 uint64_t [2] dva_word = [ 0x1, 0xc0047 ]
            }
        ]
        20858 uint64_t blk_prop = 0x800b070300000001
        20860 uint64_t [3] blk_pad = [ 0, 0, 0 ]
        20878 uint64_t blk_birth = 0x502
        20880 uint64_t blk_fill = 0x1b
        20888 zio_cksum_t blk_cksum = {
            20888 uint64_t [4] zc_word = [ 0x81f780ec5, 0x361b52dda06, 
0xb6f3f410036f, 0x1a2b8b10bfdb5c ]
        }
    }
}
...

Let's dump the blkptr_t for this uberblock_t.


> 20828::blkptr
DVA[0]: vdev_id 0 / 11a00
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 20000000000
DVA[0]: :0:11a00:200:d
DVA[1]: vdev_id 0 / c010e00
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 20000000000
DVA[1]: :0:c010e00:200:d
DVA[2]: vdev_id 0 / 18008e00
DVA[2]:       GANG: FALSE  GRID:  0000  ASIZE: 20000000000
DVA[2]: :0:18008e00:200:d
LSIZE:  400                             PSIZE: 200
ENDIAN: LITTLE                                  TYPE:  DMU objset
BIRTH:  502                LEVEL: 0     FILL:  1b00000000
CKFUNC: fletcher4                       COMP:  lzjb
CKSUM:  81f780ec5:361b52dda06:b6f3f410036f:1a2b8b10bfdb5c
$q
#

Now, using the modified zdb, let's dump the mos objset_phys_t.



# ./zdb -R zfs_fs:0:11a00:200:d,lzjb,400 2> /tmp/mos
Found vdev: /export/home/max/zfsfile
#

Back to mdb to examine the objset_phys_t for the meta object set (mos).



# ./mdb /tmp/mos
> ::loadctf
> ::load /export/home/max/source/mdb/i386/rawzfs.so 
> 0::print -a -t zfs`objset_phys_t
{
    0 dnode_phys_t os_meta_dnode = {
        0 uint8_t dn_type = 0xa  <-- this is DMU_OT_DNODE
 ...
        40 blkptr_t [1] dn_blkptr = [
            {
                40 dva_t [3] blk_dva = [
                    {
                        40 uint64_t [2] dva_word = [ 0x5, 0x88 ]
                    }
                    {
                        50 uint64_t [2] dva_word = [ 0x5, 0x60082 ]
                    }
                    {
                        60 uint64_t [2] dva_word = [ 0x5, 0xc0042 ]
                    }
                ]
                70 uint64_t blk_prop = 0x800a07030004001f
                78 uint64_t [3] blk_pad = [ 0, 0, 0 ]
                90 uint64_t blk_birth = 0x502
                98 uint64_t blk_fill = 0x1a
                a0 zio_cksum_t blk_cksum = {
                    a0 uint64_t [4] zc_word = [ 0xa9af50f215, 0xec01e192b95e, 
0xc523efad092ebc, 0x7a3a8be19416f454 ]
                }
            }
        ]
 ...

And dump the blkptr_t in the objset_phys_t.



> 40::blkptr
DVA[0]: vdev_id 0 / 11000
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: a0000000000
DVA[0]: :0:11000:a00:d
DVA[1]: vdev_id 0 / c010400
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: a0000000000
DVA[1]: :0:c010400:a00:d
DVA[2]: vdev_id 0 / 18008400
DVA[2]:       GANG: FALSE  GRID:  0000  ASIZE: a0000000000
DVA[2]: :0:18008400:a00:d
LSIZE:  4000                            PSIZE: a00
ENDIAN: LITTLE                                  TYPE:  DMU dnode
BIRTH:  502                LEVEL: 0     FILL:  1a00000000
CKFUNC: fletcher4                       COMP:  lzjb
CKSUM:  a9af50f215:ec01e192b95e:c523efad092ebc:7a3a8be19416f454
$q
#

Using zdb with the offset specified for the first ditto block
in the above blkptr output, we get the mos dnode array.
Note that the "LEVEL: 0" blkptr output means there are
no levels of indirection. On larger zfs file systems, you may
need to go through block(s) of indirect blkptr_t's. An example of this
is shown a bit later.



# ./zdb -R zfs_fs:0:11000:a00:d,lzjb,4000 2> /tmp/metadnode
Found vdev: /export/home/max/zfsfile
#

Now, we'll look at the metadnode for the DMU_OT_OBJECT_DIRECTORY. This
will tell us about objects in the zfs file system. For every zfs file
system that I have tried this on, this is dnode number 1, (starting from
0). Regardless, the field to check is "dn_type = 0x1". It is possible,
(I assume), for this to be at a different index into the metadnode array,
and, possibly not in the 0x4000 bytes read and decompressed from 0x11000.
In this case, the LEVEL field would not have been 0, and you would have to
look at indirect blkptr_t's. But not here...



# ./mdb /tmp/metadnode 
> ::loadctf
> ::load /export/home/max/source/mdb/i386/rawzfs.so
> 0,20::print -a -t zfs`dnode_phys_t  <-- dnode_phys_t is 0x200 bytes, so 0x20
{          <-- of these in 0x4000.
    0 uint8_t dn_type = 0  <-- First entry is not used (DMU_OT_NONE)
...
}
{   <-- start of the second entry
    200 uint8_t dn_type = 0x1  <-- DMU_OT_OBJECT_DIRECTORY (see dmu.h)
    ...
    240 blkptr_t [1] dn_blkptr = [ <-- blkptr_t is 0x240 in /tmp/metadnode
    ...  <-- lots of output omitted, we'll look at some of this later.
}

Now we'll look at the blkptr_t for the Object Directory.


240::blkptr
DVA[0]: vdev_id 0 / 2400
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 20000000000
DVA[0]: :0:2400:200:d
DVA[1]: vdev_id 0 / c002400
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 20000000000
DVA[1]: :0:c002400:200:d
DVA[2]: vdev_id 0 / 18000000
DVA[2]:       GANG: FALSE  GRID:  0000  ASIZE: 20000000000
DVA[2]: :0:18000000:200:d
LSIZE:  200                             PSIZE: 200
ENDIAN: LITTLE                                  TYPE:  object directory
BIRTH:  4                  LEVEL: 0     FILL:  100000000
CKFUNC: fletcher4                       COMP:  uncompressed
CKSUM:  5a40238b4:1cd8f9f7e19:522eab3e03f0:a9c92410b009e
$q

#

Now, we'll read the (uncompressed) 0x200 bytes of the object directory using zdb. The "2400" is the (hex) offset from the blkptr_t above.



# ./zdb -R zfs_fs:0:2400:200:r 2> /tmp/objdir
Found vdev: /export/home/max/zfsfile
#

Back to mdb to look at the object directory. Object directories are "zap"
objects. Zap objects contain name/value pairs. The first 64 bits
identify the type of the zap (micro zap or fat zap). A "fat zap" is a zap
object that uses indirection. Micro zaps contain name/value pairs directly
(i.e., no indirection). I have not seen a fat zap (but the largest zfs
file system I have used is only ~140GB, and I have not examined large
directories. (Directory entries are stored in zap objects).


# ./mdb /tmp/objdir 
> ::loadctf
> ::load /export/home/max/source/mdb/i386/rawzfs.so

> 0/J  <-- look at the first 64 bits as hex
0:              8000000000000003  <-- a "signature" for a micro zap

> 0::print -a -t zfs`mzap_phys_t  <-- the beginning of the microzap is
{                                 <-- an mzap_phys_t
    0 uint64_t mz_block_type = 0x8000000000000003
    8 uint64_t mz_salt = 0x129c2c3
    10 uint64_t mz_normflags = 0
    18 uint64_t [5] mz_pad = [ 0, 0, 0, 0, 0 ]
    40 mzap_ent_phys_t [1] mz_chunk = [  <-- there may be more than one
        {                                <-- mzap_ent_phys_t starting here
            40 uint64_t mze_value = 0x2  <-- object id of "root_dataset"
            48 uint32_t mze_cd = 0
            4c uint16_t mze_pad = 0
            4e char [50] mze_name = [ "root_dataset" ]  
        }
    ]
}
$q
#

Now, we go back to the mos metadnode array in /tmp/metadnode, and
examine object id 2 (the third entry in the array).
Each entry is 0x200 bytes, so we want the dnode_phys_t starting
at (2*200) bytes into the file.


# ./mdb /tmp/metadnode 
> ::loadctf
> ::load /export/home/max/source/mdb/i386/rawzfs.so
> 2*200::print -a -t zfs`dnode_phys_t  <-- get object id 2
{
    400 uint8_t dn_type = 0xc  <-- DMU_OT_DSL_DIR (a dataset directory object)
    ...
    404 uint8_t dn_bonustype = 0xc  <-- bonus buffer contains a dsl_dir_phys_t
    ...
    440 blkptr_t [1] dn_blkptr = [  <-- not used for this object
        {
            440 dva_t [3] blk_dva = [
                {
                    440 uint64_t [2] dva_word = [ 0, 0 ]
      ...
    4c0 uint8_t [320] dn_bonus = [ 0xe5, 0xa9, 0xa6, 0x48, 0, 0, 0, 0, 0x10,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0xe, 0, 0, 0, 0, 0, 0, 0, ... ]
}

And dump the bonus buffer at 0x4c0.


> 4c0::print -a -t zfs`dsl_dir_phys_t
{
    4c0 uint64_t dd_creation_time = 0x48a6a9e5
    4c8 uint64_t dd_head_dataset_obj = 0x10  <-- object id for dataset head
    ...
}

Let's get object id 10 from the metadnode array.


> 10*200::print -a -t zfs`dnode_phys_t
{
    2000 uint8_t dn_type = 0x10  <-- DMU_OT_DSL_DATASET
    ...
    2004 uint8_t dn_bonustype = 0x10  <-- bonus buffer contains dsl_dataset_phys_t
    ...
    2040 blkptr_t [1] dn_blkptr = [  <-- again, not used here
        {
            2040 dva_t [3] blk_dva = [
                {
                    2040 uint64_t [2] dva_word = [ 0, 0 ]
      ...
    20c0 uint8_t [320] dn_bonus = [ 0x2, 0, 0, 0, 0, 0, 0, 0, 0xe, 0, 0, 0, 0, 0
, 0, 0, 0x1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... ]
}

At 0x20c0 in the /tmp/metadnode file is a dsl_dataset_phys_t (the bonus
buffer).


> 20c0::print -a -t zfs`dsl_dataset_phys_t
{
    20c0 uint64_t ds_dir_obj = 0x2
    ...
    2140 blkptr_t ds_bp = {
        2140 dva_t [3] blk_dva = [
            {
                2140 uint64_t [2] dva_word = [ 0x1, 0x79 ]
            }
            {
                2150 uint64_t [2] dva_word = [ 0x1, 0x60073 ]
            }
            {
                2160 uint64_t [2] dva_word = [ 0, 0 ]
            }
        ]
 ...
}

Let's look at the blkptr_t in the dsl_dataset_phys_t.



> 2140::blkptr
DVA[0]: vdev_id 0 / f200
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 20000000000
DVA[0]: :0:f200:200:d
DVA[1]: vdev_id 0 / c00e600
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 20000000000
DVA[1]: :0:c00e600:200:d
LSIZE:  400                             PSIZE: 200
ENDIAN: LITTLE                                  TYPE:  DMU objset
BIRTH:  502                LEVEL: 0     FILL:  600000000
CKFUNC: fletcher4                       COMP:  lzjb
CKSUM:  9cb4e346a:403aa7532bf:d688fac60e1e:1e67a933734ea5
$q
#

The blkptr_t for the dsl_dataset_phys_t is for another DMU objset.
(The first DMU objset was from the uberblock_t rootbp and describe
the set of objects. The objset described by the dsl_dataset_phys_t describes
the set of objects in the file system (i.e., files and directories (and...?)).
Back to zdb to get this data.


# ./zdb -R zfs_fs:0:f200:200:d,lzjb,400 2> /tmp/root_dataset_mos
Found vdev: /export/home/max/zfsfile
#

And back to mdb to display the objset_phys_t for the root dataset.


# ./mdb /tmp/root_dataset_mos 
> ::loadctf
> ::load /export/home/max/source/mdb/i386/rawzfs.so
> 0::print -a -t zfs`objset_phys_t
{
    0 dnode_phys_t os_meta_dnode = {
        0 uint8_t dn_type = 0xa  <-- the second object directory
 ...
        40 blkptr_t [1] dn_blkptr = [  <-- blkptr_t is 0x40 bytes into the file
 ...

And dump the blkptr_t...


> 40::blkptr
DVA[0]: vdev_id 0 / 10800
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 40000000000
DVA[0]: :0:10800:400:id
DVA[1]: vdev_id 0 / c00fc00
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 40000000000
DVA[1]: :0:c00fc00:400:id
LSIZE:  4000                            PSIZE: 400
ENDIAN: LITTLE                                  TYPE:  DMU dnode
BIRTH:  502                LEVEL: 6     FILL:  500000000
CKFUNC: fletcher4                       COMP:  lzjb
CKSUM:  58461f1c5e:3c7272ace4a1:15e2cf555fd2ac:58b9cdc6bcd0b54
$q
#

Note the "LEVEL: 6" in the above output. There are 6 levels of indirection
to get to another array of dnode_phys_t. We will follow the levels, always
using the first indirect blkptr_t at each level since the file was in
a directory whose object id is 3 (from "ls -aid /zfs_fs" back at the
beginning). If I want the dnode_phys_t for a different object id, I
can use the technique explained in the paper and slides referenced
at the beginning.


# ./zdb -R zfs_fs:0:10800:400:d,lzjb,4000 2> /tmp/dnode_l6
Found vdev: /export/home/max/zfsfile
#

Each indirect blkptr_t array contains 0x80 blkptr_t structures. (The size
of a blkptr_t is 0x80 bytes. 0x4000 (i.e., the size of the decompressed
data) divided by 0x80 = 0x80). We'll use mdb to examine blkptr 0 in the
array.


# ./mdb /tmp/dnode_l6 
> ::loadctf
> ::load /export/home/max/source/mdb/i386/rawzfs.so
> 0::blkptr
DVA[0]: vdev_id 0 / 10400
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 40000000000
DVA[0]: :0:10400:400:id
DVA[1]: vdev_id 0 / c00f800
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 40000000000
DVA[1]: :0:c00f800:400:id
LSIZE:  4000                            PSIZE: 400
ENDIAN: LITTLE                                  TYPE:  DMU dnode
BIRTH:  502                LEVEL: 5     FILL:  500000000
CKFUNC: fletcher4                       COMP:  lzjb
CKSUM:  593bf7cd50:3d5bdfbff40e:1652191251855c:5af2260f72aa12a
$q
#

Great. Let's get the indirect array for level 5.


# ./zdb -R zfs_fs:0:10400:400:d,lzjb,4000 2> /tmp/dnode_l5
Found vdev: /export/home/max/zfsfile
#

And back to mdb to display it...


# ./mdb /tmp/dnode_l5
> ::loadctf
> ::load /export/home/max/source/mdb/i386/rawzfs.so
> 0::blkptr
DVA[0]: vdev_id 0 / 10000
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 40000000000
DVA[0]: :0:10000:400:id
DVA[1]: vdev_id 0 / c00f400
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 40000000000
DVA[1]: :0:c00f400:400:id
LSIZE:  4000                            PSIZE: 400
ENDIAN: LITTLE                                  TYPE:  DMU dnode
BIRTH:  502                LEVEL: 4     FILL:  500000000
CKFUNC: fletcher4                       COMP:  lzjb
CKSUM:  5a4787d7ae:3e5a99501b03:16cbd983ac6802:5d616f6513864cb
$q
#

And now to level 4. Note that BIRTH value corresponds to the
transaction id we want... (0x502 = 0t1282).


# ./zdb -R zfs_fs:0:10000:400:d,lzjb,4000 2> /tmp/dnode_l4
Found vdev: /export/home/max/zfsfile
#

Back to mdb...


# ./mdb /tmp/dnode_l4
> ::loadctf
> ::load /export/home/max/source/mdb/i386/rawzfs.so
> 0::blkptr
DVA[0]: vdev_id 0 / fc00
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 40000000000
DVA[0]: :0:fc00:400:id
DVA[1]: vdev_id 0 / c00f000
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 40000000000
DVA[1]: :0:c00f000:400:id
LSIZE:  4000                            PSIZE: 400
ENDIAN: LITTLE                                  TYPE:  DMU dnode
BIRTH:  502                LEVEL: 3     FILL:  500000000
CKFUNC: fletcher4                       COMP:  lzjb
CKSUM:  580321bd90:3cf0cb3827a9:1647f21a4fee83:5b2042e25b8771b
$q
#

Now to level 3.


# ./zdb -R zfs_fs:0:fc00:400:d,lzjb,4000 2> /tmp/dnode_l3
Found vdev: /export/home/max/zfsfile
#
# ./mdb /tmp/dnode_l3
> ::loadctf
> ::load /export/home/max/source/mdb/i386/rawzfs.so
> 0::blkptr
DVA[0]: vdev_id 0 / f800
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 40000000000
DVA[0]: :0:f800:400:id
DVA[1]: vdev_id 0 / c00ec00
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 40000000000
DVA[1]: :0:c00ec00:400:id
LSIZE:  4000                            PSIZE: 400
ENDIAN: LITTLE                                  TYPE:  DMU dnode
BIRTH:  502                LEVEL: 2     FILL:  500000000
CKFUNC: fletcher4                       COMP:  lzjb
CKSUM:  58e75640d6:3d0bc696c0c2:162c02c075c9ab:5a30099a876cabe
$q
#

And level 2...


# ./zdb -R zfs_fs:0:f800:400:d,lzjb,4000 2> /tmp/dnode_l2
Found vdev: /export/home/max/zfsfile
#
# ./mdb /tmp/dnode_l2
mdb: no terminal data available for TERM=emacs
mdb: term init failed: command-line editing and prompt will not be available
::loadctf
::load /export/home/max/source/mdb/i386/rawzfs.so
0::blkptr
DVA[0]: vdev_id 0 / f400
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 40000000000
DVA[0]: :0:f400:400:id
DVA[1]: vdev_id 0 / c00e800
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 40000000000
DVA[1]: :0:c00e800:400:id
LSIZE:  4000                            PSIZE: 400
ENDIAN: LITTLE                                  TYPE:  DMU dnode
BIRTH:  502                LEVEL: 1     FILL:  500000000
CKFUNC: fletcher4                       COMP:  lzjb
CKSUM:  5763205f2d:3c57f7df68a9:15fea170721af5:59a7686491c24a7
$q
#

And level 1.


# ./zdb -R zfs_fs:0:f400:400:d,lzjb,4000 2> /tmp/dnode_l1
Found vdev: /export/home/max/zfsfile
#
# ./mdb /tmp/dnode_l1
> ::loadctf
> ::load /export/home/max/source/mdb/i386/rawzfs.so
> 0::blkptr
DVA[0]: vdev_id 0 / ec00
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 60000000000
DVA[0]: :0:ec00:600:d
DVA[1]: vdev_id 0 / c00e000
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 60000000000
DVA[1]: :0:c00e000:600:d
LSIZE:  4000                            PSIZE: 600
ENDIAN: LITTLE                                  TYPE:  DMU dnode
BIRTH:  502                LEVEL: 0     FILL:  500000000
CKFUNC: fletcher4                       COMP:  lzjb
CKSUM:  6fb5c84271:61d7d7ffe6a4:2f9cbc90dcaa4c:10f07885852e2558
$q
#

Level 0 will contain the beginning of the array of dnode_phys_t
for files and directories within the file system.
We'll again use zdb to retrieve the block containing the first
0x20 entries. (Again, decompressed size is 0x4000, dnode_phys_t size
is 0x200, so there are 0x20 entries in the first level 0 block).


# ./zdb -R zfs_fs:0:ec00:600:d,lzjb,4000 2> /tmp/dnode_l0
Found vdev: /export/home/max/zfsfile
#

# ./mdb /tmp/dnode_l0
> ::loadctf
> ::load /export/home/max/source/mdb/i386/rawzfs.so
> 0,20::print -t -a zfs`dnode_phys_t
{
    0 uint8_t dn_type = 0 <-- first entry is not used
    ...
}
{  <-- second entry (object id 1)
    200 uint8_t dn_type = 0x15  <-- DMU_OT_MASTER_NODE
    ...
    240 blkptr_t [1] dn_blkptr = [
        {
 ...
}
{  <-- third entry (object id 2)
    400 uint8_t dn_type = 0x16
    ...
{  <-- fourth entry (object id 3, should be root directory for the fs)
    600 uint8_t dn_type = 0x14  <-- DMU_OT_DIRECTORY_CONTENTS
    ...
    604 uint8_t dn_bonustype = 0x11  <-- bonus buffer contains znode_phys_t
    ...
    640 blkptr_t [1] dn_blkptr = [  <-- this blkptr_t is a zap for directory entries
        {
            640 dva_t [3] blk_dva = [
                {
                    640 uint64_t [2] dva_word = [ 0x1, 0x73 ]
                }
  ...
     6c0 uint8_t [320] dn_bonus = [ 0x1e, 0xe9, 0xa7, 0x48, 0, 0, 0, 0,
      0xc3, 0x61, 0x34, 0xf, 0, 0, 0, 0, 0x1f, 0xe9, 0xa7, 0x48, 0,
      0, 0, 0, 0x1, 0x43, 0x79, 0x3a, 0, 0, 0, 0, ... ]
 ... <-- lots omitted
}

At this point, we could go the the fourth entry in the above output
(object id 3 at 0x600 bytes into the file) and look at the directory
contents to see if the removed file is there. (Remember, ls -aid on
the directory containing the removed file shows inumber 3).
However, we'll be safe and examine the master node to get to
the root directory of the file system. The master node
is object id 1 (at 0x200 in the above output). The block pointer
for this dnode_phys_t is for a zap object.
We'll use mdb to dump the master node blkptr_t.


> 240::blkptr
DVA[0]: vdev_id 0 / 0
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 20000000000
DVA[0]: :0:0:200:d
DVA[1]: vdev_id 0 / c000000
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 20000000000
DVA[1]: :0:c000000:200:d
LSIZE:  200                             PSIZE: 200
ENDIAN: LITTLE                                  TYPE:  ZFS master node
BIRTH:  4                  LEVEL: 0     FILL:  100000000
CKFUNC: fletcher4                       COMP:  uncompressed
CKSUM:  233dfc135:e10dd7aa27:2e8c1eba771e:6a380d575d3d6
$q
#

And now back to zdb to get the zfs master node zap object. Note
that this is not compressed, and is at the beginning of the disk
(following label 0 and label 1.


# ./zdb -R zfs_fs:0:0:200:r 2> /tmp/master_node
Found vdev: /export/home/max/zfsfile
#

Back to mdb to examine the master node zap.


# ./mdb /tmp/master_node 
> ::loadctf
> ::load /export/home/max/source/mdb/i386/rawzfs.so
> 0/J  <-- let's see what kind of zap it is
0:              8000000000000003 <-- micro zap
> 0::print -a -t zfs`mzap_phys_t
{
    0 uint64_t mz_block_type = 0x8000000000000003
    8 uint64_t mz_salt = 0x3d3b
    10 uint64_t mz_normflags = 0
    18 uint64_t [5] mz_pad = [ 0, 0, 0, 0, 0 ]
    40 mzap_ent_phys_t [1] mz_chunk = [
        {
            40 uint64_t mze_value = 0x3
            48 uint32_t mze_cd = 0
            4c uint16_t mze_pad = 0
            4e char [50] mze_name = [ "VERSION" ]
        }
    ]
}

The mzap_phys_t is 0x80 bytes large. Following this are zero or more
mzap_ent_phys_t. Each mzap_ent_phys_t is 0x40 bytes. The following
will dump all mzap_ent_phys_t following the mzap_phys_t in the block.


> 80,((200-80)%40)::print -a -t zfs`mzap_ent_phys_t
...
{
    c0 uint64_t mze_value = 0x3  <-- the object id for the root of the fs
    c8 uint32_t mze_cd = 0
    cc uint16_t mze_pad = 0
    ce char [50] mze_name = [ "ROOT" ]  <-- this is root
}
$q
#

Now, back to the level 0 dnode_phys_t array to look at the root directory
dnode_phys_t.


# ./mdb /tmp/dnode_l0
> ::loadctf
> ::load /export/home/max/source/mdb/i386/rawzfs.so
> 3*200::print -a -t zfs`dnode_phys_t
{
    600 uint8_t dn_type = 0x14  <-- DMU_OT_DIRECTORY_CONTENTS
    ...
    604 uint8_t dn_bonustype = 0x11 <-- bonus buffer contains znode_phys_t
    ...
    640 blkptr_t [1] dn_blkptr = [
        {
            640 dva_t [3] blk_dva = [
                {
                    640 uint64_t [2] dva_word = [ 0x1, 0x73 ]
                }
  ...
}

The blkptr_t is for a zap object containing filename/object id
values for files in the root directory of the file system.


> 640::blkptr
DVA[0]: vdev_id 0 / e600
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 20000000000
DVA[0]: :0:e600:200:d
DVA[1]: vdev_id 0 / c00da00
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 20000000000
DVA[1]: :0:c00da00:200:d
LSIZE:  200                             PSIZE: 200
ENDIAN: LITTLE                                  TYPE:  ZFS directory
BIRTH:  502                LEVEL: 0     FILL:  100000000
CKFUNC: fletcher4                       COMP:  uncompressed
CKSUM:  25f50a2fc:fe963fd84e:36937666328d:7f9475424708c
$q
#

Now read the root directory zap object.


# ./zdb -R zfs_fs:0:e600:200:r 2> /tmp/rootdir
Found vdev: /export/home/max/zfsfile
#

And use mdb to look at the zap entries.


# ./mdb /tmp/rootdir 
> ::loadctf
> ::load /export/home/max/source/mdb/i386/rawzfs.so
> 0/J
0:              8000000000000003  <-- a micro zap


> 0::print -a -t zfs`mzap_phys_t
{
    0 uint64_t mz_block_type = 0x8000000000000003
    8 uint64_t mz_salt = 0x3e0f
    10 uint64_t mz_normflags = 0
    18 uint64_t [5] mz_pad = [ 0, 0, 0, 0, 0 ]
    40 mzap_ent_phys_t [1] mz_chunk = [
        {
            40 uint64_t mze_value = 0x8000000000000004
            48 uint32_t mze_cd = 0
            4c uint16_t mze_pad = 0
            4e char [50] mze_name = [ "foo" ]
        }
    ]
}

And dump the rest of the zap entries.


> 80,((200-80)%40)::print -a -t zfs`mzap_ent_phys_t
{
    80 uint64_t mze_value = 0x8000000000000005
    88 uint32_t mze_cd = 0
    8c uint16_t mze_pad = 0
    8e char [50] mze_name = [ "words" ] <-- here is the removed file
}
...
5*200=X  <-- we want dnode_phys_t object id 5.
                a00    <-- Offset within /tmp/dnode_l0 where the object resides
$q
#

We'll go back and get the dnode for object id 5.


# ./mdb /tmp/dnode_l0
> ::loadctf
> ::load /export/home/max/source/mdb/i386/rawzfs.so
> a00::print -a -t zfs`dnode_phys_t
{
    a00 uint8_t dn_type = 0x13  <-- DMU_OT_PLAIN_FILE_CONTENTS
    ...
    a04 uint8_t dn_bonustype = 0x11 <-- znode_phys_t for "words" file
    ...
    a40 blkptr_t [1] dn_blkptr = [  <-- blkptr_t for data or indirect blocks
        {
 ...
    ac0 uint8_t [320] dn_bonus = [ 0x1f, 0xe9, 0xa7, 0x48, 0, 0, 0, 0, 0xcb, 
0x96, 0x78, 0x3a, 0, 0, 0, 0, 0x1f, 0xe9, 0xa7, 0x48, 0, 0, 0, 0, 0xd1, 0xb1, 
0x83, 0x3a, 0, 0, 0, 0, ... ]
}

Now, let's take a quick look at the znode_phys_t for this file.
It is in the bonus buffer at 0xac0.


> ac0::print -a -t zfs`znode_phys_t
{
    ac0 uint64_t [2] zp_atime = [ 0x48a7e91f, 0x3a7896cb ]
    ad0 uint64_t [2] zp_mtime = [ 0x48a7e91f, 0x3a83b1d1 ]
    ae0 uint64_t [2] zp_ctime = [ 0x48a7e91f, 0x3a83b1d1 ]
    af0 uint64_t [2] zp_crtime = [ 0x48a7e91f, 0x3a7896cb ]
    b00 uint64_t zp_gen = 0x502
    b08 uint64_t zp_mode = 0x8124
    b10 uint64_t zp_size = 0x32752 <-- should be same size as /usr/dict/words
    b18 uint64_t zp_parent = 0x3
    b20 uint64_t zp_links = 0x1
    ...
}

> 32752=D
                206674          
> !ls -l /usr/dict/words
-r--r--r--   1 root     bin       206674 Jul 11 02:57 /usr/dict/words

Looks good. Let's look at the blkptr_t for this dnode_phys_t.


> a40::blkptr
DVA[0]: vdev_id 0 / e800
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 40000000000
DVA[0]: :0:e800:400:id
DVA[1]: vdev_id 0 / c00dc00
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 40000000000
DVA[1]: :0:c00dc00:400:id
LSIZE:  4000                            PSIZE: 400
ENDIAN: LITTLE                                  TYPE:  ZFS plain file
BIRTH:  502                LEVEL: 1     FILL:  200000000
CKFUNC: fletcher4                       COMP:  lzjb
CKSUM:  5e9a82c0c2:3ff97cbecacc:1714169599f4c8:5dd02ff967dd42c
$q
#

Note the "LEVEL: 1". This means there is one level of indirect blocks.
We'll use zdb to retrieve the indirect block.


# ./zdb -R zfs_fs:0:e800:400:d,lzjb,4000 2> /tmp/iblock
Found vdev: /export/home/max/zfsfile
#

And mdb to look at the indirect block.


# ./mdb /tmp/iblock 
> ::loadctf
> ::load /export/home/max/source/mdb/i386/rawzfs.so
0::blkptr
DVA[0]: vdev_id 0 / 20000
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 2000000000000
DVA[0]: :0:20000:20000:d
LSIZE:  20000                           PSIZE: 20000
ENDIAN: LITTLE                                  TYPE:  ZFS plain file
BIRTH:  502                LEVEL: 0     FILL:  100000000
CKFUNC: fletcher2                       COMP:  uncompressed
CKSUM:  f5cbf93a151abcac:5b5d6ca83588d8ad:574d9b8bf334944b:ad78d30af51771d8

The blkptr_t above is for the first 0x20000 (128k) of the file.
The next blkptr_t in the indirect block should contain the remainder
of the file. (The file is less than 256k large).


> 80::blkptr
DVA[0]: vdev_id 0 / 40000
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 2000000000000
DVA[0]: :0:40000:20000:d
LSIZE:  20000                           PSIZE: 20000
ENDIAN: LITTLE                                  TYPE:  ZFS plain file
BIRTH:  502                LEVEL: 0     FILL:  100000000
CKFUNC: fletcher2                       COMP:  uncompressed
CKSUM:  f39ae34f048ae079:de2ef1af7d1fb495:ec3ae3f7985b2a98:c6d33ac68cb042b6
$q
#

Now we'll use zdb to retrieve the data blocks.


# ./zdb -R zfs_fs:0:20000:20000:r 2> /tmp/data  <-- first data block
Found vdev: /export/home/max/zfsfile
# ./zdb -R zfs_fs:0:40000:20000:r 2> /tmp/data1 <-- second data block
Found vdev: /export/home/max/zfsfile
#

# cat /tmp/data /tmp/data1 > /tmp/foo  <-- concatenate them

Size of the file, according to the znode_phys_t is 206674 bytes.
We'll lop off remaining bytes.


# dd if=/tmp/foo bs=206674 count=1 of=/tmp/finalwords
1+0 records in
1+0 records out

Now, let's see if we have the correct data.


# diff /tmp/finalwords /usr/dict/words
#   <-- no differences

new opensolaris course material

2008-02-06T01:53:00.000-08:00

I recently wrote the material and taught the second day of a three day course
on OpenSolaris for a group of university professors in Bangalore, India. The
first day was largely a "getting started on OpenSolaris" session, along with
an introduction to dtrace, and the third day was mostly administration
specifically about zones and zfs. The second day was an introduction to
OpenSolaris Internals. The topics I covered included processes and threads,
synchronization, memory management, and file systems. Those who have taken
the 5 day Solaris Internals course with me will find that several of the
diagrams and hands-on exercises that I use during that course are now written
up in this material. Note that the material for the second day uses only two
pages that are come from pre-existing sources. So this is new material.
The material contains both overhead slides and accompanying text.

I did something similar for professors in China almost 2 years ago, except
that session was a 5 day internals session. The India professors looked at
materials prepared by the Chinese professors, and decided that the depth of
coverage in the materials was too deep to use as a starting point for a course
on operating systems. Having looked at the Chinese professor's materials, I
am inclined to agree with the Indian professors. The Chinese material is
an excellent guide through the source code and through McDougall and Mauro's
Solaris Internals book. As a study guide for people trying to learn about
the way OpenSolaris works, it is quite good and complete. As a tool for
teaching, especially classes without prior operating system experience, I feel
it misses the mark. While a professor that has good operating system knowledge
can use the Chinese material to learn OpenSolaris internals for him or her
self, I think the materials may assume too much prior knowledge of the
students. The new material tries to give professors a starting point that can
be used to teach OpenSolaris, not just to learn it.

For those of you who already have an Operating Systems background, (though not
in OpenSolaris), I think you'll find this new material quite useful. The
new material is not elementary (there is plenty of useful information, even
for people who have extensive OS and even extensive Solaris kernel experience).
It makes assumptions that, for instance, you understand why one needs locks,
or why virtual memory is useful, among others. The material explains the
implementation of various concepts/mechanisms using a combination of tools,
mostly mdb and dtrace, and various diagrams. In a 1 day session, there are
many topics not covered, and some are covered very superficially, but most
of the major OS topics are covered in a good amount of detail.

The new material is at: http://www.opensolaris.org/os/community/edu/curriculum_development.
Look at the OpenSolaris Curriculum "Plugins Preparation" section for overheads
and course guides.

Using kernel ctf with raw disk

2007-09-16T02:00:00.000-07:00

The following shows a use of a modified version of mdb
which allows one to use the CTF information from the kernel
to examine data structures on disk. The disk used below contains
a ufs file system. The same techniques can be used to examine
zfs file system on disk, which is why I did this in the first place.
Once I have "mapped out" the on-disk format of zfs using this modified
version of mdb, I'll write about it.

In the meantime, I'll probably add a few dcmds and walkers for ufs that
use the kernel CTF information.

In the following, annotation starts with "<--" except for a
few places where I have embedded code from header files.
Also, the output has been truncated in a few places.
I am assuming that you have some knowledge of mdb, (for instance,
"value1 % value2 = X" returns the (hex) value of value1 divided by value2.
If you need more mdb, read the Modular Debugger Guide on docs.sun.com,
or, even better, take a course.

This example will start with the superblock, and from there examine the root
inode and then the root directory. From there, the example gets the /var inode
and then the /var directory. From there, we go to the /var/sadm directory
and look at the contents of the /var/sadm/README file. All of this is
done by examining the relevant data structures on disk.

# ./mdb /dev/rdsk/c0d0s0 <-- this is the root fs
mdb: no terminal data available for TERM=emacs
mdb: term init failed: command-line editing and prompt will not be available
mdb: no module 'mdb_ks' could be found <-- kernel support module not loaded(?)
mdb: failed to load kernel support module -- some modules may not load
::print struct anon <-- try ::print, normally this does not work with raw disk
{
an_vp <-- it works!
an_pvp
an_off
an_poff
an_hash
an_refcnt
}
2000::print struct fs <-- superblock should be 8192 bytes into fs (see sys/fs/ufs_fs.h)
{
fs_link = 0 <-- see fs_magic below for sanity check
fs_rolled = 0x2
fs_sblkno = 0x10
fs_cblkno = 0x18
fs_iblkno = 0x20
fs_dblkno = 0x2f8
fs_cgoffset = 0x40
fs_cgmask = 0xffffffc0
fs_time = 0x46eb90d4
fs_size = 0x32e3519
fs_dsize = 0x321e0c8
fs_ncg = 0x43e
fs_bsize = 0x2000
fs_fsize = 0x400
<-- output omitted
fs_fsmnt = [ "/" ]
<-- output omitted
fs_magic = 0x11954 <-- check against FS_MAGIC in sys/fs/ufs_fs.h (correct)
fs_space = [ 0x8 ]
}

::status <-- what does mdb say I'm debugging
debugging file '/dev/rdsk/c0d0s0' (object file)

<-- The following only prints the fields I am interested in:
2000::print struct fs fs_sblkno fs_cblkno fs_iblkno fs_cgoffset fs_magic fs_ipg
fs_sblkno = 0x10 <-- location of the superblock in the cylinder group
fs_cblkno = 0x18 <-- location of the cylinder group block (struct cg)
fs_iblkno = 0x20 <-- location of start of inodes (in cylinder group)
fs_cgoffset = 0x40 <-- offset of cylinder group
fs_magic = 0x11954 <-- magic number
fs_ipg = 0x16c0 <-- inodes per cylinder group

<-- immediately following superblock is back up
4000::print struct fs fs_sblkno fs_cblkno fs_iblkno fs_cgoffset fs_magic
fs_sblkno = 0x10
fs_cblkno = 0x18
fs_iblkno = 0x20
fs_cgoffset = 0x40
fs_magic = 0x11954

6000::print struct cg <-- next block should be first cylinder group block
{
cg_link = 0
cg_magic = 0x90255 <-- magic number is good
cg_time = 0x46e78f3e
<-- output omitted
}

::sizeof struct icommon <-- how big is the disk inode
sizeof (struct icommon) = 0x80

<-- the following is the root inode. Root for ufs is inumber 2. The fs_iblkno
<-- value (0x20) is multiplied by the fragment size to get the start of the
<-- inodes (in the first cylinder group), each inode is 0x80 bytes large. The
<-- second (i.e., root inode) is then at disk location (20*400+(2*80)
(20*400)+(2*80)::print -a struct icommon
{
8100 ic_smode = 0x41ed
8102 ic_nlink = 0x30
8104 ic_suid = 0
8106 ic_sgid = 0
8108 ic_lsize = 0x600
8110 ic_atime = {
8110 tv_sec = 0x46eba57f
8114 tv_usec = 0x5a69b
}
<-- output omitted
8128 ic_db = [ 0x2ff410, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ] <-- one data block for / directory
8158 ic_ib = [ 0, 0, 0 ]
<-- output omitted
}

8110\Y <-- this is disk address of atime stamp on root inode
0x8110: 2007 Sep 15 11:27:27 <-- looks good

!date <-- current time
Sat Sep 15 11:36:07 CEST 2007

<-- now the block number from ic_db[0] is used to dump the contents of the
<-- root directory
2ff410*400::print struct direct
{
d_ino = 0x2
d_reclen = 0xc
d_namlen = 0x1
d_name = [ "." ]
}
(2ff410*400)+c::print struct direct <-- second entry (first+d_reclen)
{
d_ino = 0x2
d_reclen = 0xc
d_namlen = 0x2
d_name = [ ".." ]
}
(2ff410*400)+c+c::print struct direct <-- third entry (first + d_reclen of first and second)
{
d_ino = 0x3
d_reclen = 0x14
d_namlen = 0xa
d_name = [ "lost+found" ]
}
(2ff410*400)+c+c+14::print struct direct <-- fourth entry (easily made into walker?)
{
d_ino = 0x16c0
d_reclen = 0xc
d_namlen = 0x3
d_name = [ "var" ]
}
<-- check "/var" inumber
!ls -id /var <-- get inumber
5824 /var
16c0=D <-- d_ino from direct entry
5824 <-- match

<-- the following is a back up superblock in the second cylinder group.
<-- The relevant macros for this are in sys/fs/ufs_fs.h and are shown here:
/*
* Cylinder group macros to locate things in cylinder groups.
* They calc file system addresses of cylinder group data structures.
*/
#define cgbase(fs, c) ((daddr32_t)((fs)->fs_fpg * (c)))

#define cgstart(fs, c) \
(cgbase(fs, c) + (fs)->fs_cgoffset * ((c) & ~((fs)->fs_cgmask)))

#define cgsblock(fs, c) (cgstart(fs, c) + (fs)->fs_sblkno) /* super blk */

#define cgtod(fs, c) (cgstart(fs, c) + (fs)->fs_cblkno) /* cg block */

#define cgimin(fs, c) (cgstart(fs, c) + (fs)->fs_iblkno) /* inode blk */

#define cgdmin(fs, c) (cgstart(fs, c) + (fs)->fs_dblkno) /* 1st data */

/*
* Macros for handling inode numbers:
* inode number to file system block offset.
* inode number to cylinder group number.
* inode number to file system block address.
*/
#define itoo(fs, x) ((x) % (uint32_t)INOPB(fs))

#define itog(fs, x) ((x) / (uint32_t)(fs)->fs_ipg)
<-- So. Here the fs_fpg field from the superblock (= 0xc000) is used to
<-- get the fragments per group. This is multiplied times the fragment size (0x400)
<-- Then the fs_cgoffset field (cylinder group offset) is added (40*400), then the
<-- fs_sblkno offset (10*400). The resulting address is the location on the
<-- disk of the backup superblock in the second cylinder group. To see the third,
<-- use (c000*2*400)+(40*400)+(10*400)::print struct fs
<-- To see the fourth, (c000*3*400)+(40*400)+(10*400)::print struct fs, etc.

(c000*400)+(40*400)+(10*400)::print struct fs fs_sblkno fs_cblkno fs_iblkno fs_cgoffset fs_magic
fs_sblkno = 0x10
fs_cblkno = 0x18
fs_iblkno = 0x20
fs_cgoffset = 0x40
fs_magic = 0x11954

<-- Now, let's take a look at the inode for the "/var" directory.
<-- Above, the direct structure for /var says the inumber is 0x16c0.
<-- There are 16c0 inodes per cylinder group (the fs_ipg field in the
<-- superblock), so this inode should be the first inode in the
<-- second cylinder group. (c000*400) is the base of the second cylinder
<-- group. (40*400) is the starting offset. (20*400) is the starting
<-- inode offset. Given an inumber, the formula for finding the inode
<-- on disk is:
<-- (inumber % fs_ipg)=X This returns an index indicating which cylinder
<-- group the inode is in. This is "cg_index" in the next calculation.
<-- (inumber - (cg_index * fs_ipg))=X This returns the index
<-- (offset) within the cylinder group ("cg_offset") (Actually, inumber mod fs_ipg).
<-- Then: ((fs_fpg*400)*cg_index)+((fs_cgoffset*400)*cg_index)+(fs_iblkno*400)+(cg_offset*80).
<-- Here, 400 is the fragment size (from fs_fsize) and 80 is the sizeof the
<-- disk inode.

(16c0%16c0)=X
1 <-- the second cylinder group
16c0-(1*16c0)=X
0 <-- the first inode in the group

<-- this is the inode for /var
(c000*400*1)+(40*400)+(20*400)+(0*80)::print struct icommon
{
ic_smode = 0x41ed
ic_nlink = 0x2c
ic_suid = 0
ic_sgid = 0x3
ic_lsize = 0x400
ic_atime = {
tv_sec = 0x46eb30e8
tv_usec = 0x992bc
}
<-- output omitted
ic_db = [ 0xc348, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
ic_ib = [ 0, 0, 0 ]
<-- output omitted
}

<-- So /var has one block (ic_db[0] = 0xc348). This should be a directory. Now dump out some
<-- entries. The following is a good candidate for a walker...
(c348*400)::print struct direct
{
d_ino = 0x16c0
d_reclen = 0xc
d_namlen = 0x1
d_name = [ "." ]
}
(c348*400)+c::print struct direct
{
d_ino = 0x2
d_reclen = 0xc
d_namlen = 0x2
d_name = [ ".." ]
}
(c348*400)+c+c::print struct direct
{
d_ino = 0x16c1
d_reclen = 0x10
d_namlen = 0x4
d_name = [ "sadm" ]
}

<-- let's check the work...
!ls -id /var/sadm
5825 /var/sadm
16c1=D
5825 <-- match looks good
<-- Ok. Now lets look at the inode for /var/sadm. This is
<-- inumber 16c1.

16c1%16c0=X
1 <-- the second cylinder group
16c1-(16c0*1)=X
1 <-- the second inode in the group

(c000*400*1)+(40*400)+(20*400)+(1*80)::print struct icommon
{
ic_smode = 0x41ed
ic_nlink = 0xd
ic_suid = 0
ic_sgid = 0x3
ic_lsize = 0x200
<-- output omitted
ic_db = [ 0xc349, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
ic_ib = [ 0, 0, 0 ]
<-- output omitted
}
c349*400::print struct direct
{
d_ino = 0x16c1
d_reclen = 0xc
d_namlen = 0x1
d_name = [ "." ]
}
(c349*400)+c::print struct direct
{
d_ino = 0x16c0
d_reclen = 0xc
d_namlen = 0x2
d_name = [ ".." ]
}
(c349*400)+c+c::print struct direct
{
d_ino = 0x16c2
d_reclen = 0x10
d_namlen = 0x7
d_name = [ "install" ]
}
(c349*400)+c+c+10::print struct direct
{
d_ino = 0x16c5
d_reclen = 0xc
d_namlen = 0x3
d_name = [ "pkg" ]
}
(c349*400)+c+c+10+c::print struct direct
{
d_ino = 0x1d41
d_reclen = 0x10
d_namlen = 0x6
d_name = [ "system" ]
}
(c349*400)+c+c+10+c+10::print struct direct
{
d_ino = 0x7e4d
d_reclen = 0x18
d_namlen = 0xc
d_name = [ "install_data" ]
}
(c349*400)+c+c+10+c+10+18::print struct direct
{
d_ino = 0x7e4e
d_reclen = 0x14
d_namlen = 0x8
d_name = [ "softinfo" ]
}
(c349*400)+c+c+10+c+10+18+14::print struct direct
{
d_ino = 0x27d9
d_reclen = 0x10
d_namlen = 0x6
d_name = [ "README" ] <-- here is the file we want to examine
}

27d9%16c0=X
1 <-- the second cylinder group
27d9-(1*16c0)=X
1119 <-- the 1120th inode
(1*c000*400)+(40*400)+(20*400)+(1119*80)::print struct icommon
{
ic_smode = 0x8124
ic_nlink = 0x1
ic_suid = 0
ic_sgid = 0x3
ic_lsize = 0x444
<-- output omitted
ic_db = [ 0x111cc, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
ic_ib = [ 0, 0, 0 ]
<-- output omitted
}
111cc*400,200/c <-- ok, let's dump the first 512 bytes
0x4473000: -------------------------------
/var/sadm DIRECTORY RESTRUCTURE
-------------------------------

The system administration directory has been reorganized to bett
er
service the needs of Solaris administrators and administration s
oftware.
The old and new locations for files important to our customers a
re:

OLD LOCATION NEW LOCATION
------------ ------------
install_data/install_log system/logs/install_log
install_data/upgrade_log system/log

<-- check the work
!head /var/sadm/README
-------------------------------
/var/sadm DIRECTORY RESTRUCTURE
-------------------------------

The system administration directory has been reorganized to better
service the needs of Solaris administrators and administration software.
The old and new locations for files important to our customers are:

OLD LOCATION NEW LOCATION
------------ ------------
$q
#

bash-3.00$

Using dtrace and mdb to examine virtual memory

2005-11-08T09:47:00.000-08:00

here is a short example using
dtrace and mdb to examine page faults and process address spaces.
This will be used in a workshop being given to professors teaching operating systems
within China. The workshop will cover Solaris internals using the opensolaris source code
and tools such as dtrace, mdb, and kmdb.

new web site for Bruning Systems

2005-06-24T10:50:00.000-07:00

Hi,
I have finally gotten around to updating the web site. Still need to
add the dtrace scripts, but first I need to document them.

www.bruningsystems.com

max

snooping gld-based NIC drivers using dtrace

2005-04-05T09:23:00.000-07:00

Hi.
I have just posted a script at http://www.bruningsystems.com/rtlsio.p
that allows one to snoop incoming/outgoing packets on a Realtek NIC.
The script is very easy to change for any other GLD-based NIC (see
the comment at the beginning of the script to determine what needs
to be changed.)
To run the script, save it and then:

# dtrace -q -C -s ./rtlsio.p

Let me know what you think.

max

dtrace script to trace kernel thread state changes

2005-02-12T12:14:00.000-08:00

I just posted a script that traces kernel thread state changes to the dtrace
forum on forum.sun.com. You can also find it at http://www.bruningsystems.com/runq.d

Have fun.
max

solaris kernel memory usage

2005-01-14T09:34:00.000-08:00

This is a test. I am currently working on answering a question
from a former student about kernel memory. The question and
answer will be here shortly.