Monday, November 12, 2012

Hadoop bug on SmartOS



Recently I had a chance to help with a problem that occurred when trying to run a Hadoop benchmark on SmartOS.  Basically, some of the Java code written for Hadoop was making an implicit assumption that the code was being run on Linux.  When running the benchmark, the following error showed up:


12/10/01 20:58:49 INFO mapred.JobClient: Task Id : attempt_201209262235_0003_m_000003_0, Status : FAILED
ENOENT: No such file or directory
at org.apache.hadoop.io.nativeio.NativeIO.open(Native Method)
at org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:161)
at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:312)
at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:385)
at org.apache.hadoop.mapred.Child.main(Child.java:229)

The NativeIO.open call basically calls the open(2) system call.  Here, it is being called from
createForWrite() in SecureIOUtils.java at line 161.  Here is the code for SecureIOUtils.java:

 /**
  * Open the specified File for write access, ensuring that it does not exist.
  * @param f the file that we want to create
  * @param permissions we want to have on the file (if security is enabled)
  *
  * @throws AlreadyExistsException if the file already exists
  * @throws IOException if any other error occurred
  */
 public static FileOutputStream createForWrite(File f, int permissions)
 throws IOException {
   if (skipSecurity) {
     return insecureCreateForWrite(f, permissions);
   } else {
     // Use the native wrapper around open(2)
     try {
       FileDescriptor fd = NativeIO.open(f.getAbsolutePath(),  <-- 161="161" line="line" span="span">
         NativeIO.O_WRONLY | NativeIO.O_CREAT | NativeIO.O_EXCL,
         permissions);
       return new FileOutputStream(fd);
     } catch (NativeIOException nioe) {
       if (nioe.getErrno() == Errno.EEXIST) {
         throw new AlreadyExistsException(nioe);
       }
       throw nioe;
     }
   }
 }

So, the open is called with O_WRONLY, O_CREAT, and O_EXCL flags.  However, the truss(1) output
shows a different story.  We started the following truss on a slave machine, and ran the test again:

# truss -f -a -wall -topen,close,fork,write,stat,fstat -o ~/mapred.truss -p $(pgrep -f Djava.library.path)

And here is the relevant truss output:

51039/28: open("/opt/local/hadoop/bin/../logs/userlogs/job_201210171129_0008/attempt_201210171129_0008_m_000002_1/log.tmp", O_WRONLY|O_DSYNC|O_NONBLOCK) Err#2 ENOENT

The error message is emitted shortly after the above open(2) system call.  So, the code shows O_WRONLY, O_CREAT, and O_EXCL, which is what one
would expect for a routine that is called createForWrite().  However, the flags actually passed to open() are: O_WRONLY, O_DSYNC, and O_NONBLOCK.
Why the difference?

Grepping for O_CREAT in the hadoop source finds it defined at:

./trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/nativeio/NativeIO.java:

/**
* JNI wrappers for various native IO-related calls not available in Java.
* These functions should generally be used alongside a fallback to another
* more portable mechanism.
*/
public class NativeIO {
 // Flags for open() call from bits/fcntl.h
 public static final int O_RDONLY   =    00;
 public static final int O_WRONLY   =    01;
 public static final int O_RDWR     =    02;
 public static final int O_CREAT    =  0100;
 public static final int O_EXCL     =  0200;
 public static final int O_NOCTTY   =  0400;
 public static final int O_TRUNC    = 01000;
 public static final int O_APPEND   = 02000;
 public static final int O_NONBLOCK = 04000;
 public static final int O_SYNC   =  010000;
 public static final int O_ASYNC  =  020000;
 public static final int O_FSYNC = O_SYNC;
 public static final int O_NDELAY = O_NONBLOCK;

The comment in the above code says that the flags for the open(2) call are coming from bit/fcntl.h.
However, on SmartOS (as well as illumos and Solaris), the same flags in sys/fcntl.h show:

/*
* Flag values accessible to open(2) and fcntl(2)
* The first five can only be set (exclusively) by open(2).
*/
#define   O_RDONLY        0
#define        O_WRONLY        1
#define        O_RDWR          2
#define        O_SEARCH        0x200000
#define O_EXEC          0x400000
#if defined(__EXTENSIONS__) || !defined(_POSIX_C_SOURCE)
#define O_NDELAY        0x04    /* non-blocking I/O */
#endif /* defined(__EXTENSIONS__) || !defined(_POSIX_C_SOURCE) */
#define  O_APPEND        0x08    /* append (writes guaranteed at the end) */
#if defined(__EXTENSIONS__) || !defined(_POSIX_C_SOURCE) || \
       (_POSIX_C_SOURCE > 2) || defined(_XOPEN_SOURCE)
#define  O_SYNC          0x10    /* synchronized file update option */
#define    O_DSYNC         0x40    /* synchronized data update option */
#define    O_RSYNC         0x8000  /* synchronized file update option */
                          /* defines read/write file integrity */
#endif /* defined(__EXTENSIONS__) || !defined(_POSIX_C_SOURCE) ... */
#define     O_NONBLOCK      0x80    /* non-blocking I/O (POSIX) */
#ifdef    _LARGEFILE_SOURCE
#define        O_LARGEFILE     0x2000
#endif

/*
* Flag values accessible only to open(2).
*/
#define      O_CREAT         0x100   /* open with file create (uses third arg) */
#define     O_TRUNC         0x200   /* open with truncation */
#define       O_EXCL          0x400   /* exclusive open */
#define     O_NOCTTY        0x800   /* don't allocate controlling tty (POSIX) */
#define     O_XATTR         0x4000  /* extended attribute */
#define O_NOFOLLOW      0x20000 /* don't follow symlinks */
#define      O_NOLINKS       0x40000 /* don't allow multiple hard links */

The O_CREAT flag (from bits/fcntl.h) is 0100 (octal) in the NativeIO.java file, but 0x100 on SmartOS.  The 0100 value is 0x40, which corresponds to O_DSYNC on SmartOS. Similarly, the O_EXCL value of 0200 is hex value 0x80, which is O_NONBLOCK on SmartOS.  Whoever wrote this code made an assumption that they were running on a Linux system.  The flags are different yet again on FreeBSD and Mac OS (for instance, O_CREAT is 0x200 on these systems).  My colleague, Filip Hajny, changed the flags to match the SmartOS flags, and rebuilt everything to fix the problem.

This problem reminds me how many little things like this can occur when porting an application that was developed on one operating system to run on another operating system.  It is possible that for all but the simplest of applications, some changes are going to be needed.  For the above problem, POSIX specifies the flags that open(2) can take (O_CREAT, O_RDWR, etc.), but does not specify the values of those flags.  Basically, if the code could include the correct header file (fcntl.h in both cases), the problem would not occur.  It is an important reminder that all code should be reviewed and tested on as many different systems as possible.

Monday, July 09, 2012

Why take a SmartOS/illumos Internals or ZFS Internals course?


I have been teaching OS internals courses for many years, starting with Bell Labs/AT&T Unix System III in 1982, onto System V, SVR2, SVR3, SVR4, and Solaris Unixes since 1994.  Along the way, I have also taught HP-UX internals, various device driver courses, and kernel debugging courses.  I started using unix on the Sixth Edition in 1975.  I have also done a fair amount of kernel development and debugging, along with some user level stuff.

The audiences I have for internals courses has been quite varied.  Many of the people I have taught have been in support or sustaining organizations, but I have also taught developers, system administrators, Java programmers, QA people,  hardware engineers, and even end users.  Along the way, I have been asked by various people (many of them managers), "Why should I or my team take this course?  What will I or my team get out of this training?"

In response to the first question, I usually tell people that an internals course should teach students how the system works, and why it works the way it does.  In other words, the course teaches the data structures and algorithms used by the operating system to manage the resources of the computer, and explains the architecture of the system, as well as the rationale behind the design decisions that have been made to implement the system.  My view is that knowledge of how the system works can benefit everyone.  For developers (especially kernel developers), knowledge of the system is key to adding new functionality.  For system administrators, knowledge of the OS can help to do troubleshooting and performance analysis.  Tools like DTrace become even more useful when one has knowledge of what's going on in the system.  In general, knowledge of how the system works allows everyone who uses the system to make better use of the system.

As for what specific skills are acquired in an internals course, I make very extensive use of tools that come with the system during the training.  Both when I am lecturing, and in lab work.  My view has always been that in order to learn the concepts being taught, one must be able to actually "see" them.  Tools like DTrace, mdb, kmdb, and other observability mechanisms are key to doing this.  I do not "teach" the tools, but rather we use the tools in lots of examples throughout the course.  At the end of the course, I am satisfied if my students can start to learn things on their own.  Basically, a good internals course should be an "enabling" course.  It should enable the student to learn more on their own.  For some, they may never use the specific tools used during the class in the specific ways they are used, but it will educate students that one can actually determine what the system is doing at any given time.  For others, they will be using the tools consistently in their work.

As with all training, you only get out of it what you put into it.
If you're interested in Internals training, please visit training from joyent.

I hope to see people in class soon!

Friday, June 15, 2012

SmartOS/Illumos Training

If you are reading this, you probably are here either because you saw my post on twitter, or you searched for "zfs recovery" (see here). This is the first blog I have written here since 2009, so it is time to write again.

I have written (in a different blog) on using flamegraphs at Using flamegraph to Solve IP Scaling Issue. Rather than spending time saying what I've been doing since my last blog entry here, I want to talk a bit about what I am doing now. If you're interested in what I have been doing, see KVM on Illumos.

Since I wrote the blogs on ZFS recovery, I have been getting emails, at a rate that is slowly increasing over time (now ~2 per week), from people asking if I can help with ZFS problems. If I had received this many emails when I wrote the blog, I might be working full time now on ZFS recovery issues. As it is, I am now very busy working for Joyent, and have not had time to answer as many of the ZFS requests as I would like. My apologies to people who I have not answered. For those of you who have asked for my mdb and zdb modifications, please send me email at max_at_joyent_dot_com. If I get enough requests, it is possible that the modifications may find their way into SmartOS (and Illumos).

If you would like help with ZFS problems, I can better justify my time if you download Joyent's SmartDataCenter product, available here, and give it a try. If you're interested in SmartOS (simply the best Solaris-based operating system you can use), it is available for download at www.smartos.org. Joyent fully supports SmartOS for use in its SmartDataCenter product, so you are more likely to get help in a timely fashion with problems than I am able to provide on my own.

And, what am I doing now? Joyent is offering classes on SmartDataCenter, DTrace, and SmartOS/Illumos Internals. I am involved with developing the courseware, and shall be (along with Brendan Gregg) delivering the courses. For more information, see Training from Joyent.

Sunday, December 20, 2009

ZFS Raidz Data Walk

Several months ago, I wrote in my blog about raidz on disk format (see http://mbruning.blogspot.com/2009/04/raidz-on-disk-format.html). In that blog, I went over the high level details. Here, I thought I would show the low level stuff that I did to determine the layout. I am using a modified zdb and mdb to walk through the on-disk data structures to find the data for a copy of the /usr/dict/words file that I made on a raidz file system.

The raidz volume contains 5 equal size devices. Since I don't have 5 disks lying around, I created 5 equal sized files (/export/home/max/r0 through /export/home/max/r4). I'll use the term disk throughout this discussion to refer to one of these files.

# zpool status -v tank
pool: tank
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz1 ONLINE 0 0 0
/export/home/max/r0 ONLINE 0 0 0
/export/home/max/r1 ONLINE 0 0 0
/export/home/max/r2 ONLINE 0 0 0
/export/home/max/r3 ONLINE 0 0 0
/export/home/max/r4 ONLINE 0 0 0

errors: No known data errors
#

I'll umount the file system so things don't change while I'm examining the on-disk structures.

# zfs umount tank
#

And, as I have done in the past, I walk the data structures to get to the "words" file by starting at the uberblock_t. If you get lost during this walk, you can always refer to the diagram "ZFS On-Disk Layout - The Big Picture", page 4 in http://www.osdevcon.org/2008/files/osdevcon2008-max.pdf from the OpenSolaris Developer's Conference, 2008 in Prague.

First, the active uberblock_t.

# zdb -uuu tank
Uberblock

magic = 0000000000bab10c
version = 13
txg = 1280
guid_sum = 6800651560363961243
timestamp = 1239197133 UTC = Wed Apr 8 15:25:33 2009
rootbp = [L0 DMU objset] 400L/200P DVA[0]=<0:1e007400:400> DVA[1]=<0:9400:400> DVA[2]=<0:3c003800:400> fletcher4 lzjb LE contiguous birth=1280 fill=27 cksum=9ad89e117:40b4956a12c:db76af09e62f:1f779fd1db6115
#

Now, I use a new command I added to zdb to allow me to see the raidz mapping. The "-Z" option takes the pool name, device id, location, and physical size as arguments, and prints the device index, location, and size for each piece of the corresponding data plus parity.

# ./zdb -Z tank:0:1e007400:200
Columns = 2, bigcols = 2, asize = 400, firstdatacol = 1
devidx = 3, offset = 6001600, size = 200
devidx = 4, offset = 6001600, size = 200
#

So, the 0x200 byte parity is on the fourth disk (devidx = 3), and the 0x200 byte objset_phys_t is on the fifth disk (devidx = 4). (Of course, either one will work since there are only 2).

Now, convert the hex offset to an absolute decimal block number. The 0x400000 skips the disk labels at the front of each device in the volume.

# mdb
> (6001600>>9)+(400000>>9)=D
204811

Attempting to use zdb with the -R option to read the blocks causes a assertion failure in zdb (at least, that was the state back in April, when I wrote the original blog on raidz). So, instead I use dd to dump the raw data into a file.

# dd if=/export/home/max/r4 of=/tmp/objset_t bs=512 count=1 iseek=204811
1+0 records in
1+0 records out
#

Now, I'll uncompress the data. The size after decompression should be 0x400 bytes (as specified in the block pointer in the uberblock_t above). For this, I use a utility I wrote called zuncompress. This utility takes an option which allows one to specify the compression algorithm used. The default is lzjb. The output should be the objset_phys_t for the meta object set (MOS).

# ./zuncompress -p 200 -l 400 /tmp/objset_t > /tmp/objset
#

And now, I'll use my modified mdb to print the objset_phys_t.

# mdb /tmp/objset
> 0::print -a zfs`objset_phys_t
{
0 os_meta_dnode = {
0 dn_type = 0xa <-- DMU_OT_DNODE
1 dn_indblkshift = 0xe
2 dn_nlevels = 0x1
...
40 dn_blkptr = [
{
40 blk_dva = [
{
40 dva_word = [ 0x8, 0xf0050 ]
}
{
50 dva_word = [ 0x8, 0x40 ]
}
{
60 dva_word = [ 0x8, 0x1e0028 ]
}
]
...
}

And the blkptr_t at 0x40:

> 40::blkptr
DVA[0]: vdev_id 0 / 1e00a000
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 100000000000
DVA[0]: :0:1e00a000:a00:d
DVA[1]: vdev_id 0 / 8000
DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 100000000000
DVA[1]: :0:8000:a00:d
DVA[2]: vdev_id 0 / 3c005000
DVA[2]: GANG: FALSE GRID: 0000 ASIZE: 100000000000
DVA[2]: :0:3c005000:a00:d
LSIZE: 4000 PSIZE: a00
ENDIAN: LITTLE TYPE: DMU dnode
BIRTH: 500 LEVEL: 0 FILL: 1a00000000
CKFUNC: fletcher4 COMP: lzjb
CKSUM: a182339fe8:ded0f7be7047:bcb1c1a96b94cc:765bd519587bfb41
$q
#

So, "LEVEL: 0" means no indirection. The next object is the MOS, which is an array of dnode_phys_t. Let's see how the MOS is layed out on the raidz volume.

# ./zdb -Z tank:0:1e00a000:a00
Columns = 5, bigcols = 2, asize = 1000, firstdatacol = 1
devidx = 0, offset = 6002000, size = 400
devidx = 1, offset = 6002000, size = 400
devidx = 2, offset = 6002000, size = 200
devidx = 3, offset = 6002000, size = 200
devidx = 4, offset = 6002000, size = 200
#

So, disk 0 contains parity, and disks 1, 2, 3, and 4 contain the MOS. The MOS is compressed with lzjb compression. We'll use dd to dump the 4 blocks containing the MOS to a file, then decompress the MOS.

I'll use mdb to translate the blkptr DVA address to a block on the disks. Note that all blocks in this example are at the same location (0x6002000).

# mdb
> (6002000>>9)+(400000>>9)=D
204816

And now dd each of the blocks. The first disk (/export/home/max/r0) is parity. The second disk contains 0x400 bytes. The other 3 disks contain 0x200 bytes each. So total size of compressed data is 0x400 + 0x200 + 0x200 + 0x200, or 0xa00 bytes, which agrees with the PSIZE field in the blkptr_t. Note that size of the parity block must be equal to the size of the largest block (0x400).

# dd if=/export/home/max/r1 of=/tmp/mos_z1 iseek=204816 count=2
2+0 records in
2+0 records out
# dd if=/export/home/max/r2 of=/tmp/mos_z2 iseek=204816 count=1
1+0 records in
1+0 records out
# dd if=/export/home/max/r3 of=/tmp/mos_z3 iseek=204816 count=1
1+0 records in
1+0 records out
# dd if=/export/home/max/r4 of=/tmp/mos_z4 iseek=204816 count=1
1+0 records in
1+0 records out
#

Now, concatenate the files to get the compressed MOS.

# cat /tmp/mos_z* > /tmp/mos_comp

And uncompress. The size after decompression, according to the blkptr is 0x4000 (LSIZE in the blkptr).

# ./zuncompress -l 4000 -p a00 /tmp/mos_comp > /tmp/mos

And I'll use the modified mdb to dump out the MOS.

# mdb /tmp/mos
> ::sizeof zfs`dnode_phys_t
sizeof (zfs`dnode_phys_t) = 0x200

> 4000%200=K
20 <-- There are 32 dnode_phys_t in the MOS
> 0,20::print -a zfs`dnode_phys_t
{
0 dn_type = 0 <-- DMU_OT_NONE, first is unused
...
}
{
200 dn_type = 0x1 <-- DMU_OT_OBJECT_DIRECTORY
...
240 dn_blkptr = [
{
240 blk_dva = [
{
240 dva_word = [ 0x2, 0x24 ]
}
...
}
{
400 dn_type = 0xc <-- DMU_OT_DSL_DIR (DSL Directory)
...
404 dn_bonustype = 0xc
...
4c0 dn_bonus = [ 0x39, 0x75, 0xdb, 0x49, 0, 0, 0, 0, 0x10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0xe, 0, 0, 0, 0, 0, 0, 0,
... ]
}
{
600 dn_type = 0xf
...
{
1600 dn_type = 0x10 <-- DMU_OT_DSL_DATASET (DSL DataSet)
...
1604 dn_bonustype = 0x10
...
16c0 dn_bonus = [ 0x8, 0, 0, 0, 0, 0, 0, 0, 0xe, 0, 0, 0, 0, 0, 0, 0, 0x1, 0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
... ]
}
...

The blkptr_t at 0x240 is for the Object Directory. Let's take a closer look.

> 240::blkptr
DVA[0]: vdev_id 0 / 4800
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 40000000000
DVA[0]: :0:4800:200:d
DVA[1]: vdev_id 0 / 1e004800
DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 40000000000
DVA[1]: :0:1e004800:200:d
DVA[2]: vdev_id 0 / 3c000000
DVA[2]: GANG: FALSE GRID: 0000 ASIZE: 40000000000
DVA[2]: :0:3c000000:200:d
LSIZE: 200 PSIZE: 200
ENDIAN: LITTLE TYPE: object directory
BIRTH: 4 LEVEL: 0 FILL: 100000000
CKFUNC: fletcher4 COMP: uncompressed
CKSUM: 5d4dec3ac:1e59c2be429:5825c81154e8:b9b170eedd49e
$q
#

We'll use zdb to find out where ZFS has put the 0x200 byte object directory.

# ./zdb -Z tank:0:4800:200
Columns = 2, bigcols = 2, asize = 400, firstdatacol = 1
devidx = 1, offset = e00, size = 200
devidx = 2, offset = e00, size = 200
#

So, the parity is on the second disk (devidx = 1), and the object directory (a ZAP object), is on the third disk.

We'll convert the offset into a block address.

# mdb
> (e00>>9)+(400000>>9)=D
8199


And dump the 0x200 (i.e, 512byte) block.

# dd if=/export/home/max/r2 of=/tmp/objdir iseek=8199 count=1
1+0 records in
1+0 records out
#

The ZAP object is not compressed (see the above blkptr_t). So, no need to uncompress. We'll use mdb to look at the zap.

# mdb /tmp/objdir
> 0/J
0: 8000000000000003 <-- a microzap object
>

> 0::print -a -t zfs`mzap_phys_t
{
0 uint64_t mz_block_type = 0x8000000000000003
8 uint64_t mz_salt = 0x32064dbb
10 uint64_t mz_normflags = 0
18 uint64_t [5] mz_pad = [ 0, 0, 0, 0, 0 ]
40 mzap_ent_phys_t [1] mz_chunk = [
{
40 uint64_t mze_value = 0x2
48 uint32_t mze_cd = 0
4c uint16_t mze_pad = 0
4e char [50] mze_name = [ "root_dataset" ]
}
]
}
$q
#

There are more mzap_ent_phys_t in the object, but we are only concerned with the root dataset. This is object id 2, so we'll go back to the MOS, and examine the dnode_phys_t at index 2.

# mdb /tmp/mos
> 2*200::print -a zfs`dnode_phys_t <-- Each dnode_phys_t is 0x200 bytes
{
400 dn_type = 0xc <-- DMU_OT_DSL_DIR
...
404 dn_bonustype = 0xc <-- DMU_OT_DSL_DIR
...
4c0 dn_bonus = [ 0x39, 0x75, 0xdb, 0x49, 0, 0, 0, 0, 0x10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0xe, 0, 0, 0, 0, 0, 0, 0, ... ]
}

The bonus buffer contains a dsl_dir_phys_t.

> 4c0::print -a zfs`dsl_dir_phys_t
{
4c0 dd_creation_time = 0x49db7539
4c8 dd_head_dataset_obj = 0x10
...
}

The DSL DataSet is object id 0x10 (dd_head_dataset_obj = 0x10).

> 10*200::print -a zfs`dnode_phys_t
{
2000 dn_type = 0x10 <-- DMU_OT_DSL_DATASET
...
2004 dn_bonustype = 0x10 <-- bonus buffer contains dsl_dataset_phys_t
...
20c0 dn_bonus = [ 0x2, 0, 0, 0, 0, 0, 0, 0, 0xe, 0, 0, 0, 0, 0, 0, 0, 0x1, 0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... ]
}

Now, the dsl_dataset_phys_t in the bonus buffer of the DSL DataSet dnode.

> 20c0::print -a zfs`dsl_dataset_phys_t
{
20c0 ds_dir_obj = 0x2
...
2140 ds_bp = {
2140 blk_dva = [
{
2140 dva_word = [ 0x2, 0xf0038 ]
}
{
2150 dva_word = [ 0x2, 0x6 ]
}
{
2160 dva_word = [ 0, 0 ]
}
]
...
}

The blkptr_t at 0x2140 will give us the objset_phys_t of the root dataset of the file system.

> 2140::blkptr
DVA[0]: vdev_id 0 / 1e007000
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 40000000000
DVA[0]: :0:1e007000:200:d
DVA[1]: vdev_id 0 / c00
DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 40000000000
DVA[1]: :0:c00:200:d
LSIZE: 400 PSIZE: 200
ENDIAN: LITTLE TYPE: DMU objset
BIRTH: 500 LEVEL: 0 FILL: a00000000
CKFUNC: fletcher4 COMP: lzjb
CKSUM: 6bb79a7b2:2e0d64756dd:9fc17017938b:176b8a4b6c4756
$q
#

Now get the locations where the file system objset_phys_t resides.

# ./zdb -Z tank:0:1e007000:200
Columns = 2, bigcols = 2, asize = 400, firstdatacol = 1
devidx = 1, offset = 6001600, size = 200
devidx = 2, offset = 6001600, size = 200
#

So, parity is on the second disk, and the data is on the third disk, both at offset 0x6001600.

# mdb
(6001600>>9)+(400000>>9)=D
204811

And again use dd to dump the compressed objset_phys_t to a file.

# dd if=/export/home/max/r2 of=/tmp/dmu_objset_comp iseek=204811 count=1
1+0 records in
1+0 records out
#

And uncompress the objset_phys_t.

# ./zuncompress -l 400 -p 200 /tmp/dmu_objset_comp > /tmp/dmu_objset
#

Now, mdb to example the objset_phys_t.

# mdb /tmp/dmu_objset
> 0::print -a zfs`objset_phys_t
{
0 os_meta_dnode = {
0 dn_type = 0xa <-- DMU_OT_DNODE
1 dn_indblkshift = 0xe
2 dn_nlevels = 0x7 <-- 7 levels of indirection
...
40 dn_blkptr = [
{
40 blk_dva = [
{
40 dva_word = [ 0x4, 0xf0020 ]
}
{
50 dva_word = [ 0x4, 0x20 ]
}
{
60 dva_word = [ 0, 0 ]
}
]
...
}
> 40::blkptr
DVA[0]: vdev_id 0 / 1e004000
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 80000000000
DVA[0]: :0:1e004000:400:id
DVA[1]: vdev_id 0 / 4000
DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 80000000000
DVA[1]: :0:4000:400:id
LSIZE: 4000 PSIZE: 400
ENDIAN: LITTLE TYPE: DMU dnode
BIRTH: 500 LEVEL: 6 FILL: 900000000
CKFUNC: fletcher4 COMP: lzjb
CKSUM: 5b884586fa:3f9c7d79ba1f:17674db0ee38e0:6077d2d63aa75b6
$q
#

There are 6 levels of indirection to get the MOS for the file system. Next, we'll get the disk locations for the 6th level of indirection.

# ./zdb -Z tank:0:1e004000:400
Columns = 3, bigcols = 3, asize = 800, firstdatacol = 1
devidx = 2, offset = 6000c00, size = 200
devidx = 3, offset = 6000c00, size = 200
devidx = 4, offset = 6000c00, size = 200
#

So, the third disk contains parity, and the fourth and fifth disks contain the indirect block.

# mdb
> (6000c00>>9)+(400000>>9)=D
204806


Again, we'll use dd to get the data from the 2 disks, then concatenate the dd outputs, then uncompress.

# dd if=/export/home/max/r3 of=/tmp/i6_1z iseek=204806 count=1
1+0 records in
1+0 records out
# dd if=/export/home/max/r4 of=/tmp/i6_2z iseek=204806 count=1
1+0 records in
1+0 records out
# cat /tmp/i6*z > /tmp/i6_z
#

Now, uncompress. The size after decompression is 0x4000 bytes, as specified in the LSIZE field of the blkptr_t.

# ./zuncompress -l 4000 -p 400 /tmp/i6_z > /tmp/i6
#

And use mdb to examine the blkptr_t structures. We are only interested in the first one, since it will take us to the beginning dnode_phys_t in the file system.

# mdb/intel/ia32/mdb/mdb /tmp/i6
> 0::blkptr
DVA[0]: vdev_id 0 / 1e003800
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 80000000000
DVA[0]: :0:1e003800:400:id
DVA[1]: vdev_id 0 / 3800
DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 80000000000
DVA[1]: :0:3800:400:id
LSIZE: 4000 PSIZE: 400
ENDIAN: LITTLE TYPE: DMU dnode
BIRTH: 500 LEVEL: 5 FILL: 900000000
CKFUNC: fletcher4 COMP: lzjb
CKSUM: 59defe2103:3e0ac53edc13:16a8c688ba6d69:5cafeb97a9046d7
$q
#

Now at level 5, we again need to know where on the physical disks are the data.

# ./zdb -Z tank:0:1e003800:400
Columns = 3, bigcols = 3, asize = 800, firstdatacol = 1
devidx = 3, offset = 6000a00, size = 200
devidx = 4, offset = 6000a00, size = 200
devidx = 0, offset = 6000c00, size = 200
#

So, parity on fourth disk and data on fifth and first.

# mdb
> (6000a00>>9)+(400000>>9)=D
204805
> (6000c00>>9)+(400000>>9)=D
204806

And use dd to dump the blocks.

# dd if=/export/home/max/r4 of=/tmp/i5_1z iseek=204805 count=1
1+0 records in
1+0 records out
# dd if=/export/home/max/r0 of=/tmp/i5_2z iseek=204806 count=1
1+0 records in
1+0 records out
#

And concatenate...

# cat /tmp/i5*z > /tmp/i5_z
#

And uncompress...

# ./zuncompress -p 400 -l 4000 /tmp/i5_z > /tmp/i5
#

And get to the 4th level of indirection...

# mdb /tmp/i5
> 0::blkptr
DVA[0]: vdev_id 0 / 1e003000
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 80000000000
DVA[0]: :0:1e003000:400:id
DVA[1]: vdev_id 0 / 3000
DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 80000000000
DVA[1]: :0:3000:400:id
LSIZE: 4000 PSIZE: 400
ENDIAN: LITTLE TYPE: DMU dnode
BIRTH: 500 LEVEL: 4 FILL: 900000000
CKFUNC: fletcher4 COMP: lzjb
CKSUM: 5aaaf038c7:3ecd4215b2cf:1705e4d4343d71:5e8d71a8535f678
$q
#

Rather than show all 6 levels, we'll jump to level 0.

# mdb /tmp/i1
> 0::blkptr
DVA[0]: vdev_id 0 / 1e001000
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 80000000000
DVA[0]: :0:1e001000:600:d
DVA[1]: vdev_id 0 / 1000
DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 80000000000
DVA[1]: :0:1000:600:d
LSIZE: 4000 PSIZE: 600
ENDIAN: LITTLE TYPE: DMU dnode
BIRTH: 500 LEVEL: 0 FILL: 900000000
CKFUNC: fletcher4 COMP: lzjb
CKSUM: 7e1ebca68d:4f0370c6d404:23a24df0937608:ce6838f39084f95
$q
#

Locate the data for the stripe:

# ./zdb -Z tank:0:1e001000:600
Columns = 4, bigcols = 4, asize = 800, firstdatacol = 1
devidx = 3, offset = 6000200, size = 200
devidx = 4, offset = 6000200, size = 200
devidx = 0, offset = 6000400, size = 200
devidx = 1, offset = 6000400, size = 200
#

# mdb
> (6000200>>9)+(400000>>9)=D
204801
> (6000400>>9)+(400000>>9)=D
204802


Get the data from the individual disks...

# dd if=/export/home/max/r4 of=/tmp/fs_mos_1z iseek=204801 count=1
1+0 records in
1+0 records out
# dd if=/export/home/max/r0 of=/tmp/fs_mos_2z iseek=204802 count=1
1+0 records in
1+0 records out
# dd if=/export/home/max/r1 of=/tmp/fs_mos_3z iseek=204802 count=1
1+0 records in
1+0 records out
#

Concatenate the data...

# cat /tmp/fs_mos_* > /tmp/fs_mos_z
#

Uncompress...

# ./zuncompress -l 4000 -p 600 /tmp/fs_mos_z > /tmp/fs_mos
#

We should now be at the MOS for the root data set.

# mdb /tmp/fs_mos
> 0,20::print -a zfs`dnode_phys_t
{
0 dn_type = 0 <-- first is not used
...
}
{
200 dn_type = 0x15 <-- DMU_OT_MASTER
...
240 dn_blkptr = [
{
240 blk_dva = [
{
240 dva_word = [ 0x2, 0 ]
}
{
250 dva_word = [ 0x2, 0xf0000 ]
}
{
260 dva_word = [ 0, 0 ]
}
]
...
}
...
{
600 dn_type = 0x14 <-- DMU_OT_DIRECTORY_CONTENTS (probably for root of fs)
...
604 dn_bonustype = 0x11 <-- bonus buffer is a znode_phys_t
...
640 dn_blkptr = [
{
640 blk_dva = [
{
640 dva_word = [ 0x2, 0xf0006 ]
}
...
6c0 dn_bonus = [ 0x26, 0xa0, 0xdb, 0x49, 0, 0, 0, 0, 0x8e, 0xf0, 0xf7, 0x25, 0, 0, 0, 0, 0xca, 0xa5, 0xdc, 0x49, 0, 0, 0, 0, 0xf3, 0x80, 0xdd, 0x34, 0, 0, 0 , 0, ... ]
}
{
800 dn_type = 0x13 <-- DMU_OT_PLAIN_FILE_CONTENTS
...
804 dn_bonustype = 0x11 <-- bonus buffer is znode_phys_t
...
840 dn_blkptr = [
{
840 blk_dva = [
{
840 dva_word = [ 0x2, 0xf0004 ]
}
...
8c0 dn_bonus = [ 0xca, 0xa5, 0xdc, 0x49, 0, 0, 0, 0, 0x5e, 0xe2, 0xdc, 0x34, 0, 0, 0, 0, 0xca, 0xa5, 0xdc, 0x49, 0, 0, 0, 0, 0x58, 0x9e, 0xde, 0x34, 0, 0, 0 , 0, ... ]
}
...
>

We should start with the ZAP object specified by the blkptr_t for the master node to get to the root directory object of the file system. Instead, we'll assume the dnode_phys_t at 0x600 is for the root of the file system, and we'll dump the blkptr_t. This should be for a ZAP object which should contain the list of files in the directory.

> 640::blkptr
DVA[0]: vdev_id 0 / 1e000c00
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 40000000000
DVA[0]: :0:1e000c00:200:d
DVA[1]: vdev_id 0 / 800
DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 40000000000
DVA[1]: :0:800:200:d
LSIZE: 200 PSIZE: 200
ENDIAN: LITTLE TYPE: ZFS directory
BIRTH: 500 LEVEL: 0 FILL: 100000000
CKFUNC: fletcher4 COMP: uncompressed
CKSUM: 60d062a16:197ca3c8839:4691877f93d3:946b572aca5a2
$q
#

Find the location on the disk(s) for the directory ZAP object.

# ./zdb -Z tank:0:1e000c00:200
Columns = 2, bigcols = 2, asize = 400, firstdatacol = 1
devidx = 1, offset = 6000200, size = 200
devidx = 2, offset = 6000200, size = 200
# mdb
> (6000200>>9)+(400000>>9)=D
204801

Dump the contents.

# dd if=/export/home/max/r2 of=/tmp/rootdir iseek=204801 count=1
1+0 records in
1+0 records out
#

Examine the directory.

# mdb /tmp/rootdir
> ::sizeof zfs`mzap_phys_t
sizeof (zfs`mzap_phys_t) = 0x80
> ::sizeof zfs`mzap_ent_phys_t
sizeof (zfs`mzap_ent_phys_t) = 0x40
> 0::print zfs`mzap_phys_t
{
mz_block_type = 0x8000000000000003
mz_salt = 0x14187c7
mz_normflags = 0
mz_pad = [ 0, 0, 0, 0, 0 ]
mz_chunk = [
{
mze_value = 0x8000000000000004
mze_cd = 0
mze_pad = 0
mze_name = [ "smallfile" ]
}
]
}
> (200-80)%40=K
6 <-- there are 6 more mzap_ent_phys_t
> 80,6::print zfs`mzap_ent_phys_t
{
mze_value = 0
mze_cd = 0
mze_pad = 0
mze_name = [ '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0 ', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', ... ]
}
{
mze_value = 0x8000000000000006
mze_cd = 0
mze_pad = 0
mze_name = [ "words" ] <-- here is the file we want, object id is 6
}
{
mze_value = 0x8000000000000007
mze_cd = 0
mze_pad = 0
mze_name = [ "foo" ]
}
...
$q
#

Now, go back to the file system MOS to look at object id 6. If the object ID was greater than 32 (0x20), there would have been more work looking at other indirect blocks from the objset_phys_t for the file system. We assumed that the root directory for the file system would be a low object number above, and, fortunately,
the file we want to examine is also a low object number.

# mdb /tmp/fs_mos
> (6*200)::print -a zfs`dnode_phys_t
{
c00 dn_type = 0x13 <-- plain file contents
c01 dn_indblkshift = 0xe
c02 dn_nlevels = 0x2 <-- one level of indirection
c03 dn_nblkptr = 0x1
c04 dn_bonustype = 0x11 <-- bonus buffer contains znode_phys_t
...
c40 dn_blkptr = [
{
c40 blk_dva = [
{
c40 dva_word = [ 0x4, 0x5ec ]
}
{
c50 dva_word = [ 0x4, 0xf00ec ]
}
{
c60 dva_word = [ 0, 0 ]
}
]
...
cc0 dn_bonus = [ 0x22, 0x86, 0xdb, 0x49, 0, 0, 0, 0, 0x9, 0x31, 0x20, 0x28, 0, 0, 0, 0, 0x22, 0x86, 0xdb, 0x49, 0, 0, 0, 0, 0x71, 0x48, 0x2b, 0x28, 0, 0, 0, 0, ... ]
}

A quick look at the znode_phys_t in the bonus buffer...

> cc0::print zfs`znode_phys_t
{
zp_atime = [ 0x49db8622, 0x28203109 ]
zp_mtime = [ 0x49db8622, 0x282b4871 ]
zp_ctime = [ 0x49db8622, 0x282b4871 ]
zp_crtime = [ 0x49db8622, 0x28203109 ]
zp_gen = 0x97
zp_mode = 0x8124
zp_size = 0x32752
zp_parent = 0x3
zp_links = 0x1
zp_xattr = 0
zp_rdev = 0
zp_flags = 0x40800000004
zp_uid = 0
zp_gid = 0
zp_zap = 0
zp_pad = [ 0, 0, 0 ]
zp_acl = {
z_acl_extern_obj = 0
z_acl_size = 0x30
z_acl_version = 0x1
z_acl_count = 0x6
z_ace_data = [ 0x1, 0, 0, 0x10, 0x26, 0, 0, 0, 0, 0, 0, 0x10, 0x11, 0x1,
0xc, 0, 0x1, 0, 0x40, 0x20, 0x26, 0, 0, 0, 0, 0, 0x40, 0x20, 0x1, 0, 0, 0, ...
]
}
}

When was the file created?

> 49db8622=Y
2009 Apr 7 18:58:10

Now, let's look at the blkptr_t.

> c40::blkptr
DVA[0]: vdev_id 0 / bd800
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 80000000000
DVA[0]: :0:bd800:400:id
DVA[1]: vdev_id 0 / 1e01d800
DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 80000000000
DVA[1]: :0:1e01d800:400:id
LSIZE: 4000 PSIZE: 400
ENDIAN: LITTLE TYPE: ZFS plain file
BIRTH: 97 LEVEL: 1 FILL: 200000000
CKFUNC: fletcher4 COMP: lzjb
CKSUM: 600e97db0e:411c4ea86350:1790b46d936d46:602547566d07cc7
$q
#

We're at level 1.

# ./zdb -Z tank:0:bd800:400
Columns = 3, bigcols = 3, asize = 800, firstdatacol = 1
devidx = 1, offset = 25e00, size = 200
devidx = 2, offset = 25e00, size = 200
devidx = 3, offset = 25e00, size = 200
#

# mdb
> (25e00>>9)+(400000>>9)=D
8495

# dd if=/export/home/max/r2 of=/tmp/words_i1z iseek=8495 count=1
1+0 records in
1+0 records out
# dd if=/export/home/max/r3 of=/tmp/words_i2z iseek=8495 count=1
1+0 records in
1+0 records out
# cat /tmp/words_*z > /tmp/words_iz

Uncompress...

# ./zuncompress -l 4000 -p 400 /tmp/words_iz > /tmp/words_i

So, /tmp/words_i should contain uncompressed blkptr_t. These blkptr_t should take us to the data for the words file.

# mdb /tmp/words_i
> 0::blkptr
DVA[0]: vdev_id 0 / c0000
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 2800000000000
DVA[0]: :0:c0000:20000:d
LSIZE: 20000 PSIZE: 20000
ENDIAN: LITTLE TYPE: ZFS plain file
BIRTH: 97 LEVEL: 0 FILL: 100000000
CKFUNC: fletcher2 COMP: uncompressed
CKSUM: f5cbf93a151abcac:5b5d6ca83588d8ad:574d9b8bf334944b:ad78d30af51771d8
80::blkptr
DVA[0]: vdev_id 0 / e8000
DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 2800000000000
DVA[0]: :0:e8000:20000:d
LSIZE: 20000 PSIZE: 20000
ENDIAN: LITTLE TYPE: ZFS plain file
BIRTH: 97 LEVEL: 0 FILL: 100000000
CKFUNC: fletcher2 COMP: uncompressed
CKSUM: f39ae34f048ae079:de2ef1af7d1fb495:ec3ae3f7985b2a98:c6d33ac68cb042b6
$q
#

So, where is the data?

# ./zdb -Z tank:0:c0000:20000
Columns = 5, bigcols = 0, asize = 28000, firstdatacol = 1
devidx = 1, offset = 26600, size = 8000
devidx = 2, offset = 26600, size = 8000
devidx = 3, offset = 26600, size = 8000
devidx = 4, offset = 26600, size = 8000
devidx = 0, offset = 26800, size = 8000

A little hex to decimal conversion for dd...

# mdb
> (26600>>9)+(400000>>9)=D
8499
> (26800>>9)+(400000>>9)=D
8500
8000>>9=D
64

Now, dump the blocks...

# dd if=/export/home/max/r2 of=/tmp/w1 iseek=8499 bs=512 count=64
64+0 records in
64+0 records out
# dd if=/export/home/max/r3 of=/tmp/w2 iseek=8499 bs=512 count=64
64+0 records in
64+0 records out
# dd if=/export/home/max/r4 of=/tmp/w3 iseek=8499 bs=512 count=64
64+0 records in
64+0 records out
# dd if=/export/home/max/r0 of=/tmp/w4 iseek=8500 bs=512 count=64
64+0 records in
64+0 records out

And concatenate the 4 files...

# cat /tmp/w[1-4]
10th
1st
2nd
3rd
4th
5th
6th
7th
8th
9th
a
AAA
AAAS
Aarhus
Aaron
AAU
ABA
Ababa
aback
...
Nostrand
nostril
not
notary
notate
notch
note
notebook
noteworthy
nothing
notice
notice#

The first 128k of the file. To get the remainder of the file, we would need to look at the next blkptr_t at level 1. But, not today...

Wednesday, December 16, 2009

ZFS Data Recovery

I knew there was a reason to understand the ZFS on-disk format besides wanting to be able to teach about it...


Recently, I was sent an email from someone who had 15 years of video and music stored in a 10TB ZFS pool that, after a power failure, became defective. He unfortunately did not have a backup. He was using ZFS version 6 on FreeBSD 7, and he asked if I could help. He also said that he had spoken with various people, including engineers within Sun, and was told that basically, it was probably not possible to recover the data.


He also got in touch with a data recovery company who told him they would assign a technician to examine the problem at a cost of $15,000/month. And if they could not restore his data, he would only have to pay 1/2 of the cost.


After spending about 1 week examining the data on the disk, I was able to restore basically all of it. The recovery was done on OpenSolaris 0609 (build 111b). After the recovery, he was able to view his data on the FreeBSD system (as well as OpenSolaris). I would be happy to do this for anyone else who runs into a problem where they can no longer access their ZFS pool, especially at $15k/month. And if I can not do it, you would not need to pay anything!

Tuesday, December 15, 2009

Monday, December 14, 2009

Examining address spaces with mdb


A while ago, I was interested in more details about process address spaces. For instance, if a page is mapped into an address space, where is the page in physical memory? Or if a page is on swap, where is it on swap? Are there pages that are in memory, but not currently valid for a process? The meminfo(2) system call can be used by an application to examine the locations of physical pages corresponding to a range of virtual addresses that the process is using. Is there a tool for doing this from outside the process? Is there any tool for determining the locations of pages in memory when one is using liblgrp(3)? liblgrp(3) provides an API for specifying a "locality group". A locality group, as the man page says, "represents the set of CPU-like and memory-like hardware devices that are at most some locality apart from each other". Essentially, using liblgrp(3), one can specify the desired memory placement for memory that threads within a process are using.


So, I have written a dcmd, called segpages, for mdb that allows one to examine each virtual page of a segment in a process address space. The command gives the following information:


  • The virtual address of the page.
  • If the page is in memory, the physical address of the page.
  • If the page is on swap, the location on swap, and which swap device/file.
  • If the page is not currently in memory or on swap, a "-".
  • If the page is mapped from a file, the pathname of the file, and the offset within the file.
  • If the page is anonymous, the command prints "anon".
  • If the page is mapped to a device, the command only prints the physical address it is mapped to, and the path to the device.
  • The "share count" for the page, i.e., the number of processes sharing the same page.
  • The dcmd command also prints the status of the page:

    • VALID -- The page is mapped
    • INMEMORY -- The page is in memory (it may not be valid for the process).
    • SWAPPED -- The page is on swap. Note that a page may be INMEMORY and SWAPPED. What I find more interesting, is pages that are SWAPPED and VALID. I expect to find INMEMORY pages that are also on swap. I did not expect to find SWAPPED pages that are also VALID, since I assumed that a page that was read in from swap and is now valid would not have a copy on swap. From a quick look at the code, it appears the swap slot is not freed until the reference count on the anon struct that is mapping the page has gone to 0. Anyone with a more complete understanding of this is welcome to comment.



Here is (very abbreviated) output for a running bash process.


First, a look at pmap output. Each line of the pmap output represents a "segment" of the address space. The different columns are described in the pmap(1) man page.

$ pmap -x 919
919: /bin/bash --noediting -i<
Address Kbytes RSS Anon Locked Mode Mapped File
08045000 12 12 4 - rw--- [ stack ]
08050000 644 644 - - r-x-- bash
08100000 80 80 12 - rwx-- bash
08114000 52 52 28 - rwx-- [ heap ]
CE410000 624 512 - - r-x-- libnsl.so.1
CE4BC000 16 16 4 - rw--- libnsl.so.1
CE4C0000 20 8 - - rw--- libnsl.so.1
CE4F0000 56 52 - - r-x-- methods_unicode.so.3
CE50D000 4 4 - - rwx-- methods_unicode.so.3
CE510000 2416 752 - - r-x-- en_US.UTF-8.so.3
CE77B000 4 4 - - rwx-- en_US.UTF-8.so.3
CE960000 64 16 - - rwx-- [ anon ]
CE97E000 4 4 - - rwxs- [ anon ]
CE980000 4 4 - - rwx-- [ anon ]
CE990000 24 12 4 - rwx-- [ anon ]
CE9A0000 4 4 4 - rwx-- [ anon ]
CE9B0000 1280 972 - - r-x-- libc_hwcap1.so.1
CEAF0000 28 28 16 - rwx-- libc_hwcap1.so.1
CEAF7000 8 8 - - rwx-- libc_hwcap1.so.1
CEB00000 4 4 - - r-x-- libdl.so.1
CEB10000 4 4 4 - rwx-- [ anon ]
CEB20000 56 56 - - r-x-- libsocket.so.1
CEB3E000 4 4 - - rw--- libsocket.so.1
CEB40000 180 136 - - r-x-- libcurses.so.1
CEB7D000 28 28 - - rw--- libcurses.so.1
CEB84000 8 - - - rw--- libcurses.so.1
CEB90000 4 4 4 - rwx-- [ anon ]
CEBA0000 4 4 4 - rw--- [ anon ]
CEBB0000 4 4 - - rw--- [ anon ]
CEBBF000 180 180 - - r-x-- ld.so.1
CEBFC000 8 8 4 - rwx-- ld.so.1
CEBFE000 4 4 4 - rwx-- ld.so.1
-------- ------- ------- ------- -------
total Kb 5832 3620 92 -

# mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc pcplusmp scsi_vhci zfs usba sockfs ip hook neti sctp arp uhci sd fctl md lofs audiosup fcip fcp random cpc crypto logindmux ptm ufs sppp ipc ]



First, load the dmod containing the new dcmd.


> ::load /wd320/max/source/mdb/segpages/i386/segpages.so
>


Now, walk through the segments of the process address space, showing
each virtual page in the segment. Note that output has been omitted.


> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages
VA PA FILE OFFSET SHARES DISPOSITION
8045000 378C5000 [anon] 54518000 1 VALID
8046000 6EB06000 [anon] 54118000 1 VALID
8047000 5F9C7000 [anon] 540B8000 1 VALID
VA PA FILE OFFSET SHARES DISPOSITION
8050000 600A7000 bash 0 7 VALID
8051000 74368000 bash 1000 7 VALID
8052000 72669000 bash 2000 7 VALID
8053000 66C6A000 bash 3000 7 VALID
8054000 636AB000 bash 4000 0 INVALID,INMEMORY
8055000 5FDEC000 bash 5000 0 INVALID,INMEMORY
8056000 63EED000 bash 6000 0 INVALID,INMEMORY
8057000 62EAE000 bash 7000 0 INVALID,INMEMORY
8058000 5C52F000 bash 8000 7 VALID
8059000 5C5B0000 bash 9000 7 VALID
... output omitted
80ED000 5C2C4000 bash 9D000 7 VALID
80EE000 5C245000 bash 9E000 7 VALID
80EF000 5C286000 bash 9F000 3 VALID
80F0000 63A97000 bash A0000 0 INVALID,INMEMORY
VA PA FILE OFFSET SHARES DISPOSITION
8100000 79940000 [anon] 541D8000 1 VALID
8101000 5F0C1000 [anon] 62F00000 1 VALID
8102000 378C2000 [anon] 54438000 1 VALID
8103000 5EF5A000 bash A3000 6 VALID
8104000 5EEDB000 bash A4000 6 VALID
8105000 37885000 [anon] 543B8000 1 VALID
8106000 60E1D000 bash A6000 7 VALID
...
VA PA FILE OFFSET SHARES DISPOSITION
8114000 37914000 [anon] 54478000 1 VALID
8115000 79DD5000 [anon] 54368000 1 VALID
8116000 55356000 [anon] 62F90000 1 VALID
...
VA PA FILE OFFSET SHARES DISPOSITION
CE410000 7AE40000 libnsl.so.1 0 55 VALID
CE411000 7AEC1000 libnsl.so.1 1000 57 VALID
CE412000 7AE42000 libnsl.so.1 2000 57 VALID
CE413000 7AE83000 libnsl.so.1 3000 57 VALID
CE414000 7AE84000 libnsl.so.1 4000 57 VALID
...
CE42D000 6EE96000 libnsl.so.1 1D000 1A INVALID,INMEMORY
CE42E000 6E797000 libnsl.so.1 1E000 1A INVALID,INMEMORY
...
VA PA FILE OFFSET SHARES DISPOSITION
CE4F0000 17D9000 methods_unicode.so.3 0 29 VALID
CE4F1000 17DA000 methods_unicode.so.3 1000 2C VALID
...
VA PA FILE OFFSET SHARES DISPOSITION
CE510000 1869000 en_US.UTF-8.so.3 0 28 VALID
CE511000 18AA000 en_US.UTF-8.so.3 1000 2A VALID
...
CE518000 6F1EA000 en_US.UTF-8.so.3 8000 0 INVALID,INMEMORY
CE519000 6F1EB000 en_US.UTF-8.so.3 9000 0 INVALID,INMEMORY
CE51A000 6F1EC000 en_US.UTF-8.so.3 A000 0 INVALID,INMEMORY
...
CE5FF000 6DB60000 en_US.UTF-8.so.3 EF000 5 INVALID,INMEMORY
CE600000 1659000 en_US.UTF-8.so.3 F0000 7 INVALID,INMEMORY
...
CE6EE000 1687000 en_US.UTF-8.so.3 1DE000 9 INVALID,INMEMORY
CE6EF000 1688000 en_US.UTF-8.so.3 1DF000 9 INVALID,INMEMORY
CE6F0000 1649000 en_US.UTF-8.so.3 1E0000 9 INVALID,INMEMORY
...
CE729000 1782000 en_US.UTF-8.so.3 219000 29 VALID
CE72A000 1783000 en_US.UTF-8.so.3 21A000 29 VALID
...
CE730000 1709000 en_US.UTF-8.so.3 220000 29 VALID
CE731000 6F143000 en_US.UTF-8.so.3 221000 0 INVALID,INMEMORY
CE732000 6F144000 en_US.UTF-8.so.3 222000 0 INVALID,INMEMORY
...
VA PA FILE OFFSET SHARES DISPOSITION
CE9B0000 76A42000 libc_hwcap1.so.1 0 5B VALID
CE9B1000 76AC3000 libc_hwcap1.so.1 1000 5B VALID
...
VA PA FILE OFFSET SHARES DISPOSITION
...
CEBC4000 2A34000 ld.so.1 5000 47 VALID
CEBC5000 28B5000 ld.so.1 6000 47 VALID
CEBC6000 29F6000 ld.so.1 7000 60 VALID
CEBC7000 2937000 ld.so.1 8000 60 VALID
...
>


Some general things to note:


  • Physical pages are randomly distributed. However, pages from ld.so.1 tend to be in low memory with comparison to anonymous pages. This should be expected as most pages of ld.so.1 are probably loaded early on in the system lifetime as most every application uses it.
  • There are many pages that are not valid, but they are in memory. In general, text and data pages are prefetched when a program starts, unless the program is large, or there is not enough free memory. Although pages are prefetched, it appears that they are not mapped to the process address space until/unless they are actually used.
  • Bash is not very large. Running the command above finishes in 5-10 seconds. Running the same command on a large program (for instance, firefox-bin), takes several minutes to complete. Running the command on a large 64-bit application will take considerably longer.
  • This is being run on a live system, so the address space of the process being examined may change while it is being walked.
  • At this point in time, no pages are swapped out.


Now, let's get some general statistics.


First, a count of the pages currently valid for the process. This is the current mapped RSS. Note that the pmap command reports "RSS", which, at 3620k is 905 4k-byte pages. But only 558 pages (or 2232k) are currently valid.


> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages !grep -i " valid" | wc
558 3348 35712
>


Now, the pages in memory, but not currently valid in the page table(s) for the process.


> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages !egrep -i "inmemory" | wc
347 2082 26025
>


Note that the valid pages plus the in memory pages is 905, or the value reported by pmap. So RSS as reported by pmap does not imply that page faults will not happen for all of those pages. But if a page fault occurs the correct page will be found in memory.


How many pages are currently not valid (and not in memory).


> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages !egrep -i " invalid$" | wc
553 3298 36498
>


How large is the address space?


> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages !egrep -v OFFSET | wc
1458 8728 98235
>


Note that this is 5832k, the total size as reported by pmap.


How many pages have been swapped out?


> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages !grep -i swapped | wc
0 0 0
>


Now, we'll induce memory load on the system, and again examine the address space. The memory usage induced should be enough to cause pages to be swapped (paged) out.


First, pmap output after the memory stress.


$ pmap -x 919
919: /bin/bash --noediting -i
Address Kbytes RSS Anon Locked Mode Mapped File
08045000 12 4 - - rw--- [ stack ]
08050000 644 508 - - r-x-- bash
08100000 80 80 - - rwx-- bash
08114000 52 44 28 - rwx-- [ heap ]
CE410000 624 320 - - r-x-- libnsl.so.1
CE4BC000 16 16 4 - rw--- libnsl.so.1
CE4C0000 20 8 - - rw--- libnsl.so.1
CE4F0000 56 36 - - r-x-- methods_unicode.so.3
CE50D000 4 4 - - rwx-- methods_unicode.so.3
CE510000 2416 124 - - r-x-- en_US.UTF-8.so.3
CE77B000 4 4 - - rwx-- en_US.UTF-8.so.3
CE960000 64 16 - - rwx-- [ anon ]
CE97E000 4 4 - - rwxs- [ anon ]
CE980000 4 4 - - rwx-- [ anon ]
CE990000 24 12 4 - rwx-- [ anon ]
CE9A0000 4 4 4 - rwx-- [ anon ]
CE9B0000 1280 952 - - r-x-- libc_hwcap1.so.1
CEAF0000 28 28 12 - rwx-- libc_hwcap1.so.1
CEAF7000 8 8 - - rwx-- libc_hwcap1.so.1
CEB00000 4 4 - - r-x-- libdl.so.1
CEB10000 4 4 4 - rwx-- [ anon ]
CEB20000 56 56 - - r-x-- libsocket.so.1
CEB3E000 4 4 - - rw--- libsocket.so.1
CEB40000 180 68 - - r-x-- libcurses.so.1
CEB7D000 28 28 - - rw--- libcurses.so.1
CEB84000 8 - - - rw--- libcurses.so.1
CEB90000 4 4 4 - rwx-- [ anon ]
CEBA0000 4 4 4 - rw--- [ anon ]
CEBB0000 4 4 - - rw--- [ anon ]
CEBBF000 180 180 - - r-x-- ld.so.1
CEBFC000 8 8 4 - rwx-- ld.so.1
CEBFE000 4 4 4 - rwx-- ld.so.1
-------- ------- ------- ------- -------
total Kb 5832 2544 72 -

$


As expected, the RSS has gone down, but the virtual size remains the same. It is a little interesting that the amount reported under anon has also dropped by 20k.


Again, we'll use the new dcmd to examine the address space more closely.


> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages
VA PA FILE OFFSET SHARES DISPOSITION
8045000 378C5000 [anon] 54518000 1 VALID
8046000 - /dev/zvol/dsk/rpool/swap 17EBF000 0 INVALID,SWAPPED
8047000 - /dev/zvol/dsk/rpool/swap 1D8FE000 0 INVALID,SWAPPED
VA PA FILE OFFSET SHARES DISPOSITION
8050000 16B86000 bash 0 1 VALID
8051000 13A07000 bash 1000 1 VALID
8052000 6B088000 bash 2000 1 VALID
8053000 1889000 bash 3000 1 VALID
8054000 2430A000 bash 4000 1 VALID
8055000 6440B000 bash 5000 1 VALID
8056000 6684C000 bash 6000 1 VALID
8057000 7308D000 bash 7000 1 VALID
8058000 6DCCE000 bash 8000 0 INVALID,INMEMORY
8059000 3784F000 bash 9000 0 INVALID,INMEMORY
...
80ED000 4CB23000 bash 9D000 0 INVALID,INMEMORY
80EE000 76BE4000 bash 9E000 0 INVALID,INMEMORY
80EF000 5BA5000 bash 9F000 0 INVALID,INMEMORY
80F0000 36836000 bash A0000 0 INVALID,INMEMORY
VA PA FILE OFFSET SHARES DISPOSITION
8100000 - /dev/zvol/dsk/rpool/swap 247C2000 0 INVALID,SWAPPED
8101000 - /dev/zvol/dsk/rpool/swap 7CCD000 0 INVALID,SWAPPED
8102000 378C2000 [anon] 54438000 1 VALID
8103000 75479000 bash A3000 0 INVALID,INMEMORY
8104000 532BA000 bash A4000 0 INVALID,INMEMORY
8105000 37885000 [anon] 543B8000 1 VALID
8106000 7443C000 bash A6000 0 INVALID,INMEMORY
...
VA PA FILE OFFSET SHARES DISPOSITION
8114000 37914000 [anon] 54478000 1 VALID
8115000 79DD5000 [anon] 54368000 1 VALID
8116000 55356000 [anon] 62F90000 1 VALID
...
VA PA FILE OFFSET SHARES DISPOSITION
CE410000 7AE40000 libnsl.so.1 0 4C VALID
CE411000 7AEC1000 libnsl.so.1 1000 4E VALID
CE412000 7AE42000 libnsl.so.1 2000 4E VALID
CE413000 7AE83000 libnsl.so.1 3000 4E VALID
CE414000 7AE84000 libnsl.so.1 4000 4E VALID
...
CE42D000 6EE96000 libnsl.so.1 1D000 18 INVALID,INMEMORY
CE42E000 6E797000 libnsl.so.1 1E000 18 INVALID,INMEMORY
...
VA PA FILE OFFSET SHARES DISPOSITION
CE4F0000 17D9000 methods_unicode.so.3 0 27 VALID
CE4F1000 17DA000 methods_unicode.so.3 1000 2A VALID
...
VA PA FILE OFFSET SHARES DISPOSITION
CE510000 1869000 en_US.UTF-8.so.3 0 26 VALID
CE511000 18AA000 en_US.UTF-8.so.3 1000 28 VALID
...
CE518000 - en_US.UTF-8.so.3 8000 0 INVALID
CE519000 - en_US.UTF-8.so.3 9000 0 INVALID
CE51A000 - en_US.UTF-8.so.3 A000 0 INVALID
...
CE5FF000 - en_US.UTF-8.so.3 EF000 0 INVALID
CE600000 - en_US.UTF-8.so.3 F0000 0 INVALID
...
CE6EE000 1687000 en_US.UTF-8.so.3 1DE000 A INVALID,INMEMORY
CE6EF000 1688000 en_US.UTF-8.so.3 1DF000 A INVALID,INMEMORY
CE6F0000 1649000 en_US.UTF-8.so.3 1E0000 A INVALID,INMEMORY
...
CE729000 1782000 en_US.UTF-8.so.3 219000 27 VALID
CE72A000 1783000 en_US.UTF-8.so.3 21A000 27 VALID
...
CE730000 1709000 en_US.UTF-8.so.3 220000 27 VALID
CE731000 - en_US.UTF-8.so.3 221000 0 INVALID
CE732000 - en_US.UTF-8.so.3 222000 0 INVALID
...
VA PA FILE OFFSET SHARES DISPOSITION
CE9B0000 76A42000 libc_hwcap1.so.1 0 51 VALID
CE9B1000 76AC3000 libc_hwcap1.so.1 1000 51 VALID
...
VA PA FILE OFFSET SHARES DISPOSITION
CEBC4000 2A34000 ld.so.1 5000 42 VALID
CEBC5000 28B5000 ld.so.1 6000 42 VALID
CEBC6000 29F6000 ld.so.1 7000 57 VALID
CEBC7000 2937000 ld.so.1 8000 57 VALID
...
>


As expected, many pages that were previously valid are now invalid. Many of these pages are still in memory, but some have been swapped out. The output does not show it, but some pages that are swapped out can also be in memory (the page was swapped out, put on a freelist, but has not yet been re-used for some other purpose. It is interesting that some pages with reasonably high share counts are still in memory, but no longer valid for this instance of bash. The pageout code checks the share counts, and skips pages being shared by more than po_share processes. On my system, po_share is 8. So I am not sure what is marking the pages invalid (maybe a job for DTrace).


As before, I'll get some counts of valid, invalid, inmemory, and
swapped pages.


> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages !grep -i " valid" | wc
413 2478 26432
>


Previously, the number of valid pages was 558, so 145 pages have been marked invalid and possibly swapped out.


The number of invalid pages is now:


> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages !egrep -i " invalid$" | wc
818 4888 53988
>


Previously, this was 553, so 265 pages that previously were valid are
now invalid.


> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages !egrep -i "inmemory" | wc
215 1290 16125
>


And 215 pages that are invalid are still in memory, but the page table entries for the bash instance does not have the pages mapped.


> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages !grep -i swapped | wc
12 72 936


And 12 pages of bash are on swap.


It would be nice to be able to show this graphically. For instance, a large box representing the address space, with different colored pixels to represent the state of the different pages of the address space. I have been told that JavaFX is good for this, but my knowledge of Java is really not up to it. Especially for large processes, a graphical view would be nice (well, at least interesting to look at...).


I have not tried the dcmd on SPARC or x64, but I expect it to work (at least on x64). I would also like to try this on a large machine which has latency groups set up. If anyone has such a machine and would like to try this out, please let me know.


I also have a version of the command that only prints summary information. I want to add an option that prints page sizes, but currently the command assumes all pages are the native page size (4k on x86/x64 and 8k on SPARC).


If there is interest, I'll make the code for the dcmd available.