tag:blogger.com,1999:blog-72455182024-03-07T14:44:20.477-08:00Max Bruning's weblogVery occasional tech stuff...Unknownnoreply@blogger.comBlogger23125tag:blogger.com,1999:blog-7245518.post-31560210973862643492012-11-12T00:35:00.000-08:002012-11-12T00:35:05.170-08:00Hadoop bug on SmartOS<b id="internal-source-marker_0.8931944540236145"><br /><span style="font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"></span><br /><span style="font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">Recently I had a chance to help with a problem that occurred when trying to run a Hadoop benchmark on SmartOS. Basically, some of the Java code written for Hadoop was making an implicit assumption that the code was being run on Linux. When running the benchmark, the following error showed up:</span><br /><span style="font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"></span><br /><span style="font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"></span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">12/10/01 20:58:49 INFO mapred.JobClient: Task Id : attempt_201209262235_0003_m_000003_0, Status : FAILED</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">ENOENT: No such file or directory</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">at org.apache.hadoop.io.nativeio.NativeIO.open(Native Method)</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">at org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:161)</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:312)</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:385)</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">at org.apache.hadoop.mapred.Child.main(Child.java:229)</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"></span><br /><span style="font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">The NativeIO.open call basically calls the open(2) system call. Here, it is being called from</span><br /><span style="font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">createForWrite() in SecureIOUtils.java at line 161. Here is the code for SecureIOUtils.java:</span><br /><span style="font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"></span><br /><span style="font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> </span><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">/**</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> * Open the specified File for write access, ensuring that it does not exist.</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> * @param f the file that we want to create</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> * @param permissions we want to have on the file (if security is enabled)</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> *</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> * @throws AlreadyExistsException if the file already exists</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> * @throws IOException if any other error occurred</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> */</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> public static FileOutputStream createForWrite(File f, int permissions)</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> throws IOException {</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> if (skipSecurity) {</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> return insecureCreateForWrite(f, permissions);</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> } else {</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> // Use the native wrapper around open(2)</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> try {</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> FileDescriptor fd = NativeIO.open(f.getAbsolutePath(), <-- 161="161" line="line" span="span"><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> NativeIO.O_WRONLY | NativeIO.O_CREAT | NativeIO.O_EXCL,</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> permissions);</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> return new FileOutputStream(fd);</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> } catch (NativeIOException nioe) {</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> if (nioe.getErrno() == Errno.EEXIST) {</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> throw new AlreadyExistsException(nioe);</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> }</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> throw nioe;</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> }</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> }</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> }</span><br /><span style="font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"></span><br /><span style="font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">So, the open is called with O_WRONLY, O_CREAT, and O_EXCL flags. However, the truss(1) output</span><br /><span style="font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">shows a different story. We started the following truss on a slave machine, and ran the test again:</span><br /><span style="font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"></span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"># truss -f -a -wall -topen,close,fork,write,stat,fstat -o ~/mapred.truss -p $(pgrep -f Djava.library.path) </span><br /><span style="font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"></span><br /><span style="font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">And here is the relevant truss output:</span><br /><span style="font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"></span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">51039/28: open("/opt/local/hadoop/bin/../logs/userlogs/job_201210171129_0008/attempt_201210171129_0008_m_000002_1/log.tmp", O_WRONLY|O_DSYNC|O_NONBLOCK) Err#2 ENOENT</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"></span><br /><span style="font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">The error message is emitted shortly after the above open(2) system call. So, the code shows O_WRONLY, O_CREAT, and O_EXCL, which is what one</span><br /><span style="font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">would expect for a routine that is called createForWrite(). However, the flags actually passed to open() are: O_WRONLY, O_DSYNC, and O_NONBLOCK.</span><br /><span style="font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">Why the difference?</span><br /><span style="font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"></span><br /><span style="font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">Grepping for O_CREAT in the hadoop source finds it defined at:</span><br /><span style="font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"></span><br /><span style="font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">./trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/nativeio/NativeIO.java:</span><br /><span style="font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"></span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">/**</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> * JNI wrappers for various native IO-related calls not available in Java.</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> * These functions should generally be used alongside a fallback to another</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> * more portable mechanism.</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> */</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">public class NativeIO {</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> // Flags for open() call from bits/fcntl.h</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> public static final int O_RDONLY = 00;</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> public static final int O_WRONLY = 01;</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> public static final int O_RDWR = 02;</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> public static final int O_CREAT = 0100;</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> public static final int O_EXCL = 0200;</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> public static final int O_NOCTTY = 0400;</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> public static final int O_TRUNC = 01000;</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> public static final int O_APPEND = 02000;</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> public static final int O_NONBLOCK = 04000;</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> public static final int O_SYNC = 010000;</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> public static final int O_ASYNC = 020000;</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> public static final int O_FSYNC = O_SYNC;</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> public static final int O_NDELAY = O_NONBLOCK;</span><br /><span style="font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"></span><br /><span style="font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">The comment in the above code says that the flags for the open(2) call are coming from bit/fcntl.h.</span><br /><span style="font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">However, on SmartOS (as well as illumos and Solaris), the same flags in sys/fcntl.h show:</span><br /><span style="font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"></span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">/*</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> * Flag values accessible to open(2) and fcntl(2)</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> * The first five can only be set (exclusively) by open(2).</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> */</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">#define O_RDONLY 0</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">#define O_WRONLY 1</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">#define O_RDWR 2</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">#define O_SEARCH 0x200000</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">#define O_EXEC 0x400000</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">#if defined(__EXTENSIONS__) || !defined(_POSIX_C_SOURCE)</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">#define O_NDELAY 0x04 /* non-blocking I/O */</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">#endif /* defined(__EXTENSIONS__) || !defined(_POSIX_C_SOURCE) */</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">#define O_APPEND 0x08 /* append (writes guaranteed at the end) */</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">#if defined(__EXTENSIONS__) || !defined(_POSIX_C_SOURCE) || \</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> (_POSIX_C_SOURCE > 2) || defined(_XOPEN_SOURCE)</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">#define O_SYNC 0x10 /* synchronized file update option */</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">#define O_DSYNC 0x40 /* synchronized data update option */</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">#define O_RSYNC 0x8000 /* synchronized file update option */</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> /* defines read/write file integrity */</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">#endif /* defined(__EXTENSIONS__) || !defined(_POSIX_C_SOURCE) ... */</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">#define O_NONBLOCK 0x80 /* non-blocking I/O (POSIX) */</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">#ifdef _LARGEFILE_SOURCE</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">#define O_LARGEFILE 0x2000</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">#endif</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"></span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">/*</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> * Flag values accessible only to open(2).</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"> */</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">#define O_CREAT 0x100 /* open with file create (uses third arg) */</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">#define O_TRUNC 0x200 /* open with truncation */</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">#define O_EXCL 0x400 /* exclusive open */</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">#define O_NOCTTY 0x800 /* don't allocate controlling tty (POSIX) */</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">#define O_XATTR 0x4000 /* extended attribute */</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">#define O_NOFOLLOW 0x20000 /* don't follow symlinks */</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">#define O_NOLINKS 0x40000 /* don't allow multiple hard links */</span><br /><span style="font-family: 'Courier New'; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"></span><br /><span style="font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">The O_CREAT flag (from bits/fcntl.h) is 0100 (octal) in the NativeIO.java file, but 0x100 on SmartOS. The 0100 value is 0x40, which corresponds to O_DSYNC on SmartOS. Similarly, the O_EXCL value of 0200 is hex value 0x80, which is O_NONBLOCK on SmartOS. Whoever wrote this code made an assumption that they were running on a Linux system. The flags are different yet again on FreeBSD and Mac OS (for instance, O_CREAT is 0x200 on these systems). My colleague, Filip Hajny, changed the flags to match the SmartOS flags, and rebuilt everything to fix the problem.</span><br /><span style="font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"></span><br /><span style="font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;">This problem reminds me how many little things like this can occur when porting an application that was developed on one operating system to run on another operating system. It is possible that for all but the simplest of applications, some changes are going to be needed. For the above problem, POSIX specifies the flags that open(2) can take (O_CREAT, O_RDWR, etc.), but does not specify the values of those flags. Basically, if the code could include the correct header file (fcntl.h in both cases), the problem would not occur. It is an important reminder that all code should be reviewed and tested on as many different systems as possible.</span></--></span></b>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-7245518.post-86149898157764354952012-07-09T00:32:00.000-07:002012-07-09T00:55:14.546-07:00Why take a SmartOS/illumos Internals or ZFS Internals course?<br />
<div style="background-color: white; margin-bottom: 8px; margin-left: 16px; margin-right: 16px; margin-top: 8px; min-width: 0px; width: 653px;">
<div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">
I have been teaching OS internals courses for many years, starting with Bell Labs/AT&T Unix System III in 1982, onto System V, SVR2, SVR3, SVR4, and Solaris Unixes since 1994. Along the way, I have also taught HP-UX internals, various device driver courses, and kernel debugging courses. I started using unix on the Sixth Edition in 1975. I have also done a fair amount of kernel development and debugging, along with some user level stuff.</div>
<div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">
<br /></div>
<div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">
The audiences I have for internals courses has been quite varied. Many of the people I have taught have been in support or sustaining organizations, but I have also taught developers, system administrators, Java programmers, QA people, hardware engineers, and even end users. Along the way, I have been asked by various people (many of them managers), "Why should I or my team take this course? What will I or my team get out of this training?"</div>
<div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">
<br /></div>
<div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">
In response to the first question, I usually tell people that an internals course should teach students how the system works, and why it works the way it does. In other words, the course teaches the data structures and algorithms used by the operating system to manage the resources of the computer, and explains the architecture of the system, as well as the rationale behind the design decisions that have been made to implement the system. My view is that knowledge of how the system works can benefit everyone. For developers (especially kernel developers), knowledge of the system is key to adding new functionality. For system administrators, knowledge of the OS can help to do troubleshooting and performance analysis. Tools like DTrace become even more useful when one has knowledge of what's going on in the system. In general, knowledge of how the system works allows everyone who uses the system to make better use of the system.</div>
<div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">
<br /></div>
<div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">
As for what specific skills are acquired in an internals course, I make very extensive use of tools that come with the system during the training. Both when I am lecturing, and in lab work. My view has always been that in order to learn the concepts being taught, one must be able to actually "see" them. Tools like DTrace, mdb, kmdb, and other observability mechanisms are key to doing this. I do not "teach" the tools, but rather we use the tools in lots of examples throughout the course. At the end of the course, I am satisfied if my students can start to learn things on their own. Basically, a good internals course should be an "enabling" course. It should enable the student to learn more on their own. For some, they may never use the specific tools used during the class in the specific ways they are used, but it will educate students that one can actually determine what the system is doing at any given time. For others, they will be using the tools consistently in their work.</div>
<div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">
<br /></div>
<div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">
As with all training, you only get out of it what you put into it.</div>
<div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">
If you're interested in Internals training, please visit <a href="http://smartos.org/2012/06/14/training-from-joyent/">training from joyent</a>.</div>
<div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">
<br /></div>
<div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">
I hope to see people in class soon!</div>
<div>
<br /></div>
</div>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-7245518.post-21004791056086811972012-06-15T08:50:00.000-07:002012-06-15T08:50:17.348-07:00SmartOS/Illumos TrainingIf you are reading this, you probably are here either because you saw my post on twitter, or you searched for "zfs recovery" (see <a href="http://mbruning.blogspot.ch/2009/12/zfs-data-recovery.html">here</a>). This is the first blog I have written here since 2009, so it is time to write again.<br />
<br />
I have written (in a different blog) on using <a href="http://dtrace.org/blogs/brendan/2011/12/16/flame-graphs/">flamegraphs</a> at <a href="http://smartos.org/2012/02/28/using-flamegraph-to-solve-ip-scaling-issue-dce/">Using flamegraph to Solve IP Scaling Issue</a>.
Rather than spending time saying what I've been doing since my last blog entry here, I want to talk a bit about what I am doing now. If you're interested in what I have been doing, see <a href="http://smartos.org/2011/08/22/its-here-kvm-on-illumos/">KVM on Illumos</a>.<br />
<br />
Since I wrote the blogs on ZFS recovery, I have been getting emails, at a rate that is slowly increasing over time (now ~2 per week), from people asking if I can help with ZFS problems. If I had received this many emails when I wrote the blog, I might be working full time now on ZFS recovery issues. As it is, I am now very busy working for <a href="http://www.joyent.com/">Joyent</a>, and have not had time to answer as many of the ZFS requests as I would like. My apologies to people who I have not answered. For those of you who have asked for my mdb and zdb modifications, please send me email at <a href="mailto:max@joyent.com">max_at_joyent_dot_com</a>. If I get enough requests, it is possible that the modifications may find their way into SmartOS (and Illumos).<br />
<br />
If you would like help with ZFS problems, I can better justify my time if you download Joyent's SmartDataCenter product, available <a href="http://www.joyent.com/adoption">here</a>, and give it a try. If you're interested in SmartOS (simply the best Solaris-based operating system you can use), it is available for download at <a href="http://www.smartos.org/">www.smartos.org</a>. Joyent fully supports SmartOS for use in its SmartDataCenter product, so you are more likely to get help in a timely fashion with problems than I am able to provide on my own.<br />
<br />
And, what am I doing now? Joyent is offering classes on SmartDataCenter, DTrace, and SmartOS/Illumos Internals. I am involved with developing the courseware, and shall be (along with <a href="http://dtrace.org/blogs/brendan/">Brendan Gregg</a>) delivering the courses. For more information, see <a href="http://smartos.org/2012/06/14/training-from-joyent/">Training from Joyent</a>.<br />
<br />Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-7245518.post-56239066546241864342009-12-20T23:54:00.000-08:002009-12-21T05:05:23.485-08:00ZFS Raidz Data WalkSeveral months ago, I wrote in my blog about raidz on disk format (see <a href="http://mbruning.blogspot.com/2009/04/raidz-on-disk-format.html">http://mbruning.blogspot.com/2009/04/raidz-on-disk-format.html</a>). In that blog, I went over the high level details. Here, I thought I would show the low level stuff that I did to determine the layout. I am using a modified zdb and mdb to walk through the on-disk data structures to find the data for a copy of the /usr/dict/words file that I made on a raidz file system.<br /><br />The raidz volume contains 5 equal size devices. Since I don't have 5 disks lying around, I created 5 equal sized files (/export/home/max/r0 through /export/home/max/r4). I'll use the term disk throughout this discussion to refer to one of these files.<br /><code><br /># zpool status -v tank<br />pool: tank<br />state: ONLINE<br />scrub: none requested<br />config:<br /><br /> NAME STATE READ WRITE CKSUM<br /> tank ONLINE 0 0 0<br /> raidz1 ONLINE 0 0 0<br /> /export/home/max/r0 ONLINE 0 0 0<br /> /export/home/max/r1 ONLINE 0 0 0<br /> /export/home/max/r2 ONLINE 0 0 0<br /> /export/home/max/r3 ONLINE 0 0 0<br /> /export/home/max/r4 ONLINE 0 0 0<br /><br />errors: No known data errors<br />#<br /></code><br />I'll umount the file system so things don't change while I'm examining the on-disk structures.<br /><code><br /># zfs umount tank<br />#<br /></code><br />And, as I have done in the past, I walk the data structures to get to the "words" file by starting at the uberblock_t. If you get lost during this walk, you can always refer to the diagram "ZFS On-Disk Layout - The Big Picture", page 4 in <a href="http://www.osdevcon.org/2008/files/osdevcon2008-max.pdf">http://www.osdevcon.org/2008/files/osdevcon2008-max.pdf</a> from the OpenSolaris Developer's Conference, 2008 in Prague.<br /><br />First, the active uberblock_t.<br /><code><br /># zdb -uuu tank<br />Uberblock<br /><br /> magic = 0000000000bab10c<br /> version = 13<br /> txg = 1280<br /> guid_sum = 6800651560363961243<br /> timestamp = 1239197133 UTC = Wed Apr 8 15:25:33 2009<br /> rootbp = [L0 DMU objset] 400L/200P DVA[0]=<0:1e007400:400> DVA[1]=<0:9400:400> DVA[2]=<0:3c003800:400> fletcher4 lzjb LE contiguous birth=1280 fill=27 cksum=9ad89e117:40b4956a12c:db76af09e62f:1f779fd1db6115<br />#<br /></code><br />Now, I use a new command I added to zdb to allow me to see the raidz mapping. The "-Z" option takes the pool name, device id, location, and physical size as arguments, and prints the device index, location, and size for each piece of the corresponding data plus parity.<br /><code><br /># ./zdb -Z tank:0:1e007400:200<br />Columns = 2, bigcols = 2, asize = 400, firstdatacol = 1<br />devidx = 3, offset = 6001600, size = 200<br />devidx = 4, offset = 6001600, size = 200<br />#<br /></code><br />So, the 0x200 byte parity is on the fourth disk (devidx = 3), and the 0x200 byte objset_phys_t is on the fifth disk (devidx = 4). (Of course, either one will work since there are only 2).<br /><br />Now, convert the hex offset to an absolute decimal block number. The 0x400000 skips the disk labels at the front of each device in the volume.<br /><code><br /># mdb<br />> (6001600>>9)+(400000>>9)=D<br /> 204811 <br /></code><br />Attempting to use zdb with the -R option to read the blocks causes a assertion failure in zdb (at least, that was the state back in April, when I wrote the original blog on raidz). So, instead I use dd to dump the raw data into a file.<br /><code><br /># dd if=/export/home/max/r4 of=/tmp/objset_t bs=512 count=1 iseek=204811<br />1+0 records in<br />1+0 records out<br />#<br /></code><br />Now, I'll uncompress the data. The size after decompression should be 0x400 bytes (as specified in the block pointer in the uberblock_t above). For this, I use a utility I wrote called zuncompress. This utility takes an option which allows one to specify the compression algorithm used. The default is lzjb. The output should be the objset_phys_t for the meta object set (MOS).<br /><code><br /># ./zuncompress -p 200 -l 400 /tmp/objset_t > /tmp/objset<br />#<br /></code><br />And now, I'll use my modified mdb to print the objset_phys_t.<br /><code><br /># mdb /tmp/objset<br />> 0::print -a zfs`objset_phys_t<br />{<br /> 0 os_meta_dnode = {<br /> 0 dn_type = 0xa <-- DMU_OT_DNODE <br /> 1 dn_indblkshift = 0xe <br /> 2 dn_nlevels = 0x1<br /> ... <br /> 40 dn_blkptr = [<br /> { <br /> 40 blk_dva = [ <br /> { <br /> 40 dva_word = [ 0x8, 0xf0050 ] <br /> } <br /> { <br /> 50 dva_word = [ 0x8, 0x40 ] <br /> } <br /> { <br /> 60 dva_word = [ 0x8, 0x1e0028 ] <br /> } <br /> ]<br /> ... <br />} <br /></code><br />And the blkptr_t at 0x40:<br /><code><br />> 40::blkptr<br />DVA[0]: vdev_id 0 / 1e00a000<br />DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 100000000000<br />DVA[0]: :0:1e00a000:a00:d<br />DVA[1]: vdev_id 0 / 8000<br />DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 100000000000<br />DVA[1]: :0:8000:a00:d<br />DVA[2]: vdev_id 0 / 3c005000<br />DVA[2]: GANG: FALSE GRID: 0000 ASIZE: 100000000000<br />DVA[2]: :0:3c005000:a00:d<br />LSIZE: 4000 PSIZE: a00<br />ENDIAN: LITTLE TYPE: DMU dnode<br />BIRTH: 500 LEVEL: 0 FILL: 1a00000000<br />CKFUNC: fletcher4 COMP: lzjb<br />CKSUM: a182339fe8:ded0f7be7047:bcb1c1a96b94cc:765bd519587bfb41<br />$q<br />#<br /></code><br />So, "LEVEL: 0" means no indirection. The next object is the MOS, which is an array of dnode_phys_t. Let's see how the MOS is layed out on the raidz volume.<br /><code><br /># ./zdb -Z tank:0:1e00a000:a00<br />Columns = 5, bigcols = 2, asize = 1000, firstdatacol = 1<br />devidx = 0, offset = 6002000, size = 400<br />devidx = 1, offset = 6002000, size = 400<br />devidx = 2, offset = 6002000, size = 200<br />devidx = 3, offset = 6002000, size = 200<br />devidx = 4, offset = 6002000, size = 200<br />#<br /></code><br />So, disk 0 contains parity, and disks 1, 2, 3, and 4 contain the MOS. The MOS is compressed with lzjb compression. We'll use dd to dump the 4 blocks containing the MOS to a file, then decompress the MOS.<br /><br />I'll use mdb to translate the blkptr DVA address to a block on the disks. Note that all blocks in this example are at the same location (0x6002000).<br /><code><br /># mdb<br />> (6002000>>9)+(400000>>9)=D<br /> 204816 <br /></code><br />And now dd each of the blocks. The first disk (/export/home/max/r0) is parity. The second disk contains 0x400 bytes. The other 3 disks contain 0x200 bytes each. So total size of compressed data is 0x400 + 0x200 + 0x200 + 0x200, or 0xa00 bytes, which agrees with the PSIZE field in the blkptr_t. Note that size of the parity block must be equal to the size of the largest block (0x400).<br /><code><br /># dd if=/export/home/max/r1 of=/tmp/mos_z1 iseek=204816 count=2<br />2+0 records in<br />2+0 records out<br /># dd if=/export/home/max/r2 of=/tmp/mos_z2 iseek=204816 count=1<br />1+0 records in<br />1+0 records out<br /># dd if=/export/home/max/r3 of=/tmp/mos_z3 iseek=204816 count=1<br />1+0 records in<br />1+0 records out<br /># dd if=/export/home/max/r4 of=/tmp/mos_z4 iseek=204816 count=1<br />1+0 records in<br />1+0 records out<br />#<br /></code><br />Now, concatenate the files to get the compressed MOS.<br /><code><br /># cat /tmp/mos_z* > /tmp/mos_comp<br /></code><br />And uncompress. The size after decompression, according to the blkptr is 0x4000 (LSIZE in the blkptr).<br /><code><br /># ./zuncompress -l 4000 -p a00 /tmp/mos_comp > /tmp/mos<br /></code><br />And I'll use the modified mdb to dump out the MOS.<br /><code><br /># mdb /tmp/mos<br />> ::sizeof zfs`dnode_phys_t<br />sizeof (zfs`dnode_phys_t) = 0x200<br /><br />> 4000%200=K<br /> 20 <-- There are 32 dnode_phys_t in the MOS<br />> 0,20::print -a zfs`dnode_phys_t<br />{<br /> 0 dn_type = 0 <-- DMU_OT_NONE, first is unused<br /> ... <br />} <br />{ <br /> 200 dn_type = 0x1 <-- DMU_OT_OBJECT_DIRECTORY<br /> ...<br /> 240 dn_blkptr = [<br /> {<br /> 240 blk_dva = [<br /> {<br /> 240 dva_word = [ 0x2, 0x24 ]<br /> }<br /> ...<br />}<br />{<br /> 400 dn_type = 0xc <-- DMU_OT_DSL_DIR (DSL Directory)<br /> ...<br /> 404 dn_bonustype = 0xc<br /> ...<br /> 4c0 dn_bonus = [ 0x39, 0x75, 0xdb, 0x49, 0, 0, 0, 0, 0x10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0xe, 0, 0, 0, 0, 0, 0, 0,<br /> ... ]<br />}<br />{<br /> 600 dn_type = 0xf<br /> ...<br />{<br /> 1600 dn_type = 0x10 <-- DMU_OT_DSL_DATASET (DSL DataSet)<br /> ...<br /> 1604 dn_bonustype = 0x10<br /> ...<br /> 16c0 dn_bonus = [ 0x8, 0, 0, 0, 0, 0, 0, 0, 0xe, 0, 0, 0, 0, 0, 0, 0, 0x1, 0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,<br /> ... ]<br />}<br /> ...<br /></code><br />The blkptr_t at 0x240 is for the Object Directory. Let's take a closer look.<br /><code><br />> 240::blkptr<br />DVA[0]: vdev_id 0 / 4800<br />DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 40000000000<br />DVA[0]: :0:4800:200:d<br />DVA[1]: vdev_id 0 / 1e004800<br />DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 40000000000<br />DVA[1]: :0:1e004800:200:d<br />DVA[2]: vdev_id 0 / 3c000000<br />DVA[2]: GANG: FALSE GRID: 0000 ASIZE: 40000000000<br />DVA[2]: :0:3c000000:200:d<br />LSIZE: 200 PSIZE: 200<br />ENDIAN: LITTLE TYPE: object directory<br />BIRTH: 4 LEVEL: 0 FILL: 100000000<br />CKFUNC: fletcher4 COMP: uncompressed<br />CKSUM: 5d4dec3ac:1e59c2be429:5825c81154e8:b9b170eedd49e<br />$q<br />#<br /></code><br />We'll use zdb to find out where ZFS has put the 0x200 byte object directory.<br /><code><br /># ./zdb -Z tank:0:4800:200<br />Columns = 2, bigcols = 2, asize = 400, firstdatacol = 1<br />devidx = 1, offset = e00, size = 200<br />devidx = 2, offset = e00, size = 200<br />#<br /></code><br />So, the parity is on the second disk (devidx = 1), and the object directory (a ZAP object), is on the third disk.<br /><br />We'll convert the offset into a block address.<br /><code><br /># mdb<br />> (e00>>9)+(400000>>9)=D<br /> 8199 <br /></code><br /><br />And dump the 0x200 (i.e, 512byte) block.<br /><code><br /># dd if=/export/home/max/r2 of=/tmp/objdir iseek=8199 count=1<br />1+0 records in<br />1+0 records out<br />#<br /></code><br />The ZAP object is not compressed (see the above blkptr_t). So, no need to uncompress. We'll use mdb to look at the zap.<br /><code><br /># mdb /tmp/objdir<br />> 0/J<br />0: 8000000000000003 <-- a microzap object<br />><br /><br />> 0::print -a -t zfs`mzap_phys_t<br />{<br /> 0 uint64_t mz_block_type = 0x8000000000000003<br /> 8 uint64_t mz_salt = 0x32064dbb<br /> 10 uint64_t mz_normflags = 0<br /> 18 uint64_t [5] mz_pad = [ 0, 0, 0, 0, 0 ]<br /> 40 mzap_ent_phys_t [1] mz_chunk = [<br /> {<br /> 40 uint64_t mze_value = 0x2<br /> 48 uint32_t mze_cd = 0<br /> 4c uint16_t mze_pad = 0<br /> 4e char [50] mze_name = [ "root_dataset" ]<br /> }<br /> ]<br />}<br />$q<br />#<br /></code><br />There are more mzap_ent_phys_t in the object, but we are only concerned with the root dataset. This is object id 2, so we'll go back to the MOS, and examine the dnode_phys_t at index 2.<br /><code><br /># mdb /tmp/mos<br />> 2*200::print -a zfs`dnode_phys_t <-- Each dnode_phys_t is 0x200 bytes<br />{<br /> 400 dn_type = 0xc <-- DMU_OT_DSL_DIR<br /> ...<br /> 404 dn_bonustype = 0xc <-- DMU_OT_DSL_DIR<br /> ...<br /> 4c0 dn_bonus = [ 0x39, 0x75, 0xdb, 0x49, 0, 0, 0, 0, 0x10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0xe, 0, 0, 0, 0, 0, 0, 0, ... ]<br />}<br /></code><br />The bonus buffer contains a dsl_dir_phys_t.<br /><code><br />> 4c0::print -a zfs`dsl_dir_phys_t<br />{<br /> 4c0 dd_creation_time = 0x49db7539<br /> 4c8 dd_head_dataset_obj = 0x10<br />...<br />}<br /></code><br />The DSL DataSet is object id 0x10 (dd_head_dataset_obj = 0x10).<br /><code><br />> 10*200::print -a zfs`dnode_phys_t<br />{<br /> 2000 dn_type = 0x10 <-- DMU_OT_DSL_DATASET<br /> ...<br /> 2004 dn_bonustype = 0x10 <-- bonus buffer contains dsl_dataset_phys_t<br /> ...<br /> 20c0 dn_bonus = [ 0x2, 0, 0, 0, 0, 0, 0, 0, 0xe, 0, 0, 0, 0, 0, 0, 0, 0x1, 0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... ] <br />} <br /></code><br />Now, the dsl_dataset_phys_t in the bonus buffer of the DSL DataSet dnode.<br /><code><br />> 20c0::print -a zfs`dsl_dataset_phys_t<br />{<br /> 20c0 ds_dir_obj = 0x2<br />...<br /> 2140 ds_bp = {<br /> 2140 blk_dva = [<br /> {<br /> 2140 dva_word = [ 0x2, 0xf0038 ]<br /> }<br /> {<br /> 2150 dva_word = [ 0x2, 0x6 ]<br /> }<br /> {<br /> 2160 dva_word = [ 0, 0 ]<br /> }<br /> ]<br />...<br />}<br /></code><br />The blkptr_t at 0x2140 will give us the objset_phys_t of the root dataset of the file system.<br /><code><br />> 2140::blkptr<br />DVA[0]: vdev_id 0 / 1e007000<br />DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 40000000000<br />DVA[0]: :0:1e007000:200:d<br />DVA[1]: vdev_id 0 / c00<br />DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 40000000000<br />DVA[1]: :0:c00:200:d<br />LSIZE: 400 PSIZE: 200<br />ENDIAN: LITTLE TYPE: DMU objset<br />BIRTH: 500 LEVEL: 0 FILL: a00000000<br />CKFUNC: fletcher4 COMP: lzjb<br />CKSUM: 6bb79a7b2:2e0d64756dd:9fc17017938b:176b8a4b6c4756<br />$q<br />#<br /></code><br />Now get the locations where the file system objset_phys_t resides.<br /><code><br /># ./zdb -Z tank:0:1e007000:200<br />Columns = 2, bigcols = 2, asize = 400, firstdatacol = 1<br />devidx = 1, offset = 6001600, size = 200<br />devidx = 2, offset = 6001600, size = 200<br />#<br /></code><br />So, parity is on the second disk, and the data is on the third disk, both at offset 0x6001600.<br /><code><br /># mdb<br />(6001600>>9)+(400000>>9)=D<br /> 204811 <br /></code><br />And again use dd to dump the compressed objset_phys_t to a file.<br /><code><br /># dd if=/export/home/max/r2 of=/tmp/dmu_objset_comp iseek=204811 count=1<br />1+0 records in<br />1+0 records out<br />#<br /></code><br />And uncompress the objset_phys_t.<br /><code><br /># ./zuncompress -l 400 -p 200 /tmp/dmu_objset_comp > /tmp/dmu_objset<br />#<br /></code><br />Now, mdb to example the objset_phys_t.<br /><code><br /># mdb /tmp/dmu_objset<br />> 0::print -a zfs`objset_phys_t<br />{<br /> 0 os_meta_dnode = {<br /> 0 dn_type = 0xa <-- DMU_OT_DNODE<br /> 1 dn_indblkshift = 0xe<br /> 2 dn_nlevels = 0x7 <-- 7 levels of indirection<br /> ...<br /> 40 dn_blkptr = [<br /> {<br /> 40 blk_dva = [<br /> {<br /> 40 dva_word = [ 0x4, 0xf0020 ]<br /> }<br /> {<br /> 50 dva_word = [ 0x4, 0x20 ]<br /> }<br /> {<br /> 60 dva_word = [ 0, 0 ]<br /> }<br /> ]<br /> ...<br />}<br />> 40::blkptr<br />DVA[0]: vdev_id 0 / 1e004000<br />DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 80000000000<br />DVA[0]: :0:1e004000:400:id<br />DVA[1]: vdev_id 0 / 4000<br />DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 80000000000<br />DVA[1]: :0:4000:400:id<br />LSIZE: 4000 PSIZE: 400<br />ENDIAN: LITTLE TYPE: DMU dnode<br />BIRTH: 500 LEVEL: 6 FILL: 900000000<br />CKFUNC: fletcher4 COMP: lzjb<br />CKSUM: 5b884586fa:3f9c7d79ba1f:17674db0ee38e0:6077d2d63aa75b6<br />$q<br />#<br /></code><br />There are 6 levels of indirection to get the MOS for the file system. Next, we'll get the disk locations for the 6th level of indirection.<br /><code><br /># ./zdb -Z tank:0:1e004000:400<br />Columns = 3, bigcols = 3, asize = 800, firstdatacol = 1<br />devidx = 2, offset = 6000c00, size = 200<br />devidx = 3, offset = 6000c00, size = 200<br />devidx = 4, offset = 6000c00, size = 200<br />#<br /></code><br />So, the third disk contains parity, and the fourth and fifth disks contain the indirect block.<br /><code><br /># mdb<br />> (6000c00>>9)+(400000>>9)=D<br /> 204806 <br /></code><br /><br />Again, we'll use dd to get the data from the 2 disks, then concatenate the dd outputs, then uncompress.<br /><code><br /># dd if=/export/home/max/r3 of=/tmp/i6_1z iseek=204806 count=1<br />1+0 records in<br />1+0 records out<br /># dd if=/export/home/max/r4 of=/tmp/i6_2z iseek=204806 count=1<br />1+0 records in<br />1+0 records out<br /># cat /tmp/i6*z > /tmp/i6_z<br />#<br /></code><br />Now, uncompress. The size after decompression is 0x4000 bytes, as specified in the LSIZE field of the blkptr_t.<br /><code><br /># ./zuncompress -l 4000 -p 400 /tmp/i6_z > /tmp/i6<br />#<br /></code><br />And use mdb to examine the blkptr_t structures. We are only interested in the first one, since it will take us to the beginning dnode_phys_t in the file system.<br /><code><br /># mdb/intel/ia32/mdb/mdb /tmp/i6<br />> 0::blkptr<br />DVA[0]: vdev_id 0 / 1e003800<br />DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 80000000000<br />DVA[0]: :0:1e003800:400:id<br />DVA[1]: vdev_id 0 / 3800<br />DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 80000000000<br />DVA[1]: :0:3800:400:id<br />LSIZE: 4000 PSIZE: 400<br />ENDIAN: LITTLE TYPE: DMU dnode<br />BIRTH: 500 LEVEL: 5 FILL: 900000000<br />CKFUNC: fletcher4 COMP: lzjb<br />CKSUM: 59defe2103:3e0ac53edc13:16a8c688ba6d69:5cafeb97a9046d7<br />$q<br />#<br /></code><br />Now at level 5, we again need to know where on the physical disks are the data.<br /><code><br /># ./zdb -Z tank:0:1e003800:400<br />Columns = 3, bigcols = 3, asize = 800, firstdatacol = 1<br />devidx = 3, offset = 6000a00, size = 200<br />devidx = 4, offset = 6000a00, size = 200<br />devidx = 0, offset = 6000c00, size = 200<br />#<br /></code><br />So, parity on fourth disk and data on fifth and first.<br /><code><br /># mdb<br />> (6000a00>>9)+(400000>>9)=D<br /> 204805 <br />> (6000c00>>9)+(400000>>9)=D<br /> 204806 <br /></code><br />And use dd to dump the blocks.<br /><code><br /># dd if=/export/home/max/r4 of=/tmp/i5_1z iseek=204805 count=1<br />1+0 records in<br />1+0 records out<br /># dd if=/export/home/max/r0 of=/tmp/i5_2z iseek=204806 count=1<br />1+0 records in<br />1+0 records out<br />#<br /></code><br />And concatenate...<br /><code><br /># cat /tmp/i5*z > /tmp/i5_z<br />#<br /></code><br />And uncompress...<br /><code><br /># ./zuncompress -p 400 -l 4000 /tmp/i5_z > /tmp/i5<br />#<br /></code><br />And get to the 4th level of indirection...<br /><code><br /># mdb /tmp/i5<br />> 0::blkptr<br />DVA[0]: vdev_id 0 / 1e003000<br />DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 80000000000<br />DVA[0]: :0:1e003000:400:id<br />DVA[1]: vdev_id 0 / 3000<br />DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 80000000000<br />DVA[1]: :0:3000:400:id<br />LSIZE: 4000 PSIZE: 400<br />ENDIAN: LITTLE TYPE: DMU dnode<br />BIRTH: 500 LEVEL: 4 FILL: 900000000<br />CKFUNC: fletcher4 COMP: lzjb<br />CKSUM: 5aaaf038c7:3ecd4215b2cf:1705e4d4343d71:5e8d71a8535f678<br />$q<br />#<br /></code><br />Rather than show all 6 levels, we'll jump to level 0.<br /><code><br /># mdb /tmp/i1<br />> 0::blkptr<br />DVA[0]: vdev_id 0 / 1e001000<br />DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 80000000000<br />DVA[0]: :0:1e001000:600:d<br />DVA[1]: vdev_id 0 / 1000<br />DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 80000000000<br />DVA[1]: :0:1000:600:d<br />LSIZE: 4000 PSIZE: 600<br />ENDIAN: LITTLE TYPE: DMU dnode<br />BIRTH: 500 LEVEL: 0 FILL: 900000000<br />CKFUNC: fletcher4 COMP: lzjb<br />CKSUM: 7e1ebca68d:4f0370c6d404:23a24df0937608:ce6838f39084f95<br />$q<br />#<br /></code><br />Locate the data for the stripe:<br /><code><br /># ./zdb -Z tank:0:1e001000:600<br />Columns = 4, bigcols = 4, asize = 800, firstdatacol = 1<br />devidx = 3, offset = 6000200, size = 200<br />devidx = 4, offset = 6000200, size = 200<br />devidx = 0, offset = 6000400, size = 200<br />devidx = 1, offset = 6000400, size = 200<br />#<br /><br /># mdb<br />> (6000200>>9)+(400000>>9)=D<br /> 204801 <br />> (6000400>>9)+(400000>>9)=D<br /> 204802 <br /></code><br /><br />Get the data from the individual disks...<br /><code><br /># dd if=/export/home/max/r4 of=/tmp/fs_mos_1z iseek=204801 count=1<br />1+0 records in<br />1+0 records out<br /># dd if=/export/home/max/r0 of=/tmp/fs_mos_2z iseek=204802 count=1<br />1+0 records in<br />1+0 records out<br /># dd if=/export/home/max/r1 of=/tmp/fs_mos_3z iseek=204802 count=1<br />1+0 records in<br />1+0 records out<br />#<br /></code><br />Concatenate the data...<br /><code><br /># cat /tmp/fs_mos_* > /tmp/fs_mos_z<br />#<br /></code><br />Uncompress...<br /><code><br /># ./zuncompress -l 4000 -p 600 /tmp/fs_mos_z > /tmp/fs_mos<br />#<br /></code><br />We should now be at the MOS for the root data set.<br /><code><br /># mdb /tmp/fs_mos<br />> 0,20::print -a zfs`dnode_phys_t<br />{<br /> 0 dn_type = 0 <-- first is not used<br /> ...<br />}<br />{<br /> 200 dn_type = 0x15 <-- DMU_OT_MASTER<br /> ...<br /> 240 dn_blkptr = [<br /> {<br /> 240 blk_dva = [<br /> {<br /> 240 dva_word = [ 0x2, 0 ]<br /> }<br /> {<br /> 250 dva_word = [ 0x2, 0xf0000 ]<br /> }<br /> {<br /> 260 dva_word = [ 0, 0 ]<br /> }<br /> ]<br /> ...<br />}<br /> ...<br />{<br /> 600 dn_type = 0x14 <-- DMU_OT_DIRECTORY_CONTENTS (probably for root of fs)<br /> ...<br /> 604 dn_bonustype = 0x11 <-- bonus buffer is a znode_phys_t<br /> ...<br /> 640 dn_blkptr = [<br /> {<br /> 640 blk_dva = [<br /> {<br /> 640 dva_word = [ 0x2, 0xf0006 ]<br /> }<br /> ...<br /> 6c0 dn_bonus = [ 0x26, 0xa0, 0xdb, 0x49, 0, 0, 0, 0, 0x8e, 0xf0, 0xf7, 0x25, 0, 0, 0, 0, 0xca, 0xa5, 0xdc, 0x49, 0, 0, 0, 0, 0xf3, 0x80, 0xdd, 0x34, 0, 0, 0 , 0, ... ] <br />} <br />{<br /> 800 dn_type = 0x13 <-- DMU_OT_PLAIN_FILE_CONTENTS<br /> ...<br /> 804 dn_bonustype = 0x11 <-- bonus buffer is znode_phys_t<br /> ...<br /> 840 dn_blkptr = [<br /> {<br /> 840 blk_dva = [<br /> {<br /> 840 dva_word = [ 0x2, 0xf0004 ]<br /> }<br />...<br /> 8c0 dn_bonus = [ 0xca, 0xa5, 0xdc, 0x49, 0, 0, 0, 0, 0x5e, 0xe2, 0xdc, 0x34, 0, 0, 0, 0, 0xca, 0xa5, 0xdc, 0x49, 0, 0, 0, 0, 0x58, 0x9e, 0xde, 0x34, 0, 0, 0 , 0, ... ] <br />}<br /> ... <br />><br /></code><br />We should start with the ZAP object specified by the blkptr_t for the master node to get to the root directory object of the file system. Instead, we'll assume the dnode_phys_t at 0x600 is for the root of the file system, and we'll dump the blkptr_t. This should be for a ZAP object which should contain the list of files in the directory.<br /><code><br />> 640::blkptr<br />DVA[0]: vdev_id 0 / 1e000c00<br />DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 40000000000<br />DVA[0]: :0:1e000c00:200:d<br />DVA[1]: vdev_id 0 / 800<br />DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 40000000000<br />DVA[1]: :0:800:200:d<br />LSIZE: 200 PSIZE: 200<br />ENDIAN: LITTLE TYPE: ZFS directory<br />BIRTH: 500 LEVEL: 0 FILL: 100000000<br />CKFUNC: fletcher4 COMP: uncompressed<br />CKSUM: 60d062a16:197ca3c8839:4691877f93d3:946b572aca5a2<br />$q<br />#<br /></code><br />Find the location on the disk(s) for the directory ZAP object.<br /><code><br /># ./zdb -Z tank:0:1e000c00:200<br />Columns = 2, bigcols = 2, asize = 400, firstdatacol = 1<br />devidx = 1, offset = 6000200, size = 200<br />devidx = 2, offset = 6000200, size = 200<br /># mdb<br />> (6000200>>9)+(400000>>9)=D<br /> 204801 <br /></code><br />Dump the contents.<br /><code><br /># dd if=/export/home/max/r2 of=/tmp/rootdir iseek=204801 count=1<br />1+0 records in<br />1+0 records out<br />#<br /></code><br />Examine the directory.<br /><code><br /># mdb /tmp/rootdir<br />> ::sizeof zfs`mzap_phys_t<br />sizeof (zfs`mzap_phys_t) = 0x80<br />> ::sizeof zfs`mzap_ent_phys_t<br />sizeof (zfs`mzap_ent_phys_t) = 0x40<br />> 0::print zfs`mzap_phys_t<br />{<br /> mz_block_type = 0x8000000000000003<br /> mz_salt = 0x14187c7<br /> mz_normflags = 0<br /> mz_pad = [ 0, 0, 0, 0, 0 ]<br /> mz_chunk = [<br /> {<br /> mze_value = 0x8000000000000004<br /> mze_cd = 0<br /> mze_pad = 0<br /> mze_name = [ "smallfile" ]<br /> }<br /> ]<br />}<br />> (200-80)%40=K<br /> 6 <-- there are 6 more mzap_ent_phys_t<br />> 80,6::print zfs`mzap_ent_phys_t<br /> {<br /> mze_value = 0<br /> mze_cd = 0<br /> mze_pad = 0<br /> mze_name = [ '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0 ', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', ... ]<br /> }<br /> {<br /> mze_value = 0x8000000000000006<br /> mze_cd = 0<br /> mze_pad = 0<br /> mze_name = [ "words" ] <-- here is the file we want, object id is 6<br /> }<br /> {<br /> mze_value = 0x8000000000000007<br /> mze_cd = 0<br /> mze_pad = 0<br /> mze_name = [ "foo" ]<br /> }<br /> ...<br />$q<br /># <br /></code><br />Now, go back to the file system MOS to look at object id 6. If the object ID was greater than 32 (0x20), there would have been more work looking at other indirect blocks from the objset_phys_t for the file system. We assumed that the root directory for the file system would be a low object number above, and, fortunately,<br />the file we want to examine is also a low object number.<br /><code><br /># mdb /tmp/fs_mos<br />> (6*200)::print -a zfs`dnode_phys_t<br />{<br /> c00 dn_type = 0x13 <-- plain file contents<br /> c01 dn_indblkshift = 0xe<br /> c02 dn_nlevels = 0x2 <-- one level of indirection<br /> c03 dn_nblkptr = 0x1<br /> c04 dn_bonustype = 0x11 <-- bonus buffer contains znode_phys_t<br /> ...<br /> c40 dn_blkptr = [<br /> {<br /> c40 blk_dva = [<br /> {<br /> c40 dva_word = [ 0x4, 0x5ec ]<br /> }<br /> {<br /> c50 dva_word = [ 0x4, 0xf00ec ]<br /> }<br /> {<br /> c60 dva_word = [ 0, 0 ]<br /> }<br /> ]<br /> ...<br /> cc0 dn_bonus = [ 0x22, 0x86, 0xdb, 0x49, 0, 0, 0, 0, 0x9, 0x31, 0x20, 0x28, 0, 0, 0, 0, 0x22, 0x86, 0xdb, 0x49, 0, 0, 0, 0, 0x71, 0x48, 0x2b, 0x28, 0, 0, 0, 0, ... ] <br />} <br /></code><br />A quick look at the znode_phys_t in the bonus buffer...<br /><code><br />> cc0::print zfs`znode_phys_t<br />{<br /> zp_atime = [ 0x49db8622, 0x28203109 ]<br /> zp_mtime = [ 0x49db8622, 0x282b4871 ]<br /> zp_ctime = [ 0x49db8622, 0x282b4871 ]<br /> zp_crtime = [ 0x49db8622, 0x28203109 ]<br /> zp_gen = 0x97<br /> zp_mode = 0x8124<br /> zp_size = 0x32752<br /> zp_parent = 0x3<br /> zp_links = 0x1<br /> zp_xattr = 0<br /> zp_rdev = 0<br /> zp_flags = 0x40800000004<br /> zp_uid = 0<br /> zp_gid = 0<br /> zp_zap = 0<br /> zp_pad = [ 0, 0, 0 ]<br /> zp_acl = {<br /> z_acl_extern_obj = 0<br /> z_acl_size = 0x30<br /> z_acl_version = 0x1<br /> z_acl_count = 0x6<br /> z_ace_data = [ 0x1, 0, 0, 0x10, 0x26, 0, 0, 0, 0, 0, 0, 0x10, 0x11, 0x1,<br />0xc, 0, 0x1, 0, 0x40, 0x20, 0x26, 0, 0, 0, 0, 0, 0x40, 0x20, 0x1, 0, 0, 0, ...<br />]<br /> }<br />}<br /></code><br />When was the file created?<br /><code><br />> 49db8622=Y<br /> 2009 Apr 7 18:58:10 <br /></code><br />Now, let's look at the blkptr_t.<br /><code><br />> c40::blkptr<br />DVA[0]: vdev_id 0 / bd800<br />DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 80000000000<br />DVA[0]: :0:bd800:400:id<br />DVA[1]: vdev_id 0 / 1e01d800<br />DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 80000000000<br />DVA[1]: :0:1e01d800:400:id<br />LSIZE: 4000 PSIZE: 400<br />ENDIAN: LITTLE TYPE: ZFS plain file<br />BIRTH: 97 LEVEL: 1 FILL: 200000000<br />CKFUNC: fletcher4 COMP: lzjb<br />CKSUM: 600e97db0e:411c4ea86350:1790b46d936d46:602547566d07cc7<br />$q<br />#<br /></code><br />We're at level 1.<br /><code><br /># ./zdb -Z tank:0:bd800:400<br />Columns = 3, bigcols = 3, asize = 800, firstdatacol = 1<br />devidx = 1, offset = 25e00, size = 200<br />devidx = 2, offset = 25e00, size = 200<br />devidx = 3, offset = 25e00, size = 200<br />#<br /><br /># mdb<br />> (25e00>>9)+(400000>>9)=D<br /> 8495 <br /><br /># dd if=/export/home/max/r2 of=/tmp/words_i1z iseek=8495 count=1<br />1+0 records in<br />1+0 records out<br /># dd if=/export/home/max/r3 of=/tmp/words_i2z iseek=8495 count=1<br />1+0 records in<br />1+0 records out<br /># cat /tmp/words_*z > /tmp/words_iz<br /></code><br />Uncompress...<br /><code><br /># ./zuncompress -l 4000 -p 400 /tmp/words_iz > /tmp/words_i<br /></code><br />So, /tmp/words_i should contain uncompressed blkptr_t. These blkptr_t should take us to the data for the words file.<br /><code><br /># mdb /tmp/words_i<br />> 0::blkptr<br />DVA[0]: vdev_id 0 / c0000<br />DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 2800000000000<br />DVA[0]: :0:c0000:20000:d<br />LSIZE: 20000 PSIZE: 20000<br />ENDIAN: LITTLE TYPE: ZFS plain file<br />BIRTH: 97 LEVEL: 0 FILL: 100000000<br />CKFUNC: fletcher2 COMP: uncompressed<br />CKSUM: f5cbf93a151abcac:5b5d6ca83588d8ad:574d9b8bf334944b:ad78d30af51771d8<br />80::blkptr<br />DVA[0]: vdev_id 0 / e8000<br />DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 2800000000000<br />DVA[0]: :0:e8000:20000:d<br />LSIZE: 20000 PSIZE: 20000<br />ENDIAN: LITTLE TYPE: ZFS plain file<br />BIRTH: 97 LEVEL: 0 FILL: 100000000<br />CKFUNC: fletcher2 COMP: uncompressed<br />CKSUM: f39ae34f048ae079:de2ef1af7d1fb495:ec3ae3f7985b2a98:c6d33ac68cb042b6<br />$q<br />#<br /></code><br />So, where is the data?<br /><code><br /># ./zdb -Z tank:0:c0000:20000<br />Columns = 5, bigcols = 0, asize = 28000, firstdatacol = 1<br />devidx = 1, offset = 26600, size = 8000<br />devidx = 2, offset = 26600, size = 8000<br />devidx = 3, offset = 26600, size = 8000<br />devidx = 4, offset = 26600, size = 8000<br />devidx = 0, offset = 26800, size = 8000<br /></code><br />A little hex to decimal conversion for dd...<br /><code><br /># mdb<br />> (26600>>9)+(400000>>9)=D<br /> 8499 <br />> (26800>>9)+(400000>>9)=D<br /> 8500 <br />8000>>9=D<br /> 64 <br /></code><br />Now, dump the blocks...<br /><code><br /># dd if=/export/home/max/r2 of=/tmp/w1 iseek=8499 bs=512 count=64<br />64+0 records in<br />64+0 records out<br /># dd if=/export/home/max/r3 of=/tmp/w2 iseek=8499 bs=512 count=64<br />64+0 records in<br />64+0 records out<br /># dd if=/export/home/max/r4 of=/tmp/w3 iseek=8499 bs=512 count=64<br />64+0 records in<br />64+0 records out<br /># dd if=/export/home/max/r0 of=/tmp/w4 iseek=8500 bs=512 count=64<br />64+0 records in<br />64+0 records out<br /></code><br />And concatenate the 4 files...<br /><code><br /># cat /tmp/w[1-4]<br />10th<br />1st<br />2nd<br />3rd<br />4th<br />5th<br />6th<br />7th<br />8th<br />9th<br />a<br />AAA<br />AAAS<br />Aarhus<br />Aaron<br />AAU<br />ABA<br />Ababa<br />aback<br />...<br />Nostrand<br />nostril<br />not<br />notary<br />notate<br />notch<br />note<br />notebook<br />noteworthy<br />nothing<br />notice<br />notice#<br /></code><br />The first 128k of the file. To get the remainder of the file, we would need to look at the next blkptr_t at level 1. But, not today...Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-7245518.post-89620513721037322192009-12-16T05:55:00.000-08:002009-12-16T06:02:08.789-08:00ZFS Data Recovery<p>I knew there was a reason to understand the ZFS on-disk format besides wanting to be able to teach about it...<br /><p><br />Recently, I was sent an email from someone who had 15 years of video and music stored in a 10TB ZFS pool that, after a power failure, became defective. He unfortunately did not have a backup. He was using ZFS version 6 on FreeBSD 7, and he asked if I could help. He also said that he had spoken with various people, including engineers within Sun, and was told that basically, it was probably not possible to recover the data.<br /><p><br />He also got in touch with a data recovery company who told him they would assign a technician to examine the problem at a cost of $15,000/month. And if they could not restore his data, he would only have to pay 1/2 of the cost.<br /><p><br />After spending about 1 week examining the data on the disk, I was able to restore basically all of it. The recovery was done on OpenSolaris 0609 (build 111b). After the recovery, he was able to view his data on the FreeBSD system (as well as OpenSolaris). I would be happy to do this for anyone else who runs into a problem where they can no longer access their ZFS pool, especially at $15k/month. And if I can not do it, you would not need to pay anything!Unknownnoreply@blogger.com14tag:blogger.com,1999:blog-7245518.post-54240876169777274012009-12-15T04:39:00.000-08:002009-12-15T08:33:27.832-08:00segpages dmod source codeSource is at <a href="ftp://ftp.bruningsystems.com/segpages.tar">ftp://ftp.bruningsystems.com/segpages.tar</a>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-7245518.post-43998805292829489862009-12-14T05:19:00.000-08:002009-12-14T06:03:30.187-08:00Examining address spaces with mdb<p><br />A while ago, I was interested in more details about process address spaces. For instance, if a page is mapped into an address space, where is the page in physical memory? Or if a page is on swap, where is it on swap? Are there pages that are in memory, but not currently valid for a process? The meminfo(2) system call can be used by an application to examine the locations of physical pages corresponding to a range of virtual addresses that the process is using. Is there a tool for doing this from outside the process? Is there any tool for determining the locations of pages in memory when one is using liblgrp(3)? liblgrp(3) provides an API for specifying a "locality group". A locality group, as the man page says, "represents the set of CPU-like and memory-like hardware devices that are at most some locality apart from each other". Essentially, using liblgrp(3), one can specify the desired memory placement for memory that threads within a process are using.<br /><p><br />So, I have written a dcmd, called segpages, for mdb that allows one to examine each virtual page of a segment in a process address space. The command gives the following information:<br /><ul><br /><li> The virtual address of the page.<br /><li> If the page is in memory, the physical address of the page.<br /><li> If the page is on swap, the location on swap, and which swap device/file.<br /><li> If the page is not currently in memory or on swap, a "-".<br /><li> If the page is mapped from a file, the pathname of the file, and the offset within the file.<br /><li> If the page is anonymous, the command prints "anon".<br /><li> If the page is mapped to a device, the command only prints the physical address it is mapped to, and the path to the device.<br /><li> The "share count" for the page, i.e., the number of processes sharing the same page.<br /><li> The dcmd command also prints the status of the page:<br /><ul><br /><li> VALID -- The page is mapped<br /><li> INMEMORY -- The page is in memory (it may not be valid for the process).<br /><li> SWAPPED -- The page is on swap. Note that a page may be INMEMORY and SWAPPED. What I find more interesting, is pages that are SWAPPED and VALID. I expect to find INMEMORY pages that are also on swap. I did not expect to find SWAPPED pages that are also VALID, since I assumed that a page that was read in from swap and is now valid would not have a copy on swap. From a quick look at the code, it appears the swap slot is not freed until the reference count on the anon struct that is mapping the page has gone to 0. Anyone with a more complete understanding of this is welcome to comment.<br /></ul><br /></ul><br /><p><br />Here is (very abbreviated) output for a running bash process.<br /><p><br />First, a look at pmap output. Each line of the pmap output represents a "segment" of the address space. The different columns are described in the pmap(1) man page.<br /><code><br />$ pmap -x 919<br />919: /bin/bash --noediting -i<<br />Address Kbytes RSS Anon Locked Mode Mapped File<br />08045000 12 12 4 - rw--- [ stack ]<br />08050000 644 644 - - r-x-- bash<br />08100000 80 80 12 - rwx-- bash<br />08114000 52 52 28 - rwx-- [ heap ]<br />CE410000 624 512 - - r-x-- libnsl.so.1<br />CE4BC000 16 16 4 - rw--- libnsl.so.1<br />CE4C0000 20 8 - - rw--- libnsl.so.1<br />CE4F0000 56 52 - - r-x-- methods_unicode.so.3<br />CE50D000 4 4 - - rwx-- methods_unicode.so.3<br />CE510000 2416 752 - - r-x-- en_US.UTF-8.so.3<br />CE77B000 4 4 - - rwx-- en_US.UTF-8.so.3<br />CE960000 64 16 - - rwx-- [ anon ]<br />CE97E000 4 4 - - rwxs- [ anon ]<br />CE980000 4 4 - - rwx-- [ anon ]<br />CE990000 24 12 4 - rwx-- [ anon ]<br />CE9A0000 4 4 4 - rwx-- [ anon ]<br />CE9B0000 1280 972 - - r-x-- libc_hwcap1.so.1<br />CEAF0000 28 28 16 - rwx-- libc_hwcap1.so.1<br />CEAF7000 8 8 - - rwx-- libc_hwcap1.so.1<br />CEB00000 4 4 - - r-x-- libdl.so.1<br />CEB10000 4 4 4 - rwx-- [ anon ]<br />CEB20000 56 56 - - r-x-- libsocket.so.1<br />CEB3E000 4 4 - - rw--- libsocket.so.1<br />CEB40000 180 136 - - r-x-- libcurses.so.1<br />CEB7D000 28 28 - - rw--- libcurses.so.1<br />CEB84000 8 - - - rw--- libcurses.so.1<br />CEB90000 4 4 4 - rwx-- [ anon ]<br />CEBA0000 4 4 4 - rw--- [ anon ]<br />CEBB0000 4 4 - - rw--- [ anon ]<br />CEBBF000 180 180 - - r-x-- ld.so.1<br />CEBFC000 8 8 4 - rwx-- ld.so.1<br />CEBFE000 4 4 4 - rwx-- ld.so.1<br />-------- ------- ------- ------- -------<br />total Kb 5832 3620 92 -<br /><br /># mdb -k<br />Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc pcplusmp scsi_vhci zfs usba sockfs ip hook neti sctp arp uhci sd fctl md lofs audiosup fcip fcp random cpc crypto logindmux ptm ufs sppp ipc ]<br /><br /></code><br /><p><br />First, load the dmod containing the new dcmd.<br /><code><br /><br />> ::load /wd320/max/source/mdb/segpages/i386/segpages.so <br />><br /></code><br /><p><br />Now, walk through the segments of the process address space, showing<br />each virtual page in the segment. Note that output has been omitted.<br /><code><br /><br />> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages<br /> VA PA FILE OFFSET SHARES DISPOSITION<br /> 8045000 378C5000 [anon] 54518000 1 VALID<br /> 8046000 6EB06000 [anon] 54118000 1 VALID<br /> 8047000 5F9C7000 [anon] 540B8000 1 VALID<br /> VA PA FILE OFFSET SHARES DISPOSITION<br /> 8050000 600A7000 bash 0 7 VALID<br /> 8051000 74368000 bash 1000 7 VALID<br /> 8052000 72669000 bash 2000 7 VALID<br /> 8053000 66C6A000 bash 3000 7 VALID<br /> 8054000 636AB000 bash 4000 0 INVALID,INMEMORY<br /> 8055000 5FDEC000 bash 5000 0 INVALID,INMEMORY<br /> 8056000 63EED000 bash 6000 0 INVALID,INMEMORY<br /> 8057000 62EAE000 bash 7000 0 INVALID,INMEMORY<br /> 8058000 5C52F000 bash 8000 7 VALID<br /> 8059000 5C5B0000 bash 9000 7 VALID<br />... output omitted<br /> 80ED000 5C2C4000 bash 9D000 7 VALID<br /> 80EE000 5C245000 bash 9E000 7 VALID<br /> 80EF000 5C286000 bash 9F000 3 VALID<br /> 80F0000 63A97000 bash A0000 0 INVALID,INMEMORY<br /> VA PA FILE OFFSET SHARES DISPOSITION<br /> 8100000 79940000 [anon] 541D8000 1 VALID<br /> 8101000 5F0C1000 [anon] 62F00000 1 VALID<br /> 8102000 378C2000 [anon] 54438000 1 VALID<br /> 8103000 5EF5A000 bash A3000 6 VALID<br /> 8104000 5EEDB000 bash A4000 6 VALID<br /> 8105000 37885000 [anon] 543B8000 1 VALID<br /> 8106000 60E1D000 bash A6000 7 VALID<br />...<br /> VA PA FILE OFFSET SHARES DISPOSITION<br /> 8114000 37914000 [anon] 54478000 1 VALID<br /> 8115000 79DD5000 [anon] 54368000 1 VALID<br /> 8116000 55356000 [anon] 62F90000 1 VALID<br />...<br /> VA PA FILE OFFSET SHARES DISPOSITION<br />CE410000 7AE40000 libnsl.so.1 0 55 VALID<br />CE411000 7AEC1000 libnsl.so.1 1000 57 VALID<br />CE412000 7AE42000 libnsl.so.1 2000 57 VALID<br />CE413000 7AE83000 libnsl.so.1 3000 57 VALID<br />CE414000 7AE84000 libnsl.so.1 4000 57 VALID<br />...<br />CE42D000 6EE96000 libnsl.so.1 1D000 1A INVALID,INMEMORY<br />CE42E000 6E797000 libnsl.so.1 1E000 1A INVALID,INMEMORY<br />...<br /> VA PA FILE OFFSET SHARES DISPOSITION<br />CE4F0000 17D9000 methods_unicode.so.3 0 29 VALID<br />CE4F1000 17DA000 methods_unicode.so.3 1000 2C VALID<br />...<br /> VA PA FILE OFFSET SHARES DISPOSITION<br />CE510000 1869000 en_US.UTF-8.so.3 0 28 VALID<br />CE511000 18AA000 en_US.UTF-8.so.3 1000 2A VALID<br />...<br />CE518000 6F1EA000 en_US.UTF-8.so.3 8000 0 INVALID,INMEMORY<br />CE519000 6F1EB000 en_US.UTF-8.so.3 9000 0 INVALID,INMEMORY<br />CE51A000 6F1EC000 en_US.UTF-8.so.3 A000 0 INVALID,INMEMORY<br />...<br />CE5FF000 6DB60000 en_US.UTF-8.so.3 EF000 5 INVALID,INMEMORY<br />CE600000 1659000 en_US.UTF-8.so.3 F0000 7 INVALID,INMEMORY<br />...<br />CE6EE000 1687000 en_US.UTF-8.so.3 1DE000 9 INVALID,INMEMORY<br />CE6EF000 1688000 en_US.UTF-8.so.3 1DF000 9 INVALID,INMEMORY<br />CE6F0000 1649000 en_US.UTF-8.so.3 1E0000 9 INVALID,INMEMORY<br />...<br />CE729000 1782000 en_US.UTF-8.so.3 219000 29 VALID<br />CE72A000 1783000 en_US.UTF-8.so.3 21A000 29 VALID<br />...<br />CE730000 1709000 en_US.UTF-8.so.3 220000 29 VALID<br />CE731000 6F143000 en_US.UTF-8.so.3 221000 0 INVALID,INMEMORY<br />CE732000 6F144000 en_US.UTF-8.so.3 222000 0 INVALID,INMEMORY<br />...<br /> VA PA FILE OFFSET SHARES DISPOSITION<br />CE9B0000 76A42000 libc_hwcap1.so.1 0 5B VALID<br />CE9B1000 76AC3000 libc_hwcap1.so.1 1000 5B VALID<br />...<br /> VA PA FILE OFFSET SHARES DISPOSITION<br />...<br />CEBC4000 2A34000 ld.so.1 5000 47 VALID<br />CEBC5000 28B5000 ld.so.1 6000 47 VALID<br />CEBC6000 29F6000 ld.so.1 7000 60 VALID<br />CEBC7000 2937000 ld.so.1 8000 60 VALID<br />...<br />><br /></code><br /><p><br />Some general things to note:<br /><ul><br /><li> Physical pages are randomly distributed. However, pages from ld.so.1 tend to be in low memory with comparison to anonymous pages. This should be expected as most pages of ld.so.1 are probably loaded early on in the system lifetime as most every application uses it.<br /><li> There are many pages that are not valid, but they are in memory. In general, text and data pages are prefetched when a program starts, unless the program is large, or there is not enough free memory. Although pages are prefetched, it appears that they are not mapped to the process address space until/unless they are actually used.<br /><li> Bash is not very large. Running the command above finishes in 5-10 seconds. Running the same command on a large program (for instance, firefox-bin), takes several minutes to complete. Running the command on a large 64-bit application will take considerably longer.<br /><li> This is being run on a live system, so the address space of the process being examined may change while it is being walked.<br /><li> At this point in time, no pages are swapped out.<br /></ul><br /><p><br />Now, let's get some general statistics.<br /><p><br />First, a count of the pages currently valid for the process. This is the current mapped RSS. Note that the pmap command reports "RSS", which, at 3620k is 905 4k-byte pages. But only 558 pages (or 2232k) are currently valid.<br /><code><br /><br />> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages !grep -i " valid" | wc<br /> 558 3348 35712<br />><br /></code><br /><p><br />Now, the pages in memory, but not currently valid in the page table(s) for the process.<br /><code><br /><br />> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages !egrep -i "inmemory" | wc<br /> 347 2082 26025<br />><br /></code><br /><p><br />Note that the valid pages plus the in memory pages is 905, or the value reported by pmap. So RSS as reported by pmap does not imply that page faults will not happen for all of those pages. But if a page fault occurs the correct page will be found in memory.<br /><p><br />How many pages are currently not valid (and not in memory).<br /><code><br /><br />> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages !egrep -i " invalid$" | wc<br /> 553 3298 36498<br />><br /></code><br /><p><br />How large is the address space?<br /><code><br /><br />> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages !egrep -v OFFSET | wc<br /> 1458 8728 98235<br />><br /></code><br /><p><br />Note that this is 5832k, the total size as reported by pmap.<br /><p><br />How many pages have been swapped out?<br /><code><br /><br />> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages !grep -i swapped | wc<br /> 0 0 0<br />><br /></code><br /><p><br />Now, we'll induce memory load on the system, and again examine the address space. The memory usage induced should be enough to cause pages to be swapped (paged) out.<br /><p><br />First, pmap output after the memory stress.<br /><code><br /><br />$ pmap -x 919<br />919: /bin/bash --noediting -i<br /> Address Kbytes RSS Anon Locked Mode Mapped File<br />08045000 12 4 - - rw--- [ stack ]<br />08050000 644 508 - - r-x-- bash<br />08100000 80 80 - - rwx-- bash<br />08114000 52 44 28 - rwx-- [ heap ]<br />CE410000 624 320 - - r-x-- libnsl.so.1<br />CE4BC000 16 16 4 - rw--- libnsl.so.1<br />CE4C0000 20 8 - - rw--- libnsl.so.1<br />CE4F0000 56 36 - - r-x-- methods_unicode.so.3<br />CE50D000 4 4 - - rwx-- methods_unicode.so.3<br />CE510000 2416 124 - - r-x-- en_US.UTF-8.so.3<br />CE77B000 4 4 - - rwx-- en_US.UTF-8.so.3<br />CE960000 64 16 - - rwx-- [ anon ]<br />CE97E000 4 4 - - rwxs- [ anon ]<br />CE980000 4 4 - - rwx-- [ anon ]<br />CE990000 24 12 4 - rwx-- [ anon ]<br />CE9A0000 4 4 4 - rwx-- [ anon ]<br />CE9B0000 1280 952 - - r-x-- libc_hwcap1.so.1<br />CEAF0000 28 28 12 - rwx-- libc_hwcap1.so.1<br />CEAF7000 8 8 - - rwx-- libc_hwcap1.so.1<br />CEB00000 4 4 - - r-x-- libdl.so.1<br />CEB10000 4 4 4 - rwx-- [ anon ]<br />CEB20000 56 56 - - r-x-- libsocket.so.1<br />CEB3E000 4 4 - - rw--- libsocket.so.1<br />CEB40000 180 68 - - r-x-- libcurses.so.1<br />CEB7D000 28 28 - - rw--- libcurses.so.1<br />CEB84000 8 - - - rw--- libcurses.so.1<br />CEB90000 4 4 4 - rwx-- [ anon ]<br />CEBA0000 4 4 4 - rw--- [ anon ]<br />CEBB0000 4 4 - - rw--- [ anon ]<br />CEBBF000 180 180 - - r-x-- ld.so.1<br />CEBFC000 8 8 4 - rwx-- ld.so.1<br />CEBFE000 4 4 4 - rwx-- ld.so.1<br />-------- ------- ------- ------- -------<br />total Kb 5832 2544 72 -<br /><br />$<br /></code><br /><p><br />As expected, the RSS has gone down, but the virtual size remains the same. It is a little interesting that the amount reported under anon has also dropped by 20k.<br /><p><br />Again, we'll use the new dcmd to examine the address space more closely.<br /><code><br /><br />> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages<br /> VA PA FILE OFFSET SHARES DISPOSITION<br /> 8045000 378C5000 [anon] 54518000 1 VALID<br /> 8046000 - /dev/zvol/dsk/rpool/swap 17EBF000 0 INVALID,SWAPPED<br /> 8047000 - /dev/zvol/dsk/rpool/swap 1D8FE000 0 INVALID,SWAPPED<br /> VA PA FILE OFFSET SHARES DISPOSITION<br /> 8050000 16B86000 bash 0 1 VALID<br /> 8051000 13A07000 bash 1000 1 VALID<br /> 8052000 6B088000 bash 2000 1 VALID<br /> 8053000 1889000 bash 3000 1 VALID<br /> 8054000 2430A000 bash 4000 1 VALID<br /> 8055000 6440B000 bash 5000 1 VALID<br /> 8056000 6684C000 bash 6000 1 VALID<br /> 8057000 7308D000 bash 7000 1 VALID<br /> 8058000 6DCCE000 bash 8000 0 INVALID,INMEMORY<br /> 8059000 3784F000 bash 9000 0 INVALID,INMEMORY<br />...<br /> 80ED000 4CB23000 bash 9D000 0 INVALID,INMEMORY<br /> 80EE000 76BE4000 bash 9E000 0 INVALID,INMEMORY<br /> 80EF000 5BA5000 bash 9F000 0 INVALID,INMEMORY<br /> 80F0000 36836000 bash A0000 0 INVALID,INMEMORY<br /> VA PA FILE OFFSET SHARES DISPOSITION<br /> 8100000 - /dev/zvol/dsk/rpool/swap 247C2000 0 INVALID,SWAPPED<br /> 8101000 - /dev/zvol/dsk/rpool/swap 7CCD000 0 INVALID,SWAPPED<br /> 8102000 378C2000 [anon] 54438000 1 VALID<br /> 8103000 75479000 bash A3000 0 INVALID,INMEMORY<br /> 8104000 532BA000 bash A4000 0 INVALID,INMEMORY<br /> 8105000 37885000 [anon] 543B8000 1 VALID<br /> 8106000 7443C000 bash A6000 0 INVALID,INMEMORY<br />...<br /> VA PA FILE OFFSET SHARES DISPOSITION<br /> 8114000 37914000 [anon] 54478000 1 VALID<br /> 8115000 79DD5000 [anon] 54368000 1 VALID<br /> 8116000 55356000 [anon] 62F90000 1 VALID<br />...<br /> VA PA FILE OFFSET SHARES DISPOSITION<br />CE410000 7AE40000 libnsl.so.1 0 4C VALID<br />CE411000 7AEC1000 libnsl.so.1 1000 4E VALID<br />CE412000 7AE42000 libnsl.so.1 2000 4E VALID<br />CE413000 7AE83000 libnsl.so.1 3000 4E VALID<br />CE414000 7AE84000 libnsl.so.1 4000 4E VALID<br />...<br />CE42D000 6EE96000 libnsl.so.1 1D000 18 INVALID,INMEMORY<br />CE42E000 6E797000 libnsl.so.1 1E000 18 INVALID,INMEMORY<br />...<br /> VA PA FILE OFFSET SHARES DISPOSITION<br />CE4F0000 17D9000 methods_unicode.so.3 0 27 VALID<br />CE4F1000 17DA000 methods_unicode.so.3 1000 2A VALID<br />...<br /> VA PA FILE OFFSET SHARES DISPOSITION<br />CE510000 1869000 en_US.UTF-8.so.3 0 26 VALID<br />CE511000 18AA000 en_US.UTF-8.so.3 1000 28 VALID<br />...<br />CE518000 - en_US.UTF-8.so.3 8000 0 INVALID<br />CE519000 - en_US.UTF-8.so.3 9000 0 INVALID<br />CE51A000 - en_US.UTF-8.so.3 A000 0 INVALID<br />...<br />CE5FF000 - en_US.UTF-8.so.3 EF000 0 INVALID<br />CE600000 - en_US.UTF-8.so.3 F0000 0 INVALID<br />...<br />CE6EE000 1687000 en_US.UTF-8.so.3 1DE000 A INVALID,INMEMORY<br />CE6EF000 1688000 en_US.UTF-8.so.3 1DF000 A INVALID,INMEMORY<br />CE6F0000 1649000 en_US.UTF-8.so.3 1E0000 A INVALID,INMEMORY<br />...<br />CE729000 1782000 en_US.UTF-8.so.3 219000 27 VALID<br />CE72A000 1783000 en_US.UTF-8.so.3 21A000 27 VALID<br />...<br />CE730000 1709000 en_US.UTF-8.so.3 220000 27 VALID<br />CE731000 - en_US.UTF-8.so.3 221000 0 INVALID<br />CE732000 - en_US.UTF-8.so.3 222000 0 INVALID<br />...<br /> VA PA FILE OFFSET SHARES DISPOSITION<br />CE9B0000 76A42000 libc_hwcap1.so.1 0 51 VALID<br />CE9B1000 76AC3000 libc_hwcap1.so.1 1000 51 VALID<br />...<br /> VA PA FILE OFFSET SHARES DISPOSITION<br />CEBC4000 2A34000 ld.so.1 5000 42 VALID<br />CEBC5000 28B5000 ld.so.1 6000 42 VALID<br />CEBC6000 29F6000 ld.so.1 7000 57 VALID<br />CEBC7000 2937000 ld.so.1 8000 57 VALID<br />...<br />><br /></code><br /><p><br />As expected, many pages that were previously valid are now invalid. Many of these pages are still in memory, but some have been swapped out. The output does not show it, but some pages that are swapped out can also be in memory (the page was swapped out, put on a freelist, but has not yet been re-used for some other purpose. It is interesting that some pages with reasonably high share counts are still in memory, but no longer valid for this instance of bash. The pageout code checks the share counts, and skips pages being shared by more than po_share processes. On my system, po_share is 8. So I am not sure what is marking the pages invalid (maybe a job for DTrace).<br /><p><br />As before, I'll get some counts of valid, invalid, inmemory, and<br />swapped pages.<br /><code><br /><br />> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages !grep -i " valid" | wc<br /> 413 2478 26432<br />><br /></code><br /><p><br />Previously, the number of valid pages was 558, so 145 pages have been marked invalid and possibly swapped out.<br /><p><br />The number of invalid pages is now:<br /><code><br /><br />> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages !egrep -i " invalid$" | wc<br /> 818 4888 53988<br />><br /></code><br /><p><br />Previously, this was 553, so 265 pages that previously were valid are<br />now invalid.<br /><code><br /><br />> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages !egrep -i "inmemory" | wc<br /> 215 1290 16125<br />><br /></code><br /><p><br />And 215 pages that are invalid are still in memory, but the page table entries for the bash instance does not have the pages mapped.<br /><code><br /><br />> 0t919::pid2proc | ::print proc_t p_as | ::walk seg | ::segpages !grep -i swapped | wc<br /> 12 72 936<br /></code><br /><p><br />And 12 pages of bash are on swap.<br /><p><br />It would be nice to be able to show this graphically. For instance, a large box representing the address space, with different colored pixels to represent the state of the different pages of the address space. I have been told that JavaFX is good for this, but my knowledge of Java is really not up to it. Especially for large processes, a graphical view would be nice (well, at least interesting to look at...).<br /><p><br />I have not tried the dcmd on SPARC or x64, but I expect it to work (at least on x64). I would also like to try this on a large machine which has latency groups set up. If anyone has such a machine and would like to try this out, please let me know.<br /><p><br />I also have a version of the command that only prints summary information. I want to add an option that prints page sizes, but currently the command assumes all pages are the native page size (4k on x86/x64 and 8k on SPARC).<br /><p><br />If there is interest, I'll make the code for the dcmd available.Unknownnoreply@blogger.com3tag:blogger.com,1999:blog-7245518.post-42475889316787152112009-09-05T06:08:00.000-07:002009-09-05T06:10:52.151-07:00Correction for classes in BerlinOops. I've got the wrong dates for the Solaris/OpenSolaris Internals classes in Berlin. The correct dates are Sept. 21-25, and Sept. 28 - Oct. 2. The first week is full, but there are still openings for the second week. Hope to see you there!Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-7245518.post-62561991985057167592009-08-04T05:24:00.000-07:002009-08-04T05:37:01.306-07:00OpenSolaris Internals course announcementsI will be teaching 2 back-to-back 5 day OpenSolaris Internals classes in Berlin, Germany the weeks of September 28 through October 2, and again from October 5 through October 10. For details about topics covered, price, and availability, please visit <a href="http://www.workshops-berlin.de/">http://www.workshops-berlin.de/</a>. Note that this website is in German. The course itself will be in English. If you are interested, but cannot read German, send me an email.Unknownnoreply@blogger.com1tag:blogger.com,1999:blog-7245518.post-80676593123334968592009-04-09T05:50:00.000-07:002009-04-09T09:43:46.418-07:00RAIDZ On-Disk FormatA while back, I came up with a way to examine ZFS on-disk format using a modified mdb and zdb (see <a href="http://www.bruningsystems.com/osdevcon_draft3.pdf">paper</a> and <a href="http://www.bruningsystems.com/zfs_ondisk_slides.pdf">slides</a>. I also used the method described to recover a removed file (see <a href="http://mbruning.blogspot.com/2008/08/recovering-removed-file-on-zfs-disk.html">here in my blog</a>. During the past week, I decided to try to understand the layout of raidz. In other words, how raidz organizes data on disk. It's simple to say that raidz on disk is basically raid5 with variable length stripes, but what does that really mean?<br /><br />To do this, I once again use the modified mdb, and made a further modification to zdb. In addtion, I implemented a new command (zuncompress) which allow me to uncompress ZFS data existing in a regular file. Since I fear that most of the 10 people or so who read this will not want to read a long description of how I determined the layout, here I'll just give a summary. If anyone really wants the details, reply to the blog and maybe I'll go into them.<br /><br />First, some general characteristics:<br /><br /> - Each disk contains the 4MB labels at the beginning and end of the disk. For information on these, please see the <a href="http://opensolaris.org/os/community/zfs/docs/ondiskformat0822.pdf">ZFS On-Disk Specification</a> paper. The starting point for any walk of ZFS on-disk starts with an uberblock_t which is found in this area.<br /><br /> - The metadata used for raidz is the same as for other ZFS objects. In other words, uberblock_t contains the location of a objset_phys_t, which in turn contains the location of the meta-object set (mos), and so on. A difference is that physically, individual structures on disk may exist across different disks, and not necessarily all of the disks. For example, let's take a mos (basically an array of dnode_phys_t structures), on a 5 disk raidz volume. This might be compressed to 0xa00 (2560 decimal) bytes. This may be organized on the raidz disks as follows:<br /> - 512 bytes on disk 0<br /> - 512 bytes on disk 1<br /> - 512 bytes on disk 2<br /> - 1024 bytes on disk 3<br /> - 1024 bytes on disk 4<br />If you do the arithmetic, you'll find this is 0xe00 bytes (3.5k) and not 0xa00 (2.5K) bytes. The actual allocated size may be still larger. The reason for the extra 1k bytes is the next point.<br /><br /> - Each metadata object (as well as data itself) has its own parity. The extra 1k bytes in the previous point is for parity. If the parity in the above example is on disk 4, it must be 1024 bytes large, since the largest size of any of the blocks containing the object is 1k bytes. Even a metadata structure that only takes up 512 bytes (for instance, an objset_phys_t), will take up 1024 bytes on the disks, one disk containing the 512-byte structure, and another containing 512-bytes of parity.<br /><br /> - Block offsets as reported by zdb (and described in the ZFS On-Disk Specification) are for the entire space (i.e., if you have 5 100GB disks making up a raidz pool, the block offsets start at 0 and go to 500GB).<br /><br /> - Since block offsets cover the entire pool, you cannot simply look at the offsets reported by zdb and map them to locations on disk. The kernel routine, vdev_raidz_map_alloc() (see http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c#644), converts offset and size to locations on the disks. I have added an option to zdb that, given a raidz pool, offset, and size (as reported by zdb), calls this routine and prints out the values of the returned map. This shows the location on the disk(s) and sizes for both the data itself, and parity.<br /><br /> - I recently saw on #opensolaris irc, a person stating that a write of a 1 byte file results in a write to all disks in raidz. That may be true (I haven't checked), but only 1024 bytes are used for the 1 byte file. One 512 byte block containing the 1 byte of data, and the other 512 byte block on a different disk containing parity. It is not using space on all n disks for a 1 byte file.<br /><br /> - ZFS basically uses a scatter-gather approach to reading and writing data on a raidz pool. The disks are read at the correct offsets into a buffer large enough to contain the data. So on a read, data on the first disk is read into the beginning of the buffer, data from the second disk is read into the same buffer virtually contiguous with the end of the data from the first disk, and so on. The resulting buffer is then de-compressed, and the data returned to the requestor.<br /><br />So, that's the basics. I was going to turn my attention to the in-memory representation of ZFS, but now think instead I'll take a stab at automating the techniques I am using. Once I have that done, I'll try automating recovery of a removed file. From there, we'll see.Unknownnoreply@blogger.com7tag:blogger.com,1999:blog-7245518.post-43159389623814231972009-04-04T02:04:00.000-07:002009-04-06T11:22:42.698-07:00More information about Free One Day OpenSolaris Internals TrainingI thought I would say a few words about what is planned for the free one day OpenSolaris Internals training class (see <a href="http://sl.osunix.org/FreeKernelTrainingDay">http://sl.osunix.org/FreeKernelTrainingDay</a> for a list of topics, and to sign up).<br /><br />Regardless of the topics covered, I want to make this as close to a "classroom" setting as possible. For me, this means that attendees should be able to follow along with anything I am doing on OpenSolaris by doing it themselves. So, for instance, if I am using mdb to examine some data structure, students should be able to do the same on their machines. For some topics, notably ZFS, this will require students to either build an mdb dmod, and the modified mdb and zdb that I use, or load a version of OpenSolaris that contains these (to be provided by osunix.org). Source for the modified mdb, zdb, and rawzfs mdb dmod is available for download at <a href="ftp://ftp.bruningsystems.com/mdb.tar.Z">ftp://ftp.bruningsystems.com/mdb.tar.Z</a>, <a href="ftp://ftp.bruningsystems.com/zdb.tar.Z">ftp://ftp.bruningsystems.com/zdb.tar.Z</a>, and <a href="ftp://ftp.bruningsystems.com/raw_dmods.tar.Z">ftp://ftp.bruningsystems.com/raw_dmods.tar.Z</a>. If we do a kmdb session, students will either need to run OpenSolaris in a VM (virtualbox), or have 2 machines connectable via tip or a terminal server for console access.<br /><br />Currently, the plan is to give attendees access to some slides, use IRC, and give students access to a window on my machine where they can see what I am doing, and try the same on their machine. Best would be a window where everyone can "see" my desktop, but I'm still looking into the best way to do that (any suggestions for this are welcome). It would be great to have audio, preferably conferencing, but this may cost money, and... the class is free. That should mean free for me as well. If anyone has a suggestion for free, conferenced audio, I would appreciate it.<br /><br />I would like to decide on topics to be covered in the next week or so. So, if you are interested in attending, please go to http://sl.osunix.org/FreeKernelTrainingDay, take a look at the topics, and sign up. If you have ideas for other kernel-related topics, please let me know. Depending on how this goes, I may do more of these in the future.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-7245518.post-29348603156369932002009-04-02T23:36:00.000-07:002009-04-02T23:44:13.404-07:00Free One-day OpenSolaris Internals classI am holding a free, one day OpenSolaris Internals class on-line on April 18 or 19. We'll cover 2 topics as determined by a vote of topics that may be covered. For more information, see <a href="http://sl.osunix.org/FreeKernelTrainingDay">http://sl.osunix.org/FreeKernelTrainingDay</a>. I hope to see you there!Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-7245518.post-28996318489705485972009-04-02T23:25:00.000-07:002009-04-02T23:36:01.605-07:00OpenSolaris Internals classI am teaching an OpenSolaris Internals class at Systemics in Warsaw, Poland the week of May 4-8. The course will be held in English. For a detailed topic outline, see <a href="http://www.bruningsystems.com/page14/page13/page13.html">here</a>. For pricing, location information, and availability, please send email to <a href="mailto:magdalena.sternik@systemics.pl">magdalena.sternick@systemics.pl</a>. If you have questions about course content, please email me at <a href="mailto:max@bruningsystems.com">max@bruningsystems.com</a>.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-7245518.post-51511474514644844872009-03-31T02:31:00.000-07:002009-03-31T04:20:04.668-07:00A faster memstat for mdbI have implemented a version of the ::memstat dcmd for mdb that gives results in less than half the time of the ::memstat currently in mdb. If you are interested, it is available for download <a href="ftp://ftp.bruningsystems.com/memstat.tar.Z">here</a>.<br /><br />So, how does it work? The current version of memstat uses the page walker (::walk page) to walk all cached pages in the system. The new version simply examines all pages.<br /><br />Each page-able page in the system is represented by a page_t structure. Basically all of memory except for the unix and genunix kernel modules, and a few other odds and ends is considered "page-able". Every page, that is in use (either by the kernel, user process, anonymous, or cached) has an identity. This is a vnode/offset pair. The identify uniquely determines how the page is being used. For instance, a page for the code of a running bash process will have the vnode_t for /usr/bin/bash, and the offset within /usr/bin/bash of where the page comes from. For a kernel page, there is a special vnode_t (kvp). For anonymous space, the page has a swapfsvnode (also used for shared memory and tmpfs files). When a process gets a page fault, the fault handling code first checks to see if the faulting address is mapped in the process' address space. If not, a segmentation violation (SIGSEGV) is sent to the process. If the address is within the address space, the fault handling code sees if the page is already in memory. It does this by retrieving the vnode/offset for the faulting page, and hashing into an array called page_hash. Each entry in page_hash is the beginning of a linked list of page_t structures. So the fault handling code does a hash to get a page_hash array entry, then walks the page_t structures starting at that entry to look for a matching vnode_t/offset. If the page_t is in the hash, the fault handling code sets up a page table entry (translation table entry on SPARC) to map to the corresponding physical page.<br /><br />The page_hash array is sized so the the average search, given a page_hash bucket, is no longer than 4 (PAGE_HASHAVELEN in vm/page.h) entries. This makes searching for a cached page fairly fast.<br /><br />The page walker that mdb uses to do memstat walks every hash bucket looking for pages. Basically, if the page is found from a hash bucket, the page is either in use by the kernel, some process, or tmpfs, or, the page is free but cached. Any page not hanging off of a page_hash bucket is considered free (i.e., the free (freelist) statistic).<br /><br />The new memstat takes a different approach. Rather than scanning each hash bucket for pages, it simply reads all of the page_t structures on the system, then examines each one to determine if it is a kernel page, executable page, anonymous page, and so on. Any page_t that does not have a vnode_t is considered a free page, and is counted in the free (freelist) statistic.<br /><br />How are the page_t structures found on the system? There is a linked list of memseg structures that are bookkeeping for page-able page structures. The list is headed by a pointer, memsegs, and is built early on in the startup code when the system is booting. I suspect the list could change due to dynamic reconfiguration events, but I'll leave that as an exercise for the reader...<br />To see the list, you can do "::memseg_list" in mdb. On my system, this gives:<br /><p><br /><code><br />> ::memseg_list<br /> ADDR PAGES EPAGES BASE END<br />fbe00028 fbe94160 fe652260 00000c00 0007fed0<br />fbe00014 fbe85160 fbe94160 00000100 00000400<br />fbe00000 fbe82050 fbe85160 00000002 0000009f<br />><br /></code><br /></p><br />The ADDR column is the address of a memseg structure. PAGES is the address of the first page_t in an array. EPAGES points to the end of the page_t array. BASE is the starting page frame number of the memseg, and END is the ending page frame number. So, on my system, page frames between 2 and 9f, 100 and 400, and c00 and 7fed0 have page_t structures. Physical page 0 and 1 are not in the list. Also pages between 9f and 100, and between 400 and c00 are not in the list. This is either because the physical memory does not exist, or it is not considered pageable.<br /><br />The new memstat uses the memseg list to read in all of the page_t structures. On my system, this means 3 read calls (though the read from fbe94160 to fe652260 is quite large), versus thousands of read calls in the existing memstat via the page walker. The new memstat assumes there will never be more than 256 memseg structures. (This was arbitrarily chosen. I have not seen machines with more than 6 memseg structures, but I don't get on very large machines very often). A more correct way would be to build the memseg list within the dcmd, but I am lazy.<br /><br />Using dtrace and counting system calls during running of the two version of memstat shows that the original memstat makes 1741225 system calls, while the new memstat makes 737344. So over 10000 fewer system calls in the new memstat.<br /><br />I think memstat performance could be improved even more by using mmap to map in the page arrays. Then there would be no need for using mdb_alloc, and no need to mdb_vread the page_t structs.Unknownnoreply@blogger.com2tag:blogger.com,1999:blog-7245518.post-47058565406632840322008-08-25T07:31:00.000-07:002008-08-25T07:36:36.248-07:00Update to bruningsystems.com websiteI have added a section called "articles", which has links to various articles, presentations, and some course materials on OpenSolaris. You can see it <a href="http://www.bruningsystems.com/page16/page16.html">here</a>.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-7245518.post-52320337067358495172008-08-18T10:05:00.000-07:002008-08-19T01:46:29.884-07:00recovering removed file on zfs diskI have used my modified mdb and zdb (see<br /><a href="http://www.osdevcon.org/2008/files/osdevcon2008-proceedings.pdf">http://www.osdevcon.org/2008/files/osdevcon2008-proceedings.pdf </a> and<br /><a href="http://www.osdevcon.org/2008/files/osdevcon2008-max.pdf">http://www.osdevcon.org/2008/files/osdevcon2008-max.pdf</a>)<br />to recover a file that was removed from a zfs file system.<br />The technique is to locate the active uberblock_t after the file<br />was created, but before the file was removed, and follow the data<br />structures from that uberblock_t. This technique would probably not<br />work on a near full file system, and probably not on a very busy file<br />system, but it works here. Also, this will not work with RAID-z,<br />but works fine for mirrors. (I shall get around to figuring out<br />raid-z, but not now...).<br /><br />It is possible to follow all of the steps and still not have the right<br />data because you chose the wrong uberblock_t, or one of the blocks containing<br />metadata (or the data itself) has been re-used.<br /><br />The modified mdb and zdb have been updated to work with Nevada,<br />build 94. It took about 15 minutes to merge the versions I was using<br />for build 79 into build94. For source for the changes, and<br />the zfs dmod, send mail to me at max@bruningsystems.com.<br /><br />It might be possible, with a bit more clever use of mdb<br />and some shell scripting, to automate this... Also, it might be<br />useful to add an option to zdb so that a different transaction<br />id other than the active one be used for it's traversals.<br />Then you might be able to do everything using zdb.<br /><br />The following describes the steps taken.<br /><br />First, I copy a file with known contents to the zfs file system.<br /><pre><br /><br /># cp /usr/dict/words /zfs_fs/words<br />#<br /><br /></pre><br />We'll get the object id (inumber) for /zfs_fs. We'll use it later.<br /><pre><br /><br /># ls -aid /zfs_fs<br /> 3 /zfs_fs<br />#<br /></pre><br />Next, I'll try to make sure everything is on the disk.<br /><pre><br /># sync<br />#<br /></pre><br />Now, I'll use zdb to get the root blkptr from the uberblock.<br />This will also give me a transaction ID. Generally, you would not<br />use zdb to get the uberblock_t every time that you add/remove a<br />file to a zfs file system. That is ok. I have written a dcmd<br />(output shown below), that walks the uberblock_t array on disk.<br />Then you can, by trial and error, locate the uberblock_t you need<br />(assuming it still exists in the array, and assuming the metadata<br />it points to has not been re-used for another purpose).<br /><pre><br /><br /># ./zdb -uuu zfs_fs<br />Uberblock<br /><br /> magic = 0000000000bab10c<br /> version = 11<br /> txg = 1282 <-- transaction id in decimal<br /> guid_sum = 8876692711396000182<br /> timestamp = 1218963748 UTC = Sun Aug 17 11:02:28 2008<br /> rootbp = [L0 DMU objset] 400L/200P DVA[0]=<0:11a00:200><br /> DVA[1]=<0:c010e00:200> DVA[2]=<0:18008e00:200> fletcher4<br /> lzjb LE contiguous birth=1282 fill=27<br /> cksum=81f780ec5:361b52dda06:b6f3f410036f:1a2b8b10bfdb5c<br /><br /></pre><br />Next, I'll remove the file I just created.<br /><pre><br /><br /># rm /zfs_fs/words <br />#<br /><br /></pre><br />Let's take a look at the active uberblock_t.<br /><pre><br /><br /># ./zdb -uuu zfs_fs<br />Uberblock<br /><br /> magic = 0000000000bab10c<br /> version = 11<br /> txg = 1282 <-- nothing has changed<br /> guid_sum = 8876692711396000182<br /> timestamp = 1218963748 UTC = Sun Aug 17 11:02:28 2008<br /> rootbp = [L0 DMU objset] 400L/200P DVA[0]=<0:11a00:200> <br /> DVA[1]=<0:c010e00:200> DVA[2]=<0:18008e00:200> fletcher4<br /> lzjb LE contiguous birth=1282 fill=27 <br /> cksum=81f780ec5:361b52dda06:b6f3f410036f:1a2b8b10bfdb5c<br /><br /></pre><br />Let's try to make sure it is on the disk.<br /><pre><br /># sync<br />#<br /></pre><br />And check the active uberblock_t again.<br /><pre><br /># ./zdb -uuu zfs_fs<br />Uberblock<br /><br /> magic = 0000000000bab10c<br /> version = 11<br /> txg = 1284 <-- new transaction id, after file was removed<br /> guid_sum = 8876692711396000182<br /> timestamp = 1218963808 UTC = Sun Aug 17 11:03:28 2008<br /> rootbp = [L0 DMU objset] 400L/200P DVA[0]=<0:15200:200> <br /> DVA[1]=<0:c014600:200> DVA[2]=<0:1800a000:200> fletcher4<br /> lzjb LE contiguous birth=1284 fill=27 <br /> cksum=87431704e:37f154f58d7:bbddb76e9703:1aaf346847004f<br /><br /></pre><br />Now, let's make sure nothing changes in the file system.<br /><pre><br /><br /># zfs umount zfs_fs<br />#<br /><br /></pre><br />And look at the active uberblock_t again.<br /><pre><br /><br /># ./zdb -uuu zfs_fs<br />Uberblock<br /><br /> magic = 0000000000bab10c<br /> version = 11<br /> txg = 1284 <-- ok. nothing changed<br /> guid_sum = 8876692711396000182<br /> timestamp = 1218963808 UTC = Sun Aug 17 11:03:28 2008<br /> rootbp = [L0 DMU objset] 400L/200P DVA[0]=<0:15200:200> <br /> DVA[1]=<0:c014600:200> DVA[2]=<0:1800a000:200> fletcher4 <br /> lzjb LE contiguous birth=1284 fill=27 <br /> cksum=87431704e:37f154f58d7:bbddb76e9703:1aaf346847004f<br /><br /></pre><br />Ok. So nothing changed when the file system was unmounted.<br />Now, we'll use the modified mdb to examine the uberblock_t array on disk.<br />The uberblock_t we want has transaction id 1282 decimal.<br /><pre><br /><br /># ./mdb /export/home/max/zfsfile <br /></pre><br />First, convert decimal 1282 to hex.<br /><pre><br />> 0t1282=X<br /> 502 <br /></pre><br />Now, load kernel CTF and a few dcmds that work with zfs on disk.<br /><pre><br />> ::loadctf<br />> ::load /export/home/max/source/mdb/i386/rawzfs.so <br /></pre><br />Walk the uberblock_t array on disk. This shows all possible 1024 uberblocks.<br />Here, we'll only show the entry with ub_txg = 0x502. Again, if I had not<br />retrieved the value of the active uberblock_t after the file was created,<br />and before the file was removed, I could dump all uberblock_t using the<br />following command, and then searched backwards, trying all transaction ids<br />that are less than the current (i.e., current after the file was removed<br />and the file system unmounted).<br /><pre><br />> ::walk uberblock | ::print -a -t zfs`uberblock_t<br />...<br />{<br /> 20800 uint64_t ub_magic = 0xbab10c<br /> 20808 uint64_t ub_version = 0xb<br /> 20810 uint64_t ub_txg = 0x502 <-- the correct transaction id<br /> 20818 uint64_t ub_guid_sum = 0x7b3058fd830ec1b6<br /> 20820 uint64_t ub_timestamp = 0x48a7e924<br /> 20828 blkptr_t ub_rootbp = { <-- blkptr is at 0x20828 on disk<br /> 20828 dva_t [3] blk_dva = [<br /> {<br /> 20828 uint64_t [2] dva_word = [ 0x1, 0x8d ]<br /> }<br /> {<br /> 20838 uint64_t [2] dva_word = [ 0x1, 0x60087 ]<br /> }<br /> {<br /> 20848 uint64_t [2] dva_word = [ 0x1, 0xc0047 ]<br /> }<br /> ]<br /> 20858 uint64_t blk_prop = 0x800b070300000001<br /> 20860 uint64_t [3] blk_pad = [ 0, 0, 0 ]<br /> 20878 uint64_t blk_birth = 0x502<br /> 20880 uint64_t blk_fill = 0x1b<br /> 20888 zio_cksum_t blk_cksum = {<br /> 20888 uint64_t [4] zc_word = [ 0x81f780ec5, 0x361b52dda06, <br />0xb6f3f410036f, 0x1a2b8b10bfdb5c ]<br /> }<br /> }<br />}<br />...<br /></pre><br />Let's dump the blkptr_t for this uberblock_t.<br /><pre><br />> 20828::blkptr<br />DVA[0]: vdev_id 0 / 11a00<br />DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 20000000000<br />DVA[0]: :0:11a00:200:d<br />DVA[1]: vdev_id 0 / c010e00<br />DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 20000000000<br />DVA[1]: :0:c010e00:200:d<br />DVA[2]: vdev_id 0 / 18008e00<br />DVA[2]: GANG: FALSE GRID: 0000 ASIZE: 20000000000<br />DVA[2]: :0:18008e00:200:d<br />LSIZE: 400 PSIZE: 200<br />ENDIAN: LITTLE TYPE: DMU objset<br />BIRTH: 502 LEVEL: 0 FILL: 1b00000000<br />CKFUNC: fletcher4 COMP: lzjb<br />CKSUM: 81f780ec5:361b52dda06:b6f3f410036f:1a2b8b10bfdb5c<br />$q<br />#<br /></pre><br />Now, using the modified zdb, let's dump the mos objset_phys_t.<br /><pre><br /><br /># ./zdb -R zfs_fs:0:11a00:200:d,lzjb,400 2> /tmp/mos<br />Found vdev: /export/home/max/zfsfile<br />#<br /></pre><br />Back to mdb to examine the objset_phys_t for the meta object set (mos).<br /><pre><br /><br /># ./mdb /tmp/mos<br />> ::loadctf<br />> ::load /export/home/max/source/mdb/i386/rawzfs.so <br />> 0::print -a -t zfs`objset_phys_t<br />{<br /> 0 dnode_phys_t os_meta_dnode = {<br /> 0 uint8_t dn_type = 0xa <-- this is DMU_OT_DNODE<br /> ...<br /> 40 blkptr_t [1] dn_blkptr = [<br /> {<br /> 40 dva_t [3] blk_dva = [<br /> {<br /> 40 uint64_t [2] dva_word = [ 0x5, 0x88 ]<br /> }<br /> {<br /> 50 uint64_t [2] dva_word = [ 0x5, 0x60082 ]<br /> }<br /> {<br /> 60 uint64_t [2] dva_word = [ 0x5, 0xc0042 ]<br /> }<br /> ]<br /> 70 uint64_t blk_prop = 0x800a07030004001f<br /> 78 uint64_t [3] blk_pad = [ 0, 0, 0 ]<br /> 90 uint64_t blk_birth = 0x502<br /> 98 uint64_t blk_fill = 0x1a<br /> a0 zio_cksum_t blk_cksum = {<br /> a0 uint64_t [4] zc_word = [ 0xa9af50f215, 0xec01e192b95e, <br />0xc523efad092ebc, 0x7a3a8be19416f454 ]<br /> }<br /> }<br /> ]<br /> ...<br /></pre><br />And dump the blkptr_t in the objset_phys_t.<br /><pre><br /><br />> 40::blkptr<br />DVA[0]: vdev_id 0 / 11000<br />DVA[0]: GANG: FALSE GRID: 0000 ASIZE: a0000000000<br />DVA[0]: :0:11000:a00:d<br />DVA[1]: vdev_id 0 / c010400<br />DVA[1]: GANG: FALSE GRID: 0000 ASIZE: a0000000000<br />DVA[1]: :0:c010400:a00:d<br />DVA[2]: vdev_id 0 / 18008400<br />DVA[2]: GANG: FALSE GRID: 0000 ASIZE: a0000000000<br />DVA[2]: :0:18008400:a00:d<br />LSIZE: 4000 PSIZE: a00<br />ENDIAN: LITTLE TYPE: DMU dnode<br />BIRTH: 502 LEVEL: 0 FILL: 1a00000000<br />CKFUNC: fletcher4 COMP: lzjb<br />CKSUM: a9af50f215:ec01e192b95e:c523efad092ebc:7a3a8be19416f454<br />$q<br />#<br /></pre><br />Using zdb with the offset specified for the first ditto block<br />in the above blkptr output, we get the mos dnode array.<br />Note that the "LEVEL: 0" blkptr output means there are<br />no levels of indirection. On larger zfs file systems, you may<br />need to go through block(s) of indirect blkptr_t's. An example of this<br />is shown a bit later.<br /><pre><br /><br /># ./zdb -R zfs_fs:0:11000:a00:d,lzjb,4000 2> /tmp/metadnode<br />Found vdev: /export/home/max/zfsfile<br />#<br /></pre><br />Now, we'll look at the metadnode for the DMU_OT_OBJECT_DIRECTORY. This<br />will tell us about objects in the zfs file system. For every zfs file<br />system that I have tried this on, this is dnode number 1, (starting from<br />0). Regardless, the field to check is "dn_type = 0x1". It is possible,<br />(I assume), for this to be at a different index into the metadnode array,<br />and, possibly not in the 0x4000 bytes read and decompressed from 0x11000.<br />In this case, the LEVEL field would not have been 0, and you would have to<br />look at indirect blkptr_t's. But not here...<br /><pre><br /><br /># ./mdb /tmp/metadnode <br />> ::loadctf<br />> ::load /export/home/max/source/mdb/i386/rawzfs.so<br />> 0,20::print -a -t zfs`dnode_phys_t <-- dnode_phys_t is 0x200 bytes, so 0x20<br />{ <-- of these in 0x4000.<br /> 0 uint8_t dn_type = 0 <-- First entry is not used (DMU_OT_NONE)<br />...<br />}<br />{ <-- start of the second entry<br /> 200 uint8_t dn_type = 0x1 <-- DMU_OT_OBJECT_DIRECTORY (see dmu.h)<br /> ...<br /> 240 blkptr_t [1] dn_blkptr = [ <-- blkptr_t is 0x240 in /tmp/metadnode<br /> ... <-- lots of output omitted, we'll look at some of this later.<br />}<br /><br /></pre><br />Now we'll look at the blkptr_t for the Object Directory.<br /><pre><br />240::blkptr<br />DVA[0]: vdev_id 0 / 2400<br />DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 20000000000<br />DVA[0]: :0:2400:200:d<br />DVA[1]: vdev_id 0 / c002400<br />DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 20000000000<br />DVA[1]: :0:c002400:200:d<br />DVA[2]: vdev_id 0 / 18000000<br />DVA[2]: GANG: FALSE GRID: 0000 ASIZE: 20000000000<br />DVA[2]: :0:18000000:200:d<br />LSIZE: 200 PSIZE: 200<br />ENDIAN: LITTLE TYPE: object directory<br />BIRTH: 4 LEVEL: 0 FILL: 100000000<br />CKFUNC: fletcher4 COMP: uncompressed<br />CKSUM: 5a40238b4:1cd8f9f7e19:522eab3e03f0:a9c92410b009e<br />$q<br /><br />#<br /></pre><br />Now, we'll read the (uncompressed) 0x200 bytes of the object directory using zdb. The "2400" is the (hex) offset from the blkptr_t above.<br /><pre><br /><br /># ./zdb -R zfs_fs:0:2400:200:r 2> /tmp/objdir<br />Found vdev: /export/home/max/zfsfile<br />#<br /></pre><br />Back to mdb to look at the object directory. Object directories are "zap"<br />objects. Zap objects contain name/value pairs. The first 64 bits<br />identify the type of the zap (micro zap or fat zap). A "fat zap" is a zap<br />object that uses indirection. Micro zaps contain name/value pairs directly<br />(i.e., no indirection). I have not seen a fat zap (but the largest zfs<br />file system I have used is only ~140GB, and I have not examined large<br />directories. (Directory entries are stored in zap objects).<br /><pre><br /># ./mdb /tmp/objdir <br />> ::loadctf<br />> ::load /export/home/max/source/mdb/i386/rawzfs.so<br /><br />> 0/J <-- look at the first 64 bits as hex<br />0: 8000000000000003 <-- a "signature" for a micro zap<br /><br />> 0::print -a -t zfs`mzap_phys_t <-- the beginning of the microzap is<br />{ <-- an mzap_phys_t<br /> 0 uint64_t mz_block_type = 0x8000000000000003<br /> 8 uint64_t mz_salt = 0x129c2c3<br /> 10 uint64_t mz_normflags = 0<br /> 18 uint64_t [5] mz_pad = [ 0, 0, 0, 0, 0 ]<br /> 40 mzap_ent_phys_t [1] mz_chunk = [ <-- there may be more than one<br /> { <-- mzap_ent_phys_t starting here<br /> 40 uint64_t mze_value = 0x2 <-- object id of "root_dataset"<br /> 48 uint32_t mze_cd = 0<br /> 4c uint16_t mze_pad = 0<br /> 4e char [50] mze_name = [ "root_dataset" ] <br /> }<br /> ]<br />}<br />$q<br />#<br /></pre><br />Now, we go back to the mos metadnode array in /tmp/metadnode, and<br />examine object id 2 (the third entry in the array).<br />Each entry is 0x200 bytes, so we want the dnode_phys_t starting<br />at (2*200) bytes into the file.<br /><pre><br /># ./mdb /tmp/metadnode <br />> ::loadctf<br />> ::load /export/home/max/source/mdb/i386/rawzfs.so<br />> 2*200::print -a -t zfs`dnode_phys_t <-- get object id 2<br />{<br /> 400 uint8_t dn_type = 0xc <-- DMU_OT_DSL_DIR (a dataset directory object)<br /> ...<br /> 404 uint8_t dn_bonustype = 0xc <-- bonus buffer contains a dsl_dir_phys_t<br /> ...<br /> 440 blkptr_t [1] dn_blkptr = [ <-- not used for this object<br /> {<br /> 440 dva_t [3] blk_dva = [<br /> {<br /> 440 uint64_t [2] dva_word = [ 0, 0 ]<br /> ...<br /> 4c0 uint8_t [320] dn_bonus = [ 0xe5, 0xa9, 0xa6, 0x48, 0, 0, 0, 0, 0x10,<br /> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0xe, 0, 0, 0, 0, 0, 0, 0, ... ]<br />}<br /></pre><br />And dump the bonus buffer at 0x4c0.<br /><pre><br />> 4c0::print -a -t zfs`dsl_dir_phys_t<br />{<br /> 4c0 uint64_t dd_creation_time = 0x48a6a9e5<br /> 4c8 uint64_t dd_head_dataset_obj = 0x10 <-- object id for dataset head<br /> ...<br />}<br /></pre><br />Let's get object id 10 from the metadnode array.<br /><pre><br />> 10*200::print -a -t zfs`dnode_phys_t<br />{<br /> 2000 uint8_t dn_type = 0x10 <-- DMU_OT_DSL_DATASET<br /> ...<br /> 2004 uint8_t dn_bonustype = 0x10 <-- bonus buffer contains dsl_dataset_phys_t<br /> ...<br /> 2040 blkptr_t [1] dn_blkptr = [ <-- again, not used here<br /> {<br /> 2040 dva_t [3] blk_dva = [<br /> {<br /> 2040 uint64_t [2] dva_word = [ 0, 0 ]<br /> ...<br /> 20c0 uint8_t [320] dn_bonus = [ 0x2, 0, 0, 0, 0, 0, 0, 0, 0xe, 0, 0, 0, 0, 0<br />, 0, 0, 0x1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... ]<br />}<br /></pre><br />At 0x20c0 in the /tmp/metadnode file is a dsl_dataset_phys_t (the bonus<br />buffer).<br /><pre><br />> 20c0::print -a -t zfs`dsl_dataset_phys_t<br />{<br /> 20c0 uint64_t ds_dir_obj = 0x2<br /> ...<br /> 2140 blkptr_t ds_bp = {<br /> 2140 dva_t [3] blk_dva = [<br /> {<br /> 2140 uint64_t [2] dva_word = [ 0x1, 0x79 ]<br /> }<br /> {<br /> 2150 uint64_t [2] dva_word = [ 0x1, 0x60073 ]<br /> }<br /> {<br /> 2160 uint64_t [2] dva_word = [ 0, 0 ]<br /> }<br /> ]<br /> ...<br />}<br /></pre><br />Let's look at the blkptr_t in the dsl_dataset_phys_t.<br /><pre><br /><br />> 2140::blkptr<br />DVA[0]: vdev_id 0 / f200<br />DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 20000000000<br />DVA[0]: :0:f200:200:d<br />DVA[1]: vdev_id 0 / c00e600<br />DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 20000000000<br />DVA[1]: :0:c00e600:200:d<br />LSIZE: 400 PSIZE: 200<br />ENDIAN: LITTLE TYPE: DMU objset<br />BIRTH: 502 LEVEL: 0 FILL: 600000000<br />CKFUNC: fletcher4 COMP: lzjb<br />CKSUM: 9cb4e346a:403aa7532bf:d688fac60e1e:1e67a933734ea5<br />$q<br />#<br /></pre><br />The blkptr_t for the dsl_dataset_phys_t is for another DMU objset.<br />(The first DMU objset was from the uberblock_t rootbp and describe<br />the set of objects. The objset described by the dsl_dataset_phys_t describes<br />the set of objects in the file system (i.e., files and directories (and...?)).<br />Back to zdb to get this data.<br /><pre><br /># ./zdb -R zfs_fs:0:f200:200:d,lzjb,400 2> /tmp/root_dataset_mos<br />Found vdev: /export/home/max/zfsfile<br />#<br /></pre><br />And back to mdb to display the objset_phys_t for the root dataset.<br /><pre><br /># ./mdb /tmp/root_dataset_mos <br />> ::loadctf<br />> ::load /export/home/max/source/mdb/i386/rawzfs.so<br />> 0::print -a -t zfs`objset_phys_t<br />{<br /> 0 dnode_phys_t os_meta_dnode = {<br /> 0 uint8_t dn_type = 0xa <-- the second object directory<br /> ...<br /> 40 blkptr_t [1] dn_blkptr = [ <-- blkptr_t is 0x40 bytes into the file<br /> ...<br /></pre><br />And dump the blkptr_t...<br /><pre><br />> 40::blkptr<br />DVA[0]: vdev_id 0 / 10800<br />DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 40000000000<br />DVA[0]: :0:10800:400:id<br />DVA[1]: vdev_id 0 / c00fc00<br />DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 40000000000<br />DVA[1]: :0:c00fc00:400:id<br />LSIZE: 4000 PSIZE: 400<br />ENDIAN: LITTLE TYPE: DMU dnode<br />BIRTH: 502 LEVEL: 6 FILL: 500000000<br />CKFUNC: fletcher4 COMP: lzjb<br />CKSUM: 58461f1c5e:3c7272ace4a1:15e2cf555fd2ac:58b9cdc6bcd0b54<br />$q<br />#<br /></pre><br />Note the "LEVEL: 6" in the above output. There are 6 levels of indirection<br />to get to another array of dnode_phys_t. We will follow the levels, always<br />using the first indirect blkptr_t at each level since the file was in <br />a directory whose object id is 3 (from "ls -aid /zfs_fs" back at the<br />beginning). If I want the dnode_phys_t for a different object id, I<br />can use the technique explained in the paper and slides referenced<br />at the beginning.<br /><pre><br /># ./zdb -R zfs_fs:0:10800:400:d,lzjb,4000 2> /tmp/dnode_l6<br />Found vdev: /export/home/max/zfsfile<br />#<br /></pre><br />Each indirect blkptr_t array contains 0x80 blkptr_t structures. (The size<br />of a blkptr_t is 0x80 bytes. 0x4000 (i.e., the size of the decompressed<br />data) divided by 0x80 = 0x80). We'll use mdb to examine blkptr 0 in the<br />array.<br /><pre><br /># ./mdb /tmp/dnode_l6 <br />> ::loadctf<br />> ::load /export/home/max/source/mdb/i386/rawzfs.so<br />> 0::blkptr<br />DVA[0]: vdev_id 0 / 10400<br />DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 40000000000<br />DVA[0]: :0:10400:400:id<br />DVA[1]: vdev_id 0 / c00f800<br />DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 40000000000<br />DVA[1]: :0:c00f800:400:id<br />LSIZE: 4000 PSIZE: 400<br />ENDIAN: LITTLE TYPE: DMU dnode<br />BIRTH: 502 LEVEL: 5 FILL: 500000000<br />CKFUNC: fletcher4 COMP: lzjb<br />CKSUM: 593bf7cd50:3d5bdfbff40e:1652191251855c:5af2260f72aa12a<br />$q<br />#<br /></pre><br />Great. Let's get the indirect array for level 5.<br /><pre><br /># ./zdb -R zfs_fs:0:10400:400:d,lzjb,4000 2> /tmp/dnode_l5<br />Found vdev: /export/home/max/zfsfile<br />#<br /></pre><br />And back to mdb to display it...<br /><pre><br /># ./mdb /tmp/dnode_l5<br />> ::loadctf<br />> ::load /export/home/max/source/mdb/i386/rawzfs.so<br />> 0::blkptr<br />DVA[0]: vdev_id 0 / 10000<br />DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 40000000000<br />DVA[0]: :0:10000:400:id<br />DVA[1]: vdev_id 0 / c00f400<br />DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 40000000000<br />DVA[1]: :0:c00f400:400:id<br />LSIZE: 4000 PSIZE: 400<br />ENDIAN: LITTLE TYPE: DMU dnode<br />BIRTH: 502 LEVEL: 4 FILL: 500000000<br />CKFUNC: fletcher4 COMP: lzjb<br />CKSUM: 5a4787d7ae:3e5a99501b03:16cbd983ac6802:5d616f6513864cb<br />$q<br />#<br /></pre><br />And now to level 4. Note that BIRTH value corresponds to the<br />transaction id we want... (0x502 = 0t1282).<br /><pre><br /># ./zdb -R zfs_fs:0:10000:400:d,lzjb,4000 2> /tmp/dnode_l4<br />Found vdev: /export/home/max/zfsfile<br />#<br /></pre><br />Back to mdb...<br /><pre><br /># ./mdb /tmp/dnode_l4<br />> ::loadctf<br />> ::load /export/home/max/source/mdb/i386/rawzfs.so<br />> 0::blkptr<br />DVA[0]: vdev_id 0 / fc00<br />DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 40000000000<br />DVA[0]: :0:fc00:400:id<br />DVA[1]: vdev_id 0 / c00f000<br />DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 40000000000<br />DVA[1]: :0:c00f000:400:id<br />LSIZE: 4000 PSIZE: 400<br />ENDIAN: LITTLE TYPE: DMU dnode<br />BIRTH: 502 LEVEL: 3 FILL: 500000000<br />CKFUNC: fletcher4 COMP: lzjb<br />CKSUM: 580321bd90:3cf0cb3827a9:1647f21a4fee83:5b2042e25b8771b<br />$q<br />#<br /></pre><br />Now to level 3.<br /><pre><br /># ./zdb -R zfs_fs:0:fc00:400:d,lzjb,4000 2> /tmp/dnode_l3<br />Found vdev: /export/home/max/zfsfile<br />#<br /># ./mdb /tmp/dnode_l3<br />> ::loadctf<br />> ::load /export/home/max/source/mdb/i386/rawzfs.so<br />> 0::blkptr<br />DVA[0]: vdev_id 0 / f800<br />DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 40000000000<br />DVA[0]: :0:f800:400:id<br />DVA[1]: vdev_id 0 / c00ec00<br />DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 40000000000<br />DVA[1]: :0:c00ec00:400:id<br />LSIZE: 4000 PSIZE: 400<br />ENDIAN: LITTLE TYPE: DMU dnode<br />BIRTH: 502 LEVEL: 2 FILL: 500000000<br />CKFUNC: fletcher4 COMP: lzjb<br />CKSUM: 58e75640d6:3d0bc696c0c2:162c02c075c9ab:5a30099a876cabe<br />$q<br />#<br /></pre><br />And level 2...<br /><pre><br /># ./zdb -R zfs_fs:0:f800:400:d,lzjb,4000 2> /tmp/dnode_l2<br />Found vdev: /export/home/max/zfsfile<br />#<br /># ./mdb /tmp/dnode_l2<br />mdb: no terminal data available for TERM=emacs<br />mdb: term init failed: command-line editing and prompt will not be available<br />::loadctf<br />::load /export/home/max/source/mdb/i386/rawzfs.so<br />0::blkptr<br />DVA[0]: vdev_id 0 / f400<br />DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 40000000000<br />DVA[0]: :0:f400:400:id<br />DVA[1]: vdev_id 0 / c00e800<br />DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 40000000000<br />DVA[1]: :0:c00e800:400:id<br />LSIZE: 4000 PSIZE: 400<br />ENDIAN: LITTLE TYPE: DMU dnode<br />BIRTH: 502 LEVEL: 1 FILL: 500000000<br />CKFUNC: fletcher4 COMP: lzjb<br />CKSUM: 5763205f2d:3c57f7df68a9:15fea170721af5:59a7686491c24a7<br />$q<br />#<br /></pre><br />And level 1.<br /><pre><br /># ./zdb -R zfs_fs:0:f400:400:d,lzjb,4000 2> /tmp/dnode_l1<br />Found vdev: /export/home/max/zfsfile<br />#<br /># ./mdb /tmp/dnode_l1<br />> ::loadctf<br />> ::load /export/home/max/source/mdb/i386/rawzfs.so<br />> 0::blkptr<br />DVA[0]: vdev_id 0 / ec00<br />DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 60000000000<br />DVA[0]: :0:ec00:600:d<br />DVA[1]: vdev_id 0 / c00e000<br />DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 60000000000<br />DVA[1]: :0:c00e000:600:d<br />LSIZE: 4000 PSIZE: 600<br />ENDIAN: LITTLE TYPE: DMU dnode<br />BIRTH: 502 LEVEL: 0 FILL: 500000000<br />CKFUNC: fletcher4 COMP: lzjb<br />CKSUM: 6fb5c84271:61d7d7ffe6a4:2f9cbc90dcaa4c:10f07885852e2558<br />$q<br />#<br /></pre><br />Level 0 will contain the beginning of the array of dnode_phys_t<br />for files and directories within the file system.<br />We'll again use zdb to retrieve the block containing the first<br />0x20 entries. (Again, decompressed size is 0x4000, dnode_phys_t size<br />is 0x200, so there are 0x20 entries in the first level 0 block).<br /><pre><br /># ./zdb -R zfs_fs:0:ec00:600:d,lzjb,4000 2> /tmp/dnode_l0<br />Found vdev: /export/home/max/zfsfile<br />#<br /><br /># ./mdb /tmp/dnode_l0<br />> ::loadctf<br />> ::load /export/home/max/source/mdb/i386/rawzfs.so<br />> 0,20::print -t -a zfs`dnode_phys_t<br />{<br /> 0 uint8_t dn_type = 0 <-- first entry is not used<br /> ...<br />}<br />{ <-- second entry (object id 1)<br /> 200 uint8_t dn_type = 0x15 <-- DMU_OT_MASTER_NODE<br /> ...<br /> 240 blkptr_t [1] dn_blkptr = [<br /> {<br /> ...<br />}<br />{ <-- third entry (object id 2)<br /> 400 uint8_t dn_type = 0x16<br /> ...<br />{ <-- fourth entry (object id 3, should be root directory for the fs)<br /> 600 uint8_t dn_type = 0x14 <-- DMU_OT_DIRECTORY_CONTENTS<br /> ...<br /> 604 uint8_t dn_bonustype = 0x11 <-- bonus buffer contains znode_phys_t<br /> ...<br /> 640 blkptr_t [1] dn_blkptr = [ <-- this blkptr_t is a zap for directory entries<br /> {<br /> 640 dva_t [3] blk_dva = [<br /> {<br /> 640 uint64_t [2] dva_word = [ 0x1, 0x73 ]<br /> }<br /> ...<br /> 6c0 uint8_t [320] dn_bonus = [ 0x1e, 0xe9, 0xa7, 0x48, 0, 0, 0, 0,<br /> 0xc3, 0x61, 0x34, 0xf, 0, 0, 0, 0, 0x1f, 0xe9, 0xa7, 0x48, 0,<br /> 0, 0, 0, 0x1, 0x43, 0x79, 0x3a, 0, 0, 0, 0, ... ]<br /> ... <-- lots omitted<br />}<br /></pre><br />At this point, we could go the the fourth entry in the above output<br />(object id 3 at 0x600 bytes into the file) and look at the directory<br />contents to see if the removed file is there. (Remember, ls -aid on<br />the directory containing the removed file shows inumber 3).<br />However, we'll be safe and examine the master node to get to <br />the root directory of the file system. The master node<br />is object id 1 (at 0x200 in the above output). The block pointer<br />for this dnode_phys_t is for a zap object.<br />We'll use mdb to dump the master node blkptr_t.<br /><pre><br />> 240::blkptr<br />DVA[0]: vdev_id 0 / 0<br />DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 20000000000<br />DVA[0]: :0:0:200:d<br />DVA[1]: vdev_id 0 / c000000<br />DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 20000000000<br />DVA[1]: :0:c000000:200:d<br />LSIZE: 200 PSIZE: 200<br />ENDIAN: LITTLE TYPE: ZFS master node<br />BIRTH: 4 LEVEL: 0 FILL: 100000000<br />CKFUNC: fletcher4 COMP: uncompressed<br />CKSUM: 233dfc135:e10dd7aa27:2e8c1eba771e:6a380d575d3d6<br />$q<br />#<br /></pre><br />And now back to zdb to get the zfs master node zap object. Note<br />that this is not compressed, and is at the beginning of the disk<br />(following label 0 and label 1.<br /><pre><br /># ./zdb -R zfs_fs:0:0:200:r 2> /tmp/master_node<br />Found vdev: /export/home/max/zfsfile<br />#<br /></pre><br />Back to mdb to examine the master node zap.<br /><pre><br /># ./mdb /tmp/master_node <br />> ::loadctf<br />> ::load /export/home/max/source/mdb/i386/rawzfs.so<br />> 0/J <-- let's see what kind of zap it is<br />0: 8000000000000003 <-- micro zap<br />> 0::print -a -t zfs`mzap_phys_t<br />{<br /> 0 uint64_t mz_block_type = 0x8000000000000003<br /> 8 uint64_t mz_salt = 0x3d3b<br /> 10 uint64_t mz_normflags = 0<br /> 18 uint64_t [5] mz_pad = [ 0, 0, 0, 0, 0 ]<br /> 40 mzap_ent_phys_t [1] mz_chunk = [<br /> {<br /> 40 uint64_t mze_value = 0x3<br /> 48 uint32_t mze_cd = 0<br /> 4c uint16_t mze_pad = 0<br /> 4e char [50] mze_name = [ "VERSION" ]<br /> }<br /> ]<br />}<br /></pre><br />The mzap_phys_t is 0x80 bytes large. Following this are zero or more<br />mzap_ent_phys_t. Each mzap_ent_phys_t is 0x40 bytes. The following<br />will dump all mzap_ent_phys_t following the mzap_phys_t in the block.<br /><pre><br />> 80,((200-80)%40)::print -a -t zfs`mzap_ent_phys_t<br />...<br />{<br /> c0 uint64_t mze_value = 0x3 <-- the object id for the root of the fs<br /> c8 uint32_t mze_cd = 0<br /> cc uint16_t mze_pad = 0<br /> ce char [50] mze_name = [ "ROOT" ] <-- this is root<br />}<br />$q<br />#<br /></pre><br />Now, back to the level 0 dnode_phys_t array to look at the root directory<br />dnode_phys_t.<br /><pre><br /># ./mdb /tmp/dnode_l0<br />> ::loadctf<br />> ::load /export/home/max/source/mdb/i386/rawzfs.so<br />> 3*200::print -a -t zfs`dnode_phys_t<br />{<br /> 600 uint8_t dn_type = 0x14 <-- DMU_OT_DIRECTORY_CONTENTS<br /> ...<br /> 604 uint8_t dn_bonustype = 0x11 <-- bonus buffer contains znode_phys_t<br /> ...<br /> 640 blkptr_t [1] dn_blkptr = [<br /> {<br /> 640 dva_t [3] blk_dva = [<br /> {<br /> 640 uint64_t [2] dva_word = [ 0x1, 0x73 ]<br /> }<br /> ...<br />}<br /></pre><br />The blkptr_t is for a zap object containing filename/object id<br />values for files in the root directory of the file system.<br /><pre><br />> 640::blkptr<br />DVA[0]: vdev_id 0 / e600<br />DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 20000000000<br />DVA[0]: :0:e600:200:d<br />DVA[1]: vdev_id 0 / c00da00<br />DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 20000000000<br />DVA[1]: :0:c00da00:200:d<br />LSIZE: 200 PSIZE: 200<br />ENDIAN: LITTLE TYPE: ZFS directory<br />BIRTH: 502 LEVEL: 0 FILL: 100000000<br />CKFUNC: fletcher4 COMP: uncompressed<br />CKSUM: 25f50a2fc:fe963fd84e:36937666328d:7f9475424708c<br />$q<br />#<br /></pre><br />Now read the root directory zap object.<br /><pre><br /># ./zdb -R zfs_fs:0:e600:200:r 2> /tmp/rootdir<br />Found vdev: /export/home/max/zfsfile<br />#<br /></pre><br />And use mdb to look at the zap entries.<br /><pre><br /># ./mdb /tmp/rootdir <br />> ::loadctf<br />> ::load /export/home/max/source/mdb/i386/rawzfs.so<br />> 0/J<br />0: 8000000000000003 <-- a micro zap<br /><br /><br />> 0::print -a -t zfs`mzap_phys_t<br />{<br /> 0 uint64_t mz_block_type = 0x8000000000000003<br /> 8 uint64_t mz_salt = 0x3e0f<br /> 10 uint64_t mz_normflags = 0<br /> 18 uint64_t [5] mz_pad = [ 0, 0, 0, 0, 0 ]<br /> 40 mzap_ent_phys_t [1] mz_chunk = [<br /> {<br /> 40 uint64_t mze_value = 0x8000000000000004<br /> 48 uint32_t mze_cd = 0<br /> 4c uint16_t mze_pad = 0<br /> 4e char [50] mze_name = [ "foo" ]<br /> }<br /> ]<br />}<br /></pre><br />And dump the rest of the zap entries.<br /><pre><br />> 80,((200-80)%40)::print -a -t zfs`mzap_ent_phys_t<br />{<br /> 80 uint64_t mze_value = 0x8000000000000005<br /> 88 uint32_t mze_cd = 0<br /> 8c uint16_t mze_pad = 0<br /> 8e char [50] mze_name = [ "words" ] <-- here is the removed file<br />}<br />...<br />5*200=X <-- we want dnode_phys_t object id 5.<br /> a00 <-- Offset within /tmp/dnode_l0 where the object resides<br />$q<br />#<br /></pre><br />We'll go back and get the dnode for object id 5.<br /><pre><br /># ./mdb /tmp/dnode_l0<br />> ::loadctf<br />> ::load /export/home/max/source/mdb/i386/rawzfs.so<br />> a00::print -a -t zfs`dnode_phys_t<br />{<br /> a00 uint8_t dn_type = 0x13 <-- DMU_OT_PLAIN_FILE_CONTENTS<br /> ...<br /> a04 uint8_t dn_bonustype = 0x11 <-- znode_phys_t for "words" file<br /> ...<br /> a40 blkptr_t [1] dn_blkptr = [ <-- blkptr_t for data or indirect blocks<br /> {<br /> ...<br /> ac0 uint8_t [320] dn_bonus = [ 0x1f, 0xe9, 0xa7, 0x48, 0, 0, 0, 0, 0xcb, <br />0x96, 0x78, 0x3a, 0, 0, 0, 0, 0x1f, 0xe9, 0xa7, 0x48, 0, 0, 0, 0, 0xd1, 0xb1, <br />0x83, 0x3a, 0, 0, 0, 0, ... ]<br />}<br /></pre><br />Now, let's take a quick look at the znode_phys_t for this file.<br />It is in the bonus buffer at 0xac0.<br /><pre><br />> ac0::print -a -t zfs`znode_phys_t<br />{<br /> ac0 uint64_t [2] zp_atime = [ 0x48a7e91f, 0x3a7896cb ]<br /> ad0 uint64_t [2] zp_mtime = [ 0x48a7e91f, 0x3a83b1d1 ]<br /> ae0 uint64_t [2] zp_ctime = [ 0x48a7e91f, 0x3a83b1d1 ]<br /> af0 uint64_t [2] zp_crtime = [ 0x48a7e91f, 0x3a7896cb ]<br /> b00 uint64_t zp_gen = 0x502<br /> b08 uint64_t zp_mode = 0x8124<br /> b10 uint64_t zp_size = 0x32752 <-- should be same size as /usr/dict/words<br /> b18 uint64_t zp_parent = 0x3<br /> b20 uint64_t zp_links = 0x1<br /> ...<br />}<br /><br />> 32752=D<br /> 206674 <br />> !ls -l /usr/dict/words<br />-r--r--r-- 1 root bin 206674 Jul 11 02:57 /usr/dict/words<br /></pre><br />Looks good. Let's look at the blkptr_t for this dnode_phys_t.<br /><pre><br />> a40::blkptr<br />DVA[0]: vdev_id 0 / e800<br />DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 40000000000<br />DVA[0]: :0:e800:400:id<br />DVA[1]: vdev_id 0 / c00dc00<br />DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 40000000000<br />DVA[1]: :0:c00dc00:400:id<br />LSIZE: 4000 PSIZE: 400<br />ENDIAN: LITTLE TYPE: ZFS plain file<br />BIRTH: 502 LEVEL: 1 FILL: 200000000<br />CKFUNC: fletcher4 COMP: lzjb<br />CKSUM: 5e9a82c0c2:3ff97cbecacc:1714169599f4c8:5dd02ff967dd42c<br />$q<br />#<br /></pre><br />Note the "LEVEL: 1". This means there is one level of indirect blocks.<br />We'll use zdb to retrieve the indirect block.<br /><pre><br /># ./zdb -R zfs_fs:0:e800:400:d,lzjb,4000 2> /tmp/iblock<br />Found vdev: /export/home/max/zfsfile<br />#<br /></pre><br />And mdb to look at the indirect block.<br /><pre><br /># ./mdb /tmp/iblock <br />> ::loadctf<br />> ::load /export/home/max/source/mdb/i386/rawzfs.so<br />0::blkptr<br />DVA[0]: vdev_id 0 / 20000<br />DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 2000000000000<br />DVA[0]: :0:20000:20000:d<br />LSIZE: 20000 PSIZE: 20000<br />ENDIAN: LITTLE TYPE: ZFS plain file<br />BIRTH: 502 LEVEL: 0 FILL: 100000000<br />CKFUNC: fletcher2 COMP: uncompressed<br />CKSUM: f5cbf93a151abcac:5b5d6ca83588d8ad:574d9b8bf334944b:ad78d30af51771d8<br /></pre><br />The blkptr_t above is for the first 0x20000 (128k) of the file.<br />The next blkptr_t in the indirect block should contain the remainder<br />of the file. (The file is less than 256k large).<br /><pre><br />> 80::blkptr<br />DVA[0]: vdev_id 0 / 40000<br />DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 2000000000000<br />DVA[0]: :0:40000:20000:d<br />LSIZE: 20000 PSIZE: 20000<br />ENDIAN: LITTLE TYPE: ZFS plain file<br />BIRTH: 502 LEVEL: 0 FILL: 100000000<br />CKFUNC: fletcher2 COMP: uncompressed<br />CKSUM: f39ae34f048ae079:de2ef1af7d1fb495:ec3ae3f7985b2a98:c6d33ac68cb042b6<br />$q<br />#<br /></pre><br />Now we'll use zdb to retrieve the data blocks.<br /><pre><br /># ./zdb -R zfs_fs:0:20000:20000:r 2> /tmp/data <-- first data block<br />Found vdev: /export/home/max/zfsfile<br /># ./zdb -R zfs_fs:0:40000:20000:r 2> /tmp/data1 <-- second data block<br />Found vdev: /export/home/max/zfsfile<br />#<br /><br /># cat /tmp/data /tmp/data1 > /tmp/foo <-- concatenate them<br /></pre><br />Size of the file, according to the znode_phys_t is 206674 bytes.<br />We'll lop off remaining bytes.<br /><pre><br /># dd if=/tmp/foo bs=206674 count=1 of=/tmp/finalwords<br />1+0 records in<br />1+0 records out<br /></pre><br />Now, let's see if we have the correct data.<br /><pre><br /># diff /tmp/finalwords /usr/dict/words<br /># <-- no differences<br /></pre>Unknownnoreply@blogger.com2tag:blogger.com,1999:blog-7245518.post-50437315056481615362008-02-06T01:53:00.000-08:002008-02-06T08:03:36.624-08:00new opensolaris course materialI recently wrote the material and taught the second day of a three day course<br />on OpenSolaris for a group of university professors in Bangalore, India. The<br />first day was largely a "getting started on OpenSolaris" session, along with<br />an introduction to dtrace, and the third day was mostly administration<br />specifically about zones and zfs. The second day was an introduction to<br />OpenSolaris Internals. The topics I covered included processes and threads,<br />synchronization, memory management, and file systems. Those who have taken<br />the 5 day Solaris Internals course with me will find that several of the<br />diagrams and hands-on exercises that I use during that course are now written<br />up in this material. Note that the material for the second day uses only two<br />pages that are come from pre-existing sources. So this is new material.<br />The material contains both overhead slides and accompanying text.<br /><br />I did something similar for professors in China almost 2 years ago, except<br />that session was a 5 day internals session. The India professors looked at<br />materials prepared by the Chinese professors, and decided that the depth of<br />coverage in the materials was too deep to use as a starting point for a course<br />on operating systems. Having looked at the Chinese professor's materials, I<br />am inclined to agree with the Indian professors. The Chinese material is<br />an excellent guide through the source code and through McDougall and Mauro's<br />Solaris Internals book. As a study guide for people trying to learn about<br />the way OpenSolaris works, it is quite good and complete. As a tool for<br />teaching, especially classes without prior operating system experience, I feel<br />it misses the mark. While a professor that has good operating system knowledge<br />can use the Chinese material to learn OpenSolaris internals for him or her<br />self, I think the materials may assume too much prior knowledge of the<br />students. The new material tries to give professors a starting point that can<br />be used to teach OpenSolaris, not just to learn it.<br /><br />For those of you who already have an Operating Systems background, (though not<br />in OpenSolaris), I think you'll find this new material quite useful. The<br />new material is not elementary (there is plenty of useful information, even<br />for people who have extensive OS and even extensive Solaris kernel experience).<br />It makes assumptions that, for instance, you understand why one needs locks,<br />or why virtual memory is useful, among others. The material explains the<br />implementation of various concepts/mechanisms using a combination of tools,<br />mostly mdb and dtrace, and various diagrams. In a 1 day session, there are<br />many topics not covered, and some are covered very superficially, but most<br />of the major OS topics are covered in a good amount of detail.<br /><br />The new material is at: http://www.opensolaris.org/os/community/edu/curriculum_development.<br />Look at the OpenSolaris Curriculum "Plugins Preparation" section for overheads<br />and course guides.Unknownnoreply@blogger.com2tag:blogger.com,1999:blog-7245518.post-75142023030502224642007-09-16T02:00:00.000-07:002007-09-23T02:35:22.659-07:00Using kernel ctf with raw diskThe following shows a use of a modified version of mdb<br />which allows one to use the CTF information from the kernel<br />to examine data structures on disk. The disk used below contains<br />a ufs file system. The same techniques can be used to examine<br />zfs file system on disk, which is why I did this in the first place.<br />Once I have "mapped out" the on-disk format of zfs using this modified<br />version of mdb, I'll write about it.<br /><br />In the meantime, I'll probably add a few dcmds and walkers for ufs that<br />use the kernel CTF information.<br /><br />In the following, annotation starts with "<--" except for a<br />few places where I have embedded code from header files.<br />Also, the output has been truncated in a few places.<br />I am assuming that you have some knowledge of mdb, (for instance,<br />"value1 % value2 = X" returns the (hex) value of value1 divided by value2.<br />If you need more mdb, read the Modular Debugger Guide on docs.sun.com,<br />or, even better, take a course.<br /><br />This example will start with the superblock, and from there examine the root<br />inode and then the root directory. From there, the example gets the /var inode<br />and then the /var directory. From there, we go to the /var/sadm directory<br />and look at the contents of the /var/sadm/README file. All of this is<br />done by examining the relevant data structures on disk.<br /><br /># ./mdb /dev/rdsk/c0d0s0 <-- this is the root fs<br />mdb: no terminal data available for TERM=emacs<br />mdb: term init failed: command-line editing and prompt will not be available<br />mdb: no module 'mdb_ks' could be found <-- kernel support module not loaded(?)<br />mdb: failed to load kernel support module -- some modules may not load<br />::print struct anon <-- try ::print, normally this does not work with raw disk<br />{<br /> an_vp <-- it works!<br /> an_pvp <br /> an_off <br /> an_poff <br /> an_hash <br /> an_refcnt <br />}<br />2000::print struct fs <-- superblock should be 8192 bytes into fs (see sys/fs/ufs_fs.h)<br />{<br /> fs_link = 0 <-- see fs_magic below for sanity check<br /> fs_rolled = 0x2<br /> fs_sblkno = 0x10<br /> fs_cblkno = 0x18<br /> fs_iblkno = 0x20<br /> fs_dblkno = 0x2f8<br /> fs_cgoffset = 0x40<br /> fs_cgmask = 0xffffffc0<br /> fs_time = 0x46eb90d4<br /> fs_size = 0x32e3519<br /> fs_dsize = 0x321e0c8<br /> fs_ncg = 0x43e<br /> fs_bsize = 0x2000<br /> fs_fsize = 0x400<br /><-- output omitted<br /> fs_fsmnt = [ "/" ]<br /><-- output omitted<br /> fs_magic = 0x11954 <-- check against FS_MAGIC in sys/fs/ufs_fs.h (correct)<br /> fs_space = [ 0x8 ]<br />}<br /><br />::status <-- what does mdb say I'm debugging<br />debugging file '/dev/rdsk/c0d0s0' (object file)<br /><br /><-- The following only prints the fields I am interested in:<br />2000::print struct fs fs_sblkno fs_cblkno fs_iblkno fs_cgoffset fs_magic fs_ipg<br />fs_sblkno = 0x10 <-- location of the superblock in the cylinder group<br />fs_cblkno = 0x18 <-- location of the cylinder group block (struct cg)<br />fs_iblkno = 0x20 <-- location of start of inodes (in cylinder group)<br />fs_cgoffset = 0x40 <-- offset of cylinder group<br />fs_magic = 0x11954 <-- magic number <br />fs_ipg = 0x16c0 <-- inodes per cylinder group<br /><br /><-- immediately following superblock is back up<br />4000::print struct fs fs_sblkno fs_cblkno fs_iblkno fs_cgoffset fs_magic<br />fs_sblkno = 0x10<br />fs_cblkno = 0x18<br />fs_iblkno = 0x20<br />fs_cgoffset = 0x40<br />fs_magic = 0x11954<br /><br />6000::print struct cg <-- next block should be first cylinder group block<br />{<br /> cg_link = 0<br /> cg_magic = 0x90255 <-- magic number is good<br /> cg_time = 0x46e78f3e<br /><-- output omitted<br />}<br /><br />::sizeof struct icommon <-- how big is the disk inode<br />sizeof (struct icommon) = 0x80<br /><br /><-- the following is the root inode. Root for ufs is inumber 2. The fs_iblkno<br /><-- value (0x20) is multiplied by the fragment size to get the start of the<br /><-- inodes (in the first cylinder group), each inode is 0x80 bytes large. The<br /><-- second (i.e., root inode) is then at disk location (20*400+(2*80)<br />(20*400)+(2*80)::print -a struct icommon <br />{<br /> 8100 ic_smode = 0x41ed<br /> 8102 ic_nlink = 0x30<br /> 8104 ic_suid = 0<br /> 8106 ic_sgid = 0<br /> 8108 ic_lsize = 0x600<br /> 8110 ic_atime = {<br /> 8110 tv_sec = 0x46eba57f<br /> 8114 tv_usec = 0x5a69b<br /> }<br /><-- output omitted<br /> 8128 ic_db = [ 0x2ff410, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ] <-- one data block for / directory<br /> 8158 ic_ib = [ 0, 0, 0 ]<br /><-- output omitted<br />}<br /><br />8110\Y <-- this is disk address of atime stamp on root inode<br />0x8110: 2007 Sep 15 11:27:27 <-- looks good<br /><br />!date <-- current time<br />Sat Sep 15 11:36:07 CEST 2007<br /><br /><-- now the block number from ic_db[0] is used to dump the contents of the<br /><-- root directory<br />2ff410*400::print struct direct<br />{<br /> d_ino = 0x2<br /> d_reclen = 0xc<br /> d_namlen = 0x1<br /> d_name = [ "." ]<br />}<br />(2ff410*400)+c::print struct direct <-- second entry (first+d_reclen)<br />{<br /> d_ino = 0x2<br /> d_reclen = 0xc<br /> d_namlen = 0x2<br /> d_name = [ ".." ]<br />}<br />(2ff410*400)+c+c::print struct direct <-- third entry (first + d_reclen of first and second)<br />{<br /> d_ino = 0x3<br /> d_reclen = 0x14<br /> d_namlen = 0xa<br /> d_name = [ "lost+found" ]<br />}<br />(2ff410*400)+c+c+14::print struct direct <-- fourth entry (easily made into walker?)<br />{<br /> d_ino = 0x16c0<br /> d_reclen = 0xc<br /> d_namlen = 0x3<br /> d_name = [ "var" ]<br />}<br /><-- check "/var" inumber<br />!ls -id /var <-- get inumber<br /> 5824 /var<br />16c0=D <-- d_ino from direct entry<br /> 5824 <-- match<br /><br /><-- the following is a back up superblock in the second cylinder group.<br /><-- The relevant macros for this are in sys/fs/ufs_fs.h and are shown here:<br />/*<br /> * Cylinder group macros to locate things in cylinder groups.<br /> * They calc file system addresses of cylinder group data structures.<br /> */<br />#define cgbase(fs, c) ((daddr32_t)((fs)->fs_fpg * (c)))<br /><br />#define cgstart(fs, c) \<br /> (cgbase(fs, c) + (fs)->fs_cgoffset * ((c) & ~((fs)->fs_cgmask)))<br /><br />#define cgsblock(fs, c) (cgstart(fs, c) + (fs)->fs_sblkno) /* super blk */<br /><br />#define cgtod(fs, c) (cgstart(fs, c) + (fs)->fs_cblkno) /* cg block */<br /><br />#define cgimin(fs, c) (cgstart(fs, c) + (fs)->fs_iblkno) /* inode blk */<br /><br />#define cgdmin(fs, c) (cgstart(fs, c) + (fs)->fs_dblkno) /* 1st data */<br /><br />/*<br /> * Macros for handling inode numbers:<br /> * inode number to file system block offset.<br /> * inode number to cylinder group number.<br /> * inode number to file system block address.<br /> */<br />#define itoo(fs, x) ((x) % (uint32_t)INOPB(fs))<br /><br />#define itog(fs, x) ((x) / (uint32_t)(fs)->fs_ipg)<br /><-- So. Here the fs_fpg field from the superblock (= 0xc000) is used to<br /><-- get the fragments per group. This is multiplied times the fragment size (0x400)<br /><-- Then the fs_cgoffset field (cylinder group offset) is added (40*400), then the<br /><-- fs_sblkno offset (10*400). The resulting address is the location on the<br /><-- disk of the backup superblock in the second cylinder group. To see the third,<br /><-- use (c000*2*400)+(40*400)+(10*400)::print struct fs<br /><-- To see the fourth, (c000*3*400)+(40*400)+(10*400)::print struct fs, etc.<br /><br />(c000*400)+(40*400)+(10*400)::print struct fs fs_sblkno fs_cblkno fs_iblkno fs_cgoffset fs_magic<br />fs_sblkno = 0x10<br />fs_cblkno = 0x18<br />fs_iblkno = 0x20<br />fs_cgoffset = 0x40<br />fs_magic = 0x11954<br /><br /><-- Now, let's take a look at the inode for the "/var" directory.<br /><-- Above, the direct structure for /var says the inumber is 0x16c0.<br /><-- There are 16c0 inodes per cylinder group (the fs_ipg field in the<br /><-- superblock), so this inode should be the first inode in the <br /><-- second cylinder group. (c000*400) is the base of the second cylinder<br /><-- group. (40*400) is the starting offset. (20*400) is the starting<br /><-- inode offset. Given an inumber, the formula for finding the inode<br /><-- on disk is:<br /><-- (inumber % fs_ipg)=X This returns an index indicating which cylinder<br /><-- group the inode is in. This is "cg_index" in the next calculation.<br /><-- (inumber - (cg_index * fs_ipg))=X This returns the index<br /><-- (offset) within the cylinder group ("cg_offset") (Actually, inumber mod fs_ipg).<br /><-- Then: ((fs_fpg*400)*cg_index)+((fs_cgoffset*400)*cg_index)+(fs_iblkno*400)+(cg_offset*80).<br /><-- Here, 400 is the fragment size (from fs_fsize) and 80 is the sizeof the<br /><-- disk inode.<br /><br />(16c0%16c0)=X<br /> 1 <-- the second cylinder group<br />16c0-(1*16c0)=X<br /> 0 <-- the first inode in the group <br /><br /><-- this is the inode for /var<br />(c000*400*1)+(40*400)+(20*400)+(0*80)::print struct icommon<br />{<br /> ic_smode = 0x41ed<br /> ic_nlink = 0x2c<br /> ic_suid = 0<br /> ic_sgid = 0x3<br /> ic_lsize = 0x400<br /> ic_atime = {<br /> tv_sec = 0x46eb30e8<br /> tv_usec = 0x992bc<br /> }<br /><-- output omitted<br /> ic_db = [ 0xc348, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]<br /> ic_ib = [ 0, 0, 0 ]<br /><-- output omitted<br />}<br /><br /><-- So /var has one block (ic_db[0] = 0xc348). This should be a directory. Now dump out some<br /><-- entries. The following is a good candidate for a walker...<br />(c348*400)::print struct direct<br />{<br /> d_ino = 0x16c0<br /> d_reclen = 0xc<br /> d_namlen = 0x1<br /> d_name = [ "." ]<br />}<br />(c348*400)+c::print struct direct<br />{<br /> d_ino = 0x2<br /> d_reclen = 0xc<br /> d_namlen = 0x2<br /> d_name = [ ".." ]<br />}<br />(c348*400)+c+c::print struct direct<br />{<br /> d_ino = 0x16c1<br /> d_reclen = 0x10<br /> d_namlen = 0x4<br /> d_name = [ "sadm" ]<br />}<br /><br /><-- let's check the work...<br />!ls -id /var/sadm<br /> 5825 /var/sadm<br />16c1=D<br /> 5825 <-- match looks good<br /><-- Ok. Now lets look at the inode for /var/sadm. This is<br /><-- inumber 16c1. <br /><br />16c1%16c0=X<br /> 1 <-- the second cylinder group<br />16c1-(16c0*1)=X<br /> 1 <-- the second inode in the group<br /><br />(c000*400*1)+(40*400)+(20*400)+(1*80)::print struct icommon<br />{<br /> ic_smode = 0x41ed<br /> ic_nlink = 0xd<br /> ic_suid = 0<br /> ic_sgid = 0x3<br /> ic_lsize = 0x200<br /><-- output omitted<br /> ic_db = [ 0xc349, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]<br /> ic_ib = [ 0, 0, 0 ]<br /><-- output omitted<br />}<br />c349*400::print struct direct<br />{<br /> d_ino = 0x16c1<br /> d_reclen = 0xc<br /> d_namlen = 0x1<br /> d_name = [ "." ]<br />}<br />(c349*400)+c::print struct direct<br />{<br /> d_ino = 0x16c0<br /> d_reclen = 0xc<br /> d_namlen = 0x2<br /> d_name = [ ".." ]<br />}<br />(c349*400)+c+c::print struct direct<br />{<br /> d_ino = 0x16c2<br /> d_reclen = 0x10<br /> d_namlen = 0x7<br /> d_name = [ "install" ]<br />}<br />(c349*400)+c+c+10::print struct direct<br />{<br /> d_ino = 0x16c5<br /> d_reclen = 0xc<br /> d_namlen = 0x3<br /> d_name = [ "pkg" ]<br />}<br />(c349*400)+c+c+10+c::print struct direct<br />{<br /> d_ino = 0x1d41<br /> d_reclen = 0x10<br /> d_namlen = 0x6<br /> d_name = [ "system" ]<br />}<br />(c349*400)+c+c+10+c+10::print struct direct<br />{<br /> d_ino = 0x7e4d<br /> d_reclen = 0x18<br /> d_namlen = 0xc<br /> d_name = [ "install_data" ]<br />}<br />(c349*400)+c+c+10+c+10+18::print struct direct<br />{<br /> d_ino = 0x7e4e<br /> d_reclen = 0x14<br /> d_namlen = 0x8<br /> d_name = [ "softinfo" ]<br />}<br />(c349*400)+c+c+10+c+10+18+14::print struct direct<br />{<br /> d_ino = 0x27d9<br /> d_reclen = 0x10<br /> d_namlen = 0x6<br /> d_name = [ "README" ] <-- here is the file we want to examine<br />}<br /><br />27d9%16c0=X<br /> 1 <-- the second cylinder group<br />27d9-(1*16c0)=X<br /> 1119 <-- the 1120th inode <br />(1*c000*400)+(40*400)+(20*400)+(1119*80)::print struct icommon<br />{<br /> ic_smode = 0x8124<br /> ic_nlink = 0x1<br /> ic_suid = 0<br /> ic_sgid = 0x3<br /> ic_lsize = 0x444<br /><-- output omitted<br /> ic_db = [ 0x111cc, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]<br /> ic_ib = [ 0, 0, 0 ]<br /><-- output omitted<br />}<br />111cc*400,200/c <-- ok, let's dump the first 512 bytes<br />0x4473000: -------------------------------<br /> /var/sadm DIRECTORY RESTRUCTURE<br /> -------------------------------<br /> <br /> The system administration directory has been reorganized to bett<br /> er<br /> service the needs of Solaris administrators and administration s<br /> oftware.<br /> The old and new locations for files important to our customers a<br /> re:<br /> <br /> OLD LOCATION NEW LOCATION<br /> ------------ ------------<br /> install_data/install_log system/logs/install_log<br /> install_data/upgrade_log system/log<br /><br /><-- check the work<br />!head /var/sadm/README<br /> -------------------------------<br /> /var/sadm DIRECTORY RESTRUCTURE<br /> -------------------------------<br /><br />The system administration directory has been reorganized to better<br />service the needs of Solaris administrators and administration software.<br />The old and new locations for files important to our customers are:<br /><br /> OLD LOCATION NEW LOCATION<br /> ------------ ------------<br />$q<br />#<br /><br />bash-3.00$Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-7245518.post-1131472680169914082005-11-08T09:47:00.000-08:002006-11-29T19:23:59.846-08:00Using dtrace and mdb to examine virtual memory<a href="http://www.bruningsystems.com/swmm.html">here</a> is a short example using<br />dtrace and mdb to examine page faults and process address spaces.<br />This will be used in a workshop being given to professors teaching operating systems<br />within China. The workshop will cover Solaris internals using the opensolaris source code<br />and tools such as dtrace, mdb, and kmdb.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-7245518.post-1119635567018383462005-06-24T10:50:00.000-07:002006-09-17T06:51:39.140-07:00new web site for Bruning SystemsHi,<br />I have finally gotten around to updating the web site. Still need to<br />add the dtrace scripts, but first I need to document them.<br /><br /><a href="http://www.bruningsystems.com">www.bruningsystems.com</a><br /><br />maxUnknownnoreply@blogger.com1tag:blogger.com,1999:blog-7245518.post-1112718367726406252005-04-05T09:23:00.000-07:002005-04-05T09:26:07.726-07:00snooping gld-based NIC drivers using dtraceHi.<br />I have just posted a script at http://www.bruningsystems.com/rtlsio.p<br />that allows one to snoop incoming/outgoing packets on a Realtek NIC.<br />The script is very easy to change for any other GLD-based NIC (see<br />the comment at the beginning of the script to determine what needs<br />to be changed.)<br />To run the script, save it and then:<br /><br /># dtrace -q -C -s ./rtlsio.p<br /><br />Let me know what you think.<br /><br />maxUnknownnoreply@blogger.com2tag:blogger.com,1999:blog-7245518.post-1108239353650703152005-02-12T12:14:00.000-08:002005-02-12T12:15:53.650-08:00dtrace script to trace kernel thread state changesI just posted a script that traces kernel thread state changes to the dtrace<br />forum on forum.sun.com. You can also find it at http://www.bruningsystems.com/runq.d<br /><br />Have fun.<br />maxUnknownnoreply@blogger.com0tag:blogger.com,1999:blog-7245518.post-1105724174317757712005-01-14T09:34:00.000-08:002005-01-14T09:36:14.316-08:00solaris kernel memory usageThis is a test. I am currently working on answering a question
<br />from a former student about kernel memory. The question and
<br />answer will be here shortly.
<br />
<br />Unknownnoreply@blogger.com1