Monday, November 12, 2012

Hadoop bug on SmartOS



Recently I had a chance to help with a problem that occurred when trying to run a Hadoop benchmark on SmartOS.  Basically, some of the Java code written for Hadoop was making an implicit assumption that the code was being run on Linux.  When running the benchmark, the following error showed up:


12/10/01 20:58:49 INFO mapred.JobClient: Task Id : attempt_201209262235_0003_m_000003_0, Status : FAILED
ENOENT: No such file or directory
at org.apache.hadoop.io.nativeio.NativeIO.open(Native Method)
at org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:161)
at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:312)
at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:385)
at org.apache.hadoop.mapred.Child.main(Child.java:229)

The NativeIO.open call basically calls the open(2) system call.  Here, it is being called from
createForWrite() in SecureIOUtils.java at line 161.  Here is the code for SecureIOUtils.java:

 /**
  * Open the specified File for write access, ensuring that it does not exist.
  * @param f the file that we want to create
  * @param permissions we want to have on the file (if security is enabled)
  *
  * @throws AlreadyExistsException if the file already exists
  * @throws IOException if any other error occurred
  */
 public static FileOutputStream createForWrite(File f, int permissions)
 throws IOException {
   if (skipSecurity) {
     return insecureCreateForWrite(f, permissions);
   } else {
     // Use the native wrapper around open(2)
     try {
       FileDescriptor fd = NativeIO.open(f.getAbsolutePath(),  <-- 161="161" line="line" span="span">
         NativeIO.O_WRONLY | NativeIO.O_CREAT | NativeIO.O_EXCL,
         permissions);
       return new FileOutputStream(fd);
     } catch (NativeIOException nioe) {
       if (nioe.getErrno() == Errno.EEXIST) {
         throw new AlreadyExistsException(nioe);
       }
       throw nioe;
     }
   }
 }

So, the open is called with O_WRONLY, O_CREAT, and O_EXCL flags.  However, the truss(1) output
shows a different story.  We started the following truss on a slave machine, and ran the test again:

# truss -f -a -wall -topen,close,fork,write,stat,fstat -o ~/mapred.truss -p $(pgrep -f Djava.library.path)

And here is the relevant truss output:

51039/28: open("/opt/local/hadoop/bin/../logs/userlogs/job_201210171129_0008/attempt_201210171129_0008_m_000002_1/log.tmp", O_WRONLY|O_DSYNC|O_NONBLOCK) Err#2 ENOENT

The error message is emitted shortly after the above open(2) system call.  So, the code shows O_WRONLY, O_CREAT, and O_EXCL, which is what one
would expect for a routine that is called createForWrite().  However, the flags actually passed to open() are: O_WRONLY, O_DSYNC, and O_NONBLOCK.
Why the difference?

Grepping for O_CREAT in the hadoop source finds it defined at:

./trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/nativeio/NativeIO.java:

/**
* JNI wrappers for various native IO-related calls not available in Java.
* These functions should generally be used alongside a fallback to another
* more portable mechanism.
*/
public class NativeIO {
 // Flags for open() call from bits/fcntl.h
 public static final int O_RDONLY   =    00;
 public static final int O_WRONLY   =    01;
 public static final int O_RDWR     =    02;
 public static final int O_CREAT    =  0100;
 public static final int O_EXCL     =  0200;
 public static final int O_NOCTTY   =  0400;
 public static final int O_TRUNC    = 01000;
 public static final int O_APPEND   = 02000;
 public static final int O_NONBLOCK = 04000;
 public static final int O_SYNC   =  010000;
 public static final int O_ASYNC  =  020000;
 public static final int O_FSYNC = O_SYNC;
 public static final int O_NDELAY = O_NONBLOCK;

The comment in the above code says that the flags for the open(2) call are coming from bit/fcntl.h.
However, on SmartOS (as well as illumos and Solaris), the same flags in sys/fcntl.h show:

/*
* Flag values accessible to open(2) and fcntl(2)
* The first five can only be set (exclusively) by open(2).
*/
#define   O_RDONLY        0
#define        O_WRONLY        1
#define        O_RDWR          2
#define        O_SEARCH        0x200000
#define O_EXEC          0x400000
#if defined(__EXTENSIONS__) || !defined(_POSIX_C_SOURCE)
#define O_NDELAY        0x04    /* non-blocking I/O */
#endif /* defined(__EXTENSIONS__) || !defined(_POSIX_C_SOURCE) */
#define  O_APPEND        0x08    /* append (writes guaranteed at the end) */
#if defined(__EXTENSIONS__) || !defined(_POSIX_C_SOURCE) || \
       (_POSIX_C_SOURCE > 2) || defined(_XOPEN_SOURCE)
#define  O_SYNC          0x10    /* synchronized file update option */
#define    O_DSYNC         0x40    /* synchronized data update option */
#define    O_RSYNC         0x8000  /* synchronized file update option */
                          /* defines read/write file integrity */
#endif /* defined(__EXTENSIONS__) || !defined(_POSIX_C_SOURCE) ... */
#define     O_NONBLOCK      0x80    /* non-blocking I/O (POSIX) */
#ifdef    _LARGEFILE_SOURCE
#define        O_LARGEFILE     0x2000
#endif

/*
* Flag values accessible only to open(2).
*/
#define      O_CREAT         0x100   /* open with file create (uses third arg) */
#define     O_TRUNC         0x200   /* open with truncation */
#define       O_EXCL          0x400   /* exclusive open */
#define     O_NOCTTY        0x800   /* don't allocate controlling tty (POSIX) */
#define     O_XATTR         0x4000  /* extended attribute */
#define O_NOFOLLOW      0x20000 /* don't follow symlinks */
#define      O_NOLINKS       0x40000 /* don't allow multiple hard links */

The O_CREAT flag (from bits/fcntl.h) is 0100 (octal) in the NativeIO.java file, but 0x100 on SmartOS.  The 0100 value is 0x40, which corresponds to O_DSYNC on SmartOS. Similarly, the O_EXCL value of 0200 is hex value 0x80, which is O_NONBLOCK on SmartOS.  Whoever wrote this code made an assumption that they were running on a Linux system.  The flags are different yet again on FreeBSD and Mac OS (for instance, O_CREAT is 0x200 on these systems).  My colleague, Filip Hajny, changed the flags to match the SmartOS flags, and rebuilt everything to fix the problem.

This problem reminds me how many little things like this can occur when porting an application that was developed on one operating system to run on another operating system.  It is possible that for all but the simplest of applications, some changes are going to be needed.  For the above problem, POSIX specifies the flags that open(2) can take (O_CREAT, O_RDWR, etc.), but does not specify the values of those flags.  Basically, if the code could include the correct header file (fcntl.h in both cases), the problem would not occur.  It is an important reminder that all code should be reviewed and tested on as many different systems as possible.

Monday, July 09, 2012

Why take a SmartOS/illumos Internals or ZFS Internals course?


I have been teaching OS internals courses for many years, starting with Bell Labs/AT&T Unix System III in 1982, onto System V, SVR2, SVR3, SVR4, and Solaris Unixes since 1994.  Along the way, I have also taught HP-UX internals, various device driver courses, and kernel debugging courses.  I started using unix on the Sixth Edition in 1975.  I have also done a fair amount of kernel development and debugging, along with some user level stuff.

The audiences I have for internals courses has been quite varied.  Many of the people I have taught have been in support or sustaining organizations, but I have also taught developers, system administrators, Java programmers, QA people,  hardware engineers, and even end users.  Along the way, I have been asked by various people (many of them managers), "Why should I or my team take this course?  What will I or my team get out of this training?"

In response to the first question, I usually tell people that an internals course should teach students how the system works, and why it works the way it does.  In other words, the course teaches the data structures and algorithms used by the operating system to manage the resources of the computer, and explains the architecture of the system, as well as the rationale behind the design decisions that have been made to implement the system.  My view is that knowledge of how the system works can benefit everyone.  For developers (especially kernel developers), knowledge of the system is key to adding new functionality.  For system administrators, knowledge of the OS can help to do troubleshooting and performance analysis.  Tools like DTrace become even more useful when one has knowledge of what's going on in the system.  In general, knowledge of how the system works allows everyone who uses the system to make better use of the system.

As for what specific skills are acquired in an internals course, I make very extensive use of tools that come with the system during the training.  Both when I am lecturing, and in lab work.  My view has always been that in order to learn the concepts being taught, one must be able to actually "see" them.  Tools like DTrace, mdb, kmdb, and other observability mechanisms are key to doing this.  I do not "teach" the tools, but rather we use the tools in lots of examples throughout the course.  At the end of the course, I am satisfied if my students can start to learn things on their own.  Basically, a good internals course should be an "enabling" course.  It should enable the student to learn more on their own.  For some, they may never use the specific tools used during the class in the specific ways they are used, but it will educate students that one can actually determine what the system is doing at any given time.  For others, they will be using the tools consistently in their work.

As with all training, you only get out of it what you put into it.
If you're interested in Internals training, please visit training from joyent.

I hope to see people in class soon!

Friday, June 15, 2012

SmartOS/Illumos Training

If you are reading this, you probably are here either because you saw my post on twitter, or you searched for "zfs recovery" (see here). This is the first blog I have written here since 2009, so it is time to write again.

I have written (in a different blog) on using flamegraphs at Using flamegraph to Solve IP Scaling Issue. Rather than spending time saying what I've been doing since my last blog entry here, I want to talk a bit about what I am doing now. If you're interested in what I have been doing, see KVM on Illumos.

Since I wrote the blogs on ZFS recovery, I have been getting emails, at a rate that is slowly increasing over time (now ~2 per week), from people asking if I can help with ZFS problems. If I had received this many emails when I wrote the blog, I might be working full time now on ZFS recovery issues. As it is, I am now very busy working for Joyent, and have not had time to answer as many of the ZFS requests as I would like. My apologies to people who I have not answered. For those of you who have asked for my mdb and zdb modifications, please send me email at max_at_joyent_dot_com. If I get enough requests, it is possible that the modifications may find their way into SmartOS (and Illumos).

If you would like help with ZFS problems, I can better justify my time if you download Joyent's SmartDataCenter product, available here, and give it a try. If you're interested in SmartOS (simply the best Solaris-based operating system you can use), it is available for download at www.smartos.org. Joyent fully supports SmartOS for use in its SmartDataCenter product, so you are more likely to get help in a timely fashion with problems than I am able to provide on my own.

And, what am I doing now? Joyent is offering classes on SmartDataCenter, DTrace, and SmartOS/Illumos Internals. I am involved with developing the courseware, and shall be (along with Brendan Gregg) delivering the courses. For more information, see Training from Joyent.