Tuesday, March 31, 2009

A faster memstat for mdb

I have implemented a version of the ::memstat dcmd for mdb that gives results in less than half the time of the ::memstat currently in mdb. If you are interested, it is available for download here.

So, how does it work? The current version of memstat uses the page walker (::walk page) to walk all cached pages in the system. The new version simply examines all pages.

Each page-able page in the system is represented by a page_t structure. Basically all of memory except for the unix and genunix kernel modules, and a few other odds and ends is considered "page-able". Every page, that is in use (either by the kernel, user process, anonymous, or cached) has an identity. This is a vnode/offset pair. The identify uniquely determines how the page is being used. For instance, a page for the code of a running bash process will have the vnode_t for /usr/bin/bash, and the offset within /usr/bin/bash of where the page comes from. For a kernel page, there is a special vnode_t (kvp). For anonymous space, the page has a swapfsvnode (also used for shared memory and tmpfs files). When a process gets a page fault, the fault handling code first checks to see if the faulting address is mapped in the process' address space. If not, a segmentation violation (SIGSEGV) is sent to the process. If the address is within the address space, the fault handling code sees if the page is already in memory. It does this by retrieving the vnode/offset for the faulting page, and hashing into an array called page_hash. Each entry in page_hash is the beginning of a linked list of page_t structures. So the fault handling code does a hash to get a page_hash array entry, then walks the page_t structures starting at that entry to look for a matching vnode_t/offset. If the page_t is in the hash, the fault handling code sets up a page table entry (translation table entry on SPARC) to map to the corresponding physical page.

The page_hash array is sized so the the average search, given a page_hash bucket, is no longer than 4 (PAGE_HASHAVELEN in vm/page.h) entries. This makes searching for a cached page fairly fast.

The page walker that mdb uses to do memstat walks every hash bucket looking for pages. Basically, if the page is found from a hash bucket, the page is either in use by the kernel, some process, or tmpfs, or, the page is free but cached. Any page not hanging off of a page_hash bucket is considered free (i.e., the free (freelist) statistic).

The new memstat takes a different approach. Rather than scanning each hash bucket for pages, it simply reads all of the page_t structures on the system, then examines each one to determine if it is a kernel page, executable page, anonymous page, and so on. Any page_t that does not have a vnode_t is considered a free page, and is counted in the free (freelist) statistic.

How are the page_t structures found on the system? There is a linked list of memseg structures that are bookkeeping for page-able page structures. The list is headed by a pointer, memsegs, and is built early on in the startup code when the system is booting. I suspect the list could change due to dynamic reconfiguration events, but I'll leave that as an exercise for the reader...
To see the list, you can do "::memseg_list" in mdb. On my system, this gives:

> ::memseg_list
fbe00028 fbe94160 fe652260 00000c00 0007fed0
fbe00014 fbe85160 fbe94160 00000100 00000400
fbe00000 fbe82050 fbe85160 00000002 0000009f

The ADDR column is the address of a memseg structure. PAGES is the address of the first page_t in an array. EPAGES points to the end of the page_t array. BASE is the starting page frame number of the memseg, and END is the ending page frame number. So, on my system, page frames between 2 and 9f, 100 and 400, and c00 and 7fed0 have page_t structures. Physical page 0 and 1 are not in the list. Also pages between 9f and 100, and between 400 and c00 are not in the list. This is either because the physical memory does not exist, or it is not considered pageable.

The new memstat uses the memseg list to read in all of the page_t structures. On my system, this means 3 read calls (though the read from fbe94160 to fe652260 is quite large), versus thousands of read calls in the existing memstat via the page walker. The new memstat assumes there will never be more than 256 memseg structures. (This was arbitrarily chosen. I have not seen machines with more than 6 memseg structures, but I don't get on very large machines very often). A more correct way would be to build the memseg list within the dcmd, but I am lazy.

Using dtrace and counting system calls during running of the two version of memstat shows that the original memstat makes 1741225 system calls, while the new memstat makes 737344. So over 10000 fewer system calls in the new memstat.

I think memstat performance could be improved even more by using mmap to map in the page arrays. Then there would be no need for using mdb_alloc, and no need to mdb_vread the page_t structs.