Wednesday, December 16, 2009

ZFS Data Recovery

I knew there was a reason to understand the ZFS on-disk format besides wanting to be able to teach about it...

Recently, I was sent an email from someone who had 15 years of video and music stored in a 10TB ZFS pool that, after a power failure, became defective. He unfortunately did not have a backup. He was using ZFS version 6 on FreeBSD 7, and he asked if I could help. He also said that he had spoken with various people, including engineers within Sun, and was told that basically, it was probably not possible to recover the data.

He also got in touch with a data recovery company who told him they would assign a technician to examine the problem at a cost of $15,000/month. And if they could not restore his data, he would only have to pay 1/2 of the cost.

After spending about 1 week examining the data on the disk, I was able to restore basically all of it. The recovery was done on OpenSolaris 0609 (build 111b). After the recovery, he was able to view his data on the FreeBSD system (as well as OpenSolaris). I would be happy to do this for anyone else who runs into a problem where they can no longer access their ZFS pool, especially at $15k/month. And if I can not do it, you would not need to pay anything!


Anonymous said...

About 1 year ago we lost 4 zpools .. and a lot of money (much more of $15k).
Now we moved back to Netapp.
Pace of mind :)

Anonymous said...

Any details on what happened and what was done to recover? I figure that if it took you 1 week of tinkering to figure what's wrong, it's probably something more interesting than a case of corrupted data in the last TXG.

Anonymous said...

The takeaway shouldn't be ZFS is dangerous — the takeway should be: ALWAYS HAVE BACKUPS!

Max Bruning said...

I agree with you about "Always have backups". The person will continue to use ZFS. He says one of the best things about it is the performance. I did not mean to imply that "ZFS is dangerous".

Data Recovery said...

Nice post:)

Max Bruning said...

As to what happened... It started with a failure on a controller, then a failure during a scrub (I am not sure of the details). As to what was wrong, the labels were not consistent. I found a label with what appeared to be consistent data. Then I found an uberblock_t within the label that, by walking the data structures on the list, led me to what looked correct. Then I zeroed out all the uberblocks with transaction id greater than the txg id in the "good" uberblock. Then I imported.

Peter William Lount said...

All the "protections" of zfs are useless since I can't "import" the eight zfs disks.

I just "lost" 6TB of data when attempting to change controller cards for an upgrade. After exporting and shutting down, changing the controller, adding drives, and rebooting the import no longer works. It took a number of times to find the right controller card that would work with the motherboard so I had done this a number of times without any problems with the import. Clearly this controller works fine. It's a 16 port highpoint raid controller model 2340 on freebsd 7.2.

The configuration is 8 drives in two groups of four raidz1, one of which shows online for all of it's drives and the raidz1 vdev level, however the second group of four drives shows online for each drive and "corrupted data" on the it's vdev. Thus the top level "pod1" vdev level shows "unavail" and "insufficient replicas". The state of the pool upon "zpool import" is "faulted" and the action says "The pool cannot be imported due to damaged devices or data."

If you could share your methods of data recovery I'd surely appreciate it. Don't have any money to spend and the data is confidential so I can't send you the drives.

Which programs did you use to view the data structures? Which data structures are the most important ones? What is the best source for information on the data structures? Any links? Docs? Or did you just use the source code?

Fortunately none of the data is encrypted or compressed on this zfs installation and it's two groups of four disks each with a raidz1 so there should be a redundant copy of the data to recover files even if one disk in each set of four has been erased.

By the way it's insane for any file system provider to not provide data recovery software (fsck or better as needed to recover data) as part of it's design and implementation. Claiming that users should rely on backups doesn't make sense especially for a file system that doesn't support live replication to a backup system as a simple option. The whole point is to protect the data... no matter how "good" a file system is claimed to be there will always be cases where meta data, disks, controllers, or data is damaged resulting in the need for recovery software to be used to attempt a rescue.

File Systems must provide the ability for recovery tools to "scrape" as many "files" or fragments of files off of a disk with damaged meta data. A meta data inspector is a needed tool, one that can recover meta data even when some or all of it has been corrupted or destroyed. Clearly something better than zfs is needed. One thing is that meta data needs to be made hugely redundant even if it means a performance penalty.

By the way my update was supposed to provide 8 more disks so that I could have an ongoing somewhat live backup copy in the same box with two "pods" of data. Sigh.


Peter William Lount said...

Max I found a couple of your other articles and have started reading them.

Is there a program to view the zfs data structures on an "exported" set of zfs disks?

Peter William Lount said...

Using the following command I was able to verify that uber blocks exist on all eight drives in the "damaged" pod1 (ad4, ad6, ad8, ad10, da0, da1, da2 and da3)!

od -A x -x /dev/da0 | grep "b10c    00ba"


Note that I had to use four blanks and not just one as in the linked example.

Note the "corrupted" set are da0-da3. They were on a four port rocket raid controller and now are on either a four port or sixteen port controller with the same import problem. Individual disk non-joined JBOD configuration.

I was really worried that one of the disks might have been overwritten during bios configuration of the hard drive controllers.

Well that's an important first step. There are zfs structures viable on all the drives! Could be recovered yet...

Max Bruning said...

There is a tool for examining metadata. It is zdb(1). As for sharing my tools, please send me email. I can be reached at By the way, have you tried importing onto a recent build of opensolaris? There is supposed to be a recovery mechanism in the newest builds.

Peter William Lount said...

Hi Max,

Thanks for the open solaris suggestion... I might give that a try before digging into the data structures any deeper... they seem inordinately and needlessly complex...

Unfortunately zdb only works with live zfs disk sets and not exported ones, at least that is the case in Freebsd 7.2. Are you suggesting it is different in open solaris, that zdb can work with exported volumes that can't be imported due to vague claims of "data corruption" by zfs?

This failure that my 8 disks are currently suffering due to zfs does mean that it is not just dangerous but a highly dangerous place to store precious data. Once I recover the data I'll be thinking twice about keeping my data on zfs. Any file system must provide recovery tools... part of that is that each on disk meta data structure (and even data chunks) must have a magic number to be able to identify the record object type out of the noise... and redundant pointers and info for reconstruction processes. Failure to provide that on a vendors part is in my view highly irresponsible and short sighted.

I do look forward to the new version in open solaris to see if it can recover the data for me on an import.

Thanks again... I'll keep you advised as I progress through this recovery of crucial data.

Oh, I've noticed that zfs is very particular about which disk controller ports its various drives are on... why is that when each disk can be identified by it's on disk guid and other information? would this be what is causing the problem with the second set of four disks being corrupted? it's a real pain going though all the permutations of disks to controller ports... all that rebooting... I thought that export and import would then let them be moved to different controller ports?


Max Bruning said...

Hi Peter,
This would be much simpler to do by email, but I don't have an email address for you.

No, zdb will not work with exported pools in opensolaris.

I have also noticed zfs has a problem if disks are moved around, but it has not been a big problem for me. There has been quite a bit of discussion about this lately on the zfs discussion list on opensolaris. You might want to try reading/posting there.

As for ZFS losing your data, I doubt it. I suspect your data is there, but you need to find the right way to get at it.

The tools I am using are: a few tools that deal directly with the raw disk, a modified mdb, and a modified zdb. I posted the modified mdb and zdb code quite a while ago. I have kept them up to date (i.e., the tools work with the latest opensolaris release (0609, build 111b).

Good luck!

Anonymous said...

Hi Max,

What if you accidentally deleted some files on zfs is there a way to recover the files? Note there was no snapshot done and after the the files were delete nothing else was done on the system? Any help would be great!

Max Bruning said...

If you have deleted files, and the file system has not changed a lot since the deletions, you should be able to recover the deleted files. I have not been following zfs too closely as of late (though I am using it all the time). I thought there was a command now existing that allowed you to roll back to an earlier uberblock for recovery. Regardless, as I very seldom post to this blog (looks like almost 1 year now), your best bet if you want me to try to help is to send me email. max at bruningsystems dot com.