Friday, October 30, 2009

zpool recovery support integrated!

Zpool recovery was just integrated into the ONNV gate and will be part of OpenSolaris development build 128 in a about a month. Read more about it in this previous entry Zpool recovery support (PSARC/2009/479).

In short, if you have hardware that does not honor cache flush requests or write ordering (some cheap USB-drives for example) combined with for example loss of power a pool can end up being damaged. This option provided a chance of recovering the pool in a automated way by reverting to a older but sane transaction group.

From the new zpool.1m:
"Recovery mode for a non-importable pool. Attempt to
return the pool to an importable state by discarding the
last few transactions. Not all damaged pools can be
recovered by using this option. If successful, the data
from the discarded transactions is irreversibly lost.
This option is ignored if the pool is importable or
already imported."

"# zpool clear -F data
Pool data returned to its state as of Tue Sep 08 13:23:35 2009.
Discarded approximately 29 seconds of transactions."

Here is the changeset: http://hg.genunix.org/onnv-gate.hg/rev/8aac17999e4d

Thursday, October 29, 2009

The curious case of the strange ARC

I've recently encountered some strange ZFS behavior on my OpenSolaris laptop. It was installed about a year ago with OpenSolaris 2008.11, I have since then upgraded to about every developer release available. Every time it's been upgraded lots of updated packages are downloaded. The Image Packaging System caches downloaded data in a directory that can become quite large over time. In my case the "/var/pkg/download" directory had half a million files in it consuming 6.6 gigabytes of disk space.

Traversing trough all these files using du(1B) took about 90 minutes, a terrible long time even for half a million files. Performing the same operation once more took about the same time, which raised two questions. First, why is it so terribly slow to begin with? Second, why doesn't the ARC cache the metadata so that the second run is much faster?

Looking at the activity on the machine the CPU was almost idle but the disk was close to 100 percent busy for the whole run. Arcstat reported that the space used by the ARC was only 1/3 of the target.

ZFS can get problems with fragmentation if a pool is allowed to get close to full. This pool had been upgraded 12 times the last year, every upgrade creates a clone of the filesystem and performs a lot of updates to the /var/pkg/download directory structure. Fragmentation could explain the extremely slow initial run, but the ARC should cache the data for a fast second pass.

Replicating the directory to another machine running the very same OSOL development release(b124) and doing the same test performs much better:

Initial runtime: source ~ 90m repl ~3m
Second runtime: source ~90m repl ~15s
Reads/s initial: source: ~ 200 repl: 5-6K
Reads/s second run: source: ~200 repl: 46k

If we presume that we are correct about the fragmentation problem, all data has now been rewritten at the same time to a pool with plenty of free space. This can explain why the initial pass is now much faster. But why does the ARC speed up the second run on this machine? Both machines are of the same class (~2GHz x86, 2.5" SATA boot disk) but the second machine has more memory. But this shouldn't matter since we have plenty of room left in the ARC even on the slow machine? Some digging shows that there is a limit for the amount of metadata cached in the ARC.

# echo "::arc" | mdb -k |grep arc_meta
arc_meta_used = 854 MB
arc_meta_limit = 752 MB
arc_meta_max = 855 MB

This is something to look into. What's happening is that the ARC is of no use at all in this situation, it is based on Least Recently Used(LRU) and most frequently used lists. Everything under this directory is read the same number of times each pass and in the same order, filling the ARC pushing out entries before they can be used. The arc_metadata_max is set to 1/4 of the ARC which is to small on this 4GB system, lets try raising the limit to 1GB and run the tests again:

# echo "arc_meta_limit/Z 0x4000000" | mdb -kw

Traversing the directory now takes about 10 minutes after the initial run, better but still terribly slow for cached data and the machine is still using the disks extensively. This is caused by access time updates on all files and directories, and remember that the the filesystem is terribly fragmented which of course is a problem for this operation also. We can turn off the atime updates for a filesystem in ZFS:

# zfs set atime=off rpool/ROOT/opensolaris-12

Now, a second pass over all the data takes well under a minute, the ARC problem is solved and the fragmentation problem can be fixed by making some room in the pool and copying the directory so that all data is rewritten. Note that this data could also be removed without any problems, more on this here.

There is no defragmentation tool or function in ZFS (yet? It would depend on bp-rewrite as everything else) and for most of the time it's not needed. ZFS is a Copy On Write(COW) filesystem that is fragmented by design and should deal with it in a good way. It works perfectly fine for most of the time, i've never had any issues with fragmentation on ZFS before. In the future I will make sure to keep some more free space in my pools to minimize risks of something similar happening again.

I plan to write more in detail about the fragmentation part of this issue in a later post, stay tuned.

Thursday, October 22, 2009

Solaris 10 containers integrated

Solaris 10 have a technology known as brandz, which can be seen as a translation layer between the Solaris kernel and a local zone. This layer can be used to provide an execution environment inside the zone that mimics another release of Solaris or even another OS. This have been used to create the LX brand that runs Linux inside a container and it have also been used to crate Solaris 8 and 9 containers.

This have probably helped accelerate Solaris 10 adaptation and made it easier for customers to take advantage of new Solaris features and hardware even if their application environment could not be upgraded. A good example of this is that you can take a Solaris 8 application, put in in a branded zone and use dtrace, which was not available before Solaris 10 to debug the application.

Now that Solaris 10 containers have integrated plan is to provide a similar service for Solaris 10 so that installations and zones can be migrated into Solaris 10 containers on OpenSolaris. This is probably even more important now since all version of Solaris post 10 will feature a whole new packaging system. Upgrades from Solaris 10 to any later Solaris release is not possible and will probably never be in the way we have known upgrades in earlier Solaris releases.

It will be possible migrate both existing Solaris 10 zones(v2v) and whole installations(p2v) into containers on OpenSolaris.

You can take a look at the change here where there is also a link to the PSARC which contains much more information.

Wednesday, October 21, 2009

OSOL 2010.03, the story so far

There have been some time since the OpenSolaris 2009.06 release, and the developers have been busy. I've compiled a list of changes in ONNV that caught my attention along with a few new packages added to the repository.

A few of the major features in so far in my opinion:
  • Tripple parity raidz, offers even better data protection than raidz2, enabling even wider stripes and/or more fault tolerance for zpools.
  • xVM is synced with XEN 3.3, up from version 3.1, offering better compatibility and performance.
  • Crossbow enhancements, bridging, anti-spoofing, Solaris packet capture
  • ZFS User and group quotas

If you find anything you like, you can try it out in the latest development release which is available here.

ZFS/Storage
triple-parity RAID-Z (6854612)
ZFS user/group quotas & space accounting (PSARC/2009/204)
iSCSI initiator tunables (PSARC/2009/369)
zpool autoexpand property (PSARC/2008/353)
COMSTAR Infiniband SRP Target - PSARC/2009/111
ZFS logbias property (PSARC/2009/423)
Removing a slog doesn't work (6574286)
FCoE (Fibre Channel over Ethernet) Target (PSARC/2008/310
FCoE (Fibre Channel over Ethernet) Initiator (PSARC/2008/311)
Multiple disk sector size support (PSARC/2008/769)
zfs snapshot holds (PSARC/2009/297)
SATA Framework Port Multiplier Support (PSARC/2009/394)
zfs checksum ereport payload additions (PSARC/2009/497)
Solaris needs reparse point support (PSARC 2009/387)
ZFS support for Access Based Enumeration (PSARC/2009/246)
If 'zfs destroy' fails, it can leave a zvol device link missing (6438937)
zpool destruction/export should better handle stale zvol links (6573142)
zpool import with 8500 snapshots took 11hours (6761786)
zfs caching performance problem (6859997)
stat() performance on files on zfs should be improved (6775100)

Network
Solaris gldv3/wifi needs to support 802.11n (6814606)
RBridges: Routing Bridges (PSARC/2007/596)
Solaris Bridging (PSARC/2008/055)
Bridging Updates (PSARC/2009/344)
Clearview IP Tunneling (PSARC/2009/373)
Datalink Administration from Non-Global Zones (PSARC/2009/410)
Solaris Packet Capture (PSARC/2009/232)
Anti-spoofing Link Protection (PSARC/2009/436
flowadm(1m) remote_port flow attribute (PSARC/2009/488)

Other
Boomer: Next Generation Solaris Audio (PSARC/2008/318)
ls enhancements (PSARC/2009/228)
Upgrade NTP to Version 4 (PSARC/2009/244)
Solaris on Extended partition (PSARC/2006/379)
Disk IO PM Enhancement (PSARC/2009/310)
System Management Agent (SMA1.0) migration to Net-SNMP 5.4.1 (LSARC/2008/355)
Upgrade OpenSSL to 0.9.8k (6806386)
Need to synch with newer versions of Xen and associated tools (6849090)
LatencyTOP for OpenSolaris (PSARC/2009/339)
Wireless USB support (PSARC/2007/425)

New drivers:
Atmel AT76C50x USB IEEE 802.11b Wireless Device Driver (PSARC/2009/143)
Ralink RT2700/2800 IEEE802.11 a/b/g/n wireless network device (PSARC/2009/167)
RealTek RTL8187L USB 802.11b/g Wireless Driver - (PSARC/2008/754)
add a VIA Rhine Ethernet driver to Solaris (PSARC/2008/619)
audio1575 driver available
audiocmi driver (PSARC/2009/263)
bfe fast ethernet driver (PSARC/2009/242)
Driver for LSI MPT2.0 compliant SAS controller (PSARC/2008/443)
Atheros AR5416/5418/9280/9281/9285 wireless Driver (PSARC/2009/322)
audiovia97 (PSARC/2009/321)
Myricom 10 Gigabit Ethernet Driver (PSARC/2009/185)
Atheros/Attansic Ethernet Gigbit Ethernet Driver (PSARC/2009/405)
audiols driver (PSARC/2009/385)
audiosbp16x audio driver (PSARC/2009/384)
Marvell Yukon Gigabit Ethernet Driver (PSARC/2009/190)
audiosolo driver (PSARC/2009/487)

Some highlights among the over 200 packages added to the repository:
SUNWwireshark Wireshark - Network protocol analyzer
SUNWparted GNU Parted - Partition Editor
SUNWiperf tool for measuring maximum TCP and UDP bandwidth
SUNWiftop iftop - Display bandwidth usage on an interface
SUNWsnort snort - Network Intrusion Detector
SUNWdosbox DosBox - DOS Emulator
SUNWejabberd ejabberd - Jabber/XMPP instant messaging server
SUNWiozone iozone - a filesystem benchmark tool
SUNWrtorrent rtorrent - a BitTorrent client for ncurses
SUNWsynergy Synergy Mouse/Keyboard sharing
SUNWareca Areca backup utilities

Many Python python modules, languages and OpenJDK 7.0 have also been added.

Tuesday, October 13, 2009

OpenSolaris 2010.03

It looks like OpenSolaris 2010.02 have had a slight adjustment in schedule and will now be known as OpenSolaris 2010.03. The target build for this release is now 135 instead of previously build 132. You can read whats available so far here.

Monday, October 12, 2009

Oracle OpenWorld keynote

Ben Rockwood is attending Oracle OpenWorld and have written a good summary of Scotts and Larrys keynote in his blog.

There are indeed some signs that something good might come out of the acquisition after all.

Wednesday, October 7, 2009

Solaris 10 10/09 (Update 8) available

Solaris 10 10/09 is now available for download but it is yet to be announced and have all download links updated.

Anyway, here is a working link. New documentation is also available on docs.sun.com, including the What's New in the Solaris 10 10/09 Release.

The changes are pretty much in line with my earlier predictions.

Update: Joerg Moellenkamp has nice summary of what's new.