Thursday, November 6, 2008

SYSV Packages fundamentals

If you are new to Solaris or never bothered looking at how the SYSV packages are built up, i found a good article at bigadmin that could be worth a read:

Introduction to Package Components for the Solaris OS

I found this article while researching why a file was not updated when patching and upgrading a system. It turned out that the file is flagged as a editable file, which means it will not be overwritten by the package system. Such files should be handled by a class action script that are responsible for making the required changes to the file.

In this particular case we had problems with OPL servers (M5000,M8000) that did not have a working console login. We had installed them with a flash archive created with a Solaris release predating OPL support, so the /etc/ou.ap used by autopush was not updated with information required for OPL hardware.

Monday, November 3, 2008

Follow the white rabbit

Yesterday i encountered a strange problem which was very entertaining to debug and dtrace came to the rescue as as always. I had done a fresh install of Solaris Nevada build 101 on a AMD64 machine at home. The installation went fine and i started to configure on the host, but after a while i begun to get error messages like: "couldn't set locale correctly". I checked the SYSV package containing my locale, en_US.8859-1:

# pkgchk SUNWlang-en-extra
ERROR: /usr/lib/locale/en_US.ISO8859-1/en_US.ISO8859-1.so.3
pathname does not exist


The installation logs showed that the package had been installed correctly. I tried to reinstall the package which temporarily fixed the problem but after 10 minutes or so it the file was missing again. I repeated this twice to see if there was any time pattern, but it gave me no clue to why the files was removed. I decided to use both BSM auditing and dtrace to see what might be causing the removal of the file. I waited and waited, but the file stayed there so i started to use the system as ordinary, that somehow triggered the removal of the file!?

A dtrace script created by Chris Gerhard showed me what caused the removal of the file:

# ./delete.d
dtrace: script './trace.d' matched 2 probes
CPU ID FUNCTION:NAME
0 75683 unlink:return man prctl
UID 0 PPID 11244 sh


Audit also logged the unlink of the file, following the trail it could have shoved me that man was responsible:

header,242,2,unlink(2),,ollespappa,2008-11-03 02:52:01.096 +01:00
path,/usr/lib/locale/en_US.ISO8859-1/en_US.ISO8859-1.so.3
attribute,100555,root,bin,65538,320199,18446754073709451615
subject,arne,root,root,root,root,15984,2799982011,1330 5632 xx.xx.xx.xx
return,success,0


So, man(1) unlinked the file, that did not make any sense at all. A dtrace one liner gave me more information:

# dtrace -n 'pid$target:libc:unlink:entry { printf("unlink(%s)\n",copyinstr(arg0)); ustack(); }' -c 'man ls'
CPU ID FUNCTION:NAME
1 76165 unlink:entry unlink(/usr/lib/locale/en_US.ISO8859-1/en_US.ISO8859-1.so.3)
libc.so.1`unlink
man`format+0xd54
man`searchdir+0x14d
man`mandir+0x194
man`manual+0x1c1
man`main+0x5d4
man`_start+0x7d


The unlink was called from the format function of man, Looking at the source I identified which line was responsible for the unlink:
2754 (void) unlink(tmpdir);

After some more investigation I realized that the argument string to the unlink command might not get initialized, it then points out in space. For this system it pointed at a string which represented the path to a file which could be unlinked. This occurs only if the manual page is in another format than SGML since the variables are initialized inside an SGML-related if-statement :

2612 if (sgml_flag == 1) {
2613 if (check_flag == 0) {
2614 strcpy(tmpdir, "/tmp/sman_XXXXXX");


This bug seems to have been in Solaris for ages, but the uninitialized string might never have come to point to anything useful to unlink, it could perhaps be the switch to SunStudio 12 in snv100 that triggered this behavior.

This is now CR 6767074.

Sunday, November 2, 2008

Update 6 released and attach/detach (updated)

After a long wait update 6 of Solaris was released a few days ago and is available for download. All new features is listed here, as i have mentioned earlier the bulk of the changes are in the major ZFS update, including ZFS boot. This is a great update to a already good OS, but i was extremely disappointed to see that they have not fixed the attach/dettach bug related to patch history yet, it have been waiting for this fix for well over a year. In this update bugid 6637869 (zone attach doesn't handle obsolete patches correctly) is fixed but a new one is listed in known issues, bugid 6710545 (fix for 6550154 doesn't always ignore older patches).

What this means is that it is still not enough to be on the exact same patch level on both machines when moving zones in a supported way, all patches have to be applied in the same way, have the same history. Otherwise it might not work. I don't know what update on attach can do for this, but that is not the point, attach is still broken.

Update, Gerald kindly informed me that this is indeed fixed in Solaris 10 10/08 (137137-09) But it is stated as an issue in the release notes, which i based this post on.