OpenCSW Bug Tracker


Viewing Issue Simple Details Jump to Notes ] View Advanced ] Issue History ] Print ]
ID Category Severity Reproducibility Date Submitted Last Update
0003730 [nrpe] regular use crash always 2009-06-26 00:28 2012-07-12 11:05
Reporter gadavis View Status public  
Assigned To ja
Priority normal Resolution fixed  
Status closed  
Summary 0003730: svcadm disable cswnrpe does not gracefully handle missing pid_file param, hangs system on shutdown
Description CSWnpre 2.12,REV=2009.06.18 fails to successfully shutdown on Solaris 10 SPARC. This causes something in the SMF framework to hang when the system is init 6'd or init 1'd. The system must be Stop-A'd and rebooted.

When a manual svcadm enable/ svcadm disable is issued, the following is observed in /var/svc/log/application-cswnrpe:default.log:

[ Jun 25 22:01:14 Executing start method ("/var/opt/csw/svc/method/svc-cswnrpe s
tart") ]
[ Jun 25 22:01:16 Method "start" exited with status 0 ]
[ Jun 25 22:02:12 Stopping because service disabled. ]
[ Jun 25 22:02:12 Executing stop method ("/var/opt/csw/svc/method/svc-cswnrpe st
op") ]
/usr/bin/kill[8]: kill: bad argument count
[ Jun 25 22:02:13 Method "stop" exited with status 0 ]

svcs -xv shows:
# svcs -xv cswnrpe
svc:/application/cswnrpe:default (?)
 State: online since June 25, 2009 10:02:12 PM UTC
   See: /var/svc/log/application-cswnrpe:default.log
Impact: None.

Digging a bit further, it appears that the stop method script does not have any sort of error checking to see if pid_file is defined in nrpe.cfg.

This is a bit of a problem for those of us upgrading from an older version of NRPE that didn't support the pid_file argument.
Additional Information
Tags No tags attached.
Attached Files

- Relationships
related to 0003764closedbonivart cswclassutils Problems with service manifest generation and buggy stop methods 

-  Notes
(0006342)
ja (developer)
2009-06-26 14:52

Did I understand it right, that in your config file the pid_file directive is missing?

Then, this should do the trick in /var/opt/csw/svc/method/svc-cswnrpe, do you agree?

'stop')
        if [ -f "$pidfile" ]; then
            [ -n "`pgrep -x -u 0,1,$NRPE_USER nrpe`" ] && /usr/bin/kill `head -1 $pidfile`
            rm "$pidfile"
        else
            /usr/bin/kill `pgrep -x -u 0,1,$NRPE_USER nrpe`
        fi
        ;;
(0006344)
gadavis (reporter)
2009-06-27 01:07

The restart function looks like it would still be broken, but that seems like it will work.

Now that I look at things closer, I would almost consider treating a configuration file without a pid_file declared to be an error on Solaris 10 and higher because pgrep will find multiple pids if it is run in the global zone and there are non-global zones running nrpe as well. As it currently stands, the script will attempt to kill all of them if it is run without a pidfile.

You might also want to replace lines 32 and 33 with:

pidfile=`awk -F'=' '/^[ \t]*pid_file/ {print $2}' $CONFIG_FILE`
NRPE_USER=`awk -F'=' '/^[ \t]*nrpe_user/ { print $2 }' $CONFIG_FILE`

This fixes a couple of problems with spaces at the beginning of the line for both config options and commented out pid_file lines
(0006345)
ja (developer)
2009-06-27 12:39

Good point with the zones! What do you think about this?

'stop')
        # remove pid file
        if [ -f "$pidfile" ]; then
            [ -n "`pgrep -x -u 0,1,$NRPE_USER nrpe`" ] && /usr/bin/kill `head -1 $pidfile`
            rm "$pidfile"
        else
                if [ `uname -r` = 5.8 -o `uname -r` = 5.9 ]
                then
                        /usr/bin/kill `pgrep -x -u 0,1,$NRPE_USER nrpe`
                else
                        /usr/bin/kill `pgrep -x -u 0,1,$NRPE_USER -z \`zonename\` nrpe`
                fi
        fi
        ;;

Works for me reliable in a global zone and works around a missing pid_file line in the config.

Thanks for the modified lines 32 and 33, cool!
(0006360)
gadavis (reporter)
2009-06-29 22:32

Looks like it should work
(0006361)
ja (developer)
2009-06-29 23:39

I put packages with the fixed start / stop script into testing. Please test them, if there aren't other issues I will release them at the end of the week.
(0006381)
gadavis (reporter)
2009-07-03 02:51

I'm not quite sure where to look for this package. I don't see it on the ibiblio or purdue mirrors under unstable or testing in the 5.10 directories. Most recent version I see is: nrpe-2.12,REV=2009.06.25-SunOS5.8-sparc-CSW, and this version predates me opening this ticket.

Am I looking in the right places?
(0006382)
ja (developer)
2009-07-03 08:55

Please look at http://mirror.opencsw.org/testing.html [^] - there you will find the all packages in testing :)
(0006406)
gadavis (reporter)
2009-07-07 18:36

I tried to install the package but got errors in the non-global zones when the zones are not booted. It only installs in zones that are currently running.

I don't think I had noticed the error before, but the old versions of the package apparently give the same error.

# zoneadm list -cv
  ID NAME STATUS PATH BRAND IP
   0 global running / native shared
   1 anfweb-dev running /zones/anfweb-dev native shared
   - anfwfproc installed /zones/anfwfproc native shared

# pkgadd -d nrpe-2.12\,REV\=2009.06.30-SunOS5.8-sparc-CSW.pkg all
## Verifying package <CSWnrpe> dependencies in zone <anfweb-dev>
## Booting non-running zone <anfwfproc> into administrative state
## Verifying package <CSWnrpe> dependencies in zone <anfwfproc>
## Restoring state of global zone <anfwfproc>

The package <CSWnrpe> contains scripts which will be executed on
zones <anfwfproc, anfweb-dev> with super-user permission during the
process of installing this package.

Do you want to continue with the installation of <CSWnrpe> [y,n,?] y

Processing package instance <CSWnrpe> from </root/nrpe-2.12,REV=2009.06.30-SunOS5.8-sparc-CSW.pkg>
## Installing package <CSWnrpe> in global zone

nrpe - nagios remote plugin executor(sparc) 2.12,REV=2009.06.30
http://downloads.sourceforge.net/nagios/ [^] packaged for CSW by Juergen Arndt
## Executing checkinstall script.
nagios user detected
nagios group detected
## Processing package information.
## Processing system information.
   2 package pathnames are already properly installed.
## Verifying package dependencies.
## Verifying disk space requirements.
## Checking for conflicts with packages already installed.
## Checking for setuid/setgid programs.

This package contains scripts which will be executed with super-user
permission during the process of installing this package.

Do you want to continue with the installation of <CSWnrpe> [y,n,?] y

Installing nrpe - nagios remote plugin executor as <CSWnrpe>

## Executing preinstall script.
## Installing part 1 of 1.
/opt/csw/bin/nrpe <symbolic link>
/opt/csw/bin/nrpe_1k
/opt/csw/bin/nrpe_8k
/opt/csw/share/doc/nrpe/LEGAL
/opt/csw/share/doc/nrpe/NRPE.pdf
/opt/csw/share/doc/nrpe/README
/opt/csw/share/doc/nrpe/README.SSL
/opt/csw/share/doc/nrpe/README_8k
/opt/csw/share/doc/nrpe/SECURITY
[ verifying class <none> ]
Restoring /etc/opt/csw/preserve/CSWnrpe/nrpe.cfg

[ verifying class <cswpreserveconf> ]
Installing class <cswinitsmf> ...
Creating /var/opt/csw/svc/manifest/application ...
Creating service script in /var/opt/csw/svc/method/svc-cswnrpe ...
Creating manifest ...
Configuring service in SMF ...
CSWnrpe is using Service Management Facility. The FMRI is svc:/application/cswnrpe:default
[ verifying class <cswinitsmf> ]

Installation of <CSWnrpe> was successful.
## Installing package <CSWnrpe> in zone <anfweb-dev>

nrpe - nagios remote plugin executor(sparc) 2.12,REV=2009.06.30
## Executing checkinstall script.
nagios user detected
nagios group detected
## Processing package information.
## Processing system information.
   2 package pathnames are already properly installed.

Installing nrpe - nagios remote plugin executor as <CSWnrpe>

## Executing preinstall script.
## Installing part 1 of 1.
/opt/csw/bin/nrpe <symbolic link>
/opt/csw/bin/nrpe_1k
/opt/csw/bin/nrpe_8k
/opt/csw/share/doc/nrpe/LEGAL
/opt/csw/share/doc/nrpe/NRPE.pdf
/opt/csw/share/doc/nrpe/README
/opt/csw/share/doc/nrpe/README.SSL
/opt/csw/share/doc/nrpe/README_8k
/opt/csw/share/doc/nrpe/SECURITY
[ verifying class <none> ]
Copying sample config to /opt/csw/etc/nrpe.cfg

[ verifying class <cswpreserveconf> ]
Installing class <cswinitsmf> ...
Creating service script in /var/opt/csw/svc/method/svc-cswnrpe ...
Creating manifest ...
Configuring service in SMF ...
CSWnrpe is using Service Management Facility. The FMRI is svc:/application/cswnrpe:default
[ verifying class <cswinitsmf> ]

Installation of <CSWnrpe> on zone <anfweb-dev> was successful.
## Booting non-running zone <anfwfproc> into administrative state
## Installing package <CSWnrpe> in zone <anfwfproc>

nrpe - nagios remote plugin executor(sparc) 2.12,REV=2009.06.30
## Executing checkinstall script.
nagios user detected
nagios group detected
/var/tmp//installM_aiEa/checkinstallR_aiEa: /tmp/sh2470: cannot create
pkginstall: ERROR: checkinstall script did not complete successfully

Installation of <CSWnrpe> on zone <anfwfproc> failed.
No changes were made to the system.
## Restoring state of global zone <anfwfproc>
(0006407)
gadavis (reporter)
2009-07-07 18:42

Another oddity, and probably the reason why the system hands when the method script errors out, is that the timeout values are all set to something huge.

[root@plinian:/root]
{516}# svccfg -s cswnrpe listprop start/timeout_seconds
start/timeout_seconds count 18446744073709551615
[root@plinian:/root]
{517}# svccfg -s cswnrpe listprop stop/timeout_seconds
stop/timeout_seconds count 18446744073709551615
[root@plinian:/root]
{518}# svccfg -s cswnrpe listprop restart/timeout_seconds
restart/timeout_seconds count 18446744073709551615

Could you tweak your manifest so that those timeout values are brought down to something reasonable like 60 seconds?

You might also consider just changing the stop/method property to ":kill" - this negates the whole pid_file problem as well as the zone problem
(0006426)
ja (developer)
2009-07-12 21:00

I'll try to reproduce the strange behaviour when installing on a system with zones.

Concerning the timeout values I have to investigate the reason for this. Give me some time, because I'm a little bit under load these days.
(0006429)
gadavis (reporter)
2009-07-13 19:00

I get the feeling both are related to cswclassutils or MGAR, specifically the automatic manifest generation routines in cswclassutils. I actually opened bug 0003764 against cswclassutils but haven't heard back from the maintainer yet.
(0006430)
gadavis (reporter)
2009-07-13 19:21

Further research shows that the manifest script generated by http://sourceforge.net/apps/trac/gar/browser/csw/mgar/pkg/cswclassutils/trunk/files/CSWcswclassutils.i.cswinitsmf [^] tries to set the timeout values to -1. I get the feeling that 18446744073709551615 is what happens when you print a signed 64-bit integer as an UNsigned 64-bit integer.
(0010017)
ja (developer)
2012-07-12 11:05

Issue closed. Start / Stopp method redesigned and tested.


Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker