One step forward…

As I mentioned here, we upgraded our mail server to Tiger. All looked good, until we started to see this in our logs:

Potential VM growth in DirectoryService since client PID: 0, 
  has 550 open references when the warning limit is 500.

According to posts in Apple’s OS X Server forum (here, here, and the tail end of here) it looks like this nasty problem involves either servermgrd and/or DirectoryService. The number of open reference continuously grows until it renders your server unusable – usually not considered a good thing.

I don’t have a fix for the problem.

I do however, have an bandaid.

I created a perl script in /usr/local/bin/ called plugleak.pl. The script checks the system.log looking for the tell-tail error message above. If it finds it it restarts both servermgrd and DirectoryService.

Using Lingon (a very nice little GUI for launchd items) I created a periodic task (/Library/LaunchDeamons/com.pimedia.plugleak.plist) to run the plugleak.pl script every 10 minutes.

Not necessarily an elegant solution. But, well, ya do what ya gotta do.

Update

I don’t do a lot of admin support anymore, but I do scan the Mac OS X Server mailing list, and I came across a post that led me to this thread on the Darwin Kernel list that goes a long way to explaining what is going on here.

Very interesting and worth a read.

17 thoughts on “One step forward…

  1. Great Lord almighty – this is a classical “you shot your own knee”-tool…

    I’ve found a system suffering from this problem where local7 is getting logged into /var/log/system.log.

    Guess what – your log message “Found that damn Potential VM notice in the log, killing services” contains the words “Potential VM”, too…

    recursion — see recursive
    recursive — see recursion

    Might be time to change the log message to protect idiots from themselves…

  2. Thanks Noses,

    I’ve changed the script, though there was little chance of recursion on the afflicted server given the 10 minute script frequency. There was always enough system.log content to prevent the script’s log message from showing up in sequential tails of the log.

  3. The sad thing is that the bug is still present in 10.5.2, two years later. I just installed your script after having this exact problem … thanks!

  4. I’m seeing this too with ManagedClient as the trigger in Mac OS X 10.5.2 *client*. If I kill that process, then everything goes normal.

    It appears to be related to issues with the machine record in Open Directory.

  5. I see this as well on a 10.5.2 server running only mail services and bound to an OD domain. Applying your plugleak workaround has fixed things. Thanks!

  6. Trying to use your script, it bombs. since I am not a perl guru, not sure why. I tried the commands, tail & grep they are there and work. & since the first 2 bomb, the secondary command fails too.
    10.5.4
    /usr/bin/plugleak.pl: line 18: =: command not found
    /usr/bin/plugleak.pl: line 19: syntax error near unexpected token `”Checking logs”‘
    /usr/bin/plugleak.pl: line 19: `logthis(“Checking logs”);’

  7. Hi, when I try this, I get a ‘permission denied’.. even when running it as ‘super user’.. so no clue.. but I may manually run this when I get the error and test it.. Seems to happen every 3 weeks for me.

  8. Here it is almost four years later, and your little perl script is still saving people’s bacon! When 10.6 came out, I upgraded my production 10.4 Server to 10.5.8 (I try to stay one version back), and I’ve been beating my head against this DirectoryService issue ever since–until I managed to end up here. Since Apple can’t seem to make it work properly, this script is the best “fix” out there. A million thanks to you for sharing this with the world!

  9. One helpful mod…it’s a good idea to try killing the DirectoryService process more gently before using the -9 flag. I wrote a script that tries a “gentle” kill and resorts to the -9 if that doesn’t work. It also restarts servermgrd and AppleFileServer, both of which are negatively affected by restarting DirectoryService.

    ———————–

    #!/bin/sh
    dspo=`/bin/cat /var/run/DirectoryService.pid`
    /usr/bin/killall -HUP servermgrd
    /usr/bin/killall DirectoryService
    /bin/sleep 10
    dspn=`/bin/cat /var/run/DirectoryService.pid`
    # if DS didn't stop gracefully, kill it harder
    if [ $dspo = $dspn ]
    then
    /usr/bin/killall -9 DirectoryService
    /bin/sleep 10
    fi
    /usr/bin/killall AppleFileServer

    ———————–

  10. Wow, I can’t believe that this is still an ongoing issue.

    I’ve not needed to deal with it in years, but thanks for the script.

    ;david

  11. I also wrote a script that monitors the CPU usage of the DirectoryService process, and kills it (using the script above) if DS is hogging the CPU for several minutes straight. This has proven to be fairly effective at straightening out DS before it runs out of resources. Tweak it for your installation.

    ———————–

    #!/bin/sh
    log=/var/log/checkDScpu.log
    cnt=0
    while :
    do
    sum=0
    # calculate the ten-second average CPU usage for DS
    for pct in `/usr/bin/top -l11 | /usr/bin/grep DirectoryS | /usr/bin/cut -c18-25 | /usr/bin/cut -f1 -d'%' | /usr/bin/cut -f1 -d'.' | /usr/bin/tail -10`
    do
    sum=`/bin/expr $sum + $pct`
    done
    sum=`/bin/expr $sum / 10`
    # log it (handy for making graphs)
    echo `/bin/date +%Y-%m-%d-%H%M%S` $sum% >> $log
    # if DS is using more than 95% CPU, remember it
    if [ $sum -gt 95 ]
    then
    cnt=`/bin/expr $cnt + 1`
    else
    cnt=0
    fi
    # if DS uses more than 95% CPU for four minutes, kill it
    if [ $cnt -gt 3 ]
    then
    cnt=0
    /usr/bin/mail -s 'DirectoryService CPU is running away; killing it' root
    /usr/bin/logger -p local7.notice -t "checkDScpu" "DirectoryService is running away with the server, killing it..."
    /usr/local/admin/killds.sh
    fi
    sleep 50
    done

    ———————–

  12. We’re having the exact same problem running on Mac OS X Server 10.5.8 it’s amazing to know the issue has been around for so long. Does anyone know if an upgrade to Snow Leopard fixes the problem?

  13. My script check last minute entrys in system.log and kill the DirectoryService process if it find “Potential VM growth in DirectoryService”.
    —————– DSKiller.awk ——————-

    #!/usr/bin/awk -f

    BEGIN {
    # Usage:
    # 1. Login as root
    # 2. Run “chmod +x DSKiller.awk”
    # 3. Execute “crontab -e”
    # 4. Enter “i”
    # 5. Write this:
    # */1 * * * * /Path/to/this/script/DSKiller.awk
    # 6. Enter “esc”
    # 7. Enter “:wq”
    #
    file_log = “/var/log/system.log”
    bin_kill = “/bin/kill”
    arg_kill = ” -s KILL ”
    bin_date = “/bin/date”
    arg_date = ” -v -1M”
    bin_env = “/usr/bin/env”
    arg_env = ” LC_TIME=C ”
    bin_logger = “/usr/bin/logger”
    arg_logger = (” -p local3.notice -t DSKiller “)
    exit 0
    }

    END {
    cmd_current_date = (bin_env arg_env bin_date arg_date)
    cmd_current_date | getline current_date
    close(cmd_current_date)

    if (split(current_date,TMPDATE)>0) {
    mask_date = (TMPDATE[2]” “TMPDATE[3]” “(substr(TMPDATE[4],1,5)))
    }
    split(“”,TMPDATE)

    dead_is = 0
    while ((getline line_in 0) {
    if (line_in ~ mask_date && line_in ~ “Potential VM growth in DirectoryService”) {
    dead_is = 1
    split(line_in,TMPdr)
    break
    }
    }
    close (file_log)

    if (dead_is == 1) {
    split(substr(TMPdr[5],18),TMPdp,”]”)
    split(“”,TMPdr)
    dead_pid = TMPdp[1]
    split(“”,TMPdp)

    do_kill = (bin_kill arg_kill dead_pid” 2> /dev/null”)
    if (system(do_kill) == 0) {
    system(bin_logger arg_logger “Kill signal sucessfully sent with args”arg_kill”to the DirectoryService daemon PID “dead_pid)
    } else {
    system(bin_logger arg_logger “Kill signal unsucessfully sent with args”arg_kill”to the DirectoryService daemon PID “dead_pid)
    }
    }
    }
    # Ver 0.9

    —————– DSKiller.awk ——————-

  14. I’m preparing to upgrade my Xserve G5 running 10.5.8 to an Intel Xserve running 10.6, but in the process, I’ve hit upon some good fixes for the DirectoryService issue in 10.5. I believe there are two primary things that cause DirectoryService to fail.

    One is the twistd service that is part of the wiki service. On my server, it doesn’t matter that I don’t have the calendaring enabled; the system starts the wiki service anyway, and there’s apparently a bug in the service that causes it to hammer DirectoryService with authentication requests. To disable the service, edit /usr/share/caldavd/bin/twistd and, after the “import” line, add a line that reads “sys.exit(0)”. This will cause the twistd process to exit whenever it’s started. You’ll see a com.apple.wikid error in your system log every ten seconds, but it can be safely ignored.

    The second DirectoryService killer is authentication hack attacks on various daemons–VPN, IMAP, POP, SMTP/SASL, FTP, Apache, etc. I use fail2ban to monitor the various system logs, and configured it to use the built-in adaptive firewall to dynamically block attacks. If you’re interested in these scripts (which should work just fine for 10.6 too), shoot me an email at jlg at mac dot com.

    The upshot of these changes is that overall CPU usage is way down, and DirectoryService is much better behaved. It always dies at least once or twice on Monday mornings, and thus far it’s cranking right along nicely without a hitch.

Comments are closed.