nagiosあれこれ - file-glob こと k.daibaの日記

ある日，

なんだかネットワークの調子が不安定だったので，ふとコアスイッチのcpu使用率を見てみました．するとそこには98%の文字が．急遽システム管理の必要性に目覚めました．MRTGかRRDToolでも動かそうかと思ったですが，そんなのは昔作ったのでつまんない，あ，ゲホンゲホン，イベントがリアルタイムに把握できないのはトラブル対処として問題があると思ったので，nagiosをインストールしてみることにしました．動かしたのはいつも使っているmac mini，参考にしたのはSoftware Design 2007年10月号の特集，ネットワーク＆システム，見える化計画#4，「システムの稼働状態を見える化！〜Naagios活用術」です．

nagios.cfg

まず，「nagios.cfgのとくに重要な設定項目」は以下のようにしました．

log_file=/usr/local/nagios/var/nagios.log
cfg_file=/usr/local/nagios/etc/commands.cfg
cfg_file=/usr/local/nagios/etc/localhost.cfg
nagios_user=nagios
nagios_group=nagios
lock_file=/usr/local/nagios/var/nagios.lock

コマンドラインからuser/groupにnagiosアカウントを追加してみたかったので，NetInfo - MacWikiに載っているnidump, niloadを使ってみました．macosx，もうそろそろ新しいバージョンが出て来るはずですが，そのバージョンでもこのやり方は使えるんでしょうか．ちょっと疑問．
ま，それはともかく，このファイルが呼び出している設定ファイルはcommands.cfg, localhost.cfgです．まず，commands.cfgの中身を説明します．

commands.cfg

nagiosがデフォルトで死活監視に使っているcheck_ping, check-host-aliveはそのまま使うことにしました．NW機器が落ちれば使っている人から苦情が来てすぐにわかりそうなものですが，最近，swにpingやtelnetなんかができないんだけど，パケット送受信は問題なく動くという管理系のバグを叩いたので，念のため使っています．

# 'check_ping' command definition
define command{
  command_name    check_ping
  command_line    $USER1$/check_ping -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ -p 5
}
# 'check-host-alive' command definition
define command{
  command_name    check-host-alive
  command_line    $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 1
}

一方で「mailで通知が来るなんて今ひとつ，故障周知ならIRCでしょ」と思ったので特定のIRCチャネルに通知するコマンドを設定しました．

# 'notify-by-irc' command definition
define command{
        command_name    notify-by-irc
        command_line    $USER1$/irclient.pl "$NOTIFICATIONTYPE$ - $HOSTALIAS$ is $SERVICEOUTPUT$"
}
# 'host-notify-by-irc' command definition
define command{
        command_name    host-notify-by-irc
        command_line    $USER1$/irclient.pl "Host $HOSTSTATE$ - $HOSTNAME$!"
        }

irclient.plの実装に関しては後述します．cpu使用率に関してはhttp://www.cisco.com/japanese/warp/public/3/jp/service/tac/477/SNMP/collect_cpu_util_snmp-j.shtmlというページを見つけたので簡単なスクリプトを書いてみました．これは，check_busyPer.plというスクリプトで，これについても後述します．さて，このスクリプトを使う設定は以下のようになります．ここではcpu使用率が20%を超えるとWARNINGを，80%を超えるとCRITICALを出力するようにしています．

# 'check-busyPer' command definition
define command{
  command_name    check_busyPer
  command_line    $USER1$/check_busyPer.pl -H $HOSTADDRESS$ -C $ARG1$ -w 20 -c 80
}

localhost.cfg

nagios設定では「継承」という考え方を使っていて，それがわからないと理解しづらいんですが，そこはそれ，私も参考にしたSoftwareDesignを参考にしてください．まず，トラブル発生時の連絡先の設定はこんな感じ

# '24x7' timeperiod definition
define timeperiod{
  timeperiod_name 24x7
  alias           24 Hours A Day, 7 Days A Week
  sunday          00:00-24:00
  monday          00:00-24:00
  tuesday         00:00-24:00
  wednesday       00:00-24:00
  thursday        00:00-24:00
  friday          00:00-24:00
  saturday        00:00-24:00
}

# 'nagios-admin' contact definition
define contact{
  contact_name                  nagios-admin
  alias                         Nagios Admin
  service_notification_period   24x7
  host_notification_period      24x7
  service_notification_options  w,u,c,r       ; warning, unknown, critical, recover
  host_notification_options     d,u,r         ; down, unknown, recover
  host_notification_commands    host-notify-by-irc
  service_notification_commands notify-by-irc
  email                         error@example.com
}

# 'admins' contactgroup definition
define contactgroup{
  contactgroup_name admins
  alias             Nagios Administrators
  members           nagios-admin
}

24時間いつでもIRCに投げるようにしています．ircに投げるように変えたんだからemail設定はいらないのかと思ったら，これは必ず必要なようです．それから，service_notification_options, host_notification_optionsでunknownも通知対象に入れていますが，これはスクリプトのデバッグ用です．次に，スイッチ用の設定はこんな感じ

# 'generic-host' definition
define host{
  name                         generic-host
  notifications_enabled        1
  event_handler_enabled        1
  flap_detection_enabled       1
  failure_prediction_enabled   1
  process_perf_data            1
  retain_status_information    1
  retain_nonstatus_information 1
  notification_period          24x7
  register                     0
}
# 'network-element' definition
define host{
  name                  network-element
  use                   generic-host
  check_period          24x7
  max_check_attempts    10               ; Check each host 10 times (max)
  check_command         check-host-alive ; Default command to check hosts
  notification_period   24x7
  notification_interval 120              ; Resend notification every 2 hours
  notification_options  d,u,r
  contact_groups        admins           ; Notifications get sent to the admins by default
  register              0                ; ITS NOT A REAL HOST, JUST A TEMPLATE!
}
# core switches
define host{
  use       network-element
  host_name coresw01
  alias     coresw01
  address   10.0.0.2
}
define host{
  use       network-element
  host_name coresw02
  alias     coresw02
  address   10.0.0.3
}
define hostgroup{
  hostgroup_name core-sw
  alias          Core Switches
  members        coresw01,coresw02
}

元々の設定を，ほぼそのまま使っています．監視対象機器はcoresw01, coresw02で，24時間監視します．最後に，監視対象サービスの設定はこんな感じ．

define service{
  name                         generic-service
  active_checks_enabled        1
  passive_checks_enabled       1
  parallelize_check            1
  obsess_over_service          1
  check_freshness              0
  notifications_enabled        1
  event_handler_enabled        1
  flap_detection_enabled       1
  failure_prediction_enabled   1
  process_perf_data            1
  retain_status_information    1
  retain_nonstatus_information 1
  is_volatile                  0
  register                     0
}
define service{
  name                  local-service
  use                   generic-service
  check_period          24x7
  max_check_attempts    4
  normal_check_interval 5
  retry_check_interval  1
  contact_groups        admins
  notification_options  w,u,c,r
  notification_interval 60
  notification_period   24x7
  register              0
}
define service{
  use                 local-service
  host_name           coresw01,coresw02
  service_description PING
  check_command       check_ping!100.0,20%!500.0,60%
}
define service{
  use                 local-service
  host_name           coresw01,coresw02
  service_description CPU
  check_command       check_busyPer!communString
}

generic-service, local-service, PINGは，元々の設定を使っています．CPUの設定ではcommands.cfgで定義したcheck_busyPerのコマンドを'communityString'を引数として起動しています．つまり，

$USER1$/check_busyPer.pl -H $HOSTADDRESS$ -C communityString -w 20 -c 80

と実行することと等価です．

irclient.pl

irclient.plはIRCにメッセージを投げるための小さなクライアントスクリプトです．どれくらい小さいのかというと，これくらい．

#!/usr/local/bin/perl
use strict;
use warnings;
use POE::Component::IKC::ClientLite;

my $r = POE::Component::IKC::ClientLite::create_ikc_client(
    port => 9999,
    ip   => "localhost",
    name => "cl$$",
    timeout => 5,
) || die "create_ikc\n";

$r->post('notify_irc/update', $ARGV[0]);

当然，このスクリプトだけでIRCにアクセスできる訳はなく，サーバスクリプトが必要になります．それがこれ．

#!/usr/local/bin/perl
use warnings;
use strict;

use POE qw(
  Session
  Component::IRC
  Component::IKC::Server
  Component::IKC::Specifier
);

use Term::ANSIColor qw(:constants);
sub msg (@) { print GREEN, BOLD, " * ", RESET, "@_\n" }
sub err (@) { print RED,   BOLD, " * ", RESET, "@_\n" }

msg 'loading configuration';
require YAML;
my $config = { %{ YAML::LoadFile('config.yaml') || {} }, };

msg 'creating daemon component';
POE::Component::IKC::Server->spawn(
    port => $config->{notify_irc_daemon_port},
    name => 'NotifyIRCBot',
);

msg 'creating irc component';
POE::Component::IRC->spawn( alias => 'bot' )
  or die "Couldn't create IRC POE session: $!";

msg 'creating kernel session';
POE::Session->create(
    inline_states => {
        _start           => \&bot_start,
        _stop            => \&bot_stop,
        irc_001          => \&bot_connected,
        irc_372          => \&bot_motd,
        irc_disconnected => \&bot_reconnect,
        irc_error        => \&bot_reconnect,
        irc_socketerr    => \&bot_reconnect,
        autoping         => \&bot_do_autoping,
        update           => \&update,
        _default         => $ENV{DEBUG} ? \&bot_default : sub { },
    }
);

msg 'starting the kernel';
POE::Kernel->run();
msg 'exiting';
exit 0;

sub bot_default {
    my ( $event, $args ) = @_[ ARG0 .. $#_ ];
      err "unhandled $event";
      err "  - $_" foreach @$args;
    return 0;
}

sub update {
    my ( $kernel, $heap, $msg ) = @_[ KERNEL, HEAP, ARG0 ];
    eval {
        for my $channel ( @{ $config->{notify_irc_server_channels} } )
        {
            $kernel->post( bot => privmsg => $channel, $msg );
        }
        msg sprintf( "%s on %s", $msg, scalar localtime(time) );
    };
      err "update error: $@" if $@;
}

sub bot_start {
    my ( $kernel, $heap ) = @_[ KERNEL, HEAP ];
    msg "starting irc session";
    $kernel->alias_set('notify_irc');
    $kernel->call( IKC => publish => notify_irc => ['update'] );
    $kernel->post( bot => register => 'all' );
    $kernel->post(
        bot => connect => {
            Nick     => $config->{notify_irc_nickname},
            Ircname  => $config->{notify_irc_ircname},
            Username => $ENV{USER},
            Server   => $config->{notify_irc_server_host},
            Port     => $config->{notify_irc_server_port},
        }
    );
}

sub bot_stop {
    msg "stopping bot";
}

sub bot_connected {
    my ( $kernel, $heap ) = @_[ KERNEL, HEAP ];

    for my $channel ( @{ $config->{notify_irc_server_channels} } ) {
        msg "joining channel $channel";
        $kernel->post( bot => join => $channel );
    }
}

sub bot_motd {
    msg '[motd] ' . $_[ARG1];
}

sub bot_do_autoping {
    my ( $kernel, $heap ) = @_[ KERNEL, HEAP ];
    $kernel->post( bot => userhost => $config->{notify_irc_nickname} )
      unless $heap->{seen_traffic};
    $heap->{seen_traffic} = 0;
    $kernel->delay( autoping => 300 );
}

sub bot_reconnect {
    my ( $kernel, $heap ) = @_[ KERNEL, HEAP ];
      err "reconnect: " . $_[ARG0];
    $kernel->delay( autoping => undef );
    $kernel->delay( connect  => 60 );
}

これも自前で作ったわけではなくて，Kwiki::Notify::IRC - announce updates to your Kwiki on IRC channels - metacpan.orgのソースコード中にあるnotify_irc.plをちょっと変更したものです．このスクリプトを動かすためのconfig.yamlは以下のようになります．

---
notify_irc_daemon_host: localhost
notify_irc_daemon_port: 9999
notify_irc_nickname: nagios
notify_irc_ircname: "nagios notifyer"
notify_irc_server_host: irc.foo.co.jp
notify_irc_server_port: 6667
notify_irc_server_channels:
  - #nagios

元々のスクリプトは，実行時エラーが出たのと，メッセージを書き出すチャンネルの設定部分のYAML処理が変だったので，そこは変更しています．

check_busyPer.pl

check_busyPer.plはnagios pluginです．pluginの構造や作り方についてはNagiosのPluginをPerlで+NRPEを読むのがお薦めです．要はスクリプトの終了コードでOK/WARNING/CRITICAL/UNKNOWNを，標準出力でメッセージを，nagiosに通知します．

#! /usr/local/bin/perl -wT

use Getopt::Long qw(:config no_ignore_case);
use Net::SNMP;
use strict;
use warnings;

my $version = 'snmpv2c';
my $host    = 'localhost';
my $commun  = 'public';
my $port    = 161;
my $w_level = 20;
my $c_level = 80;
my $busyPer = '1.3.6.1.4.1.9.2.1.56.0';
my ( $session, $error, $result, $cpu );

GetOptions(
    'H=s' => \$host,
    'C=s' => \$commun,
    'w=i' => \$w_level,
    'c=i' => \$c_level
);

( $session, $error ) = Net::SNMP->session(
    -version   => $version,
    -hostname  => $host,
    -community => $commun,
    -port => $port,
);

if ( !defined($session) ) {
    printf( "CPU UNKNOWN - %s.\n", $error );
    exit 3;
}

$result = $session->get_request( -varbindlist => [$busyPer] );

if ( !defined($result) ) {
    printf( "CPU UNKNOWN - %s.\n", $session->error );
    $session->close;
    exit 3;
}

$cpu = $result->{$busyPer};

if ( $cpu >= $c_level ) {
    printf( "CPU CRITICAL - %d%%\n", $cpu );
    $session->close;
    exit 2;
}
elsif ( $cpu >= $w_level ) {
    printf( "CPU WARNING - %d%%\n", $cpu );
    $session->close;
    exit 1;
}
else {
    printf( "CPU OK - %d%%\n", $cpu );
    $session->close;
    exit 0;
}

まとめ

作成した設定を動かすと，IRCにこんな表示が流れてきます．

nagios: PROBLEM - coresw01 is CPU CRITICAL - 97%
nagios: RECOVERY - coresw01 is CPU OK - 10%

動いたのはめでたいんだけど，97%ってどうよ．しょうがないので，監視対象範囲をじわじわ広げて原因究明中ですよ．トホホ