AIX - Reboot

Question
Why did AIX shut down and reboot, or shut down and halt?

Answer

Introduction
How does a system go down?
How does a system boot up after going down?
What system logs and commands can I use to investigate an unexpected shut down?
System shut down events

Conclusion

Introduction
Sometimes an AIX Operating System might shut down for no apparent reason, and with no evidence of a user initiating the shut down by running a command. When this happens, the system should be investigated for clues as to what caused the outage. While it is possible for AIX to perform a delayed shut down in response to a shutdown command with the appropriate timing options, AIX will never shut itself down, unless the system crashes, or there is a hardware failure. Even though AIX does not shut itself down automatically, cluster management software such as PowerHA, Oracle RAC, or Veritas Cluster Server might force a system down under certain conditions. If a system goes down unexpectedly, various AIX commands and system logs can be used to find information about the cause. If a node in a cluster goes down unexpectedly, the cluster manager logs should also be reviewed to see if the shut down might have been initiated by the cluster management software. If there is a system dump that was generated during the shut down, it should be analyzed by AIX Software Support to see what the system was doing at the time the dump was captured.

How does a system go down?
An AIX system can be in one of three states; running, halted, or hung. A running system that is fully operational should respond to commands. A halted system is not running AIX, and power may or may not be turned off on the machine. A hung system is running AIX but for some reason is no longer responding to commands. The most common way for a system to go down is for a user, script, or program with root authority to run one of the AIX shut down commands. Or if the system is an LPAR, a user with hsroot authority can run a restart or shutdown command on the Hardware Management Console (HMC). A hung system might be powered off or reset by an operator.

Below is a list of events that can cause a running or hung system to shut down and reboot, or shut down and halt:

  • A user with root authority runs one of the shut down commands on the command line to bring the system down immediately, or after a specified period of time.
  • A script or program running on the system with root authority executes one of the shut down commands. This could be a script or program started from a cron job or in inittab, or started from the command line. Cluster management software is one example of this type of software.
  • A user with hsroot authority selects a menu option on the HMC to restart or shut down an LPAR, or uses the command line on the HMC to run a shutdown command. If appropriate options are selected, a system dump will be generated if the dump facility is properly configured.
  • A user manually resets the system by pressing the reset button on the front console, or by selecting specific functions on a system with a multi-button front panel display. A system dump should be written if the dump facility is properly configured.
  • The system crashes due to an operating system defect.
  • The system crashes due to a hardware malfunction.
  • The system is powered off.
  • The system loses power and a UPS does not provide power backup.

How does a system boot up after going down?
There are four basic boot types. An older term for booting is Initial Program Load (IPL).

  • Cold boot: Booting a system that is not powered on by turning on power.
  • Soft/Warm boot: Booting a running system by performing a shut down and boot in a single operation.
  • Hard boot: Booting a system by recycling power (abruptly turning power off and back on). This can also be accomplished with a physical reset button, or for an LPAR, a menu command on the HMC. A crashing system is also an example of a hard boot.
  • Timed boot: Booting a halted system automatically after a specified period of time.

When a system goes down for any reason other than catastrophic hardware failure, it will either automatically reboot, or it will remain in the halted state. If a system is brought down by one of the shut down commands, it will only reboot if an appropriate reboot option is included with the command, or if the command itself specifies that the machine will be rebooted. For example, the -r flag when used with the shutdown command will cause the system to automatically reboot after the shut down. Or the reboot command will shut down a system and then automatically reboot. If a system goes down and then remains in the halted state, the power button must be pressed on a stand-alone system to boot, or if the system is an LPAR, the LPAR must be activated using the HMC.

If a system crashes or if a user resets the system or runs the sysdumpstart command to force a system dump, it will automatically reboot if the autorestart flag is enabled. Use the following command to view the current value of the autorestart flag:

# lsattr -D -l sys0 | grep auto
autorestart    true    Automatically REBOOT system after a crash

The default value for this flag is true. To change this value, use the following command:

# chdev -l sys0 -a autorestart=value
where value = true or false

What system logs and commands can I use to investigate an unexpected shut down?
A number of AIX system logs and commands can be used to investigate the cause of an unexpected shut down. Also on HMC managed LPARs, logs are maintained on the HMC that can provide information about shut down and dump related commands that have been executed on the HMC. These logs and commands should be used to help determine the cause of an unexpected system shut down. Because the logs cannot be read until a system has been booted, logs will usually show boot entries just after the shut down entries from the most recent shut down.

Below is a list of the most useful system logs and commands that can be used to investigate an unexpected shut down:
AIX error log (read with the errpt command)
/var/adm/wtmp account file (read with the last command)
/var/adm/pacct account files (read with the lastcomm command)
AIX console log (read with the alog -t console -o command)
su log file (read with cat /var/adm/sulog)
Shell history file (read with the fc command)
/etc/shutdown.log file (read with cat /etc/shutdown.log)
HMC log files (consult HMC documentation)
AIX audit log

The AIX error report
The AIX error log can be read with the errpt command. This log contains many different types of error and informational entries. Some of the most useful entries for providing information about a system shut down and reboot are REBOOT_ID, ERRLOG_ON, ERRLOG_OFF, SYS_RESET, DUMP_STATS, and MINIDUMP. Some of the entries that will be logged if a system crashes due to a software problem are DSI_PROC, ISI_PROC, and PROGRAM_INT. Some of the entries that might be logged if a system crashes due to a hardware malfunction are SCAN_ERROR_CHRP, and SCANOUT.
REBOOT_ID

This entry is written into the error log whenever a system boots, so it is used for both warm boots and cold boots. The Detail Data section in the entry specifies whether the boot was warm, cold, or timed.

---------------------------------------------------------------------------
LABEL:          REBOOT_ID
IDENTIFIER:     2BFA76F6

Date/Time:       Sun Nov 23 13:45:12 CST 2008
Sequence Number: 199
Machine Id:      0002FBB2D900
Node Id:         vegas
Class:           S
Type:            TEMP
Resource Name:   SYSPROC

Description
SYSTEM SHUTDOWN BY USER

Probable Causes
SYSTEM SHUTDOWN

Detail Data
USER ID
           0
0=SOFT IPL 1=HALT 2=TIME REBOOT
           0
TIME TO REBOOT (FOR TIMED REBOOT ONLY)
           0
---------------------------------------------------------------------------

The Date/Time is the time that the entry was logged into the error report during the beginning of the boot process, and so is a good approximation of the time when the system was booted. If the system was warm booted, meaning that a running system was rebooted with one of the reboot commands, this time stamp will be later than the time that the reboot command was actually executed, because of the additional time required for the system to shut down before the reboot. There is no dedicated entry in the error report for a normal system shut down that gives the precise time that the shut down command was executed. However under normal circumstances, the AIX error log is turned off when a system is shutting down, so the ERRLOG_OFF entry, described below, can be used to approximate the time that the shut down was initiated. Also, the last command described above, can be used to read the shutdown record to find the exact time that a system was shut down.

The boot type is displayed in the Detail Data section.
0=SOFT IPL:
Soft/warm boot, meaning that a running system was shut down and rebooted.
1=HALT:
Cold boot, meaning that a halted system had been previously halted.
2=TIME REBOOT:
Timed boot, meaning that a halted system was booted automatically after a specified period of time.

ERRLOG_OFF, ERRLOG_ON

The AIX error logging system is turned off whenever a system is shut down normally using any of the shut down commands, except for sysdumpstart. It is always turned on when the system is booted. So if a system is shut down and rebooted in the normal way, the error log should contain an ERRLOG_OFF that is written when the system is shutting down, and then an ERRLOG_ON that is written after the system boots back up. The time stamp on the ERRLOG_OFF entry will approximate the actual time that the shut down was initiated, and the time stamp on the ERRLOG_ON will approximate the actual time the system was rebooted. If a system shuts down in the normal way, these two entries will normally exist in the error report, one right after the other. But they can also be written one after the other if error logging is manually turned off and then immediately turned back on.

DUMP_STATS

This entry is written into the error log to show that a system dump was attempted. If the dump facility is properly configured, a system dump will be captured when a system crashes, or when a dump is forced by a user. Whenever a dump is written, the system is shut down immediately with no warnings, and running processes are killed and not terminated in an orderly way. The time stamp on this entry is the time that the entry was written into the error log after the system was rebooted, and so is not the time that the system actually went down. A second time stamp in the detail section of the entry reports the time that the system dump was started, and this would be the time when the system crashed, or was reset. Contact AIX Software Support for assistance with analyzing a system dump to determine the reason why the dump was created.

SYS_RESET

This entry is written into the error report when a system is manually reset by pressing the reset button or function buttons on the front panel, or by selecting the Restart menu on an HMC. When a system is reset, a system dump will be created if the dump facility is properly configured. The SYS_RESET entry is not written into the error report when the sysdumpstart command is executed.

DSI_PROC, ISI_PROC, PROGRAM_INT

These entries are written into the error report when a system crashes due to some type of defect in the kernel, kernel extensions, or device drivers. If the system dump facility is properly configured, a DUMP_STATS entry should also be written into the error report about the same time as one of these entries. Contact AIX Software Support for assistance with these types of errors.

SCAN_ERROR_CHRP, SCANOUT

These are hardware related entries and might be logged into the error report if the system crashes due to a hardware malfunction., or if there is an unexpected loss of power. Contact IBM Hardware Support for assistance with these types of errors.

The /var/adm/wtmp account file
This binary file is used to store various types of login information. One type of information stored in this file is user login records. These records document the user name and time of login. Pseudo user names are used for shutdown and reboot. So when a system is shut down using one of the shut down commands, a record with the user name shutdown will be logged into the wtmp file. Similarly when a system is booted, a record with the user name reboot will be written into the wtmp file. Some shut down commands have flags that can be used to suppress login records in the wtmp file.

Note: Technically a reboot is a warm boot but the pseudo user name reboot is written into the wtmp file for both warm boots and cold boots.

The wtmp file can be read by using the last command. Below is example output from the last command.

# last
root      pts/0        sig-9-65-19-99.mts.ibm.com       Nov 23 13:50   still logged in.
A root user logged in at 13:50

reboot    ~                  Nov 23 13:45
The system was booted at 13:45

shutdown  pts/0              Nov 23 13:44
The system was shut down at 13:44 on the remote terminal pts/0

shutdown  vty0               Nov 23 15:16
The system was shutdown at 15:16 on vty0, the virtual console on the HMC

root      pts/0        sig-9-65-19-99.mts.ibm.com       Nov 23 13:43 - System is halted by system administrator.   (00:00)
A root user logged back in at 13:43 and after 00:00 minutes, or almost immediately, the system was halted.
Note: The system was not necessarily halted on this particular terminal! The actual terminal where the halt was executed will be listed on the shutdown record, as in the example shutdown record above.

root      pts/0        sig-9-65-19-99.mts.ibm.com       Nov 23 13:29 - 13:43  (00:13)
A root user logged in at 13:29 and logged out at 13:43, for a total login time of about 00:13 minutes

The /var/adm/pacct account files
Files in the /var/adm/pacct directory store information about the last commands that have been executed on the system. These files are read with the lastcomm command. This command displays information, in reverse chronological order, about all of the previously executed commands that are still recorded in the files in the /var/adm/pacct directory. The /usr/sbin/acct/startup command must be executed before the lastcomm command can be used. The startup command does not persist across a reboot, so if you want to keep command recording active at all times, you would need to add the startup command to /etc/inittab or some other startup script. For some reason the shut down commands such as shutdown, reboot, and halt are not recorded. However other commands that are executed during the shut down process are recorded. When a script is executed, all commands called within the script are logged, so the lastcomm command output can be difficult to read. See the man page for more information about this command.

The AIX console log
The AIX console log is a binary log file that can be read with the following command:

# alog -t console -o

A number of AIX system processes log information into the console log when starting up during the boot process, and when shutting down during the shut down process. Time stamps are written with each entry, so this log can contain valuable information that can be used to investigate an unexpected shut down.

su log file
The su log file is used to log attempts to become a superuser. This log can be useful when trying to track down who might have gained root access to shut down a system. The su log file is located in /var/adm/sulog and has messages that look like this:

# cat /var/adm/sulog
SU 07/08 10:57 + pts/0 root-root
SU 07/11 12:44 + pts/0 root-nobody
SU 07/25 16:37 + pts/5 dcoca-root
SU 09/11 10:21 + pts/1 mrj1-root

Shell history file
If the root account uses a shell that supports a history file, this file can be used to view a history of commands that were executed by a root user. The korn shell will write the history file to the file named in the HISTFILE environment variable ( $HOME/.sh_history by default). Of course it is possible for a user who has gained root access to disable the file temporarily before running commands. The korn shell history file is read with the fc command.
/etc/shutdown.log
This log file is created or appended to if the -l option is used with the shutdown command. The file contains a time stamp to show the time of the shut down. It also logs the shut down of specific subsystems such as syslogd, the unmounting of file systems, and bringing down network interfaces. Here is example output from this log file:

# cat /etc/shutdown.log

Sun Nov 30 11:45:31 CST 2008
shutdown:  THE SYSTEM IS BEING SHUT DOWN NOW

User(s) currently logged in:
 root

Stopping some active subsystems...

0513-044 The syslogd Subsystem was requested to stop.
0513-044 The hostmibd Subsystem was requested to stop.
0513-044 The snmpmibd Subsystem was requested to stop.
...

Unmounting the file systems...

/lgfs unmounted successfully.
/download unmounted successfully.
...
umount: 0506-349 Cannot unmount /dev/hd3: The requested resource is busy.

Bringing down network interfaces:

detached en0 from the network interface list
detached lo0 from the network interface list

HMC system logs
The HMC maintains a system log file that records information about commands that have been executed on the HMC. If an LPAR is shut down, restarted, halted, or dumped, the HMC log should contain a record of the command and the time the command was executed, if the system was shut down using the HMC. Consult your HMC documentation for details about how to access this log.

AIX audit log

AIX includes an auditing subsystem that can be used to log information about commands that have been executed on the system. If a system is going down repeatedly due to one of the shut down commands, auditing can be enabled to help provide information about who or what is executing the command. For an overview on the AIX auditing subsystem, see technote T1000212.

System shut down events
The table below contains a list of the most common events that cause a system to shut down and reboot, or shut down and halt. Note that only the most commonly used command options are listed - consult the AIX man pages and HMC manuals for more comprehensive documentation.

Event
Description
shutdown
shutdown -h
shutdown -v


Shuts down a running system with multiple users in an orderly way, and then halts. Notifies users with the wall command of the impending shut down. If this command is used on a system with software control of the power supply, power will be turned off.

Note: All three of these commands shut down the system essentially the same way, and generate identical entries in AIX logs.
Logs
Error Report
LABEL:          REBOOT_ID
0=SOFT IPL 1=HALT 2=TIME REBOOT
            1
----------
LABEL:          ERRLOG_ON
-----------
LABEL:          ERRLOG_OFF

wtmp
reboot    ~     Nov 23 15:20
shutdown  vty0  Nov 23 15:16
root      pts/0 hostname   Nov 23 15:13 - 15:16  (00:03)
Event
Description
shutdown -r
shutdown -Fr
Shuts down a running system with multiple users in an orderly way and then calls the reboot command to reboot the system. Notifies users with the wall command of the impending shut down, unless the -F flag is used, in which case the system is shut down as quickly as possible with no user notification.

Logs
Error Report
LABEL:          REBOOT_ID
0=SOFT IPL 1=HALT 2=TIME REBOOT
            0
----------
LABEL:          ERRLOG_ON
-----------
LABEL:          ERRLOG_OFF

wtmp
reboot    ~     Nov 23 16:33
shutdown  vty0  Nov 23 16:32
root      pts/0 hostname   Nov 23 16:28 - 16:32  (00:04)
Event
Description
shutdown -l
The -l option can be used alone or added to other options to create or append to the AIX system log file /etc/shutdown.log. This option can be used to debug problems with the shut down process.

Logs
/etc/shutdown.log
Sun Nov 30 11:45:31 CST 2008
shutdown:  THE SYSTEM IS BEING SHUT DOWN NOW
User(s) currently logged in:
 root
Stopping some active subsystems...
0513-044 The syslogd Subsystem was requested to stop.
...

Note: The Error Report and wtmp entries are the same as above, depending on options used in addition to the -l option.
Event
Description
reboot
fastboot

Shuts down a running system in an orderly way and then reboots. This command should not be used if other users are logged into the system. Use shutdown -r instead.

Note: fastboot is identical to reboot and is provided for BSD compatibility.
Logs
Error Report
LABEL:          REBOOT_ID
0=SOFT IPL 1=HALT 2=TIME REBOOT
            0
----------
LABEL:          ERRLOG_ON
-----------
LABEL:          ERRLOG_OFF

wtmp
reboot    ~     Nov 23 13:45
shutdown  vty0  Nov 23 13:44
root      pts/0 hostname   Nov 23 13:43 - System is halted by system administrator.   (00:00)
Event
Description
reboot -l

Note: the -n and -q options imply -l
Shuts down a running system in an orderly way and then reboots, but does not log a shutdown record in the /var/adm/wtmp accounting file. This command should not be used if other users are logged into the system. Use shutdown -r instead.

Note: The -l option should normally not be used by a system administrator. It is intended for other commands such as shutdown -r that call the reboot command but log an entry in wtmp themselves.

Logs
Error Report
LABEL:          REBOOT_ID
0=SOFT IPL 1=HALT 2=TIME REBOOT
            0
----------
LABEL:          ERRLOG_ON
-----------
LABEL:          ERRLOG_OFF

wtmp
reboot    ~     Dec 06 14:29
root      pts/0 hostname   Dec 06 14:22 - System halted abnormally.   (00:06)
Event
Description
reboot -nq

Note: the -l option is implied
Shuts down a running system as quickly as possible and then reboots, but does not log a shutdown record in the /var/adm/wtmp accounting file. Does not call sync to flush file buffers and does not send processes a SIGTERM. Normally this command should not be used by a system administrator.

Note: This command is sometimes used by cluster management software such as Oracle RAC to evict a node as quickly as possible to preserve the integrity of the database.
Logs
Error Report
LABEL:          REBOOT_ID
0=SOFT IPL 1=HALT 2=TIME REBOOT
            0
----------
LABEL:          ERRLOG_ON

wtmp
reboot    ~     Dec 05 11:32
root      pts/0 hostname   Dec 05 11:03 - System halted abnormally.   (00:29)

Note: There is no ERRLOG_OFF entry in the error report because the -q option causes the reboot command to shut down immediately without sending processes a SIGTERM to shut them down in an orderly way.
Event
Description
halt
fasthalt
Shuts down a running system in an orderly way and then halts. This command should not be used if other users are logged into the system. Use shutdown -h instead.

Note: fasthalt is identical to halt and is provided for BSD compatibility.

Logs
Error Report
LABEL:          REBOOT_ID
0=SOFT IPL 1=HALT 2=TIME REBOOT
            1
----------
LABEL:          ERRLOG_ON
-----------
LABEL:          ERRLOG_OFF

wtmp
reboot    ~     Nov 23 14:35
shutdown  vty0  Nov 23 14:32
root      pts/0 hostname   Nov 23 14:30 - System is halted by system administrator.   (00:01)
Event
Description
halt  -l

Note: the -n and -q options imply -l
The same as a halt with no options, except that a shutdown record will not be logged in the /var/adm/wtmp accounting file.

Note: The -l option should normally not be used by a system administrator. It is intended for other commands such as shutdown -h that call the halt command but log an entry in wtmp themselves.
Logs
Error Report
LABEL:          REBOOT_ID
0=SOFT IPL 1=HALT 2=TIME REBOOT
            1
----------
LABEL:          ERRLOG_ON
-----------
LABEL:          ERRLOG_OFF

wtmp
reboot    ~     Dec 06 16:08
root      pts/0 hostname   Dec 06 15:51 - 16:04  (00:12)
Event
Description
halt -q
Shuts down a running system quickly. Does not issue a sync and does not send the terminate signal to running processes. Normally this command should not be used by a system administrator. Use shutdown -h instead.

Note: This command is sometimes used by cluster management software such as PowerHA to quickly bring a node down so that applications can failover to a secondary node.
Logs
Error Report
LABEL:          REBOOT_ID
0=SOFT IPL 1=HALT 2=TIME REBOOT
            1
----------
LABEL:          ERRLOG_ON

wtmp
root      pts/0        sig-9-49-130-54.mts.ibm.com       Jan 01 09:00 - System halted abnormally.   (00:06)
Event
Description
sysdumpstart -p
Immediately stops AIX and initiates a system dump to the primary dump device if the dump facility is properly configured. Afterwards the system will automatically reboot if the auto restart flag is true. Otherwise the system will halt.

Note: This command is sometimes used by cluster management software such as Oracle RAC to evict a node and create a system dump. If a node in an Oracle RAC is evicted with the sysdumpstart command, contact Oracle Support and IBM AIX support for assistance with analyzing the system dump.
Logs
Error Report
LABEL:          DUMP_STATS
Description
SYSTEM DUMP
User Causes
SYSTEM DUMP REQUESTED BY USER
Detail Data
DUMP DEVICE
/dev/lg_dumplv
DUMP SIZE
              39740416
TIME
Sun Dec  7 09:45:21 2008
...
----------
LABEL:          MINIDUMP_LOG
----------
LABEL:          ERRLOG_ON

wtmp
reboot    ~     Dec 07 09:52
root      pts/0 hostname   Dec 07 09:44 - System halted abnormally.   (00:07)

# sysdumpdev -L
0453-039
Device name:         /dev/lg_dumplv
Size:                39740416 bytes
Uncompressed Size:   398315509 bytes
Date/Time:           Sun Dec  7 09:45:21 CST 2008
Dump status:         0
dump completed successfully

Note: No REBOOT_ID command is logged in the error report. Also there is no ERRLOG_OFF entry in the error report because this command shuts down the system immediately without sending processes a SIGTERM to shut them down in an orderly way. No shutdown record is logged in the wtmp file.
Event
Description
Forced reset on system front panel

OR

Restart command with the dump option is executed on an HMC
The system is reset by pressing the reset button on the front panel of the machine. Or if the machine has function buttons, the system is reset by executing one or more functions on the front panel. Initiates a system dump to the primary dump device if the dump facility is properly configured. The system will automatically reboot if the autorestart flag is set to true.

If the system is an LPAR managed by an HMC, the system is reset by running the Restart command with the dump option selected on the HMC.
Logs
Error Report
LABEL:          DUMP_STATS
Description
SYSTEM DUMP
User Causes
SYSTEM DUMP REQUESTED BY USER
Detail Data
DUMP DEVICE
/dev/lg_dumplv
DUMP SIZE
              56760832
TIME
Wed Dec  3 08:14:41 2008
...
----------
LABEL:          MINIDUMP_LOG
----------
LABEL:          SYS_RESET
Description
SYSTEM RESET INTERRUPT RECEIVED
----------
LABEL:          ERRLOG_ON

wtmp
reboot    ~     Dec 03 08:21
root      pts/0 hostname   Dec 03 08:04 - 08:12  (00:07)

# sysdumpdev -L
0453-039
Device name:         /dev/lg_dumplv
Size:                56760832 bytes
Uncompressed Size:   399727524 bytes
Date/Time:           Wed Dec  3 08:14:41 CST 2008
Dump status:         0
dump completed successfully

Note: No REBOOT_ID command is logged in the error report. Also there is no ERRLOG_OFF entry in the error report because a system reset shuts down the system immediately without sending processes a SIGTERM to shut them down in an orderly way. No shutdown record is logged in the wtmp file.
Event
Description
Software system crash
A software system crash or kernel panic is most often caused by some type of problem in the kernel, kernel extensions, or device drivers. If the system dump facility is properly configured, a system dump will be created. The system will automatically reboot of the autorestart flag is set to true.

If a system crashes, contact IBM AIX Support for assistance.

Logs
Error Report
LABEL:          DUMP_STATS
Description
SYSTEM DUMP
User Causes
SYSTEM DUMP REQUESTED BY USER
Detail Data
DUMP DEVICE
/dev/hd7
DUMP SIZE
              189803520
TIME
Fri Oct 17 02:02:02 2008
...
Note: The DUMP_STATS entry might report that the system dump was requested by user, even though the system crashed.
----------
LABEL:          MINIDUMP_LOG
----------
LABEL:          PROGRAM_INT
OR
LABEL:          DSI_PROC
OR
LABEL:          ISI_PROC
----------
LABEL:          ERRLOG_ON

wtmp
reboot    ~     Oct 17 02:05

# sysdumpdev -L
0453-039
Device name:         /dev/hd7
Size:                189803520 bytes
Uncompressed Size:   6548975244 bytes
Date/Time:           Fri Oct 17 02:02:02 CST 2008
Dump status:         0
dump completed successfully

Note: No REBOOT_ID command is logged in the error report. Also there is no ERRLOG_OFF entry in the error report because a system crash shuts down the system immediately without sending processes a SIGTERM to shut them down in an orderly way. No shutdown record is logged in the wtmp file.
Event
Description
Hardware system crash
A hardware system crash is caused by some type of hardware failure. Even if the system dump facility is properly configured, depending on the type of hardware failure, a system dump might not be created. The system will automatically reboot of the autorestart flag is set to true, and if the hardware is operational enough to allow the system to boot.

If a system crashes due to hardware failure, contact IBM Hardware Support for assistance.

Logs
Error Report
LABEL:          DUMP_STATS
Description
SYSTEM DUMP
User Causes
SYSTEM DUMP REQUESTED BY USER
Detail Data
DUMP DEVICE
/dev/lg_dumplv
DUMP SIZE
              284503126
TIME
Sat Oct 18 04:01:05 2008
...
Note: The DUMP_STATS entry might report that the system dump was requested by user, even though the system crashed.
----------
LABEL:          MINIDUMP_LOG
----------
LABEL:          SCAN_ERROR_CHRP
OR
LABEL:          SCANOUT
OR
possibly other hardware related entries
----------
LABEL:          ERRLOG_ON

wtmp
reboot    ~     Oct 18 04:05

# sysdumpdev -L
0453-039
Device name:         /dev/lg_dumplv
Size:                284503126 bytes
Date/Time:           Sat Oct 18 04:01:05 CST 2008
Dump status:         0
dump completed successfully

Note: No REBOOT_ID command is logged in the error report. Also there is no ERRLOG_OFF entry in the error report because a system crash shuts down the system immediately without sending processes a SIGTERM to shut them down in an orderly way. No shutdown record is logged in the wtmp file.
Event
Description
Loss of power
A power failure occurs and a UPS does not provide backup power. Or, the power button is pressed without first shutting the system down with one of the shut down commands.
Logs
Error Report
LABEL:          SCAN_ERROR_CHRP

Note: The reference code in this entry will indicate an unexpected loss of power.
Event
Description
HMC menu commands such as:

Operations:Activate
Operations:Shutdown
Operations:Restart

OR

HMC command line commands

An LPAR can also be shut down and rebooted using commands on the HMC. For example, commands can be executed on the HMC that will shut down and halt an LPAR, shut down and reboot an LPAR, or initiate a system dump on an LPAR. Discussion of these methods are beyond the scope of this document. Consult your HMC documentation for details.
Logs
HMC Console Log
hscroot@pkdahmc5:~> lssvcevents -t console
time=03/12/2008 08:04:34,text=HSCE2174 User hscroot Login from remote host pcp684467pcs.central.sprint with IP address 10.86.10.151 was successful.
time=03/06/2008 13:43:22,text=HSCE2016 User name hscroot Logical Partition dkda0177 with ID 1 of managed system 9133-55A*10C104G has been activated with profile pkda0177.
time=03/06/2008 13:43:21,text=HSCE2245 User name 1: Activating the partition 9133-55A*10C104G succeeded on managed system {2}.

time=03/06/2008 13:42:14,text=HSCE2121 User name hscroot: Immediate shut down executed successfully on partition dkda0177 with ID 1 on the managed system Server-9133-55A-SN10C104G.

time=03/06/2008 13:42:14,text=HSCE2254 User name 1*9133-55A*10C104G: Dump to load source for partition 9133-55A*10C104G succeeded on managed system {2}.

Note: The log output above is presented only as an example. Consult your HMC documentation for details about how to view HMC logs.

Note: Some of the shut down and restart commands on the HMC have options that will cause the HMC to send AIX shut down commands to the LPAR. These options are documented in the HMC interface with the text "Operating System". If an Operating System command is executed on the HMC, the AIX system logs will be the same as if the command had been executed directly with the Unix command line on the LPAR.

Conclusion
If AIX shuts down unexpectedly, there are a number of log files and commands that can be used to investigate the cause. AIX does not shut itself down on its own, but software running on AIX might initiate a shut down. This document shows there are a number of ways for a system to go down, and provides information about system log files and commands that can help to determine the cause. If a node in a cluster shuts down unexpectedly, review the cluster manager logs to see if there are any entries related to the shut down that might provide additional information about the cause. If a system dump was created after the shut down, contact AIX Software Support for assistance with analyzing the system dump.

No comments:

Post a Comment