AIX, HACMP, PowerVM: October 2013

AIX - Reboot

Question
Why did AIX shut down and reboot, or shut down and halt?

Answer

Introduction
How does a system go down?
How does a system boot up after going down?
What system logs and commands can I use to investigate an unexpected shut down?
System shut down events

Conclusion

Introduction
Sometimes an AIX Operating System might shut down for no apparent reason, and with no evidence of a user initiating the shut down by running a command. When this happens, the system should be investigated for clues as to what caused the outage. While it is possible for AIX to perform a delayed shut down in response to a shutdown command with the appropriate timing options, AIX will never shut itself down, unless the system crashes, or there is a hardware failure. Even though AIX does not shut itself down automatically, cluster management software such as PowerHA, Oracle RAC, or Veritas Cluster Server might force a system down under certain conditions. If a system goes down unexpectedly, various AIX commands and system logs can be used to find information about the cause. If a node in a cluster goes down unexpectedly, the cluster manager logs should also be reviewed to see if the shut down might have been initiated by the cluster management software. If there is a system dump that was generated during the shut down, it should be analyzed by AIX Software Support to see what the system was doing at the time the dump was captured.

How does a system go down?
An AIX system can be in one of three states; running, halted, or hung. A running system that is fully operational should respond to commands. A halted system is not running AIX, and power may or may not be turned off on the machine. A hung system is running AIX but for some reason is no longer responding to commands. The most common way for a system to go down is for a user, script, or program with root authority to run one of the AIX shut down commands. Or if the system is an LPAR, a user with hsroot authority can run a restart or shutdown command on the Hardware Management Console (HMC). A hung system might be powered off or reset by an operator.

Below is a list of events that can cause a running or hung system to shut down and reboot, or shut down and halt:

A user with root authority runs one of the shut down commands on the command line to bring the system down immediately, or after a specified period of time.
A script or program running on the system with root authority executes one of the shut down commands. This could be a script or program started from a cron job or in inittab, or started from the command line. Cluster management software is one example of this type of software.
A user with hsroot authority selects a menu option on the HMC to restart or shut down an LPAR, or uses the command line on the HMC to run a shutdown command. If appropriate options are selected, a system dump will be generated if the dump facility is properly configured.
A user manually resets the system by pressing the reset button on the front console, or by selecting specific functions on a system with a multi-button front panel display. A system dump should be written if the dump facility is properly configured.
The system crashes due to an operating system defect.
The system crashes due to a hardware malfunction.
The system is powered off.
The system loses power and a UPS does not provide power backup.

How does a system boot up after going down?
There are four basic boot types. An older term for booting is Initial Program Load (IPL).

Cold boot: Booting a system that is not powered on by turning on power.
Soft/Warm boot: Booting a running system by performing a shut down and boot in a single operation.
Hard boot: Booting a system by recycling power (abruptly turning power off and back on). This can also be accomplished with a physical reset button, or for an LPAR, a menu command on the HMC. A crashing system is also an example of a hard boot.
Timed boot: Booting a halted system automatically after a specified period of time.

When a system goes down for any reason other than catastrophic hardware failure, it will either automatically reboot, or it will remain in the halted state. If a system is brought down by one of the shut down commands, it will only reboot if an appropriate reboot option is included with the command, or if the command itself specifies that the machine will be rebooted. For example, the -r flag when used with the shutdown command will cause the system to automatically reboot after the shut down. Or the reboot command will shut down a system and then automatically reboot. If a system goes down and then remains in the halted state, the power button must be pressed on a stand-alone system to boot, or if the system is an LPAR, the LPAR must be activated using the HMC.

If a system crashes or if a user resets the system or runs the sysdumpstart command to force a system dump, it will automatically reboot if the autorestart flag is enabled. Use the following command to view the current value of the autorestart flag:

# lsattr -D -l sys0 | grep auto
autorestart true Automatically REBOOT system after a crash

The default value for this flag is true. To change this value, use the following command:

# chdev -l sys0 -a autorestart=value
where value = true or false

What system logs and commands can I use to investigate an unexpected shut down?
A number of AIX system logs and commands can be used to investigate the cause of an unexpected shut down. Also on HMC managed LPARs, logs are maintained on the HMC that can provide information about shut down and dump related commands that have been executed on the HMC. These logs and commands should be used to help determine the cause of an unexpected system shut down. Because the logs cannot be read until a system has been booted, logs will usually show boot entries just after the shut down entries from the most recent shut down.

Below is a list of the most useful system logs and commands that can be used to investigate an unexpected shut down:
AIX error log (read with the errpt command)
/var/adm/wtmp account file (read with the last command)
/var/adm/pacct account files (read with the lastcomm command)
AIX console log (read with the alog -t console -o command)
su log file (read with cat /var/adm/sulog)
Shell history file (read with the fc command)
/etc/shutdown.log file (read with cat /etc/shutdown.log)
HMC log files (consult HMC documentation)
AIX audit log

The AIX error report
The AIX error log can be read with the errpt command. This log contains many different types of error and informational entries. Some of the most useful entries for providing information about a system shut down and reboot are REBOOT_ID, ERRLOG_ON, ERRLOG_OFF, SYS_RESET, DUMP_STATS, and MINIDUMP. Some of the entries that will be logged if a system crashes due to a software problem are DSI_PROC, ISI_PROC, and PROGRAM_INT. Some of the entries that might be logged if a system crashes due to a hardware malfunction are SCAN_ERROR_CHRP, and SCANOUT.
REBOOT_ID

This entry is written into the error log whenever a system boots, so it is used for both warm boots and cold boots. The Detail Data section in the entry specifies whether the boot was warm, cold, or timed.

---------------------------------------------------------------------------
LABEL: REBOOT_ID
IDENTIFIER: 2BFA76F6

Date/Time: Sun Nov 23 13:45:12 CST 2008
Sequence Number: 199
Machine Id: 0002FBB2D900
Node Id: vegas
Class: S
Type: TEMP
Resource Name: SYSPROC

Description
SYSTEM SHUTDOWN BY USER

Probable Causes
SYSTEM SHUTDOWN

Detail Data
USER ID
0
0=SOFT IPL 1=HALT 2=TIME REBOOT
0
TIME TO REBOOT (FOR TIMED REBOOT ONLY)
0
---------------------------------------------------------------------------

The Date/Time is the time that the entry was logged into the error report during the beginning of the boot process, and so is a good approximation of the time when the system was booted. If the system was warm booted, meaning that a running system was rebooted with one of the reboot commands, this time stamp will be later than the time that the reboot command was actually executed, because of the additional time required for the system to shut down before the reboot. There is no dedicated entry in the error report for a normal system shut down that gives the precise time that the shut down command was executed. However under normal circumstances, the AIX error log is turned off when a system is shutting down, so the ERRLOG_OFF entry, described below, can be used to approximate the time that the shut down was initiated. Also, the last command described above, can be used to read the shutdown record to find the exact time that a system was shut down.

The boot type is displayed in the Detail Data section.
0=SOFT IPL:
Soft/warm boot, meaning that a running system was shut down and rebooted.
1=HALT:
Cold boot, meaning that a halted system had been previously halted.
2=TIME REBOOT:
Timed boot, meaning that a halted system was booted automatically after a specified period of time.

ERRLOG_OFF, ERRLOG_ON

The AIX error logging system is turned off whenever a system is shut down normally using any of the shut down commands, except for sysdumpstart. It is always turned on when the system is booted. So if a system is shut down and rebooted in the normal way, the error log should contain an ERRLOG_OFF that is written when the system is shutting down, and then an ERRLOG_ON that is written after the system boots back up. The time stamp on the ERRLOG_OFF entry will approximate the actual time that the shut down was initiated, and the time stamp on the ERRLOG_ON will approximate the actual time the system was rebooted. If a system shuts down in the normal way, these two entries will normally exist in the error report, one right after the other. But they can also be written one after the other if error logging is manually turned off and then immediately turned back on.

DUMP_STATS

This entry is written into the error log to show that a system dump was attempted. If the dump facility is properly configured, a system dump will be captured when a system crashes, or when a dump is forced by a user. Whenever a dump is written, the system is shut down immediately with no warnings, and running processes are killed and not terminated in an orderly way. The time stamp on this entry is the time that the entry was written into the error log after the system was rebooted, and so is not the time that the system actually went down. A second time stamp in the detail section of the entry reports the time that the system dump was started, and this would be the time when the system crashed, or was reset. Contact AIX Software Support for assistance with analyzing a system dump to determine the reason why the dump was created.

SYS_RESET

This entry is written into the error report when a system is manually reset by pressing the reset button or function buttons on the front panel, or by selecting the Restart menu on an HMC. When a system is reset, a system dump will be created if the dump facility is properly configured. The SYS_RESET entry is not written into the error report when the sysdumpstart command is executed.

DSI_PROC, ISI_PROC, PROGRAM_INT

These entries are written into the error report when a system crashes due to some type of defect in the kernel, kernel extensions, or device drivers. If the system dump facility is properly configured, a DUMP_STATS entry should also be written into the error report about the same time as one of these entries. Contact AIX Software Support for assistance with these types of errors.

SCAN_ERROR_CHRP, SCANOUT

These are hardware related entries and might be logged into the error report if the system crashes due to a hardware malfunction., or if there is an unexpected loss of power. Contact IBM Hardware Support for assistance with these types of errors.

The /var/adm/wtmp account file
This binary file is used to store various types of login information. One type of information stored in this file is user login records. These records document the user name and time of login. Pseudo user names are used for shutdown and reboot. So when a system is shut down using one of the shut down commands, a record with the user name shutdown will be logged into the wtmp file. Similarly when a system is booted, a record with the user name reboot will be written into the wtmp file. Some shut down commands have flags that can be used to suppress login records in the wtmp file.

Note: Technically a reboot is a warm boot but the pseudo user name reboot is written into the wtmp file for both warm boots and cold boots.

The wtmp file can be read by using the last command. Below is example output from the last command.

# last
root pts/0 sig-9-65-19-99.mts.ibm.com Nov 23 13:50 still logged in.
A root user logged in at 13:50

reboot ~ Nov 23 13:45
The system was booted at 13:45

shutdown pts/0 Nov 23 13:44
The system was shut down at 13:44 on the remote terminal pts/0

shutdown vty0 Nov 23 15:16
The system was shutdown at 15:16 on vty0, the virtual console on the HMC

root pts/0 sig-9-65-19-99.mts.ibm.com Nov 23 13:43 - System is halted by system administrator. (00:00)
A root user logged back in at 13:43 and after 00:00 minutes, or almost immediately, the system was halted.
Note: The system was not necessarily halted on this particular terminal! The actual terminal where the halt was executed will be listed on the shutdown record, as in the example shutdown record above.

root pts/0 sig-9-65-19-99.mts.ibm.com Nov 23 13:29 - 13:43 (00:13)
A root user logged in at 13:29 and logged out at 13:43, for a total login time of about 00:13 minutes

The /var/adm/pacct account files
Files in the /var/adm/pacct directory store information about the last commands that have been executed on the system. These files are read with the lastcomm command. This command displays information, in reverse chronological order, about all of the previously executed commands that are still recorded in the files in the /var/adm/pacct directory. The /usr/sbin/acct/startup command must be executed before the lastcomm command can be used. The startup command does not persist across a reboot, so if you want to keep command recording active at all times, you would need to add the startup command to /etc/inittab or some other startup script. For some reason the shut down commands such as shutdown, reboot, and halt are not recorded. However other commands that are executed during the shut down process are recorded. When a script is executed, all commands called within the script are logged, so the lastcomm command output can be difficult to read. See the man page for more information about this command.

The AIX console log
The AIX console log is a binary log file that can be read with the following command:

# alog -t console -o

A number of AIX system processes log information into the console log when starting up during the boot process, and when shutting down during the shut down process. Time stamps are written with each entry, so this log can contain valuable information that can be used to investigate an unexpected shut down.

su log file
The su log file is used to log attempts to become a superuser. This log can be useful when trying to track down who might have gained root access to shut down a system. The su log file is located in /var/adm/sulog and has messages that look like this:

# cat /var/adm/sulog
SU 07/08 10:57 + pts/0 root-root
SU 07/11 12:44 + pts/0 root-nobody
SU 07/25 16:37 + pts/5 dcoca-root
SU 09/11 10:21 + pts/1 mrj1-root

Shell history file
If the root account uses a shell that supports a history file, this file can be used to view a history of commands that were executed by a root user. The korn shell will write the history file to the file named in the HISTFILE environment variable ( $HOME/.sh_history by default). Of course it is possible for a user who has gained root access to disable the file temporarily before running commands. The korn shell history file is read with the fc command.
/etc/shutdown.log
This log file is created or appended to if the -l option is used with the shutdown command. The file contains a time stamp to show the time of the shut down. It also logs the shut down of specific subsystems such as syslogd, the unmounting of file systems, and bringing down network interfaces. Here is example output from this log file:

# cat /etc/shutdown.log

Sun Nov 30 11:45:31 CST 2008
shutdown: THE SYSTEM IS BEING SHUT DOWN NOW

User(s) currently logged in:
root

Stopping some active subsystems...

0513-044 The syslogd Subsystem was requested to stop.
0513-044 The hostmibd Subsystem was requested to stop.
0513-044 The snmpmibd Subsystem was requested to stop.
...

Unmounting the file systems...

/lgfs unmounted successfully.
/download unmounted successfully.
...
umount: 0506-349 Cannot unmount /dev/hd3: The requested resource is busy.

Bringing down network interfaces:

detached en0 from the network interface list
detached lo0 from the network interface list

HMC system logs
The HMC maintains a system log file that records information about commands that have been executed on the HMC. If an LPAR is shut down, restarted, halted, or dumped, the HMC log should contain a record of the command and the time the command was executed, if the system was shut down using the HMC. Consult your HMC documentation for details about how to access this log.

AIX audit log

AIX includes an auditing subsystem that can be used to log information about commands that have been executed on the system. If a system is going down repeatedly due to one of the shut down commands, auditing can be enabled to help provide information about who or what is executing the command. For an overview on the AIX auditing subsystem, see technote T1000212.

System shut down events
The table below contains a list of the most common events that cause a system to shut down and reboot, or shut down and halt. Note that only the most commonly used command options are listed - consult the AIX man pages and HMC manuals for more comprehensive documentation.

Event
Description
shutdown
shutdown -h
shutdown -v

Shuts down a running system with multiple users in an orderly way, and then halts. Notifies users with the wall command of the impending shut down. If this command is used on a system with software control of the power supply, power will be turned off.

Note: All three of these commands shut down the system essentially the same way, and generate identical entries in AIX logs.
Logs
Error Report
LABEL: REBOOT_ID
0=SOFT IPL 1=HALT 2=TIME REBOOT
1
----------
LABEL: ERRLOG_ON
-----------
LABEL: ERRLOG_OFF

wtmp
reboot ~ Nov 23 15:20
shutdown vty0 Nov 23 15:16
root pts/0 hostname Nov 23 15:13 - 15:16 (00:03)
Event
Description
shutdown -r
shutdown -Fr
Shuts down a running system with multiple users in an orderly way and then calls the reboot command to reboot the system. Notifies users with the wall command of the impending shut down, unless the -F flag is used, in which case the system is shut down as quickly as possible with no user notification.

Logs
Error Report
LABEL: REBOOT_ID
0=SOFT IPL 1=HALT 2=TIME REBOOT
0
----------
LABEL: ERRLOG_ON
-----------
LABEL: ERRLOG_OFF

wtmp
reboot ~ Nov 23 16:33
shutdown vty0 Nov 23 16:32
root pts/0 hostname Nov 23 16:28 - 16:32 (00:04)
Event
Description
shutdown -l
The -l option can be used alone or added to other options to create or append to the AIX system log file /etc/shutdown.log. This option can be used to debug problems with the shut down process.

Logs
/etc/shutdown.log
Sun Nov 30 11:45:31 CST 2008
shutdown: THE SYSTEM IS BEING SHUT DOWN NOW
User(s) currently logged in:
root
Stopping some active subsystems...
0513-044 The syslogd Subsystem was requested to stop.
...

Note: The Error Report and wtmp entries are the same as above, depending on options used in addition to the -l option.
Event
Description
reboot
fastboot

Shuts down a running system in an orderly way and then reboots. This command should not be used if other users are logged into the system. Use shutdown -r instead.

Note: fastboot is identical to reboot and is provided for BSD compatibility.
Logs
Error Report
LABEL: REBOOT_ID
0=SOFT IPL 1=HALT 2=TIME REBOOT
0
----------
LABEL: ERRLOG_ON
-----------
LABEL: ERRLOG_OFF

wtmp
reboot ~ Nov 23 13:45
shutdown vty0 Nov 23 13:44
root pts/0 hostname Nov 23 13:43 - System is halted by system administrator. (00:00)
Event
Description
reboot -l

Note: the -n and -q options imply -l
Shuts down a running system in an orderly way and then reboots, but does not log a shutdown record in the /var/adm/wtmp accounting file. This command should not be used if other users are logged into the system. Use shutdown -r instead.

Note: The -l option should normally not be used by a system administrator. It is intended for other commands such as shutdown -r that call the reboot command but log an entry in wtmp themselves.

Logs
Error Report
LABEL: REBOOT_ID
0=SOFT IPL 1=HALT 2=TIME REBOOT
0
----------
LABEL: ERRLOG_ON
-----------
LABEL: ERRLOG_OFF

wtmp
reboot ~ Dec 06 14:29
root pts/0 hostname Dec 06 14:22 - System halted abnormally. (00:06)
Event
Description
reboot -nq

Note: the -l option is implied
Shuts down a running system as quickly as possible and then reboots, but does not log a shutdown record in the /var/adm/wtmp accounting file. Does not call sync to flush file buffers and does not send processes a SIGTERM. Normally this command should not be used by a system administrator.

Note: This command is sometimes used by cluster management software such as Oracle RAC to evict a node as quickly as possible to preserve the integrity of the database.
Logs
Error Report
LABEL: REBOOT_ID
0=SOFT IPL 1=HALT 2=TIME REBOOT
0
----------
LABEL: ERRLOG_ON

wtmp
reboot ~ Dec 05 11:32
root pts/0 hostname Dec 05 11:03 - System halted abnormally. (00:29)

Note: There is no ERRLOG_OFF entry in the error report because the -q option causes the reboot command to shut down immediately without sending processes a SIGTERM to shut them down in an orderly way.
Event
Description
halt
fasthalt
Shuts down a running system in an orderly way and then halts. This command should not be used if other users are logged into the system. Use shutdown -h instead.

Note: fasthalt is identical to halt and is provided for BSD compatibility.

Logs
Error Report
LABEL: REBOOT_ID
0=SOFT IPL 1=HALT 2=TIME REBOOT
1
----------
LABEL: ERRLOG_ON
-----------
LABEL: ERRLOG_OFF

wtmp
reboot ~ Nov 23 14:35
shutdown vty0 Nov 23 14:32
root pts/0 hostname Nov 23 14:30 - System is halted by system administrator. (00:01)
Event
Description
halt -l

Note: the -n and -q options imply -l
The same as a halt with no options, except that a shutdown record will not be logged in the /var/adm/wtmp accounting file.

Note: The -l option should normally not be used by a system administrator. It is intended for other commands such as shutdown -h that call the halt command but log an entry in wtmp themselves.
Logs
Error Report
LABEL: REBOOT_ID
0=SOFT IPL 1=HALT 2=TIME REBOOT
1
----------
LABEL: ERRLOG_ON
-----------
LABEL: ERRLOG_OFF

wtmp
reboot ~ Dec 06 16:08
root pts/0 hostname Dec 06 15:51 - 16:04 (00:12)
Event
Description
halt -q
Shuts down a running system quickly. Does not issue a sync and does not send the terminate signal to running processes. Normally this command should not be used by a system administrator. Use shutdown -h instead.

Note: This command is sometimes used by cluster management software such as PowerHA to quickly bring a node down so that applications can failover to a secondary node.
Logs
Error Report
LABEL: REBOOT_ID
0=SOFT IPL 1=HALT 2=TIME REBOOT
1
----------
LABEL: ERRLOG_ON

wtmp
root pts/0 sig-9-49-130-54.mts.ibm.com Jan 01 09:00 - System halted abnormally. (00:06)
Event
Description
sysdumpstart -p
Immediately stops AIX and initiates a system dump to the primary dump device if the dump facility is properly configured. Afterwards the system will automatically reboot if the auto restart flag is true. Otherwise the system will halt.

Note: This command is sometimes used by cluster management software such as Oracle RAC to evict a node and create a system dump. If a node in an Oracle RAC is evicted with the sysdumpstart command, contact Oracle Support and IBM AIX support for assistance with analyzing the system dump.
Logs
Error Report
LABEL: DUMP_STATS
Description
SYSTEM DUMP
User Causes
SYSTEM DUMP REQUESTED BY USER
Detail Data
DUMP DEVICE
/dev/lg_dumplv
DUMP SIZE
39740416
TIME
Sun Dec 7 09:45:21 2008
...
----------
LABEL: MINIDUMP_LOG
----------
LABEL: ERRLOG_ON

wtmp
reboot ~ Dec 07 09:52
root pts/0 hostname Dec 07 09:44 - System halted abnormally. (00:07)

# sysdumpdev -L
0453-039
Device name: /dev/lg_dumplv
Size: 39740416 bytes
Uncompressed Size: 398315509 bytes
Date/Time: Sun Dec 7 09:45:21 CST 2008
Dump status: 0
dump completed successfully

Note: No REBOOT_ID command is logged in the error report. Also there is no ERRLOG_OFF entry in the error report because this command shuts down the system immediately without sending processes a SIGTERM to shut them down in an orderly way. No shutdown record is logged in the wtmp file.
Event
Description
Forced reset on system front panel

OR

Restart command with the dump option is executed on an HMC
The system is reset by pressing the reset button on the front panel of the machine. Or if the machine has function buttons, the system is reset by executing one or more functions on the front panel. Initiates a system dump to the primary dump device if the dump facility is properly configured. The system will automatically reboot if the autorestart flag is set to true.

If the system is an LPAR managed by an HMC, the system is reset by running the Restart command with the dump option selected on the HMC.
Logs
Error Report
LABEL: DUMP_STATS
Description
SYSTEM DUMP
User Causes
SYSTEM DUMP REQUESTED BY USER
Detail Data
DUMP DEVICE
/dev/lg_dumplv
DUMP SIZE
56760832
TIME
Wed Dec 3 08:14:41 2008
...
----------
LABEL: MINIDUMP_LOG
----------
LABEL: SYS_RESET
Description
SYSTEM RESET INTERRUPT RECEIVED
----------
LABEL: ERRLOG_ON

wtmp
reboot ~ Dec 03 08:21
root pts/0 hostname Dec 03 08:04 - 08:12 (00:07)

# sysdumpdev -L
0453-039
Device name: /dev/lg_dumplv
Size: 56760832 bytes
Uncompressed Size: 399727524 bytes
Date/Time: Wed Dec 3 08:14:41 CST 2008
Dump status: 0
dump completed successfully

Note: No REBOOT_ID command is logged in the error report. Also there is no ERRLOG_OFF entry in the error report because a system reset shuts down the system immediately without sending processes a SIGTERM to shut them down in an orderly way. No shutdown record is logged in the wtmp file.
Event
Description
Software system crash
A software system crash or kernel panic is most often caused by some type of problem in the kernel, kernel extensions, or device drivers. If the system dump facility is properly configured, a system dump will be created. The system will automatically reboot of the autorestart flag is set to true.

If a system crashes, contact IBM AIX Support for assistance.

Logs
Error Report
LABEL: DUMP_STATS
Description
SYSTEM DUMP
User Causes
SYSTEM DUMP REQUESTED BY USER
Detail Data
DUMP DEVICE
/dev/hd7
DUMP SIZE
189803520
TIME
Fri Oct 17 02:02:02 2008
...
Note: The DUMP_STATS entry might report that the system dump was requested by user, even though the system crashed.
----------
LABEL: MINIDUMP_LOG
----------
LABEL: PROGRAM_INT
OR
LABEL: DSI_PROC
OR
LABEL: ISI_PROC
----------
LABEL: ERRLOG_ON

wtmp
reboot ~ Oct 17 02:05

# sysdumpdev -L
0453-039
Device name: /dev/hd7
Size: 189803520 bytes
Uncompressed Size: 6548975244 bytes
Date/Time: Fri Oct 17 02:02:02 CST 2008
Dump status: 0
dump completed successfully

Note: No REBOOT_ID command is logged in the error report. Also there is no ERRLOG_OFF entry in the error report because a system crash shuts down the system immediately without sending processes a SIGTERM to shut them down in an orderly way. No shutdown record is logged in the wtmp file.
Event
Description
Hardware system crash
A hardware system crash is caused by some type of hardware failure. Even if the system dump facility is properly configured, depending on the type of hardware failure, a system dump might not be created. The system will automatically reboot of the autorestart flag is set to true, and if the hardware is operational enough to allow the system to boot.

If a system crashes due to hardware failure, contact IBM Hardware Support for assistance.

Logs
Error Report
LABEL: DUMP_STATS
Description
SYSTEM DUMP
User Causes
SYSTEM DUMP REQUESTED BY USER
Detail Data
DUMP DEVICE
/dev/lg_dumplv
DUMP SIZE
284503126
TIME
Sat Oct 18 04:01:05 2008
...
Note: The DUMP_STATS entry might report that the system dump was requested by user, even though the system crashed.
----------
LABEL: MINIDUMP_LOG
----------
LABEL: SCAN_ERROR_CHRP
OR
LABEL: SCANOUT
OR
possibly other hardware related entries
----------
LABEL: ERRLOG_ON

wtmp
reboot ~ Oct 18 04:05

# sysdumpdev -L
0453-039
Device name: /dev/lg_dumplv
Size: 284503126 bytes
Date/Time: Sat Oct 18 04:01:05 CST 2008
Dump status: 0
dump completed successfully

Note: No REBOOT_ID command is logged in the error report. Also there is no ERRLOG_OFF entry in the error report because a system crash shuts down the system immediately without sending processes a SIGTERM to shut them down in an orderly way. No shutdown record is logged in the wtmp file.
Event
Description
Loss of power
A power failure occurs and a UPS does not provide backup power. Or, the power button is pressed without first shutting the system down with one of the shut down commands.
Logs
Error Report
LABEL: SCAN_ERROR_CHRP

Note: The reference code in this entry will indicate an unexpected loss of power.
Event
Description
HMC menu commands such as:

Operations:Activate
Operations:Shutdown
Operations:Restart

OR

HMC command line commands

An LPAR can also be shut down and rebooted using commands on the HMC. For example, commands can be executed on the HMC that will shut down and halt an LPAR, shut down and reboot an LPAR, or initiate a system dump on an LPAR. Discussion of these methods are beyond the scope of this document. Consult your HMC documentation for details.
Logs
HMC Console Log
hscroot@pkdahmc5:~> lssvcevents -t console
time=03/12/2008 08:04:34,text=HSCE2174 User hscroot Login from remote host pcp684467pcs.central.sprint with IP address 10.86.10.151 was successful.
time=03/06/2008 13:43:22,text=HSCE2016 User name hscroot Logical Partition dkda0177 with ID 1 of managed system 9133-55A*10C104G has been activated with profile pkda0177.
time=03/06/2008 13:43:21,text=HSCE2245 User name 1: Activating the partition 9133-55A*10C104G succeeded on managed system {2}.

time=03/06/2008 13:42:14,text=HSCE2121 User name hscroot: Immediate shut down executed successfully on partition dkda0177 with ID 1 on the managed system Server-9133-55A-SN10C104G.

time=03/06/2008 13:42:14,text=HSCE2254 User name 1*9133-55A*10C104G: Dump to load source for partition 9133-55A*10C104G succeeded on managed system {2}.

Note: The log output above is presented only as an example. Consult your HMC documentation for details about how to view HMC logs.

Note: Some of the shut down and restart commands on the HMC have options that will cause the HMC to send AIX shut down commands to the LPAR. These options are documented in the HMC interface with the text "Operating System". If an Operating System command is executed on the HMC, the AIX system logs will be the same as if the command had been executed directly with the Unix command line on the LPAR.

Conclusion
If AIX shuts down unexpectedly, there are a number of log files and commands that can be used to investigate the cause. AIX does not shut itself down on its own, but software running on AIX might initiate a shut down. This document shows there are a number of ways for a system to go down, and provides information about system log files and commands that can help to determine the cause. If a node in a cluster shuts down unexpectedly, review the cluster manager logs to see if there are any entries related to the shut down that might provide additional information about the cause. If a system dump was created after the shut down, contact AIX Software Support for assistance with analyzing the system dump.

LPM - Limitations

LPM Limitations:

LPM cannot be performed on a stand-alone LPAR; it should be a VIOS client.
It must have virtual adapters for both network and storage.
It requires PowerVM Enterprise Edition.
The VIOS cannot be migrated.
When migrating between systems, only the active profile is updated for the partition and VIOS.
A partition that is in a crashed or failed state is not capable of being migrated.
A server that is running on battery power is not allowed to be the destination of a migration. A server that is running on battery power may be the source of a migrating partition.
For a migration to be performed, the destination server must have resources (for example, processors and memory) available that are equivalent to the current configuration of the migrating partition. If a reduction or increase of resources is required then a DLPAR operation needs to be performed separate from migration.
This is not a replacement for PowerHA solution or a Disaster Recovery Solution.
The partition data is not encrypted when transferred between MSPs.

AIX - devscan

Thanks to IBM,

Name

devscan

Purpose

Diagnostic tool for Storage Area Networks

Syntax

devscan [ options ]

Description

The devscan tool facilitates the debugging of storage problems by rapidly gathering a great deal of information about the SAN. It then displays the information in an easy-to-understand manner. . The information devscan displays is gathered from the SAN itself or the device driver, not from ODM, with exceptions described below inFurther Details. The data is therefore current and correct.

Devscan scans a set of SCSI adapters, and then issues a set of commands to a set of targets and LUNs on those adapters. In the default case, devscan finds every Fibre Channel, SAS, iSCSI, and VSCSI adapter in the system and traverses each one. It issues SCSI Report LUNs and Inquiry commands to every target and LUN it finds. The set of adapters to be scanned, targets and LUNs to be traversed, and commands to be issued may be controlled with several of the optional flags.

You can run devscan from any AIX host, including VIO clients, or from a VIOS.

In the default case, devscan is unable to change any state on the SAN or on the host, making it safe to run even in production environments. In all cases, devscan is safer to run than cfgmgr, because it cannot change the ODM. Some of the optional commands devscan can use are able to cause a state change on the SAN. Details are provided in the Flags section.

Flags

-t, --types=

Specify which adapter types to scan. Valid subflags are v, s, i, and f, for VSCSI, SAS, iSCSI, and FCP, respectively.

-c, --commands=

Commands may be specified as a level from 0 to 9, defaulting to 3, or as a series of subflags naming specific commands that are desired.

The levels have the following meanings

0

No commands issued, devscan will only report on the adapters it finds.
1

The special LUN 0 is Started and Report LUNs is issued, but no commands are sent to the other LUNs. The list of LUNs is printed.
3

Normal behavior. Every reported LUN is Started and an Inquiry is sent.
5

Normal behavior, plus PVID checking.
7

Everything except performance testing.
9

Everything.

The available commands are

l

Report LUNs
i

Inquiry
t

Test Unit Ready
a

ALUA commands (RTPG)
c

Read Capacity
r

Reservation commands (PR In & Read)
p

Performance testing (Read)
v

Check PVID (Read)

Some SCSI commands require others to be done. Specifying a command that requires others will cause the prerequisite commands to be performed as well.

Inquiry

-> Start
Test Unit Ready

-> Start
RTPG

-> Start
Read Capacity

-> Start
Read

-> Read Capacity

Report LUNs is required for awareness of any LUN besides LUN 0, but it is not a prerequisite of any command. If Report LUNs is not requested, the specified set of commands will be sent only to LUN 0.

Some of the devscan SCSI commands can consume a SCSI Check Condition type Unit Attention. It is possible this Unit Attention was actually generated by another application on the host, including the device driver. In that case, devscan will consume a Unit Attention that the other application needs to know about, potentially putting the host and the target device into inconsistent states. Because of this possibility, command levels above 3 or command subflags a, r, p, v, t, and c require confirmation that the user wishes to proceed, either on the command line or via the -F flag.

-n, --npiv=

NPIV mode. Devscan masquerades as an NPIV client when running on a VIOS using the given WWPN. The WWPN must be specified as a 64-bit hexadecimal number.

--intra_npiv_delay=

Devscan waits at least the specified time after issuing a STARTINITR for an NPIV login before issuing the corresponding STOPINITR.

--inter_npiv_delay=

Devscan waits at least the specified time after issuing a STOPINITR for an NPIV login before issuing the next STARTINITR.

--dev=

Devscan scans only the specified adapter, rather than all adapters. The device name must be either the adapter or protocol driver instance name, and may optionally be preceded by "/dev/".

--iscsitargets=

Devscan by default will traverse /etc/iscsi/targets or /etc/iscsi/targetshw, depending on the iSCSI adapter type. In addition to the default, the user may pass in another file listing iSCSI targets. The file name may be "-", and devscan will read from stdin. The format of the file is a whitespace-delimited list, similar to the format of /etc/iscsi/targets, except the subsequent fields may be omitted and devscan will substitute the default port of 3260, the default name of "iscsi", and default to using no authentication.

[ [ [ ]]]

--blacklist=
--whitelist=

A file containing a list of descriptors may be passed in to be either white or black listed. The file name may be "-", and devscan will read from stdin. White and black listing may not be used at the same time.

If a LUN does not match any entry on the white list, or does match any entry on the black list, no commands are issued to it, except that Start and Report LUNs will be issued to LUN 0 regardless. This has two primary purposes: to limit the time it takes devscan to run on large SANs, and to limit the number of devices that may be affected if a command level greater than 3 is in use. See -c flag information.

Devices may be specified by name, or by location. To specify by name, simply enter the ODM name of the device, one per line. To enter the location, specify the device type (f, i, v, or s), followed by a "|" delimited list of specifiers appropriate to that type, as follows.

At least one specifier must be provided per entry. More may be provided as desired. Specifiers may be left empty.

--concise

Devscan will output in a machine-parseable format. Every LUN will be displayed on one line in a delimited list. The default delimiter is "|", but another may be passed in using the --delim flag. A header line will be printed describing each field as the first line of output. Error output is suppressed.

--delim=

Specify a string up to 8 characters to use as the delimiter for the --concise flag.

-v, --verbosity=

Verbosity level, from 0 to 9. Default is 3.

-V, --version

Print version information.

-o, --outputfile=

Devscan writes to filename instead of stdout.

-F, --force

Force flag. See -c flag information.

--timestamps=

Timestamps. Valid subflags are l, t, a, and T, and will cause timestamps to be printed for LUN, target, adapter, and total, respectively.

-?, --usage, --help

Print usage information.

Further details

ODM names and path IDs

ODM names and path IDs are provided for convenience, but they are obtained from the ODM. If the ODM has, for whatever reason, errneous data, devscan will be misled. The ODM names and path IDs are therefore not guaranteed to be accurate.

Devscan does not construct the unique ID for SAN devices. Devscan attempts to match devices it finds on the SAN with devices in ODM using their location. The fields it uses to do this vary by adapter type, by necessity. In FCP, devscan uses the WWPN and LUN ID. In SAS, devscan matches the SAS ID and LUN ID. In iSCSI, the target name and LUN ID are matched. In VSCSI, only the LUN ID is needed.

PVID checking

If the command level is 5 or greater, or if the -cv flag is passed in, devscan will read the PVID location on every device it encounters and use it to match that device against the ODM, in addition to the device's location. If the device does not have a PVID, then this field is ignored.

Active/Active, Active/Passive and ALUA devices

Active paths appear with no special designation in devscan.

Passive paths can be revealed on most devices by invoking Test Unit Ready with -ct or -c7, and on all devices by issuing a Read with -cv or -c5. Passive paths will return with a failure condition.

Devscan automatically identifies ALUA-capable devices. ALUA state of each path will be ascertained if the ALUA commands are requested with the -c7 or -ca flag. An extra field will be printed for each ALUA-capable path revealing its state.

Usage examples

To run against all SCSI adapters with the default command set (Start, Report LUNs, and Inquiry):
devscan
To run against only the fscsi3 adapter and gather SCSI Status from all attached devices:
devscan -c7 --dev=fscsi3
To determine what the NPIV client using WWPN C0507601A673002A can see through all Fibre Channel adapters on the VIOS (e.g., because the client cannot boot):
devscan -t f -n C0507601A673002A
To run devscan in machine-parseable mode using "::" as the field delimiter:
devscan --concise --delim="::"
To run devscan against only the VSCSI adapters in the system and write the output to /tmp/vscsi_scan_results:
devscan -tv -o /tmp/vscsi_scan_results
To scan only the storage port 5001738000330193:
echo "f|||5001738000330193" | devscan --whitelist=-
To scan only the storage at SCSI ID 0x010400:
echo "f|010400" | devscan --whitelist=-
To scan only for hdisk15:
echo "hdisk15" | devscan --whitelist=-
To scan for all targets except the one with WWNN 5001738000330000:
echo "f||||5001738000330000" | devscan --blacklist=-
To scan for an iSCSI target at 192.168.3.147:
echo "192.168.3.147" | devscan --iscsitargets=-
To check the SCSI status of hdisk71 on all the Fibre adapters in the system and send the output to /tmp/devscan.out:
echo "hdisk71" | devscan --whitelist=- -o /tmp/devscan.out -tf -c7 -F

Output examples

Processing FC device: Adapter driver: fcs4 Protocol driver: fscsi4 Connection type: none Local SCSI ID: 0x000000 Device ID: df1000fe Microcode level: 271102

The connection type of "none" indicates this adapter has never had a link.
Processing FC device: Adapter driver: fcs0 Protocol driver: fscsi0 Connection type: fabric Link State: down Current link speed: 4 Gbps Local SCSI ID: 0x180600 Device ID: 77102224 Microcode level: 0125040024

The link state of "down" indicates this adapter had a link up since the last time it was configured, but does not currently.
Nameserver query succeeded, but indicated no targets are available on the SAN

This means the adapter's link to the switch is good, but no storage is available, typically because the storage has unexpectedly left the SAN or because it was not zoned to this host port.
Processing iSCSI device: Protocol driver: iscsi0 No targets found Elapsed time this adapter: 0.001358 seconds

For non-Fibre Channel devices, there is no name server, so the no-targets condition looks like this.
00000000001f7d00 0000000000000000 START failed with errno ECONNREFUSED

Devcsan is able to reach this device, so the host is connected to the SAN and the nameserver is reporting it, but we are not able to log in to the device. This is an end device problem.
Vendor ID: IBM Device ID: 2107900 Rev: 5.90 NACA: yes PDQ: Not connected PDT: Unknown or no device Dynamic Tracking Enabled TUR SCSI status: Check Condition (sense key: ABORTED_COMMAND; ASCQ: LOGICAL UNIT NOT SUPPORTED) ALUA-capable device Report LUNs failed with errno ENXIO Extended Inquiry failed with errno ETIMEDOUT Test Unit Ready failed with errno EIO

Devscan is successfully talking to this device, so the complete end-to-end connection is working. The SCSI Inquiry even succeeded, but the device is responding to further SCSI commands with errors for some reason. This is an end device problem.
651b00 0000000000000000 201400a0b82697ac 200400a0b82697ac Vendor ID: IBM Device ID: 1815 Rev: 0914 NACA: yes PDQ: Connected PDT: Block (Disc) Name: No ODM match 1 targets found, reporting 0 LUNs, 1 of which responded to SCIOLSTART. Responsive LUNs can exceed reported LUNs when LUN 0 is not reported, or when the target is a single-LUN device. This is not an error.

Devscan is again talking to this device, so the complete end-to-end connection is working, but only LUN 0 is responding. This is generally a LUN mapping or other configuration problem on the end storage device.

Limitations

ODM name matching is done using location, not unique ID. See Further details, above.

Due to a limitation of the device driver interface, devscan is unable to issue SCSI commands when using the -n flag. Using -n effectively forces the use of -c1 on all NPIV- capable adapters. Note that non-NPIV adapters (e.g., VSCSI or iSCSI adapters) are not affected and will use the default setting or whatever -c level was explicitly passed in.

Devscan supports multiple flags that can be directed to read from stdin, but only one may do so at a time. For example, the following command is invalid.

devscan --blacklist - --iscsitargets -

Exit status

0 Successful completion

>0 Error status

AIX - lspath

To Check the missing or failed paths by issuing this command:

root@mohisrv:/root# lspath | awk '{print $1,$NF}' |sort |uniq -c
18 Enabled fscsi0
6 Enabled fscsi1
12 Failed fscsi1

If there are some failed paths, maybe you should try to re-enable them (quick and painless, can’t do no harm) with this one-liner

root@mohisrv:/root# lspath|grep Failed | awk '{print "chpath -l "$2" -s enable -p "$3}'|ksh
paths Changed
paths Changed
paths Changed
paths Changed
paths Changed
paths Changed
paths Changed
paths Changed
paths Changed
paths Changed
paths Changed
paths Changed
root@lpar:/root# lspath | awk '{print $1" " $NF}' |sort |uniq -c
18 Enabled fscsi0
18 Enabled fscsi1