Sysdump Device

Dump - Core

AIX generates a system dump when a severe error occurs. A system dump creates a snap of the system's memory contents. If the AIX kernel crashes kernel data is written to the primary dump device. After a kernel crash AIX must be rebooted. During the next boot, the dump is copied into a dump directory (default is /var/adm/ras). The dump file name is vmcore.x (x indicates a number, e.g. vmcore.0)

When installing the operating system, the dump device is automatically configured. By default, the primary device is /dev/hd6, which is a paging logical volume, and the secondary device is /dev/sysdumpnull.

A rule of thumb is when a dump is created, it is about 1/4 of the size of real memory. The command "sysdumpdev -e" will also provide an estimate of the dump space needed for your machine. (Estimation can differ at times with high load, as kernel space is higher at that time.)

When a system dump is occurring, the dump image is not written to disk in mirrored form. A dump to a mirrored lv results in an inconsistent dump and therefore, should be avoided. The logic behind this fact is that if the mirroring code itself were the cause of the system crash, then trusting the same code to handle the mirrored write would be pointless. Thus, mirroring a dump device is a waste of resources and is not recommended.

Since the default dump device is the primary paging lv, you should create a separate dump lv, if you mirror your paging lv (which is suggested.)If a valid secondary dump device exists and the primary dump device cannot be reached, the secondary dump device will accept the dump information intended for the primary dump device.

IBM recommendation:
IBM recommends us to force a dump the next time the problem should occur. This will enable us to check which process was hanging or what caused the system to not respond any more. We can do this via the HMC using the following steps:
   Operations -> Restart -> Dump
As a general recommendation we should always force a dump if a system is hanging. There are only very few cases in which we can determine the reason for a hanging system without having a dump available for analysis.



Traditional vs Firmware-assisted dump:

Surprised !!! 

Yes there are two types of dumps,

Up to POWER5 only traditional dumps were available, and the introduction of the POWER6 processor-based systems allowed system  dumps to be firmware assisted. When performing a firmware-assisted dump, system memory is frozen and the partition rebooted, which allows a new instance of the operating system to complete the dump.



Traditional dump: it is generated before partition is rebooted.
(When system crashed, memory content is trying to be copied at that moment to dump device)

Firmware-assisted dump: it takes place when the partition is restarting.
(When system crashed, memory is frozen, and by hypervisor (firmware) new memory space is allocated in RAM, and the contents of memory is copied there. Then during reboot it is copied from this new memory area to the dump device.)

Firmware-assisted dump offers improved reliability over the traditional dump, by rebooting the partition and using a new kernel to dump data from the previous kernel crash.

When an administrator attempts to switch from a traditional to firmware-assisted system dump, system memory is checked against the firmware-assisted system dump memory requirements. If these memory requirements are not met, then the "sysdumpdev -t" command output reports the required minimum system memory to allow for firmware-assisted dump to be configured. Changing from traditional to firmware-assisted dump requires a reboot of the partition for the dump changes to take effect.

Firmware-assisted system dumps can be one of these types:

Selective memory dump: Selective memory dumps are triggered by or use of AIX instances that must be dumped.
Full memory dump: The whole partition memory is dumped without any interaction with an AIX instance that is failing.



Use the sysdumpdev command to query or change the primary or secondary dump devices.
    - Primary:    usually used when you wish to save the dump data
    - Secondary: can be used to discard dump data (that is, /dev/sysdumpnull)


Flags for sysdumpdev command:
    -l                list the current dump destination
    -e                estimates the size of the dump (in bytes)
    -p                primary
    -s                secondary
    -P                make change permanent
    -C                turns on compression
    -c                turns off compression
    -L                shows info about last dump
    -K                turns on: alway allow system dump

sysdumpdev -P -p /dev/dumpdev    change the primary dumpdevice permanently to /dev/dumpdev

root@aix1: /root # sysdumpdev -l
primary              /dev/dumplv
secondary            /dev/sysdumpnull
copy directory       /var/adm/ras
forced copy flag     TRUE
always allow dump    TRUE          if it is on FALSE then in smitty sysdumpdev it can be change
dump compression     ON            if it is on OFF then sysdumpdev -C changes it to ON-ra (-c changes it to OFF)



Other commands:

sysdumpstart            starts a dump (smitty dump)(it will do a reboot as well)
kdb                     it analysis the dump
/usr/lib/ras/dumpcheck  checks if dump device and copy directory are able to receive the system dump
If dump device is a paging space, it verifies if enough free space exists in the copy dir to hold the dump
If dump device is a logical volume, it verifies it is large enough to hold a dump
(man dumpcheck)


 SNAP:

snap
    -a                copies all system config. information to /tmp/ibmsupt directory tree
    -c                creates a  compressed tar image (snap.tar.Z) of all files in the /tmp/ibmsupt
    -g                gather general information

    -e                for HACMP, it runs clverification and gathers the data creating a snap

1. snap -r        removes old snap from /tmp/ibmsupt
2. snap -gc     creates a new snap file

Reading a compressed snap file:
1. snap -ac          creates a compressed snap file (/tmp/ibmsupt/snap.pax.Z)
2. uncompress snap.pax.Z        uncompresses it, we will have a snap.pax file
3. pax -rvf snap.pax           unpack files, after files can be read



Creating a dump device

Creating a dump device

1. sysdumpdev -e                                    shows an estimation, how much space is required for a dump
2. mklv -t sysdump -y lg_dumplv rootvg 3  hdisk0    it creates a sysdump lv with 3 PPs
3. sysdumpdev -Pp /dev/lg_dumplv                    making it as a primary device (system will use this lv now for dumps)


System dump initiaded by a user

!!!reboot will take place automatically!!!

1. sysdumpstart -p              initiates a dump to the primary device
(Reboot will be done automatically)
(If a dedicated dump device is used, user initiated dumps are not copied automatically to copy directory.)
(If paging space is used for dump, then dump will be copied automatically to /var/adm/ras)

2. sysdumpdev -L               shows dump took place on the primary device, time, size ... (errpt will show as well)
3. savecore -d /var/adm/ras    copy last dump from system dump device to directory /var/adm/ras (if paging space is used this is not needed)


Specifying the default gateway on a specific interface

When you're using HACMP, you usually have multiple network adapters installed and thus multiple network interface to handle with. If AIX configured the default gateway on a wrong interface (like on your management interface instead of the boot interface), you might want to change this, so network traffic isn't sent over the management interface. Here's how you can do this:

First, stop HACMP or do a take-over of the resource groups to another node; this will avoid any problems with applications when you start fiddling with the network configuration.

Then open up a virtual terminal window to the host on your HMC. Otherwise you would loose the connection, as soon as you drop the current default gateway.

Now you need to determine where your current default gateway is configured. You can do this by typing:
# lsattr -El inet0
# netstat -nr
The lsattr command will show you the current default gateway route and the netstat command will show you the interface it is configured on. You can also check the ODM:
# odmget -q"attribute=route" CuAt
Now, delete the default gateway like this: (Do this after understand it fully)
# lsattr -El inet0 | awk '$2 ~ /hopcount/ { print $2 }' | read GW
# chdev -l inet0 -a delroute=${GW}
If you would now use the route command to specifiy the default gateway on a specific interface, like this:
# route add 0 [ip address of default gateway: xxx.xxx.xxx.254] -if enX
You will have a working entry for the default gateway. But... the route command does not change anything in the ODM. As soon as your system reboots; the default gateway is gone again. Not a good idea.

A better solution is to use the chdev command:
# chdev -l inet0 -a addroute=net,-hopcount,0,,0,[ip address of default gateway]
This will set the default gateway to the first interface available.

To specify the interface use:
# chdev -l inet0 -a addroute=net,-hopcount,0,if,enX,,0,[ip address of default gateway]
Substitute the correct interface for enX in the command above.

If you previously used the route add command, and after that you use chdev to enter the default gateway, then this will fail. You have to delete it first by using route delete 0, and then give the chdev command.

Afterwards, check fi the new default gateway is properly configured:
# lsattr -El inet0
# odmget -q"attribute=route" CuAt
And ofcourse, try to ping the IP address of the default gateway and some outside address. Now reboot your system and check if the default gateway remains configured on the correct interface. And startup HACMP again!

AIX - Filesystem space management

Fix for the AIX filesystems and general search techniques:

If the file system recently overflowed, use the -newer flag to find recently modified files.

To produce a file for the -newer flag to find against, use the following touch command:
touch mmddhhmm filename   (eg: touch 01192000 test_mohi)
Where mm is the month, dd is the date, hh is the hour in 24–hour format, mm is the minute, and filename is the name of the file you are creating with the touch command.
 
After you have created the touched file, you can use the following command to find newer large files:
find /filesystem_name -xdev -newer touch_filename -ls
(eg: fine /var -xdev -newer /test_mohi -ls)
 
You can also use the find command to locate files that have been changed in the last 24 hours, as shown in the following example:
find /filesystem_name -xdev -mtime 0 -ls
 
******
 

/ (root) overflow

Check the following when the root file system (/) has become full.
  • Use the following command to read the contents of the /etc/security/failedlogin file:
    who /etc/security/failedlogin
    The condition of TTYs recreating too rapidly can create failed login entries. To clear the file after reading or saving the output, execute the following command:
    cp /dev/null /etc/security/failedlogin
  • Check the /dev directory for a device name that is typed incorrectly. If a device name is typed incorrectly, such as rmto instead of rmt0, a file will be created in /dev called rmto. The command will normally proceed until the entire root file system is filled before failing. /dev is part of the root (/) file system. Look for entries that are not devices (that do not have a major or minor number). To check for this situation, use the following command:
    cd /dev
    ls -l | pg
    In the same location that would indicate a file size for an ordinary file, a device file has two numbers separated by a comma. For example:
    crw-rw-rw-   1 root     system    12,0 Oct 25 10:19 rmt0
    If the file name or size location indicates an invalid device, as shown in the following example, remove the associated file:
    crw-rw-rw-   1 root     system   9375473 Oct 25 10:19 rmto
    Note:
    • Do not remove valid device names in the /dev directory. One indicator of an invalid device is an associated file size that is larger than 500 bytes.
    • If system auditing is running, the default /audit directory can rapidly fill up and require attention.
  • Check for very large files that might be removed using the find command. For example, to find all files in the root (/) directory larger than 1 MB, use the following command:
    find / -xdev -size  +2048 -ls |sort -r -n +6
    This command finds all files greater than 1 MB and sorts them in reverse order with the largest files first. Other flags for the find command, such as -newer, might be useful in this search. For detailed information, see the command description for the find command.
    Note: When checking the root directory, major and minor numbers for devices in the /dev directory will be interspersed with real files and file sizes. Major and minor numbers, which are separated by a comma, can be ignored.
    Before removing any files, use the following command to ensure a file is not currently in use by a user process:
    fuser filename
    Where filename is the name of the suspect large file. If a file is open at the time of removal, it is only removed from the directory listing. The blocks allocated to that file are not freed until the process holding the file open is killed.
 
******
 

Resolving overflows in the /var file system

Check the following when the /var file system has become full.
  • You can use the find command to look for large files in the /var directory. For example:
    find /var -xdev -size  +2048 -ls| sort -r  +6
    For detailed information, see the command description for the find command.
  • Check for obsolete or leftover files in /var/tmp.
  • Check the size of the /var/adm/wtmp file, which logs all logins, rlogins and telnet sessions. The log will grow indefinitely unless system accounting is running. System accounting clears it out nightly. The /var/adm/wtmp file can be cleared out or edited to remove old and unwanted information. To clear it, use the following command:
    cp /dev/null  /var/adm/wtmp
    To edit the /var/adm/wtmp file, first copy the file temporarily with the following command:
    /usr/sbin/acct/fwtmp < /var/adm/wtmp >/tmp/out
    Edit the /tmp/out file to remove unwanted entries then replace the original file with the following command:
    /usr/sbin/acct/fwtmp -ic < /tmp/out > /var/adm/wtmp
  • Clear the error log in the /var/adm/ras directory using the following procedure. The error log is never cleared unless it is manually cleared.
    Note: Never use the cp /dev/null command to clear the error log. A zero-length errlog file disables the error logging functions of the operating system and must be replaced from a backup.
    1. Stop the error daemon using the following command:
      /usr/lib/errstop
    2. Remove or move to a different filesystem the error log file by using one of the following commands:
      rm /var/adm/ras/errlog
      or
      mv /var/adm/ras/errlog filename
      Where filename is the name of the moved errlog file.
      Note: The historical error data is deleted if you remove the error log file.
    3. Restart the error daemon using the following command:
      /usr/lib/errdemon
    Note: Consider limiting the errlog by running the following entries in cron:
    0 11 * * * /usr/bin/errclear -d S,O 30    
    0 12 * * * /usr/bin/errclear -d H 90
  • Check whether the trcfile file in this directory is large. If it is large and a trace is not currently being run, you can remove the file using the following command:
    rm /var/adm/ras/trcfile
  • If your dump device is set to hd6 (which is the default), there might be a number of vmcore* files in the /var/adm/ras directory. If their file dates are old or you do not want to retain them, you can remove them with the rm command.
  • Check the /var/spool directory, which contains the queuing subsystem files. Clear the queueing subsystem using the following commands:
    stopsrc -s qdaemon
    rm /var/spool/lpd/qdir/*
    rm /var/spool/lpd/stat/*
    rm /var/spool/qdaemon/*
    startsrc -s qdaemon
  • Check the /var/adm/acct directory, which contains accounting records. If accounting is running, this directory may contain several large files.
  • Check the /var/preserve directory for terminated vi sessions. Generally, it is safe to remove these files. If a user wants to recover a session, you can use the vi -r command to list all recoverable sessions. To recover a specific session, usevi -r filename.
  • Modify the /var/adm/sulog file, which records the number of attempted uses of the su command and whether each was successful. This is a flat file and can be viewed and modified with a favorite editor. If it is removed, it will be recreated by the next attempted su command. Modify the /var/tmp/snmpd.log, which records events from the snmpd daemon. If the file is removed it will be recreated by the snmpd daemon.
    Note: The size of the /var/tmp/snmpd.log file can be limited so that it does not grow indefinitely. Edit the /etc/snmpd.conf file to change the number (in bytes) in the appropriate section for size.
 

Power VM - Dual VIO Migration

VIOS is nothing but the AIX OS added with the additional softwares,

Single VIOS migration is a direct approach, We need to do shutdown all the VIO Clients and do the migration on the VIOS.

Dual VIOS is different since We can run the clients with the redundancy path and the network during a migration of the VIOS.

Note: We need to make sure the disk paths and NW redundancy on all the VIO clients.

Step by step Procedure for dual VIOS migration:

Considerations:
1. Make sure the control channel VLAN is same and Its health status.
2. Identify the Primary and Secondary VIO server.
3. Make sure the SEA adapter's "ha_mode" should be "auto".
4. Check all the clients disk path is redundant (atleast 2 paths and each one should come from different VIO Server).
5. IF THE DISKS IN A MIRRORED ROOTVG FOR THE VIO CLIENTS ARE COMING FROM EACH VIO SERVER AND BOTH THE DISKS HAVE A SINGLE PATH, RUN SYNCVG AND WAIT TO SYNCHRONIZE AFTER EACH UPDATE

Procedure:

1. Check VIO's sea adapter state and update the VIO Server who is in backup mode first.

To check VIO Servers state run the following script

for x in $(lsdev -Cc adapter |grep -i shared|awk '{print $1}')
do
echo $x
entstat -d $x|grep -E State
done

ent10
    State: BACKUP
ent11
    State: BACKUP


2. On the VIO Server which is in backup mode

Check ioslevel in padmin mode

Note: If any interim fixes were installed then remove that.
$ oem_setup_env
# emgr –P  -> To list the interim fixes
# emgr –r –L -> To remove the interim fixes
$ ioslevel
2.1.3.10-FP-23
Commit all the updates
$ updateios -commit
There are no uncommitted updates.

For the safer side
force  the sea adapters to backup state

chdev -dev ent10 -attr ha_mode=standby
chdev -dev ent11 -attr ha_mode=standby

Mount the remote file system , location where vio updates are kept
and use  command
updateios -accept -install -dev

reboot VIO Server after installation is complete
and accept license
$ license -accept
$ ioslevel
2.2.1.3
now change the sea adapters into auto mode again
chdev -dev ent10 -attr ha_mode=auto
chdev -dev ent11 -attr ha_mode=auto
3)Now move to the other VIO Server and change the sea adapter to standby mode

chdev -dev ent10 -attr ha_mode=standby
chdev -dev ent11 -attr ha_mode=standby

Now the adapters at the previous vio servers should be viewed as primary and the adapters at the current vio server

Now follow the same procedure to update this vio server as the previous one
after reboot and license update
change the adapter again to auto