Errors and troubleshooting
Online Help Table of Contents
The following topics are discussed in this chapter:
Unexpected file system events
This section discusses several unwanted file system events and how Integrated Manager for Lustre software responds to them.
A server’s connection to a storage target is lost |
Immediate file system consequences: |
Lustre clients will block if they have requested a file from an unavailable OST. The block will continue until connection to the OST is restored and the
OST is again fully online. For OSTs that are still connected to their
servers, client access continues unaffected |
Manager software / Peer server response: |
No automatic failover. No alerts. |
Suggested remedies: |
Repair the connection to the target. In the meantime, the superuser may
manually fail the target over to the peer server. |
A server’s connection to LNet is lost |
Immediate file system consequences: |
Lustre clients will block waiting for the connection to be re-established. Those portions of the file system that are presented by the affected server are unavailable until then. |
Manager software / Peer server response: |
No automatic failover. No alerts. |
Suggested remedies: |
Repair the server’s connection to LNet. In the meantime, the superuser may manually fail the target over to the peer server. |
Manager software connection to a server (via the management network, ring0) is lost |
Immediate file system consequences: |
No direct file system impact; the file system remains operational. However, Manger for Lustre software can no longer manage or monitor the server. |
Manager software / Peer server response: |
Alerts to administer regarding loss of network connection to server. |
Suggested remedies: |
Re-establish the management network connection to the server. |
A Lustre server loses connectivity with the power control device for its peer server (IPMI or PDU) |
Immediate file system consequences: |
None. The file system continues to operate normally. In the event of a peer server failure, the server that has lost connectivity to power control will be unable to power off the failed server and assume responsibility for its resources. |
Manager software / Peer server response: |
No response to the loss of connectivity if the file system is operating normally. In the event of a server failure, automatic failover of Lustre targets from the failed server may be disabled. |
Suggested remedies: |
Repair the network link to power control (IPMI or PDU). |
The Integrated Manager for Lustre software loses connection with a server’s power control device (IPMI or PDU) |
Immediate file system consequences: |
The software's ability to shut down the server is lost. |
Manager software / Peer server response: |
Alerts to administer regarding loss of connection to power control device. |
Suggested remedies: |
Restore the connection between the Manager software server and affected server’s power control device. |
A crossover cable between servers is disconnected or the network is down |
Immediate file system consequences: |
This is the loss of the ring1 network link, but the ring0 link (the management network) provides complete redundancy. The file system is not affected. |
Manager software / Peer server response: |
No automatic failover. No alerts. |
Suggested remedies: |
Replace/reconnect the cross-over cable, restore the network. |
A primary server’s OS kernel crashes |
Immediate file system consequences: |
Each server is used as both a primary and secondary server. Temporarily delayed access to served storage as failover occurs. |
Manager software / Peer server response: |
Peer server performs STONITH, failover occurs |
Suggested remedies: |
None needed by Admin. Successful STONITH causes the server to be rebooted. |
LBUG, a Lustre crash on a server |
Immediate file system consequences: |
This will also crash Linux on the affected server. Temporarily delayed access to served storage as failover occurs. |
Manager software / Peer server response: |
Peer server performs STONITH, failover occurs. |
Suggested remedies: |
No Admin action needed. |
The primary server spontaneously reboots |
Immediate file system consequences: |
Temporarily delayed access to served storage as failover occurs. |
Manager software / Peer server response: |
Peer server performs STONITH, failover occurs. |
Suggested remedies: |
No Admin action needed. |
The management network (ring0) and a peer crossover network (ring1) are both down |
Immediate file system consequences: |
The file system is not directly affected and client operations may continue. Affected peer servers may attempt STONITH. |
Manager software / Peer server response: |
Peer server performs STONITH and failover occurs. However, each affected server may attempt STONITH on its peer. |
Suggested remedies: |
This condition is unlikely and unstable. The superuser needs to restore network connections for the management network and the cross-over link between affected servers. |
Top of page
Running Integrated Manager for Lustre software diagnostics
If Integrated Manager for Lustre software is not operating normally and you require support, you may be asked to run iml-diagnostics on any servers that are suspected of having problems, and/or on the server hosting the Integrated Manager for Lustre software dashboard. The results of running the diagnostics should be attached to the ticket you are filing describing the problem. These diagnostics are described next.
Run diagnostics
- Log into the server in question. Admin login is required in order to collect all desired data.
- Enter the following command at the prompt:
This command generates a compressed tar.xz file that you can email to customer support. The following are sample displayed results of running this command. (The resulting tar.xz file will have a different file name.)
sosreport (version 3.4)
This command will collect diagnostic and configuration information from
this CentOS Linux system and installed applications.
An archive containing the collected information will be generated in
/var/tmp/sos.p3Djuo and may be provided to a CentOS support
representative.
Any information provided to CentOS will be treated in accordance with
the published support policies at:
https://wiki.centos.org/
The generated archive may contain data considered sensitive and its
content should be reviewed by the originating organization before being
passed to any third party.
No changes will be made to system configuration.
Setting up archive ...
Setting up plugins ...
Running plugins. Please wait ...
Running 1/10: block...
Running 2/10: filesys...
Running 3/10: iml...
Running 4/10: kernel...
Running 5/10: logs...
Running 6/10: memory...
Running 7/10: pacemaker...
Running 8/10: pci...
Running 9/10: processor...
Running 10/10: yum...
Creating compressed archive...
Your sosreport has been generated and saved in:
/var/tmp/sosreport-iml.dev-20171017003954.tar.xz
The checksum is: f018ba301df835862e559aa98465e9fc
Please send this file to your support representative.
You can also decompress the file and examine the results. To unpack and extract the files, use this command:
tar --xz -xvpf <file_name>.tar.xz
Help for iml-diagnostics
Generally, if requested you should run this command without options, as this will generate the needed data. Enter
to see help for this command, as follows:
# iml-diagnostics -h
Usage: sosreport [options]
Options:
-h, --help show this help message and exit
-l, --list-plugins list plugins and available plugin options
-n NOPLUGINS, --skip-plugins=NOPLUGINS
disable these plugins
--experimental enable experimental plugins
-e ENABLEPLUGINS, --enable-plugins=ENABLEPLUGINS
enable these plugins
-o ONLYPLUGINS, --only-plugins=ONLYPLUGINS
enable these plugins only
-k PLUGOPTS, --plugin-option=PLUGOPTS
plugin options in plugname.option=value format (see
-l)
--log-size=LOG_SIZE set a limit on the size of collected logs (in MiB)
-a, --alloptions enable all options for loaded plugins
--all-logs collect all available logs regardless of size
--batch batch mode - do not prompt interactively
--build preserve the temporary directory and do not package
results
-v, --verbose increase verbosity
--verify perform data verification during collection
--quiet only print fatal errors
--debug enable interactive debugging using the python debugger
--ticket-number=CASE_ID
specify ticket number
--case-id=CASE_ID specify case identifier
-p PROFILES, --profile=PROFILES
enable plugins selected by the given profiles
--list-profiles display a list of available profiles and plugins that
they include
--name=CUSTOMER_NAME specify report name
--config-file=CONFIG_FILE
specify alternate configuration file
--tmp-dir=TMP_DIR specify alternate temporary directory
--no-report disable HTML/XML reporting
-s SYSROOT, --sysroot=SYSROOT
system root directory path (default='/')
-c CHROOT, --chroot=CHROOT
chroot executed commands to SYSROOT [auto, always,
never] (default=auto)
-z COMPRESSION_TYPE, --compression-type=COMPRESSION_TYPE
compression technology to use [auto, gzip, bzip2, xz]
(default=auto)
Some examples:
enable dlm plugin only and collect dlm lockdumps:
# sosreport -o dlm -k dlm.lockdump
disable memory and samba plugins, turn off rpm -Va collection:
# sosreport -n memory,samba -k rpm.rpmva=off
Top of page