Monitoring Lustre file systems

Online Help Table of Contents

You can easily monitor one or more file systems at the Dashboard, Status, and Logs windows. The Dashboard window displays a set of charts that provide usage and performance data at several levels in the file systems being monitored, while the Status and Logs windows keep you informed of file system activity relevant to current and past file system health and performance.

In this section:

View charts on the Dashboard

The Dashboard displays a set of graphical charts that provide real-time usage and performance data at several levels in the file systems being monitored. All Dashboards’ charts are available for both monitored-only and managed/monitored file systems.

At the top, the Dashboard lists the file system(s) being managed or monitored-only. The following information is provided for each file system:

Persistent Chart Configuration

You can configure certain data display parameters for each chart, and your chart configuration will persist until you reload/refresh the Dashboard page, using the browser.

See:

Top of page

View charts for one or all file systems

When you first login, the Dashboard displays the following six charts for all file systems combined. Click on the links here to learn more.

To view these six charts for a single file system:

  1. If it is not displayed, click Dashboard to access the Dashboard window. The default view is for all six charts to be displayed.
  2. Click Configure Dashboard.
  3. Under File System, selected the file system you wish to view.
  4. Click Update.

View charts for all servers combined

Viewing charts for all servers is similar to viewing charts for all file systems. To do this:

  1. On the Dashboard, click Configure Dashboard.
  2. Leave All Servers selected in the Server drop-down menu.
  3. Click Update.

View charts for an individual server

  1. On the Dashboard, click Configure Dashboard.
  2. Select Server.
  3. Under Server, select the server of interest and click Update.

The following charts are displayed for an individual server. Click on the links to learn about these charts.

View charts for an OST or MDT

To view charts for a specific OST or MDT:

  1. On the Dashboard, click Configure Dashboard.
  2. Select Server.
  3. At the Server drop-down menu, select the sever hosting the desired target.
  4. At the Target drop-down menu, select the desired target. Then click Update.

The following charts are displayed for OSTs.

The following charts are displayed for MDTs:

Top of page

Check file systems status

The file systems Status light Status Light provides a quick glance of the status and health of all file systems managed by Integrated Manager for Lustre software. This indicator is located along the top banner of the manager GUI. The indicator reflects the worst-case condition. For example, an Error message for any file system will always display a red Status light. Click Status to open the Status window and learn more about status.

Click Status to open the Status window. See View commands and status messages on the Status window.

Top of page

View job stats

Job statistics are available from two locations:

To view job statistics

  1. Before viewing job statistics, you will need to run a command to enable this feature. Run this command for each file system. The following command is an example to be run on the management server (MGS):

     lctl conf_param <fsname>.sys.jobid_var=procname_uid
     # where `<fsname>` is the file system name (refer to using job stats with other job schedulers for more information.
    
  2. The variable <fsname>.mdt.job_cleanup_interval sets the period after which collected statistics are cleared out. If this interval is too short, statistics may get cleared while you’re viewing job statistics. Set this interval to a value greater than your collection/viewing period. As an example, you could set this interval to 70 minutes (4200 seconds) using the following command:

     lctl conf_param <fsname>.mdt.job_cleanup_interval=4200
    
  3. View the Read/Write Heat Map chart on the dashboard window.
  4. Each row on the Read/Write Heat Map corresponds to an OST, with consecutive columns from left-to-right, corresponding to consecutive time intervals. Mouse over a cell to find an OST and time interval of interest, and click on the desired cell.

    The Jobs Stats window opens. The top banner reveals the OST and time interval. Each job executing during that interval is displayed as a row, with its average data throughput revealed for that interval. Only the top five read and write jobs are displayed. The window displays the Read Bytes, Write Bytes, Read OPS, and Write IOPS for the top five jobs, listed by Job ID.

  5. To change the duration of the job statistics sampling period, return to the Read/Write Heat Map chart. Click Change Duration and set the time period for the heat map. If you set the time period to one day (as an example), the 24-hour period will be divided into 20 equal, consecutive cells, starting 24 hours previous and ending now. Each Read/Write Heat Map cell now covers 1.2 hours. Clicking on a cell will now reveal a job statistics window that averages 1.2 hours of read/write operations.
  6. To send this Job Stats window to another person, select and copy the URL from browser URL field. Next, paste the URL into an email message body and send.

Note: The Job Stats window is static, specific to that time period and OST. To view another time period or OST, return to the Read/Write Heat Map chart and select the desired cell.

Using job stats with other job schedulers

The job stats code extracts the job identifier from an environment variable set by the scheduler when the job is started. Integrated Manager for Lustre software sets a jobstats environment variable to work with SLURM, however you can set the variable to work with other job schedulers. To enable job stats to work with a desired scheduler, specify the jobid_var to name the environment variable set by the scheduler. For example, SLURM sets the SLURM_JOB_ID environment variable with the unique job ID on each client. To permanently enable jobstats on the testfs file system, run this command on the MGS:

$lctl conf_param testfs.sys.jobid_var=<environment variable>

where

<environment variable>

is one of the following:

Job Scheduler Environment Variable
Simple Linux Utility for Resource Management (SLURM) SLURM_JOB_ID
Sun Grid Engine (SGE) JOB_ID
Load Sharing Facility (LSF) LSB_JOBID
Loadleveler LOADL_JOBID
Portable Batch Scheduler (PBS)/MAUI PBS_JOBID
Cray Application Level Placement Scheduler (ALPS) ALPS_APP_ID

To disable job stats, set jobid_var to disable:

$lctl conf_param testfs.sys.jobid_var=disable

To track job stats per process name and user ID (for debugging, or if no job scheduler is in use), set jobid_var to procname_uid:

$lctl conf_param testfs.sys.jobid_var=procname_uid

Top of page

View and manage file system parameters

After you have created a file system, you can view its configuration and manage the file system at the File System Details window.

Top of page

View a server’s detail window

To view all parameters available for a server, at the menu bar, click the Configuration drop-down menu and click Servers. Select the server to view its Server Detail window.

Top of page

View commands and status messages on the Status window

The Integrated Manager for Lustre software provides status messages about the health of each managed file system.

View all status messages

Click Status to view all status messages. All messages are displayed with the most-recent message first. Note that Warning and error messages are displayed as alerts. The Status window displays messages in five categories:

For more information see Status window.

Top of page

View Logs

Click Logs on the menu bar to view all system logs.

The Logs window displays log information and allows filtering of events by date range, host, service, and messages from Lustre or all sources. The logs window also features querying with auto-complete and linkable host names.

Top of page

View HSM Copytool activities

To view current copytool activities, click Configuration and select HSM. To learn about HSM capabilities supported in Integrated Manager for Lustre software, see Configuring and using Hierarchical Storage Management.

After HSM is setup for a file system, this HSM Copytool chart displays a moving time-line of waiting copytool requests, current copytool operations, and the number of idle copytool workers.

HSM Operations

Top of page