ESXi SMART Script to Check All Disk Health

Posted on 2024-07-17 Disqus:

In managing an ESXi server, ensuring all disks are healthy and functional is crucial to maintaining overall system reliability and performance. A failed disk can lead to data loss or even system downtime. Thus, proactively monitoring disk health using SMART (Self-Monitoring, Analysis, and Reporting Technology) data is a smart practice. This blog post walks you through a shell script designed to check the health of all disks on your ESXi server.

Introduction to the Script

The script leverages esxcli commands to fetch SMART data for each disk and formats the output in a readable table, logging it along with a timestamp. Key information such as health status, drive temperature, power-on hours, power cycle count, and reallocated sector count is extracted and displayed, helping you quickly identify potential issues.

You will able to check the SMART information as below sample:
> ./smart.sh

Drive                                   | Health Status | Drive Temperature | Power-on Hours | Power Cycle Count | Reallocated Sector Count
---------------------------------------------------------------------------------------------------------------------------------------------
eui.0000000001000000b7d6c88d0a080f00      OK              40/77               3118             81                  0/90                                
t10.NVMe.WD.PC.SN740.SDDPTQE2D2T00.02F2E  OK              42/84               549              1473                0/90                                                             
t10.NVMe.EDILOCA.EN870.4TB.4334424002000  OK              44/90               20               3                   0/99                                
t10.NVMe.SAMSUNG.MZVL22T0HBLB2D00B00.5D8  OK              44/81               11498            346                 0/90                                
t10.NVMe.EDILOCA.EN870.4TB.5435424002000  OK              44/90               19               4                   0/99

Prerequisites

A working ESXi server
SSH access enabled on the ESXi server
Basic familiarity with shell scripts

The Script

You can download here: GitHub Download
or you can copy the following script to your ESXi.
Here is the complete shell script for checking disk health using SMART data on ESXi, Place it in your ESXi, and chmod +x smart.sh so you can execute the script.

smart.sh:

#!/bin/sh
#ESXi SMART Script by @Upinel https://upinel.github.io. All Rights Reserved.

# Get the list of all device UIDs
device_list=$(esxcli storage core device list | grep -E '^t10.|^eui.' | awk '{print $1}')

# Timestamp for log
timestamp=$(date '+%Y-%m-%d %H:%M:%S')

# Header for the output
header="Drive                                   | Health Status | Drive Temperature | Power-on Hours | Power Cycle Count | Reallocated Sector Count"
separator="---------------------------------------------------------------------------------------------------------------------------------------------"
echo "$header"
echo "$separator"
# Begin logging output with timestamp
{
  echo "Timestamp: $timestamp"
  echo "$header"
  echo "$separator"
} >> smart.log

# Function to process device name
process_device_name() {
    local device_name=$1
    # Replace multiple underscores with a single period
    device_name=$(echo "$device_name" | sed 's/_\+/\./g')
    # Truncate to a maximum of 40 characters
    device_name=$(echo "$device_name" | cut -c 1-40)
    echo "$device_name"
}

# Iterate through each device UID and fetch its SMART data
for device in $device_list
do
    # Process the device name
    processed_device_name=$(process_device_name "$device")
    
    # Get the SMART data
    output=$(esxcli storage core device smart get -d $device)
    
    # Format the output
    formatted_output=$(echo "$output" | awk -v device="$processed_device_name" '
    BEGIN {
        # Initialize default values
        status["Health Status"] = "N/A"
        status["Power-on Hours"] = "N/A"
        status["Drive Temperature"] = "N/A/N/A"
        status["Power Cycle Count"] = "N/A"
        status["Reallocated Sector Count"] = "N/A/N/A"
    }
    # Capture specific parameters and thresholds
    /Health Status/ {status["Health Status"] = $3 ? $3 : "N/A"}
    /Power-on Hours/ {status["Power-on Hours"] = $3 ? $3 : "N/A"}
    /Drive Temperature/ {
        value = $3 ? $3 : "-"
        threshold = $4 ? $4 : "-"
        status["Drive Temperature"] = value "/" threshold
    }
    /Power Cycle Count/ {status["Power Cycle Count"] = $4 ? $4 : "N/A"}
    /Reallocated Sector Count/ {
        value = $4 ? $4 : "0"
        threshold = $5 ? $5 : "-"
        status["Reallocated Sector Count"] = value "/" threshold
    }
    END {
        # Print the results with formatted and truncated drive name (40 characters max)
        printf "%-41s %-15s %-19s %-16s %-19s %-36s\n",
        device,
        status["Health Status"],
        status["Drive Temperature"],
        status["Power-on Hours"],
        status["Power Cycle Count"],
        status["Reallocated Sector Count"]
    }')

    # Append formatted output to the log file
    echo "$formatted_output" >> smart.log
    echo "$formatted_output"
done

# Print an empty line for readability in the log
echo "" >> smart.log

Conclusion

Regularly checking your disk health can help prevent unexpected failures and prolong the lifespan of your hardware. The script presented here aims to facilitate this process, making it easier to monitor and log the SMART health status of all drives on your ESXi server.