In managing an ESXi server, ensuring all disks are healthy and functional is crucial to maintaining overall system reliability and performance. A failed disk can lead to data loss or even system downtime. Thus, proactively monitoring disk health using SMART (Self-Monitoring, Analysis, and Reporting Technology) data is a smart practice. This blog post walks you through a shell script designed to check the health of all disks on your ESXi server.
Introduction to the Script
The script leverages esxcli commands to fetch SMART data for each disk and formats the output in a readable table, logging it along with a timestamp. Key information such as health status, drive temperature, power-on hours, power cycle count, and reallocated sector count is extracted and displayed, helping you quickly identify potential issues.
You will able to check the SMART information as below sample: > ./smart.sh
1 2 3 4 5 6 7
Drive | Health Status | Drive Temperature | Power-on Hours | Power Cycle Count | Reallocated Sector Count --------------------------------------------------------------------------------------------------------------------------------------------- eui.0000000001000000b7d6c88d0a080f00 OK 40/77 3118 81 0/90 t10.NVMe.WD.PC.SN740.SDDPTQE2D2T00.02F2E OK 42/84 549 1473 0/90 t10.NVMe.EDILOCA.EN870.4TB.4334424002000 OK 44/90 20 3 0/99 t10.NVMe.SAMSUNG.MZVL22T0HBLB2D00B00.5D8 OK 44/81 11498 346 0/90 t10.NVMe.EDILOCA.EN870.4TB.5435424002000 OK 44/90 19 4 0/99
Prerequisites
A working ESXi server
SSH access enabled on the ESXi server
Basic familiarity with shell scripts
The Script
You can download here: GitHub Download or you can copy the following script to your ESXi. Here is the complete shell script for checking disk health using SMART data on ESXi, Place it in your ESXi, and chmod +x smart.sh so you can execute the script.
#!/bin/sh #ESXi SMART Script by @Upinel https://upinel.github.io. All Rights Reserved.
# Get the list of all device UIDs device_list=$(esxcli storage core device list | grep -E '^t10.|^eui.' | awk '{print $1}')
# Timestamp for log timestamp=$(date'+%Y-%m-%d %H:%M:%S')
# Header for the output header="Drive | Health Status | Drive Temperature | Power-on Hours | Power Cycle Count | Reallocated Sector Count" separator="---------------------------------------------------------------------------------------------------------------------------------------------" echo"$header" echo"$separator" # Begin logging output with timestamp { echo"Timestamp: $timestamp" echo"$header" echo"$separator" } >> smart.log
# Function to process device name process_device_name() { local device_name=$1 # Replace multiple underscores with a single period device_name=$(echo"$device_name" | sed 's/_\+/\./g') # Truncate to a maximum of 40 characters device_name=$(echo"$device_name" | cut -c 1-40) echo"$device_name" }
# Iterate through each device UID and fetch its SMART data for device in$device_list do # Process the device name processed_device_name=$(process_device_name "$device") # Get the SMART data output=$(esxcli storage core device smart get -d $device) # Format the output formatted_output=$(echo"$output" | awk -v device="$processed_device_name"' BEGIN { # Initialize default values status["Health Status"] = "N/A" status["Power-on Hours"] = "N/A" status["Drive Temperature"] = "N/A/N/A" status["Power Cycle Count"] = "N/A" status["Reallocated Sector Count"] = "N/A/N/A" } # Capture specific parameters and thresholds /Health Status/ {status["Health Status"] = $3 ? $3 : "N/A"} /Power-on Hours/ {status["Power-on Hours"] = $3 ? $3 : "N/A"} /Drive Temperature/ { value = $3 ? $3 : "-" threshold = $4 ? $4 : "-" status["Drive Temperature"] = value "/" threshold } /Power Cycle Count/ {status["Power Cycle Count"] = $4 ? $4 : "N/A"} /Reallocated Sector Count/ { value = $4 ? $4 : "0" threshold = $5 ? $5 : "-" status["Reallocated Sector Count"] = value "/" threshold } END { # Print the results with formatted and truncated drive name (40 characters max) printf "%-41s %-15s %-19s %-16s %-19s %-36s\n", device, status["Health Status"], status["Drive Temperature"], status["Power-on Hours"], status["Power Cycle Count"], status["Reallocated Sector Count"] }')
# Append formatted output to the log file echo"$formatted_output" >> smart.log echo"$formatted_output" done
# Print an empty line for readability in the log echo"" >> smart.log
Conclusion
Regularly checking your disk health can help prevent unexpected failures and prolong the lifespan of your hardware. The script presented here aims to facilitate this process, making it easier to monitor and log the SMART health status of all drives on your ESXi server.