ESXi SMART Script to Check All Disk Health

In managing an ESXi server, ensuring all disks are healthy and functional is crucial to maintaining overall system reliability and performance. A failed disk can lead to data loss or even system downtime. Thus, proactively monitoring disk health using SMART (Self-Monitoring, Analysis, and Reporting Technology) data is a smart practice. This blog post walks you through a shell script designed to check the health of all disks on your ESXi server.

Introduction to the Script

The script leverages esxcli commands to fetch SMART data for each disk and formats the output in a readable table, logging it along with a timestamp. Key information such as health status, drive temperature, power-on hours, power cycle count, and reallocated sector count is extracted and displayed, helping you quickly identify potential issues.

You will able to check the SMART information as below sample:
> ./smart.sh

1
2
3
4
5
6
7
Drive                                   | Health Status | Drive Temperature | Power-on Hours | Power Cycle Count | Reallocated Sector Count
---------------------------------------------------------------------------------------------------------------------------------------------
eui.0000000001000000b7d6c88d0a080f00 OK 40/77 3118 81 0/90
t10.NVMe.WD.PC.SN740.SDDPTQE2D2T00.02F2E OK 42/84 549 1473 0/90
t10.NVMe.EDILOCA.EN870.4TB.4334424002000 OK 44/90 20 3 0/99
t10.NVMe.SAMSUNG.MZVL22T0HBLB2D00B00.5D8 OK 44/81 11498 346 0/90
t10.NVMe.EDILOCA.EN870.4TB.5435424002000 OK 44/90 19 4 0/99

Prerequisites

  • A working ESXi server
  • SSH access enabled on the ESXi server
  • Basic familiarity with shell scripts

The Script

You can download here: GitHub Download
or you can copy the following script to your ESXi.
Here is the complete shell script for checking disk health using SMART data on ESXi, Place it in your ESXi, and chmod +x smart.sh so you can execute the script.

smart.sh:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
#!/bin/sh
#ESXi SMART Script by @Upinel https://upinel.github.io. All Rights Reserved.

# Get the list of all device UIDs
device_list=$(esxcli storage core device list | grep -E '^t10.|^eui.' | awk '{print $1}')

# Timestamp for log
timestamp=$(date '+%Y-%m-%d %H:%M:%S')

# Header for the output
header="Drive | Health Status | Drive Temperature | Power-on Hours | Power Cycle Count | Reallocated Sector Count"
separator="---------------------------------------------------------------------------------------------------------------------------------------------"
echo "$header"
echo "$separator"
# Begin logging output with timestamp
{
echo "Timestamp: $timestamp"
echo "$header"
echo "$separator"
} >> smart.log

# Function to process device name
process_device_name() {
local device_name=$1
# Replace multiple underscores with a single period
device_name=$(echo "$device_name" | sed 's/_\+/\./g')
# Truncate to a maximum of 40 characters
device_name=$(echo "$device_name" | cut -c 1-40)
echo "$device_name"
}

# Iterate through each device UID and fetch its SMART data
for device in $device_list
do
# Process the device name
processed_device_name=$(process_device_name "$device")

# Get the SMART data
output=$(esxcli storage core device smart get -d $device)

# Format the output
formatted_output=$(echo "$output" | awk -v device="$processed_device_name" '
BEGIN {
# Initialize default values
status["Health Status"] = "N/A"
status["Power-on Hours"] = "N/A"
status["Drive Temperature"] = "N/A/N/A"
status["Power Cycle Count"] = "N/A"
status["Reallocated Sector Count"] = "N/A/N/A"
}
# Capture specific parameters and thresholds
/Health Status/ {status["Health Status"] = $3 ? $3 : "N/A"}
/Power-on Hours/ {status["Power-on Hours"] = $3 ? $3 : "N/A"}
/Drive Temperature/ {
value = $3 ? $3 : "-"
threshold = $4 ? $4 : "-"
status["Drive Temperature"] = value "/" threshold
}
/Power Cycle Count/ {status["Power Cycle Count"] = $4 ? $4 : "N/A"}
/Reallocated Sector Count/ {
value = $4 ? $4 : "0"
threshold = $5 ? $5 : "-"
status["Reallocated Sector Count"] = value "/" threshold
}
END {
# Print the results with formatted and truncated drive name (40 characters max)
printf "%-41s %-15s %-19s %-16s %-19s %-36s\n",
device,
status["Health Status"],
status["Drive Temperature"],
status["Power-on Hours"],
status["Power Cycle Count"],
status["Reallocated Sector Count"]
}')

# Append formatted output to the log file
echo "$formatted_output" >> smart.log
echo "$formatted_output"
done

# Print an empty line for readability in the log
echo "" >> smart.log

Conclusion

Regularly checking your disk health can help prevent unexpected failures and prolong the lifespan of your hardware. The script presented here aims to facilitate this process, making it easier to monitor and log the SMART health status of all drives on your ESXi server.