MegaRAID is a hardware RAID controller made by LSI (now Broadcom) and found in a wide range of physical servers. Here at Jaguar Network, we run such controllers (although we moved to Dell PERC on newer servers). We use the megacli command to manage them directly from Linux.
This memo’s goal is to list all useful commands and arguments of this small but powerful tool. This is not a full-featured documentation of the tool, just a summary of the most frequently used commands.
MegaCli can be used to monitor multiple objects. The most interesting ones are:
- The Adapter: this is the physical adapter itself, connected to the drives and exposing RAID arrays to the OS
- The Virtual Drives (VD): those are the RAID aggregates that are exposed to the operating system
- The Physical Drives (PD): those are the physical disks themselves
- The Backup Battery Unit (BBU): this battery is used to hold data in case of a power outage
The MegaCli command line switches aren’t the most sysadmin-friendly ones, so we will list the commands we commonly use to monitor and diagnose our MegaRAID controllers on Linux systems.
# megacli -AdpAllInfo -a0
This very comprehensive command will show all the settings and informations related to the adapter itself. If you have multiple controllers, you can replace a0 with -aALL to query all of them.
# megacli -PDList -aALL
The most helpful fields are:
- Media Error Count
- Predictive Failure Count
- Firmware state
- Drive has flagged a S.M.A.R.T alert
# megacli -LDInfo -Lall -aALL
The most helpful fields are:
- RAID Level
- Default Cache Policy & Current Cache Policy: if the cache is WriteThrough while it should be WriteBack, expect reduced performances
- State: should be Optimal
If high iowait is noticed on the server, the Cache Policy has most probably fell back to WriteThrough (bypasses the cache) instead of WriteBack.
This is often caused by a defective Backup Battery Unit, or an automatically triggered Relearn Cycle.
Backup Battery Unit
# megacli -AdpBbuCmd -aALL
There are lots of informations to check here.
- Voltage, Current, Temperature, Battery State: a quick overview of the battery’s health
- Charging Status: should be None
- Learn Cycle Active: if Yes, this might be the cause for bad performances
- Relative State of Charge: should be 100%
- Next Learn time: when the next Relearn Cycle will be triggered
A Relearn Cycle will empty the battery and charge it again to determine its capacity. It is automatically done on a regular basis by the controller to keep track of the current capacity.
A side effect of this procedure is that when the battery charge goes under a specific threshold, the controller will consider the battery to be defective and will stop caching commands, which can have huge impacts on performance depending on the server’s workload.
You can also use MegaCli to configure the controller.
Change Disk State to Online
# megacli -PDOnline -PhysDrv [enclosure:slot] -a0
Change Disk State to Offline
# megacli -PDOffline -PhysDrv [enclosure:slot] -a0
Mark Disk as Missing
# megacli -PDMarkMissing -PhysDrv [enclosure:slot] -a0
Prepare Disk for Removal
# megacli -PdPrpRmv -PhysDrv [enclosure:slot] -a0
# megacli -PDRbld -Start -PhysDrv [enclosure:slot] -a0 # megacli -PDRbld -Stop -PhysDrv [enclosure:slot] -a0 # megacli -PDRbld -ShowProg -PhysDrv [enclosure:slot] -a0
Reset Disk Status to Good
# megacli -PDMakeGood -PhysDrv[enclosure:slot] -a0
Force Cache Policy to WriteBack
This can be used to force the cache into WriteBack mode even if the BBU is defective. Very useful in case of an incident due to the Relearn Cycle having been triggered.
# megacli -LDSetProp ForcedWB -Immediate -Lall -aALL