Sunday, March 3, 2013

VMWare ESXi 5.1 RAID Email Alerts

So I bought myself a new 3ware 9650SE-4LPML RAID Controller for my ESXi 5.1 server and ran into a few issues regarding email alerts. I got it installed just fine but there wasn't any software that would automatically check the status of the array and email me if something was wrong. After a little searching, I found that an app called tw_cli that can check the status of the RAID array from the command line. You should be able to download it here:

http://www.lsi.com/downloads/Public/SATA/SATA%20Common%20Files/CLI_linux-from_the_10.2.2.1_9.5.5.1_codesets.zip

The real issue arose when I wanted to send an email using gmail as my smtp server. That was a bunch of headaches trying to figure out the syntax. I finally got it working using openssl which can be seen in the code below. The trick was making the input sleep for gmail's servers to respond. Without the sleep command, openssl would just hang after about 2 lines of input.

VMWare has a bunch of hoops you have to jump through in order to schedule tasks and where to put files so they aren't erased after a reboot. For ESXi 5.1 you have to edit /etc/rc.local.d/local.sh and add the following lines:
/bin/kill $(cat /var/run/crond.pid)
/bin/echo "*/5 * * * * /vmfs/volumes/vol/emailalerts/tw_diskcheck" >> /var/spool/cron/crontabs/root
/usr/lib/vmware/busybox/bin/busybox crond

This will setup a cron job so the script runs every 5 minutes. You also have to make sure your files are stored on a volume at /vmfs/volumes so they don't get erased after a reboot.

Hopefully this script can also help someone who wants to use openssl to send email using gmail. That was a real pain to get working.

UPDATE! Thanks to Paul Atherton for creating a modified version of the script. Apparently there was some issues with the way the date was parsed outside of the USA. Paul also added some enhancements, including much better comments, to the script which is now below. My old script is still running on my ESX system and working fine. In case anyone wants to reference the old version it is located here. Thanks Paul!
# To setup to run every 5 mins via cron, edit /etc/rc.local and add the following lines:
# /bin/kill $(cat /var/run/crond.pid)
# /bin/echo "*/5  *    *   *   *   /vmfs/volumes/Datastore/3Ware/tw_diskcheck" >> /var/spool/cron/crontabs/root
# /bin/crond

# To set this up instantly (before reboot), write these lines to a script, prefix these lines with:
# chmod u+w /var/spool/cron/crontabs/root
# save the script, make it executable (chmod 755 script_name), and run this script directly (./script_name)
# if all is working, after 5 mins, a lol.log file should appear in /vmfs/volumes/Datastore/3Ware/ and you should receive your first status email.

# User defined variables
USERNAME=myemail@gmail.com              # your SMTP username
PASSWORD=mypassword                     # your SMTP password
ADDRESS=smtp.gmail.com                  # your SMTP server FQDN
PORT=465                                # your SMTP server port number
TO=toemail@mymaildomain.com             # your destination e-mail address
FROM=senderemail@mydomain.com           # the sending e-mail address
PROG_PATH=/vmfs/volumes/Datastore/3Ware # the server path of this script (and tw_cli)

LOCALHOST=localhost
SLEEP=3

# Create log file if it doesn't exist - used to record changes in unit status
if [ ! -f $PROG_PATH/lol.log ]; then
  echo `date`" START OF FILE" > $PROG_PATH/lol.log
fi

# Create Firewall Exception file and restart service to apply - runs only if not already present
# (a restart will lose the exception and file, so first run of this script will re-create it)

if [ ! -f /etc/vmware/firewall/email.xml ]; then
  echo "" > /etc/vmware/firewall/email.xml
  echo "" >> /etc/vmware/firewall/email.xml
  echo "    " >> /etc/vmware/firewall/email.xml
  echo "        email" >> /etc/vmware/firewall/email.xml
  echo "        " >> /etc/vmware/firewall/email.xml
  echo "            outbound" >> /etc/vmware/firewall/email.xml
  echo "            tcp" >> /etc/vmware/firewall/email.xml
  echo "            dst" >> /etc/vmware/firewall/email.xml
  echo "            $PORT" >> /etc/vmware/firewall/email.xml
  echo "        " >> /etc/vmware/firewall/email.xml
  echo "        true" >> /etc/vmware/firewall/email.xml
  echo "        false" >> /etc/vmware/firewall/email.xml
  echo "    " >> /etc/vmware/firewall/email.xml
  echo "" >> /etc/vmware/firewall/email.xml
  esxcli network firewall refresh
fi

# Test up to 3 times to see if firewall rule is present
for i in 1 2 3
do
  WORKING_EMAIL=`esxcli network firewall ruleset list | grep email | awk '{print $2}'`
  echo "Checking Firewall rule exists - attempt: "$i
  if [ "$WORKING_EMAIL" = true ]; then
    echo "Firewall rule checked out OK on attempt: "$i
    break
  fi  
done
if [ "$WORKING_EMAIL" != true ]; then
  echo `date`" After 3 attempts the firewall rule could not be detected. Aborting." # >> $PROG_PATH/lol.log
  exit
fi

TWCLI=$PROG_PATH/tw_cli
ENC_PASS=`echo -ne "\0"$USERNAME"\0"$PASSWORD | openssl base64` #encode username and password
CTL_NAME=`$TWCLI info|grep -E "^c"|awk '{print $1}'` #get controller name

# Get day name for use below in Sunday status update
DAY=`date|awk '{print $1}'`

# Build time as a serial - i.e. remove colons - used as time source for Sunday status update
TIME=`date|awk '{print $4}'`
HH=`echo $TIME | awk -F\: '{print $1}'`
MM=`echo $TIME | awk -F\: '{print $2}'`
SS=`echo $TIME | awk -F\: '{print $3}'`
TIME=$HH$MM$SS

# Get unit status for each unit - all on one line - each unit staus separated by space
UNITSTATUS=`$TWCLI info $CTL_NAME unitstatus|grep -E "^u"|awk '{printf "%s ",$3}'|sed 's/ *$//'`

# Get the last unit status report from the log file
LAST_STATUS=`tail -1 $PROG_PATH/lol.log`

# Write status to screen
echo "Previous Unit Status   (from log): "$LAST_STATUS
echo "Current Unit Status (from tw_cli): "$UNITSTATUS

# If the unit status has changed since the last log report then...
if [ "$UNITSTATUS" != "$LAST_STATUS" ]; then
  # Compose and send the e-mail
  (echo -e "EHLO $LOCALHOST";echo -e "AUTH PLAIN $ENC_PASS";echo -e "MAIL FROM: <$FROM>";sleep $SLEEP;echo -e "RCPT TO: <$TO>";sleep $SLEEP;echo -e 'DATA';sleep $SLEEP;echo -e "SUBJECT: `hostname` DISK STATUS: $UNITSTATUS";sleep $SLEEP;$TWCLI info $CTL_NAME;sleep $SLEEP;echo -e '.';sleep $SLEEP;echo -e 'quit')|openssl s_client -pause -connect $ADDRESS:$PORT -ign_eof -crlf
  # then write the new status update to the log
  echo `date` >> $PROG_PATH/lol.log
  echo $UNITSTATUS >> $PROG_PATH/lol.log
fi

# Email once on Sunday around 10am. Lets me know the script is still running.
if [ "$DAY" == "Sun" ] && [ "$TIME" -gt "100000" ] && [ "$TIME" -lt "101010" ]; then
  (echo -e "EHLO $LOCALHOST";echo -e "AUTH PLAIN $ENC_PASS";echo -e "MAIL FROM: <$FROM>";sleep $SLEEP;echo -e "RCPT TO: <$TO>";sleep $SLEEP;echo -e 'DATA';sleep $SLEEP;echo -e "SUBJECT: `hostname` WEEKLY DISK CHECK: $UNITSTATUS";sleep $SLEEP;$TWCLI info $CTL_NAME;sleep $SLEEP;echo -e '.';sleep $SLEEP;echo -e 'quit')|openssl s_client -pause -connect $ADDRESS:$PORT -ign_eof -crlf
  echo `date` >> $PROG_PATH/lol.log
  echo " $UNITSTATUS" >> $PROG_PATH/lol.log
fi