Sunday, March 3, 2013

VMWare ESXi 5.1 RAID Email Alerts

So I bought myself a new 3ware 9650SE-4LPML RAID Controller for my ESXi 5.1 server and ran into a few issues regarding email alerts. I got it installed just fine but there wasn't any software that would automatically check the status of the array and email me if something was wrong. After a little searching, I found that an app called tw_cli that can check the status of the RAID array from the command line. You should be able to download it here:

http://www.lsi.com/downloads/Public/SATA/SATA%20Common%20Files/CLI_linux-from_the_10.2.2.1_9.5.5.1_codesets.zip

The real issue arose when I wanted to send an email using gmail as my smtp server. That was a bunch of headaches trying to figure out the syntax. I finally got it working using openssl which can be seen in the code below. The trick was making the input sleep for gmail's servers to respond. Without the sleep command, openssl would just hang after about 2 lines of input.

VMWare has a bunch of hoops you have to jump through in order to schedule tasks and where to put files so they aren't erased after a reboot. For ESXi 5.1 you have to edit /etc/rc.local.d/local.sh and add the following lines:
/bin/kill $(cat /var/run/crond.pid)
/bin/echo "*/5 * * * * /vmfs/volumes/vol/emailalerts/tw_diskcheck" >> /var/spool/cron/crontabs/root
/usr/lib/vmware/busybox/bin/busybox crond

This will setup a cron job so the script runs every 5 minutes. You also have to make sure your files are stored on a volume at /vmfs/volumes so they don't get erased after a reboot.

Hopefully this script can also help someone who wants to use openssl to send email using gmail. That was a real pain to get working.

UPDATE! Thanks to Paul Atherton for creating a modified version of the script. Apparently there was some issues with the way the date was parsed outside of the USA. Paul also added some enhancements, including much better comments, to the script which is now below. My old script is still running on my ESX system and working fine. In case anyone wants to reference the old version it is located here. Thanks Paul!
# To setup to run every 5 mins via cron, edit /etc/rc.local and add the following lines:
# /bin/kill $(cat /var/run/crond.pid)
# /bin/echo "*/5  *    *   *   *   /vmfs/volumes/Datastore/3Ware/tw_diskcheck" >> /var/spool/cron/crontabs/root
# /bin/crond

# To set this up instantly (before reboot), write these lines to a script, prefix these lines with:
# chmod u+w /var/spool/cron/crontabs/root
# save the script, make it executable (chmod 755 script_name), and run this script directly (./script_name)
# if all is working, after 5 mins, a lol.log file should appear in /vmfs/volumes/Datastore/3Ware/ and you should receive your first status email.

# User defined variables
USERNAME=myemail@gmail.com              # your SMTP username
PASSWORD=mypassword                     # your SMTP password
ADDRESS=smtp.gmail.com                  # your SMTP server FQDN
PORT=465                                # your SMTP server port number
TO=toemail@mymaildomain.com             # your destination e-mail address
FROM=senderemail@mydomain.com           # the sending e-mail address
PROG_PATH=/vmfs/volumes/Datastore/3Ware # the server path of this script (and tw_cli)

LOCALHOST=localhost
SLEEP=3

# Create log file if it doesn't exist - used to record changes in unit status
if [ ! -f $PROG_PATH/lol.log ]; then
  echo `date`" START OF FILE" > $PROG_PATH/lol.log
fi

# Create Firewall Exception file and restart service to apply - runs only if not already present
# (a restart will lose the exception and file, so first run of this script will re-create it)

if [ ! -f /etc/vmware/firewall/email.xml ]; then
  echo "" > /etc/vmware/firewall/email.xml
  echo "" >> /etc/vmware/firewall/email.xml
  echo "    " >> /etc/vmware/firewall/email.xml
  echo "        email" >> /etc/vmware/firewall/email.xml
  echo "        " >> /etc/vmware/firewall/email.xml
  echo "            outbound" >> /etc/vmware/firewall/email.xml
  echo "            tcp" >> /etc/vmware/firewall/email.xml
  echo "            dst" >> /etc/vmware/firewall/email.xml
  echo "            $PORT" >> /etc/vmware/firewall/email.xml
  echo "        " >> /etc/vmware/firewall/email.xml
  echo "        true" >> /etc/vmware/firewall/email.xml
  echo "        false" >> /etc/vmware/firewall/email.xml
  echo "    " >> /etc/vmware/firewall/email.xml
  echo "" >> /etc/vmware/firewall/email.xml
  esxcli network firewall refresh
fi

# Test up to 3 times to see if firewall rule is present
for i in 1 2 3
do
  WORKING_EMAIL=`esxcli network firewall ruleset list | grep email | awk '{print $2}'`
  echo "Checking Firewall rule exists - attempt: "$i
  if [ "$WORKING_EMAIL" = true ]; then
    echo "Firewall rule checked out OK on attempt: "$i
    break
  fi  
done
if [ "$WORKING_EMAIL" != true ]; then
  echo `date`" After 3 attempts the firewall rule could not be detected. Aborting." # >> $PROG_PATH/lol.log
  exit
fi

TWCLI=$PROG_PATH/tw_cli
ENC_PASS=`echo -ne "\0"$USERNAME"\0"$PASSWORD | openssl base64` #encode username and password
CTL_NAME=`$TWCLI info|grep -E "^c"|awk '{print $1}'` #get controller name

# Get day name for use below in Sunday status update
DAY=`date|awk '{print $1}'`

# Build time as a serial - i.e. remove colons - used as time source for Sunday status update
TIME=`date|awk '{print $4}'`
HH=`echo $TIME | awk -F\: '{print $1}'`
MM=`echo $TIME | awk -F\: '{print $2}'`
SS=`echo $TIME | awk -F\: '{print $3}'`
TIME=$HH$MM$SS

# Get unit status for each unit - all on one line - each unit staus separated by space
UNITSTATUS=`$TWCLI info $CTL_NAME unitstatus|grep -E "^u"|awk '{printf "%s ",$3}'|sed 's/ *$//'`

# Get the last unit status report from the log file
LAST_STATUS=`tail -1 $PROG_PATH/lol.log`

# Write status to screen
echo "Previous Unit Status   (from log): "$LAST_STATUS
echo "Current Unit Status (from tw_cli): "$UNITSTATUS

# If the unit status has changed since the last log report then...
if [ "$UNITSTATUS" != "$LAST_STATUS" ]; then
  # Compose and send the e-mail
  (echo -e "EHLO $LOCALHOST";echo -e "AUTH PLAIN $ENC_PASS";echo -e "MAIL FROM: <$FROM>";sleep $SLEEP;echo -e "RCPT TO: <$TO>";sleep $SLEEP;echo -e 'DATA';sleep $SLEEP;echo -e "SUBJECT: `hostname` DISK STATUS: $UNITSTATUS";sleep $SLEEP;$TWCLI info $CTL_NAME;sleep $SLEEP;echo -e '.';sleep $SLEEP;echo -e 'quit')|openssl s_client -pause -connect $ADDRESS:$PORT -ign_eof -crlf
  # then write the new status update to the log
  echo `date` >> $PROG_PATH/lol.log
  echo $UNITSTATUS >> $PROG_PATH/lol.log
fi

# Email once on Sunday around 10am. Lets me know the script is still running.
if [ "$DAY" == "Sun" ] && [ "$TIME" -gt "100000" ] && [ "$TIME" -lt "101010" ]; then
  (echo -e "EHLO $LOCALHOST";echo -e "AUTH PLAIN $ENC_PASS";echo -e "MAIL FROM: <$FROM>";sleep $SLEEP;echo -e "RCPT TO: <$TO>";sleep $SLEEP;echo -e 'DATA';sleep $SLEEP;echo -e "SUBJECT: `hostname` WEEKLY DISK CHECK: $UNITSTATUS";sleep $SLEEP;$TWCLI info $CTL_NAME;sleep $SLEEP;echo -e '.';sleep $SLEEP;echo -e 'quit')|openssl s_client -pause -connect $ADDRESS:$PORT -ign_eof -crlf
  echo `date` >> $PROG_PATH/lol.log
  echo " $UNITSTATUS" >> $PROG_PATH/lol.log
fi

30 comments:

  1. Can't believe this is such a recent post - I just started trying to figure out exactly how to do this very same task for my 3Ware 9650 controller!

    I get an error if I try to run the script directly from a ssh session:

    /tw_diskcheck: line 1: syntax error: Bad substitution

    any ideas how I should try to debug this - sorry, I am new to shell scripting in esxi

    Thanks, Paul A

    ReplyDelete
    Replies
    1. Did you directly copy the script from above? What is the first line of your script?

      Delete
  2. Josh, I copied the script exactly from the above text into a text file on my mac and scp'd it to my ESXi server path: /vmfs/volumes/Datastore/3Ware to sit alongside tw_cli

    I changed only the values assigned to - USERNAME, PASSWORD, ADDRESS, TO, FROM

    plus also changed PROG_PATH=/vmfs/volumes/Datastore/3Ware

    Aside from my initial issue which is still present, my password does have a # character in it, so not sure how the script handles the setting of this env variable. Can I quote it in some way so the script doesn't see it as a comment?

    ReplyDelete
    Replies
    1. Are you sure you are on esxi5.1?

      Is tw_cli and tw_diskcheck executable?

      Can you run
      ./tw_cli info
      from your 3Ware directory?

      You should be able to add a single quote around your password variable if you want to but it is not required. I tested it on my machine without the quotes and it worked fine. You could test it on your script by adding the line

      echo $PASSWORD

      after your PASSWORD=P@ssw1th# line

      Delete
    2. Yes to all your questions
      Its esxi 5.0 update1
      yes, both scripts are executable and they do execute - otherwise there wouldn't be a script error! tw_cli is fully functional.

      Regarding the password - i changed it to not have the # and got the same results (I knew the script error was nothing to do with the password in any case).

      In your article, I think you may also have a bit missing in the cron bit:

      above you have:
      /bin/echo "*/5 * * * * /vmfs/volumes/vol/emailalerts/tw_diskcheck" >> /var/spool/cron/crontabs/root
      when it should surely read:
      /bin/echo "*/5 * * * * /vmfs/volumes/vol/emailalerts/tw_diskcheck/tw_diskcheck" >> /var/spool/cron/crontabs/root

      ...assuming your folder is called tw_diskcheck and your script is also named this.

      Delete
    3. Josh, Thinking it may be time related. I am in UK, so UK date format is dd/mm/yy.

      Also note, your script has:
      DAY=`date|awk '{print $1}'`
      TIME=`date|awk '{print $4}'`
      TIME=`echo ${TIME//:/}`

      note, you are assigning one time variable to overwrite the other - unless I'm mistaken with how things work in UNIX?

      Delete
    4. I commented out the second TIME line and the script no longer errors. I don't receive an e-mail though so not sure what goes still.

      Delete
  3. I understand your second TIME variable assignment now (in order to remove colon chars from the time), but this does not seem to work on my esxi server - no matter what chars I enter for search or substitution using this ${varname//search/replace} notation, it always reports the error:
    ./tw_diskcheck: line 1: syntax error: Bad substitution
    The Line 1 report is a mis-interpretation. If you set the variable assignment to:
    TIME=${TIME//:/} then it reports the error on line 37

    is the : removal required for this script to work?

    ReplyDelete
  4. Josh,
    I managed to figure an alternate way to get the DATE variable set and this is now fine. I used this code instead:

    TIME=`date|awk '{print $4}'`
    # TIME=`echo ${TIME//:/}`
    HH=`echo $TIME | awk -F\: '{print $1}'`
    MM=`echo $TIME | awk -F\: '{print $2}'`
    SS=`echo $TIME | awk -F\: '{print $3}'`
    TIME=$HH$MM$SS

    Moving on...
    Your code for creating email.xml looks like it has not rendered properly in your original post. It is missing all the xml tags. I amended this in my code and got the script to write email.xml as a properly formatted firewall xml file and the new firewall rule is then applied successfully with your refresh line. I can see the firewall rule called email now present in the GUI also, but still no message being sent by the script. Instead I get the error:

    connect: Connection timed out
    connect:errno=110
    ./tw_diskcheck: line 66: :f: not found

    Hopefully this is the last hurdle! Any ideas?

    Yours hoping, Paul A

    ReplyDelete
  5. Was your firewall rule refreshed?

    esxcli network firewall refresh

    Can you connect to gmail running this command?

    openssl s_client -pause -connect 74.125.133.108:465 -ign_eof -crlf

    Glad to see you are making progress. I didn't even think of the differences in time formats...

    ReplyDelete
    Replies
    1. Yes the rule had refreshed, as I can see it now in the vSphere client GUI - but looking at this, I just spotted it being under the inbound ruleset - obviously my mistake! I just amended tw_diskcheck to make the xml correct for an outbound rule, deleted the email.xml on my server and ran tw_diskcheck again - it applied the amended rule correctly and sent me an e-mail! Awesome!
      I do still get an error though at the end of the scrip:
      ./tw_diskcheck: line 69: :f: not found
      I have no idea what this is referring to in the script. Any ideas? I could send you my script in full with modifications if it is easier to read?

      Jos, I think you do need to re-post your script in your original article as the xml tags do not exist in your pasted code - this was one of the issues I had to resolve. Also, I have no idea how you are getting the:
      TIME=`echo ${TIME//:/}`
      line to run, as my esxi implementation just errors on that line, which was why I had to tweak the code in my version (see 2 posts above).

      One forther question - does this script consider all RAID units? I.e. if any one unit has a problem, will it e-mail me to let me know? I have 3 units on my 16port card.

      Thanks again for your continued support - awesome!

      Delete
    2. Josh, As I have 3 units configured on my controller, the script adds 3 lines to lol.log each time it is run, in the format:

      Sun Mar 10 22:11:35 UTC 2013 OK
      OK
      OK

      This happens even if the status hasn't changed. So there is still an issue here. The script is coded to only report to the log file (and send the e-mail:

      if [ "$UNITSTATUS" != "$LAST_STATUS" ]

      ..but for me with my 3 lines per check, this always evaluates to TRUE, so reports every time it runs. The reason this seems to happen is because of the way that the script writes the unit status output to the lol.log file. An example of the log for each run is thus:

      Sun Mar 10 22:11:35 UTC 2013 OK
      OK
      OK

      so $LAST_STATUS always evaluates to the last line here which is "OK", so this evaluation:

      if [ "$UNITSTATUS" != "$LAST_STATUS" ]

      translates to:

      if [ "OK OK OK" != "OK" ]

      which of course is TRUE, so the script reports to the log and e-mails me every 5 mins! (or it would do if I had the cron job configured).

      There needs to be a way that when the status is written to the log, it writes the date/time on line 1 and the Unit statuses in series on line 2, then that test would properly evaluate. So something like:

      Sun Mar 10 22:11:35 UTC 2013
      OK OK OK

      This would then only report to the log and e-mail when something changed in one or more units.

      Any ideas how to achieve the log reporting in this way Josh?

      Paul A

      Delete
  6. Josh, I have a fully working script now which detects issues with multiple units (RAID arrays). I would like to post it here, but the char limit prevents me. Can I please e-mail it to you to post on my behalf perhaps? my e-mail is paul (at) pjamedia (dot) com (please excuse the spamarrest protection I have if you do message me.

    Thanks again for your awesome script, without which I would still be head scratching.

    Paul A

    ReplyDelete
  7. Awesome guide. Works with small modifications also on ESX 5.0 Update 2
    Thank you.

    Kind Regards, Stefan

    ReplyDelete
    Replies
    1. Stefan, I imagine you had issues with the firewall section, as for some reason this blog has removed all the xml tags, so none shown in the above code, which doesn't help.

      I am actually having a small issue with the script - I receive a duplicate e-mail with each legit one - so every status update, I get 2 identical e-mails. I have gone over the script but can't figure why this is happening!

      If you have any ideas or even theories, I'd be glad to hear them.

      Thanks, Paul A

      Delete
  8. Hello

    nice script. But I have the problem that i get this message


    /vmfs/volumes/51c1c98f-7ec9769e-3bc3-002590a8ca7a/Raid # ./tw_diskcheck
    Checking Firewall rule exists - attempt: 1
    Checking Firewall rule exists - attempt: 2
    Checking Firewall rule exists - attempt: 3
    Thu Jun 20 17:36:39 UTC 2013 After 3 attempts the firewall rule could not be detected. Aborting.

    Using ESXi 5.1.0 1065491

    Whats wrong

    Gretting Chris

    ReplyDelete
    Replies
    1. same issue happened for me

      Gretting Stanislav

      Delete
  9. I cannot find any command to install the tw_cli, the docs that come with the link above refers to the rpm install for vmware and I know that will not work with ESXi 5.x. I have the 9750 installed and the driver loaded 5TB storage setup.

    I have been searching for the tw_cli install package, can anybody point me in the right direction?

    Thanks

    ReplyDelete
  10. Hi

    I have 2 Controller in my Server. One 9690SA and One 9650SE.

    With two controller the script is not working. I always got the message


    /vmfs/volumes/51c5aa60-923f1ee7-08e2-002590a8ca7a/Raid # ./tw_diskcheck
    Checking Firewall rule exists - attempt: 1
    Firewall rule checked out OK on attempt: 1
    Error: (CLI:012) Invalid argument in info command.


    Previous Unit Status (from log):
    Current Unit Status (from tw_cli):

    With one controller the script is working. Maybey someone can help me to change the script so that it works with 2 controllers.

    Greeting Christopher

    ReplyDelete
  11. can´t find tw_diskcheck anywhere. help please

    ReplyDelete
  12. tw_diskcheck is what you name the script file when you save it.

    ReplyDelete
  13. I know this is old but still trying to solve the same problems. Cant seem to get tw_cli to exec on esxi 5.5. tried setting it as executable (-x) but I just get access denied when I try to run it ./tw_cli

    ReplyDelete
  14. ok where to find the xml syntax for the firewall.

    ReplyDelete
  15. Paul from PJA Media reqorked this a bit and said I could post it so in case anyone else needs it get it at http://bucket.valreytech.com/tw_diskcheck.sh
    ..scott..

    ReplyDelete
  16. all right, I just executed this putty command from my windows 8 desktop and it successfully transferred the firewall xml to my desktop.

    pscp -pw Localacce$$ root@192.168.1.136:/etc/vmware/firewall/service.xml service.xml I should be able to take the output from tw_cli and transferr it to a system where the script will run properly.

    ReplyDelete
  17. I'd insert a short sleep in the for loop where you check the firewall rules. Otherwise, nicely done!

    ReplyDelete
  18. This is exactly what I was looking for... it didn't work with gmail, but it worked immediately with the smtp server of my ISP. I skipped the FW part of the script, because changes in service.xml are presistent so I just made my changes there.
    Thanks a million... if you happen to visit Berlin some time, let me know - I owe you a beer!

    ReplyDelete
  19. The automatic check of the e-mail state saves our time on the regular entrance. It is very convenient advantage of this update.

    ReplyDelete
  20. It is very valuable information for me. It is helped me with the work in the e-mail. I am very happy about it.

    ReplyDelete