After almost 2 years of no problems our Exchange2003 server has gotten sick recently. We've run diagnostic utilities on it since it acts like a hardware issue, but reports all come back green. Reboot the server and all is well for a few days, but gradually things start to degrade. People can still access their info as normal, but the response time becomes higher and higher ... until finally after a week or 2 the box stops responding.
We've determined two things that certainly aren't helping matters. First is the size of our Store (the database that holds all the messages). In almost 2 years we've gone from 16GB to 55GB! For only having 135 mailboxes, that's a ridiculous size. We have several users with 1GB+ mailboxes ... the largest is a whoppin' 5.5GB!! Yes, I know ... we should be using quotas. We will in the near future. We also haven't run an offline defrag, but it's not going to reduce the size dramatically.
The exchange box is beefy ... dual Xeon 2GHz, 3GB ram, RAID5 10k rpm SCSI drives ... and the processors just sit a hair above idle all the time. Store.exe eats up most of the ram, but that's what it's supposed to do. Event logs don't show anything out of the ordinary and MOM (MS Operations Manager) isn't throwing any red flags either. Obviously something is jacked up, but there's nothing you would typically look for indicating a problem.
The second thing we're pointing fingers at is the fact that several weeks ago we removed Domain Controller functions from the exchange box. I've found several articles advising that if you place DC roles on an exchange box do don't remove them ... doh! We've had more problems since removing the DC roles, but we also had some issues just prior to that as well. So, yes a potential contributing factor, but not the smoking gun.
So now what? Well, we could spend time trying to figure out this puzzle or rebuild the box. I figure this is a great time to test running Exchange in a virtual environment. It also gives us an opportunity to test moving virtual machines from one physical box to another and see how that works.
So first a little info on our VMware experiment thus far. We have a Dell PowerEdge 2650 that's now our VM server. It's got dual Xeon 3GHz processors with 4GB ram ... and sadly only 137GB drive space (10k RAID5 SCSI). The base OS is MS Server2003 SP1. Then we have the free VMware server installed. Inside of the VMware server we have the following separate virtual machines running ...
- MS Server2003 running as our secondary Domain Controller
- MS Server2003 running Terminal Services
- MS Server2003 running WSUS (windows update services)
- MS Server2003 running SpySweeper Ent Admin Console ... also slated to run Track-It, Symantec Corp Ed AV, and MOM2005
- Ubuntu Linux ... it's installed but I haven't had time to play with it yet.
- XP pro ... installed just as an XP test environment
So to put Exchange on the VM server required freeing up a TON of drive space ... remember the Store itself is 55GB. So we shutdown all the virtual machines except the DC. Then we moved each virtual machine's folder (in VMware everything sits in one folder and 1 file basically contains the entire server) over to another temp box. It took a while to transfer 100GB. This temp box is a GX260 with a 2GHz P4 proc and 2GB ram ... not beefy by any means, but it already had Server2003 R2 installed running our MOM2005. After installing (which is simple) and launching VMware, I clicked file>open>and clicked on the folder containing the virtual machine I wanted to start up ... then clicked "start virtual machine". Waalaa...Server2003 running Terminal Services was back up and running. Same IP, same configs, same everything. Yes, it still blows my mind how this technology works. I moved a server from one physical box to another simply by moving a folder!! I then did the same with the Server2003 for SpySweeper. So our little GX260 was now supporting Server2003 R2 as the base OS along with 2 Server2003 virtual servers ... average CPU load was 90% spiking to 100% at times. You could log into the virtual servers as normal though there was a tolerable lag associated. That's just too kewl ... 3 separate installs of Server2003 running on a simple desktop PC. This is not a good final solution, but in a pinch it works fine.
We have a plain jane Server2003 vmware template we created ... so making a new Server2003 virtual machine is quick and easy. Ed's been working on getting Exchange installed on that virtual machine today. If all goes as planned we'll start moving a few mailboxes over to this new virtual exchange box in the next day or 2. Once everyone is moved to the new VM exchange, we'll completely wipe and rebuild the prior physical box. I'm very curious to see how exchange runs in a virtual environment. We're giving it 3GB of the 4 available on the physical server, so it's a close replica of the environment it had prior. We do plan to move exchange from virtual back to it's prior physical box once it's rebuilt ... but if performance is good we might consider keeping it virtual.
In a way this is also testing our DR (disaster recovery) plans. If our Exchange server suddenly burst into flames, how longer before we could have another Exchange box up and running? This is helping us create procedures and timelines to handle such an event. Another benefit to this rebuilding is we'll move to Exchange2003 SP2 which has built in Windows Mobile "Direct Push" technology. So if you have a smart/PocketPC phone, you can continuously sync with exchange ala blackberry style. Given the fact that Ed and I will be getting PocketPC phones soon this is certainly worth some of the pain involved in this ordeal :-)
I really wish the new version of Exchange was now available ... it's still in beta. This would be a perfect time to upgrade. Of course that would also mean buying a new 64 bit server as Exchange is 64 bit only from here out. So in the end I'm looking at this through a positive lens. It bites to have to mess with this right now, but we'll come out on the other end having learned a lot about stuff we've been wanting to test. The hard part will be letting go of the fact we'll prob never know why we're having our current problems :-)