I ran into an issue with CPU utilization in our Hyper-V guest VMs over the last few weeks. Every few days, one of our guest VMs would suddenly go to 100% cpu and stay there, occasionally even across reboots. What was really strange, was that the Guest OS didn't report any processes taking up the CPU usage. No matter what I killed, the CPU stayed at 100%.
I started to gather information to define the scope of the issue. This was occurring on both Windows VMs and Ubuntu 14.04 VMs, and happened on more than one of our Hyper-V nodes, though it did happen on one node more than the others. The more vCPUs attached to a guest VM, the more likely it was to have an issue (our Exchange server was the worst), but this was even occurring on single vCPU guest VMs. During the course of troubleshooting, I noticed that if I live-migrated the guest VM, the CPU would stabilize. I could even move it back to the node where it was having an issue without the CPU destabilizing again. All of our hardware is identical, all of our Hyper-V servers are configured exactly the same, the nodes all have the same updates installed, and the issue did not follow specific guest VMs from node to node.
Once I had enough data to show that it was happening most on host 2, and didn't see anything else standing out, I drained the node and started up Prime95 hoping that it would give me a little more direction. After starting up Prime95 in hardware stress testing mode, I kept an eye on temperatures as well as the CPU, RAM, and power utilization. A few minutes into the test I saw the CPU utilization go from 100% to about 30%, and not go back up.
The power utilization also dropped considerably at the same time.
Now, if you've used Prime95 before, you know that means something isn't working right. So, I rebooted the host and tried again with the same results.
After a little contemplation, I decided to look and see what the power saving settings were in the OS. Sure enough, the power plan was configured to 'balanced'. After setting the plan to 'High Performance' and rebooting, Prime95 was able to stay at 100% CPU usage for a full 24 hours, and the power utilization was stable.
I checked the other hosts, and they were also set to 'Balanced', so I unpaused my node, moved a few VMs to it and waited. After a couple days, I added a couple more. By the end of a week, I had the node that used to be the most problematic loaded down with guest VMs and was seeing no issues. I updated the power profile on the other servers. Now it has been two weeks without issues, and I'm comfortable saying the power settings were causing the issue.
If you are curious to see what power profile you are running on your server right now, you can check in a cmd prompt with 'powercfg -l' which will list the profiles and mark the active profile with an asterisk. If you want to change the power profile, just take the GUID from the desired plan and enter the command 'powercfg -s $GUID' and reboot. You can verify the plan was changed successfully with the 'powercfg -l' command.