Azure CPU Linear Growth/Ramp Up Issue

Recently Resgrid experienced an issue with our backend cloud service web role instances. After a deployment our CPU, in all instances, would ramp up from a normal of around 10% to over 90% stay there for a while then ‘reset’ back to 10% and begin the process again.

2015-05-08_13-50-47

Above is a screen shot of the monitoring tab in the Azure portal for the visual inclined. Sometimes it would peak at over 90% and stay there, and dramatically affect our API performance for an extended period of time. This is a huge deal as most of Resgrid users interact with it via the API at some level (calls going out, emails being imported and of course our mobile apps).

We were at a complete loss why this was happening. It seemed to start a while ago, but was exacerbated by new work. But only on our API stack, our Web stack was following a normal CPU pattern, hovering around 20%. Both projects shared the exact same service, database, caching and other provider code. The API layer itself was pretty thin. So what in the world was going on?

After a lot of debugging and redeployments we were at a loss. Time to call in the big guns, Microsoft. The good thing about the Azure Support plans, if you sign up and pay the $29 bucks you get support. So no need to be paying that all the time.

We sent around 10GB’s worth of dumps and PerfView traces to MS to review. So what did they discover:

From majority of the callstack, we see StackExchange_Redis!StackExchange.Redis.SocketManager calls either for reading or writing to the queues. Now there are 2153 active threads in the process! This seems to be too high and it would be interesting to see if you arerunning into connection problems mentioned here: https://social.msdn.microsoft.com/Forums/en-US/5e075053-802a-4a46-9fea-a0e859e9a7a9/redis-cache-sudden-100-cpu-and-crash?forum=azurecache

Now there are 1065 StackExchange.Redis.ConnectionMultiplexer objects in the managed heap! The dump shows that most of these ConnectionMultiplexer objects has connection failure message “UnableToResolvePhysicalConnection on PING”. So it seems there are lot of connection issues happening.

I would highly recommend customer to update StackExchange.Redis from v1.0.333 to v1.0.450 (https://www.nuget.org/packages/StackExchange.Redis/1.0.450 ). The older version might have had such 100% high CPU issues. Also are you creating multiple multiplexer object? Redis cache recommend to have one object and reuse it.

We started using Azure’s Redis cache a while back to cache our Geocoding results. We’ve been doing more and more work with that recently as it’s critical information to first responders. Getting a cached value from Redis is far faster and cost effective then contacting Google, Yahoo or Bing.

When we set it up we installed version 1.0.333, which was supposed to fix the CPU issue and may have in some cases but not ours. We use Ninject to control the lifecycle of our objects and had our RedisProvider in a singleton scope, but that may been part of the issue as well.

We upgraded to the latest StackEchange.Redis (v1.0.450) and marked out ConnectionMultiplexer as static and that fixed the issue. So if your seeing a CPU ramp up and using Redis, check your packages/dll’s and ensure your ConnectionMultiplexer is static.

Lessons Learned:

  • Always enable Remote Desktop for the roles, web or worker. This was amazingly helpful when Microsoft needed us to install software on the machine.
  • Pick an instance to let fail and cycle the other instances. This keeps your service up and running while allowing you to test. The Azure load balance seems to be a round robin, so your high CPU instance will still get traffic.
  • In Cloud Service deployments turning off Update Deployments does not issue you a fresh VM. If you install anything on the VM and a deploy without a Update Deployment method selected (Incremental or Simultaneous) is safe.
  • To get a fresh VM you need to “Reimage” from the Azure Management Portal, Instances section. Deployment and Reimages will keep the same machine name in case you have something else that keys off machine name.
  • Azure CPU metrics are based on averages over a 5 minute period. Just because Azure is reporting 90% CPU utilization if you log into the VM you won’t see the CPU pegged at 90%.
  • Have another tool to monitor performance. I’ll be reviewing NewRelic in a latter blog post.
  • Do not rely on Profiling, Intellitrace or Remote Debugging. In both VS2013 and VS2015RC we were unable to get those to work correctly.
  • When using the Debug Diagnostics Collection tool, the HTTP Response time trigger did nothing. Although the application was slow, the way the IIS Server was determining if it was ‘slow’ didn’t work. Performance Counters worked best.

Resgrid is a SaaS product utilizing Microsoft Azure, providing logistics, management and communication tools to first responder organizations like volunteer fire departments, career fire departments, EMS, search and rescue, CERT, public safety, disaster relief organizations, etc. It was founded in late 2012 by myself and Jason Jarrett (staxmanade).

About: Shawn Jackson

I’ve spent the last 18 years in the world of Information Technology on both the IT and Development sides of the aisle. I’m currently a Software Engineer for Paylocity. In addition to working at Paylocity, I’m also the Founder of Resgrid, a cloud services company dedicated to providing logistics and management solutions to first responder organizations, volunteer and career fire departments, EMS, ambulance services, search and rescue, public safety, HAZMAT and others.