Heroku’s log-runtime-metrics Can Crash Your Rails App

Update 2: I was wrong about log-runtime-metrics! Turns out the culprit was a concurrency bug inside a Gem. Read how I fixed it here.
Update 1: Hacker News discussion here.

TLDR:  As of April 25, 2013, do not run log-runtime-metrics for extended periods of time, as it can crash your Heroku dynos (and cause them to reboot) multiple times daily. I was wrong! Please see Beware of the Zombie (Process) Apocalypse

Introduction

In the wake of the Rap Genius brouhaha, Heroku has been on a tear releasing many new features intended to help you tune and debug your Rails app’s performance. One of them is the Heroku Labs feature log-runtime-metrics, which combined with the Log2Viz feature gives you great insight into your app’s memory usage. You get a detailed breakdown of memory usage per dyno in your logs, and armed with this information you can scale your Unicorn worker count accordingly.

Emerging Problems

With this kind of amazing visibility into your dynos, it is tempting to leave this Labs feature enabled. After all, what can go wrong? Well, it turns out, lots. After having this feature enabled for a few days, I started noticing some unexplained dyno reboots inside NewRelic.

unexplained-heroku-dyno-reboots

These restarts were causing 503 timeouts, with Heroku H13 errors manifesting in the logs. (NB: In my case, I am running Unicorn with multiple processes, so each of these instance restarts is actually a single Unicorn worker restarting.)

Investigating the Mystery

Digging deeper into the mystery, I went into the Logentries add-on and examined the Heroku logs around that time. Specifically, I was not receiving any exceptions inside AirBrake so I knew it had to be a middleware problem or a system-level kind of problem.

So let’s take a look at the logs. When things are running fine, you can see that log-runtime-metrics outputs your dyno’s stats every 20 seconds or so.

(As an aside, I am using a 2X dyno, so my memory usage of ~811 MB is not a problem — in fact, it looks like I have enough memory to add another Unicorn worker. Again, log-runtime-metrics is great.)

Around 04:11:21 things start to go wrong:

Pointing the Finger at the Wrong Suspect

You can see that one of the Unicorn workers started to die repeatedly with fairly ominous error messages. At this point, my first thought was that the NewRelic gem was causing the reboots, because of this stack trace:

But here’s the spoiler: it turns out NewRelic was just the fall guy! Here’s why. If you take a look at the log-runtime-metrics gist above, you can see that the metrics are emitted every 20 seconds. Well, hmm, what needs a timer thread like that? After a bit of thought, it hit me: log-runtime-metrics, of course!

Catching the Real Crook

As you can see from the logs, the time that the thread creation error occurs is almost exactly 20 seconds from the last metric report — which tells me that log-runtime-metrics is trying to start a thread to wake itself up in 20 seconds for the next report. However, it encounters difficulty doing so:

This starts a cascading chain of failures which causes the entire dyno to reboot:

log-runtime-metrics obviously has a thread leak, as no further threads can be created. I surmise that log-runtime-metrics is not joining its timer threads when they are done and that it is running into the PTHREAD_THREADS_MAX limit in the underlying Ubuntu OS that runs the dyno.

Note that during this two-minute thrashing interval, no further runtime metrics are reported, so log-runtime-metrics really is dead.

After the reboot, the metrics on web.1 start reporting every 20 seconds again, even before Unicorn is booted:

All this definitely points toward log-runtime-metrics as the culprit in this scenario. Accordingly, I have turned log-runtime-metrics off in my production system, and there have been no more unexplained dyno reboots, after a few hours of run time. I will keep an eye on it over the next 24 hours to verify that log-runtime-metrics did indeed cause the problem.

So the next time you see this notice:

heroku-beta-feature

…pay attention and take it at face value, it really is an experimental feature!

Unabashed Self-Promotion

Got problems like these? I am a 7-year Ruby on Rails veteran, who previously founded (and subsequently exited) a startup as a CTO. I am available for short- and long-term consulting contracts involving CTO-level advice and Ruby on Rails project work. Please free free to get in touch.

About these ads
7 comments
  1. Thanks for the digging, Fred. We’re investigating…

    • Fred said:

      Thanks for the quick reply Ryan!

  2. Its funny, I had noticed that our dynos have been getting alot of boot timeouts even when we have not deployed anything.

    • Ryan Daigle said:

      Hey Fred, thanks for the follow up! We’re always looking for ways to improve the visibility and tools available to Heroku users and will use your example here as we move forward.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: