Never Schedule Future Jobs

There are many situations in which you might be tempted to schedule a job in the future. Maybe you want to want to follow up with a user a few days after they purchased something to see how they’re liking it. Maybe you want to record some summary metrics a week after a new user signs up.

Whatever your reason, don’t make this happen by putting a job in a queue to be picked up by a worker n days after it gets created. Sidekiq supports scheduled jobs and Amazon SQS supports it as well, but don’t do it!

I’ll explain why I think future jobs are a mistake, then describe an experience that convinced me never to do this again. Finally, I’ll give an alternative pattern you can use to accomplish the same result in a much safer way.

Future Jobs are Dangerous

If you were using Sidekiq and had the following code:

PurchaseFollowUpEmailWorker.perform_in(3.days, purchase_id)

What would you expect to happen in 3 days? Well presumably, an email will go out to the user, asking how they like their purchase. How do you know what’s going to happen? Tests? Wrong!

You’ve broken the logical connection between the triggers and the behavior of your system.

Tests will tell you what will happen to users who purchased something 3 days ago, but as to 3 days from now? No test will ever be able to tell you what changes will be made to your codebase between now and then.

And this is the core reason why future jobs are so dangerous. You’re essentially writing a blank check to this worker. When we compose logical systems, we say “If x and y are true, then do z”. This is safe because our tests express the meaning of x, y, and z. Tests express the common assumptions made by different components of your system that allow them to interact in a meaningful and correct way. This common understanding is unachievable future jobs however, because you can’t test code you haven’t written yet. You’ve broken the logical connection between the triggers (x and y) and the behavior (z) of your system.

An Example

You can’t go back in time to change the old assumptions any more than you can go forward into the future to know the exact behavior of your system.

In this simplified but real-life example, a peer-to-peer marketplace depends on users shipping items that were purchased by other users. If the seller does not ship their item within 7 days, the buyer receives a refund. At the time of purchase then, a future job was scheduled:

RefundWorker.peform_in(7.days, purchase_id)

Then 7 days later, the worker wakes up:

class RefundWorker
  include Sidekiq::Worker

  def perform(purchase_id)
    purchase = Purchase.find(purchased_id)
    purchase.refund! unless purchase.shipped?
  end
end

At a certain point it was decided to change the design of the system. Because these jobs were being stored in Redis, which presented the risk of losing the job entirely, there had to be a nightly auditor that looked for old, unshipped, unrefunded purchases. Rather than maintaining this overhead to support scheduling future jobs in Redis, the decision was made to run a nightly job that grabbed all relevant purchases and refunded them:

class NightlyRefunder
  include Sidekiq::Worker

  def perform
    purchases = Purchase.unshipped.where('created_at < ?', 7.days.ago)
    purchases.pluck(:id).each do |purchase_id|
      RefundWorker.perform_async(purchase_id)
    end
  end
end

Well, now that we’re only scheduling RefundWorker for unshipped items, we can take out that conditional.

class RefundWorker
  include Sidekiq::Worker

  def perform(purchase_id)
    Purchase.find(purchased_id).refund!
  end
end

Unfortunately, this code got all the way to production. The author of this code forgot that there were already thousands of jobs in the queue that were scheduled under the assumption that the unless purchased.shipped? guard clause was there. When the conditions for triggering this job changed, so did the behavior of the job, but of course you can’t go back in time to change the old assumptions any more than you can go forward into the future to know the exact behavior of your system.

Now this clearly could have been avoided in other ways than getting rid of future jobs. A different worker could have been used for these immediate refunds. Perhaps the original worker was misnamed and could have been called something like CheckRefundableWorker, which would have made it clear that it shouldn’t be re-used by the NightlyRefunder. The old jobs could have been deleted when the new code deployed (sounds tricky though).

Don’t make your life hard. Make your life easy.

So yes, there are other improvements that could have been made within the frame of future jobs, but look at what they all require: careful hindsight and forethought. By using this design, you’re requiring that your engineers remember and evaluate the assumptions that were previously made about the system. Thats hard! Don’t make your life hard. Make your life easy. Programming with the code you’ve got is hard enough without thinking about code from the past.

An Easier, Safer Way

Don’t saddle yourself with having to consider assumptions made in past versions of your code. Just write down what should happen right now. Grab all the records that are due for some action and schedule an immediate worker to do that on each one.

Having a batch job that schedules other jobs does raise the possibility of race conditions. If your your worker is not idempotent (maybe there’s a refund email that goes out), just have your worker lock a resource:

class RefundWorker
  include Sidekiq::Worker

  def perform(purchase_id)
    purchase = Purchase.find(purchased_id)
    purchase.with_lock do
      purchase.refund! unless purchase.refunded?
    end
  end
end

Just Because You Can, Doesn’t Mean You Should

Popular async tools like SQS and Sidekiq support scheduling jobs far in the future, but I think using this functionality is too dangerous. You have no idea what you’re asking for when you schedule these future jobs because you don’t know what is going to change between now and then. Be safe and only express your business logic in terms of what should happen right now.

9 thoughts on “Never Schedule Future Jobs

  1. I’ve found future jobs useful for quick feedback cycles. For example, we have a cache that we update when a user first signs up and we use an exponential backoff of future jobs to update that cache. That way they’ll get cache updates 15.seconds, 30.seconds, 1.minute, 2.minutes, etc to give the user some feedback while slowly rolling them over to our normal schedule that our already-onboarded customers use. I agree that scheduling things far in the future can be bad though. I think the key here is that the job is not critical and just makes the user experience a little nicer

    1. Yeah, that seems like a legit use case. Because you’re just updating a cache, this worker isn’t actually making any decisions. That kinda makes it immune to my main point that it could have been scheduled under a different set of assumptions. Thanks!

  2. Thank you for your post.

    I think that developers use future jobs to avoid workers or cron tasks such as you write in the end of article because of performance issues. I understand that often this is “superstitious programming” but which approach you can also recommend?

    1. Yeah, the Big Batch Job can definitely be a problem, especially if you end up loading thousands of full-blown ActiveRecord objects into memory. I think memory issues are pretty easily avoided though, as long your Big Batch Job is only responsible for scheduling many that do all the heavy lifting.

  3. This makes a lot of sense.

    If you use a cron style batch task to schedule the jobs, how do you handle a situation where the cron task fails to run for some reason? If you have tasks that must run, this could be a problem. Is there a pattern or tool for guaranteed scheduling?

    1. One thing I’ve done for this is increment counters in something like Librato for all the events in your system. Then you can set up an alert if these events go above or below some thresholds. This is a nice way to detect the kind of flatlines you’re describing.

  4. Lance, this is a great blog post.

    You’ve basically flushed out everything I’ve experienced with future background jobs, especially with the race condition possibility when there’s a central worker that basically pings the database every few minutes to execute workers that need to start.

    I find what’s helpful is to queue up all the jobs in the next 30 minutes, for instance and to check this every 30 minutes. There’s a slight possibility some jobs will be queued twice, but this would be mitigated by an internal flag and locking the record.

  5. I’m in full agreement with you on this—currently in the process of migrating from jobs scheduled days in the future to an approach which will schedule jobs that should be executed ‘right now’.

    Not only is this a safer mode of scheduling for the reasons you have outlined, but it also paves the way for us to use simpler queue stores, such as SQS.

    On that note, the maximum delay for SQS appears to be 15 minutes. That, coupled with the limit on in-flight messages means that I don’t think it has ever been intended for use with jobs scheduled far in the future, unless I’m missing something?

    Either way thanks for the article!

    1. 15 minutes is definitely safer than 15 days, but there will still be a point when you deploy new code and the old workers may not be correct with your new version. The fact that its limited kinda does point to future jobs being an off-label use though.

Leave a Reply

Your email address will not be published. Required fields are marked *