There are many situations in which you might be tempted to schedule a job in the future. Maybe you want to want to follow up with a user a few days after they purchased something to see how they’re liking it. Maybe you want to record some summary metrics a week after a new user signs up.
Whatever your reason, don’t make this happen by putting a job in a queue to be picked up by a worker n days after it gets created. Sidekiq supports scheduled jobs and Amazon SQS supports it as well, but don’t do it!
I’ll explain why I think future jobs are a mistake, then describe an experience that convinced me never to do this again. Finally, I’ll give an alternative pattern you can use to accomplish the same result in a much safer way.
Future Jobs are Dangerous
If you were using Sidekiq and had the following code:
What would you expect to happen in 3 days? Well presumably, an email will go out to the user, asking how they like their purchase. How do you know what’s going to happen? Tests? Wrong!
You’ve broken the logical connection between the triggers and the behavior of your system.
Tests will tell you what will happen to users who purchased something 3 days ago, but as to 3 days from now? No test will ever be able to tell you what changes will be made to your codebase between now and then.
And this is the core reason why future jobs are so dangerous. You’re essentially writing a blank check to this worker. When we compose logical systems, we say “If x and y are true, then do z”. This is safe because our tests express the meaning of x, y, and z. Tests express the common assumptions made by different components of your system that allow them to interact in a meaningful and correct way. This common understanding is unachievable future jobs however, because you can’t test code you haven’t written yet. You’ve broken the logical connection between the triggers (x and y) and the behavior (z) of your system.
You can’t go back in time to change the old assumptions any more than you can go forward into the future to know the exact behavior of your system.
In this simplified but real-life example, a peer-to-peer marketplace depends on users shipping items that were purchased by other users. If the seller does not ship their item within 7 days, the buyer receives a refund. At the time of purchase then, a future job was scheduled:
Then 7 days later, the worker wakes up:
class RefundWorker include Sidekiq::Worker def perform(purchase_id) purchase = Purchase.find(purchased_id) purchase.refund! unless purchase.shipped? end end
At a certain point it was decided to change the design of the system. Because these jobs were being stored in Redis, which presented the risk of losing the job entirely, there had to be a nightly auditor that looked for old, unshipped, unrefunded purchases. Rather than maintaining this overhead to support scheduling future jobs in Redis, the decision was made to run a nightly job that grabbed all relevant purchases and refunded them:
class NightlyRefunder include Sidekiq::Worker def perform purchases = Purchase.unshipped.where('created_at < ?', 7.days.ago) purchases.pluck(:id).each do |purchase_id| RefundWorker.perform_async(purchase_id) end end end
Well, now that we’re only scheduling
RefundWorker for unshipped items, we can take out that conditional.
class RefundWorker include Sidekiq::Worker def perform(purchase_id) Purchase.find(purchased_id).refund! end end
Unfortunately, this code got all the way to production. The author of this code forgot that there were already thousands of jobs in the queue that were scheduled under the assumption that the
unless purchased.shipped? guard clause was there. When the conditions for triggering this job changed, so did the behavior of the job, but of course you can’t go back in time to change the old assumptions any more than you can go forward into the future to know the exact behavior of your system.
Now this clearly could have been avoided in other ways than getting rid of future jobs. A different worker could have been used for these immediate refunds. Perhaps the original worker was misnamed and could have been called something like
CheckRefundableWorker, which would have made it clear that it shouldn’t be re-used by the
NightlyRefunder. The old jobs could have been deleted when the new code deployed (sounds tricky though).
Don’t make your life hard. Make your life easy.
So yes, there are other improvements that could have been made within the frame of future jobs, but look at what they all require: careful hindsight and forethought. By using this design, you’re requiring that your engineers remember and evaluate the assumptions that were previously made about the system. Thats hard! Don’t make your life hard. Make your life easy. Programming with the code you’ve got is hard enough without thinking about code from the past.
An Easier, Safer Way
Don’t saddle yourself with having to consider assumptions made in past versions of your code. Just write down what should happen right now. Grab all the records that are due for some action and schedule an immediate worker to do that on each one.
Having a batch job that schedules other jobs does raise the possibility of race conditions. If your your worker is not idempotent (maybe there’s a refund email that goes out), just have your worker lock a resource:
class RefundWorker include Sidekiq::Worker def perform(purchase_id) purchase = Purchase.find(purchased_id) purchase.with_lock do purchase.refund! unless purchase.refunded? end end end
Just Because You Can, Doesn’t Mean You Should
Popular async tools like SQS and Sidekiq support scheduling jobs far in the future, but I think using this functionality is too dangerous. You have no idea what you’re asking for when you schedule these future jobs because you don’t know what is going to change between now and then. Be safe and only express your business logic in terms of what should happen right now.