The Background Job

Background jobs are are chunks of code that are run "in the background" of your application. They are typically resource-intensive operations such as web scraping, emailing, and image processing. You don't want to handle these operations inside the request/response cycle; that cycle needs to be speedy to make sure other users can always make requests.

So, you queue up resource-intensive operations to be processed some time later, independently of the request/response cycle.

There are two main classes so-to-speak of background jobs:

  1. Queued Tasks
  2. Scheduled Tasks

Queued Tasks are what gems like Delayed::Job and Resque solve. You "enqueue" a task to be run when resources are available. Then, when either a) resources are available or b) you manually specify to, the tasks are "dequeued" and executed.

Scheduled Tasks are queued tasks but they are most commonly executed when Cron says so. So you say things like "fetch a feed every 5 minutes" for a scheduled task, and "fetch this feed later" for a queued task. Scheduled tasks are usually defined at application startup, queued tasks are usually based on user input. Though if you have 10 scheduled tasks all to run at the same time, then they are "queued".

Just some hopefully helpful distinctions.

The Cost of Background Jobs

Update: Use the hire-fire gem, it solves the problem now!

Background jobs are necessary for any app that has resource-intensive operations. The problem is that Heroku charges a lot to do this, and setting it up yourself on say Slicehost is a lot of work.

Background jobs on Heroku are handled with Delayed::Job. They call them workers. By default, each worker costs $36/month, and runs via cron every hour. You can, however, run it daily for free, but this is not an option if you have say a job posting aggregator where you want near real-time feed updates or use what Eric Anderson described in the comments below, and start/stop the workers using the Heroku commands from within ruby. That price isn't that bad for the first app, but if you wanted to experiment with 5 or 10 background-job-requiring Rails/Sinatra apps on Heroku, you'd be out $360/month in a hurry. This is prohibitively expensive. Plus, you are limited to hourly cron (though if you specify run_at in Delayed::Job, you can enqueue tasks at the end of the current tasks execution). Update: if you manually start and stop the workers only when you need to do work, you'll end up only spending pennies in the end. I'm still pretty excited about this GAE business anyway ;).

Background jobs on Slicehost require a custom setup. A starting Slicehost slice costs $19/month, and then you have to install the entire deployment stack yourself. The benefit is you could use more advanced tools like RabbitMQ to run background daemons, but then you're in the field of operating system best-practices and that's over my head. You could use Delayed::Job as well. But this requires a lot of time and effort.

Free Background Jobs with Google App Engine

Google App Engine (GAE) is a free service basically like Heroku from Google, except targeted at Java and Python. It's definitely not as easy to use as Heroku, and it's not Ruby, but it has it's use cases.

In particualy, GAE has built-in support for Cron and Task Queues. It also has built in email support via HTTP, but that's another topic :).

Since GAE is from Google it is very inexpensive. While hourly Cron is $36/month on Heroku, minutely Cron is FREE on GAE. And Task Queues are free as well. GAE has a defined clear quotas specifying at what point they will actually start charging you, but it's much more lenient than Heroku. Basically, if you're a small team, you could use their Task Queue and Cron all you want for free.

Free Background Jobs on Heroku

I just hacked together a project on Github called Queuable which demonstrates how to give a Sinatra app on Heroku background-job support using Google App Engine. You could easily make the demo in Queuable a Rails app or some PHP or Python app just the same.

There are 3 parts to this setup:

  1. A minimal Google App Engine application that just handles managing queues and scheduled cron jobs. You don't need to modify this, you can just deploy it to GAE with appcfg.py update . as long as you have a free GAE account.
  2. A template Sinatra application which you would use to process what would normally be background processes. The GAE application sends scheduled tasks to this Sinatra app (on Heroku), you do your feed-fetching or whatever, and the response goes back to GAE.
  3. An API. The last piece is your actual app, say a feed aggregator Rails app with lots of models and controllers. When you need to do a complex calculation (fetch, parse a large feed), you push that responsibility off to GAE. You just send it some parameters (hash, string, anything), and it will queue/schedule it, and send it to your Sinatra worker app to take those params and do the operation, freeing your app form time-consuming operations. When everything's complete, GAE will send the result back to your app for you to save to the database.

Demo some Background Jobs

Install Google App Engine

First, you need to get up and running with Google App Engine (I know, I know), it's free. Here's a helpful getting started with GAE article on Squidoo. I followed this tutorial for setting up GAE as a CDN.

All you really need is to download Google App Engine for the command-line tools, and create an application (just fill out a form with the application's name).

Setup the Demo Background Job Sinatra App

Download Queuable.

git clone [email protected]:viatropos/queuable.git
cd queuable

Open app.yaml and replace the field application: name with your application name. This is my-app-name in http://my-app-name.appspot.com, from when you created the GAE application.

Then install the gems used in the demo app (not required for the core functionality, just the demo):

gem install haml json pauldix-feedzirra

Run the 3 servers

You'll need to run the GAE app, the demo app, and the demo worker app:

dev_appserver.py .          # http://localhost:8080
ruby demo/app/app.rb # http://localhost:4567
ruby demo/worker/worker.rb # http://localhost:4568

Queue Something!

Now just open http://localhost:4567/ and submit a feed url (it has a default), and check the terminal to see what just happened.

Then this is the flow of requests:

  1. Your main app makes a POST to your Queuable on GAE. A request might look like this:

    http://my-gae-queuable.appspot.com/url=http://sinatra-worker.heroku.com/handle&params=hello!&callback=http://my-real-app.heroku.com

  2. Queuable on GAE will then POST to http://sinatra-worker.heroku.com/handle, and your Sinatra worker app will receive this:

    {'url': 'http://sinatra-worker.heroku.com/handle', 'callback': 'http://my-real-app.heroku.com', 'params': 'hello!'}

  3. You then do whatever you want with that data, IN THE REQUEST CYCLE. We can do it in the request cycle because nobody sees this app. The whole point is to not use cron/delayed_job on Heroku because it costs money.

  4. When you return some result, Queuable on GAE will send this back to the original app:

    {"body"=>"All your content (json, html, xml, anything)", "request"=>"{'url': 'http://sinatra-worker.heroku.com/handle', 'callback': 'http://my-real-app.heroku.com', 'params': 'hello!'}", "task"=>"queue", "status"=>"200", "headers"=>"{'content-length': '0', 'set-cookie': 'rack.session=AQz7AA%3D%3D%0A; path=/', 'server': 'nginx/0.6.39', 'connection': 'keep-alive', 'date': 'Thu, 26 Aug 2010 20:30:30 GMT', 'content-type': 'text/html'}"}

This means that your main app just sends a request to GAE and returns immediately. GAE then queues up the task, and calls your worker app to handle it (so you can program in Ruby). Whenever it's complete, it sends the processed result back to GAE, and GAE POST's it back to the original app. So your app isn't tied up in long-running operations, and you don't have to pay lots or configure a ton to have background processes.

Deploy Your Background Job Workers

First, deploy Queuable to Google App Engine with this one-liner:

appcfg.py update .

Then just deploy 2 apps to Heroku, the Sinatra worker (following the demo code as a guideline), and your other main App, whatever it may be (Rails, Sinatra, etc.).

Then your apps basically are speaking "background job" via HTTP with GAE as the queueing proxy.

That's it.

What are your thoughts on this? It isn't anything like a long-term bullet-proof solution, it's just something to make possible what would otherwise be prohibitively expensive.