I’m trying to get back into technical blogging as I encounter interesting situations on a daily basis … and I get so much information from others doing the same thing.
In this case I’m moving a fairly large blog from a custom deployment platform on EngineYard, to Heroku. Heroku enforces a 30-second request timeout—so the webserver can’t be used for heavy, long-running tasks like generating a large CSV file.
The solution is to move the generation of the CSV file into a background task, and store the generated CSV file on Amazon S3. Since in my case the data that I am compiling into the CSV file is private, I also show how to configure Paperclip to make the generated CSV file only downloadable to authenticated users.
Here’s a brief (30 second) video showing the UI you can build by following these steps:
Heroku Necessities: generate CSV files in the background with “delayed_job” and store them on S3 with “paperclip” … from Kevin Trowbridge on Vimeo.
The Model: ExportedDataCsv.rb
In my case I have a few large sets of data that are stored in the database, that need to be exportable from the system for reporting and administrative tasks. Think … the ‘Users’ table (full list of users with email addresses, names, and so on) … or the ‘Stories’ table (for a blog, all of the ‘stories’ that have ever been written for the site). So this is stateful. We’re going to turn the Users table into a CSV file and save it on Amazon S3. We’ll be storing specific information about the file:
- What’s its exact name?
- When was it generated?
- Is it actively generating right now, or is it available for download?
We’re using Paperclip to handle the mechanics of saving the file to S3, but we’ll need to setup a model in order to configure paperclip, as well as to store that stateful information.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
|
Now that you’ve seen it, let’s discuss this model in more detail:
The first line 'has_attached_file'
is the familiar way of configuring paperclip.
acts_as_singleton
—I’m only storing a single version of each ExportedDataCSV file … so I am using theacts_as_singleton
gem … the model associated with the exported CSV file will be a singleton.generating?
&csv_file_exists?
are two methods I can use in my view to determine the immediate state of the CSV file.trigger_csv_generation
this method gets called by the application server’s controller method to queue up thewrite_csv_file
background job.write_csv_file
this is the actual method that turns a CSV string into a TempFile which is then handed off to Paperclip.
Then there are two methods to be overridden in subclasses … oh yes, did I fail to mention? Since we are generating several distinct types of CSV files, each with its own name and data, I am using what’s called Rails ‘single table inheritance’ to create a set of subclasses to model this.
CreateExportedDataCsv db migration
Here’s the migration to create the ExportedDataCSV table in the database.
- The presence of the
type
string makes the Single Table Inheritance work. has_attached_file
is the paperclip migration helper.job_id
is used to track the delayed_job and make the model’sgenerating?
method work.timestamps
will keep track of when it was last updated.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
Subclassed Models
With the previous two files written, it’s trivial to create a CSV file:
1 2 3 4 5 6 7 8 9 10 11 12 |
|
The information to be put into the CSV file is simply a string. Please see https://github.com/crafterm/comma for more information on working with CSV files in Ruby.
GenerateCsvJob: The Delayed Job
We use the now-standard delayed_job gem to handle the passing off of the
long running task (the write_csv
method in the root model).
Here’s my ‘job’ file:
1 2 3 4 5 6 7 8 9 10 |
|
Credit – this stackoverflow post was very helpful to me: http://stackoverflow.com/questions/5582017/polling-with-delayed-job
The Controller
The controller is pretty simple … there are two methods.
generate_csv
– queue up a new delayed job to generate the CSV file and immediatelyredirect_to :back
index
– point the client to the S3 ‘expiring url’ path (the URL only lasts 5 minutes) to download the CSV file, if it exists.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
routes.rb
In the routes file we just need to add a custom route to allow the client to access the generate_csv
action that we
created in the controller:
1 2 3 4 5 |
|
The last tricky bit … the view
The last tricky piece is the view. In the view we determine whether a CSV has been generated yet … if not, we allow the user to trigger the generation of a CSV file … if so we show the link to it, but also allow the user to refresh the file as it may be far out of date.
Since we’re building a framework that will allow us to have many different CSV files … we first create an abstracted partial that will accept various input variables and that we can use all over our site:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
Here’s an example of how to call the partial:
1 2 3 4 5 6 |
|
Summary
There are lots of moving parts in this scheme but once you get your head around it all, it’s a pretty straightforward pattern and a variant of this could be used in other situations as well. Enjoy and good luck!