Heroku Necessities — Generate CSV Files in the Background With Delayed_job and Store Them on S3 With Paperclip – Kevin Trowbridge: Software Developer. Husband. I *heart* Webpages.

I’m trying to get back into technical blogging as I encounter interesting situations on a daily basis … and I get so much information from others doing the same thing.

In this case I’m moving a fairly large blog from a custom deployment platform on EngineYard, to Heroku. Heroku enforces a 30-second request timeout—so the webserver can’t be used for heavy, long-running tasks like generating a large CSV file.

The solution is to move the generation of the CSV file into a background task, and store the generated CSV file on Amazon S3. Since in my case the data that I am compiling into the CSV file is private, I also show how to configure Paperclip to make the generated CSV file only downloadable to authenticated users.

Here’s a brief (30 second) video showing the UI you can build by following these steps:

Heroku Necessities: generate CSV files in the background with “delayed_job” and store them on S3 with “paperclip” … from Kevin Trowbridge on Vimeo.

The Model: ExportedDataCsv.rb

In my case I have a few large sets of data that are stored in the database, that need to be exportable from the system for reporting and administrative tasks. Think … the ‘Users’ table (full list of users with email addresses, names, and so on) … or the ‘Stories’ table (for a blog, all of the ‘stories’ that have ever been written for the site). So this is stateful. We’re going to turn the Users table into a CSV file and save it on Amazon S3. We’ll be storing specific information about the file:

What’s its exact name?
When was it generated?
Is it actively generating right now, or is it available for download?

We’re using Paperclip to handle the mechanics of saving the file to S3, but we’ll need to setup a model in order to configure paperclip, as well as to store that stateful information.

app/models/exported_data_csv.rb

class ExportedDataCsv < ActiveRecord::Base
  has_attached_file :csv_file, {:s3_protocol => 'https', :s3_permissions => "authenticated_read"}

  acts_as_singleton

  def generating?
    job_id.present?
  end

  def csv_file_exists?
    !self.csv_file_file_name.blank?
  end

  def trigger_csv_generation
    job = Delayed::Job.enqueue GenerateCsvJob.new({:csv_instance => self})
    update_attribute(:job_id, job.id)
  end

  def write_csv
    file = Tempfile.new([self.filename, '.csv'])
    begin
      file.write self.data_string
      self.csv_file = file
      self.save
    ensure
      file.close
      file.unlink # deletes the temp file
    end
  end

  protected

  # Kevin says: override me in subclasses ...
  def filename
    'exported_data_csv_'
  end

  def data_string
    ''
  end
end

Now that you’ve seen it, let’s discuss this model in more detail:

The first line 'has_attached_file' is the familiar way of configuring paperclip.

acts_as_singleton—I’m only storing a single version of each ExportedDataCSV file … so I am using the acts_as_singleton gem … the model associated with the exported CSV file will be a singleton.
generating? & csv_file_exists? are two methods I can use in my view to determine the immediate state of the CSV file.
trigger_csv_generation this method gets called by the application server’s controller method to queue up the write_csv_file background job.
write_csv_file this is the actual method that turns a CSV string into a TempFile which is then handed off to Paperclip.

Then there are two methods to be overridden in subclasses … oh yes, did I fail to mention? Since we are generating several distinct types of CSV files, each with its own name and data, I am using what’s called Rails ‘single table inheritance’ to create a set of subclasses to model this.

CreateExportedDataCsv db migration

Here’s the migration to create the ExportedDataCSV table in the database.

The presence of the type string makes the Single Table Inheritance work.
has_attached_file is the paperclip migration helper.
job_id is used to track the delayed_job and make the model’s generating? method work.
timestamps will keep track of when it was last updated.

db/migrate/create_exported_data_csv.rb

class CreateExportedDataCsv < ActiveRecord::Migration
  def up
    create_table :exported_data_csvs do |t|
      t.timestamps
      t.has_attached_file :csv_file
      t.string :type
      t.integer :job_id
    end
  end

  def down
    drop_table :exported_data_csvs
  end
end

Subclassed Models

With the previous two files written, it’s trivial to create a CSV file:

app/models/users_csv.rb

class UsersCsv < ExportedDataCsv

  protected

  def filename
    'users_'
  end

  def data_string
    User.all.to_comma
  end
end

The information to be put into the CSV file is simply a string. Please see https://github.com/crafterm/comma for more information on working with CSV files in Ruby.

GenerateCsvJob: The Delayed Job

We use the now-standard delayed_job gem to handle the passing off of the long running task (the write_csv method in the root model).

Here’s my ‘job’ file:

lib/delayed_jobs/generate_csv_job.rb

class GenerateCsvJob < Struct.new(:options)
  def perform
    csv_instance = options[:csv_instance]
    begin
      csv_instance.write_csv
    ensure
      csv_instance.update_attribute(:job_id, nil)
    end
  end
end

Credit – this stackoverflow post was very helpful to me: http://stackoverflow.com/questions/5582017/polling-with-delayed-job

The Controller

The controller is pretty simple … there are two methods.

generate_csv – queue up a new delayed job to generate the CSV file and immediately redirect_to :back
index – point the client to the S3 ‘expiring url’ path (the URL only lasts 5 minutes) to download the CSV file, if it exists.

def index
  respond_to do |format|
    format.csv do
      if Rails.env[/production|demo/]
        redirect_to UsersCsv.instance.csv_file.expiring_url(5.minutes)
      else
        send_file UsersCsv.instance.csv_file.path
      end
    end
  end
end

def generate_csv
  UsersCsv.instance.trigger_csv_generation
  flash[:notice] = "We're generating your CSV file. Refresh the page in a minute or so to download it."
  redirect_to :back
end

routes.rb

In the routes file we just need to add a custom route to allow the client to access the generate_csv action that we created in the controller:

  resources :users do
    collection do
      post :generate_csv
    end
  end

The last tricky bit … the view

The last tricky piece is the view. In the view we determine whether a CSV has been generated yet … if not, we allow the user to trigger the generation of a CSV file … if so we show the link to it, but also allow the user to refresh the file as it may be far out of date.

Since we’re building a framework that will allow us to have many different CSV files … we first create an abstracted partial that will accept various input variables and that we can use all over our site:

<% if csv_object.generating? %>
  Generating CSV ...
<% else %>
  <% unless csv_object.csv_file_exists? %>
    No CSV exists.
  <% else # CSV exists %>
    <% shortened_filename = csv_object.csv_file_file_name.slice(/(^.*)_/, 1) + '.csv' %>
    <%= link_to shortened_filename, download_path %>
    Last updated:
    <% csv_object.updated_at.to_s(:viewable) %>
  <%= link_to "#{csv_object.csv_file_exists? ? 'Update' : 'Generate'} CSV.", trigger_generation_path, :method => :post %>
  <% end %>
<% end %>

Here’s an example of how to call the partial:

<li>
  <%= link_to 'Users', admin_users_path %>
  <br/>
  Download all:
  <%= render :partial => '/common/csv_generation_ui', :locals => {:csv_object => UsersCsv.instance, :trigger_generation_path => generate_csv_admin_users_path, :download_path => users_stories_path(:format => :csv)} %>
</li>

Summary

There are lots of moving parts in this scheme but once you get your head around it all, it’s a pretty straightforward pattern and a variant of this could be used in other situations as well. Enjoy and good luck!

Kevin Trowbridge

Software Developer. Husband. I ❤ Webpages.

Heroku Necessities — Generate CSV Files in the Background With Delayed_job and Store Them on S3 With Paperclip