Just Enough Developed Infrastructure

Rails and Large, Large file Uploads: looking at the alternatives

Uploading files in rails is a relatively easy task. There are a lot of helpers to manage this even more flexible, such as attachment_fu or paperclip. But what happens if your upload *VERY VERY LARGE* files (say 5GB) in rails, do the standard solutions apply? The main thing is that we want to avoid load file in memory strategies and avoid multiple temporary file writes.
This document describes our findings of uploading these kind of files in a rails environment. We tried the following alternatives:
  • Using Webrick
  • Using Mongrel
  • Using Merb
  • Using Mongrel Handlers
  • Using Sinatra
  • Using Rack Metal
  • Using Mod_Rails aka Passenger
  • Non-Rails Alternatives
(original image from http://www.masternewmedia.org/)
And i'm afraid, the new is not that good. For now....

A simple basic Upload Handle (to get started)
Ok , let's make a little upload application . (loosely based upon http://www.tutorialspoint.com/ruby-on-rails/rails-file-uploading.htm
Install rails (just to show you the version I used)
$ gem install rails
Successfully installed rake-0.8.4
Successfully installed activesupport-2.3.2
Successfully installed activerecord-2.3.2
Successfully installed actionpack-2.3.2
Successfully installed actionmailer-2.3.2
Successfully installed activeresource-2.3.2
Successfully installed rails-2.3.2

$ gem install sqlite3-ruby $ rails upload-test $ cd upload-test $ script/generate controller Upload       exists  app/controllers/       exists  app/helpers/       create  app/views/upload       exists  test/functional/       create  test/unit/helpers/       create  app/controllers/upload_controller.rb       create  test/functional/upload_controller_test.rb       create  app/helpers/upload_helper.rb       create  test/unit/helpers/upload_helper_test.rb
The first step is to create controller that has two actions, on 'index' it will show a form "uploadfile.html.erb' and the action 'upload'  will handle the upload
#app/controller/upload_controller.rb
class UploadController < ApplicationController
  def index
     render :file => 'app/views/upload/uploadfile.html.erb'
  end
  def upload
    post = Datafile.save(params[:uploadform])
    render :text => "File has been uploaded successfully"
  end
end
The second create the view to have file upload form in the browser. Note the multipart parameter to do a POST
#app/views/upload/uploadfile.html.erb
<% form_for :uploadform, :url => { :action => 'upload'}, :html => {:multipart => true}  do |f| %>
  <%= f.file_field :datafile %><br />
  <%= f.submit 'Create' %>
<% end %>
Last is to create the model , to save the uploaded file to public/data. Note the orignal_filename we use to
#app/models/datafile.rb
class Datafile < ActiveRecord::Base
  def self.save(upload)
    name =  upload['datafile'].original_filename
    directory = "public/data"
    # create the file path
    path = File.join(directory, name)
    # write the file
    File.open(path, "wb") { |f| f.write(upload['datafile'].read) }
  end
end


Before we startup we create the public/data dir
$ mkdir public/data
$ ./script server webrick
=> Booting WEBrick
=> Rails 2.3.2 application starting on http://0.0.0.0:3000
=> Call with -d to detach
=> Ctrl-C to shutdown server
[2009-04-10 13:18:27] INFO  WEBrick 1.3.1
[2009-04-10 13:18:27] INFO  ruby 1.8.6 (2008-03-03) [universal-darwin9.0]
[2009-04-10 13:18:27] INFO  WEBrick::HTTPServer#start: pid=5057 port=3000
Point your browser to http://localhost:3000/upload and you can upload a file. If all goes well, there should be a file public/data with the same name as your file that your uploaded.
Scripting a large Upload
Browser have their limitations for file uploads. Depending on if your working on 64Bit OS, 64 Bit Browser , you can upload larger files. But 2GB seems to be the limit.
For scripting the upload we will use curl to do the same thing. To upload a file called large.zip to our form, you can use:
curl -Fuploadform['datafile']=@large.zip http://localhost:3000/upload/upload
If you would use this, rails would throw the following error: "ActionController::InvalidAuthenticityToken (ActionController::InvalidAuthenticityToken):"
As described in http://ryandaigle.com/articles/2007/9/24/what-s-new-in-edge-rails-better-cross-site-request-forging-prevention is is used to protect rails against cross site request forging. We need to have rails skip this filter.
#app/controller/upload_controller.rb
class UploadController < ApplicationController
skip_before_filter :verify_authenticity_token
Webrick and Large File Uploads
Webrick is the default webserver that ships with rails. Now let's upload a large file and see what happens.
Ok, it's natural that this takes longer to handle. But if you zoom on the memory usage of your ruby process, f.i. with top
 7895 ruby        16.0%  0:26.61   2    33    144  559M   188K   561M   594M
====> Memory GROWS: We see that the ruby process is growing and growing. I guess it is because webrick loads the body in a string first.
#gems/rails-2.3.2/lib/webrick_server.rb
def handle_dispatch(req, res, origin = nil) #:nodoc:
    data = StringIO.new
    Dispatcher.dispatch(
      CGI.new("query", create_env_table(req, origin), StringIO.new(req.body || "")),
      ActionController::CgiRequest::DEFAULT_SESSION_OPTIONS,
      data
    )
=====> Files get written to disk Multiple times for the Multipart parsing: When the file is upload, you see message appearing in the webrick log. It has a file in /var/folder/EI/....
Processing UploadController#upload (for ::1 at 2009-04-09 13:51:23) [POST]
  Parameters: {"commit"=>"Create", "authenticity_token"=>"rf4V5bmHpxG74q6ueI3hUjJzwhTLUJCp9VO1uMV1Rd4=", "uploadform"=>{"datafile"=>#<File:/var/folders/EI/EIPLmNwOEea96YJDLHTrhU+++TI/-Tmp-/RackMultipart.7895.1>}}
[2009-04-09 14:09:03] INFO  WEBrick::HTTPServer#start: pid=7974 port=3000
It turns out, that the part that handles the multipart, writes the files to disk in the $TMPDIR. It creates files like
$ ls $TMPDIR/
RackMultipart.7974.0
RackMultipart.7974.1
Strange, two times? We only uploaded one file? I figure this is handled by the rack/utils.rb bundled in action_controller. Possible related is this bug described at https://rails.lighthouseapp.com/projects/8994/tickets/1904-rack-middleware-parse-request-parameters-twice
#gems/actionpack-2.3.2/lib/action_controller/vendor/rack-1.0/rack/utils.rb
    # Stolen from Mongrel, with some small modifications:
      def self.parse_multipart(env)
    write multi
Optimizing the last write to disk
Instead of
# write the file
File.open(path, "wb") { |f| f.write(upload['datafile'].read) }
We can use the following to avoid writing to disks our selves
FileUtils.mv upload['datafile'].path, path
This makes use from the fact that the file is allready on disk, and a file move is much faster then rewriting the file.
Still this might not be usable in all cases: If your TMPDIR is on another filesystem then your final destination, this trick won't help you.
Mongrel and Large File Uploads The behaviour of Webrick allready was discussed on the mongrel mailinglist http://osdir.com/ml/lang.ruby.mongrel.general/2007-10/msg00096.html And is supposed to be fixed. So let's install mongrell
$ gem install mongrel
Successfully installed gem_plugin-0.2.3
Successfully installed daemons-1.0.10
Successfully installed fastthread-1.0.7
Successfully installed cgi_multipart_eof_fix-2.5.0
Successfully installed mongrel-1.1.5
$ mongrel_rails start
Ok, let's start the upload again using our curl:
======> Memory does not grow: that's good news.
======> 4 file writes! for 1 upload : because Mongrel does not keep the upload in memory, it writes it to a tempfile in the $TMPDIR. Depending on the size of the file, > MAX_BODY it will create a tempfile or just a string in memory
lib/mongrel/const.rb
        # This is the maximum header that is allowed before a client is booted.  The parser detects
        # this, but we'd also like to do this as well.
    MAX_HEADER=1024 * (80 + 32)

    # Maximum request body size before it is moved out of memory and into a tempfile for reading.     MAX_BODY=MAX_HEADER
lib/mongrel/http_request.rb         # must read more data to complete body         if remain > Const::MAX_BODY           # huge body, put it in a tempfile           @body = Tempfile.new(Const::MONGREL_TMP_BASE)           @body.binmode         else           # small body, just use that           @body = StringIO.new         end
In our tests, we saw that aside from the RackMultipart.<pid>.x files, there is additional file written in $TMPDIR: mongrel.<pi>.0
That means that for 5 GB, we now have 4x 5GB : 1 mongrel + 2 RackMultipart + 1 final file (depending on the move or not)= 20 GB
======> Not reliable , predictable results?
Also, we saw the upload sometimes: mongrel did not create the RackMultiparts but CGI.<pid>.0 . Unsure what the reasons is. Merb and Large File Uploads
One of the solutions you see for handling file uploads is using Merb, the main reason that there is less blocking of your handlers. Let's try this:
$ gem install merb
Successfully installed dm-aggregates-0.9.11
Successfully installed dm-validations-0.9.11
Successfully installed randexp-0.1.4
Successfully installed dm-sweatshop-0.9.11
Successfully installed dm-serializer-0.9.11
Successfully installed merb-1.0.11
Let's create the merb application:
$ merb-gen app uploader-app
$ cd uploader-app
We need to create the controller, but this a bit different from our original controller:
  • the file is called upload.rb instead of upload_controller.rb
  • removed the skip_before
  • in Merb it is Application and not ApplicationController
#app/controllers/upload.rb
class Upload < Application
    def index
       render :file => 'app/views/upload/uploadfile.rhtml'
             end
      def upload
            post = Datafile.save(params[:uploadform])
            render :text => "File has been uploaded successfully"
      end
end
The model looks like this:
  • Remove the ActiveRecord
  • include DataMapper::Resource
  • original_filename does not exist: merb passes it in the variable filename
  • tempfile is also changed on how merb passes the temporary file
#app/models/datafile.rb
class Datafile
include DataMapper::Resource
  def self.save(upload)
   name =  upload['datafile']['filename']
   directory = "public/data"
  # create the file path
   path = File.join(directory, name)
   # write the file
      File.open(path, "wb") { |f| f.write(upload['datafile']['tempfile'].read) }
   end
We create the public/data
$    mkdir public/data
And start merb .
$ merb
 ~ Connecting to database...
 ~ Loaded slice 'MerbAuthSlicePassword' ...
 ~ Parent pid: 57318
 ~ Compiling routes...
 ~ Activating slice 'MerbAuthSlicePassword' ...
merb : worker (port 4000) ~ Starting Mongrel at port 4000
When you start the upload, a merb worker becomes active.
=====> No memory increases : good!
merb : worker (port 4000) ~ Successfully bound to port 4000
=====> 3 Filewrites: 1 mongrel + 1 merb + 1 final write
Mongrel first start writing its mongrel.<pid>.0 in our $TMPDIR/
merb : worker (port 4000) ~ Params: {"format"=>nil, "action"=>"upload", "id"=>nil, "controller"=>"upload", "uploadform"=>{"datafile"=>{"content_type"=>"application/octet-stream",
 "size"=>306609434, "tempfile"=>#<File:/var/folders/EI/EIPLmNwOEea96YJDLHTrhU+++TI/-Tmp-/Merb.13243.0>, "filename"=>"large.zip"}}}
merb : worker (port 4000) ~
After that Merb handles the multipart stream and writes once in $TMPDIR/Merb.<pid>.0
Sinatra and Large Files:
Sinatra is a simple framework for describing the controllers yourself. Because it seemed to have direct access to the stream, I hoped that i would be able to stream it directly without the MultiPart of Rack. First step install sinatra:
$ gem install sinatra
Successfully installed sinatra-0.9.1.1
1 gem installed
Installing ri documentation for sinatra-0.9.1.1...
Installing RDoc documentation for sinatra-0.9.1.1...
Create a sample upload handler:
#sinatra-test-upload.rb
require 'rubygems'
require 'sinatra'
post '/upload' do
        File.open("/tmp/theuploadedfile","wb") { |f| f.write(params[:datafile]['file'].read) }
end
$ ruby upload-sinatra.rb
== Sinatra/0.9.1.1 has taken the stage on 4567 for development with backup from Mongrel

So instead of 3000 it listens on 4567
====> No memory increase: good!
====> 4 file writes: Again we see 4= 1 Mongrel.<pid>.* + 2 x Multipart.<pid>.* + 1 file write
Using Mongrel handlers to bypass other handlers
Up until now, we have the webserver, the multipart parser  and the final write. So how can we skip the webserver or the multipart writing to disk and not consuming all the memory.
I found another approach by using a standalone mongrel handler: This allows you to interact with the incoming stream before Rack/Multipart kicks in.
Let's create an example Mongrel Handler. It's just the part that shows you that you can access the request directly:
require 'rubygems'
require 'mongrel'

class HelloWorldHandler < Mongrel::HttpHandler     def process(request, response)
    puts request.body.path           response.start(200) do |head,out|                   head['Content-Type'] = "text/plain"                         out << "Hello world!"                             end     end     def request_progress (params, clen, total)     end end
Mongrel::Configurator.new do     listener :port => 3000 do           uri "/", :handler => HelloWorldHandler.new             end
      run; join
end
=====>No memory increase: good!
=====>1 FILE and direct access, but still needs multipart parsing:
It turns out that request.body.path is the mongrel.<pid>.0 file , giving us directly access to the first uploaded file.
request.body.path = /var/folders/EI/EIPLmNwOEea96YJDLHTrhU+++TI/-Tmp-/mongrel.93690.0

Using Rails Metal Metal is an addition to Rails 2.3 that allows you to bypass the rack.
# Allow the metal piece to run in isolation
require(File.dirname(__FILE__) + "/../../config/environment") unless defined?(Rails)
class Uploader
  def self.call(env)
    if env["PATH_INFO"] =~ /^\/uploader/
      puts env["rack.input"].path

      [200, {"Content-Type" => "text/html"}, ["It worked"]]     else       [400, {"Content-Type" => "text/html"}, ["Error"]]     end   end end


Similar to the Mongrel HTTP Handler, we can have access to the mongrel file upload by
 env["rack.input"].path = actually the /var/folders/EI/EIPLmNwOEea96YJDLHTrhU+++TI/-Tmp-/mongrel.81685.0
If we want to parse this, we can pass the env to the Request.new but this kicks in the RackMultipart again.
request = Rack::Request.new(env)
      puts request.POST
      #uploaded_file = request.POST["file"][:tempfile].read
=====>No memory increase: good!
=====>1 FILE and direct access, but still needs multipart parsing
=====>Can still run traditional rails and metal rails in the same webserver

Using Mod_rails aka Passenger
Mod_rails seems to become the new standard for running rails applications without the blocking hassle using plain apache as a good stable proven technology.
One of the main benefits that it doesn't block the handler to send response back until the complete request is handled. Sounds like good technology here!
curl -v -F datafile['file']=@large.zip http://localhost:80/
 * About to connect() to localhost port 80
 *   Trying 127.0.0.1... connected
 * Connected to localhost (127.0.0.1) port 80
 > POST /datafiles HTTP/1.1
 > User-Agent: curl/7.15.5 (x86_64-redhat-linux-gnu) libcurl/7.15.5 OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5
 > Host: localhost
 > Accept: */*
 > Content-Length: 421331151
 > Expect: 100-continue
 > Content-Type: multipart/form-data; boundary=----------------------------1bf75aea2f35
 >
 < HTTP/1.1 100 Continue
Setting up mod_rails is beyond the scope of this document. So we assume you have it working for your rails app.
in my /etc/httpd/conf/httpd.conf
LoadModule passenger_module /opt/ruby-enterprise-1.8.6-20090201/lib/ruby/gems/1.8/gems/passenger-2.1.3/ext/apache2/mod_passenger.so
PassengerRoot /opt/ruby-enterprise-1.8.6-20090201/lib/ruby/gems/1.8/gems/passenger-2.1.3
PassengerRuby /opt/ruby-enterprise-1.8.6-20090201/bin/ruby
Mod_rails has a nice setting that you can specify your Tmpdir per virtual host:
See http://www.modrails.com/documentation/Users%20guide.html#_passengertempdir_lt_directory_gt for more details

5.10. PassengerTempDir <directory>

Specifies the directory that Phusion Passenger should use for storing temporary files. This includes things such as Unix socket files, buffered file uploads, etc.

This option may be specified once, in the global server configuration. The default temp directory that Phusion Passenger uses is /tmp.

This option is especially useful if Apache is not allowed to write to /tmp (which is the case on some systems with strict SELinux policies) or if the partition that /tmp lives on doesn’t have enough disk space.
Ok let's start the upload and see what happens:
=====> Memory goes up!
# ./passenger-memory-stats
 -------------- Apache processes ---------------
 PID    PPID   Threads  VMSize    Private  Name
 -----------------------------------------------
 30840  1      1        184.3 MB  0.0 MB   /usr/sbin/httpd
 30852  30840  1        186.2 MB  ?        /usr/sbin/httpd
 30853  30840  1        184.3 MB  ?        /usr/sbin/httpd
 30854  30840  1        184.3 MB  ?        /usr/sbin/httpd
 30855  30840  1        184.3 MB  ?        /usr/sbin/httpd
 30856  30840  1        184.3 MB  ?        /usr/sbin/httpd
 30857  30840  1        184.3 MB  ?        /usr/sbin/httpd
 30858  30840  1        184.3 MB  ?        /usr/sbin/httpd
 30859  30840  1        184.3 MB  ?        /usr/sbin/httpd
 ### Processes: 9
 ### Total private dirty RSS: 0.03 MB (?)
---------- Passenger processes -----------
 PID    Threads  VMSize     Private   Name
 ------------------------------------------
 30847  4        14.1 MB    0.1 MB    /opt/ruby-enterprise-1.8.6-20090201/lib/ruby/gems/1.8/gems/passenger-2.1.3/ext/apache2/ApplicationPoolServerExecutable 0
/opt/ruby-enterprise-1.8.6-20090201/lib/ruby/gems/1.8/gems/passenger-2.1.3/bin/passenger-spawn-server  /opt/ruby-enterprise-1.8.6-20090201/bin/ruby
/tmp/passenger.30840/info/status.fifo
 30848  1        87.7 MB    ?         Passenger spawn server
 30888  1        123.6 MB   0.0 MB    Passenger ApplicationSpawner: /home/myrailsapp
 30892  1        1777.4 MB  847.5 MB  Rails: /home/myrailsapp
 ### Processes: 4
 ### Total private dirty RSS: 847.62 MB (?)
Very strange: in the /opt/ruby-enterprise-1.8.6-20090201/lib/ruby/gems/1.8/gems/passenger-2.1.3/ext/apache2/Hooks.cpp of the passenger source
     expectingUploadData = ap_should_client_block(r);
     if (expectingUploadData && atol(lookupHeader(r, "Content-Length"))
             > UPLOAD_ACCELERATION_THRESHOLD) {
          uploadData = receiveRequestBody(r);
     }
the part expectionUploadData is the one that sends the
 > Expect: 100-continue
But is seems curl, isn't handling this request, it keeps on streaming the file, ignoring the response.
To avoid having mod_Rails sending this, we can fall back to http/1.0 using -0 on the curl options.
$ curl -v -0 -F datafile['file']=@large.zip http://localhost:80
* About to connect() to localhost port 80
*   Trying 127.0.0.1... connected
* Connected to localhost (127.0.0.1) port 80
> POST /uploader/ HTTP/1.0
> User-Agent: curl/7.15.5 (x86_64-redhat-linux-gnu) libcurl/7.15.5 OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5
> Host: localhost
> Accept: */*
> Content-Length: 421331151
> Content-Type: multipart/form-data; boundary=----------------------------1b04b7cb6566
Now the correct mechanism happens.
/tmp/passenger.1291/backends/backend.g0mi40ARBFbEdb08pxB3uzyh3JJyfR1eaI9xPuQwyLEd3NjQ24rbpSBb9FrZfNX5WI5VYQ
====> Memory doesn't go up: good! (again)
====> Same number of files = 1 /tmp/passenger + similar to previous examples
The alternatives: (non-rails)
The problem so far, is mainly a problem of implementation, there is no reason why streaming a file upload would not be possible in rails.
The correct hooks for streaming the file directly to a handler without temporary files or memory, are currently just not there.
I hope eventually we will see an Upload streaming API (similar to the download Stream API) and a streamable Multipart handler.
Alternative 1: have the webserver handle our stream directly Alternative 2: Write our own httpserver in ruby:

Using a Raw HTTP Server, Plain sockets to implement webserver, http://lxscmn.com/tblog/?p=25


Alternative 3: Use apache commons fileupload component in java
This component is exactly what we need in rails/ruby. http://commons.apache.org/fileupload/
Up until now, this is what we will use. It has streamable API for both the incoming AND the multiparts!
Read more at http://www.jedi.be/blog/2009/04/10/java-servlets-and-large-large-file-uploads-enter-apache-fileupload/