How to handle uploaded files

10 Nov 2006

Are you a web developer who wants to know some tips or tricks about storing files in a database? Would you like a quick example on how to handle a file upload in Rails and stuff that file in your database? My answer to that question is don’t do it. Here’s why.

The database software you are most likely using is a relational database, which was designed to hold discrete and related chunks of data and to make it easy for you to retrieve that data based on relationships to other data. Your filesystem was designed to store and organize files and to make it easy for you to retrieve those files in a fairly random fashion. In other words, use the file system for storing files, not your database. Still not convinced? Alright, then I can give you an example.

Let’s say you are hosting a popular site that is database-backed but has a lot of static content and various types of media assets (images, videos, etc.). A lot of Rails apps fall into this “custom CMS” type of application. Sooner or later, your application gets more and more popular, and you need to find a way to improve performance, and you look at distributing the load of all those assets you are serving. Following the standard “shared-nothing” approach of scaling a web application, you decided to move image and asset hosting to another server to spread the load. Rails even provides a configuration option for making this easy: ActionController::Base.asset_host.

If your images are stored in the database, now you have a problem. It’s actually a non-trivial problem to replicate database data from one database server to another. Sure, every decent database engine has replication support, but the problem is that in almost every case database replication is difficult to set up, difficult to maintain, or simply error prone. However, if you have your asset files on the filesystem where they belong, the solution is much more simple.

Rsync is a tried and tested solution to the problem of synchronizing the filesystems among hosts. It’s darned easy to replicate the files from one host to another. You can even get easy, secure, and unattended synchronization by installing rsync on both hosts, setting up key-based login via ssh, and then running this on a regular basis (from cron, for example) from the host that has the uploaded assets: rsync -cure ssh /source/path/ destination_server.com:/destination/path/. This command will recursively copy all the files that have been updated or have a different checksum to the destination server via ssh. It’s that easy.

Of course, I haven’t even addressed the most compelling reason to store your files on the filesystem: performance for serving those files. It’s readily apparent that storing your images and other media files on the filesystem makes it much easier to serve them via a very fast web server such as lighttpd, which will incur a much lesser performance impact than serving data streams pulled from the database via your Rails application.

I hope I have convinced you to store your files on the filesystem, whether for performance or for ease of scaling. If you do, the world will be a better place.


Actions

Informations

5 responses to “How to handle uploaded files”

David Bock (14:19:18) :

Agree totally! Files don’t belong in a database. Most of the time, files just belong in the filesystem. You can store any meta info on the file you need in a record in the database, along with the path to the file. If you find you need something ‘extra’, like versioning of files, then stand up a CVS or Subversion server to dump the files into. I have done this from Java, but haven’t done it from rails yet.

Kevin Teague (20:41:23) :

The flip side to this is if you are trying to store many different types of files, or you don’t have high traffic or really large file sets. Here the database makes a lot of sense, because all your data is in one place. Your code is a lot simpler and your sys admin needs are reduced. If the integrity of your data is important, it can also give you good peace of mind to have all your data handled through transactions. In some organziations there can be a gap between those who manage the filesystems and the databases, and the application developers who write the code. Using a hybrid approach can leads to systems where a portion of your data set is one day just ‘gone’.

Tramline (http://www.infrae.com/products/tramline) takes an interesting approach to the files-on-the-filesystem solution in that intercepts the files with Apache, handles file storage from there, and fakes out the file sent to the app server with a simple tramline id. Quite useful if you originally design a system to store data in the database and only later realize your blunder when performance grinds to a halt.

Database replication can be a PITA with most open source relational databases, but there are systems that are trivial to do replication with. ZEO scales the Zope object database very easily (this isn’t a relational database though). This isn’t full replication, but simply caching subsets of the data on client servers as it’s accessed, which isn’t going to work for all problems … and the ZODB still sucks at large BLOB support.

ben (07:08:06) :

I’m thinking of keeping images and uploaded files in Amazon’s S3. SmugMug’s blog had an interesting write up on using S3 http://blogs.smugmug.com/onethumb/2006/11/10/amazon-s3-show-me-the-money/

Seems like it’d take the load off the server without the bother of setting up an asset server for each new project as well.

Benj (19:55:17) :

That smugmug/S3 article was a great read, thanks for that gem!

Ben (06:02:30) :

I recently started using S3 at work, following the tips outlined here: http://blog.eberly.org/2006/10/09/how-automate-your-backup-to-amazon-s3-using-s3sync/

Basically, you can just replace rsync with s3sync and be off to the races. :) When you combine that with being able to point one of your domain hosts via CNAME to S3, it’s an easy way to offload your asset traffic.