Jimmy Theis

MAR 8, 2012

Cutting S3 Costs with Metadata

Having GitHub Pages provide free hosting for this static blog is a huge blessing, but it obviously does come with some limitations. Specifically, a free GitHub account only gives you 0.30 GB (300 MB) of storage space. Even with a Micro Account (free through GitHub educational accounts), there's a limit of 0.60 GB (600 MB). This is obviously plenty of space for source code, pages, etc., but is simply not suited for hosting media like videos, slide decks, or large images.

As a result, I, like many others, have turned to Amazon S3, a service actually geared toward hosting media and other non-text content. With S3's Free Usage Tier, I actually haven't had to pay a dime yet (it lasts for one year), and when I do, it'll be a very small amount.

S3 uses a pay-for-what-you-use style of billing, where you're billed based on how much storage you're using, how many requests are made for files you've uploaded, and how much data actually gets transferred. It's changed a few times, so I won't bother copying down the billing plans, as they'll likely be outdated quickly. Instead, I'll just link to them for anyone interested.

Because S3 bills based on how much data gets requested per month, anything we can do to reduce that number actually saves money. Is there something we can do? Well, yes.

Making Our Files Smarter

It's sort of a perfect solution: reduce the number of requests coming into our S3 account and speed up page load times, without the user being negatively affected at all. All this is accomplished by setting the Expires value of our files to some date off in the future. This will become the HTTP Expires header value that lets browsers know not to request the resources over and over again, instead loading them (nearly instantly) from their local caches.

NOTE: This is obviously only useful for resources that actually won't change, like images.

We can set this metadata value from the S3 Console either at upload time (Set Details -> Set Permissions -> Set Metadata in the upload wizard), or from the Properties pane that can be opened for any existing file (find it under the Metadata tab).

In either case, use the Add more metadata button to create a new key/value pair with Expires for the key and the file's expiration date for the value:

(during initial upload):

(or after initial upload):

Now, here's a possible sticky point. HTTP specifies that Expires dates should be in RFC 1123 format. It's a very exact format that can be easily messed up if you're trying to enter it by hand. Luckily, Ruby's CGI class has a built-in way of generating these dates. If you're using a Mac (or a Linux/BSD box with Ruby installed), use this handy one-liner in a terminal (Terminal app on Mac under Utilities in the Applications folder) to generate the date (without the $):

$ ruby -r CGI -e 'puts CGI.rfc1123_date(Time.local(2013, 7, 14))'

Instead of 2013, 7, 14 (my 24th birthday), use whichever date you have in mind. The format is year, month, day. If you're really fancy (and sure that your files won't change), use the max value for a 32-bit timestamp:

$ ruby -r CGI -e 'puts CGI.rfc1123_date(Time.at(2 ** 31 - 1))'

Here's that value for any impatient copy-and-pasters:

Tue, 19 Jan 2038 03:14:07 GMT

Now that we know how to set our file expiration date, let's look at what actually happens when we do.

Does It Actually Work?

Here's an S3-hosted image of Octobi Wan Catnobi that has no extra metadata set:

and here's an S3-hosted image of Spocktocat that has its Expires header set:

We can verify that this is true pretty easily. If you're using a UNIX-like operating system (basically anything but Windows), open a terminal (Terminal application on a Mac, found under Utilities in the Applications folder), and enter the following command (without the $) to make a request for the first image:

$ curl -v http://dl.jetheis.com/s3-metadata/octobiwan.jpg > /dev/null

Now look at the portion of the output that is prefixed with a <. Those are all of the response headers:

< HTTP/1.1 200 OK
< x-amz-id-2: +JWpez+WdCrliLIuJgb1JQ9GrdUaM2QkRF/Cc+oZBmvtZWMBCsYb4bavdptvXJEo
< x-amz-request-id: 616907A681625FE3
< Date: Wed, 07 Mar 2012 17:05:14 GMT
< Last-Modified: Wed, 07 Mar 2012 16:38:05 GMT
< ETag: "dcef3abedf0e0761203aaeb85886a6f3"
< Accept-Ranges: bytes
< Content-Type: image/jpeg
< Content-Length: 65307
< Server: AmazonS3

No Expires. Let's try the image we set that header on:

$ curl -v http://dl.jetheis.com/s3-metadata/spocktocat.jpg > /dev/null

We'll look at the output prefixed with < again:

< HTTP/1.1 200 OK
< x-amz-id-2: IZevTGh6eOzexdotN5RaSTaROUtmWc+7hWpq1t7CVD6Mz98G2m174S35wFEkk8pD
< x-amz-request-id: E321C35ACC2CB79E
< Date: Wed, 07 Mar 2012 17:10:07 GMT
< Expires: Sun, 14 Jul 2013 04:00:00 GMT
< Last-Modified: Wed, 07 Mar 2012 16:44:52 GMT
< ETag: "c817148a2d2589ca977fed81c2e5a6f2"
< Accept-Ranges: bytes
< Content-Type: image/jpeg
< Content-Length: 69947
< Server: AmazonS3

And there it is: our shiny new Expires header:

Expires: Sun, 14 Jul 2013 04:00:00 GMT

A Better Example

While this is all well and good, we haven't quite proven that setting this header does anything or cuts any costs. We'll now conduct a true test of the new configuration using Google Chrome (this one can be performed on Windows, too). Follow along if you'd like.

First, browse to this page using Chrome and open the Developer Tools pane (Alt + Command + I on a Mac, or by clicking the wrench and selecting Tools -> Developer Tools).

In the Developer Tools pane, click the Network tab. We'll see that there isn't anything there yet, because Chrome isn't "recording" network activity unless the pane's open.

Now click the Jimmy Theis at the top of the blog to jump back to the home page. We're navigating away from and back to the page because pushing Refresh actually forces all resources to be reloaded.

Now here's the kind of wonky part. Chrome does a great job of making browsing fast, so it actually does cache resources without expiration dates for a short time. To circumvent this, it seems the resources need to not be accessed for quite a while. So, forget about this and go browse Reddit or Hacker News or something and come back in about 30 minutes.

Once you've browsed some and allowed sufficient time, find the link to this post on the homepage (or the archives if this is old enough), and click on it. Make sure the Developer Tools pane is still open.

Once the page has loaded, scroll through the network activity. You should see both octobiwan.jpg and spocktocat.jpg in the list, probably somewhat close to each other.

Whoa! Something actually happened. We can see that spocktocat.jpg was loaded "(from cache)" and took a total of 0 ms to load, while octobiwan.jpg shows a 304 "Not Modified" response, a data transfer of 286 bytes, and a latency of 206 ms. So what does this mean?

Let's look at our base case first: octobiwan.jpg, which has no information dictating when it expires, could have potentially changed (from the browser's point of view), so Chrome makes a new request for the image. The intelligent computer at Amazon gave it back a 304 response, which simply tells the browser that the resource hasn't changed, so we won't bother sending it again. This saves us the bandwidth of downloading the entire image again (the request is 286 bytes instead of the full 60+ KB the image takes up), but it still makes a request to S3, and that costs money.

On the other hand, spocktocat.jpg is marked as not expiring for years from now, so Chrome doesn't even bother making a request at all. It simply grabs the image from the cache and displays it. This speeds up page load and saves us money on our Amazon bill. As a side note, if that image where to ever change, doing a hard refresh of the page would force it to be reloaded, so we're not stuck with an outdated image if we ever change our minds.

Conclusion

It's relatively simple to set the Expires value of content hosted on S3, and it does have a real effect on both site performance and the number of requests made to Amazon. In fact, apart from the work of setting those values, there isn't much of a downside to the practice. Some software suites offer this functionality in a little more accessible interface, and lots of other posts recommend using these (paid) software suites to set these extra metadata values. The purpose of this post, however, is really to highlight a potential problem and present a solution that doesn't require spending extra money. Feel free to try any S3 management software, but know that you've already got the tools you need to start tuning your account for cost effectiveness.

Cheers,

Google+