Understanding HTTP Caching

The HTTP protocol is rather powerful in and of itself. This is often forgotten as application developers incorporate caching technologies into their code itself. As useful as SQL Query caching and application-level internal caching is, HTTP provides its own caching functions that we would be remiss not to utilize.

Let us provide some definitions of commonly used words in this article.

Term	Definition
Origin Server	The HTTP server that the entity is originally served from and hosted on.
Caching Server	An HTTP server that exists between the client and the origin server. This could be a CDN caching server or caching proxy used on the client's network. This may or may not be in use: if it is used then the origin server is not directly contacted by the Client, instead, the Client contacts the caching server, which in turn contacts the Origin.
Client	The HTTP client that makes requests against a caching server or an origin server.
Entity / Entity-body	The document being accessed. An entity includes the headers, where the entity-body is only the asset being returned by the HTTP request. An image, JavaScript, css, an HTML document are all entities.

Headers

Everything to do with how an entity is served and often times whether an entity-body is served is controlled through http headers. There are two different sets of headers in HTTP: request headers and response headers.

All headers are sent in a key-value format; it is worth noting that some values are actually lists, as we will see later.

Request headers are sent by the HTTP client and include information on the capabilities of the client, the request it is making, caching information it already knows about an entity and some information about the client making the request.

Response headers are sent by the HTTP server and include information describing the entity-body it's returning, information describing the host server that is returning the entity and caching information about the entity.

It is worth noting that some headers fall outside of the above description such as cookies and custom headers (typically denoted by a prefix of x).

In addition to HTTP headers there is the HTTP Status Code. The HTTP status code is a three digit numerical code that gives information about the response. We will focus on three HTTP status codes in this document:

HTTP Status	Meaning
200 OK	This is the typical response that says everything is okay. The entity-body is returned to the client and the document is served.
304 Not Modified	This is the response to a conditional get request in which the origin server reports that the entity has not changed. Conditional get request use special request headers, which will be discussed later.
504 Gateway Timeout	This is the response from a caching server that is unable to request the entity from an origin. This response indicates that the caching server MUST validate the entity and has failed to do so.

Cache-Control

Cache-Control provides a mechanism for a server to give caching information about an asset to a client or to a caching server. Let's take a look at some of the most common cache-control tokens:

max-age

The max-age token takes a value in seconds for how long an entity remains fresh. The format of the token is max-age=N where N is the amount of seconds since the entity was served. This value is scoped at clients, however a caching server may use the value if an s-maxage header is not present.

s-maxage

The s-maxage token is formatted in the same way as the max-age token, however it is disregarded by clients. This token is used by caching servers.

must-revalidate

This must-revalidate token instructs a client or caching server that it must revalidate the asset either with a conditional GET or a standard GET against the origin if the caching information indicates that the entity is stale.

Although many aspects of the HTTP specification are left to the server developers' discretion, this behavior is not.

In the scope of a caching server an entity that is stale and has a must-revalidate token in Cache-Control must be revalidated. If the origin server does not respond to the validation then a 504 Gateway Timeout response MUST be given. If this token was not included in the original response headers it would be at the discretion of the caching server to either generate a 504 Gateway Timeout or to serve a stale entity in this situation.

Let's say that we want an asset to be cached for one hour by end users and 20 minutes by a caching server. We may use the following header:

Cache-Control: max-age=3600, s-maxage=1200

Following the above logic a client's web browser will continue to use the entity from it's local cache on the hard drive for 3600 seconds (1 hour). A caching server will continue to serve the entity from cache for 1200 seconds (20 minutes) before requesting it again.

If the only caching information we have about the entity is Cache-Control it is unlikely that a conditional GET will be sent, instead a standard GET request will be used. Conditional get requests will occur when a Last-Modified or ETag header is present.

Last-Modified

The last-modified header provides the date that an entity was last modified. Programmers are very good at obvious naming conventions. When the entity is updated the last-modified header is also updated. Let's take a look at a real world example from CNN's caching servers:


$ curl -I http://i.cdn.turner.com/cnn/.element/img/2.0/sect/connect/avatar.gif
HTTP/1.1 200 OK
Date: Mon, 16 Aug 2010 13:52:46 GMT
Expires: Mon, 16 Aug 2010 14:12:56 GMT
* Last-Modified: Fri, 23 Oct 2009 19:22:47 GMT
* Cache-Control: max-age=3600
Content-Type: image/gif
Accept-Ranges: bytes
Server: Apache
Content-Length: 365

As we see here we have a cache-control that gives a max-age, and we have a last-modified date. What will happen in this situation is that a client will continue to serve the image from its local cache for 3600 seconds (1 hour). Once that time is up, the entity is considered stale. Because we have a last-modified header, we can send a conditional request:


$ curl --header "If-Modified-Since: Fri, 23 Oct 2009 19:22:47 GMT" \
> -i http://i.cdn.turner.com/cnn/.element/img/2.0/sect/connect/avatar.gif
HTTP/1.1 304 Not Modified
Date: Mon, 16 Aug 2010 13:59:27 GMT
Expires: Mon, 16 Aug 2010 14:47:41 GMT
Last-Modified: Fri, 23 Oct 2009 19:22:47 GMT
Cache-Control: max-age=3600
Connection: keep-alive

The --header argument to curl allows us to inject our own headers into the request. In this case we did a conditional get request by injecting the if-modified-since date that we were last given by the server.

The server returned a 304 not modified response, which tells us to use the entity we have on disk. You may notice that we lowercased the -i this time. The difference is that an -I will only do a HEAD request while the -i will do a real GET request. We got no entity body in the real get request because a 304 does not return an entity-body, only the headers and HTTP Status Code.

If we give an if-modified-since which is behind the server's current modified-since date we will get a 200:


$ curl --header "If-Modified-Since: Fri, 23 Oct 2009 10:22:47 GMT" \
> -I http://i.cdn.turner.com/cnn/.element/img/2.0/sect/connect/avatar.gif
HTTP/1.1 200 OK
Cache-Control: max-age=3600
Content-Length: 365
Content-Type: image/gif
Expires: Mon, 16 Aug 2010 14:47:41 GMT
Last-Modified: Fri, 23 Oct 2009 19:22:47 GMT
Accept-Ranges: bytes
Server: Apache
Date: Mon, 16 Aug 2010 14:02:48 GMT
Connection: keep-alive

The If-Modified-Since header is a request header. As we saw the request header had a direct impact on the response of the server. Any time an aggressive cache is needed the last-modified date should always be used as it's the simplest form of revalidation.

Expires

A There is a lot of misconceptions about the Expires header. Let's take a look at the headers from our request for CNN's image:


HTTP/1.1 200 OK
* Date: Mon, 16 Aug 2010 13:52:46 GMT
* Expires: Mon, 16 Aug 2010 14:12:56 GMT
Last-Modified: Fri, 23 Oct 2009 19:22:47 GMT
* Cache-Control: max-age=3600
Content-Type: image/gif
Accept-Ranges: bytes
Server: Apache
Content-Length: 365

We notice something interesting here. The Expires header dictates that the response expires 20 minutes and 10 seconds after we got it. The Cache-Control says it expires an hour after we got it. Which is right?

If you guessed Expires… you're wrong. Section 13.2.4 of RFC 2616 says the following:

The max-age directive takes priority over Expires, so if max-age is present in a response, the calculation is simply: > reshness_lifetime = max_age_value > Otherwise, if Expires is present in the response, the calculation is: > freshness_lifetime = expires_value - date_value

As we can see, in this case the freshnesslifetime would be 3600 seconds, and the Expires header is completely ignored. If the Cache-Control header was not present, the freshnesslifetime would be 1210 seconds (20 minutes, 10 seconds) as dictated by the Expires header. It is worthwhile to note that all time operations should be done with the information the server provides. A delta generated from the expires header should be using math against the date provided by the server, not the local system date as clock skew or time zone differences can impact the delta.

ETag

An ETag is a programmatically generated string that hashes the entity-body. The general assumption that makes an ETag work is that the generated ETag of an entity-body would not stay the same if the entity-body is changed. Think of them as an MD5 or SHA1 of the entity-body.


$ curl -I http://symkat.com/s/jquery.beautyOfCode.js
HTTP/1.1 200 OK
Content-Type: application/javascript
Accept-Ranges: bytes
* ETag: "1028101788"
Last-Modified: Fri, 30 Jul 2010 19:48:57 GMT
Content-Length: 8275
Date: Mon, 16 Aug 2010 14:26:35 GMT
Server: lighttpd/1.4.19

We can see here an ETag value of 1028101788 for this entity. We can inject the If-None-Match header into our request to see if the asset has changed:


$ curl -I --header 'If-None-Match: "1028101788"' \
> http://symkat.com/s/jquery.beautyOfCode.js
* HTTP/1.1 304 Not Modified
Content-Type: application/javascript
Accept-Ranges: bytes
ETag: "1028101788"
Last-Modified: Fri, 30 Jul 2010 19:48:57 GMT
Date: Mon, 16 Aug 2010 14:31:30 GMT
Server: lighttpd/1.4.19

As we can see we got another 304 response, and saved some bandwidth by not transferring a file we already have the current version of. The above transaction shows that SymKat is only using part of the caching requirements. The website provides a method for validation of entities, however it does not provide caching rules through Cache-Control or Expires.

A properly aggressive caching system should make use of all of the above. Combining a Last-Modified header and an ETag header for revalidation without entity-body transfers as well as an aggressive Cache-Control policy.