A Reverse Caching Proxy allows you to put a cache between “the Internet” and your application server. It processes all incoming traffic and for each request decides: “can I get this from my cache or should I send it on to the application server?”. Since retrieving a static cache file from a cache takes a few milliseconds and getting a request processed by your application server can take up to a second or more, this can dramatically improve performance, scalability and the user experience. It also reduces your need to invest in expensive application servers.
Is a Reverse Caching Proxy for you?
Well, in order to use it, a caching proxy needs to be able to cache the page as a static file based on a unique key, usually the url. So, if you request http://www.mysite.com/products/ps3.html?x=3 it will cache that based on that url and anyone who subsequently requests that url gets the cached version (until it expires).
This is a problem if you have a personalized message for logged-in users: Hi John! You don’t want Peter to get the cached file that John requested and seeing Hi John in stead of Hi Peter.
So, this means that in designing your website, you need to keep the requirements of the reverse caching proxy in mind. Basically this means that a unique page needs to have a unique url. Personalized pages are the norm these days and fortunately there are ways to deal with the cacheability of these pages.
Server Side Includes (SSI)
Server Side Includes allow you to specify regions in your page that should remain ” dynamic”. In essence the proxy post-processes the page and replaces the ares that are designated as dynamic with dynamically generated fragments. Various reverse caching proxies offer modules that support SSI.
Edge Side Includes (ESI)
Edge Side Includes has been promoted to a W3C standard and is fairly similar to SSI but uses an XML based language. ESI seems a little less well supported.
A Reverse Caching Proxy works very much like a browser cache but on the server side. It can be configured in various ways to determine how long a page should be cached or wether it shouldn’t be cached at all. You can pre-configure this in the caching configuration but the best way is to use header information that is generated by your application server. Usually, your application server will allow you to fine-tune which pages to cache and for how long. You don’t want the account page to be cached, but perhaps the homepage should be cached for 10 minutes.
It’s a common mistake to cache pages for a very long time with a caching proxy. This is not necessary. If you cache a page for even 1 minute on a busy site, you will see dramatic performance increases. Let’s say you have 100 users in this 1 minute requesting the homepage. User number 1 will request the uncached page. The proxy will retrieve it from the application server, cache it and return it to the user. User number 2 to 100 will get the cached page. So Reverse Proxy Caching can certainly be used in a scenario where content gets updated frequently.
Another issue you may run into is session expiry. An application server will usually expire a session after a period of inactivity, usually 20 minutes or so. If your user logs in, but then spends more than 20 minutes doing stuff on the site that is all cached, the session will expire because the application server never gets a request from the user.
Forcing a cache refresh
Sometimes you may want to force a refresh of the cache in order to purge obsolete content. It’s important to realize that this is relatively hard to do programmatically. You generally don’t have an API that you can use to refresh a page. One way to deal with this is to do a hard reload (SHIFT + RELOAD), which can also be executed programmatically through an http request, but not all Reverse Caching Proxies support this or at least not by default. You can usually implement some special url parameter or header that forces a refresh for a particular page.
It’s a good idea to make sure that you implement a way to force a cache refresh from the client side. Otherwise, you may find yourself purging cache directories on the server when management needs it done NOW.
Cookies and Sessions
One of the things you have to take care of in configuring your reverse caching proxy is how to deal with sessions and cookies being sent from the server. You don’t want to store Set-Cookie directives in a cached page because you may inadvertently share a session with another user! The risk of accidently showing content from other users is something you have to be actively aware off. Reverse Proxies can strip header information (such as Set-Cookie) before storing a file in the cache. Check the documentation of your Reverse Proxy to make sure how to handle this.
Reverse Caching Proxies are usually configured (even by default) to cache GET actions and not cache POST actions. If you have an application that stores form content and validation exceptions in the session and redirect the user to the form page again to display this information, you have to be careful. Like all GET pages that show content based on session info, you may cache the wrong thing. It’s therefore wise to make sure you can capture these kinds of interactions behind a unique url pattern that you don’t cache, e.g. /registration/
Single page websites
Popular caching proxies
Some popular caching proxies are:
If you have some experience with hosting in the Cloud, then you will probably use Apache as a webserver. Apache is great. There’s a lot of modules, loads of documentation and it has great performance. But it also eats a lot of memory. Yes, that depends on which modules you use, but chances are that with the modules you probably need, that 512Mb webserver that looked so nice and cheap in the beginning is swapping its heart out when you start working with it. Apache is just not very memory efficient. What’s more, when you add on the load, you also need to add on the memory. Those two go hand in hand on Apache.
Nginx is a Russian built, lightweight webserver and reverse (caching) proxy that features a low and constant memory profile. I will spare you the technicalities but the key here is that it uses a lot less memory than Apache, I’m talking hundreds of megs, and that it keeps using low amounts of memory under load.
Why is this nice? Because on a Cloud server, costs goes significantly up when you need more RAM. On a dedicated hosting server, you may care less. Just throw in a couple of more megs, no sweat. On the hourly model, that’s a more expensive proposition.
Some more advantages for nginx
- Configuration files are clear and easy to configure
- nginx can be used as a reverse caching proxy: now that’s a performance enhancer
- nginx can be used as a load balancer
- English documentation is available and there is a strong community
- Russia has a history of great engineering ;-)
So, you can just put up an nginx front end webserver on a lightweight Cloud server of say 512Mb (I’m interested in 256MB experience: should be possible) and put one or more Tomcats behind that. Activate the proxy cache and you’ll have a top performing application infrastructure for a few dimes an hour.
BTW, make sure you get the nginx 0.8.x version. Many Linux distributions still ship with the older 0.7 version.
Well, today we heard that Rackspace will be joining forces with Akamai, the people who basically put CDN on the map in the first place.
When operational we’ll finally be able to have https access to those assets so we won’t have to worry about those nasty “insecure content” messages.
That’s great and will save some time here and there and improve performance. But what I want to talk about here is when you actually have to put something into Cloud Files, and lots of it. Cloud Files uses a webservice to interact with your application and consequently it is very slow. So, if you want to upload the 500 assets in your asset directory to launch your sweet new website, the script can take forever. Wait, what? Script? Yes, you actually have to use a script to get it there since currently you can only manually upload a file to Cloud Files, a file at a time!
What is needed is a bulk upload tool that allows you to upload a zip/tar file containing a structured directory. When you upload one of those, Cloud Files should smartly pull together the directory names and file name in one single filename string. Remember, Cloud Files doesn’t support directory structures. A container can only contain a list of files. But those files cán contain forward slashes allowing you to simulate directory names.
A proposal for just such a tool can be found here. Why not give it a vote?
And then there is the CDN itself. The Cloud Files CDN in its current form allows you to cache a file for …….. 72 hours. Wow, that’s something else than an expiry date somewhere in 2037! I have a suspicion that the reason for this ridiculously low TTL has nothing to do with performance and a lot with the fact that traffic is based on a per Gb model. High TTL levels mean less traffic. If this is the reason, it would be quite annoying since I don’t mind paying for the service but I do mind lower performance on my sites!
Update: As it turns out, there is a tool that pretty much does what I want and more. It’s called Cyberduck and you can find it here. Use the Synchronize function to upload an entire directory structure. Rackspace support was kind enough to point it out. Mind you this is still pretty slow. What is really needed is a native solution that processes the zip file locally for an acceptable performance.