Introduction:
Over the years of developing TheNPG.com, I have had to learn about new technologies that I couldn’t imagine when I first started this project. In fact, when I started coding the first version of TheNPG.com, I had no idea how to code Javascript and my knowledge of PHP was very basic, which meant my code was not efficient.
Efficiency is the key when programming a web service for medium-to-large-scale usage; often times the technology that we are most familiar with becomes the screwdriver when we really need a hammer, if you will. However, if we take a moment to step back, invest a little time and research, we may find a better solution and we may just learn something in the long run.
Today, TheNPG.com, while it still has it’s drawbacks, is pretty darn efficient for as large as the script as a whole is. Over 2MB of code make up all of the classes and external libraries TheNPG.com uses to perform it’s functions, that’s quite a bit, but thankfully it is not all run at once.
I rarely write about the back-end technology of TheNPG.com partially because a lot of the technology is proprietary and a core part of our business, but the past few days have unearthed some knowledge that I would like to pass on to others who may be in a similar situation.
Overview:
PHP is full of built-in functions that can perform just about anything a programmer would want to do, but often times there is a better way. Case in point: file downloads. It sounds easy easy enough: there’s a file on a server, somebody wants to download it, let them download it. But, in reality, it is far more complex especially when factoring in efficiency and different technologies. TheNPG.com uses Amazon’s S3 service to store files and there is the option to use Amazon’s Cloudfront technology to serve the files as a content delivery network. It really is a fantastic service and it works well as a file hosting service, if you know the ins and outs of getting the files where they need to be, and making sure that the people who are downloading those files are allowed to do so. We could have PHP do the bulk of the work by verifying privileges, fetching the file and serving it to the browser, but when we consider that PHP loads the entire file into memory, our efficiency is reduced and we have the potential of bringing our server to a stand-still.
The Objective:
The objective to our download service is to provide quick access to file downloads, which can be very large, while reducing the costs associated with transferring files between servers. You see, Amazon S3 charges not only for storage of files, but also for requests to put and get the file, and the bandwidth required to move the file; it adds up!
Why not just host the files on our web server that has unlimited storage and bandwidth?
This was a question that the answer took a while for me to understand. To better understand the answer, we have to consider the two different types of servers: the web server and the file server. The web server is busy serving pages and dynamic content, it is not set up for redundancy, in other words, the disk could fail, but since the web server only has files on it that are required to generate a web page, it’s not a big deal because these files are generally small and easy to backup. A file server, on the other hand, is built for redundancy; it might have several disk drives of a seemingly unlimited capacity where when one, two, or even three drives fail, there is still enough data on the others that the files can be recovered. A file server is like a rock and a web server is like a race car; sometimes the race car crashes, but the rock will always remain in tact. So, while using a web server as a web server/ file server is cheap, easy and really fast, it doesn’t make sense when you’re dealing with other people’s prized files. For this reason, we have integrated Amazon’s S3 file service into TheNPG.com.
So, how do you connect the two services efficiently while keeping costs down?
Simple [of course I would say that since I've figured it out
]. For each file that is uploaded to TheNPG.com servers, we create different sized images, if the file is an image, and then put them all on Amazon’s servers using a PHP class for S3. During this upload, we set special request headers that, when called via HTTP, will cause the browser to open a dialog box asking the user to either open the file, or save the file. Because our PHP class accepts an array of request headers as an argument, our code looks like this:
$request_headers = array(
"Content-Type" => $filetype,
"Content-Disposition" => "attachment; filename=".$filename);
This sets two headers that will be sent to a browser if the file is called via a HTTP request.
But why?
TheNPG.com uses a caching system which works like this:
- User clicks on link to download the original file
- PHP checks the cache to see if the file has been downloaded in the past 24-hours, or so
- If the file has not been cached, we do 2-things simultaneously: we send a header to the browser that redirects to the CDN server, and upon the browser connecting with the CDN server, the aforementioned request headers are output which causes the browser to bring up the download dialog. While this is happening, we use the following code to fetch the file for our cache:
exec("wget -q -nc $location -O $store > /dev/null 2>&1 &");
What the code above does is it executes a shell program “wget” with the options to be quiet [-q], and to not download the file if it already exists in the directory [-nc], then we pass the cleaned location of where the file is, followed by the -O option and the location we wish to store the file. The next bit of code is the real gem as it allows the program to continue while freeing itself from the bounds of the PHP script that called it. In other words, we’re really doing tw0 things at once!
Why use exec and wget instead of readfile()?
Readfile works really well for smaller files in PHP; however, if we were to use readfile for larger files, say 25MB, then we would have to wait for the file to be downloaded by PHP before allowing anything else to happen. If we can quickly send special headers to allow the user to immediately begin downloading their file, then they don’t have to wait for the two servers to exchange communication before letting the user download the file. Also, wget has many neat features built into it that give it advantage over any PHP function, including the -nc option. Why use PHP to do such heavy lifting when we can use a program that is built for this purpose?
So, once the file is cached on the local web server, we have the same issue with deciding how to serve the file. Do we let PHP do it since the file is local now? Yes and no. For small files, it’s ok to let PHP handle them, but for the larger files there’s another gem called mod_xsendfile. This is an Apache module that is useful for handling larger file downloads. Essentially, when this module is installed and enabled, the webmaster simply sets the XSendfile header and the mod_xsendfile module does the rest without tying up PHP. Here is an example of how to set the header:
header("Content-Disposition: attachment; filename=\"".$filename."\"");
header("Content-Type: ".$filetype;
header("Content-Transfer-Encoding: binary");
header("Content-Length: ".filesize($filepath));
header("X-Sendfile: ".$filepath);
And with that, PHP is free and Apache takes over with handling the local file serving to the user. It’s just as quick as the first download request and just as transparent to the user. If the file is popular, there can be significant savings in bandwidth transfer and request charges when using this method. Of course, requesting the file twice means that there is a risk that if the file isn’t requested a subsequent time from the first then there is a bit of waste, but that’s ok because in the end, the savings outweighs the waste.
To keep the server clean, and the cache fresh, we setup a cronjob to run a script that checks to see if any of the files in the cache have been stagnant for the given amount of time. We maintain a fetch-count along with a “last-requested” time stamp that allows us to maintain the cache so that our web host stays happy.
Also, with keeping in the quick delivery of files to the user, it was debated whether or not to use memcache, or a similar program, in order to maintain the cache index–or even the files themselves if they are small enough. Ultimately, the decision was made to not use memcache at this time because it is something else that we need to install and maintain when we already have the ability to use MySQL memory [HEAP] tables without much extra effort, but with similar fetch speeds. Our memory table simply keeps track of the file location, the original file name, the file MIME type, the fetch-count, time created and last accessed along with a hash index. Keeping the table small, and using queries that use the index, the fetch time from MySQL is very quick, 0.000011 seconds, on average.
Conclusion:
There’s a lot of information here, and probably a lot more rambling seeing that it is pretty late at night, but hopefully this information is helpful to someone else wondering how best to manage file downloads.
In short:
- Use a local cache, but keep it clean and down to only the records which are required
- Get files for the cache using exec and wget -q -rc behind the scenes
- When files are put onto S3 servers, set the request headers to cause a download box to display when the download is clicked, then redirect the user to the S3 server where the file is downloaded directly from that server, use DNS to cloak the location of the file so the user cannot know where the file is hosted, exactly, OR use time-sensitive pre-signed URLS and only allow authenticated access to the files hosted on S3.
- Use MySQL memory [HEAP] tables, or memcache, to maintain a catalog of the cache
- Use mod_xsendfile to serve local files without tying up PHP and a lot of memory
EDIT:
I have read in other forums/blogs the dislike of using mod_xsendfile because, so it is claimed, the programmer is letting go of control of the file so other things cannot be done, such as database logging. This is not true if you use PHP as FastCGI; the FastCGi processes do not end once the file is passed to the user; rather, they continue to process until the script completes. This method is superior to other methods, and that’s the bottom line.