Skip to content

Improving Disk I/O in PHP Apps

For the release of Smarty 3.1.0 I refactored most of Smarty's disk access. As this optimization from September 2011 kept popping up questions on the Smarty forums, I finally felt the need to explain what I did. Although Smarty is the reason I discovered this, this post applies to PHP in general, not Smarty in particular. This post is all about laying the groundwork for you to realize two things:

  1. Your operations are not atomic
  2. Avoid accessing the hard disk unnecessarly

Warning: Unless you know what you're doing and why you're doing it, this content is to be considered harmful advice!



Multiple Processes and Race Conditions

You probably remember that your computer switches different processes (running applications, …) in and out of the CPU. This is done so parallel execution of processes could be achieved on a single CPU core. This is what we call Multi Tasking - and the same principles apply to computers with more than one CPU core.

This means that your program (PHP script) is not executed consecutively. Some of the program is executed, then it's paused so something else can run, then it continues execution, then it's paused again, and so forth.

In some languages we can tell the executing host to treat a bunch of operations as a single operation - we call that atomicity. PHP doesn't know this concept. It's safe to say that your PHP script can be disrupted at any given time at any given operation.

Unless you've had the chance of working on fairly high-traffic sites, you've probably never seen a race conditions in action. A race condition is what we call the occasion of two (parallel executing) processes doing contradicting things. For example Process 1 is writing the file /tmp/file.txt, while Process 2 is trying to delete that same file.

Those processes don't have to run in the same context. While Process 1 could be a PHP script, Process 2 could be a shell script triggered by Cron, or some manual rm /tmp/file.txt via SSH.

File Locking

To prevent these race conditions, we're allowed to lock files. When a file is locked by Process 1 and Process 2 is trying to acquire the lock, Process 2 is blocking execution until the lock was released by Process 1. PHP provides this functionality with flock().

flock() has a couple of problems, though. For one it only works with resources, so you need to have the file opened with fopen() prior to obtaining a lock. Also flock() will fail on certain file systems like FAT or NFS. On top of that it seems quite ridiculous to open a file, only to obtain a lock, only to delete the file.

So in real life, where a PHP script does not know which file system is used, flock() won't help.

Potential Race Condition

At first glance, the following code is considered to be good code, as we check if a file exist prior to unlinking it. That is because unlink() issues an E_WARNING whenever it can't find the file to unlink:

$filepath = "/tmp/some.file";
if (file_exists($filepath)) {
  unlink($filepath);
}

But we remember that PHP has no atomic operator and a script can be disrupted at any given time:

$filepath = "/tmp/some.file";
if (file_exists($filepath)) {
  // <- potential race condition
  unlink($filepath);
}

Considering the above code to be Process 1, we could encounter the following condition:

*Process 1*: file_exists("/tmp/some.file")
*Process 2*: unlink("/tmp/some.file")
*Process 1*: unlink("/tmp/some.file") -> E_WARNING, file not found!

Between checking if the file existed and actually removing it, another process had the chance to delete the file. Now the unlink() of our script issues an E_WARNING because the unlink() failed.

Mitigating the Race Condition

Fear not, PHP knows the almighty @ silence-operator. Prefixing a function call with @ makes PHP ignore any errors issued by that function call scope. The following code will prevent any E_WARNING issued due to a race condition (or any other fault, for that matter):

$filepath = "/tmp/some.file";
if (file_exists($filepath)) {
  @unlink($filepath);
}

With that little @ we've opened the door to a slight simplification of our code. Since we're performing the file_exists() to make sure unlink() won't issue any warnings, and @unlink() won't issue any warnings, we can simply drop file_exists():

$filepath = "/tmp/some.file";
@unlink($filepath);

Et voila, we have successfully mitigated the race condition. And by doing so, we have accidentally reduced the Disk I/O by 50%.

Reducing Disk I/O (stats)

Besides the implications on race conditions, ditching file_exists() has the other benefit of reducing stat calls. Whenever you have to touch an HDD, imagine your Ferrari-application hitting the brakes. Compared to the CPU any hard disk (yes, even SSDs) are turtles chained to a rock. So the ultimate goal is to avoid touching the file system whenever possible.

Consider the following well coded program to identify if a file exists and when it's been modified last:

$filepath = "/tmp/some.file";
$file_exists = file_exists($filepath);
$file_mtime = null;
if ($file_exists) {
  $file_mtime = filemtime($filename);
}

Did you know, that filemtime() returns false (and issues an E_WARNING) if it can't find the file? So how about reversing things and ditching the file_exists():

$filepath = "/tmp/some.file";
$file_mtime = @file_mtime($filepath);
$file_exists = !!$file_mtime;

Custom Error Handling

As mentioned initially, ditching file_exists() was done to Smarty 3.1.0. We did numerous tests and benchmarks and came to the conclusion that we'd be stupid not to do it. And at that point I figured nobody would ever notice. That might've been true, hadn't it been for set_error_handler().

set_error_handler() allows you to register your own custom method for handling errors. It's pretty neat to push certain errors to a database or send mails or something like that. It gives you absolute power over each and every notice or warning issued. Even those that would've been masked by error_reporting() or the @ operator.

Apparently some people register custom error handlers to get ALL THE ERRORS. Even the masked ones. Some developers failed to understand hints in the docs, others did it deliberately. Intentions aside, these ill-conceived error handlers break the way we expect PHP to work. All of a sudden errors like error in 'test.php' on line 2: unlink(/tmp/some.file): No such file or directory (2) started popping up.

In their minds Smarty was misbehaving. After all its code was raising E_WARNINGs all over the place. They didn't know (and didn't care) about the improvements we've made. They didn't want to "fix" their error handlers, as they did not see them broken. So in Smarty 3.1.2 I introduced Smarty::muteExpectedErrors() - a custom error handler that that would proxy their handlers, filtering out errors Smarty actually expected to happen.

Warning (added Jan 10th 2013)

This post appeared on hacker news, triggering a couple of comments. I added a warning to the top of the post. That said, here are a couple of reasons I chose this route:

  • I really don't care if a file couldn't be accessed due to privilege (or any other) reasons. There is a global systems-check to take care of that. This code assumes everything is fine. If it is not, the systems-check will tell us what is going on.
  • This is the least amount of code needed to "just make it work" across any setup.
    • Regardless of the number of physical machines running in parallel.
    • Regardless of the filesystem used (yes, some don't provide locking)
    • Regardless of the frequency and concurrency a single file is touched
    • Regardless of the PECLs some environment may have installed

This is the fire and forget approach. This is something you can do when you caching, when you simply don't care about integrity and persistency.

Would I do any of the above if I cared about the data and could define the environment? HELL NO! But then, I probably wouldn't be using PHP either…

Comments

Display comments as Linear | Threaded

joe on :

joefile_exist() and filemtime() are both auto cached in PHP. that means after the first access on the HD, file_exists will check the cache instead of accessing HD in the future. the result of my test is, that file_exists() is a littel faster thatn filemtime() by itering more than 1000 times. I perfer to using file_exits().

Monte Ohrt on :

Monte OhrtI don't think the performance difference between file_exists() and filemtime() is the point. You would use filemtime() if you need to find the last modified time of the file, which is necessary for a lot of Smarty processes. So if you need the filemtime(), no sense in checking file_exists() first creating a potential race condition... just use filemtime() silenced by the @ operator.

Halmai, Csongor on :

Halmai, Csongor> Since we're performing the file_exists() to make sure unlink() won't > issue any warnings, and @unlink() won't issue any warnings, we > can simply drop file_exists():

This is not true in itself, I think. The unlink can issue a warning if it is unable to delete a read-only file. This warning is much more important than the other one about deletion attempt of a non-existing file.

If the file is not deletable because it is not there this is not a real problem because "my job has been already done". Contrary, if the file is not deletable because it is write protected then it is a serious problem and maybe script execution should be aborted.

The problem is that @ suppresses the latter reason as well, however, the real reason of the warning should be differentiated.

Rodney Rehm on :

Rodney RehmYou are right about that. But since PHP itself does not differentiate these cases properly, we would have to scan all notices to identify the single one we actually need.

Phin Pope on :

Phin PopeGreat for stopping disk I/O but suppressing the error can actually cause longer processing, as the error is still raised but just not displayed.

http://goo.gl/CIxib

Kowach on :

KowachIsn't suppressing errors with @ to expensive operation? Problem with smarty cached templates is solved by moving them into memory with APC or Memcache.

Rodney Rehm on :

Rodney RehmIn principle, yes. There is a certain overhead with any errors, no matter the way they are suppressed (silence, error_level, …). But our benchmarks have shown the error-overhead to be less significant than the overhead you get from disk I/O - even via statcache.

This post is not only about performance, but about possible race condition holes as well. Please keep that in mind.

Memcache and APC are great. But they come with a - severe - downside. Code that is stored in them must be eval()ed. There is, at least at the moment, no way to have APC (eAccelerator, …) cache the opcode resulting from the eval. So while you will avoid hitting the disk, you'll also put some additional strain on your CPU. I opened an issue for #59787 enable OpCodeCache for eval() / streams, but nothing has been done about it, yet.

mc0e on :

mc0eI'd say you're right about memcached though. not a good place to stash code.

APC does do opcode caching though. I tend to think of its other capabilities as secondary to that.

Razvan on :

RazvanPlease put a warning or something on the fact that you are using the '@' operator which is undesirable. Novice people will look at this example and will try and exploit the '@' operator.

Andrey Repin on :

Andrey RepinThe error suppression operator is evil, and the big one. You don't mention it in your article, which surprises me, but let me restate - HIDING ERRORS IS ASKING FOR ALL KINDS OF BAD THINGS TO FALL ON YOUR HEAD. At the least expected moment. Getting used to ignore errors is bad practice.

Also, one major case to wrap error handler is to have better error notification by providing a backtrace, like http://php.net/manual/en/class.errorexception.php#95415 If you want proper dealing with this kind of speedhacks, you'd write something like

try {
    unlink();
}
catch(Exception $e) {
    if(file_exists())
        throw $e;
}

Of course, providing a wrapper to convert errors to exceptions.

Andrey Repin on :

Andrey RepinHowever, the same effect could be achieved without ErrorException wrapper, like in the code below:

if(@!unlink()) {
    if(file_exists()) throw ErrorException();
}

but this will fail in the event of custom error_handler present, as you've mentioned earlier.

butcher on :

butcherEven small speed gains are important, but hiding errors under the carpet isn't such a good idea. SSD are pretty fast days, and i don't think that file_exists it's such an issue.

Also @Andrey Repin's solution seems interesting for speeding up things a little.

streaky on :

streaky@unlink($filepath); - this doesn't mitigate the race condition, it simply hides the effect of it, which is totally not the same and is Drupal school of code. What you're actually looking for (assuming you want to do it right) is exclusive lock/check/[delete]/release lock.

The author does not allow comments to this entry