OK, this took me some time to figure out.
Search engine spiders keep requesting resources that were removed from my site long ago. Since they keep coming back, they don't seem to process the repeated 404's they've been receiving. So to let hem know those resources will not return I want to send out HTTP response codes 410 (Gone) instead of the 404's (Not found).
The Apache documentation describes this can be done using the
RewriteRule directive combined with a
[G] flag, like this:
RewriteRule ^news/politics.* - [G]
Together with the 410 response code I also configured Apache to send an error page explaining the error using a
ErrorDocument 410 /error410
404 instead of 410?
Surprisingly, these changes in configuration result in 404's, together with the normal 404 error page, when requesting one of the removed resources.
Checking the error page
/error410 by directly requesting it in the browser returned the 410 page, so that seems to be OK.
One thing you need to know is that my website uses a PHP framework I wrote myself. This has a single entry point for nearly all requests. This script examines the incoming request and executes the corresponding script to render the page.
The Apache configuration to send the requests to this script is also a
RewriteRule ^(.*)$ framework.php
This rule is the last rewrite rule in the configuration, so it catches all requests not handled by any other specific rule. The main script uses the
REQUEST_URI server variable to determine which page to render.
To examine what happens I looked at the value of the
$_SERVER['REQUEST_URI'] parameter during script execution of a normal request like
http://kwebble.com/blog and a gone URL like http://kwebble.com/news/politics/no-longer-here.
For the first one the value is /blog, as expected. For the other URL I expected /error410, the URI of the configured
ErrorDocument. To my surprise it was /news/politics/no-longer-here, the original URI.
I expected the error page because I thought the configured URI would be executed as a separate request by Apache. Here I was wrong, Apache internally redirects to the configured error document instead of making a separate request. This explains the 404, because this URL no longer points to a valid resource. But how to detect the error?
Looking at the server variables I noticed some differences between the 2 requests:
- The values of
REDIRECT_STATUS differ, with 200 for the normal URL and 410 for the incorrect URL.
- A parameter called
REDIRECT_REDIRECT_STATUS is only present for the incorrect URL. The value is 410.
- The values of
REDIRECT_URL differ, with /kwebble-site/blog and /kwebble-site/error410 for the incorrect URL.
REDIRECT_REDIRECT_STATUS server variables are created by Apache when doing an internal redirect. On an error this occurs 2 time: first when the error is detected and then again to create the error document.
To render the 410 page I changed the code to look for the
REDIRECT_STATUS. If it's 200 the rendered page is based on the value of
REQUEST_URI, otherwise use the value of
Instead of always using the value from
REDIRECT_URL an additional check on
REDIRECT_STATUS is done. I added this because in my search for information I found several pages suggesting the
REDIRECT_URL is not always present.
This extra check makes sure a error page is always generated, and if possible reported as 410. Else it is reported like it always has with the 404, as next best alternative.