A tip for Webmasters: Redirect your 404 pages for maximum SEO impact!

Monday, August 31, 2009

I migrate hundreds of websites every year and one item that I've had in my standard list of tools is so often overlooked that I feel it needs to be revisited. I can't begin to tell you how many clients I run into every day that have moved their websites from one host to another or one technology to another to find themselves with a long list of "page not found" errors on Google searches. These same clients usually also are lamenting the fact that their "new" website, their pride-and-joy, actually does not perform as well in search rankings as the old website did!

This article addresses why this happens and what the savvy webmaster can do to save the day and ensure no click goes unhandled!

Why do website migrations or redesigns cause "page not found" errors?
When clients decide they want to redesign their website (or change host) they often forget that in doing so, many of the "old" pages will be replaced with "new" pages often times with different web addresses (URLs.) Even if the same web content platform is being retained, it is not uncommon for a redesign to change the site map of a website, sometimes pretty dramatically. This problem can be compounded by a change in web content delivery platform (like changing from PHP to ASP.NET) because the extensions of all the web pages will most likely change from "*.php" or "*.html" to "*.aspx" and others. The result is that often times when these new websites go live, many links inbound from both Google and other websites will be broken. Over time, these broken links will be dropped from most of the indexes, but the impact on your search ranking can be quite large and it can last for a long time.

Here's what some actual web addresses might be before a redesign:
http://myCoolSite.com/prod.php
http://myCoolSite.com/aboutwidgetabc.htm

After a redesign, the above web addresses and others might be "enhanced", which is a common SEO technique:
http://myCoolSite.com/Product-Information-CoolWidget.aspx
http://myCoolSite.com/About-Widget-ABC.aspx

Notice the difference in URLs. So, any links to the "old" pages will be broken after the new site goes live. Also, some pages on the old website might be dropped altogether, also causing errors.

How can Page-not-Found errors be handled?
Different server platforms offer different alternatives. There are two areas where these solutions will fall. First, there are HTTP level solutions which tend to be costly and difficult to implement and manage. Second, and the focus of this article, there are scripting solutions that are easily implemented by a webmaster.

On the HTTP level solution, the technique involves detecting the web page requests by the web server software and if necessary, handing those over to a processing engine of some kind. This technique works well, but also requires "low level" web server programming, a skill not usually associated with webmasters, especially those tasked with managing many servers and keeping them all running efficiently! The HTTP level redirect also might be costly depending on your resource structure, especially if you have to involve an expensive IT or programming resource.

However, the scripting solution is easier to accomplish because all web servers have a certain configurable parameter that lets you set what page should display the "page not found" error. Traditionally, the page-not-found is customized with the website's logos and colors, but it can often times do more by injecting a little bit of script. The technique is simple. The script will look at the address of the page requested, and if this address is on a list of "target pages", the script redirects to a more appropriate page!

Handling Page-Not-Found errors with asp.net script, an example for Windows Server & IIS

On Windows server, inside the properties for any website, we are able to configure the "Custom Errors" tab to display custom messages for different error conditions the server might encounter. Of course, a "page not found" is also a standard "404" error in web-server speak! So, on this screen, if you scroll down to the "404" error, notice that it can be "edited" to point to a custom page instead of the web server's default page.






Once on the edit window for the 404 error, notice it's possible to change the "Message Type" to "URL", meaning the web server will simply call a web address on your given website whenever a 404 condition is encountered. For this example, I am setting this value to "/404catch.aspx" which will be a script, on my website's root, that will handle the 404 errors.




Now, the web server will call your http://YourSite.com/404catch.aspx whenever a user requests an invalid page! Even better, the web server will usually pass you a query string specifying the user's original intended request. This is true of most web servers, although the syntax will be slightly different for different servers.

For IIS, the page is called like this:
http://YourSite.com/404catch.aspx?404;http://somesite.com:80/somepath/somepage.htm?someQuery=SomeValue

Note, the query string in red!

The first part of the query string is the "404" error number. In our example it will always be "404" since we only trapped the "404" error - but if you use your script for more than one error, it would be possible to differentiate which error triggered your script by inspecting this part of the query string.

The second part tells you which web address (URL) the user was looking for when the "page not found" triggered. Notice that it looks pretty much like a standard URL, except that after the first part (usually called the "host") you are also passed a colon and the port number the website is responding under. The port number is usually of no consequence and I ignore it in my script!

However, especially with the second part of the query string, you can now do some string comparison to determine if this URL is one where you could redirect to another page or display a custom message!

You can see a complete listing for my version of 404catch.aspx here (also see update on 9/22/09 below.) My version receives the page request, parses out the Query String to determine the user's original intended page, then uses a database to decide if this "original" or "receiver" should be redirected to another page (the "target".)

Some items to note as you look at my script:
  1. I used a Regex to split the query string into several pieces of interest:

    • sErrorCode - will contain the number of the error that triggered the script
    • sProtocol - will contain "http://" or "https://" depending on how the page was called
    • sHost - will contain the "host" part of the URL, typically "somesite.com" (in the example above) sans any "path" that might follow.
    • sPath - will contain the "path" part of the URL, typically "/somepath/somepage.htm" in the example above.
    • sQueryString - will contain any query string the user might have passed, typically "?someQuery=SomeValue" in the example above.


  2. After the Regex, we look for the URL (protocol + host + path) in a redirect table for my website and if I find it, I redirect to a designated page. Notice this script can support several types of redirects:

    • 301 (permanent redirect)
    • 302 (temporary redirect)


  3. Notice also this script can return a custom 404 or an inline 404.
  4. Notice the redirect is done using by changing the "response.status" property and "response.AddHeader" method to send the special HTTP code to the requesting browser to "redirect" to a new page!
Of course, in a modified version of 404catch.aspx, it is possible to use eliminate all the database code in favor of a text list or array that contains the receiver and target URLs and how they should be treated. The point of the example is not the database code, but the fact that one can catch the 404, call a custom script and that custom script can then figure out what page the user was trying to access - and decide if another page should be served instead OR a custom error displayed!

By the way, if you are interested in our "table" structure, you can download the DDL to create a table that will work with the 404catch.aspx script. If you choose to use a table, in your website's web.config, be sure to include the table name and the connection string to your database table with the following two lines:

<add key="RedirectManagerConn" value="Server=xxx.xxx.xxx.xxx;Database=YourDatabaseName;uid=YourUserName;pwd=YourPassword;Application Name=404Catch_RedirectManager"></add>
<add key="RedirectManagerTable" value="EZNP_RedirectManager_Redirects"></add>


So, with this information, you now have a powerful tool that will let you handle website migrations - without loosing a single click and perhaps retaining a lot of your search engine ranking! Good use of 301/302 redirects is a very powerful tool in the effective seo-aware webmaster.

Update 9/22/09 - support for URL Masking

Weeks after I created this post, I had a need to "mask" certain URLs instead of performing 301 redirects. In this case, instead of the browser being redirected, we wanted the "contents" of the target page to be displayed in response to the vanity URL - without the user (or the browser) ever knowing that the page was not actually there. For instance, we might want the URL http://myCoolSite.com/prod.php to be displayed to the user BUT it might not exist and we want the contents from http://myCoolSite.com/aboutwidgetabc.htm to be returned. Masking the URLs in a 404 catch technique will do this job well and it turns out the original script already had 99% of all the code required to accomplish this!

For instance, notice the procedure "OutputRemote404" which was part of the original function. It's job in the original script was to fetch the "content" of some remote 404 page (perhaps on another website or server) and return it "inline" as the result of the current 404. This might be useful when you have more than one website in support of one project (say a BLOG and a website that are "designed" to work together.) Rather than customizing a custom 404 page in two platforms, one could customize one and simply "call" the page from the other platform. Regardless of usage, this procedure accomplished the remote 404 by performing an HTTP "GetResponse" on the server side, which is a .NET documented method to read HTTP from a web server. The results of that call to a remote server (using standard HTTP by the way) is then stored to a string and finally output as the response of the current request.

Well, this was exactly what was needed for mask to work. We wanted to fetch the "content" of some remote page and display it "inline" as the result of the current request. The only change that was needed was to change the response code to "200 OK" (the valid response code for a successful HTTP request/response) so that the browser (and any analytics or log files) would not confuse this with a 404 error!

One more change needed since we are now returning "content" from our script (instead of a redirect) is we need to set the content type of the response correctly. By default, this will be set to "text/html", which works well for returning plain old HTML pages. However, sometimes we want to return XML, or maybe even an image or PDF (since the masking is possible for "any" type of content.) So, notice the new script has a procedure called "OutputInline" used to handle the output. This procedure is identical to "OutputRemote404", except it checks to see the extension of the file is an ".xml" extension. If it is, the Content Type of the response is set to "text/xml" so that the browser displays the content correctly. Of course, in order for this script to be truly complete, we'd need to add more types of content (images, PDFs, etc.) However, for brevity and to get this out quickly, I've only added XML handling for now. Also, my first choice would not be to hard code these extension checks (although that might be the easiest.) Instead, I'd prefer something where the actual MIME mappings on the server are used - and this would handle literally ANY content type the server is configured to return - without needing specific code for each! This, however, will have to wait for another time when I can research it more thoroughly.

Download the modified 404catch.aspx here, with support for masked URLs. Of note is that this script still supports the values 301, 302 and 404 in the database field "RedirectType" plus now an additional "inline" value that performs the masking instead of a redirect. The values of the other fields are exactly the same. So, one could easily change from 301 redirects to masks by changing the value of RedirectType to "inline" for the desired records in the database.

I hope you find this helpful! Let me know comments or questions below.

0 comments