patternMinor
Canonicalize URLs for static website
Viewed 0 times
canonicalizewebsiteforstaticurls
Problem
I want to "canonicalize" URLs for my static (files and folders) website.
The 'Aims' describes what I want to accomplish. The 'Code' gives my current
Everything works right now, but I wonder if the code could be improved:
In my
Aims
Example 1
Example 2 (if
Example 3 (if non-HTML file)
Code (
```
# Turn MultiViews off. (MultiViews on causes /abc to go to /abc.ext.)
Options +FollowSymLinks -MultiViews
# It stops DirectorySlash from being processed if mod_rewrite isn't.
# Disable mod_dir adding missing trailing slashes to directory requests.
DirectorySlash Off
RewriteEngine On
# If it's a request to index(.html)
RewriteCond %{THE_REQUEST} \ /(.+/)?index(\.html)?(\?.*)?\ [NC
The 'Aims' describes what I want to accomplish. The 'Code' gives my current
.htaccess. Everything works right now, but I wonder if the code could be improved:
- any changes for better performance?
- can some rules be removed?
- can some rules be merged?
- can something be shortened?
- are some explaining comments wrong?
In my
.htaccess there shouldn't be anything else as what is described in 'Aims' and my examples. So if there should be something that has nothing to do with it, it is probably unneeded (if I don't miss an important part right now).Aims
- Strip the file ending
.html(but keep all other endings).
- if the file is
index.html, strip "index", too, and don't keep a (folder) trailing slash
- no trailing slashs for files or folders
- if someone adds a trailing slash, redirect to variant without
Example 1
- Physical file:
example.com/foo/bar.html
- Desired URL:
example.com/foo/bar
- URLs that should redirect (301) to desired URL:
example.com/foo/bar.html
example.com/foo/bar/
Example 2 (if
index.html)- Physical file:
example.com/foo/index.html
- Desired URL:
example.com/foo
- URLs that should redirect (301) to desired URL:
example.com/foo/index.html
example.com/foo/index
example.com/foo/
example.com/foo.html
Example 3 (if non-HTML file)
- Physical file:
example.com/foo/bar.png
- Desired URL:
example.com/foo/bar.png(= same as physical)
- URLs that should redirect (301) to desired URL:
- none
Code (
.htaccess)```
# Turn MultiViews off. (MultiViews on causes /abc to go to /abc.ext.)
Options +FollowSymLinks -MultiViews
# It stops DirectorySlash from being processed if mod_rewrite isn't.
# Disable mod_dir adding missing trailing slashes to directory requests.
DirectorySlash Off
RewriteEngine On
# If it's a request to index(.html)
RewriteCond %{THE_REQUEST} \ /(.+/)?index(\.html)?(\?.*)?\ [NC
Solution
First, I'll start with an observation that applying these rewriting rules — in particular, removing trailing slashes in the URL — can break pages that reference relative URLs. I assume you know that and want to proceed anyway.
In general, you should realize that every
You have seven
-
Rule 1
It is highly unorthodox to use
I recommend incorporating Rule 6 into this rule as a simplification.
-
Rule 2
Again, the first
Avoid hard-coding the assumption that the resources are relative to the document root. They may have been remapped to another portion of the filesystem.
-
Rule 3
Avoid referencing
-
Rule 4
The directory test is superfluous. Also, since Rule 2 also strips trailing slashes, I would swap this with Rule 3 so that the rules to strip trailing slashes are placed together.
-
Rule 5
Again, the directory test is superfluous, and you can avoid referencing
-
Rule 6
As previously mentioned, this can be incorporated into Rule 1.
-
Rule 7
Use regular expression capturing. Also, use of
I found that the ruleset was prone to infinite redirects. As you probably found, when the non-redirecting rules rewrite the URL, they trigger an internal subrequest in Apache, causing the entire ruleset to be evaluated again. You probably encountered this problem and put in the check for
```
Options +FollowSymLinks -MultiViews
# Disable mod_dir adding missing trailing slashes to directory requests.
DirectorySlash Off
RewriteEngine On
######################################################################
# Canonicalizing redirects
######################################################################
# Skip all rewrites of internal subrequests (see below).
RewriteCond %{ENV:REDIRECT_STATUS} .
RewriteRule . - [L]
# Strip .html or /index.html
RewriteRule ^(.*?)(/index)?(\.html)$ $1 [NC,R=301,L]
# Strip trailing slash if...
# It isn't a directory, and if the trailing slash is removed and a .html
# appended to the end, it IS a file.
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_FILENAME} (.*?)/?$
RewriteCond %1.html -f [OR]
RewriteRule (.*)/$ $1 [R=301,L]
# Strip trailing slash if...
# It is a directory that contains an index.html file.
Rewr
In general, you should realize that every
RewriteRule is conditional. If the path does not match the pattern, the RewriteRule and all of its RewriteConds are skipped. You should take advantage of that fact to eliminate a few superfluous RewriteConds.You have seven
RewriteRules:-
Rule 1
# If it's a request to index(.html)
RewriteCond %{THE_REQUEST} \ /(.+/)?index(\.html)?(\?.*)?\ [NC]
# Remove it.
RewriteRule ^(.+/)?index(\.html)?$ /%1 [R=301,L]It is highly unorthodox to use
%{THE_REQUEST}, which works at the raw HTTP level (e.g., GET /index.html HTTP/1.1). It contains the raw URL, whereas you usually care about the decoded path. The fix is to remove the RewriteCond altogether, since it is redundant anyway.I recommend incorporating Rule 6 into this rule as a simplification.
-
Rule 2
# if request has a trailing slash
RewriteCond %{REQUEST_URI} ^/(.*)/$
# but it isn't a directory
RewriteCond %{DOCUMENT_ROOT}/%1 !-d
# and if the trailing slash is removed and a .html appended to the end, it IS a file
RewriteCond %{DOCUMENT_ROOT}/%1.html -f
# redirect without trailing slash
RewriteRule ^ /%1 [L,R=301]Again, the first
RewriteCond should be incorporated into the RewriteRule instead. There's no need to use %{REQUEST_URI}, since RewriteRule naturally works on paths.Avoid hard-coding the assumption that the resources are relative to the document root. They may have been remapped to another portion of the filesystem.
-
Rule 3
# Add missing trailing slashes to directories if a matching .html does not exist.
# If it's a request to a directory.
RewriteCond %{REQUEST_FILENAME}/ -d
# And a HTML file does not (!) exist.
RewriteCond %{REQUEST_FILENAME}/index.html !-f
# And there is not trailing slash redirect to add it.
RewriteRule [^/]$ %{REQUEST_URI}/ [R=301,L]Avoid referencing
%{REQUEST_URI} in the RewriteRule. Instead, use regular expression capturing:RewriteRule (.*[^/])$ $1/ [R=301,L]-
Rule 4
RewriteCond %{REQUEST_FILENAME} -d
# And a HTML file exists.
RewriteCond %{REQUEST_FILENAME}/index.html -f
# And there is a trailing slash redirect to remove it.
RewriteRule ^(.*?)/$ /$1 [R=301,L]The directory test is superfluous. Also, since Rule 2 also strips trailing slashes, I would swap this with Rule 3 so that the rules to strip trailing slashes are placed together.
-
Rule 5
RewriteCond %{REQUEST_FILENAME} -d
# And a HTML file exists.
RewriteCond %{REQUEST_FILENAME}/index.html -f
# And there is no trailing slash show the index.html.
RewriteRule [^/]$ %{REQUEST_URI}/index.html [L]Again, the directory test is superfluous, and you can avoid referencing
%{REQUEST_URI} by using regular expression capturing.-
Rule 6
# Remove HTML extensions.
# If it's a request from a browser, not an internal request by Apache/mod_rewrite.
RewriteCond %{ENV:REDIRECT_STATUS} ^$
# And the request has a HTML extension. Redirect to remove it.
RewriteRule ^(.+)\.html$ /$1 [R=301,L]As previously mentioned, this can be incorporated into Rule 1.
-
Rule 7
# If the request exists with a .html extension.
RewriteCond %{SCRIPT_FILENAME}.html -f
# And there is no trailing slash, rewrite to add the .html extension.
RewriteRule [^/]$ %{REQUEST_URI}.html [QSA,L]Use regular expression capturing. Also, use of
%{SCRIPT_FILENAME} is weird and inconsistent with the rest of the rules.I found that the ruleset was prone to infinite redirects. As you probably found, when the non-redirecting rules rewrite the URL, they trigger an internal subrequest in Apache, causing the entire ruleset to be evaluated again. You probably encountered this problem and put in the check for
%{ENV:REDIRECT_STATUS} as a workaround. I would generalize that check by making it the very first rule.```
Options +FollowSymLinks -MultiViews
# Disable mod_dir adding missing trailing slashes to directory requests.
DirectorySlash Off
RewriteEngine On
######################################################################
# Canonicalizing redirects
######################################################################
# Skip all rewrites of internal subrequests (see below).
RewriteCond %{ENV:REDIRECT_STATUS} .
RewriteRule . - [L]
# Strip .html or /index.html
RewriteRule ^(.*?)(/index)?(\.html)$ $1 [NC,R=301,L]
# Strip trailing slash if...
# It isn't a directory, and if the trailing slash is removed and a .html
# appended to the end, it IS a file.
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_FILENAME} (.*?)/?$
RewriteCond %1.html -f [OR]
RewriteRule (.*)/$ $1 [R=301,L]
# Strip trailing slash if...
# It is a directory that contains an index.html file.
Rewr
Code Snippets
# If it's a request to index(.html)
RewriteCond %{THE_REQUEST} \ /(.+/)?index(\.html)?(\?.*)?\ [NC]
# Remove it.
RewriteRule ^(.+/)?index(\.html)?$ /%1 [R=301,L]# if request has a trailing slash
RewriteCond %{REQUEST_URI} ^/(.*)/$
# but it isn't a directory
RewriteCond %{DOCUMENT_ROOT}/%1 !-d
# and if the trailing slash is removed and a .html appended to the end, it IS a file
RewriteCond %{DOCUMENT_ROOT}/%1.html -f
# redirect without trailing slash
RewriteRule ^ /%1 [L,R=301]# Add missing trailing slashes to directories if a matching .html does not exist.
# If it's a request to a directory.
RewriteCond %{REQUEST_FILENAME}/ -d
# And a HTML file does not (!) exist.
RewriteCond %{REQUEST_FILENAME}/index.html !-f
# And there is not trailing slash redirect to add it.
RewriteRule [^/]$ %{REQUEST_URI}/ [R=301,L]RewriteRule (.*[^/])$ $1/ [R=301,L]RewriteCond %{REQUEST_FILENAME} -d
# And a HTML file exists.
RewriteCond %{REQUEST_FILENAME}/index.html -f
# And there is a trailing slash redirect to remove it.
RewriteRule ^(.*?)/$ /$1 [R=301,L]Context
StackExchange Code Review Q#18440, answer score: 2
Revisions (0)
No revisions yet.