Score:1

Apache htaccess ruleset: Try request as: 1) file as-is 2) file + .html suffix 3) DirectoryIndex 4) Else to index.php CMS router

gb flag

I'd like to express this ruleset

If the request is /hello then try the following in the given order:

  1. /hello — File of that name exists (file without file extension).
    • Edit: Not a necessity. Only a possibility/compromise. Drop if too complicated.
  2. /hello.html - File of that name plus .html extension exists.
  3. /hello/index.(htm|html|php) — Folder of that name with index file exists. Note: /hello/ directory listing shall explicitly be forbidden
  4. /index.php — If nothing of the above matched hand over to CMS index.php (e.g. Wordpress)

My .htaccess in the domain root folder of my shared hosting account:

## Request without file extension
# e.g. "/hello"

### First look for DirectoryIndex files (with mod_dir)
# e.g. "/hello" shall serve "/hello/index.(html|htm|php)" if present
# Explicitly forbidding directory listings (for security/privacy)
<IfModule mod_dir.c>
Options -Indexes
DirectoryIndex index.html index.htm index.php
</IfModule>

### If no DirectoryIndex found then try with .html suffix (with mod_rewrite)
# e.g. "/hello" shall serve "/hello.html" if present
<IfModule mod_rewrite.c>
RewriteRule ^([^\.]+)$ $1.html [NC,L]
</IfModule>

## Everything else goes to Wordpress index.php and its standard htaccess configuration like this:
# BEGIN WordPress
# The directives (lines) between "BEGIN WordPress" and "END WordPress" are
# dynamically generated, and should only be modified via WordPress filters.
# Any changes to the directives between these markers will be overwritten.
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteRule .* - [E=HTTP_AUTHORIZATION:%{HTTP:Authorization}]
RewriteBase /
RewriteRule ^index\.php$ - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]
</IfModule>
# END WordPress

Problems

  1. mod_dir 's DirectoryIndex alone works:
  • ✅ Requesting /a01 serves /a01 being a file without suffix.
  • ✅ Requesting /a02 serves /a02/index.html which is the DirectoryIndex.
  1. mod_rewrite's RewriteRule which tries with an added .html suffix alone works:
  • ✅ Requesting /a03 serves /a03.html.
  • ❌ But now requesting /a02 returns Apache error page 403 Access forbidden.
  • ❌ And now requesting /a01 returns Wordpress error page 404 Not found.
    • Strange because that file exists and hence in the htaccess Wordpress section RewriteCond %{REQUEST_FILENAME} !-f is not even met, so how can that even land in Wordpress routing.
  1. So mod_dir and mod_rewrite rules being active together seemingly result in a conflict.
  • Is a central ruleset of my webhost interfering with this?
  • Or is this a general interplay issue of the two modules? How do I get them to work together as intended?
kz flag
Regarding #3... if an index file does not exist, do you want processing to continue to #4 (ie. passed to the CMS) or fail? Regarding #1 are you saying you have physical files without a file extension?
porg avatar
gb flag
My order is from least to most likely. So that if a less likely form is present it gets served and the any others in the chain ignored. If there is no matching #3 DirectoryIndex then continue to #4 CMS which constitutes 95% of all content. Failure shall only ever occur in the CMS. Ad #1: I like to preserve the possibility for files without a suffix. Although I prefer them as #2 with a suffix for easier handling in OS and text editor syntax highlighting. But wanna not have any file suffixes in links, which is realized by the RewriteRule.
porg avatar
gb flag
@MrWhite now that I told you the last specifications unclear to you, do you have a proposed solution?
porg avatar
gb flag
@MrWhite any answer is appreciated, also a "not possible" or "I dunno".
ezra-s avatar
ru flag
Have you tried using Loading "mod_negotiation" and then "Options +Multiviews" in the appropiate directory? Quoting the docs here: A Multiviews search is enabled by the Multiviews Options. If the server receives a request for /some/dir/foo and /some/dir/foo does not exist, then the server reads the directory looking for all files named foo.*, and effectively fakes up a type map which names all those files, assigning them the same media types and content-encodings it would have if the client had asked for one of them by name. It then chooses the best match to the client's requirements..."
ezra-s avatar
ru flag
https://httpd.apache.org/docs/2.4/mod/mod_negotiation.html
Score:1
kz flag

If there is no matching #3 DirectoryIndex then continue to #4 CMS

You can't fail "gracefully" with mod_dir's DirectoryIndex to then do something else with the request using mod_rewrite (ie. route the request to #4 the CMS). mod_dir is processed too late. So, instead of using DirectoryIndex we would need to simulate this with mod_rewrite.

However, another (minor) issue here is that the WordPress code block (that, as the comment states, should not be edited manually) needs to be edited to allow requests for filesystem directories to be passed to the CMS.

I'm assuming that any direct requests for a directory should include a trailing slash. For example, if /hello is a physical directory then you should be requesting /hello/ (with a trailing slash). We will append the trailing slash if omitted (which is what mod_dir will do by default anyway, but we need to do this manually if overriding Directoryindex.) We could disable the trailing slash (and make the canonical URL the one without a trailing slash) but this requires additional rewriting.

So, to satisfy your requirements, you could do it like this in the root .htaccess file:

Options -Indexes

# Required for the root directory (eg. the homepage of the CMS)
DirectoryIndex index.html index.htm index.php

RewriteEngine On

# Initially part of the WordPress/CMS block
# (This is just an optimisation)
RewriteRule ^index\.php$ - [L]

# Abort early if a file is requested directly
# (Regardless of whether that file includes a file extension.)
RewriteCond %{REQUEST_FILENAME} -f
RewriteRule . - [L]

# If a directory is requested, which is missing the trailing slash then append it
RewriteCond %{DOCUMENT_ROOT}/$1 -d
RewriteRule ^(.*[^/])$ /$1/ [R=301,L]

# Test if "<url>.html" exists and rewrite if so
RewriteCond %{DOCUMENT_ROOT}/$1.html -f
RewriteRule ^([^.]*[^/])$ $1.html [L]

# Optimisation: If a directory is not requested then skip the next 3 rules
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . - [S=3]

# Check for "DirectoryIndex" documents in order: index.html, index.htm and  index.php
# NB: Directories end in a trailing slash (enforced above)
RewriteCond %{DOCUMENT_ROOT}/$1/index.html -f
RewriteRule ^(.+)/$ $1/index.html [L]
RewriteCond %{DOCUMENT_ROOT}/$1/index.htm -f
RewriteRule ^(.+)/$ $1/index.htm [L]
RewriteCond %{DOCUMENT_ROOT}/$1/index.php -f
RewriteRule ^(.+)/$ $1/index.php [L]

# CMS Fallback...
# But note that the two conditions (filesystem checks) are removed.
# The first one that checks for a "file" is simply not required.
# However, the second check MUST be removed otherwise directories that do not contain a "DirectoryIndex" are not routed to the CMS.

# WordPress...

RewriteRule ^ - [E=HTTP_AUTHORIZATION:%{HTTP:Authorization}]

# CMS / Front-Controller
RewriteRule . /index.php [L]

Additional notes:

  • As an optimisation, I'm assuming your URLs that map to .html files do not contain dots. This is what you had done, so I assume that's OK. (There is no need to backslash-escape a literal dot when used inside a regex character class.)

  • I've removed the WordPress comment markers and reduced the WordPress code block to all that's required. One of the RewriteRule directives is moved to the top of the .htaccess file (since this is an optimisation, it doesn't make much sense to have it at the end anymore). You would need to configure WordPress (or your file perms) to prevent WordPress from trying to maintain the .htaccess file (although this could cause issues with plugins).

  • Passing filesystem directories to the CMS is certainly non-standard. And boilerplate code (front-controller pattern) for most CMSs will explicitly exclude physical directories. However, the added complication here is that you only want directories where the DirectoryIndex document is not present in that directory, to be passed to the CMS.

I like to preserve the possibility for files without a suffix.

The "problem" with not having file extensions on the underlying file is that Apache does not necessarily know how to handle the request and what "Content-Type" header to send (so the browser does not know how to handle the response).

A workaround in this case is to have all extensionless "files" of a specific type in a known subdirectory and force all those requests with the same Content-Type.

Note that files and URLs are very different in this respect. URLs without extensions is not a problem.


Aside:

RewriteRule ^([^\.]+)$ $1.html [NC,L]

The problem with this rule is that you are unconditionally applying the .html extension to any URL that does not contain a dot. /a01 is rewritten to /a01.html, which is not a file (so the condition is successful) and /a01 (the URL that WP sees) is not a registered WP URL so results in a 404 generated by the CMS/WordPress.

porg avatar
gb flag
First of all: A big thank you! So much expertise in one answer! Fantastic! After I will have finalized my testings and our discourse I certainly will mark this as the answer! You deserve the reputation points big time! Now my feedback/questions in the followup comment(s).
porg avatar
gb flag
To simplify my ruleset: `/hello` which is an HTML file without suffix is not a necessity for me. I just thought of it as a possibility or an acceptable compromise, if the rest of the ruleset struggles. I thought that the webserver simply uses Unix `file` (where the magic library inspects the file header) to detect the filetype (e.g. from the opening HTML tag in the file). But I guess nothing beats "determine MIME type from file extension" in terms of performance. So please drop suffix-less HTML files entirely from the .htaccess code snippet if that improves performance and stability.
porg avatar
gb flag
`RewriteRule ^index\.php$ - [L]` early on in the file. This is for the unusual case that somebody/something directly requests `index.php` that this is applied and the mod_rewrite block is ended ( `[L]` as in "last" rule to apply). On a Wordpress with beautiful URLs activated a link pointing to `index.php` literally happens in 0.001% of humanly click-able links I guess. But on a JS or HTML-asset-level index.php plus some arguments after it are probably requested frequently? I just ask to be sure whether performance-wise this is good early on in the file.
porg avatar
gb flag
`RewriteRule ^([^.]*[^/])$ $1.html [L]` with the RegEx that I interpret as "Has zero or more non-dot characters followed by a character other than slash" and `Assumption: URLs that map to .html files do not contain dots`. For now this is true. Should I want to make use of language codes in HTML files like `hello.en.html` and `hello.fr.html` I would need to revise that RegEx to what?
porg avatar
gb flag
Again on "suffix-less HTML files": I guess I would only use them very very rarely. For this the longer "content-type" detection by the server would be justified. But `RewriteCond %{REQUEST_FILENAME} -f` needs to be evaluated for each request, except those where the earlier `RewriteRule ^index\.php$ - [L]` was met. But on the other hand checking the existence of a literal filename with no wildcards in it is very minimal. Nevertheless a few nanoseconds. So I will remove that for now, and only if that becomes a real scenario put it back at the position it has in your template.
porg avatar
gb flag
`Passing filesystem directories to the CMS is certainly non-standard […]`. I want to check that neither of us has a misunderstanding here. Filesystem contents: `/aaa/index.php` and `/bbb/` as an empty dir. Intended behavior: Requesting `/aaa`: Files `/aaa` or `/aaa.html` do both not exist. Hence request transformed to `/aaa/`. Folder `/aaa/` contains index file hence DirectoryIndex is served. Requesting `/bbb`. Exists as dir. Request gets transformed to `/bbb/`. Now same situation as if requested as `/bbb/` right away. Continue. No index in it, hence fwd to central CMS router at `/index.php`.
porg avatar
gb flag
Requesting `/ccc`. No files `/ccc` or `/ccc.html`. No dir `/ccc/`. Hence fwd to CMS router at `/index.php`.
porg avatar
gb flag
Note: Wordpress ends its canonical URLs with a trailing slash. If you request a beautiful URL without a trailing slash it HTTP 301 redirects you to the one with the trailing slash.
porg avatar
gb flag
So I think the removal of the file and directory checks from the [standard Wordpress .htaccess](https://wordpress.org/documentation/article/htaccess/#basic-wp) are totally ok, because we conduct these ourselves early on in the .htaccess file, with some further tweaks added. So I think this is ok. Merely from inspection, have not performed practical tests yet. I will await your responses, and then continue to practical testing.
porg avatar
gb flag
And thanks for the info that `mod_dir` is processed after `mod_rewrite`. The official doc of [mod_dir](https://httpd.apache.org/docs/2.4/mod/mod_dir.html) doesn't mention when it kicks in. My Webhosting support agent couldn't explain why it fails either. Great that it is possible to **specify how DirectoryIndex shall create its output** with the directives `DirectoryIndex index.html index.htm index.php` and `Options -Indexes` but that the **index file detection** for which it is too late in the chain can be delegated to `mod_rewrite`. Elegant sharing of work! Glad that this solution exists.
porg avatar
gb flag
Now had the time to test it out intensively. Accepted it as the answer. Works well mostly!
porg avatar
gb flag
**The Simulated DirectoryIndex:** ✅ Works great and even offers a **stealth mode**! 1) If a valid `index` file as stated in `DirectoryIndex` is present this gets served. 2) If not, instead of giving back the `HTTP 403 forbidden` as it happens in a pure `mod_dir` setup, the request instead ends in the last `RewriteRule . /index.php [L]` (the CMS router) which returns a `HTTP 404 Not Found`. The stealth is perfect when requesting `/secrets/` with no index files inside, but notice-able to pros, b/c when req `/secrets` which gets visibly forwarded to `/secrets/` and then served as 404.
porg avatar
gb flag
**HTML file without a suffix:** The web server serves them as-is **without a content-type (MIME) header**. No magic-filetype parsing and setting of a MIME-type by the sever. But my browser (Mac Safari) nevertheless recognized HTML on the parsing level and shows it just fine. I will almost never use those, but good to have the ruleset if needed. This costs no extra processing in the `.htaccess` as the early ruleset `RewriteCond %{REQUEST_FILENAME} -f` → `RewriteRule . - [L]` handles them like any directly requested file.
porg avatar
gb flag
**Fine tuning extra: .HTM(L) suffix-less URL as the canonical URL**: Normalize `$1.(htm|html)` requests to just the beautiful `$1`. With the `.htaccess` snippet as of 2023-02-20 requesting `/hello` serves the underlying `/hello.html`. Requesting `/hello.html` serves it as is. This means duplicate content. For `/hello.htm(l)` I'd like an enforced HTTP 301 redirect to just `/hello`. The syntax for this is seemingly easy. But the correct interplay with the other rulesets ain't, especially with `Abort early if a file is requested directly` which should still work in all other cases.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.