htaccess password protect a directory except from when coming from validator.w3.org
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

If you have a htaccess password protected directory, validator.w3.org will be unable to test pages, giving this error:

IO Error: HTTP resource not retrievable. The HTTP status from the remote server was: 401.

Is there a way to bypass the htaccess password for just pings from the validator?

Tried this, but didn't work: https://gist.github.com/susanBuck/ad27c228ccf137d9770f

awarded to slang800

Crowdsource coding tasks.

1 Solution


No, there isn't, because the validator doesn't "ping" your site, it downloads it, and parses the markup. If you want to validate a resource that's password protected, then you should run a validator manually (you can even run that particular validator locally: https://dvcs.w3.org/hg/markup-validator/file/tip/).

And if you did make an exception for the validator, then there would be no point in having a password at all, because anyone can spoof the user agent string (or whatever header you choose to filter with) to match the validator.

PS: You might be able to do something clever like passing the username & password to the validator in the URL (like: http://username:password@example.com)... but in terms of security this is almost as bad as adding an exception for the validator.

That makes sense. So the validator is not even using http to view the site. Regarding the latter issue, it's just for some low-level protection - people taking the effort to bypass by spoofing user agent strings is not of concern.
Difranco almost 2 years ago
What kind of low-level protection? If you're just looking to make your site a little harder to find, then use your robots.txt to prevent google from indexing it. And maybe use a random segment in your url / subdomain (so that the link is not guessable). And the validator does use HTTP to access your site - it downloads your site in exactly the same way a browser or even cURL downloads your site. It does not just ping it, because it needs to read the markup returned by the HTTP server... It's not just checking that your machine is still up.
slang800 almost 2 years ago
It's for sites built in a college course. We just want to keep the general public out. Robots.txt, random segments, etc would work in most cases, but in this instance since it's a lot of beginner students, we want to keep things as straightforward as possible. Regarding what you're saying about http though- I'm still a little curious why the original htaccess I put up there wouldn't conceptually work--- I believe the idea is it'll ask for a htaccess password for any IP except for the two listed: Allow from 128.30.52.73 Allow from 128.30.52.96... Those are the two IPs I detected as incoming traffic from the validator, but perhaps i have that wrong.
Difranco almost 2 years ago
Yeah, that htaccess should work, except they probably run that validator on several different machines (with different addresses) to balance the load (I assume it gets quite a bit of traffic). And if these are just sites for learning, then you could keep all the public out by just not putting the sites on the web. It's pretty easy to run a local http server... Python even comes bundled with one, and you can run it via python -m http.server (to serve the current directory). Or, if you want server-side langauges to be executed, I hear https://www.mamp.info/ is pretty good.
slang800 almost 2 years ago
And, of course, passing the username/password in the URL that you give to the validator (like http://username:password@example.com) is still an option
slang800 almost 2 years ago
Think you're right about never being able to capture the exact IP as it's changing. Local servers aren't an option given the time constraints of the course. Thanks for bouncing these thoughts back and forth. I'll categorize this not quite as resolved, but certainly closed and award you the bounty.
Difranco almost 2 years ago