2023-01-11 Working around broken urls for my website
If you're bored enough to look at the sources for my webpages you'll notice I make a lot of use of<base href="https://idefix.net/~koos/">This changes the base for all relative urls from https://idefix.net/ to https://idefix.net/~koos/ because my whole site is based on being in my userdir, but https://idefix.net/ is the easy url. I use a lot of relative urls for local things because why make them longer. And this eases developing and debugging on the developer site. All browsers support the 'base href' meta tag, but some bots ignore it. And there has been a case a few years ago where a bug in one script made all urls seem 'below' other urls. The net result is that my logs are currently filled with entries like:[11/Jan/2023:17:09:34 +0100] "GET /~koos/irregular.php/morenews.cgi/2022/newstag.cgi/morenews.cgi/draadloosnetwerk/morenews.cgi/newsitem.cgi/morenews.cgi/morenews.cgi/newstag.cgi/asterisk/morenews.cgi/morenews.cgi/morenews.cgi/morenews.cgi/morenews.cgi/morenews.cgi/morenews.cgi/morenews.cgi/newstag.cgi/newstag.cgi/kismet/morenews.cgi/newstag.cgi/newsitem.cgi/morenews.cgi/morenews.cgi/2023 HTTP/1.1" 410all those entries seem for http:// versions of the urls so I now adjusted the http to https redirect function to stop at urls that look like ^\/~koos/irregular.php\/.+\.cgi to give a status 410 immediately. This 'saves' a bit of traffic because it never gets the redirect to the https version. While checking this I see multiple stupid bots, like:126.96.36.199 - - [11/Jan/2023:17:02:14 +0100] "GET /homeserver.html HTTP/1.1" 404 972 "-" "Buck/2.3.2; (+https://app.hypefactors.com/media-monitoring/about.html)"This one clearly doesn't parse the base href tag.
Seeing the resultsI'm clearly seeing the results in 4xx versus 3xx status code counters in haproxy for the prod-http backend. Interesting is also that googlebot seems to be going through the 'wrong' urls faster than before. This is good, I want googlebot to forget about that collection of wrong urls as soon as possible.