Working around broken urls for my website / 2023-01-11

2023-01-11 Working around broken urls for my website
If you're bored enough to look at the sources for my webpages you'll notice I make a lot of use of
<base href="https://idefix.net/~koos/">
This changes the base for all relative urls from https://idefix.net/ to https://idefix.net/~koos/ because my whole site is based on being in my userdir, but https://idefix.net/ is the easy url.

I use a lot of relative urls for local things because why make them longer. And this eases developing and debugging on the developer site.

All browsers support the 'base href' meta tag, but some bots ignore it. And there has been a case a few years ago where a bug in one script made all urls seem 'below' other urls. The net result is that my logs are currently filled with entries like:
[11/Jan/2023:17:09:34 +0100] "GET /~koos/irregular.php/morenews.cgi/2022/newstag.cgi/morenews.cgi/draadloosnetwerk/morenews.cgi/newsitem.cgi/morenews.cgi/morenews.cgi/newstag.cgi/asterisk/morenews.cgi/morenews.cgi/morenews.cgi/morenews.cgi/morenews.cgi/morenews.cgi/morenews.cgi/morenews.cgi/newstag.cgi/newstag.cgi/kismet/morenews.cgi/newstag.cgi/newsitem.cgi/morenews.cgi/morenews.cgi/2023 HTTP/1.1" 410
all those entries seem for http:// versions of the urls so I now adjusted the http to https redirect function to stop at urls that look like ^\/~koos/irregular.php\/.+\.cgi to give a status 410 immediately.

This 'saves' a bit of traffic because it never gets the redirect to the https version.

While checking this I see multiple stupid bots, like:
35.209.99.100 - - [11/Jan/2023:17:02:14 +0100] "GET /homeserver.html HTTP/1.1" 404 972 "-" "Buck/2.3.2; (+https://app.hypefactors.com/media-monitoring/about.html)"
This one clearly doesn't parse the base href tag.

Seeing the results

I'm clearly seeing the results in 4xx versus 3xx status code counters in haproxy for the prod-http backend. Interesting is also that googlebot seems to be going through the 'wrong' urls faster than before. This is good, I want googlebot to forget about that collection of wrong urls as soon as possible.

Tags: , ,

IPv6 check

Running test...
, reachable as koos+website@idefix.net. PGP encrypted e-mail preferred. PGP key 5BA9 368B E6F3 34E4 local copy PGP key 5BA9 368B E6F3 34E4 via keyservers

RSS
Meningen zijn die van mezelf, wat ik schrijf is beschermd door auteursrecht. Sommige publicaties bevatten een expliciete vermelding dat ze ongevraagd gedeeld mogen worden.
My opinions are my own, what I write is protected by copyrights. Some publications contain an explicit license statement which allows sharing without asking permission.
Other webprojects: Camp Wireless, wireless Internet access at campsites, The Virtual Bookcase, book reviews
This page generated by $Id: newsitem.cgi,v 1.58 2022/12/12 15:34:31 koos Exp $ in 0.010617 seconds.