Angelina Jordan Wiki:Robots

From Angelina Jordan Wiki

This page is about web robots, web crawlers, web spiders, etc. It isn't about bots that are granted specific permissions to make changes on the wiki. There are no such bots working on the wiki at this time (mid 2026).

Web bots have been causing havoc with the wiki ever since shortly after it was opened to the public (in January 2026). Problems have included exceeding our disk quota, exceeding our monthly bandwidth limit, and exceeding our CPU usage limits.

The wiki's admin has taken the countermeasures described below.

(Note: The term "anons" will be used to refer to readers of the wiki who have not registered and logged in. Bots see the wiki like anons do. Unfortunately this means that most countermeasures that place limits on bots also limit human anons in the same way.)

robots.txt

This only places restrictions on well-behaved bots that actually obey robots.txt directives. It doesn't affect human anons (which is good) nor badly behaved bots that ignore the directives (which is bad).

For what it's worth, our robots.txt file attempts to prohibit web bots from accessing wiki functionality through "ugly URLs" of the form …/w/index.php?… (which ideally would prevent most resource-instensive requests) while allowing them to access normal page content through "short URLs" of the form …/wiki/… as well as skin information (…/w/skins/…), sitewide CSS (…/w/load.php?…), and images, so they can save properly rendered pages.

It also tries to limit the rate at which bots make requests, but some bots just don't understand the relevant directive, since it's not actually part of the standard.

In reality, this doesn't accomplish very much, since the vast majority of the problematic bot traffic is coming from badly behaving bots that disregard robots.txt.

For the record, the following bots seem to be relatively well behaved:

These bots seem to respect the path restrictions in robots.txt but not its rate limit:

  • (any?)

.htaccess

For bots that don't respect robots.txt, there are (enforced) restrictions placed in our .htaccess file(s).

Blocking IPs

The following bots that seem to be totally disregarding our robots.txt have been blocked by way of their IP address(es):

  • Anthropic's ClaudeBot
    • range 216.73.216.0 – 216.73.219.255
      • currently blocking only 216.73.216.0 – 216.73.216.255
  • Meta's webcrawlers
    • currently blocking 2a03:2880:f814:0000:0000:0000:0000:0000 – 2a03:2880:f814:00ff:ffff:ffff:ffff:ffff
  • Alibaba Cloud (don't identify themselves as bots)
    • range 47.74.0.0 – 47.87.255.255
      • currently blocking only 47.79.192.0 – 47.79.219.255
    • range 8.208.0.0 – 8.223.255.255
      • currently blocking only 8.219.214.204

Denying specific requests

There are also some rather hacky limitations set on URLs of specific types. (See below. Once these are working as intended, they will be moved here.)

These .htaccess restrictions should have the benefit of being triggered before MediaWiki gets involved in page rendering (thereby cutting down on CPU usage), but have the disadvantage of resulting in rather jarring error pages for human anons.

The exact directives being used are based on recommendations at Manual:Handling web crawlers#Apache. Unfortunately, the code provided there does not work as-is on this wiki, for some reason. It has been cut down quite a bit to only the stuff that seems to work.

LocalSettings.php

Also rather hacky but perhaps less jarring to human anons (since the errors are displayed within the normal wiki interface) are restrictions implemented by way of a "hook" in LocalSettings.php. These are modeled on the code at Manual:Handling web crawlers#Custom PHP. Again, the code has been modified from the original, but this time it's mainly to implement additional restrictions that should have been in .htaccess but weren't working there.

(Once things are working somewhat well, the actual restrictions being enforced by LocalSettings.php will be listed here.)

Core MediaWiki settings

The following restrictions are set in LocalSettings.php in the usual way. They apply to both anons and bots.

  • Anons cannot edit the wiki (but they can view the raw wiki markup).
  • Anons cannot upload or re-upload files.
  • Anons cannot create accounts (and therefore cannot log in unless an account has been specifically created for them)

Extension settings

Cargo

Per the instructions at Extension:Cargo/Download and installation#Permissions, the following Cargo-related special pages have been limited to logged-in users only, by using the normal permissions features of MediaWiki:

(Note: These are allowed or denied together. There is no method provided by the extension to allow one and deny the other. That would have to be done in either .htaccess or LocalSettings.php.)

URL restrictions

Here is a set of special pages / URLs that potentially or definitely should be denied to anons and bots, but not to logged-in users (i.e., logged-in users should be able to do "anything"). They are listed here so they can be easily tested, as the various configurations discussed above are worked on. Execpt for things that are terribly abused by bots, this shouldn't include things that can already be restricted using the normal user rights approach.

(Most of these items aren't linked yet, since I need to check which things have been most abused in the past and whether those things are currently being denied effectively before I link them here. After all, bots tend to crawl newly created content the most.)

  • Special:Diff (links to come) — definitely deny arbitrary diffs (both "diff=" and "oldid=" set to numbers); maybe deny "diff=prev" with "oldid="; for "oldid=" with no "diff=", see PermanentLink
  • Special:Drilldown (links to come) — massively abused by bots
  • Special:History (links to come) — maybe not necessary if we deny the most problematic kinds of diffs
  • Special:Log (links to come) — maybe not necessary?
  • Special:PageInfo (links to come) — ?
  • Special:PageValues (links to come) — ?
  • Special:PermanentLink (i.e., "oldid=" without "diff=") (links to come) — maybe not necessary
  • Special:RecentChanges (links to come) — maybe not necessary once enough other things are being denied
  • Special:RecentChangesLinked (/wiki/, /w/, API?) — a huge target for bots
  • Special:WhatLinksHere (links to come) — a big target for bots, and a bit resource-heavy

(Additional notes to come, as well.)

And these URLs should not be restricted for anyone:

  • Special:UserLogin (more specific links to come)
  • action=edit (links to come) — anons can see the raw wikitext (markup, source) of normal pages with this; other uses may be problematic