I've written a preliminary version of a script to detect and revert spam on a remote wiki.
More details at [SpamClean].
I'm planning to run it periodically (every few days) here on UseModWiki. If it finds and reverts spam, it'll leave "spamclean.py spam revert" in the change summary field. It'll probably be from IP address user-10cm0sv.cable.mindspring.com.
Right now, I have to manually approve each revert that it suggests. If you'd prefer to have it run reliably every day, I can set it up to revert without manual approval and run it as a cron job. This way you could ignore spam for a day and assume the script will handle it, and only deal with it if the script misses it (at which point you should submit a regexp to [BannedContent] so that we can catch that piece in the future). Let me know if you'd like me to do this.
I've been running the bot about every day, however, often people are reverting spam manually before the bot gets to it.
Please note that, right now, the bot does NOT delete newly created spam pages; it only reverts pages for which it can find a prior, spamless revision. This will eventually be changed.
So far I've been manually approving each revert and there have been no "false positives". So I'm planning to make this an automated cron script which runs many times a day. This means that if there is some legitimate content which the bot thinks is spam, the bot will blithely revert it, and someone will probably have to email me to update the spam regexes. I don't anticipate this happening too frequently. But, please let me know if anyone objects to me making this a fully automated script.
OK, in the lack of any dissenters, I've setup spamclean.py to run as an automated cron job every 2 hrs. Right now this'll only work when my computer is turned on, but in a few days I'll switch it to an old server computer I have.
Bayle, I use usemod for my homepage (which regularly gets spammed with a ton of addresses with Chinees characters in them). I was thinking about an approach to stop them adding the content onto my site in the first place: the type of spam content that I get is almost always a large set of links with brief descriptions. How about checking the total number of normal http hyperlinks in the page before and after the edit, and if there's a large delta (say more than 5), simply dump the change. I think that this would certainly stop the spam on my site, but perhaps I get a different set than everyone else.
Just thought of something else: how about displaying one of those images that has a simple character code in it, but rendered such that it is difficult to do OCR on, that the user would have to enter in a field on the edit page form in order for it to work. This would (presumably) stop the automated spam immediately.
Ronan Cremin (ronan _AT_ cremin.com)
UPDATE on this (15 December 2004): my tiny patch to usemod has caught the vast majority (all bar about 3 or so) of several hundred automated spammings of my site in the last month or so. These spams always take the same form: a vast chunk of URLs placed at random points in all pages in the wiki (presumably these are all to increase google pageranks for the linked sites). So, for this type of spam at least, this method has proved very successful.
Both of those ideas have promise, but unfortunately a remote bot like mine can't proactively block spam; only a patch to UseMod could allow you to do that.
I am particularly fond of the idea of the image with the character code in it. However, I believe there is a LOT of potential for bots to interact with wikis (see [WikiGatewayMotivation] for examples), so I think there should be a way for some users to obtain "bot privilages", allowing bots which sign on as those users to work without having to deal with the character code.
Another thing to note is that a lot of the current spam may be human-generated, though.
Bayle -- I went ahead and implemented a crude patch to Usemod to prevent automated spam. It works on the first idea above and simply counts the number of HTTP links in the most recent edition and the latest update. If the delta between this and the last edit is greater than 5, it dumps the edit and leaves the page as is (and emails me to let me know). Since I implemented it, it has successfully caught all spam attempts on my site (about 10 over the past 4 days). So far so good. If anybody wants the code let me know. It they get smartwe I'll add a [captcha] -- there are several fairly easy ways to add these using public services or Perl libraries.
Ronan Cremin (ronan _AT_ cremin.com)
Sorry that spamclean wasn't operating for awhile, and then wasn't posting change summaries for a few days. Since it's being run off my home computer, it won't have a chance to run when I'm offline.
Also, I added two new features:
Note that the same technology can be used to spam or destroy wikis. We refused to publish these scripts for years because we knew the potential for disaster they imply. An arms race is not a good solution. -- SunirShah