Dealing with Comment Spam

One of the first plugins I wanted to write for WordPress was a decent spam filter. At the time (late 2004) there were several plugins already out there, but they were all rules-based and I wanted something more adaptive. So I wrote a few simpler plugins for practice and soon abandoned the idea, since I really didn’t get much spam — or even many comments for that matter.

After a seven month absence, I came back here to find over 4000 comments in moderation, most of it spam. It took a week to sort through and clean that mess up, all to preserve the 5% of legitimate comments buried in there. Then after a 2 week vacation, I was stunned to see my empty moderation queue back up to 2700 comments. No rest for the wicked, I guess.

So I fired up my PHP editor and got to work. The resulting LightPress plugin is now active on this site, tagging spam as soon as it arrives. It’s somewhat simple and is based on my interpretation of Paul Graham’s essay, “A Plan for Spam”. Basically, it’s a Bayesian filtering approach which compares tokens from incoming comments versus known ham and spam tokens from previously scored comments. The nice part is that it adapts to evolving spam, so it shouldn’t require too much attention once I’ve finished tuning the settings.

I also added a bulk scoring feature, so that I could not only build my token database but also sort through that imposing pile of 2700 comments (it’s even worse at Photos de la Route). It’s working great so far, plus it’s kind of fun to see the spam scores. But I have to agree with Paul Graham: reading lots of spam is depressing.

So comment away! Comments are no longer going straight to moderation, although you might find yourself there (or worse) if you try to sell me any medications.

To answer a few obvious questions:

Why didn’t you use Akismet?
I think that it’s great that Matt and company are offering this service, but I have a fundamental problem with Akismet: it’s not open. I know this was done to keep the spammers from getting through their filters, but this also means that I can’t trust the service either. How do they determine what’s spam and what’s not? If they’re using criteria from other sites, how well does this apply to the comments that I receive here? The only way to build trust in the service is to use it and constantly monitor that it isn’t generating false positives (i.e. good comments flagged as spam). I’d rather be able to coast, knowing that if there is a problem that I can find it easily since the data is open.

Will this work for WordPress?
Probably. 98% of the plugin code runs on both LP or WP (a real feat, let me tell you), just because comment moderating, editing and the bulk actions all happen in WP’s administration interface. But I’m not keen on converting the last 2% to WP, because most of the WP code was a real pain to create. So when it’s released as an LP plugin, you can go ahead and convert it yourself.

Feedback

A few more postscript notes:

1. Scanning through token scores, I've actually found comments that I incorrectly marked as spam some time ago. Shows you how effective the manual method is, eh?

2. We're going to have to improve comment plugin integration in LightPress. The plugin works great, but can't change the default email sent out by LP when a new comment is submitted.

jerome — 03-Sep-2006 15:14

Sorry for my lack of understanding - i don't get what a 'default email sent out by LP' is for? Do you wanna _not_ notify the author when it's _just_ spam? Or what are you heading at? btw: Do you remember my question about the technorati-tags? I just stumbled over: how to style an rss-feed (external link) what is damn cool! especially: this advanced version (external link). So, while still struggling with IE-FF differences, i took a look at your feed, and found one thing i like: i'll need entities escaped, just like you did. is the feed.php the right place for this? and one thing that is missing: the technorati-tags ;) regards

erik — 06-Sep-2006 14:43

Erik, I was referring to the email that the weblog owner gets whenever a new comment is posted. Ideally, I'd like to show how the comment scored within that email, but right now I have to send myself a separate email with the score.

It's working very well so far — no false positives and several hundred spam blocked already. But I'm seeing issues on my other site (which still uses LP 1.1 and sees 10x the amount of spam).

Regarding tags & feeds, thanks for the links — I'll check'em out. As for styling your feed, do it in the feed templates rather than hacking feed.php. Ludo & I discussed moving them into the themes dir (way back when), but never came to a final decision. And thanks for the tip about my feed tags — I did some upgrading on Sunday and must have broken something. I need to switch to a non-alpha of LP! :)

jerome — 06-Sep-2006 21:25

After hitting submit, I remembered something: I had a custom feed.php that inserted the tags into the atom feed. Actually, all it really did was run the appropriate plugin hook. Somehow that never made it into the core…

jerome — 06-Sep-2006 21:27
Somehow that never made it into the core…
oh that should. :)

ok, i see what the email is for, but i have to say, that the akismet-integration does a really good job on my blog: it kept me from moderating about 900 comments in two months. But anyway: go ahead ;) i dont need to understand what this is all about as long as i can just hit "Activate" later on ;)

Well on the xsl-stylesheet thing: its just nicer then viewing a cryptic tree/source code if someone hits a rss-link by accident. same goes for the "feed:" protocol-message that should instead declare the filetype (in order to open the feed in a aggregator).

erik — 07-Sep-2006 03:23

Well another link to a Blogpost with an XSLT 2 approach. Sorry for all this off-topic stuff, but this one is interesting regarding tags on feeds, as it also allows filtering and some more funstuff - ok, regarding the differences between browsers (opera vs. ie vs. ie7 vs. firefox) it really seems to be a funstuff-thing… anyways.

erik — 07-Sep-2006 05:12

I'm not saying that Akismet doesn't work as well — in fact, I've never tried it. But I don't like using someone else's unpublished rules to tell me what's spam and what isn't.

What I may end up doing is integrating an Akismet check and using that to verify against my Bayesian score — but it won't be the final arbiter of what goes into the spam bin.

jerome — 07-Sep-2006 23:30

Stats to date:

15 valid comments approved

797 spam comments filtered out

1 spam comment approved (tolerable)

0 false positives

I'm pleased with the results so far.

jerome — 09-Sep-2006 12:39

It turns out that the lone spam that got through was due to a bug: I was using the wrong variable name and only looked at one third of the "interesting" tokens that I should have been. It's telling that it worked so well even with this glitch. The comment now scores very positively as spam. :)

jerome — 09-Sep-2006 13:29

Thats cool, my combination akismet+spamkill (spamkarma doesn't work for lightpress) has checked around 800 comments so far and there have been around 15-25 spam-mails being approved… not so tolerable

erik — 09-Sep-2006 16:17

Thanks Erik, That's good to know! :)

Another one slipped through (not due to a bug) so I guess I back to where I was before the bug was fixed.

jerome — 10-Sep-2006 11:16

the more asiatic design-sites link to my blog the more comments i receive that aren't correctly recognized as spam :( in the admin-panel these comments mostly show with one or a lot of links and the rest (all the asiatic text let be chinese or japanese) is neither recognized nor displayed. this is a bit confusing because i thaught using UTF-8 is good because it has the widest character-set range. Does your Anti-Spam approach has a solution to foreigners (very foreign from my point of knowledge) spam? (eg. check that the same commentor doesn't post 20 comments in 30 seconds…) that is getting more and more annoying.

erik — 27-Sep-2006 10:33

btw: 1,360 comments killed let it be 60 to 80 of them, that i had to manually delete. (i also see a lot of crawlers checking for non existing urls, mostly with the word comments as parameter in it, so it seems to me they even "try" to check if the rubbish have been published…)

erik — 27-Sep-2006 10:38

Any idea when you will get around to releasing this plugin? :)

Ørvar — 09-Oct-2006 13:42

I bet these spam dudes hacked it all in manually… or did your methods fail? well. 3 months later akismet stopped around 6.100 spam-comments, where i had to manually delete another 30 … false-positives.

erik — 10-Dec-2006 13:13

Interesting … just when I was getting interested in this, scrolling down … and the post is full of spam :-).

That says all there is to this comment spam solution :-)

George Appiah — 03-Jan-2007 00:34

@George: not any more!

A large portion of the spam was being filtered correctly, but a couple of new spam "signatures" made it through (and I haven't moderated anything manually since late October). Obviously this is still a work in progress, but I still feel more comfortable using this than relying on Akismet.

jerome — 13-Jan-2007 19:01


You are viewing a mobilized version of this site...
View original page here

Mobilized by Mowser Mowser