by George Taniwaki

About comment spam

Comment spam is a real problem. Most websites that allow comments (like mine) receive over 100 spam messages that link to unethical or fraudulent websites for each legitimate comment they receive.

Luckily, there are excellent spam filters that identify and remove these annoying click-bait messages. For instance, the service that hosts this blog, WordPress, uses a service called Akismet. These spam filters use pattern recognition to find suspicious messages based on characteristics like message content, sender email address, sender IP address, web page commented on, etc. Suspect messages are tagged as spam and moved to a junk comment folder.

Naturally, in the spam arms race, the creators of spam campaigns need tools to rapidly create comments, ideally a unique one for every blog post, so as to avoid being detected.

The message

I recently received a comment on this blog that reveals how comment spammers create messages. The comment was actually not the intended comment. Rather, the spammer sent me over 300 lines of code they used to create custom-looking comments. Phrases that could be customized were enclosed in curly braces {}. The options for the words in a phrase were separated by vertical pipes |. The curly braces could be nested to allow multiple levels of customization. In fact, the entire comment starts with a curly brace so that different versions of the message could be sent. The spam message generator is partially reproduced below.

Note in particular how many of the characters (highlighted in yellow) are accented or Unicode homoglyphs, meaning they form words that look like English, but will not appear in any dictionary that might be used by a spam filter to detect phrases often used in spam messages. Of special note is that words used multiple times will often have a different glyph replacement in each instance.


{ӏ have|I’ve} bеen {surfing|browsing} online mοrе thаn {three|3|2|4} hours todaу, ƴet I
never found any іnteresting article like
yours. {It’s|It іs} pretty worth enoսgh for me. {Іn mу opinion|Personally|In my view}, іf
ɑll {webmasters|site owners|website owners|web owners} аnd
bloggers mаde gooԁ content as ƴou dіd, tҺe {internet|net|web} will bе {much moгe|a lot more} useful than ever beforе.|
I {couldn’t|could not} {resist|refrain fгom} commenting.

{Very wеll|Perfectly|Well|Exceptionally well} written!|
{ӏ wіll|І’ll} {rіght awaʏ|immeԀiately} {tɑke
hold of|grab|clutch|grasp|seize|snatch} уoսr {rss|rss feed} ɑs I {can not|ϲаn’t} {іn finding|fіnd|to find} yοur {email|е-mail} subscription {link|hyperlink} օr
{newsletter|e-newsletter} service. Ɗo {yoս ɦave|yoս’ve} any?
{Please|Kindly} {аllow|permit|lеt} me {realize|recognize|understand|recognise|кnow}
{sߋ tɦat|in orԁer that} I {may juѕt|may|cοuld} subscribe.

The string of faux-fawning gibberish continues for another 290 lines or so and finally ends with this heart-felt closing.

Thɑnks fоr {greɑt|wonderful|fantastic|magnificent|excellent} {іnformation|info} ӏ wɑs looking for thіs {informatіon|info} for my mission.|
{Hi|Hello}, i tɦink that і saw you visited my {blog|weblog|website|web site|site} {ѕo|thus}
i сame to “return the favor”.{I аm|I’m} {trying to|attempting tߋ} find thіngs to {improve|enhance}
mʏ {website|site|web site}!І suppose its ok to use {some of|a fеw of} уօur ideas!\

I’m somewhat surprised the code above can confuse a spam filter. A pattern recognition algorithm could be designed to detect which forms of phrases, misspellings, and glyph substitutions are most commonly seen in spam rather than in messages typed by honest but error-prone humans.

Anyway, I want to thank this incompetent spammer for providing me with content for this blog post. And of course, thanks for the {kind|wonderful|supporting} message.

For examples of actual blog spam that prey on people who might be persuaded to sell a kidney, see this previous blog post.