Real Numeracy

by George Taniwaki

SEIU_775_purple FFlogo Wa2016Yes1501

I’m a libertarian by nature. (That’s libertarian with a small L, meaning I believe in government transparency and clarity. Please don’t confuse it with Libertarian with a capital L, which I associate with mindless anarchy.) Every two year, I dutifully check for my ballot and voter pamphlet (Washington has voter by mail). The number of items seems to be getting longer, especially voter initiatives.

Here is my method of deciding how to cast my ballot on voter initiatives. First, I start skeptically. Most voter initiatives are funded by political extremists who do not consider the consequences of adopting their pet idea. But I do my online research, checking analysis produced by hopefully reputable and unbiased sources. Ultimately though, I usually vote against them.

This year in Washington, there a really bizarre ballot issue. It is Initiative Measure No. 1501. “Increased Penalties for Crimes Against Vulnerable Individuals”

This measure would increase the penalties for criminal identity theft and civil consumer fraud targeted at seniors or vulnerable individuals; and exempt certain information of vulnerable individuals and in-home caregivers from public disclosure.

Should this measure be enacted into law? Yes [ ] No [ ]

How could anyone be against this? We want to help seniors, right? Well, it’s not that simple.

A convoluted story

There is a very complex story about this initiative. It involves a union, an antiunion think tank, and the U.S. Supreme Court. Initiative 1501 is sponsored by the Service Employees International Union (SEIU) that represents healthcare workers that work in nursing homes or provide in-home care. Washington, like most states, requires certain workers, such as nurses, to have a license in order to provide services to the public. About one-third of all service workers in the U.S. require licenses. In many cases, these workers are also unionized.

Enter the Freedom Foundation. This antiunion policy group is headquartered in Olympia, Washington. It was founded by Bob Williams, who was formerly with the American Legislative Exchange Council (ALEC). You may have heard of ALEC; it is a corporate funded lobbying group that writes model legislation (which obviously is designed to further the goals of its corporate clients) which it then provides to state legislators to review. The legislators can then submit the bills for approval into law. The Freedom Foundation provides very similar services.

In 2014, the U.S. Supreme Court ruled 5-4 in Harris v. Quinn that an Illinois state law that allowed the SEIU to collect a representation fee (union dues) from in-home healthcare workers wages was unconstitutional. The reasoning was that the fee violated the First Amendment rights of the workers to not provide financial support for collective bargaining.

After the ruling, the Freedom Foundation complained that the SEIU was not doing enough to inform its members that they did not have to pay the representation fee in order to belong to the union. Though a public records act, it sued the union and the state, won, and started to send communications to members encouraging them to stop paying the fee.

Since a Supreme Court ruling covers the entire U.S., not just Illinois, the SEIU realized that it was very vulnerable to attack by the Freedom Foundation or other antiunion organizations.

Now the initiative makes sense

In Washington, the SEIU proactively sponsored Initiative 1501 as a direct attack against Freedom Foundation. The SEIU wants to avoid having to release the names, addresses, and phone numbers of its members (or having the state reveal these either). Initiative 1501 does this by saying that in-home caregivers are a protected class, like seniors or vulnerable individuals, that the state and the union cannot release personal information about.

After all that research, the story starts to make sense. This is a battle between two parties that a libertarian like me dislikes. But more transparency is better than less. So I will vote no. Sorry seniors and vulnerable individuals, you will have to rely on existing statutes to protect you.

by George Taniwaki

Did you watch the debate on Monday night? I did. But I am also very interested in the post-debate media coverage and analysis. This morning, two articles that combine big data and the debate caught my eye. Both are novel and much more interesting than the tired stories that simply show changes in polls after a debate.

First, the New York Time reports that during the presidential debate (between 9:00 and 10:30 PM EDT) there is high correlation between the Betfair prediction market for who will win the presidential election and afterhours S&P 500 futures prices (see chart 1).


Chart 1. Betfair prediction market for Mrs. Clinton compared to S&P 500 futures. Courtesy of New York Times

Correlation between markets is not a new phenomena. For several decades financial analysts have measured the covariance between commodity prices, especially crude oil, and equity indices. But this is the first time I have seen an article illustrating the covariance between a “fun” market for guessing who will become president against a “real” market. Check out the two graphs above, the similarity in shape is striking, including the fact that both continue to rise for about an hour after the debate ended.

In real-time, while the debate was being broadcast, players on Betfair believed the chance Mrs. Clinton will win the election rose by 5 percent. Meanwhile, the price of S&P 500 futures rose by 0.6%, meaning investors (who may be the same speculators who play on Betfair) believed the stock market prices in November were likely to be higher than before the debates started. There was no other surprise economic news that evening, so the debate is the most likely explanation for the surge. Pretty cool.

If the two markets are perfectly correlated (they aren’t) and markets are perfectly efficient (they aren’t), then one can estimate the difference in equity futures market value between the two candidates. If a 5% decrease in likelihood of a Trump win translates to a 0.6% increase in equity futures values, then the difference between Mr. Trump or Mrs. Clinton being elected (a 100% change in probability) results in about a 12% or $1.2 trillion (the total market cap of the S&P 500 is about $10 trillion) change in market value. (Note that I assume perfect correlation between the S&P 500 futures market and the actual market for the stocks used to calculate the index.)

Further, nearly all capital assets (stocks, bonds, commodities, real estate) in the US are now highly correlated. So the total difference is about $24 trillion (assuming total assets in the US are $200 trillion). Ironically, this probably means Donald Trump would be financially better off if he were to lose the election.


The other article that caught my eye involves Google Trend data. According to the Washington Post, the phrase “registrarse para votar” was the third highest trending search term the day after the debate was broadcast. The number of searches is about four times higher than in the days prior to the debates (see chart 2). Notice the spike in searches matches a spike in Sep 2012 after the first Obama-Romney debate.

The article says that it is not clear if it was the debate itself that caused the increase or the fact that Google recently introduced Spanish-language voting guides to its automated Knowledge Box, which presumably led to more searches for “registrarse para votar”. (This is the problem with confounding events.)

After a bit of research, I discovered an even more interesting fact. The spike in searches did not stop on Sep 27. Today, on Sep 30, four days after the debates, the volume of searches is 10 times higher than on Sep 27, or a total of 40x higher than before the debate (see chart 3). The two charts are scaled to make the data comparable.


Chart 2. Searches for “registrarse para votar” past 5 years to Sep 27. Courtesy of Washington Post and Google Trends


Chart 3. Searches for “registrarse para votar” past 5 years to Sep 30. Courtesy of Google Trends

I wanted to see if the spike was due to the debate or due to the addition of Spanish voter information to the Knowledge Box. To do this, I compared “registrarse para votar” to “register to vote”. The red line in chart 4 shows Google Trend data for “register to vote” scaled so that the bump in Sept 2012 is the same height as in the charts above. I’d say the debate really had an unprecedented effect on interest in voting and the effect was probably bigger for Spanish speaking web users.


Chart 4. Searches for “register to vote” past 5 years to Sep 30. Courtesy of Google Trends

Finally, I wanted to see how the search requests were distributed geographically. The key here is that most Hispanic communities vote Democratic and many states with a large Hispanic population are already blue (such as California, Washington, New Mexico, New Jersey, and New York). The exception is Florida with a large population of Cuban immigrants who tend to vote Republican.


Chart 5. Searches for “registrarse para votar” past 5 years to Sep 30 by county. Courtesy of Google Trends

If you are a supporter of Democrats like Mrs. Clinton, the good news is that a large number of queries are coming from Arizona, and Texas, two states where changes in demographics are slowly turning voting preferences from red to blue.

In Florida, it is not clear which candidate gains from increased number of Spanish-speaking voters. However, since the increase is a result of the debate (during which it was revealed that Mr. Trump had insulted and berated a beauty pageant winner from Venezuela, calling her “miss housekeeping”), I will speculate many newly registered voters are going to be Clinton supporters.

If the Google search trend continues, it may be driven by new reports that Mr. Trump may have violated the US sanctions forbidding business transactions in Cuba. Cuban-Americans searching for information on voter registration after hearing this story are more likely to favor Mrs. Clinton.

by George Taniwaki


Moon pies for cheap. Photo by George Taniwaki

I love moon pies (apparently, I was a southerner in a past life). Surprisingly, they are big in South Korea too (who knew, for history see Wikipedia).

Incidentally, don’t confuse moon pies with moon cakes which are another Asian sweet (which I usually don’t like because of the salty egg flavor).

Anyway, today, I found a really cheap source of my favorite confection. Lotte brand is $3.50 for 335g or 29 cents a pie. Mysteriously, they are hidden next to weird spices in the international food aisle, not prominently displayed with the other cookies in the snack aisle. Perhaps it’s a form of American food protectionism by US cookie makers, Asian segregationist policy or redlining by the store, or the result of some other nativist conspiracy plot.

It’s crazy that a South Korean company can import all the ingredients, process them, ship them back to the U.S., and still be cheaper than US-made cookies. But I don’t care as long as I get my fix of graham cracker, marshmallow, and sugary goodness.

by George Taniwaki

NASA recently celebrated the fifteenth anniversary of the launching of the Chandra X-ray Observatory by releasing several new images. One of the images, shown below, is an amazing composite that reveals in exquisite detail the turbulence surrounding the remnants of the Tycho supernova. (Scroll down to Figure 2, then come back.)

Tycho supernova

The scientific name of the Tycho supernova remnant is SN 1572, where SN means supernova and 1572 refers to the year it was first observed. That’s right, over 400 years ago, in November 1572, many people noticed a new bright object in the sky near the constellation Cassiopeia. Reports at the time indicated that it was as bright as Venus (peak magnitude of –4) meaning it was visible during the day.

SN 1572 is called the Tycho supernova because a Danish scientist named Tycho Brahe published a paper detailing his observations. His paper is considered one of the most important in the history of astronomy, and science in the Renaissance.


Figure 1. Star map drawn by Tycho Brahe showing position of SN 1572 (labelled I) within the constellation Cassiopeia. Image from Wikipedia

What people at the time didn’t know was that SN 1572 was about 9,000 light years away, meaning it was unimaginably far away. The explosion that caused it happened long ago but the light had just reached the earth.

(Actually, SN 1572 is fairly close to us relative to the size of the Milky Way which is 100,000 light years across, and extremely close relative to the size of the observable universe which is 29 billion light years across. Space is just really unimaginably large.)

What they also didn’t know was that SN 1572 was probably a Type 1a supernova. This type of supernova is common, and has a very specific cause. It starts with a binary star system. Two stars orbit one another very closely. Over time, one of the stars consumes all of its hydrogen and dies out, leaving a carbon-oxygen core. Its gravity causes it to accrete the gas surrounding it until its mass reaches what is called the Chandrasekhar limit and it collapses. The increased pressure causes carbon fusion to start. This results in a runaway reaction, causing the star to explode.

About a supernova remnant

In the 400 years since SN1572 exploded, the debris from it has been flying away at 5,000 km/s (3100 mi/s). It is hard to see this debris. Imagine a large explosion on the earth that occurs at night.

The debris itself doesn’t generate very much light, but it does produce some. Space is not a vacuum. It is a very thin gas. When electrons from the moving debris of the supernova remnant strike a stationary particle, it gives off a photon (which depending on the energy of the collision, is seen as radio waves, microwaves, visible light, UV, or x-rays). This energy also heats up the remaining particles, releasing additional photons, making them detectable with a very sensitive telescope.

About false color images

The Chandra X-ray Observatory was launched in 1999 from the space shuttle Columbia. As the name implies, it can take digital images of objects in the x-ray range of light. Since humans cannot see in this range, images taken in the x-ray range are often color coded in the range from green to blue to purple.

Often, composite images of space objects are created using telescopes designed to capture photons from different wavelengths. For instance, visible light telescopes like the Hubble Space Telescope often have the colors in their images compressed to black and white. Images from infrared telescopes, like the Spitzer Space Telescope, and ground-based radio telescopes are often given a false color range between red to orange.

Pictures please

All right, finally the results. Below is the most detailed image ever of the Tycho supernova remnant. It is a composite created by layering multiple, long-exposure, high-resolution images from the Chandra X-ray Observatory. The press release says, “The outer shock has produced a rapidly moving shell of extremely high-energy electrons (blue), and the reverse shock has heated the expanding debris to millions of degrees (red and green).

“This composite image of the Tycho supernova remnant combines X-ray and infrared observations obtained with NASA’s Chandra X-ray Observatory and Spitzer Space Telescope, respectively, and the Calar Alto observatory, Spain.

“The explosion has left a blazing hot cloud of expanding debris (green and yellow) visible in X-rays. The location of ultra-energetic electrons in the blast’s outer shock wave can also be seen in X-rays (the circular blue line). Newly synthesized dust in the ejected material and heated pre-existing dust from the area around the supernova radiate at infrared wavelengths of 24 microns (red).”


Figure 2. Tycho supernova remnant composite image released in 2014. Image from NASA

Compare Figure 2 above to an image of the Tycho supernova remnant that NASA released in 2009 using data from observations made in 2003 shown below. Notice the lack of details. Also notice the large number of stars in the background, some even shining through the dust of the explosion. Apparently, the image above has been modified to eliminate most of these distractions.

These two images dated only a few years apart reveal what is likely remarkable advances in software for manipulating space images. I say that because the hardware in the telescopes themselves, such as optics, detectors, and transmitters probably have not changed much since launch. Thus, any improvements in resolution and contrast between the two images is a result of better capabilities of the software used to process images after the raw data is collected.

A New View of Tycho's Supernova Remnant

Figure 3. Tycho supernova remnant composite image release in 2009. Image from NASA

by George Taniwaki

Your smartphone is more than an addictive toy. With simple modifications, it can become a lifesaving medical device. The phone can already receive and send data to medical sensors and controllers wirelessly. By adding the right software, a smartphone can do a better job than a more expensive standalone hospital-grade machine.

In addition, smartphones are portable and patients can be trained to use them outside a clinical setting. The spread of smartphones has the potential to revolutionize the treatment of chronic conditions like diabetes. This can enhance the quality of life of the patient and significantly increase survival.

Monitoring blood sugar

Type 1 diabetes mellitus is an autoimmune disease in which the body attacks the pancreas and interrupts the production of insulin. Insulin is a hormone that causes the cells in the body to absorb glucose (a type of sugar) from the blood and metabolize it. Blood sugar must be controlled to a very tight range to stay healthy.

A lack of insulin after meals can lead to persistent and repeated episodes of high blood sugar, called hyperglycemia. This in turn can lead to complications such as damage to nerves, blood vessels, and organs, including the kidneys. Too much insulin can deplete glucose from the blood, a situation called hypoglycemia. This can cause dizziness, seizures, unconsciousness, cardiac arrhythmias, and even brain damage or death.

Back when I was growing up (the 1970s), patients with type 1 diabetes had to prick their finger several times a day to get a blood sample and determine if their glucose level was too low or too high. If it was too low, they had to eat a snack or meal. (But not one containing sugar.)

They would also test themselves about an hour after each meal. Often, their glucose level was too high, and they had to calculate the correct does of insulin to self-inject into their abdomen, arm, or leg to reduce it. If they  were noncompliant (forgetful, busy, unable to afford the medication, fearful or distrustful of medical institutions or personnel, etc.), they would eventually undergo diabetic ketoacidosis, which often would require a hospital stay to treat.


Figure 1a. Example of blood glucose test strip. Photo from Mistry Medical


Figure 1b. Boy demonstrating how to inject insulin in his leg. Photo from Science Photo Library

If all these needle pricks and shots sound painful and tedious, they were and still are. There are better test devices available now and better insulin injectors, but they still rely on a patient being diligent and awake.

Yes, being awake is a problem. It is not realistic to ask a patient to wake up several times a night to monitoring her glucose level and inject herself with insulin. So most patients give themselves an injection just before going to bed and hope they don’t give themselves too much and that it will last all night.

Continuous glucose monitoring

Taking a blood sample seven or eight times a day is a hassle. But even then, it doesn’t provide information about how quickly or how much a patient’s glucose level changes after a meal, after exercise, or while sleeping.

More frequent measurements would be needed to estimate the rate at which a patient’s glucose level would likely rise or fall after a meal, exercise, or sleeping. Knowing the rate would allow the patient to inject insulin before the glucose level was outside the safe range or reduce the background dosage if it is too high.

In the 1980s, the first continuous glucose meters were developed to help estimate the correct background dosage of insulin and the correct additional amounts to inject after snacks and meals.

The early devices  were bulky and hard to use. They consisted of a sensor that was inserted under the skin (usually in the abdomen) during a doctor visit and had wires that connected it to a monitoring device that the patient carried around her waist. The sensor reported the glucose level every five to ten seconds and the monitor had enough memory to store the average reading every five to ten minutes over the course of a week.

The devices were not very accurate and had to be calibrated using the blood prick method several times a day. The patient would also have to keep a paper diary of the times of meals, medication, snacks, exercise, and sleep. After a week, the patient would return to the doctor to have the sensor removed.

The doctor would then have to interpret the results and calculate an estimated required background dose of insulin during the day and during the night and the correct amount of additional injections after snacks and meals. The patient would repeat the process every year or so to ensure the insulin dosages were keeping the glucose levels within the desired range.

Today, continuous glucose monitors can measure glucose levels using a disposable sensor patch on the skin that will stay in place for a week. It transmits data to the monitor wirelessly. Using a keypad, the monitor can also record eating, medication, exercise, and sleeping. The monitor can store months of personal data and calculate the amount of insulin needed in real-time. Alerts remind the patient when to inject insulin and how much. They are cheap enough and portable enough that the patient never stops wearing it.


Figure 2. Wireless continuous blood glucose monitor and display device. Image from Diabetes Healthy Solutions

Continuous insulin pump

Also in the 1980s, the first generation of subcutaneous insulin pumps were commercialized. These pumps could supply a low background dose of insulin rather than big spikes provided by manual injections. The first pumps were expensive, bulky, hard to use. By the early 2000s though, insulin pumps became widely available and were shown to reliably reduce the fluctuations in glucose levels seen in patients who relied on manual injections. By providing a low dose of insulin continuously during the day and at night with the ability of the patient to manually apply larger doses after meals, it lowered the average level of glucose while also reducing the incidence of hypoglycemia. Over longer periods it also reduced the incidence of complications commonly seen with diabetes.


Figure 3a and 3b. Early insulin pump (left) and modern version (right). Images from Medtronic

There is one drawback to the continuous insulin pump. It can provide too much insulin at night while the patient is asleep. While sleeping, the patient’s glucose level falls. Since she is not performing blood tests, she will not notice that the insulin pump is set too high. Further, since she is asleep she may not realize that she is in danger, a condition called nocturnal hypoglycemia.

Software to control the pump

Imagine combining the continuous glucose meter with the continuous insulin pump. Now you have a system the mimics the behavior of the human pancreas. Sensors constantly monitor the patient’s glucose level, and anticipate changes caused by activities like eating, sleeping, and exercise.

The key is to use a well-written algorithm to predict the amount of insulin needed to be injected by the pump to keep sugar levels within the acceptable range. Instead of a human, software controls the insulin pump. If the glucose level does not stay within the desired levels, the algorithm learns its mistake and corrects it.

The initial goal of the combined monitor and pump was to predict low glucose levels while a patient was sleeping and suspend the pumping of insulin to prevent nocturnal hypoglycemia. Ironically, the US FDA panel rejected the first application submitted for the device saying that the traditional uncontrolled continuous insulin pump was actually safer than a new device because of the new device’s lack of field experience.

After years of additional studies the combined device, manufactured by Medtronic, was approved for use in the US in 2013. Results of a study involving 25 patients in the UK was published in Lancet Jun 2014. Another trial, involving 95 patients in Australia was published in J. Amer. Med. Assoc. Sept 2013.


Figure 4. Combined glucose meter and insulin pump form a bionic pancreas. Image from Medtronic

Better software and smartphones

The Medtronic combined device is proprietary. But several groups are hacking it to make improvements. For instance, researchers led by Z. Mahmoudi and M. Jensen at Aalborg University in Denmark have published several papers (Diabetes Techn Ther Jun 2014Diabetes Sci Techn Apr 2014, Diabetes Techn Ther Oct 2013) on control algorithms that may be superior to the one currently used in the Medtronic device.

Another interesting paper appeared in the New Engl J Med Jun 2014. It reports a study by Dr. Steven Russell of Massachusetts General Hospital and his colleagues. They wrote an app for a smartphone (Apple’s iPhone 4S) that could receive the wireless data from the Medtronic glucose meter and wirelessly control the Medtronic insulin pump.

Smartphones are ideal platforms for use in developing medical devices because they can communicate wirelessly with other devices, have sufficient computing power and memory for even the most complex control tasks, are designed to be easy to program and easy to use, and many people already own one.

Dr. Russell and his colleagues used a machine learning algorithm they had previously developed (J Clin Endocrinol Metab May 2014) to couple the two.

Even though this is a research project, not a commercial product, the results are pretty impressive. The study lasted 5 days, with the first day used to calibrate the algorithm and days 2-5 as the test.

As can be seen in Figure 5, after a day of “training” patients using the bionic pancreas (solid black line) had lower average glucose levels than patients on the standard protocol (solid red line). Further, the variance of their glucose level (black shaded area) was smaller than for patients on the standard protocol (red shaded area). Notice how much better the control is using the bionic pancreas, especially at night.


Figure 5. Variation in mean glucose level among adults during 5-day study. Image from New Engl J Med

Another measure of quality is the amount of time the patients’ glucose levels were within the desired level of 70 to 120 mg/dl (the green shaded region in Figure 6). Patients with the bionic pancreas (solid black line) spent about 55% of the time within the desired level. They also had fewer incidents of hypoglycemia (pink shaded region) or hyperglycemia (white region on right) than patients using the standard protocol (red line).

Note that even with the bionic pancreas, 15% of the time patients had a glucose level above 180, so there is still plenty of room to improve control.


Figure 6. Cumulative glucose level in adults during day 1 where the bionic pancreas adapted to the patient (dashed line) and days 2-5 (solid black). Image from New Engl J Med

by George Taniwaki

Patients are often frustrated and confused when navigating the healthcare system. Part of the problem is that if you are sick or hurt, it reduces your cognitive abilities. But it also because hospitals are busy places with little funding for improving the user experience. Often the layout of the rooms, the signage, the forms and instructions, and the language used by the staff are not tailored to the needs of patients who are unfamiliar with the system.

Design to reduce patient violence

A significant problem in hospital emergency medical departments (called A&E in Britain, ER in America) is abusive and violent patients. According to the National Audit Office, violence and aggression towards hospital staff costs the NHS at least £69 million a year in staff absence, loss of productivity and additional security.

Some other statistics from the Design Council report: More than 150 incidents of violence and aggression are reported each day within the NHS system. In 2010, the incidence rate of violence and aggression was about 1 per 1000 patients. In 2009, 21% of staff report bullying, harassment, and abuse by patients, 11% report physical attacks by patients.

Working with the National Health Service, a design firm called PearsonLloyd developed some low-cost methods to reduce the incidence of violence and aggression, increase patient satisfaction, improve staff morale, and reduce security costs. They call their program, A Better A&E. The program was pilot tested at St. George’s Hospital in London and Southampton General. For an introduction, see the video below.


Figure 1. Still from video “A Better A&E. Image from Vimeo

Signage and brochure

The program consisted of three parts. First, improved signage was installed that included an estimated wait times along with a brochure that explained why a patient who arrived after you could be seen a doctor before you.


Figures 2a and 2b. Large screen monitor alternately shows how busy the A&E is and then how long the wait time is for different categories of patients. Images from Design Council report



Figure 3a and 3b. A page from brochure explaining why wait times differ among patients and what to expect at each station. Signage posted at each patient area keyed to the brochure. Images from

Root cause analysis

The second part of the redesign was the introduction of program to capture information from doctors, nurses, and other staff about factors that led to violent and abusive behavior. The program included root cause analysis and a prominently posted Incident Tally Chart to record the “variables within the system that might hinder the ability of staff to deliver high quality care.”


Figure 4. Incident tally posted where staff can record any events during their shift. Images from Design Council Report

Toolkit and patterns

The final part of the program was to design a toolkit that would take the lessons from the A&E departments of the two pilot hospitals and generalize them so that they could be adopted by any hospital within the NHS system. The toolkit is presented as an easy to use website,


Surveys of patients and staff taken after the redesign indicated that both groups saw benefits.

  • 88% of patients felt the guidance solution was clear
  • 75% of patients felt the signage reduced their frustration during waiting times
  • Staff reported a 50% drop in threatening body language and aggressive behavior
  • NHS calculated that each £1 spent on design solutions resulted in £3 in benefits

by George Taniwaki

About comment spam

Comment spam is a real problem. Most websites that allow comments (like mine) receive over 100 spam messages that link to unethical or fraudulent websites for each legitimate comment they receive.

Luckily, there are excellent spam filters that identify and remove these annoying click-bait messages. For instance, the service that hosts this blog, WordPress, uses a service called Akismet. These spam filters use pattern recognition to find suspicious messages based on characteristics like message content, sender email address, sender IP address, web page commented on, etc. Suspect messages are tagged as spam and moved to a junk comment folder.

Naturally, in the spam arms race, the creators of spam campaigns need tools to rapidly create comments, ideally a unique one for every blog post, so as to avoid being detected.

The message

I recently received a comment on this blog that reveals how comment spammers create messages. The comment was actually not the intended comment. Rather, the spammer sent me over 300 lines of code they used to create custom-looking comments. Phrases that could be customized were enclosed in curly braces {}. The options for the words in a phrase were separated by vertical pipes |. The curly braces could be nested to allow multiple levels of customization. In fact, the entire comment starts with a curly brace so that different versions of the message could be sent. The spam message generator is partially reproduced below.

Note in particular how many of the characters (highlighted in yellow) are accented or Unicode homoglyphs, meaning they form words that look like English, but will not appear in any dictionary that might be used by a spam filter to detect phrases often used in spam messages. Of special note is that words used multiple times will often have a different glyph replacement in each instance.


{ӏ have|I’ve} bеen {surfing|browsing} online mοrе thаn {three|3|2|4} hours todaу, ƴet I
never found any іnteresting article like
yours. {It’s|It іs} pretty worth enoսgh for me. {Іn mу opinion|Personally|In my view}, іf
ɑll {webmasters|site owners|website owners|web owners} аnd
bloggers mаde gooԁ content as ƴou dіd, tҺe {internet|net|web} will bе {much moгe|a lot more} useful than ever beforе.|
I {couldn’t|could not} {resist|refrain fгom} commenting.

{Very wеll|Perfectly|Well|Exceptionally well} written!|
{ӏ wіll|І’ll} {rіght awaʏ|immeԀiately} {tɑke
hold of|grab|clutch|grasp|seize|snatch} уoսr {rss|rss feed} ɑs I {can not|ϲаn’t} {іn finding|fіnd|to find} yοur {email|е-mail} subscription {link|hyperlink} օr
{newsletter|e-newsletter} service. Ɗo {yoս ɦave|yoս’ve} any?
{Please|Kindly} {аllow|permit|lеt} me {realize|recognize|understand|recognise|кnow}
{sߋ tɦat|in orԁer that} I {may juѕt|may|cοuld} subscribe.

The string of faux-fawning gibberish continues for another 290 lines or so and finally ends with this heart-felt closing.

Thɑnks fоr {greɑt|wonderful|fantastic|magnificent|excellent} {іnformation|info} ӏ wɑs looking for thіs {informatіon|info} for my mission.|
{Hi|Hello}, i tɦink that і saw you visited my {blog|weblog|website|web site|site} {ѕo|thus}
i сame to “return the favor”.{I аm|I’m} {trying to|attempting tߋ} find thіngs to {improve|enhance}
mʏ {website|site|web site}!І suppose its ok to use {some of|a fеw of} уօur ideas!\

I’m somewhat surprised the code above can confuse a spam filter. A pattern recognition algorithm could be designed to detect which forms of phrases, misspellings, and glyph substitutions are most commonly seen in spam rather than in messages typed by honest but error-prone humans.

Anyway, I want to thank this incompetent spammer for providing me with content for this blog post. And of course, thanks for the {kind|wonderful|supporting} message.

For examples of actual blog spam that prey on people who might be persuaded to sell a kidney, see this previous blog post.

Next Page »