Does using Internet Explorer make you stupid? I think not, but sometimes it can trick you. (See part 1 of this story here.)

I use a variety of browsers and operating systems, but my favorite is Internet Explorer 9 running on Windows 7. I like the feature that combines the address bar with the search box into a single text edit field. It allows me to just type a company name in the search box and the browser will resolve it into a domain name for me. (Of course, not everyone likes this design.)

Anyway, a few minutes ago I was using Safari on my Mac and typed “Ikea” in the address bar. Naturally, what I really wanted was “”. Safari doesn’t automatically send invalid URLs to the search engine like IE9 does. I have Comcast broadband at home. Comcast detects and captures any invalid URLs and displays its own custom DNS error page, a practice called DNS hijacking. A portion of the page is shown below.


Custom DNS error page. Image from Comcast

Notice that the first item is a sponsored link that has the title “ – Official Site” and has the URL that I wanted highlighted in green. Naturally, I clicked on it. After a few redirections, this is what I see:


It sort of looks like an Ikea home page. Image from

This looks like it could be the official IKEA site, but it isn’t. The domain name displayed in the address bar is not for but for, one of those credit card scam companies that is basically a phishing site. The top part of the page is designed to look like it is complete. But you will notice that the scroll bar indicates there is more content below the fold. If you are willing to scroll down, you’ll see the following disclaimer:

IKEA is a registered trademark of Inter IKEA Systems B.V. is not affiliated with IKEA®. All IKEA® trademarks are the property of IKEA® and does not, in any way, claim to represent or own any of the IKEA® trademarks or rights. IKEA® does not own, endorse, or promote or this promotion.

This Gift Program is not endorsed, sponsored by or affiliated with the manufacturers and retailers of the gift items listed above in anyway. All trademarks, service marks and logos are property of their respective owners.

Well, I guess that disclaimer may protect them from lawsuits by Ikea (trademark infringement) or from disgruntled customers and state attorneys general (fraud and deceptive trade practices). But I doubt it.

This sucks. Only a credulous rube would actually purchase a prepaid credit card. But everyone is forced to waste time figuring out that this is not the Ikea website and either manually typing in the correct URL to get there or go back to Comcast’s search page and click on a different link.

However, I don’t blame Comcast for this travesty, at least not directly. I believe the search results on the DNS server not found error page are provided by Yahoo (which uses Microsoft Bing as its search engine) and that Yahoo and Microsoft run the keyword auctions that populate the sponsored links. Thus, it is up to them to ensure that the green text in the sponsored link ads matches to the domain that the user will be redirected to.

Today, Microsoft Bing announced that it would offer a free music download to the first 500,000 customers who signed up on its website. I went to Softpedia to read about it, see the screenshot below.


Bing story. Image from Softpedia

Notice the big download button above the story. I clicked on it, which led me to the following download page.


I clicked on the Download Free button, which downloads the file and opens the IE security warning dialog.


Do you notice something unusual here? The file isn’t an mp3 and the publisher isn’t Microsoft Corporation. Instead, it is an executable from some company called

Very clever. A company called SearchAle bought a Google display ad that leads people to believe that Microsoft Bing’s free music download is obtained by clicking a big button. I gotta quit clicking on big buttons like that. Though since I run on a Mac with a virtual Windows 7 machine, the worst that can happen is I ruin the VM and need to reimage it.

Google Books is a project sponsored by search engine giant Google to scan the pages of every book available, convert the scans to text using OCR, and make the resulting text corpus searchable. Not withstanding any remaining copyright disputes surrounding the project, Google has reached an agreement with most of the major copyright holders (authors and the publishers that represent them) around the world. So far, Google has scanned over 15 million books, most of which are no longer in print or commercially available. This database is a treasure trove for history scholars.

Last week, Google released a new statistics tool for the Google Books project called the Ngram Viewer. It it simple to use. You simply enter a list of words or phrases separated by a comma (each called an n-gram and is case-sensitive), the language (American English and British English can be searched separately), the date range (starting from 1500 to 2008, though the number of books is sparse before 1780 which makes the early data very spiky), and the amount of moving average smoothing to apply (long trends are easier to see with smoothing, but the individual yearly data is lost).

The Ngram Viewer is the best time series graphing toy I’ve seen.

There have been lots of stories in the press showing interesting trends in the popularity of certain words and phrases in books. For instance, Jennifer Valentino-DeVries of the Wall St. J. shows that Merry Christmas beats Happy Holidays by a big margin.


Merry Christmas vs Happy Holidays, Image from Google labs

Slate’s Tom Scocca has been posting an Ngram of the Day on the site comparing the frequency of words like shopping vs salvation and television vs the Bible. He even does a comparison between words and shows the year in which the two cross in popularity. For instance, anxiety passes shame in 1942.


anxiety vs. shame. Image from Google labs

Here’s a couple charts I created for the n-grams “independence” and “rebellion” in U.S. and British English. I have no idea what conclusions to draw from this data, but it is just begs for an explanation based on unfounded speculation. There is a spike in both words in U.S. books in the 1770s, but no spike in British literature. Independence becomes more popular than rebellion in books from both countries after about 1820. The word rebellion has a spike again in the U.S. from 1860 to 1870. The absolute occurrence of both words are similar in the two countries starting around 1900. Independence shows a spike from 1940 to 1942 and another spike around 1968 to 1970.



independence vs. rebellion, British (top) vs. American (bottom). Images from Google labs.

For the truly hardcore programmer, the n-gram datasets are available for download from Google. Their use is covered under the Creative Commons Attribution 3.0 Unported license.

The next step is for Google to add its Ngram Viewer toolkit to its Public Data Explorer visualization tool (see Mar 2010 blog post) to allow animations and drill down. I can hardly wait.

[Update: I rescaled the first two graphs to normalized the time spans.]

by George Taniwaki

Facebook has a new application (or widget) currently in beta release called Questions that allows users to post questions and wait for another user to answer it. The questions are categorized into groups and users are shown questions that other people who have similar interests have answered. If you know anything about search and recommendation you realize that Facebook is trying to solve two really hard computing problems simultaneously.

First, how do you categorize the questions? What keywords and contexts do you use? For instance, what weight do you give to the interests of the person asking the question? And how do you categorize those interests? What weight do you give to the length of the question? How do you handle misspelled words? Do you give any weight to the fact that any words are misspelled?

Second, how do you decide which questions to show which user? Should you predict if the potential answerer is actually qualified to answer the question? Is it more important to generate lots of responses or to get the correct response quickly? Or is it actually more important to entertain users with a stream of interesting questions, regardless of whether they answer them? (This would be really hard to predict since Facebook will never get any feedback from users regarding the question they don’t answer.)

Community run Q&A sites are not new. Yahoo! Answers and have both been around for years and are quite popular. However, I believe that most of the answers are written by a small group of dedicated users who vie for points and recognition. Facebook’s goal is to engage the entire community, since the longer you stay at their site, the more likely you are to click on some ads.

Anyway, I want so show some screenshots. This may violate some promise I made to Facebook. The first shows a few examples of questions from the Questions widget. The widget appears in the right column under the Sponsored links widget. Notice how many of the questions seem to be factual and could be more quickly (and correctly) answered using standard web-based research skills.



Two examples of the Questions widget. Image from Facebook

If you click on a question in Questions, you will taken to a page showing all the responses for that question. You can then vote yea or nay for any response. The example below shows how introspective Facebook users are.


Question responses. Image from Facebook

Finally, if you click on the Asked about link, you will see a list of all the questions related to that category. Notice the example below for the category “Roots”. As I mentioned above, categorizing questions is tough. And was this question really asked by that Kristin Bell?


Category detail. Image from Facebook

Here’s a funny email exchange between my wife and me. I reversed the thread so that you can read it from top to bottom.

From: George Taniwaki
Sent: Friday, June 11, 2010 10:48 AM
To: Susan Wolcott
Subject: Using search engines to pick stocks

Check out this story,

From: Susan Wolcott
Sent: Friday, June 11, 2010 10:58 AM
To: George Taniwaki
Subject: RE: Using search engines to pick stocks

And the posted comments are interesting, but in a completely different way…

From: George Taniwaki
Sent: Friday, June 11, 2010 11:40 AM
To: Susan Wolcott
Subject: RE: Using search engines to pick stocks

Yeah. Who are these people and how do they decide 1) to read Technology Review and 2) write political rants?

From: Susan Wolcott
Sent: Friday, June 11, 2010 1:44 PM
To: George Taniwaki
Subject: RE: Using search engines to pick stocks

It’s part of that valuable cognitive surplus.


Sue and I obviously have too much time, er… valuable cognitive surplus, on our hands. If you don’t get the reference to cognitive surplus, read this book.

by George Taniwaki

Time series data are hard to display in a way that shows other relationships. That’s because one generally uses a static 2-dimensional chart. The x-axis is used to display time, leaving only one other axis to show another variable. The display can be greatly enhanced by using animation. Time can flow while the x and y axes can display relationship data or geographic data. This can aid in understanding how relationships vary over time.

Last week, Google announced the release of Google Public Data Explorer. This web-based data visualization tool (because all Google tools are web-based) takes time series data from public sources and provides a way to create animations. Data Explorer allows you to create line charts, bar charts, maps and scatterplots from web data. It has a pretty clear interface to change the axes. For scatterplots you can also change the color and size of the data points.

Google’s animated trend charts are based on technology called Trendalyzer it acquired in 2006 from Gapminder, a nonprofit foundation in Sweden. I mentioned Gapminder in a 2008 blog post that has been accidentally deleted.

Below is an example scatterplot I created using World Bank data for fertility rate by life expectancy worldwide by country. The data covers 1960 to 2007. First note that in 1960, there was a strong correlation between high fertility rate and low life expectancy. Also note the strong correlation between high income and high life expectancy. Many researchers have shown that women are likely to have more children if they fear that some of the children will not grow to adulthood. Note the one outlier among the high income countries that in 1960 had low life expectancy (54 years) and high fertility (5.7). That country is South Korea. If you play the animation, you can see Korea move down and to the right over time.


Country fertility rate by life expectancy. Image by George Taniwaki

I’ve marked a few countries that display unusual behavior over time. China is an interesting outlier because in 1960 it was poor with a low life expectancy, but also with a low fertility rate. But if you play the animation, you see that this may have been a statistical anomaly because in 1962, the fertility rate jumps to 7.5. The fertility rate rapidly falls to under 3 as the one-child policy takes effect and continues to fall below the 2.1 replacement level in the 1990s. Rwanda, Timor-Leste, and Cambodia show short, horrifying drops in both life expectancy and fertility during genocide campaigns. Guinea-Bissau shows large fluctuations in fertility rate without a corresponding change in life expectancy. I can’t explain it. The country is very poor and perhaps its health statistics are unreliable, though this is true for many other countries. Lesotho and Zimbabwe are poor countries with high levels of HIV infection and AIDS that are causing both fertility rates and life expectancy to fall. The AIDS epidemic is taking a toll even on wealthier countries like South Africa.

Below is another scatterplot I created showing population by income in the U.S. by state. I would have liked to have shown income by education, city/rural ratio, or some other correlated variable, but no such variable was available in the dataset. I use log scale for population since there is such a large range in state size. I also use a log range for income because the data is not adjusted for inflation and I want the data range to show the change in income dispersion over time.


State population by income. Image by George Taniwaki

One of the most surprising things about the chart is that over the 40 years the rankings of state incomes doesn’t vary much. In 1969 the poorest state was Mississippi, and all the poorest states were in the southeast census region. If you play the animation, you’ll see some horizontal reordering, but not much. Even fast growing states like Nevada and Arizona don’t change order. They shoot up, but their income ranking doesn’t move much. Same with states with falling populations like Michigan and Ohio. The only exceptions are small states and districts like Hawaii, Alaska, and Washington DC. They zip around like flies. And maybe that explains why the data doesn’t move much. State level data is too coarse and it would be better to see data for the 100 largest cities or some other finer geographic region.


About a year ago, my friend Steve Duenser pointed me to a couple of nice animations by FlowingData that shows the growth of Target and Wal-Mart. The animations use Modest Maps, a Flash-based mapping tool.

By an odd coincidence, the three biggest discount store chains, Wal-Mart, Target, and Kmart, all started in 1962 which makes a time series comparison of store openings easier. Target is headquartered in Minneapolis and slowly began to expand throughout the U.S., jumping to large metropolitan areas. Wal-Mart grew much more quickly, but was invisible to most people because it concentrated on rural areas near its headquarters in Bentonville. It blanketed the southeastern U.S. before its steady expansion to the rest of the country. The videos showing geographic distribution make the different growth strategies more easy to observe than a tabular list would. It would be even more obvious if the two animations were combined to show overlays.



Target vs Wal-Mart. Images from FlowingData


Finally, my friend Carol Borthwick pointed out an interesting way of visualizing Twitter feeds across both time and space. An interactive map was shown in The New York Times after the Super Bowl last year. It’s a fun chart and makes you realize that there’s a lot of data on the web that’s just waiting to be mined. Unfortunately, there is no story indicating how the data was collected and organized, what tools were used, or how you can make your own Twitter charts.


Map of Popular Twitter comments in real-time. Image from New York Times

[Update: I added a paragraph stating that Google’s animated trend charts are based on technology it acquired from Gapminder.]

Google has an interesting new search feature. If you run a query on a band name, the top result is a sample of a few songs. I ran a search for “Fleetwood Mac” using IE on Windows and got these 4 songs from their 2 best-selling albums in the 1970s with a button to play the sample from


However, when I run the same query using Safari on my Mac, I get these results from


Why do the results on different platforms use music from different sources? And now that Apple bought Lala, will it become the music provider when users run queries on Macs too?

A more interesting question is, what album are these songs from? The tag says “Greatest Hits – 2009”, but these songs are all live versions that I’ve never heard before. Fleetwood Mac is on tour this year, but they don’t have a new live album out (that I know of). And these songs feature the voice of Christine McVie who retired from the band ten years ago, so I’m guessing they must be recordings from the 1990s. If you click on the link to the iLike website, it takes you to a download page for the album versions of these songs, not the live versions you just listened to.

Anyone who knows how I can get these songs, please post a comment.

[Update: I confirmed that using Safari on Windows gives same results as using IE, the sample songs are from]