by George Taniwaki

Did you watch the debate on Monday night? I did. But I am also very interested in the post-debate media coverage and analysis. This morning, two articles that combine big data and the debate caught my eye. Both are novel and much more interesting than the tired stories that simply show changes in polls after a debate.

First, the New York Time reports that during the presidential debate (between 9:00 and 10:30 PM EDT) there is high correlation between the Betfair prediction market for who will win the presidential election and afterhours S&P 500 futures prices (see chart 1).


Chart 1. Betfair prediction market for Mrs. Clinton compared to S&P 500 futures. Courtesy of New York Times

Correlation between markets is not a new phenomena. For several decades financial analysts have measured the covariance between commodity prices, especially crude oil, and equity indices. But this is the first time I have seen an article illustrating the covariance between a “fun” market for guessing who will become president against a “real” market. Check out the two graphs above, the similarity in shape is striking, including the fact that both continue to rise for about an hour after the debate ended.

In real-time, while the debate was being broadcast, players on Betfair believed the chance Mrs. Clinton will win the election rose by 5 percent. Meanwhile, the price of S&P 500 futures rose by 0.6%, meaning investors (who may be the same speculators who play on Betfair) believed the stock market prices in November were likely to be higher than before the debates started. There was no other surprise economic news that evening, so the debate is the most likely explanation for the surge. Pretty cool.

If the two markets are perfectly correlated (they aren’t) and markets are perfectly efficient (they aren’t), then one can estimate the difference in equity futures market value between the two candidates. If a 5% decrease in likelihood of a Trump win translates to a 0.6% increase in equity futures values, then the difference between Mr. Trump or Mrs. Clinton being elected (a 100% change in probability) results in about a 12% or $1.2 trillion (the total market cap of the S&P 500 is about $10 trillion) change in market value. (Note that I assume perfect correlation between the S&P 500 futures market and the actual market for the stocks used to calculate the index.)

Further, nearly all capital assets (stocks, bonds, commodities, real estate) in the US are now highly correlated. So the total difference is about $24 trillion (assuming total assets in the US are $200 trillion). Ironically, this probably means Donald Trump would be financially better off if he were to lose the election.


The other article that caught my eye involves Google Trend data. According to the Washington Post, the phrase “registrarse para votar” was the third highest trending search term the day after the debate was broadcast. The number of searches is about four times higher than in the days prior to the debates (see chart 2). Notice the spike in searches matches a spike in Sep 2012 after the first Obama-Romney debate.

The article says that it is not clear if it was the debate itself that caused the increase or the fact that Google recently introduced Spanish-language voting guides to its automated Knowledge Box, which presumably led to more searches for “registrarse para votar”. (This is the problem with confounding events.)

After a bit of research, I discovered an even more interesting fact. The spike in searches did not stop on Sep 27. Today, on Sep 30, four days after the debates, the volume of searches is 10 times higher than on Sep 27, or a total of 40x higher than before the debate (see chart 3). The two charts are scaled to make the data comparable.


Chart 2. Searches for “registrarse para votar” past 5 years to Sep 27. Courtesy of Washington Post and Google Trends


Chart 3. Searches for “registrarse para votar” past 5 years to Sep 30. Courtesy of Google Trends

I wanted to see if the spike was due to the debate or due to the addition of Spanish voter information to the Knowledge Box. To do this, I compared “registrarse para votar” to “register to vote”. The red line in chart 4 shows Google Trend data for “register to vote” scaled so that the bump in Sept 2012 is the same height as in the charts above. I’d say the debate really had an unprecedented effect on interest in voting and the effect was probably bigger for Spanish speaking web users.


Chart 4. Searches for “register to vote” past 5 years to Sep 30. Courtesy of Google Trends

Finally, I wanted to see how the search requests were distributed geographically. The key here is that most Hispanic communities vote Democratic and many states with a large Hispanic population are already blue (such as California, Washington, New Mexico, New Jersey, and New York). The exception is Florida with a large population of Cuban immigrants who tend to vote Republican.


Chart 5. Searches for “registrarse para votar” past 5 years to Sep 30 by county. Courtesy of Google Trends

If you are a supporter of Democrats like Mrs. Clinton, the good news is that a large number of queries are coming from Arizona, and Texas, two states where changes in demographics are slowly turning voting preferences from red to blue.

In Florida, it is not clear which candidate gains from increased number of Spanish-speaking voters. However, since the increase is a result of the debate (during which it was revealed that Mr. Trump had insulted and berated a beauty pageant winner from Venezuela, calling her “miss housekeeping”), I will speculate many newly registered voters are going to be Clinton supporters.

If the Google search trend continues, it may be driven by new reports that Mr. Trump may have violated the US sanctions forbidding business transactions in Cuba. Cuban-Americans searching for information on voter registration after hearing this story are more likely to favor Mrs. Clinton.

by George Taniwaki

In a Dec 2011 blog post, I critiqued an article in The Fiscal Times that compared the cost of eating a meal at home against dining out at a restaurant. The article purported to show that eating at a restaurant was cheaper. I pointed out the errors in the analysis.

One of the errors was in the way data for expenditures for grocers and restaurants were shown in a line graph. The two lines were at different scales and aligned to different baselines making comparisons difficult. The original and corrected charts are shown below. Correcting the baseline makes it clear that restaurant expenditures are significantly lower than for groceries. Correcting the scale shows that restaurant expenditures are not significantly more volatile than for groceries.


Figures 1a and 1b. Original chart (left) and corrected version (right)

Another error in the article I pointed out was that the lower inflation rate of meals at restaurants compared to meals at home should not favor eating more meals at restaurants. I didn’t give an explanation why. I will do so here.

Consider an office worker who needs to decide today whether to make a sandwich for lunch or to buy a hamburger at a restaurant. Let’s say she knows that the price of bread and lunch meat has doubled over the past year (100% inflation rate) while the cost of a hamburger has not changed (0% inflation rate). Which should she buy?

The answer is, she doesn’t have enough information to decide. The inflation rate over the past year is irrelevant to her decision today, or at least it should be. What is relevant is the actual costs and utilities today.

Let’s say she likes shopping, making sandwiches, and cleaning up, so the opportunity cost for the sandwich option is zero. Let’s also say she likes sandwiches and hamburgers equally and values them equally and doesn’t value variety. Now, if the price today for lunch meat and bread for a single sandwich is 50 cents while a hamburger is 75 cents, then she should make a sandwich. Next year if inflation continues as before, making a sandwich will cost $1.00 while a hamburger remains 75 cents. In that case, she should buy a hamburger. But that decision is in the future.

Let’s consider an extreme case where inflation rates may affect purchase decisions today. What if the price of sandwich fixings are 50 cents today but inflation is expected to be 100% during the work week (so prices will be $.57, $.66, $.76, and $.87 over the next four days) . Such high inflation rates are called hyperinflation and can lead to severe economic distortions.

Let’s also assume hamburgers are 75 cents today and will remain fixed at that price by law. (Arbitrary but stringent price controls are another common feature of an economy experiencing hyperinflation.) Further, let’s assume that sandwich fixings can be stored in the refrigerator for a week for future use but hamburgers cannot be bought and stored for future consumption.

Finally, let’s assume it is early Monday and our office worker has no sandwich fixings or hamburgers but has $5 available for lunches for the upcoming week. What should she buy each day?

I would recommend trying to buy $3.75 in sandwich fixings today (enough for 5 sandwiches). Here’s why. During a period of hyperinflation, you want to get rid of money as fast as possible because cash loses its value every day you hold it. Thus, buying as much food as possible today is a good investment (called a price hedge).

Ah, you say. Why not make sandwiches the first two days of the week and then switch to the relatively cheaper hamburgers for the last three days? That is unlikely to work because the restaurant is caught between paying rising prices for the food it buys while getting a fixed price for what it sells. Long lines will form as customers seek cheap food. The restaurant will either run out of food, go bankrupt, or close its doors. Regardless, our office worker shouldn’t rely on her ability to buy cheap hamburgers later in the week.

So why am I updating a blog post from almost two years ago? Well, it’s because I noticed a big spike in traffic  landing on it last week. It turns out my wife, Susan Wolcott, assigned it as a reading for a class she is teaching to undergraduate business students at Aalto University School of Business, in Mikkeli, Finland. (The school was formerly known as the Helsinki School of Economics.)

Normally, this blog receives about 30 page views a day. On days that I post an entry on kidney disease or organ donation (pet topics of mine) traffic goes up. Of a typical day’s 30 hits, I presume about half of that traffic is not human. It is from web crawlers looking for sites to send spam to (I receive about 15 spam comments a day on this blog).

But check out the big spike in page views for my blog on the day my wife assigned the reading. This blog received 264 page views from 91 unique visitors. That’s the kind of traffic social media experts die for. Maybe I’ve hit upon an idea for generating lots of traffic to a website, convince college professors to assign it as required reading for a class.


Figure 2. Web traffic statistics for this blog

Naturally, I expect another big spike of traffic again today when my wife tells her students about this new blog post.

by George Taniwaki

I recently came across a fascinating article with the novel claim that life is older than the earth. The argument is based on a regression analysis that estimates the rate of increase in the maximum genomic complexity of living organisms over time. (Note, this argument only makes sense if you believe that the universe can be more than 5,000 years old, that genes mutate randomly and at a constant rate, that genetic changes are inherited and accumulate, and that mathematics can be used to explain how things work. Each of those assumptions can be argued separately, but that is beyond the scope of this blog post.)

The article entitled Life Before Earth was posted on in Mar 2013. The authors are Alexei Sharov a staff scientist at the National Institute on Aging and Richard Gordon, a theoretical biologist at the Gulf Specimen Marine Laboratory.

They draw a log-linear plot of the number of base pairs in a selection of terrestrial organisms against the time when the organism first appeared on earth (see Fig 1). For instance, the simplest bacteria, called prokaryotes, have between 500,000 to 6 million base pairs in their genome and are believed to have first appeared on earth about 3.5 billion years ago. At the other extreme, mammals including humans, have between about 2.7 billion to 3.2 billion base pairs in their genome. The fossil record indicates the first mammals appeared about 225 million years ago, during the Triassic period. All other known organisms can be plotted on these axes and the trend appear linear, meaning the growth in genome complexity is nearly exponential.

Extrapolating the data back in time, one can estimate when the maximum complexity was only one base pair. That is, the date when the first protoörganisms formed. The trend line indicates this occurred 9.7 billion years ago, or about 4 billion years after the big bang.


Figure 1. The growth in maximum base pair count per genome seems to grow exponentially over time. Image from

The earth is estimated to be only 4.5 billion years old. Thus, if these results are accepted, the implications are pretty astounding.

1. Life did not start on earth. It started somewhere else in the galaxy and reached the microörganism stage. Through some cosmic explosion, the life was transported here. Alternatively, life started on one of the planets around the star that went supernova before collapsing and forming our present-day sun. This hypothesis is called exogenesis

2. It is unlikely that these alien microörganisms only landed on earth as our solar system formed. They probably coated every asteroid, comet, planet, and moon in our solar system. They may still be alive in many locations and are either evolving or dormant

3. If all of the microörganisms that reached our solar system came from the same source, they likely have the same genetic structure. That is, if we find life elsewhere in our solar system, it is likely to contain right-handed double helix of DNA using the same four left-handed amino acid base pairs as life on earth. With effort, we could construct an evolutionary tree that contains these organisms

4. In fact, the same microörganisms may be very common throughout the galaxy, meaning life has arrived on many other planets, or perhaps every planet in our galaxy, even ones with no star, a hypothesis called panspermia

5. It solves Fermi’s paradox. In 1950, Enrico Fermi noted that our Sun is a young star and that there are billions of stars in our galaxy billions of years older. Why haven’t we already been contacted by intelligent life from other planets? The answer based on this analysis is because intelligent life takes 9.7 billion years to form and we may be one of the first organisms to reach the level of intelligence necessary to achieve intentional interstellar communication and travel. Ironically, if Sharov and Gordon are right, unintentional interstellar travel is already quite common and has been for billions of years.

Relationship to Moore’s Law

If the exponential growth of complexity shown in Figure 1 above looks familiar, it is because it is the same  shape as the increase in the number of transistors in microprocessor chips over time, a relationship called Moore’s Law. The authors cite this analogy as a possible argument in favor of their case.


Figure 2. The growth in maximum transistor count per processor has grown exponentially over time. Image from Wikipedia

Is this reasonable?

Like I said, this is a fascinating article. But it is all speculation. We have no direct evidence of many of the claims and inferences made in this paper. Specifically, we don’t know:

  1. The exact size of the genome that various organisms had in the earth’s past
  2. The nature of possible organisms less complex than prokaryotes
  3. The existence of any alien microörganisms or evidence that any landed on early earth
  4. The speed of genetic complexity changes in the early earth environment, or on other planetary environments in the case of the alien microörganisms prior to arrival on earth
  5. Whether any modern earth organisms, or any potential alien microörganisms, could withstand the rigors of travel through space for the millions of years it would take to get to earth from another star system

Finally, we have no clear explanation why the rate of change in genome complexity should be exponential. The use of the Moore’s Law chart to show that exponential growth in complexity is reasonable is slightly disingenuous. Moore’s Law is generally used as a forecast for the future growth in complexity for a commercial product based on historical data. Further, the forecast is used to estimate product demand, research costs, and necessary production investment, all of which tends to drive advancements and make the prediction self-fulfilling.

On the other hand, genome complexity is not directed. Evolution is a random process that will generate greater complexity only if a new, more complex organism can take advantage of an ecological niche that cannot be exploited by simpler organisms. Nothing is driving greater genome complexity.

Anyway, this is a very controversial theory. But I believe it may lead to new insights regarding microbiology, astrobiology, molecular clock hypothesis, and the use of mathematical models in evolutionary biology.


How long is 13.7 billion years?

As a side note, sometimes we wonder, how could something as complex as DNA and cellular organisms have started from nothing? It seems impossible to comprehend. But if you take a look at Figure 1, you will see that it may have taken over six billion years for a single base pair of DNA to grow in complexity to form an organism that we think of as a simple, primitive prokaryote. Then it took another 3.5 billion years before mammals appeared. Then it only took 200 million more years before our human ancestors appeared. And finally only a few hundred thousand years passed before you and I were born.

To give you a feel for how long 13.7 billion years is, watch this extremely boring YouTube video that compresses each billion years into one minute increments.


Figure 3. Age of Universe, boring, even with music. Video still from Craig Hall


A final thought to close this blog post, there may be aliens living on earth, but don’t be afraid, because it’s us.

by George Taniwaki

A Dec 2011 article in The Fiscal Times purports to show that eating at restaurants is cheaper than cooking at home. It’s an intriguing idea that has appeared in many articles in the past. However, the analysis presented in The Fiscal Times article is flawed and the conclusions are not supportable.

Before going into the specifics of the errors in The Fiscal Times article, let’s consider how one could compare whether cooking at home is more expensive than eating at restaurants. The typical cost-benefit analysis for eating in versus dining out goes something like as follows. Cooking a meal at home isn’t free. From a classical economic point of view one should include the opportunity cost of the time needed to buy groceries , drive it home, store it,  prepare a meal, and clean up afterwards. Further, one should include the implicit rental value of the automobile used to transport the groceries and the kitchen and dining room used to prepare and serve the meal.

However, shopping, cooking, and cleaning are not just chores that one is required to do. They are a form of entertainment, social interaction, and a way to share your skills with others as Nathan Myhrvold insightfully states in this Dec 2011 Slate interview. The cook receives utility from hosting a meal, even if it is a regular daily event. Naturally, if one hates to shop, cook, or clean, then there can be disutility as well. When deciding whether to eat at home or dine out, a person will want to maximize the expected utility from the decision.

Examining the wrong factors

The Fiscal Times article briefly mentions some of the above factors, but then totally ignores them when doing the price comparisons. Instead, it mentions differing inflation rates between in-home meals and restaurant meals. Relative inflation rates should be irrelevant to the decision to eat at home or dine out. The author also throws in a few additional factors that also seem to be irrelevant in comparing costs,

“We also didn’t factor in whether one meal or another would be healthier, or friendlier to the environment. But that’s part of the point: Eating right and finding the extra savings that could be had by comparison shopping comes with a time trade-off that many families can’t afford to make these days.”

Hard to interpret charts

The Fiscal Times article has two time series charts which I will reproduce below. Some of the problems I found in the first chart:

  1. For some reason the first chart is labeled Chart 2 and the second is labeled Chart 1.
  2. Chart 2 (the first chart) has two different scales (left scale has a range of 1.4% while the right scale has a range of 0.4%) even though both display values from the same dataset (percent share of consumption). This means the data using the right scale will appear to be more variable
  3. The black arrows both point toward the right scale, though “Grocers” (what’s with the quotation marks?) says it is set to the left hand scale (LHS)
  4. Neither scale shows the 0% origin point or the 100% end point (Note that if the scale did go from 0 to 100%, then there would be no need for two different scales.
  5. Assuming the left scale applies to “Grocers” and the right scale to “Restaurants”, then “Grocers share is always above “Restaurants”. It does not cross as the chart shows
  6. There is no source attribution for the data so no way to judge how valid it is or to review the original data


The second chart (entitled Chart 1) also has several flaws.

  1. The color code has been reversed. Dining out share was shown in the blue line in the first chart, while inflation is shown in gold. Similarly, eat at home share was shown in gold in the first chart, while inflation is in blue
  2. The label for each line has changed. “Restaurants” in the first chart is now called Food away from home while “Grocers” is now Food at home
  3. The use of different labels makes one wonder if the same assumptions, data sets, and cost allocations are used in the two charts and whether the same analysts produced both charts. My guess is no, which means the two charts cannot be used together
  4. As mentioned above, relative inflation rate should not directly impact the consumer’s choice to eat at home or at a restaurant, so this chart isn’t very useful


Nonequivalent price comparisons

The Fiscal Times article includes a slideshow that compares the cost of selected meals at restaurants with the cost of preparing the meal at home. In five out of six cases, the restaurant meal is cheaper.

If you only consider the price of store-bought food to the price of a cooked meal at a restaurant, there is probably no way the prices of food ingredients in a competitively priced retail store could exceed the price in a non-subsidized restaurant. Certain restaurants can serve meals at lower than expected prices because of subsidized food (school lunch programs), volunteer labor (homeless shelters or church meal programs), or subsidized rent (canteen stores or cafeterias in office buildings).

So how did The Fiscal Times get these unlikely results? I think the following errors were made:

  1. The restaurant meal prices are for a single serving while the grocery store prices are for full cans, boxes, or other package. This will provide much more food than the restaurant meal
  2. The grocery store prices are for FreshDirect, a grocery delivery service in New York. Delivery groceries are more expensive than self-serve and NYC is the most expensive city in the U.S.
  3. The restaurant meal prices exclude the tip
  4. The grocery store prices include some prepared deli foods. Grocery store deli food can be more expensive than restaurant food since it is an impulse buy

*An orangery is a British term for a greenhouse. They were mostly used to grow citrus fruit (hence the term orangery). Now most citrus fruit in Great Britain is imported. One of the most famous orangeries is located in the Royal Botanic Gardens in London. It is now used as a restaurant where people dine out, not a greenhouse used to grow oranges that people eat at home.

[Update: Expanded the footnote to make clear the irony of an orangery being used as a restaurant.

For a clearer explanation why differing inflation rates should not affect the choice between eating in and going to a restaurant, see this Sep 2013 blog post.]

Charts showing data over time, called time series, are an excellent way to show trends and to make forecasts for the future. However, time series data may need to be adjusted before plotting to avoid errors in analysis.

First, if the data being displayed is in monetary units (for instance, dollars), then one has to be careful to take into account the effect of inflation. For short periods, omitting inflation isn’t a critical error. However, as the time scale increases, the error can become significant, especially if the time scale includes the 1970s and 1980s when inflation was quite high.

When looking at historical price or monetary data, the unadjusted raw dollar amounts are called nominal values.  The adjusted dollar amounts are called real. (One has to be aware that there are many different inflation adjustments available, but that is beyond the scope of this blog post.)

Second, when comparing country economic data such as gross domestic product (GDP) across multiple years, it should be corrected for population growth. Both the U.S. and the world population are much larger today than 50 years ago while the population of many European countries and Japan are stable or shrinking. Thus, when comparing the well-being of these countries, one should use changes in per capita real GDP. Notice that the U.S. nominal GDP must grow about 4 percent a year just to keep per capita real GDP constant.

Finally, when comparing spending on a particular good or service, the data should be adjusted for changes in household income, consumption patterns, and the quality of the good or service. This is often difficult to do and there are disagreement on the best way to do this. The biggest example is change in house prices. For example, the median house today is much larger than a house 20 years ago. Also, the quality of appliances and materials is much higher than 20 years ago. This is true for the median home and can be very expensive in high-end homes, which will cause the average to be skewed.

Consumption patterns are changing as income increases. The median U.S. and world household income is higher today that it was 50 years ago, even adjusting for inflation). However, much of this gain has been directed toward college graduates who work in jobs that are located in big cities on the east and west coast. This has driven the price of houses upward in cities like New York, Boston, San Francisco, and Los Angeles relative to houses in smaller cities away from the coasts. The interaction between income and house prices is hard to analyze separately.

Rising income also affects the composition of goods and services a household consumes. As a proportion of income, people spend less on food and durable goods and more on entertainment, education, and health care.

Example of a badly drawn chart and a corrected version

In January 2011, Bain & Company released a report on electronic publishing that included a warning that publishers should not repeat the mistakes of the music industry. The warning was illustrated the following chart.


The rise and fall of the music industry. Image from Bain

Michael DeGusta picks apart this chart in a Feb 2011 Business Insider article. He notes that the chart makes it look like the music industry grew steadily from 1973 a peak in 1999 and has since fallen about 40%.

However, he corrects the chart for changes in purchasing power and population growth as shown in the revised chart below.


The rise, fall, rise, and collapse of the music industry. Image from Business Insider

This revised chart is much more interesting. It shows that real per capita music sales peaked in 1978 and began to fall until CDs arrived. Sales reached a new peak in 1999, but then fell. Digital music sales have not been able to stem the fall and per capital sales in 2009 have fallen 63% (not 40%) from the level of ten years earlier.

Mr. DeGusta goes on to provide an insightful analysis of why sales of digital music have not been able to replace CDs. Read his blog for more details.

A front-page story in Chron. Higher Educ. Jan 2011 claims that newly elected Republican governors are more likely to cut spending than Democrats or even incumbent Republicans. Since education (which includes both K-12 and higher ed) is one of the largest components of state budgets, this implies that the higher education expenditures will fall in states where governors plan to make budget cuts.

As evidence, the article contains the graphic shown below. The chart shows the total projected budget shortfall in 2012 for six states compared to the higher education appropriation in 2010.


Look at all the pretty circles. Image from Chron. Higher Educ.

There are so many problems with this graphic that it is hard to know where to begin. But I will try to critique it.

1. The most serious problem is that this chart displays a meaningless comparison. There is no a priori reason to believe that higher-education appropriations in FY2010 and total projected budget shortfall two years later in FY2012 should be correlated. Nor should one expect the two numbers to be about the same size. My guess is the fact that the two numbers appear to be the same magnitude for these six states for these two years is just a coincidence.

2. The chart should include more years. If expenditures and budget gap are correlated, then the relationship should be stable over time. Prove it by showing time series data of total budget and higher-education appropriations during previous years. It would be even better if data from previous recessions was available. My guess is that the numbers vary greatly and the relationship disappears.

3. The chart should include more states. The article and the chart predicts newly elected Republican governors will cut higher education budgets more than other classes of governors. But the chart only shows data for six states with newly elected Republican governors. There is no comparative data for incumbent Republicans or any Democrats. Again, if the relationship is due to party affiliation and incumbency of governor, then showing data for all states will emphasize it. My guess is such data would show there is little or no relationship between governor status and budgets.

4. The chart should be adjusted to aid comparisons. The size of state budgets and higher education expenditures vary because of many factors. At a minimum, the data should be adjusted for population. A few other possible adjustments that come to mind are median household income, population of 18-25 year olds, proportion of population with college degrees (presumably, grads will be more likely to favor higher ed spending), enrollment at in-state public colleges (presumably parents of current students will be more likely to favor higher ed spending), etc.

5. The chart should not use areas to represent linear values. Look at the two circles under Ohio. The red circle representing the budget gap of $3,000 million is more than twice as large as the orange circle showing the higher education appropriation of $1,900 million even though $3,000 million is less than 60% bigger than $1,900 million. That’s wrong. The area of a circle is proportional to the square of the diameter. One of the basic rules of drawing a graph (for example, see Edward Tufte’s The Visual Display of Quantitative Information) is to make sure the size and shape of data markers actually assist in understanding the values being measured.


My proposed improvement to the original chart is shown below. The data come from the same sources used in the Chron. Higher Educ. chart, state budget gaps from the Center on Budget and Policy Priorities, state higher education expenditures from the  Illinois State Univ. Grapevine Project, and political alignment from the Nat. Conf. State Legis. I also added U.S. Census annual estimates of population and U.S. Census counts so that I could calculate per capita state budget gaps and higher education expenditures. The data are posted on SkyDrive in case you want to create your own charts.

My scatterplot shows the cumulative budget gap per capita at the end of FY2009 for each state on the horizontal axis and the increase or decrease in higher education expenditures for state monies (excludes federal stimulus monies) from FY2009 to FY2010. My hypothesis is that deficits at the end of one fiscal year may impact changes in expenditures in the following year.

Each data point is colored to show the political affiliation of the governor (using blue for Democrat and red for Republican). The marker shape indicates if the party changed hands (triangle) or if it did not (square or diamond) in the 2010 election. The green line shows the least squares line through the data (not weighted for state population).


Look at all the tiny dots. Chart by George Taniwaki

I didn’t spend much time analyzing this chart but a few things caught my eye. First, the slope of the regression line is negative meaning that states with the largest deficits tended to cut education expenditures the least. Specifically, California, with the largest deficit at the end of 2009 at $1,004 per capita actually increased higher education expenditures in FY2010 by $7 per capita.

Second, the states with the largest budget gaps at the end of FY2009 were more likely to have Republican governors. They were also more likely to elect a Democratic governor in 2010. The three states that fit this pattern are California, Rhode Island, and Connecticut.

Finally, the states that made the biggest cuts in higher education expenditures in FY2010 were more likely to have a Democratic governor. They were also more likely to elect a Republican governor in 2010. The three states that fit this pattern are Wyoming, Iowa, and New Mexico.

It could be that these correlations are spurious. But my conclusions are no more unreliable than those claimed based on the weak analysis in the original chart.

Much thanks to Susan Wolcott who pointed out the original chart to me.

Google Books is a project sponsored by search engine giant Google to scan the pages of every book available, convert the scans to text using OCR, and make the resulting text corpus searchable. Not withstanding any remaining copyright disputes surrounding the project, Google has reached an agreement with most of the major copyright holders (authors and the publishers that represent them) around the world. So far, Google has scanned over 15 million books, most of which are no longer in print or commercially available. This database is a treasure trove for history scholars.

Last week, Google released a new statistics tool for the Google Books project called the Ngram Viewer. It it simple to use. You simply enter a list of words or phrases separated by a comma (each called an n-gram and is case-sensitive), the language (American English and British English can be searched separately), the date range (starting from 1500 to 2008, though the number of books is sparse before 1780 which makes the early data very spiky), and the amount of moving average smoothing to apply (long trends are easier to see with smoothing, but the individual yearly data is lost).

The Ngram Viewer is the best time series graphing toy I’ve seen.

There have been lots of stories in the press showing interesting trends in the popularity of certain words and phrases in books. For instance, Jennifer Valentino-DeVries of the Wall St. J. shows that Merry Christmas beats Happy Holidays by a big margin.


Merry Christmas vs Happy Holidays, Image from Google labs

Slate’s Tom Scocca has been posting an Ngram of the Day on the site comparing the frequency of words like shopping vs salvation and television vs the Bible. He even does a comparison between words and shows the year in which the two cross in popularity. For instance, anxiety passes shame in 1942.


anxiety vs. shame. Image from Google labs

Here’s a couple charts I created for the n-grams “independence” and “rebellion” in U.S. and British English. I have no idea what conclusions to draw from this data, but it is just begs for an explanation based on unfounded speculation. There is a spike in both words in U.S. books in the 1770s, but no spike in British literature. Independence becomes more popular than rebellion in books from both countries after about 1820. The word rebellion has a spike again in the U.S. from 1860 to 1870. The absolute occurrence of both words are similar in the two countries starting around 1900. Independence shows a spike from 1940 to 1942 and another spike around 1968 to 1970.



independence vs. rebellion, British (top) vs. American (bottom). Images from Google labs.

For the truly hardcore programmer, the n-gram datasets are available for download from Google. Their use is covered under the Creative Commons Attribution 3.0 Unported license.

The next step is for Google to add its Ngram Viewer toolkit to its Public Data Explorer visualization tool (see Mar 2010 blog post) to allow animations and drill down. I can hardly wait.

[Update: I rescaled the first two graphs to normalized the time spans.]