by George Taniwaki

Did you watch the debate on Monday night? I did. But I am also very interested in the post-debate media coverage and analysis. This morning, two articles that combine big data and the debate caught my eye. Both are novel and much more interesting than the tired stories that simply show changes in polls after a debate.

First, the New York Time reports that during the presidential debate (between 9:00 and 10:30 PM EDT) there is high correlation between the Betfair prediction market for who will win the presidential election and afterhours S&P 500 futures prices (see chart 1).


Chart 1. Betfair prediction market for Mrs. Clinton compared to S&P 500 futures. Courtesy of New York Times

Correlation between markets is not a new phenomena. For several decades financial analysts have measured the covariance between commodity prices, especially crude oil, and equity indices. But this is the first time I have seen an article illustrating the covariance between a “fun” market for guessing who will become president against a “real” market. Check out the two graphs above, the similarity in shape is striking, including the fact that both continue to rise for about an hour after the debate ended.

In real-time, while the debate was being broadcast, players on Betfair believed the chance Mrs. Clinton will win the election rose by 5 percent. Meanwhile, the price of S&P 500 futures rose by 0.6%, meaning investors (who may be the same speculators who play on Betfair) believed the stock market prices in November were likely to be higher than before the debates started. There was no other surprise economic news that evening, so the debate is the most likely explanation for the surge. Pretty cool.

If the two markets are perfectly correlated (they aren’t) and markets are perfectly efficient (they aren’t), then one can estimate the difference in equity futures market value between the two candidates. If a 5% decrease in likelihood of a Trump win translates to a 0.6% increase in equity futures values, then the difference between Mr. Trump or Mrs. Clinton being elected (a 100% change in probability) results in about a 12% or $1.2 trillion (the total market cap of the S&P 500 is about $10 trillion) change in market value. (Note that I assume perfect correlation between the S&P 500 futures market and the actual market for the stocks used to calculate the index.)

Further, nearly all capital assets (stocks, bonds, commodities, real estate) in the US are now highly correlated. So the total difference is about $24 trillion (assuming total assets in the US are $200 trillion). Ironically, this probably means Donald Trump would be financially better off if he were to lose the election.


The other article that caught my eye involves Google Trend data. According to the Washington Post, the phrase “registrarse para votar” was the third highest trending search term the day after the debate was broadcast. The number of searches is about four times higher than in the days prior to the debates (see chart 2). Notice the spike in searches matches a spike in Sep 2012 after the first Obama-Romney debate.

The article says that it is not clear if it was the debate itself that caused the increase or the fact that Google recently introduced Spanish-language voting guides to its automated Knowledge Box, which presumably led to more searches for “registrarse para votar”. (This is the problem with confounding events.)

After a bit of research, I discovered an even more interesting fact. The spike in searches did not stop on Sep 27. Today, on Sep 30, four days after the debates, the volume of searches is 10 times higher than on Sep 27, or a total of 40x higher than before the debate (see chart 3). The two charts are scaled to make the data comparable.


Chart 2. Searches for “registrarse para votar” past 5 years to Sep 27. Courtesy of Washington Post and Google Trends


Chart 3. Searches for “registrarse para votar” past 5 years to Sep 30. Courtesy of Google Trends

I wanted to see if the spike was due to the debate or due to the addition of Spanish voter information to the Knowledge Box. To do this, I compared “registrarse para votar” to “register to vote”. The red line in chart 4 shows Google Trend data for “register to vote” scaled so that the bump in Sept 2012 is the same height as in the charts above. I’d say the debate really had an unprecedented effect on interest in voting and the effect was probably bigger for Spanish speaking web users.


Chart 4. Searches for “register to vote” past 5 years to Sep 30. Courtesy of Google Trends

Finally, I wanted to see how the search requests were distributed geographically. The key here is that most Hispanic communities vote Democratic and many states with a large Hispanic population are already blue (such as California, Washington, New Mexico, New Jersey, and New York). The exception is Florida with a large population of Cuban immigrants who tend to vote Republican.


Chart 5. Searches for “registrarse para votar” past 5 years to Sep 30 by county. Courtesy of Google Trends

If you are a supporter of Democrats like Mrs. Clinton, the good news is that a large number of queries are coming from Arizona, and Texas, two states where changes in demographics are slowly turning voting preferences from red to blue.

In Florida, it is not clear which candidate gains from increased number of Spanish-speaking voters. However, since the increase is a result of the debate (during which it was revealed that Mr. Trump had insulted and berated a beauty pageant winner from Venezuela, calling her “miss housekeeping”), I will speculate many newly registered voters are going to be Clinton supporters.

If the Google search trend continues, it may be driven by new reports that Mr. Trump may have violated the US sanctions forbidding business transactions in Cuba. Cuban-Americans searching for information on voter registration after hearing this story are more likely to favor Mrs. Clinton.

by George Taniwaki

Often times, one has lots of data to display that are grouped in pairs such as men vs. women. Further, we want to show the pairs but not compare them. Instead, we are more interested in comparing different groups within the pairs than between pairs within a group. For instance. our groups could be age and we are more interested in comparing 20-24 women to 25-29 women than we are between 20-24 women and 20-24 men.

The diverging stacked bar chart is a very good way to display a pair of values next to each other. To allow easier comparison of the length of adjacent bars, there is usually no white space between them. To allow easier comparison between the left and right bar, they usually have no space between them but are different colors. The values for the bars running to the left are not negative. They are positive, just like the values on the right. The left and right bars are paired and measure the values for two different related groups.

The most common use of the diverging stacked bar chart is to display age distribution of a population broken down by gender. This type of chart is often called a population pyramid.  An example of a population pyramid using data from the 2000 U.S. Census is shown in Figure 1.


Figure 1. Population pyramid for the U.S. based on Census 2000 data. Image from Censusscope.

The population pyramid is a special case of the diverging stacked bar chart. Notice that  each of the horizontal bars is the same width and covers the same age range (except the oldest group). Thus, the height of each bar represents the same number of years and the stack of bars forms a vertical axis showing age. Similarly, the the area of each bar represents the proportion of the population in that age group and the area of all the bars shows the total size of the population. A well-drawn population pyramid shows three dimensions at once, age, gender, and counts.

The shape of a population pyramid tells a lot about the population growth (which itself is a result of economic and political conditions that affect fertility, infant survival, immigration and emigration, and longevity) of a group.Figure 2 shows the four commonly seen shapes for a population pyramid.

The two triangles at the right (labeled stage 1 and stage 2) describe a group with a combination of high birthrates, high emigration, and high mortality cause the number of young to greatly exceed the old. Several countries in Sub-Saharan Africa and India have population pyramids of this shape.

The flatter shape (labeled stage 3) describes a group where births, immigration/emigration, and mortality are in balance. Most of the developing countries and the U.S. have population pyramids of this shape.

Finally, the egg-shaped pyramid (labeled stage 4) has a base that is smaller than the center. This describes a group where a combination of low birth rate, high immigration rate, and low mortality causes a bulge in the middle. If the fertility rate is below the replacement rate (about 2.1 child per female lifetime) then the population is growing older and may even be shrinking. Nearly all of the developed countries and China have population pyramids of this shape.


Figure 2. Four commonly seen shapes for population pyramid. Image from Wikipedia

If you would like to explore population pyramids on a national, state, and metro area basis, go to


Diverging stacked bar charts can be used in cases where there are more than two categories. In a paper presented at the 2011 Joint Statistical Meeting, Naomi Robbins and Richard Heiberger suggest that Lickert scale data should be presented using this method. If the questionnaire uses the standard 5 point scale, they argue that the  “Strongly disagree,” “Disagree,” and half the “Neither agree nor disagree” counts should be shown on the left bar. The counts for “Strongly agree,” “Agree,” and half of “Neither agree nor disagree” should be shown on the right bar. An example is shown in Figure 3.


Figure 3. Diverging stacked bar chart used to display Lickert scale data. Image from 2011 JSM

I’ve tried a bunch of different ways of presenting Lickert scale data (as well as other scaled data for importance, satisfaction, and other opinions) and have never been happy with my efforts. I really like this technique. If you review the paper, you will see eight common methods for displaying Lickert scale data that the authors label as “Not recommended.” I’ve used many of them.

For instance, I’ve used the standard colored bar chart like the one shown in Figure 4. The problem is that every bar is the same length so the ends of the bars, which your eyes are drawn to don’t convey any data. All of the data is conveyed at the interior points of the bars. By comparison, in Figure 3, the data is conveyed by the lengths of each bar and the proportion of each bar that is filled with the darker shade of color.


Figure 4. Standard bar chart to display Lickert scale data. Image from 2011 JSM

So how do you create your own diverging stacked bar charts? If you are an R language user, you can use functions available in the HH package and the latticeExtra package for the R language. These functions are also available in the RExcel for R add-in for Excel on Windows.

If you are not an R user, you can create diverging stacked bar charts manually using Excel or Tableau. For instructions using Excel, Amy Emery has a good tutorial on

Incidentally, if you have time, check out some of Ms. Emery’s other slide shows, they are quite good and cover a range of topics. There is even one to help R novices like me get started in learning the language.

Much thanks to my friend and colleague Carol Borthwick for pointing me to this new use for the diverging stacked bar chart.

by George Taniwaki

This week’s issue of the New Engl J Med (subscription required) should be of special interest for those who follow kidney disease. The issue contains several articles on medical investigations into treatments and risk factors for  kidney disease along with related editorials. Unfortunately, most of the news is not good.


However, there is an important lesson to gain from these studies. Scientific knowledge advances in two ways. First, is the knowledge gained by learning what works. There is the obvious clinical benefit of knowing what is the best treatment for a patient. But successful studies also point the direction for other researchers showing where they can expect the greatest promise for future investigation.

Yet failures are valuable learning experiences. Knowing what doesn’t work reduces the chance that doctors or patients will try the same therapy on their own. But an unsuccessful trial does not mean a line of research should be abandoned. Rather, a failure should teach us to look at root causes.

Every experiment or medical trial is expected to be successful (otherwise you should invest time and effort in a different project with a greater potential payoff). When it isn’t we are temporarily surprised. But that should lead to a new investigation as to why the trial did not work as intended. And that investigation will hopefully lead to new insights that can be added to the body of human knowledge.

Trial of ACE inhibitors and ARBs

In the first article, Linda Fried of the Univ. Pittsburgh School of Medicine and her coauthors examine the effect of angiotensin-converting–enzyme (ACE) inhibitors and angiotensin-receptor blockers (ARBs) on patients with type 2 diabetes who also have kidney disease. These two drugs are often prescribed for the treatment of hypertension and congestive heart failure.

For kidney patients, the use of these drugs was intended to slow the decline in glomerular filtration rate (GFR). Previous studies had shown that ACE inhibitors and ARBs could benefit patients who already showed signs of proteinuria (protein in the urine, a sign of kidney disease). The goal of this study was to see if prescribing ACE inhibitors and ARBs to kidney patients earlier could forestall the progress toward end-stage renal disease (ESRD).

Since the progress of kidney disease for a particular patient is uncertain and can take many years, this study required a large sample that would be willing to participate by taking a prescription drug (or a placebo) for a multiyear period.

As reported in the article, the trial was stopped after four years because of safety concerns. There were more adverse events  in the therapy group than in the placebo group. The most common problem was acute kidney injury with the next most common being hyperkalemia (high potassium levels in the blood that if untreated can cause irregular heart beat). Because the study was stopped early, we now know that the combination therapy of ACE inhibitors and ARBs can cause injury, but we don’t know if it can delay the onset of ESRD.

Trial of bardoxolone methyl

The next article (online first) by Dick de Zeeuw of the Univ. Groningen and coauthors summarizes the results of treating patients that have both type 2 diabetes and stage 4 kidney disease with bardoxolone methyl. This drug is an antioxidant that can taken orally and has been shown to reduce serum creatinine.

Similar to the ACE inhibitor and ARB study, the sample size was large and the test was intended to span several years. However, also like the other study, it was ended early due to safety concerns. Those in the therapy group had more adverse events than those in the placebo group. Those who received the treatment had significantly higher GFR (a good thing) but experienced higher rates of heart failure, nonfatal stroke, hospitalization for heart failure, and higher death rate from cardiovascular causes.

Off-label use of abatacept

Abatacept (sold under the trade name Orencia) is a protein that inhibits a molecule called B7-1 that activates T cells. It is approved for the treatment of rheumatoid arthritis. It is also in clinical trials for the treatment of multiple sclerosis, type 1 diabetes, and lupus. These are all autoimmune diseases.

Chih-Chuan Yu of Harvard Medical School and coauthors noted elevated levels of B7-1 in certain patients with proteinuric kidney disease,  including primary focal segmental glomerulosclerosis (FSGS). They conducted a series of in vitro studies (laboratory experiments) to show that abatacept would block the migration of podocytes (a type of kidney cell). They then recruited patients whose FSGS who did not respond to standard treatments. They selected four kidney transplant patients with rituximab-resistant recurrent FSGS and one patient with glucocorticoid-resistant primary FSGS. They treated all five with abatacept and all five patients experienced remission.

APOL1 risk variants

APOL1 is the gene that encodes the apolipoprotein L1, a component of HDL, also called good cholesterol. Although the exact purpose of APOL1 is not known, we do know that certain variants of APOL1, called G1 and G2, circulating in plasma can suppress Trypanosoma brucei, the parasite that causes sleeping sickness. We also know that these variants are associated with ESRD, though the mechanism isn’t known.

We know that the G1 and G2 variants are more common among African-Americans than in white/Caucasians. And we know that African-Americans have between 3 to 5 times the risk of ESRD than white/Caucasians even though the prevalence of earlier stages of kidney disease are roughly equal for both racial groups. Thus, the question is whether these variants of APOL1 are responsible for some of the difference between rates of ESRD among blacks and whites.

A paper by Afshin Parsa, et al., attempts to answer that question by looking at data from two studies, one called the African-American Study of Kidney Disease and Hypertension (AASK) and the other called Chronic Renal Insufficiency Cohort (CRIC). They find direct evidence that “APOL1 high-risk variants are associated with increased disease progression over the long-term.”

Data for the AASK patient group are shown in the table below. Some items to notice:

  1. There is very little difference in CKD incidence between the patients with no copies of the APOL1 risk variants and those with one copy. This indicates that the trait is recessive
  2. Even patients with no copies of the risk variants have high rates of CKD. This indicates there are more factors left to be discovered
  3. There is a high prevalence (23%) of patients with 2 copies of the risk variants within the African-American population
  4. The risk variants may explain only about 5% (= 23% * (58% – 35%)) of the difference in the incidence rate of ESRD between blacks and whites. Further, the association does not explain the cause of kidney disease in patients with two copies of the risk variants. It does however seems to rule out hypertension and diabetes, since the study controlled for these factors
All patients
Col %
CKD at end*
Col %

Row %
No copies of APOL1 risk variants 234   34%   83   29% 35%
1 copy of APOL1 risk variants 299   43% 112   39% 37%
2 copies of APOL1 risk variants 160   23%   93   35% 58%
TOTAL 693 100% 288 100% 42%

*Number with ESRD or doubling of serum creatinine by end of study


All four papers described above were the subject of editorials in this week’s issue of New Engl J Med. One written by Dr. Zeeuw, the lead author of the  bardoxolone methyl paper, points out that the failure of ACE inhibitor ARB therapies may indicate that “improvement in surrogate markers — lower blood pressure or less albuminuria — does not translate into risk reduction.” In fact he writes that it may go further and the use of these two measures as risk markers for “as therapeutic targets in our patients with type 2 diabetes” may be in doubt. He also promotes the use of “enrichment design” to select patients who are less likely to display an adverse event.

Another editorial by Jonathan Himmelfarb and Katherine Tuttle of the Univ. Washington School of Medicine (and the Kidney Research Institute) make three recommendations to improve the safety and likelihood of success for clinical trials. First, all researchers should make more preclinical data available so that others can conduct better preclinical analysis. Second, researchers should consider the possible off-target effects of a proposed agent and collect data before starting clinical trials. The development of organ on a chip may greatly help this. Finally, researchers should exercise caution whenever a drug has known side effects, for instance when a “drug for diabetic kidney disease increases, rather than decreases, the degree of albuminuria.”

In a third editorial, Börje Haraldsson of the Univ. Gothenburg says the work of Dr. Yu and his colleagues “may signal the start of a new era in the treatment of patients with proteinuric kidney disease.” Let us hope that is true. As we discover more about how the immune system works, how it interacts with its cellular and microbial environment, and how it can be modulated, treatment of many chronic conditions, cancer, and even old age may be affected.

by George Taniwaki

In a Dec 2011 blog post, I critiqued an article in The Fiscal Times that compared the cost of eating a meal at home against dining out at a restaurant. The article purported to show that eating at a restaurant was cheaper. I pointed out the errors in the analysis.

One of the errors was in the way data for expenditures for grocers and restaurants were shown in a line graph. The two lines were at different scales and aligned to different baselines making comparisons difficult. The original and corrected charts are shown below. Correcting the baseline makes it clear that restaurant expenditures are significantly lower than for groceries. Correcting the scale shows that restaurant expenditures are not significantly more volatile than for groceries.


Figures 1a and 1b. Original chart (left) and corrected version (right)

Another error in the article I pointed out was that the lower inflation rate of meals at restaurants compared to meals at home should not favor eating more meals at restaurants. I didn’t give an explanation why. I will do so here.

Consider an office worker who needs to decide today whether to make a sandwich for lunch or to buy a hamburger at a restaurant. Let’s say she knows that the price of bread and lunch meat has doubled over the past year (100% inflation rate) while the cost of a hamburger has not changed (0% inflation rate). Which should she buy?

The answer is, she doesn’t have enough information to decide. The inflation rate over the past year is irrelevant to her decision today, or at least it should be. What is relevant is the actual costs and utilities today.

Let’s say she likes shopping, making sandwiches, and cleaning up, so the opportunity cost for the sandwich option is zero. Let’s also say she likes sandwiches and hamburgers equally and values them equally and doesn’t value variety. Now, if the price today for lunch meat and bread for a single sandwich is 50 cents while a hamburger is 75 cents, then she should make a sandwich. Next year if inflation continues as before, making a sandwich will cost $1.00 while a hamburger remains 75 cents. In that case, she should buy a hamburger. But that decision is in the future.

Let’s consider an extreme case where inflation rates may affect purchase decisions today. What if the price of sandwich fixings are 50 cents today but inflation is expected to be 100% during the work week (so prices will be $.57, $.66, $.76, and $.87 over the next four days) . Such high inflation rates are called hyperinflation and can lead to severe economic distortions.

Let’s also assume hamburgers are 75 cents today and will remain fixed at that price by law. (Arbitrary but stringent price controls are another common feature of an economy experiencing hyperinflation.) Further, let’s assume that sandwich fixings can be stored in the refrigerator for a week for future use but hamburgers cannot be bought and stored for future consumption.

Finally, let’s assume it is early Monday and our office worker has no sandwich fixings or hamburgers but has $5 available for lunches for the upcoming week. What should she buy each day?

I would recommend trying to buy $3.75 in sandwich fixings today (enough for 5 sandwiches). Here’s why. During a period of hyperinflation, you want to get rid of money as fast as possible because cash loses its value every day you hold it. Thus, buying as much food as possible today is a good investment (called a price hedge).

Ah, you say. Why not make sandwiches the first two days of the week and then switch to the relatively cheaper hamburgers for the last three days? That is unlikely to work because the restaurant is caught between paying rising prices for the food it buys while getting a fixed price for what it sells. Long lines will form as customers seek cheap food. The restaurant will either run out of food, go bankrupt, or close its doors. Regardless, our office worker shouldn’t rely on her ability to buy cheap hamburgers later in the week.

So why am I updating a blog post from almost two years ago? Well, it’s because I noticed a big spike in traffic  landing on it last week. It turns out my wife, Susan Wolcott, assigned it as a reading for a class she is teaching to undergraduate business students at Aalto University School of Business, in Mikkeli, Finland. (The school was formerly known as the Helsinki School of Economics.)

Normally, this blog receives about 30 page views a day. On days that I post an entry on kidney disease or organ donation (pet topics of mine) traffic goes up. Of a typical day’s 30 hits, I presume about half of that traffic is not human. It is from web crawlers looking for sites to send spam to (I receive about 15 spam comments a day on this blog).

But check out the big spike in page views for my blog on the day my wife assigned the reading. This blog received 264 page views from 91 unique visitors. That’s the kind of traffic social media experts die for. Maybe I’ve hit upon an idea for generating lots of traffic to a website, convince college professors to assign it as required reading for a class.


Figure 2. Web traffic statistics for this blog

Naturally, I expect another big spike of traffic again today when my wife tells her students about this new blog post.

by George Taniwaki

I recently came across a fascinating article with the novel claim that life is older than the earth. The argument is based on a regression analysis that estimates the rate of increase in the maximum genomic complexity of living organisms over time. (Note, this argument only makes sense if you believe that the universe can be more than 5,000 years old, that genes mutate randomly and at a constant rate, that genetic changes are inherited and accumulate, and that mathematics can be used to explain how things work. Each of those assumptions can be argued separately, but that is beyond the scope of this blog post.)

The article entitled Life Before Earth was posted on in Mar 2013. The authors are Alexei Sharov a staff scientist at the National Institute on Aging and Richard Gordon, a theoretical biologist at the Gulf Specimen Marine Laboratory.

They draw a log-linear plot of the number of base pairs in a selection of terrestrial organisms against the time when the organism first appeared on earth (see Fig 1). For instance, the simplest bacteria, called prokaryotes, have between 500,000 to 6 million base pairs in their genome and are believed to have first appeared on earth about 3.5 billion years ago. At the other extreme, mammals including humans, have between about 2.7 billion to 3.2 billion base pairs in their genome. The fossil record indicates the first mammals appeared about 225 million years ago, during the Triassic period. All other known organisms can be plotted on these axes and the trend appear linear, meaning the growth in genome complexity is nearly exponential.

Extrapolating the data back in time, one can estimate when the maximum complexity was only one base pair. That is, the date when the first protoörganisms formed. The trend line indicates this occurred 9.7 billion years ago, or about 4 billion years after the big bang.


Figure 1. The growth in maximum base pair count per genome seems to grow exponentially over time. Image from

The earth is estimated to be only 4.5 billion years old. Thus, if these results are accepted, the implications are pretty astounding.

1. Life did not start on earth. It started somewhere else in the galaxy and reached the microörganism stage. Through some cosmic explosion, the life was transported here. Alternatively, life started on one of the planets around the star that went supernova before collapsing and forming our present-day sun. This hypothesis is called exogenesis

2. It is unlikely that these alien microörganisms only landed on earth as our solar system formed. They probably coated every asteroid, comet, planet, and moon in our solar system. They may still be alive in many locations and are either evolving or dormant

3. If all of the microörganisms that reached our solar system came from the same source, they likely have the same genetic structure. That is, if we find life elsewhere in our solar system, it is likely to contain right-handed double helix of DNA using the same four left-handed amino acid base pairs as life on earth. With effort, we could construct an evolutionary tree that contains these organisms

4. In fact, the same microörganisms may be very common throughout the galaxy, meaning life has arrived on many other planets, or perhaps every planet in our galaxy, even ones with no star, a hypothesis called panspermia

5. It solves Fermi’s paradox. In 1950, Enrico Fermi noted that our Sun is a young star and that there are billions of stars in our galaxy billions of years older. Why haven’t we already been contacted by intelligent life from other planets? The answer based on this analysis is because intelligent life takes 9.7 billion years to form and we may be one of the first organisms to reach the level of intelligence necessary to achieve intentional interstellar communication and travel. Ironically, if Sharov and Gordon are right, unintentional interstellar travel is already quite common and has been for billions of years.

Relationship to Moore’s Law

If the exponential growth of complexity shown in Figure 1 above looks familiar, it is because it is the same  shape as the increase in the number of transistors in microprocessor chips over time, a relationship called Moore’s Law. The authors cite this analogy as a possible argument in favor of their case.


Figure 2. The growth in maximum transistor count per processor has grown exponentially over time. Image from Wikipedia

Is this reasonable?

Like I said, this is a fascinating article. But it is all speculation. We have no direct evidence of many of the claims and inferences made in this paper. Specifically, we don’t know:

  1. The exact size of the genome that various organisms had in the earth’s past
  2. The nature of possible organisms less complex than prokaryotes
  3. The existence of any alien microörganisms or evidence that any landed on early earth
  4. The speed of genetic complexity changes in the early earth environment, or on other planetary environments in the case of the alien microörganisms prior to arrival on earth
  5. Whether any modern earth organisms, or any potential alien microörganisms, could withstand the rigors of travel through space for the millions of years it would take to get to earth from another star system

Finally, we have no clear explanation why the rate of change in genome complexity should be exponential. The use of the Moore’s Law chart to show that exponential growth in complexity is reasonable is slightly disingenuous. Moore’s Law is generally used as a forecast for the future growth in complexity for a commercial product based on historical data. Further, the forecast is used to estimate product demand, research costs, and necessary production investment, all of which tends to drive advancements and make the prediction self-fulfilling.

On the other hand, genome complexity is not directed. Evolution is a random process that will generate greater complexity only if a new, more complex organism can take advantage of an ecological niche that cannot be exploited by simpler organisms. Nothing is driving greater genome complexity.

Anyway, this is a very controversial theory. But I believe it may lead to new insights regarding microbiology, astrobiology, molecular clock hypothesis, and the use of mathematical models in evolutionary biology.


How long is 13.7 billion years?

As a side note, sometimes we wonder, how could something as complex as DNA and cellular organisms have started from nothing? It seems impossible to comprehend. But if you take a look at Figure 1, you will see that it may have taken over six billion years for a single base pair of DNA to grow in complexity to form an organism that we think of as a simple, primitive prokaryote. Then it took another 3.5 billion years before mammals appeared. Then it only took 200 million more years before our human ancestors appeared. And finally only a few hundred thousand years passed before you and I were born.

To give you a feel for how long 13.7 billion years is, watch this extremely boring YouTube video that compresses each billion years into one minute increments.


Figure 3. Age of Universe, boring, even with music. Video still from Craig Hall


A final thought to close this blog post, there may be aliens living on earth, but don’t be afraid, because it’s us.

by George Taniwaki

A graphic in Bloomberg Businessweek Mar 2013 (reproduced below) lists the four metro areas with the greatest economic growth over the five-year period 2007-2011. It also gives their population change during the same period. And it lists the four cities that had both negative population growth and GDP growth during the same period.

Business Week

Figure 1. Ranking 8 cities by total growth. Image from Bloomberg Businessweek

This chart is a bit light on data, containing only 16 data points. And the changes to population and GDP are not directly comparable since the population change is reported cumulatively for the four years (total number of years minus one) while GDP is annualized. Let’s calculate the cumulative GDP change as follows:

Total change GDP = (1 + Annual change GDP)^Years – 1.

Also, notice that the data has differing numbers of significant digits. The annualized GDP changes are displayed with two digits. The population changes show one, except for Chicago and Providence which have several. I’m sure this was done to show that the populations of these two cities were falling rather than flat. Let’s get rid of those extra digits.

This chart ranks the best and worst performing metro areas. One could reasonably argue that the metro areas with the greatest absolute GDP growth are the best. (I will argue otherwise shortly.) But should the worst performing areas be defined as the four that had both declining population and declining GDP? For a counterexample, consider a city where the population is growing but GDP is falling. I would say it is actually in worse shape based on the negative value of its per capita GDP growth. In fact, any city where the population is growing faster than GDP (or shrinking slower than GDP) would have negative GDP growth. Perhaps GDP per capita is a better measure of performance than total GDP change.

To address this, let’s calculate the change in GDP per capita as follows:

Change GDP per capita = ((Change in GDP + 1) / (Change in pop. +1)) –1.

The normalized data from the chart is summarized in the table below.

Metro area
Total change in pop Total change in GDP Total change in GDP per capita
Portland   4%   22%   18%
San Jose   3%   19%   15%
Austin 12%   14%      2%
New Orleans 16%     8%    -7%
Detroit -4% -19%   -16%
Cleveland -1%   -5%     -4%
Chicago -0%   -2%     -2%
Providence -0%   -1%     -1%

Notice now that New Orleans, despite having very high GDP growth, has large negative per capita GDP growth because its population is growing faster than its GDP. Austin’s performance now looks less impressive too. And while Cleveland, Chicago, and Providence all have negative per capita GDP growth, they are not doing as badly as it first appears.

Even when normalized, the data in the table is still lacking context. It doesn’t give the reader a feel for the big picture. For instance, how many metropolitan areas over 1 million are there in the U.S.?  What is the average population change and GDP change among those cities? Which cities had the greatest change, either positive or negative, in population, GDP, and per capita GDP?

Continuing that analysis, we would want to know if most cities were growing near the average rate or if there is a large dispersion. What is the shape of this dispersion? Are there geographic location, city size, or other factors that correlate with growth? Finally, are there time series trends? To answer these questions we need to go back to the source data and create our own charts.

Creating the metro population and GDP dataset

The footnote to the Bloomberg Businessweek chart says the data is from the Bureau of Economic Analysis and the Census Bureau. The BEA GDP data is available from an interactive website. I selected Table = GDP by metro area, Industry = All industry total, Area = All MSAs, Measures = Levels, and Year = 2007 to 2011.

The Census Bureau population estimates for the metropolitan statistical areas (MSAs) are available for download from the Census website. I downloaded the historical decennial data for 2000 to 2009 and the current decennial data that covers 2010 to 2012. I merged the three data sets keyed off of Census Bureau statistical area (CBSA) code.

Note that several of the CBSAs changed in 2010, meaning the code changed too. The most significant is that Los Angeles-Long Beach-Santa Ana, CA (31100) changed to Los Angeles-Long Beach-Anaheim, CA (31080).

In addition to the MSA records, I created two additional records. One contains the total population and GDP for all MSAs and the other for MSAs with population greater than one million.

Since the geographic names of the MSAs are often quite long, I want to find shorter labels that I can use on a scatterplot. I decide to use airport codes. These are short, unique, cover any big city with an airport worldwide, and if you travel a lot, you’ve possibly memorized quite a few, so you don’t need a legend to decode them. I append this to each record.

Finally, I calculate the following descriptive statistics for each MSA and append them to the records:

Change in population = (Pop on Jul 2011 / Pop on Jul 2007) – 1

Change in GDP = (GDP for 2011 / GDP for 2007) – 1

Per capita GDP for year 20xx = GDP for year 20xx / Pop on Jul of year 20xx

Change in GDP per capita = (Per capita GDP for 2011 / GDP per capita for 2007) – 1

An Excel spreadsheet containing the original data, the merged table, and the scatterplot is available on SkyDrive.

Interactive data available on SkyDrive

Comparing the data I collected with the Bloomberg Businessweek data, the ranking for the top four cities match, but the values for population change and GDP change do not. This could be because different data was used (historical population and GDP estimates are revised annually).

The data for the bottom four cities don’t match at all. The data I collected shows only one city that had falling population and GDP during the time period, Detroit. The three other cities showed rising GDP and two showed rising population as well. And despite the falling population and GDP, all four cities showed rising GDP per capita.

The data for those 8 metro areas plus a few outliers are shown below The means for all 51 MSAs with population greater than one million are included for comparison.

Metro area
Total change in pop Total change in GDP Total change in GDP per capita
Portland   4.5% 23.1% 17.8%
San Jose   5.0% 28.6% 12.9%
Austin 11.7% 18.4%   6.0%
New Orleans   9.4% 17.5%   7.4%
Salt Lake City   1.3% 14.9% 13.4%
Mean   4.2%   6.9%   2.6%
Detroit -3.8% -2.4%   1.4%
Cleveland -1.5%   3.5%   5.0%
Chicago   0.5%   5.4%   4.9%
Providence   0.0%   6.7%   6.6%
Las Vegas   7.0% -5.9% -12.1%
Charlotte 36.7%   9.3% -20.1%

Visualization and Analysis

I generated a simple scatterplot of change in GDP against change in population for all 51 MSAs. The cities from the table above are highlighted in green and red. I added a population weighted trend line shown in brown. The trend line passes through the mean (4.2%, 6.9%) and has an y-intercept at 2.6%.

I could have made the chart fancy by adding information using the size, shape and color of the markers. For instance I could change the size of the markers based on the population of the MSA, change the shape of the marker based on whether the city was coastal or inland, and change the color of the marker based on which Census region it was in.


Figure 2. A simple scatterplot showing the top 51 MSAs. Image by George Taniwaki

The four metro areas with the highest GDP growth are all above the trend line and have high per capita GDP growth. However, the Bloomberg Businessweek chart leaves off Salt Lake City which has lower GDP growth but because its population only grew 1.3% during the period, its per capita GDP growth is a very high 13.4%.

The fastest shrinking metro area is Detroit, which matches the Bloomberg Businessweek result. Note that its lies above the 45-degree diagonal running through the origin, meaning its GDP decline is less than its population decline and so has positive GDP per capita growth. However, it is still below the trend line, meaning it is growing slower than the average.

The other three metro areas in the Bloomberg Businessweek chart, Cleveland, Chicago, and Providence, all show slow or negative population growth, but are all above the trend line. They probably should not be considered bust-towns. The true bust-town in the scatterplot is Las Vegas. It is an outlier with a population growth of 7% but a GDP decline of 6% which results in a 12% drop in GDP per capita.

The final outlier is Charlotte. It shows a population gain of nearly 37% which is more than double of the next fastest growing city. But it has only a 9% increase in GDP leaving it with a 20% drop in GDP per capita. This is a sign that rapid growth can actually be very bad for a city.

Data error and bias

The statement in the paragraph above assumes that the data for change in economic activity and in population at the MSA level developed by two separate organizations for separate reasons is accurate and comparable. Neither of these assumptions is particularly sound. Specifically, there is a big discontinuity in the population estimate for Charlotte between Jul 2009 (the last estimate based on the 2000 census) and Jul 2010 (the first estimate based on the 2010 census) that accounts for most of the population gain. Thus, the annual population estimates may need to be smoothed before calculating the change between years.

I believe the BEA estimate of economic activity for an MSA is based partly on the population estimate for the MSA. Thus, if the population estimate changes (it is revised annually), then the GDP estimate will no longer be valid and will need to be updated.

Finally, you should be careful when combining data from different sources and comparing them. We do it all the time but we have to be conscience of what the consequences are. This is an especially important point since everybody today is rapidly building giant data warehouses and running analytics on data that has never been combined before.

A common task that webmasters are asked to perform is to get bids for hosting a website. When gathering information to prepare a quote, the vendor will often ask what the peak load (in server requests per second) will be. As a webmaster you may well ask, how the heck do I do that?

Estimating page views per month

Estimating peak server requests per second is a four step process. First, we must estimate page views per month. Next, we estimate average page views per second during the heaviest or prime viewing period. Then we estimate peak page views per second during the prime viewing period. Finally, we estimate peak server requests per second during the prime viewing period.

Average page views per month can be obtained by looking at server logs. If server logs are not available or you are creating a new site, then it can be estimated using logs from similar sites. For this blog post, we will assume the website generates an estimated 2.6 million page views per month. (By comparison, this blog generates about 2,000 page views per month, mostly from bots, I think.)

Estimating average page views per prime viewing second

Assume your website that generates 2.6 million page views per month has traffic that is fairly steady all day and all night on every day. That is, of the 730 hours in a month (= 365 days per year / 12 months per year * 24 hours per day) , all of them will be prime viewing hours. In that case, we can calculate the mean page views per second by doing some simple arithmetic.

page views per second = 2.6 million page views per month / 730 hours per month / 3600 seconds per hour = approx. 1 page view per second

But what if traffic to the website isn’t steady. What if people only visit it during work hours? Well, there are about 168 work hours per month compared to about 730 actual hours per month, a ratio of about 4.3 to1. So during prime viewing hours there will be about 4.3 page views per second and 0 page views per second during non-work hours. (This assumes everyone works the same days and same hours regardless of time zone.)

The prime viewing hours for a website can be even more compressed. Let’s say you run a website for NBC and it has a blog that contains a synopsis of the television show Grimm and an update is posted immediately after each new episode airs. In that case, perhaps all of the page views will occur during a 4 hour period starting at 10:00 pm every Friday. Thus, there will be 16 prime viewing hours per month during which there will be 45 page views per second and 0 page views per second during the rest of the month.

The chart below shows the page view distribution for the three cases described above. This model is quite simplified. It can obviously be made more complex by assuming that the prime viewing hour is dependent on time zone, that page views do not drop to zero during the non-prime viewing hours, and having multiple variables that affect page views during a particular hour.


Three ways to achieve 2.6 million page views per month. Image by George Taniwaki

Let’s look at the distribution of page views in more detail. In the four-hour prime viewing period case we said that there were 0 page views per second before and after the prime viewing period and an average of 45 page views per second during the prime viewing period. If the number of page views is constant throughout the prime viewing period, then the distribution curve is rectangular as shown by the blue line in the chart below.

But it is unlikely that the change in page view rate is so abrupt. It is more likely that page views rise steadily to a peak and then fall. If the distribution is triangular and spread across four hours, then the average page views at the maximum point will be 90 per second (=45*2) as shown by the brown line below. If the distribution is bell shaped, called the normal distribution, then the average peak page views at the maximum point will be somewhere in between as shown by the green line below.


Three ways to achieve 625,000 page views in an evening. Image by George Taniwaki

One caveat, I tried to draw the curves in the chart above so that all three would have similar variance but didn’t actually do the calculations to verify it.

Estimating peak page views per prime viewing second

All of the work above was to find the average number of page views per second during the prime viewing time. However, visitors to the website will arrive randomly. So we can expect that there will be some fluctuation in the number of page views during a second. Some seconds during the prime viewing time will have fewer than the average number of visitors and some seconds will have more. We can model this random arrival of visitors using the Poisson distribution.

Since the arrival of visitors will be random, we cannot estimate the maximum number of visitors the website will ever receive in a second. That number is actually infinite. But we can estimate it for a variety of confidence levels, such as 90%, 99%, and even 99.999% (the so-called five 9s availability level). In this case confidence level indicates the proportion of one second intervals that will be below the peak.

Using Excel’s Poisson distribution function we can estimate the ratio between peak page views per second to average page views per second at various confidence levels. The results are shown in the three tables below. Notice that although the average page views per second can be a fraction, the peak page views per second is always an integer.

Avg. page views per month

Avg. page views per second at max. point*

Peak page view per second at 0.9 confidence level

Ratio of peak to average at max. point



0 or 1

0 or 604,800



0 or 1

0 or 605













Avg. page views per month

Avg. page views per second at max. point*

Peak page view per second at 0.99 confidence level

Ratio of peak to average at max. point



0 or 1

0 or 604,800



0 or 1

0 or 605













Avg. page views per month

Avg. page views per second at max. point*

Peak page view per second at 0.99999 confidence level

Ratio of peak to average at max. point



0 or 1

0 or 604,800

















*Assumes 168 prime viewing hours per month with uniform distribution

Also notice that when the average page views per second is low, the peak page views per second can have two solutions, 0 or 1. These cases occur when the average page views per second is below 1- confidence level. For instance, if all you care is that your web server can handle all the traffic 99% of the time, and your average traffic is less than 0.01 page views per second you don’t need a web server at all! That’s because 99% of the time (during the prime viewing period), there is no traffic to your website.

However, if your goal is to be able to serve 99% of your visitors during the peak viewing time, then you need a web server than can deliver at least one page view per second. And if you provide a web server to deliver one page every 100 seconds, your ratio of peak to average will be 100. Providing a complete web server ready to serve the rare visitor results in tremendous overhead costs, which is why cloud computing, where resources are shared among many websites, is becoming so popular.

Finally, notice that when the average page views per second is high (say 1,000) , then the ratio of peak to average is close to 1 and does not grow very much even at high availability (or confidence) levels. At high levels of average page views, the error in estimating the average number of page views per second is likely to be much greater than the error introduced by ignoring the random distribution of page views per second.

Estimating peak server requests per prime viewing second

Our last step is to estimate the number of server calls generated by a single web page request from a user. A typical web page consists of a static html file plus one or more images, videos, ads, and JavaScript widgets displayed on the page. (In the case of dynamic pages, the content will be generated on the server, usually as a jsp file, or a aspx file if you are using the Microsoft .NET Framework.) If the page is not cached, sending all of the page contents may take over 100 requests to the server.

Assume your website consists of a single page that contains 100 items and all of those items reside on a single server. Now assume a single user calls for that page and you don’t want the user to have to wait more than one second before being able to interact with any part of it. That means the web server will need to be able to handle at least 100 requests per second per page view. (There are other potential bottlenecks in rendering the web page including Internet traffic, ISP speed, and the speed of the client computer, but we’ll ignore these for purposes of this blog post.)

Using the five nines confidence level, the final results for page views and server calls are shown in the table below. For our website with an expected 2.6 million page views per month, we need a web server that can handle 1,600 requests per second.

Avg. page views per month

Avg. server requests per month*

Peak page view per second at 0.99999 confidence level**

Peak server requests per second at 0.99999 confidence level*,**





















*Assumes 100 server calls per page view
**Assumes 168 prime viewing hours per month with uniform distribution
***Assumes goal is to satisfy 99.999% of visitor requests, not 99.999% of time


Update: If your website is one of many hosted on a single server, then you should skip the calculations for estimating peak page views per prime viewing second. That’s because traffic to your site will be combined with traffic to other sites. In that case, it is up to the company hosting the sites to combine the average traffic from all the sites first, then calculate the peak page views based on their promised availability.